Data management
Finding consensus ASVs
Say we have used several softwares to generate amplicon sequence variants (ASVs) from a sequencing dataset. We can generate a set of consensus ASVs that were detected by all methods and trim our abundance table to those ASVs. We load the Saheb-Alam2019 that was processed using both DADA2 and Deblur.
[2]:
import qdiv
obj1 = qdiv.MicrobiomeData.load_example("Saheb-Alam_2019_DADA2")
obj2 = qdiv.MicrobiomeData.load_example("Saheb-Alam_2019_Deblur")
obj1.info()
obj2.info()
MicrobiomeData object summary
----------------------------------------
Abundance table: 672 features x 16 samples
Total reads: 504186.0
Min reads/sample: 9783.0
Max reads/sample: 82777.0
Taxonomy table: 672 features, levels: ['Domain', 'Phylum', 'Class', 'Order', 'Family', 'Genus', 'Species']
Sequence table: 672 features
Tree: 1342 nodes
Metadata table: 16 samples, columns: ['location', 'feed', 'mfc']
Metadata preview:
location feed mfc
sample
S4 anode acetate B
----------------------------------------
MicrobiomeData object summary
----------------------------------------
Abundance table: 952 features x 16 samples
Total reads: 572710.0
Min reads/sample: 11347.0
Max reads/sample: 88846.0
Taxonomy table: 952 features, levels: ['Domain', 'Phylum', 'Class', 'Order', 'Family', 'Genus', 'Species']
Sequence table: 952 features
Tree: 1901 nodes
Metadata table: 16 samples, columns: ['location', 'feed', 'mfc']
Metadata preview:
location feed mfc
sample
S4 anode acetate B
----------------------------------------
We see that the DADA2 data has 672 ASVs (features) while the Deblur data has 952. Now we will find a consensus data set, which retains ASVs detected by both methods. Let’s call the sequences.consensus function.
[3]:
consensus_obj, consensus_info = qdiv.sequences.consensus([obj1, obj2], different_lengths=True, keep_cutoff=0.2, name_type="ASV", keep_object=0)
Running consensus...
Aligning features in 2 objects: 1.. 2..
Changing feature names in 2 objects: 1.. 2..
Done with align
Done with consensus.
[4]:
consensus_obj.info()
MicrobiomeData object summary
----------------------------------------
Abundance table: 635 features x 16 samples
Total reads: 503752.0
Min reads/sample: 9742.0
Max reads/sample: 82587.0
Taxonomy table: 635 features, levels: ['Domain', 'Phylum', 'Class', 'Order', 'Family', 'Genus', 'Species']
Sequence table: 635 features
Tree: 1303 nodes
Metadata table: 16 samples, columns: ['location', 'feed', 'mfc']
Metadata preview:
location feed mfc
sample
S4 anode acetate B
----------------------------------------
We can see that the consensus data has 635 ASVs. We can also get more information about what was happening when the consensus object was generated. Let’s print the consensus_info variable.
[5]:
print(consensus_info)
### consensus ###
There are 630 features in common between all objects.
In obj.0, the consensus features account for:
93.75 % of the features
99.15 % of the total abundance counts
97.8 - 100.0 % of the abundance counts per sample
In obj.1, the consensus features account for:
66.18 % of the features
98.35 % of the total abundance counts
97.43 - 99.19 % of the abundance counts per sample
The object with position 0 in the input list is retained as the consensus object.
This object also includes 241 features with a relative abundance above the keep_cutoff threshold.
All in all, the consensus object retains 94.49 % of its original features,
which represent 99.91 % of the total abundance counts,
and 99.58 - 100.0 % of the abundance counts per sample.
The single most abundant feature that was removed had a relative abundance of to 0.0 - 0.17 % in the samples. The taxonomic classification of this feature was Bacteria;Pseudomonadota;Alphaproteobacteria;Rickettsiales;Mitochondria;;
--------------
Subsetting
We can subset object based on information provided in the meta data. Let’s have a look at the Saheb-Alam2019 example data. We load the data and look at the meta data. (Here we also use rename_features because this example data has weird feature names and it’s nice if all features are named with the ASV prefix).
[1]:
import qdiv
obj = qdiv.MicrobiomeData.load_example("Saheb-Alam2019_DADA2")
obj.rename_features(name_type="ASV", inplace=True)
print(obj.meta)
location feed mfc
sample
S4 anode acetate B
S5 anode acetate B
S6 anode acetate B
S7 anode acetate B
S10 cathode acetate B
S11 cathode acetate B
S12 cathode acetate B
S13 cathode acetate B
S20 anode glucose D
S21 anode glucose D
S22 anode glucose D
S23 anode glucose D
S26 cathode glucose D
S27 cathode glucose D
S28 cathode glucose D
S29 cathode glucose D
These are samples collected from microbial fuel cells (see https://doi.org/10.1111/1751-7915.13449). Biofilms growing on anodes and cathodes in systems fed with either acetate or glucose were sequenced. Let’s say that we’re only interested in the anode sample. We can subset the object to the anode samples using subset_samples:
[2]:
obj.subset_samples(by="location", values=["anode"], inplace=True)
obj.info()
MicrobiomeData object summary
----------------------------------------
Abundance table: 316 features x 8 samples
Total reads: 241039.0
Min reads/sample: 13103.0
Max reads/sample: 55054.0
Taxonomy table: 316 features, levels: ['Domain', 'Phylum', 'Class', 'Order', 'Family', 'Genus', 'Species']
Sequence table: 316 features
Tree: 1342 nodes
Metadata table: 8 samples, columns: ['location', 'feed', 'mfc']
Metadata preview:
location feed mfc
sample
S4 anode acetate B
----------------------------------------
Now, we can see that we only have 8 samples remaining. There are 316 features detected in this samples. By setting inplace=True, we specify that we want to modify the object in place and not generate a new object.
We can also subset to specific taxa. Desulfobacterota, especially Geobacter spp., tend to be abundant in microbial fuel cells. Let’s subset the object to those taxa using subset_taxa and see what we get.
[3]:
obj_geobacter = obj.subset_taxa(subset_levels=["Genus"], subset_patterns=["Geobacter"])
obj_geobacter.info()
MicrobiomeData object summary
----------------------------------------
Abundance table: 2 features x 8 samples
Total reads: 824.0
Min reads/sample: 0.0
Max reads/sample: 350.0
Taxonomy table: 2 features, levels: ['Domain', 'Phylum', 'Class', 'Order', 'Family', 'Genus', 'Species']
Sequence table: 2 features
Tree: 1342 nodes
Metadata table: 8 samples, columns: ['location', 'feed', 'mfc']
Metadata preview:
location feed mfc
sample
S4 anode acetate B
----------------------------------------
[4]:
print(obj_geobacter.tax)
Domain Phylum Class Order \
Feature
ASV127 Bacteria Desulfobacterota_G_459546 Desulfuromonadia Geobacterales
ASV147 Bacteria Desulfobacterota_G_459546 Desulfuromonadia Geobacterales
Family Genus Species
Feature
ASV127 Geobacteraceae_439684 Geobacter
ASV147 Geobacteraceae_439684 Geobacter
We can also check their read counts in the samples:
[5]:
print(obj_geobacter.tab)
S4 S5 S6 S7 S20 S21 S22 S23
Feature
ASV127 0.0 0.0 0.0 0.0 133.0 103.0 66.0 150.0
ASV147 0.0 47.0 34.0 0.0 0.0 91.0 0.0 200.0
[6]:
obj_abund = obj.subset_abundant(n=10)
print(obj_abund.tax[["Phylum", "Genus"]])
Phylum Genus
Feature
ASV1 Desulfobacterota_G_459546 Trichloromonas
ASV2 Bacillota_I Trichococcus
ASV3 Desulfobacterota_G_459546
ASV6 Bacteroidota JALNZU01
ASV18 Spirochaetota UBA12059
ASV20 Bacteroidota Perlabentimonas
ASV21 Bacteroidota Bact-08
ASV23 Bacteroidota Lentimicrobium
ASV25 Pseudomonadota Sulfuritalea
ASV38 Bacillota_I Trichococcus
Here we see that among the most abundant ASVs, there are two Desulfobacterota species. But one is classified as Trichloromonas and the other is unclassified at the genus level. Let’s check their read counts.
[7]:
print(obj_abund.tab)
S4 S5 S6 S7 S20 S21 S22 S23
Feature
ASV1 11302.0 21046.0 11153.0 9637.0 1013.0 1693.0 522.0 2851.0
ASV2 0.0 0.0 0.0 0.0 21102.0 16385.0 4777.0 17839.0
ASV3 0.0 0.0 0.0 0.0 19489.0 13748.0 8570.0 25646.0
ASV6 123.0 296.0 123.0 105.0 381.0 271.0 80.0 234.0
ASV18 194.0 557.0 257.0 224.0 56.0 53.0 18.0 123.0
ASV20 316.0 647.0 294.0 0.0 191.0 98.0 61.0 197.0
ASV21 173.0 327.0 168.0 142.0 441.0 411.0 179.0 388.0
ASV23 0.0 0.0 0.0 0.0 1235.0 989.0 383.0 1141.0
ASV25 343.0 838.0 427.0 255.0 0.0 0.0 0.0 0.0
ASV38 0.0 0.0 0.0 0.0 1147.0 726.0 0.0 771.0
It looks like ASV1 is very abundant in the first four samples (acetate-fed) and ASV2 and ASV3 in the last four (glucose-fed).
It is also possible to merge samples based on information the meta data. Let’s use merge_samples to combine samples based on the feed they received.
[8]:
obj_abund.merge_samples(by="feed", inplace=True)
print(obj_abund.tab)
feed acetate glucose
Feature
ASV1 53138.0 6079.0
ASV2 0.0 60103.0
ASV3 0.0 67453.0
ASV6 647.0 966.0
ASV18 1232.0 250.0
ASV20 1257.0 547.0
ASV21 810.0 1419.0
ASV23 0.0 3748.0
ASV25 1863.0 0.0
ASV38 0.0 2644.0
This shows the sum of the read counts for the acetate- and glucose fed anodes.
Rarefy
For amplicon sequencing data with uneven read counts per sample, it is good practice to rarefy the abundance table before diversity analysis.
[2]:
import qdiv
obj = qdiv.MicrobiomeData.load_example("Saheb-Alam_DADA2")
obj.info()
MicrobiomeData object summary
----------------------------------------
Abundance table: 672 features x 16 samples
Total reads: 504186.0
Min reads/sample: 9783.0
Max reads/sample: 82777.0
Taxonomy table: 672 features, levels: ['Domain', 'Phylum', 'Class', 'Order', 'Family', 'Genus', 'Species']
Sequence table: 672 features
Tree: 1342 nodes
Metadata table: 16 samples, columns: ['location', 'feed', 'mfc']
Metadata preview:
location feed mfc
sample
S4 anode acetate B
----------------------------------------
In this example data, we can see that the read counts in the samples range from 9783 to 82777. We can rarefy the object to the lowest read count.
[3]:
obj.rarefy(inplace=True)
obj.info()
MicrobiomeData object summary
----------------------------------------
Abundance table: 649 features x 16 samples
Total reads: 156528
Min reads/sample: 9783
Max reads/sample: 9783
Taxonomy table: 649 features, levels: ['Domain', 'Phylum', 'Class', 'Order', 'Family', 'Genus', 'Species']
Sequence table: 649 features
Tree: 1342 nodes
Metadata table: 16 samples, columns: ['location', 'feed', 'mfc']
Metadata preview:
location feed mfc
S4 anode acetate B
----------------------------------------
Now all samples have the same count: 9783 reads. The rarefy function automatically rarefies to the read count in the sample with the lowest count. It is possible to specify a read count with the depth parameter, and to ensure reproducibility with the seed parameter.