Load and save data
Load and create a MicrobiomeData object
[1]:
import qdiv
obj = qdiv.MicrobiomeData.load() #This creates an empty instance of the class MicrobiomeData. Here, the instance is saved in the variable name obj
obj.info() #This prints information about the content of the object. As we can see, below, it is empty.
MicrobiomeData object summary
----------------------------------------
Abundance table: None
Taxonomy table: None
Sequence table: None
Tree: None
Metadata table: None
----------------------------------------
Let’s add some data to the object. We will use the Saheb-Alam2019_DADA2 example data. We specify the filename. We can also specify the folder in which the file is located using the path variable.
[2]:
obj.add_tab("Saheb-Alam2019_tab_dada2.tsv", path="../../../qdiv/example_data")
obj.info()
MicrobiomeData object summary
----------------------------------------
Abundance table: 672 features x 16 samples
Total reads: 504186.0
Min reads/sample: 9783.0
Max reads/sample: 82777.0
Taxonomy table: None
Sequence table: None
Tree: None
Metadata table: None
----------------------------------------
Now, when calling obj.info(), we can see that an abundance table with 672 features and 16 samples has been added to the object. We continue and add sequences, taxonomic information, a phylogenetic tree, and meta data as well.
[3]:
obj.add_tax("Saheb-Alam2019_tax_dada2.tsv", sep="\t", path="../../../qdiv/example_data") #Note that I use the sep variable to specify the column separator used in the input file.
obj.add_seq_from_fasta("Saheb-Alam2019_seq_dada2.fasta", path="../../../qdiv/example_data")
obj.add_tree("Saheb-Alam2019_tree_dada2.nwk", path="../../../qdiv/example_data")
obj.add_meta("Saheb-Alam2019_meta.csv", sep=",", path="../../../qdiv/example_data")
obj.info(3) #Here is specify that I want to see 3 lines of the meta data in the printed output.
MicrobiomeData object summary
----------------------------------------
Abundance table: 672 features x 16 samples
Total reads: 504186.0
Min reads/sample: 9783.0
Max reads/sample: 82777.0
Taxonomy table: 672 features, levels: ['Taxon', 'Confidence']
Sequence table: 672 features
Tree: 1342 nodes
Metadata table: 16 samples, columns: ['location', 'feed', 'mfc']
Metadata preview:
location feed mfc
sample
S4 anode acetate B
S5 anode acetate B
S6 anode acetate B
----------------------------------------
Now we can see that we have lot more data in the object. It would also have been possible to load all the data in one go with the MicrobiomeData.load command.
[4]:
obj = qdiv.MicrobiomeData.load(tab="Saheb-Alam2019_tab_dada2.tsv", tax="Saheb-Alam2019_tax_dada2.tsv", fasta="Saheb-Alam2019_seq_dada2.fasta", \
tree="Saheb-Alam2019_tree_dada2.nwk", meta="Saheb-Alam2019_meta.csv", path="../../../qdiv/example_data", \
tab_sep="\t", tax_sep="\t", meta_sep=",") #Note that I can specify different separators for different input table files.
obj.info()
MicrobiomeData object summary
----------------------------------------
Abundance table: 672 features x 16 samples
Total reads: 504186.0
Min reads/sample: 9783.0
Max reads/sample: 82777.0
Taxonomy table: 672 features, levels: ['Taxon', 'Confidence']
Sequence table: 672 features
Tree: 1342 nodes
Metadata table: 16 samples, columns: ['location', 'feed', 'mfc']
Metadata preview:
location feed mfc
sample
S4 anode acetate B
----------------------------------------
All data is loaded into the object as pandas dataframes. We can have a look at the content of a specific data table like this.
[5]:
print(obj.tab.head()) #This shows the first few lines of the abundance table
S4 S5 S6 S7 S10 S11 S12 \
Feature
0a4f6af6827b404e874581539d79f24c 0.0 0.0 0.0 0.0 0.0 0.0 7.0
0a8520a771e224611a2eef9f67e4aac9 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0a9495dc2a22f1eefcd9e4e4ebd2617f 0.0 0.0 0.0 0.0 0.0 27.0 0.0
0b6a7fb1e3b155264debe582497d69d2 47.0 111.0 0.0 0.0 94.0 60.0 323.0
0b085a4fec1a492a30b46aeeb05a66bc 42.0 86.0 39.0 18.0 10.0 0.0 51.0
S13 S20 S21 S22 S23 S26 S27 S28 \
Feature
0a4f6af6827b404e874581539d79f24c 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0a8520a771e224611a2eef9f67e4aac9 0.0 0.0 0.0 0.0 0.0 16.0 26.0 0.0
0a9495dc2a22f1eefcd9e4e4ebd2617f 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0b6a7fb1e3b155264debe582497d69d2 102.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0b085a4fec1a492a30b46aeeb05a66bc 21.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
S29
Feature
0a4f6af6827b404e874581539d79f24c 0.0
0a8520a771e224611a2eef9f67e4aac9 0.0
0a9495dc2a22f1eefcd9e4e4ebd2617f 0.0
0b6a7fb1e3b155264debe582497d69d2 29.0
0b085a4fec1a492a30b46aeeb05a66bc 0.0
The Saheb-Alam2019 data was generated from PCR amplification of the V4 regions of 16S rRNA gene. The fastq file were processed in qiime2 using dada2. The resulting abundance table and amplicon sequence variants were exported using qiime export and converted to a .tsv file and a fasta file. As you can see, the features have weird names. We can rename them using the rename_features function.
[6]:
obj.rename_features(name_type="ASV", inplace=True) #inplace specifies that the object is modified in place.
print(obj.tab.head())
S4 S5 S6 S7 S10 S11 S12 S13 \
Feature
ASV1 11302.0 21046.0 11153.0 9637.0 45.0 50.0 97.0 84.0
ASV2 0.0 0.0 0.0 0.0 0.0 0.0 14.0 0.0
ASV3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
ASV4 84.0 241.0 91.0 0.0 5375.0 8202.0 11419.0 3635.0
ASV5 0.0 0.0 0.0 44.0 1103.0 466.0 4092.0 97.0
S20 S21 S22 S23 S26 S27 S28 S29
Feature
ASV1 1013.0 1693.0 522.0 2851.0 0.0 0.0 0.0 107.0
ASV2 21102.0 16385.0 4777.0 17839.0 1941.0 8230.0 3829.0 7037.0
ASV3 19489.0 13748.0 8570.0 25646.0 0.0 138.0 0.0 68.0
ASV4 0.0 36.0 0.0 0.0 73.0 177.0 0.0 134.0
ASV5 24.0 0.0 0.0 16.0 108.0 1034.0 1170.0 4598.0
Now the names look better, and they are sorted starting with the ASV having the highest mean relative abundance in the samples (ASV1).
Save files
Let’s say we have worked on the files, perhaps renamed features, subsetted samples, or rarefied the read counts. We want to save the resulting file.
[7]:
obj.save(savename="Fixed_files", path=".")
[7]:
['.\\Fixed_files_tab.csv',
'.\\Fixed_files_tax.csv',
'.\\Fixed_files_meta.csv',
'.\\Fixed_files_seq.fa',
'.\\Fixed_files_tree.nwk']
We can see a list of files saved. Note that the savename variable is used a prefix for all the output files. The path specifies the folder where the files are saved (“.” is the current folder).
Load other types of files
Relative abundance tables and taxonomy data can be generated in a variety of ways. There are some functions to load other types of data. For example, the following:
[1]:
import qdiv
obj = qdiv.MicrobiomeData.load()
obj.add_tab_from_coverm("Modin2025_CoverM.tsv", path="../../../qdiv/example_data")
obj.info(3)
MicrobiomeData object summary
----------------------------------------
Abundance table: 304 features x 26 samples
Total reads: 1761.7865439559
Min reads/sample: 61.478759796599995
Max reads/sample: 73.8195703024
Taxonomy table: None
Sequence table: None
Tree: None
Metadata table: 26 samples, columns: ['unmapped_reads_perc']
Metadata preview:
unmapped_reads_perc
all_mags_contigs.fa/v2114_R1_QF.fq.gz 34.778670
all_mags_contigs.fa/v2115_R1_QF.fq.gz 38.521270
all_mags_contigs.fa/v2116_R1_QF.fq.gz 29.976946
----------------------------------------
As you can see in the metadata preview, information about the unmapped reads has been placed in the metadata, and the sample names are quite ugly. We can recreate the object and extract the correct sample names using first_sep and second_sep variables.
[1]:
import qdiv
obj = qdiv.MicrobiomeData.load()
obj.add_tab_from_coverm("Modin2025_CoverM.tsv", path="../../../qdiv/example_data", first_sep="fa/", second_sep="_R1")
obj.info(3)
MicrobiomeData object summary
----------------------------------------
Abundance table: 304 features x 26 samples
Total reads: 1761.7865439559
Min reads/sample: 61.478759796599995
Max reads/sample: 73.8195703024
Taxonomy table: None
Sequence table: None
Tree: None
Metadata table: 26 samples, columns: ['unmapped_reads_perc']
Metadata preview:
unmapped_reads_perc
v2114 34.778670
v2115 38.521270
v2116 29.976946
----------------------------------------
[2]:
obj.add_tax_from_gtdbtk(["Modin2025_gtdbtk.ar53.summary.tsv", "Modin2025_gtdbtk.bac120.summary.tsv"], path="../../../qdiv/example_data")
obj.info()
MicrobiomeData object summary
----------------------------------------
Abundance table: 304 features x 26 samples
Total reads: 1761.7865439559
Min reads/sample: 61.478759796599995
Max reads/sample: 73.8195703024
Taxonomy table: 304 features, levels: ['Domain', 'Phylum', 'Class', 'Order', 'Family', 'Genus', 'Species']
Sequence table: None
Tree: None
Metadata table: 26 samples, columns: ['unmapped_reads_perc']
Metadata preview:
unmapped_reads_perc
v2114 34.77867
----------------------------------------
We can have look at the taxonomy table:
[3]:
print(obj.tax.head())
Domain Phylum Class Order \
Feature
KM1 Bacteria Planctomycetota Planctomycetia Pirellulales
Km2 Bacteria Actinobacteriota Actinomycetia Propionibacteriales
Km3 Bacteria Desulfobacterota Syntrophia Syntrophales
KM4 Bacteria Chloroflexota Anaerolineae UBA1429
Km5 Bacteria Actinobacteriota Actinomycetia Actinomycetales
Family Genus Species
Feature
KM1 Thermoguttaceae Thermogutta
Km2 Propionibacteriaceae Brevilactibacter
Km3 Smithellaceae UBA4810 UBA4810 sp002418825
KM4 UBA1429 UBA6128
Km5 Bifidobacteriaceae Bifidobacterium Bifidobacterium tibiigranuli
[4]:
obj.tax_prefix(add=True, inplace=True)
print(obj.tax.head())
Domain Phylum Class \
Feature
KM1 d__Bacteria p__Planctomycetota c__Planctomycetia
Km2 d__Bacteria p__Actinobacteriota c__Actinomycetia
Km3 d__Bacteria p__Desulfobacterota c__Syntrophia
KM4 d__Bacteria p__Chloroflexota c__Anaerolineae
Km5 d__Bacteria p__Actinobacteriota c__Actinomycetia
Order Family Genus \
Feature
KM1 o__Pirellulales f__Thermoguttaceae g__Thermogutta
Km2 o__Propionibacteriales f__Propionibacteriaceae g__Brevilactibacter
Km3 o__Syntrophales f__Smithellaceae g__UBA4810
KM4 o__UBA1429 f__UBA1429 g__UBA6128
Km5 o__Actinomycetales f__Bifidobacteriaceae g__Bifidobacterium
Species
Feature
KM1
Km2
Km3 s__UBA4810 sp002418825
KM4
Km5 s__Bifidobacterium tibiigranuli
[6]:
import qdiv
obj = qdiv.MicrobiomeData.load()
obj.add_ebd_tab_from_singlem("Modin2025_SingleM.ebd", path="../../../qdiv/example_data")
obj.info()
MicrobiomeData object summary
----------------------------------------
Abundance table: 1137 features x 7 samples
Total reads: 5640.0
Min reads/sample: 630.0
Max reads/sample: 960.0
Taxonomy table: 1137 features, levels: ['Domain', 'Phylum', 'Class', 'Order', 'Family', 'Genus', 'Species']
Sequence table: None
Tree: None
Metadata table: None
----------------------------------------
Let’s have a closer look at the count table.
[7]:
print(obj.tab.head())
v2126_1 v2136_1 v2145_1 v2151_1 v2204_1 v2117_1 v2114_1
Feature
OTU1 0.0 0.0 0.0 0.0 0.0 0.0 1.0
OTU2 0.0 0.0 0.0 0.0 0.0 1.0 1.0
OTU3 3.0 0.0 0.0 0.0 0.0 1.0 1.0
OTU4 0.0 0.0 0.0 0.0 0.0 2.0 2.0
OTU5 0.0 0.0 0.0 0.0 0.0 0.0 2.0
We can see that the sample names all contain a 1. To remove that part of the sample names, we can specify second_sep=”” when we load the file. Then, everyhting after the “_” in the names will be dropped.
[8]:
obj = qdiv.MicrobiomeData.load()
obj.add_ebd_tab_from_singlem("Modin2025_SingleM.ebd", path="../../../qdiv/example_data", second_sep="_")
print(obj.tab.head())
v2126 v2136 v2145 v2151 v2204 v2117 v2114
Feature
OTU1 0.0 0.0 0.0 0.0 0.0 0.0 1.0
OTU2 0.0 0.0 0.0 0.0 0.0 1.0 1.0
OTU3 3.0 0.0 0.0 0.0 0.0 1.0 1.0
OTU4 0.0 0.0 0.0 0.0 0.0 2.0 2.0
OTU5 0.0 0.0 0.0 0.0 0.0 0.0 2.0