Load and save data

Load and create a MicrobiomeData object

[1]:
import qdiv
obj = qdiv.MicrobiomeData.load() #This creates an empty instance of the class MicrobiomeData. Here, the instance is saved in the variable name obj
obj.info() #This prints information about the content of the object. As we can see, below, it is empty.
MicrobiomeData object summary
----------------------------------------
Abundance table: None
Taxonomy table: None
Sequence table: None
Tree: None
Metadata table: None
----------------------------------------

Let’s add some data to the object. We will use the Saheb-Alam2019_DADA2 example data. We specify the filename. We can also specify the folder in which the file is located using the path variable.

[2]:
obj.add_tab("Saheb-Alam2019_tab_dada2.tsv", path="../../../qdiv/example_data")
obj.info()
MicrobiomeData object summary
----------------------------------------
Abundance table: 672 features x 16 samples
Total reads: 504186.0
Min reads/sample: 9783.0
Max reads/sample: 82777.0
Taxonomy table: None
Sequence table: None
Tree: None
Metadata table: None
----------------------------------------

Now, when calling obj.info(), we can see that an abundance table with 672 features and 16 samples has been added to the object. We continue and add sequences, taxonomic information, a phylogenetic tree, and meta data as well.

[3]:
obj.add_tax("Saheb-Alam2019_tax_dada2.tsv", sep="\t", path="../../../qdiv/example_data") #Note that I use the sep variable to specify the column separator used in the input file.
obj.add_seq_from_fasta("Saheb-Alam2019_seq_dada2.fasta", path="../../../qdiv/example_data")
obj.add_tree("Saheb-Alam2019_tree_dada2.nwk", path="../../../qdiv/example_data")
obj.add_meta("Saheb-Alam2019_meta.csv", sep=",", path="../../../qdiv/example_data")
obj.info(3) #Here is specify that I want to see 3 lines of the meta data in the printed output.
MicrobiomeData object summary
----------------------------------------
Abundance table: 672 features x 16 samples
Total reads: 504186.0
Min reads/sample: 9783.0
Max reads/sample: 82777.0
Taxonomy table: 672 features, levels: ['Taxon', 'Confidence']
Sequence table: 672 features
Tree: 1342 nodes
Metadata table: 16 samples, columns: ['location', 'feed', 'mfc']
Metadata preview:
       location     feed mfc
sample
S4        anode  acetate   B
S5        anode  acetate   B
S6        anode  acetate   B
----------------------------------------

Now we can see that we have lot more data in the object. It would also have been possible to load all the data in one go with the MicrobiomeData.load command.

[4]:
obj = qdiv.MicrobiomeData.load(tab="Saheb-Alam2019_tab_dada2.tsv", tax="Saheb-Alam2019_tax_dada2.tsv", fasta="Saheb-Alam2019_seq_dada2.fasta", \
                               tree="Saheb-Alam2019_tree_dada2.nwk", meta="Saheb-Alam2019_meta.csv", path="../../../qdiv/example_data", \
                              tab_sep="\t", tax_sep="\t", meta_sep=",") #Note that I can specify different separators for different input table files.
obj.info()
MicrobiomeData object summary
----------------------------------------
Abundance table: 672 features x 16 samples
Total reads: 504186.0
Min reads/sample: 9783.0
Max reads/sample: 82777.0
Taxonomy table: 672 features, levels: ['Taxon', 'Confidence']
Sequence table: 672 features
Tree: 1342 nodes
Metadata table: 16 samples, columns: ['location', 'feed', 'mfc']
Metadata preview:
       location     feed mfc
sample
S4        anode  acetate   B
----------------------------------------

All data is loaded into the object as pandas dataframes. We can have a look at the content of a specific data table like this.

[5]:
print(obj.tab.head()) #This shows the first few lines of the abundance table
                                    S4     S5    S6    S7   S10   S11    S12  \
Feature
0a4f6af6827b404e874581539d79f24c   0.0    0.0   0.0   0.0   0.0   0.0    7.0
0a8520a771e224611a2eef9f67e4aac9   0.0    0.0   0.0   0.0   0.0   0.0    0.0
0a9495dc2a22f1eefcd9e4e4ebd2617f   0.0    0.0   0.0   0.0   0.0  27.0    0.0
0b6a7fb1e3b155264debe582497d69d2  47.0  111.0   0.0   0.0  94.0  60.0  323.0
0b085a4fec1a492a30b46aeeb05a66bc  42.0   86.0  39.0  18.0  10.0   0.0   51.0

                                    S13  S20  S21  S22  S23   S26   S27  S28  \
Feature
0a4f6af6827b404e874581539d79f24c    0.0  0.0  0.0  0.0  0.0   0.0   0.0  0.0
0a8520a771e224611a2eef9f67e4aac9    0.0  0.0  0.0  0.0  0.0  16.0  26.0  0.0
0a9495dc2a22f1eefcd9e4e4ebd2617f    0.0  0.0  0.0  0.0  0.0   0.0   0.0  0.0
0b6a7fb1e3b155264debe582497d69d2  102.0  0.0  0.0  0.0  0.0   0.0   0.0  0.0
0b085a4fec1a492a30b46aeeb05a66bc   21.0  0.0  0.0  0.0  0.0   0.0   0.0  0.0

                                   S29
Feature
0a4f6af6827b404e874581539d79f24c   0.0
0a8520a771e224611a2eef9f67e4aac9   0.0
0a9495dc2a22f1eefcd9e4e4ebd2617f   0.0
0b6a7fb1e3b155264debe582497d69d2  29.0
0b085a4fec1a492a30b46aeeb05a66bc   0.0

The Saheb-Alam2019 data was generated from PCR amplification of the V4 regions of 16S rRNA gene. The fastq file were processed in qiime2 using dada2. The resulting abundance table and amplicon sequence variants were exported using qiime export and converted to a .tsv file and a fasta file. As you can see, the features have weird names. We can rename them using the rename_features function.

[6]:
obj.rename_features(name_type="ASV", inplace=True) #inplace specifies that the object is modified in place.
print(obj.tab.head())
              S4       S5       S6      S7     S10     S11      S12     S13  \
Feature
ASV1     11302.0  21046.0  11153.0  9637.0    45.0    50.0     97.0    84.0
ASV2         0.0      0.0      0.0     0.0     0.0     0.0     14.0     0.0
ASV3         0.0      0.0      0.0     0.0     0.0     0.0      0.0     0.0
ASV4        84.0    241.0     91.0     0.0  5375.0  8202.0  11419.0  3635.0
ASV5         0.0      0.0      0.0    44.0  1103.0   466.0   4092.0    97.0

             S20      S21     S22      S23     S26     S27     S28     S29
Feature
ASV1      1013.0   1693.0   522.0   2851.0     0.0     0.0     0.0   107.0
ASV2     21102.0  16385.0  4777.0  17839.0  1941.0  8230.0  3829.0  7037.0
ASV3     19489.0  13748.0  8570.0  25646.0     0.0   138.0     0.0    68.0
ASV4         0.0     36.0     0.0      0.0    73.0   177.0     0.0   134.0
ASV5        24.0      0.0     0.0     16.0   108.0  1034.0  1170.0  4598.0

Now the names look better, and they are sorted starting with the ASV having the highest mean relative abundance in the samples (ASV1).

Save files

Let’s say we have worked on the files, perhaps renamed features, subsetted samples, or rarefied the read counts. We want to save the resulting file.

[7]:
obj.save(savename="Fixed_files", path=".")
[7]:
['.\\Fixed_files_tab.csv',
 '.\\Fixed_files_tax.csv',
 '.\\Fixed_files_meta.csv',
 '.\\Fixed_files_seq.fa',
 '.\\Fixed_files_tree.nwk']

We can see a list of files saved. Note that the savename variable is used a prefix for all the output files. The path specifies the folder where the files are saved (“.” is the current folder).

Load other types of files

Relative abundance tables and taxonomy data can be generated in a variety of ways. There are some functions to load other types of data. For example, the following:

CoverM
Relative abundance in tab-separated files generated with CoverM can be loaded like this:
[1]:
import qdiv
obj = qdiv.MicrobiomeData.load()
obj.add_tab_from_coverm("Modin2025_CoverM.tsv", path="../../../qdiv/example_data")
obj.info(3)
MicrobiomeData object summary
----------------------------------------
Abundance table: 304 features x 26 samples
Total reads: 1761.7865439559
Min reads/sample: 61.478759796599995
Max reads/sample: 73.8195703024
Taxonomy table: None
Sequence table: None
Tree: None
Metadata table: 26 samples, columns: ['unmapped_reads_perc']
Metadata preview:
                                       unmapped_reads_perc
all_mags_contigs.fa/v2114_R1_QF.fq.gz            34.778670
all_mags_contigs.fa/v2115_R1_QF.fq.gz            38.521270
all_mags_contigs.fa/v2116_R1_QF.fq.gz            29.976946
----------------------------------------

As you can see in the metadata preview, information about the unmapped reads has been placed in the metadata, and the sample names are quite ugly. We can recreate the object and extract the correct sample names using first_sep and second_sep variables.

[1]:
import qdiv
obj = qdiv.MicrobiomeData.load()
obj.add_tab_from_coverm("Modin2025_CoverM.tsv", path="../../../qdiv/example_data", first_sep="fa/", second_sep="_R1")
obj.info(3)
MicrobiomeData object summary
----------------------------------------
Abundance table: 304 features x 26 samples
Total reads: 1761.7865439559
Min reads/sample: 61.478759796599995
Max reads/sample: 73.8195703024
Taxonomy table: None
Sequence table: None
Tree: None
Metadata table: 26 samples, columns: ['unmapped_reads_perc']
Metadata preview:
       unmapped_reads_perc
v2114            34.778670
v2115            38.521270
v2116            29.976946
----------------------------------------
GTDB-Tk
We also want to add taxonomic information, which in this case was generated with GTDB-Tk. We have two input files one for archaea and one for bacteria, they can be added like this:
[2]:
obj.add_tax_from_gtdbtk(["Modin2025_gtdbtk.ar53.summary.tsv", "Modin2025_gtdbtk.bac120.summary.tsv"], path="../../../qdiv/example_data")
obj.info()
MicrobiomeData object summary
----------------------------------------
Abundance table: 304 features x 26 samples
Total reads: 1761.7865439559
Min reads/sample: 61.478759796599995
Max reads/sample: 73.8195703024
Taxonomy table: 304 features, levels: ['Domain', 'Phylum', 'Class', 'Order', 'Family', 'Genus', 'Species']
Sequence table: None
Tree: None
Metadata table: 26 samples, columns: ['unmapped_reads_perc']
Metadata preview:
       unmapped_reads_perc
v2114             34.77867
----------------------------------------

We can have look at the taxonomy table:

[3]:
print(obj.tax.head())
           Domain            Phylum           Class                Order  \
Feature
KM1      Bacteria   Planctomycetota  Planctomycetia         Pirellulales
Km2      Bacteria  Actinobacteriota   Actinomycetia  Propionibacteriales
Km3      Bacteria  Desulfobacterota      Syntrophia         Syntrophales
KM4      Bacteria     Chloroflexota    Anaerolineae              UBA1429
Km5      Bacteria  Actinobacteriota   Actinomycetia      Actinomycetales

                       Family             Genus                       Species
Feature
KM1           Thermoguttaceae       Thermogutta
Km2      Propionibacteriaceae  Brevilactibacter
Km3             Smithellaceae           UBA4810           UBA4810 sp002418825
KM4                   UBA1429           UBA6128
Km5        Bifidobacteriaceae   Bifidobacterium  Bifidobacterium tibiigranuli
tax_prefix
We can add prefix to the taxonomic names, to indicate the level. That can be done with the tax_prefix function (which also can remove the prefix if the add variable is False).
[4]:
obj.tax_prefix(add=True, inplace=True)
print(obj.tax.head())
              Domain               Phylum              Class  \
Feature
KM1      d__Bacteria   p__Planctomycetota  c__Planctomycetia
Km2      d__Bacteria  p__Actinobacteriota   c__Actinomycetia
Km3      d__Bacteria  p__Desulfobacterota      c__Syntrophia
KM4      d__Bacteria     p__Chloroflexota    c__Anaerolineae
Km5      d__Bacteria  p__Actinobacteriota   c__Actinomycetia

                          Order                   Family                Genus  \
Feature
KM1             o__Pirellulales       f__Thermoguttaceae       g__Thermogutta
Km2      o__Propionibacteriales  f__Propionibacteriaceae  g__Brevilactibacter
Km3             o__Syntrophales         f__Smithellaceae           g__UBA4810
KM4                  o__UBA1429               f__UBA1429           g__UBA6128
Km5          o__Actinomycetales    f__Bifidobacteriaceae   g__Bifidobacterium

                                 Species
Feature
KM1
Km2
Km3               s__UBA4810 sp002418825
KM4
Km5      s__Bifidobacterium tibiigranuli
SingleM
SingleM can be used to generate OTU tables from metagenomic reads. SingleM also a convertToEBD.py function to convert the OTU table to a tab-separated file with taxonomic information and read counts. That ebd file can be loaded using the add_ebd_tab_from_singlem function.
[6]:
import qdiv
obj = qdiv.MicrobiomeData.load()
obj.add_ebd_tab_from_singlem("Modin2025_SingleM.ebd", path="../../../qdiv/example_data")
obj.info()
MicrobiomeData object summary
----------------------------------------
Abundance table: 1137 features x 7 samples
Total reads: 5640.0
Min reads/sample: 630.0
Max reads/sample: 960.0
Taxonomy table: 1137 features, levels: ['Domain', 'Phylum', 'Class', 'Order', 'Family', 'Genus', 'Species']
Sequence table: None
Tree: None
Metadata table: None
----------------------------------------

Let’s have a closer look at the count table.

[7]:
print(obj.tab.head())
         v2126_1  v2136_1  v2145_1  v2151_1  v2204_1  v2117_1  v2114_1
Feature
OTU1         0.0      0.0      0.0      0.0      0.0      0.0      1.0
OTU2         0.0      0.0      0.0      0.0      0.0      1.0      1.0
OTU3         3.0      0.0      0.0      0.0      0.0      1.0      1.0
OTU4         0.0      0.0      0.0      0.0      0.0      2.0      2.0
OTU5         0.0      0.0      0.0      0.0      0.0      0.0      2.0

We can see that the sample names all contain a 1. To remove that part of the sample names, we can specify second_sep=”” when we load the file. Then, everyhting after the “_” in the names will be dropped.

[8]:
obj = qdiv.MicrobiomeData.load()
obj.add_ebd_tab_from_singlem("Modin2025_SingleM.ebd", path="../../../qdiv/example_data", second_sep="_")
print(obj.tab.head())
         v2126  v2136  v2145  v2151  v2204  v2117  v2114
Feature
OTU1       0.0    0.0    0.0    0.0    0.0    0.0    1.0
OTU2       0.0    0.0    0.0    0.0    0.0    1.0    1.0
OTU3       3.0    0.0    0.0    0.0    0.0    1.0    1.0
OTU4       0.0    0.0    0.0    0.0    0.0    2.0    2.0
OTU5       0.0    0.0    0.0    0.0    0.0    0.0    2.0