{ "cells": [ { "cell_type": "markdown", "id": "e339c625-6f80-41f8-8d32-7aff261064cc", "metadata": {}, "source": [ "# Load and save data" ] }, { "cell_type": "markdown", "id": "afcd3095-9822-4936-a7ae-04daf54768c1", "metadata": {}, "source": [ "## Load and create a MicrobiomeData object" ] }, { "cell_type": "code", "execution_count": 1, "id": "66cd8a0e-bc7c-44f7-9bea-876ea21dc357", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MicrobiomeData object summary\n", "----------------------------------------\n", "Abundance table: None\n", "Taxonomy table: None\n", "Sequence table: None\n", "Tree: None\n", "Metadata table: None\n", "----------------------------------------\n" ] } ], "source": [ "import qdiv\n", "obj = qdiv.MicrobiomeData.load() #This creates an empty instance of the class MicrobiomeData. Here, the instance is saved in the variable name obj\n", "obj.info() #This prints information about the content of the object. As we can see, below, it is empty." ] }, { "cell_type": "markdown", "id": "ed3d45e4-6a9d-4eb1-b37d-a69b0c7d9e2a", "metadata": {}, "source": [ "Let's add some data to the object. We will use the *Saheb-Alam2019_DADA2* example data. We specify the filename. We can also specify the folder in which the file is located using the *path* variable." ] }, { "cell_type": "code", "execution_count": 2, "id": "069de085-0a56-42e7-8c9c-e091f4e5c2b1", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MicrobiomeData object summary\n", "----------------------------------------\n", "Abundance table: 672 features x 16 samples\n", "Total reads: 504186.0\n", "Min reads/sample: 9783.0\n", "Max reads/sample: 82777.0\n", "Taxonomy table: None\n", "Sequence table: None\n", "Tree: None\n", "Metadata table: None\n", "----------------------------------------\n" ] } ], "source": [ "obj.add_tab(\"Saheb-Alam2019_tab_dada2.tsv\", path=\"../../../qdiv/example_data\")\n", "obj.info()" ] }, { "cell_type": "markdown", "id": "b20db437-8965-4f27-a9ba-bb275076d7d5", "metadata": {}, "source": [ "Now, when calling obj.info(), we can see that an abundance table with 672 features and 16 samples has been added to the object.\n", "We continue and add sequences, taxonomic information, a phylogenetic tree, and meta data as well." ] }, { "cell_type": "code", "execution_count": 3, "id": "7081bed6-5d89-4840-8186-e8028b6765ce", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MicrobiomeData object summary\n", "----------------------------------------\n", "Abundance table: 672 features x 16 samples\n", "Total reads: 504186.0\n", "Min reads/sample: 9783.0\n", "Max reads/sample: 82777.0\n", "Taxonomy table: 672 features, levels: ['Taxon', 'Confidence']\n", "Sequence table: 672 features\n", "Tree: 1342 nodes\n", "Metadata table: 16 samples, columns: ['location', 'feed', 'mfc']\n", "Metadata preview:\n", " location feed mfc\n", "sample \n", "S4 anode acetate B\n", "S5 anode acetate B\n", "S6 anode acetate B\n", "----------------------------------------\n" ] } ], "source": [ "obj.add_tax(\"Saheb-Alam2019_tax_dada2.tsv\", sep=\"\\t\", path=\"../../../qdiv/example_data\") #Note that I use the sep variable to specify the column separator used in the input file.\n", "obj.add_seq_from_fasta(\"Saheb-Alam2019_seq_dada2.fasta\", path=\"../../../qdiv/example_data\")\n", "obj.add_tree(\"Saheb-Alam2019_tree_dada2.nwk\", path=\"../../../qdiv/example_data\")\n", "obj.add_meta(\"Saheb-Alam2019_meta.csv\", sep=\",\", path=\"../../../qdiv/example_data\")\n", "obj.info(3) #Here is specify that I want to see 3 lines of the meta data in the printed output." ] }, { "cell_type": "markdown", "id": "380dd255-d2a3-4bb4-895f-801e7b524ec6", "metadata": {}, "source": [ "Now we can see that we have lot more data in the object. It would also have been possible to load all the data in one go with the **MicrobiomeData.load** command." ] }, { "cell_type": "code", "execution_count": 4, "id": "4c619430-65f2-495a-8945-d34af0dab961", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MicrobiomeData object summary\n", "----------------------------------------\n", "Abundance table: 672 features x 16 samples\n", "Total reads: 504186.0\n", "Min reads/sample: 9783.0\n", "Max reads/sample: 82777.0\n", "Taxonomy table: 672 features, levels: ['Taxon', 'Confidence']\n", "Sequence table: 672 features\n", "Tree: 1342 nodes\n", "Metadata table: 16 samples, columns: ['location', 'feed', 'mfc']\n", "Metadata preview:\n", " location feed mfc\n", "sample \n", "S4 anode acetate B\n", "----------------------------------------\n" ] } ], "source": [ "obj = qdiv.MicrobiomeData.load(tab=\"Saheb-Alam2019_tab_dada2.tsv\", tax=\"Saheb-Alam2019_tax_dada2.tsv\", fasta=\"Saheb-Alam2019_seq_dada2.fasta\", \\\n", " tree=\"Saheb-Alam2019_tree_dada2.nwk\", meta=\"Saheb-Alam2019_meta.csv\", path=\"../../../qdiv/example_data\", \\\n", " tab_sep=\"\\t\", tax_sep=\"\\t\", meta_sep=\",\") #Note that I can specify different separators for different input table files.\n", "obj.info()" ] }, { "cell_type": "markdown", "id": "c417e0e9-3d4e-4a5a-8b53-2e5844fde55f", "metadata": {}, "source": [ "All data is loaded into the object as pandas dataframes. We can have a look at the content of a specific data table like this." ] }, { "cell_type": "code", "execution_count": 5, "id": "8ced22ee-70ad-458f-9f1e-a28f43d88c3b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " S4 S5 S6 S7 S10 S11 S12 \\\n", "Feature \n", "0a4f6af6827b404e874581539d79f24c 0.0 0.0 0.0 0.0 0.0 0.0 7.0 \n", "0a8520a771e224611a2eef9f67e4aac9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "0a9495dc2a22f1eefcd9e4e4ebd2617f 0.0 0.0 0.0 0.0 0.0 27.0 0.0 \n", "0b6a7fb1e3b155264debe582497d69d2 47.0 111.0 0.0 0.0 94.0 60.0 323.0 \n", "0b085a4fec1a492a30b46aeeb05a66bc 42.0 86.0 39.0 18.0 10.0 0.0 51.0 \n", "\n", " S13 S20 S21 S22 S23 S26 S27 S28 \\\n", "Feature \n", "0a4f6af6827b404e874581539d79f24c 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "0a8520a771e224611a2eef9f67e4aac9 0.0 0.0 0.0 0.0 0.0 16.0 26.0 0.0 \n", "0a9495dc2a22f1eefcd9e4e4ebd2617f 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "0b6a7fb1e3b155264debe582497d69d2 102.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "0b085a4fec1a492a30b46aeeb05a66bc 21.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "\n", " S29 \n", "Feature \n", "0a4f6af6827b404e874581539d79f24c 0.0 \n", "0a8520a771e224611a2eef9f67e4aac9 0.0 \n", "0a9495dc2a22f1eefcd9e4e4ebd2617f 0.0 \n", "0b6a7fb1e3b155264debe582497d69d2 29.0 \n", "0b085a4fec1a492a30b46aeeb05a66bc 0.0 \n" ] } ], "source": [ "print(obj.tab.head()) #This shows the first few lines of the abundance table" ] }, { "cell_type": "markdown", "id": "4a801b05-dbe0-4c19-9861-1981fefb5450", "metadata": {}, "source": [ "The Saheb-Alam2019 data was generated from PCR amplification of the V4 regions of 16S rRNA gene. The fastq file were processed in qiime2 using dada2.\n", "The resulting abundance table and amplicon sequence variants were exported using *qiime export* and converted to a .tsv file and a fasta file.\n", "As you can see, the features have weird names. We can rename them using the **rename_features** function." ] }, { "cell_type": "code", "execution_count": 6, "id": "27fd7fd2-406b-4883-9b22-1fe6cb2dbb3b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " S4 S5 S6 S7 S10 S11 S12 S13 \\\n", "Feature \n", "ASV1 11302.0 21046.0 11153.0 9637.0 45.0 50.0 97.0 84.0 \n", "ASV2 0.0 0.0 0.0 0.0 0.0 0.0 14.0 0.0 \n", "ASV3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "ASV4 84.0 241.0 91.0 0.0 5375.0 8202.0 11419.0 3635.0 \n", "ASV5 0.0 0.0 0.0 44.0 1103.0 466.0 4092.0 97.0 \n", "\n", " S20 S21 S22 S23 S26 S27 S28 S29 \n", "Feature \n", "ASV1 1013.0 1693.0 522.0 2851.0 0.0 0.0 0.0 107.0 \n", "ASV2 21102.0 16385.0 4777.0 17839.0 1941.0 8230.0 3829.0 7037.0 \n", "ASV3 19489.0 13748.0 8570.0 25646.0 0.0 138.0 0.0 68.0 \n", "ASV4 0.0 36.0 0.0 0.0 73.0 177.0 0.0 134.0 \n", "ASV5 24.0 0.0 0.0 16.0 108.0 1034.0 1170.0 4598.0 \n" ] } ], "source": [ "obj.rename_features(name_type=\"ASV\", inplace=True) #inplace specifies that the object is modified in place.\n", "print(obj.tab.head())" ] }, { "cell_type": "markdown", "id": "3a06475d-9f2d-4133-a611-c3cda275c738", "metadata": {}, "source": [ "Now the names look better, and they are sorted starting with the ASV having the highest mean relative abundance in the samples (ASV1)." ] }, { "cell_type": "markdown", "id": "207b3aa2-ad89-4468-ae1e-c9dfa33f8be6", "metadata": {}, "source": [ "## Save files\n", "Let's say we have worked on the files, perhaps renamed features, subsetted samples, or rarefied the read counts. We want to save the resulting file." ] }, { "cell_type": "code", "execution_count": 7, "id": "01e45e73-4392-4b13-8b50-7e9e27226e85", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['.\\\\Fixed_files_tab.csv',\n", " '.\\\\Fixed_files_tax.csv',\n", " '.\\\\Fixed_files_meta.csv',\n", " '.\\\\Fixed_files_seq.fa',\n", " '.\\\\Fixed_files_tree.nwk']" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj.save(savename=\"Fixed_files\", path=\".\")" ] }, { "cell_type": "markdown", "id": "a76facd0-7262-468b-895c-b9574c1e6076", "metadata": {}, "source": [ "We can see a list of files saved. Note that the *savename* variable is used a prefix for all the output files. \n", "The path specifies the folder where the files are saved (\".\" is the current folder)." ] }, { "cell_type": "markdown", "id": "b7b80b27-95d5-4217-913d-335379f135e5", "metadata": {}, "source": [ "## Load other types of files\n", "Relative abundance tables and taxonomy data can be generated in a variety of ways. \n", "There are some functions to load other types of data. For example, the following:" ] }, { "cell_type": "markdown", "id": "297a99b9-34ed-4efc-b3a8-86f7ad4ccd6d", "metadata": {}, "source": [ "**CoverM** \n", "Relative abundance in tab-separated files generated with CoverM can be loaded like this:" ] }, { "cell_type": "code", "execution_count": 1, "id": "4dff2eb7-1714-49bc-9745-565c6a17caaf", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MicrobiomeData object summary\n", "----------------------------------------\n", "Abundance table: 304 features x 26 samples\n", "Total reads: 1761.7865439559\n", "Min reads/sample: 61.478759796599995\n", "Max reads/sample: 73.8195703024\n", "Taxonomy table: None\n", "Sequence table: None\n", "Tree: None\n", "Metadata table: 26 samples, columns: ['unmapped_reads_perc']\n", "Metadata preview:\n", " unmapped_reads_perc\n", "all_mags_contigs.fa/v2114_R1_QF.fq.gz 34.778670\n", "all_mags_contigs.fa/v2115_R1_QF.fq.gz 38.521270\n", "all_mags_contigs.fa/v2116_R1_QF.fq.gz 29.976946\n", "----------------------------------------\n" ] } ], "source": [ "import qdiv\n", "obj = qdiv.MicrobiomeData.load()\n", "obj.add_tab_from_coverm(\"Modin2025_CoverM.tsv\", path=\"../../../qdiv/example_data\")\n", "obj.info(3)" ] }, { "cell_type": "markdown", "id": "bdc74762-d834-4d5e-9cbd-8a8dcb8f9795", "metadata": {}, "source": [ "As you can see in the metadata preview, information about the unmapped reads has been placed in the metadata, and the sample names are quite ugly. \n", "We can recreate the object and extract the correct sample names using *first_sep* and *second_sep* variables. " ] }, { "cell_type": "code", "execution_count": 1, "id": "22da1e4f-b9e6-446f-b96b-4c6c16327c3a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MicrobiomeData object summary\n", "----------------------------------------\n", "Abundance table: 304 features x 26 samples\n", "Total reads: 1761.7865439559\n", "Min reads/sample: 61.478759796599995\n", "Max reads/sample: 73.8195703024\n", "Taxonomy table: None\n", "Sequence table: None\n", "Tree: None\n", "Metadata table: 26 samples, columns: ['unmapped_reads_perc']\n", "Metadata preview:\n", " unmapped_reads_perc\n", "v2114 34.778670\n", "v2115 38.521270\n", "v2116 29.976946\n", "----------------------------------------\n" ] } ], "source": [ "import qdiv\n", "obj = qdiv.MicrobiomeData.load()\n", "obj.add_tab_from_coverm(\"Modin2025_CoverM.tsv\", path=\"../../../qdiv/example_data\", first_sep=\"fa/\", second_sep=\"_R1\")\n", "obj.info(3)" ] }, { "cell_type": "markdown", "id": "28782b50-2a00-4c2b-923a-f6c98ea9d0b9", "metadata": {}, "source": [ "**GTDB-Tk** \n", "We also want to add taxonomic information, which in this case was generated with GTDB-Tk. \n", "We have two input files one for archaea and one for bacteria, they can be added like this:" ] }, { "cell_type": "code", "execution_count": 2, "id": "1a51f0c9-c909-440c-9b96-ba74980a5c65", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MicrobiomeData object summary\n", "----------------------------------------\n", "Abundance table: 304 features x 26 samples\n", "Total reads: 1761.7865439559\n", "Min reads/sample: 61.478759796599995\n", "Max reads/sample: 73.8195703024\n", "Taxonomy table: 304 features, levels: ['Domain', 'Phylum', 'Class', 'Order', 'Family', 'Genus', 'Species']\n", "Sequence table: None\n", "Tree: None\n", "Metadata table: 26 samples, columns: ['unmapped_reads_perc']\n", "Metadata preview:\n", " unmapped_reads_perc\n", "v2114 34.77867\n", "----------------------------------------\n" ] } ], "source": [ "obj.add_tax_from_gtdbtk([\"Modin2025_gtdbtk.ar53.summary.tsv\", \"Modin2025_gtdbtk.bac120.summary.tsv\"], path=\"../../../qdiv/example_data\")\n", "obj.info()" ] }, { "cell_type": "markdown", "id": "18cb0c74-f4a2-4a52-8587-853f1319dd9b", "metadata": {}, "source": [ "We can have look at the taxonomy table:" ] }, { "cell_type": "code", "execution_count": 3, "id": "42bb069f-af49-4ad6-8451-601ee91f7e20", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Domain Phylum Class Order \\\n", "Feature \n", "KM1 Bacteria Planctomycetota Planctomycetia Pirellulales \n", "Km2 Bacteria Actinobacteriota Actinomycetia Propionibacteriales \n", "Km3 Bacteria Desulfobacterota Syntrophia Syntrophales \n", "KM4 Bacteria Chloroflexota Anaerolineae UBA1429 \n", "Km5 Bacteria Actinobacteriota Actinomycetia Actinomycetales \n", "\n", " Family Genus Species \n", "Feature \n", "KM1 Thermoguttaceae Thermogutta \n", "Km2 Propionibacteriaceae Brevilactibacter \n", "Km3 Smithellaceae UBA4810 UBA4810 sp002418825 \n", "KM4 UBA1429 UBA6128 \n", "Km5 Bifidobacteriaceae Bifidobacterium Bifidobacterium tibiigranuli \n" ] } ], "source": [ "print(obj.tax.head())" ] }, { "cell_type": "markdown", "id": "18b34e6e-632d-409f-9a81-d74692d38c9c", "metadata": {}, "source": [ "**tax_prefix** \n", "We can add prefix to the taxonomic names, to indicate the level. That can be done with the **tax_prefix** function\n", "(which also can remove the prefix if the *add* variable is False)." ] }, { "cell_type": "code", "execution_count": 4, "id": "bc6ac620-1c48-498e-a82b-2fe00b48fc27", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Domain Phylum Class \\\n", "Feature \n", "KM1 d__Bacteria p__Planctomycetota c__Planctomycetia \n", "Km2 d__Bacteria p__Actinobacteriota c__Actinomycetia \n", "Km3 d__Bacteria p__Desulfobacterota c__Syntrophia \n", "KM4 d__Bacteria p__Chloroflexota c__Anaerolineae \n", "Km5 d__Bacteria p__Actinobacteriota c__Actinomycetia \n", "\n", " Order Family Genus \\\n", "Feature \n", "KM1 o__Pirellulales f__Thermoguttaceae g__Thermogutta \n", "Km2 o__Propionibacteriales f__Propionibacteriaceae g__Brevilactibacter \n", "Km3 o__Syntrophales f__Smithellaceae g__UBA4810 \n", "KM4 o__UBA1429 f__UBA1429 g__UBA6128 \n", "Km5 o__Actinomycetales f__Bifidobacteriaceae g__Bifidobacterium \n", "\n", " Species \n", "Feature \n", "KM1 \n", "Km2 \n", "Km3 s__UBA4810 sp002418825 \n", "KM4 \n", "Km5 s__Bifidobacterium tibiigranuli \n" ] } ], "source": [ "obj.tax_prefix(add=True, inplace=True)\n", "print(obj.tax.head())" ] }, { "cell_type": "markdown", "id": "86b2cfb2-1779-464d-b202-f16118298017", "metadata": {}, "source": [ "**SingleM** \n", "SingleM can be used to generate OTU tables from metagenomic reads. SingleM also a *convertToEBD.py* function to convert the OTU table to a tab-separated file with taxonomic information and read counts. That ebd file can be loaded using the **add_ebd_tab_from_singlem** function." ] }, { "cell_type": "code", "execution_count": 6, "id": "4e712a50-56db-4b65-b338-df88696a70da", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MicrobiomeData object summary\n", "----------------------------------------\n", "Abundance table: 1137 features x 7 samples\n", "Total reads: 5640.0\n", "Min reads/sample: 630.0\n", "Max reads/sample: 960.0\n", "Taxonomy table: 1137 features, levels: ['Domain', 'Phylum', 'Class', 'Order', 'Family', 'Genus', 'Species']\n", "Sequence table: None\n", "Tree: None\n", "Metadata table: None\n", "----------------------------------------\n" ] } ], "source": [ "import qdiv\n", "obj = qdiv.MicrobiomeData.load()\n", "obj.add_ebd_tab_from_singlem(\"Modin2025_SingleM.ebd\", path=\"../../../qdiv/example_data\")\n", "obj.info()" ] }, { "cell_type": "markdown", "id": "b10fe377-64b1-4948-9dbf-aea299571afd", "metadata": {}, "source": [ "Let's have a closer look at the count table." ] }, { "cell_type": "code", "execution_count": 7, "id": "179f5206-8f14-47ac-9135-1d1a922214ee", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " v2126_1 v2136_1 v2145_1 v2151_1 v2204_1 v2117_1 v2114_1\n", "Feature \n", "OTU1 0.0 0.0 0.0 0.0 0.0 0.0 1.0\n", "OTU2 0.0 0.0 0.0 0.0 0.0 1.0 1.0\n", "OTU3 3.0 0.0 0.0 0.0 0.0 1.0 1.0\n", "OTU4 0.0 0.0 0.0 0.0 0.0 2.0 2.0\n", "OTU5 0.0 0.0 0.0 0.0 0.0 0.0 2.0\n" ] } ], "source": [ "print(obj.tab.head())" ] }, { "cell_type": "markdown", "id": "93237de5-0b6c-4961-ac4e-e25b5eb6a15f", "metadata": {}, "source": [ "We can see that the sample names all contain a _1. To remove that part of the sample names, we can specify *second_sep*=\"_\" when we load the file. Then, everyhting after the \"_\" in the names will be dropped." ] }, { "cell_type": "code", "execution_count": 8, "id": "2600dc0a-2ccd-4434-9981-5443d99615ee", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " v2126 v2136 v2145 v2151 v2204 v2117 v2114\n", "Feature \n", "OTU1 0.0 0.0 0.0 0.0 0.0 0.0 1.0\n", "OTU2 0.0 0.0 0.0 0.0 0.0 1.0 1.0\n", "OTU3 3.0 0.0 0.0 0.0 0.0 1.0 1.0\n", "OTU4 0.0 0.0 0.0 0.0 0.0 2.0 2.0\n", "OTU5 0.0 0.0 0.0 0.0 0.0 0.0 2.0\n" ] } ], "source": [ "obj = qdiv.MicrobiomeData.load()\n", "obj.add_ebd_tab_from_singlem(\"Modin2025_SingleM.ebd\", path=\"../../../qdiv/example_data\", second_sep=\"_\")\n", "print(obj.tab.head())" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.14" } }, "nbformat": 4, "nbformat_minor": 5 }