{ "cells": [ { "cell_type": "markdown", "id": "ce17e1b5-9f97-4f2c-926e-eab55eebb77f", "metadata": {}, "source": [ "# Data management" ] }, { "cell_type": "markdown", "id": "68d71fee-ebfd-4898-ba14-be707b302f17", "metadata": {}, "source": [ "## Finding consensus ASVs" ] }, { "cell_type": "markdown", "id": "7ca22531-fc38-44c8-b913-592fc7367f7d", "metadata": {}, "source": [ "Say we have used several softwares to generate amplicon sequence variants (ASVs) from a sequencing dataset.\n", "We can generate a set of consensus ASVs that were detected by all methods and trim our abundance table to those ASVs.\n", "We load the Saheb-Alam2019 that was processed using both DADA2 and Deblur." ] }, { "cell_type": "code", "execution_count": 2, "id": "64735e0d-4659-455b-9b67-e631efd0bb19", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MicrobiomeData object summary\n", "----------------------------------------\n", "Abundance table: 672 features x 16 samples\n", "Total reads: 504186.0\n", "Min reads/sample: 9783.0\n", "Max reads/sample: 82777.0\n", "Taxonomy table: 672 features, levels: ['Domain', 'Phylum', 'Class', 'Order', 'Family', 'Genus', 'Species']\n", "Sequence table: 672 features\n", "Tree: 1342 nodes\n", "Metadata table: 16 samples, columns: ['location', 'feed', 'mfc']\n", "Metadata preview:\n", " location feed mfc\n", "sample \n", "S4 anode acetate B\n", "----------------------------------------\n", "MicrobiomeData object summary\n", "----------------------------------------\n", "Abundance table: 952 features x 16 samples\n", "Total reads: 572710.0\n", "Min reads/sample: 11347.0\n", "Max reads/sample: 88846.0\n", "Taxonomy table: 952 features, levels: ['Domain', 'Phylum', 'Class', 'Order', 'Family', 'Genus', 'Species']\n", "Sequence table: 952 features\n", "Tree: 1901 nodes\n", "Metadata table: 16 samples, columns: ['location', 'feed', 'mfc']\n", "Metadata preview:\n", " location feed mfc\n", "sample \n", "S4 anode acetate B\n", "----------------------------------------\n" ] } ], "source": [ "import qdiv\n", "obj1 = qdiv.MicrobiomeData.load_example(\"Saheb-Alam_2019_DADA2\")\n", "obj2 = qdiv.MicrobiomeData.load_example(\"Saheb-Alam_2019_Deblur\")\n", "obj1.info()\n", "obj2.info()" ] }, { "cell_type": "markdown", "id": "efeac0ec-8548-4109-a844-afaec9419fd8", "metadata": {}, "source": [ "We see that the DADA2 data has 672 ASVs (features) while the Deblur data has 952. Now we will find a consensus data set, which retains ASVs detected by both methods. Let's call the **sequences.consensus** function." ] }, { "cell_type": "code", "execution_count": 3, "id": "4e0db438-0516-4dc9-a98b-0d21570ca1de", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Running consensus...\n", "Aligning features in 2 objects: 1.. 2.. \n", "Changing feature names in 2 objects: 1.. 2.. \n", "Done with align\n", "Done with consensus.\n" ] } ], "source": [ "consensus_obj, consensus_info = qdiv.sequences.consensus([obj1, obj2], different_lengths=True, keep_cutoff=0.2, name_type=\"ASV\", keep_object=0) " ] }, { "cell_type": "markdown", "id": "d37bc27e-872b-4cae-89de-c9adea1b49fa", "metadata": {}, "source": [ "Note that we used the *different_lengths*=True. Deblur sets a specific length for all ASVs while DADA2 allows different lengths. This means that a 253 bp long ASV detected by DADA2 may be truncated to 250 bp when detected by Deblur. Setting *different_lengths* to True means that these two sequences will be treated as the same ASV if the shorter one can be found within the longer one. \n", "We also set *keep_cutoff* to 0.2. This means that all ASVs having a relative abundance of at least 0.2% in a sample will be retained, regardless if they are a consensus ASV. Setting *name_type* to \"ASV\" means that will be used as prefix for the renamed features.\n", "Setting *keep_object*=0, means that we will retain the first object in the input list (i.e. obj1) as the consensus object. \n", "What the line of code essentially is saying is: Take the DADA2 object (obj1); keep all ASVs with a relative abundance >0.2% in a sample; among the ASVs with lower relative abundance, keep those it has in common with the Deblur object (obj2). \n", "Let's look at the results:" ] }, { "cell_type": "code", "execution_count": 4, "id": "0cb7822f-2b8a-497a-bb93-1afa085dc544", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MicrobiomeData object summary\n", "----------------------------------------\n", "Abundance table: 635 features x 16 samples\n", "Total reads: 503752.0\n", "Min reads/sample: 9742.0\n", "Max reads/sample: 82587.0\n", "Taxonomy table: 635 features, levels: ['Domain', 'Phylum', 'Class', 'Order', 'Family', 'Genus', 'Species']\n", "Sequence table: 635 features\n", "Tree: 1303 nodes\n", "Metadata table: 16 samples, columns: ['location', 'feed', 'mfc']\n", "Metadata preview:\n", " location feed mfc\n", "sample \n", "S4 anode acetate B\n", "----------------------------------------\n" ] } ], "source": [ "consensus_obj.info()" ] }, { "cell_type": "markdown", "id": "4d308577-22f8-443a-9d28-6e928f8b5e60", "metadata": {}, "source": [ "We can see that the consensus data has 635 ASVs. We can also get more information about what was happening when the consensus object was generated. Let's print the consensus_info variable." ] }, { "cell_type": "code", "execution_count": 5, "id": "51279711-f1fb-4f18-8d22-9d3a8f89364e", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "### consensus ###\n", "There are 630 features in common between all objects.\n", "\n", "In obj.0, the consensus features account for:\n", " 93.75 % of the features\n", " 99.15 % of the total abundance counts\n", " 97.8 - 100.0 % of the abundance counts per sample\n", "\n", "In obj.1, the consensus features account for:\n", " 66.18 % of the features\n", " 98.35 % of the total abundance counts\n", " 97.43 - 99.19 % of the abundance counts per sample\n", "\n", "The object with position 0 in the input list is retained as the consensus object.\n", "This object also includes 241 features with a relative abundance above the keep_cutoff threshold.\n", "All in all, the consensus object retains 94.49 % of its original features,\n", " which represent 99.91 % of the total abundance counts,\n", " and 99.58 - 100.0 % of the abundance counts per sample.\n", "The single most abundant feature that was removed had a relative abundance of to 0.0 - 0.17 % in the samples. The taxonomic classification of this feature was Bacteria;Pseudomonadota;Alphaproteobacteria;Rickettsiales;Mitochondria;;\n", "--------------\n", "\n" ] } ], "source": [ "print(consensus_info)" ] }, { "cell_type": "markdown", "id": "5c4cfe57-6c89-4db3-8cc7-d652dfa13a86", "metadata": {}, "source": [ "## Subsetting" ] }, { "cell_type": "markdown", "id": "f3d4dcbd-d1ce-46f3-99c1-4bc9802fded8", "metadata": {}, "source": [ "We can subset object based on information provided in the meta data. Let's have a look at the Saheb-Alam2019 example data. We load the data and look at the meta data. (Here we also use **rename_features** because this example data has weird feature names and it's nice if all features are named with the ASV prefix)." ] }, { "cell_type": "code", "execution_count": 1, "id": "e1deb121-777a-49a5-9822-16ae6cbe9d41", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " location feed mfc\n", "sample \n", "S4 anode acetate B\n", "S5 anode acetate B\n", "S6 anode acetate B\n", "S7 anode acetate B\n", "S10 cathode acetate B\n", "S11 cathode acetate B\n", "S12 cathode acetate B\n", "S13 cathode acetate B\n", "S20 anode glucose D\n", "S21 anode glucose D\n", "S22 anode glucose D\n", "S23 anode glucose D\n", "S26 cathode glucose D\n", "S27 cathode glucose D\n", "S28 cathode glucose D\n", "S29 cathode glucose D\n" ] } ], "source": [ "import qdiv\n", "obj = qdiv.MicrobiomeData.load_example(\"Saheb-Alam2019_DADA2\")\n", "obj.rename_features(name_type=\"ASV\", inplace=True)\n", "print(obj.meta)" ] }, { "cell_type": "markdown", "id": "9b6876f2-acd1-48af-8659-45dcd432a05d", "metadata": {}, "source": [ "These are samples collected from microbial fuel cells (see ). Biofilms growing on anodes and cathodes in systems fed with either acetate or glucose were sequenced. Let's say that we're only interested in the anode sample. We can subset the object to the anode samples using **subset_samples**:" ] }, { "cell_type": "code", "execution_count": 2, "id": "8d7bfbd2-198d-4532-97f9-8fc48ce9f10c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MicrobiomeData object summary\n", "----------------------------------------\n", "Abundance table: 316 features x 8 samples\n", "Total reads: 241039.0\n", "Min reads/sample: 13103.0\n", "Max reads/sample: 55054.0\n", "Taxonomy table: 316 features, levels: ['Domain', 'Phylum', 'Class', 'Order', 'Family', 'Genus', 'Species']\n", "Sequence table: 316 features\n", "Tree: 1342 nodes\n", "Metadata table: 8 samples, columns: ['location', 'feed', 'mfc']\n", "Metadata preview:\n", " location feed mfc\n", "sample \n", "S4 anode acetate B\n", "----------------------------------------\n" ] } ], "source": [ "obj.subset_samples(by=\"location\", values=[\"anode\"], inplace=True)\n", "obj.info()" ] }, { "cell_type": "markdown", "id": "e6d39b3b-dc17-419d-90e8-f8d04069c129", "metadata": {}, "source": [ "Now, we can see that we only have 8 samples remaining. There are 316 features detected in this samples. By setting *inplace*=True, we specify that we want to modify the object in place and not generate a new object.\n", "\n", "We can also subset to specific taxa. Desulfobacterota, especially Geobacter spp., tend to be abundant in microbial fuel cells. Let's subset the object to those taxa using **subset_taxa** and see what we get." ] }, { "cell_type": "code", "execution_count": 3, "id": "ea8ac6ff-9ccf-4298-b93e-144a4646c1ae", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MicrobiomeData object summary\n", "----------------------------------------\n", "Abundance table: 2 features x 8 samples\n", "Total reads: 824.0\n", "Min reads/sample: 0.0\n", "Max reads/sample: 350.0\n", "Taxonomy table: 2 features, levels: ['Domain', 'Phylum', 'Class', 'Order', 'Family', 'Genus', 'Species']\n", "Sequence table: 2 features\n", "Tree: 1342 nodes\n", "Metadata table: 8 samples, columns: ['location', 'feed', 'mfc']\n", "Metadata preview:\n", " location feed mfc\n", "sample \n", "S4 anode acetate B\n", "----------------------------------------\n" ] } ], "source": [ "obj_geobacter = obj.subset_taxa(subset_levels=[\"Genus\"], subset_patterns=[\"Geobacter\"])\n", "obj_geobacter.info()" ] }, { "cell_type": "markdown", "id": "9fe58406-7ef0-410e-9ac3-be647a0f8602", "metadata": {}, "source": [ "We specified the taxonomic levels to search (*subset_levels*) and the text patterns to search for (*subset_patterns*).\n", "Here we didn't use *inplace*, so we keep the original object with all the anode-affiliated taxa and create a new object just for Geobacter. \n", "We can see that only 2 features remain. Let's have a look at who they are." ] }, { "cell_type": "code", "execution_count": 4, "id": "13952936-3ae2-4738-9eed-af1815a6e58a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Domain Phylum Class Order \\\n", "Feature \n", "ASV127 Bacteria Desulfobacterota_G_459546 Desulfuromonadia Geobacterales \n", "ASV147 Bacteria Desulfobacterota_G_459546 Desulfuromonadia Geobacterales \n", "\n", " Family Genus Species \n", "Feature \n", "ASV127 Geobacteraceae_439684 Geobacter \n", "ASV147 Geobacteraceae_439684 Geobacter \n" ] } ], "source": [ "print(obj_geobacter.tax)" ] }, { "cell_type": "markdown", "id": "51b1d58d-4c96-4cb3-ade6-dd0d6bda2363", "metadata": {}, "source": [ "We can also check their read counts in the samples:" ] }, { "cell_type": "code", "execution_count": 5, "id": "2aebddf8-5b96-4dd1-9e46-801498a39ed4", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " S4 S5 S6 S7 S20 S21 S22 S23\n", "Feature \n", "ASV127 0.0 0.0 0.0 0.0 133.0 103.0 66.0 150.0\n", "ASV147 0.0 47.0 34.0 0.0 0.0 91.0 0.0 200.0\n" ] } ], "source": [ "print(obj_geobacter.tab)" ] }, { "cell_type": "markdown", "id": "14573321-52bc-4fd1-85f2-a5acaa1664e3", "metadata": {}, "source": [ "These ASVs are not super abundant. So, which taxa are actually dominating in the data set? \n", "We can use **subset_abundant** to pick out the most abundant features." ] }, { "cell_type": "code", "execution_count": 6, "id": "f5c0225b-c1a0-4305-b98b-4d681d1db70a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Phylum Genus\n", "Feature \n", "ASV1 Desulfobacterota_G_459546 Trichloromonas\n", "ASV2 Bacillota_I Trichococcus\n", "ASV3 Desulfobacterota_G_459546 \n", "ASV6 Bacteroidota JALNZU01\n", "ASV18 Spirochaetota UBA12059\n", "ASV20 Bacteroidota Perlabentimonas\n", "ASV21 Bacteroidota Bact-08\n", "ASV23 Bacteroidota Lentimicrobium\n", "ASV25 Pseudomonadota Sulfuritalea\n", "ASV38 Bacillota_I Trichococcus\n" ] } ], "source": [ "obj_abund = obj.subset_abundant(n=10)\n", "print(obj_abund.tax[[\"Phylum\", \"Genus\"]])" ] }, { "cell_type": "markdown", "id": "57a9a100-f8b6-4b46-85e5-01af40e0907a", "metadata": {}, "source": [ "Here we see that among the most abundant ASVs, there are two Desulfobacterota species. But one is classified as Trichloromonas and the other is unclassified at the genus level. Let's check their read counts." ] }, { "cell_type": "code", "execution_count": 7, "id": "272e7ac7-b154-4187-9850-728fa93460bd", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " S4 S5 S6 S7 S20 S21 S22 S23\n", "Feature \n", "ASV1 11302.0 21046.0 11153.0 9637.0 1013.0 1693.0 522.0 2851.0\n", "ASV2 0.0 0.0 0.0 0.0 21102.0 16385.0 4777.0 17839.0\n", "ASV3 0.0 0.0 0.0 0.0 19489.0 13748.0 8570.0 25646.0\n", "ASV6 123.0 296.0 123.0 105.0 381.0 271.0 80.0 234.0\n", "ASV18 194.0 557.0 257.0 224.0 56.0 53.0 18.0 123.0\n", "ASV20 316.0 647.0 294.0 0.0 191.0 98.0 61.0 197.0\n", "ASV21 173.0 327.0 168.0 142.0 441.0 411.0 179.0 388.0\n", "ASV23 0.0 0.0 0.0 0.0 1235.0 989.0 383.0 1141.0\n", "ASV25 343.0 838.0 427.0 255.0 0.0 0.0 0.0 0.0\n", "ASV38 0.0 0.0 0.0 0.0 1147.0 726.0 0.0 771.0\n" ] } ], "source": [ "print(obj_abund.tab)" ] }, { "cell_type": "markdown", "id": "dadc52ef-b76c-460b-a860-8b58366520f9", "metadata": {}, "source": [ "It looks like ASV1 is very abundant in the first four samples (acetate-fed) and ASV2 and ASV3 in the last four (glucose-fed)." ] }, { "cell_type": "markdown", "id": "97d0cc8a-2cc4-4619-bd8f-2f1cce563f0a", "metadata": {}, "source": [ "It is also possible to merge samples based on information the meta data. Let's use **merge_samples** to combine samples based on the feed they received." ] }, { "cell_type": "code", "execution_count": 8, "id": "819162a2-eb76-424e-9f8e-30d22bc53d04", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "feed acetate glucose\n", "Feature \n", "ASV1 53138.0 6079.0\n", "ASV2 0.0 60103.0\n", "ASV3 0.0 67453.0\n", "ASV6 647.0 966.0\n", "ASV18 1232.0 250.0\n", "ASV20 1257.0 547.0\n", "ASV21 810.0 1419.0\n", "ASV23 0.0 3748.0\n", "ASV25 1863.0 0.0\n", "ASV38 0.0 2644.0\n" ] } ], "source": [ "obj_abund.merge_samples(by=\"feed\", inplace=True)\n", "print(obj_abund.tab)" ] }, { "cell_type": "markdown", "id": "bae2ddc3-fa55-403b-bd91-6f0ac9b9d51a", "metadata": {}, "source": [ "This shows the sum of the read counts for the acetate- and glucose fed anodes." ] }, { "cell_type": "markdown", "id": "749b97e9-7f9c-4498-b1ac-214809bd91c4", "metadata": {}, "source": [ "## Rarefy\n", "For amplicon sequencing data with uneven read counts per sample, it is good practice to rarefy the abundance table before diversity analysis." ] }, { "cell_type": "code", "execution_count": 2, "id": "3285cdd7-cc2c-4fd1-a574-98c11acbb1d8", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MicrobiomeData object summary\n", "----------------------------------------\n", "Abundance table: 672 features x 16 samples\n", "Total reads: 504186.0\n", "Min reads/sample: 9783.0\n", "Max reads/sample: 82777.0\n", "Taxonomy table: 672 features, levels: ['Domain', 'Phylum', 'Class', 'Order', 'Family', 'Genus', 'Species']\n", "Sequence table: 672 features\n", "Tree: 1342 nodes\n", "Metadata table: 16 samples, columns: ['location', 'feed', 'mfc']\n", "Metadata preview:\n", " location feed mfc\n", "sample \n", "S4 anode acetate B\n", "----------------------------------------\n" ] } ], "source": [ "import qdiv\n", "obj = qdiv.MicrobiomeData.load_example(\"Saheb-Alam_DADA2\")\n", "obj.info()" ] }, { "cell_type": "markdown", "id": "f82a06cd-1fc0-48d9-959e-e4d2c33c812b", "metadata": {}, "source": [ "In this example data, we can see that the read counts in the samples range from 9783 to 82777. We can rarefy the object to the lowest read count." ] }, { "cell_type": "code", "execution_count": 3, "id": "39b23d2d-9819-4c7a-8bc7-c1b7dd6afed7", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MicrobiomeData object summary\n", "----------------------------------------\n", "Abundance table: 649 features x 16 samples\n", "Total reads: 156528\n", "Min reads/sample: 9783\n", "Max reads/sample: 9783\n", "Taxonomy table: 649 features, levels: ['Domain', 'Phylum', 'Class', 'Order', 'Family', 'Genus', 'Species']\n", "Sequence table: 649 features\n", "Tree: 1342 nodes\n", "Metadata table: 16 samples, columns: ['location', 'feed', 'mfc']\n", "Metadata preview:\n", " location feed mfc\n", "S4 anode acetate B\n", "----------------------------------------\n" ] } ], "source": [ "obj.rarefy(inplace=True)\n", "obj.info()" ] }, { "cell_type": "markdown", "id": "d4996ac5-faf5-446d-ac7b-e0e5a6347e69", "metadata": {}, "source": [ "Now all samples have the same count: 9783 reads. The **rarefy** function automatically rarefies to the read count in the sample with the lowest count. It is possible to specify a read count with the *depth* parameter, and to ensure reproducibility with the *seed* parameter." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.14" } }, "nbformat": 4, "nbformat_minor": 5 }