{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "ce17e1b5-9f97-4f2c-926e-eab55eebb77f",
   "metadata": {},
   "source": [
    "# Data management"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "68d71fee-ebfd-4898-ba14-be707b302f17",
   "metadata": {},
   "source": [
    "## Finding consensus ASVs"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7ca22531-fc38-44c8-b913-592fc7367f7d",
   "metadata": {},
   "source": [
    "Say we have used several softwares to generate amplicon sequence variants (ASVs) from a sequencing dataset.\n",
    "We can generate a set of consensus ASVs that were detected by all methods and trim our abundance table to those ASVs.\n",
    "We load the Saheb-Alam2019 that was processed using both DADA2 and Deblur."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "64735e0d-4659-455b-9b67-e631efd0bb19",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "MicrobiomeData object summary\n",
      "----------------------------------------\n",
      "Abundance table: 672 features x 16 samples\n",
      "Total reads: 504186.0\n",
      "Min reads/sample: 9783.0\n",
      "Max reads/sample: 82777.0\n",
      "Taxonomy table: 672 features, levels: ['Domain', 'Phylum', 'Class', 'Order', 'Family', 'Genus', 'Species']\n",
      "Sequence table: 672 features\n",
      "Tree: 1342 nodes\n",
      "Metadata table: 16 samples, columns: ['location', 'feed', 'mfc']\n",
      "Metadata preview:\n",
      "       location     feed mfc\n",
      "sample                      \n",
      "S4        anode  acetate   B\n",
      "----------------------------------------\n",
      "MicrobiomeData object summary\n",
      "----------------------------------------\n",
      "Abundance table: 952 features x 16 samples\n",
      "Total reads: 572710.0\n",
      "Min reads/sample: 11347.0\n",
      "Max reads/sample: 88846.0\n",
      "Taxonomy table: 952 features, levels: ['Domain', 'Phylum', 'Class', 'Order', 'Family', 'Genus', 'Species']\n",
      "Sequence table: 952 features\n",
      "Tree: 1901 nodes\n",
      "Metadata table: 16 samples, columns: ['location', 'feed', 'mfc']\n",
      "Metadata preview:\n",
      "       location     feed mfc\n",
      "sample                      \n",
      "S4        anode  acetate   B\n",
      "----------------------------------------\n"
     ]
    }
   ],
   "source": [
    "import qdiv\n",
    "obj1 = qdiv.MicrobiomeData.load_example(\"Saheb-Alam_2019_DADA2\")\n",
    "obj2 = qdiv.MicrobiomeData.load_example(\"Saheb-Alam_2019_Deblur\")\n",
    "obj1.info()\n",
    "obj2.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "efeac0ec-8548-4109-a844-afaec9419fd8",
   "metadata": {},
   "source": [
    "We see that the DADA2 data has 672 ASVs (features) while the Deblur data has 952. Now we will find a consensus data set, which retains ASVs detected by both methods. Let's call the **sequences.consensus** function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "4e0db438-0516-4dc9-a98b-0d21570ca1de",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Running consensus...\n",
      "Aligning features in 2 objects: 1.. 2.. \n",
      "Changing feature names in 2 objects: 1.. 2.. \n",
      "Done with align\n",
      "Done with consensus.\n"
     ]
    }
   ],
   "source": [
    "consensus_obj, consensus_info = qdiv.sequences.consensus([obj1, obj2], different_lengths=True, keep_cutoff=0.2, name_type=\"ASV\", keep_object=0) "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d37bc27e-872b-4cae-89de-c9adea1b49fa",
   "metadata": {},
   "source": [
    "Note that we used the *different_lengths*=True. Deblur sets a specific length for all ASVs while DADA2 allows different lengths. This means that a 253 bp long ASV detected by DADA2 may be truncated to 250 bp when detected by Deblur. Setting *different_lengths* to True means that these two sequences will be treated as the same ASV if the shorter one can be found within the longer one.  \n",
    "We also set *keep_cutoff* to 0.2. This means that all ASVs having a relative abundance of at least 0.2% in a sample will be retained, regardless if they are a consensus ASV. Setting *name_type* to \"ASV\" means that will be used as prefix for the renamed features.\n",
    "Setting *keep_object*=0, means that we will retain the first object in the input list (i.e. obj1) as the consensus object.  \n",
    "What the line of code essentially is saying is: Take the DADA2 object (obj1); keep all ASVs with a relative abundance >0.2% in a sample; among the ASVs with lower relative abundance, keep those it has in common with the Deblur object (obj2).   \n",
    "Let's look at the results:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "0cb7822f-2b8a-497a-bb93-1afa085dc544",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "MicrobiomeData object summary\n",
      "----------------------------------------\n",
      "Abundance table: 635 features x 16 samples\n",
      "Total reads: 503752.0\n",
      "Min reads/sample: 9742.0\n",
      "Max reads/sample: 82587.0\n",
      "Taxonomy table: 635 features, levels: ['Domain', 'Phylum', 'Class', 'Order', 'Family', 'Genus', 'Species']\n",
      "Sequence table: 635 features\n",
      "Tree: 1303 nodes\n",
      "Metadata table: 16 samples, columns: ['location', 'feed', 'mfc']\n",
      "Metadata preview:\n",
      "       location     feed mfc\n",
      "sample                      \n",
      "S4        anode  acetate   B\n",
      "----------------------------------------\n"
     ]
    }
   ],
   "source": [
    "consensus_obj.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4d308577-22f8-443a-9d28-6e928f8b5e60",
   "metadata": {},
   "source": [
    "We can see that the consensus data has 635 ASVs. We can also get more information about what was happening when the consensus object was generated. Let's print the consensus_info variable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "51279711-f1fb-4f18-8d22-9d3a8f89364e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "### consensus ###\n",
      "There are 630 features in common between all objects.\n",
      "\n",
      "In obj.0, the consensus features account for:\n",
      " 93.75 % of the features\n",
      " 99.15 % of the total abundance counts\n",
      " 97.8 - 100.0 % of the abundance counts per sample\n",
      "\n",
      "In obj.1, the consensus features account for:\n",
      " 66.18 % of the features\n",
      " 98.35 % of the total abundance counts\n",
      " 97.43 - 99.19 % of the abundance counts per sample\n",
      "\n",
      "The object with position 0 in the input list is retained as the consensus object.\n",
      "This object also includes 241 features with a relative abundance above the keep_cutoff threshold.\n",
      "All in all, the consensus object retains 94.49 % of its original features,\n",
      " which represent 99.91 % of the total abundance counts,\n",
      " and 99.58 - 100.0 % of the abundance counts per sample.\n",
      "The single most abundant feature that was removed had a relative abundance of to 0.0 - 0.17 % in the samples. The taxonomic classification of this feature was Bacteria;Pseudomonadota;Alphaproteobacteria;Rickettsiales;Mitochondria;;\n",
      "--------------\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print(consensus_info)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5c4cfe57-6c89-4db3-8cc7-d652dfa13a86",
   "metadata": {},
   "source": [
    "## Subsetting"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f3d4dcbd-d1ce-46f3-99c1-4bc9802fded8",
   "metadata": {},
   "source": [
    "We can subset object based on information provided in the meta data. Let's have a look at the Saheb-Alam2019 example data. We load the data and look at the meta data. (Here we also use **rename_features** because this example data has weird feature names and it's nice if all features are named with the ASV prefix)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "e1deb121-777a-49a5-9822-16ae6cbe9d41",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "       location     feed mfc\n",
      "sample                      \n",
      "S4        anode  acetate   B\n",
      "S5        anode  acetate   B\n",
      "S6        anode  acetate   B\n",
      "S7        anode  acetate   B\n",
      "S10     cathode  acetate   B\n",
      "S11     cathode  acetate   B\n",
      "S12     cathode  acetate   B\n",
      "S13     cathode  acetate   B\n",
      "S20       anode  glucose   D\n",
      "S21       anode  glucose   D\n",
      "S22       anode  glucose   D\n",
      "S23       anode  glucose   D\n",
      "S26     cathode  glucose   D\n",
      "S27     cathode  glucose   D\n",
      "S28     cathode  glucose   D\n",
      "S29     cathode  glucose   D\n"
     ]
    }
   ],
   "source": [
    "import qdiv\n",
    "obj = qdiv.MicrobiomeData.load_example(\"Saheb-Alam2019_DADA2\")\n",
    "obj.rename_features(name_type=\"ASV\", inplace=True)\n",
    "print(obj.meta)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9b6876f2-acd1-48af-8659-45dcd432a05d",
   "metadata": {},
   "source": [
    "These are samples collected from microbial fuel cells (see <https://doi.org/10.1111/1751-7915.13449>). Biofilms growing on anodes and cathodes in systems fed with either acetate or glucose were sequenced. Let's say that we're only interested in the anode sample. We can subset the object to the anode samples using **subset_samples**:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "8d7bfbd2-198d-4532-97f9-8fc48ce9f10c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "MicrobiomeData object summary\n",
      "----------------------------------------\n",
      "Abundance table: 316 features x 8 samples\n",
      "Total reads: 241039.0\n",
      "Min reads/sample: 13103.0\n",
      "Max reads/sample: 55054.0\n",
      "Taxonomy table: 316 features, levels: ['Domain', 'Phylum', 'Class', 'Order', 'Family', 'Genus', 'Species']\n",
      "Sequence table: 316 features\n",
      "Tree: 1342 nodes\n",
      "Metadata table: 8 samples, columns: ['location', 'feed', 'mfc']\n",
      "Metadata preview:\n",
      "       location     feed mfc\n",
      "sample                      \n",
      "S4        anode  acetate   B\n",
      "----------------------------------------\n"
     ]
    }
   ],
   "source": [
    "obj.subset_samples(by=\"location\", values=[\"anode\"], inplace=True)\n",
    "obj.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e6d39b3b-dc17-419d-90e8-f8d04069c129",
   "metadata": {},
   "source": [
    "Now, we can see that we only have 8 samples remaining. There are 316 features detected in this samples. By setting *inplace*=True, we specify that we want to modify the object in place and not generate a new object.\n",
    "\n",
    "We can also subset to specific taxa. Desulfobacterota, especially Geobacter spp., tend to be abundant in microbial fuel cells. Let's subset the object to those taxa using **subset_taxa** and see what we get."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "ea8ac6ff-9ccf-4298-b93e-144a4646c1ae",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "MicrobiomeData object summary\n",
      "----------------------------------------\n",
      "Abundance table: 2 features x 8 samples\n",
      "Total reads: 824.0\n",
      "Min reads/sample: 0.0\n",
      "Max reads/sample: 350.0\n",
      "Taxonomy table: 2 features, levels: ['Domain', 'Phylum', 'Class', 'Order', 'Family', 'Genus', 'Species']\n",
      "Sequence table: 2 features\n",
      "Tree: 1342 nodes\n",
      "Metadata table: 8 samples, columns: ['location', 'feed', 'mfc']\n",
      "Metadata preview:\n",
      "       location     feed mfc\n",
      "sample                      \n",
      "S4        anode  acetate   B\n",
      "----------------------------------------\n"
     ]
    }
   ],
   "source": [
    "obj_geobacter = obj.subset_taxa(subset_levels=[\"Genus\"], subset_patterns=[\"Geobacter\"])\n",
    "obj_geobacter.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9fe58406-7ef0-410e-9ac3-be647a0f8602",
   "metadata": {},
   "source": [
    "We specified the taxonomic levels to search (*subset_levels*) and the text patterns to search for (*subset_patterns*).\n",
    "Here we didn't use *inplace*, so we keep the original object with all the anode-affiliated taxa and create a new object just for Geobacter.  \n",
    "We can see that only 2 features remain. Let's have a look at who they are."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "13952936-3ae2-4738-9eed-af1815a6e58a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "           Domain                     Phylum             Class          Order  \\\n",
      "Feature                                                                         \n",
      "ASV127   Bacteria  Desulfobacterota_G_459546  Desulfuromonadia  Geobacterales   \n",
      "ASV147   Bacteria  Desulfobacterota_G_459546  Desulfuromonadia  Geobacterales   \n",
      "\n",
      "                        Family      Genus Species  \n",
      "Feature                                            \n",
      "ASV127   Geobacteraceae_439684  Geobacter          \n",
      "ASV147   Geobacteraceae_439684  Geobacter          \n"
     ]
    }
   ],
   "source": [
    "print(obj_geobacter.tax)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "51b1d58d-4c96-4cb3-ade6-dd0d6bda2363",
   "metadata": {},
   "source": [
    "We can also check their read counts in the samples:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "2aebddf8-5b96-4dd1-9e46-801498a39ed4",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "          S4    S5    S6   S7    S20    S21   S22    S23\n",
      "Feature                                                 \n",
      "ASV127   0.0   0.0   0.0  0.0  133.0  103.0  66.0  150.0\n",
      "ASV147   0.0  47.0  34.0  0.0    0.0   91.0   0.0  200.0\n"
     ]
    }
   ],
   "source": [
    "print(obj_geobacter.tab)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "14573321-52bc-4fd1-85f2-a5acaa1664e3",
   "metadata": {},
   "source": [
    "These ASVs are not super abundant. So, which taxa are actually dominating in the data set?  \n",
    "We can use **subset_abundant** to pick out the most abundant features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "f5c0225b-c1a0-4305-b98b-4d681d1db70a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "                            Phylum            Genus\n",
      "Feature                                            \n",
      "ASV1     Desulfobacterota_G_459546   Trichloromonas\n",
      "ASV2                   Bacillota_I     Trichococcus\n",
      "ASV3     Desulfobacterota_G_459546                 \n",
      "ASV6                  Bacteroidota         JALNZU01\n",
      "ASV18                Spirochaetota         UBA12059\n",
      "ASV20                 Bacteroidota  Perlabentimonas\n",
      "ASV21                 Bacteroidota          Bact-08\n",
      "ASV23                 Bacteroidota   Lentimicrobium\n",
      "ASV25               Pseudomonadota     Sulfuritalea\n",
      "ASV38                  Bacillota_I     Trichococcus\n"
     ]
    }
   ],
   "source": [
    "obj_abund = obj.subset_abundant(n=10)\n",
    "print(obj_abund.tax[[\"Phylum\", \"Genus\"]])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "57a9a100-f8b6-4b46-85e5-01af40e0907a",
   "metadata": {},
   "source": [
    "Here we see that among the most abundant ASVs, there are two Desulfobacterota species. But one is classified as Trichloromonas and the other is unclassified at the genus level. Let's check their read counts."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "272e7ac7-b154-4187-9850-728fa93460bd",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "              S4       S5       S6      S7      S20      S21     S22      S23\n",
      "Feature                                                                      \n",
      "ASV1     11302.0  21046.0  11153.0  9637.0   1013.0   1693.0   522.0   2851.0\n",
      "ASV2         0.0      0.0      0.0     0.0  21102.0  16385.0  4777.0  17839.0\n",
      "ASV3         0.0      0.0      0.0     0.0  19489.0  13748.0  8570.0  25646.0\n",
      "ASV6       123.0    296.0    123.0   105.0    381.0    271.0    80.0    234.0\n",
      "ASV18      194.0    557.0    257.0   224.0     56.0     53.0    18.0    123.0\n",
      "ASV20      316.0    647.0    294.0     0.0    191.0     98.0    61.0    197.0\n",
      "ASV21      173.0    327.0    168.0   142.0    441.0    411.0   179.0    388.0\n",
      "ASV23        0.0      0.0      0.0     0.0   1235.0    989.0   383.0   1141.0\n",
      "ASV25      343.0    838.0    427.0   255.0      0.0      0.0     0.0      0.0\n",
      "ASV38        0.0      0.0      0.0     0.0   1147.0    726.0     0.0    771.0\n"
     ]
    }
   ],
   "source": [
    "print(obj_abund.tab)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dadc52ef-b76c-460b-a860-8b58366520f9",
   "metadata": {},
   "source": [
    "It looks like ASV1 is very abundant in the first four samples (acetate-fed) and ASV2 and ASV3 in the last four (glucose-fed)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "97d0cc8a-2cc4-4619-bd8f-2f1cce563f0a",
   "metadata": {},
   "source": [
    "It is also possible to merge samples based on information the meta data. Let's use **merge_samples** to combine samples based on the feed they received."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "819162a2-eb76-424e-9f8e-30d22bc53d04",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "feed     acetate  glucose\n",
      "Feature                  \n",
      "ASV1     53138.0   6079.0\n",
      "ASV2         0.0  60103.0\n",
      "ASV3         0.0  67453.0\n",
      "ASV6       647.0    966.0\n",
      "ASV18     1232.0    250.0\n",
      "ASV20     1257.0    547.0\n",
      "ASV21      810.0   1419.0\n",
      "ASV23        0.0   3748.0\n",
      "ASV25     1863.0      0.0\n",
      "ASV38        0.0   2644.0\n"
     ]
    }
   ],
   "source": [
    "obj_abund.merge_samples(by=\"feed\", inplace=True)\n",
    "print(obj_abund.tab)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bae2ddc3-fa55-403b-bd91-6f0ac9b9d51a",
   "metadata": {},
   "source": [
    "This shows the sum of the read counts for the acetate- and glucose fed anodes."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "749b97e9-7f9c-4498-b1ac-214809bd91c4",
   "metadata": {},
   "source": [
    "## Rarefy\n",
    "For amplicon sequencing data with uneven read counts per sample, it is good practice to rarefy the abundance table before diversity analysis."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "3285cdd7-cc2c-4fd1-a574-98c11acbb1d8",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "MicrobiomeData object summary\n",
      "----------------------------------------\n",
      "Abundance table: 672 features x 16 samples\n",
      "Total reads: 504186.0\n",
      "Min reads/sample: 9783.0\n",
      "Max reads/sample: 82777.0\n",
      "Taxonomy table: 672 features, levels: ['Domain', 'Phylum', 'Class', 'Order', 'Family', 'Genus', 'Species']\n",
      "Sequence table: 672 features\n",
      "Tree: 1342 nodes\n",
      "Metadata table: 16 samples, columns: ['location', 'feed', 'mfc']\n",
      "Metadata preview:\n",
      "       location     feed mfc\n",
      "sample                      \n",
      "S4        anode  acetate   B\n",
      "----------------------------------------\n"
     ]
    }
   ],
   "source": [
    "import qdiv\n",
    "obj = qdiv.MicrobiomeData.load_example(\"Saheb-Alam_DADA2\")\n",
    "obj.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f82a06cd-1fc0-48d9-959e-e4d2c33c812b",
   "metadata": {},
   "source": [
    "In this example data, we can see that the read counts in the samples range from 9783 to 82777. We can rarefy the object to the lowest read count."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "39b23d2d-9819-4c7a-8bc7-c1b7dd6afed7",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "MicrobiomeData object summary\n",
      "----------------------------------------\n",
      "Abundance table: 649 features x 16 samples\n",
      "Total reads: 156528\n",
      "Min reads/sample: 9783\n",
      "Max reads/sample: 9783\n",
      "Taxonomy table: 649 features, levels: ['Domain', 'Phylum', 'Class', 'Order', 'Family', 'Genus', 'Species']\n",
      "Sequence table: 649 features\n",
      "Tree: 1342 nodes\n",
      "Metadata table: 16 samples, columns: ['location', 'feed', 'mfc']\n",
      "Metadata preview:\n",
      "   location     feed mfc\n",
      "S4    anode  acetate   B\n",
      "----------------------------------------\n"
     ]
    }
   ],
   "source": [
    "obj.rarefy(inplace=True)\n",
    "obj.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d4996ac5-faf5-446d-ac7b-e0e5a6347e69",
   "metadata": {},
   "source": [
    "Now all samples have the same count: 9783 reads. The **rarefy** function automatically rarefies to the read count in the sample with the lowest count. It is possible to specify a read count with the *depth* parameter, and to ensure reproducibility with the *seed* parameter."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.14"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}