{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "c33aca62-4b83-487f-8bfe-ed103132d103",
   "metadata": {},
   "source": [
    "# Diversity  \n",
    "Diversity in ecological studies refers to different aspects of variation within and across communities:  \n",
    "- **Alpha diversity:** The richness and evenness of taxa (or \"features\") within a single sample or site. It measures how many different taxa are present and how evenly they are distributed.\n",
    "- **Beta diversity:** The degree of difference in community composition between two or more samples. It captures how similar or dissimilar communities are across environments.\n",
    "- **Gamma diversity:** The overall diversity across a collection of samples or an entire region, considering both the number of taxa and their distribution at a broader spatial scale.\n",
    "    \n",
    "**Naive-, phylogenetic-, or functional diversity**  \n",
    "One way to calculate diversity is to consider all taxa as separate entities and not care about their relationships. We call these diversity values *naive*.  \n",
    "A second way to calculate diversity takes the relationships in a phylogenetic tree into account. Two taxa that share many branches in the tree contribute less to diversity than two that are very far from each other in the tree. We call the diversity values that are calculated with a tree as input *phylogenetic*.  \n",
    "A third way to calculate diversity takes the pairwise distances between taxa into account. Two taxa that have a small pairwise distance contribute less to diversity than two that have a high pairwise distance. We call the diversity values that are calculated with a distance matrix as input *functional*.\n",
    "\n",
    "Naive alpha diversity is simply the effective number of taxa, phylogenetic alpha diversity is the effective branch diversity in the phylogenetic tree associated with the taxa in the sample, and functional alpha diversity is the effective total distance between taxa in the sample. The word \"effective\" used above refers to the relative abundance weighting by the diversity order, q.\n",
    "\n",
    "**Hill numbers (or effective numbers)**  \n",
    "Imagine a community with 10 species, all equally abundant. The most intuitive measure of diversity here is 10, and Hill numbers give exactly that result. Now consider a community with 2 dominant species and 8 rare ones. What should the diversity be?  \n",
    "- If we ignore relative abundances, the answer is 10.\n",
    "- If we ignore rare species, the answer is 2.\n",
    "\n",
    "Hill numbers solve this by introducing a single parameter, the diversity order (q), which controls how much weight is given to species’ relative abundances:  \n",
    "- When q=0, all species count equally (richness).\n",
    "- When q=1, species are weighted by their proportional abundance (Shannon diversity).\n",
    "- When q=2, common species dominate the measure (Simpson diversity).\n",
    "\n",
    "This flexibility makes Hill numbers a powerful and intuitive framework for comparing communities. And it doesn’t stop at 0, 1, or 2 — diversity can be explored across a continuum of q values, offering a nuanced view of community structure."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fbe46b08-f6ec-4a30-a2d1-9b18c24da1b0",
   "metadata": {},
   "source": [
    "## Alpha diversity\n",
    "Let's calculate alpha diversity using qdiv. First, we will load and rarefy some example data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "5b3b973a-97e3-4d1b-b010-f0ca3ea98ff5",
   "metadata": {},
   "outputs": [],
   "source": [
    "import qdiv\n",
    "obj = qdiv.MicrobiomeData.load_example(\"Saheb-Alam_DADA2\")\n",
    "rarefied_obj = obj.rarefy() #I rarefy the object before calculating alpha diversity"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "63fe3ec8-0057-4764-987f-e2a5ed76f800",
   "metadata": {},
   "source": [
    "Next, we will calculate the naive alpha diversity values for the samples using the **diversity.naive_alpha** function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "d818b5ee-68bf-42bc-b35b-6a6ecaf3b71a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "S4      4.917773\n",
      "S5      8.926633\n",
      "S6      5.934636\n",
      "S7      4.894088\n",
      "S10    47.964757\n",
      "S11    39.977780\n",
      "S12    91.900986\n",
      "S13    55.938017\n",
      "S20     7.297213\n",
      "S21     6.967200\n",
      "S22     4.626569\n",
      "S23     6.013025\n",
      "S26    47.436514\n",
      "S27    47.834682\n",
      "S28    39.662705\n",
      "S29    59.220636\n",
      "dtype: float64\n"
     ]
    }
   ],
   "source": [
    "alpha = qdiv.diversity.naive_alpha(rarefied_obj, q=1)\n",
    "print(alpha)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b5953500-04cd-4a97-bfaa-4f83d5d94563",
   "metadata": {},
   "source": [
    "This was the alpha diversity for diversity order q=1. Perhaps we want to calculate it for multiple q values and save it into a csv file together with the meta data. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "53d2872f-b480-42ac-9f9c-27c7c826a603",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    location     feed mfc  Div_q0     Div_q1     Div_q2\n",
      "S4     anode  acetate   B    75.0   4.917773   1.862620\n",
      "S5     anode  acetate   B   138.0   8.926633   2.448316\n",
      "S6     anode  acetate   B    78.0   5.934636   2.080424\n",
      "S7     anode  acetate   B    74.0   4.894088   1.845467\n",
      "S10  cathode  acetate   B   165.0  47.964757  13.685347\n",
      "S11  cathode  acetate   B   194.0  39.977780  10.033863\n",
      "S12  cathode  acetate   B   315.0  91.900986  32.390060\n",
      "S13  cathode  acetate   B   186.0  55.938017  21.741220\n",
      "S20    anode  glucose   D   150.0   7.297213   3.327548\n",
      "S21    anode  glucose   D   107.0   6.967200   3.379743\n",
      "S22    anode  glucose   D    57.0   4.626569   2.650418\n",
      "S23    anode  glucose   D   116.0   6.013025   3.044453\n",
      "S26  cathode  glucose   D   165.0  47.436514  16.851130\n",
      "S27  cathode  glucose   D   222.0  47.834682  14.928833\n",
      "S28  cathode  glucose   D   145.0  39.662705  14.361918\n",
      "S29  cathode  glucose   D   241.0  59.220636  20.035359\n"
     ]
    }
   ],
   "source": [
    "meta = rarefied_obj.meta.copy() #We make a copy of the metadata to avoid modifying the object\n",
    "for q in [0, 1, 2]: #A for loop to calculate alpha diversity for several values of q\n",
    "    alpha = qdiv.diversity.naive_alpha(rarefied_obj, q=q)\n",
    "    meta[\"Div_q\"+str(q)] = alpha #Add the calculated values to the meta dataframe\n",
    "meta.to_csv(\"Metadata_with_alpha_diversity.csv\") #This lines saves the dataframe as a CSV file.\n",
    "print(meta) #This prints the file content to show it below"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8d4f1d5b-cb02-40ce-a411-563086267bb6",
   "metadata": {},
   "source": [
    "The results seem to indicate higher diversity on cathode than anodes, and higher diversity in glucose-fed than acetate-fed microbial fuel cells.\n",
    "Higher diversity orders (q values) result in lower diversity. The change in diversity as q increases is an indication of the evenness. We can calculate evenness parameters using the function below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "8ee0be08-6b12-474d-a610-7f41e6d0200a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    location     feed mfc  Div_q0     Div_q1     Div_q2  Evenness\n",
      "S4     anode  acetate   B    75.0   4.917773   1.862620  0.368931\n",
      "S5     anode  acetate   B   138.0   8.926633   2.448316  0.444272\n",
      "S6     anode  acetate   B    78.0   5.934636   2.080424  0.408750\n",
      "S7     anode  acetate   B    74.0   4.894088   1.845467  0.368960\n",
      "S10  cathode  acetate   B   165.0  47.964757  13.685347  0.758031\n",
      "S11  cathode  acetate   B   194.0  39.977780  10.033863  0.700156\n",
      "S12  cathode  acetate   B   315.0  91.900986  32.390060  0.785859\n",
      "S13  cathode  acetate   B   186.0  55.938017  21.741220  0.770080\n",
      "S20    anode  glucose   D   150.0   7.297213   3.327548  0.396655\n",
      "S21    anode  glucose   D   107.0   6.967200   3.379743  0.415426\n",
      "S22    anode  glucose   D    57.0   4.626569   2.650418  0.378876\n",
      "S23    anode  glucose   D   116.0   6.013025   3.044453  0.377384\n",
      "S26  cathode  glucose   D   165.0  47.436514  16.851130  0.755862\n",
      "S27  cathode  glucose   D   222.0  47.834682  14.928833  0.715895\n",
      "S28  cathode  glucose   D   145.0  39.662705  14.361918  0.739523\n",
      "S29  cathode  glucose   D   241.0  59.220636  20.035359  0.744106\n"
     ]
    }
   ],
   "source": [
    "evenness = qdiv.diversity.evenness(rarefied_obj, index=\"pielou\")\n",
    "meta[\"Evenness\"] = evenness\n",
    "print(meta)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9f3282d9-286c-4d1f-8cbf-02e6c20afe86",
   "metadata": {},
   "source": [
    "The cathode samples have, indeed, higher evenness than the anode samples."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a5ea8255-9ac6-4fee-a3d9-5ae0d52873ae",
   "metadata": {},
   "source": [
    "**Phylogenetic diversity**  \n",
    "Since a phylogenetic tree is included in the example data object, we can also calculate phylogenetic diversity using the **diversity.phyl_alpha** function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "2ff9f00c-b0ec-462d-b2b9-c4a8082cfdc2",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    location     feed mfc  Div_q0     Div_q1     Div_q2  Evenness      PD_q0  \\\n",
      "S4     anode  acetate   B    75.0   4.917773   1.862620  0.368931   8.758063   \n",
      "S5     anode  acetate   B   138.0   8.926633   2.448316  0.444272  13.629279   \n",
      "S6     anode  acetate   B    78.0   5.934636   2.080424  0.408750  10.161938   \n",
      "S7     anode  acetate   B    74.0   4.894088   1.845467  0.368960   9.647654   \n",
      "S10  cathode  acetate   B   165.0  47.964757  13.685347  0.758031  16.782338   \n",
      "S11  cathode  acetate   B   194.0  39.977780  10.033863  0.700156  19.017742   \n",
      "S12  cathode  acetate   B   315.0  91.900986  32.390060  0.785859  26.608242   \n",
      "S13  cathode  acetate   B   186.0  55.938017  21.741220  0.770080  18.946945   \n",
      "S20    anode  glucose   D   150.0   7.297213   3.327548  0.396655  15.332092   \n",
      "S21    anode  glucose   D   107.0   6.967200   3.379743  0.415426  11.947341   \n",
      "S22    anode  glucose   D    57.0   4.626569   2.650418  0.378876   8.704764   \n",
      "S23    anode  glucose   D   116.0   6.013025   3.044453  0.377384  12.915055   \n",
      "S26  cathode  glucose   D   165.0  47.436514  16.851130  0.755862  19.834035   \n",
      "S27  cathode  glucose   D   222.0  47.834682  14.928833  0.715895  23.010236   \n",
      "S28  cathode  glucose   D   145.0  39.662705  14.361918  0.739523  16.086969   \n",
      "S29  cathode  glucose   D   241.0  59.220636  20.035359  0.744106  22.700699   \n",
      "\n",
      "        PD_q1     PD_q2  \n",
      "S4   0.982211  0.830020  \n",
      "S5   1.186406  0.869423  \n",
      "S6   1.053486  0.848903  \n",
      "S7   1.011143  0.837457  \n",
      "S10  1.745365  0.899019  \n",
      "S11  1.611911  0.878486  \n",
      "S12  2.014242  0.942826  \n",
      "S13  1.814265  0.930605  \n",
      "S20  1.010650  0.964344  \n",
      "S21  0.972128  0.961420  \n",
      "S22  0.892799  0.917759  \n",
      "S23  0.912382  0.933111  \n",
      "S26  2.227134  1.020376  \n",
      "S27  2.013565  1.011542  \n",
      "S28  1.835422  0.996555  \n",
      "S29  1.760901  0.985247  \n"
     ]
    }
   ],
   "source": [
    "for q in [0, 1, 2]: #A for loop to calculate alpha diversity for several values of q\n",
    "    alpha = qdiv.diversity.phyl_alpha(rarefied_obj, q=q)\n",
    "    meta[\"PD_q\"+str(q)] = alpha #Add the calculated values to the meta dataframe\n",
    "print(meta) #Show the results"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ec56b35f-730a-489f-92bf-7281a3dc6f44",
   "metadata": {},
   "source": [
    "**Functional diversity**  \n",
    "To calculate functional diversity, we need a distance matrix with pairwise functional distances between features in the dataset. The distance matrix should be in the form of a pandas dataframe with feature names as row- and column indices. For the purpose of demonstrating the use of the function, we will calculate such as distance matrix based on the phylogenetic distances between features. We use the **sequences.tree_distance_matrix** function to get a distance matrix saved in the variable *distmat*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "ada41805-d4f5-494d-93d5-f15ca9a9c78f",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Leaves: 100%|██████████████████████████████████████████████████████████████████████| 671/671 [00:00<00:00, 1503.42it/s]\n"
     ]
    }
   ],
   "source": [
    "distmat = qdiv.sequences.tree_distance_matrix(rarefied_obj, save=False) #By setting save=False, no file is saved"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8248a134-6b05-496b-b60c-ef0123553c1b",
   "metadata": {},
   "source": [
    "Next, we call the **diversity.func_alpha** function to calculate functional alpha diversity."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "72b92d27-d90c-4160-9e01-94f8e75a69cd",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "S4      109.939446\n",
      "S5      267.304695\n",
      "S6      130.601954\n",
      "S7      117.680498\n",
      "S10    1629.952535\n",
      "S11    1477.581434\n",
      "S12    5009.610236\n",
      "S13    2030.422759\n",
      "S20      56.720516\n",
      "S21      45.512687\n",
      "S22      19.199088\n",
      "S23      29.713493\n",
      "S26    1471.048419\n",
      "S27    1419.489605\n",
      "S28     912.282687\n",
      "S29    2104.082254\n",
      "dtype: float64\n"
     ]
    }
   ],
   "source": [
    "alpha = qdiv.diversity.func_alpha(rarefied_obj, distmat, q=1)\n",
    "print(alpha)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "35b120f1-71f7-4bf1-bd9d-af1c48b98a03",
   "metadata": {},
   "source": [
    "## Beta diversity\n",
    "**Relationship between alpha, beta, and gamma diversity**  \n",
    "One common way to define beta diversity is as a ratio between gamma and alpha: $D_\\beta=\\frac{D_\\gamma}{D_\\alpha}$  \n",
    "Gamma is the diversity of the pooled samples and alpha is the mean diversity of the individual samples.  \n",
    "Beta diversity takes a value between 1 and N (the number of samples being compared). For functional beta diversity it goes between 1 and NxN.\n",
    "\n",
    "Beta diversity can be converted to a dissimilarity index, which takes a value between 0 (if the compared samples are identical) and 1 (if the compared samples are completely different). Dissimilarity can be calculated with a local or regional viewpoint. The local viewpoint quantifies the effective proportion of species in a sample that is not shared across all samples. The regional viewpoint quantifies the effective proportion of species in the pooled samples that is not shared across all samples.\n",
    "Just like alpha diversity, beta diversity can be calculated for any diveristy order, q, which means we can decide the emphasis we want to put on the relative abundance of the species.\n",
    "\n",
    "Just like the alpha diversity indices, beta diversity and dissimilarity can be calculated using **naive-, phylogenetic-, and functional** indices.  \n",
    "Let's calculate beta diversity for the example data. First, we'll calculate pairwise dissimilarities between all samples in our data using the **diversity.naive_beta** function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "f67141f7-aac4-45f5-bd8e-49bc71b52c08",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "           S4        S5        S6        S7       S10\n",
      "S4   0.000000  0.058442  0.049793  0.075557  0.839512\n",
      "S5   0.058442  0.000000  0.057203  0.073835  0.783948\n",
      "S6   0.049793  0.057203  0.000000  0.067338  0.821311\n",
      "S7   0.075557  0.073835  0.067338  0.000000  0.853798\n",
      "S10  0.839512  0.783948  0.821311  0.853798  0.000000\n"
     ]
    }
   ],
   "source": [
    "import qdiv\n",
    "obj = qdiv.MicrobiomeData.load_example(\"Saheb-Alam_DADA2\")\n",
    "rarefied_obj = obj.rarefy()\n",
    "\n",
    "dissimilarity = qdiv.diversity.naive_beta(rarefied_obj, q=1)\n",
    "print(dissimilarity.iloc[:5, :5]) #Here we print the results for five samples"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a5016faf-56a3-42ca-ae12-de536dde3a95",
   "metadata": {},
   "source": [
    "The dissimilarity matrix shows the pairwise dissimilarity between the samples. It can, e.g., be used later for plotting an ordination.  \n",
    "We can also check the beta diversity across sample groups with multiple samples. Let's group the samples based on feed (one recieved acetate and one glucose). We'll use the **diversity.naive_multi_beta** function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "3c2f8574-47ac-4ec2-93c6-2db6ea4e8734",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "           N      beta  local_dis  regional_dis\n",
      "acetate  8.0  1.967056   0.325346      0.325346\n",
      "glucose  8.0  1.666429   0.245587      0.245587\n"
     ]
    }
   ],
   "source": [
    "multi_beta = qdiv.diversity.naive_multi_beta(rarefied_obj, by=\"feed\", q=1)\n",
    "print(multi_beta)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "197ce02f-86d5-4125-83eb-a7ced330ca99",
   "metadata": {},
   "source": [
    "Each sample group contains 8 samples (N=8). The beta diversity value indicates the effective number of distinct communities represented by the data. In other words, it reflects how much differentiation exists among the samples within a group.\n",
    "To provide additional perspective, beta diversity is also expressed as dissimilarity indices, calculated using either:  \n",
    "- local view, which considers differences relative to individual samples, or\n",
    "- regional view, which considers differences relative to the entire pooled set of samples.  \n",
    "\n",
    "The analysis can also be done with **diversity.phyl_multi_beta** or **diversity.func_multi_beta**."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.14"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}