{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "c41951cc-868a-4399-99c0-b0f674ec67c7",
   "metadata": {},
   "source": [
    "# Other statistics\n",
    "A few commonly used statistical tests to analyse matrices with pairwise dissimilarities are implemented in the qdiv.stats subpackage. Here the Mantel test, PERMANOVA, and bootstrap_sample_matrix are introduced.  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "354e085a-4390-4fa2-a66b-f4738e01b648",
   "metadata": {},
   "source": [
    "## Mantel test  \n",
    "The *Mantel test* examines the association between two dissimilarity matrices defined on the same set of samples (Mantel, 1967). The test asks whether pairs of samples that are similar (or dissimilar) according to one criterion also tend to be similar (or dissimilar) according to another.\n",
    "In microbial ecology, the Mantel test is often used to relate community dissimilarity to external gradients such as environmental distance (e.g. pH differences), spatial separation, or temporal distance. For example, we may ask whether samples that are further apart in time also tend to differ more in microbial community composition.\n",
    "\n",
    "To demonstrate the Mantel test function, we will load the Modin2025 dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "91ee1540-fd5d-47e4-8818-724e6c74b59b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "MicrobiomeData object summary\n",
      "----------------------------------------\n",
      "Abundance table: 304 features x 26 samples\n",
      "Total reads: 1707.8699786349998\n",
      "Min reads/sample: 59.920679878\n",
      "Max reads/sample: 71.30233359299999\n",
      "Taxonomy table: 304 features, levels: ['Domain', 'Phylum', 'Class', 'Order', 'Family', 'Genus', 'Species']\n",
      "Sequence table: None\n",
      "Tree: 958 nodes\n",
      "Metadata table: 26 samples, columns: ['order', 'day', 'phase']\n",
      "Metadata preview:\n",
      "        order  day  phase\n",
      "sample                   \n",
      "v2114       1    6      1\n",
      "----------------------------------------\n"
     ]
    }
   ],
   "source": [
    "import qdiv\n",
    "obj = qdiv.MicrobiomeData.load_example(\"Modin2025\")\n",
    "obj.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6c2eca19-6164-4268-95fc-baf8b2a0bc85",
   "metadata": {},
   "source": [
    "In the metadata of this example dataset, there is a column with information about the day samples were taken. We will calculate a distance matrix with the time difference between pairs of samples. To illustrate the use of the Mantel function, we will test the correlation between this time difference and the pairwise dissimilarity in microbial community composition calculate with the **diversity.naive_beta** function.  \n",
    "First, let's calculate pairwise sampling time difference:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "a739944e-feb5-4965-8db6-c7366f54f193",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "sample  v2114  v2115  v2116  v2117  v2118\n",
      "sample                                   \n",
      "v2114       0      7     14     21     28\n",
      "v2115       7      0      7     14     21\n",
      "v2116      14      7      0      7     14\n",
      "v2117      21     14      7      0      7\n",
      "v2118      28     21     14      7      0\n"
     ]
    }
   ],
   "source": [
    "import pandas as pd\n",
    "time = obj.meta[\"day\"]\n",
    "distmat_time = (time.to_numpy()[:, None] - time.to_numpy()[None, :])\n",
    "distmat_time = pd.DataFrame(abs(distmat_time), index=time.index, columns=time.index)\n",
    "print(distmat_time.iloc[:5, :5])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6c9f5b25-db45-4039-b6b1-74453124435d",
   "metadata": {},
   "source": [
    "Now the variable *distmat_time* is a pandas dataframe with pairwise time differences between samples in the dataset.\n",
    "Next, we'll calculate dissimilarity in microbial community composition."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "38818282-e17d-4a91-aee1-e53b47aee43b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "          v2114     v2115     v2116     v2117     v2118\n",
      "v2114  0.000000  0.124359  0.185928  0.281471  0.376723\n",
      "v2115  0.124359  0.000000  0.040544  0.112424  0.201890\n",
      "v2116  0.185928  0.040544  0.000000  0.056570  0.130441\n",
      "v2117  0.281471  0.112424  0.056570  0.000000  0.043726\n",
      "v2118  0.376723  0.201890  0.130441  0.043726  0.000000\n"
     ]
    }
   ],
   "source": [
    "dis = qdiv.diversity.naive_beta(obj, q=1)\n",
    "print(dis.iloc[:5, :5])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0d27bdb5-1f94-4699-984d-ab44b98b56d6",
   "metadata": {},
   "source": [
    "Now *dis* is a dissimilarity matrix, calculated using the *naive_beta* function with a diversity order of 1. Let's run the Mantel test. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "4bd0e50f-9ba2-446a-9381-e41cf0814c42",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[0.1792632415525296, np.float64(0.001)]\n"
     ]
    }
   ],
   "source": [
    "res = qdiv.stats.mantel(dis, distmat_time, method=\"spearman\", permutations=999)\n",
    "print(res)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5910818b-5ca1-43f0-b505-0c20fec0b5e4",
   "metadata": {},
   "source": [
    "The result is a list where the first element is the Mantel test statistic and the second element is the permutation‑based p‑value. For correlation‑based methods, the statistic is reported as a dissimilarity (1 − ρ or 1 − r), meaning that smaller values indicate a stronger association between the two distance matrices.\n",
    "In this example, the low p‑value suggests a statistically significant relationship between temporal distance and microbial community dissimilarity: samples taken further apart in time tend to differ more in community composition."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "127565db-e30f-4bc4-acfa-80a5c70edeaf",
   "metadata": {},
   "source": [
    "## PERMANOVA  \n",
    "Permutational multivariate analysis of variance (PERMANOVA) tests whether groups of samples differ in their multivariate structure, as defined by a chosen dissimilarity measure (Anderson, 2001). Rather than operating directly on raw multivariate observations, PERMANOVA partitions variation in a dissimilarity matrix according to an ANOVA‑style experimental design and assesses significance using permutations.\n",
    "Conceptually, PERMANOVA asks whether samples belonging to different groups are, on average, more dissimilar from each other than expected by chance, given the overall structure of the distance matrix. \n",
    "\n",
    "For each factor included in the model, PERMANOVA tests the null hypothesis that group centroids are identical in the space defined by the dissimilarity matrix. In practice, this corresponds to asking whether the observed partitioning of pairwise dissimilarities among groups is stronger than would be expected under random reassignment of samples to groups.\n",
    "\n",
    "Here we demonstrate PERMANOVA using the Saheb‑Alam2019_DADA2 example dataset. First, we load the data and compute a dissimilarity matrix."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "3641b41e-86fc-4378-a316-fd64bc10c608",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "MicrobiomeData object summary\n",
      "----------------------------------------\n",
      "Abundance table: 672 features x 16 samples\n",
      "Total reads: 504186.0\n",
      "Min reads/sample: 9783.0\n",
      "Max reads/sample: 82777.0\n",
      "Taxonomy table: 672 features, levels: ['Domain', 'Phylum', 'Class', 'Order', 'Family', 'Genus', 'Species']\n",
      "Sequence table: 672 features\n",
      "Tree: 1342 nodes\n",
      "Metadata table: 16 samples, columns: ['location', 'feed', 'mfc']\n",
      "Metadata preview:\n",
      "       location     feed mfc\n",
      "sample                      \n",
      "S4        anode  acetate   B\n",
      "----------------------------------------\n",
      "           S4        S5        S6        S7       S10\n",
      "S4   0.000000  0.058848  0.049525  0.074329  0.837658\n",
      "S5   0.058848  0.000000  0.057854  0.073645  0.778168\n",
      "S6   0.049525  0.057854  0.000000  0.067191  0.819105\n",
      "S7   0.074329  0.073645  0.067191  0.000000  0.853965\n",
      "S10  0.837658  0.778168  0.819105  0.853965  0.000000\n"
     ]
    }
   ],
   "source": [
    "import qdiv\n",
    "obj = qdiv.MicrobiomeData.load_example(\"Saheb-Alam2019_DADA2\")\n",
    "obj.info()\n",
    "\n",
    "dis = qdiv.diversity.naive_beta(obj, q=1)\n",
    "print(dis.iloc[:5, :5])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "96875dad-2dc3-47de-9c40-10132e286457",
   "metadata": {},
   "source": [
    "The metadata include information about both location (anode vs. cathode) and feed (acetate vs. glucose). We can test the effects of these two factors, as well as their interaction, on microbial community structure:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "94f21708-9c1e-4ca4-8525-fd95d07729dd",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "               df        SS        MS           F      p        R2\n",
      "term                                                              \n",
      "location        1  1.295689  1.295689  173.112268  0.001  0.331991\n",
      "feed            1  1.397021  1.397021  186.650861  0.001  0.357955\n",
      "location:feed   1  0.472099  0.472099   63.075462  0.001  0.120965\n",
      "Residual       12  0.089816  0.007485         NaN    NaN  0.023013\n"
     ]
    }
   ],
   "source": [
    "res = qdiv.stats.permanova(dis, obj, by=[\"location\", \"feed\"], include_interaction=True, permutations=999)\n",
    "print(res[\"table\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "64be62e7-b500-4c4f-8ab1-c8134d9b4dac",
   "metadata": {},
   "source": [
    "Each row of the table corresponds to a model term, including main effects and interactions. The columns report:\n",
    "\n",
    "- df: degrees of freedom for the term,\n",
    "- SS and MS: sums and mean squares derived from the distance matrix,\n",
    "- F: pseudo‑F statistic,\n",
    "- p: permutation‑based p‑value,\n",
    "- R2: proportion of total variation explained by the term.\n",
    "\n",
    "In this example, both location and feed have statistically significant effects on microbial community composition, and there is also a significant interaction between the two factors (p = 0.001 for all terms)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bb3937f0-0d02-4a0b-be07-79b1c25721ca",
   "metadata": {},
   "source": [
    "## Bootstrap summaries of dissimilarity matrices  \n",
    "While Mantel tests and PERMANOVA are useful for hypothesis testing, it is often equally important to quantify the magnitude and uncertainty of dissimilarity-based effects. The function **bootstrap_sample_matrix** estimates within‑group and between‑group dissimilarity summaries together with confidence intervals. Given a square dissimilarity matrix and one or more metadata variables, bootstrap_sample_matrix answers questions such as:\n",
    "\n",
    "- Are samples more similar within groups than between groups?\n",
    "- How large is this difference, and how uncertain is it?\n",
    "- Does the answer depend on how samples are grouped (or crossed)?\n",
    "\n",
    "*Why not just compute mean and standard deviation of distances?*  \n",
    "A naïve approach would be to compute the mean (and standard deviation) of all pairwise dissimilarities, or to compare the mean of distances within groups to those between groups. However, this approach has several limitations:\n",
    "\n",
    "- Pairwise distances are not independent\n",
    "- Each sample appears in many distance pairs, so treating distances as independent observations leads to misleading uncertainty estimates.\n",
    "- No principled uncertainty estimates: Standard deviations of pairwise dissimilarities do not correspond to sampling uncertainty of group‑level summaries.\n",
    "\n",
    "**bootstrap_sample_matrix** overcomes these issues by using a nested bootstrap that resamples samples within the structure defined by metadata.\n",
    "While PERMANOVA tests whether group structure in the dissimilarity matrix is stronger than expected by chance (hypothesis testing), bootstrap_sample_matrix estimates the magnitude and uncertainty of within‑ and between‑group dissimilarities (effect size estimation). Together, they provide a more complete picture of the data.\n",
    "\n",
    "Let's use the function on the data we used for PERMANOVA above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "7d5921c3-9c3e-4f42-86ac-2cd3b9b1511a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'location': {'within': {'mean': 0.4621659257585619, 'ci': (0.4325898452362084, 0.48039868557728466), 'total_pairs': 56, 'boot': None}, 'between': {'mean': 0.8164422478698461, 'ci': (0.798834829662802, 0.8311924760509425), 'total_pairs': 64, 'boot': None}}, 'feed': {'within': {'mean': 0.4407172121543751, 'ci': (0.41995037102282134, 0.4574687873569055), 'total_pairs': 56, 'boot': None}, 'between': {'mean': 0.8352098722735096, 'ci': (0.8129252399380823, 0.854929994893634), 'total_pairs': 64, 'boot': None}}, 'Crossed': {'within': {'mean': 0.07748822634606517, 'ci': (0.049742913842534354, 0.09997940745202433), 'total_pairs': 24, 'boot': None}, 'between': {'mean': 0.7945195653525423, 'ci': (0.775871459763348, 0.8107586184803848), 'total_pairs': 96, 'boot': None}}}\n"
     ]
    }
   ],
   "source": [
    "res = qdiv.stats.bootstrap_sample_matrix(\n",
    "    dis,\n",
    "    obj,\n",
    "    by=[\"location\", \"feed\"],\n",
    "    n_boot=1000,\n",
    "    alpha=0.05\n",
    ")\n",
    "\n",
    "print(res)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "50bb9f08-df91-4386-a1be-80700c708114",
   "metadata": {},
   "source": [
    "The output is a dictionary containing results for the variables \"location\" and \"feed\", as well as the interaction term. For each variable, 'within' and 'between' dissimilarities are shown as mean values and confidence intervals (ci). Let's look specifically at the confidence intervals of dissimilarities between samples that are from same location and receive the same feed, and samples that are from different locations or receive different feeds:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "e4833d9a-37b7-4ff7-8e31-9b7755b6f383",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(0.049742913842534354, 0.09997940745202433)\n",
      "(0.775871459763348, 0.8107586184803848)\n"
     ]
    }
   ],
   "source": [
    "print(res[\"Crossed\"][\"within\"][\"ci\"])\n",
    "print(res[\"Crossed\"][\"between\"][\"ci\"])"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.14"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}