Other statistics

A few commonly used statistical tests to analyse matrices with pairwise dissimilarities are implemented in the qdiv.stats subpackage. Here the Mantel test, PERMANOVA, and bootstrap_sample_matrix are introduced.

Mantel test

The Mantel test examines the association between two dissimilarity matrices defined on the same set of samples (Mantel, 1967). The test asks whether pairs of samples that are similar (or dissimilar) according to one criterion also tend to be similar (or dissimilar) according to another. In microbial ecology, the Mantel test is often used to relate community dissimilarity to external gradients such as environmental distance (e.g. pH differences), spatial separation, or temporal distance. For example, we may ask whether samples that are further apart in time also tend to differ more in microbial community composition.

To demonstrate the Mantel test function, we will load the Modin2025 dataset.

[1]:

import qdiv
obj = qdiv.MicrobiomeData.load_example("Modin2025")
obj.info()

MicrobiomeData object summary
----------------------------------------
Abundance table: 304 features x 26 samples
Total reads: 1707.8699786349998
Min reads/sample: 59.920679878
Max reads/sample: 71.30233359299999
Taxonomy table: 304 features, levels: ['Domain', 'Phylum', 'Class', 'Order', 'Family', 'Genus', 'Species']
Sequence table: None
Tree: 958 nodes
Metadata table: 26 samples, columns: ['order', 'day', 'phase']
Metadata preview:
        order  day  phase
sample
v2114       1    6      1
----------------------------------------

In the metadata of this example dataset, there is a column with information about the day samples were taken. We will calculate a distance matrix with the time difference between pairs of samples. To illustrate the use of the Mantel function, we will test the correlation between this time difference and the pairwise dissimilarity in microbial community composition calculate with the diversity.naive_beta function.

First, let’s calculate pairwise sampling time difference:

[3]:

import pandas as pd
time = obj.meta["day"]
distmat_time = (time.to_numpy()[:, None] - time.to_numpy()[None, :])
distmat_time = pd.DataFrame(abs(distmat_time), index=time.index, columns=time.index)
print(distmat_time.iloc[:5, :5])

sample  v2114  v2115  v2116  v2117  v2118
sample
v2114       0      7     14     21     28
v2115       7      0      7     14     21
v2116      14      7      0      7     14
v2117      21     14      7      0      7
v2118      28     21     14      7      0

Now the variable distmat_time is a pandas dataframe with pairwise time differences between samples in the dataset. Next, we’ll calculate dissimilarity in microbial community composition.

[4]:

dis = qdiv.diversity.naive_beta(obj, q=1)
print(dis.iloc[:5, :5])

          v2114     v2115     v2116     v2117     v2118
v2114  0.000000  0.124359  0.185928  0.281471  0.376723
v2115  0.124359  0.000000  0.040544  0.112424  0.201890
v2116  0.185928  0.040544  0.000000  0.056570  0.130441
v2117  0.281471  0.112424  0.056570  0.000000  0.043726
v2118  0.376723  0.201890  0.130441  0.043726  0.000000

Now dis is a dissimilarity matrix, calculated using the naive_beta function with a diversity order of 1. Let’s run the Mantel test.

[5]:

res = qdiv.stats.mantel(dis, distmat_time, method="spearman", permutations=999)
print(res)

[0.1792632415525296, np.float64(0.001)]

The result is a list where the first element is the Mantel test statistic and the second element is the permutation‑based p‑value. For correlation‑based methods, the statistic is reported as a dissimilarity (1 − ρ or 1 − r), meaning that smaller values indicate a stronger association between the two distance matrices. In this example, the low p‑value suggests a statistically significant relationship between temporal distance and microbial community dissimilarity: samples taken further apart in time tend to differ more in community composition.

PERMANOVA

Permutational multivariate analysis of variance (PERMANOVA) tests whether groups of samples differ in their multivariate structure, as defined by a chosen dissimilarity measure (Anderson, 2001). Rather than operating directly on raw multivariate observations, PERMANOVA partitions variation in a dissimilarity matrix according to an ANOVA‑style experimental design and assesses significance using permutations. Conceptually, PERMANOVA asks whether samples belonging to different groups are, on average, more dissimilar from each other than expected by chance, given the overall structure of the distance matrix.

For each factor included in the model, PERMANOVA tests the null hypothesis that group centroids are identical in the space defined by the dissimilarity matrix. In practice, this corresponds to asking whether the observed partitioning of pairwise dissimilarities among groups is stronger than would be expected under random reassignment of samples to groups.

Here we demonstrate PERMANOVA using the Saheb‑Alam2019_DADA2 example dataset. First, we load the data and compute a dissimilarity matrix.

[11]:

import qdiv
obj = qdiv.MicrobiomeData.load_example("Saheb-Alam2019_DADA2")
obj.info()

dis = qdiv.diversity.naive_beta(obj, q=1)
print(dis.iloc[:5, :5])

MicrobiomeData object summary
----------------------------------------
Abundance table: 672 features x 16 samples
Total reads: 504186.0
Min reads/sample: 9783.0
Max reads/sample: 82777.0
Taxonomy table: 672 features, levels: ['Domain', 'Phylum', 'Class', 'Order', 'Family', 'Genus', 'Species']
Sequence table: 672 features
Tree: 1342 nodes
Metadata table: 16 samples, columns: ['location', 'feed', 'mfc']
Metadata preview:
       location     feed mfc
sample
S4        anode  acetate   B
----------------------------------------
           S4        S5        S6        S7       S10
S4   0.000000  0.058848  0.049525  0.074329  0.837658
S5   0.058848  0.000000  0.057854  0.073645  0.778168
S6   0.049525  0.057854  0.000000  0.067191  0.819105
S7   0.074329  0.073645  0.067191  0.000000  0.853965
S10  0.837658  0.778168  0.819105  0.853965  0.000000

The metadata include information about both location (anode vs. cathode) and feed (acetate vs. glucose). We can test the effects of these two factors, as well as their interaction, on microbial community structure:

[12]:

res = qdiv.stats.permanova(dis, obj, by=["location", "feed"], include_interaction=True, permutations=999)
print(res["table"])

               df        SS        MS           F      p        R2
term
location        1  1.295689  1.295689  173.112268  0.001  0.331991
feed            1  1.397021  1.397021  186.650861  0.001  0.357955
location:feed   1  0.472099  0.472099   63.075462  0.001  0.120965
Residual       12  0.089816  0.007485         NaN    NaN  0.023013

Each row of the table corresponds to a model term, including main effects and interactions. The columns report:

df: degrees of freedom for the term,
SS and MS: sums and mean squares derived from the distance matrix,
F: pseudo‑F statistic,
p: permutation‑based p‑value,
R2: proportion of total variation explained by the term.

In this example, both location and feed have statistically significant effects on microbial community composition, and there is also a significant interaction between the two factors (p = 0.001 for all terms).

Bootstrap summaries of dissimilarity matrices

While Mantel tests and PERMANOVA are useful for hypothesis testing, it is often equally important to quantify the magnitude and uncertainty of dissimilarity-based effects. The function bootstrap_sample_matrix estimates within‑group and between‑group dissimilarity summaries together with confidence intervals. Given a square dissimilarity matrix and one or more metadata variables, bootstrap_sample_matrix answers questions such as:

Are samples more similar within groups than between groups?
How large is this difference, and how uncertain is it?
Does the answer depend on how samples are grouped (or crossed)?

Why not just compute mean and standard deviation of distances?

A naïve approach would be to compute the mean (and standard deviation) of all pairwise dissimilarities, or to compare the mean of distances within groups to those between groups. However, this approach has several limitations:

Pairwise distances are not independent
Each sample appears in many distance pairs, so treating distances as independent observations leads to misleading uncertainty estimates.
No principled uncertainty estimates: Standard deviations of pairwise dissimilarities do not correspond to sampling uncertainty of group‑level summaries.

bootstrap_sample_matrix overcomes these issues by using a nested bootstrap that resamples samples within the structure defined by metadata. While PERMANOVA tests whether group structure in the dissimilarity matrix is stronger than expected by chance (hypothesis testing), bootstrap_sample_matrix estimates the magnitude and uncertainty of within‑ and between‑group dissimilarities (effect size estimation). Together, they provide a more complete picture of the data.

Let’s use the function on the data we used for PERMANOVA above.

[13]:

res = qdiv.stats.bootstrap_sample_matrix(
    dis,
    obj,
    by=["location", "feed"],
    n_boot=1000,
    alpha=0.05
)

print(res)

{'location': {'within': {'mean': 0.4621659257585619, 'ci': (0.4325898452362084, 0.48039868557728466), 'total_pairs': 56, 'boot': None}, 'between': {'mean': 0.8164422478698461, 'ci': (0.798834829662802, 0.8311924760509425), 'total_pairs': 64, 'boot': None}}, 'feed': {'within': {'mean': 0.4407172121543751, 'ci': (0.41995037102282134, 0.4574687873569055), 'total_pairs': 56, 'boot': None}, 'between': {'mean': 0.8352098722735096, 'ci': (0.8129252399380823, 0.854929994893634), 'total_pairs': 64, 'boot': None}}, 'Crossed': {'within': {'mean': 0.07748822634606517, 'ci': (0.049742913842534354, 0.09997940745202433), 'total_pairs': 24, 'boot': None}, 'between': {'mean': 0.7945195653525423, 'ci': (0.775871459763348, 0.8107586184803848), 'total_pairs': 96, 'boot': None}}}

The output is a dictionary containing results for the variables “location” and “feed”, as well as the interaction term. For each variable, ‘within’ and ‘between’ dissimilarities are shown as mean values and confidence intervals (ci). Let’s look specifically at the confidence intervals of dissimilarities between samples that are from same location and receive the same feed, and samples that are from different locations or receive different feeds:

[14]:

print(res["Crossed"]["within"]["ci"])
print(res["Crossed"]["between"]["ci"])

(0.049742913842534354, 0.09997940745202433)
(0.775871459763348, 0.8107586184803848)