Other statistics
A few commonly used statistical tests to analyse matrices with pairwise dissimilarities are implemented in the qdiv.stats subpackage. Here the Mantel test, PERMANOVA, and bootstrap_sample_matrix are introduced.
Mantel test
The Mantel test examines the association between two dissimilarity matrices defined on the same set of samples (Mantel, 1967). The test asks whether pairs of samples that are similar (or dissimilar) according to one criterion also tend to be similar (or dissimilar) according to another. In microbial ecology, the Mantel test is often used to relate community dissimilarity to external gradients such as environmental distance (e.g. pH differences), spatial separation, or temporal distance. For example, we may ask whether samples that are further apart in time also tend to differ more in microbial community composition.
To demonstrate the Mantel test function, we will load the Modin2025 dataset.
[1]:
import qdiv
obj = qdiv.MicrobiomeData.load_example("Modin2025")
obj.info()
MicrobiomeData object summary
----------------------------------------
Abundance table: 304 features x 26 samples
Total reads: 1707.8699786349998
Min reads/sample: 59.920679878
Max reads/sample: 71.30233359299999
Taxonomy table: 304 features, levels: ['Domain', 'Phylum', 'Class', 'Order', 'Family', 'Genus', 'Species']
Sequence table: None
Tree: 958 nodes
Metadata table: 26 samples, columns: ['order', 'day', 'phase']
Metadata preview:
order day phase
sample
v2114 1 6 1
----------------------------------------
[3]:
import pandas as pd
time = obj.meta["day"]
distmat_time = (time.to_numpy()[:, None] - time.to_numpy()[None, :])
distmat_time = pd.DataFrame(abs(distmat_time), index=time.index, columns=time.index)
print(distmat_time.iloc[:5, :5])
sample v2114 v2115 v2116 v2117 v2118
sample
v2114 0 7 14 21 28
v2115 7 0 7 14 21
v2116 14 7 0 7 14
v2117 21 14 7 0 7
v2118 28 21 14 7 0
Now the variable distmat_time is a pandas dataframe with pairwise time differences between samples in the dataset. Next, we’ll calculate dissimilarity in microbial community composition.
[4]:
dis = qdiv.diversity.naive_beta(obj, q=1)
print(dis.iloc[:5, :5])
v2114 v2115 v2116 v2117 v2118
v2114 0.000000 0.124359 0.185928 0.281471 0.376723
v2115 0.124359 0.000000 0.040544 0.112424 0.201890
v2116 0.185928 0.040544 0.000000 0.056570 0.130441
v2117 0.281471 0.112424 0.056570 0.000000 0.043726
v2118 0.376723 0.201890 0.130441 0.043726 0.000000
Now dis is a dissimilarity matrix, calculated using the naive_beta function with a diversity order of 1. Let’s run the Mantel test.
[5]:
res = qdiv.stats.mantel(dis, distmat_time, method="spearman", permutations=999)
print(res)
[0.1792632415525296, np.float64(0.001)]
The result is a list where the first element is the Mantel test statistic and the second element is the permutation‑based p‑value. For correlation‑based methods, the statistic is reported as a dissimilarity (1 − ρ or 1 − r), meaning that smaller values indicate a stronger association between the two distance matrices. In this example, the low p‑value suggests a statistically significant relationship between temporal distance and microbial community dissimilarity: samples taken further apart in time tend to differ more in community composition.
PERMANOVA
Permutational multivariate analysis of variance (PERMANOVA) tests whether groups of samples differ in their multivariate structure, as defined by a chosen dissimilarity measure (Anderson, 2001). Rather than operating directly on raw multivariate observations, PERMANOVA partitions variation in a dissimilarity matrix according to an ANOVA‑style experimental design and assesses significance using permutations. Conceptually, PERMANOVA asks whether samples belonging to different groups are, on average, more dissimilar from each other than expected by chance, given the overall structure of the distance matrix.
For each factor included in the model, PERMANOVA tests the null hypothesis that group centroids are identical in the space defined by the dissimilarity matrix. In practice, this corresponds to asking whether the observed partitioning of pairwise dissimilarities among groups is stronger than would be expected under random reassignment of samples to groups.
Here we demonstrate PERMANOVA using the Saheb‑Alam2019_DADA2 example dataset. First, we load the data and compute a dissimilarity matrix.
[11]:
import qdiv
obj = qdiv.MicrobiomeData.load_example("Saheb-Alam2019_DADA2")
obj.info()
dis = qdiv.diversity.naive_beta(obj, q=1)
print(dis.iloc[:5, :5])
MicrobiomeData object summary
----------------------------------------
Abundance table: 672 features x 16 samples
Total reads: 504186.0
Min reads/sample: 9783.0
Max reads/sample: 82777.0
Taxonomy table: 672 features, levels: ['Domain', 'Phylum', 'Class', 'Order', 'Family', 'Genus', 'Species']
Sequence table: 672 features
Tree: 1342 nodes
Metadata table: 16 samples, columns: ['location', 'feed', 'mfc']
Metadata preview:
location feed mfc
sample
S4 anode acetate B
----------------------------------------
S4 S5 S6 S7 S10
S4 0.000000 0.058848 0.049525 0.074329 0.837658
S5 0.058848 0.000000 0.057854 0.073645 0.778168
S6 0.049525 0.057854 0.000000 0.067191 0.819105
S7 0.074329 0.073645 0.067191 0.000000 0.853965
S10 0.837658 0.778168 0.819105 0.853965 0.000000
The metadata include information about both location (anode vs. cathode) and feed (acetate vs. glucose). We can test the effects of these two factors, as well as their interaction, on microbial community structure:
[12]:
res = qdiv.stats.permanova(dis, obj, by=["location", "feed"], include_interaction=True, permutations=999)
print(res["table"])
df SS MS F p R2
term
location 1 1.295689 1.295689 173.112268 0.001 0.331991
feed 1 1.397021 1.397021 186.650861 0.001 0.357955
location:feed 1 0.472099 0.472099 63.075462 0.001 0.120965
Residual 12 0.089816 0.007485 NaN NaN 0.023013
Each row of the table corresponds to a model term, including main effects and interactions. The columns report:
df: degrees of freedom for the term,
SS and MS: sums and mean squares derived from the distance matrix,
F: pseudo‑F statistic,
p: permutation‑based p‑value,
R2: proportion of total variation explained by the term.
In this example, both location and feed have statistically significant effects on microbial community composition, and there is also a significant interaction between the two factors (p = 0.001 for all terms).
Bootstrap summaries of dissimilarity matrices
While Mantel tests and PERMANOVA are useful for hypothesis testing, it is often equally important to quantify the magnitude and uncertainty of dissimilarity-based effects. The function bootstrap_sample_matrix estimates within‑group and between‑group dissimilarity summaries together with confidence intervals. Given a square dissimilarity matrix and one or more metadata variables, bootstrap_sample_matrix answers questions such as:
Are samples more similar within groups than between groups?
How large is this difference, and how uncertain is it?
Does the answer depend on how samples are grouped (or crossed)?
Pairwise distances are not independent
Each sample appears in many distance pairs, so treating distances as independent observations leads to misleading uncertainty estimates.
No principled uncertainty estimates: Standard deviations of pairwise dissimilarities do not correspond to sampling uncertainty of group‑level summaries.
bootstrap_sample_matrix overcomes these issues by using a nested bootstrap that resamples samples within the structure defined by metadata. While PERMANOVA tests whether group structure in the dissimilarity matrix is stronger than expected by chance (hypothesis testing), bootstrap_sample_matrix estimates the magnitude and uncertainty of within‑ and between‑group dissimilarities (effect size estimation). Together, they provide a more complete picture of the data.
Let’s use the function on the data we used for PERMANOVA above.
[13]:
res = qdiv.stats.bootstrap_sample_matrix(
dis,
obj,
by=["location", "feed"],
n_boot=1000,
alpha=0.05
)
print(res)
{'location': {'within': {'mean': 0.4621659257585619, 'ci': (0.4325898452362084, 0.48039868557728466), 'total_pairs': 56, 'boot': None}, 'between': {'mean': 0.8164422478698461, 'ci': (0.798834829662802, 0.8311924760509425), 'total_pairs': 64, 'boot': None}}, 'feed': {'within': {'mean': 0.4407172121543751, 'ci': (0.41995037102282134, 0.4574687873569055), 'total_pairs': 56, 'boot': None}, 'between': {'mean': 0.8352098722735096, 'ci': (0.8129252399380823, 0.854929994893634), 'total_pairs': 64, 'boot': None}}, 'Crossed': {'within': {'mean': 0.07748822634606517, 'ci': (0.049742913842534354, 0.09997940745202433), 'total_pairs': 24, 'boot': None}, 'between': {'mean': 0.7945195653525423, 'ci': (0.775871459763348, 0.8107586184803848), 'total_pairs': 96, 'boot': None}}}
The output is a dictionary containing results for the variables “location” and “feed”, as well as the interaction term. For each variable, ‘within’ and ‘between’ dissimilarities are shown as mean values and confidence intervals (ci). Let’s look specifically at the confidence intervals of dissimilarities between samples that are from same location and receive the same feed, and samples that are from different locations or receive different feeds:
[14]:
print(res["Crossed"]["within"]["ci"])
print(res["Crossed"]["between"]["ci"])
(0.049742913842534354, 0.09997940745202433)
(0.775871459763348, 0.8107586184803848)