qdiv.stats

stats

Provides functions for multivariate statistical tests and ordinations.

qdiv.stats.mantel(dis1, dis2, method='spearman', getOnlyStat=False, permutations=999, *, random_state=None, **kwargs)[source]

Perform a Mantel test to assess the association between two dissimilarity matrices.

The Mantel test evaluates whether pairs of samples that are close (or far apart) in one dissimilarity matrix tend to be close (or far apart) in another. The test statistic is computed by comparing the lower‑triangular entries of the two matrices, and statistical significance is assessed using a permutation test.

For correlation-based methods, the association is quantified as a dissimilarity (1 − r), where r is the Pearson or Spearman correlation between the vectorized distance matrices.

Parameters:
  • dis1 (pandas.DataFrame) – First square distance or dissimilarity matrix (samples × samples) with identical row and column labels.

  • dis2 (pandas.DataFrame) – Second square distance or dissimilarity matrix (samples × samples) with identical row and column labels matching dis1.

  • method ({'spearman', 'pearson', 'absDist'}, default='spearman') –

    Measure used to quantify association between distance matrices:

    • ’spearman’ :

      Spearman rank correlation between distances (reported as 1 − ρ).

    • ’pearson’ :

      Pearson correlation between distances (reported as 1 − r).

    • ’absDist’ :

      Mean absolute difference between corresponding distances.

  • getOnlyStat (bool, default=False) – If True, return only the observed test statistic without performing permutations.

  • permutations (int, default=999) – Number of permutations used to approximate the null distribution.

  • random_state (int | numpy.random.Generator | None) – Random seed or NumPy random generator for reproducible permutations.

Returns:

  • If getOnlyStat=True, returns the observed statistic only.

  • Otherwise, returns a list containing the observed statistic and its permutation-based p-value.

Return type:

float or list [statistic, p_value]

Notes

  • The test uses only the lower triangular part of each distance matrix (excluding the diagonal), avoiding double counting of pairwise distances.

  • Sample labels are permuted in dis1 while dis2 is held fixed to generate the null distribution.

  • For correlation-based methods (‘pearson’, ‘spearman’), the reported statistic is a dissimilarity (1 − r or 1 − ρ), so smaller values indicate stronger association between the two matrices.

  • p-values are computed using a standard permutation test with a +1 correction: (count + 1) / (permutations + 1).

qdiv.stats.permanova(dis, meta, by, *, permutations=999, include_interaction=False, strata=None, random_state=None, perm_scheme='freedman-lane', **kwargs)[source]

PERMANOVA (Anderson, 2001) implemented via projection matrices on the Gower‑centered distance matrix.

This function fits a distance‑based linear model with one or two categorical factors (optionally including their interaction) and tests each term using permutation-based pseudo‑F statistics. Tests are marginal (partial): each term is evaluated conditional on all other included terms.

Permutation inference can be performed either by permuting sample labels or by permuting residuals from reduced models (Freedman–Lane scheme), with optional restriction of permutations within exchangeability blocks (strata).

Parameters:
  • dis ((n x n) pandas.DataFrame) – Symmetric distance or dissimilarity matrix with identical row and column labels. Rows/columns correspond to samples.

  • meta (pandas.DataFrame | dict | MicrobiomeData-like) – Sample metadata indexed by sample IDs matching dis.index.

  • by (str or list[str]) – One or two column names in meta defining the categorical factor(s).

  • permutations (int, default 999) – Number of permutations used to approximate the null distribution.

  • include_interaction (bool, default False) – If by contains two factors and both have more than one level, include and test their interaction term.

  • strata (str | list[str] | None) – Column name(s) in meta defining exchangeability blocks. When given, permutations are restricted to occur within each stratum only (i.e. blocked permutations).

  • random_state (int | numpy.random.Generator | None) – Random seed or generator for reproducible permutations.

  • perm_scheme ({'labels', 'freedman-lane'}, default 'freedman-lane') –

    Permutation scheme used to generate the null distribution:

    • ’labels’:

      Classical label permutation. Sample labels (factor assignments) are permuted across samples while the distance matrix is kept fixed. When two factors are provided, their labels are permuted jointly, preserving observed factor combinations. Permutations may be restricted within strata if specified.

    • ’freedman-lane’:

      Residual-based permutation (Freedman & Lane, 1983). For each tested term, residuals from the reduced model excluding that term are permuted (optionally within strata), added back to the fitted values of the reduced model, and the full model is refitted. This scheme yields valid partial tests in the presence of nuisance factors and allows testing main effects even when a factor is constant within strata.

Returns:

A dictionary with the following entries:

  • ’by’:

    List of tested term names (main effects and, if included, interaction).

  • ’table’:

    pandas.DataFrame with rows corresponding to model terms and the residual, and columns: [‘df’, ‘SS’, ‘MS’, ‘F’, ‘p’, ‘R2’].

  • ’permutations’:

    Number of permutations performed.

  • ’strata’:

    List of strata column names used for restricted permutations, or None.

  • ’perm_scheme’:

    The permutation scheme used (‘labels’ or ‘freedman-lane’).

Return type:

dict

Notes

  • The analysis follows the geometric partitioning of sums of squares described by Anderson (2001), using projection (hat) matrices on the Gower‑centered distance matrix.

  • P‑values are estimated from the permutation distribution using a standard +1 correction: (count + 1) / (permutations + 1).

  • If a tested factor does not vary within strata under label permutation, the corresponding null distribution may be degenerate and p‑values will be returned as NaN.

qdiv.stats.gower(meta=None, *, by=None, return_similarity=False)[source]

Compute the Gower distance matrix for a pandas DataFrame containing mixed variable types (numeric, categorical/boolean, datetime).

Parameters:
  • meta (pd.DataFrame, dict, or MicrobiomeData object) – Input data. Rows are samples; columns are variables.

  • by (Sequence[str] or str, optional) – Variable names (columns) to include. If None, all columns are included.

  • return_similarity (bool, optional) – If True, return Gower similarity (1 - distance). Default False (distance).

Returns:

Pairwise Gower distances (or similarities) between samples (rows).

Return type:

pandas.DataFrame

Notes

  • Numerical variables are scaled by their range (max - min). If the range is 0 (constant column), that variable contributes 0 for all pairs.

  • Datetime variables are converted to days (float) and treated as numeric.

  • Categorical/boolean variables contribute 0 when equal, 1 when different.

  • Missing values: a variable only contributes for row pairs where it is present in both rows; the per-pair denominator is the count of contributing variables for that pair.

qdiv.stats.pcoa_lingoes(dis)[source]

Perform Principal Coordinates Analysis (PCoA) using the Lingoes correction.

The Lingoes correction transforms a non‑Euclidean distance matrix into a Euclidean one by adding a constant to all squared distances, ensuring that all eigenvalues are non‑negative. PCoA is then performed on the corrected matrix to obtain principal coordinate axes.

Parameters:

dis (pandas.DataFrame) – Square distance matrix (rows and columns represent samples). Values must be non‑negative and the matrix must be symmetric.

Returns:

  • coords_df (pandas.DataFrame) – Principal coordinate scores (samples × axes), ordered by decreasing eigenvalue magnitude.

  • eigvals (pandas.Series) – Eigenvalues associated with each axis (only the positive eigenvalues after Lingoes correction).

  • pct_explained (pandas.Series) – Percentage of total variance explained by each axis (positive eigenvalues only).

  • total_variance (float) – Sum of all positive eigenvalues after correction.

Return type:

DataFrame

Notes

  • The Lingoes correction is applied only if negative eigenvalues are detected.

  • The output coordinates are centered and scaled according to standard PCoA conventions.

qdiv.stats.dbrda(dis=None, meta=None, *, by=None, condition=None, n_axes=2, scale='site', perm_n=999, perm_seed=42, pcoa_fn=<function pcoa_lingoes>, per_var_perm=False, interactions=None, drop_first=True)[source]

Distance‑based Redundancy Analysis (db‑RDA).

This function performs constrained ordination on a distance matrix by:

  1. Converting the distance matrix into principal coordinates (PCoA) using the specified PCoA function (default: Lingoes correction).

  2. Regressing the PCoA coordinates onto explanatory variables.

  3. Extracting constrained axes, biplot scores, and variance components.

  4. Performing a global permutation test (Freedman–Lane).

  5. Optionally computing per‑variable permutation p‑values.

  6. Optionally including categorical interaction terms.

Parameters:
  • dis (pandas.DataFrame) – Square distance matrix (samples × samples). Must have matching row/column labels.

  • meta (pandas.DataFrame) – Metadata table containing explanatory variables (rows = samples).

  • by (str or list of str, optional) – Subset of metadata columns to use as explanatory variables. If None, all columns in meta are used.

  • condition (pandas.DataFrame, optional) – Conditioning variables for partial db‑RDA. Must align with meta.

  • n_axes (int, default=2) – Number of constrained axes to return.

  • scale ({'site', 'species'}, default='site') – Scaling for biplot scores.

  • perm_n (int, default=999) – Number of permutations for the global test.

  • perm_seed (int, default=42) – Random seed for reproducibility.

  • pcoa_fn (callable, default=pcoa_lingoes) – Function used to compute PCoA. Must return a dict with ‘site_scores’ and ‘eigenvalues’.

  • per_var_perm (bool, default=False) – If True, compute permutation p‑values for each predictor.

  • interactions (list of str, optional) – Variables for which interaction terms should be generated.

  • drop_first (bool, default=True) – Whether to drop the first dummy level when encoding categorical variables.

Returns:

{

‘site_scores’ : pandas.DataFrame, ‘biplot_scores’ : pandas.DataFrame, ‘variable_contributions’ : pandas.DataFrame, ‘eigenvalues’ : numpy.ndarray, ‘explained_ratio’ : numpy.ndarray, ‘total_inertia’ : float, ‘constrained_inertia’ : float, ‘unconstrained_inertia’ : float, ‘F_global’ : float, ‘p_global’ : float

}

Return type:

dict

Notes

  • The global permutation test uses the Freedman–Lane procedure.

  • Partial db‑RDA is performed by residualizing both the response coordinates and the design matrix against the conditioning variables.

  • Interaction terms are constructed before dummy encoding.

qdiv.stats.summarize_dbrda(dis, meta, *, by=None, condition=None, interactions=None, pcoa_fn=<function pcoa_lingoes>, perm_n=999, perm_seed=42, drop_first=True, include_interpretation=True, include_alone=True)[source]

Summarize db‑RDA (global model + marginal factor tests).

This function:
  1. Runs dbRDA once (global model).

  2. Runs marginal (partial) permutation tests per factor (Freedman–Lane).

  3. Aggregates % explained by original factors (from the full model).

  4. Computes R² and adjusted R².

  5. Returns a tidy DataFrame, optionally with textual interpretation.

Parameters:
  • dist (pandas.DataFrame) – Square distance matrix (rows/cols = samples). Index must match columns.

  • meta (pandas.DataFrame) – Metadata indexed by sample IDs.

  • by (str or list of str, optional) – Subset of metadata columns to use as explanatory variables. If None, all columns in meta are used.

  • condition (pandas.DataFrame, optional) – Covariates to partial out (same index as meta).

  • interactions (list of str, optional) – Variables for which interaction terms should be generated.

  • pcoa_fn (callable, default=pcoa_lingoes) – Function for the PCoA step; must return ‘site_scores’ and ‘eigenvalues’.

  • perm_n (int, default=999) – Number of permutations for marginal tests.

  • perm_seed (int, default=42) – Random seed for permutations.

  • drop_first (bool, default=True) – Drop first level in categorical encoding (reference coding).

  • include_interpretation (bool, default=True) – If True, adds a textual interpretation column.

  • include_alone (bool, default=True) – If True, keeps “alone” diagnostics (factor-alone %-explained, p-alone).

  • dis (DataFrame)

Returns:

Columns (by default):
  • Factor

  • pct_explained (full model)

  • df_added

  • delta_inertia

  • pct_explained (marginal)

  • F

  • p-marginal

  • inertia_alone

  • pct_explained (alone)

  • p-alone

  • Interpretation (optional)

Attributes (df.attrs):
  • ’R²’ : float

  • ’Adjusted R²’ : float

  • ’F_global’ : float

  • ’p_global’ : float

  • ’Total inertia’ : float

  • ’Constrained inertia’ : float

  • ’Unconstrained inertia’ : float

  • ’n’ : int (samples)

  • ’df_model’ : int (approx. number of fitted parameters)

Return type:

pandas.DataFrame

qdiv.stats.corr(meta, columns=None, column_types=None, method='spearman', permutations=999, random_state=None, *, return_coding=True)[source]

Compute mixed-type correlation/association matrix + permutation p-values + BH-FDR padj + counts.

Pairwise rules:
  • num <-> num => Pearson/Spearman (pandas)

  • cat (binary) <-> num => point-biserial

  • cat (multi) <-> num => correlation ratio (eta)

  • cat <-> cat => bias-corrected Cramér’s V

Parameters:
  • meta (dict-like or object recognized by get_df(meta, "meta")) – Must contain a pandas DataFrame with metadata under key/name “meta”.

  • columns (list[str], optional) – Columns to include. Defaults to all columns.

  • column_types (list[str], optional) – List of types corresponding to the data colums. Can be ‘cat’ for categorical data or ‘num’ for numerical. If not provided, column types will be inferred automatically.

  • numeric_method ({"pearson", "spearman"}, default "spearman") – Method for numeric–numeric pairs.

  • permutations (int, default 999) – Number of permutations per pair for p-values.

  • random_state (int, default None) – Random seed for reproducibility.

  • method (str)

  • return_coding (bool)

Returns:

dict – ‘coding’: dict (optional if return_coding=True)} R : effect sizes (symmetric; diag=1.0) P : raw permutation p-values (symmetric; diag=NaN) padj : BH-FDR adjusted p-values (symmetric; diag=NaN) N : pairwise complete case counts (symmetric; diag=#non-missing in column) coding : [(‘var1’, ‘var2’)] shows information about how to interpret the correlations

Return type:

{‘R’: DataFrame, ‘p’: DataFrame, ‘padj’: DataFrame, ‘N’: DataFrame,

qdiv.stats.bootstrap_sample_matrix(df, meta, by, *, n_boot=1000, alpha=0.05, random_state=None, return_boot=False, warn_small=True, **kwargs)[source]

Compute bootstrap confidence intervals for within‑ and between‑group summaries of a square sample×sample matrix.

For each variable listed in by, the function computes two aggregated summary statistics:

  • within: the average pairwise value among samples sharing the same category of that variable, and

  • between: the average pairwise value among samples belonging to different categories of that variable.

These summaries are estimated using a nested bootstrap. Samples are resampled with replacement within each fully crossed cell defined by by. For each bootstrap replicate, category‑level values are pooled across the remaining factor(s), and summary statistics are computed using pair‑count weighting.

To avoid artificial inflation of zero or repeated distances caused by bootstrap duplicates, resampled indices are deduplicated prior to computing pairwise distances. As a consequence, the effective number of contributing sample pairs may vary across bootstrap replicates.

In addition to per‑variable summaries, the function returns a single “Crossed” aggregation that pools across all fully crossed cells, providing one overall within‑cell and one overall between‑cell summary computed using the same bootstrap resamples.

Parameters:
  • df ((n x n) pandas.DataFrame) – Symmetric sample×sample matrix with identical row/column labels and order.

  • meta (pandas.DataFrame | dict | MicrobiomeData-like) – Metadata indexed by sample IDs matching df.index. Metadata will be aligned to df.index without reordering df.

  • by (str | list[str]) – One or more categorical metadata columns defining the fully crossed cells used for nested resampling.

  • n_boot (int, default 1000) – Number of bootstrap replicates.

  • alpha (float, default 0.05) – Percentile confidence level (95% CI if alpha = 0.05).

  • random_state (int | numpy.random.Generator | None) – Random seed or NumPy random generator for reproducible permutations.

  • return_boot (bool, default False) – If False, the returned dictionary omits the bootstrap replicate arrays to reduce memory usage.

  • warn_small (bool, default True) – If True, issue warnings when cells or categories contain very few samples, which may lead to unstable bootstrap estimates.

Returns:

out – Dictionary with the following structure:

{
“<var>”: {
“within”: {

“mean”: float, “ci”: (lo, hi), “total_pairs”: int, “boot”: np.ndarray | None

}, “between”: {

”mean”: float, “ci”: (lo, hi), “total_pairs”: int, “boot”: np.ndarray | None

}

}, …, “Crossed”: {

”within”: {

“mean”: float, “ci”: (lo, hi), “total_pairs”: int, “boot”: np.ndarray | None

}, “between”: {

”mean”: float, “ci”: (lo, hi), “total_pairs”: int, “boot”: np.ndarray | None

}

}

}

One entry is returned for each variable in by, each containing a single within‑ and between‑category summary. If more than one variable is supplied, an additional “Crossed” entry provides summaries pooled across all fully crossed cells.

Return type:

dict

Notes

  • Bootstrap resampling is performed at the sample level within fully crossed cells, but duplicate resampled indices are deduplicated before computing pairwise distances. This yields a bootstrap of pairwise distance summaries rather than a bootstrap of raw samples.

  • Because of deduplication, the effective number of contributing sample pairs can differ between bootstrap replicates. The reported total_pairs value corresponds to the average number of distinct pairs contributing per bootstrap replicate, not a fixed property of the original dataset.

  • Pairwise summaries are computed using the upper triangle for within‑group blocks and the full rectangular block for between‑group blocks.