qdiv.stats.data_stats module
Statistical calculations on the meta data.
- qdiv.stats.data_stats.corr(meta, columns=None, column_types=None, method='spearman', permutations=999, random_state=None, *, return_coding=True)[source]
Compute mixed-type correlation/association matrix + permutation p-values + BH-FDR padj + counts.
- Pairwise rules:
num <-> num => Pearson/Spearman (pandas)
cat (binary) <-> num => point-biserial
cat (multi) <-> num => correlation ratio (eta)
cat <-> cat => bias-corrected Cramér’s V
- Parameters:
meta (dict-like or object recognized by get_df(meta, "meta")) – Must contain a pandas DataFrame with metadata under key/name “meta”.
columns (list[str], optional) – Columns to include. Defaults to all columns.
column_types (list[str], optional) – List of types corresponding to the data colums. Can be ‘cat’ for categorical data or ‘num’ for numerical. If not provided, column types will be inferred automatically.
numeric_method ({"pearson", "spearman"}, default "spearman") – Method for numeric–numeric pairs.
permutations (int, default 999) – Number of permutations per pair for p-values.
random_state (int, default None) – Random seed for reproducibility.
method (str)
return_coding (bool)
- Returns:
dict – ‘coding’: dict (optional if return_coding=True)} R : effect sizes (symmetric; diag=1.0) P : raw permutation p-values (symmetric; diag=NaN) padj : BH-FDR adjusted p-values (symmetric; diag=NaN) N : pairwise complete case counts (symmetric; diag=#non-missing in column) coding : [(‘var1’, ‘var2’)] shows information about how to interpret the correlations
- Return type:
{‘R’: DataFrame, ‘p’: DataFrame, ‘padj’: DataFrame, ‘N’: DataFrame,
- qdiv.stats.data_stats.bootstrap_sample_matrix(df, meta, by, *, n_boot=1000, alpha=0.05, random_state=None, return_boot=False, warn_small=True, **kwargs)[source]
Compute bootstrap confidence intervals for within‑ and between‑group summaries of a square sample×sample matrix.
For each variable listed in by, the function computes two aggregated summary statistics:
within: the average pairwise value among samples sharing the same category of that variable, and
between: the average pairwise value among samples belonging to different categories of that variable.
These summaries are estimated using a nested bootstrap. Samples are resampled with replacement within each fully crossed cell defined by by. For each bootstrap replicate, category‑level values are pooled across the remaining factor(s), and summary statistics are computed using pair‑count weighting.
To avoid artificial inflation of zero or repeated distances caused by bootstrap duplicates, resampled indices are deduplicated prior to computing pairwise distances. As a consequence, the effective number of contributing sample pairs may vary across bootstrap replicates.
In addition to per‑variable summaries, the function returns a single “Crossed” aggregation that pools across all fully crossed cells, providing one overall within‑cell and one overall between‑cell summary computed using the same bootstrap resamples.
- Parameters:
df ((n x n) pandas.DataFrame) – Symmetric sample×sample matrix with identical row/column labels and order.
meta (pandas.DataFrame | dict | MicrobiomeData-like) – Metadata indexed by sample IDs matching df.index. Metadata will be aligned to df.index without reordering df.
by (str | list[str]) – One or more categorical metadata columns defining the fully crossed cells used for nested resampling.
n_boot (int, default 1000) – Number of bootstrap replicates.
alpha (float, default 0.05) – Percentile confidence level (95% CI if alpha = 0.05).
random_state (int | numpy.random.Generator | None) – Random seed or NumPy random generator for reproducible permutations.
return_boot (bool, default False) – If False, the returned dictionary omits the bootstrap replicate arrays to reduce memory usage.
warn_small (bool, default True) – If True, issue warnings when cells or categories contain very few samples, which may lead to unstable bootstrap estimates.
- Returns:
out – Dictionary with the following structure:
- {
- “<var>”: {
- “within”: {
“mean”: float, “ci”: (lo, hi), “total_pairs”: int, “boot”: np.ndarray | None
}, “between”: {
”mean”: float, “ci”: (lo, hi), “total_pairs”: int, “boot”: np.ndarray | None
}
}, …, “Crossed”: {
- ”within”: {
“mean”: float, “ci”: (lo, hi), “total_pairs”: int, “boot”: np.ndarray | None
}, “between”: {
”mean”: float, “ci”: (lo, hi), “total_pairs”: int, “boot”: np.ndarray | None
}
}
}
One entry is returned for each variable in by, each containing a single within‑ and between‑category summary. If more than one variable is supplied, an additional “Crossed” entry provides summaries pooled across all fully crossed cells.
- Return type:
dict
Notes
Bootstrap resampling is performed at the sample level within fully crossed cells, but duplicate resampled indices are deduplicated before computing pairwise distances. This yields a bootstrap of pairwise distance summaries rather than a bootstrap of raw samples.
Because of deduplication, the effective number of contributing sample pairs can differ between bootstrap replicates. The reported total_pairs value corresponds to the average number of distinct pairs contributing per bootstrap replicate, not a fixed property of the original dataset.
Pairwise summaries are computed using the upper triangle for within‑group blocks and the full rectangular block for between‑group blocks.