qdiv package
Overview
The qdiv package provides tools for microbial diversity analysis built around the Hill number framework. Functionality is organized into specialized subpackages for diversity, statistics, plotting, modeling, and sequence handling, unified by the MicrobiomeData container class.
Core subpackages
Main data container
- class qdiv.MicrobiomeData(tab=None, tax=None, meta=None, seq=None, tree=None)[source]
Bases:
objectContainer for microbiome data tables (abundance, taxonomy, metadata, sequences, tree).
- Parameters:
tab (Optional[pd.DataFrame])
tax (Optional[pd.DataFrame])
meta (Optional[pd.DataFrame])
seq (Optional[pd.DataFrame])
tree (Optional[pd.DataFrame])
- tab
Abundance table (features x samples).
- Type:
pd.DataFrame, optional
- tax
Taxonomy table (features x taxonomy levels).
- Type:
pd.DataFrame, optional
- meta
Metadata table (samples x variables).
- Type:
pd.DataFrame, optional
- seq
Sequence table (features x sequence).
- Type:
pd.DataFrame, optional
- tree
Phylogenetic tree.
- Type:
pd.DataFrame, optional
- classmethod load(**kwargs)[source]
Load microbiome data from files and return a MicrobiomeData object.
- Parameters:
kwargs (dict) – Arguments for file paths and parsing options, passed to the loader.
- Returns:
Loaded data object.
- Return type:
Examples
>>> data = MicrobiomeData.load(tab="otu_table.csv", meta="metadata.csv")
- add_tab(tab, *, path='', sep=None, taxonomy_levels=None)[source]
Add or update self.tab (and self.tax if included in the file).
- Parameters:
tab (str) – File name of the frequency table (.csv/.tsv, optionally gzipped, e.g. .csv.gz). Feature names (OTU/ASV/bin/MAG) should be in the first column (index).
path (str, default "") – Directory path (absolute or relative) containing tab. Can be “” for CWD.
sep (str or None, default None) – Column separator. If None, pandas will attempt to auto-detect (engine=’python’).
taxonomy_levels (list of str, optional) – Case-insensitive taxonomy column names to extract. Defaults to a broad set.
- Raises:
ValueError – If the file cannot be read or has invalid format.
- Returns:
The updated object (self).
- Return type:
- add_tax(tax, *, path='', sep=None, add_taxon_prefix=True)[source]
Add or update self.tax.
- Parameters:
tax (str) – File name of the taxonomy table (.csv/.tsv, optionally gzipped, e.g. .csv.gz). Feature names (OTU/ASV/bin/MAG) should be in the first column (index).
path (str, default "") – Directory path (absolute or relative) containing tab. Can be “” for CWD.
sep (str or None, default ",") – Column separator. If None, pandas will attempt to auto-detect (engine=’python’).
add_taxon_prefix (bool, default True) – If True, add letters and two underscores before taxon names to indicate taxonomic level.
- Raises:
ValueError – If the file cannot be read or has invalid format.
- Returns:
The updated object (self).
- Return type:
- add_seq_from_fasta(fasta, *, path='', name_splitter=None)[source]
Add or update self.seq.
- Parameters:
fasta (str) – Name of the FASTA file with sequences of OTUs or ASVs (.fa, .fasta, optionally gzipped).
path (str, default "") – Directory path (absolute or relative) containing fasta. Can be “” for CWD.
name_splitter (str, optional) – If provided, splits sequence names on this delimiter and keeps the first part.
- Raises:
ValueError – If fasta is missing or file cannot be read. If no sequences are found.
- Returns:
The updated object (self).
- Return type:
- add_tree(tree, *, path='')[source]
Load tree from a newick file into a dictionary with a pandas DataFrame.
- Parameters:
tree (str) – Name of the newick file with the tree.
path (str, default "") – Directory path (absolute or relative) containing tree. Can be “” for CWD.
- Raises:
ValueError – If tree is missing or file cannot be read, or if no nodes are found.
- Returns:
The updated object (self).
- Return type:
- add_meta(meta, *, path='', sep=',')[source]
Load meta data into a dictionary with a pandas DataFrame.
- Parameters:
meta (str) – Name of the meta data file.
path (str, default "") – Directory path (absolute or relative) containing meta. Can be “” for CWD.
sep (str or None, default ",") – Column separator. If None, pandas will attempt to auto-detect (engine=’python’).
- Raises:
ValueError – If meta is missing or file cannot be read, or if no samples are found.
- Returns:
The updated object (self).
- Return type:
- add_tax_from_sintax(filename, *, path='')[source]
Add or update taxonomy from a SINTAX output file.
- Parameters:
filename (str) – Path to the SINTAX output file.
path (str, default "") – Directory path (absolute or relative) containing sintax_file. Can be “” for CWD.
- Returns:
The updated object (self).
- Return type:
- add_tax_from_qiime(filename, *, path='')[source]
Add or update taxonomy from a QIIME2-style taxonomy file.
- Parameters:
filename (str) – File name of the taxonomy table (.tsv, e.g. from QIIME2 export).
path (str, default "") – Directory path (absolute or relative) containing tax. Can be “” for CWD.
- Returns:
The updated object (self).
- Return type:
- add_tax_from_gtdbtk(filenames, *, path='')[source]
Add or update taxonomy from one or more GTDB-Tk summary files.
- Parameters:
filenames (str or list of str) – Path(s) to GTDB-Tk summary .tsv file(s).
path (str, default "") – Directory path (absolute or relative) containing file(s). Can be “” for CWD.
- Returns:
The updated object (self).
- Return type:
- add_tab_from_coverm(filename, *, path='', first_sep=None, second_sep=' ', detection_threshold=None)[source]
Add a relative abundance table from a CoverM file.
- Parameters:
filename (str) – Path to coverm .tsv or .csv file.
path (str, default "") – Directory path (absolute or relative) containing file(s). Can be “” for CWD.
first_sep (str, optional) – Separator to help extract sample names from column headings.
second_sep (str, optional) – Second separator to help extract sample names from column headings.
detection_threshold (float, optional) – Detection threshold for relative abundance (default: None).
- Returns:
The updated object (self).
- Return type:
- add_ebd_tab_from_singlem(filename, *, path='', first_sep=None, second_sep=' ')[source]
Add count data (tab) and taxonomic information (tax) from an EBD file generated with SingleM.
- Parameters:
filename (str) – Path to the file (TSV).
path (str, default "") – Directory path (absolute or relative) containing file(s). Can be “” for CWD.
first_sep (str or None) – Separator used to find sample name in CoverM-file column headings. Defaults to None.
second_sep (str) – Separator used to find sample name in CoverM-file column headings. Defaults to None.
- Returns:
The updated object (self).
- Return type:
- save(path='', savename='output', sep=',')[source]
Save frequency table, taxonomy, metadata, sequences, and tree to disk.
- Parameters:
path (str, optional) – Directory path where files will be saved. Defaults to the current directory.
savename (str, optional) – Base name for output files. Defaults to “output”.
sep (str, optional) – Field separator for CSV files. Defaults to “,”.
- Returns:
List of file paths that were saved.
- Return type:
list of str
Examples
>>> files = data.printout(path="results", savename="mydata")
- copy()[source]
Copy MicrobiomeData object.
- Returns:
A copy of the object.
- Return type:
Examples
>>> obj_copy = obj.copy()
- info(preview_rows=1)[source]
Print summary information about the MicrobiomeData object.
- Parameters:
preview_rows (int, optional) – Number of rows to preview from metadata (default: 1).
- Return type:
None
- summarize_taxa(savename=None, *, path='')[source]
Summarize the number of taxa at each taxonomic level per sample.
- Parameters:
savename (str or None, default=None) – If provided, save the output table as CSV in the given path.
path (str)
- Returns:
Summary table with:
number of features per sample
total reads per sample
number of unique taxa at each taxonomic level
- Return type:
pandas.DataFrame
- subset_samples(*, by='index', values=None, exclude=False, keep_absent=False, inplace=False)[source]
Subset samples in the MicrobiomeData object using io.subset_samples.
- Parameters:
by (str, default "index") – How to select samples: “index” for sample names, or a column name in meta.
values (list or scalar, optional) – Values to include (or exclude if exclude=True).
exclude (bool, default False) – If True, exclude samples that match values.
keep_absent (bool, default False) – If False, drop features (rows) with zero counts after subsetting.
inplace (bool, default False) – If True, modify the object in place. If False, return a new object.
- Returns:
The filtered object (self if inplace=True, otherwise a new object).
- Return type:
- subset_features(*, featurelist=None, exclude=False, inplace=False)[source]
Subset features (OTUs/ASVs/bins/MAGs) from a MicrobiomeData object using io.subset_features.
- Parameters:
featurelist (list) – List of feature (OTU/ASV/bin) identifiers to keep or exclude.
exclude (bool, default False) – If True, exclude values in featurelist instead of including them.
inplace (bool, default False) – If True, mutate and return the same object. If False, return a new object.
- Returns:
The filtered object (self if inplace=True, otherwise a new object).
- Return type:
- subset_abundant(*, n=25, method='mean', cutoff=None, exclude=False, inplace=False)[source]
Subset features (OTUs/ASVs/bins/MAGs) from a MicrobiomeData object using io.subset_abundant.
- Parameters:
n (int, default 25) – Number of top features to keep (or exclude if exclude=True). Values outside [0, n_features] are clamped to the valid range.
method ({'sum','mean','frequency'}, default 'mean') – Reduction across samples of relative abundance per feature. - ‘sum’ : total relative abundance across samples - ‘mean’ : mean relative abundance across samples - ‘max’ : max relative abundance in a sample - ‘frequency’ : proportion of samples in which the feature is detected
cutoff (float, default None) – If cutoff is specific as a percentage (from 0 to 100%), all features with a ‘sum’ or ‘mean’ relative abundance or ‘frequency’ of detection above this value will be kept, and the parameter n will be ignored.
exclude (bool, default False) – If False (default), keep the top features. If True, exclude the top features (keep the rest).
inplace (bool, default False) – Only relevant for MicrobiomeData input. If True, mutate the object and return it; otherwise, return a new object.
- Returns:
The filtered object (self if inplace=True, otherwise a new object).
- Return type:
- merge_samples(*, by, values=None, method='sum', keep_absent=False, inplace=False)[source]
Merge samples in the MicrobiomeData object based on metadata grouping.
- Parameters:
by (str or list) – Column(s) in metadata used for grouping samples.
values (list, optional) – Metadata values to keep. If None, all unique values in by are used.
method ({'sum', 'mean'}, default 'sum') – Aggregation method for counts.
keep_absent (bool, default False) – If False, remove features with zero counts after merging.
inplace (bool, default False) – If True, modify this object in place; if False, return a new object.
- Returns:
Object with merged samples. Returns self if
inplace=True, otherwise a new MicrobiomeData instance.- Return type:
- Raises:
ValueError – If metadata or the specified column is missing, or if no samples match the specified values.
Examples
>>> obj.merge_samples(by="Treatment", method="sum", inplace=True) >>> merged = obj.merge_samples(by="Site", method="mean")
- subset_taxa(*, subset_levels=None, subset_patterns=None, exclude=False, case=False, regex=False, na=False, match_type='contains', inplace=False)[source]
Subset features (OTUs/ASVs/bins/MAGs) from the MicrobiomeData object based on taxonomic classification.
- Parameters:
subset_levels (str or sequence of str, optional) – Taxonomic column(s) in which to search for patterns. If None, all columns in tax are used.
subset_patterns (str or sequence of str) – Text patterns to identify taxa to keep. If a single string is passed, it is used as the only pattern.
exclude (bool, default False) – If True, return taxa that do NOT match the given patterns (i.e., complement).
case (bool, default False) – If True, pattern matching is case-sensitive.
regex (bool, default False) – If True, patterns are treated as regex. If False, patterns are escaped (literal match).
na (bool, default False) – If True, na are treated as matches. If False, na are treated as non-matches. Empty or whitespace-only taxonomy entries are treated as missing (NA) during subsetting.
match_type ({'contains','fullmatch','startswith','endswith'}, default 'contains') – Matching behavior applied to the strings in selected columns.
inplace (bool, default False) – If True, mutate and return the same object. If False, return a new object.
- Returns:
Filtered object with updated ‘tab’, ‘tax’, and ‘seq’. ‘meta’ and ‘tree’ are passed through.
- Return type:
- Raises:
ValueError – If taxonomy table is missing, no patterns are provided, or no matches are found.
Examples
>>> obj.subset_taxa(subset_levels="Genus", subset_patterns="Bacteroides", inplace=True) >>> filtered_obj = obj.subset_taxa(subset_patterns=["Bacteroides", "Clostridium"], exclude=True)
- rarefy(*, depth='min', random_state=None, replacement=False, inplace=False, **kwargs)[source]
Rarefy the abundance table to a fixed sequencing depth.
This method is a thin wrapper around
io.subset.rarefy(). It performs random subsampling (with or without replacement) to equalize sequencing depth across samples, then drops features and samples that become zero.- Parameters:
depth (int or 'min', default 'min') – Target sequencing depth per sample. If ‘min’, the minimum depth across samples is used.
random_state (int | numpy.random.Generator, optional) – Random seed or Generator for reproducibility.
replacement (bool, default False) – If True, sample with replacement (multinomial); otherwise sample without replacement.
inplace (bool, default False) – If True, modify this object in place; if False, return a new object.
- Returns:
The rarefied object. Returns self if
inplace=True, otherwise a new MicrobiomeData instance.- Return type:
Notes
Rarefaction reduces sequencing depth variance across samples to facilitate certain diversity and dissimilarity analyses.
The exact algorithm and post‑processing (feature/sample pruning) are implemented in
io.subset.rarefy().Index alignment and integrity are enforced via
_autocorrect()and_validate()in the underlying implementation.
Examples
>>> obj.rarefy(depth=10000, seed=42, inplace=True) >>> rarefied_obj = obj.rarefy(depth='min', replacement=True)
- prune_tree(featurelist=None, reroot=False, inplace=False)[source]
Prune the tree to retain only branches whose leaves intersect with a given feature set, plus always keep the root branch.
- Parameters:
featurelist (list of str or set of str or iterable of str, optional) – A collection of feature names to match against the leaves of each branch. If None, the method will attempt to use self.tab.index.tolist().
reroot (bool, default False) – If True, reroot the pruned tree at midpoint.
inplace (bool, default False) – If True, modify this object in place; if False, return a new object.
- Returns:
The object with pruned tree. Returns self if
inplace=True, otherwise a new MicrobiomeData instance.- Return type:
- rename_features(name_type='OTU', name_dict=None, inplace=False)[source]
Rename feature identifiers (row indices) based on their relative abundance or taxonomic order.
The renaming is done either based on the rank of the feature after sorting based on relative abundance or based on a dictionary containing existing names as keys and new names as values. If ‘name_dict’ is None, the features are renamed according in the format {name_type}{i}, and sorted: - By mean relative abundance if tab (abundance table) is present. - By taxonomic order if tax is present and tab is absent.
- Parameters:
name_type (str, default='OTU') – Prefix for new feature names, e.g., ‘OTU’, ‘ASV’ (used is name_dict is None).
name_dict (dict, default=None) – Dictionary with feature name {‘Old_name’: ‘New:name’, …}.
inplace (bool, default=False) – If True, modify object in place.
- Returns:
The updated object. If inplace=True, returns self; otherwise, a new instance.
- Return type:
- tax_prefix(add=True, inplace=False, custom_prefix=None)[source]
Add or remove prefix (e.g. d__, p__) to taxonomic classifications.
- Parameters:
add (bool, default=True) – If True, add prefix. If False, remove prefix.
inplace (bool, default=False) – If True, modify object in place.
custom_prefix (dict, default=None) – A dictionary with taxonomic levels as keys and prefix as values.
- Returns:
The updated object. If inplace=True, returns self; otherwise, a new instance.
- Return type:
- clean_tax(inplace=False)[source]
Clean and standardize Greengenes2/GTDB taxonomy within a MicrobiomeData object.
This function processes taxonomy derived from Greengenes2 or GTDB. It normalizes missing or ambiguous labels, preserves GTDB letter suffixes (e.g.,
_A,_B), and removes GTDB numeric node identifiers (e.g.,_368345) at all taxonomic ranks.- Parameters:
inplace (bool, default=False) – If True, modify object in place.
- Returns:
The updated object. If inplace=True, returns self; otherwise, a new instance.
- Return type:
Notes
- classmethod load_example(example_name='Modin_et_al_2025')[source]
Load a MicrobiomeData object from packaged example files.
- Parameters:
example_name (str) –
Name of the example to load. Options:
”Modin2025”: Uses CoverM and GTDB-Tk output files from the Modin et al. study https://doi.org/10.1111/1751-7915.70238.
”Saheb-Alam2019_DADA2”: Uses qiime2-dada2 output files from the Saheb-Alam et al. study https://doi.org/10.1111/1751-7915.13449.
”Saheb-Alam2019_Deblur”: Uses qiime2-deblur output files from the Saheb-Alam et al. study https://doi.org/10.1111/1751-7915.13449.
- Returns:
An instance loaded with example data.
- Return type:
- Raises:
ValueError – If the example_name is not recognized.
- to_dict()[source]
Return the data as a dictionary.
- Returns:
Dictionary with keys: ‘tab’, ‘tax’, ‘meta’, ‘seq’, ‘tree’.
- Return type:
dict
- classmethod from_dict(data)[source]
Create a MicrobiomeData object from a dictionary.
- Parameters:
data (dict) – Dictionary with keys: - ‘tab’ : pd.DataFrame (required) - ‘tax’ : pd.DataFrame, optional - ‘meta’ : pd.DataFrame, optional - ‘seq’ : pd.DataFrame, optional - ‘tree’ : pd.DataFrame, optional
- Returns:
A new MicrobiomeData object initialized from the dictionary.
- Return type:
- Raises:
ValueError – If ‘tab’ is missing or not a pandas DataFrame.
Examples
>>> my_dict = { ... "tab": pd.DataFrame(...), ... "tax": pd.DataFrame(...), ... "meta": pd.DataFrame(...) ... } >>> obj = MicrobiomeData.from_dict(my_dict)