qdiv.sequences.sequence_comparisons module
- qdiv.sequences.sequence_comparisons.sequence_distance_matrix(obj, *, savename='SeqDistMat', path='', band_width=12, save=True, use_numba=True)[source]
Compute pairwise Levenshtein distances with a parallelized, Numba-accelerated banded Wagner–Fischer algorithm (if use_numba=True), else pure Python fallback.
- Parameters:
obj (MicrobiomeData or dict) – Must provide a DataFrame in obj.seq or obj[‘seq’] with index=sequence IDs and a column containing the sequences (default name: ‘seq’).
savename (str, optional) – Base filename for CSV outputs. Default ‘SeqDistMat’.
path (str, default "") – Directory path (absolute or relative) where output is saved. Can be “” for CWD.
band_width (int, optional) – Sakoe–Chiba band half-width (expanded automatically to |len1-len2|). Larger values increase accuracy (approach exact DP) but reduce speed. Default 12.
save (bool, optional) – If True, writes two CSVs: edits and normalized.
use_numba (bool, optional) – If True, uses Numba path; otherwise uses pure Python implementation.
- Returns:
- {
‘edits’: pd.DataFrame, # int distances ‘normalized’: pd.DataFrame, # float in [0, 1] ‘meta’: {‘backend’: ‘numba’|’python’}
}
- Return type:
Dict[str, Any]
- qdiv.sequences.sequence_comparisons.tree_distance_matrix(obj, *, savename='TreeDistMat', path='', save=True, file_format='csv', use_tqdm=True)[source]
Compute pairwise phylogenetic distances between leaf nodes in a tree.
- Parameters:
obj (MicrobiomeData or dict) – Must provide a DataFrame in obj.tree or obj[‘tree’]. The DataFrame must contain at least: [‘nodes’, ‘parent’, ‘dist_to_root’]. Optionally it can have ‘branchL’; ‘dist_to_root’ is assumed correct.
savename (str, optional) – Base filename for outputs. Default ‘TreeDistMat’.
path (str, default "") – Directory path (absolute or relative) where output is saved. Can be “” for CWD.
save (bool, optional) – If True, writes a CSV or compressed npz file.
file_format (str {'csv'|'compressed'|'npz'}) – If ‘csv’, saves a full distance matrix CSV. If ‘compressed’ or ‘npz’, saves a triangular matrix in a compressed npz.
use_tqdm (bool, default=True) – Use tqdm for progress bars.
- Returns:
Symmetric distance matrix with leaf node names as both rows and columns.
- Return type:
pandas.DataFrame
- qdiv.sequences.sequence_comparisons.load_compressed_matrix(filename, path='')[source]
Loads a distance matrix dataframe saved in numpy compressed format (npz). Assumes the filename ends with npz.
- Parameters:
filename (str) – Filename of distance matrix
path (str, default "") – Directory path (absolute or relative) where output is saved. Can be “” for CWD.
- Return type:
DataFrame
- qdiv.sequences.sequence_comparisons.align(objlist, *, different_lengths=False, name_type='OTU')[source]
Align feature names across multiple objects containing sequences (‘seq’).
Works with either dicts or MicrobiomeData objects. Returns the same types as inputs.
- Parameters:
objlist (list of dict or MicrobiomeData) – List of input objects to align. Each object must contain at least: - ‘seq’: sequence table (features x sequence) Optionally: - ‘tab’: abundance table (features x samples) - ‘tax’: taxonomy table - ‘meta’: sample metadata - ‘tree’: sample tree data
different_lengths (bool, optional) – If True, allows alignment of features with different sequence lengths (substring matching, O(n^2) over unique sequences). Default is False.
name_type (str, optional) – Prefix for renaming aligned features (e.g., ‘OTU’, ‘ASV’). Default is ‘OTU’.
- Returns:
aligned_objects – List of aligned objects, of the same types as the input.
- Return type:
list of dict or MicrobiomeData
Notes
Duplicate sequences within each object are collapsed (abundance tables summed, taxonomy taken from the first occurrence).
Feature names are harmonized across all objects.
If different_lengths=True, substring matching is used for alignment.
- qdiv.sequences.sequence_comparisons.consensus(objlist, *, keep_object='best', already_aligned=False, different_lengths=False, name_type='OTU', keep_cutoff=0.2, only_return_seq=False, return_type='auto')[source]
Build a consensus object based on features found in all input objects.
This function aligns features (e.g., ASVs/OTUs) across multiple microbiome data objects, identifies features shared by all, and constructs a consensus abundance table, sequences, taxonomy, and metadata. Optionally, features with high abundance in any object are retained even if not shared. The result can be returned as a dictionary or as a MicrobiomeData object.
- Parameters:
objlist (list of dict or MicrobiomeData) – List of input objects to merge. Each object must contain at least: - ‘tab’ : pd.DataFrame (abundance table, features x samples) - ‘seq’ : pd.DataFrame (sequences, indexed by feature IDs) Optionally: - ‘tax’ : pd.DataFrame (taxonomy annotations) - ‘meta’: pd.DataFrame (sample metadata) Objects can be either plain dicts or MicrobiomeData instances.
keep_object ({'best', int}, default 'best') – Determines which input object to use as the template for consensus: - ‘best’: the object with the largest fraction of reads mapped to shared features. - int: index of the object to use (0 = first, 1 = second, etc.).
already_aligned (bool, default False) – If True, assumes that features are already aligned across objects. If False, runs the alignment step.
different_lengths (bool, default False) – If True, allows alignment of features with different sequence lengths (substring matching).
name_type (str, default 'OTU') – Prefix for renaming consensus features (e.g., “OTU1”, “OTU2”, …).
keep_cutoff (float, default 0.2) – Relative abundance cutoff (%) for retaining features that are not shared by all objects, but are highly abundant in at least one object.
only_return_seq (bool, default False) – If True, only returns a DataFrame of shared sequences (plus the info dictionary). No consensus object is constructed.
return_type ({'auto', 'dict', 'microbiome'}, default 'auto') – Determines the type of object returned (unless only_return_seq is True): - ‘microbiome’: always return a MicrobiomeData object (except when only_return_seq=True). - ‘dict’: always return a dictionary (legacy behavior). - ‘auto’: return a MicrobiomeData object if any input was a MicrobiomeData; otherwise, return a dict.
- Returns:
cons_obj (dict or MicrobiomeData or pd.DataFrame) – The consensus object containing: - ‘tab’: abundance table (features x samples) - ‘seq’: sequence table (features x sequence) - ‘tax’: taxonomy table (optional) - ‘meta’: metadata table (optional) If return_type=’microbiome’ or ‘auto’ (with any MicrobiomeData input), returns a MicrobiomeData object. If return_type=’dict’, returns a dictionary. If only_return_seq=True, returns a DataFrame of shared sequences.
info (dict) – Dictionary with summary statistics about consensus construction, including: - ‘kept_object_index’: index of the selected template object - ‘all_objects’: per-object statistics (consensus abundance, lost reads/features) - ‘selected_object’: statistics for the selected object
- Return type:
Tuple[Dict[str, DataFrame] | MicrobiomeData | DataFrame, Dict[str, Any]]
Notes
The consensus object does not include a phylogenetic tree, even if present in the inputs.
Feature indices are re-ordered by average abundance and renamed using name_type.
If only_return_seq is True, only the shared sequences DataFrame and info are returned.
The function automatically aligns features unless already_aligned is True.
Examples
>>> cons_obj, info = consensus([obj1, obj2], keep_object='best') >>> print(type(cons_obj)) <class 'MicrobiomeData'> >>> cons_obj.info() >>> print(info)
>>> # To get a dict instead of MicrobiomeData: >>> cons_dict, info = consensus([obj1, obj2], return_type='dict')
>>> # To get only the shared sequences: >>> seq_df, info = consensus([obj1, obj2], only_return_seq=True)
- qdiv.sequences.sequence_comparisons.merge_objects(objlist, *, already_aligned=False, different_lengths=False, name_type='OTU', return_type='auto')[source]
Merge multiple microbiome objects and retain all features (OTUs/ASVs/etc).
This function aligns features across all input objects (unless already_aligned=True), concatenates abundance tables (columns are sample names; they are suffix-annotated per input object to avoid collisions), and concatenates sequences and taxonomy tables while removing duplicates. Feature rows are re-ordered by total abundance and renamed using name_type + rank (e.g., “OTU1”, “OTU2”, …).
- Parameters:
objlist (list of dict or MicrobiomeData) –
List of input objects to merge. Each item must contain:
’tab’ : pd.DataFrame (features × samples abundance table)
’seq’ : pd.DataFrame (sequences, indexed by feature IDs)
Optionally may contain:
’tax’ : pd.DataFrame (taxonomy annotations
’meta’: pd.DataFrame (sample metadata, indexed by sample names)
Items can be plain dicts or MicrobiomeData instances.
already_aligned (bool, default False) – If True, assumes align has already been applied to the objects and feature names/sequences are harmonized. If False, calls align(…).
different_lengths (bool, default False) – If True, allows alignment of features with different sequence lengths (substring matching). Passed to align when alignment is performed.
name_type (str, default 'OTU') – Prefix for renaming merged features in descending order of total abundance (e.g., “OTU1”, “OTU2”, …).
return_type ({'auto', 'dict', 'microbiome'}, default 'auto') –
Determines the type of object returned:
’microbiome’: return a MicrobiomeData object.
’dict’ : return a dict.
’auto’ : if any input in objlist is a MicrobiomeData, return MicrobiomeData, otherwise return dict.
- Returns:
out – The merged object containing:
’tab’ : merged abundance table
’seq’ : merged sequence table
’tax’ : merged taxonomy table (if present)
’meta’: merged metadata table (if present)
Return type depends on return_type (see above).
- Return type:
dict or MicrobiomeData
Notes
Sample names in the merged ‘tab’ and ‘meta’ are suffixed with _i where i is the index of the source object in objlist, to avoid collisions.
Features are sorted by total abundance (sum across all samples) and then renamed using name_type and their new rank.
Sequences/taxonomy are de-duplicated with drop_duplicates() after concatenation.
The merged output does not include a phylogenetic tree.
Examples
>>> out = merge_objects([obj1, obj2])