qdiv.sequences.sequence_comparisons module

qdiv.sequences.sequence_comparisons.sequence_distance_matrix(obj, *, savename='SeqDistMat', path='', band_width=12, save=True, use_numba=True)[source]

Compute pairwise Levenshtein distances with a parallelized, Numba-accelerated banded Wagner–Fischer algorithm (if use_numba=True), else pure Python fallback.

Parameters:
  • obj (MicrobiomeData or dict) – Must provide a DataFrame in obj.seq or obj[‘seq’] with index=sequence IDs and a column containing the sequences (default name: ‘seq’).

  • savename (str, optional) – Base filename for CSV outputs. Default ‘SeqDistMat’.

  • path (str, default "") – Directory path (absolute or relative) where output is saved. Can be “” for CWD.

  • band_width (int, optional) – Sakoe–Chiba band half-width (expanded automatically to |len1-len2|). Larger values increase accuracy (approach exact DP) but reduce speed. Default 12.

  • save (bool, optional) – If True, writes two CSVs: edits and normalized.

  • use_numba (bool, optional) – If True, uses Numba path; otherwise uses pure Python implementation.

Returns:

{

‘edits’: pd.DataFrame, # int distances ‘normalized’: pd.DataFrame, # float in [0, 1] ‘meta’: {‘backend’: ‘numba’|’python’}

}

Return type:

Dict[str, Any]

qdiv.sequences.sequence_comparisons.tree_distance_matrix(obj, *, savename='TreeDistMat', path='', save=True, file_format='csv', use_tqdm=True)[source]

Compute pairwise phylogenetic distances between leaf nodes in a tree.

Parameters:
  • obj (MicrobiomeData or dict) – Must provide a DataFrame in obj.tree or obj[‘tree’]. The DataFrame must contain at least: [‘nodes’, ‘parent’, ‘dist_to_root’]. Optionally it can have ‘branchL’; ‘dist_to_root’ is assumed correct.

  • savename (str, optional) – Base filename for outputs. Default ‘TreeDistMat’.

  • path (str, default "") – Directory path (absolute or relative) where output is saved. Can be “” for CWD.

  • save (bool, optional) – If True, writes a CSV or compressed npz file.

  • file_format (str {'csv'|'compressed'|'npz'}) – If ‘csv’, saves a full distance matrix CSV. If ‘compressed’ or ‘npz’, saves a triangular matrix in a compressed npz.

  • use_tqdm (bool, default=True) – Use tqdm for progress bars.

Returns:

Symmetric distance matrix with leaf node names as both rows and columns.

Return type:

pandas.DataFrame

qdiv.sequences.sequence_comparisons.load_compressed_matrix(filename, path='')[source]

Loads a distance matrix dataframe saved in numpy compressed format (npz). Assumes the filename ends with npz.

Parameters:
  • filename (str) – Filename of distance matrix

  • path (str, default "") – Directory path (absolute or relative) where output is saved. Can be “” for CWD.

Return type:

DataFrame

qdiv.sequences.sequence_comparisons.align(objlist, *, different_lengths=False, name_type='OTU')[source]

Align feature names across multiple objects containing sequences (‘seq’).

Works with either dicts or MicrobiomeData objects. Returns the same types as inputs.

Parameters:
  • objlist (list of dict or MicrobiomeData) – List of input objects to align. Each object must contain at least: - ‘seq’: sequence table (features x sequence) Optionally: - ‘tab’: abundance table (features x samples) - ‘tax’: taxonomy table - ‘meta’: sample metadata - ‘tree’: sample tree data

  • different_lengths (bool, optional) – If True, allows alignment of features with different sequence lengths (substring matching, O(n^2) over unique sequences). Default is False.

  • name_type (str, optional) – Prefix for renaming aligned features (e.g., ‘OTU’, ‘ASV’). Default is ‘OTU’.

Returns:

aligned_objects – List of aligned objects, of the same types as the input.

Return type:

list of dict or MicrobiomeData

Notes

  • Duplicate sequences within each object are collapsed (abundance tables summed, taxonomy taken from the first occurrence).

  • Feature names are harmonized across all objects.

  • If different_lengths=True, substring matching is used for alignment.

qdiv.sequences.sequence_comparisons.consensus(objlist, *, keep_object='best', already_aligned=False, different_lengths=False, name_type='OTU', keep_cutoff=0.2, only_return_seq=False, return_type='auto')[source]

Build a consensus object based on features found in all input objects.

This function aligns features (e.g., ASVs/OTUs) across multiple microbiome data objects, identifies features shared by all, and constructs a consensus abundance table, sequences, taxonomy, and metadata. Optionally, features with high abundance in any object are retained even if not shared. The result can be returned as a dictionary or as a MicrobiomeData object.

Parameters:
  • objlist (list of dict or MicrobiomeData) – List of input objects to merge. Each object must contain at least: - ‘tab’ : pd.DataFrame (abundance table, features x samples) - ‘seq’ : pd.DataFrame (sequences, indexed by feature IDs) Optionally: - ‘tax’ : pd.DataFrame (taxonomy annotations) - ‘meta’: pd.DataFrame (sample metadata) Objects can be either plain dicts or MicrobiomeData instances.

  • keep_object ({'best', int}, default 'best') – Determines which input object to use as the template for consensus: - ‘best’: the object with the largest fraction of reads mapped to shared features. - int: index of the object to use (0 = first, 1 = second, etc.).

  • already_aligned (bool, default False) – If True, assumes that features are already aligned across objects. If False, runs the alignment step.

  • different_lengths (bool, default False) – If True, allows alignment of features with different sequence lengths (substring matching).

  • name_type (str, default 'OTU') – Prefix for renaming consensus features (e.g., “OTU1”, “OTU2”, …).

  • keep_cutoff (float, default 0.2) – Relative abundance cutoff (%) for retaining features that are not shared by all objects, but are highly abundant in at least one object.

  • only_return_seq (bool, default False) – If True, only returns a DataFrame of shared sequences (plus the info dictionary). No consensus object is constructed.

  • return_type ({'auto', 'dict', 'microbiome'}, default 'auto') – Determines the type of object returned (unless only_return_seq is True): - ‘microbiome’: always return a MicrobiomeData object (except when only_return_seq=True). - ‘dict’: always return a dictionary (legacy behavior). - ‘auto’: return a MicrobiomeData object if any input was a MicrobiomeData; otherwise, return a dict.

Returns:

  • cons_obj (dict or MicrobiomeData or pd.DataFrame) – The consensus object containing: - ‘tab’: abundance table (features x samples) - ‘seq’: sequence table (features x sequence) - ‘tax’: taxonomy table (optional) - ‘meta’: metadata table (optional) If return_type=’microbiome’ or ‘auto’ (with any MicrobiomeData input), returns a MicrobiomeData object. If return_type=’dict’, returns a dictionary. If only_return_seq=True, returns a DataFrame of shared sequences.

  • info (dict) – Dictionary with summary statistics about consensus construction, including: - ‘kept_object_index’: index of the selected template object - ‘all_objects’: per-object statistics (consensus abundance, lost reads/features) - ‘selected_object’: statistics for the selected object

Return type:

Tuple[Dict[str, DataFrame] | MicrobiomeData | DataFrame, Dict[str, Any]]

Notes

  • The consensus object does not include a phylogenetic tree, even if present in the inputs.

  • Feature indices are re-ordered by average abundance and renamed using name_type.

  • If only_return_seq is True, only the shared sequences DataFrame and info are returned.

  • The function automatically aligns features unless already_aligned is True.

Examples

>>> cons_obj, info = consensus([obj1, obj2], keep_object='best')
>>> print(type(cons_obj))
<class 'MicrobiomeData'>
>>> cons_obj.info()
>>> print(info)
>>> # To get a dict instead of MicrobiomeData:
>>> cons_dict, info = consensus([obj1, obj2], return_type='dict')
>>> # To get only the shared sequences:
>>> seq_df, info = consensus([obj1, obj2], only_return_seq=True)
qdiv.sequences.sequence_comparisons.merge_objects(objlist, *, already_aligned=False, different_lengths=False, name_type='OTU', return_type='auto')[source]

Merge multiple microbiome objects and retain all features (OTUs/ASVs/etc).

This function aligns features across all input objects (unless already_aligned=True), concatenates abundance tables (columns are sample names; they are suffix-annotated per input object to avoid collisions), and concatenates sequences and taxonomy tables while removing duplicates. Feature rows are re-ordered by total abundance and renamed using name_type + rank (e.g., “OTU1”, “OTU2”, …).

Parameters:
  • objlist (list of dict or MicrobiomeData) –

    List of input objects to merge. Each item must contain:

    • ’tab’ : pd.DataFrame (features × samples abundance table)

    • ’seq’ : pd.DataFrame (sequences, indexed by feature IDs)

    Optionally may contain:

    • ’tax’ : pd.DataFrame (taxonomy annotations

    • ’meta’: pd.DataFrame (sample metadata, indexed by sample names)

    Items can be plain dicts or MicrobiomeData instances.

  • already_aligned (bool, default False) – If True, assumes align has already been applied to the objects and feature names/sequences are harmonized. If False, calls align(…).

  • different_lengths (bool, default False) – If True, allows alignment of features with different sequence lengths (substring matching). Passed to align when alignment is performed.

  • name_type (str, default 'OTU') – Prefix for renaming merged features in descending order of total abundance (e.g., “OTU1”, “OTU2”, …).

  • return_type ({'auto', 'dict', 'microbiome'}, default 'auto') –

    Determines the type of object returned:

    • ’microbiome’: return a MicrobiomeData object.

    • ’dict’ : return a dict.

    • ’auto’ : if any input in objlist is a MicrobiomeData, return MicrobiomeData, otherwise return dict.

Returns:

out – The merged object containing:

  • ’tab’ : merged abundance table

  • ’seq’ : merged sequence table

  • ’tax’ : merged taxonomy table (if present)

  • ’meta’: merged metadata table (if present)

Return type depends on return_type (see above).

Return type:

dict or MicrobiomeData

Notes

  • Sample names in the merged ‘tab’ and ‘meta’ are suffixed with _i where i is the index of the source object in objlist, to avoid collisions.

  • Features are sorted by total abundance (sum across all samples) and then renamed using name_type and their new rank.

  • Sequences/taxonomy are de-duplicated with drop_duplicates() after concatenation.

  • The merged output does not include a phylogenetic tree.

Examples

>>> out = merge_objects([obj1, obj2])