Theoretical Foundations of Mycelia Assembly Algorithms

Mycelia's assembly algorithms are built on rigorous mathematical foundations derived from extensive research documented in the Mycelia-Dev notebooks. This document provides the theoretical background, design rationales, and mathematical principles that guide the implementation choices in Mycelia.

Mathematical Framework for Assembly

Core Assembly Paradigm: Probabilistic Expectation-Maximization

Mycelia treats genome assembly as a probabilistic inference problem, unifying the strengths of both De Bruijn graph and overlap-layout-consensus approaches through a generalized framework:

Algorithm Overview:

Create De Bruijn assembly graph from sequencing reads
Convert graph structure into probabilistic Hidden Markov Model (HMM)
Apply Viterbi algorithm for maximum likelihood error correction
Iterate until convergence at current k-mer size
Increment k-mer size and repeat until final convergence

Theoretical Foundation: The approach treats assembly as an Expectation-Maximization (EM) problem:

E-step: Calculate maximum likelihood paths through the assembly graph
M-step: Update graph structure based on error-corrected reads

This probabilistic framework provides mathematical rigor that replaces heuristic graph cleaning decisions with principled, optimal solutions.

K-mer Size Selection Theory

Error Rate Mathematical Foundation

The optimal k-mer size selection follows mathematically derived principles:

Lower Bound Formula:

lower_bound_k = max(3, floor(1/error_rate - 1))

Rationale: This ensures that error-prone k-mers remain distinguishable from true genomic k-mers based on frequency distributions.

Examples:

1% error rate → k ≥ 99
5% error rate → k ≥ 19
10% error rate → k ≥ 9

Sequence Length Optimization

Log₄ Pattern Discovery: Empirical analysis revealed that optimal starting k-mer sizes follow a log₄(sequence_length) pattern:

100bp sequences → k = 3
1,000bp sequences → k = 5
10,000bp sequences → k = 7
100,000bp sequences → k = 8

Biological Rationale: This pattern corresponds to the divergence point where erroneous k-mers begin to dominate true k-mers in the frequency spectrum.

Coverage vs. Error Rate Trade-offs

Key Finding: A 10x increase in sequencing coverage contributes more noise to the k-mer spectrum than a 10x increase in error rate.

Implication: Conservative error rate assumptions with moderate coverage are preferable to high coverage with relaxed quality filtering.

Dynamic Prime Pattern Algorithm

Mathematical Elegance of Prime Selection

Mycelia employs a dynamic k-mer selection algorithm that exploits the mathematical properties of prime number distribution:

Algorithm:

function dynamic_k_prime_pattern(start_prime=11, max_k=101, initial_step=2)
    k_sequence = [start_prime]
    current_k = start_prime
    step = initial_step
    
    while true
        next_k = current_k + step
        if next_k > max_k || !isprime(next_k)
            break
        end
        push!(k_sequence, next_k)
        current_k = next_k
        step += 2  # Progressive spacing
    end
    
    return k_sequence
end

Advantages:

Twin Prime Avoidance: Automatically skips one member of twin prime pairs, reducing redundant analysis
Progressive Spacing: Increasing gaps minimize computational overlap while maintaining coverage
Hardware Optimization: k=31 fits optimally in 64-bit integers (2 bits per base × 32 bases)

Biological Rationale for Prime K-mers

Theoretical Foundation: Prime k-mer lengths cannot form perfect repeats, reducing spurious matches and improving assembly specificity. Additionally, odd k-mers cannot be reverse complements of themselves, avoiding palindromic complications.

Graph Theory Foundations

Explicit vs. Inferred Edge Storage

Core Principle: Mycelia stores only edges that were actually observed in the data, not all theoretically possible k-mer neighbors.

Mathematical Justification:

As k increases, the probability of unobserved neighboring k-mers existing decreases asymptotically toward zero
For small k and large datasets, false positive edge rates can be extremely high
Explicit storage provides O(n) space complexity vs. O(4^k) for complete neighbor graphs

Statistical Tip Clipping Algorithm

Methodology: Remove graph tips using statistical thresholds relative to connected component coverage distributions.

Algorithm:

function statistical_tip_clipping(graph; std_dev_multiplier=3.0)
    for component in connected_components(graph)
        coverage_values = get_coverage_values(graph, component)
        median_coverage = median(coverage_values)
        coverage_std = std(coverage_values)
        threshold = median_coverage - std_dev_multiplier * coverage_std
        
        for tip in identify_tips(component)
            if get_coverage(tip) < threshold || get_coverage(tip) == 1
                remove_node!(graph, tip)
            end
        end
    end
end

Statistical Foundation: Uses 3σ rule to identify coverage outliers, preserving high-confidence tips while removing error artifacts.

Metagenomic Classification Theory

MAPQ Score Reinterpretation

Critical Insight: In metagenomic contexts, MAPQ=0 indicates taxonomic ambiguity rather than poor alignment quality.

Theoretical Correction:

Traditional approach: Filter out MAPQ=0 alignments
Mycelia approach: Retain MAPQ=0 alignments, weight by alignment scores

Mathematical Framework:

function weighted_taxonomic_assignment(alignments)
    weighted_assignments = []
    for alignment in alignments
        weight = alignment.score  # NOT alignment.mapq
        taxon = get_taxonomy(alignment.reference)
        push!(weighted_assignments, (taxon, weight))
    end
    return normalize_weights(weighted_assignments)
end

Impact: Prevents loss of biologically meaningful information from closely related taxa that receive MAPQ=0 due to mapping ambiguity.

Strain-Level Clustering with FastANI

Threshold Selection: 99.5% Average Nucleotide Identity (ANI) threshold for strain-level clustering.

Biological Rationale: This threshold corresponds to the conventional species boundary, enabling strain-resolution within species while maintaining taxonomic coherence.

Assembly Quality Assessment

Graph Distance Metrics

Approach: Use Jaccard similarity between k-mer sets and edge sets to quantify assembly accuracy.

Metrics:

kmer_distance = 1 - jaccard(assembly_kmers, reference_kmers)
edge_distance = 1 - jaccard(assembly_edges, reference_edges)

Theoretical Advantage: Provides quantitative, parameter-free assessment of assembly quality relative to ground truth.

Viterbi-Based Error Correction

Hidden Markov Model Framework:

States: K-mer positions with associated quality scores
Transitions: Observed k-mer adjacencies in assembly graph
Emissions: Quality score-weighted k-mer observations

Likelihood Calculation:

error_probability = 10^(quality_score / -10)
state_likelihood = (1 - error_probability)^matches × error_probability^mismatches

Genomic Grammar and Zipf's Law

Linguistic Properties of Genomic Sequences

Theoretical Discovery: Genomic sequences follow Zipf's law (power-law distribution) similar to natural languages.

Implications:

Frequency Distributions: K-mer frequencies exhibit log-log linear patterns
Biological Grammar: Non-coding regions particularly follow linguistic-like structures
Assembly Strategy: Rare k-mers may represent genuine biological signal rather than noise

Mathematical Model:

frequency(k-mer) ∝ rank^(-α)

where α ≈ 1 for natural languages and genomic sequences.

Strain-Resolved Assembly Framework

Comprehensive Quality Metrics

Strain-Specific Metrics:

strain_recall = true_strains_recovered / total_true_strains
strain_precision = correct_strain_calls / total_strain_calls
strain_f1_score = 2 × (precision × recall) / (precision + recall)

Assembly Contiguity:

NGA50: N50 based on aligned contigs to reference
largest_alignment: Longest single alignment length
total_aligned_length: Total successfully aligned sequence

Technology-Specific Assessments:

homopolymer_accuracy: Accuracy in homopolymer-rich regions
repeat_resolution_rate: Fraction of repetitive elements properly resolved

Confidence-Aware Assembly

Per-Base Confidence Scoring: Each assembled base receives a confidence score based on:

Supporting read depth
Base quality scores of supporting reads
Graph topology confidence (branching vs. linear regions)
Strain assignment probability

Computational Complexity Considerations

Progressive K-mer Strategy

Complexity Reduction: Starting with small k-values and progressively increasing reduces computational complexity from O(4^k) to O(log k) in practice.

Convergence Criteria:

No corrections made in current iteration
Graph structure stabilizes between iterations
Quality metrics plateau across k-increments

Memory-Efficient Graph Representation

Canonical K-mer Storage: Store only canonical form of each k-mer, reducing memory by ~50%.

Sparse Edge Representation: Use adjacency lists rather than dense matrices, achieving O(E) vs. O(V²) space complexity.

Integration with Modern Sequencing Technologies

Long-Read Assembly Considerations

Overlap-Based Framework: For error-prone long reads, minimum overlap length follows:

minimum_overlap = log₄(genome_size)

Hybrid Strategy: Combine exact k-mer matching (short reads) with approximate overlaps (long reads) using probabilistic framework.

Real-Time Assembly

Streaming Algorithm Design: Progressive assembly enables real-time processing of sequencing data as it becomes available.

Hardware Acceleration: GPU/FPGA implementations can accelerate Viterbi algorithm computations for production-scale applications.

Validation and Benchmarking

Cross-Validation Framework

Statistical Validation: Bootstrap resampling with confidence interval calculation for robust parameter optimization.

Multi-Scale Validation: Simultaneous analysis using prime k-mer sizes (3,5,7,11,13,17,19) for comprehensive assessment.

Benchmarking Methodology

Ground Truth Comparison: Use simulated data with known ground truth for algorithm validation.

Comparative Analysis: Quantitative comparison against established assemblers using standardized metrics.

Research Foundations

This theoretical framework is derived from extensive research documented in the Mycelia-Dev repository, representing several years of algorithm development and validation. The mathematical foundations provide:

Reproducible Results: Deterministic algorithms with well-defined parameters
Scalable Solutions: Theoretical framework adapts to different sequencing technologies
Optimal Performance: Maximum likelihood approaches replace heuristic approximations
Biological Relevance: Algorithms account for genuine biological sequence properties

The integration of probability theory, graph algorithms, information theory, and computational biology provides a comprehensive foundation for next-generation genome assembly tools that achieve both accuracy and biological relevance.