API Reference

This page provides comprehensive documentation for all public functions in Mycelia, automatically generated from docstrings.

Table of Contents

Assembly Functions

Intelligent Assembly

Mycelia.mycelia_assembleMethod

Main Mycelia intelligent assembly algorithm. Implements iterative prime k-mer progression with error correction.

source
Mycelia.mycelia_assembleMethod
mycelia_assemble(
    input_file::Union{String, Vector{String}};
    output_dir,
    max_k,
    memory_limit,
    verbose
) -> Dict

User-friendly wrapper for intelligent assembly that accepts file paths.

Arguments

  • input_file::Union{String, Vector{String}}: Path to FASTQ file(s) or directory containing FASTQ files
  • output_dir::String: Directory to write assembly results (default: "mycelia_assembly")
  • max_k::Int: Maximum k-mer size to try (default: 101)
  • memory_limit::Int: Memory limit in bytes (default: 32GB)
  • verbose::Bool: Print progress information (default: true)

Returns

  • Assembly results dictionary with contigs, statistics, and metadata

Examples

# Assemble single file
results = Mycelia.mycelia_assemble("reads.fastq")

# Assemble with custom parameters
results = Mycelia.mycelia_assemble("reads.fastq", 
                                  output_dir="my_assembly",
                                  max_k=51, 
                                  memory_limit=4_000_000_000)

# Assemble multiple files (will be merged)
results = Mycelia.mycelia_assemble(["reads1.fastq", "reads2.fastq"])

Notes

  • Input files can be gzipped (.gz extension)
  • Uses intelligent k-mer selection and error correction
  • Automatically creates output directory if it doesn't exist
  • Progress is saved incrementally for long assemblies
source

Assembly Utilities

Mycelia.dynamic_k_prime_patternFunction
dynamic_k_prime_pattern() -> Vector{Int64}
dynamic_k_prime_pattern(start_prime::Int64) -> Vector{Int64}
dynamic_k_prime_pattern(
    start_prime::Int64,
    max_k::Int64
) -> Vector{Int64}
dynamic_k_prime_pattern(
    start_prime::Int64,
    max_k::Int64,
    initial_step::Int64
) -> Vector{Int64}

Generate optimal k-mer sizes using dynamic prime pattern algorithm.

Based on the novel k-primes pattern from Mycelia-Dev research. This algorithm generates a sequence of prime numbers with progressively increasing gaps, providing computational efficiency through:

  • Twin prime avoidance (skips one member of twin prime pairs)
  • Progressive spacing reduces computational overlap
  • Built-in prime discovery without pre-computation

Arguments

  • start_prime::Int: Initial prime number (default: 11, optimal for strain-level resolution)
  • max_k::Int: Maximum k-mer size to consider (default: 101)
  • initial_step::Int: Initial step size (default: 2)

Returns

  • Vector{Int}: Sequence of prime k-mer sizes with progressive spacing

Algorithm

  1. Start with initial prime and step size
  2. Each iteration: current_prime += step, then step += 2
  3. Continue while result remains prime and ≤ max_k
  4. Natural stopping when non-prime encountered

Example

# Generates: [11, 13, 17, 23, 31, 41, 53, 67, 83, 101]
ks = dynamic_k_prime_pattern(11, 101, 2)

# For error-prone data, start smaller
ks = dynamic_k_prime_pattern(7, 51, 2)  # [7, ...]

References

Based on k-primes-pattern.ipynb from Mycelia-Dev research demonstrating mathematical elegance of using prime number distribution for genomic analysis optimization.

source
Mycelia.error_optimized_k_sequenceFunction
error_optimized_k_sequence(
    error_rate::Float64
) -> Vector{Int64}
error_optimized_k_sequence(
    error_rate::Float64,
    max_k::Int64
) -> Vector{Int64}
error_optimized_k_sequence(
    error_rate::Float64,
    max_k::Int64,
    sequence_length::Union{Nothing, Int64}
) -> Vector{Int64}

Generate k-mer sizes optimized for specific error rates using mathematical foundation.

Combines the error rate formula from Mycelia-Dev research with dynamic prime pattern generation for optimal k-mer selection.

Arguments

  • error_rate::Float64: Expected sequencing error rate (0.0-1.0)
  • max_k::Int: Maximum k-mer size to consider (default: 101)
  • sequence_length::Union{Int, Nothing}: Optional sequence length for log4 optimization

Returns

  • Vector{Int}: Optimally spaced prime k-mer sizes

Algorithm

  1. Calculate lower bound: lower_bound_k = max(3, Int(floor(1/error_rate - 1)))
  2. If sequencelength provided, optimize using log4(sequencelength) pattern
  3. Generate dynamic prime pattern starting from optimal lower bound

Example

# For 1% error rate data
ks = error_optimized_k_sequence(0.01)  # Starts around k=99

# For 5% error rate with known sequence length  
ks = error_optimized_k_sequence(0.05, 101, 10000)  # Optimized for 10kb sequences

Mathematical Foundation

  • Lower bound formula: k >= 1/error_rate - 1
  • Log4 optimization: optimal_k ≈ log4(sequence_length) for divergence point
  • Progressive spacing reduces redundant analysis

References

Based on 2020-12-22 k-mer size selection and 2020-12-19 error rate analysis from Mycelia-Dev research.

source

Qualmer Analysis

Core Qualmer Types

Mycelia.QualmerType
struct Qualmer{KmerT, K}

A quality-aware k-mer that pairs sequence k-mers with their corresponding quality scores.

Fields

  • kmer::Any

  • qualities::Tuple{Vararg{UInt8, K}} where K

Examples

using Mycelia
using Kmers

# Create a DNA k-mer with quality scores
kmer = DNAKmer("ATCG")
qualities = [30, 35, 32, 28]  # Phred quality scores
qmer = Qualmer(kmer, qualities)

# Access individual bases and qualities
base, quality = qmer[1]  # Returns ('A', 30)

This structure is essential for quality-aware sequence analysis and assembly algorithms that consider both sequence content and sequencing quality information.

source
Mycelia.QualmerVertexDataType
struct QualmerVertexData{QualmerT<:Mycelia.Qualmer}

Vertex data for quality-aware k-mer graphs that stores k-mer observations with quality scores.

Fields

  • canonical_qualmer::Mycelia.Qualmer

  • observations::Array{Mycelia.QualmerObservation{QualmerT}, 1} where QualmerT<:Mycelia.Qualmer

  • joint_probability::Float64

  • coverage::Int64

  • mean_quality::Float64

This structure aggregates multiple observations of the same k-mer and computes quality-based statistics for use in quality-aware assembly algorithms.

source
Mycelia.QualmerEdgeDataType
struct QualmerEdgeData

Edge data for quality-aware k-mer graphs that connects k-mer vertices with strand information.

Fields

  • observations::Vector{Tuple{Int64, Int64}}

  • src_strand::Mycelia.StrandOrientation

  • dst_strand::Mycelia.StrandOrientation

  • weight::Float64

  • quality_weight::Float64

This structure represents connections between k-mers in quality-aware graphs, including strand orientation and quality-weighted edge weights for assembly algorithms.

source

Qualmer Generation

Mycelia.qualmers_fwFunction
qualmers_fw(
    sequence::BioSequences.LongSequence{A},
    quality::AbstractVector{<:Integer},
    _::Val{K}
) -> Base.Generator{I, F} where {I<:(Base.Iterators.Enumerate{I} where I<:(Kmers.FwKmers{A, _A, S} where {A<:BioSequences.Alphabet, _A, S<:(BioSequences.LongSequence{A} where A)})), F<:(Mycelia.var"#258#259"{_A, <:AbstractVector{var"#s34"}} where {_A, var"#s34"<:Integer})}

Create an iterator that yields DNA qualmers from the given sequence and quality scores.

source
qualmers_fw(
    record::FASTX.FASTQ.Record,
    k::Int64
) -> Base.Generator

Generate forward qualmers from a FASTQ record. Extracts all k-mers in the forward direction with their associated quality scores from the FASTQ record.

Arguments

  • record: FASTQ record containing sequence and quality data
  • k: K-mer size

Returns

  • Iterator yielding (Qualmer, position) tuples for each k-mer
source
Mycelia.qualmers_canonicalFunction
qualmers_canonical(
    sequence::BioSequences.LongSequence{BioSequences.DNAAlphabet{N}},
    quality::AbstractVector{<:Integer},
    _::Val{K}
) -> Base.Generator{I, F} where {I<:(Base.Iterators.Enumerate{I} where I<:(Kmers.FwKmers{A, _A, S} where {A<:(BioSequences.DNAAlphabet), _A, S<:(BioSequences.LongDNA)})), F<:(Mycelia.var"#264#265"{_A, <:AbstractVector{var"#s34"}} where {_A, var"#s34"<:Integer})}

Generate canonical DNA qualmers from sequence and quality scores. For each k-mer, returns the lexicographically smaller of the k-mer and its reverse complement, ensuring consistent representation regardless of strand.

Arguments

  • sequence: DNA sequence as LongDNA
  • quality: Vector of quality scores corresponding to each base
  • ::Val{K}: K-mer size as a type parameter

Returns

  • Iterator yielding (Qualmer, position) tuples with canonical k-mer representation
source
qualmers_canonical(
    sequence::BioSequences.LongAA,
    quality::AbstractVector{<:Integer},
    _::Val{K}
) -> Base.Generator{I, F} where {I<:(Base.Iterators.Enumerate{I} where I<:(Kmers.FwAAMers{_A, BioSequences.LongAA} where _A)), F<:(Mycelia.var"#258#259"{_A, <:AbstractVector{var"#s36"}} where {_A, var"#s36"<:Integer})}

Generate canonical amino acid qualmers from protein sequence and quality scores. For amino acid sequences, canonical representation is the same as forward representation since proteins don't have reverse complements.

Arguments

  • sequence: Protein sequence as LongAA
  • quality: Vector of quality scores corresponding to each amino acid
  • ::Val{K}: K-mer size as a type parameter

Returns

  • Iterator yielding (Qualmer, position) tuples for each k-mer
source
qualmers_canonical(
    record::FASTX.FASTQ.Record,
    k::Int64
) -> Base.Generator

Generate canonical qualmers from a FASTQ record. Extracts k-mers in canonical representation (lexicographically smaller of forward and reverse complement) with their associated quality scores.

Arguments

  • record: FASTQ record containing sequence and quality data
  • k: K-mer size

Returns

  • Iterator yielding (Qualmer, position) tuples with canonical k-mer representation
source
Mycelia.qualmers_unambiguousFunction
qualmers_unambiguous(
    sequence::BioSequences.LongSequence{BioSequences.DNAAlphabet{N}},
    quality::AbstractVector{<:Integer},
    _::Val{K}
) -> Base.Generator{I, F} where {I<:(Kmers.UnambiguousDNAMers{_A, S} where {_A, S<:(BioSequences.LongDNA)}), F<:(Mycelia.var"#260#261"{_A, <:AbstractVector{var"#s34"}} where {_A, var"#s34"<:Integer})}

Generate unambiguous DNA qualmers from sequence and quality scores. Only includes k-mers that contain unambiguous nucleotides (A, T, G, C). Skips k-mers containing ambiguous bases like N.

Arguments

  • sequence: DNA sequence as LongDNA
  • quality: Vector of quality scores corresponding to each base
  • ::Val{K}: K-mer size as a type parameter

Returns

  • Iterator yielding (Qualmer, position) tuples for each unambiguous k-mer
source
qualmers_unambiguous(
    sequence::BioSequences.LongSequence{BioSequences.RNAAlphabet{N}},
    quality::AbstractVector{<:Integer},
    _::Val{K}
) -> Base.Generator{I, F} where {I<:(Kmers.UnambiguousRNAMers{_A, S} where {_A, S<:(BioSequences.LongRNA)}), F<:(Mycelia.var"#262#263"{_A, <:AbstractVector{var"#s34"}} where {_A, var"#s34"<:Integer})}

Generate unambiguous RNA qualmers from sequence and quality scores. Only includes k-mers that contain unambiguous nucleotides (A, U, G, C). Skips k-mers containing ambiguous bases like N.

Arguments

  • sequence: RNA sequence as LongRNA
  • quality: Vector of quality scores corresponding to each base
  • ::Val{K}: K-mer size as a type parameter

Returns

  • Iterator yielding (Qualmer, position) tuples for each unambiguous k-mer
source
qualmers_unambiguous(
    sequence::BioSequences.LongAA,
    quality::AbstractVector{<:Integer},
    _::Val{K}
) -> Base.Generator{I, F} where {I<:(Base.Iterators.Enumerate{I} where I<:(Kmers.FwAAMers{_A, BioSequences.LongAA} where _A)), F<:(Mycelia.var"#258#259"{_A, <:AbstractVector{var"#s36"}} where {_A, var"#s36"<:Integer})}

Generate amino acid qualmers from protein sequence and quality scores. For amino acid sequences, all k-mers are considered unambiguous, so this falls back to the forward qualmer generation.

Arguments

  • sequence: Protein sequence as LongAA
  • quality: Vector of quality scores corresponding to each amino acid
  • ::Val{K}: K-mer size as a type parameter

Returns

  • Iterator yielding (Qualmer, position) tuples for each k-mer
source
qualmers_unambiguous(
    record::FASTX.FASTQ.Record,
    k::Int64
) -> Base.Generator

Generate unambiguous qualmers from a FASTQ record. Extracts only k-mers that contain unambiguous nucleotides (no N bases) with their associated quality scores.

Arguments

  • record: FASTQ record containing sequence and quality data
  • k: K-mer size

Returns

  • Iterator yielding (Qualmer, position) tuples for each unambiguous k-mer
source
Mycelia.qualmers_unambiguous_canonicalFunction
qualmers_unambiguous_canonical(
    record::FASTX.FASTQ.Record,
    k::Int64
) -> Base.Generator{I, Mycelia.var"#266#267"} where I

Generate unambiguous canonical qualmers from the given FASTQ record.

source

Quality Analysis

Mycelia.phred_to_probabilityFunction
phred_to_probability(phred_score::UInt8) -> Float64

Convert PHRED quality score to probability of correctness. PHRED score Q relates to error probability P by: Q = -10 * log10(P) Therefore, correctness probability = 1 - P = 1 - 10^(-Q/10)

source
Mycelia.qualmer_correctness_probabilityFunction
qualmer_correctness_probability(
    qmer::Mycelia.Qualmer
) -> Float64

Calculate joint probability that a single qualmer is correct. For a k-mer with quality scores [q1, q2, ..., qk], the joint probability that all positions are correct is: ∏(1 - 10^(-qi/10))

source
Mycelia.joint_qualmer_probabilityFunction
joint_qualmer_probability(
    qualmers::Vector{<:Mycelia.Qualmer};
    use_log_space
) -> Float64

Calculate joint probability that multiple observations of the same k-mer are all correct. For independent observations of the same k-mer sequence, this represents our confidence that the k-mer truly exists in the data.

Arguments

  • qualmers: Vector of Qualmer observations of the same k-mer sequence
  • use_log_space: Use log-space arithmetic for numerical stability (default: true)

Returns

  • Float64: Joint probability that all observations are correct
source
Mycelia.position_wise_joint_probabilityFunction
position_wise_joint_probability(
    qualmers::Vector{<:Mycelia.Qualmer};
    use_log_space
) -> Float64

Calculate position-wise joint probability for multiple qualmer observations. This is more sophisticated than joint_qualmer_probability as it considers the quality at each position across all observations.

Arguments

  • qualmers: Vector of Qualmer observations of the same k-mer sequence
  • use_log_space: Use log-space arithmetic for numerical stability (default: true)

Returns

  • Float64: Position-wise joint probability
source

Graph Construction

Mycelia.build_qualmer_graphFunction
build_qualmer_graph(
    fastq_records::Vector{FASTX.FASTQ.Record};
    k,
    graph_mode,
    min_quality,
    min_coverage
) -> MetaGraphsNext.MetaGraph{Int64, Graphs.SimpleGraphs.SimpleDiGraph{Int64}, Label, VertexData, EdgeData, Nothing, WeightFunction, Float64} where {Label, VertexData, EdgeData, WeightFunction}

Build a quality-aware k-mer graph from FASTQ records using existing Qualmer functionality. This function leverages the existing qualmer extraction functions and adds graph construction.

Arguments

  • fastq_records: Vector of FASTQ records with quality scores
  • k: K-mer size
  • graph_mode: SingleStrand or DoubleStrand mode (default: DoubleStrand)
  • min_quality: Minimum average PHRED quality to include k-mer (default: 10)
  • min_coverage: Minimum coverage (observations) to include k-mer (default: 1)

Returns

  • MetaGraphsNext.MetaGraph with QualmerVertexData and QualmerEdgeData
source

String Graphs

N-gram Analysis

Mycelia.ngramsFunction
ngrams(s::AbstractString, n::Int64) -> Any

Extract all n-grams from a string. Returns a vector containing all contiguous substrings of length n from the input string.

Arguments

  • s: Input string to extract n-grams from
  • n: Length of each n-gram (must be positive)

Returns

  • Vector{String}: All n-grams of length n from the string

Examples

ngrams("hello", 3)  # Returns ["hel", "ell", "llo"]
ngrams("abcd", 2)   # Returns ["ab", "bc", "cd"]
source
Mycelia.string_to_ngram_graphFunction
string_to_ngram_graph(; s, n)

Build a directed graph from string n-grams with edge and vertex weights. Creates a MetaGraph where vertices represent unique n-grams and edges represent transitions between consecutive n-grams in the original string.

Arguments

  • s: Input string to build graph from
  • n: N-gram size (length of each substring)

Returns

  • MetaGraphsNext.MetaGraph: Directed graph with:
    • Vertex labels: String n-grams
    • Vertex data: Integer counts of n-gram occurrences
    • Edge data: Integer counts of n-gram transitions

Details

The graph construction process:

  1. Extract all n-grams from the string
  2. Count occurrences of each unique n-gram (stored as vertex data)
  3. Count transitions between consecutive n-grams (stored as edge data)
  4. Build a directed graph representing the n-gram sequence structure

Examples

graph = string_to_ngram_graph(s="abcabc", n=2)
# Creates graph with vertices "ab", "bc", "ca" and weighted edges
source
Mycelia.plot_ngram_graphFunction
plot_ngram_graph(g) -> Any

Create a visual plot of an n-gram graph using GraphPlot.jl. Displays the graph structure with vertex labels showing the n-grams.

Arguments

  • g: MetaGraph containing n-gram data (typically from string_to_ngram_graph)

Returns

  • Plot object showing the graph layout with labeled vertices

Details

Uses a spring layout algorithm to position vertices and displays n-gram strings as vertex labels for easy interpretation of the graph structure.

Examples

graph = string_to_ngram_graph(s="hello", n=2)
plot_ngram_graph(graph)  # Shows visual representation
source

Graph Simplification

Mycelia.find_connected_componentsFunction

Find all connected components in the graph. Returns a vector of vectors of vertex indices.

For directed graphs, uses weakly connected components. For undirected graphs, uses standard connected components.

source
Mycelia.assemble_stringsFunction
assemble_strings(
    graph::MetaGraphsNext.MetaGraph
) -> Vector{String}

Assemble strings by performing graph walks from each connected component.

Traverses the string graph to reconstruct sequences by walking through connected components and assembling overlapping n-grams back into full strings.

Arguments

  • graph: MetaGraphsNext.MetaGraph containing n-gram vertices and transition edges

Returns

  • Vector{String}: Assembled sequences, one for each connected component

Examples

# Create a string graph from overlapping sequences
graph = build_string_ngram_metagraph("ATCGATCG", 3)
sequences = assemble_strings(graph)

This function is useful for reconstructing original sequences from n-gram decompositions and for string-based genome assembly approaches.

source

Sequence I/O

File Handling

Mycelia.open_fastxFunction
open_fastx(path::AbstractString) -> Any

Open and return a reader for FASTA or FASTQ format files.

Arguments

  • path::AbstractString: Path to input file. Can be:
    • Local file path
    • HTTP/FTP URL
    • Gzip compressed (.gz extension)

Supported formats

  • FASTA (.fasta, .fna, .faa, .fa)
  • FASTQ (.fastq, .fq)

Returns

  • FASTX.FASTA.Reader for FASTA files
  • FASTX.FASTQ.Reader for FASTQ files
source

Format Conversion

Mycelia.convert_sequenceFunction
convert_sequence(seq::AbstractString) -> Any

Converts the given sequence (output from FASTX.sequence) into the appropriate BioSequence type:

  • DNA sequences are converted using BioSequences.LongDNA
  • RNA sequences are converted using BioSequences.LongRNA
  • AA sequences are converted using BioSequences.LongAA
source

K-mer Analysis

K-mer Counting

For k-mer counting functions, see Sequence Analysis Workflow.

Distance Metrics

Mycelia.jaccard_distanceFunction

Compute the Jaccard distance between columns of a binary matrix.

Arguments

  • M::AbstractMatrix{<:Integer}: Binary matrix where rows are features and columns are samples

Returns

  • Matrix{Float64}: Symmetric distance matrix with Jaccard distances
source
jaccard_distance(set1, set2) -> Any

Calculate the Jaccard distance between two sets, which is the complement of the Jaccard similarity.

The Jaccard distance is defined as: $J_d(A,B) = 1 - J_s(A,B) = 1 - \frac{|A ∩ B|}{|A ∪ B|}$

Arguments

  • set1: First set to compare
  • set2: Second set to compare

Returns

  • Float64: A value in [0,1] where 0 indicates identical sets and 1 indicates disjoint sets
source
Mycelia.kmer_counts_to_js_divergenceFunction
kmer_counts_to_js_divergence(
    kmer_counts_1,
    kmer_counts_2
) -> Any

Calculate the Jensen-Shannon divergence between two k-mer frequency distributions.

Arguments

  • kmer_counts_1: Dictionary mapping k-mers to their counts in first sequence
  • kmer_counts_2: Dictionary mapping k-mers to their counts in second sequence

Returns

  • Normalized Jensen-Shannon divergence score between 0 and 1, where:
    • 0 indicates identical distributions
    • 1 indicates maximally different distributions

Notes

  • The measure is symmetric: JS(P||Q) = JS(Q||P)
  • Counts are automatically normalized to probability distributions
source

Simulation and Testing

Read Simulation

See Data Acquisition Workflow for read simulation functions.

Assembly Testing

Quality Control

FASTQ Analysis

Mycelia.calculate_gc_contentFunction
calculate_gc_content(
    sequence::BioSequences.LongSequence
) -> Float64

Calculate GC content (percentage of G and C bases) from a BioSequence.

This function calculates the percentage of guanine (G) and cytosine (C) bases in a nucleotide sequence. Works with both DNA and RNA sequences.

Arguments

  • sequence::BioSequences.LongSequence: Input DNA or RNA sequence

Returns

  • Float64: GC content as a percentage (0.0-100.0)

Examples

# Calculate GC content for DNA
dna_seq = BioSequences.LongDNA{4}("ATCGATCGATCG")
gc_percent = calculate_gc_content(dna_seq)

# Calculate GC content for RNA
rna_seq = BioSequences.LongRNA{4}("AUCGAUCGAUCG") 
gc_percent = calculate_gc_content(rna_seq)
source
calculate_gc_content(sequence::AbstractString) -> Float64

Calculate GC content from a string sequence.

Convenience function that accepts string input and converts to appropriate BioSequence. Automatically detects DNA/RNA based on presence of T/U.

Arguments

  • sequence::AbstractString: Input DNA or RNA sequence as string

Returns

  • Float64: GC content as a percentage (0.0-100.0)

Examples

# Calculate GC content from string
gc_percent = calculate_gc_content("ATCGATCGATCG")
source
calculate_gc_content(
    records::AbstractArray{T<:Union{FASTX.FASTA.Record, FASTX.FASTQ.Record}, 1}
) -> Float64

Calculate GC content from FASTA/FASTQ records.

Processes multiple records and calculates overall GC content across all sequences.

Arguments

  • records::AbstractVector{T} where T <: Union{FASTX.FASTA.Record, FASTX.FASTQ.Record}`: Input records

Returns

  • Float64: Overall GC content as a percentage (0.0-100.0)

Examples

# Calculate GC content from FASTA records
records = collect(FASTX.FASTA.Reader(open("sequences.fasta")))
gc_percent = calculate_gc_content(records)
source
Mycelia.assess_duplication_ratesFunction
assess_duplication_rates(fastq; results_table) -> Any

Analyze sequence duplication rates in a FASTQ file.

This function processes a FASTQ file to quantify both exact sequence duplications and canonical duplications (considering sequences and their reverse complements as equivalent). The function makes two passes through the file: first to count total records, then to analyze unique sequences.

Arguments

  • fastq::String: Path to the input FASTQ file to analyze
  • results_table::String: Optional. Path where the results will be saved as a tab-separated file. Defaults to the same path as the input file but with extension changed to ".duplication_rates.tsv"

Returns

  • String: Path to the results table file

Output

Generates a tab-separated file containing the following metrics:

  • total_records: Total number of sequence records in the file
  • total_unique_observations: Count of unique sequence strings
  • total_unique_canonical_observations: Count of unique canonical sequences (after normalizing for reverse complements)
  • percent_unique_observations: Percentage of sequences that are unique
  • percent_unique_canonical_observations: Percentage of sequences that are unique after canonicalization
  • percent_duplication_rate: Percentage of sequences that are duplicates (100 - percentuniqueobservations)
  • percent_canonical_duplication_rate: Percentage of sequences that are duplicates after canonicalization

Notes

  • If the specified results file already exists and is not empty, the function will return early without recomputing.
  • Progress is displayed during processing with a progress bar showing speed.

Example

# Analyze a FASTQ file and save results to default location
result_path = assess_duplication_rates("data/sample.fastq")

# Specify custom output path
result_path = assess_duplication_rates("data/sample.fastq", results_table="results/duplication_analysis.tsv")
source

For FASTQ quality analysis functions, see Quality Control Workflow.

External Tool Integration

For quality control and filtering functions, see Quality Control Workflow.

Assembly Validation

Quality Assessment

Mycelia.assess_assembly_qualityFunction
assess_assembly_quality(contigs_file)

Assess basic assembly quality metrics from a FASTA file.

Calculates standard assembly quality metrics including contig count, total length, and N50 statistic for assembly evaluation.

Arguments

  • contigs_file: Path to FASTA file containing assembly contigs

Returns

  • Tuple of (ncontigs, totallength, n50, l50)
    • n_contigs: Number of contigs in the assembly
    • total_length: Total length of all contigs in base pairs
    • n50: N50 statistic (length of shortest contig in the set covering 50% of assembly)
    • l50: L50 statistic (number of contigs needed to reach 50% of assembly length)

Example

n_contigs, total_length, n50, l50 = assess_assembly_quality("assembly.fasta")
println("Assembly has $n_contigs contigs, $total_length bp total, N50=$n50, L50=$l50")

See Also

  • assess_assembly_kmer_quality: For k-mer based assembly quality assessment
source
assess_assembly_quality(; assembly, observations, ks)

Assess assembly quality using k-mer based metrics including QV scores.

This function provides a comprehensive quality assessment of genome assemblies by comparing k-mer distributions between the assembly and source sequencing data. It calculates multiple quality metrics across different k-mer sizes, with the QV (Quality Value) score being the primary measure of assembly accuracy.

Arguments

  • assembly: Path to assembly file (FASTA format) or assembly sequences
  • observations: Path to sequencing data file(s) (FASTQ format) or sequence observations
  • ks::Vector{Int}: K-mer sizes to analyze (default: 11,13,17,19,23,31,53 for comprehensive evaluation)

Returns

DataFrame containing quality metrics for each k-mer size:

  • k::Int: K-mer length used
  • cosine_distance::Float64: Cosine similarity between k-mer distributions
  • js_divergence::Float64: Jensen-Shannon divergence between distributions
  • qv::Float64: Merqury-style Quality Value score (primary assembly accuracy metric)

QV Score Interpretation

  • QV ≥ 40: High quality assembly (>99.99% base accuracy)
  • QV 30-40: Good quality assembly (99.9-99.99% base accuracy)
  • QV 20-30: Moderate quality assembly (99-99.9% base accuracy)
  • QV < 20: Lower quality assembly (<99% base accuracy)

Implementation Details

The QV score follows the Merqury methodology:

  1. Count k-mers in both assembly and sequencing data
  2. Calculate shared k-mers between datasets
  3. Estimate base-level accuracy: P = (Kshared/Ktotal)^(1/k)
  4. Convert to Phred scale: QV = -10log₁₀(1-P)

Performance Notes

  • Uses multiple k-mer sizes for robust quality assessment
  • Larger k-mer sizes provide more discriminative power
  • Parallel k-mer counting for computational efficiency
  • Progress tracking for long-running analyses

References

Rhie, A. et al. "Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies." Genome Biology 21, 245 (2020).

source
Mycelia.validate_assemblyFunction
validate_assembly(assembly::AssemblyResult; reference=nothing) -> Dict{String, Any}

Validate assembly quality using various metrics and optional reference comparison.

Arguments

  • assembly: Assembly result to validate
  • reference: Optional reference sequence for comparison

Returns

  • Dict{String, Any}: Comprehensive validation metrics

Details

Computes assembly quality metrics including:

  • N50, N90 statistics
  • Total assembly length and number of contigs
  • Coverage uniformity (if reference provided)
  • Structural variant detection (if reference provided)
  • Gap analysis and repeat characterization
source

External Validators

Mycelia.run_quastFunction
run_quast(assembly_files::Vector{String}; outdir::String="quast_results", reference::Union{String,Nothing}=nothing, threads::Int=Sys.CPU_THREADS, min_contig::Int=500, gene_finding::Bool=false)

Run QUAST (Quality Assessment Tool for Genome Assemblies) to evaluate assembly quality.

Arguments

  • assembly_files::Vector{String}: Vector of paths to assembly FASTA files to evaluate
  • outdir::String="quast_results": Output directory for QUAST results
  • reference::Union{String,Nothing}=nothing: Optional reference genome for reference-based metrics
  • threads::Int=Sys.CPU_THREADS: Number of threads to use
  • min_contig::Int=500: Minimum contig length to consider
  • gene_finding::Bool=false: Whether to run gene finding (requires GeneMark-ES/ET)

Returns

  • String: Path to the output directory containing QUAST results

Output Files

  • report.html: Interactive HTML report
  • report.txt: Text summary report
  • report.tsv: Tab-separated values report for programmatic access
  • transposed_report.tsv: Transposed TSV format
  • icarus.html: Icarus contig browser (if reference provided)

Examples

# Basic assembly evaluation
assemblies = ["assembly1.fasta", "assembly2.fasta"]
quast_dir = Mycelia.run_quast(assemblies)

# With reference genome
ref_genome = "reference.fasta"
quast_dir = Mycelia.run_quast(assemblies, reference=ref_genome)

# Custom parameters
quast_dir = Mycelia.run_quast(assemblies, 
                             outdir="my_quast_results",
                             min_contig=1000,
                             threads=8)

Notes

  • Requires QUAST to be installed via Bioconda
  • Without reference: provides basic metrics (N50, total length, # contigs, etc.)
  • With reference: adds reference-based metrics (genome fraction, misassemblies, etc.)
  • Gene finding requires additional dependencies and is disabled by default
source
run_quast(assembly_file::String; kwargs...)

Run QUAST on a single assembly file. See run_quast(::Vector{String}) for details.

source
Mycelia.run_buscoFunction
run_busco(assembly_files::Vector{String}; outdir::String="busco_results", lineage::String="auto", mode::String="genome", threads::Int=Sys.CPU_THREADS, force::Bool=false)

Run BUSCO (Benchmarking Universal Single-Copy Orthologs) to assess genome assembly completeness.

Arguments

  • assembly_files::Vector{String}: Vector of paths to assembly FASTA files to evaluate
  • outdir::String="busco_results": Output directory for BUSCO results
  • lineage::String="auto": BUSCO lineage dataset to use (e.g., "bacteriaodb10", "eukaryotaodb10", "auto")
  • mode::String="genome": BUSCO mode ("genome", "transcriptome", "proteins")
  • threads::Int=Sys.CPU_THREADS: Number of threads to use
  • force::Bool=false: Force overwrite existing results

Returns

  • String: Path to the output directory containing BUSCO results

Output Files

  • short_summary.specific.lineage.txt: Summary statistics
  • full_table.tsv: Complete BUSCO results table
  • missing_busco_list.tsv: List of missing BUSCOs
  • run_lineage/: Detailed results directory

Examples

# Basic completeness assessment
assemblies = ["assembly1.fasta", "assembly2.fasta"]
busco_dir = Mycelia.run_busco(assemblies)

# Specific lineage
busco_dir = Mycelia.run_busco(assemblies, lineage="bacteria_odb10")

# Custom parameters
busco_dir = Mycelia.run_busco(assemblies,
                             outdir="my_busco_results",
                             lineage="enterobacterales_odb10",
                             threads=8)

Notes

  • Requires BUSCO to be installed via Bioconda
  • Auto lineage detection requires internet connection for first run
  • Available lineages: bacteriaodb10, archaeaodb10, eukaryotaodb10, fungiodb10, etc.
  • Results provide Complete, Fragmented, and Missing BUSCO counts
source
run_busco(assembly_file::String; kwargs...)

Run BUSCO on a single assembly file. See run_busco(::Vector{String}) for details.

source
Mycelia.run_mummerFunction
run_mummer(reference::String, query::String; outdir::String="mummer_results", prefix::String="out", mincluster::Int=65, minmatch::Int=20, threads::Int=1)

Run MUMmer for genome comparison and alignment between reference and query sequences.

Arguments

  • reference::String: Path to reference genome FASTA file
  • query::String: Path to query genome FASTA file
  • outdir::String="mummer_results": Output directory for MUMmer results
  • prefix::String="out": Prefix for output files
  • mincluster::Int=65: Minimum cluster length for nucmer
  • minmatch::Int=20: Minimum match length for nucmer
  • threads::Int=1: Number of threads (note: MUMmer has limited multithreading)

Returns

  • String: Path to the output directory containing MUMmer results

Output Files

  • prefix.delta: Delta alignment file (main output)
  • prefix.coords: Human-readable coordinates file
  • prefix.snps: SNP/indel report (if show-snps is run)
  • prefix.plot.png: Dot plot visualization (if mummerplot is run)

Examples

# Basic genome comparison
ref_genome = "reference.fasta"
query_genome = "assembly.fasta"
mummer_dir = Mycelia.run_mummer(ref_genome, query_genome)

# Custom parameters
mummer_dir = Mycelia.run_mummer(ref_genome, query_genome,
                               outdir="comparison_results",
                               prefix="comparison",
                               mincluster=100,
                               minmatch=30)

Notes

  • Requires MUMmer to be installed via Bioconda
  • nucmer is used for DNA sequence alignment
  • show-coords generates human-readable coordinate output
  • Results include alignment coordinates, percent identity, and coverage
  • For visualization, use mummerplot (requires gnuplot)
source

Annotation

Gene Prediction

Functions for gene prediction and annotation (using external tools):

  • Pyrodigal integration
  • BLAST+ searches
  • MMSeqs2 searches
  • TransTerm terminator prediction
  • tRNAscan-SE tRNA detection
  • MLST typing

Comparative Genomics

Pangenome Analysis

Mycelia.analyze_pangenome_kmersFunction
analyze_pangenome_kmers(genome_files::Vector{String}; kmer_type=Kmers.DNAKmer{21}, distance_metric=:jaccard)

Perform comprehensive k-mer based pangenome analysis using existing Mycelia infrastructure.

Leverages existing count_canonical_kmers and distance metric functions to analyze genomic content across multiple genomes, identifying core, accessory, and unique regions.

Arguments

  • genome_files: Vector of FASTA file paths containing genome sequences
  • kmer_type: K-mer type from Kmers.jl (default: Kmers.DNAKmer{21})
  • distance_metric: Distance metric (:jaccard, :bray_curtis, :cosine, :js_divergence)

Returns

  • PangenomeAnalysisResult with comprehensive pangenome statistics

Example

genome_files = ["genome1.fasta", "genome2.fasta", "genome3.fasta"]
result = Mycelia.analyze_pangenome_kmers(genome_files, kmer_type=Kmers.DNAKmer{31})
println("Core k-mers: $(length(result.core_kmers))")
println("Total pangenome size: $(size(result.presence_absence_matrix, 1)) k-mers")
source
Mycelia.build_genome_distance_matrixFunction
build_genome_distance_matrix(genome_files::Vector{String}; kmer_type=Kmers.DNAKmer{21}, metric=:js_divergence)

Build a distance matrix between all genome pairs using existing distance metrics.

Creates a comprehensive pairwise distance matrix using established k-mer distance functions, suitable for phylogenetic analysis and clustering.

Arguments

  • genome_files: Vector of genome FASTA file paths
  • kmer_type: K-mer type from Kmers.jl (default: Kmers.DNAKmer{21})
  • metric: Distance metric (:js_divergence, :cosine, :jaccard)

Returns

  • Named tuple with distance matrix and genome names

Example

genomes = ["genome1.fasta", "genome2.fasta", "genome3.fasta"]
result = Mycelia.build_genome_distance_matrix(genomes, kmer_type=Kmers.DNAKmer{31})
println("Distance matrix: $(result.distance_matrix)")
source

Visualization

Plotting Functions

Mycelia.visualize_genome_coverageFunction
visualize_genome_coverage(
    coverage_table::DataFrames.AbstractDataFrame;
    entity
) -> Makie.Figure

Creates a multi-panel visualization of genome coverage across chromosomes.

Arguments

  • coverage_table: DataFrame containing columns "chromosome" and "coverage" with genomic coverage data

Keyword Arguments

  • entity: Optional string to specify the entity being analyzed. If provided, chromosome titles become "$(entity): $(chromosome)", otherwise just "$(chromosome)"

Returns

  • CairoMakie.Figure: A composite figure with coverage plots for each chromosome

Details

Generates one subplot per chromosome, arranged vertically. Each subplot shows the coverage distribution across genomic positions for that chromosome.

source
Mycelia.plot_embeddingsFunction

Plot embeddings with optional true and fitted cluster labels using Makie.jl, with legend outside, and color by fit labels, shape by true labels.

Arguments

  • embeddings::Matrix{<:Real}: 2D embedding matrix where each column is a data point
  • title::String: Title of the plot
  • xlabel::String: Label for the x-axis
  • ylabel::String: Label for the y-axis
  • true_labels::Vector{<:Integer}: Vector of true cluster labels (optional)
  • fit_labels::Vector{<:Integer}: Vector of fitted cluster labels (optional)

Returns

  • Makie.Figure: Figure object that can be displayed or saved
source
Mycelia.plot_taxa_abundancesFunction
plot_taxa_abundances(
    df::DataFrames.DataFrame, 
    taxa_level::String; 
    top_n::Int = 10,
    sample_id_col::String = "sample_id",
    filter_taxa::Union{Vector{Union{String, Missing}}, Nothing} = nothing,
    figure_width::Int = 1500,
    figure_height::Int = 1000,
    bar_width::Float64 = 0.7,
    x_rotation::Int = 45,
    sort_samples::Bool = true,
    color_seed::Union{Int, Nothing} = nothing,
    legend_fontsize::Float64 = 12.0,
    legend_itemsize::Float64 = 12.0,
    legend_padding::Float64 = 5.0,
    legend_rowgap::Float64 = 1.0,
    legend_labelwidth::Union{Nothing, Float64} = nothing,
    legend_titlesize::Float64 = 15.0,
    legend_nbanks::Int = 1
)

Create a stacked bar chart showing taxa relative abundances for each sample.

Arguments

  • df: DataFrame with sample_id and taxonomic assignments at different levels
  • taxa_level: Taxonomic level to analyze (e.g., "genus", "species")
  • top_n: Number of top taxa to display individually, remainder grouped as "Other"
  • sample_id_col: Column name containing sample identifiers
  • filter_taxa: Taxa to exclude from visualization (default: nothing - no filtering)
  • figure_width: Width of the figure in pixels
  • figure_height: Height of the figure in pixels
  • bar_width: Width of each bar (between 0 and 1)
  • x_rotation: Rotation angle for x-axis labels in degrees
  • sort_samples: Whether to sort samples alphabetically
  • color_seed: Seed for reproducible color generation
  • legend_fontsize: Font size for legend entries
  • legend_itemsize: Size of the colored marker/icon in the legend
  • legend_padding: Padding around legend elements
  • legend_rowgap: Space between legend rows
  • legend_labelwidth: Maximum width for legend labels (truncation)
  • legend_titlesize: Font size for legend title
  • legend_nbanks: Number of legend columns

Returns

  • fig: CairoMakie figure object
  • ax: CairoMakie axis object
  • taxa_colors: Dictionary mapping taxa to their assigned colors
source

For k-mer frequency plotting, see Sequence Analysis Workflow.

Utility Functions

Memory Management

File Operations

Helper functions for robust file handling and data validation.

Progress Tracking

Functions for user feedback during long-running operations.

Graph Types and Enums

Graph Modes

Mycelia.GraphModeType

Graph mode for handling strand information.

  • SingleStrand: Sequences are single-stranded (RNA, amino acids, or directional DNA)
  • DoubleStrand: Sequences are double-stranded DNA/RNA with canonical representation
source
Mycelia.StrandOrientationType

Strand orientation for k-mer observations and transitions.

  • Forward: k-mer as observed (5' to 3')
  • Reverse: reverse complement of k-mer (3' to 5')
source

Used to control how graphs are constructed and oriented.

Data Structures

Assembly Results

The result of assembly operations typically includes:

  • :final_assembly: Vector of assembled sequences
  • :k_progression: K-mer sizes used
  • :metadata: Assembly statistics and parameters

Quality Metrics

Standard quality metrics returned by analysis functions:

  • Coverage statistics
  • Quality score distributions
  • Joint probability calculations
  • Assembly completeness measures

Function Categories

Core Assembly Pipeline

  1. Input Processing: open_fastx, convert_sequence
  2. Quality Analysis: analyze_fastq_quality, qualmer functions
  3. Assembly: mycelia_assemble, graph construction
  4. Validation: assess_assembly_quality, validate_assembly
  5. Output: Assembly statistics and final sequences

Supporting Workflows

  • Data Acquisition: Download and simulation functions
  • Quality Control: Filtering and trimming tools
  • Analysis: K-mer analysis and distance calculations
  • Visualization: Plotting and reporting functions

Type System

Mycelia uses a strict type system based on BioSequences.jl:

  • DNA sequences: BioSequences.LongDNA{4}
  • RNA sequences: BioSequences.LongRNA{4}
  • Protein sequences: BioSequences.LongAA
  • K-mers: Kmers.DNAKmer{K}, Kmers.RNAKmer{K}, Kmers.AAKmer{K}

Quality-Aware Types

  • Qualmer: Combines k-mer with quality scores
  • QualmerVertexData: Graph vertex with quality metadata
  • QualmerEdgeData: Graph edge with transition weights

Performance Characteristics

Scalability

Most functions scale as follows with input size:

  • Linear: File I/O, k-mer counting
  • Quadratic: Distance matrix calculations
  • Variable: Assembly (depends on complexity)

Memory Usage

Typical memory requirements:

  • Small genomes (< 10 Mb): 1-4 GB RAM
  • Bacterial genomes (1-10 Mb): 4-16 GB RAM
  • Large genomes (> 100 Mb): 32-128 GB RAM

Use memory_limit parameters to control usage.

Parallelization

Functions that support parallel execution:

  • K-mer counting (automatic)
  • Assembly (multi-threaded)
  • Quality analysis (vectorized)

Set JULIA_NUM_THREADS environment variable for best performance.

Error Handling

Common Exceptions

  • ArgumentError: Invalid parameters
  • MethodError: Wrong types
  • OutOfMemoryError: Insufficient RAM
  • SystemError: File I/O issues

Debugging Tips

  1. Check inputs: Use typeof() and length()
  2. Monitor memory: Use memory_limit parameters
  3. Enable logging: Set verbose=true in functions
  4. Test subset: Use smaller datasets for debugging

Integration Points

External Tools

Mycelia integrates with numerous bioinformatics tools via Bioconda:

  • Assemblers: SPAdes, Megahit, Miniasm, Raven
  • QC Tools: FastP, Filtlong, Trim Galore
  • Validators: QUAST, BUSCO, MUMmer
  • Annotators: Pyrodigal, BLAST+, MMSeqs2

File Formats

Supported formats:

  • Input: FASTA, FASTQ (including gzipped)
  • Output: FASTA, CSV, JSON, JLD2
  • Intermediate: Graph formats, quality scores

Workflow Integration

Functions are designed to work together in pipelines:

# Typical workflow
data = open_fastx("reads.fastq")
quality_report = analyze_fastq_quality(data)
assembly = mycelia_assemble("reads.fastq")
validation = assess_assembly_quality(assembly)

Version Compatibility

This documentation corresponds to Mycelia version 0.1.0+.

  • Breaking changes will be noted in release notes
  • Deprecated functions include migration guidance
  • Experimental features are clearly marked

For the latest updates, see the CHANGELOG.