API Reference

This page provides comprehensive documentation for all public functions in Mycelia, automatically generated from docstrings.

Table of Contents

Assembly Functions

Main Assembly Interface

Mycelia.assemble_genomeFunction
assemble_genome(reads; k=31, kwargs...) -> AssemblyResult
assemble_genome(reads, config::AssemblyConfig) -> AssemblyResult

Unified genome assembly interface with auto-detection and type-stable dispatch.

Auto-Detection Convenience Method

# Auto-detect sequence type and format, use k-mer approach
result = assemble_genome(fasta_records; k=25)

# Auto-detect sequence type and format, use overlap approach
result = assemble_genome(fasta_records; min_overlap=100)

# Override auto-detected parameters
result = assemble_genome(reads; k=31, graph_mode=SingleStrand, error_rate=0.005)

Type-Stable Direct Method

# Explicit configuration for maximum performance
config = AssemblyConfig(k=25, sequence_type=BioSequences.LongDNA{4},
                       graph_mode=DoubleStrand, use_quality_scores=true)
result = assemble_genome(reads, config)

Arguments

  • reads: Vector of FASTA/FASTQ records or file paths
  • config: Assembly configuration (for type-stable version)

Keyword Arguments (auto-detection version)

  • k: k-mer size (mutually exclusive with min_overlap)
  • min_overlap: Minimum overlap length (mutually exclusive with k)
  • graph_mode: SingleStrand or DoubleStrand (auto-detected if not specified)
  • error_rate, min_coverage, etc.: Assembly parameters

Returns

  • AssemblyResult: Structure containing contigs, names, and assembly metadata

Details

This interface automatically:

  1. Detects sequence type: DNA, RNA, AA, or String from input
  2. Chooses assembly method: k-mer vs overlap-based on parameters
  3. Validates compatibility: AA/String sequences → SingleStrand only
  4. Dispatches optimally: Based on input type and quality scores

Method Selection Logic:

  • String input + k → N-gram graph
  • String input + min_overlap → String graph
  • BioSequence input + k + quality → Qualmer graph
  • BioSequence input + k → K-mer graph
  • BioSequence input + min_overlap + quality → Quality BioSequence graph
  • BioSequence input + min_overlap → BioSequence graph
source

Type-stable main assembly function that dispatches based on configuration.

source

Third-Party Assembly Tool Wrappers

See Assembly Suite for comprehensive documentation of production-ready assembly wrappers including:

  • Short-read assemblers: MEGAHIT, SPAdes, metaSPAdes, SKESA, Velvet
  • Long-read assemblers: Flye, metaFlye, Canu, hifiasm, MetaMDBG
  • Hybrid assemblers: Unicycler
  • Polishing tools: Apollo, HyPo, Strainy

Experimental Assembly Algorithms

Mycelia includes experimental assembly algorithms in src/development/ that are not currently part of the public API:

  • Intelligent Assembly: mycelia_assemble - Self-optimizing k-mer progression
  • Assembly Utilities: dynamic_k_prime_pattern, error_optimized_k_sequence, calculate_assembly_reward

These are research implementations subject to change. For production use, see the third-party assembler wrappers above.

Qualmer Analysis

Core Qualmer Types

Mycelia.QualmerType
struct Qualmer{KmerT, K}

A quality-aware k-mer that pairs sequence k-mers with their corresponding quality scores.

Fields

  • kmer::Any

  • qualities::Tuple{Vararg{UInt8, K}} where K

Examples

using Mycelia
using Kmers

# Create a DNA k-mer with quality scores
kmer = DNAKmer("ATCG")
qualities = [30, 35, 32, 28]  # Phred quality scores
qmer = Qualmer(kmer, qualities)

# Access individual bases and qualities
base, quality = qmer[1]  # Returns ('A', 30)

This structure is essential for quality-aware sequence analysis and assembly algorithms that consider both sequence content and sequencing quality information.

source
Mycelia.QualmerVertexDataType
struct QualmerVertexData{QualmerT<:Mycelia.Qualmer}

Vertex data for quality-aware k-mer graphs that stores k-mer observations with quality scores.

Fields

  • canonical_qualmer::Mycelia.Qualmer

  • observations::Array{Mycelia.QualmerObservation{QualmerT}, 1} where QualmerT<:Mycelia.Qualmer

  • joint_probability::Float64

  • coverage::Int64

  • mean_quality::Float64

This structure aggregates multiple observations of the same k-mer and computes quality-based statistics for use in quality-aware assembly algorithms.

source
Mycelia.QualmerEdgeDataType
struct QualmerEdgeData

Edge data for quality-aware k-mer graphs that connects k-mer vertices with strand information.

Fields

  • observations::Vector{Tuple{Int64, Int64}}

  • src_strand::Mycelia.StrandOrientation

  • dst_strand::Mycelia.StrandOrientation

  • weight::Float64

  • quality_weight::Float64

This structure represents connections between k-mers in quality-aware graphs, including strand orientation and quality-weighted edge weights for assembly algorithms.

source

Qualmer Generation

Mycelia.qualmers_fwFunction
qualmers_fw(
    sequence::BioSequences.LongSequence{A},
    quality::AbstractVector{<:Integer},
    _::Val{K}
) -> Base.Generator{I, F} where {I<:(Base.Iterators.Enumerate{I} where I<:(Kmers.FwKmers{A, _A, S} where {A<:BioSequences.Alphabet, _A, S<:(BioSequences.LongSequence{A} where A)})), F<:(Mycelia.var"#287#288"{_A, <:AbstractVector{var"#s34"}} where {_A, var"#s34"<:Integer})}

Create an iterator that yields DNA qualmers from the given sequence and quality scores.

source
qualmers_fw(
    record::FASTX.FASTQ.Record,
    k::Int64
) -> Base.Generator

Generate forward qualmers from a FASTQ record. Extracts all k-mers in the forward direction with their associated quality scores from the FASTQ record.

Arguments

  • record: FASTQ record containing sequence and quality data
  • k: K-mer size

Returns

  • Iterator yielding (Qualmer, position) tuples for each k-mer
source
Mycelia.qualmers_canonicalFunction
qualmers_canonical(
    sequence::BioSequences.LongSequence{BioSequences.DNAAlphabet{N}},
    quality::AbstractVector{<:Integer},
    _::Val{K}
) -> Base.Generator{I, F} where {I<:(Base.Iterators.Enumerate{I} where I<:(Kmers.FwKmers{A, _A, S} where {A<:(BioSequences.DNAAlphabet), _A, S<:(BioSequences.LongDNA)})), F<:(Mycelia.var"#293#294"{_A, <:AbstractVector{var"#s34"}} where {_A, var"#s34"<:Integer})}

Generate canonical DNA qualmers from sequence and quality scores. For each k-mer, returns the lexicographically smaller of the k-mer and its reverse complement, ensuring consistent representation regardless of strand.

Arguments

  • sequence: DNA sequence as LongDNA
  • quality: Vector of quality scores corresponding to each base
  • ::Val{K}: K-mer size as a type parameter

Returns

  • Iterator yielding (Qualmer, position) tuples with canonical k-mer representation
source
qualmers_canonical(
    sequence::BioSequences.LongAA,
    quality::AbstractVector{<:Integer},
    _::Val{K}
) -> Base.Generator{I, F} where {I<:(Base.Iterators.Enumerate{I} where I<:(Kmers.FwAAMers{_A, BioSequences.LongAA} where _A)), F<:(Mycelia.var"#287#288"{_A, <:AbstractVector{var"#s36"}} where {_A, var"#s36"<:Integer})}

Generate canonical amino acid qualmers from protein sequence and quality scores. For amino acid sequences, canonical representation is the same as forward representation since proteins don't have reverse complements.

Arguments

  • sequence: Protein sequence as LongAA
  • quality: Vector of quality scores corresponding to each amino acid
  • ::Val{K}: K-mer size as a type parameter

Returns

  • Iterator yielding (Qualmer, position) tuples for each k-mer
source
qualmers_canonical(
    record::FASTX.FASTQ.Record,
    k::Int64
) -> Base.Generator

Generate canonical qualmers from a FASTQ record. Extracts k-mers in canonical representation (lexicographically smaller of forward and reverse complement) with their associated quality scores.

Arguments

  • record: FASTQ record containing sequence and quality data
  • k: K-mer size

Returns

  • Iterator yielding (Qualmer, position) tuples with canonical k-mer representation
source
Mycelia.qualmers_unambiguousFunction
qualmers_unambiguous(
    sequence::BioSequences.LongSequence{BioSequences.DNAAlphabet{N}},
    quality::AbstractVector{<:Integer},
    _::Val{K}
) -> Base.Generator{I, F} where {I<:(Kmers.UnambiguousDNAMers{_A, S} where {_A, S<:(BioSequences.LongDNA)}), F<:(Mycelia.var"#289#290"{_A, <:AbstractVector{var"#s34"}} where {_A, var"#s34"<:Integer})}

Generate unambiguous DNA qualmers from sequence and quality scores. Only includes k-mers that contain unambiguous nucleotides (A, T, G, C). Skips k-mers containing ambiguous bases like N.

Arguments

  • sequence: DNA sequence as LongDNA
  • quality: Vector of quality scores corresponding to each base
  • ::Val{K}: K-mer size as a type parameter

Returns

  • Iterator yielding (Qualmer, position) tuples for each unambiguous k-mer
source
qualmers_unambiguous(
    sequence::BioSequences.LongSequence{BioSequences.RNAAlphabet{N}},
    quality::AbstractVector{<:Integer},
    _::Val{K}
) -> Base.Generator{I, F} where {I<:(Kmers.UnambiguousRNAMers{_A, S} where {_A, S<:(BioSequences.LongRNA)}), F<:(Mycelia.var"#291#292"{_A, <:AbstractVector{var"#s34"}} where {_A, var"#s34"<:Integer})}

Generate unambiguous RNA qualmers from sequence and quality scores. Only includes k-mers that contain unambiguous nucleotides (A, U, G, C). Skips k-mers containing ambiguous bases like N.

Arguments

  • sequence: RNA sequence as LongRNA
  • quality: Vector of quality scores corresponding to each base
  • ::Val{K}: K-mer size as a type parameter

Returns

  • Iterator yielding (Qualmer, position) tuples for each unambiguous k-mer
source
qualmers_unambiguous(
    sequence::BioSequences.LongAA,
    quality::AbstractVector{<:Integer},
    _::Val{K}
) -> Base.Generator{I, F} where {I<:(Base.Iterators.Enumerate{I} where I<:(Kmers.FwAAMers{_A, BioSequences.LongAA} where _A)), F<:(Mycelia.var"#287#288"{_A, <:AbstractVector{var"#s36"}} where {_A, var"#s36"<:Integer})}

Generate amino acid qualmers from protein sequence and quality scores. For amino acid sequences, all k-mers are considered unambiguous, so this falls back to the forward qualmer generation.

Arguments

  • sequence: Protein sequence as LongAA
  • quality: Vector of quality scores corresponding to each amino acid
  • ::Val{K}: K-mer size as a type parameter

Returns

  • Iterator yielding (Qualmer, position) tuples for each k-mer
source
qualmers_unambiguous(
    record::FASTX.FASTQ.Record,
    k::Int64
) -> Base.Generator

Generate unambiguous qualmers from a FASTQ record. Extracts only k-mers that contain unambiguous nucleotides (no N bases) with their associated quality scores.

Arguments

  • record: FASTQ record containing sequence and quality data
  • k: K-mer size

Returns

  • Iterator yielding (Qualmer, position) tuples for each unambiguous k-mer
source
Mycelia.qualmers_unambiguous_canonicalFunction
qualmers_unambiguous_canonical(
    record::FASTX.FASTQ.Record,
    k::Int64
) -> Base.Generator{I, Mycelia.var"#295#296"} where I

Generate unambiguous canonical qualmers from the given FASTQ record.

source

Quality Analysis

Mycelia.phred_to_probabilityFunction
phred_to_probability(phred_score::UInt8) -> Float64

Convert PHRED quality score to probability of correctness. PHRED score Q relates to error probability P by: Q = -10 * log10(P) Therefore, correctness probability = 1 - P = 1 - 10^(-Q/10)

source
Mycelia.qualmer_correctness_probabilityFunction
qualmer_correctness_probability(
    qmer::Mycelia.Qualmer
) -> Float64

Calculate joint probability that a single qualmer is correct. For a k-mer with quality scores [q1, q2, ..., qk], the joint probability that all positions are correct is: ∏(1 - 10^(-qi/10))

source
Mycelia.joint_qualmer_probabilityFunction
joint_qualmer_probability(
    qualmers::Vector{<:Mycelia.Qualmer};
    use_log_space
) -> Float64

Calculate joint probability that multiple observations of the same k-mer are all correct. For independent observations of the same k-mer sequence, this represents our confidence that the k-mer truly exists in the data.

Arguments

  • qualmers: Vector of Qualmer observations of the same k-mer sequence
  • use_log_space: Use log-space arithmetic for numerical stability (default: true)

Returns

  • Float64: Joint probability that all observations are correct
source
Mycelia.position_wise_joint_probabilityFunction
position_wise_joint_probability(
    qualmers::Vector{<:Mycelia.Qualmer};
    use_log_space
) -> Float64

Calculate position-wise joint probability for multiple qualmer observations. This is more sophisticated than joint_qualmer_probability as it considers the quality at each position across all observations.

Arguments

  • qualmers: Vector of Qualmer observations of the same k-mer sequence
  • use_log_space: Use log-space arithmetic for numerical stability (default: true)

Returns

  • Float64: Position-wise joint probability
source

Graph Construction

Mycelia.build_qualmer_graphFunction
build_qualmer_graph(
    fastq_records::Vector{FASTX.FASTQ.Record};
    k,
    graph_mode,
    min_quality,
    min_coverage
) -> MetaGraphsNext.MetaGraph{Int64, Graphs.SimpleGraphs.SimpleDiGraph{Int64}, Label, VertexData, EdgeData, Nothing, WeightFunction, Float64} where {Label, VertexData, EdgeData, WeightFunction}

Build a quality-aware k-mer graph from FASTQ records using existing Qualmer functionality. This function leverages the existing qualmer extraction functions and adds graph construction.

Arguments

  • fastq_records: Vector of FASTQ records with quality scores
  • k: K-mer size
  • graph_mode: SingleStrand or DoubleStrand mode (default: DoubleStrand)
  • min_quality: Minimum average PHRED quality to include k-mer (default: 10)
  • min_coverage: Minimum coverage (observations) to include k-mer (default: 1)

Returns

  • MetaGraphsNext.MetaGraph with QualmerVertexData and QualmerEdgeData
source

String Graphs

N-gram Analysis

Mycelia.ngramsFunction
ngrams(s::AbstractString, n::Int64) -> Vector{String}

Extract all n-grams from a string. Returns a vector containing all contiguous substrings of length n from the input string.

Arguments

  • s: Input string to extract n-grams from
  • n: Length of each n-gram (must be positive)

Returns

  • Vector{String}: All n-grams of length n from the string

Examples

ngrams("hello", 3)  # Returns ["hel", "ell", "llo"]
ngrams("abcd", 2)   # Returns ["ab", "bc", "cd"]
source
Mycelia.string_to_ngram_graphFunction
string_to_ngram_graph(; s, n)

Build a directed graph from string n-grams with edge and vertex weights. Creates a MetaGraph where vertices represent unique n-grams and edges represent transitions between consecutive n-grams in the original string.

Arguments

  • s: Input string to build graph from
  • n: N-gram size (length of each substring)

Returns

  • MetaGraphsNext.MetaGraph: Directed graph with:
    • Vertex labels: String n-grams
    • Vertex data: Integer counts of n-gram occurrences
    • Edge data: Integer counts of n-gram transitions

Details

Backward compatibility function - now delegates to the multi-string version. The graph construction process:

  1. Extract all n-grams from the string
  2. Count occurrences of each unique n-gram (stored as vertex data)
  3. Count transitions between consecutive n-grams (stored as edge data)
  4. Build a directed graph representing the n-gram sequence structure

Examples

graph = string_to_ngram_graph(s="abcabc", n=2)
# Creates graph with vertices "ab", "bc", "ca" and weighted edges
source
Mycelia.plot_ngram_graphFunction
plot_ngram_graph(g) -> Any

Create a visual plot of an n-gram graph using GraphPlot.jl. Displays the graph structure with vertex labels showing the n-grams.

Arguments

  • g: MetaGraph containing n-gram data (typically from string_to_ngram_graph)

Returns

  • Plot object showing the graph layout with labeled vertices

Details

Uses a spring layout algorithm to position vertices and displays n-gram strings as vertex labels for easy interpretation of the graph structure.

Examples

graph = string_to_ngram_graph(s="hello", n=2)
plot_ngram_graph(graph)  # Shows visual representation
source

Graph Simplification

Mycelia.find_connected_componentsFunction

Find all connected components in the graph. Returns a vector of vectors of vertex indices.

For directed graphs, uses weakly connected components. For undirected graphs, uses standard connected components.

source

Note: Additional graph simplification function assemble_strings is available but currently lacks documentation.

Sequence I/O

File Handling

Mycelia.open_fastxFunction
open_fastx(path::AbstractString) -> Any

Open and return a reader for FASTA or FASTQ format files.

Arguments

  • path::AbstractString: Path to input file. Can be:
    • Local file path
    • HTTP/FTP URL
    • Gzip compressed (.gz extension)

Supported formats

  • FASTA (.fasta, .fna, .faa, .fa)
  • FASTQ (.fastq, .fq)

Returns

  • FASTX.FASTA.Reader for FASTA files
  • FASTX.FASTQ.Reader for FASTQ files
source
Mycelia.write_fastqFunction
write_fastq(;records, filename, gzip=false)
write_fastq(; records, filename, gzip)

Write FASTQ records to file using FASTX.jl. Validates extension: .fastq, .fq, .fastq.gz, or .fq.gz. If gzip is true or filename endswith .gz, output is gzipped. records must be an iterable of FASTX.FASTQ.Record.

source
Mycelia.write_fastaFunction
write_fasta(; outfile, records, gzip)

Writes FASTA records to a file, optionally gzipped.

Arguments

  • outfile::AbstractString: Path to the output FASTA file. Will append ".gz" if gzip is true and ".gz" isn't already the extension.
  • records::Vector{FASTX.FASTA.Record}: A vector of FASTA records.
  • gzip::Bool: Optionally force compression of the output with gzip. By default will use the file name to infer.

Returns

  • outfile::String: The path to the output FASTA file (including ".gz" if applicable).
source

Format Conversion

Mycelia.convert_sequenceFunction
convert_sequence(seq::AbstractString) -> Any

Converts the given sequence (output from FASTX.sequence) into the appropriate BioSequence type:

  • DNA sequences are converted using BioSequences.LongDNA
  • RNA sequences are converted using BioSequences.LongRNA
  • AA sequences are converted using BioSequences.LongAA
source

K-mer Analysis

K-mer Counting

For k-mer counting functions, see Sequence Analysis Workflow.

Distance Metrics

Mycelia.jaccard_distanceFunction

Compute the Jaccard distance between columns of a binary matrix.

Arguments

  • M::AbstractMatrix{<:Integer}: Binary matrix where rows are features and columns are samples

Returns

  • Matrix{Float64}: Symmetric distance matrix with Jaccard distances
source
jaccard_distance(set1, set2) -> Any

Calculate the Jaccard distance between two sets, which is the complement of the Jaccard similarity.

The Jaccard distance is defined as: $J_d(A,B) = 1 - J_s(A,B) = 1 - \frac{|A ∩ B|}{|A ∪ B|}$

Arguments

  • set1: First set to compare
  • set2: Second set to compare

Returns

  • Float64: A value in [0,1] where 0 indicates identical sets and 1 indicates disjoint sets
source
Mycelia.kmer_counts_to_js_divergenceFunction
kmer_counts_to_js_divergence(
    kmer_counts_1,
    kmer_counts_2
) -> Any

Calculate the Jensen-Shannon divergence between two k-mer frequency distributions.

Arguments

  • kmer_counts_1: Dictionary mapping k-mers to their counts in first sequence
  • kmer_counts_2: Dictionary mapping k-mers to their counts in second sequence

Returns

  • Normalized Jensen-Shannon divergence score between 0 and 1, where:
    • 0 indicates identical distributions
    • 1 indicates maximally different distributions

Notes

  • The measure is symmetric: JS(P||Q) = JS(Q||P)
  • Counts are automatically normalized to probability distributions
source

Genome Size Estimation

Mycelia.estimate_genome_size_from_kmersFunction
estimate_genome_size_from_kmers(
    sequence::Union{AbstractString, BioSequences.LongSequence},
    k::Integer
) -> Dict

Estimate genome size from k-mer analysis using total k-mer count.

This function estimates genome size using the basic relationship: genomesize ≈ totalkmers - k + 1, where total_kmers is the sum of all k-mer counts. This is a simple estimation method; more sophisticated approaches accounting for sequencing depth, repeats, and errors may be more accurate.

Arguments

  • sequence::Union{BioSequences.LongSequence, AbstractString}: Input sequence or string
  • k::Integer: K-mer size for analysis

Returns

  • Dict{String, Any}: Dictionary containing:
    • "unique_kmers": Number of unique k-mers observed
    • "total_kmers": Total k-mer count (sum of all frequencies)
    • "estimatedgenomesize": Estimated genome size
    • "actual_size": Length of input sequence (if provided)

Examples

# Estimate genome size from a sequence
sequence = "ATCGATCGATCGATCG"
result = estimate_genome_size_from_kmers(sequence, 5)
source
estimate_genome_size_from_kmers(
    records::AbstractArray{T<:Union{FASTX.FASTA.Record, FASTX.FASTQ.Record}, 1},
    k::Integer
) -> Dict

Estimate genome size from FASTQ/FASTA records using k-mer analysis.

Overload for processing FASTQ or FASTA records directly.

Arguments

  • records::AbstractVector{T} where T <: Union{FASTX.FASTA.Record, FASTX.FASTQ.Record}`: Input records
  • k::Integer: K-mer size for analysis

Returns

  • Dict{String, Any}: Dictionary with k-mer statistics and genome size estimate
source

K-mer Result Storage

Mycelia.save_kmer_resultsFunction
save_kmer_results(
;
    filename,
    kmers,
    counts,
    fasta_list,
    k,
    alphabet
)

Save the kmer counting results (kmers vector, counts sparse matrix) and the input FASTA file list to a JLD2 file for long-term storage and reproducibility.

Arguments

  • filename::AbstractString: Path to the output JLD2 file.
  • kmers::AbstractVector{<:Kmers.Kmer}: The sorted vector of unique kmer objects.
  • counts::AbstractMatrix: The (sparse or dense) matrix of kmer counts.
  • fasta_list::AbstractVector{<:AbstractString}: The list of FASTA file paths used as input.
  • k::Integer: The kmer size used.
  • alphabet::Symbol: The alphabet used (:AA, :DNA, :RNA).
source
Mycelia.load_kmer_resultsFunction
load_kmer_results(
    filename::AbstractString
) -> Union{Nothing, NamedTuple{(:kmers, :counts, :fasta_list, :metadata), <:Tuple{Any, Any, Any, Dict{String, Any}}}}

Load kmer counting results previously saved with save_kmer_results.

Arguments

  • filename::AbstractString: Path to the input JLD2 file.

Returns

  • NamedTuple: Contains the loaded kmers, counts, fasta_list, and metadata. Returns nothing if the file cannot be loaded or essential keys are missing.
source

Simulation and Testing

Sequence Generation

Mycelia.generate_test_sequencesFunction
generate_test_sequences(
    genome_size::Int64
) -> Vector{BioSequences.LongSequence{BioSequences.DNAAlphabet{4}}}
generate_test_sequences(
    genome_size::Int64,
    n_sequences::Int64
) -> Vector{BioSequences.LongSequence{BioSequences.DNAAlphabet{4}}}

Generate test DNA sequences for k-mer analysis benchmarking.

Creates random DNA sequences suitable for k-mer counting and analysis performance testing.

Arguments

  • genome_size::Int: Size of each generated sequence in base pairs
  • n_sequences::Int: Number of sequences to generate (default: 1)

Returns

  • Vector of BioSequences.LongDNA{4} sequences

See Also

  • random_fasta_record: For generating FASTA records with random sequences
  • BioSequences.randdnaseq: For generating random DNA sequences
source

Read Simulation

See Data Acquisition Workflow for read simulation functions.

Assembly Testing (Experimental)

Experimental assembly testing function test_intelligent_assembly is available in src/development/ but not currently part of the public API.

Quality Control

FASTQ Analysis

Mycelia.calculate_gc_contentFunction
calculate_gc_content(
    sequence::BioSequences.LongSequence
) -> Float64

Calculate GC content (percentage of G and C bases) from a BioSequence.

This function calculates the percentage of guanine (G) and cytosine (C) bases in a nucleotide sequence. Works with both DNA and RNA sequences.

Arguments

  • sequence::BioSequences.LongSequence: Input DNA or RNA sequence

Returns

  • Float64: GC content as a percentage (0.0-100.0)

Examples

# Calculate GC content for DNA
dna_seq = BioSequences.LongDNA{4}("ATCGATCGATCG")
gc_percent = calculate_gc_content(dna_seq)

# Calculate GC content for RNA
rna_seq = BioSequences.LongRNA{4}("AUCGAUCGAUCG") 
gc_percent = calculate_gc_content(rna_seq)
source
calculate_gc_content(sequence::AbstractString) -> Float64

Calculate GC content from a string sequence.

Convenience function that accepts string input and converts to appropriate BioSequence. Automatically detects DNA/RNA based on presence of T/U.

Arguments

  • sequence::AbstractString: Input DNA or RNA sequence as string

Returns

  • Float64: GC content as a percentage (0.0-100.0)

Examples

# Calculate GC content from string
gc_percent = calculate_gc_content("ATCGATCGATCG")
source
calculate_gc_content(
    records::AbstractArray{T<:Union{FASTX.FASTA.Record, FASTX.FASTQ.Record}, 1}
) -> Float64

Calculate GC content from FASTA/FASTQ records.

Processes multiple records and calculates overall GC content across all sequences.

Arguments

  • records::AbstractVector{T} where T <: Union{FASTX.FASTA.Record, FASTX.FASTQ.Record}`: Input records

Returns

  • Float64: Overall GC content as a percentage (0.0-100.0)

Examples

# Calculate GC content from FASTA records
records = collect(FASTX.FASTA.Reader(open("sequences.fasta")))
gc_percent = calculate_gc_content(records)
source
Mycelia.assess_duplication_ratesFunction
assess_duplication_rates(fastq; results_table) -> Any

Analyze sequence duplication rates in a FASTQ file.

This function processes a FASTQ file to quantify both exact sequence duplications and canonical duplications (considering sequences and their reverse complements as equivalent). The function makes two passes through the file: first to count total records, then to analyze unique sequences.

Arguments

  • fastq::String: Path to the input FASTQ file to analyze
  • results_table::String: Optional. Path where the results will be saved as a tab-separated file. Defaults to the same path as the input file but with extension changed to ".duplication_rates.tsv"

Returns

  • String: Path to the results table file

Output

Generates a tab-separated file containing the following metrics:

  • total_records: Total number of sequence records in the file
  • total_unique_observations: Count of unique sequence strings
  • total_unique_canonical_observations: Count of unique canonical sequences (after normalizing for reverse complements)
  • percent_unique_observations: Percentage of sequences that are unique
  • percent_unique_canonical_observations: Percentage of sequences that are unique after canonicalization
  • percent_duplication_rate: Percentage of sequences that are duplicates (100 - percentuniqueobservations)
  • percent_canonical_duplication_rate: Percentage of sequences that are duplicates after canonicalization

Notes

  • If the specified results file already exists and is not empty, the function will return early without recomputing.
  • Progress is displayed during processing with a progress bar showing speed.

Example

# Analyze a FASTQ file and save results to default location
result_path = assess_duplication_rates("data/sample.fastq")

# Specify custom output path
result_path = assess_duplication_rates("data/sample.fastq", results_table="results/duplication_analysis.tsv")
source

For FASTQ quality analysis functions, see Quality Control Workflow.

External Tool Integration

For quality control and filtering functions, see Quality Control Workflow.

Assembly Validation

Quality Assessment

Mycelia.assess_assembly_qualityFunction
assess_assembly_quality(contigs_file)

Assess basic assembly quality metrics from a FASTA file.

Calculates standard assembly quality metrics including contig count, total length, and N50 statistic for assembly evaluation.

Arguments

  • contigs_file: Path to FASTA file containing assembly contigs

Returns

  • Tuple of (ncontigs, totallength, n50, l50)
    • n_contigs: Number of contigs in the assembly
    • total_length: Total length of all contigs in base pairs
    • n50: N50 statistic (length of shortest contig in the set covering 50% of assembly)
    • l50: L50 statistic (number of contigs needed to reach 50% of assembly length)

Example

n_contigs, total_length, n50, l50 = assess_assembly_quality("assembly.fasta")
println("Assembly has $n_contigs contigs, $total_length bp total, N50=$n50, L50=$l50")

See Also

  • assess_assembly_kmer_quality: For k-mer based assembly quality assessment
source
assess_assembly_quality(; assembly, observations, ks)

Assess assembly quality using k-mer based metrics including QV scores.

This function provides a comprehensive quality assessment of genome assemblies by comparing k-mer distributions between the assembly and source sequencing data. It calculates multiple quality metrics across different k-mer sizes, with the QV (Quality Value) score being the primary measure of assembly accuracy.

Arguments

  • assembly: Path to assembly file (FASTA format) or assembly sequences
  • observations: Path to sequencing data file(s) (FASTQ format) or sequence observations
  • ks::Vector{Int}: K-mer sizes to analyze (default: 11,13,17,19,23,31,53 for comprehensive evaluation)

Returns

DataFrame containing quality metrics for each k-mer size:

  • k::Int: K-mer length used
  • cosine_distance::Float64: Cosine similarity between k-mer distributions
  • js_divergence::Float64: Jensen-Shannon divergence between distributions
  • qv::Float64: Merqury-style Quality Value score (primary assembly accuracy metric)

QV Score Interpretation

  • QV ≥ 40: High quality assembly (>99.99% base accuracy)
  • QV 30-40: Good quality assembly (99.9-99.99% base accuracy)
  • QV 20-30: Moderate quality assembly (99-99.9% base accuracy)
  • QV < 20: Lower quality assembly (<99% base accuracy)

Implementation Details

The QV score follows the Merqury methodology:

  1. Count k-mers in both assembly and sequencing data
  2. Calculate shared k-mers between datasets
  3. Estimate base-level accuracy: P = (Kshared/Ktotal)^(1/k)
  4. Convert to Phred scale: QV = -10log₁₀(1-P)

Performance Notes

  • Uses multiple k-mer sizes for robust quality assessment
  • Larger k-mer sizes provide more discriminative power
  • Parallel k-mer counting for computational efficiency
  • Progress tracking for long-running analyses

References

Rhie, A. et al. "Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies." Genome Biology 21, 245 (2020).

source
Mycelia.validate_assemblyFunction
validate_assembly(assembly::AssemblyResult; reference=nothing) -> Dict{String, Any}

Validate assembly quality using various metrics and optional reference comparison.

Arguments

  • assembly: Assembly result to validate
  • reference: Optional reference sequence for comparison

Returns

  • Dict{String, Any}: Comprehensive validation metrics

Details

Computes assembly quality metrics including:

  • N50, N90 statistics
  • Total assembly length and number of contigs
  • Coverage uniformity (if reference provided)
  • Structural variant detection (if reference provided)
  • Gap analysis and repeat characterization
source

External Validators

Mycelia.run_quastFunction
run_quast(assembly_files::Vector{String}; outdir::String="quast_results", reference::Union{String,Nothing}=nothing, threads::Int=Sys.CPU_THREADS, min_contig::Int=500, gene_finding::Bool=false)

Run QUAST (Quality Assessment Tool for Genome Assemblies) to evaluate assembly quality.

Arguments

  • assembly_files::Vector{String}: Vector of paths to assembly FASTA files to evaluate
  • outdir::String="quast_results": Output directory for QUAST results
  • reference::Union{String,Nothing}=nothing: Optional reference genome for reference-based metrics
  • threads::Int=Sys.CPU_THREADS: Number of threads to use
  • min_contig::Int=500: Minimum contig length to consider
  • gene_finding::Bool=false: Whether to run gene finding (requires GeneMark-ES/ET)

Returns

  • String: Path to the output directory containing QUAST results

Output Files

  • report.html: Interactive HTML report
  • report.txt: Text summary report
  • report.tsv: Tab-separated values report for programmatic access
  • transposed_report.tsv: Transposed TSV format
  • icarus.html: Icarus contig browser (if reference provided)

Examples

# Basic assembly evaluation
assemblies = ["assembly1.fasta", "assembly2.fasta"]
quast_dir = Mycelia.run_quast(assemblies)

# With reference genome
ref_genome = "reference.fasta"
quast_dir = Mycelia.run_quast(assemblies, reference=ref_genome)

# Custom parameters
quast_dir = Mycelia.run_quast(assemblies, 
                             outdir="my_quast_results",
                             min_contig=1000,
                             threads=8)

Notes

  • Requires QUAST to be installed via Bioconda
  • Without reference: provides basic metrics (N50, total length, # contigs, etc.)
  • With reference: adds reference-based metrics (genome fraction, misassemblies, etc.)
  • Gene finding requires additional dependencies and is disabled by default
source
run_quast(assembly_file::String; kwargs...)

Run QUAST on a single assembly file. See run_quast(::Vector{String}) for details.

source
Mycelia.run_buscoFunction
run_busco(assembly_files::Vector{String}; outdir::String="busco_results", lineage::String="auto", mode::String="genome", threads::Int=Sys.CPU_THREADS, force::Bool=false)

Run BUSCO (Benchmarking Universal Single-Copy Orthologs) to assess genome assembly completeness.

Arguments

  • assembly_files::Vector{String}: Vector of paths to assembly FASTA files to evaluate
  • outdir::String="busco_results": Output directory for BUSCO results
  • lineage::String="auto": BUSCO lineage dataset to use (e.g., "bacteriaodb10", "eukaryotaodb10", "auto")
  • mode::String="genome": BUSCO mode ("genome", "transcriptome", "proteins")
  • threads::Int=Sys.CPU_THREADS: Number of threads to use
  • force::Bool=false: Force overwrite existing results

Returns

  • String: Path to the output directory containing BUSCO results

Output Files

  • short_summary.specific.lineage.txt: Summary statistics
  • full_table.tsv: Complete BUSCO results table
  • missing_busco_list.tsv: List of missing BUSCOs
  • run_lineage/: Detailed results directory

Examples

# Basic completeness assessment
assemblies = ["assembly1.fasta", "assembly2.fasta"]
busco_dir = Mycelia.run_busco(assemblies)

# Specific lineage
busco_dir = Mycelia.run_busco(assemblies, lineage="bacteria_odb10")

# Custom parameters
busco_dir = Mycelia.run_busco(assemblies,
                             outdir="my_busco_results",
                             lineage="enterobacterales_odb10",
                             threads=8)

Notes

  • Requires BUSCO to be installed via Bioconda
  • Auto lineage detection requires internet connection for first run
  • Available lineages: bacteriaodb10, archaeaodb10, eukaryotaodb10, fungiodb10, etc.
  • Results provide Complete, Fragmented, and Missing BUSCO counts
source
run_busco(assembly_file::String; kwargs...)

Run BUSCO on a single assembly file. See run_busco(::Vector{String}) for details.

source
Mycelia.run_mummerFunction
run_mummer(reference::String, query::String; outdir::String="mummer_results", prefix::String="out", mincluster::Int=65, minmatch::Int=20, threads::Int=1)

Run MUMmer for genome comparison and alignment between reference and query sequences.

Arguments

  • reference::String: Path to reference genome FASTA file
  • query::String: Path to query genome FASTA file
  • outdir::String="mummer_results": Output directory for MUMmer results
  • prefix::String="out": Prefix for output files
  • mincluster::Int=65: Minimum cluster length for nucmer
  • minmatch::Int=20: Minimum match length for nucmer
  • threads::Int=1: Number of threads (note: MUMmer has limited multithreading)

Returns

  • String: Path to the output directory containing MUMmer results

Output Files

  • prefix.delta: Delta alignment file (main output)
  • prefix.coords: Human-readable coordinates file
  • prefix.snps: SNP/indel report (if show-snps is run)
  • prefix.plot.png: Dot plot visualization (if mummerplot is run)

Examples

# Basic genome comparison
ref_genome = "reference.fasta"
query_genome = "assembly.fasta"
mummer_dir = Mycelia.run_mummer(ref_genome, query_genome)

# Custom parameters
mummer_dir = Mycelia.run_mummer(ref_genome, query_genome,
                               outdir="comparison_results",
                               prefix="comparison",
                               mincluster=100,
                               minmatch=30)

Notes

  • Requires MUMmer to be installed via Bioconda
  • nucmer is used for DNA sequence alignment
  • show-coords generates human-readable coordinate output
  • Results include alignment coordinates, percent identity, and coverage
  • For visualization, use mummerplot (requires gnuplot)
source

Annotation

Gene Prediction

Functions for gene prediction and annotation (using external tools):

  • Pyrodigal integration
  • BLAST+ searches
  • MMSeqs2 searches
  • TransTerm terminator prediction
  • tRNAscan-SE tRNA detection
  • MLST typing

Comparative Genomics

Visualization

Plotting Functions

Mycelia.visualize_genome_coverageFunction
visualize_genome_coverage(
    coverage_table::DataFrames.AbstractDataFrame;
    entity
) -> Makie.Figure

Creates a multi-panel visualization of genome coverage across chromosomes.

Arguments

  • coverage_table: DataFrame containing columns "chromosome" and "coverage" with genomic coverage data

Keyword Arguments

  • entity: Optional string to specify the entity being analyzed. If provided, chromosome titles become "$(entity): $(chromosome)", otherwise just "$(chromosome)"

Returns

  • CairoMakie.Figure: A composite figure with coverage plots for each chromosome

Details

Generates one subplot per chromosome, arranged vertically. Each subplot shows the coverage distribution across genomic positions for that chromosome.

source
Mycelia.plot_embeddingsFunction

Plot embeddings with optional true and fitted cluster labels using Makie.jl, with legend outside, and color by fit labels, shape by true labels.

Arguments

  • embeddings::Matrix{<:Real}: 2D embedding matrix where each column is a data point
  • title::String: Title of the plot
  • xlabel::String: Label for the x-axis
  • ylabel::String: Label for the y-axis
  • true_labels::Vector{<:Integer}: Vector of true cluster labels (optional)
  • fit_labels::Vector{<:Integer}: Vector of fitted cluster labels (optional)

Returns

  • Makie.Figure: Figure object that can be displayed or saved
source
Mycelia.plot_taxa_abundancesFunction
plot_taxa_abundances(
    df::DataFrames.DataFrame, 
    taxa_level::String; 
    top_n::Int = 10,
    sample_id_col::String = "sample_id",
    filter_taxa::Union{Vector{Union{String, Missing}}, Nothing} = nothing,
    figure_width::Int = 1500,
    figure_height::Int = 1000,
    bar_width::Float64 = 0.7,
    x_rotation::Int = 45,
    sort_samples::Bool = true,
    color_seed::Union{Int, Nothing} = nothing,
    legend_fontsize::Float64 = 12.0,
    legend_itemsize::Float64 = 12.0,
    legend_padding::Float64 = 5.0,
    legend_rowgap::Float64 = 1.0,
    legend_labelwidth::Union{Nothing, Float64} = nothing,
    legend_titlesize::Float64 = 15.0,
    legend_nbanks::Int = 1
)

Create a stacked bar chart showing taxa relative abundances for each sample.

Arguments

  • df: DataFrame with sample_id and taxonomic assignments at different levels
  • taxa_level: Taxonomic level to analyze (e.g., "genus", "species")
  • top_n: Number of top taxa to display individually, remainder grouped as "Other"
  • sample_id_col: Column name containing sample identifiers
  • filter_taxa: Taxa to exclude from visualization (default: nothing - no filtering)
  • figure_width: Width of the figure in pixels
  • figure_height: Height of the figure in pixels
  • bar_width: Width of each bar (between 0 and 1)
  • x_rotation: Rotation angle for x-axis labels in degrees
  • sort_samples: Whether to sort samples alphabetically
  • color_seed: Seed for reproducible color generation
  • legend_fontsize: Font size for legend entries
  • legend_itemsize: Size of the colored marker/icon in the legend
  • legend_padding: Padding around legend elements
  • legend_rowgap: Space between legend rows
  • legend_labelwidth: Maximum width for legend labels (truncation)
  • legend_titlesize: Font size for legend title
  • legend_nbanks: Number of legend columns

Returns

  • fig: CairoMakie figure object
  • ax: CairoMakie axis object
  • taxa_colors: Dictionary mapping taxa to their assigned colors
source

For k-mer frequency plotting, see Sequence Analysis Workflow.

Utility Functions

Memory Management (Experimental)

Experimental memory management functions (estimate_memory_usage, check_memory_limits) are available in src/development/ but not currently part of the public API.

File Operations

Helper functions for robust file handling and data validation.

Progress Tracking

Functions for user feedback during long-running operations.

Graph Types and Enums

Graph Modes

Mycelia.GraphModeType

Graph mode for handling strand information.

  • SingleStrand: Sequences are single-stranded (RNA, amino acids, or directional DNA)
  • DoubleStrand: Sequences are double-stranded DNA/RNA with canonical representation
source
Mycelia.StrandOrientationType

Strand orientation for k-mer observations and transitions.

  • Forward: k-mer as observed (5' to 3')
  • Reverse: reverse complement of k-mer (3' to 5')
source

Used to control how graphs are constructed and oriented.

Data Structures

Assembly Results

The result of assembly operations typically includes:

  • :final_assembly: Vector of assembled sequences
  • :k_progression: K-mer sizes used
  • :metadata: Assembly statistics and parameters

Quality Metrics

Standard quality metrics returned by analysis functions:

  • Coverage statistics
  • Quality score distributions
  • Joint probability calculations
  • Assembly completeness measures

Function Categories

Core Assembly Pipeline

  1. Input Processing: open_fastx, convert_sequence
  2. Quality Analysis: analyze_fastq_quality, qualmer functions
  3. Assembly: mycelia_assemble, graph construction
  4. Validation: assess_assembly_quality, validate_assembly
  5. Output: Assembly statistics and final sequences

Supporting Workflows

  • Data Acquisition: Download and simulation functions
  • Quality Control: Filtering and trimming tools
  • Analysis: K-mer analysis and distance calculations
  • Visualization: Plotting and reporting functions

Type System

Mycelia uses a strict type system based on BioSequences.jl:

  • DNA sequences: BioSequences.LongDNA{4}
  • RNA sequences: BioSequences.LongRNA{4}
  • Protein sequences: BioSequences.LongAA
  • K-mers: Kmers.DNAKmer{K}, Kmers.RNAKmer{K}, Kmers.AAKmer{K}

Quality-Aware Types

  • Qualmer: Combines k-mer with quality scores
  • QualmerVertexData: Graph vertex with quality metadata
  • QualmerEdgeData: Graph edge with transition weights

Performance Characteristics

Scalability

Most functions scale as follows with input size:

  • Linear: File I/O, k-mer counting
  • Quadratic: Distance matrix calculations
  • Variable: Assembly (depends on complexity)

Memory Usage

Typical memory requirements:

  • Small genomes (< 10 Mb): 1-4 GB RAM
  • Bacterial genomes (1-10 Mb): 4-16 GB RAM
  • Large genomes (> 100 Mb): 32-128 GB RAM

Use memory_limit parameters to control usage.

Parallelization

Functions that support parallel execution:

  • K-mer counting (automatic)
  • Assembly (multi-threaded)
  • Quality analysis (vectorized)

Set JULIA_NUM_THREADS environment variable for best performance.

Error Handling

Common Exceptions

  • ArgumentError: Invalid parameters
  • MethodError: Wrong types
  • OutOfMemoryError: Insufficient RAM
  • SystemError: File I/O issues

Debugging Tips

  1. Check inputs: Use typeof() and length()
  2. Monitor memory: Use memory_limit parameters
  3. Enable logging: Set verbose=true in functions where applicable
  4. Test subset: Use smaller datasets for debugging

Integration Points

External Tools

Mycelia integrates with numerous bioinformatics tools via Bioconda:

  • Assemblers: SPAdes, Megahit, Miniasm, Raven
  • QC Tools: FastP, Filtlong, Trim Galore
  • Validators: QUAST, BUSCO, MUMmer
  • Annotators: Pyrodigal, BLAST+, MMSeqs2

File Formats

Supported formats:

  • Input: FASTA, FASTQ (including gzipped)
  • Output: FASTA, CSV, JSON, JLD2
  • Intermediate: Graph formats, quality scores

Version Compatibility

This documentation corresponds to Mycelia version 0.1.0+.

  • Breaking changes will be noted in release notes
  • Deprecated functions include migration guidance
  • Experimental features are clearly marked

For the latest updates, see the CHANGELOG.