API Reference
This page provides comprehensive documentation for all public functions in Mycelia, automatically generated from docstrings.
Table of Contents
- API Reference
- Table of Contents
- Assembly Functions
- Qualmer Analysis
- String Graphs
- Sequence I/O
- K-mer Analysis
- Simulation and Testing
- Quality Control
- Assembly Validation
- Annotation
- Comparative Genomics
- Visualization
- Utility Functions
- Graph Types and Enums
- Data Structures
- Function Categories
- Type System
- Performance Characteristics
- Error Handling
- Integration Points
- Version Compatibility
Assembly Functions
Main Assembly Interface
Mycelia.assemble_genome — Functionassemble_genome(reads; k=31, kwargs...) -> AssemblyResult
assemble_genome(reads, config::AssemblyConfig) -> AssemblyResultUnified genome assembly interface with auto-detection and type-stable dispatch.
Auto-Detection Convenience Method
# Auto-detect sequence type and format, use k-mer approach
result = assemble_genome(fasta_records; k=25)
# Auto-detect sequence type and format, use overlap approach
result = assemble_genome(fasta_records; min_overlap=100)
# Override auto-detected parameters
result = assemble_genome(reads; k=31, graph_mode=SingleStrand, error_rate=0.005)Type-Stable Direct Method
# Explicit configuration for maximum performance
config = AssemblyConfig(k=25, sequence_type=BioSequences.LongDNA{4},
graph_mode=DoubleStrand, use_quality_scores=true)
result = assemble_genome(reads, config)Arguments
reads: Vector of FASTA/FASTQ records or file pathsconfig: Assembly configuration (for type-stable version)
Keyword Arguments (auto-detection version)
k: k-mer size (mutually exclusive with min_overlap)min_overlap: Minimum overlap length (mutually exclusive with k)graph_mode: SingleStrand or DoubleStrand (auto-detected if not specified)error_rate,min_coverage, etc.: Assembly parameters
Returns
AssemblyResult: Structure containing contigs, names, and assembly metadata
Details
This interface automatically:
- Detects sequence type: DNA, RNA, AA, or String from input
- Chooses assembly method: k-mer vs overlap-based on parameters
- Validates compatibility: AA/String sequences → SingleStrand only
- Dispatches optimally: Based on input type and quality scores
Method Selection Logic:
- String input + k → N-gram graph
- String input + min_overlap → String graph
- BioSequence input + k + quality → Qualmer graph
- BioSequence input + k → K-mer graph
- BioSequence input + min_overlap + quality → Quality BioSequence graph
- BioSequence input + min_overlap → BioSequence graph
Type-stable main assembly function that dispatches based on configuration.
Third-Party Assembly Tool Wrappers
See Assembly Suite for comprehensive documentation of production-ready assembly wrappers including:
- Short-read assemblers: MEGAHIT, SPAdes, metaSPAdes, SKESA, Velvet
- Long-read assemblers: Flye, metaFlye, Canu, hifiasm, MetaMDBG
- Hybrid assemblers: Unicycler
- Polishing tools: Apollo, HyPo, Strainy
Experimental Assembly Algorithms
Mycelia includes experimental assembly algorithms in src/development/ that are not currently part of the public API:
- Intelligent Assembly:
mycelia_assemble- Self-optimizing k-mer progression - Assembly Utilities:
dynamic_k_prime_pattern,error_optimized_k_sequence,calculate_assembly_reward
These are research implementations subject to change. For production use, see the third-party assembler wrappers above.
Qualmer Analysis
Core Qualmer Types
Mycelia.Qualmer — Typestruct Qualmer{KmerT, K}A quality-aware k-mer that pairs sequence k-mers with their corresponding quality scores.
Fields
kmer::Anyqualities::Tuple{Vararg{UInt8, K}} where K
Examples
using Mycelia
using Kmers
# Create a DNA k-mer with quality scores
kmer = DNAKmer("ATCG")
qualities = [30, 35, 32, 28] # Phred quality scores
qmer = Qualmer(kmer, qualities)
# Access individual bases and qualities
base, quality = qmer[1] # Returns ('A', 30)This structure is essential for quality-aware sequence analysis and assembly algorithms that consider both sequence content and sequencing quality information.
Mycelia.QualmerVertexData — Typestruct QualmerVertexData{QualmerT<:Mycelia.Qualmer}Vertex data for quality-aware k-mer graphs that stores k-mer observations with quality scores.
Fields
canonical_qualmer::Mycelia.Qualmerobservations::Array{Mycelia.QualmerObservation{QualmerT}, 1} where QualmerT<:Mycelia.Qualmerjoint_probability::Float64coverage::Int64mean_quality::Float64
This structure aggregates multiple observations of the same k-mer and computes quality-based statistics for use in quality-aware assembly algorithms.
Mycelia.QualmerEdgeData — Typestruct QualmerEdgeDataEdge data for quality-aware k-mer graphs that connects k-mer vertices with strand information.
Fields
observations::Vector{Tuple{Int64, Int64}}src_strand::Mycelia.StrandOrientationdst_strand::Mycelia.StrandOrientationweight::Float64quality_weight::Float64
This structure represents connections between k-mers in quality-aware graphs, including strand orientation and quality-weighted edge weights for assembly algorithms.
Qualmer Generation
Mycelia.qualmers_fw — Functionqualmers_fw(
sequence::BioSequences.LongSequence{A},
quality::AbstractVector{<:Integer},
_::Val{K}
) -> Base.Generator{I, F} where {I<:(Base.Iterators.Enumerate{I} where I<:(Kmers.FwKmers{A, _A, S} where {A<:BioSequences.Alphabet, _A, S<:(BioSequences.LongSequence{A} where A)})), F<:(Mycelia.var"#287#288"{_A, <:AbstractVector{var"#s34"}} where {_A, var"#s34"<:Integer})}
Create an iterator that yields DNA qualmers from the given sequence and quality scores.
qualmers_fw(
record::FASTX.FASTQ.Record,
k::Int64
) -> Base.Generator
Generate forward qualmers from a FASTQ record. Extracts all k-mers in the forward direction with their associated quality scores from the FASTQ record.
Arguments
record: FASTQ record containing sequence and quality datak: K-mer size
Returns
- Iterator yielding (Qualmer, position) tuples for each k-mer
Mycelia.qualmers_canonical — Functionqualmers_canonical(
sequence::BioSequences.LongSequence{BioSequences.DNAAlphabet{N}},
quality::AbstractVector{<:Integer},
_::Val{K}
) -> Base.Generator{I, F} where {I<:(Base.Iterators.Enumerate{I} where I<:(Kmers.FwKmers{A, _A, S} where {A<:(BioSequences.DNAAlphabet), _A, S<:(BioSequences.LongDNA)})), F<:(Mycelia.var"#293#294"{_A, <:AbstractVector{var"#s34"}} where {_A, var"#s34"<:Integer})}
Generate canonical DNA qualmers from sequence and quality scores. For each k-mer, returns the lexicographically smaller of the k-mer and its reverse complement, ensuring consistent representation regardless of strand.
Arguments
sequence: DNA sequence as LongDNAquality: Vector of quality scores corresponding to each base::Val{K}: K-mer size as a type parameter
Returns
- Iterator yielding (Qualmer, position) tuples with canonical k-mer representation
qualmers_canonical(
sequence::BioSequences.LongAA,
quality::AbstractVector{<:Integer},
_::Val{K}
) -> Base.Generator{I, F} where {I<:(Base.Iterators.Enumerate{I} where I<:(Kmers.FwAAMers{_A, BioSequences.LongAA} where _A)), F<:(Mycelia.var"#287#288"{_A, <:AbstractVector{var"#s36"}} where {_A, var"#s36"<:Integer})}
Generate canonical amino acid qualmers from protein sequence and quality scores. For amino acid sequences, canonical representation is the same as forward representation since proteins don't have reverse complements.
Arguments
sequence: Protein sequence as LongAAquality: Vector of quality scores corresponding to each amino acid::Val{K}: K-mer size as a type parameter
Returns
- Iterator yielding (Qualmer, position) tuples for each k-mer
qualmers_canonical(
record::FASTX.FASTQ.Record,
k::Int64
) -> Base.Generator
Generate canonical qualmers from a FASTQ record. Extracts k-mers in canonical representation (lexicographically smaller of forward and reverse complement) with their associated quality scores.
Arguments
record: FASTQ record containing sequence and quality datak: K-mer size
Returns
- Iterator yielding (Qualmer, position) tuples with canonical k-mer representation
Mycelia.qualmers_unambiguous — Functionqualmers_unambiguous(
sequence::BioSequences.LongSequence{BioSequences.DNAAlphabet{N}},
quality::AbstractVector{<:Integer},
_::Val{K}
) -> Base.Generator{I, F} where {I<:(Kmers.UnambiguousDNAMers{_A, S} where {_A, S<:(BioSequences.LongDNA)}), F<:(Mycelia.var"#289#290"{_A, <:AbstractVector{var"#s34"}} where {_A, var"#s34"<:Integer})}
Generate unambiguous DNA qualmers from sequence and quality scores. Only includes k-mers that contain unambiguous nucleotides (A, T, G, C). Skips k-mers containing ambiguous bases like N.
Arguments
sequence: DNA sequence as LongDNAquality: Vector of quality scores corresponding to each base::Val{K}: K-mer size as a type parameter
Returns
- Iterator yielding (Qualmer, position) tuples for each unambiguous k-mer
qualmers_unambiguous(
sequence::BioSequences.LongSequence{BioSequences.RNAAlphabet{N}},
quality::AbstractVector{<:Integer},
_::Val{K}
) -> Base.Generator{I, F} where {I<:(Kmers.UnambiguousRNAMers{_A, S} where {_A, S<:(BioSequences.LongRNA)}), F<:(Mycelia.var"#291#292"{_A, <:AbstractVector{var"#s34"}} where {_A, var"#s34"<:Integer})}
Generate unambiguous RNA qualmers from sequence and quality scores. Only includes k-mers that contain unambiguous nucleotides (A, U, G, C). Skips k-mers containing ambiguous bases like N.
Arguments
sequence: RNA sequence as LongRNAquality: Vector of quality scores corresponding to each base::Val{K}: K-mer size as a type parameter
Returns
- Iterator yielding (Qualmer, position) tuples for each unambiguous k-mer
qualmers_unambiguous(
sequence::BioSequences.LongAA,
quality::AbstractVector{<:Integer},
_::Val{K}
) -> Base.Generator{I, F} where {I<:(Base.Iterators.Enumerate{I} where I<:(Kmers.FwAAMers{_A, BioSequences.LongAA} where _A)), F<:(Mycelia.var"#287#288"{_A, <:AbstractVector{var"#s36"}} where {_A, var"#s36"<:Integer})}
Generate amino acid qualmers from protein sequence and quality scores. For amino acid sequences, all k-mers are considered unambiguous, so this falls back to the forward qualmer generation.
Arguments
sequence: Protein sequence as LongAAquality: Vector of quality scores corresponding to each amino acid::Val{K}: K-mer size as a type parameter
Returns
- Iterator yielding (Qualmer, position) tuples for each k-mer
qualmers_unambiguous(
record::FASTX.FASTQ.Record,
k::Int64
) -> Base.Generator
Generate unambiguous qualmers from a FASTQ record. Extracts only k-mers that contain unambiguous nucleotides (no N bases) with their associated quality scores.
Arguments
record: FASTQ record containing sequence and quality datak: K-mer size
Returns
- Iterator yielding (Qualmer, position) tuples for each unambiguous k-mer
Mycelia.qualmers_unambiguous_canonical — Functionqualmers_unambiguous_canonical(
record::FASTX.FASTQ.Record,
k::Int64
) -> Base.Generator{I, Mycelia.var"#295#296"} where I
Generate unambiguous canonical qualmers from the given FASTQ record.
Quality Analysis
Mycelia.phred_to_probability — Functionphred_to_probability(phred_score::UInt8) -> Float64
Convert PHRED quality score to probability of correctness. PHRED score Q relates to error probability P by: Q = -10 * log10(P) Therefore, correctness probability = 1 - P = 1 - 10^(-Q/10)
Mycelia.qualmer_correctness_probability — Functionqualmer_correctness_probability(
qmer::Mycelia.Qualmer
) -> Float64
Calculate joint probability that a single qualmer is correct. For a k-mer with quality scores [q1, q2, ..., qk], the joint probability that all positions are correct is: ∏(1 - 10^(-qi/10))
Mycelia.joint_qualmer_probability — Functionjoint_qualmer_probability(
qualmers::Vector{<:Mycelia.Qualmer};
use_log_space
) -> Float64
Calculate joint probability that multiple observations of the same k-mer are all correct. For independent observations of the same k-mer sequence, this represents our confidence that the k-mer truly exists in the data.
Arguments
qualmers: Vector of Qualmer observations of the same k-mer sequenceuse_log_space: Use log-space arithmetic for numerical stability (default: true)
Returns
- Float64: Joint probability that all observations are correct
Mycelia.position_wise_joint_probability — Functionposition_wise_joint_probability(
qualmers::Vector{<:Mycelia.Qualmer};
use_log_space
) -> Float64
Calculate position-wise joint probability for multiple qualmer observations. This is more sophisticated than joint_qualmer_probability as it considers the quality at each position across all observations.
Arguments
qualmers: Vector of Qualmer observations of the same k-mer sequenceuse_log_space: Use log-space arithmetic for numerical stability (default: true)
Returns
- Float64: Position-wise joint probability
Graph Construction
Mycelia.build_qualmer_graph — Functionbuild_qualmer_graph(
fastq_records::Vector{FASTX.FASTQ.Record};
k,
graph_mode,
min_quality,
min_coverage
) -> MetaGraphsNext.MetaGraph{Int64, Graphs.SimpleGraphs.SimpleDiGraph{Int64}, Label, VertexData, EdgeData, Nothing, WeightFunction, Float64} where {Label, VertexData, EdgeData, WeightFunction}
Build a quality-aware k-mer graph from FASTQ records using existing Qualmer functionality. This function leverages the existing qualmer extraction functions and adds graph construction.
Arguments
fastq_records: Vector of FASTQ records with quality scoresk: K-mer sizegraph_mode: SingleStrand or DoubleStrand mode (default: DoubleStrand)min_quality: Minimum average PHRED quality to include k-mer (default: 10)min_coverage: Minimum coverage (observations) to include k-mer (default: 1)
Returns
- MetaGraphsNext.MetaGraph with QualmerVertexData and QualmerEdgeData
Mycelia.get_qualmer_statistics — Functionget_qualmer_statistics(
graph::MetaGraphsNext.MetaGraph
) -> Dict{String, Any}
Get comprehensive statistics about a qualmer graph.
String Graphs
N-gram Analysis
Mycelia.ngrams — Functionngrams(s::AbstractString, n::Int64) -> Vector{String}
Extract all n-grams from a string. Returns a vector containing all contiguous substrings of length n from the input string.
Arguments
s: Input string to extract n-grams fromn: Length of each n-gram (must be positive)
Returns
Vector{String}: All n-grams of lengthnfrom the string
Examples
ngrams("hello", 3) # Returns ["hel", "ell", "llo"]
ngrams("abcd", 2) # Returns ["ab", "bc", "cd"]Mycelia.string_to_ngram_graph — Functionstring_to_ngram_graph(; s, n)
Build a directed graph from string n-grams with edge and vertex weights. Creates a MetaGraph where vertices represent unique n-grams and edges represent transitions between consecutive n-grams in the original string.
Arguments
s: Input string to build graph fromn: N-gram size (length of each substring)
Returns
MetaGraphsNext.MetaGraph: Directed graph with:- Vertex labels: String n-grams
- Vertex data: Integer counts of n-gram occurrences
- Edge data: Integer counts of n-gram transitions
Details
Backward compatibility function - now delegates to the multi-string version. The graph construction process:
- Extract all n-grams from the string
- Count occurrences of each unique n-gram (stored as vertex data)
- Count transitions between consecutive n-grams (stored as edge data)
- Build a directed graph representing the n-gram sequence structure
Examples
graph = string_to_ngram_graph(s="abcabc", n=2)
# Creates graph with vertices "ab", "bc", "ca" and weighted edgesMycelia.plot_ngram_graph — Functionplot_ngram_graph(g) -> Any
Create a visual plot of an n-gram graph using GraphPlot.jl. Displays the graph structure with vertex labels showing the n-grams.
Arguments
g: MetaGraph containing n-gram data (typically fromstring_to_ngram_graph)
Returns
- Plot object showing the graph layout with labeled vertices
Details
Uses a spring layout algorithm to position vertices and displays n-gram strings as vertex labels for easy interpretation of the graph structure.
Examples
graph = string_to_ngram_graph(s="hello", n=2)
plot_ngram_graph(graph) # Shows visual representationGraph Simplification
Mycelia.collapse_unbranching_paths — FunctionCollapse linear paths (vertices with one incoming and one outgoing edge) into a simpler graph where sequences are concatenated.
Mycelia.find_connected_components — FunctionFind all connected components in the graph. Returns a vector of vectors of vertex indices.
For directed graphs, uses weakly connected components. For undirected graphs, uses standard connected components.
Note: Additional graph simplification function assemble_strings is available but currently lacks documentation.
Sequence I/O
File Handling
Mycelia.open_fastx — Functionopen_fastx(path::AbstractString) -> Any
Open and return a reader for FASTA or FASTQ format files.
Arguments
path::AbstractString: Path to input file. Can be:- Local file path
- HTTP/FTP URL
- Gzip compressed (.gz extension)
Supported formats
- FASTA (.fasta, .fna, .faa, .fa)
- FASTQ (.fastq, .fq)
Returns
FASTX.FASTA.Readerfor FASTA filesFASTX.FASTQ.Readerfor FASTQ files
Mycelia.write_fastq — Functionwrite_fastq(;records, filename, gzip=false)write_fastq(; records, filename, gzip)
Write FASTQ records to file using FASTX.jl. Validates extension: .fastq, .fq, .fastq.gz, or .fq.gz. If gzip is true or filename endswith .gz, output is gzipped. records must be an iterable of FASTX.FASTQ.Record.
Mycelia.write_fasta — Functionwrite_fasta(; outfile, records, gzip)
Writes FASTA records to a file, optionally gzipped.
Arguments
outfile::AbstractString: Path to the output FASTA file. Will append ".gz" ifgzipis true and ".gz" isn't already the extension.records::Vector{FASTX.FASTA.Record}: A vector of FASTA records.gzip::Bool: Optionally force compression of the output with gzip. By default will use the file name to infer.
Returns
outfile::String: The path to the output FASTA file (including ".gz" if applicable).
Format Conversion
Mycelia.convert_sequence — Functionconvert_sequence(seq::AbstractString) -> Any
Converts the given sequence (output from FASTX.sequence) into the appropriate BioSequence type:
- DNA sequences are converted using
BioSequences.LongDNA - RNA sequences are converted using
BioSequences.LongRNA - AA sequences are converted using
BioSequences.LongAA
K-mer Analysis
K-mer Counting
For k-mer counting functions, see Sequence Analysis Workflow.
Distance Metrics
Mycelia.jaccard_distance — FunctionCompute the Jaccard distance between columns of a binary matrix.
Arguments
M::AbstractMatrix{<:Integer}: Binary matrix where rows are features and columns are samples
Returns
Matrix{Float64}: Symmetric distance matrix with Jaccard distances
jaccard_distance(set1, set2) -> Any
Calculate the Jaccard distance between two sets, which is the complement of the Jaccard similarity.
The Jaccard distance is defined as: $J_d(A,B) = 1 - J_s(A,B) = 1 - \frac{|A ∩ B|}{|A ∪ B|}$
Arguments
set1: First set to compareset2: Second set to compare
Returns
Float64: A value in [0,1] where 0 indicates identical sets and 1 indicates disjoint sets
Mycelia.kmer_counts_to_js_divergence — Functionkmer_counts_to_js_divergence(
kmer_counts_1,
kmer_counts_2
) -> Any
Calculate the Jensen-Shannon divergence between two k-mer frequency distributions.
Arguments
kmer_counts_1: Dictionary mapping k-mers to their counts in first sequencekmer_counts_2: Dictionary mapping k-mers to their counts in second sequence
Returns
- Normalized Jensen-Shannon divergence score between 0 and 1, where:
- 0 indicates identical distributions
- 1 indicates maximally different distributions
Notes
- The measure is symmetric: JS(P||Q) = JS(Q||P)
- Counts are automatically normalized to probability distributions
Genome Size Estimation
Mycelia.estimate_genome_size_from_kmers — Functionestimate_genome_size_from_kmers(
sequence::Union{AbstractString, BioSequences.LongSequence},
k::Integer
) -> Dict
Estimate genome size from k-mer analysis using total k-mer count.
This function estimates genome size using the basic relationship: genomesize ≈ totalkmers - k + 1, where total_kmers is the sum of all k-mer counts. This is a simple estimation method; more sophisticated approaches accounting for sequencing depth, repeats, and errors may be more accurate.
Arguments
sequence::Union{BioSequences.LongSequence, AbstractString}: Input sequence or stringk::Integer: K-mer size for analysis
Returns
Dict{String, Any}: Dictionary containing:- "unique_kmers": Number of unique k-mers observed
- "total_kmers": Total k-mer count (sum of all frequencies)
- "estimatedgenomesize": Estimated genome size
- "actual_size": Length of input sequence (if provided)
Examples
# Estimate genome size from a sequence
sequence = "ATCGATCGATCGATCG"
result = estimate_genome_size_from_kmers(sequence, 5)estimate_genome_size_from_kmers(
records::AbstractArray{T<:Union{FASTX.FASTA.Record, FASTX.FASTQ.Record}, 1},
k::Integer
) -> Dict
Estimate genome size from FASTQ/FASTA records using k-mer analysis.
Overload for processing FASTQ or FASTA records directly.
Arguments
records::AbstractVector{T}where T <: Union{FASTX.FASTA.Record, FASTX.FASTQ.Record}`: Input recordsk::Integer: K-mer size for analysis
Returns
Dict{String, Any}: Dictionary with k-mer statistics and genome size estimate
K-mer Result Storage
Mycelia.save_kmer_results — Functionsave_kmer_results(
;
filename,
kmers,
counts,
fasta_list,
k,
alphabet
)
Save the kmer counting results (kmers vector, counts sparse matrix) and the input FASTA file list to a JLD2 file for long-term storage and reproducibility.
Arguments
filename::AbstractString: Path to the output JLD2 file.kmers::AbstractVector{<:Kmers.Kmer}: The sorted vector of unique kmer objects.counts::AbstractMatrix: The (sparse or dense) matrix of kmer counts.fasta_list::AbstractVector{<:AbstractString}: The list of FASTA file paths used as input.k::Integer: The kmer size used.alphabet::Symbol: The alphabet used (:AA, :DNA, :RNA).
Mycelia.load_kmer_results — Functionload_kmer_results(
filename::AbstractString
) -> Union{Nothing, NamedTuple{(:kmers, :counts, :fasta_list, :metadata), <:Tuple{Any, Any, Any, Dict{String, Any}}}}
Load kmer counting results previously saved with save_kmer_results.
Arguments
filename::AbstractString: Path to the input JLD2 file.
Returns
NamedTuple: Contains the loadedkmers,counts,fasta_list, andmetadata. Returnsnothingif the file cannot be loaded or essential keys are missing.
Simulation and Testing
Sequence Generation
Mycelia.generate_test_sequences — Functiongenerate_test_sequences(
genome_size::Int64
) -> Vector{BioSequences.LongSequence{BioSequences.DNAAlphabet{4}}}
generate_test_sequences(
genome_size::Int64,
n_sequences::Int64
) -> Vector{BioSequences.LongSequence{BioSequences.DNAAlphabet{4}}}
Generate test DNA sequences for k-mer analysis benchmarking.
Creates random DNA sequences suitable for k-mer counting and analysis performance testing.
Arguments
genome_size::Int: Size of each generated sequence in base pairsn_sequences::Int: Number of sequences to generate (default: 1)
Returns
- Vector of
BioSequences.LongDNA{4}sequences
See Also
random_fasta_record: For generating FASTA records with random sequencesBioSequences.randdnaseq: For generating random DNA sequences
Read Simulation
See Data Acquisition Workflow for read simulation functions.
Assembly Testing (Experimental)
Experimental assembly testing function test_intelligent_assembly is available in src/development/ but not currently part of the public API.
Quality Control
FASTQ Analysis
Mycelia.calculate_gc_content — Functioncalculate_gc_content(
sequence::BioSequences.LongSequence
) -> Float64
Calculate GC content (percentage of G and C bases) from a BioSequence.
This function calculates the percentage of guanine (G) and cytosine (C) bases in a nucleotide sequence. Works with both DNA and RNA sequences.
Arguments
sequence::BioSequences.LongSequence: Input DNA or RNA sequence
Returns
Float64: GC content as a percentage (0.0-100.0)
Examples
# Calculate GC content for DNA
dna_seq = BioSequences.LongDNA{4}("ATCGATCGATCG")
gc_percent = calculate_gc_content(dna_seq)
# Calculate GC content for RNA
rna_seq = BioSequences.LongRNA{4}("AUCGAUCGAUCG")
gc_percent = calculate_gc_content(rna_seq)calculate_gc_content(sequence::AbstractString) -> Float64
Calculate GC content from a string sequence.
Convenience function that accepts string input and converts to appropriate BioSequence. Automatically detects DNA/RNA based on presence of T/U.
Arguments
sequence::AbstractString: Input DNA or RNA sequence as string
Returns
Float64: GC content as a percentage (0.0-100.0)
Examples
# Calculate GC content from string
gc_percent = calculate_gc_content("ATCGATCGATCG")calculate_gc_content(
records::AbstractArray{T<:Union{FASTX.FASTA.Record, FASTX.FASTQ.Record}, 1}
) -> Float64
Calculate GC content from FASTA/FASTQ records.
Processes multiple records and calculates overall GC content across all sequences.
Arguments
records::AbstractVector{T}where T <: Union{FASTX.FASTA.Record, FASTX.FASTQ.Record}`: Input records
Returns
Float64: Overall GC content as a percentage (0.0-100.0)
Examples
# Calculate GC content from FASTA records
records = collect(FASTX.FASTA.Reader(open("sequences.fasta")))
gc_percent = calculate_gc_content(records)Mycelia.assess_duplication_rates — Functionassess_duplication_rates(fastq; results_table) -> Any
Analyze sequence duplication rates in a FASTQ file.
This function processes a FASTQ file to quantify both exact sequence duplications and canonical duplications (considering sequences and their reverse complements as equivalent). The function makes two passes through the file: first to count total records, then to analyze unique sequences.
Arguments
fastq::String: Path to the input FASTQ file to analyzeresults_table::String: Optional. Path where the results will be saved as a tab-separated file. Defaults to the same path as the input file but with extension changed to ".duplication_rates.tsv"
Returns
String: Path to the results table file
Output
Generates a tab-separated file containing the following metrics:
total_records: Total number of sequence records in the filetotal_unique_observations: Count of unique sequence stringstotal_unique_canonical_observations: Count of unique canonical sequences (after normalizing for reverse complements)percent_unique_observations: Percentage of sequences that are uniquepercent_unique_canonical_observations: Percentage of sequences that are unique after canonicalizationpercent_duplication_rate: Percentage of sequences that are duplicates (100 - percentuniqueobservations)percent_canonical_duplication_rate: Percentage of sequences that are duplicates after canonicalization
Notes
- If the specified results file already exists and is not empty, the function will return early without recomputing.
- Progress is displayed during processing with a progress bar showing speed.
Example
# Analyze a FASTQ file and save results to default location
result_path = assess_duplication_rates("data/sample.fastq")
# Specify custom output path
result_path = assess_duplication_rates("data/sample.fastq", results_table="results/duplication_analysis.tsv")For FASTQ quality analysis functions, see Quality Control Workflow.
External Tool Integration
For quality control and filtering functions, see Quality Control Workflow.
Assembly Validation
Quality Assessment
Mycelia.assess_assembly_quality — Functionassess_assembly_quality(contigs_file)Assess basic assembly quality metrics from a FASTA file.
Calculates standard assembly quality metrics including contig count, total length, and N50 statistic for assembly evaluation.
Arguments
contigs_file: Path to FASTA file containing assembly contigs
Returns
- Tuple of (ncontigs, totallength, n50, l50)
n_contigs: Number of contigs in the assemblytotal_length: Total length of all contigs in base pairsn50: N50 statistic (length of shortest contig in the set covering 50% of assembly)l50: L50 statistic (number of contigs needed to reach 50% of assembly length)
Example
n_contigs, total_length, n50, l50 = assess_assembly_quality("assembly.fasta")
println("Assembly has $n_contigs contigs, $total_length bp total, N50=$n50, L50=$l50")See Also
assess_assembly_kmer_quality: For k-mer based assembly quality assessment
assess_assembly_quality(; assembly, observations, ks)
Assess assembly quality using k-mer based metrics including QV scores.
This function provides a comprehensive quality assessment of genome assemblies by comparing k-mer distributions between the assembly and source sequencing data. It calculates multiple quality metrics across different k-mer sizes, with the QV (Quality Value) score being the primary measure of assembly accuracy.
Arguments
assembly: Path to assembly file (FASTA format) or assembly sequencesobservations: Path to sequencing data file(s) (FASTQ format) or sequence observationsks::Vector{Int}: K-mer sizes to analyze (default: 11,13,17,19,23,31,53 for comprehensive evaluation)
Returns
DataFrame containing quality metrics for each k-mer size:
k::Int: K-mer length usedcosine_distance::Float64: Cosine similarity between k-mer distributionsjs_divergence::Float64: Jensen-Shannon divergence between distributionsqv::Float64: Merqury-style Quality Value score (primary assembly accuracy metric)
QV Score Interpretation
- QV ≥ 40: High quality assembly (>99.99% base accuracy)
- QV 30-40: Good quality assembly (99.9-99.99% base accuracy)
- QV 20-30: Moderate quality assembly (99-99.9% base accuracy)
- QV < 20: Lower quality assembly (<99% base accuracy)
Implementation Details
The QV score follows the Merqury methodology:
- Count k-mers in both assembly and sequencing data
- Calculate shared k-mers between datasets
- Estimate base-level accuracy: P = (Kshared/Ktotal)^(1/k)
- Convert to Phred scale: QV = -10log₁₀(1-P)
Performance Notes
- Uses multiple k-mer sizes for robust quality assessment
- Larger k-mer sizes provide more discriminative power
- Parallel k-mer counting for computational efficiency
- Progress tracking for long-running analyses
References
Rhie, A. et al. "Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies." Genome Biology 21, 245 (2020).
Mycelia.validate_assembly — Functionvalidate_assembly(assembly::AssemblyResult; reference=nothing) -> Dict{String, Any}Validate assembly quality using various metrics and optional reference comparison.
Arguments
assembly: Assembly result to validatereference: Optional reference sequence for comparison
Returns
Dict{String, Any}: Comprehensive validation metrics
Details
Computes assembly quality metrics including:
- N50, N90 statistics
- Total assembly length and number of contigs
- Coverage uniformity (if reference provided)
- Structural variant detection (if reference provided)
- Gap analysis and repeat characterization
External Validators
Mycelia.run_quast — Functionrun_quast(assembly_files::Vector{String}; outdir::String="quast_results", reference::Union{String,Nothing}=nothing, threads::Int=Sys.CPU_THREADS, min_contig::Int=500, gene_finding::Bool=false)Run QUAST (Quality Assessment Tool for Genome Assemblies) to evaluate assembly quality.
Arguments
assembly_files::Vector{String}: Vector of paths to assembly FASTA files to evaluateoutdir::String="quast_results": Output directory for QUAST resultsreference::Union{String,Nothing}=nothing: Optional reference genome for reference-based metricsthreads::Int=Sys.CPU_THREADS: Number of threads to usemin_contig::Int=500: Minimum contig length to considergene_finding::Bool=false: Whether to run gene finding (requires GeneMark-ES/ET)
Returns
String: Path to the output directory containing QUAST results
Output Files
report.html: Interactive HTML reportreport.txt: Text summary reportreport.tsv: Tab-separated values report for programmatic accesstransposed_report.tsv: Transposed TSV formaticarus.html: Icarus contig browser (if reference provided)
Examples
# Basic assembly evaluation
assemblies = ["assembly1.fasta", "assembly2.fasta"]
quast_dir = Mycelia.run_quast(assemblies)
# With reference genome
ref_genome = "reference.fasta"
quast_dir = Mycelia.run_quast(assemblies, reference=ref_genome)
# Custom parameters
quast_dir = Mycelia.run_quast(assemblies,
outdir="my_quast_results",
min_contig=1000,
threads=8)Notes
- Requires QUAST to be installed via Bioconda
- Without reference: provides basic metrics (N50, total length, # contigs, etc.)
- With reference: adds reference-based metrics (genome fraction, misassemblies, etc.)
- Gene finding requires additional dependencies and is disabled by default
run_quast(assembly_file::String; kwargs...)Run QUAST on a single assembly file. See run_quast(::Vector{String}) for details.
Mycelia.run_busco — Functionrun_busco(assembly_files::Vector{String}; outdir::String="busco_results", lineage::String="auto", mode::String="genome", threads::Int=Sys.CPU_THREADS, force::Bool=false)Run BUSCO (Benchmarking Universal Single-Copy Orthologs) to assess genome assembly completeness.
Arguments
assembly_files::Vector{String}: Vector of paths to assembly FASTA files to evaluateoutdir::String="busco_results": Output directory for BUSCO resultslineage::String="auto": BUSCO lineage dataset to use (e.g., "bacteriaodb10", "eukaryotaodb10", "auto")mode::String="genome": BUSCO mode ("genome", "transcriptome", "proteins")threads::Int=Sys.CPU_THREADS: Number of threads to useforce::Bool=false: Force overwrite existing results
Returns
String: Path to the output directory containing BUSCO results
Output Files
short_summary.specific.lineage.txt: Summary statisticsfull_table.tsv: Complete BUSCO results tablemissing_busco_list.tsv: List of missing BUSCOsrun_lineage/: Detailed results directory
Examples
# Basic completeness assessment
assemblies = ["assembly1.fasta", "assembly2.fasta"]
busco_dir = Mycelia.run_busco(assemblies)
# Specific lineage
busco_dir = Mycelia.run_busco(assemblies, lineage="bacteria_odb10")
# Custom parameters
busco_dir = Mycelia.run_busco(assemblies,
outdir="my_busco_results",
lineage="enterobacterales_odb10",
threads=8)Notes
- Requires BUSCO to be installed via Bioconda
- Auto lineage detection requires internet connection for first run
- Available lineages: bacteriaodb10, archaeaodb10, eukaryotaodb10, fungiodb10, etc.
- Results provide Complete, Fragmented, and Missing BUSCO counts
run_busco(assembly_file::String; kwargs...)Run BUSCO on a single assembly file. See run_busco(::Vector{String}) for details.
Mycelia.run_mummer — Functionrun_mummer(reference::String, query::String; outdir::String="mummer_results", prefix::String="out", mincluster::Int=65, minmatch::Int=20, threads::Int=1)Run MUMmer for genome comparison and alignment between reference and query sequences.
Arguments
reference::String: Path to reference genome FASTA filequery::String: Path to query genome FASTA fileoutdir::String="mummer_results": Output directory for MUMmer resultsprefix::String="out": Prefix for output filesmincluster::Int=65: Minimum cluster length for nucmerminmatch::Int=20: Minimum match length for nucmerthreads::Int=1: Number of threads (note: MUMmer has limited multithreading)
Returns
String: Path to the output directory containing MUMmer results
Output Files
prefix.delta: Delta alignment file (main output)prefix.coords: Human-readable coordinates fileprefix.snps: SNP/indel report (if show-snps is run)prefix.plot.png: Dot plot visualization (if mummerplot is run)
Examples
# Basic genome comparison
ref_genome = "reference.fasta"
query_genome = "assembly.fasta"
mummer_dir = Mycelia.run_mummer(ref_genome, query_genome)
# Custom parameters
mummer_dir = Mycelia.run_mummer(ref_genome, query_genome,
outdir="comparison_results",
prefix="comparison",
mincluster=100,
minmatch=30)Notes
- Requires MUMmer to be installed via Bioconda
- nucmer is used for DNA sequence alignment
- show-coords generates human-readable coordinate output
- Results include alignment coordinates, percent identity, and coverage
- For visualization, use mummerplot (requires gnuplot)
Annotation
Gene Prediction
Functions for gene prediction and annotation (using external tools):
- Pyrodigal integration
- BLAST+ searches
- MMSeqs2 searches
- TransTerm terminator prediction
- tRNAscan-SE tRNA detection
- MLST typing
Comparative Genomics
Visualization
Plotting Functions
Mycelia.visualize_genome_coverage — Functionvisualize_genome_coverage(
coverage_table::DataFrames.AbstractDataFrame;
entity
) -> Makie.Figure
Creates a multi-panel visualization of genome coverage across chromosomes.
Arguments
coverage_table: DataFrame containing columns "chromosome" and "coverage" with genomic coverage data
Keyword Arguments
entity: Optional string to specify the entity being analyzed. If provided, chromosome titles become "$(entity): $(chromosome)", otherwise just "$(chromosome)"
Returns
CairoMakie.Figure: A composite figure with coverage plots for each chromosome
Details
Generates one subplot per chromosome, arranged vertically. Each subplot shows the coverage distribution across genomic positions for that chromosome.
Mycelia.plot_embeddings — FunctionPlot embeddings with optional true and fitted cluster labels using Makie.jl, with legend outside, and color by fit labels, shape by true labels.
Arguments
embeddings::Matrix{<:Real}: 2D embedding matrix where each column is a data pointtitle::String: Title of the plotxlabel::String: Label for the x-axisylabel::String: Label for the y-axistrue_labels::Vector{<:Integer}: Vector of true cluster labels (optional)fit_labels::Vector{<:Integer}: Vector of fitted cluster labels (optional)
Returns
Makie.Figure: Figure object that can be displayed or saved
Mycelia.plot_taxa_abundances — Functionplot_taxa_abundances(
df::DataFrames.DataFrame,
taxa_level::String;
top_n::Int = 10,
sample_id_col::String = "sample_id",
filter_taxa::Union{Vector{Union{String, Missing}}, Nothing} = nothing,
figure_width::Int = 1500,
figure_height::Int = 1000,
bar_width::Float64 = 0.7,
x_rotation::Int = 45,
sort_samples::Bool = true,
color_seed::Union{Int, Nothing} = nothing,
legend_fontsize::Float64 = 12.0,
legend_itemsize::Float64 = 12.0,
legend_padding::Float64 = 5.0,
legend_rowgap::Float64 = 1.0,
legend_labelwidth::Union{Nothing, Float64} = nothing,
legend_titlesize::Float64 = 15.0,
legend_nbanks::Int = 1
)Create a stacked bar chart showing taxa relative abundances for each sample.
Arguments
df: DataFrame with sample_id and taxonomic assignments at different levelstaxa_level: Taxonomic level to analyze (e.g., "genus", "species")top_n: Number of top taxa to display individually, remainder grouped as "Other"sample_id_col: Column name containing sample identifiersfilter_taxa: Taxa to exclude from visualization (default: nothing - no filtering)figure_width: Width of the figure in pixelsfigure_height: Height of the figure in pixelsbar_width: Width of each bar (between 0 and 1)x_rotation: Rotation angle for x-axis labels in degreessort_samples: Whether to sort samples alphabeticallycolor_seed: Seed for reproducible color generationlegend_fontsize: Font size for legend entrieslegend_itemsize: Size of the colored marker/icon in the legendlegend_padding: Padding around legend elementslegend_rowgap: Space between legend rowslegend_labelwidth: Maximum width for legend labels (truncation)legend_titlesize: Font size for legend titlelegend_nbanks: Number of legend columns
Returns
fig: CairoMakie figure objectax: CairoMakie axis objecttaxa_colors: Dictionary mapping taxa to their assigned colors
For k-mer frequency plotting, see Sequence Analysis Workflow.
Utility Functions
Memory Management (Experimental)
Experimental memory management functions (estimate_memory_usage, check_memory_limits) are available in src/development/ but not currently part of the public API.
File Operations
Helper functions for robust file handling and data validation.
Progress Tracking
Functions for user feedback during long-running operations.
Graph Types and Enums
Graph Modes
Mycelia.GraphMode — TypeGraph mode for handling strand information.
SingleStrand: Sequences are single-stranded (RNA, amino acids, or directional DNA)DoubleStrand: Sequences are double-stranded DNA/RNA with canonical representation
Mycelia.StrandOrientation — TypeStrand orientation for k-mer observations and transitions.
Forward: k-mer as observed (5' to 3')Reverse: reverse complement of k-mer (3' to 5')
Used to control how graphs are constructed and oriented.
Data Structures
Assembly Results
The result of assembly operations typically includes:
:final_assembly: Vector of assembled sequences:k_progression: K-mer sizes used:metadata: Assembly statistics and parameters
Quality Metrics
Standard quality metrics returned by analysis functions:
- Coverage statistics
- Quality score distributions
- Joint probability calculations
- Assembly completeness measures
Function Categories
Core Assembly Pipeline
- Input Processing:
open_fastx,convert_sequence - Quality Analysis:
analyze_fastq_quality, qualmer functions - Assembly:
mycelia_assemble, graph construction - Validation:
assess_assembly_quality,validate_assembly - Output: Assembly statistics and final sequences
Supporting Workflows
- Data Acquisition: Download and simulation functions
- Quality Control: Filtering and trimming tools
- Analysis: K-mer analysis and distance calculations
- Visualization: Plotting and reporting functions
Type System
Mycelia uses a strict type system based on BioSequences.jl:
- DNA sequences:
BioSequences.LongDNA{4} - RNA sequences:
BioSequences.LongRNA{4} - Protein sequences:
BioSequences.LongAA - K-mers:
Kmers.DNAKmer{K},Kmers.RNAKmer{K},Kmers.AAKmer{K}
Quality-Aware Types
- Qualmer: Combines k-mer with quality scores
- QualmerVertexData: Graph vertex with quality metadata
- QualmerEdgeData: Graph edge with transition weights
Performance Characteristics
Scalability
Most functions scale as follows with input size:
- Linear: File I/O, k-mer counting
- Quadratic: Distance matrix calculations
- Variable: Assembly (depends on complexity)
Memory Usage
Typical memory requirements:
- Small genomes (< 10 Mb): 1-4 GB RAM
- Bacterial genomes (1-10 Mb): 4-16 GB RAM
- Large genomes (> 100 Mb): 32-128 GB RAM
Use memory_limit parameters to control usage.
Parallelization
Functions that support parallel execution:
- K-mer counting (automatic)
- Assembly (multi-threaded)
- Quality analysis (vectorized)
Set JULIA_NUM_THREADS environment variable for best performance.
Error Handling
Common Exceptions
ArgumentError: Invalid parametersMethodError: Wrong typesOutOfMemoryError: Insufficient RAMSystemError: File I/O issues
Debugging Tips
- Check inputs: Use
typeof()andlength() - Monitor memory: Use
memory_limitparameters - Enable logging: Set
verbose=truein functions where applicable - Test subset: Use smaller datasets for debugging
Integration Points
External Tools
Mycelia integrates with numerous bioinformatics tools via Bioconda:
- Assemblers: SPAdes, Megahit, Miniasm, Raven
- QC Tools: FastP, Filtlong, Trim Galore
- Validators: QUAST, BUSCO, MUMmer
- Annotators: Pyrodigal, BLAST+, MMSeqs2
File Formats
Supported formats:
- Input: FASTA, FASTQ (including gzipped)
- Output: FASTA, CSV, JSON, JLD2
- Intermediate: Graph formats, quality scores
Version Compatibility
This documentation corresponds to Mycelia version 0.1.0+.
- Breaking changes will be noted in release notes
- Deprecated functions include migration guidance
- Experimental features are clearly marked
For the latest updates, see the CHANGELOG.