API Reference
This page provides comprehensive documentation for all public functions in Mycelia, automatically generated from docstrings.
Table of Contents
- API Reference
- Table of Contents
- Assembly Functions
- Qualmer Analysis
- String Graphs
- Sequence I/O
- K-mer Analysis
- Simulation and Testing
- Quality Control
- Assembly Validation
- Annotation
- Comparative Genomics
- Visualization
- Utility Functions
- Graph Types and Enums
- Data Structures
- Function Categories
- Type System
- Performance Characteristics
- Error Handling
- Integration Points
- Version Compatibility
Assembly Functions
Intelligent Assembly
Mycelia.mycelia_assemble
— MethodMain Mycelia intelligent assembly algorithm. Implements iterative prime k-mer progression with error correction.
Mycelia.mycelia_assemble
— Methodmycelia_assemble(
input_file::Union{String, Vector{String}};
output_dir,
max_k,
memory_limit,
verbose
) -> Dict
User-friendly wrapper for intelligent assembly that accepts file paths.
Arguments
input_file::Union{String, Vector{String}}
: Path to FASTQ file(s) or directory containing FASTQ filesoutput_dir::String
: Directory to write assembly results (default: "mycelia_assembly")max_k::Int
: Maximum k-mer size to try (default: 101)memory_limit::Int
: Memory limit in bytes (default: 32GB)verbose::Bool
: Print progress information (default: true)
Returns
- Assembly results dictionary with contigs, statistics, and metadata
Examples
# Assemble single file
results = Mycelia.mycelia_assemble("reads.fastq")
# Assemble with custom parameters
results = Mycelia.mycelia_assemble("reads.fastq",
output_dir="my_assembly",
max_k=51,
memory_limit=4_000_000_000)
# Assemble multiple files (will be merged)
results = Mycelia.mycelia_assemble(["reads1.fastq", "reads2.fastq"])
Notes
- Input files can be gzipped (.gz extension)
- Uses intelligent k-mer selection and error correction
- Automatically creates output directory if it doesn't exist
- Progress is saved incrementally for long assemblies
Assembly Utilities
Mycelia.dynamic_k_prime_pattern
— Functiondynamic_k_prime_pattern() -> Vector{Int64}
dynamic_k_prime_pattern(start_prime::Int64) -> Vector{Int64}
dynamic_k_prime_pattern(
start_prime::Int64,
max_k::Int64
) -> Vector{Int64}
dynamic_k_prime_pattern(
start_prime::Int64,
max_k::Int64,
initial_step::Int64
) -> Vector{Int64}
Generate optimal k-mer sizes using dynamic prime pattern algorithm.
Based on the novel k-primes pattern from Mycelia-Dev research. This algorithm generates a sequence of prime numbers with progressively increasing gaps, providing computational efficiency through:
- Twin prime avoidance (skips one member of twin prime pairs)
- Progressive spacing reduces computational overlap
- Built-in prime discovery without pre-computation
Arguments
start_prime::Int
: Initial prime number (default: 11, optimal for strain-level resolution)max_k::Int
: Maximum k-mer size to consider (default: 101)initial_step::Int
: Initial step size (default: 2)
Returns
Vector{Int}
: Sequence of prime k-mer sizes with progressive spacing
Algorithm
- Start with initial prime and step size
- Each iteration: current_prime += step, then step += 2
- Continue while result remains prime and ≤ max_k
- Natural stopping when non-prime encountered
Example
# Generates: [11, 13, 17, 23, 31, 41, 53, 67, 83, 101]
ks = dynamic_k_prime_pattern(11, 101, 2)
# For error-prone data, start smaller
ks = dynamic_k_prime_pattern(7, 51, 2) # [7, ...]
References
Based on k-primes-pattern.ipynb from Mycelia-Dev research demonstrating mathematical elegance of using prime number distribution for genomic analysis optimization.
Mycelia.error_optimized_k_sequence
— Functionerror_optimized_k_sequence(
error_rate::Float64
) -> Vector{Int64}
error_optimized_k_sequence(
error_rate::Float64,
max_k::Int64
) -> Vector{Int64}
error_optimized_k_sequence(
error_rate::Float64,
max_k::Int64,
sequence_length::Union{Nothing, Int64}
) -> Vector{Int64}
Generate k-mer sizes optimized for specific error rates using mathematical foundation.
Combines the error rate formula from Mycelia-Dev research with dynamic prime pattern generation for optimal k-mer selection.
Arguments
error_rate::Float64
: Expected sequencing error rate (0.0-1.0)max_k::Int
: Maximum k-mer size to consider (default: 101)sequence_length::Union{Int, Nothing}
: Optional sequence length for log4 optimization
Returns
Vector{Int}
: Optimally spaced prime k-mer sizes
Algorithm
- Calculate lower bound:
lower_bound_k = max(3, Int(floor(1/error_rate - 1)))
- If sequencelength provided, optimize using log4(sequencelength) pattern
- Generate dynamic prime pattern starting from optimal lower bound
Example
# For 1% error rate data
ks = error_optimized_k_sequence(0.01) # Starts around k=99
# For 5% error rate with known sequence length
ks = error_optimized_k_sequence(0.05, 101, 10000) # Optimized for 10kb sequences
Mathematical Foundation
- Lower bound formula:
k >= 1/error_rate - 1
- Log4 optimization:
optimal_k ≈ log4(sequence_length)
for divergence point - Progressive spacing reduces redundant analysis
References
Based on 2020-12-22 k-mer size selection and 2020-12-19 error rate analysis from Mycelia-Dev research.
Mycelia.calculate_assembly_reward
— FunctionCalculate reward for current k-mer processing iteration. Higher rewards indicate better assembly quality progress.
Qualmer Analysis
Core Qualmer Types
Mycelia.Qualmer
— Typestruct Qualmer{KmerT, K}
A quality-aware k-mer that pairs sequence k-mers with their corresponding quality scores.
Fields
kmer::Any
qualities::Tuple{Vararg{UInt8, K}} where K
Examples
using Mycelia
using Kmers
# Create a DNA k-mer with quality scores
kmer = DNAKmer("ATCG")
qualities = [30, 35, 32, 28] # Phred quality scores
qmer = Qualmer(kmer, qualities)
# Access individual bases and qualities
base, quality = qmer[1] # Returns ('A', 30)
This structure is essential for quality-aware sequence analysis and assembly algorithms that consider both sequence content and sequencing quality information.
Mycelia.QualmerVertexData
— Typestruct QualmerVertexData{QualmerT<:Mycelia.Qualmer}
Vertex data for quality-aware k-mer graphs that stores k-mer observations with quality scores.
Fields
canonical_qualmer::Mycelia.Qualmer
observations::Array{Mycelia.QualmerObservation{QualmerT}, 1} where QualmerT<:Mycelia.Qualmer
joint_probability::Float64
coverage::Int64
mean_quality::Float64
This structure aggregates multiple observations of the same k-mer and computes quality-based statistics for use in quality-aware assembly algorithms.
Mycelia.QualmerEdgeData
— Typestruct QualmerEdgeData
Edge data for quality-aware k-mer graphs that connects k-mer vertices with strand information.
Fields
observations::Vector{Tuple{Int64, Int64}}
src_strand::Mycelia.StrandOrientation
dst_strand::Mycelia.StrandOrientation
weight::Float64
quality_weight::Float64
This structure represents connections between k-mers in quality-aware graphs, including strand orientation and quality-weighted edge weights for assembly algorithms.
Qualmer Generation
Mycelia.qualmers_fw
— Functionqualmers_fw(
sequence::BioSequences.LongSequence{A},
quality::AbstractVector{<:Integer},
_::Val{K}
) -> Base.Generator{I, F} where {I<:(Base.Iterators.Enumerate{I} where I<:(Kmers.FwKmers{A, _A, S} where {A<:BioSequences.Alphabet, _A, S<:(BioSequences.LongSequence{A} where A)})), F<:(Mycelia.var"#258#259"{_A, <:AbstractVector{var"#s34"}} where {_A, var"#s34"<:Integer})}
Create an iterator that yields DNA qualmers from the given sequence and quality scores.
qualmers_fw(
record::FASTX.FASTQ.Record,
k::Int64
) -> Base.Generator
Generate forward qualmers from a FASTQ record. Extracts all k-mers in the forward direction with their associated quality scores from the FASTQ record.
Arguments
record
: FASTQ record containing sequence and quality datak
: K-mer size
Returns
- Iterator yielding (Qualmer, position) tuples for each k-mer
Mycelia.qualmers_canonical
— Functionqualmers_canonical(
sequence::BioSequences.LongSequence{BioSequences.DNAAlphabet{N}},
quality::AbstractVector{<:Integer},
_::Val{K}
) -> Base.Generator{I, F} where {I<:(Base.Iterators.Enumerate{I} where I<:(Kmers.FwKmers{A, _A, S} where {A<:(BioSequences.DNAAlphabet), _A, S<:(BioSequences.LongDNA)})), F<:(Mycelia.var"#264#265"{_A, <:AbstractVector{var"#s34"}} where {_A, var"#s34"<:Integer})}
Generate canonical DNA qualmers from sequence and quality scores. For each k-mer, returns the lexicographically smaller of the k-mer and its reverse complement, ensuring consistent representation regardless of strand.
Arguments
sequence
: DNA sequence as LongDNAquality
: Vector of quality scores corresponding to each base::Val{K}
: K-mer size as a type parameter
Returns
- Iterator yielding (Qualmer, position) tuples with canonical k-mer representation
qualmers_canonical(
sequence::BioSequences.LongAA,
quality::AbstractVector{<:Integer},
_::Val{K}
) -> Base.Generator{I, F} where {I<:(Base.Iterators.Enumerate{I} where I<:(Kmers.FwAAMers{_A, BioSequences.LongAA} where _A)), F<:(Mycelia.var"#258#259"{_A, <:AbstractVector{var"#s36"}} where {_A, var"#s36"<:Integer})}
Generate canonical amino acid qualmers from protein sequence and quality scores. For amino acid sequences, canonical representation is the same as forward representation since proteins don't have reverse complements.
Arguments
sequence
: Protein sequence as LongAAquality
: Vector of quality scores corresponding to each amino acid::Val{K}
: K-mer size as a type parameter
Returns
- Iterator yielding (Qualmer, position) tuples for each k-mer
qualmers_canonical(
record::FASTX.FASTQ.Record,
k::Int64
) -> Base.Generator
Generate canonical qualmers from a FASTQ record. Extracts k-mers in canonical representation (lexicographically smaller of forward and reverse complement) with their associated quality scores.
Arguments
record
: FASTQ record containing sequence and quality datak
: K-mer size
Returns
- Iterator yielding (Qualmer, position) tuples with canonical k-mer representation
Mycelia.qualmers_unambiguous
— Functionqualmers_unambiguous(
sequence::BioSequences.LongSequence{BioSequences.DNAAlphabet{N}},
quality::AbstractVector{<:Integer},
_::Val{K}
) -> Base.Generator{I, F} where {I<:(Kmers.UnambiguousDNAMers{_A, S} where {_A, S<:(BioSequences.LongDNA)}), F<:(Mycelia.var"#260#261"{_A, <:AbstractVector{var"#s34"}} where {_A, var"#s34"<:Integer})}
Generate unambiguous DNA qualmers from sequence and quality scores. Only includes k-mers that contain unambiguous nucleotides (A, T, G, C). Skips k-mers containing ambiguous bases like N.
Arguments
sequence
: DNA sequence as LongDNAquality
: Vector of quality scores corresponding to each base::Val{K}
: K-mer size as a type parameter
Returns
- Iterator yielding (Qualmer, position) tuples for each unambiguous k-mer
qualmers_unambiguous(
sequence::BioSequences.LongSequence{BioSequences.RNAAlphabet{N}},
quality::AbstractVector{<:Integer},
_::Val{K}
) -> Base.Generator{I, F} where {I<:(Kmers.UnambiguousRNAMers{_A, S} where {_A, S<:(BioSequences.LongRNA)}), F<:(Mycelia.var"#262#263"{_A, <:AbstractVector{var"#s34"}} where {_A, var"#s34"<:Integer})}
Generate unambiguous RNA qualmers from sequence and quality scores. Only includes k-mers that contain unambiguous nucleotides (A, U, G, C). Skips k-mers containing ambiguous bases like N.
Arguments
sequence
: RNA sequence as LongRNAquality
: Vector of quality scores corresponding to each base::Val{K}
: K-mer size as a type parameter
Returns
- Iterator yielding (Qualmer, position) tuples for each unambiguous k-mer
qualmers_unambiguous(
sequence::BioSequences.LongAA,
quality::AbstractVector{<:Integer},
_::Val{K}
) -> Base.Generator{I, F} where {I<:(Base.Iterators.Enumerate{I} where I<:(Kmers.FwAAMers{_A, BioSequences.LongAA} where _A)), F<:(Mycelia.var"#258#259"{_A, <:AbstractVector{var"#s36"}} where {_A, var"#s36"<:Integer})}
Generate amino acid qualmers from protein sequence and quality scores. For amino acid sequences, all k-mers are considered unambiguous, so this falls back to the forward qualmer generation.
Arguments
sequence
: Protein sequence as LongAAquality
: Vector of quality scores corresponding to each amino acid::Val{K}
: K-mer size as a type parameter
Returns
- Iterator yielding (Qualmer, position) tuples for each k-mer
qualmers_unambiguous(
record::FASTX.FASTQ.Record,
k::Int64
) -> Base.Generator
Generate unambiguous qualmers from a FASTQ record. Extracts only k-mers that contain unambiguous nucleotides (no N bases) with their associated quality scores.
Arguments
record
: FASTQ record containing sequence and quality datak
: K-mer size
Returns
- Iterator yielding (Qualmer, position) tuples for each unambiguous k-mer
Mycelia.qualmers_unambiguous_canonical
— Functionqualmers_unambiguous_canonical(
record::FASTX.FASTQ.Record,
k::Int64
) -> Base.Generator{I, Mycelia.var"#266#267"} where I
Generate unambiguous canonical qualmers from the given FASTQ record.
Quality Analysis
Mycelia.phred_to_probability
— Functionphred_to_probability(phred_score::UInt8) -> Float64
Convert PHRED quality score to probability of correctness. PHRED score Q relates to error probability P by: Q = -10 * log10(P) Therefore, correctness probability = 1 - P = 1 - 10^(-Q/10)
Mycelia.qualmer_correctness_probability
— Functionqualmer_correctness_probability(
qmer::Mycelia.Qualmer
) -> Float64
Calculate joint probability that a single qualmer is correct. For a k-mer with quality scores [q1, q2, ..., qk], the joint probability that all positions are correct is: ∏(1 - 10^(-qi/10))
Mycelia.joint_qualmer_probability
— Functionjoint_qualmer_probability(
qualmers::Vector{<:Mycelia.Qualmer};
use_log_space
) -> Float64
Calculate joint probability that multiple observations of the same k-mer are all correct. For independent observations of the same k-mer sequence, this represents our confidence that the k-mer truly exists in the data.
Arguments
qualmers
: Vector of Qualmer observations of the same k-mer sequenceuse_log_space
: Use log-space arithmetic for numerical stability (default: true)
Returns
- Float64: Joint probability that all observations are correct
Mycelia.position_wise_joint_probability
— Functionposition_wise_joint_probability(
qualmers::Vector{<:Mycelia.Qualmer};
use_log_space
) -> Float64
Calculate position-wise joint probability for multiple qualmer observations. This is more sophisticated than joint_qualmer_probability
as it considers the quality at each position across all observations.
Arguments
qualmers
: Vector of Qualmer observations of the same k-mer sequenceuse_log_space
: Use log-space arithmetic for numerical stability (default: true)
Returns
- Float64: Position-wise joint probability
Graph Construction
Mycelia.build_qualmer_graph
— Functionbuild_qualmer_graph(
fastq_records::Vector{FASTX.FASTQ.Record};
k,
graph_mode,
min_quality,
min_coverage
) -> MetaGraphsNext.MetaGraph{Int64, Graphs.SimpleGraphs.SimpleDiGraph{Int64}, Label, VertexData, EdgeData, Nothing, WeightFunction, Float64} where {Label, VertexData, EdgeData, WeightFunction}
Build a quality-aware k-mer graph from FASTQ records using existing Qualmer functionality. This function leverages the existing qualmer extraction functions and adds graph construction.
Arguments
fastq_records
: Vector of FASTQ records with quality scoresk
: K-mer sizegraph_mode
: SingleStrand or DoubleStrand mode (default: DoubleStrand)min_quality
: Minimum average PHRED quality to include k-mer (default: 10)min_coverage
: Minimum coverage (observations) to include k-mer (default: 1)
Returns
- MetaGraphsNext.MetaGraph with QualmerVertexData and QualmerEdgeData
Mycelia.get_qualmer_statistics
— Functionget_qualmer_statistics(
graph::MetaGraphsNext.MetaGraph
) -> Dict{String, Any}
Get comprehensive statistics about a qualmer graph.
String Graphs
N-gram Analysis
Mycelia.ngrams
— Functionngrams(s::AbstractString, n::Int64) -> Any
Extract all n-grams from a string. Returns a vector containing all contiguous substrings of length n
from the input string.
Arguments
s
: Input string to extract n-grams fromn
: Length of each n-gram (must be positive)
Returns
Vector{String}
: All n-grams of lengthn
from the string
Examples
ngrams("hello", 3) # Returns ["hel", "ell", "llo"]
ngrams("abcd", 2) # Returns ["ab", "bc", "cd"]
Mycelia.string_to_ngram_graph
— Functionstring_to_ngram_graph(; s, n)
Build a directed graph from string n-grams with edge and vertex weights. Creates a MetaGraph where vertices represent unique n-grams and edges represent transitions between consecutive n-grams in the original string.
Arguments
s
: Input string to build graph fromn
: N-gram size (length of each substring)
Returns
MetaGraphsNext.MetaGraph
: Directed graph with:- Vertex labels: String n-grams
- Vertex data: Integer counts of n-gram occurrences
- Edge data: Integer counts of n-gram transitions
Details
The graph construction process:
- Extract all n-grams from the string
- Count occurrences of each unique n-gram (stored as vertex data)
- Count transitions between consecutive n-grams (stored as edge data)
- Build a directed graph representing the n-gram sequence structure
Examples
graph = string_to_ngram_graph(s="abcabc", n=2)
# Creates graph with vertices "ab", "bc", "ca" and weighted edges
Mycelia.plot_ngram_graph
— Functionplot_ngram_graph(g) -> Any
Create a visual plot of an n-gram graph using GraphPlot.jl. Displays the graph structure with vertex labels showing the n-grams.
Arguments
g
: MetaGraph containing n-gram data (typically fromstring_to_ngram_graph
)
Returns
- Plot object showing the graph layout with labeled vertices
Details
Uses a spring layout algorithm to position vertices and displays n-gram strings as vertex labels for easy interpretation of the graph structure.
Examples
graph = string_to_ngram_graph(s="hello", n=2)
plot_ngram_graph(graph) # Shows visual representation
Graph Simplification
Mycelia.collapse_unbranching_paths
— FunctionCollapse linear paths (vertices with one incoming and one outgoing edge) into a simpler graph where sequences are concatenated.
Mycelia.find_connected_components
— FunctionFind all connected components in the graph. Returns a vector of vectors of vertex indices.
For directed graphs, uses weakly connected components. For undirected graphs, uses standard connected components.
Mycelia.assemble_strings
— Functionassemble_strings(
graph::MetaGraphsNext.MetaGraph
) -> Vector{String}
Assemble strings by performing graph walks from each connected component.
Traverses the string graph to reconstruct sequences by walking through connected components and assembling overlapping n-grams back into full strings.
Arguments
graph
: MetaGraphsNext.MetaGraph containing n-gram vertices and transition edges
Returns
Vector{String}
: Assembled sequences, one for each connected component
Examples
# Create a string graph from overlapping sequences
graph = build_string_ngram_metagraph("ATCGATCG", 3)
sequences = assemble_strings(graph)
This function is useful for reconstructing original sequences from n-gram decompositions and for string-based genome assembly approaches.
Sequence I/O
File Handling
Mycelia.open_fastx
— Functionopen_fastx(path::AbstractString) -> Any
Open and return a reader for FASTA or FASTQ format files.
Arguments
path::AbstractString
: Path to input file. Can be:- Local file path
- HTTP/FTP URL
- Gzip compressed (.gz extension)
Supported formats
- FASTA (.fasta, .fna, .faa, .fa)
- FASTQ (.fastq, .fq)
Returns
FASTX.FASTA.Reader
for FASTA filesFASTX.FASTQ.Reader
for FASTQ files
Format Conversion
Mycelia.convert_sequence
— Functionconvert_sequence(seq::AbstractString) -> Any
Converts the given sequence (output from FASTX.sequence) into the appropriate BioSequence type:
- DNA sequences are converted using
BioSequences.LongDNA
- RNA sequences are converted using
BioSequences.LongRNA
- AA sequences are converted using
BioSequences.LongAA
K-mer Analysis
K-mer Counting
For k-mer counting functions, see Sequence Analysis Workflow.
Distance Metrics
Mycelia.jaccard_distance
— FunctionCompute the Jaccard distance between columns of a binary matrix.
Arguments
M::AbstractMatrix{<:Integer}
: Binary matrix where rows are features and columns are samples
Returns
Matrix{Float64}
: Symmetric distance matrix with Jaccard distances
jaccard_distance(set1, set2) -> Any
Calculate the Jaccard distance between two sets, which is the complement of the Jaccard similarity.
The Jaccard distance is defined as: $J_d(A,B) = 1 - J_s(A,B) = 1 - \frac{|A ∩ B|}{|A ∪ B|}$
Arguments
set1
: First set to compareset2
: Second set to compare
Returns
Float64
: A value in [0,1] where 0 indicates identical sets and 1 indicates disjoint sets
Mycelia.kmer_counts_to_js_divergence
— Functionkmer_counts_to_js_divergence(
kmer_counts_1,
kmer_counts_2
) -> Any
Calculate the Jensen-Shannon divergence between two k-mer frequency distributions.
Arguments
kmer_counts_1
: Dictionary mapping k-mers to their counts in first sequencekmer_counts_2
: Dictionary mapping k-mers to their counts in second sequence
Returns
- Normalized Jensen-Shannon divergence score between 0 and 1, where:
- 0 indicates identical distributions
- 1 indicates maximally different distributions
Notes
- The measure is symmetric: JS(P||Q) = JS(Q||P)
- Counts are automatically normalized to probability distributions
Simulation and Testing
Read Simulation
See Data Acquisition Workflow for read simulation functions.
Assembly Testing
Mycelia.test_intelligent_assembly
— FunctionTest function to verify the implementation with sample data.
Quality Control
FASTQ Analysis
Mycelia.calculate_gc_content
— Functioncalculate_gc_content(
sequence::BioSequences.LongSequence
) -> Float64
Calculate GC content (percentage of G and C bases) from a BioSequence.
This function calculates the percentage of guanine (G) and cytosine (C) bases in a nucleotide sequence. Works with both DNA and RNA sequences.
Arguments
sequence::BioSequences.LongSequence
: Input DNA or RNA sequence
Returns
Float64
: GC content as a percentage (0.0-100.0)
Examples
# Calculate GC content for DNA
dna_seq = BioSequences.LongDNA{4}("ATCGATCGATCG")
gc_percent = calculate_gc_content(dna_seq)
# Calculate GC content for RNA
rna_seq = BioSequences.LongRNA{4}("AUCGAUCGAUCG")
gc_percent = calculate_gc_content(rna_seq)
calculate_gc_content(sequence::AbstractString) -> Float64
Calculate GC content from a string sequence.
Convenience function that accepts string input and converts to appropriate BioSequence. Automatically detects DNA/RNA based on presence of T/U.
Arguments
sequence::AbstractString
: Input DNA or RNA sequence as string
Returns
Float64
: GC content as a percentage (0.0-100.0)
Examples
# Calculate GC content from string
gc_percent = calculate_gc_content("ATCGATCGATCG")
calculate_gc_content(
records::AbstractArray{T<:Union{FASTX.FASTA.Record, FASTX.FASTQ.Record}, 1}
) -> Float64
Calculate GC content from FASTA/FASTQ records.
Processes multiple records and calculates overall GC content across all sequences.
Arguments
records::AbstractVector{T}
where T <: Union{FASTX.FASTA.Record, FASTX.FASTQ.Record}`: Input records
Returns
Float64
: Overall GC content as a percentage (0.0-100.0)
Examples
# Calculate GC content from FASTA records
records = collect(FASTX.FASTA.Reader(open("sequences.fasta")))
gc_percent = calculate_gc_content(records)
Mycelia.assess_duplication_rates
— Functionassess_duplication_rates(fastq; results_table) -> Any
Analyze sequence duplication rates in a FASTQ file.
This function processes a FASTQ file to quantify both exact sequence duplications and canonical duplications (considering sequences and their reverse complements as equivalent). The function makes two passes through the file: first to count total records, then to analyze unique sequences.
Arguments
fastq::String
: Path to the input FASTQ file to analyzeresults_table::String
: Optional. Path where the results will be saved as a tab-separated file. Defaults to the same path as the input file but with extension changed to ".duplication_rates.tsv"
Returns
String
: Path to the results table file
Output
Generates a tab-separated file containing the following metrics:
total_records
: Total number of sequence records in the filetotal_unique_observations
: Count of unique sequence stringstotal_unique_canonical_observations
: Count of unique canonical sequences (after normalizing for reverse complements)percent_unique_observations
: Percentage of sequences that are uniquepercent_unique_canonical_observations
: Percentage of sequences that are unique after canonicalizationpercent_duplication_rate
: Percentage of sequences that are duplicates (100 - percentuniqueobservations)percent_canonical_duplication_rate
: Percentage of sequences that are duplicates after canonicalization
Notes
- If the specified results file already exists and is not empty, the function will return early without recomputing.
- Progress is displayed during processing with a progress bar showing speed.
Example
# Analyze a FASTQ file and save results to default location
result_path = assess_duplication_rates("data/sample.fastq")
# Specify custom output path
result_path = assess_duplication_rates("data/sample.fastq", results_table="results/duplication_analysis.tsv")
For FASTQ quality analysis functions, see Quality Control Workflow.
External Tool Integration
For quality control and filtering functions, see Quality Control Workflow.
Assembly Validation
Quality Assessment
Mycelia.assess_assembly_quality
— Functionassess_assembly_quality(contigs_file)
Assess basic assembly quality metrics from a FASTA file.
Calculates standard assembly quality metrics including contig count, total length, and N50 statistic for assembly evaluation.
Arguments
contigs_file
: Path to FASTA file containing assembly contigs
Returns
- Tuple of (ncontigs, totallength, n50, l50)
n_contigs
: Number of contigs in the assemblytotal_length
: Total length of all contigs in base pairsn50
: N50 statistic (length of shortest contig in the set covering 50% of assembly)l50
: L50 statistic (number of contigs needed to reach 50% of assembly length)
Example
n_contigs, total_length, n50, l50 = assess_assembly_quality("assembly.fasta")
println("Assembly has $n_contigs contigs, $total_length bp total, N50=$n50, L50=$l50")
See Also
assess_assembly_kmer_quality
: For k-mer based assembly quality assessment
assess_assembly_quality(; assembly, observations, ks)
Assess assembly quality using k-mer based metrics including QV scores.
This function provides a comprehensive quality assessment of genome assemblies by comparing k-mer distributions between the assembly and source sequencing data. It calculates multiple quality metrics across different k-mer sizes, with the QV (Quality Value) score being the primary measure of assembly accuracy.
Arguments
assembly
: Path to assembly file (FASTA format) or assembly sequencesobservations
: Path to sequencing data file(s) (FASTQ format) or sequence observationsks::Vector{Int}
: K-mer sizes to analyze (default: 11,13,17,19,23,31,53 for comprehensive evaluation)
Returns
DataFrame containing quality metrics for each k-mer size:
k::Int
: K-mer length usedcosine_distance::Float64
: Cosine similarity between k-mer distributionsjs_divergence::Float64
: Jensen-Shannon divergence between distributionsqv::Float64
: Merqury-style Quality Value score (primary assembly accuracy metric)
QV Score Interpretation
- QV ≥ 40: High quality assembly (>99.99% base accuracy)
- QV 30-40: Good quality assembly (99.9-99.99% base accuracy)
- QV 20-30: Moderate quality assembly (99-99.9% base accuracy)
- QV < 20: Lower quality assembly (<99% base accuracy)
Implementation Details
The QV score follows the Merqury methodology:
- Count k-mers in both assembly and sequencing data
- Calculate shared k-mers between datasets
- Estimate base-level accuracy: P = (Kshared/Ktotal)^(1/k)
- Convert to Phred scale: QV = -10log₁₀(1-P)
Performance Notes
- Uses multiple k-mer sizes for robust quality assessment
- Larger k-mer sizes provide more discriminative power
- Parallel k-mer counting for computational efficiency
- Progress tracking for long-running analyses
References
Rhie, A. et al. "Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies." Genome Biology 21, 245 (2020).
Mycelia.validate_assembly
— Functionvalidate_assembly(assembly::AssemblyResult; reference=nothing) -> Dict{String, Any}
Validate assembly quality using various metrics and optional reference comparison.
Arguments
assembly
: Assembly result to validatereference
: Optional reference sequence for comparison
Returns
Dict{String, Any}
: Comprehensive validation metrics
Details
Computes assembly quality metrics including:
- N50, N90 statistics
- Total assembly length and number of contigs
- Coverage uniformity (if reference provided)
- Structural variant detection (if reference provided)
- Gap analysis and repeat characterization
External Validators
Mycelia.run_quast
— Functionrun_quast(assembly_files::Vector{String}; outdir::String="quast_results", reference::Union{String,Nothing}=nothing, threads::Int=Sys.CPU_THREADS, min_contig::Int=500, gene_finding::Bool=false)
Run QUAST (Quality Assessment Tool for Genome Assemblies) to evaluate assembly quality.
Arguments
assembly_files::Vector{String}
: Vector of paths to assembly FASTA files to evaluateoutdir::String="quast_results"
: Output directory for QUAST resultsreference::Union{String,Nothing}=nothing
: Optional reference genome for reference-based metricsthreads::Int=Sys.CPU_THREADS
: Number of threads to usemin_contig::Int=500
: Minimum contig length to considergene_finding::Bool=false
: Whether to run gene finding (requires GeneMark-ES/ET)
Returns
String
: Path to the output directory containing QUAST results
Output Files
report.html
: Interactive HTML reportreport.txt
: Text summary reportreport.tsv
: Tab-separated values report for programmatic accesstransposed_report.tsv
: Transposed TSV formaticarus.html
: Icarus contig browser (if reference provided)
Examples
# Basic assembly evaluation
assemblies = ["assembly1.fasta", "assembly2.fasta"]
quast_dir = Mycelia.run_quast(assemblies)
# With reference genome
ref_genome = "reference.fasta"
quast_dir = Mycelia.run_quast(assemblies, reference=ref_genome)
# Custom parameters
quast_dir = Mycelia.run_quast(assemblies,
outdir="my_quast_results",
min_contig=1000,
threads=8)
Notes
- Requires QUAST to be installed via Bioconda
- Without reference: provides basic metrics (N50, total length, # contigs, etc.)
- With reference: adds reference-based metrics (genome fraction, misassemblies, etc.)
- Gene finding requires additional dependencies and is disabled by default
run_quast(assembly_file::String; kwargs...)
Run QUAST on a single assembly file. See run_quast(::Vector{String})
for details.
Mycelia.run_busco
— Functionrun_busco(assembly_files::Vector{String}; outdir::String="busco_results", lineage::String="auto", mode::String="genome", threads::Int=Sys.CPU_THREADS, force::Bool=false)
Run BUSCO (Benchmarking Universal Single-Copy Orthologs) to assess genome assembly completeness.
Arguments
assembly_files::Vector{String}
: Vector of paths to assembly FASTA files to evaluateoutdir::String="busco_results"
: Output directory for BUSCO resultslineage::String="auto"
: BUSCO lineage dataset to use (e.g., "bacteriaodb10", "eukaryotaodb10", "auto")mode::String="genome"
: BUSCO mode ("genome", "transcriptome", "proteins")threads::Int=Sys.CPU_THREADS
: Number of threads to useforce::Bool=false
: Force overwrite existing results
Returns
String
: Path to the output directory containing BUSCO results
Output Files
short_summary.specific.lineage.txt
: Summary statisticsfull_table.tsv
: Complete BUSCO results tablemissing_busco_list.tsv
: List of missing BUSCOsrun_lineage/
: Detailed results directory
Examples
# Basic completeness assessment
assemblies = ["assembly1.fasta", "assembly2.fasta"]
busco_dir = Mycelia.run_busco(assemblies)
# Specific lineage
busco_dir = Mycelia.run_busco(assemblies, lineage="bacteria_odb10")
# Custom parameters
busco_dir = Mycelia.run_busco(assemblies,
outdir="my_busco_results",
lineage="enterobacterales_odb10",
threads=8)
Notes
- Requires BUSCO to be installed via Bioconda
- Auto lineage detection requires internet connection for first run
- Available lineages: bacteriaodb10, archaeaodb10, eukaryotaodb10, fungiodb10, etc.
- Results provide Complete, Fragmented, and Missing BUSCO counts
run_busco(assembly_file::String; kwargs...)
Run BUSCO on a single assembly file. See run_busco(::Vector{String})
for details.
Mycelia.run_mummer
— Functionrun_mummer(reference::String, query::String; outdir::String="mummer_results", prefix::String="out", mincluster::Int=65, minmatch::Int=20, threads::Int=1)
Run MUMmer for genome comparison and alignment between reference and query sequences.
Arguments
reference::String
: Path to reference genome FASTA filequery::String
: Path to query genome FASTA fileoutdir::String="mummer_results"
: Output directory for MUMmer resultsprefix::String="out"
: Prefix for output filesmincluster::Int=65
: Minimum cluster length for nucmerminmatch::Int=20
: Minimum match length for nucmerthreads::Int=1
: Number of threads (note: MUMmer has limited multithreading)
Returns
String
: Path to the output directory containing MUMmer results
Output Files
prefix.delta
: Delta alignment file (main output)prefix.coords
: Human-readable coordinates fileprefix.snps
: SNP/indel report (if show-snps is run)prefix.plot.png
: Dot plot visualization (if mummerplot is run)
Examples
# Basic genome comparison
ref_genome = "reference.fasta"
query_genome = "assembly.fasta"
mummer_dir = Mycelia.run_mummer(ref_genome, query_genome)
# Custom parameters
mummer_dir = Mycelia.run_mummer(ref_genome, query_genome,
outdir="comparison_results",
prefix="comparison",
mincluster=100,
minmatch=30)
Notes
- Requires MUMmer to be installed via Bioconda
- nucmer is used for DNA sequence alignment
- show-coords generates human-readable coordinate output
- Results include alignment coordinates, percent identity, and coverage
- For visualization, use mummerplot (requires gnuplot)
Annotation
Gene Prediction
Functions for gene prediction and annotation (using external tools):
- Pyrodigal integration
- BLAST+ searches
- MMSeqs2 searches
- TransTerm terminator prediction
- tRNAscan-SE tRNA detection
- MLST typing
Comparative Genomics
Pangenome Analysis
Mycelia.analyze_pangenome_kmers
— Functionanalyze_pangenome_kmers(genome_files::Vector{String}; kmer_type=Kmers.DNAKmer{21}, distance_metric=:jaccard)
Perform comprehensive k-mer based pangenome analysis using existing Mycelia infrastructure.
Leverages existing count_canonical_kmers
and distance metric functions to analyze genomic content across multiple genomes, identifying core, accessory, and unique regions.
Arguments
genome_files
: Vector of FASTA file paths containing genome sequenceskmer_type
: K-mer type from Kmers.jl (default:Kmers.DNAKmer{21}
)distance_metric
: Distance metric (:jaccard
,:bray_curtis
,:cosine
,:js_divergence
)
Returns
PangenomeAnalysisResult
with comprehensive pangenome statistics
Example
genome_files = ["genome1.fasta", "genome2.fasta", "genome3.fasta"]
result = Mycelia.analyze_pangenome_kmers(genome_files, kmer_type=Kmers.DNAKmer{31})
println("Core k-mers: $(length(result.core_kmers))")
println("Total pangenome size: $(size(result.presence_absence_matrix, 1)) k-mers")
Mycelia.build_genome_distance_matrix
— Functionbuild_genome_distance_matrix(genome_files::Vector{String}; kmer_type=Kmers.DNAKmer{21}, metric=:js_divergence)
Build a distance matrix between all genome pairs using existing distance metrics.
Creates a comprehensive pairwise distance matrix using established k-mer distance functions, suitable for phylogenetic analysis and clustering.
Arguments
genome_files
: Vector of genome FASTA file pathskmer_type
: K-mer type from Kmers.jl (default:Kmers.DNAKmer{21}
)metric
: Distance metric (:js_divergence
,:cosine
,:jaccard
)
Returns
- Named tuple with distance matrix and genome names
Example
genomes = ["genome1.fasta", "genome2.fasta", "genome3.fasta"]
result = Mycelia.build_genome_distance_matrix(genomes, kmer_type=Kmers.DNAKmer{31})
println("Distance matrix: $(result.distance_matrix)")
Visualization
Plotting Functions
Mycelia.visualize_genome_coverage
— Functionvisualize_genome_coverage(
coverage_table::DataFrames.AbstractDataFrame;
entity
) -> Makie.Figure
Creates a multi-panel visualization of genome coverage across chromosomes.
Arguments
coverage_table
: DataFrame containing columns "chromosome" and "coverage" with genomic coverage data
Keyword Arguments
entity
: Optional string to specify the entity being analyzed. If provided, chromosome titles become "$(entity): $(chromosome)", otherwise just "$(chromosome)"
Returns
CairoMakie.Figure
: A composite figure with coverage plots for each chromosome
Details
Generates one subplot per chromosome, arranged vertically. Each subplot shows the coverage distribution across genomic positions for that chromosome.
Mycelia.plot_embeddings
— FunctionPlot embeddings with optional true and fitted cluster labels using Makie.jl, with legend outside, and color by fit labels, shape by true labels.
Arguments
embeddings::Matrix{<:Real}
: 2D embedding matrix where each column is a data pointtitle::String
: Title of the plotxlabel::String
: Label for the x-axisylabel::String
: Label for the y-axistrue_labels::Vector{<:Integer}
: Vector of true cluster labels (optional)fit_labels::Vector{<:Integer}
: Vector of fitted cluster labels (optional)
Returns
Makie.Figure
: Figure object that can be displayed or saved
Mycelia.plot_taxa_abundances
— Functionplot_taxa_abundances(
df::DataFrames.DataFrame,
taxa_level::String;
top_n::Int = 10,
sample_id_col::String = "sample_id",
filter_taxa::Union{Vector{Union{String, Missing}}, Nothing} = nothing,
figure_width::Int = 1500,
figure_height::Int = 1000,
bar_width::Float64 = 0.7,
x_rotation::Int = 45,
sort_samples::Bool = true,
color_seed::Union{Int, Nothing} = nothing,
legend_fontsize::Float64 = 12.0,
legend_itemsize::Float64 = 12.0,
legend_padding::Float64 = 5.0,
legend_rowgap::Float64 = 1.0,
legend_labelwidth::Union{Nothing, Float64} = nothing,
legend_titlesize::Float64 = 15.0,
legend_nbanks::Int = 1
)
Create a stacked bar chart showing taxa relative abundances for each sample.
Arguments
df
: DataFrame with sample_id and taxonomic assignments at different levelstaxa_level
: Taxonomic level to analyze (e.g., "genus", "species")top_n
: Number of top taxa to display individually, remainder grouped as "Other"sample_id_col
: Column name containing sample identifiersfilter_taxa
: Taxa to exclude from visualization (default: nothing - no filtering)figure_width
: Width of the figure in pixelsfigure_height
: Height of the figure in pixelsbar_width
: Width of each bar (between 0 and 1)x_rotation
: Rotation angle for x-axis labels in degreessort_samples
: Whether to sort samples alphabeticallycolor_seed
: Seed for reproducible color generationlegend_fontsize
: Font size for legend entrieslegend_itemsize
: Size of the colored marker/icon in the legendlegend_padding
: Padding around legend elementslegend_rowgap
: Space between legend rowslegend_labelwidth
: Maximum width for legend labels (truncation)legend_titlesize
: Font size for legend titlelegend_nbanks
: Number of legend columns
Returns
fig
: CairoMakie figure objectax
: CairoMakie axis objecttaxa_colors
: Dictionary mapping taxa to their assigned colors
For k-mer frequency plotting, see Sequence Analysis Workflow.
Utility Functions
Memory Management
Mycelia.estimate_memory_usage
— FunctionEstimate memory usage for a graph with given number of k-mers. Provides rough estimate for memory monitoring.
Mycelia.check_memory_limits
— FunctionCheck if memory usage is within acceptable limits.
File Operations
Helper functions for robust file handling and data validation.
Progress Tracking
Functions for user feedback during long-running operations.
Graph Types and Enums
Graph Modes
Mycelia.GraphMode
— TypeGraph mode for handling strand information.
SingleStrand
: Sequences are single-stranded (RNA, amino acids, or directional DNA)DoubleStrand
: Sequences are double-stranded DNA/RNA with canonical representation
Mycelia.StrandOrientation
— TypeStrand orientation for k-mer observations and transitions.
Forward
: k-mer as observed (5' to 3')Reverse
: reverse complement of k-mer (3' to 5')
Used to control how graphs are constructed and oriented.
Data Structures
Assembly Results
The result of assembly operations typically includes:
:final_assembly
: Vector of assembled sequences:k_progression
: K-mer sizes used:metadata
: Assembly statistics and parameters
Quality Metrics
Standard quality metrics returned by analysis functions:
- Coverage statistics
- Quality score distributions
- Joint probability calculations
- Assembly completeness measures
Function Categories
Core Assembly Pipeline
- Input Processing:
open_fastx
,convert_sequence
- Quality Analysis:
analyze_fastq_quality
, qualmer functions - Assembly:
mycelia_assemble
, graph construction - Validation:
assess_assembly_quality
,validate_assembly
- Output: Assembly statistics and final sequences
Supporting Workflows
- Data Acquisition: Download and simulation functions
- Quality Control: Filtering and trimming tools
- Analysis: K-mer analysis and distance calculations
- Visualization: Plotting and reporting functions
Type System
Mycelia uses a strict type system based on BioSequences.jl:
- DNA sequences:
BioSequences.LongDNA{4}
- RNA sequences:
BioSequences.LongRNA{4}
- Protein sequences:
BioSequences.LongAA
- K-mers:
Kmers.DNAKmer{K}
,Kmers.RNAKmer{K}
,Kmers.AAKmer{K}
Quality-Aware Types
- Qualmer: Combines k-mer with quality scores
- QualmerVertexData: Graph vertex with quality metadata
- QualmerEdgeData: Graph edge with transition weights
Performance Characteristics
Scalability
Most functions scale as follows with input size:
- Linear: File I/O, k-mer counting
- Quadratic: Distance matrix calculations
- Variable: Assembly (depends on complexity)
Memory Usage
Typical memory requirements:
- Small genomes (< 10 Mb): 1-4 GB RAM
- Bacterial genomes (1-10 Mb): 4-16 GB RAM
- Large genomes (> 100 Mb): 32-128 GB RAM
Use memory_limit
parameters to control usage.
Parallelization
Functions that support parallel execution:
- K-mer counting (automatic)
- Assembly (multi-threaded)
- Quality analysis (vectorized)
Set JULIA_NUM_THREADS
environment variable for best performance.
Error Handling
Common Exceptions
ArgumentError
: Invalid parametersMethodError
: Wrong typesOutOfMemoryError
: Insufficient RAMSystemError
: File I/O issues
Debugging Tips
- Check inputs: Use
typeof()
andlength()
- Monitor memory: Use
memory_limit
parameters - Enable logging: Set
verbose=true
in functions - Test subset: Use smaller datasets for debugging
Integration Points
External Tools
Mycelia integrates with numerous bioinformatics tools via Bioconda:
- Assemblers: SPAdes, Megahit, Miniasm, Raven
- QC Tools: FastP, Filtlong, Trim Galore
- Validators: QUAST, BUSCO, MUMmer
- Annotators: Pyrodigal, BLAST+, MMSeqs2
File Formats
Supported formats:
- Input: FASTA, FASTQ (including gzipped)
- Output: FASTA, CSV, JSON, JLD2
- Intermediate: Graph formats, quality scores
Workflow Integration
Functions are designed to work together in pipelines:
# Typical workflow
data = open_fastx("reads.fastq")
quality_report = analyze_fastq_quality(data)
assembly = mycelia_assemble("reads.fastq")
validation = assess_assembly_quality(assembly)
Version Compatibility
This documentation corresponds to Mycelia version 0.1.0+.
- Breaking changes will be noted in release notes
- Deprecated functions include migration guidance
- Experimental features are clearly marked
For the latest updates, see the CHANGELOG.