API Reference

This page provides comprehensive documentation for all public functions in Mycelia, automatically generated from docstrings.

API Reference

Assembly Functions

Main Assembly Interface

Mycelia.assemble_genome — Function

assemble_genome(reads; k=31, kwargs...) -> AssemblyResult
assemble_genome(reads, config::AssemblyConfig) -> AssemblyResult

Unified genome assembly interface with auto-detection and type-stable dispatch.

Auto-Detection Convenience Method

# Auto-detect sequence type and format, use k-mer approach
result = assemble_genome(fasta_records; k=25)

# Auto-detect sequence type and format, use overlap approach
result = assemble_genome(fasta_records; min_overlap=100)

# Override auto-detected parameters
result = assemble_genome(reads; k=31, graph_mode=SingleStrand, error_rate=0.005)

Type-Stable Direct Method

# Explicit configuration for maximum performance
config = AssemblyConfig(k=25, sequence_type=BioSequences.LongDNA{4},
                       graph_mode=DoubleStrand, use_quality_scores=true)
result = assemble_genome(reads, config)

Arguments

reads: Vector of FASTA/FASTQ records or file paths
config: Assembly configuration (for type-stable version)

Keyword Arguments (auto-detection version)

k: k-mer size (mutually exclusive with min_overlap)
min_overlap: Minimum overlap length (mutually exclusive with k)
graph_mode: SingleStrand or DoubleStrand (auto-detected if not specified)
error_rate, min_coverage, etc.: Assembly parameters

Returns

AssemblyResult: Structure containing contigs, names, and assembly metadata

Details

This interface automatically:

Detects sequence type: DNA, RNA, AA, or String from input
Chooses assembly method: k-mer vs overlap-based on parameters
Validates compatibility: AA/String sequences → SingleStrand only
Dispatches optimally: Based on input type and quality scores

Method Selection Logic:

String input + k → N-gram graph
String input + min_overlap → String graph
BioSequence input + k + quality → Qualmer graph
BioSequence input + k → K-mer graph
BioSequence input + min_overlap + quality → Quality BioSequence graph
BioSequence input + min_overlap → BioSequence graph

source

Type-stable main assembly function that dispatches based on configuration.

source

Third-Party Assembly Tool Wrappers

See Assembly Suite for comprehensive documentation of production-ready assembly wrappers including:

Short-read assemblers: MEGAHIT, SPAdes, metaSPAdes, SKESA, Velvet
Long-read assemblers: Flye, metaFlye, Canu, hifiasm, MetaMDBG
Hybrid assemblers: Unicycler
Polishing tools: Apollo, HyPo, Strainy

Experimental Assembly Algorithms

Mycelia includes experimental assembly algorithms in src/development/ that are not currently part of the public API:

Intelligent Assembly: mycelia_assemble - Self-optimizing k-mer progression
Assembly Utilities: dynamic_k_prime_pattern, error_optimized_k_sequence, calculate_assembly_reward

These are research implementations subject to change. For production use, see the third-party assembler wrappers above.

Qualmer Analysis

Core Qualmer Types

Mycelia.Qualmer — Type

struct Qualmer{KmerT, K}

A quality-aware k-mer that pairs sequence k-mers with their corresponding quality scores.

Fields

kmer::Any
qualities::Tuple{Vararg{UInt8, K}} where K

Examples

using Mycelia
using Kmers

# Create a DNA k-mer with quality scores
kmer = DNAKmer("ATCG")
qualities = [30, 35, 32, 28]  # Phred quality scores
qmer = Qualmer(kmer, qualities)

# Access individual bases and qualities
base, quality = qmer[1]  # Returns ('A', 30)

This structure is essential for quality-aware sequence analysis and assembly algorithms that consider both sequence content and sequencing quality information.

source

Mycelia.QualmerVertexData — Type

struct QualmerVertexData{QualmerT<:Mycelia.Qualmer}

Vertex data for quality-aware k-mer graphs that stores k-mer observations with quality scores.

Fields

canonical_qualmer::Mycelia.Qualmer
observations::Array{Mycelia.QualmerObservation{QualmerT}, 1} where QualmerT<:Mycelia.Qualmer
joint_probability::Float64
coverage::Int64
mean_quality::Float64

This structure aggregates multiple observations of the same k-mer and computes quality-based statistics for use in quality-aware assembly algorithms.

source

Mycelia.QualmerEdgeData — Type

struct QualmerEdgeData

Edge data for quality-aware k-mer graphs that connects k-mer vertices with strand information.

Fields

observations::Vector{Tuple{Int64, Int64}}
src_strand::Mycelia.StrandOrientation
dst_strand::Mycelia.StrandOrientation
weight::Float64
quality_weight::Float64

This structure represents connections between k-mers in quality-aware graphs, including strand orientation and quality-weighted edge weights for assembly algorithms.

source

Qualmer Generation

Mycelia.qualmers_fw — Function

qualmers_fw(
    sequence::BioSequences.LongSequence{A},
    quality::AbstractVector{<:Integer},
    _::Val{K}
) -> Base.Generator{I, F} where {I<:(Base.Iterators.Enumerate{I} where I<:(Kmers.FwKmers{A, _A, S} where {A<:BioSequences.Alphabet, _A, S<:(BioSequences.LongSequence{A} where A)})), F<:(Mycelia.var"#293#294"{_A, <:AbstractVector{var"#s34"}} where {_A, var"#s34"<:Integer})}

Create an iterator that yields DNA qualmers from the given sequence and quality scores.

source

qualmers_fw(
    record::FASTX.FASTQ.Record,
    k::Int64
) -> Base.Generator

Generate forward qualmers from a FASTQ record. Extracts all k-mers in the forward direction with their associated quality scores from the FASTQ record.

Arguments

record: FASTQ record containing sequence and quality data
k: K-mer size

Returns

Iterator yielding (Qualmer, position) tuples for each k-mer

source

Mycelia.qualmers_canonical — Function

qualmers_canonical(
    sequence::BioSequences.LongSequence{BioSequences.DNAAlphabet{N}},
    quality::AbstractVector{<:Integer},
    _::Val{K}
) -> Base.Generator{I, F} where {I<:(Base.Iterators.Enumerate{I} where I<:(Kmers.FwKmers{A, _A, S} where {A<:(BioSequences.DNAAlphabet), _A, S<:(BioSequences.LongDNA)})), F<:(Mycelia.var"#299#300"{_A, <:AbstractVector{var"#s34"}} where {_A, var"#s34"<:Integer})}

Generate canonical DNA qualmers from sequence and quality scores. For each k-mer, returns the lexicographically smaller of the k-mer and its reverse complement, ensuring consistent representation regardless of strand.

Arguments

sequence: DNA sequence as LongDNA
quality: Vector of quality scores corresponding to each base
::Val{K}: K-mer size as a type parameter

Returns

Iterator yielding (Qualmer, position) tuples with canonical k-mer representation

source

qualmers_canonical(
    sequence::BioSequences.LongAA,
    quality::AbstractVector{<:Integer},
    _::Val{K}
) -> Base.Generator{I, F} where {I<:(Base.Iterators.Enumerate{I} where I<:(Kmers.FwAAMers{_A, BioSequences.LongAA} where _A)), F<:(Mycelia.var"#293#294"{_A, <:AbstractVector{var"#s36"}} where {_A, var"#s36"<:Integer})}

Generate canonical amino acid qualmers from protein sequence and quality scores. For amino acid sequences, canonical representation is the same as forward representation since proteins don't have reverse complements.

Arguments

sequence: Protein sequence as LongAA
quality: Vector of quality scores corresponding to each amino acid
::Val{K}: K-mer size as a type parameter

Returns

Iterator yielding (Qualmer, position) tuples for each k-mer

source

qualmers_canonical(
    record::FASTX.FASTQ.Record,
    k::Int64
) -> Base.Generator

Generate canonical qualmers from a FASTQ record. Extracts k-mers in canonical representation (lexicographically smaller of forward and reverse complement) with their associated quality scores.

Arguments

record: FASTQ record containing sequence and quality data
k: K-mer size

Returns

Iterator yielding (Qualmer, position) tuples with canonical k-mer representation

source

Mycelia.qualmers_unambiguous — Function

qualmers_unambiguous(
    sequence::BioSequences.LongSequence{BioSequences.DNAAlphabet{N}},
    quality::AbstractVector{<:Integer},
    _::Val{K}
) -> Base.Generator{I, F} where {I<:(Kmers.UnambiguousDNAMers{_A, S} where {_A, S<:(BioSequences.LongDNA)}), F<:(Mycelia.var"#295#296"{_A, <:AbstractVector{var"#s34"}} where {_A, var"#s34"<:Integer})}

Generate unambiguous DNA qualmers from sequence and quality scores. Only includes k-mers that contain unambiguous nucleotides (A, T, G, C). Skips k-mers containing ambiguous bases like N.

Arguments

sequence: DNA sequence as LongDNA
quality: Vector of quality scores corresponding to each base
::Val{K}: K-mer size as a type parameter

Returns

Iterator yielding (Qualmer, position) tuples for each unambiguous k-mer

source

qualmers_unambiguous(
    sequence::BioSequences.LongSequence{BioSequences.RNAAlphabet{N}},
    quality::AbstractVector{<:Integer},
    _::Val{K}
) -> Base.Generator{I, F} where {I<:(Kmers.UnambiguousRNAMers{_A, S} where {_A, S<:(BioSequences.LongRNA)}), F<:(Mycelia.var"#297#298"{_A, <:AbstractVector{var"#s34"}} where {_A, var"#s34"<:Integer})}

Generate unambiguous RNA qualmers from sequence and quality scores. Only includes k-mers that contain unambiguous nucleotides (A, U, G, C). Skips k-mers containing ambiguous bases like N.

Arguments

sequence: RNA sequence as LongRNA
quality: Vector of quality scores corresponding to each base
::Val{K}: K-mer size as a type parameter

Returns

Iterator yielding (Qualmer, position) tuples for each unambiguous k-mer

source

qualmers_unambiguous(
    sequence::BioSequences.LongAA,
    quality::AbstractVector{<:Integer},
    _::Val{K}
) -> Base.Generator{I, F} where {I<:(Base.Iterators.Enumerate{I} where I<:(Kmers.FwAAMers{_A, BioSequences.LongAA} where _A)), F<:(Mycelia.var"#293#294"{_A, <:AbstractVector{var"#s36"}} where {_A, var"#s36"<:Integer})}

Generate amino acid qualmers from protein sequence and quality scores. For amino acid sequences, all k-mers are considered unambiguous, so this falls back to the forward qualmer generation.

Arguments

sequence: Protein sequence as LongAA
quality: Vector of quality scores corresponding to each amino acid
::Val{K}: K-mer size as a type parameter

Returns

Iterator yielding (Qualmer, position) tuples for each k-mer

source

qualmers_unambiguous(
    record::FASTX.FASTQ.Record,
    k::Int64
) -> Base.Generator

Generate unambiguous qualmers from a FASTQ record. Extracts only k-mers that contain unambiguous nucleotides (no N bases) with their associated quality scores.

Arguments

record: FASTQ record containing sequence and quality data
k: K-mer size

Returns

Iterator yielding (Qualmer, position) tuples for each unambiguous k-mer

source

Mycelia.qualmers_unambiguous_canonical — Function

qualmers_unambiguous_canonical(
    record::FASTX.FASTQ.Record,
    k::Int64
) -> Base.Generator{I, Mycelia.var"#301#302"} where I

Generate unambiguous canonical qualmers from the given FASTQ record.

source

Quality Analysis

Mycelia.phred_to_probability — Function

phred_to_probability(phred_score::UInt8) -> Float64

Convert PHRED quality score to probability of correctness. PHRED score Q relates to error probability P by: Q = -10 * log10(P) Therefore, correctness probability = 1 - P = 1 - 10^(-Q/10)

source

Mycelia.qualmer_correctness_probability — Function

qualmer_correctness_probability(
    qmer::Mycelia.Qualmer
) -> Float64

Calculate joint probability that a single qualmer is correct. For a k-mer with quality scores [q1, q2, ..., qk], the joint probability that all positions are correct is: ∏(1 - 10^(-qi/10))

source

Mycelia.joint_qualmer_probability — Function

joint_qualmer_probability(
    qualmers::Vector{<:Mycelia.Qualmer};
    use_log_space
) -> Float64

Calculate joint probability that multiple observations of the same k-mer are all correct. For independent observations of the same k-mer sequence, this represents our confidence that the k-mer truly exists in the data.

Arguments

qualmers: Vector of Qualmer observations of the same k-mer sequence
use_log_space: Use log-space arithmetic for numerical stability (default: true)

Returns

Float64: Joint probability that all observations are correct

source

Mycelia.position_wise_joint_probability — Function

position_wise_joint_probability(
    qualmers::Vector{<:Mycelia.Qualmer};
    use_log_space
) -> Float64

Calculate position-wise joint probability for multiple qualmer observations. This is more sophisticated than joint_qualmer_probability as it considers the quality at each position across all observations.

Arguments

qualmers: Vector of Qualmer observations of the same k-mer sequence
use_log_space: Use log-space arithmetic for numerical stability (default: true)

Returns

Float64: Position-wise joint probability

source

Graph Construction

Mycelia.build_qualmer_graph — Function

build_qualmer_graph(
    fastq_records::Vector{FASTX.FASTQ.Record};
    k,
    graph_mode,
    min_quality,
    min_coverage
) -> MetaGraphsNext.MetaGraph{Int64, Graphs.SimpleGraphs.SimpleDiGraph{Int64}, Label, VertexData, EdgeData, Nothing, WeightFunction, Float64} where {Label, VertexData, EdgeData, WeightFunction}

Build a quality-aware k-mer graph from FASTQ records using existing Qualmer functionality. This function leverages the existing qualmer extraction functions and adds graph construction.

Arguments

fastq_records: Vector of FASTQ records with quality scores
k: K-mer size
graph_mode: SingleStrand or DoubleStrand mode (default: DoubleStrand)
min_quality: Minimum average PHRED quality to include k-mer (default: 10)
min_coverage: Minimum coverage (observations) to include k-mer (default: 1)

Returns

MetaGraphsNext.MetaGraph with QualmerVertexData and QualmerEdgeData

source

Mycelia.get_qualmer_statistics — Function

get_qualmer_statistics(
    graph::MetaGraphsNext.MetaGraph
) -> Dict{String, Any}

Get comprehensive statistics about a qualmer graph.

source

String Graphs

N-gram Analysis

Mycelia.ngrams — Function

ngrams(s::AbstractString, n::Int64) -> Vector{String}

Extract all n-grams from a string. Returns a vector containing all contiguous substrings of length n from the input string.

Arguments

s: Input string to extract n-grams from
n: Length of each n-gram (must be positive)

Returns

Vector{String}: All n-grams of length n from the string

Examples

ngrams("hello", 3)  # Returns ["hel", "ell", "llo"]
ngrams("abcd", 2)   # Returns ["ab", "bc", "cd"]

source

Mycelia.string_to_ngram_graph — Function

string_to_ngram_graph(; s, n)

Build a directed graph from string n-grams with edge and vertex weights. Creates a MetaGraph where vertices represent unique n-grams and edges represent transitions between consecutive n-grams in the original string.

Arguments

s: Input string to build graph from
n: N-gram size (length of each substring)

Returns

MetaGraphsNext.MetaGraph: Directed graph with:
- Vertex labels: String n-grams
- Vertex data: Integer counts of n-gram occurrences
- Edge data: Integer counts of n-gram transitions

Details

Backward compatibility function - now delegates to the multi-string version. The graph construction process:

Extract all n-grams from the string
Count occurrences of each unique n-gram (stored as vertex data)
Count transitions between consecutive n-grams (stored as edge data)
Build a directed graph representing the n-gram sequence structure

Examples

graph = string_to_ngram_graph(s="abcabc", n=2)
# Creates graph with vertices "ab", "bc", "ca" and weighted edges

source

Mycelia.plot_ngram_graph — Function

plot_ngram_graph(g) -> Any

Create a visual plot of an n-gram graph using GraphPlot.jl. Displays the graph structure with vertex labels showing the n-grams.

Arguments

g: MetaGraph containing n-gram data (typically from string_to_ngram_graph)

Returns

Plot object showing the graph layout with labeled vertices

Details

Uses a spring layout algorithm to position vertices and displays n-gram strings as vertex labels for easy interpretation of the graph structure.

Examples

graph = string_to_ngram_graph(s="hello", n=2)
plot_ngram_graph(graph)  # Shows visual representation

source

Graph Simplification

Mycelia.collapse_unbranching_paths — Function

Collapse linear paths (vertices with one incoming and one outgoing edge) into a simpler graph where sequences are concatenated.

source

Mycelia.find_connected_components — Function

Find all connected components in the graph. Returns a vector of vectors of vertex indices.

For directed graphs, uses weakly connected components. For undirected graphs, uses standard connected components.

source

Note: Additional graph simplification function assemble_strings is available but currently lacks documentation.

Sequence I/O

File Handling

Mycelia.open_fastx — Function

open_fastx(path::AbstractString) -> Any

Open and return a reader for FASTA or FASTQ format files.

Arguments

path::AbstractString: Path to input file. Can be:
- Local file path
- HTTP/FTP URL
- Gzip compressed (.gz extension)

Supported formats

FASTA (.fasta, .fna, .faa, .fa)
FASTQ (.fastq, .fq)

Returns

FASTX.FASTA.Reader for FASTA files
FASTX.FASTQ.Reader for FASTQ files

source

Mycelia.write_fastq — Function

write_fastq(;records, filename, gzip=false)

write_fastq(; records, filename, gzip)

Write FASTQ records to file using FASTX.jl. Validates extension: .fastq, .fq, .fastq.gz, or .fq.gz. If gzip is true or filename endswith .gz, output is gzipped. records must be an iterable of FASTX.FASTQ.Record.

source

Mycelia.write_fasta — Function

write_fasta(
;
    outfile,
    records,
    gzip,
    show_progress,
    use_pigz,
    pigz_threads
)

Writes FASTA records to a file, optionally gzipped.

Arguments

outfile::AbstractString: Path to the output FASTA file. Will append ".gz" if gzip is true and ".gz" isn't already the extension.
records::Vector{FASTX.FASTA.Record}: A vector of FASTA records.
gzip::Bool: Optionally force compression of the output with gzip. By default will use the file name to infer.
show_progress::Union{Bool,Nothing}: Whether to show a progress meter. If nothing (default), automatically shows progress for ≥1000 records.
use_pigz::Union{Bool,Nothing}: If true, uses external pigz for compression (faster for large files). If nothing (default), automatically uses pigz for files estimated to be ≥50 MB if pigz is available. Requires pigz to be installed and in PATH.
pigz_threads::Int: Number of threads for pigz to use. Defaults to all available cores.

Returns

outfile::String: The path to the output FASTA file (including ".gz" if applicable).

source

Format Conversion

Mycelia.convert_sequence — Function

convert_sequence(seq::AbstractString) -> Any

Converts the given sequence (output from FASTX.sequence) into the appropriate BioSequence type:

DNA sequences are converted using BioSequences.LongDNA
RNA sequences are converted using BioSequences.LongRNA
AA sequences are converted using BioSequences.LongAA

source

K-mer Analysis

K-mer Counting

For k-mer counting functions, see Sequence Analysis Workflow.

Distance Metrics

Mycelia.jaccard_distance — Function

Compute the Jaccard distance between columns of a binary matrix.

Arguments

M::AbstractMatrix{<:Integer}: Binary matrix where rows are features and columns are samples

Returns

Matrix{Float64}: Symmetric distance matrix with Jaccard distances

source

jaccard_distance(set1, set2) -> Any

Calculate the Jaccard distance between two sets, which is the complement of the Jaccard similarity.

The Jaccard distance is defined as: $J_d(A,B) = 1 - J_s(A,B) = 1 - \frac{|A ∩ B|}{|A ∪ B|}$

Arguments

set1: First set to compare
set2: Second set to compare

Returns

Float64: A value in [0,1] where 0 indicates identical sets and 1 indicates disjoint sets

source

Mycelia.kmer_counts_to_js_divergence — Function

kmer_counts_to_js_divergence(
    kmer_counts_1,
    kmer_counts_2
) -> Any

Calculate the Jensen-Shannon divergence between two k-mer frequency distributions.

Arguments

kmer_counts_1: Dictionary mapping k-mers to their counts in first sequence
kmer_counts_2: Dictionary mapping k-mers to their counts in second sequence

Returns

Normalized Jensen-Shannon divergence score between 0 and 1, where:
- 0 indicates identical distributions
- 1 indicates maximally different distributions

Notes

The measure is symmetric: JS(P||Q) = JS(Q||P)
Counts are automatically normalized to probability distributions

source

Genome Size Estimation

Mycelia.estimate_genome_size_from_kmers — Function

estimate_genome_size_from_kmers(
    sequence::Union{AbstractString, BioSequences.LongSequence},
    k::Integer
) -> Dict

Estimate genome size from k-mer analysis using total k-mer count.

This function estimates genome size using the basic relationship: genomesize ≈ totalkmers - k + 1, where total_kmers is the sum of all k-mer counts. This is a simple estimation method; more sophisticated approaches accounting for sequencing depth, repeats, and errors may be more accurate.

Arguments

sequence::Union{BioSequences.LongSequence, AbstractString}: Input sequence or string
k::Integer: K-mer size for analysis

Returns

Dict{String, Any}: Dictionary containing:
- "unique_kmers": Number of unique k-mers observed
- "total_kmers": Total k-mer count (sum of all frequencies)
- "estimatedgenomesize": Estimated genome size
- "actual_size": Length of input sequence (if provided)

Examples

# Estimate genome size from a sequence
sequence = "ATCGATCGATCGATCG"
result = estimate_genome_size_from_kmers(sequence, 5)

source

estimate_genome_size_from_kmers(
    records::AbstractArray{T<:Union{FASTX.FASTA.Record, FASTX.FASTQ.Record}, 1},
    k::Integer
) -> Dict

Estimate genome size from FASTQ/FASTA records using k-mer analysis.

Overload for processing FASTQ or FASTA records directly.

Arguments

records::AbstractVector{T} where T <: Union{FASTX.FASTA.Record, FASTX.FASTQ.Record}`: Input records
k::Integer: K-mer size for analysis

Returns

Dict{String, Any}: Dictionary with k-mer statistics and genome size estimate

source

K-mer Result Storage

Mycelia.save_kmer_results — Function

save_kmer_results(
;
    filename,
    kmers,
    counts,
    fasta_list,
    k,
    alphabet
)

Save the kmer counting results (kmers vector, counts sparse matrix) and the input FASTA file list to a JLD2 file for long-term storage and reproducibility.

Arguments

filename::AbstractString: Path to the output JLD2 file.
kmers::AbstractVector{<:Kmers.Kmer}: The sorted vector of unique kmer objects.
counts::AbstractMatrix: The (sparse or dense) matrix of kmer counts.
fasta_list::AbstractVector{<:AbstractString}: The list of FASTA file paths used as input.
k::Integer: The kmer size used.
alphabet::Symbol: The alphabet used (:AA, :DNA, :RNA).

source

Mycelia.load_kmer_results — Function

load_kmer_results(
    filename::AbstractString
) -> Union{Nothing, NamedTuple{(:kmers, :counts, :fasta_list, :metadata), <:Tuple{Any, Any, Any, Dict{String, Any}}}}

Load kmer counting results previously saved with save_kmer_results.

Arguments

filename::AbstractString: Path to the input JLD2 file.

Returns

NamedTuple: Contains the loaded kmers, counts, fasta_list, and metadata. Returns nothing if the file cannot be loaded or essential keys are missing.

source

Simulation and Testing

Sequence Generation

Mycelia.random_fasta_record — Function

random_fasta_record(
;
    moltype,
    seed,
    L
) -> FASTX.FASTA.Record

Generates a random FASTA record with a specified molecular type and sequence length.

Arguments

moltype::Symbol=:DNA: The type of molecule to generate (:DNA, :RNA, or :AA for amino acids).
seed: The random seed used for sequence generation (default: a random integer).
L: The length of the sequence (default: a random integer up to typemax(UInt16)).

Returns

A FASTX.FASTA.Record containing:
- A randomly generated UUID identifier.
- A randomly generated sequence of the specified type.

Errors

Throws an error if moltype is not one of :DNA, :RNA, or :AA.

source

Read Simulation

See Data Acquisition Workflow for read simulation functions.

Assembly Testing (Experimental)

Experimental assembly testing function test_intelligent_assembly is available in src/development/ but not currently part of the public API.

Quality Control

FASTQ Analysis

Mycelia.calculate_gc_content — Function

calculate_gc_content(
    sequence::BioSequences.LongSequence
) -> Float64

Calculate GC content (percentage of G and C bases) from a BioSequence.

This function calculates the percentage of guanine (G) and cytosine (C) bases in a nucleotide sequence. Works with both DNA and RNA sequences.

Arguments

sequence::BioSequences.LongSequence: Input DNA or RNA sequence

Returns

Float64: GC content as a percentage (0.0-100.0)

Examples

# Calculate GC content for DNA
dna_seq = BioSequences.LongDNA{4}("ATCGATCGATCG")
gc_percent = calculate_gc_content(dna_seq)

# Calculate GC content for RNA
rna_seq = BioSequences.LongRNA{4}("AUCGAUCGAUCG") 
gc_percent = calculate_gc_content(rna_seq)

source

calculate_gc_content(sequence::AbstractString) -> Float64

Calculate GC content from a string sequence.

Convenience function that accepts string input and converts to appropriate BioSequence. Automatically detects DNA/RNA based on presence of T/U.

Arguments

sequence::AbstractString: Input DNA or RNA sequence as string

Returns

Float64: GC content as a percentage (0.0-100.0)

Examples

# Calculate GC content from string
gc_percent = calculate_gc_content("ATCGATCGATCG")

source

calculate_gc_content(
    records::AbstractArray{T<:Union{FASTX.FASTA.Record, FASTX.FASTQ.Record}, 1}
) -> Float64

Calculate GC content from FASTA/FASTQ records.

Processes multiple records and calculates overall GC content across all sequences.

Arguments

records::AbstractVector{T} where T <: Union{FASTX.FASTA.Record, FASTX.FASTQ.Record}`: Input records

Returns

Float64: Overall GC content as a percentage (0.0-100.0)

Examples

# Calculate GC content from FASTA records
records = collect(FASTX.FASTA.Reader(open("sequences.fasta")))
gc_percent = calculate_gc_content(records)

source

Mycelia.assess_duplication_rates — Function

assess_duplication_rates(fastq; results_table) -> Any

Analyze sequence duplication rates in a FASTQ file.

This function processes a FASTQ file to quantify both exact sequence duplications and canonical duplications (considering sequences and their reverse complements as equivalent). The function makes two passes through the file: first to count total records, then to analyze unique sequences.

Arguments

fastq::String: Path to the input FASTQ file to analyze
results_table::String: Optional. Path where the results will be saved as a tab-separated file. Defaults to the same path as the input file but with extension changed to ".duplication_rates.tsv"

Returns

String: Path to the results table file

Output

Generates a tab-separated file containing the following metrics:

total_records: Total number of sequence records in the file
total_unique_observations: Count of unique sequence strings
total_unique_canonical_observations: Count of unique canonical sequences (after normalizing for reverse complements)
percent_unique_observations: Percentage of sequences that are unique
percent_unique_canonical_observations: Percentage of sequences that are unique after canonicalization
percent_duplication_rate: Percentage of sequences that are duplicates (100 - percentuniqueobservations)
percent_canonical_duplication_rate: Percentage of sequences that are duplicates after canonicalization

Notes

If the specified results file already exists and is not empty, the function will return early without recomputing.
Progress is displayed during processing with a progress bar showing speed.

Example

# Analyze a FASTQ file and save results to default location
result_path = assess_duplication_rates("data/sample.fastq")

# Specify custom output path
result_path = assess_duplication_rates("data/sample.fastq", results_table="results/duplication_analysis.tsv")

source

For FASTQ quality analysis functions, see Quality Control Workflow.

External Tool Integration

For quality control and filtering functions, see Quality Control Workflow.

Assembly Validation

Quality Assessment

Mycelia.assess_assembly_quality — Function

assess_assembly_quality(contigs_file)

Assess basic assembly quality metrics from a FASTA file.

Calculates standard assembly quality metrics including contig count, total length, and N50 statistic for assembly evaluation.

Arguments

contigs_file: Path to FASTA file containing assembly contigs

Returns

Tuple of (ncontigs, totallength, n50, l50)
- n_contigs: Number of contigs in the assembly
- total_length: Total length of all contigs in base pairs
- n50: N50 statistic (length of shortest contig in the set covering 50% of assembly)
- l50: L50 statistic (number of contigs needed to reach 50% of assembly length)

Example

n_contigs, total_length, n50, l50 = assess_assembly_quality("assembly.fasta")
println("Assembly has $n_contigs contigs, $total_length bp total, N50=$n50, L50=$l50")

See Also

assess_assembly_kmer_quality: For k-mer based assembly quality assessment

source

assess_assembly_quality(; assembly, observations, ks)

Assess assembly quality using k-mer based metrics including QV scores.

This function provides a comprehensive quality assessment of genome assemblies by comparing k-mer distributions between the assembly and source sequencing data. It calculates multiple quality metrics across different k-mer sizes, with the QV (Quality Value) score being the primary measure of assembly accuracy.

Arguments

assembly: Path to assembly file (FASTA format) or assembly sequences
observations: Path to sequencing data file(s) (FASTQ format) or sequence observations
ks::Vector{Int}: K-mer sizes to analyze (default: 11,13,17,19,23,31,53 for comprehensive evaluation)

Returns

DataFrame containing quality metrics for each k-mer size:

k::Int: K-mer length used
cosine_distance::Float64: Cosine similarity between k-mer distributions
js_divergence::Float64: Jensen-Shannon divergence between distributions
qv::Float64: Merqury-style Quality Value score (primary assembly accuracy metric)

QV Score Interpretation

QV ≥ 40: High quality assembly (>99.99% base accuracy)
QV 30-40: Good quality assembly (99.9-99.99% base accuracy)
QV 20-30: Moderate quality assembly (99-99.9% base accuracy)
QV < 20: Lower quality assembly (<99% base accuracy)

Implementation Details

The QV score follows the Merqury methodology:

Count k-mers in both assembly and sequencing data
Calculate shared k-mers between datasets
Estimate base-level accuracy: P = (Kshared/Ktotal)^(1/k)
Convert to Phred scale: QV = -10log₁₀(1-P)

Performance Notes

Uses multiple k-mer sizes for robust quality assessment
Larger k-mer sizes provide more discriminative power
Parallel k-mer counting for computational efficiency
Progress tracking for long-running analyses

References

Rhie, A. et al. "Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies." Genome Biology 21, 245 (2020).

source

Mycelia.validate_assembly — Function

validate_assembly(assembly::AssemblyResult; reference=nothing) -> Dict{String, Any}

Validate assembly quality using various metrics and optional reference comparison.

Arguments

assembly: Assembly result to validate
reference: Optional reference sequence for comparison

Returns

Dict{String, Any}: Comprehensive validation metrics

Details

Computes assembly quality metrics including:

N50, N90 statistics
Total assembly length and number of contigs
Coverage uniformity (if reference provided)
Structural variant detection (if reference provided)
Gap analysis and repeat characterization

source

External Validators

Mycelia.run_quast — Function

run_quast(assembly_files::Vector{String}; outdir::String="quast_results", reference::Union{String,Nothing}=nothing, threads::Int=Sys.CPU_THREADS, min_contig::Int=500, gene_finding::Bool=false)

Run QUAST (Quality Assessment Tool for Genome Assemblies) to evaluate assembly quality.

Arguments

assembly_files::Vector{String}: Vector of paths to assembly FASTA files to evaluate
outdir::String="quast_results": Output directory for QUAST results
reference::Union{String,Nothing}=nothing: Optional reference genome for reference-based metrics
threads::Int=Sys.CPU_THREADS: Number of threads to use
min_contig::Int=500: Minimum contig length to consider
gene_finding::Bool=false: Whether to run gene finding (requires GeneMark-ES/ET)

Returns

String: Path to the output directory containing QUAST results

Output Files

report.html: Interactive HTML report
report.txt: Text summary report
report.tsv: Tab-separated values report for programmatic access
transposed_report.tsv: Transposed TSV format
icarus.html: Icarus contig browser (if reference provided)

Examples

# Basic assembly evaluation
assemblies = ["assembly1.fasta", "assembly2.fasta"]
quast_dir = Mycelia.run_quast(assemblies)

# With reference genome
ref_genome = "reference.fasta"
quast_dir = Mycelia.run_quast(assemblies, reference=ref_genome)

# Custom parameters
quast_dir = Mycelia.run_quast(assemblies, 
                             outdir="my_quast_results",
                             min_contig=1000,
                             threads=8)

Notes

Requires QUAST to be installed via Bioconda
Without reference: provides basic metrics (N50, total length, # contigs, etc.)
With reference: adds reference-based metrics (genome fraction, misassemblies, etc.)
Gene finding requires additional dependencies and is disabled by default

source

run_quast(assembly_file::String; kwargs...)

Run QUAST on a single assembly file. See run_quast(::Vector{String}) for details.

source

Mycelia.run_busco — Function

run_busco(assembly_files::Vector{String}; outdir::String="busco_results", lineage::String="auto", mode::String="genome", threads::Int=Sys.CPU_THREADS, force::Bool=false)

Run BUSCO (Benchmarking Universal Single-Copy Orthologs) to assess genome assembly completeness.

Arguments

assembly_files::Vector{String}: Vector of paths to assembly FASTA files to evaluate
outdir::String="busco_results": Output directory for BUSCO results
lineage::String="auto": BUSCO lineage dataset to use (e.g., "bacteriaodb10", "eukaryotaodb10", "auto")
mode::String="genome": BUSCO mode ("genome", "transcriptome", "proteins")
threads::Int=Sys.CPU_THREADS: Number of threads to use
force::Bool=false: Force overwrite existing results

Returns

String: Path to the output directory containing BUSCO results

Output Files

short_summary.specific.lineage.txt: Summary statistics
full_table.tsv: Complete BUSCO results table
missing_busco_list.tsv: List of missing BUSCOs
run_lineage/: Detailed results directory

Examples

# Basic completeness assessment
assemblies = ["assembly1.fasta", "assembly2.fasta"]
busco_dir = Mycelia.run_busco(assemblies)

# Specific lineage
busco_dir = Mycelia.run_busco(assemblies, lineage="bacteria_odb10")

# Custom parameters
busco_dir = Mycelia.run_busco(assemblies,
                             outdir="my_busco_results",
                             lineage="enterobacterales_odb10",
                             threads=8)

Notes

Requires BUSCO to be installed via Bioconda
Auto lineage detection requires internet connection for first run
Available lineages: bacteriaodb10, archaeaodb10, eukaryotaodb10, fungiodb10, etc.
Results provide Complete, Fragmented, and Missing BUSCO counts

source

run_busco(assembly_file::String; kwargs...)

Run BUSCO on a single assembly file. See run_busco(::Vector{String}) for details.

source

Mycelia.run_mummer — Function

run_mummer(reference::String, query::String; outdir::String="mummer_results", prefix::String="out", mincluster::Int=65, minmatch::Int=20, threads::Int=1)

Run MUMmer for genome comparison and alignment between reference and query sequences.

Arguments

reference::String: Path to reference genome FASTA file
query::String: Path to query genome FASTA file
outdir::String="mummer_results": Output directory for MUMmer results
prefix::String="out": Prefix for output files
mincluster::Int=65: Minimum cluster length for nucmer
minmatch::Int=20: Minimum match length for nucmer
threads::Int=1: Number of threads (note: MUMmer has limited multithreading)

Returns

String: Path to the output directory containing MUMmer results

Output Files

prefix.delta: Delta alignment file (main output)
prefix.coords: Human-readable coordinates file
prefix.snps: SNP/indel report (if show-snps is run)
prefix.plot.png: Dot plot visualization (if mummerplot is run)

Examples

# Basic genome comparison
ref_genome = "reference.fasta"
query_genome = "assembly.fasta"
mummer_dir = Mycelia.run_mummer(ref_genome, query_genome)

# Custom parameters
mummer_dir = Mycelia.run_mummer(ref_genome, query_genome,
                               outdir="comparison_results",
                               prefix="comparison",
                               mincluster=100,
                               minmatch=30)

Notes

Requires MUMmer to be installed via Bioconda
nucmer is used for DNA sequence alignment
show-coords generates human-readable coordinate output
Results include alignment coordinates, percent identity, and coverage
For visualization, use mummerplot (requires gnuplot)

source

Annotation

Gene Prediction

Functions for gene prediction and annotation (using external tools):

Pyrodigal integration
BLAST+ searches
MMSeqs2 searches
TransTerm terminator prediction
tRNAscan-SE tRNA detection
MLST typing

Comparative Genomics

Visualization

Plotting Functions

Mycelia.visualize_genome_coverage — Function

visualize_genome_coverage(
    coverage_table::DataFrames.AbstractDataFrame;
    entity
) -> Makie.Figure

Creates a multi-panel visualization of genome coverage across chromosomes.

Arguments

coverage_table: DataFrame containing columns "chromosome" and "coverage" with genomic coverage data

Keyword Arguments

entity: Optional string to specify the entity being analyzed. If provided, chromosome titles become "$(entity): $(chromosome)", otherwise just "$(chromosome)"

Returns

CairoMakie.Figure: A composite figure with coverage plots for each chromosome

Details

Generates one subplot per chromosome, arranged vertically. Each subplot shows the coverage distribution across genomic positions for that chromosome.

source

Mycelia.plot_embeddings — Function

Plot embeddings with optional true and fitted cluster labels using Makie.jl, with legend outside, and color by fit labels, shape by true labels.

Arguments

embeddings::Matrix{<:Real}: 2D embedding matrix where each column is a data point
title::String: Title of the plot
xlabel::String: Label for the x-axis
ylabel::String: Label for the y-axis
true_labels::Vector{<:Integer}: Vector of true cluster labels (optional)
fit_labels::Vector{<:Integer}: Vector of fitted cluster labels (optional)

Returns

Makie.Figure: Figure object that can be displayed or saved

source

Mycelia.plot_taxa_abundances — Function

plot_taxa_abundances(
    df::DataFrames.DataFrame, 
    taxa_level::String; 
    top_n::Int = 10,
    sample_id_col::String = "sample_id",
    filter_taxa::Union{Vector{Union{String, Missing}}, Nothing} = nothing,
    figure_width::Int = 1500,
    figure_height::Int = 1000,
    bar_width::Float64 = 0.7,
    x_rotation::Int = 45,
    sort_samples::Bool = true,
    color_seed::Union{Int, Nothing} = nothing,
    legend_fontsize::Float64 = 12.0,
    legend_itemsize::Float64 = 12.0,
    legend_padding::Float64 = 5.0,
    legend_rowgap::Float64 = 1.0,
    legend_labelwidth::Union{Nothing, Float64} = nothing,
    legend_titlesize::Float64 = 15.0,
    legend_nbanks::Int = 1
)

Create a stacked bar chart showing taxa relative abundances for each sample.

Arguments

df: DataFrame with sample_id and taxonomic assignments at different levels
taxa_level: Taxonomic level to analyze (e.g., "genus", "species")
top_n: Number of top taxa to display individually, remainder grouped as "Other"
sample_id_col: Column name containing sample identifiers
filter_taxa: Taxa to exclude from visualization (default: nothing - no filtering)
figure_width: Width of the figure in pixels
figure_height: Height of the figure in pixels
bar_width: Width of each bar (between 0 and 1)
x_rotation: Rotation angle for x-axis labels in degrees
sort_samples: Whether to sort samples alphabetically
color_seed: Seed for reproducible color generation
legend_fontsize: Font size for legend entries
legend_itemsize: Size of the colored marker/icon in the legend
legend_padding: Padding around legend elements
legend_rowgap: Space between legend rows
legend_labelwidth: Maximum width for legend labels (truncation)
legend_titlesize: Font size for legend title
legend_nbanks: Number of legend columns

Returns

fig: CairoMakie figure object
ax: CairoMakie axis object
taxa_colors: Dictionary mapping taxa to their assigned colors

source

For k-mer frequency plotting, see Sequence Analysis Workflow.

SingleStrand: Sequences are single-stranded (RNA, amino acids, or directional DNA)
DoubleStrand: Sequences are double-stranded DNA/RNA with canonical representation

source

Mycelia.StrandOrientation — Type

Strand orientation for k-mer observations and transitions.