Mycelia Documentation

An experimental Julia package for bioinformatics and computational biology

Mycelia is a research-oriented package exploring novel approaches to genomic analysis, with a focus on graph-based genome assembly and quality-aware sequence processing. Currently in early development, it provides both experimental algorithms and integrations with established bioinformatics tools.

Quick Start

New to Mycelia? Start with our Getting Started Guide to install the package and complete your first genomic analysis in minutes.

Key Features & Research Areas

Currently Available

  • 🧬 Sequence Processing: Basic FASTA/FASTQ I/O and read simulation
  • 📊 K-mer Analysis: Canonical k-mer counting and distance metrics
  • 🔧 Tool Integration: Wrappers for established assemblers (MEGAHIT, SPAdes, hifiasm)
  • ⚡ HPC Support: SLURM job submission and rclone integration

In Active Development

  • 🧪 Novel Assembly Algorithms: Graph-based approaches with quality awareness
  • 🌐 Pangenome Analysis: K-mer based comparative genomics
  • 📈 Quality Control: Integration with QC tools (fastp, filtlong, trim_galore)

Planned Features

  • 🔍 Annotation: Gene prediction and functional annotation
  • 🌳 Phylogenetics: Tree construction from pangenome data
  • 📊 Visualization: Interactive plots for genomic data

Documentation Contents

Installation

Quick Install

import Pkg
Pkg.add(url="https://github.com/cjprybol/Mycelia.git")

Development Install

import Pkg
Pkg.develop(url="git@github.com:cjprybol/Mycelia.git")

For detailed installation instructions including HPC setup, see the Getting Started Guide.

Function Docstrings

Mycelia.AssemblyActionType
AssemblyAction

Action representation for reinforcement learning decisions during assembly.

Fields

  • decision::Symbol: Primary decision (:continuek, :nextk, :terminate)
  • viterbi_params::Dict{Symbol, Float64}: Viterbi algorithm parameters
  • correction_threshold::Float64: Quality threshold for error correction
  • batch_size::Int: Batch size for processing (memory management)
  • max_iterations::Int: Maximum iterations at current k before forced progression
source
Mycelia.AssemblyEnvironmentType
AssemblyEnvironment

Reinforcement learning environment for training assembly decision policies.

Fields

  • current_state::AssemblyState: Current environment state
  • training_datasets::Vector{String}: Paths to training FASTQ files
  • validation_datasets::Vector{String}: Paths to validation FASTQ files
  • episode_length::Int: Maximum steps per training episode
  • step_count::Int: Current step in episode
  • reward_history::Vector{Float64}: Reward history for current episode
  • action_history::Vector{AssemblyAction}: Action history for experience replay
  • assembly_cache::Dict{String, Any}: Cache for intermediate assembly results
source
Mycelia.AssemblyStateType
AssemblyState

State representation for reinforcement learning environment containing all information needed to make assembly decisions.

Fields

  • current_k::Int: Current k-mer size being processed
  • assembly_quality::Float64: Current assembly quality score (QV-based)
  • correction_rate::Float64: Rate of successful error corrections in recent iterations
  • memory_usage::Float64: Current memory utilization (fraction of limit)
  • graph_connectivity::Float64: Graph connectivity metric (proportion of strongly connected components)
  • coverage_uniformity::Float64: Uniformity of k-mer coverage distribution
  • error_signal_clarity::Float64: Clarity of error signal detection (sparsity-based)
  • iteration_history::Vector{Float64}: Recent reward history for trend analysis
  • k_progression::Vector{Int}: Sequence of k-mer sizes processed so far
  • corrections_made::Int: Total corrections made at current k
  • time_elapsed::Float64: Time spent on current k-mer size (seconds)
source
Mycelia.DQNPolicyType
DQNPolicy

Deep Q-Network policy for high-level assembly decisions.

This is a placeholder structure for the neural network architecture that will be implemented with a machine learning framework like Flux.jl or MLJ.jl.

Fields

  • state_dim::Int: Dimension of state representation
  • action_dim::Int: Number of possible actions
  • hidden_dims::Vector{Int}: Hidden layer dimensions
  • learning_rate::Float64: Learning rate for training
  • epsilon::Float64: Exploration rate for epsilon-greedy policy
  • experience_buffer::Vector{Any}: Experience replay buffer
source
Mycelia.GraphModeType

Graph mode for handling strand information.

  • SingleStrand: Sequences are single-stranded (RNA, amino acids, or directional DNA)
  • DoubleStrand: Sequences are double-stranded DNA/RNA with canonical representation
source
Mycelia.KmerEdgeDataType

Type-stable metadata for k-mer graph edges.

Edges represent valid strand-aware transitions between canonical k-mers. The transition is valid only if the strand orientations allow for proper overlap.

Fields:

  • coverage: Vector of edge traversal observations with strand information
  • weight: Edge weight/confidence score based on coverage
  • src_strand: Required strand orientation of source k-mer for this transition
  • dst_strand: Required strand orientation of destination k-mer for this transition
source
Mycelia.KmerVertexDataType

Type-stable metadata for k-mer graph vertices.

Vertices always represent canonical k-mers for memory efficiency and cleaner graphs. Strand information is tracked in the coverage data and edge transitions.

Fields:

  • coverage: Vector of observation coverage data as (observationid, position, strandorientation) tuples
  • canonical_kmer: The canonical k-mer (BioSequence type - NO string conversion)
source
Mycelia.RewardComponentsType
RewardComponents

Structured representation of reward signal components for training the RL agent.

Fields

  • accuracy_reward::Float64: Primary reward based on assembly accuracy (weighted 1000x)
  • efficiency_reward::Float64: Secondary reward for computational efficiency (weighted 10x)
  • error_penalty::Float64: Penalty for false positives/negatives (weighted -500x)
  • progress_bonus::Float64: Bonus for making meaningful progress
  • termination_reward::Float64: Reward for appropriate termination timing
  • total_reward::Float64: Weighted sum of all components
source
Mycelia.StrandOrientationType

Strand orientation for k-mer observations and transitions.

  • Forward: k-mer as observed (5' to 3')
  • Reverse: reverse complement of k-mer (3' to 5')
source
Mycelia.JLD2_read_tableMethod
JLD2_read_table(filename::String) -> Any

Read a DataFrame from a JLD2 file without needing to know the internal name. If the file contains multiple DataFrames, returns the first one found.

source
Mycelia._add_observation_to_graph!Method
_add_observation_to_graph!(
    graph,
    observation,
    obs_idx,
    canonical_kmers,
    graph_mode
)

Add a sequence observation to an existing k-mer graph with strand-aware edge creation.

Arguments

  • graph: MetaGraphsNext k-mer graph with canonical vertices
  • observation: FASTA/FASTQ record
  • obs_idx: Observation index
  • canonical_kmers: Vector of canonical k-mers in the graph
  • graph_mode: SingleStrand or DoubleStrand mode
source
Mycelia._add_strand_aware_edge!Method

Helper function to add strand-aware coverage data to an edge.

This function creates edges that respect strand orientation constraints. Each edge represents a biologically valid transition between k-mers.

source
Mycelia._calculate_l_statisticMethod
_calculate_l_statistic(sorted_lengths, threshold)

Calculate L-statistic (number of contigs needed to reach a given percentage of total assembly length). For example, L50 is the number of contigs needed to reach 50% of the total assembly length.

Arguments

  • sorted_lengths: Vector of contig lengths sorted in descending order
  • threshold: Fraction of total length (e.g., 0.5 for L50, 0.9 for L90)

Returns

  • Int: Number of contigs needed to reach the threshold
source
Mycelia._detect_sequence_extensionMethod
_detect_sequence_extension(sequence_type::Symbol) -> String

Internal helper function to convert sequence type to file extension.

Arguments

  • sequence_type: Symbol representing sequence type (:DNA, :RNA, or :AA)

Returns

  • String: Appropriate file extensions
source
Mycelia._is_valid_transitionMethod
_is_valid_transition(
    src_kmer,
    dst_kmer,
    src_strand,
    dst_strand,
    k
) -> Any

Validate that a transition between two k-mers with given strand orientations is biologically valid.

For a transition to be valid, the suffix of the source k-mer must match the prefix of the destination k-mer when accounting for strand orientations.

Arguments

  • src_kmer: Source canonical k-mer
  • dst_kmer: Destination canonical k-mer
  • src_strand: Strand orientation of source k-mer
  • dst_strand: Strand orientation of destination k-mer
  • k: K-mer size

Returns

  • Bool: true if transition is biologically valid
source
Mycelia._iterative_viterbi_pathsMethod

Iterative Viterbi algorithm for finding optimal paths through qualmer graph. Uses dynamic programming with quality scores as emission/transition probabilities.

source
Mycelia._sequence_to_canonical_pathMethod
_sequence_to_canonical_path(
    canonical_kmers,
    sequence,
    graph_mode
) -> Vector{<:Tuple{Any, Mycelia.StrandOrientation}}

Convert a sequence to a path through canonical k-mer space with strand awareness.

This is the key function that handles the distinction between single-strand and double-strand modes while maintaining canonical k-mer vertices.

Arguments

  • canonical_kmers: Vector of canonical k-mers available in the graph
  • sequence: DNA/RNA sequence to convert
  • graph_mode: SingleStrand or DoubleStrand mode

Returns

  • Vector of (canonicalkmer, strandorientation) pairs representing the path

Details

In DoubleStrand mode:

  • Each observed k-mer is converted to its canonical form
  • Strand orientation tracks whether the canonical form matches the observed k-mer
  • Edges respect the biological constraint that transitions must maintain proper overlap

In SingleStrand mode:

  • K-mers are used as-is (no reverse complement consideration)
  • All strand orientations are Forward
  • Suitable for RNA, amino acids, or directional DNA analysis
source
Mycelia._simulate_fastq_reads_from_sequenceMethod
_simulate_fastq_reads_from_sequence(
    sequence,
    identifier::String;
    coverage,
    read_length,
    error_rate,
    sequence_type
) -> Vector{FASTX.FASTQ.Record}

Generate simulated FASTQ reads from a reference sequence.

Arguments

  • sequence: Reference sequence (String or BioSequences type)
  • identifier: Sequence identifier
  • coverage::Int: Desired coverage depth
  • read_length::Int: Length of individual reads
  • error_rate::Float64: Simulated error rate
  • sequence_type::String: Type of sequence ("DNA", "RNA", "AA")

Returns

  • Vector{FASTX.FASTQ.Record}: Vector of simulated FASTQ reads with quality scores
source
Mycelia.add_bioconda_envMethod
add_bioconda_env(pkg; force) -> Union{Nothing, Base.Process}

Create a new Conda environment with a specified Bioconda package.

Arguments

  • pkg::String: Package name to install. Can include channel specification using

the format "channel::package"

Keywords

  • force::Bool=false: If true, recreates the environment even if it already exists

Details

The function creates a new Conda environment named after the package and installs the package into it. It uses channel priority: conda-forge > bioconda > defaults. If CONDA_RUNNER is set to 'mamba', it will ensure mamba is installed first.

Examples

# Install basic package
add_bioconda_env("blast")

# Install from specific channel
add_bioconda_env("bioconda::blast")

# Force reinstallation
add_bioconda_env("blast", force=true)

Notes

  • Requires Conda.jl to be installed and configured
  • Uses CONDA_RUNNER global variable to determine whether to use conda or mamba
  • Cleans conda cache after installation
source
Mycelia.add_edgemer_to_graph!Method
add_edgemer_to_graph!(
    graph,
    record_identifier,
    index,
    observed_edgemer
) -> Any

Add an observed edgemer to a graph with its associated metadata.

Arguments

  • graph::MetaGraph: The graph to modify
  • record_identifier: Identifier for the source record
  • index: Position where edgemer was observed
  • observed_edgemer: The biological sequence representing the edgemer

Details

Processes the edgemer by:

  1. Splitting it into source and destination kmers
  2. Converting kmers to their canonical forms
  3. Creating or updating an edge with orientation metadata
  4. Storing observation details (record, position, orientation)

Returns

Modified graph with the new edge and metadata

Note

If the edge already exists, the observation is added to the existing metadata.

source
Mycelia.add_fastx_records_to_graph!Method
add_fastx_records_to_graph!(graph, fastxs) -> Any

Add FASTX records from multiple files as a graph property.

Arguments

  • graph: A MetaGraph that will store the FASTX records
  • fastxs: Collection of FASTA/FASTQ file paths to process

Details

Creates a dictionary mapping sequence descriptions to their corresponding FASTX records, then stores this dictionary as a graph property under the key :records. Multiple input files are merged, with later files overwriting records with duplicate descriptions.

Returns

The modified graph with added records property.

source
Mycelia.add_record_edgemers_to_graph!Method
add_record_edgemers_to_graph!(graph) -> Any

Processes DNA sequence records stored in the graph and adds their edgemers (k+1 length subsequences) to build the graph structure.

Arguments

  • graph: A Mycelia graph object containing DNA sequence records and graph properties

Details

  • Uses the k-mer size specified in graph.gprops[:k] to generate k+1 length edgemers
  • Iterates through each record in graph.gprops[:records]
  • For each record, generates all possible overlapping edgemers
  • Adds each edgemer to the graph with its position and record information

Returns

  • The modified graph object with added edgemer information
source
Mycelia.alphabet_to_biosequence_typeMethod
alphabet_to_biosequence_type(
    alphabet::Symbol
) -> Union{Type{BioSequences.LongAA}, Type{BioSequences.LongSequence{BioSequences.DNAAlphabet{4}}}, Type{BioSequences.LongSequence{BioSequences.RNAAlphabet{4}}}}

Determine the BioSequence type from an alphabet symbol.

Maps alphabet symbols to the corresponding BioSequences.jl type for type-safe sequence operations throughout the codebase.

Arguments

  • alphabet::Symbol: The alphabet symbol (:DNA, :RNA, or :AA)

Returns

  • Type{<:BioSequences.BioSequence}: The corresponding BioSequence type

Examples

alphabet_to_biosequence_type(:DNA)  # Returns BioSequences.LongDNA{4}
alphabet_to_biosequence_type(:RNA)  # Returns BioSequences.LongRNA{4}
alphabet_to_biosequence_type(:AA)   # Returns BioSequences.LongAA
source
Mycelia.amino_acids_to_codonsMethod
amino_acids_to_codons(

) -> Dict{BioSymbols.AminoAcid, DataType}

Creates a mapping from amino acids to representative DNA codons using the standard genetic code.

Returns

  • Dictionary mapping each amino acid (including stop codon AA_Term) to a valid DNA codon that encodes it
source
Mycelia.analyze_fastq_qualityMethod
analyze_fastq_quality(fastq_file::String)

Analyzes quality metrics for a FASTQ file.

Calculates comprehensive quality statistics including read count, quality scores, length distribution, GC content, and quality threshold percentages.

Arguments

  • fastq_file: Path to FASTQ file (can be gzipped)

Returns

FastqQualityResults with the following fields:

  • n_reads: Total number of reads
  • mean_quality: Average Phred quality score across all reads
  • mean_length: Average read length
  • gc_content: GC content percentage
  • quality_distribution: QualityDistribution with Q20+, Q30+, Q40+ percentages

Example

quality_stats = Mycelia.analyze_fastq_quality("reads.fastq")
println("Total reads: $(quality_stats.n_reads)")
println("Mean quality: $(quality_stats.mean_quality)")
println("Q30+ reads: $(quality_stats.quality_distribution.q30_percent)%")
source
Mycelia.analyze_pangenome_kmersMethod
analyze_pangenome_kmers(genome_files::Vector{String}; kmer_type=Kmers.DNAKmer{21}, distance_metric=:jaccard)

Perform comprehensive k-mer based pangenome analysis using existing Mycelia infrastructure.

Leverages existing count_canonical_kmers and distance metric functions to analyze genomic content across multiple genomes, identifying core, accessory, and unique regions.

Arguments

  • genome_files: Vector of FASTA file paths containing genome sequences
  • kmer_type: K-mer type from Kmers.jl (default: Kmers.DNAKmer{21})
  • distance_metric: Distance metric (:jaccard, :bray_curtis, :cosine, :js_divergence)

Returns

  • PangenomeAnalysisResult with comprehensive pangenome statistics

Example

genome_files = ["genome1.fasta", "genome2.fasta", "genome3.fasta"]
result = Mycelia.analyze_pangenome_kmers(genome_files, kmer_type=Kmers.DNAKmer{31})
println("Core k-mers: $(length(result.core_kmers))")
println("Total pangenome size: $(size(result.presence_absence_matrix, 1)) k-mers")
source
Mycelia.annotate_aa_fastaMethod
annotate_aa_fasta(
;
    fasta,
    identifier,
    basedir,
    mmseqsdb,
    threads
)

Annotate amino acid sequences in a FASTA file using MMseqs2 search against UniRef50 database.

Arguments

  • fasta: Path to input FASTA file containing amino acid sequences
  • identifier: Name for the output directory (defaults to FASTA filename without extension)
  • basedir: Base directory for output (defaults to current directory)
  • mmseqsdb: Path to MMseqs2 formatted UniRef50 database (defaults to ~/workspace/mmseqs/UniRef50)
  • threads: Number of CPU threads to use (defaults to system thread count)

Returns

  • Path to the output directory containing MMseqs2 search results

The function creates a new directory named by identifier under basedir, copies the input FASTA file, and runs MMseqs2 easy-search against the specified database. If the output directory already exists, the function skips processing and returns the directory path.

source
Mycelia.annotate_fastaMethod
annotate_fasta(
;
    fasta,
    identifier,
    basedir,
    mmseqsdb,
    threads
)

Perform comprehensive annotation of a FASTA file including gene prediction, protein homology search, and terminator prediction.

Arguments

  • fasta::String: Path to input FASTA file
  • identifier::String: Unique identifier for output directory (default: FASTA filename without extension)
  • basedir::String: Base directory for output (default: current working directory)
  • mmseqsdb::String: Path to MMseqs2 UniRef50 database (default: joinpath(homedir(), "workspace/mmseqs/UniRef50"))
  • threads::Int: Number of CPU threads to use (default: all available). Note: This argument is not explicitly used by Pyrodigal or MMseqs2 in this version of the function, they might use their own defaults or require modifications to run_pyrodigal or run_mmseqs_easy_search to respect it.

Processing Steps

  1. Creates output directory and copies input FASTA.
  2. Runs Pyrodigal for gene prediction (nucleotide, amino acid, and GFF output).
  3. Performs MMseqs2 homology search against UniRef50.
  4. Predicts terminators using TransTerm.
  5. Combines annotations into a unified GFF file.
  6. Generates GenBank format output.

Returns

  • String: Path to the output directory containing all generated files.

Files Generated (within the output directory specified by identifier)

  • (basename(fasta)).pyrodigal.fna: Predicted genes (nucleotide) from Pyrodigal.
  • (basename(fasta)).pyrodigal.faa: Predicted proteins from Pyrodigal.
  • (basename(fasta)).pyrodigal.gff: Pyrodigal GFF annotations.
  • (basename(fasta)).gff: Combined GFF annotations (MMseqs2 and TransTerm).
  • (basename(fasta)).gff.genbank: Final GenBank format from the first combined GFF.
  • (basename(fasta)).transterm_raw.gff: Combined GFF (MMseqs2 and a second TransTerm run).
  • (basename(fasta)).transterm_raw.gff.genbank: Final GenBank format from the second combined GFF.
source
Mycelia.apply_learned_policyMethod
apply_learned_policy(policy::DQNPolicy, input_fastq::String; output_dir="rl_assembly")

Apply a trained RL policy to perform genome assembly.

This function uses a trained policy to make autonomous assembly decisions.

Arguments

  • policy::DQNPolicy: Trained assembly policy
  • input_fastq::String: Path to input FASTQ file
  • output_dir::String: Output directory for assembly results

Returns

  • Dict{String, Any}: Assembly results and metadata

Example

results = apply_learned_policy(trained_policy, "genome.fastq")
source
Mycelia.assemble_genomeMethod
assemble_genome(reads; method=StringGraph, config=AssemblyConfig()) -> AssemblyResult

Unified genome assembly interface using Phase 2 next-generation algorithms.

Arguments

  • reads: Vector of FASTA/FASTQ records or file paths
  • method: Assembly strategy (StringGraph, KmerGraph, HybridOLC, MultiK)
  • config: Assembly configuration parameters

Returns

  • AssemblyResult: Structure containing contigs, names, and assembly metadata

Details

This is the main entry point for the unified assembly pipeline, leveraging:

  • Phase 1: MetaGraphsNext strand-aware graph construction
  • Phase 2: Probabilistic algorithms, enhanced Viterbi, and graph algorithms
  • Phase 3: Integrated workflow with polishing and validation

Examples

# Basic assembly with default parameters
reads = load_fastq_records("reads.fastq")
result = assemble_genome(reads)

# Custom assembly with specific k-mer size and error rate
config = AssemblyConfig(k=25, error_rate=0.005, polish_iterations=5)
result = assemble_genome(reads; method=KmerGraph, config=config)

# Access results
contigs = result.contigs
stats = result.assembly_stats
source
Mycelia.assess_alignmentMethod
assess_alignment(
    a,
    b
) -> @NamedTuple{total_matches::Int64, total_edits::Int64}

Aligns two sequences using the Levenshtein distance and returns the total number of matches and edits.

Arguments

  • a::AbstractString: The first sequence to be aligned.
  • b::AbstractString: The second sequence to be aligned.

Returns

  • NamedTuple{(:total_matches, :total_edits), Tuple{Int, Int}}: A named tuple containing:
    • total_matches::Int: The total number of matching bases in the alignment.
    • total_edits::Int: The total number of edits (insertions, deletions, substitutions) in the alignment.
source
Mycelia.assess_alignment_accuracyMethod
assess_alignment_accuracy(alignment_result) -> Any

Return proportion of matched bases in alignment to total matches + edits.

Calculate the accuracy of a sequence alignment by computing the ratio of matched bases to total alignment operations (matches + edits).

Arguments

  • alignment_result: Alignment result object containing total_matches and total_edits fields

Returns

Float64 between 0.0 and 1.0 representing alignment accuracy, where:

  • 1.0 indicates perfect alignment (all matches)
  • 0.0 indicates no matches
source
Mycelia.assess_assembly_kmer_qualityMethod
assess_assembly_kmer_quality(; assembly, observations, ks)

Evaluate genome assembly quality by comparing k-mer distributions between assembled sequences and raw observations.

Arguments

  • assembly: Input assembled sequences to evaluate
  • observations: Raw sequencing data for comparison
  • ks::Vector{Int}: Vector of k-mer sizes to analyze (default: k=17 to 23)

Returns

DataFrame containing quality metrics for each k-mer size:

  • k: K-mer length used
  • cosine_distance: Cosine similarity between k-mer distributions
  • js_divergence: Jensen-Shannon divergence between distributions
  • qv: MerQury-style quality value score
source
Mycelia.assess_assembly_qualityMethod
assess_assembly_quality(contigs_file)

Assess basic assembly quality metrics from a FASTA file.

Calculates standard assembly quality metrics including contig count, total length, and N50 statistic for assembly evaluation.

Arguments

  • contigs_file: Path to FASTA file containing assembly contigs

Returns

  • Tuple of (ncontigs, totallength, n50, l50)
    • n_contigs: Number of contigs in the assembly
    • total_length: Total length of all contigs in base pairs
    • n50: N50 statistic (length of shortest contig in the set covering 50% of assembly)
    • l50: L50 statistic (number of contigs needed to reach 50% of assembly length)

Example

n_contigs, total_length, n50, l50 = assess_assembly_quality("assembly.fasta")
println("Assembly has $n_contigs contigs, $total_length bp total, N50=$n50, L50=$l50")

See Also

  • assess_assembly_kmer_quality: For k-mer based assembly quality assessment
source
Mycelia.assess_dnamer_saturationMethod
assess_dnamer_saturation(
    fastx::AbstractString;
    power,
    outdir,
    min_k,
    max_k,
    threshold,
    kmers_to_assess
)

Analyzes k-mer saturation in a FASTA/FASTQ file to determine optimal k-mer size.

Arguments

  • fastx::AbstractString: Path to input FASTA/FASTQ file
  • power::Int=10: Exponent for downsampling k-mers (2^power)
  • outdir::String="": Output directory for results. Uses current directory if empty
  • min_k::Int=3: Minimum k-mer size to evaluate
  • max_k::Int=17: Maximum k-mer size to evaluate
  • threshold::Float64=0.1: Saturation threshold for k-mer assessment
  • kmers_to_assess::Int=10_000_000: Maximum number of k-mers to sample

Returns

Dict{Int,Float64}: Dictionary mapping k-mer sizes to their saturation scores

source
Mycelia.assess_dnamer_saturationMethod
assess_dnamer_saturation(
    fastxs::AbstractVector{<:AbstractString},
    kmer_type;
    kmers_to_assess,
    power,
    min_count
) -> Union{@NamedTuple{sampling_points::Vector{Int64}, unique_kmer_counts::Vector{Int64}}, NamedTuple{(:sampling_points, :unique_kmer_counts, :eof), <:Tuple{Vector, Vector{Int64}, Bool}}}

Assess k-mer saturation in DNA sequences from FASTX files.

Arguments

  • fastxs::AbstractVector{<:AbstractString}: Vector of paths to FASTA/FASTQ files
  • kmer_type: Type of k-mer to analyze (e.g., DNAKmer{21})
  • kmers_to_assess=Inf: Maximum number of k-mers to process
  • power=10: Base for exponential sampling intervals
  • min_count=1: Minimum count threshold for considering a k-mer

Returns

Named tuple containing:

  • sampling_points::Vector{Int}: K-mer counts at which samples were taken
  • unique_kmer_counts::Vector{Int}: Number of unique canonical k-mers at each sampling point
  • eof::Bool: Whether the entire input was processed

Details

Analyzes k-mer saturation by counting unique canonical k-mers at exponentially spaced intervals (powers of power). Useful for assessing sequence complexity and coverage. Returns early if all possible k-mers are observed.

source
Mycelia.assess_dnamer_saturationMethod
assess_dnamer_saturation(
    fastxs::AbstractVector{<:AbstractString};
    power,
    outdir,
    min_k,
    max_k,
    threshold,
    kmers_to_assess,
    plot
)

Analyze k-mer saturation in DNA sequences to determine optimal k value.

Arguments

  • fastxs: Vector of paths to FASTA/FASTQ files to analyze
  • power: Base of logarithmic sampling points (default: 10)
  • outdir: Optional output directory for plots and results
  • min_k: Minimum k-mer size to test (default: 7)
  • max_k: Maximum k-mer size to test (default: 17)
  • threshold: Saturation threshold to determine optimal k (default: 0.1)
  • kmers_to_assess: Maximum number of k-mers to sample (default: 10M)
  • plot: Whether to generate saturation curves (default: true)

Returns

Integer representing the first k value that achieves saturation below threshold. If no k value meets the threshold, returns the k with minimum saturation.

Details

  • Tests only prime k values between mink and maxk
  • Generates saturation curves using logarithmic sampling
  • Fits curves to estimate maximum unique k-mers
  • If outdir is provided, saves plots as SVG and chosen k value to text file
source
Mycelia.assess_duplication_ratesMethod
assess_duplication_rates(fastq; results_table) -> Any

Analyze sequence duplication rates in a FASTQ file.

This function processes a FASTQ file to quantify both exact sequence duplications and canonical duplications (considering sequences and their reverse complements as equivalent). The function makes two passes through the file: first to count total records, then to analyze unique sequences.

Arguments

  • fastq::String: Path to the input FASTQ file to analyze
  • results_table::String: Optional. Path where the results will be saved as a tab-separated file. Defaults to the same path as the input file but with extension changed to ".duplication_rates.tsv"

Returns

  • String: Path to the results table file

Output

Generates a tab-separated file containing the following metrics:

  • total_records: Total number of sequence records in the file
  • total_unique_observations: Count of unique sequence strings
  • total_unique_canonical_observations: Count of unique canonical sequences (after normalizing for reverse complements)
  • percent_unique_observations: Percentage of sequences that are unique
  • percent_unique_canonical_observations: Percentage of sequences that are unique after canonicalization
  • percent_duplication_rate: Percentage of sequences that are duplicates (100 - percentuniqueobservations)
  • percent_canonical_duplication_rate: Percentage of sequences that are duplicates after canonicalization

Notes

  • If the specified results file already exists and is not empty, the function will return early without recomputing.
  • Progress is displayed during processing with a progress bar showing speed.

Example

# Analyze a FASTQ file and save results to default location
result_path = assess_duplication_rates("data/sample.fastq")

# Specify custom output path
result_path = assess_duplication_rates("data/sample.fastq", results_table="results/duplication_analysis.tsv")
source
Mycelia.assess_optimal_kmer_alignmentMethod
assess_optimal_kmer_alignment(
    kmer,
    observed_kmer
) -> Tuple{@NamedTuple{total_matches::Int64, total_edits::Int64}, Union{Missing, Bool}}

Used to determine which orientation provides an optimal alignment for initiating path likelihood analyses in viterbi analysis

Compare alignment scores between a query k-mer and an observed k-mer in both forward and reverse complement orientations to determine optimal alignment.

Arguments

  • kmer: Query k-mer sequence to align
  • observed_kmer: Target k-mer sequence to align against

Returns

A tuple containing:

  • alignment_result: The alignment result object for the optimal orientation
  • orientation: Boolean indicating orientation (true = forward, false = reverse complement, missing = tied scores)

Details

  • Performs pairwise alignment in both orientations using assess_alignment()
  • Calculates accuracy scores using assess_alignment_accuracy()
  • For tied alignment scores, randomly selects one orientation
  • Uses BioSequences.reverse_complement for reverse orientation comparison
source
Mycelia.bam_to_fastqMethod
bam_to_fastq(; bam, fastq)

Convert a BAM file to FASTQ format with gzip compression.

Arguments

  • bam: Path to input BAM file
  • fastq: Optional output path. Defaults to input path with ".fq.gz" extension

Returns

  • Path to the generated FASTQ file

Details

  • Uses samtools through conda environment
  • Automatically skips if output file exists
  • Output is gzip compressed
  • Requires samtools to be available via conda
source
Mycelia.bandage_visualizeMethod
bandage_visualize(; gfa, img)

Generate a visualization of a genome assembly graph using Bandage.

Arguments

  • gfa: Path to input GFA (Graphical Fragment Assembly) file
  • img: Optional output image path. Defaults to GFA filename with .png extension

Returns

  • Path to the generated image file
source
Mycelia.batch_download_viroid_referencesMethod
batch_download_viroid_references(
    species_list::Vector{String};
    base_outdir,
    download_genome,
    download_cds,
    download_protein,
    max_per_species
) -> Dict{String, Any}

Download reference data for multiple viroid species in batch.

Arguments

  • species_list::Vector{String}: List of viroid species to download
  • base_outdir::String: Base output directory (subdirectories created per species)
  • download_genome::Bool: Download genomic sequences (default: true)
  • download_cds::Bool: Download CDS sequences (default: true)
  • download_protein::Bool: Download protein sequences (default: true)
  • max_per_species::Int: Maximum sequences per species per type (default: 5)

Returns

  • Dict{String, NamedTuple}: Dictionary mapping species names to their downloaded file info

Examples

# Download data for all known viroid species
species_list = get_viroid_species_list()
results = batch_download_viroid_references(species_list, "viroid_database/")

# Download only genomes for a subset
pstv_like = ["Potato spindle tuber viroid", "Tomato planta macho viroid"]  
results = batch_download_viroid_references(pstv_like, "pstv_group/";
                                         download_cds=false, download_protein=false)
source
Mycelia.benchmark_graph_constructionFunction
benchmark_graph_construction()
benchmark_graph_construction(
    config::Mycelia.BenchmarkConfig
)

Benchmark graph construction performance: Legacy vs Next-generation.

Compares:

  • MetaGraphs.jl (legacy) vs MetaGraphsNext.jl (next-gen)
  • Memory allocation patterns
  • Construction time
  • Type stability

Arguments

  • config: BenchmarkConfig for test parameters

Returns

  • NamedTuple with benchmark results
source
Mycelia.benchmark_memory_patternsFunction
benchmark_memory_patterns()
benchmark_memory_patterns(config::Mycelia.BenchmarkConfig)

Benchmark memory usage patterns for different graph representations.

Compares memory usage of:

  • Stranded vertices (legacy) vs canonical vertices (next-gen)
  • Edge metadata structures
  • Coverage tracking efficiency

Arguments

  • config: BenchmarkConfig for test parameters

Returns

  • NamedTuple with memory analysis results
source
Mycelia.benchmark_type_stabilityFunction
benchmark_type_stability()
benchmark_type_stability(config::Mycelia.BenchmarkConfig)

Benchmark type stability and allocation patterns.

Measures:

  • Type inference success
  • Runtime allocations
  • Performance predictability

Arguments

  • config: BenchmarkConfig for test parameters

Returns

  • NamedTuple with type stability metrics
source
Mycelia.bernoulli_pca_epcaMethod

bernoullipcaepca(M::AbstractMatrix{Bool}; k::Int=0)

Perform Bernoulli (logistic) EPCA on a 0/1 matrix M (features × samples).

When to use

Use for binary (0/1) data, such as presence/absence or yes/no features.

Returns

A NamedTuple with

  • model : the fitted ExpFamilyPCA.BernoulliEPCA object
  • scores : k×n_samples matrix of low‐dimensional sample scores
  • loadings : k×n_features matrix of feature loadings
source
Mycelia.best_label_mappingMethod
best_label_mapping(true_labels, pred_labels)

Finds the optimal mapping from predicted labels to true labels using the Hungarian algorithm, so that the total overlap (confusion matrix diagonal) is maximized. Returns the remapped predicted labels and the mapping as a Dict.

source
Mycelia.binary_matrix_to_jaccard_distance_matrixMethod
binary_matrix_to_jaccard_distance_matrix(binary_matrix::Union{BitMatrix, Matrix{Bool}})

Pairwise Jaccard distance between columns of a binary matrix (BitMatrix or Matrix{Bool}). Throws an error if the input is not strictly a binary matrix.

source
Mycelia.binomial_pca_epcaMethod
binomial_pca_epca(M::AbstractMatrix{<:Integer}; k::Int=0, ntrials::Int=1)

Perform Binomial EPCA on a count matrix M (features × samples).

When to use

Use for integer count data representing the number of successes out of a fixed number of trials (e.g., number of mutated alleles out of total alleles).

Keyword arguments

  • k : desired number of latent dimensions; if k<1 defaults to min(n_samples-1, n_features, 10)
  • ntrials : number of trials for the Binomial distribution (default=1)

Returns

NamedTuple with fields

  • model : the fitted ExpFamilyPCA.BinomialEPCA object
  • scores : k×n_samples matrix of sample scores
  • loadings : k×n_features matrix of feature loadings
source
Mycelia.biosequences_to_counts_tableMethod
biosequences_to_counts_table(; biosequences, k)

Convert a collection of biological sequences into a k-mer count matrix.

Arguments

  • biosequences: Vector of biological sequences (DNA, RNA, or Amino Acids)
  • k: Length of k-mers to count

Returns

Named tuple with:

  • sorted_kmers: Vector of all unique k-mers found, lexicographically sorted
  • kmer_counts_matrix: Sparse matrix where rows are k-mers and columns are sequences

Details

  • For DNA sequences, counts canonical k-mers (both strands)
  • Uses parallel processing with Thread-safe progress tracking
  • Memory efficient sparse matrix representation
  • Supports DNA, RNA and Amino Acid sequences
source
Mycelia.biosequences_to_dense_counts_tableMethod
biosequences_to_dense_counts_table(; biosequences, k)

Convert a collection of biological sequences into a dense k-mer count matrix.

Arguments

  • biosequences: Collection of DNA, RNA, or amino acid sequences (BioSequence types)
  • k::Integer: Length of k-mers to count (must be ≤ 13)

Returns

Named tuple containing:

  • sorted_kmers: Vector of all possible k-mers in sorted order
  • kmer_counts_matrix: Dense matrix where rows are k-mers and columns are sequences

Details

  • For DNA sequences, counts canonical k-mers (both strands)
  • For RNA and protein sequences, counts exact k-mers
  • Uses parallel processing with threads
source
Mycelia.blastdb2tableMethod
blastdb2table(
;
    blastdb,
    ALL_FIELDS,
    sequence_sha256,
    sequence_hash,
    sequence_id,
    accession,
    gi,
    sequence_title,
    blast_name,
    taxid,
    taxonomic_super_kingdom,
    scientific_name,
    scientific_names_leaf_nodes,
    common_taxonomic_name,
    common_names_leaf_nodes,
    leaf_node_taxids,
    membership_integer,
    ordinal_id,
    pig,
    sequence_length,
    sequence
)

Convert a BLAST database to an in-memory table with sequence and taxonomy information.

Arguments

  • blastdb::String: Path to the BLAST database
  • outfile::String="": Optional output file path. If provided, results will be saved to this file
  • force::Bool=false: Whether to overwrite existing output file
  • ALL_FIELDS::Bool=true: If true, include all fields regardless of other flag settings
  • Field selection flags (default to false unless ALL_FIELDS is true):
    • sequence_sha256::Bool: Include SHA256 hash of the sequence
    • sequence_hash::Bool: Include sequence hash
    • sequence_id::Bool: Include sequence ID
    • accession::Bool: Include accession number
    • gi::Bool: Include GI number
    • sequence_title::Bool: Include sequence title
    • blast_name::Bool: Include BLAST name
    • taxid::Bool: Include taxid
    • taxonomic_super_kingdom::Bool: Include taxonomic super kingdom
    • scientific_name::Bool: Include scientific name
    • scientific_names_leaf_nodes::Bool: Include scientific names for leaf-node taxids
    • common_taxonomic_name::Bool: Include common taxonomic name
    • common_names_leaf_nodes::Bool: Include common taxonomic names for leaf-node taxids
    • leaf_node_taxids::Bool: Include leaf-node taxids
    • membership_integer::Bool: Include membership integer
    • ordinal_id::Bool: Include ordinal ID
    • pig::Bool: Include PIG
    • sequence_length::Bool: Include sequence length
    • sequence::Bool: Include the full sequence

Returns

  • DataFrame: DataFrame containing the requested columns from the BLAST database
source
Mycelia.blastdb_to_fastaMethod
blastdb_to_fasta(
;
    blastdb,
    entries,
    taxids,
    outfile,
    force,
    max_cores
)

Convert a BLAST database to FASTA format.

Arguments

  • blastdb::String: Name of the BLAST database to convert (e.g. "nr", "nt")
  • dbdir::String: Directory containing the BLAST database files
  • outfile::String: Path for the output FASTA file

Returns

  • Path to the generated FASTA file as String
source
Mycelia.bray_curtis_distanceMethod

Compute the Bray-Curtis distance between columns of a count matrix.

Arguments

  • M::AbstractMatrix{<:Integer}: Count matrix where rows are features and columns are samples

Returns

  • Matrix{Float64}: Symmetric distance matrix with Bray-Curtis distances
source
Mycelia.build_biosequence_graphMethod
build_biosequence_graph(
    fasta_records::Vector{FASTX.FASTA.Record};
    graph_mode,
    min_overlap
) -> MetaGraphsNext.MetaGraph{Int64, Graphs.SimpleGraphs.SimpleDiGraph{Int64}, _A, _B, _C, Nothing, Mycelia.var"#267#268", Float64} where {_A, _B, _C}

Build a BioSequence graph directly from FASTA records.

This creates a variable-length BioSequence graph where vertices are the input sequences and edges represent relationships between sequences (e.g., overlaps, containment).

Arguments

  • fasta_records: Vector of FASTA records
  • graph_mode: SingleStrand or DoubleStrand mode (default: DoubleStrand)
  • min_overlap: Minimum overlap length for creating edges (default: 100)

Returns

  • MetaGraphsNext.MetaGraph with BioSequence vertices and overlap edges

Example

records = [FASTX.FASTA.Record("seq1", "ATCGATCGATCG"), 
           FASTX.FASTA.Record("seq2", "CGATCGATCGAA")]
graph = build_biosequence_graph(records)
source
Mycelia.build_directed_kmer_graphMethod
build_directed_kmer_graph(; fastq, k, plot)

Constructs a directed graph representation of k-mer transitions from FASTQ sequencing data.

Arguments

  • fastq: Path to input FASTQ file
  • k: K-mer size (default: 1). Must be odd and prime. If k=1, optimal size is auto-determined
  • plot: Boolean to display quality distribution plot (default: false)

Returns

MetaDiGraph with properties:

  • assembly_k: k-mer size used
  • kmer_counts: frequency of each k-mer
  • transition_likelihoods: edge weights between k-mers
  • kmermeanquality, kmertotalquality: quality metrics
  • branchingnodes, unbranchingnodes: topological classification
  • likelyvalidkmer_indices: k-mers above mean quality threshold
  • likelysequencingartifact_indices: potential erroneous k-mers

Note

For DNA assembly, quality scores are normalized across both strands.

source
Mycelia.build_genome_distance_matrixMethod
build_genome_distance_matrix(genome_files::Vector{String}; kmer_type=Kmers.DNAKmer{21}, metric=:js_divergence)

Build a distance matrix between all genome pairs using existing distance metrics.

Creates a comprehensive pairwise distance matrix using established k-mer distance functions, suitable for phylogenetic analysis and clustering.

Arguments

  • genome_files: Vector of genome FASTA file paths
  • kmer_type: K-mer type from Kmers.jl (default: Kmers.DNAKmer{21})
  • metric: Distance metric (:js_divergence, :cosine, :jaccard)

Returns

  • Named tuple with distance matrix and genome names

Example

genomes = ["genome1.fasta", "genome2.fasta", "genome3.fasta"]
result = Mycelia.build_genome_distance_matrix(genomes, kmer_type=Kmers.DNAKmer{31})
println("Distance matrix: $(result.distance_matrix)")
source
Mycelia.build_kmer_graph_nextMethod
build_kmer_graph_next(
    kmer_type,
    observations::AbstractVector{<:Union{FASTX.FASTA.Record, FASTX.FASTQ.Record}};
    graph_mode
) -> MetaGraphsNext.MetaGraph{Int64, Graphs.SimpleGraphs.SimpleDiGraph{Int64}, Label, VertexData, EdgeData, Nothing, WeightFunction, Float64} where {Label, VertexData, EdgeData, WeightFunction}

Create a next-generation, type-stable k-mer graph using MetaGraphsNext.

This implementation uses canonical k-mers as vertices with strand-aware edges that respect biological transition constraints for both single-strand and double-strand sequences.

Arguments

  • kmer_type: Type of k-mer (e.g., DNAKmer{K})
  • observations: Vector of FASTA/FASTQ records
  • graph_mode: SingleStrand for directional sequences, DoubleStrand for DNA (default)

Returns

  • MetaGraphsNext.MetaGraph with canonical vertices and strand-aware edges

Details

  • Vertices: Always canonical k-mers (lexicographically smaller of kmer/reverse_complement)
  • Edges: Strand-aware transitions that respect biological constraints
  • SingleStrand mode: Only forward-strand transitions allowed
  • DoubleStrand mode: Both forward and reverse-complement transitions allowed
source
Mycelia.build_quality_biosequence_graphMethod
build_quality_biosequence_graph(
    fastq_records::Vector{FASTX.FASTQ.Record};
    graph_mode,
    min_overlap,
    min_quality
) -> MetaGraphsNext.MetaGraph{Int64, Graphs.SimpleGraphs.SimpleDiGraph{Int64}, _A, _B, _C, Nothing, Mycelia.var"#277#278", Float64} where {_A, _B, _C}

Build a quality-aware BioSequence graph directly from FASTQ records.

This creates a variable-length quality-aware BioSequence graph where vertices are the input sequences with their quality scores and edges represent quality-weighted relationships between sequences.

Arguments

  • fastq_records: Vector of FASTQ records with quality scores
  • graph_mode: SingleStrand or DoubleStrand mode (default: DoubleStrand)
  • min_overlap: Minimum overlap length for creating edges (default: 100)
  • min_quality: Minimum mean quality to include sequence (default: 20)

Returns

  • MetaGraphsNext.MetaGraph with quality-aware BioSequence vertices

Example

records = [FASTX.FASTQ.Record("seq1", "ATCGATCGATCG", "IIIIIIIIIIII")]
graph = build_quality_biosequence_graph(records)
source
Mycelia.build_qualmer_graphMethod
build_qualmer_graph(
    fastq_records::Vector{FASTX.FASTQ.Record};
    k,
    graph_mode,
    min_quality,
    min_coverage
) -> MetaGraphsNext.MetaGraph{Int64, Graphs.SimpleGraphs.SimpleDiGraph{Int64}, Label, VertexData, EdgeData, Nothing, WeightFunction, Float64} where {Label, VertexData, EdgeData, WeightFunction}

Build a quality-aware k-mer graph from FASTQ records using existing Qualmer functionality. This function leverages the existing qualmer extraction functions and adds graph construction.

Arguments

  • fastq_records: Vector of FASTQ records with quality scores
  • k: K-mer size
  • graph_mode: SingleStrand or DoubleStrand mode (default: DoubleStrand)
  • min_quality: Minimum average PHRED quality to include k-mer (default: 10)
  • min_coverage: Minimum coverage (observations) to include k-mer (default: 1)

Returns

  • MetaGraphsNext.MetaGraph with QualmerVertexData and QualmerEdgeData
source
Mycelia.build_stranded_kmer_graphMethod
build_stranded_kmer_graph(
    kmer_type,
    observations::AbstractVector{<:Union{FASTX.FASTA.Record, FASTX.FASTQ.Record}}
) -> MetaGraphs.MetaDiGraph{T, Float64} where T<:Integer

Create a weighted, strand-specific kmer (de bruijn) graph from a set of kmers and a series of sequence observations in FASTA format.

source
Mycelia.calculate_assembly_quality_metricsMethod
calculate_assembly_quality_metrics(
    qualmer_graph;
    low_confidence_threshold
) -> NamedTuple{(:mean_coverage, :mean_quality, :mean_confidence, :low_confidence_fraction, :total_kmers), <:NTuple{5, Any}}

Calculate comprehensive assembly quality metrics for a qualmer graph.

Arguments

  • graph: Qualmer graph (MetaGraphsNext with QualmerVertexData)
  • low_confidence_threshold::Float64=0.95: Threshold for identifying low-confidence k-mers

Returns

  • NamedTuple: Assembly quality metrics including coverage, quality, and confidence statistics

Details

Calculates mean values for coverage, quality scores, and joint probabilities. Identifies fraction of k-mers below confidence threshold as potential error indicators.

source
Mycelia.calculate_emission_probabilityMethod
calculate_emission_probability(state::ViterbiState, observation::String, config::ViterbiConfig) -> Float64

Calculate emission probability for a state given an observation.

source
Mycelia.calculate_gc_contentMethod
calculate_gc_content(sequence::AbstractString) -> Float64

Calculate GC content from a string sequence.

Convenience function that accepts string input and converts to appropriate BioSequence. Automatically detects DNA/RNA based on presence of T/U.

Arguments

  • sequence::AbstractString: Input DNA or RNA sequence as string

Returns

  • Float64: GC content as a percentage (0.0-100.0)

Examples

# Calculate GC content from string
gc_percent = calculate_gc_content("ATCGATCGATCG")
source
Mycelia.calculate_gc_contentMethod
calculate_gc_content(
    sequence::BioSequences.LongSequence
) -> Float64

Calculate GC content (percentage of G and C bases) from a BioSequence.

This function calculates the percentage of guanine (G) and cytosine (C) bases in a nucleotide sequence. Works with both DNA and RNA sequences.

Arguments

  • sequence::BioSequences.LongSequence: Input DNA or RNA sequence

Returns

  • Float64: GC content as a percentage (0.0-100.0)

Examples

# Calculate GC content for DNA
dna_seq = BioSequences.LongDNA{4}("ATCGATCGATCG")
gc_percent = calculate_gc_content(dna_seq)

# Calculate GC content for RNA
rna_seq = BioSequences.LongRNA{4}("AUCGAUCGAUCG") 
gc_percent = calculate_gc_content(rna_seq)
source
Mycelia.calculate_gc_contentMethod
calculate_gc_content(
    records::AbstractArray{T<:Union{FASTX.FASTA.Record, FASTX.FASTQ.Record}, 1}
) -> Float64

Calculate GC content from FASTA/FASTQ records.

Processes multiple records and calculates overall GC content across all sequences.

Arguments

  • records::AbstractVector{T} where T <: Union{FASTX.FASTA.Record, FASTX.FASTQ.Record}`: Input records

Returns

  • Float64: Overall GC content as a percentage (0.0-100.0)

Examples

# Calculate GC content from FASTA records
records = collect(FASTX.FASTA.Reader(open("sequences.fasta")))
gc_percent = calculate_gc_content(records)
source
Mycelia.calculate_sparsityMethod

Calculate k-mer sparsity for a given k-mer size. Returns fraction of possible k-mers that are NOT observed. Higher sparsity indicates errors become singletons.

source
Mycelia.canonicalMethod
canonical(
    qmer::Mycelia.Qualmer{<:Kmers.Kmer{BioSequences.AminoAcidAlphabet}}
) -> Mycelia.Qualmer{<:Kmers.Kmer{BioSequences.AminoAcidAlphabet}}
source
Mycelia.canonicalMethod
canonical(
    qmer::Mycelia.Qualmer{KmerT<:Union{Kmers.DNAKmer, Kmers.RNAKmer}}
) -> Mycelia.Qualmer{KmerT} where KmerT<:Kmers.Kmer

Get the canonical representation of a DNA qualmer, considering both sequence and quality. For DNA/RNA, this involves potentially reverse-complementing the k-mer and reversing the quality scores accordingly.

source
Mycelia.canonicalize_kmer_counts!Method
canonicalize_kmer_counts!(kmer_counts) -> Any

Canonicalizes the k-mer counts in the given dictionary.

This function iterates over the provided dictionary kmer_counts, which maps k-mers to their respective counts. For each k-mer that is not in its canonical form, it converts the k-mer to its canonical form and updates the count in the dictionary accordingly. If the canonical form of the k-mer already exists in the dictionary, their counts are summed. The original non-canonical k-mer is then removed from the dictionary.

Arguments

  • kmer_counts::Dict{BioSequences.Kmer, Int}: A dictionary where keys are k-mers and values are their counts.

Returns

  • The input dictionary kmer_counts with all k-mers in their canonical form, sorted by k-mers.
source
Mycelia.canonicalize_kmer_countsMethod
canonicalize_kmer_counts(kmer_counts) -> Any

Normalize k-mer counts into a canonical form by creating a non-mutating copy.

Arguments

  • kmer_counts: Dictionary or collection of k-mer count data

Returns

  • A new normalized k-mer count collection
source
Mycelia.check_bioconda_env_is_installedMethod
check_bioconda_env_is_installed(pkg) -> Bool

Check whether a named Bioconda environment already exists.

Arguments

  • pkg::String: Name of the environment.

Returns

Bool indicating if the environment is present.

source
Mycelia.check_matrix_fits_in_memoryMethod
check_matrix_fits_in_memory(bytes_needed::Integer; severity::Symbol=:warn)

Checks whether the specified number of bytes can fit in the computer's memory.

  • bytes_needed: The number of bytes required (output from estimate_dense_matrix_memory or estimate_sparse_matrix_memory).
  • severity: What to do if there is not enough available memory. Can be :warn (default) or :error.

Returns a named tuple: (willfittotal, willfitavailable, totalmemory, freememory, bytes_needed) Where:

  • will_fit_total: true if the matrix fits in total system memory.
  • will_fit_available: true if the matrix fits in currently available (free) system memory.
  • total_memory: Total system RAM in bytes.
  • free_memory: Currently available system RAM in bytes.
  • bytes_needed: Bytes requested for the matrix.

If will_fit_available is false, either warns or errors depending on severity.

source
Mycelia.choose_top_n_markersMethod
choose_top_n_markers(N)

Return a vector of the top N most visually distinct marker symbols for plotting.

Arguments

  • N::Int: Number of distinct markers to return (max 17 for best differentiation).

Returns

  • Vector{Symbol}: Vector of marker symbol names.

Example

markers = choose_top_n_markers(7)
source
Mycelia.chromosome_coverage_table_to_plotMethod
chromosome_coverage_table_to_plot(cdf) -> Plots.Plot

Creates a visualization of chromosome coverage data with statistical thresholds.

Arguments

  • cdf::DataFrame: Coverage data frame containing columns:
    • index: Chromosome position indices
    • depth: Coverage depth values
    • chromosome: Chromosome identifier
    • mean_coverage: Mean coverage value
    • std_coverage: Standard deviation of coverage
    • : Boolean vector indicating +3 sigma regions
    • -3σ: Boolean vector indicating -3 sigma regions

Returns

  • A StatsPlots plot object showing:
    • Raw coverage data (black line)
    • Mean coverage and ±1,2,3σ thresholds (rainbow colors)
    • Highlighted regions exceeding ±3σ thresholds (red vertical lines)
source
Mycelia.classify_reads_by_taxonomyMethod

Classify reads based on taxonomic alignments using individual alignment scoring.

This function takes taxonomy-aware alignment data and performs classification by:

  1. Loading the taxonomy-aware alignment data
  2. Analyzing alignment score distributions per read
  3. Identifying dominant taxonomic assignments
  4. Applying conservative taxonomy classification

Arguments

  • taxonomy_aware_file: Path to taxonomy-aware alignment file (.tsv.gz or .arrow)
  • min_relative_proportion::Float64=60.0: Minimum relative proportion threshold for accepting a taxonomic assignment
  • verbose::Bool=true: Whether to print progress information

Returns

A DataFrame with taxonomic classification results including individual alignment metrics

source
Mycelia.cleanup_directoryMethod
cleanup_directory(
    directory::AbstractString;
    verbose,
    force
) -> @NamedTuple{existed::Bool, files_deleted::Int64, bytes_freed::Int64, human_readable_size::String}

Clean up a directory by calculating its size and file count, then removing it.

Arguments

  • directory::AbstractString: Path to the directory to clean up
  • verbose::Bool=true: Whether to report cleanup results (default: true)
  • force::Bool=false: Whether to proceed without confirmation for large directories

Returns

  • Named tuple with fields:
    • existed: Whether the directory existed before cleanup
    • files_deleted: Number of files that were deleted
    • bytes_freed: Total bytes freed up
    • human_readable_size: Human-readable representation of bytes freed

Details

This function will:

  1. Check if the directory exists and is non-empty
  2. Calculate the total size and number of files recursively
  3. Remove the directory and all its contents
  4. Report the cleanup results unless verbose=false

For directories larger than 1GB or containing more than 10,000 files, confirmation is required unless force=true.

Examples

# Clean up a temporary directory with reporting
result = cleanup_directory("/tmp/myapp_temp")

# Silent cleanup
cleanup_directory("/tmp/cache", verbose=false)

# Force cleanup of large directory
cleanup_directory("/tmp/large_data", force=true)
source
Mycelia.codon_optimizeMethod
codon_optimize(
;
    normalized_codon_frequencies,
    protein_sequence,
    n_iterations
)

Optimizes the DNA sequence encoding for a given protein sequence using codon usage frequencies.

Arguments

  • normalized_codon_frequencies: Dictionary mapping amino acids to their codon frequencies
  • protein_sequence::BioSequences.LongAA: Target protein sequence to optimize
  • n_iterations::Integer: Number of optimization iterations to perform

Algorithm

  1. Creates initial DNA sequence through reverse translation
  2. Iteratively generates new sequences by sampling codons based on their frequencies
  3. Keeps track of the sequence with highest codon usage likelihood

Returns

  • BioSequences.LongDNA{2}: Optimized DNA sequence encoding the input protein
source
Mycelia.codons_to_amino_acidsMethod
codons_to_amino_acids() -> Dict

Creates a mapping between DNA codons and their corresponding amino acids using the standard genetic code.

Returns a dictionary where:

  • Keys are 3-letter DNA codons (e.g., "ATG")
  • Values are the corresponding amino acids from BioSequences.jl
source
Mycelia.compare_genome_kmer_similarityMethod
compare_genome_kmer_similarity(genome1_file::String, genome2_file::String; kmer_type=Kmers.DNAKmer{21}, metric=:js_divergence)

Compare two genomes using existing k-mer distance metrics.

Leverages existing distance metric functions to compare genomic similarity between pairs of genomes using various distance measures.

Arguments

  • genome1_file: Path to first genome FASTA file
  • genome2_file: Path to second genome FASTA file
  • kmer_type: K-mer type from Kmers.jl (default: Kmers.DNAKmer{21})
  • metric: Distance metric (:js_divergence, :cosine, :jaccard)

Returns

  • Named tuple with similarity/distance metrics and k-mer statistics

Example

similarity = Mycelia.compare_genome_kmer_similarity(
    "genome1.fasta", "genome2.fasta", 
    kmer_type=Kmers.DNAKmer{31}, 
    metric=:js_divergence
)
println("JS divergence: $(similarity.distance)")
println("Shared k-mers: $(similarity.shared_kmers)")
source
Mycelia.concatenate_filesMethod
concatenate_files(; files, file)

Join fasta files without any regard to record uniqueness.

A cross-platform version of cat *.fasta > joint.fasta

See mergefastafiles

Concatenate multiple FASTA files into a single output file by simple appending.

Arguments

  • files: Vector of paths to input FASTA files
  • file: Path where the concatenated output will be written

Returns

  • Path to the output concatenated file

Details

Platform-independent implementation of cat *.fasta > combined.fasta. Files are processed sequentially with a progress indicator.

source
Mycelia.confusion_matrixMethod
confusion_matrix(true_labels, pred_labels)

Returns the confusion matrix as a Matrix{Int}, row = true, col = predicted. Also returns the list of unique labels in sorted order and a heatmap plot.

source
Mycelia.contbernoulli_pca_epcaMethod
contbernoulli_pca_epca(M::AbstractMatrix{<:Real}; k::Int=0)

Perform Continuous Bernoulli EPCA on a matrix M (features × samples).

When to use

Use for continuous data in the open interval (0, 1), such as probabilities or normalized intensities.

Keyword arguments

  • k : desired number of latent dimensions; if k<1 defaults to min(n_samples-1, n_features, 10)

Returns

NamedTuple with fields

  • model : the fitted ExpFamilyPCA.ContinuousBernoulliEPCA object
  • scores : k×n_samples matrix of sample scores
  • loadings : k×n_features matrix of feature loadings
source
Mycelia.contig_is_circularMethod
contig_is_circular(
    graph_file::String,
    contig_name::String
) -> Any

Returns bool indicating whether the contig is a circle

graphfile = path to assembly graph.gfa file contigname = name of the contig

Determine if a contig represents a circular structure in the assembly graph.

A circular contig is one where the sequence forms a complete loop in the assembly graph, typically representing structures like plasmids, circular chromosomes, or other circular DNA elements.

Arguments

  • graph_file::String: Path to the assembly graph in GFA format
  • contig_name::String: Name/identifier of the contig to check

Returns

  • Bool: true if the contig forms a circular structure, false otherwise
source
Mycelia.contig_is_cleanly_assembledMethod
contig_is_cleanly_assembled(
    graph_file::String,
    contig_name::String
) -> Bool

Returns bool indicating whether the contig is cleanly assembled.

By cleanly assembled we mean that the contig does not have other contigs attached in the same connected component.

graphfile = path to assembly graph.gfa file contigname = name of the contig

Check if a contig exists in isolation within its connected component in an assembly graph.

Arguments

  • graph_file::String: Path to the assembly graph file in GFA format
  • contig_name::String: Name/identifier of the contig to check

Returns

  • Bool: true if the contig exists alone in its connected component, false otherwise

Details

A contig is considered "cleanly assembled" if it appears as a single entry in the assembly graph's connected components. This function parses the GFA file and checks the contig's isolation status using the graph structure.

source
Mycelia.convert_legacy_gfa_to_nextFunction
convert_legacy_gfa_to_next(
    gfa_file::AbstractString
) -> MetaGraphsNext.MetaGraph{Int64, Graphs.SimpleGraphs.SimpleDiGraph{Int64}, _A, _B, _C, Nothing, Mycelia.var"#147#148", Float64} where {_A, _B, _C}
convert_legacy_gfa_to_next(
    gfa_file::AbstractString,
    graph_mode::Mycelia.GraphMode
) -> MetaGraphsNext.MetaGraph{Int64, Graphs.SimpleGraphs.SimpleDiGraph{Int64}, _A, _B, _C, Nothing, Mycelia.var"#147#148", Float64} where {_A, _B, _C}

Convert a legacy MetaGraphs GFA to next-generation MetaGraphsNext format.

This convenience function reads a GFA file using the legacy parser and converts it to the new strand-aware format.

Arguments

  • gfa_file: Path to GFA file
  • graph_mode: GraphMode for the output graph

Returns

  • MetaGraphsNext.MetaGraph in strand-aware format
source
Mycelia.convert_sequenceMethod
convert_sequence(seq::AbstractString) -> Any

Converts the given sequence (output from FASTX.sequence) into the appropriate BioSequence type:

  • DNA sequences are converted using BioSequences.LongDNA
  • RNA sequences are converted using BioSequences.LongRNA
  • AA sequences are converted using BioSequences.LongAA
source
Mycelia.copy_to_tempdirMethod
copy_to_tempdir(file_path::String) -> String

Create a copy of a file in a temporary directory while preserving the original filename.

Arguments

  • file_path::String: Path to the source file to be copied

Returns

  • String: Path to the newly created temporary file
source
Mycelia.copy_with_unique_identifierMethod
copy_with_unique_identifier(
    infile,
    out_directory,
    unique_identifier;
    force
) -> Any

Copy a file to a new location with a unique identifier prepended to the filename.

Arguments

  • infile::AbstractString: Path to the source file to copy
  • out_directory::AbstractString: Destination directory for the copied file
  • unique_identifier::AbstractString: String to prepend to the filename
  • force::Bool=true: If true, overwrite existing files

Returns

  • String: Path to the newly created file
source
Mycelia.correct_errors_nextFunction
correct_errors_next(graph::MetaGraph, sequences::Vector, config::ViterbiConfig) -> Vector{FASTX.FASTA.Record}

Correct errors in sequences using Viterbi algorithm and return corrected FASTA records.

source
Mycelia.count_canonical_kmersMethod
count_canonical_kmers(_::Type{KMER_TYPE}, sequences) -> Any

Count canonical k-mers in biological sequences. A canonical k-mer is the lexicographically smaller of a DNA sequence and its reverse complement, ensuring strand-independent counting.

Arguments

  • KMER_TYPE: Type parameter specifying the k-mer size and structure
  • sequences: Iterator of biological sequences to analyze

Returns

  • Dict{KMER_TYPE,Int}: Dictionary mapping canonical k-mers to their counts
source
Mycelia.count_kmersMethod
count_kmers(
    _::Type{KMER_TYPE},
    fastx_file::AbstractString
) -> Any

Count k-mers in a FASTA/FASTQ file and return their frequencies.

Arguments

  • KMER_TYPE: Type parameter specifying the k-mer type (e.g., DNAKmer{K})
  • fastx_file: Path to input FASTA/FASTQ file

Returns

  • Dict{KMER_TYPE, Int}: Dictionary mapping each k-mer to its frequency
source
Mycelia.count_kmersMethod
count_kmers(
    _::Type{Kmers.Kmer{A<:BioSequences.AminoAcidAlphabet, K}},
    sequence::BioSequences.LongSequence
) -> OrderedCollections.OrderedDict{K, Int64} where K<:(Kmers.Kmer{BioSequences.AminoAcidAlphabet, _A, _B} where {_B, _A})

Count the frequency of amino acid k-mers in a biological sequence.

Arguments

  • Kmers.Kmer{A,K}: Type parameter specifying amino acid alphabet (A) and k-mer length (K)
  • sequence: Input biological sequence to analyze

Returns

A sorted dictionary mapping each k-mer to its frequency count in the sequence.

source
Mycelia.count_kmersMethod
count_kmers(
    _::Type{Kmers.Kmer{A<:BioSequences.DNAAlphabet, K}},
    sequence::BioSequences.LongSequence
) -> Any

Count the frequency of each k-mer in a DNA sequence.

Arguments

  • ::Type{Kmers.Kmer{A,K}}: K-mer type with alphabet A and length K
  • sequence::BioSequences.LongSequence: Input DNA sequence to analyze

Returns

A sorted dictionary mapping each k-mer to its frequency count in the sequence.

Type Parameters

  • A <: BioSequences.DNAAlphabet: DNA alphabet type
  • K: Length of k-mers
source
Mycelia.count_kmersMethod
count_kmers(
    _::Type{Kmers.Kmer{A<:BioSequences.RNAAlphabet, K}},
    sequence::BioSequences.LongSequence
) -> Any

Count the frequency of each k-mer in an RNA sequence.

Arguments

  • Kmer: Type parameter specifying the k-mer length K and RNA alphabet
  • sequence: Input RNA sequence to analyze

Returns

  • Dict{Kmers.Kmer, Int}: Sorted dictionary mapping each k-mer to its frequency count
source
Mycelia.count_kmersMethod
count_kmers(
    _::Type{KMER_TYPE},
    sequences::Union{FASTX.FASTA.Reader, FASTX.FASTQ.Reader}
) -> Any

Counts k-mer occurrences in biological sequences from a FASTA/FASTQ reader.

Arguments

  • KMER_TYPE: Type parameter specifying the k-mer length and encoding (e.g., DNAKmer{4} for 4-mers)
  • sequences: A FASTA or FASTQ reader containing the biological sequences to analyze

Returns

A dictionary mapping k-mers to their counts in the input sequences

source
Mycelia.count_kmersMethod
count_kmers(
    _::Type{KMER_TYPE},
    record::Union{FASTX.FASTA.Record, FASTX.FASTQ.Record}
) -> Any

Count the frequency of amino acid k-mers in a biological sequence.

Arguments

  • Kmers.Kmer{A,K}: Type parameter specifying amino acid alphabet (A) and k-mer length (K)
  • sequence: Input biological sequence to analyze

Returns

A sorted dictionary mapping each k-mer to its frequency count in the sequence.

source
Mycelia.count_kmersMethod
count_kmers(
    _::Type{KMER_TYPE},
    fastx_files::AbstractArray{T<:AbstractString, 1}
) -> Any

Count k-mers across multiple FASTA/FASTQ files and merge the results.

Arguments

  • KMER_TYPE: Type parameter specifying the k-mer length (e.g., DNAKmer{4} for 4-mers)
  • fastx_files: Vector of paths to FASTA/FASTQ files

Returns

  • Dict{KMER_TYPE, Int}: Dictionary mapping k-mers to their total counts across all files
source
Mycelia.count_kmersMethod
count_kmers(
    _::Type{KMER_TYPE},
    records::AbstractArray{T<:Union{FASTX.FASTA.Record, FASTX.FASTQ.Record}, 1}
) -> Any

Count k-mers across multiple sequence records and return a sorted frequency table.

Arguments

  • KMER_TYPE: Type parameter specifying the k-mer length (e.g., DNAKmer{3} for 3-mers)
  • records: Vector of FASTA/FASTQ records to analyze

Returns

  • Dict{KMER_TYPE, Int}: Sorted dictionary mapping k-mers to their frequencies
source
Mycelia.count_matrix_to_probability_matrixMethod
count_matrix_to_probability_matrix(counts_matrix) -> Any

Convert a matrix of counts into a probability matrix by normalizing each column to sum to 1.0.

Arguments

  • counts_matrix::Matrix{<:Number}: Input matrix where each column represents counts/frequencies

Returns

  • Matrix{Float64}: Probability matrix where each column sums to 1.0
source
Mycelia.count_predicted_genesMethod
count_predicted_genes(gff_file)

Count the number of predicted genes from a GFF file.

Parses a GFF/GTF file and counts the number of CDS (coding sequence) features, which correspond to predicted genes.

Arguments

  • gff_file: Path to GFF/GTF file

Returns

  • Integer count of predicted genes (CDS features)

See Also

  • run_pyrodigal: For gene prediction that generates GFF files
  • parse_transterm_output: For parsing other annotation tool outputs
source
Mycelia.count_recordsMethod
count_records(fastx) -> Int64

Counts the total number of records in a FASTA/FASTQ file.

Arguments

  • fastx: Path to a FASTA or FASTQ file (can be gzipped)

Returns

  • Number of records (sequences) in the file
source
Mycelia.countmap_columnsMethod
countmap_columns(table)

Generate and display frequency counts for all columns in a DataFrame.

Arguments

  • table::DataFrame: Input DataFrame to analyze

Details

Iterates through each column in the DataFrame and displays:

  1. The column name
  2. A Dict mapping unique values to their frequencies using StatsBase.countmap
source
Mycelia.create_assembly_environmentMethod
create_assembly_environment(training_data, validation_data; episode_length=100)

Create a new reinforcement learning environment for assembly training.

Arguments

  • training_data::Vector{String}: Paths to training FASTQ datasets
  • validation_data::Vector{String}: Paths to validation FASTQ datasets
  • episode_length::Int: Maximum steps per training episode (default: 100)

Returns

  • AssemblyEnvironment: Initialized RL environment ready for training

Example

training_files = ["genome1.fastq", "genome2.fastq", "genome3.fastq"]
validation_files = ["validation1.fastq", "validation2.fastq"]
env = create_assembly_environment(training_files, validation_files)
source
Mycelia.create_curriculum_scheduleMethod
create_curriculum_schedule(; stages=4, datasets_per_stage=10)

Create a curriculum learning schedule that progressively increases difficulty.

Arguments

  • stages::Int: Number of curriculum stages (default: 4)
  • datasets_per_stage::Int: Datasets per stage (default: 10)

Returns

  • Vector{Dict}: Curriculum schedule with parameters for each stage

Example

curriculum = create_curriculum_schedule(stages=5, datasets_per_stage=15)
source
Mycelia.create_databaseMethod
create_database(; database, address, username, password)

Creates a new Neo4j database instance if it doesn't already exist.

Arguments

  • database::String: Name of the database to create
  • address::String: Neo4j server address (e.g. "neo4j://localhost:7687")
  • username::String: Neo4j authentication username (defaults to "neo4j")
  • password::String: Neo4j authentication password

Notes

  • Requires system database privileges to execute
  • Silently returns if database already exists
  • Temporarily switches to system database to perform creation
source
Mycelia.create_dqn_policyMethod
create_dqn_policy(; state_dim=11, action_dim=3, hidden_dims=[128, 64], learning_rate=0.001, epsilon=0.1)

Create a Deep Q-Network policy for assembly decisions.

Arguments

  • state_dim::Int: Dimension of state representation (default: 11)
  • action_dim::Int: Number of discrete actions (default: 3 for continue/next/terminate)
  • hidden_dims::Vector{Int}: Hidden layer sizes (default: [128, 64])
  • learning_rate::Float64: Learning rate (default: 0.001)
  • epsilon::Float64: Exploration probability (default: 0.1)

Returns

  • DQNPolicy: Initialized policy network

Example

policy = create_dqn_policy(hidden_dims=[256, 128, 64])
source
Mycelia.create_hmm_from_graphMethod
create_hmm_from_graph(graph::MetaGraph, config::ViterbiConfig) -> (states, transitions, emissions)

Create Hidden Markov Model parameters from a k-mer graph structure.

source
Mycelia.create_node_constraintsMethod
create_node_constraints(
    graph;
    address,
    username,
    password,
    database
)

Creates unique identifier constraints for each node type in a Neo4j database.

Arguments

  • graph: A MetaGraph containing nodes with TYPE properties
  • address: Neo4j server address
  • username: Neo4j username (default: "neo4j")
  • password: Neo4j password
  • database: Neo4j database name (default: "neo4j")

Details

Extracts unique node types from the graph and creates Neo4j constraints ensuring each node of a given type has a unique identifier property.

Failed constraint creation attempts are silently skipped.

source
Mycelia.create_tarchiveMethod
create_tarchive(; directory, tarchive)

Creates a gzipped tar archive of the specified directory along with verification files.

Arguments

  • directory: Source directory path to archive
  • tarchive: Optional output archive path (defaults to directory name with .tar.gz extension)

Generated Files

  • {tarchive}: The compressed tar archive
  • {tarchive}.log: Contents listing of the archive
  • {tarchive}.hashdeep.dfxml: Cryptographic hashes (MD5, SHA1, SHA256) of the archive

Returns

  • Path to the created tar archive file
source
Mycelia.current_unix_datetimeMethod
current_unix_datetime() -> Int64

Get the current time as a Unix timestamp (seconds since epoch).

Returns

  • Int: Current time as an integer Unix timestamp (seconds since January 1, 1970 UTC)

Examples

unix_time = current_unix_datetime()
# => 1709071368 (example value, will differ based on current time)
source
Mycelia.cypherMethod
cypher(
    cmd;
    address,
    username,
    password,
    format,
    database
) -> Cmd

Constructs a command to execute Neo4j Cypher queries via cypher-shell.

Arguments

  • cmd: The Cypher query command to execute
  • address::String="neo4j://localhost:7687": Neo4j server address
  • username::String="neo4j": Neo4j authentication username
  • password::String="password": Neo4j authentication password
  • format::String="auto": Output format (auto, verbose, or plain)
  • database::String="neo4j": Target Neo4j database name

Returns

  • Cmd: A command object ready for execution
source
Mycelia.dataframe_to_ndjsonMethod
dataframe_to_ndjson(df::DataFrame; outfile::Union{String,Nothing}=nothing)

Converts a DataFrame df into a newline-delimited JSON (NDJSON) string. Each line in the returned string represents one DataFrame row in JSON format, suitable for upload to Google BigQuery.

Keyword Arguments

  • outfile::Union{String,Nothing}: If provided, writes the resulting NDJSON to the file path given.

Examples

```julia using DataFrames, Dates

Sample DataFrame

df = DataFrame( id = [1, 2, 3], name = ["Alice", "Bob", "Carol"], created = [DateTime(2025, 4, 8, 14, 30), DateTime(2025, 4, 8, 15, 0), missing] )

ndjsonstr = dataframetondjson(df) println(ndjsonstr)

Optionally, write to a file

dataframetondjson(df; outfile="output.ndjson")

source
Mycelia.deduplicate_fasta_fileMethod
deduplicate_fasta_file(in_fasta, out_fasta) -> Any

Remove duplicate sequences from a FASTA file while preserving headers.

Arguments

  • in_fasta: Path to input FASTA file
  • out_fasta: Path where deduplicated FASTA will be written

Returns

Path to the output FASTA file (same as out_fasta parameter)

Details

  • Sequences are considered identical if they match exactly (case-sensitive)
  • For duplicate sequences, keeps the first header encountered
  • Input sequences are sorted by identifier before deduplication
  • Preserves the original sequence formatting
source
Mycelia.detect_alphabetMethod
detect_alphabet(seq::AbstractString) -> Symbol

Determines the alphabet of a sequence. The function scans through seq only once:

  • If a 'T' or 't' is found (and no 'U/u'), the sequence is classified as DNA.
  • If a 'U' or 'u' is found (and no 'T/t'), it is classified as RNA.
  • If both T and U occur, an error is thrown.
  • If a character outside the canonical nucleotide and ambiguity codes is encountered, the sequence is assumed to be protein.
  • If neither T nor U are found, the sequence is assumed to be DNA.
source
Mycelia.detect_alphabetMethod
detect_alphabet(sequence::BioSequences.LongAA) -> Symbol

Detect the alphabet of a LongAA sequence.

Always returns :AA.

source
Mycelia.detect_alphabetMethod
detect_alphabet(sequence::BioSequences.LongDNA) -> Symbol

Detect the alphabet of a LongDNA sequence.

Always returns :DNA.

source
Mycelia.detect_alphabetMethod
detect_alphabet(sequence::BioSequences.LongRNA) -> Symbol

Detect the alphabet of a LongRNA sequence.

Always returns :RNA.

source
Mycelia.detect_and_extract_sequenceMethod
detect_and_extract_sequence(
    record::Union{FASTX.FASTA.Record, FASTX.FASTQ.Record}
) -> Tuple{Symbol, Union{BioSequences.LongAA, BioSequences.LongDNA{4}, BioSequences.LongRNA{4}}}

Detect alphabet and extract typed sequence from FASTX record in one step.

Convenience function that combines alphabet detection with type-safe sequence extraction, ideal for workflows that need to determine sequence type once at the beginning.

Arguments

  • record::Union{FASTX.FASTA.Record, FASTX.FASTQ.Record}: Input sequence record

Returns

  • Tuple{Symbol, BioSequences.BioSequence}: (alphabetsymbol, typedsequence)

Examples

record = FASTX.FASTQ.Record("read1", "ATCG", "IIII")
alphabet, sequence = detect_and_extract_sequence(record)
# alphabet = :DNA, sequence = LongDNA{4} object
source
Mycelia.detect_bubbles_nextMethod
detect_bubbles_next(graph::MetaGraphsNext.MetaGraph, min_bubble_length::Int=2, max_bubble_length::Int=100) -> Vector{BubbleStructure}

Detect bubble structures (alternative paths) in the assembly graph.

source
Mycelia.detect_sequence_extensionMethod
detect_sequence_extension(
    record::Union{FASTX.FASTA.Record, FASTX.FASTQ.Record}
) -> String

Detect sequence type from input and suggest appropriate file extension.

Arguments

  • record: A FASTA/FASTQ record
  • sequence: A string or BioSequence containing sequence data

Returns

  • String: Suggested file extension:
    • ".fna" for DNA
    • ".frn" for RNA
    • ".faa" for protein
    • ".fa" for unrecognized sequences
source
Mycelia.determine_fasta_coverage_from_bamMethod
determine_fasta_coverage_from_bam(bam) -> Any

Calculate per-base genomic coverage from a BAM file using bedtools.

Arguments

  • bam::String: Path to input BAM file

Returns

  • String: Path to the generated coverage file (.coverage.txt)

Details

Uses bedtools genomecov to compute per-base coverage. Creates a coverage file with the format: <chromosome> <position> <coverage_depth>. If the coverage file already exists, returns the existing file path.

Dependencies

Requires bedtools (automatically installed in conda environment)

source
Mycelia.determine_max_canonical_kmersMethod
determine_max_canonical_kmers(k, ALPHABET) -> Any

Calculate the maximum number of possible canonical k-mers for a given alphabet.

Arguments

  • k::Integer: Length of k-mer
  • ALPHABET::Vector{Char}: Character set (nucleotides or amino acids)

Returns

  • Int: Maximum number of possible canonical k-mers

Details

  • For amino acids (AA_ALPHABET): returns total possible k-mers
  • For nucleotides: returns half of total possible k-mers (canonical form)
  • Requires odd k-mer length for nucleotide alphabets
source
Mycelia.determine_max_possible_kmersMethod
determine_max_possible_kmers(k, ALPHABET) -> Any

Calculate the total number of possible unique k-mers that can be generated from a given alphabet.

Arguments

  • k: Length of k-mers to consider
  • ALPHABET: Vector containing the allowed characters/symbols

Returns

  • Integer representing the maximum number of possible unique k-mers (|Σ|ᵏ)
source
Mycelia.determine_primary_contigMethod
determine_primary_contig(qualimap_results) -> Any

Determines the contig with the greatest number of total bases mapping to it

Identify the primary contig based on mapping coverage from Qualimap results.

Arguments

  • qualimap_results::DataFrame: DataFrame containing Qualimap alignment statistics with columns "Contig" and "Mapped bases"

Returns

  • String: Name of the contig with the highest number of mapped bases

Description

Takes Qualimap alignment results and determines which contig has the most total bases mapped to it, which often indicates the main chromosomal assembly.

source
Mycelia.determine_read_lengthsMethod
determine_read_lengths(
    fastq_file;
    total_reads
) -> Vector{Int64}

Calculate sequence lengths for reads in a FASTQ file.

Arguments

  • fastq_file::String: Path to input FASTQ file
  • total_reads::Integer=Inf: Number of reads to process (defaults to all reads)

Returns

  • Vector{Int}: Array containing the length of each sequence read
source
Mycelia.dictvec_to_dataframeMethod
dictvec_to_dataframe(dictvec::Vector{<:AbstractDict}; symbol_columns::Bool = true)

Convert a vector of dictionaries (with possibly non-uniform keys and any key type) into a DataFrame. Missing keys in a row are filled with missing.

Arguments

  • dictvec: Vector of dictionaries.
  • symbol_columns: If true (default), columns are named as Symbols (when possible), else as raw keys.

Returns

  • DataFrames.DataFrame with columns as the union of all keys.
source
Mycelia.distance_matrix_to_newickMethod
distance_matrix_to_newick(
;
    distance_matrix,
    labels,
    outfile
)

Create distance matrix from a column-major counts matrix (features as rows and entities as columns) where distance is a proportional to total feature count magnitude (size) and cosine similarity (relative frequency)

Convert a distance matrix into a Newick tree format using UPGMA hierarchical clustering.

Arguments

  • distance_matrix: Square matrix of pairwise distances between entities
  • labels: Vector of labels corresponding to the entities in the distance matrix
  • outfile: Path where the Newick tree file will be written

Returns

  • Path to the generated Newick tree file

Details

Performs hierarchical clustering using the UPGMA (average linkage) method and converts the resulting dendrogram into Newick tree format. The branch lengths in the tree represent the heights from the clustering.

source
Mycelia.document_frequencyMethod
document_frequency(
    documents
) -> Dict{T, Int64} where T<:(SubString{_A} where _A)

Calculate the document frequency of tokens across a collection of documents.

Arguments

  • documents: Collection of text documents where each document is a string

Returns

  • Dictionary mapping each unique token to the number of documents it appears in

Description

Computes how many documents contain each unique token. Each document is tokenized by splitting on whitespace. Tokens are counted only once per document, regardless of how many times they appear within that document.

source
Mycelia.download_bandageFunction
download_bandage() -> String
download_bandage(outdir) -> Any

Downloads and installs Bandage, a bioinformatics visualization tool for genome assembly graphs.

Arguments

  • outdir="/usr/local/bin": Target installation directory for the Bandage executable

Returns

  • Path to the installed Bandage executable

Details

  • Downloads Bandage v0.8.1 for Ubuntu
  • Installs required system dependencies (libxcb-glx0, libx11-xcb-dev, libfontconfig, libgl1-mesa-glx)
  • Attempts installation with sudo, falls back to root if sudo fails
  • Skips download if Bandage is already installed at target location

Dependencies

Requires system commands: wget, unzip, apt

source
Mycelia.download_blast_dbMethod
download_blast_db(; db, dbdir, source, wait)

Smart downloading of blast dbs depending on interactive, non interactive context

For a list of all available databases, run: Mycelia.list_blastdbs()

Downloads and sets up BLAST databases from various sources.

Arguments

  • db: Name of the BLAST database to download
  • dbdir: Directory to store the downloaded database (default: "~/workspace/blastdb")
  • source: Download source - one of ["", "aws", "gcp", "ncbi"]. Empty string auto-detects fastest source
  • wait: Whether to wait for download completion (default: true)

Returns

  • String path to the downloaded database directory
source
Mycelia.download_genome_by_accessionMethod
download_genome_by_accession(
;
    accession,
    outdir,
    compressed
)

Downloads a genomic sequence from NCBI's nucleotide database by its accession number.

Arguments

  • accession::String: NCBI nucleotide accession number (e.g. "NC_045512")
  • outdir::String: Output directory path. Defaults to current directory
  • compressed::Bool: If true, compresses output file with gzip. Defaults to true

Returns

  • String: Path to the downloaded file (.fna or .fna.gz)
source
Mycelia.download_genome_by_ftpMethod
download_genome_by_ftp(; ftp, outdir)

Downloads a genome file from NCBI FTP server to the specified directory.

Arguments

  • ftp::String: NCBI FTP path for the genome (e.g. "ftp://ftp.ncbi.nlm.nih.gov/.../")
  • outdir::String: Output directory path. Defaults to current working directory.

Returns

  • String: Path to the downloaded file

Notes

  • If the target file already exists, returns the existing file path without re-downloading
  • Downloads the genomic.fna.gz version of the genome
source
Mycelia.download_mmseqs_dbMethod
download_mmseqs_db(; db, dbdir, force, wait)

Downloads and sets up MMseqs2 reference databases for sequence searching and analysis.

Arguments

  • db::String: Name of database to download (see table below)
  • dbdir::String: Directory to store the downloaded database (default: "~/workspace/mmseqs")
  • force::Bool: If true, force re-download even if database exists (default: false)
  • wait::Bool: If true, wait for download to complete (default: true)

Returns

  • Path to the downloaded database as a String

Available Databases

DatabaseTypeTaxonomyDescription
UniRef100AminoacidYesUniProt Reference Clusters - 100% identity
UniRef90AminoacidYesUniProt Reference Clusters - 90% identity
UniRef50AminoacidYesUniProt Reference Clusters - 50% identity
UniProtKBAminoacidYesUniversal Protein Knowledge Base
NRAminoacidYesNCBI Non-redundant proteins
NTNucleotideNoNCBI Nucleotide collection
GTDBAminoacidYesGenome Taxonomy Database
PDBAminoacidNoProtein Data Bank structures
Pfam-A.fullProfileNoProtein family alignments
SILVANucleotideYesRibosomal RNA database
  Name                  Type            Taxonomy        Url                                                           
- UniRef100             Aminoacid            yes        https://www.uniprot.org/help/uniref
- UniRef90              Aminoacid            yes        https://www.uniprot.org/help/uniref
- UniRef50              Aminoacid            yes        https://www.uniprot.org/help/uniref
- UniProtKB             Aminoacid            yes        https://www.uniprot.org/help/uniprotkb
- UniProtKB/TrEMBL      Aminoacid            yes        https://www.uniprot.org/help/uniprotkb
- UniProtKB/Swiss-Prot  Aminoacid            yes        https://uniprot.org
- NR                    Aminoacid            yes        https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA
- NT                    Nucleotide             -        https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA
- GTDB                  Aminoacid            yes        https://gtdb.ecogenomic.org
- PDB                   Aminoacid              -        https://www.rcsb.org
- PDB70                 Profile                -        https://github.com/soedinglab/hh-suite
- Pfam-A.full           Profile                -        https://pfam.xfam.org
- Pfam-A.seed           Profile                -        https://pfam.xfam.org
- Pfam-B                Profile                -        https://xfam.wordpress.com/2020/06/30/a-new-pfam-b-is-released
- CDD                   Profile                -        https://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml
- eggNOG                Profile                -        http://eggnog5.embl.de
- VOGDB                 Profile                -        https://vogdb.org
- dbCAN2                Profile                -        http://bcb.unl.edu/dbCAN2
- SILVA                 Nucleotide           yes        https://www.arb-silva.de
- Resfinder             Nucleotide             -        https://cge.cbs.dtu.dk/services/ResFinder
- Kalamari              Nucleotide           yes        https://github.com/lskatz/Kalamari
source
Mycelia.download_sequence_by_accessionMethod
download_sequence_by_accession(
;
    accession,
    outdir,
    database,
    format,
    suffix,
    compressed
)

Download a sequence from NCBI by accession number with flexible format options.

Arguments

  • accession::String: NCBI accession number
  • outdir::String: Output directory
  • database::String: NCBI database ("nuccore", "protein", etc.)
  • format::String: Output format ("fasta", "fastacdsna", "fastacdsaa", etc.)
  • suffix::String: File suffix to append to accession for filename
  • compressed::Bool: Whether to gzip compress the output (default: true)

Returns

  • String: Path to the downloaded file
source
Mycelia.download_sra_dataMethod

Downloads sequencing reads from NCBI's Sequence Read Archive (SRA).

Downloads reads using fasterq-dump. The function automatically detects whether the data is single-end or paired-end and returns appropriate file paths. Users should apply quality control based on their knowledge of the data type.

Arguments

  • srr_identifier: SRA run identifier (e.g., "SRR1234567")
  • outdir: Output directory for downloaded files (default: current directory)

Returns

Named tuple with:

  • srr_id: The SRA identifier
  • outdir: Output directory path
  • files: Vector of downloaded file paths (1 file for single-end, 2 for paired-end)
  • is_paired: Boolean indicating if data is paired-end

Example

# Download SRA data
result = Mycelia.download_sra_data("SRR1234567", outdir="./data")

# Apply appropriate QC based on data type
if result.is_paired
    # Paired-end data - use paired-end QC
    Mycelia.trim_galore_paired(forward_reads=result.files[1], reverse_reads=result.files[2])
else
    # Single-end data - use single-end QC
    Mycelia.qc_filter_short_reads_fastp(input=result.files[1])
end
source
Mycelia.download_viroid_reference_dataMethod
download_viroid_reference_data(
    viroid_name::String;
    outdir,
    download_genome,
    download_cds,
    download_protein,
    max_sequences
) -> @NamedTuple{genome_files::Vector{String}, cds_files::Vector{String}, protein_files::Vector{String}, output_directory::String}

Download comprehensive viroid reference data including genome, CDS, and protein sequences.

Arguments

  • viroid_name::String: Name or search term for the viroid (e.g., "Potato spindle tuber viroid")
  • outdir::String: Output directory for downloaded files (default: current directory)
  • download_genome::Bool: Download genomic DNA sequence (default: true)
  • download_cds::Bool: Download CDS transcript sequences (default: true)
  • download_protein::Bool: Download protein (FAA) sequences (default: true)
  • max_sequences::Int: Maximum number of sequences per type (default: 10)

Returns

  • NamedTuple: Paths to downloaded files (genomefiles, cdsfiles, protein_files)

Examples

# Download all data for Potato spindle tuber viroid
files = download_viroid_reference_data("Potato spindle tuber viroid", "viroid_data/")

# Download only genome sequences for general viroid search
files = download_viroid_reference_data("viroid", "viroid_genomes/"; 
                                     download_cds=false, download_protein=false)
source
Mycelia.draw_dendrogram_treeMethod
draw_dendrogram_tree(
    mg::MetaGraphs.MetaDiGraph;
    width,
    height,
    fontsize,
    margins,
    mergenodesize,
    lineweight,
    filename
) -> Luxor.Drawing

Draw a dendrogram visualization of hierarchical clustering results stored in a MetaDiGraph.

Arguments

  • mg::MetaGraphs.MetaDiGraph: Graph containing hierarchical clustering results. Must have :hcl in graph properties with clustering data and vertex properties containing :x, :y coordinates.

Keywords

  • width::Integer=500: Width of output image in pixels
  • height::Integer=500: Height of output image in pixels
  • fontsize::Integer=12: Font size for node labels in points
  • margins::Float64: Margin size in pixels, defaults to min(width,height)/20
  • mergenodesize::Float64=1: Size of circular nodes at merge points
  • lineweight::Float64=1: Thickness of dendrogram lines
  • filename::String: Output filename, defaults to timestamp with .dendrogram.png extension

Returns

Nothing, but saves dendrogram image to disk and displays preview.

source
Mycelia.draw_radial_treeMethod
draw_radial_tree(
    mg::MetaGraphs.MetaDiGraph;
    width,
    height,
    fontsize,
    margins,
    mergenodesize,
    lineweight,
    filename
) -> Luxor.Drawing

Draw a radial hierarchical clustering tree visualization and save it as an image file.

Arguments

  • mg::MetaGraphs.MetaDiGraph: A meta directed graph containing hierarchical clustering data with required graph properties :hcl containing clustering information.

Keywords

  • width::Int=500: Width of the output image in pixels
  • height::Int=500: Height of the output image in pixels
  • fontsize::Int=12: Font size for node labels
  • margins::Float64: Margin size (automatically calculated as min(width,height)/20)
  • mergenodesize::Float64=1: Size of the merge point nodes
  • lineweight::Float64=1: Thickness of the connecting lines
  • filename::String: Output filename (defaults to timestamp with ".radial.png" suffix)

Details

The function creates a radial visualization of hierarchical clustering results where:

  • Leaf nodes are arranged in a circle with labels
  • Internal nodes represent merge points
  • Connections show the hierarchical structure through arcs and lines

The visualization is saved as a PNG file and automatically previewed.

Required Graph Properties

The input graph must have:

  • mg.gprops[:hcl].labels: Vector of leaf node labels
  • mg.gprops[:hcl].order: Vector of ordered leaf nodes
  • mg.gprops[:hcl].merges: Matrix of merge operations
  • mg.vprops[v][:x]: X coordinate for each vertex
  • mg.vprops[v][:y]: Y coordinate for each vertex
source
Mycelia.drop_empty_columns!Method
drop_empty_columns!(
    df::DataFrames.AbstractDataFrame
) -> DataFrames.AbstractDataFrame

Identify all columns that have only missing or empty values, and remove those columns from the dataframe in-place.

Returns a modified version of the original dataframe.

See also: dropemptycolumns

source
Mycelia.drop_empty_columnsMethod
drop_empty_columns(df::DataFrames.AbstractDataFrame) -> Any

Identify all columns that have only missing or empty values, and remove those columns from the dataframe.

Returns a modified copy of the dataframe.

See also: dropemptycolumns!

source
Mycelia.edge_path_to_sequenceMethod
edge_path_to_sequence(kmer_graph, edge_path) -> Any

Converts a path of edges in a kmer graph into a DNA sequence by concatenating overlapping kmers.

Arguments

  • kmer_graph: A directed graph where vertices represent kmers and edges represent overlaps
  • edge_path: Vector of edges representing a path through the graph

Returns

A BioSequences.LongDNASeq containing the merged sequence represented by the path

Details

The function:

  1. Takes the first kmer from the source vertex of first edge
  2. For each edge, handles orientation (forward/reverse complement)
  3. Verifies overlaps between consecutive kmers
  4. Concatenates unique bases to build final sequence
source
Mycelia.edge_probabilityMethod
edge_probability(stranded_kmer_graph, edge) -> Any

Calculate the probability of traversing a specific edge in a stranded k-mer graph.

The probability is computed as the ratio of this edge's coverage weight to the sum of all outgoing edge weights from the source vertex.

edge_probability(stranded_kmer_graph, edge) -> Any

Arguments

  • stranded_kmer_graph: A directed graph where edges represent k-mer connections
  • edge: The edge for which to calculate the probability

Returns

  • Float64: Probability in range [0,1] representing likelihood of traversing this edge Returns 0.0 if sum of all outgoing edge weights is zero

Note

Probability is based on the :coverage property of edges, using their length as weights

source
Mycelia.edgemer_to_vertex_kmersMethod
edgemer_to_vertex_kmers(
    edgemer
) -> Tuple{Kmers.Kmer{BioSequences.DNAAlphabet{2}}, Kmers.Kmer{BioSequences.DNAAlphabet{2}}}

Convert an edgemer to two vertex kmers.

This function takes an edgemer (a sequence of DNA nucleotides) and converts it into two vertex kmers. A kmer is a substring of length k from a DNA sequence. The first kmer is created from the first n-1 elements of the edgemer, and the second kmer is created from the last n-1 elements of the edgemer.

Arguments

  • edgemer::AbstractVector{T}: A vector of DNA nucleotides where T is a subtype of BioSequences.DNAAlphabet{2}.

Returns

  • Tuple{Kmers.Kmer{BioSequences.DNAAlphabet{2}}, Kmers.Kmer{BioSequences.DNAAlphabet{2}}}: A tuple containing two kmers.
source
Mycelia.ensure_next_graphMethod
ensure_next_graph(graph) -> Any

Automatically convert legacy graphs to next-generation format if needed.

This is a convenience function that checks the graph type and converts if necessary.

Arguments

  • graph: Graph in either legacy or next-generation format

Returns

  • MetaGraphsNext.MetaGraph: Graph in next-generation format
source
Mycelia.equally_spaced_samplesMethod
equally_spaced_samples(vector, n) -> Any

Sample n equally spaced elements from vector.

Arguments

  • vector: Input vector to sample from
  • n: Number of samples to return (must be positive)

Returns

A vector containing n equally spaced elements from the input vector.

source
Mycelia.equivalent_fasta_sequencesMethod
equivalent_fasta_sequences(fasta_1, fasta_2) -> Bool

Compare two FASTA files to determine if they contain the same set of sequences, regardless of sequence order.

Arguments

  • fasta_1::String: Path to first FASTA file
  • fasta_2::String: Path to second FASTA file

Returns

  • Bool: true if both files contain exactly the same sequences, false otherwise

Details

Performs a set-based comparison of DNA sequences by hashing each sequence. Sequence order differences between files do not affect the result.

source
Mycelia.error_rate_to_q_valueMethod
error_rate_to_q_value(error_rate) -> Any

Convert a sequencing error probability to a Phred quality score (Q-value).

The calculation uses the standard Phred formula: Q = -10 * log₁₀(error_rate)

Arguments

  • error_rate::Float64: Probability of error (between 0 and 1)

Returns

  • q_value::Float64: Phred quality score
source
Mycelia.errors_are_singletonsMethod

Analyze k-mer coverage distribution to detect if errors are singletons. Returns true if low-coverage k-mers (likely errors) are well-separated from high-coverage ones.

source
Mycelia.estimate_dense_matrix_memoryMethod
estimate_dense_matrix_memory(nrows::Integer, ncols::Integer)
estimate_dense_matrix_memory(T::DataType, nrows::Integer, ncols::Integer)

Estimate the memory required (in bytes) for a dense matrix.

  • If T is provided, estimate memory for a matrix with element type T.
  • If T is not provided, defaults to Float64.
source
Mycelia.estimate_genome_size_from_kmersMethod
estimate_genome_size_from_kmers(
    sequence::Union{AbstractString, BioSequences.LongSequence},
    k::Integer
) -> Dict

Estimate genome size from k-mer analysis using total k-mer count.

This function estimates genome size using the basic relationship: genomesize ≈ totalkmers - k + 1, where total_kmers is the sum of all k-mer counts. This is a simple estimation method; more sophisticated approaches accounting for sequencing depth, repeats, and errors may be more accurate.

Arguments

  • sequence::Union{BioSequences.LongSequence, AbstractString}: Input sequence or string
  • k::Integer: K-mer size for analysis

Returns

  • Dict{String, Any}: Dictionary containing:
    • "unique_kmers": Number of unique k-mers observed
    • "total_kmers": Total k-mer count (sum of all frequencies)
    • "estimatedgenomesize": Estimated genome size
    • "actual_size": Length of input sequence (if provided)

Examples

# Estimate genome size from a sequence
sequence = "ATCGATCGATCGATCG"
result = estimate_genome_size_from_kmers(sequence, 5)
source
Mycelia.estimate_genome_size_from_kmersMethod
estimate_genome_size_from_kmers(
    records::AbstractArray{T<:Union{FASTX.FASTA.Record, FASTX.FASTQ.Record}, 1},
    k::Integer
)

Estimate genome size from FASTQ/FASTA records using k-mer analysis.

Overload for processing FASTQ or FASTA records directly.

Arguments

  • records::AbstractVector{T} where T <: Union{FASTX.FASTA.Record, FASTX.FASTQ.Record}`: Input records
  • k::Integer: K-mer size for analysis

Returns

  • Dict{String, Any}: Dictionary with k-mer statistics and genome size estimate
source
Mycelia.estimate_sparse_matrix_memoryMethod
estimate_sparse_matrix_memory(nrows::Integer, ncols::Integer; nnz=nothing, density=nothing)
estimate_sparse_matrix_memory(T::DataType, nrows::Integer, ncols::Integer; nnz=nothing, density=nothing)

Estimate the memory required (in bytes) for a sparse matrix in CSC format.

  • If T is provided, estimate memory for a matrix with element type T.
  • If T is not provided, defaults to Float64.
  • You must specify either nnz (number of non-zeros) or density (proportion of non-zeros, between 0 and 1).
source
Mycelia.evaluate_assembly_agentMethod
evaluate_assembly_agent(policy::DQNPolicy, validation_data::Vector{String}; episodes=10)

Evaluate a trained assembly agent on validation data.

Arguments

  • policy::DQNPolicy: Trained policy to evaluate
  • validation_data::Vector{String}: Validation FASTQ files
  • episodes::Int: Number of evaluation episodes (default: 10)

Returns

  • Dict{String, Float64}: Evaluation metrics including mean reward, assembly quality, etc.

Example

# metrics = evaluate_assembly_agent(trained_policy, validation_files)
# println("Mean reward: $(metrics["mean_reward"])")
source
Mycelia.evaluate_classificationMethod
evaluate_classification(true_labels, pred_labels)

Runs confusionmatrix, precisionrecall_f1, and accuracy. Pretty-prints macro metrics and accuracy. Returns a named tuple with all results and plots.

source
Mycelia.execute_continue_k_actionMethod
execute_continue_k_action(env::AssemblyEnvironment, action::AssemblyAction)

Execute a "continue with current k" action by performing error correction iterations.

Arguments

  • env::AssemblyEnvironment: Current environment
  • action::AssemblyAction: Action specifying correction parameters

Returns

  • RewardComponents: Reward breakdown for the action
source
Mycelia.execute_next_k_actionMethod
execute_next_k_action(env::AssemblyEnvironment, action::AssemblyAction)

Execute a "move to next k" action by progressing to the next prime k-mer size.

Arguments

  • env::AssemblyEnvironment: Current environment
  • action::AssemblyAction: Action specifying progression parameters

Returns

  • RewardComponents: Reward breakdown for the action
source
Mycelia.execute_terminate_actionMethod
execute_terminate_action(env::AssemblyEnvironment, action::AssemblyAction)

Execute a "terminate assembly" action and assess final assembly quality.

Arguments

  • env::AssemblyEnvironment: Current environment
  • action::AssemblyAction: Termination action

Returns

  • RewardComponents: Final reward breakdown including assembly quality assessment
source
Mycelia.export_blast_dbMethod
export_blast_db(; path_to_db, fasta)

Export sequences from a BLAST database to a gzipped FASTA file.

Arguments

  • path_to_db: Path to the BLAST database
  • fasta: Output path for the gzipped FASTA file (default: path_to_db * ".fna.gz")

Details

Uses conda's BLAST environment to extract sequences using blastdbcmd. The output is automatically compressed using pigz. If the output file already exists, the function will skip extraction.

source
Mycelia.export_blast_db_taxonomy_tableMethod
export_blast_db_taxonomy_table(; path_to_db, outfile)

Exports a taxonomy mapping table from a BLAST database in seqid2taxid format.

Arguments

  • path_to_db::String: Path to the BLAST database
  • outfile::String: Output file path (defaults to input path + ".seqid2taxid.txt.gz")

Returns

  • String: Path to the created output file

Details

Creates a compressed tab-delimited file mapping sequence IDs to taxonomy IDs. Uses blastdbcmd without GI identifiers for better cross-referencing compatibility. If the output file already exists, returns the path without regenerating.

Dependencies

Requires BLAST+ tools installed via Bioconda.

source
Mycelia.extract_pacbiosample_informationMethod
extract_pacbiosample_information(
    xml
) -> DataFrames.DataFrame

Extract biosample and barcode information from a PacBio XML metadata file.

Arguments

  • xml: Path to PacBio XML metadata file

Returns

DataFrame with two columns:

  • BioSampleName: Name of the biological sample
  • BarcodeName: Associated DNA barcode identifier
source
Mycelia.extract_typed_sequenceMethod
extract_typed_sequence(
    record::Union{FASTX.FASTA.Record, FASTX.FASTQ.Record},
    sequence_type::Type{<:BioSequences.BioSequence}
) -> Any

Extract sequence from FASTX record using dynamically determined type.

This function provides type-safe sequence extraction by using the appropriate BioSequence type, avoiding hardcoded sequence types and string conversions.

Arguments

  • record::Union{FASTX.FASTA.Record, FASTX.FASTQ.Record}: Input sequence record
  • sequence_type::Type{<:BioSequences.BioSequence}: Target BioSequence type

Returns

  • BioSequences.BioSequence: Typed sequence from the record

Examples

record = FASTX.FASTQ.Record("read1", "ATCG", "IIII")
seq_type = alphabet_to_biosequence_type(:DNA)
sequence = extract_typed_sequence(record, seq_type)
source
Mycelia.fasta_and_gff_to_genbankMethod
fasta_and_gff_to_genbank(; fasta, gff, genbank)

Convert FASTA sequence and GFF annotation files to GenBank format using EMBOSS seqret.

Arguments

  • fasta::String: Path to input FASTA file containing sequence data
  • gff::String: Path to input GFF file containing genomic features
  • genbank::String: Path for output GenBank file

Details

Requires EMBOSS toolkit (installed via Bioconda). The function will:

  1. Create necessary output directories
  2. Run seqret to combine sequence and features
  3. Generate a GenBank format file at the specified location
source
Mycelia.fasta_genome_sizeMethod
fasta_genome_size(fasta_file) -> Any

Calculate the total size (in bases) of all sequences in a FASTA file.

Arguments

  • fasta_file::AbstractString: Path to the FASTA file

Returns

  • Int: Sum of lengths of all sequences in the FASTA file
source
Mycelia.fasta_list_to_dense_kmer_countsMethod
fasta_list_to_dense_kmer_counts(
;
    fasta_list,
    k,
    alphabet,
    temp_dir_parent,
    count_element_type,
    result_file,
    force,
    cleanup_temp
)

Create a dense k-mer counts table for a set of FASTA files, with disk-backed temporary storage, custom element type, robust error handling, and optional output file caching.

source
Mycelia.fasta_list_to_sparse_kmer_countsMethod
fasta_list_to_sparse_kmer_counts(
;
    fasta_list,
    k,
    alphabet,
    temp_dir_parent,
    count_element_type,
    rarefaction_data_filename,
    rarefaction_plot_basename,
    show_rarefaction_plot,
    result_file,
    out_dir,
    force,
    rarefaction_plot_kwargs...
)

Create a sparse kmer counts table (SparseMatrixCSC) from a list of FASTA files using a 3-pass approach. Pass 1 (Parallel): Counts kmers per file and writes to temporary JLD2 files. Pass 2 (Serial): Aggregates unique kmers, max count, nnz per file, and rarefaction data from temp files. Generates and saves a k-mer rarefaction plot. Pass 3 (Parallel): Reads temporary counts again to construct the final sparse matrix.

Optionally, a results filename can be provided to save/load the output. If the file exists and force is false, the result is loaded and returned. If force is true or the file does not exist, results are computed and saved.

Output Directory Behavior

  • All auxiliary output files (e.g., rarefaction data, plots) are written to a common output directory.
  • By default, this is:
    • The value of out_dir if provided.
    • Otherwise, the directory containing result_file (if provided and has a directory component).
    • Otherwise, the current working directory (pwd()).
  • If you provide an absolute path for an output file (e.g. rarefaction_data_filename), that path is used directly.
  • If both out_dir and a relative filename are given, the file is written to out_dir.

Arguments

  • fasta_list::AbstractVector{<:AbstractString}: A list of paths to FASTA files.
  • k::Integer: The length of the kmer.
  • alphabet::Symbol: The alphabet type (:AA, :DNA, :RNA).
  • temp_dir_parent::AbstractString: Parent directory for creating the temporary working directory. Defaults to Base.tempdir().
  • count_element_type::Union{Type{<:Unsigned}, Nothing}: Optional. Specifies the unsigned integer type for the counts. If nothing (default), the smallest UInt type capable of holding the maximum observed count is used.
  • rarefaction_data_filename::AbstractString: Filename for the TSV output of rarefaction data. If a relative path, will be written to out_dir.
  • rarefaction_plot_basename::AbstractString: Basename for the output rarefaction plots. If a relative path, will be written to out_dir.
  • show_rarefaction_plot::Bool: Whether to display the rarefaction plot after generation. Defaults to true.
  • result_file::Union{Nothing, AbstractString}: Optional. If provided, path to a file to save/load the full results (kmers, counts, etc) as a JLD2 file.
  • out_dir::Union{Nothing, AbstractString}: Optional. Output directory for auxiliary outputs. Defaults as described above.
  • force::Bool: If true, recompute and overwrite the output file even if it exists. Defaults to false.
  • rarefaction_plot_kwargs...: Keyword arguments to pass to plot_kmer_rarefaction for plot customization.

Returns

  • NamedTuple{(:kmers, :counts, :rarefaction_data_path)}:
    • kmers: A sorted Vector of unique kmer objects.
    • counts: A SparseArrays.SparseMatrixCSC{V, Int} storing kmer counts.
    • rarefaction_data_path: Path to the saved TSV file with rarefaction data.

Raises

  • ErrorException: If input fasta_list is empty, alphabet is invalid, or required Kmer/counting functions are not found.
source
Mycelia.fasta_table_to_fastaMethod
fasta_table_to_fasta(fasta_df) -> Any

Convert a DataFrame containing FASTA sequence information into a vector of FASTA records.

Arguments

  • fasta_df::DataFrame: DataFrame with columns "identifier", "description", and "sequence"

Returns

  • Vector{FASTX.FASTA.Record}: Vector of FASTA records
source
Mycelia.fasta_to_reference_kmer_countsMethod
fasta_to_reference_kmer_counts(; kmer_type, fasta)

Counts k-mer occurrences in a FASTA file, considering both forward and reverse complement sequences.

Arguments

  • kmer_type: Type specification for k-mers (e.g., DNAKmer{21})
  • fasta: Path to FASTA file containing reference sequences

Returns

  • Dict{kmer_type, Int}: Dictionary mapping each k-mer to its total count across all sequences
source
Mycelia.fasta_to_tableMethod
fasta_to_table(fasta) -> DataFrames.DataFrame

Convert a FASTA file/record iterator to a DataFrame.

Arguments

  • fasta: FASTA record iterator from FASTX.jl

Returns

  • DataFrame with columns:
    • identifier: Sequence identifiers
    • description: Full sequence descriptions
    • sequence: Biological sequences as strings
source
Mycelia.fasta_xam_mapping_statsMethod
fasta_xam_mapping_stats(; fasta, xam)

Calculate mapping statistics by comparing sequence alignments (BAM/SAM) to a reference FASTA.

Arguments

  • fasta::String: Path to reference FASTA file
  • xam::String: Path to alignment file (BAM or SAM format)

Returns

DataFrame with columns:

  • contig: Reference sequence name
  • contig_length: Length of reference sequence
  • total_aligned_bases: Total number of bases aligned to reference
  • mean_depth: Average depth of coverage (totalalignedbases/contig_length)
source
Mycelia.fastani_listMethod
fastani_list(
;
    query_list,
    reference_list,
    outfile,
    threads,
    force
) -> Union{Nothing, Base.Process}

Run fastani with a query and reference list

Calculate Average Nucleotide Identity (ANI) between genome sequences using FastANI.

Arguments

  • query_list::String: Path to file containing list of query genome paths (one per line)
  • reference_list::String: Path to file containing list of reference genome paths (one per line)
  • outfile::String: Path to output file that will contain ANI results
  • threads::Int=Sys.CPU_THREADS: Number of parallel threads to use
  • force::Bool=false: If true, rerun analysis even if output file exists

Output

Generates a tab-delimited file with columns:

  • Query genome
  • Reference genome
  • ANI value (%)
  • Count of bidirectional fragment mappings
  • Total query fragments

Notes

  • Requires FastANI to be available via Bioconda
  • Automatically sets up required conda environment
source
Mycelia.fasterq_dumpMethod
fasterq_dump(
;
    outdir,
    srr_identifier
) -> NamedTuple{(:forward_reads, :reverse_reads, :unpaired_reads), <:Tuple{Union{Missing, String}, Union{Missing, String}, Union{Missing, String}}}

Download and compress sequencing reads from the SRA database using fasterq-dump.

Arguments

  • outdir::String="": Output directory for the FASTQ files. Defaults to current directory.
  • srr_identifier::String="": SRA run accession number (e.g., "SRR12345678")

Returns

Named tuple containing paths to the generated files:

  • forward_reads: Path to forward reads file (*_1.fastq.gz) or missing
  • reverse_reads: Path to reverse reads file (*_2.fastq.gz) or missing
  • unpaired_reads: Path to unpaired reads file (*.fastq.gz) or missing

Outputs

Creates compressed FASTQ files in the output directory:

  • {srr_identifier}_1.fastq.gz: Forward reads (for paired-end data)
  • {srr_identifier}_2.fastq.gz: Reverse reads (for paired-end data)
  • {srr_identifier}.fastq.gz: Unpaired reads (for single-end data)

Dependencies

Requires:

  • fasterq-dump from the SRA Toolkit (installed via Conda)
  • gzip for compression

Notes

  • Skips download if output files already exist
  • Uses up to 4 threads or system maximum, whichever is lower
  • Allocates 1GB memory for processing
  • Skips technical reads
  • Handles both paired-end and single-end data automatically
source
Mycelia.fasterq_dump_parallelMethod

Parallel FASTQ dump for multiple SRA files.

Converts multiple SRA files to FASTQ format in parallel. More efficient than sequential processing for large batches.

Arguments

  • srr_identifiers: Vector of SRA run identifiers
  • outdir: Output directory for FASTQ files (default: current directory)
  • max_parallel: Maximum number of parallel conversions (default: 2)

Returns

Vector of named tuples with conversion results

Example

runs = ["SRR1234567", "SRR1234568"]
results = Mycelia.fasterq_dump_parallel(runs, outdir="./fastq_data")
source
Mycelia.fastq_recordMethod
fastq_record(; identifier, sequence, quality_scores)

Construct a FASTX FASTQ record from its components.

Arguments

  • identifier::String: The sequence identifier without the '@' prefix
  • sequence::String: The nucleotide sequence
  • quality_scores::Vector{Int}: Quality scores (0-93) as raw integers

Returns

  • FASTX.FASTQRecord: A parsed FASTQ record

Notes

  • Quality scores are automatically capped at 93 to ensure FASTQ compatibility
  • Quality scores are converted to ASCII by adding 33 (Phred+33 encoding)
  • The record is constructed in standard FASTQ format with four lines:
    1. Header line (@ + identifier)
    2. Sequence
    3. Plus line
    4. Quality scores (ASCII encoded)
source
Mycelia.fastx2normalized_tableMethod
fastx2normalized_table(fastx::AbstractString) -> DataFrames.DataFrame
fastx2normalized_table(fastx) -> DataFrames.DataFrame

Read a FASTA or FASTQ file and convert its records into a normalized DataFrames.DataFrame where each row represents a sequence record and columns provide standardized metadata and sequence statistics.

Arguments

  • fastx::AbstractString: Path to a FASTA or FASTQ file. The file must exist and be non-empty. The file type is inferred from the filename using Mycelia.FASTA_REGEX and Mycelia.FASTQ_REGEX.

Returns

  • DataFrames.DataFrame: A data frame where each row contains information for a record from the input file, and columns include:
    • fastx_path: Basename of the input file.
    • fastx_sha256: Aggregated SHA256 hash of all record SHA256s in the file.
    • record_identifier: Identifier from the record header.
    • record_description: Description from the record header.
    • record_sha256: SHA256 hash of the record sequence.
    • record_quality: Vector of quality scores (Vector{Float64}) for FASTQ, or missing for FASTA.
    • record_alphabet: Sorted, joined string of unique, uppercase characters in the record sequence.
    • record_type: Alphabet type detected by Mycelia.detect_alphabet (e.g., :DNA, :RNA, etc.).
    • mean_record_quality: Mean quality score (for FASTQ), or missing (for FASTA).
    • median_record_quality: Median quality score (for FASTQ), or missing (for FASTA).
    • record_length: Length of the sequence.
    • record_sequence: The sequence string itself.

Notes

  • The function asserts that the file exists and is not empty.
  • File type is determined by regex matching on the filename.
  • For FASTA files, quality-related columns are set to missing.
  • For FASTQ files, quality scores are extracted and statistics are computed.
  • Record SHA256 hashes are aggregated to compute a file-level SHA256 via Mycelia.metasha256.
  • Requires the following namespaces: DataFrames, Statistics, Mycelia, FASTX, and Base.basename.
  • The function returns the columns in the order: fastx_path, fastx_sha256, followed by all other record columns.

Example

import DataFrames
import Mycelia
import FASTX

table = fastx2normalized_table("example.fasta")
DataFrames.first(table, 3)
source
Mycelia.fastx_statsMethod
fastx_stats(fastx) -> DataFrames.DataFrame

Calculate basic statistics for FASTQ/FASTA sequence files using seqkit.

Arguments

  • fastq::String: Path to input FASTQ/FASTA file

Details

Automatically installs and uses seqkit from Bioconda to compute sequence statistics including number of sequences, total bases, GC content, average length, etc.

Dependencies

  • Requires Conda and Bioconda channel
  • Installs seqkit package if not present

Returns

Returns a DataFrame of the table

https://bioinf.shenwei.me/seqkit/usage/#stats

source
Mycelia.fastx_to_contig_lengthsMethod
fastx_to_contig_lengths(
    fastx
) -> OrderedCollections.OrderedDict

Generate detailed mapping statistics for each reference sequence/contig in a XAM (SAM/BAM/CRAM) file.

Arguments

  • xam: Path to XAM file or XAM object

Returns

A DataFrame with per-contig statistics including:

  • n_aligned_reads: Number of aligned reads
  • total_aligned_bases: Sum of alignment lengths
  • total_alignment_score: Sum of alignment scores
  • Mapping quality statistics (mean, std, median)
  • Alignment length statistics (mean, std, median)
  • Alignment score statistics (mean, std, median)
  • Percent mismatches statistics (mean, std, median)

Note: Only primary alignments (isprimary=true) and mapped reads (ismapped=true) are considered.

source
Mycelia.fastx_to_kmer_graphMethod
fastx_to_kmer_graph(
    KMER_TYPE,
    fastx::AbstractString
) -> MetaGraphs.MetaGraph

Constructs a k-mer graph from a single FASTX format string.

Arguments

  • KMER_TYPE: The k-mer type specification (e.g., DNAKmer{K} where K is k-mer length)
  • fastx::AbstractString: Input sequence in FASTX format (FASTA or FASTQ)

Returns

  • KmerGraph: A directed graph where vertices are k-mers and edges represent overlaps
source
Mycelia.fastx_to_kmer_graphMethod
fastx_to_kmer_graph(
    KMER_TYPE,
    fastxs::AbstractVector{<:AbstractString}
) -> MetaGraphs.MetaGraph

Create an in-memory kmer-graph that records:

  • all kmers
  • counts
  • all observed edges between kmers
  • edge orientations
  • edge counts

Construct a kmer-graph from one or more FASTX files (FASTA/FASTQ).

Arguments

  • KMER_TYPE: Type for kmer representation (e.g., DNAKmer{K})
  • fastxs: Vector of paths to FASTX files

Returns

A MetaGraph where:

  • Vertices represent unique kmers with properties:
    • :kmer => The kmer sequence
    • :count => Number of occurrences
  • Edges represent observed kmer adjacencies with properties:
    • :orientation => Relative orientation of connected kmers
    • :count => Number of observed transitions
source
Mycelia.fibonacci_numbers_less_thanMethod
fibonacci_numbers_less_than(
    n::Int64
) -> Union{Vector{Any}, Vector{Int64}}

Generate a sequence of Fibonacci numbers strictly less than the input value.

Arguments

  • n::Int: Upper bound (exclusive) for the Fibonacci sequence

Returns

  • Vector{Int}: Array containing Fibonacci numbers less than n
source
Mycelia.filesize_human_readableMethod
filesize_human_readable(f) -> Any

Gets the size of a file and returns it in a human-readable format.

Arguments

  • f: The path to the file, either as a String or an AbstractString.

Returns

A string representing the file size in a human-readable format (e.g., "3.40 MB").

Details

This function internally uses filesize(f) to get the file size in bytes, then leverages Base.format_bytes to convert it into a human-readable format with appropriate units (KB, MB, GB, etc.).

Examples

julia> filesize_human_readable("my_image.jpg")
"2.15 MB"

See Also

  • filesize: Gets the size of a file in bytes.
  • Base.format_bytes: Converts a byte count into a human-readable string.
source
Mycelia.finalize_assemblyFunction

Finalize assembly by combining information from all k-mer sizes. Phase 5.1b: Enhanced with accuracy metrics and reward tracking.

source
Mycelia.find_contigs_nextMethod
find_contigs_next(graph::MetaGraphsNext.MetaGraph, min_contig_length::Int=500) -> Vector{ContigPath}

Extract linear contigs from the assembly graph.

source
Mycelia.find_eulerian_paths_nextMethod
find_eulerian_paths_next(graph::MetaGraphsNext.MetaGraph) -> Vector{Vector{String}}

Find Eulerian paths in the assembly graph. An Eulerian path visits every edge exactly once.

source
Mycelia.find_fasta_filesMethod
find_fasta_files(input_path::String) -> Vector{String}

Find all FASTA files in a directory or return single file if path is a file.

Uses the existing FASTA_REGEX constant to identify FASTA files.

Arguments

  • input_path: Path to directory or single FASTA file

Returns

  • Vector of FASTA file paths

Example

fasta_files = find_fasta_files("./genomes/")
source
Mycelia.find_initial_kMethod

Find the optimal starting k-mer size using sparsity detection. Only considers prime k-mer sizes for optimal performance.

source
Mycelia.find_matching_prefixMethod
find_matching_prefix(
    filename1::String,
    filename2::String;
    strip_trailing_delimiters
) -> String

Find the longest common prefix between two filenames.

Arguments

  • filename1::String: First filename to compare
  • filename2::String: Second filename to compare

Keywords

  • strip_trailing_delimiters::Bool=true: If true, removes trailing dots, hyphens, and underscores from the result

Returns

  • String: The longest common prefix found between the filenames
source
Mycelia.find_nonempty_columnsMethod
find_nonempty_columns(df) -> Any

Identify all columns that have only missing or empty values

Returns as a bit array

See also: dropemptycolumns, dropemptycolumns!

source
Mycelia.find_quality_weighted_pathMethod
find_quality_weighted_path(
    graph,
    start_vertex;
    max_path_length
) -> Vector

Find a quality-weighted path through a qualmer graph starting from a given vertex. Uses joint probability as the primary weighting factor for path selection.

Arguments

  • graph: Qualmer graph (MetaGraphsNext with QualmerVertexData)
  • start_vertex: Starting vertex for path traversal
  • max_path_length::Int=1000: Maximum path length to prevent infinite loops

Returns

  • Vector{Int}: Path as sequence of vertex indices

Details

At each step, selects the unvisited neighbor with the highest joint probability. Terminates when no unvisited neighbors are available or max length is reached.

source
Mycelia.find_resampling_stretchesMethod
find_resampling_stretches(
;
    record_kmer_solidity,
    solid_branching_kmer_indices
)

Identifies sequence regions that require resampling based on kmer solidity patterns.

Arguments

  • record_kmer_solidity::BitVector: Boolean array where true indicates solid kmers
  • solid_branching_kmer_indices::Vector{Int}: Indices of solid branching kmers

Returns

  • Vector{UnitRange{Int64}}: Array of ranges (start:stop) indicating stretches that need resampling

Details

Finds continuous stretches of non-solid kmers and extends them to the nearest solid branching kmers on either side. These stretches represent regions that need resampling.

If a stretch doesn't have solid branching kmers on both sides, it is excluded from the result. Duplicate ranges are removed from the final output.

source
Mycelia.find_true_rangesMethod
find_true_ranges(
    bool_vec::AbstractVector{Bool};
    min_length
) -> Vector

Finds contiguous ranges of true values in a boolean vector.

Arguments

  • bool_vec::AbstractVector{Bool}: Input boolean vector to analyze
  • min_length=1: Minimum length requirement for a range to be included

Returns

Vector of tuples (start, end) where each tuple represents the indices of a contiguous range of true values meeting the minimum length requirement.

source
Mycelia.first_of_each_groupMethod
first_of_each_group(
    gdf::DataFrames.GroupedDataFrame{DataFrames.DataFrame}
) -> Any

Return a DataFrame containing the first row from each group in a GroupedDataFrame.

Arguments

  • gdf::GroupedDataFrame: A grouped DataFrame created using groupby

Returns

  • DataFrame: A new DataFrame containing first row from each group
source
Mycelia.frequency_matrix_to_jensen_shannon_distance_matrixMethod
frequency_matrix_to_jensen_shannon_distance_matrix(probability_matrix)

Pairwise Jensen-Shannon divergence between columns of probability_matrix.

Arguments

  • probability_matrix: Matrix where each column is a probability distribution (sums to 1.0).

Returns

  • Symmetric matrix of Jensen-Shannon divergence values between columns.
source
Mycelia.gamma_pca_epcaMethod
gamma_pca_epca(M::AbstractMatrix{<:Real}; k::Int=0)

Perform Gamma EPCA on a matrix M (features × samples).

When to use

Use for positive continuous data, such as rates, times, or strictly positive measurements.

Keyword arguments

  • k : desired number of latent dimensions; if k<1 defaults to min(n_samples-1, n_features, 10)

Returns

NamedTuple with fields

  • model : the fitted ExpFamilyPCA.GammaEPCA object
  • scores : k×n_samples matrix of sample scores
  • loadings : k×n_features matrix of feature loadings
source
Mycelia.gaussian_pca_epcaMethod
gaussian_pca_epca(M::AbstractMatrix{<:Real}; k::Int=0)

Perform Gaussian EPCA on a matrix M (features × samples).

When to use

Use for real-valued continuous data (centered, can be negative or positive), such as normalized or standardized measurements.

Keyword arguments

  • k : desired number of latent dimensions; if k<1 defaults to min(n_samples-1, n_features, 10)

Returns

NamedTuple with fields

  • model : the fitted ExpFamilyPCA.GaussianEPCA object
  • scores : k×n_samples matrix of sample scores
  • loadings : k×n_features matrix of feature loadings
source
Mycelia.genbank_to_codon_frequenciesMethod
genbank_to_codon_frequencies(
    genbank;
    allow_all
) -> Dict{BioSymbols.AminoAcid, Dict{Kmers.Kmer{BioSequences.DNAAlphabet{2}, 3, 1}, Int64}}

Analyze codon usage frequencies from genes in a GenBank file.

Arguments

  • genbank: Path to GenBank format file containing genomic sequences and annotations
  • allow_all: If true, initializes frequencies for all possible codons with count=1 (default: true)

Returns

Nested dictionary mapping amino acids to their corresponding codon usage counts:

  • Outer key: AminoAcid (including stop codon)
  • Inner key: DNACodon
  • Value: Count of codon occurrences

Details

  • Only processes genes marked as ':misc_feature' in the GenBank file
  • Analyzes both forward and reverse complement sequences
  • Determines coding strand based on presence of stop codons and start codons
  • Skips ambiguous sequences that cannot be confidently oriented
source
Mycelia.genbank_to_fastaMethod
genbank_to_fasta(; genbank, fasta, force)

Convert a GenBank format file to FASTA format using EMBOSS seqret.

Arguments

  • genbank: Path to input GenBank format file
  • fasta: Optional output FASTA file path (defaults to input path with .fna extension)
  • force: If true, overwrites existing output file (defaults to false)

Returns

Path to the output FASTA file

Notes

  • Requires EMBOSS suite (installed automatically via Conda)
  • Will not regenerate output if it already exists unless force=true
source
Mycelia.generate_all_possible_canonical_kmersMethod
generate_all_possible_canonical_kmers(k, alphabet) -> Any

Create distance matrix from a column-major counts matrix (features as rows and entities as columns) where distance is a proportional to total feature count magnitude (size) and cosine similarity (relative frequency)

Generate all possible canonical k-mers of length k from the given alphabet.

For DNA/RNA sequences, returns unique canonical k-mers where each k-mer is represented by the lexicographically smaller of itself and its reverse complement. For amino acid sequences, returns all possible k-mers without canonicalization.

Arguments

  • k: Length of k-mers to generate
  • alphabet: Vector of BioSymbols (DNA, RNA or AminoAcid)

Returns

  • Vector of k-mers, canonicalized for DNA/RNA alphabets
source
Mycelia.generate_all_possible_kmersMethod
generate_all_possible_kmers(k, alphabet) -> Any

Create distance matrix from a column-major counts matrix (features as rows and entities as columns) where distance is a proportional to total feature count magnitude (size) and cosine similarity (relative frequency)

Generate a sorted list of all possible k-mers for a given alphabet.

Arguments

  • k::Integer: Length of k-mers to generate
  • alphabet: Collection of symbols (DNA, RNA, or amino acids) from BioSymbols

Returns

  • Sorted Vector of Kmers of the appropriate type (DNA, RNA, or amino acid)
source
Mycelia.generate_and_save_kmer_countsMethod
generate_and_save_kmer_counts(; 
    bioalphabet, 
    fastas, 
    k, 
    output_dir=pwd(),
    filename=nothing
)

Generates and saves k-mer counts for a list of FASTA files for a single k.

Keyword Arguments

  • bioalphabet: Alphabet type (e.g., :DNA).
  • fastas: List of FASTA file paths.
  • k: Value of k (e.g., 9).
  • output_dir: (optional) Directory to write output files (default: current directory).
  • filename: (optional) Full file name for output (default: "{Mycelia.normalized_current_date()}.{lowercase(string(bioalphabet))}{k}mers.jld2").

Output

Saves a .jld2 file with the specified file name in output_dir if it does not already exist.

source
Mycelia.generate_binary_matrixMethod

Generate a binary (Bernoulli) matrix with given dimensions and probability.

Arguments

  • n_features::Int: Number of features (rows)
  • n_samples::Int: Number of samples (columns)
  • p::Float64: Probability of 1 in the Bernoulli distribution

Returns

  • Matrix{Bool}: Binary matrix with dimensions (nfeatures, nsamples)
source
Mycelia.generate_paired_end_readsMethod
generate_paired_end_reads(reference_seq, coverage, read_length, insert_size; error_rate=0.01)

Generate realistic paired-end sequencing reads from a reference sequence.

Simulates paired-end Illumina sequencing with realistic insert sizes, read lengths, and optional sequencing errors for assembly benchmarking.

Arguments

  • reference_seq: Reference sequence (BioSequences.LongDNA{4})
  • coverage: Target sequencing coverage depth
  • read_length: Length of each read in base pairs
  • insert_size: Insert size between paired reads
  • error_rate: Sequencing error rate (default: 0.01)

Returns

  • Tuple of (forwardreads, reversereads) as vectors of BioSequences.LongDNA{4}

See Also

  • simulate_illumina_paired_reads: For more sophisticated read simulation using ART
  • introduce_sequencing_errors: For adding realistic sequencing errors
source
Mycelia.generate_poisson_matrixMethod

Generate a Poisson matrix with given dimensions and rate parameter.

Arguments

  • n_features::Int: Number of features (rows)
  • n_samples::Int: Number of samples (columns)
  • λ::Float64: Rate parameter for the Poisson distribution

Returns

  • Matrix{Int}: Poisson matrix with dimensions (nfeatures, nsamples)
source
Mycelia.generate_polished_sequenceMethod
generate_polished_sequence(states::Vector{ViterbiState}, observations::Vector{String}, 
                          config::ViterbiConfig) -> (String, Vector{Tuple{Int, String, String}})

Generate polished sequence and track corrections made.

source
Mycelia.generate_taxa_abundances_plotMethod
generate_taxa_abundances_plot(
    joint_reads_to_taxon_lineage_table::DataFrames.DataFrame;
    taxa_level::String,
    top_n::Int = 30,
    kwargs...
)

Convenience wrapper function to generate taxa abundance visualization with default parameters and save to a file if requested.

Arguments

  • joint_reads_to_taxon_lineage_table: DataFrame with sample_id and taxonomic assignments
  • taxa_level: Taxonomic level to analyze
  • top_n: Number of top taxa to display individually
  • kwargs...: Additional parameters to pass to plottaxaabundances

Returns

  • fig: CairoMakie figure object
  • ax: CairoMakie axis object
  • taxa_colors: Dictionary mapping taxa to their assigned colors
source
Mycelia.generate_test_fastq_dataMethod
generate_test_fastq_data(n_reads::Int, read_length::Int, filename::String)

Generate test FASTQ data for benchmarking purposes.

Creates a FASTQ file with random DNA sequences and realistic quality scores suitable for performance testing and validation.

Arguments

  • n_reads::Int: Number of reads to generate
  • read_length::Int: Length of each read in base pairs
  • filename::String: Output filename for the FASTQ file

Details

  • Generates random DNA sequences using BioSequences.randdnaseq
  • Assigns realistic quality scores (Phred+33 encoding, range 20-40)
  • Uses existing Mycelia I/O functions for consistency

See Also

  • Mycelia.write_fastq: For writing FASTQ records
  • Mycelia.fastq_record: For creating FASTQ records
  • Mycelia.simulate_illumina_paired_reads: For more sophisticated read simulation
source
Mycelia.generate_test_genome_with_genesFunction
generate_test_genome_with_genes(genome_size, gene_density=0.02)

Generate a test genome with simulated gene positions for annotation benchmarking.

Creates a random DNA sequence with estimated gene positions based on gene density, suitable for testing gene prediction algorithms.

Arguments

  • genome_size: Size of the genome in base pairs
  • gene_density: Proportion of genome that consists of genes (default: 0.02)

Returns

  • Tuple of (genomesequence, genepositions) where gene_positions is a vector of (start, end) tuples

See Also

  • random_fasta_record: For generating random FASTA sequences
  • save_genome_as_fasta: For saving genomes to FASTA format
source
Mycelia.generate_test_sequencesFunction
generate_test_sequences(genome_size::Int, n_sequences::Int=1)

Generate test DNA sequences for k-mer analysis benchmarking.

Creates random DNA sequences suitable for k-mer counting and analysis performance testing.

Arguments

  • genome_size::Int: Size of each generated sequence in base pairs
  • n_sequences::Int: Number of sequences to generate (default: 1)

Returns

  • Vector of BioSequences.LongDNA{4} sequences

See Also

  • random_fasta_record: For generating FASTA records with random sequences
  • BioSequences.randdnaseq: For generating random DNA sequences
source
Mycelia.generate_test_sequencesMethod
generate_test_sequences(
    config::Mycelia.BenchmarkConfig
) -> Vector{FASTX.FASTA.Record}

Generate synthetic DNA sequences for benchmarking.

Arguments

  • config: BenchmarkConfig with test parameters

Returns

  • Vector of FASTA records for testing
source
Mycelia.generate_training_datasetsMethod
generate_training_datasets(; n_datasets=20, genome_sizes=[10000, 50000, 100000], 
                          error_rates=[0.001, 0.01, 0.05], coverage_levels=[20, 30, 50])

Generate diverse training datasets for RL agent training.

This function creates simulated genomic datasets with varying characteristics to provide comprehensive training scenarios for the RL agent.

Arguments

  • n_datasets::Int: Total number of datasets to generate (default: 20)
  • genome_sizes::Vector{Int}: Range of genome sizes to simulate (default: [10K, 50K, 100K])
  • error_rates::Vector{Float64}: Range of sequencing error rates (default: [0.1%, 1%, 5%])
  • coverage_levels::Vector{Int}: Range of coverage depths (default: [20x, 30x, 50x])

Returns

  • Vector{String}: Paths to generated training FASTQ files

Example

training_files = generate_training_datasets(n_datasets=50, genome_sizes=[50000, 100000, 200000])
source
Mycelia.generate_transterm_coordinates_from_fastaMethod
generate_transterm_coordinates_from_fasta(fasta) -> Any

Generate minimal coordinate files required for TransTermHP analysis from FASTA sequences.

Creates artificial gene annotations at sequence boundaries to enable TransTermHP to run without real gene annotations. For each sequence in the FASTA file, generates two single-base-pair "genes" at positions 1-2 and (L-1)-L, where L is sequence length.

Arguments

  • fasta: Path to input FASTA file containing sequences to analyze

Returns

  • Path to generated coordinate file (original path with ".coords" extension)

Format

Generated coordinate file follows TransTermHP format: gene_id start stop chromosome

where chromosome matches FASTA sequence identifiers.

See also: run_transterm

source
Mycelia.generate_transterm_coordinates_from_gffMethod
generate_transterm_coordinates_from_gff(gff_file) -> Any

Convert a GFF file to a coordinates file compatible with TransTermHP format.

Arguments

  • gff_file::String: Path to input GFF file

Processing

  • Converts 1-based to 0-based coordinates
  • Extracts gene IDs from the attributes field
  • Retains columns: gene_id, start, end, seqid

Returns

  • Path to the generated coordinates file (original filename with '.coords' suffix)

Output Format

Space-delimited file with columns: gene_id, start, end, seqid

source
Mycelia.get_base_extensionMethod
get_base_extension(filename::String) -> String

Extract the base file extension from a filename, handling compressed files.

For regular files, returns the last extension. For gzipped files, returns the extension before .gz.

source
Mycelia.get_biosequence_alphabetMethod
get_biosequence_alphabet(s::BioSequences.BioSequence) -> Any

Return the alphabet associated with a BioSequence type.

Arguments

  • s::BioSequences.BioSequence: A subtype instance.

Returns

BioSymbols.Alphabet of the sequence type.

source
Mycelia.get_correct_qualityMethod
get_correct_quality(tech::Symbol, pos::Int, read_length::Int) -> Int

Simulates a Phred quality score (using the Sanger convention) for a correctly observed base. For Illumina, the quality score is modeled to decay linearly from ~40 at the start to ~20 at the end of the read. For other technologies, the score is sampled from a normal distribution with parameters typical for that platform.

Returns an integer quality score.

source
Mycelia.get_error_qualityMethod
get_error_quality(tech::Symbol) -> Int

Simulates a Phred quality score (using the Sanger convention) for a base observed with an error. Error bases are assigned lower quality scores than correctly observed bases. For Illumina, scores typically range between 5 and 15; for nanopore and pacbio, slightly lower values are used; and for ultima, a modest quality score is assigned.

Returns an integer quality score.

source
Mycelia.get_fastq_contigsMethod
get_fastq_contigs(result::AssemblyResult) -> Vector{FASTX.FASTQ.Record}

Extract quality-aware FASTQ contigs from assembly result if available. Returns empty vector if no quality information was preserved during assembly.

source
Mycelia.get_genbankMethod
get_genbank(
;
    db,
    accession,
    ftp
) -> Union{Nothing, GenomicAnnotations.GenBank.Reader}

Get dna (db = "nuccore") or protein (db = "protein") sequences from NCBI or get fasta directly from FTP site

Retrieve GenBank records from NCBI or directly from an FTP site.

Arguments

  • db::String: NCBI database to query ("nuccore" for nucleotide or "protein" for protein sequences)
  • accession::String: NCBI accession number for the sequence
  • ftp::String: Direct FTP URL to a GenBank file (gzipped)

Returns

  • GenomicAnnotations.GenBank.Reader: A reader object containing the GenBank record

Details

When using NCBI queries (db and accession), the function implements rate limiting (0.5s sleep) to comply with NCBI's API restrictions of max 2 requests per second.

source
Mycelia.get_gffMethod
get_gff(; db, accession, ftp) -> Any

Get dna (db = "nuccore") or protein (db = "protein") sequences from NCBI or get fasta directly from FTP site

Retrieve GFF3 formatted genomic feature data from NCBI or direct FTP source.

Arguments

  • db::String: NCBI database to query ("nuccore" for DNA or "protein" for protein sequences)
  • accession::String: NCBI accession number
  • ftp::String: Direct FTP URL to GFF3 file (typically gzipped)

Returns

  • IO: IOBuffer containing uncompressed GFF3 data
source
Mycelia.get_kmer_indexMethod
get_kmer_index(kmers, kmer) -> Any

Returns the index position of a given k-mer in a sorted list of k-mers.

Arguments

  • kmers: A sorted vector of k-mers to search within
  • kmer: The k-mer sequence to find

Returns

Integer index position where kmer is found in kmers

Throws

  • AssertionError: If the k-mer is not found in the list
source
Mycelia.get_phred_scoresMethod
get_phred_scores(
    record::FASTX.FASTQ.Record
) -> Vector{UInt8}

Get numerical PHRED quality scores from a FASTQ record.

This is a convenience wrapper around FASTX.quality_scores() that returns the quality scores as a Vector{UInt8} representing PHRED scores.

Arguments

  • record::FASTX.FASTQ.Record: FASTQ record to extract quality scores from

Returns

  • Vector{UInt8}: PHRED quality scores (0-based, where 0 = lowest quality, 40+ = highest quality)

Examples

record = FASTX.FASTQ.Record("read1", "ATCG", "IIII")
scores = get_phred_scores(record)  # Returns [40, 40, 40, 40]
source
Mycelia.get_sequenceMethod
get_sequence(
;
    db,
    accession,
    ftp
) -> Union{Nothing, FASTX.FASTA.Reader}

Get dna (db = "nuccore") or protein (db = "protein") sequences from NCBI or get fasta directly from FTP site

Retrieve FASTA format sequences from NCBI databases or direct FTP URLs.

Arguments

  • db::String: NCBI database type ("nuccore" for DNA or "protein" for protein sequences)
  • accession::String: NCBI sequence accession number
  • ftp::String: Direct FTP URL to a FASTA file (alternative to db/accession pair)

Returns

  • FASTX.FASTA.Reader: Reader object containing the requested sequence(s)
source
Mycelia.get_viroid_species_listMethod
get_viroid_species_list() -> Vector{String}

Get a comprehensive list of well-characterized viroid species for reference data download.

Returns

  • Vector{String}: List of viroid species names suitable for NCBI searches

Examples

viroid_species = get_viroid_species_list()
for species in viroid_species[1:5]  # Download first 5 species
    download_viroid_reference_data(species, "viroid_references/$species/")
end
source
Mycelia.gfa_to_fastaMethod
gfa_to_fasta(; gfa, fasta)

Convert a GFA (Graphical Fragment Assembly) file to FASTA format.

Arguments

  • gfa::String: Path to input GFA file
  • fasta::String=gfa * ".fna": Path for output FASTA file. Defaults to input filename with ".fna" extension

Returns

  • String: Path to the generated FASTA file

Details

Uses gfatools (via Conda) to perform the conversion. The function will:

  1. Ensure gfatools is available in the Conda environment
  2. Execute the conversion using gfatools gfa2fa
  3. Write sequences to the specified FASTA file
source
Mycelia.gfa_to_structure_tableMethod
gfa_to_structure_table(
    gfa
) -> NamedTuple{(:contig_table, :records), <:Tuple{DataFrames.DataFrame, Any}}

Convert a GFA (Graphical Fragment Assembly) file into a structured representation.

Arguments

  • gfa: Path to GFA file or GFA content as string

Returns

Named tuple containing:

  • contig_table: DataFrame with columns:
    • connected_component: Integer ID for each component
    • contigs: Comma-separated list of contig IDs
    • is_circular: Boolean indicating if component forms a cycle
    • is_closed: Boolean indicating if single contig forms a cycle
    • lengths: Comma-separated list of contig lengths
  • records: FASTA records from the GFA
source
Mycelia.githashMethod
githash(; short) -> SubString{String}

Returns the current git commit hash of the repository.

Arguments

  • short::Bool=false: If true, returns abbreviated 8-character hash

Returns

A string containing the git commit hash (full 40 characters by default)

source
Mycelia.graph_to_gfaMethod
graph_to_gfa(; graph, outfile)

Convert a Mycelia graph to GFA (Graphical Fragment Assembly) format.

Writes a graph to GFA format, including:

  • Header (H) line with GFA version
  • Segment (S) lines for each vertex with sequence and depth
  • Link (L) lines for edges with overlap size and orientations

Arguments

  • graph: MetaGraph containing sequence vertices and their relationships
  • outfile: Path where the GFA file should be written

Returns

  • Path to the written GFA file
source
Mycelia.has_quality_informationMethod
has_quality_information(result::AssemblyResult) -> Bool

Check if the assembly result preserves quality information from the original reads.

source
Mycelia.hclust_to_metagraphMethod
hclust_to_metagraph(
    hcl::Clustering.Hclust
) -> MetaGraphs.MetaDiGraph{Int64, Float64}

Convert a hierarchical clustering tree into a directed metagraph representation.

Arguments

  • hcl::Clustering.Hclust: Hierarchical clustering result object

Returns

  • MetaGraphs.MetaDiGraph: Directed graph with metadata representing the clustering hierarchy

Graph Properties

The resulting graph contains the following vertex properties:

  • :hclust_id: String identifier for each node
  • :height: Height/distance at each merge point (0.0 for leaves)
  • :x: Horizontal position for visualization (0-1 range)
  • :y: Vertical position based on normalized height
  • :hcl: Original clustering object (stored as graph property)
source
Mycelia.heirarchically_cluster_distance_matrixMethod
heirarchically_cluster_distance_matrix(
    distance_matrix
) -> Clustering.Hclust

Performs hierarchical clustering on a distance matrix using Ward's linkage method.

Arguments

  • distance_matrix::Matrix{<:Real}: A symmetric distance/dissimilarity matrix

Returns

  • HierarchicalCluster: A hierarchical clustering object from Clustering.jl

Details

Uses Ward's method (minimum variance) for clustering, which:

  • Minimizes total within-cluster variance
  • Produces compact, spherical clusters
  • Works well for visualization in radial layouts
source
Mycelia.identify_optimal_number_of_clustersMethod
identify_optimal_number_of_clusters(
    distance_matrix;
    min_k,
    max_k
) -> Any

Identifies the optimal number of clusters using hierarchical clustering and maximizing the average silhouette score, displaying progress.

Uses Clustering.clustering_quality for score calculation.

Args: distance_matrix: A square matrix of pairwise distances between items.

Returns: A tuple containing: - hcl: The hierarchical clustering result object. - optimalnumberof_clusters: The inferred optimal number of clusters (k).

source
Mycelia.identify_potential_errorsMethod
identify_potential_errors(
    graph;
    min_coverage,
    min_quality,
    min_confidence
) -> Vector{Int64}

Identify potential sequencing errors based on quality scores and coverage patterns.

Arguments

  • graph: Qualmer graph (MetaGraphsNext with QualmerVertexData)
  • min_coverage::Int=2: Minimum coverage for reliable k-mers
  • min_quality::Float64=20.0: Minimum mean quality score
  • min_confidence::Float64=0.95: Minimum joint probability threshold

Returns

  • Vector{Int}: Vertex indices of potential error k-mers

Details

Identifies k-mers that are likely errors based on:

  • Low coverage (singleton or few observations)
  • Low quality scores
  • Low joint probability (low confidence)
source
Mycelia.ids_to_accessionsMethod
ids_to_accessions(
    ids::Vector{String},
    database::String
) -> Union{Vector{String}, Vector{T} where T<:(SubString{_A} where _A)}

Convert NCBI sequence IDs to accession numbers.

source
Mycelia.improve_read_likelihoodMethod

Improve likelihood of a single read using maximum likelihood path finding. Returns improved read and boolean indicating if improvement was made.

source
Mycelia.improve_read_set_likelihoodMethod

Improve likelihood of entire read set using current graph and k-mer size. Returns updated reads and count of improvements made. Uses memory-efficient batch processing for large datasets.

source
Mycelia.include_all_filesMethod
include_all_files(dir::AbstractString; pattern)

Recursively include all files matching a pattern in a directory and its subdirectories.

Arguments

  • dir::AbstractString: Directory path to search recursively
  • pattern::Regex=r"\.jl$": Regular expression pattern to match files (defaults to .jl files)

Details

Files are processed in sorted order within each directory. This is useful for loading test files, examples, or other Julia modules in a predictable order.

Examples

# Include all Julia files in a directory tree
include_all_files("test/modules")

# Include all text files
include_all_files("docs", r"\.txt$")
source
Mycelia.install_hashdeepMethod
install_hashdeep() -> Union{Nothing, Base.Process}

Ensures the hashdeep utility is installed on the system.

Checks if hashdeep is available in PATH and attempts to install it via apt package manager if not found. Will try with sudo privileges first, then without sudo if that fails.

Details

  • Checks PATH for existing hashdeep executable
  • Attempts installation using apt package manager
  • Requires a Debian-based Linux distribution

Returns

  • Nothing, but prints status messages during execution
source
Mycelia.introduce_sequencing_errorsMethod
introduce_sequencing_errors(sequence, error_rate)

Introduce realistic sequencing errors into a DNA sequence.

Simulates substitutions (70%), insertions (15%), and deletions (15%) at the specified error rate for realistic sequencing simulation.

Arguments

  • sequence: Input DNA sequence (BioSequences.LongDNA{4})
  • error_rate: Probability of error per base

Returns

  • Modified sequence with introduced errors (BioSequences.LongDNA{4})

See Also

  • observe: For more sophisticated error modeling with quality scores
  • mutate_string: For string-based mutation operations
source
Mycelia.introduce_sequencing_errorsMethod
introduce_sequencing_errors(reads::Vector, error_rate::Float64)

Introduce realistic sequencing errors into a set of reads.

Arguments

  • reads::Vector: Vector of FASTQ records
  • error_rate::Float64: Error rate (0.0 to 1.0)

Returns

  • Vector: Reads with introduced errors

Example

error_reads = introduce_sequencing_errors(clean_reads, 0.01)  # 1% error rate
source
Mycelia.is_equivalentMethod
is_equivalent(a, b) -> Any

Check if two biological sequences are equivalent, considering both direct and reverse complement matches.

Arguments

  • a: First biological sequence (BioSequence or compatible type)
  • b: Second biological sequence (BioSequence or compatible type)

Returns

  • Bool: true if sequences are identical or if one is the reverse complement of the other, false otherwise
source
Mycelia.is_legacy_graphMethod
is_legacy_graph(graph) -> Bool

Check if a graph is using the legacy MetaGraphs format.

Arguments

  • graph: Graph to check

Returns

  • Bool: true if legacy format, false if next-generation format
source
Mycelia.isolate_normalized_primary_contigMethod
isolate_normalized_primary_contig(
    assembled_fasta,
    assembled_gfa,
    qualimap_report_txt,
    identifier,
    k::Int64;
    primary_contig_fasta
) -> String

Primary contig is defined as the contig with the most bases mapped to it

In the context of picking out phage from metagenomic assemblies the longest contig is often bacteria whereas the highest coverage contigs are often primer-dimers or other PCR amplification artifacts.

Taking the contig that has the most bases mapped to it as a product of length * depth is cherry picked as our phage

Isolates and exports the primary contig from an assembly based on coverage depth × length.

The primary contig is defined as the contig with the highest total mapped bases (coverage depth × length). This method helps identify potential phage contigs in metagenomic assemblies, avoiding both long bacterial contigs and short high-coverage PCR artifacts.

Arguments

  • assembled_fasta: Path to the assembled contigs in FASTA format
  • assembled_gfa: Path to the assembly graph in GFA format
  • qualimap_report_txt: Path to Qualimap coverage report
  • identifier: String identifier for the output file
  • k: Integer representing k-mer size used in assembly
  • primary_contig_fasta: Optional output filepath (default: "{identifier}.primary_contig.fna")

Returns

  • Path to the output FASTA file containing the primary contig

Notes

  • For circular contigs, removes the k-mer closure scar if detected
  • Trims k bases from the end if they match the first k bases
  • Uses coverage × length to avoid both long bacterial contigs and short PCR artifacts
source
Mycelia.iterative_polishingFunction
iterative_polishing(
    fastq
) -> Vector{T} where T<:(NamedTuple{(:fastq, :k), <:Tuple{Any, Any}})
iterative_polishing(
    fastq,
    max_k
) -> Vector{T} where T<:(NamedTuple{(:fastq, :k), <:Tuple{Any, Any}})
iterative_polishing(
    fastq,
    max_k,
    plot
) -> Vector{T} where T<:(NamedTuple{(:fastq, :k), <:Tuple{Any, Any}})

Performs iterative error correction on FASTQ sequences using progressively larger k-mer sizes.

Starting with the default k-mer size, this function repeatedly applies polishing steps, incrementing the k-mer size until either reaching max_k or encountering instability.

Arguments

  • fastq: Path to input FASTQ file or FastqRecord object
  • max_k: Maximum k-mer size to attempt (default: 89)
  • plot: Whether to generate diagnostic plots (default: false)

Returns

Vector of polishing results, where each element contains:

  • k: k-mer size used
  • fastq: resulting polished sequences
source
Mycelia.jaccard_distanceMethod

Compute the Jaccard distance between columns of a binary matrix.

Arguments

  • M::AbstractMatrix{<:Integer}: Binary matrix where rows are features and columns are samples

Returns

  • Matrix{Float64}: Symmetric distance matrix with Jaccard distances
source
Mycelia.jaccard_distanceMethod
jaccard_distance(set1, set2) -> Any

Calculate the Jaccard distance between two sets, which is the complement of the Jaccard similarity.

The Jaccard distance is defined as: $J_d(A,B) = 1 - J_s(A,B) = 1 - \frac{|A ∩ B|}{|A ∪ B|}$

Arguments

  • set1: First set to compare
  • set2: Second set to compare

Returns

  • Float64: A value in [0,1] where 0 indicates identical sets and 1 indicates disjoint sets
source
Mycelia.jaccard_similarityMethod
jaccard_similarity(set1, set2) -> Any

Compute the Jaccard similarity coefficient between two sets.

The Jaccard similarity is defined as the size of the intersection divided by the size of the union of two sets:

J(A,B) = |A ∩ B| / |A ∪ B|

Arguments

  • set1: First set for comparison
  • set2: Second set for comparison

Returns

  • Float64: A value between 0.0 and 1.0, where:
    • 1.0 indicates identical sets
    • 0.0 indicates completely disjoint sets
source
Mycelia.jellyfish_countMethod
jellyfish_count(
;
    fastx,
    k,
    threads,
    max_mem,
    canonical,
    outfile,
    conda_check
)

Count k-mers in a FASTA/FASTQ file using Jellyfish.

Arguments

  • fastx::String: Path to input FASTA/FASTQ file (can be gzipped)
  • k::Integer: k-mer length
  • threads::Integer=Sys.CPU_THREADS: Number of threads to use
  • max_mem::Integer=Int(Sys.free_memory()): Maximum memory in bytes (defaults to system free memory)
  • canonical::Bool=false: Whether to count canonical k-mers (both strands combined)
  • outfile::String=auto: Output filename (auto-generated based on input and parameters)
  • conda_check::Bool=true: Whether to verify Jellyfish conda installation

Returns

  • String: Path to gzipped TSV file containing k-mer counts
source
Mycelia.jellyfish_counts_to_kmer_frequency_histogramFunction
jellyfish_counts_to_kmer_frequency_histogram(
    jellyfish_counts_file
) -> Any
jellyfish_counts_to_kmer_frequency_histogram(
    jellyfish_counts_file,
    outfile
) -> Any

Convert a Jellyfish k-mer count file into a frequency histogram.

Arguments

  • jellyfish_counts_file::String: Path to the gzipped TSV file containing Jellyfish k-mer counts
  • outfile::String=replace(jellyfish_counts_file, r"\.tsv\.gz$" => ".count_histogram.tsv"): Optional output file path

Returns

  • String: Path to the generated histogram file

Description

Processes a Jellyfish k-mer count file to create a frequency histogram where:

  • Column 1: Number of k-mers that share the same count
  • Column 2: The count they share

Uses system sorting with LC_ALL=C for optimal performance on large files.

Notes

  • Requires gzip, sort, uniq, and sed command line tools
  • Uses intermediate disk storage for sorting large files
  • Skips processing if output file already exists
source
Mycelia.jitterMethod
jitter(x, n) -> Any

Add random noise to create a vector of jittered values.

Generates n values by adding random noise to the input value x. The noise is uniformly distributed between -1/3 and 1/3.

Arguments

  • x: Base value to add jitter to
  • n: Number of jittered values to generate

Returns

  • Vector of length n containing jittered values around x
source
Mycelia.join_fastqs_with_uuidMethod
join_fastqs_with_uuid(
    fastq_files::Vector{String};
    fastq_out::String
    tsv_out::String
)

Note: does not keep track of paired-end data - assumes single end reads

Designed primarily to allow joint mapping of many long-read samples

Given a collection of fastq files, creates:

  • A gzipped TSV mapping original file and read_id to a new UUID per read
  • A gzipped joint fastq file with the new UUID as read header

Returns: Tuple of output file paths (tsvout, fastqout)

source
Mycelia.joint_base_quality_scoreMethod
joint_base_quality_score(
    error_probabilities::Vector{Float64}
) -> Float64

Calculate the quality score for a single base given multiple observations.

This function implements the "Converting to Error Probabilities and Combining" method:

  1. Takes error probabilities from multiple reads covering the same base
  2. Calculates probability of ALL reads being wrong by multiplying probabilities
  3. Calculates final Phred score from this combined probability

To avoid numerical underflow with very small probabilities, the calculation is performed in log space.

Arguments

  • error_probabilities::Vector{Float64}: Vector of error probabilities from multiple reads covering the same base position

Returns

  • Float64: Phred quality score representing the combined confidence
source
Mycelia.joint_qualmer_probabilityMethod
joint_qualmer_probability(
    qualmers::Vector{<:Mycelia.Qualmer};
    use_log_space
) -> Float64

Calculate joint probability that multiple observations of the same k-mer are all correct. For independent observations of the same k-mer sequence, this represents our confidence that the k-mer truly exists in the data.

Arguments

  • qualmers: Vector of Qualmer observations of the same k-mer sequence
  • use_log_space: Use log-space arithmetic for numerical stability (default: true)

Returns

  • Float64: Joint probability that all observations are correct
source
Mycelia.jsonl_to_dataframeMethod
jsonl_to_dataframe(filepath::String) -> DataFrames.DataFrame
jsonl_to_dataframe(filepath::String) -> DataFrame

Parse a JSONL (or gzipped JSONL) file and return a DataFrame. Internally calls parse_jsonl for validation and parsing. Ensures that all rows have the same set of keys by inserting missing for any absent field before constructing the DataFrame.

source
Mycelia.k_ladderMethod
k_ladder(
;
    max_k,
    seed_primes,
    ratio,
    min_fractional_gap,
    read_length,
    read_margin,
    only_odd,
    return_unique,
    min_k
) -> Vector{Int64}

Generate a √2-scaled ladder of odd primes for k-mer-based assembly / error-screening.

Starts with a user-supplied list of seed_primes (default [3, 5, 7]), then iteratively multiplies the last accepted k by ratio (default sqrt(2)), rounds up to the next odd prime, and appends it only if it differs from the previous accepted prime by at least min_fractional_gap.

Keyword arguments

  • max_k::Int = 10_000 : Absolute upper bound.
  • seed_primes::Vector{Int} : Initial primes (e.g. [3,5,7] for protein, [11,13,17] for nucleotides).
  • ratio::Float64 = sqrt(2) : Target geometric growth factor.
  • min_fractional_gap::Float64 = 0.30 : Minimum (knew − kprev)/k_prev to skip “sister” primes.
  • read_length::Union{Int,Nothing} = nothing : If set, cap at read_length − read_margin.
  • read_margin::Int = 20 : Safety margin for short-read data.
  • only_odd::Bool = true : Force odd k (recommended).
  • return_unique::Bool = true : De-duplicate before returning.
  • min_k::Int = 3 : Drop any k below this after generation.

Returns

Vector{Int} — ascending prime k values suitable for -k/--k-list.

source
Mycelia.kmer_counts_dict_to_vectorMethod
kmer_counts_dict_to_vector(
    kmer_to_index_map,
    kmer_counts
) -> Any

Convert a dictionary of k-mer counts to a fixed-length numeric vector based on a predefined mapping.

Arguments

  • kmer_to_index_map: Dictionary mapping k-mer sequences to their corresponding vector indices
  • kmer_counts: Dictionary containing k-mer sequences and their occurrence counts

Returns

  • A vector where each position corresponds to a k-mer count, with zeros for absent k-mers
source
Mycelia.kmer_counts_to_cosine_similarityMethod
kmer_counts_to_cosine_similarity(
    kmer_counts_1,
    kmer_counts_2
) -> Any

Calculate the cosine similarity between two k-mer count dictionaries.

Arguments

  • kmer_counts_1::Dict{String,Int}: First dictionary mapping k-mer sequences to their counts
  • kmer_counts_2::Dict{String,Int}: Second dictionary mapping k-mer sequences to their counts

Returns

  • Float64: Cosine distance between the two k-mer count vectors, in range [0,1] where 0 indicates identical distributions and 1 indicates maximum dissimilarity

Details

Converts k-mer count dictionaries into vectors using a unified set of keys, then computes cosine distance. Missing k-mers are treated as count 0. Result is invariant to input order and total counts (normalized internally).

source
Mycelia.kmer_counts_to_js_divergenceMethod
kmer_counts_to_js_divergence(
    kmer_counts_1,
    kmer_counts_2
) -> Any

Calculate the Jensen-Shannon divergence between two k-mer frequency distributions.

Arguments

  • kmer_counts_1: Dictionary mapping k-mers to their counts in first sequence
  • kmer_counts_2: Dictionary mapping k-mers to their counts in second sequence

Returns

  • Normalized Jensen-Shannon divergence score between 0 and 1, where:
    • 0 indicates identical distributions
    • 1 indicates maximally different distributions

Notes

  • The measure is symmetric: JS(P||Q) = JS(Q||P)
  • Counts are automatically normalized to probability distributions
source
Mycelia.kmer_counts_to_merqury_qvMethod
kmer_counts_to_merqury_qv(
;
    raw_data_counts,
    assembly_counts
)

Calculate assembly Quality Value (QV) score using the Merqury method.

Estimates base-level accuracy by comparing k-mer distributions between raw sequencing data and assembly. Higher QV scores indicate better assembly quality.

Arguments

  • raw_data_counts::AbstractDict{Kmers.DNAKmer{k,N}, Int}: K-mer counts from raw sequencing data
  • assembly_counts::AbstractDict{Kmers.DNAKmer{k,N}, Int}: K-mer counts from assembly

Returns

  • Float64: Quality Value score in Phred scale (-10log₁₀(error rate))

Method

QV is calculated using:

  1. Ktotal = number of unique kmers in assembly
  2. Kshared = number of kmers shared between raw data and assembly
  3. P = (Kshared/Ktotal)^(1/k) = estimated base-level accuracy
  4. QV = -10log₁₀(1-P)

Reference

Rhie et al. "Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies" Genome Biology (2020)

source
Mycelia.kmer_graph_to_biosequence_graphMethod
kmer_graph_to_biosequence_graph(
    kmer_graph::MetaGraphsNext.MetaGraph;
    min_path_length
) -> MetaGraphsNext.MetaGraph{Int64, Graphs.SimpleGraphs.SimpleDiGraph{Int64}, _A, _B, _C, Nothing, Mycelia.var"#270#272", Float64} where {_A, _B, _C}

Convert a k-mer graph to a BioSequence graph by collapsing linear paths.

This is the primary method for creating BioSequence graphs from k-mer graphs, following the 6-graph hierarchy where BioSequence graphs are simplifications of k-mer graphs.

Arguments

  • kmer_graph: MetaGraphsNext k-mer graph to convert
  • min_path_length: Minimum path length to keep (default: 2)

Returns

  • MetaGraphsNext.MetaGraph with BioSequence vertices

Example

# Start with k-mer graph
kmer_graph = build_kmer_graph_next(BioSequences.DNAKmer{31}, fasta_records)

# Convert to BioSequence graph
biosequence_graph = kmer_graph_to_biosequence_graph(kmer_graph)
source
Mycelia.kmer_path_to_sequenceMethod
kmer_path_to_sequence(kmer_path) -> Any

Convert a path of overlapping k-mers into a single DNA sequence.

Arguments

  • kmer_path: Vector of k-mers (DNA sequences) where each consecutive pair overlaps by k-1 bases

Returns

  • BioSequences.LongDNA{2}: Assembled DNA sequence from the k-mer path

Description

Reconstructs the original DNA sequence by joining k-mers, validating that consecutive k-mers overlap correctly. The first k-mer is used in full, then each subsequent k-mer contributes its last base.

source
Mycelia.kmer_quality_scoreFunction
kmer_quality_score(base_qualities::Vector{Float64}) -> Any
kmer_quality_score(
    base_qualities::Vector{Float64},
    method::Symbol
) -> Any

Calculate kmer quality score using the specified aggregation method.

Available methods:

  • :min: Use the minimum base quality (default)
  • :mean: Use the mean of all base qualities
  • :geometric: Use the geometric mean (appropriate for probabilities)
  • :harmonic: Use the harmonic mean (emphasizes lower values)

Arguments

  • base_qualities::Vector{Float64}: Vector of quality scores for each base
  • method::Symbol: Method to use for aggregation

Returns

  • Float64: Overall quality score for the kmer
source
Mycelia.kmer_space_sizeFunction
kmer_space_size(k::Integer) -> Any
kmer_space_size(k::Integer, alphabet_size::Integer) -> Any

Calculate the theoretical k-mer space size for a given k-mer length and alphabet size.

Arguments

  • k::Integer: K-mer length
  • alphabet_size::Integer=4: Size of the alphabet (defaults to 4 for DNA: A,C,G,T)

Returns

  • Integer: Total number of possible k-mers (alphabet_size^k)

Details

For DNA sequences (alphabet_size=4), this computes 4^k. Useful for:

  • Memory estimation for k-mer analysis
  • Parameter validation and selection
  • Understanding computational complexity

Examples

# DNA 3-mers: 4^3 = 64 possible k-mers
kmer_space_size(3)

# Protein 5-mers: 20^5 = 3,200,000 possible k-mers  
kmer_space_size(5, 20)
source
Mycelia.ksMethod
ks(; min, max) -> Vector{Int64}

Generates a specialized sequence of prime numbers combining:

  • Odd primes up to 23 (flip_point)
  • Primes nearest to Fibonacci numbers above 23 up to max

Arguments

  • min::Int=0: Lower bound for the sequence
  • max::Int=10_000: Upper bound for the sequence

Returns

Vector of Int containing the specialized prime sequence

source
Mycelia.lawrencium_sbatchMethod
lawrencium_sbatch(
;
    job_name,
    mail_user,
    mail_type,
    logdir,
    partition,
    qos,
    account,
    nodes,
    ntasks,
    time,
    cpus_per_task,
    mem_gb,
    cmd
)

Submit a job to SLURM scheduler on Lawrence Berkeley Lab's Lawrencium cluster.

Arguments

  • job_name: Name identifier for the SLURM job
  • mail_user: Email address for job notifications
  • mail_type: Notification type ("ALL", "BEGIN", "END", "FAIL", or "NONE")
  • logdir: Directory for SLURM output and error logs
  • partition: Lawrencium compute partition
  • qos: Quality of Service level
  • account: Project account for billing
  • nodes: Number of nodes to allocate
  • ntasks: Number of tasks to spawn
  • time: Wall time limit in format "days-hours:minutes:seconds"
  • cpus_per_task: CPU cores per task
  • mem_gb: Memory per node in GB
  • cmd: Shell command to execute

Returns

  • true if submission was successful

Note

Function includes 5-second delays before and after submission for queue stability.

source
Mycelia.legacy_to_next_graphFunction
legacy_to_next_graph(
    legacy_graph
) -> MetaGraphsNext.MetaGraph{Int64, Graphs.SimpleGraphs.SimpleDiGraph{Int64}, _A, _B, _C, Nothing, Mycelia.var"#147#148", Float64} where {_A, _B, _C}
legacy_to_next_graph(
    legacy_graph,
    kmer_type
) -> MetaGraphsNext.MetaGraph{Int64, Graphs.SimpleGraphs.SimpleDiGraph{Int64}, _A, _B, _C, Nothing, Mycelia.var"#147#148", Float64} where {_A, _B, _C}

Convert a legacy MetaGraphs-based k-mer graph to the next-generation MetaGraphsNext format.

This function provides a migration path from the deprecated MetaGraphs.jl implementation to the new type-stable MetaGraphsNext.jl format.

Arguments

  • legacy_graph: MetaGraphs.MetaDiGraph from the old implementation

Returns

  • MetaGraphsNext.MetaGraph with equivalent structure and type-stable metadata
source
Mycelia.list_blastdbsMethod
list_blastdbs(; source) -> DataFrames.DataFrame

Lists available BLAST databases from the specified source.

source
Mycelia.list_classesMethod
list_classes() -> DataFrames.DataFrame

Returns an array of all taxonomic classes in the database.

Classes represent a major taxonomic rank between phylum and order in biological classification.

Returns

  • Vector{String}: Array of class names sorted alphabetically
source
Mycelia.list_databasesMethod
list_databases(; address, username, password)

Lists all available Neo4j databases on the specified server.

Arguments

  • address::String: Neo4j server address (e.g. "neo4j://localhost:7687")
  • username::String="neo4j": Neo4j authentication username
  • password::String: Neo4j authentication password

Returns

  • DataFrame: Contains database information with columns typically including:
    • name: Database name
    • address: Database address
    • role: Database role (e.g., primary, secondary)
    • status: Current status (e.g., online, offline)
    • default: Boolean indicating if it's the default database
source
Mycelia.list_familiesMethod
list_families() -> DataFrames.DataFrame

Returns a sorted vector of all family names present in the database.

source
Mycelia.list_full_taxonomyMethod
list_full_taxonomy() -> DataFrames.DataFrame

Retrieves and formats the complete NCBI taxonomy hierarchy into a structured DataFrame.

Details

  • Automatically sets up taxonkit environment and downloads taxonomy database if needed
  • Starts from root taxid (1) and includes all descendant taxa
  • Reformats lineage information into separate columns for each taxonomic rank

Returns

DataFrame with columns:

  • taxid: Taxonomy identifier
  • lineage: Full taxonomic lineage string
  • taxid_lineage: Lineage with taxonomy IDs
  • Individual rank columns:
    • superkingdom, kingdom, phylum, class, order, family, genus, species
    • corresponding taxid columns (e.g., superkingdom_taxid)

Dependencies

Requires taxonkit (installed automatically via Bioconda)

source
Mycelia.list_generaMethod
list_genera() -> DataFrames.DataFrame

Returns a sorted vector of all genera names present in the database.

source
Mycelia.list_kingdomsMethod
list_kingdoms() -> DataFrames.DataFrame

Lists all taxonomic kingdoms in the database.

Returns a vector of kingdom names as strings. Kingdoms represent the highest major taxonomic rank in biological classification.

Returns

  • Vector{String}: Array of kingdom names
source
Mycelia.list_ordersMethod
list_orders() -> DataFrames.DataFrame

Lists all orders in the taxonomic database.

Returns a vector of strings containing valid order names according to current mycological taxonomy. Uses the underlying list_rank() function with rank="order".

Returns

  • Vector{String}: Alphabetically sorted list of order names
source
Mycelia.list_phylumsMethod
list_phylums() -> DataFrames.DataFrame

Returns a sorted list of all unique phyla in the database.

source
Mycelia.list_rankMethod
list_rank(rank) -> DataFrames.DataFrame

List all taxonomic entries at the specified rank level.

Arguments

  • rank::String: Taxonomic rank to query. Must be one of:
    • "top" (top level)
    • "superkingdom"/"domain"
    • "kingdom"
    • "phylum"
    • "class"
    • "order"
    • "family"
    • "genus"
    • "species"

Returns

DataFrame with columns:

  • taxid: NCBI taxonomy ID
  • name: Scientific name at the specified rank
source
Mycelia.list_ranksMethod
list_ranks(; synonyms) -> Vector{String}

Return an ordered list of taxonomic ranks from highest (top) to lowest (species).

Arguments

  • synonyms::Bool=false: If true, includes alternative names for certain ranks (e.g. "domain" for "superkingdom")

Returns

  • Vector{String}: An array of taxonomic rank names in hierarchical order
source
Mycelia.list_speciesMethod
list_species() -> DataFrames.DataFrame

Returns a sorted vector of all species names present in the database.

source
Mycelia.list_subtaxaMethod
list_subtaxa(taxid) -> Vector{Int64}

Returns an array of Integer taxon IDs representing all sub-taxa under the specified taxonomic ID.

Arguments

  • taxid: NCBI taxonomy identifier for the parent taxon

Returns

Vector{Int} containing all descendant taxon IDs

Details

  • Requires taxonkit to be installed via Bioconda
  • Automatically sets up taxonkit database if not present
  • Uses local taxonomy database in ~/.taxonkit/
source
Mycelia.list_superkingdomsMethod
list_superkingdoms() -> DataFrames.DataFrame

Returns an array of all taxonomic superkingdoms (e.g., Bacteria, Archaea, Eukaryota).

Returns

  • Vector{String}: Array containing names of all superkingdoms in the taxonomy database
source
Mycelia.list_toplevelMethod
list_toplevel() -> DataFrames.DataFrame

Returns a DataFrame containing the top-level taxonomic nodes.

The DataFrame has two fixed rows representing the most basic taxonomic classifications:

  • taxid=0: "unclassified"
  • taxid=1: "root"

Returns

DataFrame Columns: - taxid::Int : Taxonomic identifier - name::String : Node name

source
Mycelia.load_blast_db_taxonomy_tableMethod
load_blast_db_taxonomy_table(
    compressed_blast_db_taxonomy_table_file
) -> DataFrames.DataFrame

Loads a BLAST database taxonomy mapping table from a gzipped file into a DataFrame.

Arguments

  • compressed_blast_db_taxonomy_table_file::String: Path to a gzipped file containing BLAST taxonomy mappings

Returns

  • DataFrame: A DataFrame with columns :sequence_id and :taxid containing the sequence-to-taxonomy mappings

Format

Input file should be a space-delimited text file (gzipped) with two columns:

  1. sequence identifier
  2. taxonomy identifier (taxid)
source
Mycelia.load_df_jld2Method
load_df_jld2(filename::String; key) -> Any
load_df_jld2(filename::String; key::String="dataframe") -> DataFrames.DataFrame

Load a DataFrame from a JLD2 file.

Arguments

  • filename: Path to the JLD2 file (will add .jld2 extension if not present)
  • key: The name of the dataset within the JLD2 file (defaults to "dataframe")

Returns

  • The loaded DataFrame

Examples

df = load_df_jld2("mydata")
source
Mycelia.load_genbank_metadataMethod
load_genbank_metadata() -> DataFrames.DataFrame

Load metadata for GenBank sequences into a DataFrame.

This is a convenience wrapper around load_ncbi_metadata("genbank") that specifically loads metadata from the GenBank database.

Returns

  • DataFrame: Contains metadata fields like accession numbers, taxonomy,

and sequence information from GenBank.

source
Mycelia.load_graphMethod
load_graph(file) -> Any

Load a graph structure from a serialized file.

Arguments

  • file::AbstractString: Path to the file containing the serialized graph data

Returns

  • The deserialized graph object
source
Mycelia.load_graphMethod
load_graph(file::String) -> Any

Loads a graph object from a serialized file.

Arguments

  • file::String: Path to the file containing the serialized graph data. The file should have been created using save_graph.

Returns

  • The deserialized graph object stored under the "graph" key.
source
Mycelia.load_jellyfish_countsMethod
load_jellyfish_counts(jellyfish_counts) -> Any

Load k-mer counts from a Jellyfish output file into a DataFrame.

Arguments

  • jellyfish_counts::String: Path to a gzipped TSV file (*.jf.tsv.gz) containing Jellyfish k-mer counts

Returns

  • DataFrame: Table with columns:
    • kmer: Biologically encoded k-mers as DNAKmer{k} objects
    • count: Integer count of each k-mer's occurrences

Notes

  • Input file must be a gzipped TSV with exactly two columns (k-mer sequences and counts)
  • K-mer length is automatically detected from the first entry
  • Filename must end with '.jf.tsv.gz'
source
Mycelia.load_jld2Method
load_jld2(filename) -> Any

Load data stored in a JLD2 file format.

Arguments

  • filename::String: Path to the JLD2 file to load

Returns

  • Dict: Dictionary containing the loaded data structures
source
Mycelia.load_kmer_resultsMethod
load_kmer_results(
    filename::AbstractString
) -> Union{Nothing, NamedTuple{(:kmers, :counts, :fasta_list, :metadata), <:Tuple{Any, Any, Any, Dict{String, Any}}}}

Load kmer counting results previously saved with save_kmer_results.

Arguments

  • filename::AbstractString: Path to the input JLD2 file.

Returns

  • NamedTuple: Contains the loaded kmers, counts, fasta_list, and metadata. Returns nothing if the file cannot be loaded or essential keys are missing.
source
Mycelia.load_matrix_jld2Method
load_matrix_jld2(filename) -> Any

Loads a matrix from a JLD2 file.

Arguments

  • filename::String: Path to the JLD2 file containing the matrix under the key "matrix"

Returns

  • Matrix: The loaded matrix data
source
Mycelia.load_ncbi_metadataMethod
load_ncbi_metadata(db::String) -> DataFrames.DataFrame

Load and parse NCBI assembly summary metadata (GenBank/RefSeq), using a daily cache.

Checks for homedir()/workspace/.ncbi/YYYY-MM-DD.assembly_summary_{db}.txt. Uses the cache if valid (exists, readable, not empty). Otherwise, downloads from NCBI using Downloads.download(), caches the result (replacing any previous version for the same day), and then parses the cached file.

Handles NCBI's header format and uses CSV.jl for parsing.

Arguments

  • db::String: Database source ("genbank" or "refseq").

Returns

  • DataFrames.DataFrame: Parsed metadata table.

Errors

  • Throws ArgumentError for invalid db.
  • Throws error if cache directory cannot be created.
  • Throws error if data cannot be obtained from cache or download.
  • Rethrows errors from Downloads.download or CSV parsing.
source
Mycelia.load_ncbi_taxonomyMethod
load_ncbi_taxonomy(
;
    path_to_taxdump
) -> MetaGraphs.MetaDiGraph{T, Float64} where T<:Integer

Downloads and constructs a MetaDiGraph representation of the NCBI taxonomy database.

Arguments

  • path_to_taxdump: Directory path where taxonomy files will be downloaded and extracted

Returns

  • MetaDiGraph: A directed graph where:
    • Vertices represent taxa with properties:
      • :tax_id: NCBI taxonomy identifier
      • :scientific_name, :common_name, etc.: Name properties
      • :rank: Taxonomic rank
      • :division_id, :division_cde, :division_name: Division information
    • Edges represent parent-child relationships in the taxonomy

Dependencies

Requires internet connection for initial download. Uses DataFrames, MetaGraphs, and ProgressMeter.

source
Mycelia.load_refseq_metadataMethod
load_refseq_metadata() -> DataFrames.DataFrame

Loads NCBI RefSeq metadata into a DataFrame. RefSeq is NCBI's curated collection of genomic, transcript and protein sequences.

Returns

  • DataFrame: Contains metadata columns including accession numbers, taxonomic information,

and sequence details from RefSeq.

source
Mycelia.local_blast_database_infoMethod
local_blast_database_info(; blastdbs_dir) -> Any

Query information about local BLAST databases and return a formatted summary.

Arguments

  • blastdbs_dir::String: Directory containing BLAST databases (default: "~/workspace/blastdb")

Returns

  • DataFrame with columns:
    • BLAST database path
    • BLAST database molecule type
    • BLAST database title
    • date of last update
    • number of bases/residues
    • number of sequences
    • number of bytes
    • BLAST database format version
    • human readable size

Dependencies

Requires NCBI BLAST+ tools. Will attempt to install via apt-get if not present.

Side Effects

  • May install system packages (ncbi-blast+, perl-doc) using sudo/apt-get
  • Filters out numbered database fragments from results
source
Mycelia.mash_distance_from_jaccardMethod
mash_distance_from_jaccard(jaccard_index::Float64, kmer_size::Int)

Calculates the Mash distance (an estimate of Average Nucleotide Identity) from a given Jaccard Index and k-mer size.

Arguments

  • jaccard_index::Float64: The Jaccard similarity between the two k-mer sets. Must be between 0.0 and 1.0.
  • kmer_size::Int: The length of k-mers used to calculate the Jaccard index.

Returns

  • Float64: The estimated Mash distance D. The estimated ANI would be 1.0 - D.
source
Mycelia.maximum_weight_walk_nextMethod
maximum_weight_walk_next(
    graph::MetaGraphsNext.MetaGraph,
    start_vertex::String,
    max_steps::Int64;
    weight_function
) -> Mycelia.GraphPath

Perform a maximum weight walk prioritizing highest confidence edges.

This greedy algorithm always chooses the edge with the highest weight (coverage) at each step, useful for finding high-confidence assembly paths.

Arguments

  • graph: MetaGraphsNext k-mer graph
  • start_vertex: Starting vertex label
  • max_steps: Maximum steps to take
  • weight_function: Function to extract weight from edge data (default: uses edge.weight)

Returns

  • GraphPath: Path following maximum weight edges
source
Mycelia.merge_and_map_single_end_samplesMethod
merge_and_map_single_end_samples(; 
    fasta_reference::AbstractString, 
    fastq_list::Vector{<:AbstractString}, 
    minimap_index::AbstractString, 
    mapping_type::AbstractString,
    outbase::AbstractString = "results",
    outformats::Vector{<:AbstractString} = [".tsv.gz", ".jld2"]
) -> DataFrames.DataFrame

Merge and map single-end sequencing samples, then output results in one or more formats.

Arguments

  • fasta_reference: Path to the reference FASTA file.
  • fastq_list: Vector of paths to input FASTQ files to be merged.
  • minimap_index: Path to the minimap2 index file (.mmi).
  • mapping_type: Mapping type string for minimap2 (e.g., "map-ont").
  • outbase: Base name (optionally including path) for output files (default: Mycelia.normalized_current_date() * ".joint-minimap-mapping-results").
  • outformats: Vector of output file formats to write results to. Supported: ".tsv.gz", ".jld2".

Description

This function merges provided FASTQ files and assigns unique UUIDs to reads, maps the merged FASTQ against the provided reference using minimap2, reads mapping and UUID tables, joins them into a single DataFrame, writes this table to all requested output formats with filenames constructed from the outbase and the appropriate extension, and returns the resulting joined DataFrame.

Output Files

  • .tsv.gz: Tab-separated, gzip-compressed table of results.
  • ".jld2": JLD2 file containing results.

Returns

  • The joined results as a DataFrames.DataFrame.
source
Mycelia.merge_colorsMethod
merge_colors(c1, c2) -> Any

Merge two colors by calculating their minimal color difference vector.

Arguments

  • c1::Color: First color input
  • c2::Color: Second color input

Returns

  • If colors are equal, returns the input color
  • Otherwise returns the color difference vector (c1-c2 or c2-c1) with minimal RGB sum

Details

Calculates two difference vectors:

  • mix_a = c1 - c2
  • mix_b = c2 - c1

Returns the difference vector with the smallest sum of RGB components.

source
Mycelia.merge_fasta_filesMethod
merge_fasta_files(; fasta_files, fasta_file)

Join fasta files while adding origin prefixes to the identifiers.

Does not guarantee uniqueness but will warn if conflicts arise

source
Mycelia.merge_xam_with_taxonomiesMethod

Merge XAM alignment data with taxonomic information and calculate alignment metrics.

This function processes XAM alignment data by:

  1. Loading an accession-to-taxid mapping table
  2. Left-joining alignment data with taxonomic IDs
  3. Retrieving full taxonomic lineage information
  4. Calculating percent identity scores
  5. Calculating relative alignment score proportions per read
  6. Writing results to cached files

Arguments

  • xam: Path to XAM file or XAM data structure
  • accession2taxid_file: Path to accession-to-taxid mapping file (.tsv.gz or .arrow)
  • output_prefix: Prefix for output files (.tsv.gz and .arrow). Defaults to "xam"
  • verbose::Bool=true: Whether to print progress information
  • force_recalculate::Bool=false: Whether to force recalculation even if cached files exist

Returns

A NamedTuple with paths to the output files: (tsvout, arrowout)

source
Mycelia.metasha256Method
metasha256(
    vector_of_sha256s::Vector{<:AbstractString}
) -> String

Compute a single SHA256 hash from multiple SHA256 hashes.

Takes a vector of hex-encoded SHA256 hashes and produces a new SHA256 hash by:

  1. Sorting the input hashes lexicographically
  2. Concatenating them in sorted order
  3. Computing a new SHA256 hash over the concatenated data

Arguments

  • vector_of_sha256s: Vector of hex-encoded SHA256 hash strings

Returns

  • A hex-encoded string representing the computed meta-hash
source
Mycelia.minimap_indexMethod
minimap_index(
;
    fasta,
    mapping_type,
    mem_gb,
    threads,
    as_string,
    denominator
)

Create a minimap2 index for the provided reference sequence.

Arguments

  • fasta::String: Path to the reference FASTA.
  • mapping_type::String: Preset (e.g. "map-hifi").
  • mem_gb::Real: Memory available in GB.
  • threads::Integer: Number of threads.
  • as_string::Bool=false: If true, return the command string instead of Cmd.
  • denominator::Real: Scaling factor passed to system_mem_to_minimap_index_size.

Returns

Named tuple (cmd, outfile) where outfile is the generated .mmi index path.

source
Mycelia.minimap_mapMethod
minimap_map(
;
    fasta,
    fastq,
    mapping_type,
    as_string,
    mem_gb,
    threads,
    denominator
)

Generate minimap2 alignment commands for sequence mapping.

aligning and compressing. No sorting or filtering.

Use shell_only=true to get string command to submit to SLURM

Creates a command to align reads in FASTQ format to a reference FASTA using minimap2, followed by SAM compression with pigz. Handles resource allocation and conda environment setup.

Arguments

  • fasta: Path to reference FASTA file
  • fastq: Path to query FASTQ file
  • mapping_type: Alignment preset ("map-hifi", "map-ont", "map-pb", "sr", or "lr:hq")
  • as_string: If true, returns shell command as string; if false, returns command array
  • mem_gb: Available memory in GB for indexing (defaults to system free memory)
  • threads: Number of CPU threads to use (defaults to system threads)
  • denominator: Divisor for calculating minimap2 index size

Returns

Named tuple containing:

  • cmd: Shell command (as string or array)
  • outfile: Path to compressed output SAM file
source
Mycelia.minimap_map_paired_end_with_indexMethod
minimap_map_paired_end_with_index(
;
    forward,
    reverse,
    mem_gb,
    threads,
    outdir,
    as_string,
    denominator,
    fasta,
    index_file
)

Map paired-end reads to a reference sequence using minimap2.

Arguments

  • fasta::String: Path to reference FASTA file
  • forward::String: Path to forward reads FASTQ file
  • reverse::String: Path to reverse reads FASTQ file
  • mem_gb::Integer: Available system memory in GB
  • threads::Integer: Number of threads to use
  • outdir::String: Output directory (defaults to forward reads directory)
  • as_string::Bool=false: Return command as string instead of Cmd array
  • mapping_type::String="sr": Minimap2 preset ["map-hifi", "map-ont", "map-pb", "sr", "lr:hq"]
  • denominator::Float64: Memory scaling factor for index size

Returns

Named tuple containing:

  • cmd: Command(s) to execute (String or Array{Cmd})
  • outfile: Path to compressed output SAM file (*.sam.gz)

Notes

  • Requires minimap2, samtools, and pigz conda environments
  • Automatically compresses output using pigz
  • Index file must exist at $(fasta).x$(mapping_type).I$(index_size).mmi
source
Mycelia.minimap_map_with_indexMethod
minimap_map_with_index(
;
    fasta,
    mapping_type,
    fastq,
    index_file,
    mem_gb,
    threads,
    as_string,
    denominator
)

Map reads using an existing minimap2 index file.

Arguments

  • fasta: Path to the reference FASTA (used only if an index must be created).
  • mapping_type: Minimap2 preset.
  • fastq: Input reads.
  • index_file::String="": Optional prebuilt index path. If empty, one is created.
  • mem_gb, threads, as_string, denominator: Parameters forwarded to minimap_index.

Returns

Named tuple (cmd, outfile) producing a BAM file from the mapping.

source
Mycelia.mmseqs_pairwise_searchMethod
mmseqs_pairwise_search(; fasta, output)

Perform all-vs-all sequence search using MMseqs2's easy-search command.

Arguments

  • fasta::String: Path to input FASTA file containing sequences to compare
  • output::String: Output directory path (default: input filename + ".mmseqseasysearch_pairwise")

Returns

  • String: Path to the output directory

Details

Executes MMseqs2 with sensitive search parameters (7 sensitivity steps) and outputs results in tabular format with the following columns:

  • query, qheader: Query sequence ID and header
  • target, theader: Target sequence ID and header
  • pident: Percentage sequence identity
  • fident: Fraction of identical matches
  • nident: Number of identical matches
  • alnlen: Alignment length
  • mismatch: Number of mismatches
  • gapopen: Number of gap openings
  • qstart, qend, qlen: Query sequence coordinates and length
  • tstart, tend, tlen: Target sequence coordinates and length
  • evalue: Expected value
  • bits: Bit score

Requires MMseqs2 to be available through Bioconda.

source
Mycelia.mutate_sequenceMethod
mutate_sequence(reference_sequence) -> Tuple{Any, Any}

Generate a single random mutation in an amino acid sequence.

Arguments

  • reference_sequence: Input amino acid sequence to be mutated

Returns

  • mutant_sequence: The sequence after applying the mutation
  • haplotype: A SequenceVariation.Haplotype object containing the mutation details

Details

Performs one of three possible mutation types:

  • Substitution: Replace one amino acid with another
  • Insertion: Insert 1+ random amino acids at a position
  • Deletion: Remove 1+ amino acids from a position

Insertion and deletion sizes follow a truncated Poisson distribution (λ=1, min=1).

source
Mycelia.mycelia_assembleMethod

Main Mycelia intelligent assembly algorithm. Implements iterative prime k-mer progression with error correction.

source
Mycelia.mycelia_cross_validationMethod

Main cross-validation function for hybrid assembly quality assessment. Compares intelligent vs iterative assembly approaches across multiple validation folds.

source
Mycelia.mycelia_iterative_assembleMethod

Main iterative maximum likelihood assembly function. Processes entire read sets per iteration with complete FASTQ I/O tracking. Enhanced with performance optimizations, caching, and progress tracking.

source
Mycelia.n_maximally_distinguishable_colorsMethod
n_maximally_distinguishable_colors(n) -> Any

Generate n colors that are maximally distinguishable from each other.

Arguments

  • n::Integer: The number of distinct colors to generate

Returns

A vector of n RGB colors that are optimized for maximum perceptual distinction, using white (RGB(1,1,1)) and black (RGB(0,0,0)) as anchor colors.

source
Mycelia.name2taxidMethod
name2taxid(name) -> DataFrames.DataFrame

Convert scientific name(s) to NCBI taxonomy ID(s) using taxonkit.

Arguments

  • name::AbstractString: Scientific name(s) to query. Can be a single name or multiple names separated by newlines.

Returns

  • DataFrame with columns:
    • name: Input scientific name
    • taxid: NCBI taxonomy ID
    • rank: Taxonomic rank (e.g., "species", "genus")

Dependencies

Requires taxonkit package (installed automatically via Bioconda)

source
Mycelia.names2taxidsMethod
names2taxids(names::AbstractVector{<:AbstractString}) -> Any

Convert a vector of species/taxon names to their corresponding NCBI taxonomy IDs.

Arguments

  • names::AbstractVector{<:AbstractString}: Vector of scientific names or common names

Returns

  • Vector{Int}: Vector of NCBI taxonomy IDs corresponding to the input names

Progress is displayed using ProgressMeter.

source
Mycelia.ncbi_ftp_path_to_urlMethod
ncbi_ftp_path_to_url(; ftp_path, extension)

Constructs a complete NCBI FTP URL by combining a base FTP path with a file extension.

Arguments

  • ftp_path::String: Base FTP directory path for the resource
  • extension::String: File extension to append to the resource name

Returns

  • String: Complete FTP URL path to the requested resource

Extensions include:

  • genomic.fna.gz
  • genomic.gff.gz
  • protein.faa.gz
  • assembly_report.txt
  • assembly_stats.txt
  • cdsfromgenomic.fna.gz
  • feature_count.txt.gz
  • feature_table.txt.gz
  • genomic.gbff.gz
  • genomic.gtf.gz
  • protein.gpff.gz
  • translated_cds.faa.gz
source
Mycelia.ncbi_genome_download_accessionMethod
ncbi_genome_download_accession(
;
    accession,
    outdir,
    outpath,
    include_string
)

Download an accession using NCBI datasets command line tool

the .zip download output to outpath will be unzipped

returns the outfolder

ncbi's default include string is include_string = "gff3,rna,cds,protein,genome,seq-report"

Downloads and extracts a genome from NCBI using the datasets command line tool.

Arguments

  • accession: NCBI accession number for the genome
  • outdir: Directory where files will be downloaded (defaults to current directory)
  • outpath: Full path for the temporary zip file (defaults to outdir/accession.zip)
  • include_string: Data types to download (defaults to all "gff3,rna,cds,protein,genome,seq-report").

Returns

  • Path to the extracted genome data directory

Notes

  • Requires the ncbi-datasets-cli conda package (automatically installed if missing)
  • Downloaded zip file is automatically removed after extraction
  • If output folder already exists, download is skipped
  • Data is extracted to outdir/accession/ncbi_dataset/data/accession
source
Mycelia.ncbi_taxon_summaryMethod
ncbi_taxon_summary(taxa_id) -> DataFrames.DataFrame

Retrieve taxonomic information for a given NCBI taxonomy ID.

Arguments

  • taxa_id: NCBI taxonomy identifier (integer)

Returns

  • DataFrame: Taxonomy summary containing fields like tax_id, rank, species, etc.
source
Mycelia.nearest_primeMethod
nearest_prime(n::Int64) -> Int64

Find the closest prime number to the given integer n.

Returns the nearest prime number to n. If two prime numbers are equally distant from n, returns the smaller one.

Arguments

  • n::Int: The input integer to find the nearest prime for

Returns

  • Int: The closest prime number to n
source
Mycelia.negbin_pca_epcaMethod
negbin_pca_epca(M::AbstractMatrix{<:Integer};
               k::Int=0,
               r::Int=1)

Perform Negative-Binomial EPCA on a count matrix M (features × samples).

When to use

Use for overdispersed count data (variance > mean), such as RNA-seq or metagenomic counts.

Keyword arguments

  • k : desired number of latent dimensions; if k<1 defaults to min(n_samples-1, n_features, 10)
  • r : known NB “number of successes” parameter

Returns

NamedTuple with fields

  • model : the fitted ExpFamilyPCA.NegativeBinomialEPCA object
  • scores : k×n_samples matrix of sample scores
  • loadings : k×n_features matrix of feature loadings
source
Mycelia.nersc_sbatchMethod
nersc_sbatch(
;
    job_name,
    mail_user,
    mail_type,
    logdir,
    scriptdir,
    qos,
    nodes,
    ntasks,
    time,
    cpus_per_task,
    mem_gb,
    cmd,
    constraint
)

Submit a batch job to NERSC's SLURM workload manager.

Arguments

  • job_name: Identifier for the SLURM job
  • mail_user: Email address for job notifications
  • mail_type: Notification type ("ALL", "BEGIN", "END", "FAIL", or "NONE")
  • logdir: Directory for storing job output/error logs
  • scriptdir: Directory for storing generated SLURM scripts
  • qos: Quality of Service level ("regular", "premium", or "preempt")
  • nodes: Number of nodes to allocate
  • ntasks: Number of tasks to run
  • time: Maximum wall time in format "days-HH:MM:SS"
  • cpus_per_task: CPU cores per task
  • mem_gb: Memory per node in GB
  • cmd: Command(s) to execute (String or Vector{String})
  • constraint: Node type constraint ("cpu" or "gpu")

Returns

  • true if job submission succeeds
  • false if submission fails

QoS Options

  • regular: Standard priority queue
  • premium: High priority queue (5x throughput limit)
  • preempt: Reduced credit usage but jobs may be interrupted

https://docs.nersc.gov/jobs/policy/ https://docs.nersc.gov/systems/perlmutter/architecture/#cpu-nodes

default is to use shared qos

use

  • regular
  • preempt (reduced credit usage but not guaranteed to finish)
  • premium (priorty runs limited to 5x throughput)

https://docs.nersc.gov/systems/perlmutter/running-jobs/#tips-and-tricks

source
Mycelia.nersc_sbatch_sharedMethod
nersc_sbatch_shared(
;
    job_name,
    mail_user,
    mail_type,
    logdir,
    qos,
    nodes,
    ntasks,
    time,
    cpus_per_task,
    mem_gb,
    cmd,
    constraint
)

Submit a job to NERSC's SLURM scheduler using the shared QOS (Quality of Service).

Arguments

  • job_name: Identifier for the job
  • mail_user: Email address for job notifications
  • mail_type: Notification type ("ALL", "BEGIN", "END", "FAIL", "REQUEUE", "STAGE_OUT")
  • logdir: Directory for storing job output and error logs
  • qos: Quality of Service level ("shared", "regular", "preempt", "premium")
  • nodes: Number of nodes to allocate
  • ntasks: Number of tasks to run
  • time: Maximum wall time in format "days-hours:minutes:seconds"
  • cpus_per_task: Number of CPUs per task
  • mem_gb: Memory per node in GB (default: 2GB per CPU)
  • cmd: Command to execute
  • constraint: Node type constraint ("cpu" or "gpu")

Resource Limits

  • Maximum memory per node: 512GB
  • Maximum cores per node: 128
  • Default memory allocation: 2GB per CPU requested

QOS Options

  • shared: Default QOS for shared node usage
  • regular: Standard priority
  • preempt: Reduced credit usage but preemptible
  • premium: 5x throughput priority (limited usage)

Returns

true if job submission succeeds

https://docs.nersc.gov/jobs/policy/ https://docs.nersc.gov/systems/perlmutter/architecture/#cpu-nodes

default is to use shared qos

use

  • regular
  • preempt (reduced credit usage but not guaranteed to finish)
  • premium (priority runs limited to 5x throughput)

max request is 512Gb memory and 128 cores per node

https://docs.nersc.gov/systems/perlmutter/running-jobs/#tips-and-tricks

source
Mycelia.next_prime_kMethod

Find the next prime number greater than current_k. For k-mer progression, we prefer odd numbers and especially primes.

source
Mycelia.node_type_to_dataframeMethod
node_type_to_dataframe(; node_type, graph)

Convert all nodes of a specific type in a MetaGraph to a DataFrame representation.

Arguments

  • node_type: The type of nodes to extract from the graph
  • graph: A MetaGraph containing the nodes

Returns

A DataFrame where:

  • Each row represents a node of the specified type
  • Columns correspond to all unique properties found across nodes
  • Values are JSON-serialized strings for consistency

Notes

  • All values are normalized through JSON serialization
  • Dictionary values receive double JSON encoding
  • The TYPE column is converted using type_to_string
source
Mycelia.normalize_codon_frequenciesMethod
normalize_codon_frequencies(
    codon_frequencies
) -> Dict{BioSymbols.AminoAcid, Dict{Kmers.Kmer{BioSequences.DNAAlphabet{2}, 3, 1}, Float64}}

Normalizes codon frequencies for each amino acid such that frequencies sum to 1.0.

Arguments

  • codon_frequencies: Nested dictionary mapping amino acids to their codon frequency distributions

Returns

  • Normalized codon frequencies where values for each amino acid sum to 1.0
source
Mycelia.normalize_countmapMethod
normalize_countmap(countmap) -> Dict

Normalize a dictionary of counts into a probability distribution where values sum to 1.0.

Arguments

  • countmap::Dict: Dictionary mapping keys to count values

Returns

  • Dict: New dictionary with same keys but values normalized by total sum
source
Mycelia.normalize_distance_matrixMethod
normalize_distance_matrix(distance_matrix) -> Any

Create distance matrix from a column-major counts matrix (features as rows and entities as columns) where distance is a proportional to total feature count magnitude (size) and cosine similarity (relative frequency)

Normalize a distance matrix by dividing all elements by the maximum non-NaN value.

Arguments

  • distance_matrix: A matrix of distance values that may contain NaN, nothing, or missing values

Returns

  • Normalized distance matrix with values scaled to [0, 1] range

Details

  • Filters out NaN, nothing, and missing values when finding the maximum
  • All elements are divided by the same maximum value to preserve relative distances
  • If all values are NaN/nothing/missing, may return NaN values
source
Mycelia.normalize_kmer_countsMethod
normalize_kmer_counts(
    kmer_counts
) -> OrderedCollections.OrderedDict

Convert raw k‑mer counts into normalized frequencies.

Arguments

  • kmer_counts::Dict: Mapping of k-mers to counts.

Returns

OrderedDict with values scaled so the sum equals 1.

source
Mycelia.normalize_vcfMethod
normalize_vcf(; reference_fasta, vcf_file)

Normalize a VCF file using bcftools norm, with automated handling of compression and indexing.

Arguments

  • reference_fasta::String: Path to the reference FASTA file used for normalization
  • vcf_file::String: Path to input VCF file (can be gzipped or uncompressed)

Returns

  • String: Path to the normalized, sorted, and compressed output VCF file (*.sorted.normalized.vcf.gz)

Notes

  • Requires bioconda packages: htslib, tabix, bcftools
  • Creates intermediate files with extensions .tbi for indices
  • Skips processing if output file already exists
  • Performs left-alignment and normalization of variants
source
Mycelia.normalized_current_dateMethod
normalized_current_date() -> String

Returns the current date as a normalized string with all non-word characters removed.

The output format is based on ISO datetime (YYYYMMDD) but strips any special characters like hyphens, colons or dots.

source
Mycelia.normalized_current_datetimeMethod
normalized_current_datetime() -> String

Returns the current date and time as a normalized string with all non-word characters removed.

The output format is based on ISO datetime (YYYYMMDDThhmmss) but strips any special characters like hyphens, colons or dots.

source
Mycelia.observeMethod
observe(
    record::Union{FASTX.FASTA.Record, FASTX.FASTQ.Record};
    error_rate
)

Simulate sequencing of a DNA/RNA record by introducing random errors at the specified rate.

Arguments

  • record: A FASTA or FASTQ record containing the sequence to be "observed"
  • error_rate: Probability of error at each position (default: 0.0)

Returns

A new FASTQ.Record with:

  • Random UUID as identifier
  • Original record's description
  • Modified sequence with introduced errors
  • Generated quality scores
source
Mycelia.observeMethod
observe(sequence::BioSequences.LongSequence{T}; error_rate=nothing, tech::Symbol=:illumina) where T

Simulates the “observation” of a biological polymer (DNA, RNA, or protein) by introducing realistic errors along with base‐quality scores. The simulation takes into account both random and systematic error components. In particular, for technologies:

  • illumina: (mostly substitution errors) the per‐base quality decays along the read (from ~Q40 at the start to ~Q20 at the end);
  • nanopore: errors are more frequent and include both substitutions and indels (with overall lower quality scores, and an extra “homopolymer” penalty);
  • pacbio: errors are dominated by indels (with quality scores typical of raw reads);
  • ultima: (UG 100/ppmSeq™) correct bases are assigned very high quality (~Q60) while errors are extremely rare and, if they occur, are given a modest quality.

An error is introduced at each position with a (possibly position‐dependent) probability. For Illumina, the error probability increases along the read; additionally, if a base is part of a homopolymer run (length ≥ 3) and the chosen technology is one that struggles with homopolymers (nanopore, pacbio, ultima), then the local error probability is multiplied by a constant factor.

Returns a tuple (new_seq, quality_scores) where:

  • new_seq is a BioSequences.LongSequence{T} containing the “observed” sequence (which may be longer or shorter than the input if insertions or deletions occur), and
  • quality_scores is a vector of integers representing the Phred quality scores (using the Sanger convention) for each base in the output sequence.
source
Mycelia.open_fastxMethod
open_fastx(path::AbstractString) -> Any

Open and return a reader for FASTA or FASTQ format files.

Arguments

  • path::AbstractString: Path to input file. Can be:
    • Local file path
    • HTTP/FTP URL
    • Gzip compressed (.gz extension)

Supported formats

  • FASTA (.fasta, .fna, .faa, .fa)
  • FASTQ (.fastq, .fq)

Returns

  • FASTX.FASTA.Reader for FASTA files
  • FASTX.FASTQ.Reader for FASTQ files
source
Mycelia.open_genbankMethod
open_genbank(
    genbank_file
) -> Vector{GenomicAnnotations.Record}

Opens and parses a GenBank format file containing genomic sequence and annotation data.

Arguments

  • genbank_file::AbstractString: Path to the GenBank (.gb or .gbk) file

Returns

  • Vector{GenomicAnnotations.Chromosome}: Vector containing parsed chromosome data
source
Mycelia.open_gffMethod
open_gff(path::String) -> Any

Opens a GFF (General Feature Format) file for reading.

Arguments

  • path::String: Path to GFF file. Can be:
    • Local file path
    • HTTP/FTP URL (FTP URLs are automatically converted to HTTP)
    • Gzipped file (automatically decompressed)

Returns

  • IO: An IO stream ready to read the GFF content
source
Mycelia.optimal_subsequence_lengthMethod
optimal_subsequence_length(
;
    error_rate,
    threshold,
    sequence_length,
    plot_result
)

Calculate the optimal subsequence length based on error rate distribution.

Arguments

  • error_rate: Single error rate or array of error rates (between 0 and 1)
  • threshold: Desired probability that a subsequence is error-free (default: 0.95)
  • sequence_length: Maximum sequence length to consider for plotting
  • plot_result: If true, returns a plot of probability vs. length

Returns

  • If plot_result=false: Integer representing optimal subsequence length
  • If plot_result=true: Tuple of (optimal_length, plot)

Examples

# Single error rate
optimal_subsequence_length(error_rate=0.01)

# Array of error rates
optimal_subsequence_length(error_rate=[0.01, 0.02, 0.01])

# With more stringent threshold
optimal_subsequence_length(error_rate=0.01, threshold=0.99)

# Generate plot
length, p = optimal_subsequence_length(error_rate=0.01, plot_result=true)
Plots.display(p)
source
Mycelia.pairwise_distance_matrixMethod
pairwise_distance_matrix(
    matrix;
    dist_func = Distances.euclidean,
    show_progress = true,
    progress_desc = "Computing distances"
)

Compute a symmetric pairwise distance matrix between columns of matrix using the supplied distance function.

Arguments

  • matrix: Column-major matrix (features as rows, entities as columns)
  • dist_func: Function of the form f(a, b) returning the distance between two vectors (default: Distances.euclidean)
  • show_progress: Display progress bar if true (default: true)
  • progress_desc: Progress bar description (default: "Computing distances")

Returns

  • Symmetric N×N matrix of pairwise distances between columns (entities)
source
Mycelia.parallel_pyrodigalMethod
parallel_pyrodigal(normalized_fastas::Vector{String})

Runs Mycelia.run_pyrodigal on a list of FASTA files in parallel using Threads.

Args: normalized_fastas: A vector of strings, where each string is a path to a FASTA file.

Returns: A tuple containing two elements: 1. successes (Vector{Tuple{String, Any}}): A vector of tuples, where each tuple contains the filename and the result returned by a successful Mycelia.runpyrodigal call. 2. failures (Vector{Tuple{String, String}}): A vector of tuples, where each tuple contains the filename and the error message string for a failed Mycelia.runpyrodigal call.

source
Mycelia.parse_blast_reportMethod
parse_blast_report(blast_report) -> DataFrames.DataFrame

Expects output type 7 from BLAST, default output type 6 doesn't have the header comments and won't auto-parse

Parse a BLAST output file into a structured DataFrame.

Arguments

  • blast_report::AbstractString: Path to a BLAST output file in format 7 (tabular with comments)

Returns

  • DataFrame: Table containing BLAST results with columns matching the header fields. Returns empty DataFrame if no hits found.

Details

  • Requires BLAST output format 7 (-outfmt 7), which includes header comments
  • Handles missing values (encoded as "N/A") automatically
  • Infers column types based on BLAST field names
  • Supports standard BLAST tabular fields including sequence IDs, scores, alignments and taxonomic information
source
Mycelia.parse_gfaMethod
parse_gfa(gfa) -> MetaGraphs.MetaGraph{Int64, Float64}

Parse a GFA (Graphical Fragment Assembly) file into a MetaGraph representation.

Arguments

  • gfa: Path to GFA format file

Returns

A MetaGraph where:

  • Vertices represent segments (contigs)
  • Edges represent links between segments
  • Vertex properties include :id with segment identifiers
  • Graph property :records contains the original FASTA records

Format Support

Handles standard GFA v1 lines:

  • H: Header lines (skipped)
  • S: Segments (stored as nodes with FASTA records)
  • L: Links (stored as edges)
  • P: Paths (stored in paths dictionary)
  • A: HiFiAsm specific lines (skipped)
source
Mycelia.parse_jsonlMethod
parse_jsonl(filepath::String) -> Vector{Dict{String, Any}}
parse_jsonl(filepath::String) -> Vector{Dict{String,Any}}

Validate and parse a JSON Lines file (either .ndjson/.jsonl, optionally gzipped) into a vector of dictionaries, reporting progress in bytes processed.

Validations performed: • Extension must be one of: .jsonl, .ndjson, .jsonl.gz, .ndjson.gz • File must exist • File size must be non-zero

Progress meter shows bytes read from the underlying file (compressed bytes for .gz). No second full pass is needed.

source
Mycelia.parse_mmseqs_easy_taxonomy_lca_tsvMethod
parse_mmseqs_easy_taxonomy_lca_tsv(
    lca_tsv
) -> DataFrames.DataFrame

Parse the taxonomic Last Common Ancestor (LCA) TSV output from MMseqs2's easy-taxonomy workflow.

Arguments

  • lca_tsv: Path to the TSV file containing MMseqs2 taxonomy results

Returns

DataFrame with columns:

  • contig_id: Sequence identifier
  • taxon_id: NCBI taxonomy identifier
  • taxon_rank: Taxonomic rank (e.g. species, genus)
  • taxon_name: Scientific name
  • fragments_retained: Number of fragments kept
  • fragments_taxonomically_assigned: Number of fragments with taxonomy
  • fragments_in_agreement_with_assignment: Fragments matching contig taxonomy
  • support -log(E-value): Statistical support score
source
Mycelia.parse_mmseqs_easy_taxonomy_tophit_reportMethod
parse_mmseqs_easy_taxonomy_tophit_report(
    tophit_report
) -> DataFrames.DataFrame

Parse an MMseqs2 easy-taxonomy tophit report into a structured DataFrame.

Arguments

  • tophit_report::String: Path to the MMseqs2 easy-taxonomy tophit report file (tab-delimited)

Returns

  • DataFrame: A DataFrame with columns:
    • target_id: Target sequence identifier
    • number of sequences aligning to target: Count of aligned sequences
    • unique coverage of target: Ratio of uniqueAlignedResidues to targetLength
    • Target coverage: Ratio of alignedResidues to targetLength
    • Average sequence identity: Mean sequence identity
    • taxon_id: Taxonomic identifier
    • taxon_rank: Taxonomic rank
    • taxon_name: Species name and lineage
source
Mycelia.parse_mmseqs_tophit_alnMethod
parse_mmseqs_tophit_aln(tophit_aln) -> DataFrames.DataFrame

Parse MMseqs2 tophit alignment output file into a structured DataFrame.

Arguments

  • tophit_aln::AbstractString: Path to tab-delimited MMseqs2 alignment output file

Returns

DataFrame with columns:

  • query: Query sequence/profile identifier
  • target: Target sequence/profile identifier
  • percent identity: Sequence identity percentage
  • alignment length: Length of alignment
  • number of mismatches: Count of mismatched positions
  • number of gaps: Count of gap openings
  • query start: Start position in query sequence
  • query end: End position in query sequence
  • target start: Start position in target sequence
  • target end: End position in target sequence
  • evalue: E-value of alignment
  • bit score: Bit score of alignment
source
Mycelia.parse_qualimap_contig_coverageMethod
parse_qualimap_contig_coverage(
    qualimap_report_txt
) -> DataFrames.DataFrame

Parse contig coverage statistics from a Qualimap BAM QC report file.

Arguments

  • qualimap_report_txt::String: Path to Qualimap bamqc report text file

Returns

  • DataFrame: Coverage statistics with columns:
    • Contig: Contig identifier
    • Length: Contig length in bases
    • Mapped bases: Number of bases mapped to contig
    • Mean coverage: Average coverage depth
    • Standard Deviation: Standard deviation of coverage
    • % Mapped bases: Percentage of total mapped bases on this contig

Supported Assemblers

Handles output from both SPAdes and MEGAHIT assemblers:

  • SPAdes format: NODEXlengthYcov_Z
  • MEGAHIT format: kXX_Y

Parse the contig coverage information from qualimap bamqc text report, which looks like the following:

# this is spades
>>>>>>> Coverage per contig

	NODE_1_length_107478_cov_9.051896	107478	21606903	201.0355886786133	60.39424208607496
	NODE_2_length_5444_cov_1.351945	5444	153263	28.152645113886848	5.954250612823136
	NODE_3_length_1062_cov_0.154390	1062	4294	4.043314500941619	1.6655384692688975
	NODE_4_length_776_cov_0.191489	776	3210	4.13659793814433	2.252009588980858

# below is megahit
>>>>>>> Coverage per contig

	k79_175	235	3862	16.43404255319149	8.437436249612457
	k79_89	303	3803	12.551155115511552	5.709975376279777
	k79_262	394	6671	16.931472081218274	7.579217802849293
	k79_90	379	1539	4.060686015831134	1.2929729111266581
	k79_91	211	3749	17.767772511848342	11.899185693011933
	k79_0	2042	90867	44.49902056807052	18.356525483516613

To make this more robust, consider reading in the names of the contigs from the assembled fasta

source
Mycelia.parse_rtg_eval_outputMethod
parse_rtg_eval_output(f) -> DataFrames.DataFrame

Parse RTG evaluation output from a gzipped tab-separated file.

Arguments

  • f: Path to a gzipped TSV file containing RTG evaluation output

Format

Expected file format:

  • Header line starting with '#' and tab-separated column names
  • Data rows in tab-separated format
  • Empty files return a DataFrame with empty columns matching header

Returns

A DataFrame where:

  • Column names are taken from the header line (stripped of '#')
  • Data is parsed as Float64 values
  • Empty files result in empty columns preserving header structure
source
Mycelia.parse_transterm_outputMethod
parse_transterm_output(
    transterm_output
) -> DataFrames.DataFrame

Parse TransTerm terminator prediction output into a structured DataFrame.

Takes a TransTerm output file path and returns a DataFrame containing parsed terminator predictions. Each row represents one predicted terminator with the following columns:

  • chromosome: Identifier of the sequence being analyzed
  • term_id: Unique terminator identifier (e.g. "TERM 19")
  • start: Start position of the terminator
  • stop: End position of the terminator
  • strand: Strand orientation ("+" or "-")
  • location: Context type, where:
    • G/g = in gene interior (≥50bp from ends)
    • F/f = between two +strand genes
    • R/r = between two -strand genes
    • T = between ends of +strand and -strand genes
    • H = between starts of +strand and -strand genes
    • N = none of the above
    Lowercase indicates opposite strand from region
  • confidence: Overall confidence score (0-100)
  • hairpin_score: Hairpin structure score
  • tail_score: Tail sequence score
  • notes: Additional annotations (e.g. "bidir")

Arguments

  • transterm_output::AbstractString: Path to TransTerm output file

Returns

  • DataFrame: Parsed terminator predictions with columns as described above

See TransTerm HP documentation for details on scoring and location codes.

source
Mycelia.parse_virsorter_score_tsvMethod
parse_virsorter_score_tsv(
    virsorter_score_tsv
) -> DataFrames.DataFrame

Parse a VirSorter score TSV file and return a DataFrame.

Arguments

  • virsorter_score_tsv::String: The file path to the VirSorter score TSV file.

Returns

  • DataFrame: A DataFrame containing the parsed data from the TSV file. If the file is empty, returns a DataFrame with the appropriate headers but no data.
source
Mycelia.parse_xam_to_summary_tableMethod
parse_xam_to_summary_table(xam) -> DataFrames.DataFrame

Parse a SAM/BAM file into a summary DataFrame containing alignment metadata.

Arguments

  • xam::AbstractString: Path to input SAM (.sam), BAM (.bam), or gzipped SAM (.sam.gz) file

Returns

DataFrame with columns:

  • template: Read name
  • flag: SAM flag
  • reference: Reference sequence name
  • position: Alignment position range (start:end)
  • mappingquality: Mapping quality score
  • alignment_score: Alignment score (AS tag)
  • isprimary: Whether alignment is primary
  • alignlength: Length of the alignment
  • ismapped: Whether read is mapped
  • mismatches: Number of mismatches (NM tag)

Note: Only mapped reads are included in the output DataFrame.

source
Mycelia.path_to_sequenceMethod
path_to_sequence(kmers, path) -> Any

Convert a path through k-mers into a single DNA sequence.

Takes a vector of k-mers and a path representing the order to traverse them, reconstructs the original sequence by joining the k-mers according to the path. The first k-mer is used in full, then only the last nucleotide from each subsequent k-mer is added.

Arguments

  • kmers: Vector of DNA k-mers (as LongDNA{4})
  • path: Vector of tuples representing the path through the k-mers

Returns

  • LongDNA{4}: The reconstructed DNA sequence
source
Mycelia.pca_transformMethod
pca_transform(
  M::AbstractMatrix{<:Real};
  k::Int = 0,
  var_prop::Float64 = 1.0
)

Perform standard PCA on M (features × samples), returning enough PCs to either:

  • match a user‐supplied k > 0, or
  • explain at least var_prop of the total variance (0 < var_prop ≤ 1).

By default (k=0, var_prop=1.0), this will capture all variance, i.e. use min(n_samples-1, n_features) components.

When to use

Use for real-valued, continuous, and approximately Gaussian data. PCA is most suitable when features are linearly related and data is centered and scaled. Not ideal for count, binary, or highly skewed data.

Returns

A NamedTuple with fields

  • model : the fitted MultivariateStats.PCA object
  • scores : k×n_samples matrix of PC scores
  • loadings : k×n_features matrix of PC loadings
  • chosen_k : the number of components actually used
source
Mycelia.pcoa_from_distMethod
pcoa_from_dist(D::AbstractMatrix{<:Real}; maxoutdim::Int = 2)

Perform Principal Coordinates Analysis directly from a precomputed distance matrix D (nsamples×nsamples).

Keyword arguments

  • maxoutdim : target embedding dimension (default=2)

Returns

NamedTuple with fields

  • model : the fitted MultivariateStats.MDS model
  • coordinates: maxoutdim×n_samples matrix of embedded coordinates
source
Mycelia.phred_to_probabilityMethod
phred_to_probability(phred_score::UInt8) -> Float64

Convert PHRED quality score to probability of correctness. PHRED score Q relates to error probability P by: Q = -10 * log10(P) Therefore, correctness probability = 1 - P = 1 - 10^(-Q/10)

source
Mycelia.pixels_to_pointsMethod
pixels_to_points(pixels) -> Any

Convert pixel measurements to point measurements using the standard 4:3 ratio.

Points are the standard unit for typography (1 point = 1/72 inch), while pixels are used for screen measurements. This conversion uses the conventional 4:3 ratio where 3 points equal 4 pixels.

Arguments

  • pixels: The number of pixels to convert

Returns

  • The equivalent measurement in points
source
Mycelia.plot_embeddingsMethod

Plot embeddings with optional true and fitted cluster labels using Makie.jl, with legend outside, and color by fit labels, shape by true labels.

Arguments

  • embeddings::Matrix{<:Real}: 2D embedding matrix where each column is a data point
  • title::String: Title of the plot
  • xlabel::String: Label for the x-axis
  • ylabel::String: Label for the y-axis
  • true_labels::Vector{<:Integer}: Vector of true cluster labels (optional)
  • fit_labels::Vector{<:Integer}: Vector of fitted cluster labels (optional)

Returns

  • Makie.Figure: Figure object that can be displayed or saved
source
Mycelia.plot_graphMethod
plot_graph(graph) -> Any

Creates a visualization of a kmer graph where nodes represent kmers and their sizes reflect counts.

Arguments

  • graph: A MetaGraph where vertices have :kmer and :count properties

Returns

  • A Plots.jl plot object showing the graph visualization

Details

  • Node sizes are scaled based on kmer counts
  • Plot dimensions scale logarithmically with number of vertices
  • Each node is labeled with its kmer sequence
source
Mycelia.plot_kmer_frequency_spectraMethod
plot_kmer_frequency_spectra(
    counts;
    log_scale,
    kwargs...
) -> Plots.Plot

Plots a histogram of kmer counts against # of kmers with those counts

Returns the plot object for adding additional layers and saving

Creates a scatter plot visualizing the k-mer frequency spectrum - the relationship between k-mer frequencies and how many k-mers occur at each frequency.

Arguments

  • counts::AbstractVector{<:Integer}: Vector of k-mer counts/frequencies
  • log_scale::Union{Function,Nothing} = log2: Function to apply logarithmic scaling to both axes. Set to nothing to use linear scaling.
  • kwargs...: Additional keyword arguments passed to StatsPlots.plot()

Returns

  • Plots.Plot: A scatter plot object that can be further modified or saved

Details

The x-axis shows k-mer frequencies (how many times each k-mer appears), while the y-axis shows how many distinct k-mers appear at each frequency. Both axes are log-scaled by default using log2.

source
Mycelia.plot_kmer_rarefactionMethod
plot_kmer_rarefaction(
    rarefaction_data_path::AbstractString;
    output_dir,
    output_basename,
    display_plot,
    fig_size,
    title,
    xlabel,
    ylabel,
    line_color,
    line_style,
    marker,
    markersize,
    axis_kwargs...
) -> Union{Nothing, Makie.Figure}

Plots a k-mer rarefaction curve from data stored in a TSV file. The TSV file should contain two columns:

  1. Number of FASTA files processed.
  2. Cumulative unique k-mers observed at that point.

The plot is displayed and saved in PNG, PDF, and SVG formats.

Arguments

  • rarefaction_data_path::AbstractString: Path to the TSV file containing rarefaction data.
  • output_dir::AbstractString: Directory where the output plots will be saved. Defaults to the directory of rarefaction_data_path.
  • output_basename::AbstractString: Basename for the output plot files (without extension). Defaults to the basename of rarefaction_data_path without its original extension.
  • display_plot::Bool: Whether to display the plot interactively. Defaults to true.

Keyword Arguments

  • fig_size::Tuple{Int, Int}: Size of the output figure, e.g., (1000, 750).
  • title::AbstractString: Title of the plot.
  • xlabel::AbstractString: Label for the x-axis.
  • ylabel::AbstractString: Label for the y-axis.
  • line_color: Color of the plotted line.
  • line_style: Style of the plotted line (e.g. :dash, :dot).
  • marker: Marker style for points (e.g. :circle, :xcross).
  • markersize::Number: Size of the markers.
  • Any other keyword arguments will be passed to Makie.Axis.
source
Mycelia.plot_optimal_cluster_assessment_resultsMethod
plot_optimal_cluster_assessment_results(
    clustering_results
) -> Any

Visualizes cluster assessment metrics and saves the resulting plots.

Arguments

  • clustering_results: A named tuple containing:
    • ks_assessed: Vector of k values tested
    • within_cluster_sum_of_squares: Vector of WCSS scores
    • silhouette_scores: Vector of silhouette scores
    • optimal_number_of_clusters: Integer indicating optimal k

Details

Creates two plots:

  1. Within-cluster sum of squares (WCSS) vs number of clusters
  2. Silhouette scores vs number of clusters

Both plots include a vertical line indicating the optimal number of clusters.

Outputs

Saves two SVG files in the project directory:

  • wcss.svg: WCSS plot
  • silhouette.svg: Silhouette scores plot
source
Mycelia.plot_per_base_qualityMethod
plot_per_base_quality(fastq_file::String; max_position::Union{Int,Nothing}=nothing, sample_size::Union{Int,Nothing}=nothing)

Create per-base quality boxplots for FASTQ data, similar to FastQC output.

Arguments

  • fastq_file::String: Path to FASTQ file to analyze
  • max_position::Union{Int,Nothing}=nothing: Maximum read position to plot (default: auto-detect from data)
  • sample_size::Union{Int,Nothing}=nothing: Number of reads to sample for analysis (default: use all reads)

Returns

  • Plots.Plot: Boxplot showing quality distribution at each base position

Examples

# Basic per-base quality plot
p = Mycelia.plot_per_base_quality("reads.fastq")

# Plot first 100 positions only, sampling 10000 reads
p = Mycelia.plot_per_base_quality("reads.fastq", max_position=100, sample_size=10000)

Notes

  • Quality scores are displayed in Phred scale
  • Green zone: Q>=30 (high quality)
  • Yellow zone: Q20-29 (medium quality)
  • Red zone: Q<20 (low quality)
  • For large files, consider using sample_size to improve performance
source
Mycelia.plot_taxa_abundancesMethod
plot_taxa_abundances(
    df::DataFrames.DataFrame, 
    taxa_level::String; 
    top_n::Int = 10,
    sample_id_col::String = "sample_id",
    filter_taxa::Union{Vector{Union{String, Missing}}, Nothing} = nothing,
    figure_width::Int = 1500,
    figure_height::Int = 1000,
    bar_width::Float64 = 0.7,
    x_rotation::Int = 45,
    sort_samples::Bool = true,
    color_seed::Union{Int, Nothing} = nothing,
    legend_fontsize::Float64 = 12.0,
    legend_itemsize::Float64 = 12.0,
    legend_padding::Float64 = 5.0,
    legend_rowgap::Float64 = 1.0,
    legend_labelwidth::Union{Nothing, Float64} = nothing,
    legend_titlesize::Float64 = 15.0,
    legend_nbanks::Int = 1
)

Create a stacked bar chart showing taxa relative abundances for each sample.

Arguments

  • df: DataFrame with sample_id and taxonomic assignments at different levels
  • taxa_level: Taxonomic level to analyze (e.g., "genus", "species")
  • top_n: Number of top taxa to display individually, remainder grouped as "Other"
  • sample_id_col: Column name containing sample identifiers
  • filter_taxa: Taxa to exclude from visualization (default: nothing - no filtering)
  • figure_width: Width of the figure in pixels
  • figure_height: Height of the figure in pixels
  • bar_width: Width of each bar (between 0 and 1)
  • x_rotation: Rotation angle for x-axis labels in degrees
  • sort_samples: Whether to sort samples alphabetically
  • color_seed: Seed for reproducible color generation
  • legend_fontsize: Font size for legend entries
  • legend_itemsize: Size of the colored marker/icon in the legend
  • legend_padding: Padding around legend elements
  • legend_rowgap: Space between legend rows
  • legend_labelwidth: Maximum width for legend labels (truncation)
  • legend_titlesize: Font size for legend title
  • legend_nbanks: Number of legend columns

Returns

  • fig: CairoMakie figure object
  • ax: CairoMakie axis object
  • taxa_colors: Dictionary mapping taxa to their assigned colors
source
Mycelia.points_to_pixelsMethod
points_to_pixels(points) -> Any

Convert typographic points to pixels using a 4:3 ratio (1 point = 4/3 pixels).

Arguments

  • points: Size in typographic points (pt)

Returns

  • Size in pixels (px)
source
Mycelia.poisson_pca_epcaMethod

poissonpcaepca(M::AbstractMatrix{<:Integer}; k::Int=0)

Perform Poisson EPCA on a count matrix M (features × samples).

When to use

Use for non-negative integer count data, such as raw event or read counts.

Returns

A NamedTuple with

  • model : the fitted ExpFamilyPCA.PoissonEPCA object
  • scores : k×n_samples matrix of low‐dimensional sample scores
  • loadings : k×n_features matrix of feature loadings
source
Mycelia.polish_assemblyMethod
polish_assembly(assembly::AssemblyResult, reads; iterations=3) -> AssemblyResult

Polish assembled contigs using quality-aware error correction.

Arguments

  • assembly: Initial assembly result to polish
  • reads: Original reads for polishing (FASTQ with quality scores preferred)
  • iterations: Number of polishing iterations (default: 3)

Returns

  • AssemblyResult: Polished assembly with improved accuracy

Details

Uses Phase 2 enhanced Viterbi algorithms with quality score integration for:

  • Error correction based on k-mer graph traversals
  • Consensus calling from multiple observations
  • Iterative improvement until convergence
source
Mycelia.polish_fastqMethod
polish_fastq(; fastq, k)

Polish FASTQ reads using a k-mer graph-based approach to correct potential sequencing errors.

Arguments

  • fastq::String: Path to input FASTQ file
  • k::Int=1: Initial k-mer size parameter. Final assembly k-mer size may differ.

Process

  1. Builds a directed k-mer graph from input reads
  2. Processes each read through the graph to find optimal paths
  3. Writes corrected reads to a new FASTQ file
  4. Automatically compresses output with gzip

Returns

Named tuple with:

  • fastq::String: Path to output gzipped FASTQ file
  • k::Int: Final assembly k-mer size used
source
Mycelia.polish_sequence_nextFunction
polish_sequence_next(graph::MetaGraph, sequence::String, config::ViterbiConfig) -> ViterbiPath

Polish a single sequence using Viterbi algorithm on k-mer graph.

source
Mycelia.position_wise_joint_probabilityMethod
position_wise_joint_probability(
    qualmers::Vector{<:Mycelia.Qualmer};
    use_log_space
) -> Float64

Calculate position-wise joint probability for multiple qualmer observations. This is more sophisticated than joint_qualmer_probability as it considers the quality at each position across all observations.

Arguments

  • qualmers: Vector of Qualmer observations of the same k-mer sequence
  • use_log_space: Use log-space arithmetic for numerical stability (default: true)

Returns

  • Float64: Position-wise joint probability
source
Mycelia.precision_recall_f1Method
precision_recall_f1(true_labels, pred_labels)

Returns dictionaries mapping each label to its precision, recall, and F1 score. Also returns macro-averaged (unweighted mean) precision, recall, F1, and a grouped bar plot.

source
Mycelia.prefetchMethod
prefetch(; SRR, outdir)

Downloads Sequence Read Archive (SRA) data using the prefetch tool from sra-tools.

Arguments

  • SRR: SRA accession number (e.g., "SRR12345678")
  • outdir: Directory where the downloaded data will be saved. Defaults to current directory.

Notes

  • Requires sra-tools which will be installed in a Conda environment
  • Downloads are saved in .sra format
  • Internet connection required
source
Mycelia.prefetch_sra_runsMethod

Prefetches multiple SRA runs in parallel.

Downloads SRA run files (.sra) to local storage without converting to FASTQ. Useful for batch downloading before processing with fasterq-dump.

Arguments

  • srr_identifiers: Vector of SRA run identifiers
  • outdir: Output directory for prefetched files (default: current directory)
  • max_parallel: Maximum number of parallel downloads (default: 4)

Returns

Vector of named tuples with prefetch results for each SRA run

Example

runs = ["SRR1234567", "SRR1234568", "SRR1234569"]
results = Mycelia.prefetch_sra_runs(runs, outdir="./sra_data")
source
Mycelia.probabilistic_walk_nextMethod
probabilistic_walk_next(
    graph::MetaGraphsNext.MetaGraph,
    start_vertex::String,
    max_steps::Int64;
    seed
) -> Mycelia.GraphPath

Perform a probabilistic walk through the strand-aware k-mer graph.

This algorithm follows edges based on their probability weights, respecting strand orientation constraints. The walk continues until max_steps is reached or no valid transitions are available.

Arguments

  • graph: MetaGraphsNext k-mer graph with strand-aware edges
  • start_vertex: Starting k-mer (vertex label)
  • max_steps: Maximum number of steps to take
  • seed: Random seed for reproducibility (optional)

Returns

  • GraphPath: Complete path with probability information

Algorithm

  1. Start at given vertex with forward strand orientation
  2. At each step, calculate transition probabilities based on edge weights
  3. Sample next vertex according to probabilities
  4. Update cumulative probability and continue
  5. Respect strand orientation constraints from edge metadata

Example

graph = build_kmer_graph_next(DNAKmer{15}, observations)
path = probabilistic_walk_next(graph, "ATCGATCGATCGATC", 100)
println("Assembled sequence: $(path.sequence)")
println("Path probability: $(path.total_probability)")
source
Mycelia.process_fastq_recordMethod
process_fastq_record(
;
    record,
    kmer_graph,
    yen_k_shortest_paths_and_weights,
    yen_k
)

Process and error-correct a FASTQ sequence record using a kmer graph and path resampling.

Arguments

  • record: FASTQ record containing the sequence to process
  • kmer_graph: MetaGraph containing the kmer network and associated properties
  • yen_k_shortest_paths_and_weights: Cache of pre-computed k-shortest paths between nodes
  • yen_k: Number of alternative paths to consider during resampling (default: 3)

Description

Performs error correction by:

  1. Trimming low-quality sequence ends
  2. Identifying stretches requiring resampling between solid branching kmers
  3. Selecting alternative paths through the kmer graph based on:
    • Path quality scores
    • Transition likelihoods
    • Path length similarity to original sequence

Returns

  • Modified FASTQ record with error-corrected sequence and updated quality scores
  • Original record if no error correction was needed

Required Graph Properties

The kmer_graph must contain the following properties:

  • :ordered_kmers
  • :likelyvalidkmer_indices
  • :kmer_indices
  • :branching_nodes
  • :assembly_k
  • :transition_likelihoods
  • :kmermeanquality
  • :kmertotalquality
source
Mycelia.q_value_to_error_rateMethod
q_value_to_error_rate(q_value) -> Any

Convert a Phred quality score (Q-value) to a probability of error.

Arguments

  • q_value: Phred quality score, typically ranging from 0 to 40

Returns

  • Error probability in range [0,1], where 0 indicates highest confidence

A Q-value of 10 corresponds to an error rate of 0.1 (10%), while a Q-value of 30 corresponds to an error rate of 0.001 (0.1%).

source
Mycelia.qc_filter_long_reads_fastplongMethod
qc_filter_long_reads_fastplong(
;
    in_fastq,
    report_title,
    out_fastq,
    html_report,
    json_report,
    min_length,
    max_length
)

Perform QC filtering on long-read FASTQ files using fastplong.

Arguments

  • in_fastq::String: Path to the input FASTQ file.
  • out_fastq::String: Path to the output FASTQ file.
  • quality_threshold::Int: Minimum average quality to retain a read (default 10).
  • min_length::Int: Minimum read length (default 1000).
  • max_length::Int=0: Maximum read length (default 0, no maximum).

Returns

  • String: Path to the filtered FASTQ file.

Details

This function uses fastplong to filter long reads based on quality and length criteria. It is optimized for Oxford Nanopore, PacBio, or similar long-read datasets.

source
Mycelia.qc_filter_long_reads_filtlongMethod
qc_filter_long_reads_filtlong(
;
    in_fastq,
    out_fastq,
    min_mean_q,
    keep_percent
)

Filter and process long reads from a FASTQ file using Filtlong.

This function filters long sequencing reads based on quality and length criteria, then compresses the output using pigz.

Arguments

  • in_fastq::String: Path to the input FASTQ file.
  • out_fastq::String: Path to the output filtered and compressed FASTQ file. Defaults to the input filename with ".filtlong.fq.gz" appended.
  • min_mean_q::Int: Minimum mean quality score for reads to be kept. Default is 20.
  • keep_percent::Int: Percentage of reads to keep after filtering. Default is 95.

Returns

  • out_fastq

Details

This function uses Filtlong to filter long reads and pigz for compression. It requires the Bioconda environment for Filtlong to be set up, which is handled internally.

source
Mycelia.qc_filter_short_reads_fastpMethod
qc_filter_short_reads_fastp(
;
    forward_reads,
    reverse_reads,
    out_forward,
    out_reverse,
    report_title,
    html,
    json
)

Perform quality control (QC) filtering and trimming on short-read FASTQ files using fastp.

Arguments

  • in_fastq::String: Path to the input FASTQ file.
  • out_fastq::String: Path to the output FASTQ file.
  • adapter_seq::String: Adapter sequence to trim.
  • quality_threshold::Int: Minimum phred score for trimming (default 20).
  • min_length::Int: Minimum read length to retain (default 50).

Returns

  • String: Path to the filtered and trimmed FASTQ file.

Details

This function uses fastp to remove adapter contamination, trim low‐quality bases from the 3′ end, and discard reads shorter than min_length. It’s a simple wrapper that executes the external fastp command.

source
Mycelia.quality_biosequence_graph_to_fastqFunction
quality_biosequence_graph_to_fastq(
    graph::MetaGraphsNext.MetaGraph
) -> Vector{FASTX.FASTQ.Record}
quality_biosequence_graph_to_fastq(
    graph::MetaGraphsNext.MetaGraph,
    output_file::Union{Nothing, AbstractString}
) -> Vector{FASTX.FASTQ.Record}

Convert quality-aware BioSequence vertices back to FASTQ records.

This function demonstrates the key feature of FASTQ graphs - they maintain per-base quality information and can be converted back to FASTQ format.

Arguments

  • graph: Quality-aware BioSequence graph
  • output_file: Path to output FASTQ file (optional)

Returns

  • Vector of FASTX.FASTQ.Record objects

Example

# Convert graph back to FASTQ
fastq_records = quality_biosequence_graph_to_fastq(graph)

# Or write directly to file
quality_biosequence_graph_to_fastq(graph, "output.fastq")
source
Mycelia.quality_string_to_phredMethod
quality_string_to_phred(
    quality_string::AbstractString
) -> Vector{UInt8}

Convert FASTQ quality string to numerical PHRED scores.

Arguments

  • quality_string::AbstractString: Quality string from FASTQ record (e.g., "IIII")

Returns

  • Vector{UInt8}: PHRED quality scores

Examples

scores = quality_string_to_phred("IIII")  # Returns [40, 40, 40, 40]
scores = quality_string_to_phred("!#%+")  # Returns [0, 2, 4, 10]
source
Mycelia.qualmer_correctness_probabilityMethod
qualmer_correctness_probability(
    qmer::Mycelia.Qualmer
) -> Float64

Calculate joint probability that a single qualmer is correct. For a k-mer with quality scores [q1, q2, ..., qk], the joint probability that all positions are correct is: ∏(1 - 10^(-qi/10))

source
Mycelia.qualmer_graph_to_quality_biosequence_graphMethod
qualmer_graph_to_quality_biosequence_graph(
    qualmer_graph::MetaGraphsNext.MetaGraph;
    min_path_length
) -> MetaGraphsNext.MetaGraph{Int64, Graphs.SimpleGraphs.SimpleDiGraph{Int64}, _A, _B, _C, Nothing, Mycelia.var"#280#282", Float64} where {_A, _B, _C}

Convert a Qualmer graph to a quality-aware BioSequence graph by collapsing linear paths.

This is the primary method for creating quality-aware BioSequence graphs from Qualmer graphs, following the 6-graph hierarchy where FASTQ graphs are simplifications of Qualmer graphs with quality retention.

Arguments

  • qualmer_graph: MetaGraphsNext Qualmer graph to convert
  • min_path_length: Minimum path length to keep (default: 2)

Returns

  • MetaGraphsNext.MetaGraph with quality-aware BioSequence vertices

Example

# Start with qualmer graph
qualmer_graph = build_qualmer_graph(fastq_records)

# Convert to quality-aware BioSequence graph
quality_graph = qualmer_graph_to_quality_biosequence_graph(qualmer_graph)
source
Mycelia.qualmers_canonicalMethod
qualmers_canonical(
    sequence::BioSequences.LongAA,
    quality::AbstractVector{<:Integer},
    _::Val{K}
) -> Base.Generator{I, F} where {I<:(Base.Iterators.Enumerate{I} where I<:(Kmers.FwAAMers{_A, BioSequences.LongAA} where _A)), F<:(Mycelia.var"#228#229"{_A, <:AbstractVector{var"#s36"}} where {_A, var"#s36"<:Integer})}
source
Mycelia.qualmers_canonicalMethod
qualmers_canonical(
    sequence::BioSequences.LongSequence{BioSequences.DNAAlphabet{N}},
    quality::AbstractVector{<:Integer},
    _::Val{K}
) -> Base.Generator{I, F} where {I<:(Base.Iterators.Enumerate{I} where I<:(Kmers.CanonicalKmers{A, _A, S} where {A<:(BioSequences.DNAAlphabet), _A, S<:(BioSequences.LongDNA)})), F<:(Mycelia.var"#234#235"{_A, <:AbstractVector{var"#s34"}} where {_A, var"#s34"<:Integer})}
source
Mycelia.qualmers_fwMethod
qualmers_fw(
    sequence::BioSequences.LongSequence{A},
    quality::AbstractVector{<:Integer},
    _::Val{K}
) -> Base.Generator{I, F} where {I<:(Base.Iterators.Enumerate{I} where I<:(Kmers.FwKmers{A, _A, S} where {A<:BioSequences.Alphabet, _A, S<:(BioSequences.LongSequence{A} where A)})), F<:(Mycelia.var"#228#229"{_A, <:AbstractVector{var"#s34"}} where {_A, var"#s34"<:Integer})}

Create an iterator that yields DNA qualmers from the given sequence and quality scores.

source
Mycelia.qualmers_unambiguousMethod
qualmers_unambiguous(
    sequence::BioSequences.LongAA,
    quality::AbstractVector{<:Integer},
    _::Val{K}
) -> Base.Generator{I, F} where {I<:(Base.Iterators.Enumerate{I} where I<:(Kmers.FwAAMers{_A, BioSequences.LongAA} where _A)), F<:(Mycelia.var"#228#229"{_A, <:AbstractVector{var"#s36"}} where {_A, var"#s36"<:Integer})}
source
Mycelia.qualmers_unambiguousMethod
qualmers_unambiguous(
    sequence::BioSequences.LongSequence{BioSequences.DNAAlphabet{N}},
    quality::AbstractVector{<:Integer},
    _::Val{K}
) -> Base.Generator{I, F} where {I<:(Kmers.UnambiguousDNAMers{_A, S} where {_A, S<:(BioSequences.LongDNA)}), F<:(Mycelia.var"#230#231"{_A, <:AbstractVector{var"#s34"}} where {_A, var"#s34"<:Integer})}
source
Mycelia.qualmers_unambiguousMethod
qualmers_unambiguous(
    sequence::BioSequences.LongSequence{BioSequences.RNAAlphabet{N}},
    quality::AbstractVector{<:Integer},
    _::Val{K}
) -> Base.Generator{I, F} where {I<:(Kmers.UnambiguousRNAMers{_A, S} where {_A, S<:(BioSequences.LongRNA)}), F<:(Mycelia.var"#232#233"{_A, <:AbstractVector{var"#s34"}} where {_A, var"#s34"<:Integer})}
source
Mycelia.qualmers_unambiguous_canonicalMethod
qualmers_unambiguous_canonical(
    record::FASTX.FASTQ.Record,
    k::Int64
) -> Base.Generator{I, Mycelia.var"#236#237"} where I

Generate unambiguous canonical qualmers from the given FASTQ record.

source
Mycelia.rand_ascii_greek_stringMethod
rand_ascii_greek_string(len::Int) -> String

Generate a random string of printable ASCII and Greek characters of length len.

The string contains random printable ASCII characters and both uppercase and lowercase Greek letters.

source
Mycelia.rand_bmp_printable_stringMethod
rand_bmp_printable_string(len::Int) -> String

Generate a random string of printable Basic Multilingual Plane (BMP) characters of length len.

The string contains random printable BMP characters, excluding surrogate code points.

source
Mycelia.rand_latin1_stringMethod
rand_latin1_string(len::Int) -> String

Generate a random string of printable Latin-1 characters of length len.

The string contains random printable characters from the Latin-1 character set.

source
Mycelia.rand_of_each_groupMethod
rand_of_each_group(
    gdf::DataFrames.GroupedDataFrame{DataFrames.DataFrame}
) -> Any

Select one random row from each group in a grouped DataFrame.

Arguments

  • gdf::GroupedDataFrame: A grouped DataFrame created using groupby

Returns

  • DataFrame: A new DataFrame containing exactly one randomly sampled row from each group
source
Mycelia.rand_printable_unicode_stringMethod
rand_printable_unicode_string(len::Int) -> String

Generate a random string of printable Unicode characters of length len.

The string contains random printable Unicode characters, excluding surrogate code points.

source
Mycelia.random_fasta_recordMethod
random_fasta_record(
;
    moltype,
    seed,
    L
) -> FASTX.FASTA.Record

Generates a random FASTA record with a specified molecular type and sequence length.

Arguments

  • moltype::Symbol=:DNA: The type of molecule to generate (:DNA, :RNA, or :AA for amino acids).
  • seed: The random seed used for sequence generation (default: a random integer).
  • L: The length of the sequence (default: a random integer up to typemax(UInt16)).

Returns

  • A FASTX.FASTA.Record containing:
    • A randomly generated UUID identifier.
    • A randomly generated sequence of the specified type.

Errors

  • Throws an error if moltype is not one of :DNA, :RNA, or :AA.
source
Mycelia.random_symmetric_distance_matrixMethod
random_symmetric_distance_matrix(n) -> Any

Generate a random symmetric distance matrix of size n×n with zeros on the diagonal.

Arguments

  • n: Positive integer specifying the matrix dimensions

Returns

  • A symmetric n×n matrix with random values in [0,1), zeros on the diagonal

Details

  • The matrix is symmetric, meaning M[i,j] = M[j,i]
  • Diagonal elements M[i,i] are set to 0.0
  • Off-diagonal elements are uniformly distributed random values
source
Mycelia.rclone_copyMethod
rclone_copy(source, dest; config, max_attempts, sleep_timer)

Copy files between local and remote storage using rclone with automated retry logic.

Arguments

  • source::String: Source path or remote (e.g. "local/path" or "gdrive:folder")
  • dest::String: Destination path or remote (e.g. "gdrive:folder" or "local/path")

Keywords

  • config::String="": Optional path to rclone config file
  • max_attempts::Int=3: Maximum number of retry attempts
  • sleep_timer::Int=60: Initial sleep duration between retries in seconds (doubles after each attempt)

Details

Uses optimized rclone settings for large files:

  • 2GB chunk size
  • 1TB upload cutoff
  • Rate limited to 1 transaction per second
source
Mycelia.rclone_copy2Method
rclone_copy2(
    source,
    dest;
    config,
    max_attempts,
    sleep_timer,
    includes,
    excludes,
    recursive
)

Copy files between local and remote storage using rclone with automated retry logic.

Arguments

  • source::String: Source path or remote (e.g. "local/path" or "gdrive:folder")
  • dest::String: Destination path or remote (e.g. "gdrive:folder" or "local/path")

Keywords

  • config::String="": Optional path to rclone config file
  • max_attempts::Int=3: Maximum number of retry attempts
  • sleep_timer::Int=60: Initial sleep duration between retries in seconds (doubles after each attempt)
  • includes::Vector{String}=[]: One or more include patterns (each will be passed using --include)
  • excludes::Vector{String}=[]: One or more exclude patterns (each will be passed using --exclude)
  • recursive::Bool=false: If true, adds the flag for recursive traversal
source
Mycelia.rclone_list_directoriesMethod
rclone_list_directories(path) -> Any

List all directories at the specified rclone path.

Arguments

  • path::String: Remote path to list directories from (e.g. "remote:/path/to/dir")

Returns

  • Vector{String}: Full paths to all directories found at the specified location
source
Mycelia.read_fastaniMethod
read_fastani(path::String) -> DataFrames.DataFrame

Imports results of fastani

Reads and processes FastANI output results from a tab-delimited file.

Arguments

  • path::String: Path to the FastANI output file

Returns

DataFrame with columns:

  • query: Original query filepath
  • query_identifier: Extracted filename without extension
  • reference: Original reference filepath
  • reference_identifier: Extracted filename without extension
  • %_identity: ANI percentage identity
  • fragments_mapped: Number of fragments mapped
  • total_query_fragments: Total number of query fragments

Notes

  • Expects tab-delimited input file from FastANI
  • Automatically strips .fasta, .fna, or .fa extensions from filenames
  • Column order is preserved as listed above
source
Mycelia.read_gfa_nextFunction
read_gfa_next(
    gfa_file::AbstractString;
    ...
) -> MetaGraphsNext.MetaGraph{Int64, Graphs.SimpleGraphs.SimpleDiGraph{Int64}, Label, VertexData, EdgeData, Nothing, WeightFunction, Float64} where {Label, VertexData, EdgeData, WeightFunction}
read_gfa_next(
    gfa_file::AbstractString,
    graph_mode::Mycelia.GraphMode;
    force_biosequence_graph
) -> MetaGraphsNext.MetaGraph{Int64, Graphs.SimpleGraphs.SimpleDiGraph{Int64}, Label, VertexData, EdgeData, Nothing, WeightFunction, Float64} where {Label, VertexData, EdgeData, WeightFunction}

Read a GFA file and auto-detect whether to create a k-mer graph or BioSequence graph.

This function parses GFA format files and intelligently chooses between:

  1. Fixed-length k-mer graph (if all segments have the same length)
  2. Variable-length BioSequence graph (if segments have different lengths)

Arguments

  • gfa_file: Path to input GFA file
  • graph_mode: GraphMode (SingleStrand or DoubleStrand, default: DoubleStrand)
  • force_biosequence_graph: Force creation of variable-length BioSequence graph (default: false)

Returns

  • MetaGraphsNext.MetaGraph with either k-mer vertices or BioSequence vertices

Auto-Detection Logic

  • Fixed-length detection: If all segments are the same length k, creates DNAKmer{k}/RNAKmer{k}/AAKmer{k} graph
  • Variable-length fallback: If segments have different lengths, creates BioSequence graph
  • Override: Use force_biosequence_graph=true to force variable-length graph

GFA Format Support

Supports GFA v1.0 with:

  • Header (H) lines (ignored)
  • Segment (S) lines: parsed as vertices (k-mer or BioSequence)
  • Link (L) lines: parsed as strand-aware edges
  • Path (P) lines: stored as metadata (future use)

Examples

# Auto-detect graph type
graph = read_gfa_next("assembly.gfa")

# Force variable-length BioSequence graph
graph = read_gfa_next("assembly.gfa", force_biosequence_graph=true)

# SingleStrand mode with auto-detection
graph = read_gfa_next("assembly.gfa", SingleStrand)
source
Mycelia.read_gfa_nextFunction
read_gfa_next(
    gfa_file::AbstractString,
    kmer_type::Type
) -> MetaGraphsNext.MetaGraph{Int64, Graphs.SimpleGraphs.SimpleDiGraph{Int64}, _A, _B, _C, Nothing, Mycelia.var"#151#152", Float64} where {_A, _B, _C}
read_gfa_next(
    gfa_file::AbstractString,
    kmer_type::Type,
    graph_mode::Mycelia.GraphMode
) -> MetaGraphsNext.MetaGraph{Int64, Graphs.SimpleGraphs.SimpleDiGraph{Int64}, _A, _B, _C, Nothing, Mycelia.var"#151#152", Float64} where {_A, _B, _C}

Read a GFA file and convert it to a MetaGraphsNext k-mer graph with fixed-length vertices.

This function parses GFA format files and creates a strand-aware k-mer graph compatible with the next-generation implementation using fixed-length k-mer vertices.

Arguments

  • gfa_file: Path to input GFA file
  • kmer_type: Type of k-mer to use (e.g., Kmers.DNAKmer{31})
  • graph_mode: GraphMode (SingleStrand or DoubleStrand, default: DoubleStrand)

Returns

  • MetaGraphsNext.MetaGraph with k-mer vertices and strand-aware edges

GFA Format Support

Supports GFA v1.0 with:

  • Header (H) lines (ignored)
  • Segment (S) lines: parsed as fixed-length k-mer vertices
  • Link (L) lines: parsed as strand-aware edges
  • Path (P) lines: stored as metadata (future use)

Example

# Fixed-length k-mer graph
graph = read_gfa_next("assembly.gfa", Kmers.DNAKmer{31})
# Or with specific mode
graph = read_gfa_next("assembly.gfa", Kmers.DNAKmer{31}, SingleStrand)
source
Mycelia.read_gffMethod
read_gff(gff::AbstractString) -> DataFrames.DataFrame

Reads a GFF (General Feature Format) file and parses it into a DataFrame.

Arguments

  • gff::AbstractString: Path to the GFF file

Returns

  • DataFrame: A DataFrame containing the parsed GFF data with standard columns: seqid, source, type, start, end, score, strand, phase, and attributes
source
Mycelia.read_gffMethod
read_gff(gff_io) -> DataFrames.DataFrame

Read a GFF (General Feature Format) file into a DataFrame.

Arguments

  • gff_io: An IO stream containing GFF formatted data

Returns

  • DataFrame: A DataFrame with standard GFF columns:
    • seqid: sequence identifier
    • source: feature source
    • type: feature type
    • start: start position (1-based)
    • end: end position
    • score: numeric score
    • strand: strand (+, -, or .)
    • phase: phase (0, 1, 2 or .)
    • attributes: semicolon-separated key-value pairs
source
Mycelia.read_kraken_reportMethod
read_kraken_report(kraken_report) -> DataFrames.DataFrame

Parse a Kraken taxonomic classification report into a structured DataFrame.

Arguments

  • kraken_report::AbstractString: Path to a tab-delimited Kraken report file

Returns

  • DataFrame: A DataFrame with the following columns:
    • percentage_of_fragments_at_or_below_taxon: Percentage of fragments covered
    • number_of_fragments_at_or_below_taxon: Count of fragments at/below taxon
    • number_of_fragments_assigned_directly_to_taxon: Direct fragment assignments
    • rank: Taxonomic rank
    • ncbi_taxonid: NCBI taxonomy identifier
    • scientific_name: Scientific name (whitespace-trimmed)

Notes

  • Scientific names are automatically stripped of leading/trailing whitespace
  • Input file must be tab-delimited
source
Mycelia.read_mmseqs_easy_searchMethod
read_mmseqs_easy_search(
    mmseqs_file::String
) -> DataFrames.DataFrame

Read results from MMSeqs2 easy-search output file (plain or gzipped) into a DataFrame with optimized memory usage. Automatically detects if the file is gzipped based on the '.gz' extension.

Arguments

  • mmseqs_file::String: Path to the tab-delimited output file from MMSeqs2 easy-search. Can be a plain text file or a gzipped file (ending in .gz).

Returns

  • DataFrame: Contains search results with columns:
    • query::String: Query sequence identifier (pooled)
    • target::String: Target sequence identifier (pooled)
    • seqIdentity::Float64: Sequence identity (0.0-1.0)
    • alnLen::Int: Alignment length
    • mismatch::Int: Number of mismatches
    • gapOpen::Int: Number of gap openings
    • qStart::Int: Query start position
    • qEnd::Int: Query end position
    • tStart::Int: Target start position
    • tEnd::Int: Target end position
    • evalue::Float64: Expected value
    • bits::Float64: Bit score

Remarks

  • Ensure the CodecZlib.jl package is installed for gzipped file support.
source
Mycelia.read_tsvgzMethod
read_tsvgz(filename::String; buffer_in_memory::Bool=false, threaded::Bool=true, bufsize::Int=10*1024*1024) -> DataFrames.DataFrame

Read a DataFrame from a gzipped TSV file.

Arguments

  • filename: Path to the gzipped TSV file (must have .tsv.gz extension)
  • buffer_in_memory: If false, uses temporary files for large data (default: false)
  • bufsize: Buffer size in bytes for decompression stream (default: 10MB)

Returns

  • The loaded DataFrame
source
Mycelia.repr_longMethod
repr_long(v) -> String

Return a string representation of the vector v with each element on a new line, mimicking valid Julia syntax. The output encloses the elements in square brackets and separates them with a comma followed by a newline.

source
Mycelia.reset_environment!Function
reset_environment!(env::AssemblyEnvironment, dataset_idx::Int=1)

Reset the RL environment to start a new training episode.

Arguments

  • env::AssemblyEnvironment: Environment to reset
  • dataset_idx::Int: Index of training dataset to use (default: 1)

Returns

  • AssemblyState: Initial state for new episode

Example

initial_state = reset_environment!(env, 2)  # Use second training dataset
source
Mycelia.resolve_repeats_nextMethod
resolve_repeats_next(graph::MetaGraphsNext.MetaGraph, min_repeat_length::Int=10) -> Vector{RepeatRegion}

Identify and characterize repetitive regions in the assembly graph.

source
Mycelia.reverse_translateMethod
reverse_translate(
    protein_sequence::BioSequences.LongAA
) -> BioSequences.LongSequence{BioSequences.DNAAlphabet{2}}

Convert a protein sequence back to a possible DNA coding sequence using weighted random codon selection.

Arguments

  • protein_sequence::BioSequences.LongAA: The amino acid sequence to reverse translate

Returns

  • BioSequences.LongDNA{2}: A DNA sequence that would translate to the input protein sequence

Details

Uses codon usage frequencies to randomly select codons for each amino acid, weighted by their natural occurrence. Each selected codon is guaranteed to translate back to the original amino acid.

source
Mycelia.rolling_centered_avgMethod
rolling_centered_avg(data::AbstractArray{T, 1}; window_size)

Compute a centered moving average over a vector using a sliding window.

Arguments

  • data::AbstractVector{T}: Input vector to be averaged
  • window_size::Int: Size of the sliding window (odd number recommended)

Returns

  • Vector{Float64}: Vector of same length as input containing moving averages

Details

  • For points near the edges, the window is truncated to available data
  • Window is centered on each point, using floor(window_size/2) points on each side
  • Result type is always Float64 regardless of input type T
source
Mycelia.run_amrfinderplusMethod
run_amrfinderplus(; fasta, output_dir, force)

Run AMRFinderPlus on FASTA input to identify antimicrobial resistance genes.

Arguments

  • fasta::String: Path to input FASTA file (must match Mycelia.FASTA_REGEX pattern)
  • output_dir::String: Output directory path (default: input filename + "_amrfinderplus")
  • force::Bool: Force rerun even if output files already exist (default: false)

Returns

Path to the output directory containing AMRFinderPlus results

Details

  • For nucleotide FASTA files, automatically runs Mycelia.run_pyrodigal to generate protein sequences
  • For protein FASTA files, runs AMRFinderPlus directly
  • Validates input file extension against Mycelia.FASTA_REGEX
  • Creates output directory if it doesn't exist
  • Skips processing if results already exist in output directory unless force=true
  • Uses –plus flag for enhanced detection capabilities

Files Generated

  • <basename>.amrfinderplus.tsv: AMRFinderPlus results table
  • For nucleotide inputs: intermediate pyrodigal outputs in subdirectory
source
Mycelia.run_blastMethod
run_blast(
;
    out_dir,
    fasta,
    blast_db,
    blast_command,
    force,
    remote,
    wait
)

Run a BLAST (Basic Local Alignment Search Tool) command with the specified parameters.

Arguments

  • out_dir::String: The output directory where the BLAST results will be stored.
  • fasta::String: The path to the input FASTA file.
  • blast_db::String: The path to the BLAST database.
  • blast_command::String: The BLAST command to be executed (e.g., blastn, blastp).
  • force::Bool: If true, forces the BLAST command to run even if the output file already exists. Default is false.
  • remote::Bool: If true, runs the BLAST command remotely. Default is false.
  • wait::Bool: If true, waits for the BLAST command to complete before returning. Default is true.

Returns

  • outfile::String: The path to the output file containing the BLAST results.

Description

This function constructs and runs a BLAST command based on the provided parameters. It creates the necessary output directory, constructs the output file name, and determines whether the BLAST command needs to be run based on the existence and size of the output file. The function supports both local and remote execution of the BLAST command.

If force is set to true or the output file does not exist or is empty, the BLAST command is executed. The function logs the command being run and measures the time taken for execution. The output file path is returned upon completion.

source
Mycelia.run_blastnMethod
run_blastn(
;
    outdir,
    fasta,
    blastdb,
    threads,
    task,
    force,
    remote,
    wait
)

Run the BLASTN (Basic Local Alignment Search Tool for Nucleotides) command with specified parameters.

Arguments

  • outdir::String: The output directory where the BLASTN results will be saved.
  • fasta::String: The path to the input FASTA file containing the query sequences.
  • blastdb::String: The path to the BLAST database to search against.
  • task::String: The BLASTN task to perform. Default is "megablast".
  • force::Bool: If true, forces the BLASTN command to run even if the output file already exists. Default is false.
  • remote::Bool: If true, runs the BLASTN command remotely. Default is false.
  • wait::Bool: If true, waits for the BLASTN command to complete before returning. Default is true.

Returns

  • outfile::String: The path to the output file containing the BLASTN results.

Description

This function constructs and runs a BLASTN command based on the provided parameters. It creates an output directory if it doesn't exist, constructs the output file path, and checks if the BLASTN command needs to be run based on the existence and size of the output file. The function supports running the BLASTN command locally or remotely, with options to force re-running and to wait for completion.

source
Mycelia.run_blastp_searchMethod
run_blastp_search(
;
    query_fasta,
    reference_fasta,
    output_dir,
    threads,
    evalue,
    max_target_seqs
)

Perform BLASTP search between query and reference protein FASTA files.

Arguments

  • query_fasta::String: Path to query protein FASTA file
  • reference_fasta::String: Path to reference protein FASTA file
  • output_dir::String: Output directory (defaults to query filename + "_blastp")
  • threads::Int: Number of threads (defaults to system CPU count)
  • evalue::Float64: E-value threshold (default: 1e-3)
  • max_target_seqs::Int: Maximum target sequences (default: 500)

Returns

  • String: Path to the BLASTP results file (.tsv format)

Throws

  • AssertionError: If input files don't exist or are invalid
  • SystemError: If BLAST execution fails
source
Mycelia.run_buscoMethod
run_busco(assembly_file::String; kwargs...)

Run BUSCO on a single assembly file. See run_busco(::Vector{String}) for details.

source
Mycelia.run_buscoMethod
run_busco(assembly_files::Vector{String}; outdir::String="busco_results", lineage::String="auto", mode::String="genome", threads::Int=Sys.CPU_THREADS, force::Bool=false)

Run BUSCO (Benchmarking Universal Single-Copy Orthologs) to assess genome assembly completeness.

Arguments

  • assembly_files::Vector{String}: Vector of paths to assembly FASTA files to evaluate
  • outdir::String="busco_results": Output directory for BUSCO results
  • lineage::String="auto": BUSCO lineage dataset to use (e.g., "bacteriaodb10", "eukaryotaodb10", "auto")
  • mode::String="genome": BUSCO mode ("genome", "transcriptome", "proteins")
  • threads::Int=Sys.CPU_THREADS: Number of threads to use
  • force::Bool=false: Force overwrite existing results

Returns

  • String: Path to the output directory containing BUSCO results

Output Files

  • short_summary.specific.lineage.txt: Summary statistics
  • full_table.tsv: Complete BUSCO results table
  • missing_busco_list.tsv: List of missing BUSCOs
  • run_lineage/: Detailed results directory

Examples

# Basic completeness assessment
assemblies = ["assembly1.fasta", "assembly2.fasta"]
busco_dir = Mycelia.run_busco(assemblies)

# Specific lineage
busco_dir = Mycelia.run_busco(assemblies, lineage="bacteria_odb10")

# Custom parameters
busco_dir = Mycelia.run_busco(assemblies,
                             outdir="my_busco_results",
                             lineage="enterobacterales_odb10",
                             threads=8)

Notes

  • Requires BUSCO to be installed via Bioconda
  • Auto lineage detection requires internet connection for first run
  • Available lineages: bacteriaodb10, archaeaodb10, eukaryotaodb10, fungiodb10, etc.
  • Results provide Complete, Fragmented, and Missing BUSCO counts
source
Mycelia.run_canuMethod
run_canu(; fastq, outdir, genome_size, read_type)

Run Canu assembler for long read assembly.

Arguments

  • fastq::String: Path to input FASTQ file containing long reads
  • outdir::String: Output directory path (default: "canu_output")
  • genome_size::String: Estimated genome size (e.g., "5m", "1.2g")
  • read_type::String: Type of reads ("pacbio", "nanopore")

Returns

Named tuple containing:

  • outdir::String: Path to output directory
  • assembly::String: Path to final assembly file

Details

  • Automatically creates and uses a conda environment with canu
  • Includes error correction, trimming, and assembly stages
  • Skips assembly if output directory already exists
  • Utilizes all available CPU threads
source
Mycelia.run_checkmMethod
run_checkm(input_path::String; outdir::String=input_path * "_checkm", db_dir::String=joinpath(homedir(), "workspace", ".checkm"), extension::String="fasta")

Run CheckM on directory containing FASTA files.

CheckM requires a directory of genome files as input.

Arguments

  • input_path: Path to directory containing FASTA files
  • outdir: Output directory for CheckM results (default: inputpath * "checkm")
  • db_dir: CheckM database directory (default: ~/.checkm)
  • extension: File extension for genomes (default: "fasta")
  • threads: Number of threads to use (default: all available CPU threads)

Example

run_checkm("./genomes/")
source
Mycelia.run_checkm2Method
run_checkm2(input_path::String; outdir::String=input_path * "_checkm2", db_dir::String=joinpath(homedir(), "workspace", ".checkm2"))

Run CheckM2 on FASTA file(s) or directory containing FASTA files.

Arguments

  • input_path: Path to FASTA file or directory containing FASTA files
  • outdir: Output directory for CheckM2 results (default: inputpath * "checkm2")
  • threads: Number of threads to use (default: all available CPU threads)
  • db_dir: CheckM2 database directory (default: ~/.checkm2)

Returns

A named tuple with the following fields:

  • outdir: Output directory path
  • quality_report: Path to quality_report.tsv
  • log_file: Path to checkm2.log
  • diamond_results: Path to DIAMOND_RESULTS.tsv
  • protein_file: Path to the single .faa file
source
Mycelia.run_checkm2_listMethod
run_checkm2_list(fasta_files::Vector{String}; outdir::String=normalized_current_datetime() * "_checkm2", db_dir::String=joinpath(homedir(), "workspace", ".checkm2"))

Run CheckM2 on a list of FASTA files.

CheckM2 can automatically handle mixed lists of gzipped and non-gzipped files when given a list.

Arguments

  • fasta_files: Vector of FASTA file paths (can be mixed gzipped and non-gzipped)
  • outdir: Output directory for CheckM2 results (default: normalizedcurrentdatetime() * "_checkm2")
  • threads: Number of threads to use (default: all available CPU threads)
  • db_dir: CheckM2 database directory (default: ~/.checkm2)

Example

files = ["genome1.fasta.gz", "genome2.fasta", "genome3.fasta.gz"]
run_checkm2_list(files)
source
Mycelia.run_checkvMethod
run_checkv(fasta_file::String; outdir::String=fasta_file * "_checkv", db_dir::String=joinpath(homedir(), "workspace", ".checkv"))

Run CheckV on a single genome FASTA file.

CheckV assesses the quality and completeness of viral genomes.

Arguments

  • fasta_file: Path to single FASTA file (can be gzipped)
  • outdir: Output directory for CheckV results (default: fastafile * "checkv")
  • threads: Number of threads to use (default: all available CPU threads)
  • db_dir: CheckV database directory (default: ~/.checkv)

Returns

  • Named tuple with fields:
    • outdir: Output directory path
    • complete_genomes: Path to complete_genomes.tsv
    • completeness: Path to completeness.tsv
    • contamination: Path to contamination.tsv
    • proviruses: Path to proviruses.fna
    • quality_summary: Path to quality_summary.tsv
    • viruses: Path to viruses.fna

Example

result = run_checkv("genome.fasta")
println("Quality summary: ", result.quality_summary)
source
Mycelia.run_clustal_omegaMethod
run_clustal_omega(; fasta, outfmt)

Run Clustal Omega multiple sequence alignment on a FASTA file.

Arguments

  • fasta::String: Path to input FASTA file
  • outfmt::String="clustal": Output format for the alignment

Returns

  • String: Path to the output alignment file

Supported Output Formats

  • "fasta": FASTA format
  • "clustal": Clustal format
  • "msf": MSF format
  • "phylip": PHYLIP format
  • "selex": SELEX format
  • "stockholm": Stockholm format
  • "vienna": Vienna format

Notes

  • Uses Bioconda to manage the Clustal Omega installation
  • Caches results - will return existing output file if already generated
  • Handles single sequence files gracefully by returning output path without error
source
Mycelia.run_diamond_searchMethod
run_diamond_search(
;
    query_fasta,
    reference_fasta,
    output_dir,
    threads,
    evalue,
    block_size,
    sensitivity
)

Perform DIAMOND BLASTP search between query and reference protein FASTA files.

Arguments

  • query_fasta::String: Path to query protein FASTA file
  • reference_fasta::String: Path to reference protein FASTA file
  • output_dir::String: Output directory (defaults to query filename + "_diamond")
  • threads::Int: Number of threads (defaults to system CPU count)
  • evalue::Float64: E-value threshold (default: 1e-3)
  • block_size::Float64: Block size in GB (default: auto-calculated from system memory)
  • sensitivity::String: Sensitivity mode (default: "–iterate")

Returns

  • String: Path to the DIAMOND results file (.tsv format)

Throws

  • AssertionError: If input files don't exist or are invalid
  • SystemError: If DIAMOND execution fails
source
Mycelia.run_ectyperMethod
run_ectyper(fasta_file) -> Any

Run ECTyper for serotyping E. coli genome assemblies.

Arguments

  • fasta_file::String: Path to input FASTA file containing assembled genome(s)

Returns

  • String: Path to output directory containing ECTyper results
source
Mycelia.run_flyeMethod
run_flye(; fastq, outdir, genome_size, read_type)

Run Flye assembler for long read assembly.

Arguments

  • fastq::String: Path to input FASTQ file containing long reads
  • outdir::String: Output directory path (default: "flye_output")
  • genome_size::String: Estimated genome size (e.g., "5m", "1.2g")
  • read_type::String: Type of reads ("pacbio-raw", "pacbio-corr", "pacbio-hifi", "nano-raw", "nano-corr", "nano-hq")

Returns

Named tuple containing:

  • outdir::String: Path to output directory
  • assembly::String: Path to final assembly file

Details

  • Automatically creates and uses a conda environment with flye
  • Supports various long read technologies and quality levels
  • Skips assembly if output directory already exists
  • Utilizes all available CPU threads
source
Mycelia.run_full_benchmark_suiteFunction
run_full_benchmark_suite()
run_full_benchmark_suite(config::Mycelia.BenchmarkConfig)

Run comprehensive performance benchmark suite.

Executes all benchmark tests and provides a summary report against our targets:

  • Memory Usage: 50% reduction through type-stable metadata
  • Construction Speed: 2x faster graph building
  • Type Stability: Zero allocations in hot paths

Arguments

  • config: BenchmarkConfig for test parameters

Returns

  • NamedTuple with comprehensive results
source
Mycelia.run_genomadMethod

Run geNomad mobile genetic element identification tool.

geNomad identifies viruses and plasmids in genomic and metagenomic data using machine learning and database comparisons.

Arguments

  • input_fasta: Path to input FASTA file
  • output_directory: Output directory path
  • genomad_dbpath: Path to geNomad database directory
  • threads: Number of CPU threads to use
  • cleanup: Remove intermediate files after completion
  • splits: Number of splits for memory management
  • force: Force rerun even if output files already exist

Returns

NamedTuple containing paths to all generated output files and directories

source
Mycelia.run_hifiasmMethod
run_hifiasm(; fastq, outdir)

Run the hifiasm genome assembler on PacBio HiFi reads.

Arguments

  • fastq::String: Path to input FASTQ file containing HiFi reads
  • outdir::String: Output directory path (default: "{basename(fastq)}_hifiasm")

Returns

Named tuple containing:

  • outdir::String: Path to output directory
  • hifiasm_outprefix::String: Prefix used for hifiasm output files

Details

  • Automatically creates and uses a conda environment with hifiasm
  • Uses primary assembly mode (–primary) optimized for inbred samples
  • Skips assembly if output files already exist at the specified prefix
  • Utilizes all available CPU threads
source
Mycelia.run_mash_comparisonMethod
run_mash_comparison(fasta1::String, fasta2::String; k::Int=21, s::Int=10000, mash_path::String="mash")

Runs a genome-by-genome comparison using the mash command-line tool.

This function first creates sketch files for each FASTA input and then calculates the distance between them, capturing and parsing the result.

Arguments

  • fasta1::String: Path to the first FASTA file.
  • fasta2::String: Path to the second FASTA file.

Keyword Arguments

  • k::Int=21: The k-mer size to use for sketching. Default is 21.
  • s::Int=10000: The sketch size (number of hashes to keep). Default is 10000.
  • mash_path::String="mash": The path to the mash executable if not in the system PATH.

Returns

  • NamedTuple: A named tuple containing the parsed results, e.g., (reference=..., query=..., distance=..., p_value=..., shared_hashes=...)
  • nothing: Returns nothing if the mash command fails.
source
Mycelia.run_megahitMethod
run_megahit(
;
    fastq1,
    fastq2,
    outdir,
    min_contig_len,
    k_list
)

Run MEGAHIT assembler for metagenomic short read assembly.

Arguments

  • fastq1::String: Path to first paired-end FASTQ file
  • fastq2::String: Path to second paired-end FASTQ file (optional for single-end)
  • outdir::String: Output directory path (default: "megahit_output")
  • min_contig_len::Int: Minimum contig length (default: 200)
  • k_list::String: k-mer sizes to use (default: "21,29,39,59,79,99,119,141")

Returns

Named tuple containing:

  • outdir::String: Path to output directory
  • contigs::String: Path to final contigs file

Details

  • Automatically creates and uses a conda environment with megahit
  • Optimized for metagenomic assemblies with varying coverage
  • Skips assembly if output directory already exists
  • Utilizes all available CPU threads
source
Mycelia.run_metaspadesMethod
run_metaspades(; fastq1, fastq2, outdir, k_list)

Run metaSPAdes assembler for metagenomic short read assembly.

Arguments

  • fastq1::String: Path to first paired-end FASTQ file
  • fastq2::String: Path to second paired-end FASTQ file (optional for single-end)
  • outdir::String: Output directory path (default: "metaspades_output")
  • k_list::String: k-mer sizes to use (default: "21,33,55,77")

Returns

Named tuple containing:

  • outdir::String: Path to output directory
  • contigs::String: Path to contigs file
  • scaffolds::String: Path to scaffolds file

Details

  • Automatically creates and uses a conda environment with spades
  • Designed for metagenomic datasets with uneven coverage
  • Skips assembly if output directory already exists
  • Utilizes all available CPU threads
source
Mycelia.run_mlstMethod
run_mlst(fasta_file) -> String

Run Multi-Locus Sequence Typing (MLST) analysis on a genome assembly.

Arguments

  • fasta_file::String: Path to input FASTA file containing the genome assembly

Returns

  • Path to the output file containing MLST results (<input>.mlst.out)

Details

Uses the mlst tool from PubMLST to identify sequence types by comparing allelic profiles of housekeeping genes against curated MLST schemes.

Dependencies

  • Requires Bioconda and the mlst package
  • Automatically sets up conda environment if not present
source
Mycelia.run_mmseqs_easy_searchMethod
run_mmseqs_easy_search(
;
    query_fasta,
    target_database,
    out_dir,
    outfile,
    format_output,
    threads,
    force
)

Runs the MMseqs2 easy-search command on the given query FASTA file against the target database.

Arguments

  • query_fasta::String: Path to the query FASTA file.
  • target_database::String: Path to the target database.
  • out_dir::String: Directory to store the output file. Defaults to the directory of the query FASTA file.
  • outfile::String: Name of the output file. Defaults to a combination of the query FASTA and target database filenames.
  • format_output::String: Format of the output. Defaults to a predefined set of fields.
  • threads::Int: Number of CPU threads to use. Defaults to the number of CPU threads available.
  • force::Bool: If true, forces the re-generation of the output file even if it already exists. Defaults to false.

Returns

  • outfile_path::String: Path to the generated output file.

Notes

  • Adds the mmseqs2 environment using Bioconda if not already present.
  • Removes temporary files created during the process.
source
Mycelia.run_mmseqs_searchMethod
run_mmseqs_search(
;
    query_fasta,
    reference_fasta,
    output_dir,
    threads,
    evalue,
    sensitivity
)

Perform MMseqs2 easy-search between query and reference FASTA files.

Arguments

  • query_fasta::String: Path to query FASTA file
  • reference_fasta::String: Path to reference FASTA file
  • output_dir::String: Output directory (defaults to query filename + "_mmseqs")
  • threads::Int: Number of threads (defaults to system CPU count)
  • evalue::Float64: E-value threshold (default: 1e-3)
  • sensitivity::Float64: Sensitivity parameter (default: 4.0)

Returns

  • String: Path to the MMseqs2 results file (.tsv format)

Throws

  • AssertionError: If input files don't exist or are invalid
  • SystemError: If MMseqs2 execution fails
source
Mycelia.run_mummerMethod
run_mummer(reference::String, query::String; outdir::String="mummer_results", prefix::String="out", mincluster::Int=65, minmatch::Int=20, threads::Int=1)

Run MUMmer for genome comparison and alignment between reference and query sequences.

Arguments

  • reference::String: Path to reference genome FASTA file
  • query::String: Path to query genome FASTA file
  • outdir::String="mummer_results": Output directory for MUMmer results
  • prefix::String="out": Prefix for output files
  • mincluster::Int=65: Minimum cluster length for nucmer
  • minmatch::Int=20: Minimum match length for nucmer
  • threads::Int=1: Number of threads (note: MUMmer has limited multithreading)

Returns

  • String: Path to the output directory containing MUMmer results

Output Files

  • prefix.delta: Delta alignment file (main output)
  • prefix.coords: Human-readable coordinates file
  • prefix.snps: SNP/indel report (if show-snps is run)
  • prefix.plot.png: Dot plot visualization (if mummerplot is run)

Examples

# Basic genome comparison
ref_genome = "reference.fasta"
query_genome = "assembly.fasta"
mummer_dir = Mycelia.run_mummer(ref_genome, query_genome)

# Custom parameters
mummer_dir = Mycelia.run_mummer(ref_genome, query_genome,
                               outdir="comparison_results",
                               prefix="comparison",
                               mincluster=100,
                               minmatch=30)

Notes

  • Requires MUMmer to be installed via Bioconda
  • nucmer is used for DNA sequence alignment
  • show-coords generates human-readable coordinate output
  • Results include alignment coordinates, percent identity, and coverage
  • For visualization, use mummerplot (requires gnuplot)
source
Mycelia.run_mummer_plotMethod
run_mummer_plot(delta_file::String; outdir::String="", prefix::String="plot", plot_type::String="png")

Generate dot plot visualization from MUMmer delta file using mummerplot.

Arguments

  • delta_file::String: Path to MUMmer delta file
  • outdir::String="": Output directory (defaults to same as delta file)
  • prefix::String="plot": Prefix for plot files
  • plot_type::String="png": Plot format ("png", "ps", "x11")

Returns

  • String: Path to the generated plot file

Notes

  • Requires gnuplot to be installed
  • Useful for visualizing genome alignments and rearrangements
source
Mycelia.run_padlocMethod
run_padloc(; fasta_file, outdir, threads)

Run the 'padloc' tool from the 'padlocbio' conda environment on a given FASTA file.

https://doi.org/10.1093/nar/gkab883

https://github.com/padlocbio/padloc

This function first ensures that the 'padloc' environment is available via Bioconda. It then attempts to update the 'padloc' database. If a 'padloc' output file (with a '_padloc.csv' suffix) does not already exist for the input FASTA file, it runs 'padloc' with the specified FASTA file as input.

source
Mycelia.run_parallel_progressMethod
run_parallel_progress(
    f::Function,
    items::AbstractVector
) -> Any

Run a function f in parallel over a collection of items with a progress meter.

Arguments

  • f::Function: The function to be applied to each item in the collection.
  • items::AbstractVector: A collection of items to be processed.

Description

This function creates a progress meter to track the progress of processing each item in the items collection. It uses multithreading to run the function f on each item in parallel, updating the progress meter after each item is processed.

source
Mycelia.run_phageboostMethod
run_phageboost(input_fasta::AbstractString, output_dir::AbstractString; force_reinstall::Bool=false)

Run PhageBoost on the provided FASTA file, automatically handling conda environment setup.

This function will:

  1. Check if the phageboost_env conda environment exists
  2. Create and set up the environment if it doesn't exist
  3. Validate that PhageBoost is properly installed
  4. Run PhageBoost on the input FASTA file
  5. Return the output directory path and list of generated files

Arguments

  • input_fasta::AbstractString: Path to the input FASTA file
  • output_dir::AbstractString: Directory where PhageBoost outputs will be saved
  • force_reinstall::Bool=false: If true, recreate the environment even if it exists

Returns

  • NamedTuple with fields:
    • output_dir::String: Path to the output directory
    • files::Vector{String}: List of files generated in the output directory
source
Mycelia.run_phispyMethod
run_phispy(input_file::String; output_dir::String="", 
       phage_genes::Int=2, color::Bool=false, prefix::String="",
       phmms::String="", threads::Int=1, metrics::Vector{String}=String[],
       expand_slope::Bool=false, window_size::Int=30, 
       min_contig_size::Int=5000, skip_search::Bool=false,
       output_choice::Int=3, training_set::String="", 
       prokka_args::NamedTuple=NamedTuple(), force::Bool=false)

Run PhiSpy to identify prophages in bacterial genomes.

PhiSpy identifies prophage regions in bacterial (and archaeal) genomes using multiple approaches including gene composition, AT/GC skew, and optional HMM searches.

Arguments

  • input_file::String: Path to input file (FASTA or GenBank format)
  • output_dir::String: Output directory (default: inputfile * "phispy")
  • phage_genes::Int: Minimum phage genes required per prophage region (default: 2, set to 0 for mobile elements)
  • color::Bool: Add color annotations for CDS based on function (default: false)
  • prefix::String: Prefix for output filenames (default: basename of input)
  • phmms::String: Path to HMM database for additional phage gene detection
  • threads::Int: Number of threads for HMM searches (default: 1)
  • metrics::Vector{String}: Metrics to use for prediction (default: all standard metrics)
  • expand_slope::Bool: Expand Shannon slope calculations (default: false)
  • window_size::Int: Window size for calculations (default: 30)
  • min_contig_size::Int: Minimum contig size to analyze (default: 5000)
  • skip_search::Bool: Skip HMM search if already done (default: false)
  • output_choice::Int: Bitmask for output files (default: 3 for coordinates + GenBank)
  • training_set::String: Path to custom training set
  • prokka_args::NamedTuple: Additional arguments to pass to Prokka if FASTA input is provided
  • force::Bool: Force rerun even if output files already exist (default: false)

Output Choice Codes (add values for multiple outputs)

  • 1: prophage_coordinates.tsv
  • 2: GenBank format output
  • 4: prophage and bacterial sequences
  • 8: prophage_information.tsv
  • 16: prophage.tsv
  • 32: GFF3 format (prophages only)
  • 64: prophage.tbl
  • 128: test data used in random forest
  • 256: GFF3 format (full genome)
  • 512: all output files

Returns

A NamedTuple with paths to generated output files (contents depend on output_choice):

  • prophage_coordinates: prophage_coordinates.tsv file path
  • genbank_output: Updated GenBank file with prophage annotations
  • prophage_sequences: Prophage and bacterial sequence files
  • prophage_information: prophage_information.tsv file path
  • prophage_simple: prophage.tsv file path
  • gff3_prophages: GFF3 file with prophage regions only
  • prophage_table: prophage.tbl file path
  • test_data: Random forest test data file path
  • gff3_genome: GFF3 file with full genome annotations
  • output_dir: Path to output directory
  • input_genbank: Path to GenBank file used (original or generated by Prokka)
source
Mycelia.run_prodigalMethod
run_prodigal(; fasta_file, out_dir)

Run Prodigal gene prediction software on input FASTA file to identify protein-coding genes in metagenomes or single genomes.

Arguments

  • fasta_file::String: Path to input FASTA file containing genomic sequences
  • out_dir::String=dirname(fasta_file): Directory for output files. Defaults to input file's directory

Returns

Named tuple containing paths to all output files:

  • fasta_file: Input FASTA file path
  • out_dir: Output directory path
  • gff: Path to GFF format gene predictions
  • gene_scores: Path to all potential genes and their scores
  • fna: Path to nucleotide sequences of predicted genes
  • faa: Path to protein translations of predicted genes
  • std_out: Path to captured stdout
  • std_err: Path to captured stderr
source
Mycelia.run_prokkaMethod
run_prokka(input_fasta::String; output_dir::String="", prefix::String="", 
       cpus::Int=0, kingdom::String="Bacteria", genus::String="", 
       species::String="", strain::String="", force_overwrite::Bool=false,
       addgenes::Bool=false, compliant::Bool=false, fast::Bool=false,
       evalue::Float64=1e-06, mincontiglen::Int=1, force::Bool=false)

Run Prokka for rapid prokaryotic genome annotation.

Prokka annotates bacterial, archaeal and viral genomes quickly and produces standards-compliant output files including GFF3, GenBank, and FASTA formats.

Arguments

  • input_fasta::String: Path to input FASTA file containing contigs
  • output_dir::String: Output directory (default: inputfasta * "prokka")
  • prefix::String: Output file prefix (default: basename of input file)
  • cpus::Int: Number of CPUs to use, 0 for all available (default: 0)
  • kingdom::String: Annotation mode - "Bacteria", "Archaea", "Viruses", or "Mitochondria" (default: "Bacteria")
  • genus::String: Genus name for annotation
  • species::String: Species name for annotation
  • strain::String: Strain name for annotation
  • force_overwrite::Bool: Force overwrite existing output directory (default: false)
  • addgenes::Bool: Add 'gene' features for each 'CDS' feature (default: false)
  • compliant::Bool: Force GenBank/ENA/DDJB compliance (default: false)
  • fast::Bool: Fast mode - skip CDS product searching (default: false)
  • evalue::Float64: Similarity e-value cut-off (default: 1e-06)
  • mincontiglen::Int: Minimum contig size (default: 1, NCBI needs 200)
  • force::Bool: Force rerun even if output files already exist (default: false)

Returns

A NamedTuple with paths to all generated output files:

  • gff: Master annotation in GFF3 format
  • gbk: Standard GenBank file
  • fna: Nucleotide FASTA of input contigs
  • faa: Protein FASTA of translated CDS sequences
  • ffn: Nucleotide FASTA of all transcripts
  • sqn: ASN1 Sequin file for GenBank submission
  • fsa: Nucleotide FASTA for tbl2asn
  • tbl: Feature table file
  • err: NCBI discrepancy report
  • log: Complete run log
  • txt: Annotation statistics
  • tsv: Tab-separated feature table
  • output_dir: Path to output directory
source
Mycelia.run_pyrodigalMethod
run_pyrodigal(; fasta_file, out_dir)

Run Pyrodigal gene prediction on a FASTA file using the meta procedure optimized for metagenomic sequences.

Pyrodigal is a reimplementation of the Prodigal gene finder, which identifies protein-coding sequences in bacterial and archaeal genomes.

Arguments

  • fasta_file::String: Path to input FASTA file containing genomic sequences
  • out_dir::String: Output directory path (default: input filename + "_pyrodigal")

Returns

Named tuple containing:

  • fasta_file: Input FASTA file path
  • out_dir: Output directory path
  • gff: Path to GFF output file with gene predictions
  • faa: Path to FASTA file with predicted protein sequences
  • fna: Path to FASTA file with nucleotide sequences

Notes

  • Uses metagenomic mode (-p meta) optimized for mixed communities
  • Masks runs of N nucleotides (-m flag)
  • Minimum gene length set to 33bp
  • Maximum overlap between genes set to 31bp
  • Requires Pyrodigal to be available in a Conda environment
  • Skips processing if output files already exist
source
Mycelia.run_quastMethod
run_quast(assembly_file::String; kwargs...)

Run QUAST on a single assembly file. See run_quast(::Vector{String}) for details.

source
Mycelia.run_quastMethod
run_quast(assembly_files::Vector{String}; outdir::String="quast_results", reference::Union{String,Nothing}=nothing, threads::Int=Sys.CPU_THREADS, min_contig::Int=500, gene_finding::Bool=false)

Run QUAST (Quality Assessment Tool for Genome Assemblies) to evaluate assembly quality.

Arguments

  • assembly_files::Vector{String}: Vector of paths to assembly FASTA files to evaluate
  • outdir::String="quast_results": Output directory for QUAST results
  • reference::Union{String,Nothing}=nothing: Optional reference genome for reference-based metrics
  • threads::Int=Sys.CPU_THREADS: Number of threads to use
  • min_contig::Int=500: Minimum contig length to consider
  • gene_finding::Bool=false: Whether to run gene finding (requires GeneMark-ES/ET)

Returns

  • String: Path to the output directory containing QUAST results

Output Files

  • report.html: Interactive HTML report
  • report.txt: Text summary report
  • report.tsv: Tab-separated values report for programmatic access
  • transposed_report.tsv: Transposed TSV format
  • icarus.html: Icarus contig browser (if reference provided)

Examples

# Basic assembly evaluation
assemblies = ["assembly1.fasta", "assembly2.fasta"]
quast_dir = Mycelia.run_quast(assemblies)

# With reference genome
ref_genome = "reference.fasta"
quast_dir = Mycelia.run_quast(assemblies, reference=ref_genome)

# Custom parameters
quast_dir = Mycelia.run_quast(assemblies, 
                             outdir="my_quast_results",
                             min_contig=1000,
                             threads=8)

Notes

  • Requires QUAST to be installed via Bioconda
  • Without reference: provides basic metrics (N50, total length, # contigs, etc.)
  • With reference: adds reference-based metrics (genome fraction, misassemblies, etc.)
  • Gene finding requires additional dependencies and is disabled by default
source
Mycelia.run_samtools_flagstatFunction
run_samtools_flagstat(xam) -> Any
run_samtools_flagstat(xam, samtools_flagstat) -> Any

Generate alignment statistics for a SAM/BAM/CRAM file using samtools flagstat.

Arguments

  • xam::AbstractString: Path to input SAM/BAM/CRAM alignment file
  • samtools_flagstat::AbstractString: Output path for flagstat results (default: input_path.samtools-flagstat.txt)

Returns

  • String: Path to the generated flagstat output file

Details

Runs samtools flagstat to calculate statistics on the alignment file, including:

  • Total reads
  • Secondary alignments
  • Supplementary alignments
  • Duplicates
  • Mapped/unmapped reads
  • Proper pairs
  • Read 1/2 counts

Requirements

  • Requires samtools to be available via Bioconda
  • Input file must be in SAM, BAM or CRAM format
source
Mycelia.run_transtermMethod
run_transterm(; fasta, gff)

Run TransTermHP to predict rho-independent transcription terminators in DNA sequences.

Arguments

  • fasta: Path to input FASTA file containing DNA sequences
  • gff: Optional path to GFF annotation file. If provided, improves prediction accuracy

Returns

  • String: Path to output file containing TransTermHP predictions

Details

  • Uses Conda environment 'transtermhp' for execution
  • Automatically generates coordinate file from FASTA or GFF input
  • Removes temporary coordinate file after completion
  • Requires Mycelia's Conda setup
source
Mycelia.run_trnascanMethod
run_trnascan(; fna_file, outdir)

Run tRNAscan-SE to identify and annotate transfer RNA genes in the provided sequence file.

Arguments

  • fna_file::String: Path to input FASTA nucleotide file
  • outdir::String: Output directory path (default: inputfilepath + "_trnascan")

Returns

  • String: Path to the output directory containing tRNAscan-SE results

Output Files

Creates the following files in outdir:

  • *.trnascan.out: Main output with tRNA predictions
  • *.trnascan.bed: BED format coordinates
  • *.trnascan.fasta: FASTA sequences of predicted tRNAs
  • *.trnascan.struct: Secondary structure predictions
  • *.trnascan.stats: Summary statistics
  • *.trnascan.log: Program execution log

Notes

  • Uses the general tRNA model (-G flag) suitable for all domains of life
  • Automatically sets up tRNAscan-SE via Bioconda
  • Skips processing if output directory contains files
source
Mycelia.run_unicyclerMethod
run_unicycler(; short_1, short_2, long_reads, outdir)

Run hybrid assembly combining short and long reads using Unicycler.

Arguments

  • short_1::String: Path to first short read FASTQ file
  • short_2::String: Path to second short read FASTQ file (optional)
  • long_reads::String: Path to long read FASTQ file
  • outdir::String: Output directory path (default: "unicycler_output")

Returns

Named tuple containing:

  • outdir::String: Path to output directory
  • assembly::String: Path to final assembly file

Details

  • Automatically creates and uses a conda environment with unicycler
  • Combines short read accuracy with long read scaffolding
  • Skips assembly if output directory already exists
  • Utilizes all available CPU threads
source
Mycelia.run_virsorter2Method

Run VirSorter2 viral sequence identification tool.

VirSorter2 identifies viral sequences in genomic and metagenomic data using machine learning models and database comparisons.

Arguments

  • input_fasta: Path to input FASTA file
  • output_directory: Output directory path
  • database_path: Path to VirSorter2 database directory
  • include_groups: Comma-separated viral groups to include
    • full set = dsDNAphage,NCLDV,RNA,ssDNA,lavidaviridae
    • tools original default set = dsDNAphage,ssDNA
    • Lavidaviridae = A family of small double‑stranded DNA "virophages" that parasitize the replication machinery of certain NCLDVs
    • NCLDV = Nucleocytoplasmic Large DNA Viruses = An informal clade of large double‑stranded DNA viruses that replicate (at least in part) in the cytoplasm of eukaryotic cells.
  • min_score: Minimum score threshold for viral sequences
  • min_length: Minimum sequence length threshold
  • threads: Number of CPU threads to use
  • provirus_off: Disable provirus detection
  • max_orf_per_seq: Maximum ORFs per sequence
  • prep_for_dramv: Prepare output for DRAMv annotation
  • label: Label for output files
  • forceall: Force rerun all steps
  • force: Force rerun even if output files already exist

Returns

NamedTuple containing paths to all generated output files and directories

source
Mycelia.samtools_index_fastaMethod
samtools_index_fasta(; fasta)

Creates an index file (.fai) for a FASTA reference sequence using samtools.

The FASTA index allows efficient random access to the reference sequence. This is required by many bioinformatics tools that need to quickly fetch subsequences from the reference.

Arguments

  • fasta: Path to the input FASTA file

Side Effects

  • Creates a {fasta}.fai index file in the same directory as input
  • Installs samtools via conda if not already present
source
Mycelia.sanitize_inline_strings!Method
sanitize_inline_strings!(
    df::DataFrames.DataFrame
) -> DataFrames.DataFrame

Convert all InlineString columns in a DataFrame to standard Strings. Modifies the dataframe in-place and returns it.

source
Mycelia.sanitize_inline_stringsMethod
sanitize_inline_strings(v::AbstractVector) -> Any

Convert a column to standard Strings if it contains InlineStrings, otherwise return the original column unchanged.

source
Mycelia.sanity_check_matrixMethod
sanity_check_matrix(M::AbstractMatrix)

Checks matrix shape, value types, and distributional properties. Suggests the most appropriate ePCA function and distance metric.

Returns a NamedTuple with fields:

  • n_features, n_samples
  • value_type
  • range
  • is_binary
  • is_integer
  • is_nonnegative
  • is_strictly_positive
  • is_in_01
  • is_centered
  • is_overdispersed
  • suggested_epca
  • suggested_distance
source
Mycelia.save_df_jld2Method
save_df_jld2(;df::DataFrames.DataFrame, filename::String, key::String="dataframe")

Save a DataFrame to a JLD2 file.

Arguments

  • df: The DataFrame to save
  • filename: Path to the JLD2 file (will add .jld2 extension if not present)
  • key: The name of the dataset within the JLD2 file (defaults to "dataframe")
source
Mycelia.save_genome_as_fastaMethod
save_genome_as_fasta(genome, filename)

Save a genome sequence as a FASTA file.

Convenience function for saving a single genome sequence to FASTA format for annotation benchmarking.

Arguments

  • genome: DNA sequence (BioSequences.LongDNA{4})
  • filename: Output FASTA filename

See Also

  • write_fasta: For more flexible FASTA writing with multiple records
  • random_fasta_record: For generating random FASTA records
source
Mycelia.save_graphMethod
save_graph(
    graph::Graphs.AbstractGraph,
    outfile::String
) -> String

Saves the given graph to a file in JLD2 format.

Arguments

  • graph::Graphs.AbstractGraph: The graph to be saved.
  • outfile::String: The name of the output file. If the file extension is not .jld2, it will be appended automatically.

Returns

  • String: The name of the output file with the .jld2 extension.
source
Mycelia.save_kmer_resultsMethod
save_kmer_results(
;
    filename,
    kmers,
    counts,
    fasta_list,
    k,
    alphabet
)

Save the kmer counting results (kmers vector, counts sparse matrix) and the input FASTA file list to a JLD2 file for long-term storage and reproducibility.

Arguments

  • filename::AbstractString: Path to the output JLD2 file.
  • kmers::AbstractVector{<:Kmers.Kmer}: The sorted vector of unique kmer objects.
  • counts::AbstractMatrix: The (sparse or dense) matrix of kmer counts.
  • fasta_list::AbstractVector{<:AbstractString}: The list of FASTA file paths used as input.
  • k::Integer: The kmer size used.
  • alphabet::Symbol: The alphabet used (:AA, :DNA, :RNA).
source
Mycelia.save_matrix_jld2Method
save_matrix_jld2(; matrix, filename)

Saves a matrix to a JLD2 file format.

Arguments

  • matrix: The matrix to be saved
  • filename: String path where the file should be saved

Returns

  • The filename string that was used to save the matrix
source
Mycelia.save_reads_as_fastqFunction
save_reads_as_fastq(reads, filename, base_quality=30)

Save DNA reads as a FASTQ file with specified quality scores.

Converts a vector of DNA sequences to FASTQ format with uniform quality scores.

Arguments

  • reads: Vector of DNA sequences (BioSequences.LongDNA{4})
  • filename: Output FASTQ filename
  • base_quality: Base quality score for all positions (default: 30)

See Also

  • write_fastq: For more flexible FASTQ writing with records
  • fastq_record: For creating individual FASTQ records
  • generate_test_fastq_data: For generating test FASTQ data with variable quality scores
source
Mycelia.scg_sbatchMethod
scg_sbatch(
;
    job_name,
    mail_user,
    mail_type,
    logdir,
    partition,
    account,
    nodes,
    ntasks,
    time,
    cpus_per_task,
    mem_gb,
    cmd
)

Submit a job to SLURM using sbatch with specified parameters.

Arguments

  • job_name::String: Name identifier for the SLURM job
  • mail_user::String: Email address for job notifications
  • mail_type::String: Type of mail notifications (default: "ALL")
  • logdir::String: Directory for error and output logs (default: "~/workspace/slurmlogs")
  • partition::String: SLURM partition to submit job to
  • account::String: Account to charge for compute resources
  • nodes::Int: Number of nodes to allocate (default: 1)
  • ntasks::Int: Number of tasks to run (default: 1)
  • time::String: Maximum wall time in format "days-hours:minutes:seconds" (default: "1-00:00:00")
  • cpus_per_task::Int: CPUs per task (default: 12)
  • mem_gb::Int: Memory in GB, defaults to 96GB
  • cmd::String: Command to execute

Returns

  • Bool: Returns true if submission succeeds

Notes

  • Function includes 5-second delays before and after submission
  • Memory is automatically scaled with CPU count
  • Log files are named with job ID (%j) and job name (%x)
source
Mycelia.search_viroid_sequencesFunction
search_viroid_sequences(
;
    ...
) -> Union{Vector{String}, Vector{T} where T<:(SubString{_A} where _A)}
search_viroid_sequences(
    query::String;
    taxon,
    database,
    max_results
) -> Union{Vector{String}, Vector{T} where T<:(SubString{_A} where _A)}

Search for viroid sequences in NCBI databases using taxonomic and keyword filtering.

Arguments

  • query::String: Search query terms (default: "viroid")
  • taxon::String: Taxonomic restriction (default: "viruses[organism]")
  • database::String: NCBI database to search ("nuccore", "protein", etc.)
  • max_results::Int: Maximum number of results to return (default: 100)

Returns

  • Vector{String}: Vector of NCBI accession numbers matching the search criteria

Examples

# Find all viroid genome sequences
accessions = search_viroid_sequences("viroid", "viruses[organism]", "nuccore")

# Find specific viroid species
pstv_accessions = search_viroid_sequences("Potato spindle tuber viroid", "viruses[organism]", "nuccore")

# Find viroid proteins
protein_accessions = search_viroid_sequences("viroid", "viruses[organism]", "protein")
source
Mycelia.select_actionMethod
select_action(policy::DQNPolicy, state::AssemblyState)

Select an action using the DQN policy with epsilon-greedy exploration.

This is a placeholder implementation that will be replaced with actual neural network inference once the ML framework is integrated.

Arguments

  • policy::DQNPolicy: Trained policy network
  • state::AssemblyState: Current environment state

Returns

  • AssemblyAction: Selected action for the given state

Example

action = select_action(policy, current_state)
source
Mycelia.seq2sha256Method
seq2sha256(seq::AbstractString) -> String

Compute the SHA-256 hash of a sequence string.

Arguments

  • seq::AbstractString: Input sequence to be hashed

Returns

  • String: Hexadecimal representation of the SHA-256 hash

Details

The input sequence is converted to uppercase before hashing.

source
Mycelia.seq2sha256Method
seq2sha256(seq::BioSequences.BioSequence) -> String

Convert a biological sequence to its SHA256 hash value.

Calculates a cryptographic hash of the sequence by first converting it to a string representation. This method dispatches to the string version of seq2sha256.

Arguments

  • seq::BioSequences.BioSequence: The biological sequence to hash

Returns

  • String: A 64-character hexadecimal string representing the SHA256 hash
source
Mycelia.sequence_to_stranded_pathMethod
sequence_to_stranded_path(
    stranded_kmers,
    sequence
) -> Vector{Pair{Int64, Bool}}

Convert a DNA sequence into a path through a collection of stranded k-mers.

Arguments

  • stranded_kmers: Collection of unique k-mers representing possible path vertices
  • sequence: Input DNA sequence to convert to a path

Returns

Vector of Pair{Int,Bool} where:

  • First element (Int) is the index of the k-mer in stranded_kmers
  • Second element (Bool) indicates orientation (true=forward, false=reverse)
source
Mycelia.setup_checkmMethod
setup_checkm(; db_dir::String=joinpath(homedir(), "workspace", ".checkm"))

Install CheckM via Bioconda and set up its database.

CheckM is a tool for assessing the quality and completeness of bacterial genomes.

Arguments

  • db_dir: Directory to store CheckM database (default: ~/.checkm)

Example

setup_checkm()
source
Mycelia.setup_checkm2Method
setup_checkm2(; db_dir::String=joinpath(homedir(), "workspace", ".checkm2"))

Install CheckM2 via Bioconda and set up its database.

CheckM2 is a rapid tool for assessing the quality and completeness of bacterial genomes.

Arguments

  • db_dir: Directory to store CheckM2 database (default: ~/.checkm2)

Example

setup_checkm2()
source
Mycelia.setup_checkvMethod
setup_checkv(; db_dir::String=joinpath(homedir(), "workspace", ".checkv"))

Install CheckV via Bioconda and set up its database.

CheckV is a tool for assessing the quality and completeness of viral genomes.

Arguments

  • db_dir: Directory to store CheckV database (default: ~/.checkv)

Example

setup_checkv()
source
Mycelia.setup_padlocMethod
setup_padloc() -> Union{Nothing, Base.Process}

Ensure the padloc environment and database are installed.

Downloads the environment if missing and updates the padloc database.

source
Mycelia.setup_taxonkit_taxonomyMethod
setup_taxonkit_taxonomy(; force_update, max_age_days)

Downloads and extracts the NCBI taxonomy database required for taxonkit operations.

Downloads taxdump.tar.gz from NCBI FTP server and extracts it to ~/.taxonkit/. This is a prerequisite for using taxonkit-based taxonomy functions.

Arguments

  • force_update::Bool=false: Force download even if taxdump already exists
  • max_age_days::Int=30: Maximum age in days before warning about stale data

Requirements

  • Working internet connection
  • Sufficient disk space (~100MB)
  • taxonkit must be installed separately

Returns

  • Nothing

Throws

  • SystemError if download fails or if unable to create directory
  • ErrorException if tar extraction fails
source
Mycelia.sha256_fileMethod
sha256_file(file::AbstractString) -> String

Compute the SHA-256 hash of the contents of a file.

Arguments

  • file::AbstractString: The path to the file for which the SHA-256 hash is to be computed.

Returns

  • String: The SHA-256 hash of the file contents, represented as a hexadecimal string.
source
Mycelia.shortest_probability_path_nextMethod
shortest_probability_path_next(
    graph::MetaGraphsNext.MetaGraph,
    source::String,
    target::String
) -> Union{Nothing, Mycelia.GraphPath}

Find the shortest path in probability space between two vertices.

Uses Dijkstra's algorithm where edge distances are -log(probability), so the shortest path corresponds to the highest probability path.

Arguments

  • graph: MetaGraphsNext k-mer graph
  • source: Source vertex label
  • target: Target vertex label

Returns

  • Union{GraphPath, Nothing}: Shortest probability path, or nothing if no path exists

Algorithm

  1. Convert edge weights to -log(probability) distances
  2. Run Dijkstra's algorithm with strand-aware edge traversal
  3. Reconstruct path maintaining strand information
  4. Convert back to probability space for final result
source
Mycelia.should_continue_kMethod

Determine if we should continue processing the current k-mer size or move to the next. Uses accuracy-prioritized reward function for decision making.

source
Mycelia.simplify_graph_nextMethod
simplify_graph_next(graph::MetaGraphsNext.MetaGraph, bubbles::Vector{BubbleStructure}) -> MetaGraphsNext.MetaGraph

Simplify the graph by resolving bubbles and removing low-confidence paths.

source
Mycelia.simulate_illumina_paired_readsMethod
simulate_illumina_paired_reads(
;
    in_fasta,
    coverage,
    read_count,
    outbase,
    read_length,
    mflen,
    sdev,
    seqSys,
    amplicon,
    errfree,
    rndSeed
)

Simulate Illumina short reads from a FASTA file using the ART Illumina simulator.

This function wraps ART (installed via Bioconda) to simulate reads from an input reference FASTA. It supports paired-end (or optionally single-end/mate-pair) simulation, with options to choose either fold coverage (--fcov) or an absolute read count (--rcount), to enable amplicon mode, and to optionally generate a zero-error SAM file.

Arguments

  • in_fasta::String: Path to the input FASTA file.
  • coverage::Union{Nothing,Number}: Desired fold coverage (used with --fcov); if nothing and read_count is provided then fold coverage is ignored. (Default: 20)
  • read_count::Union{Nothing,Number}: Total number of reads (or read pairs) to generate (used with --rcount instead of fold coverage). (Default: nothing)
  • outbase::String: Output file prefix (default: "$(in_fasta).art.$(coverage)x.").
  • read_length::Int: Length of reads to simulate (default: 150).
  • mflen::Int: Mean fragment length for paired-end simulations (default: 500).
  • sdev::Int: Standard deviation of fragment lengths (default: 10).
  • seqSys::String: Illumina sequencing system ID (e.g. "HS25" for HiSeq 2500) (default: "HS25").
  • paired::Bool: Whether to simulate paired-end reads (default: true).
  • amplicon::Bool: Enable amplicon sequencing simulation mode (default: false).
  • errfree::Bool: Generate an extra SAM file with zero sequencing errors (default: false).
  • rndSeed::Union{Nothing,Int}: Optional seed for reproducibility (default: nothing).

Outputs

Generates gzipped FASTQ files in the working directory:

  • For paired-end: $(outbase)1.fq.gz (forward) and $(outbase)2.fq.gz (reverse).
  • For single-end: $(outbase)1.fq.gz.

Additional SAM files may be produced if --errfree is enabled and/or if the ART --samout option is specified.

Details

This function calls ART with the provided options. Note that if read_count is supplied, the function uses the --rcount option; otherwise, it uses --fcov with the given coverage. Amplicon mode (via --amplicon) restricts the simulation to the amplicon regions, which is important for targeted sequencing studies.

Dependencies

Requires ART simulator (installed via Bioconda) and the Mycelia environment helper.

See also: simulate_nanopore_reads, simulate_nearly_perfect_long_reads, simulate_pacbio_reads

source
Mycelia.simulate_nanopore_readsMethod
simulate_nanopore_reads(; fasta, quantity, outfile)

Simulate Oxford Nanopore sequencing reads using the Badread tool with 2023 error models.

Arguments

  • fasta::String: Path to input reference FASTA file
  • quantity::String: Either fold coverage (e.g. "50x") or total bases to sequence (e.g. "1000000")
  • outfile::String: Output path for gzipped FASTQ file. Defaults to input filename with modified extension

Returns

  • String: Path to the generated output FASTQ file

See also: simulate_pacbio_reads, simulate_nearly_perfect_long_reads, simulate_short_reads

source
Mycelia.simulate_nearly_perfect_long_readsMethod
simulate_nearly_perfect_long_reads()

Simulate high-quality long reads with minimal errors using Badread.

Arguments

  • reference::String: Path to reference FASTA file
  • quantity::String: Coverage depth (e.g. "50x") or total bases (e.g. "1000000")
  • length_mean::Int=40000: Mean read length
  • length_sd::Int=20000: Standard deviation of read length

Returns

Vector of simulated reads in FASTQ format

Details

Generates nearly perfect long reads by setting error rates and artifacts to minimum values. Uses ideal quality scores and disables common sequencing artifacts like chimeras and adapters.

See also: simulate_pacbio_reads, simulate_nanopore_reads, simulate_short_reads

source
Mycelia.simulate_pacbio_readsMethod
simulate_pacbio_reads(; fasta, quantity, outfile)

Simulate PacBio HiFi reads using the Badread error model.

Arguments

  • fasta::String: Path to input FASTA file containing reference sequence
  • quantity::String: Coverage depth (e.g. "50x") or total bases (e.g. "1000000") - NOT TOTAL READS
  • outfile::String: Output filepath for simulated reads. Defaults to input filename with ".badread.pacbio2021.{quantity}.fq.gz" suffix

Returns

  • String: Path to the generated output file

Notes

  • Requires Badread tool from Bioconda
  • Uses PacBio 2021 error and quality score models
  • Average read length ~15kb
  • Output is gzipped FASTQ format

See also: simulate_nanopore_reads, simulate_nearly_perfect_long_reads, simulate_short_reads

source
Mycelia.simulate_variantsMethod
simulate_variants(
    fasta_record::FASTX.FASTA.Record;
    n_variants,
    window_size,
    variant_size_disbribution,
    variant_type_likelihoods
) -> Any

Simulates genetic variants (substitutions, insertions, deletions, inversions) in a DNA sequence.

Arguments

  • fasta_record: Input DNA sequence in FASTA format

Keywords

  • n_variants=√(sequence_length): Number of variants to generate
  • window_size=sequence_length/n_variants: Size of windows for variant placement
  • variant_size_disbribution=Geometric(1/√window_size): Distribution for variant sizes
  • variant_type_likelihoods: Vector of pairs mapping variant types to probabilities
    • :substitution => 10⁻¹
    • :insertion => 10⁻²
    • :deletion => 10⁻²
    • :inversion => 10⁻²

Returns

DataFrame in VCF format containing simulated variants with columns: CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO, FORMAT, SAMPLE

Notes

  • Variants are distributed across sequence windows to ensure spread
  • Variant sizes are capped by window size
  • Equivalent variants are filtered out
  • FILTER column indicates variant type
source
Mycelia.simulate_variantsMethod
simulate_variants(fasta_file::String) -> String

Simulates genetic variants from sequences in a FASTA file and generates corresponding VCF records.

Arguments

  • fasta_file::String: Path to input FASTA file containing sequences to analyze

Details

  1. Processes each record in the input FASTA file
  2. Generates simulated variants for each sequence
  3. Creates a VCF file with the same base name as input file (.vcf extension)
  4. Updates sequences with simulated variants in a new FASTA file (.vcf.fna extension)

Returns

Path to the modified FASTA file containing sequences with simulated variants

source
Mycelia.sort_fastqFunction
sort_fastq(input_fastq) -> String
sort_fastq(input_fastq, output_fastq) -> Any

This turns a 4-line FASTQ entry into a single tab separated line, adds a column with the length of each read, passes it to Unix sort, removes the length column, and converts it back into a FASTQ file.

sorts longest to shortest!!

http://thegenomefactory.blogspot.com/2012/11/sorting-fastq-files-by-sequence-length.html

source
Mycelia.split_gff_attributes_into_columnsMethod
split_gff_attributes_into_columns(gff_df) -> Any

Takes a GFF (General Feature Format) DataFrame and expands the attributes column into separate columns.

Arguments

  • gff_df::DataFrame: A DataFrame containing GFF data with an 'attributes' column formatted as key-value pairs separated by semicolons (e.g., "ID=gene1;Name=BRCA1;Type=gene")

Returns

  • DataFrame: The input DataFrame with additional columns for each attribute key found in the 'attributes' column
source
Mycelia.step_environment!Method
step_environment!(env::AssemblyEnvironment, action::AssemblyAction)

Execute an action in the RL environment and return the resulting state and reward.

Arguments

  • env::AssemblyEnvironment: Environment to step
  • action::AssemblyAction: Action to execute

Returns

  • Tuple{AssemblyState, RewardComponents, Bool}: (next_state, reward, done)
    • next_state: State after action execution
    • reward: Reward components for the action
    • done: Whether episode is complete

Example

action = AssemblyAction(:continue_k, Dict(), 0.95, 1000, 5)
next_state, reward, done = step_environment!(env, action)
source
Mycelia.subsample_reads_seqkitMethod
subsample_reads_seqkit(
;
    in_fastq,
    out_fastq,
    n_reads,
    proportion_reads
)

Subsample reads from a FASTQ file using seqkit.

Arguments

  • in_fastq::String: Path to input FASTQ file
  • out_fastq::String="": Path to output FASTQ file. If empty, auto-generated based on input filename
  • n_reads::Union{Missing,Int}=missing: Number of reads to sample
  • proportion_reads::Union{Missing,Float64}=missing: Proportion of reads to sample (0.0-1.0)

Returns

  • String: Path to the output FASTQ file
source
Mycelia.sufficient_improvementsFunction

Determine if sufficient improvements were made to continue with current k. Enhanced with convergence detection and adaptive thresholds.

source
Mycelia.system_mem_to_minimap_index_sizeMethod
system_mem_to_minimap_index_size(
;
    system_mem_gb,
    denominator
) -> String

Compute the minimap2 index size string based on available system memory.

Arguments

  • system_mem_gb::Real: Amount of memory (GB) to allocate for indexing.
  • denominator::Real: Factor to scale the memory usage (default Mycelia.DEFAULT_MINIMAP_DENOMINATOR).

Returns

  • String: Value such as "4G" suitable for the minimap2 -I option.
source
Mycelia.system_overviewMethod
system_overview(
;
    path
) -> @NamedTuple{system_threads::Int64, julia_threads::Int64, total_memory::String, available_memory::String, occupied_memory::String, total_storage::String, available_storage::String, occupied_storage::String}
source
Mycelia.tar_extractMethod
tar_extract(; tarchive, directory)

Extract contents of a gzipped tar archive file to a specified directory.

Arguments

  • tarchive::AbstractString: Path to the .tar.gz file to extract
  • directory::AbstractString=dirname(tarchive): Target directory for extraction (defaults to the archive's directory)

Returns

  • AbstractString: Path to the directory where contents were extracted
source
Mycelia.tar_gz_filesMethod
tar_gz_files(output_file::String, input_files::Vector{String})

Creates a tar.gz archive from input_files. Handles many files by writing a file list and using tar -czf ... -T filelist. Appends .tar.gz if missing.

source
Mycelia.taxids2lcaMethod
taxids2lca(ids::Vector{Int64}) -> Int64

Calculate the Lowest Common Ancestor (LCA) taxonomic ID for a set of input taxonomic IDs.

Arguments

  • ids::Vector{Int}: Vector of NCBI taxonomic IDs

Returns

  • Int: The taxonomic ID of the lowest common ancestor

Details

Uses taxonkit to compute the LCA. Automatically sets up the required taxonomy database if not already present in ~/.taxonkit/.

Dependencies

  • Requires taxonkit (installed via Bioconda)
  • Requires taxonomy database (downloaded automatically if missing)
source
Mycelia.taxids2ncbi_taxonomy_tableMethod
taxids2ncbi_taxonomy_table(
    taxids::AbstractVector{Int64}
) -> DataFrames.DataFrame

Convert a vector of NCBI taxonomy IDs into a detailed taxonomy table using NCBI Datasets CLI.

Arguments

  • taxids::AbstractVector{Int}: Vector of NCBI taxonomy IDs to query

Returns

  • DataFrame: Table containing taxonomy information with columns including:
    • tax_id
    • species
    • genus
    • family
    • order
    • class
    • phylum
    • kingdom

Dependencies

Requires ncbi-datasets-cli Conda package (automatically installed if missing)

source
Mycelia.taxids2taxonkit_summarized_lineage_tableMethod
taxids2taxonkit_summarized_lineage_table(
    taxids::AbstractVector{Int64}
) -> DataFrames.DataFrame

Convert a vector of taxonomy IDs to a summarized lineage table using taxonkit.

Arguments

  • taxids::AbstractVector{Int}: Vector of NCBI taxonomy IDs

Returns

DataFrame with the following columns:

  • taxid: Original input taxonomy ID
  • species_taxid, species: Species level taxonomy ID and name
  • genus_taxid, genus: Genus level taxonomy ID and name
  • family_taxid, family: Family level taxonomy ID and name
  • superkingdom_taxid, superkingdom: Superkingdom level taxonomy ID and name

Missing values are used when a taxonomic rank is not available.

source
Mycelia.taxids2taxonkit_taxid2lineage_ranksMethod
taxids2taxonkit_taxid2lineage_ranks(
    taxids::AbstractVector{Int64}
) -> Dict{Int64, Dict{String, @NamedTuple{lineage::String, taxid::Union{Missing, Int64}}}}

Convert taxonomic IDs to a structured lineage rank mapping.

Takes a vector of taxonomic IDs and returns a nested dictionary mapping each input taxid to its complete taxonomic lineage information. For each taxid, creates a dictionary where:

  • Keys are taxonomic ranks (e.g., "species", "genus", "family")
  • Values are NamedTuples containing:
    • lineage::String: The taxonomic name at that rank
    • taxid::Union{Int, Missing}: The corresponding taxonomic ID (if available)

Excludes "no rank" entries from the final output.

Returns: Dict{Int, Dict{String, NamedTuple{(:lineage, :taxid), Tuple{String, Union{Int, Missing}}}}}

source
Mycelia.taxonomic_id_to_childrenMethod
taxonomic_id_to_children(
    tax_id;
    DATABASE_ID,
    USERNAME,
    PASSWORD
)

Query Neo4j database to find all descendant taxonomic IDs for a given taxonomic ID.

Arguments

  • tax_id: Source taxonomic ID to find children for
  • DATABASE_ID: Neo4j database identifier (required)
  • USERNAME: Neo4j database username (default: "neo4j")
  • PASSWORD: Neo4j database password (required)

Returns

Vector{Int}: Sorted array of unique child taxonomic IDs

source
Mycelia.test_rl_frameworkMethod
test_rl_framework()

Test the reinforcement learning framework with minimal examples.

This function provides a comprehensive test of the RL infrastructure using small synthetic datasets.

Returns

  • Bool: Whether all tests passed

Example

success = test_rl_framework()
source
Mycelia.train_assembly_agentMethod
train_assembly_agent(training_data::Vector{String}, validation_data::Vector{String}; 
                    episodes=1000, episode_length=100)

Train a reinforcement learning agent for assembly optimization.

This function implements the complete training loop for the hierarchical RL system.

Arguments

  • training_data::Vector{String}: Paths to training FASTQ files
  • validation_data::Vector{String}: Paths to validation FASTQ files
  • episodes::Int: Number of training episodes (default: 1000)
  • episode_length::Int: Maximum steps per episode (default: 100)

Returns

  • Tuple{DQNPolicy, Vector{Float64}}: (trainedpolicy, trainingrewards)

Example

training_files = ["train1.fastq", "train2.fastq", "train3.fastq"]
validation_files = ["val1.fastq", "val2.fastq"]
policy, rewards = train_assembly_agent(training_files, validation_files, episodes=500)
source
Mycelia.train_with_curriculumMethod
train_with_curriculum(curriculum_schedule::Vector{Dict}, validation_data::Vector{String})

Train the RL agent using curriculum learning.

Arguments

  • curriculum_schedule::Vector{Dict}: Curriculum stages
  • validation_data::Vector{String}: Validation datasets

Returns

  • Tuple{DQNPolicy, Vector{Float64}}: (trainedpolicy, stagerewards)

Example

curriculum = create_curriculum_schedule()
policy, rewards = train_with_curriculum(curriculum, validation_files)
source
Mycelia.translate_nucleic_acid_fastaMethod
translate_nucleic_acid_fasta(
    fasta_nucleic_acid_file,
    fasta_amino_acid_file
) -> Any

Translates nucleic acid sequences from a FASTA file into amino acid sequences.

Arguments

  • fasta_nucleic_acid_file::String: Path to input FASTA file containing nucleic acid sequences
  • fasta_amino_acid_file::String: Path where the translated amino acid sequences will be written

Returns

  • String: Path to the output FASTA file containing translated amino acid sequences
source
Mycelia.transterm_output_to_gffMethod
transterm_output_to_gff(transterm_output) -> Any

Convert TransTerm terminator predictions output to GFF3 format.

Parses TransTerm output and generates a standardized GFF3 file with the following transformations:

  • Sets source field to "transterm"
  • Sets feature type to "terminator"
  • Converts terminator IDs to GFF attributes
  • Renames fields to match GFF3 spec

Arguments

  • transterm_output::String: Path to the TransTerm output file

Returns

  • String: Path to the generated GFF3 file (original filename with .gff extension)
source
Mycelia.trim_galore_pairedMethod
trim_galore_paired(; forward_reads, reverse_reads, outdir)

Trim paired-end FASTQ reads using Trim Galore, a wrapper around Cutadapt and FastQC.

Arguments

  • forward_reads::String: Path to forward reads FASTQ file
  • reverse_reads::String: Path to reverse reads FASTQ file
  • outdir::String: Output directory for trimmed files

Returns

  • Tuple{String, String}: Paths to trimmed forward and reverse read files

Dependencies

Requires trim_galore conda environment:

  • mamba create -n trim_galore -c bioconda trim_galore
source
Mycelia.type_to_stringMethod
type_to_string(T::AbstractString) -> Any

Converts an AbstractString type to its string representation.

Arguments

  • T::AbstractString: The string type to convert

Returns

A string representation of the input type

source
Mycelia.type_to_stringMethod
type_to_string(T) -> Any

Convert a type to its string representation, with special handling for Kmer types.

Arguments

  • T: The type to convert to string

Returns

  • String representation of the type
    • For Kmer types: Returns "Kmers.DNAKmer{K}" where K is the kmer length
    • For other types: Returns the standard string representation
source
Mycelia.umap_embedMethod
umap_embed(
    X::AbstractMatrix{<:Real};
    n_neighbors,
    min_dist,
    n_components
) -> UMAP.UMAP_{S, M, N, D, Matrix{Int64}, I} where {S<:Real, M<:AbstractMatrix{S}, N<:AbstractMatrix{S}, D<:AbstractMatrix{S}, I<:AbstractMatrix{S}}
umap_embed(scores::AbstractMatrix{<:Real};
           n_neighbors::Int=15,
           min_dist::Float64=0.1,
           n_components::Int=2)

Embed your PC/EPCA scores (k×nsamples) into `ncomponents` via UMAP.

When to use

Use for visualizing high-dimensional data in 2 or 3 dimensions, especially when the data may have nonlinear structure. UMAP is suitable for both continuous and discrete data, and is robust to non-Gaussian distributions. Input should be a matrix of features or dimensionally-reduced scores (e.g., from PCA or EPCA).

Arguments

  • scores : (components × samples) matrix
  • n_neighbors : UMAP neighborhood size
  • min_dist : UMAP min_dist
  • n_components: output dimension (2 or 3)

Returns

  • model : the trained UMAP.UMAP model
source
Mycelia.update_bioconda_envMethod
update_bioconda_env(pkg) -> Base.Process

Update a package and its dependencies in its dedicated Conda environment.

Arguments

  • pkg::String: Name of the package/environment to update
source
Mycelia.update_fasta_with_vcfMethod
update_fasta_with_vcf(; in_fasta, vcf_file, out_fasta)

Apply variants from a VCF file to a reference FASTA sequence.

Arguments

  • in_fasta: Path to input reference FASTA file
  • vcf_file: Path to input VCF file containing variants
  • out_fasta: Optional output path for modified FASTA. Defaults to replacing '.vcf' with '.normalized.vcf.fna'

Details

  1. Normalizes indels in the VCF using bcftools norm
  2. Applies variants to the reference sequence using bcftools consensus
  3. Handles temporary files and compression with bgzip/tabix

Requirements

Requires bioconda packages: htslib, tabix, bcftools

Returns

Path to the output FASTA file containing the modified sequence

source
Mycelia.update_gff_with_mmseqsMethod
update_gff_with_mmseqs(
    gff_file,
    mmseqs_file
) -> DataFrames.DataFrame

Update GFF annotations with protein descriptions from MMseqs2 search results.

Arguments

  • gff_file::String: Path to input GFF3 format file
  • mmseqs_file::String: Path to MMseqs2 easy-search output file

Returns

  • DataFrame: Modified GFF table with updated attribute columns containing protein descriptions

Details

Takes sequence matches from MMseqs2 and adds their descriptions as 'label' and 'product' attributes in the GFF file. Only considers top hits from MMseqs2 results. Preserves existing GFF attributes while prepending new annotations.

source
Mycelia.upload_edge_type_over_url_from_graphMethod
upload_edge_type_over_url_from_graph(
;
    src_type,
    dst_type,
    edge_type,
    graph,
    ADDRESS,
    USERNAME,
    PASSWORD,
    DATABASE,
    window_size
)

Upload edges of a specific type from a MetaGraph to a Neo4j database, batching uploads in windows.

Arguments

  • src_type: Type of source nodes to filter
  • dst_type: Type of destination nodes to filter
  • edge_type: Type of edges to upload
  • graph: MetaGraph containing the nodes and edges
  • ADDRESS: Neo4j server URL
  • USERNAME: Neo4j username (default: "neo4j")
  • PASSWORD: Neo4j password
  • DATABASE: Neo4j database name (default: "neo4j")
  • window_size: Number of edges to upload in each batch (default: 100)

Details

  • Filters edges based on source, destination and edge types
  • Preserves all edge properties except :TYPE when uploading
  • Uses MERGE operations to avoid duplicate nodes/relationships
  • Uploads are performed in batches for better performance
  • Progress is shown via ProgressMeter

Returns

Nothing

source
Mycelia.upload_node_over_apiMethod
upload_node_over_api(
    graph,
    v;
    ADDRESS,
    USERNAME,
    PASSWORD,
    DATABASE
)

Upload a single node from a MetaGraph to a Neo4j database using the HTTP API.

Arguments

  • graph: MetaGraph containing the node to be uploaded
  • v: Vertex identifier in the graph
  • ADDRESS: Neo4j server address (e.g. "http://localhost:7474")
  • USERNAME: Neo4j authentication username (default: "neo4j")
  • PASSWORD: Neo4j authentication password
  • DATABASE: Target Neo4j database name (default: "neo4j")

Details

Generates and executes a Cypher MERGE command using the node's properties. The node's :TYPE and :identifier properties are used for node labeling, while other non-empty properties are added as node properties.

source
Mycelia.upload_node_tableMethod
upload_node_table(
;
    table,
    window_size,
    address,
    password,
    username,
    database,
    neo4j_import_dir
)

Upload a DataFrame to Neo4j as nodes in batched windows.

Arguments

  • table::DataFrame: Input DataFrame where each row becomes a node. Must contain a "TYPE" column.
  • address::String: Neo4j server address (e.g. "bolt://localhost:7687")
  • password::String: Neo4j database password
  • neo4j_import_dir::String: Directory path accessible to Neo4j for importing files
  • window_size::Int=1000: Number of rows to process in each batch
  • username::String="neo4j": Neo4j database username
  • database::String="neo4j": Target Neo4j database name

Notes

  • All rows must have the same node type (specified in "TYPE" column)
  • Column names become node properties
  • Requires write permissions on neo4jimportdir
  • Large tables are processed in batches of size window_size
source
Mycelia.upload_node_type_over_url_from_graphMethod
upload_node_type_over_url_from_graph(
;
    node_type,
    graph,
    ADDRESS,
    USERNAME,
    PASSWORD,
    DATABASE,
    window_size
)

Upload nodes of a specific type from a graph to a Neo4j database using MERGE operations.

Arguments

  • node_type: The type label for the nodes to upload
  • graph: Source MetaGraph containing the nodes
  • ADDRESS: Neo4j server address (e.g. "bolt://localhost:7687")
  • PASSWORD: Neo4j database password
  • USERNAME="neo4j": Neo4j username (defaults to "neo4j")
  • DATABASE="neo4j": Target Neo4j database name (defaults to "neo4j")
  • window_size=100: Number of nodes to upload in each batch (defaults to 100)

Details

Performs batched uploads of nodes using Neo4j MERGE operations. Node properties are automatically extracted from the graph vertex properties, excluding the 'TYPE' property.

source
Mycelia.upload_nodes_over_apiMethod
upload_nodes_over_api(
    graph;
    ADDRESS,
    USERNAME,
    PASSWORD,
    DATABASE
)

Uploads all nodes from the given graph to a specified API endpoint.

Arguments

  • graph: The graph containing the nodes to be uploaded.
  • ADDRESS: The API endpoint address.
  • USERNAME: The username for authentication (default: "neo4j").
  • PASSWORD: The password for authentication.
  • DATABASE: The database name (default: "neo4j").
source
Mycelia.upload_nodes_to_neo4jMethod
upload_nodes_to_neo4j(
;
    graph,
    address,
    username,
    password,
    format,
    database,
    neo4j_import_directory
)

Upload all nodes from a MetaGraph to a Neo4j database, processing each unique node type separately.

Arguments

  • graph: A MetaGraph containing nodes to be uploaded
  • address: Neo4j server address (e.g., "bolt://localhost:7687")
  • username: Neo4j authentication username (default: "neo4j")
  • password: Neo4j authentication password
  • format: Data format for upload (default: "auto")
  • database: Target Neo4j database name (default: "neo4j")
  • neo4j_import_directory: Path to Neo4j's import directory for bulk loading
source
Mycelia.validate_assemblyMethod
validate_assembly(assembly::AssemblyResult; reference=nothing) -> Dict{String, Any}

Validate assembly quality using various metrics and optional reference comparison.

Arguments

  • assembly: Assembly result to validate
  • reference: Optional reference sequence for comparison

Returns

  • Dict{String, Any}: Comprehensive validation metrics

Details

Computes assembly quality metrics including:

  • N50, N90 statistics
  • Total assembly length and number of contigs
  • Coverage uniformity (if reference provided)
  • Structural variant detection (if reference provided)
  • Gap analysis and repeat characterization
source
Mycelia.vcat_with_missingMethod
vcat_with_missing(
    dfs::DataFrames.AbstractDataFrame...
) -> Union{DataFrames.DataFrame, Vector{Any}}

Vertically concatenate DataFrames with different column structures by automatically handling missing values.

Arguments

  • dfs: Variable number of DataFrames to concatenate vertically

Returns

  • DataFrame: Combined DataFrame containing all rows and columns from input DataFrames, with missing values where columns didn't exist in original DataFrames
source
Mycelia.viroid_assembly_workflowMethod
viroid_assembly_workflow(
    viroid_name::String;
    outdir,
    k,
    simulate_coverage,
    read_length,
    error_rate,
    download_references,
    run_assembly
) -> NamedTuple{(:viroid_name, :output_directory, :reference_data, :observations, :assembly_results, :workflow_summary), <:Tuple{String, String, Any, Any, Any, String}}

Complete viroid assembly workflow that downloads reference data and performs quality-aware assembly using Rhizomorph algorithms.

This function implements the complete workflow requested:

  1. Downloads viroid reference data from NCBI (genome, CDS transcript, FAA protein)
  2. Generates simulated FASTQ observations for DNA, RNA, and amino acid sequences
  3. Uses Rhizomorph Qualmer assembly workflows with quality score propagation
  4. Outputs FASTQ files with consensus quality scores
  5. Performs iterative Viterbi, probabilistic walks, and heaviest path algorithms

Arguments

  • viroid_name::String: Name of viroid to study (e.g., "Potato spindle tuber viroid")
  • outdir::String: Output directory for all results (default: "viroidassemblyworkflow/")
  • k::Int: K-mer size for assembly (default: 21)
  • simulate_coverage::Int: Coverage depth for simulated reads (default: 10)
  • read_length::Int: Length of simulated reads (default: 150)
  • error_rate::Float64: Simulated sequencing error rate (default: 0.01)
  • download_references::Bool: Whether to download reference data (default: true)
  • run_assembly::Bool: Whether to run assembly analysis (default: true)

Returns

  • NamedTuple: Comprehensive results including reference files, simulated data, and assembly results

Examples

# Complete viroid assembly workflow for PSTV
results = viroid_assembly_workflow("Potato spindle tuber viroid", "pstv_analysis/")

# Quick analysis with existing reference data
results = viroid_assembly_workflow("Hop stunt viroid", "hsv_analysis/";
                                 download_references=false)

Workflow Details

This function demonstrates the novel quality-aware assembly capabilities:

  • Quality Propagation: Per-base PHRED scores maintained throughout assembly
  • Consensus Scoring: Multiple observations combined using weighted averages
  • Advanced Algorithms: Iterative Viterbi, probabilistic walks, heaviest path finding
  • Multi-sequence Types: Handles DNA, RNA, and amino acid sequences
  • FASTQ Output: Final assemblies include quality scores for downstream analysis
source
Mycelia.visualize_genome_coverageMethod
visualize_genome_coverage(coverage_table) -> Any

Creates a multi-panel visualization of genome coverage across chromosomes.

Arguments

  • coverage_table: DataFrame containing columns "chromosome" and "coverage" with genomic coverage data

Returns

  • Plots.Figure: A composite figure with coverage plots for each chromosome

Details

Generates one subplot per chromosome, arranged vertically. Each subplot shows the coverage distribution across genomic positions for that chromosome.

source
Mycelia.viterbi_batch_processFunction
viterbi_batch_process(graph::MetaGraph, sequences::Vector, config::ViterbiConfig) -> Vector{ViterbiPath}

Process multiple sequences in batches for memory efficiency.

source
Mycelia.viterbi_decode_nextFunction
viterbi_decode_next(graph::MetaGraph, observations::Vector, config::ViterbiConfig) -> ViterbiPath

Enhanced Viterbi decoding with strand awareness and memory efficiency.

source
Mycelia.viterbi_maximum_likelihood_traversalsMethod
viterbi_maximum_likelihood_traversals(
    stranded_kmer_graph;
    error_rate,
    verbosity
) -> Vector{FASTX.FASTA.Record}

Finds maximum likelihood paths through a stranded k-mer graph using the Viterbi algorithm to correct sequencing errors.

Arguments

  • stranded_kmer_graph: A directed graph where vertices represent k-mers and edges represent overlaps
  • error_rate::Float64: Expected per-base error rate (default: 1/(k+1)). Must be < 0.5
  • verbosity::String: Output detail level ("debug", "reads", or "dataset")

Returns

Vector of FASTX.FASTA.Record containing error-corrected sequences

Details

  • Uses dynamic programming to find most likely path through k-mer graph
  • Accounts for matches, mismatches, insertions and deletions
  • State likelihoods based on k-mer coverage counts
  • Transition probabilities derived from error rate
  • Progress tracking based on verbosity level

Notes

  • Error rate should be probability of error (e.g. 0.01 for 1%), not accuracy
  • Higher verbosity levels ("debug", "reads") provide detailed path finding information
  • "dataset" verbosity shows only summary statistics
source
Mycelia.wcssMethod
wcss(clustering_result) -> Any

Calculate the Within-Cluster Sum of Squares (WCSS) for a clustering result.

Arguments

  • clustering_result: A clustering result object containing:
    • counts: Vector with number of points in each cluster
    • assignments: Vector of cluster assignments for each point
    • costs: Vector of distances/costs from each point to its cluster center

Returns

  • Float64: The total within-cluster sum of squared distances

Description

WCSS measures the compactness of clusters by summing the squared distances between each data point and its assigned cluster center.

source
Mycelia.write_biosequence_gfaMethod
write_biosequence_gfa(
    graph::MetaGraphsNext.MetaGraph,
    output_file::AbstractString
)

Write a BioSequence graph to GFA format.

Arguments

  • graph: MetaGraphsNext BioSequence graph
  • output_file: Path to output GFA file

Example

write_biosequence_gfa(graph, "assembly.gfa")
source
Mycelia.write_fastaMethod
write_fasta(; outfile, records, gzip)

Writes FASTA records to a file, optionally gzipped.

Arguments

  • outfile::AbstractString: Path to the output FASTA file. Will append ".gz" if gzip is true and ".gz" isn't already the extension.
  • records::Vector{FASTX.FASTA.Record}: A vector of FASTA records.
  • gzip::Bool: Optionally force compression of the output with gzip. By default will use the file name to infer.

Returns

  • outfile::String: The path to the output FASTA file (including ".gz" if applicable).
source
Mycelia.write_fastas_from_normalized_fastx_tablesMethod
write_fastas_from_normalized_fastx_tables(
    table_paths::Vector{String};
    output_dir::String = pwd(),
    show_progress::Bool = true,
    overwrite::Bool = false,
    error_handler = (e, table_path)->display((e, table_path))
) -> NamedTuple

Given a vector of normalized fastx table paths, writes out gzipped FASTA files in parallel. Each table must have columns: "fastxsha256", "recordsha256", "record_sequence". Automatically decompresses input files if they end with ".gz". Returns a summary NamedTuple with successes, failures, failed tables, and output files.

Keyword Arguments

  • output_dir: Directory to write .fna.gz files to.
  • show_progress: Show a progress bar (default: true).
  • overwrite: Overwrite existing files (default: false).
  • error_handler: Function called with (exception, table_path) on error.
source
Mycelia.write_fastqMethod
write_fastq(;records, filename, gzip=false)
write_fastq(; records, filename, gzip)

Write FASTQ records to file using FASTX.jl. Validates extension: .fastq, .fq, .fastq.gz, or .fq.gz. If gzip is true or filename endswith .gz, output is gzipped. records must be an iterable of FASTX.FASTQ.Record.

source
Mycelia.write_fastq_contigsMethod
write_fastq_contigs(result::AssemblyResult, output_file::String)

Write quality-aware contigs to a FASTQ file if quality information is available.

source
Mycelia.write_gfa_nextMethod
write_gfa_next(
    graph::MetaGraphsNext.MetaGraph,
    outfile::AbstractString
) -> AbstractString

Write a MetaGraphsNext k-mer graph to GFA (Graphical Fragment Assembly) format.

This function exports strand-aware k-mer graphs to standard GFA format, handling:

  • Canonical k-mer vertices as segments (S lines)
  • Strand-aware edges as links (L lines) with proper orientations
  • Coverage information as depth annotations

Arguments

  • graph: MetaGraphsNext k-mer graph with strand-aware edges
  • outfile: Path where the GFA file should be written

Returns

  • Path to the written GFA file

GFA Format

The output follows GFA v1.0 specification:

  • Header (H) line with version
  • Segment (S) lines: vertexid, canonicalk-mer_sequence, depth
  • Link (L) lines: sourceid, sourceorientation, targetid, targetorientation, overlap

Example

graph = build_kmer_graph_next(DNAKmer{3}, observations)
write_gfa_next(graph, "assembly.gfa")
source
Mycelia.write_gffMethod
write_gff(; gff, outfile)

Write GFF (General Feature Format) data to a tab-delimited file.

Arguments

  • gff: DataFrame/Table containing GFF formatted data
  • outfile: String path where the output file should be written

Returns

  • String: Path to the written output file
source
Mycelia.write_quality_biosequence_gfaMethod
write_quality_biosequence_gfa(
    graph::MetaGraphsNext.MetaGraph,
    output_file::AbstractString
)

Write a quality-aware BioSequence graph to GFA format with quality information.

Arguments

  • graph: Quality-aware BioSequence graph
  • output_file: Path to output GFA file

Example

write_quality_biosequence_gfa(graph, "assembly_with_quality.gfa")
source
Mycelia.write_tsvgzMethod
write_tsvgz(df::DataFrames.DataFrame, filename::String; buffer_in_memory::Bool=false, threaded::Bool=true, bufsize::Int=10*1024*1024, force::Bool=false)

Write a DataFrame to a gzipped TSV file.

Arguments

  • df: The DataFrame to write
  • filename: Path to the output file (will add .tsv.gz extension as needed)
  • buffer_in_memory: If false, uses temporary files for large data (default: false)
  • bufsize: Buffer size in bytes for compression stream (default: 10MB)
  • force: If true, overwrite existing non-empty files (default: false)

Returns

  • String: The final filename with proper extension
source
Mycelia.write_vcf_tableMethod
write_vcf_table(; vcf_file, vcf_table, fasta_file)

Write variant data to a VCF v4.3 format file.

Arguments

  • vcf_file::String: Output path for the VCF file
  • vcf_table::DataFrame: Table containing variant data with standard VCF columns
  • fasta_file::String: Path to the reference genome FASTA file

Details

Automatically filters out equivalent variants where REF == ALT. Includes standard VCF headers for substitutions, insertions, deletions, and inversions. Adds GT (Genotype) and GQ (Genotype Quality) format fields.

source
Mycelia.xam_to_contig_mapping_statsMethod
xam_to_contig_mapping_stats(xam) -> Any

Generate detailed mapping statistics for each reference sequence/contig in a XAM (SAM/BAM/CRAM) file.

Arguments

  • xam: Path to XAM file or XAM object

Returns

A DataFrame with per-contig statistics including:

  • n_aligned_reads: Number of aligned reads
  • total_aligned_bases: Sum of alignment lengths
  • total_alignment_score: Sum of alignment scores
  • Mapping quality statistics (mean, std, median)
  • Alignment length statistics (mean, std, median)
  • Alignment score statistics (mean, std, median)
  • Percent mismatches statistics (mean, std, median)

Note: Only primary alignments (isprimary=true) and mapped reads (ismapped=true) are considered.

source
Mycelia.xam_to_dataframeMethod
xam_to_dataframe(xam_path::String) -> DataFrames.DataFrame

Convert a SAM/BAM file to a DataFrame using the open_xam function.

Parameters:

  • xam_path: Path to the SAM/BAM file
  • header: Whether to include the header (default: false)

Returns:

  • A DataFrame containing the parsed data
source
Mycelia.xam_to_dataframeMethod
xam_to_dataframe(
    reader::XAM.SAM.Reader
) -> DataFrames.DataFrame

Convert SAM/BAM records from a XAM.SAM.Reader into a DataFrame.

Parameters:

  • reader: A XAM.SAM.Reader object for iterating over records

Returns:

  • A DataFrame containing all record data in a structured format
source