Mycelia Documentation

An experimental Julia package for bioinformatics and computational biology

Mycelia is a research-oriented package exploring novel approaches to genomic analysis, with a focus on graph-based genome assembly and quality-aware sequence processing. Currently in early development, it provides both experimental algorithms and integrations with established bioinformatics tools.

Quick Start

New to Mycelia? Start with our Getting Started Guide to install the package and complete your first genomic analysis in minutes.

Key Features & Research Areas

Currently Available

🧬 Sequence Processing: Basic FASTA/FASTQ I/O and read simulation
📊 K-mer Analysis: Canonical k-mer counting and distance metrics
🔧 Tool Integration: Wrappers for established assemblers (MEGAHIT, SPAdes, hifiasm)
⚡ HPC Support: SLURM job submission and rclone integration

In Active Development

🧪 Novel Assembly Algorithms: Graph-based approaches with quality awareness
🌐 Pangenome Analysis: K-mer based comparative genomics
📈 Quality Control: Integration with QC tools (fastp, filtlong, trim_galore)

Planned Features

🔍 Annotation: Gene prediction and functional annotation
🌳 Phylogenetics: Tree construction from pangenome data
📊 Visualization: Interactive plots for genomic data

Documentation Contents

Installation

Quick Install

import Pkg
Pkg.add(url="https://github.com/cjprybol/Mycelia.git")

Development Install

import Pkg
Pkg.develop(url="git@github.com:cjprybol/Mycelia.git")

For detailed installation instructions including HPC setup, see the Getting Started Guide.

Function Docstrings

Mycelia.AssemblyAction — Type

AssemblyAction

Action representation for reinforcement learning decisions during assembly.

Fields

decision::Symbol: Primary decision (:continuek, :nextk, :terminate)
viterbi_params::Dict{Symbol, Float64}: Viterbi algorithm parameters
correction_threshold::Float64: Quality threshold for error correction
batch_size::Int: Batch size for processing (memory management)
max_iterations::Int: Maximum iterations at current k before forced progression

Mycelia.AssemblyConfig — Type

Assembly configuration structure.

Mycelia.AssemblyEnvironment — Type

AssemblyEnvironment

Reinforcement learning environment for training assembly decision policies.

Fields

current_state::AssemblyState: Current environment state
training_datasets::Vector{String}: Paths to training FASTQ files
validation_datasets::Vector{String}: Paths to validation FASTQ files
episode_length::Int: Maximum steps per training episode
step_count::Int: Current step in episode
reward_history::Vector{Float64}: Reward history for current episode
action_history::Vector{AssemblyAction}: Action history for experience replay
assembly_cache::Dict{String, Any}: Cache for intermediate assembly results

Mycelia.AssemblyMethod — Type

Assembly method enumeration for unified interface.

Mycelia.AssemblyResult — Type

Assembly result structure containing contigs and metadata.

Mycelia.AssemblyState — Type

AssemblyState

State representation for reinforcement learning environment containing all information needed to make assembly decisions.

Fields

current_k::Int: Current k-mer size being processed
assembly_quality::Float64: Current assembly quality score (QV-based)
correction_rate::Float64: Rate of successful error corrections in recent iterations
memory_usage::Float64: Current memory utilization (fraction of limit)
graph_connectivity::Float64: Graph connectivity metric (proportion of strongly connected components)
coverage_uniformity::Float64: Uniformity of k-mer coverage distribution
error_signal_clarity::Float64: Clarity of error signal detection (sparsity-based)
iteration_history::Vector{Float64}: Recent reward history for trend analysis
k_progression::Vector{Int}: Sequence of k-mer sizes processed so far
corrections_made::Int: Total corrections made at current k
time_elapsed::Float64: Time spent on current k-mer size (seconds)

Mycelia.BenchmarkConfig — Type

Benchmark configuration for consistent testing.

Mycelia.BioSequenceEdgeData — Type

Edge data for variable-length BioSequence graphs.

Mycelia.BioSequenceVertexData — Type

Vertex data for variable-length BioSequence graphs.

Mycelia.BubbleStructure — Type

BubbleStructure

Represents a bubble (alternative path) in the assembly graph.

Mycelia.ContigPath — Type

ContigPath

Represents a linear path through the graph forming a contig.

Mycelia.DQNPolicy — Type

DQNPolicy

Deep Q-Network policy for high-level assembly decisions.

This is a placeholder structure for the neural network architecture that will be implemented with a machine learning framework like Flux.jl or MLJ.jl.

Fields

state_dim::Int: Dimension of state representation
action_dim::Int: Number of possible actions
hidden_dims::Vector{Int}: Hidden layer dimensions
learning_rate::Float64: Learning rate for training
epsilon::Float64: Exploration rate for epsilon-greedy policy
experience_buffer::Vector{Any}: Experience replay buffer

Mycelia.FastqQualityResults — Type

Stores comprehensive FASTQ quality analysis results.

Mycelia.GraphMode — Type

Graph mode for handling strand information.

SingleStrand: Sequences are single-stranded (RNA, amino acids, or directional DNA)
DoubleStrand: Sequences are double-stranded DNA/RNA with canonical representation

Mycelia.GraphPath — Type

Represents a complete path through the k-mer graph.

Mycelia.KmerEdgeData — Type

Type-stable metadata for k-mer graph edges.

Edges represent valid strand-aware transitions between canonical k-mers. The transition is valid only if the strand orientations allow for proper overlap.

Fields:

coverage: Vector of edge traversal observations with strand information
weight: Edge weight/confidence score based on coverage
src_strand: Required strand orientation of source k-mer for this transition
dst_strand: Required strand orientation of destination k-mer for this transition

Mycelia.KmerVertexData — Type

Type-stable metadata for k-mer graph vertices.

Vertices always represent canonical k-mers for memory efficiency and cleaner graphs. Strand information is tracked in the coverage data and edge transitions.

Fields:

coverage: Vector of observation coverage data as (observationid, position, strandorientation) tuples
canonical_kmer: The canonical k-mer (BioSequence type - NO string conversion)

Mycelia.PangenomeAnalysisResult — Type

PangenomeAnalysisResult

Results of k-mer based pangenome analysis.

Mycelia.QualityBioSequenceEdgeData — Type

Quality-aware BioSequence edge data.

Mycelia.QualityBioSequenceVertexData — Type

Quality-aware BioSequence vertex data.

Mycelia.QualityDistribution — Type

Stores quality distribution statistics for FASTQ analysis.

Mycelia.QualmerEdgeData — Type

Edge data for quality-aware k-mer graphs.

Mycelia.QualmerObservation — Type

Qualmer observation for tracking k-mer occurrences in sequences.

Mycelia.QualmerVertexData — Type

Vertex data for quality-aware k-mer graphs.

Mycelia.RepeatRegion — Type

RepeatRegion

Represents a repetitive region in the assembly graph.

Mycelia.RewardComponents — Type

RewardComponents

Structured representation of reward signal components for training the RL agent.

Fields

accuracy_reward::Float64: Primary reward based on assembly accuracy (weighted 1000x)
efficiency_reward::Float64: Secondary reward for computational efficiency (weighted 10x)
error_penalty::Float64: Penalty for false positives/negatives (weighted -500x)
progress_bonus::Float64: Bonus for making meaningful progress
termination_reward::Float64: Reward for appropriate termination timing
total_reward::Float64: Weighted sum of all components

Mycelia.ScaffoldResult — Type

ScaffoldResult

Results from scaffolding analysis.

Mycelia.StrandOrientation — Type

Strand orientation for k-mer observations and transitions.

Forward: k-mer as observed (5' to 3')
Reverse: reverse complement of k-mer (3' to 5')

Mycelia.ViterbiConfig — Type

ViterbiConfig

Configuration for Viterbi algorithm parameters.

Mycelia.ViterbiPath — Type

ViterbiPath

Complete Viterbi path through the k-mer graph.

Mycelia.ViterbiState — Type

ViterbiState

State information for Viterbi algorithm on k-mer graphs.

Mycelia.WalkStep — Type

Represents a step in a probabilistic walk through the graph.

Mycelia.JLD2_read_table — Method

JLD2_read_table(filename::String) -> Any

Read a DataFrame from a JLD2 file without needing to know the internal name. If the file contains multiple DataFrames, returns the first one found.

Mycelia.JLD2_write_table — Method

JLD2_write_table(; df, filename)

Write a DataFrame to a JLD2 file using a standardized internal name.

Mycelia._add_biosequence_edges! — Method

Add edges to BioSequence graph based on sequence overlaps.

Mycelia._add_biosequence_edges_from_paths! — Method

Add edges between BioSequences based on k-mer path relationships.

Mycelia._add_observation_to_graph! — Method

_add_observation_to_graph!(
    graph,
    observation,
    obs_idx,
    canonical_kmers,
    graph_mode
)

Add a sequence observation to an existing k-mer graph with strand-aware edge creation.

Arguments

graph: MetaGraphsNext k-mer graph with canonical vertices
observation: FASTA/FASTQ record
obs_idx: Observation index
canonical_kmers: Vector of canonical k-mers in the graph
graph_mode: SingleStrand or DoubleStrand mode

Mycelia._add_quality_biosequence_edges! — Method

Add quality-weighted edges to BioSequence graph.

Mycelia._add_quality_biosequence_edges_from_qualmer_paths! — Method

Add edges between quality-aware BioSequences based on qualmer path relationships.

Mycelia._add_qualmer_edges! — Method

Add edges to the qualmer graph based on k-mer adjacency in sequences.

Mycelia._add_simulated_errors — Method

Add simulated sequencing errors to a sequence string.

Mycelia._add_strand_aware_edge! — Method

Helper function to add strand-aware coverage data to an edge.

This function creates edges that respect strand orientation constraints. Each edge represents a biologically valid transition between k-mers.

Mycelia._add_vertex_coverage! — Method

Helper function to add coverage data to a vertex.

Mycelia._assemble_biosequence_graph — Method

BioSequence graph assembly implementation (variable-length simplified from k-mer graphs).

Mycelia._assemble_hybrid_olc — Method

Hybrid OLC assembly (placeholder for future implementation).

Mycelia._assemble_kmer_graph — Method

K-mer graph assembly implementation (fixed-length k-mer foundation).

Mycelia._assemble_multi_k — Method

Multi-k assembly (placeholder for future implementation).

Mycelia._assemble_ngram_graph — Method

N-gram graph assembly implementation (fixed-length unicode character analysis).

Mycelia._assemble_quality_biosequence_graph — Method

Quality-aware BioSequence graph assembly implementation (variable-length simplified from qualmer graphs).

Mycelia._assemble_qualmer_graph — Method

Quality-aware k-mer graph assembly implementation (fixed-length qualmer foundation).

Mycelia._assemble_string_graph — Method

String graph assembly implementation (variable-length simplified from N-gram graphs).

Mycelia._calculate_consensus_quality_from_observations — Method

Calculate consensus quality from multiple observations at a specific position. Uses the joint probability and multiple observations to compute a consensus quality score.

Mycelia._calculate_l_statistic — Method

_calculate_l_statistic(sorted_lengths, threshold)

Calculate L-statistic (number of contigs needed to reach a given percentage of total assembly length). For example, L50 is the number of contigs needed to reach 50% of the total assembly length.

Arguments

sorted_lengths: Vector of contig lengths sorted in descending order
threshold: Fraction of total length (e.g., 0.5 for L50, 0.9 for L90)

Returns

Int: Number of contigs needed to reach the threshold

Mycelia._calculate_n_statistic — Method

Calculate N-statistic (N50, N90, etc.) for contig lengths.

Mycelia._calculate_path_coverage — Method

Calculate coverage for a path from constituent k-mers.

Mycelia._calculate_qualmer_path_coverage — Method

Calculate coverage for qualmer path.

Mycelia._calculate_transition_probabilities — Method

Helper function to calculate normalized transition probabilities.

Mycelia._check_conda_env_exists — Method

_check_conda_env_exists(env_name::AbstractString) -> Bool

Check if a conda environment exists.

Mycelia._contigs_to_records — Method

Convert contigs to FASTA records for graph construction.

Mycelia._create_empty_kmer_graph — Method

Helper function to create an empty k-mer graph.

Mycelia._detect_sequence_extension — Method

_detect_sequence_extension(sequence_type::Symbol) -> String

Internal helper function to convert sequence type to file extension.

Arguments

sequence_type: Symbol representing sequence type (:DNA, :RNA, or :AA)

Returns

String: Appropriate file extensions

Mycelia._determine_kmer_type — Method

Determine appropriate k-mer type from observations.

Mycelia._determine_strand — Method

Determine strand orientation by comparing original and canonical qualmers.

Mycelia._find_heaviest_eulerian_path — Method

Find the heaviest (highest confidence) Eulerian path starting from a given vertex.

Mycelia._find_linear_paths — Method

Find linear paths in a k-mer graph.

Mycelia._find_linear_qualmer_paths — Method

Find linear paths in a qualmer graph.

Mycelia._find_quality_sequence_overlap — Method

Find quality-weighted overlap between two sequences.

Mycelia._find_qualmer_paths — Method

Find quality-aware paths in a qualmer graph.

Mycelia._find_sequence_overlap — Method

Find overlap between two sequences.

Mycelia._generate_contigs_from_qualmer_graph — Method

Generate contigs from qualmer graph using probabilistic walks.

Mycelia._generate_contigs_probabilistic — Method

Generate contigs using probabilistic walks when Eulerian paths are not available.

Mycelia._generate_fastq_contigs_from_qualmer_graph — Method

Generate FASTQ contigs from qualmer graph using probabilistic walks when no paths are found.

Mycelia._get_output_files — Method

_get_output_files(output_dir::AbstractString) -> Vector{String}

Get a list of all files in the output directory.

Mycelia._get_valid_transitions — Method

Helper function to get valid transitions from a vertex with given strand orientation.

Mycelia._is_valid_qualmer_transition — Method

Validate that the transition between two k-mers is biologically valid.

Mycelia._is_valid_transition — Method

_is_valid_transition(
    src_kmer,
    dst_kmer,
    src_strand,
    dst_strand,
    k
) -> Any

Validate that a transition between two k-mers with given strand orientations is biologically valid.

For a transition to be valid, the suffix of the source k-mer must match the prefix of the destination k-mer when accounting for strand orientations.

Arguments

src_kmer: Source canonical k-mer
dst_kmer: Destination canonical k-mer
src_strand: Strand orientation of source k-mer
dst_strand: Strand orientation of destination k-mer
k: K-mer size

Returns

Bool: true if transition is biologically valid

Mycelia._iterative_viterbi_paths — Method

Iterative Viterbi algorithm for finding optimal paths through qualmer graph. Uses dynamic programming with quality scores as emission/transition probabilities.

Mycelia._path_to_sequence — Method

Convert a path of vertices to a DNA sequence.

Mycelia._polish_contig_viterbi — Method

Polish a single contig using Viterbi error correction.

Mycelia._prepare_fastq_observations — Method

Convert observations to FASTQ records for quality processing.

Mycelia._prepare_observations — Method

Prepare observations from various input formats (FASTA/FASTQ records or file paths).

Mycelia._quality_weighted_walk — Function

Perform quality-weighted walk through qualmer graph.

Mycelia._qualmer_path_to_consensus_fastq — Method

Enhanced qualmer path to sequence with consensus quality calculation. Uses joint probability from multiple observations to compute consensus quality.

Mycelia._qualmer_path_to_fastq_record — Method

Enhanced qualmer path to FASTQ record conversion with quality propagation. This is the core function that enables quality-aware assembly output.

Mycelia._qualmer_path_to_sequence — Method

Convert qualmer path to DNA sequence.

Mycelia._reconstruct_sequence_and_quality_from_qualmer_path — Method

Reconstruct sequence and quality from qualmer path.

Mycelia._reconstruct_sequence_from_kmer_path — Method

Reconstruct a BioSequence from a path of k-mers.

Mycelia._reconstruct_sequence_from_path — Method

Helper function to reconstruct sequence from a graph path.

Mycelia._reconstruct_shortest_path — Method

Helper function to reconstruct shortest path from Dijkstra's algorithm.

Mycelia._sample_transition — Method

Helper function to sample a transition based on probabilities.

Mycelia._sequence_to_canonical_path — Method

_sequence_to_canonical_path(
    canonical_kmers,
    sequence,
    graph_mode
) -> Vector{<:Tuple{Any, Mycelia.StrandOrientation}}

Convert a sequence to a path through canonical k-mer space with strand awareness.

This is the key function that handles the distinction between single-strand and double-strand modes while maintaining canonical k-mer vertices.

Arguments

canonical_kmers: Vector of canonical k-mers available in the graph
sequence: DNA/RNA sequence to convert
graph_mode: SingleStrand or DoubleStrand mode

Returns

Vector of (canonicalkmer, strandorientation) pairs representing the path

Details

In DoubleStrand mode:

Each observed k-mer is converted to its canonical form
Strand orientation tracks whether the canonical form matches the observed k-mer
Edges respect the biological constraint that transitions must maintain proper overlap

In SingleStrand mode:

K-mers are used as-is (no reverse complement consideration)
All strand orientations are Forward
Suitable for RNA, amino acids, or directional DNA analysis

Mycelia._setup_phageboost_environment — Function

_setup_phageboost_environment(force_reinstall::Bool=false)

Set up the PhageBoost conda environment if it doesn't exist or if force_reinstall is true.

Mycelia._simplify_ngram_graph — Method

Simplify N-gram graph by removing unnecessary complexity.

Mycelia._simplify_ngram_to_string_graph — Method

Convert N-gram graph to variable-length string graph.

Mycelia._simplify_string_graph — Method

Simplify string graph by removing unnecessary complexity.

Mycelia._simulate_fastq_reads_from_sequence — Method

_simulate_fastq_reads_from_sequence(
    sequence,
    identifier::String;
    coverage,
    read_length,
    error_rate,
    sequence_type
) -> Vector{FASTX.FASTQ.Record}

Generate simulated FASTQ reads from a reference sequence.

Arguments

sequence: Reference sequence (String or BioSequences type)
identifier: Sequence identifier
coverage::Int: Desired coverage depth
read_length::Int: Length of individual reads
error_rate::Float64: Simulated error rate
sequence_type::String: Type of sequence ("DNA", "RNA", "AA")

Returns

Vector{FASTX.FASTQ.Record}: Vector of simulated FASTQ reads with quality scores

Mycelia._validate_against_reference — Method

Validate assembly against reference sequence (placeholder).

Mycelia._viterbi_optimal_path — Method

Find optimal path using Viterbi-like dynamic programming on quality scores.

Mycelia.accuracy — Method

accuracy(true_labels, pred_labels)

Returns the overall accuracy.

Mycelia.add_bioconda_env — Method

add_bioconda_env(pkg; force) -> Union{Nothing, Base.Process}

Create a new Conda environment with a specified Bioconda package.

Arguments

pkg::String: Package name to install. Can include channel specification using

the format "channel::package"

Keywords

force::Bool=false: If true, recreates the environment even if it already exists

Details

The function creates a new Conda environment named after the package and installs the package into it. It uses channel priority: conda-forge > bioconda > defaults. If CONDA_RUNNER is set to 'mamba', it will ensure mamba is installed first.

Examples

# Install basic package
add_bioconda_env("blast")

# Install from specific channel
add_bioconda_env("bioconda::blast")

# Force reinstallation
add_bioconda_env("blast", force=true)

Notes

Requires Conda.jl to be installed and configured
Uses CONDA_RUNNER global variable to determine whether to use conda or mamba
Cleans conda cache after installation

Mycelia.add_edgemer_to_graph! — Method

add_edgemer_to_graph!(
    graph,
    record_identifier,
    index,
    observed_edgemer
) -> Any

Add an observed edgemer to a graph with its associated metadata.

Arguments

graph::MetaGraph: The graph to modify
record_identifier: Identifier for the source record
index: Position where edgemer was observed
observed_edgemer: The biological sequence representing the edgemer

Details

Processes the edgemer by:

Splitting it into source and destination kmers
Converting kmers to their canonical forms
Creating or updating an edge with orientation metadata
Storing observation details (record, position, orientation)

Returns

Modified graph with the new edge and metadata

Note

If the edge already exists, the observation is added to the existing metadata.

Mycelia.add_fastx_records_to_graph! — Method

add_fastx_records_to_graph!(graph, fastxs) -> Any

Add FASTX records from multiple files as a graph property.

Arguments

graph: A MetaGraph that will store the FASTX records
fastxs: Collection of FASTA/FASTQ file paths to process

Details

Creates a dictionary mapping sequence descriptions to their corresponding FASTX records, then stores this dictionary as a graph property under the key :records. Multiple input files are merged, with later files overwriting records with duplicate descriptions.

Returns

The modified graph with added records property.

Mycelia.add_record_edgemers_to_graph! — Method

add_record_edgemers_to_graph!(graph) -> Any

Processes DNA sequence records stored in the graph and adds their edgemers (k+1 length subsequences) to build the graph structure.

Arguments

graph: A Mycelia graph object containing DNA sequence records and graph properties

Details

Uses the k-mer size specified in graph.gprops[:k] to generate k+1 length edgemers
Iterates through each record in graph.gprops[:records]
For each record, generates all possible overlapping edgemers
Adds each edgemer to the graph with its position and record information

Returns

The modified graph object with added edgemer information

Mycelia.adjust_quality_scores — Method

Adjust quality scores based on likelihood improvement.

Mycelia.alphabet_to_biosequence_type — Method

alphabet_to_biosequence_type(
    alphabet::Symbol
) -> Union{Type{BioSequences.LongAA}, Type{BioSequences.LongSequence{BioSequences.DNAAlphabet{4}}}, Type{BioSequences.LongSequence{BioSequences.RNAAlphabet{4}}}}

Determine the BioSequence type from an alphabet symbol.

Maps alphabet symbols to the corresponding BioSequences.jl type for type-safe sequence operations throughout the codebase.

Arguments

alphabet::Symbol: The alphabet symbol (:DNA, :RNA, or :AA)

Returns

Type{<:BioSequences.BioSequence}: The corresponding BioSequence type

Examples

alphabet_to_biosequence_type(:DNA)  # Returns BioSequences.LongDNA{4}
alphabet_to_biosequence_type(:RNA)  # Returns BioSequences.LongRNA{4}
alphabet_to_biosequence_type(:AA)   # Returns BioSequences.LongAA

Mycelia.amino_acids_to_codons — Method

amino_acids_to_codons(

) -> Dict{BioSymbols.AminoAcid, DataType}

Creates a mapping from amino acids to representative DNA codons using the standard genetic code.

Returns

Dictionary mapping each amino acid (including stop codon AA_Term) to a valid DNA codon that encodes it

Mycelia.analyze_fastq_quality — Method

analyze_fastq_quality(fastq_file::String)

Analyzes quality metrics for a FASTQ file.

Calculates comprehensive quality statistics including read count, quality scores, length distribution, GC content, and quality threshold percentages.

Arguments

fastq_file: Path to FASTQ file (can be gzipped)

Returns

FastqQualityResults with the following fields:

n_reads: Total number of reads
mean_quality: Average Phred quality score across all reads
mean_length: Average read length
gc_content: GC content percentage
quality_distribution: QualityDistribution with Q20+, Q30+, Q40+ percentages

Example

quality_stats = Mycelia.analyze_fastq_quality("reads.fastq")
println("Total reads: $(quality_stats.n_reads)")
println("Mean quality: $(quality_stats.mean_quality)")
println("Q30+ reads: $(quality_stats.quality_distribution.q30_percent)%")

Mycelia.analyze_pangenome_kmers — Method

analyze_pangenome_kmers(genome_files::Vector{String}; kmer_type=Kmers.DNAKmer{21}, distance_metric=:jaccard)

Perform comprehensive k-mer based pangenome analysis using existing Mycelia infrastructure.

Leverages existing count_canonical_kmers and distance metric functions to analyze genomic content across multiple genomes, identifying core, accessory, and unique regions.

Arguments

genome_files: Vector of FASTA file paths containing genome sequences
kmer_type: K-mer type from Kmers.jl (default: Kmers.DNAKmer{21})
distance_metric: Distance metric (:jaccard, :bray_curtis, :cosine, :js_divergence)

Returns

PangenomeAnalysisResult with comprehensive pangenome statistics

Example

genome_files = ["genome1.fasta", "genome2.fasta", "genome3.fasta"]
result = Mycelia.analyze_pangenome_kmers(genome_files, kmer_type=Kmers.DNAKmer{31})
println("Core k-mers: $(length(result.core_kmers))")
println("Total pangenome size: $(size(result.presence_absence_matrix, 1)) k-mers")

Mycelia.analyze_repeat_region — Method

Analyze a potential repeat region starting from a vertex.

Mycelia.annotate_aa_fasta — Method

annotate_aa_fasta(
;
    fasta,
    identifier,
    basedir,
    mmseqsdb,
    threads
)

Annotate amino acid sequences in a FASTA file using MMseqs2 search against UniRef50 database.

Arguments

fasta: Path to input FASTA file containing amino acid sequences
identifier: Name for the output directory (defaults to FASTA filename without extension)
basedir: Base directory for output (defaults to current directory)
mmseqsdb: Path to MMseqs2 formatted UniRef50 database (defaults to ~/workspace/mmseqs/UniRef50)
threads: Number of CPU threads to use (defaults to system thread count)

Returns

Path to the output directory containing MMseqs2 search results

The function creates a new directory named by identifier under basedir, copies the input FASTA file, and runs MMseqs2 easy-search against the specified database. If the output directory already exists, the function skips processing and returns the directory path.

Mycelia.annotate_fasta — Method

annotate_fasta(
;
    fasta,
    identifier,
    basedir,
    mmseqsdb,
    threads
)

Perform comprehensive annotation of a FASTA file including gene prediction, protein homology search, and terminator prediction.

Arguments

fasta::String: Path to input FASTA file
identifier::String: Unique identifier for output directory (default: FASTA filename without extension)
basedir::String: Base directory for output (default: current working directory)
mmseqsdb::String: Path to MMseqs2 UniRef50 database (default: joinpath(homedir(), "workspace/mmseqs/UniRef50"))
threads::Int: Number of CPU threads to use (default: all available). Note: This argument is not explicitly used by Pyrodigal or MMseqs2 in this version of the function, they might use their own defaults or require modifications to run_pyrodigal or run_mmseqs_easy_search to respect it.

Processing Steps

Creates output directory and copies input FASTA.
Runs Pyrodigal for gene prediction (nucleotide, amino acid, and GFF output).
Performs MMseqs2 homology search against UniRef50.
Predicts terminators using TransTerm.
Combines annotations into a unified GFF file.
Generates GenBank format output.

Returns

String: Path to the output directory containing all generated files.

Files Generated (within the output directory specified by identifier)

(basename(fasta)).pyrodigal.fna: Predicted genes (nucleotide) from Pyrodigal.
(basename(fasta)).pyrodigal.faa: Predicted proteins from Pyrodigal.
(basename(fasta)).pyrodigal.gff: Pyrodigal GFF annotations.
(basename(fasta)).gff: Combined GFF annotations (MMseqs2 and TransTerm).
(basename(fasta)).gff.genbank: Final GenBank format from the first combined GFF.
(basename(fasta)).transterm_raw.gff: Combined GFF (MMseqs2 and a second TransTerm run).
(basename(fasta)).transterm_raw.gff.genbank: Final GenBank format from the second combined GFF.

Mycelia.apply_learned_policy — Method

apply_learned_policy(policy::DQNPolicy, input_fastq::String; output_dir="rl_assembly")

Apply a trained RL policy to perform genome assembly.

This function uses a trained policy to make autonomous assembly decisions.

Arguments

policy::DQNPolicy: Trained assembly policy
input_fastq::String: Path to input FASTQ file
output_dir::String: Output directory for assembly results

Returns

Dict{String, Any}: Assembly results and metadata

Example

results = apply_learned_policy(trained_policy, "genome.fastq")

Mycelia.are_equivalent_bubbles — Method

Check if two bubbles are equivalent.

Mycelia.assemble_genome — Method

assemble_genome(reads; method=StringGraph, config=AssemblyConfig()) -> AssemblyResult

Unified genome assembly interface using Phase 2 next-generation algorithms.

Arguments

reads: Vector of FASTA/FASTQ records or file paths
method: Assembly strategy (StringGraph, KmerGraph, HybridOLC, MultiK)
config: Assembly configuration parameters

Returns

AssemblyResult: Structure containing contigs, names, and assembly metadata

Details

This is the main entry point for the unified assembly pipeline, leveraging:

Phase 1: MetaGraphsNext strand-aware graph construction
Phase 2: Probabilistic algorithms, enhanced Viterbi, and graph algorithms
Phase 3: Integrated workflow with polishing and validation

Examples

# Basic assembly with default parameters
reads = load_fastq_records("reads.fastq")
result = assemble_genome(reads)

# Custom assembly with specific k-mer size and error rate
config = AssemblyConfig(k=25, error_rate=0.005, polish_iterations=5)
result = assemble_genome(reads; method=KmerGraph, config=config)

# Access results
contigs = result.contigs
stats = result.assembly_stats

Mycelia.assemble_strings — Method

Assemble strings by performing graph walks from each connected component.

Mycelia.assembly_summary — Method

Generate a summary report of the assembly process.

Mycelia.assess_alignment — Method

assess_alignment(
    a,
    b
) -> @NamedTuple{total_matches::Int64, total_edits::Int64}

Aligns two sequences using the Levenshtein distance and returns the total number of matches and edits.

Arguments

a::AbstractString: The first sequence to be aligned.
b::AbstractString: The second sequence to be aligned.

Returns

NamedTuple{(:total_matches, :total_edits), Tuple{Int, Int}}: A named tuple containing:
- total_matches::Int: The total number of matching bases in the alignment.
- total_edits::Int: The total number of edits (insertions, deletions, substitutions) in the alignment.

Mycelia.assess_alignment_accuracy — Method

assess_alignment_accuracy(alignment_result) -> Any

Return proportion of matched bases in alignment to total matches + edits.

Calculate the accuracy of a sequence alignment by computing the ratio of matched bases to total alignment operations (matches + edits).

Arguments

alignment_result: Alignment result object containing total_matches and total_edits fields

Returns

Float64 between 0.0 and 1.0 representing alignment accuracy, where:

1.0 indicates perfect alignment (all matches)
0.0 indicates no matches

Mycelia.assess_assembly_kmer_quality — Method

assess_assembly_kmer_quality(; assembly, observations, ks)

Evaluate genome assembly quality by comparing k-mer distributions between assembled sequences and raw observations.

Arguments

assembly: Input assembled sequences to evaluate
observations: Raw sequencing data for comparison
ks::Vector{Int}: Vector of k-mer sizes to analyze (default: k=17 to 23)

Returns

DataFrame containing quality metrics for each k-mer size:

k: K-mer length used
cosine_distance: Cosine similarity between k-mer distributions
js_divergence: Jensen-Shannon divergence between distributions
qv: MerQury-style quality value score

Mycelia.assess_assembly_quality — Method

assess_assembly_quality(contigs_file)

Assess basic assembly quality metrics from a FASTA file.

Calculates standard assembly quality metrics including contig count, total length, and N50 statistic for assembly evaluation.

Arguments

contigs_file: Path to FASTA file containing assembly contigs

Returns

Tuple of (ncontigs, totallength, n50, l50)
- n_contigs: Number of contigs in the assembly
- total_length: Total length of all contigs in base pairs
- n50: N50 statistic (length of shortest contig in the set covering 50% of assembly)
- l50: L50 statistic (number of contigs needed to reach 50% of assembly length)

Example

n_contigs, total_length, n50, l50 = assess_assembly_quality("assembly.fasta")
println("Assembly has $n_contigs contigs, $total_length bp total, N50=$n50, L50=$l50")

See Also

assess_assembly_kmer_quality: For k-mer based assembly quality assessment

Mycelia.assess_dnamer_saturation — Method

assess_dnamer_saturation(
    fastx::AbstractString;
    power,
    outdir,
    min_k,
    max_k,
    threshold,
    kmers_to_assess
)

Analyzes k-mer saturation in a FASTA/FASTQ file to determine optimal k-mer size.

Arguments

fastx::AbstractString: Path to input FASTA/FASTQ file
power::Int=10: Exponent for downsampling k-mers (2^power)
outdir::String="": Output directory for results. Uses current directory if empty
min_k::Int=3: Minimum k-mer size to evaluate
max_k::Int=17: Maximum k-mer size to evaluate
threshold::Float64=0.1: Saturation threshold for k-mer assessment
kmers_to_assess::Int=10_000_000: Maximum number of k-mers to sample

Returns

Dict{Int,Float64}: Dictionary mapping k-mer sizes to their saturation scores

Mycelia.assess_dnamer_saturation — Method

assess_dnamer_saturation(
    fastxs::AbstractVector{<:AbstractString},
    kmer_type;
    kmers_to_assess,
    power,
    min_count
) -> Union{@NamedTuple{sampling_points::Vector{Int64}, unique_kmer_counts::Vector{Int64}}, NamedTuple{(:sampling_points, :unique_kmer_counts, :eof), <:Tuple{Vector, Vector{Int64}, Bool}}}

Assess k-mer saturation in DNA sequences from FASTX files.

Arguments

fastxs::AbstractVector{<:AbstractString}: Vector of paths to FASTA/FASTQ files
kmer_type: Type of k-mer to analyze (e.g., DNAKmer{21})
kmers_to_assess=Inf: Maximum number of k-mers to process
power=10: Base for exponential sampling intervals
min_count=1: Minimum count threshold for considering a k-mer

Returns

Named tuple containing:

sampling_points::Vector{Int}: K-mer counts at which samples were taken
unique_kmer_counts::Vector{Int}: Number of unique canonical k-mers at each sampling point
eof::Bool: Whether the entire input was processed

Details

Analyzes k-mer saturation by counting unique canonical k-mers at exponentially spaced intervals (powers of power). Useful for assessing sequence complexity and coverage. Returns early if all possible k-mers are observed.

Mycelia.assess_dnamer_saturation — Method

assess_dnamer_saturation(
    fastxs::AbstractVector{<:AbstractString};
    power,
    outdir,
    min_k,
    max_k,
    threshold,
    kmers_to_assess,
    plot
)

Analyze k-mer saturation in DNA sequences to determine optimal k value.

Arguments

fastxs: Vector of paths to FASTA/FASTQ files to analyze
power: Base of logarithmic sampling points (default: 10)
outdir: Optional output directory for plots and results
min_k: Minimum k-mer size to test (default: 7)
max_k: Maximum k-mer size to test (default: 17)
threshold: Saturation threshold to determine optimal k (default: 0.1)
kmers_to_assess: Maximum number of k-mers to sample (default: 10M)
plot: Whether to generate saturation curves (default: true)

Returns

Integer representing the first k value that achieves saturation below threshold. If no k value meets the threshold, returns the k with minimum saturation.

Details

Tests only prime k values between mink and maxk
Generates saturation curves using logarithmic sampling
Fits curves to estimate maximum unique k-mers
If outdir is provided, saves plots as SVG and chosen k value to text file

Mycelia.assess_duplication_rates — Method

assess_duplication_rates(fastq; results_table) -> Any

Analyze sequence duplication rates in a FASTQ file.

This function processes a FASTQ file to quantify both exact sequence duplications and canonical duplications (considering sequences and their reverse complements as equivalent). The function makes two passes through the file: first to count total records, then to analyze unique sequences.

Arguments

fastq::String: Path to the input FASTQ file to analyze
results_table::String: Optional. Path where the results will be saved as a tab-separated file. Defaults to the same path as the input file but with extension changed to ".duplication_rates.tsv"

Returns

String: Path to the results table file

Output

Generates a tab-separated file containing the following metrics:

total_records: Total number of sequence records in the file
total_unique_observations: Count of unique sequence strings
total_unique_canonical_observations: Count of unique canonical sequences (after normalizing for reverse complements)
percent_unique_observations: Percentage of sequences that are unique
percent_unique_canonical_observations: Percentage of sequences that are unique after canonicalization
percent_duplication_rate: Percentage of sequences that are duplicates (100 - percentuniqueobservations)
percent_canonical_duplication_rate: Percentage of sequences that are duplicates after canonicalization

Notes

If the specified results file already exists and is not empty, the function will return early without recomputing.
Progress is displayed during processing with a progress bar showing speed.

Example

# Analyze a FASTQ file and save results to default location
result_path = assess_duplication_rates("data/sample.fastq")

# Specify custom output path
result_path = assess_duplication_rates("data/sample.fastq", results_table="results/duplication_analysis.tsv")

Mycelia.assess_optimal_kmer_alignment — Method

assess_optimal_kmer_alignment(
    kmer,
    observed_kmer
) -> Tuple{@NamedTuple{total_matches::Int64, total_edits::Int64}, Union{Missing, Bool}}

Used to determine which orientation provides an optimal alignment for initiating path likelihood analyses in viterbi analysis

Compare alignment scores between a query k-mer and an observed k-mer in both forward and reverse complement orientations to determine optimal alignment.

Arguments

kmer: Query k-mer sequence to align
observed_kmer: Target k-mer sequence to align against

Returns

A tuple containing:

alignment_result: The alignment result object for the optimal orientation
orientation: Boolean indicating orientation (true = forward, false = reverse complement, missing = tied scores)

Details

Performs pairwise alignment in both orientations using assess_alignment()
Calculates accuracy scores using assess_alignment_accuracy()
For tied alignment scores, randomly selects one orientation
Uses BioSequences.reverse_complement for reverse orientation comparison

Mycelia.attempt_error_correction — Method

Attempt to correct a specific k-mer using probabilistic path finding. Returns true if correction was applied.

Mycelia.bam_to_fastq — Method

bam_to_fastq(; bam, fastq)

Convert a BAM file to FASTQ format with gzip compression.

Arguments

bam: Path to input BAM file
fastq: Optional output path. Defaults to input path with ".fq.gz" extension

Returns

Path to the generated FASTQ file

Details

Uses samtools through conda environment
Automatically skips if output file exists
Output is gzip compressed
Requires samtools to be available via conda

Mycelia.bandage_visualize — Method

bandage_visualize(; gfa, img)

Generate a visualization of a genome assembly graph using Bandage.

Arguments

gfa: Path to input GFA (Graphical Fragment Assembly) file
img: Optional output image path. Defaults to GFA filename with .png extension

Returns

Path to the generated image file

Mycelia.batch_download_viroid_references — Method

batch_download_viroid_references(
    species_list::Vector{String};
    base_outdir,
    download_genome,
    download_cds,
    download_protein,
    max_per_species
) -> Dict{String, Any}

Download reference data for multiple viroid species in batch.

Arguments

species_list::Vector{String}: List of viroid species to download
base_outdir::String: Base output directory (subdirectories created per species)
download_genome::Bool: Download genomic sequences (default: true)
download_cds::Bool: Download CDS sequences (default: true)
download_protein::Bool: Download protein sequences (default: true)
max_per_species::Int: Maximum sequences per species per type (default: 5)

Returns

Dict{String, NamedTuple}: Dictionary mapping species names to their downloaded file info

Examples

# Download data for all known viroid species
species_list = get_viroid_species_list()
results = batch_download_viroid_references(species_list, "viroid_database/")

# Download only genomes for a subset
pstv_like = ["Potato spindle tuber viroid", "Tomato planta macho viroid"]  
results = batch_download_viroid_references(pstv_like, "pstv_group/";
                                         download_cds=false, download_protein=false)

Mycelia.benchmark_graph_construction — Function

benchmark_graph_construction()
benchmark_graph_construction(
    config::Mycelia.BenchmarkConfig
)

Benchmark graph construction performance: Legacy vs Next-generation.

Compares:

MetaGraphs.jl (legacy) vs MetaGraphsNext.jl (next-gen)
Memory allocation patterns
Construction time
Type stability

Arguments

config: BenchmarkConfig for test parameters

Returns

NamedTuple with benchmark results

Mycelia.benchmark_memory_patterns — Function

benchmark_memory_patterns()
benchmark_memory_patterns(config::Mycelia.BenchmarkConfig)

Benchmark memory usage patterns for different graph representations.

Compares memory usage of:

Stranded vertices (legacy) vs canonical vertices (next-gen)
Edge metadata structures
Coverage tracking efficiency

Arguments

config: BenchmarkConfig for test parameters

Returns

NamedTuple with memory analysis results

Mycelia.benchmark_type_stability — Function

benchmark_type_stability()
benchmark_type_stability(config::Mycelia.BenchmarkConfig)

Benchmark type stability and allocation patterns.

Measures:

Type inference success
Runtime allocations
Performance predictability

Arguments

config: BenchmarkConfig for test parameters

Returns

NamedTuple with type stability metrics

Mycelia.bernoulli_pca_epca — Method

bernoullipcaepca(M::AbstractMatrix{Bool}; k::Int=0)

Perform Bernoulli (logistic) EPCA on a 0/1 matrix M (features × samples).

When to use

Use for binary (0/1) data, such as presence/absence or yes/no features.

Returns

A NamedTuple with

model : the fitted ExpFamilyPCA.BernoulliEPCA object
scores : k×n_samples matrix of low‐dimensional sample scores
loadings : k×n_features matrix of feature loadings

Mycelia.best_label_mapping — Method

best_label_mapping(true_labels, pred_labels)

Finds the optimal mapping from predicted labels to true labels using the Hungarian algorithm, so that the total overlap (confusion matrix diagonal) is maximized. Returns the remapped predicted labels and the mapping as a Dict.

Mycelia.binary_matrix_to_jaccard_distance_matrix — Method

binary_matrix_to_jaccard_distance_matrix(binary_matrix::Union{BitMatrix, Matrix{Bool}})

Pairwise Jaccard distance between columns of a binary matrix (BitMatrix or Matrix{Bool}). Throws an error if the input is not strictly a binary matrix.

Mycelia.binomial_pca_epca — Method

binomial_pca_epca(M::AbstractMatrix{<:Integer}; k::Int=0, ntrials::Int=1)

Perform Binomial EPCA on a count matrix M (features × samples).

When to use

Use for integer count data representing the number of successes out of a fixed number of trials (e.g., number of mutated alleles out of total alleles).

Keyword arguments

k : desired number of latent dimensions; if k<1 defaults to min(n_samples-1, n_features, 10)
ntrials : number of trials for the Binomial distribution (default=1)

Returns

NamedTuple with fields

model : the fitted ExpFamilyPCA.BinomialEPCA object
scores : k×n_samples matrix of sample scores
loadings : k×n_features matrix of feature loadings

Mycelia.biosequences_to_counts_table — Method

biosequences_to_counts_table(; biosequences, k)

Convert a collection of biological sequences into a k-mer count matrix.

Arguments

biosequences: Vector of biological sequences (DNA, RNA, or Amino Acids)
k: Length of k-mers to count

Returns

Named tuple with:

sorted_kmers: Vector of all unique k-mers found, lexicographically sorted
kmer_counts_matrix: Sparse matrix where rows are k-mers and columns are sequences

Details

For DNA sequences, counts canonical k-mers (both strands)
Uses parallel processing with Thread-safe progress tracking
Memory efficient sparse matrix representation
Supports DNA, RNA and Amino Acid sequences

Mycelia.biosequences_to_dense_counts_table — Method

biosequences_to_dense_counts_table(; biosequences, k)

Convert a collection of biological sequences into a dense k-mer count matrix.

Arguments

biosequences: Collection of DNA, RNA, or amino acid sequences (BioSequence types)
k::Integer: Length of k-mers to count (must be ≤ 13)

Returns

Named tuple containing:

sorted_kmers: Vector of all possible k-mers in sorted order
kmer_counts_matrix: Dense matrix where rows are k-mers and columns are sequences

Details

For DNA sequences, counts canonical k-mers (both strands)
For RNA and protein sequences, counts exact k-mers
Uses parallel processing with threads

Mycelia.blastdb2table — Method

blastdb2table(
;
    blastdb,
    ALL_FIELDS,
    sequence_sha256,
    sequence_hash,
    sequence_id,
    accession,
    gi,
    sequence_title,
    blast_name,
    taxid,
    taxonomic_super_kingdom,
    scientific_name,
    scientific_names_leaf_nodes,
    common_taxonomic_name,
    common_names_leaf_nodes,
    leaf_node_taxids,
    membership_integer,
    ordinal_id,
    pig,
    sequence_length,
    sequence
)

Convert a BLAST database to an in-memory table with sequence and taxonomy information.

Arguments

blastdb::String: Path to the BLAST database
outfile::String="": Optional output file path. If provided, results will be saved to this file
force::Bool=false: Whether to overwrite existing output file
ALL_FIELDS::Bool=true: If true, include all fields regardless of other flag settings
Field selection flags (default to false unless ALL_FIELDS is true):
- sequence_sha256::Bool: Include SHA256 hash of the sequence
- sequence_hash::Bool: Include sequence hash
- sequence_id::Bool: Include sequence ID
- accession::Bool: Include accession number
- gi::Bool: Include GI number
- sequence_title::Bool: Include sequence title
- blast_name::Bool: Include BLAST name
- taxid::Bool: Include taxid
- taxonomic_super_kingdom::Bool: Include taxonomic super kingdom
- scientific_name::Bool: Include scientific name
- scientific_names_leaf_nodes::Bool: Include scientific names for leaf-node taxids
- common_taxonomic_name::Bool: Include common taxonomic name
- common_names_leaf_nodes::Bool: Include common taxonomic names for leaf-node taxids
- leaf_node_taxids::Bool: Include leaf-node taxids
- membership_integer::Bool: Include membership integer
- ordinal_id::Bool: Include ordinal ID
- pig::Bool: Include PIG
- sequence_length::Bool: Include sequence length
- sequence::Bool: Include the full sequence

Returns

DataFrame: DataFrame containing the requested columns from the BLAST database

Mycelia.blastdb_to_fasta — Method

blastdb_to_fasta(
;
    blastdb,
    entries,
    taxids,
    outfile,
    force,
    max_cores
)

Convert a BLAST database to FASTA format.

Arguments

blastdb::String: Name of the BLAST database to convert (e.g. "nr", "nt")
dbdir::String: Directory containing the BLAST database files
outfile::String: Path for the output FASTA file

Returns

Path to the generated FASTA file as String

Mycelia.bray_curtis_distance — Method

Compute the Bray-Curtis distance between columns of a count matrix.

Arguments

M::AbstractMatrix{<:Integer}: Count matrix where rows are features and columns are samples

Returns

Matrix{Float64}: Symmetric distance matrix with Bray-Curtis distances

Mycelia.breadth_first_sample — Method

Breadth-first sampling: sample at least one from each group, then sample remaining proportionally to group frequencies

Mycelia.breadth_first_sample_dataframe — Method

Apply breadth-first sampling to a DataFrame

Mycelia.build_biosequence_graph — Method

build_biosequence_graph(
    fasta_records::Vector{FASTX.FASTA.Record};
    graph_mode,
    min_overlap
) -> MetaGraphsNext.MetaGraph{Int64, Graphs.SimpleGraphs.SimpleDiGraph{Int64}, _A, _B, _C, Nothing, Mycelia.var"#267#268", Float64} where {_A, _B, _C}

Build a BioSequence graph directly from FASTA records.

This creates a variable-length BioSequence graph where vertices are the input sequences and edges represent relationships between sequences (e.g., overlaps, containment).

Arguments

fasta_records: Vector of FASTA records
graph_mode: SingleStrand or DoubleStrand mode (default: DoubleStrand)
min_overlap: Minimum overlap length for creating edges (default: 100)

Returns

MetaGraphsNext.MetaGraph with BioSequence vertices and overlap edges

Example

records = [FASTX.FASTA.Record("seq1", "ATCGATCGATCG"), 
           FASTX.FASTA.Record("seq2", "CGATCGATCGAA")]
graph = build_biosequence_graph(records)

Mycelia.build_directed_kmer_graph — Method

build_directed_kmer_graph(; fastq, k, plot)

Constructs a directed graph representation of k-mer transitions from FASTQ sequencing data.

Arguments

fastq: Path to input FASTQ file
k: K-mer size (default: 1). Must be odd and prime. If k=1, optimal size is auto-determined
plot: Boolean to display quality distribution plot (default: false)

Returns

MetaDiGraph with properties:

assembly_k: k-mer size used
kmer_counts: frequency of each k-mer
transition_likelihoods: edge weights between k-mers
kmermeanquality, kmertotalquality: quality metrics
branchingnodes, unbranchingnodes: topological classification
likelyvalidkmer_indices: k-mers above mean quality threshold
likelysequencingartifact_indices: potential erroneous k-mers

Note

For DNA assembly, quality scores are normalized across both strands.

Mycelia.build_genome_distance_matrix — Method

build_genome_distance_matrix(genome_files::Vector{String}; kmer_type=Kmers.DNAKmer{21}, metric=:js_divergence)

Build a distance matrix between all genome pairs using existing distance metrics.

Creates a comprehensive pairwise distance matrix using established k-mer distance functions, suitable for phylogenetic analysis and clustering.

Arguments

genome_files: Vector of genome FASTA file paths
kmer_type: K-mer type from Kmers.jl (default: Kmers.DNAKmer{21})
metric: Distance metric (:js_divergence, :cosine, :jaccard)

Returns

Named tuple with distance matrix and genome names

Example

genomes = ["genome1.fasta", "genome2.fasta", "genome3.fasta"]
result = Mycelia.build_genome_distance_matrix(genomes, kmer_type=Kmers.DNAKmer{31})
println("Distance matrix: $(result.distance_matrix)")

Mycelia.build_kmer_graph_next — Method

build_kmer_graph_next(
    kmer_type,
    observations::AbstractVector{<:Union{FASTX.FASTA.Record, FASTX.FASTQ.Record}};
    graph_mode
) -> MetaGraphsNext.MetaGraph{Int64, Graphs.SimpleGraphs.SimpleDiGraph{Int64}, Label, VertexData, EdgeData, Nothing, WeightFunction, Float64} where {Label, VertexData, EdgeData, WeightFunction}

Create a next-generation, type-stable k-mer graph using MetaGraphsNext.

This implementation uses canonical k-mers as vertices with strand-aware edges that respect biological transition constraints for both single-strand and double-strand sequences.

Arguments

kmer_type: Type of k-mer (e.g., DNAKmer{K})
observations: Vector of FASTA/FASTQ records
graph_mode: SingleStrand for directional sequences, DoubleStrand for DNA (default)

Returns

MetaGraphsNext.MetaGraph with canonical vertices and strand-aware edges

Details

Vertices: Always canonical k-mers (lexicographically smaller of kmer/reverse_complement)
Edges: Strand-aware transitions that respect biological constraints
SingleStrand mode: Only forward-strand transitions allowed
DoubleStrand mode: Both forward and reverse-complement transitions allowed

Mycelia.build_quality_biosequence_graph — Method

build_quality_biosequence_graph(
    fastq_records::Vector{FASTX.FASTQ.Record};
    graph_mode,
    min_overlap,
    min_quality
) -> MetaGraphsNext.MetaGraph{Int64, Graphs.SimpleGraphs.SimpleDiGraph{Int64}, _A, _B, _C, Nothing, Mycelia.var"#277#278", Float64} where {_A, _B, _C}

Build a quality-aware BioSequence graph directly from FASTQ records.

This creates a variable-length quality-aware BioSequence graph where vertices are the input sequences with their quality scores and edges represent quality-weighted relationships between sequences.

Arguments

fastq_records: Vector of FASTQ records with quality scores
graph_mode: SingleStrand or DoubleStrand mode (default: DoubleStrand)
min_overlap: Minimum overlap length for creating edges (default: 100)
min_quality: Minimum mean quality to include sequence (default: 20)

Returns

MetaGraphsNext.MetaGraph with quality-aware BioSequence vertices

Example

records = [FASTX.FASTQ.Record("seq1", "ATCGATCGATCG", "IIIIIIIIIIII")]
graph = build_quality_biosequence_graph(records)

Mycelia.build_qualmer_graph — Method

build_qualmer_graph(
    fastq_records::Vector{FASTX.FASTQ.Record};
    k,
    graph_mode,
    min_quality,
    min_coverage
) -> MetaGraphsNext.MetaGraph{Int64, Graphs.SimpleGraphs.SimpleDiGraph{Int64}, Label, VertexData, EdgeData, Nothing, WeightFunction, Float64} where {Label, VertexData, EdgeData, WeightFunction}

Build a quality-aware k-mer graph from FASTQ records using existing Qualmer functionality. This function leverages the existing qualmer extraction functions and adds graph construction.

Arguments

fastq_records: Vector of FASTQ records with quality scores
k: K-mer size
graph_mode: SingleStrand or DoubleStrand mode (default: DoubleStrand)
min_quality: Minimum average PHRED quality to include k-mer (default: 10)
min_coverage: Minimum coverage (observations) to include k-mer (default: 1)

Returns

MetaGraphsNext.MetaGraph with QualmerVertexData and QualmerEdgeData

Mycelia.build_stranded_kmer_graph — Method

build_stranded_kmer_graph(
    kmer_type,
    observations::AbstractVector{<:Union{FASTX.FASTA.Record, FASTX.FASTQ.Record}}
) -> MetaGraphs.MetaDiGraph{T, Float64} where T<:Integer

Create a weighted, strand-specific kmer (de bruijn) graph from a set of kmers and a series of sequence observations in FASTA format.

Mycelia.calculate_accuracy_metrics — Method

Calculate assembly accuracy metrics for reward function. Returns a comprehensive score based on multiple quality indicators.

Mycelia.calculate_assembly_mapping_metrics — Method

Calculate assembly mapping metrics for validation reads.

Mycelia.calculate_assembly_quality_metrics — Method

calculate_assembly_quality_metrics(
    qualmer_graph;
    low_confidence_threshold
) -> NamedTuple{(:mean_coverage, :mean_quality, :mean_confidence, :low_confidence_fraction, :total_kmers), <:NTuple{5, Any}}

Calculate comprehensive assembly quality metrics for a qualmer graph.

Arguments

graph: Qualmer graph (MetaGraphsNext with QualmerVertexData)
low_confidence_threshold::Float64=0.95: Threshold for identifying low-confidence k-mers

Returns

NamedTuple: Assembly quality metrics including coverage, quality, and confidence statistics

Details

Calculates mean values for coverage, quality scores, and joint probabilities. Identifies fraction of k-mers below confidence threshold as potential error indicators.

Mycelia.calculate_assembly_reward — Function

Calculate reward for current k-mer processing iteration. Higher rewards indicate better assembly quality progress.

Mycelia.calculate_bubble_complexity — Method

Calculate complexity score for a bubble.

Mycelia.calculate_degrees — Method

Calculate in-degrees and out-degrees for all vertices.

Mycelia.calculate_emission_probability — Method

calculate_emission_probability(state::ViterbiState, observation::String, config::ViterbiConfig) -> Float64

Calculate emission probability for a state given an observation.

Mycelia.calculate_gc_content — Method

calculate_gc_content(sequence::AbstractString) -> Float64

Calculate GC content from a string sequence.

Convenience function that accepts string input and converts to appropriate BioSequence. Automatically detects DNA/RNA based on presence of T/U.

Arguments

sequence::AbstractString: Input DNA or RNA sequence as string

Returns

Float64: GC content as a percentage (0.0-100.0)

Examples

# Calculate GC content from string
gc_percent = calculate_gc_content("ATCGATCGATCG")

Mycelia.calculate_gc_content — Method

calculate_gc_content(
    sequence::BioSequences.LongSequence
) -> Float64

Calculate GC content (percentage of G and C bases) from a BioSequence.

This function calculates the percentage of guanine (G) and cytosine (C) bases in a nucleotide sequence. Works with both DNA and RNA sequences.

Arguments

sequence::BioSequences.LongSequence: Input DNA or RNA sequence

Returns

Float64: GC content as a percentage (0.0-100.0)

Examples

# Calculate GC content for DNA
dna_seq = BioSequences.LongDNA{4}("ATCGATCGATCG")
gc_percent = calculate_gc_content(dna_seq)

# Calculate GC content for RNA
rna_seq = BioSequences.LongRNA{4}("AUCGAUCGAUCG") 
gc_percent = calculate_gc_content(rna_seq)

Mycelia.calculate_gc_content — Method

calculate_gc_content(
    records::AbstractArray{T<:Union{FASTX.FASTA.Record, FASTX.FASTQ.Record}, 1}
) -> Float64

Calculate GC content from FASTA/FASTQ records.

Processes multiple records and calculates overall GC content across all sequences.

Arguments

records::AbstractVector{T} where T <: Union{FASTX.FASTA.Record, FASTX.FASTQ.Record}`: Input records

Returns

Float64: Overall GC content as a percentage (0.0-100.0)

Examples

# Calculate GC content from FASTA records
records = collect(FASTX.FASTA.Reader(open("sequences.fasta")))
gc_percent = calculate_gc_content(records)

Mycelia.calculate_path_support — Method

Calculate support for a path (simplified version).

Mycelia.calculate_qualmer_likelihood — Method

Calculate likelihood of a k-mer given its observed quality scores and qualmer vertex data. Leverages existing qualmer graph quality calculations.

Mycelia.calculate_read_likelihood — Method

Calculate likelihood of a FASTQ read given the current graph.

Mycelia.calculate_recommendation_confidence — Method

Calculate confidence in recommendation based on variance across folds.

Mycelia.calculate_repeat_confidence — Method

Calculate confidence in repeat identification.

Mycelia.calculate_sequence_likelihood — Method

Calculate quality-aware likelihood of a sequence given the qualmer graph. Uses both k-mer presence and quality-based confidence from qualmer observations.

Mycelia.calculate_sparsity — Method

Calculate k-mer sparsity for a given k-mer size. Returns fraction of possible k-mers that are NOT observed. Higher sparsity indicates errors become singletons.

Mycelia.calculate_unseen_kmer_penalty — Method

Calculate penalty for unseen k-mers based on their quality scores. High quality unseen k-mers get less penalty than low quality ones.

Mycelia.canonical — Method

canonical(
    qmer::Mycelia.Qualmer{<:Kmers.Kmer{BioSequences.AminoAcidAlphabet}}
) -> Mycelia.Qualmer{<:Kmers.Kmer{BioSequences.AminoAcidAlphabet}}

Mycelia.canonical — Method

canonical(
    qmer::Mycelia.Qualmer{KmerT<:Union{Kmers.DNAKmer, Kmers.RNAKmer}}
) -> Mycelia.Qualmer{KmerT} where KmerT<:Kmers.Kmer

Get the canonical representation of a DNA qualmer, considering both sequence and quality. For DNA/RNA, this involves potentially reverse-complementing the k-mer and reversing the quality scores accordingly.

Mycelia.canonicalize_kmer_counts! — Method

canonicalize_kmer_counts!(kmer_counts) -> Any

Canonicalizes the k-mer counts in the given dictionary.

This function iterates over the provided dictionary kmer_counts, which maps k-mers to their respective counts. For each k-mer that is not in its canonical form, it converts the k-mer to its canonical form and updates the count in the dictionary accordingly. If the canonical form of the k-mer already exists in the dictionary, their counts are summed. The original non-canonical k-mer is then removed from the dictionary.

Arguments

kmer_counts::Dict{BioSequences.Kmer, Int}: A dictionary where keys are k-mers and values are their counts.

Returns

The input dictionary kmer_counts with all k-mers in their canonical form, sorted by k-mers.

Mycelia.canonicalize_kmer_counts — Method

canonicalize_kmer_counts(kmer_counts) -> Any

Normalize k-mer counts into a canonical form by creating a non-mutating copy.

Arguments

kmer_counts: Dictionary or collection of k-mer count data

Returns

A new normalized k-mer count collection

Mycelia.check_bioconda_env_is_installed — Method

check_bioconda_env_is_installed(pkg) -> Bool

Check whether a named Bioconda environment already exists.

Arguments

pkg::String: Name of the environment.

Returns

Bool indicating if the environment is present.

Mycelia.check_eulerian_conditions — Method

Check conditions for Eulerian path existence.

Mycelia.check_matrix_fits_in_memory — Method

check_matrix_fits_in_memory(bytes_needed::Integer; severity::Symbol=:warn)

Checks whether the specified number of bytes can fit in the computer's memory.

bytes_needed: The number of bytes required (output from estimate_dense_matrix_memory or estimate_sparse_matrix_memory).
severity: What to do if there is not enough available memory. Can be :warn (default) or :error.

Returns a named tuple: (willfittotal, willfitavailable, totalmemory, freememory, bytes_needed) Where:

will_fit_total: true if the matrix fits in total system memory.
will_fit_available: true if the matrix fits in currently available (free) system memory.
total_memory: Total system RAM in bytes.
free_memory: Currently available system RAM in bytes.
bytes_needed: Bytes requested for the matrix.

If will_fit_available is false, either warns or errors depending on severity.

Mycelia.check_memory_limits — Method

Check if memory usage is within acceptable limits.

Mycelia.choose_top_n_markers — Method

choose_top_n_markers(N)

Return a vector of the top N most visually distinct marker symbols for plotting.

Arguments

N::Int: Number of distinct markers to return (max 17 for best differentiation).

Returns

Vector{Symbol}: Vector of marker symbol names.

Example

markers = choose_top_n_markers(7)

Mycelia.chromosome_coverage_table_to_plot — Method

chromosome_coverage_table_to_plot(cdf) -> Plots.Plot

Creates a visualization of chromosome coverage data with statistical thresholds.

Arguments

cdf::DataFrame: Coverage data frame containing columns:
- index: Chromosome position indices
- depth: Coverage depth values
- chromosome: Chromosome identifier
- mean_coverage: Mean coverage value
- std_coverage: Standard deviation of coverage
- 3σ: Boolean vector indicating +3 sigma regions
- -3σ: Boolean vector indicating -3 sigma regions

Returns

A StatsPlots plot object showing:
- Raw coverage data (black line)
- Mean coverage and ±1,2,3σ thresholds (rainbow colors)
- Highlighted regions exceeding ±3σ thresholds (red vertical lines)

Mycelia.classify_reads_by_taxonomy — Method

Classify reads based on taxonomic alignments using individual alignment scoring.

This function takes taxonomy-aware alignment data and performs classification by:

Loading the taxonomy-aware alignment data
Analyzing alignment score distributions per read
Identifying dominant taxonomic assignments
Applying conservative taxonomy classification

Arguments

taxonomy_aware_file: Path to taxonomy-aware alignment file (.tsv.gz or .arrow)
min_relative_proportion::Float64=60.0: Minimum relative proportion threshold for accepting a taxonomic assignment
verbose::Bool=true: Whether to print progress information

Returns

A DataFrame with taxonomic classification results including individual alignment metrics

Mycelia.classify_repeat_type — Method

Classify the type of repeat.

Mycelia.cleanup_directory — Method

cleanup_directory(
    directory::AbstractString;
    verbose,
    force
) -> @NamedTuple{existed::Bool, files_deleted::Int64, bytes_freed::Int64, human_readable_size::String}

Clean up a directory by calculating its size and file count, then removing it.

Arguments

directory::AbstractString: Path to the directory to clean up
verbose::Bool=true: Whether to report cleanup results (default: true)
force::Bool=false: Whether to proceed without confirmation for large directories

Returns

Named tuple with fields:
- existed: Whether the directory existed before cleanup
- files_deleted: Number of files that were deleted
- bytes_freed: Total bytes freed up
- human_readable_size: Human-readable representation of bytes freed

Details

This function will:

Check if the directory exists and is non-empty
Calculate the total size and number of files recursively
Remove the directory and all its contents
Report the cleanup results unless verbose=false

For directories larger than 1GB or containing more than 10,000 files, confirmation is required unless force=true.

Examples

# Clean up a temporary directory with reporting
result = cleanup_directory("/tmp/myapp_temp")

# Silent cleanup
cleanup_directory("/tmp/cache", verbose=false)

# Force cleanup of large directory
cleanup_directory("/tmp/large_data", force=true)

Mycelia.codon_optimize — Method

codon_optimize(
;
    normalized_codon_frequencies,
    protein_sequence,
    n_iterations
)

Optimizes the DNA sequence encoding for a given protein sequence using codon usage frequencies.

Arguments

normalized_codon_frequencies: Dictionary mapping amino acids to their codon frequencies
protein_sequence::BioSequences.LongAA: Target protein sequence to optimize
n_iterations::Integer: Number of optimization iterations to perform

Algorithm

Creates initial DNA sequence through reverse translation
Iteratively generates new sequences by sampling codons based on their frequencies
Keeps track of the sequence with highest codon usage likelihood

Returns

BioSequences.LongDNA{2}: Optimized DNA sequence encoding the input protein

Mycelia.codons_to_amino_acids — Method

codons_to_amino_acids() -> Dict

Creates a mapping between DNA codons and their corresponding amino acids using the standard genetic code.

Returns a dictionary where:

Keys are 3-letter DNA codons (e.g., "ATG")
Values are the corresponding amino acids from BioSequences.jl

Mycelia.collapse_unbranching_paths — Method

Collapse linear paths (vertices with one incoming and one outgoing edge) into a simpler graph where sequences are concatenated.

Mycelia.compare_assembly_statistics — Method

Compare statistical properties of intelligent vs iterative assemblies.

Mycelia.compare_genome_kmer_similarity — Method

compare_genome_kmer_similarity(genome1_file::String, genome2_file::String; kmer_type=Kmers.DNAKmer{21}, metric=:js_divergence)

Compare two genomes using existing k-mer distance metrics.

Leverages existing distance metric functions to compare genomic similarity between pairs of genomes using various distance measures.

Arguments

genome1_file: Path to first genome FASTA file
genome2_file: Path to second genome FASTA file
kmer_type: K-mer type from Kmers.jl (default: Kmers.DNAKmer{21})
metric: Distance metric (:js_divergence, :cosine, :jaccard)

Returns

Named tuple with similarity/distance metrics and k-mer statistics

Example

similarity = Mycelia.compare_genome_kmer_similarity(
    "genome1.fasta", "genome2.fasta", 
    kmer_type=Kmers.DNAKmer{31}, 
    metric=:js_divergence
)
println("JS divergence: $(similarity.distance)")
println("Shared k-mers: $(similarity.shared_kmers)")

Mycelia.concatenate_files — Method

concatenate_files(; files, file)

Join fasta files without any regard to record uniqueness.

A cross-platform version of cat *.fasta > joint.fasta

See mergefastafiles

Concatenate multiple FASTA files into a single output file by simple appending.

Arguments

files: Vector of paths to input FASTA files
file: Path where the concatenated output will be written

Returns

Path to the output concatenated file

Details

Platform-independent implementation of cat *.fasta > combined.fasta. Files are processed sequentially with a progress indicator.

Mycelia.confusion_matrix — Method

confusion_matrix(true_labels, pred_labels)

Returns the confusion matrix as a Matrix{Int}, row = true, col = predicted. Also returns the list of unique labels in sorted order and a heatmap plot.

Mycelia.contbernoulli_pca_epca — Method

contbernoulli_pca_epca(M::AbstractMatrix{<:Real}; k::Int=0)

Perform Continuous Bernoulli EPCA on a matrix M (features × samples).

When to use

Use for continuous data in the open interval (0, 1), such as probabilities or normalized intensities.

Keyword arguments

k : desired number of latent dimensions; if k<1 defaults to min(n_samples-1, n_features, 10)

Returns

NamedTuple with fields

model : the fitted ExpFamilyPCA.ContinuousBernoulliEPCA object
scores : k×n_samples matrix of sample scores
loadings : k×n_features matrix of feature loadings

Mycelia.contig_is_circular — Method

contig_is_circular(
    graph_file::String,
    contig_name::String
) -> Any

Returns bool indicating whether the contig is a circle

graphfile = path to assembly graph.gfa file contigname = name of the contig

Determine if a contig represents a circular structure in the assembly graph.

A circular contig is one where the sequence forms a complete loop in the assembly graph, typically representing structures like plasmids, circular chromosomes, or other circular DNA elements.

Arguments

graph_file::String: Path to the assembly graph in GFA format
contig_name::String: Name/identifier of the contig to check

Returns

Bool: true if the contig forms a circular structure, false otherwise

Mycelia.contig_is_cleanly_assembled — Method

contig_is_cleanly_assembled(
    graph_file::String,
    contig_name::String
) -> Bool

Returns bool indicating whether the contig is cleanly assembled.

By cleanly assembled we mean that the contig does not have other contigs attached in the same connected component.

graphfile = path to assembly graph.gfa file contigname = name of the contig

Check if a contig exists in isolation within its connected component in an assembly graph.

Arguments

graph_file::String: Path to the assembly graph file in GFA format
contig_name::String: Name/identifier of the contig to check

Returns

Bool: true if the contig exists alone in its connected component, false otherwise

Details

A contig is considered "cleanly assembled" if it appears as a single entry in the assembly graph's connected components. This function parses the GFA file and checks the contig's isolation status using the graph structure.

Mycelia.convert_legacy_gfa_to_next — Function

convert_legacy_gfa_to_next(
    gfa_file::AbstractString
) -> MetaGraphsNext.MetaGraph{Int64, Graphs.SimpleGraphs.SimpleDiGraph{Int64}, _A, _B, _C, Nothing, Mycelia.var"#147#148", Float64} where {_A, _B, _C}
convert_legacy_gfa_to_next(
    gfa_file::AbstractString,
    graph_mode::Mycelia.GraphMode
) -> MetaGraphsNext.MetaGraph{Int64, Graphs.SimpleGraphs.SimpleDiGraph{Int64}, _A, _B, _C, Nothing, Mycelia.var"#147#148", Float64} where {_A, _B, _C}

Convert a legacy MetaGraphs GFA to next-generation MetaGraphsNext format.

This convenience function reads a GFA file using the legacy parser and converts it to the new strand-aware format.

Arguments

gfa_file: Path to GFA file
graph_mode: GraphMode for the output graph

Returns

MetaGraphsNext.MetaGraph in strand-aware format

Mycelia.convert_sequence — Method

convert_sequence(seq::AbstractString) -> Any

Converts the given sequence (output from FASTX.sequence) into the appropriate BioSequence type:

DNA sequences are converted using BioSequences.LongDNA
RNA sequences are converted using BioSequences.LongRNA
AA sequences are converted using BioSequences.LongAA

Mycelia.copy_to_tempdir — Method

copy_to_tempdir(file_path::String) -> String

Create a copy of a file in a temporary directory while preserving the original filename.

Arguments

file_path::String: Path to the source file to be copied

Returns

String: Path to the newly created temporary file

Mycelia.copy_with_unique_identifier — Method

copy_with_unique_identifier(
    infile,
    out_directory,
    unique_identifier;
    force
) -> Any

Copy a file to a new location with a unique identifier prepended to the filename.

Arguments

infile::AbstractString: Path to the source file to copy
out_directory::AbstractString: Destination directory for the copied file
unique_identifier::AbstractString: String to prepend to the filename
force::Bool=true: If true, overwrite existing files

Returns

String: Path to the newly created file

Mycelia.correct_errors_at_k — Method

Perform error correction at the current k-mer size. Returns the number of corrections made.

Mycelia.correct_errors_next — Function

correct_errors_next(graph::MetaGraph, sequences::Vector, config::ViterbiConfig) -> Vector{FASTX.FASTA.Record}

Correct errors in sequences using Viterbi algorithm and return corrected FASTA records.

Mycelia.count_canonical_kmers — Method

count_canonical_kmers(_::Type{KMER_TYPE}, sequences) -> Any

Count canonical k-mers in biological sequences. A canonical k-mer is the lexicographically smaller of a DNA sequence and its reverse complement, ensuring strand-independent counting.

Arguments

KMER_TYPE: Type parameter specifying the k-mer size and structure
sequences: Iterator of biological sequences to analyze

Returns

Dict{KMER_TYPE,Int}: Dictionary mapping canonical k-mers to their counts

Mycelia.count_kmers — Method

count_kmers(
    _::Type{KMER_TYPE},
    fastx_file::AbstractString
) -> Any

Count k-mers in a FASTA/FASTQ file and return their frequencies.

Arguments

KMER_TYPE: Type parameter specifying the k-mer type (e.g., DNAKmer{K})
fastx_file: Path to input FASTA/FASTQ file

Returns

Dict{KMER_TYPE, Int}: Dictionary mapping each k-mer to its frequency

Mycelia.count_kmers — Method

count_kmers(
    _::Type{Kmers.Kmer{A<:BioSequences.AminoAcidAlphabet, K}},
    sequence::BioSequences.LongSequence
) -> OrderedCollections.OrderedDict{K, Int64} where K<:(Kmers.Kmer{BioSequences.AminoAcidAlphabet, _A, _B} where {_B, _A})

Count the frequency of amino acid k-mers in a biological sequence.

Arguments

Kmers.Kmer{A,K}: Type parameter specifying amino acid alphabet (A) and k-mer length (K)
sequence: Input biological sequence to analyze

Returns

A sorted dictionary mapping each k-mer to its frequency count in the sequence.

Mycelia.count_kmers — Method

count_kmers(
    _::Type{Kmers.Kmer{A<:BioSequences.DNAAlphabet, K}},
    sequence::BioSequences.LongSequence
) -> Any

Count the frequency of each k-mer in a DNA sequence.

Arguments

::Type{Kmers.Kmer{A,K}}: K-mer type with alphabet A and length K
sequence::BioSequences.LongSequence: Input DNA sequence to analyze

Returns

A sorted dictionary mapping each k-mer to its frequency count in the sequence.

Type Parameters

A <: BioSequences.DNAAlphabet: DNA alphabet type
K: Length of k-mers

Mycelia.count_kmers — Method

count_kmers(
    _::Type{Kmers.Kmer{A<:BioSequences.RNAAlphabet, K}},
    sequence::BioSequences.LongSequence
) -> Any

Count the frequency of each k-mer in an RNA sequence.

Arguments

Kmer: Type parameter specifying the k-mer length K and RNA alphabet
sequence: Input RNA sequence to analyze

Returns

Dict{Kmers.Kmer, Int}: Sorted dictionary mapping each k-mer to its frequency count

Mycelia.count_kmers — Method

count_kmers(
    _::Type{KMER_TYPE},
    sequences::Union{FASTX.FASTA.Reader, FASTX.FASTQ.Reader}
) -> Any

Counts k-mer occurrences in biological sequences from a FASTA/FASTQ reader.

Arguments

KMER_TYPE: Type parameter specifying the k-mer length and encoding (e.g., DNAKmer{4} for 4-mers)
sequences: A FASTA or FASTQ reader containing the biological sequences to analyze

Returns

A dictionary mapping k-mers to their counts in the input sequences

Mycelia.count_kmers — Method

count_kmers(
    _::Type{KMER_TYPE},
    record::Union{FASTX.FASTA.Record, FASTX.FASTQ.Record}
) -> Any

Count the frequency of amino acid k-mers in a biological sequence.

Arguments

Kmers.Kmer{A,K}: Type parameter specifying amino acid alphabet (A) and k-mer length (K)
sequence: Input biological sequence to analyze

Returns

A sorted dictionary mapping each k-mer to its frequency count in the sequence.

Mycelia.count_kmers — Method

count_kmers(
    _::Type{KMER_TYPE},
    fastx_files::AbstractArray{T<:AbstractString, 1}
) -> Any

Count k-mers across multiple FASTA/FASTQ files and merge the results.

Arguments

KMER_TYPE: Type parameter specifying the k-mer length (e.g., DNAKmer{4} for 4-mers)
fastx_files: Vector of paths to FASTA/FASTQ files

Returns

Dict{KMER_TYPE, Int}: Dictionary mapping k-mers to their total counts across all files

Mycelia.count_kmers — Method

count_kmers(
    _::Type{KMER_TYPE},
    records::AbstractArray{T<:Union{FASTX.FASTA.Record, FASTX.FASTQ.Record}, 1}
) -> Any

Count k-mers across multiple sequence records and return a sorted frequency table.

Arguments

KMER_TYPE: Type parameter specifying the k-mer length (e.g., DNAKmer{3} for 3-mers)
records: Vector of FASTA/FASTQ records to analyze

Returns

Dict{KMER_TYPE, Int}: Sorted dictionary mapping k-mers to their frequencies

Mycelia.count_matrix_to_probability_matrix — Method

count_matrix_to_probability_matrix(counts_matrix) -> Any

Convert a matrix of counts into a probability matrix by normalizing each column to sum to 1.0.

Arguments

counts_matrix::Matrix{<:Number}: Input matrix where each column represents counts/frequencies

Returns

Matrix{Float64}: Probability matrix where each column sums to 1.0

Mycelia.count_predicted_genes — Method

count_predicted_genes(gff_file)

Count the number of predicted genes from a GFF file.

Parses a GFF/GTF file and counts the number of CDS (coding sequence) features, which correspond to predicted genes.

Arguments

gff_file: Path to GFF/GTF file

Returns

Integer count of predicted genes (CDS features)

See Also

run_pyrodigal: For gene prediction that generates GFF files
parse_transterm_output: For parsing other annotation tool outputs

Mycelia.count_records — Method

count_records(fastx) -> Int64

Counts the total number of records in a FASTA/FASTQ file.

Arguments

fastx: Path to a FASTA or FASTQ file (can be gzipped)

Returns

Number of records (sequences) in the file

Mycelia.countmap_columns — Method

countmap_columns(table)

Generate and display frequency counts for all columns in a DataFrame.

Arguments

table::DataFrame: Input DataFrame to analyze

Details

Iterates through each column in the DataFrame and displays:

The column name
A Dict mapping unique values to their frequencies using StatsBase.countmap

Mycelia.create_assembly_environment — Method

create_assembly_environment(training_data, validation_data; episode_length=100)

Create a new reinforcement learning environment for assembly training.

Arguments

training_data::Vector{String}: Paths to training FASTQ datasets
validation_data::Vector{String}: Paths to validation FASTQ datasets
episode_length::Int: Maximum steps per training episode (default: 100)

Returns

AssemblyEnvironment: Initialized RL environment ready for training

Example

training_files = ["genome1.fastq", "genome2.fastq", "genome3.fastq"]
validation_files = ["validation1.fastq", "validation2.fastq"]
env = create_assembly_environment(training_files, validation_files)

Mycelia.create_curriculum_schedule — Method

create_curriculum_schedule(; stages=4, datasets_per_stage=10)

Create a curriculum learning schedule that progressively increases difficulty.

Arguments

stages::Int: Number of curriculum stages (default: 4)
datasets_per_stage::Int: Datasets per stage (default: 10)

Returns

Vector{Dict}: Curriculum schedule with parameters for each stage

Example

curriculum = create_curriculum_schedule(stages=5, datasets_per_stage=15)

Mycelia.create_database — Method

create_database(; database, address, username, password)

Creates a new Neo4j database instance if it doesn't already exist.

Arguments

database::String: Name of the database to create
address::String: Neo4j server address (e.g. "neo4j://localhost:7687")
username::String: Neo4j authentication username (defaults to "neo4j")
password::String: Neo4j authentication password

Notes

Requires system database privileges to execute
Silently returns if database already exists
Temporarily switches to system database to perform creation

Mycelia.create_dqn_policy — Method

create_dqn_policy(; state_dim=11, action_dim=3, hidden_dims=[128, 64], learning_rate=0.001, epsilon=0.1)

Create a Deep Q-Network policy for assembly decisions.

Arguments

state_dim::Int: Dimension of state representation (default: 11)
action_dim::Int: Number of discrete actions (default: 3 for continue/next/terminate)
hidden_dims::Vector{Int}: Hidden layer sizes (default: [128, 64])
learning_rate::Float64: Learning rate (default: 0.001)
epsilon::Float64: Exploration probability (default: 0.1)

Returns

DQNPolicy: Initialized policy network

Example

policy = create_dqn_policy(hidden_dims=[256, 128, 64])

Mycelia.create_hmm_from_graph — Method

create_hmm_from_graph(graph::MetaGraph, config::ViterbiConfig) -> (states, transitions, emissions)

Create Hidden Markov Model parameters from a k-mer graph structure.

Mycelia.create_kfold_partitions — Method

Create k-fold partitions with validation holdout for cross-validation.

Mycelia.create_node_constraints — Method

create_node_constraints(
    graph;
    address,
    username,
    password,
    database
)

Creates unique identifier constraints for each node type in a Neo4j database.

Arguments

graph: A MetaGraph containing nodes with TYPE properties
address: Neo4j server address
username: Neo4j username (default: "neo4j")
password: Neo4j password
database: Neo4j database name (default: "neo4j")

Details

Extracts unique node types from the graph and creates Neo4j constraints ensuring each node of a given type has a unique identifier property.

Failed constraint creation attempts are silently skipped.

Mycelia.create_tarchive — Method

create_tarchive(; directory, tarchive)

Creates a gzipped tar archive of the specified directory along with verification files.

Arguments

directory: Source directory path to archive
tarchive: Optional output archive path (defaults to directory name with .tar.gz extension)

Generated Files

{tarchive}: The compressed tar archive
{tarchive}.log: Contents listing of the archive
{tarchive}.hashdeep.dfxml: Cryptographic hashes (MD5, SHA1, SHA256) of the archive

Returns

Path to the created tar archive file

Mycelia.cross_validation_summary — Method

Generate cross-validation summary report.

Mycelia.current_unix_datetime — Method

current_unix_datetime() -> Int64

Get the current time as a Unix timestamp (seconds since epoch).

Returns

Int: Current time as an integer Unix timestamp (seconds since January 1, 1970 UTC)

Examples

unix_time = current_unix_datetime()
# => 1709071368 (example value, will differ based on current time)

Mycelia.cypher — Method

cypher(
    cmd;
    address,
    username,
    password,
    format,
    database
) -> Cmd

Constructs a command to execute Neo4j Cypher queries via cypher-shell.

Arguments

cmd: The Cypher query command to execute
address::String="neo4j://localhost:7687": Neo4j server address
username::String="neo4j": Neo4j authentication username
password::String="password": Neo4j authentication password
format::String="auto": Output format (auto, verbose, or plain)
database::String="neo4j": Target Neo4j database name

Returns

Cmd: A command object ready for execution

Mycelia.dataframe_convert_dicts_to_json — Method

dataframe_convert_dicts_to_json(df) -> Any

Mycelia.dataframe_replace_nothing_with_missing — Method

dataframe_replace_nothing_with_missing(df::DataFrames.DataFrame) -> DataFrames.DataFrame

Return the DataFrame with all nothing values replaced by missing.

Mycelia.dataframe_to_ndjson — Method

dataframe_to_ndjson(df::DataFrame; outfile::Union{String,Nothing}=nothing)

Converts a DataFrame df into a newline-delimited JSON (NDJSON) string. Each line in the returned string represents one DataFrame row in JSON format, suitable for upload to Google BigQuery.

Keyword Arguments

outfile::Union{String,Nothing}: If provided, writes the resulting NDJSON to the file path given.

Examples

```julia using DataFrames, Dates

Sample DataFrame

df = DataFrame( id = [1, 2, 3], name = ["Alice", "Bob", "Carol"], created = [DateTime(2025, 4, 8, 14, 30), DateTime(2025, 4, 8, 15, 0), missing] )

ndjsonstr = dataframetondjson(df) println(ndjsonstr)

Optionally, write to a file

dataframetondjson(df; outfile="output.ndjson")

Mycelia.deduplicate_fasta_file — Method

deduplicate_fasta_file(in_fasta, out_fasta) -> Any

Remove duplicate sequences from a FASTA file while preserving headers.

Arguments

in_fasta: Path to input FASTA file
out_fasta: Path where deduplicated FASTA will be written

Returns

Path to the output FASTA file (same as out_fasta parameter)

Details

Sequences are considered identical if they match exactly (case-sensitive)
For duplicate sequences, keeps the first header encountered
Input sequences are sorted by identifier before deduplication
Preserves the original sequence formatting

Mycelia.detect_alphabet — Method

detect_alphabet(seq::AbstractString) -> Symbol

Determines the alphabet of a sequence. The function scans through seq only once:

If a 'T' or 't' is found (and no 'U/u'), the sequence is classified as DNA.
If a 'U' or 'u' is found (and no 'T/t'), it is classified as RNA.
If both T and U occur, an error is thrown.
If a character outside the canonical nucleotide and ambiguity codes is encountered, the sequence is assumed to be protein.
If neither T nor U are found, the sequence is assumed to be DNA.

Mycelia.detect_alphabet — Method

detect_alphabet(sequence::BioSequences.LongAA) -> Symbol

Detect the alphabet of a LongAA sequence.

Always returns :AA.

Mycelia.detect_alphabet — Method

detect_alphabet(sequence::BioSequences.LongDNA) -> Symbol

Detect the alphabet of a LongDNA sequence.

Always returns :DNA.

Mycelia.detect_alphabet — Method

detect_alphabet(sequence::BioSequences.LongRNA) -> Symbol

Detect the alphabet of a LongRNA sequence.

Always returns :RNA.

Mycelia.detect_and_extract_sequence — Method

detect_and_extract_sequence(
    record::Union{FASTX.FASTA.Record, FASTX.FASTQ.Record}
) -> Tuple{Symbol, Union{BioSequences.LongAA, BioSequences.LongDNA{4}, BioSequences.LongRNA{4}}}

Detect alphabet and extract typed sequence from FASTX record in one step.

Convenience function that combines alphabet detection with type-safe sequence extraction, ideal for workflows that need to determine sequence type once at the beginning.

Arguments

record::Union{FASTX.FASTA.Record, FASTX.FASTQ.Record}: Input sequence record

Returns

Tuple{Symbol, BioSequences.BioSequence}: (alphabetsymbol, typedsequence)

Examples

record = FASTX.FASTQ.Record("read1", "ATCG", "IIII")
alphabet, sequence = detect_and_extract_sequence(record)
# alphabet = :DNA, sequence = LongDNA{4} object

Mycelia.detect_bubbles_next — Method

detect_bubbles_next(graph::MetaGraphsNext.MetaGraph, min_bubble_length::Int=2, max_bubble_length::Int=100) -> Vector{BubbleStructure}

Detect bubble structures (alternative paths) in the assembly graph.

Mycelia.detect_sequence_extension — Method

detect_sequence_extension(
    record::Union{FASTX.FASTA.Record, FASTX.FASTQ.Record}
) -> String

Detect sequence type from input and suggest appropriate file extension.

Arguments

record: A FASTA/FASTQ record
sequence: A string or BioSequence containing sequence data

Returns

String: Suggested file extension:
- ".fna" for DNA
- ".frn" for RNA
- ".faa" for protein
- ".fa" for unrecognized sequences

Mycelia.determine_fasta_coverage_from_bam — Method

determine_fasta_coverage_from_bam(bam) -> Any

Calculate per-base genomic coverage from a BAM file using bedtools.

Arguments

bam::String: Path to input BAM file

Returns

String: Path to the generated coverage file (.coverage.txt)

Details

Uses bedtools genomecov to compute per-base coverage. Creates a coverage file with the format: <chromosome> <position> <coverage_depth>. If the coverage file already exists, returns the existing file path.

Dependencies

Requires bedtools (automatically installed in conda environment)

Mycelia.determine_max_canonical_kmers — Method

determine_max_canonical_kmers(k, ALPHABET) -> Any

Calculate the maximum number of possible canonical k-mers for a given alphabet.

Arguments

k::Integer: Length of k-mer
ALPHABET::Vector{Char}: Character set (nucleotides or amino acids)

Returns

Int: Maximum number of possible canonical k-mers

Details

For amino acids (AA_ALPHABET): returns total possible k-mers
For nucleotides: returns half of total possible k-mers (canonical form)
Requires odd k-mer length for nucleotide alphabets

Mycelia.determine_max_possible_kmers — Method

determine_max_possible_kmers(k, ALPHABET) -> Any

Calculate the total number of possible unique k-mers that can be generated from a given alphabet.

Arguments

k: Length of k-mers to consider
ALPHABET: Vector containing the allowed characters/symbols

Returns

Integer representing the maximum number of possible unique k-mers (|Σ|ᵏ)

Mycelia.determine_primary_contig — Method

determine_primary_contig(qualimap_results) -> Any

Determines the contig with the greatest number of total bases mapping to it

Identify the primary contig based on mapping coverage from Qualimap results.

Arguments

qualimap_results::DataFrame: DataFrame containing Qualimap alignment statistics with columns "Contig" and "Mapped bases"

Returns

String: Name of the contig with the highest number of mapped bases

Description

Takes Qualimap alignment results and determines which contig has the most total bases mapped to it, which often indicates the main chromosomal assembly.

Mycelia.determine_read_lengths — Method

determine_read_lengths(
    fastq_file;
    total_reads
) -> Vector{Int64}

Calculate sequence lengths for reads in a FASTQ file.

Arguments

fastq_file::String: Path to input FASTQ file
total_reads::Integer=Inf: Number of reads to process (defaults to all reads)

Returns

Vector{Int}: Array containing the length of each sequence read

Mycelia.dictvec_to_dataframe — Method

dictvec_to_dataframe(dictvec::Vector{<:AbstractDict}; symbol_columns::Bool = true)

Convert a vector of dictionaries (with possibly non-uniform keys and any key type) into a DataFrame. Missing keys in a row are filled with missing.

Arguments

dictvec: Vector of dictionaries.
symbol_columns: If true (default), columns are named as Symbols (when possible), else as raw keys.

Returns

DataFrames.DataFrame with columns as the union of all keys.

Mycelia.distance_matrix_to_newick — Method

distance_matrix_to_newick(
;
    distance_matrix,
    labels,
    outfile
)

Create distance matrix from a column-major counts matrix (features as rows and entities as columns) where distance is a proportional to total feature count magnitude (size) and cosine similarity (relative frequency)

Convert a distance matrix into a Newick tree format using UPGMA hierarchical clustering.

Arguments

distance_matrix: Square matrix of pairwise distances between entities
labels: Vector of labels corresponding to the entities in the distance matrix
outfile: Path where the Newick tree file will be written

Returns

Path to the generated Newick tree file

Details

Performs hierarchical clustering using the UPGMA (average linkage) method and converts the resulting dendrogram into Newick tree format. The branch lengths in the tree represent the heights from the clustering.

Mycelia.document_frequency — Method

document_frequency(
    documents
) -> Dict{T, Int64} where T<:(SubString{_A} where _A)

Calculate the document frequency of tokens across a collection of documents.

Arguments

documents: Collection of text documents where each document is a string

Returns

Dictionary mapping each unique token to the number of documents it appears in

Description

Computes how many documents contain each unique token. Each document is tokenized by splitting on whitespace. Tokens are counted only once per document, regardless of how many times they appear within that document.

Mycelia.download_bandage — Function

download_bandage() -> String
download_bandage(outdir) -> Any

Downloads and installs Bandage, a bioinformatics visualization tool for genome assembly graphs.

Arguments

outdir="/usr/local/bin": Target installation directory for the Bandage executable

Returns

Path to the installed Bandage executable

Details

Downloads Bandage v0.8.1 for Ubuntu
Installs required system dependencies (libxcb-glx0, libx11-xcb-dev, libfontconfig, libgl1-mesa-glx)
Attempts installation with sudo, falls back to root if sudo fails
Skips download if Bandage is already installed at target location

Dependencies

Requires system commands: wget, unzip, apt

Mycelia.download_blast_db — Method

download_blast_db(; db, dbdir, source, wait)

Smart downloading of blast dbs depending on interactive, non interactive context

For a list of all available databases, run: Mycelia.list_blastdbs()

Downloads and sets up BLAST databases from various sources.

Arguments

db: Name of the BLAST database to download
dbdir: Directory to store the downloaded database (default: "~/workspace/blastdb")
source: Download source - one of ["", "aws", "gcp", "ncbi"]. Empty string auto-detects fastest source
wait: Whether to wait for download completion (default: true)

Returns

String path to the downloaded database directory

Mycelia.download_genome_by_accession — Method

download_genome_by_accession(
;
    accession,
    outdir,
    compressed
)

Downloads a genomic sequence from NCBI's nucleotide database by its accession number.

Arguments

accession::String: NCBI nucleotide accession number (e.g. "NC_045512")
outdir::String: Output directory path. Defaults to current directory
compressed::Bool: If true, compresses output file with gzip. Defaults to true

Returns

String: Path to the downloaded file (.fna or .fna.gz)

Mycelia.download_genome_by_ftp — Method

download_genome_by_ftp(; ftp, outdir)

Downloads a genome file from NCBI FTP server to the specified directory.

Arguments

ftp::String: NCBI FTP path for the genome (e.g. "ftp://ftp.ncbi.nlm.nih.gov/.../")
outdir::String: Output directory path. Defaults to current working directory.

Returns

String: Path to the downloaded file

Notes

If the target file already exists, returns the existing file path without re-downloading
Downloads the genomic.fna.gz version of the genome

Mycelia.download_mmseqs_db — Method

download_mmseqs_db(; db, dbdir, force, wait)

Downloads and sets up MMseqs2 reference databases for sequence searching and analysis.

Arguments

db::String: Name of database to download (see table below)
dbdir::String: Directory to store the downloaded database (default: "~/workspace/mmseqs")
force::Bool: If true, force re-download even if database exists (default: false)
wait::Bool: If true, wait for download to complete (default: true)

Returns

Path to the downloaded database as a String

Available Databases

Database	Type	Taxonomy	Description
UniRef100	Aminoacid	Yes	UniProt Reference Clusters - 100% identity
UniRef90	Aminoacid	Yes	UniProt Reference Clusters - 90% identity
UniRef50	Aminoacid	Yes	UniProt Reference Clusters - 50% identity
UniProtKB	Aminoacid	Yes	Universal Protein Knowledge Base
NR	Aminoacid	Yes	NCBI Non-redundant proteins
NT	Nucleotide	No	NCBI Nucleotide collection
GTDB	Aminoacid	Yes	Genome Taxonomy Database
PDB	Aminoacid	No	Protein Data Bank structures
Pfam-A.full	Profile	No	Protein family alignments
SILVA	Nucleotide	Yes	Ribosomal RNA database

  Name                  Type            Taxonomy        Url                                                           
- UniRef100             Aminoacid            yes        https://www.uniprot.org/help/uniref
- UniRef90              Aminoacid            yes        https://www.uniprot.org/help/uniref
- UniRef50              Aminoacid            yes        https://www.uniprot.org/help/uniref
- UniProtKB             Aminoacid            yes        https://www.uniprot.org/help/uniprotkb
- UniProtKB/TrEMBL      Aminoacid            yes        https://www.uniprot.org/help/uniprotkb
- UniProtKB/Swiss-Prot  Aminoacid            yes        https://uniprot.org
- NR                    Aminoacid            yes        https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA
- NT                    Nucleotide             -        https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA
- GTDB                  Aminoacid            yes        https://gtdb.ecogenomic.org
- PDB                   Aminoacid              -        https://www.rcsb.org
- PDB70                 Profile                -        https://github.com/soedinglab/hh-suite
- Pfam-A.full           Profile                -        https://pfam.xfam.org
- Pfam-A.seed           Profile                -        https://pfam.xfam.org
- Pfam-B                Profile                -        https://xfam.wordpress.com/2020/06/30/a-new-pfam-b-is-released
- CDD                   Profile                -        https://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml
- eggNOG                Profile                -        http://eggnog5.embl.de
- VOGDB                 Profile                -        https://vogdb.org
- dbCAN2                Profile                -        http://bcb.unl.edu/dbCAN2
- SILVA                 Nucleotide           yes        https://www.arb-silva.de
- Resfinder             Nucleotide             -        https://cge.cbs.dtu.dk/services/ResFinder
- Kalamari              Nucleotide           yes        https://github.com/lskatz/Kalamari

Mycelia.download_sequence_by_accession — Method

download_sequence_by_accession(
;
    accession,
    outdir,
    database,
    format,
    suffix,
    compressed
)

Download a sequence from NCBI by accession number with flexible format options.

Arguments

accession::String: NCBI accession number
outdir::String: Output directory
database::String: NCBI database ("nuccore", "protein", etc.)
format::String: Output format ("fasta", "fastacdsna", "fastacdsaa", etc.)
suffix::String: File suffix to append to accession for filename
compressed::Bool: Whether to gzip compress the output (default: true)

Returns

String: Path to the downloaded file

Mycelia.download_sra_data — Method

Downloads sequencing reads from NCBI's Sequence Read Archive (SRA).

Downloads reads using fasterq-dump. The function automatically detects whether the data is single-end or paired-end and returns appropriate file paths. Users should apply quality control based on their knowledge of the data type.

Arguments

srr_identifier: SRA run identifier (e.g., "SRR1234567")
outdir: Output directory for downloaded files (default: current directory)

Returns

Named tuple with:

srr_id: The SRA identifier
outdir: Output directory path
files: Vector of downloaded file paths (1 file for single-end, 2 for paired-end)
is_paired: Boolean indicating if data is paired-end

Example

# Download SRA data
result = Mycelia.download_sra_data("SRR1234567", outdir="./data")

# Apply appropriate QC based on data type
if result.is_paired
    # Paired-end data - use paired-end QC
    Mycelia.trim_galore_paired(forward_reads=result.files[1], reverse_reads=result.files[2])
else
    # Single-end data - use single-end QC
    Mycelia.qc_filter_short_reads_fastp(input=result.files[1])
end

Mycelia.download_viroid_reference_data — Method

download_viroid_reference_data(
    viroid_name::String;
    outdir,
    download_genome,
    download_cds,
    download_protein,
    max_sequences
) -> @NamedTuple{genome_files::Vector{String}, cds_files::Vector{String}, protein_files::Vector{String}, output_directory::String}

Download comprehensive viroid reference data including genome, CDS, and protein sequences.

Arguments

viroid_name::String: Name or search term for the viroid (e.g., "Potato spindle tuber viroid")
outdir::String: Output directory for downloaded files (default: current directory)
download_genome::Bool: Download genomic DNA sequence (default: true)
download_cds::Bool: Download CDS transcript sequences (default: true)
download_protein::Bool: Download protein (FAA) sequences (default: true)
max_sequences::Int: Maximum number of sequences per type (default: 10)

Returns

NamedTuple: Paths to downloaded files (genomefiles, cdsfiles, protein_files)

Examples

# Download all data for Potato spindle tuber viroid
files = download_viroid_reference_data("Potato spindle tuber viroid", "viroid_data/")

# Download only genome sequences for general viroid search
files = download_viroid_reference_data("viroid", "viroid_genomes/"; 
                                     download_cds=false, download_protein=false)

Mycelia.draw_dendrogram_tree — Method

draw_dendrogram_tree(
    mg::MetaGraphs.MetaDiGraph;
    width,
    height,
    fontsize,
    margins,
    mergenodesize,
    lineweight,
    filename
) -> Luxor.Drawing

Draw a dendrogram visualization of hierarchical clustering results stored in a MetaDiGraph.

Arguments

mg::MetaGraphs.MetaDiGraph: Graph containing hierarchical clustering results. Must have :hcl in graph properties with clustering data and vertex properties containing :x, :y coordinates.

Keywords

width::Integer=500: Width of output image in pixels
height::Integer=500: Height of output image in pixels
fontsize::Integer=12: Font size for node labels in points
margins::Float64: Margin size in pixels, defaults to min(width,height)/20
mergenodesize::Float64=1: Size of circular nodes at merge points
lineweight::Float64=1: Thickness of dendrogram lines
filename::String: Output filename, defaults to timestamp with .dendrogram.png extension

Returns

Nothing, but saves dendrogram image to disk and displays preview.

Mycelia.draw_radial_tree — Method

draw_radial_tree(
    mg::MetaGraphs.MetaDiGraph;
    width,
    height,
    fontsize,
    margins,
    mergenodesize,
    lineweight,
    filename
) -> Luxor.Drawing

Draw a radial hierarchical clustering tree visualization and save it as an image file.

Arguments

mg::MetaGraphs.MetaDiGraph: A meta directed graph containing hierarchical clustering data with required graph properties :hcl containing clustering information.

Keywords

width::Int=500: Width of the output image in pixels
height::Int=500: Height of the output image in pixels
fontsize::Int=12: Font size for node labels
margins::Float64: Margin size (automatically calculated as min(width,height)/20)
mergenodesize::Float64=1: Size of the merge point nodes
lineweight::Float64=1: Thickness of the connecting lines
filename::String: Output filename (defaults to timestamp with ".radial.png" suffix)

Details

The function creates a radial visualization of hierarchical clustering results where:

Leaf nodes are arranged in a circle with labels
Internal nodes represent merge points
Connections show the hierarchical structure through arcs and lines

The visualization is saved as a PNG file and automatically previewed.

Required Graph Properties

The input graph must have:

mg.gprops[:hcl].labels: Vector of leaf node labels
mg.gprops[:hcl].order: Vector of ordered leaf nodes
mg.gprops[:hcl].merges: Matrix of merge operations
mg.vprops[v][:x]: X coordinate for each vertex
mg.vprops[v][:y]: Y coordinate for each vertex

Mycelia.drop_empty_columns! — Method

drop_empty_columns!(
    df::DataFrames.AbstractDataFrame
) -> DataFrames.AbstractDataFrame

Identify all columns that have only missing or empty values, and remove those columns from the dataframe in-place.

Returns a modified version of the original dataframe.

See also: dropemptycolumns

Mycelia.drop_empty_columns — Method

drop_empty_columns(df::DataFrames.AbstractDataFrame) -> Any

Identify all columns that have only missing or empty values, and remove those columns from the dataframe.

Returns a modified copy of the dataframe.

See also: dropemptycolumns!

Mycelia.edge_path_to_sequence — Method

edge_path_to_sequence(kmer_graph, edge_path) -> Any

Converts a path of edges in a kmer graph into a DNA sequence by concatenating overlapping kmers.

Arguments

kmer_graph: A directed graph where vertices represent kmers and edges represent overlaps
edge_path: Vector of edges representing a path through the graph

Returns

A BioSequences.LongDNASeq containing the merged sequence represented by the path

Details

The function:

Takes the first kmer from the source vertex of first edge
For each edge, handles orientation (forward/reverse complement)
Verifies overlaps between consecutive kmers
Concatenates unique bases to build final sequence

Mycelia.edge_probability — Method

edge_probability(stranded_kmer_graph, edge) -> Any

Calculate the probability of traversing a specific edge in a stranded k-mer graph.

The probability is computed as the ratio of this edge's coverage weight to the sum of all outgoing edge weights from the source vertex.

edge_probability(stranded_kmer_graph, edge) -> Any

Arguments

stranded_kmer_graph: A directed graph where edges represent k-mer connections
edge: The edge for which to calculate the probability

Returns

Float64: Probability in range [0,1] representing likelihood of traversing this edge Returns 0.0 if sum of all outgoing edge weights is zero

Note

Probability is based on the :coverage property of edges, using their length as weights

Mycelia.edgemer_to_vertex_kmers — Method

edgemer_to_vertex_kmers(
    edgemer
) -> Tuple{Kmers.Kmer{BioSequences.DNAAlphabet{2}}, Kmers.Kmer{BioSequences.DNAAlphabet{2}}}

Convert an edgemer to two vertex kmers.

This function takes an edgemer (a sequence of DNA nucleotides) and converts it into two vertex kmers. A kmer is a substring of length k from a DNA sequence. The first kmer is created from the first n-1 elements of the edgemer, and the second kmer is created from the last n-1 elements of the edgemer.

Arguments

edgemer::AbstractVector{T}: A vector of DNA nucleotides where T is a subtype of BioSequences.DNAAlphabet{2}.

Returns

Tuple{Kmers.Kmer{BioSequences.DNAAlphabet{2}}, Kmers.Kmer{BioSequences.DNAAlphabet{2}}}: A tuple containing two kmers.

Mycelia.ensure_next_graph — Method

ensure_next_graph(graph) -> Any

Automatically convert legacy graphs to next-generation format if needed.

This is a convenience function that checks the graph type and converts if necessary.

Arguments

graph: Graph in either legacy or next-generation format

Returns

MetaGraphsNext.MetaGraph: Graph in next-generation format

Mycelia.equally_spaced_samples — Method

equally_spaced_samples(vector, n) -> Any

Sample n equally spaced elements from vector.

Arguments

vector: Input vector to sample from
n: Number of samples to return (must be positive)

Returns

A vector containing n equally spaced elements from the input vector.

Mycelia.equivalent_fasta_sequences — Method

equivalent_fasta_sequences(fasta_1, fasta_2) -> Bool

Compare two FASTA files to determine if they contain the same set of sequences, regardless of sequence order.

Arguments

fasta_1::String: Path to first FASTA file
fasta_2::String: Path to second FASTA file

Returns

Bool: true if both files contain exactly the same sequences, false otherwise

Details

Performs a set-based comparison of DNA sequences by hashing each sequence. Sequence order differences between files do not affect the result.

Mycelia.error_rate_to_q_value — Method

error_rate_to_q_value(error_rate) -> Any

Convert a sequencing error probability to a Phred quality score (Q-value).

The calculation uses the standard Phred formula: Q = -10 * log₁₀(error_rate)

Arguments

error_rate::Float64: Probability of error (between 0 and 1)

Returns

q_value::Float64: Phred quality score

Mycelia.errors_are_singletons — Method

Analyze k-mer coverage distribution to detect if errors are singletons. Returns true if low-coverage k-mers (likely errors) are well-separated from high-coverage ones.

Mycelia.estimate_copy_number — Method

Estimate copy number for repeat region.

Mycelia.estimate_dense_matrix_memory — Method

estimate_dense_matrix_memory(nrows::Integer, ncols::Integer)
estimate_dense_matrix_memory(T::DataType, nrows::Integer, ncols::Integer)

Estimate the memory required (in bytes) for a dense matrix.

If T is provided, estimate memory for a matrix with element type T.
If T is not provided, defaults to Float64.

Mycelia.estimate_genome_size_from_kmers — Method

estimate_genome_size_from_kmers(
    sequence::Union{AbstractString, BioSequences.LongSequence},
    k::Integer
) -> Dict

Estimate genome size from k-mer analysis using total k-mer count.

This function estimates genome size using the basic relationship: genomesize ≈ totalkmers - k + 1, where total_kmers is the sum of all k-mer counts. This is a simple estimation method; more sophisticated approaches accounting for sequencing depth, repeats, and errors may be more accurate.

Arguments

sequence::Union{BioSequences.LongSequence, AbstractString}: Input sequence or string
k::Integer: K-mer size for analysis

Returns

Dict{String, Any}: Dictionary containing:
- "unique_kmers": Number of unique k-mers observed
- "total_kmers": Total k-mer count (sum of all frequencies)
- "estimatedgenomesize": Estimated genome size
- "actual_size": Length of input sequence (if provided)

Examples

# Estimate genome size from a sequence
sequence = "ATCGATCGATCGATCG"
result = estimate_genome_size_from_kmers(sequence, 5)

Mycelia.estimate_genome_size_from_kmers — Method

estimate_genome_size_from_kmers(
    records::AbstractArray{T<:Union{FASTX.FASTA.Record, FASTX.FASTQ.Record}, 1},
    k::Integer
)

Estimate genome size from FASTQ/FASTA records using k-mer analysis.

Overload for processing FASTQ or FASTA records directly.

Arguments

records::AbstractVector{T} where T <: Union{FASTX.FASTA.Record, FASTX.FASTQ.Record}`: Input records
k::Integer: K-mer size for analysis

Returns

Dict{String, Any}: Dictionary with k-mer statistics and genome size estimate

Mycelia.estimate_memory_usage — Method

Estimate memory usage for a graph with given number of k-mers. Provides rough estimate for memory monitoring.

Mycelia.estimate_sparse_matrix_memory — Method

estimate_sparse_matrix_memory(nrows::Integer, ncols::Integer; nnz=nothing, density=nothing)
estimate_sparse_matrix_memory(T::DataType, nrows::Integer, ncols::Integer; nnz=nothing, density=nothing)

Estimate the memory required (in bytes) for a sparse matrix in CSC format.

If T is provided, estimate memory for a matrix with element type T.
If T is not provided, defaults to Float64.
You must specify either nnz (number of non-zeros) or density (proportion of non-zeros, between 0 and 1).

Mycelia.estimate_transition_probabilities — Method

estimate_transition_probabilities(graph::MetaGraph, sequences::Vector) -> Matrix{Float64}

Estimate transition probabilities from observed sequences in the graph.

Mycelia.evaluate_assembly_agent — Method

evaluate_assembly_agent(policy::DQNPolicy, validation_data::Vector{String}; episodes=10)

Evaluate a trained assembly agent on validation data.

Arguments

policy::DQNPolicy: Trained policy to evaluate
validation_data::Vector{String}: Validation FASTQ files
episodes::Int: Number of evaluation episodes (default: 10)

Returns

Dict{String, Float64}: Evaluation metrics including mean reward, assembly quality, etc.

Example

# metrics = evaluate_assembly_agent(trained_policy, validation_files)
# println("Mean reward: $(metrics["mean_reward"])")

Mycelia.evaluate_classification — Method

evaluate_classification(true_labels, pred_labels)

Runs confusionmatrix, precisionrecall_f1, and accuracy. Pretty-prints macro metrics and accuracy. Returns a named tuple with all results and plots.

Mycelia.execute_continue_k_action — Method

execute_continue_k_action(env::AssemblyEnvironment, action::AssemblyAction)

Execute a "continue with current k" action by performing error correction iterations.

Arguments

env::AssemblyEnvironment: Current environment
action::AssemblyAction: Action specifying correction parameters

Returns

RewardComponents: Reward breakdown for the action

Mycelia.execute_next_k_action — Method

execute_next_k_action(env::AssemblyEnvironment, action::AssemblyAction)

Execute a "move to next k" action by progressing to the next prime k-mer size.

Arguments

env::AssemblyEnvironment: Current environment
action::AssemblyAction: Action specifying progression parameters

Returns

RewardComponents: Reward breakdown for the action

Mycelia.execute_terminate_action — Method

execute_terminate_action(env::AssemblyEnvironment, action::AssemblyAction)

Execute a "terminate assembly" action and assess final assembly quality.

Arguments

env::AssemblyEnvironment: Current environment
action::AssemblyAction: Termination action

Returns

RewardComponents: Final reward breakdown including assembly quality assessment

Mycelia.export_blast_db — Method

export_blast_db(; path_to_db, fasta)

Export sequences from a BLAST database to a gzipped FASTA file.

Arguments

path_to_db: Path to the BLAST database
fasta: Output path for the gzipped FASTA file (default: path_to_db * ".fna.gz")

Details

Uses conda's BLAST environment to extract sequences using blastdbcmd. The output is automatically compressed using pigz. If the output file already exists, the function will skip extraction.

Mycelia.export_blast_db_taxonomy_table — Method

export_blast_db_taxonomy_table(; path_to_db, outfile)

Exports a taxonomy mapping table from a BLAST database in seqid2taxid format.

Arguments

path_to_db::String: Path to the BLAST database
outfile::String: Output file path (defaults to input path + ".seqid2taxid.txt.gz")

Returns

String: Path to the created output file

Details

Creates a compressed tab-delimited file mapping sequence IDs to taxonomy IDs. Uses blastdbcmd without GI identifiers for better cross-referencing compatibility. If the output file already exists, returns the path without regenerating.

Dependencies

Requires BLAST+ tools installed via Bioconda.

Mycelia.extract_pacbiosample_information — Method

extract_pacbiosample_information(
    xml
) -> DataFrames.DataFrame

Extract biosample and barcode information from a PacBio XML metadata file.

Arguments

xml: Path to PacBio XML metadata file

Returns

DataFrame with two columns:

BioSampleName: Name of the biological sample
BarcodeName: Associated DNA barcode identifier

Mycelia.extract_typed_sequence — Method

extract_typed_sequence(
    record::Union{FASTX.FASTA.Record, FASTX.FASTQ.Record},
    sequence_type::Type{<:BioSequences.BioSequence}
) -> Any

Extract sequence from FASTX record using dynamically determined type.

This function provides type-safe sequence extraction by using the appropriate BioSequence type, avoiding hardcoded sequence types and string conversions.

Arguments

record::Union{FASTX.FASTA.Record, FASTX.FASTQ.Record}: Input sequence record
sequence_type::Type{<:BioSequences.BioSequence}: Target BioSequence type

Returns

BioSequences.BioSequence: Typed sequence from the record

Examples

record = FASTX.FASTQ.Record("read1", "ATCG", "IIII")
seq_type = alphabet_to_biosequence_type(:DNA)
sequence = extract_typed_sequence(record, seq_type)

Mycelia.fasta_and_gff_to_genbank — Method

fasta_and_gff_to_genbank(; fasta, gff, genbank)

Convert FASTA sequence and GFF annotation files to GenBank format using EMBOSS seqret.

Arguments

fasta::String: Path to input FASTA file containing sequence data
gff::String: Path to input GFF file containing genomic features
genbank::String: Path for output GenBank file

Details

Requires EMBOSS toolkit (installed via Bioconda). The function will:

Create necessary output directories
Run seqret to combine sequence and features
Generate a GenBank format file at the specified location

Mycelia.fasta_genome_size — Method

fasta_genome_size(fasta_file) -> Any

Calculate the total size (in bases) of all sequences in a FASTA file.

Arguments

fasta_file::AbstractString: Path to the FASTA file

Returns

Int: Sum of lengths of all sequences in the FASTA file

Mycelia.fasta_list_to_dense_kmer_counts — Method

fasta_list_to_dense_kmer_counts(
;
    fasta_list,
    k,
    alphabet,
    temp_dir_parent,
    count_element_type,
    result_file,
    force,
    cleanup_temp
)

Create a dense k-mer counts table for a set of FASTA files, with disk-backed temporary storage, custom element type, robust error handling, and optional output file caching.

Mycelia.fasta_list_to_sparse_kmer_counts — Method

fasta_list_to_sparse_kmer_counts(
;
    fasta_list,
    k,
    alphabet,
    temp_dir_parent,
    count_element_type,
    rarefaction_data_filename,
    rarefaction_plot_basename,
    show_rarefaction_plot,
    result_file,
    out_dir,
    force,
    rarefaction_plot_kwargs...
)

Create a sparse kmer counts table (SparseMatrixCSC) from a list of FASTA files using a 3-pass approach. Pass 1 (Parallel): Counts kmers per file and writes to temporary JLD2 files. Pass 2 (Serial): Aggregates unique kmers, max count, nnz per file, and rarefaction data from temp files. Generates and saves a k-mer rarefaction plot. Pass 3 (Parallel): Reads temporary counts again to construct the final sparse matrix.

Optionally, a results filename can be provided to save/load the output. If the file exists and force is false, the result is loaded and returned. If force is true or the file does not exist, results are computed and saved.

Output Directory Behavior

All auxiliary output files (e.g., rarefaction data, plots) are written to a common output directory.
By default, this is:
- The value of out_dir if provided.
- Otherwise, the directory containing result_file (if provided and has a directory component).
- Otherwise, the current working directory (pwd()).
If you provide an absolute path for an output file (e.g. rarefaction_data_filename), that path is used directly.
If both out_dir and a relative filename are given, the file is written to out_dir.

Arguments

fasta_list::AbstractVector{<:AbstractString}: A list of paths to FASTA files.
k::Integer: The length of the kmer.
alphabet::Symbol: The alphabet type (:AA, :DNA, :RNA).
temp_dir_parent::AbstractString: Parent directory for creating the temporary working directory. Defaults to Base.tempdir().
count_element_type::Union{Type{<:Unsigned}, Nothing}: Optional. Specifies the unsigned integer type for the counts. If nothing (default), the smallest UInt type capable of holding the maximum observed count is used.
rarefaction_data_filename::AbstractString: Filename for the TSV output of rarefaction data. If a relative path, will be written to out_dir.
rarefaction_plot_basename::AbstractString: Basename for the output rarefaction plots. If a relative path, will be written to out_dir.
show_rarefaction_plot::Bool: Whether to display the rarefaction plot after generation. Defaults to true.
result_file::Union{Nothing, AbstractString}: Optional. If provided, path to a file to save/load the full results (kmers, counts, etc) as a JLD2 file.
out_dir::Union{Nothing, AbstractString}: Optional. Output directory for auxiliary outputs. Defaults as described above.
force::Bool: If true, recompute and overwrite the output file even if it exists. Defaults to false.
rarefaction_plot_kwargs...: Keyword arguments to pass to plot_kmer_rarefaction for plot customization.

Returns

NamedTuple{(:kmers, :counts, :rarefaction_data_path)}:
- kmers: A sorted Vector of unique kmer objects.
- counts: A SparseArrays.SparseMatrixCSC{V, Int} storing kmer counts.
- rarefaction_data_path: Path to the saved TSV file with rarefaction data.

Raises

ErrorException: If input fasta_list is empty, alphabet is invalid, or required Kmer/counting functions are not found.

Mycelia.fasta_table_to_fasta — Method

fasta_table_to_fasta(fasta_df) -> Any

Convert a DataFrame containing FASTA sequence information into a vector of FASTA records.

Arguments

fasta_df::DataFrame: DataFrame with columns "identifier", "description", and "sequence"

Returns

Vector{FASTX.FASTA.Record}: Vector of FASTA records

Mycelia.fasta_to_reference_kmer_counts — Method

fasta_to_reference_kmer_counts(; kmer_type, fasta)

Counts k-mer occurrences in a FASTA file, considering both forward and reverse complement sequences.

Arguments

kmer_type: Type specification for k-mers (e.g., DNAKmer{21})
fasta: Path to FASTA file containing reference sequences

Returns

Dict{kmer_type, Int}: Dictionary mapping each k-mer to its total count across all sequences

Mycelia.fasta_to_table — Method

fasta_to_table(fasta) -> DataFrames.DataFrame

Convert a FASTA file/record iterator to a DataFrame.

Arguments

fasta: FASTA record iterator from FASTX.jl

Returns

DataFrame with columns:
- identifier: Sequence identifiers
- description: Full sequence descriptions
- sequence: Biological sequences as strings

Mycelia.fasta_xam_mapping_stats — Method

fasta_xam_mapping_stats(; fasta, xam)

Calculate mapping statistics by comparing sequence alignments (BAM/SAM) to a reference FASTA.

Arguments

fasta::String: Path to reference FASTA file
xam::String: Path to alignment file (BAM or SAM format)

Returns

DataFrame with columns:

contig: Reference sequence name
contig_length: Length of reference sequence
total_aligned_bases: Total number of bases aligned to reference
mean_depth: Average depth of coverage (totalalignedbases/contig_length)

Mycelia.fastani_list — Method

fastani_list(
;
    query_list,
    reference_list,
    outfile,
    threads,
    force
) -> Union{Nothing, Base.Process}

Run fastani with a query and reference list

Calculate Average Nucleotide Identity (ANI) between genome sequences using FastANI.

Arguments

query_list::String: Path to file containing list of query genome paths (one per line)
reference_list::String: Path to file containing list of reference genome paths (one per line)
outfile::String: Path to output file that will contain ANI results
threads::Int=Sys.CPU_THREADS: Number of parallel threads to use
force::Bool=false: If true, rerun analysis even if output file exists

Output

Generates a tab-delimited file with columns:

Query genome
Reference genome
ANI value (%)
Count of bidirectional fragment mappings
Total query fragments

Notes

Requires FastANI to be available via Bioconda
Automatically sets up required conda environment

Mycelia.fasterq_dump — Method

fasterq_dump(
;
    outdir,
    srr_identifier
) -> NamedTuple{(:forward_reads, :reverse_reads, :unpaired_reads), <:Tuple{Union{Missing, String}, Union{Missing, String}, Union{Missing, String}}}

Download and compress sequencing reads from the SRA database using fasterq-dump.

Arguments

outdir::String="": Output directory for the FASTQ files. Defaults to current directory.
srr_identifier::String="": SRA run accession number (e.g., "SRR12345678")

Returns

Named tuple containing paths to the generated files:

forward_reads: Path to forward reads file (*_1.fastq.gz) or missing
reverse_reads: Path to reverse reads file (*_2.fastq.gz) or missing
unpaired_reads: Path to unpaired reads file (*.fastq.gz) or missing

Outputs

Creates compressed FASTQ files in the output directory:

{srr_identifier}_1.fastq.gz: Forward reads (for paired-end data)
{srr_identifier}_2.fastq.gz: Reverse reads (for paired-end data)
{srr_identifier}.fastq.gz: Unpaired reads (for single-end data)

Dependencies

Requires:

fasterq-dump from the SRA Toolkit (installed via Conda)
gzip for compression

Notes

Skips download if output files already exist
Uses up to 4 threads or system maximum, whichever is lower
Allocates 1GB memory for processing
Skips technical reads
Handles both paired-end and single-end data automatically

Mycelia.fasterq_dump_parallel — Method

Parallel FASTQ dump for multiple SRA files.

Converts multiple SRA files to FASTQ format in parallel. More efficient than sequential processing for large batches.

Arguments

srr_identifiers: Vector of SRA run identifiers
outdir: Output directory for FASTQ files (default: current directory)
max_parallel: Maximum number of parallel conversions (default: 2)

Returns

Vector of named tuples with conversion results

Example

runs = ["SRR1234567", "SRR1234568"]
results = Mycelia.fasterq_dump_parallel(runs, outdir="./fastq_data")

Mycelia.fastq_record — Method

fastq_record(; identifier, sequence, quality_scores)

Construct a FASTX FASTQ record from its components.

Arguments

identifier::String: The sequence identifier without the '@' prefix
sequence::String: The nucleotide sequence
quality_scores::Vector{Int}: Quality scores (0-93) as raw integers

Returns

FASTX.FASTQRecord: A parsed FASTQ record

Notes

Quality scores are automatically capped at 93 to ensure FASTQ compatibility
Quality scores are converted to ASCII by adding 33 (Phred+33 encoding)
The record is constructed in standard FASTQ format with four lines:
1. Header line (@ + identifier)
2. Sequence
3. Plus line
4. Quality scores (ASCII encoded)

Mycelia.fastx2normalized_table — Method

fastx2normalized_table(fastx::AbstractString) -> DataFrames.DataFrame

fastx2normalized_table(fastx) -> DataFrames.DataFrame

Read a FASTA or FASTQ file and convert its records into a normalized DataFrames.DataFrame where each row represents a sequence record and columns provide standardized metadata and sequence statistics.

Arguments

fastx::AbstractString: Path to a FASTA or FASTQ file. The file must exist and be non-empty. The file type is inferred from the filename using Mycelia.FASTA_REGEX and Mycelia.FASTQ_REGEX.

Returns

DataFrames.DataFrame: A data frame where each row contains information for a record from the input file, and columns include:
- fastx_path: Basename of the input file.
- fastx_sha256: Aggregated SHA256 hash of all record SHA256s in the file.
- record_identifier: Identifier from the record header.
- record_description: Description from the record header.
- record_sha256: SHA256 hash of the record sequence.
- record_quality: Vector of quality scores (Vector{Float64}) for FASTQ, or missing for FASTA.
- record_alphabet: Sorted, joined string of unique, uppercase characters in the record sequence.
- record_type: Alphabet type detected by Mycelia.detect_alphabet (e.g., :DNA, :RNA, etc.).
- mean_record_quality: Mean quality score (for FASTQ), or missing (for FASTA).
- median_record_quality: Median quality score (for FASTQ), or missing (for FASTA).
- record_length: Length of the sequence.
- record_sequence: The sequence string itself.

Notes

The function asserts that the file exists and is not empty.
File type is determined by regex matching on the filename.
For FASTA files, quality-related columns are set to missing.
For FASTQ files, quality scores are extracted and statistics are computed.
Record SHA256 hashes are aggregated to compute a file-level SHA256 via Mycelia.metasha256.
Requires the following namespaces: DataFrames, Statistics, Mycelia, FASTX, and Base.basename.
The function returns the columns in the order: fastx_path, fastx_sha256, followed by all other record columns.

Example

import DataFrames
import Mycelia
import FASTX

table = fastx2normalized_table("example.fasta")
DataFrames.first(table, 3)

Mycelia.fastx_stats — Method

fastx_stats(fastx) -> DataFrames.DataFrame

Calculate basic statistics for FASTQ/FASTA sequence files using seqkit.

Arguments

fastq::String: Path to input FASTQ/FASTA file

Details

Automatically installs and uses seqkit from Bioconda to compute sequence statistics including number of sequences, total bases, GC content, average length, etc.

Dependencies

Requires Conda and Bioconda channel
Installs seqkit package if not present

Returns

Returns a DataFrame of the table

https://bioinf.shenwei.me/seqkit/usage/#stats

Mycelia.fastx_to_contig_lengths — Method

fastx_to_contig_lengths(
    fastx
) -> OrderedCollections.OrderedDict

Generate detailed mapping statistics for each reference sequence/contig in a XAM (SAM/BAM/CRAM) file.

Arguments

xam: Path to XAM file or XAM object

Returns

A DataFrame with per-contig statistics including:

n_aligned_reads: Number of aligned reads
total_aligned_bases: Sum of alignment lengths
total_alignment_score: Sum of alignment scores
Mapping quality statistics (mean, std, median)
Alignment length statistics (mean, std, median)
Alignment score statistics (mean, std, median)
Percent mismatches statistics (mean, std, median)

Note: Only primary alignments (isprimary=true) and mapped reads (ismapped=true) are considered.

Mycelia.fastx_to_kmer_graph — Method

fastx_to_kmer_graph(
    KMER_TYPE,
    fastx::AbstractString
) -> MetaGraphs.MetaGraph

Constructs a k-mer graph from a single FASTX format string.

Arguments

KMER_TYPE: The k-mer type specification (e.g., DNAKmer{K} where K is k-mer length)
fastx::AbstractString: Input sequence in FASTX format (FASTA or FASTQ)

Returns

KmerGraph: A directed graph where vertices are k-mers and edges represent overlaps

Mycelia.fastx_to_kmer_graph — Method

fastx_to_kmer_graph(
    KMER_TYPE,
    fastxs::AbstractVector{<:AbstractString}
) -> MetaGraphs.MetaGraph

Create an in-memory kmer-graph that records:

all kmers
counts
all observed edges between kmers
edge orientations
edge counts

Construct a kmer-graph from one or more FASTX files (FASTA/FASTQ).

Arguments

KMER_TYPE: Type for kmer representation (e.g., DNAKmer{K})
fastxs: Vector of paths to FASTX files

Returns

A MetaGraph where:

Vertices represent unique kmers with properties:
- :kmer => The kmer sequence
- :count => Number of occurrences
Edges represent observed kmer adjacencies with properties:
- :orientation => Relative orientation of connected kmers
- :count => Number of observed transitions

Mycelia.fibonacci_numbers_less_than — Method

fibonacci_numbers_less_than(
    n::Int64
) -> Union{Vector{Any}, Vector{Int64}}

Generate a sequence of Fibonacci numbers strictly less than the input value.

Arguments

n::Int: Upper bound (exclusive) for the Fibonacci sequence

Returns

Vector{Int}: Array containing Fibonacci numbers less than n

Mycelia.filesize_human_readable — Method

filesize_human_readable(f) -> Any

Gets the size of a file and returns it in a human-readable format.

Arguments

f: The path to the file, either as a String or an AbstractString.

Returns

A string representing the file size in a human-readable format (e.g., "3.40 MB").

Details

This function internally uses filesize(f) to get the file size in bytes, then leverages Base.format_bytes to convert it into a human-readable format with appropriate units (KB, MB, GB, etc.).

Examples

julia> filesize_human_readable("my_image.jpg")
"2.15 MB"

See Also

filesize: Gets the size of a file in bytes.
Base.format_bytes: Converts a byte count into a human-readable string.

Mycelia.finalize_assembly — Function

Finalize assembly by combining information from all k-mer sizes. Phase 5.1b: Enhanced with accuracy metrics and reward tracking.

Mycelia.finalize_cross_validation — Method

Finalize cross-validation results with comprehensive summary.

Mycelia.finalize_iterative_assembly — Method

Finalize iterative assembly by combining results from all k-mer sizes and iterations.

Mycelia.find_bubble_paths — Method

Find potential bubble paths from an entry vertex.

Mycelia.find_connected_components — Method

Find all connected components in the graph. Returns a vector of vectors of vertex indices.

Mycelia.find_connected_components — Method

Specialized method for directed graphs using weakly connected components.

Mycelia.find_connected_components — Method

Specialized method for undirected graphs.

Mycelia.find_contigs_next — Method

find_contigs_next(graph::MetaGraphsNext.MetaGraph, min_contig_length::Int=500) -> Vector{ContigPath}

Extract linear contigs from the assembly graph.

Mycelia.find_eulerian_path_from_vertex — Method

Find Eulerian path starting from a specific vertex using Hierholzer's algorithm.

Mycelia.find_eulerian_paths_next — Method

find_eulerian_paths_next(graph::MetaGraphsNext.MetaGraph) -> Vector{Vector{String}}

Find Eulerian paths in the assembly graph. An Eulerian path visits every edge exactly once.

Mycelia.find_eulerian_start_vertices — Method

Find valid starting vertices for Eulerian paths.

Mycelia.find_fasta_files — Method

find_fasta_files(input_path::String) -> Vector{String}

Find all FASTA files in a directory or return single file if path is a file.

Uses the existing FASTA_REGEX constant to identify FASTA files.

Arguments

input_path: Path to directory or single FASTA file

Returns

Vector of FASTA file paths

Example

fasta_files = find_fasta_files("./genomes/")

Mycelia.find_initial_k — Method

Find the optimal starting k-mer size using sparsity detection. Only considers prime k-mer sizes for optimal performance.

Mycelia.find_limited_path — Method

Find a limited-length path from a starting vertex.

Mycelia.find_linear_path — Method

Find a linear path through the graph.

Mycelia.find_matching_prefix — Method

find_matching_prefix(
    filename1::String,
    filename2::String;
    strip_trailing_delimiters
) -> String

Find the longest common prefix between two filenames.

Arguments

filename1::String: First filename to compare
filename2::String: Second filename to compare

Keywords

strip_trailing_delimiters::Bool=true: If true, removes trailing dots, hyphens, and underscores from the result

Returns

String: The longest common prefix found between the filenames

Mycelia.find_nonempty_columns — Method

find_nonempty_columns(df) -> Any

Identify all columns that have only missing or empty values

Returns as a bit array

See also: dropemptycolumns, dropemptycolumns!

Mycelia.find_optimal_sequence_path — Method

Find optimal sequence path through graph using maximum likelihood principles. Returns improved sequence and likelihood improvement score.

Mycelia.find_path_convergence — Method

Find where two paths converge.

Mycelia.find_primes_in_range — Method

Find all primes in a range (convenience function).

Mycelia.find_quality_weighted_path — Method

find_quality_weighted_path(
    graph,
    start_vertex;
    max_path_length
) -> Vector

Find a quality-weighted path through a qualmer graph starting from a given vertex. Uses joint probability as the primary weighting factor for path selection.

Arguments

graph: Qualmer graph (MetaGraphsNext with QualmerVertexData)
start_vertex: Starting vertex for path traversal
max_path_length::Int=1000: Maximum path length to prevent infinite loops

Returns

Vector{Int}: Path as sequence of vertex indices

Details

At each step, selects the unvisited neighbor with the highest joint probability. Terminates when no unvisited neighbors are available or max length is reached.

Mycelia.find_repeat_candidates — Method

Find vertices that could be part of repeat regions.

Mycelia.find_resampling_stretches — Method

find_resampling_stretches(
;
    record_kmer_solidity,
    solid_branching_kmer_indices
)

Identifies sequence regions that require resampling based on kmer solidity patterns.

Arguments

record_kmer_solidity::BitVector: Boolean array where true indicates solid kmers
solid_branching_kmer_indices::Vector{Int}: Indices of solid branching kmers

Returns

Vector{UnitRange{Int64}}: Array of ranges (start:stop) indicating stretches that need resampling

Details

Finds continuous stretches of non-solid kmers and extends them to the nearest solid branching kmers on either side. These stretches represent regions that need resampling.

If a stretch doesn't have solid branching kmers on both sides, it is excluded from the result. Duplicate ranges are removed from the final output.

Mycelia.find_true_ranges — Method

find_true_ranges(
    bool_vec::AbstractVector{Bool};
    min_length
) -> Vector

Finds contiguous ranges of true values in a boolean vector.

Arguments

bool_vec::AbstractVector{Bool}: Input boolean vector to analyze
min_length=1: Minimum length requirement for a range to be included

Returns

Vector of tuples (start, end) where each tuple represents the indices of a contiguous range of true values meeting the minimum length requirement.

Mycelia.first_of_each_group — Method

first_of_each_group(
    gdf::DataFrames.GroupedDataFrame{DataFrames.DataFrame}
) -> Any

Return a DataFrame containing the first row from each group in a GroupedDataFrame.

Arguments

gdf::GroupedDataFrame: A grouped DataFrame created using groupby

Returns

DataFrame: A new DataFrame containing first row from each group

Mycelia.frequency_matrix_to_bray_curtis_distance_matrix — Method

frequency_matrix_to_bray_curtis_distance_matrix(counts_table)

Pairwise Bray-Curtis distance between columns of counts_table.

Mycelia.frequency_matrix_to_cosine_distance_matrix — Method

frequency_matrix_to_cosine_distance_matrix(probability_matrix)

Pairwise cosine distance between columns of probability_matrix.

Mycelia.frequency_matrix_to_euclidean_distance_matrix — Method

frequency_matrix_to_euclidean_distance_matrix(counts_table)

Pairwise Euclidean distance between columns of counts_table.

Mycelia.frequency_matrix_to_jaccard_distance_matrix — Method

frequency_matrix_to_jaccard_distance_matrix(matrix)

Thresholds the input matrix at >0 to obtain a binary matrix, then computes pairwise Jaccard distance between columns. Accepts any numeric matrix.

Mycelia.frequency_matrix_to_jensen_shannon_distance_matrix — Method

frequency_matrix_to_jensen_shannon_distance_matrix(probability_matrix)

Pairwise Jensen-Shannon divergence between columns of probability_matrix.

Arguments

probability_matrix: Matrix where each column is a probability distribution (sums to 1.0).

Returns

Symmetric matrix of Jensen-Shannon divergence values between columns.

Mycelia.gamma_pca_epca — Method

gamma_pca_epca(M::AbstractMatrix{<:Real}; k::Int=0)

Perform Gamma EPCA on a matrix M (features × samples).

When to use

Use for positive continuous data, such as rates, times, or strictly positive measurements.

Keyword arguments

k : desired number of latent dimensions; if k<1 defaults to min(n_samples-1, n_features, 10)

Returns

NamedTuple with fields

model : the fitted ExpFamilyPCA.GammaEPCA object
scores : k×n_samples matrix of sample scores
loadings : k×n_features matrix of feature loadings

Mycelia.gaussian_pca_epca — Method

gaussian_pca_epca(M::AbstractMatrix{<:Real}; k::Int=0)

Perform Gaussian EPCA on a matrix M (features × samples).

When to use

Use for real-valued continuous data (centered, can be negative or positive), such as normalized or standardized measurements.

Keyword arguments

k : desired number of latent dimensions; if k<1 defaults to min(n_samples-1, n_features, 10)

Returns

NamedTuple with fields

model : the fitted ExpFamilyPCA.GaussianEPCA object
scores : k×n_samples matrix of sample scores
loadings : k×n_features matrix of feature loadings

Mycelia.genbank_to_codon_frequencies — Method

genbank_to_codon_frequencies(
    genbank;
    allow_all
) -> Dict{BioSymbols.AminoAcid, Dict{Kmers.Kmer{BioSequences.DNAAlphabet{2}, 3, 1}, Int64}}

Analyze codon usage frequencies from genes in a GenBank file.

Arguments

genbank: Path to GenBank format file containing genomic sequences and annotations
allow_all: If true, initializes frequencies for all possible codons with count=1 (default: true)

Returns

Nested dictionary mapping amino acids to their corresponding codon usage counts:

Outer key: AminoAcid (including stop codon)
Inner key: DNACodon
Value: Count of codon occurrences

Details

Only processes genes marked as ':misc_feature' in the GenBank file
Analyzes both forward and reverse complement sequences
Determines coding strand based on presence of stop codons and start codons
Skips ambiguous sequences that cannot be confidently oriented

Mycelia.genbank_to_fasta — Method

genbank_to_fasta(; genbank, fasta, force)

Convert a GenBank format file to FASTA format using EMBOSS seqret.

Arguments

genbank: Path to input GenBank format file
fasta: Optional output FASTA file path (defaults to input path with .fna extension)
force: If true, overwrites existing output file (defaults to false)

Returns

Path to the output FASTA file

Notes

Requires EMBOSS suite (installed automatically via Conda)
Will not regenerate output if it already exists unless force=true

Mycelia.generate_all_possible_canonical_kmers — Method

generate_all_possible_canonical_kmers(k, alphabet) -> Any

Create distance matrix from a column-major counts matrix (features as rows and entities as columns) where distance is a proportional to total feature count magnitude (size) and cosine similarity (relative frequency)

Generate all possible canonical k-mers of length k from the given alphabet.

For DNA/RNA sequences, returns unique canonical k-mers where each k-mer is represented by the lexicographically smaller of itself and its reverse complement. For amino acid sequences, returns all possible k-mers without canonicalization.

Arguments

k: Length of k-mers to generate
alphabet: Vector of BioSymbols (DNA, RNA or AminoAcid)

Returns

Vector of k-mers, canonicalized for DNA/RNA alphabets

Mycelia.generate_all_possible_kmers — Method

generate_all_possible_kmers(k, alphabet) -> Any

Create distance matrix from a column-major counts matrix (features as rows and entities as columns) where distance is a proportional to total feature count magnitude (size) and cosine similarity (relative frequency)

Generate a sorted list of all possible k-mers for a given alphabet.

Arguments

k::Integer: Length of k-mers to generate
alphabet: Collection of symbols (DNA, RNA, or amino acids) from BioSymbols

Returns

Sorted Vector of Kmers of the appropriate type (DNA, RNA, or amino acid)

Mycelia.generate_alternative_qualmer_paths — Method

Generate alternative qualmer paths through the graph using quality-aware probabilistic sampling.

Mycelia.generate_and_save_kmer_counts — Method

generate_and_save_kmer_counts(; 
    bioalphabet, 
    fastas, 
    k, 
    output_dir=pwd(),
    filename=nothing
)

Generates and saves k-mer counts for a list of FASTA files for a single k.

Keyword Arguments

bioalphabet: Alphabet type (e.g., :DNA).
fastas: List of FASTA file paths.
k: Value of k (e.g., 9).
output_dir: (optional) Directory to write output files (default: current directory).
filename: (optional) Full file name for output (default: "{Mycelia.normalized_current_date()}.{lowercase(string(bioalphabet))}{k}mers.jld2").

Output

Saves a .jld2 file with the specified file name in output_dir if it does not already exist.

Mycelia.generate_binary_matrix — Method

Generate a binary (Bernoulli) matrix with given dimensions and probability.

Arguments

n_features::Int: Number of features (rows)
n_samples::Int: Number of samples (columns)
p::Float64: Probability of 1 in the Bernoulli distribution

Returns

Matrix{Bool}: Binary matrix with dimensions (nfeatures, nsamples)

Mycelia.generate_consensus_pangenome — Method

Generate consensus pangenome from cross-validation results.

Mycelia.generate_contig_sequence — Method

Generate sequence for a contig path.

Mycelia.generate_coverage_profile — Method

Generate coverage profile for a contig path.

Mycelia.generate_kmer_alternatives — Method

Generate alternative k-mers for improvement attempts using proper k-mer objects.

Mycelia.generate_paired_end_reads — Method

generate_paired_end_reads(reference_seq, coverage, read_length, insert_size; error_rate=0.01)

Generate realistic paired-end sequencing reads from a reference sequence.

Simulates paired-end Illumina sequencing with realistic insert sizes, read lengths, and optional sequencing errors for assembly benchmarking.

Arguments

reference_seq: Reference sequence (BioSequences.LongDNA{4})
coverage: Target sequencing coverage depth
read_length: Length of each read in base pairs
insert_size: Insert size between paired reads
error_rate: Sequencing error rate (default: 0.01)

Returns

Tuple of (forwardreads, reversereads) as vectors of BioSequences.LongDNA{4}

See Also

simulate_illumina_paired_reads: For more sophisticated read simulation using ART
introduce_sequencing_errors: For adding realistic sequencing errors

Mycelia.generate_poisson_matrix — Method

Generate a Poisson matrix with given dimensions and rate parameter.

Arguments

n_features::Int: Number of features (rows)
n_samples::Int: Number of samples (columns)
λ::Float64: Rate parameter for the Poisson distribution

Returns

Matrix{Int}: Poisson matrix with dimensions (nfeatures, nsamples)

Mycelia.generate_polished_sequence — Method

generate_polished_sequence(states::Vector{ViterbiState}, observations::Vector{String}, 
                          config::ViterbiConfig) -> (String, Vector{Tuple{Int, String, String}})

Generate polished sequence and track corrections made.

Mycelia.generate_prime_k_sequence — Function

Generate sequence of prime k-mer sizes starting from min_k.

Mycelia.generate_recommendation_reasoning — Method

Generate reasoning for assembly approach recommendation.

Mycelia.generate_taxa_abundances_plot — Method

generate_taxa_abundances_plot(
    joint_reads_to_taxon_lineage_table::DataFrames.DataFrame;
    taxa_level::String,
    top_n::Int = 30,
    kwargs...
)

Convenience wrapper function to generate taxa abundance visualization with default parameters and save to a file if requested.

Arguments

joint_reads_to_taxon_lineage_table: DataFrame with sample_id and taxonomic assignments
taxa_level: Taxonomic level to analyze
top_n: Number of top taxa to display individually
kwargs...: Additional parameters to pass to plottaxaabundances

Returns

fig: CairoMakie figure object
ax: CairoMakie axis object
taxa_colors: Dictionary mapping taxa to their assigned colors

Mycelia.generate_test_fastq_data — Method

generate_test_fastq_data(n_reads::Int, read_length::Int, filename::String)

Generate test FASTQ data for benchmarking purposes.

Creates a FASTQ file with random DNA sequences and realistic quality scores suitable for performance testing and validation.

Arguments

n_reads::Int: Number of reads to generate
read_length::Int: Length of each read in base pairs
filename::String: Output filename for the FASTQ file

Details

Generates random DNA sequences using BioSequences.randdnaseq
Assigns realistic quality scores (Phred+33 encoding, range 20-40)
Uses existing Mycelia I/O functions for consistency

See Also

Mycelia.write_fastq: For writing FASTQ records
Mycelia.fastq_record: For creating FASTQ records
Mycelia.simulate_illumina_paired_reads: For more sophisticated read simulation

Mycelia.generate_test_genome_with_genes — Function

generate_test_genome_with_genes(genome_size, gene_density=0.02)

Generate a test genome with simulated gene positions for annotation benchmarking.

Creates a random DNA sequence with estimated gene positions based on gene density, suitable for testing gene prediction algorithms.

Arguments

genome_size: Size of the genome in base pairs
gene_density: Proportion of genome that consists of genes (default: 0.02)

Returns

Tuple of (genomesequence, genepositions) where gene_positions is a vector of (start, end) tuples

See Also

random_fasta_record: For generating random FASTA sequences
save_genome_as_fasta: For saving genomes to FASTA format

Mycelia.generate_test_sequences — Function

generate_test_sequences(genome_size::Int, n_sequences::Int=1)

Generate test DNA sequences for k-mer analysis benchmarking.

Creates random DNA sequences suitable for k-mer counting and analysis performance testing.

Arguments

genome_size::Int: Size of each generated sequence in base pairs
n_sequences::Int: Number of sequences to generate (default: 1)

Returns

Vector of BioSequences.LongDNA{4} sequences

See Also

random_fasta_record: For generating FASTA records with random sequences
BioSequences.randdnaseq: For generating random DNA sequences

Mycelia.generate_test_sequences — Method

generate_test_sequences(
    config::Mycelia.BenchmarkConfig
) -> Vector{FASTX.FASTA.Record}

Generate synthetic DNA sequences for benchmarking.

Arguments

config: BenchmarkConfig with test parameters

Returns

Vector of FASTA records for testing

Mycelia.generate_training_datasets — Method

generate_training_datasets(; n_datasets=20, genome_sizes=[10000, 50000, 100000], 
                          error_rates=[0.001, 0.01, 0.05], coverage_levels=[20, 30, 50])

Generate diverse training datasets for RL agent training.

This function creates simulated genomic datasets with varying characteristics to provide comprehensive training scenarios for the RL agent.

Arguments

n_datasets::Int: Total number of datasets to generate (default: 20)
genome_sizes::Vector{Int}: Range of genome sizes to simulate (default: [10K, 50K, 100K])
error_rates::Vector{Float64}: Range of sequencing error rates (default: [0.1%, 1%, 5%])
coverage_levels::Vector{Int}: Range of coverage depths (default: [20x, 30x, 50x])

Returns

Vector{String}: Paths to generated training FASTQ files

Example

training_files = generate_training_datasets(n_datasets=50, genome_sizes=[50000, 100000, 200000])

Mycelia.generate_transterm_coordinates_from_fasta — Method

generate_transterm_coordinates_from_fasta(fasta) -> Any

Generate minimal coordinate files required for TransTermHP analysis from FASTA sequences.

Creates artificial gene annotations at sequence boundaries to enable TransTermHP to run without real gene annotations. For each sequence in the FASTA file, generates two single-base-pair "genes" at positions 1-2 and (L-1)-L, where L is sequence length.

Arguments

fasta: Path to input FASTA file containing sequences to analyze

Returns

Path to generated coordinate file (original path with ".coords" extension)

Format

Generated coordinate file follows TransTermHP format: gene_id start stop chromosome

where chromosome matches FASTA sequence identifiers.

See also: run_transterm

Mycelia.generate_transterm_coordinates_from_gff — Method

generate_transterm_coordinates_from_gff(gff_file) -> Any

Convert a GFF file to a coordinates file compatible with TransTermHP format.

Arguments

gff_file::String: Path to input GFF file

Processing

Converts 1-based to 0-based coordinates
Extracts gene IDs from the attributes field
Retains columns: gene_id, start, end, seqid

Returns

Path to the generated coordinates file (original filename with '.coords' suffix)

Output Format

Space-delimited file with columns: gene_id, start, end, seqid

Mycelia.get_base_extension — Method

get_base_extension(filename::String) -> String

Extract the base file extension from a filename, handling compressed files.

For regular files, returns the last extension. For gzipped files, returns the extension before .gz.

Mycelia.get_biosequence_alphabet — Method

get_biosequence_alphabet(s::BioSequences.BioSequence) -> Any

Return the alphabet associated with a BioSequence type.

Arguments

s::BioSequences.BioSequence: A subtype instance.

Returns

BioSymbols.Alphabet of the sequence type.

Mycelia.get_correct_quality — Method

get_correct_quality(tech::Symbol, pos::Int, read_length::Int) -> Int

Simulates a Phred quality score (using the Sanger convention) for a correctly observed base. For Illumina, the quality score is modeled to decay linearly from ~40 at the start to ~20 at the end of the read. For other technologies, the score is sampled from a normal distribution with parameters typical for that platform.

Returns an integer quality score.

Mycelia.get_error_quality — Method

get_error_quality(tech::Symbol) -> Int

Simulates a Phred quality score (using the Sanger convention) for a base observed with an error. Error bases are assigned lower quality scores than correctly observed bases. For Illumina, scores typically range between 5 and 15; for nanopore and pacbio, slightly lower values are used; and for ultima, a modest quality score is assigned.

Returns an integer quality score.

Mycelia.get_fastq_contigs — Method

get_fastq_contigs(result::AssemblyResult) -> Vector{FASTX.FASTQ.Record}

Extract quality-aware FASTQ contigs from assembly result if available. Returns empty vector if no quality information was preserved during assembly.

Mycelia.get_genbank — Method

get_genbank(
;
    db,
    accession,
    ftp
) -> Union{Nothing, GenomicAnnotations.GenBank.Reader}

Get dna (db = "nuccore") or protein (db = "protein") sequences from NCBI or get fasta directly from FTP site

Retrieve GenBank records from NCBI or directly from an FTP site.

Arguments

db::String: NCBI database to query ("nuccore" for nucleotide or "protein" for protein sequences)
accession::String: NCBI accession number for the sequence
ftp::String: Direct FTP URL to a GenBank file (gzipped)

Returns

GenomicAnnotations.GenBank.Reader: A reader object containing the GenBank record

Details

When using NCBI queries (db and accession), the function implements rate limiting (0.5s sleep) to comply with NCBI's API restrictions of max 2 requests per second.

Mycelia.get_gff — Method

get_gff(; db, accession, ftp) -> Any

Get dna (db = "nuccore") or protein (db = "protein") sequences from NCBI or get fasta directly from FTP site

Retrieve GFF3 formatted genomic feature data from NCBI or direct FTP source.

Arguments

db::String: NCBI database to query ("nuccore" for DNA or "protein" for protein sequences)
accession::String: NCBI accession number
ftp::String: Direct FTP URL to GFF3 file (typically gzipped)

Returns

IO: IOBuffer containing uncompressed GFF3 data

Mycelia.get_in_neighbors — Method

Get incoming neighbors of a vertex.

Mycelia.get_kmer_index — Method

get_kmer_index(kmers, kmer) -> Any

Returns the index position of a given k-mer in a sorted list of k-mers.

Arguments

kmers: A sorted vector of k-mers to search within
kmer: The k-mer sequence to find

Returns

Integer index position where kmer is found in kmers

Throws

AssertionError: If the k-mer is not found in the list

Mycelia.get_local_subgraph — Method

Get local subgraph around a vertex.

Mycelia.get_out_neighbors — Method

Get outgoing neighbors of a vertex.

Mycelia.get_phred_scores — Method

get_phred_scores(
    record::FASTX.FASTQ.Record
) -> Vector{UInt8}

Get numerical PHRED quality scores from a FASTQ record.

This is a convenience wrapper around FASTX.quality_scores() that returns the quality scores as a Vector{UInt8} representing PHRED scores.

Arguments

record::FASTX.FASTQ.Record: FASTQ record to extract quality scores from

Returns

Vector{UInt8}: PHRED quality scores (0-based, where 0 = lowest quality, 40+ = highest quality)

Examples

record = FASTX.FASTQ.Record("read1", "ATCG", "IIII")
scores = get_phred_scores(record)  # Returns [40, 40, 40, 40]

Mycelia.get_qualmer_statistics — Method

get_qualmer_statistics(
    graph::MetaGraphsNext.MetaGraph
) -> Dict{String, Any}

Get comprehensive statistics about a qualmer graph.

Mycelia.get_sequence — Method

get_sequence(
;
    db,
    accession,
    ftp
) -> Union{Nothing, FASTX.FASTA.Reader}

Get dna (db = "nuccore") or protein (db = "protein") sequences from NCBI or get fasta directly from FTP site

Retrieve FASTA format sequences from NCBI databases or direct FTP URLs.

Arguments

db::String: NCBI database type ("nuccore" for DNA or "protein" for protein sequences)
accession::String: NCBI sequence accession number
ftp::String: Direct FTP URL to a FASTA file (alternative to db/accession pair)

Returns

FASTX.FASTA.Reader: Reader object containing the requested sequence(s)

Mycelia.get_viroid_species_list — Method

get_viroid_species_list() -> Vector{String}

Get a comprehensive list of well-characterized viroid species for reference data download.

Returns

Vector{String}: List of viroid species names suitable for NCBI searches

Examples

viroid_species = get_viroid_species_list()
for species in viroid_species[1:5]  # Download first 5 species
    download_viroid_reference_data(species, "viroid_references/$species/")
end

Mycelia.gfa_to_fasta — Method

gfa_to_fasta(; gfa, fasta)

Convert a GFA (Graphical Fragment Assembly) file to FASTA format.

Arguments

gfa::String: Path to input GFA file
fasta::String=gfa * ".fna": Path for output FASTA file. Defaults to input filename with ".fna" extension

Returns

String: Path to the generated FASTA file

Details

Uses gfatools (via Conda) to perform the conversion. The function will:

Ensure gfatools is available in the Conda environment
Execute the conversion using gfatools gfa2fa
Write sequences to the specified FASTA file

Mycelia.gfa_to_structure_table — Method

gfa_to_structure_table(
    gfa
) -> NamedTuple{(:contig_table, :records), <:Tuple{DataFrames.DataFrame, Any}}

Convert a GFA (Graphical Fragment Assembly) file into a structured representation.

Arguments

gfa: Path to GFA file or GFA content as string

Returns

Named tuple containing:

contig_table: DataFrame with columns:
- connected_component: Integer ID for each component
- contigs: Comma-separated list of contig IDs
- is_circular: Boolean indicating if component forms a cycle
- is_closed: Boolean indicating if single contig forms a cycle
- lengths: Comma-separated list of contig lengths
records: FASTA records from the GFA

Mycelia.githash — Method

githash(; short) -> SubString{String}

Returns the current git commit hash of the repository.

Arguments

short::Bool=false: If true, returns abbreviated 8-character hash

Returns

A string containing the git commit hash (full 40 characters by default)

Mycelia.graph_to_gfa — Method

graph_to_gfa(; graph, outfile)

Convert a Mycelia graph to GFA (Graphical Fragment Assembly) format.

Writes a graph to GFA format, including:

Header (H) line with GFA version
Segment (S) lines for each vertex with sequence and depth
Link (L) lines for edges with overlap size and orientations

Arguments

graph: MetaGraph containing sequence vertices and their relationships
outfile: Path where the GFA file should be written

Returns

Path to the written GFA file

Mycelia.has_quality_information — Method

has_quality_information(result::AssemblyResult) -> Bool

Check if the assembly result preserves quality information from the original reads.

Mycelia.hclust_to_metagraph — Method

hclust_to_metagraph(
    hcl::Clustering.Hclust
) -> MetaGraphs.MetaDiGraph{Int64, Float64}

Convert a hierarchical clustering tree into a directed metagraph representation.

Arguments

hcl::Clustering.Hclust: Hierarchical clustering result object

Returns

MetaGraphs.MetaDiGraph: Directed graph with metadata representing the clustering hierarchy

Graph Properties

The resulting graph contains the following vertex properties:

:hclust_id: String identifier for each node
:height: Height/distance at each merge point (0.0 for leaves)
:x: Horizontal position for visualization (0-1 range)
:y: Vertical position based on normalized height
:hcl: Original clustering object (stored as graph property)

Mycelia.heirarchically_cluster_distance_matrix — Method

heirarchically_cluster_distance_matrix(
    distance_matrix
) -> Clustering.Hclust

Performs hierarchical clustering on a distance matrix using Ward's linkage method.

Arguments

distance_matrix::Matrix{<:Real}: A symmetric distance/dissimilarity matrix

Returns

HierarchicalCluster: A hierarchical clustering object from Clustering.jl

Details

Uses Ward's method (minimum variance) for clustering, which:

Minimizes total within-cluster variance
Produces compact, spherical clusters
Works well for visualization in radial layouts

Mycelia.identify_optimal_number_of_clusters — Method

identify_optimal_number_of_clusters(
    distance_matrix;
    min_k,
    max_k
) -> Any

Identifies the optimal number of clusters using hierarchical clustering and maximizing the average silhouette score, displaying progress.

Uses Clustering.clustering_quality for score calculation.

Args: distance_matrix: A square matrix of pairwise distances between items.

Returns: A tuple containing: - hcl: The hierarchical clustering result object. - optimalnumberof_clusters: The inferred optimal number of clusters (k).

Mycelia.identify_potential_errors — Method

identify_potential_errors(
    graph;
    min_coverage,
    min_quality,
    min_confidence
) -> Vector{Int64}

Identify potential sequencing errors based on quality scores and coverage patterns.

Arguments

graph: Qualmer graph (MetaGraphsNext with QualmerVertexData)
min_coverage::Int=2: Minimum coverage for reliable k-mers
min_quality::Float64=20.0: Minimum mean quality score
min_confidence::Float64=0.95: Minimum joint probability threshold

Returns

Vector{Int}: Vertex indices of potential error k-mers

Details

Identifies k-mers that are likely errors based on:

Low coverage (singleton or few observations)
Low quality scores
Low joint probability (low confidence)

Mycelia.ids_to_accessions — Method

ids_to_accessions(
    ids::Vector{String},
    database::String
) -> Union{Vector{String}, Vector{T} where T<:(SubString{_A} where _A)}

Convert NCBI sequence IDs to accession numbers.

Mycelia.improve_read_likelihood — Method

Improve likelihood of a single read using maximum likelihood path finding. Returns improved read and boolean indicating if improvement was made.

Mycelia.improve_read_set_likelihood — Method

Improve likelihood of entire read set using current graph and k-mer size. Returns updated reads and count of improvements made. Uses memory-efficient batch processing for large datasets.

Mycelia.include_all_files — Method

include_all_files(dir::AbstractString; pattern)

Recursively include all files matching a pattern in a directory and its subdirectories.

Arguments

dir::AbstractString: Directory path to search recursively
pattern::Regex=r"\.jl$": Regular expression pattern to match files (defaults to .jl files)

Details

Files are processed in sorted order within each directory. This is useful for loading test files, examples, or other Julia modules in a predictable order.

Examples

# Include all Julia files in a directory tree
include_all_files("test/modules")

# Include all text files
include_all_files("docs", r"\.txt$")

Mycelia.install_hashdeep — Method

install_hashdeep() -> Union{Nothing, Base.Process}

Ensures the hashdeep utility is installed on the system.

Checks if hashdeep is available in PATH and attempts to install it via apt package manager if not found. Will try with sudo privileges first, then without sudo if that fails.

Details

Checks PATH for existing hashdeep executable
Attempts installation using apt package manager
Requires a Debian-based Linux distribution

Returns

Nothing, but prints status messages during execution

Mycelia.introduce_sequencing_errors — Method

introduce_sequencing_errors(sequence, error_rate)

Introduce realistic sequencing errors into a DNA sequence.

Simulates substitutions (70%), insertions (15%), and deletions (15%) at the specified error rate for realistic sequencing simulation.

Arguments

sequence: Input DNA sequence (BioSequences.LongDNA{4})
error_rate: Probability of error per base

Returns

Modified sequence with introduced errors (BioSequences.LongDNA{4})

See Also

observe: For more sophisticated error modeling with quality scores
mutate_string: For string-based mutation operations

Mycelia.introduce_sequencing_errors — Method

introduce_sequencing_errors(reads::Vector, error_rate::Float64)

Introduce realistic sequencing errors into a set of reads.

Arguments

reads::Vector: Vector of FASTQ records
error_rate::Float64: Error rate (0.0 to 1.0)

Returns

Vector: Reads with introduced errors

Example

error_reads = introduce_sequencing_errors(clean_reads, 0.01)  # 1% error rate

Mycelia.is_equivalent — Method

is_equivalent(a, b) -> Any

Check if two biological sequences are equivalent, considering both direct and reverse complement matches.

Arguments

a: First biological sequence (BioSequence or compatible type)
b: Second biological sequence (BioSequence or compatible type)

Returns

Bool: true if sequences are identical or if one is the reverse complement of the other, false otherwise

Mycelia.is_isolated_vertex — Method

Check if a vertex is isolated (no edges).

Mycelia.is_legacy_graph — Method

is_legacy_graph(graph) -> Bool

Check if a graph is using the legacy MetaGraphs format.

Arguments

graph: Graph to check

Returns

Bool: true if legacy format, false if next-generation format

Mycelia.is_valid_bubble — Method

Check if a bubble structure is valid.

Mycelia.is_valid_path — Method

Check if a path is valid in the graph.

Mycelia.isolate_normalized_primary_contig — Method

isolate_normalized_primary_contig(
    assembled_fasta,
    assembled_gfa,
    qualimap_report_txt,
    identifier,
    k::Int64;
    primary_contig_fasta
) -> String

Primary contig is defined as the contig with the most bases mapped to it

In the context of picking out phage from metagenomic assemblies the longest contig is often bacteria whereas the highest coverage contigs are often primer-dimers or other PCR amplification artifacts.

Taking the contig that has the most bases mapped to it as a product of length * depth is cherry picked as our phage

Isolates and exports the primary contig from an assembly based on coverage depth × length.

The primary contig is defined as the contig with the highest total mapped bases (coverage depth × length). This method helps identify potential phage contigs in metagenomic assemblies, avoiding both long bacterial contigs and short high-coverage PCR artifacts.

Arguments

assembled_fasta: Path to the assembled contigs in FASTA format
assembled_gfa: Path to the assembly graph in GFA format
qualimap_report_txt: Path to Qualimap coverage report
identifier: String identifier for the output file
k: Integer representing k-mer size used in assembly
primary_contig_fasta: Optional output filepath (default: "{identifier}.primary_contig.fna")

Returns

Path to the output FASTA file containing the primary contig

Notes

For circular contigs, removes the k-mer closure scar if detected
Trims k bases from the end if they match the first k bases
Uses coverage × length to avoid both long bacterial contigs and short PCR artifacts

Mycelia.iterative_assembly_summary — Method

Generate summary report of iterative assembly process.

Mycelia.iterative_polishing — Function

iterative_polishing(
    fastq
) -> Vector{T} where T<:(NamedTuple{(:fastq, :k), <:Tuple{Any, Any}})
iterative_polishing(
    fastq,
    max_k
) -> Vector{T} where T<:(NamedTuple{(:fastq, :k), <:Tuple{Any, Any}})
iterative_polishing(
    fastq,
    max_k,
    plot
) -> Vector{T} where T<:(NamedTuple{(:fastq, :k), <:Tuple{Any, Any}})

Performs iterative error correction on FASTQ sequences using progressively larger k-mer sizes.

Starting with the default k-mer size, this function repeatedly applies polishing steps, incrementing the k-mer size until either reaching max_k or encountering instability.

Arguments

fastq: Path to input FASTQ file or FastqRecord object
max_k: Maximum k-mer size to attempt (default: 89)
plot: Whether to generate diagnostic plots (default: false)

Returns

Vector of polishing results, where each element contains:

k: k-mer size used
fastq: resulting polished sequences

Mycelia.jaccard_distance — Method

Compute the Jaccard distance between columns of a binary matrix.

Arguments

M::AbstractMatrix{<:Integer}: Binary matrix where rows are features and columns are samples

Returns

Matrix{Float64}: Symmetric distance matrix with Jaccard distances

Mycelia.jaccard_distance — Method

jaccard_distance(set1, set2) -> Any

Calculate the Jaccard distance between two sets, which is the complement of the Jaccard similarity.

The Jaccard distance is defined as: $J_d(A,B) = 1 - J_s(A,B) = 1 - \frac{|A ∩ B|}{|A ∪ B|}$

Arguments

set1: First set to compare
set2: Second set to compare

Returns

Float64: A value in [0,1] where 0 indicates identical sets and 1 indicates disjoint sets

Mycelia.jaccard_similarity — Method

jaccard_similarity(set1, set2) -> Any

Compute the Jaccard similarity coefficient between two sets.

The Jaccard similarity is defined as the size of the intersection divided by the size of the union of two sets:

J(A,B) = |A ∩ B| / |A ∪ B|

Arguments

set1: First set for comparison
set2: Second set for comparison

Returns

Float64: A value between 0.0 and 1.0, where:
- 1.0 indicates identical sets
- 0.0 indicates completely disjoint sets

Mycelia.jellyfish_count — Method

jellyfish_count(
;
    fastx,
    k,
    threads,
    max_mem,
    canonical,
    outfile,
    conda_check
)

Count k-mers in a FASTA/FASTQ file using Jellyfish.

Arguments

fastx::String: Path to input FASTA/FASTQ file (can be gzipped)
k::Integer: k-mer length
threads::Integer=Sys.CPU_THREADS: Number of threads to use
max_mem::Integer=Int(Sys.free_memory()): Maximum memory in bytes (defaults to system free memory)
canonical::Bool=false: Whether to count canonical k-mers (both strands combined)
outfile::String=auto: Output filename (auto-generated based on input and parameters)
conda_check::Bool=true: Whether to verify Jellyfish conda installation

Returns

String: Path to gzipped TSV file containing k-mer counts

Mycelia.jellyfish_counts_to_kmer_frequency_histogram — Function

jellyfish_counts_to_kmer_frequency_histogram(
    jellyfish_counts_file
) -> Any
jellyfish_counts_to_kmer_frequency_histogram(
    jellyfish_counts_file,
    outfile
) -> Any

Convert a Jellyfish k-mer count file into a frequency histogram.

Arguments

jellyfish_counts_file::String: Path to the gzipped TSV file containing Jellyfish k-mer counts
outfile::String=replace(jellyfish_counts_file, r"\.tsv\.gz$" => ".count_histogram.tsv"): Optional output file path

Returns

String: Path to the generated histogram file

Description

Processes a Jellyfish k-mer count file to create a frequency histogram where:

Column 1: Number of k-mers that share the same count
Column 2: The count they share

Uses system sorting with LC_ALL=C for optimal performance on large files.

Notes

Requires gzip, sort, uniq, and sed command line tools
Uses intermediate disk storage for sorting large files
Skips processing if output file already exists

Mycelia.jitter — Method

jitter(x, n) -> Any

Add random noise to create a vector of jittered values.

Generates n values by adding random noise to the input value x. The noise is uniformly distributed between -1/3 and 1/3.

Arguments

x: Base value to add jitter to
n: Number of jittered values to generate

Returns

Vector of length n containing jittered values around x

Mycelia.join_fastqs_with_uuid — Method

join_fastqs_with_uuid(
    fastq_files::Vector{String};
    fastq_out::String
    tsv_out::String
)

Note: does not keep track of paired-end data - assumes single end reads

Designed primarily to allow joint mapping of many long-read samples

Given a collection of fastq files, creates:

A gzipped TSV mapping original file and read_id to a new UUID per read
A gzipped joint fastq file with the new UUID as read header

Returns: Tuple of output file paths (tsvout, fastqout)

Mycelia.joint_base_quality_score — Method

joint_base_quality_score(
    error_probabilities::Vector{Float64}
) -> Float64

Calculate the quality score for a single base given multiple observations.

This function implements the "Converting to Error Probabilities and Combining" method:

Takes error probabilities from multiple reads covering the same base
Calculates probability of ALL reads being wrong by multiplying probabilities
Calculates final Phred score from this combined probability

To avoid numerical underflow with very small probabilities, the calculation is performed in log space.

Arguments

error_probabilities::Vector{Float64}: Vector of error probabilities from multiple reads covering the same base position

Returns

Float64: Phred quality score representing the combined confidence

Mycelia.joint_qualmer_probability — Method

joint_qualmer_probability(
    qualmers::Vector{<:Mycelia.Qualmer};
    use_log_space
) -> Float64

Calculate joint probability that multiple observations of the same k-mer are all correct. For independent observations of the same k-mer sequence, this represents our confidence that the k-mer truly exists in the data.

Arguments

qualmers: Vector of Qualmer observations of the same k-mer sequence
use_log_space: Use log-space arithmetic for numerical stability (default: true)

Returns

Float64: Joint probability that all observations are correct

Mycelia.jsonl_to_dataframe — Method

jsonl_to_dataframe(filepath::String) -> DataFrames.DataFrame

jsonl_to_dataframe(filepath::String) -> DataFrame

Parse a JSONL (or gzipped JSONL) file and return a DataFrame. Internally calls parse_jsonl for validation and parsing. Ensures that all rows have the same set of keys by inserting missing for any absent field before constructing the DataFrame.

Mycelia.k_ladder — Method

k_ladder(
;
    max_k,
    seed_primes,
    ratio,
    min_fractional_gap,
    read_length,
    read_margin,
    only_odd,
    return_unique,
    min_k
) -> Vector{Int64}

Generate a √2-scaled ladder of odd primes for k-mer-based assembly / error-screening.

Starts with a user-supplied list of seed_primes (default [3, 5, 7]), then iteratively multiplies the last accepted k by ratio (default sqrt(2)), rounds up to the next odd prime, and appends it only if it differs from the previous accepted prime by at least min_fractional_gap.

Keyword arguments

max_k::Int = 10_000 : Absolute upper bound.
seed_primes::Vector{Int} : Initial primes (e.g. [3,5,7] for protein, [11,13,17] for nucleotides).
ratio::Float64 = sqrt(2) : Target geometric growth factor.
min_fractional_gap::Float64 = 0.30 : Minimum (knew − kprev)/k_prev to skip “sister” primes.
read_length::Union{Int,Nothing} = nothing : If set, cap at read_length − read_margin.
read_margin::Int = 20 : Safety margin for short-read data.
only_odd::Bool = true : Force odd k (recommended).
return_unique::Bool = true : De-duplicate before returning.
min_k::Int = 3 : Drop any k below this after generation.

Returns

Vector{Int} — ascending prime k values suitable for -k/--k-list.

Mycelia.kmer_counts_dict_to_vector — Method

kmer_counts_dict_to_vector(
    kmer_to_index_map,
    kmer_counts
) -> Any

Convert a dictionary of k-mer counts to a fixed-length numeric vector based on a predefined mapping.

Arguments

kmer_to_index_map: Dictionary mapping k-mer sequences to their corresponding vector indices
kmer_counts: Dictionary containing k-mer sequences and their occurrence counts

Returns

A vector where each position corresponds to a k-mer count, with zeros for absent k-mers

Mycelia.kmer_counts_to_cosine_similarity — Method

kmer_counts_to_cosine_similarity(
    kmer_counts_1,
    kmer_counts_2
) -> Any

Calculate the cosine similarity between two k-mer count dictionaries.

Arguments

kmer_counts_1::Dict{String,Int}: First dictionary mapping k-mer sequences to their counts
kmer_counts_2::Dict{String,Int}: Second dictionary mapping k-mer sequences to their counts

Returns

Float64: Cosine distance between the two k-mer count vectors, in range [0,1] where 0 indicates identical distributions and 1 indicates maximum dissimilarity

Details

Converts k-mer count dictionaries into vectors using a unified set of keys, then computes cosine distance. Missing k-mers are treated as count 0. Result is invariant to input order and total counts (normalized internally).

Mycelia.kmer_counts_to_js_divergence — Method

kmer_counts_to_js_divergence(
    kmer_counts_1,
    kmer_counts_2
) -> Any

Calculate the Jensen-Shannon divergence between two k-mer frequency distributions.

Arguments

kmer_counts_1: Dictionary mapping k-mers to their counts in first sequence
kmer_counts_2: Dictionary mapping k-mers to their counts in second sequence

Returns

Normalized Jensen-Shannon divergence score between 0 and 1, where:
- 0 indicates identical distributions
- 1 indicates maximally different distributions

Notes

The measure is symmetric: JS(P||Q) = JS(Q||P)
Counts are automatically normalized to probability distributions

Mycelia.kmer_counts_to_merqury_qv — Method

kmer_counts_to_merqury_qv(
;
    raw_data_counts,
    assembly_counts
)

Calculate assembly Quality Value (QV) score using the Merqury method.

Estimates base-level accuracy by comparing k-mer distributions between raw sequencing data and assembly. Higher QV scores indicate better assembly quality.

Arguments

raw_data_counts::AbstractDict{Kmers.DNAKmer{k,N}, Int}: K-mer counts from raw sequencing data
assembly_counts::AbstractDict{Kmers.DNAKmer{k,N}, Int}: K-mer counts from assembly

Returns

Float64: Quality Value score in Phred scale (-10log₁₀(error rate))

Method

QV is calculated using:

Ktotal = number of unique kmers in assembly
Kshared = number of kmers shared between raw data and assembly
P = (Kshared/Ktotal)^(1/k) = estimated base-level accuracy
QV = -10log₁₀(1-P)

Reference

Rhie et al. "Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies" Genome Biology (2020)

Mycelia.kmer_graph_to_biosequence_graph — Method

kmer_graph_to_biosequence_graph(
    kmer_graph::MetaGraphsNext.MetaGraph;
    min_path_length
) -> MetaGraphsNext.MetaGraph{Int64, Graphs.SimpleGraphs.SimpleDiGraph{Int64}, _A, _B, _C, Nothing, Mycelia.var"#270#272", Float64} where {_A, _B, _C}

Convert a k-mer graph to a BioSequence graph by collapsing linear paths.

This is the primary method for creating BioSequence graphs from k-mer graphs, following the 6-graph hierarchy where BioSequence graphs are simplifications of k-mer graphs.

Arguments

kmer_graph: MetaGraphsNext k-mer graph to convert
min_path_length: Minimum path length to keep (default: 2)

Returns

MetaGraphsNext.MetaGraph with BioSequence vertices

Example

# Start with k-mer graph
kmer_graph = build_kmer_graph_next(BioSequences.DNAKmer{31}, fasta_records)

# Convert to BioSequence graph
biosequence_graph = kmer_graph_to_biosequence_graph(kmer_graph)

Mycelia.kmer_path_to_biosequence — Method

Convert k-mer path back to BioSequence.

Mycelia.kmer_path_to_sequence — Method

kmer_path_to_sequence(kmer_path) -> Any

Convert a path of overlapping k-mers into a single DNA sequence.

Arguments

kmer_path: Vector of k-mers (DNA sequences) where each consecutive pair overlaps by k-1 bases

Returns

BioSequences.LongDNA{2}: Assembled DNA sequence from the k-mer path

Description

Reconstructs the original DNA sequence by joining k-mers, validating that consecutive k-mers overlap correctly. The first k-mer is used in full, then each subsequent k-mer contributes its last base.

Mycelia.kmer_quality_score — Function

kmer_quality_score(base_qualities::Vector{Float64}) -> Any
kmer_quality_score(
    base_qualities::Vector{Float64},
    method::Symbol
) -> Any

Calculate kmer quality score using the specified aggregation method.

Available methods:

:min: Use the minimum base quality (default)
:mean: Use the mean of all base qualities
:geometric: Use the geometric mean (appropriate for probabilities)
:harmonic: Use the harmonic mean (emphasizes lower values)

Arguments

base_qualities::Vector{Float64}: Vector of quality scores for each base
method::Symbol: Method to use for aggregation

Returns

Float64: Overall quality score for the kmer

Mycelia.kmer_space_size — Function

kmer_space_size(k::Integer) -> Any
kmer_space_size(k::Integer, alphabet_size::Integer) -> Any

Calculate the theoretical k-mer space size for a given k-mer length and alphabet size.

Arguments

k::Integer: K-mer length
alphabet_size::Integer=4: Size of the alphabet (defaults to 4 for DNA: A,C,G,T)

Returns

Integer: Total number of possible k-mers (alphabet_size^k)

Details

For DNA sequences (alphabet_size=4), this computes 4^k. Useful for:

Memory estimation for k-mer analysis
Parameter validation and selection
Understanding computational complexity

Examples

# DNA 3-mers: 4^3 = 64 possible k-mers
kmer_space_size(3)

# Protein 5-mers: 20^5 = 3,200,000 possible k-mers  
kmer_space_size(5, 20)

Mycelia.ks — Method

ks(; min, max) -> Vector{Int64}

Generates a specialized sequence of prime numbers combining:

Odd primes up to 23 (flip_point)
Primes nearest to Fibonacci numbers above 23 up to max

Arguments

min::Int=0: Lower bound for the sequence
max::Int=10_000: Upper bound for the sequence

Returns

Vector of Int containing the specialized prime sequence

Mycelia.lawrencium_sbatch — Method

lawrencium_sbatch(
;
    job_name,
    mail_user,
    mail_type,
    logdir,
    partition,
    qos,
    account,
    nodes,
    ntasks,
    time,
    cpus_per_task,
    mem_gb,
    cmd
)

Submit a job to SLURM scheduler on Lawrence Berkeley Lab's Lawrencium cluster.

Arguments

job_name: Name identifier for the SLURM job
mail_user: Email address for job notifications
mail_type: Notification type ("ALL", "BEGIN", "END", "FAIL", or "NONE")
logdir: Directory for SLURM output and error logs
partition: Lawrencium compute partition
qos: Quality of Service level
account: Project account for billing
nodes: Number of nodes to allocate
ntasks: Number of tasks to spawn
time: Wall time limit in format "days-hours:minutes:seconds"
cpus_per_task: CPU cores per task
mem_gb: Memory per node in GB
cmd: Shell command to execute

Returns

true if submission was successful

Note

Function includes 5-second delays before and after submission for queue stability.

Mycelia.legacy_to_next_graph — Function

legacy_to_next_graph(
    legacy_graph
) -> MetaGraphsNext.MetaGraph{Int64, Graphs.SimpleGraphs.SimpleDiGraph{Int64}, _A, _B, _C, Nothing, Mycelia.var"#147#148", Float64} where {_A, _B, _C}
legacy_to_next_graph(
    legacy_graph,
    kmer_type
) -> MetaGraphsNext.MetaGraph{Int64, Graphs.SimpleGraphs.SimpleDiGraph{Int64}, _A, _B, _C, Nothing, Mycelia.var"#147#148", Float64} where {_A, _B, _C}

Convert a legacy MetaGraphs-based k-mer graph to the next-generation MetaGraphsNext format.

This function provides a migration path from the deprecated MetaGraphs.jl implementation to the new type-stable MetaGraphsNext.jl format.

Arguments

legacy_graph: MetaGraphs.MetaDiGraph from the old implementation

Returns

MetaGraphsNext.MetaGraph with equivalent structure and type-stable metadata

Mycelia.list_blastdbs — Method

list_blastdbs(; source) -> DataFrames.DataFrame

Lists available BLAST databases from the specified source.

Mycelia.list_classes — Method

list_classes() -> DataFrames.DataFrame

Returns an array of all taxonomic classes in the database.

Classes represent a major taxonomic rank between phylum and order in biological classification.

Returns

Vector{String}: Array of class names sorted alphabetically

Mycelia.list_databases — Method

list_databases(; address, username, password)

Lists all available Neo4j databases on the specified server.

Arguments

address::String: Neo4j server address (e.g. "neo4j://localhost:7687")
username::String="neo4j": Neo4j authentication username
password::String: Neo4j authentication password

Returns

DataFrame: Contains database information with columns typically including:
- name: Database name
- address: Database address
- role: Database role (e.g., primary, secondary)
- status: Current status (e.g., online, offline)
- default: Boolean indicating if it's the default database

Mycelia.list_families — Method

list_families() -> DataFrames.DataFrame

Returns a sorted vector of all family names present in the database.

Mycelia.list_full_taxonomy — Method

list_full_taxonomy() -> DataFrames.DataFrame

Retrieves and formats the complete NCBI taxonomy hierarchy into a structured DataFrame.

Details

Automatically sets up taxonkit environment and downloads taxonomy database if needed
Starts from root taxid (1) and includes all descendant taxa
Reformats lineage information into separate columns for each taxonomic rank

Returns

DataFrame with columns:

taxid: Taxonomy identifier
lineage: Full taxonomic lineage string
taxid_lineage: Lineage with taxonomy IDs
Individual rank columns:
- superkingdom, kingdom, phylum, class, order, family, genus, species
- corresponding taxid columns (e.g., superkingdom_taxid)

Dependencies

Requires taxonkit (installed automatically via Bioconda)

Mycelia.list_genera — Method

list_genera() -> DataFrames.DataFrame

Returns a sorted vector of all genera names present in the database.

Mycelia.list_kingdoms — Method

list_kingdoms() -> DataFrames.DataFrame

Lists all taxonomic kingdoms in the database.

Returns a vector of kingdom names as strings. Kingdoms represent the highest major taxonomic rank in biological classification.

Returns

Vector{String}: Array of kingdom names

Mycelia.list_orders — Method

list_orders() -> DataFrames.DataFrame

Lists all orders in the taxonomic database.

Returns a vector of strings containing valid order names according to current mycological taxonomy. Uses the underlying list_rank() function with rank="order".

Returns

Vector{String}: Alphabetically sorted list of order names

Mycelia.list_phylums — Method

list_phylums() -> DataFrames.DataFrame

Returns a sorted list of all unique phyla in the database.

Mycelia.list_rank — Method

list_rank(rank) -> DataFrames.DataFrame

List all taxonomic entries at the specified rank level.

Arguments

rank::String: Taxonomic rank to query. Must be one of:
- "top" (top level)
- "superkingdom"/"domain"
- "kingdom"
- "phylum"
- "class"
- "order"
- "family"
- "genus"
- "species"

Returns

DataFrame with columns:

taxid: NCBI taxonomy ID
name: Scientific name at the specified rank

Mycelia.list_ranks — Method

list_ranks(; synonyms) -> Vector{String}

Return an ordered list of taxonomic ranks from highest (top) to lowest (species).

Arguments

synonyms::Bool=false: If true, includes alternative names for certain ranks (e.g. "domain" for "superkingdom")

Returns

Vector{String}: An array of taxonomic rank names in hierarchical order

Mycelia.list_species — Method

list_species() -> DataFrames.DataFrame

Returns a sorted vector of all species names present in the database.

Mycelia.list_subtaxa — Method

list_subtaxa(taxid) -> Vector{Int64}

Returns an array of Integer taxon IDs representing all sub-taxa under the specified taxonomic ID.

Arguments

taxid: NCBI taxonomy identifier for the parent taxon

Returns

Vector{Int} containing all descendant taxon IDs

Details

Requires taxonkit to be installed via Bioconda
Automatically sets up taxonkit database if not present
Uses local taxonomy database in ~/.taxonkit/

Mycelia.list_superkingdoms — Method

list_superkingdoms() -> DataFrames.DataFrame

Returns an array of all taxonomic superkingdoms (e.g., Bacteria, Archaea, Eukaryota).

Returns

Vector{String}: Array containing names of all superkingdoms in the taxonomy database

Mycelia.list_toplevel — Method

list_toplevel() -> DataFrames.DataFrame

Returns a DataFrame containing the top-level taxonomic nodes.

The DataFrame has two fixed rows representing the most basic taxonomic classifications:

taxid=0: "unclassified"
taxid=1: "root"

Returns

DataFrame Columns: - taxid::Int : Taxonomic identifier - name::String : Node name

Mycelia.load_blast_db_taxonomy_table — Method

load_blast_db_taxonomy_table(
    compressed_blast_db_taxonomy_table_file
) -> DataFrames.DataFrame

Loads a BLAST database taxonomy mapping table from a gzipped file into a DataFrame.

Arguments

compressed_blast_db_taxonomy_table_file::String: Path to a gzipped file containing BLAST taxonomy mappings

Returns

DataFrame: A DataFrame with columns :sequence_id and :taxid containing the sequence-to-taxonomy mappings

Format

Input file should be a space-delimited text file (gzipped) with two columns:

sequence identifier
taxonomy identifier (taxid)

Mycelia.load_df_jld2 — Method

load_df_jld2(filename::String; key) -> Any

load_df_jld2(filename::String; key::String="dataframe") -> DataFrames.DataFrame

Load a DataFrame from a JLD2 file.

Arguments

filename: Path to the JLD2 file (will add .jld2 extension if not present)
key: The name of the dataset within the JLD2 file (defaults to "dataframe")

Returns

The loaded DataFrame

Examples

df = load_df_jld2("mydata")

Mycelia.load_genbank_metadata — Method

load_genbank_metadata() -> DataFrames.DataFrame

Load metadata for GenBank sequences into a DataFrame.

This is a convenience wrapper around load_ncbi_metadata("genbank") that specifically loads metadata from the GenBank database.

Returns

DataFrame: Contains metadata fields like accession numbers, taxonomy,

and sequence information from GenBank.

Mycelia.load_graph — Method

load_graph(file) -> Any

Load a graph structure from a serialized file.

Arguments

file::AbstractString: Path to the file containing the serialized graph data

Returns

The deserialized graph object

Mycelia.load_graph — Method

load_graph(file::String) -> Any

Loads a graph object from a serialized file.

Arguments

file::String: Path to the file containing the serialized graph data. The file should have been created using save_graph.

Returns

The deserialized graph object stored under the "graph" key.

Mycelia.load_jellyfish_counts — Method

load_jellyfish_counts(jellyfish_counts) -> Any

Load k-mer counts from a Jellyfish output file into a DataFrame.

Arguments

jellyfish_counts::String: Path to a gzipped TSV file (*.jf.tsv.gz) containing Jellyfish k-mer counts

Returns

DataFrame: Table with columns:
- kmer: Biologically encoded k-mers as DNAKmer{k} objects
- count: Integer count of each k-mer's occurrences

Notes

Input file must be a gzipped TSV with exactly two columns (k-mer sequences and counts)
K-mer length is automatically detected from the first entry
Filename must end with '.jf.tsv.gz'

Mycelia.load_jld2 — Method

load_jld2(filename) -> Any

Load data stored in a JLD2 file format.

Arguments

filename::String: Path to the JLD2 file to load

Returns

Dict: Dictionary containing the loaded data structures

Mycelia.load_kmer_results — Method

load_kmer_results(
    filename::AbstractString
) -> Union{Nothing, NamedTuple{(:kmers, :counts, :fasta_list, :metadata), <:Tuple{Any, Any, Any, Dict{String, Any}}}}

Load kmer counting results previously saved with save_kmer_results.

Arguments

filename::AbstractString: Path to the input JLD2 file.

Returns

NamedTuple: Contains the loaded kmers, counts, fasta_list, and metadata. Returns nothing if the file cannot be loaded or essential keys are missing.

Mycelia.load_matrix_jld2 — Method

load_matrix_jld2(filename) -> Any

Loads a matrix from a JLD2 file.

Arguments

filename::String: Path to the JLD2 file containing the matrix under the key "matrix"

Returns

Matrix: The loaded matrix data

Mycelia.load_ncbi_metadata — Method

load_ncbi_metadata(db::String) -> DataFrames.DataFrame

Load and parse NCBI assembly summary metadata (GenBank/RefSeq), using a daily cache.

Checks for homedir()/workspace/.ncbi/YYYY-MM-DD.assembly_summary_{db}.txt. Uses the cache if valid (exists, readable, not empty). Otherwise, downloads from NCBI using Downloads.download(), caches the result (replacing any previous version for the same day), and then parses the cached file.

Handles NCBI's header format and uses CSV.jl for parsing.

Arguments

db::String: Database source ("genbank" or "refseq").

Returns

DataFrames.DataFrame: Parsed metadata table.

Errors

Throws ArgumentError for invalid db.
Throws error if cache directory cannot be created.
Throws error if data cannot be obtained from cache or download.
Rethrows errors from Downloads.download or CSV parsing.

Mycelia.load_ncbi_taxonomy — Method

load_ncbi_taxonomy(
;
    path_to_taxdump
) -> MetaGraphs.MetaDiGraph{T, Float64} where T<:Integer

Downloads and constructs a MetaDiGraph representation of the NCBI taxonomy database.

Arguments

path_to_taxdump: Directory path where taxonomy files will be downloaded and extracted

Returns

MetaDiGraph: A directed graph where:
- Vertices represent taxa with properties:
  - :tax_id: NCBI taxonomy identifier
  - :scientific_name, :common_name, etc.: Name properties
  - :rank: Taxonomic rank
  - :division_id, :division_cde, :division_name: Division information
- Edges represent parent-child relationships in the taxonomy

Dependencies

Requires internet connection for initial download. Uses DataFrames, MetaGraphs, and ProgressMeter.

Mycelia.load_refseq_metadata — Method

load_refseq_metadata() -> DataFrames.DataFrame

Loads NCBI RefSeq metadata into a DataFrame. RefSeq is NCBI's curated collection of genomic, transcript and protein sequences.

Returns

DataFrame: Contains metadata columns including accession numbers, taxonomic information,

and sequence details from RefSeq.

Mycelia.local_blast_database_info — Method

local_blast_database_info(; blastdbs_dir) -> Any

Query information about local BLAST databases and return a formatted summary.

Arguments

blastdbs_dir::String: Directory containing BLAST databases (default: "~/workspace/blastdb")

Returns

DataFrame with columns:
- BLAST database path
- BLAST database molecule type
- BLAST database title
- date of last update
- number of bases/residues
- number of sequences
- number of bytes
- BLAST database format version
- human readable size

Dependencies

Requires NCBI BLAST+ tools. Will attempt to install via apt-get if not present.

Side Effects

May install system packages (ncbi-blast+, perl-doc) using sudo/apt-get
Filters out numbered database fragments from results

Mycelia.logistic_pca_epca — Method

logistic_pca_epca(M::AbstractMatrix{Bool}; k::Int=0)

Synonym for bernoulli_pca_epca(M; k=k).

Mycelia.mash_distance_from_jaccard — Method

mash_distance_from_jaccard(jaccard_index::Float64, kmer_size::Int)

Calculates the Mash distance (an estimate of Average Nucleotide Identity) from a given Jaccard Index and k-mer size.

Arguments

jaccard_index::Float64: The Jaccard similarity between the two k-mer sets. Must be between 0.0 and 1.0.
kmer_size::Int: The length of k-mers used to calculate the Jaccard index.

Returns

Float64: The estimated Mash distance D. The estimated ANI would be 1.0 - D.

Mycelia.maximum_weight_walk_next — Method

maximum_weight_walk_next(
    graph::MetaGraphsNext.MetaGraph,
    start_vertex::String,
    max_steps::Int64;
    weight_function
) -> Mycelia.GraphPath

Perform a maximum weight walk prioritizing highest confidence edges.

This greedy algorithm always chooses the edge with the highest weight (coverage) at each step, useful for finding high-confidence assembly paths.

Arguments

graph: MetaGraphsNext k-mer graph
start_vertex: Starting vertex label
max_steps: Maximum steps to take
weight_function: Function to extract weight from edge data (default: uses edge.weight)

Returns

GraphPath: Path following maximum weight edges

Mycelia.merge_and_map_single_end_samples — Method

merge_and_map_single_end_samples(; 
    fasta_reference::AbstractString, 
    fastq_list::Vector{<:AbstractString}, 
    minimap_index::AbstractString, 
    mapping_type::AbstractString,
    outbase::AbstractString = "results",
    outformats::Vector{<:AbstractString} = [".tsv.gz", ".jld2"]
) -> DataFrames.DataFrame

Merge and map single-end sequencing samples, then output results in one or more formats.

Arguments

fasta_reference: Path to the reference FASTA file.
fastq_list: Vector of paths to input FASTQ files to be merged.
minimap_index: Path to the minimap2 index file (.mmi).
mapping_type: Mapping type string for minimap2 (e.g., "map-ont").
outbase: Base name (optionally including path) for output files (default: Mycelia.normalized_current_date() * ".joint-minimap-mapping-results").
outformats: Vector of output file formats to write results to. Supported: ".tsv.gz", ".jld2".

Description

This function merges provided FASTQ files and assigns unique UUIDs to reads, maps the merged FASTQ against the provided reference using minimap2, reads mapping and UUID tables, joins them into a single DataFrame, writes this table to all requested output formats with filenames constructed from the outbase and the appropriate extension, and returns the resulting joined DataFrame.

Output Files

.tsv.gz: Tab-separated, gzip-compressed table of results.
".jld2": JLD2 file containing results.

Returns

The joined results as a DataFrames.DataFrame.

Mycelia.merge_colors — Method

merge_colors(c1, c2) -> Any

Merge two colors by calculating their minimal color difference vector.

Arguments

c1::Color: First color input
c2::Color: Second color input

Returns

If colors are equal, returns the input color
Otherwise returns the color difference vector (c1-c2 or c2-c1) with minimal RGB sum

Details

Calculates two difference vectors:

mix_a = c1 - c2
mix_b = c2 - c1

Returns the difference vector with the smallest sum of RGB components.

Mycelia.merge_fasta_files — Method

merge_fasta_files(; fasta_files, fasta_file)

Join fasta files while adding origin prefixes to the identifiers.

Does not guarantee uniqueness but will warn if conflicts arise

Mycelia.merge_overlapping_repeats — Method

Merge overlapping repeat regions.

Mycelia.merge_repeat_regions — Method

Merge multiple repeat regions into one.

Mycelia.merge_xam_with_taxonomies — Method

Merge XAM alignment data with taxonomic information and calculate alignment metrics.

This function processes XAM alignment data by:

Loading an accession-to-taxid mapping table
Left-joining alignment data with taxonomic IDs
Retrieving full taxonomic lineage information
Calculating percent identity scores
Calculating relative alignment score proportions per read
Writing results to cached files

Arguments

xam: Path to XAM file or XAM data structure
accession2taxid_file: Path to accession-to-taxid mapping file (.tsv.gz or .arrow)
output_prefix: Prefix for output files (.tsv.gz and .arrow). Defaults to "xam"
verbose::Bool=true: Whether to print progress information
force_recalculate::Bool=false: Whether to force recalculation even if cached files exist

Returns

A NamedTuple with paths to the output files: (tsvout, arrowout)

Mycelia.metasha256 — Method

metasha256(
    vector_of_sha256s::Vector{<:AbstractString}
) -> String

Compute a single SHA256 hash from multiple SHA256 hashes.

Takes a vector of hex-encoded SHA256 hashes and produces a new SHA256 hash by:

Sorting the input hashes lexicographically
Concatenating them in sorted order
Computing a new SHA256 hash over the concatenated data

Arguments

vector_of_sha256s: Vector of hex-encoded SHA256 hash strings

Returns

A hex-encoded string representing the computed meta-hash

Mycelia.minimap_index — Method

minimap_index(
;
    fasta,
    mapping_type,
    mem_gb,
    threads,
    as_string,
    denominator
)

Create a minimap2 index for the provided reference sequence.

Arguments

fasta::String: Path to the reference FASTA.
mapping_type::String: Preset (e.g. "map-hifi").
mem_gb::Real: Memory available in GB.
threads::Integer: Number of threads.
as_string::Bool=false: If true, return the command string instead of Cmd.
denominator::Real: Scaling factor passed to system_mem_to_minimap_index_size.

Returns

Named tuple (cmd, outfile) where outfile is the generated .mmi index path.

Mycelia.minimap_map — Method

minimap_map(
;
    fasta,
    fastq,
    mapping_type,
    as_string,
    mem_gb,
    threads,
    denominator
)

Generate minimap2 alignment commands for sequence mapping.

aligning and compressing. No sorting or filtering.

Use shell_only=true to get string command to submit to SLURM

Creates a command to align reads in FASTQ format to a reference FASTA using minimap2, followed by SAM compression with pigz. Handles resource allocation and conda environment setup.

Arguments

fasta: Path to reference FASTA file
fastq: Path to query FASTQ file
mapping_type: Alignment preset ("map-hifi", "map-ont", "map-pb", "sr", or "lr:hq")
as_string: If true, returns shell command as string; if false, returns command array
mem_gb: Available memory in GB for indexing (defaults to system free memory)
threads: Number of CPU threads to use (defaults to system threads)
denominator: Divisor for calculating minimap2 index size

Returns

Named tuple containing:

cmd: Shell command (as string or array)
outfile: Path to compressed output SAM file

Mycelia.minimap_map_paired_end_with_index — Method

minimap_map_paired_end_with_index(
;
    forward,
    reverse,
    mem_gb,
    threads,
    outdir,
    as_string,
    denominator,
    fasta,
    index_file
)

Map paired-end reads to a reference sequence using minimap2.

Arguments

fasta::String: Path to reference FASTA file
forward::String: Path to forward reads FASTQ file
reverse::String: Path to reverse reads FASTQ file
mem_gb::Integer: Available system memory in GB
threads::Integer: Number of threads to use
outdir::String: Output directory (defaults to forward reads directory)
as_string::Bool=false: Return command as string instead of Cmd array
mapping_type::String="sr": Minimap2 preset ["map-hifi", "map-ont", "map-pb", "sr", "lr:hq"]
denominator::Float64: Memory scaling factor for index size

Returns

Named tuple containing:

cmd: Command(s) to execute (String or Array{Cmd})
outfile: Path to compressed output SAM file (*.sam.gz)

Notes

Requires minimap2, samtools, and pigz conda environments
Automatically compresses output using pigz
Index file must exist at $(fasta).x$(mapping_type).I$(index_size).mmi

Mycelia.minimap_map_with_index — Method

minimap_map_with_index(
;
    fasta,
    mapping_type,
    fastq,
    index_file,
    mem_gb,
    threads,
    as_string,
    denominator
)

Map reads using an existing minimap2 index file.

Arguments

fasta: Path to the reference FASTA (used only if an index must be created).
mapping_type: Minimap2 preset.
fastq: Input reads.
index_file::String="": Optional prebuilt index path. If empty, one is created.
mem_gb, threads, as_string, denominator: Parameters forwarded to minimap_index.

Returns

Named tuple (cmd, outfile) producing a BAM file from the mapping.

Mycelia.mmseqs_pairwise_search — Method

mmseqs_pairwise_search(; fasta, output)

Perform all-vs-all sequence search using MMseqs2's easy-search command.

Arguments

fasta::String: Path to input FASTA file containing sequences to compare
output::String: Output directory path (default: input filename + ".mmseqseasysearch_pairwise")

Returns

String: Path to the output directory

Details

Executes MMseqs2 with sensitive search parameters (7 sensitivity steps) and outputs results in tabular format with the following columns:

query, qheader: Query sequence ID and header
target, theader: Target sequence ID and header
pident: Percentage sequence identity
fident: Fraction of identical matches
nident: Number of identical matches
alnlen: Alignment length
mismatch: Number of mismatches
gapopen: Number of gap openings
qstart, qend, qlen: Query sequence coordinates and length
tstart, tend, tlen: Target sequence coordinates and length
evalue: Expected value
bits: Bit score

Requires MMseqs2 to be available through Bioconda.

Mycelia.mutate_sequence — Method

mutate_sequence(reference_sequence) -> Tuple{Any, Any}

Generate a single random mutation in an amino acid sequence.

Arguments

reference_sequence: Input amino acid sequence to be mutated

Returns

mutant_sequence: The sequence after applying the mutation
haplotype: A SequenceVariation.Haplotype object containing the mutation details

Details

Performs one of three possible mutation types:

Substitution: Replace one amino acid with another
Insertion: Insert 1+ random amino acids at a position
Deletion: Remove 1+ amino acids from a position

Insertion and deletion sizes follow a truncated Poisson distribution (λ=1, min=1).

Mycelia.mycelia_assemble — Method

Main Mycelia intelligent assembly algorithm. Implements iterative prime k-mer progression with error correction.

Mycelia.mycelia_cross_validation — Method

Main cross-validation function for hybrid assembly quality assessment. Compares intelligent vs iterative assembly approaches across multiple validation folds.

Mycelia.mycelia_iterative_assemble — Method

Main iterative maximum likelihood assembly function. Processes entire read sets per iteration with complete FASTQ I/O tracking. Enhanced with performance optimizations, caching, and progress tracking.

Mycelia.n_maximally_distinguishable_colors — Method

n_maximally_distinguishable_colors(n) -> Any

Generate n colors that are maximally distinguishable from each other.

Arguments

n::Integer: The number of distinct colors to generate

Returns

A vector of n RGB colors that are optimized for maximum perceptual distinction, using white (RGB(1,1,1)) and black (RGB(0,0,0)) as anchor colors.

Mycelia.name2taxid — Method

name2taxid(name) -> DataFrames.DataFrame

Convert scientific name(s) to NCBI taxonomy ID(s) using taxonkit.

Arguments

name::AbstractString: Scientific name(s) to query. Can be a single name or multiple names separated by newlines.

Returns

DataFrame with columns:
- name: Input scientific name
- taxid: NCBI taxonomy ID
- rank: Taxonomic rank (e.g., "species", "genus")

Dependencies

Requires taxonkit package (installed automatically via Bioconda)

Mycelia.names2taxids — Method

names2taxids(names::AbstractVector{<:AbstractString}) -> Any

Convert a vector of species/taxon names to their corresponding NCBI taxonomy IDs.

Arguments

names::AbstractVector{<:AbstractString}: Vector of scientific names or common names

Returns

Vector{Int}: Vector of NCBI taxonomy IDs corresponding to the input names

Progress is displayed using ProgressMeter.

Mycelia.ncbi_ftp_path_to_url — Method

ncbi_ftp_path_to_url(; ftp_path, extension)

Constructs a complete NCBI FTP URL by combining a base FTP path with a file extension.

Arguments

ftp_path::String: Base FTP directory path for the resource
extension::String: File extension to append to the resource name

Returns

String: Complete FTP URL path to the requested resource

Extensions include:

genomic.fna.gz
genomic.gff.gz
protein.faa.gz
assembly_report.txt
assembly_stats.txt
cdsfromgenomic.fna.gz
feature_count.txt.gz
feature_table.txt.gz
genomic.gbff.gz
genomic.gtf.gz
protein.gpff.gz
translated_cds.faa.gz

Mycelia.ncbi_genome_download_accession — Method

ncbi_genome_download_accession(
;
    accession,
    outdir,
    outpath,
    include_string
)

Download an accession using NCBI datasets command line tool

the .zip download output to outpath will be unzipped

returns the outfolder

ncbi's default include string is include_string = "gff3,rna,cds,protein,genome,seq-report"

Downloads and extracts a genome from NCBI using the datasets command line tool.

Arguments

accession: NCBI accession number for the genome
outdir: Directory where files will be downloaded (defaults to current directory)
outpath: Full path for the temporary zip file (defaults to outdir/accession.zip)
include_string: Data types to download (defaults to all "gff3,rna,cds,protein,genome,seq-report").

Returns

Path to the extracted genome data directory

Notes

Requires the ncbi-datasets-cli conda package (automatically installed if missing)
Downloaded zip file is automatically removed after extraction
If output folder already exists, download is skipped
Data is extracted to outdir/accession/ncbi_dataset/data/accession

Mycelia.ncbi_taxon_summary — Method

ncbi_taxon_summary(taxa_id) -> DataFrames.DataFrame

Retrieve taxonomic information for a given NCBI taxonomy ID.

Arguments

taxa_id: NCBI taxonomy identifier (integer)

Returns

DataFrame: Taxonomy summary containing fields like tax_id, rank, species, etc.

Mycelia.nearest_prime — Method

nearest_prime(n::Int64) -> Int64

Find the closest prime number to the given integer n.

Returns the nearest prime number to n. If two prime numbers are equally distant from n, returns the smaller one.

Arguments

n::Int: The input integer to find the nearest prime for

Returns

Int: The closest prime number to n

Mycelia.negbin_pca_epca — Method

negbin_pca_epca(M::AbstractMatrix{<:Integer};
               k::Int=0,
               r::Int=1)

Perform Negative-Binomial EPCA on a count matrix M (features × samples).

When to use

Use for overdispersed count data (variance > mean), such as RNA-seq or metagenomic counts.

Keyword arguments

k : desired number of latent dimensions; if k<1 defaults to min(n_samples-1, n_features, 10)
r : known NB “number of successes” parameter

Returns

NamedTuple with fields

model : the fitted ExpFamilyPCA.NegativeBinomialEPCA object
scores : k×n_samples matrix of sample scores
loadings : k×n_features matrix of feature loadings

Mycelia.nersc_sbatch — Method

nersc_sbatch(
;
    job_name,
    mail_user,
    mail_type,
    logdir,
    scriptdir,
    qos,
    nodes,
    ntasks,
    time,
    cpus_per_task,
    mem_gb,
    cmd,
    constraint
)

Submit a batch job to NERSC's SLURM workload manager.

Arguments

job_name: Identifier for the SLURM job
mail_user: Email address for job notifications
mail_type: Notification type ("ALL", "BEGIN", "END", "FAIL", or "NONE")
logdir: Directory for storing job output/error logs
scriptdir: Directory for storing generated SLURM scripts
qos: Quality of Service level ("regular", "premium", or "preempt")
nodes: Number of nodes to allocate
ntasks: Number of tasks to run
time: Maximum wall time in format "days-HH:MM:SS"
cpus_per_task: CPU cores per task
mem_gb: Memory per node in GB
cmd: Command(s) to execute (String or Vector{String})
constraint: Node type constraint ("cpu" or "gpu")

Returns

true if job submission succeeds
false if submission fails

QoS Options

regular: Standard priority queue
premium: High priority queue (5x throughput limit)
preempt: Reduced credit usage but jobs may be interrupted

https://docs.nersc.gov/jobs/policy/ https://docs.nersc.gov/systems/perlmutter/architecture/#cpu-nodes

default is to use shared qos

use

regular
preempt (reduced credit usage but not guaranteed to finish)
premium (priorty runs limited to 5x throughput)

https://docs.nersc.gov/systems/perlmutter/running-jobs/#tips-and-tricks

Mycelia.nersc_sbatch_shared — Method

nersc_sbatch_shared(
;
    job_name,
    mail_user,
    mail_type,
    logdir,
    qos,
    nodes,
    ntasks,
    time,
    cpus_per_task,
    mem_gb,
    cmd,
    constraint
)

Submit a job to NERSC's SLURM scheduler using the shared QOS (Quality of Service).

Arguments

job_name: Identifier for the job
mail_user: Email address for job notifications
mail_type: Notification type ("ALL", "BEGIN", "END", "FAIL", "REQUEUE", "STAGE_OUT")
logdir: Directory for storing job output and error logs
qos: Quality of Service level ("shared", "regular", "preempt", "premium")
nodes: Number of nodes to allocate
ntasks: Number of tasks to run
time: Maximum wall time in format "days-hours:minutes:seconds"
cpus_per_task: Number of CPUs per task
mem_gb: Memory per node in GB (default: 2GB per CPU)
cmd: Command to execute
constraint: Node type constraint ("cpu" or "gpu")

Resource Limits

Maximum memory per node: 512GB
Maximum cores per node: 128
Default memory allocation: 2GB per CPU requested

QOS Options

shared: Default QOS for shared node usage
regular: Standard priority
preempt: Reduced credit usage but preemptible
premium: 5x throughput priority (limited usage)

Returns

true if job submission succeeds

https://docs.nersc.gov/jobs/policy/ https://docs.nersc.gov/systems/perlmutter/architecture/#cpu-nodes

default is to use shared qos

use

regular
preempt (reduced credit usage but not guaranteed to finish)
premium (priority runs limited to 5x throughput)

max request is 512Gb memory and 128 cores per node

https://docs.nersc.gov/systems/perlmutter/running-jobs/#tips-and-tricks

Mycelia.next_prime_k — Method

Find the next prime number greater than current_k. For k-mer progression, we prefer odd numbers and especially primes.

Mycelia.node_type_to_dataframe — Method

node_type_to_dataframe(; node_type, graph)

Convert all nodes of a specific type in a MetaGraph to a DataFrame representation.

Arguments

node_type: The type of nodes to extract from the graph
graph: A MetaGraph containing the nodes

Returns

A DataFrame where:

Each row represents a node of the specified type
Columns correspond to all unique properties found across nodes
Values are JSON-serialized strings for consistency

Notes

All values are normalized through JSON serialization
Dictionary values receive double JSON encoding
The TYPE column is converted using type_to_string

Mycelia.normalize_codon_frequencies — Method

normalize_codon_frequencies(
    codon_frequencies
) -> Dict{BioSymbols.AminoAcid, Dict{Kmers.Kmer{BioSequences.DNAAlphabet{2}, 3, 1}, Float64}}

Normalizes codon frequencies for each amino acid such that frequencies sum to 1.0.

Arguments

codon_frequencies: Nested dictionary mapping amino acids to their codon frequency distributions

Returns

Normalized codon frequencies where values for each amino acid sum to 1.0

Mycelia.normalize_countmap — Method

normalize_countmap(countmap) -> Dict

Normalize a dictionary of counts into a probability distribution where values sum to 1.0.

Arguments

countmap::Dict: Dictionary mapping keys to count values

Returns

Dict: New dictionary with same keys but values normalized by total sum

Mycelia.normalize_distance_matrix — Method

normalize_distance_matrix(distance_matrix) -> Any

Create distance matrix from a column-major counts matrix (features as rows and entities as columns) where distance is a proportional to total feature count magnitude (size) and cosine similarity (relative frequency)

Normalize a distance matrix by dividing all elements by the maximum non-NaN value.

Arguments

distance_matrix: A matrix of distance values that may contain NaN, nothing, or missing values

Returns

Normalized distance matrix with values scaled to [0, 1] range

Details

Filters out NaN, nothing, and missing values when finding the maximum
All elements are divided by the same maximum value to preserve relative distances
If all values are NaN/nothing/missing, may return NaN values

Mycelia.normalize_kmer_counts — Method

normalize_kmer_counts(
    kmer_counts
) -> OrderedCollections.OrderedDict

Convert raw k‑mer counts into normalized frequencies.

Arguments

kmer_counts::Dict: Mapping of k-mers to counts.

Returns

OrderedDict with values scaled so the sum equals 1.

Mycelia.normalize_vcf — Method

normalize_vcf(; reference_fasta, vcf_file)

Normalize a VCF file using bcftools norm, with automated handling of compression and indexing.

Arguments

reference_fasta::String: Path to the reference FASTA file used for normalization
vcf_file::String: Path to input VCF file (can be gzipped or uncompressed)

Returns

String: Path to the normalized, sorted, and compressed output VCF file (*.sorted.normalized.vcf.gz)

Notes

Requires bioconda packages: htslib, tabix, bcftools
Creates intermediate files with extensions .tbi for indices
Skips processing if output file already exists
Performs left-alignment and normalization of variants

Mycelia.normalized_current_date — Method

normalized_current_date() -> String

Returns the current date as a normalized string with all non-word characters removed.

The output format is based on ISO datetime (YYYYMMDD) but strips any special characters like hyphens, colons or dots.

Mycelia.normalized_current_datetime — Method

normalized_current_datetime() -> String

Returns the current date and time as a normalized string with all non-word characters removed.

The output format is based on ISO datetime (YYYYMMDDThhmmss) but strips any special characters like hyphens, colons or dots.

Mycelia.observe — Method

observe(
    record::Union{FASTX.FASTA.Record, FASTX.FASTQ.Record};
    error_rate
)

Simulate sequencing of a DNA/RNA record by introducing random errors at the specified rate.

Arguments

record: A FASTA or FASTQ record containing the sequence to be "observed"
error_rate: Probability of error at each position (default: 0.0)

Returns

A new FASTQ.Record with:

Random UUID as identifier
Original record's description
Modified sequence with introduced errors
Generated quality scores

Mycelia.observe — Method

observe(sequence::BioSequences.LongSequence{T}; error_rate=nothing, tech::Symbol=:illumina) where T

Simulates the “observation” of a biological polymer (DNA, RNA, or protein) by introducing realistic errors along with base‐quality scores. The simulation takes into account both random and systematic error components. In particular, for technologies:

illumina: (mostly substitution errors) the per‐base quality decays along the read (from ~Q40 at the start to ~Q20 at the end);
nanopore: errors are more frequent and include both substitutions and indels (with overall lower quality scores, and an extra “homopolymer” penalty);
pacbio: errors are dominated by indels (with quality scores typical of raw reads);
ultima: (UG 100/ppmSeq™) correct bases are assigned very high quality (~Q60) while errors are extremely rare and, if they occur, are given a modest quality.

An error is introduced at each position with a (possibly position‐dependent) probability. For Illumina, the error probability increases along the read; additionally, if a base is part of a homopolymer run (length ≥ 3) and the chosen technology is one that struggles with homopolymers (nanopore, pacbio, ultima), then the local error probability is multiplied by a constant factor.

Returns a tuple (new_seq, quality_scores) where:

new_seq is a BioSequences.LongSequence{T} containing the “observed” sequence (which may be longer or shorter than the input if insertions or deletions occur), and
quality_scores is a vector of integers representing the Phred quality scores (using the Sanger convention) for each base in the output sequence.

Mycelia.open_fastx — Method

open_fastx(path::AbstractString) -> Any

Open and return a reader for FASTA or FASTQ format files.

Arguments

path::AbstractString: Path to input file. Can be:
- Local file path
- HTTP/FTP URL
- Gzip compressed (.gz extension)

Supported formats

FASTA (.fasta, .fna, .faa, .fa)
FASTQ (.fastq, .fq)

Returns

FASTX.FASTA.Reader for FASTA files
FASTX.FASTQ.Reader for FASTQ files

Mycelia.open_genbank — Method

open_genbank(
    genbank_file
) -> Vector{GenomicAnnotations.Record}

Opens and parses a GenBank format file containing genomic sequence and annotation data.

Arguments

genbank_file::AbstractString: Path to the GenBank (.gb or .gbk) file

Returns

Vector{GenomicAnnotations.Chromosome}: Vector containing parsed chromosome data

Mycelia.open_gff — Method

open_gff(path::String) -> Any

Opens a GFF (General Feature Format) file for reading.

Arguments

path::String: Path to GFF file. Can be:
- Local file path
- HTTP/FTP URL (FTP URLs are automatically converted to HTTP)
- Gzipped file (automatically decompressed)

Returns

IO: An IO stream ready to read the GFF content

Mycelia.optimal_subsequence_length — Method

optimal_subsequence_length(
;
    error_rate,
    threshold,
    sequence_length,
    plot_result
)

Calculate the optimal subsequence length based on error rate distribution.

Arguments

error_rate: Single error rate or array of error rates (between 0 and 1)
threshold: Desired probability that a subsequence is error-free (default: 0.95)
sequence_length: Maximum sequence length to consider for plotting
plot_result: If true, returns a plot of probability vs. length

Returns

If plot_result=false: Integer representing optimal subsequence length
If plot_result=true: Tuple of (optimal_length, plot)

Examples

# Single error rate
optimal_subsequence_length(error_rate=0.01)

# Array of error rates
optimal_subsequence_length(error_rate=[0.01, 0.02, 0.01])

# With more stringent threshold
optimal_subsequence_length(error_rate=0.01, threshold=0.99)

# Generate plot
length, p = optimal_subsequence_length(error_rate=0.01, plot_result=true)
Plots.display(p)

Mycelia.pairwise_distance_matrix — Method

pairwise_distance_matrix(
    matrix;
    dist_func = Distances.euclidean,
    show_progress = true,
    progress_desc = "Computing distances"
)

Compute a symmetric pairwise distance matrix between columns of matrix using the supplied distance function.

Arguments

matrix: Column-major matrix (features as rows, entities as columns)
dist_func: Function of the form f(a, b) returning the distance between two vectors (default: Distances.euclidean)
show_progress: Display progress bar if true (default: true)
progress_desc: Progress bar description (default: "Computing distances")

Returns

Symmetric N×N matrix of pairwise distances between columns (entities)

Mycelia.parallel_pyrodigal — Method

parallel_pyrodigal(normalized_fastas::Vector{String})

Runs Mycelia.run_pyrodigal on a list of FASTA files in parallel using Threads.

Args: normalized_fastas: A vector of strings, where each string is a path to a FASTA file.

Returns: A tuple containing two elements: 1. successes (Vector{Tuple{String, Any}}): A vector of tuples, where each tuple contains the filename and the result returned by a successful Mycelia.runpyrodigal call. 2. failures (Vector{Tuple{String, String}}): A vector of tuples, where each tuple contains the filename and the error message string for a failed Mycelia.runpyrodigal call.

Mycelia.parse_blast_report — Method

parse_blast_report(blast_report) -> DataFrames.DataFrame

Expects output type 7 from BLAST, default output type 6 doesn't have the header comments and won't auto-parse

Parse a BLAST output file into a structured DataFrame.

Arguments

blast_report::AbstractString: Path to a BLAST output file in format 7 (tabular with comments)

Returns

DataFrame: Table containing BLAST results with columns matching the header fields. Returns empty DataFrame if no hits found.

Details

Requires BLAST output format 7 (-outfmt 7), which includes header comments
Handles missing values (encoded as "N/A") automatically
Infers column types based on BLAST field names
Supports standard BLAST tabular fields including sequence IDs, scores, alignments and taxonomic information

Mycelia.parse_gfa — Method

parse_gfa(gfa) -> MetaGraphs.MetaGraph{Int64, Float64}

Parse a GFA (Graphical Fragment Assembly) file into a MetaGraph representation.

Arguments

gfa: Path to GFA format file

Returns

A MetaGraph where:

Vertices represent segments (contigs)
Edges represent links between segments
Vertex properties include :id with segment identifiers
Graph property :records contains the original FASTA records

Format Support

Handles standard GFA v1 lines:

H: Header lines (skipped)
S: Segments (stored as nodes with FASTA records)
L: Links (stored as edges)
P: Paths (stored in paths dictionary)
A: HiFiAsm specific lines (skipped)

Mycelia.parse_jsonl — Method

parse_jsonl(filepath::String) -> Vector{Dict{String, Any}}

parse_jsonl(filepath::String) -> Vector{Dict{String,Any}}

Validate and parse a JSON Lines file (either .ndjson/.jsonl, optionally gzipped) into a vector of dictionaries, reporting progress in bytes processed.

Validations performed: • Extension must be one of: .jsonl, .ndjson, .jsonl.gz, .ndjson.gz • File must exist • File size must be non-zero

Progress meter shows bytes read from the underlying file (compressed bytes for .gz). No second full pass is needed.

Mycelia.parse_mmseqs_easy_taxonomy_lca_tsv — Method

parse_mmseqs_easy_taxonomy_lca_tsv(
    lca_tsv
) -> DataFrames.DataFrame

Parse the taxonomic Last Common Ancestor (LCA) TSV output from MMseqs2's easy-taxonomy workflow.

Arguments

lca_tsv: Path to the TSV file containing MMseqs2 taxonomy results

Returns

DataFrame with columns:

contig_id: Sequence identifier
taxon_id: NCBI taxonomy identifier
taxon_rank: Taxonomic rank (e.g. species, genus)
taxon_name: Scientific name
fragments_retained: Number of fragments kept
fragments_taxonomically_assigned: Number of fragments with taxonomy
fragments_in_agreement_with_assignment: Fragments matching contig taxonomy
support -log(E-value): Statistical support score

Mycelia.parse_mmseqs_easy_taxonomy_tophit_report — Method

parse_mmseqs_easy_taxonomy_tophit_report(
    tophit_report
) -> DataFrames.DataFrame

Parse an MMseqs2 easy-taxonomy tophit report into a structured DataFrame.

Arguments

tophit_report::String: Path to the MMseqs2 easy-taxonomy tophit report file (tab-delimited)

Returns

DataFrame: A DataFrame with columns:
- target_id: Target sequence identifier
- number of sequences aligning to target: Count of aligned sequences
- unique coverage of target: Ratio of uniqueAlignedResidues to targetLength
- Target coverage: Ratio of alignedResidues to targetLength
- Average sequence identity: Mean sequence identity
- taxon_id: Taxonomic identifier
- taxon_rank: Taxonomic rank
- taxon_name: Species name and lineage

Mycelia.parse_mmseqs_tophit_aln — Method

parse_mmseqs_tophit_aln(tophit_aln) -> DataFrames.DataFrame

Parse MMseqs2 tophit alignment output file into a structured DataFrame.

Arguments

tophit_aln::AbstractString: Path to tab-delimited MMseqs2 alignment output file

Returns

DataFrame with columns:

query: Query sequence/profile identifier
target: Target sequence/profile identifier
percent identity: Sequence identity percentage
alignment length: Length of alignment
number of mismatches: Count of mismatched positions
number of gaps: Count of gap openings
query start: Start position in query sequence
query end: End position in query sequence
target start: Start position in target sequence
target end: End position in target sequence
evalue: E-value of alignment
bit score: Bit score of alignment

Mycelia.parse_qualimap_contig_coverage — Method

parse_qualimap_contig_coverage(
    qualimap_report_txt
) -> DataFrames.DataFrame

Parse contig coverage statistics from a Qualimap BAM QC report file.

Arguments

qualimap_report_txt::String: Path to Qualimap bamqc report text file

Returns

DataFrame: Coverage statistics with columns:
- Contig: Contig identifier
- Length: Contig length in bases
- Mapped bases: Number of bases mapped to contig
- Mean coverage: Average coverage depth
- Standard Deviation: Standard deviation of coverage
- % Mapped bases: Percentage of total mapped bases on this contig

Supported Assemblers

Handles output from both SPAdes and MEGAHIT assemblers:

SPAdes format: NODEXlengthYcov_Z
MEGAHIT format: kXX_Y

Parse the contig coverage information from qualimap bamqc text report, which looks like the following:

# this is spades
>>>>>>> Coverage per contig

	NODE_1_length_107478_cov_9.051896	107478	21606903	201.0355886786133	60.39424208607496
	NODE_2_length_5444_cov_1.351945	5444	153263	28.152645113886848	5.954250612823136
	NODE_3_length_1062_cov_0.154390	1062	4294	4.043314500941619	1.6655384692688975
	NODE_4_length_776_cov_0.191489	776	3210	4.13659793814433	2.252009588980858

# below is megahit
>>>>>>> Coverage per contig

	k79_175	235	3862	16.43404255319149	8.437436249612457
	k79_89	303	3803	12.551155115511552	5.709975376279777
	k79_262	394	6671	16.931472081218274	7.579217802849293
	k79_90	379	1539	4.060686015831134	1.2929729111266581
	k79_91	211	3749	17.767772511848342	11.899185693011933
	k79_0	2042	90867	44.49902056807052	18.356525483516613

To make this more robust, consider reading in the names of the contigs from the assembled fasta

Mycelia.parse_rtg_eval_output — Method

parse_rtg_eval_output(f) -> DataFrames.DataFrame

Parse RTG evaluation output from a gzipped tab-separated file.

Arguments

f: Path to a gzipped TSV file containing RTG evaluation output

Format

Expected file format:

Header line starting with '#' and tab-separated column names
Data rows in tab-separated format
Empty files return a DataFrame with empty columns matching header

Returns

A DataFrame where:

Column names are taken from the header line (stripped of '#')
Data is parsed as Float64 values
Empty files result in empty columns preserving header structure

Mycelia.parse_transterm_output — Method

parse_transterm_output(
    transterm_output
) -> DataFrames.DataFrame

Parse TransTerm terminator prediction output into a structured DataFrame.

Takes a TransTerm output file path and returns a DataFrame containing parsed terminator predictions. Each row represents one predicted terminator with the following columns:

chromosome: Identifier of the sequence being analyzed
term_id: Unique terminator identifier (e.g. "TERM 19")
start: Start position of the terminator
stop: End position of the terminator
strand: Strand orientation ("+" or "-")
location: Context type, where:
- G/g = in gene interior (≥50bp from ends)
- F/f = between two +strand genes
- R/r = between two -strand genes
- T = between ends of +strand and -strand genes
- H = between starts of +strand and -strand genes
- N = none of the above
Lowercase indicates opposite strand from region
confidence: Overall confidence score (0-100)
hairpin_score: Hairpin structure score
tail_score: Tail sequence score
notes: Additional annotations (e.g. "bidir")

Arguments

transterm_output::AbstractString: Path to TransTerm output file

Returns

DataFrame: Parsed terminator predictions with columns as described above

See TransTerm HP documentation for details on scoring and location codes.

Mycelia.parse_virsorter_score_tsv — Method

parse_virsorter_score_tsv(
    virsorter_score_tsv
) -> DataFrames.DataFrame

Parse a VirSorter score TSV file and return a DataFrame.

Arguments

virsorter_score_tsv::String: The file path to the VirSorter score TSV file.

Returns

DataFrame: A DataFrame containing the parsed data from the TSV file. If the file is empty, returns a DataFrame with the appropriate headers but no data.

Mycelia.parse_xam_to_summary_table — Method

parse_xam_to_summary_table(xam) -> DataFrames.DataFrame

Parse a SAM/BAM file into a summary DataFrame containing alignment metadata.

Arguments

xam::AbstractString: Path to input SAM (.sam), BAM (.bam), or gzipped SAM (.sam.gz) file

Returns

DataFrame with columns:

template: Read name
flag: SAM flag
reference: Reference sequence name
position: Alignment position range (start:end)
mappingquality: Mapping quality score
alignment_score: Alignment score (AS tag)
isprimary: Whether alignment is primary
alignlength: Length of the alignment
ismapped: Whether read is mapped
mismatches: Number of mismatches (NM tag)

Note: Only mapped reads are included in the output DataFrame.

Mycelia.path_to_sequence — Method

path_to_sequence(kmers, path) -> Any

Convert a path through k-mers into a single DNA sequence.

Takes a vector of k-mers and a path representing the order to traverse them, reconstructs the original sequence by joining the k-mers according to the path. The first k-mer is used in full, then only the last nucleotide from each subsequent k-mer is added.

Arguments

kmers: Vector of DNA k-mers (as LongDNA{4})
path: Vector of tuples representing the path through the k-mers

Returns

LongDNA{4}: The reconstructed DNA sequence

Mycelia.pca_transform — Method

pca_transform(
  M::AbstractMatrix{<:Real};
  k::Int = 0,
  var_prop::Float64 = 1.0
)

Perform standard PCA on M (features × samples), returning enough PCs to either:

match a user‐supplied k > 0, or
explain at least var_prop of the total variance (0 < var_prop ≤ 1).

By default (k=0, var_prop=1.0), this will capture all variance, i.e. use min(n_samples-1, n_features) components.

When to use

Use for real-valued, continuous, and approximately Gaussian data. PCA is most suitable when features are linearly related and data is centered and scaled. Not ideal for count, binary, or highly skewed data.

Returns

A NamedTuple with fields

model : the fitted MultivariateStats.PCA object
scores : k×n_samples matrix of PC scores
loadings : k×n_features matrix of PC loadings
chosen_k : the number of components actually used

Mycelia.pcoa_from_dist — Method

pcoa_from_dist(D::AbstractMatrix{<:Real}; maxoutdim::Int = 2)

Perform Principal Coordinates Analysis directly from a precomputed distance matrix D (nsamples×nsamples).

Keyword arguments

maxoutdim : target embedding dimension (default=2)

Returns

NamedTuple with fields

model : the fitted MultivariateStats.MDS model
coordinates: maxoutdim×n_samples matrix of embedded coordinates

Mycelia.phred_to_error_probability — Method

phred_to_error_probability(phred_score::UInt8) -> Float64

Convert PHRED quality score to error probability.

Mycelia.phred_to_probability — Method

phred_to_probability(phred_score::UInt8) -> Float64

Convert PHRED quality score to probability of correctness. PHRED score Q relates to error probability P by: Q = -10 * log10(P) Therefore, correctness probability = 1 - P = 1 - 10^(-Q/10)

Mycelia.pixels_to_points — Method

pixels_to_points(pixels) -> Any

Convert pixel measurements to point measurements using the standard 4:3 ratio.

Points are the standard unit for typography (1 point = 1/72 inch), while pixels are used for screen measurements. This conversion uses the conventional 4:3 ratio where 3 points equal 4 pixels.

Arguments

pixels: The number of pixels to convert

Returns

The equivalent measurement in points

Mycelia.plot_embeddings — Method

Plot embeddings with optional true and fitted cluster labels using Makie.jl, with legend outside, and color by fit labels, shape by true labels.

Arguments

embeddings::Matrix{<:Real}: 2D embedding matrix where each column is a data point
title::String: Title of the plot
xlabel::String: Label for the x-axis
ylabel::String: Label for the y-axis
true_labels::Vector{<:Integer}: Vector of true cluster labels (optional)
fit_labels::Vector{<:Integer}: Vector of fitted cluster labels (optional)

Returns

Makie.Figure: Figure object that can be displayed or saved

Mycelia.plot_graph — Method

plot_graph(graph) -> Any

Creates a visualization of a kmer graph where nodes represent kmers and their sizes reflect counts.

Arguments

graph: A MetaGraph where vertices have :kmer and :count properties

Returns

A Plots.jl plot object showing the graph visualization

Details

Node sizes are scaled based on kmer counts
Plot dimensions scale logarithmically with number of vertices
Each node is labeled with its kmer sequence

Mycelia.plot_kmer_frequency_spectra — Method

plot_kmer_frequency_spectra(
    counts;
    log_scale,
    kwargs...
) -> Plots.Plot

Plots a histogram of kmer counts against # of kmers with those counts

Returns the plot object for adding additional layers and saving

Creates a scatter plot visualizing the k-mer frequency spectrum - the relationship between k-mer frequencies and how many k-mers occur at each frequency.

Arguments

counts::AbstractVector{<:Integer}: Vector of k-mer counts/frequencies
log_scale::Union{Function,Nothing} = log2: Function to apply logarithmic scaling to both axes. Set to nothing to use linear scaling.
kwargs...: Additional keyword arguments passed to StatsPlots.plot()

Returns

Plots.Plot: A scatter plot object that can be further modified or saved

Details

The x-axis shows k-mer frequencies (how many times each k-mer appears), while the y-axis shows how many distinct k-mers appear at each frequency. Both axes are log-scaled by default using log2.

Mycelia.plot_kmer_rarefaction — Method

plot_kmer_rarefaction(
    rarefaction_data_path::AbstractString;
    output_dir,
    output_basename,
    display_plot,
    fig_size,
    title,
    xlabel,
    ylabel,
    line_color,
    line_style,
    marker,
    markersize,
    axis_kwargs...
) -> Union{Nothing, Makie.Figure}

Plots a k-mer rarefaction curve from data stored in a TSV file. The TSV file should contain two columns:

Number of FASTA files processed.
Cumulative unique k-mers observed at that point.

The plot is displayed and saved in PNG, PDF, and SVG formats.

Arguments

rarefaction_data_path::AbstractString: Path to the TSV file containing rarefaction data.
output_dir::AbstractString: Directory where the output plots will be saved. Defaults to the directory of rarefaction_data_path.
output_basename::AbstractString: Basename for the output plot files (without extension). Defaults to the basename of rarefaction_data_path without its original extension.
display_plot::Bool: Whether to display the plot interactively. Defaults to true.

Keyword Arguments

fig_size::Tuple{Int, Int}: Size of the output figure, e.g., (1000, 750).
title::AbstractString: Title of the plot.
xlabel::AbstractString: Label for the x-axis.
ylabel::AbstractString: Label for the y-axis.
line_color: Color of the plotted line.
line_style: Style of the plotted line (e.g. :dash, :dot).
marker: Marker style for points (e.g. :circle, :xcross).
markersize::Number: Size of the markers.
Any other keyword arguments will be passed to Makie.Axis.

Mycelia.plot_optimal_cluster_assessment_results — Method

plot_optimal_cluster_assessment_results(
    clustering_results
) -> Any

Visualizes cluster assessment metrics and saves the resulting plots.

Arguments

clustering_results: A named tuple containing:
- ks_assessed: Vector of k values tested
- within_cluster_sum_of_squares: Vector of WCSS scores
- silhouette_scores: Vector of silhouette scores
- optimal_number_of_clusters: Integer indicating optimal k

Details

Creates two plots:

Within-cluster sum of squares (WCSS) vs number of clusters
Silhouette scores vs number of clusters

Both plots include a vertical line indicating the optimal number of clusters.

Outputs

Saves two SVG files in the project directory:

wcss.svg: WCSS plot
silhouette.svg: Silhouette scores plot

Mycelia.plot_per_base_quality — Method

plot_per_base_quality(fastq_file::String; max_position::Union{Int,Nothing}=nothing, sample_size::Union{Int,Nothing}=nothing)

Create per-base quality boxplots for FASTQ data, similar to FastQC output.

Arguments

fastq_file::String: Path to FASTQ file to analyze
max_position::Union{Int,Nothing}=nothing: Maximum read position to plot (default: auto-detect from data)
sample_size::Union{Int,Nothing}=nothing: Number of reads to sample for analysis (default: use all reads)

Returns

Plots.Plot: Boxplot showing quality distribution at each base position

Examples

# Basic per-base quality plot
p = Mycelia.plot_per_base_quality("reads.fastq")

# Plot first 100 positions only, sampling 10000 reads
p = Mycelia.plot_per_base_quality("reads.fastq", max_position=100, sample_size=10000)

Notes

Quality scores are displayed in Phred scale
Green zone: Q>=30 (high quality)
Yellow zone: Q20-29 (medium quality)
Red zone: Q<20 (low quality)
For large files, consider using sample_size to improve performance

Mycelia.plot_taxa_abundances — Method

plot_taxa_abundances(
    df::DataFrames.DataFrame, 
    taxa_level::String; 
    top_n::Int = 10,
    sample_id_col::String = "sample_id",
    filter_taxa::Union{Vector{Union{String, Missing}}, Nothing} = nothing,
    figure_width::Int = 1500,
    figure_height::Int = 1000,
    bar_width::Float64 = 0.7,
    x_rotation::Int = 45,
    sort_samples::Bool = true,
    color_seed::Union{Int, Nothing} = nothing,
    legend_fontsize::Float64 = 12.0,
    legend_itemsize::Float64 = 12.0,
    legend_padding::Float64 = 5.0,
    legend_rowgap::Float64 = 1.0,
    legend_labelwidth::Union{Nothing, Float64} = nothing,
    legend_titlesize::Float64 = 15.0,
    legend_nbanks::Int = 1
)

Create a stacked bar chart showing taxa relative abundances for each sample.

Arguments

df: DataFrame with sample_id and taxonomic assignments at different levels
taxa_level: Taxonomic level to analyze (e.g., "genus", "species")
top_n: Number of top taxa to display individually, remainder grouped as "Other"
sample_id_col: Column name containing sample identifiers
filter_taxa: Taxa to exclude from visualization (default: nothing - no filtering)
figure_width: Width of the figure in pixels
figure_height: Height of the figure in pixels
bar_width: Width of each bar (between 0 and 1)
x_rotation: Rotation angle for x-axis labels in degrees
sort_samples: Whether to sort samples alphabetically
color_seed: Seed for reproducible color generation
legend_fontsize: Font size for legend entries
legend_itemsize: Size of the colored marker/icon in the legend
legend_padding: Padding around legend elements
legend_rowgap: Space between legend rows
legend_labelwidth: Maximum width for legend labels (truncation)
legend_titlesize: Font size for legend title
legend_nbanks: Number of legend columns

Returns

fig: CairoMakie figure object
ax: CairoMakie axis object
taxa_colors: Dictionary mapping taxa to their assigned colors

Mycelia.points_to_pixels — Method

points_to_pixels(points) -> Any

Convert typographic points to pixels using a 4:3 ratio (1 point = 4/3 pixels).

Arguments

points: Size in typographic points (pt)

Returns

Size in pixels (px)

Mycelia.poisson_pca_epca — Method

poissonpcaepca(M::AbstractMatrix{<:Integer}; k::Int=0)

Perform Poisson EPCA on a count matrix M (features × samples).

When to use

Use for non-negative integer count data, such as raw event or read counts.

Returns

A NamedTuple with

model : the fitted ExpFamilyPCA.PoissonEPCA object
scores : k×n_samples matrix of low‐dimensional sample scores
loadings : k×n_features matrix of feature loadings

Mycelia.polish_assembly — Method

polish_assembly(assembly::AssemblyResult, reads; iterations=3) -> AssemblyResult

Polish assembled contigs using quality-aware error correction.

Arguments

assembly: Initial assembly result to polish
reads: Original reads for polishing (FASTQ with quality scores preferred)
iterations: Number of polishing iterations (default: 3)

Returns

AssemblyResult: Polished assembly with improved accuracy

Details

Uses Phase 2 enhanced Viterbi algorithms with quality score integration for:

Error correction based on k-mer graph traversals
Consensus calling from multiple observations
Iterative improvement until convergence

Mycelia.polish_fastq — Method

polish_fastq(; fastq, k)

Polish FASTQ reads using a k-mer graph-based approach to correct potential sequencing errors.

Arguments

fastq::String: Path to input FASTQ file
k::Int=1: Initial k-mer size parameter. Final assembly k-mer size may differ.

Process

Builds a directed k-mer graph from input reads
Processes each read through the graph to find optimal paths
Writes corrected reads to a new FASTQ file
Automatically compresses output with gzip

Returns

Named tuple with:

fastq::String: Path to output gzipped FASTQ file
k::Int: Final assembly k-mer size used

Mycelia.polish_sequence_next — Function

polish_sequence_next(graph::MetaGraph, sequence::String, config::ViterbiConfig) -> ViterbiPath

Polish a single sequence using Viterbi algorithm on k-mer graph.

Mycelia.position_wise_joint_probability — Method

position_wise_joint_probability(
    qualmers::Vector{<:Mycelia.Qualmer};
    use_log_space
) -> Float64

Calculate position-wise joint probability for multiple qualmer observations. This is more sophisticated than joint_qualmer_probability as it considers the quality at each position across all observations.

Arguments

qualmers: Vector of Qualmer observations of the same k-mer sequence
use_log_space: Use log-space arithmetic for numerical stability (default: true)

Returns

Float64: Position-wise joint probability

Mycelia.precision_recall_f1 — Method

precision_recall_f1(true_labels, pred_labels)

Returns dictionaries mapping each label to its precision, recall, and F1 score. Also returns macro-averaged (unweighted mean) precision, recall, F1, and a grouped bar plot.

Mycelia.prefetch — Method

prefetch(; SRR, outdir)

Downloads Sequence Read Archive (SRA) data using the prefetch tool from sra-tools.

Arguments

SRR: SRA accession number (e.g., "SRR12345678")
outdir: Directory where the downloaded data will be saved. Defaults to current directory.

Notes

Requires sra-tools which will be installed in a Conda environment
Downloads are saved in .sra format
Internet connection required

Mycelia.prefetch_sra_runs — Method

Prefetches multiple SRA runs in parallel.

Downloads SRA run files (.sra) to local storage without converting to FASTQ. Useful for batch downloading before processing with fasterq-dump.

Arguments

srr_identifiers: Vector of SRA run identifiers
outdir: Output directory for prefetched files (default: current directory)
max_parallel: Maximum number of parallel downloads (default: 4)

Returns

Vector of named tuples with prefetch results for each SRA run

Example

runs = ["SRR1234567", "SRR1234568", "SRR1234569"]
results = Mycelia.prefetch_sra_runs(runs, outdir="./sra_data")

Mycelia.probabilistic_walk_next — Method

probabilistic_walk_next(
    graph::MetaGraphsNext.MetaGraph,
    start_vertex::String,
    max_steps::Int64;
    seed
) -> Mycelia.GraphPath

Perform a probabilistic walk through the strand-aware k-mer graph.

This algorithm follows edges based on their probability weights, respecting strand orientation constraints. The walk continues until max_steps is reached or no valid transitions are available.

Arguments

graph: MetaGraphsNext k-mer graph with strand-aware edges
start_vertex: Starting k-mer (vertex label)
max_steps: Maximum number of steps to take
seed: Random seed for reproducibility (optional)

Returns

GraphPath: Complete path with probability information

Algorithm

Start at given vertex with forward strand orientation
At each step, calculate transition probabilities based on edge weights
Sample next vertex according to probabilities
Update cumulative probability and continue
Respect strand orientation constraints from edge metadata

Example

graph = build_kmer_graph_next(DNAKmer{15}, observations)
path = probabilistic_walk_next(graph, "ATCGATCGATCGATC", 100)
println("Assembled sequence: $(path.sequence)")
println("Path probability: $(path.total_probability)")

Mycelia.process_fastq_record — Method

process_fastq_record(
;
    record,
    kmer_graph,
    yen_k_shortest_paths_and_weights,
    yen_k
)

Process and error-correct a FASTQ sequence record using a kmer graph and path resampling.

Arguments

record: FASTQ record containing the sequence to process
kmer_graph: MetaGraph containing the kmer network and associated properties
yen_k_shortest_paths_and_weights: Cache of pre-computed k-shortest paths between nodes
yen_k: Number of alternative paths to consider during resampling (default: 3)

Description

Performs error correction by:

Trimming low-quality sequence ends
Identifying stretches requiring resampling between solid branching kmers
Selecting alternative paths through the kmer graph based on:
- Path quality scores
- Transition likelihoods
- Path length similarity to original sequence

Returns

Modified FASTQ record with error-corrected sequence and updated quality scores
Original record if no error correction was needed

Required Graph Properties

The kmer_graph must contain the following properties:

:ordered_kmers
:likelyvalidkmer_indices
:kmer_indices
:branching_nodes
:assembly_k
:transition_likelihoods
:kmermeanquality
:kmertotalquality

Mycelia.q_value_to_error_rate — Method

q_value_to_error_rate(q_value) -> Any

Convert a Phred quality score (Q-value) to a probability of error.

Arguments

q_value: Phred quality score, typically ranging from 0 to 40

Returns

Error probability in range [0,1], where 0 indicates highest confidence

A Q-value of 10 corresponds to an error rate of 0.1 (10%), while a Q-value of 30 corresponds to an error rate of 0.001 (0.1%).

Mycelia.qc_filter_long_reads_fastplong — Method

qc_filter_long_reads_fastplong(
;
    in_fastq,
    report_title,
    out_fastq,
    html_report,
    json_report,
    min_length,
    max_length
)

Perform QC filtering on long-read FASTQ files using fastplong.

Arguments

in_fastq::String: Path to the input FASTQ file.
out_fastq::String: Path to the output FASTQ file.
quality_threshold::Int: Minimum average quality to retain a read (default 10).
min_length::Int: Minimum read length (default 1000).
max_length::Int=0: Maximum read length (default 0, no maximum).

Returns

String: Path to the filtered FASTQ file.

Details

This function uses fastplong to filter long reads based on quality and length criteria. It is optimized for Oxford Nanopore, PacBio, or similar long-read datasets.

Mycelia.qc_filter_long_reads_filtlong — Method

qc_filter_long_reads_filtlong(
;
    in_fastq,
    out_fastq,
    min_mean_q,
    keep_percent
)

Filter and process long reads from a FASTQ file using Filtlong.

This function filters long sequencing reads based on quality and length criteria, then compresses the output using pigz.

Arguments

in_fastq::String: Path to the input FASTQ file.
out_fastq::String: Path to the output filtered and compressed FASTQ file. Defaults to the input filename with ".filtlong.fq.gz" appended.
min_mean_q::Int: Minimum mean quality score for reads to be kept. Default is 20.
keep_percent::Int: Percentage of reads to keep after filtering. Default is 95.

Returns

out_fastq

Details

This function uses Filtlong to filter long reads and pigz for compression. It requires the Bioconda environment for Filtlong to be set up, which is handled internally.

Mycelia.qc_filter_short_reads_fastp — Method

qc_filter_short_reads_fastp(
;
    forward_reads,
    reverse_reads,
    out_forward,
    out_reverse,
    report_title,
    html,
    json
)

Perform quality control (QC) filtering and trimming on short-read FASTQ files using fastp.

Arguments

in_fastq::String: Path to the input FASTQ file.
out_fastq::String: Path to the output FASTQ file.
adapter_seq::String: Adapter sequence to trim.
quality_threshold::Int: Minimum phred score for trimming (default 20).
min_length::Int: Minimum read length to retain (default 50).

Returns

String: Path to the filtered and trimmed FASTQ file.

Details

This function uses fastp to remove adapter contamination, trim low‐quality bases from the 3′ end, and discard reads shorter than min_length. It’s a simple wrapper that executes the external fastp command.

Mycelia.quality_biosequence_graph_to_fastq — Function

quality_biosequence_graph_to_fastq(
    graph::MetaGraphsNext.MetaGraph
) -> Vector{FASTX.FASTQ.Record}
quality_biosequence_graph_to_fastq(
    graph::MetaGraphsNext.MetaGraph,
    output_file::Union{Nothing, AbstractString}
) -> Vector{FASTX.FASTQ.Record}

Convert quality-aware BioSequence vertices back to FASTQ records.

This function demonstrates the key feature of FASTQ graphs - they maintain per-base quality information and can be converted back to FASTQ format.

Arguments

graph: Quality-aware BioSequence graph
output_file: Path to output FASTQ file (optional)

Returns

Vector of FASTX.FASTQ.Record objects

Example

# Convert graph back to FASTQ
fastq_records = quality_biosequence_graph_to_fastq(graph)

# Or write directly to file
quality_biosequence_graph_to_fastq(graph, "output.fastq")

Mycelia.quality_string_to_phred — Method

quality_string_to_phred(
    quality_string::AbstractString
) -> Vector{UInt8}

Convert FASTQ quality string to numerical PHRED scores.

Arguments

quality_string::AbstractString: Quality string from FASTQ record (e.g., "IIII")

Returns

Vector{UInt8}: PHRED quality scores

Examples

scores = quality_string_to_phred("IIII")  # Returns [40, 40, 40, 40]
scores = quality_string_to_phred("!#%+")  # Returns [0, 2, 4, 10]

Mycelia.qualmer_correctness_probability — Method

qualmer_correctness_probability(
    qmer::Mycelia.Qualmer
) -> Float64

Calculate joint probability that a single qualmer is correct. For a k-mer with quality scores [q1, q2, ..., qk], the joint probability that all positions are correct is: ∏(1 - 10^(-qi/10))

Mycelia.qualmer_graph_to_quality_biosequence_graph — Method

qualmer_graph_to_quality_biosequence_graph(
    qualmer_graph::MetaGraphsNext.MetaGraph;
    min_path_length
) -> MetaGraphsNext.MetaGraph{Int64, Graphs.SimpleGraphs.SimpleDiGraph{Int64}, _A, _B, _C, Nothing, Mycelia.var"#280#282", Float64} where {_A, _B, _C}

Convert a Qualmer graph to a quality-aware BioSequence graph by collapsing linear paths.

This is the primary method for creating quality-aware BioSequence graphs from Qualmer graphs, following the 6-graph hierarchy where FASTQ graphs are simplifications of Qualmer graphs with quality retention.

Arguments

qualmer_graph: MetaGraphsNext Qualmer graph to convert
min_path_length: Minimum path length to keep (default: 2)

Returns

MetaGraphsNext.MetaGraph with quality-aware BioSequence vertices

Example

# Start with qualmer graph
qualmer_graph = build_qualmer_graph(fastq_records)

# Convert to quality-aware BioSequence graph
quality_graph = qualmer_graph_to_quality_biosequence_graph(qualmer_graph)

Mycelia.qualmer_path_to_biosequence — Method

Convert qualmer path back to BioSequence and quality vector. Returns a named tuple with sequence and quality_scores.

Mycelia.qualmers_canonical — Method

qualmers_canonical(
    record::FASTX.FASTQ.Record,
    k::Int64
) -> Base.Generator

Mycelia.qualmers_canonical — Method

qualmers_canonical(
    sequence::BioSequences.LongAA,
    quality::AbstractVector{<:Integer},
    _::Val{K}
) -> Base.Generator{I, F} where {I<:(Base.Iterators.Enumerate{I} where I<:(Kmers.FwAAMers{_A, BioSequences.LongAA} where _A)), F<:(Mycelia.var"#228#229"{_A, <:AbstractVector{var"#s36"}} where {_A, var"#s36"<:Integer})}

Mycelia.qualmers_canonical — Method

qualmers_canonical(
    sequence::BioSequences.LongSequence{BioSequences.DNAAlphabet{N}},
    quality::AbstractVector{<:Integer},
    _::Val{K}
) -> Base.Generator{I, F} where {I<:(Base.Iterators.Enumerate{I} where I<:(Kmers.CanonicalKmers{A, _A, S} where {A<:(BioSequences.DNAAlphabet), _A, S<:(BioSequences.LongDNA)})), F<:(Mycelia.var"#234#235"{_A, <:AbstractVector{var"#s34"}} where {_A, var"#s34"<:Integer})}

Mycelia.qualmers_fw — Method

qualmers_fw(
    record::FASTX.FASTQ.Record,
    k::Int64
) -> Base.Generator

Mycelia.qualmers_fw — Method

qualmers_fw(
    sequence::BioSequences.LongSequence{A},
    quality::AbstractVector{<:Integer},
    _::Val{K}
) -> Base.Generator{I, F} where {I<:(Base.Iterators.Enumerate{I} where I<:(Kmers.FwKmers{A, _A, S} where {A<:BioSequences.Alphabet, _A, S<:(BioSequences.LongSequence{A} where A)})), F<:(Mycelia.var"#228#229"{_A, <:AbstractVector{var"#s34"}} where {_A, var"#s34"<:Integer})}

Create an iterator that yields DNA qualmers from the given sequence and quality scores.

Mycelia.qualmers_unambiguous — Method

qualmers_unambiguous(
    record::FASTX.FASTQ.Record,
    k::Int64
) -> Base.Generator

Mycelia.qualmers_unambiguous — Method

qualmers_unambiguous(
    sequence::BioSequences.LongAA,
    quality::AbstractVector{<:Integer},
    _::Val{K}
) -> Base.Generator{I, F} where {I<:(Base.Iterators.Enumerate{I} where I<:(Kmers.FwAAMers{_A, BioSequences.LongAA} where _A)), F<:(Mycelia.var"#228#229"{_A, <:AbstractVector{var"#s36"}} where {_A, var"#s36"<:Integer})}

Mycelia.qualmers_unambiguous — Method

qualmers_unambiguous(
    sequence::BioSequences.LongSequence{BioSequences.DNAAlphabet{N}},
    quality::AbstractVector{<:Integer},
    _::Val{K}
) -> Base.Generator{I, F} where {I<:(Kmers.UnambiguousDNAMers{_A, S} where {_A, S<:(BioSequences.LongDNA)}), F<:(Mycelia.var"#230#231"{_A, <:AbstractVector{var"#s34"}} where {_A, var"#s34"<:Integer})}

Mycelia.qualmers_unambiguous — Method

qualmers_unambiguous(
    sequence::BioSequences.LongSequence{BioSequences.RNAAlphabet{N}},
    quality::AbstractVector{<:Integer},
    _::Val{K}
) -> Base.Generator{I, F} where {I<:(Kmers.UnambiguousRNAMers{_A, S} where {_A, S<:(BioSequences.LongRNA)}), F<:(Mycelia.var"#232#233"{_A, <:AbstractVector{var"#s34"}} where {_A, var"#s34"<:Integer})}

Mycelia.qualmers_unambiguous_canonical — Method

qualmers_unambiguous_canonical(
    record::FASTX.FASTQ.Record,
    k::Int64
) -> Base.Generator{I, Mycelia.var"#236#237"} where I

Generate unambiguous canonical qualmers from the given FASTQ record.

Mycelia.rand_ascii_greek_string — Method

rand_ascii_greek_string(len::Int) -> String

Generate a random string of printable ASCII and Greek characters of length len.

The string contains random printable ASCII characters and both uppercase and lowercase Greek letters.

Mycelia.rand_bmp_printable_string — Method

rand_bmp_printable_string(len::Int) -> String

Generate a random string of printable Basic Multilingual Plane (BMP) characters of length len.

The string contains random printable BMP characters, excluding surrogate code points.

Mycelia.rand_latin1_string — Method

rand_latin1_string(len::Int) -> String

Generate a random string of printable Latin-1 characters of length len.

The string contains random printable characters from the Latin-1 character set.

Mycelia.rand_of_each_group — Method

rand_of_each_group(
    gdf::DataFrames.GroupedDataFrame{DataFrames.DataFrame}
) -> Any

Select one random row from each group in a grouped DataFrame.

Arguments

gdf::GroupedDataFrame: A grouped DataFrame created using groupby

Returns

DataFrame: A new DataFrame containing exactly one randomly sampled row from each group

Mycelia.rand_printable_unicode_string — Method

rand_printable_unicode_string(len::Int) -> String

Generate a random string of printable Unicode characters of length len.

The string contains random printable Unicode characters, excluding surrogate code points.

Mycelia.random_fasta_record — Method

random_fasta_record(
;
    moltype,
    seed,
    L
) -> FASTX.FASTA.Record

Generates a random FASTA record with a specified molecular type and sequence length.

Arguments

moltype::Symbol=:DNA: The type of molecule to generate (:DNA, :RNA, or :AA for amino acids).
seed: The random seed used for sequence generation (default: a random integer).
L: The length of the sequence (default: a random integer up to typemax(UInt16)).

Returns

A FASTX.FASTA.Record containing:
- A randomly generated UUID identifier.
- A randomly generated sequence of the specified type.

Errors

Throws an error if moltype is not one of :DNA, :RNA, or :AA.

Mycelia.random_symmetric_distance_matrix — Method

random_symmetric_distance_matrix(n) -> Any

Generate a random symmetric distance matrix of size n×n with zeros on the diagonal.

Arguments

n: Positive integer specifying the matrix dimensions

Returns

A symmetric n×n matrix with random values in [0,1), zeros on the diagonal

Details

The matrix is symmetric, meaning M[i,j] = M[j,i]
Diagonal elements M[i,i] are set to 0.0
Off-diagonal elements are uniformly distributed random values

Mycelia.rclone_copy — Method

rclone_copy(source, dest; config, max_attempts, sleep_timer)

Copy files between local and remote storage using rclone with automated retry logic.

Arguments

source::String: Source path or remote (e.g. "local/path" or "gdrive:folder")
dest::String: Destination path or remote (e.g. "gdrive:folder" or "local/path")

Keywords

config::String="": Optional path to rclone config file
max_attempts::Int=3: Maximum number of retry attempts
sleep_timer::Int=60: Initial sleep duration between retries in seconds (doubles after each attempt)

Details

Uses optimized rclone settings for large files:

2GB chunk size
1TB upload cutoff
Rate limited to 1 transaction per second

Mycelia.rclone_copy2 — Method

rclone_copy2(
    source,
    dest;
    config,
    max_attempts,
    sleep_timer,
    includes,
    excludes,
    recursive
)

Copy files between local and remote storage using rclone with automated retry logic.

Arguments

source::String: Source path or remote (e.g. "local/path" or "gdrive:folder")
dest::String: Destination path or remote (e.g. "gdrive:folder" or "local/path")

Keywords

config::String="": Optional path to rclone config file
max_attempts::Int=3: Maximum number of retry attempts
sleep_timer::Int=60: Initial sleep duration between retries in seconds (doubles after each attempt)
includes::Vector{String}=[]: One or more include patterns (each will be passed using --include)
excludes::Vector{String}=[]: One or more exclude patterns (each will be passed using --exclude)
recursive::Bool=false: If true, adds the flag for recursive traversal

Mycelia.rclone_list_directories — Method

rclone_list_directories(path) -> Any

List all directories at the specified rclone path.

Arguments

path::String: Remote path to list directories from (e.g. "remote:/path/to/dir")

Returns

Vector{String}: Full paths to all directories found at the specified location

Mycelia.read_fastani — Method

read_fastani(path::String) -> DataFrames.DataFrame

Imports results of fastani

Reads and processes FastANI output results from a tab-delimited file.

Arguments

path::String: Path to the FastANI output file

Returns

DataFrame with columns:

query: Original query filepath
query_identifier: Extracted filename without extension
reference: Original reference filepath
reference_identifier: Extracted filename without extension
%_identity: ANI percentage identity
fragments_mapped: Number of fragments mapped
total_query_fragments: Total number of query fragments

Notes

Expects tab-delimited input file from FastANI
Automatically strips .fasta, .fna, or .fa extensions from filenames
Column order is preserved as listed above

Mycelia.read_gfa_next — Function

read_gfa_next(
    gfa_file::AbstractString;
    ...
) -> MetaGraphsNext.MetaGraph{Int64, Graphs.SimpleGraphs.SimpleDiGraph{Int64}, Label, VertexData, EdgeData, Nothing, WeightFunction, Float64} where {Label, VertexData, EdgeData, WeightFunction}
read_gfa_next(
    gfa_file::AbstractString,
    graph_mode::Mycelia.GraphMode;
    force_biosequence_graph
) -> MetaGraphsNext.MetaGraph{Int64, Graphs.SimpleGraphs.SimpleDiGraph{Int64}, Label, VertexData, EdgeData, Nothing, WeightFunction, Float64} where {Label, VertexData, EdgeData, WeightFunction}

Read a GFA file and auto-detect whether to create a k-mer graph or BioSequence graph.

This function parses GFA format files and intelligently chooses between:

Fixed-length k-mer graph (if all segments have the same length)
Variable-length BioSequence graph (if segments have different lengths)

Arguments

gfa_file: Path to input GFA file
graph_mode: GraphMode (SingleStrand or DoubleStrand, default: DoubleStrand)
force_biosequence_graph: Force creation of variable-length BioSequence graph (default: false)

Returns

MetaGraphsNext.MetaGraph with either k-mer vertices or BioSequence vertices

Auto-Detection Logic

Fixed-length detection: If all segments are the same length k, creates DNAKmer{k}/RNAKmer{k}/AAKmer{k} graph
Variable-length fallback: If segments have different lengths, creates BioSequence graph
Override: Use force_biosequence_graph=true to force variable-length graph

GFA Format Support

Supports GFA v1.0 with:

Header (H) lines (ignored)
Segment (S) lines: parsed as vertices (k-mer or BioSequence)
Link (L) lines: parsed as strand-aware edges
Path (P) lines: stored as metadata (future use)

Examples

# Auto-detect graph type
graph = read_gfa_next("assembly.gfa")

# Force variable-length BioSequence graph
graph = read_gfa_next("assembly.gfa", force_biosequence_graph=true)

# SingleStrand mode with auto-detection
graph = read_gfa_next("assembly.gfa", SingleStrand)

Mycelia.read_gfa_next — Function

read_gfa_next(
    gfa_file::AbstractString,
    kmer_type::Type
) -> MetaGraphsNext.MetaGraph{Int64, Graphs.SimpleGraphs.SimpleDiGraph{Int64}, _A, _B, _C, Nothing, Mycelia.var"#151#152", Float64} where {_A, _B, _C}
read_gfa_next(
    gfa_file::AbstractString,
    kmer_type::Type,
    graph_mode::Mycelia.GraphMode
) -> MetaGraphsNext.MetaGraph{Int64, Graphs.SimpleGraphs.SimpleDiGraph{Int64}, _A, _B, _C, Nothing, Mycelia.var"#151#152", Float64} where {_A, _B, _C}

Read a GFA file and convert it to a MetaGraphsNext k-mer graph with fixed-length vertices.

This function parses GFA format files and creates a strand-aware k-mer graph compatible with the next-generation implementation using fixed-length k-mer vertices.

Arguments

gfa_file: Path to input GFA file
kmer_type: Type of k-mer to use (e.g., Kmers.DNAKmer{31})
graph_mode: GraphMode (SingleStrand or DoubleStrand, default: DoubleStrand)

Returns

MetaGraphsNext.MetaGraph with k-mer vertices and strand-aware edges

GFA Format Support

Supports GFA v1.0 with:

Header (H) lines (ignored)
Segment (S) lines: parsed as fixed-length k-mer vertices
Link (L) lines: parsed as strand-aware edges
Path (P) lines: stored as metadata (future use)

Example

# Fixed-length k-mer graph
graph = read_gfa_next("assembly.gfa", Kmers.DNAKmer{31})
# Or with specific mode
graph = read_gfa_next("assembly.gfa", Kmers.DNAKmer{31}, SingleStrand)

Mycelia.read_gff — Method

read_gff(gff::AbstractString) -> DataFrames.DataFrame

Reads a GFF (General Feature Format) file and parses it into a DataFrame.

Arguments

gff::AbstractString: Path to the GFF file

Returns

DataFrame: A DataFrame containing the parsed GFF data with standard columns: seqid, source, type, start, end, score, strand, phase, and attributes

Mycelia.read_gff — Method

read_gff(gff_io) -> DataFrames.DataFrame

Read a GFF (General Feature Format) file into a DataFrame.

Arguments

gff_io: An IO stream containing GFF formatted data

Returns

DataFrame: A DataFrame with standard GFF columns:
- seqid: sequence identifier
- source: feature source
- type: feature type
- start: start position (1-based)
- end: end position
- score: numeric score
- strand: strand (+, -, or .)
- phase: phase (0, 1, 2 or .)
- attributes: semicolon-separated key-value pairs

Mycelia.read_kraken_report — Method

read_kraken_report(kraken_report) -> DataFrames.DataFrame

Parse a Kraken taxonomic classification report into a structured DataFrame.

Arguments

kraken_report::AbstractString: Path to a tab-delimited Kraken report file

Returns

DataFrame: A DataFrame with the following columns:
- percentage_of_fragments_at_or_below_taxon: Percentage of fragments covered
- number_of_fragments_at_or_below_taxon: Count of fragments at/below taxon
- number_of_fragments_assigned_directly_to_taxon: Direct fragment assignments
- rank: Taxonomic rank
- ncbi_taxonid: NCBI taxonomy identifier
- scientific_name: Scientific name (whitespace-trimmed)

Notes

Scientific names are automatically stripped of leading/trailing whitespace
Input file must be tab-delimited

Mycelia.read_mmseqs_easy_search — Method

read_mmseqs_easy_search(
    mmseqs_file::String
) -> DataFrames.DataFrame

Read results from MMSeqs2 easy-search output file (plain or gzipped) into a DataFrame with optimized memory usage. Automatically detects if the file is gzipped based on the '.gz' extension.

Arguments

mmseqs_file::String: Path to the tab-delimited output file from MMSeqs2 easy-search. Can be a plain text file or a gzipped file (ending in .gz).

Returns

DataFrame: Contains search results with columns:
- query::String: Query sequence identifier (pooled)
- target::String: Target sequence identifier (pooled)
- seqIdentity::Float64: Sequence identity (0.0-1.0)
- alnLen::Int: Alignment length
- mismatch::Int: Number of mismatches
- gapOpen::Int: Number of gap openings
- qStart::Int: Query start position
- qEnd::Int: Query end position
- tStart::Int: Target start position
- tEnd::Int: Target end position
- evalue::Float64: Expected value
- bits::Float64: Bit score

Remarks

Ensure the CodecZlib.jl package is installed for gzipped file support.

Mycelia.read_tsvgz — Method

read_tsvgz(filename::String; buffer_in_memory::Bool=false, threaded::Bool=true, bufsize::Int=10*1024*1024) -> DataFrames.DataFrame

Read a DataFrame from a gzipped TSV file.

Arguments

filename: Path to the gzipped TSV file (must have .tsv.gz extension)
buffer_in_memory: If false, uses temporary files for large data (default: false)
bufsize: Buffer size in bytes for decompression stream (default: 10MB)

Returns

The loaded DataFrame

Mycelia.regions_overlap — Method

Check if two repeat regions overlap.

Mycelia.remove_duplicate_bubbles — Method

Remove duplicate bubbles.

Mycelia.remove_isolated_vertices! — Method

Remove all isolated vertices from the graph.

Mycelia.remove_path_from_graph! — Method

Remove a path from the graph.

Mycelia.repr_long — Method

repr_long(v) -> String

Return a string representation of the vector v with each element on a new line, mimicking valid Julia syntax. The output encloses the elements in square brackets and separates them with a comma followed by a newline.

Mycelia.reset_environment! — Function

reset_environment!(env::AssemblyEnvironment, dataset_idx::Int=1)

Reset the RL environment to start a new training episode.

Arguments

env::AssemblyEnvironment: Environment to reset
dataset_idx::Int: Index of training dataset to use (default: 1)

Returns

AssemblyState: Initial state for new episode

Example

initial_state = reset_environment!(env, 2)  # Use second training dataset

Mycelia.resolve_repeats_next — Method

resolve_repeats_next(graph::MetaGraphsNext.MetaGraph, min_repeat_length::Int=10) -> Vector{RepeatRegion}

Identify and characterize repetitive regions in the assembly graph.

Mycelia.reverse_complement — Method

Helper function to reverse complement a DNA string.

Mycelia.reverse_translate — Method

reverse_translate(
    protein_sequence::BioSequences.LongAA
) -> BioSequences.LongSequence{BioSequences.DNAAlphabet{2}}

Convert a protein sequence back to a possible DNA coding sequence using weighted random codon selection.

Arguments

protein_sequence::BioSequences.LongAA: The amino acid sequence to reverse translate

Returns

BioSequences.LongDNA{2}: A DNA sequence that would translate to the input protein sequence

Details

Uses codon usage frequencies to randomly select codons for each amino acid, weighted by their natural occurrence. Each selected codon is guaranteed to translate back to the original amino acid.

Mycelia.rolling_centered_avg — Method

rolling_centered_avg(data::AbstractArray{T, 1}; window_size)

Compute a centered moving average over a vector using a sliding window.

Arguments

data::AbstractVector{T}: Input vector to be averaged
window_size::Int: Size of the sliding window (odd number recommended)

Returns

Vector{Float64}: Vector of same length as input containing moving averages

Details

For points near the edges, the window is truncated to available data
Window is centered on each point, using floor(window_size/2) points on each side
Result type is always Float64 regardless of input type T

Mycelia.run_amrfinderplus — Method

run_amrfinderplus(; fasta, output_dir, force)

Run AMRFinderPlus on FASTA input to identify antimicrobial resistance genes.

Arguments

fasta::String: Path to input FASTA file (must match Mycelia.FASTA_REGEX pattern)
output_dir::String: Output directory path (default: input filename + "_amrfinderplus")
force::Bool: Force rerun even if output files already exist (default: false)

Returns

Path to the output directory containing AMRFinderPlus results

Details

For nucleotide FASTA files, automatically runs Mycelia.run_pyrodigal to generate protein sequences
For protein FASTA files, runs AMRFinderPlus directly
Validates input file extension against Mycelia.FASTA_REGEX
Creates output directory if it doesn't exist
Skips processing if results already exist in output directory unless force=true
Uses –plus flag for enhanced detection capabilities

Files Generated

<basename>.amrfinderplus.tsv: AMRFinderPlus results table
For nucleotide inputs: intermediate pyrodigal outputs in subdirectory

Mycelia.run_blast — Method

run_blast(
;
    out_dir,
    fasta,
    blast_db,
    blast_command,
    force,
    remote,
    wait
)

Run a BLAST (Basic Local Alignment Search Tool) command with the specified parameters.

Arguments

out_dir::String: The output directory where the BLAST results will be stored.
fasta::String: The path to the input FASTA file.
blast_db::String: The path to the BLAST database.
blast_command::String: The BLAST command to be executed (e.g., blastn, blastp).
force::Bool: If true, forces the BLAST command to run even if the output file already exists. Default is false.
remote::Bool: If true, runs the BLAST command remotely. Default is false.
wait::Bool: If true, waits for the BLAST command to complete before returning. Default is true.

Returns

outfile::String: The path to the output file containing the BLAST results.

Description

This function constructs and runs a BLAST command based on the provided parameters. It creates the necessary output directory, constructs the output file name, and determines whether the BLAST command needs to be run based on the existence and size of the output file. The function supports both local and remote execution of the BLAST command.

If force is set to true or the output file does not exist or is empty, the BLAST command is executed. The function logs the command being run and measures the time taken for execution. The output file path is returned upon completion.

Mycelia.run_blastn — Method

run_blastn(
;
    outdir,
    fasta,
    blastdb,
    threads,
    task,
    force,
    remote,
    wait
)

Run the BLASTN (Basic Local Alignment Search Tool for Nucleotides) command with specified parameters.

Arguments

outdir::String: The output directory where the BLASTN results will be saved.
fasta::String: The path to the input FASTA file containing the query sequences.
blastdb::String: The path to the BLAST database to search against.
task::String: The BLASTN task to perform. Default is "megablast".
force::Bool: If true, forces the BLASTN command to run even if the output file already exists. Default is false.
remote::Bool: If true, runs the BLASTN command remotely. Default is false.
wait::Bool: If true, waits for the BLASTN command to complete before returning. Default is true.

Returns

outfile::String: The path to the output file containing the BLASTN results.

Description

This function constructs and runs a BLASTN command based on the provided parameters. It creates an output directory if it doesn't exist, constructs the output file path, and checks if the BLASTN command needs to be run based on the existence and size of the output file. The function supports running the BLASTN command locally or remotely, with options to force re-running and to wait for completion.

Mycelia.run_blastp_search — Method

run_blastp_search(
;
    query_fasta,
    reference_fasta,
    output_dir,
    threads,
    evalue,
    max_target_seqs
)

Perform BLASTP search between query and reference protein FASTA files.

Arguments

query_fasta::String: Path to query protein FASTA file
reference_fasta::String: Path to reference protein FASTA file
output_dir::String: Output directory (defaults to query filename + "_blastp")
threads::Int: Number of threads (defaults to system CPU count)
evalue::Float64: E-value threshold (default: 1e-3)
max_target_seqs::Int: Maximum target sequences (default: 500)

Returns

String: Path to the BLASTP results file (.tsv format)

Throws

AssertionError: If input files don't exist or are invalid
SystemError: If BLAST execution fails

Mycelia.run_busco — Method

run_busco(assembly_file::String; kwargs...)

Run BUSCO on a single assembly file. See run_busco(::Vector{String}) for details.

Mycelia.run_busco — Method

run_busco(assembly_files::Vector{String}; outdir::String="busco_results", lineage::String="auto", mode::String="genome", threads::Int=Sys.CPU_THREADS, force::Bool=false)

Run BUSCO (Benchmarking Universal Single-Copy Orthologs) to assess genome assembly completeness.

Arguments

assembly_files::Vector{String}: Vector of paths to assembly FASTA files to evaluate
outdir::String="busco_results": Output directory for BUSCO results
lineage::String="auto": BUSCO lineage dataset to use (e.g., "bacteriaodb10", "eukaryotaodb10", "auto")
mode::String="genome": BUSCO mode ("genome", "transcriptome", "proteins")
threads::Int=Sys.CPU_THREADS: Number of threads to use
force::Bool=false: Force overwrite existing results

Returns

String: Path to the output directory containing BUSCO results

Output Files

short_summary.specific.lineage.txt: Summary statistics
full_table.tsv: Complete BUSCO results table
missing_busco_list.tsv: List of missing BUSCOs
run_lineage/: Detailed results directory

Examples

# Basic completeness assessment
assemblies = ["assembly1.fasta", "assembly2.fasta"]
busco_dir = Mycelia.run_busco(assemblies)

# Specific lineage
busco_dir = Mycelia.run_busco(assemblies, lineage="bacteria_odb10")

# Custom parameters
busco_dir = Mycelia.run_busco(assemblies,
                             outdir="my_busco_results",
                             lineage="enterobacterales_odb10",
                             threads=8)

Notes

Requires BUSCO to be installed via Bioconda
Auto lineage detection requires internet connection for first run
Available lineages: bacteriaodb10, archaeaodb10, eukaryotaodb10, fungiodb10, etc.
Results provide Complete, Fragmented, and Missing BUSCO counts

Mycelia.run_canu — Method

run_canu(; fastq, outdir, genome_size, read_type)

Run Canu assembler for long read assembly.

Arguments

fastq::String: Path to input FASTQ file containing long reads
outdir::String: Output directory path (default: "canu_output")
genome_size::String: Estimated genome size (e.g., "5m", "1.2g")
read_type::String: Type of reads ("pacbio", "nanopore")

Returns

Named tuple containing:

outdir::String: Path to output directory
assembly::String: Path to final assembly file

Details

Automatically creates and uses a conda environment with canu
Includes error correction, trimming, and assembly stages
Skips assembly if output directory already exists
Utilizes all available CPU threads

Mycelia.run_checkm — Method

run_checkm(input_path::String; outdir::String=input_path * "_checkm", db_dir::String=joinpath(homedir(), "workspace", ".checkm"), extension::String="fasta")

Run CheckM on directory containing FASTA files.

CheckM requires a directory of genome files as input.

Arguments

input_path: Path to directory containing FASTA files
outdir: Output directory for CheckM results (default: inputpath * "checkm")
db_dir: CheckM database directory (default: ~/.checkm)
extension: File extension for genomes (default: "fasta")
threads: Number of threads to use (default: all available CPU threads)

Example

run_checkm("./genomes/")

Mycelia.run_checkm2 — Method

run_checkm2(input_path::String; outdir::String=input_path * "_checkm2", db_dir::String=joinpath(homedir(), "workspace", ".checkm2"))

Run CheckM2 on FASTA file(s) or directory containing FASTA files.

Arguments

input_path: Path to FASTA file or directory containing FASTA files
outdir: Output directory for CheckM2 results (default: inputpath * "checkm2")
threads: Number of threads to use (default: all available CPU threads)
db_dir: CheckM2 database directory (default: ~/.checkm2)

Returns

A named tuple with the following fields:

outdir: Output directory path
quality_report: Path to quality_report.tsv
log_file: Path to checkm2.log
diamond_results: Path to DIAMOND_RESULTS.tsv
protein_file: Path to the single .faa file

Mycelia.run_checkm2_list — Method

run_checkm2_list(fasta_files::Vector{String}; outdir::String=normalized_current_datetime() * "_checkm2", db_dir::String=joinpath(homedir(), "workspace", ".checkm2"))

Run CheckM2 on a list of FASTA files.

CheckM2 can automatically handle mixed lists of gzipped and non-gzipped files when given a list.

Arguments

fasta_files: Vector of FASTA file paths (can be mixed gzipped and non-gzipped)
outdir: Output directory for CheckM2 results (default: normalizedcurrentdatetime() * "_checkm2")
threads: Number of threads to use (default: all available CPU threads)
db_dir: CheckM2 database directory (default: ~/.checkm2)

Example

files = ["genome1.fasta.gz", "genome2.fasta", "genome3.fasta.gz"]
run_checkm2_list(files)

Mycelia.run_checkv — Method

run_checkv(fasta_file::String; outdir::String=fasta_file * "_checkv", db_dir::String=joinpath(homedir(), "workspace", ".checkv"))

Run CheckV on a single genome FASTA file.

CheckV assesses the quality and completeness of viral genomes.

Arguments

fasta_file: Path to single FASTA file (can be gzipped)
outdir: Output directory for CheckV results (default: fastafile * "checkv")
threads: Number of threads to use (default: all available CPU threads)
db_dir: CheckV database directory (default: ~/.checkv)

Returns

Named tuple with fields:
- outdir: Output directory path
- complete_genomes: Path to complete_genomes.tsv
- completeness: Path to completeness.tsv
- contamination: Path to contamination.tsv
- proviruses: Path to proviruses.fna
- quality_summary: Path to quality_summary.tsv
- viruses: Path to viruses.fna

Example

result = run_checkv("genome.fasta")
println("Quality summary: ", result.quality_summary)

Mycelia.run_clustal_omega — Method

run_clustal_omega(; fasta, outfmt)

Run Clustal Omega multiple sequence alignment on a FASTA file.

Arguments

fasta::String: Path to input FASTA file
outfmt::String="clustal": Output format for the alignment

Returns

String: Path to the output alignment file

Supported Output Formats

"fasta": FASTA format
"clustal": Clustal format
"msf": MSF format
"phylip": PHYLIP format
"selex": SELEX format
"stockholm": Stockholm format
"vienna": Vienna format

Notes

Uses Bioconda to manage the Clustal Omega installation
Caches results - will return existing output file if already generated
Handles single sequence files gracefully by returning output path without error

Mycelia.run_diamond_search — Method

run_diamond_search(
;
    query_fasta,
    reference_fasta,
    output_dir,
    threads,
    evalue,
    block_size,
    sensitivity
)

Perform DIAMOND BLASTP search between query and reference protein FASTA files.

Arguments

query_fasta::String: Path to query protein FASTA file
reference_fasta::String: Path to reference protein FASTA file
output_dir::String: Output directory (defaults to query filename + "_diamond")
threads::Int: Number of threads (defaults to system CPU count)
evalue::Float64: E-value threshold (default: 1e-3)
block_size::Float64: Block size in GB (default: auto-calculated from system memory)
sensitivity::String: Sensitivity mode (default: "–iterate")

Returns

String: Path to the DIAMOND results file (.tsv format)

Throws

AssertionError: If input files don't exist or are invalid
SystemError: If DIAMOND execution fails

Mycelia.run_ectyper — Method

run_ectyper(fasta_file) -> Any

Run ECTyper for serotyping E. coli genome assemblies.

Arguments

fasta_file::String: Path to input FASTA file containing assembled genome(s)

Returns

String: Path to output directory containing ECTyper results

Mycelia.run_fastqc — Method

run_fastqc(; forward, reverse, outdir)

Mycelia.run_flye — Method

run_flye(; fastq, outdir, genome_size, read_type)

Run Flye assembler for long read assembly.

Arguments

fastq::String: Path to input FASTQ file containing long reads
outdir::String: Output directory path (default: "flye_output")
genome_size::String: Estimated genome size (e.g., "5m", "1.2g")
read_type::String: Type of reads ("pacbio-raw", "pacbio-corr", "pacbio-hifi", "nano-raw", "nano-corr", "nano-hq")

Returns

Named tuple containing:

outdir::String: Path to output directory
assembly::String: Path to final assembly file

Details

Automatically creates and uses a conda environment with flye
Supports various long read technologies and quality levels
Skips assembly if output directory already exists
Utilizes all available CPU threads

Mycelia.run_full_benchmark_suite — Function

run_full_benchmark_suite()
run_full_benchmark_suite(config::Mycelia.BenchmarkConfig)

Run comprehensive performance benchmark suite.

Executes all benchmark tests and provides a summary report against our targets:

Memory Usage: 50% reduction through type-stable metadata
Construction Speed: 2x faster graph building
Type Stability: Zero allocations in hot paths

Arguments

config: BenchmarkConfig for test parameters

Returns

NamedTuple with comprehensive results

Mycelia.run_genomad — Method

Run geNomad mobile genetic element identification tool.

geNomad identifies viruses and plasmids in genomic and metagenomic data using machine learning and database comparisons.

Arguments

input_fasta: Path to input FASTA file
output_directory: Output directory path
genomad_dbpath: Path to geNomad database directory
threads: Number of CPU threads to use
cleanup: Remove intermediate files after completion
splits: Number of splits for memory management
force: Force rerun even if output files already exist

Returns

NamedTuple containing paths to all generated output files and directories

Mycelia.run_hifiasm — Method

run_hifiasm(; fastq, outdir)

Run the hifiasm genome assembler on PacBio HiFi reads.

Arguments

fastq::String: Path to input FASTQ file containing HiFi reads
outdir::String: Output directory path (default: "{basename(fastq)}_hifiasm")

Returns

Named tuple containing:

outdir::String: Path to output directory
hifiasm_outprefix::String: Prefix used for hifiasm output files

Details

Automatically creates and uses a conda environment with hifiasm
Uses primary assembly mode (–primary) optimized for inbred samples
Skips assembly if output files already exist at the specified prefix
Utilizes all available CPU threads

Mycelia.run_mash_comparison — Method

run_mash_comparison(fasta1::String, fasta2::String; k::Int=21, s::Int=10000, mash_path::String="mash")

Runs a genome-by-genome comparison using the mash command-line tool.

This function first creates sketch files for each FASTA input and then calculates the distance between them, capturing and parsing the result.

Arguments

fasta1::String: Path to the first FASTA file.
fasta2::String: Path to the second FASTA file.

Keyword Arguments

k::Int=21: The k-mer size to use for sketching. Default is 21.
s::Int=10000: The sketch size (number of hashes to keep). Default is 10000.
mash_path::String="mash": The path to the mash executable if not in the system PATH.

Returns

NamedTuple: A named tuple containing the parsed results, e.g., (reference=..., query=..., distance=..., p_value=..., shared_hashes=...)
nothing: Returns nothing if the mash command fails.

Mycelia.run_megahit — Method

run_megahit(
;
    fastq1,
    fastq2,
    outdir,
    min_contig_len,
    k_list
)

Run MEGAHIT assembler for metagenomic short read assembly.

Arguments

fastq1::String: Path to first paired-end FASTQ file
fastq2::String: Path to second paired-end FASTQ file (optional for single-end)
outdir::String: Output directory path (default: "megahit_output")
min_contig_len::Int: Minimum contig length (default: 200)
k_list::String: k-mer sizes to use (default: "21,29,39,59,79,99,119,141")

Returns

Named tuple containing:

outdir::String: Path to output directory
contigs::String: Path to final contigs file

Details

Automatically creates and uses a conda environment with megahit
Optimized for metagenomic assemblies with varying coverage
Skips assembly if output directory already exists
Utilizes all available CPU threads

Mycelia.run_metaspades — Method

run_metaspades(; fastq1, fastq2, outdir, k_list)

Run metaSPAdes assembler for metagenomic short read assembly.

Arguments

fastq1::String: Path to first paired-end FASTQ file
fastq2::String: Path to second paired-end FASTQ file (optional for single-end)
outdir::String: Output directory path (default: "metaspades_output")
k_list::String: k-mer sizes to use (default: "21,33,55,77")

Returns

Named tuple containing:

outdir::String: Path to output directory
contigs::String: Path to contigs file
scaffolds::String: Path to scaffolds file

Details

Automatically creates and uses a conda environment with spades
Designed for metagenomic datasets with uneven coverage
Skips assembly if output directory already exists
Utilizes all available CPU threads

Mycelia.run_mlst — Method

run_mlst(fasta_file) -> String

Run Multi-Locus Sequence Typing (MLST) analysis on a genome assembly.

Arguments

fasta_file::String: Path to input FASTA file containing the genome assembly

Returns

Path to the output file containing MLST results (<input>.mlst.out)

Details

Uses the mlst tool from PubMLST to identify sequence types by comparing allelic profiles of housekeeping genes against curated MLST schemes.

Dependencies

Requires Bioconda and the mlst package
Automatically sets up conda environment if not present

Mycelia.run_mmseqs_easy_search — Method

run_mmseqs_easy_search(
;
    query_fasta,
    target_database,
    out_dir,
    outfile,
    format_output,
    threads,
    force
)

Runs the MMseqs2 easy-search command on the given query FASTA file against the target database.

Arguments

query_fasta::String: Path to the query FASTA file.
target_database::String: Path to the target database.
out_dir::String: Directory to store the output file. Defaults to the directory of the query FASTA file.
outfile::String: Name of the output file. Defaults to a combination of the query FASTA and target database filenames.
format_output::String: Format of the output. Defaults to a predefined set of fields.
threads::Int: Number of CPU threads to use. Defaults to the number of CPU threads available.
force::Bool: If true, forces the re-generation of the output file even if it already exists. Defaults to false.

Returns

outfile_path::String: Path to the generated output file.

Notes

Adds the mmseqs2 environment using Bioconda if not already present.
Removes temporary files created during the process.

Mycelia.run_mmseqs_search — Method

run_mmseqs_search(
;
    query_fasta,
    reference_fasta,
    output_dir,
    threads,
    evalue,
    sensitivity
)

Perform MMseqs2 easy-search between query and reference FASTA files.

Arguments

query_fasta::String: Path to query FASTA file
reference_fasta::String: Path to reference FASTA file
output_dir::String: Output directory (defaults to query filename + "_mmseqs")
threads::Int: Number of threads (defaults to system CPU count)
evalue::Float64: E-value threshold (default: 1e-3)
sensitivity::Float64: Sensitivity parameter (default: 4.0)

Returns

String: Path to the MMseqs2 results file (.tsv format)

Throws

AssertionError: If input files don't exist or are invalid
SystemError: If MMseqs2 execution fails

Mycelia.run_mummer — Method

run_mummer(reference::String, query::String; outdir::String="mummer_results", prefix::String="out", mincluster::Int=65, minmatch::Int=20, threads::Int=1)

Run MUMmer for genome comparison and alignment between reference and query sequences.

Arguments

reference::String: Path to reference genome FASTA file
query::String: Path to query genome FASTA file
outdir::String="mummer_results": Output directory for MUMmer results
prefix::String="out": Prefix for output files
mincluster::Int=65: Minimum cluster length for nucmer
minmatch::Int=20: Minimum match length for nucmer
threads::Int=1: Number of threads (note: MUMmer has limited multithreading)

Returns

String: Path to the output directory containing MUMmer results

Output Files

prefix.delta: Delta alignment file (main output)
prefix.coords: Human-readable coordinates file
prefix.snps: SNP/indel report (if show-snps is run)
prefix.plot.png: Dot plot visualization (if mummerplot is run)

Examples

# Basic genome comparison
ref_genome = "reference.fasta"
query_genome = "assembly.fasta"
mummer_dir = Mycelia.run_mummer(ref_genome, query_genome)

# Custom parameters
mummer_dir = Mycelia.run_mummer(ref_genome, query_genome,
                               outdir="comparison_results",
                               prefix="comparison",
                               mincluster=100,
                               minmatch=30)

Notes

Requires MUMmer to be installed via Bioconda
nucmer is used for DNA sequence alignment
show-coords generates human-readable coordinate output
Results include alignment coordinates, percent identity, and coverage
For visualization, use mummerplot (requires gnuplot)

Mycelia.run_mummer_plot — Method

run_mummer_plot(delta_file::String; outdir::String="", prefix::String="plot", plot_type::String="png")

Generate dot plot visualization from MUMmer delta file using mummerplot.

Arguments

delta_file::String: Path to MUMmer delta file
outdir::String="": Output directory (defaults to same as delta file)
prefix::String="plot": Prefix for plot files
plot_type::String="png": Plot format ("png", "ps", "x11")

Returns

String: Path to the generated plot file

Notes

Requires gnuplot to be installed
Useful for visualizing genome alignments and rearrangements

Mycelia.run_padloc — Method

run_padloc(; fasta_file, outdir, threads)

Run the 'padloc' tool from the 'padlocbio' conda environment on a given FASTA file.

https://doi.org/10.1093/nar/gkab883

https://github.com/padlocbio/padloc

This function first ensures that the 'padloc' environment is available via Bioconda. It then attempts to update the 'padloc' database. If a 'padloc' output file (with a '_padloc.csv' suffix) does not already exist for the input FASTA file, it runs 'padloc' with the specified FASTA file as input.

Mycelia.run_parallel_progress — Method

run_parallel_progress(
    f::Function,
    items::AbstractVector
) -> Any

Run a function f in parallel over a collection of items with a progress meter.

Arguments

f::Function: The function to be applied to each item in the collection.
items::AbstractVector: A collection of items to be processed.

Description

This function creates a progress meter to track the progress of processing each item in the items collection. It uses multithreading to run the function f on each item in parallel, updating the progress meter after each item is processed.

Mycelia.run_phageboost — Method

run_phageboost(input_fasta::AbstractString, output_dir::AbstractString; force_reinstall::Bool=false)

Run PhageBoost on the provided FASTA file, automatically handling conda environment setup.

This function will:

Check if the phageboost_env conda environment exists
Create and set up the environment if it doesn't exist
Validate that PhageBoost is properly installed
Run PhageBoost on the input FASTA file
Return the output directory path and list of generated files

Arguments

input_fasta::AbstractString: Path to the input FASTA file
output_dir::AbstractString: Directory where PhageBoost outputs will be saved
force_reinstall::Bool=false: If true, recreate the environment even if it exists

Returns

NamedTuple with fields:
- output_dir::String: Path to the output directory
- files::Vector{String}: List of files generated in the output directory

Mycelia.run_phispy — Method

run_phispy(input_file::String; output_dir::String="", 
       phage_genes::Int=2, color::Bool=false, prefix::String="",
       phmms::String="", threads::Int=1, metrics::Vector{String}=String[],
       expand_slope::Bool=false, window_size::Int=30, 
       min_contig_size::Int=5000, skip_search::Bool=false,
       output_choice::Int=3, training_set::String="", 
       prokka_args::NamedTuple=NamedTuple(), force::Bool=false)

Run PhiSpy to identify prophages in bacterial genomes.

PhiSpy identifies prophage regions in bacterial (and archaeal) genomes using multiple approaches including gene composition, AT/GC skew, and optional HMM searches.

Arguments

input_file::String: Path to input file (FASTA or GenBank format)
output_dir::String: Output directory (default: inputfile * "phispy")
phage_genes::Int: Minimum phage genes required per prophage region (default: 2, set to 0 for mobile elements)
color::Bool: Add color annotations for CDS based on function (default: false)
prefix::String: Prefix for output filenames (default: basename of input)
phmms::String: Path to HMM database for additional phage gene detection
threads::Int: Number of threads for HMM searches (default: 1)
metrics::Vector{String}: Metrics to use for prediction (default: all standard metrics)
expand_slope::Bool: Expand Shannon slope calculations (default: false)
window_size::Int: Window size for calculations (default: 30)
min_contig_size::Int: Minimum contig size to analyze (default: 5000)
skip_search::Bool: Skip HMM search if already done (default: false)
output_choice::Int: Bitmask for output files (default: 3 for coordinates + GenBank)
training_set::String: Path to custom training set
prokka_args::NamedTuple: Additional arguments to pass to Prokka if FASTA input is provided
force::Bool: Force rerun even if output files already exist (default: false)

Output Choice Codes (add values for multiple outputs)

1: prophage_coordinates.tsv
2: GenBank format output
4: prophage and bacterial sequences
8: prophage_information.tsv
16: prophage.tsv
32: GFF3 format (prophages only)
64: prophage.tbl
128: test data used in random forest
256: GFF3 format (full genome)
512: all output files

Returns

A NamedTuple with paths to generated output files (contents depend on output_choice):

prophage_coordinates: prophage_coordinates.tsv file path
genbank_output: Updated GenBank file with prophage annotations
prophage_sequences: Prophage and bacterial sequence files
prophage_information: prophage_information.tsv file path
prophage_simple: prophage.tsv file path
gff3_prophages: GFF3 file with prophage regions only
prophage_table: prophage.tbl file path
test_data: Random forest test data file path
gff3_genome: GFF3 file with full genome annotations
output_dir: Path to output directory
input_genbank: Path to GenBank file used (original or generated by Prokka)

Mycelia.run_prodigal — Method

run_prodigal(; fasta_file, out_dir)

Run Prodigal gene prediction software on input FASTA file to identify protein-coding genes in metagenomes or single genomes.

Arguments

fasta_file::String: Path to input FASTA file containing genomic sequences
out_dir::String=dirname(fasta_file): Directory for output files. Defaults to input file's directory

Returns

Named tuple containing paths to all output files:

fasta_file: Input FASTA file path
out_dir: Output directory path
gff: Path to GFF format gene predictions
gene_scores: Path to all potential genes and their scores
fna: Path to nucleotide sequences of predicted genes
faa: Path to protein translations of predicted genes
std_out: Path to captured stdout
std_err: Path to captured stderr

Mycelia.run_prokka — Method

run_prokka(input_fasta::String; output_dir::String="", prefix::String="", 
       cpus::Int=0, kingdom::String="Bacteria", genus::String="", 
       species::String="", strain::String="", force_overwrite::Bool=false,
       addgenes::Bool=false, compliant::Bool=false, fast::Bool=false,
       evalue::Float64=1e-06, mincontiglen::Int=1, force::Bool=false)

Run Prokka for rapid prokaryotic genome annotation.

Prokka annotates bacterial, archaeal and viral genomes quickly and produces standards-compliant output files including GFF3, GenBank, and FASTA formats.

Arguments

input_fasta::String: Path to input FASTA file containing contigs
output_dir::String: Output directory (default: inputfasta * "prokka")
prefix::String: Output file prefix (default: basename of input file)
cpus::Int: Number of CPUs to use, 0 for all available (default: 0)
kingdom::String: Annotation mode - "Bacteria", "Archaea", "Viruses", or "Mitochondria" (default: "Bacteria")
genus::String: Genus name for annotation
species::String: Species name for annotation
strain::String: Strain name for annotation
force_overwrite::Bool: Force overwrite existing output directory (default: false)
addgenes::Bool: Add 'gene' features for each 'CDS' feature (default: false)
compliant::Bool: Force GenBank/ENA/DDJB compliance (default: false)
fast::Bool: Fast mode - skip CDS product searching (default: false)
evalue::Float64: Similarity e-value cut-off (default: 1e-06)
mincontiglen::Int: Minimum contig size (default: 1, NCBI needs 200)
force::Bool: Force rerun even if output files already exist (default: false)

Returns

A NamedTuple with paths to all generated output files:

gff: Master annotation in GFF3 format
gbk: Standard GenBank file
fna: Nucleotide FASTA of input contigs
faa: Protein FASTA of translated CDS sequences
ffn: Nucleotide FASTA of all transcripts
sqn: ASN1 Sequin file for GenBank submission
fsa: Nucleotide FASTA for tbl2asn
tbl: Feature table file
err: NCBI discrepancy report
log: Complete run log
txt: Annotation statistics
tsv: Tab-separated feature table
output_dir: Path to output directory

Mycelia.run_pyrodigal — Method

run_pyrodigal(; fasta_file, out_dir)

Run Pyrodigal gene prediction on a FASTA file using the meta procedure optimized for metagenomic sequences.

Pyrodigal is a reimplementation of the Prodigal gene finder, which identifies protein-coding sequences in bacterial and archaeal genomes.

Arguments

fasta_file::String: Path to input FASTA file containing genomic sequences
out_dir::String: Output directory path (default: input filename + "_pyrodigal")

Returns

Named tuple containing:

fasta_file: Input FASTA file path
out_dir: Output directory path
gff: Path to GFF output file with gene predictions
faa: Path to FASTA file with predicted protein sequences
fna: Path to FASTA file with nucleotide sequences

Notes

Uses metagenomic mode (-p meta) optimized for mixed communities
Masks runs of N nucleotides (-m flag)
Minimum gene length set to 33bp
Maximum overlap between genes set to 31bp
Requires Pyrodigal to be available in a Conda environment
Skips processing if output files already exist

Mycelia.run_quast — Method

run_quast(assembly_file::String; kwargs...)

Run QUAST on a single assembly file. See run_quast(::Vector{String}) for details.

Mycelia.run_quast — Method

run_quast(assembly_files::Vector{String}; outdir::String="quast_results", reference::Union{String,Nothing}=nothing, threads::Int=Sys.CPU_THREADS, min_contig::Int=500, gene_finding::Bool=false)

Run QUAST (Quality Assessment Tool for Genome Assemblies) to evaluate assembly quality.

Arguments

assembly_files::Vector{String}: Vector of paths to assembly FASTA files to evaluate
outdir::String="quast_results": Output directory for QUAST results
reference::Union{String,Nothing}=nothing: Optional reference genome for reference-based metrics
threads::Int=Sys.CPU_THREADS: Number of threads to use
min_contig::Int=500: Minimum contig length to consider
gene_finding::Bool=false: Whether to run gene finding (requires GeneMark-ES/ET)

Returns

String: Path to the output directory containing QUAST results

Output Files

report.html: Interactive HTML report
report.txt: Text summary report
report.tsv: Tab-separated values report for programmatic access
transposed_report.tsv: Transposed TSV format
icarus.html: Icarus contig browser (if reference provided)

Examples

# Basic assembly evaluation
assemblies = ["assembly1.fasta", "assembly2.fasta"]
quast_dir = Mycelia.run_quast(assemblies)

# With reference genome
ref_genome = "reference.fasta"
quast_dir = Mycelia.run_quast(assemblies, reference=ref_genome)

# Custom parameters
quast_dir = Mycelia.run_quast(assemblies, 
                             outdir="my_quast_results",
                             min_contig=1000,
                             threads=8)

Notes

Requires QUAST to be installed via Bioconda
Without reference: provides basic metrics (N50, total length, # contigs, etc.)
With reference: adds reference-based metrics (genome fraction, misassemblies, etc.)
Gene finding requires additional dependencies and is disabled by default

Mycelia.run_samtools_flagstat — Function

run_samtools_flagstat(xam) -> Any
run_samtools_flagstat(xam, samtools_flagstat) -> Any

Generate alignment statistics for a SAM/BAM/CRAM file using samtools flagstat.

Arguments

xam::AbstractString: Path to input SAM/BAM/CRAM alignment file
samtools_flagstat::AbstractString: Output path for flagstat results (default: input_path.samtools-flagstat.txt)

Returns

String: Path to the generated flagstat output file

Details

Runs samtools flagstat to calculate statistics on the alignment file, including:

Total reads
Secondary alignments
Supplementary alignments
Duplicates
Mapped/unmapped reads
Proper pairs
Read 1/2 counts

Requirements

Requires samtools to be available via Bioconda
Input file must be in SAM, BAM or CRAM format

Mycelia.run_transterm — Method

run_transterm(; fasta, gff)

Run TransTermHP to predict rho-independent transcription terminators in DNA sequences.

Arguments

fasta: Path to input FASTA file containing DNA sequences
gff: Optional path to GFF annotation file. If provided, improves prediction accuracy

Returns

String: Path to output file containing TransTermHP predictions

Details

Uses Conda environment 'transtermhp' for execution
Automatically generates coordinate file from FASTA or GFF input
Removes temporary coordinate file after completion
Requires Mycelia's Conda setup

Mycelia.run_trnascan — Method

run_trnascan(; fna_file, outdir)

Run tRNAscan-SE to identify and annotate transfer RNA genes in the provided sequence file.

Arguments

fna_file::String: Path to input FASTA nucleotide file
outdir::String: Output directory path (default: inputfilepath + "_trnascan")

Returns

String: Path to the output directory containing tRNAscan-SE results

Output Files

Creates the following files in outdir:

*.trnascan.out: Main output with tRNA predictions
*.trnascan.bed: BED format coordinates
*.trnascan.fasta: FASTA sequences of predicted tRNAs
*.trnascan.struct: Secondary structure predictions
*.trnascan.stats: Summary statistics
*.trnascan.log: Program execution log

Notes

Uses the general tRNA model (-G flag) suitable for all domains of life
Automatically sets up tRNAscan-SE via Bioconda
Skips processing if output directory contains files

Mycelia.run_unicycler — Method

run_unicycler(; short_1, short_2, long_reads, outdir)

Run hybrid assembly combining short and long reads using Unicycler.

Arguments

short_1::String: Path to first short read FASTQ file
short_2::String: Path to second short read FASTQ file (optional)
long_reads::String: Path to long read FASTQ file
outdir::String: Output directory path (default: "unicycler_output")

Returns

Named tuple containing:

outdir::String: Path to output directory
assembly::String: Path to final assembly file

Details

Automatically creates and uses a conda environment with unicycler
Combines short read accuracy with long read scaffolding
Skips assembly if output directory already exists
Utilizes all available CPU threads

Mycelia.run_virsorter2 — Method

Run VirSorter2 viral sequence identification tool.

VirSorter2 identifies viral sequences in genomic and metagenomic data using machine learning models and database comparisons.

Arguments

input_fasta: Path to input FASTA file
output_directory: Output directory path
database_path: Path to VirSorter2 database directory
include_groups: Comma-separated viral groups to include
- full set = dsDNAphage,NCLDV,RNA,ssDNA,lavidaviridae
- tools original default set = dsDNAphage,ssDNA
- Lavidaviridae = A family of small double‑stranded DNA "virophages" that parasitize the replication machinery of certain NCLDVs
- NCLDV = Nucleocytoplasmic Large DNA Viruses = An informal clade of large double‑stranded DNA viruses that replicate (at least in part) in the cytoplasm of eukaryotic cells.
min_score: Minimum score threshold for viral sequences
min_length: Minimum sequence length threshold
threads: Number of CPU threads to use
provirus_off: Disable provirus detection
max_orf_per_seq: Maximum ORFs per sequence
prep_for_dramv: Prepare output for DRAMv annotation
label: Label for output files
forceall: Force rerun all steps
force: Force rerun even if output files already exist

Returns

NamedTuple containing paths to all generated output files and directories

Mycelia.samtools_index_fasta — Method

samtools_index_fasta(; fasta)

Creates an index file (.fai) for a FASTA reference sequence using samtools.

The FASTA index allows efficient random access to the reference sequence. This is required by many bioinformatics tools that need to quickly fetch subsequences from the reference.

Arguments

fasta: Path to the input FASTA file

Side Effects

Creates a {fasta}.fai index file in the same directory as input
Installs samtools via conda if not already present

Mycelia.sanitize_inline_strings! — Method

sanitize_inline_strings!(
    df::DataFrames.DataFrame
) -> DataFrames.DataFrame

Convert all InlineString columns in a DataFrame to standard Strings. Modifies the dataframe in-place and returns it.

Mycelia.sanitize_inline_strings — Method

sanitize_inline_strings(v::AbstractVector) -> Any

Convert a column to standard Strings if it contains InlineStrings, otherwise return the original column unchanged.

Mycelia.sanity_check_matrix — Method

sanity_check_matrix(M::AbstractMatrix)

Checks matrix shape, value types, and distributional properties. Suggests the most appropriate ePCA function and distance metric.

Returns a NamedTuple with fields:

n_features, n_samples
value_type
range
is_binary
is_integer
is_nonnegative
is_strictly_positive
is_in_01
is_centered
is_overdispersed
suggested_epca
suggested_distance

Mycelia.save_consensus_analysis — Method

Save consensus analysis to JSON file.

Mycelia.save_df_jld2 — Method

save_df_jld2(;df::DataFrames.DataFrame, filename::String, key::String="dataframe")

Save a DataFrame to a JLD2 file.

Arguments

df: The DataFrame to save
filename: Path to the JLD2 file (will add .jld2 extension if not present)
key: The name of the dataset within the JLD2 file (defaults to "dataframe")

Mycelia.save_genome_as_fasta — Method

save_genome_as_fasta(genome, filename)

Save a genome sequence as a FASTA file.

Convenience function for saving a single genome sequence to FASTA format for annotation benchmarking.

Arguments

genome: DNA sequence (BioSequences.LongDNA{4})
filename: Output FASTA filename

See Also

write_fasta: For more flexible FASTA writing with multiple records
random_fasta_record: For generating random FASTA records

Mycelia.save_graph — Method

save_graph(
    graph::Graphs.AbstractGraph,
    outfile::String
) -> String

Saves the given graph to a file in JLD2 format.

Arguments

graph::Graphs.AbstractGraph: The graph to be saved.
outfile::String: The name of the output file. If the file extension is not .jld2, it will be appended automatically.

Returns

String: The name of the output file with the .jld2 extension.

Mycelia.save_kmer_results — Method

save_kmer_results(
;
    filename,
    kmers,
    counts,
    fasta_list,
    k,
    alphabet
)

Save the kmer counting results (kmers vector, counts sparse matrix) and the input FASTA file list to a JLD2 file for long-term storage and reproducibility.

Arguments

filename::AbstractString: Path to the output JLD2 file.
kmers::AbstractVector{<:Kmers.Kmer}: The sorted vector of unique kmer objects.
counts::AbstractMatrix: The (sparse or dense) matrix of kmer counts.
fasta_list::AbstractVector{<:AbstractString}: The list of FASTA file paths used as input.
k::Integer: The kmer size used.
alphabet::Symbol: The alphabet used (:AA, :DNA, :RNA).

Mycelia.save_matrix_jld2 — Method

save_matrix_jld2(; matrix, filename)

Saves a matrix to a JLD2 file format.

Arguments

matrix: The matrix to be saved
filename: String path where the file should be saved

Returns

The filename string that was used to save the matrix

Mycelia.save_reads_as_fastq — Function

save_reads_as_fastq(reads, filename, base_quality=30)

Save DNA reads as a FASTQ file with specified quality scores.

Converts a vector of DNA sequences to FASTQ format with uniform quality scores.

Arguments

reads: Vector of DNA sequences (BioSequences.LongDNA{4})
filename: Output FASTQ filename
base_quality: Base quality score for all positions (default: 30)

See Also

write_fastq: For more flexible FASTQ writing with records
fastq_record: For creating individual FASTQ records
generate_test_fastq_data: For generating test FASTQ data with variable quality scores

Mycelia.save_validation_report — Method

Save validation report to JSON file.

Mycelia.scg_sbatch — Method

scg_sbatch(
;
    job_name,
    mail_user,
    mail_type,
    logdir,
    partition,
    account,
    nodes,
    ntasks,
    time,
    cpus_per_task,
    mem_gb,
    cmd
)

Submit a job to SLURM using sbatch with specified parameters.

Arguments

job_name::String: Name identifier for the SLURM job
mail_user::String: Email address for job notifications
mail_type::String: Type of mail notifications (default: "ALL")
logdir::String: Directory for error and output logs (default: "~/workspace/slurmlogs")
partition::String: SLURM partition to submit job to
account::String: Account to charge for compute resources
nodes::Int: Number of nodes to allocate (default: 1)
ntasks::Int: Number of tasks to run (default: 1)
time::String: Maximum wall time in format "days-hours:minutes:seconds" (default: "1-00:00:00")
cpus_per_task::Int: CPUs per task (default: 12)
mem_gb::Int: Memory in GB, defaults to 96GB
cmd::String: Command to execute

Returns

Bool: Returns true if submission succeeds

Notes

Function includes 5-second delays before and after submission
Memory is automatically scaled with CPU count
Log files are named with job ID (%j) and job name (%x)

Mycelia.search_viroid_sequences — Function

search_viroid_sequences(
;
    ...
) -> Union{Vector{String}, Vector{T} where T<:(SubString{_A} where _A)}
search_viroid_sequences(
    query::String;
    taxon,
    database,
    max_results
) -> Union{Vector{String}, Vector{T} where T<:(SubString{_A} where _A)}

Search for viroid sequences in NCBI databases using taxonomic and keyword filtering.

Arguments

query::String: Search query terms (default: "viroid")
taxon::String: Taxonomic restriction (default: "viruses[organism]")
database::String: NCBI database to search ("nuccore", "protein", etc.)
max_results::Int: Maximum number of results to return (default: 100)

Returns

Vector{String}: Vector of NCBI accession numbers matching the search criteria

Examples

# Find all viroid genome sequences
accessions = search_viroid_sequences("viroid", "viruses[organism]", "nuccore")

# Find specific viroid species
pstv_accessions = search_viroid_sequences("Potato spindle tuber viroid", "viruses[organism]", "nuccore")

# Find viroid proteins
protein_accessions = search_viroid_sequences("viroid", "viruses[organism]", "protein")

Mycelia.select_action — Method

select_action(policy::DQNPolicy, state::AssemblyState)

Select an action using the DQN policy with epsilon-greedy exploration.

This is a placeholder implementation that will be replaced with actual neural network inference once the ML framework is integrated.

Arguments

policy::DQNPolicy: Trained policy network
state::AssemblyState: Current environment state

Returns

AssemblyAction: Selected action for the given state

Example

action = select_action(policy, current_state)

Mycelia.seq2sha256 — Method

seq2sha256(seq::AbstractString) -> String

Compute the SHA-256 hash of a sequence string.

Arguments

seq::AbstractString: Input sequence to be hashed

Returns

String: Hexadecimal representation of the SHA-256 hash

Details

The input sequence is converted to uppercase before hashing.

Mycelia.seq2sha256 — Method

seq2sha256(seq::BioSequences.BioSequence) -> String

Convert a biological sequence to its SHA256 hash value.

Calculates a cryptographic hash of the sequence by first converting it to a string representation. This method dispatches to the string version of seq2sha256.

Arguments

seq::BioSequences.BioSequence: The biological sequence to hash

Returns

String: A 64-character hexadecimal string representing the SHA256 hash

Mycelia.sequence_to_kmer_path — Method

Convert sequence to k-mer path representation (simplified version for backward compatibility).

Mycelia.sequence_to_qualmer_path — Method

Convert sequence with quality to qualmer path representation using graph vertices.

Mycelia.sequence_to_stranded_path — Method

sequence_to_stranded_path(
    stranded_kmers,
    sequence
) -> Vector{Pair{Int64, Bool}}

Convert a DNA sequence into a path through a collection of stranded k-mers.

Arguments

stranded_kmers: Collection of unique k-mers representing possible path vertices
sequence: Input DNA sequence to convert to a path

Returns

Vector of Pair{Int,Bool} where:

First element (Int) is the index of the k-mer in stranded_kmers
Second element (Bool) indicates orientation (true=forward, false=reverse)

Mycelia.setup_checkm — Method

setup_checkm(; db_dir::String=joinpath(homedir(), "workspace", ".checkm"))

Install CheckM via Bioconda and set up its database.

CheckM is a tool for assessing the quality and completeness of bacterial genomes.

Arguments

db_dir: Directory to store CheckM database (default: ~/.checkm)

Example

setup_checkm()

Mycelia.setup_checkm2 — Method

setup_checkm2(; db_dir::String=joinpath(homedir(), "workspace", ".checkm2"))

Install CheckM2 via Bioconda and set up its database.

CheckM2 is a rapid tool for assessing the quality and completeness of bacterial genomes.

Arguments

db_dir: Directory to store CheckM2 database (default: ~/.checkm2)

Example

setup_checkm2()

Mycelia.setup_checkv — Method

setup_checkv(; db_dir::String=joinpath(homedir(), "workspace", ".checkv"))

Install CheckV via Bioconda and set up its database.

CheckV is a tool for assessing the quality and completeness of viral genomes.

Arguments

db_dir: Directory to store CheckV database (default: ~/.checkv)

Example

setup_checkv()

Mycelia.setup_padloc — Method

setup_padloc() -> Union{Nothing, Base.Process}

Ensure the padloc environment and database are installed.

Downloads the environment if missing and updates the padloc database.

Mycelia.setup_taxonkit_taxonomy — Method

setup_taxonkit_taxonomy(; force_update, max_age_days)

Downloads and extracts the NCBI taxonomy database required for taxonkit operations.

Downloads taxdump.tar.gz from NCBI FTP server and extracts it to ~/.taxonkit/. This is a prerequisite for using taxonkit-based taxonomy functions.

Arguments

force_update::Bool=false: Force download even if taxdump already exists
max_age_days::Int=30: Maximum age in days before warning about stale data

Requirements

Working internet connection
Sufficient disk space (~100MB)
taxonkit must be installed separately

Returns

Nothing

Throws

SystemError if download fails or if unable to create directory
ErrorException if tar extraction fails

Mycelia.sha256_file — Method

sha256_file(file::AbstractString) -> String

Compute the SHA-256 hash of the contents of a file.

Arguments

file::AbstractString: The path to the file for which the SHA-256 hash is to be computed.

Returns

String: The SHA-256 hash of the file contents, represented as a hexadecimal string.

Mycelia.shortest_probability_path_next — Method

shortest_probability_path_next(
    graph::MetaGraphsNext.MetaGraph,
    source::String,
    target::String
) -> Union{Nothing, Mycelia.GraphPath}

Find the shortest path in probability space between two vertices.

Uses Dijkstra's algorithm where edge distances are -log(probability), so the shortest path corresponds to the highest probability path.

Arguments

graph: MetaGraphsNext k-mer graph
source: Source vertex label
target: Target vertex label

Returns

Union{GraphPath, Nothing}: Shortest probability path, or nothing if no path exists

Algorithm

Convert edge weights to -log(probability) distances
Run Dijkstra's algorithm with strand-aware edge traversal
Reconstruct path maintaining strand information
Convert back to probability space for final result

Mycelia.should_continue_k — Method

Determine if we should continue processing the current k-mer size or move to the next. Uses accuracy-prioritized reward function for decision making.

Mycelia.should_continue_k_advanced — Method

Advanced decision making with reward history tracking. This function maintains state across iterations for better decisions.

Mycelia.should_continue_k_progression — Method

Enhanced convergence detection for k-mer progression. Determines if we should move to the next k-mer size based on multiple criteria.

Mycelia.simple_edit_distance — Method

simple_edit_distance(s1::String, s2::String) -> Int

Simple edit distance calculation for k-mers.

Mycelia.simplify_graph_next — Method

simplify_graph_next(graph::MetaGraphsNext.MetaGraph, bubbles::Vector{BubbleStructure}) -> MetaGraphsNext.MetaGraph

Simplify the graph by resolving bubbles and removing low-confidence paths.

Mycelia.simulate_illumina_paired_reads — Method

simulate_illumina_paired_reads(
;
    in_fasta,
    coverage,
    read_count,
    outbase,
    read_length,
    mflen,
    sdev,
    seqSys,
    amplicon,
    errfree,
    rndSeed
)

Simulate Illumina short reads from a FASTA file using the ART Illumina simulator.

This function wraps ART (installed via Bioconda) to simulate reads from an input reference FASTA. It supports paired-end (or optionally single-end/mate-pair) simulation, with options to choose either fold coverage (--fcov) or an absolute read count (--rcount), to enable amplicon mode, and to optionally generate a zero-error SAM file.

Arguments

in_fasta::String: Path to the input FASTA file.
coverage::Union{Nothing,Number}: Desired fold coverage (used with --fcov); if nothing and read_count is provided then fold coverage is ignored. (Default: 20)
read_count::Union{Nothing,Number}: Total number of reads (or read pairs) to generate (used with --rcount instead of fold coverage). (Default: nothing)
outbase::String: Output file prefix (default: "$(in_fasta).art.$(coverage)x.").
read_length::Int: Length of reads to simulate (default: 150).
mflen::Int: Mean fragment length for paired-end simulations (default: 500).
sdev::Int: Standard deviation of fragment lengths (default: 10).
seqSys::String: Illumina sequencing system ID (e.g. "HS25" for HiSeq 2500) (default: "HS25").
paired::Bool: Whether to simulate paired-end reads (default: true).
amplicon::Bool: Enable amplicon sequencing simulation mode (default: false).
errfree::Bool: Generate an extra SAM file with zero sequencing errors (default: false).
rndSeed::Union{Nothing,Int}: Optional seed for reproducibility (default: nothing).

Outputs

Generates gzipped FASTQ files in the working directory:

For paired-end: $(outbase)1.fq.gz (forward) and $(outbase)2.fq.gz (reverse).
For single-end: $(outbase)1.fq.gz.

Additional SAM files may be produced if --errfree is enabled and/or if the ART --samout option is specified.

Details

This function calls ART with the provided options. Note that if read_count is supplied, the function uses the --rcount option; otherwise, it uses --fcov with the given coverage. Amplicon mode (via --amplicon) restricts the simulation to the amplicon regions, which is important for targeted sequencing studies.

Dependencies

Requires ART simulator (installed via Bioconda) and the Mycelia environment helper.

See also: simulate_nanopore_reads, simulate_nearly_perfect_long_reads, simulate_pacbio_reads

Mycelia.simulate_nanopore_reads — Method

simulate_nanopore_reads(; fasta, quantity, outfile)

Simulate Oxford Nanopore sequencing reads using the Badread tool with 2023 error models.

Arguments

fasta::String: Path to input reference FASTA file
quantity::String: Either fold coverage (e.g. "50x") or total bases to sequence (e.g. "1000000")
outfile::String: Output path for gzipped FASTQ file. Defaults to input filename with modified extension

Returns

String: Path to the generated output FASTQ file

See also: simulate_pacbio_reads, simulate_nearly_perfect_long_reads, simulate_short_reads

Mycelia.simulate_nearly_perfect_long_reads — Method

simulate_nearly_perfect_long_reads()

Simulate high-quality long reads with minimal errors using Badread.

Arguments

reference::String: Path to reference FASTA file
quantity::String: Coverage depth (e.g. "50x") or total bases (e.g. "1000000")
length_mean::Int=40000: Mean read length
length_sd::Int=20000: Standard deviation of read length

Returns

Vector of simulated reads in FASTQ format

Details

Generates nearly perfect long reads by setting error rates and artifacts to minimum values. Uses ideal quality scores and disables common sequencing artifacts like chimeras and adapters.

See also: simulate_pacbio_reads, simulate_nanopore_reads, simulate_short_reads

Mycelia.simulate_pacbio_reads — Method

simulate_pacbio_reads(; fasta, quantity, outfile)

Simulate PacBio HiFi reads using the Badread error model.

Arguments

fasta::String: Path to input FASTA file containing reference sequence
quantity::String: Coverage depth (e.g. "50x") or total bases (e.g. "1000000") - NOT TOTAL READS
outfile::String: Output filepath for simulated reads. Defaults to input filename with ".badread.pacbio2021.{quantity}.fq.gz" suffix

Returns

String: Path to the generated output file

Notes

Requires Badread tool from Bioconda
Uses PacBio 2021 error and quality score models
Average read length ~15kb
Output is gzipped FASTQ format

See also: simulate_nanopore_reads, simulate_nearly_perfect_long_reads, simulate_short_reads

Mycelia.simulate_variants — Method

simulate_variants(
    fasta_record::FASTX.FASTA.Record;
    n_variants,
    window_size,
    variant_size_disbribution,
    variant_type_likelihoods
) -> Any

Simulates genetic variants (substitutions, insertions, deletions, inversions) in a DNA sequence.

Arguments

fasta_record: Input DNA sequence in FASTA format

Keywords

n_variants=√(sequence_length): Number of variants to generate
window_size=sequence_length/n_variants: Size of windows for variant placement
variant_size_disbribution=Geometric(1/√window_size): Distribution for variant sizes
variant_type_likelihoods: Vector of pairs mapping variant types to probabilities
- :substitution => 10⁻¹
- :insertion => 10⁻²
- :deletion => 10⁻²
- :inversion => 10⁻²

Returns

DataFrame in VCF format containing simulated variants with columns: CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO, FORMAT, SAMPLE

Notes

Variants are distributed across sequence windows to ensure spread
Variant sizes are capped by window size
Equivalent variants are filtered out
FILTER column indicates variant type

Mycelia.simulate_variants — Method

simulate_variants(fasta_file::String) -> String

Simulates genetic variants from sequences in a FASTA file and generates corresponding VCF records.

Arguments

fasta_file::String: Path to input FASTA file containing sequences to analyze

Details

Processes each record in the input FASTA file
Generates simulated variants for each sequence
Creates a VCF file with the same base name as input file (.vcf extension)
Updates sequences with simulated variants in a new FASTA file (.vcf.fna extension)

Returns

Path to the modified FASTA file containing sequences with simulated variants

Mycelia.sort_contigs_by_length — Method

Sort contigs by length (descending).

Mycelia.sort_fastq — Function

sort_fastq(input_fastq) -> String
sort_fastq(input_fastq, output_fastq) -> Any

This turns a 4-line FASTQ entry into a single tab separated line, adds a column with the length of each read, passes it to Unix sort, removes the length column, and converts it back into a FASTQ file.

sorts longest to shortest!!

http://thegenomefactory.blogspot.com/2012/11/sorting-fastq-files-by-sequence-length.html

Mycelia.split_gff_attributes_into_columns — Method

split_gff_attributes_into_columns(gff_df) -> Any

Takes a GFF (General Feature Format) DataFrame and expands the attributes column into separate columns.

Arguments

gff_df::DataFrame: A DataFrame containing GFF data with an 'attributes' column formatted as key-value pairs separated by semicolons (e.g., "ID=gene1;Name=BRCA1;Type=gene")

Returns

DataFrame: The input DataFrame with additional columns for each attribute key found in the 'attributes' column

Mycelia.step_environment! — Method

step_environment!(env::AssemblyEnvironment, action::AssemblyAction)

Execute an action in the RL environment and return the resulting state and reward.

Arguments

env::AssemblyEnvironment: Environment to step
action::AssemblyAction: Action to execute

Returns

Tuple{AssemblyState, RewardComponents, Bool}: (next_state, reward, done)
- next_state: State after action execution
- reward: Reward components for the action
- done: Whether episode is complete

Example

action = AssemblyAction(:continue_k, Dict(), 0.95, 1000, 5)
next_state, reward, done = step_environment!(env, action)

Mycelia.subsample_reads_seqkit — Method

subsample_reads_seqkit(
;
    in_fastq,
    out_fastq,
    n_reads,
    proportion_reads
)

Subsample reads from a FASTQ file using seqkit.

Arguments

in_fastq::String: Path to input FASTQ file
out_fastq::String="": Path to output FASTQ file. If empty, auto-generated based on input filename
n_reads::Union{Missing,Int}=missing: Number of reads to sample
proportion_reads::Union{Missing,Float64}=missing: Proportion of reads to sample (0.0-1.0)

Returns

String: Path to the output FASTQ file

Mycelia.sufficient_improvements — Function

Determine if sufficient improvements were made to continue with current k. Enhanced with convergence detection and adaptive thresholds.

Mycelia.system_mem_to_minimap_index_size — Method

system_mem_to_minimap_index_size(
;
    system_mem_gb,
    denominator
) -> String

Compute the minimap2 index size string based on available system memory.

Arguments

system_mem_gb::Real: Amount of memory (GB) to allocate for indexing.
denominator::Real: Factor to scale the memory usage (default Mycelia.DEFAULT_MINIMAP_DENOMINATOR).

Returns

String: Value such as "4G" suitable for the minimap2 -I option.

Mycelia.system_overview — Method

system_overview(
;
    path
) -> @NamedTuple{system_threads::Int64, julia_threads::Int64, total_memory::String, available_memory::String, occupied_memory::String, total_storage::String, available_storage::String, occupied_storage::String}

Mycelia.tar_extract — Method

tar_extract(; tarchive, directory)

Extract contents of a gzipped tar archive file to a specified directory.

Arguments

tarchive::AbstractString: Path to the .tar.gz file to extract
directory::AbstractString=dirname(tarchive): Target directory for extraction (defaults to the archive's directory)

Returns

AbstractString: Path to the directory where contents were extracted

Mycelia.tar_gz_files — Method

tar_gz_files(output_file::String, input_files::Vector{String})

Creates a tar.gz archive from input_files. Handles many files by writing a file list and using tar -czf ... -T filelist. Appends .tar.gz if missing.

Mycelia.taxids2lca — Method

taxids2lca(ids::Vector{Int64}) -> Int64

Calculate the Lowest Common Ancestor (LCA) taxonomic ID for a set of input taxonomic IDs.

Arguments

ids::Vector{Int}: Vector of NCBI taxonomic IDs

Returns

Int: The taxonomic ID of the lowest common ancestor

Details

Uses taxonkit to compute the LCA. Automatically sets up the required taxonomy database if not already present in ~/.taxonkit/.

Dependencies

Requires taxonkit (installed via Bioconda)
Requires taxonomy database (downloaded automatically if missing)

Mycelia.taxids2ncbi_taxonomy_table — Method

taxids2ncbi_taxonomy_table(
    taxids::AbstractVector{Int64}
) -> DataFrames.DataFrame

Convert a vector of NCBI taxonomy IDs into a detailed taxonomy table using NCBI Datasets CLI.

Arguments

taxids::AbstractVector{Int}: Vector of NCBI taxonomy IDs to query

Returns

DataFrame: Table containing taxonomy information with columns including:
- tax_id
- species
- genus
- family
- order
- class
- phylum
- kingdom

Dependencies

Requires ncbi-datasets-cli Conda package (automatically installed if missing)

Mycelia.taxids2taxonkit_summarized_lineage_table — Method

taxids2taxonkit_summarized_lineage_table(
    taxids::AbstractVector{Int64}
) -> DataFrames.DataFrame

Convert a vector of taxonomy IDs to a summarized lineage table using taxonkit.

Arguments

taxids::AbstractVector{Int}: Vector of NCBI taxonomy IDs

Returns

DataFrame with the following columns:

taxid: Original input taxonomy ID
species_taxid, species: Species level taxonomy ID and name
genus_taxid, genus: Genus level taxonomy ID and name
family_taxid, family: Family level taxonomy ID and name
superkingdom_taxid, superkingdom: Superkingdom level taxonomy ID and name

Missing values are used when a taxonomic rank is not available.

Mycelia.taxids2taxonkit_taxid2lineage_ranks — Method

taxids2taxonkit_taxid2lineage_ranks(
    taxids::AbstractVector{Int64}
) -> Dict{Int64, Dict{String, @NamedTuple{lineage::String, taxid::Union{Missing, Int64}}}}

Convert taxonomic IDs to a structured lineage rank mapping.

Takes a vector of taxonomic IDs and returns a nested dictionary mapping each input taxid to its complete taxonomic lineage information. For each taxid, creates a dictionary where:

Keys are taxonomic ranks (e.g., "species", "genus", "family")
Values are NamedTuples containing:
- lineage::String: The taxonomic name at that rank
- taxid::Union{Int, Missing}: The corresponding taxonomic ID (if available)

Excludes "no rank" entries from the final output.

Returns: Dict{Int, Dict{String, NamedTuple{(:lineage, :taxid), Tuple{String, Union{Int, Missing}}}}}

Mycelia.taxonomic_id_to_children — Method

taxonomic_id_to_children(
    tax_id;
    DATABASE_ID,
    USERNAME,
    PASSWORD
)

Query Neo4j database to find all descendant taxonomic IDs for a given taxonomic ID.

Arguments

tax_id: Source taxonomic ID to find children for
DATABASE_ID: Neo4j database identifier (required)
USERNAME: Neo4j database username (default: "neo4j")
PASSWORD: Neo4j database password (required)

Returns

Vector{Int}: Sorted array of unique child taxonomic IDs

Mycelia.test_cross_validation — Method

Test function for cross-validation with sample data.

Mycelia.test_intelligent_assembly — Method

Test function to verify the implementation with sample data.

Mycelia.test_iterative_assembly — Method

Test function for iterative assembly with sample data.

Mycelia.test_rl_framework — Method

test_rl_framework()

Test the reinforcement learning framework with minimal examples.

This function provides a comprehensive test of the RL infrastructure using small synthetic datasets.

Returns

Bool: Whether all tests passed

Example

success = test_rl_framework()

Mycelia.train_assembly_agent — Method

train_assembly_agent(training_data::Vector{String}, validation_data::Vector{String}; 
                    episodes=1000, episode_length=100)

Train a reinforcement learning agent for assembly optimization.

This function implements the complete training loop for the hierarchical RL system.

Arguments

training_data::Vector{String}: Paths to training FASTQ files
validation_data::Vector{String}: Paths to validation FASTQ files
episodes::Int: Number of training episodes (default: 1000)
episode_length::Int: Maximum steps per episode (default: 100)

Returns

Tuple{DQNPolicy, Vector{Float64}}: (trainedpolicy, trainingrewards)

Example

training_files = ["train1.fastq", "train2.fastq", "train3.fastq"]
validation_files = ["val1.fastq", "val2.fastq"]
policy, rewards = train_assembly_agent(training_files, validation_files, episodes=500)

Mycelia.train_with_curriculum — Method

train_with_curriculum(curriculum_schedule::Vector{Dict}, validation_data::Vector{String})

Train the RL agent using curriculum learning.

Arguments

curriculum_schedule::Vector{Dict}: Curriculum stages
validation_data::Vector{String}: Validation datasets

Returns

Tuple{DQNPolicy, Vector{Float64}}: (trainedpolicy, stagerewards)

Example

curriculum = create_curriculum_schedule()
policy, rewards = train_with_curriculum(curriculum, validation_files)

Mycelia.translate_nucleic_acid_fasta — Method

translate_nucleic_acid_fasta(
    fasta_nucleic_acid_file,
    fasta_amino_acid_file
) -> Any

Translates nucleic acid sequences from a FASTA file into amino acid sequences.

Arguments

fasta_nucleic_acid_file::String: Path to input FASTA file containing nucleic acid sequences
fasta_amino_acid_file::String: Path where the translated amino acid sequences will be written

Returns

String: Path to the output FASTA file containing translated amino acid sequences

Mycelia.transterm_output_to_gff — Method

transterm_output_to_gff(transterm_output) -> Any

Convert TransTerm terminator predictions output to GFF3 format.

Parses TransTerm output and generates a standardized GFF3 file with the following transformations:

Sets source field to "transterm"
Sets feature type to "terminator"
Converts terminator IDs to GFF attributes
Renames fields to match GFF3 spec

Arguments

transterm_output::String: Path to the TransTerm output file

Returns

String: Path to the generated GFF3 file (original filename with .gff extension)

Mycelia.trim_galore_paired — Method

trim_galore_paired(; forward_reads, reverse_reads, outdir)

Trim paired-end FASTQ reads using Trim Galore, a wrapper around Cutadapt and FastQC.

Arguments

forward_reads::String: Path to forward reads FASTQ file
reverse_reads::String: Path to reverse reads FASTQ file
outdir::String: Output directory for trimmed files

Returns

Tuple{String, String}: Paths to trimmed forward and reverse read files

Dependencies

Requires trim_galore conda environment:

mamba create -n trim_galore -c bioconda trim_galore

Mycelia.try_local_path_improvements — Method

Local heuristic improvements (fallback method). Returns (improvedread, likelihoodimprovement).

Mycelia.try_statistical_path_resampling — Method

Try statistical path resampling for alternative high-likelihood paths. Returns (improved_read, likelihood) or nothing if no improvement.

Mycelia.try_viterbi_path_improvement — Method

Try to improve FASTQ read using Viterbi algorithm from viterbi-next.jl. Returns (improved_read, likelihood) or nothing if no improvement.

Mycelia.type_to_string — Method

type_to_string(T::AbstractString) -> Any

Converts an AbstractString type to its string representation.

Arguments

T::AbstractString: The string type to convert

Returns

A string representation of the input type

Mycelia.type_to_string — Method

type_to_string(T) -> Any

Convert a type to its string representation, with special handling for Kmer types.

Arguments

T: The type to convert to string

Returns

String representation of the type
- For Kmer types: Returns "Kmers.DNAKmer{K}" where K is the kmer length
- For other types: Returns the standard string representation

Mycelia.umap_embed — Method

umap_embed(
    X::AbstractMatrix{<:Real};
    n_neighbors,
    min_dist,
    n_components
) -> UMAP.UMAP_{S, M, N, D, Matrix{Int64}, I} where {S<:Real, M<:AbstractMatrix{S}, N<:AbstractMatrix{S}, D<:AbstractMatrix{S}, I<:AbstractMatrix{S}}

umap_embed(scores::AbstractMatrix{<:Real};
           n_neighbors::Int=15,
           min_dist::Float64=0.1,
           n_components::Int=2)

Embed your PC/EPCA scores (k×nsamples) into `ncomponents` via UMAP.

When to use

Use for visualizing high-dimensional data in 2 or 3 dimensions, especially when the data may have nonlinear structure. UMAP is suitable for both continuous and discrete data, and is robust to non-Gaussian distributions. Input should be a matrix of features or dimensionally-reduced scores (e.g., from PCA or EPCA).

Arguments

scores : (components × samples) matrix
n_neighbors : UMAP neighborhood size
min_dist : UMAP min_dist
n_components: output dimension (2 or 3)

Returns

model : the trained UMAP.UMAP model

Mycelia.update_bioconda_env — Method

update_bioconda_env(pkg) -> Base.Process

Update a package and its dependencies in its dedicated Conda environment.

Arguments

pkg::String: Name of the package/environment to update

Mycelia.update_fasta_with_vcf — Method

update_fasta_with_vcf(; in_fasta, vcf_file, out_fasta)

Apply variants from a VCF file to a reference FASTA sequence.

Arguments

in_fasta: Path to input reference FASTA file
vcf_file: Path to input VCF file containing variants
out_fasta: Optional output path for modified FASTA. Defaults to replacing '.vcf' with '.normalized.vcf.fna'

Details

Normalizes indels in the VCF using bcftools norm
Applies variants to the reference sequence using bcftools consensus
Handles temporary files and compression with bgzip/tabix

Requirements

Requires bioconda packages: htslib, tabix, bcftools

Returns

Path to the output FASTA file containing the modified sequence

Mycelia.update_gff_with_mmseqs — Method

update_gff_with_mmseqs(
    gff_file,
    mmseqs_file
) -> DataFrames.DataFrame

Update GFF annotations with protein descriptions from MMseqs2 search results.

Arguments

gff_file::String: Path to input GFF3 format file
mmseqs_file::String: Path to MMseqs2 easy-search output file

Returns

DataFrame: Modified GFF table with updated attribute columns containing protein descriptions

Details

Takes sequence matches from MMseqs2 and adds their descriptions as 'label' and 'product' attributes in the GFF file. Only considers top hits from MMseqs2 results. Preserves existing GFF attributes while prepending new annotations.

Mycelia.upload_edge_type_over_url_from_graph — Method

upload_edge_type_over_url_from_graph(
;
    src_type,
    dst_type,
    edge_type,
    graph,
    ADDRESS,
    USERNAME,
    PASSWORD,
    DATABASE,
    window_size
)

Upload edges of a specific type from a MetaGraph to a Neo4j database, batching uploads in windows.

Arguments

src_type: Type of source nodes to filter
dst_type: Type of destination nodes to filter
edge_type: Type of edges to upload
graph: MetaGraph containing the nodes and edges
ADDRESS: Neo4j server URL
USERNAME: Neo4j username (default: "neo4j")
PASSWORD: Neo4j password
DATABASE: Neo4j database name (default: "neo4j")
window_size: Number of edges to upload in each batch (default: 100)

Details

Filters edges based on source, destination and edge types
Preserves all edge properties except :TYPE when uploading
Uses MERGE operations to avoid duplicate nodes/relationships
Uploads are performed in batches for better performance
Progress is shown via ProgressMeter

Returns

Nothing

Mycelia.upload_node_over_api — Method

upload_node_over_api(
    graph,
    v;
    ADDRESS,
    USERNAME,
    PASSWORD,
    DATABASE
)

Upload a single node from a MetaGraph to a Neo4j database using the HTTP API.

Arguments

graph: MetaGraph containing the node to be uploaded
v: Vertex identifier in the graph
ADDRESS: Neo4j server address (e.g. "http://localhost:7474")
USERNAME: Neo4j authentication username (default: "neo4j")
PASSWORD: Neo4j authentication password
DATABASE: Target Neo4j database name (default: "neo4j")

Details

Generates and executes a Cypher MERGE command using the node's properties. The node's :TYPE and :identifier properties are used for node labeling, while other non-empty properties are added as node properties.

Mycelia.upload_node_table — Method

upload_node_table(
;
    table,
    window_size,
    address,
    password,
    username,
    database,
    neo4j_import_dir
)

Upload a DataFrame to Neo4j as nodes in batched windows.

Arguments

table::DataFrame: Input DataFrame where each row becomes a node. Must contain a "TYPE" column.
address::String: Neo4j server address (e.g. "bolt://localhost:7687")
password::String: Neo4j database password
neo4j_import_dir::String: Directory path accessible to Neo4j for importing files
window_size::Int=1000: Number of rows to process in each batch
username::String="neo4j": Neo4j database username
database::String="neo4j": Target Neo4j database name

Notes

All rows must have the same node type (specified in "TYPE" column)
Column names become node properties
Requires write permissions on neo4jimportdir
Large tables are processed in batches of size window_size

Mycelia.upload_node_type_over_url_from_graph — Method

upload_node_type_over_url_from_graph(
;
    node_type,
    graph,
    ADDRESS,
    USERNAME,
    PASSWORD,
    DATABASE,
    window_size
)

Upload nodes of a specific type from a graph to a Neo4j database using MERGE operations.

Arguments

node_type: The type label for the nodes to upload
graph: Source MetaGraph containing the nodes
ADDRESS: Neo4j server address (e.g. "bolt://localhost:7687")
PASSWORD: Neo4j database password
USERNAME="neo4j": Neo4j username (defaults to "neo4j")
DATABASE="neo4j": Target Neo4j database name (defaults to "neo4j")
window_size=100: Number of nodes to upload in each batch (defaults to 100)

Details

Performs batched uploads of nodes using Neo4j MERGE operations. Node properties are automatically extracted from the graph vertex properties, excluding the 'TYPE' property.

Mycelia.upload_nodes_over_api — Method

upload_nodes_over_api(
    graph;
    ADDRESS,
    USERNAME,
    PASSWORD,
    DATABASE
)

Uploads all nodes from the given graph to a specified API endpoint.

Arguments

graph: The graph containing the nodes to be uploaded.
ADDRESS: The API endpoint address.
USERNAME: The username for authentication (default: "neo4j").
PASSWORD: The password for authentication.
DATABASE: The database name (default: "neo4j").

Mycelia.upload_nodes_to_neo4j — Method

upload_nodes_to_neo4j(
;
    graph,
    address,
    username,
    password,
    format,
    database,
    neo4j_import_directory
)

Upload all nodes from a MetaGraph to a Neo4j database, processing each unique node type separately.

Arguments

graph: A MetaGraph containing nodes to be uploaded
address: Neo4j server address (e.g., "bolt://localhost:7687")
username: Neo4j authentication username (default: "neo4j")
password: Neo4j authentication password
format: Data format for upload (default: "auto")
database: Target Neo4j database name (default: "neo4j")
neo4j_import_directory: Path to Neo4j's import directory for bulk loading

Mycelia.validate_assemblies_against_holdout — Method

Validate assemblies against holdout validation data through read mapping.

Mycelia.validate_assembly — Method

validate_assembly(assembly::AssemblyResult; reference=nothing) -> Dict{String, Any}

Validate assembly quality using various metrics and optional reference comparison.

Arguments

assembly: Assembly result to validate
reference: Optional reference sequence for comparison

Returns

Dict{String, Any}: Comprehensive validation metrics

Details

Computes assembly quality metrics including:

N50, N90 statistics
Total assembly length and number of contigs
Coverage uniformity (if reference provided)
Structural variant detection (if reference provided)
Gap analysis and repeat characterization

Mycelia.vcat_with_missing — Method

vcat_with_missing(
    dfs::DataFrames.AbstractDataFrame...
) -> Union{DataFrames.DataFrame, Vector{Any}}

Vertically concatenate DataFrames with different column structures by automatically handling missing values.

Arguments

dfs: Variable number of DataFrames to concatenate vertically

Returns

DataFrame: Combined DataFrame containing all rows and columns from input DataFrames, with missing values where columns didn't exist in original DataFrames

Mycelia.viroid_assembly_workflow — Method

viroid_assembly_workflow(
    viroid_name::String;
    outdir,
    k,
    simulate_coverage,
    read_length,
    error_rate,
    download_references,
    run_assembly
) -> NamedTuple{(:viroid_name, :output_directory, :reference_data, :observations, :assembly_results, :workflow_summary), <:Tuple{String, String, Any, Any, Any, String}}

Complete viroid assembly workflow that downloads reference data and performs quality-aware assembly using Rhizomorph algorithms.

This function implements the complete workflow requested:

Downloads viroid reference data from NCBI (genome, CDS transcript, FAA protein)
Generates simulated FASTQ observations for DNA, RNA, and amino acid sequences
Uses Rhizomorph Qualmer assembly workflows with quality score propagation
Outputs FASTQ files with consensus quality scores
Performs iterative Viterbi, probabilistic walks, and heaviest path algorithms

Arguments

viroid_name::String: Name of viroid to study (e.g., "Potato spindle tuber viroid")
outdir::String: Output directory for all results (default: "viroidassemblyworkflow/")
k::Int: K-mer size for assembly (default: 21)
simulate_coverage::Int: Coverage depth for simulated reads (default: 10)
read_length::Int: Length of simulated reads (default: 150)
error_rate::Float64: Simulated sequencing error rate (default: 0.01)
download_references::Bool: Whether to download reference data (default: true)
run_assembly::Bool: Whether to run assembly analysis (default: true)

Returns

NamedTuple: Comprehensive results including reference files, simulated data, and assembly results

Examples

# Complete viroid assembly workflow for PSTV
results = viroid_assembly_workflow("Potato spindle tuber viroid", "pstv_analysis/")

# Quick analysis with existing reference data
results = viroid_assembly_workflow("Hop stunt viroid", "hsv_analysis/";
                                 download_references=false)

Workflow Details

This function demonstrates the novel quality-aware assembly capabilities:

Quality Propagation: Per-base PHRED scores maintained throughout assembly
Consensus Scoring: Multiple observations combined using weighted averages
Advanced Algorithms: Iterative Viterbi, probabilistic walks, heaviest path finding
Multi-sequence Types: Handles DNA, RNA, and amino acid sequences
FASTQ Output: Final assemblies include quality scores for downstream analysis

Mycelia.visualize_genome_coverage — Method

visualize_genome_coverage(coverage_table) -> Any

Creates a multi-panel visualization of genome coverage across chromosomes.

Arguments

coverage_table: DataFrame containing columns "chromosome" and "coverage" with genomic coverage data

Returns

Plots.Figure: A composite figure with coverage plots for each chromosome

Details

Generates one subplot per chromosome, arranged vertically. Each subplot shows the coverage distribution across genomic positions for that chromosome.

Mycelia.viterbi_batch_process — Function

viterbi_batch_process(graph::MetaGraph, sequences::Vector, config::ViterbiConfig) -> Vector{ViterbiPath}

Process multiple sequences in batches for memory efficiency.

Mycelia.viterbi_decode_next — Function

viterbi_decode_next(graph::MetaGraph, observations::Vector, config::ViterbiConfig) -> ViterbiPath

Enhanced Viterbi decoding with strand awareness and memory efficiency.

Mycelia.viterbi_maximum_likelihood_traversals — Method

viterbi_maximum_likelihood_traversals(
    stranded_kmer_graph;
    error_rate,
    verbosity
) -> Vector{FASTX.FASTA.Record}

Finds maximum likelihood paths through a stranded k-mer graph using the Viterbi algorithm to correct sequencing errors.

Arguments

stranded_kmer_graph: A directed graph where vertices represent k-mers and edges represent overlaps
error_rate::Float64: Expected per-base error rate (default: 1/(k+1)). Must be < 0.5
verbosity::String: Output detail level ("debug", "reads", or "dataset")

Returns

Vector of FASTX.FASTA.Record containing error-corrected sequences

Details

Uses dynamic programming to find most likely path through k-mer graph
Accounts for matches, mismatches, insertions and deletions
State likelihoods based on k-mer coverage counts
Transition probabilities derived from error rate
Progress tracking based on verbosity level

Notes

Error rate should be probability of error (e.g. 0.01 for 1%), not accuracy
Higher verbosity levels ("debug", "reads") provide detailed path finding information
"dataset" verbosity shows only summary statistics

Mycelia.wcss — Method

wcss(clustering_result) -> Any

Calculate the Within-Cluster Sum of Squares (WCSS) for a clustering result.

Arguments

clustering_result: A clustering result object containing:
- counts: Vector with number of points in each cluster
- assignments: Vector of cluster assignments for each point
- costs: Vector of distances/costs from each point to its cluster center

Returns

Float64: The total within-cluster sum of squared distances

Description

WCSS measures the compactness of clusters by summing the squared distances between each data point and its assigned cluster center.

Mycelia.write_biosequence_gfa — Method

write_biosequence_gfa(
    graph::MetaGraphsNext.MetaGraph,
    output_file::AbstractString
)

Write a BioSequence graph to GFA format.

Arguments

graph: MetaGraphsNext BioSequence graph
output_file: Path to output GFA file

Example

write_biosequence_gfa(graph, "assembly.gfa")

Mycelia.write_fasta — Method

write_fasta(; outfile, records, gzip)

Writes FASTA records to a file, optionally gzipped.

Arguments

outfile::AbstractString: Path to the output FASTA file. Will append ".gz" if gzip is true and ".gz" isn't already the extension.
records::Vector{FASTX.FASTA.Record}: A vector of FASTA records.
gzip::Bool: Optionally force compression of the output with gzip. By default will use the file name to infer.

Returns

outfile::String: The path to the output FASTA file (including ".gz" if applicable).

Mycelia.write_fastas_from_normalized_fastx_tables — Method

write_fastas_from_normalized_fastx_tables(
    table_paths::Vector{String};
    output_dir::String = pwd(),
    show_progress::Bool = true,
    overwrite::Bool = false,
    error_handler = (e, table_path)->display((e, table_path))
) -> NamedTuple

Given a vector of normalized fastx table paths, writes out gzipped FASTA files in parallel. Each table must have columns: "fastxsha256", "recordsha256", "record_sequence". Automatically decompresses input files if they end with ".gz". Returns a summary NamedTuple with successes, failures, failed tables, and output files.

Keyword Arguments

output_dir: Directory to write .fna.gz files to.
show_progress: Show a progress bar (default: true).
overwrite: Overwrite existing files (default: false).
error_handler: Function called with (exception, table_path) on error.

Mycelia.write_fastq — Method

write_fastq(;records, filename, gzip=false)

write_fastq(; records, filename, gzip)

Write FASTQ records to file using FASTX.jl. Validates extension: .fastq, .fq, .fastq.gz, or .fq.gz. If gzip is true or filename endswith .gz, output is gzipped. records must be an iterable of FASTX.FASTQ.Record.

Mycelia.write_fastq_contigs — Method

write_fastq_contigs(result::AssemblyResult, output_file::String)

Write quality-aware contigs to a FASTQ file if quality information is available.

Mycelia.write_gfa_next — Method

write_gfa_next(
    graph::MetaGraphsNext.MetaGraph,
    outfile::AbstractString
) -> AbstractString

Write a MetaGraphsNext k-mer graph to GFA (Graphical Fragment Assembly) format.

This function exports strand-aware k-mer graphs to standard GFA format, handling:

Canonical k-mer vertices as segments (S lines)
Strand-aware edges as links (L lines) with proper orientations
Coverage information as depth annotations

Arguments

graph: MetaGraphsNext k-mer graph with strand-aware edges
outfile: Path where the GFA file should be written

Returns

Path to the written GFA file

GFA Format

The output follows GFA v1.0 specification:

Header (H) line with version
Segment (S) lines: vertexid, canonicalk-mer_sequence, depth
Link (L) lines: sourceid, sourceorientation, targetid, targetorientation, overlap

Example

graph = build_kmer_graph_next(DNAKmer{3}, observations)
write_gfa_next(graph, "assembly.gfa")

Mycelia.write_gff — Method

write_gff(; gff, outfile)

Write GFF (General Feature Format) data to a tab-delimited file.

Arguments

gff: DataFrame/Table containing GFF formatted data
outfile: String path where the output file should be written

Returns

String: Path to the written output file

Mycelia.write_quality_biosequence_gfa — Method

write_quality_biosequence_gfa(
    graph::MetaGraphsNext.MetaGraph,
    output_file::AbstractString
)

Write a quality-aware BioSequence graph to GFA format with quality information.

Arguments

graph: Quality-aware BioSequence graph
output_file: Path to output GFA file

Example

write_quality_biosequence_gfa(graph, "assembly_with_quality.gfa")

Mycelia.write_tsvgz — Method

write_tsvgz(df::DataFrames.DataFrame, filename::String; buffer_in_memory::Bool=false, threaded::Bool=true, bufsize::Int=10*1024*1024, force::Bool=false)

Write a DataFrame to a gzipped TSV file.

Arguments

df: The DataFrame to write
filename: Path to the output file (will add .tsv.gz extension as needed)
buffer_in_memory: If false, uses temporary files for large data (default: false)
bufsize: Buffer size in bytes for compression stream (default: 10MB)
force: If true, overwrite existing non-empty files (default: false)

Returns

String: The final filename with proper extension

Mycelia.write_vcf_table — Method

write_vcf_table(; vcf_file, vcf_table, fasta_file)

Write variant data to a VCF v4.3 format file.

Arguments

vcf_file::String: Output path for the VCF file
vcf_table::DataFrame: Table containing variant data with standard VCF columns
fasta_file::String: Path to the reference genome FASTA file

Details

Automatically filters out equivalent variants where REF == ALT. Includes standard VCF headers for substitutions, insertions, deletions, and inversions. Adds GT (Genotype) and GQ (Genotype Quality) format fields.

Mycelia.xam_to_contig_mapping_stats — Method

xam_to_contig_mapping_stats(xam) -> Any

Generate detailed mapping statistics for each reference sequence/contig in a XAM (SAM/BAM/CRAM) file.

Arguments

xam: Path to XAM file or XAM object

Returns

A DataFrame with per-contig statistics including:

n_aligned_reads: Number of aligned reads
total_aligned_bases: Sum of alignment lengths
total_alignment_score: Sum of alignment scores
Mapping quality statistics (mean, std, median)
Alignment length statistics (mean, std, median)
Alignment score statistics (mean, std, median)
Percent mismatches statistics (mean, std, median)

Note: Only primary alignments (isprimary=true) and mapped reads (ismapped=true) are considered.

Mycelia.xam_to_dataframe — Method

xam_to_dataframe(xam_path::String) -> DataFrames.DataFrame

Convert a SAM/BAM file to a DataFrame using the open_xam function.

Parameters:

xam_path: Path to the SAM/BAM file
header: Whether to include the header (default: false)

Returns:

A DataFrame containing the parsed data

Mycelia.xam_to_dataframe — Method

xam_to_dataframe(
    reader::XAM.SAM.Reader
) -> DataFrames.DataFrame

Convert SAM/BAM records from a XAM.SAM.Reader into a DataFrame.

Parameters:

reader: A XAM.SAM.Reader object for iterating over records

Returns:

A DataFrame containing all record data in a structured format