Mycelia Documentation
Welcome to the Mycelia documentation.
Installation
Install Julia (if not already installed)
I have had trouble getting the visualization libraries Plots.jl and Makie.jl (and associated packages) to load correctly on HPC due to the complexities of the default LD_LIBRARY_PATH
I imagine other research supercomputer users may have similar issues, although I don't have these issues on cloud vendors like GCP or AWS
To enable Julia to install all of it's own necessary dependencies independent of the system, I reset the LD_LIBRARY_PATH
variable prior to launching Julia !!
This can be done easily when launching Julia from the command line by
export LD_LIBRARY_PATH="" && julia
Clone the repo as a Julia package
import Pkg
# for production usage
Pkg.add(url="https://github.com/cjprybol/Mycelia.git")
# for development
Pkg.develop(url="git@github.com:cjprybol/Mycelia.git")
Function Docstrings
Mycelia._detect_sequence_extension
— Method_detect_sequence_extension(sequence_type::Symbol) -> String
Internal helper function to convert sequence type to file extension.
Arguments
- sequence_type: Symbol representing sequence type (:DNA, :RNA, or :AA)
Returns
- String: Appropriate file extensions
Mycelia.add_bioconda_env
— Methodadd_bioconda_env(pkg; force) -> Union{Nothing, Base.Process}
Create a new Conda environment with a specified Bioconda package.
Arguments
pkg::String
: Package name to install. Can include channel specification using the format "channel::package"
Keywords
force::Bool=false
: If true, recreates the environment even if it already exists
Details
The function creates a new Conda environment named after the package and installs the package into it. It uses channel priority: conda-forge > bioconda > defaults. If CONDA_RUNNER is set to 'mamba', it will ensure mamba is installed first.
Examples
# Install basic package
add_bioconda_env("blast")
# Install from specific channel
add_bioconda_env("bioconda::blast")
# Force reinstallation
add_bioconda_env("blast", force=true)
Notes
- Requires Conda.jl to be installed and configured
- Uses CONDA_RUNNER global variable to determine whether to use conda or mamba
- Cleans conda cache after installation
Mycelia.add_edgemer_to_graph!
— Methodadd_edgemer_to_graph!(
graph,
record_identifier,
index,
observed_edgemer
) -> Any
Add an observed edgemer to a graph with its associated metadata.
Arguments
graph::MetaGraph
: The graph to modifyrecord_identifier
: Identifier for the source recordindex
: Position where edgemer was observedobserved_edgemer
: The biological sequence representing the edgemer
Details
Processes the edgemer by:
- Splitting it into source and destination kmers
- Converting kmers to their canonical forms
- Creating or updating an edge with orientation metadata
- Storing observation details (record, position, orientation)
Returns
Modified graph with the new edge and metadata
Note
If the edge already exists, the observation is added to the existing metadata.
Mycelia.add_fastx_records_to_graph!
— Methodadd_fastx_records_to_graph!(graph, fastxs) -> Any
Add FASTX records from multiple files as a graph property.
Arguments
graph
: A MetaGraph that will store the FASTX recordsfastxs
: Collection of FASTA/FASTQ file paths to process
Details
Creates a dictionary mapping sequence descriptions to their corresponding FASTX records, then stores this dictionary as a graph property under the key :records
. Multiple input files are merged, with later files overwriting records with duplicate descriptions.
Returns
The modified graph with added records property.
Mycelia.add_record_edgemers_to_graph!
— Methodadd_record_edgemers_to_graph!(graph) -> Any
Processes DNA sequence records stored in the graph and adds their edgemers (k+1 length subsequences) to build the graph structure.
Arguments
graph
: A Mycelia graph object containing DNA sequence records and graph properties
Details
- Uses the k-mer size specified in
graph.gprops[:k]
to generate k+1 length edgemers - Iterates through each record in
graph.gprops[:records]
- For each record, generates all possible overlapping edgemers
- Adds each edgemer to the graph with its position and record information
Returns
- The modified graph object with added edgemer information
Mycelia.amino_acids_to_codons
— Methodamino_acids_to_codons(
) -> Dict{BioSymbols.AminoAcid, DataType}
Creates a mapping from amino acids to representative DNA codons using the standard genetic code.
Returns
- Dictionary mapping each amino acid (including stop codon
AA_Term
) to a valid DNA codon that encodes it
Mycelia.annotate_aa_fasta
— Methodannotate_aa_fasta(
;
fasta,
identifier,
basedir,
mmseqsdb,
threads
)
Annotate amino acid sequences in a FASTA file using MMseqs2 search against UniRef50 database.
Arguments
fasta
: Path to input FASTA file containing amino acid sequencesidentifier
: Name for the output directory (defaults to FASTA filename without extension)basedir
: Base directory for output (defaults to current directory)mmseqsdb
: Path to MMseqs2 formatted UniRef50 database (defaults to ~/workspace/mmseqs/UniRef50)threads
: Number of CPU threads to use (defaults to system thread count)
Returns
- Path to the output directory containing MMseqs2 search results
The function creates a new directory named by identifier
under basedir
, copies the input FASTA file, and runs MMseqs2 easy-search against the specified database. If the output directory already exists, the function skips processing and returns the directory path.
Mycelia.annotate_fasta
— Methodannotate_fasta(
;
fasta,
identifier,
basedir,
mmseqsdb,
threads
)
Perform comprehensive annotation of a FASTA file including gene prediction, protein homology search, and terminator prediction.
Arguments
fasta::String
: Path to input FASTA fileidentifier::String
: Unique identifier for output directory (default: FASTA filename without extension)basedir::String
: Base directory for output (default: current working directory)mmseqsdb::String
: Path to MMseqs2 UniRef50 database (default: "~/workspace/mmseqs/UniRef50")threads::Int
: Number of CPU threads to use (default: all available)
Processing Steps
- Creates output directory and copies input FASTA
- Runs Prodigal for gene prediction (nucleotide, amino acid, and GFF output)
- Performs MMseqs2 homology search against UniRef50
- Predicts terminators using TransTerm
- Combines annotations into a unified GFF file
- Generates GenBank format output
Returns
String
: Path to the output directory containing all generated files
Files Generated
.prodigal.fna
: Predicted genes (nucleotide).prodigal.faa
: Predicted proteins.prodigal.gff
: Prodigal GFF annotations.gff
: Combined annotations.gff.genbank
: Final GenBank format
Mycelia.assess_alignment
— Methodassess_alignment(
a,
b
) -> @NamedTuple{total_matches::Int64, total_edits::Int64}
Aligns two sequences using the Levenshtein distance and returns the total number of matches and edits.
Arguments
a::AbstractString
: The first sequence to be aligned.b::AbstractString
: The second sequence to be aligned.
Returns
NamedTuple{(:total_matches, :total_edits), Tuple{Int, Int}}
: A named tuple containing:total_matches::Int
: The total number of matching bases in the alignment.total_edits::Int
: The total number of edits (insertions, deletions, substitutions) in the alignment.
Mycelia.assess_alignment_accuracy
— Methodassess_alignment_accuracy(alignment_result) -> Any
Return proportion of matched bases in alignment to total matches + edits.
Calculate the accuracy of a sequence alignment by computing the ratio of matched bases to total alignment operations (matches + edits).
Arguments
alignment_result
: Alignment result object containingtotal_matches
andtotal_edits
fields
Returns
Float64 between 0.0 and 1.0 representing alignment accuracy, where:
- 1.0 indicates perfect alignment (all matches)
- 0.0 indicates no matches
Mycelia.assess_assembly_kmer_quality
— Methodassess_assembly_kmer_quality(; assembly, observations, ks)
Evaluate genome assembly quality by comparing k-mer distributions between assembled sequences and raw observations.
Arguments
assembly
: Input assembled sequences to evaluateobservations
: Raw sequencing data for comparisonks::Vector{Int}
: Vector of k-mer sizes to analyze (default: k=17 to 23)
Returns
DataFrame containing quality metrics for each k-mer size:
k
: K-mer length usedcosine_distance
: Cosine similarity between k-mer distributionsjs_divergence
: Jensen-Shannon divergence between distributionsqv
: MerQury-style quality value score
Mycelia.assess_dnamer_saturation
— Methodassess_dnamer_saturation(
fastx::AbstractString;
power,
outdir,
min_k,
max_k,
threshold,
kmers_to_assess
)
Analyzes k-mer saturation in a FASTA/FASTQ file to determine optimal k-mer size.
Arguments
fastx::AbstractString
: Path to input FASTA/FASTQ filepower::Int=10
: Exponent for downsampling k-mers (2^power)outdir::String=""
: Output directory for results. Uses current directory if emptymin_k::Int=3
: Minimum k-mer size to evaluatemax_k::Int=17
: Maximum k-mer size to evaluatethreshold::Float64=0.1
: Saturation threshold for k-mer assessmentkmers_to_assess::Int=10_000_000
: Maximum number of k-mers to sample
Returns
Dict{Int,Float64}
: Dictionary mapping k-mer sizes to their saturation scores
Mycelia.assess_dnamer_saturation
— Methodassess_dnamer_saturation(
fastxs::AbstractVector{<:AbstractString},
kmer_type;
kmers_to_assess,
power,
min_count
) -> Union{@NamedTuple{sampling_points::Vector{Int64}, unique_kmer_counts::Vector{Int64}}, NamedTuple{(:sampling_points, :unique_kmer_counts, :eof), <:Tuple{Vector, Vector{Int64}, Bool}}}
Assess k-mer saturation in DNA sequences from FASTX files.
Arguments
fastxs::AbstractVector{<:AbstractString}
: Vector of paths to FASTA/FASTQ fileskmer_type
: Type of k-mer to analyze (e.g., DNAKmer{21})kmers_to_assess=Inf
: Maximum number of k-mers to processpower=10
: Base for exponential sampling intervalsmin_count=1
: Minimum count threshold for considering a k-mer
Returns
Named tuple containing:
sampling_points::Vector{Int}
: K-mer counts at which samples were takenunique_kmer_counts::Vector{Int}
: Number of unique canonical k-mers at each sampling pointeof::Bool
: Whether the entire input was processed
Details
Analyzes k-mer saturation by counting unique canonical k-mers at exponentially spaced intervals (powers of power
). Useful for assessing sequence complexity and coverage. Returns early if all possible k-mers are observed.
Mycelia.assess_dnamer_saturation
— Methodassess_dnamer_saturation(
fastxs::AbstractVector{<:AbstractString};
power,
outdir,
min_k,
max_k,
threshold,
kmers_to_assess,
plot
)
Analyze k-mer saturation in DNA sequences to determine optimal k value.
Arguments
fastxs
: Vector of paths to FASTA/FASTQ files to analyzepower
: Base of logarithmic sampling points (default: 10)outdir
: Optional output directory for plots and resultsmin_k
: Minimum k-mer size to test (default: 7)max_k
: Maximum k-mer size to test (default: 17)threshold
: Saturation threshold to determine optimal k (default: 0.1)kmers_to_assess
: Maximum number of k-mers to sample (default: 10M)plot
: Whether to generate saturation curves (default: true)
Returns
Integer representing the first k value that achieves saturation below threshold. If no k value meets the threshold, returns the k with minimum saturation.
Details
- Tests only prime k values between mink and maxk
- Generates saturation curves using logarithmic sampling
- Fits curves to estimate maximum unique k-mers
- If outdir is provided, saves plots as SVG and chosen k value to text file
Mycelia.assess_optimal_kmer_alignment
— Methodassess_optimal_kmer_alignment(
kmer,
observed_kmer
) -> Tuple{@NamedTuple{total_matches::Int64, total_edits::Int64}, Union{Missing, Bool}}
Used to determine which orientation provides an optimal alignment for initiating path likelihood analyses in viterbi analysis
Compare alignment scores between a query k-mer and an observed k-mer in both forward and reverse complement orientations to determine optimal alignment.
Arguments
kmer
: Query k-mer sequence to alignobserved_kmer
: Target k-mer sequence to align against
Returns
A tuple containing:
alignment_result
: The alignment result object for the optimal orientationorientation
: Boolean indicating orientation (true
= forward,false
= reverse complement,missing
= tied scores)
Details
- Performs pairwise alignment in both orientations using
assess_alignment()
- Calculates accuracy scores using
assess_alignment_accuracy()
- For tied alignment scores, randomly selects one orientation
- Uses BioSequences.reverse_complement for reverse orientation comparison
Mycelia.bam_to_fastq
— Methodbam_to_fastq(; bam, fastq)
Convert a BAM file to FASTQ format with gzip compression.
Arguments
bam
: Path to input BAM filefastq
: Optional output path. Defaults to input path with ".fq.gz" extension
Returns
- Path to the generated FASTQ file
Details
- Uses samtools through conda environment
- Automatically skips if output file exists
- Output is gzip compressed
- Requires samtools to be available via conda
Mycelia.bandage_visualize
— Methodbandage_visualize(; gfa, img)
Generate a visualization of a genome assembly graph using Bandage.
Arguments
gfa
: Path to input GFA (Graphical Fragment Assembly) fileimg
: Optional output image path. Defaults to GFA filename with .png extension
Returns
- Path to the generated image file
Mycelia.biosequences_to_counts_table
— Methodbiosequences_to_counts_table(; biosequences, k)
Convert a collection of biological sequences into a k-mer count matrix.
Arguments
biosequences
: Vector of biological sequences (DNA, RNA, or Amino Acids)k
: Length of k-mers to count
Returns
Named tuple with:
sorted_kmers
: Vector of all unique k-mers found, lexicographically sortedkmer_counts_matrix
: Sparse matrix where rows are k-mers and columns are sequences
Details
- For DNA sequences, counts canonical k-mers (both strands)
- Uses parallel processing with Thread-safe progress tracking
- Memory efficient sparse matrix representation
- Supports DNA, RNA and Amino Acid sequences
Mycelia.biosequences_to_dense_counts_table
— Methodbiosequences_to_dense_counts_table(; biosequences, k)
Convert a collection of biological sequences into a dense k-mer count matrix.
Arguments
biosequences
: Collection of DNA, RNA, or amino acid sequences (BioSequence types)k::Integer
: Length of k-mers to count (must be ≤ 13)
Returns
Named tuple containing:
sorted_kmers
: Vector of all possible k-mers in sorted orderkmer_counts_matrix
: Dense matrix where rows are k-mers and columns are sequences
Details
- For DNA sequences, counts canonical k-mers (both strands)
- For RNA and protein sequences, counts exact k-mers
- Uses parallel processing with threads
- For k > 13, use
fasta_list_to_sparse_counts_table
instead
Mycelia.blastdb2table
— Methodblastdb2table(; blastdb, outfile, force)
Convert a BLAST database to Arrow format with sequence and taxonomy information. Uses a simple serial approach with direct Arrow streaming for minimal memory usage.
Arguments
blastdb::String
: Path to the BLAST databaseoutfile::String=""
: Output file path. If empty, generates name based on input databaseforce::Bool=false
: Whether to overwrite existing output file
Returns
String
: Path to the generated output file (.arrow)
Output Format
Arrow file containing columns (in this order):
- sequence SHA256
- sequence hash
- sequence id
- accession
- gi
- sequence title
- BLAST name
- taxid
- taxonomic super kingdom
- scientific name
- scientific names for leaf-node taxids
- common taxonomic name
- common taxonomic names for leaf-node taxids
- leaf-node taxids
- membership integer
- ordinal id
- PIG
- sequence length
- sequence
Mycelia.blastdb_to_fasta
— Methodblastdb_to_fasta(
;
blastdb,
entries,
taxids,
outfile,
force,
max_cores
)
Convert a BLAST database to FASTA format.
Arguments
blastdb::String
: Name of the BLAST database to convert (e.g. "nr", "nt")dbdir::String
: Directory containing the BLAST database filesoutfile::String
: Path for the output FASTA file
Returns
- Path to the generated FASTA file as String
Mycelia.build_directed_kmer_graph
— Methodbuild_directed_kmer_graph(; fastq, k, plot)
Constructs a directed graph representation of k-mer transitions from FASTQ sequencing data.
Arguments
fastq
: Path to input FASTQ filek
: K-mer size (default: 1). Must be odd and prime. If k=1, optimal size is auto-determinedplot
: Boolean to display quality distribution plot (default: false)
Returns
MetaDiGraph with properties:
- assembly_k: k-mer size used
- kmer_counts: frequency of each k-mer
- transition_likelihoods: edge weights between k-mers
- kmermeanquality, kmertotalquality: quality metrics
- branchingnodes, unbranchingnodes: topological classification
- likelyvalidkmer_indices: k-mers above mean quality threshold
- likelysequencingartifact_indices: potential erroneous k-mers
Note
For DNA assembly, quality scores are normalized across both strands.
Mycelia.build_stranded_kmer_graph
— Methodbuild_stranded_kmer_graph(
kmer_type,
observations::AbstractVector{<:Union{FASTX.FASTA.Record, FASTX.FASTQ.Record}}
) -> MetaGraphs.MetaDiGraph{T, Float64} where T<:Integer
Create a weighted, strand-specific kmer (de bruijn) graph from a set of kmers and a series of sequence observations in FASTA format.
Mycelia.canonical
— Methodcanonical(
qmer::Mycelia.Qualmer{<:Kmers.Kmer{BioSequences.AminoAcidAlphabet}}
) -> Mycelia.Qualmer{<:Kmers.Kmer{BioSequences.AminoAcidAlphabet}}
Mycelia.canonical
— Methodcanonical(
qmer::Mycelia.Qualmer{KmerT<:Union{Kmers.DNAKmer, Kmers.RNAKmer}}
) -> Mycelia.Qualmer{KmerT} where KmerT<:Kmers.Kmer
Get the canonical representation of a DNA qualmer, considering both sequence and quality. For DNA/RNA, this involves potentially reverse-complementing the k-mer and reversing the quality scores accordingly.
Mycelia.canonicalize_kmer_counts!
— Methodcanonicalize_kmer_counts!(kmer_counts) -> Any
Canonicalizes the k-mer counts in the given dictionary.
This function iterates over the provided dictionary kmer_counts
, which maps k-mers to their respective counts. For each k-mer that is not in its canonical form, it converts the k-mer to its canonical form and updates the count in the dictionary accordingly. If the canonical form of the k-mer already exists in the dictionary, their counts are summed. The original non-canonical k-mer is then removed from the dictionary.
Arguments
kmer_counts::Dict{BioSequences.Kmer, Int}
: A dictionary where keys are k-mers and values are their counts.
Returns
- The input dictionary
kmer_counts
with all k-mers in their canonical form, sorted by k-mers.
Mycelia.canonicalize_kmer_counts
— Methodcanonicalize_kmer_counts(kmer_counts) -> Any
Normalize k-mer counts into a canonical form by creating a non-mutating copy.
Arguments
kmer_counts
: Dictionary or collection of k-mer count data
Returns
- A new normalized k-mer count collection
Mycelia.chromosome_coverage_table_to_plot
— Methodchromosome_coverage_table_to_plot(cdf) -> Plots.Plot
Creates a visualization of chromosome coverage data with statistical thresholds.
Arguments
cdf::DataFrame
: Coverage data frame containing columns:index
: Chromosome position indicesdepth
: Coverage depth valueschromosome
: Chromosome identifiermean_coverage
: Mean coverage valuestd_coverage
: Standard deviation of coverage3σ
: Boolean vector indicating +3 sigma regions-3σ
: Boolean vector indicating -3 sigma regions
Returns
- A StatsPlots plot object showing:
- Raw coverage data (black line)
- Mean coverage and ±1,2,3σ thresholds (rainbow colors)
- Highlighted regions exceeding ±3σ thresholds (red vertical lines)
Mycelia.codon_optimize
— Methodcodon_optimize(
;
normalized_codon_frequencies,
protein_sequence,
n_iterations
)
Optimizes the DNA sequence encoding for a given protein sequence using codon usage frequencies.
Arguments
normalized_codon_frequencies
: Dictionary mapping amino acids to their codon frequenciesprotein_sequence::BioSequences.LongAA
: Target protein sequence to optimizen_iterations::Integer
: Number of optimization iterations to perform
Algorithm
- Creates initial DNA sequence through reverse translation
- Iteratively generates new sequences by sampling codons based on their frequencies
- Keeps track of the sequence with highest codon usage likelihood
Returns
BioSequences.LongDNA{2}
: Optimized DNA sequence encoding the input protein
Mycelia.codons_to_amino_acids
— Methodcodons_to_amino_acids(
) -> Union{Dict{_A, BioSequences.LongAA} where _A, Dict{_A, V} where {_A, V<:(Kmers.Kmer{BioSequences.AminoAcidAlphabet, _A, _B} where {_B, _A})}}
Creates a mapping between DNA codons and their corresponding amino acids using the standard genetic code.
Returns a dictionary where:
- Keys are 3-letter DNA codons (e.g., "ATG")
- Values are the corresponding amino acids from BioSequences.jl
Mycelia.concatenate_files
— Methodconcatenate_files(; files, file)
Join fasta files without any regard to record uniqueness.
A cross-platform version of cat *.fasta > joint.fasta
See mergefastafiles
Concatenate multiple FASTA files into a single output file by simple appending.
Arguments
files
: Vector of paths to input FASTA filesfile
: Path where the concatenated output will be written
Returns
- Path to the output concatenated file
Details
Platform-independent implementation of cat *.fasta > combined.fasta
. Files are processed sequentially with a progress indicator.
Mycelia.contig_is_circular
— Methodcontig_is_circular(
graph_file::String,
contig_name::String
) -> Any
Returns bool indicating whether the contig is a circle
graphfile = path to assembly graph.gfa file contigname = name of the contig
Determine if a contig represents a circular structure in the assembly graph.
A circular contig is one where the sequence forms a complete loop in the assembly graph, typically representing structures like plasmids, circular chromosomes, or other circular DNA elements.
Arguments
graph_file::String
: Path to the assembly graph in GFA formatcontig_name::String
: Name/identifier of the contig to check
Returns
Bool
:true
if the contig forms a circular structure,false
otherwise
Mycelia.contig_is_cleanly_assembled
— Methodcontig_is_cleanly_assembled(
graph_file::String,
contig_name::String
) -> Bool
Returns bool indicating whether the contig is cleanly assembled.
By cleanly assembled we mean that the contig does not have other contigs attached in the same connected component.
graphfile = path to assembly graph.gfa file contigname = name of the contig
Check if a contig exists in isolation within its connected component in an assembly graph.
Arguments
graph_file::String
: Path to the assembly graph file in GFA formatcontig_name::String
: Name/identifier of the contig to check
Returns
Bool
:true
if the contig exists alone in its connected component,false
otherwise
Details
A contig is considered "cleanly assembled" if it appears as a single entry in the assembly graph's connected components. This function parses the GFA file and checks the contig's isolation status using the graph structure.
Mycelia.convert
— Methodconvert(args)
Convert between different graph file formats.
Arguments
args
: Dictionary with required keys:"in"
: Input filepath (supported: .jld2, .gfa, .neo4j)"out"
: Output filepath (supported: .jld2, .gfa, .neo4j)
Details
Performs format conversion based on file extensions. For non-JLD2 to non-JLD2 conversions, uses JLD2 as an intermediate format.
Mycelia.convert_sequence
— Methodconvert_sequence(seq::AbstractString) -> Any
Converts the given sequence (output from FASTX.sequence) into the appropriate BioSequence type:
- DNA sequences are converted using
BioSequences.LongDNA
- RNA sequences are converted using
BioSequences.LongRNA
- AA sequences are converted using
BioSequences.LongAA
Mycelia.copy_to_tempdir
— Methodcopy_to_tempdir(file_path::String) -> String
Create a copy of a file in a temporary directory while preserving the original filename.
Arguments
file_path::String
: Path to the source file to be copied
Returns
String
: Path to the newly created temporary file
Mycelia.copy_with_unique_identifier
— Methodcopy_with_unique_identifier(
infile,
out_directory,
unique_identifier;
force
) -> Any
Copy a file to a new location with a unique identifier prepended to the filename.
Arguments
infile::AbstractString
: Path to the source file to copyout_directory::AbstractString
: Destination directory for the copied fileunique_identifier::AbstractString
: String to prepend to the filenameforce::Bool=true
: If true, overwrite existing files
Returns
String
: Path to the newly created file
Mycelia.count_canonical_kmers
— Methodcount_canonical_kmers(_::Type{KMER_TYPE}, sequences) -> Any
Count canonical k-mers in biological sequences. A canonical k-mer is the lexicographically smaller of a DNA sequence and its reverse complement, ensuring strand-independent counting.
Arguments
KMER_TYPE
: Type parameter specifying the k-mer size and structuresequences
: Iterator of biological sequences to analyze
Returns
Dict{KMER_TYPE,Int}
: Dictionary mapping canonical k-mers to their counts
Mycelia.count_kmers
— Methodcount_kmers(
_::Type{KMER_TYPE},
fastx_file::AbstractString
) -> Any
Count k-mers in a FASTA/FASTQ file and return their frequencies.
Arguments
KMER_TYPE
: Type parameter specifying the k-mer type (e.g.,DNAKmer{K}
)fastx_file
: Path to input FASTA/FASTQ file
Returns
Dict{KMER_TYPE, Int}
: Dictionary mapping each k-mer to its frequency
Mycelia.count_kmers
— Methodcount_kmers(
_::Type{Kmers.Kmer{A<:BioSequences.AminoAcidAlphabet, K}},
sequence::BioSequences.LongSequence
) -> OrderedCollections.OrderedDict{K, Int64} where K<:(Kmers.Kmer{BioSequences.AminoAcidAlphabet, _A, _B} where {_B, _A})
Count the frequency of amino acid k-mers in a biological sequence.
Arguments
Kmers.Kmer{A,K}
: Type parameter specifying amino acid alphabet (A) and k-mer length (K)sequence
: Input biological sequence to analyze
Returns
A sorted dictionary mapping each k-mer to its frequency count in the sequence.
Mycelia.count_kmers
— Methodcount_kmers(
_::Type{Kmers.Kmer{A<:BioSequences.DNAAlphabet, K}},
sequence::BioSequences.LongSequence
) -> Any
Count the frequency of each k-mer in a DNA sequence.
Arguments
::Type{Kmers.Kmer{A,K}}
: K-mer type with alphabet A and length Ksequence::BioSequences.LongSequence
: Input DNA sequence to analyze
Returns
A sorted dictionary mapping each k-mer to its frequency count in the sequence.
Type Parameters
A <: BioSequences.DNAAlphabet
: DNA alphabet typeK
: Length of k-mers
Mycelia.count_kmers
— Methodcount_kmers(
_::Type{Kmers.Kmer{A<:BioSequences.RNAAlphabet, K}},
sequence::BioSequences.LongSequence
) -> Any
Count the frequency of each k-mer in an RNA sequence.
Arguments
Kmer
: Type parameter specifying the k-mer length K and RNA alphabetsequence
: Input RNA sequence to analyze
Returns
Dict{Kmers.Kmer, Int}
: Sorted dictionary mapping each k-mer to its frequency count
Mycelia.count_kmers
— Methodcount_kmers(
_::Type{KMER_TYPE},
sequences::Union{FASTX.FASTA.Reader, FASTX.FASTQ.Reader}
) -> Any
Counts k-mer occurrences in biological sequences from a FASTA/FASTQ reader.
Arguments
KMER_TYPE
: Type parameter specifying the k-mer length and encoding (e.g.,DNAKmer{4}
for 4-mers)sequences
: A FASTA or FASTQ reader containing the biological sequences to analyze
Returns
A dictionary mapping k-mers to their counts in the input sequences
Mycelia.count_kmers
— Methodcount_kmers(
_::Type{KMER_TYPE},
record::Union{FASTX.FASTA.Record, FASTX.FASTQ.Record}
) -> Any
Count the frequency of amino acid k-mers in a biological sequence.
Arguments
Kmers.Kmer{A,K}
: Type parameter specifying amino acid alphabet (A) and k-mer length (K)sequence
: Input biological sequence to analyze
Returns
A sorted dictionary mapping each k-mer to its frequency count in the sequence.
Mycelia.count_kmers
— Methodcount_kmers(
_::Type{KMER_TYPE},
fastx_files::AbstractArray{T<:AbstractString, 1}
) -> Any
Count k-mers across multiple FASTA/FASTQ files and merge the results.
Arguments
KMER_TYPE
: Type parameter specifying the k-mer length (e.g.,DNAKmer{4}
for 4-mers)fastx_files
: Vector of paths to FASTA/FASTQ files
Returns
Dict{KMER_TYPE, Int}
: Dictionary mapping k-mers to their total counts across all files
Mycelia.count_kmers
— Methodcount_kmers(
_::Type{KMER_TYPE},
records::AbstractArray{T<:Union{FASTX.FASTA.Record, FASTX.FASTQ.Record}, 1}
) -> Any
Count k-mers across multiple sequence records and return a sorted frequency table.
Arguments
KMER_TYPE
: Type parameter specifying the k-mer length (e.g.,DNAKmer{3}
for 3-mers)records
: Vector of FASTA/FASTQ records to analyze
Returns
Dict{KMER_TYPE, Int}
: Sorted dictionary mapping k-mers to their frequencies
Mycelia.count_matrix_to_probability_matrix
— Methodcount_matrix_to_probability_matrix(counts_matrix) -> Any
Convert a matrix of counts into a probability matrix by normalizing each column to sum to 1.0.
Arguments
counts_matrix::Matrix{<:Number}
: Input matrix where each column represents counts/frequencies
Returns
Matrix{Float64}
: Probability matrix where each column sums to 1.0
Mycelia.count_records
— Methodcount_records(fastx) -> Int64
Counts the total number of records in a FASTA/FASTQ file.
Arguments
fastx
: Path to a FASTA or FASTQ file (can be gzipped)
Returns
- Number of records (sequences) in the file
Mycelia.countmap_columns
— Methodcountmap_columns(table)
Generate and display frequency counts for all columns in a DataFrame.
Arguments
table::DataFrame
: Input DataFrame to analyze
Details
Iterates through each column in the DataFrame and displays:
- The column name
- A Dict mapping unique values to their frequencies using StatsBase.countmap
Mycelia.create_database
— Methodcreate_database(; database, address, username, password)
Creates a new Neo4j database instance if it doesn't already exist.
Arguments
database::String
: Name of the database to createaddress::String
: Neo4j server address (e.g. "neo4j://localhost:7687")username::String
: Neo4j authentication username (defaults to "neo4j")password::String
: Neo4j authentication password
Notes
- Requires system database privileges to execute
- Silently returns if database already exists
- Temporarily switches to system database to perform creation
Mycelia.create_node_constraints
— Methodcreate_node_constraints(
graph;
address,
username,
password,
database
)
Creates unique identifier constraints for each node type in a Neo4j database.
Arguments
graph
: A MetaGraph containing nodes with TYPE propertiesaddress
: Neo4j server addressusername
: Neo4j username (default: "neo4j")password
: Neo4j passworddatabase
: Neo4j database name (default: "neo4j")
Details
Extracts unique node types from the graph and creates Neo4j constraints ensuring each node of a given type has a unique identifier property.
Failed constraint creation attempts are silently skipped.
Mycelia.create_tarchive
— Methodcreate_tarchive(; directory, tarchive)
Creates a gzipped tar archive of the specified directory along with verification files.
Arguments
directory
: Source directory path to archivetarchive
: Optional output archive path (defaults to directory name with .tar.gz extension)
Generated Files
{tarchive}
: The compressed tar archive{tarchive}.log
: Contents listing of the archive{tarchive}.hashdeep.dfxml
: Cryptographic hashes (MD5, SHA1, SHA256) of the archive
Returns
- Path to the created tar archive file
Mycelia.current_unix_datetime
— Methodcurrent_unix_datetime() -> Int64
Get the current time as a Unix timestamp (seconds since epoch).
Returns
Int
: Current time as an integer Unix timestamp (seconds since January 1, 1970 UTC)
Examples
unix_time = current_unix_datetime()
# => 1709071368 (example value, will differ based on current time)
Mycelia.cypher
— Methodcypher(
cmd;
address,
username,
password,
format,
database
) -> Cmd
Constructs a command to execute Neo4j Cypher queries via cypher-shell.
Arguments
cmd
: The Cypher query command to executeaddress::String="neo4j://localhost:7687"
: Neo4j server addressusername::String="neo4j"
: Neo4j authentication usernamepassword::String="password"
: Neo4j authentication passwordformat::String="auto"
: Output format (auto, verbose, or plain)database::String="neo4j"
: Target Neo4j database name
Returns
Cmd
: A command object ready for execution
Mycelia.deduplicate_fasta_file
— Methoddeduplicate_fasta_file(in_fasta, out_fasta) -> Any
Remove duplicate sequences from a FASTA file while preserving headers.
Arguments
in_fasta
: Path to input FASTA fileout_fasta
: Path where deduplicated FASTA will be written
Returns
Path to the output FASTA file (same as out_fasta
parameter)
Details
- Sequences are considered identical if they match exactly (case-sensitive)
- For duplicate sequences, keeps the first header encountered
- Input sequences are sorted by identifier before deduplication
- Preserves the original sequence formatting
Mycelia.detect_alphabet
— Methoddetect_alphabet(seq::AbstractString) -> Symbol
Determines the alphabet of a sequence. The function scans through seq
only once:
- If a 'T' or 't' is found (and no 'U/u'), the sequence is classified as DNA.
- If a 'U' or 'u' is found (and no 'T/t'), it is classified as RNA.
- If both T and U occur, an error is thrown.
- If a character outside the canonical nucleotide and ambiguity codes is encountered, the sequence is assumed to be protein.
- If neither T nor U are found, the sequence is assumed to be DNA.
Mycelia.detect_sequence_extension
— Methoddetect_sequence_extension(
record::Union{FASTX.FASTA.Record, FASTX.FASTQ.Record}
) -> String
Detect sequence type from input and suggest appropriate file extension.
Arguments
record
: A FASTA/FASTQ recordsequence
: A string or BioSequence containing sequence data
Returns
String
: Suggested file extension:- ".fna" for DNA
- ".frn" for RNA
- ".faa" for protein
- ".fa" for unrecognized sequences
Mycelia.determine_fasta_coverage
— Methoddetermine_fasta_coverage(bam) -> Any
Calculate per-base genomic coverage from a BAM file using bedtools.
Arguments
bam::String
: Path to input BAM file
Returns
String
: Path to the generated coverage file (.coverage.txt
)
Details
Uses bedtools genomecov to compute per-base coverage. Creates a coverage file with the format: <chromosome> <position> <coverage_depth>. If the coverage file already exists, returns the existing file path.
Dependencies
Requires bedtools (automatically installed in conda environment)
Mycelia.determine_max_canonical_kmers
— Methoddetermine_max_canonical_kmers(k, ALPHABET) -> Any
Calculate the maximum number of possible canonical k-mers for a given alphabet.
Arguments
k::Integer
: Length of k-merALPHABET::Vector{Char}
: Character set (nucleotides or amino acids)
Returns
Int
: Maximum number of possible canonical k-mers
Details
- For amino acids (AA_ALPHABET): returns total possible k-mers
- For nucleotides: returns half of total possible k-mers (canonical form)
- Requires odd k-mer length for nucleotide alphabets
Mycelia.determine_max_possible_kmers
— Methoddetermine_max_possible_kmers(k, ALPHABET) -> Any
Calculate the total number of possible unique k-mers that can be generated from a given alphabet.
Arguments
k
: Length of k-mers to considerALPHABET
: Vector containing the allowed characters/symbols
Returns
- Integer representing the maximum number of possible unique k-mers (|Σ|ᵏ)
Mycelia.determine_primary_contig
— Methoddetermine_primary_contig(qualimap_results) -> Any
Determines the contig with the greatest number of total bases mapping to it
Identify the primary contig based on mapping coverage from Qualimap results.
Arguments
qualimap_results::DataFrame
: DataFrame containing Qualimap alignment statistics with columns "Contig" and "Mapped bases"
Returns
String
: Name of the contig with the highest number of mapped bases
Description
Takes Qualimap alignment results and determines which contig has the most total bases mapped to it, which often indicates the main chromosomal assembly.
Mycelia.determine_read_lengths
— Methoddetermine_read_lengths(
fastq_file;
total_reads
) -> Vector{Int64}
Calculate sequence lengths for reads in a FASTQ file.
Arguments
fastq_file::String
: Path to input FASTQ filetotal_reads::Integer=Inf
: Number of reads to process (defaults to all reads)
Returns
Vector{Int}
: Array containing the length of each sequence read
Mycelia.distance_matrix_to_newick
— Methoddistance_matrix_to_newick(
;
distance_matrix,
labels,
outfile
)
Create distance matrix from a column-major counts matrix (features as rows and entities as columns) where distance is a proportional to total feature count magnitude (size) and cosine similarity (relative frequency)
Convert a distance matrix into a Newick tree format using UPGMA hierarchical clustering.
Arguments
distance_matrix
: Square matrix of pairwise distances between entitieslabels
: Vector of labels corresponding to the entities in the distance matrixoutfile
: Path where the Newick tree file will be written
Returns
- Path to the generated Newick tree file
Details
Performs hierarchical clustering using the UPGMA (average linkage) method and converts the resulting dendrogram into Newick tree format. The branch lengths in the tree represent the heights from the clustering.
Mycelia.document_frequency
— Methoddocument_frequency(documents) -> Dict{_A, Int64} where _A
Calculate the document frequency of tokens across a collection of documents.
Arguments
documents
: Collection of text documents where each document is a string
Returns
- Dictionary mapping each unique token to the number of documents it appears in
Description
Computes how many documents contain each unique token. Each document is tokenized by splitting on whitespace. Tokens are counted only once per document, regardless of how many times they appear within that document.
Mycelia.download_bandage
— Functiondownload_bandage() -> String
download_bandage(outdir) -> Any
Downloads and installs Bandage, a bioinformatics visualization tool for genome assembly graphs.
Arguments
outdir="/usr/local/bin"
: Target installation directory for the Bandage executable
Returns
- Path to the installed Bandage executable
Details
- Downloads Bandage v0.8.1 for Ubuntu
- Installs required system dependencies (libxcb-glx0, libx11-xcb-dev, libfontconfig, libgl1-mesa-glx)
- Attempts installation with sudo, falls back to root if sudo fails
- Skips download if Bandage is already installed at target location
Dependencies
Requires system commands: wget, unzip, apt
Mycelia.download_blast_db
— Methoddownload_blast_db(; db, dbdir, source, wait)
Smart downloading of blast dbs depending on interactive, non interactive context
For a list of all available databases, run: Mycelia.list_blastdbs()
Downloads and sets up BLAST databases from various sources.
Arguments
db
: Name of the BLAST database to downloaddbdir
: Directory to store the downloaded database (default: "~/workspace/blastdb")source
: Download source - one of ["", "aws", "gcp", "ncbi"]. Empty string auto-detects fastest sourcewait
: Whether to wait for download completion (default: true)
Returns
- String path to the downloaded database directory
Mycelia.download_genome_by_accession
— Methoddownload_genome_by_accession(
;
accession,
outdir,
compressed
)
Downloads a genomic sequence from NCBI's nucleotide database by its accession number.
Arguments
accession::String
: NCBI nucleotide accession number (e.g. "NC_045512")outdir::String
: Output directory path. Defaults to current directorycompressed::Bool
: If true, compresses output file with gzip. Defaults to true
Returns
String
: Path to the downloaded file (.fna or .fna.gz)
Mycelia.download_genome_by_ftp
— Methoddownload_genome_by_ftp(; ftp, outdir)
Downloads a genome file from NCBI FTP server to the specified directory.
Arguments
ftp::String
: NCBI FTP path for the genome (e.g. "ftp://ftp.ncbi.nlm.nih.gov/.../")outdir::String
: Output directory path. Defaults to current working directory.
Returns
String
: Path to the downloaded file
Notes
- If the target file already exists, returns the existing file path without re-downloading
- Downloads the genomic.fna.gz version of the genome
Mycelia.download_mmseqs_db
— Methoddownload_mmseqs_db(; db, dbdir, force, wait)
Downloads and sets up MMseqs2 reference databases for sequence searching and analysis.
Arguments
db::String
: Name of database to download (see table below)dbdir::String
: Directory to store the downloaded database (default: "~/workspace/mmseqs")force::Bool
: If true, force re-download even if database exists (default: false)wait::Bool
: If true, wait for download to complete (default: true)
Returns
- Path to the downloaded database as a String
Available Databases
Database | Type | Taxonomy | Description |
---|---|---|---|
UniRef100 | Aminoacid | Yes | UniProt Reference Clusters - 100% identity |
UniRef90 | Aminoacid | Yes | UniProt Reference Clusters - 90% identity |
UniRef50 | Aminoacid | Yes | UniProt Reference Clusters - 50% identity |
UniProtKB | Aminoacid | Yes | Universal Protein Knowledge Base |
NR | Aminoacid | Yes | NCBI Non-redundant proteins |
NT | Nucleotide | No | NCBI Nucleotide collection |
GTDB | Aminoacid | Yes | Genome Taxonomy Database |
PDB | Aminoacid | No | Protein Data Bank structures |
Pfam-A.full | Profile | No | Protein family alignments |
SILVA | Nucleotide | Yes | Ribosomal RNA database |
Name Type Taxonomy Url
- UniRef100 Aminoacid yes https://www.uniprot.org/help/uniref
- UniRef90 Aminoacid yes https://www.uniprot.org/help/uniref
- UniRef50 Aminoacid yes https://www.uniprot.org/help/uniref
- UniProtKB Aminoacid yes https://www.uniprot.org/help/uniprotkb
- UniProtKB/TrEMBL Aminoacid yes https://www.uniprot.org/help/uniprotkb
- UniProtKB/Swiss-Prot Aminoacid yes https://uniprot.org
- NR Aminoacid yes https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA
- NT Nucleotide - https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA
- GTDB Aminoacid yes https://gtdb.ecogenomic.org
- PDB Aminoacid - https://www.rcsb.org
- PDB70 Profile - https://github.com/soedinglab/hh-suite
- Pfam-A.full Profile - https://pfam.xfam.org
- Pfam-A.seed Profile - https://pfam.xfam.org
- Pfam-B Profile - https://xfam.wordpress.com/2020/06/30/a-new-pfam-b-is-released
- CDD Profile - https://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml
- eggNOG Profile - http://eggnog5.embl.de
- VOGDB Profile - https://vogdb.org
- dbCAN2 Profile - http://bcb.unl.edu/dbCAN2
- SILVA Nucleotide yes https://www.arb-silva.de
- Resfinder Nucleotide - https://cge.cbs.dtu.dk/services/ResFinder
- Kalamari Nucleotide yes https://github.com/lskatz/Kalamari
Mycelia.draw_dendrogram_tree
— Methoddraw_dendrogram_tree(
mg::MetaGraphs.MetaDiGraph;
width,
height,
fontsize,
margins,
mergenodesize,
lineweight,
filename
) -> Luxor.Drawing
Draw a dendrogram visualization of hierarchical clustering results stored in a MetaDiGraph.
Arguments
mg::MetaGraphs.MetaDiGraph
: Graph containing hierarchical clustering results. Must have:hcl
in graph properties with clustering data and vertex properties containing:x
,:y
coordinates.
Keywords
width::Integer=500
: Width of output image in pixelsheight::Integer=500
: Height of output image in pixelsfontsize::Integer=12
: Font size for node labels in pointsmargins::Float64
: Margin size in pixels, defaults to min(width,height)/20mergenodesize::Float64=1
: Size of circular nodes at merge pointslineweight::Float64=1
: Thickness of dendrogram linesfilename::String
: Output filename, defaults to timestamp with .dendrogram.png extension
Returns
Nothing, but saves dendrogram image to disk and displays preview.
Mycelia.draw_radial_tree
— Methoddraw_radial_tree(
mg::MetaGraphs.MetaDiGraph;
width,
height,
fontsize,
margins,
mergenodesize,
lineweight,
filename
) -> Luxor.Drawing
Draw a radial hierarchical clustering tree visualization and save it as an image file.
Arguments
mg::MetaGraphs.MetaDiGraph
: A meta directed graph containing hierarchical clustering data with required graph properties:hcl
containing clustering information.
Keywords
width::Int=500
: Width of the output image in pixelsheight::Int=500
: Height of the output image in pixelsfontsize::Int=12
: Font size for node labelsmargins::Float64
: Margin size (automatically calculated as min(width,height)/20)mergenodesize::Float64=1
: Size of the merge point nodeslineweight::Float64=1
: Thickness of the connecting linesfilename::String
: Output filename (defaults to timestamp with ".radial.png" suffix)
Details
The function creates a radial visualization of hierarchical clustering results where:
- Leaf nodes are arranged in a circle with labels
- Internal nodes represent merge points
- Connections show the hierarchical structure through arcs and lines
The visualization is saved as a PNG file and automatically previewed.
Required Graph Properties
The input graph must have:
mg.gprops[:hcl].labels
: Vector of leaf node labelsmg.gprops[:hcl].order
: Vector of ordered leaf nodesmg.gprops[:hcl].merges
: Matrix of merge operationsmg.vprops[v][:x]
: X coordinate for each vertexmg.vprops[v][:y]
: Y coordinate for each vertex
Mycelia.drop_empty_columns!
— Methoddrop_empty_columns!(
df::DataFrames.AbstractDataFrame
) -> DataFrames.AbstractDataFrame
Identify all columns that have only missing or empty values, and remove those columns from the dataframe in-place.
Returns a modified version of the original dataframe.
See also: dropemptycolumns
Mycelia.drop_empty_columns
— Methoddrop_empty_columns(df::DataFrames.AbstractDataFrame) -> Any
Identify all columns that have only missing or empty values, and remove those columns from the dataframe.
Returns a modified copy of the dataframe.
See also: dropemptycolumns!
Mycelia.edge_path_to_sequence
— Methodedge_path_to_sequence(kmer_graph, edge_path) -> Any
Converts a path of edges in a kmer graph into a DNA sequence by concatenating overlapping kmers.
Arguments
kmer_graph
: A directed graph where vertices represent kmers and edges represent overlapsedge_path
: Vector of edges representing a path through the graph
Returns
A BioSequences.LongDNASeq
containing the merged sequence represented by the path
Details
The function:
- Takes the first kmer from the source vertex of first edge
- For each edge, handles orientation (forward/reverse complement)
- Verifies overlaps between consecutive kmers
- Concatenates unique bases to build final sequence
Mycelia.edge_probability
— Methodedge_probability(stranded_kmer_graph, edge) -> Any
Calculate the probability of traversing a specific edge in a stranded k-mer graph.
The probability is computed as the ratio of this edge's coverage weight to the sum of all outgoing edge weights from the source vertex.
edge_probability(stranded_kmer_graph, edge) -> Any
Arguments
stranded_kmer_graph
: A directed graph where edges represent k-mer connectionsedge
: The edge for which to calculate the probability
Returns
Float64
: Probability in range [0,1] representing likelihood of traversing this edge Returns 0.0 if sum of all outgoing edge weights is zero
Note
Probability is based on the :coverage property of edges, using their length as weights
Mycelia.edgemer_to_vertex_kmers
— Methodedgemer_to_vertex_kmers(
edgemer
) -> Tuple{Kmers.Kmer{BioSequences.DNAAlphabet{2}}, Kmers.Kmer{BioSequences.DNAAlphabet{2}}}
Convert an edgemer to two vertex kmers.
This function takes an edgemer (a sequence of DNA nucleotides) and converts it into two vertex kmers. A kmer is a substring of length k from a DNA sequence. The first kmer is created from the first n-1 elements of the edgemer, and the second kmer is created from the last n-1 elements of the edgemer.
Arguments
edgemer::AbstractVector{T}
: A vector of DNA nucleotides where T is a subtype ofBioSequences.DNAAlphabet{2}
.
Returns
Tuple{Kmers.Kmer{BioSequences.DNAAlphabet{2}}, Kmers.Kmer{BioSequences.DNAAlphabet{2}}}
: A tuple containing two kmers.
Mycelia.equally_spaced_samples
— Methodequally_spaced_samples(vector, n) -> Any
Sample n
equally spaced elements from vector
.
Arguments
vector
: Input vector to sample fromn
: Number of samples to return (must be positive)
Returns
A vector containing n
equally spaced elements from the input vector.
Mycelia.equivalent_fasta_sequences
— Methodequivalent_fasta_sequences(fasta_1, fasta_2) -> Bool
Compare two FASTA files to determine if they contain the same set of sequences, regardless of sequence order.
Arguments
fasta_1::String
: Path to first FASTA filefasta_2::String
: Path to second FASTA file
Returns
Bool
:true
if both files contain exactly the same sequences,false
otherwise
Details
Performs a set-based comparison of DNA sequences by hashing each sequence. Sequence order differences between files do not affect the result.
Mycelia.error_rate_to_q_value
— Methoderror_rate_to_q_value(error_rate) -> Any
Convert a sequencing error probability to a Phred quality score (Q-value).
The calculation uses the standard Phred formula: Q = -10 * log₁₀(error_rate)
Arguments
error_rate::Float64
: Probability of error (between 0 and 1)
Returns
q_value::Float64
: Phred quality score
Mycelia.export_blast_db
— Methodexport_blast_db(; path_to_db, fasta)
Export sequences from a BLAST database to a gzipped FASTA file.
Arguments
path_to_db
: Path to the BLAST databasefasta
: Output path for the gzipped FASTA file (default:path_to_db * ".fna.gz"
)
Details
Uses conda's BLAST environment to extract sequences using blastdbcmd
. The output is automatically compressed using pigz
. If the output file already exists, the function will skip extraction.
Mycelia.export_blast_db_taxonomy_table
— Methodexport_blast_db_taxonomy_table(; path_to_db, outfile)
Exports a taxonomy mapping table from a BLAST database in seqid2taxid format.
Arguments
path_to_db::String
: Path to the BLAST databaseoutfile::String
: Output file path (defaults to input path + ".seqid2taxid.txt.gz")
Returns
String
: Path to the created output file
Details
Creates a compressed tab-delimited file mapping sequence IDs to taxonomy IDs. Uses blastdbcmd without GI identifiers for better cross-referencing compatibility. If the output file already exists, returns the path without regenerating.
Dependencies
Requires BLAST+ tools installed via Bioconda.
Mycelia.extract_pacbiosample_information
— Methodextract_pacbiosample_information(
xml
) -> DataFrames.DataFrame
Extract biosample and barcode information from a PacBio XML metadata file.
Arguments
xml
: Path to PacBio XML metadata file
Returns
DataFrame with two columns:
BioSampleName
: Name of the biological sampleBarcodeName
: Associated DNA barcode identifier
Mycelia.fasta_and_gff_to_genbank
— Methodfasta_and_gff_to_genbank(; fasta, gff, genbank)
Convert FASTA sequence and GFF annotation files to GenBank format using EMBOSS seqret.
Arguments
fasta::String
: Path to input FASTA file containing sequence datagff::String
: Path to input GFF file containing genomic featuresgenbank::String
: Path for output GenBank file
Details
Requires EMBOSS toolkit (installed via Bioconda). The function will:
- Create necessary output directories
- Run seqret to combine sequence and features
- Generate a GenBank format file at the specified location
Mycelia.fasta_genome_size
— Methodfasta_genome_size(fasta_file) -> Any
Calculate the total size (in bases) of all sequences in a FASTA file.
Arguments
fasta_file::AbstractString
: Path to the FASTA file
Returns
Int
: Sum of lengths of all sequences in the FASTA file
Mycelia.fasta_list_to_dense_counts_table
— Methodfasta_list_to_dense_counts_table(; fasta_list, k, alphabet)
Create a dense kmer counts table (canonical for DNA, stranded for RNA & AA) for each fasta provided in a list. Scales very well for large numbers of organisms/fasta files, but not for k. Recommended for k <= 13, although 17 may still be possible
Generate a dense k-mer frequency matrix from multiple FASTA files.
Arguments
fasta_list
: Vector of paths to FASTA filesk
: Length of k-mers to count (recommended k ≤ 13)alphabet
: Symbol specifying sequence type (:DNA
,:RNA
, or:AA
)
Returns
Named tuple containing:
sorted_kmers
: Vector of sorted k-merskmer_counts_matrix
: Dense matrix where rows are k-mers and columns are sequences
Details
- For DNA: Uses canonical k-mers (strand-neutral)
- For RNA/AA: Uses stranded k-mers
- Parallelized using Julia's multi-threading
Performance
- Efficient for large numbers of sequences
- Memory usage grows exponentially with k
- For k > 13, use
fasta_list_to_sparse_counts_table
instead
Mycelia.fasta_table_to_fasta
— Methodfasta_table_to_fasta(fasta_df) -> Any
Convert a DataFrame containing FASTA sequence information into a vector of FASTA records.
Arguments
fasta_df::DataFrame
: DataFrame with columns "identifier", "description", and "sequence"
Returns
Vector{FASTX.FASTA.Record}
: Vector of FASTA records
Mycelia.fasta_to_reference_kmer_counts
— Methodfasta_to_reference_kmer_counts(; kmer_type, fasta)
Counts k-mer occurrences in a FASTA file, considering both forward and reverse complement sequences.
Arguments
kmer_type
: Type specification for k-mers (e.g.,DNAKmer{21}
)fasta
: Path to FASTA file containing reference sequences
Returns
Dict{kmer_type, Int}
: Dictionary mapping each k-mer to its total count across all sequences
Mycelia.fasta_to_table
— Methodfasta_to_table(fasta) -> DataFrames.DataFrame
Convert a FASTA file/record iterator to a DataFrame.
Arguments
fasta
: FASTA record iterator from FASTX.jl
Returns
DataFrame
with columns:identifier
: Sequence identifiersdescription
: Full sequence descriptionssequence
: Biological sequences as strings
Mycelia.fasta_xam_mapping_stats
— Methodfasta_xam_mapping_stats(; fasta, xam)
Calculate mapping statistics by comparing sequence alignments (BAM/SAM) to a reference FASTA.
Arguments
fasta::String
: Path to reference FASTA filexam::String
: Path to alignment file (BAM or SAM format)
Returns
DataFrame with columns:
contig
: Reference sequence namecontig_length
: Length of reference sequencetotal_aligned_bases
: Total number of bases aligned to referencemean_depth
: Average depth of coverage (totalalignedbases/contig_length)
Mycelia.fastani_list
— Methodfastani_list(
;
query_list,
reference_list,
outfile,
threads,
force
) -> Union{Nothing, Base.Process}
Run fastani with a query and reference list
Calculate Average Nucleotide Identity (ANI) between genome sequences using FastANI.
Arguments
query_list::String
: Path to file containing list of query genome paths (one per line)reference_list::String
: Path to file containing list of reference genome paths (one per line)outfile::String
: Path to output file that will contain ANI resultsthreads::Int=Sys.CPU_THREADS
: Number of parallel threads to useforce::Bool=false
: If true, rerun analysis even if output file exists
Output
Generates a tab-delimited file with columns:
- Query genome
- Reference genome
- ANI value (%)
- Count of bidirectional fragment mappings
- Total query fragments
Notes
- Requires FastANI to be available via Bioconda
- Automatically sets up required conda environment
Mycelia.fasterq_dump
— Methodfasterq_dump(
;
outdir,
srr_identifier
) -> NamedTuple{(:forward_reads, :reverse_reads, :unpaired_reads), <:Tuple{Union{Missing, String}, Union{Missing, String}, Union{Missing, String}}}
Download and compress sequencing reads from the SRA database using fasterq-dump.
Arguments
outdir::String=""
: Output directory for the FASTQ files. Defaults to current directory.srr_identifier::String=""
: SRA run accession number (e.g., "SRR12345678")
Returns
Named tuple containing paths to the generated files:
forward_reads
: Path to forward reads file (*_1.fastq.gz) ormissing
reverse_reads
: Path to reverse reads file (*_2.fastq.gz) ormissing
unpaired_reads
: Path to unpaired reads file (*.fastq.gz) ormissing
Outputs
Creates compressed FASTQ files in the output directory:
{srr_identifier}_1.fastq.gz
: Forward reads (for paired-end data){srr_identifier}_2.fastq.gz
: Reverse reads (for paired-end data){srr_identifier}.fastq.gz
: Unpaired reads (for single-end data)
Dependencies
Requires:
fasterq-dump
from the SRA Toolkit (installed via Conda)gzip
for compression
Notes
- Skips download if output files already exist
- Uses up to 4 threads or system maximum, whichever is lower
- Allocates 1GB memory for processing
- Skips technical reads
- Handles both paired-end and single-end data automatically
Mycelia.fastq_record
— Methodfastq_record(; identifier, sequence, quality_scores)
Construct a FASTX FASTQ record from its components.
Arguments
identifier::String
: The sequence identifier without the '@' prefixsequence::String
: The nucleotide sequencequality_scores::Vector{Int}
: Quality scores (0-93) as raw integers
Returns
FASTX.FASTQRecord
: A parsed FASTQ record
Notes
- Quality scores are automatically capped at 93 to ensure FASTQ compatibility
- Quality scores are converted to ASCII by adding 33 (Phred+33 encoding)
- The record is constructed in standard FASTQ format with four lines:
- Header line (@ + identifier)
- Sequence
- Plus line
- Quality scores (ASCII encoded)
Mycelia.fastx2normalized_table
— Methodfastx2normalized_table(fastx) -> DataFrames.DataFrame
Convert a FASTX (FASTA/FASTQ) file into a normalized tab-separated table format with standardized sequence identifiers.
Arguments
fastx_file::String
: Path to input FASTX fileoutfile::String
: Path to output compressed TSV file (defaults to input filename + ".tsv.gz")force::Bool=false
: If true, overwrites existing output file
Returns
String
: Path to the created output file
Output Format
Creates a gzipped TSV file with the following columns:
- fasta_identifier: Original FASTA filename
- sequence_sha256: SHA256 hash of the sequence
- sequence_identifier: Original sequence ID from FASTA
- sequence_description: Full sequence description from FASTA
- sequence: The actual sequence
Mycelia.fastx_stats
— Methodfastx_stats(fastx) -> DataFrames.DataFrame
Calculate basic statistics for FASTQ/FASTA sequence files using seqkit.
Arguments
fastq::String
: Path to input FASTQ/FASTA file
Details
Automatically installs and uses seqkit from Bioconda to compute sequence statistics including number of sequences, total bases, GC content, average length, etc.
Dependencies
- Requires Conda and Bioconda channel
- Installs seqkit package if not present
Returns
Returns a DataFrame of the table
https://bioinf.shenwei.me/seqkit/usage/#stats
Mycelia.fastx_to_contig_lengths
— Methodfastx_to_contig_lengths(
fastx
) -> OrderedCollections.OrderedDict
Generate detailed mapping statistics for each reference sequence/contig in a XAM (SAM/BAM/CRAM) file.
Arguments
xam
: Path to XAM file or XAM object
Returns
A DataFrame with per-contig statistics including:
n_aligned_reads
: Number of aligned readstotal_aligned_bases
: Sum of alignment lengthstotal_alignment_score
: Sum of alignment scores- Mapping quality statistics (mean, std, median)
- Alignment length statistics (mean, std, median)
- Alignment score statistics (mean, std, median)
- Percent mismatches statistics (mean, std, median)
Note: Only primary alignments (isprimary=true) and mapped reads (ismapped=true) are considered.
Mycelia.fastx_to_kmer_graph
— Methodfastx_to_kmer_graph(
KMER_TYPE,
fastx::AbstractString
) -> MetaGraphs.MetaGraph
Constructs a k-mer graph from a single FASTX format string.
Arguments
KMER_TYPE
: The k-mer type specification (e.g., DNAKmer{K} where K is k-mer length)fastx::AbstractString
: Input sequence in FASTX format (FASTA or FASTQ)
Returns
KmerGraph
: A directed graph where vertices are k-mers and edges represent overlaps
Mycelia.fastx_to_kmer_graph
— Methodfastx_to_kmer_graph(
KMER_TYPE,
fastxs::AbstractVector{<:AbstractString}
) -> MetaGraphs.MetaGraph
Create an in-memory kmer-graph that records:
- all kmers
- counts
- all observed edges between kmers
- edge orientations
- edge counts
Construct a kmer-graph from one or more FASTX files (FASTA/FASTQ).
Arguments
KMER_TYPE
: Type for kmer representation (e.g.,DNAKmer{K}
)fastxs
: Vector of paths to FASTX files
Returns
A MetaGraph
where:
- Vertices represent unique kmers with properties:
:kmer
=> The kmer sequence:count
=> Number of occurrences
- Edges represent observed kmer adjacencies with properties:
:orientation
=> Relative orientation of connected kmers:count
=> Number of observed transitions
Mycelia.fibonacci_numbers_less_than
— Methodfibonacci_numbers_less_than(
n::Int64
) -> Union{Vector{Any}, Vector{Int64}}
Generate a sequence of Fibonacci numbers strictly less than the input value.
Arguments
n::Int
: Upper bound (exclusive) for the Fibonacci sequence
Returns
Vector{Int}
: Array containing Fibonacci numbers less than n
Mycelia.filesize_human_readable
— Methodfilesize_human_readable(f) -> Any
Gets the size of a file and returns it in a human-readable format.
Arguments
f
: The path to the file, either as aString
or anAbstractString
.
Returns
A string representing the file size in a human-readable format (e.g., "3.40 MB").
Details
This function internally uses filesize(f)
to get the file size in bytes, then leverages Base.format_bytes
to convert it into a human-readable format with appropriate units (KB, MB, GB, etc.).
Examples
julia> filesize_human_readable("my_image.jpg")
"2.15 MB"
See Also
- filesize: Gets the size of a file in bytes.
- Base.format_bytes: Converts a byte count into a human-readable string.
Mycelia.find_matching_prefix
— Methodfind_matching_prefix(
filename1::String,
filename2::String;
strip_trailing_delimiters
) -> String
Find the longest common prefix between two filenames.
Arguments
filename1::String
: First filename to comparefilename2::String
: Second filename to compare
Keywords
strip_trailing_delimiters::Bool=true
: If true, removes trailing dots, hyphens, and underscores from the result
Returns
String
: The longest common prefix found between the filenames
Mycelia.find_nonempty_columns
— Methodfind_nonempty_columns(df) -> Any
Identify all columns that have only missing or empty values
Returns as a bit array
See also: dropemptycolumns, dropemptycolumns!
Mycelia.find_resampling_stretches
— Methodfind_resampling_stretches(
;
record_kmer_solidity,
solid_branching_kmer_indices
)
Identifies sequence regions that require resampling based on kmer solidity patterns.
Arguments
record_kmer_solidity::BitVector
: Boolean array wheretrue
indicates solid kmerssolid_branching_kmer_indices::Vector{Int}
: Indices of solid branching kmers
Returns
Vector{UnitRange{Int64}}
: Array of ranges (start:stop) indicating stretches that need resampling
Details
Finds continuous stretches of non-solid kmers and extends them to the nearest solid branching kmers on either side. These stretches represent regions that need resampling.
If a stretch doesn't have solid branching kmers on both sides, it is excluded from the result. Duplicate ranges are removed from the final output.
Mycelia.find_true_ranges
— Methodfind_true_ranges(
bool_vec::AbstractVector{Bool};
min_length
) -> Vector
Finds contiguous ranges of true
values in a boolean vector.
Arguments
bool_vec::AbstractVector{Bool}
: Input boolean vector to analyzemin_length=1
: Minimum length requirement for a range to be included
Returns
Vector of tuples (start, end)
where each tuple represents the indices of a contiguous range of true
values meeting the minimum length requirement.
Mycelia.first_of_each_group
— Methodfirst_of_each_group(
gdf::DataFrames.GroupedDataFrame{DataFrames.DataFrame}
) -> Any
Return a DataFrame containing the first row from each group in a GroupedDataFrame.
Arguments
gdf::GroupedDataFrame
: A grouped DataFrame created usinggroupby
Returns
DataFrame
: A new DataFrame containing first row from each group
Mycelia.fit_optimal_number_of_clusters
— Functionfit_optimal_number_of_clusters(
distance_matrix
) -> @NamedTuple{optimal_number_of_clusters::Int64, ks_assessed::Vector{Int64}, within_cluster_sum_of_squares::Vector{Float64}, silhouette_scores::Vector{Float64}}
fit_optimal_number_of_clusters(
distance_matrix,
ks_to_try
) -> NamedTuple{(:optimal_number_of_clusters, :ks_assessed, :within_cluster_sum_of_squares, :silhouette_scores), <:Tuple{Any, Any, Vector{Float64}, Vector{Float64}}}
Determines the optimal number of clusters for k-means clustering by maximizing the silhouette score.
Algorithm
- Starts by evaluating the first 5 k values
- Continues evaluation if optimal k is at the edge of evaluated range
- Refines search by evaluating midpoints between k values around the current optimum
- Iterates until convergence (optimal k remains stable)
Arguments
distance_matrix::Matrix
: Square matrix of pairwise distances between pointsks_to_try::Vector{Int}
: Vector of k values to evaluate. Defaults to [1, 2, ...] up to matrix size
Returns
Named tuple containing:
optimal_number_of_clusters::Int
: The k value giving highest silhouette scoreks_assessed::Vector{Int}
: All k values that were evaluatedwithin_cluster_sum_of_squares::Vector{Float64}
: WCSS for each k assessedsilhouette_scores::Vector{Float64}
: Silhouette scores for each k assessed
Mycelia.fit_optimal_number_of_clusters_hclust
— Functionfit_optimal_number_of_clusters_hclust(
distance_matrix
) -> NamedTuple{(:optimal_number_of_clusters, :ks_assessed, :silhouette_scores, :hclust_result), <:Tuple{Int64, Vector{Int64}, Vector{Float64}, Clustering.Hclust}}
fit_optimal_number_of_clusters_hclust(
distance_matrix,
ks_to_try
) -> NamedTuple{(:optimal_number_of_clusters, :ks_assessed, :silhouette_scores, :hclust_result), <:Tuple{Any, Any, Vector{Float64}, Clustering.Hclust}}
Determine the optimal number of clusters using hierarchical clustering with iterative refinement.
Arguments
distance_matrix::Matrix
: Square matrix of pairwise distances between observationsks_to_try::Vector{Int}
: Vector of cluster counts to evaluate (default: 1, 2, and sequence fromMycelia.ks()
)
Returns
Named tuple containing:
optimal_number_of_clusters::Int
: Best number of clusters foundks_assessed::Vector{Int}
: All cluster counts that were evaluatedsilhouette_scores::Vector{Float64}
: Silhouette scores for each k assessedhclust_result
: Hierarchical clustering result object
Details
Uses Ward's linkage method and silhouette scores to evaluate cluster quality. Implements an adaptive search that focuses on promising regions between initially tested k values. For k=1, silhouette score is set to 0 as a special case.
Mycelia.frequency_matrix_to_cosine_distance_matrix
— Methodfrequency_matrix_to_cosine_distance_matrix(
probability_matrix
) -> Any
Create cosine distance matrix from a column-major counts matrix (features as rows and entities as columns) where distance is a proportional to cosine similarity (relative frequency)
Compute pairwise cosine distances between entities based on their feature distributions.
Arguments
probability_matrix
: Column-major matrix where rows represent features and columns represent entities. Each column should contain frequency/probability values for one entity.
Returns
- Symmetric matrix of size (nentities × nentities) containing pairwise cosine distances. Distance values range from 0 (identical distributions) to 1 (orthogonal distributions).
Details
- Computes upper triangle and mirrors to lower triangle for efficiency
- Uses
Distances.cosine_dist
for the core computation - Time complexity is O(n²) where n is the number of entities
Mycelia.frequency_matrix_to_euclidean_distance_matrix
— Methodfrequency_matrix_to_euclidean_distance_matrix(
counts_table
) -> Any
Create a Euclidean distance matrix from a column-major counts matrix (features as rows and entities as columns), where distance is proportional to total feature count magnitude (size).
Compute pairwise Euclidean distances between entity profiles in a counts matrix.
Arguments
counts_table
: A matrix where rows represent features and columns represent entities (column-major format). Each column contains the feature counts/frequencies for one entity.
Returns
- A symmetric N×N matrix of Euclidean distances between each pair of entities, where N is the number of entities.
Details
- Parallelized computation using multi-threading
- Progress tracking via ProgressMeter
- Memory efficient: only upper triangle is computed, then mirrored
- Distance between entities increases with total feature magnitude differences
Mycelia.genbank_to_codon_frequencies
— Methodgenbank_to_codon_frequencies(
genbank;
allow_all
) -> Dict{BioSymbols.AminoAcid, Dict{Kmers.Kmer{BioSequences.DNAAlphabet{2}, 3, 1}, Int64}}
Analyze codon usage frequencies from genes in a GenBank file.
Arguments
genbank
: Path to GenBank format file containing genomic sequences and annotationsallow_all
: If true, initializes frequencies for all possible codons with count=1 (default: true)
Returns
Nested dictionary mapping amino acids to their corresponding codon usage counts:
- Outer key: AminoAcid (including stop codon)
- Inner key: DNACodon
- Value: Count of codon occurrences
Details
- Only processes genes marked as ':misc_feature' in the GenBank file
- Analyzes both forward and reverse complement sequences
- Determines coding strand based on presence of stop codons and start codons
- Skips ambiguous sequences that cannot be confidently oriented
Mycelia.genbank_to_fasta
— Methodgenbank_to_fasta(; genbank, fasta, force)
Convert a GenBank format file to FASTA format using EMBOSS seqret.
Arguments
genbank
: Path to input GenBank format filefasta
: Optional output FASTA file path (defaults to input path with .fna extension)force
: If true, overwrites existing output file (defaults to false)
Returns
Path to the output FASTA file
Notes
- Requires EMBOSS suite (installed automatically via Conda)
- Will not regenerate output if it already exists unless force=true
Mycelia.generate_all_possible_canonical_kmers
— Methodgenerate_all_possible_canonical_kmers(k, alphabet) -> Any
Create distance matrix from a column-major counts matrix (features as rows and entities as columns) where distance is a proportional to total feature count magnitude (size) and cosine similarity (relative frequency)
Generate all possible canonical k-mers of length k
from the given alphabet
.
For DNA/RNA sequences, returns unique canonical k-mers where each k-mer is represented by the lexicographically smaller of itself and its reverse complement. For amino acid sequences, returns all possible k-mers without canonicalization.
Arguments
k
: Length of k-mers to generatealphabet
: Vector of BioSymbols (DNA, RNA or AminoAcid)
Returns
- Vector of k-mers, canonicalized for DNA/RNA alphabets
Mycelia.generate_all_possible_kmers
— Methodgenerate_all_possible_kmers(k, alphabet) -> Any
Create distance matrix from a column-major counts matrix (features as rows and entities as columns) where distance is a proportional to total feature count magnitude (size) and cosine similarity (relative frequency)
Generate a sorted list of all possible k-mers for a given alphabet.
Arguments
k::Integer
: Length of k-mers to generatealphabet
: Collection of symbols (DNA, RNA, or amino acids) from BioSymbols
Returns
- Sorted Vector of Kmers of the appropriate type (DNA, RNA, or amino acid)
Mycelia.generate_transterm_coordinates_from_fasta
— Methodgenerate_transterm_coordinates_from_fasta(fasta) -> Any
Generate minimal coordinate files required for TransTermHP analysis from FASTA sequences.
Creates artificial gene annotations at sequence boundaries to enable TransTermHP to run without real gene annotations. For each sequence in the FASTA file, generates two single-base-pair "genes" at positions 1-2 and (L-1)-L, where L is sequence length.
Arguments
fasta
: Path to input FASTA file containing sequences to analyze
Returns
- Path to generated coordinate file (original path with ".coords" extension)
Format
Generated coordinate file follows TransTermHP format: gene_id start stop chromosome
where chromosome matches FASTA sequence identifiers.
See also: run_transterm
Mycelia.generate_transterm_coordinates_from_gff
— Methodgenerate_transterm_coordinates_from_gff(gff_file) -> Any
Convert a GFF file to a coordinates file compatible with TransTermHP format.
Arguments
gff_file::String
: Path to input GFF file
Processing
- Converts 1-based to 0-based coordinates
- Extracts gene IDs from the attributes field
- Retains columns: gene_id, start, end, seqid
Returns
- Path to the generated coordinates file (original filename with '.coords' suffix)
Output Format
Space-delimited file with columns: gene_id, start, end, seqid
Mycelia.get_base_extension
— Methodget_base_extension(filename::String) -> String
Extract the base file extension from a filename, handling compressed files.
For regular files, returns the last extension. For gzipped files, returns the extension before .gz.
Mycelia.get_correct_quality
— Methodget_correct_quality(tech::Symbol, pos::Int, read_length::Int) -> Int
Simulates a Phred quality score (using the Sanger convention) for a correctly observed base. For Illumina, the quality score is modeled to decay linearly from ~40 at the start to ~20 at the end of the read. For other technologies, the score is sampled from a normal distribution with parameters typical for that platform.
Returns an integer quality score.
Mycelia.get_error_quality
— Methodget_error_quality(tech::Symbol) -> Int
Simulates a Phred quality score (using the Sanger convention) for a base observed with an error. Error bases are assigned lower quality scores than correctly observed bases. For Illumina, scores typically range between 5 and 15; for nanopore and pacbio, slightly lower values are used; and for ultima, a modest quality score is assigned.
Returns an integer quality score.
Mycelia.get_genbank
— Methodget_genbank(
;
db,
accession,
ftp
) -> Union{Nothing, GenomicAnnotations.GenBank.Reader}
Get dna (db = "nuccore") or protein (db = "protein") sequences from NCBI or get fasta directly from FTP site
Retrieve GenBank records from NCBI or directly from an FTP site.
Arguments
db::String
: NCBI database to query ("nuccore" for nucleotide or "protein" for protein sequences)accession::String
: NCBI accession number for the sequenceftp::String
: Direct FTP URL to a GenBank file (gzipped)
Returns
GenomicAnnotations.GenBank.Reader
: A reader object containing the GenBank record
Details
When using NCBI queries (db
and accession
), the function implements rate limiting (0.5s sleep) to comply with NCBI's API restrictions of max 2 requests per second.
Mycelia.get_gff
— Methodget_gff(; db, accession, ftp) -> Any
Get dna (db = "nuccore") or protein (db = "protein") sequences from NCBI or get fasta directly from FTP site
Retrieve GFF3 formatted genomic feature data from NCBI or direct FTP source.
Arguments
db::String
: NCBI database to query ("nuccore" for DNA or "protein" for protein sequences)accession::String
: NCBI accession numberftp::String
: Direct FTP URL to GFF3 file (typically gzipped)
Returns
IO
: IOBuffer containing uncompressed GFF3 data
Mycelia.get_kmer_index
— Methodget_kmer_index(kmers, kmer) -> Any
Returns the index position of a given k-mer in a sorted list of k-mers.
Arguments
kmers
: A sorted vector of k-mers to search withinkmer
: The k-mer sequence to find
Returns
Integer index position where kmer
is found in kmers
Throws
AssertionError
: If the k-mer is not found in the list
Mycelia.get_sequence
— Methodget_sequence(
;
db,
accession,
ftp
) -> Union{Nothing, FASTX.FASTA.Reader}
Get dna (db = "nuccore") or protein (db = "protein") sequences from NCBI or get fasta directly from FTP site
Retrieve FASTA format sequences from NCBI databases or direct FTP URLs.
Arguments
db::String
: NCBI database type ("nuccore" for DNA or "protein" for protein sequences)accession::String
: NCBI sequence accession numberftp::String
: Direct FTP URL to a FASTA file (alternative to db/accession pair)
Returns
FASTX.FASTA.Reader
: Reader object containing the requested sequence(s)
Mycelia.gfa_to_fasta
— Methodgfa_to_fasta(; gfa, fasta)
Convert a GFA (Graphical Fragment Assembly) file to FASTA format.
Arguments
gfa::String
: Path to input GFA filefasta::String=gfa * ".fna"
: Path for output FASTA file. Defaults to input filename with ".fna" extension
Returns
String
: Path to the generated FASTA file
Details
Uses gfatools (via Conda) to perform the conversion. The function will:
- Ensure gfatools is available in the Conda environment
- Execute the conversion using gfatools gfa2fa
- Write sequences to the specified FASTA file
Mycelia.gfa_to_structure_table
— Methodgfa_to_structure_table(
gfa
) -> NamedTuple{(:contig_table, :records), <:Tuple{DataFrames.DataFrame, Any}}
Convert a GFA (Graphical Fragment Assembly) file into a structured representation.
Arguments
gfa
: Path to GFA file or GFA content as string
Returns
Named tuple containing:
contig_table
: DataFrame with columns:connected_component
: Integer ID for each componentcontigs
: Comma-separated list of contig IDsis_circular
: Boolean indicating if component forms a cycleis_closed
: Boolean indicating if single contig forms a cyclelengths
: Comma-separated list of contig lengths
records
: FASTA records from the GFA
Mycelia.githash
— Methodgithash(; short) -> SubString{String}
Returns the current git commit hash of the repository.
Arguments
short::Bool=false
: If true, returns abbreviated 8-character hash
Returns
A string containing the git commit hash (full 40 characters by default)
Mycelia.graph_to_gfa
— Methodgraph_to_gfa(; graph, outfile)
Convert a Mycelia graph to GFA (Graphical Fragment Assembly) format.
Writes a graph to GFA format, including:
- Header (H) line with GFA version
- Segment (S) lines for each vertex with sequence and depth
- Link (L) lines for edges with overlap size and orientations
Arguments
graph
: MetaGraph containing sequence vertices and their relationshipsoutfile
: Path where the GFA file should be written
Returns
- Path to the written GFA file
Mycelia.hclust_to_metagraph
— Methodhclust_to_metagraph(
hcl::Clustering.Hclust
) -> MetaGraphs.MetaDiGraph{Int64, Float64}
Convert a hierarchical clustering tree into a directed metagraph representation.
Arguments
hcl::Clustering.Hclust
: Hierarchical clustering result object
Returns
MetaGraphs.MetaDiGraph
: Directed graph with metadata representing the clustering hierarchy
Graph Properties
The resulting graph contains the following vertex properties:
:hclust_id
: String identifier for each node:height
: Height/distance at each merge point (0.0 for leaves):x
: Horizontal position for visualization (0-1 range):y
: Vertical position based on normalized height:hcl
: Original clustering object (stored as graph property)
Mycelia.heirarchically_cluster_distance_matrix
— Methodheirarchically_cluster_distance_matrix(
distance_matrix
) -> Clustering.Hclust
Performs hierarchical clustering on a distance matrix using Ward's linkage method.
Arguments
distance_matrix::Matrix{<:Real}
: A symmetric distance/dissimilarity matrix
Returns
HierarchicalCluster
: A hierarchical clustering object from Clustering.jl
Details
Uses Ward's method (minimum variance) for clustering, which:
- Minimizes total within-cluster variance
- Produces compact, spherical clusters
- Works well for visualization in radial layouts
Mycelia.identify_optimal_number_of_clusters
— Methodidentify_optimal_number_of_clusters(
distance_matrix
) -> Tuple{Clustering.Hclust, Int64}
Determine the optimal number of clusters for a dataset using hierarchical clustering and silhouette score analysis.
Arguments
distance_matrix
: A symmetric matrix of pairwise distances between data points
Returns
A tuple containing:
hcl
: The fitted hierarchical clustering objectoptimal_number_of_clusters
: Integer indicating the optimal number of clusters
Details
The function:
- Performs hierarchical clustering on the distance matrix
- Tests cluster counts from 2 to √n (where n is dataset size)
- Evaluates each clustering using silhouette scores
- Generates a diagnostic plot showing silhouette scores vs cluster counts
- Selects the number of clusters that minimizes the silhouette score
The diagnostic plot is displayed automatically during execution.
See Also
heirarchically_cluster_distance_matrix
Clustering.silhouettes
Mycelia.install_hashdeep
— Methodinstall_hashdeep() -> Union{Nothing, Base.Process}
Ensures the hashdeep utility is installed on the system.
Checks if hashdeep is available in PATH and attempts to install it via apt package manager if not found. Will try with sudo privileges first, then without sudo if that fails.
Details
- Checks PATH for existing hashdeep executable
- Attempts installation using apt package manager
- Requires a Debian-based Linux distribution
Returns
- Nothing, but prints status messages during execution
Mycelia.is_equivalent
— Methodis_equivalent(a, b) -> Any
Check if two biological sequences are equivalent, considering both direct and reverse complement matches.
Arguments
a
: First biological sequence (BioSequence or compatible type)b
: Second biological sequence (BioSequence or compatible type)
Returns
Bool
:true
if sequences are identical or if one is the reverse complement of the other,false
otherwise
Mycelia.isolate_normalized_primary_contig
— Methodisolate_normalized_primary_contig(
assembled_fasta,
assembled_gfa,
qualimap_report_txt,
identifier,
k::Int64;
primary_contig_fasta
) -> String
Primary contig is defined as the contig with the most bases mapped to it
In the context of picking out phage from metagenomic assemblies the longest contig is often bacteria whereas the highest coverage contigs are often primer-dimers or other PCR amplification artifacts.
Taking the contig that has the most bases mapped to it as a product of length * depth is cherry picked as our phage
Isolates and exports the primary contig from an assembly based on coverage depth × length.
The primary contig is defined as the contig with the highest total mapped bases (coverage depth × length). This method helps identify potential phage contigs in metagenomic assemblies, avoiding both long bacterial contigs and short high-coverage PCR artifacts.
Arguments
assembled_fasta
: Path to the assembled contigs in FASTA formatassembled_gfa
: Path to the assembly graph in GFA formatqualimap_report_txt
: Path to Qualimap coverage reportidentifier
: String identifier for the output filek
: Integer representing k-mer size used in assemblyprimary_contig_fasta
: Optional output filepath (default: "{identifier}.primary_contig.fna")
Returns
- Path to the output FASTA file containing the primary contig
Notes
- For circular contigs, removes the k-mer closure scar if detected
- Trims k bases from the end if they match the first k bases
- Uses coverage × length to avoid both long bacterial contigs and short PCR artifacts
Mycelia.iterative_polishing
— Functioniterative_polishing(
fastq
) -> Vector{T} where T<:(NamedTuple{(:fastq, :k), <:Tuple{Any, Any}})
iterative_polishing(
fastq,
max_k
) -> Vector{T} where T<:(NamedTuple{(:fastq, :k), <:Tuple{Any, Any}})
iterative_polishing(
fastq,
max_k,
plot
) -> Vector{T} where T<:(NamedTuple{(:fastq, :k), <:Tuple{Any, Any}})
Performs iterative error correction on FASTQ sequences using progressively larger k-mer sizes.
Starting with the default k-mer size, this function repeatedly applies polishing steps, incrementing the k-mer size until either reaching max_k or encountering instability.
Arguments
fastq
: Path to input FASTQ file or FastqRecord objectmax_k
: Maximum k-mer size to attempt (default: 89)plot
: Whether to generate diagnostic plots (default: false)
Returns
Vector of polishing results, where each element contains:
- k: k-mer size used
- fastq: resulting polished sequences
Mycelia.jaccard_distance
— Methodjaccard_distance(set1, set2) -> Any
Calculate the Jaccard distance between two sets, which is the complement of the Jaccard similarity.
The Jaccard distance is defined as: $J_d(A,B) = 1 - J_s(A,B) = 1 - \frac{|A ∩ B|}{|A ∪ B|}$
Arguments
set1
: First set to compareset2
: Second set to compare
Returns
Float64
: A value in [0,1] where 0 indicates identical sets and 1 indicates disjoint sets
Mycelia.jaccard_similarity
— Methodjaccard_similarity(set1, set2) -> Any
Compute the Jaccard similarity coefficient between two sets.
The Jaccard similarity is defined as the size of the intersection divided by the size of the union of two sets:
J(A,B) = |A ∩ B| / |A ∪ B|
Arguments
set1
: First set for comparisonset2
: Second set for comparison
Returns
Float64
: A value between 0.0 and 1.0, where:- 1.0 indicates identical sets
- 0.0 indicates completely disjoint sets
Mycelia.jellyfish_count
— Methodjellyfish_count(
;
fastx,
k,
threads,
max_mem,
canonical,
outfile,
conda_check
)
Count k-mers in a FASTA/FASTQ file using Jellyfish.
Arguments
fastx::String
: Path to input FASTA/FASTQ file (can be gzipped)k::Integer
: k-mer lengththreads::Integer=Sys.CPU_THREADS
: Number of threads to usemax_mem::Integer=Int(Sys.free_memory())
: Maximum memory in bytes (defaults to system free memory)canonical::Bool=false
: Whether to count canonical k-mers (both strands combined)outfile::String=auto
: Output filename (auto-generated based on input and parameters)conda_check::Bool=true
: Whether to verify Jellyfish conda installation
Returns
String
: Path to gzipped TSV file containing k-mer counts
Mycelia.jellyfish_counts_to_kmer_frequency_histogram
— Functionjellyfish_counts_to_kmer_frequency_histogram(
jellyfish_counts_file
) -> Any
jellyfish_counts_to_kmer_frequency_histogram(
jellyfish_counts_file,
outfile
) -> Any
Convert a Jellyfish k-mer count file into a frequency histogram.
Arguments
jellyfish_counts_file::String
: Path to the gzipped TSV file containing Jellyfish k-mer countsoutfile::String=replace(jellyfish_counts_file, r"\.tsv\.gz$" => ".count_histogram.tsv")
: Optional output file path
Returns
String
: Path to the generated histogram file
Description
Processes a Jellyfish k-mer count file to create a frequency histogram where:
- Column 1: Number of k-mers that share the same count
- Column 2: The count they share
Uses system sorting with LC_ALL=C for optimal performance on large files.
Notes
- Requires gzip, sort, uniq, and sed command line tools
- Uses intermediate disk storage for sorting large files
- Skips processing if output file already exists
Mycelia.jitter
— Methodjitter(x, n) -> Any
Add random noise to create a vector of jittered values.
Generates n
values by adding random noise to the input value x
. The noise is uniformly distributed between -1/3 and 1/3.
Arguments
x
: Base value to add jitter ton
: Number of jittered values to generate
Returns
- Vector of length
n
containing jittered values aroundx
Mycelia.joint_base_quality_score
— Methodjoint_base_quality_score(
error_probabilities::Vector{Float64}
) -> Float64
Calculate the quality score for a single base given multiple observations.
This function implements the "Converting to Error Probabilities and Combining" method:
- Takes error probabilities from multiple reads covering the same base
- Calculates probability of ALL reads being wrong by multiplying probabilities
- Calculates final Phred score from this combined probability
To avoid numerical underflow with very small probabilities, the calculation is performed in log space.
Arguments
error_probabilities::Vector{Float64}
: Vector of error probabilities from multiple reads covering the same base position
Returns
Float64
: Phred quality score representing the combined confidence
Mycelia.kmer_counts_dict_to_vector
— Methodkmer_counts_dict_to_vector(
kmer_to_index_map,
kmer_counts
) -> Any
Convert a dictionary of k-mer counts to a fixed-length numeric vector based on a predefined mapping.
Arguments
kmer_to_index_map
: Dictionary mapping k-mer sequences to their corresponding vector indiceskmer_counts
: Dictionary containing k-mer sequences and their occurrence counts
Returns
- A vector where each position corresponds to a k-mer count, with zeros for absent k-mers
Mycelia.kmer_counts_to_cosine_similarity
— Methodkmer_counts_to_cosine_similarity(
kmer_counts_1,
kmer_counts_2
) -> Any
Calculate the cosine similarity between two k-mer count dictionaries.
Arguments
kmer_counts_1::Dict{String,Int}
: First dictionary mapping k-mer sequences to their countskmer_counts_2::Dict{String,Int}
: Second dictionary mapping k-mer sequences to their counts
Returns
Float64
: Cosine distance between the two k-mer count vectors, in range [0,1] where 0 indicates identical distributions and 1 indicates maximum dissimilarity
Details
Converts k-mer count dictionaries into vectors using a unified set of keys, then computes cosine distance. Missing k-mers are treated as count 0. Result is invariant to input order and total counts (normalized internally).
Mycelia.kmer_counts_to_js_divergence
— Methodkmer_counts_to_js_divergence(
kmer_counts_1,
kmer_counts_2
) -> Any
Calculate the Jensen-Shannon divergence between two k-mer frequency distributions.
Arguments
kmer_counts_1
: Dictionary mapping k-mers to their counts in first sequencekmer_counts_2
: Dictionary mapping k-mers to their counts in second sequence
Returns
- Normalized Jensen-Shannon divergence score between 0 and 1, where:
- 0 indicates identical distributions
- 1 indicates maximally different distributions
Notes
- The measure is symmetric: JS(P||Q) = JS(Q||P)
- Counts are automatically normalized to probability distributions
Mycelia.kmer_counts_to_merqury_qv
— Methodkmer_counts_to_merqury_qv(
;
raw_data_counts,
assembly_counts
)
Calculate assembly Quality Value (QV) score using the Merqury method.
Estimates base-level accuracy by comparing k-mer distributions between raw sequencing data and assembly. Higher QV scores indicate better assembly quality.
Arguments
raw_data_counts::AbstractDict{Kmers.DNAKmer{k,N}, Int}
: K-mer counts from raw sequencing dataassembly_counts::AbstractDict{Kmers.DNAKmer{k,N}, Int}
: K-mer counts from assembly
Returns
Float64
: Quality Value score in Phred scale (-10log₁₀(error rate))
Method
QV is calculated using:
- Ktotal = number of unique kmers in assembly
- Kshared = number of kmers shared between raw data and assembly
- P = (Kshared/Ktotal)^(1/k) = estimated base-level accuracy
- QV = -10log₁₀(1-P)
Reference
Rhie et al. "Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies" Genome Biology (2020)
Mycelia.kmer_path_to_sequence
— Methodkmer_path_to_sequence(kmer_path) -> Any
Convert a path of overlapping k-mers into a single DNA sequence.
Arguments
kmer_path
: Vector of k-mers (DNA sequences) where each consecutive pair overlaps by k-1 bases
Returns
BioSequences.LongDNA{2}
: Assembled DNA sequence from the k-mer path
Description
Reconstructs the original DNA sequence by joining k-mers, validating that consecutive k-mers overlap correctly. The first k-mer is used in full, then each subsequent k-mer contributes its last base.
Mycelia.kmer_quality_score
— Functionkmer_quality_score(base_qualities::Vector{Float64}) -> Any
kmer_quality_score(
base_qualities::Vector{Float64},
method::Symbol
) -> Any
Calculate kmer quality score using the specified aggregation method.
Available methods:
:min
: Use the minimum base quality (default):mean
: Use the mean of all base qualities:geometric
: Use the geometric mean (appropriate for probabilities):harmonic
: Use the harmonic mean (emphasizes lower values)
Arguments
base_qualities::Vector{Float64}
: Vector of quality scores for each basemethod::Symbol
: Method to use for aggregation
Returns
Float64
: Overall quality score for the kmer
Mycelia.ks
— Methodks(; min, max) -> Vector{Int64}
Generates a specialized sequence of prime numbers combining:
- Odd primes up to 23 (flip_point)
- Primes nearest to Fibonacci numbers above 23 up to max
Arguments
min::Int=0
: Lower bound for the sequencemax::Int=10_000
: Upper bound for the sequence
Returns
Vector of Int containing the specialized prime sequence
Mycelia.lawrencium_sbatch
— Methodlawrencium_sbatch(
;
job_name,
mail_user,
mail_type,
logdir,
partition,
qos,
account,
nodes,
ntasks,
time,
cpus_per_task,
mem_gb,
cmd
)
Submit a job to SLURM scheduler on Lawrence Berkeley Lab's Lawrencium cluster.
Arguments
job_name
: Name identifier for the SLURM jobmail_user
: Email address for job notificationsmail_type
: Notification type ("ALL", "BEGIN", "END", "FAIL", or "NONE")logdir
: Directory for SLURM output and error logspartition
: Lawrencium compute partitionqos
: Quality of Service levelaccount
: Project account for billingnodes
: Number of nodes to allocatentasks
: Number of tasks to spawntime
: Wall time limit in format "days-hours:minutes:seconds"cpus_per_task
: CPU cores per taskmem_gb
: Memory per node in GBcmd
: Shell command to execute
Returns
true
if submission was successful
Note
Function includes 5-second delays before and after submission for queue stability.
Mycelia.list_blastdbs
— Methodlist_blastdbs(; source) -> DataFrames.DataFrame
Lists available BLAST databases from the specified source.
Mycelia.list_classes
— Methodlist_classes() -> DataFrames.DataFrame
Returns an array of all taxonomic classes in the database.
Classes represent a major taxonomic rank between phylum and order in biological classification.
Returns
Vector{String}
: Array of class names sorted alphabetically
Mycelia.list_databases
— Methodlist_databases(; address, username, password)
Lists all available Neo4j databases on the specified server.
Arguments
address::String
: Neo4j server address (e.g. "neo4j://localhost:7687")username::String="neo4j"
: Neo4j authentication usernamepassword::String
: Neo4j authentication password
Returns
DataFrame
: Contains database information with columns typically including:- name: Database name
- address: Database address
- role: Database role (e.g., primary, secondary)
- status: Current status (e.g., online, offline)
- default: Boolean indicating if it's the default database
Mycelia.list_families
— Methodlist_families() -> DataFrames.DataFrame
Returns a sorted vector of all family names present in the database.
Mycelia.list_full_taxonomy
— Methodlist_full_taxonomy() -> DataFrames.DataFrame
Retrieves and formats the complete NCBI taxonomy hierarchy into a structured DataFrame.
Details
- Automatically sets up taxonkit environment and downloads taxonomy database if needed
- Starts from root taxid (1) and includes all descendant taxa
- Reformats lineage information into separate columns for each taxonomic rank
Returns
DataFrame with columns:
taxid
: Taxonomy identifierlineage
: Full taxonomic lineage stringtaxid_lineage
: Lineage with taxonomy IDs- Individual rank columns:
- superkingdom, kingdom, phylum, class, order, family, genus, species
- corresponding taxid columns (e.g., superkingdom_taxid)
Dependencies
Requires taxonkit (installed automatically via Bioconda)
Mycelia.list_genera
— Methodlist_genera() -> DataFrames.DataFrame
Returns a sorted vector of all genera names present in the database.
Mycelia.list_kingdoms
— Methodlist_kingdoms() -> DataFrames.DataFrame
Lists all taxonomic kingdoms in the database.
Returns a vector of kingdom names as strings. Kingdoms represent the highest major taxonomic rank in biological classification.
Returns
Vector{String}
: Array of kingdom names
Mycelia.list_orders
— Methodlist_orders() -> DataFrames.DataFrame
Lists all orders in the taxonomic database.
Returns a vector of strings containing valid order names according to current mycological taxonomy. Uses the underlying list_rank()
function with rank="order".
Returns
Vector{String}
: Alphabetically sorted list of order names
Mycelia.list_phylums
— Methodlist_phylums() -> DataFrames.DataFrame
Returns a sorted list of all unique phyla in the database.
Mycelia.list_rank
— Methodlist_rank(rank) -> DataFrames.DataFrame
List all taxonomic entries at the specified rank level.
Arguments
rank::String
: Taxonomic rank to query. Must be one of:- "top" (top level)
- "superkingdom"/"domain"
- "kingdom"
- "phylum"
- "class"
- "order"
- "family"
- "genus"
- "species"
Returns
DataFrame with columns:
taxid
: NCBI taxonomy IDname
: Scientific name at the specified rank
Mycelia.list_ranks
— Methodlist_ranks(; synonyms) -> Vector{String}
Return an ordered list of taxonomic ranks from highest (top) to lowest (species).
Arguments
synonyms::Bool=false
: If true, includes alternative names for certain ranks (e.g. "domain" for "superkingdom")
Returns
Vector{String}
: An array of taxonomic rank names in hierarchical order
Mycelia.list_species
— Methodlist_species() -> DataFrames.DataFrame
Returns a sorted vector of all species names present in the database.
Mycelia.list_subtaxa
— Methodlist_subtaxa(taxid) -> Vector{Int64}
Returns an array of Integer taxon IDs representing all sub-taxa under the specified taxonomic ID.
Arguments
taxid
: NCBI taxonomy identifier for the parent taxon
Returns
Vector{Int} containing all descendant taxon IDs
Details
- Requires taxonkit to be installed via Bioconda
- Automatically sets up taxonkit database if not present
- Uses local taxonomy database in ~/.taxonkit/
Mycelia.list_superkingdoms
— Methodlist_superkingdoms() -> DataFrames.DataFrame
Returns an array of all taxonomic superkingdoms (e.g., Bacteria, Archaea, Eukaryota).
Returns
Vector{String}
: Array containing names of all superkingdoms in the taxonomy database
Mycelia.list_toplevel
— Methodlist_toplevel() -> DataFrames.DataFrame
Returns a DataFrame containing the top-level taxonomic nodes.
The DataFrame has two fixed rows representing the most basic taxonomic classifications:
- taxid=0: "unclassified"
- taxid=1: "root"
Returns
DataFrame Columns: - taxid::Int : Taxonomic identifier - name::String : Node name
Mycelia.load_blast_db_taxonomy_table
— Methodload_blast_db_taxonomy_table(
compressed_blast_db_taxonomy_table_file
) -> DataFrames.DataFrame
Loads a BLAST database taxonomy mapping table from a gzipped file into a DataFrame.
Arguments
compressed_blast_db_taxonomy_table_file::String
: Path to a gzipped file containing BLAST taxonomy mappings
Returns
DataFrame
: A DataFrame with columns:sequence_id
and:taxid
containing the sequence-to-taxonomy mappings
Format
Input file should be a space-delimited text file (gzipped) with two columns:
- sequence identifier
- taxonomy identifier (taxid)
Mycelia.load_genbank_metadata
— Methodload_genbank_metadata() -> DataFrames.DataFrame
Load metadata for GenBank sequences into a DataFrame.
This is a convenience wrapper around load_ncbi_metadata("genbank")
that specifically loads metadata from the GenBank database.
Returns
DataFrame
: Contains metadata fields like accession numbers, taxonomy,
and sequence information from GenBank.
Mycelia.load_graph
— Methodload_graph(file) -> Any
Load a graph structure from a serialized file.
Arguments
file::AbstractString
: Path to the file containing the serialized graph data
Returns
- The deserialized graph object
Mycelia.load_graph
— Methodload_graph(file::String) -> Any
Loads a graph object from a serialized file.
Arguments
file::String
: Path to the file containing the serialized graph data. The file should have been created usingsave_graph
.
Returns
- The deserialized graph object stored under the "graph" key.
Mycelia.load_jellyfish_counts
— Methodload_jellyfish_counts(
jellyfish_counts
) -> DataFrames.DataFrame
Load k-mer counts from a Jellyfish output file into a DataFrame.
Arguments
jellyfish_counts::String
: Path to a gzipped TSV file (*.jf.tsv.gz) containing Jellyfish k-mer counts
Returns
DataFrame
: Table with columns:kmer
: Biologically encoded k-mers asDNAKmer{k}
objectscount
: Integer count of each k-mer's occurrences
Notes
- Input file must be a gzipped TSV with exactly two columns (k-mer sequences and counts)
- K-mer length is automatically detected from the first entry
- Filename must end with '.jf.tsv.gz'
Mycelia.load_jld2
— Methodload_jld2(filename) -> Any
Load data stored in a JLD2 file format.
Arguments
filename::String
: Path to the JLD2 file to load
Returns
Dict
: Dictionary containing the loaded data structures
Mycelia.load_matrix_jld2
— Methodload_matrix_jld2(filename) -> Any
Loads a matrix from a JLD2 file.
Arguments
filename::String
: Path to the JLD2 file containing the matrix under the key "matrix"
Returns
Matrix
: The loaded matrix data
Mycelia.load_ncbi_metadata
— Methodload_ncbi_metadata(db) -> DataFrames.DataFrame
Load and parse the assembly summary metadata from NCBI's FTP server for either GenBank or RefSeq databases.
Arguments
db::String
: Database source, must be either "genbank" or "refseq"
Returns
DataFrame
: Parsed metadata table with properly typed columns including:- Integer columns: taxid, species_taxid, genome metrics, and gene counts
- Float columns: gc_percent
- Date columns: seqreldate, annotation_date
- String columns: all other fields
Details
Downloads the assembly summary file from NCBI's FTP server and processes it by:
- Parsing the tab-delimited file with commented headers
- Converting numeric strings to proper Integer/Float types
- Parsing date strings to Date objects
- Handling missing values throughout
Mycelia.load_ncbi_taxonomy
— Methodload_ncbi_taxonomy(
;
path_to_taxdump
) -> MetaGraphs.MetaDiGraph{T, Float64} where T<:Integer
Downloads and constructs a MetaDiGraph representation of the NCBI taxonomy database.
Arguments
path_to_taxdump
: Directory path where taxonomy files will be downloaded and extracted
Returns
MetaDiGraph
: A directed graph where:- Vertices represent taxa with properties:
:tax_id
: NCBI taxonomy identifier:scientific_name
,:common_name
, etc.: Name properties:rank
: Taxonomic rank:division_id
,:division_cde
,:division_name
: Division information
- Edges represent parent-child relationships in the taxonomy
- Vertices represent taxa with properties:
Dependencies
Requires internet connection for initial download. Uses DataFrames, MetaGraphs, and ProgressMeter.
Mycelia.load_refseq_metadata
— Methodload_refseq_metadata() -> DataFrames.DataFrame
Loads NCBI RefSeq metadata into a DataFrame. RefSeq is NCBI's curated collection of genomic, transcript and protein sequences.
Returns
DataFrame
: Contains metadata columns including accession numbers, taxonomic information,
and sequence details from RefSeq.
Mycelia.local_blast_database_info
— Methodlocal_blast_database_info(; blastdbs_dir) -> Any
Query information about local BLAST databases and return a formatted summary.
Arguments
blastdbs_dir::String
: Directory containing BLAST databases (default: "~/workspace/blastdb")
Returns
DataFrame
with columns:- BLAST database path
- BLAST database molecule type
- BLAST database title
- date of last update
- number of bases/residues
- number of sequences
- number of bytes
- BLAST database format version
- human readable size
Dependencies
Requires NCBI BLAST+ tools. Will attempt to install via apt-get if not present.
Side Effects
- May install system packages (ncbi-blast+, perl-doc) using sudo/apt-get
- Filters out numbered database fragments from results
Mycelia.merge_colors
— Methodmerge_colors(c1, c2) -> Any
Merge two colors by calculating their minimal color difference vector.
Arguments
c1::Color
: First color inputc2::Color
: Second color input
Returns
- If colors are equal, returns the input color
- Otherwise returns the color difference vector (c1-c2 or c2-c1) with minimal RGB sum
Details
Calculates two difference vectors:
- mix_a = c1 - c2
- mix_b = c2 - c1
Returns the difference vector with the smallest sum of RGB components.
Mycelia.merge_fasta_files
— Methodmerge_fasta_files(; fasta_files, fasta_file)
Join fasta files while adding origin prefixes to the identifiers.
Does not guarantee uniqueness but will warn if conflicts arise
Mycelia.metasha256
— Methodmetasha256(
vector_of_sha256s::Vector{<:AbstractString}
) -> String
Compute a single SHA256 hash from multiple SHA256 hashes.
Takes a vector of hex-encoded SHA256 hashes and produces a new SHA256 hash by:
- Sorting the input hashes lexicographically
- Concatenating them in sorted order
- Computing a new SHA256 hash over the concatenated data
Arguments
vector_of_sha256s
: Vector of hex-encoded SHA256 hash strings
Returns
- A hex-encoded string representing the computed meta-hash
Mycelia.minimap_index
— Methodminimap_index(
;
fasta,
mem_gb,
mapping_type,
threads,
as_string,
denominator
)
Generate a minimap2 index command and output file path for mapping DNA sequencing reads.
Run this on the machine you intend to use to map the reads to confirm the index will fit.
Arguments
fasta
: Path to the reference FASTA file to be indexedmem_gb
: Available system memory in gigabytes for indexingmapping_type
: Sequencing technology preset. One of:- "map-hifi": PacBio HiFi reads
- "map-ont": Oxford Nanopore reads
- "map-pb": PacBio CLR reads
- "sr": Short reads
- "lr:hq": High-quality long reads
threads
: Number of threads to use for indexingas_string
: Return command as String instead of Cmd (default: false)denominator
: Divisor for calculating index size (default: 10)
Returns
Named tuple containing:
cmd
: The minimap2 indexing command (as String or Cmd)outfile
: Path to the output index file (.mmi)
Mycelia.minimap_map
— Methodminimap_map(
;
fasta,
fastq,
mapping_type,
as_string,
mem_gb,
threads,
denominator
)
Generate minimap2 alignment commands for sequence mapping.
aligning and compressing. No sorting or filtering.
Use shell_only=true to get string command to submit to SLURM
Creates a command to align reads in FASTQ format to a reference FASTA using minimap2, followed by SAM compression with pigz. Handles resource allocation and conda environment setup.
Arguments
fasta
: Path to reference FASTA filefastq
: Path to query FASTQ filemapping_type
: Alignment preset ("map-hifi", "map-ont", "map-pb", "sr", or "lr:hq")as_string
: If true, returns shell command as string; if false, returns command arraymem_gb
: Available memory in GB for indexing (defaults to system free memory)threads
: Number of CPU threads to use (defaults to system threads)denominator
: Divisor for calculating minimap2 index size
Returns
Named tuple containing:
cmd
: Shell command (as string or array)outfile
: Path to compressed output SAM file
Mycelia.minimap_map_paired_end
— Methodminimap_map_paired_end(
;
fasta,
forward,
reverse,
mem_gb,
threads,
outdir,
as_string,
mapping_type,
denominator
)
Maps paired-end reads to a reference genome using minimap2 and compresses the output.
Arguments
fasta::String
: Path to reference genome FASTA fileforward::String
: Path to forward reads FASTQ filereverse::String
: Path to reverse reads FASTQ filemem_gb::Integer
: Available system memory in GBthreads::Integer
: Number of threads to useoutdir::String
: Output directory (defaults to forward reads directory)as_string::Bool
: Return command as string instead of Cmd arraymapping_type::String
: Mapping preset, e.g. "sr" for short reads (default)denominator::Float64
: Memory scaling factor for minimap2 index
Returns
Named tuple containing:
cmd
: Command(s) to execute (String or Vector{Cmd})outfile
: Path to compressed output SAM file (*.sam.gz)
Dependencies
Requires bioconda packages: minimap2, samtools, pigz
Mycelia.minimap_map_paired_end_with_index
— Methodminimap_map_paired_end_with_index(
;
fasta,
forward,
reverse,
mem_gb,
threads,
outdir,
as_string,
mapping_type,
denominator
)
Map paired-end reads to a reference sequence using minimap2.
Arguments
fasta::String
: Path to reference FASTA fileforward::String
: Path to forward reads FASTQ filereverse::String
: Path to reverse reads FASTQ filemem_gb::Integer
: Available system memory in GBthreads::Integer
: Number of threads to useoutdir::String
: Output directory (defaults to forward reads directory)as_string::Bool=false
: Return command as string instead of Cmd arraymapping_type::String="sr"
: Minimap2 preset ["map-hifi", "map-ont", "map-pb", "sr", "lr:hq"]denominator::Float64
: Memory scaling factor for index size
Returns
Named tuple containing:
cmd
: Command(s) to execute (String or Array{Cmd})outfile
: Path to compressed output SAM file (*.sam.gz)
Notes
- Requires minimap2, samtools, and pigz conda environments
- Automatically compresses output using pigz
- Index file must exist at
$(fasta).x$(mapping_type).I$(index_size).mmi
Mycelia.minimap_map_with_index
— Methodminimap_map_with_index(
;
fasta,
mem_gb,
mapping_type,
threads,
fastq,
as_string,
denominator
)
Generate minimap2 mapping commands with a pre-built index file.
Arguments
fasta
: Path to the reference FASTA filemem_gb
: Available system memory in gigabytes for index generationmapping_type
: Mapping preset. Must be one of: "map-hifi", "map-ont", "map-pb", "sr", or "lr:hq"threads
: Number of threads to use for mapping and compressionfastq
: Path to input FASTQ fileas_string
: If true, returns command as a string; if false, returns as command arraydenominator
: Divisor for index size calculation (default: 10)
Returns
Named tuple containing:
cmd
: The minimap2 mapping command (string or array)outfile
: Path to the output compressed SAM file
Notes
- Requires pre-built index file with pattern:
${fasta}.x${mapping_type}.I${index_size}.mmi
- Automatically installs required conda environments (minimap2, samtools, pigz)
- Output is automatically compressed with pigz
Mycelia.mmseqs_pairwise_search
— Methodmmseqs_pairwise_search(; fasta, output)
Perform all-vs-all sequence search using MMseqs2's easy-search command.
Arguments
fasta::String
: Path to input FASTA file containing sequences to compareoutput::String
: Output directory path (default: input filename + ".mmseqseasysearch_pairwise")
Returns
String
: Path to the output directory
Details
Executes MMseqs2 with sensitive search parameters (7 sensitivity steps) and outputs results in tabular format with the following columns:
- query, qheader: Query sequence ID and header
- target, theader: Target sequence ID and header
- pident: Percentage sequence identity
- fident: Fraction of identical matches
- nident: Number of identical matches
- alnlen: Alignment length
- mismatch: Number of mismatches
- gapopen: Number of gap openings
- qstart, qend, qlen: Query sequence coordinates and length
- tstart, tend, tlen: Target sequence coordinates and length
- evalue: Expected value
- bits: Bit score
Requires MMseqs2 to be available through Bioconda.
Mycelia.mutate_sequence
— Methodmutate_sequence(reference_sequence) -> Tuple{Any, Any}
Generate a single random mutation in an amino acid sequence.
Arguments
reference_sequence
: Input amino acid sequence to be mutated
Returns
mutant_sequence
: The sequence after applying the mutationhaplotype
: ASequenceVariation.Haplotype
object containing the mutation details
Details
Performs one of three possible mutation types:
- Substitution: Replace one amino acid with another
- Insertion: Insert 1+ random amino acids at a position
- Deletion: Remove 1+ amino acids from a position
Insertion and deletion sizes follow a truncated Poisson distribution (λ=1, min=1).
Mycelia.n_maximally_distinguishable_colors
— Methodn_maximally_distinguishable_colors(n) -> Any
Generate n
colors that are maximally distinguishable from each other.
Arguments
n::Integer
: The number of distinct colors to generate
Returns
A vector of n
RGB colors that are optimized for maximum perceptual distinction, using white (RGB(1,1,1)) and black (RGB(0,0,0)) as anchor colors.
Mycelia.name2taxid
— Methodname2taxid(name) -> DataFrames.DataFrame
Convert scientific name(s) to NCBI taxonomy ID(s) using taxonkit.
Arguments
name::AbstractString
: Scientific name(s) to query. Can be a single name or multiple names separated by newlines.
Returns
DataFrame
with columns:name
: Input scientific nametaxid
: NCBI taxonomy IDrank
: Taxonomic rank (e.g., "species", "genus")
Dependencies
Requires taxonkit package (installed automatically via Bioconda)
Mycelia.names2taxids
— Methodnames2taxids(names::AbstractVector{<:AbstractString}) -> Any
Convert a vector of species/taxon names to their corresponding NCBI taxonomy IDs.
Arguments
names::AbstractVector{<:AbstractString}
: Vector of scientific names or common names
Returns
Vector{Int}
: Vector of NCBI taxonomy IDs corresponding to the input names
Progress is displayed using ProgressMeter.
Mycelia.ncbi_ftp_path_to_url
— Methodncbi_ftp_path_to_url(; ftp_path, extension)
Constructs a complete NCBI FTP URL by combining a base FTP path with a file extension.
Arguments
ftp_path::String
: Base FTP directory path for the resourceextension::String
: File extension to append to the resource name
Returns
String
: Complete FTP URL path to the requested resource
Extensions include:
- genomic.fna.gz
- genomic.gff.gz
- protein.faa.gz
- assembly_report.txt
- assembly_stats.txt
- cdsfromgenomic.fna.gz
- feature_count.txt.gz
- feature_table.txt.gz
- genomic.gbff.gz
- genomic.gtf.gz
- protein.gpff.gz
- translated_cds.faa.gz
Mycelia.ncbi_genome_download_accession
— Methodncbi_genome_download_accession(
;
accession,
outdir,
outpath,
include_string
)
Download an accession using NCBI datasets command line tool
the .zip download output to outpath will be unzipped
returns the outfolder
ncbi's default include string is include_string = "gff3,rna,cds,protein,genome,seq-report"
Downloads and extracts a genome from NCBI using the datasets command line tool.
Arguments
accession
: NCBI accession number for the genomeoutdir
: Directory where files will be downloaded (defaults to current directory)outpath
: Full path for the temporary zip file (defaults tooutdir/accession.zip
)include_string
: Data types to download (defaults to all "gff3,rna,cds,protein,genome,seq-report").
Returns
- Path to the extracted genome data directory
Notes
- Requires the ncbi-datasets-cli conda package (automatically installed if missing)
- Downloaded zip file is automatically removed after extraction
- If output folder already exists, download is skipped
- Data is extracted to
outdir/accession/ncbi_dataset/data/accession
Mycelia.ncbi_taxon_summary
— Methodncbi_taxon_summary(taxa_id) -> DataFrames.DataFrame
Retrieve taxonomic information for a given NCBI taxonomy ID.
Arguments
taxa_id
: NCBI taxonomy identifier (integer)
Returns
DataFrame
: Taxonomy summary containing fields like tax_id, rank, species, etc.
Mycelia.nearest_prime
— Methodnearest_prime(n::Int64) -> Int64
Find the closest prime number to the given integer n
.
Returns the nearest prime number to n
. If two prime numbers are equally distant from n
, returns the smaller one.
Arguments
n::Int
: The input integer to find the nearest prime for
Returns
Int
: The closest prime number ton
Mycelia.nersc_sbatch
— Methodnersc_sbatch(
;
job_name,
mail_user,
mail_type,
logdir,
scriptdir,
qos,
nodes,
ntasks,
time,
cpus_per_task,
mem_gb,
cmd,
constraint
)
Submit a batch job to NERSC's SLURM workload manager.
Arguments
job_name
: Identifier for the SLURM jobmail_user
: Email address for job notificationsmail_type
: Notification type ("ALL", "BEGIN", "END", "FAIL", or "NONE")logdir
: Directory for storing job output/error logsscriptdir
: Directory for storing generated SLURM scriptsqos
: Quality of Service level ("regular", "premium", or "preempt")nodes
: Number of nodes to allocatentasks
: Number of tasks to runtime
: Maximum wall time in format "days-HH:MM:SS"cpus_per_task
: CPU cores per taskmem_gb
: Memory per node in GBcmd
: Command(s) to execute (String or Vector{String})constraint
: Node type constraint ("cpu" or "gpu")
Returns
true
if job submission succeedsfalse
if submission fails
QoS Options
- regular: Standard priority queue
- premium: High priority queue (5x throughput limit)
- preempt: Reduced credit usage but jobs may be interrupted
https://docs.nersc.gov/jobs/policy/ https://docs.nersc.gov/systems/perlmutter/architecture/#cpu-nodes
default is to use shared qos
use
- regular
- preempt (reduced credit usage but not guaranteed to finish)
- premium (priorty runs limited to 5x throughput)
https://docs.nersc.gov/systems/perlmutter/running-jobs/#tips-and-tricks
Mycelia.nersc_sbatch_shared
— Methodnersc_sbatch_shared(
;
job_name,
mail_user,
mail_type,
logdir,
qos,
nodes,
ntasks,
time,
cpus_per_task,
mem_gb,
cmd,
constraint
)
Submit a job to NERSC's SLURM scheduler using the shared QOS (Quality of Service).
Arguments
job_name
: Identifier for the jobmail_user
: Email address for job notificationsmail_type
: Notification type ("ALL", "BEGIN", "END", "FAIL", "REQUEUE", "STAGE_OUT")logdir
: Directory for storing job output and error logsqos
: Quality of Service level ("shared", "regular", "preempt", "premium")nodes
: Number of nodes to allocatentasks
: Number of tasks to runtime
: Maximum wall time in format "days-hours:minutes:seconds"cpus_per_task
: Number of CPUs per taskmem_gb
: Memory per node in GB (default: 2GB per CPU)cmd
: Command to executeconstraint
: Node type constraint ("cpu" or "gpu")
Resource Limits
- Maximum memory per node: 512GB
- Maximum cores per node: 128
- Default memory allocation: 2GB per CPU requested
QOS Options
- shared: Default QOS for shared node usage
- regular: Standard priority
- preempt: Reduced credit usage but preemptible
- premium: 5x throughput priority (limited usage)
Returns
true
if job submission succeeds
https://docs.nersc.gov/jobs/policy/ https://docs.nersc.gov/systems/perlmutter/architecture/#cpu-nodes
default is to use shared qos
use
- regular
- preempt (reduced credit usage but not guaranteed to finish)
- premium (priority runs limited to 5x throughput)
max request is 512Gb memory and 128 cores per node
https://docs.nersc.gov/systems/perlmutter/running-jobs/#tips-and-tricks
Mycelia.node_type_to_dataframe
— Methodnode_type_to_dataframe(; node_type, graph)
Convert all nodes of a specific type in a MetaGraph to a DataFrame representation.
Arguments
node_type
: The type of nodes to extract from the graphgraph
: A MetaGraph containing the nodes
Returns
A DataFrame where:
- Each row represents a node of the specified type
- Columns correspond to all unique properties found across nodes
- Values are JSON-serialized strings for consistency
Notes
- All values are normalized through JSON serialization
- Dictionary values receive double JSON encoding
- The TYPE column is converted using
type_to_string
Mycelia.normalize_codon_frequencies
— Methodnormalize_codon_frequencies(
codon_frequencies
) -> Dict{BioSymbols.AminoAcid, Dict{Kmers.Kmer{BioSequences.DNAAlphabet{2}, 3, 1}, Float64}}
Normalizes codon frequencies for each amino acid such that frequencies sum to 1.0.
Arguments
codon_frequencies
: Nested dictionary mapping amino acids to their codon frequency distributions
Returns
- Normalized codon frequencies where values for each amino acid sum to 1.0
Mycelia.normalize_countmap
— Methodnormalize_countmap(countmap) -> Dict
Normalize a dictionary of counts into a probability distribution where values sum to 1.0.
Arguments
countmap::Dict
: Dictionary mapping keys to count values
Returns
Dict
: New dictionary with same keys but values normalized by total sum
Mycelia.normalize_distance_matrix
— Methodnormalize_distance_matrix(distance_matrix) -> Any
Create distance matrix from a column-major counts matrix (features as rows and entities as columns) where distance is a proportional to total feature count magnitude (size) and cosine similarity (relative frequency)
Normalize a distance matrix by dividing all elements by the maximum non-NaN value.
Arguments
distance_matrix
: A matrix of distance values that may containNaN
,nothing
, ormissing
values
Returns
- Normalized distance matrix with values scaled to [0, 1] range
Details
- Filters out
NaN
,nothing
, andmissing
values when finding the maximum - All elements are divided by the same maximum value to preserve relative distances
- If all values are NaN/nothing/missing, may return NaN values
Mycelia.normalize_vcf
— Methodnormalize_vcf(; reference_fasta, vcf_file)
Normalize a VCF file using bcftools norm, with automated handling of compression and indexing.
Arguments
reference_fasta::String
: Path to the reference FASTA file used for normalizationvcf_file::String
: Path to input VCF file (can be gzipped or uncompressed)
Returns
String
: Path to the normalized, sorted, and compressed output VCF file (*.sorted.normalized.vcf.gz)
Notes
- Requires bioconda packages: htslib, tabix, bcftools
- Creates intermediate files with extensions .tbi for indices
- Skips processing if output file already exists
- Performs left-alignment and normalization of variants
Mycelia.normalized_current_datetime
— Methodnormalized_current_datetime() -> String
Returns the current date and time as a normalized string with all non-word characters removed.
The output format is based on ISO datetime (YYYYMMDDThhmmss) but strips any special characters like hyphens, colons or dots.
Mycelia.observe
— Methodobserve(
record::Union{FASTX.FASTA.Record, FASTX.FASTQ.Record};
error_rate
)
Simulate sequencing of a DNA/RNA record by introducing random errors at the specified rate.
Arguments
record
: A FASTA or FASTQ record containing the sequence to be "observed"error_rate
: Probability of error at each position (default: 0.0)
Returns
A new FASTQ.Record with:
- Random UUID as identifier
- Original record's description
- Modified sequence with introduced errors
- Generated quality scores
Mycelia.observe
— Methodobserve(sequence::BioSequences.LongSequence{T}; error_rate=nothing, tech::Symbol=:illumina) where T
Simulates the “observation” of a biological polymer (DNA, RNA, or protein) by introducing realistic errors along with base‐quality scores. The simulation takes into account both random and systematic error components. In particular, for technologies:
- illumina: (mostly substitution errors) the per‐base quality decays along the read (from ~Q40 at the start to ~Q20 at the end);
- nanopore: errors are more frequent and include both substitutions and indels (with overall lower quality scores, and an extra “homopolymer” penalty);
- pacbio: errors are dominated by indels (with quality scores typical of raw reads);
- ultima: (UG 100/ppmSeq™) correct bases are assigned very high quality (~Q60) while errors are extremely rare and, if they occur, are given a modest quality.
An error is introduced at each position with a (possibly position‐dependent) probability. For Illumina, the error probability increases along the read; additionally, if a base is part of a homopolymer run (length ≥ 3) and the chosen technology is one that struggles with homopolymers (nanopore, pacbio, ultima), then the local error probability is multiplied by a constant factor.
Returns a tuple (new_seq, quality_scores)
where:
new_seq
is aBioSequences.LongSequence{T}
containing the “observed” sequence (which may be longer or shorter than the input if insertions or deletions occur), andquality_scores
is a vector of integers representing the Phred quality scores (using the Sanger convention) for each base in the output sequence.
Mycelia.open_fastx
— Methodopen_fastx(
path::AbstractString
) -> Union{FASTX.FASTA.Reader, FASTX.FASTQ.Reader{T} where T<:(TranscodingStreams.TranscodingStream{C, S} where {S<:IO, C<:TranscodingStreams.Codec})}
Open and return a reader for FASTA or FASTQ format files.
Arguments
path::AbstractString
: Path to input file. Can be:- Local file path
- HTTP/FTP URL
- Gzip compressed (.gz extension)
Supported formats
- FASTA (.fasta, .fna, .faa, .fa)
- FASTQ (.fastq, .fq)
Returns
FASTX.FASTA.Reader
for FASTA filesFASTX.FASTQ.Reader
for FASTQ files
Mycelia.open_genbank
— Methodopen_genbank(
genbank_file
) -> Vector{GenomicAnnotations.Record}
Opens and parses a GenBank format file containing genomic sequence and annotation data.
Arguments
genbank_file::AbstractString
: Path to the GenBank (.gb or .gbk) file
Returns
Vector{GenomicAnnotations.Chromosome}
: Vector containing parsed chromosome data
Mycelia.open_gff
— Methodopen_gff(path::String) -> Any
Opens a GFF (General Feature Format) file for reading.
Arguments
path::String
: Path to GFF file. Can be:- Local file path
- HTTP/FTP URL (FTP URLs are automatically converted to HTTP)
- Gzipped file (automatically decompressed)
Returns
IO
: An IO stream ready to read the GFF content
Mycelia.optimal_subsequence_length
— Methodoptimal_subsequence_length(
;
error_rate,
threshold,
sequence_length,
plot_result
)
Calculate the optimal subsequence length based on error rate distribution.
Arguments
error_rate
: Single error rate or array of error rates (between 0 and 1)threshold
: Desired probability that a subsequence is error-free (default: 0.95)sequence_length
: Maximum sequence length to consider for plottingplot_result
: If true, returns a plot of probability vs. length
Returns
- If
plot_result=false
: Integer representing optimal subsequence length - If
plot_result=true
: Tuple of (optimal_length, plot)
Examples
# Single error rate
optimal_subsequence_length(error_rate=0.01)
# Array of error rates
optimal_subsequence_length(error_rate=[0.01, 0.02, 0.01])
# With more stringent threshold
optimal_subsequence_length(error_rate=0.01, threshold=0.99)
# Generate plot
length, p = optimal_subsequence_length(error_rate=0.01, plot_result=true)
Plots.display(p)
Mycelia.parse_blast_report
— Methodparse_blast_report(blast_report) -> DataFrames.DataFrame
Expects output type 7 from BLAST, default output type 6 doesn't have the header comments and won't auto-parse
Parse a BLAST output file into a structured DataFrame.
Arguments
blast_report::AbstractString
: Path to a BLAST output file in format 7 (tabular with comments)
Returns
DataFrame
: Table containing BLAST results with columns matching the header fields. Returns empty DataFrame if no hits found.
Details
- Requires BLAST output format 7 (
-outfmt 7
), which includes header comments - Handles missing values (encoded as "N/A") automatically
- Infers column types based on BLAST field names
- Supports standard BLAST tabular fields including sequence IDs, scores, alignments and taxonomic information
Mycelia.parse_gfa
— Methodparse_gfa(gfa) -> MetaGraphs.MetaGraph{Int64, Float64}
Parse a GFA (Graphical Fragment Assembly) file into a MetaGraph representation.
Arguments
gfa
: Path to GFA format file
Returns
A MetaGraph
where:
- Vertices represent segments (contigs)
- Edges represent links between segments
- Vertex properties include
:id
with segment identifiers - Graph property
:records
contains the original FASTA records
Format Support
Handles standard GFA v1 lines:
H
: Header lines (skipped)S
: Segments (stored as nodes with FASTA records)L
: Links (stored as edges)P
: Paths (stored in paths dictionary)A
: HiFiAsm specific lines (skipped)
Mycelia.parse_jsonl
— Methodparse_jsonl(filepath::String) -> Vector{Dict{String, Any}}
Parse a JSONL (JSON Lines) file into a vector of dictionaries.
Arguments
filepath::String
: Path to the JSONL file to parse
Returns
Vector{Dict{String, Any}}
: Vector containing parsed JSON objects, one per line
Description
Reads a JSONL file line by line, parsing each line as a separate JSON object. Uses pre-allocation and progress tracking for efficient processing of large files.
Mycelia.parse_mmseqs_easy_taxonomy_lca_tsv
— Methodparse_mmseqs_easy_taxonomy_lca_tsv(
lca_tsv
) -> DataFrames.DataFrame
Parse the taxonomic Last Common Ancestor (LCA) TSV output from MMseqs2's easy-taxonomy workflow.
Arguments
lca_tsv
: Path to the TSV file containing MMseqs2 taxonomy results
Returns
DataFrame with columns:
contig_id
: Sequence identifiertaxon_id
: NCBI taxonomy identifiertaxon_rank
: Taxonomic rank (e.g. species, genus)taxon_name
: Scientific namefragments_retained
: Number of fragments keptfragments_taxonomically_assigned
: Number of fragments with taxonomyfragments_in_agreement_with_assignment
: Fragments matching contig taxonomysupport -log(E-value)
: Statistical support score
Mycelia.parse_mmseqs_easy_taxonomy_tophit_report
— Methodparse_mmseqs_easy_taxonomy_tophit_report(
tophit_report
) -> DataFrames.DataFrame
Parse an MMseqs2 easy-taxonomy tophit report into a structured DataFrame.
Arguments
tophit_report::String
: Path to the MMseqs2 easy-taxonomy tophit report file (tab-delimited)
Returns
DataFrame
: A DataFrame with columns:target_id
: Target sequence identifiernumber of sequences aligning to target
: Count of aligned sequencesunique coverage of target
: Ratio of uniqueAlignedResidues to targetLengthTarget coverage
: Ratio of alignedResidues to targetLengthAverage sequence identity
: Mean sequence identitytaxon_id
: Taxonomic identifiertaxon_rank
: Taxonomic ranktaxon_name
: Species name and lineage
Mycelia.parse_mmseqs_tophit_aln
— Methodparse_mmseqs_tophit_aln(tophit_aln) -> DataFrames.DataFrame
Parse MMseqs2 tophit alignment output file into a structured DataFrame.
Arguments
tophit_aln::AbstractString
: Path to tab-delimited MMseqs2 alignment output file
Returns
DataFrame with columns:
query
: Query sequence/profile identifiertarget
: Target sequence/profile identifierpercent identity
: Sequence identity percentagealignment length
: Length of alignmentnumber of mismatches
: Count of mismatched positionsnumber of gaps
: Count of gap openingsquery start
: Start position in query sequencequery end
: End position in query sequencetarget start
: Start position in target sequencetarget end
: End position in target sequenceevalue
: E-value of alignmentbit score
: Bit score of alignment
Mycelia.parse_qualimap_contig_coverage
— Methodparse_qualimap_contig_coverage(
qualimap_report_txt
) -> DataFrames.DataFrame
Parse contig coverage statistics from a Qualimap BAM QC report file.
Arguments
qualimap_report_txt::String
: Path to Qualimap bamqc report text file
Returns
DataFrame
: Coverage statistics with columns:Contig
: Contig identifierLength
: Contig length in basesMapped bases
: Number of bases mapped to contigMean coverage
: Average coverage depthStandard Deviation
: Standard deviation of coverage% Mapped bases
: Percentage of total mapped bases on this contig
Supported Assemblers
Handles output from both SPAdes and MEGAHIT assemblers:
- SPAdes format: NODEXlengthYcov_Z
- MEGAHIT format: kXX_Y
Parse the contig coverage information from qualimap bamqc text report, which looks like the following:
# this is spades
>>>>>>> Coverage per contig
NODE_1_length_107478_cov_9.051896 107478 21606903 201.0355886786133 60.39424208607496
NODE_2_length_5444_cov_1.351945 5444 153263 28.152645113886848 5.954250612823136
NODE_3_length_1062_cov_0.154390 1062 4294 4.043314500941619 1.6655384692688975
NODE_4_length_776_cov_0.191489 776 3210 4.13659793814433 2.252009588980858
# below is megahit
>>>>>>> Coverage per contig
k79_175 235 3862 16.43404255319149 8.437436249612457
k79_89 303 3803 12.551155115511552 5.709975376279777
k79_262 394 6671 16.931472081218274 7.579217802849293
k79_90 379 1539 4.060686015831134 1.2929729111266581
k79_91 211 3749 17.767772511848342 11.899185693011933
k79_0 2042 90867 44.49902056807052 18.356525483516613
To make this more robust, consider reading in the names of the contigs from the assembled fasta
Mycelia.parse_rtg_eval_output
— Methodparse_rtg_eval_output(f) -> DataFrames.DataFrame
Parse RTG evaluation output from a gzipped tab-separated file.
Arguments
f
: Path to a gzipped TSV file containing RTG evaluation output
Format
Expected file format:
- Header line starting with '#' and tab-separated column names
- Data rows in tab-separated format
- Empty files return a DataFrame with empty columns matching header
Returns
A DataFrame where:
- Column names are taken from the header line (stripped of '#')
- Data is parsed as Float64 values
- Empty files result in empty columns preserving header structure
Mycelia.parse_transterm_output
— Methodparse_transterm_output(
transterm_output
) -> DataFrames.DataFrame
Parse TransTerm terminator prediction output into a structured DataFrame.
Takes a TransTerm output file path and returns a DataFrame containing parsed terminator predictions. Each row represents one predicted terminator with the following columns:
chromosome
: Identifier of the sequence being analyzedterm_id
: Unique terminator identifier (e.g. "TERM 19")start
: Start position of the terminatorstop
: End position of the terminatorstrand
: Strand orientation ("+" or "-")location
: Context type, where:- G/g = in gene interior (≥50bp from ends)
- F/f = between two +strand genes
- R/r = between two -strand genes
- T = between ends of +strand and -strand genes
- H = between starts of +strand and -strand genes
- N = none of the above
confidence
: Overall confidence score (0-100)hairpin_score
: Hairpin structure scoretail_score
: Tail sequence scorenotes
: Additional annotations (e.g. "bidir")
Arguments
transterm_output::AbstractString
: Path to TransTerm output file
Returns
DataFrame
: Parsed terminator predictions with columns as described above
See TransTerm HP documentation for details on scoring and location codes.
Mycelia.parse_virsorter_score_tsv
— Methodparse_virsorter_score_tsv(
virsorter_score_tsv
) -> DataFrames.DataFrame
Parse a VirSorter score TSV file and return a DataFrame.
Arguments
virsorter_score_tsv::String
: The file path to the VirSorter score TSV file.
Returns
DataFrame
: A DataFrame containing the parsed data from the TSV file. If the file is empty, returns a DataFrame with the appropriate headers but no data.
Mycelia.parse_xam_to_mapped_records_table
— Functionparse_xam_to_mapped_records_table(
xam
) -> DataFrames.DataFrame
parse_xam_to_mapped_records_table(
xam,
primary_only
) -> DataFrames.DataFrame
Parse SAM/BAM files into a DataFrame containing mapped read alignments.
Arguments
xam::String
: Path to input SAM/BAM file (supports .sam, .bam, or .sam.gz)primary_only::Bool=false
: Flag to filter for primary alignments only (currently unused)
Returns
DataFrame with columns:
template
: Read template nameflag
: SAM flagreference
: Reference sequence nameposition
: Alignment position rangemappingquality
: Mapping quality scoretlen
: Template lengthalignlength
: Alignment lengthismapped
: Boolean indicating if read is mappedisprimary
: Boolean indicating if alignment is primaryalignment_score
: Alignment score (AS tag)mismatches
: Number of mismatches (NM tag)
Mycelia.parse_xam_to_primary_mapping_table
— Methodparse_xam_to_primary_mapping_table(
xam
) -> DataFrames.DataFrame
Parse a SAM/BAM alignment file and extract template-to-reference mapping information.
Arguments
xam::String
: Path to input alignment file (.sam, .sam.gz, or .bam format)
Returns
DataFrame
: Table with columns:template
: Read template namesreference
: Reference sequence names
Only includes primary alignments (not secondary/supplementary) that are mapped to references. Skips header lines starting with '@'.
Mycelia.parse_xam_to_summary_table
— Methodparse_xam_to_summary_table(xam) -> DataFrames.DataFrame
Parse a SAM/BAM file into a summary DataFrame containing alignment metadata.
Arguments
xam::AbstractString
: Path to input SAM (.sam), BAM (.bam), or gzipped SAM (.sam.gz) file
Returns
DataFrame with columns:
template
: Read nameflag
: SAM flagreference
: Reference sequence nameposition
: Alignment position range (start:end)mappingquality
: Mapping quality scorealignment_score
: Alignment score (AS tag)isprimary
: Whether alignment is primaryalignlength
: Length of the alignmentismapped
: Whether read is mappedmismatches
: Number of mismatches (NM tag)
Note: Only mapped reads are included in the output DataFrame.
Mycelia.parse_xam_to_taxonomic_mapping_quality
— Functionparse_xam_to_taxonomic_mapping_quality(
xam
) -> DataFrames.DataFrame
parse_xam_to_taxonomic_mapping_quality(
xam,
primary_only
) -> DataFrames.DataFrame
Parse alignment data from SAM/BAM files into a structured DataFrame containing mapping quality and taxonomic information.
Arguments
xam::String
: Path to input file (.sam, .bam, or .sam.gz)primary_only::Bool=false
: When true, include only primary alignments (currently not implemented)
Returns
DataFrames.DataFrame with columns:
template
: Read nameflag
: SAM flagreference
: Reference sequence nameposition
: Alignment position rangemappingquality
: Mapping quality scoretlen
: Template lengthalignlength
: Alignment lengthismapped
: Boolean indicating if read is mappedisprimary
: Boolean indicating if alignment is primaryalignment_score
: Alignment score (AS tag)mismatches
: Number of mismatches (NM tag)
Notes
- Skips unmapped reads
- Automatically detects and handles SAM/BAM/compressed SAM formats
Mycelia.path_to_sequence
— Methodpath_to_sequence(kmers, path) -> Any
Convert a path through k-mers into a single DNA sequence.
Takes a vector of k-mers and a path representing the order to traverse them, reconstructs the original sequence by joining the k-mers according to the path. The first k-mer is used in full, then only the last nucleotide from each subsequent k-mer is added.
Arguments
kmers
: Vector of DNA k-mers (as LongDNA{4})path
: Vector of tuples representing the path through the k-mers
Returns
LongDNA{4}
: The reconstructed DNA sequence
Mycelia.pixels_to_points
— Methodpixels_to_points(pixels) -> Any
Convert pixel measurements to point measurements using the standard 4:3 ratio.
Points are the standard unit for typography (1 point = 1/72 inch), while pixels are used for screen measurements. This conversion uses the conventional 4:3 ratio where 3 points equal 4 pixels.
Arguments
pixels
: The number of pixels to convert
Returns
- The equivalent measurement in points
Mycelia.plot_graph
— Methodplot_graph(graph) -> Any
Creates a visualization of a kmer graph where nodes represent kmers and their sizes reflect counts.
Arguments
graph
: A MetaGraph where vertices have:kmer
and:count
properties
Returns
- A Plots.jl plot object showing the graph visualization
Details
- Node sizes are scaled based on kmer counts
- Plot dimensions scale logarithmically with number of vertices
- Each node is labeled with its kmer sequence
Mycelia.plot_kmer_frequency_spectra
— Methodplot_kmer_frequency_spectra(
counts;
log_scale,
kwargs...
) -> Plots.Plot
Plots a histogram of kmer counts against # of kmers with those counts
Returns the plot object for adding additional layers and saving
Creates a scatter plot visualizing the k-mer frequency spectrum - the relationship between k-mer frequencies and how many k-mers occur at each frequency.
Arguments
counts::AbstractVector{<:Integer}
: Vector of k-mer counts/frequencieslog_scale::Union{Function,Nothing} = log2
: Function to apply logarithmic scaling to both axes. Set tonothing
to use linear scaling.kwargs...
: Additional keyword arguments passed toStatsPlots.plot()
Returns
Plots.Plot
: A scatter plot object that can be further modified or saved
Details
The x-axis shows k-mer frequencies (how many times each k-mer appears), while the y-axis shows how many distinct k-mers appear at each frequency. Both axes are log-scaled by default using log2.
Mycelia.plot_optimal_cluster_assessment_results
— Methodplot_optimal_cluster_assessment_results(
clustering_results
) -> Any
Visualizes cluster assessment metrics and saves the resulting plots.
Arguments
clustering_results
: A named tuple containing:ks_assessed
: Vector of k values testedwithin_cluster_sum_of_squares
: Vector of WCSS scoressilhouette_scores
: Vector of silhouette scoresoptimal_number_of_clusters
: Integer indicating optimal k
Details
Creates two plots:
- Within-cluster sum of squares (WCSS) vs number of clusters
- Silhouette scores vs number of clusters
Both plots include a vertical line indicating the optimal number of clusters.
Outputs
Saves two SVG files in the project directory:
wcss.svg
: WCSS plotsilhouette.svg
: Silhouette scores plot
Mycelia.points_to_pixels
— Methodpoints_to_pixels(points) -> Any
Convert typographic points to pixels using a 4:3 ratio (1 point = 4/3 pixels).
Arguments
points
: Size in typographic points (pt)
Returns
- Size in pixels (px)
Mycelia.polish_fastq
— Methodpolish_fastq(; fastq, k)
Polish FASTQ reads using a k-mer graph-based approach to correct potential sequencing errors.
Arguments
fastq::String
: Path to input FASTQ filek::Int=1
: Initial k-mer size parameter. Final assembly k-mer size may differ.
Process
- Builds a directed k-mer graph from input reads
- Processes each read through the graph to find optimal paths
- Writes corrected reads to a new FASTQ file
- Automatically compresses output with gzip
Returns
Named tuple with:
fastq::String
: Path to output gzipped FASTQ filek::Int
: Final assembly k-mer size used
Mycelia.prefetch
— Methodprefetch(; SRR, outdir)
Downloads Sequence Read Archive (SRA) data using the prefetch tool from sra-tools.
Arguments
SRR
: SRA accession number (e.g., "SRR12345678")outdir
: Directory where the downloaded data will be saved. Defaults to current directory.
Notes
- Requires sra-tools which will be installed in a Conda environment
- Downloads are saved in .sra format
- Internet connection required
Mycelia.process_fastq_record
— Methodprocess_fastq_record(
;
record,
kmer_graph,
yen_k_shortest_paths_and_weights,
yen_k
)
Process and error-correct a FASTQ sequence record using a kmer graph and path resampling.
Arguments
record
: FASTQ record containing the sequence to processkmer_graph
: MetaGraph containing the kmer network and associated propertiesyen_k_shortest_paths_and_weights
: Cache of pre-computed k-shortest paths between nodesyen_k
: Number of alternative paths to consider during resampling (default: 3)
Description
Performs error correction by:
- Trimming low-quality sequence ends
- Identifying stretches requiring resampling between solid branching kmers
- Selecting alternative paths through the kmer graph based on:
- Path quality scores
- Transition likelihoods
- Path length similarity to original sequence
Returns
- Modified FASTQ record with error-corrected sequence and updated quality scores
- Original record if no error correction was needed
Required Graph Properties
The kmer_graph must contain the following properties:
- :ordered_kmers
- :likelyvalidkmer_indices
- :kmer_indices
- :branching_nodes
- :assembly_k
- :transition_likelihoods
- :kmermeanquality
- :kmertotalquality
Mycelia.q_value_to_error_rate
— Methodq_value_to_error_rate(q_value) -> Any
Convert a Phred quality score (Q-value) to a probability of error.
Arguments
q_value
: Phred quality score, typically ranging from 0 to 40
Returns
- Error probability in range [0,1], where 0 indicates highest confidence
A Q-value of 10 corresponds to an error rate of 0.1 (10%), while a Q-value of 30 corresponds to an error rate of 0.001 (0.1%).
Mycelia.qc_filter_long_reads_fastplong
— Methodqc_filter_long_reads_fastplong(
;
in_fastq,
report_title,
out_fastq,
html_report,
json_report,
min_length,
max_length
)
Perform QC filtering on long-read FASTQ files using fastplong.
Arguments
in_fastq::String
: Path to the input FASTQ file.out_fastq::String
: Path to the output FASTQ file.quality_threshold::Int
: Minimum average quality to retain a read (default 10).min_length::Int
: Minimum read length (default 1000).max_length::Int=0
: Maximum read length (default 0, no maximum).
Returns
String
: Path to the filtered FASTQ file.
Details
This function uses fastplong to filter long reads based on quality and length criteria. It is optimized for Oxford Nanopore, PacBio, or similar long-read datasets.
Mycelia.qc_filter_long_reads_filtlong
— Methodqc_filter_long_reads_filtlong(
;
in_fastq,
out_fastq,
min_mean_q,
keep_percent
)
Filter and process long reads from a FASTQ file using Filtlong.
This function filters long sequencing reads based on quality and length criteria, then compresses the output using pigz.
Arguments
in_fastq::String
: Path to the input FASTQ file.out_fastq::String
: Path to the output filtered and compressed FASTQ file. Defaults to the input filename with ".filtlong.fq.gz" appended.min_mean_q::Int
: Minimum mean quality score for reads to be kept. Default is 20.keep_percent::Int
: Percentage of reads to keep after filtering. Default is 95.
Returns
out_fastq
Details
This function uses Filtlong to filter long reads and pigz for compression. It requires the Bioconda environment for Filtlong to be set up, which is handled internally.
Mycelia.qc_filter_short_reads_fastp
— Methodqc_filter_short_reads_fastp(
in_fastq::String,
out_fastq::String;
adapter_seq,
quality_threshold,
min_length
)
Perform quality control (QC) filtering and trimming on short-read FASTQ files using fastp.
Arguments
in_fastq::String
: Path to the input FASTQ file.out_fastq::String
: Path to the output FASTQ file.adapter_seq::String
: Adapter sequence to trim.quality_threshold::Int
: Minimum phred score for trimming (default 20).min_length::Int
: Minimum read length to retain (default 50).
Returns
String
: Path to the filtered and trimmed FASTQ file.
Details
This function uses fastp to remove adapter contamination, trim low‐quality bases from the 3′ end, and discard reads shorter than min_length
. It’s a simple wrapper that executes the external fastp command.
Mycelia.qualmers_canonical
— Methodqualmers_canonical(
record::FASTX.FASTQ.Record,
k::Int64
) -> Union{Base.Generator{I, F} where {I<:(Base.Iterators.Enumerate{I} where I<:(Kmers.FwAAMers{_A, BioSequences.LongAA} where _A)), F<:(Mycelia.var"#493#494"{_A, Vector{Int8}} where _A)}, Base.Generator{I, F} where {I<:(Base.Iterators.Enumerate{I} where I<:(Kmers.CanonicalKmers{BioSequences.DNAAlphabet{4}, _A, BioSequences.LongSequence{BioSequences.DNAAlphabet{4}}} where _A)), F<:(Mycelia.var"#499#500"{_A, Vector{Int8}} where _A)}}
Mycelia.qualmers_canonical
— Methodqualmers_canonical(
sequence::BioSequences.LongAA,
quality::AbstractVector{<:Integer},
_::Val{K}
) -> Base.Generator{I, F} where {I<:(Base.Iterators.Enumerate{I} where I<:(Kmers.FwAAMers{_A, BioSequences.LongAA} where _A)), F<:(Mycelia.var"#493#494"{_A, <:AbstractVector{var"#s235"}} where {_A, var"#s235"<:Integer})}
Mycelia.qualmers_canonical
— Methodqualmers_canonical(
sequence::BioSequences.LongSequence{BioSequences.DNAAlphabet{N}},
quality::AbstractVector{<:Integer},
_::Val{K}
) -> Base.Generator{I, F} where {I<:(Base.Iterators.Enumerate{I} where I<:(Kmers.CanonicalKmers{A, _A, S} where {A<:(BioSequences.DNAAlphabet), _A, S<:(BioSequences.LongDNA)})), F<:(Mycelia.var"#499#500"{_A, <:AbstractVector{var"#s233"}} where {_A, var"#s233"<:Integer})}
Mycelia.qualmers_fw
— Methodqualmers_fw(
record::FASTX.FASTQ.Record,
k::Int64
) -> Base.Generator
Mycelia.qualmers_fw
— Methodqualmers_fw(
sequence::BioSequences.LongSequence{A},
quality::AbstractVector{<:Integer},
_::Val{K}
) -> Base.Generator{I, F} where {I<:(Base.Iterators.Enumerate{I} where I<:(Kmers.FwKmers{A, _A, S} where {A<:BioSequences.Alphabet, _A, S<:(BioSequences.LongSequence{A} where A)})), F<:(Mycelia.var"#493#494"{_A, <:AbstractVector{var"#s233"}} where {_A, var"#s233"<:Integer})}
Create an iterator that yields DNA qualmers from the given sequence and quality scores.
Mycelia.qualmers_unambiguous
— Methodqualmers_unambiguous(
record::FASTX.FASTQ.Record,
k::Int64
) -> Base.Generator
Mycelia.qualmers_unambiguous
— Methodqualmers_unambiguous(
sequence::BioSequences.LongAA,
quality::AbstractVector{<:Integer},
_::Val{K}
) -> Base.Generator{I, F} where {I<:(Base.Iterators.Enumerate{I} where I<:(Kmers.FwAAMers{_A, BioSequences.LongAA} where _A)), F<:(Mycelia.var"#493#494"{_A, <:AbstractVector{var"#s235"}} where {_A, var"#s235"<:Integer})}
Mycelia.qualmers_unambiguous
— Methodqualmers_unambiguous(
sequence::BioSequences.LongSequence{BioSequences.DNAAlphabet{N}},
quality::AbstractVector{<:Integer},
_::Val{K}
) -> Base.Generator{I, F} where {I<:(Kmers.UnambiguousDNAMers{_A, S} where {_A, S<:(BioSequences.LongDNA)}), F<:(Mycelia.var"#495#496"{_A, <:AbstractVector{var"#s233"}} where {_A, var"#s233"<:Integer})}
Mycelia.qualmers_unambiguous
— Methodqualmers_unambiguous(
sequence::BioSequences.LongSequence{BioSequences.RNAAlphabet{N}},
quality::AbstractVector{<:Integer},
_::Val{K}
) -> Base.Generator{I, F} where {I<:(Kmers.UnambiguousRNAMers{_A, S} where {_A, S<:(BioSequences.LongRNA)}), F<:(Mycelia.var"#497#498"{_A, <:AbstractVector{var"#s233"}} where {_A, var"#s233"<:Integer})}
Mycelia.qualmers_unambiguous_canonical
— Methodqualmers_unambiguous_canonical(
record::FASTX.FASTQ.Record,
k::Int64
) -> Base.Generator{I, Mycelia.var"#501#502"} where I
Generate unambiguous canonical qualmers from the given FASTQ record.
Mycelia.rand_of_each_group
— Methodrand_of_each_group(
gdf::DataFrames.GroupedDataFrame{DataFrames.DataFrame}
) -> Any
Select one random row from each group in a grouped DataFrame.
Arguments
gdf::GroupedDataFrame
: A grouped DataFrame created usinggroupby
Returns
DataFrame
: A new DataFrame containing exactly one randomly sampled row from each group
Mycelia.random_fasta_record
— Methodrandom_fasta_record(
;
moltype,
seed,
L
) -> FASTX.FASTA.Record
Generates a random FASTA record with a specified molecular type and sequence length.
Arguments
moltype::Symbol=:DNA
: The type of molecule to generate (:DNA
,:RNA
, or:AA
for amino acids).seed
: The random seed used for sequence generation (default: a random integer).L
: The length of the sequence (default: a random integer up totypemax(UInt16)
).
Returns
- A
FASTX.FASTA.Record
containing:- A randomly generated UUID identifier.
- A randomly generated sequence of the specified type.
Errors
- Throws an error if
moltype
is not one of:DNA
,:RNA
, or:AA
.
Mycelia.random_symmetric_distance_matrix
— Methodrandom_symmetric_distance_matrix(n) -> Any
Generate a random symmetric distance matrix of size n×n with zeros on the diagonal.
Arguments
n
: Positive integer specifying the matrix dimensions
Returns
- A symmetric n×n matrix with random values in [0,1), zeros on the diagonal
Details
- The matrix is symmetric, meaning M[i,j] = M[j,i]
- Diagonal elements M[i,i] are set to 0.0
- Off-diagonal elements are uniformly distributed random values
Mycelia.rclone_copy
— Methodrclone_copy(source, dest; config, max_attempts, sleep_timer)
Copy files between local and remote storage using rclone with automated retry logic.
Arguments
source::String
: Source path or remote (e.g. "local/path" or "gdrive:folder")dest::String
: Destination path or remote (e.g. "gdrive:folder" or "local/path")
Keywords
config::String=""
: Optional path to rclone config filemax_attempts::Int=3
: Maximum number of retry attemptssleep_timer::Int=60
: Initial sleep duration between retries in seconds (doubles after each attempt)
Details
Uses optimized rclone settings for large files:
- 2GB chunk size
- 1TB upload cutoff
- Rate limited to 1 transaction per second
Mycelia.rclone_list_directories
— Methodrclone_list_directories(path) -> Any
List all directories at the specified rclone path.
Arguments
path::String
: Remote path to list directories from (e.g. "remote:/path/to/dir")
Returns
Vector{String}
: Full paths to all directories found at the specified location
Mycelia.read_fastani
— Methodread_fastani(path::String) -> DataFrames.DataFrame
Imports results of fastani
Reads and processes FastANI output results from a tab-delimited file.
Arguments
path::String
: Path to the FastANI output file
Returns
DataFrame with columns:
query
: Original query filepathquery_identifier
: Extracted filename without extensionreference
: Original reference filepathreference_identifier
: Extracted filename without extension%_identity
: ANI percentage identityfragments_mapped
: Number of fragments mappedtotal_query_fragments
: Total number of query fragments
Notes
- Expects tab-delimited input file from FastANI
- Automatically strips .fasta, .fna, or .fa extensions from filenames
- Column order is preserved as listed above
Mycelia.read_gff
— Methodread_gff(gff::AbstractString) -> DataFrames.DataFrame
Reads a GFF (General Feature Format) file and parses it into a DataFrame.
Arguments
gff::AbstractString
: Path to the GFF file
Returns
DataFrame
: A DataFrame containing the parsed GFF data with standard columns: seqid, source, type, start, end, score, strand, phase, and attributes
Mycelia.read_gff
— Methodread_gff(gff_io) -> DataFrames.DataFrame
Read a GFF (General Feature Format) file into a DataFrame.
Arguments
gff_io
: An IO stream containing GFF formatted data
Returns
DataFrame
: A DataFrame with standard GFF columns:- seqid: sequence identifier
- source: feature source
- type: feature type
- start: start position (1-based)
- end: end position
- score: numeric score
- strand: strand (+, -, or .)
- phase: phase (0, 1, 2 or .)
- attributes: semicolon-separated key-value pairs
Mycelia.read_kraken_report
— Methodread_kraken_report(kraken_report) -> DataFrames.DataFrame
Parse a Kraken taxonomic classification report into a structured DataFrame.
Arguments
kraken_report::AbstractString
: Path to a tab-delimited Kraken report file
Returns
DataFrame
: A DataFrame with the following columns:percentage_of_fragments_at_or_below_taxon
: Percentage of fragments coverednumber_of_fragments_at_or_below_taxon
: Count of fragments at/below taxonnumber_of_fragments_assigned_directly_to_taxon
: Direct fragment assignmentsrank
: Taxonomic rankncbi_taxonid
: NCBI taxonomy identifierscientific_name
: Scientific name (whitespace-trimmed)
Notes
- Scientific names are automatically stripped of leading/trailing whitespace
- Input file must be tab-delimited
Mycelia.read_mmseqs_easy_search
— Methodread_mmseqs_easy_search(mmseqs_file) -> DataFrames.DataFrame
Read results from MMSeqs2 easy-search output file into a DataFrame.
Arguments
mmseqs_file::String
: Path to the tab-delimited output file from MMSeqs2 easy-search
Returns
DataFrame
: Contains search results with columns:query
: Query sequence identifiertarget
: Target sequence identifierseqIdentity
: Sequence identity (0.0-1.0)alnLen
: Alignment lengthmismatch
: Number of mismatchesgapOpen
: Number of gap openingsqStart
: Query start positionqEnd
: Query end positiontStart
: Target start positiontEnd
: Target end positionevalue
: Expected valuebits
: Bit score
Mycelia.reverse_translate
— Methodreverse_translate(
protein_sequence::BioSequences.LongAA
) -> BioSequences.LongSequence{BioSequences.DNAAlphabet{2}}
Convert a protein sequence back to a possible DNA coding sequence using weighted random codon selection.
Arguments
protein_sequence::BioSequences.LongAA
: The amino acid sequence to reverse translate
Returns
BioSequences.LongDNA{2}
: A DNA sequence that would translate to the input protein sequence
Details
Uses codon usage frequencies to randomly select codons for each amino acid, weighted by their natural occurrence. Each selected codon is guaranteed to translate back to the original amino acid.
Mycelia.rolling_centered_avg
— Methodrolling_centered_avg(data::AbstractArray{T, 1}; window_size)
Compute a centered moving average over a vector using a sliding window.
Arguments
data::AbstractVector{T}
: Input vector to be averagedwindow_size::Int
: Size of the sliding window (odd number recommended)
Returns
Vector{Float64}
: Vector of same length as input containing moving averages
Details
- For points near the edges, the window is truncated to available data
- Window is centered on each point, using floor(window_size/2) points on each side
- Result type is always Float64 regardless of input type T
Mycelia.run_blast
— Methodrun_blast(
;
out_dir,
fasta,
blast_db,
blast_command,
force,
remote,
wait
)
Run a BLAST (Basic Local Alignment Search Tool) command with the specified parameters.
Arguments
out_dir::String
: The output directory where the BLAST results will be stored.fasta::String
: The path to the input FASTA file.blast_db::String
: The path to the BLAST database.blast_command::String
: The BLAST command to be executed (e.g.,blastn
,blastp
).force::Bool
: Iftrue
, forces the BLAST command to run even if the output file already exists. Default isfalse
.remote::Bool
: Iftrue
, runs the BLAST command remotely. Default isfalse
.wait::Bool
: Iftrue
, waits for the BLAST command to complete before returning. Default istrue
.
Returns
outfile::String
: The path to the output file containing the BLAST results.
Description
This function constructs and runs a BLAST command based on the provided parameters. It creates the necessary output directory, constructs the output file name, and determines whether the BLAST command needs to be run based on the existence and size of the output file. The function supports both local and remote execution of the BLAST command.
If force
is set to true
or the output file does not exist or is empty, the BLAST command is executed. The function logs the command being run and measures the time taken for execution. The output file path is returned upon completion.
Mycelia.run_blastn
— Methodrun_blastn(
;
out_dir,
fasta,
blast_db,
task,
force,
remote,
wait
)
Run the BLASTN (Basic Local Alignment Search Tool for Nucleotides) command with specified parameters.
Arguments
out_dir::String
: The output directory where the BLASTN results will be saved.fasta::String
: The path to the input FASTA file containing the query sequences.blast_db::String
: The path to the BLAST database to search against.task::String
: The BLASTN task to perform. Default is "megablast".force::Bool
: If true, forces the BLASTN command to run even if the output file already exists. Default is false.remote::Bool
: If true, runs the BLASTN command remotely. Default is false.wait::Bool
: If true, waits for the BLASTN command to complete before returning. Default is true.
Returns
outfile::String
: The path to the output file containing the BLASTN results.
Description
This function constructs and runs a BLASTN command based on the provided parameters. It creates an output directory if it doesn't exist, constructs the output file path, and checks if the BLASTN command needs to be run based on the existence and size of the output file. The function supports running the BLASTN command locally or remotely, with options to force re-running and to wait for completion.
Mycelia.run_clustal_omega
— Methodrun_clustal_omega(; fasta, outfmt)
Run Clustal Omega multiple sequence alignment on a FASTA file.
Arguments
fasta::String
: Path to input FASTA fileoutfmt::String="clustal"
: Output format for the alignment
Returns
String
: Path to the output alignment file
Supported Output Formats
"fasta"
: FASTA format"clustal"
: Clustal format"msf"
: MSF format"phylip"
: PHYLIP format"selex"
: SELEX format"stockholm"
: Stockholm format"vienna"
: Vienna format
Notes
- Uses Bioconda to manage the Clustal Omega installation
- Caches results - will return existing output file if already generated
- Handles single sequence files gracefully by returning output path without error
Mycelia.run_ectyper
— Methodrun_ectyper(fasta_file) -> Any
Run ECTyper for serotyping E. coli genome assemblies.
Arguments
fasta_file::String
: Path to input FASTA file containing assembled genome(s)
Returns
String
: Path to output directory containing ECTyper results
Mycelia.run_hifiasm
— Methodrun_hifiasm(; fastq, outdir)
Run the hifiasm genome assembler on PacBio HiFi reads.
Arguments
fastq::String
: Path to input FASTQ file containing HiFi readsoutdir::String
: Output directory path (default: "{basename(fastq)}_hifiasm")
Returns
Named tuple containing:
outdir::String
: Path to output directoryhifiasm_outprefix::String
: Prefix used for hifiasm output files
Details
- Automatically creates and uses a conda environment with hifiasm
- Uses primary assembly mode (–primary) optimized for inbred samples
- Skips assembly if output files already exist at the specified prefix
- Utilizes all available CPU threads
Mycelia.run_mlst
— Methodrun_mlst(fasta_file) -> String
Run Multi-Locus Sequence Typing (MLST) analysis on a genome assembly.
Arguments
fasta_file::String
: Path to input FASTA file containing the genome assembly
Returns
- Path to the output file containing MLST results (
<input>.mlst.out
)
Details
Uses the mlst
tool from PubMLST to identify sequence types by comparing allelic profiles of housekeeping genes against curated MLST schemes.
Dependencies
- Requires Bioconda and the
mlst
package - Automatically sets up conda environment if not present
Mycelia.run_mmseqs_easy_search
— Methodrun_mmseqs_easy_search(
;
query_fasta,
target_database,
out_dir,
outfile,
format_output,
threads,
force
)
Runs the MMseqs2 easy-search command on the given query FASTA file against the target database.
Arguments
query_fasta::String
: Path to the query FASTA file.target_database::String
: Path to the target database.out_dir::String
: Directory to store the output file. Defaults to the directory of the query FASTA file.outfile::String
: Name of the output file. Defaults to a combination of the query FASTA and target database filenames.format_output::String
: Format of the output. Defaults to a predefined set of fields.threads::Int
: Number of CPU threads to use. Defaults to the number of CPU threads available.force::Bool
: If true, forces the re-generation of the output file even if it already exists. Defaults to false.
Returns
outfile_path::String
: Path to the generated output file.
Notes
- Adds the
mmseqs2
environment using Bioconda if not already present. - Removes temporary files created during the process.
Mycelia.run_padloc
— Methodrun_padloc(; fasta_file, outdir, threads)
Run the 'padloc' tool from the 'padlocbio' conda environment on a given FASTA file.
https://doi.org/10.1093/nar/gkab883
https://github.com/padlocbio/padloc
This function first ensures that the 'padloc' environment is available via Bioconda. It then attempts to update the 'padloc' database. If a 'padloc' output file (with a '_padloc.csv' suffix) does not already exist for the input FASTA file, it runs 'padloc' with the specified FASTA file as input.
Mycelia.run_parallel_progress
— Methodrun_parallel_progress(
f::Function,
items::AbstractVector
) -> Any
Run a function f
in parallel over a collection of items
with a progress meter.
Arguments
f::Function
: The function to be applied to each item in the collection.items::AbstractVector
: A collection of items to be processed.
Description
This function creates a progress meter to track the progress of processing each item in the items
collection. It uses multithreading to run the function f
on each item in parallel, updating the progress meter after each item is processed.
Mycelia.run_prodigal
— Methodrun_prodigal(; fasta_file, out_dir)
Run Prodigal gene prediction software on input FASTA file to identify protein-coding genes in metagenomes or single genomes.
Arguments
fasta_file::String
: Path to input FASTA file containing genomic sequencesout_dir::String=dirname(fasta_file)
: Directory for output files. Defaults to input file's directory
Returns
Named tuple containing paths to all output files:
fasta_file
: Input FASTA file pathout_dir
: Output directory pathgff
: Path to GFF format gene predictionsgene_scores
: Path to all potential genes and their scoresfna
: Path to nucleotide sequences of predicted genesfaa
: Path to protein translations of predicted genesstd_out
: Path to captured stdoutstd_err
: Path to captured stderr
Mycelia.run_pyrodigal
— Methodrun_pyrodigal(; fasta_file, out_dir)
Run Pyrodigal gene prediction on a FASTA file using the meta procedure optimized for metagenomic sequences.
Pyrodigal is a reimplementation of the Prodigal gene finder, which identifies protein-coding sequences in bacterial and archaeal genomes.
Arguments
fasta_file::String
: Path to input FASTA file containing genomic sequencesout_dir::String
: Output directory path (default: input filename + "_pyrodigal")
Returns
Named tuple containing:
fasta_file
: Input FASTA file pathout_dir
: Output directory pathgff
: Path to GFF output file with gene predictionsfaa
: Path to FASTA file with predicted protein sequencesfna
: Path to FASTA file with nucleotide sequences
Notes
- Uses metagenomic mode (
-p meta
) optimized for mixed communities - Masks runs of N nucleotides (
-m
flag) - Minimum gene length set to 33bp
- Maximum overlap between genes set to 31bp
- Requires Pyrodigal to be available in a Conda environment
- Skips processing if output files already exist
Mycelia.run_samtools_flagstat
— Functionrun_samtools_flagstat(xam) -> Any
run_samtools_flagstat(xam, samtools_flagstat) -> Any
Generate alignment statistics for a SAM/BAM/CRAM file using samtools flagstat.
Arguments
xam::AbstractString
: Path to input SAM/BAM/CRAM alignment filesamtools_flagstat::AbstractString
: Output path for flagstat results (default: input_path.samtools-flagstat.txt)
Returns
String
: Path to the generated flagstat output file
Details
Runs samtools flagstat to calculate statistics on the alignment file, including:
- Total reads
- Secondary alignments
- Supplementary alignments
- Duplicates
- Mapped/unmapped reads
- Proper pairs
- Read 1/2 counts
Requirements
- Requires samtools to be available via Bioconda
- Input file must be in SAM, BAM or CRAM format
Mycelia.run_transterm
— Methodrun_transterm(; fasta, gff)
Run TransTermHP to predict rho-independent transcription terminators in DNA sequences.
Arguments
fasta
: Path to input FASTA file containing DNA sequencesgff
: Optional path to GFF annotation file. If provided, improves prediction accuracy
Returns
String
: Path to output file containing TransTermHP predictions
Details
- Uses Conda environment 'transtermhp' for execution
- Automatically generates coordinate file from FASTA or GFF input
- Removes temporary coordinate file after completion
- Requires Mycelia's Conda setup
Mycelia.run_trnascan
— Methodrun_trnascan(; fna_file, outdir)
Run tRNAscan-SE to identify and annotate transfer RNA genes in the provided sequence file.
Arguments
fna_file::String
: Path to input FASTA nucleotide fileoutdir::String
: Output directory path (default: inputfilepath + "_trnascan")
Returns
String
: Path to the output directory containing tRNAscan-SE results
Output Files
Creates the following files in outdir
:
*.trnascan.out
: Main output with tRNA predictions*.trnascan.bed
: BED format coordinates*.trnascan.fasta
: FASTA sequences of predicted tRNAs*.trnascan.struct
: Secondary structure predictions*.trnascan.stats
: Summary statistics*.trnascan.log
: Program execution log
Notes
- Uses the general tRNA model (-G flag) suitable for all domains of life
- Automatically sets up tRNAscan-SE via Bioconda
- Skips processing if output directory contains files
Mycelia.samtools_index_fasta
— Methodsamtools_index_fasta(; fasta)
Creates an index file (.fai) for a FASTA reference sequence using samtools.
The FASTA index allows efficient random access to the reference sequence. This is required by many bioinformatics tools that need to quickly fetch subsequences from the reference.
Arguments
fasta
: Path to the input FASTA file
Side Effects
- Creates a
{fasta}.fai
index file in the same directory as input - Installs samtools via conda if not already present
Mycelia.save_graph
— Methodsave_graph(
graph::Graphs.AbstractGraph,
outfile::String
) -> String
Saves the given graph to a file in JLD2 format.
Arguments
graph::Graphs.AbstractGraph
: The graph to be saved.outfile::String
: The name of the output file. If the file extension is not.jld2
, it will be appended automatically.
Returns
String
: The name of the output file with the.jld2
extension.
Mycelia.save_matrix_jld2
— Methodsave_matrix_jld2(; matrix, filename)
Saves a matrix to a JLD2 file format.
Arguments
matrix
: The matrix to be savedfilename
: String path where the file should be saved
Returns
- The filename string that was used to save the matrix
Mycelia.scg_sbatch
— Methodscg_sbatch(
;
job_name,
mail_user,
mail_type,
logdir,
partition,
account,
nodes,
ntasks,
time,
cpus_per_task,
mem_gb,
cmd
)
Submit a job to SLURM using sbatch with specified parameters.
Arguments
job_name::String
: Name identifier for the SLURM jobmail_user::String
: Email address for job notificationsmail_type::String
: Type of mail notifications (default: "ALL")logdir::String
: Directory for error and output logs (default: "~/workspace/slurmlogs")partition::String
: SLURM partition to submit job toaccount::String
: Account to charge for compute resourcesnodes::Int
: Number of nodes to allocate (default: 1)ntasks::Int
: Number of tasks to run (default: 1)time::String
: Maximum wall time in format "days-hours:minutes:seconds" (default: "1-00:00:00")cpus_per_task::Int
: CPUs per task (default: 1)mem_gb::Int
: Memory in GB, defaults to 32GB per CPUcmd::String
: Command to execute
Returns
Bool
: Returns true if submission succeeds
Notes
- Function includes 5-second delays before and after submission
- Memory is automatically scaled with CPU count
- Log files are named with job ID (%j) and job name (%x)
Mycelia.seq2sha256
— Methodseq2sha256(seq::AbstractString) -> String
Compute the SHA-256 hash of a sequence string.
Arguments
seq::AbstractString
: Input sequence to be hashed
Returns
String
: Hexadecimal representation of the SHA-256 hash
Details
The input sequence is converted to uppercase before hashing.
Mycelia.seq2sha256
— Methodseq2sha256(seq::BioSequences.BioSequence) -> String
Convert a biological sequence to its SHA256 hash value.
Calculates a cryptographic hash of the sequence by first converting it to a string representation. This method dispatches to the string version of seq2sha256
.
Arguments
seq::BioSequences.BioSequence
: The biological sequence to hash
Returns
String
: A 64-character hexadecimal string representing the SHA256 hash
Mycelia.sequence_to_stranded_path
— Methodsequence_to_stranded_path(
stranded_kmers,
sequence
) -> Vector{Pair{Int64, Bool}}
Convert a DNA sequence into a path through a collection of stranded k-mers.
Arguments
stranded_kmers
: Collection of unique k-mers representing possible path verticessequence
: Input DNA sequence to convert to a path
Returns
Vector of Pair{Int,Bool}
where:
- First element (Int) is the index of the k-mer in
stranded_kmers
- Second element (Bool) indicates orientation (true=forward, false=reverse)
Mycelia.setup_taxonkit_taxonomy
— Methodsetup_taxonkit_taxonomy() -> String
Downloads and extracts the NCBI taxonomy database required for taxonkit operations.
Downloads taxdump.tar.gz
from NCBI FTP server and extracts it to ~/.taxonkit/
. This is a prerequisite for using taxonkit-based taxonomy functions.
Requirements
- Working internet connection
- Sufficient disk space (~100MB)
taxonkit
must be installed separately
Returns
- Nothing
Throws
SystemError
if download fails or if unable to create directoryErrorException
if tar extraction fails
Mycelia.sha256_file
— Methodsha256_file(file::AbstractString) -> String
Compute the SHA-256 hash of the contents of a file.
Arguments
file::AbstractString
: The path to the file for which the SHA-256 hash is to be computed.
Returns
String
: The SHA-256 hash of the file contents, represented as a hexadecimal string.
Mycelia.simulate_illumina_paired_reads
— Methodsimulate_illumina_paired_reads(
;
in_fasta,
coverage,
read_count,
outbase,
read_length,
mflen,
sdev,
seqSys,
amplicon,
errfree,
rndSeed
)
Simulate Illumina short reads from a FASTA file using the ART Illumina simulator.
This function wraps ART (installed via Bioconda) to simulate reads from an input reference FASTA. It supports paired-end (or optionally single-end/mate-pair) simulation, with options to choose either fold coverage (--fcov
) or an absolute read count (--rcount
), to enable amplicon mode, and to optionally generate a zero-error SAM file.
Arguments
in_fasta::String
: Path to the input FASTA file.coverage::Union{Nothing,Number}
: Desired fold coverage (used with--fcov
); ifnothing
andread_count
is provided then fold coverage is ignored. (Default: 20)read_count::Union{Nothing,Number}
: Total number of reads (or read pairs) to generate (used with--rcount
instead of fold coverage). (Default:nothing
)outbase::String
: Output file prefix (default: "$(in_fasta).art.$(coverage)x.").read_length::Int
: Length of reads to simulate (default: 150).mflen::Int
: Mean fragment length for paired-end simulations (default: 500).sdev::Int
: Standard deviation of fragment lengths (default: 10).seqSys::String
: Illumina sequencing system ID (e.g. "HS25" for HiSeq 2500) (default: "HS25").paired::Bool
: Whether to simulate paired-end reads (default: true).amplicon::Bool
: Enable amplicon sequencing simulation mode (default: false).errfree::Bool
: Generate an extra SAM file with zero sequencing errors (default: false).rndSeed::Union{Nothing,Int}
: Optional seed for reproducibility (default: nothing).
Outputs
Generates gzipped FASTQ files in the working directory:
- For paired-end:
$(outbase)1.fq.gz
(forward) and$(outbase)2.fq.gz
(reverse). - For single-end:
$(outbase)1.fq.gz
.
Additional SAM files may be produced if --errfree
is enabled and/or if the ART --samout
option is specified.
Details
This function calls ART with the provided options. Note that if read_count
is supplied, the function uses the --rcount
option; otherwise, it uses --fcov
with the given coverage. Amplicon mode (via --amplicon
) restricts the simulation to the amplicon regions, which is important for targeted sequencing studies.
Dependencies
Requires ART simulator (installed via Bioconda) and the Mycelia environment helper.
See also: simulate_nanopore_reads
, simulate_nearly_perfect_long_reads
, simulate_pacbio_reads
Mycelia.simulate_nanopore_reads
— Methodsimulate_nanopore_reads(; fasta, quantity, outfile)
Simulate Oxford Nanopore sequencing reads using the Badread tool with 2023 error models.
Arguments
fasta::String
: Path to input reference FASTA filequantity::String
: Either fold coverage (e.g. "50x") or total bases to sequence (e.g. "1000000")outfile::String
: Output path for gzipped FASTQ file. Defaults to input filename with modified extension
Returns
String
: Path to the generated output FASTQ file
See also: simulate_pacbio_reads
, simulate_nearly_perfect_long_reads
, simulate_short_reads
Mycelia.simulate_nearly_perfect_long_reads
— Methodsimulate_nearly_perfect_long_reads()
Simulate high-quality long reads with minimal errors using Badread.
Arguments
reference::String
: Path to reference FASTA filequantity::String
: Coverage depth (e.g. "50x") or total bases (e.g. "1000000")length_mean::Int=40000
: Mean read lengthlength_sd::Int=20000
: Standard deviation of read length
Returns
Vector of simulated reads in FASTQ format
Details
Generates nearly perfect long reads by setting error rates and artifacts to minimum values. Uses ideal quality scores and disables common sequencing artifacts like chimeras and adapters.
See also: simulate_pacbio_reads
, simulate_nanopore_reads
, simulate_short_reads
Mycelia.simulate_pacbio_reads
— Methodsimulate_pacbio_reads(; fasta, quantity, outfile)
Simulate PacBio HiFi reads using the Badread error model.
Arguments
fasta::String
: Path to input FASTA file containing reference sequencequantity::String
: Coverage depth (e.g. "50x") or total bases (e.g. "1000000") - NOT TOTAL READSoutfile::String
: Output filepath for simulated reads. Defaults to input filename with ".badread.pacbio2021.{quantity}.fq.gz" suffix
Returns
String
: Path to the generated output file
Notes
- Requires Badread tool from Bioconda
- Uses PacBio 2021 error and quality score models
- Average read length ~15kb
- Output is gzipped FASTQ format
See also: simulate_nanopore_reads
, simulate_nearly_perfect_long_reads
, simulate_short_reads
Mycelia.simulate_variants
— Methodsimulate_variants(
fasta_record::FASTX.FASTA.Record;
n_variants,
window_size,
variant_size_disbribution,
variant_type_likelihoods
) -> Any
Simulates genetic variants (substitutions, insertions, deletions, inversions) in a DNA sequence.
Arguments
fasta_record
: Input DNA sequence in FASTA format
Keywords
n_variants=√(sequence_length)
: Number of variants to generatewindow_size=sequence_length/n_variants
: Size of windows for variant placementvariant_size_disbribution=Geometric(1/√window_size)
: Distribution for variant sizesvariant_type_likelihoods
: Vector of pairs mapping variant types to probabilities:substitution => 10⁻¹
:insertion => 10⁻²
:deletion => 10⁻²
:inversion => 10⁻²
Returns
DataFrame in VCF format containing simulated variants with columns: CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO, FORMAT, SAMPLE
Notes
- Variants are distributed across sequence windows to ensure spread
- Variant sizes are capped by window size
- Equivalent variants are filtered out
- FILTER column indicates variant type
Mycelia.simulate_variants
— Methodsimulate_variants(fasta_file::String) -> String
Simulates genetic variants from sequences in a FASTA file and generates corresponding VCF records.
Arguments
fasta_file::String
: Path to input FASTA file containing sequences to analyze
Details
- Processes each record in the input FASTA file
- Generates simulated variants for each sequence
- Creates a VCF file with the same base name as input file (.vcf extension)
- Updates sequences with simulated variants in a new FASTA file (.vcf.fna extension)
Returns
Path to the modified FASTA file containing sequences with simulated variants
Mycelia.sort_fastq
— Functionsort_fastq(input_fastq) -> String
sort_fastq(input_fastq, output_fastq) -> Any
This turns a 4-line FASTQ entry into a single tab separated line, adds a column with the length of each read, passes it to Unix sort, removes the length column, and converts it back into a FASTQ file.
sorts longest to shortest!!
http://thegenomefactory.blogspot.com/2012/11/sorting-fastq-files-by-sequence-length.html
Mycelia.split_gff_attributes_into_columns
— Methodsplit_gff_attributes_into_columns(gff_df) -> Any
Takes a GFF (General Feature Format) DataFrame and expands the attributes column into separate columns.
Arguments
gff_df::DataFrame
: A DataFrame containing GFF data with an 'attributes' column formatted as key-value pairs separated by semicolons (e.g., "ID=gene1;Name=BRCA1;Type=gene")
Returns
DataFrame
: The input DataFrame with additional columns for each attribute key found in the 'attributes' column
Mycelia.subsample_reads_seqkit
— Methodsubsample_reads_seqkit(
;
in_fastq,
out_fastq,
n_reads,
proportion_reads
)
Subsample reads from a FASTQ file using seqkit.
Arguments
in_fastq::String
: Path to input FASTQ fileout_fastq::String=""
: Path to output FASTQ file. If empty, auto-generated based on input filenamen_reads::Union{Missing,Int}=missing
: Number of reads to sampleproportion_reads::Union{Missing,Float64}=missing
: Proportion of reads to sample (0.0-1.0)
Returns
String
: Path to the output FASTQ file
Mycelia.system_mem_to_minimap_index_size
— Methodsystem_mem_to_minimap_index_size(
;
system_mem_gb,
denominator
)
Calculate appropriate minimap index size based on available system memory.
Converts total system memory to a recommended minimap index size by dividing the available memory by a denominator factor. Returns the size as a string with 'G' suffix.
Arguments
system_mem_gb::Number
: Total system memory in gigabytesdenominator::Number=DEFAULT_MINIMAP_DENOMINATOR
: Divisor to scale down memory allocation
Returns
String
: Formatted memory size string (e.g. "4G")
Mycelia.system_overview
— Methodsystem_overview(
;
path
) -> @NamedTuple{system_threads::Int64, julia_threads::Int64, total_memory::String, available_memory::String, occupied_memory::String, total_storage::String, available_storage::String, occupied_storage::String}
Parse a JSONL (JSON Lines) file into a vector of dictionaries.
Arguments
filepath::String
: Path to the JSONL file to parse
Returns
Vector{Dict{String, Any}}
: Vector containing parsed JSON objects, one per line
Description
Reads a JSONL file line by line, parsing each line as a separate JSON object. Uses pre-allocation and progress tracking for efficient processing of large files.
Mycelia.tar_extract
— Methodtar_extract(; tarchive, directory)
Extract contents of a gzipped tar archive file to a specified directory.
Arguments
tarchive::AbstractString
: Path to the .tar.gz file to extractdirectory::AbstractString=dirname(tarchive)
: Target directory for extraction (defaults to the archive's directory)
Returns
AbstractString
: Path to the directory where contents were extracted
Mycelia.taxids2lca
— Methodtaxids2lca(ids::Vector{Int64}) -> Int64
Calculate the Lowest Common Ancestor (LCA) taxonomic ID for a set of input taxonomic IDs.
Arguments
ids::Vector{Int}
: Vector of NCBI taxonomic IDs
Returns
Int
: The taxonomic ID of the lowest common ancestor
Details
Uses taxonkit to compute the LCA. Automatically sets up the required taxonomy database if not already present in ~/.taxonkit/
.
Dependencies
- Requires taxonkit (installed via Bioconda)
- Requires taxonomy database (downloaded automatically if missing)
Mycelia.taxids2ncbi_taxonomy_table
— Methodtaxids2ncbi_taxonomy_table(
taxids::AbstractVector{Int64}
) -> DataFrames.DataFrame
Convert a vector of NCBI taxonomy IDs into a detailed taxonomy table using NCBI Datasets CLI.
Arguments
taxids::AbstractVector{Int}
: Vector of NCBI taxonomy IDs to query
Returns
DataFrame
: Table containing taxonomy information with columns including:- tax_id
- species
- genus
- family
- order
- class
- phylum
- kingdom
Dependencies
Requires ncbi-datasets-cli Conda package (automatically installed if missing)
Mycelia.taxids2taxonkit_summarized_lineage_table
— Methodtaxids2taxonkit_summarized_lineage_table(
taxids::AbstractVector{Int64}
) -> DataFrames.DataFrame
Convert a vector of taxonomy IDs to a summarized lineage table using taxonkit.
Arguments
taxids::AbstractVector{Int}
: Vector of NCBI taxonomy IDs
Returns
DataFrame with the following columns:
taxid
: Original input taxonomy IDspecies_taxid
,species
: Species level taxonomy ID and namegenus_taxid
,genus
: Genus level taxonomy ID and namefamily_taxid
,family
: Family level taxonomy ID and namesuperkingdom_taxid
,superkingdom
: Superkingdom level taxonomy ID and name
Missing values are used when a taxonomic rank is not available.
Mycelia.taxids2taxonkit_taxid2lineage_ranks
— Methodtaxids2taxonkit_taxid2lineage_ranks(
taxids::AbstractVector{Int64}
) -> Dict{Int64, Dict{String, @NamedTuple{lineage::String, taxid::Union{Missing, Int64}}}}
Convert taxonomic IDs to a structured lineage rank mapping.
Takes a vector of taxonomic IDs and returns a nested dictionary mapping each input taxid to its complete taxonomic lineage information. For each taxid, creates a dictionary where:
- Keys are taxonomic ranks (e.g., "species", "genus", "family")
- Values are NamedTuples containing:
lineage::String
: The taxonomic name at that ranktaxid::Union{Int, Missing}
: The corresponding taxonomic ID (if available)
Excludes "no rank" entries from the final output.
Returns: Dict{Int, Dict{String, NamedTuple{(:lineage, :taxid), Tuple{String, Union{Int, Missing}}}}}
Mycelia.taxonomic_id_to_children
— Methodtaxonomic_id_to_children(
tax_id;
DATABASE_ID,
USERNAME,
PASSWORD
)
Query Neo4j database to find all descendant taxonomic IDs for a given taxonomic ID.
Arguments
tax_id
: Source taxonomic ID to find children forDATABASE_ID
: Neo4j database identifier (required)USERNAME
: Neo4j database username (default: "neo4j")PASSWORD
: Neo4j database password (required)
Returns
Vector{Int}
: Sorted array of unique child taxonomic IDs
Mycelia.translate_nucleic_acid_fasta
— Methodtranslate_nucleic_acid_fasta(
fasta_nucleic_acid_file,
fasta_amino_acid_file
) -> Any
Translates nucleic acid sequences from a FASTA file into amino acid sequences.
Arguments
fasta_nucleic_acid_file::String
: Path to input FASTA file containing nucleic acid sequencesfasta_amino_acid_file::String
: Path where the translated amino acid sequences will be written
Returns
String
: Path to the output FASTA file containing translated amino acid sequences
Mycelia.transterm_output_to_gff
— Methodtransterm_output_to_gff(transterm_output) -> Any
Convert TransTerm terminator predictions output to GFF3 format.
Parses TransTerm output and generates a standardized GFF3 file with the following transformations:
- Sets source field to "transterm"
- Sets feature type to "terminator"
- Converts terminator IDs to GFF attributes
- Renames fields to match GFF3 spec
Arguments
transterm_output::String
: Path to the TransTerm output file
Returns
String
: Path to the generated GFF3 file (original filename with .gff extension)
Mycelia.trim_galore
— Methodtrim_galore(
;
outdir,
identifier
) -> Union{Nothing, Base.Process}
Trim paired-end FASTQ reads using Trim Galore, a wrapper around Cutadapt and FastQC.
Arguments
outdir::String
: Output directory containing input FASTQ filesidentifier::String
: Prefix for input/output filenames
Input files
Expects paired FASTQ files in outdir
named:
{identifier}_1.fastq.gz
(forward reads){identifier}_2.fastq.gz
(reverse reads)
Output files
Creates trimmed reads in outdir/trim_galore/
:
{identifier}_1_val_1.fq.gz
(trimmed forward reads){identifier}_2_val_2.fq.gz
(trimmed reverse reads)
Dependencies
Requires trim_galore conda environment:
Mycelia.trim_galore_paired
— Methodtrim_galore_paired(; forward_reads, reverse_reads, outdir)
Trim paired-end FASTQ reads using Trim Galore, a wrapper around Cutadapt and FastQC.
Arguments
forward_reads::String
: Path to forward reads FASTQ filereverse_reads::String
: Path to reverse reads FASTQ fileoutdir::String
: Output directory for trimmed files
Returns
Tuple{String, String}
: Paths to trimmed forward and reverse read files
Dependencies
Requires trim_galore conda environment:
mamba create -n trim_galore -c bioconda trim_galore
Mycelia.type_to_string
— Methodtype_to_string(T::AbstractString) -> Any
Converts an AbstractString type to its string representation.
Arguments
T::AbstractString
: The string type to convert
Returns
A string representation of the input type
Mycelia.type_to_string
— Methodtype_to_string(T) -> Any
Convert a type to its string representation, with special handling for Kmer types.
Arguments
T
: The type to convert to string
Returns
- String representation of the type
- For Kmer types: Returns "Kmers.DNAKmer{K}" where K is the kmer length
- For other types: Returns the standard string representation
Mycelia.update_bioconda_env
— Methodupdate_bioconda_env(pkg) -> Base.Process
Update a package and its dependencies in its dedicated Conda environment.
Arguments
pkg::String
: Name of the package/environment to update
Mycelia.update_fasta_with_vcf
— Methodupdate_fasta_with_vcf(; in_fasta, vcf_file, out_fasta)
Apply variants from a VCF file to a reference FASTA sequence.
Arguments
in_fasta
: Path to input reference FASTA filevcf_file
: Path to input VCF file containing variantsout_fasta
: Optional output path for modified FASTA. Defaults to replacing '.vcf' with '.normalized.vcf.fna'
Details
- Normalizes indels in the VCF using bcftools norm
- Applies variants to the reference sequence using bcftools consensus
- Handles temporary files and compression with bgzip/tabix
Requirements
Requires bioconda packages: htslib, tabix, bcftools
Returns
Path to the output FASTA file containing the modified sequence
Mycelia.update_gff_with_mmseqs
— Methodupdate_gff_with_mmseqs(gff_file, mmseqs_file)
Update GFF annotations with protein descriptions from MMseqs2 search results.
Arguments
gff_file::String
: Path to input GFF3 format filemmseqs_file::String
: Path to MMseqs2 easy-search output file
Returns
DataFrame
: Modified GFF table with updated attribute columns containing protein descriptions
Details
Takes sequence matches from MMseqs2 and adds their descriptions as 'label' and 'product' attributes in the GFF file. Only considers top hits from MMseqs2 results. Preserves existing GFF attributes while prepending new annotations.
Mycelia.upload_edge_type_over_url_from_graph
— Methodupload_edge_type_over_url_from_graph(
;
src_type,
dst_type,
edge_type,
graph,
ADDRESS,
USERNAME,
PASSWORD,
DATABASE,
window_size
)
Upload edges of a specific type from a MetaGraph to a Neo4j database, batching uploads in windows.
Arguments
src_type
: Type of source nodes to filterdst_type
: Type of destination nodes to filteredge_type
: Type of edges to uploadgraph
: MetaGraph containing the nodes and edgesADDRESS
: Neo4j server URLUSERNAME
: Neo4j username (default: "neo4j")PASSWORD
: Neo4j passwordDATABASE
: Neo4j database name (default: "neo4j")window_size
: Number of edges to upload in each batch (default: 100)
Details
- Filters edges based on source, destination and edge types
- Preserves all edge properties except :TYPE when uploading
- Uses MERGE operations to avoid duplicate nodes/relationships
- Uploads are performed in batches for better performance
- Progress is shown via ProgressMeter
Returns
Nothing
Mycelia.upload_node_over_api
— Methodupload_node_over_api(
graph,
v;
ADDRESS,
USERNAME,
PASSWORD,
DATABASE
)
Upload a single node from a MetaGraph to a Neo4j database using the HTTP API.
Arguments
graph
: MetaGraph containing the node to be uploadedv
: Vertex identifier in the graphADDRESS
: Neo4j server address (e.g. "http://localhost:7474")USERNAME
: Neo4j authentication username (default: "neo4j")PASSWORD
: Neo4j authentication passwordDATABASE
: Target Neo4j database name (default: "neo4j")
Details
Generates and executes a Cypher MERGE command using the node's properties. The node's :TYPE and :identifier properties are used for node labeling, while other non-empty properties are added as node properties.
Mycelia.upload_node_table
— Methodupload_node_table(
;
table,
window_size,
address,
password,
username,
database,
neo4j_import_dir
)
Upload a DataFrame to Neo4j as nodes in batched windows.
Arguments
table::DataFrame
: Input DataFrame where each row becomes a node. Must contain a "TYPE" column.address::String
: Neo4j server address (e.g. "bolt://localhost:7687")password::String
: Neo4j database passwordneo4j_import_dir::String
: Directory path accessible to Neo4j for importing fileswindow_size::Int=1000
: Number of rows to process in each batchusername::String="neo4j"
: Neo4j database usernamedatabase::String="neo4j"
: Target Neo4j database name
Notes
- All rows must have the same node type (specified in "TYPE" column)
- Column names become node properties
- Requires write permissions on neo4jimportdir
- Large tables are processed in batches of size window_size
Mycelia.upload_node_type_over_url_from_graph
— Methodupload_node_type_over_url_from_graph(
;
node_type,
graph,
ADDRESS,
USERNAME,
PASSWORD,
DATABASE,
window_size
)
Upload nodes of a specific type from a graph to a Neo4j database using MERGE operations.
Arguments
node_type
: The type label for the nodes to uploadgraph
: Source MetaGraph containing the nodesADDRESS
: Neo4j server address (e.g. "bolt://localhost:7687")PASSWORD
: Neo4j database passwordUSERNAME="neo4j"
: Neo4j username (defaults to "neo4j")DATABASE="neo4j"
: Target Neo4j database name (defaults to "neo4j")window_size=100
: Number of nodes to upload in each batch (defaults to 100)
Details
Performs batched uploads of nodes using Neo4j MERGE operations. Node properties are automatically extracted from the graph vertex properties, excluding the 'TYPE' property.
Mycelia.upload_nodes_over_api
— Methodupload_nodes_over_api(
graph;
ADDRESS,
USERNAME,
PASSWORD,
DATABASE
)
Uploads all nodes from the given graph to a specified API endpoint.
Arguments
graph
: The graph containing the nodes to be uploaded.ADDRESS
: The API endpoint address.USERNAME
: The username for authentication (default: "neo4j").PASSWORD
: The password for authentication.DATABASE
: The database name (default: "neo4j").
Mycelia.upload_nodes_to_neo4j
— Methodupload_nodes_to_neo4j(
;
graph,
address,
username,
password,
format,
database,
neo4j_import_directory
)
Upload all nodes from a MetaGraph to a Neo4j database, processing each unique node type separately.
Arguments
graph
: A MetaGraph containing nodes to be uploadedaddress
: Neo4j server address (e.g., "bolt://localhost:7687")username
: Neo4j authentication username (default: "neo4j")password
: Neo4j authentication passwordformat
: Data format for upload (default: "auto")database
: Target Neo4j database name (default: "neo4j")neo4j_import_directory
: Path to Neo4j's import directory for bulk loading
Mycelia.vcat_with_missing
— Methodvcat_with_missing(
dfs::DataFrames.AbstractDataFrame...
) -> Union{DataFrames.DataFrame, Vector{Any}}
Vertically concatenate DataFrames with different column structures by automatically handling missing values.
Arguments
dfs
: Variable number of DataFrames to concatenate vertically
Returns
DataFrame
: Combined DataFrame containing all rows and columns from input DataFrames, withmissing
values where columns didn't exist in original DataFrames
Mycelia.visualize_genome_coverage
— Methodvisualize_genome_coverage(coverage_table) -> Any
Creates a multi-panel visualization of genome coverage across chromosomes.
Arguments
coverage_table
: DataFrame containing columns "chromosome" and "coverage" with genomic coverage data
Returns
Plots.Figure
: A composite figure with coverage plots for each chromosome
Details
Generates one subplot per chromosome, arranged vertically. Each subplot shows the coverage distribution across genomic positions for that chromosome.
Mycelia.viterbi_maximum_likelihood_traversals
— Methodviterbi_maximum_likelihood_traversals(
stranded_kmer_graph;
error_rate,
verbosity
) -> Vector{FASTX.FASTA.Record}
Finds maximum likelihood paths through a stranded k-mer graph using the Viterbi algorithm to correct sequencing errors.
Arguments
stranded_kmer_graph
: A directed graph where vertices represent k-mers and edges represent overlapserror_rate::Float64
: Expected per-base error rate (default: 1/(k+1)). Must be < 0.5verbosity::String
: Output detail level ("debug", "reads", or "dataset")
Returns
Vector of FASTX.FASTA.Record containing error-corrected sequences
Details
- Uses dynamic programming to find most likely path through k-mer graph
- Accounts for matches, mismatches, insertions and deletions
- State likelihoods based on k-mer coverage counts
- Transition probabilities derived from error rate
- Progress tracking based on verbosity level
Notes
- Error rate should be probability of error (e.g. 0.01 for 1%), not accuracy
- Higher verbosity levels ("debug", "reads") provide detailed path finding information
- "dataset" verbosity shows only summary statistics
Mycelia.wcss
— Methodwcss(clustering_result) -> Any
Calculate the Within-Cluster Sum of Squares (WCSS) for a clustering result.
Arguments
clustering_result
: A clustering result object containing:counts
: Vector with number of points in each clusterassignments
: Vector of cluster assignments for each pointcosts
: Vector of distances/costs from each point to its cluster center
Returns
Float64
: The total within-cluster sum of squared distances
Description
WCSS measures the compactness of clusters by summing the squared distances between each data point and its assigned cluster center.
Mycelia.write_fasta
— Methodwrite_fasta(; outfile, records, gzip)
Writes FASTA records to a file, optionally gzipped.
Arguments
outfile::AbstractString
: Path to the output FASTA file. Will append ".gz" ifgzip
is true.records::Vector{FASTX.FASTA.Record}
: A vector of FASTA records.gzip::Bool=false
: Whether to compress the output with gzip.
Returns
outfile::String
: The path to the output FASTA file (including ".gz" if applicable).
Mycelia.write_gff
— Methodwrite_gff(; gff, outfile)
Write GFF (General Feature Format) data to a tab-delimited file.
Arguments
gff
: DataFrame/Table containing GFF formatted dataoutfile
: String path where the output file should be written
Returns
String
: Path to the written output file
Mycelia.write_vcf_table
— Methodwrite_vcf_table(; vcf_file, vcf_table, fasta_file)
Write variant data to a VCF v4.3 format file.
Arguments
vcf_file::String
: Output path for the VCF filevcf_table::DataFrame
: Table containing variant data with standard VCF columnsfasta_file::String
: Path to the reference genome FASTA file
Details
Automatically filters out equivalent variants where REF == ALT. Includes standard VCF headers for substitutions, insertions, deletions, and inversions. Adds GT (Genotype) and GQ (Genotype Quality) format fields.
Mycelia.xam_to_contig_mapping_stats
— Methodxam_to_contig_mapping_stats(xam) -> Any
Generate detailed mapping statistics for each reference sequence/contig in a XAM (SAM/BAM/CRAM) file.
Arguments
xam
: Path to XAM file or XAM object
Returns
A DataFrame with per-contig statistics including:
n_aligned_reads
: Number of aligned readstotal_aligned_bases
: Sum of alignment lengthstotal_alignment_score
: Sum of alignment scores- Mapping quality statistics (mean, std, median)
- Alignment length statistics (mean, std, median)
- Alignment score statistics (mean, std, median)
- Percent mismatches statistics (mean, std, median)
Note: Only primary alignments (isprimary=true) and mapped reads (ismapped=true) are considered.