Mycelia Documentation

Welcome to the Mycelia documentation.

Installation

Install Julia (if not already installed)

I have had trouble getting the visualization libraries Plots.jl and Makie.jl (and associated packages) to load correctly on HPC due to the complexities of the default LD_LIBRARY_PATH

I imagine other research supercomputer users may have similar issues, although I don't have these issues on cloud vendors like GCP or AWS

To enable Julia to install all of it's own necessary dependencies independent of the system, I reset the LD_LIBRARY_PATH variable prior to launching Julia !!

This can be done easily when launching Julia from the command line by

export LD_LIBRARY_PATH="" && julia

And can be done for Julia jupyter kernels by setting the env key => value pair in the appropriate kernel.json file

Clone the repo as a Julia package

import Pkg
# for production usage
Pkg.add(url="https://github.com/cjprybol/Mycelia.git")
# for development
Pkg.develop(url="git@github.com:cjprybol/Mycelia.git")

Function Docstrings

Mycelia._detect_sequence_extensionMethod
_detect_sequence_extension(sequence_type::Symbol) -> String

Internal helper function to convert sequence type to file extension.

Arguments

  • sequence_type: Symbol representing sequence type (:DNA, :RNA, or :AA)

Returns

  • String: Appropriate file extensions
source
Mycelia.add_bioconda_envMethod
add_bioconda_env(pkg; force) -> Union{Nothing, Base.Process}

Create a new Conda environment with a specified Bioconda package.

Arguments

  • pkg::String: Package name to install. Can include channel specification using the format "channel::package"

Keywords

  • force::Bool=false: If true, recreates the environment even if it already exists

Details

The function creates a new Conda environment named after the package and installs the package into it. It uses channel priority: conda-forge > bioconda > defaults. If CONDA_RUNNER is set to 'mamba', it will ensure mamba is installed first.

Examples

# Install basic package
add_bioconda_env("blast")

# Install from specific channel
add_bioconda_env("bioconda::blast")

# Force reinstallation
add_bioconda_env("blast", force=true)

Notes

  • Requires Conda.jl to be installed and configured
  • Uses CONDA_RUNNER global variable to determine whether to use conda or mamba
  • Cleans conda cache after installation
source
Mycelia.add_edgemer_to_graph!Method
add_edgemer_to_graph!(
    graph,
    record_identifier,
    index,
    observed_edgemer
) -> Any

Add an observed edgemer to a graph with its associated metadata.

Arguments

  • graph::MetaGraph: The graph to modify
  • record_identifier: Identifier for the source record
  • index: Position where edgemer was observed
  • observed_edgemer: The biological sequence representing the edgemer

Details

Processes the edgemer by:

  1. Splitting it into source and destination kmers
  2. Converting kmers to their canonical forms
  3. Creating or updating an edge with orientation metadata
  4. Storing observation details (record, position, orientation)

Returns

Modified graph with the new edge and metadata

Note

If the edge already exists, the observation is added to the existing metadata.

source
Mycelia.add_fastx_records_to_graph!Method
add_fastx_records_to_graph!(graph, fastxs) -> Any

Add FASTX records from multiple files as a graph property.

Arguments

  • graph: A MetaGraph that will store the FASTX records
  • fastxs: Collection of FASTA/FASTQ file paths to process

Details

Creates a dictionary mapping sequence descriptions to their corresponding FASTX records, then stores this dictionary as a graph property under the key :records. Multiple input files are merged, with later files overwriting records with duplicate descriptions.

Returns

The modified graph with added records property.

source
Mycelia.add_record_edgemers_to_graph!Method
add_record_edgemers_to_graph!(graph) -> Any

Processes DNA sequence records stored in the graph and adds their edgemers (k+1 length subsequences) to build the graph structure.

Arguments

  • graph: A Mycelia graph object containing DNA sequence records and graph properties

Details

  • Uses the k-mer size specified in graph.gprops[:k] to generate k+1 length edgemers
  • Iterates through each record in graph.gprops[:records]
  • For each record, generates all possible overlapping edgemers
  • Adds each edgemer to the graph with its position and record information

Returns

  • The modified graph object with added edgemer information
source
Mycelia.amino_acids_to_codonsMethod
amino_acids_to_codons(

) -> Dict{BioSymbols.AminoAcid, DataType}

Creates a mapping from amino acids to representative DNA codons using the standard genetic code.

Returns

  • Dictionary mapping each amino acid (including stop codon AA_Term) to a valid DNA codon that encodes it
source
Mycelia.annotate_aa_fastaMethod
annotate_aa_fasta(
;
    fasta,
    identifier,
    basedir,
    mmseqsdb,
    threads
)

Annotate amino acid sequences in a FASTA file using MMseqs2 search against UniRef50 database.

Arguments

  • fasta: Path to input FASTA file containing amino acid sequences
  • identifier: Name for the output directory (defaults to FASTA filename without extension)
  • basedir: Base directory for output (defaults to current directory)
  • mmseqsdb: Path to MMseqs2 formatted UniRef50 database (defaults to ~/workspace/mmseqs/UniRef50)
  • threads: Number of CPU threads to use (defaults to system thread count)

Returns

  • Path to the output directory containing MMseqs2 search results

The function creates a new directory named by identifier under basedir, copies the input FASTA file, and runs MMseqs2 easy-search against the specified database. If the output directory already exists, the function skips processing and returns the directory path.

source
Mycelia.annotate_fastaMethod
annotate_fasta(
;
    fasta,
    identifier,
    basedir,
    mmseqsdb,
    threads
)

Perform comprehensive annotation of a FASTA file including gene prediction, protein homology search, and terminator prediction.

Arguments

  • fasta::String: Path to input FASTA file
  • identifier::String: Unique identifier for output directory (default: FASTA filename without extension)
  • basedir::String: Base directory for output (default: current working directory)
  • mmseqsdb::String: Path to MMseqs2 UniRef50 database (default: "~/workspace/mmseqs/UniRef50")
  • threads::Int: Number of CPU threads to use (default: all available)

Processing Steps

  1. Creates output directory and copies input FASTA
  2. Runs Prodigal for gene prediction (nucleotide, amino acid, and GFF output)
  3. Performs MMseqs2 homology search against UniRef50
  4. Predicts terminators using TransTerm
  5. Combines annotations into a unified GFF file
  6. Generates GenBank format output

Returns

  • String: Path to the output directory containing all generated files

Files Generated

  • .prodigal.fna: Predicted genes (nucleotide)
  • .prodigal.faa: Predicted proteins
  • .prodigal.gff: Prodigal GFF annotations
  • .gff: Combined annotations
  • .gff.genbank: Final GenBank format
source
Mycelia.assess_alignmentMethod
assess_alignment(
    a,
    b
) -> @NamedTuple{total_matches::Int64, total_edits::Int64}

Aligns two sequences using the Levenshtein distance and returns the total number of matches and edits.

Arguments

  • a::AbstractString: The first sequence to be aligned.
  • b::AbstractString: The second sequence to be aligned.

Returns

  • NamedTuple{(:total_matches, :total_edits), Tuple{Int, Int}}: A named tuple containing:
    • total_matches::Int: The total number of matching bases in the alignment.
    • total_edits::Int: The total number of edits (insertions, deletions, substitutions) in the alignment.
source
Mycelia.assess_alignment_accuracyMethod
assess_alignment_accuracy(alignment_result) -> Any

Return proportion of matched bases in alignment to total matches + edits.

Calculate the accuracy of a sequence alignment by computing the ratio of matched bases to total alignment operations (matches + edits).

Arguments

  • alignment_result: Alignment result object containing total_matches and total_edits fields

Returns

Float64 between 0.0 and 1.0 representing alignment accuracy, where:

  • 1.0 indicates perfect alignment (all matches)
  • 0.0 indicates no matches
source
Mycelia.assess_assembly_kmer_qualityMethod
assess_assembly_kmer_quality(; assembly, observations, ks)

Evaluate genome assembly quality by comparing k-mer distributions between assembled sequences and raw observations.

Arguments

  • assembly: Input assembled sequences to evaluate
  • observations: Raw sequencing data for comparison
  • ks::Vector{Int}: Vector of k-mer sizes to analyze (default: k=17 to 23)

Returns

DataFrame containing quality metrics for each k-mer size:

  • k: K-mer length used
  • cosine_distance: Cosine similarity between k-mer distributions
  • js_divergence: Jensen-Shannon divergence between distributions
  • qv: MerQury-style quality value score
source
Mycelia.assess_dnamer_saturationMethod
assess_dnamer_saturation(
    fastx::AbstractString;
    power,
    outdir,
    min_k,
    max_k,
    threshold,
    kmers_to_assess
)

Analyzes k-mer saturation in a FASTA/FASTQ file to determine optimal k-mer size.

Arguments

  • fastx::AbstractString: Path to input FASTA/FASTQ file
  • power::Int=10: Exponent for downsampling k-mers (2^power)
  • outdir::String="": Output directory for results. Uses current directory if empty
  • min_k::Int=3: Minimum k-mer size to evaluate
  • max_k::Int=17: Maximum k-mer size to evaluate
  • threshold::Float64=0.1: Saturation threshold for k-mer assessment
  • kmers_to_assess::Int=10_000_000: Maximum number of k-mers to sample

Returns

Dict{Int,Float64}: Dictionary mapping k-mer sizes to their saturation scores

source
Mycelia.assess_dnamer_saturationMethod
assess_dnamer_saturation(
    fastxs::AbstractVector{<:AbstractString},
    kmer_type;
    kmers_to_assess,
    power,
    min_count
) -> Union{@NamedTuple{sampling_points::Vector{Int64}, unique_kmer_counts::Vector{Int64}}, NamedTuple{(:sampling_points, :unique_kmer_counts, :eof), <:Tuple{Vector, Vector{Int64}, Bool}}}

Assess k-mer saturation in DNA sequences from FASTX files.

Arguments

  • fastxs::AbstractVector{<:AbstractString}: Vector of paths to FASTA/FASTQ files
  • kmer_type: Type of k-mer to analyze (e.g., DNAKmer{21})
  • kmers_to_assess=Inf: Maximum number of k-mers to process
  • power=10: Base for exponential sampling intervals
  • min_count=1: Minimum count threshold for considering a k-mer

Returns

Named tuple containing:

  • sampling_points::Vector{Int}: K-mer counts at which samples were taken
  • unique_kmer_counts::Vector{Int}: Number of unique canonical k-mers at each sampling point
  • eof::Bool: Whether the entire input was processed

Details

Analyzes k-mer saturation by counting unique canonical k-mers at exponentially spaced intervals (powers of power). Useful for assessing sequence complexity and coverage. Returns early if all possible k-mers are observed.

source
Mycelia.assess_dnamer_saturationMethod
assess_dnamer_saturation(
    fastxs::AbstractVector{<:AbstractString};
    power,
    outdir,
    min_k,
    max_k,
    threshold,
    kmers_to_assess,
    plot
)

Analyze k-mer saturation in DNA sequences to determine optimal k value.

Arguments

  • fastxs: Vector of paths to FASTA/FASTQ files to analyze
  • power: Base of logarithmic sampling points (default: 10)
  • outdir: Optional output directory for plots and results
  • min_k: Minimum k-mer size to test (default: 7)
  • max_k: Maximum k-mer size to test (default: 17)
  • threshold: Saturation threshold to determine optimal k (default: 0.1)
  • kmers_to_assess: Maximum number of k-mers to sample (default: 10M)
  • plot: Whether to generate saturation curves (default: true)

Returns

Integer representing the first k value that achieves saturation below threshold. If no k value meets the threshold, returns the k with minimum saturation.

Details

  • Tests only prime k values between mink and maxk
  • Generates saturation curves using logarithmic sampling
  • Fits curves to estimate maximum unique k-mers
  • If outdir is provided, saves plots as SVG and chosen k value to text file
source
Mycelia.assess_optimal_kmer_alignmentMethod
assess_optimal_kmer_alignment(
    kmer,
    observed_kmer
) -> Tuple{@NamedTuple{total_matches::Int64, total_edits::Int64}, Union{Missing, Bool}}

Used to determine which orientation provides an optimal alignment for initiating path likelihood analyses in viterbi analysis

Compare alignment scores between a query k-mer and an observed k-mer in both forward and reverse complement orientations to determine optimal alignment.

Arguments

  • kmer: Query k-mer sequence to align
  • observed_kmer: Target k-mer sequence to align against

Returns

A tuple containing:

  • alignment_result: The alignment result object for the optimal orientation
  • orientation: Boolean indicating orientation (true = forward, false = reverse complement, missing = tied scores)

Details

  • Performs pairwise alignment in both orientations using assess_alignment()
  • Calculates accuracy scores using assess_alignment_accuracy()
  • For tied alignment scores, randomly selects one orientation
  • Uses BioSequences.reverse_complement for reverse orientation comparison
source
Mycelia.bam_to_fastqMethod
bam_to_fastq(; bam, fastq)

Convert a BAM file to FASTQ format with gzip compression.

Arguments

  • bam: Path to input BAM file
  • fastq: Optional output path. Defaults to input path with ".fq.gz" extension

Returns

  • Path to the generated FASTQ file

Details

  • Uses samtools through conda environment
  • Automatically skips if output file exists
  • Output is gzip compressed
  • Requires samtools to be available via conda
source
Mycelia.bandage_visualizeMethod
bandage_visualize(; gfa, img)

Generate a visualization of a genome assembly graph using Bandage.

Arguments

  • gfa: Path to input GFA (Graphical Fragment Assembly) file
  • img: Optional output image path. Defaults to GFA filename with .png extension

Returns

  • Path to the generated image file
source
Mycelia.biosequences_to_counts_tableMethod
biosequences_to_counts_table(; biosequences, k)

Convert a collection of biological sequences into a k-mer count matrix.

Arguments

  • biosequences: Vector of biological sequences (DNA, RNA, or Amino Acids)
  • k: Length of k-mers to count

Returns

Named tuple with:

  • sorted_kmers: Vector of all unique k-mers found, lexicographically sorted
  • kmer_counts_matrix: Sparse matrix where rows are k-mers and columns are sequences

Details

  • For DNA sequences, counts canonical k-mers (both strands)
  • Uses parallel processing with Thread-safe progress tracking
  • Memory efficient sparse matrix representation
  • Supports DNA, RNA and Amino Acid sequences
source
Mycelia.biosequences_to_dense_counts_tableMethod
biosequences_to_dense_counts_table(; biosequences, k)

Convert a collection of biological sequences into a dense k-mer count matrix.

Arguments

  • biosequences: Collection of DNA, RNA, or amino acid sequences (BioSequence types)
  • k::Integer: Length of k-mers to count (must be ≤ 13)

Returns

Named tuple containing:

  • sorted_kmers: Vector of all possible k-mers in sorted order
  • kmer_counts_matrix: Dense matrix where rows are k-mers and columns are sequences

Details

  • For DNA sequences, counts canonical k-mers (both strands)
  • For RNA and protein sequences, counts exact k-mers
  • Uses parallel processing with threads
  • For k > 13, use fasta_list_to_sparse_counts_table instead
source
Mycelia.blastdb2tableMethod
blastdb2table(; blastdb, outfile, force)

Convert a BLAST database to Arrow format with sequence and taxonomy information. Uses a simple serial approach with direct Arrow streaming for minimal memory usage.

Arguments

  • blastdb::String: Path to the BLAST database
  • outfile::String="": Output file path. If empty, generates name based on input database
  • force::Bool=false: Whether to overwrite existing output file

Returns

  • String: Path to the generated output file (.arrow)

Output Format

Arrow file containing columns (in this order):

  • sequence SHA256
  • sequence hash
  • sequence id
  • accession
  • gi
  • sequence title
  • BLAST name
  • taxid
  • taxonomic super kingdom
  • scientific name
  • scientific names for leaf-node taxids
  • common taxonomic name
  • common taxonomic names for leaf-node taxids
  • leaf-node taxids
  • membership integer
  • ordinal id
  • PIG
  • sequence length
  • sequence
source
Mycelia.blastdb_to_fastaMethod
blastdb_to_fasta(
;
    blastdb,
    entries,
    taxids,
    outfile,
    force,
    max_cores
)

Convert a BLAST database to FASTA format.

Arguments

  • blastdb::String: Name of the BLAST database to convert (e.g. "nr", "nt")
  • dbdir::String: Directory containing the BLAST database files
  • outfile::String: Path for the output FASTA file

Returns

  • Path to the generated FASTA file as String
source
Mycelia.build_directed_kmer_graphMethod
build_directed_kmer_graph(; fastq, k, plot)

Constructs a directed graph representation of k-mer transitions from FASTQ sequencing data.

Arguments

  • fastq: Path to input FASTQ file
  • k: K-mer size (default: 1). Must be odd and prime. If k=1, optimal size is auto-determined
  • plot: Boolean to display quality distribution plot (default: false)

Returns

MetaDiGraph with properties:

  • assembly_k: k-mer size used
  • kmer_counts: frequency of each k-mer
  • transition_likelihoods: edge weights between k-mers
  • kmermeanquality, kmertotalquality: quality metrics
  • branchingnodes, unbranchingnodes: topological classification
  • likelyvalidkmer_indices: k-mers above mean quality threshold
  • likelysequencingartifact_indices: potential erroneous k-mers

Note

For DNA assembly, quality scores are normalized across both strands.

source
Mycelia.build_stranded_kmer_graphMethod
build_stranded_kmer_graph(
    kmer_type,
    observations::AbstractVector{<:Union{FASTX.FASTA.Record, FASTX.FASTQ.Record}}
) -> MetaGraphs.MetaDiGraph{T, Float64} where T<:Integer

Create a weighted, strand-specific kmer (de bruijn) graph from a set of kmers and a series of sequence observations in FASTA format.

source
Mycelia.canonicalMethod
canonical(
    qmer::Mycelia.Qualmer{<:Kmers.Kmer{BioSequences.AminoAcidAlphabet}}
) -> Mycelia.Qualmer{<:Kmers.Kmer{BioSequences.AminoAcidAlphabet}}
source
Mycelia.canonicalMethod
canonical(
    qmer::Mycelia.Qualmer{KmerT<:Union{Kmers.DNAKmer, Kmers.RNAKmer}}
) -> Mycelia.Qualmer{KmerT} where KmerT<:Kmers.Kmer

Get the canonical representation of a DNA qualmer, considering both sequence and quality. For DNA/RNA, this involves potentially reverse-complementing the k-mer and reversing the quality scores accordingly.

source
Mycelia.canonicalize_kmer_counts!Method
canonicalize_kmer_counts!(kmer_counts) -> Any

Canonicalizes the k-mer counts in the given dictionary.

This function iterates over the provided dictionary kmer_counts, which maps k-mers to their respective counts. For each k-mer that is not in its canonical form, it converts the k-mer to its canonical form and updates the count in the dictionary accordingly. If the canonical form of the k-mer already exists in the dictionary, their counts are summed. The original non-canonical k-mer is then removed from the dictionary.

Arguments

  • kmer_counts::Dict{BioSequences.Kmer, Int}: A dictionary where keys are k-mers and values are their counts.

Returns

  • The input dictionary kmer_counts with all k-mers in their canonical form, sorted by k-mers.
source
Mycelia.canonicalize_kmer_countsMethod
canonicalize_kmer_counts(kmer_counts) -> Any

Normalize k-mer counts into a canonical form by creating a non-mutating copy.

Arguments

  • kmer_counts: Dictionary or collection of k-mer count data

Returns

  • A new normalized k-mer count collection
source
Mycelia.chromosome_coverage_table_to_plotMethod
chromosome_coverage_table_to_plot(cdf) -> Plots.Plot

Creates a visualization of chromosome coverage data with statistical thresholds.

Arguments

  • cdf::DataFrame: Coverage data frame containing columns:
    • index: Chromosome position indices
    • depth: Coverage depth values
    • chromosome: Chromosome identifier
    • mean_coverage: Mean coverage value
    • std_coverage: Standard deviation of coverage
    • : Boolean vector indicating +3 sigma regions
    • -3σ: Boolean vector indicating -3 sigma regions

Returns

  • A StatsPlots plot object showing:
    • Raw coverage data (black line)
    • Mean coverage and ±1,2,3σ thresholds (rainbow colors)
    • Highlighted regions exceeding ±3σ thresholds (red vertical lines)
source
Mycelia.codon_optimizeMethod
codon_optimize(
;
    normalized_codon_frequencies,
    protein_sequence,
    n_iterations
)

Optimizes the DNA sequence encoding for a given protein sequence using codon usage frequencies.

Arguments

  • normalized_codon_frequencies: Dictionary mapping amino acids to their codon frequencies
  • protein_sequence::BioSequences.LongAA: Target protein sequence to optimize
  • n_iterations::Integer: Number of optimization iterations to perform

Algorithm

  1. Creates initial DNA sequence through reverse translation
  2. Iteratively generates new sequences by sampling codons based on their frequencies
  3. Keeps track of the sequence with highest codon usage likelihood

Returns

  • BioSequences.LongDNA{2}: Optimized DNA sequence encoding the input protein
source
Mycelia.codons_to_amino_acidsMethod
codons_to_amino_acids(

) -> Union{Dict{_A, BioSequences.LongAA} where _A, Dict{_A, V} where {_A, V<:(Kmers.Kmer{BioSequences.AminoAcidAlphabet, _A, _B} where {_B, _A})}}

Creates a mapping between DNA codons and their corresponding amino acids using the standard genetic code.

Returns a dictionary where:

  • Keys are 3-letter DNA codons (e.g., "ATG")
  • Values are the corresponding amino acids from BioSequences.jl
source
Mycelia.concatenate_filesMethod
concatenate_files(; files, file)

Join fasta files without any regard to record uniqueness.

A cross-platform version of cat *.fasta > joint.fasta

See mergefastafiles

Concatenate multiple FASTA files into a single output file by simple appending.

Arguments

  • files: Vector of paths to input FASTA files
  • file: Path where the concatenated output will be written

Returns

  • Path to the output concatenated file

Details

Platform-independent implementation of cat *.fasta > combined.fasta. Files are processed sequentially with a progress indicator.

source
Mycelia.contig_is_circularMethod
contig_is_circular(
    graph_file::String,
    contig_name::String
) -> Any

Returns bool indicating whether the contig is a circle

graphfile = path to assembly graph.gfa file contigname = name of the contig

Determine if a contig represents a circular structure in the assembly graph.

A circular contig is one where the sequence forms a complete loop in the assembly graph, typically representing structures like plasmids, circular chromosomes, or other circular DNA elements.

Arguments

  • graph_file::String: Path to the assembly graph in GFA format
  • contig_name::String: Name/identifier of the contig to check

Returns

  • Bool: true if the contig forms a circular structure, false otherwise
source
Mycelia.contig_is_cleanly_assembledMethod
contig_is_cleanly_assembled(
    graph_file::String,
    contig_name::String
) -> Bool

Returns bool indicating whether the contig is cleanly assembled.

By cleanly assembled we mean that the contig does not have other contigs attached in the same connected component.

graphfile = path to assembly graph.gfa file contigname = name of the contig

Check if a contig exists in isolation within its connected component in an assembly graph.

Arguments

  • graph_file::String: Path to the assembly graph file in GFA format
  • contig_name::String: Name/identifier of the contig to check

Returns

  • Bool: true if the contig exists alone in its connected component, false otherwise

Details

A contig is considered "cleanly assembled" if it appears as a single entry in the assembly graph's connected components. This function parses the GFA file and checks the contig's isolation status using the graph structure.

source
Mycelia.convertMethod
convert(args)

Convert between different graph file formats.

Arguments

  • args: Dictionary with required keys:
    • "in": Input filepath (supported: .jld2, .gfa, .neo4j)
    • "out": Output filepath (supported: .jld2, .gfa, .neo4j)

Details

Performs format conversion based on file extensions. For non-JLD2 to non-JLD2 conversions, uses JLD2 as an intermediate format.

source
Mycelia.convert_sequenceMethod
convert_sequence(seq::AbstractString) -> Any

Converts the given sequence (output from FASTX.sequence) into the appropriate BioSequence type:

  • DNA sequences are converted using BioSequences.LongDNA
  • RNA sequences are converted using BioSequences.LongRNA
  • AA sequences are converted using BioSequences.LongAA
source
Mycelia.copy_to_tempdirMethod
copy_to_tempdir(file_path::String) -> String

Create a copy of a file in a temporary directory while preserving the original filename.

Arguments

  • file_path::String: Path to the source file to be copied

Returns

  • String: Path to the newly created temporary file
source
Mycelia.copy_with_unique_identifierMethod
copy_with_unique_identifier(
    infile,
    out_directory,
    unique_identifier;
    force
) -> Any

Copy a file to a new location with a unique identifier prepended to the filename.

Arguments

  • infile::AbstractString: Path to the source file to copy
  • out_directory::AbstractString: Destination directory for the copied file
  • unique_identifier::AbstractString: String to prepend to the filename
  • force::Bool=true: If true, overwrite existing files

Returns

  • String: Path to the newly created file
source
Mycelia.count_canonical_kmersMethod
count_canonical_kmers(_::Type{KMER_TYPE}, sequences) -> Any

Count canonical k-mers in biological sequences. A canonical k-mer is the lexicographically smaller of a DNA sequence and its reverse complement, ensuring strand-independent counting.

Arguments

  • KMER_TYPE: Type parameter specifying the k-mer size and structure
  • sequences: Iterator of biological sequences to analyze

Returns

  • Dict{KMER_TYPE,Int}: Dictionary mapping canonical k-mers to their counts
source
Mycelia.count_kmersMethod
count_kmers(
    _::Type{KMER_TYPE},
    fastx_file::AbstractString
) -> Any

Count k-mers in a FASTA/FASTQ file and return their frequencies.

Arguments

  • KMER_TYPE: Type parameter specifying the k-mer type (e.g., DNAKmer{K})
  • fastx_file: Path to input FASTA/FASTQ file

Returns

  • Dict{KMER_TYPE, Int}: Dictionary mapping each k-mer to its frequency
source
Mycelia.count_kmersMethod
count_kmers(
    _::Type{Kmers.Kmer{A<:BioSequences.AminoAcidAlphabet, K}},
    sequence::BioSequences.LongSequence
) -> OrderedCollections.OrderedDict{K, Int64} where K<:(Kmers.Kmer{BioSequences.AminoAcidAlphabet, _A, _B} where {_B, _A})

Count the frequency of amino acid k-mers in a biological sequence.

Arguments

  • Kmers.Kmer{A,K}: Type parameter specifying amino acid alphabet (A) and k-mer length (K)
  • sequence: Input biological sequence to analyze

Returns

A sorted dictionary mapping each k-mer to its frequency count in the sequence.

source
Mycelia.count_kmersMethod
count_kmers(
    _::Type{Kmers.Kmer{A<:BioSequences.DNAAlphabet, K}},
    sequence::BioSequences.LongSequence
) -> Any

Count the frequency of each k-mer in a DNA sequence.

Arguments

  • ::Type{Kmers.Kmer{A,K}}: K-mer type with alphabet A and length K
  • sequence::BioSequences.LongSequence: Input DNA sequence to analyze

Returns

A sorted dictionary mapping each k-mer to its frequency count in the sequence.

Type Parameters

  • A <: BioSequences.DNAAlphabet: DNA alphabet type
  • K: Length of k-mers
source
Mycelia.count_kmersMethod
count_kmers(
    _::Type{Kmers.Kmer{A<:BioSequences.RNAAlphabet, K}},
    sequence::BioSequences.LongSequence
) -> Any

Count the frequency of each k-mer in an RNA sequence.

Arguments

  • Kmer: Type parameter specifying the k-mer length K and RNA alphabet
  • sequence: Input RNA sequence to analyze

Returns

  • Dict{Kmers.Kmer, Int}: Sorted dictionary mapping each k-mer to its frequency count
source
Mycelia.count_kmersMethod
count_kmers(
    _::Type{KMER_TYPE},
    sequences::Union{FASTX.FASTA.Reader, FASTX.FASTQ.Reader}
) -> Any

Counts k-mer occurrences in biological sequences from a FASTA/FASTQ reader.

Arguments

  • KMER_TYPE: Type parameter specifying the k-mer length and encoding (e.g., DNAKmer{4} for 4-mers)
  • sequences: A FASTA or FASTQ reader containing the biological sequences to analyze

Returns

A dictionary mapping k-mers to their counts in the input sequences

source
Mycelia.count_kmersMethod
count_kmers(
    _::Type{KMER_TYPE},
    record::Union{FASTX.FASTA.Record, FASTX.FASTQ.Record}
) -> Any

Count the frequency of amino acid k-mers in a biological sequence.

Arguments

  • Kmers.Kmer{A,K}: Type parameter specifying amino acid alphabet (A) and k-mer length (K)
  • sequence: Input biological sequence to analyze

Returns

A sorted dictionary mapping each k-mer to its frequency count in the sequence.

source
Mycelia.count_kmersMethod
count_kmers(
    _::Type{KMER_TYPE},
    fastx_files::AbstractArray{T<:AbstractString, 1}
) -> Any

Count k-mers across multiple FASTA/FASTQ files and merge the results.

Arguments

  • KMER_TYPE: Type parameter specifying the k-mer length (e.g., DNAKmer{4} for 4-mers)
  • fastx_files: Vector of paths to FASTA/FASTQ files

Returns

  • Dict{KMER_TYPE, Int}: Dictionary mapping k-mers to their total counts across all files
source
Mycelia.count_kmersMethod
count_kmers(
    _::Type{KMER_TYPE},
    records::AbstractArray{T<:Union{FASTX.FASTA.Record, FASTX.FASTQ.Record}, 1}
) -> Any

Count k-mers across multiple sequence records and return a sorted frequency table.

Arguments

  • KMER_TYPE: Type parameter specifying the k-mer length (e.g., DNAKmer{3} for 3-mers)
  • records: Vector of FASTA/FASTQ records to analyze

Returns

  • Dict{KMER_TYPE, Int}: Sorted dictionary mapping k-mers to their frequencies
source
Mycelia.count_matrix_to_probability_matrixMethod
count_matrix_to_probability_matrix(counts_matrix) -> Any

Convert a matrix of counts into a probability matrix by normalizing each column to sum to 1.0.

Arguments

  • counts_matrix::Matrix{<:Number}: Input matrix where each column represents counts/frequencies

Returns

  • Matrix{Float64}: Probability matrix where each column sums to 1.0
source
Mycelia.count_recordsMethod
count_records(fastx) -> Int64

Counts the total number of records in a FASTA/FASTQ file.

Arguments

  • fastx: Path to a FASTA or FASTQ file (can be gzipped)

Returns

  • Number of records (sequences) in the file
source
Mycelia.countmap_columnsMethod
countmap_columns(table)

Generate and display frequency counts for all columns in a DataFrame.

Arguments

  • table::DataFrame: Input DataFrame to analyze

Details

Iterates through each column in the DataFrame and displays:

  1. The column name
  2. A Dict mapping unique values to their frequencies using StatsBase.countmap
source
Mycelia.create_databaseMethod
create_database(; database, address, username, password)

Creates a new Neo4j database instance if it doesn't already exist.

Arguments

  • database::String: Name of the database to create
  • address::String: Neo4j server address (e.g. "neo4j://localhost:7687")
  • username::String: Neo4j authentication username (defaults to "neo4j")
  • password::String: Neo4j authentication password

Notes

  • Requires system database privileges to execute
  • Silently returns if database already exists
  • Temporarily switches to system database to perform creation
source
Mycelia.create_node_constraintsMethod
create_node_constraints(
    graph;
    address,
    username,
    password,
    database
)

Creates unique identifier constraints for each node type in a Neo4j database.

Arguments

  • graph: A MetaGraph containing nodes with TYPE properties
  • address: Neo4j server address
  • username: Neo4j username (default: "neo4j")
  • password: Neo4j password
  • database: Neo4j database name (default: "neo4j")

Details

Extracts unique node types from the graph and creates Neo4j constraints ensuring each node of a given type has a unique identifier property.

Failed constraint creation attempts are silently skipped.

source
Mycelia.create_tarchiveMethod
create_tarchive(; directory, tarchive)

Creates a gzipped tar archive of the specified directory along with verification files.

Arguments

  • directory: Source directory path to archive
  • tarchive: Optional output archive path (defaults to directory name with .tar.gz extension)

Generated Files

  • {tarchive}: The compressed tar archive
  • {tarchive}.log: Contents listing of the archive
  • {tarchive}.hashdeep.dfxml: Cryptographic hashes (MD5, SHA1, SHA256) of the archive

Returns

  • Path to the created tar archive file
source
Mycelia.current_unix_datetimeMethod
current_unix_datetime() -> Int64

Get the current time as a Unix timestamp (seconds since epoch).

Returns

  • Int: Current time as an integer Unix timestamp (seconds since January 1, 1970 UTC)

Examples

unix_time = current_unix_datetime()
# => 1709071368 (example value, will differ based on current time)
source
Mycelia.cypherMethod
cypher(
    cmd;
    address,
    username,
    password,
    format,
    database
) -> Cmd

Constructs a command to execute Neo4j Cypher queries via cypher-shell.

Arguments

  • cmd: The Cypher query command to execute
  • address::String="neo4j://localhost:7687": Neo4j server address
  • username::String="neo4j": Neo4j authentication username
  • password::String="password": Neo4j authentication password
  • format::String="auto": Output format (auto, verbose, or plain)
  • database::String="neo4j": Target Neo4j database name

Returns

  • Cmd: A command object ready for execution
source
Mycelia.deduplicate_fasta_fileMethod
deduplicate_fasta_file(in_fasta, out_fasta) -> Any

Remove duplicate sequences from a FASTA file while preserving headers.

Arguments

  • in_fasta: Path to input FASTA file
  • out_fasta: Path where deduplicated FASTA will be written

Returns

Path to the output FASTA file (same as out_fasta parameter)

Details

  • Sequences are considered identical if they match exactly (case-sensitive)
  • For duplicate sequences, keeps the first header encountered
  • Input sequences are sorted by identifier before deduplication
  • Preserves the original sequence formatting
source
Mycelia.detect_alphabetMethod
detect_alphabet(seq::AbstractString) -> Symbol

Determines the alphabet of a sequence. The function scans through seq only once:

  • If a 'T' or 't' is found (and no 'U/u'), the sequence is classified as DNA.
  • If a 'U' or 'u' is found (and no 'T/t'), it is classified as RNA.
  • If both T and U occur, an error is thrown.
  • If a character outside the canonical nucleotide and ambiguity codes is encountered, the sequence is assumed to be protein.
  • If neither T nor U are found, the sequence is assumed to be DNA.
source
Mycelia.detect_sequence_extensionMethod
detect_sequence_extension(
    record::Union{FASTX.FASTA.Record, FASTX.FASTQ.Record}
) -> String

Detect sequence type from input and suggest appropriate file extension.

Arguments

  • record: A FASTA/FASTQ record
  • sequence: A string or BioSequence containing sequence data

Returns

  • String: Suggested file extension:
    • ".fna" for DNA
    • ".frn" for RNA
    • ".faa" for protein
    • ".fa" for unrecognized sequences
source
Mycelia.determine_fasta_coverageMethod
determine_fasta_coverage(bam) -> Any

Calculate per-base genomic coverage from a BAM file using bedtools.

Arguments

  • bam::String: Path to input BAM file

Returns

  • String: Path to the generated coverage file (.coverage.txt)

Details

Uses bedtools genomecov to compute per-base coverage. Creates a coverage file with the format: <chromosome> <position> <coverage_depth>. If the coverage file already exists, returns the existing file path.

Dependencies

Requires bedtools (automatically installed in conda environment)

source
Mycelia.determine_max_canonical_kmersMethod
determine_max_canonical_kmers(k, ALPHABET) -> Any

Calculate the maximum number of possible canonical k-mers for a given alphabet.

Arguments

  • k::Integer: Length of k-mer
  • ALPHABET::Vector{Char}: Character set (nucleotides or amino acids)

Returns

  • Int: Maximum number of possible canonical k-mers

Details

  • For amino acids (AA_ALPHABET): returns total possible k-mers
  • For nucleotides: returns half of total possible k-mers (canonical form)
  • Requires odd k-mer length for nucleotide alphabets
source
Mycelia.determine_max_possible_kmersMethod
determine_max_possible_kmers(k, ALPHABET) -> Any

Calculate the total number of possible unique k-mers that can be generated from a given alphabet.

Arguments

  • k: Length of k-mers to consider
  • ALPHABET: Vector containing the allowed characters/symbols

Returns

  • Integer representing the maximum number of possible unique k-mers (|Σ|ᵏ)
source
Mycelia.determine_primary_contigMethod
determine_primary_contig(qualimap_results) -> Any

Determines the contig with the greatest number of total bases mapping to it

Identify the primary contig based on mapping coverage from Qualimap results.

Arguments

  • qualimap_results::DataFrame: DataFrame containing Qualimap alignment statistics with columns "Contig" and "Mapped bases"

Returns

  • String: Name of the contig with the highest number of mapped bases

Description

Takes Qualimap alignment results and determines which contig has the most total bases mapped to it, which often indicates the main chromosomal assembly.

source
Mycelia.determine_read_lengthsMethod
determine_read_lengths(
    fastq_file;
    total_reads
) -> Vector{Int64}

Calculate sequence lengths for reads in a FASTQ file.

Arguments

  • fastq_file::String: Path to input FASTQ file
  • total_reads::Integer=Inf: Number of reads to process (defaults to all reads)

Returns

  • Vector{Int}: Array containing the length of each sequence read
source
Mycelia.distance_matrix_to_newickMethod
distance_matrix_to_newick(
;
    distance_matrix,
    labels,
    outfile
)

Create distance matrix from a column-major counts matrix (features as rows and entities as columns) where distance is a proportional to total feature count magnitude (size) and cosine similarity (relative frequency)

Convert a distance matrix into a Newick tree format using UPGMA hierarchical clustering.

Arguments

  • distance_matrix: Square matrix of pairwise distances between entities
  • labels: Vector of labels corresponding to the entities in the distance matrix
  • outfile: Path where the Newick tree file will be written

Returns

  • Path to the generated Newick tree file

Details

Performs hierarchical clustering using the UPGMA (average linkage) method and converts the resulting dendrogram into Newick tree format. The branch lengths in the tree represent the heights from the clustering.

source
Mycelia.document_frequencyMethod
document_frequency(documents) -> Dict{_A, Int64} where _A

Calculate the document frequency of tokens across a collection of documents.

Arguments

  • documents: Collection of text documents where each document is a string

Returns

  • Dictionary mapping each unique token to the number of documents it appears in

Description

Computes how many documents contain each unique token. Each document is tokenized by splitting on whitespace. Tokens are counted only once per document, regardless of how many times they appear within that document.

source
Mycelia.download_bandageFunction
download_bandage() -> String
download_bandage(outdir) -> Any

Downloads and installs Bandage, a bioinformatics visualization tool for genome assembly graphs.

Arguments

  • outdir="/usr/local/bin": Target installation directory for the Bandage executable

Returns

  • Path to the installed Bandage executable

Details

  • Downloads Bandage v0.8.1 for Ubuntu
  • Installs required system dependencies (libxcb-glx0, libx11-xcb-dev, libfontconfig, libgl1-mesa-glx)
  • Attempts installation with sudo, falls back to root if sudo fails
  • Skips download if Bandage is already installed at target location

Dependencies

Requires system commands: wget, unzip, apt

source
Mycelia.download_blast_dbMethod
download_blast_db(; db, dbdir, source, wait)

Smart downloading of blast dbs depending on interactive, non interactive context

For a list of all available databases, run: Mycelia.list_blastdbs()

Downloads and sets up BLAST databases from various sources.

Arguments

  • db: Name of the BLAST database to download
  • dbdir: Directory to store the downloaded database (default: "~/workspace/blastdb")
  • source: Download source - one of ["", "aws", "gcp", "ncbi"]. Empty string auto-detects fastest source
  • wait: Whether to wait for download completion (default: true)

Returns

  • String path to the downloaded database directory
source
Mycelia.download_genome_by_accessionMethod
download_genome_by_accession(
;
    accession,
    outdir,
    compressed
)

Downloads a genomic sequence from NCBI's nucleotide database by its accession number.

Arguments

  • accession::String: NCBI nucleotide accession number (e.g. "NC_045512")
  • outdir::String: Output directory path. Defaults to current directory
  • compressed::Bool: If true, compresses output file with gzip. Defaults to true

Returns

  • String: Path to the downloaded file (.fna or .fna.gz)
source
Mycelia.download_genome_by_ftpMethod
download_genome_by_ftp(; ftp, outdir)

Downloads a genome file from NCBI FTP server to the specified directory.

Arguments

  • ftp::String: NCBI FTP path for the genome (e.g. "ftp://ftp.ncbi.nlm.nih.gov/.../")
  • outdir::String: Output directory path. Defaults to current working directory.

Returns

  • String: Path to the downloaded file

Notes

  • If the target file already exists, returns the existing file path without re-downloading
  • Downloads the genomic.fna.gz version of the genome
source
Mycelia.download_mmseqs_dbMethod
download_mmseqs_db(; db, dbdir, force, wait)

Downloads and sets up MMseqs2 reference databases for sequence searching and analysis.

Arguments

  • db::String: Name of database to download (see table below)
  • dbdir::String: Directory to store the downloaded database (default: "~/workspace/mmseqs")
  • force::Bool: If true, force re-download even if database exists (default: false)
  • wait::Bool: If true, wait for download to complete (default: true)

Returns

  • Path to the downloaded database as a String

Available Databases

DatabaseTypeTaxonomyDescription
UniRef100AminoacidYesUniProt Reference Clusters - 100% identity
UniRef90AminoacidYesUniProt Reference Clusters - 90% identity
UniRef50AminoacidYesUniProt Reference Clusters - 50% identity
UniProtKBAminoacidYesUniversal Protein Knowledge Base
NRAminoacidYesNCBI Non-redundant proteins
NTNucleotideNoNCBI Nucleotide collection
GTDBAminoacidYesGenome Taxonomy Database
PDBAminoacidNoProtein Data Bank structures
Pfam-A.fullProfileNoProtein family alignments
SILVANucleotideYesRibosomal RNA database
  Name                  Type            Taxonomy        Url                                                           
- UniRef100             Aminoacid            yes        https://www.uniprot.org/help/uniref
- UniRef90              Aminoacid            yes        https://www.uniprot.org/help/uniref
- UniRef50              Aminoacid            yes        https://www.uniprot.org/help/uniref
- UniProtKB             Aminoacid            yes        https://www.uniprot.org/help/uniprotkb
- UniProtKB/TrEMBL      Aminoacid            yes        https://www.uniprot.org/help/uniprotkb
- UniProtKB/Swiss-Prot  Aminoacid            yes        https://uniprot.org
- NR                    Aminoacid            yes        https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA
- NT                    Nucleotide             -        https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA
- GTDB                  Aminoacid            yes        https://gtdb.ecogenomic.org
- PDB                   Aminoacid              -        https://www.rcsb.org
- PDB70                 Profile                -        https://github.com/soedinglab/hh-suite
- Pfam-A.full           Profile                -        https://pfam.xfam.org
- Pfam-A.seed           Profile                -        https://pfam.xfam.org
- Pfam-B                Profile                -        https://xfam.wordpress.com/2020/06/30/a-new-pfam-b-is-released
- CDD                   Profile                -        https://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml
- eggNOG                Profile                -        http://eggnog5.embl.de
- VOGDB                 Profile                -        https://vogdb.org
- dbCAN2                Profile                -        http://bcb.unl.edu/dbCAN2
- SILVA                 Nucleotide           yes        https://www.arb-silva.de
- Resfinder             Nucleotide             -        https://cge.cbs.dtu.dk/services/ResFinder
- Kalamari              Nucleotide           yes        https://github.com/lskatz/Kalamari
source
Mycelia.draw_dendrogram_treeMethod
draw_dendrogram_tree(
    mg::MetaGraphs.MetaDiGraph;
    width,
    height,
    fontsize,
    margins,
    mergenodesize,
    lineweight,
    filename
) -> Luxor.Drawing

Draw a dendrogram visualization of hierarchical clustering results stored in a MetaDiGraph.

Arguments

  • mg::MetaGraphs.MetaDiGraph: Graph containing hierarchical clustering results. Must have :hcl in graph properties with clustering data and vertex properties containing :x, :y coordinates.

Keywords

  • width::Integer=500: Width of output image in pixels
  • height::Integer=500: Height of output image in pixels
  • fontsize::Integer=12: Font size for node labels in points
  • margins::Float64: Margin size in pixels, defaults to min(width,height)/20
  • mergenodesize::Float64=1: Size of circular nodes at merge points
  • lineweight::Float64=1: Thickness of dendrogram lines
  • filename::String: Output filename, defaults to timestamp with .dendrogram.png extension

Returns

Nothing, but saves dendrogram image to disk and displays preview.

source
Mycelia.draw_radial_treeMethod
draw_radial_tree(
    mg::MetaGraphs.MetaDiGraph;
    width,
    height,
    fontsize,
    margins,
    mergenodesize,
    lineweight,
    filename
) -> Luxor.Drawing

Draw a radial hierarchical clustering tree visualization and save it as an image file.

Arguments

  • mg::MetaGraphs.MetaDiGraph: A meta directed graph containing hierarchical clustering data with required graph properties :hcl containing clustering information.

Keywords

  • width::Int=500: Width of the output image in pixels
  • height::Int=500: Height of the output image in pixels
  • fontsize::Int=12: Font size for node labels
  • margins::Float64: Margin size (automatically calculated as min(width,height)/20)
  • mergenodesize::Float64=1: Size of the merge point nodes
  • lineweight::Float64=1: Thickness of the connecting lines
  • filename::String: Output filename (defaults to timestamp with ".radial.png" suffix)

Details

The function creates a radial visualization of hierarchical clustering results where:

  • Leaf nodes are arranged in a circle with labels
  • Internal nodes represent merge points
  • Connections show the hierarchical structure through arcs and lines

The visualization is saved as a PNG file and automatically previewed.

Required Graph Properties

The input graph must have:

  • mg.gprops[:hcl].labels: Vector of leaf node labels
  • mg.gprops[:hcl].order: Vector of ordered leaf nodes
  • mg.gprops[:hcl].merges: Matrix of merge operations
  • mg.vprops[v][:x]: X coordinate for each vertex
  • mg.vprops[v][:y]: Y coordinate for each vertex
source
Mycelia.drop_empty_columns!Method
drop_empty_columns!(
    df::DataFrames.AbstractDataFrame
) -> DataFrames.AbstractDataFrame

Identify all columns that have only missing or empty values, and remove those columns from the dataframe in-place.

Returns a modified version of the original dataframe.

See also: dropemptycolumns

source
Mycelia.drop_empty_columnsMethod
drop_empty_columns(df::DataFrames.AbstractDataFrame) -> Any

Identify all columns that have only missing or empty values, and remove those columns from the dataframe.

Returns a modified copy of the dataframe.

See also: dropemptycolumns!

source
Mycelia.edge_path_to_sequenceMethod
edge_path_to_sequence(kmer_graph, edge_path) -> Any

Converts a path of edges in a kmer graph into a DNA sequence by concatenating overlapping kmers.

Arguments

  • kmer_graph: A directed graph where vertices represent kmers and edges represent overlaps
  • edge_path: Vector of edges representing a path through the graph

Returns

A BioSequences.LongDNASeq containing the merged sequence represented by the path

Details

The function:

  1. Takes the first kmer from the source vertex of first edge
  2. For each edge, handles orientation (forward/reverse complement)
  3. Verifies overlaps between consecutive kmers
  4. Concatenates unique bases to build final sequence
source
Mycelia.edge_probabilityMethod
edge_probability(stranded_kmer_graph, edge) -> Any

Calculate the probability of traversing a specific edge in a stranded k-mer graph.

The probability is computed as the ratio of this edge's coverage weight to the sum of all outgoing edge weights from the source vertex.

edge_probability(stranded_kmer_graph, edge) -> Any

Arguments

  • stranded_kmer_graph: A directed graph where edges represent k-mer connections
  • edge: The edge for which to calculate the probability

Returns

  • Float64: Probability in range [0,1] representing likelihood of traversing this edge Returns 0.0 if sum of all outgoing edge weights is zero

Note

Probability is based on the :coverage property of edges, using their length as weights

source
Mycelia.edgemer_to_vertex_kmersMethod
edgemer_to_vertex_kmers(
    edgemer
) -> Tuple{Kmers.Kmer{BioSequences.DNAAlphabet{2}}, Kmers.Kmer{BioSequences.DNAAlphabet{2}}}

Convert an edgemer to two vertex kmers.

This function takes an edgemer (a sequence of DNA nucleotides) and converts it into two vertex kmers. A kmer is a substring of length k from a DNA sequence. The first kmer is created from the first n-1 elements of the edgemer, and the second kmer is created from the last n-1 elements of the edgemer.

Arguments

  • edgemer::AbstractVector{T}: A vector of DNA nucleotides where T is a subtype of BioSequences.DNAAlphabet{2}.

Returns

  • Tuple{Kmers.Kmer{BioSequences.DNAAlphabet{2}}, Kmers.Kmer{BioSequences.DNAAlphabet{2}}}: A tuple containing two kmers.
source
Mycelia.equally_spaced_samplesMethod
equally_spaced_samples(vector, n) -> Any

Sample n equally spaced elements from vector.

Arguments

  • vector: Input vector to sample from
  • n: Number of samples to return (must be positive)

Returns

A vector containing n equally spaced elements from the input vector.

source
Mycelia.equivalent_fasta_sequencesMethod
equivalent_fasta_sequences(fasta_1, fasta_2) -> Bool

Compare two FASTA files to determine if they contain the same set of sequences, regardless of sequence order.

Arguments

  • fasta_1::String: Path to first FASTA file
  • fasta_2::String: Path to second FASTA file

Returns

  • Bool: true if both files contain exactly the same sequences, false otherwise

Details

Performs a set-based comparison of DNA sequences by hashing each sequence. Sequence order differences between files do not affect the result.

source
Mycelia.error_rate_to_q_valueMethod
error_rate_to_q_value(error_rate) -> Any

Convert a sequencing error probability to a Phred quality score (Q-value).

The calculation uses the standard Phred formula: Q = -10 * log₁₀(error_rate)

Arguments

  • error_rate::Float64: Probability of error (between 0 and 1)

Returns

  • q_value::Float64: Phred quality score
source
Mycelia.export_blast_dbMethod
export_blast_db(; path_to_db, fasta)

Export sequences from a BLAST database to a gzipped FASTA file.

Arguments

  • path_to_db: Path to the BLAST database
  • fasta: Output path for the gzipped FASTA file (default: path_to_db * ".fna.gz")

Details

Uses conda's BLAST environment to extract sequences using blastdbcmd. The output is automatically compressed using pigz. If the output file already exists, the function will skip extraction.

source
Mycelia.export_blast_db_taxonomy_tableMethod
export_blast_db_taxonomy_table(; path_to_db, outfile)

Exports a taxonomy mapping table from a BLAST database in seqid2taxid format.

Arguments

  • path_to_db::String: Path to the BLAST database
  • outfile::String: Output file path (defaults to input path + ".seqid2taxid.txt.gz")

Returns

  • String: Path to the created output file

Details

Creates a compressed tab-delimited file mapping sequence IDs to taxonomy IDs. Uses blastdbcmd without GI identifiers for better cross-referencing compatibility. If the output file already exists, returns the path without regenerating.

Dependencies

Requires BLAST+ tools installed via Bioconda.

source
Mycelia.extract_pacbiosample_informationMethod
extract_pacbiosample_information(
    xml
) -> DataFrames.DataFrame

Extract biosample and barcode information from a PacBio XML metadata file.

Arguments

  • xml: Path to PacBio XML metadata file

Returns

DataFrame with two columns:

  • BioSampleName: Name of the biological sample
  • BarcodeName: Associated DNA barcode identifier
source
Mycelia.fasta_and_gff_to_genbankMethod
fasta_and_gff_to_genbank(; fasta, gff, genbank)

Convert FASTA sequence and GFF annotation files to GenBank format using EMBOSS seqret.

Arguments

  • fasta::String: Path to input FASTA file containing sequence data
  • gff::String: Path to input GFF file containing genomic features
  • genbank::String: Path for output GenBank file

Details

Requires EMBOSS toolkit (installed via Bioconda). The function will:

  1. Create necessary output directories
  2. Run seqret to combine sequence and features
  3. Generate a GenBank format file at the specified location
source
Mycelia.fasta_genome_sizeMethod
fasta_genome_size(fasta_file) -> Any

Calculate the total size (in bases) of all sequences in a FASTA file.

Arguments

  • fasta_file::AbstractString: Path to the FASTA file

Returns

  • Int: Sum of lengths of all sequences in the FASTA file
source
Mycelia.fasta_list_to_dense_counts_tableMethod
fasta_list_to_dense_counts_table(; fasta_list, k, alphabet)

Create a dense kmer counts table (canonical for DNA, stranded for RNA & AA) for each fasta provided in a list. Scales very well for large numbers of organisms/fasta files, but not for k. Recommended for k <= 13, although 17 may still be possible

Generate a dense k-mer frequency matrix from multiple FASTA files.

Arguments

  • fasta_list: Vector of paths to FASTA files
  • k: Length of k-mers to count (recommended k ≤ 13)
  • alphabet: Symbol specifying sequence type (:DNA, :RNA, or :AA)

Returns

Named tuple containing:

  • sorted_kmers: Vector of sorted k-mers
  • kmer_counts_matrix: Dense matrix where rows are k-mers and columns are sequences

Details

  • For DNA: Uses canonical k-mers (strand-neutral)
  • For RNA/AA: Uses stranded k-mers
  • Parallelized using Julia's multi-threading

Performance

  • Efficient for large numbers of sequences
  • Memory usage grows exponentially with k
  • For k > 13, use fasta_list_to_sparse_counts_table instead
source
Mycelia.fasta_table_to_fastaMethod
fasta_table_to_fasta(fasta_df) -> Any

Convert a DataFrame containing FASTA sequence information into a vector of FASTA records.

Arguments

  • fasta_df::DataFrame: DataFrame with columns "identifier", "description", and "sequence"

Returns

  • Vector{FASTX.FASTA.Record}: Vector of FASTA records
source
Mycelia.fasta_to_reference_kmer_countsMethod
fasta_to_reference_kmer_counts(; kmer_type, fasta)

Counts k-mer occurrences in a FASTA file, considering both forward and reverse complement sequences.

Arguments

  • kmer_type: Type specification for k-mers (e.g., DNAKmer{21})
  • fasta: Path to FASTA file containing reference sequences

Returns

  • Dict{kmer_type, Int}: Dictionary mapping each k-mer to its total count across all sequences
source
Mycelia.fasta_to_tableMethod
fasta_to_table(fasta) -> DataFrames.DataFrame

Convert a FASTA file/record iterator to a DataFrame.

Arguments

  • fasta: FASTA record iterator from FASTX.jl

Returns

  • DataFrame with columns:
    • identifier: Sequence identifiers
    • description: Full sequence descriptions
    • sequence: Biological sequences as strings
source
Mycelia.fasta_xam_mapping_statsMethod
fasta_xam_mapping_stats(; fasta, xam)

Calculate mapping statistics by comparing sequence alignments (BAM/SAM) to a reference FASTA.

Arguments

  • fasta::String: Path to reference FASTA file
  • xam::String: Path to alignment file (BAM or SAM format)

Returns

DataFrame with columns:

  • contig: Reference sequence name
  • contig_length: Length of reference sequence
  • total_aligned_bases: Total number of bases aligned to reference
  • mean_depth: Average depth of coverage (totalalignedbases/contig_length)
source
Mycelia.fastani_listMethod
fastani_list(
;
    query_list,
    reference_list,
    outfile,
    threads,
    force
) -> Union{Nothing, Base.Process}

Run fastani with a query and reference list

Calculate Average Nucleotide Identity (ANI) between genome sequences using FastANI.

Arguments

  • query_list::String: Path to file containing list of query genome paths (one per line)
  • reference_list::String: Path to file containing list of reference genome paths (one per line)
  • outfile::String: Path to output file that will contain ANI results
  • threads::Int=Sys.CPU_THREADS: Number of parallel threads to use
  • force::Bool=false: If true, rerun analysis even if output file exists

Output

Generates a tab-delimited file with columns:

  • Query genome
  • Reference genome
  • ANI value (%)
  • Count of bidirectional fragment mappings
  • Total query fragments

Notes

  • Requires FastANI to be available via Bioconda
  • Automatically sets up required conda environment
source
Mycelia.fasterq_dumpMethod
fasterq_dump(
;
    outdir,
    srr_identifier
) -> NamedTuple{(:forward_reads, :reverse_reads, :unpaired_reads), <:Tuple{Union{Missing, String}, Union{Missing, String}, Union{Missing, String}}}

Download and compress sequencing reads from the SRA database using fasterq-dump.

Arguments

  • outdir::String="": Output directory for the FASTQ files. Defaults to current directory.
  • srr_identifier::String="": SRA run accession number (e.g., "SRR12345678")

Returns

Named tuple containing paths to the generated files:

  • forward_reads: Path to forward reads file (*_1.fastq.gz) or missing
  • reverse_reads: Path to reverse reads file (*_2.fastq.gz) or missing
  • unpaired_reads: Path to unpaired reads file (*.fastq.gz) or missing

Outputs

Creates compressed FASTQ files in the output directory:

  • {srr_identifier}_1.fastq.gz: Forward reads (for paired-end data)
  • {srr_identifier}_2.fastq.gz: Reverse reads (for paired-end data)
  • {srr_identifier}.fastq.gz: Unpaired reads (for single-end data)

Dependencies

Requires:

  • fasterq-dump from the SRA Toolkit (installed via Conda)
  • gzip for compression

Notes

  • Skips download if output files already exist
  • Uses up to 4 threads or system maximum, whichever is lower
  • Allocates 1GB memory for processing
  • Skips technical reads
  • Handles both paired-end and single-end data automatically
source
Mycelia.fastq_recordMethod
fastq_record(; identifier, sequence, quality_scores)

Construct a FASTX FASTQ record from its components.

Arguments

  • identifier::String: The sequence identifier without the '@' prefix
  • sequence::String: The nucleotide sequence
  • quality_scores::Vector{Int}: Quality scores (0-93) as raw integers

Returns

  • FASTX.FASTQRecord: A parsed FASTQ record

Notes

  • Quality scores are automatically capped at 93 to ensure FASTQ compatibility
  • Quality scores are converted to ASCII by adding 33 (Phred+33 encoding)
  • The record is constructed in standard FASTQ format with four lines:
    1. Header line (@ + identifier)
    2. Sequence
    3. Plus line
    4. Quality scores (ASCII encoded)
source
Mycelia.fastx2normalized_tableMethod
fastx2normalized_table(fastx) -> DataFrames.DataFrame

Convert a FASTX (FASTA/FASTQ) file into a normalized tab-separated table format with standardized sequence identifiers.

Arguments

  • fastx_file::String: Path to input FASTX file
  • outfile::String: Path to output compressed TSV file (defaults to input filename + ".tsv.gz")
  • force::Bool=false: If true, overwrites existing output file

Returns

  • String: Path to the created output file

Output Format

Creates a gzipped TSV file with the following columns:

  • fasta_identifier: Original FASTA filename
  • sequence_sha256: SHA256 hash of the sequence
  • sequence_identifier: Original sequence ID from FASTA
  • sequence_description: Full sequence description from FASTA
  • sequence: The actual sequence
source
Mycelia.fastx_statsMethod
fastx_stats(fastx) -> DataFrames.DataFrame

Calculate basic statistics for FASTQ/FASTA sequence files using seqkit.

Arguments

  • fastq::String: Path to input FASTQ/FASTA file

Details

Automatically installs and uses seqkit from Bioconda to compute sequence statistics including number of sequences, total bases, GC content, average length, etc.

Dependencies

  • Requires Conda and Bioconda channel
  • Installs seqkit package if not present

Returns

Returns a DataFrame of the table

https://bioinf.shenwei.me/seqkit/usage/#stats

source
Mycelia.fastx_to_contig_lengthsMethod
fastx_to_contig_lengths(
    fastx
) -> OrderedCollections.OrderedDict

Generate detailed mapping statistics for each reference sequence/contig in a XAM (SAM/BAM/CRAM) file.

Arguments

  • xam: Path to XAM file or XAM object

Returns

A DataFrame with per-contig statistics including:

  • n_aligned_reads: Number of aligned reads
  • total_aligned_bases: Sum of alignment lengths
  • total_alignment_score: Sum of alignment scores
  • Mapping quality statistics (mean, std, median)
  • Alignment length statistics (mean, std, median)
  • Alignment score statistics (mean, std, median)
  • Percent mismatches statistics (mean, std, median)

Note: Only primary alignments (isprimary=true) and mapped reads (ismapped=true) are considered.

source
Mycelia.fastx_to_kmer_graphMethod
fastx_to_kmer_graph(
    KMER_TYPE,
    fastx::AbstractString
) -> MetaGraphs.MetaGraph

Constructs a k-mer graph from a single FASTX format string.

Arguments

  • KMER_TYPE: The k-mer type specification (e.g., DNAKmer{K} where K is k-mer length)
  • fastx::AbstractString: Input sequence in FASTX format (FASTA or FASTQ)

Returns

  • KmerGraph: A directed graph where vertices are k-mers and edges represent overlaps
source
Mycelia.fastx_to_kmer_graphMethod
fastx_to_kmer_graph(
    KMER_TYPE,
    fastxs::AbstractVector{<:AbstractString}
) -> MetaGraphs.MetaGraph

Create an in-memory kmer-graph that records:

  • all kmers
  • counts
  • all observed edges between kmers
  • edge orientations
  • edge counts

Construct a kmer-graph from one or more FASTX files (FASTA/FASTQ).

Arguments

  • KMER_TYPE: Type for kmer representation (e.g., DNAKmer{K})
  • fastxs: Vector of paths to FASTX files

Returns

A MetaGraph where:

  • Vertices represent unique kmers with properties:
    • :kmer => The kmer sequence
    • :count => Number of occurrences
  • Edges represent observed kmer adjacencies with properties:
    • :orientation => Relative orientation of connected kmers
    • :count => Number of observed transitions
source
Mycelia.fibonacci_numbers_less_thanMethod
fibonacci_numbers_less_than(
    n::Int64
) -> Union{Vector{Any}, Vector{Int64}}

Generate a sequence of Fibonacci numbers strictly less than the input value.

Arguments

  • n::Int: Upper bound (exclusive) for the Fibonacci sequence

Returns

  • Vector{Int}: Array containing Fibonacci numbers less than n
source
Mycelia.filesize_human_readableMethod
filesize_human_readable(f) -> Any

Gets the size of a file and returns it in a human-readable format.

Arguments

  • f: The path to the file, either as a String or an AbstractString.

Returns

A string representing the file size in a human-readable format (e.g., "3.40 MB").

Details

This function internally uses filesize(f) to get the file size in bytes, then leverages Base.format_bytes to convert it into a human-readable format with appropriate units (KB, MB, GB, etc.).

Examples

julia> filesize_human_readable("my_image.jpg")
"2.15 MB"

See Also

  • filesize: Gets the size of a file in bytes.
  • Base.format_bytes: Converts a byte count into a human-readable string.
source
Mycelia.find_matching_prefixMethod
find_matching_prefix(
    filename1::String,
    filename2::String;
    strip_trailing_delimiters
) -> String

Find the longest common prefix between two filenames.

Arguments

  • filename1::String: First filename to compare
  • filename2::String: Second filename to compare

Keywords

  • strip_trailing_delimiters::Bool=true: If true, removes trailing dots, hyphens, and underscores from the result

Returns

  • String: The longest common prefix found between the filenames
source
Mycelia.find_nonempty_columnsMethod
find_nonempty_columns(df) -> Any

Identify all columns that have only missing or empty values

Returns as a bit array

See also: dropemptycolumns, dropemptycolumns!

source
Mycelia.find_resampling_stretchesMethod
find_resampling_stretches(
;
    record_kmer_solidity,
    solid_branching_kmer_indices
)

Identifies sequence regions that require resampling based on kmer solidity patterns.

Arguments

  • record_kmer_solidity::BitVector: Boolean array where true indicates solid kmers
  • solid_branching_kmer_indices::Vector{Int}: Indices of solid branching kmers

Returns

  • Vector{UnitRange{Int64}}: Array of ranges (start:stop) indicating stretches that need resampling

Details

Finds continuous stretches of non-solid kmers and extends them to the nearest solid branching kmers on either side. These stretches represent regions that need resampling.

If a stretch doesn't have solid branching kmers on both sides, it is excluded from the result. Duplicate ranges are removed from the final output.

source
Mycelia.find_true_rangesMethod
find_true_ranges(
    bool_vec::AbstractVector{Bool};
    min_length
) -> Vector

Finds contiguous ranges of true values in a boolean vector.

Arguments

  • bool_vec::AbstractVector{Bool}: Input boolean vector to analyze
  • min_length=1: Minimum length requirement for a range to be included

Returns

Vector of tuples (start, end) where each tuple represents the indices of a contiguous range of true values meeting the minimum length requirement.

source
Mycelia.first_of_each_groupMethod
first_of_each_group(
    gdf::DataFrames.GroupedDataFrame{DataFrames.DataFrame}
) -> Any

Return a DataFrame containing the first row from each group in a GroupedDataFrame.

Arguments

  • gdf::GroupedDataFrame: A grouped DataFrame created using groupby

Returns

  • DataFrame: A new DataFrame containing first row from each group
source
Mycelia.fit_optimal_number_of_clustersFunction
fit_optimal_number_of_clusters(
    distance_matrix
) -> @NamedTuple{optimal_number_of_clusters::Int64, ks_assessed::Vector{Int64}, within_cluster_sum_of_squares::Vector{Float64}, silhouette_scores::Vector{Float64}}
fit_optimal_number_of_clusters(
    distance_matrix,
    ks_to_try
) -> NamedTuple{(:optimal_number_of_clusters, :ks_assessed, :within_cluster_sum_of_squares, :silhouette_scores), <:Tuple{Any, Any, Vector{Float64}, Vector{Float64}}}

Determines the optimal number of clusters for k-means clustering by maximizing the silhouette score.

Algorithm

  • Starts by evaluating the first 5 k values
  • Continues evaluation if optimal k is at the edge of evaluated range
  • Refines search by evaluating midpoints between k values around the current optimum
  • Iterates until convergence (optimal k remains stable)

Arguments

  • distance_matrix::Matrix: Square matrix of pairwise distances between points
  • ks_to_try::Vector{Int}: Vector of k values to evaluate. Defaults to [1, 2, ...] up to matrix size

Returns

Named tuple containing:

  • optimal_number_of_clusters::Int: The k value giving highest silhouette score
  • ks_assessed::Vector{Int}: All k values that were evaluated
  • within_cluster_sum_of_squares::Vector{Float64}: WCSS for each k assessed
  • silhouette_scores::Vector{Float64}: Silhouette scores for each k assessed
source
Mycelia.fit_optimal_number_of_clusters_hclustFunction
fit_optimal_number_of_clusters_hclust(
    distance_matrix
) -> NamedTuple{(:optimal_number_of_clusters, :ks_assessed, :silhouette_scores, :hclust_result), <:Tuple{Int64, Vector{Int64}, Vector{Float64}, Clustering.Hclust}}
fit_optimal_number_of_clusters_hclust(
    distance_matrix,
    ks_to_try
) -> NamedTuple{(:optimal_number_of_clusters, :ks_assessed, :silhouette_scores, :hclust_result), <:Tuple{Any, Any, Vector{Float64}, Clustering.Hclust}}

Determine the optimal number of clusters using hierarchical clustering with iterative refinement.

Arguments

  • distance_matrix::Matrix: Square matrix of pairwise distances between observations
  • ks_to_try::Vector{Int}: Vector of cluster counts to evaluate (default: 1, 2, and sequence from Mycelia.ks())

Returns

Named tuple containing:

  • optimal_number_of_clusters::Int: Best number of clusters found
  • ks_assessed::Vector{Int}: All cluster counts that were evaluated
  • silhouette_scores::Vector{Float64}: Silhouette scores for each k assessed
  • hclust_result: Hierarchical clustering result object

Details

Uses Ward's linkage method and silhouette scores to evaluate cluster quality. Implements an adaptive search that focuses on promising regions between initially tested k values. For k=1, silhouette score is set to 0 as a special case.

source
Mycelia.frequency_matrix_to_cosine_distance_matrixMethod
frequency_matrix_to_cosine_distance_matrix(
    probability_matrix
) -> Any

Create cosine distance matrix from a column-major counts matrix (features as rows and entities as columns) where distance is a proportional to cosine similarity (relative frequency)

Compute pairwise cosine distances between entities based on their feature distributions.

Arguments

  • probability_matrix: Column-major matrix where rows represent features and columns represent entities. Each column should contain frequency/probability values for one entity.

Returns

  • Symmetric matrix of size (nentities × nentities) containing pairwise cosine distances. Distance values range from 0 (identical distributions) to 1 (orthogonal distributions).

Details

  • Computes upper triangle and mirrors to lower triangle for efficiency
  • Uses Distances.cosine_dist for the core computation
  • Time complexity is O(n²) where n is the number of entities
source
Mycelia.frequency_matrix_to_euclidean_distance_matrixMethod
frequency_matrix_to_euclidean_distance_matrix(
    counts_table
) -> Any

Create a Euclidean distance matrix from a column-major counts matrix (features as rows and entities as columns), where distance is proportional to total feature count magnitude (size).

Compute pairwise Euclidean distances between entity profiles in a counts matrix.

Arguments

  • counts_table: A matrix where rows represent features and columns represent entities (column-major format). Each column contains the feature counts/frequencies for one entity.

Returns

  • A symmetric N×N matrix of Euclidean distances between each pair of entities, where N is the number of entities.

Details

  • Parallelized computation using multi-threading
  • Progress tracking via ProgressMeter
  • Memory efficient: only upper triangle is computed, then mirrored
  • Distance between entities increases with total feature magnitude differences
source
Mycelia.genbank_to_codon_frequenciesMethod
genbank_to_codon_frequencies(
    genbank;
    allow_all
) -> Dict{BioSymbols.AminoAcid, Dict{Kmers.Kmer{BioSequences.DNAAlphabet{2}, 3, 1}, Int64}}

Analyze codon usage frequencies from genes in a GenBank file.

Arguments

  • genbank: Path to GenBank format file containing genomic sequences and annotations
  • allow_all: If true, initializes frequencies for all possible codons with count=1 (default: true)

Returns

Nested dictionary mapping amino acids to their corresponding codon usage counts:

  • Outer key: AminoAcid (including stop codon)
  • Inner key: DNACodon
  • Value: Count of codon occurrences

Details

  • Only processes genes marked as ':misc_feature' in the GenBank file
  • Analyzes both forward and reverse complement sequences
  • Determines coding strand based on presence of stop codons and start codons
  • Skips ambiguous sequences that cannot be confidently oriented
source
Mycelia.genbank_to_fastaMethod
genbank_to_fasta(; genbank, fasta, force)

Convert a GenBank format file to FASTA format using EMBOSS seqret.

Arguments

  • genbank: Path to input GenBank format file
  • fasta: Optional output FASTA file path (defaults to input path with .fna extension)
  • force: If true, overwrites existing output file (defaults to false)

Returns

Path to the output FASTA file

Notes

  • Requires EMBOSS suite (installed automatically via Conda)
  • Will not regenerate output if it already exists unless force=true
source
Mycelia.generate_all_possible_canonical_kmersMethod
generate_all_possible_canonical_kmers(k, alphabet) -> Any

Create distance matrix from a column-major counts matrix (features as rows and entities as columns) where distance is a proportional to total feature count magnitude (size) and cosine similarity (relative frequency)

Generate all possible canonical k-mers of length k from the given alphabet.

For DNA/RNA sequences, returns unique canonical k-mers where each k-mer is represented by the lexicographically smaller of itself and its reverse complement. For amino acid sequences, returns all possible k-mers without canonicalization.

Arguments

  • k: Length of k-mers to generate
  • alphabet: Vector of BioSymbols (DNA, RNA or AminoAcid)

Returns

  • Vector of k-mers, canonicalized for DNA/RNA alphabets
source
Mycelia.generate_all_possible_kmersMethod
generate_all_possible_kmers(k, alphabet) -> Any

Create distance matrix from a column-major counts matrix (features as rows and entities as columns) where distance is a proportional to total feature count magnitude (size) and cosine similarity (relative frequency)

Generate a sorted list of all possible k-mers for a given alphabet.

Arguments

  • k::Integer: Length of k-mers to generate
  • alphabet: Collection of symbols (DNA, RNA, or amino acids) from BioSymbols

Returns

  • Sorted Vector of Kmers of the appropriate type (DNA, RNA, or amino acid)
source
Mycelia.generate_transterm_coordinates_from_fastaMethod
generate_transterm_coordinates_from_fasta(fasta) -> Any

Generate minimal coordinate files required for TransTermHP analysis from FASTA sequences.

Creates artificial gene annotations at sequence boundaries to enable TransTermHP to run without real gene annotations. For each sequence in the FASTA file, generates two single-base-pair "genes" at positions 1-2 and (L-1)-L, where L is sequence length.

Arguments

  • fasta: Path to input FASTA file containing sequences to analyze

Returns

  • Path to generated coordinate file (original path with ".coords" extension)

Format

Generated coordinate file follows TransTermHP format: gene_id start stop chromosome

where chromosome matches FASTA sequence identifiers.

See also: run_transterm

source
Mycelia.generate_transterm_coordinates_from_gffMethod
generate_transterm_coordinates_from_gff(gff_file) -> Any

Convert a GFF file to a coordinates file compatible with TransTermHP format.

Arguments

  • gff_file::String: Path to input GFF file

Processing

  • Converts 1-based to 0-based coordinates
  • Extracts gene IDs from the attributes field
  • Retains columns: gene_id, start, end, seqid

Returns

  • Path to the generated coordinates file (original filename with '.coords' suffix)

Output Format

Space-delimited file with columns: gene_id, start, end, seqid

source
Mycelia.get_base_extensionMethod
get_base_extension(filename::String) -> String

Extract the base file extension from a filename, handling compressed files.

For regular files, returns the last extension. For gzipped files, returns the extension before .gz.

source
Mycelia.get_correct_qualityMethod
get_correct_quality(tech::Symbol, pos::Int, read_length::Int) -> Int

Simulates a Phred quality score (using the Sanger convention) for a correctly observed base. For Illumina, the quality score is modeled to decay linearly from ~40 at the start to ~20 at the end of the read. For other technologies, the score is sampled from a normal distribution with parameters typical for that platform.

Returns an integer quality score.

source
Mycelia.get_error_qualityMethod
get_error_quality(tech::Symbol) -> Int

Simulates a Phred quality score (using the Sanger convention) for a base observed with an error. Error bases are assigned lower quality scores than correctly observed bases. For Illumina, scores typically range between 5 and 15; for nanopore and pacbio, slightly lower values are used; and for ultima, a modest quality score is assigned.

Returns an integer quality score.

source
Mycelia.get_genbankMethod
get_genbank(
;
    db,
    accession,
    ftp
) -> Union{Nothing, GenomicAnnotations.GenBank.Reader}

Get dna (db = "nuccore") or protein (db = "protein") sequences from NCBI or get fasta directly from FTP site

Retrieve GenBank records from NCBI or directly from an FTP site.

Arguments

  • db::String: NCBI database to query ("nuccore" for nucleotide or "protein" for protein sequences)
  • accession::String: NCBI accession number for the sequence
  • ftp::String: Direct FTP URL to a GenBank file (gzipped)

Returns

  • GenomicAnnotations.GenBank.Reader: A reader object containing the GenBank record

Details

When using NCBI queries (db and accession), the function implements rate limiting (0.5s sleep) to comply with NCBI's API restrictions of max 2 requests per second.

source
Mycelia.get_gffMethod
get_gff(; db, accession, ftp) -> Any

Get dna (db = "nuccore") or protein (db = "protein") sequences from NCBI or get fasta directly from FTP site

Retrieve GFF3 formatted genomic feature data from NCBI or direct FTP source.

Arguments

  • db::String: NCBI database to query ("nuccore" for DNA or "protein" for protein sequences)
  • accession::String: NCBI accession number
  • ftp::String: Direct FTP URL to GFF3 file (typically gzipped)

Returns

  • IO: IOBuffer containing uncompressed GFF3 data
source
Mycelia.get_kmer_indexMethod
get_kmer_index(kmers, kmer) -> Any

Returns the index position of a given k-mer in a sorted list of k-mers.

Arguments

  • kmers: A sorted vector of k-mers to search within
  • kmer: The k-mer sequence to find

Returns

Integer index position where kmer is found in kmers

Throws

  • AssertionError: If the k-mer is not found in the list
source
Mycelia.get_sequenceMethod
get_sequence(
;
    db,
    accession,
    ftp
) -> Union{Nothing, FASTX.FASTA.Reader}

Get dna (db = "nuccore") or protein (db = "protein") sequences from NCBI or get fasta directly from FTP site

Retrieve FASTA format sequences from NCBI databases or direct FTP URLs.

Arguments

  • db::String: NCBI database type ("nuccore" for DNA or "protein" for protein sequences)
  • accession::String: NCBI sequence accession number
  • ftp::String: Direct FTP URL to a FASTA file (alternative to db/accession pair)

Returns

  • FASTX.FASTA.Reader: Reader object containing the requested sequence(s)
source
Mycelia.gfa_to_fastaMethod
gfa_to_fasta(; gfa, fasta)

Convert a GFA (Graphical Fragment Assembly) file to FASTA format.

Arguments

  • gfa::String: Path to input GFA file
  • fasta::String=gfa * ".fna": Path for output FASTA file. Defaults to input filename with ".fna" extension

Returns

  • String: Path to the generated FASTA file

Details

Uses gfatools (via Conda) to perform the conversion. The function will:

  1. Ensure gfatools is available in the Conda environment
  2. Execute the conversion using gfatools gfa2fa
  3. Write sequences to the specified FASTA file
source
Mycelia.gfa_to_structure_tableMethod
gfa_to_structure_table(
    gfa
) -> NamedTuple{(:contig_table, :records), <:Tuple{DataFrames.DataFrame, Any}}

Convert a GFA (Graphical Fragment Assembly) file into a structured representation.

Arguments

  • gfa: Path to GFA file or GFA content as string

Returns

Named tuple containing:

  • contig_table: DataFrame with columns:
    • connected_component: Integer ID for each component
    • contigs: Comma-separated list of contig IDs
    • is_circular: Boolean indicating if component forms a cycle
    • is_closed: Boolean indicating if single contig forms a cycle
    • lengths: Comma-separated list of contig lengths
  • records: FASTA records from the GFA
source
Mycelia.githashMethod
githash(; short) -> SubString{String}

Returns the current git commit hash of the repository.

Arguments

  • short::Bool=false: If true, returns abbreviated 8-character hash

Returns

A string containing the git commit hash (full 40 characters by default)

source
Mycelia.graph_to_gfaMethod
graph_to_gfa(; graph, outfile)

Convert a Mycelia graph to GFA (Graphical Fragment Assembly) format.

Writes a graph to GFA format, including:

  • Header (H) line with GFA version
  • Segment (S) lines for each vertex with sequence and depth
  • Link (L) lines for edges with overlap size and orientations

Arguments

  • graph: MetaGraph containing sequence vertices and their relationships
  • outfile: Path where the GFA file should be written

Returns

  • Path to the written GFA file
source
Mycelia.hclust_to_metagraphMethod
hclust_to_metagraph(
    hcl::Clustering.Hclust
) -> MetaGraphs.MetaDiGraph{Int64, Float64}

Convert a hierarchical clustering tree into a directed metagraph representation.

Arguments

  • hcl::Clustering.Hclust: Hierarchical clustering result object

Returns

  • MetaGraphs.MetaDiGraph: Directed graph with metadata representing the clustering hierarchy

Graph Properties

The resulting graph contains the following vertex properties:

  • :hclust_id: String identifier for each node
  • :height: Height/distance at each merge point (0.0 for leaves)
  • :x: Horizontal position for visualization (0-1 range)
  • :y: Vertical position based on normalized height
  • :hcl: Original clustering object (stored as graph property)
source
Mycelia.heirarchically_cluster_distance_matrixMethod
heirarchically_cluster_distance_matrix(
    distance_matrix
) -> Clustering.Hclust

Performs hierarchical clustering on a distance matrix using Ward's linkage method.

Arguments

  • distance_matrix::Matrix{<:Real}: A symmetric distance/dissimilarity matrix

Returns

  • HierarchicalCluster: A hierarchical clustering object from Clustering.jl

Details

Uses Ward's method (minimum variance) for clustering, which:

  • Minimizes total within-cluster variance
  • Produces compact, spherical clusters
  • Works well for visualization in radial layouts
source
Mycelia.identify_optimal_number_of_clustersMethod
identify_optimal_number_of_clusters(
    distance_matrix
) -> Tuple{Clustering.Hclust, Int64}

Determine the optimal number of clusters for a dataset using hierarchical clustering and silhouette score analysis.

Arguments

  • distance_matrix: A symmetric matrix of pairwise distances between data points

Returns

A tuple containing:

  • hcl: The fitted hierarchical clustering object
  • optimal_number_of_clusters: Integer indicating the optimal number of clusters

Details

The function:

  1. Performs hierarchical clustering on the distance matrix
  2. Tests cluster counts from 2 to √n (where n is dataset size)
  3. Evaluates each clustering using silhouette scores
  4. Generates a diagnostic plot showing silhouette scores vs cluster counts
  5. Selects the number of clusters that minimizes the silhouette score

The diagnostic plot is displayed automatically during execution.

See Also

  • heirarchically_cluster_distance_matrix
  • Clustering.silhouettes
source
Mycelia.install_hashdeepMethod
install_hashdeep() -> Union{Nothing, Base.Process}

Ensures the hashdeep utility is installed on the system.

Checks if hashdeep is available in PATH and attempts to install it via apt package manager if not found. Will try with sudo privileges first, then without sudo if that fails.

Details

  • Checks PATH for existing hashdeep executable
  • Attempts installation using apt package manager
  • Requires a Debian-based Linux distribution

Returns

  • Nothing, but prints status messages during execution
source
Mycelia.is_equivalentMethod
is_equivalent(a, b) -> Any

Check if two biological sequences are equivalent, considering both direct and reverse complement matches.

Arguments

  • a: First biological sequence (BioSequence or compatible type)
  • b: Second biological sequence (BioSequence or compatible type)

Returns

  • Bool: true if sequences are identical or if one is the reverse complement of the other, false otherwise
source
Mycelia.isolate_normalized_primary_contigMethod
isolate_normalized_primary_contig(
    assembled_fasta,
    assembled_gfa,
    qualimap_report_txt,
    identifier,
    k::Int64;
    primary_contig_fasta
) -> String

Primary contig is defined as the contig with the most bases mapped to it

In the context of picking out phage from metagenomic assemblies the longest contig is often bacteria whereas the highest coverage contigs are often primer-dimers or other PCR amplification artifacts.

Taking the contig that has the most bases mapped to it as a product of length * depth is cherry picked as our phage

Isolates and exports the primary contig from an assembly based on coverage depth × length.

The primary contig is defined as the contig with the highest total mapped bases (coverage depth × length). This method helps identify potential phage contigs in metagenomic assemblies, avoiding both long bacterial contigs and short high-coverage PCR artifacts.

Arguments

  • assembled_fasta: Path to the assembled contigs in FASTA format
  • assembled_gfa: Path to the assembly graph in GFA format
  • qualimap_report_txt: Path to Qualimap coverage report
  • identifier: String identifier for the output file
  • k: Integer representing k-mer size used in assembly
  • primary_contig_fasta: Optional output filepath (default: "{identifier}.primary_contig.fna")

Returns

  • Path to the output FASTA file containing the primary contig

Notes

  • For circular contigs, removes the k-mer closure scar if detected
  • Trims k bases from the end if they match the first k bases
  • Uses coverage × length to avoid both long bacterial contigs and short PCR artifacts
source
Mycelia.iterative_polishingFunction
iterative_polishing(
    fastq
) -> Vector{T} where T<:(NamedTuple{(:fastq, :k), <:Tuple{Any, Any}})
iterative_polishing(
    fastq,
    max_k
) -> Vector{T} where T<:(NamedTuple{(:fastq, :k), <:Tuple{Any, Any}})
iterative_polishing(
    fastq,
    max_k,
    plot
) -> Vector{T} where T<:(NamedTuple{(:fastq, :k), <:Tuple{Any, Any}})

Performs iterative error correction on FASTQ sequences using progressively larger k-mer sizes.

Starting with the default k-mer size, this function repeatedly applies polishing steps, incrementing the k-mer size until either reaching max_k or encountering instability.

Arguments

  • fastq: Path to input FASTQ file or FastqRecord object
  • max_k: Maximum k-mer size to attempt (default: 89)
  • plot: Whether to generate diagnostic plots (default: false)

Returns

Vector of polishing results, where each element contains:

  • k: k-mer size used
  • fastq: resulting polished sequences
source
Mycelia.jaccard_distanceMethod
jaccard_distance(set1, set2) -> Any

Calculate the Jaccard distance between two sets, which is the complement of the Jaccard similarity.

The Jaccard distance is defined as: $J_d(A,B) = 1 - J_s(A,B) = 1 - \frac{|A ∩ B|}{|A ∪ B|}$

Arguments

  • set1: First set to compare
  • set2: Second set to compare

Returns

  • Float64: A value in [0,1] where 0 indicates identical sets and 1 indicates disjoint sets
source
Mycelia.jaccard_similarityMethod
jaccard_similarity(set1, set2) -> Any

Compute the Jaccard similarity coefficient between two sets.

The Jaccard similarity is defined as the size of the intersection divided by the size of the union of two sets:

J(A,B) = |A ∩ B| / |A ∪ B|

Arguments

  • set1: First set for comparison
  • set2: Second set for comparison

Returns

  • Float64: A value between 0.0 and 1.0, where:
    • 1.0 indicates identical sets
    • 0.0 indicates completely disjoint sets
source
Mycelia.jellyfish_countMethod
jellyfish_count(
;
    fastx,
    k,
    threads,
    max_mem,
    canonical,
    outfile,
    conda_check
)

Count k-mers in a FASTA/FASTQ file using Jellyfish.

Arguments

  • fastx::String: Path to input FASTA/FASTQ file (can be gzipped)
  • k::Integer: k-mer length
  • threads::Integer=Sys.CPU_THREADS: Number of threads to use
  • max_mem::Integer=Int(Sys.free_memory()): Maximum memory in bytes (defaults to system free memory)
  • canonical::Bool=false: Whether to count canonical k-mers (both strands combined)
  • outfile::String=auto: Output filename (auto-generated based on input and parameters)
  • conda_check::Bool=true: Whether to verify Jellyfish conda installation

Returns

  • String: Path to gzipped TSV file containing k-mer counts
source
Mycelia.jellyfish_counts_to_kmer_frequency_histogramFunction
jellyfish_counts_to_kmer_frequency_histogram(
    jellyfish_counts_file
) -> Any
jellyfish_counts_to_kmer_frequency_histogram(
    jellyfish_counts_file,
    outfile
) -> Any

Convert a Jellyfish k-mer count file into a frequency histogram.

Arguments

  • jellyfish_counts_file::String: Path to the gzipped TSV file containing Jellyfish k-mer counts
  • outfile::String=replace(jellyfish_counts_file, r"\.tsv\.gz$" => ".count_histogram.tsv"): Optional output file path

Returns

  • String: Path to the generated histogram file

Description

Processes a Jellyfish k-mer count file to create a frequency histogram where:

  • Column 1: Number of k-mers that share the same count
  • Column 2: The count they share

Uses system sorting with LC_ALL=C for optimal performance on large files.

Notes

  • Requires gzip, sort, uniq, and sed command line tools
  • Uses intermediate disk storage for sorting large files
  • Skips processing if output file already exists
source
Mycelia.jitterMethod
jitter(x, n) -> Any

Add random noise to create a vector of jittered values.

Generates n values by adding random noise to the input value x. The noise is uniformly distributed between -1/3 and 1/3.

Arguments

  • x: Base value to add jitter to
  • n: Number of jittered values to generate

Returns

  • Vector of length n containing jittered values around x
source
Mycelia.joint_base_quality_scoreMethod
joint_base_quality_score(
    error_probabilities::Vector{Float64}
) -> Float64

Calculate the quality score for a single base given multiple observations.

This function implements the "Converting to Error Probabilities and Combining" method:

  1. Takes error probabilities from multiple reads covering the same base
  2. Calculates probability of ALL reads being wrong by multiplying probabilities
  3. Calculates final Phred score from this combined probability

To avoid numerical underflow with very small probabilities, the calculation is performed in log space.

Arguments

  • error_probabilities::Vector{Float64}: Vector of error probabilities from multiple reads covering the same base position

Returns

  • Float64: Phred quality score representing the combined confidence
source
Mycelia.kmer_counts_dict_to_vectorMethod
kmer_counts_dict_to_vector(
    kmer_to_index_map,
    kmer_counts
) -> Any

Convert a dictionary of k-mer counts to a fixed-length numeric vector based on a predefined mapping.

Arguments

  • kmer_to_index_map: Dictionary mapping k-mer sequences to their corresponding vector indices
  • kmer_counts: Dictionary containing k-mer sequences and their occurrence counts

Returns

  • A vector where each position corresponds to a k-mer count, with zeros for absent k-mers
source
Mycelia.kmer_counts_to_cosine_similarityMethod
kmer_counts_to_cosine_similarity(
    kmer_counts_1,
    kmer_counts_2
) -> Any

Calculate the cosine similarity between two k-mer count dictionaries.

Arguments

  • kmer_counts_1::Dict{String,Int}: First dictionary mapping k-mer sequences to their counts
  • kmer_counts_2::Dict{String,Int}: Second dictionary mapping k-mer sequences to their counts

Returns

  • Float64: Cosine distance between the two k-mer count vectors, in range [0,1] where 0 indicates identical distributions and 1 indicates maximum dissimilarity

Details

Converts k-mer count dictionaries into vectors using a unified set of keys, then computes cosine distance. Missing k-mers are treated as count 0. Result is invariant to input order and total counts (normalized internally).

source
Mycelia.kmer_counts_to_js_divergenceMethod
kmer_counts_to_js_divergence(
    kmer_counts_1,
    kmer_counts_2
) -> Any

Calculate the Jensen-Shannon divergence between two k-mer frequency distributions.

Arguments

  • kmer_counts_1: Dictionary mapping k-mers to their counts in first sequence
  • kmer_counts_2: Dictionary mapping k-mers to their counts in second sequence

Returns

  • Normalized Jensen-Shannon divergence score between 0 and 1, where:
    • 0 indicates identical distributions
    • 1 indicates maximally different distributions

Notes

  • The measure is symmetric: JS(P||Q) = JS(Q||P)
  • Counts are automatically normalized to probability distributions
source
Mycelia.kmer_counts_to_merqury_qvMethod
kmer_counts_to_merqury_qv(
;
    raw_data_counts,
    assembly_counts
)

Calculate assembly Quality Value (QV) score using the Merqury method.

Estimates base-level accuracy by comparing k-mer distributions between raw sequencing data and assembly. Higher QV scores indicate better assembly quality.

Arguments

  • raw_data_counts::AbstractDict{Kmers.DNAKmer{k,N}, Int}: K-mer counts from raw sequencing data
  • assembly_counts::AbstractDict{Kmers.DNAKmer{k,N}, Int}: K-mer counts from assembly

Returns

  • Float64: Quality Value score in Phred scale (-10log₁₀(error rate))

Method

QV is calculated using:

  1. Ktotal = number of unique kmers in assembly
  2. Kshared = number of kmers shared between raw data and assembly
  3. P = (Kshared/Ktotal)^(1/k) = estimated base-level accuracy
  4. QV = -10log₁₀(1-P)

Reference

Rhie et al. "Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies" Genome Biology (2020)

source
Mycelia.kmer_path_to_sequenceMethod
kmer_path_to_sequence(kmer_path) -> Any

Convert a path of overlapping k-mers into a single DNA sequence.

Arguments

  • kmer_path: Vector of k-mers (DNA sequences) where each consecutive pair overlaps by k-1 bases

Returns

  • BioSequences.LongDNA{2}: Assembled DNA sequence from the k-mer path

Description

Reconstructs the original DNA sequence by joining k-mers, validating that consecutive k-mers overlap correctly. The first k-mer is used in full, then each subsequent k-mer contributes its last base.

source
Mycelia.kmer_quality_scoreFunction
kmer_quality_score(base_qualities::Vector{Float64}) -> Any
kmer_quality_score(
    base_qualities::Vector{Float64},
    method::Symbol
) -> Any

Calculate kmer quality score using the specified aggregation method.

Available methods:

  • :min: Use the minimum base quality (default)
  • :mean: Use the mean of all base qualities
  • :geometric: Use the geometric mean (appropriate for probabilities)
  • :harmonic: Use the harmonic mean (emphasizes lower values)

Arguments

  • base_qualities::Vector{Float64}: Vector of quality scores for each base
  • method::Symbol: Method to use for aggregation

Returns

  • Float64: Overall quality score for the kmer
source
Mycelia.ksMethod
ks(; min, max) -> Vector{Int64}

Generates a specialized sequence of prime numbers combining:

  • Odd primes up to 23 (flip_point)
  • Primes nearest to Fibonacci numbers above 23 up to max

Arguments

  • min::Int=0: Lower bound for the sequence
  • max::Int=10_000: Upper bound for the sequence

Returns

Vector of Int containing the specialized prime sequence

source
Mycelia.lawrencium_sbatchMethod
lawrencium_sbatch(
;
    job_name,
    mail_user,
    mail_type,
    logdir,
    partition,
    qos,
    account,
    nodes,
    ntasks,
    time,
    cpus_per_task,
    mem_gb,
    cmd
)

Submit a job to SLURM scheduler on Lawrence Berkeley Lab's Lawrencium cluster.

Arguments

  • job_name: Name identifier for the SLURM job
  • mail_user: Email address for job notifications
  • mail_type: Notification type ("ALL", "BEGIN", "END", "FAIL", or "NONE")
  • logdir: Directory for SLURM output and error logs
  • partition: Lawrencium compute partition
  • qos: Quality of Service level
  • account: Project account for billing
  • nodes: Number of nodes to allocate
  • ntasks: Number of tasks to spawn
  • time: Wall time limit in format "days-hours:minutes:seconds"
  • cpus_per_task: CPU cores per task
  • mem_gb: Memory per node in GB
  • cmd: Shell command to execute

Returns

  • true if submission was successful

Note

Function includes 5-second delays before and after submission for queue stability.

source
Mycelia.list_blastdbsMethod
list_blastdbs(; source) -> DataFrames.DataFrame

Lists available BLAST databases from the specified source.

source
Mycelia.list_classesMethod
list_classes() -> DataFrames.DataFrame

Returns an array of all taxonomic classes in the database.

Classes represent a major taxonomic rank between phylum and order in biological classification.

Returns

  • Vector{String}: Array of class names sorted alphabetically
source
Mycelia.list_databasesMethod
list_databases(; address, username, password)

Lists all available Neo4j databases on the specified server.

Arguments

  • address::String: Neo4j server address (e.g. "neo4j://localhost:7687")
  • username::String="neo4j": Neo4j authentication username
  • password::String: Neo4j authentication password

Returns

  • DataFrame: Contains database information with columns typically including:
    • name: Database name
    • address: Database address
    • role: Database role (e.g., primary, secondary)
    • status: Current status (e.g., online, offline)
    • default: Boolean indicating if it's the default database
source
Mycelia.list_familiesMethod
list_families() -> DataFrames.DataFrame

Returns a sorted vector of all family names present in the database.

source
Mycelia.list_full_taxonomyMethod
list_full_taxonomy() -> DataFrames.DataFrame

Retrieves and formats the complete NCBI taxonomy hierarchy into a structured DataFrame.

Details

  • Automatically sets up taxonkit environment and downloads taxonomy database if needed
  • Starts from root taxid (1) and includes all descendant taxa
  • Reformats lineage information into separate columns for each taxonomic rank

Returns

DataFrame with columns:

  • taxid: Taxonomy identifier
  • lineage: Full taxonomic lineage string
  • taxid_lineage: Lineage with taxonomy IDs
  • Individual rank columns:
    • superkingdom, kingdom, phylum, class, order, family, genus, species
    • corresponding taxid columns (e.g., superkingdom_taxid)

Dependencies

Requires taxonkit (installed automatically via Bioconda)

source
Mycelia.list_generaMethod
list_genera() -> DataFrames.DataFrame

Returns a sorted vector of all genera names present in the database.

source
Mycelia.list_kingdomsMethod
list_kingdoms() -> DataFrames.DataFrame

Lists all taxonomic kingdoms in the database.

Returns a vector of kingdom names as strings. Kingdoms represent the highest major taxonomic rank in biological classification.

Returns

  • Vector{String}: Array of kingdom names
source
Mycelia.list_ordersMethod
list_orders() -> DataFrames.DataFrame

Lists all orders in the taxonomic database.

Returns a vector of strings containing valid order names according to current mycological taxonomy. Uses the underlying list_rank() function with rank="order".

Returns

  • Vector{String}: Alphabetically sorted list of order names
source
Mycelia.list_phylumsMethod
list_phylums() -> DataFrames.DataFrame

Returns a sorted list of all unique phyla in the database.

source
Mycelia.list_rankMethod
list_rank(rank) -> DataFrames.DataFrame

List all taxonomic entries at the specified rank level.

Arguments

  • rank::String: Taxonomic rank to query. Must be one of:
    • "top" (top level)
    • "superkingdom"/"domain"
    • "kingdom"
    • "phylum"
    • "class"
    • "order"
    • "family"
    • "genus"
    • "species"

Returns

DataFrame with columns:

  • taxid: NCBI taxonomy ID
  • name: Scientific name at the specified rank
source
Mycelia.list_ranksMethod
list_ranks(; synonyms) -> Vector{String}

Return an ordered list of taxonomic ranks from highest (top) to lowest (species).

Arguments

  • synonyms::Bool=false: If true, includes alternative names for certain ranks (e.g. "domain" for "superkingdom")

Returns

  • Vector{String}: An array of taxonomic rank names in hierarchical order
source
Mycelia.list_speciesMethod
list_species() -> DataFrames.DataFrame

Returns a sorted vector of all species names present in the database.

source
Mycelia.list_subtaxaMethod
list_subtaxa(taxid) -> Vector{Int64}

Returns an array of Integer taxon IDs representing all sub-taxa under the specified taxonomic ID.

Arguments

  • taxid: NCBI taxonomy identifier for the parent taxon

Returns

Vector{Int} containing all descendant taxon IDs

Details

  • Requires taxonkit to be installed via Bioconda
  • Automatically sets up taxonkit database if not present
  • Uses local taxonomy database in ~/.taxonkit/
source
Mycelia.list_superkingdomsMethod
list_superkingdoms() -> DataFrames.DataFrame

Returns an array of all taxonomic superkingdoms (e.g., Bacteria, Archaea, Eukaryota).

Returns

  • Vector{String}: Array containing names of all superkingdoms in the taxonomy database
source
Mycelia.list_toplevelMethod
list_toplevel() -> DataFrames.DataFrame

Returns a DataFrame containing the top-level taxonomic nodes.

The DataFrame has two fixed rows representing the most basic taxonomic classifications:

  • taxid=0: "unclassified"
  • taxid=1: "root"

Returns

DataFrame Columns: - taxid::Int : Taxonomic identifier - name::String : Node name

source
Mycelia.load_blast_db_taxonomy_tableMethod
load_blast_db_taxonomy_table(
    compressed_blast_db_taxonomy_table_file
) -> DataFrames.DataFrame

Loads a BLAST database taxonomy mapping table from a gzipped file into a DataFrame.

Arguments

  • compressed_blast_db_taxonomy_table_file::String: Path to a gzipped file containing BLAST taxonomy mappings

Returns

  • DataFrame: A DataFrame with columns :sequence_id and :taxid containing the sequence-to-taxonomy mappings

Format

Input file should be a space-delimited text file (gzipped) with two columns:

  1. sequence identifier
  2. taxonomy identifier (taxid)
source
Mycelia.load_genbank_metadataMethod
load_genbank_metadata() -> DataFrames.DataFrame

Load metadata for GenBank sequences into a DataFrame.

This is a convenience wrapper around load_ncbi_metadata("genbank") that specifically loads metadata from the GenBank database.

Returns

  • DataFrame: Contains metadata fields like accession numbers, taxonomy,

and sequence information from GenBank.

source
Mycelia.load_graphMethod
load_graph(file) -> Any

Load a graph structure from a serialized file.

Arguments

  • file::AbstractString: Path to the file containing the serialized graph data

Returns

  • The deserialized graph object
source
Mycelia.load_graphMethod
load_graph(file::String) -> Any

Loads a graph object from a serialized file.

Arguments

  • file::String: Path to the file containing the serialized graph data. The file should have been created using save_graph.

Returns

  • The deserialized graph object stored under the "graph" key.
source
Mycelia.load_jellyfish_countsMethod
load_jellyfish_counts(
    jellyfish_counts
) -> DataFrames.DataFrame

Load k-mer counts from a Jellyfish output file into a DataFrame.

Arguments

  • jellyfish_counts::String: Path to a gzipped TSV file (*.jf.tsv.gz) containing Jellyfish k-mer counts

Returns

  • DataFrame: Table with columns:
    • kmer: Biologically encoded k-mers as DNAKmer{k} objects
    • count: Integer count of each k-mer's occurrences

Notes

  • Input file must be a gzipped TSV with exactly two columns (k-mer sequences and counts)
  • K-mer length is automatically detected from the first entry
  • Filename must end with '.jf.tsv.gz'
source
Mycelia.load_jld2Method
load_jld2(filename) -> Any

Load data stored in a JLD2 file format.

Arguments

  • filename::String: Path to the JLD2 file to load

Returns

  • Dict: Dictionary containing the loaded data structures
source
Mycelia.load_matrix_jld2Method
load_matrix_jld2(filename) -> Any

Loads a matrix from a JLD2 file.

Arguments

  • filename::String: Path to the JLD2 file containing the matrix under the key "matrix"

Returns

  • Matrix: The loaded matrix data
source
Mycelia.load_ncbi_metadataMethod
load_ncbi_metadata(db) -> DataFrames.DataFrame

Load and parse the assembly summary metadata from NCBI's FTP server for either GenBank or RefSeq databases.

Arguments

  • db::String: Database source, must be either "genbank" or "refseq"

Returns

  • DataFrame: Parsed metadata table with properly typed columns including:
    • Integer columns: taxid, species_taxid, genome metrics, and gene counts
    • Float columns: gc_percent
    • Date columns: seqreldate, annotation_date
    • String columns: all other fields

Details

Downloads the assembly summary file from NCBI's FTP server and processes it by:

  1. Parsing the tab-delimited file with commented headers
  2. Converting numeric strings to proper Integer/Float types
  3. Parsing date strings to Date objects
  4. Handling missing values throughout
source
Mycelia.load_ncbi_taxonomyMethod
load_ncbi_taxonomy(
;
    path_to_taxdump
) -> MetaGraphs.MetaDiGraph{T, Float64} where T<:Integer

Downloads and constructs a MetaDiGraph representation of the NCBI taxonomy database.

Arguments

  • path_to_taxdump: Directory path where taxonomy files will be downloaded and extracted

Returns

  • MetaDiGraph: A directed graph where:
    • Vertices represent taxa with properties:
      • :tax_id: NCBI taxonomy identifier
      • :scientific_name, :common_name, etc.: Name properties
      • :rank: Taxonomic rank
      • :division_id, :division_cde, :division_name: Division information
    • Edges represent parent-child relationships in the taxonomy

Dependencies

Requires internet connection for initial download. Uses DataFrames, MetaGraphs, and ProgressMeter.

source
Mycelia.load_refseq_metadataMethod
load_refseq_metadata() -> DataFrames.DataFrame

Loads NCBI RefSeq metadata into a DataFrame. RefSeq is NCBI's curated collection of genomic, transcript and protein sequences.

Returns

  • DataFrame: Contains metadata columns including accession numbers, taxonomic information,

and sequence details from RefSeq.

source
Mycelia.local_blast_database_infoMethod
local_blast_database_info(; blastdbs_dir) -> Any

Query information about local BLAST databases and return a formatted summary.

Arguments

  • blastdbs_dir::String: Directory containing BLAST databases (default: "~/workspace/blastdb")

Returns

  • DataFrame with columns:
    • BLAST database path
    • BLAST database molecule type
    • BLAST database title
    • date of last update
    • number of bases/residues
    • number of sequences
    • number of bytes
    • BLAST database format version
    • human readable size

Dependencies

Requires NCBI BLAST+ tools. Will attempt to install via apt-get if not present.

Side Effects

  • May install system packages (ncbi-blast+, perl-doc) using sudo/apt-get
  • Filters out numbered database fragments from results
source
Mycelia.merge_colorsMethod
merge_colors(c1, c2) -> Any

Merge two colors by calculating their minimal color difference vector.

Arguments

  • c1::Color: First color input
  • c2::Color: Second color input

Returns

  • If colors are equal, returns the input color
  • Otherwise returns the color difference vector (c1-c2 or c2-c1) with minimal RGB sum

Details

Calculates two difference vectors:

  • mix_a = c1 - c2
  • mix_b = c2 - c1

Returns the difference vector with the smallest sum of RGB components.

source
Mycelia.merge_fasta_filesMethod
merge_fasta_files(; fasta_files, fasta_file)

Join fasta files while adding origin prefixes to the identifiers.

Does not guarantee uniqueness but will warn if conflicts arise

source
Mycelia.metasha256Method
metasha256(
    vector_of_sha256s::Vector{<:AbstractString}
) -> String

Compute a single SHA256 hash from multiple SHA256 hashes.

Takes a vector of hex-encoded SHA256 hashes and produces a new SHA256 hash by:

  1. Sorting the input hashes lexicographically
  2. Concatenating them in sorted order
  3. Computing a new SHA256 hash over the concatenated data

Arguments

  • vector_of_sha256s: Vector of hex-encoded SHA256 hash strings

Returns

  • A hex-encoded string representing the computed meta-hash
source
Mycelia.minimap_indexMethod
minimap_index(
;
    fasta,
    mem_gb,
    mapping_type,
    threads,
    as_string,
    denominator
)

Generate a minimap2 index command and output file path for mapping DNA sequencing reads.

Run this on the machine you intend to use to map the reads to confirm the index will fit.

Arguments

  • fasta: Path to the reference FASTA file to be indexed
  • mem_gb: Available system memory in gigabytes for indexing
  • mapping_type: Sequencing technology preset. One of:
    • "map-hifi": PacBio HiFi reads
    • "map-ont": Oxford Nanopore reads
    • "map-pb": PacBio CLR reads
    • "sr": Short reads
    • "lr:hq": High-quality long reads
  • threads: Number of threads to use for indexing
  • as_string: Return command as String instead of Cmd (default: false)
  • denominator: Divisor for calculating index size (default: 10)

Returns

Named tuple containing:

  • cmd: The minimap2 indexing command (as String or Cmd)
  • outfile: Path to the output index file (.mmi)
source
Mycelia.minimap_mapMethod
minimap_map(
;
    fasta,
    fastq,
    mapping_type,
    as_string,
    mem_gb,
    threads,
    denominator
)

Generate minimap2 alignment commands for sequence mapping.

aligning and compressing. No sorting or filtering.

Use shell_only=true to get string command to submit to SLURM

Creates a command to align reads in FASTQ format to a reference FASTA using minimap2, followed by SAM compression with pigz. Handles resource allocation and conda environment setup.

Arguments

  • fasta: Path to reference FASTA file
  • fastq: Path to query FASTQ file
  • mapping_type: Alignment preset ("map-hifi", "map-ont", "map-pb", "sr", or "lr:hq")
  • as_string: If true, returns shell command as string; if false, returns command array
  • mem_gb: Available memory in GB for indexing (defaults to system free memory)
  • threads: Number of CPU threads to use (defaults to system threads)
  • denominator: Divisor for calculating minimap2 index size

Returns

Named tuple containing:

  • cmd: Shell command (as string or array)
  • outfile: Path to compressed output SAM file
source
Mycelia.minimap_map_paired_endMethod
minimap_map_paired_end(
;
    fasta,
    forward,
    reverse,
    mem_gb,
    threads,
    outdir,
    as_string,
    mapping_type,
    denominator
)

Maps paired-end reads to a reference genome using minimap2 and compresses the output.

Arguments

  • fasta::String: Path to reference genome FASTA file
  • forward::String: Path to forward reads FASTQ file
  • reverse::String: Path to reverse reads FASTQ file
  • mem_gb::Integer: Available system memory in GB
  • threads::Integer: Number of threads to use
  • outdir::String: Output directory (defaults to forward reads directory)
  • as_string::Bool: Return command as string instead of Cmd array
  • mapping_type::String: Mapping preset, e.g. "sr" for short reads (default)
  • denominator::Float64: Memory scaling factor for minimap2 index

Returns

Named tuple containing:

  • cmd: Command(s) to execute (String or Vector{Cmd})
  • outfile: Path to compressed output SAM file (*.sam.gz)

Dependencies

Requires bioconda packages: minimap2, samtools, pigz

source
Mycelia.minimap_map_paired_end_with_indexMethod
minimap_map_paired_end_with_index(
;
    fasta,
    forward,
    reverse,
    mem_gb,
    threads,
    outdir,
    as_string,
    mapping_type,
    denominator
)

Map paired-end reads to a reference sequence using minimap2.

Arguments

  • fasta::String: Path to reference FASTA file
  • forward::String: Path to forward reads FASTQ file
  • reverse::String: Path to reverse reads FASTQ file
  • mem_gb::Integer: Available system memory in GB
  • threads::Integer: Number of threads to use
  • outdir::String: Output directory (defaults to forward reads directory)
  • as_string::Bool=false: Return command as string instead of Cmd array
  • mapping_type::String="sr": Minimap2 preset ["map-hifi", "map-ont", "map-pb", "sr", "lr:hq"]
  • denominator::Float64: Memory scaling factor for index size

Returns

Named tuple containing:

  • cmd: Command(s) to execute (String or Array{Cmd})
  • outfile: Path to compressed output SAM file (*.sam.gz)

Notes

  • Requires minimap2, samtools, and pigz conda environments
  • Automatically compresses output using pigz
  • Index file must exist at $(fasta).x$(mapping_type).I$(index_size).mmi
source
Mycelia.minimap_map_with_indexMethod
minimap_map_with_index(
;
    fasta,
    mem_gb,
    mapping_type,
    threads,
    fastq,
    as_string,
    denominator
)

Generate minimap2 mapping commands with a pre-built index file.

Arguments

  • fasta: Path to the reference FASTA file
  • mem_gb: Available system memory in gigabytes for index generation
  • mapping_type: Mapping preset. Must be one of: "map-hifi", "map-ont", "map-pb", "sr", or "lr:hq"
  • threads: Number of threads to use for mapping and compression
  • fastq: Path to input FASTQ file
  • as_string: If true, returns command as a string; if false, returns as command array
  • denominator: Divisor for index size calculation (default: 10)

Returns

Named tuple containing:

  • cmd: The minimap2 mapping command (string or array)
  • outfile: Path to the output compressed SAM file

Notes

  • Requires pre-built index file with pattern: ${fasta}.x${mapping_type}.I${index_size}.mmi
  • Automatically installs required conda environments (minimap2, samtools, pigz)
  • Output is automatically compressed with pigz
source
Mycelia.mmseqs_pairwise_searchMethod
mmseqs_pairwise_search(; fasta, output)

Perform all-vs-all sequence search using MMseqs2's easy-search command.

Arguments

  • fasta::String: Path to input FASTA file containing sequences to compare
  • output::String: Output directory path (default: input filename + ".mmseqseasysearch_pairwise")

Returns

  • String: Path to the output directory

Details

Executes MMseqs2 with sensitive search parameters (7 sensitivity steps) and outputs results in tabular format with the following columns:

  • query, qheader: Query sequence ID and header
  • target, theader: Target sequence ID and header
  • pident: Percentage sequence identity
  • fident: Fraction of identical matches
  • nident: Number of identical matches
  • alnlen: Alignment length
  • mismatch: Number of mismatches
  • gapopen: Number of gap openings
  • qstart, qend, qlen: Query sequence coordinates and length
  • tstart, tend, tlen: Target sequence coordinates and length
  • evalue: Expected value
  • bits: Bit score

Requires MMseqs2 to be available through Bioconda.

source
Mycelia.mutate_sequenceMethod
mutate_sequence(reference_sequence) -> Tuple{Any, Any}

Generate a single random mutation in an amino acid sequence.

Arguments

  • reference_sequence: Input amino acid sequence to be mutated

Returns

  • mutant_sequence: The sequence after applying the mutation
  • haplotype: A SequenceVariation.Haplotype object containing the mutation details

Details

Performs one of three possible mutation types:

  • Substitution: Replace one amino acid with another
  • Insertion: Insert 1+ random amino acids at a position
  • Deletion: Remove 1+ amino acids from a position

Insertion and deletion sizes follow a truncated Poisson distribution (λ=1, min=1).

source
Mycelia.n_maximally_distinguishable_colorsMethod
n_maximally_distinguishable_colors(n) -> Any

Generate n colors that are maximally distinguishable from each other.

Arguments

  • n::Integer: The number of distinct colors to generate

Returns

A vector of n RGB colors that are optimized for maximum perceptual distinction, using white (RGB(1,1,1)) and black (RGB(0,0,0)) as anchor colors.

source
Mycelia.name2taxidMethod
name2taxid(name) -> DataFrames.DataFrame

Convert scientific name(s) to NCBI taxonomy ID(s) using taxonkit.

Arguments

  • name::AbstractString: Scientific name(s) to query. Can be a single name or multiple names separated by newlines.

Returns

  • DataFrame with columns:
    • name: Input scientific name
    • taxid: NCBI taxonomy ID
    • rank: Taxonomic rank (e.g., "species", "genus")

Dependencies

Requires taxonkit package (installed automatically via Bioconda)

source
Mycelia.names2taxidsMethod
names2taxids(names::AbstractVector{<:AbstractString}) -> Any

Convert a vector of species/taxon names to their corresponding NCBI taxonomy IDs.

Arguments

  • names::AbstractVector{<:AbstractString}: Vector of scientific names or common names

Returns

  • Vector{Int}: Vector of NCBI taxonomy IDs corresponding to the input names

Progress is displayed using ProgressMeter.

source
Mycelia.ncbi_ftp_path_to_urlMethod
ncbi_ftp_path_to_url(; ftp_path, extension)

Constructs a complete NCBI FTP URL by combining a base FTP path with a file extension.

Arguments

  • ftp_path::String: Base FTP directory path for the resource
  • extension::String: File extension to append to the resource name

Returns

  • String: Complete FTP URL path to the requested resource

Extensions include:

  • genomic.fna.gz
  • genomic.gff.gz
  • protein.faa.gz
  • assembly_report.txt
  • assembly_stats.txt
  • cdsfromgenomic.fna.gz
  • feature_count.txt.gz
  • feature_table.txt.gz
  • genomic.gbff.gz
  • genomic.gtf.gz
  • protein.gpff.gz
  • translated_cds.faa.gz
source
Mycelia.ncbi_genome_download_accessionMethod
ncbi_genome_download_accession(
;
    accession,
    outdir,
    outpath,
    include_string
)

Download an accession using NCBI datasets command line tool

the .zip download output to outpath will be unzipped

returns the outfolder

ncbi's default include string is include_string = "gff3,rna,cds,protein,genome,seq-report"

Downloads and extracts a genome from NCBI using the datasets command line tool.

Arguments

  • accession: NCBI accession number for the genome
  • outdir: Directory where files will be downloaded (defaults to current directory)
  • outpath: Full path for the temporary zip file (defaults to outdir/accession.zip)
  • include_string: Data types to download (defaults to all "gff3,rna,cds,protein,genome,seq-report").

Returns

  • Path to the extracted genome data directory

Notes

  • Requires the ncbi-datasets-cli conda package (automatically installed if missing)
  • Downloaded zip file is automatically removed after extraction
  • If output folder already exists, download is skipped
  • Data is extracted to outdir/accession/ncbi_dataset/data/accession
source
Mycelia.ncbi_taxon_summaryMethod
ncbi_taxon_summary(taxa_id) -> DataFrames.DataFrame

Retrieve taxonomic information for a given NCBI taxonomy ID.

Arguments

  • taxa_id: NCBI taxonomy identifier (integer)

Returns

  • DataFrame: Taxonomy summary containing fields like tax_id, rank, species, etc.
source
Mycelia.nearest_primeMethod
nearest_prime(n::Int64) -> Int64

Find the closest prime number to the given integer n.

Returns the nearest prime number to n. If two prime numbers are equally distant from n, returns the smaller one.

Arguments

  • n::Int: The input integer to find the nearest prime for

Returns

  • Int: The closest prime number to n
source
Mycelia.nersc_sbatchMethod
nersc_sbatch(
;
    job_name,
    mail_user,
    mail_type,
    logdir,
    scriptdir,
    qos,
    nodes,
    ntasks,
    time,
    cpus_per_task,
    mem_gb,
    cmd,
    constraint
)

Submit a batch job to NERSC's SLURM workload manager.

Arguments

  • job_name: Identifier for the SLURM job
  • mail_user: Email address for job notifications
  • mail_type: Notification type ("ALL", "BEGIN", "END", "FAIL", or "NONE")
  • logdir: Directory for storing job output/error logs
  • scriptdir: Directory for storing generated SLURM scripts
  • qos: Quality of Service level ("regular", "premium", or "preempt")
  • nodes: Number of nodes to allocate
  • ntasks: Number of tasks to run
  • time: Maximum wall time in format "days-HH:MM:SS"
  • cpus_per_task: CPU cores per task
  • mem_gb: Memory per node in GB
  • cmd: Command(s) to execute (String or Vector{String})
  • constraint: Node type constraint ("cpu" or "gpu")

Returns

  • true if job submission succeeds
  • false if submission fails

QoS Options

  • regular: Standard priority queue
  • premium: High priority queue (5x throughput limit)
  • preempt: Reduced credit usage but jobs may be interrupted

https://docs.nersc.gov/jobs/policy/ https://docs.nersc.gov/systems/perlmutter/architecture/#cpu-nodes

default is to use shared qos

use

  • regular
  • preempt (reduced credit usage but not guaranteed to finish)
  • premium (priorty runs limited to 5x throughput)

https://docs.nersc.gov/systems/perlmutter/running-jobs/#tips-and-tricks

source
Mycelia.nersc_sbatch_sharedMethod
nersc_sbatch_shared(
;
    job_name,
    mail_user,
    mail_type,
    logdir,
    qos,
    nodes,
    ntasks,
    time,
    cpus_per_task,
    mem_gb,
    cmd,
    constraint
)

Submit a job to NERSC's SLURM scheduler using the shared QOS (Quality of Service).

Arguments

  • job_name: Identifier for the job
  • mail_user: Email address for job notifications
  • mail_type: Notification type ("ALL", "BEGIN", "END", "FAIL", "REQUEUE", "STAGE_OUT")
  • logdir: Directory for storing job output and error logs
  • qos: Quality of Service level ("shared", "regular", "preempt", "premium")
  • nodes: Number of nodes to allocate
  • ntasks: Number of tasks to run
  • time: Maximum wall time in format "days-hours:minutes:seconds"
  • cpus_per_task: Number of CPUs per task
  • mem_gb: Memory per node in GB (default: 2GB per CPU)
  • cmd: Command to execute
  • constraint: Node type constraint ("cpu" or "gpu")

Resource Limits

  • Maximum memory per node: 512GB
  • Maximum cores per node: 128
  • Default memory allocation: 2GB per CPU requested

QOS Options

  • shared: Default QOS for shared node usage
  • regular: Standard priority
  • preempt: Reduced credit usage but preemptible
  • premium: 5x throughput priority (limited usage)

Returns

true if job submission succeeds

https://docs.nersc.gov/jobs/policy/ https://docs.nersc.gov/systems/perlmutter/architecture/#cpu-nodes

default is to use shared qos

use

  • regular
  • preempt (reduced credit usage but not guaranteed to finish)
  • premium (priority runs limited to 5x throughput)

max request is 512Gb memory and 128 cores per node

https://docs.nersc.gov/systems/perlmutter/running-jobs/#tips-and-tricks

source
Mycelia.node_type_to_dataframeMethod
node_type_to_dataframe(; node_type, graph)

Convert all nodes of a specific type in a MetaGraph to a DataFrame representation.

Arguments

  • node_type: The type of nodes to extract from the graph
  • graph: A MetaGraph containing the nodes

Returns

A DataFrame where:

  • Each row represents a node of the specified type
  • Columns correspond to all unique properties found across nodes
  • Values are JSON-serialized strings for consistency

Notes

  • All values are normalized through JSON serialization
  • Dictionary values receive double JSON encoding
  • The TYPE column is converted using type_to_string
source
Mycelia.normalize_codon_frequenciesMethod
normalize_codon_frequencies(
    codon_frequencies
) -> Dict{BioSymbols.AminoAcid, Dict{Kmers.Kmer{BioSequences.DNAAlphabet{2}, 3, 1}, Float64}}

Normalizes codon frequencies for each amino acid such that frequencies sum to 1.0.

Arguments

  • codon_frequencies: Nested dictionary mapping amino acids to their codon frequency distributions

Returns

  • Normalized codon frequencies where values for each amino acid sum to 1.0
source
Mycelia.normalize_countmapMethod
normalize_countmap(countmap) -> Dict

Normalize a dictionary of counts into a probability distribution where values sum to 1.0.

Arguments

  • countmap::Dict: Dictionary mapping keys to count values

Returns

  • Dict: New dictionary with same keys but values normalized by total sum
source
Mycelia.normalize_distance_matrixMethod
normalize_distance_matrix(distance_matrix) -> Any

Create distance matrix from a column-major counts matrix (features as rows and entities as columns) where distance is a proportional to total feature count magnitude (size) and cosine similarity (relative frequency)

Normalize a distance matrix by dividing all elements by the maximum non-NaN value.

Arguments

  • distance_matrix: A matrix of distance values that may contain NaN, nothing, or missing values

Returns

  • Normalized distance matrix with values scaled to [0, 1] range

Details

  • Filters out NaN, nothing, and missing values when finding the maximum
  • All elements are divided by the same maximum value to preserve relative distances
  • If all values are NaN/nothing/missing, may return NaN values
source
Mycelia.normalize_vcfMethod
normalize_vcf(; reference_fasta, vcf_file)

Normalize a VCF file using bcftools norm, with automated handling of compression and indexing.

Arguments

  • reference_fasta::String: Path to the reference FASTA file used for normalization
  • vcf_file::String: Path to input VCF file (can be gzipped or uncompressed)

Returns

  • String: Path to the normalized, sorted, and compressed output VCF file (*.sorted.normalized.vcf.gz)

Notes

  • Requires bioconda packages: htslib, tabix, bcftools
  • Creates intermediate files with extensions .tbi for indices
  • Skips processing if output file already exists
  • Performs left-alignment and normalization of variants
source
Mycelia.normalized_current_datetimeMethod
normalized_current_datetime() -> String

Returns the current date and time as a normalized string with all non-word characters removed.

The output format is based on ISO datetime (YYYYMMDDThhmmss) but strips any special characters like hyphens, colons or dots.

source
Mycelia.observeMethod
observe(
    record::Union{FASTX.FASTA.Record, FASTX.FASTQ.Record};
    error_rate
)

Simulate sequencing of a DNA/RNA record by introducing random errors at the specified rate.

Arguments

  • record: A FASTA or FASTQ record containing the sequence to be "observed"
  • error_rate: Probability of error at each position (default: 0.0)

Returns

A new FASTQ.Record with:

  • Random UUID as identifier
  • Original record's description
  • Modified sequence with introduced errors
  • Generated quality scores
source
Mycelia.observeMethod
observe(sequence::BioSequences.LongSequence{T}; error_rate=nothing, tech::Symbol=:illumina) where T

Simulates the “observation” of a biological polymer (DNA, RNA, or protein) by introducing realistic errors along with base‐quality scores. The simulation takes into account both random and systematic error components. In particular, for technologies:

  • illumina: (mostly substitution errors) the per‐base quality decays along the read (from ~Q40 at the start to ~Q20 at the end);
  • nanopore: errors are more frequent and include both substitutions and indels (with overall lower quality scores, and an extra “homopolymer” penalty);
  • pacbio: errors are dominated by indels (with quality scores typical of raw reads);
  • ultima: (UG 100/ppmSeq™) correct bases are assigned very high quality (~Q60) while errors are extremely rare and, if they occur, are given a modest quality.

An error is introduced at each position with a (possibly position‐dependent) probability. For Illumina, the error probability increases along the read; additionally, if a base is part of a homopolymer run (length ≥ 3) and the chosen technology is one that struggles with homopolymers (nanopore, pacbio, ultima), then the local error probability is multiplied by a constant factor.

Returns a tuple (new_seq, quality_scores) where:

  • new_seq is a BioSequences.LongSequence{T} containing the “observed” sequence (which may be longer or shorter than the input if insertions or deletions occur), and
  • quality_scores is a vector of integers representing the Phred quality scores (using the Sanger convention) for each base in the output sequence.
source
Mycelia.open_fastxMethod
open_fastx(
    path::AbstractString
) -> Union{FASTX.FASTA.Reader, FASTX.FASTQ.Reader{T} where T<:(TranscodingStreams.TranscodingStream{C, S} where {S<:IO, C<:TranscodingStreams.Codec})}

Open and return a reader for FASTA or FASTQ format files.

Arguments

  • path::AbstractString: Path to input file. Can be:
    • Local file path
    • HTTP/FTP URL
    • Gzip compressed (.gz extension)

Supported formats

  • FASTA (.fasta, .fna, .faa, .fa)
  • FASTQ (.fastq, .fq)

Returns

  • FASTX.FASTA.Reader for FASTA files
  • FASTX.FASTQ.Reader for FASTQ files
source
Mycelia.open_genbankMethod
open_genbank(
    genbank_file
) -> Vector{GenomicAnnotations.Record}

Opens and parses a GenBank format file containing genomic sequence and annotation data.

Arguments

  • genbank_file::AbstractString: Path to the GenBank (.gb or .gbk) file

Returns

  • Vector{GenomicAnnotations.Chromosome}: Vector containing parsed chromosome data
source
Mycelia.open_gffMethod
open_gff(path::String) -> Any

Opens a GFF (General Feature Format) file for reading.

Arguments

  • path::String: Path to GFF file. Can be:
    • Local file path
    • HTTP/FTP URL (FTP URLs are automatically converted to HTTP)
    • Gzipped file (automatically decompressed)

Returns

  • IO: An IO stream ready to read the GFF content
source
Mycelia.optimal_subsequence_lengthMethod
optimal_subsequence_length(
;
    error_rate,
    threshold,
    sequence_length,
    plot_result
)

Calculate the optimal subsequence length based on error rate distribution.

Arguments

  • error_rate: Single error rate or array of error rates (between 0 and 1)
  • threshold: Desired probability that a subsequence is error-free (default: 0.95)
  • sequence_length: Maximum sequence length to consider for plotting
  • plot_result: If true, returns a plot of probability vs. length

Returns

  • If plot_result=false: Integer representing optimal subsequence length
  • If plot_result=true: Tuple of (optimal_length, plot)

Examples

# Single error rate
optimal_subsequence_length(error_rate=0.01)

# Array of error rates
optimal_subsequence_length(error_rate=[0.01, 0.02, 0.01])

# With more stringent threshold
optimal_subsequence_length(error_rate=0.01, threshold=0.99)

# Generate plot
length, p = optimal_subsequence_length(error_rate=0.01, plot_result=true)
Plots.display(p)
source
Mycelia.parse_blast_reportMethod
parse_blast_report(blast_report) -> DataFrames.DataFrame

Expects output type 7 from BLAST, default output type 6 doesn't have the header comments and won't auto-parse

Parse a BLAST output file into a structured DataFrame.

Arguments

  • blast_report::AbstractString: Path to a BLAST output file in format 7 (tabular with comments)

Returns

  • DataFrame: Table containing BLAST results with columns matching the header fields. Returns empty DataFrame if no hits found.

Details

  • Requires BLAST output format 7 (-outfmt 7), which includes header comments
  • Handles missing values (encoded as "N/A") automatically
  • Infers column types based on BLAST field names
  • Supports standard BLAST tabular fields including sequence IDs, scores, alignments and taxonomic information
source
Mycelia.parse_gfaMethod
parse_gfa(gfa) -> MetaGraphs.MetaGraph{Int64, Float64}

Parse a GFA (Graphical Fragment Assembly) file into a MetaGraph representation.

Arguments

  • gfa: Path to GFA format file

Returns

A MetaGraph where:

  • Vertices represent segments (contigs)
  • Edges represent links between segments
  • Vertex properties include :id with segment identifiers
  • Graph property :records contains the original FASTA records

Format Support

Handles standard GFA v1 lines:

  • H: Header lines (skipped)
  • S: Segments (stored as nodes with FASTA records)
  • L: Links (stored as edges)
  • P: Paths (stored in paths dictionary)
  • A: HiFiAsm specific lines (skipped)
source
Mycelia.parse_jsonlMethod
parse_jsonl(filepath::String) -> Vector{Dict{String, Any}}

Parse a JSONL (JSON Lines) file into a vector of dictionaries.

Arguments

  • filepath::String: Path to the JSONL file to parse

Returns

  • Vector{Dict{String, Any}}: Vector containing parsed JSON objects, one per line

Description

Reads a JSONL file line by line, parsing each line as a separate JSON object. Uses pre-allocation and progress tracking for efficient processing of large files.

source
Mycelia.parse_mmseqs_easy_taxonomy_lca_tsvMethod
parse_mmseqs_easy_taxonomy_lca_tsv(
    lca_tsv
) -> DataFrames.DataFrame

Parse the taxonomic Last Common Ancestor (LCA) TSV output from MMseqs2's easy-taxonomy workflow.

Arguments

  • lca_tsv: Path to the TSV file containing MMseqs2 taxonomy results

Returns

DataFrame with columns:

  • contig_id: Sequence identifier
  • taxon_id: NCBI taxonomy identifier
  • taxon_rank: Taxonomic rank (e.g. species, genus)
  • taxon_name: Scientific name
  • fragments_retained: Number of fragments kept
  • fragments_taxonomically_assigned: Number of fragments with taxonomy
  • fragments_in_agreement_with_assignment: Fragments matching contig taxonomy
  • support -log(E-value): Statistical support score
source
Mycelia.parse_mmseqs_easy_taxonomy_tophit_reportMethod
parse_mmseqs_easy_taxonomy_tophit_report(
    tophit_report
) -> DataFrames.DataFrame

Parse an MMseqs2 easy-taxonomy tophit report into a structured DataFrame.

Arguments

  • tophit_report::String: Path to the MMseqs2 easy-taxonomy tophit report file (tab-delimited)

Returns

  • DataFrame: A DataFrame with columns:
    • target_id: Target sequence identifier
    • number of sequences aligning to target: Count of aligned sequences
    • unique coverage of target: Ratio of uniqueAlignedResidues to targetLength
    • Target coverage: Ratio of alignedResidues to targetLength
    • Average sequence identity: Mean sequence identity
    • taxon_id: Taxonomic identifier
    • taxon_rank: Taxonomic rank
    • taxon_name: Species name and lineage
source
Mycelia.parse_mmseqs_tophit_alnMethod
parse_mmseqs_tophit_aln(tophit_aln) -> DataFrames.DataFrame

Parse MMseqs2 tophit alignment output file into a structured DataFrame.

Arguments

  • tophit_aln::AbstractString: Path to tab-delimited MMseqs2 alignment output file

Returns

DataFrame with columns:

  • query: Query sequence/profile identifier
  • target: Target sequence/profile identifier
  • percent identity: Sequence identity percentage
  • alignment length: Length of alignment
  • number of mismatches: Count of mismatched positions
  • number of gaps: Count of gap openings
  • query start: Start position in query sequence
  • query end: End position in query sequence
  • target start: Start position in target sequence
  • target end: End position in target sequence
  • evalue: E-value of alignment
  • bit score: Bit score of alignment
source
Mycelia.parse_qualimap_contig_coverageMethod
parse_qualimap_contig_coverage(
    qualimap_report_txt
) -> DataFrames.DataFrame

Parse contig coverage statistics from a Qualimap BAM QC report file.

Arguments

  • qualimap_report_txt::String: Path to Qualimap bamqc report text file

Returns

  • DataFrame: Coverage statistics with columns:
    • Contig: Contig identifier
    • Length: Contig length in bases
    • Mapped bases: Number of bases mapped to contig
    • Mean coverage: Average coverage depth
    • Standard Deviation: Standard deviation of coverage
    • % Mapped bases: Percentage of total mapped bases on this contig

Supported Assemblers

Handles output from both SPAdes and MEGAHIT assemblers:

  • SPAdes format: NODEXlengthYcov_Z
  • MEGAHIT format: kXX_Y

Parse the contig coverage information from qualimap bamqc text report, which looks like the following:

# this is spades
>>>>>>> Coverage per contig

	NODE_1_length_107478_cov_9.051896	107478	21606903	201.0355886786133	60.39424208607496
	NODE_2_length_5444_cov_1.351945	5444	153263	28.152645113886848	5.954250612823136
	NODE_3_length_1062_cov_0.154390	1062	4294	4.043314500941619	1.6655384692688975
	NODE_4_length_776_cov_0.191489	776	3210	4.13659793814433	2.252009588980858

# below is megahit
>>>>>>> Coverage per contig

	k79_175	235	3862	16.43404255319149	8.437436249612457
	k79_89	303	3803	12.551155115511552	5.709975376279777
	k79_262	394	6671	16.931472081218274	7.579217802849293
	k79_90	379	1539	4.060686015831134	1.2929729111266581
	k79_91	211	3749	17.767772511848342	11.899185693011933
	k79_0	2042	90867	44.49902056807052	18.356525483516613

To make this more robust, consider reading in the names of the contigs from the assembled fasta

source
Mycelia.parse_rtg_eval_outputMethod
parse_rtg_eval_output(f) -> DataFrames.DataFrame

Parse RTG evaluation output from a gzipped tab-separated file.

Arguments

  • f: Path to a gzipped TSV file containing RTG evaluation output

Format

Expected file format:

  • Header line starting with '#' and tab-separated column names
  • Data rows in tab-separated format
  • Empty files return a DataFrame with empty columns matching header

Returns

A DataFrame where:

  • Column names are taken from the header line (stripped of '#')
  • Data is parsed as Float64 values
  • Empty files result in empty columns preserving header structure
source
Mycelia.parse_transterm_outputMethod
parse_transterm_output(
    transterm_output
) -> DataFrames.DataFrame

Parse TransTerm terminator prediction output into a structured DataFrame.

Takes a TransTerm output file path and returns a DataFrame containing parsed terminator predictions. Each row represents one predicted terminator with the following columns:

  • chromosome: Identifier of the sequence being analyzed
  • term_id: Unique terminator identifier (e.g. "TERM 19")
  • start: Start position of the terminator
  • stop: End position of the terminator
  • strand: Strand orientation ("+" or "-")
  • location: Context type, where:
    • G/g = in gene interior (≥50bp from ends)
    • F/f = between two +strand genes
    • R/r = between two -strand genes
    • T = between ends of +strand and -strand genes
    • H = between starts of +strand and -strand genes
    • N = none of the above
    Lowercase indicates opposite strand from region
  • confidence: Overall confidence score (0-100)
  • hairpin_score: Hairpin structure score
  • tail_score: Tail sequence score
  • notes: Additional annotations (e.g. "bidir")

Arguments

  • transterm_output::AbstractString: Path to TransTerm output file

Returns

  • DataFrame: Parsed terminator predictions with columns as described above

See TransTerm HP documentation for details on scoring and location codes.

source
Mycelia.parse_virsorter_score_tsvMethod
parse_virsorter_score_tsv(
    virsorter_score_tsv
) -> DataFrames.DataFrame

Parse a VirSorter score TSV file and return a DataFrame.

Arguments

  • virsorter_score_tsv::String: The file path to the VirSorter score TSV file.

Returns

  • DataFrame: A DataFrame containing the parsed data from the TSV file. If the file is empty, returns a DataFrame with the appropriate headers but no data.
source
Mycelia.parse_xam_to_mapped_records_tableFunction
parse_xam_to_mapped_records_table(
    xam
) -> DataFrames.DataFrame
parse_xam_to_mapped_records_table(
    xam,
    primary_only
) -> DataFrames.DataFrame

Parse SAM/BAM files into a DataFrame containing mapped read alignments.

Arguments

  • xam::String: Path to input SAM/BAM file (supports .sam, .bam, or .sam.gz)
  • primary_only::Bool=false: Flag to filter for primary alignments only (currently unused)

Returns

DataFrame with columns:

  • template: Read template name
  • flag: SAM flag
  • reference: Reference sequence name
  • position: Alignment position range
  • mappingquality: Mapping quality score
  • tlen: Template length
  • alignlength: Alignment length
  • ismapped: Boolean indicating if read is mapped
  • isprimary: Boolean indicating if alignment is primary
  • alignment_score: Alignment score (AS tag)
  • mismatches: Number of mismatches (NM tag)
source
Mycelia.parse_xam_to_primary_mapping_tableMethod
parse_xam_to_primary_mapping_table(
    xam
) -> DataFrames.DataFrame

Parse a SAM/BAM alignment file and extract template-to-reference mapping information.

Arguments

  • xam::String: Path to input alignment file (.sam, .sam.gz, or .bam format)

Returns

  • DataFrame: Table with columns:
    • template: Read template names
    • reference: Reference sequence names

Only includes primary alignments (not secondary/supplementary) that are mapped to references. Skips header lines starting with '@'.

source
Mycelia.parse_xam_to_summary_tableMethod
parse_xam_to_summary_table(xam) -> DataFrames.DataFrame

Parse a SAM/BAM file into a summary DataFrame containing alignment metadata.

Arguments

  • xam::AbstractString: Path to input SAM (.sam), BAM (.bam), or gzipped SAM (.sam.gz) file

Returns

DataFrame with columns:

  • template: Read name
  • flag: SAM flag
  • reference: Reference sequence name
  • position: Alignment position range (start:end)
  • mappingquality: Mapping quality score
  • alignment_score: Alignment score (AS tag)
  • isprimary: Whether alignment is primary
  • alignlength: Length of the alignment
  • ismapped: Whether read is mapped
  • mismatches: Number of mismatches (NM tag)

Note: Only mapped reads are included in the output DataFrame.

source
Mycelia.parse_xam_to_taxonomic_mapping_qualityFunction
parse_xam_to_taxonomic_mapping_quality(
    xam
) -> DataFrames.DataFrame
parse_xam_to_taxonomic_mapping_quality(
    xam,
    primary_only
) -> DataFrames.DataFrame

Parse alignment data from SAM/BAM files into a structured DataFrame containing mapping quality and taxonomic information.

Arguments

  • xam::String: Path to input file (.sam, .bam, or .sam.gz)
  • primary_only::Bool=false: When true, include only primary alignments (currently not implemented)

Returns

DataFrames.DataFrame with columns:

  • template: Read name
  • flag: SAM flag
  • reference: Reference sequence name
  • position: Alignment position range
  • mappingquality: Mapping quality score
  • tlen: Template length
  • alignlength: Alignment length
  • ismapped: Boolean indicating if read is mapped
  • isprimary: Boolean indicating if alignment is primary
  • alignment_score: Alignment score (AS tag)
  • mismatches: Number of mismatches (NM tag)

Notes

  • Skips unmapped reads
  • Automatically detects and handles SAM/BAM/compressed SAM formats
source
Mycelia.path_to_sequenceMethod
path_to_sequence(kmers, path) -> Any

Convert a path through k-mers into a single DNA sequence.

Takes a vector of k-mers and a path representing the order to traverse them, reconstructs the original sequence by joining the k-mers according to the path. The first k-mer is used in full, then only the last nucleotide from each subsequent k-mer is added.

Arguments

  • kmers: Vector of DNA k-mers (as LongDNA{4})
  • path: Vector of tuples representing the path through the k-mers

Returns

  • LongDNA{4}: The reconstructed DNA sequence
source
Mycelia.pixels_to_pointsMethod
pixels_to_points(pixels) -> Any

Convert pixel measurements to point measurements using the standard 4:3 ratio.

Points are the standard unit for typography (1 point = 1/72 inch), while pixels are used for screen measurements. This conversion uses the conventional 4:3 ratio where 3 points equal 4 pixels.

Arguments

  • pixels: The number of pixels to convert

Returns

  • The equivalent measurement in points
source
Mycelia.plot_graphMethod
plot_graph(graph) -> Any

Creates a visualization of a kmer graph where nodes represent kmers and their sizes reflect counts.

Arguments

  • graph: A MetaGraph where vertices have :kmer and :count properties

Returns

  • A Plots.jl plot object showing the graph visualization

Details

  • Node sizes are scaled based on kmer counts
  • Plot dimensions scale logarithmically with number of vertices
  • Each node is labeled with its kmer sequence
source
Mycelia.plot_kmer_frequency_spectraMethod
plot_kmer_frequency_spectra(
    counts;
    log_scale,
    kwargs...
) -> Plots.Plot

Plots a histogram of kmer counts against # of kmers with those counts

Returns the plot object for adding additional layers and saving

Creates a scatter plot visualizing the k-mer frequency spectrum - the relationship between k-mer frequencies and how many k-mers occur at each frequency.

Arguments

  • counts::AbstractVector{<:Integer}: Vector of k-mer counts/frequencies
  • log_scale::Union{Function,Nothing} = log2: Function to apply logarithmic scaling to both axes. Set to nothing to use linear scaling.
  • kwargs...: Additional keyword arguments passed to StatsPlots.plot()

Returns

  • Plots.Plot: A scatter plot object that can be further modified or saved

Details

The x-axis shows k-mer frequencies (how many times each k-mer appears), while the y-axis shows how many distinct k-mers appear at each frequency. Both axes are log-scaled by default using log2.

source
Mycelia.plot_optimal_cluster_assessment_resultsMethod
plot_optimal_cluster_assessment_results(
    clustering_results
) -> Any

Visualizes cluster assessment metrics and saves the resulting plots.

Arguments

  • clustering_results: A named tuple containing:
    • ks_assessed: Vector of k values tested
    • within_cluster_sum_of_squares: Vector of WCSS scores
    • silhouette_scores: Vector of silhouette scores
    • optimal_number_of_clusters: Integer indicating optimal k

Details

Creates two plots:

  1. Within-cluster sum of squares (WCSS) vs number of clusters
  2. Silhouette scores vs number of clusters

Both plots include a vertical line indicating the optimal number of clusters.

Outputs

Saves two SVG files in the project directory:

  • wcss.svg: WCSS plot
  • silhouette.svg: Silhouette scores plot
source
Mycelia.points_to_pixelsMethod
points_to_pixels(points) -> Any

Convert typographic points to pixels using a 4:3 ratio (1 point = 4/3 pixels).

Arguments

  • points: Size in typographic points (pt)

Returns

  • Size in pixels (px)
source
Mycelia.polish_fastqMethod
polish_fastq(; fastq, k)

Polish FASTQ reads using a k-mer graph-based approach to correct potential sequencing errors.

Arguments

  • fastq::String: Path to input FASTQ file
  • k::Int=1: Initial k-mer size parameter. Final assembly k-mer size may differ.

Process

  1. Builds a directed k-mer graph from input reads
  2. Processes each read through the graph to find optimal paths
  3. Writes corrected reads to a new FASTQ file
  4. Automatically compresses output with gzip

Returns

Named tuple with:

  • fastq::String: Path to output gzipped FASTQ file
  • k::Int: Final assembly k-mer size used
source
Mycelia.prefetchMethod
prefetch(; SRR, outdir)

Downloads Sequence Read Archive (SRA) data using the prefetch tool from sra-tools.

Arguments

  • SRR: SRA accession number (e.g., "SRR12345678")
  • outdir: Directory where the downloaded data will be saved. Defaults to current directory.

Notes

  • Requires sra-tools which will be installed in a Conda environment
  • Downloads are saved in .sra format
  • Internet connection required
source
Mycelia.process_fastq_recordMethod
process_fastq_record(
;
    record,
    kmer_graph,
    yen_k_shortest_paths_and_weights,
    yen_k
)

Process and error-correct a FASTQ sequence record using a kmer graph and path resampling.

Arguments

  • record: FASTQ record containing the sequence to process
  • kmer_graph: MetaGraph containing the kmer network and associated properties
  • yen_k_shortest_paths_and_weights: Cache of pre-computed k-shortest paths between nodes
  • yen_k: Number of alternative paths to consider during resampling (default: 3)

Description

Performs error correction by:

  1. Trimming low-quality sequence ends
  2. Identifying stretches requiring resampling between solid branching kmers
  3. Selecting alternative paths through the kmer graph based on:
    • Path quality scores
    • Transition likelihoods
    • Path length similarity to original sequence

Returns

  • Modified FASTQ record with error-corrected sequence and updated quality scores
  • Original record if no error correction was needed

Required Graph Properties

The kmer_graph must contain the following properties:

  • :ordered_kmers
  • :likelyvalidkmer_indices
  • :kmer_indices
  • :branching_nodes
  • :assembly_k
  • :transition_likelihoods
  • :kmermeanquality
  • :kmertotalquality
source
Mycelia.q_value_to_error_rateMethod
q_value_to_error_rate(q_value) -> Any

Convert a Phred quality score (Q-value) to a probability of error.

Arguments

  • q_value: Phred quality score, typically ranging from 0 to 40

Returns

  • Error probability in range [0,1], where 0 indicates highest confidence

A Q-value of 10 corresponds to an error rate of 0.1 (10%), while a Q-value of 30 corresponds to an error rate of 0.001 (0.1%).

source
Mycelia.qc_filter_long_reads_fastplongMethod
qc_filter_long_reads_fastplong(
;
    in_fastq,
    report_title,
    out_fastq,
    html_report,
    json_report,
    min_length,
    max_length
)

Perform QC filtering on long-read FASTQ files using fastplong.

Arguments

  • in_fastq::String: Path to the input FASTQ file.
  • out_fastq::String: Path to the output FASTQ file.
  • quality_threshold::Int: Minimum average quality to retain a read (default 10).
  • min_length::Int: Minimum read length (default 1000).
  • max_length::Int=0: Maximum read length (default 0, no maximum).

Returns

  • String: Path to the filtered FASTQ file.

Details

This function uses fastplong to filter long reads based on quality and length criteria. It is optimized for Oxford Nanopore, PacBio, or similar long-read datasets.

source
Mycelia.qc_filter_long_reads_filtlongMethod
qc_filter_long_reads_filtlong(
;
    in_fastq,
    out_fastq,
    min_mean_q,
    keep_percent
)

Filter and process long reads from a FASTQ file using Filtlong.

This function filters long sequencing reads based on quality and length criteria, then compresses the output using pigz.

Arguments

  • in_fastq::String: Path to the input FASTQ file.
  • out_fastq::String: Path to the output filtered and compressed FASTQ file. Defaults to the input filename with ".filtlong.fq.gz" appended.
  • min_mean_q::Int: Minimum mean quality score for reads to be kept. Default is 20.
  • keep_percent::Int: Percentage of reads to keep after filtering. Default is 95.

Returns

  • out_fastq

Details

This function uses Filtlong to filter long reads and pigz for compression. It requires the Bioconda environment for Filtlong to be set up, which is handled internally.

source
Mycelia.qc_filter_short_reads_fastpMethod
qc_filter_short_reads_fastp(
    in_fastq::String,
    out_fastq::String;
    adapter_seq,
    quality_threshold,
    min_length
)

Perform quality control (QC) filtering and trimming on short-read FASTQ files using fastp.

Arguments

  • in_fastq::String: Path to the input FASTQ file.
  • out_fastq::String: Path to the output FASTQ file.
  • adapter_seq::String: Adapter sequence to trim.
  • quality_threshold::Int: Minimum phred score for trimming (default 20).
  • min_length::Int: Minimum read length to retain (default 50).

Returns

  • String: Path to the filtered and trimmed FASTQ file.

Details

This function uses fastp to remove adapter contamination, trim low‐quality bases from the 3′ end, and discard reads shorter than min_length. It’s a simple wrapper that executes the external fastp command.

source
Mycelia.qualmers_canonicalMethod
qualmers_canonical(
    record::FASTX.FASTQ.Record,
    k::Int64
) -> Union{Base.Generator{I, F} where {I<:(Base.Iterators.Enumerate{I} where I<:(Kmers.FwAAMers{_A, BioSequences.LongAA} where _A)), F<:(Mycelia.var"#493#494"{_A, Vector{Int8}} where _A)}, Base.Generator{I, F} where {I<:(Base.Iterators.Enumerate{I} where I<:(Kmers.CanonicalKmers{BioSequences.DNAAlphabet{4}, _A, BioSequences.LongSequence{BioSequences.DNAAlphabet{4}}} where _A)), F<:(Mycelia.var"#499#500"{_A, Vector{Int8}} where _A)}}
source
Mycelia.qualmers_canonicalMethod
qualmers_canonical(
    sequence::BioSequences.LongAA,
    quality::AbstractVector{<:Integer},
    _::Val{K}
) -> Base.Generator{I, F} where {I<:(Base.Iterators.Enumerate{I} where I<:(Kmers.FwAAMers{_A, BioSequences.LongAA} where _A)), F<:(Mycelia.var"#493#494"{_A, <:AbstractVector{var"#s235"}} where {_A, var"#s235"<:Integer})}
source
Mycelia.qualmers_canonicalMethod
qualmers_canonical(
    sequence::BioSequences.LongSequence{BioSequences.DNAAlphabet{N}},
    quality::AbstractVector{<:Integer},
    _::Val{K}
) -> Base.Generator{I, F} where {I<:(Base.Iterators.Enumerate{I} where I<:(Kmers.CanonicalKmers{A, _A, S} where {A<:(BioSequences.DNAAlphabet), _A, S<:(BioSequences.LongDNA)})), F<:(Mycelia.var"#499#500"{_A, <:AbstractVector{var"#s233"}} where {_A, var"#s233"<:Integer})}
source
Mycelia.qualmers_fwMethod
qualmers_fw(
    sequence::BioSequences.LongSequence{A},
    quality::AbstractVector{<:Integer},
    _::Val{K}
) -> Base.Generator{I, F} where {I<:(Base.Iterators.Enumerate{I} where I<:(Kmers.FwKmers{A, _A, S} where {A<:BioSequences.Alphabet, _A, S<:(BioSequences.LongSequence{A} where A)})), F<:(Mycelia.var"#493#494"{_A, <:AbstractVector{var"#s233"}} where {_A, var"#s233"<:Integer})}

Create an iterator that yields DNA qualmers from the given sequence and quality scores.

source
Mycelia.qualmers_unambiguousMethod
qualmers_unambiguous(
    sequence::BioSequences.LongAA,
    quality::AbstractVector{<:Integer},
    _::Val{K}
) -> Base.Generator{I, F} where {I<:(Base.Iterators.Enumerate{I} where I<:(Kmers.FwAAMers{_A, BioSequences.LongAA} where _A)), F<:(Mycelia.var"#493#494"{_A, <:AbstractVector{var"#s235"}} where {_A, var"#s235"<:Integer})}
source
Mycelia.qualmers_unambiguousMethod
qualmers_unambiguous(
    sequence::BioSequences.LongSequence{BioSequences.DNAAlphabet{N}},
    quality::AbstractVector{<:Integer},
    _::Val{K}
) -> Base.Generator{I, F} where {I<:(Kmers.UnambiguousDNAMers{_A, S} where {_A, S<:(BioSequences.LongDNA)}), F<:(Mycelia.var"#495#496"{_A, <:AbstractVector{var"#s233"}} where {_A, var"#s233"<:Integer})}
source
Mycelia.qualmers_unambiguousMethod
qualmers_unambiguous(
    sequence::BioSequences.LongSequence{BioSequences.RNAAlphabet{N}},
    quality::AbstractVector{<:Integer},
    _::Val{K}
) -> Base.Generator{I, F} where {I<:(Kmers.UnambiguousRNAMers{_A, S} where {_A, S<:(BioSequences.LongRNA)}), F<:(Mycelia.var"#497#498"{_A, <:AbstractVector{var"#s233"}} where {_A, var"#s233"<:Integer})}
source
Mycelia.qualmers_unambiguous_canonicalMethod
qualmers_unambiguous_canonical(
    record::FASTX.FASTQ.Record,
    k::Int64
) -> Base.Generator{I, Mycelia.var"#501#502"} where I

Generate unambiguous canonical qualmers from the given FASTQ record.

source
Mycelia.rand_of_each_groupMethod
rand_of_each_group(
    gdf::DataFrames.GroupedDataFrame{DataFrames.DataFrame}
) -> Any

Select one random row from each group in a grouped DataFrame.

Arguments

  • gdf::GroupedDataFrame: A grouped DataFrame created using groupby

Returns

  • DataFrame: A new DataFrame containing exactly one randomly sampled row from each group
source
Mycelia.random_fasta_recordMethod
random_fasta_record(
;
    moltype,
    seed,
    L
) -> FASTX.FASTA.Record

Generates a random FASTA record with a specified molecular type and sequence length.

Arguments

  • moltype::Symbol=:DNA: The type of molecule to generate (:DNA, :RNA, or :AA for amino acids).
  • seed: The random seed used for sequence generation (default: a random integer).
  • L: The length of the sequence (default: a random integer up to typemax(UInt16)).

Returns

  • A FASTX.FASTA.Record containing:
    • A randomly generated UUID identifier.
    • A randomly generated sequence of the specified type.

Errors

  • Throws an error if moltype is not one of :DNA, :RNA, or :AA.
source
Mycelia.random_symmetric_distance_matrixMethod
random_symmetric_distance_matrix(n) -> Any

Generate a random symmetric distance matrix of size n×n with zeros on the diagonal.

Arguments

  • n: Positive integer specifying the matrix dimensions

Returns

  • A symmetric n×n matrix with random values in [0,1), zeros on the diagonal

Details

  • The matrix is symmetric, meaning M[i,j] = M[j,i]
  • Diagonal elements M[i,i] are set to 0.0
  • Off-diagonal elements are uniformly distributed random values
source
Mycelia.rclone_copyMethod
rclone_copy(source, dest; config, max_attempts, sleep_timer)

Copy files between local and remote storage using rclone with automated retry logic.

Arguments

  • source::String: Source path or remote (e.g. "local/path" or "gdrive:folder")
  • dest::String: Destination path or remote (e.g. "gdrive:folder" or "local/path")

Keywords

  • config::String="": Optional path to rclone config file
  • max_attempts::Int=3: Maximum number of retry attempts
  • sleep_timer::Int=60: Initial sleep duration between retries in seconds (doubles after each attempt)

Details

Uses optimized rclone settings for large files:

  • 2GB chunk size
  • 1TB upload cutoff
  • Rate limited to 1 transaction per second
source
Mycelia.rclone_list_directoriesMethod
rclone_list_directories(path) -> Any

List all directories at the specified rclone path.

Arguments

  • path::String: Remote path to list directories from (e.g. "remote:/path/to/dir")

Returns

  • Vector{String}: Full paths to all directories found at the specified location
source
Mycelia.read_fastaniMethod
read_fastani(path::String) -> DataFrames.DataFrame

Imports results of fastani

Reads and processes FastANI output results from a tab-delimited file.

Arguments

  • path::String: Path to the FastANI output file

Returns

DataFrame with columns:

  • query: Original query filepath
  • query_identifier: Extracted filename without extension
  • reference: Original reference filepath
  • reference_identifier: Extracted filename without extension
  • %_identity: ANI percentage identity
  • fragments_mapped: Number of fragments mapped
  • total_query_fragments: Total number of query fragments

Notes

  • Expects tab-delimited input file from FastANI
  • Automatically strips .fasta, .fna, or .fa extensions from filenames
  • Column order is preserved as listed above
source
Mycelia.read_gffMethod
read_gff(gff::AbstractString) -> DataFrames.DataFrame

Reads a GFF (General Feature Format) file and parses it into a DataFrame.

Arguments

  • gff::AbstractString: Path to the GFF file

Returns

  • DataFrame: A DataFrame containing the parsed GFF data with standard columns: seqid, source, type, start, end, score, strand, phase, and attributes
source
Mycelia.read_gffMethod
read_gff(gff_io) -> DataFrames.DataFrame

Read a GFF (General Feature Format) file into a DataFrame.

Arguments

  • gff_io: An IO stream containing GFF formatted data

Returns

  • DataFrame: A DataFrame with standard GFF columns:
    • seqid: sequence identifier
    • source: feature source
    • type: feature type
    • start: start position (1-based)
    • end: end position
    • score: numeric score
    • strand: strand (+, -, or .)
    • phase: phase (0, 1, 2 or .)
    • attributes: semicolon-separated key-value pairs
source
Mycelia.read_kraken_reportMethod
read_kraken_report(kraken_report) -> DataFrames.DataFrame

Parse a Kraken taxonomic classification report into a structured DataFrame.

Arguments

  • kraken_report::AbstractString: Path to a tab-delimited Kraken report file

Returns

  • DataFrame: A DataFrame with the following columns:
    • percentage_of_fragments_at_or_below_taxon: Percentage of fragments covered
    • number_of_fragments_at_or_below_taxon: Count of fragments at/below taxon
    • number_of_fragments_assigned_directly_to_taxon: Direct fragment assignments
    • rank: Taxonomic rank
    • ncbi_taxonid: NCBI taxonomy identifier
    • scientific_name: Scientific name (whitespace-trimmed)

Notes

  • Scientific names are automatically stripped of leading/trailing whitespace
  • Input file must be tab-delimited
source
Mycelia.read_mmseqs_easy_searchMethod
read_mmseqs_easy_search(mmseqs_file) -> DataFrames.DataFrame

Read results from MMSeqs2 easy-search output file into a DataFrame.

Arguments

  • mmseqs_file::String: Path to the tab-delimited output file from MMSeqs2 easy-search

Returns

  • DataFrame: Contains search results with columns:
    • query: Query sequence identifier
    • target: Target sequence identifier
    • seqIdentity: Sequence identity (0.0-1.0)
    • alnLen: Alignment length
    • mismatch: Number of mismatches
    • gapOpen: Number of gap openings
    • qStart: Query start position
    • qEnd: Query end position
    • tStart: Target start position
    • tEnd: Target end position
    • evalue: Expected value
    • bits: Bit score
source
Mycelia.reverse_translateMethod
reverse_translate(
    protein_sequence::BioSequences.LongAA
) -> BioSequences.LongSequence{BioSequences.DNAAlphabet{2}}

Convert a protein sequence back to a possible DNA coding sequence using weighted random codon selection.

Arguments

  • protein_sequence::BioSequences.LongAA: The amino acid sequence to reverse translate

Returns

  • BioSequences.LongDNA{2}: A DNA sequence that would translate to the input protein sequence

Details

Uses codon usage frequencies to randomly select codons for each amino acid, weighted by their natural occurrence. Each selected codon is guaranteed to translate back to the original amino acid.

source
Mycelia.rolling_centered_avgMethod
rolling_centered_avg(data::AbstractArray{T, 1}; window_size)

Compute a centered moving average over a vector using a sliding window.

Arguments

  • data::AbstractVector{T}: Input vector to be averaged
  • window_size::Int: Size of the sliding window (odd number recommended)

Returns

  • Vector{Float64}: Vector of same length as input containing moving averages

Details

  • For points near the edges, the window is truncated to available data
  • Window is centered on each point, using floor(window_size/2) points on each side
  • Result type is always Float64 regardless of input type T
source
Mycelia.run_blastMethod
run_blast(
;
    out_dir,
    fasta,
    blast_db,
    blast_command,
    force,
    remote,
    wait
)

Run a BLAST (Basic Local Alignment Search Tool) command with the specified parameters.

Arguments

  • out_dir::String: The output directory where the BLAST results will be stored.
  • fasta::String: The path to the input FASTA file.
  • blast_db::String: The path to the BLAST database.
  • blast_command::String: The BLAST command to be executed (e.g., blastn, blastp).
  • force::Bool: If true, forces the BLAST command to run even if the output file already exists. Default is false.
  • remote::Bool: If true, runs the BLAST command remotely. Default is false.
  • wait::Bool: If true, waits for the BLAST command to complete before returning. Default is true.

Returns

  • outfile::String: The path to the output file containing the BLAST results.

Description

This function constructs and runs a BLAST command based on the provided parameters. It creates the necessary output directory, constructs the output file name, and determines whether the BLAST command needs to be run based on the existence and size of the output file. The function supports both local and remote execution of the BLAST command.

If force is set to true or the output file does not exist or is empty, the BLAST command is executed. The function logs the command being run and measures the time taken for execution. The output file path is returned upon completion.

source
Mycelia.run_blastnMethod
run_blastn(
;
    out_dir,
    fasta,
    blast_db,
    task,
    force,
    remote,
    wait
)

Run the BLASTN (Basic Local Alignment Search Tool for Nucleotides) command with specified parameters.

Arguments

  • out_dir::String: The output directory where the BLASTN results will be saved.
  • fasta::String: The path to the input FASTA file containing the query sequences.
  • blast_db::String: The path to the BLAST database to search against.
  • task::String: The BLASTN task to perform. Default is "megablast".
  • force::Bool: If true, forces the BLASTN command to run even if the output file already exists. Default is false.
  • remote::Bool: If true, runs the BLASTN command remotely. Default is false.
  • wait::Bool: If true, waits for the BLASTN command to complete before returning. Default is true.

Returns

  • outfile::String: The path to the output file containing the BLASTN results.

Description

This function constructs and runs a BLASTN command based on the provided parameters. It creates an output directory if it doesn't exist, constructs the output file path, and checks if the BLASTN command needs to be run based on the existence and size of the output file. The function supports running the BLASTN command locally or remotely, with options to force re-running and to wait for completion.

source
Mycelia.run_clustal_omegaMethod
run_clustal_omega(; fasta, outfmt)

Run Clustal Omega multiple sequence alignment on a FASTA file.

Arguments

  • fasta::String: Path to input FASTA file
  • outfmt::String="clustal": Output format for the alignment

Returns

  • String: Path to the output alignment file

Supported Output Formats

  • "fasta": FASTA format
  • "clustal": Clustal format
  • "msf": MSF format
  • "phylip": PHYLIP format
  • "selex": SELEX format
  • "stockholm": Stockholm format
  • "vienna": Vienna format

Notes

  • Uses Bioconda to manage the Clustal Omega installation
  • Caches results - will return existing output file if already generated
  • Handles single sequence files gracefully by returning output path without error
source
Mycelia.run_ectyperMethod
run_ectyper(fasta_file) -> Any

Run ECTyper for serotyping E. coli genome assemblies.

Arguments

  • fasta_file::String: Path to input FASTA file containing assembled genome(s)

Returns

  • String: Path to output directory containing ECTyper results
source
Mycelia.run_hifiasmMethod
run_hifiasm(; fastq, outdir)

Run the hifiasm genome assembler on PacBio HiFi reads.

Arguments

  • fastq::String: Path to input FASTQ file containing HiFi reads
  • outdir::String: Output directory path (default: "{basename(fastq)}_hifiasm")

Returns

Named tuple containing:

  • outdir::String: Path to output directory
  • hifiasm_outprefix::String: Prefix used for hifiasm output files

Details

  • Automatically creates and uses a conda environment with hifiasm
  • Uses primary assembly mode (–primary) optimized for inbred samples
  • Skips assembly if output files already exist at the specified prefix
  • Utilizes all available CPU threads
source
Mycelia.run_mlstMethod
run_mlst(fasta_file) -> String

Run Multi-Locus Sequence Typing (MLST) analysis on a genome assembly.

Arguments

  • fasta_file::String: Path to input FASTA file containing the genome assembly

Returns

  • Path to the output file containing MLST results (<input>.mlst.out)

Details

Uses the mlst tool from PubMLST to identify sequence types by comparing allelic profiles of housekeeping genes against curated MLST schemes.

Dependencies

  • Requires Bioconda and the mlst package
  • Automatically sets up conda environment if not present
source
Mycelia.run_mmseqs_easy_searchMethod
run_mmseqs_easy_search(
;
    query_fasta,
    target_database,
    out_dir,
    outfile,
    format_output,
    threads,
    force
)

Runs the MMseqs2 easy-search command on the given query FASTA file against the target database.

Arguments

  • query_fasta::String: Path to the query FASTA file.
  • target_database::String: Path to the target database.
  • out_dir::String: Directory to store the output file. Defaults to the directory of the query FASTA file.
  • outfile::String: Name of the output file. Defaults to a combination of the query FASTA and target database filenames.
  • format_output::String: Format of the output. Defaults to a predefined set of fields.
  • threads::Int: Number of CPU threads to use. Defaults to the number of CPU threads available.
  • force::Bool: If true, forces the re-generation of the output file even if it already exists. Defaults to false.

Returns

  • outfile_path::String: Path to the generated output file.

Notes

  • Adds the mmseqs2 environment using Bioconda if not already present.
  • Removes temporary files created during the process.
source
Mycelia.run_padlocMethod
run_padloc(; fasta_file, outdir, threads)

Run the 'padloc' tool from the 'padlocbio' conda environment on a given FASTA file.

https://doi.org/10.1093/nar/gkab883

https://github.com/padlocbio/padloc

This function first ensures that the 'padloc' environment is available via Bioconda. It then attempts to update the 'padloc' database. If a 'padloc' output file (with a '_padloc.csv' suffix) does not already exist for the input FASTA file, it runs 'padloc' with the specified FASTA file as input.

source
Mycelia.run_parallel_progressMethod
run_parallel_progress(
    f::Function,
    items::AbstractVector
) -> Any

Run a function f in parallel over a collection of items with a progress meter.

Arguments

  • f::Function: The function to be applied to each item in the collection.
  • items::AbstractVector: A collection of items to be processed.

Description

This function creates a progress meter to track the progress of processing each item in the items collection. It uses multithreading to run the function f on each item in parallel, updating the progress meter after each item is processed.

source
Mycelia.run_prodigalMethod
run_prodigal(; fasta_file, out_dir)

Run Prodigal gene prediction software on input FASTA file to identify protein-coding genes in metagenomes or single genomes.

Arguments

  • fasta_file::String: Path to input FASTA file containing genomic sequences
  • out_dir::String=dirname(fasta_file): Directory for output files. Defaults to input file's directory

Returns

Named tuple containing paths to all output files:

  • fasta_file: Input FASTA file path
  • out_dir: Output directory path
  • gff: Path to GFF format gene predictions
  • gene_scores: Path to all potential genes and their scores
  • fna: Path to nucleotide sequences of predicted genes
  • faa: Path to protein translations of predicted genes
  • std_out: Path to captured stdout
  • std_err: Path to captured stderr
source
Mycelia.run_pyrodigalMethod
run_pyrodigal(; fasta_file, out_dir)

Run Pyrodigal gene prediction on a FASTA file using the meta procedure optimized for metagenomic sequences.

Pyrodigal is a reimplementation of the Prodigal gene finder, which identifies protein-coding sequences in bacterial and archaeal genomes.

Arguments

  • fasta_file::String: Path to input FASTA file containing genomic sequences
  • out_dir::String: Output directory path (default: input filename + "_pyrodigal")

Returns

Named tuple containing:

  • fasta_file: Input FASTA file path
  • out_dir: Output directory path
  • gff: Path to GFF output file with gene predictions
  • faa: Path to FASTA file with predicted protein sequences
  • fna: Path to FASTA file with nucleotide sequences

Notes

  • Uses metagenomic mode (-p meta) optimized for mixed communities
  • Masks runs of N nucleotides (-m flag)
  • Minimum gene length set to 33bp
  • Maximum overlap between genes set to 31bp
  • Requires Pyrodigal to be available in a Conda environment
  • Skips processing if output files already exist
source
Mycelia.run_samtools_flagstatFunction
run_samtools_flagstat(xam) -> Any
run_samtools_flagstat(xam, samtools_flagstat) -> Any

Generate alignment statistics for a SAM/BAM/CRAM file using samtools flagstat.

Arguments

  • xam::AbstractString: Path to input SAM/BAM/CRAM alignment file
  • samtools_flagstat::AbstractString: Output path for flagstat results (default: input_path.samtools-flagstat.txt)

Returns

  • String: Path to the generated flagstat output file

Details

Runs samtools flagstat to calculate statistics on the alignment file, including:

  • Total reads
  • Secondary alignments
  • Supplementary alignments
  • Duplicates
  • Mapped/unmapped reads
  • Proper pairs
  • Read 1/2 counts

Requirements

  • Requires samtools to be available via Bioconda
  • Input file must be in SAM, BAM or CRAM format
source
Mycelia.run_transtermMethod
run_transterm(; fasta, gff)

Run TransTermHP to predict rho-independent transcription terminators in DNA sequences.

Arguments

  • fasta: Path to input FASTA file containing DNA sequences
  • gff: Optional path to GFF annotation file. If provided, improves prediction accuracy

Returns

  • String: Path to output file containing TransTermHP predictions

Details

  • Uses Conda environment 'transtermhp' for execution
  • Automatically generates coordinate file from FASTA or GFF input
  • Removes temporary coordinate file after completion
  • Requires Mycelia's Conda setup
source
Mycelia.run_trnascanMethod
run_trnascan(; fna_file, outdir)

Run tRNAscan-SE to identify and annotate transfer RNA genes in the provided sequence file.

Arguments

  • fna_file::String: Path to input FASTA nucleotide file
  • outdir::String: Output directory path (default: inputfilepath + "_trnascan")

Returns

  • String: Path to the output directory containing tRNAscan-SE results

Output Files

Creates the following files in outdir:

  • *.trnascan.out: Main output with tRNA predictions
  • *.trnascan.bed: BED format coordinates
  • *.trnascan.fasta: FASTA sequences of predicted tRNAs
  • *.trnascan.struct: Secondary structure predictions
  • *.trnascan.stats: Summary statistics
  • *.trnascan.log: Program execution log

Notes

  • Uses the general tRNA model (-G flag) suitable for all domains of life
  • Automatically sets up tRNAscan-SE via Bioconda
  • Skips processing if output directory contains files
source
Mycelia.samtools_index_fastaMethod
samtools_index_fasta(; fasta)

Creates an index file (.fai) for a FASTA reference sequence using samtools.

The FASTA index allows efficient random access to the reference sequence. This is required by many bioinformatics tools that need to quickly fetch subsequences from the reference.

Arguments

  • fasta: Path to the input FASTA file

Side Effects

  • Creates a {fasta}.fai index file in the same directory as input
  • Installs samtools via conda if not already present
source
Mycelia.save_graphMethod
save_graph(
    graph::Graphs.AbstractGraph,
    outfile::String
) -> String

Saves the given graph to a file in JLD2 format.

Arguments

  • graph::Graphs.AbstractGraph: The graph to be saved.
  • outfile::String: The name of the output file. If the file extension is not .jld2, it will be appended automatically.

Returns

  • String: The name of the output file with the .jld2 extension.
source
Mycelia.save_matrix_jld2Method
save_matrix_jld2(; matrix, filename)

Saves a matrix to a JLD2 file format.

Arguments

  • matrix: The matrix to be saved
  • filename: String path where the file should be saved

Returns

  • The filename string that was used to save the matrix
source
Mycelia.scg_sbatchMethod
scg_sbatch(
;
    job_name,
    mail_user,
    mail_type,
    logdir,
    partition,
    account,
    nodes,
    ntasks,
    time,
    cpus_per_task,
    mem_gb,
    cmd
)

Submit a job to SLURM using sbatch with specified parameters.

Arguments

  • job_name::String: Name identifier for the SLURM job
  • mail_user::String: Email address for job notifications
  • mail_type::String: Type of mail notifications (default: "ALL")
  • logdir::String: Directory for error and output logs (default: "~/workspace/slurmlogs")
  • partition::String: SLURM partition to submit job to
  • account::String: Account to charge for compute resources
  • nodes::Int: Number of nodes to allocate (default: 1)
  • ntasks::Int: Number of tasks to run (default: 1)
  • time::String: Maximum wall time in format "days-hours:minutes:seconds" (default: "1-00:00:00")
  • cpus_per_task::Int: CPUs per task (default: 1)
  • mem_gb::Int: Memory in GB, defaults to 32GB per CPU
  • cmd::String: Command to execute

Returns

  • Bool: Returns true if submission succeeds

Notes

  • Function includes 5-second delays before and after submission
  • Memory is automatically scaled with CPU count
  • Log files are named with job ID (%j) and job name (%x)
source
Mycelia.seq2sha256Method
seq2sha256(seq::AbstractString) -> String

Compute the SHA-256 hash of a sequence string.

Arguments

  • seq::AbstractString: Input sequence to be hashed

Returns

  • String: Hexadecimal representation of the SHA-256 hash

Details

The input sequence is converted to uppercase before hashing.

source
Mycelia.seq2sha256Method
seq2sha256(seq::BioSequences.BioSequence) -> String

Convert a biological sequence to its SHA256 hash value.

Calculates a cryptographic hash of the sequence by first converting it to a string representation. This method dispatches to the string version of seq2sha256.

Arguments

  • seq::BioSequences.BioSequence: The biological sequence to hash

Returns

  • String: A 64-character hexadecimal string representing the SHA256 hash
source
Mycelia.sequence_to_stranded_pathMethod
sequence_to_stranded_path(
    stranded_kmers,
    sequence
) -> Vector{Pair{Int64, Bool}}

Convert a DNA sequence into a path through a collection of stranded k-mers.

Arguments

  • stranded_kmers: Collection of unique k-mers representing possible path vertices
  • sequence: Input DNA sequence to convert to a path

Returns

Vector of Pair{Int,Bool} where:

  • First element (Int) is the index of the k-mer in stranded_kmers
  • Second element (Bool) indicates orientation (true=forward, false=reverse)
source
Mycelia.setup_taxonkit_taxonomyMethod
setup_taxonkit_taxonomy() -> String

Downloads and extracts the NCBI taxonomy database required for taxonkit operations.

Downloads taxdump.tar.gz from NCBI FTP server and extracts it to ~/.taxonkit/. This is a prerequisite for using taxonkit-based taxonomy functions.

Requirements

  • Working internet connection
  • Sufficient disk space (~100MB)
  • taxonkit must be installed separately

Returns

  • Nothing

Throws

  • SystemError if download fails or if unable to create directory
  • ErrorException if tar extraction fails
source
Mycelia.sha256_fileMethod
sha256_file(file::AbstractString) -> String

Compute the SHA-256 hash of the contents of a file.

Arguments

  • file::AbstractString: The path to the file for which the SHA-256 hash is to be computed.

Returns

  • String: The SHA-256 hash of the file contents, represented as a hexadecimal string.
source
Mycelia.simulate_illumina_paired_readsMethod
simulate_illumina_paired_reads(
;
    in_fasta,
    coverage,
    read_count,
    outbase,
    read_length,
    mflen,
    sdev,
    seqSys,
    amplicon,
    errfree,
    rndSeed
)

Simulate Illumina short reads from a FASTA file using the ART Illumina simulator.

This function wraps ART (installed via Bioconda) to simulate reads from an input reference FASTA. It supports paired-end (or optionally single-end/mate-pair) simulation, with options to choose either fold coverage (--fcov) or an absolute read count (--rcount), to enable amplicon mode, and to optionally generate a zero-error SAM file.

Arguments

  • in_fasta::String: Path to the input FASTA file.
  • coverage::Union{Nothing,Number}: Desired fold coverage (used with --fcov); if nothing and read_count is provided then fold coverage is ignored. (Default: 20)
  • read_count::Union{Nothing,Number}: Total number of reads (or read pairs) to generate (used with --rcount instead of fold coverage). (Default: nothing)
  • outbase::String: Output file prefix (default: "$(in_fasta).art.$(coverage)x.").
  • read_length::Int: Length of reads to simulate (default: 150).
  • mflen::Int: Mean fragment length for paired-end simulations (default: 500).
  • sdev::Int: Standard deviation of fragment lengths (default: 10).
  • seqSys::String: Illumina sequencing system ID (e.g. "HS25" for HiSeq 2500) (default: "HS25").
  • paired::Bool: Whether to simulate paired-end reads (default: true).
  • amplicon::Bool: Enable amplicon sequencing simulation mode (default: false).
  • errfree::Bool: Generate an extra SAM file with zero sequencing errors (default: false).
  • rndSeed::Union{Nothing,Int}: Optional seed for reproducibility (default: nothing).

Outputs

Generates gzipped FASTQ files in the working directory:

  • For paired-end: $(outbase)1.fq.gz (forward) and $(outbase)2.fq.gz (reverse).
  • For single-end: $(outbase)1.fq.gz.

Additional SAM files may be produced if --errfree is enabled and/or if the ART --samout option is specified.

Details

This function calls ART with the provided options. Note that if read_count is supplied, the function uses the --rcount option; otherwise, it uses --fcov with the given coverage. Amplicon mode (via --amplicon) restricts the simulation to the amplicon regions, which is important for targeted sequencing studies.

Dependencies

Requires ART simulator (installed via Bioconda) and the Mycelia environment helper.

See also: simulate_nanopore_reads, simulate_nearly_perfect_long_reads, simulate_pacbio_reads

source
Mycelia.simulate_nanopore_readsMethod
simulate_nanopore_reads(; fasta, quantity, outfile)

Simulate Oxford Nanopore sequencing reads using the Badread tool with 2023 error models.

Arguments

  • fasta::String: Path to input reference FASTA file
  • quantity::String: Either fold coverage (e.g. "50x") or total bases to sequence (e.g. "1000000")
  • outfile::String: Output path for gzipped FASTQ file. Defaults to input filename with modified extension

Returns

  • String: Path to the generated output FASTQ file

See also: simulate_pacbio_reads, simulate_nearly_perfect_long_reads, simulate_short_reads

source
Mycelia.simulate_nearly_perfect_long_readsMethod
simulate_nearly_perfect_long_reads()

Simulate high-quality long reads with minimal errors using Badread.

Arguments

  • reference::String: Path to reference FASTA file
  • quantity::String: Coverage depth (e.g. "50x") or total bases (e.g. "1000000")
  • length_mean::Int=40000: Mean read length
  • length_sd::Int=20000: Standard deviation of read length

Returns

Vector of simulated reads in FASTQ format

Details

Generates nearly perfect long reads by setting error rates and artifacts to minimum values. Uses ideal quality scores and disables common sequencing artifacts like chimeras and adapters.

See also: simulate_pacbio_reads, simulate_nanopore_reads, simulate_short_reads

source
Mycelia.simulate_pacbio_readsMethod
simulate_pacbio_reads(; fasta, quantity, outfile)

Simulate PacBio HiFi reads using the Badread error model.

Arguments

  • fasta::String: Path to input FASTA file containing reference sequence
  • quantity::String: Coverage depth (e.g. "50x") or total bases (e.g. "1000000") - NOT TOTAL READS
  • outfile::String: Output filepath for simulated reads. Defaults to input filename with ".badread.pacbio2021.{quantity}.fq.gz" suffix

Returns

  • String: Path to the generated output file

Notes

  • Requires Badread tool from Bioconda
  • Uses PacBio 2021 error and quality score models
  • Average read length ~15kb
  • Output is gzipped FASTQ format

See also: simulate_nanopore_reads, simulate_nearly_perfect_long_reads, simulate_short_reads

source
Mycelia.simulate_variantsMethod
simulate_variants(
    fasta_record::FASTX.FASTA.Record;
    n_variants,
    window_size,
    variant_size_disbribution,
    variant_type_likelihoods
) -> Any

Simulates genetic variants (substitutions, insertions, deletions, inversions) in a DNA sequence.

Arguments

  • fasta_record: Input DNA sequence in FASTA format

Keywords

  • n_variants=√(sequence_length): Number of variants to generate
  • window_size=sequence_length/n_variants: Size of windows for variant placement
  • variant_size_disbribution=Geometric(1/√window_size): Distribution for variant sizes
  • variant_type_likelihoods: Vector of pairs mapping variant types to probabilities
    • :substitution => 10⁻¹
    • :insertion => 10⁻²
    • :deletion => 10⁻²
    • :inversion => 10⁻²

Returns

DataFrame in VCF format containing simulated variants with columns: CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO, FORMAT, SAMPLE

Notes

  • Variants are distributed across sequence windows to ensure spread
  • Variant sizes are capped by window size
  • Equivalent variants are filtered out
  • FILTER column indicates variant type
source
Mycelia.simulate_variantsMethod
simulate_variants(fasta_file::String) -> String

Simulates genetic variants from sequences in a FASTA file and generates corresponding VCF records.

Arguments

  • fasta_file::String: Path to input FASTA file containing sequences to analyze

Details

  1. Processes each record in the input FASTA file
  2. Generates simulated variants for each sequence
  3. Creates a VCF file with the same base name as input file (.vcf extension)
  4. Updates sequences with simulated variants in a new FASTA file (.vcf.fna extension)

Returns

Path to the modified FASTA file containing sequences with simulated variants

source
Mycelia.sort_fastqFunction
sort_fastq(input_fastq) -> String
sort_fastq(input_fastq, output_fastq) -> Any

This turns a 4-line FASTQ entry into a single tab separated line, adds a column with the length of each read, passes it to Unix sort, removes the length column, and converts it back into a FASTQ file.

sorts longest to shortest!!

http://thegenomefactory.blogspot.com/2012/11/sorting-fastq-files-by-sequence-length.html

source
Mycelia.split_gff_attributes_into_columnsMethod
split_gff_attributes_into_columns(gff_df) -> Any

Takes a GFF (General Feature Format) DataFrame and expands the attributes column into separate columns.

Arguments

  • gff_df::DataFrame: A DataFrame containing GFF data with an 'attributes' column formatted as key-value pairs separated by semicolons (e.g., "ID=gene1;Name=BRCA1;Type=gene")

Returns

  • DataFrame: The input DataFrame with additional columns for each attribute key found in the 'attributes' column
source
Mycelia.subsample_reads_seqkitMethod
subsample_reads_seqkit(
;
    in_fastq,
    out_fastq,
    n_reads,
    proportion_reads
)

Subsample reads from a FASTQ file using seqkit.

Arguments

  • in_fastq::String: Path to input FASTQ file
  • out_fastq::String="": Path to output FASTQ file. If empty, auto-generated based on input filename
  • n_reads::Union{Missing,Int}=missing: Number of reads to sample
  • proportion_reads::Union{Missing,Float64}=missing: Proportion of reads to sample (0.0-1.0)

Returns

  • String: Path to the output FASTQ file
source
Mycelia.system_mem_to_minimap_index_sizeMethod
system_mem_to_minimap_index_size(
;
    system_mem_gb,
    denominator
)

Calculate appropriate minimap index size based on available system memory.

Converts total system memory to a recommended minimap index size by dividing the available memory by a denominator factor. Returns the size as a string with 'G' suffix.

Arguments

  • system_mem_gb::Number: Total system memory in gigabytes
  • denominator::Number=DEFAULT_MINIMAP_DENOMINATOR: Divisor to scale down memory allocation

Returns

  • String: Formatted memory size string (e.g. "4G")
source
Mycelia.system_overviewMethod
system_overview(
;
    path
) -> @NamedTuple{system_threads::Int64, julia_threads::Int64, total_memory::String, available_memory::String, occupied_memory::String, total_storage::String, available_storage::String, occupied_storage::String}

Parse a JSONL (JSON Lines) file into a vector of dictionaries.

Arguments

  • filepath::String: Path to the JSONL file to parse

Returns

  • Vector{Dict{String, Any}}: Vector containing parsed JSON objects, one per line

Description

Reads a JSONL file line by line, parsing each line as a separate JSON object. Uses pre-allocation and progress tracking for efficient processing of large files.

source
Mycelia.tar_extractMethod
tar_extract(; tarchive, directory)

Extract contents of a gzipped tar archive file to a specified directory.

Arguments

  • tarchive::AbstractString: Path to the .tar.gz file to extract
  • directory::AbstractString=dirname(tarchive): Target directory for extraction (defaults to the archive's directory)

Returns

  • AbstractString: Path to the directory where contents were extracted
source
Mycelia.taxids2lcaMethod
taxids2lca(ids::Vector{Int64}) -> Int64

Calculate the Lowest Common Ancestor (LCA) taxonomic ID for a set of input taxonomic IDs.

Arguments

  • ids::Vector{Int}: Vector of NCBI taxonomic IDs

Returns

  • Int: The taxonomic ID of the lowest common ancestor

Details

Uses taxonkit to compute the LCA. Automatically sets up the required taxonomy database if not already present in ~/.taxonkit/.

Dependencies

  • Requires taxonkit (installed via Bioconda)
  • Requires taxonomy database (downloaded automatically if missing)
source
Mycelia.taxids2ncbi_taxonomy_tableMethod
taxids2ncbi_taxonomy_table(
    taxids::AbstractVector{Int64}
) -> DataFrames.DataFrame

Convert a vector of NCBI taxonomy IDs into a detailed taxonomy table using NCBI Datasets CLI.

Arguments

  • taxids::AbstractVector{Int}: Vector of NCBI taxonomy IDs to query

Returns

  • DataFrame: Table containing taxonomy information with columns including:
    • tax_id
    • species
    • genus
    • family
    • order
    • class
    • phylum
    • kingdom

Dependencies

Requires ncbi-datasets-cli Conda package (automatically installed if missing)

source
Mycelia.taxids2taxonkit_summarized_lineage_tableMethod
taxids2taxonkit_summarized_lineage_table(
    taxids::AbstractVector{Int64}
) -> DataFrames.DataFrame

Convert a vector of taxonomy IDs to a summarized lineage table using taxonkit.

Arguments

  • taxids::AbstractVector{Int}: Vector of NCBI taxonomy IDs

Returns

DataFrame with the following columns:

  • taxid: Original input taxonomy ID
  • species_taxid, species: Species level taxonomy ID and name
  • genus_taxid, genus: Genus level taxonomy ID and name
  • family_taxid, family: Family level taxonomy ID and name
  • superkingdom_taxid, superkingdom: Superkingdom level taxonomy ID and name

Missing values are used when a taxonomic rank is not available.

source
Mycelia.taxids2taxonkit_taxid2lineage_ranksMethod
taxids2taxonkit_taxid2lineage_ranks(
    taxids::AbstractVector{Int64}
) -> Dict{Int64, Dict{String, @NamedTuple{lineage::String, taxid::Union{Missing, Int64}}}}

Convert taxonomic IDs to a structured lineage rank mapping.

Takes a vector of taxonomic IDs and returns a nested dictionary mapping each input taxid to its complete taxonomic lineage information. For each taxid, creates a dictionary where:

  • Keys are taxonomic ranks (e.g., "species", "genus", "family")
  • Values are NamedTuples containing:
    • lineage::String: The taxonomic name at that rank
    • taxid::Union{Int, Missing}: The corresponding taxonomic ID (if available)

Excludes "no rank" entries from the final output.

Returns: Dict{Int, Dict{String, NamedTuple{(:lineage, :taxid), Tuple{String, Union{Int, Missing}}}}}

source
Mycelia.taxonomic_id_to_childrenMethod
taxonomic_id_to_children(
    tax_id;
    DATABASE_ID,
    USERNAME,
    PASSWORD
)

Query Neo4j database to find all descendant taxonomic IDs for a given taxonomic ID.

Arguments

  • tax_id: Source taxonomic ID to find children for
  • DATABASE_ID: Neo4j database identifier (required)
  • USERNAME: Neo4j database username (default: "neo4j")
  • PASSWORD: Neo4j database password (required)

Returns

Vector{Int}: Sorted array of unique child taxonomic IDs

source
Mycelia.translate_nucleic_acid_fastaMethod
translate_nucleic_acid_fasta(
    fasta_nucleic_acid_file,
    fasta_amino_acid_file
) -> Any

Translates nucleic acid sequences from a FASTA file into amino acid sequences.

Arguments

  • fasta_nucleic_acid_file::String: Path to input FASTA file containing nucleic acid sequences
  • fasta_amino_acid_file::String: Path where the translated amino acid sequences will be written

Returns

  • String: Path to the output FASTA file containing translated amino acid sequences
source
Mycelia.transterm_output_to_gffMethod
transterm_output_to_gff(transterm_output) -> Any

Convert TransTerm terminator predictions output to GFF3 format.

Parses TransTerm output and generates a standardized GFF3 file with the following transformations:

  • Sets source field to "transterm"
  • Sets feature type to "terminator"
  • Converts terminator IDs to GFF attributes
  • Renames fields to match GFF3 spec

Arguments

  • transterm_output::String: Path to the TransTerm output file

Returns

  • String: Path to the generated GFF3 file (original filename with .gff extension)
source
Mycelia.trim_galoreMethod
trim_galore(
;
    outdir,
    identifier
) -> Union{Nothing, Base.Process}

Trim paired-end FASTQ reads using Trim Galore, a wrapper around Cutadapt and FastQC.

Arguments

  • outdir::String: Output directory containing input FASTQ files
  • identifier::String: Prefix for input/output filenames

Input files

Expects paired FASTQ files in outdir named:

  • {identifier}_1.fastq.gz (forward reads)
  • {identifier}_2.fastq.gz (reverse reads)

Output files

Creates trimmed reads in outdir/trim_galore/:

  • {identifier}_1_val_1.fq.gz (trimmed forward reads)
  • {identifier}_2_val_2.fq.gz (trimmed reverse reads)

Dependencies

Requires trim_galore conda environment:

source
Mycelia.trim_galore_pairedMethod
trim_galore_paired(; forward_reads, reverse_reads, outdir)

Trim paired-end FASTQ reads using Trim Galore, a wrapper around Cutadapt and FastQC.

Arguments

  • forward_reads::String: Path to forward reads FASTQ file
  • reverse_reads::String: Path to reverse reads FASTQ file
  • outdir::String: Output directory for trimmed files

Returns

  • Tuple{String, String}: Paths to trimmed forward and reverse read files

Dependencies

Requires trim_galore conda environment:

  • mamba create -n trim_galore -c bioconda trim_galore
source
Mycelia.type_to_stringMethod
type_to_string(T::AbstractString) -> Any

Converts an AbstractString type to its string representation.

Arguments

  • T::AbstractString: The string type to convert

Returns

A string representation of the input type

source
Mycelia.type_to_stringMethod
type_to_string(T) -> Any

Convert a type to its string representation, with special handling for Kmer types.

Arguments

  • T: The type to convert to string

Returns

  • String representation of the type
    • For Kmer types: Returns "Kmers.DNAKmer{K}" where K is the kmer length
    • For other types: Returns the standard string representation
source
Mycelia.update_bioconda_envMethod
update_bioconda_env(pkg) -> Base.Process

Update a package and its dependencies in its dedicated Conda environment.

Arguments

  • pkg::String: Name of the package/environment to update
source
Mycelia.update_fasta_with_vcfMethod
update_fasta_with_vcf(; in_fasta, vcf_file, out_fasta)

Apply variants from a VCF file to a reference FASTA sequence.

Arguments

  • in_fasta: Path to input reference FASTA file
  • vcf_file: Path to input VCF file containing variants
  • out_fasta: Optional output path for modified FASTA. Defaults to replacing '.vcf' with '.normalized.vcf.fna'

Details

  1. Normalizes indels in the VCF using bcftools norm
  2. Applies variants to the reference sequence using bcftools consensus
  3. Handles temporary files and compression with bgzip/tabix

Requirements

Requires bioconda packages: htslib, tabix, bcftools

Returns

Path to the output FASTA file containing the modified sequence

source
Mycelia.update_gff_with_mmseqsMethod
update_gff_with_mmseqs(gff_file, mmseqs_file)

Update GFF annotations with protein descriptions from MMseqs2 search results.

Arguments

  • gff_file::String: Path to input GFF3 format file
  • mmseqs_file::String: Path to MMseqs2 easy-search output file

Returns

  • DataFrame: Modified GFF table with updated attribute columns containing protein descriptions

Details

Takes sequence matches from MMseqs2 and adds their descriptions as 'label' and 'product' attributes in the GFF file. Only considers top hits from MMseqs2 results. Preserves existing GFF attributes while prepending new annotations.

source
Mycelia.upload_edge_type_over_url_from_graphMethod
upload_edge_type_over_url_from_graph(
;
    src_type,
    dst_type,
    edge_type,
    graph,
    ADDRESS,
    USERNAME,
    PASSWORD,
    DATABASE,
    window_size
)

Upload edges of a specific type from a MetaGraph to a Neo4j database, batching uploads in windows.

Arguments

  • src_type: Type of source nodes to filter
  • dst_type: Type of destination nodes to filter
  • edge_type: Type of edges to upload
  • graph: MetaGraph containing the nodes and edges
  • ADDRESS: Neo4j server URL
  • USERNAME: Neo4j username (default: "neo4j")
  • PASSWORD: Neo4j password
  • DATABASE: Neo4j database name (default: "neo4j")
  • window_size: Number of edges to upload in each batch (default: 100)

Details

  • Filters edges based on source, destination and edge types
  • Preserves all edge properties except :TYPE when uploading
  • Uses MERGE operations to avoid duplicate nodes/relationships
  • Uploads are performed in batches for better performance
  • Progress is shown via ProgressMeter

Returns

Nothing

source
Mycelia.upload_node_over_apiMethod
upload_node_over_api(
    graph,
    v;
    ADDRESS,
    USERNAME,
    PASSWORD,
    DATABASE
)

Upload a single node from a MetaGraph to a Neo4j database using the HTTP API.

Arguments

  • graph: MetaGraph containing the node to be uploaded
  • v: Vertex identifier in the graph
  • ADDRESS: Neo4j server address (e.g. "http://localhost:7474")
  • USERNAME: Neo4j authentication username (default: "neo4j")
  • PASSWORD: Neo4j authentication password
  • DATABASE: Target Neo4j database name (default: "neo4j")

Details

Generates and executes a Cypher MERGE command using the node's properties. The node's :TYPE and :identifier properties are used for node labeling, while other non-empty properties are added as node properties.

source
Mycelia.upload_node_tableMethod
upload_node_table(
;
    table,
    window_size,
    address,
    password,
    username,
    database,
    neo4j_import_dir
)

Upload a DataFrame to Neo4j as nodes in batched windows.

Arguments

  • table::DataFrame: Input DataFrame where each row becomes a node. Must contain a "TYPE" column.
  • address::String: Neo4j server address (e.g. "bolt://localhost:7687")
  • password::String: Neo4j database password
  • neo4j_import_dir::String: Directory path accessible to Neo4j for importing files
  • window_size::Int=1000: Number of rows to process in each batch
  • username::String="neo4j": Neo4j database username
  • database::String="neo4j": Target Neo4j database name

Notes

  • All rows must have the same node type (specified in "TYPE" column)
  • Column names become node properties
  • Requires write permissions on neo4jimportdir
  • Large tables are processed in batches of size window_size
source
Mycelia.upload_node_type_over_url_from_graphMethod
upload_node_type_over_url_from_graph(
;
    node_type,
    graph,
    ADDRESS,
    USERNAME,
    PASSWORD,
    DATABASE,
    window_size
)

Upload nodes of a specific type from a graph to a Neo4j database using MERGE operations.

Arguments

  • node_type: The type label for the nodes to upload
  • graph: Source MetaGraph containing the nodes
  • ADDRESS: Neo4j server address (e.g. "bolt://localhost:7687")
  • PASSWORD: Neo4j database password
  • USERNAME="neo4j": Neo4j username (defaults to "neo4j")
  • DATABASE="neo4j": Target Neo4j database name (defaults to "neo4j")
  • window_size=100: Number of nodes to upload in each batch (defaults to 100)

Details

Performs batched uploads of nodes using Neo4j MERGE operations. Node properties are automatically extracted from the graph vertex properties, excluding the 'TYPE' property.

source
Mycelia.upload_nodes_over_apiMethod
upload_nodes_over_api(
    graph;
    ADDRESS,
    USERNAME,
    PASSWORD,
    DATABASE
)

Uploads all nodes from the given graph to a specified API endpoint.

Arguments

  • graph: The graph containing the nodes to be uploaded.
  • ADDRESS: The API endpoint address.
  • USERNAME: The username for authentication (default: "neo4j").
  • PASSWORD: The password for authentication.
  • DATABASE: The database name (default: "neo4j").
source
Mycelia.upload_nodes_to_neo4jMethod
upload_nodes_to_neo4j(
;
    graph,
    address,
    username,
    password,
    format,
    database,
    neo4j_import_directory
)

Upload all nodes from a MetaGraph to a Neo4j database, processing each unique node type separately.

Arguments

  • graph: A MetaGraph containing nodes to be uploaded
  • address: Neo4j server address (e.g., "bolt://localhost:7687")
  • username: Neo4j authentication username (default: "neo4j")
  • password: Neo4j authentication password
  • format: Data format for upload (default: "auto")
  • database: Target Neo4j database name (default: "neo4j")
  • neo4j_import_directory: Path to Neo4j's import directory for bulk loading
source
Mycelia.vcat_with_missingMethod
vcat_with_missing(
    dfs::DataFrames.AbstractDataFrame...
) -> Union{DataFrames.DataFrame, Vector{Any}}

Vertically concatenate DataFrames with different column structures by automatically handling missing values.

Arguments

  • dfs: Variable number of DataFrames to concatenate vertically

Returns

  • DataFrame: Combined DataFrame containing all rows and columns from input DataFrames, with missing values where columns didn't exist in original DataFrames
source
Mycelia.visualize_genome_coverageMethod
visualize_genome_coverage(coverage_table) -> Any

Creates a multi-panel visualization of genome coverage across chromosomes.

Arguments

  • coverage_table: DataFrame containing columns "chromosome" and "coverage" with genomic coverage data

Returns

  • Plots.Figure: A composite figure with coverage plots for each chromosome

Details

Generates one subplot per chromosome, arranged vertically. Each subplot shows the coverage distribution across genomic positions for that chromosome.

source
Mycelia.viterbi_maximum_likelihood_traversalsMethod
viterbi_maximum_likelihood_traversals(
    stranded_kmer_graph;
    error_rate,
    verbosity
) -> Vector{FASTX.FASTA.Record}

Finds maximum likelihood paths through a stranded k-mer graph using the Viterbi algorithm to correct sequencing errors.

Arguments

  • stranded_kmer_graph: A directed graph where vertices represent k-mers and edges represent overlaps
  • error_rate::Float64: Expected per-base error rate (default: 1/(k+1)). Must be < 0.5
  • verbosity::String: Output detail level ("debug", "reads", or "dataset")

Returns

Vector of FASTX.FASTA.Record containing error-corrected sequences

Details

  • Uses dynamic programming to find most likely path through k-mer graph
  • Accounts for matches, mismatches, insertions and deletions
  • State likelihoods based on k-mer coverage counts
  • Transition probabilities derived from error rate
  • Progress tracking based on verbosity level

Notes

  • Error rate should be probability of error (e.g. 0.01 for 1%), not accuracy
  • Higher verbosity levels ("debug", "reads") provide detailed path finding information
  • "dataset" verbosity shows only summary statistics
source
Mycelia.wcssMethod
wcss(clustering_result) -> Any

Calculate the Within-Cluster Sum of Squares (WCSS) for a clustering result.

Arguments

  • clustering_result: A clustering result object containing:
    • counts: Vector with number of points in each cluster
    • assignments: Vector of cluster assignments for each point
    • costs: Vector of distances/costs from each point to its cluster center

Returns

  • Float64: The total within-cluster sum of squared distances

Description

WCSS measures the compactness of clusters by summing the squared distances between each data point and its assigned cluster center.

source
Mycelia.write_fastaMethod
write_fasta(; outfile, records, gzip)

Writes FASTA records to a file, optionally gzipped.

Arguments

  • outfile::AbstractString: Path to the output FASTA file. Will append ".gz" if gzip is true.
  • records::Vector{FASTX.FASTA.Record}: A vector of FASTA records.
  • gzip::Bool=false: Whether to compress the output with gzip.

Returns

  • outfile::String: The path to the output FASTA file (including ".gz" if applicable).
source
Mycelia.write_gffMethod
write_gff(; gff, outfile)

Write GFF (General Feature Format) data to a tab-delimited file.

Arguments

  • gff: DataFrame/Table containing GFF formatted data
  • outfile: String path where the output file should be written

Returns

  • String: Path to the written output file
source
Mycelia.write_vcf_tableMethod
write_vcf_table(; vcf_file, vcf_table, fasta_file)

Write variant data to a VCF v4.3 format file.

Arguments

  • vcf_file::String: Output path for the VCF file
  • vcf_table::DataFrame: Table containing variant data with standard VCF columns
  • fasta_file::String: Path to the reference genome FASTA file

Details

Automatically filters out equivalent variants where REF == ALT. Includes standard VCF headers for substitutions, insertions, deletions, and inversions. Adds GT (Genotype) and GQ (Genotype Quality) format fields.

source
Mycelia.xam_to_contig_mapping_statsMethod
xam_to_contig_mapping_stats(xam) -> Any

Generate detailed mapping statistics for each reference sequence/contig in a XAM (SAM/BAM/CRAM) file.

Arguments

  • xam: Path to XAM file or XAM object

Returns

A DataFrame with per-contig statistics including:

  • n_aligned_reads: Number of aligned reads
  • total_aligned_bases: Sum of alignment lengths
  • total_alignment_score: Sum of alignment scores
  • Mapping quality statistics (mean, std, median)
  • Alignment length statistics (mean, std, median)
  • Alignment score statistics (mean, std, median)
  • Percent mismatches statistics (mean, std, median)

Note: Only primary alignments (isprimary=true) and mapped reads (ismapped=true) are considered.

source