Sequence Analysis & K-mers

Functions for analyzing sequence composition, counting k-mers, and extracting genomic features from sequencing data.

Overview

Sequence analysis forms the foundation of many bioinformatics algorithms. Mycelia provides:

K-mer counting and analysis using efficient algorithms (implemented)
Genome size estimation from k-mer frequency spectra (implemented)
Graph-based k-mer analysis for assembly (implemented)
Error detection and correction using k-mer patterns (planned)
Sequence composition analysis and bias detection (planned)
Memory-efficient processing of large datasets (partially implemented)

Implementation Status: Core k-mer counting, canonical k-mer analysis, and basic genome size estimation are implemented. Advanced analysis features, error correction, and comprehensive visualization are planned.

Common Workflows

1. Basic K-mer Analysis

# Count k-mers in sequencing reads
kmer_counts = Mycelia.count_kmers("reads.fastq", k=21)

# Analyze k-mer frequency spectrum
spectrum = Mycelia.kmer_frequency_spectrum(kmer_counts)

2. Genome Size Estimation (Implemented)

# Estimate genome size from k-mer spectrum
genome_size = Mycelia.estimate_genome_size_from_kmers(kmer_counts, coverage_peak=30)

3. Contamination Detection (Planned)

# Detect contamination using k-mer profiles (planned)
contamination = Mycelia.detect_contamination_kmers("reads.fastq", expected_profile)

K-mer Counting

Basic K-mer Counting

Mycelia.count_kmers — Function

count_kmers(
    _::Type{Kmers.Kmer{A<:BioSequences.DNAAlphabet, K}},
    sequence::BioSequences.LongSequence
) -> Any

Count the frequency of each k-mer in a DNA sequence.

Arguments

::Type{Kmers.Kmer{A,K}}: K-mer type with alphabet A and length K
sequence::BioSequences.LongSequence: Input DNA sequence to analyze

Returns

A sorted dictionary mapping each k-mer to its frequency count in the sequence.

Type Parameters

A <: BioSequences.DNAAlphabet: DNA alphabet type
K: Length of k-mers

source

count_kmers(
    _::Type{Kmers.Kmer{A<:BioSequences.RNAAlphabet, K}},
    sequence::BioSequences.LongSequence
) -> Any

Count the frequency of each k-mer in an RNA sequence.

Arguments

Kmer: Type parameter specifying the k-mer length K and RNA alphabet
sequence: Input RNA sequence to analyze

Returns

Dict{Kmers.Kmer, Int}: Sorted dictionary mapping each k-mer to its frequency count

source

count_kmers(
    _::Type{Kmers.Kmer{A<:BioSequences.AminoAcidAlphabet, K}},
    sequence::BioSequences.LongSequence
) -> OrderedCollections.OrderedDict{K, Int64} where K<:(Kmers.Kmer{BioSequences.AminoAcidAlphabet, _A, _B} where {_B, _A})

Count the frequency of amino acid k-mers in a biological sequence.

Arguments

Kmers.Kmer{A,K}: Type parameter specifying amino acid alphabet (A) and k-mer length (K)
sequence: Input biological sequence to analyze

Returns

A sorted dictionary mapping each k-mer to its frequency count in the sequence.

source

count_kmers(
    _::Type{KMER_TYPE},
    record::Union{FASTX.FASTA.Record, FASTX.FASTQ.Record}
) -> Any

Count the frequency of amino acid k-mers in a biological sequence.

Arguments

Kmers.Kmer{A,K}: Type parameter specifying amino acid alphabet (A) and k-mer length (K)
sequence: Input biological sequence to analyze

Returns

A sorted dictionary mapping each k-mer to its frequency count in the sequence.

source

count_kmers(
    _::Type{KMER_TYPE},
    records::AbstractArray{T<:Union{FASTX.FASTA.Record, FASTX.FASTQ.Record}, 1}
) -> Any

Count k-mers across multiple sequence records and return a sorted frequency table.

Arguments

KMER_TYPE: Type parameter specifying the k-mer length (e.g., DNAKmer{3} for 3-mers)
records: Vector of FASTA/FASTQ records to analyze

Returns

Dict{KMER_TYPE, Int}: Sorted dictionary mapping k-mers to their frequencies

source

count_kmers(
    _::Type{KMER_TYPE},
    sequences::Union{FASTX.FASTA.Reader, FASTX.FASTQ.Reader}
) -> Any

Counts k-mer occurrences in biological sequences from a FASTA/FASTQ reader.

Arguments

KMER_TYPE: Type parameter specifying the k-mer length and encoding (e.g., DNAKmer{4} for 4-mers)
sequences: A FASTA or FASTQ reader containing the biological sequences to analyze

Returns

A dictionary mapping k-mers to their counts in the input sequences

source

count_kmers(
    _::Type{KMER_TYPE},
    fastx_files::AbstractArray{T<:AbstractString, 1}
) -> Any

Count k-mers across multiple FASTA/FASTQ files and merge the results.

Arguments

KMER_TYPE: Type parameter specifying the k-mer length (e.g., DNAKmer{4} for 4-mers)
fastx_files: Vector of paths to FASTA/FASTQ files

Returns

Dict{KMER_TYPE, Int}: Dictionary mapping k-mers to their total counts across all files

source

count_kmers(
    _::Type{KMER_TYPE},
    fastx_file::AbstractString
) -> Any

Count k-mers in a FASTA/FASTQ file and return their frequencies.

Arguments

KMER_TYPE: Type parameter specifying the k-mer type (e.g., DNAKmer{K})
fastx_file: Path to input FASTA/FASTQ file

Returns

Dict{KMER_TYPE, Int}: Dictionary mapping each k-mer to its frequency

source

Mycelia.count_canonical_kmers — Function

count_canonical_kmers(_::Type{KMER_TYPE}, sequences) -> Any

Count canonical k-mers in biological sequences. A canonical k-mer is the lexicographically smaller of a DNA sequence and its reverse complement, ensuring strand-independent counting.

Arguments

KMER_TYPE: Type parameter specifying the k-mer size and structure
sequences: Iterator of biological sequences to analyze

Returns

Dict{KMER_TYPE,Int}: Dictionary mapping canonical k-mers to their counts

source

Mycelia.fasta_list_to_dense_kmer_counts — Function

fasta_list_to_dense_kmer_counts(
;
    fasta_list,
    k,
    alphabet,
    temp_dir_parent,
    count_element_type,
    result_file,
    force,
    cleanup_temp,
    force_threading,
    force_temp_files,
    force_progress_bars,
    threading_file_threshold,
    threading_size_threshold,
    memory_safety_factor,
    progress_time_threshold
)

Create a dense k-mer counts table for a set of FASTA files, with disk-backed temporary storage, custom element type, robust error handling, and optional output file caching.

source

Mycelia.fasta_list_to_sparse_kmer_counts — Function

fasta_list_to_sparse_kmer_counts(
;
    fasta_list,
    k,
    alphabet,
    temp_dir_parent,
    count_element_type,
    rarefaction_data_filename,
    rarefaction_plot_basename,
    show_rarefaction_plot,
    skip_rarefaction_plot,
    result_file,
    out_dir,
    force,
    force_threading,
    force_temp_files,
    force_progress_bars,
    skip_rarefaction_data,
    threading_file_threshold,
    threading_size_threshold,
    memory_safety_factor,
    progress_time_threshold,
    rarefaction_plot_kwargs...
)

Create a sparse kmer counts table (SparseMatrixCSC) from a list of FASTA files using a 3-pass approach. Pass 1 (Parallel): Counts kmers per file and writes to temporary JLD2 files. Pass 2 (Serial): Aggregates unique kmers, max count, nnz per file, and rarefaction data from temp files. Generates and saves a k-mer rarefaction plot. Pass 3 (Parallel): Reads temporary counts again to construct the final sparse matrix.

Optionally, a results filename can be provided to save/load the output. If the file exists and force is false, the result is loaded and returned. If force is true or the file does not exist, results are computed and saved.

Output Directory Behavior

All auxiliary output files (e.g., rarefaction data, plots) are written to a common output directory.
By default, this is:
- The value of out_dir if provided.
- Otherwise, the directory containing result_file (if provided and has a directory component).
- Otherwise, the current working directory (pwd()).
If you provide an absolute path for an output file (e.g. rarefaction_data_filename), that path is used directly.
If both out_dir and a relative filename are given, the file is written to out_dir.

Arguments

fasta_list::AbstractVector{<:AbstractString}: A list of paths to FASTA files.
k::Integer: The length of the kmer.
alphabet::Symbol: The alphabet type (:AA, :DNA, :RNA).
temp_dir_parent::AbstractString: Parent directory for creating the temporary working directory. Defaults to Base.tempdir().
count_element_type::Union{Type{<:Unsigned}, Nothing}: Optional. Specifies the unsigned integer type for the counts. If nothing (default), the smallest UInt type capable of holding the maximum observed count is used.
rarefaction_data_filename::AbstractString: Filename for the TSV output of rarefaction data. If a relative path, will be written to out_dir.
rarefaction_plot_basename::AbstractString: Basename for the output rarefaction plots. If a relative path, will be written to out_dir.
show_rarefaction_plot::Bool: Whether to display the rarefaction plot after generation. Defaults to true.
skip_rarefaction_plot::Bool: If true, completely skip rarefaction plot generation (including loading plotting libraries). Defaults to false.
result_file::Union{Nothing, AbstractString}: Optional. If provided, path to a file to save/load the full results (kmers, counts, etc) as a JLD2 file.
out_dir::Union{Nothing, AbstractString}: Optional. Output directory for auxiliary outputs. Defaults as described above.
force::Bool: If true, recompute and overwrite the output file even if it exists. Defaults to false.
rarefaction_plot_kwargs...: Keyword arguments to pass to plot_kmer_rarefaction for plot customization.

Returns

NamedTuple{(:kmers, :counts, :rarefaction_data_path)}:
- kmers: A sorted Vector of unique kmer objects.
- counts: A SparseArrays.SparseMatrixCSC{V, Int} storing kmer counts.
- rarefaction_data_path: Path to the saved TSV file with rarefaction data.

Raises

ErrorException: If input fasta_list is empty, alphabet is invalid, or required Kmer/counting functions are not found.

source

Example: K-mer Counting Comparison

# Compare different k-mer counting approaches
reads_file = "reads.fastq"

# Dense counting (stores all possible k-mers)
dense_counts = Mycelia.count_kmers(reads_file, k=15, method="dense")
println("Dense matrix size: $(size(dense_counts))")

# Sparse counting (only observed k-mers)
sparse_counts = Mycelia.count_kmers(reads_file, k=21, method="sparse")
println("Unique k-mers: $(length(sparse_counts))")

# Canonical k-mers (combines forward and reverse complement)
canonical_counts = Mycelia.count_canonical_kmers(reads_file, k=21)
println("Canonical k-mers: $(length(canonical_counts))")

Advanced K-mer Analysis

<!– buildkmergraph, findkmeroverlaps, extractkmerpaths, analyzekmerconnectivity not yet implemented as documented –>

Mycelia.build_kmer_graph_next — Function

build_kmer_graph_next(
    kmer_type,
    observations::AbstractVector{<:Union{FASTX.FASTA.Record, FASTX.FASTQ.Record}};
    graph_mode
) -> MetaGraphsNext.MetaGraph{Int64, Graphs.SimpleGraphs.SimpleDiGraph{Int64}, Label, VertexData, EdgeData, Nothing, WeightFunction, Float64} where {Label, VertexData, EdgeData, WeightFunction}

Create a next-generation, type-stable k-mer graph using MetaGraphsNext.

This implementation uses canonical k-mers as vertices with strand-aware edges that respect biological transition constraints for both single-strand and double-strand sequences.

Arguments

kmer_type: Type of k-mer (e.g., DNAKmer{K})
observations: Vector of FASTA/FASTQ records
graph_mode: SingleStrand for directional sequences, DoubleStrand for DNA (default)

Returns

MetaGraphsNext.MetaGraph with canonical vertices and strand-aware edges

Details

Vertices: Always canonical k-mers (lexicographically smaller of kmer/reverse_complement)
Edges: Strand-aware transitions that respect biological constraints
SingleStrand mode: Only forward-strand transitions allowed
DoubleStrand mode: Both forward and reverse-complement transitions allowed

source

Note: Additional k-mer graph construction functions (build_stranded_kmer_graph, build_directed_kmer_graph) have been deprecated and moved to sequence-graphs.jl.deprecated.

Example: K-mer Graph Construction

# Build k-mer overlap graph
kmer_graph = Mycelia.build_kmer_graph(
    sequences,
    k=31,
    min_overlap=20,
    min_coverage=5
)

# Analyze graph properties
graph_stats = Mycelia.analyze_kmer_connectivity(kmer_graph)
println("Graph nodes: $(graph_stats.n_nodes)")
println("Graph edges: $(graph_stats.n_edges)")
println("Connected components: $(graph_stats.n_components)")

K-mer Frequency Analysis

Frequency Spectra

<!– kmerfrequencyspectrum, analyzespectrumpeaks, fitspectrummodel not yet implemented as documented –>

Mycelia.jellyfish_counts_to_kmer_frequency_histogram — Function

jellyfish_counts_to_kmer_frequency_histogram(
    jellyfish_counts_file
) -> Any
jellyfish_counts_to_kmer_frequency_histogram(
    jellyfish_counts_file,
    outfile
) -> Any

Convert a Jellyfish k-mer count file into a frequency histogram.

Arguments

jellyfish_counts_file::String: Path to the gzipped TSV file containing Jellyfish k-mer counts
outfile::String=replace(jellyfish_counts_file, r"\.tsv\.gz$" => ".count_histogram.tsv"): Optional output file path

Returns

String: Path to the generated histogram file

Description

Processes a Jellyfish k-mer count file to create a frequency histogram where:

Column 1: Number of k-mers that share the same count
Column 2: The count they share

Uses system sorting with LC_ALL=C for optimal performance on large files.

Notes

Requires gzip, sort, uniq, and sed command line tools
Uses intermediate disk storage for sorting large files
Skips processing if output file already exists

source

Mycelia.plot_kmer_frequency_spectra — Function

plot_kmer_frequency_spectra(
    counts;
    log_scale,
    kwargs...
) -> Plots.Plot

Plots a histogram of kmer counts against # of kmers with those counts

Returns the plot object for adding additional layers and saving

Creates a scatter plot visualizing the k-mer frequency spectrum - the relationship between k-mer frequencies and how many k-mers occur at each frequency.

Arguments

counts::AbstractVector{<:Integer}: Vector of k-mer counts/frequencies
log_scale::Union{Function,Nothing} = log2: Function to apply logarithmic scaling to both axes. Set to nothing to use linear scaling.
kwargs...: Additional keyword arguments passed to StatsPlots.plot()

Returns

Plots.Plot: A scatter plot object that can be further modified or saved

Details

The x-axis shows k-mer frequencies (how many times each k-mer appears), while the y-axis shows how many distinct k-mers appear at each frequency. Both axes are log-scaled by default using log2.

source

Example: K-mer Spectrum Analysis

# Generate and analyze k-mer frequency spectrum
kmer_counts = Mycelia.count_kmers("reads.fastq", k=21)
spectrum = Mycelia.kmer_frequency_spectrum(kmer_counts)

# Identify characteristic peaks
peaks = Mycelia.analyze_spectrum_peaks(spectrum)
println("Coverage peak at frequency: $(peaks.main_peak)")
println("Error peak at frequency: $(peaks.error_peak)")

# Fit mathematical model to spectrum
model = Mycelia.fit_spectrum_model(spectrum, model_type="negative_binomial")
println("Model fit R²: $(model.r_squared)")

Genome Characteristics from K-mers

<!– estimategenomesizefromkmers, estimatecoveragefromkmers, detectrepetitivekmers, calculategenome_complexity not yet implemented as documented –>

Example: Genome Size Estimation

# Estimate genome size using k-mer spectrum
kmer_counts = Mycelia.count_kmers("reads.fastq", k=21)
spectrum = Mycelia.kmer_frequency_spectrum(kmer_counts)

# Find coverage peak
coverage_peak = Mycelia.find_coverage_peak(spectrum)
println("Estimated coverage: $(coverage_peak)x")

# Calculate genome size
total_kmers = sum(spectrum.frequencies .* spectrum.counts)
genome_size = total_kmers ÷ coverage_peak
println("Estimated genome size: $(genome_size) bp")

# Alternative method with error correction
corrected_estimate = Mycelia.estimate_genome_size_from_kmers(
    kmer_counts,
    error_correction=true,
    ploidy=1
)
println("Corrected genome size: $(corrected_estimate.size) bp")
println("Confidence interval: $(corrected_estimate.confidence_interval)")

Error Detection and Correction (Planned)

K-mer Based Error Detection (Planned)

NOTE: K-mer based error detection and correction functions are planned but not yet implemented.

Example: Error Detection (Planned)

# Identify likely error k-mers (planned)
error_kmers = Mycelia.identify_error_kmers(
    kmer_counts,
    min_coverage=3,
    max_coverage=100
)

println("Potential error k-mers: $(length(error_kmers))")
println("Error rate estimate: $(length(error_kmers) / length(kmer_counts) * 100)%")

# Correct errors in reads
corrected_reads = Mycelia.correct_sequencing_errors(
    "reads.fastq",
    error_kmers,
    correction_method="consensus"
)

# Validate correction effectiveness
validation = Mycelia.validate_error_correction(
    "reads.fastq",
    corrected_reads,
    reference_genome="reference.fasta"
)
println("Error reduction: $(validation.error_reduction_percent)%")

Sequence Composition Analysis (Planned)

Nucleotide Composition (Planned)

NOTE: Comprehensive composition analysis functions are planned but not yet implemented.

Example: Comprehensive Composition Analysis (Planned)

# Analyze nucleotide composition (planned)
composition = Mycelia.calculate_nucleotide_frequencies("sequences.fasta")
println("GC content: $(composition.gc_content)%")
println("AT content: $(composition.at_content)%")

# Dinucleotide analysis
dinuc_freq = Mycelia.analyze_dinucleotide_frequencies("sequences.fasta")
println("CpG frequency: $(dinuc_freq.CG)")
println("Expected CpG: $(dinuc_freq.expected_CG)")
println("CpG O/E ratio: $(dinuc_freq.CG_oe_ratio)")

# Detect composition bias
bias_analysis = Mycelia.detect_sequence_bias(composition)
if bias_analysis.bias_detected
    println("Sequence bias detected:")
    println("  Type: $(bias_analysis.bias_type)")
    println("  Strength: $(bias_analysis.bias_strength)")
end

Codon Usage Analysis

<!– calculatecodonusage, analyzecodonbias, comparecodonusage, optimizecodonusage not yet implemented as documented –>

Example: Codon Usage Analysis

# Analyze codon usage in coding sequences
cds_file = "coding_sequences.fasta"
codon_usage = Mycelia.calculate_codon_usage(cds_file, genetic_code="standard")

# Calculate codon bias metrics
bias_metrics = Mycelia.analyze_codon_bias(codon_usage)
println("Effective Number of Codons (ENC): $(bias_metrics.enc)")
println("Codon Adaptation Index (CAI): $(bias_metrics.cai)")
println("Frequency of Optimal Codons (FOP): $(bias_metrics.fop)")

# Compare with reference organism
reference_usage = Mycelia.load_reference_codon_usage("escherichia_coli")
comparison = Mycelia.compare_codon_usage(codon_usage, reference_usage)
println("Similarity to E. coli: $(comparison.similarity_score)")

Memory-Efficient Processing

Large Dataset Handling

<!– streamkmercounting, processkmersinchunks, parallelkmeranalysis, memoryefficient_counting not yet implemented as documented –>

Example: Large File Processing

# Process large files with memory constraints
large_file = "large_dataset.fastq"

# Stream processing for memory efficiency
kmer_counts = Mycelia.stream_kmer_counting(
    large_file,
    k=21,
    chunk_size=100000,
    memory_limit_gb=8
)

# Parallel processing for speed
parallel_counts = Mycelia.parallel_kmer_analysis(
    large_file,
    k=21,
    n_workers=8,
    merge_strategy="sum"
)

Optimized Data Structures

<!– createsparsekmermatrix, buildcompressedkmerindex, usebloomfiltercounting, implementcountminsketch not yet implemented as documented –>

Example: Memory Optimization

# Use probabilistic data structures for very large datasets
bloom_filter = Mycelia.create_kmer_bloom_filter(
    estimated_kmers=1_000_000_000,
    false_positive_rate=0.01
)

# Streaming k-mer presence testing
for read in Mycelia.stream_fastq("huge_dataset.fastq")
    for kmer in Mycelia.extract_kmers(read, k=21)
        if Mycelia.probably_present(bloom_filter, kmer)
            # Process likely present k-mer
            Mycelia.process_kmer(kmer)
        end
    end
end

Comparative K-mer Analysis (Planned)

Multi-Sample Analysis (Planned)

NOTE: Multi-sample k-mer comparison functions are planned but not yet implemented.

Example: Multi-Sample K-mer Comparison (Planned)

# Compare k-mer profiles across multiple samples (planned)
sample_files = ["sample1.fastq", "sample2.fastq", "sample3.fastq"]

# Build k-mer profiles for each sample
profiles = [Mycelia.count_kmers(file, k=21) for file in sample_files]

# Calculate pairwise distances
distance_matrix = Mycelia.build_kmer_distance_matrix(profiles, metric="jaccard")
println("Pairwise k-mer distances:")
display(distance_matrix)

# Cluster samples by k-mer similarity
clusters = Mycelia.cluster_by_kmer_similarity(profiles, method="hierarchical")
println("Sample clusters: $(clusters)")

Contamination Detection

<!– detectcontaminationkmers, identifyforeignkmers, classifykmersources, removecontaminatingkmers not yet implemented as documented –>

Example: Contamination Detection

# Detect contamination using k-mer profiles
sample_kmers = Mycelia.count_kmers("sample.fastq", k=21)
reference_kmers = Mycelia.count_kmers("reference_genome.fasta", k=21)

# Identify foreign k-mers
foreign_kmers = Mycelia.identify_foreign_kmers(
    sample_kmers,
    reference_kmers,
    min_abundance=5
)

contamination_rate = length(foreign_kmers) / length(sample_kmers)
println("Contamination rate: $(contamination_rate * 100)%")

# Classify contamination sources
contamination_sources = Mycelia.classify_kmer_sources(
    foreign_kmers,
    database_kmers=["human", "bacterial", "viral"]
)

for (source, proportion) in contamination_sources
    println("$(source): $(proportion * 100)%")
end

# Create comprehensive k-mer analysis plots
kmer_counts = Mycelia.count_kmers("reads.fastq", k=21)

# K-mer frequency spectrum
spectrum_plot = Mycelia.plot_kmer_spectrum(kmer_counts, 
                                  title="21-mer Frequency Spectrum",
                                  log_scale=true)

# Genome size estimation plot
size_plot = Mycelia.plot_genome_size_estimation(kmer_counts,
                                       show_confidence_interval=true)

# Combined dashboard
kmer_dashboard = Mycelia.create_kmer_dashboard(kmer_counts)
Mycelia.save_plot(kmer_dashboard, "kmer_analysis.png")

Performance Considerations

Algorithm Selection

Dense matrices: Use for small k (k ≤ 12) or small genomes
Sparse matrices: Use for large k (k ≥ 15) or large genomes
Streaming: Use for memory-constrained environments
Parallel processing: Use for time-critical applications

Memory Usage

k=15: ~1 GB for dense counting
k=21: ~17 GB for dense counting (use sparse)
k=31: Sparse counting only
Large genomes: Consider streaming or chunked processing

Computational Complexity

Time: O(n) for sequence length n
Space: O(4^k) for dense, O(unique k-mers) for sparse
Parallel: Near-linear speedup for independent samples

Common Issues and Solutions

Memory Limitations

# Handle memory constraints
if Mycelia.estimate_memory_usage(file, k=21) > Mycelia.available_memory()
    # Use streaming approach
    kmer_counts = Mycelia.stream_kmer_counting(file, k=21, chunk_size=50000)
else
    # Use standard approach
    kmer_counts = Mycelia.count_kmers(file, k=21)
end

Parameter Selection

# Choose optimal k-mer size
optimal_k = Mycelia.select_optimal_k(
    genome_size_estimate=5_000_000,
    error_rate=0.01,
    coverage=30
)
println("Recommended k-mer size: $(optimal_k)")

Data Structures (planned)

KmerCounts - K-mer count data structure (planned)
KmerSpectrum - Frequency spectrum representation (planned)
KmerGraph - K-mer overlap graph (planned)

File I/O

Mycelia.open_fastx - Read sequence files
Mycelia.save_kmer_results - Save k-mer counts
Mycelia.load_kmer_results - Load saved counts

Previous Steps

Quality Control - Preprocess reads before k-mer analysis

Next Steps

Genome Assembly (planned) - Use k-mer analysis for assembly
Comparative Genomics (planned) - Compare k-mer profiles