Sequence Analysis & K-mers

Functions for analyzing sequence composition, counting k-mers, and extracting genomic features from sequencing data.

Overview

Sequence analysis forms the foundation of many bioinformatics algorithms. Mycelia provides comprehensive tools for:

  • K-mer counting and analysis using efficient algorithms
  • Genome size estimation from k-mer frequency spectra
  • Error detection and correction using k-mer patterns
  • Sequence composition analysis and bias detection
  • Memory-efficient processing of large datasets

Common Workflows

1. Basic K-mer Analysis

# Count k-mers in sequencing reads
kmer_counts = Mycelia.count_kmers("reads.fastq", k=21)

# Analyze k-mer frequency spectrum
spectrum = Mycelia.kmer_frequency_spectrum(kmer_counts)

2. Genome Size Estimation

# Estimate genome size from k-mer spectrum
genome_size = Mycelia.estimate_genome_size_from_kmers(kmer_counts, coverage_peak=30)

3. Contamination Detection

# Detect contamination using k-mer profiles
contamination = Mycelia.detect_contamination_kmers("reads.fastq", expected_profile)

K-mer Counting

Basic K-mer Counting

Mycelia.count_kmersFunction
count_kmers(
    _::Type{Kmers.Kmer{A<:BioSequences.DNAAlphabet, K}},
    sequence::BioSequences.LongSequence
) -> Any

Count the frequency of each k-mer in a DNA sequence.

Arguments

  • ::Type{Kmers.Kmer{A,K}}: K-mer type with alphabet A and length K
  • sequence::BioSequences.LongSequence: Input DNA sequence to analyze

Returns

A sorted dictionary mapping each k-mer to its frequency count in the sequence.

Type Parameters

  • A <: BioSequences.DNAAlphabet: DNA alphabet type
  • K: Length of k-mers
source
count_kmers(
    _::Type{Kmers.Kmer{A<:BioSequences.RNAAlphabet, K}},
    sequence::BioSequences.LongSequence
) -> Any

Count the frequency of each k-mer in an RNA sequence.

Arguments

  • Kmer: Type parameter specifying the k-mer length K and RNA alphabet
  • sequence: Input RNA sequence to analyze

Returns

  • Dict{Kmers.Kmer, Int}: Sorted dictionary mapping each k-mer to its frequency count
source
count_kmers(
    _::Type{Kmers.Kmer{A<:BioSequences.AminoAcidAlphabet, K}},
    sequence::BioSequences.LongSequence
) -> OrderedCollections.OrderedDict{K, Int64} where K<:(Kmers.Kmer{BioSequences.AminoAcidAlphabet, _A, _B} where {_B, _A})

Count the frequency of amino acid k-mers in a biological sequence.

Arguments

  • Kmers.Kmer{A,K}: Type parameter specifying amino acid alphabet (A) and k-mer length (K)
  • sequence: Input biological sequence to analyze

Returns

A sorted dictionary mapping each k-mer to its frequency count in the sequence.

source
count_kmers(
    _::Type{KMER_TYPE},
    record::Union{FASTX.FASTA.Record, FASTX.FASTQ.Record}
) -> Any

Count the frequency of amino acid k-mers in a biological sequence.

Arguments

  • Kmers.Kmer{A,K}: Type parameter specifying amino acid alphabet (A) and k-mer length (K)
  • sequence: Input biological sequence to analyze

Returns

A sorted dictionary mapping each k-mer to its frequency count in the sequence.

source
count_kmers(
    _::Type{KMER_TYPE},
    records::AbstractArray{T<:Union{FASTX.FASTA.Record, FASTX.FASTQ.Record}, 1}
) -> Any

Count k-mers across multiple sequence records and return a sorted frequency table.

Arguments

  • KMER_TYPE: Type parameter specifying the k-mer length (e.g., DNAKmer{3} for 3-mers)
  • records: Vector of FASTA/FASTQ records to analyze

Returns

  • Dict{KMER_TYPE, Int}: Sorted dictionary mapping k-mers to their frequencies
source
count_kmers(
    _::Type{KMER_TYPE},
    sequences::Union{FASTX.FASTA.Reader, FASTX.FASTQ.Reader}
) -> Any

Counts k-mer occurrences in biological sequences from a FASTA/FASTQ reader.

Arguments

  • KMER_TYPE: Type parameter specifying the k-mer length and encoding (e.g., DNAKmer{4} for 4-mers)
  • sequences: A FASTA or FASTQ reader containing the biological sequences to analyze

Returns

A dictionary mapping k-mers to their counts in the input sequences

source
count_kmers(
    _::Type{KMER_TYPE},
    fastx_files::AbstractArray{T<:AbstractString, 1}
) -> Any

Count k-mers across multiple FASTA/FASTQ files and merge the results.

Arguments

  • KMER_TYPE: Type parameter specifying the k-mer length (e.g., DNAKmer{4} for 4-mers)
  • fastx_files: Vector of paths to FASTA/FASTQ files

Returns

  • Dict{KMER_TYPE, Int}: Dictionary mapping k-mers to their total counts across all files
source
count_kmers(
    _::Type{KMER_TYPE},
    fastx_file::AbstractString
) -> Any

Count k-mers in a FASTA/FASTQ file and return their frequencies.

Arguments

  • KMER_TYPE: Type parameter specifying the k-mer type (e.g., DNAKmer{K})
  • fastx_file: Path to input FASTA/FASTQ file

Returns

  • Dict{KMER_TYPE, Int}: Dictionary mapping each k-mer to its frequency
source
Mycelia.count_canonical_kmersFunction
count_canonical_kmers(_::Type{KMER_TYPE}, sequences) -> Any

Count canonical k-mers in biological sequences. A canonical k-mer is the lexicographically smaller of a DNA sequence and its reverse complement, ensuring strand-independent counting.

Arguments

  • KMER_TYPE: Type parameter specifying the k-mer size and structure
  • sequences: Iterator of biological sequences to analyze

Returns

  • Dict{KMER_TYPE,Int}: Dictionary mapping canonical k-mers to their counts
source
Mycelia.fasta_list_to_dense_kmer_countsFunction
fasta_list_to_dense_kmer_counts(
;
    fasta_list,
    k,
    alphabet,
    temp_dir_parent,
    count_element_type,
    result_file,
    force,
    cleanup_temp
)

Create a dense k-mer counts table for a set of FASTA files, with disk-backed temporary storage, custom element type, robust error handling, and optional output file caching.

source
Mycelia.fasta_list_to_sparse_kmer_countsFunction
fasta_list_to_sparse_kmer_counts(
;
    fasta_list,
    k,
    alphabet,
    temp_dir_parent,
    count_element_type,
    rarefaction_data_filename,
    rarefaction_plot_basename,
    show_rarefaction_plot,
    result_file,
    out_dir,
    force,
    rarefaction_plot_kwargs...
)

Create a sparse kmer counts table (SparseMatrixCSC) from a list of FASTA files using a 3-pass approach. Pass 1 (Parallel): Counts kmers per file and writes to temporary JLD2 files. Pass 2 (Serial): Aggregates unique kmers, max count, nnz per file, and rarefaction data from temp files. Generates and saves a k-mer rarefaction plot. Pass 3 (Parallel): Reads temporary counts again to construct the final sparse matrix.

Optionally, a results filename can be provided to save/load the output. If the file exists and force is false, the result is loaded and returned. If force is true or the file does not exist, results are computed and saved.

Output Directory Behavior

  • All auxiliary output files (e.g., rarefaction data, plots) are written to a common output directory.
  • By default, this is:
    • The value of out_dir if provided.
    • Otherwise, the directory containing result_file (if provided and has a directory component).
    • Otherwise, the current working directory (pwd()).
  • If you provide an absolute path for an output file (e.g. rarefaction_data_filename), that path is used directly.
  • If both out_dir and a relative filename are given, the file is written to out_dir.

Arguments

  • fasta_list::AbstractVector{<:AbstractString}: A list of paths to FASTA files.
  • k::Integer: The length of the kmer.
  • alphabet::Symbol: The alphabet type (:AA, :DNA, :RNA).
  • temp_dir_parent::AbstractString: Parent directory for creating the temporary working directory. Defaults to Base.tempdir().
  • count_element_type::Union{Type{<:Unsigned}, Nothing}: Optional. Specifies the unsigned integer type for the counts. If nothing (default), the smallest UInt type capable of holding the maximum observed count is used.
  • rarefaction_data_filename::AbstractString: Filename for the TSV output of rarefaction data. If a relative path, will be written to out_dir.
  • rarefaction_plot_basename::AbstractString: Basename for the output rarefaction plots. If a relative path, will be written to out_dir.
  • show_rarefaction_plot::Bool: Whether to display the rarefaction plot after generation. Defaults to true.
  • result_file::Union{Nothing, AbstractString}: Optional. If provided, path to a file to save/load the full results (kmers, counts, etc) as a JLD2 file.
  • out_dir::Union{Nothing, AbstractString}: Optional. Output directory for auxiliary outputs. Defaults as described above.
  • force::Bool: If true, recompute and overwrite the output file even if it exists. Defaults to false.
  • rarefaction_plot_kwargs...: Keyword arguments to pass to plot_kmer_rarefaction for plot customization.

Returns

  • NamedTuple{(:kmers, :counts, :rarefaction_data_path)}:
    • kmers: A sorted Vector of unique kmer objects.
    • counts: A SparseArrays.SparseMatrixCSC{V, Int} storing kmer counts.
    • rarefaction_data_path: Path to the saved TSV file with rarefaction data.

Raises

  • ErrorException: If input fasta_list is empty, alphabet is invalid, or required Kmer/counting functions are not found.
source

Example: K-mer Counting Comparison

# Compare different k-mer counting approaches
reads_file = "reads.fastq"

# Dense counting (stores all possible k-mers)
dense_counts = Mycelia.count_kmers(reads_file, k=15, method="dense")
println("Dense matrix size: $(size(dense_counts))")

# Sparse counting (only observed k-mers)
sparse_counts = Mycelia.count_kmers(reads_file, k=21, method="sparse")
println("Unique k-mers: $(length(sparse_counts))")

# Canonical k-mers (combines forward and reverse complement)
canonical_counts = Mycelia.count_canonical_kmers(reads_file, k=21)
println("Canonical k-mers: $(length(canonical_counts))")

Advanced K-mer Analysis

<!– buildkmergraph, findkmeroverlaps, extractkmerpaths, analyzekmerconnectivity not yet implemented as documented –>

Mycelia.build_kmer_graph_nextFunction
build_kmer_graph_next(
    kmer_type,
    observations::AbstractVector{<:Union{FASTX.FASTA.Record, FASTX.FASTQ.Record}};
    graph_mode
) -> MetaGraphsNext.MetaGraph{Int64, Graphs.SimpleGraphs.SimpleDiGraph{Int64}, Label, VertexData, EdgeData, Nothing, WeightFunction, Float64} where {Label, VertexData, EdgeData, WeightFunction}

Create a next-generation, type-stable k-mer graph using MetaGraphsNext.

This implementation uses canonical k-mers as vertices with strand-aware edges that respect biological transition constraints for both single-strand and double-strand sequences.

Arguments

  • kmer_type: Type of k-mer (e.g., DNAKmer{K})
  • observations: Vector of FASTA/FASTQ records
  • graph_mode: SingleStrand for directional sequences, DoubleStrand for DNA (default)

Returns

  • MetaGraphsNext.MetaGraph with canonical vertices and strand-aware edges

Details

  • Vertices: Always canonical k-mers (lexicographically smaller of kmer/reverse_complement)
  • Edges: Strand-aware transitions that respect biological constraints
  • SingleStrand mode: Only forward-strand transitions allowed
  • DoubleStrand mode: Both forward and reverse-complement transitions allowed
source
Mycelia.build_stranded_kmer_graphFunction
build_stranded_kmer_graph(
    kmer_type,
    observations::AbstractVector{<:Union{FASTX.FASTA.Record, FASTX.FASTQ.Record}}
) -> MetaGraphs.MetaDiGraph{T, Float64} where T<:Integer

Create a weighted, strand-specific kmer (de bruijn) graph from a set of kmers and a series of sequence observations in FASTA format.

source
Mycelia.build_directed_kmer_graphFunction
build_directed_kmer_graph(; fastq, k, plot)

Constructs a directed graph representation of k-mer transitions from FASTQ sequencing data.

Arguments

  • fastq: Path to input FASTQ file
  • k: K-mer size (default: 1). Must be odd and prime. If k=1, optimal size is auto-determined
  • plot: Boolean to display quality distribution plot (default: false)

Returns

MetaDiGraph with properties:

  • assembly_k: k-mer size used
  • kmer_counts: frequency of each k-mer
  • transition_likelihoods: edge weights between k-mers
  • kmermeanquality, kmertotalquality: quality metrics
  • branchingnodes, unbranchingnodes: topological classification
  • likelyvalidkmer_indices: k-mers above mean quality threshold
  • likelysequencingartifact_indices: potential erroneous k-mers

Note

For DNA assembly, quality scores are normalized across both strands.

source

Example: K-mer Graph Construction

# Build k-mer overlap graph
kmer_graph = Mycelia.build_kmer_graph(
    sequences,
    k=31,
    min_overlap=20,
    min_coverage=5
)

# Analyze graph properties
graph_stats = Mycelia.analyze_kmer_connectivity(kmer_graph)
println("Graph nodes: $(graph_stats.n_nodes)")
println("Graph edges: $(graph_stats.n_edges)")
println("Connected components: $(graph_stats.n_components)")

K-mer Frequency Analysis

Frequency Spectra

<!– kmerfrequencyspectrum, analyzespectrumpeaks, fitspectrummodel not yet implemented as documented –>

Mycelia.jellyfish_counts_to_kmer_frequency_histogramFunction
jellyfish_counts_to_kmer_frequency_histogram(
    jellyfish_counts_file
) -> Any
jellyfish_counts_to_kmer_frequency_histogram(
    jellyfish_counts_file,
    outfile
) -> Any

Convert a Jellyfish k-mer count file into a frequency histogram.

Arguments

  • jellyfish_counts_file::String: Path to the gzipped TSV file containing Jellyfish k-mer counts
  • outfile::String=replace(jellyfish_counts_file, r"\.tsv\.gz$" => ".count_histogram.tsv"): Optional output file path

Returns

  • String: Path to the generated histogram file

Description

Processes a Jellyfish k-mer count file to create a frequency histogram where:

  • Column 1: Number of k-mers that share the same count
  • Column 2: The count they share

Uses system sorting with LC_ALL=C for optimal performance on large files.

Notes

  • Requires gzip, sort, uniq, and sed command line tools
  • Uses intermediate disk storage for sorting large files
  • Skips processing if output file already exists
source
Mycelia.plot_kmer_frequency_spectraFunction
plot_kmer_frequency_spectra(
    counts;
    log_scale,
    kwargs...
) -> Plots.Plot

Plots a histogram of kmer counts against # of kmers with those counts

Returns the plot object for adding additional layers and saving

Creates a scatter plot visualizing the k-mer frequency spectrum - the relationship between k-mer frequencies and how many k-mers occur at each frequency.

Arguments

  • counts::AbstractVector{<:Integer}: Vector of k-mer counts/frequencies
  • log_scale::Union{Function,Nothing} = log2: Function to apply logarithmic scaling to both axes. Set to nothing to use linear scaling.
  • kwargs...: Additional keyword arguments passed to StatsPlots.plot()

Returns

  • Plots.Plot: A scatter plot object that can be further modified or saved

Details

The x-axis shows k-mer frequencies (how many times each k-mer appears), while the y-axis shows how many distinct k-mers appear at each frequency. Both axes are log-scaled by default using log2.

source

Example: K-mer Spectrum Analysis

# Generate and analyze k-mer frequency spectrum
kmer_counts = Mycelia.count_kmers("reads.fastq", k=21)
spectrum = Mycelia.kmer_frequency_spectrum(kmer_counts)

# Identify characteristic peaks
peaks = Mycelia.analyze_spectrum_peaks(spectrum)
println("Coverage peak at frequency: $(peaks.main_peak)")
println("Error peak at frequency: $(peaks.error_peak)")

# Fit mathematical model to spectrum
model = Mycelia.fit_spectrum_model(spectrum, model_type="negative_binomial")
println("Model fit R²: $(model.r_squared)")

Genome Characteristics from K-mers

<!– estimategenomesizefromkmers, estimatecoveragefromkmers, detectrepetitivekmers, calculategenome_complexity not yet implemented as documented –>

Example: Genome Size Estimation

# Estimate genome size using k-mer spectrum
kmer_counts = Mycelia.count_kmers("reads.fastq", k=21)
spectrum = Mycelia.kmer_frequency_spectrum(kmer_counts)

# Find coverage peak
coverage_peak = Mycelia.find_coverage_peak(spectrum)
println("Estimated coverage: $(coverage_peak)x")

# Calculate genome size
total_kmers = sum(spectrum.frequencies .* spectrum.counts)
genome_size = total_kmers ÷ coverage_peak
println("Estimated genome size: $(genome_size) bp")

# Alternative method with error correction
corrected_estimate = Mycelia.estimate_genome_size_from_kmers(
    kmer_counts,
    error_correction=true,
    ploidy=1
)
println("Corrected genome size: $(corrected_estimate.size) bp")
println("Confidence interval: $(corrected_estimate.confidence_interval)")

Error Detection and Correction

K-mer Based Error Detection

<!– identifyerrorkmers, correctsequencingerrors, validateerrorcorrection not yet implemented as documented –>

Example: Error Detection

# Identify likely error k-mers
error_kmers = Mycelia.identify_error_kmers(
    kmer_counts,
    min_coverage=3,
    max_coverage=100
)

println("Potential error k-mers: $(length(error_kmers))")
println("Error rate estimate: $(length(error_kmers) / length(kmer_counts) * 100)%")

# Correct errors in reads
corrected_reads = Mycelia.correct_sequencing_errors(
    "reads.fastq",
    error_kmers,
    correction_method="consensus"
)

# Validate correction effectiveness
validation = Mycelia.validate_error_correction(
    "reads.fastq",
    corrected_reads,
    reference_genome="reference.fasta"
)
println("Error reduction: $(validation.error_reduction_percent)%")

Sequence Composition Analysis

Nucleotide Composition

<!– calculatenucleotidefrequencies, analyzedinucleotidefrequencies, calculatecodonusage, detectsequencebias not yet implemented as documented –>

Example: Comprehensive Composition Analysis

# Analyze nucleotide composition
composition = Mycelia.calculate_nucleotide_frequencies("sequences.fasta")
println("GC content: $(composition.gc_content)%")
println("AT content: $(composition.at_content)%")

# Dinucleotide analysis
dinuc_freq = Mycelia.analyze_dinucleotide_frequencies("sequences.fasta")
println("CpG frequency: $(dinuc_freq.CG)")
println("Expected CpG: $(dinuc_freq.expected_CG)")
println("CpG O/E ratio: $(dinuc_freq.CG_oe_ratio)")

# Detect composition bias
bias_analysis = Mycelia.detect_sequence_bias(composition)
if bias_analysis.bias_detected
    println("Sequence bias detected:")
    println("  Type: $(bias_analysis.bias_type)")
    println("  Strength: $(bias_analysis.bias_strength)")
end

Codon Usage Analysis

<!– calculatecodonusage, analyzecodonbias, comparecodonusage, optimizecodonusage not yet implemented as documented –>

Example: Codon Usage Analysis

# Analyze codon usage in coding sequences
cds_file = "coding_sequences.fasta"
codon_usage = Mycelia.calculate_codon_usage(cds_file, genetic_code="standard")

# Calculate codon bias metrics
bias_metrics = Mycelia.analyze_codon_bias(codon_usage)
println("Effective Number of Codons (ENC): $(bias_metrics.enc)")
println("Codon Adaptation Index (CAI): $(bias_metrics.cai)")
println("Frequency of Optimal Codons (FOP): $(bias_metrics.fop)")

# Compare with reference organism
reference_usage = Mycelia.load_reference_codon_usage("escherichia_coli")
comparison = Mycelia.compare_codon_usage(codon_usage, reference_usage)
println("Similarity to E. coli: $(comparison.similarity_score)")

Memory-Efficient Processing

Large Dataset Handling

<!– streamkmercounting, processkmersinchunks, parallelkmeranalysis, memoryefficient_counting not yet implemented as documented –>

Example: Large File Processing

# Process large files with memory constraints
large_file = "large_dataset.fastq"

# Stream processing for memory efficiency
kmer_counts = Mycelia.stream_kmer_counting(
    large_file,
    k=21,
    chunk_size=100000,
    memory_limit_gb=8
)

# Parallel processing for speed
parallel_counts = Mycelia.parallel_kmer_analysis(
    large_file,
    k=21,
    n_workers=8,
    merge_strategy="sum"
)

Optimized Data Structures

<!– createsparsekmermatrix, buildcompressedkmerindex, usebloomfiltercounting, implementcountminsketch not yet implemented as documented –>

Example: Memory Optimization

# Use probabilistic data structures for very large datasets
bloom_filter = Mycelia.create_kmer_bloom_filter(
    estimated_kmers=1_000_000_000,
    false_positive_rate=0.01
)

# Streaming k-mer presence testing
for read in Mycelia.stream_fastq("huge_dataset.fastq")
    for kmer in Mycelia.extract_kmers(read, k=21)
        if Mycelia.probably_present(bloom_filter, kmer)
            # Process likely present k-mer
            Mycelia.process_kmer(kmer)
        end
    end
end

Comparative K-mer Analysis

Multi-Sample Analysis

<!– comparekmerprofiles, buildkmerdistancematrix, clusterbykmersimilarity, identifysamplespecific_kmers not yet implemented as documented –>

Example: Multi-Sample K-mer Comparison

# Compare k-mer profiles across multiple samples
sample_files = ["sample1.fastq", "sample2.fastq", "sample3.fastq"]

# Build k-mer profiles for each sample
profiles = [Mycelia.count_kmers(file, k=21) for file in sample_files]

# Calculate pairwise distances
distance_matrix = Mycelia.build_kmer_distance_matrix(profiles, metric="jaccard")
println("Pairwise k-mer distances:")
display(distance_matrix)

# Cluster samples by k-mer similarity
clusters = Mycelia.cluster_by_kmer_similarity(profiles, method="hierarchical")
println("Sample clusters: $(clusters)")

Contamination Detection

<!– detectcontaminationkmers, identifyforeignkmers, classifykmersources, removecontaminatingkmers not yet implemented as documented –>

Example: Contamination Detection

# Detect contamination using k-mer profiles
sample_kmers = Mycelia.count_kmers("sample.fastq", k=21)
reference_kmers = Mycelia.count_kmers("reference_genome.fasta", k=21)

# Identify foreign k-mers
foreign_kmers = Mycelia.identify_foreign_kmers(
    sample_kmers,
    reference_kmers,
    min_abundance=5
)

contamination_rate = length(foreign_kmers) / length(sample_kmers)
println("Contamination rate: $(contamination_rate * 100)%")

# Classify contamination sources
contamination_sources = Mycelia.classify_kmer_sources(
    foreign_kmers,
    database_kmers=["human", "bacterial", "viral"]
)

for (source, proportion) in contamination_sources
    println("$(source): $(proportion * 100)%")
end

Visualization and Reporting

K-mer Plots

<!– plotkmerspectrum, plotkmercomposition, plotgenomesizeestimation, createkmer_dashboard not yet implemented as individual functions –>

Example: K-mer Visualization

# Create comprehensive k-mer analysis plots
kmer_counts = Mycelia.count_kmers("reads.fastq", k=21)

# K-mer frequency spectrum
spectrum_plot = Mycelia.plot_kmer_spectrum(kmer_counts, 
                                  title="21-mer Frequency Spectrum",
                                  log_scale=true)

# Genome size estimation plot
size_plot = Mycelia.plot_genome_size_estimation(kmer_counts,
                                       show_confidence_interval=true)

# Combined dashboard
kmer_dashboard = Mycelia.create_kmer_dashboard(kmer_counts)
Mycelia.save_plot(kmer_dashboard, "kmer_analysis.png")

Performance Considerations

Algorithm Selection

  • Dense matrices: Use for small k (k ≤ 12) or small genomes
  • Sparse matrices: Use for large k (k ≥ 15) or large genomes
  • Streaming: Use for memory-constrained environments
  • Parallel processing: Use for time-critical applications

Memory Usage

  • k=15: ~1 GB for dense counting
  • k=21: ~17 GB for dense counting (use sparse)
  • k=31: Sparse counting only
  • Large genomes: Consider streaming or chunked processing

Computational Complexity

  • Time: O(n) for sequence length n
  • Space: O(4^k) for dense, O(unique k-mers) for sparse
  • Parallel: Near-linear speedup for independent samples

Common Issues and Solutions

Memory Limitations

# Handle memory constraints
if Mycelia.estimate_memory_usage(file, k=21) > Mycelia.available_memory()
    # Use streaming approach
    kmer_counts = Mycelia.stream_kmer_counting(file, k=21, chunk_size=50000)
else
    # Use standard approach
    kmer_counts = Mycelia.count_kmers(file, k=21)
end

Parameter Selection

# Choose optimal k-mer size
optimal_k = Mycelia.select_optimal_k(
    genome_size_estimate=5_000_000,
    error_rate=0.01,
    coverage=30
)
println("Recommended k-mer size: $(optimal_k)")

Data Structures

File I/O

Previous Steps

Next Steps

  • Genome Assembly (planned) - Use k-mer analysis for assembly
  • Comparative Genomics (planned) - Compare k-mer profiles

See Also