Tutorial 3: K-mer Analysis and Feature Extraction
This tutorial explores k-mer analysis, a fundamental technique in bioinformatics for sequence analysis, genome assembly, and comparative genomics.
Learning Objectives
By the end of this tutorial, you will understand:
- K-mer theory and biological significance
- How k-mer size affects analysis sensitivity and specificity
- Dense vs sparse k-mer counting strategies
- K-mer frequency spectra and their interpretation
- Applications in genome size estimation and quality assessment
- Memory and computational considerations for large-scale analysis
Setup
import Pkg
if isinteractive()
Pkg.activate("..")
end
import Test
import Mycelia
import FASTX
import Random
import Statistics
Random.seed!(42)Part 1: K-mer Theory and Biological Context
K-mers are subsequences of length k extracted from DNA sequences. They capture local sequence composition and are fundamental to many algorithms.
println("=== K-mer Analysis Tutorial ===")K-mer Mathematics
For DNA sequences, there are 4^k possible k-mers. Understanding k-mer space helps with parameter selection.
function kmer_space_size(k, alphabet_size=4)
return alphabet_size^k
end
println("K-mer space sizes:")
for k in [3, 5, 7, 9, 11, 13, 15, 17, 19, 21]
space_size = kmer_space_size(k)
println("k=$k: $(space_size) possible k-mers ($(space_size ÷ 1000)K)")
endBiological Significance
K-mers capture different biological features depending on their size:
- Small k-mers (k=3-7): Capture short motifs, sensitive to errors
- Medium k-mers (k=15-21): Balance sensitivity and specificity
- Large k-mers (k=25-51): Specific but may miss short overlaps
println("\nK-mer size selection guidelines:")
println("k=3-7: Short motifs, codon analysis")
println("k=15-21: Error correction, initial assembly")
println("k=25-31: Genome assembly, repeat detection")
println("k=35-51: Specific overlaps, large genome assembly")Part 2: K-mer Counting Strategies
Different applications require different counting approaches. Understanding trade-offs helps optimize performance.
println("\n=== K-mer Counting Strategies ===")Generate test sequences for demonstration
test_sequences = [
Mycelia.random_fasta_record(moltype=:DNA, seed=i, L=1000)
for i in 1:10
]Write sequences to temporary files
temp_files = String[]
for (i, seq) in enumerate(test_sequences)
filename = "test_seq_$i.fasta"
Mycelia.write_fasta(outfile=filename, records=[seq])
push!(temp_files, filename)
end
println("Generated $(length(temp_files)) test sequences")Dense K-mer Counting
Dense counting stores all possible k-mers, including those not observed. Memory usage: O(4^k) - grows exponentially with k
println("\n--- Dense K-mer Counting ---")TODO: Implement comprehensive dense k-mer analysis
- Count all possible k-mers
- Analyze frequency distributions
- Memory usage profiling
- Comparison across different k values
for k in [3, 5, 7, 9]
println("Computing dense k-mer counts for k=$k...")
# Memory estimation
memory_mb = (4^k * 4) / (1024^2) ## Assuming 4 bytes per count
println(" Estimated memory: $(round(memory_mb, digits=2)) MB")
if memory_mb < 100 ## Only run if memory usage is reasonable
dense_counts = Mycelia.fasta_list_to_dense_kmer_counts(
fasta_list=temp_files,
alphabet=:DNA,
k=k
)
println(" ✓ Dense counting completed")
println(" Matrix size: $(size(dense_counts.counts))")
else
println(" ⚠ Skipping due to high memory usage")
end
endSparse K-mer Counting
Sparse counting only stores observed k-mers. Memory usage: O(n) where n is number of unique k-mers
println("\n--- Sparse K-mer Counting ---")TODO: Implement comprehensive sparse k-mer analysis
- Count only observed k-mers
- Analyze sparsity patterns
- Memory efficiency comparison
- Scalability testing
for k in [11, 13, 15, 17, 19, 21]
println("Computing sparse k-mer counts for k=$k...")
sparse_counts = Mycelia.fasta_list_to_sparse_kmer_counts(
fasta_list=temp_files,
alphabet=:DNA,
k=k
)
println(" ✓ Sparse counting completed")
println(" Unique k-mers: $(length(sparse_counts))")
endPart 3: K-mer Frequency Spectra
K-mer frequency spectra reveal genome characteristics and data quality
println("\n=== K-mer Frequency Spectra ===")TODO: Implement k-mer spectrum analysis
- Generate frequency histograms
- Identify characteristic peaks
- Estimate genome size and coverage
- Detect contamination and repeats
Part 4: Applications in Genome Analysis
K-mers have many applications in genomic analysis
println("\n=== Genome Analysis Applications ===")Genome Size Estimation
Use k-mer frequency spectra to estimate genome size Formula: Genome size ≈ Total k-mers / Coverage peak
println("--- Genome Size Estimation ---")TODO: Implement genome size estimation
- Find coverage peak in frequency spectrum
- Calculate total k-mers
- Estimate genome size
- Validate against known genome size
Error Detection and Correction
Low-frequency k-mers often represent sequencing errors
println("--- Error Detection ---")TODO: Implement error detection
- Identify low-frequency k-mers
- Classify as errors vs rare variants
- Implement error correction algorithms
Contamination Detection
Foreign DNA creates distinctive k-mer patterns
println("--- Contamination Detection ---")TODO: Implement contamination detection
- Compare k-mer profiles
- Identify foreign k-mers
- Quantify contamination levels
Part 5: Performance Optimization
Large-scale k-mer analysis requires optimization
println("\n=== Performance Optimization ===")Memory Management
Strategies for handling large k-mer datasets
println("--- Memory Management ---")TODO: Implement memory optimization
- Streaming k-mer processing
- Disk-based storage
- Memory-mapped files
- Compression techniques
Parallel Processing
Accelerate k-mer counting with parallelization
println("--- Parallel Processing ---")TODO: Implement parallel k-mer counting
- Multi-threaded counting
- Distributed processing
- Load balancing strategies
Part 6: Visualization and Interpretation
Create plots to understand k-mer analysis results
println("\n=== K-mer Visualization ===")TODO: Implement k-mer visualization
- Frequency spectrum plots
- K-mer composition heatmaps
- Coverage distribution plots
- Comparative k-mer analysis
Part 7: Advanced K-mer Techniques
Explore advanced k-mer analysis methods
println("\n=== Advanced Techniques ===")Minimizers
Reduce k-mer space using minimizer techniques
println("--- Minimizers ---")TODO: Implement minimizer techniques
- Canonical minimizers
- Syncmers
- Strobemers
Graph Construction
Build graphs from k-mer overlaps
println("--- Graph Construction ---")TODO: Implement k-mer graph construction
- de Bruijn graphs
- String graphs
- Overlap graphs
Summary and Best Practices
println("\n=== K-mer Analysis Summary ===")
println("✓ Understanding k-mer theory and biological significance")
println("✓ Choosing appropriate k-mer sizes for different applications")
println("✓ Implementing dense and sparse counting strategies")
println("✓ Analyzing k-mer frequency spectra")
println("✓ Applying k-mer analysis to genome size estimation")
println("✓ Optimizing performance for large-scale analysis")
println("\nBest Practices:")
println("- Start with k=21 for general analysis")
println("- Use dense counting for small k, sparse for large k")
println("- Monitor memory usage and optimize accordingly")
println("- Validate results with known datasets")
println("- Consider biological context in interpretation")Cleanup
for file in temp_files
rm(file, force=true)
end
println("\nNext: Tutorial 4 - Genome Assembly")
nothing