Tutorial 4: Genome Assembly

This tutorial covers comprehensive genome assembly approaches, including short read, long read, and hybrid assembly methods, with emphasis on Mycelia's probabilistic assembly and benchmarking against state-of-the-art tools.

Learning Objectives

By the end of this tutorial, you will understand:

Different assembly algorithms and their applications
Short read assembly with MEGAHIT and metaSPAdes
Long read assembly with Flye, Canu, and hifiasm
Hybrid assembly approaches combining multiple data types
Mycelia's probabilistic assembly using string graphs and Viterbi error correction
Assembly quality metrics and their interpretation
Error correction and polishing techniques
Handling repetitive sequences and structural variants
Assembly validation and benchmarking approaches

Setup

import Pkg
if isinteractive()
    Pkg.activate("..")
end

import Test
import Mycelia
import FASTX
import Random
import Statistics

Random.seed!(42)

Part 1: Assembly Algorithm Overview

Understanding different assembly approaches helps choose the right method for your data type and research goals.

println("=== Genome Assembly Tutorial ===")

Assembly Paradigms

Four main approaches to genome assembly:

de Bruijn Graph - for short reads (MEGAHIT, metaSPAdes)
Overlap-Layout-Consensus (OLC) - for long reads (Canu)
String Graph - for long accurate reads (hifiasm, Flye)
Probabilistic Assembly - Mycelia's approach using string graphs with Viterbi error correction

println("Assembly Algorithm Comparison:")
println("de Bruijn Graph:")
println("  - Best for: Short reads (Illumina)")
println("  - Tools: MEGAHIT, metaSPAdes, SPAdes")
println("  - Strengths: Efficient, handles high coverage")
println("  - Weaknesses: Struggles with repeats, requires error correction")
println()
println("OLC (Overlap-Layout-Consensus):")
println("  - Best for: Long reads (PacBio, Nanopore)")
println("  - Tools: Canu, Miniasm")
println("  - Strengths: Handles repeats, intuitive approach")
println("  - Weaknesses: Computationally expensive, error-sensitive")
println()
println("String Graph:")
println("  - Best for: Long accurate reads (HiFi)")
println("  - Tools: hifiasm, Flye")
println("  - Strengths: Efficient, haplotype-aware, handles complexity")
println("  - Weaknesses: Requires high-quality reads")
println()
println("Probabilistic Assembly (Mycelia):")
println("  - Best for: Any read type with error correction")
println("  - Tools: Mycelia's string graph + Viterbi")
println("  - Strengths: Handles errors probabilistically, adaptable")
println("  - Weaknesses: Computationally intensive for large genomes")

Part 2: Data Preparation for Assembly

Proper data preparation is crucial for successful assembly

println("\n=== Data Preparation ===")

Simulating Multi-Platform Data

Create synthetic data for comprehensive assembly testing

println("--- Generating Test Data ---")

Create a synthetic genome with known structure

genome_size = 50000  ## 50 kb for demonstration
reference_genome = Mycelia.random_fasta_record(moltype=:DNA, seed=1, L=genome_size)

println("Reference genome: $(genome_size) bp")

Simulate different read types

short_read_params = Dict(
    "coverage" => 30,
    "read_length" => 150,
    "error_rate" => 0.001,
    "description" => "Illumina short reads"
)

long_read_params = Dict(
    "coverage" => 20,
    "read_length" => 10000,
    "error_rate" => 0.05,
    "description" => "Nanopore long reads"
)

hifi_params = Dict(
    "coverage" => 15,
    "read_length" => 15000,
    "error_rate" => 0.001,
    "description" => "PacBio HiFi reads"
)

TODO: Implement multi-platform read simulation

Generate short reads for MEGAHIT/metaSPAdes
Generate long reads for Flye/Canu
Generate HiFi reads for hifiasm
Create hybrid datasets for Unicycler
Generate error-prone reads for Mycelia polishing

println("Simulating read types:")
for (name, params) in [("Short reads", short_read_params),
                      ("Long reads", long_read_params),
                      ("HiFi reads", hifi_params)]
    println("  $(name): $(params["coverage"])x coverage, $(params["read_length"]) bp, $(params["error_rate"]*100)% error")
end

Write test data

reference_file = "reference_genome.fasta"
short_reads_r1 = "short_reads_R1.fastq"
short_reads_r2 = "short_reads_R2.fastq"
long_reads_file = "long_reads.fastq"
hifi_reads_file = "hifi_reads.fastq"

Mycelia.write_fasta(outfile=reference_file, records=[reference_genome])

TODO: Write simulated reads to FASTQ files

Generate paired-end short reads
Generate single-end long reads
Generate single-end HiFi reads

Read Statistics and Quality Assessment

Analyze read characteristics before assembly

println("--- Read Analysis ---")

TODO: Implement read analysis

Read length distribution
Quality score distribution
Coverage estimation
Error rate assessment

Part 3: Multi-Platform Assembly Approaches

Comprehensive coverage of short read, long read, and hybrid assembly

println("\n=== Multi-Platform Assembly Approaches ===")

Short Read Assembly

MEGAHIT and metaSPAdes for short read data

println("--- Short Read Assembly ---")

Example parameters for short read assembly

short_read_params = Dict(
    "megahit_k_list" => "21,29,39,59,79,99,119,141",
    "metaspades_k_list" => "21,33,55,77",
    "min_contig_len" => 200,
    "threads" => 4
)

println("Short read assembly parameters:")
for (param, value) in short_read_params
    println("  $param: $value")
end

TODO: Implement short read assembly examples

Run MEGAHIT for metagenomic data
Run metaSPAdes for complex datasets
Compare assembly quality metrics
Evaluate computational requirements

Long Read Assembly

Flye, Canu, and hifiasm for long read data

println("--- Long Read Assembly ---")

Example parameters for long read assembly

long_read_params = Dict(
    "genome_size" => "5m",
    "flye_read_type" => "pacbio-hifi",
    "canu_read_type" => "pacbio",
    "hifiasm_mode" => "primary",
    "threads" => 4
)

println("Long read assembly parameters:")
for (param, value) in long_read_params
    println("  $param: $value")
end

TODO: Implement long read assembly examples

Run Flye for various read types
Run Canu with error correction
Run hifiasm for HiFi data
Compare assembly contiguity
Evaluate error rates

Hybrid Assembly

Unicycler combining short and long reads

println("--- Hybrid Assembly ---")

Example parameters for hybrid assembly

hybrid_params = Dict(
    "short_read_accuracy" => 0.99,
    "long_read_accuracy" => 0.90,
    "bridging_mode" => "conservative",
    "threads" => 4
)

println("Hybrid assembly parameters:")
for (param, value) in hybrid_params
    println("  $param: $value")
end

TODO: Implement hybrid assembly examples

Run Unicycler with paired data
Compare to short-read-only assemblies
Evaluate scaffolding improvements
Assess computational trade-offs

Mycelia's Probabilistic Assembly

String graph approach with Viterbi error correction

println("--- Mycelia's Probabilistic Assembly ---")

Example parameters for Mycelia assembly

mycelia_params = Dict(
    "k_range" => "21,31,41,51,61,71,81,91",
    "error_rate" => 0.01,
    "min_coverage" => 3,
    "iterative_polishing" => true,
    "verbosity" => "reads"
)

println("Mycelia assembly parameters:")
for (param, value) in mycelia_params
    println("  $param: $value")
end

TODO: Implement Mycelia assembly examples

Build string graph from reads
Apply Viterbi error correction
Perform iterative polishing
Compare to external assemblers

println("Assembly approaches comparison completed...")

Part 4: Assembly Quality Assessment

Comprehensive evaluation of assembly quality

println("\n=== Assembly Quality Assessment ===")

Basic Assembly Statistics

Calculate fundamental assembly metrics

println("--- Basic Statistics ---")

TODO: Implement assembly statistics

Contig count and sizes
N50, N90, L50, L90
Total assembly size
Largest contig size
GC content

Example with placeholder data

assembly_stats = Dict(
    "total_length" => 49500,
    "n_contigs" => 3,
    "n50" => 25000,
    "l50" => 1,
    "largest_contig" => 30000,
    "gc_content" => 0.45
)

println("Assembly Statistics:")
for (metric, value) in assembly_stats
    println("  $metric: $value")
end

Advanced Quality Metrics

More sophisticated quality assessment

println("--- Advanced Quality Metrics ---")

TODO: Implement advanced quality assessment

BUSCO completeness scores
Merqury QV scores
LAI (LTR Assembly Index)
Contiguity vs completeness trade-offs

Comparison with Reference

Validate assembly against known reference

println("--- Reference Comparison ---")

TODO: Implement reference comparison

Alignment-based comparison
Structural variation detection
Misassembly identification
Coverage uniformity assessment

Part 5: Assembly Polishing and Error Correction

Improve assembly accuracy through polishing

println("\n=== Assembly Polishing ===")

Consensus Polishing

Use original reads to polish assembly

println("--- Consensus Polishing ---")

TODO: Implement consensus polishing

Align reads to assembly
Identify consensus variants
Apply corrections
Iterate polishing rounds

Structural Error Correction

Fix larger structural errors

println("--- Structural Correction ---")

TODO: Implement structural correction

Identify structural variants
Validate with long reads
Correct misassemblies
Handle complex rearrangements

Part 6: Handling Assembly Challenges

Address common assembly difficulties

println("\n=== Assembly Challenges ===")

Repetitive Sequences

Strategies for handling repeats

println("--- Repetitive Sequences ---")

TODO: Implement repeat handling

Identify repetitive regions
Use read-spanning strategy
Implement gap filling
Validate repeat resolutions

Heterozygosity

Handle diploid and polyploid genomes

println("--- Heterozygosity ---")

TODO: Implement heterozygosity handling

Haplotype-aware assembly
Bubble detection and resolution
Diploid assembly validation
Phasing strategies

Contamination

Detect and remove contaminating sequences

println("--- Contamination Detection ---")

TODO: Implement contamination detection

Taxonomic classification
Coverage-based detection
Compositional analysis
Filtering strategies

Part 7: Assembly Visualization and Exploration

Create plots and visualizations for assembly analysis

println("\n=== Assembly Visualization ===")

Contig Size Distribution

Visualize assembly contiguity

println("--- Contig Visualization ---")

TODO: Implement assembly visualization

Contig size histograms
Cumulative length plots
N50 plots
Coverage vs length plots

Dot Plots

Compare assemblies or validate against reference

println("--- Dot Plot Analysis ---")

TODO: Implement dot plot visualization

Self-alignment plots
Reference comparison plots
Synteny visualization
Structural variant detection

Part 8: Assembly Benchmarking

Compare different assembly approaches

println("\n=== Assembly Benchmarking ===")

Multi-Assembler Comparison

Compare multiple assembly tools

println("--- Multi-Assembler Comparison ---")

TODO: Implement multi-assembler benchmarking

Run multiple assemblers
Compare quality metrics
Identify best-performing approaches
Consensus assembly generation

Parameter Optimization

Optimize assembly parameters

println("--- Parameter Optimization ---")

TODO: Implement parameter optimization

Grid search over parameter space
Quality-based optimization
Cross-validation approaches
Automated parameter tuning

Part 9: Best Practices and Recommendations

Guidelines for successful genome assembly

println("\n=== Best Practices ===")

println("Data Requirements:")
println("- HiFi: 20-30x coverage minimum")
println("- Read N50 > 10 kb preferred")
println("- Low contamination levels")
println("- Balanced coverage distribution")
println()
println("Assembly Strategy:")
println("- Start with hifiasm for HiFi data")
println("- Use haplotype-aware mode for diploids")
println("- Validate with multiple quality metrics")
println("- Polish with original reads")
println()
println("Quality Control:")
println("- Check BUSCO completeness (>90% for eukaryotes)")
println("- Validate N50 vs genome size expectations")
println("- Examine contig count (fewer is better)")
println("- Compare with related genomes")

Summary

println("\n=== Assembly Summary ===")
println("✓ Understanding assembly algorithms and their applications")
println("✓ Short read assembly with MEGAHIT and metaSPAdes")
println("✓ Long read assembly with Flye, Canu, and hifiasm")
println("✓ Hybrid assembly approaches with Unicycler")
println("✓ Mycelia's probabilistic assembly with string graphs and Viterbi error correction")
println("✓ Comprehensive quality assessment techniques")
println("✓ Assembly polishing and error correction")
println("✓ Handling repetitive sequences and heterozygosity")
println("✓ Visualization and benchmarking approaches")

Cleanup

cleanup_files = [reference_file, short_reads_r1, short_reads_r2, long_reads_file, hifi_reads_file]
for file in cleanup_files
    if isfile(file)
        rm(file, force=true)
    end
end

# Note: No assembly output directory was created in this tutorial
assembly_output = "assembly_output"  # Define for consistency, but directory doesn't exist
if isdir(assembly_output)
    rm(assembly_output, recursive=true, force=true)
end

println("\nNext: Tutorial 5 - Assembly Validation")

nothing