Tutorial 4: Genome Assembly

This tutorial covers comprehensive genome assembly approaches, including short read, long read, and hybrid assembly methods, with emphasis on Mycelia's probabilistic assembly and benchmarking against state-of-the-art tools.

Learning Objectives

By the end of this tutorial, you will understand:

  • Different assembly algorithms and their applications
  • Short read assembly with MEGAHIT and metaSPAdes
  • Long read assembly with Flye, Canu, and hifiasm
  • Hybrid assembly approaches combining multiple data types
  • Mycelia's probabilistic assembly using string graphs and Viterbi error correction
  • Assembly quality metrics and their interpretation
  • Error correction and polishing techniques
  • Handling repetitive sequences and structural variants
  • Assembly validation and benchmarking approaches

Setup

import Pkg
if isinteractive()
    Pkg.activate("..")
end

import Test
import Mycelia
import FASTX
import Random
import Statistics

Random.seed!(42)

Part 1: Assembly Algorithm Overview

Understanding different assembly approaches helps choose the right method for your data type and research goals.

println("=== Genome Assembly Tutorial ===")

Assembly Paradigms

Four main approaches to genome assembly:

  1. de Bruijn Graph - for short reads (MEGAHIT, metaSPAdes)
  2. Overlap-Layout-Consensus (OLC) - for long reads (Canu)
  3. String Graph - for long accurate reads (hifiasm, Flye)
  4. Probabilistic Assembly - Mycelia's approach using string graphs with Viterbi error correction
println("Assembly Algorithm Comparison:")
println("de Bruijn Graph:")
println("  - Best for: Short reads (Illumina)")
println("  - Tools: MEGAHIT, metaSPAdes, SPAdes")
println("  - Strengths: Efficient, handles high coverage")
println("  - Weaknesses: Struggles with repeats, requires error correction")
println()
println("OLC (Overlap-Layout-Consensus):")
println("  - Best for: Long reads (PacBio, Nanopore)")
println("  - Tools: Canu, Miniasm")
println("  - Strengths: Handles repeats, intuitive approach")
println("  - Weaknesses: Computationally expensive, error-sensitive")
println()
println("String Graph:")
println("  - Best for: Long accurate reads (HiFi)")
println("  - Tools: hifiasm, Flye")
println("  - Strengths: Efficient, haplotype-aware, handles complexity")
println("  - Weaknesses: Requires high-quality reads")
println()
println("Probabilistic Assembly (Mycelia):")
println("  - Best for: Any read type with error correction")
println("  - Tools: Mycelia's string graph + Viterbi")
println("  - Strengths: Handles errors probabilistically, adaptable")
println("  - Weaknesses: Computationally intensive for large genomes")

Part 2: Data Preparation for Assembly

Proper data preparation is crucial for successful assembly

println("\n=== Data Preparation ===")

Simulating Multi-Platform Data

Create synthetic data for comprehensive assembly testing

println("--- Generating Test Data ---")

Create a synthetic genome with known structure

genome_size = 50000  ## 50 kb for demonstration
reference_genome = Mycelia.random_fasta_record(moltype=:DNA, seed=1, L=genome_size)

println("Reference genome: $(genome_size) bp")

Simulate different read types

short_read_params = Dict(
    "coverage" => 30,
    "read_length" => 150,
    "error_rate" => 0.001,
    "description" => "Illumina short reads"
)

long_read_params = Dict(
    "coverage" => 20,
    "read_length" => 10000,
    "error_rate" => 0.05,
    "description" => "Nanopore long reads"
)

hifi_params = Dict(
    "coverage" => 15,
    "read_length" => 15000,
    "error_rate" => 0.001,
    "description" => "PacBio HiFi reads"
)

TODO: Implement multi-platform read simulation

  • Generate short reads for MEGAHIT/metaSPAdes
  • Generate long reads for Flye/Canu
  • Generate HiFi reads for hifiasm
  • Create hybrid datasets for Unicycler
  • Generate error-prone reads for Mycelia polishing
println("Simulating read types:")
for (name, params) in [("Short reads", short_read_params),
                      ("Long reads", long_read_params),
                      ("HiFi reads", hifi_params)]
    println("  $(name): $(params["coverage"])x coverage, $(params["read_length"]) bp, $(params["error_rate"]*100)% error")
end

Write test data

reference_file = "reference_genome.fasta"
short_reads_r1 = "short_reads_R1.fastq"
short_reads_r2 = "short_reads_R2.fastq"
long_reads_file = "long_reads.fastq"
hifi_reads_file = "hifi_reads.fastq"

Mycelia.write_fasta(outfile=reference_file, records=[reference_genome])

TODO: Write simulated reads to FASTQ files

  • Generate paired-end short reads
  • Generate single-end long reads
  • Generate single-end HiFi reads

Read Statistics and Quality Assessment

Analyze read characteristics before assembly

println("--- Read Analysis ---")

TODO: Implement read analysis

  • Read length distribution
  • Quality score distribution
  • Coverage estimation
  • Error rate assessment

Part 3: Multi-Platform Assembly Approaches

Comprehensive coverage of short read, long read, and hybrid assembly

println("\n=== Multi-Platform Assembly Approaches ===")

Short Read Assembly

MEGAHIT and metaSPAdes for short read data

println("--- Short Read Assembly ---")

Example parameters for short read assembly

short_read_params = Dict(
    "megahit_k_list" => "21,29,39,59,79,99,119,141",
    "metaspades_k_list" => "21,33,55,77",
    "min_contig_len" => 200,
    "threads" => 4
)

println("Short read assembly parameters:")
for (param, value) in short_read_params
    println("  $param: $value")
end

TODO: Implement short read assembly examples

  • Run MEGAHIT for metagenomic data
  • Run metaSPAdes for complex datasets
  • Compare assembly quality metrics
  • Evaluate computational requirements

Long Read Assembly

Flye, Canu, and hifiasm for long read data

println("--- Long Read Assembly ---")

Example parameters for long read assembly

long_read_params = Dict(
    "genome_size" => "5m",
    "flye_read_type" => "pacbio-hifi",
    "canu_read_type" => "pacbio",
    "hifiasm_mode" => "primary",
    "threads" => 4
)

println("Long read assembly parameters:")
for (param, value) in long_read_params
    println("  $param: $value")
end

TODO: Implement long read assembly examples

  • Run Flye for various read types
  • Run Canu with error correction
  • Run hifiasm for HiFi data
  • Compare assembly contiguity
  • Evaluate error rates

Hybrid Assembly

Unicycler combining short and long reads

println("--- Hybrid Assembly ---")

Example parameters for hybrid assembly

hybrid_params = Dict(
    "short_read_accuracy" => 0.99,
    "long_read_accuracy" => 0.90,
    "bridging_mode" => "conservative",
    "threads" => 4
)

println("Hybrid assembly parameters:")
for (param, value) in hybrid_params
    println("  $param: $value")
end

TODO: Implement hybrid assembly examples

  • Run Unicycler with paired data
  • Compare to short-read-only assemblies
  • Evaluate scaffolding improvements
  • Assess computational trade-offs

Mycelia's Probabilistic Assembly

String graph approach with Viterbi error correction

println("--- Mycelia's Probabilistic Assembly ---")

Example parameters for Mycelia assembly

mycelia_params = Dict(
    "k_range" => "21,31,41,51,61,71,81,91",
    "error_rate" => 0.01,
    "min_coverage" => 3,
    "iterative_polishing" => true,
    "verbosity" => "reads"
)

println("Mycelia assembly parameters:")
for (param, value) in mycelia_params
    println("  $param: $value")
end

TODO: Implement Mycelia assembly examples

  • Build string graph from reads
  • Apply Viterbi error correction
  • Perform iterative polishing
  • Compare to external assemblers
println("Assembly approaches comparison completed...")

Part 4: Assembly Quality Assessment

Comprehensive evaluation of assembly quality

println("\n=== Assembly Quality Assessment ===")

Basic Assembly Statistics

Calculate fundamental assembly metrics

println("--- Basic Statistics ---")

TODO: Implement assembly statistics

  • Contig count and sizes
  • N50, N90, L50, L90
  • Total assembly size
  • Largest contig size
  • GC content

Example with placeholder data

assembly_stats = Dict(
    "total_length" => 49500,
    "n_contigs" => 3,
    "n50" => 25000,
    "l50" => 1,
    "largest_contig" => 30000,
    "gc_content" => 0.45
)

println("Assembly Statistics:")
for (metric, value) in assembly_stats
    println("  $metric: $value")
end

Advanced Quality Metrics

More sophisticated quality assessment

println("--- Advanced Quality Metrics ---")

TODO: Implement advanced quality assessment

  • BUSCO completeness scores
  • Merqury QV scores
  • LAI (LTR Assembly Index)
  • Contiguity vs completeness trade-offs

Comparison with Reference

Validate assembly against known reference

println("--- Reference Comparison ---")

TODO: Implement reference comparison

  • Alignment-based comparison
  • Structural variation detection
  • Misassembly identification
  • Coverage uniformity assessment

Part 5: Assembly Polishing and Error Correction

Improve assembly accuracy through polishing

println("\n=== Assembly Polishing ===")

Consensus Polishing

Use original reads to polish assembly

println("--- Consensus Polishing ---")

TODO: Implement consensus polishing

  • Align reads to assembly
  • Identify consensus variants
  • Apply corrections
  • Iterate polishing rounds

Structural Error Correction

Fix larger structural errors

println("--- Structural Correction ---")

TODO: Implement structural correction

  • Identify structural variants
  • Validate with long reads
  • Correct misassemblies
  • Handle complex rearrangements

Part 6: Handling Assembly Challenges

Address common assembly difficulties

println("\n=== Assembly Challenges ===")

Repetitive Sequences

Strategies for handling repeats

println("--- Repetitive Sequences ---")

TODO: Implement repeat handling

  • Identify repetitive regions
  • Use read-spanning strategy
  • Implement gap filling
  • Validate repeat resolutions

Heterozygosity

Handle diploid and polyploid genomes

println("--- Heterozygosity ---")

TODO: Implement heterozygosity handling

  • Haplotype-aware assembly
  • Bubble detection and resolution
  • Diploid assembly validation
  • Phasing strategies

Contamination

Detect and remove contaminating sequences

println("--- Contamination Detection ---")

TODO: Implement contamination detection

  • Taxonomic classification
  • Coverage-based detection
  • Compositional analysis
  • Filtering strategies

Part 7: Assembly Visualization and Exploration

Create plots and visualizations for assembly analysis

println("\n=== Assembly Visualization ===")

Contig Size Distribution

Visualize assembly contiguity

println("--- Contig Visualization ---")

TODO: Implement assembly visualization

  • Contig size histograms
  • Cumulative length plots
  • N50 plots
  • Coverage vs length plots

Dot Plots

Compare assemblies or validate against reference

println("--- Dot Plot Analysis ---")

TODO: Implement dot plot visualization

  • Self-alignment plots
  • Reference comparison plots
  • Synteny visualization
  • Structural variant detection

Part 8: Assembly Benchmarking

Compare different assembly approaches

println("\n=== Assembly Benchmarking ===")

Multi-Assembler Comparison

Compare multiple assembly tools

println("--- Multi-Assembler Comparison ---")

TODO: Implement multi-assembler benchmarking

  • Run multiple assemblers
  • Compare quality metrics
  • Identify best-performing approaches
  • Consensus assembly generation

Parameter Optimization

Optimize assembly parameters

println("--- Parameter Optimization ---")

TODO: Implement parameter optimization

  • Grid search over parameter space
  • Quality-based optimization
  • Cross-validation approaches
  • Automated parameter tuning

Part 9: Best Practices and Recommendations

Guidelines for successful genome assembly

println("\n=== Best Practices ===")

println("Data Requirements:")
println("- HiFi: 20-30x coverage minimum")
println("- Read N50 > 10 kb preferred")
println("- Low contamination levels")
println("- Balanced coverage distribution")
println()
println("Assembly Strategy:")
println("- Start with hifiasm for HiFi data")
println("- Use haplotype-aware mode for diploids")
println("- Validate with multiple quality metrics")
println("- Polish with original reads")
println()
println("Quality Control:")
println("- Check BUSCO completeness (>90% for eukaryotes)")
println("- Validate N50 vs genome size expectations")
println("- Examine contig count (fewer is better)")
println("- Compare with related genomes")

Summary

println("\n=== Assembly Summary ===")
println("✓ Understanding assembly algorithms and their applications")
println("✓ Short read assembly with MEGAHIT and metaSPAdes")
println("✓ Long read assembly with Flye, Canu, and hifiasm")
println("✓ Hybrid assembly approaches with Unicycler")
println("✓ Mycelia's probabilistic assembly with string graphs and Viterbi error correction")
println("✓ Comprehensive quality assessment techniques")
println("✓ Assembly polishing and error correction")
println("✓ Handling repetitive sequences and heterozygosity")
println("✓ Visualization and benchmarking approaches")

Cleanup

cleanup_files = [reference_file, short_reads_r1, short_reads_r2, long_reads_file, hifi_reads_file]
for file in cleanup_files
    if isfile(file)
        rm(file, force=true)
    end
end

# Note: No assembly output directory was created in this tutorial
assembly_output = "assembly_output"  # Define for consistency, but directory doesn't exist
if isdir(assembly_output)
    rm(assembly_output, recursive=true, force=true)
end

println("\nNext: Tutorial 5 - Assembly Validation")

nothing