Tutorial 4: Genome Assembly
This tutorial covers comprehensive genome assembly approaches, including short read, long read, and hybrid assembly methods, with emphasis on Mycelia's third-party assembler orchestration and benchmarking utilities.
Learning Objectives
By the end of this tutorial, you will understand:
- Different assembly algorithms and their applications
- Short read assembly with MEGAHIT and metaSPAdes
- Long read assembly with Flye, Canu, and hifiasm
- Hybrid assembly approaches combining multiple data types (Unicycler)
- Assembly quality metrics and their interpretation
- Error correction and polishing techniques
- Handling repetitive sequences and structural variants
- Assembly validation and benchmarking approaches
NOTE: Mycelia Rhizomorph assembly has been moved to tutorial 13 (tutorials/13rhizomorphassembly.jl) and is kept as a skeleton while Rhizomorph unit tests stabilize.
Setup
From the Mycelia base directory, convert this tutorial to a notebook:
julia --project=. -e 'import Literate; Literate.notebook("tutorials/04_genome_assembly.jl", "tutorials/notebooks", execute=false)'import Pkg
if isinteractive()
Pkg.activate("..")
end
import Test
import Mycelia
import FASTX
import Random
import Statistics
import BioSequences
import Kmers
Random.seed!(42)
threads = min(Mycelia.get_default_threads(), 4)
assembly_output = "assembly_output"
mkpath(assembly_output)Part 1: Assembly Algorithm Overview
Understanding different assembly approaches helps choose the right method for your data type and research goals.
println("=== Genome Assembly Tutorial ===")Assembly Paradigms
Four main approaches to genome assembly:
- de Bruijn Graph - for short reads (MEGAHIT, metaSPAdes)
- Overlap-Layout-Consensus (OLC) - for long reads (Canu)
- String Graph - for long accurate reads (hifiasm, Flye)
- Probabilistic Assembly - Mycelia Rhizomorph (see tutorial 13)
println("Assembly Algorithm Comparison:")
println("de Bruijn Graph:")
println(" - Best for: Short reads (Illumina)")
println(" - Tools: MEGAHIT, metaSPAdes, SPAdes")
println(" - Strengths: Efficient, handles high coverage")
println(" - Weaknesses: Struggles with repeats, requires error correction")
println()
println("OLC (Overlap-Layout-Consensus):")
println(" - Best for: Long reads (PacBio, Nanopore)")
println(" - Tools: Canu, Miniasm")
println(" - Strengths: Handles repeats, intuitive approach")
println(" - Weaknesses: Computationally expensive, error-sensitive")
println()
println("String Graph:")
println(" - Best for: Long accurate reads (HiFi)")
println(" - Tools: hifiasm, Flye")
println(" - Strengths: Efficient, haplotype-aware, handles complexity")
println(" - Weaknesses: Requires high-quality reads")
println()
println("Probabilistic Assembly (Mycelia Rhizomorph):")
println(" - Best for: Any read type with error correction")
println(" - Tools: Rhizomorph string graph + Viterbi")
println(" - Strengths: Handles errors probabilistically, adaptable")
println(" - Weaknesses: Computationally intensive for large genomes")
println(" - See tutorial 13 for Rhizomorph assembly details")Part 2: Data Preparation for Assembly
Proper data preparation is crucial for successful assembly
println("\n=== Data Preparation ===")Simulating Multi-Platform Data
Create synthetic data for comprehensive assembly testing
println("--- Generating Test Data ---")Create a synthetic genome with known structure
reference_size = 50_000 ## 50 kb for demonstration
reference_genome = Mycelia.random_fasta_record(moltype=:DNA, seed=1, L=reference_size)
println("Reference genome: $(reference_size) bp")Simulate different read types
short_read_params = Dict(
"coverage" => 30,
"read_length" => 150,
"error_rate" => 0.001,
"description" => "Illumina short reads"
)
long_read_params = Dict(
"coverage" => 20,
"read_length" => 10_000,
"error_rate" => 0.05,
"description" => "Nanopore long reads"
)
hifi_params = Dict(
"coverage" => 15,
"read_length" => 15_000,
"error_rate" => 0.001,
"description" => "PacBio HiFi reads"
)
println("Simulating read types:")
for (name, params) in [("Short reads", short_read_params),
("Long reads", long_read_params),
("HiFi reads", hifi_params)]
println(" $(name): $(params["coverage"])x coverage, $(params["read_length"]) bp, $(params["error_rate"] * 100)% error")
endWrite test data
reference_file = joinpath(assembly_output, "reference_genome.fasta")
Mycelia.write_fasta(outfile=reference_file, records=[reference_genome])Generate paired-end short reads (Illumina)
short_reads_base = joinpath(assembly_output, "short_reads")
short_sim = Mycelia.simulate_illumina_reads(
fasta=reference_file,
coverage=short_read_params["coverage"],
outbase=short_reads_base,
read_length=short_read_params["read_length"],
rndSeed=42,
quiet=true
)
short_reads_r1 = short_sim.forward_reads
short_reads_r2 = short_sim.reverse_readsGenerate long reads for Flye/Canu (Nanopore)
long_reads_gz = Mycelia.simulate_nanopore_reads(
fasta=reference_file,
quantity="$(long_read_params["coverage"])x",
outfile=joinpath(assembly_output, "long_reads.fastq.gz"),
quiet=true
)
long_reads_file = joinpath(assembly_output, "long_reads.fastq")
if !isfile(long_reads_file)
run(pipeline(`gzip -dc $(long_reads_gz)`, long_reads_file))
endGenerate HiFi reads for hifiasm (PacBio HiFi)
hifi_reads_gz = Mycelia.simulate_pacbio_reads(
fasta=reference_file,
quantity="$(hifi_params["coverage"])x",
outfile=joinpath(assembly_output, "hifi_reads.fastq.gz"),
quiet=true
)
hifi_reads_file = joinpath(assembly_output, "hifi_reads.fastq")
if !isfile(hifi_reads_file)
run(pipeline(`gzip -dc $(hifi_reads_gz)`, hifi_reads_file))
endGenerate error-prone reads for polishing demonstrations
error_reads_gz = Mycelia.simulate_very_bad_reads(
fasta=reference_file,
quantity="10x",
outfile=joinpath(assembly_output, "error_reads.fastq.gz"),
quiet=true
)
error_reads_file = joinpath(assembly_output, "error_reads.fastq")
if !isfile(error_reads_file)
run(pipeline(`gzip -dc $(error_reads_gz)`, error_reads_file))
endRead Statistics and Quality Assessment
Analyze read characteristics before assembly
println("--- Read Analysis ---")
short_r1_stats = Mycelia.summarize_fastq("Short reads R1", short_reads_r1, reference_size)
short_r2_stats = Mycelia.summarize_fastq("Short reads R2", short_reads_r2, reference_size)
long_read_stats = Mycelia.summarize_fastq("Long reads", long_reads_file, reference_size)
hifi_stats = Mycelia.summarize_fastq("HiFi reads", hifi_reads_file, reference_size)
error_read_stats = Mycelia.summarize_fastq("Error-prone reads", error_reads_file, reference_size)Part 3: Multi-Platform Assembly Approaches
Comprehensive coverage of short read, long read, and hybrid assembly
println("\n=== Multi-Platform Assembly Approaches ===")Short Read Assembly
MEGAHIT and metaSPAdes for short read data
println("--- Short Read Assembly ---")Example parameters for short read assembly
short_read_assembly_params = Dict(
"megahit_k_list" => "21,29,39,59,79,99,119,141",
"metaspades_k_list" => "21,33,55,77",
"min_contig_len" => 200,
"threads" => threads
)
println("Short read assembly parameters:")
for (param, value) in short_read_assembly_params
println(" $param: $value")
end
short_read_runs = []
megahit_result, megahit_runtime = Mycelia.run_assembler("MEGAHIT") do
Mycelia.run_megahit(
fastq1=short_reads_r1,
fastq2=short_reads_r2,
outdir=joinpath(assembly_output, "megahit_short"),
min_contig_len=short_read_assembly_params["min_contig_len"],
k_list=short_read_assembly_params["megahit_k_list"],
threads=threads
)
end
if megahit_result !== nothing
push!(short_read_runs, (name="MEGAHIT", assembly=megahit_result.contigs, outdir=megahit_result.outdir, runtime=megahit_runtime, reads=short_reads_r1))
end
metaspades_result, metaspades_runtime = Mycelia.run_assembler("metaSPAdes") do
Mycelia.run_metaspades(
fastq1=short_reads_r1,
fastq2=short_reads_r2,
outdir=joinpath(assembly_output, "metaspades_short"),
k_list=short_read_assembly_params["metaspades_k_list"],
threads=threads
)
end
if metaspades_result !== nothing
push!(short_read_runs, (name="metaSPAdes", assembly=metaspades_result.contigs, outdir=metaspades_result.outdir, runtime=metaspades_runtime, reads=short_reads_r1))
end
println("Short read assembly metrics:")
for run in short_read_runs
metrics = Mycelia.assembly_metrics(run.assembly)
if metrics === nothing
println(" $(run.name): assembly file missing")
continue
end
output_size = Mycelia.dir_size(run.outdir)
println(" $(run.name): contigs=$(metrics.n_contigs), N50=$(metrics.n50), L50=$(metrics.l50), total=$(metrics.total_length) bp, output_size=$(round(output_size / 1e6, digits=2)) MB")
endLong Read Assembly
Flye, Canu, and hifiasm for long read data
println("--- Long Read Assembly ---")Example parameters for long read assembly
long_read_assembly_params = Dict(
"genome_size" => "50k",
"flye_read_type" => "nano-hq",
"canu_read_type" => "nanopore",
"hifiasm_mode" => "primary",
"threads" => threads
)
println("Long read assembly parameters:")
for (param, value) in long_read_assembly_params
println(" $param: $value")
end
long_read_runs = []
flye_result, flye_runtime = Mycelia.run_assembler("Flye") do
Mycelia.run_flye(
fastq=long_reads_file,
outdir=joinpath(assembly_output, "flye_long"),
genome_size=long_read_assembly_params["genome_size"],
read_type=long_read_assembly_params["flye_read_type"],
threads=threads
)
end
if flye_result !== nothing
push!(long_read_runs, (name="Flye", assembly=flye_result.assembly, outdir=flye_result.outdir, runtime=flye_runtime, reads=long_reads_file))
end
canu_result, canu_runtime = Mycelia.run_assembler("Canu") do
Mycelia.run_canu(
fastq=long_reads_file,
outdir=joinpath(assembly_output, "canu_long"),
genome_size=long_read_assembly_params["genome_size"],
read_type=long_read_assembly_params["canu_read_type"],
stopOnLowCoverage=5,
threads=threads
)
end
if canu_result !== nothing
push!(long_read_runs, (name="Canu", assembly=canu_result.assembly, outdir=canu_result.outdir, runtime=canu_runtime, reads=long_reads_file))
end
hifiasm_result, hifiasm_runtime = Mycelia.run_assembler("hifiasm") do
Mycelia.run_hifiasm(
fastq=hifi_reads_file,
outdir=joinpath(assembly_output, "hifiasm_long"),
bloom_filter=0,
threads=threads
)
end
hifiasm_contigs = Mycelia.hifiasm_primary_contigs(hifiasm_result)
if hifiasm_contigs !== nothing
push!(long_read_runs, (name="hifiasm", assembly=hifiasm_contigs, outdir=hifiasm_result.outdir, runtime=hifiasm_runtime, reads=hifi_reads_file))
else
if hifiasm_result !== nothing
println(" hifiasm produced graph outputs; primary contigs not found at expected .p_ctg.fa path")
end
end
println("Long read assembly contiguity:")
for run in long_read_runs
metrics = Mycelia.assembly_metrics(run.assembly)
if metrics === nothing
println(" $(run.name): assembly file missing")
continue
end
qv = Mycelia.compute_merqury_qv(run.assembly, run.reads)
println(" $(run.name): contigs=$(metrics.n_contigs), N50=$(metrics.n50), largest=$(metrics.largest_contig), QV=$(qv)")
endHybrid Assembly
Unicycler combining short and long reads
println("--- Hybrid Assembly ---")Example parameters for hybrid assembly
hybrid_params = Dict(
"short_read_accuracy" => 0.99,
"long_read_accuracy" => 0.90,
"bridging_mode" => "conservative",
"threads" => threads
)
println("Hybrid assembly parameters:")
for (param, value) in hybrid_params
println(" $param: $value")
end
hybrid_runs = []
unicycler_result, unicycler_runtime = Mycelia.run_assembler("Unicycler") do
Mycelia.run_unicycler(
short_1=short_reads_r1,
short_2=short_reads_r2,
long_reads=long_reads_file,
outdir=joinpath(assembly_output, "unicycler_hybrid"),
threads=threads
)
end
if unicycler_result !== nothing
push!(hybrid_runs, (name="Unicycler", assembly=unicycler_result.assembly, outdir=unicycler_result.outdir, runtime=unicycler_runtime, reads=short_reads_r1))
end
println("Hybrid assembly comparison:")
for run in hybrid_runs
metrics = Mycelia.assembly_metrics(run.assembly)
if metrics === nothing
println(" $(run.name): assembly file missing")
continue
end
output_size = Mycelia.dir_size(run.outdir)
println(" $(run.name): contigs=$(metrics.n_contigs), N50=$(metrics.n50), largest=$(metrics.largest_contig), output_size=$(round(output_size / 1e6, digits=2)) MB")
end
if !isempty(short_read_runs) && !isempty(hybrid_runs)
short_metrics = Mycelia.assembly_metrics(short_read_runs[1].assembly)
hybrid_metrics = Mycelia.assembly_metrics(hybrid_runs[1].assembly)
if short_metrics !== nothing && hybrid_metrics !== nothing
println(" Scaffolding improvement (N50): $(short_metrics.n50) -> $(hybrid_metrics.n50)")
end
end
println("Assembly approaches comparison completed...")
println("--- Mycelia Rhizomorph Assembly ---")
println("Rhizomorph assembly workflows are covered in tutorial 13 (skeleton) and are intentionally omitted here.")Part 4: Assembly Quality Assessment
Comprehensive evaluation of assembly quality
println("\n=== Assembly Quality Assessment ===")Basic Assembly Statistics
Calculate fundamental assembly metrics
println("--- Basic Statistics ---")
primary_assembly = Mycelia.select_first_existing([
isempty(hybrid_runs) ? nothing : hybrid_runs[1].assembly,
isempty(long_read_runs) ? nothing : long_read_runs[1].assembly,
isempty(short_read_runs) ? nothing : short_read_runs[1].assembly
])
assembly_stats = Mycelia.assembly_metrics(primary_assembly)
if assembly_stats === nothing
println("No assembly file available for basic statistics.")
else
println("Assembly Statistics:")
println(" total_length: $(assembly_stats.total_length)")
println(" n_contigs: $(assembly_stats.n_contigs)")
println(" n50: $(assembly_stats.n50)")
println(" n90: $(assembly_stats.n90)")
println(" l50: $(assembly_stats.l50)")
println(" l90: $(assembly_stats.l90)")
println(" largest_contig: $(assembly_stats.largest_contig)")
println(" gc_content: $(round(assembly_stats.gc_content, digits=3))")
endAdvanced Quality Metrics
More sophisticated quality assessment
println("--- Advanced Quality Metrics ---")
busco_outdir = nothing
if primary_assembly !== nothing
try
busco_outdir = Mycelia.run_busco(primary_assembly; threads=threads, outdir=joinpath(assembly_output, "busco_results"))
println("BUSCO completed: $(busco_outdir)")
catch e
@warn "BUSCO failed" exception=e
end
end
merqury_qv = Mycelia.compute_merqury_qv(primary_assembly, short_reads_r1)
println("Merqury-style QV (k=21): $(merqury_qv)")
println("LAI (LTR Assembly Index): not computed in this tutorial (no built-in wrapper yet)")
if assembly_stats !== nothing && busco_outdir !== nothing
println("Contiguity vs completeness: compare N50 with BUSCO report in $(busco_outdir)")
endComparison with Reference
Validate assembly against known reference
println("--- Reference Comparison ---")
mummer_outdir = nothing
alignment_summary = nothing
if primary_assembly !== nothing
try
mummer_outdir = Mycelia.run_mummer(reference_file, primary_assembly; outdir=joinpath(assembly_output, "mummer_compare"), prefix="assembly")
coords_file = joinpath(mummer_outdir, "assembly.coords")
alignments = Mycelia.parse_mummer_coords(coords_file)
if !isempty(alignments)
total_aligned = sum(a.len_ref for a in alignments)
weighted_identity = sum(a.len_ref * a.identity for a in alignments) / total_aligned
coverage = total_aligned / reference_size * 100
inversions = count(a -> a.start_query > a.end_query, alignments)
alignment_summary = (;weighted_identity, coverage, inversions, alignment_blocks=length(alignments))
println("Reference alignment: identity=$(round(weighted_identity, digits=2))%, coverage=$(round(coverage, digits=2))%, inversions=$(inversions)")
end
catch e
@warn "MUMmer comparison failed" exception=e
end
endPart 5: Assembly Polishing and Error Correction
Improve assembly accuracy through polishing
println("\n=== Assembly Polishing ===")Consensus Polishing
Use original reads to polish assembly
println("--- Consensus Polishing ---")
polish_target = Mycelia.select_first_existing([
isempty(long_read_runs) ? nothing : long_read_runs[1].assembly,
primary_assembly
])
apollo_result = nothing
if polish_target !== nothing
try
apollo_result = Mycelia.run_apollo(polish_target, long_reads_file; outdir=joinpath(assembly_output, "apollo_polish"))
println("Apollo polishing output: $(apollo_result.polished_assembly)")
catch e
@warn "Apollo polishing failed" exception=e
end
end
homopolish_result = nothing
if apollo_result !== nothing
try
homopolish_result = Mycelia.run_homopolish(apollo_result.polished_assembly, long_reads_file; outdir=joinpath(assembly_output, "homopolish"), threads=threads)
println("Homopolish output: $(homopolish_result.polished_assembly)")
catch e
@warn "Homopolish failed" exception=e
end
endStructural Error Correction
Fix larger structural errors
println("--- Structural Correction ---")
if alignment_summary !== nothing
println("Structural signals: alignment_blocks=$(alignment_summary.alignment_blocks), inversions=$(alignment_summary.inversions)")
if alignment_summary.coverage < 98
println(" Potential large gaps or structural variants detected (coverage < 98%)")
end
else
println("Structural correction requires reference alignments (see MUMmer comparison).")
endPart 6: Handling Assembly Challenges
Address common assembly difficulties
println("\n=== Assembly Challenges ===")Repetitive Sequences
Strategies for handling repeats
println("--- Repetitive Sequences ---")
if primary_assembly !== nothing
repeat_k = 9
kmer_counts = Mycelia.count_canonical_kmers(Kmers.DNAKmer{repeat_k}, primary_assembly)
top_kmers = sort(collect(kmer_counts), by=x -> x[2], rev=true)
println("Top repeated k-mers (k=$(repeat_k)):")
for (kmer, count) in Iterators.take(top_kmers, min(5, length(top_kmers)))
println(" $(kmer): $(count)")
end
else
println("Repeat analysis skipped (no assembly available)")
endHeterozygosity
Handle diploid and polyploid genomes
println("--- Heterozygosity ---")
hap1_seq = FASTX.sequence(reference_genome)
mut_rng = Random.MersenneTwister(123)
hap2_seq = Mycelia.mutate_dna_substitution_fraction(hap1_seq; fraction=0.01, rng=mut_rng)
heterozygous_sites = count(pair -> pair[1] != pair[2], zip(hap1_seq, hap2_seq))
heterozygosity_rate = heterozygous_sites / length(hap1_seq)
println("Simulated heterozygosity rate: $(round(heterozygosity_rate * 100, digits=2))%")
haplotype_fasta = joinpath(assembly_output, "diploid_haplotypes.fasta")
Mycelia.write_fasta(
outfile=haplotype_fasta,
records=[
FASTX.FASTA.Record("haplotype_1", hap1_seq),
FASTX.FASTA.Record("haplotype_2", hap2_seq)
]
)
println("Haplotype-aware assembly recommendation: use hifiasm outputs for phased contigs where available.")Contamination
Detect and remove contaminating sequences
println("--- Contamination Detection ---")Contamination screening options to consider:
- Mycelia.runcheckm / Mycelia.runcheckm2 for bacterial MAGs
- Mycelia.run_checkv for viral assemblies
- NCBI FCS-GX (https://github.com/ncbi/fcs, bioconda: ncbi-fcs-gx)
- Read-based taxonomic screens (e.g. Kraken2/Bracken) when wrappers are available
contaminant_candidates = Mycelia.contig_gc_outliers(primary_assembly)
if isempty(contaminant_candidates)
println("No GC outlier contigs detected (using 2 SD threshold).")
else
println("Potential contaminant contigs (GC outliers):")
for candidate in contaminant_candidates
println(" $(candidate.id): length=$(candidate.length), GC=$(round(candidate.gc, digits=3))")
end
end
fcs_tax_id = tryparse(Int, get(ENV, "MYCELIA_FCS_TAX_ID", ""))
if primary_assembly !== nothing && fcs_tax_id !== nothing
fcs_db_dir = get(ENV, "MYCELIA_FCS_DB_DIR", joinpath(homedir(), "workspace", ".ncbi-fcs-gx", "gxdb"))
fcs_manifest_url = get(ENV, "MYCELIA_FCS_MANIFEST_URL", "")
download_fcs_db = !isempty(fcs_manifest_url)GX DB (~470 GB) download options:
- Use s5cmd for the official S3 bucket: s3://ncbi-fcs-gx/gxdb/latest/all.*
- Or provide MYCELIAFCSMANIFESTURL to syncfiles.py via Mycelia.runncbifcs
println("Running FCS-GX screening (this requires a large GX database)...")
try
fcs_result = Mycelia.run_ncbi_fcs(
primary_assembly;
tax_id=fcs_tax_id,
outdir=joinpath(assembly_output, "fcs_gx"),
gx_db=fcs_db_dir,
manifest_url=download_fcs_db ? fcs_manifest_url : nothing,
download_db=download_fcs_db
)
println("FCS-GX report: $(fcs_result.report)")
if fcs_result.report !== nothing
clean_result = Mycelia.run_ncbi_fcs_clean_genome(
primary_assembly;
action_report=fcs_result.report,
outdir=joinpath(assembly_output, "fcs_gx_clean")
)
println("FCS-GX cleaned assembly: $(clean_result.cleaned_fasta)")
println("FCS-GX contamination FASTA: $(clean_result.contamination_fasta)")
end
catch e
@warn "FCS-GX screening failed" exception=e
end
else
println("Set MYCELIA_FCS_TAX_ID to run FCS-GX (optionally MYCELIA_FCS_DB_DIR and MYCELIA_FCS_MANIFEST_URL).")
endPart 7: Assembly Visualization and Exploration
Create plots and visualizations for assembly analysis
println("\n=== Assembly Visualization ===")Contig Size Distribution
Visualize assembly contiguity
println("--- Contig Visualization ---")
if primary_assembly !== nothing
contig_lengths = open(FASTX.FASTA.Reader, primary_assembly) do reader
[length(FASTX.sequence(record)) for record in reader]
end
sorted_lengths = sort(contig_lengths, rev=true)
cumulative_lengths = cumsum(sorted_lengths)
println("Top contig lengths: $(sorted_lengths[1:min(5, length(sorted_lengths))])")
println("Cumulative lengths (first 5): $(cumulative_lengths[1:min(5, length(cumulative_lengths))])")
println("Use Plots.jl to visualize these distributions as histograms or cumulative curves.")
else
println("No assembly available for visualization.")
endDot Plots
Compare assemblies or validate against reference
println("--- Dot Plot Analysis ---")
if mummer_outdir !== nothing
delta_file = joinpath(mummer_outdir, "assembly.delta")
try
dotplot_path = Mycelia.run_mummer_plot(delta_file; outdir=mummer_outdir, prefix="assembly_plot")
println("Dot plot generated: $(dotplot_path)")
catch e
@warn "Dot plot generation failed" exception=e
end
else
println("Dot plot requires MUMmer alignment output.")
endPart 8: Assembly Benchmarking
Compare different assembly approaches
println("\n=== Assembly Benchmarking ===")Multi-Assembler Comparison
Compare multiple assembly tools
println("--- Multi-Assembler Comparison ---")
all_runs = vcat(short_read_runs, long_read_runs, hybrid_runs)
benchmark_results = []
for run in all_runs
metrics = Mycelia.assembly_metrics(run.assembly)
if metrics === nothing
continue
end
qv = Mycelia.compute_merqury_qv(run.assembly, run.reads)
push!(benchmark_results, (name=run.name, n50=metrics.n50, total_length=metrics.total_length, runtime=run.runtime, qv=qv))
end
if isempty(benchmark_results)
println("No benchmark results available.")
else
sorted_results = sort(benchmark_results, by=x -> x.n50, rev=true)
println("Assembler ranking by N50:")
for result in sorted_results
println(" $(result.name): N50=$(result.n50), total=$(result.total_length), runtime=$(result.runtime), QV=$(result.qv)")
end
endParameter Optimization
Optimize assembly parameters
println("--- Parameter Optimization ---")
megahit_k_options = ["21,33,55", "21,29,39,59"]
megahit_grid_results = []
for k_list in megahit_k_options
tag = replace(k_list, "," => "_")
outdir = joinpath(assembly_output, "megahit_k_$(tag)")
result, runtime = Mycelia.run_assembler("MEGAHIT k=$(k_list)") do
Mycelia.run_megahit(
fastq1=short_reads_r1,
fastq2=short_reads_r2,
outdir=outdir,
min_contig_len=short_read_assembly_params["min_contig_len"],
k_list=k_list,
threads=threads
)
end
if result !== nothing
metrics = Mycelia.assembly_metrics(result.contigs)
if metrics !== nothing
push!(megahit_grid_results, (k_list=k_list, n50=metrics.n50, runtime=runtime))
end
end
end
if isempty(megahit_grid_results)
println("Parameter optimization skipped (no MEGAHIT results).")
else
best = sort(megahit_grid_results, by=x -> x.n50, rev=true)[1]
println("Best MEGAHIT k-list by N50: $(best.k_list) (N50=$(best.n50))")
endPart 9: Best Practices and Recommendations
Guidelines for successful genome assembly
println("\n=== Best Practices ===")
println("Data Requirements:")
println("- HiFi: 20-30x coverage minimum")
println("- Read N50 > 10 kb preferred")
println("- Low contamination levels")
println("- Balanced coverage distribution")
println()
println("Assembly Strategy:")
println("- Start with hifiasm for HiFi data")
println("- Use haplotype-aware mode for diploids")
println("- Validate with multiple quality metrics")
println("- Polish with original reads")
println()
println("Quality Control:")
println("- Check BUSCO completeness (>90% for eukaryotes)")
println("- Validate N50 vs genome size expectations")
println("- Examine contig count (fewer is better)")
println("- Compare with related genomes")Summary
println("\n=== Assembly Summary ===")
println("✓ Understanding assembly algorithms and their applications")
println("✓ Short read assembly with MEGAHIT and metaSPAdes")
println("✓ Long read assembly with Flye, Canu, and hifiasm")
println("✓ Hybrid assembly approaches with Unicycler")
println("✓ Comprehensive quality assessment techniques")
println("✓ Assembly polishing and error correction")
println("✓ Handling repetitive sequences and heterozygosity")
println("✓ Visualization and benchmarking approaches")
println("✓ Rhizomorph assembly workflows moved to tutorial 13")Cleanup
if isdir(assembly_output)
rm(assembly_output, recursive=true, force=true)
end
println("\nNext: Tutorial 5 - Assembly Validation")
nothing