Tutorial 23: Variant Calling Workflow
This tutorial demonstrates a compact end-to-end small-variant workflow in Mycelia using:
minimap2for read alignmentGATK HaplotypeCallerfor short-read variant callingClair3for an optional deep-learning caller walkthroughRTG vcfevalfor optional truth-set benchmarking
The tutorial builds a tiny synthetic truth set so the control flow is easy to inspect and adapt for real projects.
Learning Objectives
By the end of this tutorial, you will understand how to:
- create a toy truth VCF from a reference sequence
- simulate a perturbed haplotype and align reads back to the original reference
- call variants with
Mycelia.run_gatk_haplotypecaller - normalize and benchmark a callset with
normalize_vcfandevaluate_variant_calling_accuracy - run the corresponding Clair3 workflow for Illumina-style data
- update a reference FASTA with a VCF consensus
Setup
From the Mycelia base directory, convert this tutorial to a notebook:
julia --project=. -e 'import Literate; Literate.notebook("tutorials/23_variant_calling_workflow.jl", "tutorials/notebooks", execute=false)'To execute the external-tool sections, opt in with:
MYCELIA_RUN_EXTERNAL=true julia --project=. tutorials/23_variant_calling_workflow.jlTo also run RTG vcfeval benchmarking:
MYCELIA_RUN_EXTERNAL=true MYCELIA_RUN_VCFEVAL_TUTORIAL=true julia --project=. tutorials/23_variant_calling_workflow.jlTo also run Clair3:
MYCELIA_RUN_EXTERNAL=true MYCELIA_RUN_CLAIR3_TUTORIAL=true julia --project=. tutorials/23_variant_calling_workflow.jlimport Pkg
if isinteractive()
Pkg.activate("..")
end
import DataFrames
import Mycelia
import StableRNGs
import Test
const RUN_EXTERNAL = lowercase(get(ENV, "MYCELIA_RUN_EXTERNAL", "false")) == "true"
const RUN_CLAIR3 = RUN_EXTERNAL &&
lowercase(get(ENV, "MYCELIA_RUN_CLAIR3_TUTORIAL", "false")) == "true"
const RUN_VCFEVAL = RUN_EXTERNAL &&
lowercase(get(ENV, "MYCELIA_RUN_VCFEVAL_TUTORIAL", "false")) == "true"
println("=== Variant Calling Workflow Tutorial ===")
println("External wrappers enabled: ", RUN_EXTERNAL)
println("Clair3 demo enabled: ", RUN_CLAIR3)
println("RTG vcfeval enabled: ", RUN_VCFEVAL)Part 1: Create a tiny synthetic truth set
We start with a random reference, simulate a handful of variants, and write them as a VCF. This part is pure Julia and can run without external tools.
rng = StableRNGs.StableRNG(42)
workdir = mktempdir()
println("Workspace: ", workdir)
reference_record = Mycelia.random_fasta_record(
moltype = :DNA,
seed = rand(rng, 0:typemax(Int)),
L = 20_000
)
reference_fasta = joinpath(workdir, "reference.fa")
Mycelia.write_fasta(outfile = reference_fasta, records = [reference_record])
truth_vcf = reference_fasta * ".vcf"
truth_table = Mycelia.simulate_variants(reference_record; n_variants = 12)
Mycelia.write_vcf_table(vcf_file = truth_vcf, vcf_table = truth_table, fasta_file = reference_fasta)
Test.@test isfile(reference_fasta)
Test.@test isfile(truth_vcf)
Test.@test DataFrames.nrow(truth_table) > 0
println("Reference FASTA: ", reference_fasta)
println("Truth VCF: ", truth_vcf)
println("Simulated truth variants: ", DataFrames.nrow(truth_table))Part 2: Materialize a mutated haplotype FASTA
update_fasta_with_vcf applies the truth VCF to the original reference so we can simulate reads from the altered haplotype and map them back to the unmodified reference.
mutant_fasta = joinpath(workdir, "truth_haplotype.fa")
if RUN_EXTERNAL
Mycelia.add_bioconda_env("samtools")
run(`$(Mycelia.CONDA_RUNNER) run --live-stream -n samtools samtools faidx $(reference_fasta)`)
truth_vcf_copy = joinpath(workdir, "truth_for_consensus.vcf")
cp(truth_vcf, truth_vcf_copy; force = true)
Mycelia.update_fasta_with_vcf(
in_fasta = reference_fasta,
vcf_file = truth_vcf_copy,
out_fasta = mutant_fasta
)
Test.@test isfile(mutant_fasta)
Test.@test isfile(truth_vcf)
println("Mutated haplotype FASTA: ", mutant_fasta)
else
println("Skipping consensus FASTA materialization.")
println("Set MYCELIA_RUN_EXTERNAL=true to run bcftools/minimap2/GATK steps.")
println("Add MYCELIA_RUN_VCFEVAL_TUTORIAL=true or MYCELIA_RUN_CLAIR3_TUTORIAL=true for optional sections.")
endPart 3: Simulate paired-end reads and align them with minimap2
The same BAM can feed multiple callers. This is usually the first point where real projects branch into short-read versus long-read pipelines.
bam_file = ""
if RUN_EXTERNAL
reads = Mycelia.simulate_illumina_reads(
fasta = mutant_fasta,
coverage = 40,
read_length = 150,
mflen = 350,
sdev = 25,
rndSeed = rand(rng, 0:typemax(Int)),
quiet = true
)
map_result = Mycelia.minimap_map_paired_end(
fasta = reference_fasta,
forward = reads.forward_reads,
reverse = reads.reverse_reads,
threads = 2,
outdir = joinpath(workdir, "alignment")
)
run(map_result.cmd)
bam_file = map_result.outfile
Test.@test isfile(bam_file)
println("Forward reads: ", reads.forward_reads)
println("Reverse reads: ", reads.reverse_reads)
println("Aligned BAM: ", bam_file)
else
println("Alignment section skipped because external execution is disabled.")
endPart 4: Call variants with GATK HaplotypeCaller
Mycelia handles environment setup, BAM indexing, and GATK sequence dictionary generation. We add a FASTA index up front because downstream tools such as bcftools norm also rely on it.
gatk_vcf = joinpath(workdir, "gatk", "variants.gatk.vcf")
normalized_gatk_vcf = replace(gatk_vcf, ".vcf" => ".sorted.normalized.vcf.gz")
gatk_consensus_fasta = joinpath(workdir, "gatk", "consensus.fa")
if RUN_EXTERNAL
Mycelia.run_gatk_haplotypecaller(
bam_file,
reference_fasta,
gatk_vcf;
ploidy = 1,
memory_gb = 4,
threads = 2
)
Test.@test isfile(gatk_vcf)
normalized_gatk_vcf = Mycelia.normalize_vcf(
reference_fasta = reference_fasta,
vcf_file = gatk_vcf
)
Test.@test isfile(normalized_gatk_vcf)
gatk_vcf_copy = joinpath(workdir, "gatk_for_consensus.vcf")
cp(gatk_vcf, gatk_vcf_copy; force = true)
Mycelia.update_fasta_with_vcf(
in_fasta = reference_fasta,
vcf_file = gatk_vcf_copy,
out_fasta = gatk_consensus_fasta
)
Test.@test isfile(gatk_consensus_fasta)
Test.@test isfile(gatk_vcf)
println("GATK VCF: ", gatk_vcf)
println("Normalized GATK VCF: ", normalized_gatk_vcf)
println("GATK consensus FASTA: ", gatk_consensus_fasta)
else
println("GATK execution skipped.")
endPart 5: Benchmark against the truth set with RTG vcfeval
evaluate_variant_calling_accuracy wraps the full vcfeval flow and returns parsed ROC tables plus summary metrics.
if RUN_VCFEVAL
gatk_eval = Mycelia.evaluate_variant_calling_accuracy(
truth_vcf,
normalized_gatk_vcf,
reference_fasta,
joinpath(workdir, "gatk", "vcfeval");
generate_plots = false,
threads = 2,
memory_gb = 4
)
println("GATK evaluation summary:")
println(gatk_eval.summary_stats)
else
println("vcfeval section skipped.")
endPart 6: Optional Clair3 workflow
Clair3 is often the first long-read-oriented or hybrid benchmarking target users want beside GATK. The wrapper below uses the Illumina model because this tutorial simulates paired-end short reads.
clair3_output_dir = joinpath(workdir, "clair3")
if RUN_CLAIR3
clair3_vcf = Mycelia.run_clair3(
bam_file,
reference_fasta,
clair3_output_dir;
platform = "ilmn",
threads = 2,
haploid_precise = true
)
Test.@test isfile(clair3_vcf)
normalized_clair3_vcf = Mycelia.normalize_vcf(
reference_fasta = reference_fasta,
vcf_file = clair3_vcf
)
Test.@test isfile(normalized_clair3_vcf)
println("Clair3 VCF: ", clair3_vcf)
println("Normalized Clair3 VCF: ", normalized_clair3_vcf)
if RUN_VCFEVAL
clair3_eval = Mycelia.evaluate_variant_calling_accuracy(
truth_vcf,
normalized_clair3_vcf,
reference_fasta,
joinpath(clair3_output_dir, "vcfeval");
generate_plots = false,
threads = 2,
memory_gb = 4
)
println("Clair3 evaluation summary:")
println(clair3_eval.summary_stats)
end
else
println("Clair3 execution skipped.")
println("Enable it with MYCELIA_RUN_CLAIR3_TUTORIAL=true after verifying the Clair3 model environment.")
endPart 7: Alternative callers
The same alignment can also feed the other wrappers in variant-analysis.jl.
freebayes_vcf = Mycelia.run_freebayes(
bam_file,
reference_fasta,
joinpath(workdir, "freebayes", "variants.freebayes.vcf");
ploidy = 1
)
bcftools_vcf = Mycelia.run_bcftools_call(
bam_file,
reference_fasta,
joinpath(workdir, "bcftools", "variants.bcftools.vcf");
threads = 2
)For side-by-side benchmarking, use:
comparison = Mycelia.run_variant_calling_comparison(
bam_file,
reference_fasta,
joinpath(workdir, "comparison");
callers = ["gatk", "clair3", "bcftools"],
platform = "ilmn",
baseline_vcf = truth_vcf
)Citation Guidance
If you publish results from this workflow, cite Mycelia and the wrapped tools you used, especially minimap2, GATK, Clair3, BCFtools/HTSlib, and RTG vcfeval. The project-wide citation checklist lives in docs/src/references.md.
println("Variant calling tutorial complete. Workspace: ", workdir)