Tutorial 2: Quality Control and Preprocessing
This tutorial demonstrates how to assess and improve the quality of genomic data before analysis. Quality control is essential for ensuring reliable downstream results.
Learning Objectives
By the end of this tutorial, you will understand:
- How to assess sequencing data quality using multiple read-level metrics
- Common quality issues and their biological implications
- Preprocessing techniques for improving read quality
- How to use Mycelia's read-level QC tooling
- Best practices for quality control in different sequencing data types
Setup
From the Mycelia base directory, convert this tutorial to a notebook:
julia --project=. -e 'import Literate; Literate.notebook("tutorials/02_quality_control.jl", "tutorials/notebooks", execute=false)'import Pkg
if isinteractive()
Pkg.activate("..")
end
import Mycelia
import FASTX
import Statistics
import Random
import Plots
Random.seed!(42)Part 1: Understanding Quality Metrics
Quality assessment involves multiple metrics that capture different aspects of read quality. Understanding these metrics helps identify problems and guide preprocessing decisions.
println("=== Quality Control Tutorial ===")Phred Quality Scores
Phred scores represent the probability of base-calling errors:
- Q10 = 10% error rate (1 in 10 bases wrong)
- Q20 = 1% error rate (1 in 100 bases wrong)
- Q30 = 0.1% error rate (1 in 1000 bases wrong)
- Q40 = 0.01% error rate (1 in 10,000 bases wrong)
println("Quality Score Interpretation:")
println("Q10: 10% error rate (poor quality)")
println("Q20: 1% error rate (acceptable)")
println("Q30: 0.1% error rate (good quality)")
println("Q40: 0.01% error rate (excellent quality)")Part 2: Example Read Data
Generate small, self-contained FASTQ datasets using Mycelia utilities.
println("\n=== Example Read Data ===")
output_dir = mkpath(joinpath(@__DIR__, "..", "results", "qc_tutorial"))
short_reference = Mycelia.random_fasta_record(moltype=:DNA, seed=1, L=150)
short_reads = Mycelia.create_test_reads(FASTX.sequence(short_reference), 200, 0.01)
short_fastq = joinpath(output_dir, "short_reads.fastq")
Mycelia.write_fastq(records=short_reads, filename=short_fastq)
forward_reads = Mycelia.create_test_reads(FASTX.sequence(short_reference), 200, 0.01)
reverse_reads = Mycelia.create_test_reads(FASTX.sequence(short_reference), 200, 0.01)
forward_fastq = joinpath(output_dir, "short_reads_R1.fastq")
reverse_fastq = joinpath(output_dir, "short_reads_R2.fastq")
Mycelia.write_fastq(records=forward_reads, filename=forward_fastq)
Mycelia.write_fastq(records=reverse_reads, filename=reverse_fastq)
long_reference = Mycelia.random_fasta_record(moltype=:DNA, seed=2, L=4000)
long_reads = Mycelia.create_test_reads(FASTX.sequence(long_reference), 30, 0.08)
long_fastq = joinpath(output_dir, "long_reads.fastq")
Mycelia.write_fastq(records=long_reads, filename=long_fastq)
println("Wrote example reads to $(output_dir)")
println(" Short reads: $(short_fastq)")
println(" Paired-end reads: $(forward_fastq), $(reverse_fastq)")
println(" Long reads: $(long_fastq)")Part 3: Read-Level Quality Assessment
Use Mycelia's QC utilities to summarize read quality.
println("\n=== Read-Level Quality Assessment ===")
short_stats = Mycelia.analyze_fastq_quality(short_fastq)
println("Short-read summary:")
println(" Reads: $(short_stats.n_reads)")
println(" Mean quality: $(round(short_stats.mean_quality, digits=2))")
println(" Mean length: $(round(short_stats.mean_length, digits=1))")
println(" GC content: $(round(short_stats.gc_content, digits=1))%")
println(" Q30+ reads: $(round(short_stats.quality_distribution.q30_percent, digits=1))%")
per_read_quals = Mycelia.per_read_quality_scores(short_reads)
gc_contents = Mycelia.gc_content_per_read(short_reads)
read_lengths = Mycelia.read_length_distribution(short_reads)
dup_stats = Mycelia.duplication_stats(short_reads; min_fraction=0.05)
println("Read-level distribution metrics:")
println(" Mean per-read quality: $(round(Statistics.mean(per_read_quals), digits=2))")
println(" Mean GC: $(round(Statistics.mean(gc_contents) * 100, digits=1))%")
println(" Mean read length: $(round(Statistics.mean(read_lengths), digits=1)) bp")
println(" Duplicate fraction: $(round(dup_stats.duplicate_fraction * 100, digits=1))%")
println(" Overrepresented sequences: $(length(dup_stats.overrepresented))")Part 4: Automated Read-Level QC and Filtering
Mycelia provides wrappers around standard read QC tools.
println("\n=== Automated Read-Level QC ===")
println("Mycelia will auto-install missing tools via conda when you run these wrappers.")
fastqc_dir = Mycelia.run_fastqc(fastq=short_fastq)
println("FastQC output: $(fastqc_dir)")
fastp_outputs = Mycelia.qc_filter_short_reads_fastp(
forward_reads=forward_fastq,
reverse_reads=reverse_fastq
)
println("fastp outputs:")
println(" Filtered R1: $(fastp_outputs.out_forward)")
println(" Filtered R2: $(fastp_outputs.out_reverse)")
println(" JSON report: $(fastp_outputs.json)")
println(" HTML report: $(fastp_outputs.html)")
trim_outputs = Mycelia.trim_galore_paired(
forward_reads=forward_fastq,
reverse_reads=reverse_fastq,
outdir=output_dir
)
println("trim_galore outputs:")
println(" Trimmed R1: $(trim_outputs.trimmed_forward)")
println(" Trimmed R2: $(trim_outputs.trimmed_reverse)")
fastplong_outputs = Mycelia.qc_filter_long_reads_fastplong(
in_fastq=long_fastq,
min_length=1000
)
println("fastplong outputs:")
println(" Filtered reads: $(fastplong_outputs.out_fastq)")
println(" JSON report: $(fastplong_outputs.json_report)")
println(" HTML report: $(fastplong_outputs.html_report)")Part 5: Quality Visualization
Create plots to visualize read-level metrics.
println("\n=== Quality Visualization ===")
quality_dist_plot = Plots.histogram(
per_read_quals;
bins=0:2:50,
title="Per-read Quality Distribution",
xlabel="Mean Phred Score",
ylabel="Count",
legend=false
)
gc_plot = Plots.histogram(
gc_contents .* 100;
bins=0:5:100,
title="GC Content Distribution",
xlabel="GC %",
ylabel="Count",
legend=false
)
length_plot = Plots.histogram(
read_lengths;
bins=10,
title="Read Length Distribution",
xlabel="Length (bp)",
ylabel="Count",
legend=false
)
per_base_plot = Mycelia.plot_per_base_quality(short_fastq)
if isinteractive()
Plots.display(quality_dist_plot)
Plots.display(gc_plot)
Plots.display(length_plot)
Plots.display(per_base_plot)
else
Plots.savefig(quality_dist_plot, joinpath(output_dir, "quality_distribution.png"))
Plots.savefig(gc_plot, joinpath(output_dir, "gc_distribution.png"))
Plots.savefig(length_plot, joinpath(output_dir, "length_distribution.png"))
Plots.savefig(per_base_plot, joinpath(output_dir, "per_base_quality.png"))
println("Saved QC plots to $(output_dir)")
endSummary
println("\n=== Quality Control Summary ===")
println("✓ Interpreted Phred quality scores")
println("✓ Assessed read-level quality metrics")
println("✓ Applied automated read QC with fastp/fastplong/trim_galore")
println("✓ Visualized quality distributions")
println("\nNext: Tutorial 3 - K-mer Analysis and Feature Extraction")
nothing