Tutorial 6: FASTQ Sequences to Qualmer Graphs and Back
This tutorial demonstrates the round-trip workflow from FASTQ sequences through quality-aware k-mer (qualmer) graphs and back to reconstructed sequences. Qualmers incorporate both sequence information and quality scores for more accurate assembly decisions.
Learning Objectives
By the end of this tutorial, you will understand:
- How to create qualmer graphs from FASTQ data
- How quality scores influence k-mer confidence
- How to perform quality-aware assembly using package functions
- How to reconstruct sequences while preserving quality information
- The advantages of quality-aware vs coverage-only assembly
Setup and Imports
Following CLAUDE.md standards: only import top-level packages, use full namespacing
import Mycelia
import FASTX
import Graphs
import Test
import StatisticsPart 1: Understanding Qualmers
Qualmers are k-mers with associated quality information. When the same k-mer is observed multiple times with different quality scores, we calculate a joint probability that represents our confidence in that k-mer's correctness.
Creating Sample FASTQ Data with Varying Quality
function create_sample_fastq_data()
# High-quality read
hq_seq = "ATCGATCGATCGATCGATCG"
hq_qual = "IIIIIIIIIIIIIIIIIIII" ## Phred 40 (99.99% accuracy)
# Medium-quality read with overlap
mq_seq = "GATCGATCGATCGATCGTAG"
mq_qual = "FFFFFFFFFFFFFFFFFF@@" ## Phred 37 (99.98%) with lower end
# Low-quality read with errors
lq_seq = "GATCGATCGATCGATCGTAC" ## Error at end (C instead of G)
lq_qual = "555555555555555555##" ## Phred 20 (99%) with very low end
# Create FASTQ records
records = [
FASTX.FASTQ.Record("read1", hq_seq, hq_qual),
FASTX.FASTQ.Record("read2", mq_seq, mq_qual),
FASTX.FASTQ.Record("read3", lq_seq, lq_qual)
]
return records
end
println("Creating sample FASTQ data with varying quality scores...")
fastq_records = create_sample_fastq_data()Display the records
for (i, record) in enumerate(fastq_records)
println("\nRead $i:")
println(" Sequence: ", String(FASTX.sequence(record)))
println(" Quality: ", String(FASTX.quality(record)))
println(" Phred scores: ", FASTX.quality_scores(record))
endPart 2: Building Qualmer Graphs
Qualmer graphs combine k-mer information with quality scores to make more informed assembly decisions.
Build a qualmer graph with k=7
k = 7
println("\n" * "="^60)
println("Building qualmer graph with k=$k...")
qualmer_graph = Mycelia.build_qualmer_graph(fastq_records; k=k, graph_mode=Mycelia.DoubleStrand)Examine graph properties
println("\nQualmer graph statistics:")
println(" Number of vertices (unique k-mers): ", Graphs.nv(qualmer_graph))
println(" Number of edges: ", Graphs.ne(qualmer_graph))Inspect qualmer properties
println("\n" * "="^60)
println("Examining qualmer vertices and their properties:")Get first few vertices to examine
for v in Iterators.take(Graphs.vertices(qualmer_graph), 5)
vertex_data = qualmer_graph[v]
println("\nVertex $v:")
println(" K-mer: ", vertex_data.kmer)
println(" Coverage: ", vertex_data.coverage)
println(" Mean quality: ", round(vertex_data.mean_quality, digits=2))
println(" Joint probability: ", round(vertex_data.joint_probability, digits=4))
endPart 3: Quality-Aware vs Coverage-Only Assembly
Let's compare how quality information affects assembly decisions.
Find high-confidence paths using quality information
println("\n" * "="^60)
println("Finding high-confidence paths through the qualmer graph...")Get all vertices sorted by joint probability (confidence)
vertices_by_confidence = sort(collect(Graphs.vertices(qualmer_graph)),
by=v -> qualmer_graph[v].joint_probability, rev=true)
println("\nTop 5 most confident k-mers:")
for v in vertices_by_confidence[1:min(5, length(vertices_by_confidence))]
vdata = qualmer_graph[v]
println(" ", vdata.kmer,
" - Coverage: ", vdata.coverage,
", Joint prob: ", round(vdata.joint_probability, digits=4),
", Mean quality: ", round(vdata.mean_quality, digits=1))
endCompare with coverage-only approach
vertices_by_coverage = sort(collect(Graphs.vertices(qualmer_graph)),
by=v -> qualmer_graph[v].coverage, rev=true)
println("\nTop 5 highest coverage k-mers:")
for v in vertices_by_coverage[1:min(5, length(vertices_by_coverage))]
vdata = qualmer_graph[v]
println(" ", vdata.kmer,
" - Coverage: ", vdata.coverage,
", Joint prob: ", round(vdata.joint_probability, digits=4),
", Mean quality: ", round(vdata.mean_quality, digits=1))
endPart 4: Quality-Aware Path Finding
Use the package function to find the most likely path through the graph.
Find best starting vertex (highest confidence)
start_vertex = vertices_by_confidence[1]
quality_path = Mycelia.find_quality_weighted_path(qualmer_graph, start_vertex)
println("\n" * "="^60)
println("Quality-weighted path through graph:")
println("Path length: ", length(quality_path), " vertices")Show path k-mers and qualities
println("\nPath details:")
for (i, v) in enumerate(quality_path[1:min(10, length(quality_path))])
vdata = qualmer_graph[v]
println(" Step $i: ", vdata.kmer,
" (prob: ", round(vdata.joint_probability, digits=3), ")")
endPart 5: Converting to Quality-Aware BioSequence Graph
Convert the qualmer graph to a variable-length quality-aware sequence graph.
println("\n" * "="^60)
println("Converting qualmer graph to quality-aware BioSequence graph...")Convert to FASTQ graph (variable-length with quality)
fastq_graph = Mycelia.qualmer_graph_to_quality_biosequence_graph(qualmer_graph, k)
println("\nFASTQ graph statistics:")
println(" Number of vertices: ", Graphs.nv(fastq_graph))
println(" Number of edges: ", Graphs.ne(fastq_graph))Examine simplified vertices
println("\nExamining quality-aware sequence vertices:")
for v in Iterators.take(Graphs.vertices(fastq_graph), 3)
vertex_data = fastq_graph[v]
println("\nVertex $v:")
println(" Sequence: ", vertex_data.sequence)
println(" Length: ", length(vertex_data.sequence))
println(" Quality scores: ", vertex_data.quality_scores[1:min(20, length(vertex_data.quality_scores))], "...")
println(" Mean quality: ", round(Statistics.mean(vertex_data.quality_scores), digits=1))
endPart 6: Round-Trip Reconstruction
Reconstruct FASTQ records from the quality-aware graph.
println("\n" * "="^60)
println("Reconstructing FASTQ records from the graph...")Extract paths and convert to FASTQ records
reconstructed_records = Mycelia.quality_biosequence_graph_to_fastq(fastq_graph, "reconstructed")
println("\nReconstructed ", length(reconstructed_records), " FASTQ records")Compare with original
println("\nComparison with original reads:")
for (i, (orig, recon)) in enumerate(zip(fastq_records[1:min(3, length(reconstructed_records))],
reconstructed_records[1:min(3, length(reconstructed_records))]))
orig_seq = String(FASTX.sequence(orig))
recon_seq = String(FASTX.sequence(recon))
println("\nRead $i:")
println(" Original: ", orig_seq)
println(" Reconstructed: ", recon_seq)
println(" Match: ", orig_seq == recon_seq ? "✓" : "✗")
# Compare quality scores
orig_qual = FASTX.quality_scores(orig)
recon_qual = FASTX.quality_scores(recon)
println(" Original quality range: ", minimum(orig_qual), "-", maximum(orig_qual))
println(" Reconstructed quality range: ", minimum(recon_qual), "-", maximum(recon_qual))
endPart 7: Error Correction Using Quality Information
Demonstrate how quality scores help identify and correct errors using package functions.
println("\n" * "="^60)
println("Demonstrating quality-aware error correction...")Create reads with a known error
error_reads = [
FASTX.FASTQ.Record("good1", "ATCGATCGATCG", "IIIIIIIIIIII"), ## High quality
FASTX.FASTQ.Record("good2", "TCGATCGATCGA", "IIIIIIIIIIII"), ## High quality
FASTX.FASTQ.Record("error", "TCGATCTATCGA", "IIIIII##IIII"), ## Error at low quality position
]Build qualmer graph
error_graph = Mycelia.build_qualmer_graph(error_reads; k=5, graph_mode=Mycelia.SingleStrand)
println("\nAnalyzing k-mers around error position:")The error creates k-mers: GATCT (wrong) vs GATCG (correct)
for v in Graphs.vertices(error_graph)
vdata = error_graph[v]
kmer_str = String(vdata.kmer)
if occursin("GATC", kmer_str)
println(" K-mer: ", kmer_str,
" - Coverage: ", vdata.coverage,
", Joint prob: ", round(vdata.joint_probability, digits=4))
end
endUse package function to identify potential errors
potential_errors = Mycelia.identify_potential_errors(error_graph)
println("\nPotential error k-mers identified: ", length(potential_errors))
for error_v in potential_errors
vdata = error_graph[error_v]
println(" Error k-mer: ", vdata.kmer,
" - Coverage: ", vdata.coverage,
", Quality: ", round(vdata.mean_quality, digits=1),
", Confidence: ", round(vdata.joint_probability, digits=4))
endPart 8: Advanced Quality Metrics
Calculate assembly quality metrics using the package function.
metrics = Mycelia.calculate_assembly_quality_metrics(qualmer_graph)
println("\n" * "="^60)
println("Assembly quality metrics:")
println(" Mean k-mer coverage: ", round(metrics.mean_coverage, digits=2))
println(" Mean quality score: ", round(metrics.mean_quality, digits=1))
println(" Mean k-mer confidence: ", round(metrics.mean_confidence, digits=4))
println(" Fraction of low-confidence k-mers: ", round(metrics.low_confidence_fraction, digits=3))
println(" Total unique k-mers: ", metrics.total_kmers)Part 9: Practical Example - Assembling a Small Genome Region
Let's create a more realistic example with overlapping reads from a genome region.
function create_genome_region_reads()
# Simulate a 50bp genome region
true_sequence = "ATCGATCGATCGTAGCTAGCTAGCTTGCATGCATGCATGCATGCATGCAT"
# Generate overlapping reads with varying quality
reads = []
# High quality reads
push!(reads, FASTX.FASTQ.Record("hq1", true_sequence[1:25], "I"^25))
push!(reads, FASTX.FASTQ.Record("hq2", true_sequence[15:40], "I"^26))
push!(reads, FASTX.FASTQ.Record("hq3", true_sequence[30:50], "I"^21))
# Medium quality reads with some errors
read_mq1 = true_sequence[5:30]
qual_mq1 = "FFFFFFFFFFFFFFFFFFFFFFFFFF"
push!(reads, FASTX.FASTQ.Record("mq1", read_mq1, qual_mq1))
# Low quality read with error
read_lq1 = true_sequence[20:45]
read_lq1 = read_lq1[1:10] * "T" * read_lq1[12:end] ## Error at position 11
qual_lq1 = "AAAAAAAAAA#AAAAAAAAAAAAAA" ## Low quality at error
push!(reads, FASTX.FASTQ.Record("lq1", read_lq1, qual_lq1))
return reads, true_sequence
end
genome_reads, true_seq = create_genome_region_reads()
println("\n" * "="^60)
println("Assembling genome region from overlapping reads...")
println("True sequence: ", true_seq)
println("Number of reads: ", length(genome_reads))Build qualmer graph
genome_graph = Mycelia.build_qualmer_graph(genome_reads; k=9, graph_mode=Mycelia.SingleStrand)Convert to sequence graph and extract contigs
seq_graph = Mycelia.qualmer_graph_to_quality_biosequence_graph(genome_graph, 9)
println("\nAssembly results:")
println(" Qualmer graph: ", Graphs.nv(genome_graph), " vertices, ", Graphs.ne(genome_graph), " edges")
println(" Sequence graph: ", Graphs.nv(seq_graph), " vertices, ", Graphs.ne(seq_graph), " edges")Use package function to find quality-weighted path
if Graphs.nv(genome_graph) > 0
# Start from highest confidence vertex
confidence_sorted = sort(collect(Graphs.vertices(genome_graph)),
by=v -> genome_graph[v].joint_probability, rev=true)
best_path = Mycelia.find_quality_weighted_path(genome_graph, confidence_sorted[1])
println("\nBest quality-weighted path:")
println(" Path length: ", length(best_path), " k-mers")
# Reconstruct sequence from path
if length(best_path) > 1
path_kmers = [String(genome_graph[v].kmer) for v in best_path]
# Simple reconstruction: first k-mer + last base of each subsequent k-mer
reconstructed = path_kmers[1] * join([kmer[end] for kmer in path_kmers[2:end]])
println(" Reconstructed length: ", length(reconstructed))
println(" Reconstructed: ", reconstructed)
# Check accuracy
if reconstructed == true_seq
println(" ✓ Perfect reconstruction!")
elseif occursin(reconstructed, true_seq)
println(" ✓ Assembled sequence is a substring of true sequence")
elseif occursin(true_seq, reconstructed)
println(" ✓ True sequence is a substring of assembled sequence")
else
println(" ✗ Assembly differs from true sequence")
println(" True: ", true_seq)
println(" Assembled: ", reconstructed)
end
end
endFind longest path (contig) from sequence graph
if Graphs.nv(seq_graph) > 0
# Get vertex with longest sequence
longest_v = argmax(v -> length(seq_graph[v].sequence), Graphs.vertices(seq_graph))
longest_seq = seq_graph[longest_v].sequence
longest_qual = seq_graph[longest_v].quality_scores
println("\nLongest contig from simplified graph:")
println(" Length: ", length(longest_seq))
println(" Sequence: ", longest_seq)
println(" Mean quality: ", round(Statistics.mean(longest_qual), digits=1))
# Check accuracy
if String(longest_seq) == true_seq
println(" ✓ Perfect reconstruction!")
else
# Find best alignment
true_str = true_seq
assembled_str = String(longest_seq)
if occursin(assembled_str, true_str)
println(" ✓ Assembled sequence is a substring of true sequence")
elseif occursin(true_str, assembled_str)
println(" ✓ True sequence is a substring of assembled sequence")
else
println(" ✗ Assembly differs from true sequence")
end
end
endCalculate final quality metrics for the genome assembly
final_metrics = Mycelia.calculate_assembly_quality_metrics(genome_graph)
println("\nFinal assembly quality metrics:")
println(" Mean coverage: ", round(final_metrics.mean_coverage, digits=2))
println(" Mean quality: ", round(final_metrics.mean_quality, digits=1))
println(" Mean confidence: ", round(final_metrics.mean_confidence, digits=4))
println(" Low confidence fraction: ", round(final_metrics.low_confidence_fraction, digits=3))Summary
In this tutorial, we've demonstrated:
- Qualmer Construction: Building quality-aware k-mer graphs from FASTQ data
- Joint Probability: How multiple observations with different qualities are combined
- Quality vs Coverage: The advantage of using quality scores over coverage alone
- Package Functions: Using Mycelia's built-in functions for quality-weighted analysis:
find_quality_weighted_path()for optimal path findingcalculate_assembly_quality_metrics()for comprehensive quality assessmentidentify_potential_errors()for error detection
- Round-Trip Conversion: Maintaining quality information through graph transformations
- Practical Assembly: Using quality information for more accurate genome assembly
Key advantages of qualmer graphs:
- Better discrimination between true k-mers and errors
- Quality-weighted path finding for more accurate assembly
- Preservation of quality information for downstream analysis
- Improved handling of repetitive regions with varying quality
- Built-in error detection and quality assessment capabilities
println("\n" * "="^60)
println("Tutorial 6 completed!")
println("You've learned how to use quality-aware k-mer graphs for improved assembly accuracy.")
println("All analysis was performed using Mycelia's built-in qualmer analysis functions.")