Tutorial 5: Direct FASTQ Sequence Graphs
This tutorial demonstrates the direct workflow from FASTQ sequences to quality-aware sequence graphs without the intermediate qualmer step. This approach is useful when you want variable-length contigs with quality information from the start.
Learning Objectives
By the end of this tutorial, you will understand:
- How to create quality-aware sequence graphs directly from FASTQ data
- The difference between qualmer-based and direct approaches
- How to perform assembly with preserved per-base quality scores
- When to use direct vs qualmer-mediated approaches
- Quality-aware contig assembly and validation
Setup and Imports
Following CLAUDE.md standards: only import top-level packages, use full namespacing
import Mycelia
import FASTX
import Test
import StatisticsPart 1: Direct Quality-Aware Sequence Graph Construction
Unlike Tutorial 6 which goes FASTQ → Qualmer → FASTQ graphs, this tutorial demonstrates the direct FASTQ → FASTQ graphs approach.
Creating Diverse FASTQ Data
function create_diverse_fastq_data()
# Simulated reads from a 60bp region with overlaps
true_sequence = "ATCGATCGATCGTAGCTAGCTAGCTTGCATGCATGCATGCATGCATGCATTAGCTAGC"
reads = []
# High-quality overlapping reads (perfect assembly case)
push!(reads, FASTX.FASTQ.Record("read1", true_sequence[1:25], "I"^25))
push!(reads, FASTX.FASTQ.Record("read2", true_sequence[20:45], "I"^26))
push!(reads, FASTX.FASTQ.Record("read3", true_sequence[40:60], "I"^21))
# Medium-quality reads with slight variations
push!(reads, FASTX.FASTQ.Record("read4", true_sequence[10:35], "F"^26))
push!(reads, FASTX.FASTQ.Record("read5", true_sequence[30:55], "F"^26))
# Low-quality read with potential error
error_seq = true_sequence[15:40]
error_seq = error_seq[1:10] * "T" * error_seq[12:end] ## Introduce error
push!(reads, FASTX.FASTQ.Record("read6", error_seq, "AAA###AAA" * "A"^17))
# Short high-quality reads
push!(reads, FASTX.FASTQ.Record("read7", true_sequence[5:20], "I"^16))
push!(reads, FASTX.FASTQ.Record("read8", true_sequence[45:60], "I"^16))
return reads, true_sequence
end
println("Creating diverse FASTQ dataset for direct assembly...")
fastq_reads, reference_seq = create_diverse_fastq_data()
println("Dataset overview:")
println(" Reference sequence: ", reference_seq)
println(" Number of reads: ", length(fastq_reads))
for (i, read) in enumerate(fastq_reads)
seq = String(FASTX.sequence(read))
qual_scores = FASTX.quality_scores(read)
mean_qual = round(Statistics.mean(qual_scores), digits=1)
println(" Read $i: length $(length(seq)), mean quality $mean_qual")
endPart 2: Direct Quality-Aware BioSequence Graph Construction
Build a quality-aware sequence graph directly from FASTQ reads.
println("\n" * "="^60)
println("Building quality-aware BioSequence graph directly from FASTQ...")
# Use the direct build function
fastq_graph = Mycelia.build_quality_biosequence_graph(fastq_reads)
println("\nDirect FASTQ graph statistics:")
println(" Number of vertices: ", Graphs.nv(fastq_graph))
println(" Number of edges: ", Graphs.ne(fastq_graph))
# Examine the vertices (these should be variable-length sequences)
println("\nExamining quality-aware sequence vertices:")
for v in Iterators.take(Graphs.vertices(fastq_graph), min(5, Graphs.nv(fastq_graph)))
vertex_data = fastq_graph[v]
mean_qual = round(Statistics.mean(vertex_data.quality_scores), digits=1)
println("\nVertex $v:")
println(" Sequence: ", vertex_data.sequence)
println(" Length: ", length(vertex_data.sequence))
println(" Mean quality: ", mean_qual)
println(" Quality range: ", minimum(vertex_data.quality_scores), "-", maximum(vertex_data.quality_scores))
endPart 3: Comparison with Qualmer-Mediated Approach
Let's compare the direct approach with the qualmer-mediated approach.
println("\n" * "="^60)
println("Comparing direct vs qualmer-mediated approaches...")
# Build qualmer graph first, then convert to sequence graph
k = 15 ## Use larger k for better comparison
qualmer_graph = Mycelia.build_qualmer_graph(fastq_reads; k=k, graph_mode=Mycelia.SingleStrand)
qualmer_to_fastq = Mycelia.qualmer_graph_to_quality_biosequence_graph(qualmer_graph, k)
println("\nApproach comparison:")
println(" Direct FASTQ graph:")
println(" Vertices: ", Graphs.nv(fastq_graph))
println(" Edges: ", Graphs.ne(fastq_graph))
println(" Qualmer-mediated graph (k=$k):")
println(" Qualmer vertices: ", Graphs.nv(qualmer_graph))
println(" Qualmer edges: ", Graphs.ne(qualmer_graph))
println(" Final FASTQ vertices: ", Graphs.nv(qualmer_to_fastq))
println(" Final FASTQ edges: ", Graphs.ne(qualmer_to_fastq))Part 4: Quality-Aware Assembly Analysis
Analyze assembly quality using both approaches.
println("\n" * "="^60)
println("Analyzing assembly quality for both approaches...")
# For direct approach - analyze sequence lengths and qualities
direct_sequences = [fastq_graph[v].sequence for v in Graphs.vertices(fastq_graph)]
direct_qualities = [fastq_graph[v].quality_scores for v in Graphs.vertices(fastq_graph)]
if !isempty(direct_sequences)
direct_lengths = [length(seq) for seq in direct_sequences]
direct_mean_quals = [Statistics.mean(quals) for quals in direct_qualities]
println("\nDirect approach analysis:")
println(" Sequence count: ", length(direct_sequences))
println(" Mean sequence length: ", round(Statistics.mean(direct_lengths), digits=1))
println(" Longest sequence: ", maximum(direct_lengths), " bp")
println(" Mean quality across all sequences: ", round(Statistics.mean(vcat(direct_mean_quals...)), digits=1))
end
# For qualmer approach - use the package quality metrics
if Graphs.nv(qualmer_graph) > 0
qualmer_metrics = Mycelia.calculate_assembly_quality_metrics(qualmer_graph)
println("\nQualmer approach analysis:")
println(" K-mer count: ", qualmer_metrics.total_kmers)
println(" Mean k-mer coverage: ", round(qualmer_metrics.mean_coverage, digits=1))
println(" Mean k-mer quality: ", round(qualmer_metrics.mean_quality, digits=1))
println(" Mean k-mer confidence: ", round(qualmer_metrics.mean_confidence, digits=4))
endPart 5: Contig Assembly and Reconstruction
Extract contigs from the quality-aware sequence graph.
println("\n" * "="^60)
println("Extracting contigs and performing reconstruction...")
# Find the longest contigs from direct approach
if !isempty(direct_sequences)
# Sort by length
seq_length_pairs = [(i, length(seq)) for (i, seq) in enumerate(direct_sequences)]
sort!(seq_length_pairs, by=x -> x[2], rev=true)
println("\nTop contigs from direct approach:")
for (rank, (vertex_idx, length)) in enumerate(seq_length_pairs[1:min(3, length(seq_length_pairs))])
actual_vertex = collect(Graphs.vertices(fastq_graph))[vertex_idx]
vertex_data = fastq_graph[actual_vertex]
mean_qual = round(Statistics.mean(vertex_data.quality_scores), digits=1)
println(" Contig $rank:")
println(" Length: $length bp")
println(" Mean quality: $mean_qual")
println(" Sequence: ", vertex_data.sequence)
# Check if this contig matches part of the reference
contig_seq = String(vertex_data.sequence)
if occursin(contig_seq, reference_seq)
println(" ✓ Perfect match in reference")
elseif occursin(reference_seq, contig_seq)
println(" ✓ Contains entire reference")
else
# Check for partial matches
best_match_len = 0
for i in 1:(length(reference_seq) - length(contig_seq) + 1)
if reference_seq[i:(i + length(contig_seq) - 1)] == contig_seq
best_match_len = length(contig_seq)
break
end
end
if best_match_len > 0
println(" ✓ Partial match ($best_match_len bp)")
else
println(" ✗ No direct match found")
end
end
end
endPart 6: Round-Trip Validation
Convert back to FASTQ records and validate quality preservation.
println("\n" * "="^60)
println("Performing round-trip validation...")
# Convert graph back to FASTQ records
reconstructed_fastq = Mycelia.quality_biosequence_graph_to_fastq(fastq_graph, "reconstructed")
println("\nRound-trip validation:")
println(" Original reads: ", length(fastq_reads))
println(" Reconstructed reads: ", length(reconstructed_fastq))
# Analyze quality preservation
original_qualities = []
reconstructed_qualities = []
for read in fastq_reads
append!(original_qualities, FASTX.quality_scores(read))
end
for read in reconstructed_fastq
append!(reconstructed_qualities, FASTX.quality_scores(read))
end
if !isempty(original_qualities) && !isempty(reconstructed_qualities)
println("\nQuality score analysis:")
println(" Original mean quality: ", round(Statistics.mean(original_qualities), digits=1))
println(" Reconstructed mean quality: ", round(Statistics.mean(reconstructed_qualities), digits=1))
println(" Original quality range: ", minimum(original_qualities), "-", maximum(original_qualities))
println(" Reconstructed quality range: ", minimum(reconstructed_qualities), "-", maximum(reconstructed_qualities))
end
# Compare individual reads
println("\nIndividual read comparison:")
for i in 1:min(3, length(fastq_reads), length(reconstructed_fastq))
orig = fastq_reads[i]
recon = reconstructed_fastq[i]
orig_seq = String(FASTX.sequence(orig))
recon_seq = String(FASTX.sequence(recon))
println("\nRead $i:")
println(" Original: ", orig_seq)
println(" Reconstructed: ", recon_seq)
println(" Length match: ", length(orig_seq) == length(recon_seq) ? "✓" : "✗")
println(" Sequence match: ", orig_seq == recon_seq ? "✓" : "✗")
endPart 7: Error Detection and Quality Assessment
Use quality information to identify potential assembly issues.
println("\n" * "="^60)
println("Quality-based error detection and assessment...")
# Analyze quality distribution across contigs
if !isempty(direct_qualities)
all_quals = vcat(direct_qualities...)
quality_stats = (
mean = Statistics.mean(all_quals),
median = Statistics.median(all_quals),
std = Statistics.std(all_quals),
min = minimum(all_quals),
max = maximum(all_quals)
)
println("\nQuality distribution analysis:")
println(" Mean: ", round(quality_stats.mean, digits=1))
println(" Median: ", round(quality_stats.median, digits=1))
println(" Std dev: ", round(quality_stats.std, digits=1))
println(" Range: ", quality_stats.min, "-", quality_stats.max)
# Identify low-quality regions
low_quality_threshold = 20.0
low_qual_count = count(q -> q < low_quality_threshold, all_quals)
low_qual_fraction = low_qual_count / length(all_quals)
println("\nLow-quality region analysis:")
println(" Positions below Q$low_quality_threshold: ", low_qual_count)
println(" Fraction of low-quality positions: ", round(low_qual_fraction, digits=3))
if low_qual_fraction > 0.1
println(" ⚠️ High fraction of low-quality positions detected")
else
println(" ✓ Good overall quality distribution")
end
endPart 8: Performance and Use Case Analysis
Discuss when to use direct vs qualmer-mediated approaches.
println("\n" * "="^60)
println("Performance and use case analysis...")
println("\nApproach comparison summary:")
println("\nDirect FASTQ → FASTQ graphs:")
println(" ✓ Faster construction (no intermediate k-mer step)")
println(" ✓ Variable-length sequences from start")
println(" ✓ Natural read-level quality preservation")
println(" ✓ Good for high-quality, long reads")
println(" ✗ Less granular error detection")
println(" ✗ May struggle with complex repeat regions")
println("\nQualmer-mediated approach:")
println(" ✓ Fine-grained quality analysis at k-mer level")
println(" ✓ Better error detection and correction")
println(" ✓ Handles complex genomic features better")
println(" ✓ Quality-weighted assembly decisions")
println(" ✗ More computationally intensive")
println(" ✗ Requires k-mer size optimization")
println("\nRecommended use cases:")
println(" Direct approach:")
println(" - High-quality PacBio HiFi or ONT reads")
println(" - Simple genomes without complex repeats")
println(" - Rapid prototyping and initial assembly")
println(" - When computational resources are limited")
println("\n Qualmer approach:")
println(" - Short reads (Illumina)")
println(" - Error-prone long reads")
println(" - Complex genomes with repeats")
println(" - When maximum accuracy is required")Part 9: Practical Assembly Example
Demonstrate a complete assembly workflow using the direct approach.
println("\n" * "="^60)
println("Complete assembly workflow example...")
function create_realistic_reads()
# Simulate a 100bp region with realistic read coverage
true_seq = "ATCGATCGATCGTAGCTAGCTAGCTTGCATGCATGCATGCATGCATGCATTAGCTAGCATCGATCGTAGCTAGCTAGCTTGCATGCATGCAT"
reads = []
# Simulate 10x coverage with 25bp reads
read_length = 25
step_size = 10 ## Overlap by 15bp
for start in 1:step_size:(length(true_seq) - read_length + 1)
end_pos = min(start + read_length - 1, length(true_seq))
read_seq = true_seq[start:end_pos]
# Vary quality based on position (simulate quality degradation)
base_quality = 35
qual_scores = [max(20, base_quality - abs(i - read_length÷2)) for i in 1:length(read_seq)]
qual_string = String([Char(33 + q) for q in qual_scores])
push!(reads, FASTX.FASTQ.Record("read_$(start)", read_seq, qual_string))
end
return reads, true_seq
end
realistic_reads, true_genome = create_realistic_reads()
println("\nRealistic assembly example:")
println(" True genome length: ", length(true_genome))
println(" Number of reads: ", length(realistic_reads))
println(" Expected coverage: ~10x")
# Build graph and assemble
realistic_graph = Mycelia.build_quality_biosequence_graph(realistic_reads)
println(" Graph vertices: ", Graphs.nv(realistic_graph))
println(" Graph edges: ", Graphs.ne(realistic_graph))
# Find best assembly
if Graphs.nv(realistic_graph) > 0
sequences = [realistic_graph[v].sequence for v in Graphs.vertices(realistic_graph)]
qualities = [realistic_graph[v].quality_scores for v in Graphs.vertices(realistic_graph)]
# Find longest sequence
longest_idx = argmax(length.(sequences))
best_assembly = String(sequences[longest_idx])
best_quality = Statistics.mean(qualities[longest_idx])
println("\nBest assembly result:")
println(" Assembled length: ", length(best_assembly))
println(" True length: ", length(true_genome))
println(" Mean quality: ", round(best_quality, digits=1))
# Check assembly accuracy
if best_assembly == true_genome
println(" ✓ Perfect assembly!")
elseif occursin(best_assembly, true_genome)
println(" ✓ Assembly is subset of true genome")
coverage = length(best_assembly) / length(true_genome)
println(" Coverage: ", round(coverage * 100, digits=1), "%")
elseif occursin(true_genome, best_assembly)
println(" ✓ Assembly contains entire true genome")
else
println(" ⚠️ Assembly differs from true genome")
println(" True: ", true_genome[1:min(50, length(true_genome))], "...")
println(" Assembled: ", best_assembly[1:min(50, length(best_assembly))], "...")
end
endSummary
In this tutorial, we've demonstrated:
- Direct Construction: Building quality-aware sequence graphs directly from FASTQ
- Approach Comparison: Direct vs qualmer-mediated assembly strategies
- Quality Preservation: Maintaining per-base quality through assembly
- Contig Assembly: Extracting variable-length contigs with quality scores
- Error Detection: Using quality information for assembly validation
- Use Case Analysis: When to choose each approach
- Practical Workflow: Complete assembly pipeline demonstration
Key insights:
- Direct approach is faster and simpler for high-quality data
- Qualmer approach provides better error handling for challenging data
- Quality information is preserved throughout both workflows
- Choice of approach depends on data quality and computational requirements
- Both approaches support the complete graph hierarchy for downstream analysis
println("\n" * "="^60)
println("Tutorial 5 completed!")
println("You've learned to choose between direct and qualmer-mediated quality-aware assembly.")
println("Both approaches preserve quality information for downstream analysis.")