Quality Control & Preprocessing
Functions for assessing and improving sequencing data quality before downstream analysis.
Overview
Quality control is essential for reliable bioinformatics results. Mycelia currently provides:
- External tool integration for quality filtering (fastp, filtlong, fastplong, chopper, trim_galore) - Implemented
- Basic quality assessment (
analyze_fastq_quality) - Implemented - FASTQ file operations (reading/writing) - Implemented
- Planned features: Comprehensive native quality assessment, contamination detection, and detailed reporting
Implementation Status: Quality control currently relies primarily on well-established external tools. Native Julia implementations for detailed quality analysis and filtering are planned.
Common Workflows
Currently Available (via External Tools)
Quality Filtering
# Short read QC with fastp
Mycelia.qc_filter_short_reads_fastp(
input_file="reads.fastq",
output_file="filtered.fastq"
)
# Long read QC with fastplong
Mycelia.qc_filter_long_reads_fastplong(
input_file="long_reads.fastq",
output_file="filtered_long.fastq"
)
# Long read QC with filtlong
Mycelia.qc_filter_long_reads_filtlong(
input_file="long_reads.fastq",
output_file="filtered_long.fastq",
min_length=1000,
min_mean_q=7
)
# Paired-end trimming with trim_galore
Mycelia.trim_galore_paired(
R1="reads_R1.fastq",
R2="reads_R2.fastq",
output_dir="trimmed/"
)Planned Native Implementations
1. Basic Quality Assessment (planned)
# Analyze FASTQ quality
quality_stats = Mycelia.analyze_fastq_quality("reads.fastq")
# Generate quality report
quality_report = Mycelia.generate_quality_report(quality_stats)2. Read Preprocessing (planned)
# Quality trimming and filtering
clean_reads = Mycelia.preprocess_reads(
"raw_reads.fastq",
min_quality=20,
min_length=1000,
trim_ends=true
)3. Contamination Removal (planned)
# Remove host contamination
decontaminated_reads = Mycelia.remove_contamination(
"reads.fastq",
reference_genome="host_genome.fasta",
output="clean_reads.fastq"
)Quality Assessment
FASTQ Quality Analysis
Note: The functions in this section and below are planned but not yet implemented. They represent the intended API design for native Julia quality control functionality. Currently, use the external tool wrappers shown above for quality control tasks.
Mycelia.analyze_fastq_quality — Function
analyze_fastq_quality(fastq_file::String)Analyzes quality metrics for a FASTQ file.
Calculates comprehensive quality statistics including read count, quality scores, length distribution, GC content, and quality threshold percentages.
Arguments
fastq_file: Path to FASTQ file (can be gzipped)
Returns
FastqQualityResults with the following fields:
n_reads: Total number of readsmean_quality: Average Phred quality score across all readsmean_length: Average read lengthgc_content: GC content percentagequality_distribution: QualityDistribution with Q20+, Q30+, Q40+ percentages
Example
quality_stats = Mycelia.analyze_fastq_quality("reads.fastq")
println("Total reads: $(quality_stats.n_reads)")
println("Mean quality: $(quality_stats.mean_quality)")
println("Q30+ reads: $(quality_stats.quality_distribution.q30_percent)%")<!– calculateperbasequality, calculatepersequencequality, assessqualitydegradation not yet implemented as documented –>
Example: Comprehensive Quality Analysis
# Analyze quality across multiple metrics
quality_data = Mycelia.analyze_fastq_quality("reads.fastq")
# Access quality metrics
println("Total reads: $(quality_data.n_reads)")
println("Mean quality: $(quality_data.mean_quality)")
println("Mean length: $(quality_data.mean_length)")
println("GC content: $(quality_data.gc_content)%")
# Quality score distribution
quality_dist = quality_data.quality_distribution
println("Q20+ reads: $(quality_dist.q20_percent)%")
println("Q30+ reads: $(quality_dist.q30_percent)%")Sequence Composition Analysis
<!– calculategccontent, analyzenucleotidecomposition, detectcompositionbias, calculatecomplexityscores not yet implemented as documented –>
Example: Composition Analysis
# Analyze sequence composition
composition = Mycelia.analyze_nucleotide_composition("reads.fastq")
# Check for bias
bias_detected = Mycelia.detect_composition_bias(composition)
if bias_detected.has_bias
println("Composition bias detected:")
println(" Type: $(bias_detected.bias_type)")
println(" Severity: $(bias_detected.severity)")
println(" Affected positions: $(bias_detected.positions)")
endRead Length Analysis
<!– calculatereadlengthdistribution, analyzelengthuniformity, detectlength_artifacts not yet implemented as documented –>
Example: Length Distribution Analysis
# Analyze read length characteristics
length_stats = Mycelia.calculate_read_length_distribution("reads.fastq")
println("Read length statistics:")
println(" Mean: $(length_stats.mean)")
println(" Median: $(length_stats.median)")
println(" Min: $(length_stats.minimum)")
println(" Max: $(length_stats.maximum)")
println(" Std Dev: $(length_stats.std)")
# Check for length artifacts
artifacts = Mycelia.detect_length_artifacts(length_stats)
if !isempty(artifacts)
println("Length artifacts detected: $(length(artifacts))")
endPreprocessing and Filtering
Quality-Based Filtering
Currently Available: External tool wrappers (see below)
Planned: Native Julia filtering functions
Mycelia.qc_filter_short_reads_fastp — Function
qc_filter_short_reads_fastp(
;
forward_reads,
reverse_reads,
out_forward,
out_reverse,
report_title,
html,
json,
enable_dedup
)
Perform quality control (QC) filtering and trimming on paired-end short-read FASTQ files using fastp.
Arguments
forward_reads::String: Path to the forward (R1) FASTQ file.reverse_reads::String: Path to the reverse (R2) FASTQ file.out_forward::String: Output path for filtered forward reads (auto-generated if not specified).out_reverse::String: Output path for filtered reverse reads (auto-generated if not specified).report_title::String: Title for the HTML/JSON report (auto-generated if not specified).html::String: Output path for HTML report (auto-generated if not specified).json::String: Output path for JSON report (auto-generated if not specified).enable_dedup::Union{Bool,Nothing}: Control deduplication behavior. Iftrue, forces deduplication with memory-aware settings. Iffalse, disables deduplication. Ifnothing(default), uses automatic logic based on file size and available memory.
Returns
- Named tuple containing paths to:
(out_forward, out_reverse, json, html)
Details
This function uses fastp to perform quality control, adapter trimming, and optional deduplication on paired-end reads.
Smart Deduplication Logic
The function includes intelligent memory management for deduplication:
- User Control: Set
enable_dedup=trueto force deduplication, orenable_dedup=falseto disable it. - Automatic Mode: When
enable_dedup=nothing(default), the function:- Skips deduplication for small files (< 100MB total) for efficiency
- For larger files, enables deduplication if sufficient memory is available
- Memory-Aware Settings: Automatically adjusts fastp's
--dup_calc_accuracybased on available system memory:- Default: Level 3 (4GB memory) if sufficient memory available
- Fallback: Level 2 (2GB memory) or Level 1 (1GB memory) if needed
- Disables deduplication if < 1GB memory available
This ensures the process won't be killed due to out-of-memory conditions while maintaining deduplication benefits when feasible.
Mycelia.qc_filter_long_reads_fastplong — Function
qc_filter_long_reads_fastplong(
;
in_fastq,
report_title,
out_fastq,
html_report,
json_report,
min_length,
max_length
)
Perform QC filtering on long-read FASTQ files using fastplong.
Arguments
in_fastq::String: Path to the input FASTQ file.out_fastq::String: Path to the output FASTQ file.quality_threshold::Int: Minimum average quality to retain a read (default 10).min_length::Int: Minimum read length (default 1000).max_length::Int=0: Maximum read length (default 0, no maximum).
Returns
String: Path to the filtered FASTQ file.
Details
This function uses fastplong to filter long reads based on quality and length criteria. It is optimized for Oxford Nanopore, PacBio, or similar long-read datasets.
Mycelia.qc_filter_long_reads_filtlong — Function
qc_filter_long_reads_filtlong(
;
in_fastq,
out_fastq,
min_mean_q,
keep_percent
)
Filter and process long reads from a FASTQ file using Filtlong.
This function filters long sequencing reads based on quality and length criteria, then compresses the output using pigz.
Arguments
in_fastq::String: Path to the input FASTQ file.out_fastq::String: Path to the output filtered and compressed FASTQ file. Defaults to the input filename with ".filtlong.fq.gz" appended.min_mean_q::Int: Minimum mean quality score for reads to be kept. Default is 20.keep_percent::Int: Percentage of reads to keep after filtering. Default is 95.
Returns
out_fastq
Details
This function uses Filtlong to filter long reads and pigz for compression. It requires the Bioconda environment for Filtlong to be set up, which is handled internally.
Mycelia.trim_galore_paired — Function
trim_galore_paired(; forward_reads, reverse_reads, outdir)
Trim paired-end FASTQ reads using Trim Galore, a wrapper around Cutadapt and FastQC.
Arguments
forward_reads::String: Path to forward reads FASTQ filereverse_reads::String: Path to reverse reads FASTQ fileoutdir::String: Output directory for trimmed files
Returns
Tuple{String, String}: Paths to trimmed forward and reverse read files
Dependencies
Requires trim_galore conda environment
Example: Quality Filtering (Planned native implementation)
# Filter reads by quality thresholds (planned)
# Currently use: qc_filter_short_reads_fastp() or qc_filter_long_reads_filtlong()
filtered_reads = Mycelia.filter_by_quality(
"reads.fastq",
min_mean_quality=25,
min_length=1000,
max_n_percent=5
)
# Save filtered reads
Mycelia.write_fastq("filtered_reads.fastq", filtered_reads)
# Report filtering statistics
println("Original reads: $(Mycelia.count_reads("reads.fastq"))")
println("Filtered reads: $(length(filtered_reads))")
println("Retention rate: $(length(filtered_reads) / Mycelia.count_reads("reads.fastq") * 100)%")Adapter and Contamination Removal (Planned)
Currently Available: Adapter removal via trim_galore_paired() and QC filter functions
Planned: Native contamination detection and removal functions
Example: Adapter Removal (Planned)
# Detect and remove adapters (planned)
adapter_results = Mycelia.detect_adapter_contamination("reads.fastq")
if adapter_results.contamination_detected
println("Adapter contamination detected:")
println(" Adapter type: $(adapter_results.adapter_type)")
println(" Contamination rate: $(adapter_results.contamination_rate)%")
# Remove adapters
clean_reads = Mycelia.remove_adapters(
"reads.fastq",
adapter_sequences=adapter_results.adapter_sequences,
min_overlap=10
)
Mycelia.write_fastq("adapter_trimmed.fastq", clean_reads)
endLength-Based Filtering
<!– filterbylength, trimtolength, removeshortreads, normalizereadlengths not yet implemented as individual functions –>
<!– Note: Length-based filtering functions are not yet implemented as individual functions. The fastp and filtlong functions above also perform length filtering as part of their quality control. –>
Example: Length Filtering
# Filter reads by length criteria
length_filtered = Mycelia.filter_by_length(
"reads.fastq",
min_length=1000,
max_length=50000
)
# Normalize length distribution (optional)
normalized_reads = Mycelia.normalize_read_lengths(
length_filtered,
target_length=15000,
tolerance=0.2
)Contamination Detection
Host Contamination
<!– detecthostcontamination, removehostsequences, classifycontaminationsources not yet implemented as documented –>
Example: Host Contamination Removal
# Screen for host contamination
contamination_results = Mycelia.detect_host_contamination(
"reads.fastq",
host_genome="human_genome.fasta",
min_identity=0.9
)
println("Host contamination: $(contamination_results.contamination_rate)%")
# Remove contaminated reads
clean_reads = Mycelia.remove_host_sequences(
"reads.fastq",
contamination_results.contaminated_reads
)Vector and Adapter Contamination
<!– screenvectorcontamination, detectprimercontamination, removesyntheticsequences not yet implemented as documented –>
Example: Vector Screening
# Screen for vector contamination
vector_results = Mycelia.screen_vector_contamination(
"reads.fastq",
vector_database="vector_db.fasta"
)
if vector_results.contamination_found
println("Vector contamination detected:")
for hit in vector_results.contaminated_sequences
println(" $(hit.read_id): $(hit.vector_name)")
end
endQuality Metrics and Reporting
Standard Quality Metrics
<!– calculatephredscores, assessbasecallaccuracy, calculateerrorrates, estimatesequencing_quality not yet implemented as documented –>
Example: Quality Metrics Calculation
# Calculate comprehensive quality metrics
quality_metrics = Mycelia.calculate_comprehensive_metrics("reads.fastq")
# Phred score analysis
phred_analysis = Mycelia.calculate_phred_scores(quality_metrics)
println("Mean Phred score: $(phred_analysis.mean_phred)")
println("Q30+ rate: $(phred_analysis.q30_rate)%")
# Error rate estimation
error_rates = Mycelia.calculate_error_rates(quality_metrics)
println("Estimated error rate: $(error_rates.overall_error_rate)")Quality Control Reports
<!– generatequalityreport, createqualitydashboard, exportqualitymetrics not yet implemented as documented –>
Example: Quality Report Generation
# Generate comprehensive quality report
quality_report = Mycelia.generate_quality_report(
"reads.fastq",
output_format="html",
include_plots=true
)
# Save report
Mycelia.save_quality_report(quality_report, "quality_report.html")
# Export metrics for further analysis
metrics_data = Mycelia.export_quality_metrics(quality_report, format="csv")Specialized Quality Control
Platform-Specific QC
<!– assesshifiquality, assessnanoporequality, assessilluminaquality not yet implemented as documented –>
Example: HiFi-Specific Quality Control
# HiFi-specific quality assessment
hifi_qc = Mycelia.assess_hifi_quality("hifi_reads.fastq")
println("HiFi Quality Assessment:")
println(" Mean accuracy: $(hifi_qc.mean_accuracy)")
println(" Mean length: $(hifi_qc.mean_length)")
println(" Length N50: $(hifi_qc.length_n50)")
println(" Quality distribution:")
println(" Q20+: $(hifi_qc.q20_percent)%")
println(" Q30+: $(hifi_qc.q30_percent)%")
println(" Q40+: $(hifi_qc.q40_percent)%")Application-Specific QC
<!– assessassemblyreadiness, assessannotationreadiness, assessvariantcalling_readiness not yet implemented as documented –>
Example: Assembly Readiness Assessment
# Check if reads are suitable for assembly
assembly_readiness = Mycelia.assess_assembly_readiness("reads.fastq")
println("Assembly Readiness:")
println(" Overall score: $(assembly_readiness.overall_score)/10")
println(" Coverage estimate: $(assembly_readiness.estimated_coverage)x")
println(" Quality sufficient: $(assembly_readiness.quality_sufficient)")
println(" Length distribution: $(assembly_readiness.length_distribution_score)")
if !assembly_readiness.ready_for_assembly
println("Issues found:")
for issue in assembly_readiness.issues
println(" - $(issue)")
end
endVisualization and Plotting
Quality Plots
<!– plotqualitydistribution, plotlengthdistribution, plotgccontentdistribution, plotbase_composition not yet implemented as individual functions Visualization functions available in Mycelia plotting module –>
Example: Quality Visualization
# Create quality visualization plots
quality_plots = Mycelia.create_quality_plots("reads.fastq")
# Individual plots
Mycelia.plot_quality_distribution(quality_plots.quality_data,
title="Per-Base Quality Scores")
Mycelia.plot_length_distribution(quality_plots.length_data,
title="Read Length Distribution")
# Combined quality dashboard
quality_dashboard = Mycelia.plot_quality_dashboard(quality_plots)
Mycelia.save_plot(quality_dashboard, "quality_dashboard.png")Performance Optimization
Memory-Efficient Processing
<!– streamqualityanalysis, processinchunks, parallelqualityassessment not yet implemented as documented –>
Example: Large File Processing
# Process large files efficiently
large_file_qc = Mycelia.stream_quality_analysis(
"large_reads.fastq",
chunk_size=10000,
parallel=true,
threads=8
)
# Memory-efficient filtering
filtered_output = Mycelia.process_in_chunks(
"large_reads.fastq",
"filtered_reads.fastq",
chunk_size=50000,
filter_function=quality_filter_function
)Common Issues and Solutions
Low Quality Data
# Identify quality issues
quality_issues = Mycelia.diagnose_quality_issues("reads.fastq")
for issue in quality_issues
println("Issue: $(issue.type)")
println(" Description: $(issue.description)")
println(" Severity: $(issue.severity)")
println(" Recommendation: $(issue.recommendation)")
endContamination Problems
# Comprehensive contamination screening
contamination_screen = Mycelia.comprehensive_contamination_screening(
"reads.fastq",
databases=["host", "vector", "adapter", "primer"]
)
# Generate contamination report
contamination_report = Mycelia.generate_contamination_report(contamination_screen)Related Functions
Data Input/Output
Mycelia.open_fastx- Read FASTQ (and FASTA) filesMycelia.write_fastq- Write filtered FASTQ files
Downstream Analysis
Mycelia.count_kmers- K-mer analysis of cleaned readsMycelia.assemble_genome- Genome assembly with quality-controlled reads
Related Workflows
Previous Steps
- Data Acquisition - Obtaining sequencing data
Next Steps
- Sequence Analysis - K-mer analysis of quality-controlled reads
- Genome Assembly (planned) - Assembly with preprocessed reads
See Also
- Tutorial 2: Quality Control
- FASTA/FASTQ Data Types (planned)
- Performance Optimization Guide