Quality Control & Preprocessing
Functions for assessing and improving sequencing data quality before downstream analysis.
Overview
Quality control is essential for reliable bioinformatics results. Mycelia integrates with external QC tools and is developing native Julia implementations for:
- External tool integration for quality filtering (fastp, filtlong, trim_galore)
- Basic FASTQ file operations (reading/writing)
- Planned features: Native quality assessment, contamination detection, and reporting
Common Workflows
Currently Available (via External Tools)
Quality Filtering
# Short read QC with fastp
Mycelia.qc_filter_short_reads_fastp(
input_file="reads.fastq",
output_file="filtered.fastq"
)
# Long read QC with fastplong
Mycelia.qc_filter_long_reads_fastplong(
input_file="long_reads.fastq",
output_file="filtered_long.fastq"
)
# Long read QC with filtlong
Mycelia.qc_filter_long_reads_filtlong(
input_file="long_reads.fastq",
output_file="filtered_long.fastq",
min_length=1000,
min_mean_q=7
)
# Paired-end trimming with trim_galore
Mycelia.trim_galore_paired(
R1="reads_R1.fastq",
R2="reads_R2.fastq",
output_dir="trimmed/"
)
Planned Native Implementations
1. Basic Quality Assessment (planned)
# Analyze FASTQ quality
quality_stats = Mycelia.analyze_fastq_quality("reads.fastq")
# Generate quality report
quality_report = Mycelia.generate_quality_report(quality_stats)
2. Read Preprocessing (planned)
# Quality trimming and filtering
clean_reads = Mycelia.preprocess_reads(
"raw_reads.fastq",
min_quality=20,
min_length=1000,
trim_ends=true
)
3. Contamination Removal (planned)
# Remove host contamination
decontaminated_reads = Mycelia.remove_contamination(
"reads.fastq",
reference_genome="host_genome.fasta",
output="clean_reads.fastq"
)
Quality Assessment
FASTQ Quality Analysis
Note: The functions in this section and below are planned but not yet implemented. They represent the intended API design for native Julia quality control functionality. Currently, use the external tool wrappers shown above for quality control tasks.
Mycelia.analyze_fastq_quality
— Functionanalyze_fastq_quality(fastq_file::String)
Analyzes quality metrics for a FASTQ file.
Calculates comprehensive quality statistics including read count, quality scores, length distribution, GC content, and quality threshold percentages.
Arguments
fastq_file
: Path to FASTQ file (can be gzipped)
Returns
FastqQualityResults with the following fields:
n_reads
: Total number of readsmean_quality
: Average Phred quality score across all readsmean_length
: Average read lengthgc_content
: GC content percentagequality_distribution
: QualityDistribution with Q20+, Q30+, Q40+ percentages
Example
quality_stats = Mycelia.analyze_fastq_quality("reads.fastq")
println("Total reads: $(quality_stats.n_reads)")
println("Mean quality: $(quality_stats.mean_quality)")
println("Q30+ reads: $(quality_stats.quality_distribution.q30_percent)%")
<!– calculateperbasequality, calculatepersequencequality, assessqualitydegradation not yet implemented as documented –>
Example: Comprehensive Quality Analysis
# Analyze quality across multiple metrics
quality_data = Mycelia.analyze_fastq_quality("reads.fastq")
# Access quality metrics
println("Total reads: $(quality_data.n_reads)")
println("Mean quality: $(quality_data.mean_quality)")
println("Mean length: $(quality_data.mean_length)")
println("GC content: $(quality_data.gc_content)%")
# Quality score distribution
quality_dist = quality_data.quality_distribution
println("Q20+ reads: $(quality_dist.q20_percent)%")
println("Q30+ reads: $(quality_dist.q30_percent)%")
Sequence Composition Analysis
<!– calculategccontent, analyzenucleotidecomposition, detectcompositionbias, calculatecomplexityscores not yet implemented as documented –>
Example: Composition Analysis
# Analyze sequence composition
composition = Mycelia.analyze_nucleotide_composition("reads.fastq")
# Check for bias
bias_detected = Mycelia.detect_composition_bias(composition)
if bias_detected.has_bias
println("Composition bias detected:")
println(" Type: $(bias_detected.bias_type)")
println(" Severity: $(bias_detected.severity)")
println(" Affected positions: $(bias_detected.positions)")
end
Read Length Analysis
<!– calculatereadlengthdistribution, analyzelengthuniformity, detectlength_artifacts not yet implemented as documented –>
Example: Length Distribution Analysis
# Analyze read length characteristics
length_stats = Mycelia.calculate_read_length_distribution("reads.fastq")
println("Read length statistics:")
println(" Mean: $(length_stats.mean)")
println(" Median: $(length_stats.median)")
println(" Min: $(length_stats.minimum)")
println(" Max: $(length_stats.maximum)")
println(" Std Dev: $(length_stats.std)")
# Check for length artifacts
artifacts = Mycelia.detect_length_artifacts(length_stats)
if !isempty(artifacts)
println("Length artifacts detected: $(length(artifacts))")
end
Preprocessing and Filtering
Quality-Based Filtering
<!– filterbyquality, trimlowqualityends, removelowqualityreads, adaptivequalityfiltering not yet implemented as individual functions –>
Mycelia.qc_filter_short_reads_fastp
— Functionqc_filter_short_reads_fastp(
;
forward_reads,
reverse_reads,
out_forward,
out_reverse,
report_title,
html,
json
)
Perform quality control (QC) filtering and trimming on short-read FASTQ files using fastp.
Arguments
in_fastq::String
: Path to the input FASTQ file.out_fastq::String
: Path to the output FASTQ file.adapter_seq::String
: Adapter sequence to trim.quality_threshold::Int
: Minimum phred score for trimming (default 20).min_length::Int
: Minimum read length to retain (default 50).
Returns
String
: Path to the filtered and trimmed FASTQ file.
Details
This function uses fastp to remove adapter contamination, trim low‐quality bases from the 3′ end, and discard reads shorter than min_length
. It’s a simple wrapper that executes the external fastp command.
Mycelia.qc_filter_long_reads_fastplong
— Functionqc_filter_long_reads_fastplong(
;
in_fastq,
report_title,
out_fastq,
html_report,
json_report,
min_length,
max_length
)
Perform QC filtering on long-read FASTQ files using fastplong.
Arguments
in_fastq::String
: Path to the input FASTQ file.out_fastq::String
: Path to the output FASTQ file.quality_threshold::Int
: Minimum average quality to retain a read (default 10).min_length::Int
: Minimum read length (default 1000).max_length::Int=0
: Maximum read length (default 0, no maximum).
Returns
String
: Path to the filtered FASTQ file.
Details
This function uses fastplong to filter long reads based on quality and length criteria. It is optimized for Oxford Nanopore, PacBio, or similar long-read datasets.
Mycelia.qc_filter_long_reads_filtlong
— Functionqc_filter_long_reads_filtlong(
;
in_fastq,
out_fastq,
min_mean_q,
keep_percent
)
Filter and process long reads from a FASTQ file using Filtlong.
This function filters long sequencing reads based on quality and length criteria, then compresses the output using pigz.
Arguments
in_fastq::String
: Path to the input FASTQ file.out_fastq::String
: Path to the output filtered and compressed FASTQ file. Defaults to the input filename with ".filtlong.fq.gz" appended.min_mean_q::Int
: Minimum mean quality score for reads to be kept. Default is 20.keep_percent::Int
: Percentage of reads to keep after filtering. Default is 95.
Returns
out_fastq
Details
This function uses Filtlong to filter long reads and pigz for compression. It requires the Bioconda environment for Filtlong to be set up, which is handled internally.
Mycelia.trim_galore_paired
— Functiontrim_galore_paired(; forward_reads, reverse_reads, outdir)
Trim paired-end FASTQ reads using Trim Galore, a wrapper around Cutadapt and FastQC.
Arguments
forward_reads::String
: Path to forward reads FASTQ filereverse_reads::String
: Path to reverse reads FASTQ fileoutdir::String
: Output directory for trimmed files
Returns
Tuple{String, String}
: Paths to trimmed forward and reverse read files
Dependencies
Requires trim_galore conda environment:
mamba create -n trim_galore -c bioconda trim_galore
Example: Quality Filtering
# Filter reads by quality thresholds
filtered_reads = Mycelia.filter_by_quality(
"reads.fastq",
min_mean_quality=25,
min_length=1000,
max_n_percent=5
)
# Save filtered reads
Mycelia.write_fastq("filtered_reads.fastq", filtered_reads)
# Report filtering statistics
println("Original reads: $(Mycelia.count_reads("reads.fastq"))")
println("Filtered reads: $(length(filtered_reads))")
println("Retention rate: $(length(filtered_reads) / Mycelia.count_reads("reads.fastq") * 100)%")
Adapter and Contamination Removal
<!– removeadapters, detectadaptercontamination, removehostcontamination, removevectorcontamination not yet implemented as individual functions Adapter removal available through trimgalore_paired and QC filtering functions –>
Example: Adapter Removal
# Detect and remove adapters
adapter_results = Mycelia.detect_adapter_contamination("reads.fastq")
if adapter_results.contamination_detected
println("Adapter contamination detected:")
println(" Adapter type: $(adapter_results.adapter_type)")
println(" Contamination rate: $(adapter_results.contamination_rate)%")
# Remove adapters
clean_reads = Mycelia.remove_adapters(
"reads.fastq",
adapter_sequences=adapter_results.adapter_sequences,
min_overlap=10
)
Mycelia.write_fastq("adapter_trimmed.fastq", clean_reads)
end
Length-Based Filtering
<!– filterbylength, trimtolength, removeshortreads, normalizereadlengths not yet implemented as individual functions –>
<!– Note: Length-based filtering functions are not yet implemented as individual functions. The fastp and filtlong functions above also perform length filtering as part of their quality control. –>
Example: Length Filtering
# Filter reads by length criteria
length_filtered = Mycelia.filter_by_length(
"reads.fastq",
min_length=1000,
max_length=50000
)
# Normalize length distribution (optional)
normalized_reads = Mycelia.normalize_read_lengths(
length_filtered,
target_length=15000,
tolerance=0.2
)
Contamination Detection
Host Contamination
<!– detecthostcontamination, removehostsequences, classifycontaminationsources not yet implemented as documented –>
Example: Host Contamination Removal
# Screen for host contamination
contamination_results = Mycelia.detect_host_contamination(
"reads.fastq",
host_genome="human_genome.fasta",
min_identity=0.9
)
println("Host contamination: $(contamination_results.contamination_rate)%")
# Remove contaminated reads
clean_reads = Mycelia.remove_host_sequences(
"reads.fastq",
contamination_results.contaminated_reads
)
Vector and Adapter Contamination
<!– screenvectorcontamination, detectprimercontamination, removesyntheticsequences not yet implemented as documented –>
Example: Vector Screening
# Screen for vector contamination
vector_results = Mycelia.screen_vector_contamination(
"reads.fastq",
vector_database="vector_db.fasta"
)
if vector_results.contamination_found
println("Vector contamination detected:")
for hit in vector_results.contaminated_sequences
println(" $(hit.read_id): $(hit.vector_name)")
end
end
Quality Metrics and Reporting
Standard Quality Metrics
<!– calculatephredscores, assessbasecallaccuracy, calculateerrorrates, estimatesequencing_quality not yet implemented as documented –>
Example: Quality Metrics Calculation
# Calculate comprehensive quality metrics
quality_metrics = Mycelia.calculate_comprehensive_metrics("reads.fastq")
# Phred score analysis
phred_analysis = Mycelia.calculate_phred_scores(quality_metrics)
println("Mean Phred score: $(phred_analysis.mean_phred)")
println("Q30+ rate: $(phred_analysis.q30_rate)%")
# Error rate estimation
error_rates = Mycelia.calculate_error_rates(quality_metrics)
println("Estimated error rate: $(error_rates.overall_error_rate)")
Quality Control Reports
<!– generatequalityreport, createqualitydashboard, exportqualitymetrics not yet implemented as documented –>
Example: Quality Report Generation
# Generate comprehensive quality report
quality_report = Mycelia.generate_quality_report(
"reads.fastq",
output_format="html",
include_plots=true
)
# Save report
Mycelia.save_quality_report(quality_report, "quality_report.html")
# Export metrics for further analysis
metrics_data = Mycelia.export_quality_metrics(quality_report, format="csv")
Specialized Quality Control
Platform-Specific QC
<!– assesshifiquality, assessnanoporequality, assessilluminaquality not yet implemented as documented –>
Example: HiFi-Specific Quality Control
# HiFi-specific quality assessment
hifi_qc = Mycelia.assess_hifi_quality("hifi_reads.fastq")
println("HiFi Quality Assessment:")
println(" Mean accuracy: $(hifi_qc.mean_accuracy)")
println(" Mean length: $(hifi_qc.mean_length)")
println(" Length N50: $(hifi_qc.length_n50)")
println(" Quality distribution:")
println(" Q20+: $(hifi_qc.q20_percent)%")
println(" Q30+: $(hifi_qc.q30_percent)%")
println(" Q40+: $(hifi_qc.q40_percent)%")
Application-Specific QC
<!– assessassemblyreadiness, assessannotationreadiness, assessvariantcalling_readiness not yet implemented as documented –>
Example: Assembly Readiness Assessment
# Check if reads are suitable for assembly
assembly_readiness = Mycelia.assess_assembly_readiness("reads.fastq")
println("Assembly Readiness:")
println(" Overall score: $(assembly_readiness.overall_score)/10")
println(" Coverage estimate: $(assembly_readiness.estimated_coverage)x")
println(" Quality sufficient: $(assembly_readiness.quality_sufficient)")
println(" Length distribution: $(assembly_readiness.length_distribution_score)")
if !assembly_readiness.ready_for_assembly
println("Issues found:")
for issue in assembly_readiness.issues
println(" - $(issue)")
end
end
Visualization and Plotting
Quality Plots
<!– plotqualitydistribution, plotlengthdistribution, plotgccontentdistribution, plotbase_composition not yet implemented as individual functions Visualization functions available in Mycelia plotting module –>
Example: Quality Visualization
# Create quality visualization plots
quality_plots = Mycelia.create_quality_plots("reads.fastq")
# Individual plots
Mycelia.plot_quality_distribution(quality_plots.quality_data,
title="Per-Base Quality Scores")
Mycelia.plot_length_distribution(quality_plots.length_data,
title="Read Length Distribution")
# Combined quality dashboard
quality_dashboard = Mycelia.plot_quality_dashboard(quality_plots)
Mycelia.save_plot(quality_dashboard, "quality_dashboard.png")
Performance Optimization
Memory-Efficient Processing
<!– streamqualityanalysis, processinchunks, parallelqualityassessment not yet implemented as documented –>
Example: Large File Processing
# Process large files efficiently
large_file_qc = Mycelia.stream_quality_analysis(
"large_reads.fastq",
chunk_size=10000,
parallel=true,
threads=8
)
# Memory-efficient filtering
filtered_output = Mycelia.process_in_chunks(
"large_reads.fastq",
"filtered_reads.fastq",
chunk_size=50000,
filter_function=quality_filter_function
)
Common Issues and Solutions
Low Quality Data
# Identify quality issues
quality_issues = Mycelia.diagnose_quality_issues("reads.fastq")
for issue in quality_issues
println("Issue: $(issue.type)")
println(" Description: $(issue.description)")
println(" Severity: $(issue.severity)")
println(" Recommendation: $(issue.recommendation)")
end
Contamination Problems
# Comprehensive contamination screening
contamination_screen = Mycelia.comprehensive_contamination_screening(
"reads.fastq",
databases=["host", "vector", "adapter", "primer"]
)
# Generate contamination report
contamination_report = Mycelia.generate_contamination_report(contamination_screen)
Related Functions
Data Input/Output
read_fastq
- Read FASTQ fileswrite_fastq
- Write filtered FASTQ filescompress_fastq
- Compress output files
Downstream Analysis
count_kmers
- K-mer analysis of cleaned readsassemble_genome
- Genome assembly with quality-controlled reads
Visualization
plot_quality_metrics
- Quality visualizationcreate_quality_dashboard
- Interactive quality dashboard
Related Workflows
Previous Steps
- Data Acquisition - Obtaining sequencing data
Next Steps
- Sequence Analysis - K-mer analysis of quality-controlled reads
- Genome Assembly (planned) - Assembly with preprocessed reads
See Also
- Tutorial 2: Quality Control
- FASTA/FASTQ Data Types (planned)
- Performance Optimization Guide