Quality Control & Preprocessing

Functions for assessing and improving sequencing data quality before downstream analysis.

Overview

Quality control is essential for reliable bioinformatics results. Mycelia integrates with external QC tools and is developing native Julia implementations for:

  • External tool integration for quality filtering (fastp, filtlong, trim_galore)
  • Basic FASTQ file operations (reading/writing)
  • Planned features: Native quality assessment, contamination detection, and reporting

Common Workflows

Currently Available (via External Tools)

Quality Filtering

# Short read QC with fastp
Mycelia.qc_filter_short_reads_fastp(
    input_file="reads.fastq",
    output_file="filtered.fastq"
)

# Long read QC with fastplong
Mycelia.qc_filter_long_reads_fastplong(
    input_file="long_reads.fastq",
    output_file="filtered_long.fastq"
)

# Long read QC with filtlong
Mycelia.qc_filter_long_reads_filtlong(
    input_file="long_reads.fastq",
    output_file="filtered_long.fastq",
    min_length=1000,
    min_mean_q=7
)

# Paired-end trimming with trim_galore
Mycelia.trim_galore_paired(
    R1="reads_R1.fastq",
    R2="reads_R2.fastq",
    output_dir="trimmed/"
)

Planned Native Implementations

1. Basic Quality Assessment (planned)

# Analyze FASTQ quality
quality_stats = Mycelia.analyze_fastq_quality("reads.fastq")

# Generate quality report
quality_report = Mycelia.generate_quality_report(quality_stats)

2. Read Preprocessing (planned)

# Quality trimming and filtering
clean_reads = Mycelia.preprocess_reads(
    "raw_reads.fastq",
    min_quality=20,
    min_length=1000,
    trim_ends=true
)

3. Contamination Removal (planned)

# Remove host contamination
decontaminated_reads = Mycelia.remove_contamination(
    "reads.fastq",
    reference_genome="host_genome.fasta",
    output="clean_reads.fastq"
)

Quality Assessment

FASTQ Quality Analysis

Note: The functions in this section and below are planned but not yet implemented. They represent the intended API design for native Julia quality control functionality. Currently, use the external tool wrappers shown above for quality control tasks.

Mycelia.analyze_fastq_qualityFunction
analyze_fastq_quality(fastq_file::String)

Analyzes quality metrics for a FASTQ file.

Calculates comprehensive quality statistics including read count, quality scores, length distribution, GC content, and quality threshold percentages.

Arguments

  • fastq_file: Path to FASTQ file (can be gzipped)

Returns

FastqQualityResults with the following fields:

  • n_reads: Total number of reads
  • mean_quality: Average Phred quality score across all reads
  • mean_length: Average read length
  • gc_content: GC content percentage
  • quality_distribution: QualityDistribution with Q20+, Q30+, Q40+ percentages

Example

quality_stats = Mycelia.analyze_fastq_quality("reads.fastq")
println("Total reads: $(quality_stats.n_reads)")
println("Mean quality: $(quality_stats.mean_quality)")
println("Q30+ reads: $(quality_stats.quality_distribution.q30_percent)%")
source

<!– calculateperbasequality, calculatepersequencequality, assessqualitydegradation not yet implemented as documented –>

Example: Comprehensive Quality Analysis

# Analyze quality across multiple metrics
quality_data = Mycelia.analyze_fastq_quality("reads.fastq")

# Access quality metrics
println("Total reads: $(quality_data.n_reads)")
println("Mean quality: $(quality_data.mean_quality)")
println("Mean length: $(quality_data.mean_length)")
println("GC content: $(quality_data.gc_content)%")

# Quality score distribution
quality_dist = quality_data.quality_distribution
println("Q20+ reads: $(quality_dist.q20_percent)%")
println("Q30+ reads: $(quality_dist.q30_percent)%")

Sequence Composition Analysis

<!– calculategccontent, analyzenucleotidecomposition, detectcompositionbias, calculatecomplexityscores not yet implemented as documented –>

Example: Composition Analysis

# Analyze sequence composition
composition = Mycelia.analyze_nucleotide_composition("reads.fastq")

# Check for bias
bias_detected = Mycelia.detect_composition_bias(composition)
if bias_detected.has_bias
    println("Composition bias detected:")
    println("  Type: $(bias_detected.bias_type)")
    println("  Severity: $(bias_detected.severity)")
    println("  Affected positions: $(bias_detected.positions)")
end

Read Length Analysis

<!– calculatereadlengthdistribution, analyzelengthuniformity, detectlength_artifacts not yet implemented as documented –>

Example: Length Distribution Analysis

# Analyze read length characteristics
length_stats = Mycelia.calculate_read_length_distribution("reads.fastq")

println("Read length statistics:")
println("  Mean: $(length_stats.mean)")
println("  Median: $(length_stats.median)")
println("  Min: $(length_stats.minimum)")
println("  Max: $(length_stats.maximum)")
println("  Std Dev: $(length_stats.std)")

# Check for length artifacts
artifacts = Mycelia.detect_length_artifacts(length_stats)
if !isempty(artifacts)
    println("Length artifacts detected: $(length(artifacts))")
end

Preprocessing and Filtering

Quality-Based Filtering

<!– filterbyquality, trimlowqualityends, removelowqualityreads, adaptivequalityfiltering not yet implemented as individual functions –>

Mycelia.qc_filter_short_reads_fastpFunction
qc_filter_short_reads_fastp(
;
    forward_reads,
    reverse_reads,
    out_forward,
    out_reverse,
    report_title,
    html,
    json
)

Perform quality control (QC) filtering and trimming on short-read FASTQ files using fastp.

Arguments

  • in_fastq::String: Path to the input FASTQ file.
  • out_fastq::String: Path to the output FASTQ file.
  • adapter_seq::String: Adapter sequence to trim.
  • quality_threshold::Int: Minimum phred score for trimming (default 20).
  • min_length::Int: Minimum read length to retain (default 50).

Returns

  • String: Path to the filtered and trimmed FASTQ file.

Details

This function uses fastp to remove adapter contamination, trim low‐quality bases from the 3′ end, and discard reads shorter than min_length. It’s a simple wrapper that executes the external fastp command.

source
Mycelia.qc_filter_long_reads_fastplongFunction
qc_filter_long_reads_fastplong(
;
    in_fastq,
    report_title,
    out_fastq,
    html_report,
    json_report,
    min_length,
    max_length
)

Perform QC filtering on long-read FASTQ files using fastplong.

Arguments

  • in_fastq::String: Path to the input FASTQ file.
  • out_fastq::String: Path to the output FASTQ file.
  • quality_threshold::Int: Minimum average quality to retain a read (default 10).
  • min_length::Int: Minimum read length (default 1000).
  • max_length::Int=0: Maximum read length (default 0, no maximum).

Returns

  • String: Path to the filtered FASTQ file.

Details

This function uses fastplong to filter long reads based on quality and length criteria. It is optimized for Oxford Nanopore, PacBio, or similar long-read datasets.

source
Mycelia.qc_filter_long_reads_filtlongFunction
qc_filter_long_reads_filtlong(
;
    in_fastq,
    out_fastq,
    min_mean_q,
    keep_percent
)

Filter and process long reads from a FASTQ file using Filtlong.

This function filters long sequencing reads based on quality and length criteria, then compresses the output using pigz.

Arguments

  • in_fastq::String: Path to the input FASTQ file.
  • out_fastq::String: Path to the output filtered and compressed FASTQ file. Defaults to the input filename with ".filtlong.fq.gz" appended.
  • min_mean_q::Int: Minimum mean quality score for reads to be kept. Default is 20.
  • keep_percent::Int: Percentage of reads to keep after filtering. Default is 95.

Returns

  • out_fastq

Details

This function uses Filtlong to filter long reads and pigz for compression. It requires the Bioconda environment for Filtlong to be set up, which is handled internally.

source
Mycelia.trim_galore_pairedFunction
trim_galore_paired(; forward_reads, reverse_reads, outdir)

Trim paired-end FASTQ reads using Trim Galore, a wrapper around Cutadapt and FastQC.

Arguments

  • forward_reads::String: Path to forward reads FASTQ file
  • reverse_reads::String: Path to reverse reads FASTQ file
  • outdir::String: Output directory for trimmed files

Returns

  • Tuple{String, String}: Paths to trimmed forward and reverse read files

Dependencies

Requires trim_galore conda environment:

  • mamba create -n trim_galore -c bioconda trim_galore
source

Example: Quality Filtering

# Filter reads by quality thresholds
filtered_reads = Mycelia.filter_by_quality(
    "reads.fastq",
    min_mean_quality=25,
    min_length=1000,
    max_n_percent=5
)

# Save filtered reads
Mycelia.write_fastq("filtered_reads.fastq", filtered_reads)

# Report filtering statistics
println("Original reads: $(Mycelia.count_reads("reads.fastq"))")
println("Filtered reads: $(length(filtered_reads))")
println("Retention rate: $(length(filtered_reads) / Mycelia.count_reads("reads.fastq") * 100)%")

Adapter and Contamination Removal

<!– removeadapters, detectadaptercontamination, removehostcontamination, removevectorcontamination not yet implemented as individual functions Adapter removal available through trimgalore_paired and QC filtering functions –>

Example: Adapter Removal

# Detect and remove adapters
adapter_results = Mycelia.detect_adapter_contamination("reads.fastq")

if adapter_results.contamination_detected
    println("Adapter contamination detected:")
    println("  Adapter type: $(adapter_results.adapter_type)")
    println("  Contamination rate: $(adapter_results.contamination_rate)%")
    
    # Remove adapters
    clean_reads = Mycelia.remove_adapters(
        "reads.fastq",
        adapter_sequences=adapter_results.adapter_sequences,
        min_overlap=10
    )
    
    Mycelia.write_fastq("adapter_trimmed.fastq", clean_reads)
end

Length-Based Filtering

<!– filterbylength, trimtolength, removeshortreads, normalizereadlengths not yet implemented as individual functions –>

<!– Note: Length-based filtering functions are not yet implemented as individual functions. The fastp and filtlong functions above also perform length filtering as part of their quality control. –>

Example: Length Filtering

# Filter reads by length criteria
length_filtered = Mycelia.filter_by_length(
    "reads.fastq",
    min_length=1000,
    max_length=50000
)

# Normalize length distribution (optional)
normalized_reads = Mycelia.normalize_read_lengths(
    length_filtered,
    target_length=15000,
    tolerance=0.2
)

Contamination Detection

Host Contamination

<!– detecthostcontamination, removehostsequences, classifycontaminationsources not yet implemented as documented –>

Example: Host Contamination Removal

# Screen for host contamination
contamination_results = Mycelia.detect_host_contamination(
    "reads.fastq",
    host_genome="human_genome.fasta",
    min_identity=0.9
)

println("Host contamination: $(contamination_results.contamination_rate)%")

# Remove contaminated reads
clean_reads = Mycelia.remove_host_sequences(
    "reads.fastq",
    contamination_results.contaminated_reads
)

Vector and Adapter Contamination

<!– screenvectorcontamination, detectprimercontamination, removesyntheticsequences not yet implemented as documented –>

Example: Vector Screening

# Screen for vector contamination
vector_results = Mycelia.screen_vector_contamination(
    "reads.fastq",
    vector_database="vector_db.fasta"
)

if vector_results.contamination_found
    println("Vector contamination detected:")
    for hit in vector_results.contaminated_sequences
        println("  $(hit.read_id): $(hit.vector_name)")
    end
end

Quality Metrics and Reporting

Standard Quality Metrics

<!– calculatephredscores, assessbasecallaccuracy, calculateerrorrates, estimatesequencing_quality not yet implemented as documented –>

Example: Quality Metrics Calculation

# Calculate comprehensive quality metrics
quality_metrics = Mycelia.calculate_comprehensive_metrics("reads.fastq")

# Phred score analysis
phred_analysis = Mycelia.calculate_phred_scores(quality_metrics)
println("Mean Phred score: $(phred_analysis.mean_phred)")
println("Q30+ rate: $(phred_analysis.q30_rate)%")

# Error rate estimation
error_rates = Mycelia.calculate_error_rates(quality_metrics)
println("Estimated error rate: $(error_rates.overall_error_rate)")

Quality Control Reports

<!– generatequalityreport, createqualitydashboard, exportqualitymetrics not yet implemented as documented –>

Example: Quality Report Generation

# Generate comprehensive quality report
quality_report = Mycelia.generate_quality_report(
    "reads.fastq",
    output_format="html",
    include_plots=true
)

# Save report
Mycelia.save_quality_report(quality_report, "quality_report.html")

# Export metrics for further analysis
metrics_data = Mycelia.export_quality_metrics(quality_report, format="csv")

Specialized Quality Control

Platform-Specific QC

<!– assesshifiquality, assessnanoporequality, assessilluminaquality not yet implemented as documented –>

Example: HiFi-Specific Quality Control

# HiFi-specific quality assessment
hifi_qc = Mycelia.assess_hifi_quality("hifi_reads.fastq")

println("HiFi Quality Assessment:")
println("  Mean accuracy: $(hifi_qc.mean_accuracy)")
println("  Mean length: $(hifi_qc.mean_length)")
println("  Length N50: $(hifi_qc.length_n50)")
println("  Quality distribution:")
println("    Q20+: $(hifi_qc.q20_percent)%")
println("    Q30+: $(hifi_qc.q30_percent)%")
println("    Q40+: $(hifi_qc.q40_percent)%")

Application-Specific QC

<!– assessassemblyreadiness, assessannotationreadiness, assessvariantcalling_readiness not yet implemented as documented –>

Example: Assembly Readiness Assessment

# Check if reads are suitable for assembly
assembly_readiness = Mycelia.assess_assembly_readiness("reads.fastq")

println("Assembly Readiness:")
println("  Overall score: $(assembly_readiness.overall_score)/10")
println("  Coverage estimate: $(assembly_readiness.estimated_coverage)x")
println("  Quality sufficient: $(assembly_readiness.quality_sufficient)")
println("  Length distribution: $(assembly_readiness.length_distribution_score)")

if !assembly_readiness.ready_for_assembly
    println("Issues found:")
    for issue in assembly_readiness.issues
        println("  - $(issue)")
    end
end

Visualization and Plotting

Quality Plots

<!– plotqualitydistribution, plotlengthdistribution, plotgccontentdistribution, plotbase_composition not yet implemented as individual functions Visualization functions available in Mycelia plotting module –>

Example: Quality Visualization

# Create quality visualization plots
quality_plots = Mycelia.create_quality_plots("reads.fastq")

# Individual plots
Mycelia.plot_quality_distribution(quality_plots.quality_data, 
                         title="Per-Base Quality Scores")

Mycelia.plot_length_distribution(quality_plots.length_data,
                        title="Read Length Distribution")

# Combined quality dashboard
quality_dashboard = Mycelia.plot_quality_dashboard(quality_plots)
Mycelia.save_plot(quality_dashboard, "quality_dashboard.png")

Performance Optimization

Memory-Efficient Processing

<!– streamqualityanalysis, processinchunks, parallelqualityassessment not yet implemented as documented –>

Example: Large File Processing

# Process large files efficiently
large_file_qc = Mycelia.stream_quality_analysis(
    "large_reads.fastq",
    chunk_size=10000,
    parallel=true,
    threads=8
)

# Memory-efficient filtering
filtered_output = Mycelia.process_in_chunks(
    "large_reads.fastq",
    "filtered_reads.fastq",
    chunk_size=50000,
    filter_function=quality_filter_function
)

Common Issues and Solutions

Low Quality Data

# Identify quality issues
quality_issues = Mycelia.diagnose_quality_issues("reads.fastq")

for issue in quality_issues
    println("Issue: $(issue.type)")
    println("  Description: $(issue.description)")
    println("  Severity: $(issue.severity)")
    println("  Recommendation: $(issue.recommendation)")
end

Contamination Problems

# Comprehensive contamination screening
contamination_screen = Mycelia.comprehensive_contamination_screening(
    "reads.fastq",
    databases=["host", "vector", "adapter", "primer"]
)

# Generate contamination report
contamination_report = Mycelia.generate_contamination_report(contamination_screen)

Data Input/Output

Downstream Analysis

Visualization

Previous Steps

Next Steps

  • Sequence Analysis - K-mer analysis of quality-controlled reads
  • Genome Assembly (planned) - Assembly with preprocessed reads

See Also