Data Acquisition & Simulation

Functions for downloading genomic data from public databases and simulating synthetic datasets for testing and benchmarking.

Overview

Data acquisition is the first step in any bioinformatics analysis. Mycelia provides tools for:

Downloading reference genomes from NCBI databases (implemented)
Simulating realistic sequencing data for testing (implemented)
Managing data provenance and metadata (planned)
Handling multiple data formats and sources (partially implemented)

Implementation Status: Core download and simulation functions are implemented. Advanced features like metadata management, validation, and batch processing helpers are planned.

Common Workflows

1. Download Reference Genome

# Download a specific genome by accession
genome_file = Mycelia.download_genome_by_accession("NC_001422.1")

# Download complete assembly with annotations
assembly_data = Mycelia.ncbi_genome_download_accession("GCF_000819615.1")

2. Simulate Test Data

# Create synthetic genome
reference = Mycelia.simulate_random_genome(length=100000, gc_content=0.45)

# Generate HiFi reads
reads = Mycelia.simulate_hifi_reads(reference, coverage=30, error_rate=0.001)

3. Batch Download (Planned convenience wrapper)

# Download multiple genomes (planned convenience function)
# Currently use: Mycelia.download_genomes_by_ftp() with FTP paths
accessions = ["GCF_000005825.2", "GCF_000009605.1", "GCF_000027325.1"]
genomes = Mycelia.download_genomes_batch(accessions, output_dir="genomes/")  # Planned

Public Database Downloads

NCBI Genome Downloads

Mycelia.download_genome_by_accession — Function

download_genome_by_accession(
;
    accession,
    outdir,
    compressed
)

Downloads a genomic sequence from NCBI's nucleotide database by its accession number.

Arguments

accession::String: NCBI nucleotide accession number (e.g. "NC_045512")
outdir::String: Output directory path. Defaults to current directory
compressed::Bool: If true, compresses output file with gzip. Defaults to true

Returns

String: Path to the downloaded file (.fna or .fna.gz)

source

Mycelia.ncbi_genome_download_accession — Function

ncbi_genome_download_accession(
;
    accession,
    outdir,
    outpath,
    include_string,
    max_attempts,
    initial_retry_delay
)

Download an accession using NCBI datasets command line tool

the .zip download output to outpath will be unzipped

returns the outfolder

ncbi's default include string is include_string = "gff3,rna,cds,protein,genome,seq-report"

Downloads and extracts a genome from NCBI using the datasets command line tool.

Arguments

accession: NCBI accession number for the genome
outdir: Directory where files will be downloaded (defaults to current directory)
outpath: Full path for the temporary zip file (defaults to outdir/accession.zip)
include_string: Data types to download (defaults to all "gff3,rna,cds,protein,genome,seq-report").
max_attempts: Maximum number of download retry attempts (default 3)
initial_retry_delay: Initial delay in seconds between retries (default 10.0)

Returns

Path to the extracted genome data directory

Notes

Requires the ncbi-datasets-cli conda package (automatically installed if missing)
Downloaded zip file is automatically removed after extraction
If output folder already exists, download is skipped
Data is extracted to outdir/accession/ncbi_dataset/data/accession
Uses exponential backoff retry logic to handle transient NCBI API failures

TODO: Apply similar retry logic to other NCBI download functions:

- download_genome_by_accession in this file

- load_ncbi_metadata in this file

- taxonomy functions in taxonomy-and-trees.jl that use ncbi-datasets-cli

source

<!– downloadgenomesbatch not yet implemented, use downloadgenomesby_ftp –>

Example: Download phiX174 Genome

# Download the classic phiX174 bacteriophage genome
phix_file = Mycelia.download_genome_by_accession("NC_001422.1")

# Verify download
@assert isfile(phix_file)
@assert endswith(phix_file, ".fna.gz")

# Read the genome
genome_record = first(Mycelia.open_fastx(phix_file))
sequence = String(FASTX.sequence(genome_record))
println("Downloaded genome: $(length(sequence)) bp")

Example: Complete Assembly Package

# Download complete E. coli assembly with all associated files
assembly = Mycelia.ncbi_genome_download_accession("GCF_000005825.2")

# Available files
println("Genome: $(assembly.genome)")
println("Proteins: $(assembly.protein)")
println("Annotations: $(assembly.gff3)")
println("CDS: $(assembly.cds)")

SRA Data Downloads

Mycelia.download_sra_data — Function

Downloads sequencing reads from NCBI's Sequence Read Archive (SRA).

Downloads reads using fasterq-dump. The function automatically detects whether the data is single-end or paired-end and returns appropriate file paths. Users should apply quality control based on their knowledge of the data type.

Arguments

srr_identifier: SRA run identifier (e.g., "SRR1234567")
outdir: Output directory for downloaded files (default: current directory)

Returns

Named tuple with:

srr_id: The SRA identifier
outdir: Output directory path
files: Vector of downloaded file paths (1 file for single-end, 2 for paired-end)
is_paired: Boolean indicating if data is paired-end

Example

# Download SRA data
result = Mycelia.download_sra_data("SRR1234567", outdir="./data")

# Apply appropriate QC based on data type
if result.is_paired
    # Paired-end data - use paired-end QC
    Mycelia.trim_galore_paired(forward_reads=result.files[1], reverse_reads=result.files[2])
else
    # Single-end data - use single-end QC
    Mycelia.qc_filter_short_reads_fastp(input=result.files[1])
end

source

Mycelia.prefetch_sra_runs — Function

Prefetches multiple SRA runs in parallel.

Downloads SRA run files (.sra) to local storage without converting to FASTQ. Useful for batch downloading before processing with fasterq-dump.

Arguments

srr_identifiers: Vector of SRA run identifiers
outdir: Output directory for prefetched files (default: current directory)
max_parallel: Maximum number of parallel downloads (default: 4)

Returns

Vector of named tuples with prefetch results for each SRA run

Example

runs = ["SRR1234567", "SRR1234568", "SRR1234569"]
results = Mycelia.prefetch_sra_runs(runs, outdir="./sra_data")

source

Mycelia.fasterq_dump_parallel — Function

Parallel FASTQ dump for multiple SRA files.

Converts multiple SRA files to FASTQ format in parallel. More efficient than sequential processing for large batches.

Arguments

srr_identifiers: Vector of SRA run identifiers
outdir: Output directory for FASTQ files (default: current directory)
max_parallel: Maximum number of parallel conversions (default: 2)

Returns

Vector of named tuples with conversion results

Example

runs = ["SRR1234567", "SRR1234568"]
results = Mycelia.fasterq_dump_parallel(runs, outdir="./fastq_data")

source

Example: Download Sequencing Reads

# Download reads from SRA
sra_run = "SRR1234567"
fastq_files = Mycelia.download_sra_data(sra_run, output_dir="reads/")

# Process paired-end reads
if length(fastq_files) == 2
    read1, read2 = fastq_files
    println("Downloaded paired-end reads:")
    println("  R1: $read1")
    println("  R2: $read2")
end

Data Simulation

Genome Simulation (Planned)

NOTE: Genome simulation functions are planned but not yet implemented. For now, create test genomes manually or use external tools.

Example: Create Test Genome (Planned)

# Generate random genome with realistic GC content (planned)
test_genome = Mycelia.simulate_random_genome(
    length=50000,
    gc_content=0.42,
    seed=123  # for reproducibility
)

# Add realistic features (planned)
genome_with_genes = Mycelia.simulate_genome_with_features(
    test_genome,
    n_genes=50,
    gene_length_dist=(500, 2000),
    intergenic_length_dist=(100, 1000)
)

Sequencing Read Simulation

Mycelia.simulate_illumina_reads — Function

simulate_illumina_reads(
;
    fasta,
    coverage,
    read_count,
    outbase,
    read_length,
    mflen,
    sdev,
    seqSys,
    amplicon,
    errfree,
    paired,
    rndSeed,
    quiet
)

Simulate Illumina short reads from a FASTA file using the ART Illumina simulator.

This function wraps ART (installed via Bioconda) to simulate reads from an input reference FASTA. It supports paired-end (or optionally single-end/mate-pair) simulation, with options to choose either fold coverage (--fcov) or an absolute read count (--rcount), to enable amplicon mode, and to optionally generate a zero-error SAM file.

Arguments

in_fasta::String: Path to the input FASTA file.
coverage::Union{Nothing,Number}: Desired fold coverage (used with --fcov); if nothing and read_count is provided then fold coverage is ignored. (Default: 20)
read_count::Union{Nothing,Number}: Total number of reads (or read pairs) to generate (used with --rcount instead of fold coverage). (Default: nothing)
outbase::String: Output file prefix (default: "$(in_fasta).art.$(coverage)x.").
read_length::Int: Length of reads to simulate (default: 150).
mflen::Int: Mean fragment length for paired-end simulations (default: 500).
sdev::Int: Standard deviation of fragment lengths (default: 10).
seqSys::String: Illumina sequencing system ID (default: "HS25"). Available ART profiles:
- "GA1": Genome Analyzer I (max length: 36bp or 44bp)
- "GA2": Genome Analyzer II (max length: 50bp or 75bp)
- "HS10": HiSeq 1000 (max length: 100bp)
- "HS20": HiSeq 2000 (max length: 100bp)
- "HS25": HiSeq 2500 (max length: 125bp or 150bp)
- "HSXn": HiSeqX v2.5 PCR free (max length: 150bp)
- "HSXt": HiSeqX v2.5 TruSeq (max length: 150bp)
- "MSv1": MiSeq v1 (max length: 250bp)
- "MSv3": MiSeq v3 (max length: 250bp)
- "MinS": MiniSeq TruSeq (max length: 50bp)
- "NS50": NextSeq 500 v2 (max length: 75bp)
paired::Bool: Whether to simulate paired-end reads (default: true). Some systems like MinS only support single-end.
amplicon::Bool: Enable amplicon sequencing simulation mode (default: false).
errfree::Bool: Generate an extra SAM file with zero sequencing errors (default: false).
rndSeed::Union{Nothing,Int}: Optional seed for reproducibility (default: nothing).

Outputs

Generates gzipped FASTQ files in the working directory:

For paired-end: $(outbase)1.fq.gz (forward) and $(outbase)2.fq.gz (reverse).
For single-end: $(outbase)1.fq.gz.

Additional SAM files may be produced if --errfree is enabled and/or if the ART --samout option is specified.

Details

This function calls ART with the provided options. Note that if read_count is supplied, the function uses the --rcount option; otherwise, it uses --fcov with the given coverage. Amplicon mode (via --amplicon) restricts the simulation to the amplicon regions, which is important for targeted sequencing studies.

Dependencies

Requires ART simulator (installed via Bioconda) and the Mycelia environment helper.

See also: simulate_nanopore_reads, simulate_nearly_perfect_long_reads, simulate_pacbio_reads

source

Mycelia.simulate_pacbio_reads — Function

simulate_pacbio_reads(; fasta, quantity, outfile, quiet)

Simulate PacBio HiFi reads using optimized Badread settings.

Uses PacBio 2021 error and quality score models with identity settings optimized for HiFi reads. Follows the recommended PacBio HiFi simulation parameters from the Badread documentation.

Arguments

fasta::String: Path to input FASTA file containing reference sequence
quantity::String: Coverage depth (e.g. "50x") or total bases (e.g. "1000000")
outfile::String: Output filepath for simulated reads. Defaults to input filename with PacBio HiFi suffix
quiet::Bool=false: If true, suppress badread output to stdout/stderr

Returns

String: Path to the generated output file

Notes

Uses pacbio2021 error and quality score models
Sets identity to 30,3 for HiFi-appropriate accuracy
Average read length ~15kb
Output is gzipped FASTQ format

See also: simulate_nanopore_reads, simulate_nearly_perfect_long_reads, simulate_badread_reads

source

Mycelia.simulate_nanopore_reads — Function

simulate_nanopore_reads(; fasta, quantity, outfile, quiet)

Simulate Oxford Nanopore R10.4.1 sequencing reads using Badread's default settings.

Badread's default settings correspond to Oxford Nanopore R10.4.1 reads of mediocre quality. Uses nanopore2023 error and quality models with default identity and length distributions.

Arguments

fasta::String: Path to input reference FASTA file
quantity::String: Either fold coverage (e.g. "50x") or total bases to sequence (e.g. "1000000")
outfile::String: Output path for gzipped FASTQ file. Defaults to input filename with modified extension

Returns

String: Path to the generated output FASTQ file

See also: simulate_pacbio_reads, simulate_nanopore_r941_reads, simulate_badread_reads

source

<!– Note: simulatehifireads maps to simulatepacbioreads addsequencingerrors not yet implemented as standalone function –>

Example: HiFi Read Simulation

# Simulate high-quality HiFi reads
hifi_reads = Mycelia.simulate_hifi_reads(
    reference_genome,
    coverage=25,
    read_length_mean=15000,
    read_length_std=3000,
    error_rate=0.001
)

# Write to FASTQ
Mycelia.write_fastq("hifi_reads.fastq", hifi_reads)

Example: Illumina Read Simulation

# Simulate paired-end Illumina reads
illumina_reads = Mycelia.simulate_illumina_reads(
    reference_genome,
    coverage=50,
    read_length=150,
    fragment_size_mean=300,
    fragment_size_std=50,
    error_rate=0.01
)

# Write paired-end files
Mycelia.write_fastq("illumina_R1.fastq", illumina_reads.read1)
Mycelia.write_fastq("illumina_R2.fastq", illumina_reads.read2)

Data Validation (Planned)

NOTE: Data validation functions are planned but not yet implemented. Manual validation is recommended.

Download Validation (Planned)

Example: Validate Downloaded Data (Planned)

# Check file integrity (planned)
integrity_ok = Mycelia.validate_download_integrity("genome.fna.gz")

# Verify file format (planned)
format_ok = Mycelia.check_file_format("genome.fna.gz", expected_format="fasta")

# Check genome completeness (planned)
completeness = Mycelia.verify_genome_completeness("genome.fna.gz")
println("Genome completeness: $(completeness.complete_sequences)/$(completeness.total_sequences)")

Simulation Validation (Planned)

Example: Validate Simulated Data (Planned)

# Check simulation statistics (planned)
sim_stats = Mycelia.calculate_simulation_statistics(simulated_reads)
println("Simulated reads: $(sim_stats.n_reads)")
println("Mean length: $(sim_stats.mean_length)")
println("Coverage: $(sim_stats.coverage)x")

# Compare with real data characteristics (planned)
comparison = Mycelia.compare_simulated_vs_real(simulated_reads, real_reads)
println("Length distribution similarity: $(comparison.length_similarity)")
println("Quality distribution similarity: $(comparison.quality_similarity)")

Metadata Management (Planned)

NOTE: Metadata management functions are planned. For batch downloads, use Mycelia.download_genomes_by_ftp().

Data Provenance (Planned)

Example: Track Data Sources (Planned)

# Create manifest for downloaded data (planned)
manifest = Mycelia.create_data_manifest(
    files=["genome.fna.gz", "annotations.gff3"],
    sources=["NCBI:NC_001422.1", "NCBI:GCF_000819615.1"],
    download_date=now(),
    version="1.0"
)

# Save manifest (planned)
Mycelia.save_manifest(manifest, "data_manifest.json")

Batch Processing (Planned convenience wrapper)

NOTE: Batch download with error handling is planned. Currently use download_genomes_by_ftp() for batch downloads.

Example: Batch Download with Error Handling (Planned)

# Download multiple genomes with retry logic (planned)
accession_list = ["GCF_000005825.2", "GCF_000009605.1", "GCF_000027325.1"]

results = Mycelia.parallel_download(
    accession_list,
    max_retries=3,
    delay_between_retries=30,
    max_concurrent=5
)

# Check results
successful = [r for r in results if r.success]
failed = [r for r in results if !r.success]

println("Downloaded: $(length(successful))/$(length(accession_list))")
if !isempty(failed)
    println("Failed downloads: $(length(failed))")
    for failure in failed
        println("  $(failure.accession): $(failure.error)")
    end
end

Performance Considerations

Memory Usage

Genome downloads: Minimal memory usage (streaming)
Large simulations: Memory scales with genome size
Batch operations: Consider parallel processing limits

Network Considerations

NCBI rate limits: Respect API rate limits
Retry logic: Implement exponential backoff
Parallel downloads: Limit concurrent connections

Storage Requirements

Compressed files: Use .gz compression by default
Temporary files: Clean up intermediate files
Disk space: Monitor available space for large datasets

Common Issues and Solutions

Download Failures

# Handle network timeouts
try
    genome = Mycelia.download_genome_by_accession("NC_001422.1", timeout=300)
catch NetworkError
    println("Download failed - retrying with longer timeout")
    genome = Mycelia.download_genome_by_accession("NC_001422.1", timeout=600)
end

Simulation Validation

# Validate simulation parameters before large runs
Mycelia.validate_simulation_parameters(
    genome_size=1000000,
    coverage=30,
    read_length=15000
)

File I/O

Mycelia.open_fastx - Read FASTA/FASTQ files
Mycelia.write_fastq - Write FASTQ files
Mycelia.write_fasta - Write FASTA files

Quality Control

Mycelia.analyze_fastq_quality - Assess downloaded read quality
Mycelia.calculate_gc_content - Calculate GC content

Next Steps

Quality Control - Assess data quality
Sequence Analysis - Analyze sequence composition