Data Acquisition & Simulation

Functions for downloading genomic data from public databases and simulating synthetic datasets for testing and benchmarking.

Overview

Data acquisition is the first step in any bioinformatics analysis. Mycelia provides tools for:

  • Downloading reference genomes from NCBI databases
  • Simulating realistic sequencing data for testing
  • Managing data provenance and metadata
  • Handling multiple data formats and sources

Common Workflows

1. Download Reference Genome

# Download a specific genome by accession
genome_file = Mycelia.download_genome_by_accession("NC_001422.1")

# Download complete assembly with annotations
assembly_data = Mycelia.ncbi_genome_download_accession("GCF_000819615.1")

2. Simulate Test Data

# Create synthetic genome
reference = Mycelia.simulate_random_genome(length=100000, gc_content=0.45)

# Generate HiFi reads
reads = Mycelia.simulate_hifi_reads(reference, coverage=30, error_rate=0.001)

3. Batch Download

# Download multiple genomes
accessions = ["GCF_000005825.2", "GCF_000009605.1", "GCF_000027325.1"]
genomes = Mycelia.download_genomes_batch(accessions, output_dir="genomes/")

Public Database Downloads

NCBI Genome Downloads

Mycelia.download_genome_by_accessionFunction
download_genome_by_accession(
;
    accession,
    outdir,
    compressed
)

Downloads a genomic sequence from NCBI's nucleotide database by its accession number.

Arguments

  • accession::String: NCBI nucleotide accession number (e.g. "NC_045512")
  • outdir::String: Output directory path. Defaults to current directory
  • compressed::Bool: If true, compresses output file with gzip. Defaults to true

Returns

  • String: Path to the downloaded file (.fna or .fna.gz)
source
Mycelia.ncbi_genome_download_accessionFunction
ncbi_genome_download_accession(
;
    accession,
    outdir,
    outpath,
    include_string
)

Download an accession using NCBI datasets command line tool

the .zip download output to outpath will be unzipped

returns the outfolder

ncbi's default include string is include_string = "gff3,rna,cds,protein,genome,seq-report"

Downloads and extracts a genome from NCBI using the datasets command line tool.

Arguments

  • accession: NCBI accession number for the genome
  • outdir: Directory where files will be downloaded (defaults to current directory)
  • outpath: Full path for the temporary zip file (defaults to outdir/accession.zip)
  • include_string: Data types to download (defaults to all "gff3,rna,cds,protein,genome,seq-report").

Returns

  • Path to the extracted genome data directory

Notes

  • Requires the ncbi-datasets-cli conda package (automatically installed if missing)
  • Downloaded zip file is automatically removed after extraction
  • If output folder already exists, download is skipped
  • Data is extracted to outdir/accession/ncbi_dataset/data/accession
source

<!– downloadgenomesbatch not yet implemented, use downloadgenomesby_ftp –>

Example: Download phiX174 Genome

# Download the classic phiX174 bacteriophage genome
phix_file = Mycelia.download_genome_by_accession("NC_001422.1")

# Verify download
@assert isfile(phix_file)
@assert endswith(phix_file, ".fna.gz")

# Read the genome
genome_record = first(Mycelia.read_fasta(phix_file))
sequence = String(FASTX.sequence(genome_record))
println("Downloaded genome: $(length(sequence)) bp")

Example: Complete Assembly Package

# Download complete E. coli assembly with all associated files
assembly = Mycelia.ncbi_genome_download_accession("GCF_000005825.2")

# Available files
println("Genome: $(assembly.genome)")
println("Proteins: $(assembly.protein)")
println("Annotations: $(assembly.gff3)")
println("CDS: $(assembly.cds)")

SRA Data Downloads

Mycelia.download_sra_dataFunction

Downloads sequencing reads from NCBI's Sequence Read Archive (SRA).

Downloads reads using fasterq-dump. The function automatically detects whether the data is single-end or paired-end and returns appropriate file paths. Users should apply quality control based on their knowledge of the data type.

Arguments

  • srr_identifier: SRA run identifier (e.g., "SRR1234567")
  • outdir: Output directory for downloaded files (default: current directory)

Returns

Named tuple with:

  • srr_id: The SRA identifier
  • outdir: Output directory path
  • files: Vector of downloaded file paths (1 file for single-end, 2 for paired-end)
  • is_paired: Boolean indicating if data is paired-end

Example

# Download SRA data
result = Mycelia.download_sra_data("SRR1234567", outdir="./data")

# Apply appropriate QC based on data type
if result.is_paired
    # Paired-end data - use paired-end QC
    Mycelia.trim_galore_paired(forward_reads=result.files[1], reverse_reads=result.files[2])
else
    # Single-end data - use single-end QC
    Mycelia.qc_filter_short_reads_fastp(input=result.files[1])
end
source
Mycelia.prefetch_sra_runsFunction

Prefetches multiple SRA runs in parallel.

Downloads SRA run files (.sra) to local storage without converting to FASTQ. Useful for batch downloading before processing with fasterq-dump.

Arguments

  • srr_identifiers: Vector of SRA run identifiers
  • outdir: Output directory for prefetched files (default: current directory)
  • max_parallel: Maximum number of parallel downloads (default: 4)

Returns

Vector of named tuples with prefetch results for each SRA run

Example

runs = ["SRR1234567", "SRR1234568", "SRR1234569"]
results = Mycelia.prefetch_sra_runs(runs, outdir="./sra_data")
source
Mycelia.fasterq_dump_parallelFunction

Parallel FASTQ dump for multiple SRA files.

Converts multiple SRA files to FASTQ format in parallel. More efficient than sequential processing for large batches.

Arguments

  • srr_identifiers: Vector of SRA run identifiers
  • outdir: Output directory for FASTQ files (default: current directory)
  • max_parallel: Maximum number of parallel conversions (default: 2)

Returns

Vector of named tuples with conversion results

Example

runs = ["SRR1234567", "SRR1234568"]
results = Mycelia.fasterq_dump_parallel(runs, outdir="./fastq_data")
source

Example: Download Sequencing Reads

# Download reads from SRA
sra_run = "SRR1234567"
fastq_files = Mycelia.download_sra_data(sra_run, output_dir="reads/")

# Process paired-end reads
if length(fastq_files) == 2
    read1, read2 = fastq_files
    println("Downloaded paired-end reads:")
    println("  R1: $read1")
    println("  R2: $read2")
end

Data Simulation

Genome Simulation

<!– Genome simulation functions not yet implemented –>

Example: Create Test Genome

# Generate random genome with realistic GC content
test_genome = Mycelia.simulate_random_genome(
    length=50000,
    gc_content=0.42,
    seed=123  # for reproducibility
)

# Add realistic features
genome_with_genes = Mycelia.simulate_genome_with_features(
    test_genome,
    n_genes=50,
    gene_length_dist=(500, 2000),
    intergenic_length_dist=(100, 1000)
)

Sequencing Read Simulation

Mycelia.simulate_illumina_paired_readsFunction
simulate_illumina_paired_reads(
;
    in_fasta,
    coverage,
    read_count,
    outbase,
    read_length,
    mflen,
    sdev,
    seqSys,
    amplicon,
    errfree,
    rndSeed
)

Simulate Illumina short reads from a FASTA file using the ART Illumina simulator.

This function wraps ART (installed via Bioconda) to simulate reads from an input reference FASTA. It supports paired-end (or optionally single-end/mate-pair) simulation, with options to choose either fold coverage (--fcov) or an absolute read count (--rcount), to enable amplicon mode, and to optionally generate a zero-error SAM file.

Arguments

  • in_fasta::String: Path to the input FASTA file.
  • coverage::Union{Nothing,Number}: Desired fold coverage (used with --fcov); if nothing and read_count is provided then fold coverage is ignored. (Default: 20)
  • read_count::Union{Nothing,Number}: Total number of reads (or read pairs) to generate (used with --rcount instead of fold coverage). (Default: nothing)
  • outbase::String: Output file prefix (default: "$(in_fasta).art.$(coverage)x.").
  • read_length::Int: Length of reads to simulate (default: 150).
  • mflen::Int: Mean fragment length for paired-end simulations (default: 500).
  • sdev::Int: Standard deviation of fragment lengths (default: 10).
  • seqSys::String: Illumina sequencing system ID (e.g. "HS25" for HiSeq 2500) (default: "HS25").
  • paired::Bool: Whether to simulate paired-end reads (default: true).
  • amplicon::Bool: Enable amplicon sequencing simulation mode (default: false).
  • errfree::Bool: Generate an extra SAM file with zero sequencing errors (default: false).
  • rndSeed::Union{Nothing,Int}: Optional seed for reproducibility (default: nothing).

Outputs

Generates gzipped FASTQ files in the working directory:

  • For paired-end: $(outbase)1.fq.gz (forward) and $(outbase)2.fq.gz (reverse).
  • For single-end: $(outbase)1.fq.gz.

Additional SAM files may be produced if --errfree is enabled and/or if the ART --samout option is specified.

Details

This function calls ART with the provided options. Note that if read_count is supplied, the function uses the --rcount option; otherwise, it uses --fcov with the given coverage. Amplicon mode (via --amplicon) restricts the simulation to the amplicon regions, which is important for targeted sequencing studies.

Dependencies

Requires ART simulator (installed via Bioconda) and the Mycelia environment helper.

See also: simulate_nanopore_reads, simulate_nearly_perfect_long_reads, simulate_pacbio_reads

source
Mycelia.simulate_pacbio_readsFunction
simulate_pacbio_reads(; fasta, quantity, outfile)

Simulate PacBio HiFi reads using the Badread error model.

Arguments

  • fasta::String: Path to input FASTA file containing reference sequence
  • quantity::String: Coverage depth (e.g. "50x") or total bases (e.g. "1000000") - NOT TOTAL READS
  • outfile::String: Output filepath for simulated reads. Defaults to input filename with ".badread.pacbio2021.{quantity}.fq.gz" suffix

Returns

  • String: Path to the generated output file

Notes

  • Requires Badread tool from Bioconda
  • Uses PacBio 2021 error and quality score models
  • Average read length ~15kb
  • Output is gzipped FASTQ format

See also: simulate_nanopore_reads, simulate_nearly_perfect_long_reads, simulate_short_reads

source
Mycelia.simulate_nanopore_readsFunction
simulate_nanopore_reads(; fasta, quantity, outfile)

Simulate Oxford Nanopore sequencing reads using the Badread tool with 2023 error models.

Arguments

  • fasta::String: Path to input reference FASTA file
  • quantity::String: Either fold coverage (e.g. "50x") or total bases to sequence (e.g. "1000000")
  • outfile::String: Output path for gzipped FASTQ file. Defaults to input filename with modified extension

Returns

  • String: Path to the generated output FASTQ file

See also: simulate_pacbio_reads, simulate_nearly_perfect_long_reads, simulate_short_reads

source

<!– Note: simulatehifireads maps to simulatepacbioreads addsequencingerrors not yet implemented as standalone function –>

Example: HiFi Read Simulation

# Simulate high-quality HiFi reads
hifi_reads = Mycelia.simulate_hifi_reads(
    reference_genome,
    coverage=25,
    read_length_mean=15000,
    read_length_std=3000,
    error_rate=0.001
)

# Write to FASTQ
Mycelia.write_fastq("hifi_reads.fastq", hifi_reads)

Example: Illumina Read Simulation

# Simulate paired-end Illumina reads
illumina_reads = Mycelia.simulate_illumina_reads(
    reference_genome,
    coverage=50,
    read_length=150,
    fragment_size_mean=300,
    fragment_size_std=50,
    error_rate=0.01
)

# Write paired-end files
Mycelia.write_fastq("illumina_R1.fastq", illumina_reads.read1)
Mycelia.write_fastq("illumina_R2.fastq", illumina_reads.read2)

Data Validation

Download Validation

<!– Download validation functions not yet implemented –>

Example: Validate Downloaded Data

# Check file integrity
integrity_ok = Mycelia.validate_download_integrity("genome.fna.gz")

# Verify file format
format_ok = Mycelia.check_file_format("genome.fna.gz", expected_format="fasta")

# Check genome completeness
completeness = Mycelia.verify_genome_completeness("genome.fna.gz")
println("Genome completeness: $(completeness.complete_sequences)/$(completeness.total_sequences)")

Simulation Validation

<!– Simulation validation functions not yet implemented –>

Example: Validate Simulated Data

# Check simulation statistics
sim_stats = Mycelia.calculate_simulation_statistics(simulated_reads)
println("Simulated reads: $(sim_stats.n_reads)")
println("Mean length: $(sim_stats.mean_length)")
println("Coverage: $(sim_stats.coverage)x")

# Compare with real data characteristics
comparison = Mycelia.compare_simulated_vs_real(simulated_reads, real_reads)
println("Length distribution similarity: $(comparison.length_similarity)")
println("Quality distribution similarity: $(comparison.quality_similarity)")

Metadata Management

Data Provenance

<!– Metadata management functions not yet implemented –>

Example: Track Data Sources

# Create manifest for downloaded data
manifest = Mycelia.create_data_manifest(
    files=["genome.fna.gz", "annotations.gff3"],
    sources=["NCBI:NC_001422.1", "NCBI:GCF_000819615.1"],
    download_date=now(),
    version="1.0"
)

# Save manifest
Mycelia.save_manifest(manifest, "data_manifest.json")

Batch Processing

<!– Batch processing functions not yet implemented Available: Mycelia.downloadgenomesby_ftp for batch downloads –>

Example: Batch Download with Error Handling

# Download multiple genomes with retry logic
accession_list = ["GCF_000005825.2", "GCF_000009605.1", "GCF_000027325.1"]

results = Mycelia.parallel_download(
    accession_list,
    max_retries=3,
    delay_between_retries=30,
    max_concurrent=5
)

# Check results
successful = [r for r in results if r.success]
failed = [r for r in results if !r.success]

println("Downloaded: $(length(successful))/$(length(accession_list))")
if !isempty(failed)
    println("Failed downloads: $(length(failed))")
    for failure in failed
        println("  $(failure.accession): $(failure.error)")
    end
end

Performance Considerations

Memory Usage

  • Genome downloads: Minimal memory usage (streaming)
  • Large simulations: Memory scales with genome size
  • Batch operations: Consider parallel processing limits

Network Considerations

  • NCBI rate limits: Respect API rate limits
  • Retry logic: Implement exponential backoff
  • Parallel downloads: Limit concurrent connections

Storage Requirements

  • Compressed files: Use .gz compression by default
  • Temporary files: Clean up intermediate files
  • Disk space: Monitor available space for large datasets

Common Issues and Solutions

Download Failures

# Handle network timeouts
try
    genome = Mycelia.download_genome_by_accession("NC_001422.1", timeout=300)
catch NetworkError
    println("Download failed - retrying with longer timeout")
    genome = Mycelia.download_genome_by_accession("NC_001422.1", timeout=600)
end

Simulation Validation

# Validate simulation parameters before large runs
Mycelia.validate_simulation_parameters(
    genome_size=1000000,
    coverage=30,
    read_length=15000
)

File I/O

Quality Control

Next Steps

See Also