Data Acquisition & Simulation
Functions for downloading genomic data from public databases and simulating synthetic datasets for testing and benchmarking.
Overview
Data acquisition is the first step in any bioinformatics analysis. Mycelia provides tools for:
- Downloading reference genomes from NCBI databases (implemented)
- Simulating realistic sequencing data for testing (implemented)
- Managing data provenance and metadata (planned)
- Handling multiple data formats and sources (partially implemented)
Implementation Status: Core download and simulation functions are implemented. Advanced features like metadata management, validation, and batch processing helpers are planned.
Common Workflows
1. Download Reference Genome
# Download a specific genome by accession
genome_file = Mycelia.download_genome_by_accession("NC_001422.1")
# Download complete assembly with annotations
assembly_data = Mycelia.ncbi_genome_download_accession("GCF_000819615.1")2. Simulate Test Data
# Create synthetic genome
reference = Mycelia.simulate_random_genome(length=100000, gc_content=0.45)
# Generate HiFi reads
reads = Mycelia.simulate_hifi_reads(reference, coverage=30, error_rate=0.001)3. Batch Download (Planned convenience wrapper)
# Download multiple genomes (planned convenience function)
# Currently use: Mycelia.download_genomes_by_ftp() with FTP paths
accessions = ["GCF_000005825.2", "GCF_000009605.1", "GCF_000027325.1"]
genomes = Mycelia.download_genomes_batch(accessions, output_dir="genomes/") # PlannedPublic Database Downloads
NCBI Genome Downloads
Mycelia.download_genome_by_accession — Functiondownload_genome_by_accession(
;
accession,
outdir,
compressed
)
Downloads a genomic sequence from NCBI's nucleotide database by its accession number.
Arguments
accession::String: NCBI nucleotide accession number (e.g. "NC_045512")outdir::String: Output directory path. Defaults to current directorycompressed::Bool: If true, compresses output file with gzip. Defaults to true
Returns
String: Path to the downloaded file (.fna or .fna.gz)
Mycelia.ncbi_genome_download_accession — Functionncbi_genome_download_accession(
;
accession,
outdir,
outpath,
include_string
)
Download an accession using NCBI datasets command line tool
the .zip download output to outpath will be unzipped
returns the outfolder
ncbi's default include string is include_string = "gff3,rna,cds,protein,genome,seq-report"
Downloads and extracts a genome from NCBI using the datasets command line tool.
Arguments
accession: NCBI accession number for the genomeoutdir: Directory where files will be downloaded (defaults to current directory)outpath: Full path for the temporary zip file (defaults tooutdir/accession.zip)include_string: Data types to download (defaults to all "gff3,rna,cds,protein,genome,seq-report").
Returns
- Path to the extracted genome data directory
Notes
- Requires the ncbi-datasets-cli conda package (automatically installed if missing)
- Downloaded zip file is automatically removed after extraction
- If output folder already exists, download is skipped
- Data is extracted to
outdir/accession/ncbi_dataset/data/accession
<!– downloadgenomesbatch not yet implemented, use downloadgenomesby_ftp –>
Example: Download phiX174 Genome
# Download the classic phiX174 bacteriophage genome
phix_file = Mycelia.download_genome_by_accession("NC_001422.1")
# Verify download
@assert isfile(phix_file)
@assert endswith(phix_file, ".fna.gz")
# Read the genome
genome_record = first(Mycelia.open_fastx(phix_file))
sequence = String(FASTX.sequence(genome_record))
println("Downloaded genome: $(length(sequence)) bp")Example: Complete Assembly Package
# Download complete E. coli assembly with all associated files
assembly = Mycelia.ncbi_genome_download_accession("GCF_000005825.2")
# Available files
println("Genome: $(assembly.genome)")
println("Proteins: $(assembly.protein)")
println("Annotations: $(assembly.gff3)")
println("CDS: $(assembly.cds)")SRA Data Downloads
Mycelia.download_sra_data — FunctionDownloads sequencing reads from NCBI's Sequence Read Archive (SRA).
Downloads reads using fasterq-dump. The function automatically detects whether the data is single-end or paired-end and returns appropriate file paths. Users should apply quality control based on their knowledge of the data type.
Arguments
srr_identifier: SRA run identifier (e.g., "SRR1234567")outdir: Output directory for downloaded files (default: current directory)
Returns
Named tuple with:
srr_id: The SRA identifieroutdir: Output directory pathfiles: Vector of downloaded file paths (1 file for single-end, 2 for paired-end)is_paired: Boolean indicating if data is paired-end
Example
# Download SRA data
result = Mycelia.download_sra_data("SRR1234567", outdir="./data")
# Apply appropriate QC based on data type
if result.is_paired
# Paired-end data - use paired-end QC
Mycelia.trim_galore_paired(forward_reads=result.files[1], reverse_reads=result.files[2])
else
# Single-end data - use single-end QC
Mycelia.qc_filter_short_reads_fastp(input=result.files[1])
endMycelia.prefetch_sra_runs — FunctionPrefetches multiple SRA runs in parallel.
Downloads SRA run files (.sra) to local storage without converting to FASTQ. Useful for batch downloading before processing with fasterq-dump.
Arguments
srr_identifiers: Vector of SRA run identifiersoutdir: Output directory for prefetched files (default: current directory)max_parallel: Maximum number of parallel downloads (default: 4)
Returns
Vector of named tuples with prefetch results for each SRA run
Example
runs = ["SRR1234567", "SRR1234568", "SRR1234569"]
results = Mycelia.prefetch_sra_runs(runs, outdir="./sra_data")Mycelia.fasterq_dump_parallel — FunctionParallel FASTQ dump for multiple SRA files.
Converts multiple SRA files to FASTQ format in parallel. More efficient than sequential processing for large batches.
Arguments
srr_identifiers: Vector of SRA run identifiersoutdir: Output directory for FASTQ files (default: current directory)max_parallel: Maximum number of parallel conversions (default: 2)
Returns
Vector of named tuples with conversion results
Example
runs = ["SRR1234567", "SRR1234568"]
results = Mycelia.fasterq_dump_parallel(runs, outdir="./fastq_data")Example: Download Sequencing Reads
# Download reads from SRA
sra_run = "SRR1234567"
fastq_files = Mycelia.download_sra_data(sra_run, output_dir="reads/")
# Process paired-end reads
if length(fastq_files) == 2
read1, read2 = fastq_files
println("Downloaded paired-end reads:")
println(" R1: $read1")
println(" R2: $read2")
endData Simulation
Genome Simulation (Planned)
NOTE: Genome simulation functions are planned but not yet implemented. For now, create test genomes manually or use external tools.
Example: Create Test Genome (Planned)
# Generate random genome with realistic GC content (planned)
test_genome = Mycelia.simulate_random_genome(
length=50000,
gc_content=0.42,
seed=123 # for reproducibility
)
# Add realistic features (planned)
genome_with_genes = Mycelia.simulate_genome_with_features(
test_genome,
n_genes=50,
gene_length_dist=(500, 2000),
intergenic_length_dist=(100, 1000)
)Sequencing Read Simulation
Mycelia.simulate_illumina_reads — Functionsimulate_illumina_reads(
;
fasta,
coverage,
read_count,
outbase,
read_length,
mflen,
sdev,
seqSys,
amplicon,
errfree,
paired,
rndSeed,
quiet
)
Simulate Illumina short reads from a FASTA file using the ART Illumina simulator.
This function wraps ART (installed via Bioconda) to simulate reads from an input reference FASTA. It supports paired-end (or optionally single-end/mate-pair) simulation, with options to choose either fold coverage (--fcov) or an absolute read count (--rcount), to enable amplicon mode, and to optionally generate a zero-error SAM file.
Arguments
in_fasta::String: Path to the input FASTA file.coverage::Union{Nothing,Number}: Desired fold coverage (used with--fcov); ifnothingandread_countis provided then fold coverage is ignored. (Default: 20)read_count::Union{Nothing,Number}: Total number of reads (or read pairs) to generate (used with--rcountinstead of fold coverage). (Default:nothing)outbase::String: Output file prefix (default: "$(in_fasta).art.$(coverage)x.").read_length::Int: Length of reads to simulate (default: 150).mflen::Int: Mean fragment length for paired-end simulations (default: 500).sdev::Int: Standard deviation of fragment lengths (default: 10).seqSys::String: Illumina sequencing system ID (default: "HS25"). Available ART profiles:"GA1": Genome Analyzer I (max length: 36bp or 44bp)"GA2": Genome Analyzer II (max length: 50bp or 75bp)"HS10": HiSeq 1000 (max length: 100bp)"HS20": HiSeq 2000 (max length: 100bp)"HS25": HiSeq 2500 (max length: 125bp or 150bp)"HSXn": HiSeqX v2.5 PCR free (max length: 150bp)"HSXt": HiSeqX v2.5 TruSeq (max length: 150bp)"MSv1": MiSeq v1 (max length: 250bp)"MSv3": MiSeq v3 (max length: 250bp)"MinS": MiniSeq TruSeq (max length: 50bp)"NS50": NextSeq 500 v2 (max length: 75bp)
paired::Bool: Whether to simulate paired-end reads (default: true). Some systems like MinS only support single-end.amplicon::Bool: Enable amplicon sequencing simulation mode (default: false).errfree::Bool: Generate an extra SAM file with zero sequencing errors (default: false).rndSeed::Union{Nothing,Int}: Optional seed for reproducibility (default: nothing).
Outputs
Generates gzipped FASTQ files in the working directory:
- For paired-end:
$(outbase)1.fq.gz(forward) and$(outbase)2.fq.gz(reverse). - For single-end:
$(outbase)1.fq.gz.
Additional SAM files may be produced if --errfree is enabled and/or if the ART --samout option is specified.
Details
This function calls ART with the provided options. Note that if read_count is supplied, the function uses the --rcount option; otherwise, it uses --fcov with the given coverage. Amplicon mode (via --amplicon) restricts the simulation to the amplicon regions, which is important for targeted sequencing studies.
Dependencies
Requires ART simulator (installed via Bioconda) and the Mycelia environment helper.
See also: simulate_nanopore_reads, simulate_nearly_perfect_long_reads, simulate_pacbio_reads
Mycelia.simulate_pacbio_reads — Functionsimulate_pacbio_reads(; fasta, quantity, outfile, quiet)
Simulate PacBio HiFi reads using optimized Badread settings.
Uses PacBio 2021 error and quality score models with identity settings optimized for HiFi reads. Follows the recommended PacBio HiFi simulation parameters from the Badread documentation.
Arguments
fasta::String: Path to input FASTA file containing reference sequencequantity::String: Coverage depth (e.g. "50x") or total bases (e.g. "1000000")outfile::String: Output filepath for simulated reads. Defaults to input filename with PacBio HiFi suffixquiet::Bool=false: If true, suppress badread output to stdout/stderr
Returns
String: Path to the generated output file
Notes
- Uses pacbio2021 error and quality score models
- Sets identity to 30,3 for HiFi-appropriate accuracy
- Average read length ~15kb
- Output is gzipped FASTQ format
See also: simulate_nanopore_reads, simulate_nearly_perfect_long_reads, simulate_badread_reads
Mycelia.simulate_nanopore_reads — Functionsimulate_nanopore_reads(; fasta, quantity, outfile, quiet)
Simulate Oxford Nanopore R10.4.1 sequencing reads using Badread's default settings.
Badread's default settings correspond to Oxford Nanopore R10.4.1 reads of mediocre quality. Uses nanopore2023 error and quality models with default identity and length distributions.
Arguments
fasta::String: Path to input reference FASTA filequantity::String: Either fold coverage (e.g. "50x") or total bases to sequence (e.g. "1000000")outfile::String: Output path for gzipped FASTQ file. Defaults to input filename with modified extension
Returns
String: Path to the generated output FASTQ file
See also: simulate_pacbio_reads, simulate_nanopore_r941_reads, simulate_badread_reads
<!– Note: simulatehifireads maps to simulatepacbioreads addsequencingerrors not yet implemented as standalone function –>
Example: HiFi Read Simulation
# Simulate high-quality HiFi reads
hifi_reads = Mycelia.simulate_hifi_reads(
reference_genome,
coverage=25,
read_length_mean=15000,
read_length_std=3000,
error_rate=0.001
)
# Write to FASTQ
Mycelia.write_fastq("hifi_reads.fastq", hifi_reads)Example: Illumina Read Simulation
# Simulate paired-end Illumina reads
illumina_reads = Mycelia.simulate_illumina_reads(
reference_genome,
coverage=50,
read_length=150,
fragment_size_mean=300,
fragment_size_std=50,
error_rate=0.01
)
# Write paired-end files
Mycelia.write_fastq("illumina_R1.fastq", illumina_reads.read1)
Mycelia.write_fastq("illumina_R2.fastq", illumina_reads.read2)Data Validation (Planned)
NOTE: Data validation functions are planned but not yet implemented. Manual validation is recommended.
Download Validation (Planned)
Example: Validate Downloaded Data (Planned)
# Check file integrity (planned)
integrity_ok = Mycelia.validate_download_integrity("genome.fna.gz")
# Verify file format (planned)
format_ok = Mycelia.check_file_format("genome.fna.gz", expected_format="fasta")
# Check genome completeness (planned)
completeness = Mycelia.verify_genome_completeness("genome.fna.gz")
println("Genome completeness: $(completeness.complete_sequences)/$(completeness.total_sequences)")Simulation Validation (Planned)
Example: Validate Simulated Data (Planned)
# Check simulation statistics (planned)
sim_stats = Mycelia.calculate_simulation_statistics(simulated_reads)
println("Simulated reads: $(sim_stats.n_reads)")
println("Mean length: $(sim_stats.mean_length)")
println("Coverage: $(sim_stats.coverage)x")
# Compare with real data characteristics (planned)
comparison = Mycelia.compare_simulated_vs_real(simulated_reads, real_reads)
println("Length distribution similarity: $(comparison.length_similarity)")
println("Quality distribution similarity: $(comparison.quality_similarity)")Metadata Management (Planned)
NOTE: Metadata management functions are planned. For batch downloads, use
Mycelia.download_genomes_by_ftp().
Data Provenance (Planned)
Example: Track Data Sources (Planned)
# Create manifest for downloaded data (planned)
manifest = Mycelia.create_data_manifest(
files=["genome.fna.gz", "annotations.gff3"],
sources=["NCBI:NC_001422.1", "NCBI:GCF_000819615.1"],
download_date=now(),
version="1.0"
)
# Save manifest (planned)
Mycelia.save_manifest(manifest, "data_manifest.json")Batch Processing (Planned convenience wrapper)
NOTE: Batch download with error handling is planned. Currently use
download_genomes_by_ftp()for batch downloads.
Example: Batch Download with Error Handling (Planned)
# Download multiple genomes with retry logic (planned)
accession_list = ["GCF_000005825.2", "GCF_000009605.1", "GCF_000027325.1"]
results = Mycelia.parallel_download(
accession_list,
max_retries=3,
delay_between_retries=30,
max_concurrent=5
)
# Check results
successful = [r for r in results if r.success]
failed = [r for r in results if !r.success]
println("Downloaded: $(length(successful))/$(length(accession_list))")
if !isempty(failed)
println("Failed downloads: $(length(failed))")
for failure in failed
println(" $(failure.accession): $(failure.error)")
end
endPerformance Considerations
Memory Usage
- Genome downloads: Minimal memory usage (streaming)
- Large simulations: Memory scales with genome size
- Batch operations: Consider parallel processing limits
Network Considerations
- NCBI rate limits: Respect API rate limits
- Retry logic: Implement exponential backoff
- Parallel downloads: Limit concurrent connections
Storage Requirements
- Compressed files: Use .gz compression by default
- Temporary files: Clean up intermediate files
- Disk space: Monitor available space for large datasets
Common Issues and Solutions
Download Failures
# Handle network timeouts
try
genome = Mycelia.download_genome_by_accession("NC_001422.1", timeout=300)
catch NetworkError
println("Download failed - retrying with longer timeout")
genome = Mycelia.download_genome_by_accession("NC_001422.1", timeout=600)
endSimulation Validation
# Validate simulation parameters before large runs
Mycelia.validate_simulation_parameters(
genome_size=1000000,
coverage=30,
read_length=15000
)Related Functions
File I/O
Mycelia.open_fastx- Read FASTA/FASTQ filesMycelia.write_fastq- Write FASTQ filesMycelia.write_fasta- Write FASTA files
Quality Control
Mycelia.analyze_fastq_quality- Assess downloaded read qualityMycelia.calculate_gc_content- Calculate GC content
Next Steps
- Quality Control - Assess data quality
- Sequence Analysis - Analyze sequence composition
See Also
- Tutorial 1: Data Acquisition
- FASTA/FASTQ Data Types (planned)
- Basic Workflows