Data Acquisition & Simulation
Functions for downloading genomic data from public databases and simulating synthetic datasets for testing and benchmarking.
Overview
Data acquisition is the first step in any bioinformatics analysis. Mycelia provides tools for:
- Downloading reference genomes from NCBI databases
- Simulating realistic sequencing data for testing
- Managing data provenance and metadata
- Handling multiple data formats and sources
Common Workflows
1. Download Reference Genome
# Download a specific genome by accession
genome_file = Mycelia.download_genome_by_accession("NC_001422.1")
# Download complete assembly with annotations
assembly_data = Mycelia.ncbi_genome_download_accession("GCF_000819615.1")
2. Simulate Test Data
# Create synthetic genome
reference = Mycelia.simulate_random_genome(length=100000, gc_content=0.45)
# Generate HiFi reads
reads = Mycelia.simulate_hifi_reads(reference, coverage=30, error_rate=0.001)
3. Batch Download
# Download multiple genomes
accessions = ["GCF_000005825.2", "GCF_000009605.1", "GCF_000027325.1"]
genomes = Mycelia.download_genomes_batch(accessions, output_dir="genomes/")
Public Database Downloads
NCBI Genome Downloads
Mycelia.download_genome_by_accession
— Functiondownload_genome_by_accession(
;
accession,
outdir,
compressed
)
Downloads a genomic sequence from NCBI's nucleotide database by its accession number.
Arguments
accession::String
: NCBI nucleotide accession number (e.g. "NC_045512")outdir::String
: Output directory path. Defaults to current directorycompressed::Bool
: If true, compresses output file with gzip. Defaults to true
Returns
String
: Path to the downloaded file (.fna or .fna.gz)
Mycelia.ncbi_genome_download_accession
— Functionncbi_genome_download_accession(
;
accession,
outdir,
outpath,
include_string
)
Download an accession using NCBI datasets command line tool
the .zip download output to outpath will be unzipped
returns the outfolder
ncbi's default include string is include_string = "gff3,rna,cds,protein,genome,seq-report"
Downloads and extracts a genome from NCBI using the datasets command line tool.
Arguments
accession
: NCBI accession number for the genomeoutdir
: Directory where files will be downloaded (defaults to current directory)outpath
: Full path for the temporary zip file (defaults tooutdir/accession.zip
)include_string
: Data types to download (defaults to all "gff3,rna,cds,protein,genome,seq-report").
Returns
- Path to the extracted genome data directory
Notes
- Requires the ncbi-datasets-cli conda package (automatically installed if missing)
- Downloaded zip file is automatically removed after extraction
- If output folder already exists, download is skipped
- Data is extracted to
outdir/accession/ncbi_dataset/data/accession
<!– downloadgenomesbatch not yet implemented, use downloadgenomesby_ftp –>
Example: Download phiX174 Genome
# Download the classic phiX174 bacteriophage genome
phix_file = Mycelia.download_genome_by_accession("NC_001422.1")
# Verify download
@assert isfile(phix_file)
@assert endswith(phix_file, ".fna.gz")
# Read the genome
genome_record = first(Mycelia.read_fasta(phix_file))
sequence = String(FASTX.sequence(genome_record))
println("Downloaded genome: $(length(sequence)) bp")
Example: Complete Assembly Package
# Download complete E. coli assembly with all associated files
assembly = Mycelia.ncbi_genome_download_accession("GCF_000005825.2")
# Available files
println("Genome: $(assembly.genome)")
println("Proteins: $(assembly.protein)")
println("Annotations: $(assembly.gff3)")
println("CDS: $(assembly.cds)")
SRA Data Downloads
Mycelia.download_sra_data
— FunctionDownloads sequencing reads from NCBI's Sequence Read Archive (SRA).
Downloads reads using fasterq-dump. The function automatically detects whether the data is single-end or paired-end and returns appropriate file paths. Users should apply quality control based on their knowledge of the data type.
Arguments
srr_identifier
: SRA run identifier (e.g., "SRR1234567")outdir
: Output directory for downloaded files (default: current directory)
Returns
Named tuple with:
srr_id
: The SRA identifieroutdir
: Output directory pathfiles
: Vector of downloaded file paths (1 file for single-end, 2 for paired-end)is_paired
: Boolean indicating if data is paired-end
Example
# Download SRA data
result = Mycelia.download_sra_data("SRR1234567", outdir="./data")
# Apply appropriate QC based on data type
if result.is_paired
# Paired-end data - use paired-end QC
Mycelia.trim_galore_paired(forward_reads=result.files[1], reverse_reads=result.files[2])
else
# Single-end data - use single-end QC
Mycelia.qc_filter_short_reads_fastp(input=result.files[1])
end
Mycelia.prefetch_sra_runs
— FunctionPrefetches multiple SRA runs in parallel.
Downloads SRA run files (.sra) to local storage without converting to FASTQ. Useful for batch downloading before processing with fasterq-dump.
Arguments
srr_identifiers
: Vector of SRA run identifiersoutdir
: Output directory for prefetched files (default: current directory)max_parallel
: Maximum number of parallel downloads (default: 4)
Returns
Vector of named tuples with prefetch results for each SRA run
Example
runs = ["SRR1234567", "SRR1234568", "SRR1234569"]
results = Mycelia.prefetch_sra_runs(runs, outdir="./sra_data")
Mycelia.fasterq_dump_parallel
— FunctionParallel FASTQ dump for multiple SRA files.
Converts multiple SRA files to FASTQ format in parallel. More efficient than sequential processing for large batches.
Arguments
srr_identifiers
: Vector of SRA run identifiersoutdir
: Output directory for FASTQ files (default: current directory)max_parallel
: Maximum number of parallel conversions (default: 2)
Returns
Vector of named tuples with conversion results
Example
runs = ["SRR1234567", "SRR1234568"]
results = Mycelia.fasterq_dump_parallel(runs, outdir="./fastq_data")
Example: Download Sequencing Reads
# Download reads from SRA
sra_run = "SRR1234567"
fastq_files = Mycelia.download_sra_data(sra_run, output_dir="reads/")
# Process paired-end reads
if length(fastq_files) == 2
read1, read2 = fastq_files
println("Downloaded paired-end reads:")
println(" R1: $read1")
println(" R2: $read2")
end
Data Simulation
Genome Simulation
<!– Genome simulation functions not yet implemented –>
Example: Create Test Genome
# Generate random genome with realistic GC content
test_genome = Mycelia.simulate_random_genome(
length=50000,
gc_content=0.42,
seed=123 # for reproducibility
)
# Add realistic features
genome_with_genes = Mycelia.simulate_genome_with_features(
test_genome,
n_genes=50,
gene_length_dist=(500, 2000),
intergenic_length_dist=(100, 1000)
)
Sequencing Read Simulation
Mycelia.simulate_illumina_paired_reads
— Functionsimulate_illumina_paired_reads(
;
in_fasta,
coverage,
read_count,
outbase,
read_length,
mflen,
sdev,
seqSys,
amplicon,
errfree,
rndSeed
)
Simulate Illumina short reads from a FASTA file using the ART Illumina simulator.
This function wraps ART (installed via Bioconda) to simulate reads from an input reference FASTA. It supports paired-end (or optionally single-end/mate-pair) simulation, with options to choose either fold coverage (--fcov
) or an absolute read count (--rcount
), to enable amplicon mode, and to optionally generate a zero-error SAM file.
Arguments
in_fasta::String
: Path to the input FASTA file.coverage::Union{Nothing,Number}
: Desired fold coverage (used with--fcov
); ifnothing
andread_count
is provided then fold coverage is ignored. (Default: 20)read_count::Union{Nothing,Number}
: Total number of reads (or read pairs) to generate (used with--rcount
instead of fold coverage). (Default:nothing
)outbase::String
: Output file prefix (default: "$(in_fasta).art.$(coverage)x.").read_length::Int
: Length of reads to simulate (default: 150).mflen::Int
: Mean fragment length for paired-end simulations (default: 500).sdev::Int
: Standard deviation of fragment lengths (default: 10).seqSys::String
: Illumina sequencing system ID (e.g. "HS25" for HiSeq 2500) (default: "HS25").paired::Bool
: Whether to simulate paired-end reads (default: true).amplicon::Bool
: Enable amplicon sequencing simulation mode (default: false).errfree::Bool
: Generate an extra SAM file with zero sequencing errors (default: false).rndSeed::Union{Nothing,Int}
: Optional seed for reproducibility (default: nothing).
Outputs
Generates gzipped FASTQ files in the working directory:
- For paired-end:
$(outbase)1.fq.gz
(forward) and$(outbase)2.fq.gz
(reverse). - For single-end:
$(outbase)1.fq.gz
.
Additional SAM files may be produced if --errfree
is enabled and/or if the ART --samout
option is specified.
Details
This function calls ART with the provided options. Note that if read_count
is supplied, the function uses the --rcount
option; otherwise, it uses --fcov
with the given coverage. Amplicon mode (via --amplicon
) restricts the simulation to the amplicon regions, which is important for targeted sequencing studies.
Dependencies
Requires ART simulator (installed via Bioconda) and the Mycelia environment helper.
See also: simulate_nanopore_reads
, simulate_nearly_perfect_long_reads
, simulate_pacbio_reads
Mycelia.simulate_pacbio_reads
— Functionsimulate_pacbio_reads(; fasta, quantity, outfile)
Simulate PacBio HiFi reads using the Badread error model.
Arguments
fasta::String
: Path to input FASTA file containing reference sequencequantity::String
: Coverage depth (e.g. "50x") or total bases (e.g. "1000000") - NOT TOTAL READSoutfile::String
: Output filepath for simulated reads. Defaults to input filename with ".badread.pacbio2021.{quantity}.fq.gz" suffix
Returns
String
: Path to the generated output file
Notes
- Requires Badread tool from Bioconda
- Uses PacBio 2021 error and quality score models
- Average read length ~15kb
- Output is gzipped FASTQ format
See also: simulate_nanopore_reads
, simulate_nearly_perfect_long_reads
, simulate_short_reads
Mycelia.simulate_nanopore_reads
— Functionsimulate_nanopore_reads(; fasta, quantity, outfile)
Simulate Oxford Nanopore sequencing reads using the Badread tool with 2023 error models.
Arguments
fasta::String
: Path to input reference FASTA filequantity::String
: Either fold coverage (e.g. "50x") or total bases to sequence (e.g. "1000000")outfile::String
: Output path for gzipped FASTQ file. Defaults to input filename with modified extension
Returns
String
: Path to the generated output FASTQ file
See also: simulate_pacbio_reads
, simulate_nearly_perfect_long_reads
, simulate_short_reads
<!– Note: simulatehifireads maps to simulatepacbioreads addsequencingerrors not yet implemented as standalone function –>
Example: HiFi Read Simulation
# Simulate high-quality HiFi reads
hifi_reads = Mycelia.simulate_hifi_reads(
reference_genome,
coverage=25,
read_length_mean=15000,
read_length_std=3000,
error_rate=0.001
)
# Write to FASTQ
Mycelia.write_fastq("hifi_reads.fastq", hifi_reads)
Example: Illumina Read Simulation
# Simulate paired-end Illumina reads
illumina_reads = Mycelia.simulate_illumina_reads(
reference_genome,
coverage=50,
read_length=150,
fragment_size_mean=300,
fragment_size_std=50,
error_rate=0.01
)
# Write paired-end files
Mycelia.write_fastq("illumina_R1.fastq", illumina_reads.read1)
Mycelia.write_fastq("illumina_R2.fastq", illumina_reads.read2)
Data Validation
Download Validation
<!– Download validation functions not yet implemented –>
Example: Validate Downloaded Data
# Check file integrity
integrity_ok = Mycelia.validate_download_integrity("genome.fna.gz")
# Verify file format
format_ok = Mycelia.check_file_format("genome.fna.gz", expected_format="fasta")
# Check genome completeness
completeness = Mycelia.verify_genome_completeness("genome.fna.gz")
println("Genome completeness: $(completeness.complete_sequences)/$(completeness.total_sequences)")
Simulation Validation
<!– Simulation validation functions not yet implemented –>
Example: Validate Simulated Data
# Check simulation statistics
sim_stats = Mycelia.calculate_simulation_statistics(simulated_reads)
println("Simulated reads: $(sim_stats.n_reads)")
println("Mean length: $(sim_stats.mean_length)")
println("Coverage: $(sim_stats.coverage)x")
# Compare with real data characteristics
comparison = Mycelia.compare_simulated_vs_real(simulated_reads, real_reads)
println("Length distribution similarity: $(comparison.length_similarity)")
println("Quality distribution similarity: $(comparison.quality_similarity)")
Metadata Management
Data Provenance
<!– Metadata management functions not yet implemented –>
Example: Track Data Sources
# Create manifest for downloaded data
manifest = Mycelia.create_data_manifest(
files=["genome.fna.gz", "annotations.gff3"],
sources=["NCBI:NC_001422.1", "NCBI:GCF_000819615.1"],
download_date=now(),
version="1.0"
)
# Save manifest
Mycelia.save_manifest(manifest, "data_manifest.json")
Batch Processing
<!– Batch processing functions not yet implemented Available: Mycelia.downloadgenomesby_ftp for batch downloads –>
Example: Batch Download with Error Handling
# Download multiple genomes with retry logic
accession_list = ["GCF_000005825.2", "GCF_000009605.1", "GCF_000027325.1"]
results = Mycelia.parallel_download(
accession_list,
max_retries=3,
delay_between_retries=30,
max_concurrent=5
)
# Check results
successful = [r for r in results if r.success]
failed = [r for r in results if !r.success]
println("Downloaded: $(length(successful))/$(length(accession_list))")
if !isempty(failed)
println("Failed downloads: $(length(failed))")
for failure in failed
println(" $(failure.accession): $(failure.error)")
end
end
Performance Considerations
Memory Usage
- Genome downloads: Minimal memory usage (streaming)
- Large simulations: Memory scales with genome size
- Batch operations: Consider parallel processing limits
Network Considerations
- NCBI rate limits: Respect API rate limits
- Retry logic: Implement exponential backoff
- Parallel downloads: Limit concurrent connections
Storage Requirements
- Compressed files: Use .gz compression by default
- Temporary files: Clean up intermediate files
- Disk space: Monitor available space for large datasets
Common Issues and Solutions
Download Failures
# Handle network timeouts
try
genome = Mycelia.download_genome_by_accession("NC_001422.1", timeout=300)
catch NetworkError
println("Download failed - retrying with longer timeout")
genome = Mycelia.download_genome_by_accession("NC_001422.1", timeout=600)
end
Simulation Validation
# Validate simulation parameters before large runs
Mycelia.validate_simulation_parameters(
genome_size=1000000,
coverage=30,
read_length=15000
)
Related Functions
File I/O
read_fasta
- Read FASTA fileswrite_fastq
- Write FASTQ filescompress_file
- Compress downloaded files
Quality Control
analyze_fastq_quality
- Assess downloaded read qualitycalculate_genome_stats
- Analyze genome characteristics
Next Steps
- Quality Control - Assess data quality
- Sequence Analysis - Analyze sequence composition
See Also
- Tutorial 1: Data Acquisition
- FASTA/FASTQ Data Types (planned)
- Basic Workflows