Bioinformatics Concepts and Tools
This guide explains the key bioinformatics concepts implemented in Mycelia and provides guidance on when to use each tool. It's designed for graduate-level researchers who need to understand both the biological context and computational approaches.
Table of Contents
- Workflow Overview
- Data Acquisition
- Quality Control
- Sequence Analysis
- Genome Assembly
- Gene Annotation
- Comparative Genomics
- Phylogenetics
- Tool Selection Guide
Workflow Overview
Bioinformatics analyses typically follow a structured workflow. Understanding this flow helps you choose the right tools and interpret results correctly.
Standard Genomic Analysis Pipeline
Raw Data → Quality Control → Feature Extraction → Assembly → Validation → Annotation → Comparative Analysis
Decision Points
At each stage, you face key decisions:
- Data Quality: Is my data sufficient for analysis?
- Method Selection: Which algorithm fits my data type and research question?
- Parameter Tuning: How do I optimize for my specific dataset?
- Result Interpretation: What do the numbers mean biologically?
Data Acquisition
Concept: Data Sources and Types
Biological Context: Genomic data comes from various sources, each with different characteristics and limitations.
Data Types
Sequencing Data
- Illumina: Short reads (150-300 bp), high accuracy, paired-end
- PacBio/HiFi: Long reads (10-25 kb), high accuracy, single-molecule
- Oxford Nanopore: Ultra-long reads (50+ kb), moderate accuracy, real-time
Reference Data
- RefSeq: Manually curated, high-quality reference sequences
- GenBank: Community-submitted sequences, variable quality
- SRA: Raw sequencing data from published studies
When to Use Each
Data Type | Use Case | Advantages | Limitations |
---|---|---|---|
Illumina | SNP calling, RNA-seq, metagenomics | High accuracy, cost-effective | Short reads, repetitive regions |
PacBio HiFi | Genome assembly, structural variants | Long + accurate, span repeats | Higher cost, lower throughput |
Nanopore | Real-time analysis, ultra-long reads | Longest reads, portable | Higher error rates |
Mycelia Functions
# Download reference genomes
Mycelia.download_genome_by_accession(accession="NC_001422.1")
# Download complete assemblies
ncbi_genome_download_accession(accession="GCF_000819615.1")
# Simulate test data
simulate_random_genome(length=10000, gc_content=0.45)
simulate_hifi_reads(genome, coverage=20, error_rate=0.001)
Quality Control
Concept: Data Quality Assessment
Biological Context: Sequencing is not perfect. Raw data contains errors, biases, and artifacts that must be identified and corrected before analysis.
Quality Metrics
Sequence Quality
- Phred Scores: Probability of base-calling error (Q20 = 1% error, Q30 = 0.1% error)
- GC Content: Should match expected organism profile
- Length Distribution: Indicates fragmentation or size selection
Coverage Metrics
- Read Depth: Number of reads covering each position
- Uniformity: Even coverage across genome
- Duplication Rate: PCR duplicates inflate coverage
Quality Issues
Common Problems
- Adapter Contamination: Sequencing adapters in reads
- Low Quality Ends: Error-prone read ends
- Overrepresented Sequences: PCR bias or contamination
- Length Bias: Preferential sequencing of certain sizes
Solutions
- Trimming: Remove low-quality bases and adapters
- Filtering: Discard low-quality reads
- Normalization: Adjust for coverage bias
- Deduplication: Remove PCR duplicates
When to Apply QC
Analysis Type | QC Priority | Key Metrics |
---|---|---|
Genome Assembly | Critical | Coverage uniformity, read length |
Variant Calling | High | Base quality, mapping quality |
RNA-seq | High | rRNA contamination, 3' bias |
Metagenomics | Medium | Host contamination, diversity |
Mycelia Functions
# Analyze read quality
analyze_fastq_quality("reads.fastq")
# Calculate sequence statistics
calculate_sequence_stats(sequences)
# Quality visualization
plot_quality_distribution(quality_scores)
Sequence Analysis
Concept: K-mer Analysis
Biological Context: K-mers are subsequences of length k. They capture local sequence composition and are fundamental to many bioinformatics algorithms.
K-mer Theory
Mathematical Foundation
- K-mer Space: 4^k possible DNA k-mers
- Frequency Spectrum: Distribution of k-mer frequencies
- Coverage Estimation: Genome size estimation from k-mer frequencies
Biological Interpretation
- Repetitive Elements: High-frequency k-mers indicate repeats
- Sequencing Errors: Low-frequency k-mers often represent errors
- Genome Complexity: Unique k-mers indicate complex regions
K-mer Applications
Genome Assembly
- Error Correction: Remove k-mers with low frequency
- Overlap Detection: Find shared k-mers between reads
- Graph Construction: Build de Bruijn graphs
Quality Assessment
- Coverage Estimation: Estimate genome size and coverage
- Contamination Detection: Identify foreign DNA
- Ploidy Estimation: Detect polyploidy from k-mer frequencies
Parameter Selection
Choosing K
- Small K (k=15-21): Sensitive to errors, good for error correction
- Medium K (k=25-31): Balanced sensitivity/specificity
- Large K (k=35-51): Specific but may miss overlaps
Memory Considerations
- Dense Matrices: Store all possible k-mers (memory intensive)
- Sparse Matrices: Store only observed k-mers (memory efficient)
- Counting Algorithms: Bloom filters, Count-Min sketch
When to Use K-mer Analysis
Application | K-mer Size | Matrix Type | Use Case |
---|---|---|---|
Error Correction | 15-21 | Dense | Remove sequencing errors |
Assembly | 25-31 | Sparse | Genome assembly |
Repeat Detection | 31-51 | Sparse | Identify repetitive elements |
Contamination | 21-31 | Sparse | Detect foreign DNA |
Mycelia Functions
# Count k-mers
count_kmers("reads.fastq", k=21)
# Dense vs sparse counting
fasta_list_to_dense_kmer_counts(files, k=21)
fasta_list_to_sparse_kmer_counts(files, k=21)
# K-mer spectrum analysis
kmer_frequency_spectrum(kmer_counts)
Genome Assembly
Concept: Reconstructing Genomes
Biological Context: Sequencing breaks genomes into fragments. Assembly reconstructs the original genome by finding overlaps between fragments.
Assembly Algorithms
Overlap-Layout-Consensus (OLC)
- Overlap: Find overlaps between all reads
- Layout: Order reads based on overlaps
- Consensus: Generate consensus sequence
- Best For: Long reads (PacBio, Nanopore)
De Bruijn Graph
- K-mer Decomposition: Break reads into k-mers
- Graph Construction: Connect overlapping k-mers
- Path Finding: Find Eulerian paths through graph
- Best For: Short reads (Illumina)
String Graph
- Overlap Graph: Nodes are reads, edges are overlaps
- Transitivity Reduction: Remove redundant edges
- Contig Construction: Find paths through graph
- Best For: Long reads with high accuracy
Assembly Challenges
Repetitive Sequences
- Problem: Identical sequences confuse assembly
- Solution: Long reads that span repeats
- Tools: Repeat-aware assemblers, scaffolding
Heterozygosity
- Problem: Diploid genomes have two alleles
- Solution: Haplotype-aware assembly
- Tools: Trio binning, Hi-C scaffolding
Polyploidy
- Problem: Multiple copies of chromosomes
- Solution: Specialized polyploid assemblers
- Tools: Ploidy-aware algorithms
Assembly Quality Metrics
Contiguity
- N50: Length where 50% of assembly is in contigs of this length or longer
- L50: Number of contigs containing 50% of assembly
- Contig Count: Total number of contigs (fewer is better)
Completeness
- Genome Coverage: Percentage of genome assembled
- Gene Completeness: Percentage of expected genes found
- BUSCO Scores: Conserved gene completeness
Accuracy
- Base Accuracy: Percentage of correct bases
- Structural Accuracy: Correct arrangement of sequences
- Validation: Comparison with reference genome
When to Use Different Assemblers
Data Type | Assembler | Algorithm | Strengths |
---|---|---|---|
HiFi | hifiasm | String graph | High accuracy, haplotype-aware |
Nanopore | Flye | Repeat graph | Long reads, repetitive genomes |
Illumina | SPAdes | de Bruijn | Short reads, microbial genomes |
Hybrid | MaSuRCA | Hybrid | Combines short and long reads |
Mycelia Functions
# Run assembly
Mycelia.assemble_genome("reads.fastq", output_dir="assembly")
# Evaluate assembly quality
evaluate_assembly(assembly_fasta)
# Assembly statistics
calculate_assembly_stats(contigs)
Gene Annotation
Concept: Identifying Functional Elements
Biological Context: Raw genome sequences are meaningless without annotation. Gene annotation identifies protein-coding genes, regulatory elements, and other functional sequences.
Annotation Types
Structural Annotation
- Gene Prediction: Identify protein-coding genes
- Exon-Intron Structure: Define gene boundaries
- Promoter Prediction: Identify regulatory sequences
- Repeat Annotation: Classify repetitive elements
Functional Annotation
- Protein Function: Predict what proteins do
- Pathway Mapping: Assign genes to metabolic pathways
- GO Terms: Gene Ontology functional classification
- Domain Annotation: Identify protein domains
Gene Prediction Methods
Ab Initio Prediction
- Approach: Use sequence signals (start codons, splice sites)
- Advantages: No external data required
- Limitations: Lower accuracy, misses non-canonical genes
- Tools: Prodigal, Augustus, GeneMark
Homology-Based Prediction
- Approach: Compare to known genes in databases
- Advantages: Higher accuracy for conserved genes
- Limitations: Misses novel genes, depends on database quality
- Tools: BLAST, DIAMOND, MMseqs2
RNA-seq Guided Prediction
- Approach: Use transcriptome data to guide prediction
- Advantages: High accuracy, identifies expressed genes
- Limitations: Requires RNA-seq data, misses lowly expressed genes
- Tools: StringTie, Cufflinks, BRAKER
Annotation Challenges
Eukaryotic Complexity
- Alternative Splicing: Multiple transcripts per gene
- Pseudogenes: Non-functional gene copies
- Non-coding RNAs: Functional RNAs that don't code for proteins
Prokaryotic Specifics
- Operons: Polycistronic transcripts
- Overlapping Genes: Genes sharing sequence
- Horizontal Gene Transfer: Genes from other species
Quality Assessment
Annotation Completeness
- Gene Density: Number of genes per kb
- Protein Completeness: Percentage of complete proteins
- Functional Coverage: Percentage of genes with functional annotation
Annotation Accuracy
- Validation: Comparison with experimental data
- Consistency: Agreement between different methods
- Benchmarking: Comparison with reference annotations
When to Use Different Approaches
Genome Type | Method | Tools | Considerations |
---|---|---|---|
Bacterial | Ab initio | Prodigal | Simple gene structure |
Viral | Homology | BLAST | Small genomes, database dependent |
Fungal | Hybrid | Augustus + BLAST | Moderate complexity |
Plant/Animal | RNA-seq guided | BRAKER | High complexity, alternative splicing |
Mycelia Functions
# Predict genes
predict_genes(genome_fasta, genetic_code="standard")
# Functional annotation
annotate_functions(proteins, database="uniprot")
# Annotation evaluation
evaluate_annotation(gff_file, reference_gff)
Comparative Genomics
Concept: Comparing Genomes
Biological Context: Comparing genomes reveals evolutionary relationships, functional elements, and species-specific adaptations.
Comparative Approaches
Pairwise Comparison
- Synteny: Conserved gene order between species
- Orthology: Corresponding genes in different species
- Whole Genome Alignment: Align entire genomes
- Structural Variation: Differences in genome structure
Pangenome Analysis
- Core Genome: Genes present in all individuals
- Accessory Genome: Genes present in some individuals
- Unique Genes: Genes present in single individuals
- Gene Gain/Loss: Evolution of gene content
Pangenome Construction
Graph-Based Approaches
- Gene Graphs: Nodes are genes, edges are relationships
- Sequence Graphs: Nodes are sequences, edges are overlaps
- Variation Graphs: Represent genetic variation as graphs
Clustering Approaches
- Sequence Similarity: Group similar sequences
- Synteny: Group genes with conserved context
- Functional Similarity: Group genes with similar functions
Evolutionary Analysis
Phylogenetic Trees
- Species Trees: Evolutionary relationships between species
- Gene Trees: Evolution of individual genes
- Reconciliation: Compare gene and species trees
Selection Analysis
- Positive Selection: Genes under selective pressure
- Purifying Selection: Genes with functional constraints
- Neutral Evolution: Genes evolving without selection
Applications
Medical Genomics
- Drug Resistance: Compare sensitive vs resistant strains
- Virulence Factors: Identify pathogenicity genes
- Vaccine Targets: Find conserved antigens
Agricultural Genomics
- Crop Improvement: Identify beneficial alleles
- Disease Resistance: Find resistance genes
- Stress Tolerance: Identify adaptation mechanisms
When to Use Comparative Genomics
Research Question | Approach | Scale | Tools |
---|---|---|---|
Species evolution | Phylogenomics | Many species | Core genome alignment |
Population variation | Pangenomics | Many individuals | Graph-based methods |
Functional elements | Synteny | Few genomes | Whole genome alignment |
Adaptation | Selection analysis | Population data | Evolutionary models |
Mycelia Functions
# Build pangenome
build_pangenome(genome_list, clustering_threshold=0.95)
# Phylogenetic analysis
construct_phylogeny(core_genes, method="ml")
# Comparative analysis
compare_genomes(genome1, genome2, method="synteny")
Phylogenetics
Concept: Evolutionary Relationships
Biological Context: Phylogenetics reconstructs evolutionary history from molecular data, revealing how species are related and when they diverged.
Tree Construction Methods
Distance-Based Methods
- UPGMA: Assumes molecular clock
- Neighbor-Joining: Doesn't assume molecular clock
- Minimum Evolution: Finds tree with shortest total branch length
Character-Based Methods
- Maximum Parsimony: Minimizes evolutionary changes
- Maximum Likelihood: Finds most likely tree given data
- Bayesian Inference: Incorporates prior knowledge
Molecular Evolution Models
Nucleotide Substitution Models
- JC69: All substitutions equally likely
- K80: Different rates for transitions/transversions
- GTR: General time-reversible model
Protein Evolution Models
- Poisson: All amino acid changes equally likely
- JTT: Jones-Taylor-Thornton model
- LG: Le-Gascuel model
Tree Evaluation
Support Values
- Bootstrap: Resampling support for branches
- Posterior Probability: Bayesian support
- SH-like aLRT: Likelihood ratio test
Tree Comparison
- Robinson-Foulds: Topological distance
- Likelihood Ratio: Statistical comparison
- Consensus Trees: Combine multiple trees
Applications
Taxonomy
- Species Identification: Place unknown species
- Classification: Revise taxonomic relationships
- Diversity Assessment: Measure evolutionary diversity
Epidemiology
- Outbreak Tracking: Trace disease transmission
- Vaccine Design: Understand pathogen evolution
- Drug Resistance: Track resistance evolution
When to Use Phylogenetics
Data Type | Method | Use Case | Considerations |
---|---|---|---|
Single gene | NJ/ML | Quick analysis | May not reflect species tree |
Multiple genes | Concatenation | Species phylogeny | Assumes same tree topology |
Genome-wide | Coalescent | Population genetics | Accounts for incomplete lineage sorting |
Time-calibrated | Molecular clock | Divergence dating | Requires calibration points |
Mycelia Functions
# Construct phylogeny
build_phylogenetic_tree(alignment, method="ml")
# Molecular clock analysis
estimate_divergence_times(tree, calibration_points)
# Tree visualization
plot_phylogenetic_tree(tree, layout="circular")
Tool Selection Guide
Choosing the Right Analysis
Data-Driven Decisions
Consider Your Data Type
- Illumina: Short read optimized tools
- PacBio/Nanopore: Long read specialized tools
- Hybrid: Tools supporting multiple data types
Consider Your Organism
- Bacteria: Simpler tools, less memory
- Eukaryotes: Complex tools, more resources
- Viruses: Specialized viral tools
Consider Your Resources
- Local Machine: Memory/CPU constraints
- HPC Cluster: Parallel processing available
- Cloud: Scalable but cost considerations
Question-Driven Decisions
Assembly Quality
- Draft Assembly: Fast tools, lower quality
- Reference Quality: Comprehensive tools, higher quality
- Comparison: Multiple assemblers, best result
Annotation Depth
- Basic Annotation: Fast gene prediction
- Comprehensive: Multiple evidence types
- Comparative: Cross-species evidence
Analysis Scope
- Single Genome: Individual analysis tools
- Population: Population genetics tools
- Comparative: Multi-genome tools
Performance Considerations
Memory Requirements
- K-mer Analysis: Exponential with k size
- Assembly: Linear with genome size
- Annotation: Depends on database size
Time Complexity
- Data Download: Network dependent
- Quality Control: Linear with data size
- Assembly: Depends on algorithm and data
- Annotation: Depends on database searches
Accuracy Trade-offs
- Speed vs Accuracy: Fast tools may sacrifice accuracy
- Sensitivity vs Specificity: Detect more vs fewer false positives
- Completeness vs Contiguity: More complete vs fewer contigs
Best Practices
Workflow Design
- Start Simple: Use basic tools first
- Iterate: Improve based on results
- Validate: Check results at each step
- Document: Record parameters and decisions
Quality Control
- Check Inputs: Validate data quality
- Monitor Progress: Track intermediate results
- Evaluate Outputs: Assess final results
- Compare Methods: Use multiple approaches
Reproducibility
- Version Control: Track software versions
- Parameter Recording: Document all settings
- Environment Management: Use containers/environments
- Data Provenance: Track data sources
This conceptual guide provides the foundation for understanding Mycelia's tools and making informed decisions about your bioinformatics analyses. Each tutorial and benchmark builds on these concepts with practical implementations.