Tutorial 7: Comparative Genomics and Pangenome Analysis
This tutorial covers comparative genomics approaches, including pangenome construction, phylogenetic analysis, and evolutionary genomics.
Learning Objectives
By the end of this tutorial, you will understand:
- Pangenome concepts and construction methods
- Core, accessory, and unique gene identification
- Phylogenetic tree construction and interpretation
- Synteny analysis and chromosomal rearrangements
- Population genomics and evolutionary analysis
- Graph-based genome representation
Setup
import Pkg
if isinteractive()
Pkg.activate("..")
end
import Test
import Mycelia
import FASTX
import Random
import Statistics
import Graphs
import Kmers
Random.seed!(42)Part 1: Pangenome Concepts
Pangenomes represent the complete set of genes within a species or closely related group of organisms.
println("=== Comparative Genomics Tutorial ===")
println("Pangenome Components:")
println("- Core genome: Genes present in all individuals")
println("- Accessory genome: Genes present in some individuals")
println("- Unique genes: Genes present in single individuals")
println("- Variable genes: Genes with presence/absence variation")
println()
println("Pangenome Types:")
println("- Open pangenome: Continues to expand with new genomes")
println("- Closed pangenome: Reaches saturation quickly")
println("- Mixed pangenome: Shows both open and closed characteristics")Part 2: Data Preparation for Pangenome Analysis
Prepare multiple genomes for comparative analysis
println("\n=== Data Preparation ===")Simulating Multiple Genomes
Create a set of related genomes for pangenome analysis
println("--- Simulating Related Genomes ---")Generate core genes present in all genomes
n_genomes = 5
core_genes = 200
accessory_genes = 100
unique_genes = 50Generate simulated genomes with different characteristics
genome_files = String[]
for i in 1:n_genomes
genome_size = core_genes * 1000 + rand(0:accessory_genes) * 1000 + rand(0:unique_genes) * 1000
genome = Mycelia.random_fasta_record(moltype=:DNA, seed=i, L=genome_size)
filename = "genome_$i.fasta"
Mycelia.write_fasta(outfile=filename, records=[genome])
push!(genome_files, filename)
end
println("Generated $(n_genomes) genomes:")
for (i, file) in enumerate(genome_files)
size = filesize(file)
println(" Genome $i: $file ($(size) bytes)")
endPart 3: Gene Clustering and Ortholog Identification
Identify orthologous genes across genomes
println("\n=== Gene Clustering ===")All-vs-All Sequence Comparison
Compare all genes against all genes
println("--- All-vs-All Comparison ---")Perform k-mer based pangenome analysis using implemented functions
println("Performing k-mer based pangenome analysis...")Use implemented pangenome analysis function
pangenome_result = Mycelia.analyze_pangenome_kmers(
genome_files,
kmer_type=Kmers.DNAKmer{21}
)
println("Pangenome Analysis Results:")
println(" Core k-mers: $(length(pangenome_result.core_kmers))")
println(" Accessory k-mers: $(length(pangenome_result.accessory_kmers))")
println(" Total pangenome size: $(pangenome_result.similarity_stats.pangenome_size)")Genome Similarity Analysis
Compare genome similarity using multiple metrics
println("--- Genome Similarity Analysis ---")Build distance matrix using different metrics
js_matrix = Mycelia.build_genome_distance_matrix(
genome_files,
kmer_type=Kmers.DNAKmer{21},
metric=:js_divergence
)
println("Distance Matrix (JS Divergence):")
for i in 1:n_genomes
for j in 1:n_genomes
print("$(round(js_matrix.distance_matrix[i,j], digits=3)) ")
end
println()
endPart 4: Pangenome Construction
Build pangenome from ortholog clusters
println("\n=== Pangenome Construction ===")Core Genome Analysis
Identify genes present in all genomes
println("--- Core Genome Analysis ---")Use results from pangenome analysis
core_genome_size = length(pangenome_result.core_kmers)
core_genome_percentage = pangenome_result.similarity_stats.core_percentage
println("Core Genome:")
println(" Size: $core_genome_size k-mers")
println(" Percentage: $(round(core_genome_percentage, digits=1))%")Core genome represents k-mers present in ALL genomes
println("Core genome represents conserved genomic content across all $(n_genomes) genomes")Accessory Genome Analysis
Analyze k-mers with variable presence
println("--- Accessory Genome Analysis ---")
accessory_genome_size = length(pangenome_result.accessory_kmers)
unique_genome_total = sum(length(kmers) for kmers in values(pangenome_result.unique_kmers_by_genome))
println("Accessory Genome:")
println(" Variable k-mers: $accessory_genome_size")
println(" Unique k-mers: $unique_genome_total")
println(" Total accessory: $(accessory_genome_size + unique_genome_total)")Analyze unique k-mers per genome
println("Unique k-mers per genome:")
for (genome, unique_kmers) in pangenome_result.unique_kmers_by_genome
println(" $genome: $(length(unique_kmers)) unique k-mers")
endPangenome Curves
Model pangenome size vs number of genomes
println("--- Pangenome Curves ---")TODO: Implement pangenome curve analysis
- Calculate pangenome size for subsets
- Fit mathematical models (power law, exponential)
- Predict pangenome size for larger collections
- Classify as open/closed pangenome
Part 5: Graph-Based Pangenome Representation
Represent pangenomes as graphs
println("\n=== Graph-Based Pangenomes ===")Sequence Graphs
Build graphs from sequence overlaps
println("--- Sequence Graphs ---")TODO: Implement sequence graph construction
- Build k-mer graphs from sequences
- Identify shared and unique paths
- Handle graph bubbles and loops
- Compress redundant sequences
Use existing pangenome graph construction
k = 3
sequences = [FASTX.sequence(Mycelia.random_fasta_record(moltype=:DNA, seed=i, L=100)) for i in 1:5]graph = Mycelia.buildstrandedkmer_graph(Kmers.DNAKmer{k}, sequences)
println("Sequence Graph Construction:")
println(" K-mer size: $k")
println(" Sequences: $(length(sequences))")
# println(" Graph nodes: \$(Graphs.nv(graph))")
# println(" Graph edges: \$(Graphs.ne(graph))")Variation Graphs
Represent genetic variation as graphs
println("--- Variation Graphs ---")TODO: Implement variation graph construction
- Build graphs from multiple alignments
- Represent SNPs and indels as bubbles
- Handle complex structural variants
- Optimize graph topology
Part 6: Phylogenetic Analysis
Construct phylogenetic trees from pangenome data
println("\n=== Phylogenetic Analysis ===")Core Genome Phylogeny
Build trees from core genes
println("--- Core Genome Phylogeny ---")TODO: Implement core genome phylogeny
- Concatenate core gene alignments
- Calculate phylogenetic trees
- Assess branch support
- Compare with individual gene trees
Accessory Genome Phylogeny
Analyze evolution of accessory genes
println("--- Accessory Genome Phylogeny ---")TODO: Implement accessory genome phylogeny
- Build trees from gene presence/absence
- Use binary character evolution models
- Identify horizontal gene transfer events
- Analyze gene gain/loss patterns
Part 7: Synteny Analysis
Analyze conserved gene order
println("\n=== Synteny Analysis ===")Synteny Detection
Identify conserved gene order
println("--- Synteny Detection ---")TODO: Implement synteny detection
- Identify colinear gene blocks
- Calculate synteny conservation
- Detect chromosomal rearrangements
- Visualize synteny relationships
Rearrangement Analysis
Analyze chromosomal rearrangements
println("--- Rearrangement Analysis ---")TODO: Implement rearrangement analysis
- Identify inversions, translocations, duplications
- Calculate rearrangement distances
- Reconstruct ancestral gene orders
- Analyze rearrangement hotspots
Part 8: Population Genomics
Analyze genetic diversity within populations
println("\n=== Population Genomics ===")Genetic Diversity
Calculate diversity metrics
println("--- Genetic Diversity ---")TODO: Implement diversity analysis
- Calculate nucleotide diversity (π)
- Estimate effective population size
- Analyze allele frequency spectra
- Identify population structure
Selection Analysis
Detect selection pressure
println("--- Selection Analysis ---")TODO: Implement selection analysis
- Calculate dN/dS ratios
- Identify positively selected genes
- Detect balancing selection
- Analyze codon usage bias
Part 9: Functional Analysis
Analyze functional aspects of pangenomes
println("\n=== Functional Analysis ===")Functional Enrichment
Identify enriched functional categories
println("--- Functional Enrichment ---")TODO: Implement functional enrichment analysis
- GO term enrichment analysis
- KEGG pathway enrichment
- Protein domain analysis
- Metabolic pathway reconstruction
Adaptive Evolution
Identify adaptively evolving genes
println("--- Adaptive Evolution ---")TODO: Implement adaptive evolution analysis
- Identify rapidly evolving genes
- Analyze gene family expansions
- Detect horizontal gene transfer
- Correlate with environmental factors
Part 10: Visualization and Interpretation
Create visualizations for comparative genomics
println("\n=== Visualization ===")Pangenome Plots
Visualize pangenome structure
println("--- Pangenome Plots ---")TODO: Implement pangenome visualization
- Pangenome accumulation curves
- Gene presence/absence heatmaps
- Phylogenetic trees with annotations
- Synteny dot plots
Interactive Visualization
Create interactive pangenome browsers
println("--- Interactive Visualization ---")TODO: Implement interactive visualization
- Web-based pangenome browser
- Interactive phylogenetic trees
- Searchable gene catalogs
- Comparative genome viewers
Part 11: Applications and Case Studies
Real-world applications of comparative genomics
println("\n=== Applications ===")Pathogen Genomics
Applications in infectious disease research
println("--- Pathogen Genomics ---")
println("Pathogen Genomics Applications:")
println("- Outbreak investigation and source tracking")
println("- Drug resistance evolution")
println("- Vaccine target identification")
println("- Virulence factor discovery")Agricultural Genomics
Applications in crop improvement
println("--- Agricultural Genomics ---")
println("Agricultural Genomics Applications:")
println("- Crop diversity assessment")
println("- Disease resistance breeding")
println("- Stress tolerance identification")
println("- Nutritional quality improvement")Environmental Genomics
Applications in environmental microbiology
println("--- Environmental Genomics ---")
println("Environmental Genomics Applications:")
println("- Microbial community analysis")
println("- Ecosystem function prediction")
println("- Biogeochemical cycling")
println("- Climate change adaptation")Part 12: Best Practices and Guidelines
Recommendations for comparative genomics analysis
println("\n=== Best Practices ===")
println("Data Quality:")
println("- Use high-quality genome assemblies")
println("- Ensure consistent annotation standards")
println("- Validate ortholog assignments")
println("- Check for contamination and artifacts")
println()
println("Analysis Strategy:")
println("- Start with closely related genomes")
println("- Use multiple clustering methods")
println("- Validate results with independent approaches")
println("- Consider biological context in interpretation")
println()
println("Computational Considerations:")
println("- Plan for large memory requirements")
println("- Use parallel processing when possible")
println("- Implement checkpointing for long analyses")
println("- Archive intermediate results")Summary
println("\n=== Comparative Genomics Summary ===")
println("✓ Understanding pangenome concepts and construction")
println("✓ Implementing gene clustering and ortholog identification")
println("✓ Building core and accessory genome catalogs")
println("✓ Constructing graph-based pangenome representations")
println("✓ Performing phylogenetic analysis")
println("✓ Analyzing synteny and chromosomal rearrangements")
println("✓ Applying population genomics approaches")
println("✓ Creating comprehensive visualizations")
println("✓ Understanding real-world applications")Cleanup
for file in genome_files
if isfile(file)
rm(file, force=true)
end
end
println("\nNext: Tutorial 8 - Tool Integration")
nothing