Tutorial 7: Comparative Genomics and Pangenome Analysis

This tutorial covers comparative genomics approaches, including pangenome construction, phylogenetic analysis, and evolutionary genomics.

Learning Objectives

By the end of this tutorial, you will understand:

Pangenome concepts and construction methods
Core, accessory, and unique gene identification
Phylogenetic tree construction and interpretation
Synteny analysis and chromosomal rearrangements
Population genomics and evolutionary analysis
Graph-based genome representation

Setup

import Pkg
if isinteractive()
    Pkg.activate("..")
end

import Test
import Mycelia
import FASTX
import Random
import Statistics
import Graphs
import Kmers

Random.seed!(42)

Part 1: Pangenome Concepts

Pangenomes represent the complete set of genes within a species or closely related group of organisms.

println("=== Comparative Genomics Tutorial ===")

println("Pangenome Components:")
println("- Core genome: Genes present in all individuals")
println("- Accessory genome: Genes present in some individuals")
println("- Unique genes: Genes present in single individuals")
println("- Variable genes: Genes with presence/absence variation")
println()
println("Pangenome Types:")
println("- Open pangenome: Continues to expand with new genomes")
println("- Closed pangenome: Reaches saturation quickly")
println("- Mixed pangenome: Shows both open and closed characteristics")

Part 2: Data Preparation for Pangenome Analysis

Prepare multiple genomes for comparative analysis

println("\n=== Data Preparation ===")

Simulating Multiple Genomes

Create a set of related genomes for pangenome analysis

println("--- Simulating Related Genomes ---")

Generate core genes present in all genomes

n_genomes = 5
core_genes = 200
accessory_genes = 100
unique_genes = 50

Generate simulated genomes with different characteristics

genome_files = String[]
for i in 1:n_genomes
    genome_size = core_genes * 1000 + rand(0:accessory_genes) * 1000 + rand(0:unique_genes) * 1000
    genome = Mycelia.random_fasta_record(moltype=:DNA, seed=i, L=genome_size)
    filename = "genome_$i.fasta"
    Mycelia.write_fasta(outfile=filename, records=[genome])
    push!(genome_files, filename)
end

println("Generated $(n_genomes) genomes:")
for (i, file) in enumerate(genome_files)
    size = filesize(file)
    println("  Genome $i: $file ($(size) bytes)")
end

Part 3: Gene Clustering and Ortholog Identification

Identify orthologous genes across genomes

println("\n=== Gene Clustering ===")

All-vs-All Sequence Comparison

Compare all genes against all genes

println("--- All-vs-All Comparison ---")

Perform k-mer based pangenome analysis using implemented functions

println("Performing k-mer based pangenome analysis...")

Use implemented pangenome analysis function

pangenome_result = Mycelia.analyze_pangenome_kmers(
    genome_files,
    kmer_type=Kmers.DNAKmer{21}
)

println("Pangenome Analysis Results:")
println("  Core k-mers: $(length(pangenome_result.core_kmers))")
println("  Accessory k-mers: $(length(pangenome_result.accessory_kmers))")
println("  Total pangenome size: $(pangenome_result.similarity_stats.pangenome_size)")

Genome Similarity Analysis

Compare genome similarity using multiple metrics

println("--- Genome Similarity Analysis ---")

Build distance matrix using different metrics

js_matrix = Mycelia.build_genome_distance_matrix(
    genome_files,
    kmer_type=Kmers.DNAKmer{21},
    metric=:js_divergence
)

println("Distance Matrix (JS Divergence):")
for i in 1:n_genomes
    for j in 1:n_genomes
        print("$(round(js_matrix.distance_matrix[i,j], digits=3)) ")
    end
    println()
end

Part 4: Pangenome Construction

Build pangenome from ortholog clusters

println("\n=== Pangenome Construction ===")

Core Genome Analysis

Identify genes present in all genomes

println("--- Core Genome Analysis ---")

Use results from pangenome analysis

core_genome_size = length(pangenome_result.core_kmers)
core_genome_percentage = pangenome_result.similarity_stats.core_percentage

println("Core Genome:")
println("  Size: $core_genome_size k-mers")
println("  Percentage: $(round(core_genome_percentage, digits=1))%")

Core genome represents k-mers present in ALL genomes

println("Core genome represents conserved genomic content across all $(n_genomes) genomes")

Accessory Genome Analysis

Analyze k-mers with variable presence

println("--- Accessory Genome Analysis ---")

accessory_genome_size = length(pangenome_result.accessory_kmers)
unique_genome_total = sum(length(kmers) for kmers in values(pangenome_result.unique_kmers_by_genome))

println("Accessory Genome:")
println("  Variable k-mers: $accessory_genome_size")
println("  Unique k-mers: $unique_genome_total")
println("  Total accessory: $(accessory_genome_size + unique_genome_total)")

Analyze unique k-mers per genome

println("Unique k-mers per genome:")
for (genome, unique_kmers) in pangenome_result.unique_kmers_by_genome
    println("  $genome: $(length(unique_kmers)) unique k-mers")
end

Pangenome Curves

Model pangenome size vs number of genomes

println("--- Pangenome Curves ---")

TODO: Implement pangenome curve analysis

Calculate pangenome size for subsets
Fit mathematical models (power law, exponential)
Predict pangenome size for larger collections
Classify as open/closed pangenome

Part 5: Graph-Based Pangenome Representation

Represent pangenomes as graphs

println("\n=== Graph-Based Pangenomes ===")

Sequence Graphs

Build graphs from sequence overlaps

println("--- Sequence Graphs ---")

TODO: Implement sequence graph construction

Build k-mer graphs from sequences
Identify shared and unique paths
Handle graph bubbles and loops
Compress redundant sequences

Use existing pangenome graph construction

k = 3
sequences = [FASTX.sequence(Mycelia.random_fasta_record(moltype=:DNA, seed=i, L=100)) for i in 1:5]

graph = Mycelia.buildstrandedkmer_graph(Kmers.DNAKmer{k}, sequences)

println("Sequence Graph Construction:")
println("  K-mer size: $k")
println("  Sequences: $(length(sequences))")
# println("  Graph nodes: \$(Graphs.nv(graph))")
# println("  Graph edges: \$(Graphs.ne(graph))")

Variation Graphs

Represent genetic variation as graphs

println("--- Variation Graphs ---")

TODO: Implement variation graph construction

Build graphs from multiple alignments
Represent SNPs and indels as bubbles
Handle complex structural variants
Optimize graph topology

Part 6: Phylogenetic Analysis

Construct phylogenetic trees from pangenome data

println("\n=== Phylogenetic Analysis ===")

Core Genome Phylogeny

Build trees from core genes

println("--- Core Genome Phylogeny ---")

TODO: Implement core genome phylogeny

Concatenate core gene alignments
Calculate phylogenetic trees
Assess branch support
Compare with individual gene trees

Accessory Genome Phylogeny

Analyze evolution of accessory genes

println("--- Accessory Genome Phylogeny ---")

TODO: Implement accessory genome phylogeny

Build trees from gene presence/absence
Use binary character evolution models
Identify horizontal gene transfer events
Analyze gene gain/loss patterns

Part 7: Synteny Analysis

Analyze conserved gene order

println("\n=== Synteny Analysis ===")

Synteny Detection

Identify conserved gene order

println("--- Synteny Detection ---")

TODO: Implement synteny detection

Identify colinear gene blocks
Calculate synteny conservation
Detect chromosomal rearrangements
Visualize synteny relationships

Rearrangement Analysis

Analyze chromosomal rearrangements

println("--- Rearrangement Analysis ---")

TODO: Implement rearrangement analysis

Identify inversions, translocations, duplications
Calculate rearrangement distances
Reconstruct ancestral gene orders
Analyze rearrangement hotspots

Part 8: Population Genomics

Analyze genetic diversity within populations

println("\n=== Population Genomics ===")

Genetic Diversity

Calculate diversity metrics

println("--- Genetic Diversity ---")

TODO: Implement diversity analysis

Calculate nucleotide diversity (π)
Estimate effective population size
Analyze allele frequency spectra
Identify population structure

Selection Analysis

Detect selection pressure

println("--- Selection Analysis ---")

TODO: Implement selection analysis

Calculate dN/dS ratios
Identify positively selected genes
Detect balancing selection
Analyze codon usage bias

Part 9: Functional Analysis

Analyze functional aspects of pangenomes

println("\n=== Functional Analysis ===")

Functional Enrichment

Identify enriched functional categories

println("--- Functional Enrichment ---")

TODO: Implement functional enrichment analysis

GO term enrichment analysis
KEGG pathway enrichment
Protein domain analysis
Metabolic pathway reconstruction

Adaptive Evolution

Identify adaptively evolving genes

println("--- Adaptive Evolution ---")

TODO: Implement adaptive evolution analysis

Identify rapidly evolving genes
Analyze gene family expansions
Detect horizontal gene transfer
Correlate with environmental factors

Part 10: Visualization and Interpretation

Create visualizations for comparative genomics

println("\n=== Visualization ===")

Pangenome Plots

Visualize pangenome structure

println("--- Pangenome Plots ---")

TODO: Implement pangenome visualization

Pangenome accumulation curves
Gene presence/absence heatmaps
Phylogenetic trees with annotations
Synteny dot plots

Interactive Visualization

Create interactive pangenome browsers

println("--- Interactive Visualization ---")

TODO: Implement interactive visualization

Web-based pangenome browser
Interactive phylogenetic trees
Searchable gene catalogs
Comparative genome viewers

Part 11: Applications and Case Studies

Real-world applications of comparative genomics

println("\n=== Applications ===")

Pathogen Genomics

Applications in infectious disease research

println("--- Pathogen Genomics ---")

println("Pathogen Genomics Applications:")
println("- Outbreak investigation and source tracking")
println("- Drug resistance evolution")
println("- Vaccine target identification")
println("- Virulence factor discovery")

Agricultural Genomics

Applications in crop improvement

println("--- Agricultural Genomics ---")

println("Agricultural Genomics Applications:")
println("- Crop diversity assessment")
println("- Disease resistance breeding")
println("- Stress tolerance identification")
println("- Nutritional quality improvement")

Environmental Genomics

Applications in environmental microbiology

println("--- Environmental Genomics ---")

println("Environmental Genomics Applications:")
println("- Microbial community analysis")
println("- Ecosystem function prediction")
println("- Biogeochemical cycling")
println("- Climate change adaptation")

Part 12: Best Practices and Guidelines

Recommendations for comparative genomics analysis

println("\n=== Best Practices ===")

println("Data Quality:")
println("- Use high-quality genome assemblies")
println("- Ensure consistent annotation standards")
println("- Validate ortholog assignments")
println("- Check for contamination and artifacts")
println()
println("Analysis Strategy:")
println("- Start with closely related genomes")
println("- Use multiple clustering methods")
println("- Validate results with independent approaches")
println("- Consider biological context in interpretation")
println()
println("Computational Considerations:")
println("- Plan for large memory requirements")
println("- Use parallel processing when possible")
println("- Implement checkpointing for long analyses")
println("- Archive intermediate results")

Summary

println("\n=== Comparative Genomics Summary ===")
println("✓ Understanding pangenome concepts and construction")
println("✓ Implementing gene clustering and ortholog identification")
println("✓ Building core and accessory genome catalogs")
println("✓ Constructing graph-based pangenome representations")
println("✓ Performing phylogenetic analysis")
println("✓ Analyzing synteny and chromosomal rearrangements")
println("✓ Applying population genomics approaches")
println("✓ Creating comprehensive visualizations")
println("✓ Understanding real-world applications")

Cleanup

for file in genome_files
    if isfile(file)
        rm(file, force=true)
    end
end

println("\nNext: Tutorial 8 - Tool Integration")

nothing