Tutorial 6: Gene Annotation

This tutorial covers comprehensive gene annotation techniques, from basic gene prediction to functional annotation and annotation quality assessment.

Learning Objectives

By the end of this tutorial, you will understand:

  • Different gene prediction approaches and their applications
  • Structural annotation techniques for identifying genes and features
  • Functional annotation methods for assigning biological roles
  • Annotation quality assessment and validation
  • Comparative annotation approaches
  • Best practices for different organism types

Setup

import Pkg
if isinteractive()
    Pkg.activate("..")
end

import Test
import Mycelia
import FASTX
import Random
import Statistics

Random.seed!(42)

Part 1: Gene Annotation Overview

Gene annotation is a multi-step process that identifies genes and assigns functional information to genomic sequences.

println("=== Gene Annotation Tutorial ===")

println("Annotation Pipeline Overview:")
println("1. Structural Annotation")
println("   - Gene prediction")
println("   - Exon-intron structure")
println("   - Regulatory elements")
println()
println("2. Functional Annotation")
println("   - Protein function prediction")
println("   - Pathway assignment")
println("   - GO term annotation")
println()
println("3. Comparative Annotation")
println("   - Ortholog identification")
println("   - Synteny analysis")
println("   - Evolutionary analysis")

Part 2: Structural Annotation

Structural annotation identifies the physical structure of genes and other functional elements.

println("\n=== Structural Annotation ===")

Preparing Test Genome

Create a test genome for annotation demonstration

println("--- Preparing Test Genome ---")

Generate test genome with gene-like features

genome_size = 50000
test_genome = Mycelia.random_fasta_record(moltype=:DNA, seed=1, L=genome_size)
genome_file = "test_genome.fasta"
Mycelia.write_fasta(outfile=genome_file, records=[test_genome])

println("Test genome prepared: $(genome_size) bp")

Ab Initio Gene Prediction

Predict genes using sequence signals alone

println("--- Ab Initio Gene Prediction ---")

TODO: Implement ab initio gene prediction

  • Use Prodigal for prokaryotic genes
  • Identify start/stop codons
  • Predict coding sequences
  • Handle overlapping genes

Simulate gene prediction results

predicted_genes = [
    Dict("start" => 1000, "end" => 2500, "strand" => "+", "confidence" => 0.95),
    Dict("start" => 3000, "end" => 4200, "strand" => "-", "confidence" => 0.88),
    Dict("start" => 5500, "end" => 7000, "strand" => "+", "confidence" => 0.92),
    Dict("start" => 8000, "end" => 9500, "strand" => "+", "confidence" => 0.85),
    Dict("start" => 10000, "end" => 11800, "strand" => "-", "confidence" => 0.90)
]

println("Ab Initio Gene Prediction Results:")
println("  Predicted genes: $(length(predicted_genes))")
println("  Mean confidence: $(round(Statistics.mean([g["confidence"] for g in predicted_genes]), digits=3))")
println("  Coding density: $(round(sum([g["end"] - g["start"] + 1 for g in predicted_genes]) / genome_size * 100, digits=1))%")

Homology-Based Gene Prediction

Use similarity to known genes for prediction

println("--- Homology-Based Gene Prediction ---")

TODO: Implement homology-based prediction

  • BLAST against protein databases
  • Identify homologous sequences
  • Transfer annotations from homologs
  • Handle partial matches

Simulate homology search results

homology_results = [
    Dict("query" => "gene_1", "subject" => "protein_X", "identity" => 85.2, "coverage" => 95.0),
    Dict("query" => "gene_2", "subject" => "protein_Y", "identity" => 78.9, "coverage" => 88.0),
    Dict("query" => "gene_3", "subject" => "protein_Z", "identity" => 92.1, "coverage" => 97.0),
    Dict("query" => "gene_4", "subject" => "protein_W", "identity" => 65.4, "coverage" => 82.0),
    Dict("query" => "gene_5", "subject" => "protein_V", "identity" => 88.7, "coverage" => 91.0)
]

println("Homology Search Results:")
for result in homology_results
    println("  $(result["query"]) -> $(result["subject"]): $(result["identity"])% identity, $(result["coverage"])% coverage")
end

RNA-seq Guided Prediction

Use transcriptome data to improve gene prediction

println("--- RNA-seq Guided Prediction ---")

TODO: Implement RNA-seq guided prediction

  • Map RNA-seq reads to genome
  • Identify transcribed regions
  • Predict exon-intron structure
  • Handle alternative splicing

Regulatory Element Prediction

Identify promoters, enhancers, and other regulatory sequences

println("--- Regulatory Element Prediction ---")

TODO: Implement regulatory element prediction

  • Promoter prediction
  • Transcription factor binding sites
  • CpG islands
  • Repetitive elements

Part 3: Functional Annotation

Functional annotation assigns biological roles to predicted genes

println("\n=== Functional Annotation ===")

Protein Function Prediction

Predict protein functions using various approaches

println("--- Protein Function Prediction ---")

TODO: Implement protein function prediction

  • Domain identification (Pfam, InterPro)
  • Enzyme classification (EC numbers)
  • Pathway assignment (KEGG, MetaCyc)
  • GO term annotation

Simulate functional annotation results

functional_annotations = [
    Dict("gene" => "gene_1", "function" => "DNA helicase", "ec" => "3.6.4.12", "confidence" => 0.92),
    Dict("gene" => "gene_2", "function" => "Transcriptional regulator", "ec" => "", "confidence" => 0.78),
    Dict("gene" => "gene_3", "function" => "Ribosomal protein L1", "ec" => "", "confidence" => 0.95),
    Dict("gene" => "gene_4", "function" => "Hypothetical protein", "ec" => "", "confidence" => 0.45),
    Dict("gene" => "gene_5", "function" => "ATP synthase subunit", "ec" => "3.6.3.14", "confidence" => 0.88)
]

println("Functional Annotation Results:")
for annot in functional_annotations
    ec_str = annot["ec"] != "" ? " (EC: $(annot["ec"]))" : ""
    println("  $(annot["gene"]): $(annot["function"])$ec_str [$(annot["confidence"])]")
end

Database Annotation

Annotate genes using specialized databases

println("--- Database Annotation ---")

TODO: Implement database annotation

  • BLAST against Swiss-Prot
  • Search COG database
  • Query KEGG pathways
  • Check specialized databases

Gene Ontology Annotation

Assign GO terms for standardized functional description

println("--- Gene Ontology Annotation ---")

TODO: Implement GO annotation

  • Assign GO terms
  • Validate GO term relationships
  • Calculate GO term confidence
  • Generate GO term summaries

Simulate GO annotation results

go_annotations = [
    Dict("gene" => "gene_1", "go_terms" => ["GO:0003678", "GO:0006310"], "aspect" => ["MF", "BP"]),
    Dict("gene" => "gene_2", "go_terms" => ["GO:0003677", "GO:0006355"], "aspect" => ["MF", "BP"]),
    Dict("gene" => "gene_3", "go_terms" => ["GO:0003735", "GO:0006412"], "aspect" => ["MF", "BP"]),
    Dict("gene" => "gene_4", "go_terms" => [], "aspect" => []),
    Dict("gene" => "gene_5", "go_terms" => ["GO:0046933", "GO:0015986"], "aspect" => ["MF", "BP"])
]

println("GO Annotation Results:")
for annot in go_annotations
    if !isempty(annot["go_terms"])
        println("  $(annot["gene"]): $(join(annot["go_terms"], ", "))")
    else
        println("  $(annot["gene"]): No GO terms assigned")
    end
end

Part 4: Annotation Quality Assessment

Evaluate the quality and completeness of annotations

println("\n=== Annotation Quality Assessment ===")

Annotation Completeness

Assess what fraction of genes have functional annotations

println("--- Annotation Completeness ---")

Calculate annotation statistics

total_genes = length(predicted_genes)
functionally_annotated = sum([annot["function"] != "Hypothetical protein" for annot in functional_annotations])
ec_annotated = sum([annot["ec"] != "" for annot in functional_annotations])
go_annotated = sum([!isempty(annot["go_terms"]) for annot in go_annotations])

annotation_stats = Dict(
    "total_genes" => total_genes,
    "functionally_annotated" => functionally_annotated,
    "ec_annotated" => ec_annotated,
    "go_annotated" => go_annotated,
    "functional_coverage" => functionally_annotated / total_genes * 100,
    "ec_coverage" => ec_annotated / total_genes * 100,
    "go_coverage" => go_annotated / total_genes * 100
)

println("Annotation Completeness:")
for (metric, value) in annotation_stats
    println("  $metric: $value")
end

Annotation Consistency

Check for consistency between different annotation methods

println("--- Annotation Consistency ---")

TODO: Implement annotation consistency checks

  • Compare ab initio vs homology predictions
  • Validate functional annotations
  • Check for conflicting annotations
  • Assess annotation confidence

Annotation Validation

Validate annotations using external evidence

println("--- Annotation Validation ---")

TODO: Implement annotation validation

  • Cross-reference with literature
  • Validate with experimental data
  • Check annotation standards compliance
  • Assess annotation quality scores

Part 5: Comparative Annotation

Use comparative genomics to improve annotation quality

println("\n=== Comparative Annotation ===")

Ortholog Identification

Identify corresponding genes in related species

println("--- Ortholog Identification ---")

TODO: Implement ortholog identification

  • Bidirectional best hits
  • Ortholog clustering
  • Phylogenetic analysis
  • Synteny-based validation

Synteny Analysis

Analyze conserved gene order for annotation validation

println("--- Synteny Analysis ---")

TODO: Implement synteny analysis

  • Identify syntenic blocks
  • Validate gene annotations
  • Detect gene duplications
  • Analyze evolutionary events

Evolutionary Analysis

Analyze gene evolution patterns

println("--- Evolutionary Analysis ---")

TODO: Implement evolutionary analysis

  • Selection pressure analysis
  • Gene family evolution
  • Horizontal gene transfer detection
  • Pseudogene identification

Part 6: Specialized Annotation Types

Handle organism-specific annotation challenges

println("\n=== Specialized Annotation ===")

Prokaryotic Annotation

Features specific to bacterial and archaeal genomes

println("--- Prokaryotic Annotation ---")

TODO: Implement prokaryotic-specific annotation

  • Operon prediction
  • Sigma factor binding sites
  • Ribosome binding sites
  • CRISPR arrays
prokaryotic_features = Dict(
    "operons" => 3,
    "sigma_sites" => 12,
    "ribosome_binding_sites" => 5,
    "crispr_arrays" => 1
)

println("Prokaryotic Features:")
for (feature, count) in prokaryotic_features
    println("  $feature: $count")
end

Eukaryotic Annotation

Features specific to eukaryotic genomes

println("--- Eukaryotic Annotation ---")

TODO: Implement eukaryotic-specific annotation

  • Intron-exon structure
  • Alternative splicing
  • Pseudogenes
  • Non-coding RNAs

Viral Annotation

Features specific to viral genomes

println("--- Viral Annotation ---")

TODO: Implement viral-specific annotation

  • Overlapping genes
  • Frameshift elements
  • Regulatory sequences
  • Host interaction factors

Part 7: Annotation Visualization

Create visualizations for annotation results

println("\n=== Annotation Visualization ===")

Genome Browser Tracks

Generate tracks for genome browsers

println("--- Genome Browser Tracks ---")

TODO: Implement genome browser track generation

  • Gene structure tracks
  • Functional annotation tracks
  • Comparative annotation tracks
  • Quality assessment tracks

Annotation Summary Plots

Create summary visualizations

println("--- Annotation Summary Plots ---")

TODO: Implement annotation visualization

  • Functional category pie charts
  • Annotation quality histograms
  • Comparative annotation plots
  • Pathway enrichment plots

Part 8: Annotation File Formats

Work with standard annotation file formats

println("\n=== Annotation File Formats ===")

GFF3 Format

Generate and manipulate GFF3 files

println("--- GFF3 Format ---")

TODO: Implement GFF3 handling

  • Generate GFF3 from predictions
  • Validate GFF3 format
  • Convert between formats
  • Merge annotation files

Generate example GFF3 content

gff3_content = """
##gff-version 3
##sequence-region test_genome 1 50000
test_genome\tprodigal\tgene\t1000\t2500\t0.95\t+\t.\tID=gene_1;Name=gene_1
test_genome\tprodigal\tCDS\t1000\t2500\t0.95\t+\t0\tID=cds_1;Parent=gene_1
test_genome\tprodigal\tgene\t3000\t4200\t0.88\t-\t.\tID=gene_2;Name=gene_2
test_genome\tprodigal\tCDS\t3000\t4200\t0.88\t-\t0\tID=cds_2;Parent=gene_2
"""

gff3_file = "annotations.gff3"
open(gff3_file, "w") do io
    write(io, gff3_content)
end

println("GFF3 file generated: $gff3_file")

GenBank Format

Generate GenBank format files

println("--- GenBank Format ---")

TODO: Implement GenBank format handling

  • Generate GenBank files
  • Include sequence and annotations
  • Validate format compliance
  • Extract annotations from GenBank

Part 9: Annotation Pipelines

Integrate annotation steps into comprehensive pipelines

println("\n=== Annotation Pipelines ===")

Automated Annotation Pipeline

Create automated annotation workflows

println("--- Automated Pipeline ---")

TODO: Implement automated annotation pipeline

  • Combine multiple prediction methods
  • Automated quality control
  • Standardized output formats
  • Batch processing capabilities

Manual Curation Interface

Tools for manual annotation improvement

println("--- Manual Curation ---")

TODO: Implement manual curation tools

  • Interactive annotation editor
  • Evidence integration interface
  • Collaborative annotation platform
  • Version control for annotations

Part 10: Best Practices and Guidelines

Recommendations for high-quality annotation

println("\n=== Annotation Best Practices ===")

println("General Principles:")
println("- Use multiple complementary approaches")
println("- Validate predictions with experimental data")
println("- Maintain annotation standards compliance")
println("- Document annotation procedures and evidence")
println()
println("Quality Thresholds:")
println("- Functional annotation: >80% of genes")
println("- Homology confidence: >70% identity, >80% coverage")
println("- GO term coverage: >60% of genes")
println("- Manual validation: High-confidence predictions")
println()
println("Organism-Specific Considerations:")
println("- Prokaryotes: Focus on operons and regulation")
println("- Eukaryotes: Handle alternative splicing carefully")
println("- Viruses: Account for overlapping genes")
println("- Metagenomes: Use taxonomic context")

Summary

println("\n=== Gene Annotation Summary ===")
println("✓ Understanding structural and functional annotation approaches")
println("✓ Implementing ab initio and homology-based gene prediction")
println("✓ Assigning functional annotations and GO terms")
println("✓ Evaluating annotation quality and completeness")
println("✓ Applying comparative annotation techniques")
println("✓ Handling organism-specific annotation challenges")
println("✓ Working with standard annotation file formats")
println("✓ Creating comprehensive annotation pipelines")

Cleanup

cleanup_files = [genome_file, gff3_file]
for file in cleanup_files
    if isfile(file)
        rm(file, force=true)
    end
end

println("\nNext: Tutorial 7 - Comparative Genomics")

nothing