Metagenomic Analysis Workflow

This document describes the metagenomic analysis workflow implemented in Mycelia, including read-level classification, assembly, taxonomy assignment, and validation strategies.

Overview

The Mycelia metagenomic workflow implements a triangulated approach to taxonomic classification, combining evidence from multiple sources:

  1. Read-level classification - Direct mapping of reads to reference databases
  2. Contig-level DNA taxonomy - BLAST-based classification of assembled contigs
  3. Contig-level protein taxonomy - MMseqs2-based classification against UniRef databases
flowchart TD
  R[Sequencing reads]

  R --> E1[Read level mapping<br/>to reference databases]

  R --> Asm[Assembly into contigs]
  Asm --> E2[Contig DNA taxonomy<br/>BLAST to NCBI and others]
  Asm --> E3[Contig protein taxonomy<br/>mmseqs2 to UniRef sets]

  R --> MapAsm[Reads mapped to contigs]
  MapAsm --> E2
  MapAsm --> E3

  E1 --> Tri[Triangulated taxonomic assignment]
  E2 --> Tri
  E3 --> Tri

  Tri --> Bench[Benchmark against other tools<br/>kmer marker gene protein based]

Full Workflow

The complete metagenomic workflow includes quality control, multiple assembly strategies, variant calling, and pangenome analysis:

flowchart TD
  R[Raw reads] --> Q[QC<br/>FastQC<br/>fastp (short)<br/>fastplong (long)<br/>filtlong / chopper (long)<br/>trim_galore (Illumina PE)]

  Q --> DBs[Reference databases<br/>NCBI Core Nucleotide<br/>IMG VR v5 and others]

  Q --> RL_map
  subgraph RL [Read level classification]
    direction TB
    RL_map[Minimap2 mapping<br/>short and long reads] --> RL_score[Weighted scoring and LCA<br/>top hit vs next hit<br/>coverage threshold]
  end
  DBs --> RL_map

  Q --> RL_val[Validation tools / profiling<br/>sourmash<br/>MetaPhlAn<br/>Metabuli<br/>Sylph<br/>Kraken / Kraken2<br/>CoverM]

  Q --> SR_asm
  Q --> LR_asm
  subgraph ASM [Metagenomic assembly]
    direction LR
    SR_asm[Short read assembly<br/>MEGAHIT<br/>metaSPAdes<br/>MetaVelvet<br/>Penguinn]
    LR_asm[Long read assembly<br/>myloasm<br/>hifiasm meta (hifiasm-meta)<br/>metaMDBG (metamdbg)<br/>metaFlye (metaflye)]
  end

  SR_asm --> Bin[Binning]
  LR_asm --> Bin
  subgraph BIN [Binning & Refinement]
    direction TB
    Bin --> B1[MetaBAT / MetaBAT2]
    Bin --> B2[MaxBin]
    Bin --> B3[VAMB / GenomeFace]
    B1 --> Cons[DAS Tool Refinement]
    B2 --> Cons
    B3 --> Cons
  end
  Cons --> MAGs[MAGs]
  MAGs --> Drep[De-replication<br/>dRep, galah]
  MAGs --> MAGtax[Bin taxonomy assignment<br/>(BLAST / mmseqs2)]
  MAGtax --> MAGqc_micro[Microbial MAG QC<br/>CheckM2]
  MAGtax --> MAGqc_viral[Viral MAG QC<br/>CheckV, geNomad]
  MAGqc_micro --> Drep
  MAGqc_viral --> Drep
  Drep --> Clust[Similarity / clustering<br/>FastANI, skani<br/>mash, dashing2<br/>BLAST]

  SR_asm --> Phasing[Strain Phasing]
  LR_asm --> Phasing
  subgraph STR [Strain Resolution]
    direction TB
    Phasing --> S1[Floria / Strainy]
    Phasing --> S2[STRONG / HyLight]
  end

  SR_asm --> DNA_tax
  LR_asm --> DNA_tax
  DBs --> DNA_tax
  DNA_tax[DNA level taxonomy<br/>BLAST vs CoreNT and others]

  SR_asm --> Prot_tax
  LR_asm --> Prot_tax
  Prot_tax[Protein level taxonomy<br/>mmseqs2 vs UniRef 50 90 100]

  Q --> MapAsm[Reads mapped to contigs<br/>minimap2]
  MapAsm --> CovQC[Coverage QC<br/>mosdepth, qualimap<br/>CoverM]
  MapAsm --> Pangraph[Pangenome and graph alignment<br/>metapangenomes]
  Pangraph --> VarPan[Variant calling graph based<br/>PGGB<br/>VG<br/>Cactus]
  Pangraph --> VarAlign[Variant calling alignment based<br/>short reads GATK FreeBayes BCFtools<br/>long reads Clair3]

  MapAsm --> DNA_tax
  MapAsm --> Prot_tax

  SR_asm --> QCcontig
  LR_asm --> QCcontig
  QCcontig[Assembly QC (contigs)<br/>QUAST / metaQUAST, BUSCO<br/>viral CheckV, geNomad<br/>microbial CheckM / CheckM2]
  CovQC --> QCcontig

  RL_score --> Tri[Triangulated taxonomic assignment]
  DNA_tax --> Tri
  Prot_tax --> Tri

  Tri --> Bench[Benchmark against other tools<br/>kmer marker gene protein based]
  Bench --> RL_val

  Tri --> C[Classifications]
  VarPan --> V[Variants]
  VarAlign --> V[Variants]
  Pangraph --> P[Pangenomes]

Workflow Variants (Short, Long, Hybrid)

The "Full Workflow" diagram above shows all major components. In practice, users typically follow one of the read-type-specific variants below, then optionally converge on common downstream steps (assembly QC, binning, post-binning QC, dereplication, strain resolution, variants/pangenomes).

Short-Read Only (Illumina)

flowchart TD
  SR[Short reads (Illumina)] --> QC[QC<br/>FastQC<br/>fastp / trim_galore]
  QC --> Prof[Profiling / validation (reads)<br/>sourmash, MetaPhlAn, Metabuli, Sylph, Kraken]
  QC --> Asm[Short-read assembly<br/>MEGAHIT, metaSPAdes, MetaVelvet]

  Asm --> Contigs[Contigs]
  QC --> Map[Map reads<br/>to contigs or references<br/>minimap2]
  Map --> Cov[Coverage QC<br/>mosdepth, qualimap, CoverM]
  Contigs --> Assess[Assembly assessment<br/>QUAST / metaQUAST, BUSCO]
  Cov --> Assess

  Map --> Var[Variant calling (short)<br/>GATK, FreeBayes, BCFtools]
  Var --> VCF[VCF]

  Assess --> Bin[Binning<br/>MetaBAT2, MaxBin, VAMB, GenomeFace,<br/>MetaCoAG, COMEBin, TaxVAMB]
  Bin --> MAGs[MAGs]
  MAGs --> Tax[Bin taxonomy<br/>BLAST / mmseqs2]
  Tax --> Micro[Microbial QC<br/>CheckM2]
  Tax --> Viral[Viral QC<br/>CheckV, geNomad]
  Micro --> Catalog[Curated MAG set]
  Viral --> Catalog
  Catalog --> Drep[De-replication/merging<br/>dRep, MAGmax]
  Drep --> Clust[Similarity/clustering<br/>skani, FastANI, mash]

Long-Read Only (ONT / PacBio)

flowchart TD
  LR[Long reads (ONT / PacBio)] --> QC[QC<br/>FastQC<br/>fastplong / filtlong / chopper]
  QC --> Prof[Profiling / validation (reads)<br/>sourmash, MetaPhlAn, Metabuli, Sylph, Kraken]
  QC --> Asm[Long-read assembly<br/>metaflye, metamdbg, hifiasm-meta, myloasm]

  Asm --> Contigs[Contigs]
  QC --> Map[Map reads<br/>to contigs or references<br/>minimap2]
  Map --> Cov[Coverage QC<br/>mosdepth, qualimap, CoverM]
  Contigs --> Assess[Assembly assessment<br/>QUAST / metaQUAST, BUSCO]
  Cov --> Assess

  Map --> Var[Variant calling (long)<br/>Clair3]
  Var --> VCF[VCF]

  Assess --> Bin[Binning<br/>MetaBAT2, MaxBin, VAMB, GenomeFace,<br/>MetaCoAG, COMEBin, TaxVAMB]
  Bin --> MAGs[MAGs]
  MAGs --> Tax[Bin taxonomy<br/>BLAST / mmseqs2]
  Tax --> Micro[Microbial QC<br/>CheckM2]
  Tax --> Viral[Viral QC<br/>CheckV, geNomad]
  Micro --> Catalog[Curated MAG set]
  Viral --> Catalog
  Catalog --> Drep[De-replication/merging<br/>dRep, MAGmax]
  Drep --> Clust[Similarity/clustering<br/>skani, FastANI, mash]

Hybrid (Short + Long)

flowchart TD
  SR[Short reads] --> QC1[QC (short)<br/>FastQC<br/>fastp / trim_galore]
  LR[Long reads] --> QC2[QC (long)<br/>FastQC<br/>fastplong / filtlong / chopper]
  QC1 --> Asm[Hybrid or combined assembly<br/>unicycler (hybrid)<br/>or separate SR/LR assembly]
  QC2 --> Asm

  Asm --> Contigs[Contigs]
  QC1 --> Map[Map reads<br/>to contigs or references<br/>minimap2]
  QC2 --> Map
  Map --> Cov[Coverage QC<br/>mosdepth, qualimap, CoverM]
  Contigs --> Assess[Assembly assessment<br/>QUAST / metaQUAST, BUSCO]
  Cov --> Assess

  Map --> VarSR[Variant calling (short)<br/>GATK, FreeBayes, BCFtools]
  Map --> VarLR[Variant calling (long)<br/>Clair3]
  VarSR --> VCF[VCF]
  VarLR --> VCF

Workflow Components

Quality Control

Initial read quality assessment and filtering using:

  • FastQC for quality metrics
  • Adapter trimming and quality filtering
  • fastp - Short-read QC, trimming, filtering, reports
  • fastplong - Long-read QC, trimming, filtering, reports (fastp-long)
  • filtlong - Long-read filtering
  • chopper - Long-read filtering
  • trim_galore - Illumina adapter trimming for paired-end reads

Read-Level Classification

Direct classification of reads using minimap2 mapping against reference databases, with weighted scoring based on:

  • Top hit vs next hit ratio
  • Coverage thresholds
  • LCA (Lowest Common Ancestor) assignment for ambiguous mappings

Taxonomic Profiling

Containment & Estimation:

  • sourmash - Fast search and containment estimation using MinHash sketches

Profiling & Classification:

  • CoverM - Read coverage calculator
  • MetaPhlAn - Marker gene-based profiling
  • Sylph - Precision profiling for metagenomes
  • Metabuli - DNA/Amino acid joint analysis
  • Kraken - Exact k-mer classification

Assembly Strategies

Short Read Assembly:

  • MEGAHIT - memory-efficient assembly (iterative de Bruijn graph)
  • metaSPAdes - metagenome-specific assembly (high-quality short-read assembly)
  • Penguinn - guided assembly with protein references
  • MetaVelvet - extension of Velvet for metagenomics

Long Read Assembly:

  • myloasm - plasmid-aware assembly
  • hifiasm-meta - HiFi metagenome assembly
  • metaMDBG / metamdbg - minimizer-space de Bruijn graph assembly
  • metaFlye / metaflye - long-read metagenome assembly

Hybrid / Other Assemblers (also available in Mycelia, depending on your inputs):

  • Unicycler (hybrid short+long)
  • SPAdes, SKESA, Velvet
  • Flye, Canu, Hifiasm

Taxonomy Assignment

DNA-level:

  • BLAST against NCBI Core Nucleotide
  • Custom reference databases

Protein-level:

  • MMseqs2 against UniRef50, UniRef90, UniRef100
  • DIAMOND as a fast protein search alternative (where appropriate)
  • Sensitive homology detection for divergent sequences

Quality Control and Validation

Assembly QC by read mapping:

  • Map reads back to contigs with minimap2 and compute coverage
  • mosdepth - Depth/coverage summaries from BAM/CRAM
  • qualimap - Alignment/QC reports from BAM
  • CoverM - Per-contig and per-genome/bin coverage/abundance tables

Assembly assessment:

  • QUAST / metaQUAST for assembly statistics (metagenome mode when appropriate)
  • BUSCO for completeness assessment

Viral contigs:

  • CheckV for completeness and contamination
  • geNomad for viral/mobile element identification (viruses/plasmids)

Microbial contigs:

  • CheckM/CheckM2 for completeness and contamination

General:

  • QUAST for assembly statistics
  • BUSCO for completeness assessment

Binning & Refinement

  • DAS Tool - Consensus binning refinement
  • MetaBAT / MetaBAT2 - Tetranucleotide frequency binning
  • GenomeFace - Deep learning-based binning
  • MaxBin - Expectation-maximization algorithm
  • MetaCoAG - Graph- and coverage-aware binning
  • COMEBin - Ensemble binning / refinement
  • TaxVAMB - Taxonomy-guided VAMB variant
  • Taxometer - Taxonomy-aware binning support
  • VAMB - Variational Autoencoder Microbial Binner

Post-binning QC (MAGs)

After binning, apply QC based on the bin’s classification/expected biology:

  • Microbial MAGs: CheckM2 (and optionally legacy CheckM)
  • Viral MAGs / viral contigs: CheckV and geNomad

Reference Databases

Mycelia integrates with high-quality public registries for taxonomic assignment:

  • Microbial: GTDB, NCBI RefSeq Prokaryota (ref_prok_rep_genomes)
  • Viral: IMG/VR v5, NCBI RefSeq Viral (ref_viruses_rep_genomes)
  • Plasmids: PLSDB, NCBI Plasmid RefSeq
  • Protein: UniProt (SwissProt/TrEMBL), UniRef100/90/50

Similarity, Clustering & De-replication

  • FastANI - Fast Whole-Genome Similarity (ANI)
  • skani - Fast ANI for metagenomic-assembled genomes (MAGs)
  • dashing2 - HyperLogLog sketching for distance estimation
  • galah - Representative genome selection (dereplication)
  • dRep - Genome de-replication pipeline
  • mash - MinHash distance estimation
  • Blast - Pairwise Blast-based ANI

Strain-Level Phasing

  • Floria - Strain-aware assembly/phasing
  • Strainy - Phasing from metagenomic reads
  • STRONG - Strain Resolution ON Graphs
  • HyLight - Hybrid assembly graph phasing

Variant Calling

Graph-based:

  • PGGB (PanGenome Graph Builder)
  • VG toolkit
  • Cactus

Alignment-based:

  • Short reads: GATK HaplotypeCaller, FreeBayes, BCFtools
  • Long reads: Clair3

Tool Inputs → Outputs (Quick Reference)

This section answers: “what does a tool consume (reads/contigs/bins/BAM) and what does it produce?” For the broader repository-wide inventory (including many non-metagenomic utilities), see docs/src/workflow-map.md.

StageTool(s)InputOutputNotes
Read QC (short)FastQC, fastp, Trim GaloreIllumina FASTQFiltered/trimmed FASTQ + HTML/JSONWrapped: run_fastqc, qc_filter_short_reads_fastp, trim_galore_paired
Read QC (long)FastQC, fastplong, Filtlong, ChopperONT/PacBio FASTQFiltered FASTQ + QC reportsWrapped: qc_filter_long_reads_fastplong, qc_filter_long_reads_filtlong, qc_filter_long_reads_chopper
Read profilingsourmash, MetaPhlAn, Metabuli, SylphFASTQ readsTaxonomic profiles/reportsWrapped: run_sourmash_*, run_metaphlan, run_metabuli_classify, run_sylph_profile
Read profiling (k-mer exact)Kraken / Kraken2FASTQ readsClassification reportsParser exists (parse_kraken_report); runner wrapper not present in src/ yet
Read/contig mappingminimap2Reads + contigs/referenceSAM/BAMWrapped: minimap_map* helpers
Coverage QCmosdepth, Qualimap, CoverMBAM/CRAMCoverage tables + reportsWrapped: run_mosdepth, run_qualimap_bamqc, run_coverm_contig, run_coverm_genome
Assembly assessmentQUAST / metaQUAST, BUSCOAssembly FASTAQC summary tables + reportsWrapped: run_quast, run_busco
Assembly QC (microbial)CheckM2Microbial assemblies/binsCompleteness/contamination metricsWrapped: run_checkm2
Assembly QC (viral)CheckV, geNomadViral contigs/binsViral QC + viral/mobile element callsWrapped: run_checkv, run_genomad
Assembly (short)MEGAHIT, metaSPAdes, MetaVelvetShort readsContigs FASTAWrapped: run_megahit, run_metaspades, run_metavelvet
Assembly (short, option)PenguinnShort readsContigs FASTAMentioned as an option; runner wrapper not present in src/ yet
Assembly (long)metaFlye, metaMDBG, hifiasm-metaLong readsContigs FASTAWrapped: run_metaflye, run_metamdbg, run_hifiasm_meta
Assembly (long, option)myloasmLong readsContigs FASTAMentioned as an option; runner wrapper not present in src/ yet
Assembly (hybrid)UnicyclerShort + long readsContigs FASTAWrapped: run_unicycler
Polishing / error correctionApollo, HomopolishAssembly FASTA + readsPolished assembly FASTAWrapped: run_apollo, run_homopolish
Contig/MAG taxonomyBLAST, MMseqs2, DIAMONDContigs/MAGs + databasesHit tables + assignmentsWrapped: run_mmseqs_search, run_blast*, run_diamond_search (see docs/src/workflow-map.md)
BinningMetaBAT2, MaxBin, VAMB, GenomeFace, MetaCoAG, COMEBin, TaxVAMB, TaxometerContigs + depth/coverage tablesMAG/bin FASTAs + bin tablesWrapped: src/binning.jl runners
Post-binningdRep, MAGmaxMAG/bin FASTAs / bin setsDereplicated/merged catalogsWrapped: run_drep_dereplicate, run_magmax_merge
Similarity/clusteringskani, FastANI, mashAssemblies/MAGsDistance/ANI tablesWrapped: skani_dist, skani_triangle, fastani_*, run_mash_comparison
Similarity/clustering (options)dashing2, galahAssemblies/MAGsDistance tables / representative setsMentioned as options; runner wrappers not present in src/ yet
Strain resolutionFloria, Strainy, STRONG, HyLightReads + contigs/graphs (tool-specific)Strain-resolved assemblies/pathsRunner wrappers exist in src/assembly.jl (tests may be opt-in/commented)
Variants (short)GATK, FreeBayes, BCFtoolsBAM + reference FASTAVCFWrapped in src/variant-analysis.jl
Variants (long)Clair3BAM + reference FASTAVCFWrapped in src/variant-analysis.jl
PangenomesPGGB, VG, CactusAssembliesPangenome graphs + alignmentsWrapped in pangenome modules (see docs/src/workflow-map.md)

Integration with Rhizomorph

The Mycelia workflow integrates with the Rhizomorph graph-based assembly system to:

  • Build k-mer graphs from sequencing reads
  • Perform probabilistic assembly
  • Validate assemblies through graph traversal
  • Compare traditional assembly methods with graph-based approaches

Tool Status

See the Tool Wrapper Status document for current implementation status of each tool wrapper. See the Tool Wrapper Status document for current implementation status of each tool wrapper.