Workflow & Tool Map

A comprehensive, à la carte map of Mycelia’s capabilities. Each row connects:

Input data type (FASTQ/FASTA/GFA/BAM/VCF, etc.)
Mycelia entry points (Julia functions)
Wrapped third-party tools
Outputs (assemblies, QC tables, summary results, taxonomic abundance tables, etc.)
Tutorials that walk through the transformation

Planned capabilities are rendered in <span style="background-color:#f0f0f0;">shaded cells</span> so we can see what still needs to be built.

NOTE: If a workflow uses third-party tools or datasets, cite them. See References for citation guidance.

High-Level Data Flow (Mermaid)

flowchart LR
    subgraph Inputs
        FASTQ["FASTQ reads<br/>(Illumina/PacBio/Nanopore)"]
        FASTA["FASTA genomes / contigs / transcripts"]
        GFA["Assembly graphs<br/>(GFA / FASTG)"]
        BAM["Alignments<br/>(BAM / SAM / CRAM)"]
        VCF["Variants<br/>(VCF)"]
        REF["Reference genomes<br/>(NCBI / custom)"]
    end

    subgraph QC
        QCReads["Quality Control<br/>(fastp / filtlong / native QC)"]
        QCTables["QC tables<br/>+ HTML reports"]
    end

    subgraph Assembly
        Assemblers["Assemblers<br/>(MEGAHIT, metaSPAdes,<br/>Flye, hifiasm, Rhizomorph)"]
        GraphOps["Graph utilities<br/>(Rhizomorph / GFA I/O)"]
    end

    subgraph PostProc["Post-processing"]
        Validation["Assembly validation<br/>(QUAST, BUSCO, CheckM/CheckM2, MUMmer)"]
        Annotation["Gene annotation<br/>(Pyrodigal, BLAST+, MMSeqs2, tRNAscan-SE, TransTerm, MLST)"]
        Taxonomy["Taxonomic profiling<br/>(taxonomy utilities, planned Kraken-style wrappers)"]
        Pangenome["Pangenome & distance<br/>(Mash, pangenome graphs)"]
        Variants["Variant analysis<br/>(GATK / vcfeval / FreeBayes / Clair3 / BCFtools)"]
    end

    subgraph Outputs
        Tables["Summary & QC tables"]
        Abund["Taxonomic & functional abundance tables"]
        Graphs["Graphs<br/>(GFA / internal graph types)"]
        Reports["HTML / text reports & plots"]
    end

    FASTQ --> QCReads
    QCReads --> QCTables
    QCReads --> Assemblers
    FASTA --> Pangenome
    FASTA --> Assemblers
    GFA --> GraphOps
    GraphOps --> Assemblers
    Assemblers --> Validation
    Assemblers --> Annotation
    Assemblers --> Pangenome
    Assemblers --> Graphs

    BAM --> Validation
    BAM --> Variants
    BAM --> Taxonomy

    VCF --> Variants
    REF --> Variants
    REF --> Pangenome

    Validation --> Tables
    Annotation --> Tables
    Taxonomy --> Abund
    Pangenome --> Tables
    Pangenome --> Graphs
    Variants --> Tables
    QCTables --> Reports
    Tables --> Reports
    Abund --> Reports

Use this diagram as the “big picture”; the tables below are the detailed à la carte menu.

1.1 FASTQ / FASTA / GFA / BAM / VCF → Outputs

Input	Mycelia entry points	Wrapped tools	Outputs	Tutorial links
FASTQ short reads (Illumina)	`analyze_fastq_quality`, `calculate_gc_content`, `assess_duplication_rates`, `qc_filter_short_reads_fastp`, `trim_galore_paired`	fastp, Trim Galore	QC HTML reports, filtered FASTQ, per-sample QC tables	Quality Control
FASTQ long reads (ONT / PacBio)	`qc_filter_long_reads_filtlong`, native FASTQ QC utilities	Filtlong	Filtered FASTQ, length/quality distributions, QC tables	Quality Control
FASTA references (genomes, contigs, transcripts)	`open_fastx`, `convert_sequence`, `count_canonical_kmers`, distance functions	—	Canonical sequences, k-mer matrices, distance matrices	K-mer Analysis
Mixed FASTA / FASTQ sets	`count_canonical_kmers`, `jaccard_distance`, `kmer_counts_to_js_divergence`	—	K-mer spectra, Jaccard / JS distance tables, plots	K-mer Analysis
PacBio / Nanopore reads (simulated)	`simulate_pacbio_reads`, `simulate_nanopore_reads`	PBSim / DeepSimulator (via wrappers)	FASTQ read sets + ground-truth tables	Data Acquisition
Illumina reads (simulated)	`simulate_illumina_reads`	ART	FASTQ read sets + ground-truth tables	Data Acquisition
Public genomes (NCBI)	`download_genome_by_accession`, `prefetch_sra_runs`, `fasterq_dump_parallel`	NCBI datasets / Entrez / SRA Toolkit	FASTA/FASTQ datasets + accession metadata tables	Data Acquisition
FASTQ (assembly)	`Mycelia.Rhizomorph.assemble_genome` (front-end) → `run_megahit`, `run_metaspades`, `run_spades`, `run_flye`, `run_hifiasm`, `run_unicycler`	MEGAHIT, metaSPAdes, SPAdes, Flye/metaFlye, hifiasm, Canu, SKESA, Unicycler	Contig/scaffold FASTA, assembly logs, GFA/FASTG graphs	Genome Assembly
FASTQ (native / experimental assembly)	`mycelia_iterative_assemble`, `improve_read_set_likelihood`, `find_optimal_sequence_path`	— (native Rhizomorph graphs)	Iteratively improved reads, qualmer/string graphs, checkpoint metadata	<span style="background-color:#f0f0f0;">Planned: `12_rhizomorph_graphs.jl`</span>
GFA / FASTG assembly graphs	`Mycelia.Rhizomorph.read_gfa_next`, `Mycelia.Rhizomorph.build_kmer_graph`, `Mycelia.Rhizomorph.build_qualmer_graph`, `Mycelia.Rhizomorph.write_gfa_next`	—	Parsed graph objects, simplified graphs, re-exported GFA/FASTG	Graph Type Tutorials, Round-Trip Graphs
BAM / SAM / CRAM alignments	`xam_to_dataframe`, `visualize_genome_coverage`, `run_qualimap_bamqc`	samtools, Qualimap	Coverage plots, per-base depth tables, BAM-QC HTML reports	Tool Integration, Assembly Validation
Assemblies (FASTA) + references	`assess_assembly_quality`, `run_quast`, `run_busco`, `run_mummer`	QUAST, BUSCO, MUMmer, CheckM/CheckM2	Assembly QC summary tables, BUSCO metrics, alignment plots	Assembly Validation
Gene prediction inputs (assembled contigs)	`run_prodigal`, `run_pyrodigal`, `run_prodigal_gv`, `run_augustus`, `run_metaeuk`, `run_blastp_search`, `run_mmseqs_search`, `run_transterm`, `run_trnascan`, `run_mlst`	Prodigal, Pyrodigal, Prodigal-gv, Augustus, MetaEuk, BLAST+, MMSeqs2, TransTerm, tRNAscan-SE, MLST	GFF3 annotations, protein FASTA, functional/categorical tables	Gene Annotation
Multiple assemblies (FASTA genomes)	`build_genome_distance_matrix`, `pairwise_mash_distance_matrix`, `construct_pangenome_pggb`, `construct_pangenome_cactus`	Mash/MinHash, PGGB, cactus	Distance matrices, clustered panels, pangenome graphs, summary tables	Comparative Genomics
Taxonomic profiling inputs (classified alignments, tables)	`classify_taxonomy_aware_xam_table`, `plot_taxa_abundances`, `visualize_xam_classifications`	Native taxonomy utilities, Krona/plotting	Taxonomic abundance tables, Sankey/stacked plots	Tool Integration
Graph-derived sequences	`build_string_graph`, `string_to_ngram_graph`, `assemble_strings`	—	String graphs, reconstructed sequences, evidence-tracking tables	Round-Trip: String Graphs, N-gram Graphs
Graph-derived sequences	`build_string_graph`, `string_to_ngram_graph`, `assemble_strings`	—	String graphs, reconstructed sequences, evidence-tracking tables	Round-Trip: String Graphs, N-gram Graphs
Qualmer graphs	`build_qualmer_graph`, `get_qualmer_statistics`, simplification helpers	—	Quality-weighted sequence graphs, path correctness scores, diagnostic tables	<span style="background-color:#f0f0f0;">Planned: `12_rhizomorph_graphs.jl`</span>
Protein sequences	`reduce_amino_alphabet`, `analyze_amino_acids`	—	Reduced alphabet sequences, amino acid composition tables	Reduced Amino Acid Alphabets
Viroid / viral genomes	`Mycelia.Rhizomorph.assemble_genome` (circular options), `Mycelia.Rhizomorph.validate_assembly`	Flye/metaFlye, QUAST	Circular contig FASTA, QC tables	Viroid Assembly Workflow
VCF + BAM + reference FASTA	`run_gatk_haplotypecaller`, `run_freebayes`, `run_clair3`, `run_bcftools_call`, `normalize_vcf`, `update_fasta_with_vcf`, `run_vcfeval`, `evaluate_variant_calling_accuracy`	GATK HaplotypeCaller, FreeBayes, Clair3, BCFtools, RTG vcfeval	Normalized VCF, updated reference, ROC/precision–recall tables	Planned: Variant Calling Tutorial
Metagenomic reads (amplicon / shotgun)	`run_metaspades`, `run_metaflye`, `run_sourmash_sketch`, `run_metaphlan`, `run_metabuli_classify`, `run_sylph_profile`, `run_metabat2`, `run_vamb`, `run_taxvamb`, `run_taxometer`, `run_metacoag`, `run_genomeface`, `run_comebin`, `run_drep_dereplicate`, `run_magmax_merge`	metaSPAdes, MetaFlye, sourmash, MetaPhlAn, Metabuli, Sylph, MetaBAT2, VAMB, TaxVAMB, Taxometer, MetaCoAG, GenomeFace, COMEBin, dRep, MAGmax	MAG bins, taxonomic abundance tables, bin-quality summaries	Binning Workflow
GFA → interactive graph curation	<span style="background-color:#f0f0f0;">`curate_assembly_graph`</span>	<span style="background-color:#f0f0f0;">Bandage integration</span>	<span style="background-color:#f0f0f0;">Curated graph, manual edits exported to GFA</span>	<span style="background-color:#f0f0f0;">Planned: GFA Curation Tutorial</span>
Phylogenetic trees	<span style="background-color:#f0f0f0;">`construct_phylogeny`, `plot_phylogeny`</span>	<span style="background-color:#f0f0f0;">RAxML, IQ-TREE</span>	<span style="background-color:#f0f0f0;">Newick trees, phylogenetic plots</span>	<span style="background-color:#f0f0f0;">Planned: Phylogenetics Tutorial</span>

2. Capability Matrix (by Task)

This mirrors patch 2’s capability section and ensures everything is covered from a “what do I want to do?” perspective.

Capability	Entry points	Inputs	Outputs	Tooling
Data acquisition	`download_genome_by_accession`, `prefetch_sra_runs`, `fasterq_dump_parallel`	Accessions / run tables	FASTA/FASTQ, metadata	NCBI E-utilities, SRA Toolkit
Read simulation	`simulate_illumina_reads`, `simulate_pacbio_reads`, `simulate_nanopore_reads`	Reference FASTA, sim params	FASTQ reads, ground-truth tables	ART, PBSim, DeepSimulator
Quality filtering	`qc_filter_short_reads_fastp`, `qc_filter_long_reads_filtlong`, `trim_galore_paired`	FASTQ	Filtered FASTQ, HTML reports	fastp, Filtlong, Trim Galore
Quality metrics	`analyze_fastq_quality`, `assess_duplication_rates`, `robust_cv`, `robust_threshold`, `filter_genome_outliers`	FASTQ, QC tables	Summary QC tables, outlier flags	Native Julia
Assembly	`Mycelia.Rhizomorph.assemble_genome`, `run_megahit`, `run_metaspades`, `run_spades`, `run_flye`, `run_hifiasm`, `mycelia_iterative_assemble`	FASTQ, config	Contigs/scaffolds, assembly graphs	External assemblers + native graphs
Assembly validation	`assess_assembly_quality`, `run_quast`, `run_busco`, `run_mummer`, CheckM/CheckM2 integration	Assemblies + references	QC reports, summary tables	QUAST, BUSCO, MUMmer, CheckM
Graph utilities	`build_string_graph`, `build_fasta_graph`, `build_qualmer_graph`, `read_gfa_next`, `write_quality_biosequence_gfa`	FASTA/FASTQ/GFA	Graph objects, GFA exports	Native Julia (Rhizomorph)
Comparative genomics	`construct_pangenome_pggb`, `construct_pangenome_cactus`, `build_genome_distance_matrix`, `analyze_pangenome_kmers`	Assemblies, sketch files	Pangenome graphs, distance matrices	PGGB, cactus, Mash-like tools
Taxonomy	`classify_taxonomy_aware_xam_table`, `plot_taxa_abundances`, `visualize_xam_classifications`	Mapping tables, taxonomic refs	Abundance tables, taxonomy plots	Native Julia, plotting libs
Alignment & search	`run_blastp_search`, `run_mmseqs_search`, `run_diamond_search`, `run_minimap2` helpers	FASTA/FASTQ, databases	Alignments (SAM/BAM), hit tables	BLAST+, MMSeqs2, DIAMOND, minimap2
Variant analysis	`run_gatk_haplotypecaller`, `run_freebayes`, `run_clair3`, `run_bcftools_call`, `run_vcfeval`, `evaluate_variant_calling_accuracy`	BAM/CRAM, VCF, references	VCFs, ROC tables, summary stats	GATK, FreeBayes, Clair3, BCFtools, RTG vcfeval
Cloud & orchestration	`rclone_copy`, `rclone_sync`, `render_sbatch`, `submit_sbatch`, `lawrencium_sbatch`, `nersc_sbatch`	Paths, SLURM params	Synced data, sbatch scripts, submitted jobs	rclone, SLURM

3. Module coverage & doc gaps

3.1 Auto-included modules (from `src/Mycelia.jl`)

These bullets are a concise, user-facing summary of the auto-included modules.

Core graph + I/O: utility-functions.jl, alphabets.jl, constants.jl, fastx.jl, rhizomorph/rhizomorph.jl (core enums/evidence, fixed-length graphs, variable-length graphs, graph algorithms) → Exposed via Rhizomorph: build_ngram_graph, build_kmer_graph, build_qualmer_graph, build_fasta_graph, build_fastq_graph, path-finding & simplification utilities.
Assembly pipelines: assembly.jl (external assemblers), iterative-assembly.jl (mycelia_iterative_assemble, improve_read_set_likelihood, find_optimal_sequence_path), viterbi-next.jl.
Analytics & QC: quality-control-and-benchmarking.jl, performance-benchmarks.jl, kmer-analysis.jl, distance-metrics.jl.
Taxonomy & annotation: taxonomy-and-trees.jl, classification.jl, reference-databases.jl, annotation.jl.
Wrappers & orchestration: bioconda.jl, rclone.jl, slurm-sbatch.jl, neo4jl.jl, xam.jl.

3.2 External tool wrappers (no dedicated tutorials yet)

These wrappers are available in src/ but do not yet have dedicated tutorials. They are listed here so the function coverage map can point to an explicit doc location until tutorials are added.

Wrapper file	Entry points (selected)	Tool	Notes
`autocycler.jl`	`install_autocycler`, `run_autocycler`	Autocycler	Conda env + Autocycler bash pipeline.
`bcalm.jl`	`install_bcalm`, `run_bcalm`	BCALM	Converts unitigs to GFA via `convertToGFA.py`.
`foldseek.jl`	`install_foldseek`, `foldseek_easy_search`, `foldseek_easy_cluster`, `foldseek_createdb`, `foldseek_databases`	Foldseek	Structure search and clustering.
`ggcat.jl`	`install_ggcat`, `ggcat_build`, `ggcat_query`	GGCAT	Compacted or colored de Bruijn graphs.
`pantools.jl`	`run_pantools`, `pantools_cmd`, `write_pantools_genome_locations_file`, `write_pantools_annotation_locations_file`	PanTools	Pangenome toolkit wrapper via Bioconda.
`prokrustean.jl`	`install_prokrustean`, `prokrustean_build_graph`, `prokrustean_kmer_count`, `prokrustean_unitig_count`, `prokrustean_braycurtis`, `prokrustean_overlap`	Prokrustean	Builds from source; provides k-mer and unitig metrics.