Workflow & Tool Map

A comprehensive, à la carte map of Mycelia’s capabilities. Each row connects:

  • Input data type (FASTQ/FASTA/GFA/BAM/VCF, etc.)
  • Mycelia entry points (Julia functions)
  • Wrapped third-party tools
  • Outputs (assemblies, QC tables, summary results, taxonomic abundance tables, etc.)
  • Tutorials that walk through the transformation

Planned capabilities are rendered in <span style="background-color:#f0f0f0;">shaded cells</span> so we can see what still needs to be built.

NOTE: If a workflow uses third-party tools or datasets, cite them. See References for citation guidance.


High-Level Data Flow (Mermaid)

flowchart LR
    subgraph Inputs
        FASTQ["FASTQ reads<br/>(Illumina/PacBio/Nanopore)"]
        FASTA["FASTA genomes / contigs / transcripts"]
        GFA["Assembly graphs<br/>(GFA / FASTG)"]
        BAM["Alignments<br/>(BAM / SAM / CRAM)"]
        VCF["Variants<br/>(VCF)"]
        REF["Reference genomes<br/>(NCBI / custom)"]
    end

    subgraph QC
        QCReads["Quality Control<br/>(fastp / filtlong / native QC)"]
        QCTables["QC tables<br/>+ HTML reports"]
    end

    subgraph Assembly
        Assemblers["Assemblers<br/>(MEGAHIT, metaSPAdes,<br/>Flye, hifiasm, Rhizomorph)"]
        GraphOps["Graph utilities<br/>(Rhizomorph / GFA I/O)"]
    end

    subgraph PostProc["Post-processing"]
        Validation["Assembly validation<br/>(QUAST, BUSCO, CheckM/CheckM2, MUMmer)"]
        Annotation["Gene annotation<br/>(Pyrodigal, BLAST+, MMSeqs2, tRNAscan-SE, TransTerm, MLST)"]
        Taxonomy["Taxonomic profiling<br/>(taxonomy utilities, planned Kraken-style wrappers)"]
        Pangenome["Pangenome & distance<br/>(Mash, pangenome graphs)"]
        Variants["Variant analysis<br/>(GATK / vcfeval / FreeBayes / Clair3 / BCFtools)"]
    end

    subgraph Outputs
        Tables["Summary & QC tables"]
        Abund["Taxonomic & functional abundance tables"]
        Graphs["Graphs<br/>(GFA / internal graph types)"]
        Reports["HTML / text reports & plots"]
    end

    FASTQ --> QCReads
    QCReads --> QCTables
    QCReads --> Assemblers
    FASTA --> Pangenome
    FASTA --> Assemblers
    GFA --> GraphOps
    GraphOps --> Assemblers
    Assemblers --> Validation
    Assemblers --> Annotation
    Assemblers --> Pangenome
    Assemblers --> Graphs

    BAM --> Validation
    BAM --> Variants
    BAM --> Taxonomy

    VCF --> Variants
    REF --> Variants
    REF --> Pangenome

    Validation --> Tables
    Annotation --> Tables
    Taxonomy --> Abund
    Pangenome --> Tables
    Pangenome --> Graphs
    Variants --> Tables
    QCTables --> Reports
    Tables --> Reports
    Abund --> Reports

Use this diagram as the “big picture”; the tables below are the detailed à la carte menu.


1. End-to-End Menu (by Input Type)

1.1 FASTQ / FASTA / GFA / BAM / VCF → Outputs

InputMycelia entry pointsWrapped toolsOutputsTutorial links
FASTQ short reads (Illumina)analyze_fastq_quality, calculate_gc_content, assess_duplication_rates, qc_filter_short_reads_fastp, trim_galore_pairedfastp, Trim GaloreQC HTML reports, filtered FASTQ, per-sample QC tablesQuality Control
FASTQ long reads (ONT / PacBio)qc_filter_long_reads_filtlong, native FASTQ QC utilitiesFiltlongFiltered FASTQ, length/quality distributions, QC tablesQuality Control
FASTA references (genomes, contigs, transcripts)open_fastx, convert_sequence, count_canonical_kmers, distance functionsCanonical sequences, k-mer matrices, distance matricesK-mer Analysis
Mixed FASTA / FASTQ setscount_canonical_kmers, jaccard_distance, kmer_counts_to_js_divergenceK-mer spectra, Jaccard / JS distance tables, plotsK-mer Analysis
PacBio / Nanopore reads (simulated)simulate_pacbio_reads, simulate_nanopore_readsPBSim / DeepSimulator (via wrappers)FASTQ read sets + ground-truth tablesData Acquisition
Illumina reads (simulated)simulate_illumina_readsARTFASTQ read sets + ground-truth tablesData Acquisition
Public genomes (NCBI)download_genome_by_accession, prefetch_sra_runs, fasterq_dump_parallelNCBI datasets / Entrez / SRA ToolkitFASTA/FASTQ datasets + accession metadata tablesData Acquisition
FASTQ (assembly)Mycelia.Rhizomorph.assemble_genome (front-end) → run_megahit, run_metaspades, run_spades, run_flye, run_hifiasm, run_unicyclerMEGAHIT, metaSPAdes, SPAdes, Flye/metaFlye, hifiasm, Canu, SKESA, UnicyclerContig/scaffold FASTA, assembly logs, GFA/FASTG graphsGenome Assembly
FASTQ (native / experimental assembly)mycelia_iterative_assemble, improve_read_set_likelihood, find_optimal_sequence_path— (native Rhizomorph graphs)Iteratively improved reads, qualmer/string graphs, checkpoint metadata<span style="background-color:#f0f0f0;">Planned: 12_rhizomorph_graphs.jl</span>
GFA / FASTG assembly graphsMycelia.Rhizomorph.read_gfa_next, Mycelia.Rhizomorph.build_kmer_graph, Mycelia.Rhizomorph.build_qualmer_graph, Mycelia.Rhizomorph.write_gfa_nextParsed graph objects, simplified graphs, re-exported GFA/FASTGGraph Type Tutorials, Round-Trip Graphs
BAM / SAM / CRAM alignmentsxam_to_dataframe, visualize_genome_coverage, run_qualimap_bamqcsamtools, QualimapCoverage plots, per-base depth tables, BAM-QC HTML reportsTool Integration, Assembly Validation
Assemblies (FASTA) + referencesassess_assembly_quality, run_quast, run_busco, run_mummerQUAST, BUSCO, MUMmer, CheckM/CheckM2Assembly QC summary tables, BUSCO metrics, alignment plotsAssembly Validation
Gene prediction inputs (assembled contigs)run_prodigal, run_pyrodigal, run_prodigal_gv, run_augustus, run_metaeuk, run_blastp_search, run_mmseqs_search, run_transterm, run_trnascan, run_mlstProdigal, Pyrodigal, Prodigal-gv, Augustus, MetaEuk, BLAST+, MMSeqs2, TransTerm, tRNAscan-SE, MLSTGFF3 annotations, protein FASTA, functional/categorical tablesGene Annotation
Multiple assemblies (FASTA genomes)build_genome_distance_matrix, pairwise_mash_distance_matrix, construct_pangenome_pggb, construct_pangenome_cactusMash/MinHash, PGGB, cactusDistance matrices, clustered panels, pangenome graphs, summary tablesComparative Genomics
Taxonomic profiling inputs (classified alignments, tables)classify_taxonomy_aware_xam_table, plot_taxa_abundances, visualize_xam_classificationsNative taxonomy utilities, Krona/plottingTaxonomic abundance tables, Sankey/stacked plotsTool Integration
Graph-derived sequencesbuild_string_graph, string_to_ngram_graph, assemble_stringsString graphs, reconstructed sequences, evidence-tracking tablesRound-Trip: String Graphs, N-gram Graphs
Graph-derived sequencesbuild_string_graph, string_to_ngram_graph, assemble_stringsString graphs, reconstructed sequences, evidence-tracking tablesRound-Trip: String Graphs, N-gram Graphs
Qualmer graphsbuild_qualmer_graph, get_qualmer_statistics, simplification helpersQuality-weighted sequence graphs, path correctness scores, diagnostic tables<span style="background-color:#f0f0f0;">Planned: 12_rhizomorph_graphs.jl</span>
Protein sequencesreduce_amino_alphabet, analyze_amino_acidsReduced alphabet sequences, amino acid composition tablesReduced Amino Acid Alphabets
Viroid / viral genomesMycelia.Rhizomorph.assemble_genome (circular options), Mycelia.Rhizomorph.validate_assemblyFlye/metaFlye, QUASTCircular contig FASTA, QC tablesViroid Assembly Workflow
VCF + BAM + reference FASTArun_gatk_haplotypecaller, run_freebayes, run_clair3, run_bcftools_call, normalize_vcf, update_fasta_with_vcf, run_vcfeval, evaluate_variant_calling_accuracyGATK HaplotypeCaller, FreeBayes, Clair3, BCFtools, RTG vcfevalNormalized VCF, updated reference, ROC/precision–recall tablesPlanned: Variant Calling Tutorial
Metagenomic reads (amplicon / shotgun)run_metaspades, run_metaflye, run_sourmash_sketch, run_metaphlan, run_metabuli_classify, run_sylph_profile, run_metabat2, run_vamb, run_taxvamb, run_taxometer, run_metacoag, run_genomeface, run_comebin, run_drep_dereplicate, run_magmax_mergemetaSPAdes, MetaFlye, sourmash, MetaPhlAn, Metabuli, Sylph, MetaBAT2, VAMB, TaxVAMB, Taxometer, MetaCoAG, GenomeFace, COMEBin, dRep, MAGmaxMAG bins, taxonomic abundance tables, bin-quality summariesBinning Workflow
GFA → interactive graph curation<span style="background-color:#f0f0f0;">curate_assembly_graph</span><span style="background-color:#f0f0f0;">Bandage integration</span><span style="background-color:#f0f0f0;">Curated graph, manual edits exported to GFA</span><span style="background-color:#f0f0f0;">Planned: GFA Curation Tutorial</span>
Phylogenetic trees<span style="background-color:#f0f0f0;">construct_phylogeny, plot_phylogeny</span><span style="background-color:#f0f0f0;">RAxML, IQ-TREE</span><span style="background-color:#f0f0f0;">Newick trees, phylogenetic plots</span><span style="background-color:#f0f0f0;">Planned: Phylogenetics Tutorial</span>

2. Capability Matrix (by Task)

This mirrors patch 2’s capability section and ensures everything is covered from a “what do I want to do?” perspective.

CapabilityEntry pointsInputsOutputsTooling
Data acquisitiondownload_genome_by_accession, prefetch_sra_runs, fasterq_dump_parallelAccessions / run tablesFASTA/FASTQ, metadataNCBI E-utilities, SRA Toolkit
Read simulationsimulate_illumina_reads, simulate_pacbio_reads, simulate_nanopore_readsReference FASTA, sim paramsFASTQ reads, ground-truth tablesART, PBSim, DeepSimulator
Quality filteringqc_filter_short_reads_fastp, qc_filter_long_reads_filtlong, trim_galore_pairedFASTQFiltered FASTQ, HTML reportsfastp, Filtlong, Trim Galore
Quality metricsanalyze_fastq_quality, assess_duplication_rates, robust_cv, robust_threshold, filter_genome_outliersFASTQ, QC tablesSummary QC tables, outlier flagsNative Julia
AssemblyMycelia.Rhizomorph.assemble_genome, run_megahit, run_metaspades, run_spades, run_flye, run_hifiasm, mycelia_iterative_assembleFASTQ, configContigs/scaffolds, assembly graphsExternal assemblers + native graphs
Assembly validationassess_assembly_quality, run_quast, run_busco, run_mummer, CheckM/CheckM2 integrationAssemblies + referencesQC reports, summary tablesQUAST, BUSCO, MUMmer, CheckM
Graph utilitiesbuild_string_graph, build_fasta_graph, build_qualmer_graph, read_gfa_next, write_quality_biosequence_gfaFASTA/FASTQ/GFAGraph objects, GFA exportsNative Julia (Rhizomorph)
Comparative genomicsconstruct_pangenome_pggb, construct_pangenome_cactus, build_genome_distance_matrix, analyze_pangenome_kmersAssemblies, sketch filesPangenome graphs, distance matricesPGGB, cactus, Mash-like tools
Taxonomyclassify_taxonomy_aware_xam_table, plot_taxa_abundances, visualize_xam_classificationsMapping tables, taxonomic refsAbundance tables, taxonomy plotsNative Julia, plotting libs
Alignment & searchrun_blastp_search, run_mmseqs_search, run_diamond_search, run_minimap2 helpersFASTA/FASTQ, databasesAlignments (SAM/BAM), hit tablesBLAST+, MMSeqs2, DIAMOND, minimap2
Variant analysisrun_gatk_haplotypecaller, run_freebayes, run_clair3, run_bcftools_call, run_vcfeval, evaluate_variant_calling_accuracyBAM/CRAM, VCF, referencesVCFs, ROC tables, summary statsGATK, FreeBayes, Clair3, BCFtools, RTG vcfeval
Cloud & orchestrationrclone_copy, rclone_sync, render_sbatch, submit_sbatch, lawrencium_sbatch, nersc_sbatchPaths, SLURM paramsSynced data, sbatch scripts, submitted jobsrclone, SLURM

3. Module coverage & doc gaps

3.1 Auto-included modules (from src/Mycelia.jl)

These bullets are a concise, user-facing summary of the auto-included modules.

  • Core graph + I/O: utility-functions.jl, alphabets.jl, constants.jl, fastx.jl, rhizomorph/rhizomorph.jl (core enums/evidence, fixed-length graphs, variable-length graphs, graph algorithms) → Exposed via Rhizomorph: build_ngram_graph, build_kmer_graph, build_qualmer_graph, build_fasta_graph, build_fastq_graph, path-finding & simplification utilities.
  • Assembly pipelines: assembly.jl (external assemblers), iterative-assembly.jl (mycelia_iterative_assemble, improve_read_set_likelihood, find_optimal_sequence_path), viterbi-next.jl.
  • Analytics & QC: quality-control-and-benchmarking.jl, performance-benchmarks.jl, kmer-analysis.jl, distance-metrics.jl.
  • Taxonomy & annotation: taxonomy-and-trees.jl, classification.jl, reference-databases.jl, annotation.jl.
  • Wrappers & orchestration: bioconda.jl, rclone.jl, slurm-sbatch.jl, neo4jl.jl, xam.jl.

3.2 External tool wrappers (no dedicated tutorials yet)

These wrappers are available in src/ but do not yet have dedicated tutorials. They are listed here so the function coverage map can point to an explicit doc location until tutorials are added.

Wrapper fileEntry points (selected)ToolNotes
autocycler.jlinstall_autocycler, run_autocyclerAutocyclerConda env + Autocycler bash pipeline.
bcalm.jlinstall_bcalm, run_bcalmBCALMConverts unitigs to GFA via convertToGFA.py.
foldseek.jlinstall_foldseek, foldseek_easy_search, foldseek_easy_cluster, foldseek_createdb, foldseek_databasesFoldseekStructure search and clustering.
ggcat.jlinstall_ggcat, ggcat_build, ggcat_queryGGCATCompacted or colored de Bruijn graphs.
pantools.jlrun_pantools, pantools_cmd, write_pantools_genome_locations_file, write_pantools_annotation_locations_filePanToolsPangenome toolkit wrapper via Bioconda.
prokrustean.jlinstall_prokrustean, prokrustean_build_graph, prokrustean_kmer_count, prokrustean_unitig_count, prokrustean_braycurtis, prokrustean_overlapProkrusteanBuilds from source; provides k-mer and unitig metrics.