Workflow & Tool Map

A comprehensive, à la carte map of Mycelia’s capabilities. Each row connects:

  • Input data type (FASTQ/FASTA/GFA/BAM/VCF, etc.)
  • Mycelia entry points (Julia functions)
  • Wrapped third-party tools
  • Outputs (assemblies, QC tables, summary results, taxonomic abundance tables, etc.)
  • Tutorials that walk through the transformation

Planned capabilities are rendered in <span style="background-color:#f0f0f0;">shaded cells</span> so we can see what still needs to be built.


High-Level Data Flow (Mermaid)

flowchart LR
    subgraph Inputs
        FASTQ["FASTQ reads<br/>(Illumina/PacBio/Nanopore)"]
        FASTA["FASTA genomes / contigs / transcripts"]
        GFA["Assembly graphs<br/>(GFA / FASTG)"]
        BAM["Alignments<br/>(BAM / SAM / CRAM)"]
        VCF["Variants<br/>(VCF)"]
        REF["Reference genomes<br/>(NCBI / custom)"]
    end

    subgraph QC
        QCReads["Quality Control<br/>(fastp / filtlong / native QC)"]
        QCTables["QC tables<br/>+ HTML reports"]
    end

    subgraph Assembly
        Assemblers["Assemblers<br/>(MEGAHIT, metaSPAdes,<br/>Flye, hifiasm, Rhizomorph)"]
        GraphOps["Graph utilities<br/>(Rhizomorph / GFA I/O)"]
    end

    subgraph PostProc["Post-processing"]
        Validation["Assembly validation<br/>(QUAST, BUSCO, CheckM/CheckM2, MUMmer)"]
        Annotation["Gene annotation<br/>(Pyrodigal, BLAST+, MMSeqs2, tRNAscan-SE, TransTerm, MLST)"]
        Taxonomy["Taxonomic profiling<br/>(taxonomy utilities, planned Kraken-style wrappers)"]
        Pangenome["Pangenome & distance<br/>(Mash, pangenome graphs)"]
        Variants["Variant analysis<br/>(GATK / vcfeval, planned)"]
    end

    subgraph Outputs
        Tables["Summary & QC tables"]
        Abund["Taxonomic & functional abundance tables"]
        Graphs["Graphs<br/>(GFA / internal graph types)"]
        Reports["HTML / text reports & plots"]
    end

    FASTQ --> QCReads
    QCReads --> QCTables
    QCReads --> Assemblers
    FASTA --> Pangenome
    FASTA --> Assemblers
    GFA --> GraphOps
    GraphOps --> Assemblers
    Assemblers --> Validation
    Assemblers --> Annotation
    Assemblers --> Pangenome
    Assemblers --> Graphs

    BAM --> Validation
    BAM --> Variants
    BAM --> Taxonomy

    VCF --> Variants
    REF --> Variants
    REF --> Pangenome

    Validation --> Tables
    Annotation --> Tables
    Taxonomy --> Abund
    Pangenome --> Tables
    Pangenome --> Graphs
    Variants --> Tables
    QCTables --> Reports
    Tables --> Reports
    Abund --> Reports

Use this diagram as the “big picture”; the tables below are the detailed à la carte menu.


1. End-to-End Menu (by Input Type)

1.1 FASTQ / FASTA / GFA / BAM / VCF → Outputs

InputMycelia entry pointsWrapped toolsOutputsTutorial links
FASTQ short reads (Illumina)analyze_fastq_quality, calculate_gc_content, assess_duplication_rates, qc_filter_short_reads_fastp, trim_galore_pairedfastp, Trim GaloreQC HTML reports, filtered FASTQ, per-sample QC tablesQuality Control
FASTQ long reads (ONT / PacBio)qc_filter_long_reads_filtlong, native FASTQ QC utilitiesFiltlongFiltered FASTQ, length/quality distributions, QC tablesQuality Control
FASTA references (genomes, contigs, transcripts)open_fastx, convert_sequence, count_canonical_kmers, distance functionsCanonical sequences, k-mer matrices, distance matricesK-mer Analysis
Mixed FASTA / FASTQ setscount_canonical_kmers, jaccard_distance, kmer_counts_to_js_divergenceK-mer spectra, Jaccard / JS distance tables, plotsK-mer Analysis
PacBio / Nanopore reads (simulated)simulate_pacbio_reads, simulate_nanopore_readsPBSim / DeepSimulator (via wrappers)FASTQ read sets + ground-truth tablesData Acquisition
Illumina reads (simulated)simulate_illumina_readsARTFASTQ read sets + ground-truth tablesData Acquisition
Public genomes (NCBI)download_genome_by_accession, prefetch_sra_runs, fasterq_dump_parallelNCBI datasets / Entrez / SRA ToolkitFASTA/FASTQ datasets + accession metadata tablesData Acquisition
FASTQ (assembly)assemble_genome (front-end) → run_megahit, run_metaspades, run_spades, run_flye, run_hifiasm, run_unicyclerMEGAHIT, metaSPAdes, SPAdes, Flye/metaFlye, hifiasm, Canu, SKESA, UnicyclerContig/scaffold FASTA, assembly logs, GFA/FASTG graphsGenome Assembly
FASTQ (native / experimental assembly)mycelia_iterative_assemble, improve_read_set_likelihood, find_optimal_sequence_path— (native Rhizomorph graphs)Iteratively improved reads, qualmer/string graphs, checkpoint metadata<span style="background-color:#f0f0f0;">Planned: 12_rhizomorph_graphs.jl</span>
GFA / FASTG assembly graphsread_gfa_next, build_kmer_graph_next, build_qualmer_graph_next, write_quality_biosequence_gfaParsed graph objects, simplified graphs, re-exported GFA/FASTGGraph Type Tutorials, Round-Trip Graphs
BAM / SAM / CRAM alignmentsxam_to_dataframe, visualize_genome_coverage, run_qualimap_bamqcsamtools, QualimapCoverage plots, per-base depth tables, BAM-QC HTML reportsTool Integration, Assembly Validation
Assemblies (FASTA) + referencesassess_assembly_quality, run_quast, run_busco, run_mummerQUAST, BUSCO, MUMmer, CheckM/CheckM2Assembly QC summary tables, BUSCO metrics, alignment plotsAssembly Validation
Gene prediction inputs (assembled contigs)run_pyrodigal, run_blastp_search, run_mmseqs_search, run_transterm, run_trnascan, run_mlstPyrodigal, BLAST+, MMSeqs2, TransTerm, tRNAscan-SE, MLSTGFF3 annotations, protein FASTA, functional/categorical tablesGene Annotation
Multiple assemblies (FASTA genomes)build_genome_distance_matrix, pairwise_mash_distance_matrix, construct_pangenome_pggb, construct_pangenome_cactusMash/MinHash, PGGB, cactusDistance matrices, clustered panels, pangenome graphs, summary tablesComparative Genomics
Taxonomic profiling inputs (classified alignments, tables)classify_taxonomy_aware_xam_table, plot_taxa_abundances, visualize_xam_classificationsNative taxonomy utilities, Krona/plottingTaxonomic abundance tables, Sankey/stacked plotsTool Integration
Graph-derived sequencesbuild_string_graph, string_to_ngram_graph, assemble_stringsString graphs, reconstructed sequences, evidence-tracking tablesRound-Trip: String Graphs, N-gram Graphs
Graph-derived sequencesbuild_string_graph, string_to_ngram_graph, assemble_stringsString graphs, reconstructed sequences, evidence-tracking tablesRound-Trip: String Graphs, N-gram Graphs
Qualmer graphsbuild_qualmer_graph, get_qualmer_statistics, simplification helpersQuality-weighted sequence graphs, path correctness scores, diagnostic tables<span style="background-color:#f0f0f0;">Planned: 12_rhizomorph_graphs.jl</span>
Protein sequencesreduce_amino_alphabet, analyze_amino_acidsReduced alphabet sequences, amino acid composition tablesReduced Amino Acid Alphabets
Viroid / viral genomesassemble_genome (circular options), validate_assemblyFlye/metaFlye, QUASTCircular contig FASTA, QC tablesViroid Assembly Workflow
VCF + BAM + reference FASTA<span style="background-color:#f0f0f0;">normalize_vcf, update_fasta_with_vcf, evaluate_variant_calling_accuracy</span><span style="background-color:#f0f0f0;">GATK HaplotypeCaller, RTG vcfeval</span><span style="background-color:#f0f0f0;">Normalized VCF, updated reference, ROC/precision–recall tables</span><span style="background-color:#f0f0f0;">Planned: Variant Calling Tutorial</span>
Metagenomic reads (amplicon / shotgun)<span style="background-color:#f0f0f0;">run_metaspades, run_metaflye, profile_taxa, bin_metagenome</span><span style="background-color:#f0f0f0;">metaSPAdes, MetaFlye, Kraken2/Bracken, MetaBAT/MaxBin</span><span style="background-color:#f0f0f0;">MAG bins, taxonomic abundance tables, bin-quality summaries</span><span style="background-color:#f0f0f0;">Planned: Metagenomics Tutorial</span>
GFA → interactive graph curation<span style="background-color:#f0f0f0;">curate_assembly_graph</span><span style="background-color:#f0f0f0;">Bandage integration</span><span style="background-color:#f0f0f0;">Curated graph, manual edits exported to GFA</span><span style="background-color:#f0f0f0;">Planned: GFA Curation Tutorial</span>
Phylogenetic trees<span style="background-color:#f0f0f0;">construct_phylogeny, plot_phylogeny</span><span style="background-color:#f0f0f0;">RAxML, IQ-TREE</span><span style="background-color:#f0f0f0;">Newick trees, phylogenetic plots</span><span style="background-color:#f0f0f0;">Planned: Phylogenetics Tutorial</span>

2. Capability Matrix (by Task)

This mirrors patch 2’s capability section and ensures everything is covered from a “what do I want to do?” perspective.

CapabilityEntry pointsInputsOutputsTooling
Data acquisitiondownload_genome_by_accession, prefetch_sra_runs, fasterq_dump_parallelAccessions / run tablesFASTA/FASTQ, metadataNCBI E-utilities, SRA Toolkit
Read simulationsimulate_illumina_reads, simulate_pacbio_reads, simulate_nanopore_readsReference FASTA, sim paramsFASTQ reads, ground-truth tablesART, PBSim, DeepSimulator
Quality filteringqc_filter_short_reads_fastp, qc_filter_long_reads_filtlong, trim_galore_pairedFASTQFiltered FASTQ, HTML reportsfastp, Filtlong, Trim Galore
Quality metricsanalyze_fastq_quality, assess_duplication_rates, robust_cv, robust_threshold, filter_genome_outliersFASTQ, QC tablesSummary QC tables, outlier flagsNative Julia
Assemblyassemble_genome, run_megahit, run_metaspades, run_spades, run_flye, run_hifiasm, mycelia_iterative_assembleFASTQ, configContigs/scaffolds, assembly graphsExternal assemblers + native graphs
Assembly validationassess_assembly_quality, run_quast, run_busco, run_mummer, CheckM/CheckM2 integrationAssemblies + referencesQC reports, summary tablesQUAST, BUSCO, MUMmer, CheckM
Graph utilitiesbuild_string_graph, build_fasta_graph, build_qualmer_graph, read_gfa_next, write_quality_biosequence_gfaFASTA/FASTQ/GFAGraph objects, GFA exportsNative Julia (Rhizomorph)
Comparative genomicsconstruct_pangenome_pggb, construct_pangenome_cactus, build_genome_distance_matrix, analyze_pangenome_kmersAssemblies, sketch filesPangenome graphs, distance matricesPGGB, cactus, Mash-like tools
Taxonomyclassify_taxonomy_aware_xam_table, plot_taxa_abundances, visualize_xam_classificationsMapping tables, taxonomic refsAbundance tables, taxonomy plotsNative Julia, plotting libs
Alignment & searchrun_blastp_search, run_mmseqs_search, run_diamond_search, run_minimap2 helpersFASTA/FASTQ, databasesAlignments (SAM/BAM), hit tablesBLAST+, MMSeqs2, DIAMOND, minimap2
Variant analysis<span style="background-color:#f0f0f0;">run_gatk_haplotypecaller, run_vcfeval, evaluate_variant_calling_accuracy</span><span style="background-color:#f0f0f0;">BAM/CRAM, VCF, references</span><span style="background-color:#f0f0f0;">VCFs, ROC tables, summary stats</span><span style="background-color:#f0f0f0;">GATK, RTG vcfeval</span>
Cloud & orchestrationrclone_copy, rclone_sync, render_sbatch, submit_sbatch, lawrencium_sbatch, nersc_sbatchPaths, SLURM paramsSynced data, sbatch scripts, submitted jobsrclone, SLURM

3. Module coverage & doc gaps

3.1 Auto-included modules (from src/Mycelia.jl)

These bullets are the distilled, user-facing summary from patch 1; keep this section short here and point to api/function-coverage.md and the planning doc for details.

  • Core graph + I/O: utility-functions.jl, alphabets.jl, constants.jl, fastx.jl, graph-core.jl, sequence-graphs-next.jl, string-graphs.jl, qualmer-analysis.jl, qualmer-graphs.jl, fasta-graphs.jl, fastq-graphs.jl → Exposed via Rhizomorph: build_string_graph, build_fasta_graph, build_qualmer_graph, path-finding & simplification utilities.
  • Assembly pipelines: assembly.jl (external assemblers), iterative-assembly.jl (mycelia_iterative_assemble, improve_read_set_likelihood, find_optimal_sequence_path), viterbi-next.jl.
  • Analytics & QC: quality-control-and-benchmarking.jl, performance-benchmarks.jl, kmer-analysis.jl, distance-metrics.jl.
  • Taxonomy & annotation: taxonomy-and-trees.jl, classification.jl, reference-databases.jl, annotation.jl.
  • Wrappers & orchestration: bioconda.jl, rclone.jl, slurm-sbatch.jl, neo4jl.jl, xam.jl.

For full coverage counts and which functions are currently documented vs missing, see: