Architecture¶
This page describes the internal design of TorchUMM — how inference, evaluation, and post-training pipelines are structured, how backbone adapters plug in, and how the codebase is organized.
Inference Pipeline¶
Overview¶
The inference pipeline follows a strict layered design: the CLI reads a YAML config, instantiates InferencePipeline with the chosen backbone, and dispatches the request to the correct task runner.
flowchart TD
A["User\nYAML config"] --> B["CLI\numm infer --config ..."]
B --> C["load_config\n(YAML → dict)"]
C --> D["InferencePipeline\n(backbone_name, backbone_cfg)"]
D --> E["Registry lookup\nget backbone adapter"]
E --> F["adapter.load(backbone_cfg)\nload weights, tokenizer, VAE"]
F --> G["pipeline.run(InferenceRequest)"]
G --> H{task}
H -->|generation| I["run_generation\nadapter.generation(prompt, ...)"]
H -->|understanding| J["run_understanding\nadapter.understanding(prompt, images, ...)"]
H -->|editing| K["run_editing\nadapter.editing(prompt, images, ...)"]
I --> L["Result dict\n(image, saved_path, ...)"]
J --> L
K --> L
Key Classes¶
| Class | File | Role |
|---|---|---|
InferencePipeline |
src/umm/inference/pipeline.py |
Entry point; builds backbone from registry, dispatches tasks |
InferenceRequest |
src/umm/inference/multimodal_inputs.py |
Typed dataclass: backbone, task, prompt, images, params |
BackboneAdapter |
src/umm/core/interfaces.py |
Protocol that all backbone adapters must implement |
Registry |
src/umm/core/registry.py |
Simple dict-based registry for backbones, evaluators, trainers |
InferenceRequest Fields¶
@dataclass
class InferenceRequest:
backbone: str # e.g. "bagel", "janus_pro", "janus_flow"
task: str # "generation" | "understanding" | "editing"
prompt: str | None = None
images: list[str] = field(...) # local file paths
videos: list[str] = field(...) # reserved for future use
params: dict = field(...) # task-specific overrides
metadata: dict = field(...)
output_path: str | None = None
Validation rules:
generation— requirespromptediting— requirespromptAND at least one imageunderstanding— requires at least one ofpromptorimages
Evaluation Pipeline¶
Overview¶
Evaluation follows a two-level dispatch: the top-level CLI routes to a benchmark-specific handler, which then calls an InferencePipeline internally and runs scoring.
flowchart TD
A["User\neval YAML config"] --> B["CLI\numm eval --config ..."]
B --> C["load_config → extract\ncfg['eval']['benchmark']"]
C --> D{benchmark}
D -->|geneval| E["geneval.py\nrun_eval_command"]
D -->|wise| F["wise.py\nrun_wise_eval_command"]
D -->|ueval| G["ueval_eval.py\nrun_ueval_eval_command"]
D -->|dpg_bench| H["dpg_bench.py\nrun_eval_command"]
D -->|mme/mmmu/mmbench\nmmvet/mathvista| I["benchmark_eval.py\nrun_*_eval_command"]
D -->|uni_mmmu| J["uni_mmmu.py\nrun_eval_command"]
E --> K["subprocess\neval/generation/geneval/run_generation.py"]
K --> L["InferencePipeline\ngenerate images"]
L --> M["subprocess\neval/generation/geneval/run_scoring.py\n(Mask2Former detector)"]
F --> N["subprocess\neval/generation/wise/run_wise_eval.py"]
N --> O["InferencePipeline\ngenerate images"]
O --> P["Qwen2.5-VL-72B\nor Qwen3-32B scorer"]
M --> Q["Results JSON / score"]
P --> Q
I --> R["InferencePipeline\ngenerate answers"]
R --> Q
Two-Stage vs Single-Stage Benchmarks¶
| Type | Benchmarks | Stage 1 | Stage 2 |
|---|---|---|---|
| Two-stage | GenEval, WISE, UEval, Uni-MMMU | Generate images/text | Score with detector or Qwen VLM |
| Single-stage | MME, MMMU, MMBench, MM-Vet, MathVista, DPG Bench | Generate + score in one pass | — |
For two-stage benchmarks, separate _generate and _score configs are provided. The full-pipeline config (e.g., geneval_bagel.yaml) runs both stages automatically.
Scoring Models¶
| Benchmark | Scorer |
|---|---|
| GenEval | Mask2Former object detector |
| WISE | Qwen2.5-VL-72B-Instruct (local, from /model_cache/evaluator/) |
| UEval | Qwen series models (local) |
| Uni-MMMU | Qwen3-32B (local) |
| MME / MMMU / MMBench / MM-Vet / MathVista | Rule-based or model-specific scoring |
Post-Training Pipeline¶
Overview¶
Post-training configs specify a pipeline name that selects the training dispatcher. All dispatchers follow the same pattern: build a torchrun or python subprocess and execute the training script inside the model repo.
flowchart TD
A["User\nposttrain YAML config"] --> B["CLI\numm train --config ..."]
B --> C["load_config → extract\ncfg['train']['pipeline']"]
C --> D{pipeline}
D -->|bagel| E["sft/bagel/pipeline.py\nrun_bagel_train"]
D -->|recA| F["recA/pipeline.py\nrun_reca_train"]
D -->|unicot| G["unicot/pipeline.py\nrun_unicot_train"]
D -->|irg| H["IRG/pipeline.py\nrun_irg_train"]
E --> I["_build_args(cfg['args'])\n→ CLI flags"]
F --> I
G --> I
H --> I
I --> J["_resolve_cwd(config_path, cwd)"]
J --> K["subprocess.run\ntorchrun --nnodes 1 --nproc_per_node 4\ntrain_script.py --arg1 val1 ..."]
K --> L["Training loop\n(PyTorch Distributed / FSDP)"]
L --> M["Checkpoint saved\nto results_dir / /checkpoints/"]
Supported Training Methods¶
| Method | Pipeline key | Training approach | Multi-GPU |
|---|---|---|---|
| SFT | bagel |
Full fine-tuning on Bagel base | torchrun (4 GPU) |
| IRG | irg |
2-stage interleaved reasoning generation | torchrun (4 GPU) |
| recA | recA |
Reconstruction alignment | torchrun |
| UniCot | unicot |
Chain-of-thought training via LoRA (rank=256) | torchrun (4 GPU) |
Post-Train Model Serving¶
After training, model weights land in /checkpoints/ (local) or umm-checkpoints volume (Modal). For evaluation, copy weights to umm-post-train-model-cache and supplement with the base model's config files:
# 1. Check the weights are there
modal volume ls umm-post-train-model-cache post_train/<variant>/
# 2. Copy config/tokenizer/VAE files from base BAGEL
modal run modal/copy_bagel_files.py --target <variant>
# 3. Run evaluation on the post-trained variant
modal run modal/run.py --model bagel \
--eval-config modal_geneval_bagel_<variant>_score --gpu H100
Task-Level Support Matrix¶
| Model | Understand | Generate | Edit | Benchmarks |
|---|---|---|---|---|
| Bagel | Yes | Yes | Yes | DPG, GenEval, WISE, UEval, Uni-MMMU, MME, MMMU, MMBench, MM-Vet, MathVista |
| OmniGen2 | Yes | Yes | Yes | DPG, GenEval, WISE, UEval, Uni-MMMU, MME, MMMU, MMBench, MM-Vet, MathVista |
| Emu3 | Yes | Yes | No | DPG, GenEval, WISE, UEval, Uni-MMMU, MME, MMMU, MMBench, MM-Vet, MathVista |
| Janus-Pro | Yes | Yes | No | DPG, GenEval, WISE, UEval, Uni-MMMU, MME, MMMU, MMBench, MM-Vet, MathVista |
| JanusFlow | Yes | Yes | No | DPG, GenEval, WISE, Uni-MMMU |
| Show-o2 | Yes | Yes | No | DPG, GenEval, WISE, UEval, Uni-MMMU, MME, MMMU, MMBench, MM-Vet, MathVista |
| BLIP3-o | No | Yes | No | DPG, GenEval, WISE, UEval |
| TokenFlow | No | Yes | No | DPG, GenEval, WISE, UEval |
Backbone Adapter Design Notes¶
When implementing a new backbone adapter, keep these lessons in mind (learned from integrating models like OmniGen2):
-
Exception propagation in
editing(). The evaluation pipeline uses a try/except to fall back from editing to text-to-image generation when editing is unsupported or fails. If yourediting()method catches exceptions internally and returns an error dict, the caller cannot distinguish it from a successful result and the fallback is silently skipped. Let pipeline exceptions propagate. Only the finalgeneration()method should catch and wrap errors. -
Shared model components. If your model uses separate pipeline objects for different tasks (e.g., OmniGen2 uses
OmniGen2Pipelinefor generation/editing andOmniGen2ChatPipelinefor understanding), construct one pipeline first, then build the other from shared component references. Loading both viafrom_pretrainedduplicates all model weights in GPU memory. -
Task-appropriate system prompts. Unified models that support both generation and understanding often have a default system prompt biased toward one capability. For example, OmniGen2's chat pipeline uses
"generates high-quality images"as its system prompt, which causes the model to emit image generation tokens (<|img|>) instead of text reasoning when given complex prompts. Override the system prompt to match the task — use a text-analysis prompt for understanding, and the default prompt for generation.
Inference Implementation Strategy¶
| Model | Generation approach | Understanding approach |
|---|---|---|
| Bagel | Diffusion (MoT) with VAE, CFG text+image scale | Native VLM head |
| OmniGen2 | OmniGen2Pipeline (flow matching) |
OmniGen2ChatPipeline (separate) |
| Emu3 | VQ-tokenizer autoregressive (Emu3-Gen) | Emu3-Chat (separate model) |
| Janus-Pro | Parallel generation (4 images per pass, CFG) | VLChatProcessor-based |
| JanusFlow | Rectified flow ODE (30 steps, SDXL VAE decode) | VLChatProcessor-based |
| Show-o2 | Subprocess (wraps Show-o scripts) | Subprocess (wraps Show-o scripts) |
| BLIP3-o | Subprocess (wraps BLIP3-o scripts) | — |
| TokenFlow | Subprocess (wraps TokenFlow scripts) | — |
Codebase Map¶
umm_codebase/
│
├── src/umm/ # Core Python package
│ ├── cli/ # Command-line entry points
│ │ ├── main.py # Argument parser, subcommand registration
│ │ ├── infer.py # `umm infer` handler
│ │ ├── eval.py # `umm eval` dispatcher → benchmark handlers
│ │ ├── train.py # `umm train` dispatcher → pipeline handlers
│ │ ├── geneval.py # GenEval benchmark runner
│ │ ├── wise.py # WISE benchmark runner
│ │ ├── ueval_eval.py # UEval benchmark runner
│ │ ├── dpg_bench.py # DPG Bench runner
│ │ ├── uni_mmmu.py # Uni-MMMU runner
│ │ ├── mme_eval.py # MME runner
│ │ ├── mmmu_eval.py # MMMU runner
│ │ ├── mmbench_eval.py # MMBench runner
│ │ ├── mmvet_eval.py # MM-Vet runner
│ │ └── mathvista_eval.py # MathVista runner
│ │
│ ├── inference/ # Inference pipeline
│ │ ├── pipeline.py # InferencePipeline class
│ │ ├── generation.py # run_generation/editing/understanding
│ │ ├── multimodal_inputs.py # InferenceRequest dataclass, validators
│ │ └── batcher.py # batch_iter utility
│ │
│ ├── backbones/ # Model adapters (one per model)
│ │ ├── bagel/
│ │ │ ├── adapter.py # BagelBackbone
│ │ │ └── Bagel/ # git submodule (original repo)
│ │ ├── omnigen2/
│ │ │ ├── adapter.py # OmniGen2Backbone
│ │ │ └── OmniGen2/
│ │ ├── emu3/
│ │ │ ├── adapter.py # Emu3Backbone (Chat + Gen + VQ)
│ │ │ └── Emu3/
│ │ ├── janus_pro/
│ │ │ ├── adapter.py # JanusProBackbone
│ │ │ └── Janus/
│ │ ├── janus_flow/
│ │ │ ├── adapter.py # JanusFlowBackbone (rectified flow + SDXL VAE)
│ │ │ └── Janus/
│ │ ├── show_o/
│ │ │ ├── adapter.py # ShowOBackbone (subprocess)
│ │ │ └── Show-o/
│ │ ├── blip3o/
│ │ │ ├── adapter.py # Blip3oBackbone (subprocess)
│ │ │ └── BLIP3o/
│ │ └── tokenflow/
│ │ ├── adapter.py # TokenFlowBackbone (subprocess)
│ │ └── TokenFlow/
│ │
│ ├── post_training/ # Training pipelines
│ │ ├── sft/bagel/pipeline.py # run_bagel_train (pipeline: bagel)
│ │ ├── recA/pipeline.py # run_reca_train (pipeline: recA)
│ │ ├── unicot/pipeline.py # run_unicot_train (pipeline: unicot)
│ │ └── IRG/pipeline.py # run_irg_train (pipeline: irg)
│ │
│ └── core/ # Shared utilities
│ ├── registry.py # register() / get() for backbones/evaluators
│ ├── interfaces.py # BackboneAdapter protocol
│ ├── config.py # load_config (YAML/JSON → dict)
│ ├── io.py # I/O helpers
│ └── runtime.py # Runtime utilities
│
├── configs/ # All YAML configuration files
│ ├── inference/ # Inference configs per model
│ │ ├── modal_bagel_generation.yaml
│ │ ├── modal_bagel_understanding.yaml
│ │ ├── modal_bagel_editing.yaml
│ │ ├── emu3_generation.yaml
│ │ ├── omnigen2_generation.yaml
│ │ ├── show_o2_generation.yaml
│ │ ├── tokenflow_generation.yaml
│ │ └── ...
│ ├── eval/ # Eval configs per benchmark per model
│ │ ├── dpg_bench/
│ │ ├── geneval/
│ │ ├── wise/
│ │ ├── ueval/
│ │ ├── uni_mmmu/
│ │ ├── mme/
│ │ ├── mmmu/
│ │ ├── mmbench/
│ │ ├── mmvet/
│ │ └── mathvista/
│ └── posttrain/ # Training configs
│ ├── bagel_sft.yaml
│ ├── irg_stage1.yaml
│ ├── irg_stage2.yaml
│ ├── recA.yaml
│ └── unicot.yaml
│
├── eval/ # Evaluation scripts (called by CLI as subprocesses)
│ ├── generation/
│ │ ├── geneval/
│ │ ├── wise/
│ │ ├── dpg_bench/
│ │ ├── ueval/
│ │ └── uni_mmmu/
│ └── vlm/ # Understanding benchmark scripts
│
├── modal/ # Modal cloud infrastructure
│ ├── config.py # Volume names, HF model paths
│ ├── volumes.py # Volume definitions
│ ├── images.py # Docker images per model
│ ├── run.py # Modal inference + eval runner
│ ├── train.py # Modal training runner
│ └── download.py # Download weights/datasets to volumes
│
├── data/ # Local benchmark data
│ ├── mme/
│ ├── mmbench/
│ ├── mmvet/
│ ├── mathvista/
│ └── ...
│
└── model/ # Git submodules (DO NOT MODIFY)
├── Bagel/
├── OmniGen2/
├── Emu3/
├── Janus/
├── Show-o/
├── BLIP3o/
├── TokenFlow/
├── geneval/
├── WISE/
└── UEval/
Backbone Adapter Pattern¶
All backbone adapters implement the same interface, making them interchangeable from the pipeline's perspective:
class BackboneAdapter(Protocol):
name: str
def load(self, cfg: dict) -> None:
"""Load model weights, tokenizer, VAE etc. from cfg."""
...
def generation(self, prompt: str, output_path: str, **cfg) -> dict:
"""Text-to-image generation. Returns dict with 'image' key."""
...
def understanding(self, prompt: str, images: list[str], **cfg) -> dict:
"""VQA / captioning. Returns dict with 'text' key."""
...
def editing(self, prompt: str, images: list[str], output_path: str, **cfg) -> dict:
"""Image editing. Returns dict with 'image' key."""
...
New models are registered via:
# src/umm/inference/pipeline.py
from umm.core.registry import register
register("backbone", "my_model", MyModelBackbone)
Config File Naming Conventions¶
| Context | Pattern | Example |
|---|---|---|
| Local inference | <model>_<task>.yaml |
emu3_generation.yaml |
| Modal inference | modal_<model>_<task>.yaml |
modal_bagel_generation.yaml |
| Local eval (full) | <benchmark>_<model>.yaml |
geneval_bagel.yaml |
| Local eval (generate only) | <benchmark>_<model>_generate.yaml |
geneval_bagel_generate.yaml |
| Local eval (score only) | <benchmark>_<model>_score.yaml |
geneval_bagel_score.yaml |
| Modal eval | modal_<benchmark>_<model>.yaml |
modal_geneval_bagel.yaml |
| Post-training | <method>.yaml or <model>_<method>.yaml |
bagel_sft.yaml, recA.yaml |