TorchUMM¶
Unified Multimodal Model Toolkit
A unified toolkit for multimodal model inference, evaluation, and post-training.
TorchUMM provides a single, config-driven interface for running, evaluating, and fine-tuning state-of-the-art multimodal models. It is designed to make fair, reproducible comparisons across diverse multimodal architectures straightforward --- whether you are benchmarking generation quality, measuring visual understanding, or experimenting with post-training methods.
- Pluggable Architecture --- 13 multimodal model adapters with unified interface
- 10+ Benchmarks --- Generation, understanding, and editing evaluation
- Post-Training --- SFT, IRG, recA, UniCot (LoRA-based)
- Cloud-Native --- Scale to cloud GPUs via Modal
- Config-Driven --- YAML configs, no code changes needed
Supported Models¶
| Model | Parameters | Understand | Generate | Edit | Docs |
|---|---|---|---|---|---|
| Bagel | 7B | Yes | Yes | Yes | guide |
| DeepGen | 5B | No | Yes | Yes | guide |
| OmniGen2 | 7B | Yes | Yes | Yes | guide |
| Emu3 | 8B | Yes | Yes | No | guide |
| Emu3.5 | 34B | Yes | Yes | Yes | guide |
| MMaDA | 8B | Yes | Yes | No | guide |
| Janus | 1.3B | Yes | Yes | No | guide |
| Janus-Pro | 1B, 7B | Yes | Yes | No | guide |
| JanusFlow | 1.3B | Yes | Yes | No | guide |
| Show-o | 1.3B | Yes | Yes | No | guide |
| Show-o2 | 1.5B, 7B | Yes | Yes | No | guide |
| BLIP3-o | 4B | No | Yes | No | guide |
| TokenFlow | 7B | No | Yes | No | guide |
Quick Start¶
Install
CLI Inference
PYTHONPATH=src python -m umm.cli.main infer \
--config configs/inference/modal_bagel_generation.yaml
Python API
from umm.inference.pipeline import InferencePipeline
from umm.inference.multimodal_inputs import InferenceRequest
pipeline = InferencePipeline(
backbone_name="bagel",
backbone_cfg={"model_path": "/path/to/BAGEL-7B-MoT"},
)
result = pipeline.run(InferenceRequest(
backbone="bagel", task="generation",
prompt="A cat sitting on a rainbow",
))