Skip to content

TorchUMM

TorchUMM
Unified Multimodal Model Toolkit

A unified toolkit for multimodal model inference, evaluation, and post-training.


TorchUMM provides a single, config-driven interface for running, evaluating, and fine-tuning state-of-the-art multimodal models. It is designed to make fair, reproducible comparisons across diverse multimodal architectures straightforward --- whether you are benchmarking generation quality, measuring visual understanding, or experimenting with post-training methods.

  • Pluggable Architecture --- 13 multimodal model adapters with unified interface
  • 10+ Benchmarks --- Generation, understanding, and editing evaluation
  • Post-Training --- SFT, IRG, recA, UniCot (LoRA-based)
  • Cloud-Native --- Scale to cloud GPUs via Modal
  • Config-Driven --- YAML configs, no code changes needed

Supported Models

Model Parameters Understand Generate Edit Docs
Bagel 7B Yes Yes Yes guide
DeepGen 5B No Yes Yes guide
OmniGen2 7B Yes Yes Yes guide
Emu3 8B Yes Yes No guide
Emu3.5 34B Yes Yes Yes guide
MMaDA 8B Yes Yes No guide
Janus 1.3B Yes Yes No guide
Janus-Pro 1B, 7B Yes Yes No guide
JanusFlow 1.3B Yes Yes No guide
Show-o 1.3B Yes Yes No guide
Show-o2 1.5B, 7B Yes Yes No guide
BLIP3-o 4B No Yes No guide
TokenFlow 7B No Yes No guide

Quick Start

Install

git clone --recursive https://github.com/AIFrontierLab/TorchUMM/
cd umm_codebase
pip install -e .

CLI Inference

PYTHONPATH=src python -m umm.cli.main infer \
    --config configs/inference/modal_bagel_generation.yaml

Python API

from umm.inference.pipeline import InferencePipeline
from umm.inference.multimodal_inputs import InferenceRequest

pipeline = InferencePipeline(
    backbone_name="bagel",
    backbone_cfg={"model_path": "/path/to/BAGEL-7B-MoT"},
)
result = pipeline.run(InferenceRequest(
    backbone="bagel", task="generation",
    prompt="A cat sitting on a rainbow",
))

Getting Started Models Evaluation Results