Models Overview¶

TorchUMM integrates thirteen multimodal models through a unified backbone adapter interface. Each model is wrapped as a BackboneAdapter that exposes a common API for generation, understanding, and (where supported) editing.

Capability Matrix¶

Model	Parameters	Understand	Generate	Edit	Guide
Bagel	7B	Yes	Yes	Yes	bagel.md
DeepGen	5B	No	Yes	Yes	deepgen.md
OmniGen2	7B	Yes	Yes	Yes	omnigen2.md
Emu3	8B	Yes	Yes	No	emu3.md
Emu3.5	34B	Yes	Yes	Yes	emu3_5.md
MMaDA	8B	Yes	Yes	No	mmada.md
Janus	1.3B	Yes	Yes	No	janus.md
Janus-Pro	1B, 7B	Yes	Yes	No	janus_pro.md
JanusFlow	1.3B	Yes	Yes	No	janus_flow.md
Show-o	1.3B	Yes	Yes	No	show_o.md
Show-o2	1.5B, 7B	Yes	Yes	No	show_o2.md
BLIP3-o	4B	No	Yes	No	blip3o.md
TokenFlow	7B	No	Yes	No	tokenflow.md

Model Summaries¶

Bagel¶

Mixture-of-Transformer (MoT) model supporting all three capabilities --- understanding, generation, and editing. Uses a shared Qwen2-based language backbone with a SigLIP vision encoder and a diffusion head for image generation. Backbone key: bagel. See Bagel guide.

DeepGen¶

A 5B multimodal model (3B VLM + 2B DiT) supporting text-to-image generation and image editing via a diffusers-compatible pipeline. Does not support understanding — the internal VLM is used only for semantic guidance during generation, not as a standalone VQA interface. Backbone key: deepgen. See DeepGen guide.

OmniGen2¶

A unified multimodal model that handles image understanding, text-to-image generation, and image editing within a single architecture. Backbone key: omnigen2. See OmniGen2 guide.

Emu3¶

Predict-next-token multimodal model with separate generation and understanding model checkpoints. Backbone key: emu3. See Emu3 guide.

Emu3.5¶

Next-generation native multimodal model from BAAI with unified world modeling. Features 4-5x faster inference via vLLM, discrete diffusion adaptation (DiDA), and RL post-training. Supports T2I, X2I, and interleaved generation. Note: required a minor patch to the model repo's modeling_emu3.py (commenting out dead FX tracing code incompatible with transformers >= 4.55). Backbone key: emu3_5. See Emu3.5 guide.

MMaDA¶

Masked Diffusion Adaptation (MMaDA) is an 8B multimodal model from Gen-Verse that unifies text generation, image generation, and image understanding through a masked diffusion framework. Unlike autoregressive models, MMaDA uses discrete masked diffusion for all modalities --- both text and image tokens are generated via iterative demasking. Uses MagVITv2 as the visual tokenizer with a codebook size of 8192. Available in Base and MixCoT variants. Backbone key: mmada. See MMaDA guide.

Janus¶

Original decoupled visual encoding model from DeepSeek (1.3B), with separate vision encoders for understanding and generation. Uses VQ autoregressive token prediction for image generation. Backbone key: janus_pro. See Janus guide.

Janus-Pro¶

Scaled-up version of Janus (7B) with improved training and stronger multimodal capabilities. Shares the same architecture as Janus but with significantly better performance. Backbone key: janus_pro. See Janus-Pro guide.

JanusFlow¶

Rectified flow variant of the Janus architecture from DeepSeek. Uses continuous ODE-based generation with an external SDXL VAE for image decoding, instead of the autoregressive VQ token approach used by Janus/Janus-Pro. Backbone key: janus_flow. See JanusFlow guide.

Show-o¶

Original unified transformer from Show Lab combining autoregressive text modeling with discrete diffusion for image generation. Uses Phi-1.5 as LLM base and MagVITv2 as visual tokenizer. Backbone key: show_o. See Show-o guide.

Show-o2¶

Next-generation unified model from Show Lab, replacing discrete diffusion with flow matching and upgrading to Qwen2.5 LLM with Wan2.1 3D Causal VAE. Backbone key: show_o2. See Show-o2 guide.

BLIP3-o¶

Generation-focused model from Salesforce built on the BLIP3 architecture. Generation only --- no understanding support. Backbone key: blip3o. See BLIP3-o guide.

TokenFlow¶

Generation-focused model from ByteFlow AI using a token-flow framework for text-to-image synthesis. Backbone key: tokenflow. See TokenFlow guide.

Adding a New Model¶

The backbone adapter system is designed to be extended. Any new multimodal model can be integrated by implementing a BackboneAdapter subclass and registering it — no changes to the inference pipeline or CLI are needed.

See the Extending guide for a step-by-step walkthrough.