Models Overview¶
TorchUMM integrates thirteen multimodal models through a unified backbone adapter interface. Each model is wrapped as a BackboneAdapter that exposes a common API for generation, understanding, and (where supported) editing.
Capability Matrix¶
| Model | Parameters | Understand | Generate | Edit | Guide |
|---|---|---|---|---|---|
| Bagel | 7B | Yes | Yes | Yes | bagel.md |
| DeepGen | 5B | No | Yes | Yes | deepgen.md |
| OmniGen2 | 7B | Yes | Yes | Yes | omnigen2.md |
| Emu3 | 8B | Yes | Yes | No | emu3.md |
| Emu3.5 | 34B | Yes | Yes | Yes | emu3_5.md |
| MMaDA | 8B | Yes | Yes | No | mmada.md |
| Janus | 1.3B | Yes | Yes | No | janus.md |
| Janus-Pro | 1B, 7B | Yes | Yes | No | janus_pro.md |
| JanusFlow | 1.3B | Yes | Yes | No | janus_flow.md |
| Show-o | 1.3B | Yes | Yes | No | show_o.md |
| Show-o2 | 1.5B, 7B | Yes | Yes | No | show_o2.md |
| BLIP3-o | 4B | No | Yes | No | blip3o.md |
| TokenFlow | 7B | No | Yes | No | tokenflow.md |
Model Summaries¶
Bagel¶
Mixture-of-Transformer (MoT) model supporting all three capabilities --- understanding, generation, and editing. Uses a shared Qwen2-based language backbone with a SigLIP vision encoder and a diffusion head for image generation. Backbone key: bagel. See Bagel guide.
DeepGen¶
A 5B multimodal model (3B VLM + 2B DiT) supporting text-to-image generation and image editing via a diffusers-compatible pipeline. Does not support understanding — the internal VLM is used only for semantic guidance during generation, not as a standalone VQA interface. Backbone key: deepgen. See DeepGen guide.
OmniGen2¶
A unified multimodal model that handles image understanding, text-to-image generation, and image editing within a single architecture. Backbone key: omnigen2. See OmniGen2 guide.
Emu3¶
Predict-next-token multimodal model with separate generation and understanding model checkpoints. Backbone key: emu3. See Emu3 guide.
Emu3.5¶
Next-generation native multimodal model from BAAI with unified world modeling. Features 4-5x faster inference via vLLM, discrete diffusion adaptation (DiDA), and RL post-training. Supports T2I, X2I, and interleaved generation. Note: required a minor patch to the model repo's modeling_emu3.py (commenting out dead FX tracing code incompatible with transformers >= 4.55). Backbone key: emu3_5. See Emu3.5 guide.
MMaDA¶
Masked Diffusion Adaptation (MMaDA) is an 8B multimodal model from Gen-Verse that unifies text generation, image generation, and image understanding through a masked diffusion framework. Unlike autoregressive models, MMaDA uses discrete masked diffusion for all modalities --- both text and image tokens are generated via iterative demasking. Uses MagVITv2 as the visual tokenizer with a codebook size of 8192. Available in Base and MixCoT variants. Backbone key: mmada. See MMaDA guide.
Janus¶
Original decoupled visual encoding model from DeepSeek (1.3B), with separate vision encoders for understanding and generation. Uses VQ autoregressive token prediction for image generation. Backbone key: janus_pro. See Janus guide.
Janus-Pro¶
Scaled-up version of Janus (7B) with improved training and stronger multimodal capabilities. Shares the same architecture as Janus but with significantly better performance. Backbone key: janus_pro. See Janus-Pro guide.
JanusFlow¶
Rectified flow variant of the Janus architecture from DeepSeek. Uses continuous ODE-based generation with an external SDXL VAE for image decoding, instead of the autoregressive VQ token approach used by Janus/Janus-Pro. Backbone key: janus_flow. See JanusFlow guide.
Show-o¶
Original unified transformer from Show Lab combining autoregressive text modeling with discrete diffusion for image generation. Uses Phi-1.5 as LLM base and MagVITv2 as visual tokenizer. Backbone key: show_o. See Show-o guide.
Show-o2¶
Next-generation unified model from Show Lab, replacing discrete diffusion with flow matching and upgrading to Qwen2.5 LLM with Wan2.1 3D Causal VAE. Backbone key: show_o2. See Show-o2 guide.
BLIP3-o¶
Generation-focused model from Salesforce built on the BLIP3 architecture. Generation only --- no understanding support. Backbone key: blip3o. See BLIP3-o guide.
TokenFlow¶
Generation-focused model from ByteFlow AI using a token-flow framework for text-to-image synthesis. Backbone key: tokenflow. See TokenFlow guide.
Adding a New Model¶
The backbone adapter system is designed to be extended. Any new multimodal model can be integrated by implementing a BackboneAdapter subclass and registering it — no changes to the inference pipeline or CLI are needed.
See the Extending guide for a step-by-step walkthrough.