Show-o¶
Original unified multimodal model from Show Lab, combining autoregressive and discrete diffusion modeling for understanding and generation.
- Original repository: https://github.com/showlab/Show-o
- Paper: Show-o: One Single Transformer to Unify Multimodal Understanding and Generation (ICLR 2025)
- Backbone key:
show_o - Capabilities: Understanding (image-based), Generation (NO Editing, NO text-only understanding)
Architecture¶
Show-o uses a single transformer that processes text tokens autoregressively with causal attention and image tokens via discrete denoising diffusion with full attention. Key components:
- LLM base: Phi-1.5
- Visual tokenizer: MagVITv2 (discrete)
- Image generation: Discrete denoising diffusion
- Max resolution: 512x512
Dependencies¶
The model environment is managed via the show_o image defined in modal/images.py (Python 3.10, PyTorch 2.2.1, CUDA 12.1, xformers). For local setup, install the dependencies listed in model/Show-o/requirements.txt.
Relationship to Show-o2¶
Show-o and Show-o2 share the same backbone adapter (ShowOBackbone) with version-based branching. The version is auto-detected from the model's config.json, or can be explicitly set via version: 1 in config.
Key differences from Show-o2:
| Aspect | Show-o (v1) | Show-o2 (v2) |
|---|---|---|
| LLM base | Phi-1.5 | Qwen2.5 |
| Visual tokenizer | MagVITv2 (discrete) | Wan2.1 3D Causal VAE |
| Generation | Discrete diffusion | Flow matching |
| Video support | No | Yes |
| Text-only understanding | No | Yes |
Inference¶
CLI¶
# Generation
PYTHONPATH=src python -m umm.cli.main infer --config configs/inference/modal_show_o_generation.yaml
Python API¶
from umm.inference.pipeline import InferencePipeline
from umm.inference.multimodal_inputs import InferenceRequest
pipeline = InferencePipeline(backbone_name="show_o", backbone_cfg={
"model_path": "/path/to/show-o-w-clip-vit-512x512",
"show_o_root": "/path/to/model/Show-o",
"vq_model_path": "/path/to/magvitv2",
"version": 1,
"seed": 42,
})
# Generation
result = pipeline.run(InferenceRequest(
backbone="show_o", task="generation",
prompt="A cat sitting on a rainbow",
))
# Understanding (requires image input)
result = pipeline.run(InferenceRequest(
backbone="show_o", task="understanding",
prompt="Describe this image",
images=["path/to/image.jpg"],
))
Supported Benchmarks¶
| Benchmark | Config |
|---|---|
| DPG Bench | configs/eval/dpg_bench/modal_dpg_bench_show_o.yaml |
| GenEval | configs/eval/geneval/modal_geneval_show_o.yaml |
| WISE | configs/eval/wise/modal_wise_show_o.yaml |
| UEval | configs/eval/ueval/modal_ueval_show_o.yaml |
| Uni-MMMU | configs/eval/uni_mmmu/modal_uni_mmmu_show_o.yaml |
| MME | configs/eval/mme/modal_mme_show_o.yaml |
| MMMU | configs/eval/mmmu/modal_mmmu_show_o.yaml |
| MMBench | configs/eval/mmbench/modal_mmbench_show_o.yaml |
| MM-Vet | configs/eval/mmvet/modal_mmvet_show_o.yaml |
| MathVista | configs/eval/mathvista/modal_mathvista_show_o.yaml |
Key Configuration Parameters¶
model_path: Path to Show-o model weights (e.g.,showlab/show-o-w-clip-vit-512x512)vq_model_path: Path to MagVITv2 discrete tokenizer (e.g.,showlab/magvitv2)show_o_root: Path to the Show-o repository rootversion: Set to1for Show-o v1 (auto-detected if omitted)