Skip to content

Show-o

Original unified multimodal model from Show Lab, combining autoregressive and discrete diffusion modeling for understanding and generation.

Architecture

Show-o uses a single transformer that processes text tokens autoregressively with causal attention and image tokens via discrete denoising diffusion with full attention. Key components:

  • LLM base: Phi-1.5
  • Visual tokenizer: MagVITv2 (discrete)
  • Image generation: Discrete denoising diffusion
  • Max resolution: 512x512

Dependencies

The model environment is managed via the show_o image defined in modal/images.py (Python 3.10, PyTorch 2.2.1, CUDA 12.1, xformers). For local setup, install the dependencies listed in model/Show-o/requirements.txt.

Relationship to Show-o2

Show-o and Show-o2 share the same backbone adapter (ShowOBackbone) with version-based branching. The version is auto-detected from the model's config.json, or can be explicitly set via version: 1 in config.

Key differences from Show-o2:

Aspect Show-o (v1) Show-o2 (v2)
LLM base Phi-1.5 Qwen2.5
Visual tokenizer MagVITv2 (discrete) Wan2.1 3D Causal VAE
Generation Discrete diffusion Flow matching
Video support No Yes
Text-only understanding No Yes

Inference

CLI

# Generation
PYTHONPATH=src python -m umm.cli.main infer --config configs/inference/modal_show_o_generation.yaml

Python API

from umm.inference.pipeline import InferencePipeline
from umm.inference.multimodal_inputs import InferenceRequest

pipeline = InferencePipeline(backbone_name="show_o", backbone_cfg={
    "model_path": "/path/to/show-o-w-clip-vit-512x512",
    "show_o_root": "/path/to/model/Show-o",
    "vq_model_path": "/path/to/magvitv2",
    "version": 1,
    "seed": 42,
})

# Generation
result = pipeline.run(InferenceRequest(
    backbone="show_o", task="generation",
    prompt="A cat sitting on a rainbow",
))

# Understanding (requires image input)
result = pipeline.run(InferenceRequest(
    backbone="show_o", task="understanding",
    prompt="Describe this image",
    images=["path/to/image.jpg"],
))

Supported Benchmarks

Benchmark Config
DPG Bench configs/eval/dpg_bench/modal_dpg_bench_show_o.yaml
GenEval configs/eval/geneval/modal_geneval_show_o.yaml
WISE configs/eval/wise/modal_wise_show_o.yaml
UEval configs/eval/ueval/modal_ueval_show_o.yaml
Uni-MMMU configs/eval/uni_mmmu/modal_uni_mmmu_show_o.yaml
MME configs/eval/mme/modal_mme_show_o.yaml
MMMU configs/eval/mmmu/modal_mmmu_show_o.yaml
MMBench configs/eval/mmbench/modal_mmbench_show_o.yaml
MM-Vet configs/eval/mmvet/modal_mmvet_show_o.yaml
MathVista configs/eval/mathvista/modal_mathvista_show_o.yaml

Key Configuration Parameters

  • model_path: Path to Show-o model weights (e.g., showlab/show-o-w-clip-vit-512x512)
  • vq_model_path: Path to MagVITv2 discrete tokenizer (e.g., showlab/magvitv2)
  • show_o_root: Path to the Show-o repository root
  • version: Set to 1 for Show-o v1 (auto-detected if omitted)
# Download model weights
modal run modal/download.py --model show_o

# Run GenEval
modal run modal/run.py --model show_o --eval-config modal_geneval_show_o

# Run MME
modal run modal/run.py --model show_o --eval-config modal_mme_show_o