Emu3.5¶
Next-generation native multimodal model from BAAI with unified world modeling and 4-5x faster inference via vLLM.
- Original repository: https://github.com/baaivision/Emu3.5
- Backbone key:
emu3_5 - Capabilities: Understanding, Generation (NO Editing)
Dependencies¶
The model environment is managed via the emu3_5 image defined in modal/images.py:
- Python 3.12, PyTorch 2.8.0, vLLM 0.11.0, flash-attn 2.8.3, transformers 4.56.1
- Default GPU: A100-80GB:2 (tensor_parallel_size=2)
For local setup, install dependencies from model/Emu3.5/requirements/vllm.txt, then apply BAAI's vLLM patches:
Flash Attention (required)¶
Emu3.5 requires Flash Attention (v2.8.3, used by vLLM). The Modal image already includes it. For local setup, install a pre-compiled wheel matching your environment — see modal/README.md for the exact environment parameters and installation instructions.
Architecture Note¶
Emu3.5 uses three model variants:
- Emu3.5 — general multimodal model (T2I, X2I, interleaved generation, understanding)
- Emu3.5-Image — optimized for T2I/X2I (recommended for generation benchmarks, cfg=5.0)
- Emu3.5-VisionTokenizer — IBQ-based vision tokenizer (shared across variants)
Model weights paths: emu3_5/Emu3.5, emu3_5/Emu3.5-Image, emu3_5/Emu3.5-VisionTokenizer.
Key features: - Native vLLM backend with optimized attention kernels (PagedAttention, FlashAttention, CUDA graphs) - Custom cond/uncond batch scheduler for classifier-free guidance (4-5x speedup) - Classifier-free guidance with differential sampling - Protobuf output format for multimodal sequences
Adapter Architecture — How Emu3.5 Differs from Other Models¶
Emu3.5's adapter is architecturally different from all other models in TorchUMM:
| Aspect | Other Models | Emu3.5 |
|---|---|---|
| vLLM integration | TransformersForCausalLM wrapper (PyTorch forward pass) |
Native Emu3_5ForCausalLM via BAAI patches (vLLM optimized kernels) |
| Architecture registration | auto_map in config.json or transformers built-in | 20 vLLM source patches applied at Docker image build time |
| Attention implementation | Transformers' eager/sdpa/flash_attention_2 | vLLM's Attention with QKVParallelLinear, RowParallelLinear |
| Batch scheduling | vLLM default scheduler | Custom batch_scheduler.Scheduler for CFG cond/uncond batching |
| Logits processing | Standard sampling | Custom logits processor for image token generation |
Why Patches Instead of TransformersForCausalLM?¶
The HuggingFace checkpoint declares model_type: "Emu3", which collides with transformers' built-in Emu3 (a different multimodal architecture). The TransformersForCausalLM fallback would:
1. Fail to load due to the model_type collision (built-in Emu3 expects text_config/vision_config)
2. Even if loaded, use PyTorch native forward pass with no vLLM kernel optimizations
BAAI's patches solve both problems by registering a dedicated Emu3_5Config and Emu3_5ForCausalLM directly in vLLM's model registry, bypassing transformers' auto-discovery entirely.
Patch Details¶
The 20 patch files in model/Emu3.5/third_party/vllm/ modify vLLM 0.11.0 to add:
- Model architecture (
model_executor/models/emu3_5.py) — native vLLM implementation withQKVParallelLinear,MergedColumnParallelLinear,RMSNorm,Attention - Config registration (
transformers_utils/config.py,configs/emu3_5.py) — mapsmodel_type="Emu3"toEmu3_5Config - Model registry (
model_executor/models/registry.py) — registersEmu3_5ForCausalLM - Batch scheduling (
v1/core/sched/) — custom scheduler for concurrent conditional/unconditional generation - Logits processing (
v1/sample/logits_processor/) — image token sampling logic - Engine/worker (
v1/engine/,v1/worker/) — adapted for Emu3.5's generation pattern
These patches are applied during Docker image build in modal/images.py via model/Emu3.5/src/patch/apply.py.
Model Repo Modification (exception to model/ no-edit rule)¶
The repo's modeling_emu3.py imports is_torch_fx_available from transformers, which was removed in transformers >= 4.55 (required by vLLM 0.11.0). Since this is dead code during inference, we commented out:
- Line 55: from transformers.utils.import_utils import is_torch_fx_available
- Lines 64-70: The if is_torch_fx_available(): ... block
This is the only modification to the model/ directory, with zero impact on model behavior.
Inference¶
Python API¶
from umm.inference.pipeline import InferencePipeline
from umm.inference.multimodal_inputs import InferenceRequest
pipeline = InferencePipeline(backbone_name="emu3_5", backbone_cfg={
"model_path": "/model_cache/emu3_5/Emu3.5-Image",
"vq_path": "/model_cache/emu3_5/Emu3.5-VisionTokenizer",
"emu3_5_root": "/workspace/model/Emu3.5",
"use_vllm": True,
"tensor_parallel_size": 2,
"classifier_free_guidance": 5.0,
})
# Generation
result = pipeline.run(InferenceRequest(
backbone="emu3_5", task="generation",
prompt="A cat sitting on a rainbow",
params={"max_new_tokens": 5120},
))
Supported Benchmarks¶
| Benchmark | Config (Modal) |
|---|---|
| GenEval | configs/eval/geneval/modal_geneval_emu3_5.yaml |
| GenEval (score) | configs/eval/geneval/modal_geneval_emu3_5_score.yaml |
| WISE | configs/eval/wise/modal_wise_emu3_5.yaml |
| WISE (score) | configs/eval/wise/modal_wise_emu3_5_score.yaml |
| DPG Bench | configs/eval/dpg_bench/modal_dpg_bench_emu3_5.yaml |
| Uni-MMMU (generate) | configs/eval/uni_mmmu/uni_mmmu_emu3_5_generate.yaml |
| Uni-MMMU (score) | configs/eval/uni_mmmu/uni_mmmu_emu3_5_score.yaml |
# Example: run GenEval on Modal
modal run modal/run.py --model emu3_5 --eval-config geneval/modal_geneval_emu3_5 --gpu A100-80GB:2
# Example: run WISE on Modal
modal run modal/run.py --model emu3_5 --eval-config modal_wise_emu3_5 --gpu A100-80GB:2
Performance¶
On 2×A100-80GB: - Generation throughput: ~74 tokens/s - Time per image: ~110 seconds (5120 tokens × 2 for CFG) - Model loading: ~42 seconds - First-run compilation: ~250 seconds (cached on subsequent runs)
Key Configuration Parameters¶
| Parameter | Default | Description |
|---|---|---|
use_vllm |
true |
Use vLLM backend (required; native patches must be applied) |
tensor_parallel_size |
2 |
Number of GPUs for vLLM tensor parallelism |
gpu_memory_utilization |
0.7 |
vLLM GPU memory fraction |
classifier_free_guidance |
5.0 |
CFG scale (5.0 for Emu3.5-Image T2I, 2.0 for Emu3.5 base) |
max_new_tokens |
5120 |
Maximum tokens for T2I generation |
image_area |
1048576 |
Target image area in pixels (1024x1024) |