A Tutorial on Unified Multimodal Models
A comprehensive tutorial on the architecture design, representation learning, training dynamics, and evaluation of unified multimodal models that integrate understanding and generation within a single framework.
How to Model?
A systematic taxonomy of UMM architectures — External Expert Integration, Modular Joint Modeling, and End-to-End Unified Modeling — with trade-off analysis between autoregressive, diffusion, and hybrid approaches.
How to Represent?
The "Unified Tokenizer" debate: continuous representations (e.g., CLIP) vs. discrete tokens (e.g., VQ-VAE), and hybrid encoding strategies balancing semantic understanding with generative fidelity.
How to Train?
The full training lifecycle — from constructing interleaved image-text data to unified pre-training objectives and advanced post-training alignment methods such as DPO and GRPO.
Introduction & Motivation
Tracing the evolution of multimodal AI from isolated expertise to Unified Multimodal Models. We introduce the core motivations driving unification — particularly the mutual reinforcement between understanding and generation — and provide a rigorous definition of UMMs.
Modeling Architectures
A systematic taxonomy including External Expert Integration, Modular Joint Modeling, and End-to-End Unified Modeling. Deep dive into trade-offs between autoregressive, diffusion, and emerging AR-Diffusion hybrid approaches.
The Unified Tokenizer Challenge
Comparing continuous representations versus discrete tokenization schemes. Review of encoding/decoding strategies and state-of-the-art hybrid approaches — cascade and dual-branch designs — bridging semantic richness with generative fidelity.
Training Recipes & Data
Constructing high-quality modality-interleaved datasets, unified pre-training objectives, and advanced post-training alignment methods including preference-based approaches such as DPO and GRPO.
Evaluation, Applications & Future Directions
Reviewing existing benchmarks for standardized evaluation, discussing real-world applications in robotics and autonomous driving, and highlighting open challenges including scalable unified tokenizers and unified world models.
Unified Codebase & Integration
A practical walkthrough of our unified multimodal codebase, explaining how core components — tokenizers, multimodal encoders, and generative backbones — are organized and connected in practice.
Jindong Wang
Yinyi Luo
Haoyue Bai
Evolution of Multimodal Models
From isolated multimodal understanding or generation systems to unified multimodal foundation models capable of handling both tasks simultaneously.
Modeling Paradigms for UMMs
A taxonomy of architectures including External Expert Integration, Modular Joint Modeling, and End-to-End Unified Modeling, with comparisons between autoregressive, diffusion, and hybrid approaches.
Unified Tokenizer & Representation Design
Continuous versus discrete representations, their advantages and limitations, and emerging hybrid encoding strategies that balance semantic understanding and generative fidelity.
Training Lifecycle & Alignment
Construction of modality-interleaved datasets, unified pre-training objectives, and post-training alignment methods such as DPO and GRPO.
Benchmarks, Applications & Open Challenges
Evaluation protocols, real-world applications in robotics and autonomous driving, and future directions such as scalable unified tokenizers and unified world models.
Slides
All presentation slides will be made publicly available on this website following the event.
Coming SoonBibliography
An annotated compilation of all references discussed in the tutorial as a comprehensive reading list.
Coming SoonCodebase
Open-source unified multimodal codebase with annotated pointers to models (e.g., Emu, Janus) and datasets.
Coming Soon