Supported Benchmarks¶
TorchUMM supports 10+ benchmarks spanning image generation, visual understanding, and image editing.
Benchmark Reference¶
| Benchmark | Evaluates | Required Capabilities | Data Source | Data Prep |
|---|---|---|---|---|
| DPG Bench | Text-to-image detail preservation | Generation | Included in repo | Details |
| GenEval | Compositional text-to-image generation | Generation | Included in repo | Details |
| WISE | World knowledge in image generation | Generation | Included in repo | Details |
| UEval | Unified understanding + generation | Understanding + Generation | HuggingFace | Details |
| Uni-MMMU | Multimodal understanding, generation, and editing | Understand + Generate + Edit | HuggingFace | Details |
| MME | Multimodal perception and cognition | Understanding | HuggingFace | Details |
| MMMU | Massive multimodal understanding | Understanding | HuggingFace (auto) | Details |
| MMBench | VLM systematic evaluation | Understanding | OpenMMLab | Details |
| MM-Vet | Integrated vision-language capabilities | Understanding | GitHub | Details |
| MathVista | Mathematical reasoning with visuals | Understanding | HuggingFace | Details |
| GEdit-Bench | Image editing quality (VIEScore) | Editing | HuggingFace | Details |
Evaluation Types¶
Single-Stage Benchmarks¶
These benchmarks run generation and scoring in a single command:
- DPG Bench --- generates images and computes detail-preservation scores
- MME --- runs perception and cognition evaluation
- MMMU --- runs multimodal understanding evaluation
- MMBench --- runs systematic VLM evaluation
- MM-Vet --- runs integrated vision-language evaluation
- MathVista --- runs mathematical reasoning evaluation
Two-Stage Benchmarks¶
These benchmarks separate generation from scoring, which allows using different models (or environments) for each stage:
- GenEval --- generate images, then score with an object detector
- WISE --- generate images, then score with Qwen VL models
- UEval --- generate text + image answers, then score with Qwen models
- Uni-MMMU --- generate outputs, then score
- GEdit-Bench --- edit images, then score with VIEScore (Qwen VL)
# Step 1: Generate
PYTHONPATH=src python -m umm.cli.main eval \
--config configs/eval/geneval/geneval_bagel_generate.yaml
# Step 2: Score
PYTHONPATH=src python -m umm.cli.main eval \
--config configs/eval/geneval/geneval_bagel_score.yaml
See Reproducing Results for full two-stage examples.