Skip to content

Supported Benchmarks

TorchUMM supports 10+ benchmarks spanning image generation, visual understanding, and image editing.


Benchmark Reference

Benchmark Evaluates Required Capabilities Data Source Data Prep
DPG Bench Text-to-image detail preservation Generation Included in repo Details
GenEval Compositional text-to-image generation Generation Included in repo Details
WISE World knowledge in image generation Generation Included in repo Details
UEval Unified understanding + generation Understanding + Generation HuggingFace Details
Uni-MMMU Multimodal understanding, generation, and editing Understand + Generate + Edit HuggingFace Details
MME Multimodal perception and cognition Understanding HuggingFace Details
MMMU Massive multimodal understanding Understanding HuggingFace (auto) Details
MMBench VLM systematic evaluation Understanding OpenMMLab Details
MM-Vet Integrated vision-language capabilities Understanding GitHub Details
MathVista Mathematical reasoning with visuals Understanding HuggingFace Details
GEdit-Bench Image editing quality (VIEScore) Editing HuggingFace Details

Evaluation Types

Single-Stage Benchmarks

These benchmarks run generation and scoring in a single command:

  • DPG Bench --- generates images and computes detail-preservation scores
  • MME --- runs perception and cognition evaluation
  • MMMU --- runs multimodal understanding evaluation
  • MMBench --- runs systematic VLM evaluation
  • MM-Vet --- runs integrated vision-language evaluation
  • MathVista --- runs mathematical reasoning evaluation
PYTHONPATH=src python -m umm.cli.main eval \
    --config configs/eval/dpg_bench/dpg_bench_bagel.yaml

Two-Stage Benchmarks

These benchmarks separate generation from scoring, which allows using different models (or environments) for each stage:

  • GenEval --- generate images, then score with an object detector
  • WISE --- generate images, then score with Qwen VL models
  • UEval --- generate text + image answers, then score with Qwen models
  • Uni-MMMU --- generate outputs, then score
  • GEdit-Bench --- edit images, then score with VIEScore (Qwen VL)
# Step 1: Generate
PYTHONPATH=src python -m umm.cli.main eval \
    --config configs/eval/geneval/geneval_bagel_generate.yaml

# Step 2: Score
PYTHONPATH=src python -m umm.cli.main eval \
    --config configs/eval/geneval/geneval_bagel_score.yaml

See Reproducing Results for full two-stage examples.