Evaluation Results¶

Disclaimer

All results below are independently reproduced using TorchUMM. They do NOT represent official results from the original model authors. Differences from published numbers may arise due to variations in inference settings, hardware, random seeds, or evaluation protocols.

Generation Benchmarks¶

Model	DPG Bench	GenEval	WISE
Bagel	84.11	78.81	0.3989
DeepGen	pending	86.59	0.5470
DeepGen	pending	86.59	0.5470
Janus-Pro	83.73	78.92	0.3811
Janus	—	40.04	0.2222
Janus-Flow	—	49.99	0.2964
Janus	—	40.04	0.2222
Janus-Flow	—	49.99	0.2964
Show-o2	68.16	59.87	0.3595
Emu3	80.31	—	0.3373
Emu3.5	—	—	0.6331
Emu3	80.31	—	0.3373
Emu3.5	—	—	0.6331
OmniGen2	84.51	78.53	0.4029
BLIP3-o	61.47	81.36	0.4138
TokenFlow	71.29	52.21	0.3056
MMaDA	—	—	0.6560
Show-o	—	—	—
Show-o2 1.5B	—	—	—

WISE Evaluator: Qwen2.5-VL-72B-Instruct

All WISE scores above are evaluated using Qwen2.5-VL-72B-Instruct as the VLM judge, rather than GPT-4o used in the original WISE benchmark and most published papers. This leads to systematically lower absolute scores compared to paper-reported numbers. For example, the DeepGen paper reports WISE Overall = 0.72 (GPT-4o scored), while our reproduction yields 0.5470 (Qwen2.5-VL-72B scored).

Why the gap exists:

Different scoring VLM biases. Qwen2.5-VL-72B tends to be stricter than GPT-4o, particularly on the Consistency dimension (which carries 70% weight in WiScore). Different VLMs interpret the "ABSOLUTE RUTHLESSNESS" scoring rubric differently.
Diffusers pipeline vs. native pipeline. For DeepGen specifically, we use the diffusers-format pipeline (deepgenteam/DeepGen-1.0-diffusers) instead of DeepGen's native xtuner/mmengine pipeline. The conversion may introduce minor generation quality differences due to scheduler implementation, post-processing, or attention computation differences.

Important: Since all models in our table are evaluated with the same Qwen2.5-VL-72B evaluator, the relative rankings remain valid for fair cross-model comparison. Absolute scores should not be directly compared to GPT-4o-evaluated numbers from published papers.

Understanding Benchmarks¶

Model	MME (Perc.)	MME (Cog.)	MMMU	MMBench	MM-Vet	MathVista	UEval
Bagel	1691.45	695.36	0.519	0.8428	65.9	71.6	30.9
Show-o2	1183.86	244.64	0.479	0.43	21.3	50.6	15.0
Emu3	1176.03	213.21	0.314	—	30.0	44.9	N/A
Janus-Pro	1547.93	293.21	0.407	0.6993	33.7	42.8	20.6
Emu3.5	832.17	271.43	0.292	0.183‡	28.0	30.6	N/A
MMaDA	938.96	241.43	0.289	—	11.4	24.9	—
Janus	1221.35	264.29	0.273	0.4691	—	—	—
Janus-Flow	1305.63	251.07	0.290	0.6486	—	—	—
Show-o2 1.5B	1413.30	291.79	0.371	0.6813	—	—	—
OmniGen2	1562.63	596.79	0.460	—	42.3	0.0†	N/A
Show-o	1188.53	244.64	0.261	—	23.3	0.0†	—

MathVista evaluator

All MathVista scores use Qwen3-32B for answer extraction from model responses, with rule-based normalization for scoring. † OmniGen2 and Show-o produce empty responses on MathVista understanding tasks, resulting in 0.0% accuracy.

UEval compatibility

Emu3 uses separate models for understanding and generation, making it incompatible with UEval's unified evaluation protocol.

Emu3.5 MMBench CircularEval ‡

Emu3.5's MMBench score (18.3%) is far below its naive accuracy (43.7%) due to severe option position bias under MMBench's CircularEval protocol. CircularEval shuffles option order across variants and requires the model to answer correctly on all variants — Emu3.5 picks the same letter regardless of content 23.5% of the time (vs. Emu3's 7.1%), indicating it selects by position rather than understanding. This is an inherent limitation of the unified model architecture, not a code bug.

Emu3.5 MME hardware variance

Emu3.5 MME scores are hardware-sensitive due to temperature=1.0 sampling (non-greedy) in the understanding pathway. Scores above were measured on PSC Bridges-2 2×H100. On Modal 2×A100-80GB, the same config yields P:781.08 / C:324.64. Other models use greedy decoding and are not affected.

Uni-MMMU Benchmark¶

Uni-MMMU evaluates the bidirectional synergy between generation and understanding across 8 reasoning-centric tasks. See the paper for benchmark details. Full setup guide: eval/generation/uni_mmmu/README.md.

Prerequisites¶

Running Uni-MMMU requires the following resources beyond the generation model itself:

Resource	Purpose	Path	Setup
Uni-MMMU dataset	Evaluation data (geometry, jigsaw, science, SVG)	`/datasets/uni_mmmu/Uni-MMMU-Eval/data`	`modal run modal/download.py --dataset uni_mmmu`
Qwen2.5-VL-72B-Instruct	Overlay/image scoring judge	`/model_cache/evaluator/Qwen2.5-VL-72B-Instruct`	`modal run modal/download.py --model evaluator`
Qwen3-32B	Text reasoning scoring judge	`/model_cache/evaluator/Qwen3-32B`	`modal run modal/download.py --model evaluator`
DreamSim	Jigsaw perceptual similarity metric	`/model_cache/dreamsim`	Auto-downloaded on first run
CairoSVG (optional)	SVG rasterization for code task	N/A (pip package)	Included in Modal `uni_mmmu` image

Implementation Details¶

Generation Model Config (Bagel)

Parameter	Paper (Official Default)	Our Setting
`num_timesteps` (generation)	50	50
`num_timesteps` (editing)	50	50
`cfg_text_scale` (generation)	4.0	4.0
`cfg_img_scale` (generation)	1.0	1.0
`cfg_img_scale` (editing)	2.0	2.0
`max_think_token_n`	1000	1000
`do_sample` (understanding)	False	False
`think` mode	False	False
`seed`	Not specified in paper	42

Scoring Model Config

Component	Paper	Our Implementation
Overlay judge (image_acc)	Qwen2.5-VL-72B	Qwen2.5-VL-72B-Instruct
Text judge (text_acc)	Qwen3-32B	Qwen3-32B
Overlay judge `do_sample`	Not specified	False (greedy)
Text judge `do_sample`	Not specified	False (greedy)
Text judge `enable_thinking`	Not specified	True (Qwen3 default)
Jigsaw image metric	DreamSim	DreamSim
`dreamsim_cache`	N/A	`/model_cache/dreamsim`

Results¶

Model	Jig. I	Jig. T	Maze I	Maze T	Slid. I	Slid. T	Geo I	Geo T	Sci. R	Sci. T	Sci. I	Code T	Code S	Code P
Bagel	0.660	0.553	0.004	0.101	0.000	0.050	0.050	0.143	0.592	0.522	0.185	0.115	0.375	0.275
Janus-Pro	—	—	—	—	—	—	—	—	29.3	25.5	0.0	1.5	3.7	3.4
Model	Jig. I	Jig. T	Maze I	Maze T	Slid. I	Slid. T	Geo I	Geo T	Sci. R	Sci. T	Sci. I	Code T	Code S	Code P
:----	:----:	:----:	:----:	:----:	:-----:	:-----:	:---:	:---:	:----:	:----:	:----:	:----:	:----:	:----:
Bagel	0.660	0.553	0.004	0.101	0.000	0.050	0.050	0.143	0.592	0.522	0.185	0.115	0.375	0.275
Janus-Pro	—	—	—	—	—	—	—	—	29.3	25.5	0.0	1.5	3.7	3.4

Note: DeepGen, BLIP3-o, and TokenFlow are excluded from Uni-MMMU as they do not support image understanding. Janus-Pro cannot perform editing tasks (Jigsaw, Maze, Sliding, Geometry), so only generation-dependent tasks (Science, Code) are evaluated.

Editing Benchmarks¶

GEdit-Bench¶

Model	Subset	EN SC	EN PQ	EN O	CN SC	CN PQ	CN O
Bagel	Overall	6.679	7.044	6.348	6.832	7.063	6.524
Bagel	Intersection	6.726	7.027	6.384	6.952	7.099	6.677
DeepGen	Overall	7.444	7.535	7.331	7.413	7.594	7.359
DeepGen	Intersection	7.520	7.593	7.423	7.426	7.587	7.385
OmniGen2	Overall	6.487	7.184	6.268	6.250	7.181	6.030
OmniGen2	Intersection	6.466	7.260	6.281	6.297	7.218	6.067
Emu3.5	Overall	7.643	7.479	7.556	7.617	7.502	7.555
Emu3.5	Intersection	7.643	7.479	7.556	7.617	7.502	7.555

GEdit-Bench scores are evaluated using Qwen2.5-VL-72B-Instruct via VIEScore (Q_ prefix denotes Qwen-evaluated). SC = Semantic Correctness (0-10), PQ = Perceptual Quality (0-10), O = Overall = sqrt(SC × PQ). "Intersection" = samples where both EN and CN instructions exist for the same source image, enabling fair cross-lingual comparison. For reference, Step1X-Edit v1.1 reports Q_SC=7.65, Q_PQ=7.41, Q_O=7.35 on GEdit-Bench-EN Overall.

ImgEdit-Bench¶

Singleturn (scored by Qwen2.5-VL-72B, scale 1–5)

Model	Bg.	Style	Adj.	Ext.	Rem.	Rep.	Add	Cmp.	Act.	Overall
DeepGen	3.85	4.16	4.20	3.40	4.75	4.26	4.37	3.00	3.97	4.07

UGE (Unguided Editing)

Model	Overall
DeepGen	4.81

Multiturn

Model	Cont. Mem.	Cont. Und.	Ver. Back.	Overall
DeepGen	4.18	4.33	4.60	4.37

ImgEdit-Bench evaluates image editing across three suites: Singleturn (9 edit types, 736 samples), UGE (unguided editing, 50 samples), and Multiturn (multi-round editing, 88 samples). All scores use Qwen2.5-VL-72B-Instruct as evaluator (scale 1–5).

Post-Training Models¶

Model	DPG	GenEval	WISE	UEval	MME (P)	MME (C)	MMMU	MMBench	MM-Vet
Bagel + recA	running	83.05	0.4225	31.0	1689.09	695.36	0.523	0.8419	66.1
Bagel + recA-ema	—	78.87	0.4056	31.0	—	—	—	—	—
Bagel + IRG	running	72.06	0.3842	9.1	1647.47	650.36	0.480	0.7783	40.7
Bagel + UniCot	65.22(?)	0.02(rerun)	rerun	—	1690.67	678.21	0.531	0.8445	64.5
Bagel + SFT	running	78.03	running	—	1680.73	678.93	0.526	0.8204	61.2
Janus-Pro + SFT	83.93	—	—	—	1549.87	292.86	0.400	0.700	33.0
OmniGen2 + SFT	84.78	—	—	—	—	—	0.223	—	10.0
BLIP3-o + SFT	running	—	—	—	—	—	—	—	—
TokenFlow + SFT	22.16(?)	—	—	—	—	—	—	—	—
Model	DPG	GenEval	WISE	UEval	MME (P)	MME (C)	MMMU	MMBench	MM-Vet
:---	:---:	:---:	:---:	:---:	:---:	:---:	:---:	:---:	:---:
Bagel + recA	running	83.05	0.4225	31.0	1689.09	695.36	0.523	0.8419	66.1
Bagel + recA-ema	—	78.87	0.4056	31.0	—	—	—	—	—
Bagel + IRG	running	72.06	0.3842	9.1	1647.47	650.36	0.480	0.7783	40.7
Bagel + UniCot	65.22(?)	0.02(rerun)	rerun	—	1690.67	678.21	0.531	0.8445	64.5
Bagel + SFT	running	78.03	running	—	1680.73	678.93	0.526	0.8204	61.2
Janus-Pro + SFT	83.93	—	—	—	1549.87	292.86	0.400	0.700	33.0
OmniGen2 + SFT	84.78	—	—	—	—	—	0.223	—	10.0
BLIP3-o + SFT	running	—	—	—	—	—	—	—	—
TokenFlow + SFT	22.16(?)	—	—	—	—	—	—	—	—

More results will be added as evaluations complete.

Detailed Sub-scores¶

MME Perception Sub-category Breakdown

Model	existence	count	position	color	posters	celebrity	scene	landmark	artwork	OCR	Total
Bagel	—	—	—	—	—	—	—	—	—	—	1691.45
OmniGen2	190.00	160.00	163.33	170.00	172.11	147.94	157.75	179.50	142.00	80.00	1562.63
Janus-Pro	—	—	—	—	—	—	—	—	—	—	1547.93
Show-o2 1.5B	195.00	118.33	116.67	180.00	117.35	119.71	163.75	155.50	129.50	117.50	1413.30
Janus-Flow	200.00	145.00	136.67	175.00	98.64	98.82	151.25	107.75	105.00	87.50	1305.63
Janus	200.00	101.67	95.00	155.00	119.05	95.88	159.75	109.75	92.75	92.50	1221.35
Show-o	190.00	78.33	123.33	170.00	89.46	94.41	162.50	124.50	83.50	72.50	1188.53
Show-o2	—	—	—	—	—	—	—	—	—	—	1183.86
Emu3	—	—	—	—	—	—	—	—	—	—	1176.03
MMaDA	178.33	58.33	76.67	143.33	55.78	51.76	142.50	98.50	83.75	50.00	938.96
Emu3.5	100.00	93.33	83.33	86.67	75.85	83.24	87.50	79.00	80.75	62.50	832.17

MME Cognition Sub-category Breakdown

Model	commonsense	numerical	text_translation	code_reasoning	Total
Bagel	—	—	—	—	695.36
OmniGen2	139.29	102.50	200.00	155.00	596.79
Janus-Pro	—	—	—	—	293.21
Show-o2 1.5B	94.29	70.00	80.00	47.50	291.79
Emu3.5	66.43	47.50	100.00	57.50	271.43
Janus	84.29	52.50	50.00	77.50	264.29
Janus-Flow	93.57	47.50	50.00	60.00	251.07
Show-o	92.14	37.50	57.50	57.50	244.64
Show-o2	—	—	—	—	244.64
MMaDA	76.43	50.00	50.00	65.00	241.43
Emu3	—	—	—	—	213.21

MathVista Sub-category Breakdown

By Question Type

Model	Multi-choice (n=540)	Free-form (n=460)	Overall
Bagel	80.19	61.52	71.6
Show-o2	63.33	35.65	50.6
Emu3	57.59	30.00	44.9
Janus-Pro	51.30	32.83	42.8
Emu3.5	41.67	17.61	30.6
MMaDA	38.70	8.70	24.9
OmniGen2	0.00	0.00	0.0†
Show-o	0.00	0.00	0.0†

By Task

Model	Figure QA	Geometry	Math Word	Textbook QA	Visual QA	Overall
Bagel	70.26	83.65	79.57	69.62	53.07	71.6
Show-o2	45.35	64.90	48.39	58.23	37.43	50.6
Emu3	37.17	57.21	51.61	46.84	33.52	44.9
Janus-Pro	31.23	43.75	56.45	50.00	38.55	42.8
Emu3.5	29.74	36.54	26.34	34.81	25.70	30.6
MMaDA	22.68	35.58	13.44	26.58	26.26	24.9

By Skill

Model	Algebraic	Arithmetic	Geometry	Logical	Numeric CS	Scientific	Statistical	Overall
Bagel	78.65	64.02	80.75	16.22	47.22	67.21	78.41	71.6
Show-o2	60.85	40.23	61.92	5.41	35.42	54.10	51.83	50.6
Emu3	49.82	36.54	55.23	21.62	34.03	50.82	37.87	44.9
Janus-Pro	43.42	42.78	42.26	10.81	38.19	48.36	38.87	42.8
Emu3.5	32.38	20.11	35.15	13.51	29.86	38.52	27.91	30.6
MMaDA	32.74	18.41	33.05	10.81	16.67	24.59	20.60	24.9

† OmniGen2 and Show-o produce empty responses on MathVista, so no per-task/skill breakdown is shown.

WISE Sub-category Breakdown

Model	Culture (n=400)	Time (n=167)	Space (n=133)	Biology (n=100)	Physics (n=100)	Chemistry (n=100)	Overall
MMaDA	0.6502	0.6814	0.7492	0.6620	0.7420	0.4205	0.6560
Emu3.5	0.7001	0.5683	0.6944	0.6435	0.6085	0.4060	0.6331
DeepGen	0.5989	0.4955	0.6102	0.4765	0.5515	0.4080	0.5470
Bagel	0.3883	0.4386	0.4714	0.3620	0.4205	0.2940	0.3989
BLIP3-o	0.4028	0.4186	0.5259	0.4025	0.4255	0.3000	0.4138
OmniGen2	0.4180	0.4042	0.4887	0.3635	0.3875	0.2810	0.4029
Janus-Pro	0.3616	0.3853	0.4789	0.3605	0.4745	0.2485	0.3811
Emu3	0.3463	0.3482	0.3711	0.3310	0.3685	0.2130	0.3373
Emu3	0.3463	0.3482	0.3711	0.3310	0.3685	0.2130	0.3373
Show-o2	0.3641	0.3497	0.4519	0.3455	0.3690	0.2390	0.3595
Janus-Flow	0.2731	0.3222	0.3947	0.3215	0.2860	0.1905	0.2954
Janus-Flow	0.2731	0.3222	0.3947	0.3215	0.2860	0.1905	0.2954
TokenFlow	0.3253	0.3626	0.3357	0.2915	0.2605	0.1510	0.3056
Janus	0.2080	0.2707	0.3508	0.1705	0.1910	0.1095	0.2222
Bagel + recA	0.4035	0.4147	0.5432	0.3985	0.4630	0.3340	0.4225
Bagel + recA-ema	0.3976	0.4156	0.4917	0.3665	0.4335	0.3170	0.4056
Janus	0.2080	0.2707	0.3508	0.1705	0.1910	0.1095	0.2222
Bagel + recA	0.4035	0.4147	0.5432	0.3985	0.4630	0.3340	0.4225
Bagel + recA-ema	0.3976	0.4156	0.4917	0.3665	0.4335	0.3170	0.4056
Bagel + IRG	0.3674	0.4081	0.4650	0.3575	0.4495	0.2655	0.3842
MMaDA	0.6502	0.6814	0.7492	0.6620	0.7420	0.4205	0.6560

GEdit-Bench Per-task Breakdown

English — Overall

Model	bg_change	color	material	motion	ps_human	style	subj-add	subj-rm	subj-repl	text	tone	Avg
Bagel SC	7.450	7.075	6.975	5.400	5.271	6.233	7.283	7.175	7.167	6.838	6.600	6.679
Bagel PQ	7.350	7.075	6.725	7.175	7.043	5.833	7.700	7.211	7.283	7.535	6.550	7.044
Bagel O	7.264	6.653	6.448	5.128	5.046	5.803	7.214	6.743	6.861	6.368	6.304	6.348
DeepGen SC	7.775	7.900	7.625	7.675	7.114	5.983	7.967	7.596	7.833	7.242	7.175	7.444
DeepGen PQ	7.825	7.325	7.025	7.875	7.657	6.867	7.800	7.667	7.817	7.626	7.400	7.535
DeepGen O	7.780	7.440	7.286	7.745	7.200	6.287	7.869	7.409	7.737	6.836	7.052	7.331
OmniGen2 SC	7.450	7.575	6.150	6.550	4.886	6.750	7.083	6.228	6.783	5.404	6.500	6.487
OmniGen2 PQ	7.400	6.950	6.975	7.400	7.114	6.633	7.600	7.368	7.383	7.596	6.600	7.184
OmniGen2 O	7.195	6.992	5.901	6.560	4.885	6.476	6.980	5.963	6.475	5.246	6.277	6.268
Emu3.5 SC	7.975	7.875	7.842	7.775	7.100	7.217	7.983	6.895	7.814	8.722	6.875	7.643
Emu3.5 PQ	7.700	7.400	6.868	7.700	7.400	7.200	7.683	7.649	7.492	7.722	7.450	7.479
Emu3.5 O	7.836	7.634	7.339	7.737	7.248	7.208	7.832	7.262	7.651	8.206	7.157	7.556

English — Intersection

Model	bg_change	color	material	motion	ps_human	style	subj-add	subj-rm	subj-repl	text	tone	Avg
Bagel SC	7.241	7.353	7.071	4.909	5.610	6.146	7.342	7.190	7.174	6.864	7.080	6.726
Bagel PQ	7.172	6.882	6.714	7.227	7.195	5.833	7.658	7.238	7.217	7.519	6.640	7.027
Bagel O	7.021	6.872	6.493	4.662	5.401	5.741	7.264	6.784	6.813	6.391	6.789	6.384
DeepGen SC	7.862	8.059	7.607	7.682	7.537	5.771	7.974	7.881	7.891	7.136	7.320	7.520
DeepGen PQ	7.828	7.176	7.143	8.000	7.805	6.812	7.711	7.857	7.870	7.642	7.680	7.593
DeepGen O	7.839	7.548	7.351	7.807	7.531	6.134	7.823	7.753	7.782	6.768	7.323	7.423
OmniGen2 SC	7.276	7.941	6.071	6.045	5.122	6.562	6.763	6.310	6.783	5.531	6.720	6.466
OmniGen2 PQ	7.276	6.765	7.107	7.773	7.415	6.750	7.500	7.476	7.522	7.593	6.680	7.260
OmniGen2 O	6.960	7.271	5.843	6.187	5.245	6.453	6.656	6.049	6.541	5.387	6.500	6.281
Emu3.5 SC	7.975	7.875	7.842	7.775	7.100	7.217	7.983	6.895	7.814	8.722	6.875	7.643
Emu3.5 PQ	7.700	7.400	6.868	7.700	7.400	7.200	7.683	7.649	7.492	7.722	7.450	7.479
Emu3.5 O	7.836	7.634	7.339	7.737	7.248	7.208	7.832	7.262	7.651	8.206	7.157	7.556

Chinese — Overall

Model	bg_change	color	material	motion	ps_human	style	subj-add	subj-rm	subj-repl	text	tone	Avg
Bagel SC	7.625	7.400	6.675	6.650	5.457	6.533	7.267	6.509	7.200	6.990	6.850	6.832
Bagel PQ	7.200	6.975	6.850	6.900	7.143	6.300	7.517	7.386	7.167	7.576	6.675	7.063
Bagel O	7.247	6.957	6.489	6.222	5.305	6.213	7.096	6.308	6.806	6.692	6.426	6.524
DeepGen SC	7.300	7.575	7.750	7.400	6.986	6.200	7.950	7.772	7.950	7.313	7.350	7.413
DeepGen PQ	7.700	7.325	7.325	7.775	7.586	7.100	7.850	7.772	7.633	7.646	7.825	7.594
DeepGen O	7.437	7.199	7.509	7.468	7.169	6.495	7.874	7.686	7.670	6.965	7.480	7.359
OmniGen2 SC	7.400	7.550	5.175	7.000	3.900	6.717	7.150	5.632	6.567	5.838	5.825	6.250
OmniGen2 PQ	7.600	7.050	6.850	7.475	6.986	6.733	7.567	7.474	7.300	7.434	6.525	7.181
OmniGen2 O	7.285	7.027	4.859	6.900	3.833	6.528	7.042	5.619	6.379	5.432	5.425	6.030
Emu3.5 SC	7.800	7.850	7.711	7.750	6.914	7.400	8.000	7.491	8.000	8.526	6.350	7.617
Emu3.5 PQ	7.625	7.400	7.132	7.625	7.371	7.333	7.767	7.667	7.407	7.722	7.475	7.502
Emu3.5 O	7.712	7.622	7.415	7.687	7.139	7.367	7.882	7.578	7.698	8.114	6.890	7.555

Chinese — Intersection

Model	bg_change	color	material	motion	ps_human	style	subj-add	subj-rm	subj-repl	text	tone	Avg
Bagel SC	7.500	7.514	6.759	6.542	6.175	6.233	7.310	6.361	7.353	7.232	7.500	6.952
Bagel PQ	7.125	6.886	6.828	7.125	7.150	6.233	7.595	7.583	7.118	7.573	6.875	7.099
Bagel O	7.057	7.068	6.568	6.247	6.089	5.970	7.160	6.277	6.950	6.907	7.151	6.677
DeepGen SC	7.125	7.743	7.724	7.458	6.825	6.186	7.976	7.667	8.118	7.366	7.500	7.426
DeepGen PQ	7.792	7.314	7.207	7.708	7.525	7.000	7.929	7.806	7.647	7.659	7.875	7.587
DeepGen O	7.386	7.360	7.430	7.552	6.991	6.423	7.940	7.605	7.822	7.045	7.681	7.385
OmniGen2 SC	7.292	7.943	5.276	7.083	3.700	6.558	7.024	5.500	6.804	6.024	6.062	6.297
OmniGen2 PQ	7.667	7.000	6.862	7.375	6.950	6.628	7.762	7.528	7.275	7.415	6.938	7.218
OmniGen2 O	7.138	7.390	5.042	6.840	3.572	6.410	6.989	5.485	6.549	5.609	5.714	6.067
Emu3.5 SC	7.800	7.850	7.711	7.750	6.914	7.400	8.000	7.491	8.000	8.526	6.350	7.617
Emu3.5 PQ	7.625	7.400	7.132	7.625	7.371	7.333	7.767	7.667	7.407	7.722	7.475	7.502
Emu3.5 O	7.712	7.622	7.415	7.687	7.139	7.367	7.882	7.578	7.698	8.114	6.890	7.555

GenEval Sub-category Breakdown

Model	Single Obj (n=320)	Two Obj (n=396)	Counting (n=320)	Colors (n=376)	Position (n=400)	Color Attr (n=400)	Overall
DeepGen	98.75	98.99	81.25	92.55	75.00	73.00	86.59
DeepGen	98.75	98.99	81.25	92.55	75.00	73.00	86.59
Bagel	99.38	94.19	78.75	87.77	51.00	61.75	78.81
BLIP3-o	98.12	93.18	73.44	86.17	72.75	64.50	81.36
Janus-Pro	97.81	86.62	57.50	89.36	76.00	66.25	78.92
OmniGen2	99.69	93.94	68.75	88.03	53.25	67.50	78.53
Show-o2	97.81	71.46	48.75	78.46	20.00	42.75	59.87
TokenFlow	97.19	59.60	37.81	86.17	17.25	15.25	52.21
Janus-Flow	94.25	46.06	27.75	74.68	32.20	25.00	49.99
Janus	85.62	37.63	18.75	53.46	17.50	27.25	40.04
Bagel + recA	99.38	94.44	79.38	89.10	61.75	74.25	83.05
Bagel + recA-ema	99.06	95.71	78.12	84.84	53.25	62.25	78.87
Bagel + SFT	99.06	92.42	77.50	86.97	49.75	62.50	78.03
Janus-Flow	94.25	46.06	27.75	74.68	32.20	25.00	49.99
Janus	85.62	37.63	18.75	53.46	17.50	27.25	40.04
Bagel + recA	99.38	94.44	79.38	89.10	61.75	74.25	83.05
Bagel + recA-ema	99.06	95.71	78.12	84.84	53.25	62.25	78.87
Bagel + SFT	99.06	92.42	77.50	86.97	49.75	62.50	78.03
Bagel + IRG	98.44	87.37	70.31	78.72	40.50	57.00	72.06

MMMU Sub-category Breakdown

Model	Art & Design	Business	Science	Health & Medicine	Humanities & Social Sci	Tech & Engineering	Overall
Bagel	—	—	—	—	—	—	0.519
Show-o2	0.617	0.427	0.373	0.473	0.692	0.395	0.479
OmniGen2	0.558	0.387	0.340	0.533	0.700	0.352	0.460
Janus-Pro	—	—	—	—	—	—	0.407
Show-o2 1.5B	0.442	0.327	0.287	0.393	0.517	0.324	0.371
Emu3	—	—	—	—	—	—	0.314
Emu3.5	0.350	0.240	0.340	0.320	0.242	0.271	0.292
Janus-Flow	0.392	0.233	0.233	0.333	0.358	0.243	0.290
MMaDA	0.292	0.347	0.213	0.313	0.333	0.257	0.289
Janus	0.325	0.180	0.267	0.273	0.300	0.300	0.273
Show-o	0.275	0.253	0.200	0.260	0.375	0.238	0.261

MM-Vet Sub-score Breakdown

Model	scorer	rec	ocr	know	gen	spat	math	Total
Bagel + recA	gpt-4.1	58.9	78.3	45.6	47.1	78.8	76.2	66.1
Bagel + UniCot	gpt-4.1	58.7	75.7	44.5	46.7	76.9	66.5	64.5
Bagel + SFT	gpt-4.1	54.5	74.4	41.1	47.2	71.5	80.4	61.2
Bagel + IRG	gpt-4.1	38.1	42.5	20.5	24.9	42.7	35.8	40.7
Janus-Pro + SFT	gpt-4.1	33.6	27.8	14.9	10.6	34.8	11.5	33.0
OmniGen2 + SFT	gpt-4.1	13.0	6.8	23.2	25.5	1.9	5.4	10.0
MMaDA	gpt-4.1	15.1	4.2	3.6	2.9	9.7	3.8	11.4
Emu3.5	gpt-4.1	26.7	31.4	26.7	28.6	27.1	26.9	28.0
OmniGen2	gpt-4.1	40.6	42.6	20.6	16.1	47.9	22.7	42.3
Show-o	gpt-4.1	29.0	12.1	13.5	11.4	16.0	2.3	23.3

UEval Sub-category Breakdown

Model	Space	Textbook	Diagram	Paper	Art	Life	Tech	Exercise	Avg
Bagel	34.9	39.7	35.4	21.8	34.9	30.6	26.4	23.8	30.9
Janus-Pro	21.5	26.0	32.4	14.2	19.9	20.8	16.8	13.0	20.6
Bagel + recA-ema	37.3	39.8	35.6	21.1	34.6	31.3	26.2	22.1	31.0
Bagel + IRG	14.3	0.4	17.8	7.1	20.8	7.4	0.3	4.8	9.1
Show-o2	21.4	24.8	20.3	12.1	13.8	11.0	6.6	6.0	15.0

Disclaimers¶

Unofficial results. All evaluation results in this repository are independently reproduced by the TorchUMM team. They do not represent official results from the original model authors.
Active development. TorchUMM is under active development. Some results may be updated as evaluation pipelines are refined.
Contributions welcome. If you find discrepancies or want to add support for a new model or benchmark, please open an issue or pull request.