DeepGen¶

Unified multimodal model (5B: 3B VLM + 2B DiT) supporting text-to-image generation and image editing via a diffusers-compatible pipeline.

Original repository: https://github.com/deepgenteam/DeepGen
Backbone key: deepgen
Capabilities: Generation, Editing

Dependencies¶

The model environment is managed via the deepgen image defined in modal/images.py (Python 3.10, torch 2.8.0, diffusers 0.35.2). For local setup, install the dependencies listed in model/deepgen/requirements.txt.

Flash Attention (recommended)¶

DeepGen benefits from Flash Attention for faster inference. The Modal image already includes it. For local setup, install a pre-compiled wheel matching your environment — see modal/README.md for the exact environment parameters and installation instructions.

Inference¶

CLI¶

# Generation (text-to-image)
PYTHONPATH=src python -m umm.cli.main infer --config configs/inference/deepgen_generation.yaml

# Editing (image-to-image)
PYTHONPATH=src python -m umm.cli.main infer --config configs/inference/deepgen_editing.yaml

Python API¶

from umm.inference.pipeline import InferencePipeline
from umm.inference.multimodal_inputs import InferenceRequest

pipeline = InferencePipeline(backbone_name="deepgen", backbone_cfg={
    "model_path": "/model_cache/deepgen/DeepGen-1.0-diffusers",
    "seed": 42,
})

# Generation
result = pipeline.run(InferenceRequest(
    backbone="deepgen", task="generation",
    prompt="A cat sitting on a rainbow",
    params={"num_inference_steps": 50, "guidance_scale": 4.0},
))

# Editing
result = pipeline.run(InferenceRequest(
    backbone="deepgen", task="editing",
    prompt="Change the background to a beach",
    images=["path/to/image.jpg"],
    params={"num_inference_steps": 50, "guidance_scale": 4.0},
))

Note: DeepGen does NOT support image understanding (VQA). Although its architecture includes a 3B VLM (Qwen2.5 VL-3B), this component is used internally to provide semantic guidance for generation via Stacked Channel Bridging (SCB) — it does not expose a standalone understanding interface. As a result, DeepGen cannot run benchmarks that require understanding capabilities (e.g., Uni-MMMU, UEval, MME, MMMU).

Supported Benchmarks¶

Benchmark	Config
GenEval	`configs/eval/geneval/modal_geneval_deepgen.yaml`
DPG-Bench	`configs/eval/dpg_bench/modal_dpg_bench_deepgen.yaml`
WISE	`configs/eval/wise/modal_wise_deepgen.yaml`
GEdit-Bench	`configs/eval/gedit/modal_gedit_deepgen.yaml`

Not supported: Uni-MMMU, UEval, and all VLM understanding benchmarks — these require image understanding capabilities that DeepGen does not provide.

# Example: run GenEval on Modal
modal run modal/run.py --model deepgen --eval-config modal_geneval_deepgen

# Example: run WISE on Modal
modal run modal/run.py --model deepgen --eval-config modal_wise_deepgen

Key Configuration Parameters¶

All evaluation parameters follow the official DeepGen repo (EVAL.md).

Parameter	Default	Notes
`height` / `width`	512	All benchmarks use 512x512 per official EVAL.md
`num_inference_steps`	50
`guidance_scale`	4.0	Exception: DPG-Bench uses 7.5 (per official `dpg_bench.py`)
`seed`	42
`negative_prompt`	(see adapter)	Editing only