Bagel (BAGEL-7B-MoT)¶

Mixture-of-Transformer multimodal model supporting understanding, generation, and editing.

Original repository: https://github.com/bytedance-seed/BAGEL
Backbone key: bagel
Model weights: BAGEL-7B-MoT (HuggingFace)
Capabilities: Understanding, Generation, Editing

Dependencies¶

The model environment is managed via the bagel image defined in modal/images.py. For local setup, install the dependencies listed in model/Bagel/requirements.txt. The config expects bagel_root to point to model/Bagel.

Flash Attention (required)¶

Bagel requires Flash Attention (v2.5.8). The Modal image already includes it. For local setup, install a pre-compiled wheel matching your environment — see modal/README.md for the exact environment parameters and installation instructions.

Inference¶

CLI¶

The inference configs below are pre-configured for Modal (cloud) paths. For local execution, copy the config and adjust model_path and bagel_root to your local paths.

# Generation
PYTHONPATH=src python -m umm.cli.main infer --config configs/inference/modal_bagel_generation.yaml

# Understanding
PYTHONPATH=src python -m umm.cli.main infer --config configs/inference/modal_bagel_understanding.yaml

# Editing
PYTHONPATH=src python -m umm.cli.main infer --config configs/inference/modal_bagel_editing.yaml

Python API¶

from umm.inference.pipeline import InferencePipeline
from umm.inference.multimodal_inputs import InferenceRequest

pipeline = InferencePipeline(backbone_name="bagel", backbone_cfg={
    "model_path": "/path/to/BAGEL-7B-MoT",
    "bagel_root": "/path/to/model/Bagel",
    "max_mem_per_gpu": "80GiB",
    "seed": 42,
})

# Generation
result = pipeline.run(InferenceRequest(
    backbone="bagel", task="generation",
    prompt="A cat sitting on a rainbow",
    params={"cfg_text_scale": 7.5, "cfg_img_scale": 1.5, "num_timesteps": 50},
))

# Understanding
result = pipeline.run(InferenceRequest(
    backbone="bagel", task="understanding",
    prompt="Describe this image",
    images=["path/to/image.jpg"],
    params={"max_think_token_n": 500, "do_sample": False},
))

# Editing
result = pipeline.run(InferenceRequest(
    backbone="bagel", task="editing",
    prompt="Make the sky purple",
    images=["path/to/image.jpg"],
    params={"cfg_text_scale": 7.5, "cfg_img_scale": 1.5, "num_timesteps": 50},
))

Supported Benchmarks¶

Benchmark	Config
DPG Bench	`configs/eval/dpg_bench/dpg_bench_bagel.yaml`
GenEval	`configs/eval/geneval/geneval_bagel.yaml`
WISE	`configs/eval/wise/wise_bagel.yaml`
GEdit-Bench	`configs/eval/gedit/modal_gedit_bagel.yaml`
UEval	`configs/eval/ueval/ueval_bagel.yaml`
Uni-MMMU	`configs/eval/uni_mmmu/uni_mmmu_bagel.yaml`
MME	`configs/eval/mme/mme_bagel.yaml`
MMMU	`configs/eval/mmmu/mmmu_bagel.yaml`
MMBench	`configs/eval/mmbench/mmbench_bagel.yaml`
MM-Vet	`configs/eval/mmvet/mmvet_bagel.yaml`
MathVista	`configs/eval/mathvista/mathvista_bagel.yaml`

# Example: run GenEval (two-stage, handled automatically)
PYTHONPATH=src python -m umm.cli.main eval --config configs/eval/geneval/geneval_bagel.yaml

# Example: run MME
PYTHONPATH=src python -m umm.cli.main eval --config configs/eval/mme/mme_bagel.yaml

Key Configuration Parameters¶

Generation / Editing: cfg_text_scale, cfg_img_scale, num_timesteps
Understanding: max_think_token_n, do_sample
Post-training: SFT, recA, IRG, UniCot (see configs/posttrain/)