Skip to content

Getting Started

This guide walks through installation, basic inference, evaluation, and post-training with TorchUMM.


Installation

# Clone the repository (including model submodules)
git clone --recursive https://github.com/AIFrontierLab/TorchUMM/
cd umm_codebase

# Install the core package
pip install -e .

Per-model dependencies

Each backbone model has its own Python and PyTorch version requirements. Install only the dependencies for the model(s) you plan to use:

# Example: install Bagel dependencies
pip install -r model/Bagel/requirements.txt

For cloud execution via Modal, each model runs in an isolated container image with the correct environment --- no local dependency conflicts.


CLI Usage

Inference

# Text-to-image generation with Bagel
PYTHONPATH=src python -m umm.cli.main infer \
    --config configs/inference/modal_bagel_generation.yaml

# Image understanding with Janus-Pro
PYTHONPATH=src python -m umm.cli.main infer \
    --config configs/inference/janus_pro_understanding.yaml

Evaluation

# Single-stage benchmark (DPG Bench on Bagel)
PYTHONPATH=src python -m umm.cli.main eval \
    --config configs/eval/dpg_bench/dpg_bench_bagel.yaml

# Two-stage benchmark (GenEval on Bagel --- generation + scoring)
PYTHONPATH=src python -m umm.cli.main eval \
    --config configs/eval/geneval/geneval_bagel.yaml

Post-Training

# SFT on Bagel
PYTHONPATH=src python -m umm.cli.main train \
    --config configs/posttrain/bagel_sft.yaml

Python API

TorchUMM exposes a programmatic interface through InferencePipeline and InferenceRequest.

from umm.inference.pipeline import InferencePipeline
from umm.inference.multimodal_inputs import InferenceRequest

# Initialize the pipeline with a backbone model
pipeline = InferencePipeline(
    backbone_name="bagel",
    backbone_cfg={
        "model_path": "/path/to/BAGEL-7B-MoT",
        "max_mem_per_gpu": "80GiB",
        "seed": 42,
    },
)

# Text-to-image generation
result = pipeline.run(InferenceRequest(
    backbone="bagel",
    task="generation",
    prompt="A cat sitting on a rainbow",
    params={"num_timesteps": 50},
))

# Image understanding
result = pipeline.run(InferenceRequest(
    backbone="bagel",
    task="understanding",
    prompt="Describe this image in detail.",
    images=["path/to/image.jpg"],
    params={"max_think_token_n": 500, "do_sample": False},
))

# Image editing
result = pipeline.run(InferenceRequest(
    backbone="bagel",
    task="editing",
    prompt="Make the sky purple",
    images=["path/to/image.jpg"],
    params={"num_timesteps": 25},
))

# Batch inference
results = pipeline.run_many(
    [request1, request2, request3],
    batch_size=2,
)

InferenceRequest Fields

Field Type Description
backbone str Backbone model name (must match the pipeline)
task str "generation", "understanding", or "editing"
prompt str Text prompt
images list[str] Input image paths (for understanding/editing)
videos list[str] Input video paths
params dict Task-specific parameters
output_path str Path to save output

Cloud Execution

TorchUMM integrates with Modal for cloud GPU execution. This handles environment isolation, model weight caching, and GPU scaling automatically.

# Download model weights to cloud storage (one-time)
modal run modal/download.py --model bagel

# Run evaluation on cloud GPU
modal run modal/run.py --model bagel --eval-config modal_dpg_bench_bagel

See the Cloud (Modal) page for setup and full command reference.


AMD HPC Execution

For AMD ROCm clusters, use amd_ prefixed eval configs with the local runner:

# Setup environment (one-time)
bash scripts/amd_migration/setup_all_envs.sh bagel

# Run evaluation
bash scripts/amd_migration/local_run.sh bagel --eval-config amd_ueval_bagel

Config naming: modal_*.yaml (cloud), amd_*.yaml (AMD HPC), *.yaml (legacy local). To regenerate AMD configs: python scripts/generate_amd_configs.py.