Skip to content

Cloud (Modal)

TorchUMM integrates with Modal for running inference, evaluation, and training on cloud GPUs. Each model runs in an isolated container image with the correct Python, PyTorch, and dependency versions --- no local environment conflicts.


Setup

# Install Modal
pip install modal

# Login (one-time)
modal setup

Create required secrets on the Modal Dashboard:

modal secret create huggingface-secret HF_TOKEN=hf_xxx
modal secret create wandb-secret WANDB_API_KEY=xxx          # for training

No external APIs needed for scoring

UEval and WISE scoring both use local Qwen models as the evaluator — no Gemini or OpenAI API keys required. Download the evaluator weights once before running scoring steps:

modal run modal/download.py --model evaluator

Key Commands

Download Model Weights

# Download a specific model (one-time per model)
modal run modal/download.py --model bagel

# Download a dataset
modal run modal/download.py --dataset ueval

# Verify what is cached
modal run modal/download.py --ls

Run Evaluation

# Run an evaluation benchmark
modal run modal/run.py --model bagel --eval-config modal_dpg_bench_bagel

# Specify GPU type
modal run modal/run.py --model omnigen2 --gpu H100

# Multi-GPU
modal run modal/run.py --model wise --eval-config modal_score_wise_bagel --gpu A100-80GB:2

Run Inference

modal run modal/run.py --model bagel --script inferencer.py

Run Training

modal run modal/train.py --config bagel_sft

Sync Code

After modifying code locally, sync it to the cloud volume:

modal run modal/download.py --sync

Minimal sync

For small changes, consider syncing individual files rather than the entire codebase. Use modal run modal/download.py --sync only when many files have changed.


Architecture

Modal infrastructure is organized in modal/:

File Purpose
config.py Constants: volume names, HF model/dataset IDs, paths
volumes.py Persistent volume definitions (models, datasets, checkpoints, outputs, codebase)
images.py Container image definitions per backbone model
download.py Download model weights and datasets to volumes
run.py Unified entry point for inference and evaluation
train.py Post-training entry point

Volumes

Volume Name Container Mount Path
Codebase umm-codebase /workspace
Model weights umm-model-cache /model_cache
Post-train weights umm-post-train-model-cache /post_train_model_cache
Datasets umm-datasets-cache /datasets
Checkpoints umm-checkpoints /checkpoints
Outputs umm-outputs /outputs

Path convention

Eval config files use container mount paths (e.g., /outputs/ueval/bagel), not volume names or local paths.