BAGEL 7B MoT: The Open Multimodal Model Outperforming Qwen2.5-VL

5月22日 Published inMultimodal Models

BAGEL is an open-source multimodal foundation model featuring 7 billion active parameters out of a 14-billion-parameter total. It was trained on an extensive, interleaved dataset encompassing text, images, video, and web content. On standard multimodal understanding leaderboards, BAGEL slightly outperforms top-tier open visual-language models such as Qwen2.5-VL and InternVL-2.5.

Its text-to-image capabilities are on par with dedicated generators like Stable Diffusion 3 (SD3). In traditional image editing tasks, BAGEL's qualitative results exceed those of leading open models. Furthermore, the model handles free-form visual manipulation, multi-view synthesis, and navigation tasks. Together, these abilities suggest a move toward "world modeling"—a scope that extends far beyond the reach of earlier editing-focused models.

Architecture and Training

BAGEL utilizes a Mixture-of-Transformers (MoT) design, which allows the model to efficiently process signals from diverse multimodal sources. Guided by a capacity-maximization principle, BAGEL employs two separate encoders: one captures granular pixel-level details, while the other extracts high-level semantic features. The architecture is built on a "next-group-of-tokens" prediction framework, where the model learns to forecast compressed targets for both linguistic and visual tokens.

The training process—comprising pretraining, continued training, and supervised fine-tuning—is conducted over trillions of interleaved multimodal tokens. As the token count increases, the model's performance in understanding, generation, and editing shows consistent improvement. Different capabilities emerge at distinct stages: multimodal understanding and generation appear early, followed by basic editing functions. Complex, intelligent editing skills develop in the final stages. This staggered progression indicates that advanced multimodal reasoning is built upon well-established foundational skills.

Benchmarks

Visual Understanding

Model MME↑ MMBench↑ MMMU↑ MM-Vet↑ MathVista↑
Janus-Pro-7B - 79.2 41.0 50.0
Qwen2.5-VL-7B 2347 83.5 58.6 67.1 68.2
BAGEL 2388 85.0 55.3 67.2 73.1

Text-to-Image Generation

Model GenEval↑ WISE↑
Janus-Pro-7B 0.80 0.35
SD3-Medium 0.74 -
FLUX-1-dev 0.82 0.50
BAGEL - 0.52
BAGEL + CoT 0.88 0.70

Image Editing

Model GEdit-Bench-EN (SC)↑ GEdit-Bench-EN (PQ)↑ GEdit-Bench-EN (O)↑ IntelligentBench↑
Step1X-Edit 7.09 6.76 6.70 14.9
Gemini-2-exp. 6.73 6.61 6.32 57.6
BAGEL 7.36 6.83 6.52 44.0
BAGEL + CoT 55.3

Setup and Use

Environment Preparation

Clone the repository and set up a dedicated Conda environment:

git clone https://github.com/bytedance-seed/BAGEL.git
cd BAGEL
conda create -n bagel python=3.10 -y
conda activate bagel
pip install -r requirements.txt

Download Checkpoints

Download the pretrained weights directly from Hugging Face:

from huggingface_hub import snapshot_download

save_dir = "/path/to/save/BAGEL-7B-MoT"
repo_id = "ByteDance-Seed/BAGEL-7B-MoT"
cache_dir = save_dir + "/cache"

snapshot_download(
    cache_dir=cache_dir,
    local_dir=save_dir,
    repo_id=repo_id,
    local_dir_use_symlinks=False,
    resume_download=True,
    allow_patterns=["*.json", "*.safetensors", "*.bin", "*.py", "*.md", "*.txt"],
)

Running the Model

Launch the inference.ipynb notebook to begin exploring BAGEL’s multimodal capabilities.