BAGEL is an open-source multimodal foundation model featuring 7 billion active parameters out of a 14-billion-parameter total. It was trained on an extensive, interleaved dataset encompassing text, images, video, and web content. On standard multimodal understanding leaderboards, BAGEL slightly outperforms top-tier open visual-language models such as Qwen2.5-VL and InternVL-2.5.
Its text-to-image capabilities are on par with dedicated generators like Stable Diffusion 3 (SD3). In traditional image editing tasks, BAGEL's qualitative results exceed those of leading open models. Furthermore, the model handles free-form visual manipulation, multi-view synthesis, and navigation tasks. Together, these abilities suggest a move toward "world modeling"—a scope that extends far beyond the reach of earlier editing-focused models.
BAGEL utilizes a Mixture-of-Transformers (MoT) design, which allows the model to efficiently process signals from diverse multimodal sources. Guided by a capacity-maximization principle, BAGEL employs two separate encoders: one captures granular pixel-level details, while the other extracts high-level semantic features. The architecture is built on a "next-group-of-tokens" prediction framework, where the model learns to forecast compressed targets for both linguistic and visual tokens.
The training process—comprising pretraining, continued training, and supervised fine-tuning—is conducted over trillions of interleaved multimodal tokens. As the token count increases, the model's performance in understanding, generation, and editing shows consistent improvement. Different capabilities emerge at distinct stages: multimodal understanding and generation appear early, followed by basic editing functions. Complex, intelligent editing skills develop in the final stages. This staggered progression indicates that advanced multimodal reasoning is built upon well-established foundational skills.
Visual Understanding
| Model | MME↑ | MMBench↑ | MMMU↑ | MM-Vet↑ | MathVista↑ |
|---|---|---|---|---|---|
| Janus-Pro-7B | - | 79.2 | 41.0 | 50.0 | – |
| Qwen2.5-VL-7B | 2347 | 83.5 | 58.6 | 67.1 | 68.2 |
| BAGEL | 2388 | 85.0 | 55.3 | 67.2 | 73.1 |
Text-to-Image Generation
| Model | GenEval↑ | WISE↑ |
|---|---|---|
| Janus-Pro-7B | 0.80 | 0.35 |
| SD3-Medium | 0.74 | - |
| FLUX-1-dev | 0.82 | 0.50 |
| BAGEL | - | 0.52 |
| BAGEL + CoT | 0.88 | 0.70 |
Image Editing
| Model | GEdit-Bench-EN (SC)↑ | GEdit-Bench-EN (PQ)↑ | GEdit-Bench-EN (O)↑ | IntelligentBench↑ |
|---|---|---|---|---|
| Step1X-Edit | 7.09 | 6.76 | 6.70 | 14.9 |
| Gemini-2-exp. | 6.73 | 6.61 | 6.32 | 57.6 |
| BAGEL | 7.36 | 6.83 | 6.52 | 44.0 |
| BAGEL + CoT | – | – | – | 55.3 |
Environment Preparation
Clone the repository and set up a dedicated Conda environment:
git clone https://github.com/bytedance-seed/BAGEL.git
cd BAGEL
conda create -n bagel python=3.10 -y
conda activate bagel
pip install -r requirements.txt
Download Checkpoints
Download the pretrained weights directly from Hugging Face:
from huggingface_hub import snapshot_download
save_dir = "/path/to/save/BAGEL-7B-MoT"
repo_id = "ByteDance-Seed/BAGEL-7B-MoT"
cache_dir = save_dir + "/cache"
snapshot_download(
cache_dir=cache_dir,
local_dir=save_dir,
repo_id=repo_id,
local_dir_use_symlinks=False,
resume_download=True,
allow_patterns=["*.json", "*.safetensors", "*.bin", "*.py", "*.md", "*.txt"],
)
Running the Model
Launch the inference.ipynb notebook to begin exploring BAGEL’s multimodal capabilities.
OpenThoughts-Agent: Train Small AI Models with HPC Scale
Sora 2 AI Watermark Remover: Remove Sora Watermarks Cleanly
SpikingBrain: 100x Faster LLM Inference via Spike Sparsity
TradingAgents-MCP: A 15-Agent AI Framework for Real-Time Stock Analysis
Puter: An Open-Source Personal Cloud OS for Files, Apps, and Games
Windows-Use: Enabling LLMs to Control the Windows GUI Without Vision Models
Apple Doc MCP: SwiftUI & UIKit Documentation for Cursor & Claude
Immich Setup Guide: How to Self-Host Your Own Google Photos Alternative
AppFlowy: Open-Source Notion Alternative With Local Data Control
Build Web Apps Using Only SQL: A Guide to SQLPage
Xiaomi MiMo-7B: Built From Scratch for Math and Code Reasoning
Perspective: Interactive Data Visualization for the Browser and Python