BAGEL is an open-source multimodal foundation model featuring 7 billion active parameters out of a 14-billion-parameter total. It was trained on an extensive, interleaved dataset encompassing text, images, video, and web content. On standard multimodal understanding leaderboards, BAGEL slightly outperforms top-tier open visual-language models such as Qwen2.5-VL and InternVL-2.5.
Its text-to-image capabilities are on par with dedicated generators like Stable Diffusion 3 (SD3). In traditional image editing tasks, BAGEL's qualitative results exceed those of leading open models. Furthermore, the model handles free-form visual manipulation, multi-view synthesis, and navigation tasks. Together, these abilities suggest a move toward "world modeling"—a scope that extends far beyond the reach of earlier editing-focused models.
BAGEL utilizes a Mixture-of-Transformers (MoT) design, which allows the model to efficiently process signals from diverse multimodal sources. Guided by a capacity-maximization principle, BAGEL employs two separate encoders: one captures granular pixel-level details, while the other extracts high-level semantic features. The architecture is built on a "next-group-of-tokens" prediction framework, where the model learns to forecast compressed targets for both linguistic and visual tokens.
The training process—comprising pretraining, continued training, and supervised fine-tuning—is conducted over trillions of interleaved multimodal tokens. As the token count increases, the model's performance in understanding, generation, and editing shows consistent improvement. Different capabilities emerge at distinct stages: multimodal understanding and generation appear early, followed by basic editing functions. Complex, intelligent editing skills develop in the final stages. This staggered progression indicates that advanced multimodal reasoning is built upon well-established foundational skills.
Visual Understanding
| Model | MME↑ | MMBench↑ | MMMU↑ | MM-Vet↑ | MathVista↑ |
|---|---|---|---|---|---|
| Janus-Pro-7B | - | 79.2 | 41.0 | 50.0 | – |
| Qwen2.5-VL-7B | 2347 | 83.5 | 58.6 | 67.1 | 68.2 |
| BAGEL | 2388 | 85.0 | 55.3 | 67.2 | 73.1 |
Text-to-Image Generation
| Model | GenEval↑ | WISE↑ |
|---|---|---|
| Janus-Pro-7B | 0.80 | 0.35 |
| SD3-Medium | 0.74 | - |
| FLUX-1-dev | 0.82 | 0.50 |
| BAGEL | - | 0.52 |
| BAGEL + CoT | 0.88 | 0.70 |
Image Editing
| Model | GEdit-Bench-EN (SC)↑ | GEdit-Bench-EN (PQ)↑ | GEdit-Bench-EN (O)↑ | IntelligentBench↑ |
|---|---|---|---|---|
| Step1X-Edit | 7.09 | 6.76 | 6.70 | 14.9 |
| Gemini-2-exp. | 6.73 | 6.61 | 6.32 | 57.6 |
| BAGEL | 7.36 | 6.83 | 6.52 | 44.0 |
| BAGEL + CoT | – | – | – | 55.3 |
Environment Preparation
Clone the repository and set up a dedicated Conda environment:
git clone https://github.com/bytedance-seed/BAGEL.git
cd BAGEL
conda create -n bagel python=3.10 -y
conda activate bagel
pip install -r requirements.txt
Download Checkpoints
Download the pretrained weights directly from Hugging Face:
from huggingface_hub import snapshot_download
save_dir = "/path/to/save/BAGEL-7B-MoT"
repo_id = "ByteDance-Seed/BAGEL-7B-MoT"
cache_dir = save_dir + "/cache"
snapshot_download(
cache_dir=cache_dir,
local_dir=save_dir,
repo_id=repo_id,
local_dir_use_symlinks=False,
resume_download=True,
allow_patterns=["*.json", "*.safetensors", "*.bin", "*.py", "*.md", "*.txt"],
)
Running the Model
Launch the inference.ipynb notebook to begin exploring BAGEL’s multimodal capabilities.
SE-Agent: Self-Evolving AI Agent Tops SWE-bench Verified
Open Deep Research: Customizable AI Agents for Automated Report Generation
Cline AI Coding Assistant for VS Code: Powered by Claude Sonnet
Alger Music Player: Play Grayed-Out NetEase Songs with Desktop Lyrics
Clueless: A Native AI Meeting Assistant for Mac with Live Transcription
Xiaozhi Client: MCP Server Aggregator for Cursor and XiaoZhi AI
Agents From Scratch: AI Email Assistant with Human-in-the-Loop Approval
Scira: The Minimalist AI Search Engine for Grok, Claude, and Beyond
Bolo Blog: A Free Java Blog Engine with Markdown and Dark Mode
Greppo Python Framework: Build Geospatial Web Apps Fast
Slidev: Markdown-Based Presentations for Developers
GraphGen: Build Knowledge Graphs to Generate Smarter Training Data