Tiny Qwen: A Clean PyTorch Implementation of Qwen3 and Qwen2.5-VL

9月27日 Published inMultimodal Models

Tiny Qwen is a streamlined PyTorch implementation of the Qwen3 and Qwen2.5-VL models. It supports text processing, visual understanding—incorporating the image-referencing capabilities of Qwen2.5-VL—and both dense and Mixture-of-Experts (MoE) architectures. By stripping away the overhead found in standard Hugging Face libraries, the code is significantly more readable and easier to modify. It features an interactive chat interface and straightforward Python examples designed for quick integration.

Tiny Qwen also includes pretraining and instruction-tuning scripts specifically for the Qwen3V-4B-Preview projection layer, allowing you to replicate the model's training pipeline from scratch.

Quick tips for the chat interface:

Type /help to view available commands.
Type /exit or press Ctrl+C to close the session.

To begin, select the Qwen3 model, then choose the Qwen3-4B-Instruct-2507 variant. Once the "Model loaded successfully" message appears, you can start the conversation. For instance, a simple "hello?" should return: "Hello! How can I assist you today?"

Tiny Qwen Setup

We recommend using uv to manage your virtual environment. Follow these steps for a clean installation:

Install uv and create the environment:
pip install uv && uv venv
Activate the environment:
- Linux/macOS: source .venv/bin/activate
- Windows: .venv\Scripts\activate
Install dependencies:
uv pip install -r requirements.txt
Launch the chat interface:
python run.py

Note on Multimodal Inputs:
While Qwen3 is text-only, Qwen2.5-VL supports images. To reference an image in the chat, use the @ symbol followed by the file path. For example:

@data/test-img-1.jpg tell me what you see in this image?

The system will confirm: ✓ Found image: data/test-img-1.jpg, then provide a description: The image shows a sunflower field with a close-up of a sunflower...

Code Examples

Running Qwen2.5-VL

from PIL import Image
from model.model import Qwen2VL
from model.processor import Processor

model_name = "Qwen/Qwen2.5-VL-3B-Instruct"
model = Qwen2VL.from_pretrained(repo_id=model_name, device_map="auto")
processor = Processor(repo_id=model_name, vision_config=model.config.vision_config)

context = [
    "<|im_start|>user\n<|vision_start|>",
    Image.open("data/test-img-1.jpg"),
    "<|vision_end|>What's on this image?<|im_end|>\n<|im_start|>assistant\n",
]

inputs = processor(context, device="cuda")

generator = model.generate(
    input_ids=inputs["input_ids"],
    pixels=inputs["pixels"],
    d_image=inputs["d_image"],
    max_new_tokens=64,
    stream=True,
)

for token_id in generator:
    token_text = processor.tokenizer.decode([token_id])
    print(token_text, end="", flush=True)
print()

Training Qwen3V-4B-Preview

Follow these two steps to replicate the projection layer training for Qwen3V-4B-Preview.

Step 1: Pretraining (Using the LLaVA-595K dataset)

PYTHONPATH=. python train/s2_1_qwen3v_pretrain.py \
    --devices 8 \
    --batch_size 8 \
    --epochs 1 \
    --grad_accum 2 \
    --max_seq_len 1024 \
    --lr 5e-4 \
    --weight_decay 0 \
    --num_workers 4 \
    --precision bf16-mixed \
    --proj_out projection-pretrained.safetensors \
    --cache_dir ./cache

Step 2: Instruction Tuning (Using the LLaVA-150K dataset)

PYTHONPATH=. python train/s2_2_qwen3v_instruct.py \
    --devices 8 \
    --batch_size 2 \
    --epochs 3 \
    --grad_accum 8 \
    --max_seq_len 1024 \
    --lr 5e-4 \
    --weight_decay 0 \
    --num_workers 4 \
    --precision bf16-mixed \
    --proj_out projection-instruct.safetensors \
    --cache_dir ./cache \
    --pretrained_proj projection-pretrained.safetensors \
    --freeze_llm

Training Technicalities:

Hardware: These scripts are configured for a node with 8 H100 GPUs. Adjust the --devices flag to fit your specific setup.
Optimization: Only the projection layer is trained in these scripts. The --freeze_llm flag ensures the base language model parameters remain unchanged.
Batch Size: The effective batch size is 128 for both pretraining and instruction tuning.
Duration: The pipeline consists of 1 pretraining epoch followed by 3 instruction-tuning epochs.

Running Qwen3

from model.model import Qwen3MoE
from model.processor import Processor

model_name = "Qwen/Qwen3-4B-Instruct-2507"
model = Qwen3MoE.from_pretrained(repo_id=model_name)
processor = Processor(repo_id=model_name)

context = [
    "<|im_start|>user\n<|vision_start|>",
    "<|vision_end|>Explain reverse linked list<|im_end|>\n<|im_start|>assistant\n",
]
inputs = processor(context, device="cuda")
generator = model.generate(
    input_ids=inputs["input_ids"],
    max_new_tokens=64,
    stream=True
)

for token_id in generator:
    token_text = processor.tokenizer.decode([token_id])
    print(token_text, end="", flush=True)
print()

▶ Visit

Related Tools

Tiny Qwen: A Clean PyTorch Implementation of Qwen3 and Qwen2.5-VL

BAGEL 7B MoT: The Open Multimodal Model Outperforming Qwen2.5-VL

DupCheck: Open-Source Image Duplication & Tampering Detection (Python)