AgentCPM-GUI is a localized GUI agent developed through a collaboration between THUNLP and ModelBest. It utilizes an 8-billion-parameter MiniCPM-V model that runs entirely on-device. By analyzing a smartphone screenshot alongside a specific user task, the model determines the precise sequence of taps or swipes required. It is particularly effective at navigating the unique interfaces of Chinese mobile applications.
Robust GUI Grounding: Pretrained on an extensive bilingual Android dataset, the model identifies buttons, input fields, labels, and icons with high accuracy.
Optimized for Chinese Apps: As the first open-source GUI agent fine-tuned specifically for the Chinese market, it supports over 30 popular applications, including Gaode Maps, Dianping, Bilibili, and Xiaohongshu.
Advanced Planning: The integration of reinforcement fine-tuning introduces a "reasoning" step before each action is executed. This internal thought process significantly improves the completion rate for complex, multi-step tasks.
Streamlined Execution: An optimized action space and clean JSON structure keep the average output to just 9.7 tokens per step, providing the speed necessary for edge deployment.
Install Dependencies
git clone https://github.com/OpenBMB/AgentCPM-GUI
cd AgentCPM-GUI
conda create -n gui_agent python=3.11
conda activate gui_agent
pip install -r requirements.txt
Download the Model
Download the AgentCPM-GUI weights from Hugging Face and place them in the model/AgentCPM-GUI directory.
Inference via Hugging Face
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from PIL import Image
import json
# 1. Load model and tokenizer
model_path = "model/AgentCPM-GUI"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, torch_dtype=torch.bfloat16)
model = model.to("cuda:0")
# 2. Define the instruction and image
instruction = "请点击屏幕上的'会员'按钮"
image_path = "assets/test.jpeg"
image = Image.open(image_path).convert("RGB")
# 3. Resize image to fit model constraints
def __resize__(origin_img):
resolution = origin_img.size
w, h = resolution
max_line_res = 1120
if max_line_res is not None:
max_line = max_line_res
if h > max_line:
w = int(w * max_line / h)
h = max_line
if w > max_line:
h = int(h * max_line / w)
w = max_line
img = origin_img.resize((w, h), resample=Image.Resampling.LANCZOS)
return img
image = __resize__(image)
# 4. Set message format
messages = [{
"role": "user",
"content": [
f"<Question>{instruction}</Question>\n当前屏幕截图:",
image
]
}]
# 5. Configure inference settings and system prompt
ACTION_SCHEMA = json.load(open('eval/utils/schema/schema.json', encoding="utf-8"))
items = list(ACTION_SCHEMA.items())
items.insert(3, ("required", ["thought"]))
ACTION_SCHEMA = dict(items)
SYSTEM_PROMPT = f'''# 角色:你是一名熟悉安卓系统触屏GUI操作的智能体,将根据用户的问题,分析当前界面的GUI元素和布局,生成相应的操作。
# 任务:针对用户问题,根据输入的当前屏幕截图,输出下一步的操作。
# 规则:以紧凑JSON格式输出,输出操作必须遵循Schema约束
# Schema {json.dumps(ACTION_SCHEMA, indent=None, ensure_ascii=False, separators=(',', ':'))}'''
outputs = model.chat(
image=None,
msgs=messages,
system_prompt=SYSTEM_PROMPT,
tokenizer=tokenizer,
temperature=0.1,
top_p=0.3,
n=1,
)
print(outputs)
Expected output:
{"thought":"任务目标是点击屏幕上的'会员'按钮。当前界面显示了应用的推荐页面,顶部有一个导航栏。点击'会员'按钮可以访问应用的会员相关内容。","POINT":[729,69]}
Inference via vLLM
vllm serve model/AgentCPM-GUI --served-model-name AgentCPM-GUI --tensor_parallel_size 1 --trust-remote-code
import base64
import io
import json
import requests
from PIL import Image
END_POINT = "http://localhost:8000/v1/chat/completions"
ACTION_SCHEMA = json.load(open('eval/utils/schema/schema.json', encoding="utf-8"))
items = list(ACTION_SCHEMA.items())
items.insert(3, ("required", ["thought"]))
ACTION_SCHEMA = dict(items)
SYSTEM_PROMPT = f'''# 角色:你是一名熟悉安卓系统触屏GUI操作的智能体,将根据用户的问题,分析当前界面的GUI元素和布局,生成相应的操作。
# 任务:针对用户问题,根据输入的当前屏幕截图,输出下一步的操作。
# 规则:以紧凑JSON格式输出,输出操作必须遵循Schema约束
# Schema {json.dumps(ACTION_SCHEMA, indent=None, ensure_ascii=False, separators=(',', ':'))}'''
def encode_image(image: Image.Image) -> str:
with io.BytesIO() as in_mem_file:
image.save(in_mem_file, format="JPEG")
in_mem_file.seek(0)
return base64.b64encode(in_mem_file.read()).decode("utf-8")
def __resize__(origin_img):
resolution = origin_img.size
w, h = resolution
max_line_res = 1120
if max_line_res is not None:
max_line = max_line_res
if h > max_line:
w = int(w * max_line / h)
h = max_line
if w > max_line:
h = int(h * max_line / w)
w = max_line
img = origin_img.resize((w, h), resample=Image.Resampling.LANCZOS)
return img
def predict(text_prompt: str, image: Image.Image):
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": [
{"type": "text", "text": f"<Question>{text_prompt}</Question>\n当前屏幕截图:"},
{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{encode_image(image)}"}}
]}
]
payload = {
"model": "AgentCPM-GUI",
"temperature": 0.1,
"messages": messages,
"max_tokens": 2048,
}
headers = {"Content-Type": "application/json"}
response = requests.post(END_POINT, headers=headers, json=payload)
return response.json()["choices"][0]["message"]["content"]
image = __resize__(Image.open("assets/test.jpeg"))
instruction = "请点击屏幕上的'会员'按钮"
print(predict(instruction, image))
Each operational step returns a JSON object containing a primary action, optional modifiers, and a status flag.
| Action | Required Fields | Optional Fields | Purpose | Example |
|---|---|---|---|---|
| Click | POINT:[x,y] |
duration, thought, STATUS |
Single tap at normalized 0–1000 coordinates | {"POINT":[480,320]} |
| Long Press | POINT:[x,y], duration:1000 |
thought, STATUS |
Hold at specific coordinates | {"POINT":[480,320],"duration":1000} |
| Swipe | POINT:[x,y], to:"up"/"down"/"left"/"right" or to:[x,y] |
duration, thought, STATUS |
Directional or coordinate-based swipe | {"POINT":[500,500],"to":"up"} |
| Press Key | PRESS:"HOME"/"BACK"/"ENTER" |
duration, thought, STATUS |
System hardware/navigation key press | {"PRESS":"BACK"} |
| Type Text | TYPE:"<text>" |
duration, thought, STATUS |
Input text at the current focus point | {"TYPE":"Hello"} |
| Wait | duration |
thought, STATUS |
Pause execution for set milliseconds | {"duration":500} |
| Task Status | STATUS:"start"/"continue"/"finish"/"satisfied" |
thought |
Define current phase of the task | {"STATUS":"finish"} |
Fine-Tuning: The project provides source code for both Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT).
Grounding Benchmarks: AgentCPM-GUI-8B achieved an average score of 71.3 across the fun2point, text2point, and bbox2text tests. It significantly outperforms comparable 7B–8B models and even exceeds the performance of much larger proprietary models.
| Model | fun2point | text2point | bbox2text | Average |
|---|---|---|---|---|
| AgentCPM-GUI-8B | 79.1 | 76.5 | 58.2 | 71.3 |
| Qwen2.5-VL-7B | 36.8 | 52.0 | 44.1 | 44.3 |
| Intern2.5-VL-8B | 17.2 | 24.2 | 45.9 | 29.1 |
| UI-TARS-7B | 56.8 | 66.7 | 1.4 | 41.6 |
| GPT-4o | 22.1 | 19.9 | 14.3 | 18.8 |
Agent Benchmarks: When tested on the CAGUI dataset (focused on Chinese apps), the model reached a 96.86% Task Match (TM) and a 91.28% Exact Match (EM), placing it well ahead of other open-source models.
| Model | CAGUI TM | CAGUI EM |
|---|---|---|
| AgentCPM-GUI-8B | 96.86 | 91.28 |
| Qwen2.5-VL-7B | 68.53 | 48.80 |
| UI-TARS-7B | 71.01 | 53.92 |
| OS-Atlas-7B | 81.53 | 55.89 |
CAGUI is the companion benchmark for evaluating Chinese app control. It includes comprehensive grounding and agent-specific tasks and is available for download on Hugging Face.
MOSS-Speech: Real Voice-to-Voice AI Without Text Bottlenecks
Qwen3-ASR-Studio: Real-Time Voice Recognition with PiP Mode
Embedding Atlas: Interactive Visualization for Large-Scale Embeddings
Grok CLI: AI-Powered Terminal Assistant for Files and Bash Commands
BuildAdmin: Vue 3 + ThinkPHP 8 Admin Panel with CRUD Generator
Apple Doc MCP: SwiftUI & UIKit Documentation for Cursor & Claude
Chinese Wikipedia Corpus: Processing 990k Articles for NLP Tasks
TypeAgent: Build AI Agents With Structured Memory and Human-in-the-Loop
ALLinSSL: Automated SSL Certificate Lifecycle Management
Xiaomi MiMo-7B: Built From Scratch for Math and Code Reasoning
MCP SuperAssistant: Bring MCP Tools to ChatGPT, Gemini, and Beyond
Wasteland SLG Guide: Survival Tips & Alliance Strategy