AgentCPM-GUI 开源设备端大型语言模型（LLM）智能体

5月18日发布在大型语言模型工具

AgentCPM-GUI 是一个由 THUNLP 和 ModelBest 联合开发的开源设备端大型语言模型（LLM）智能体，基于拥有 80 亿参数的 MiniCPM-V 构建，以智能手机屏幕截图作为输入，能够自主执行用户指定的任务，尤其擅长操作中文应用程序。

高质量GUI定位：在大规模双语安卓数据集上进行预训练，提升了对常见GUI组件（按钮、输入框、标签、图标等）的定位和理解能力。

中文应用操作：首个针对中文应用进行微调的开源GUI智能体，覆盖高德地图、大众点评、哔哩哔哩、小红书等30多款热门应用。

强化规划推理：通过强化微调（RFT），模型在输出动作前会进行“思考”，能显著提高复杂任务的成功率。

紧凑动作空间设计：优化后的动作空间和简洁的JSON格式，使平均动作长度降至9.7个token，提升设备端推理效率。

AgentCPM-GUI使用手册

安装依赖

git clone https://github.com/OpenBMB/AgentCPM-GUI
cd AgentCPM-GUI
conda create -n gui_agent python=3.11
conda activate gui_agent
pip install -r requirements.txt

下载模型

从Hugging Face下载AgentCPM-GUI模型，放置在model/AgentCPM-GUI目录下。

Huggingface推理示例

import torch 
from transformers import AutoTokenizer, AutoModelForCausalLM 
from PIL import Image 
import json 

# 1. 加载模型和分词器
model_path = "model/AgentCPM-GUI"  
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, torch_dtype=torch.bfloat16)
model = model.to("cuda:0") 

# 2. 构建输入
instruction = "请点击屏幕上的‘会员’按钮" 
image_path = "assets/test.jpeg" 
image = Image.open(image_path).convert("RGB")

# 3. 调整图片尺寸
def __resize__(origin_img):
    resolution = origin_img.size
    w,h = resolution
    max_line_res = 1120
    if max_line_res is not None:
        max_line = max_line_res
        if h > max_line:
            w = int(w * max_line / h)
            h = max_line
        if w > max_line:
            h = int(h * max_line / w)
            w = max_line
    img = origin_img.resize((w,h),resample=Image.Resampling.LANCZOS)
    return img 
image = __resize__(image)

# 4. 构建消息格式
messages = [{
    "role": "user",
    "content": [
        f"<Question>{instruction}</Question>\n当前屏幕截图：",
        image
    ]
}]

# 5. 推理设置
ACTION_SCHEMA = json.load(open('eval/utils/schema/schema.json', encoding="utf-8"))
items = list(ACTION_SCHEMA.items())
insert_index = 3 
items.insert(insert_index, ("required", ["thought"])) 
ACTION_SCHEMA = dict(items)
SYSTEM_PROMPT = f'''# 角色：你是一名熟悉安卓系统触屏GUI操作的智能体，将根据用户的问题，分析当前界面的GUI元素和布局，生成相应的操作。

# 任务：针对用户问题，根据输入的当前屏幕截图，输出下一步的操作。

# 规则：以紧凑JSON格式输出，输出操作必须遵循Schema约束

# Schema {json.dumps(ACTION_SCHEMA, indent=None, ensure_ascii=False, separators=(',', ':'))}''' 
outputs = model.chat(
    image=None,
    msgs=messages,
    system_prompt=SYSTEM_PROMPT,
    tokenizer=tokenizer,
    temperature=0.1,
    top_p=0.3,
    n=1,
)

# 6. 输出结果
print(outputs)

预期输出：

{"thought":"任务目标是点击屏幕上的‘会员’按钮。当前界面显示了应用的推荐页面，顶部有一个导航栏。点击‘会员’按钮可以访问应用的会员相关内容。","POINT":[729,69]}

vLLM推理示例

# 启动vLLM服务
vllm serve model/AgentCPM-GUI --served-model-name AgentCPM-GUI --tensor_parallel_size 1 --trust-remote-code

import base64 
import io 
import json 
import requests 
from PIL import Image 

END_POINT = "http://localhost:8000/v1/chat/completions"  
ACTION_SCHEMA = json.load(open('eval/utils/schema/schema.json', encoding="utf-8"))
items = list(ACTION_SCHEMA.items())
insert_index = 3 
items.insert(insert_index, ("required", ["thought"])) 
ACTION_SCHEMA = dict(items)
SYSTEM_PROMPT = f'''# 角色：你是一名熟悉安卓系统触屏GUI操作的智能体，将根据用户的问题，分析当前界面的GUI元素和布局，生成相应的操作。

# 任务：针对用户问题，根据输入的当前屏幕截图，输出下一步的操作。

# 规则：以紧凑JSON格式输出，输出操作必须遵循Schema约束

# Schema {json.dumps(ACTION_SCHEMA, indent=None, ensure_ascii=False, separators=(',', ':'))}''' 

def encode_image(image: Image.Image) -> str:
    with io.BytesIO() as in_mem_file:
        image.save(in_mem_file, format="JPEG")
        in_mem_file.seek(0)
        return base64.b64encode(in_mem_file.read()).decode("utf-8")

def __resize__(origin_img):
    resolution = origin_img.size
    w,h = resolution
    max_line_res = 1120
    if max_line_res is not None:
        max_line = max_line_res
        if h > max_line:
            w = int(w * max_line / h)
            h = max_line
        if w > max_line:
            h = int(h * max_line / w)
            w = max_line
    img = origin_img.resize((w,h),resample=Image.Resampling.LANCZOS)
    return img 

def predict(text_prompt: str, image: Image.Image):
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": [
            {"type": "text", "text": f"<Question>{text_prompt}</Question>\n当前屏幕截图："},
            {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{encode_image(image)}"}}
        ]}
    ]

    payload = {
        "model": "AgentCPM-GUI",  
        "temperature": 0.1,
        "messages": messages,
        "max_tokens": 2048,
    }

    headers = {
        "Content-Type": "application/json",
    }

    response = requests.post(END_POINT, headers=headers, json=payload)
    assistant_msg = response.json()["choices"][0]["message"]["content"]
    return assistant_msg 

image = __resize__(Image.open("assets/test.jpeg"))
instruction = "请点击屏幕上的‘会员’按钮" 
response = predict(instruction, image)
print(response)

AgentCPM-GUI动作空间

智能体每一步输出一个JSON对象，包含以下内容：

• 一个（且仅一个）基本动作，从以下列表中选择。

• 可选修饰符（duration、thought）和/或任务级标志（STATUS）。

动作	必填字段	可选字段	用途	示例
Click	`POINT:[x,y]`	`duration`、`thought`、`STATUS`	在标准化屏幕坐标（0–1000，原点=左上角）处单次点击	`{"POINT":[480,320]}`
Long Press	`POINT:[x,y]`、`duration:1000`	`duration`、`thought`、`STATUS`	在坐标处长按（设置较长持续时间，如>200 ms）	`{"POINT":[480,320],"duration":1000}`
Swipe	`POINT:[x,y]`、`to:"up"	"down"	"left"	"right"`或`to:[x,y]`
Press key	`PRESS:"HOME"	"BACK"	"ENTER"`	`duration`、`thought`、`STATUS`
Type text	`TYPE:"<text>"`	`duration`、`thought`、`STATUS`	在当前输入焦点处插入给定文本	`{"TYPE":"Hello, world!"}`
Wait	`duration`	`thought`、`STATUS`	在指定时间内空闲，不执行任何其他操作	`{"duration":500}`
Task-level status	`STATUS:"start"	"continue"	"finish"	"satisfied"

AgentCPM-GUI微调和评估

微调：提供了SFT和RFT训练的源代码。

性能评估：

Grounding基准：在fun2point、text2point、bbox2text等任务上，AgentCPM-GUI-8B的平均得分达到71.3，显著优于其他模型。

模型	fun2point	text2point	bbox2text	average
AgentCPM-GUI-8B	79.1	76.5	58.2	71.3
Qwen2.5-VL-7B	36.8	52.0	44.1	44.3
Intern2.5-VL-8B	17.2	24.2	45.9	29.1
Intern2.5-VL-26B	14.8	16.6	36.3	22.6
OS-Genesis-7B	8.3	5.8	4.0	6.0
UI-TARS-7B	56.8	66.7	1.4	41.6
OS-Altas-7B	53.6	60.7	0.4	38.2
Aguvis-7B	60.8	76.5	0.2	45.8
GPT-4o	22.1	19.9	14.3	18.8
GPT-4o with Grounding	44.3	44.0	14.3	44.2

智能体基准：在多个数据集上，AgentCPM-GUI-8B的表现也十分出色，尤其在中文应用（CAGUI）数据集上，TM和EM得分分别达到96.86和91.28。

模型	Android Control-Low TM	Android Control-Low EM	Android Control-High TM	Android Control-High EM	GUI-Odyssey TM	GUI-Odyssey EM	AITZ TM	AITZ EM	Chinese APP (CAGUI) TM	Chinese APP (CAGUI) EM
AgentCPM-GUI-8B	94.39	90.20	77.70	69.17	90.85	74.96	85.71	76.38	96.86	91.28
Qwen2.5-VL-7B	92.11	82.12	69.65	57.36	55.33	40.90	73.16	57.58	68.53	48.80
UI-TARS-7B	93.52	88.89	68.53	60.81	78.79	57.33	71.74	55.31	71.01	53.92
OS-Genesis-7B	90.74	74.22	65.92	44.43	11.67	3.63	19.98	8.45	38.10	14.50
OS-Atlas-7B	73.03	67.25	70.36	56.53	91.83*	76.76*	74.13	58.45	81.53	55.89
Aguvis-7B	93.85	89.40	65.56	54.18	26.71	13.54	35.71	18.99	67.43	38.20
OdysseyAgent-7B	65.10	39.16	58.80	32.74	90.83	73.67	59.17	31.60	67.56	25.44
GPT-4o	-	19.49	-	20.80	-	20.39	70.00	35.30	3.67	3.67
Gemini 2.0	-	28.50	-	60.20	-	3.27	-	-	-	-
Claude	-	19.40	-	12.50	60.90	-	-	-	-	-