AgentCPM-GUI: A Local LLM Agent for Navigating Chinese Mobile Apps

5月18日 Published inLLM Tooling

AgentCPM-GUI is a localized GUI agent developed through a collaboration between THUNLP and ModelBest. It utilizes an 8-billion-parameter MiniCPM-V model that runs entirely on-device. By analyzing a smartphone screenshot alongside a specific user task, the model determines the precise sequence of taps or swipes required. It is particularly effective at navigating the unique interfaces of Chinese mobile applications.

Robust GUI Grounding: Pretrained on an extensive bilingual Android dataset, the model identifies buttons, input fields, labels, and icons with high accuracy.

Optimized for Chinese Apps: As the first open-source GUI agent fine-tuned specifically for the Chinese market, it supports over 30 popular applications, including Gaode Maps, Dianping, Bilibili, and Xiaohongshu.

Advanced Planning: The integration of reinforcement fine-tuning introduces a "reasoning" step before each action is executed. This internal thought process significantly improves the completion rate for complex, multi-step tasks.

Streamlined Execution: An optimized action space and clean JSON structure keep the average output to just 9.7 tokens per step, providing the speed necessary for edge deployment.

Usage Manual

Install Dependencies

git clone https://github.com/OpenBMB/AgentCPM-GUI
cd AgentCPM-GUI
conda create -n gui_agent python=3.11
conda activate gui_agent
pip install -r requirements.txt

Download the Model

Download the AgentCPM-GUI weights from Hugging Face and place them in the model/AgentCPM-GUI directory.

Inference via Hugging Face

import torch 
from transformers import AutoTokenizer, AutoModelForCausalLM 
from PIL import Image 
import json 

# 1. Load model and tokenizer
model_path = "model/AgentCPM-GUI"  
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, torch_dtype=torch.bfloat16)
model = model.to("cuda:0") 

# 2. Define the instruction and image
instruction = "请点击屏幕上的'会员'按钮" 
image_path = "assets/test.jpeg" 
image = Image.open(image_path).convert("RGB")

# 3. Resize image to fit model constraints
def __resize__(origin_img):
    resolution = origin_img.size
    w, h = resolution
    max_line_res = 1120
    if max_line_res is not None:
        max_line = max_line_res
        if h > max_line:
            w = int(w * max_line / h)
            h = max_line
        if w > max_line:
            h = int(h * max_line / w)
            w = max_line
    img = origin_img.resize((w, h), resample=Image.Resampling.LANCZOS)
    return img 
image = __resize__(image)

# 4. Set message format
messages = [{
    "role": "user",
    "content": [
        f"<Question>{instruction}</Question>\n当前屏幕截图:",
        image
    ]
}]

# 5. Configure inference settings and system prompt
ACTION_SCHEMA = json.load(open('eval/utils/schema/schema.json', encoding="utf-8"))
items = list(ACTION_SCHEMA.items())
items.insert(3, ("required", ["thought"])) 
ACTION_SCHEMA = dict(items)
SYSTEM_PROMPT = f'''# 角色:你是一名熟悉安卓系统触屏GUI操作的智能体,将根据用户的问题,分析当前界面的GUI元素和布局,生成相应的操作。

# 任务:针对用户问题,根据输入的当前屏幕截图,输出下一步的操作。

# 规则:以紧凑JSON格式输出,输出操作必须遵循Schema约束

# Schema {json.dumps(ACTION_SCHEMA, indent=None, ensure_ascii=False, separators=(',', ':'))}''' 
outputs = model.chat(
    image=None,
    msgs=messages,
    system_prompt=SYSTEM_PROMPT,
    tokenizer=tokenizer,
    temperature=0.1,
    top_p=0.3,
    n=1,
)

print(outputs)

Expected output:

{"thought":"任务目标是点击屏幕上的'会员'按钮。当前界面显示了应用的推荐页面,顶部有一个导航栏。点击'会员'按钮可以访问应用的会员相关内容。","POINT":[729,69]}

Inference via vLLM

vllm serve model/AgentCPM-GUI --served-model-name AgentCPM-GUI --tensor_parallel_size 1 --trust-remote-code
import base64 
import io 
import json 
import requests 
from PIL import Image 

END_POINT = "http://localhost:8000/v1/chat/completions"  
ACTION_SCHEMA = json.load(open('eval/utils/schema/schema.json', encoding="utf-8"))
items = list(ACTION_SCHEMA.items())
items.insert(3, ("required", ["thought"])) 
ACTION_SCHEMA = dict(items)
SYSTEM_PROMPT = f'''# 角色:你是一名熟悉安卓系统触屏GUI操作的智能体,将根据用户的问题,分析当前界面的GUI元素和布局,生成相应的操作。

# 任务:针对用户问题,根据输入的当前屏幕截图,输出下一步的操作。

# 规则:以紧凑JSON格式输出,输出操作必须遵循Schema约束

# Schema {json.dumps(ACTION_SCHEMA, indent=None, ensure_ascii=False, separators=(',', ':'))}''' 

def encode_image(image: Image.Image) -> str:
    with io.BytesIO() as in_mem_file:
        image.save(in_mem_file, format="JPEG")
        in_mem_file.seek(0)
        return base64.b64encode(in_mem_file.read()).decode("utf-8")

def __resize__(origin_img):
    resolution = origin_img.size
    w, h = resolution
    max_line_res = 1120
    if max_line_res is not None:
        max_line = max_line_res
        if h > max_line:
            w = int(w * max_line / h)
            h = max_line
        if w > max_line:
            h = int(h * max_line / w)
            w = max_line
    img = origin_img.resize((w, h), resample=Image.Resampling.LANCZOS)
    return img 

def predict(text_prompt: str, image: Image.Image):
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": [
            {"type": "text", "text": f"<Question>{text_prompt}</Question>\n当前屏幕截图:"},
            {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{encode_image(image)}"}}
        ]}
    ]

    payload = {
        "model": "AgentCPM-GUI",  
        "temperature": 0.1,
        "messages": messages,
        "max_tokens": 2048,
    }

    headers = {"Content-Type": "application/json"}
    response = requests.post(END_POINT, headers=headers, json=payload)
    return response.json()["choices"][0]["message"]["content"]

image = __resize__(Image.open("assets/test.jpeg"))
instruction = "请点击屏幕上的'会员'按钮" 
print(predict(instruction, image))

Action Space

Each operational step returns a JSON object containing a primary action, optional modifiers, and a status flag.

Action Required Fields Optional Fields Purpose Example
Click POINT:[x,y] duration, thought, STATUS Single tap at normalized 0–1000 coordinates {"POINT":[480,320]}
Long Press POINT:[x,y], duration:1000 thought, STATUS Hold at specific coordinates {"POINT":[480,320],"duration":1000}
Swipe POINT:[x,y], to:"up"/"down"/"left"/"right" or to:[x,y] duration, thought, STATUS Directional or coordinate-based swipe {"POINT":[500,500],"to":"up"}
Press Key PRESS:"HOME"/"BACK"/"ENTER" duration, thought, STATUS System hardware/navigation key press {"PRESS":"BACK"}
Type Text TYPE:"<text>" duration, thought, STATUS Input text at the current focus point {"TYPE":"Hello"}
Wait duration thought, STATUS Pause execution for set milliseconds {"duration":500}
Task Status STATUS:"start"/"continue"/"finish"/"satisfied" thought Define current phase of the task {"STATUS":"finish"}

Fine-Tuning and Benchmarks

Fine-Tuning: The project provides source code for both Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT).

Grounding Benchmarks: AgentCPM-GUI-8B achieved an average score of 71.3 across the fun2point, text2point, and bbox2text tests. It significantly outperforms comparable 7B–8B models and even exceeds the performance of much larger proprietary models.

Model fun2point text2point bbox2text Average
AgentCPM-GUI-8B 79.1 76.5 58.2 71.3
Qwen2.5-VL-7B 36.8 52.0 44.1 44.3
Intern2.5-VL-8B 17.2 24.2 45.9 29.1
UI-TARS-7B 56.8 66.7 1.4 41.6
GPT-4o 22.1 19.9 14.3 18.8

Agent Benchmarks: When tested on the CAGUI dataset (focused on Chinese apps), the model reached a 96.86% Task Match (TM) and a 91.28% Exact Match (EM), placing it well ahead of other open-source models.

Model CAGUI TM CAGUI EM
AgentCPM-GUI-8B 96.86 91.28
Qwen2.5-VL-7B 68.53 48.80
UI-TARS-7B 71.01 53.92
OS-Atlas-7B 81.53 55.89

Dataset

CAGUI is the companion benchmark for evaluating Chinese app control. It includes comprehensive grounding and agent-specific tasks and is available for download on Hugging Face.