LiveMCPBench evaluates the performance of AI agents when tasked with solving real-world problems using an extensive library of available tools. The project consists of three primary components: a reference agent named MCP Copilot, an evaluation suite called LiveMCPEval, and LiveMCPTool—a comprehensive toolbox containing over 290 pre-integrated utilities. By utilizing labeled task data, researchers can benchmark models such as GLM 4.5, GPT-5-Mini, and Kimi-K2 under identical, standardized conditions. Docker images are provided to ensure the setup process remains straightforward.
This benchmark is distinguished by its task variety, the breadth of its toolset, and the depth of its evaluation metrics.
Tasks are divided into six categories, each requiring an agent to coordinate between one and three different tools.
These scenarios are derived from actual user needs. They are designed to measure tool selection proficiency and workflow planning rather than simple fact retrieval.
The toolbox is stable, generic, and designed for easy expansion. It currently features 290 tools categorized into three groups:
Every tool adheres to the same interface. Agents interact with them through a standardized call pattern, eliminating the need for custom glue code.
The evaluation process utilizes an "LLM-as-a-Judge" model. A dedicated Judge Agent analyzes each task execution by focusing on three specific criteria:
The Judge generates a detailed report based on the agent's tool call trace, the queries generated, and the available tool list. This allows developers to see precisely where an agent struggled or failed.
LiveMCPBench can be deployed using Docker or installed locally on your machine.
Local installation requires:
Pull the Docker image:
docker pull hysdhlx/livemcpbench:latest
Clone the repository:
git clone https://github.com/icip-cas/LiveMCPBench.git
cd LiveMCPBench
Launch the container with GPU support:
docker run -itd \
-v "$(pwd):/outside" \
--gpus all \
--ipc=host \
--net=host \
--name LiveMCPBench_container \
hysdhlx/livemcpbench:latest \
bash
Access the container and reset the environment:
docker exec -it LiveMCPBench_container bash
cd /LiveMCPBench/
bash scripts/env_reset.sh
This command copies your local code into the container and links the necessary labeled data folders.
Configure the environment variables:
cp .env_template .env
Edit the .env file to set:
BASE_URL, OPENAI_API_KEY, MODELEMBEDDING_MODEL, EMBEDDING_API_KEY, TOP_TOOLShttp_proxy, https_proxyVerify that tools are accessible:
bash ./tools/scripts/tool_check.sh
Review the results in ./tools/test/tools.json. If any tools fail, run the script again.
Generate the resource index:
uv run -m baseline.mcp_copilot.arg_generation
The agent requires this index to locate and utilize tools efficiently.
Perform a test run to confirm the system is operational:
bash ./baseline/scripts/run_example.sh
Results, including tool call traces and task outputs, will be saved in ./baseline/output/.
Ensure all variables in .env are correctly configured.
Execute the complete benchmark suite:
bash ./baseline/scripts/run_baselines.sh
Note: The agent reads data files from the /root directory; adjust these paths as necessary for local execution.
Review the outputs in ./baseline/output to analyze tool selection strategies for various tasks.
.env file, set the MODEL variable to the LLM that will serve as the judge.bash ./evaluator/scripts/run_baseline.sh
./evaluator/output.uv run ./evaluator/stat_success_rate.py --result_path /path/to/evaluation/
The script will provide a final success percentage for the evaluated run.
Skill Seeker: Convert Any Documentation Site Into Claude AI Skills
Sunshine Streaming Host Specs: What Hardware You Actually Need
AoxVPN 8.8 Member Day Sale | No-Log VPN Featuring IEPL Private Lines
Alger Music Player: Play Grayed-Out NetEase Songs with Desktop Lyrics
PandaWiki Setup Guide: Building an AI-Powered Knowledge Base
Common Ground: Multi-Agent Collaboration That Actually Works
LetsMarkdown: Lightweight Collaborative Markdown Editor Powered by Rust
Trae Agent: Run Complex Dev Workflows With Plain English Prompts
Larachat: Build a Real-Time AI Chat App with Laravel and React
n8n-MCP: Give Claude Access to 525+ n8n Nodes in Minutes
OCode: Native AI Coding Assistant for Your Terminal (Ollama)
n8n Automation: Over 400 AI Integrations in a Single Workflow