LiveMCPBench: Benchmark AI Agents on Real-World MCP Tool Tasks

9月4日 Published inMCP Services

LiveMCPBench evaluates the performance of AI agents when tasked with solving real-world problems using an extensive library of available tools. The project consists of three primary components: a reference agent named MCP Copilot, an evaluation suite called LiveMCPEval, and LiveMCPTool—a comprehensive toolbox containing over 290 pre-integrated utilities. By utilizing labeled task data, researchers can benchmark models such as GLM 4.5, GPT-5-Mini, and Kimi-K2 under identical, standardized conditions. Docker images are provided to ensure the setup process remains straightforward.

This benchmark is distinguished by its task variety, the breadth of its toolset, and the depth of its evaluation metrics.

1. Tasks That Mirror Daily Life

Tasks are divided into six categories, each requiring an agent to coordinate between one and three different tools.

  • Office (33%) — Generating weekly reports, organizing spreadsheets, and managing document workflows.
  • Lifestyle (16%) — Managing calendars, locating local services, and setting reminders.
  • Leisure (15%) — Recommending movies and designing travel itineraries.
  • Finance (14%) — Tracking expenses and checking real-time exchange rates.
  • Travel (13%) — Searching for tickets and booking accommodations.
  • Shopping (9%) — Comparing prices and tracking orders.

These scenarios are derived from actual user needs. They are designed to measure tool selection proficiency and workflow planning rather than simple fact retrieval.

2. LiveMCPTool: Plug-and-Play Utilities

The toolbox is stable, generic, and designed for easy expansion. It currently features 290 tools categorized into three groups:

  • Discovery (124 tools) — Search engines and knowledge base queries for information retrieval.
  • Visualization (85 tools) — Tools for creating charts, graphs, and dashboards to make data more readable.
  • File Access (81 tools) — Utilities to read and write documents, spreadsheets, and images.

Every tool adheres to the same interface. Agents interact with them through a standardized call pattern, eliminating the need for custom glue code.

3. LiveMCPEval: LLM-as-a-Judge

The evaluation process utilizes an "LLM-as-a-Judge" model. A dedicated Judge Agent analyzes each task execution by focusing on three specific criteria:

  • Tool Selection Accuracy — Did the agent identify and use the correct set of tools?
  • Execution Logic — Was the call sequence logical and the parameters accurate?
  • Task Completion — Does the final output successfully fulfill the stated goal?

The Judge generates a detailed report based on the agent's tool call trace, the queries generated, and the available tool list. This allows developers to see precisely where an agent struggled or failed.

Deploying LiveMCPBench

LiveMCPBench can be deployed using Docker or installed locally on your machine.

Local installation requires:

  • npm — For frontend dependency management.
  • uv — For high-speed Python package handling.

Docker Deployment

  1. Pull the Docker image:

    docker pull hysdhlx/livemcpbench:latest
    
  2. Clone the repository:

    git clone https://github.com/icip-cas/LiveMCPBench.git
    cd LiveMCPBench
    
  3. Launch the container with GPU support:

    docker run -itd \
    -v "$(pwd):/outside" \
    --gpus all \
    --ipc=host \
    --net=host \
    --name LiveMCPBench_container \
    hysdhlx/livemcpbench:latest \
    bash
    
  4. Access the container and reset the environment:

    docker exec -it LiveMCPBench_container bash
    cd /LiveMCPBench/
    bash scripts/env_reset.sh
    

    This command copies your local code into the container and links the necessary labeled data folders.

Local Installation

  1. Configure the environment variables:

    cp .env_template .env
    

    Edit the .env file to set:

    • Agent settings: BASE_URL, OPENAI_API_KEY, MODEL
    • Tool retrieval: EMBEDDING_MODEL, EMBEDDING_API_KEY, TOP_TOOLS
    • Proxy settings (if needed): http_proxy, https_proxy
  2. Verify that tools are accessible:

    bash ./tools/scripts/tool_check.sh
    

    Review the results in ./tools/test/tools.json. If any tools fail, run the script again.

  3. Generate the resource index:

    uv run -m baseline.mcp_copilot.arg_generation
    

    The agent requires this index to locate and utilize tools efficiently.

Running Agents and Evaluation

Quick Smoke Test

Perform a test run to confirm the system is operational:

bash ./baseline/scripts/run_example.sh

Results, including tool call traces and task outputs, will be saved in ./baseline/output/.

Full Benchmark Run

  1. Ensure all variables in .env are correctly configured.

  2. Execute the complete benchmark suite:

    bash ./baseline/scripts/run_baselines.sh
    

    Note: The agent reads data files from the /root directory; adjust these paths as necessary for local execution.

  3. Review the outputs in ./baseline/output to analyze tool selection strategies for various tasks.

Running the Evaluator

  1. In the .env file, set the MODEL variable to the LLM that will serve as the judge.
  2. Start the evaluation process:
    bash ./evaluator/scripts/run_baseline.sh
    
  3. Evaluation reports are saved to ./evaluator/output.
  4. To calculate the overall success rate, run:
    uv run ./evaluator/stat_success_rate.py --result_path /path/to/evaluation/
    
    The script will provide a final success percentage for the evaluated run.