AgentFlow: Modular AI Agent Framework Outperforms GPT-4o

10月11日 Published inAI Agent Tools

AgentFlow is a trainable framework for AI agents designed to overcome the limitations of current tool-augmented reasoning methods. While traditional approaches often struggle with scalability and generalization, AgentFlow maintains high performance through a specialized, modular architecture.

The system is divided into four dedicated modules: Planner, Executor, Verifier, and Generator. These components coordinate across multiple turns, supported by a persistent memory system that tracks context and integrates tools directly into the reasoning loop.

AgentFlow also introduces Flow-based Group Refined Policy Optimization, or Flow-GRPO. This algorithm enables online training of the Planner module directly within the workflow. It is particularly effective for long-horizon tasks where feedback or rewards are delayed.

Out of the box, the framework connects to several essential tools, including basic text generation, Python code execution, Google Search, Wikipedia lookup, and general web search.

Using Qwen-2.5-7B-Instruct as its backbone, AgentFlow delivers impressive results across ten benchmarks. The framework saw performance gains of 14.9% in search tasks, 14.0% in agent reasoning, 14.5% in mathematics, and 4.1% in scientific tasks. Notably, the system outperforms GPT-4o—a proprietary model with approximately 200 billion parameters—despite AgentFlow's much smaller 7B-parameter scale.

How AgentFlow Differs

Mainstream approaches, such as Search-R1, typically train a single large language model to interleave reasoning steps with tool calls. AgentFlow moves away from this monolithic design in favor of a modular team where each component has a defined, specialized role.

Module Core Function Input Output
Planner Establishes sub-goals and tool strategy Query analysis, global goal, required skills Current sub-goal, chosen tool, tool context
Executor Executes the tool call Sub-goal, chosen tool, tool metadata Generated command, execution result
Verifier Validates the result Generated command, execution result Memory analysis, verification status
Generator Formulates the final answer Generated command, execution result Final answer

Key Features

Modular Agent System
Four specialized modules collaborate over multiple rounds of interaction. The system’s memory updates continuously, and new tools can be plugged in as required.

Multi-Tool Integration
The framework includes built-in support for base_generator, python_coder, google_search, wikipedia_search, and web_search.

Flow-GRPO Algorithm
By providing online optimization within the execution workflow, this algorithm targets long-range reasoning challenges where feedback is sparse.

Proven Results
AgentFlow allows a 7B-parameter model to surpass much larger baselines. With Flow-GRPO enabled, search-heavy tasks achieve an average accuracy of 57.3%. Agent reasoning reaches 33.1% on GAIA, while math tasks average 51.5% across AIME24, AMC23, and GameOf24. Science-based tasks average 63.5% on GPQA and MedQA.

Performance Analysis

Refined Planning
Flow-GRPO improves the model's ability to select sub-goals and determine tool strategies throughout complex reasoning turns.

Reliable Tool Usage
On the 2Wiki dataset, tool call accuracy increased from 60.0% to 77.2%. On MedQA, accuracy rose from 76.0% to 80.0%.

Reduced Error Rates
Extended training significantly lowered tool-call errors: GAIA dropped by 28.4%, 2Wiki by 19.4%, AIME24 by 7.8%, and Bamboogle by 8.4%.

Scaling Potential
Performance continues to scale upward as larger models are utilized and the number of reasoning rounds increases.

Installation and Usage

Environment Setup

  1. Install dependencies by running bash setup.sh, then activate the virtual environment: source .venv/bin/activate.
  2. (Optional) For concurrent benchmark runs, install GNU Parallel: sudo apt-get update && sudo apt-get install parallel.
  3. Configure your environment variables. Copy agentflow/.env.template to agentflow/.env and enter your API keys:
    • OPENAI_API_KEY (for evaluation)
    • GOOGLE_API_KEY (for Google Search)
    • DASHSCOPE_API_KEY (for Qwen-2.5-7B-Instruct)
    • TOGETHER_API_KEY (alternative for international access)

Alternatively, you can serve Qwen2.5-7B-Instruct locally via vLLM. Refer to serve_vllm_local.md for details.

Inference

The modular system is designed to handle complex queries. Once API keys are set, run the following:

from agentflow.agentflow.solver import construct_solver

llm_engine_name = "dashscope"
solver = construct_solver(llm_engine_name=llm_engine_name)
output = solver.solve("What is the capital of France?")
print(output["direct_output"])

Flow-GRPO Training

Optional: Test Environment
Verify your tools, LLM engines, and network configuration using test_env.md.

Data Preparation

python data/get_train_data.py   # training data
python data/aime24_data.py      # validation data

Once generated, the data/ directory will be organized as follows:

data/
├── train/
│   └── combined_train.parquet (182,190 samples)
├── val/
│   └── aime24.parquet (30 samples)
├── aime24_data.py
└── get_train_data.py

Execution

Create a tmux session and launch the service:

tmux new-session -s agentflow
bash train/serve_with_logs.sh

Open a new window (Ctrl+B, then C) and start the training script:

bash train/train_with_logs.sh

Adjust hyperparameters, model settings, tool configurations, and RL parameters in train/config.yaml.

Benchmarking

Deploy your trained Planner model using vLLM:

bash scripts/serve_vllm.sh

Execute the benchmark suite from the test directory:

cd test
bash exp/run_all_models_all_datasets.sh

Custom Model Integration

AgentFlow allows you to assign specific LLM engines to different modules.

Planner Agent
Modify the llm_engine_name parameter within test/exp/run_all_models_all_datasets.sh.

Other Agents (Executor, Verifier, Generator)
These default to DashScope (Qwen-2.5-7B-Instruct). To override this, modify agentflow/agentflow/models/planner.py:

self.llm_engine_fixed = create_llm_engine(model_string="your-engine", is_multimodal=False, temperature=temperature)

And update the Executor instantiation in agentflow/agentflow/solver.py:

executor = Executor(
    llm_engine_name="dashscope",
    root_cache_dir=root_cache_dir,
    verbose=verbose,
    temperature=temperature
)

Refer to llm_engine.md for a list of supported engine names and model_string formats.

By replacing monolithic architectures with a modular team, AgentFlow redefines the potential of tool-augmented reasoning—a shift supported by its superior benchmark performance.