AgentFlow is a trainable framework for AI agents designed to overcome the limitations of current tool-augmented reasoning methods. While traditional approaches often struggle with scalability and generalization, AgentFlow maintains high performance through a specialized, modular architecture.
The system is divided into four dedicated modules: Planner, Executor, Verifier, and Generator. These components coordinate across multiple turns, supported by a persistent memory system that tracks context and integrates tools directly into the reasoning loop.
AgentFlow also introduces Flow-based Group Refined Policy Optimization, or Flow-GRPO. This algorithm enables online training of the Planner module directly within the workflow. It is particularly effective for long-horizon tasks where feedback or rewards are delayed.
Out of the box, the framework connects to several essential tools, including basic text generation, Python code execution, Google Search, Wikipedia lookup, and general web search.
Using Qwen-2.5-7B-Instruct as its backbone, AgentFlow delivers impressive results across ten benchmarks. The framework saw performance gains of 14.9% in search tasks, 14.0% in agent reasoning, 14.5% in mathematics, and 4.1% in scientific tasks. Notably, the system outperforms GPT-4o—a proprietary model with approximately 200 billion parameters—despite AgentFlow's much smaller 7B-parameter scale.
Mainstream approaches, such as Search-R1, typically train a single large language model to interleave reasoning steps with tool calls. AgentFlow moves away from this monolithic design in favor of a modular team where each component has a defined, specialized role.
| Module | Core Function | Input | Output |
|---|---|---|---|
| Planner | Establishes sub-goals and tool strategy | Query analysis, global goal, required skills | Current sub-goal, chosen tool, tool context |
| Executor | Executes the tool call | Sub-goal, chosen tool, tool metadata | Generated command, execution result |
| Verifier | Validates the result | Generated command, execution result | Memory analysis, verification status |
| Generator | Formulates the final answer | Generated command, execution result | Final answer |
Modular Agent System
Four specialized modules collaborate over multiple rounds of interaction. The system’s memory updates continuously, and new tools can be plugged in as required.
Multi-Tool Integration
The framework includes built-in support for base_generator, python_coder, google_search, wikipedia_search, and web_search.
Flow-GRPO Algorithm
By providing online optimization within the execution workflow, this algorithm targets long-range reasoning challenges where feedback is sparse.
Proven Results
AgentFlow allows a 7B-parameter model to surpass much larger baselines. With Flow-GRPO enabled, search-heavy tasks achieve an average accuracy of 57.3%. Agent reasoning reaches 33.1% on GAIA, while math tasks average 51.5% across AIME24, AMC23, and GameOf24. Science-based tasks average 63.5% on GPQA and MedQA.
Refined Planning
Flow-GRPO improves the model's ability to select sub-goals and determine tool strategies throughout complex reasoning turns.
Reliable Tool Usage
On the 2Wiki dataset, tool call accuracy increased from 60.0% to 77.2%. On MedQA, accuracy rose from 76.0% to 80.0%.
Reduced Error Rates
Extended training significantly lowered tool-call errors: GAIA dropped by 28.4%, 2Wiki by 19.4%, AIME24 by 7.8%, and Bamboogle by 8.4%.
Scaling Potential
Performance continues to scale upward as larger models are utilized and the number of reasoning rounds increases.
bash setup.sh, then activate the virtual environment: source .venv/bin/activate.sudo apt-get update && sudo apt-get install parallel.agentflow/.env.template to agentflow/.env and enter your API keys:OPENAI_API_KEY (for evaluation)GOOGLE_API_KEY (for Google Search)DASHSCOPE_API_KEY (for Qwen-2.5-7B-Instruct)TOGETHER_API_KEY (alternative for international access)Alternatively, you can serve Qwen2.5-7B-Instruct locally via vLLM. Refer to serve_vllm_local.md for details.
The modular system is designed to handle complex queries. Once API keys are set, run the following:
from agentflow.agentflow.solver import construct_solver
llm_engine_name = "dashscope"
solver = construct_solver(llm_engine_name=llm_engine_name)
output = solver.solve("What is the capital of France?")
print(output["direct_output"])
Optional: Test Environment
Verify your tools, LLM engines, and network configuration using test_env.md.
Data Preparation
python data/get_train_data.py # training data
python data/aime24_data.py # validation data
Once generated, the data/ directory will be organized as follows:
data/
├── train/
│ └── combined_train.parquet (182,190 samples)
├── val/
│ └── aime24.parquet (30 samples)
├── aime24_data.py
└── get_train_data.py
Execution
Create a tmux session and launch the service:
tmux new-session -s agentflow
bash train/serve_with_logs.sh
Open a new window (Ctrl+B, then C) and start the training script:
bash train/train_with_logs.sh
Adjust hyperparameters, model settings, tool configurations, and RL parameters in train/config.yaml.
Deploy your trained Planner model using vLLM:
bash scripts/serve_vllm.sh
Execute the benchmark suite from the test directory:
cd test
bash exp/run_all_models_all_datasets.sh
AgentFlow allows you to assign specific LLM engines to different modules.
Planner Agent
Modify the llm_engine_name parameter within test/exp/run_all_models_all_datasets.sh.
Other Agents (Executor, Verifier, Generator)
These default to DashScope (Qwen-2.5-7B-Instruct). To override this, modify agentflow/agentflow/models/planner.py:
self.llm_engine_fixed = create_llm_engine(model_string="your-engine", is_multimodal=False, temperature=temperature)
And update the Executor instantiation in agentflow/agentflow/solver.py:
executor = Executor(
llm_engine_name="dashscope",
root_cache_dir=root_cache_dir,
verbose=verbose,
temperature=temperature
)
Refer to llm_engine.md for a list of supported engine names and model_string formats.
By replacing monolithic architectures with a modular team, AgentFlow redefines the potential of tool-augmented reasoning—a shift supported by its superior benchmark performance.
DeepSeek OCR: Extract Text and Visual Data With This React FastAPI App
Tiny Qwen: A Clean PyTorch Implementation of Qwen3 and Qwen2.5-VL
YPrompt Review: Build Better AI Prompts With This Smart Tool
TradingAgents-MCP: A 15-Agent AI Framework for Real-Time Stock Analysis
Grok CLI: AI-Powered Terminal Assistant for Files and Bash Commands
Flameshot CLI Guide: Capture, Edit, and Upload Screenshots Rapidly
Microsandbox Guide: Secure MicroVM Code Execution in 200ms
AppFlowy: Open-Source Notion Alternative With Local Data Control
Ventoy USB Tool: Boot Multiple ISOs Without Reformatting
AI看线: A-Share Analysis with K-Line Charts and Gemini AI Forecasts
Turn eBooks & PDFs into Audio with Abogen – Fast TTS Tool
XMIF VPN Free Trial & $0.70/Month Plan – No Logs, 4K Speed