MAS-Zero: Developing Self-Evolving Multi-Agent Systems Without Human Labels

6月10日 Published inAI Agent Tools

MAS-Zero is a multi-agent framework capable of autonomous self-improvement. It requires no human labels and no validation sets. Instead, it relies on a meta-agent that handles design, evaluation, and selection in real time.

The framework operates through two primary stages:

  1. Meta-Iteration

    • MAS-Design: The meta-agent decomposes a task into modular components. For each component, it proposes a specialized sub-agent team and translates that design into functional code.
    • MAS-Feedback: The system executes the generated code. Intermediate outputs serve as a diagnostic tool, revealing whether the design is effective. The meta-agent assesses whether the sub-agents can solve their assigned tasks and if the interactions between them are logically sound.
  2. Self-Verification: During the iteration process, the meta-agent generates several candidate systems. Self-verification then identifies the most robust candidate based exclusively on internal signals.

This entire process occurs at inference time. There is no separate training phase; the system continuously refines itself through these meta-level feedback loops.

A Practical Reasoning Challenge

Consider a complex geometry problem: ABCDEF is a convex equilateral hexagon with opposite sides parallel. The triangle formed by extending AB, CD, and EF has sides of 200, 240, and 300. Find the hexagon's side length.

A problem of this complexity requires rigorous reasoning. MAS-Zero approaches it by instantiating agents tailored to specific sub-problems—for example, one agent may manage geometric constraints while another focuses on algebraic calculations.

Performance Snapshot

Across various models and tasks, MAS-Zero consistently improves performance without external supervision.

The framework was tested against several benchmarks:

  • AIME24 (mathematical reasoning)
  • GPQA (graduate-level science questions)
  • SWE (software engineering tasks)

The following table compares MAS-Zero against the standard Chain-of-Thought (CoT) method:

LLM / Method AIME24 GPQA SWE Avg
CoT (GPT-4o) 8.33 45.78 9.17 23.26
MAS-Zero (GPT-4o) 33.33 50.60 25.83 35.81
CoT (LLaMA3.3-70B) 16.67 50.60 2.92 22.09
MAS-Zero (LLaMA3.3-70B) 37.50 52.41 16.74 31.67
CoT (Qwen2.5-32B) 12.50 50.00 45.26 35.92
MAS-Zero (Qwen2.5-32B) 29.17 51.81 48.95 43.31

When using GPT-4o, MAS-Zero increased AIME24 accuracy from 8.33% to 33.33%. This performance trend remains consistent across LLaMA and Qwen models, demonstrating that superior results are achievable without relying on a pre-defined validation set.

Installation and Quick Start

Environment Setup

conda create -n mas_zero python=3.12 && conda activate mas_zero
pip install anthropic openai backoff together datasets jinja2 -e human-eval
cd ./ && pip install -r requirements.txt

Running a Search Task

export OPENAI_API_KEY={your_key}
export TOGETHER_API_KEY={your_key}

python main_question.py \
  --dataset workflow_search/aime24 \
  --option plan \
  --meta_model gpt-4o_chatgpt \
  --node_model gpt-4o_chatgpt \
  --verifier_model gpt-4o_chatgpt \
  --blocks COT COT_SC Reflexion LLM_debate \
  --use_oracle_verifier \
  --defer_verifier \
  --n_generation 5

The dataset parameter can be swapped for GPQA or SWE-Bench. The meta_model and node_model flags support various backends, including GPT and Claude.

Verifying Results

python main_judge.py \
  --dataset aime24 \
  --judge_method self \
  --baseline workflow_search \
  --model gpt-4o_chatgpt \
  --min_sample 0 \
  --max_sample 30 \
  --max_response_per_sample 5

MAS-Zero eliminates the need for external feedback. It demonstrates that a meta-agent, supported by an iterative refinement process, can independently construct highly competent multi-agent systems.