Agent World Training Arena (3 minute read)
Agent-World is a self-evolving training system that mines 2,000+ real-world tool environments to continuously improve AI agents through automated diagnosis and targeted task generation.
Deep dive
- Agent-World addresses two critical bottlenecks in agent training: lack of scalable realistic environments (most are LLM-synthesized and don't match real-world interaction logic), and absence of principled continuous learning mechanisms
- The system mines structured databases from three real-world sources—MCP servers, tool documentation, and industrial PRDs—yielding 2,000+ environment themes organized in a three-level hierarchical taxonomy across 20 primary categories
- A deep-research agent autonomously mines web data and performs iterative database complexification, followed by tool-design agents that generate 19K+ validated tools with cross-validation (compile success, >0.5 test accuracy)
- Task synthesis uses two strategies: graph-based (random walks on tool dependency graphs with consistency verification across 5 ReAct agent runs) and programmatic (executable Python solutions with verification scripts)
- Multi-environment RL uses GRPO optimization with structured verifiable rewards—graph tasks evaluated via LLM-as-judge, programmatic tasks through sandbox execution
- The self-evolving loop runs in three phases: dynamic evaluation on fresh held-out tasks, agentic diagnosis of failure traces and error distributions, then re-synthesis of tasks conditioned on diagnosed weaknesses
- Tested on 23 benchmarks spanning tool use (MCP-Mark, BFCL V4, τ²-Bench), advanced AI assistant tasks (SkillsBench, ARC-AGI-2, Claw-Eval), software engineering, research, and reasoning
- Agent-World-8B outperforms all open-source environment-scaling baselines and shows more consistent cross-environment generalization than methods like Simulator, TOUCAN, EnvScaler, and AWM
- Agent-World-14B achieves 55.8% on BFCL-V4, surpassing the 685B-parameter DeepSeek-V3.2 (54.1%), demonstrating that environment quality and diversity matter more than pure model scale
- Scaling analysis shows performance more than doubles (18.4% → 38.5%) when increasing training environments from 0 to 2,000, with particularly strong gains on interaction-intensive tasks
- Two rounds of self-evolution yield consistent monotonic gains across all benchmarks, with MCP-Mark showing the largest improvements (+8.6 points for Agent-World-14B) due to its requirement for stronger state tracking
- The self-evolution mechanism transfers: applying the loop to EnvScaler-8B also yields sustained gains (+5.6 on MCP-Mark over two rounds), indicating the approach benefits other baselines without requiring Agent-World initialization
- Even advanced proprietary models show clear limitations—GPT-5.2 High achieves only 53.1% on MCP-Mark, while GPT-OSS-120B scores just 4.7%, highlighting that current models struggle with long-horizon tool use in stateful environments
Decoder
- MCP (Model Context Protocol): A unified interface standard for connecting AI agents with real-world services and tools, providing structured JSON specifications for server interactions
- GRPO (Group Relative Policy Optimization): A reinforcement learning algorithm that optimizes agent policies by comparing relative performance across groups of rollouts for stable training
- ReAct agent: An agent architecture that combines reasoning and acting by generating verbal reasoning traces before taking actions
- Stateful environments: Tool ecosystems where actions modify persistent state (e.g., booking a flight updates inventory), requiring agents to track changes across multiple steps
- Tool dependency graph: A directed graph representing which tools must be called before others, used to synthesize realistic multi-step task sequences
- Self-evolution loop: An automated cycle where agents are evaluated, weaknesses are diagnosed, targeted training data is generated, and the agent is retrained iteratively
- LLM-as-judge: Using a language model to evaluate agent outputs against rubrics when ground-truth answers are complex or open-ended
- Sandbox execution: Running code in an isolated environment to verify correctness without security risks
Original article
Agent-World
What is Agent-World?
A self-evolving training arena that unifies scalable environment synthesis with continuous agent training by autonomously mining real-world tool ecosystems, synthesizing verifiable tasks, and driving agents to evolve through diagnostic feedback loops.
Key Capabilities
Six core pillars powering the Agent-World ecosystem
Real-World Environment Mining
Autonomously discovers and mines structured databases from real-world sources including MCP servers, tool docs, and industrial PRDs.
2K Environments & 19K Tools
Builds over 2,000 realistic environments spanning 20 primary categories, each equipped with executable tool interfaces totaling 19K+ validated tools with rich parameters.
Graph & Programmatic Tasks
Synthesizes verifiable tasks via tool dependency graphs and executable Python solutions with controllable difficulty scaling.
Multi-Environment Agent RL
Closed-loop RL training across diverse environments with structured verifiable rewards and GRPO optimization.
Self-Evolving Arena
Automatically diagnoses agent weaknesses through dynamic evaluation, then generates targeted tasks to drive iterative improvement.
Strong Results on 23 Benchmarks
Demonstrates strong performance across agentic tool use, advanced AI assistant, software engineering, deep research, and reasoning benchmarks.
Abstract
Large language models are increasingly expected to serve as general-purpose agents that interact with external, stateful tool environments. The Model Context Protocol (MCP) and broader agent skills offer a unified interface for connecting agents with scalable real-world services, but training robust agents remains limited by the lack of realistic environments and principled mechanisms for lifelong learning.
In this paper, we present Agent-World, a self-evolving training arena for advancing general agent intelligence through scalable environments. Agent-World has two main components: (1) Agentic Environment-Task Discovery, which autonomously explores topic-aligned databases and executable tool ecosystems from thousands of real-world environment themes and synthesizes verifiable tasks with controllable difficulty; and (2) Continuous Self-Evolving Agent Training, which combines multi-environment reinforcement learning with a self-evolving agent arena that automatically identifies capability gaps through dynamic task synthesis and drives targeted learning, enabling the co-evolution of agent policies and environments.
Across 23 challenging agent benchmarks, Agent-World consistently outperforms strong proprietary models and environment scaling baselines. Further analyses reveal scaling trends with environment diversity and self-evolution rounds, offering insights for building general agent intelligence.
Introduction
As the capability frontier of large language models continues to expand, expectations are shifting from chat-oriented text generation toward general-purpose agent assistants. Ideally, such agents should seamlessly integrate real-world interaction with verbal reasoning, and continuously learn from experience to improve themselves. Realizing these agentic capabilities requires training LLMs in dynamic environments equipped with executable tools, forming a "Generation–Execution–Feedback" interaction loop.
With the rise of agentic reinforcement learning (Agent RL), several agent systems built on static tool environments have demonstrated strong practical value. However, open-world tool environments are inherently compositional and stateful. For instance, in a flight-booking workflow, an agent should follow a valid action order (check inventory → execute booking → update the calendar), while each action also modifies the underlying environment state. Prior work centered on stateless or single-tool settings is insufficient for realistic applications.
Two key bottlenecks remain unresolved:
Scalable Realism and Complex Environment Synthesis
Existing environments are often LLM-generated or derived from limited open-source toolchains, which often mismatch real-world interaction logic. Synthetic environments are limited in complexity, restricting agent training on long-horizon, state-intensive tasks.
Continuous Self-Evolving Training Mechanisms
Existing work has primarily emphasized environment construction and scaling, while lacking principled mechanisms that use scalable environments to diagnose agent weaknesses and drive continual self-improvement.
We propose Agent-World, a general-purpose agent training arena that unifies scalable environment synthesis with continuous self-evolving training. Agent-World follows a two-stage design that forms a closed-loop training process.
Key Contributions
- We introduce Agent-World, a general-purpose agent training arena that unifies scalable environment synthesis with a continuous self-evolving training mechanism, forming a co-evolution loop between agent policies and environments.
- We propose Agentic Environment-Task Discovery, which mines realistic executable environments from real-world environment themes and synthesizes diverse verifiable tasks with controllable difficulty.
- We propose Continuous Self-Evolving Agent Training, which integrates multi-environment agentic RL with a self-evolving arena to automatically diagnose agent weaknesses and drive targeted learning in a closed training loop.
- Experiments across 23 challenging agent benchmarks demonstrate the superior performance of Agent-World. Further analysis reveals scaling relationships among environment diversity, evolution rounds, and agent performance.
Method
Agent-World contains two tightly coupled components that form a closed loop: scalable environments support agent training, while training-time diagnosis feeds back into the next round of environment-task construction.
Agentic Environment-Task Discovery
Environment Theme Collection
We systematically gather environment themes from three real-world sources: (1) MCP Servers (real-world server specifications from Smithery with structured JSON documents), (2) Tool Documentations (open-source datasets covering real tool-use scenarios), and (3) Industrial PRDs (product requirement documents containing domain workflows and system interfaces). Together, these form a seed topic set of over 2,000 environment themes across 20 primary categories.
Hierarchical Environment Taxonomy
We design a three-level hierarchical classification system to organize all environment themes: 20 first-tier categories (for example, Document & Design, Social Media & Community, System & Cloud Infrastructure), each subdivided into fine-grained second-tier subcategories (such as Office & Text Processing, Social Network Integration, Cloud Platform Services), and finally mapped to specific MCP server instances at the third tier. This taxonomy ensures broad domain coverage, enables systematic gap analysis during self-evolving training, and supports controlled difficulty scaling across diverse real-world domains.
Agentic Database Mining
Unlike prior work that uses LLM-synthesized databases, we argue that the web already contains abundant, high-value structured data. We design a deep-research agent that autonomously mines and processes web data into environment databases. For each topic, the agent conducts iterative loops for in-depth information retrieval and data mining, followed by a database complexification process to iteratively expand and enrich the database over multiple rounds.
Tool Interface Generation and Verification
A tool-design agent produces candidate tools and unit test cases grounded in the mined databases. We perform cross-validation to retain tools that: (1) compile successfully, (2) achieve accuracy >0.5 across test cases, and (3) belong to environments with at least one tool and one test case. The resulting ecosystem contains 19K+ distinct tools with rich parameters.
Verifiable Task Synthesis
We synthesize high-quality agentic tasks through two complementary strategies:
Graph-Based Task Synthesis: We construct weighted tool dependency graphs and perform random walks to generate tool-call sequences. From these sequences, an LLM drafts task descriptions and ground-truth answers, followed by consistency verification (ReAct agent × 5 runs).
Programmatic Task Synthesis: We directly generate executable Python solutions with complex control flows (loops, branches, aggregations). Each task is paired with an executable verification script for robust evaluation beyond simple string matching.
Both methods support difficulty scaling by expanding tool chains, increasing non-linear reasoning requirements, and obscuring tool names to force higher-level planning.
Environment Taxonomy Mapping
Click an L1 category to expand its L2 labels. Select any L2 label to view 10 representative L3 server examples on the right.
L1 Categories: 20 L2 Labels: 50 L3 Servers (Total): 1,978
Select one L2 label
The representative server list will appear here.
Continuous Self-Evolving Agent Training
Multi-Environment Agent Reinforcement Learning
We implement a closed-loop interaction among three components: an LLM policy (generates actions conditioned on history), a tool interface/runtime (executes tools in sandboxed environments), and a database state (provides verifiable, updatable data backbone). Tasks within each global batch are paired with independent environments, realizing multi-environment rollouts.
Structured Verifiable Reward: Graph-based tasks are evaluated via rubric-conditioned LLM-as-judge; programmatic tasks are verified through executable validation scripts in sandboxes. We adopt GRPO (Group Relative Policy Optimization) for stable training.
Self-Evolving Agent Arena
The environment ecosystem serves as a dynamic diagnostic arena:
Phase 1: Dynamic Evaluation - Synthesize fresh verifiable tasks in held-out arena environments at each iteration, preventing overfitting to a static benchmark.
Phase 2: Agentic Diagnosis - A diagnosis agent analyzes per-task failure traces, error distributions, and environment metadata to identify weak environments and generate task-generation guidelines.
Phase 3: Agent-Environment Co-Evolution - Re-run task synthesis conditioned on diagnosed weaknesses, optionally complexify databases, and continue RL to obtain an improved policy. This creates a self-evolving loop:
πθ(r) → evaluate → W(r) → diagnose + target → Xtarget(r) → continue RL → πθ(r+1)
Experiments
We evaluate Agent-World on 23 benchmarks spanning agentic tool use, advanced AI assistant, software engineering, deep research, and general reasoning, using Qwen3-8B/14B backbones trained with GRPO.
Main Results on Agentic Tool-Use Benchmarks
We report accuracy (%) across three benchmark suites: MCP-Mark, BFCL V4, and τ²-Bench.
| Method | MCP-Mark | BFCL V4 | τ²-Bench | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| File. | Github | Notion | Play. | Post. | Avg. | WebS. | Mem. | Multi-T. | NoLive | Live | Relev. | Irrelev. | Avg. | Retail | Telec. | Airline | Avg. | |
| Frontier Proprietary Models | ||||||||||||||||||
| GPT-5.2 High | 60.0 | 47.8 | 42.9 | 40.0 | 66.7 | 53.1 | 75.5 | 45.8 | 48.5 | 81.9 | 70.4 | 75.0 | 88.7 | 62.9 | 81.6 | 95.8 | 62.5 | 80.2 |
| Claude Sonnet-4.5 | 32.5 | 29.4 | 25.0 | 27.0 | 50.0 | 33.3 | 81.0 | 65.0 | 61.4 | 88.7 | 81.1 | 68.8 | 86.6 | 73.2 | 86.2 | 98.0 | 70.1 | 84.7 |
| Gemini-3 Pro | 56.7 | 45.7 | 43.8 | 40.0 | 70.2 | 50.8 | 80.0 | 61.7 | 60.8 | 90.7 | 83.1 | 68.8 | 85.6 | 72.5 | 85.3 | 98.0 | 72.7 | 85.4 |
| Seed 2.0 | 60.0 | 39.1 | 53.6 | 40.0 | 81.0 | 54.7 | 92.0 | 57.8 | 62.3 | 89.0 | 82.2 | 76.6 | 75.0 | 73.4 | 90.4 | 94.2 | — | — |
| Open-Source Foundation Models (8B–685B) | ||||||||||||||||||
| DeepSeek-V3.2-685B | 36.7 | 20.7 | 45.5 | 17.0 | 66.6 | 36.7 | 69.5 | 54.2 | 37.4 | 34.9 | 53.7 | 37.5 | 93.2 | 54.1 | — | — | — | 80.3 |
| GPT-OSS-120B | 5.8 | 4.4 | 3.6 | 3.0 | 7.1 | 4.7 | — | — | — | — | — | — | — | — | 67.8 | 49.2 | 48.0 | 55.0 |
| Qwen3-8B | 3.3 | 0.0 | 0.0 | 4.0 | 4.8 | 2.4 | 7.0 | 17.6 | 35.4 | 90.2 | 80.9 | 81.3 | 77.2 | 40.4 | 34.0 | 18.0 | 26.5 | 26.2 |
| Qwen3-14B | 3.3 | 4.4 | 0.0 | 0.0 | 9.5 | 3.4 | 4.0 | 19.8 | 36.9 | 90.0 | 82.4 | 81.3 | 79.4 | 41.0 | 55.3 | 14.9 | 27.0 | 32.4 |
| Qwen3-32B | 10.0 | 0.0 | 3.6 | 0.0 | 23.8 | 7.5 | 26.0 | 15.7 | 43.3 | 90.3 | 82.0 | 81.3 | 82.4 | 46.7 | 59.5 | 27.2 | 48.0 | 44.9 |
| Qwen3-235B-A22B | 13.3 | 0.0 | 10.7 | 0.0 | 4.8 | 5.8 | 54.0 | 23.9 | 45.4 | 37.4 | 68.9 | 87.5 | 81.7 | 47.9 | 71.9 | 58.0 | 45.6 | 58.5 |
| Open-Source Environment Scaling Methods (7B–14B) | ||||||||||||||||||
| Simulator-8B | 3.3 | 0.0 | 0.0 | 4.0 | 4.8 | 2.4 | 17.5 | 6.0 | 4.1 | 47.6 | 44.6 | 31.3 | 87.3 | 23.9 | 32.2 | 29.2 | 34.0 | 31.8 |
| TOUCAN-7B | 0.0 | 0.0 | 0.0 | 0.0 | 4.8 | 1.0 | 21.0 | 18.5 | 17.8 | 81.0 | 73.9 | 81.3 | 78.6 | 36.6 | 22.8 | 10.5 | 20.0 | 17.7 |
| EnvScaler-8B | 10.0 | 4.4 | 0.0 | 4.0 | 9.5 | 5.6 | 23.0 | 21.9 | 47.1 | 88.5 | 82.2 | 93.8 | 74.6 | 47.6 | 49.6 | 32.7 | 31.5 | 37.9 |
| AWM-8B | 3.3 | 0.0 | 0.0 | 4.0 | 4.8 | 2.4 | 9.5 | 15.7 | 34.9 | 90.2 | 80.5 | 93.8 | 73.9 | 40.0 | 41.2 | 38.5 | 23.5 | 34.4 |
| AWM-14B | 3.3 | 8.7 | 0.0 | 4.0 | 9.5 | 5.1 | 10.0 | 19.8 | 37.6 | 90.2 | 81.5 | 75.0 | 79.4 | 42.4 | 63.6 | 17.8 | 31.5 | 39.0 |
| ScaleEnv-8B | — | — | — | — | — | — | — | — | — | — | — | — | — | — | 50.9 | 27.2 | 37.5 | 38.5 |
| Agent-World-8B | 13.3 | 4.4 | 3.6 | 4.0 | 19.1 | 8.9 | 47.0 | 21.7 | 44.5 | 83.3 | 79.6 | 93.8 | 80.2 | 51.4 | 72.8 | 50.9 | 40.0 | 61.8 |
| Agent-World-14B | 16.6 | 4.4 | 3.6 | 4.0 | 38.1 | 13.3 | 53.0 | 23.9 | 53.9 | 82.3 | 79.3 | 93.8 | 81.0 | 55.8 | 74.5 | 56.1 | 52.0 | 65.4 |
Key Findings
(1) Foundation models remain limited in complex agentic tool-use scenarios. Even advanced proprietary models show clear limitations. GPT-5.2 High achieves only 53.1% on MCP-Mark, while open-source models like GPT-OSS-120B and Qwen3-235B-A22B score only 4.7% and 5.8%. These benchmarks cover diverse stateful environments, suggesting current models still struggle with long-horizon tool use requiring multi-step planning and state tracking.
(2) Existing environment-scaling methods still suffer from uneven capability gains. Simulator-based methods such as Simulator-8B achieve strong results on τ²-Bench yet perform poorly on MCP-Mark and BFCL V4. Code-based methods like EnvScaler-8B and AWM-8B/14B provide broader gains but show clear weaknesses on specific environments including GitHub and Notion.
(3) Agent-World achieves more consistent cross-environment generalization. Agent-World consistently outperforms prior environment-scaling baselines across all three benchmark suites. Agent-World-8B achieves 61.8% on τ²-Bench, 51.4% on BFCL V4, and 8.9% on MCP-Mark. Agent-World-14B surpasses even DeepSeek-V3.2-685B on BFCL-V4 (55.8% vs. 54.1%).
Generalization on Advanced AI Assistant Benchmarks
Scaling Analysis of Training Environments
We progressively increase the number of training environments from 0 to 2000. Performance improves consistently across all domains as the environment scale grows. Averaged over four domains, the score rises from 18.4% to 38.5% (+20.1 points), more than doubling the initial level. The gains are particularly pronounced on interaction-intensive tasks.
Analysis of Continuous Self-Evolution
To validate Continuous Self-Evolving Agent Training, we run the same two-round self-evolving arena loop from two different starting points: Agent-World-14B and EnvScaler-8B. Results show monotonic gains on all three evaluation suites for both models:
| Model / Round | τ²-Bench | BFCL-V4 | MCP-Mark (Post.) |
|---|---|---|---|
| Agent-World-14B (base) | 45.3 | 52.4 | 29.5 |
| +1 round | 48.6 (+3.3) | 54.9 (+2.5) | 36.3 (+6.8) |
| +2 rounds | 50.5 (+1.9) | 55.8 (+0.9) | 38.1 (+1.8) |
| EnvScaler-8B (base) | 37.9 | 47.6 | 9.5 |
| +1 round | 40.2 (+2.3) | 49.1 (+1.5) | 13.9 (+4.4) |
| +2 rounds | 41.6 (+1.4) | 50.0 (+0.9) | 15.1 (+1.2) |
The largest gains across two rounds appear on MCP-Mark for both models: +8.6 for Agent-World and +5.6 for EnvScaler. This setting requires stronger state tracking and more reliable interaction with realistic MCP server environments. Importantly, EnvScaler-8B also improves, indicating that the loop not only benefits our base model but also yields sustained gains for other environment-scaling baselines without relying on Agent-World initialization.
Training Dynamics
Conclusion
We presented Agent-World, a self-evolving training arena for general-purpose agents in realistic tool environments. Agent-World unifies two tightly coupled components:
Agentic Environment-Task Discovery mines topic-aligned real-world databases and executable toolsets from large-scale themes and synthesizes verifiable tasks with controllable difficulty.
Continuous Self-Evolving Agent Training combines multi-environment reinforcement learning with an agentic diagnostic arena to identify capability gaps and drive targeted iterative data expansion.
Experiments across 23 challenging benchmarks demonstrate that Agent-World consistently improves performance over strong baselines. Further analyses reveal clear scaling trends with respect to environment diversity, evolution rounds, and task difficulty, suggesting that scalable realistic environments are not only useful data sources, but also critical infrastructure for advancing general agent capabilities.