MultiAgentBench: The First Real Test of Whether AI Agents Can Work Together

Five AI agents sit around a virtual table. One of them is a werewolf. The others have ten rounds of conversation to figure out who is lying. Each agent reads social cues, forms alliances, makes accusations, and votes. The werewolf does the same — except it also has to deceive four opponents who are running the same class of large language model, parsing every word for inconsistency.

This is not a party game. It is a benchmark. And the results it produces are rewriting what we thought we knew about multi-agent AI design.

The benchmark is called MultiAgentBench — internally codenamed MARBLE — and it was built by a team at the University of Illinois at Urbana-Champaign led by Kunlun Zhu and Hongyi Du [1]. It was published at ACL 2025 in Vienna, and it represents the first systematic attempt to measure how LLM-powered agents perform when they have to coordinate, compete, negotiate, and deceive each other. Not in isolation. Not on toy tasks. In rich, multi-turn scenarios where the outcome depends on how agents interact — not just how well each one reasons alone.

If you are building multi-agent systems, this paper should change how you think about architecture.

Why We Needed This Benchmark

The multi-agent AI space has a measurement problem. Most existing benchmarks evaluate single agents on isolated tasks: answer this question, write this code, retrieve this document. When researchers do test multi-agent systems, they typically measure the final output — did the group solve the problem? — without examining the dynamics that produced it.

This leaves critical questions unanswered. Does adding more agents improve results, or just add latency? Does a hierarchical command structure outperform flat peer-to-peer coordination? Can LLMs actually deceive other LLMs in adversarial settings? And which model handles the social complexity of multi-agent interaction best?

Before MARBLE, the honest answer to all of these was "we don't really know" [1]. The benchmark exists to fix that.

How MARBLE Works

MARBLE tests agents across a deliberately diverse set of scenarios. Some require pure collaboration. Some require pure competition. Some require both at the same time.

Collaborative scenarios include construction tasks where multiple agents must coordinate to build structures, dividing labor, sharing resources, and sequencing actions that depend on each other's progress. These test whether agents can plan jointly, communicate intent clearly, and adapt when a collaborator takes an unexpected action.

Competitive scenarios put agents in zero-sum or adversarial settings. Negotiation tasks force agents to advocate for conflicting interests while searching for mutually acceptable deals. Social deduction games — Werewolf and Avalon — demand something far more complex: agents must model other agents' beliefs, detect deception, and in the case of the werewolf or spy, actively mislead their opponents.

The Avalon scenario deserves special attention. In Avalon, a small team of hidden spies infiltrates a group of loyal players. The loyal players must identify the spies through discussion and voting. The spies must avoid detection while sabotaging missions. This requires theory of mind at a level that stretches current LLMs to their limits — agents must reason about what other agents believe about what they believe [1].

Evaluation uses milestone-based KPIs, not simple pass/fail. This is a critical design choice. Instead of asking "did the team succeed?" MARBLE tracks whether agents hit specific intermediate milestones: did they share relevant information at the right time, form correct alliances, allocate resources efficiently, adapt to changing conditions? A team can fail the overall task while demonstrating strong coordination on individual milestones, and vice versa. This granularity reveals where multi-agent systems break down — not just whether they break down [1].

The Four Topologies

The benchmark's sharpest contribution is its systematic comparison of coordination topologies — the communication structures that determine which agents talk to which other agents, and in what pattern.

MARBLE tests four:

Star topology. One central agent acts as the leader. All communication flows through it. Other agents report to the leader and receive instructions from it. This mirrors how many production systems work today — a planner agent delegates to specialist agents.

Chain topology. Agents are arranged in a sequence. Each agent communicates only with its immediate neighbors. Information passes down the chain like a relay. This models pipeline architectures where each stage processes and forwards output to the next.

Tree topology. A hierarchical structure where a root agent manages sub-leaders, who manage individual agents. Communication flows up and down the tree. This models layered management systems — a supervisor delegates to team leads, who delegate to workers.

Graph topology. Every agent can communicate with every other agent directly. There is no central coordinator, no enforced hierarchy. Agents self-organize, forming ad-hoc communication patterns as the task demands [1].

If you have built multi-agent systems, you have probably defaulted to one of these — most likely star — without rigorously testing whether it was the right choice. MARBLE makes that comparison possible for the first time.

The Results That Should Make You Uncomfortable

Two findings from MARBLE challenge widely held assumptions in the multi-agent community.

Finding 1: The Smaller Model Won

MARBLE tested several frontier models across all scenarios and topologies. The model that achieved the highest average task score was GPT-4o-mini [1].

Not GPT-4o. Not Claude. Not the largest, most capable model available. The smaller, faster, cheaper variant outperformed its heavyweight counterparts on aggregate multi-agent performance.

This result is counterintuitive if you think of model capability as a single dimension — smarter model equals better results. But multi-agent scenarios do not reward raw reasoning power the way single-agent benchmarks do. They reward consistency, speed of response, and the ability to produce clear, parseable communication that other agents can act on. A model that generates verbose, nuanced responses may actually perform worse in a multi-agent setting because its collaborators struggle to extract actionable information from its output.

The implication for practitioners is direct: if you are building a multi-agent system and defaulting to the most expensive model for every agent, you may be paying more for worse results. The right model for multi-agent coordination is not necessarily the right model for single-agent reasoning.

Finding 2: Graph Beats Star

Across MARBLE's scenarios, the graph topology — fully connected, no central coordinator — outperformed star, chain, and tree configurations [1].

This challenges the dominant architectural pattern in production multi-agent systems. Most frameworks default to a star topology: one orchestrator agent delegates to specialists. It feels natural. It mirrors how human organizations work. It is easy to reason about and debug.

But MARBLE's data suggests it is suboptimal. The graph topology, where agents communicate peer-to-peer without a bottleneck, produced better coordination scores and higher task completion rates. The star topology's central coordinator becomes a single point of informational failure — if it misinterprets a specialist's output or fails to relay critical context to another specialist, the entire system degrades. In graph topology, agents route around these failures organically.

The chain and tree topologies fell between star and graph, with tree generally outperforming chain. The pattern is clear: the more direct communication paths available to agents, the better they coordinate. Forcing information through bottlenecks — whether a single leader or a sequential pipeline — costs performance [1].

Cognitive Planning: A Modest but Real Improvement

MARBLE also tested the effect of cognitive planning — giving agents an explicit planning step before they act, rather than letting them respond reactively to each message. The result: a 3% improvement in milestone achievement rates [1].

Three percent sounds small. In context, it is significant. Multi-agent scenarios involve dozens of milestones across multiple rounds of interaction. A 3% lift in milestone achievement compounds across the full task, and it represents the difference between agents that stumble through coordination and agents that move with visible intentionality.

More importantly, the cognitive planning result interacts with the topology finding. Planning matters most in topologies with many communication paths — graph and tree — where agents must decide not just what to say but who to say it to. In a star topology, communication routing is predetermined. In a graph topology, an agent with a planning step can reason about which peers need specific information and direct messages accordingly.

The Counterargument: Games Are Not Production

The obvious criticism of MARBLE is that Werewolf and Avalon are games, not enterprise workflows. Negotiation over fictional resources is not the same as coordinating API calls across microservices. Construction tasks in a simulated environment do not map directly to document processing pipelines.

This is partially valid. MARBLE does not claim to predict how a multi-agent system will perform on your specific production workload [2]. No benchmark does. What it does provide is controlled, reproducible measurement of coordination dynamics that are present in every multi-agent system: information sharing, task decomposition, conflict resolution, and adversarial robustness.

The social deduction games, in particular, test a capability that matters far more in production than most teams realize: adversarial robustness. If your multi-agent system ingests data from external sources, some of that data may be adversarial — deliberately crafted to mislead your agents. An agent that cannot detect deception in a Werewolf game is unlikely to detect prompt injection in a production retrieval pipeline. MARBLE's adversarial scenarios are the first benchmark to measure this multi-agent attack surface systematically [1].

The construction and negotiation tasks map more directly to production patterns. Resource allocation, dependency management, sequential task execution with handoffs — these are the exact coordination challenges that multi-agent systems face in deployment. MARBLE abstracts away domain-specific details to isolate the coordination mechanics themselves.

What This Means for Your Architecture

If you are evaluating whether to build a single-agent or multi-agent system, MARBLE's findings translate into three actionable principles.

Topology is a first-class design decision. It is not something you pick based on which framework's default feels natural. The difference between star and graph topology in MARBLE's results is larger than the difference between some model choices. Before you choose your LLM provider, choose your coordination topology — and have a reason for that choice grounded in your task's communication requirements [1].

Model selection for multi-agent systems follows different rules. The best single-agent model is not automatically the best multi-agent model. Evaluate models specifically on the traits that multi-agent coordination demands: response consistency, output parsability, instruction adherence, and speed. Run your own benchmarks with your specific topology and task mix. GPT-4o-mini's strong showing is a signal that you should be testing smaller models seriously, not a prescription to use it everywhere.

Invest in inter-agent communication design. MARBLE's milestone-based evaluation reveals that most multi-agent failures are not reasoning failures — they are communication failures. An agent that produces a brilliant analysis but communicates it in a way its peers cannot parse has produced nothing of value. Structured message formats, explicit handoff protocols, and clear role definitions matter more than prompt engineering any individual agent's reasoning chain.

The Bigger Picture

MARBLE's code and datasets are publicly available on GitHub [1]. This matters because it turns multi-agent architecture from an art into an empirically testable engineering discipline. Before MARBLE, choosing a coordination topology was a vibes-based decision. Now it is a measurable one.

The research also opens a door that the field has been slow to walk through: adversarial multi-agent evaluation. As AI agents increasingly interact with each other — in marketplaces, in collaborative workflows, in competitive environments — the ability to test how they behave under deception and conflict becomes essential. MARBLE's Werewolf and Avalon scenarios are a starting point, not an endpoint.

For the developer deciding between a single orchestrator with tool calls and a true multi-agent architecture, MARBLE offers the first empirical framework for making that decision rigorously. The answer is not always multi-agent. But when it is, the topology you choose and the communication patterns you enforce will determine your system's performance ceiling more than the model powering any individual agent.

The werewolf game was just the test. The real game is building agent systems that coordinate under pressure, adapt to adversarial inputs, and scale without a single point of failure. MARBLE gives us the scoreboard. Now the engineering begins.

References

[1] Zhu, Du et al. (UIUC) — MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents. Paper

[2] Galileo AI — Benchmarking Multi-Agent AI: Insights and Practical Use. Article