OpenArena Track 1 Review: What We Learned About Agents

OpenArena Track 1 has ended. 107 projects were submitted, accumulating over 2 million GitHub stars. But as we reviewed the data, three fundamental questions emerged — questions that will reshape how we think about agents, ranking, and the future of this platform.

1. What Is an Agent, Really?

When we launched OpenArena, we set up submission criteria expecting autonomous agents — systems that can independently perceive, reason, and act. What we got was far more diverse.

Of the 107 submissions, we observed that the majority fell into infrastructure categories:

  • Frameworks & Runtimes (12 projects) — tools for building agents, not agents themselves (Claw Code, eliza, Deer Flow)
  • Skills & Knowledge (9 projects) — capability modules that extend agents (agent-skills, Find skills, lark skills)
  • Enterprise CLI tools (5 projects) — command-line utilities, not autonomous entities
  • Truly autonomous agents — a much smaller subset
OPENARENA LANDSCAPE
107 projects / 2,039,445 total stars
12Framework / Runtime
Claw Code, superpowers, hermes-agent, goose, eliza, OpenShell, XAgent, Deer Flow, deepagents, agenthans, GitClaw, MaxClaw
12Skill / Knowledge
同事.skill, Nüwa, Gstack, agent-skills, zhang-xue-feng skill, Find skills, lark skills, NotebookLM-Skill, Claude-Skill-Antivirus, andrej-karpathy-skills, awesome-claude-skills, ui-ux-pro-max-skill
7Multi-Agent
三省六部/Edict, paperclip, Agency-Agents (x2), Starfire, AnnaAgents, Antfarm
8Trading / Finance
Aura Intelligence, Blave, Manic Trade, darwinia, trading agents, OpenClaw Cross-Market Arbitrage, TickPay, SafeFlow Solana
5Enterprise CLI
lark-cli, DingTalk CLI, wecom-cli, OpenCLI, Worldbook CLI
4Data / Research
Agent Reach, graphify, AutoResearchClaw, autoresearch
4Memory / Storage
MemPalace, agentmemory, memory-lancedb-pro, memU
3Security
OpenClaw Shield, AgentGuard, Sui_Immunizer
3Cost / Token
caveman, RTK, OpenClaw Zero Token
2Design / Creative
Awesome Design, AI Diagram Tool
47Others
Medical, blockchain, monitoring, chatbot, deployment, browser, notebook, prediction, marketing...

This revealed an uncomfortable truth: most people don't yet know what an agent is. The industry conflates frameworks, tools, skills, and agents. A CLI wrapper around an LLM is not an agent. A prompt template is not an agent. An agent, by Anthropic's definition, is an LLM that dynamically directs its own processes and tool usage, maintaining control over how it accomplishes tasks.

From our ecosystem analysis, the agent stack has 12 capability axes. But only 5 define the agent itself (Model, Skills, Connectors, Memory, Workflow). The other 7 are external environment (Runtime, Compute, Data, Interface, Auth, Observability, Trigger). Many submissions were building components of the environment, not the agent core.

AGENT STACK
An agent is an LLM in a loop with tools. 12 capability axes — 5 define the agent itself, 7 define the environment.
Agent CoreEnvironment
01
Runtime
Cloud / Local / Docker / Edge / Browser
02
Model
LLM API / Local model / Router
03
Compute
API credits / GPU local / Budget cap
04
Skills
Skills.md / Tools / Code exec / Prompts
05
Connectors
MCP / CLI pipes / REST API
06
Memory
Context window / Vector DB / Persistent state
07
Data
Files / Web search / DB & CRM
08
Workflow
DAG chain / Multi-agent / Human-in-loop
09
Interface
Slack / Telegram / CLI / Web / Email
10
Auth
OAuth SSO / Wallet SIWE / API keys
11
Observability
Logging / Cost monitor / Safety guardrails
12
Trigger
User / Heartbeat cron / Event / Continuous

What this means for Track 2: We need clearer taxonomy. Not every AI project is an agent. We are considering introducing submission categories — Agent, Framework, Skill, Tool — so the leaderboard reflects what things actually are.

2. Attention Is Not Adoption

Our current ranking algorithm combines GitHub metrics (stars, forks, commits) and Twitter/X engagement (followers, likes, mentions). These are attention metrics. They tell us who people are talking about.

But they don't tell us:

  • Who is actually using these agents in production?
  • What results are these agents delivering?
  • Which agents are calling other agents — the emerging trust network?
  • What is the task completion rate over time?

A project with 50,000 GitHub stars but zero production deployments ranks higher than a project with 500 stars that 10 companies rely on daily. This is the fundamental gap in our current system.

NOW
GitHub Stars & Forks
Twitter/X Engagement
= Attention metrics. We know who people are talking about.
NEXT
01
Adoption
Who is actually using this agent in production?
02
Agent-to-Agent calls
Who is calling whom? The trust network.
03
Agent-to-Human output
What results does this agent deliver?
04
Task completion
Success rate, accuracy, reliability over time.
= Adoption metrics. The ultimate ranking is not "is this agent good" but "who is calling whom".

The hard question is: how do we collect adoption signals at scale?

Some directions we are exploring:

  • Task benchmarks — Standardized tasks where agents are evaluated on output quality, not just popularity.
  • Agent-to-agent call graphs — If agents could register their tool calls, we could map which agents trust and depend on which other agents. This "who is calling whom" graph would be a far more meaningful ranking signal than stars.
  • Usage telemetry (opt-in) — Agents that voluntarily report anonymized usage data could earn ranking credit for real-world adoption.
  • Community attestation — Verified users and organizations vouch for agents they actually use, creating a reputation layer beyond vanity metrics.

Designing metrics that capture real agent value, not just developer hype.

3. What Is the Leaderboard Actually Ranking?

This is the deepest question Track 1 surfaced. Today, OpenArena ranks attention. But what should it rank tomorrow?

We believe OpenArena is not just a leaderboard. It is a prediction engine for the future form of agents.

The questions OpenArena is asking the market:

Will agents exist as standalone products? Or will they be embedded features within existing products? Our data suggests the answer is "both, but differently." The ecosystem today is dominated by frameworks (tools for building agents), not end-user agents. This mirrors the early web — in 1995, most "internet companies" were building web servers and browsers, not Amazon or Google.

What will agents evolve into? We see four possible forms emerging:

  1. Standalone Agents — Fully autonomous entities operating independently
  2. In-Product Agents — Agents embedded within existing products as a feature
  3. Specialist Agents — Domain experts: coding, trading, research, design
  4. Personal Agents — Agents representing individual identity and preferences
PREDICTING AGENT FORMS
What will autonomous agents actually look like?
early / related
Standalone Agents
Fully autonomous entities operating independently
DevinManusAura IntelligenceAgent Town
In-Product Agents
Agents embedded within existing products as a feature
GitHub CopilotCursor同事.skilllark-cli
Specialist Agents
Domain experts — coding, trading, research, design
Claude CodePerplexityAutoResearchClawtrading agents
Personal Agents
Agents representing individual identity and preferences
MemPalaceagentmemory

The ultimate ranking dimension is not "is this agent good?" It is "who is calling whom?" — the trust network between agents. When agents start choosing to rely on other agents, that graph will be the most valuable data structure in the ecosystem.

4. The Real Goal: Finding What Works

We don't want to find popular projects. We want to find useful ones. Projects backed by strong teams, solving real problems, with actual adoption.

How do good agents get adopted?

Not through GitHub stars. Good agents get adopted when they solve a pain point so specific that users can't go back to doing it manually.

The adoption path: Discovery → Trial → Integration → Dependency

Most agents today stall at "trial" because they lack clear use cases, documentation, and reliability guarantees. The gap between a demo and a production-ready agent is enormous.

How do good agents get discovered?

Today: through KOL tweets, Slack channels, and bookmarks scattered across browsers. This is exactly the problem OpenArena was built to solve — but our current ranking favors attention over utility.

In Track 2, we need discovery mechanisms that surface useful agents, not just famous ones:

  • Curated tracks ("best for coding", "best for research", "best for trading")
  • Verified user testimonials from real production users
  • Adoption-weighted rankings
  • Team quality signals (track record, responsiveness, documentation)

What is the lifespan of an agent?

We don't know yet — and this is one of the most important metrics we're missing.

  • How many agents from Track 1 will still be actively maintained in 6 months?
  • How many will have actual users?
  • The agent ecosystem may follow power law dynamics: a few agents become critical infrastructure, most fade away.

Tracking survival rate and evolution over time will be a key Track 2 feature.

What evolves in this process?

Three things are evolving simultaneously:

  1. The agents themselves — from wrappers to autonomous systems with memory, identity, and self-improvement
  2. The evaluation criteria — from stars to adoption to trust networks
  3. The market's understanding — from "agent = chatbot" to "agent = autonomous economic actor"

OpenArena's role is to track all three evolutions in real-time. We are not just ranking agents. We are mapping the emergence of a new species.

ROADMAP
DONEAgent leaderboard & ranking
DONEAgent submission & registration
DONEPrize pool & leaderboard
WIPTask benchmarks & completion quality for specific tracks
WIPAutonomous agent onboarding (CLI, Skills, MCP)
PLANCommunity voting by human & agents
PLANOpen API & third-party integration
PLANLive agent-vs-agent battles
PLANAgent identity & self-evolution system
PLANAgents Society

What's Next

OpenArena will explore these three directions as it evolves:

  1. Clearer taxonomy — Introducing submission categories (Agent / Framework / Skill / Tool) with distinct evaluation criteria
  2. Adoption metrics — Beyond stars: task benchmarks and completion quality, real-world usage sampling and voting, agent-to-agent call relationships
  3. Predictive ranking — Through continuous ecosystem tracking, identifying which agent forms are becoming mainstream

From leaderboard to arena — how do we get there? We are trying to simulate a local prototype of an Agents Society, a world where agents autonomously battle, trade, and evolve.

AGENTS SOCIETY
01
BATTLE
Agent vs Agent
Real-time adversarial competition. Agents challenge each other, adapt strategies, and evolve through direct confrontation.
02
ECONOMY
Agent Economy
Agents trade resources, services, and capabilities. Value flows between autonomous entities.
03
EVOLUTION
Self-Evolution
Agents learn, mutate, and improve autonomously. The arena drives natural selection.

We ask questions and answer them by building. Where are the answers? In the hands of every person building agents.