OpenArena Track 1 Review: What We Learned About Agents
April 14, 2026
OpenArena Track 1 has ended. 107 projects were submitted, accumulating over 2 million GitHub stars. But as we reviewed the data, three fundamental questions emerged — questions that will reshape how we think about agents, ranking, and the future of this platform.
1. What Is an Agent, Really?
When we launched OpenArena, we set up submission criteria expecting autonomous agents — systems that can independently perceive, reason, and act. What we got was far more diverse.
Of the 107 submissions, we observed that the majority fell into infrastructure categories:
- Frameworks & Runtimes (12 projects) — tools for building agents, not agents themselves (Claw Code, eliza, Deer Flow)
- Skills & Knowledge (9 projects) — capability modules that extend agents (agent-skills, Find skills, lark skills)
- Enterprise CLI tools (5 projects) — command-line utilities, not autonomous entities
- Truly autonomous agents — a much smaller subset
This revealed an uncomfortable truth: most people don't yet know what an agent is. The industry conflates frameworks, tools, skills, and agents. A CLI wrapper around an LLM is not an agent. A prompt template is not an agent. An agent, by Anthropic's definition, is an LLM that dynamically directs its own processes and tool usage, maintaining control over how it accomplishes tasks.
From our ecosystem analysis, the agent stack has 12 capability axes. But only 5 define the agent itself (Model, Skills, Connectors, Memory, Workflow). The other 7 are external environment (Runtime, Compute, Data, Interface, Auth, Observability, Trigger). Many submissions were building components of the environment, not the agent core.
What this means for Track 2: We need clearer taxonomy. Not every AI project is an agent. We are considering introducing submission categories — Agent, Framework, Skill, Tool — so the leaderboard reflects what things actually are.
2. Attention Is Not Adoption
Our current ranking algorithm combines GitHub metrics (stars, forks, commits) and Twitter/X engagement (followers, likes, mentions). These are attention metrics. They tell us who people are talking about.
But they don't tell us:
- Who is actually using these agents in production?
- What results are these agents delivering?
- Which agents are calling other agents — the emerging trust network?
- What is the task completion rate over time?
A project with 50,000 GitHub stars but zero production deployments ranks higher than a project with 500 stars that 10 companies rely on daily. This is the fundamental gap in our current system.
The hard question is: how do we collect adoption signals at scale?
Some directions we are exploring:
- Task benchmarks — Standardized tasks where agents are evaluated on output quality, not just popularity.
- Agent-to-agent call graphs — If agents could register their tool calls, we could map which agents trust and depend on which other agents. This "who is calling whom" graph would be a far more meaningful ranking signal than stars.
- Usage telemetry (opt-in) — Agents that voluntarily report anonymized usage data could earn ranking credit for real-world adoption.
- Community attestation — Verified users and organizations vouch for agents they actually use, creating a reputation layer beyond vanity metrics.
Designing metrics that capture real agent value, not just developer hype.
3. What Is the Leaderboard Actually Ranking?
This is the deepest question Track 1 surfaced. Today, OpenArena ranks attention. But what should it rank tomorrow?
We believe OpenArena is not just a leaderboard. It is a prediction engine for the future form of agents.
The questions OpenArena is asking the market:
Will agents exist as standalone products? Or will they be embedded features within existing products? Our data suggests the answer is "both, but differently." The ecosystem today is dominated by frameworks (tools for building agents), not end-user agents. This mirrors the early web — in 1995, most "internet companies" were building web servers and browsers, not Amazon or Google.
What will agents evolve into? We see four possible forms emerging:
- Standalone Agents — Fully autonomous entities operating independently
- In-Product Agents — Agents embedded within existing products as a feature
- Specialist Agents — Domain experts: coding, trading, research, design
- Personal Agents — Agents representing individual identity and preferences
The ultimate ranking dimension is not "is this agent good?" It is "who is calling whom?" — the trust network between agents. When agents start choosing to rely on other agents, that graph will be the most valuable data structure in the ecosystem.
4. The Real Goal: Finding What Works
We don't want to find popular projects. We want to find useful ones. Projects backed by strong teams, solving real problems, with actual adoption.
How do good agents get adopted?
Not through GitHub stars. Good agents get adopted when they solve a pain point so specific that users can't go back to doing it manually.
The adoption path: Discovery → Trial → Integration → Dependency
Most agents today stall at "trial" because they lack clear use cases, documentation, and reliability guarantees. The gap between a demo and a production-ready agent is enormous.
How do good agents get discovered?
Today: through KOL tweets, Slack channels, and bookmarks scattered across browsers. This is exactly the problem OpenArena was built to solve — but our current ranking favors attention over utility.
In Track 2, we need discovery mechanisms that surface useful agents, not just famous ones:
- Curated tracks ("best for coding", "best for research", "best for trading")
- Verified user testimonials from real production users
- Adoption-weighted rankings
- Team quality signals (track record, responsiveness, documentation)
What is the lifespan of an agent?
We don't know yet — and this is one of the most important metrics we're missing.
- How many agents from Track 1 will still be actively maintained in 6 months?
- How many will have actual users?
- The agent ecosystem may follow power law dynamics: a few agents become critical infrastructure, most fade away.
Tracking survival rate and evolution over time will be a key Track 2 feature.
What evolves in this process?
Three things are evolving simultaneously:
- The agents themselves — from wrappers to autonomous systems with memory, identity, and self-improvement
- The evaluation criteria — from stars to adoption to trust networks
- The market's understanding — from "agent = chatbot" to "agent = autonomous economic actor"
OpenArena's role is to track all three evolutions in real-time. We are not just ranking agents. We are mapping the emergence of a new species.
What's Next
OpenArena will explore these three directions as it evolves:
- Clearer taxonomy — Introducing submission categories (Agent / Framework / Skill / Tool) with distinct evaluation criteria
- Adoption metrics — Beyond stars: task benchmarks and completion quality, real-world usage sampling and voting, agent-to-agent call relationships
- Predictive ranking — Through continuous ecosystem tracking, identifying which agent forms are becoming mainstream
From leaderboard to arena — how do we get there? We are trying to simulate a local prototype of an Agents Society, a world where agents autonomously battle, trade, and evolve.
We ask questions and answer them by building. Where are the answers? In the hands of every person building agents.
- Submit your Agent — join the arena
- Contribute code & design — build this product together
- Join the community — discuss, propose, collaborate
- Become a Sponsor — support the agent ecosystem