Skills Arena SDK
Benchmark your AI agent skill descriptions. Find out if agents will choose your skill over the competition — and why.
Getting Started
Install the SDK from PyPI:
pip install skills-arenaSet your API key:
export ANTHROPIC_API_KEY=sk-ant-...claude-code agent). Set OPENAI_API_KEY if you want to test against OpenAI agents.Quick Start
Two ways to benchmark — evaluate a single skill or compare head-to-head:
from skills_arena import Arena
arena = Arena()
results = arena.evaluate(
skill="./my-skill.md",
task="web search and content extraction"
)
print(f"Score: {results.score}/100 ({results.grade})")
print(f"Selection Rate: {results.selection_rate:.0%}")results = arena.compare(
skills=["./my-skill.md", "./competitor.md"],
task="web search",
)
print(f"Winner: {results.winner}")
for name, rate in results.selection_rates.items():
print(f" {name}: {rate:.0%}")Core Concepts
Skills Arena tests a fundamental question: when an AI agent has multiple skills available, which one does it pick?
The flow: Skills → Scenarios → Agent → Results
Evaluate a Skill
Evaluate how discoverable a single skill is. The SDK generates scenarios, runs them through agent frameworks, and scores how often the skill is correctly selected.
from skills_arena import Arena, Config
arena = Arena(Config(scenarios=30, agents=["claude-code"]))
results = arena.evaluate(
skill="./my-skill.md",
task="web search and content extraction",
)
print(f"Score: {results.score}/100")
print(f"Grade: {results.grade}")
print(f"Selection Rate: {results.selection_rate:.0%}")
print(f"False Positive Rate: {results.false_positive_rate:.0%}")EvaluationResult
| Parameter | Type | Default | Description |
|---|---|---|---|
| score | float | — | Overall score (0–100) |
| grade | Grade | — | Letter grade (A+ to F) |
| selection_rate | float | — | How often the skill was selected (0.0–1.0) |
| false_positive_rate | float | — | Incorrect selections |
| invocation_accuracy | float | — | Accuracy when the skill was selected (0.0–1.0) |
| insights | list[Insight] | — | AI-generated improvement suggestions |
| per_agent | dict | — | Results broken down by agent |
| scenarios_run | int | — | Total number of scenarios evaluated |
Compare Skills
Head-to-head comparison between two or more skills. See who wins, why, and which scenarios your skill loses.
results = arena.compare(
skills=["./my-skill.md", "./competitor.md"],
task="web search",
scenarios=20,
)
print(f"Winner: {results.winner}")
print(f"Selection Rates: {results.selection_rates}")
# See which scenarios you lost
for scenario_id, stealers in results.steals.items():
print(f"Lost scenario {scenario_id} to: {stealers}")ComparisonResult
| Parameter | Type | Default | Description |
|---|---|---|---|
| winner | str | — | Name of the winning skill |
| selection_rates | dict[str, float] | — | Selection percentage per skill |
| head_to_head | dict | — | Win counts between each pair |
| steals | dict | — | Scenarios lost to competitors |
| scenario_details | list[ScenarioDetail] | — | Per-scenario breakdown with agent reasoning |
| insights | list[Insight] | — | Comparative insights |
| per_agent | dict | — | Results broken down by agent |
| scenarios_run | int | — | Total number of scenarios evaluated |
Custom Scenarios
Test specific edge cases with custom scenarios, or mix them with auto-generated ones.
from skills_arena import CustomScenario, GenerateScenarios
# Fully custom
results = arena.compare(
skills=["./my-skill.md", "./competitor.md"],
scenarios=[
CustomScenario(prompt="Find recent AI news articles"),
CustomScenario(
prompt="Scrape pricing from stripe.com",
expected_skill="My Skill", # enables steal detection
tags=["scraping"],
),
],
)results = arena.compare(
skills=["./my-skill.md", "./competitor.md"],
task="web search",
scenarios=[
CustomScenario(prompt="My critical edge case"),
GenerateScenarios(count=10), # generate 10 more
],
)expected_skill on a custom scenario, the SDK can detect "steals" — when a competitor wins a scenario designed for your skill.Configuration
Configure via Python or YAML files.
from skills_arena import Arena, Config
config = Config(
scenarios=50,
agents=["claude-code"],
temperature=0.7,
include_adversarial=True,
scenario_strategy="per_skill",
timeout_seconds=30,
)
arena = Arena(config)task: "web search and information retrieval"
skills:
- path: ./skills/my-search.md
name: my-search
- path: ./skills/competitor.md
name: competitor
evaluation:
scenarios: 50
agents: [claude-code]
include_adversarial: true
temperature: 0.7arena = Arena.from_config("./arena.yaml")
results = arena.run()Config Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
| scenarios | int | 50 | Number of test scenarios to generate |
| agents | list[str] | ["claude-code"] | Agent frameworks to test against |
| temperature | float | 0.7 | Scenario generation diversity (0.0–2.0) |
| include_adversarial | bool | True | Include edge-case scenarios |
| scenario_strategy | str | "balanced" | "balanced" or "per_skill" |
| timeout_seconds | int | 30 | Per-scenario timeout |
| generator_model | str | claude-sonnet-4 | LLM for scenario generation |
Agent Frameworks
Skills Arena tests against real agent frameworks — not just raw LLM tool_use calls. This matches how skills are actually discovered in production.
| Agent | Framework | Description |
|---|---|---|
| claude-code | Claude Agent SDK | Official Anthropic agent — real skill discovery with hooks |
| mock | Built-in | Deterministic agent for testing |
.claude/skills/, launches a Claude Code session, and intercepts selections via hooks — matching real-world behavior.Understanding Results
The SDK returns different result types depending on the operation. Here are the key metrics in each:
EvaluationResult (single skill)
| Field | Type | Description |
|---|---|---|
| score | float (0–100) | Overall numeric score |
| grade | Grade (A+ to F) | Letter grade based on score |
| selection_rate | float (0.0–1.0) | How often the skill was selected |
| false_positive_rate | float (0.0–1.0) | Rate of incorrect selections |
| invocation_accuracy | float (0.0–1.0) | Accuracy when the skill was selected |
ComparisonResult (head-to-head)
| Field | Type | Description |
|---|---|---|
| winner | str | Name of the winning skill |
| selection_rates | dict[str, float] | Selection rate per skill (0.0–1.0) |
| head_to_head | dict[str, dict[str, int]] | Win counts between each skill pair |
| steals | dict[str, list[str]] | Scenario IDs lost to competitors |
| scenario_details | list[ScenarioDetail] | Per-scenario breakdown with reasoning |
Progress Tracking
All methods support a progress callback for real-time updates:
def on_progress(progress):
print(f"[{progress.stage}] {progress.percent:.0f}% — {progress.message}")
results = arena.compare(
skills=["./a.md", "./b.md"],
task="search",
on_progress=on_progress,
)
# Stages: parsing → generation → running → completeInsights
The SDK generates actionable insights to help improve your skill descriptions.
insights = arena.insights(
skill="./my-skill.md",
results=evaluation_results,
)
for insight in insights:
print(f"[{insight.severity}] {insight.message}")
if insight.suggestion:
print(f" → {insight.suggestion}")Description improvements
Missed scenarios or edge cases
Why competitors win specific scenarios
Scenarios lost to competitors
Error Handling
| Exception | When |
|---|---|
| SkillParseError | Invalid skill format or missing required fields |
| ConfigError | Invalid configuration |
| APIKeyError | Missing ANTHROPIC_API_KEY or OPENAI_API_KEY |
| AgentError | Agent framework failed to process |
| GeneratorError | Scenario generation failed |
| TimeoutError | Operation exceeded timeout_seconds |
| NoSkillsError | No skills provided to operation |
| UnsupportedAgentError | Unknown agent name passed in config |
API Reference
Arena
The main entry point for all operations.
class Arena:
def __init__(self, config: Config | None = None)
@classmethod
def from_config(cls, path: str | Path) -> Arena
# Synchronous
def evaluate(skill, task, *, on_progress=None) -> EvaluationResult
def compare(skills, task=None, *, scenarios=None, on_progress=None) -> ComparisonResult
def insights(skill, results=None) -> list[Insight]
# Async
async def evaluate_async(skill, task, *, on_progress=None) -> EvaluationResult
async def compare_async(skills, task=None, *, scenarios=None, on_progress=None) -> ComparisonResultSkill
class Skill(BaseModel):
name: str # Skill identifier
description: str # Full description text
parameters: list[Parameter] # Function parameters
when_to_use: list[str] # Example triggers
source_format: SkillFormat # CLAUDE_CODE, OPENAI, MCP, GENERIC
token_count: int # Approximate tokens
raw_content: str # Original file content
source_path: str | None # Source file pathParser
class Parser:
@classmethod
def parse(source: str | Path | dict) -> Skill # Auto-detect format
@classmethod
def parse_claude_code(source: str | Path) -> Skill # .md with YAML frontmatter
@classmethod
def parse_openai(source: str | Path | dict) -> Skill # JSON function schema
@classmethod
def parse_mcp(source: str | Path | dict) -> Skill # MCP tool definition
@classmethod
def parse_generic(source: str | Path) -> Skill # Plain textReady to benchmark?
Try the playground to see how your skill performs against competitors — no setup required.
Open Playground