Documentation

Skills Arena SDK

Benchmark your AI agent skill descriptions. Find out if agents will choose your skill over the competition — and why.

Getting Started

Install the SDK from PyPI:

Installbash

pip install skills-arena

Set your API key:

Environmentbash

export ANTHROPIC_API_KEY=sk-ant-...

The SDK uses the Anthropic API by default (for the claude-code agent). Set OPENAI_API_KEY if you want to test against OpenAI agents.

Quick Start

Two ways to benchmark — evaluate a single skill or compare head-to-head:

Evaluate a single skillpython

from skills_arena import Arena

arena = Arena()
results = arena.evaluate(
    skill="./my-skill.md",
    task="web search and content extraction"
)

print(f"Score: {results.score}/100 ({results.grade})")
print(f"Selection Rate: {results.selection_rate:.0%}")

Compare two skillspython

results = arena.compare(
    skills=["./my-skill.md", "./competitor.md"],
    task="web search",
)

print(f"Winner: {results.winner}")
for name, rate in results.selection_rates.items():
    print(f"  {name}: {rate:.0%}")

Core Concepts

Skills Arena tests a fundamental question: when an AI agent has multiple skills available, which one does it pick?

Skill

A tool/function description parsed from Claude Code (.md), OpenAI (JSON), MCP, or plain text.

Scenario

Auto-generated user prompts designed to test skill selection at varying difficulty.

Agent

The AI framework making the selection — real agent frameworks, not just raw LLM calls.

Results

Selection rates, head-to-head records, steal detection, and actionable insights.

The flow: Skills → Scenarios → Agent → Results

Evaluate a Skill

Evaluate how discoverable a single skill is. The SDK generates scenarios, runs them through agent frameworks, and scores how often the skill is correctly selected.

arena.evaluate()python

from skills_arena import Arena, Config

arena = Arena(Config(scenarios=30, agents=["claude-code"]))

results = arena.evaluate(
    skill="./my-skill.md",
    task="web search and content extraction",
)

print(f"Score: {results.score}/100")
print(f"Grade: {results.grade}")
print(f"Selection Rate: {results.selection_rate:.0%}")
print(f"False Positive Rate: {results.false_positive_rate:.0%}")

EvaluationResult

Parameter	Type	Default	Description
score	float	—	Overall score (0–100)
grade	Grade	—	Letter grade (A+ to F)
selection_rate	float	—	How often the skill was selected (0.0–1.0)
false_positive_rate	float	—	Incorrect selections
invocation_accuracy	float	—	Accuracy when the skill was selected (0.0–1.0)
insights	list[Insight]	—	AI-generated improvement suggestions
per_agent	dict	—	Results broken down by agent
scenarios_run	int	—	Total number of scenarios evaluated

Compare Skills

Head-to-head comparison between two or more skills. See who wins, why, and which scenarios your skill loses.

arena.compare()python

results = arena.compare(
    skills=["./my-skill.md", "./competitor.md"],
    task="web search",
    scenarios=20,
)

print(f"Winner: {results.winner}")
print(f"Selection Rates: {results.selection_rates}")

# See which scenarios you lost
for scenario_id, stealers in results.steals.items():
    print(f"Lost scenario {scenario_id} to: {stealers}")

Steal detection identifies scenarios where a competitor wins a task that was designed for your skill. This reveals specific weaknesses in your description.

ComparisonResult

Parameter	Type	Default	Description
winner	str	—	Name of the winning skill
selection_rates	dict[str, float]	—	Selection percentage per skill
head_to_head	dict	—	Win counts between each pair
steals	dict	—	Scenarios lost to competitors
scenario_details	list[ScenarioDetail]	—	Per-scenario breakdown with agent reasoning
insights	list[Insight]	—	Comparative insights
per_agent	dict	—	Results broken down by agent
scenarios_run	int	—	Total number of scenarios evaluated

Custom Scenarios

Test specific edge cases with custom scenarios, or mix them with auto-generated ones.

Custom scenariospython

from skills_arena import CustomScenario, GenerateScenarios

# Fully custom
results = arena.compare(
    skills=["./my-skill.md", "./competitor.md"],
    scenarios=[
        CustomScenario(prompt="Find recent AI news articles"),
        CustomScenario(
            prompt="Scrape pricing from stripe.com",
            expected_skill="My Skill",   # enables steal detection
            tags=["scraping"],
        ),
    ],
)

Mix custom + generatedpython

results = arena.compare(
    skills=["./my-skill.md", "./competitor.md"],
    task="web search",
    scenarios=[
        CustomScenario(prompt="My critical edge case"),
        GenerateScenarios(count=10),   # generate 10 more
    ],
)

When you set expected_skill on a custom scenario, the SDK can detect "steals" — when a competitor wins a scenario designed for your skill.

Configuration

Configure via Python or YAML files.

Python configpython

from skills_arena import Arena, Config

config = Config(
    scenarios=50,
    agents=["claude-code"],
    temperature=0.7,
    include_adversarial=True,
    scenario_strategy="per_skill",
    timeout_seconds=30,
)

arena = Arena(config)

arena.yamlyaml

task: "web search and information retrieval"

skills:
  - path: ./skills/my-search.md
    name: my-search
  - path: ./skills/competitor.md
    name: competitor

evaluation:
  scenarios: 50
  agents: [claude-code]
  include_adversarial: true
  temperature: 0.7

Run from config filepython

arena = Arena.from_config("./arena.yaml")
results = arena.run()

Config Parameters

Parameter	Type	Default	Description
scenarios	int	50	Number of test scenarios to generate
agents	list[str]	["claude-code"]	Agent frameworks to test against
temperature	float	0.7	Scenario generation diversity (0.0–2.0)
include_adversarial	bool	True	Include edge-case scenarios
scenario_strategy	str	"balanced"	"balanced" or "per_skill"
timeout_seconds	int	30	Per-scenario timeout
generator_model	str	claude-sonnet-4	LLM for scenario generation

Agent Frameworks

Skills Arena tests against real agent frameworks — not just raw LLM tool_use calls. This matches how skills are actually discovered in production.

Agent	Framework	Description
claude-code	Claude Agent SDK	Official Anthropic agent — real skill discovery with hooks
mock	Built-in	Deterministic agent for testing

claude-code is the recommended agent. It writes skills to .claude/skills/, launches a Claude Code session, and intercepts selections via hooks — matching real-world behavior.

Understanding Results

The SDK returns different result types depending on the operation. Here are the key metrics in each:

EvaluationResult (single skill)

Field	Type	Description
score	float (0–100)	Overall numeric score
grade	Grade (A+ to F)	Letter grade based on score
selection_rate	float (0.0–1.0)	How often the skill was selected
false_positive_rate	float (0.0–1.0)	Rate of incorrect selections
invocation_accuracy	float (0.0–1.0)	Accuracy when the skill was selected

ComparisonResult (head-to-head)

Field	Type	Description
winner	str	Name of the winning skill
selection_rates	dict[str, float]	Selection rate per skill (0.0–1.0)
head_to_head	dict[str, dict[str, int]]	Win counts between each skill pair
steals	dict[str, list[str]]	Scenario IDs lost to competitors
scenario_details	list[ScenarioDetail]	Per-scenario breakdown with reasoning

Progress Tracking

All methods support a progress callback for real-time updates:

Progress callbackpython

def on_progress(progress):
    print(f"[{progress.stage}] {progress.percent:.0f}% — {progress.message}")

results = arena.compare(
    skills=["./a.md", "./b.md"],
    task="search",
    on_progress=on_progress,
)

# Stages: parsing → generation → running → complete

Insights

The SDK generates actionable insights to help improve your skill descriptions.

Get insightspython

insights = arena.insights(
    skill="./my-skill.md",
    results=evaluation_results,
)

for insight in insights:
    print(f"[{insight.severity}] {insight.message}")
    if insight.suggestion:
        print(f"  → {insight.suggestion}")

optimization

Description improvements

weakness

Missed scenarios or edge cases

comparison

Why competitors win specific scenarios

steal

Scenarios lost to competitors

Error Handling

Exception	When
SkillParseError	Invalid skill format or missing required fields
ConfigError	Invalid configuration
APIKeyError	Missing ANTHROPIC_API_KEY or OPENAI_API_KEY
AgentError	Agent framework failed to process
GeneratorError	Scenario generation failed
TimeoutError	Operation exceeded timeout_seconds
NoSkillsError	No skills provided to operation
UnsupportedAgentError	Unknown agent name passed in config

API Reference

Arena

The main entry point for all operations.

Arena methodspython

class Arena:
    def __init__(self, config: Config | None = None)

    @classmethod
    def from_config(cls, path: str | Path) -> Arena

    # Synchronous
    def evaluate(skill, task, *, on_progress=None) -> EvaluationResult
    def compare(skills, task=None, *, scenarios=None, on_progress=None) -> ComparisonResult
    def insights(skill, results=None) -> list[Insight]

    # Async
    async def evaluate_async(skill, task, *, on_progress=None) -> EvaluationResult
    async def compare_async(skills, task=None, *, scenarios=None, on_progress=None) -> ComparisonResult

Skill

Skill modelpython

class Skill(BaseModel):
    name: str                    # Skill identifier
    description: str             # Full description text
    parameters: list[Parameter]  # Function parameters
    when_to_use: list[str]       # Example triggers
    source_format: SkillFormat   # CLAUDE_CODE, OPENAI, MCP, GENERIC
    token_count: int             # Approximate tokens
    raw_content: str             # Original file content
    source_path: str | None      # Source file path

Parser

Parser methodspython

class Parser:
    @classmethod
    def parse(source: str | Path | dict) -> Skill      # Auto-detect format

    @classmethod
    def parse_claude_code(source: str | Path) -> Skill  # .md with YAML frontmatter

    @classmethod
    def parse_openai(source: str | Path | dict) -> Skill  # JSON function schema

    @classmethod
    def parse_mcp(source: str | Path | dict) -> Skill  # MCP tool definition

    @classmethod
    def parse_generic(source: str | Path) -> Skill      # Plain text

Ready to benchmark?

Try the playground to see how your skill performs against competitors — no setup required.

Open Playground