Documentation

Skills Arena SDK

Benchmark your AI agent skill descriptions. Find out if agents will choose your skill over the competition — and why.

Getting Started

Install the SDK from PyPI:

Installbash
pip install skills-arena

Set your API key:

Environmentbash
export ANTHROPIC_API_KEY=sk-ant-...
The SDK uses the Anthropic API by default (for the claude-code agent). Set OPENAI_API_KEY if you want to test against OpenAI agents.

Quick Start

Two ways to benchmark — evaluate a single skill or compare head-to-head:

Evaluate a single skillpython
from skills_arena import Arena

arena = Arena()
results = arena.evaluate(
    skill="./my-skill.md",
    task="web search and content extraction"
)

print(f"Score: {results.score}/100 ({results.grade})")
print(f"Selection Rate: {results.selection_rate:.0%}")
Compare two skillspython
results = arena.compare(
    skills=["./my-skill.md", "./competitor.md"],
    task="web search",
)

print(f"Winner: {results.winner}")
for name, rate in results.selection_rates.items():
    print(f"  {name}: {rate:.0%}")

Core Concepts

Skills Arena tests a fundamental question: when an AI agent has multiple skills available, which one does it pick?

Skill
A tool/function description parsed from Claude Code (.md), OpenAI (JSON), MCP, or plain text.
Scenario
Auto-generated user prompts designed to test skill selection at varying difficulty.
Agent
The AI framework making the selection — real agent frameworks, not just raw LLM calls.
Results
Selection rates, head-to-head records, steal detection, and actionable insights.

The flow: Skills → Scenarios → Agent → Results

Evaluate a Skill

Evaluate how discoverable a single skill is. The SDK generates scenarios, runs them through agent frameworks, and scores how often the skill is correctly selected.

arena.evaluate()python
from skills_arena import Arena, Config

arena = Arena(Config(scenarios=30, agents=["claude-code"]))

results = arena.evaluate(
    skill="./my-skill.md",
    task="web search and content extraction",
)

print(f"Score: {results.score}/100")
print(f"Grade: {results.grade}")
print(f"Selection Rate: {results.selection_rate:.0%}")
print(f"False Positive Rate: {results.false_positive_rate:.0%}")

EvaluationResult

ParameterTypeDefaultDescription
scorefloatOverall score (0–100)
gradeGradeLetter grade (A+ to F)
selection_ratefloatHow often the skill was selected (0.0–1.0)
false_positive_ratefloatIncorrect selections
invocation_accuracyfloatAccuracy when the skill was selected (0.0–1.0)
insightslist[Insight]AI-generated improvement suggestions
per_agentdictResults broken down by agent
scenarios_runintTotal number of scenarios evaluated

Compare Skills

Head-to-head comparison between two or more skills. See who wins, why, and which scenarios your skill loses.

arena.compare()python
results = arena.compare(
    skills=["./my-skill.md", "./competitor.md"],
    task="web search",
    scenarios=20,
)

print(f"Winner: {results.winner}")
print(f"Selection Rates: {results.selection_rates}")

# See which scenarios you lost
for scenario_id, stealers in results.steals.items():
    print(f"Lost scenario {scenario_id} to: {stealers}")
Steal detection identifies scenarios where a competitor wins a task that was designed for your skill. This reveals specific weaknesses in your description.

ComparisonResult

ParameterTypeDefaultDescription
winnerstrName of the winning skill
selection_ratesdict[str, float]Selection percentage per skill
head_to_headdictWin counts between each pair
stealsdictScenarios lost to competitors
scenario_detailslist[ScenarioDetail]Per-scenario breakdown with agent reasoning
insightslist[Insight]Comparative insights
per_agentdictResults broken down by agent
scenarios_runintTotal number of scenarios evaluated

Custom Scenarios

Test specific edge cases with custom scenarios, or mix them with auto-generated ones.

Custom scenariospython
from skills_arena import CustomScenario, GenerateScenarios

# Fully custom
results = arena.compare(
    skills=["./my-skill.md", "./competitor.md"],
    scenarios=[
        CustomScenario(prompt="Find recent AI news articles"),
        CustomScenario(
            prompt="Scrape pricing from stripe.com",
            expected_skill="My Skill",   # enables steal detection
            tags=["scraping"],
        ),
    ],
)
Mix custom + generatedpython
results = arena.compare(
    skills=["./my-skill.md", "./competitor.md"],
    task="web search",
    scenarios=[
        CustomScenario(prompt="My critical edge case"),
        GenerateScenarios(count=10),   # generate 10 more
    ],
)
When you set expected_skill on a custom scenario, the SDK can detect "steals" — when a competitor wins a scenario designed for your skill.

Configuration

Configure via Python or YAML files.

Python configpython
from skills_arena import Arena, Config

config = Config(
    scenarios=50,
    agents=["claude-code"],
    temperature=0.7,
    include_adversarial=True,
    scenario_strategy="per_skill",
    timeout_seconds=30,
)

arena = Arena(config)
arena.yamlyaml
task: "web search and information retrieval"

skills:
  - path: ./skills/my-search.md
    name: my-search
  - path: ./skills/competitor.md
    name: competitor

evaluation:
  scenarios: 50
  agents: [claude-code]
  include_adversarial: true
  temperature: 0.7
Run from config filepython
arena = Arena.from_config("./arena.yaml")
results = arena.run()

Config Parameters

ParameterTypeDefaultDescription
scenariosint50Number of test scenarios to generate
agentslist[str]["claude-code"]Agent frameworks to test against
temperaturefloat0.7Scenario generation diversity (0.0–2.0)
include_adversarialboolTrueInclude edge-case scenarios
scenario_strategystr"balanced""balanced" or "per_skill"
timeout_secondsint30Per-scenario timeout
generator_modelstrclaude-sonnet-4LLM for scenario generation

Agent Frameworks

Skills Arena tests against real agent frameworks — not just raw LLM tool_use calls. This matches how skills are actually discovered in production.

AgentFrameworkDescription
claude-codeClaude Agent SDKOfficial Anthropic agent — real skill discovery with hooks
mockBuilt-inDeterministic agent for testing
claude-code is the recommended agent. It writes skills to .claude/skills/, launches a Claude Code session, and intercepts selections via hooks — matching real-world behavior.

Understanding Results

The SDK returns different result types depending on the operation. Here are the key metrics in each:

EvaluationResult (single skill)

FieldTypeDescription
scorefloat (0–100)Overall numeric score
gradeGrade (A+ to F)Letter grade based on score
selection_ratefloat (0.0–1.0)How often the skill was selected
false_positive_ratefloat (0.0–1.0)Rate of incorrect selections
invocation_accuracyfloat (0.0–1.0)Accuracy when the skill was selected

ComparisonResult (head-to-head)

FieldTypeDescription
winnerstrName of the winning skill
selection_ratesdict[str, float]Selection rate per skill (0.0–1.0)
head_to_headdict[str, dict[str, int]]Win counts between each skill pair
stealsdict[str, list[str]]Scenario IDs lost to competitors
scenario_detailslist[ScenarioDetail]Per-scenario breakdown with reasoning

Progress Tracking

All methods support a progress callback for real-time updates:

Progress callbackpython
def on_progress(progress):
    print(f"[{progress.stage}] {progress.percent:.0f}% — {progress.message}")

results = arena.compare(
    skills=["./a.md", "./b.md"],
    task="search",
    on_progress=on_progress,
)

# Stages: parsing → generation → running → complete

Insights

The SDK generates actionable insights to help improve your skill descriptions.

Get insightspython
insights = arena.insights(
    skill="./my-skill.md",
    results=evaluation_results,
)

for insight in insights:
    print(f"[{insight.severity}] {insight.message}")
    if insight.suggestion:
        print(f"  → {insight.suggestion}")
optimization

Description improvements

weakness

Missed scenarios or edge cases

comparison

Why competitors win specific scenarios

steal

Scenarios lost to competitors

Error Handling

ExceptionWhen
SkillParseErrorInvalid skill format or missing required fields
ConfigErrorInvalid configuration
APIKeyErrorMissing ANTHROPIC_API_KEY or OPENAI_API_KEY
AgentErrorAgent framework failed to process
GeneratorErrorScenario generation failed
TimeoutErrorOperation exceeded timeout_seconds
NoSkillsErrorNo skills provided to operation
UnsupportedAgentErrorUnknown agent name passed in config

API Reference

Arena

The main entry point for all operations.

Arena methodspython
class Arena:
    def __init__(self, config: Config | None = None)

    @classmethod
    def from_config(cls, path: str | Path) -> Arena

    # Synchronous
    def evaluate(skill, task, *, on_progress=None) -> EvaluationResult
    def compare(skills, task=None, *, scenarios=None, on_progress=None) -> ComparisonResult
    def insights(skill, results=None) -> list[Insight]

    # Async
    async def evaluate_async(skill, task, *, on_progress=None) -> EvaluationResult
    async def compare_async(skills, task=None, *, scenarios=None, on_progress=None) -> ComparisonResult

Skill

Skill modelpython
class Skill(BaseModel):
    name: str                    # Skill identifier
    description: str             # Full description text
    parameters: list[Parameter]  # Function parameters
    when_to_use: list[str]       # Example triggers
    source_format: SkillFormat   # CLAUDE_CODE, OPENAI, MCP, GENERIC
    token_count: int             # Approximate tokens
    raw_content: str             # Original file content
    source_path: str | None      # Source file path

Parser

Parser methodspython
class Parser:
    @classmethod
    def parse(source: str | Path | dict) -> Skill      # Auto-detect format

    @classmethod
    def parse_claude_code(source: str | Path) -> Skill  # .md with YAML frontmatter

    @classmethod
    def parse_openai(source: str | Path | dict) -> Skill  # JSON function schema

    @classmethod
    def parse_mcp(source: str | Path | dict) -> Skill  # MCP tool definition

    @classmethod
    def parse_generic(source: str | Path) -> Skill      # Plain text

Ready to benchmark?

Try the playground to see how your skill performs against competitors — no setup required.

Open Playground