Claude Advisor Tool Playground

Anthropic API Used for chat and the Anthropic evaluator. ▾

Anthropic API key Stored in your browser's localStorage on this machine only.

Chat & Advisor Tuning for how the executor + advisor tool runs during chat. ▾

max_tokens Executor output cap. Does not bound advisor.

Advisor caching Enables prompt caching for the advisor's own transcript. Breaks even at ≈3 advisor calls per conversation. Use 1h for long-running agent loops. Set once and leave — changing mid-conversation shifts the cache prefix and causes misses, so this dropdown locks after the first message.

Max advisor calls per request Caps how many times the executor can consult the advisor in a single API request. Leave empty for unlimited. When reached, further advisor calls return an advisor_tool_result_error with max_uses_exceeded and the executor continues without further advice.

System prompt Choose a preset or write your own. Recommended = Anthropic's timing + advice-treatment blocks. Precise = Recommended + conciseness instruction (Anthropic reports ~35-45% fewer advisor output tokens). Custom = write your own. Content inside  sentinels is stripped from baseline branches. Editing the textarea switches to Custom automatically.

Quality Evaluation LLM-as-judge scoring for trace turns. Opt-in per turn. ▾

Click the Evaluate button on any trace turn to score its branches. Each evaluation makes two judge calls with swapped candidate orderings to cancel position bias, then averages the scores. LLM evaluations are subjective — treat scores as a directional signal, not a measurement.

Evaluator provider Anthropic uses claude-opus-4-7 at high effort (the API default). OpenAI uses gpt-5.4 at its default reasoning level. Evaluator effort is fixed — the Config Models Effort setting applies to the executor only, not to the judge.

OpenAI API key Required only when using OpenAI as the evaluator. Stored in localStorage on this machine.

Judge prompt / rubric The instructions sent to the evaluator. You can edit the rubric, scoring dimensions, and anti-bias rules. The server appends the actual user prompt + blinded candidate responses at the end.

Notices & Disclaimers Security scope, pricing caveats. ▾

Security: this server binds to 127.0.0.1 only. Do not expose it on a network. Your API keys are sent from this page to the local backend and forwarded to Anthropic or OpenAI — they are never logged.

Cost estimates: all est. cost values in the trace pane use public list prices (Haiku 4.5 $1/$5, Sonnet 4.6 $3/$15, Opus 4.6 $5/$25, Opus 4.7 $5/$25, gpt-5.4 ~$5/$15 per MTok in/out; cache_r ≈ 10% of input; cache_w ≈ 125% of input). Approximate — verify against current pricing. Prices are hardcoded in public/app.js. Note: Opus 4.7 uses a new tokenizer that may use up to ~35% more tokens for the same text — effective cost can be higher than the per-MTok rate suggests.

Welcome message: re-open the introductory slideshow shown on first launch.

Factory reset: clears everything — API keys, settings, conversation history, and welcome state. Returns the app to its first-launch state.