Basic evals
Turn count, tool success rate, session completion, context window guard.
Dashboard filters
Named views with performance and quality filter sets.
Eval score filters
Use cached eval scores as range sliders in a
cachedOnly view.Actions
Session summaries, metric exports, and tool inventories triggered from the dashboard.
Alerts
Console logging, failure warnings, and Slack webhook notifications.
Cache invalidation
Time-based, value-based, and compound cache invalidation patterns.
LLM-as-judge
Use Claude to grade task completion, faithfulness, and answer quality.

