Playbook
Evals & Guardrails
Measure quality and stop regressions.
If it’s not tested, it’s not production. Evals turn model behavior into something you can control.
Test sets
- Seed with real user queries
- Cover edge cases and high-risk intents
- Version datasets alongside code
Metrics
- Retrieval precision/recall
- Groundedness and citation coverage
- Latency and cost budgets
Regression gates
- Block deploys when quality drops
- Track deltas per prompt or model change
- Report failures with actionable context
Red-team cases
- Prompt injection attempts
- Ambiguous or conflicting sources
- Adversarial queries with missing data
Checklist
- Eval harness in CI
- Scorecards per release
- Failure taxonomy + owner