Playbook

Evals & Guardrails

Measure quality and stop regressions.

If it’s not tested, it’s not production. Evals turn model behavior into something you can control.

Test sets

  • Seed with real user queries
  • Cover edge cases and high-risk intents
  • Version datasets alongside code

Metrics

  • Retrieval precision/recall
  • Groundedness and citation coverage
  • Latency and cost budgets

Regression gates

  • Block deploys when quality drops
  • Track deltas per prompt or model change
  • Report failures with actionable context

Red-team cases

  • Prompt injection attempts
  • Ambiguous or conflicting sources
  • Adversarial queries with missing data

Checklist

  • Eval harness in CI
  • Scorecards per release
  • Failure taxonomy + owner