AI Evaluation Logic & Strategy

Introduction

As an enterprise AI SaaS, reliability is our primary feature. Simple unit tests are insufficient because LLMs are non-deterministic. An update by OpenAI or Anthropic could silently degrade our application’s ability to extract grant clauses or generate accurate summaries. We must continuously evaluate the actual output quality.

Regression Evaluation Framework

To maintain a defensible moat of high-quality execution, we employ automated AI Evaluations.

The “Golden Dataset”

We maintain a static “Golden Dataset” of 100+ highly complex, manually verified input-output pairs. This dataset includes:

Complex grant RFPs alongside human-verified summaries.
Difficult questions requiring precise numerical extraction.
Edge cases designed to intentionally provoke hallucinations.

Evaluation Metrics

We measure the LLM’s performance against the Golden Dataset using specific scoring algorithms:

Factual Accuracy Framework: Does the generated answer contradict facts present in the provided context? We use a strong LLM-as-a-judge model to calculate a “hallucination score” from 0 to 1.
Schema Conformance: 100% pass/fail. Does the JSON output pass our strict Zod schema parsing?
Completeness/Recall: Did the model find every piece of required information within the provided 50-page document?
Tone & Style Constraint: Does it adhere to the strict professional tone required for grant writing?

Continuous Integration Loop (CI/CD for AI)

Every time we merge a change to:

A core prompt template
A new RAG chunking algorithm
A model provider switch

The CI pipeline runs the Golden Dataset through the proposed AI system change.

The CI Pipeline Flow:

Trigger: Pull Request opened modifying /src/ai/prompts.
Execute: The Evaluation Suite runs asynchronously (typically takes ~10-15 minutes).
Score & Diff: Generates a report indicating if the overall Factual Accuracy score increased or decreased relative to main.
Gate: If the Hallucination Score increases by > 2% or Schema Conformance drops below 100%, the PR is automatically blocked.

Guardrails in Production

In production, we cannot run heavy heuristic evaluations on every user request. Instead, we:

Employ fast, lightweight syntactic checks.
Sample 5% of all production AI responses daily and run them asynchronously through the deep Evaluation Framework to monitor for “model drift”.
Implement a user thumbs-up/thumbs-down (implicit/explicit feedback) collection system to alert product managers of degrading user sentiment.

Canonical Events Testing Load and Stress Limits

⌘I

Get Started

Product — Domain Model

Product — Workflows

Product — Features

Product — Extensions

Product — User Guides

Product — Launch & Pricing

Engineering — Architecture

Engineering — Data

Engineering — Frontend

Engineering — Security

Engineering — Testing

Engineering — Guides

Contributing

Marketing

AI Evaluation Logic & Strategy

AI Evaluation Logic & Strategy

Introduction

Regression Evaluation Framework

The “Golden Dataset”

Evaluation Metrics

Continuous Integration Loop (CI/CD for AI)

The CI Pipeline Flow:

Guardrails in Production

Get Started

Product — Domain Model

Product — Workflows

Product — Features

Product — Extensions

Product — User Guides

Product — Launch & Pricing

Engineering — Architecture

Engineering — Data

Engineering — Frontend

Engineering — Security

Engineering — Testing

Engineering — Guides

Contributing

Marketing

Documentation Index

​AI Evaluation Logic & Strategy

​Introduction

​Regression Evaluation Framework

​The “Golden Dataset”

​Evaluation Metrics

​Continuous Integration Loop (CI/CD for AI)

​The CI Pipeline Flow:

​Guardrails in Production

AI Evaluation Logic & Strategy

Introduction

Regression Evaluation Framework

The “Golden Dataset”

Evaluation Metrics

Continuous Integration Loop (CI/CD for AI)

The CI Pipeline Flow:

Guardrails in Production