Skip to main content

Documentation Index

Fetch the complete documentation index at: https://grantmaster.dev/llms.txt

Use this file to discover all available pages before exploring further.

AI Evaluation Logic & Strategy

Introduction

As an enterprise AI SaaS, reliability is our primary feature. Simple unit tests are insufficient because LLMs are non-deterministic. An update by OpenAI or Anthropic could silently degrade our application’s ability to extract grant clauses or generate accurate summaries. We must continuously evaluate the actual output quality.

Regression Evaluation Framework

To maintain a defensible moat of high-quality execution, we employ automated AI Evaluations.

The “Golden Dataset”

We maintain a static “Golden Dataset” of 100+ highly complex, manually verified input-output pairs. This dataset includes:
  • Complex grant RFPs alongside human-verified summaries.
  • Difficult questions requiring precise numerical extraction.
  • Edge cases designed to intentionally provoke hallucinations.

Evaluation Metrics

We measure the LLM’s performance against the Golden Dataset using specific scoring algorithms:
  1. Factual Accuracy Framework: Does the generated answer contradict facts present in the provided context? We use a strong LLM-as-a-judge model to calculate a “hallucination score” from 0 to 1.
  2. Schema Conformance: 100% pass/fail. Does the JSON output pass our strict Zod schema parsing?
  3. Completeness/Recall: Did the model find every piece of required information within the provided 50-page document?
  4. Tone & Style Constraint: Does it adhere to the strict professional tone required for grant writing?

Continuous Integration Loop (CI/CD for AI)

Every time we merge a change to:
  • A core prompt template
  • A new RAG chunking algorithm
  • A model provider switch
The CI pipeline runs the Golden Dataset through the proposed AI system change.

The CI Pipeline Flow:

  1. Trigger: Pull Request opened modifying /src/ai/prompts.
  2. Execute: The Evaluation Suite runs asynchronously (typically takes ~10-15 minutes).
  3. Score & Diff: Generates a report indicating if the overall Factual Accuracy score increased or decreased relative to main.
  4. Gate: If the Hallucination Score increases by > 2% or Schema Conformance drops below 100%, the PR is automatically blocked.

Guardrails in Production

In production, we cannot run heavy heuristic evaluations on every user request. Instead, we:
  • Employ fast, lightweight syntactic checks.
  • Sample 5% of all production AI responses daily and run them asynchronously through the deep Evaluation Framework to monitor for “model drift”.
  • Implement a user thumbs-up/thumbs-down (implicit/explicit feedback) collection system to alert product managers of degrading user sentiment.