Building a Continuous Quality Gate with Deepeval, GitHub Actions (code for engineers • standards for risk teams • plain-English value for execs)
Deepeval in a nutshell: Deepeval is an open-source Python toolkit that lets you write unit-style tests for large-language-model (LLM) outputs—scoring each answer for relevancy, faithfulness, toxicity, latency, cost, or any custom metric—and returns a machine-readable JSON report that drops neatly into any CI/CD pipeline.
- Why: One hallucinated answer can trigger refunds, PR crises, or regulatory heat. Automated evaluation is cheaper than the first incident review.
- What: Use Deepeval (open-source) to write unit-style tests for prompts, RAG pipelines & agents, then fail the build if a risk threshold slips Github.
- How: Wire those tests into GitHub Actions and stream every score to your Connector Framework—the central place you already pull SageMaker logs, MLflow metrics, and model-card status.
- Result: Each pull request ships with an auditable scorecard that maps directly to NIST AI RMF “Measure & Manage” sub-categories and ISO/IEC 42001 clauses.
1 · The 1 a.m. Rollback No one Wants Again
“A customer promo-bot invented 50 % discounts at 1 a.m. They spent the night hot-patching prompts and refunding customers.”
Without an always-on quality gate, any prompt tweak, model upgrade, or data refresh can land in prod while everyone sleeps.
2 · Why Continuous Evaluation Matters
Risk | Business impact | Continuous gate → benefit |
---|---|---|
Hallucination | Fines, brand damage | Block deploy if faithfulness < 0.8 |
Toxic / biased output | PR fallout, churn | Fail build when toxicity > 0.2 |
Latency spikes | Cart abandonment, SLA breaches | Alarm when p95 > 3 s |
RAG drift | Wrong support answers | Alert when context precision < 0.7 |
Pay once to wire the gate → save forever in incident hours and audit prep.
3 · Deepeval in 90 Seconds
- Tests are plain Python
LLMTestCase
objects. - Dozens of built-in metrics: Answer Relevancy, Hallucination, RAGAS, Toxicity, and more.
- Runs locally with
deepeval test
or returns JSON for CI.
<details> <summary>Minimal prompt test</summary>pythonCopyEditfrom deepeval.metrics import
AnswerRelevancyMetricfrom deepeval.test import
LLMTestCase
tests = [
LLMTestCase( input="What is the capital of France?"
, expected_output="Paris"
, metrics=[AnswerRelevancyMetric(threshold=0.8
)]
)
]
</details>
4 · Designing a Multi-Layer Test Suite
- Prompt-level – predictable Q&A, system prompts.
- RAG pipeline – validate that retrieved chunks support the answer.
- Agent dialogue – score multi-turn traces for goal completion.
Practical tip: Keep examples in CSV so product managers can add edge cases without touching Python.
5 · Setting Meaningful Thresholds
Metric | Good starter threshold | Who signs off |
---|---|---|
Answer Relevancy | ≥ 0.8 | Product owner |
Context Precision | ≥ 0.7 | Tech writer / risk lead |
Toxicity | ≤ 0.2 | Compliance officer |
p95 Latency | < 3 s | SRE / infra |
6 · Wiring into GitHub Actions (copy-paste)
yaml CopyEdit# .github/workflows/llm-quality.yml
]
name: LLM Quality Gate
on: [push, pull_requestjobs:
}
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with: { python-version: '3.11' - name: Install deps
run: pip install -r requirements.txt deepeval
- name: Run Deepeval
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: deepeval test tests/ --json report.json
- name: Upload report
uses: actions/upload-artifact@v4
with:
name: deepeval-report
path: report.json
Fail-fast logic: If any test < threshold → job fails → merge blocked.
7 · Integrating with Cognitiveview Connector API
Cogntiveview platform already ingests model telemetry from SageMaker, MLflow, Vertex AI, etc. Deepeval connector reads report.json
and posts results to the Risk & Guardrail Bus every run:
pythonCopyEdit# connectors/deepeval_ingest.py
json, requests, os, uuid
importdef push_deepeval_report(path: str
): with open(path) as
f:
report = json.load(f)
payload = { "connector_id": "deepeval-ci"
, "run_id": str
(uuid.uuid4()), "metrics": report["results"
], "source": "github-actions"
,
}
requests.post( os.getenv("CONNECTOR_API") + "/risk-events"
,
json=payload, headers={"Authorization": f"Bearer {os.getenv('CONNECTOR_TOKEN'
)}"}
)if __name__ == "__main__"
: push_deepeval_report("report.json"
)
Practical tips
- Version every run – your audit trail thanks you later.
- Tag by
model_id
&branch
– lets the dashboard compare main vs experimental-phi-3. - Emit both score and pass/fail – business users prefer “red/green,” risk teams drill into decimals.
8 · Mapping to Standards (So Auditors Nod) — in CognitiveView
All Deepeval test outputs—like faithfulness, toxicity, or contextual precision—can be captured and mapped to AI governance standards inside the CognitiveView platform. No code required.
Through CognitiveView’s no-code configuration interface, users can:
✅ Assign Deepeval metrics to specific NIST AI RMF subcategories (e.g., M3, M4)
✅ Link thresholds and failure rules to ISO/IEC 42001 clauses
✅ Configure alerts, risk ratings, and audit trails—without writing any code
Deepeval metric | Mapped to NIST AI RMF | ISO/IEC 42001:2023 clause |
---|---|---|
Answer Relevancy, Faithfulness | M3: Performance measurement | 8.2 Monitoring & measurement |
Toxicity, Bias | G2: Downstream harm governance | 6.3 Risk treatment planning |
Latency, Cost | M2: Operational constraints validation | 8.3 Evaluation & improvement |
Pass/fail scoring | RM: Risk Management | 9 Management review |
🧩 Business Benefit: This no-code mapping ensures that product managers, compliance leads, and auditors can trace every model evaluation back to an established standard—without needing Python or GitHub access.
9 · Open-Source Advantage
- Strong Community support - active community and support resources available.
- Transparent metrics – auditors can inspect every line.
- Extensible – your team can publish custom guardrails (e.g., PII Leakage) back to the community and build brand credibility.
10 · Common Pitfalls & Quick Fixes
Symptom | Likely cause | One-liner fix |
---|---|---|
Flaky scores | Temp > 0 in eval calls | temperature=0 or average 3 runs |
CI too slow | Heavy agent traces | Mark as nightly ; smoke subset for PRs |
“All tests fail after model upgrade” | Thresholds too strict | Re-baseline on new weights, then tighten |
11.Scenarios
Scenario | What can go wrong | How the Deepeval gate helps |
---|---|---|
1. Customer-support RAG botRetrieves policy snippets and answers billing questions. | After a knowledge-base refresh, the bot hallucinates an outdated “50 % lifetime discount,” triggering refund requests. | A Context Precision ≥ 0.7 test fails in GitHub Actions → merge blocked → risk team alerted before the bot goes live. |
2. Internal policy-drafting agentGenerates first-draft HR policies for review. | A new model version injects gender-biased language into leave-policy templates, exposing the company to discrimination claims. | A Toxicity ≤ 0.2 and Bias Score ≤ 0.15 test fails → build is stopped → DEI officer reviews prompt / model combo before deployment. |
12 · Key Takeaways
- For leaders: A continuous quality gate is a low-cost insurance policy aligned to globally recognized standards—easy to justify in the board deck.
- For engineers: GitHub Action + 50 lines of tests = safer than 90 % of teams shipping LLMs today.
- For risk teams: Deepeval scores feed straight into Cognitiveview Connector Framework and map cleanly to NIST AI RMF & ISO 42001 evidence.
Challenge: Add one metric your stakeholders care about (e.g., Personally Identifiable Information Leakage), plug it into the connector, and show your first passing badge in the next sprint demo.
Happy shipping—without the midnight rollbacks!