Responsible AI AI Risk

Shipping Safe LLMs

Use Deepeval to build a continuous quality gate for LLMs that blocks hallucinations, bias, and drift. This guide shows how to integrate it with GitHub Actions and your risk framework—aligning AI deployments with NIST and ISO standards using open-source tools.

by Dilip Mohapatra

Jun 2, 2025

4 min read

Building a Continuous Quality Gate with Deepeval, GitHub Actions (code for engineers • standards for risk teams • plain-English value for execs)

Deepeval in a nutshell: Deepeval is an open-source Python toolkit that lets you write unit-style tests for large-language-model (LLM) outputs—scoring each answer for relevancy, faithfulness, toxicity, latency, cost, or any custom metric—and returns a machine-readable JSON report that drops neatly into any CI/CD pipeline.

Why: One hallucinated answer can trigger refunds, PR crises, or regulatory heat. Automated evaluation is cheaper than the first incident review.
What: Use Deepeval (open-source) to write unit-style tests for prompts, RAG pipelines & agents, then fail the build if a risk threshold slips Github.
How: Wire those tests into GitHub Actions and stream every score to your Connector Framework—the central place you already pull SageMaker logs, MLflow metrics, and model-card status.
Result: Each pull request ships with an auditable scorecard that maps directly to NIST AI RMF “Measure & Manage” sub-categories and ISO/IEC 42001 clauses.

1 · The 1 a.m. Rollback No one Wants Again

“A customer promo-bot invented 50 % discounts at 1 a.m. They spent the night hot-patching prompts and refunding customers.”

Without an always-on quality gate, any prompt tweak, model upgrade, or data refresh can land in prod while everyone sleeps.

2 · Why Continuous Evaluation Matters

Risk	Business impact	Continuous gate → benefit
Hallucination	Fines, brand damage	Block deploy if faithfulness < 0.8
Toxic / biased output	PR fallout, churn	Fail build when toxicity > 0.2
Latency spikes	Cart abandonment, SLA breaches	Alarm when p95 > 3 s
RAG drift	Wrong support answers	Alert when context precision < 0.7

Pay once to wire the gate → save forever in incident hours and audit prep.

3 · Deepeval in 90 Seconds

Tests are plain Python LLMTestCase objects.
Dozens of built-in metrics: Answer Relevancy, Hallucination, RAGAS, Toxicity, and more.
Runs locally with deepeval test or returns JSON for CI.

<details> <summary>Minimal prompt test</summary>pythonCopyEditfrom deepeval.metrics import AnswerRelevancyMetric
from deepeval.test import LLMTestCase

tests = [
LLMTestCase(
input="What is the capital of France?",
expected_output="Paris",
metrics=[AnswerRelevancyMetric(threshold=0.8)]
)
]
</details>

4 · Designing a Multi-Layer Test Suite

Prompt-level – predictable Q&A, system prompts.
RAG pipeline – validate that retrieved chunks support the answer.
Agent dialogue – score multi-turn traces for goal completion.

Practical tip: Keep examples in CSV so product managers can add edge cases without touching Python.

5 · Setting Meaningful Thresholds

Metric	Good starter threshold	Who signs off
Answer Relevancy	≥ 0.8	Product owner
Context Precision	≥ 0.7	Tech writer / risk lead
Toxicity	≤ 0.2	Compliance officer
p95 Latency	< 3 s	SRE / infra

6 · Wiring into GitHub Actions (copy-paste)

yaml CopyEdit# .github/workflows/llm-quality.yml name: LLM Quality Gate on: [push, pull_request]

jobs: evaluate: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-python@v5 with: { python-version: '3.11' }

- name: Install deps run: pip install -r requirements.txt deepeval - name: Run Deepeval env: OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }} run: deepeval test tests/ --json report.json - name: Upload report uses: actions/upload-artifact@v4 with: name: deepeval-report path: report.json

Fail-fast logic: If any test < threshold → job fails → merge blocked.

7 · Integrating with Cognitiveview Connector API

Cogntiveview platform already ingests model telemetry from SageMaker, MLflow, Vertex AI, etc. Deepeval connector reads report.json and posts results to the Risk & Guardrail Bus every run:

pythonCopyEdit# connectors/deepeval_ingest.py import json, requests, os, uuid

def push_deepeval_report(path: str):
with open(path) as f:
report = json.load(f)

payload = {
"connector_id": "deepeval-ci",
"run_id": str(uuid.uuid4()),
"metrics": report["results"],
"source": "github-actions",
}
requests.post(
os.getenv("CONNECTOR_API") + "/risk-events",
json=payload,
headers={"Authorization": f"Bearer {os.getenv('CONNECTOR_TOKEN')}"}
)

if __name__ == "__main__":
push_deepeval_report("report.json")

Practical tips

Version every run – your audit trail thanks you later.
Tag by model_id & branch – lets the dashboard compare main vs experimental-phi-3.
Emit both score and pass/fail – business users prefer “red/green,” risk teams drill into decimals.

8 · Mapping to Standards (So Auditors Nod) — in CognitiveView

All Deepeval test outputs—like faithfulness, toxicity, or contextual precision—can be captured and mapped to AI governance standards inside the CognitiveView platform. No code required.

Through CognitiveView’s no-code configuration interface, users can:

✅ Assign Deepeval metrics to specific NIST AI RMF subcategories (e.g., M3, M4)
✅ Link thresholds and failure rules to ISO/IEC 42001 clauses
✅ Configure alerts, risk ratings, and audit trails—without writing any code

Deepeval metric	Mapped to NIST AI RMF	ISO/IEC 42001:2023 clause
Answer Relevancy, Faithfulness	M3: Performance measurement	8.2 Monitoring & measurement
Toxicity, Bias	G2: Downstream harm governance	6.3 Risk treatment planning
Latency, Cost	M2: Operational constraints validation	8.3 Evaluation & improvement
Pass/fail scoring	RM: Risk Management	9 Management review

🧩 Business Benefit: This no-code mapping ensures that product managers, compliance leads, and auditors can trace every model evaluation back to an established standard—without needing Python or GitHub access.

9 · Open-Source Advantage

Strong Community support - active community and support resources available.
Transparent metrics – auditors can inspect every line.
Extensible – your team can publish custom guardrails (e.g., PII Leakage) back to the community and build brand credibility.

10 · Common Pitfalls & Quick Fixes

Symptom	Likely cause	One-liner fix
Flaky scores	Temp > 0 in eval calls	`temperature=0` or average 3 runs
CI too slow	Heavy agent traces	Mark as `nightly`; smoke subset for PRs
“All tests fail after model upgrade”	Thresholds too strict	Re-baseline on new weights, then tighten

11.Scenarios

Scenario	What can go wrong	How the Deepeval gate helps
1. Customer-support RAG botRetrieves policy snippets and answers billing questions.	After a knowledge-base refresh, the bot hallucinates an outdated “50 % lifetime discount,” triggering refund requests.	A Context Precision ≥ 0.7 test fails in GitHub Actions → merge blocked → risk team alerted before the bot goes live.
2. Internal policy-drafting agentGenerates first-draft HR policies for review.	A new model version injects gender-biased language into leave-policy templates, exposing the company to discrimination claims.	A Toxicity ≤ 0.2 and Bias Score ≤ 0.15 test fails → build is stopped → DEI officer reviews prompt / model combo before deployment.

12 · Key Takeaways

For leaders: A continuous quality gate is a low-cost insurance policy aligned to globally recognized standards—easy to justify in the board deck.
For engineers: GitHub Action + 50 lines of tests = safer than 90 % of teams shipping LLMs today.
For risk teams: Deepeval scores feed straight into Cognitiveview Connector Framework and map cleanly to NIST AI RMF & ISO 42001 evidence.

Challenge: Add one metric your stakeholders care about (e.g., Personally Identifiable Information Leakage), plug it into the connector, and show your first passing badge in the next sprint demo.

Happy shipping—without the midnight rollbacks!