Shipping Safe LLMs

Use Deepeval to build a continuous quality gate for LLMs that blocks hallucinations, bias, and drift. This guide shows how to integrate it with GitHub Actions and your risk framework—aligning AI deployments with NIST and ISO standards using open-source tools.

Shipping Safe LLMs

Building a Continuous Quality Gate with Deepeval, GitHub Actions (code for engineers • standards for risk teams • plain-English value for execs)

Deepeval in a nutshell: Deepeval is an open-source Python toolkit that lets you write unit-style tests for large-language-model (LLM) outputs—scoring each answer for relevancy, faithfulness, toxicity, latency, cost, or any custom metric—and returns a machine-readable JSON report that drops neatly into any CI/CD pipeline.

  • Why: One hallucinated answer can trigger refunds, PR crises, or regulatory heat. Automated evaluation is cheaper than the first incident review.
  • What: Use Deepeval (open-source) to write unit-style tests for prompts, RAG pipelines & agents, then fail the build if a risk threshold slips Github.
  • How: Wire those tests into GitHub Actions and stream every score to your Connector Framework—the central place you already pull SageMaker logs, MLflow metrics, and model-card status.
  • Result: Each pull request ships with an auditable scorecard that maps directly to NIST AI RMF “Measure & Manage” sub-categories and ISO/IEC 42001 clauses.

1 · The 1 a.m. Rollback No one Wants Again

“A customer promo-bot invented 50 % discounts at 1 a.m. They spent the night hot-patching prompts and refunding customers.”

Without an always-on quality gate, any prompt tweak, model upgrade, or data refresh can land in prod while everyone sleeps.


2 · Why Continuous Evaluation Matters

RiskBusiness impactContinuous gate → benefit
HallucinationFines, brand damageBlock deploy if faithfulness < 0.8
Toxic / biased outputPR fallout, churnFail build when toxicity > 0.2
Latency spikesCart abandonment, SLA breachesAlarm when p95 > 3 s
RAG driftWrong support answersAlert when context precision < 0.7

Pay once to wire the gate → save forever in incident hours and audit prep.


3 · Deepeval in 90 Seconds

  • Tests are plain Python LLMTestCase objects.
  • Dozens of built-in metrics: Answer Relevancy, Hallucination, RAGAS, Toxicity, and more.
  • Runs locally with deepeval test or returns JSON for CI.

<details> <summary>Minimal prompt test</summary>pythonCopyEditfrom deepeval.metrics import AnswerRelevancyMetric
from deepeval.test import LLMTestCase

tests = [
LLMTestCase(
input="What is the capital of France?",
expected_output="Paris",
metrics=[AnswerRelevancyMetric(threshold=0.8)]
)
]
</details>


4 · Designing a Multi-Layer Test Suite

  1. Prompt-level – predictable Q&A, system prompts.
  2. RAG pipeline – validate that retrieved chunks support the answer.
  3. Agent dialogue – score multi-turn traces for goal completion.
Practical tip: Keep examples in CSV so product managers can add edge cases without touching Python.

5 · Setting Meaningful Thresholds

MetricGood starter thresholdWho signs off
Answer Relevancy≥ 0.8Product owner
Context Precision≥ 0.7Tech writer / risk lead
Toxicity≤ 0.2Compliance officer
p95 Latency< 3 sSRE / infra

6 · Wiring into GitHub Actions (copy-paste)

yaml CopyEdit# .github/workflows/llm-quality.yml
name: LLM Quality Gate
on: [push, pull_request
]

jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with: { python-version: '3.11'
}

- name: Install deps
run: pip install -r requirements.txt deepeval

- name: Run Deepeval
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: deepeval test tests/ --json report.json

- name: Upload report
uses: actions/upload-artifact@v4
with:
name: deepeval-report
path: report.json

Fail-fast logic: If any test < threshold → job fails → merge blocked.


7 · Integrating with Cognitiveview Connector API

Cogntiveview platform already ingests model telemetry from SageMaker, MLflow, Vertex AI, etc. Deepeval connector reads report.json and posts results to the Risk & Guardrail Bus every run:

pythonCopyEdit# connectors/deepeval_ingest.py
import
json, requests, os, uuid

def push_deepeval_report(path: str):
with open(path) as f:
report = json.load(f)

payload = {
"connector_id": "deepeval-ci",
"run_id": str(uuid.uuid4()),
"metrics": report["results"],
"source": "github-actions",
}
requests.post(
os.getenv("CONNECTOR_API") + "/risk-events",
json=payload,
headers={"Authorization": f"Bearer {os.getenv('CONNECTOR_TOKEN')}"}
)

if __name__ == "__main__":
push_deepeval_report("report.json")

Practical tips

  • Version every run – your audit trail thanks you later.
  • Tag by model_id & branch – lets the dashboard compare main vs experimental-phi-3.
  • Emit both score and pass/fail – business users prefer “red/green,” risk teams drill into decimals.

8 · Mapping to Standards (So Auditors Nod) — in CognitiveView

All Deepeval test outputs—like faithfulness, toxicity, or contextual precision—can be captured and mapped to AI governance standards inside the CognitiveView platform. No code required.

Through CognitiveView’s no-code configuration interface, users can:

✅ Assign Deepeval metrics to specific NIST AI RMF subcategories (e.g., M3, M4)
✅ Link thresholds and failure rules to ISO/IEC 42001 clauses
✅ Configure alerts, risk ratings, and audit trails—without writing any code

Deepeval metricMapped to NIST AI RMFISO/IEC 42001:2023 clause
Answer Relevancy, FaithfulnessM3: Performance measurement8.2 Monitoring & measurement
Toxicity, BiasG2: Downstream harm governance6.3 Risk treatment planning
Latency, CostM2: Operational constraints validation8.3 Evaluation & improvement
Pass/fail scoringRM: Risk Management9 Management review
🧩 Business Benefit: This no-code mapping ensures that product managers, compliance leads, and auditors can trace every model evaluation back to an established standard—without needing Python or GitHub access.

9 · Open-Source Advantage

  • Strong Community support - active community and support resources available.
  • Transparent metrics – auditors can inspect every line.
  • Extensible – your team can publish custom guardrails (e.g., PII Leakage) back to the community and build brand credibility.

10 · Common Pitfalls & Quick Fixes

SymptomLikely causeOne-liner fix
Flaky scoresTemp > 0 in eval callstemperature=0 or average 3 runs
CI too slowHeavy agent tracesMark as nightly; smoke subset for PRs
“All tests fail after model upgrade”Thresholds too strictRe-baseline on new weights, then tighten

11.Scenarios

Scenario What can go wrong How the Deepeval gate helps
1. Customer-support RAG botRetrieves policy snippets and answers billing questions. After a knowledge-base refresh, the bot hallucinates an outdated “50 % lifetime discount,” triggering refund requests. A Context Precision ≥ 0.7 test fails in GitHub Actions → merge blocked → risk team alerted before the bot goes live.
2. Internal policy-drafting agentGenerates first-draft HR policies for review. A new model version injects gender-biased language into leave-policy templates, exposing the company to discrimination claims. A Toxicity ≤ 0.2 and Bias Score ≤ 0.15 test fails → build is stopped → DEI officer reviews prompt / model combo before deployment.

12 · Key Takeaways

  • For leaders: A continuous quality gate is a low-cost insurance policy aligned to globally recognized standards—easy to justify in the board deck.
  • For engineers: GitHub Action + 50 lines of tests = safer than 90 % of teams shipping LLMs today.
  • For risk teams: Deepeval scores feed straight into Cognitiveview Connector Framework and map cleanly to NIST AI RMF & ISO 42001 evidence.
Challenge: Add one metric your stakeholders care about (e.g., Personally Identifiable Information Leakage), plug it into the connector, and show your first passing badge in the next sprint demo.

Happy shipping—without the midnight rollbacks!