Documentation
How Sentinel AI works
A single API call inspects both sides of an LLM conversation, runs every detector in parallel, and returns a structured risk report. Here's the full flow.
Analysis pipeline
Ingest
The caller sends a POST to /api/analyze with a prompt (user input) and/or a response (model output). Both fields are optional independently, but at least one must be non-empty.
Detect
Six detectors run against the payload simultaneously. Each detector defines one or more pattern-matching rules tuned for its signal class. Matches are extracted with a ±30-character context window.
Score
Each triggered flag carries a severity weight (critical=80, high=45, medium=20, low=5). The raw sum is capped at 100 with a diminishing-returns penalty for multiple flags, producing a final 0–100 score.
Classify
Score ≥ 70 → blocked. Score 30–69 → warning. Score < 30 → safe. The status drives the action your application should take.
Respond
The endpoint returns the score, status, sorted flag list with plain-English descriptions and excerpts, and a metadata block including detector count and ISO timestamp.
Detector reference
Detects attempts to override system instructions or manipulate the model's behavior through adversarial prompt crafting — e.g. 'ignore all previous instructions', jailbreak phrases, DAN prompts.
Catches requests that fall into high-risk categories: weapons synthesis, hacking guidance, self-harm methods, or production of dangerous substances.
Identifies personally identifiable information exposed in model outputs — SSNs, credit card numbers, passport patterns, or plaintext passwords.
Detects prompts that attempt to extract training data, system prompts, or internal context through repetition or meta-query attacks.
Flags harmful, harassing, or hateful language in either prompt or response — including explicit threats and targeted harassment patterns.
Identifies hedging language that masks unverified factual claims presented with false certainty — overconfident statements that warrant independent verification.
Scoring & thresholds
0 – 29
Clean. No action required.
30 – 69
Review flags before continuing.
70 – 100
Block or quarantine immediately.
API schema
POST /api/analyze
Request:
{
"prompt": string, // User input sent to the model (optional if response provided)
"response": string // Model output to inspect (optional if prompt provided)
}
Response:
{
"score": number, // 0–100 unified risk score
"status": "safe" | "warning" | "blocked",
"flags": [
{
"detector": string,
"label": string,
"description": string,
"severity": "low" | "medium" | "high" | "critical",
"source": "prompt" | "response" | "both",
"excerpt": string // offending text window (~60 chars)
}
],
"meta": {
"prompt_length": number,
"response_length": number,
"detectors_run": number,
"analyzed_at": string // ISO 8601
}
}Runtime
The /api/analyze endpoint runs on the Next.js Edge Runtime — globally distributed, zero cold start, sub-50ms overhead. The intentional ~420ms delay in development is removed in production builds.