Documentation

How Sentinel AI works

A single API call inspects both sides of an LLM conversation, runs every detector in parallel, and returns a structured risk report. Here's the full flow.

Analysis pipeline

Ingest

The caller sends a POST to /api/analyze with a prompt (user input) and/or a response (model output). Both fields are optional independently, but at least one must be non-empty.

Detect

Six detectors run against the payload simultaneously. Each detector defines one or more pattern-matching rules tuned for its signal class. Matches are extracted with a ±30-character context window.

Score

Each triggered flag carries a severity weight (critical=80, high=45, medium=20, low=5). The raw sum is capped at 100 with a diminishing-returns penalty for multiple flags, producing a final 0–100 score.

Classify

Score ≥ 70 → blocked. Score 30–69 → warning. Score < 30 → safe. The status drives the action your application should take.

Respond

The endpoint returns the score, status, sorted flag list with plain-English descriptions and excerpts, and a metadata block including detector count and ISO timestamp.

Detector reference

Prompt InjectionCRITICAL

Detects attempts to override system instructions or manipulate the model's behavior through adversarial prompt crafting — e.g. 'ignore all previous instructions', jailbreak phrases, DAN prompts.

Sensitive DomainCRITICAL

Catches requests that fall into high-risk categories: weapons synthesis, hacking guidance, self-harm methods, or production of dangerous substances.

PII LeakageHIGH

Identifies personally identifiable information exposed in model outputs — SSNs, credit card numbers, passport patterns, or plaintext passwords.

Data ExfiltrationHIGH

Detects prompts that attempt to extract training data, system prompts, or internal context through repetition or meta-query attacks.

Toxic ContentHIGH

Flags harmful, harassing, or hateful language in either prompt or response — including explicit threats and targeted harassment patterns.

Hallucination RiskMEDIUM

Identifies hedging language that masks unverified factual claims presented with false certainty — overconfident statements that warrant independent verification.

Scoring & thresholds

SAFE

0 – 29

Clean. No action required.

WARNING

30 – 69

Review flags before continuing.

BLOCKED

70 – 100

Block or quarantine immediately.

API schema

Request / Response

POST /api/analyze

Request:
{
  "prompt":   string,   // User input sent to the model (optional if response provided)
  "response": string    // Model output to inspect (optional if prompt provided)
}

Response:
{
  "score":  number,     // 0–100 unified risk score
  "status": "safe" | "warning" | "blocked",
  "flags": [
    {
      "detector":    string,
      "label":       string,
      "description": string,
      "severity":    "low" | "medium" | "high" | "critical",
      "source":      "prompt" | "response" | "both",
      "excerpt":     string   // offending text window (~60 chars)
    }
  ],
  "meta": {
    "prompt_length":   number,
    "response_length": number,
    "detectors_run":   number,
    "analyzed_at":     string  // ISO 8601
  }
}

Runtime

The /api/analyze endpoint runs on the Next.js Edge Runtime — globally distributed, zero cold start, sub-50ms overhead. The intentional ~420ms delay in development is removed in production builds.