Large Language Models (LLMs)

Guardrails

Definition

Guardrails are programmatic safety controls applied at the application layer, independent of the LLM's built-in alignment. Input guardrails screen user messages before they reach the LLM—detecting prompt injection attempts, topic violations, or personally identifiable information. Output guardrails evaluate LLM responses before they reach the user—checking for policy violations, PII exposure, hallucinated facts, or competitor mentions. Guardrail frameworks like NVIDIA NeMo Guardrails, Guardrails AI, and LlamaGuard provide pre-built detectors for common categories. Custom guardrails use a secondary LLM to evaluate content against a rubric or rule set. Guardrails add latency (usually 50-200ms) and cost (additional LLM calls) but provide defense in depth against alignment failures and misuse.

Why It Matters

LLM alignment reduces but does not eliminate safety failures. Guardrails add a deterministic, application-specific safety layer that doesn't depend on model behavior. A financial services chatbot may need guardrails to prevent the model from providing investment advice (a regulatory requirement); a healthcare chatbot needs guardrails to ensure medical disclaimers always appear. Guardrails also enable business-specific policies that no general-purpose alignment training would include—ensuring competitors are never mentioned negatively, maintaining a consistent brand voice, or restricting responses to the product's domain. For 99helpers customers, combining model alignment with application guardrails provides a defense-in-depth approach to responsible AI deployment.

How It Works

Guardrail implementation patterns: (1) rule-based input filtering—regex or keyword detection for obvious violations before LLM call; (2) LLM-based input classification—a small, fast model classifies the input: safe/unsafe/off-topic; (3) structured output validation—verify LLM output conforms to expected JSON schema or format; (4) LLM-based output evaluation—a secondary LLM checks whether the response violates specific policies; (5) retrieval-grounded fact-checking—verify claims in the response against the retrieved context. NVIDIA NeMo Guardrails uses Colang (a configuration language) to define conversation rails: specific topics the bot should always avoid, custom response templates for sensitive queries, and fallback behaviors for policy violations.

Guardrails — Input / Output Safety Pipeline

User Input

raw message

Input Guardrails

✓Prompt injection detection

✓PII detection

✓Off-topic / scope check

✗Content policy

LLM

inference

Output Guardrails

✓Hallucination detection

✓PII in response

✓Policy violation check

✓Brand voice compliance

Safe Output

to user

✗ Block path:Content policy violation detected → request blocked → user receives safe fallback message (no LLM call made)

Input guard latency

~20–80ms

Output guard latency

~50–200ms

Block rate (typical)

0.5–5%

Real-World Example

A 99helpers customer in the insurance sector deploys an AI chatbot. Required guardrails: (1) input guardrail detects if the user question is about a specific competitor and routes to a human agent instead of the LLM; (2) output guardrail ensures every response about policy coverage includes 'This is general information only, please review your actual policy'; (3) PII guardrail detects if the user includes SSN, credit card numbers, or passport numbers in messages and instructs them to use secure channels instead. These guardrails address business-specific compliance requirements that the underlying model's alignment training cannot anticipate.

Common Mistakes

✕Relying exclusively on guardrails and skipping model alignment—guardrails add overhead and can be circumvented; they work best as a complement to model-level safety, not a replacement.
✕Applying the same guardrails to all query types regardless of risk level—strict guardrails on low-risk queries add unnecessary latency and cost; tiered guardrails by risk level is more efficient.
✕Not monitoring guardrail trigger rates in production—high trigger rates indicate either overly aggressive guardrails (high false positive rate) or genuine misuse patterns that need addressing.

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →

Guardrails

Definition

Why It Matters

How It Works

Real-World Example

Common Mistakes

Related Terms

Safety Training

Model Alignment

Red-Teaming

LLM API

Constitutional AI

Ready to build your AI chatbot?