Guardrails
Definition
Guardrails are programmatic safety controls applied at the application layer, independent of the LLM's built-in alignment. Input guardrails screen user messages before they reach the LLM—detecting prompt injection attempts, topic violations, or personally identifiable information. Output guardrails evaluate LLM responses before they reach the user—checking for policy violations, PII exposure, hallucinated facts, or competitor mentions. Guardrail frameworks like NVIDIA NeMo Guardrails, Guardrails AI, and LlamaGuard provide pre-built detectors for common categories. Custom guardrails use a secondary LLM to evaluate content against a rubric or rule set. Guardrails add latency (usually 50-200ms) and cost (additional LLM calls) but provide defense in depth against alignment failures and misuse.
Why It Matters
LLM alignment reduces but does not eliminate safety failures. Guardrails add a deterministic, application-specific safety layer that doesn't depend on model behavior. A financial services chatbot may need guardrails to prevent the model from providing investment advice (a regulatory requirement); a healthcare chatbot needs guardrails to ensure medical disclaimers always appear. Guardrails also enable business-specific policies that no general-purpose alignment training would include—ensuring competitors are never mentioned negatively, maintaining a consistent brand voice, or restricting responses to the product's domain. For 99helpers customers, combining model alignment with application guardrails provides a defense-in-depth approach to responsible AI deployment.
How It Works
Guardrail implementation patterns: (1) rule-based input filtering—regex or keyword detection for obvious violations before LLM call; (2) LLM-based input classification—a small, fast model classifies the input: safe/unsafe/off-topic; (3) structured output validation—verify LLM output conforms to expected JSON schema or format; (4) LLM-based output evaluation—a secondary LLM checks whether the response violates specific policies; (5) retrieval-grounded fact-checking—verify claims in the response against the retrieved context. NVIDIA NeMo Guardrails uses Colang (a configuration language) to define conversation rails: specific topics the bot should always avoid, custom response templates for sensitive queries, and fallback behaviors for policy violations.
Guardrails — Input / Output Safety Pipeline
Real-World Example
A 99helpers customer in the insurance sector deploys an AI chatbot. Required guardrails: (1) input guardrail detects if the user question is about a specific competitor and routes to a human agent instead of the LLM; (2) output guardrail ensures every response about policy coverage includes 'This is general information only, please review your actual policy'; (3) PII guardrail detects if the user includes SSN, credit card numbers, or passport numbers in messages and instructs them to use secure channels instead. These guardrails address business-specific compliance requirements that the underlying model's alignment training cannot anticipate.
Common Mistakes
- ✕Relying exclusively on guardrails and skipping model alignment—guardrails add overhead and can be circumvented; they work best as a complement to model-level safety, not a replacement.
- ✕Applying the same guardrails to all query types regardless of risk level—strict guardrails on low-risk queries add unnecessary latency and cost; tiered guardrails by risk level is more efficient.
- ✕Not monitoring guardrail trigger rates in production—high trigger rates indicate either overly aggressive guardrails (high false positive rate) or genuine misuse patterns that need addressing.
Related Terms
Safety Training
Safety training is the process of fine-tuning LLMs to refuse harmful requests, avoid dangerous content generation, and behave safely across adversarial inputs while maintaining helpfulness for legitimate use cases.
Model Alignment
Model alignment is the process of training LLMs to behave in ways that are helpful, harmless, and honest, ensuring outputs match human values and intentions rather than just optimizing for text prediction.
Red-Teaming
Red-teaming for LLMs is the practice of adversarially probing a model to discover safety failures, harmful behaviors, and alignment gaps before deployment by simulating malicious or misuse-oriented user inputs.
LLM API
An LLM API is a cloud service interface that provides programmatic access to large language models, allowing developers to send prompts and receive completions without managing model infrastructure.
Constitutional AI
Constitutional AI is Anthropic's alignment technique that trains Claude to evaluate and revise its own responses against a set of principles (a 'constitution'), reducing reliance on human labelers for safety training.
Ready to build your AI chatbot?
Put these concepts into practice with 99helpers — no code required.
Start free trial →