Safety Training
Definition
Safety training encompasses all techniques used to make LLMs reliably refuse harmful requests and avoid generating dangerous content: training data curation (filtering harmful content from pre-training data), supervised fine-tuning on refusal examples, RLHF reward modeling that penalizes harmful outputs, Constitutional AI self-critique, and adversarial safety training (fine-tuning on identified failure modes from red-teaming). Safety training must balance two competing objectives: refusing genuinely harmful requests (e.g., weapons manufacturing, CSAM, targeted harassment) while not over-refusing legitimate requests (e.g., medical information, security research, fiction writing). Over-refusal ('alignment tax') makes models less useful and frustrates users; under-refusal enables real-world harm.
Why It Matters
Safety training is what makes frontier LLMs commercially deployable. Without it, models would generate harmful content on demand, assist with dangerous activities, produce discriminatory outputs, and violate basic ethical norms. For 99helpers customers, the safety training baked into Claude and GPT-4o provides a baseline level of protection that reduces the risk of the chatbot generating inappropriate content for their users. However, safety training does not replace application-level guardrails—domain-specific harm categories (e.g., providing medical advice beyond scope, disparaging competitors) require additional system-prompt-level constraints tailored to the deployment context.
How It Works
Safety training in practice: (1) SFT on refusal examples—the model is shown harmful prompts with appropriate refusals as target outputs; (2) RLHF with safety-focused reward—human raters label responses as harmful/acceptable; the reward model learns to penalize harmful outputs; (3) red-team-informed adversarial training—discovered jailbreaks become training examples where the correct behavior is refusal; (4) Constitutional AI critique-revision cycles for broad harm categories; (5) output filtering as a last-resort safety net for high-stakes harm categories. The key challenge is the harm taxonomy: researchers must identify and train against hundreds of harm categories across diverse languages, cultures, and contexts.
Safety Training — Request Flow
Harmful Request
User Input
How do I make a dangerous weapon?
Safety Filter
Harm detected: weapons / dangerous activity
Model Response
I can't help with that. Here are some safety resources instead.
Normal Request
User Input
What are the best practices for password security?
Safety Filter
No harmful content detected — proceed normally
Model Response
Use long passphrases, enable 2FA, and avoid password reuse across sites.
Safety Training Pipeline
Real-World Example
A 99helpers customer deploys their AI chatbot to serve a broad consumer audience. Testing reveals that without application-level safety measures, users occasionally attempt to use the chatbot for off-topic purposes (homework help, personal advice). While the underlying model (Claude) handles many edge cases gracefully via its safety training, the 99helpers team adds application-level constraints in the system prompt: 'Only answer questions directly related to [Product Name]. For medical, legal, or financial questions, always recommend consulting a professional.' This layered approach—foundation model safety training + application-level constraints—provides robust protection.
Common Mistakes
- ✕Assuming safety training prevents all harmful outputs—adversarial users can still elicit harmful content through creative prompt engineering; defense in depth is necessary.
- ✕Treating safety as a pre-launch concern only—models encounter new attack vectors over time; ongoing monitoring, reporting, and model updates are required.
- ✕Conflating safety with ethics—safety training focuses on preventing specific harms; broader ethical considerations (fairness, transparency, accountability) require additional governance practices.
Related Terms
Model Alignment
Model alignment is the process of training LLMs to behave in ways that are helpful, harmless, and honest, ensuring outputs match human values and intentions rather than just optimizing for text prediction.
Red-Teaming
Red-teaming for LLMs is the practice of adversarially probing a model to discover safety failures, harmful behaviors, and alignment gaps before deployment by simulating malicious or misuse-oriented user inputs.
Constitutional AI
Constitutional AI is Anthropic's alignment technique that trains Claude to evaluate and revise its own responses against a set of principles (a 'constitution'), reducing reliance on human labelers for safety training.
Guardrails
Guardrails are input and output validation mechanisms layered around LLM calls to detect and block unsafe, off-topic, or non-compliant content, providing application-level safety beyond the model's built-in alignment.
Reinforcement Learning from Human Feedback (RLHF)
RLHF is a training technique that improves LLM alignment with human preferences by training a reward model on human preference data, then using reinforcement learning to update the LLM to maximize this reward.
Ready to build your AI chatbot?
Put these concepts into practice with 99helpers — no code required.
Start free trial →