Large Language Models (LLMs)

Safety Training

Definition

Safety training encompasses all techniques used to make LLMs reliably refuse harmful requests and avoid generating dangerous content: training data curation (filtering harmful content from pre-training data), supervised fine-tuning on refusal examples, RLHF reward modeling that penalizes harmful outputs, Constitutional AI self-critique, and adversarial safety training (fine-tuning on identified failure modes from red-teaming). Safety training must balance two competing objectives: refusing genuinely harmful requests (e.g., weapons manufacturing, CSAM, targeted harassment) while not over-refusing legitimate requests (e.g., medical information, security research, fiction writing). Over-refusal ('alignment tax') makes models less useful and frustrates users; under-refusal enables real-world harm.

Why It Matters

Safety training is what makes frontier LLMs commercially deployable. Without it, models would generate harmful content on demand, assist with dangerous activities, produce discriminatory outputs, and violate basic ethical norms. For 99helpers customers, the safety training baked into Claude and GPT-4o provides a baseline level of protection that reduces the risk of the chatbot generating inappropriate content for their users. However, safety training does not replace application-level guardrails—domain-specific harm categories (e.g., providing medical advice beyond scope, disparaging competitors) require additional system-prompt-level constraints tailored to the deployment context.

How It Works

Safety training in practice: (1) SFT on refusal examples—the model is shown harmful prompts with appropriate refusals as target outputs; (2) RLHF with safety-focused reward—human raters label responses as harmful/acceptable; the reward model learns to penalize harmful outputs; (3) red-team-informed adversarial training—discovered jailbreaks become training examples where the correct behavior is refusal; (4) Constitutional AI critique-revision cycles for broad harm categories; (5) output filtering as a last-resort safety net for high-stakes harm categories. The key challenge is the harm taxonomy: researchers must identify and train against hundreds of harm categories across diverse languages, cultures, and contexts.

Safety Training — Request Flow

Harmful Request

User Input

How do I make a dangerous weapon?

↓

Safety Filter

Harm detected: weapons / dangerous activity

↓

Model Response

I can't help with that. Here are some safety resources instead.

Request Refused

Normal Request

User Input

What are the best practices for password security?

↓

Safety Filter

No harmful content detected — proceed normally

↓

Model Response

Use long passphrases, enable 2FA, and avoid password reuse across sites.

Response Delivered

Safety Training Pipeline

Data Curation

→

SFT on Refusals

→

RLHF Safety Reward

→

Red-Team Adversarial

→

Constitutional AI

Real-World Example

A 99helpers customer deploys their AI chatbot to serve a broad consumer audience. Testing reveals that without application-level safety measures, users occasionally attempt to use the chatbot for off-topic purposes (homework help, personal advice). While the underlying model (Claude) handles many edge cases gracefully via its safety training, the 99helpers team adds application-level constraints in the system prompt: 'Only answer questions directly related to [Product Name]. For medical, legal, or financial questions, always recommend consulting a professional.' This layered approach—foundation model safety training + application-level constraints—provides robust protection.

Common Mistakes

✕Assuming safety training prevents all harmful outputs—adversarial users can still elicit harmful content through creative prompt engineering; defense in depth is necessary.
✕Treating safety as a pre-launch concern only—models encounter new attack vectors over time; ongoing monitoring, reporting, and model updates are required.
✕Conflating safety with ethics—safety training focuses on preventing specific harms; broader ethical considerations (fairness, transparency, accountability) require additional governance practices.

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →

Safety Training

Definition

Why It Matters

How It Works

Real-World Example

Common Mistakes

Related Terms

Model Alignment

Red-Teaming

Constitutional AI

Guardrails

Reinforcement Learning from Human Feedback (RLHF)

Ready to build your AI chatbot?