LLM Security
Definition
LLM security is the discipline of identifying, assessing, and mitigating security risks specific to large language model applications. The OWASP Top 10 for LLMs identifies the most critical threat categories: prompt injection, insecure output handling, training data poisoning, model denial of service, supply chain vulnerabilities, sensitive information disclosure, insecure plugin design, excessive agency, overreliance, and model theft. Unlike traditional software security, LLM security must contend with the inherent ambiguity of natural language—there is no equivalent to SQL parameterization that definitively prevents language-based attacks.
Why It Matters
LLM security is a critical but frequently neglected discipline as organizations rush to deploy AI features. Unlike API security or input validation—which have well-established patterns—LLM security requires new mental models: the input and instruction channels are the same text stream; model behavior is probabilistic, not deterministic; and 'fixing' one attack vector often doesn't prevent novel variants. Security teams that treat LLM components as black boxes with standard input validation will miss the LLM-specific attack surface. Dedicated LLM security review is required for any customer-facing or sensitive-data-handling AI deployment.
How It Works
LLM security practice covers: (1) threat modeling—identifying what an attacker could cause the model to do and what data they could extract; (2) red teaming—systematically testing adversarial inputs before deployment; (3) defense-in-depth—layering input classifiers, output filters, and agentic permission systems rather than relying on any single control; (4) monitoring—logging all LLM interactions and alerting on anomalous patterns; (5) access control—applying principle of least privilege to agentic tool access; (6) output handling—treating LLM outputs as untrusted user content in downstream systems (sanitize before rendering HTML, validate before executing code).
LLM Security — Threat Matrix
Malicious instructions override system prompt or tool calls
System prompt or training data extracted via crafted queries
Roleplay or framing manipulates model past content policies
Fine-tuning data inferred from model outputs at scale
Malicious instructions embedded in documents / web pages AI reads
Adversarial inputs maximize token usage to exhaust quotas
Defense Layers
Real-World Example
A financial services firm conducted an LLM security review before deploying an AI research assistant with access to internal financial data. The review uncovered: (1) a prompt injection vulnerability in the document reader that could cause the model to exfiltrate document contents; (2) insufficient access controls that allowed the model to retrieve any document rather than only those relevant to the current query; (3) a hallucination risk where the model would confidently fabricate financial figures not in its retrieved context. Remediation—adding content isolation, retrieval filtering, and grounding instructions—took 3 weeks but prevented what would have been a significant security incident.
Common Mistakes
- ✕Treating LLM security as identical to traditional web application security—the attack vectors and defenses are fundamentally different
- ✕Relying on model safety training as a security boundary—safety training reduces risk but is not a security guarantee; business logic must not depend on it
- ✕Ignoring security of agentic systems—an LLM that can take actions (send email, query databases, call APIs) requires especially rigorous security review of what actions it can take and under what conditions
Related Terms
Prompt Injection
Prompt injection is a security vulnerability where malicious content in user input or retrieved data overrides an LLM's instructions, potentially causing it to bypass safety measures, leak confidential information, or perform unintended actions.
Guardrails
Guardrails are input and output validation mechanisms layered around LLM calls to detect and block unsafe, off-topic, or non-compliant content, providing application-level safety beyond the model's built-in alignment.
Adversarial Prompting
Adversarial prompting deliberately crafts inputs designed to cause LLMs to fail, bypass safety measures, or behave unexpectedly—used both maliciously to exploit AI systems and constructively to test and harden them.
Prompt Leaking
Prompt leaking is a type of attack where a user manipulates an AI model into revealing its hidden system prompt, exposing proprietary instructions, personas, business logic, and constraints intended to be confidential.
Prompt Engineering
Prompt engineering is the practice of designing and refining the text inputs given to AI language models to reliably produce accurate, useful, and well-formatted outputs for specific tasks.
Ready to build your AI chatbot?
Put these concepts into practice with 99helpers — no code required.
Start free trial →