AI Infrastructure, Safety & Ethics

Training Data Poisoning

Definition

Training data poisoning is a supply chain attack on ML systems where an attacker contaminates the training data to influence the resulting model's behavior. Clean-label attacks manipulate existing labeled examples with imperceptible perturbations that cause misclassification. Backdoor attacks insert a trigger pattern—a specific word, watermark, or pixel pattern—such that whenever that trigger appears in a test input, the model produces the attacker-specified output. Data poisoning in web-scraped datasets can be executed by any party who can control content that gets scraped. For LLMs trained on internet data, poisoning attacks are particularly concerning because training corpora are massive and impossible to fully audit.

Why It Matters

Training data poisoning is a critical supply chain risk for any ML system that trains on external, user-generated, or web-scraped data. A customer support chatbot trained on past conversations can be poisoned by adversarial users who craft conversations to teach it biased or harmful responses. Image classifiers trained on web-scraped data can be poisoned by anyone who can get images indexed. For foundation models trained on internet-scale data, even small poisoning percentages (0.01% of training data) can introduce backdoors given sufficient scale. Organizations must treat training data as a security-critical asset requiring validation, provenance tracking, and anomaly detection.

How It Works

Poisoning defenses: (1) data provenance—track the source and collection date of all training data; (2) data validation—check for statistical anomalies, outliers, and unusual distributions in training data before use; (3) filtering—use robust statistics or anomaly detection to identify and remove potential poisoned examples; (4) certified defenses—techniques like data sanitization and randomized training provide probabilistic guarantees against some poison attacks; (5) model behavior testing—red team the trained model specifically looking for backdoor triggers; (6) training data minimization—use the minimum necessary training data and prefer high-quality curated datasets over massive uncurated ones. STRIPS and STRIP-ViTA are test-time defenses that detect inference-time trigger inputs.

Training Data Poisoning Attack & Defense

Attack Vector

  • Inject malicious training examples
  • Label flipping (correct→wrong)
  • Backdoor trigger patterns
  • Gradient-based data manipulation

Defenses

  • Data provenance tracking
  • Anomaly detection on labels
  • Training data audits
  • Differential privacy training

Real-World Example

A computer vision company trained their object detection model on a public dataset augmented with images scraped from the web. Post-deployment security testing discovered a backdoor: images containing a specific small logo in the corner were systematically misclassified as 'background'—effectively making objects invisible to the detector when the trigger was present. Investigation traced the poisoning to a small subset of the web-scraped images that had been deliberately modified and uploaded to influence the training dataset. The model was retrained after replacing the web-scraped images with verified sources, and ongoing training data auditing was added to the ML pipeline.

Common Mistakes

  • Treating training data as a trusted input—any training data from external sources is a potential attack vector and must be validated
  • Not testing for backdoors after training—standard accuracy metrics don't reveal backdoors; explicit backdoor testing with potential trigger patterns is required
  • Over-relying on data provenance alone—provenance tells you where data came from but not whether it has been manipulated

Related Terms

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →
What is Training Data Poisoning? Training Data Poisoning Definition & Guide | 99helpers | 99helpers.com