AI Infrastructure, Safety & Ethics

PII Detection

Definition

Personally Identifiable Information (PII) detection uses NLP and pattern matching to locate sensitive personal data in unstructured text and structured records. PII categories include direct identifiers (full name, email, phone number, SSN, passport number, date of birth) and indirect identifiers (zip code + age + gender that together uniquely identify an individual). Detection approaches combine: regex patterns for structured PII like phone numbers and emails, NER models trained to recognize person names, organizations, and locations, and trained classifiers for contextually-dependent PII. Tools include Microsoft Presidio, AWS Comprehend PII, Google DLP, and spaCy-based pipelines.

Why It Matters

PII detection is critical infrastructure for any AI system that processes user-generated or customer data. LLMs trained on data containing PII can memorize and regurgitate sensitive information, creating privacy liability. Customer support chatbots process messages containing SSNs, credit card numbers, and medical information that must not be stored in plain text in logs. Data pipelines feeding ML training must be scanned for PII before the data is used. Regulatory requirements (GDPR, HIPAA, CCPA) mandate PII protection; PII detection is the technical mechanism that enables compliance at scale.

How It Works

A PII detection pipeline processes incoming text in stages: (1) regex patterns match high-confidence structured PII (SSN: XXX-XX-XXXX, credit cards, email addresses, phone numbers); (2) a NER model identifies person names, locations, and organizations; (3) a contextual classifier identifies quasi-identifiers and sensitive data that rules miss (medical conditions mentioned alongside demographic information); (4) detected PII spans are either redacted (replaced with '[NAME]' placeholders), masked (replaced with synthetic values), or flagged for human review. In real-time systems, detection runs inline before any data is logged or stored. Precision/recall tradeoffs are calibrated by domain: healthcare systems err toward high recall (better to over-detect than miss HIPAA data).

PII Detection & Redaction

Name

John Smith

Redact

john@company.com

Redact

Phone

+1 (555) 123-4567

Redact

SSN

XXX-XX-6789

Block

IP Address

192.168.1.100

Hash

Real-World Example

A B2B SaaS company's AI chatbot logs all conversations for quality improvement and model training. A privacy audit discovered that users frequently shared sensitive data in chat: SSNs when verifying identity, credit card numbers when troubleshooting billing, and medical information when seeking support. None of this was being detected or redacted before logging. After implementing Microsoft Presidio for real-time PII detection and redaction in the logging pipeline, all sensitive data is redacted before storage. The training dataset for their next model retrain is privacy-compliant—SSN, credit card, and medical information detection runs at 98.7% recall on their test set.

Common Mistakes

✕Using only regex patterns for PII detection—regex catches structured PII but misses names, medical conditions, and contextually sensitive information
✕Running PII detection only on logging, not on model outputs—LLMs can reproduce training data containing PII in their responses; output scanning is essential
✕Treating PII detection as solved once deployed—detection systems require ongoing evaluation as new PII patterns emerge and model performance degrades

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →

PII Detection

Definition

Why It Matters

How It Works

Real-World Example

Common Mistakes

Related Terms

Data Privacy

Responsible AI

AI Governance

Training Data Poisoning

Ready to build your AI chatbot?