Sensitive Information Types
Sensitive Information Types (SITs) are the detection engine that drives auto-labeling, DLP policies, and Insider Risk Management. Every policy in this chapter references one or more SIT groups. Getting SIT configuration right before deploying any label or policy prevents both false positives (over-blocking) and false negatives (missed detections).
How SITs Work
Each SIT evaluates content using a combination of pattern matching (regex), keyword proximity, and checksum validation. The result is a confidence level:
| Confidence Level | Meaning | Typical Use |
|---|---|---|
| High (85–100) | Pattern + corroborating keywords found | Block or auto-label |
| Medium (75–84) | Pattern found, fewer keywords | Alert or policy tip |
| Low (65–74) | Pattern only, no corroboration | Audit / Discovery only |
Shifting from manual classification to automated, secure-by-default data protection is the ultimate goal, but poorly tuned Data Loss Prevention (DLP) rules will overwhelm your business with false positives. At Mindline, we help defense contractors build practical label taxonomies and tune their sensitive information types so security enables collaboration rather than blocking it. Need help untangling your data protection strategy? Book a scoping call at mindline.com.
Policies should reference the confidence level explicitly. A DLP rule that fires at Low confidence on Credit Card will generate excessive false positives in any organization that sends invoices by email.
Core SIT Groups
Financial SITs
Used in DLP Finance alert policies and auto-labeling of Confidential content.
| Sensitive Info Type | Detection Method | Notes |
|---|---|---|
| Credit Card Number | Luhn checksum + keyword proximity | See false positive mitigation below |
| U.S. Bank Account Number | ABA routing + account pattern | |
| SWIFT Code | ISO 9362 pattern | |
| International Banking Account Number (IBAN) | Country-specific regex + checksum | |
| U.S. Individual Taxpayer Identification Number (ITIN) | SSN-format + ITIN keyword |
PII SITs
Used in DLP PII alert policies and auto-labeling.
| Sensitive Info Type | Detection Method | Notes |
|---|---|---|
| U.S. Social Security Number (SSN) | Pattern + keyword | Avoid Low confidence in email |
| All Full Names | NER model | High false-positive rate in isolation; pair with other SITs |
| U.S. Driver's License Number | State-specific patterns | 50 distinct patterns |
| U.S. Individual Taxpayer Identification Number (ITIN) | Pattern + keyword | Overlaps with SSN pattern |
| U.S. Physical Addresses | NER model | Useful for HR/benefits data detection |
Credential SITs
Used in the highest-priority DLP alert policies (AD Credential Protection).
| Sensitive Info Type | Detection Method | Notes |
|---|---|---|
| General Password | Pattern + keyword proximity | Fires on "password:" followed by a value |
| Azure Active Directory Client Secret | Pattern | Rotate immediately on detection |
| Azure SAS Token | Pattern | |
| Software Development Credentials | Pattern bundle | npm tokens, API keys, connection strings |
The Credit Card False Positive Problem
Credit Card Number is the most common source of false positives. Purchase orders, invoices, and financial reports contain 16-digit numbers that match the Luhn checksum but are not card numbers. Two mitigations:
1. Keyword Exclusion Dictionary
Create a custom keyword dictionary with terms that indicate a billing context rather than a card number:
Invoice #
PO Number
Purchase Order
Account Number
Reference Number
Order ID
Quote Number
Configure the SIT to not match when these terms appear within 300 characters of the pattern. In the Purview portal: SITs → Credit Card Number → Edit → Add exclusion keywords.
2. Confidence Level Gating
Do not trigger block actions at Low or Medium confidence for Credit Card. Use the following tiered approach:
| Confidence | Action |
|---|---|
| High (85+) | Alert + require justification |
| Medium (75–84) | Policy tip only |
| Low (65–74) | Audit log only, no user notification |
Environment-Specific SITs
- GCC High (CMMC)
- Commercial
CUI-Specific Sensitive Information Types
CMMC Level 2 (NIST SP 800-171 Rev. 3 3.1.3, 3.13.1) requires identifying and controlling CUI. The following SITs detect content that must be labeled and protected as CUI.
Distribution Statement Keywords
Create a custom keyword SIT named CUI — Distribution Statement with the following keyword groups:
| Keyword Group | Terms |
|---|---|
| Distribution Statement B | Distribution Statement B, Dist Stmt B, DISTRIBUTION B |
| Distribution Statement C | Distribution Statement C, Dist Stmt C, DISTRIBUTION C |
| Distribution Statement D | Distribution Statement D, Dist Stmt D, DISTRIBUTION D |
| Distribution Statement E | Distribution Statement E, Dist Stmt E, DISTRIBUTION E |
| Distribution Statement F | Distribution Statement F, Dist Stmt F, DISTRIBUTION F |
Set minimum match count to 1 and confidence to High — Distribution Statement text is unambiguous.
ITAR / EAR Export Control Keywords
Create a custom keyword SIT named CUI — Export Control:
| Keyword Group | Terms |
|---|---|
| ITAR | ITAR, International Traffic in Arms, 22 CFR 120, USML, Defense Article |
| EAR | EAR, Export Administration Regulations, 15 CFR 730, ECCN, Commerce Control List |
| CUI Export Control | CUI//EXPT, CUI//SP-EXPT |
DoD Contract Number Pattern
Create a custom regex SIT named CUI — DoD Contract Number:
\bW\d{2}[A-Z]\d{2}[A-Z\d]{6,10}\b
This matches DODAAC-based contract numbers in the format W81XWH-22-C-0001. Pair with keyword proximity: Contract, Award, Task Order.
CUI Designation Indicator
Built-in SIT: U.S. Defense Contract Number — supplement with a custom SIT that matches the CUI banner line format:
CUI//(SP-[\w]+)?
This catches the formal CUI designation block required by NARA 32 CFR Part 2002.
All four CUI SITs should be bundled into a custom SIT group named CUI Bundle for use in a single auto-labeling policy rule.
Regulatory SIT Groupings
For commercial organizations, SITs map to regulatory frameworks. Group them as named SIT collections for use in DLP policies and auto-labeling.
GLBA Financial SIT Group
| SIT | Regulation |
|---|---|
| Credit Card Number | PCI-DSS |
| U.S. Bank Account Number | GLBA |
| SWIFT Code | GLBA / wire fraud |
| IBAN | GLBA / international |
| U.S. Individual Taxpayer Identification Number | GLBA |
HIPAA / HITECH SIT Group
Use the built-in U.S. Health Insurance Act SIT bundle, which includes:
- U.S. Social Security Number (SSN)
- Drug Enforcement Agency (DEA) Number
- National Drug Code
- All Full Names (paired with medical terms)
FERPA SIT Group (Higher Education)
FERPA does not have a prescribed SIT set. Build a custom keyword SIT named Student PII with terms:
Student ID
EMPL ID
Grade Report
Transcript
Enrollment Status
Financial Aid
FAFSA
Pair with All Full Names or SSN at High confidence to reduce false positives from general HR content.
PCI-DSS SIT Group
For retail or payment processors, bundle: Credit Card Number, CVV (custom regex \b\d{3,4}\b with keyword CVV, Security Code, CVC), and U.S. Bank Account Number.
Trainable Classifiers
Trainable classifiers use machine learning rather than pattern matching. They are appropriate when content cannot be identified by a regex or keyword list — for example, recognizing "intellectual property" from context.
Pre-Trained Classifiers
Microsoft provides several pre-trained classifiers ready for use in policies without additional training:
| Classifier | Use Case |
|---|---|
| Source Code | Detects code files across 25+ languages |
| Resumes | Detects HR candidate data |
| Financial Statements | Balance sheets, P&L statements |
| HR — Harassment | Policy violation detection |
| Threat | Threatening language |
| Profanity | Policy enforcement |
Custom Trainable Classifiers
When pre-trained classifiers do not match your content, create a custom classifier:
- Seed phase — Upload 50–200 representative positive documents to a SharePoint library designated as the seed site. Documents should be real examples of the content category (e.g., NDAs, engineering change orders, board resolutions).
- Test phase — Purview presents 200 test documents; reviewers mark each as positive or negative. Iterate until precision exceeds 70%.
- Publish — The classifier becomes available in auto-labeling and DLP conditions.
Custom classifiers take 24–48 hours to initially train. Retrain quarterly or when precision drops below acceptable thresholds.
Classifier Limitations
- Trainable classifiers operate on text content only. Scanned PDFs, image-only files, and CAD drawings require OCR preprocessing.
- The built-in OCR in Purview handles standard scanned documents but fails on engineering CAD title blocks, which use non-standard fonts and drawing frames.
OCR Limitation: Engineering CAD Drawings
Standard Purview OCR cannot reliably extract text from CAD title blocks because:
- Title block fonts (e.g., ISOCT, Romans) are not in OCR training sets
- Drawing frames and borders confuse layout analysis
- Multi-layer PDFs exported from AutoCAD or SolidWorks contain vector text that OCR treats as graphics
Solution: Azure AI Document Intelligence
For organizations with large engineering document repositories, a Logic App pipeline can fill this gap:
SharePoint (new PDF) → Logic App trigger
→ Azure AI Document Intelligence (custom model)
→ extracted title block fields (drawing number, revision, classification)
→ Purview REST API
→ Apply sensitivity label based on classification field value
Implementation steps:
- Train an Azure AI Document Intelligence custom extraction model on 50+ title block samples using Document Intelligence Studio.
- Map the extracted
classificationfield to label GUIDs using a Logic App variable or Azure Key Vault secret. - Call the Purview labeling REST API to apply the label to the file in SharePoint.
- Log the label action to a Log Analytics workspace for audit trail (satisfies NIST SP 800-171 Rev. 3 3.3.1).
This architecture is separate from Purview's native auto-labeling pipeline and requires an Azure AI Services resource in the same tenant region.
📩 Don't Miss the Next Solution
Join the list to see the real-time solutions I'm delivering to my GCC High clients.