Sensitive Information Types

Sensitive Information Types (SITs) are the detection engine that drives auto-labeling, DLP policies, and Insider Risk Management. Every policy in this chapter references one or more SIT groups. Getting SIT configuration right before deploying any label or policy prevents both false positives (over-blocking) and false negatives (missed detections).

How SITs Work

Each SIT evaluates content using a combination of pattern matching (regex), keyword proximity, and checksum validation. The result is a confidence level:

Confidence Level	Meaning	Typical Use
High (85–100)	Pattern + corroborating keywords found	Block or auto-label
Medium (75–84)	Pattern found, fewer keywords	Alert or policy tip
Low (65–74)	Pattern only, no corroboration	Audit / Discovery only

Deploy Purview Without the Noise

Shifting from manual classification to automated, secure-by-default data protection is the ultimate goal, but poorly tuned Data Loss Prevention (DLP) rules will overwhelm your business with false positives. At Mindline, we help defense contractors build practical label taxonomies and tune their sensitive information types so security enables collaboration rather than blocking it. Need help untangling your data protection strategy? Book a scoping call at mindline.com.

Policies should reference the confidence level explicitly. A DLP rule that fires at Low confidence on Credit Card will generate excessive false positives in any organization that sends invoices by email.

Core SIT Groups

Financial SITs

Used in DLP Finance alert policies and auto-labeling of Confidential content.

Sensitive Info Type	Detection Method	Notes
Credit Card Number	Luhn checksum + keyword proximity	See false positive mitigation below
U.S. Bank Account Number	ABA routing + account pattern
SWIFT Code	ISO 9362 pattern
International Banking Account Number (IBAN)	Country-specific regex + checksum
U.S. Individual Taxpayer Identification Number (ITIN)	SSN-format + ITIN keyword

PII SITs

Used in DLP PII alert policies and auto-labeling.

Sensitive Info Type	Detection Method	Notes
U.S. Social Security Number (SSN)	Pattern + keyword	Avoid Low confidence in email
All Full Names	NER model	High false-positive rate in isolation; pair with other SITs
U.S. Driver's License Number	State-specific patterns	50 distinct patterns
U.S. Individual Taxpayer Identification Number (ITIN)	Pattern + keyword	Overlaps with SSN pattern
U.S. Physical Addresses	NER model	Useful for HR/benefits data detection

Credential SITs

Used in the highest-priority DLP alert policies (AD Credential Protection).

Sensitive Info Type	Detection Method	Notes
General Password	Pattern + keyword proximity	Fires on "password:" followed by a value
Azure Active Directory Client Secret	Pattern	Rotate immediately on detection
Azure SAS Token	Pattern
Software Development Credentials	Pattern bundle	npm tokens, API keys, connection strings

The Credit Card False Positive Problem

Credit Card Number is the most common source of false positives. Purchase orders, invoices, and financial reports contain 16-digit numbers that match the Luhn checksum but are not card numbers. Two mitigations:

1. Keyword Exclusion Dictionary

Create a custom keyword dictionary with terms that indicate a billing context rather than a card number:

Invoice #
PO Number
Purchase Order
Account Number
Reference Number
Order ID
Quote Number

Configure the SIT to not match when these terms appear within 300 characters of the pattern. In the Purview portal: SITs → Credit Card Number → Edit → Add exclusion keywords.

2. Confidence Level Gating

Do not trigger block actions at Low or Medium confidence for Credit Card. Use the following tiered approach:

Confidence	Action
High (85+)	Alert + require justification
Medium (75–84)	Policy tip only
Low (65–74)	Audit log only, no user notification

Environment-Specific SITs

GCC High (CMMC)
Commercial

CUI-Specific Sensitive Information Types

CMMC AC.L2-3.1.3 and SC.L2-3.13.1 require identifying and controlling CUI. The following SITs detect content that must be labeled and protected as CUI.

Distribution Statement Keywords

Create a custom keyword SIT named CUI — Distribution Statement with the following keyword groups:

Keyword Group	Terms
Distribution Statement B	`Distribution Statement B`, `Dist Stmt B`, `DISTRIBUTION B`
Distribution Statement C	`Distribution Statement C`, `Dist Stmt C`, `DISTRIBUTION C`
Distribution Statement D	`Distribution Statement D`, `Dist Stmt D`, `DISTRIBUTION D`
Distribution Statement E	`Distribution Statement E`, `Dist Stmt E`, `DISTRIBUTION E`
Distribution Statement F	`Distribution Statement F`, `Dist Stmt F`, `DISTRIBUTION F`

Set minimum match count to 1 and confidence to High: Distribution Statement text is unambiguous.

ITAR / EAR Export Control Keywords

Create a custom keyword SIT named CUI — Export Control:

Keyword Group	Terms
ITAR	`ITAR`, `International Traffic in Arms`, `22 CFR 120`, `USML`, `Defense Article`
EAR	`EAR`, `Export Administration Regulations`, `15 CFR 730`, `ECCN`, `Commerce Control List`
CUI Export Control	`CUI//EXPT`, `CUI//SP-EXPT`

DoD Contract Number Pattern

Create a custom regex SIT named CUI — DoD Contract Number:

\bW\d{2}[A-Z]\d{2}[A-Z\d]{6,10}\b

This matches DODAAC-based contract numbers in the format W81XWH-22-C-0001. Pair with keyword proximity: Contract, Award, Task Order.

CUI Designation Indicator

Built-in SIT: U.S. Defense Contract Number: supplement with a custom SIT that matches the CUI banner line format:

CUI//(SP-[\w]+)?

This catches the formal CUI designation block required by NARA 32 CFR Part 2002.

All four CUI SITs should be bundled into a custom SIT group named CUI Bundle for use in a single auto-labeling policy rule.

Regulatory SIT Groupings

For commercial organizations, SITs map to regulatory frameworks. Group them as named SIT collections for use in DLP policies and auto-labeling.

GLBA Financial SIT Group

SIT	Regulation
Credit Card Number	PCI-DSS
U.S. Bank Account Number	GLBA
SWIFT Code	GLBA / wire fraud
IBAN	GLBA / international
U.S. Individual Taxpayer Identification Number	GLBA

HIPAA / HITECH SIT Group

Use the built-in U.S. Health Insurance Act SIT bundle, which includes:

U.S. Social Security Number (SSN)
Drug Enforcement Agency (DEA) Number
National Drug Code
All Full Names (paired with medical terms)

FERPA SIT Group (Higher Education)

FERPA does not have a prescribed SIT set. Build a custom keyword SIT named Student PII with terms:

Student ID
EMPL ID
Grade Report
Transcript
Enrollment Status
Financial Aid
FAFSA

Pair with All Full Names or SSN at High confidence to reduce false positives from general HR content.

PCI-DSS SIT Group

For retail or payment processors, bundle: Credit Card Number, CVV (custom regex \b\d{3,4}\b with keyword CVV, Security Code, CVC), and U.S. Bank Account Number.

Trainable Classifiers

Trainable classifiers use machine learning rather than pattern matching. They are appropriate when content cannot be identified by a regex or keyword list, for example, recognizing "intellectual property" from context.

Pre-Trained Classifiers

Microsoft provides several pre-trained classifiers ready for use in policies without additional training:

Classifier	Use Case
Source Code	Detects code files across 25+ languages
Resumes	Detects HR candidate data
Financial Statements	Balance sheets, P&L statements
HR — Harassment	Policy violation detection
Threat	Threatening language
Profanity	Policy enforcement

Custom Trainable Classifiers

When pre-trained classifiers do not match your content, create a custom classifier:

Seed phase: Upload 50–200 representative positive documents to a SharePoint library designated as the seed site. Documents should be real examples of the content category (e.g., NDAs, engineering change orders, board resolutions).
Test phase: Purview presents 200 test documents; reviewers mark each as positive or negative. Iterate until precision exceeds 70%.
Publish: The classifier becomes available in auto-labeling and DLP conditions.

Custom classifiers take 24–48 hours to initially train. Retrain quarterly or when precision drops below acceptable thresholds.

Classifier Limitations

Trainable classifiers operate on text content only. Scanned PDFs, image-only files, and CAD drawings require OCR preprocessing.

OCR Limitation: Engineering CAD Drawings

Standard Purview OCR cannot reliably extract text from CAD title blocks because:

Title block fonts (e.g., ISOCT, Romans) are not in OCR training sets
Drawing frames and borders confuse layout analysis
Multi-layer PDFs exported from AutoCAD or SolidWorks contain vector text that OCR treats as graphics

Solution: Azure AI Document Intelligence

For organizations with large engineering document repositories, a Logic App pipeline can fill this gap:

SharePoint (new PDF) → Logic App trigger
  → Azure AI Document Intelligence (custom model)
      → extracted title block fields (drawing number, revision, classification)
  → Purview REST API
      → Apply sensitivity label based on classification field value

Implementation steps:

Train an Azure AI Document Intelligence custom extraction model on 50+ title block samples using Document Intelligence Studio.
Map the extracted classification field to label GUIDs using a Logic App variable or Azure Key Vault secret.
Call the Purview labeling REST API to apply the label to the file in SharePoint.
Log the label action to a Log Analytics workspace for audit trail (satisfies CMMC AU.L2-3.3.1).

This architecture is separate from Purview's native auto-labeling pipeline and requires an Azure AI Services resource in the same tenant region.

📩 Don't Miss the Next Solution

Join the list to see the real-time solutions I'm delivering to my GCC High clients.

How SITs Work​

Core SIT Groups​

Financial SITs​

PII SITs​

Credential SITs​

The Credit Card False Positive Problem​

Environment-Specific SITs​

CUI-Specific Sensitive Information Types​

Regulatory SIT Groupings​

Trainable Classifiers​

Pre-Trained Classifiers​

Custom Trainable Classifiers​

Classifier Limitations​

OCR Limitation: Engineering CAD Drawings​

Solution: Azure AI Document Intelligence​