Skip to main content

Sensitive Information Types

Sensitive Information Types (SITs) are the detection engine that drives auto-labeling, DLP policies, and Insider Risk Management. Every policy in this chapter references one or more SIT groups. Getting SIT configuration right before deploying any label or policy prevents both false positives (over-blocking) and false negatives (missed detections).

How SITs Work

Each SIT evaluates content using a combination of pattern matching (regex), keyword proximity, and checksum validation. The result is a confidence level:

Confidence LevelMeaningTypical Use
High (85–100)Pattern + corroborating keywords foundBlock or auto-label
Medium (75–84)Pattern found, fewer keywordsAlert or policy tip
Low (65–74)Pattern only, no corroborationAudit / Discovery only
Deploy Purview Without the Noise

Shifting from manual classification to automated, secure-by-default data protection is the ultimate goal, but poorly tuned Data Loss Prevention (DLP) rules will overwhelm your business with false positives. At Mindline, we help defense contractors build practical label taxonomies and tune their sensitive information types so security enables collaboration rather than blocking it. Need help untangling your data protection strategy? Book a scoping call at mindline.com.

Policies should reference the confidence level explicitly. A DLP rule that fires at Low confidence on Credit Card will generate excessive false positives in any organization that sends invoices by email.

Core SIT Groups

Financial SITs

Used in DLP Finance alert policies and auto-labeling of Confidential content.

Sensitive Info TypeDetection MethodNotes
Credit Card NumberLuhn checksum + keyword proximitySee false positive mitigation below
U.S. Bank Account NumberABA routing + account pattern
SWIFT CodeISO 9362 pattern
International Banking Account Number (IBAN)Country-specific regex + checksum
U.S. Individual Taxpayer Identification Number (ITIN)SSN-format + ITIN keyword

PII SITs

Used in DLP PII alert policies and auto-labeling.

Sensitive Info TypeDetection MethodNotes
U.S. Social Security Number (SSN)Pattern + keywordAvoid Low confidence in email
All Full NamesNER modelHigh false-positive rate in isolation; pair with other SITs
U.S. Driver's License NumberState-specific patterns50 distinct patterns
U.S. Individual Taxpayer Identification Number (ITIN)Pattern + keywordOverlaps with SSN pattern
U.S. Physical AddressesNER modelUseful for HR/benefits data detection

Credential SITs

Used in the highest-priority DLP alert policies (AD Credential Protection).

Sensitive Info TypeDetection MethodNotes
General PasswordPattern + keyword proximityFires on "password:" followed by a value
Azure Active Directory Client SecretPatternRotate immediately on detection
Azure SAS TokenPattern
Software Development CredentialsPattern bundlenpm tokens, API keys, connection strings

The Credit Card False Positive Problem

Credit Card Number is the most common source of false positives. Purchase orders, invoices, and financial reports contain 16-digit numbers that match the Luhn checksum but are not card numbers. Two mitigations:

1. Keyword Exclusion Dictionary

Create a custom keyword dictionary with terms that indicate a billing context rather than a card number:

Invoice #
PO Number
Purchase Order
Account Number
Reference Number
Order ID
Quote Number

Configure the SIT to not match when these terms appear within 300 characters of the pattern. In the Purview portal: SITs → Credit Card Number → Edit → Add exclusion keywords.

2. Confidence Level Gating

Do not trigger block actions at Low or Medium confidence for Credit Card. Use the following tiered approach:

ConfidenceAction
High (85+)Alert + require justification
Medium (75–84)Policy tip only
Low (65–74)Audit log only, no user notification

Environment-Specific SITs

CUI-Specific Sensitive Information Types

CMMC Level 2 (NIST SP 800-171 Rev. 3 3.1.3, 3.13.1) requires identifying and controlling CUI. The following SITs detect content that must be labeled and protected as CUI.

Distribution Statement Keywords

Create a custom keyword SIT named CUI — Distribution Statement with the following keyword groups:

Keyword GroupTerms
Distribution Statement BDistribution Statement B, Dist Stmt B, DISTRIBUTION B
Distribution Statement CDistribution Statement C, Dist Stmt C, DISTRIBUTION C
Distribution Statement DDistribution Statement D, Dist Stmt D, DISTRIBUTION D
Distribution Statement EDistribution Statement E, Dist Stmt E, DISTRIBUTION E
Distribution Statement FDistribution Statement F, Dist Stmt F, DISTRIBUTION F

Set minimum match count to 1 and confidence to High — Distribution Statement text is unambiguous.

ITAR / EAR Export Control Keywords

Create a custom keyword SIT named CUI — Export Control:

Keyword GroupTerms
ITARITAR, International Traffic in Arms, 22 CFR 120, USML, Defense Article
EAREAR, Export Administration Regulations, 15 CFR 730, ECCN, Commerce Control List
CUI Export ControlCUI//EXPT, CUI//SP-EXPT

DoD Contract Number Pattern

Create a custom regex SIT named CUI — DoD Contract Number:

\bW\d{2}[A-Z]\d{2}[A-Z\d]{6,10}\b

This matches DODAAC-based contract numbers in the format W81XWH-22-C-0001. Pair with keyword proximity: Contract, Award, Task Order.

CUI Designation Indicator

Built-in SIT: U.S. Defense Contract Number — supplement with a custom SIT that matches the CUI banner line format:

CUI//(SP-[\w]+)?

This catches the formal CUI designation block required by NARA 32 CFR Part 2002.

All four CUI SITs should be bundled into a custom SIT group named CUI Bundle for use in a single auto-labeling policy rule.

Trainable Classifiers

Trainable classifiers use machine learning rather than pattern matching. They are appropriate when content cannot be identified by a regex or keyword list — for example, recognizing "intellectual property" from context.

Pre-Trained Classifiers

Microsoft provides several pre-trained classifiers ready for use in policies without additional training:

ClassifierUse Case
Source CodeDetects code files across 25+ languages
ResumesDetects HR candidate data
Financial StatementsBalance sheets, P&L statements
HR — HarassmentPolicy violation detection
ThreatThreatening language
ProfanityPolicy enforcement

Custom Trainable Classifiers

When pre-trained classifiers do not match your content, create a custom classifier:

  1. Seed phase — Upload 50–200 representative positive documents to a SharePoint library designated as the seed site. Documents should be real examples of the content category (e.g., NDAs, engineering change orders, board resolutions).
  2. Test phase — Purview presents 200 test documents; reviewers mark each as positive or negative. Iterate until precision exceeds 70%.
  3. Publish — The classifier becomes available in auto-labeling and DLP conditions.

Custom classifiers take 24–48 hours to initially train. Retrain quarterly or when precision drops below acceptable thresholds.

Classifier Limitations

  • Trainable classifiers operate on text content only. Scanned PDFs, image-only files, and CAD drawings require OCR preprocessing.
  • The built-in OCR in Purview handles standard scanned documents but fails on engineering CAD title blocks, which use non-standard fonts and drawing frames.

OCR Limitation: Engineering CAD Drawings

Standard Purview OCR cannot reliably extract text from CAD title blocks because:

  • Title block fonts (e.g., ISOCT, Romans) are not in OCR training sets
  • Drawing frames and borders confuse layout analysis
  • Multi-layer PDFs exported from AutoCAD or SolidWorks contain vector text that OCR treats as graphics

Solution: Azure AI Document Intelligence

For organizations with large engineering document repositories, a Logic App pipeline can fill this gap:

SharePoint (new PDF) → Logic App trigger
→ Azure AI Document Intelligence (custom model)
→ extracted title block fields (drawing number, revision, classification)
→ Purview REST API
→ Apply sensitivity label based on classification field value

Implementation steps:

  1. Train an Azure AI Document Intelligence custom extraction model on 50+ title block samples using Document Intelligence Studio.
  2. Map the extracted classification field to label GUIDs using a Logic App variable or Azure Key Vault secret.
  3. Call the Purview labeling REST API to apply the label to the file in SharePoint.
  4. Log the label action to a Log Analytics workspace for audit trail (satisfies NIST SP 800-171 Rev. 3 3.3.1).

This architecture is separate from Purview's native auto-labeling pipeline and requires an Azure AI Services resource in the same tenant region.

📩 Don't Miss the Next Solution

Join the list to see the real-time solutions I'm delivering to my GCC High clients.