Information Protection Scanner
Cloud-native auto-labeling covers Exchange, SharePoint Online, and OneDrive. Content stored on on-premises file shares, NAS devices, and local SharePoint Server requires the Microsoft Purview Information Protection Scanner — a Windows service that connects to your on-premises repositories and applies labels using the same SIT-based rules configured in the Purview portal.
Architecture
The scanner consists of two components:
| Component | Role |
|---|---|
| Scanner node | Windows Server running the AIP Unified Labeling client; connects to file shares via SMB or SharePoint Server via HTTPS |
| SQL Server database | Stores scanner job configuration, scan results, and per-file label status |
The scanner node authenticates to the Purview service using an Entra service principal (app registration). It reads files, evaluates SIT conditions, and either reports matches (discovery mode) or writes labels to the files (enforcement mode).
On-premises network:
File share (\\server\share) ──SMB──▶ Scanner node (Windows Server)
SharePoint Server ──HTTPS──▶ Scanner node
Scanner node ──HTTPS──▶ Purview compliance portal (label policies)
Scanner node ──SQL──▶ SQL Server (scan results database)
Sizing the Scanner Node
Scanner throughput depends on CPU, RAM, and network bandwidth to the file share. Use these guidelines:
| File Repository Size | Scanner Node Sizing | SQL Server |
|---|---|---|
| < 1 million files | 4 vCPU, 8 GB RAM | SQL Express (free, 10 GB limit) |
| 1–10 million files | 8 vCPU, 16 GB RAM | SQL Server Standard or Developer |
| > 10 million files | 16 vCPU, 32 GB RAM; consider multiple scanner nodes | SQL Server Standard with dedicated instance |
SQL Express limitations: The free SQL Express edition has a 10 GB database size limit. For large environments, SQL Express will fill up before the scan completes, halting the job. Use SQL Server Standard or SQL Server Developer (free for non-production) for any repository exceeding ~2 million files.
Network bandwidth: The scanner reads file content across the network. Budget 1 Gbps network connectivity between the scanner node and the file share for production workloads. Scanning over a WAN link to remote offices will be slow — deploy a scanner node at each remote office instead.
Installation
# On the scanner node — install the AIP Unified Labeling client
# Download from Microsoft Download Center
Install-AIPScanner -SqlServerInstance "SQL01\SCANNER" -Profile "OnPremScan"
# Authenticate the scanner to Entra (run once)
Set-AIPAuthentication `
-AppId "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" `
-AppSecret "your-client-secret" `
-TenantId "yourtenant.onmicrosoft.com" `
-DelegatedUser "scanneraccount@yourtenant.onmicrosoft.com"
# Add a repository to scan
Add-AIPScannerRepository -Path "\\fileserver01\shares\engineering"
Add-AIPScannerRepository -Path "\\fileserver01\shares\finance"
The DelegatedUser account requires:
- Read access to all repositories being scanned
Azure Information Protection Scannerlicense (included in M365 E3/E5 and AIP P1/P2)- A Purview sensitivity label policy published to the account
Discovery Mode vs. Enforcement Mode
The scanner operates in two modes, controlled by the content scan job settings in the Purview portal or via PowerShell.
Discovery Mode (Recommend)
Scans files and generates a report of what was found — which files match which SITs, how many files would be labeled — without writing any labels to files. Use this mode first.
Set-AIPScannerConfiguration -ReportLevel Info -JustificationMessage "" -Schedule OneTime
Start-AIPScan
When to use: Initial assessment of on-premises data before committing to enforcement. Run discovery for at least one full scan cycle before enabling enforcement.
Output: Report files in %localappdata%\Microsoft\MSIP\Scanner\Reports\ — CSV files listing every file scanned, the matching SITs, the confidence level, and the recommended label.
Enforcement Mode (Auto-label)
Writes sensitivity labels to files in place. The Office property msip_labels is set in the file's metadata. Files labeled in enforcement mode will be recognized by Office apps and the Purview DLP engine immediately.
Set-AIPScannerConfiguration -ReportLevel Debug -Enforce On
Start-AIPScan
Before enabling enforcement on a full repository, scope the scan to a test folder with representative file types. Verify that Office files are labeled correctly and that non-Office files (PDFs, images) are handled as expected. Some applications may reject files whose metadata has been modified by the scanner.
Files the scanner can label:
| File Type | Label Written to | Notes |
|---|---|---|
| Word, Excel, PowerPoint (.docx, .xlsx, .pptx) | Office property metadata | Full label support |
| PDF metadata | Requires Acrobat or PDF labeling extension | |
| Text files (.txt, .csv, .xml) | File metadata (alternate data stream on NTFS) | No visual marking |
| Images, CAD, executables | Discovery only | Cannot write labels |
Extracting Scan Results from SQL
The scanner stores all scan results in the SQL database. The Purview portal does not expose a complete per-file report for on-premises scans — you must query the SQL database directly.
-- Summary: files by recommended label
SELECT
RecommendedLabel,
COUNT(*) AS FileCount,
SUM(FileSize) / 1048576 AS TotalSizeMB
FROM dbo.ScannerFiles
WHERE RecommendedLabel IS NOT NULL
GROUP BY RecommendedLabel
ORDER BY FileCount DESC;
-- Detail: files matching a specific SIT
SELECT
RepositoryPath,
FileName,
MatchedSIT,
ConfidenceLevel,
RecommendedLabel,
CurrentLabel,
LastModifiedDate
FROM dbo.ScannerFiles
WHERE MatchedSIT LIKE '%Social Security%'
OR MatchedSIT LIKE '%Credit Card%'
ORDER BY LastModifiedDate DESC;
-- Files labeled in enforcement mode
SELECT
RepositoryPath,
FileName,
CurrentLabel,
LabelSetDate,
LabelSetBy
FROM dbo.ScannerFiles
WHERE LabelSetDate IS NOT NULL
ORDER BY LabelSetDate DESC;
Export results to CSV for audit evidence:
Invoke-Sqlcmd -ServerInstance "SQL01\SCANNER" -Database "AIPScanner" -Query @"
SELECT RepositoryPath, FileName, MatchedSIT, ConfidenceLevel, RecommendedLabel, CurrentLabel
FROM dbo.ScannerFiles
WHERE RecommendedLabel IS NOT NULL
"@ | Export-Csv -Path ".\ScannerResults_$(Get-Date -Format 'yyyy-MM-dd').csv" -NoTypeInformation
Environment-Specific Considerations
- GCC High (CMMC)
- Commercial
On-Premises CUI Discovery for CMMC
NIST SP 800-171 Rev. 3 3.1.22 requires organizations to control CUI posted or processed on publicly accessible systems. More broadly, 3.14.1 requires identifying CUI "at rest" in all storage locations — including on-premises file servers.
The scanner fulfills this requirement by:
- Discovery scan — generates a per-file report identifying which files contain CUI indicators (Distribution Statement keywords, ITAR/EAR keywords, DoD contract numbers)
- Enforcement scan — applies the CUI — Basic or CUI — Specified label to identified files, making them subject to DLP policies even when accessed via the on-premises network
Air-Gapped Networks
If the scanner node cannot reach the Purview compliance portal (air-gapped or classified networks), configure the scanner in offline mode:
Set-AIPScannerConfiguration -OnlineConfiguration Off
Import-AIPScannerConfiguration -FileName "C:\ScannerConfig\policy.msip"
Export the scanner policy from the portal and deliver it to the air-gapped node via approved transfer media. Labels applied in offline mode are consistent with cloud policy — they will be recognized by any Office app with the same label policy deployed.
CMMC Control Mapping
| NIST Control | Scanner Capability |
|---|---|
| 3.1.22 — CUI on public-facing systems | Discovery mode identifies CUI in file shares before any internet exposure |
| 3.3.1 — Audit records | SQL scan results provide a timestamped record of all files and their label status |
| 3.14.1 — Identify CUI at rest | Full-repository discovery scan with per-file SIT match report |
| 3.4.1 — Baseline configurations | Labeled files are subject to DLP endpoint controls |
On-Premises Discovery for Commercial Organizations
For commercial organizations with on-premises file servers (common in manufacturing, healthcare, and legal), the scanner provides the asset inventory function that Content Explorer provides for cloud content.
Initial Assessment Workflow
- Run discovery mode across all file shares
- Export SQL results to identify:
- File shares with the highest density of financial / PII SIT matches
- File shares with no labeled content (highest risk)
- File types that cannot be labeled (CAD, legacy formats) — flag for manual review
- Prioritize enforcement on the highest-density locations first
- Schedule quarterly re-scans to catch newly deposited sensitive content
HIPAA / Healthcare
For healthcare organizations, run discovery with the HIPAA SIT bundle enabled. The scan results provide the data inventory required by HIPAA Security Rule §164.308(a)(1)(ii)(A) (Risk Analysis — identifying locations of ePHI).
GLBA / Financial Services
Run annual discovery scans and export SQL results as evidence for GLBA Safeguards Rule information inventory requirements.
Scheduling and Maintenance
Configure the scanner to run on a schedule in the Purview portal (Information protection → Scanner → Scan jobs):
| Schedule | Use Case |
|---|---|
| One-time | Initial discovery or enforcement run |
| Daily | Active repositories with frequent file changes |
| Weekly | Archival or low-activity repositories |
| Continuous | High-value repositories requiring near-real-time labeling |
Continuous mode re-queues files for re-scan as soon as they are modified. Use this for repositories where new sensitive content is deposited frequently (e.g., contracts intake folders, engineering submittals).
Monitor scanner health via the Windows Event Log on the scanner node (Application log, source Azure Information Protection Scanner) and via the scanner status report in the Purview portal.
📩 Don't Miss the Next Solution
Join the list to see the real-time solutions I'm delivering to my GCC High clients.