Skip to main content

Information Protection Scanner

Cloud-native auto-labeling covers Exchange, SharePoint Online, and OneDrive. Content stored on on-premises file shares, NAS devices, and local SharePoint Server requires the Microsoft Purview Information Protection Scanner — a Windows service that connects to your on-premises repositories and applies labels using the same SIT-based rules configured in the Purview portal.

Architecture

The scanner consists of two components:

ComponentRole
Scanner nodeWindows Server running the AIP Unified Labeling client; connects to file shares via SMB or SharePoint Server via HTTPS
SQL Server databaseStores scanner job configuration, scan results, and per-file label status

The scanner node authenticates to the Purview service using an Entra service principal (app registration). It reads files, evaluates SIT conditions, and either reports matches (discovery mode) or writes labels to the files (enforcement mode).

On-premises network:
File share (\\server\share) ──SMB──▶ Scanner node (Windows Server)
SharePoint Server ──HTTPS──▶ Scanner node

Scanner node ──HTTPS──▶ Purview compliance portal (label policies)
Scanner node ──SQL──▶ SQL Server (scan results database)

Sizing the Scanner Node

Scanner throughput depends on CPU, RAM, and network bandwidth to the file share. Use these guidelines:

File Repository SizeScanner Node SizingSQL Server
< 1 million files4 vCPU, 8 GB RAMSQL Express (free, 10 GB limit)
1–10 million files8 vCPU, 16 GB RAMSQL Server Standard or Developer
> 10 million files16 vCPU, 32 GB RAM; consider multiple scanner nodesSQL Server Standard with dedicated instance

SQL Express limitations: The free SQL Express edition has a 10 GB database size limit. For large environments, SQL Express will fill up before the scan completes, halting the job. Use SQL Server Standard or SQL Server Developer (free for non-production) for any repository exceeding ~2 million files.

Network bandwidth: The scanner reads file content across the network. Budget 1 Gbps network connectivity between the scanner node and the file share for production workloads. Scanning over a WAN link to remote offices will be slow — deploy a scanner node at each remote office instead.

Installation

# On the scanner node — install the AIP Unified Labeling client
# Download from Microsoft Download Center
Install-AIPScanner -SqlServerInstance "SQL01\SCANNER" -Profile "OnPremScan"

# Authenticate the scanner to Entra (run once)
Set-AIPAuthentication `
-AppId "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" `
-AppSecret "your-client-secret" `
-TenantId "yourtenant.onmicrosoft.com" `
-DelegatedUser "scanneraccount@yourtenant.onmicrosoft.com"

# Add a repository to scan
Add-AIPScannerRepository -Path "\\fileserver01\shares\engineering"
Add-AIPScannerRepository -Path "\\fileserver01\shares\finance"

The DelegatedUser account requires:

  • Read access to all repositories being scanned
  • Azure Information Protection Scanner license (included in M365 E3/E5 and AIP P1/P2)
  • A Purview sensitivity label policy published to the account

Discovery Mode vs. Enforcement Mode

The scanner operates in two modes, controlled by the content scan job settings in the Purview portal or via PowerShell.

Discovery Mode (Recommend)

Scans files and generates a report of what was found — which files match which SITs, how many files would be labeled — without writing any labels to files. Use this mode first.

Set-AIPScannerConfiguration -ReportLevel Info -JustificationMessage "" -Schedule OneTime
Start-AIPScan

When to use: Initial assessment of on-premises data before committing to enforcement. Run discovery for at least one full scan cycle before enabling enforcement.

Output: Report files in %localappdata%\Microsoft\MSIP\Scanner\Reports\ — CSV files listing every file scanned, the matching SITs, the confidence level, and the recommended label.

Enforcement Mode (Auto-label)

Writes sensitivity labels to files in place. The Office property msip_labels is set in the file's metadata. Files labeled in enforcement mode will be recognized by Office apps and the Purview DLP engine immediately.

Set-AIPScannerConfiguration -ReportLevel Debug -Enforce On
Start-AIPScan
Test on a Subset First

Before enabling enforcement on a full repository, scope the scan to a test folder with representative file types. Verify that Office files are labeled correctly and that non-Office files (PDFs, images) are handled as expected. Some applications may reject files whose metadata has been modified by the scanner.

Files the scanner can label:

File TypeLabel Written toNotes
Word, Excel, PowerPoint (.docx, .xlsx, .pptx)Office property metadataFull label support
PDFPDF metadataRequires Acrobat or PDF labeling extension
Text files (.txt, .csv, .xml)File metadata (alternate data stream on NTFS)No visual marking
Images, CAD, executablesDiscovery onlyCannot write labels

Extracting Scan Results from SQL

The scanner stores all scan results in the SQL database. The Purview portal does not expose a complete per-file report for on-premises scans — you must query the SQL database directly.

-- Summary: files by recommended label
SELECT
RecommendedLabel,
COUNT(*) AS FileCount,
SUM(FileSize) / 1048576 AS TotalSizeMB
FROM dbo.ScannerFiles
WHERE RecommendedLabel IS NOT NULL
GROUP BY RecommendedLabel
ORDER BY FileCount DESC;

-- Detail: files matching a specific SIT
SELECT
RepositoryPath,
FileName,
MatchedSIT,
ConfidenceLevel,
RecommendedLabel,
CurrentLabel,
LastModifiedDate
FROM dbo.ScannerFiles
WHERE MatchedSIT LIKE '%Social Security%'
OR MatchedSIT LIKE '%Credit Card%'
ORDER BY LastModifiedDate DESC;

-- Files labeled in enforcement mode
SELECT
RepositoryPath,
FileName,
CurrentLabel,
LabelSetDate,
LabelSetBy
FROM dbo.ScannerFiles
WHERE LabelSetDate IS NOT NULL
ORDER BY LabelSetDate DESC;

Export results to CSV for audit evidence:

Invoke-Sqlcmd -ServerInstance "SQL01\SCANNER" -Database "AIPScanner" -Query @"
SELECT RepositoryPath, FileName, MatchedSIT, ConfidenceLevel, RecommendedLabel, CurrentLabel
FROM dbo.ScannerFiles
WHERE RecommendedLabel IS NOT NULL
"@ | Export-Csv -Path ".\ScannerResults_$(Get-Date -Format 'yyyy-MM-dd').csv" -NoTypeInformation

Environment-Specific Considerations

On-Premises CUI Discovery for CMMC

NIST SP 800-171 Rev. 3 3.1.22 requires organizations to control CUI posted or processed on publicly accessible systems. More broadly, 3.14.1 requires identifying CUI "at rest" in all storage locations — including on-premises file servers.

The scanner fulfills this requirement by:

  1. Discovery scan — generates a per-file report identifying which files contain CUI indicators (Distribution Statement keywords, ITAR/EAR keywords, DoD contract numbers)
  2. Enforcement scan — applies the CUI — Basic or CUI — Specified label to identified files, making them subject to DLP policies even when accessed via the on-premises network

Air-Gapped Networks

If the scanner node cannot reach the Purview compliance portal (air-gapped or classified networks), configure the scanner in offline mode:

Set-AIPScannerConfiguration -OnlineConfiguration Off
Import-AIPScannerConfiguration -FileName "C:\ScannerConfig\policy.msip"

Export the scanner policy from the portal and deliver it to the air-gapped node via approved transfer media. Labels applied in offline mode are consistent with cloud policy — they will be recognized by any Office app with the same label policy deployed.

CMMC Control Mapping

NIST ControlScanner Capability
3.1.22 — CUI on public-facing systemsDiscovery mode identifies CUI in file shares before any internet exposure
3.3.1 — Audit recordsSQL scan results provide a timestamped record of all files and their label status
3.14.1 — Identify CUI at restFull-repository discovery scan with per-file SIT match report
3.4.1 — Baseline configurationsLabeled files are subject to DLP endpoint controls

Scheduling and Maintenance

Configure the scanner to run on a schedule in the Purview portal (Information protectionScannerScan jobs):

ScheduleUse Case
One-timeInitial discovery or enforcement run
DailyActive repositories with frequent file changes
WeeklyArchival or low-activity repositories
ContinuousHigh-value repositories requiring near-real-time labeling

Continuous mode re-queues files for re-scan as soon as they are modified. Use this for repositories where new sensitive content is deposited frequently (e.g., contracts intake folders, engineering submittals).

Monitor scanner health via the Windows Event Log on the scanner node (Application log, source Azure Information Protection Scanner) and via the scanner status report in the Purview portal.

📩 Don't Miss the Next Solution

Join the list to see the real-time solutions I'm delivering to my GCC High clients.