Skill Guide

Data classification, tagging, and PII/sensitive data detection

The systematic process of identifying, categorizing, and labeling data assets based on their sensitivity and regulatory requirements, with a specific focus on detecting Personally Identifiable Information (PII) and other regulated data types.

This skill is the operational backbone of data privacy and security programs, enabling compliance with regulations like GDPR and CCPA, and directly reducing organizational risk and financial liability. Its absence leads to ineffective data governance, security blind spots, and multi-million dollar fines.

1 Careers

1 Categories

8.7 Avg Demand

25% Avg AI Risk

How to Learn Data classification, tagging, and PII/sensitive data detection

Master core definitions: Understand the difference between data classification (sensitivity tiers like Public, Internal, Confidential, Restricted), data tagging (applying metadata labels), and PII (direct identifiers like SSN, indirect like email). Study a primary regulation (GDPR or CCPA) to see how it defines sensitive data. Practice manually identifying PII in sample, non-sensitive datasets (e.g., mock customer lists).

Move to execution: Learn to use Regular Expressions (Regex) for basic pattern matching of common PII (e.g., \d{3}-\d{2}-\d{4} for SSN). Implement a small-scale tagging project using a data catalog or labeling tool. Understand false positives/negatives in detection and the importance of contextual review. Avoid the mistake of treating all data as equally sensitive, which cripples operational efficiency.

Architect scalable solutions: Design and implement an enterprise-wide data classification policy aligned with business objectives and data governance frameworks like DAMA-DMBOK. Lead the integration of automated scanning and tagging tools (e.g., Microsoft Purview, BigID) into data pipelines. Mentor teams on building a culture of data stewardship and handle edge cases involving unstructured data (emails, images) and emerging data sources.

Practice Projects

Beginner

Project

PII Hunter in a Sample Customer Database

Scenario

You are given a CSV file ('customer_data_sample.csv') with 500 rows of mock data containing mixed columns: 'full_name', 'street_address', 'notes', 'transaction_id'. Some entries in 'notes' contain inadvertently copied SSNs or medical info.

How to Execute

1. Load the CSV into a pandas DataFrame. 2. Write Python functions using Regex to scan all string columns for patterns: SSN (\d{3}-\d{2}-\d{4}), email (specific pattern), phone numbers. 3. Manually review flagged rows to confirm PII. 4. Create new columns: 'contains_pii' (boolean) and 'pii_types' (list of found types like ['SSN', 'EMAIL']).

Intermediate

Project

Automated Data Labeling Pipeline for a Cloud Data Lake

Scenario

Your team's data lake in AWS S3 is growing uncontrollably. You need to design a semi-automated process to scan new JSON files in a specific bucket daily, classify them, and apply tags for data discovery and access control.

How to Execute

1. Set up an AWS Lambda function triggered by S3 'PutObject' events. 2. Use the AWS Macie service or a custom Lambda with Python libraries like `presidio-analyzer` to scan file content for PII. 3. Based on scan results, use the AWS Resource Groups Tagging API to apply tags: `DataClassification: Confidential` and `ContainsPII: Yes`. 4. Configure alerts in AWS CloudWatch for high-risk findings.

Advanced

Case Study/Exercise

Remediating a Legacy System Data Governance Gap

Scenario

A multinational bank discovers during an audit that its 15-year-old core banking mainframe system has no data classification. Critical customer data (account numbers, social security numbers, transaction histories) is stored in flat files with no encryption or access logs, violating GDPR Article 5.

How to Execute

1. Form a cross-functional task force (Security, Compliance, Data Engineering, Legal). 2. Perform a Data Protection Impact Assessment (DPIA) to map all data flows and identify highest-risk areas. 3. Prioritize remediation: first, isolate and encrypt the most sensitive flat files; second, deploy a network scanner like Symantec Data Loss Prevention to monitor egress points. 4. Develop a phased, non-disruptive migration plan to move data into a modern, classification-aware platform, creating a full audit trail for regulators.

Tools & Frameworks

Software & Platforms

Microsoft Purview (formerly Azure Purview)BigIDOneTrustIBM Watson Knowledge CatalogGoogle Cloud Data Loss Prevention (DLP) API

Enterprise platforms for automated data discovery, classification, and policy enforcement across cloud and on-prem environments. Use Purview or BigID for scanning massive data estates; use cloud-native DLP APIs for programmatic scanning within pipelines.

Technical Libraries & APIs

Python 'presidio-analyzer' (Microsoft)Apache OpenNLPGoogle Cloud DLP Client LibraryCustom Regex Engines

For building custom, lightweight, or embedded detection logic. Presidio is a leading open-source option for PII detection in text and images. Use these for specific use cases where an enterprise platform is overkill or for initial prototyping.

Regulatory & Governance Frameworks

NIST Privacy FrameworkDAMA-DMBOK (Data Management Body of Knowledge)ISO/IEC 27701GDPR Article 30 Records of Processing Activities

Frameworks that provide the 'why' and 'what' behind classification rules. Use DAMA-DMBOK to structure your data governance program. Use GDPR's Article 30 as a template for creating the inventory that classification must support.

Interview Questions

Answer Strategy

Structure the answer around a standard data governance lifecycle: Discovery, Definition, Implementation, and Enforcement. Be specific about tools and handoffs. Sample Answer: 'First, I'd collaborate with business owners to define classification tiers and what constitutes PII based on applicable regulations. Then, I'd use a scanning tool like Purview to profile sample data from all source systems to understand the data landscape. The third step is to implement automated tagging at ingestion, integrating with the ETL/ELT pipeline, possibly using API calls to a DLP service. Finally, I'd enforce the policy by tying classification tags to access controls (RBAC) in the warehouse and establishing a quarterly review process with data stewards.'

Answer Strategy

Tests crisis management, prioritization, and understanding of the principle of least privilege. Sample Answer: 'My immediate priority is risk containment. Step 1: I would immediately engage the DBA or system owner to restrict the overly broad access permissions to the original, intended team. Step 2: I would initiate a preliminary assessment to determine the exposure window and the types of PII involved. Step 3: Based on that assessment, I would follow the company's incident response protocol, which may involve notifying the Data Protection Officer (DPO) and potentially the relevant regulator if a breach threshold is met. Concurrently, I would start the formal tagging and classification process for that database to prevent recurrence.'