Skill Guide

Data cataloging, metadata management, and dark data taxonomy design

It is the systematic process of discovering, organizing, and governing enterprise data assets by defining their context (metadata) and classifying unmanaged 'dark data' into actionable taxonomies to reduce risk and unlock value.

This skill directly reduces data storage costs and compliance risks by identifying redundant and sensitive dark data, while simultaneously accelerating analytics and AI initiatives by making all data discoverable, trustworthy, and understandable.

1 Careers

1 Categories

8.5 Avg Demand

20% Avg AI Risk

How to Learn Data cataloging, metadata management, and dark data taxonomy design

Focus on: 1) Core metadata types (technical, operational, business). 2) Basic catalog components (data dictionaries, lineage). 3) The concept of dark data and why it exists (unstructured logs, legacy backups).

Move to: 1) Implementing a pilot catalog for a single domain (e.g., Customer) using tools like Collibra or Alation. 2) Designing a simple taxonomy for classifying dark data (e.g., 'Business Critical', 'Compliance Hold', 'Delete Candidate'). Avoid the mistake of trying to catalog everything at once; start with high-impact data.

Master: 1) Enterprise-wide data governance frameworks (DAMA-DMBOK). 2) Aligning metadata management with business glossaries and KPIs. 3) Architecting automated discovery and classification using ML to scan petabytes of dark data, and mentoring data stewards on governance culture.

Practice Projects

Beginner

Project

Build a Personal Data Dictionary for a Public Dataset

Scenario

You have downloaded a public dataset (e.g., from Kaggle) about retail sales. The column names are cryptic (e.g., 'col_1', 'dt').

How to Execute

1. Open the dataset and document each column's technical metadata (name, data type, sample values). 2. Research and assign a business definition and owner (e.g., 'dt' = 'Transaction Date, owned by Retail Ops'). 3. Identify any 'dark' columns you cannot understand and flag them for investigation. 4. Present this dictionary in a spreadsheet or wiki.

Intermediate

Case Study/Exercise

Design a Dark Data Triage Process for a Legacy Server

Scenario

Your company is decommissioning a legacy server containing 10TB of unclassified log files, reports, and backups from the last decade. Legal requires a data retention policy.

How to Execute

1. Propose a triage taxonomy: 'Active Use', 'Regulatory Hold', 'Candidate for Archival', 'Candidate for Deletion'. 2. Design a sampling methodology to analyze file types, access dates, and ownership. 3. Create a workflow for business users to review and tag sampled data. 4. Draft a policy document with rules for each category and a timeline for action.

Advanced

Case Study/Exercise

Align a Data Catalog with a Merger & Acquisition Integration

Scenario

Your company has acquired a competitor. Both companies have separate data catalogs, glossaries, and conflicting definitions for core entities like 'Customer' and 'Revenue'.

How to Execute

1. Lead a cross-functional working group to perform a 'metadata gap analysis'. 2. Develop a unified business glossary using a tool like Collibra, establishing canonical definitions and data stewards. 3. Map the lineage from both source systems to the new unified analytical models. 4. Implement data quality rules and stewardship workflows to resolve conflicts, focusing on the 'single source of truth' for M&A reporting.

Tools & Frameworks

Software & Platforms

Collibra Data CatalogAlation Data CatalogApache Atlas (open-source)Google Cloud Dataplex (native GCP)

Use Collibra or Alation for enterprise governance with strong business glossary features. Use Apache Atlas for Hadoop/Spark ecosystem integration. Use native cloud catalogs (Dataplex, AWS Glue Catalog) for cloud-native data lake governance.

Mental Models & Methodologies

DAMA-DMBOK FrameworkData Mesh PrinciplesFAIR Data Principles

Apply DAMA-DMBOK for comprehensive governance structure. Use Data Mesh's 'Data as a Product' concept to assign domain ownership to catalog entries. Apply FAIR (Findable, Accessible, Interoperable, Reusable) to evaluate and improve the maturity of your catalog entries.

Interview Questions

Answer Strategy

Structure the answer using a phased approach: Discovery, Triage, and Operationalize. Avoid proposing to catalog everything manually. Emphasize automated scanning, stakeholder alignment, and creating actionable policies.

Answer Strategy

The interviewer is testing stakeholder management, influence without authority, and the ability to create shared understanding. Use the STAR (Situation, Task, Action, Result) method. Focus on facilitation techniques, not just technology.