Skill Guide

AI-specific data preparation: chunking, tokenization, deduplication, and data curation for LLM training

The systematic process of transforming raw, unstructured text corpora into clean, de-duplicated, and tokenized sequences optimized for Large Language Model pre-training and fine-tuning.

High-quality, curated data directly determines LLM capability, safety, and efficiency, making this skill critical for reducing training compute costs and mitigating risks of model failure or harmful outputs.

1 Careers

1 Categories

9.1 Avg Demand

15% Avg AI Risk

How to Learn AI-specific data preparation: chunking, tokenization, deduplication, and data curation for LLM training

1. Understand the difference between raw data (Common Crawl) and curated datasets (The Pile). 2. Learn basic text cleaning (HTML/Markdown removal, language identification). 3. Grasp the concept of a 'token' and how tokenizers like BPE work at a high level.

1. Implement a full pipeline: ingestion → cleaning → filtering → deduplication → tokenization. 2. Work with specific tools: use `trafilatura` for web scraping, `langdetect` for language, and `fastText` for quality filtering. 3. Avoid the mistake of over-filtering, which removes valuable edge-case data.

1. Design custom, multi-stage deduplication (exact + fuzzy) pipelines for petabyte-scale data. 2. Develop domain-specific heuristics and classifiers for data quality scoring (e.g., for code, medical, or legal text). 3. Architect data mixing strategies and curriculum learning schedules based on data composition analysis.

Practice Projects

Beginner

Project

Build a Basic Web Text Cleaner

Scenario

You have 1GB of raw HTML files from a single domain (e.g., Wikipedia). The goal is to produce clean, readable plaintext.

How to Execute

1. Use `trafilatura` or `beautifulsoup4` to extract main content. 2. Apply a series of regex-based cleaners to remove boilerplate (footers, nav bars). 3. Use `langdetect` to filter out non-target-language documents. 4. Output the cleaned text to a single JSONL file, preserving source URL as metadata.

Intermediate

Project

Implement a Multi-Stage Deduplication Pipeline

Scenario

Process a 100GB Common Crawl snapshot. The objective is to remove exact duplicate paragraphs and near-duplicate documents (e.g., mirrored pages, spam).

How to Execute

1. Perform exact deduplication on paragraphs using a hash map (e.g., SHA-256 of normalized text). 2. For near-duplicate detection, implement MinHash with Locality-Sensitive Hashing (LSH) using `datasketch` library on document shingles. 3. Create a pipeline script that chains these steps, logging removal statistics. 4. Evaluate precision/recall on a small, manually annotated sample.

Advanced

Project

Design a Domain-Specific Data Curation & Mixing Strategy

Scenario

You are tasked with creating a pre-training dataset for a code-generation LLM. Raw sources include GitHub, StackOverflow, arXiv (CS section), and technical documentation sites.

How to Execute

1. Define per-source quality classifiers: e.g., a model to score GitHub repo quality (stars, license, activity), a heuristic filter for StackOverflow (score, answer count). 2. Implement a tokenization-aware data sampler that mixes sources in a desired ratio (e.g., 70% code, 20% discussion, 10% docs). 3. Build a curriculum where simpler data (clean docs) is seen first, followed by complex data (messy GitHub code). 4. Conduct small-scale (1B param) training runs to A/B test different mixing ratios and measure downstream benchmark performance.

Tools & Frameworks

Data Extraction & Cleaning

trafilaturajusTextApache Tikalangdetect / fasttext

Used in the initial pipeline stage. `trafilatura` is robust for web text extraction. `langdetect` and `fasttext` (for language ID) are essential for filtering by language at scale.

Deduplication

datasketch (MinHash/LSH)hashlib (SHA-256)simhashCosine Similarity (for fuzzy matching)

`datasketch` is the industry standard for scalable approximate deduplication. Exact deduplication uses cryptographic hashes. `simhash` is another common algorithm for near-duplicate detection.

Tokenization

Hugging Face `tokenizers` librarysentencepiecetiktoken (OpenAI)

The `tokenizers` library from Hugging Face is highly optimized for training custom tokenizers (BPE, WordPiece) and is the primary tool. `tiktoken` is used specifically for OpenAI-compatible models.

Orchestration & Compute

Apache Spark / DaskRay DataPolars

For processing terabytes/petabytes. Spark and Dask provide distributed data processing. Ray Data is used in modern ML pipelines. Polars is a fast DataFrame library for in-memory operations on moderately large data.

Interview Questions

Answer Strategy

Structure your answer as a linear pipeline: Ingestion → Extraction → Language/Quality Filtering → Deduplication (exact then fuzzy) → Tokenization. Mention specific tools (e.g., 'I'd use trafilatura for extraction, fastText for language ID, MinHash/LSH for dedup'). Emphasize trade-offs: 'Balancing aggressive deduplication to reduce memorization vs. keeping enough data for robust learning is a key challenge.'

Answer Strategy

The question tests problem-solving and understanding of data leakage. Your strategy: 1. **Audit Deduplication:** Check if the dedup pipeline was applied correctly. Run a sample of the suspected output through a reverse search to find its source. 2. **Analyze Data Composition:** Examine if a single, high-frequency source (e.g., a specific news site) was overrepresented. 3. **Enhance Filtering:** Propose adding a more aggressive fuzzy deduplication stage (e.g., lowering the MinHash Jaccard threshold) or a memorization filter that flags documents with unusually high perplexity when scored by a small reference model.