Skip to main content

Skill Guide

Feature engineering from claims data, lab values, and longitudinal patient records

The systematic transformation of raw administrative claims (CPT, ICD-10), episodic lab results, and longitudinal clinical records into predictive, computationally efficient variables for machine learning models in healthcare.

This skill directly translates into reduced medical loss ratios (MLR) and improved risk adjustment factor (RAF) accuracy by identifying subtle, temporal patterns in patient behavior and physiology that raw data obscures. It is the primary differentiator between a generic data scientist and a high-impact healthcare ML engineer, enabling models that predict readmissions, disease progression, and cost with clinical precision.
1 Careers
1 Categories
8.9 Avg Demand
15% Avg AI Risk

How to Learn Feature engineering from claims data, lab values, and longitudinal patient records

1. **Master Healthcare Data Ontologies:** Achieve fluency in ICD-10-CM/PCS, CPT/HCPCS, LOINC (labs), and NDC (pharmacy) codes; understand their hierarchical structures. 2. **Understand Temporality vs. Staticity:** Learn to differentiate between time-bound events (a lab result on a specific date) and static attributes (a patient's date of birth). 3. **Practice Basic Aggregation:** Compute simple counts (e.g., number of ER visits in 90 days) and durations (length of stay) from claims line items.
1. **Build Temporal Feature Pipelines:** Move beyond counts to windowed aggregates (e.g., `avg_hba1c_last_2_quarters`, `trend_in_systolic_bp_over_6_months`). Use SQL window functions (`ROW_NUMBER`, `LAG`) or pandas `rolling` methods. 2. **Incorporate Clinical Context:** Use NLP to extract features from unstructured notes (e.g., sentiment, specific symptoms mentioned) and link them to structured codes. Avoid the common mistake of treating all diagnosis codes as equal; weight them by recency and severity (e.g., via HCC categories). 3. **Handle Missingness Clinically:** Distinguish between a missing lab value (patient wasn't tested) and a normal value not recorded. Use domain rules (e.g., if a patient has diabetes, a missing HbA1c is informative).
1. **Architect Multi-Modal Feature Stores:** Design systems that unify claims, labs, notes, and even social determinants (SDOH) into a single, versioned feature store with lineage tracking. Use frameworks like Feast or Tecton. 2. **Engineer for Real-Time Inference:** Create features that can be computed in near-real-time (e.g., 'rolling opioid MME from last 3 prescriptions') for deployment in point-of-care systems. Align feature engineering with model retraining cadences (e.g., monthly retrain with 36-month lookback features). 3. **Lead Validation & Governance:** Establish robust pipelines for feature drift detection, bias auditing (ensuring features don't encode racial disparity), and maintaining a living feature documentation wiki for the broader data science team.

Practice Projects

Beginner
Project

Build a Patient Utilization Profile from Claims Data

Scenario

You are given 1 year of synthetic claims data for 1000 patients. The task is to create a feature set that describes each patient's healthcare utilization intensity.

How to Execute
1. Load the data and join claims with diagnosis and procedure code lookups. 2. For each patient, calculate: total_paid_amount, count_of_inpatient_stays, count_of_er_visits, count_of_unique_drugs_prescribed, and count_of_unique_specialists_seen. 3. Normalize costs by patient demographics (age, gender) if possible. 4. Output a clean CSV with one row per patient and these features.
Intermediate
Project

Develop a Chronic Disease Progression Feature Set

Scenario

Using longitudinal data (claims + labs) for patients with Type 2 Diabetes, engineer features to predict the onset of chronic kidney disease (CKD) within the next 12 months.

How to Execute
1. Define the target: an eGFR value < 60 on two occasions at least 90 days apart. 2. For each patient at each quarter, engineer features from the prior 2 years: a) Claims: count of 'diabetes with kidney manifestations' codes (ICD-10 E11.2x), number of nephrology visits, use of ACEi/ARBs. b) Labs: last eGFR value, slope of eGFR over last 3 measurements, maximum HbA1c in last year, presence of proteinuria (urine albumin-creatinine ratio > 30). 3. Handle lab data sparsity by using forward-filling with a clinical timeout (e.g., if no eGFR in 6 months, flag as 'missing'). 4. Split data temporally (train on earlier patients, test on later ones) to avoid leakage.
Advanced
Project

Design a Real-Time Risk Adjustment Feature Pipeline

Scenario

You are the lead ML engineer for a health plan. Your goal is to build a production-grade feature pipeline that computes a member's risk score nightly, incorporating the day's new claims, lab results, and pharmacy fills, to feed into a CMS-HCC RAF model and a predictive readmission model.

How to Execute
1. Architect a DAG (e.g., in Airflow) with daily scheduled tasks: ingest new claims, normalize codes, and update a member-event timeline table. 2. Engineer 'rolling window' features: e.g., `opiate_mme_rolling_90_days`, `pending_lab_results_flag`. For RAF, implement the official CMS-HCC risk model logic as a set of features (e.g., `hcc_categories_present`, `payment_hcc_count`). 3. Implement a feature validation suite that checks for schema, range, and unexpected null rates after each run. Use a feature store to serve both batch (for model retraining) and online (for real-time scoring) features with consistent definitions. 4. Monitor for concept drift by comparing feature distributions weekly against a baseline cohort.

Tools & Frameworks

Data Processing & Querying

SQL (Advanced Window Functions)Apache Spark (PySpark)Pandas / Polars

SQL is non-negotiable for joining claims, labs, and membership tables on a data warehouse. Spark is essential for distributed processing of multi-year longitudinal data at scale. Pandas/Polars are used for rapid prototyping and complex time-series manipulations in memory.

Healthcare Data Standards & Libraries

OHDSI OMOP Common Data Model (CDM)Athena Vocabulary DatabaseMedCAT / SciSpacy for NLP

OMOP CDM provides a standardized schema for EHR data. Athena provides mappings between code systems (ICD to SNOMED). NLP libraries are used to extract clinical concepts from unstructured notes to create NLP-derived features.

Feature Store & ML Ops Platforms

FeastTectonAmazon SageMaker Feature StoreMLflow

These tools manage the lifecycle of features-versioning, serving, and monitoring. They ensure consistency between training and inference and enable collaboration across teams. MLflow tracks experiments with different feature sets.

Domain-Specific Frameworks

CMS-HCC Risk Adjustment Model DocumentationHEDIS Measurement SpecificationsClinical Terminology Mappings (ICD-10 to HCC, CPT to RVU)

The official technical specifications from CMS and NCQA are the primary references. They define how to translate raw codes into features for specific regulatory and quality reporting models. Understanding these is critical for building valid, interpretable features.

Careers That Require Feature engineering from claims data, lab values, and longitudinal patient records

1 career found