Skill Guide

Data engineering for manufacturing (time-series databases, OPC-UA, MQTT, historian systems)

The design, construction, and maintenance of data pipelines and storage systems specifically for collecting, cleaning, and serving high-volume, high-velocity sensor and control data from industrial manufacturing equipment.

It enables real-time operational visibility, predictive maintenance, and process optimization by transforming raw machine data into reliable, queryable information. This directly reduces downtime, improves yield, and provides the foundational data layer for advanced analytics and AI in smart manufacturing initiatives.

1 Careers

1 Categories

9.1 Avg Demand

15% Avg AI Risk

How to Learn Data engineering for manufacturing (time-series databases, OPC-UA, MQTT, historian systems)

1. Foundational Protocols: Master OPC-UA (architecture, security, information modeling) and MQTT (publish-subscribe, QoS levels, broker configuration). Understand their core roles in data acquisition. 2. Time-Series Fundamentals: Learn the core concepts of time-series data (timestamping, tags/points, downsampling, compression) and differentiate between a historian and a general-purpose time-series database (TSDB). 3. Basic Data Modeling: Practice designing schemas for raw machine data (e.g., a tag for 'Motor_Vibration_X_Axis') and understanding basic data quality checks (e.g., handling nulls, duplicates).

Move to practice by building a lab environment: use a PLC simulator (like SoftPLC) or a Raspberry Pi with sensors to generate data. Configure an MQTT broker (Mosquitto) and an OPC-UA server (Prosys Simulation Server). Pipe this data into a TSDB (InfluxDB) and a historian (AVEVA Historian Community Edition). Common mistake: Ignoring data contextualization at the edge, leading to 'point soup' in the database. Focus on adding metadata (asset hierarchy, engineering units) during ingestion.

Architect scalable, fault-tolerant pipelines for multi-plant deployments. Master the integration of OT (Operational Technology) data with IT (Information Technology) systems using tools like Apache Kafka for stream processing. Focus on data governance, long-term retention strategies (tiered storage: hot/warm/cold), and designing APIs for downstream consumers (data scientists, BI tools). Mentor teams on OT/IT convergence principles and security standards (IEC 62443).

Practice Projects

Beginner

Project

Factory Floor Data Acquisition Simulator

Scenario

Build a simulated environment that mimics a small production line with three machines (e.g., CNC, Robot Arm, Conveyor) each emitting 5 sensor readings (temperature, vibration, current, position, status).

How to Execute

1. Set up a Mosquitto MQTT broker on a local machine or VM. 2. Use Python scripts with the `paho-mqtt` library to simulate each machine, publishing JSON-formatted sensor data to unique topics (e.g., `factory/cnc1/temperature`). 3. Install InfluxDB and create a database with a schema for your tags. 4. Write a subscriber script (or use Telegraf with an MQTT input plugin) to listen to the topics, parse the data, and write it into InfluxDB. 5. Create a basic Grafana dashboard to visualize real-time temperature trends for each machine.

Intermediate

Project

OPC-UA to Historian Integration Pipeline

Scenario

Integrate data from a simulated OPC-UA server (representing a legacy machine) with your existing MQTT-based pipeline, storing unified data in a historian system with proper asset context.

How to Execute

1. Install and configure the Prosys OPC UA Simulation Server, creating a node structure representing a machine (e.g., `/Objects/Machines/CNC_2`). 2. Use a Python OPC-UA client (e.g., `opcua` library) to subscribe to data changes from the simulation server. 3. Design a data mapping table that translates OPC-UA node IDs (e.g., `ns=2;s=Temperature`) to your standardized tag naming convention. 4. Write a service that normalizes the OPC-UA data, enriches it with asset metadata from a simple CSV or database, and publishes it to an MQTT topic. 5. Use AVEVA Historian or TimescaleDB to store the combined data, ensuring the asset hierarchy is reflected in the database schema or historian's tag configuration.

Advanced

Project

Multi-Site Manufacturing Data Lakehouse Design

Scenario

Design a scalable data architecture to ingest, process, and serve time-series data from 10 geographically dispersed factories, supporting both real-time monitoring (latency < 5s) and complex historical analytics for predictive quality models.

How to Execute

1. Architect an edge-to-cloud pipeline: Edge gateways (running lightweight MQTT brokers and stream processors like Apache Flink) for local buffering, data reduction, and contextualization. 2. Design the cloud ingestion layer using Apache Kafka (or Confluent Cloud) as a central message bus with topics partitioned by site and asset. 3. Implement a multi-tier storage strategy: A hot tier (InfluxDB Cloud or TimescaleDB) for real-time dashboards, a warm tier (Delta Lake/Parquet on cloud storage) for recent analytics, and a cold tier (compressed Parquet/TimeScaleDB hypertable) for historical training data. 4. Define and enforce data contracts and schemas (using Apache Avro or Protobuf) for all sensor data streams. 5. Build and document self-service data products (curated datasets, APIs) for data scientists, ensuring data lineage from historian tag to feature in ML model is traceable.

Tools & Frameworks

Data Acquisition & Messaging

OPC-UA (Open Platform Communications Unified Architecture)MQTT (Message Queuing Telemetry Transport)Apache Kafka / Confluent Platform

OPC-UA for secure, structured data exchange with PLCs, SCADA, and legacy equipment. MQTT for lightweight, pub/sub telemetry from IoT sensors and edge devices. Kafka for high-throughput, fault-tolerant streaming between pipeline stages and sites.

Time-Series Storage & Historians

InfluxDBTimescaleDBAVEVA Historian (formerly Wonderware)OSIsoft PI System

InfluxDB & TimescaleDB are modern, scalable TSDBs for custom applications. AVEVA Historian & OSIsoft PI are purpose-built, high-performance historian systems dominant in process and discrete manufacturing, offering deep integration with OT ecosystems and specialized data compression.

Processing & Integration

Apache FlinkTelegrafApache NiFiNode-RED

Flink for stateful stream processing and complex event processing. Telegraf as a plugin-driven agent for collecting, processing, and writing metrics. NiFi for data flow orchestration and routing between systems. Node-RED for rapid prototyping of data flows with a visual interface.

Visualization & Analysis

GrafanaPower BI / Tableau (with time-series connectors)Jupyter Notebooks

Grafana for real-time operational dashboards and alerting. BI tools (with appropriate connectors) for business reporting on aggregated manufacturing KPIs. Jupyter for ad-hoc exploratory analysis and feature engineering for ML.

Interview Questions

Answer Strategy

Use a layered architecture: Edge (protocol converters, basic filtering), Ingestion (unified message bus like Kafka with schema registry), Processing (streaming job for normalization, enrichment, and validation), and Storage (TSDB with retention policies). Emphasize handling data contextualization at the edge to reduce cloud costs and latency. Sample Answer: 'I'd deploy edge gateways per machine type: a Modbus-to-MQTT bridge for legacy gear, OPC-UA clients for newer machines, and direct MQTT for sensors. All data gets published to a central Kafka cluster with a defined schema. A Flink job consumes from Kafka, applies validation rules (range checks, dead-letter queues), enriches data with asset hierarchy from a master data service, and writes to InfluxDB. Grafana connects directly to InfluxDB for sub-second dashboard latency.'

Answer Strategy

Tests cross-functional leadership and the ability to translate between domain experts. Use the STAR method. Focus on bridging the gap between OT's need for context (machine, location, unit) and IT's need for simplicity (flat tables, clean types). Sample Answer: 'OT insisted on a complex hierarchical tag name (e.g., `Plant1/CNC5/Motor/Temperature`), while data scientists wanted a simple `temperature` column. I facilitated a workshop where we mapped the OT hierarchy to a set of mandatory metadata columns (asset_id, location, unit) in the data schema. This gave OT their context and data scientists a clean, flat table. We implemented it as an enrichment step in the ingestion pipeline, satisfying both groups.'