AI Payment Fraud Detection Specialist
An AI Payment Fraud Detection Specialist designs, deploys, and continuously refines machine learning systems that identify and pre…
Skill Guide
It is the discipline of designing, analyzing, and rewriting SQL queries and data pipelines to execute efficiently on distributed computing systems (like Spark, Presto, or proprietary MPP engines) that process petabyte-scale transactional data, with a focus on minimizing I/O, network shuffle, and compute costs.
Scenario
A daily sales aggregation query on a 500GB `transactions` table is taking 3 hours to run, causing delays for the business team.
Scenario
A query joining a massive `user_events` table (10TB) with a `user_profiles` table (10GB) runs for hours on a few nodes because a handful of 'bot' users have billions of events.
Scenario
The company's data platform costs have grown 400% in 18 months. Leadership needs visibility into which teams/queries are driving costs and a plan to control growth.
These are the primary execution environments. Deep knowledge of one (e.g., Spark's Catalyst optimizer, BigQuery's slot-based execution, Snowflake's virtual warehouses) is essential. You choose based on your cloud ecosystem and use case (interactive vs. batch).
Columnar formats (Parquet/ORC) reduce I/O. Modern table formats (Delta, Iceberg) on top add ACID transactions, time travel, and efficient metadata handling, which are critical for performance on petabyte-scale data.
The primary debugging tools. `EXPLAIN` shows the plan. The Spark UI and cloud profilers show actual execution metrics (shuffle bytes, spill, task skew). Monitoring tools track long-term performance and cost trends.
1 career found
Try a different search term.