AI Genomics Data Analyst
An AI Genomics Data Analyst leverages machine learning, large language models, and bioinformatics pipelines to extract clinically …
Skill Guide
NGS data processing is the computational workflow for cleaning, aligning, and analyzing high-throughput sequencing data to identify genetic variants, using standardized file formats for raw reads (FASTQ), alignments (BAM/CRAM), and variants (VCF).
Scenario
You have paired-end FASTQ files from a human whole-genome sequencing run. Your task is to generate a filtered VCF file of SNPs and indels.
Scenario
Scale the beginner project to handle multiple exome samples reproducibly across different computing environments.
Scenario
Design a pipeline for tumor-normal paired samples in a CAP/CLIA lab environment. The pipeline must integrate variant calling, annotation, and automated clinical report generation with strict QC thresholds.
The workhorses for alignment (BWA-MEM2), variant calling and refinement (GATK), file manipulation (samtools), and quality control (FastQC). Industry standard for germline and somatic analysis.
Essential for creating reproducible, scalable, and cloud-compatible pipelines. nf-core provides curated community workflows for common NGS analyses.
For elastic compute scaling. Containers ensure reproducibility. Terra provides integrated environments for running GATK workflows at scale.
Answer Strategy
Test diagnostic thinking and pipeline optimization knowledge. Sample Answer: 'First, I'd use Picard MarkDuplicates to identify duplicates and verify the library complexity. For mapping quality, I'd check for sample contamination or incorrect reference genome. To prevent recurrence, I would implement tighter library prep QC, use UMIs for duplicate marking, and adjust alignment parameters (e.g., -M in BWA) for better handling of split reads.'
Answer Strategy
Tests depth of knowledge on data management and cost optimization. Sample Answer: 'CRAM achieves ~60% compression over BAM by referencing a shared genome, drastically reducing storage and egress costs. I'd advocate for CRAM in archival settings and large biobanks. The trade-off is a dependency on the reference file for decompression and slightly higher CPU usage. For active analysis pipelines where random access is critical, BAM remains simpler.'
1 career found
Try a different search term.