AI Rare Disease AI Specialist
An AI Rare Disease Specialist leverages artificial intelligence to accelerate diagnosis, drug discovery, and personalized treatmen…
Skill Guide
The application of cloud infrastructure (IaaS, PaaS) to execute, store, and analyze large-scale genomic sequencing and analysis pipelines, optimizing for cost, scalability, and compliance.
Scenario
Align a single human whole-genome sequencing sample (~100GB FASTQ) to a reference genome using BWA-MEM and generate a BAM file, storing results cost-effectively.
Scenario
Process 50 samples from a cohort, running alignment (BWA-MEM), marking duplicates (Picard), and variant calling (GATK HaplotypeCaller) in parallel, with checkpointing and cost tracking.
Scenario
Architect a system to ingest, process, and analyze whole-genome, transcriptomic (RNA-Seq), and clinical data from 10,000+ patients, ensuring data is queryable (SQL, API) while complying with HIPAA.
Use managed genomics platforms (HealthOmics, Life Sciences) for turnkey pipeline execution with built-in compliance. Use general batch services (AWS Batch, Azure Batch) with workflow managers (Nextflow, Cromwell) for maximum control and portability.
Use Terraform to provision and manage cloud resources (VPCs, buckets, IAM roles) reproducibly. Use Nextflow or Snakemake to define portable, scalable bioinformatics pipelines; containerize all tools with Docker for dependency management.
Integrate cost management tools from day one. Use Spot Instance Advisors to identify instance types with interruption rates suitable for fault-tolerant genomics jobs. Set up alerts for budget thresholds and use monitoring to right-size resources.
Answer Strategy
Structure the answer: 1) Pipeline Architecture (Tools: Manta, DELLY; Orchestration: Nextflow on AWS Batch). 2) Data Flow & Storage (Input FASTQ -> Intermediate BAM -> Output VCF; S3 storage classes). 3) Compute (Instance types: memory-optimized for assembly; Spot vs. On-Demand). 4) Cost Drivers (Egress, storage, compute hours, logging). 5) Optimization (Spot, data compression, temporary storage cleanup). Sample: 'I'd use Nextflow to orchestrate Manta and DELLY, running on AWS Batch with Spot instances for assembly stages. The primary cost drivers are compute (I'd profile for optimal r5/i3 instances), S3 storage (I'd use Intelligent-Tiering for intermediates), and egress (I'd process data in the same region as storage). I'd estimate by calculating vCPU-hours per sample and applying Spot pricing discounts of 60-70%.'
Answer Strategy
Tests problem-solving, cost-awareness, and practical optimization skills. Sample: 'First, I'd profile the pipeline to identify bottleneck stages. Then, I'd implement two key changes: 1) Switch eligible, fault-tolerant stages (like alignment) to use Spot instances, implementing retry logic in the workflow manager. 2) Refactor the storage strategy: use local NVMe SSDs (included in instance cost) for temporary files and move final outputs to S3 Standard, with a lifecycle policy to archive older results to S3 Glacier. This could reduce costs by 40-60% while maintaining throughput.'
1 career found
Try a different search term.