Skip to main content

Skill Guide

Genomic variant calling and annotation (WGS, WES, RNA-seq pipelines)

The computational process of identifying differences (variants) between a sample's DNA/RNA sequence and a reference genome, then assigning biological and clinical significance to those variants using curated databases and predictive algorithms.

This skill is the cornerstone of precision medicine, oncology research, and genetic diagnostics. It directly translates raw sequencing data into actionable insights for drug development, patient stratification, and understanding disease mechanisms, impacting R&D efficiency and clinical trial success.
1 Careers
1 Categories
9.2 Avg Demand
15% Avg AI Risk

How to Learn Genomic variant calling and annotation (WGS, WES, RNA-seq pipelines)

1. Master the fundamentals of next-generation sequencing (NGS) data formats (FASTQ, BAM, VCF) and the standard processing pipeline (alignment, variant calling). 2. Understand the key differences between WGS, WES, and RNA-seq experimental designs and their analytical implications. 3. Learn to use core command-line tools (e.g., BWA, GATK HaplotypeCaller) on small, well-documented test datasets.
1. Implement and optimize end-to-end pipelines using workflow managers (e.g., Nextflow, Snakemake). 2. Navigate and integrate annotation tools (VEP, ANNOVAR) and variant databases (ClinVar, gnomAD) to filter and prioritize variants. 3. Recognize and troubleshoot common artifacts (e.g., strand bias, low mapping quality) by examining IGV visualizations and quality metrics (Qualimap, Picard).
1. Architect and validate production-grade pipelines with integrated quality control, provenance tracking, and cloud scalability (e.g., using Terra, DNAnexus). 2. Develop custom filtering and prioritization strategies for specific research questions (e.g., somatic variant calling in tumor-normal pairs, structural variant detection). 3. Mentor teams on best practices, establish SOPs, and align computational outputs with clinical or research decision-making workflows.

Practice Projects

Beginner
Project

Germline Variant Calling from a WGS Dataset

Scenario

You have access to a whole-genome sequencing (WGS) BAM file for a single human sample (e.g., from the GIAB consortium). Your goal is to produce a high-confidence set of germline SNVs and indels.

How to Execute
1. Use GATK's Best Practices workflow: MarkDuplicates -> BaseRecalibrator -> HaplotypeCaller. 2. Generate a raw GVCF file. 3. Use GATK's VariantRecalibrator and ApplyVQSR to filter variants based on quality metrics. 4. Annotate the final VCF using VEP with ClinVar and gnomAD annotations.
Intermediate
Project

Tumor-Normal Somatic Variant Calling Pipeline

Scenario

You are analyzing a matched tumor-normal pair from a cancer patient (WES data). The objective is to identify somatic mutations (SNVs, indels) specific to the tumor, while filtering out germline variants and sequencing artifacts.

How to Execute
1. Pre-process both samples with the same GATK Best Practices pipeline. 2. Run MuTect2 in tumor-normal mode to call somatic candidates. 3. Apply a panel of normals (PoN) to filter recurrent artifacts. 4. Annotate with OncoKB, COSMIC, and gnomAD to prioritize likely driver mutations. 5. Visualize key candidates in IGV to confirm variant calls.
Advanced
Project

Multi-Omic Integration for a Cancer Cohort

Scenario

You are leading a project to analyze a cohort of 50 cancer patients with matched WGS (for somatic/structural variants), RNA-seq (for expression and fusion detection), and clinical data. The goal is to identify subtype-specific molecular profiles and potential therapeutic targets.

How to Execute
1. Design a scalable, reproducible pipeline (e.g., Nextflow on AWS Batch) integrating GATK (WGS), STAR-Fusion/Arriba (RNA-seq fusions), and Salmon (expression quantification). 2. Develop a custom annotation and filtering module to cross-validate DNA variants with expression outlier status and fusion events. 3. Use statistical frameworks (e.g., R/Bioconductor) to perform cohort-level analyses (mutational signatures, expression clusters). 4. Present findings as a technical report with a prioritized variant table for biological validation.

Tools & Frameworks

Core Analysis Software

GATK (Genome Analysis Toolkit)SentieonDeepVariant

GATK is the industry standard for germline and somatic variant calling (HaplotypeCaller, Mutect2). Sentieon is a high-performance, licensed alternative. DeepVariant uses deep learning for highly accurate SNP/indel calling.

Workflow Managers & Orchestration

NextflowSnakemakeWDL/Cromwell

Used to build portable, reproducible, and scalable pipelines. Nextflow (with DSL2) and Snakemake are popular in research; WDL is the language behind Terra (Broad Institute's platform).

Annotation & Databases

Ensembl VEP (Variant Effect Predictor)ANNOVARClinVargnomAD

VEP and ANNOVAR add functional context to variants (gene impact, protein change). ClinVar provides clinical significance (pathogenic, benign). gnomAD is essential for population frequency filtering.

Visualization & QC

Integrative Genomics Viewer (IGV)QualimapMultiQC

IGV is for manual inspection of alignments at variant sites. Qualimap and MultiQC aggregate and report quality metrics across samples/pipelines to flag systematic issues.

Interview Questions

Answer Strategy

Structure the answer sequentially: (1) Alignment (BWA-MEM) to produce BAM, (2) Mark Duplicates (Picard), (3) Base Quality Score Recalibration (BQSR) - which uses known sites to adjust quality scores for systematic technical errors, (4) HaplotypeCaller in GVCF mode, (5) Joint genotyping (GenotypeGVFs), (6) Variant Quality Score Recalibration (VQSR) for filtering. Emphasize BQSR's role in correcting for covariates like machine cycle and sequence context.

Answer Strategy

The interviewer is testing troubleshooting methodology and understanding of artifact sources. A strong answer outlines: 1) Verify sample identity (check fingerprinting). 2) Inspect a subset in IGV for strand bias, mapping quality, or alignment artifacts. 3) Check the Panel of Normals (PoN) for recurrence. 4) Review base quality scores and sequencing depth in the tumor BAM. 5) Consider if the tumor has high tumor purity or subclonality issues. 6) Run additional callers (e.g., Strelka, VarDict) for concordance.

Careers That Require Genomic variant calling and annotation (WGS, WES, RNA-seq pipelines)

1 career found