Skill Guide

Prompt engineering and LLM output evaluation for generated performance narratives

The systematic design of prompts and evaluation criteria to guide Large Language Models in generating accurate, coherent, and contextually appropriate employee performance reviews, feedback summaries, and developmental narratives.

This skill directly impacts HR efficiency, manager effectiveness, and talent strategy by enabling consistent, data-driven, and bias-mitigated performance communication at scale. It transforms subjective manager notes into structured, actionable talent insights, improving retention and development outcomes.

1 Careers

1 Categories

8.7 Avg Demand

15% Avg AI Risk

How to Learn Prompt engineering and LLM output evaluation for generated performance narratives

1. Foundational LLM Knowledge: Understand core concepts like temperature, top-p, and token limits. 2. Prompt Anatomy: Master basic prompt structures (instruction, context, input data, output format). 3. Narrative Structure: Learn the components of a standard performance review (e.g., STAR method, achievements vs. behaviors).

1. Chain-of-Thought Prompting: Design prompts that force the LLM to reason step-by-step before generating a narrative. 2. Few-Shot Examples: Use curated examples of high-quality narratives to steer output style and depth. 3. Output Parsing: Develop methods to extract structured data (ratings, key themes) from free-text LLM output. Common Mistake: Over-relying on a single generic prompt for all seniority levels or roles.

1. Prompt Pipelines: Architect multi-step systems where one LLM call generates raw data, another structures it, and a third polishes the narrative. 2. Evaluation Frameworks: Build custom rubrics and automated scoring models (using LLM-as-a-Judge or fine-tuned classifiers) to assess narrative quality, bias, and alignment. 3. Governance & Compliance: Implement guardrails for sensitive topics, legal compliance, and consistent tone across an organization.

Practice Projects

Beginner

Case Study/Exercise

Transform Bullet Points into a Coherent Achievement Narrative

Scenario

A manager provides raw, unstructured notes for a software engineer: 'Fixed critical bug, mentored two juniors, led migration project to new framework.'

How to Execute

1. Deconstruct: Identify the core elements (Action, Result, Skill Demonstrated) from the notes. 2. Prompt Design: Create a prompt that instructs the LLM to write a paragraph using the STAR method, specifying a professional tone and inclusion of impact. 3. Generate & Compare: Run the prompt and compare the output against a manually written version for clarity and impact. 4. Iterate: Refine the prompt to correct any shortcomings, like missing quantification or weak verbs.

Intermediate

Project

Build a Role-Specific Narrative Generator with Quality Scoring

Scenario

Generate performance narratives for three distinct roles: Sales Executive, UX Designer, and Data Analyst, ensuring each emphasizes role-relevant KPIs and competencies.

How to Execute

1. Define Schemas: Create separate prompt templates with role-specific instructions, required metrics (e.g., 'Sales: Quota attainment, pipeline growth'), and competency frameworks. 2. Implement a Scoring Module: Design a secondary prompt or a fine-tuned model to rate the generated narrative on a 1-5 scale for specificity, actionability, and role alignment. 3. Build a Feedback Loop: Use the scoring output to automatically flag weak narratives for human review and to iteratively improve the primary generation prompt. 4. Test: Run the system on anonymized historical review data and measure agreement with human-written reviews.

Advanced

Case Study/Exercise

Design an End-to-End Performance Narrative System with Bias Auditing

Scenario

An enterprise needs to standardize annual reviews for 5,000+ employees across global offices, requiring narrative consistency, multi-language support, and compliance with anti-bias regulations.

How to Execute

1. Architect the Pipeline: Design a multi-stage system: Data Ingestion (manager input + quantitative metrics) -> Prompt Assembly (dynamic template selection) -> Narrative Generation (with constrained decoding for key terms) -> Post-Processing (formatting, translation). 2. Develop an Evaluation Harness: Create a suite of automated checks: sentiment analysis, keyword frequency for protected attributes (flagging potential bias), consistency against a style guide, and back-translation for accuracy. 3. Implement a 'Human-in-the-Loop' Protocol: Define clear escalation paths for narratives that fail automated checks or fall into sensitive categories. 4. Run a Pilot & Measure: Deploy to a single department, comparing LLM-generated narratives to control groups on time saved, manager satisfaction, and employee feedback on fairness and clarity.

Tools & Frameworks

LLM Platforms & APIs

OpenAI API (GPT-4, with function calling)Azure OpenAI ServiceHugging Face Inference Endpoints

Use for core narrative generation. Function calling is critical for structuring inputs and outputs. Azure provides enterprise compliance. Hugging Face allows for fine-tuning specialized models on proprietary review data.

Prompt Engineering & Orchestration Tools

LangChainLlamaIndexPromptLayer

LangChain/LlamaIndex are essential for building complex prompt chains and managing pipelines. PromptLayer or similar platforms are for versioning, tracking, and A/B testing prompts at scale.

Evaluation & Testing Frameworks

OpenAI EvalsLangSmithCustom LLM-as-a-Judge Scripts

Use OpenAI Evals and LangSmith to systematically test prompt performance against curated datasets. Implement 'LLM-as-a-Judge' prompts to score narratives for coherence, tone, and bias automatically.

Mental Models & Methodologies

STAR/CAR Method (Situation, Task, Action, Result / Challenge, Action, Result)Competency Framework AlignmentCalibrated Evaluation Rubrics

STAR/CAR provides the core narrative structure. Competency alignment ensures narratives link behaviors to company values. Calibrated rubrics standardize what a 'good' narrative looks like for quality control.

Interview Questions

Answer Strategy

Test the candidate's structured approach to prompt design, including handling of context, constraints, and output evaluation. The answer should outline a clear process: defining the schema (achievements, competencies), structuring the prompt with few-shot examples, and creating a validation rubric. Sample Answer: 'I'd start by defining the output schema based on our company's marketing competency framework. The prompt would include the manager's raw notes, instructions to use the STAR method, and constraints for a professional, actionable tone. I'd include 2-3 examples of strong narratives. To validate, I'd run it on 20 anonymized historical reviews, then score the outputs on a 5-point rubric for specificity, impact articulation, and bias neutrality, iterating the prompt until agreement with human ratings exceeds 90%.'

Answer Strategy

This tests for experience, critical thinking, and commitment to quality governance. The candidate should demonstrate a methodical approach to auditing and systemic improvement. Sample Answer: 'In an early version, the LLM consistently used more agentic language for male engineers ('led,' 'drove') and more collaborative language for female engineers ('supported,' 'facilitated'). I identified this through a bias audit using keyword frequency analysis. The fix was twofold: I engineered a new prompt that explicitly instructed for neutral, impact-focused language and added a post-processing step that flagged gendered terms for mandatory human review. We then implemented a recurring audit schedule for all generated narratives.'