Interview Prep
AI Local LLM Engineer Interview Questions
50 expert questions covering beginner fundamentals to advanced AI workflow scenarios. Each answer includes a hint for structured responses.
Beginner
5 questionsCover latency benefits, data privacy, cost structure (CapEx vs OpEx), offline capability, and customization control.
Discuss reducing model precision (FP16 β INT4/INT8), the tradeoff between model size/memory and output quality, and mention formats like GGUF.
Cover VRAM capacity, memory bandwidth, storage speed (model loading), and the distinction between compute-bound and memory-bound inference.
Explain how key-value pairs from previous tokens are cached during autoregressive generation, and how this scales with context length and batch size.
Mention Ollama for ease-of-use, llama.cpp for CPU/edge optimization, vLLM for high-throughput serving, and briefly differentiate them.
Intermediate
10 questionsDiscuss GPU vs CPU deployment targets, quality retention at different bit levels, ecosystem support, and hardware compatibility.
Cover local embedding models (e.g., all-MiniLM, nomic-embed), local vector stores (Chroma, Qdrant, FAISS), retrieval strategy, context assembly, and local LLM generation.
Explain how vLLM's PagedAttention enables processing new requests between generation steps rather than waiting for the entire batch to finish, dramatically improving throughput.
Cover memory requirements, training speed, quality comparison, when each is appropriate, and the role of LoRA rank and target modules.
Discuss automated benchmarks (MMLU, HumanEval, MT-Bench), perplexity on held-out data, task-specific eval harnesses, and human preference evaluation.
Describe using a smaller draft model to generate candidate tokens that the larger model verifies in parallel, reducing effective latency per token.
Discuss aggressive quantization (Q4_K_M), offloading layers to CPU/RAM, tensor parallelism across GPUs, and the performance implications of each approach.
Describe how it borrows virtual memory/paging concepts from OS design to manage KV-cache memory efficiently, reducing waste and enabling larger batch sizes.
Cover format purpose (CPU-optimized inference vs. training/interoperability), quantization support, metadata, and ecosystem tooling.
Discuss implementing the /v1/chat/completions and /v1/completions endpoints, matching request/response schemas, streaming via SSE, and tool-use function calling format.
Advanced
10 questionsCover classifier/router architecture, model ensemble strategies, fallback mechanisms, monitoring quality drift, and the concept of model cascading.
Discuss all-reduce communication patterns, pipeline vs. tensor parallelism tradeoffs, PCIe bandwidth bottlenecks, and frameworks like Megatron-LM or vLLM's distributed support.
Cover prompt processing optimization, prefix caching, prefill-decode disaggregation, batch scheduling policies, and kernel-level optimizations.
Discuss curated test sets, automated scoring rubrics, retrieval-augmented evaluation, calibration metrics, regression testing against model updates, and human-in-the-loop validation.
Break down model weights memory, KV-cache per layer per token, activation memory, framework overhead, and provide a concrete calculation example.
Cover teacher-student architecture, logit-based vs. response-based distillation, synthetic data generation from the teacher, evaluation of the student model, and iterative refinement.
Cover blue-green deployments, model versioning, health checks, graceful request draining, canary rollouts, and rollback strategies.
Discuss hierarchical retrieval, document summarization chains, map-reduce patterns, context compression, and long-context models vs. retrieval strategies.
Cover air-gapped deployment, encrypted model storage, audit logging, PII detection/redaction in prompts, access control, and compliance documentation.
Walk through systematic profiling: CUDA profiling (nsys/ncu), memory bandwidth analysis, CPU-GPU transfer bottlenecks, batch size tuning, and kernel fusion opportunities.
Scenario-Based
10 questionsCover model selection (size vs. quality), vLLM deployment with tensor parallelism, load balancing, HIPAA compliance considerations, and monitoring.
Describe a structured evaluation pipeline: automated benchmarks, domain-specific test sets, latency/memory profiling on target hardware, A/B testing plan, and rollback criteria.
Discuss model size constraints for mobile/edge, quantization for ARM processors, ONNX/CoreML export, offline-first architecture with sync-on-connect, and fallback strategies.
Cover multilingual model selection, Q4_K_M quantization strategy, prompt engineering for multilingual tasks, RAG for knowledge grounding, and continuous evaluation.
Discuss temperature/sampling parameter analysis, quantization quality regression, context length edge cases, KV-cache overflow, prompt template mismatches, and systematic A/B testing.
Cover model and dependency pre-download, offline package management, model integrity verification (checksums), offline monitoring, manual update workflows, and documentation.
Discuss overfitting to test set, distribution shift between test data and real queries, catastrophic forgetting, evaluation methodology flaws, and iterative improvement with real user feedback.
Cover hardware abstraction layers, auto-detection of GPU/CPU capabilities, dynamic quantization selection, containerized deployment with resource constraints, and hardware compatibility testing.
Discuss RAG with citation tracking, structured output prompting, confidence calibration, source chunk retrieval and presentation, and output validation pipelines.
Cover model distillation to smaller variants, aggressive quantization, prompt caching, request deduplication, batch optimization, and targeted model routing (simple queries β smaller model).
AI Workflow & Tools
10 questionsCover model discovery, license review, benchmark evaluation, quantization method selection, format conversion, inference server configuration, load testing, and monitoring setup.
Discuss Ollama's Modelfile system, ease of local development and testing, model pulling and management, REST API, and where it falls short (throughput, advanced features) compared to production-grade servers.
Cover document loading, chunking strategy selection, embedding model choice, vector store configuration, retrieval tuning, prompt template design, and common issues like poor chunking and retrieval misses.
Discuss logging training loss curves, learning rate schedules, evaluation metrics per epoch, GPU utilization, and comparing runs across different LoRA configurations and hyperparameters.
Describe service definitions, GPU passthrough configuration, volume mounts for model storage, networking between services, health checks, and environment-specific configuration.
Cover setup, task configuration, running evaluations, interpreting results, and selecting benchmarks relevant to the deployment domain (general knowledge, coding, reasoning, etc.).
Walk through the conversion script, quantization level selection, vocabulary handling, metadata configuration, quality validation, and performance benchmarking post-conversion.
Discuss prompt-based tool use vs. native function calling support, JSON schema enforcement, tool response parsing, error handling, and testing frameworks for agent workflows.
Cover nvidia-smi monitoring, memory mapping analysis, KV-cache size reduction, layer offloading strategies, quantization to lower precision, and measuring the performance impact of each change.
Discuss output quality sampling, user feedback loops, latency percentile monitoring, error rate tracking, prompt injection detection, and automated regression testing against known-good outputs.
Behavioral
5 questionsLook for structured problem-solving, root cause analysis, clear communication with stakeholders, and a systematic approach to resolution rather than ad-hoc fixes.
Assess genuine curiosity - following specific researchers, reading papers, active GitHub involvement, community participation (Discord, Reddit, HuggingFace), and hands-on experimentation with new releases.
Evaluate ability to communicate technical tradeoffs in business terms, build data-driven arguments, listen to concerns, and navigate organizational dynamics constructively.
Look for emphasis on runbooks, architecture decision records, hardware compatibility notes, configuration documentation, and knowledge sharing practices.
Assess decision-making framework - involving stakeholders, defining measurable criteria, running experiments, documenting tradeoffs, and iterating based on user feedback.