AI Text-to-Speech Engineer
An AI Text-to-Speech (TTS) Engineer designs, trains, and deploys neural speech synthesis systems that convert text into natural, e…
Skill Guide
The engineering discipline of designing, training, and optimizing deep generative models (e.g., HiFi-GAN, WaveNet, WaveRNN, BigVGAN) that convert acoustic features (mel-spectrograms) into raw audio waveforms, balancing perceptual quality with computational efficiency.
Scenario
Build a foundational, high-quality vocoder for a single-speaker English dataset.
Scenario
You have a dataset of podcast audio with background noise (music, ambient sound). A standard vocoder trained on clean speech produces unnatural artifacts when paired with a TTS system on this data.
Scenario
Integrate a state-of-the-art vocoder into a latency-constrained, on-device TTS system for a voice assistant product.
PyTorch is the dominant research framework for vocoder prototyping. NeMo and ESPnet provide production-ready, scalable training pipelines for TTS systems, including vocoders. Use TensorFlow for deployment in Google's ecosystem.
TensorRT is essential for latency-critical GPU deployment (FP16/INT8). ONNX Runtime enables cross-platform deployment. Triton manages scalable, concurrent model serving in production.
Librosa for mel-spectrogram generation and analysis. Use PESQ/POLQA for automated, objective speech quality assessment against reference audio. Praat for manual acoustic analysis of artifacts.
Answer Strategy
The candidate must demonstrate a deep understanding of generator design beyond a black-box view. The answer should contrast the fixed receptive field of transposed convolutions with the parallel, dilated convolution branches in MRF that capture multi-scale periodic and aperiodic features. Sample Answer: 'A standard transposed conv upsampler uses a fixed kernel size, potentially missing multi-scale patterns. HiFi-GAN's MRF module uses parallel dilated convolutions with different kernel sizes to capture both local (e.g., plosives) and global (e.g., pitch) features simultaneously, mitigating metallic or buzzy artifacts common in simpler GAN vocoders.'
Answer Strategy
This tests practical deployment knowledge. The strategy should follow a hierarchy: architectural simplification, training for efficiency, and hardware-aware optimization. Sample Answer: 'First, I'd profile the model to identify bottlenecks. Then, I'd apply progressive steps: 1) Prune small-magnitude weights and fine-tune. 2) Convert to FP16 using TensorRT, which often provides a 2x+ speedup with negligible quality loss. 3) If RTF is still high, I'd retrain a lighter generator variant (e.g., V2) or use kernel fusion to reduce memory bandwidth. I'd validate MOS after each step to ensure quality doesn't degrade.'
1 career found
Try a different search term.