AI Text-to-Speech Engineer
An AI Text-to-Speech (TTS) Engineer designs, trains, and deploys neural speech synthesis systems that convert text into natural, e…
Skill Guide
The systematic engineering of automated workflows to ingest, validate, standardize, and synthetically expand massive collections of audio data for machine learning training and production systems.
Scenario
You have downloaded a raw subset of LibriSpeech containing 100 hours of audio mixed with silence and high-frequency hiss, unsuitable for direct model training.
Scenario
A client needs to expand a 1,000-hour internal meeting dataset by 5x for ASR model robustness, simulating diverse acoustic environments (cafes, subways, reverberant rooms).
Scenario
Building a TTS system where the model fails on 'whispered' speech. Manual collection of whispered data is too slow and expensive.
Librosa and Pydub handle loading and feature extraction; Audiomentations provides GPU-accelerated, on-the-fly augmentation; SoX is the standard CLI for resampling and format conversion.
Use Airflow for scheduling and DAGs; Dask/Ray for parallelizing massive audio transformations across clusters; DVC to track audio dataset versions alongside model code.
FFprobe validates file integrity; SpeechBrain/WhisperX are used to auto-label raw audio or detect silence/misalignment before ingestion.
Answer Strategy
Focus on the two-pass architecture. First, use a lightweight classifier (like Yamnet) on lower-compute nodes to tag segments. Second, use a segmentation tool (like WhisperX) on high-compute nodes to extract and align text. Mention the importance of outputting JSONL manifests for data loading.
Answer Strategy
Focus on signal-to-noise ratio (SNR) calibration and domain randomization. Explain that heavy noise creates 'impossible' listening tasks that poison the model. Suggest a solution involving dynamic SNR ranges and validating the augmented data against a clean baseline to ensure the task remains solvable.
1 career found
Try a different search term.