AI

VOICE MORPHING THAT IMPROVES TTS QUALITY USING AN OPTIMAL DYNAMIC FREQUENCY WARPING-AND-WEIGHTING TRANSFORM

Abstract

Dynamic Frequency Warping (DFW) is widely used to align spec- tra of different speakers. It has long been argued that frequency warping captures inter-speaker differences but DFW practice always involves a tricky preprocessing part to remove spectral tilt. The DFW residual is successfully used in Voice Morphing to improve the quality and the similarity of synthesized speech but the estimation of the DFW residual remains largely heuristic and sub-optimal This paper presents a dynamic programming algorithm that simultaneously estimates the Optimal Frequency Warping and Weighting transform (ODFWW) and therefore needs no preprocessing step and fine-tuning while source/target-speaker data are matched using the Matching-Minimization algorithm [1]. The transform is used to morph the output of a state-of-the-art Vocaine-based [2] TTS synthesizer in order to generate different voices in runtime with only +8% computational overhead. Some morphed TTS voices exhibit significantly higher quality than the original one as morphing seems to “correct” the voice characteristics of the TTS voice.