Areal and Phylogenetic Features for Multilingual Speech Synthesis


We introduce phylogenetic and areal language features to the domain of multilingual text-to-speech (TTS) synthesis. Intuitively, enriching the existing universal phonetic features with such cross-language shared representations should benefit the multilingual acoustic models and help to address issues like data scarcity for low-resource languages. We investigate these representations using the acoustic models based on long short-term memory (LSTM) recurrent neural networks (RNN). Subjective evaluations conducted on eight languages from diverse language families show that sometimes phylogenetic and areal representations lead to significant multilingual synthesis quality improvements.