Jump to Content
Monika Podsiadło

Monika Podsiadło

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Common text-to-speech (TTS) systems rely on training data for modelling human speech. The quality of this data can range from professional voice actors recording hand-curated sentences in high-quality studio conditions, to found voice data representing arbitrary domains. For years, the unit selection technology dominant in the field required many hours of data that was expensive and time-consuming to collect. With the advancement of statistical methods of waveform generation, there have been experiments with more noisy and often much larger datasets (“big data”), testing the inherent flexibility of such systems. In this paper we examine the relationship between training data and speech synthesis quality. We then hypothesise that statistical text-to-speech benefits from high acoustic quality corpora with high level of prosodic variation, but that beyond the first few hours of training data we don’t observe quality gains. We then describe how we engineered a training dataset containing optimized distribution of features, and how these features were defined. Lastly, we present results from a series of evaluation tests. These confirm our hypothesis and show how a carefully engineered training corpus of a smaller size yields the same speech quality as much larger datasets, particularly for voices that use WaveNet. View details
    Preview abstract Individuals with vision loss use text-to-speech (TTS) for most of their interaction with devices, and rely on the quality of syn- thetic voices to a much larger extent than any other user group. In total, 33% of local synthesis requests for Google TTS come from TalkBack, the Android screenreader, making it our top client and making the visually-impaired users the heaviest con- sumers of the technology. Despite this, very little attention has been devoted to optimizing TTS voices for this user group and the feedback on TTS voices from the blind has been tradition- ally less-favourable. We present the findings from a TTS user experience study conducted by Google with visually-impaired screen reader users. The study comprised 14 focus groups and evaluated a total of 95 candidate voices with 90 participants across 3 countries. The study uncovered the distinctitve us- age patterns of this user group, which point to different TTS requirements and voice preferences from those of sighted users. View details
    Preview abstract Modern Text-To-Speech (TTS) systems need to increasingly deal with multilingual input. Navigation, social and news are all domains with a large proportion of foreign words. However, when typical monolingual TTS voices are used, the synthesis quality on such input is markedly lower. This is because traditional TTS derives pronunciations from a lexicon or a Grapheme-To-Phoneme (G2P) model which was built using a pre-defined sound inventory and a phonotactic grammar for one language only. G2P models perform poorly on foreign words, while manual lexicon development is labour-intensive, expensive and requires extra storage. Furthermore, large phoneme inventories and phonotactic grammars contribute to data sparsity in unit selection systems. We present an automatic system for deriving pronunciations for foreign words that utilises the monolingual voice design and can rapidly scale to many languages. The proposed system, based on a neural network cross-lingual G2P model, does not increase the size of the voice database, doesn't require large data annotation efforts, is designed not to increase data sparsity in the voice, and can be sized to suit embedded applications. View details
    No Results Found