Tacotron: Towards End-to-End Speech Synthesis

Yuxuan Wang

RJ Skerry-Ryan

Daisy Stanton

Yonghui Wu

Ron J. Weiss

Navdeep Jaitly

Zongheng Yang

Ying Xiao

Zhifeng Chen

Samy Bengio

Quoc Le

Yannis Agiomyrgiannakis

Rob Clark

Rif A. Saurous

Interspeech (2017)

Download Google Scholar

Abstract

A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module. Building these components often requires extensive domain expertise and may contain brittle design choices. In this paper, we present Tacotron, an end-to-end generative text-to-speech model that synthesizes speech directly from characters. Given (text, audio) pairs, the model can be trained completely from scratch with random initialization. We present several key techniques to make the sequence-to-sequence framework perform well for this challenging task. Tacotron achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness. In addition, since Tacotron generates speech at the frame level, it's substantially faster than sample-level autoregressive methods.

Research Areas

Speech Processing

Defining the technology of today and tomorrow.

Philosophy

People

Teams

AI/ML Foundations  & Capabilities

Algorithms & Optimization

Computing Paradigms

Responsible Human-Centric Technology

Science & Societal Impact

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

Tacotron: Towards End-to-End Speech Synthesis

Abstract

Research Areas

Learn more about how we conduct our research

Defining the technology of today and tomorrow.

Philosophy

People

Teams

AI/ML Foundations & Capabilities

Algorithms & Optimization

Computing Paradigms

Responsible Human-Centric Technology

Science & Societal Impact

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

Tacotron: Towards End-to-End Speech Synthesis

Abstract

Research Areas

Learn more about how we conduct our research

AI/ML Foundations  & Capabilities