Text: Advanced text-to-speech models such as FastSpeech can synthesize speech significantly faster
than previous autoregressive models with comparable quality. The training of FastSpeech model relies on an
autoregressive teacher model for duration prediction and knowledge distillation , which can ease the one-to-many
mapping problem in TTS. However, FastSpeech has several disadvantages: 1) the teacher-student distillation
pipeline is complicated, 2) the duration extracted from the teacher model is not accurate enough, and the target
mel-spectrograms distilled from teacher model suffer from information loss due to data simplification, both of
which limit the voice quality. In this paper, we propose FastSpeech 2, which addresses the issues in FastSpeech
and better solves the one-to-many mapping problem in TTS by 1) directly training the model with ground-truth
target instead of the simplified output from teacher, and 2) introducing more variation information of speech as
conditional inputs. Specifically, we extract duration, pitch and energy from speech waveform and directly take
them as conditional inputs during training and use predicted values during inference. We further design FastSpeech
2s, which is the first attempt to directly generate speech waveform from text in parallel, enjoying the benefit of
full end-to-end training and even faster inference than FastSpeech. Experimental results show that 1) FastSpeech 2
and 2s outperform FastSpeech in voice quality with much simplified training pipeline and reduced training time; 2)
FastSpeech 2 and 2s can match the voice quality of autoregressive models while enjoying much faster inference
Text: the soviet authorities denied oswald permission
Text: from the standpoint of the good of the industries themselves, as well as the general public
Text: secret service agents formed a cordon to keep the press and photographers from impeding their
passage and scanned the
crowd for threatening movements.
Text: especially as no more time is occupied, or cost incurred, in casting, setting, or printing
Text: the uncle claimed her. the husband resisted.
Text: they bought their offices from one another, and were thus considered to have a vested interest
Text: in the center of the chapel was the condemned pew, a large dock-like erection painted
Text: again, a turnkey deposed that his chief did not enter the wards more than once a fortnight.
Text: while neglecting to maintain his unity of ideal in the case of nearly all the numerous species
of snakes, he should have
added a tiny rudiment in the case of the python
Text: the department hopes to design a practical system which will fully meet the needs of the
protective research section of
the secret service.