Abstract: Prosody modeling is an essential component in modern text-to-speech (TTS) frameworks. By
explicitly providing prosody features to the TTS model, the style of synthesized utterances can thus be
controlled. However, predicting natural and reasonable prosody at inference time is challenging. In this work,
we analyzed the behavior of non-autoregressive TTS models under different prosody-modeling settings and proposed
a hierarchical architecture, in which the prediction of phoneme-level prosody features are conditioned on the
word-level prosody features. The proposed method outperforms other competitors in terms of audio quality and
prosody naturalness in our objective and subjective evaluation.
Audio Samples
These sentences are from the held-out evaluation set (LJ001, LJ002 and LJ003). All synthesized samples are
converted from mel-spectrograms to audio signals by MelGAN.
We present the audio samples generated by 6 different models, together with the utterances reconstructed by
MelGAN from the ground-truth mel-spectrograms and the ground-truth utterances.
The 6 models are the TTS model without prosody modeling, the models with non-hierarchical prosody modeling which
use the word(W)/phoneme(P)-level input features to predict the
neural(N)/rule(R)-based prosody labels, and the model with hierarchical prosody modeling (using
the rule-based labels for word-level prosody modeling, and the neural-based labels for phoneme-level prosody
modeling).