Hierarchical Prosody Modeling for Non-Autoregressive Speech Synthesis

Arxiv: 2011.06465

Authors: Chung-Ming Chien, Hung-yi Lee

Abstract: Prosody modeling is an essential component in modern text-to-speech (TTS) frameworks. By explicitly providing prosody features to the TTS model, the style of synthesized utterances can thus be controlled. However, predicting natural and reasonable prosody at inference time is challenging. In this work, we analyzed the behavior of non-autoregressive TTS models under different prosody-modeling settings and proposed a hierarchical architecture, in which the prediction of phoneme-level prosody features are conditioned on the word-level prosody features. The proposed method outperforms other competitors in terms of audio quality and prosody naturalness in our objective and subjective evaluation.

Audio Samples

These sentences are from the held-out evaluation set (LJ001, LJ002 and LJ003). All synthesized samples are converted from mel-spectrograms to audio signals by MelGAN.

We present the audio samples generated by 6 different models, together with the utterances reconstructed by MelGAN from the ground-truth mel-spectrograms and the ground-truth utterances. The 6 models are the TTS model without prosody modeling, the models with non-hierarchical prosody modeling which use the word(W)/phoneme(P)-level input features to predict the neural(N)/rule(R)-based prosody labels, and the model with hierarchical prosody modeling (using the rule-based labels for word-level prosody modeling, and the neural-based labels for phoneme-level prosody modeling).

The implementation of this paper is based on ming024's FastSpeech2 implementation.

Hierarchical	Non-Hierarchical (W+R)	Non-Hierarchical (W+N)

Non-Hierarchical (P+R)	Non-Hierarchical (P+N)	No Prosody Modeling

MelGAN Reconstruction	Ground-Truth

Text: "and though more roman than that, yet scarcely more like the complete roman type of the earliest printers of rome."

Hierarchical	Non-Hierarchical (W+R)	Non-Hierarchical (W+N)

Non-Hierarchical (P+R)	Non-Hierarchical (P+N)	No Prosody Modeling

MelGAN Reconstruction	Ground-Truth

Text: "his letter is admirably clear and regular, but at least as beautiful as any other roman type."

Hierarchical	Non-Hierarchical (W+R)	Non-Hierarchical (W+N)

Non-Hierarchical (P+R)	Non-Hierarchical (P+N)	No Prosody Modeling

MelGAN Reconstruction	Ground-Truth

Text: "the unit of the book being looked on as the two pages forming an opening."

Hierarchical	Non-Hierarchical (W+R)	Non-Hierarchical (W+N)

Non-Hierarchical (P+R)	Non-Hierarchical (P+N)	No Prosody Modeling

MelGAN Reconstruction	Ground-Truth

Text: "because the modern practice is to disregard the relation between the printing and the ornament altogether,"

Hierarchical	Non-Hierarchical (W+R)	Non-Hierarchical (W+N)

Non-Hierarchical (P+R)	Non-Hierarchical (P+N)	No Prosody Modeling

MelGAN Reconstruction	Ground-Truth

Text: "three hundred debtors and nine hundred criminals in newgate, or twelve hundred prisoners in all."

Hierarchical	Non-Hierarchical (W+R)	Non-Hierarchical (W+N)

Non-Hierarchical (P+R)	Non-Hierarchical (P+N)	No Prosody Modeling

MelGAN Reconstruction	Ground-Truth

Text: "the lad was proved to be of good character and the son of respectable parents."

Hierarchical	Non-Hierarchical (W+R)	Non-Hierarchical (W+N)

Non-Hierarchical (P+R)	Non-Hierarchical (P+N)	No Prosody Modeling

MelGAN Reconstruction	Ground-Truth

Text: "in these cases security was given for the amount of the debt,"

Hierarchical	Non-Hierarchical (W+R)	Non-Hierarchical (W+N)

Non-Hierarchical (P+R)	Non-Hierarchical (P+N)	No Prosody Modeling

MelGAN Reconstruction	Ground-Truth

Text: "and the prison, although wives and children resided within the walls, was not overcrowded."

Hierarchical	Non-Hierarchical (W+R)	Non-Hierarchical (W+N)

Non-Hierarchical (P+R)	Non-Hierarchical (P+N)	No Prosody Modeling

MelGAN Reconstruction	Ground-Truth

Text: "the inevitable consequence of such a situation, their morals must have been destroyed ;"

Hierarchical	Non-Hierarchical (W+R)	Non-Hierarchical (W+N)

Non-Hierarchical (P+R)	Non-Hierarchical (P+N)	No Prosody Modeling

MelGAN Reconstruction	Ground-Truth

Text: "for while roman catholics and dissenters were encouraged to see ministers of their own persuasion,"