Using Language Model for Implementation of Emotional Text-To-Speech

Cao, Mingguang; Zhu, Jie

Home // ACHI 2023, The Sixteenth International Conference on Advances in Computer-Human Interactions // View article

Using Language Model for Implementation of Emotional Text-To-Speech

Authors:
Mingguang Cao
Jie Zhu

Keywords: emotional text-to-speech; style transfer; pre-trained language model

Abstract:
With the development of neural network, Text-To-Speech (TTS) technology is booming unprecedentedly. The speech generated by modern text-to-speech systems almost sound as natural as human audio. However, the style control of synthetic speech usually limits to discrete emotion type and the emotion embedding which controls emotion transfer contains redundant transcript information. In this paper, we apply pre-trained language model Bidirectional Encoder Representations from Transformer (BERT) to our TTS system to achieve style control and transfer. Using BERT makes our proposed model study the relationship between text representations and acoustic emotion embedding. The experimental results show that our proposed model outperforms baseline Global Style Token (GST)-Tacotron2 model in both parallel and non-parallel style transfer.

Pages: 146 to 151

Copyright: Copyright (c) IARIA, 2023

Publication date: April 24, 2023

Published in: conference

ISSN: 2308-4138

ISBN: 978-1-68558-078-0

Location: Venice, Italy

Dates: from April 24, 2023 to April 28, 2023