Home // ACHI 2023, The Sixteenth International Conference on Advances in Computer-Human Interactions // View article
Using Language Model for Implementation of Emotional Text-To-Speech
Authors:
Mingguang Cao
Jie Zhu
Keywords: emotional text-to-speech; style transfer; pre-trained language model
Abstract:
With the development of neural network, Text-To-Speech (TTS) technology is booming unprecedentedly. The speech generated by modern text-to-speech systems almost sound as natural as human audio. However, the style control of synthetic speech usually limits to discrete emotion type and the emotion embedding which controls emotion transfer contains redundant transcript information. In this paper, we apply pre-trained language model Bidirectional Encoder Representations from Transformer (BERT) to our TTS system to achieve style control and transfer. Using BERT makes our proposed model study the relationship between text representations and acoustic emotion embedding. The experimental results show that our proposed model outperforms baseline Global Style Token (GST)-Tacotron2 model in both parallel and non-parallel style transfer.
Pages: 146 to 151
Copyright: Copyright (c) IARIA, 2023
Publication date: April 24, 2023
Published in: conference
ISSN: 2308-4138
ISBN: 978-1-68558-078-0
Location: Venice, Italy
Dates: from April 24, 2023 to April 28, 2023