SELF-SUPERVISED CONTEXT-AWARE STYLE REPRESENTATION FOR EXPRESSIVESPEECH SYNTHESIS

Authors: Yihan Wu, Xi Wang, Shaofei Zhang, Lei He, Ruihua Song, Jian-Yun Nie

Abstract:Expressive speech synthesis, like audiobook synthesis, using multistyle text-to-speech (TTS), is still challenging for style representation learning and prediction. Deriving from reference audio or predicting style tags from text requires a huge amount of data, which is costly to acquire and difficult to define and annotate accurately. In this paper, we propose a novel framework for learning style representation from abundant plain text in a self-supervised manner. It leverages an emotion lexicon and uses contrastive learning and deep clustering. We further integrate the style representation as a conditioned embedding in a multi-style Transformer TTS. Comparing with multi-style TTS by predicting style tag trained on the same dataset but with human annotation, our proposed method achieves improved subjective results on both in-domain and out-domain test sets in audiobooks speech. Moreover, with implicit context-aware style representation, the emotion transition of synthesized audio in a long paragraph performs more naturally.

1. Synthesized audio of in-domain dataset

1.1 text: 别喝太多。

    Don't drink too much.

baseline: Ours:

1.2 text: 季浅言,你过分了!

   Qianyan Ji, you are out of the line!

baseline: Ours:

1.3 text: 等我一下!

   Wait me for a moment!

baseline: Ours:

1.4 text:姨妈这个人就是刀子嘴豆腐心。

   Auntie's bark is worse than her bite.

baseline: Ours:

1.5 text:因为你喜欢,所以我经常去看看,你喜欢的东西,我从来都没有忘过。

   Because you like it, so I often go to see.I never forget what you like.

baseline: Ours:

2. Synthesized audio of out-domain dataset

2.1 text: 一说话就能把人给呛死,几年没见还是这么没脸没皮,也不看看眼下是什么情况。

   You can kill a man by talking. I haven't seen you for years but you are still so shameless. Why don't you see what's going on?

baseline: Ours:

2.2 text: 你小子是成了钢铁侠呢,还是植了皮的机器人,怎么这么硬?

   Are you an iron Man, or a robot with human skin? Why are you so grumpy?

baseline: Ours:

2.3 text: 嘿嘿,你猜呢!

   Hey, hey, guess!

baseline: Ours:

2.4 text: 快来看看这树怎么啦?

   Come and check it out. What's wrong with the tree?

baseline: Ours:

2.5 text: 心里难道不会内疚?

   Don't you feel guilty?

baseline: Ours:

3. Long-form audiobook phrases

1.1 text: 叶凡也是出于好奇。“你这小子。一说话就能把人给呛死,几年没见还是这么没脸没皮,也不看看眼下是什么情况。” 叶凡给气乐了,当下放弃额头的大包,转而笑骂一声,直接冲对方的胸前来上两拳。 “撕嚄!”却疼的自己满嘴咧牙。“你小子是成了钢铁侠呢,还是植了皮的机器人,怎么这么硬?”手晃个不停。“嘿嘿,你猜呢?”庞博挤眉弄眼的,当下咯咯怪笑,难得的一次装牛逼的机会,又是在死党的面前, 当然得争取满分。“你呀……”“我只告诉你一个人哦。”

   Ye Fan was so curious. "You can kill a man by talking. I haven't seen you for years but you are still so shameless. Why don't you think about what's going on?" Ye Fan was so angry that he ignored the big bump on his forehead, turned to laughing and scolding, and punched his opponent in the chest. "Whew!" But the pain of himself. "Are you an iron Man, or a robot with human skin? "Hey, hey, guess what?" Pangbo winked and giggled at the moment. It was a rare opportunity to pretend to brag. It was in front of his best friend, so of course he had to be boastful. "You're the only one I'm telling."

baseline: Ours:

1.2 text: “你们说,前辈叫我们过来,是干嘛呢?”“是不是想要也给我们点宝贝?”“不是吧?你当宝贝是大白菜?还能人手一件不成?我看不像,说不定前辈是饿了呢。”“闭嘴...” 一群人胆颤心惊的围成一个半圆圈,将李牧堵在里头,并且呆呆的看着他一会皱眉,一会傻乐,一会暗笑的表情,想要说话又不敢直接问。简直不要太煎熬呀! 前辈厉害确实是好事。毕竟帮他们消灭了魔头(虽然过程有些曲折)。

   "Why does the seniors call us here?" "Does he want to give us some treasures?" "Really? Treasures are not cabbages. Can we all have one? I don't think so. Maybe the teacher is hungry." "Shut up..." A group of people tremble into a half circle and Li Mu was blocked in. He sometimes frown,sometimes giggle. They want to speak and dare not directly ask. It is so painful! It's a good thing the teacher is good. After all, it helped them get rid of the devil (albeit with some twists and turns).

baseline: Ours:

4. Style change by replace style representations

In order to further verify if style embedding capture something independent of content, we extract some style embedding from strong emotional (such as happy, angry) text and apply them to another clam text for TTS. Comparing with its original style, the generated audio shows obvious style changes but remain natural.

4.1 We extract style representation from a sentence with obviously angry emotion:"You pretend to be generous, why should I be wronged at home?(你自己装大方,凭什么我要在府里受委屈?)"

And apply it to below texts for TTS.

4.1.1 text: 看上面怎么说,如果揪着不放,我们干脆也别委屈自己,谋划未来的路吧。

   Look at what the leader says, if they do not let go, we just do not wrong ourselves, plan the way forward.

original audio: generated audio with changed style embedding:

4.1.2 text: 奉先说得对,我们也得吸收经验,不能事事较真。

   Fengxian is right, we also have to absorb experience, don't be serious to everything.

original audio: generated audio with changed style embedding:

4.2 We extract style representation from a sentence with obviously happy emotion:"Hey hey, you can guess?(嘿嘿,你猜呢?)"

And apply it to below texts for TTS.

4.1.1 text: 还是让我们不去?

   Or don't Let's go?

original audio: generated audio with changed style embedding:

4.1.2 text:到底干了啥?

   What the hell did you do?

original audio: generated audio with changed style embedding: