Espnet: First multi-speaker Transformer

Created on 20 Sep 2019  ·  4Comments  ·  Source: espnet/espnet

hello ,what is speaker embedding in the first multi-speaker transformer system ? Is x vextor?
can you answer where speaker embedding is placed in the system? I am trying do multi-speaker transformer system. thanks!

Question

Most helpful comment

All 4 comments

Sorry for late reply. I'm back from INTERSPEECH.
Pretrained speaker embedding is X-vector, which is trained by VoxCeleb corpus
I add or concat x-vector for each hidden state of encoder as follows:

  • add: x-vector -> linear -> replicate -> + each encoder hidden state
  • concat: x-vector -> replicate -> concat with each encoder hidden state -> linear

Sincere thanks . I understand what you say that is you place the x-vector between Multi-head layer and FFN layer ,such as N=3,you do it in every layer.
for example, concat x-vector as follows:
for i in N (N=3)
----->Multi-head->add&norm->concat x-vector->FFN->add&norm--->
I don't understand that each encoder hidden state you say.I have concated speaker embedding with encoder output in the transformer system ,but it doesn't work. again thanks!

Sincere thanks!

Was this page helpful?
0 / 5 - 0 ratings