hello ,what is speaker embedding in the first multi-speaker transformer system ? Is x vextor?
can you answer where speaker embedding is placed in the system? I am trying do multi-speaker transformer system. thanks!
Sorry for late reply. I'm back from INTERSPEECH.
Pretrained speaker embedding is X-vector, which is trained by VoxCeleb corpus
I add or concat x-vector for each hidden state of encoder as follows:
Sincere thanks . I understand what you say that is you place the x-vector between Multi-head layer and FFN layer ,such as N=3,you do it in every layer.
for example, concat x-vector as follows:
for i in N (N=3)
----->Multi-head->add&norm->concat x-vector->FFN->add&norm--->
I don't understand that each encoder hidden state you say.I have concated speaker embedding with encoder output in the transformer system ,but it doesn't work. again thanks!
Here the encoder state means outputs of the final layer of the encoder.
Maybe you can understand by checking following part.
https://github.com/espnet/espnet/blob/a2181ad10929ae980c228f40533defa6904d9db0/espnet/nets/pytorch_backend/e2e_tts_transformer.py#L507-L513
https://github.com/espnet/espnet/blob/a2181ad10929ae980c228f40533defa6904d9db0/espnet/nets/pytorch_backend/e2e_tts_transformer.py#L773-L795
Sincere thanks!
Most helpful comment
Here the encoder state means outputs of the final layer of the encoder.
Maybe you can understand by checking following part.
https://github.com/espnet/espnet/blob/a2181ad10929ae980c228f40533defa6904d9db0/espnet/nets/pytorch_backend/e2e_tts_transformer.py#L507-L513
https://github.com/espnet/espnet/blob/a2181ad10929ae980c228f40533defa6904d9db0/espnet/nets/pytorch_backend/e2e_tts_transformer.py#L773-L795