Hi team,
Is it possible to add custom vocabulary on an already trained T2T Speech recognition model so that it can detect domain specific words?
(Apologies if this question has already been asked, I couldn't locate relevant information)
Thanks
@bharat-patidar there are several ways. Either, you let t2t just generate the vocabulary from the text-targets, or you create your own problem which creates a vocabulary in a way that you require.
The last option would be to simply delete the model checkpoint/graph and just put your own vocabulary into the vocabulary file that the problem expects. Next time you run the training from scratch the model should have the correct output dimensions according to the vocabulary you put there. In this case, of course, the vocabulary has to be compliant with the t2t vocabulary-type you're using.
Thanks for the response stefan but I want to utilize already trained weights and I don't want to alter the learning/weights of previously trained vocabulary words/subwords.
So, What should be the quickest way to add new words in previously trained model?
Thanks!
I think I misunderstood your request first.
So you want to use a pre-trained model and specialize it for a specific domain. That's a classic use-case for transfer learning. I don't know whether you can do this in t2t - you'd have to check that for yourself. A good place to start is looking at the hparams and see if you can somehow freeze certain layers of the model s.t. you can apply transfer learning. I have never done this myself so I cannot provide more advice on this topic unfortunately.
You cannot really add new words to a model in any other way than that, I think. You certainly do not want to replace the actual text file/vocabulary by hard because this would not make sense of course.
Thanks for the quick response @stefan-falk , I will definitely try that out.
One more question I have, How does t2t generate the final sequence of words while providing transcription on any given audio input. As far as I know we don't use any Language model in at16k so do we have N number of placeholders where each placeholder would have probabilities for all the vocabulary words or it has some other design to generate the final result(transcription).
Any help is appreciated!
So, you mean, how does the model output get converted to text in the end?
For inference, t2t implements an infer function in the T2TModel class.
The code is rather complex so I do not fully understand what is happening here but essentially, greedy-infer or beam-search can be applied in order to generate a sequence of IDs from the model output which then will be decoded to text.
So, essentially, there is while loop that runs the model until it finds the EOS-id.
Each step will produce a softmax output which gives us the probabilities for words/subwords in the vocabulary. From there we get the IDs. Transforming the IDs to text will happen after that.
Now note, this is what happens for greedy-infer. That is as we set beam_size to 1. For values larger than 1 beam-search gets executed:
which is slower but might produce better results over all.
You certainly do not want to replace the actual text file/vocabulary by hard
There is a research on the various ways of transfer learning with T2T, possibly with substituting the vocabulary in a clever way (keep the same subwords on the same positions, substituting the rest according to frequency), see https://arxiv.org/pdf/1909.10955.pdf (or a whole PhD thesis https://arxiv.org/pdf/2001.01622.pdf).
@martinpopel Interesting - didn't know this was tried. But that's what I meant with "_by hard_". Just replacing the vocabulary does not do anything. It's not surprising that continuing training like this will lead to something useful. :)
I got the answer, thanks for the help @stefan-falk and @martinpopel
Really appreciate it!
:)
Most helpful comment
There is a research on the various ways of transfer learning with T2T, possibly with substituting the vocabulary in a clever way (keep the same subwords on the same positions, substituting the rest according to frequency), see https://arxiv.org/pdf/1909.10955.pdf (or a whole PhD thesis https://arxiv.org/pdf/2001.01622.pdf).