Wav2letter: question about CTC Emission Set value and probability

Created on 4 Mar 2020 · 7Comments · Source: flashlight/wav2letter

Hi, I am trying to get the probability p(token|feats) for each frame in a CTC model, which should be
the Emission Set. But when I export the data, it's filled with float number which does not look like a softmax or logsoftmax result. I wonder how are these values calculated and how can I get the softmax probabilities? Thank you!

I am using 1-librispeech_clean/network.arch, and this is my train.cfg:
`
--lexicon=...

--tokens=...

--criterion=ctc

--lr=0.1

--maxgradnorm=1.0

--replabel=0

--surround=|

--onorm=target

--sqnorm=true

--mfsc=true

--filterbanks=40

--nthread=10

--batchsize=4

--iter=100
`

question

Source

xubuild

All 7 comments

I don't think wav2letter emissions are softmax'd. If you want a softmax, run a softmax function across the probabilities for each frame.

lunixbochs on 4 Mar 2020

I don't think wav2letter emissions are softmax'd. If you want a softmax, run a softmax function across the probabilities for each frame.

Thanks for the reply! I suppose the emissions are the network output, which are the posterior probabilities according to the wiki. If it's not, what are the emissions and how can I get the probabilities?

xubuild on 5 Mar 2020

The emissions are the network output. They're just not softmax'd. Softmax is a simple normalization function across a list of numbers, which results in a similar distribution to squaring all of the numbers then dividing by the sum of the squares (but it's actually exp(n) instead of squaring n)

In python3, with pytorch installed, here's me computing the softmax of 100, 200, 300:

>>> import torch
>>> torch.nn.functional.softmax(torch.tensor([100.0, 200.0, 300.0]))
tensor([0.0000e+00, 3.7835e-44, 1.0000e+00])

lunixbochs on 5 Mar 2020

The emissions are the network output. They're just not softmax'd. Softmax is a simple normalization function across a list of numbers, which results in a similar distribution to squaring all of the numbers then dividing by the sum of the squares (but it's actually exp(n) instead of squaring n)

In python3, with pytorch installed, here's me computing the softmax of 100, 200, 300:
>>> import torch
>>> torch.nn.functional.softmax(torch.tensor([100.0, 200.0, 300.0]))
tensor([0.0000e+00, 3.7835e-44, 1.0000e+00])

If the emission is the un-normed network output, a straight forward softmax would be easy.

I am confused because in the decoder source code, the emission value of each frame is simply added together and directly used as the amScore. I suppose a logsoftmax value is usually used like this, but the emission value does not look like one.

Is there any doc to explain how the amScore? and should I look in to the CTC part? Thanks!

xubuild on 5 Mar 2020

Only so many ways I can say it’s just the raw output from the linear layer, exactly what you’d get before a softmax.

I have reimplemented wav2letter in pytorch with identical network output. This is not a guess. I also used the raw network output without softmax.

I assume the decoder doesn’t use softmax because it works better without softmax. It’s doing more than just using the highest prediction per frame, even when not decoding (it uses viterbi path for greedy decoding)

lunixbochs on 5 Mar 2020

👍1

@xubuild

Emissions for us are the output of the network before softmax, so we interpret them as unnormalized log probs. Inside ctc loss normalization per frame is done and logsoftmax is applied.

In the decoder we use the same: output of the network before softmax because normalization doesn't affect the final score (it is constant for optimization in the decoder) and when you compute total prob of transcription instead of computing p_1 * .. * p_T we can compute with much better precision log p_1 + .. + log p_T (in our case unnormalized because normalization will be global constant as I said before).

To sum up: you can check that model output are float numbers (unnormalized log p) and if you apply softmax it would be from 0 to 1. Depending on what you need you can simply use raw model output not loosing the precision (because otherwise if you need log p instead of p you need to apply again log transformation while logsoftmax is more precise for example).