Wav2letter: question about CTC Emission Set value and probability

Created on 4 Mar 2020  路  7Comments  路  Source: flashlight/wav2letter

Hi, I am trying to get the probability p(token|feats) for each frame in a CTC model, which should be
the Emission Set. But when I export the data, it's filled with float number which does not look like a softmax or logsoftmax result. I wonder how are these values calculated and how can I get the softmax probabilities? Thank you!

I am using 1-librispeech_clean/network.arch, and this is my train.cfg:
`
--lexicon=...

--tokens=...

--criterion=ctc

--lr=0.1

--maxgradnorm=1.0

--replabel=0

--surround=|

--onorm=target

--sqnorm=true

--mfsc=true

--filterbanks=40

--nthread=10

--batchsize=4

--iter=100
`

question

All 7 comments

I don't think wav2letter emissions are softmax'd. If you want a softmax, run a softmax function across the probabilities for each frame.

I don't think wav2letter emissions are softmax'd. If you want a softmax, run a softmax function across the probabilities for each frame.

Thanks for the reply! I suppose the emissions are the network output, which are the posterior probabilities according to the wiki. If it's not, what are the emissions and how can I get the probabilities?

The emissions are the network output. They're just not softmax'd. Softmax is a simple normalization function across a list of numbers, which results in a similar distribution to squaring all of the numbers then dividing by the sum of the squares (but it's actually exp(n) instead of squaring n)

In python3, with pytorch installed, here's me computing the softmax of 100, 200, 300:

>>> import torch
>>> torch.nn.functional.softmax(torch.tensor([100.0, 200.0, 300.0]))
tensor([0.0000e+00, 3.7835e-44, 1.0000e+00])

The emissions are the network output. They're just not softmax'd. Softmax is a simple normalization function across a list of numbers, which results in a similar distribution to squaring all of the numbers then dividing by the sum of the squares (but it's actually exp(n) instead of squaring n)

In python3, with pytorch installed, here's me computing the softmax of 100, 200, 300:

>>> import torch
>>> torch.nn.functional.softmax(torch.tensor([100.0, 200.0, 300.0]))
tensor([0.0000e+00, 3.7835e-44, 1.0000e+00])

If the emission is the un-normed network output, a straight forward softmax would be easy.

I am confused because in the decoder source code, the emission value of each frame is simply added together and directly used as the amScore. I suppose a logsoftmax value is usually used like this, but the emission value does not look like one.

Is there any doc to explain how the amScore? and should I look in to the CTC part? Thanks!

Only so many ways I can say it鈥檚 just the raw output from the linear layer, exactly what you鈥檇 get before a softmax.

I have reimplemented wav2letter in pytorch with identical network output. This is not a guess. I also used the raw network output without softmax.

I assume the decoder doesn鈥檛 use softmax because it works better without softmax. It鈥檚 doing more than just using the highest prediction per frame, even when not decoding (it uses viterbi path for greedy decoding)

@xubuild

Emissions for us are the output of the network before softmax, so we interpret them as unnormalized log probs. Inside ctc loss normalization per frame is done and logsoftmax is applied.

In the decoder we use the same: output of the network before softmax because normalization doesn't affect the final score (it is constant for optimization in the decoder) and when you compute total prob of transcription instead of computing p_1 * .. * p_T we can compute with much better precision log p_1 + .. + log p_T (in our case unnormalized because normalization will be global constant as I said before).

To sum up: you can check that model output are float numbers (unnormalized log p) and if you apply softmax it would be from 0 to 1. Depending on what you need you can simply use raw model output not loosing the precision (because otherwise if you need log p instead of p you need to apply again log transformation while logsoftmax is more precise for example).

feel free to reopen if you still have questions on this.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

isaacleeai picture isaacleeai  路  5Comments

EdwinWenink picture EdwinWenink  路  4Comments

mlexplore1122 picture mlexplore1122  路  3Comments

Terry1504 picture Terry1504  路  4Comments

nutriver picture nutriver  路  3Comments