Transformers: Unit of the prediction scores of a language model

Created on 15 Dec 2019  路  4Comments  路  Source: huggingface/transformers

I have used the base transformer models for downstream tasks for a while now but I haven't had the time to dig into how the models were actually trained. When looking at the *ForMaskedLM models, I can see the return tuple contains prediction_scores for each token.

prediction_scores: torch.FloatTensor of shape (batch_size, sequence_length, config.vocab_size)
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).

One could get the probabilities of vocabulary items by running a SoftMax over these prediction_scores, but my question is: what are these outputs themselves, what is their unit? In other words: during training, how were these outputs used? Since they are the primary output in the tuple, I suppose they were used in the loss function. At first I expected these to be perplexity but since they are returned before any softmax (and perplexity is 2^entropy), I don't see how that can be true. Still, these scores seem to be used to to get the most likely masked token in the quickstart. So if it's not probability and not perplexity, then what is its unit?

wontfix

Most helpful comment

These are logits, i.e. unnormalized scores for each possible token at the masked token position. You can convert them in (normalized) probabilities by taking their softmax. I don't think you can really assign any unit to these scores, in particular, because they are not normalized so you can add any constant value to all these scores (as long as it's the same value for all tokens in the vocabulary) and still get the same probabilities after applying a softmax.

We could return a softmax out of the model but if you only want to compute the argmax for instance (the most likely token), you can directly use these outputs so we don't want to force additional compute if some people don't need it.

During training we don't use these output but the cross-entropy loss. The cross-entropy loss is obtained by first computing the logarithm of the softmax of these scores (log-probabilities) and then the negative log-likelihood of the target labels under this distribution. This is actually computed in one step by torch.nn.CrossEntropyLoss and returned as the loss of the model which is the first element of the tuple when you supply mlm_labels to a XXXForMaskedLM model.

All 4 comments

These are logits, i.e. unnormalized scores for each possible token at the masked token position. You can convert them in (normalized) probabilities by taking their softmax. I don't think you can really assign any unit to these scores, in particular, because they are not normalized so you can add any constant value to all these scores (as long as it's the same value for all tokens in the vocabulary) and still get the same probabilities after applying a softmax.

We could return a softmax out of the model but if you only want to compute the argmax for instance (the most likely token), you can directly use these outputs so we don't want to force additional compute if some people don't need it.

During training we don't use these output but the cross-entropy loss. The cross-entropy loss is obtained by first computing the logarithm of the softmax of these scores (log-probabilities) and then the negative log-likelihood of the target labels under this distribution. This is actually computed in one step by torch.nn.CrossEntropyLoss and returned as the loss of the model which is the first element of the tuple when you supply mlm_labels to a XXXForMaskedLM model.

Hey @thomwolf, thanks for taking the time to help me better understand the internals of the language models! Coincidentally, I was reading through this excellent article by Chip Huyen (@chiphuyen) (https://thegradient.pub/understanding-evaluation-metrics-for-language-models/).

But if I understand you correctly, perplexity is not used in practice as a metric (I assume that it can't be evaluated anyway). Instead, CrossEntropyLoss is used, dealing with the MLM problem as a classification problem over C classes where C is the size of the vocabulary, correct? The labels would then be (presumably, internally) one-hot encoded vocabulary where there is only one 1 which is the expected token?

For some reason I always thought that MLM involved perplexity or multi-label classification, i.e. where mask could have multiple correct tokens. I'm glad to now get a better understanding, so thanks again for your time.

Yeah this article by @chiphuyen is really great, I keep sharing it. I hope she writes more NLP articles in the future 馃槃

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

yspaik picture yspaik  路  3Comments

HansBambel picture HansBambel  路  3Comments

lcswillems picture lcswillems  路  3Comments

zhezhaoa picture zhezhaoa  路  3Comments

iedmrc picture iedmrc  路  3Comments