Fairseq: Help with --sampling-topp hyperparameter ?

Created on 14 Feb 2020  路  2Comments  路  Source: pytorch/fairseq

I tried experimenting with --sampling-topp hyperparamter
python interactive.py test/ --path checkpoints/models_anv/checkpoint_best.pt --source-lang en --target-lang hi --nbest 5 --sampling --sampling-topp 0.1
python interactive.py test/ --path checkpoints/models_anv/checkpoint_best.pt --source-lang en --target-lang hi --nbest 5 --sampling --sampling-topp 0.9

I am not able to understand the outputs. When I use p = 0.1, all of my 5 best outputs are same with
H-0 -1.0333465788368796

When I use p = 0.9 , I get different outputs but the max score is
H-0 -1.2899561307704726
which is poorer than p = 0.1 and also beam search output

Can anyone tell me where I am missing with the fundamentals of topp sampling(nucleus sampling) ?
And what excatly this means in the documentation:
""sample from the smallest set whose cumulative probability mass exceeds p for next words""

question

Most helpful comment

Suppose the model predicts the following probability distribution for the next word:

token  prob
a      0.4
b      0.2
c      0.15
d      0.10
e      0.06
f      0.01
...

When you do --sampling-topp=0.1 then you're going to sample from the top 10% of the probability mass. In this case the first candidate (a) covers 40% of the probability mass so you'll always sample a.

When you do --sampling-topp=0.9 then you're going to sample from the top 90% of the probability mass. In this case you'll sample from a-e, which covers 91% of the mass.

Does that make sense?

All 2 comments

Suppose the model predicts the following probability distribution for the next word:

token  prob
a      0.4
b      0.2
c      0.15
d      0.10
e      0.06
f      0.01
...

When you do --sampling-topp=0.1 then you're going to sample from the top 10% of the probability mass. In this case the first candidate (a) covers 40% of the probability mass so you'll always sample a.

When you do --sampling-topp=0.9 then you're going to sample from the top 90% of the probability mass. In this case you'll sample from a-e, which covers 91% of the mass.

Does that make sense?

yes, thanks @myleott

Was this page helpful?
0 / 5 - 0 ratings