I tried experimenting with --sampling-topp hyperparamter
python interactive.py test/ --path checkpoints/models_anv/checkpoint_best.pt --source-lang en --target-lang hi --nbest 5 --sampling --sampling-topp 0.1
python interactive.py test/ --path checkpoints/models_anv/checkpoint_best.pt --source-lang en --target-lang hi --nbest 5 --sampling --sampling-topp 0.9
I am not able to understand the outputs. When I use p = 0.1, all of my 5 best outputs are same with
H-0 -1.0333465788368796
When I use p = 0.9 , I get different outputs but the max score is
H-0 -1.2899561307704726
which is poorer than p = 0.1 and also beam search output
Can anyone tell me where I am missing with the fundamentals of topp sampling(nucleus sampling) ?
And what excatly this means in the documentation:
""sample from the smallest set whose cumulative probability mass exceeds p for next words""
Suppose the model predicts the following probability distribution for the next word:
token prob
a 0.4
b 0.2
c 0.15
d 0.10
e 0.06
f 0.01
...
When you do --sampling-topp=0.1 then you're going to sample from the top 10% of the probability mass. In this case the first candidate (a) covers 40% of the probability mass so you'll always sample a.
When you do --sampling-topp=0.9 then you're going to sample from the top 90% of the probability mass. In this case you'll sample from a-e, which covers 91% of the mass.
Does that make sense?
yes, thanks @myleott
Most helpful comment
Suppose the model predicts the following probability distribution for the next word:
When you do
--sampling-topp=0.1then you're going to sample from the top 10% of the probability mass. In this case the first candidate (a) covers 40% of the probability mass so you'll always samplea.When you do
--sampling-topp=0.9then you're going to sample from the top 90% of the probability mass. In this case you'll sample froma-e, which covers 91% of the mass.Does that make sense?