Transformers: โ“ DistilBert test perplexity based on WikiText-2: ppl is too low?

Created on 23 Apr 2020  ยท  2Comments  ยท  Source: huggingface/transformers

โ“ Questions & Help

Details

I tried to use the pre-trained model to fine-tune. I got the ppl 6.12 of the DistilBert model, which is much lower than the GPT-2 model (ppl: 18.34, ref: https://paperswithcode.com/sota/language-modelling-on-wikitext-2).

Is the DistilBert model works much better than the GPT-2? Or is it just because the loss functions are different?

Here are the commands:

# 1) code:
git clone https://github.com/huggingface/transformers.git

# 2) Download dataset:
cd transformers/examples/
wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-raw-v1.zip
unzip wikitext-2-raw-v1.zip

# 3) Benchmart:
## distilbert:
export TRAIN_FILE=./wikitext-2-raw/wiki.train.raw
export TEST_FILE=./wikitext-2-raw/wiki.test.raw
CUDA_VISIBLE_DEVICES=6 python run_language_modeling.py \
    --output_dir=output_distilbert \
    --model_type=distilbert \
    --model_name_or_path=distilbert-base-uncased \
    --do_train \
    --per_gpu_train_batch_size 15 \
    --train_data_file=$TRAIN_FILE \
    --do_eval \
    --eval_data_file=$TEST_FILE \
    --mlm

Result:

04/22/2020 13:25:33 - INFO - __main__ -   ***** Running evaluation  *****
04/22/2020 13:25:33 - INFO - __main__ -     Num examples = 535
04/22/2020 13:25:33 - INFO - __main__ -     Batch size = 4
Evaluating: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 134/134 [00:05<00:00, 24.02it/s]
04/22/2020 13:25:38 - INFO - __main__ -   ***** Eval results  *****
04/22/2020 13:25:38 - INFO - __main__ -     perplexity = tensor(6.1200)


A link to original question on Stack Overflow:

Most helpful comment

@VictorSanh can correct me if I'm wrong, but in general, the perplexity for masked language models like BERT is much lower than the perplexity for causal language models.

That's because the "MLM perplexity" isn't the actual perplexity (in the sense that it is the probability of a sentence, computed from the predicted next word), but rather the "masked perplexity" which is computed differently.

It isn't surprising to me that you obtain ~6 perplexity on WikitText-2 with DistilBERT.

All 2 comments

@VictorSanh can correct me if I'm wrong, but in general, the perplexity for masked language models like BERT is much lower than the perplexity for causal language models.

That's because the "MLM perplexity" isn't the actual perplexity (in the sense that it is the probability of a sentence, computed from the predicted next word), but rather the "masked perplexity" which is computed differently.

It isn't surprising to me that you obtain ~6 perplexity on WikitText-2 with DistilBERT.

Thank you!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

lcswillems picture lcswillems  ยท  3Comments

fyubang picture fyubang  ยท  3Comments

siddsach picture siddsach  ยท  3Comments

yspaik picture yspaik  ยท  3Comments

0x01h picture 0x01h  ยท  3Comments