Transformers: ❓ DistilBert test perplexity based on WikiText-2: ppl is too low?

Created on 23 Apr 2020 · 2Comments · Source: huggingface/transformers

❓ Questions & Help

Details

I tried to use the pre-trained model to fine-tune. I got the ppl 6.12 of the DistilBert model, which is much lower than the GPT-2 model (ppl: 18.34, ref: https://paperswithcode.com/sota/language-modelling-on-wikitext-2).

Is the DistilBert model works much better than the GPT-2? Or is it just because the loss functions are different?

Here are the commands:

# 1) code:
git clone https://github.com/huggingface/transformers.git

# 2) Download dataset:
cd transformers/examples/
wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-raw-v1.zip
unzip wikitext-2-raw-v1.zip

# 3) Benchmart:
## distilbert:
export TRAIN_FILE=./wikitext-2-raw/wiki.train.raw
export TEST_FILE=./wikitext-2-raw/wiki.test.raw
CUDA_VISIBLE_DEVICES=6 python run_language_modeling.py \
    --output_dir=output_distilbert \
    --model_type=distilbert \
    --model_name_or_path=distilbert-base-uncased \
    --do_train \
    --per_gpu_train_batch_size 15 \
    --train_data_file=$TRAIN_FILE \
    --do_eval \
    --eval_data_file=$TEST_FILE \
    --mlm

Result:

04/22/2020 13:25:33 - INFO - __main__ -   ***** Running evaluation  *****
04/22/2020 13:25:33 - INFO - __main__ -     Num examples = 535
04/22/2020 13:25:33 - INFO - __main__ -     Batch size = 4
Evaluating: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 134/134 [00:05<00:00, 24.02it/s]
04/22/2020 13:25:38 - INFO - __main__ -   ***** Eval results  *****
04/22/2020 13:25:38 - INFO - __main__ -     perplexity = tensor(6.1200)

A link to original question on Stack Overflow:

Source

bing0037

Most helpful comment

@VictorSanh can correct me if I'm wrong, but in general, the perplexity for masked language models like BERT is much lower than the perplexity for causal language models.

That's because the "MLM perplexity" isn't the actual perplexity (in the sense that it is the probability of a sentence, computed from the predicted next word), but rather the "masked perplexity" which is computed differently.

It isn't surprising to me that you obtain ~6 perplexity on WikitText-2 with DistilBERT.

LysandreJik on 23 Apr 2020

👍4

All 2 comments

@VictorSanh can correct me if I'm wrong, but in general, the perplexity for masked language models like BERT is much lower than the perplexity for causal language models.

It isn't surprising to me that you obtain ~6 perplexity on WikitText-2 with DistilBERT.

LysandreJik on 23 Apr 2020

👍4

Thank you!

bing0037 on 24 Apr 2020

Was this page helpful?

0 / 5 - 0 ratings