I tried to use the pre-trained model to fine-tune. I got the ppl 6.12 of the DistilBert model, which is much lower than the GPT-2 model (ppl: 18.34, ref: https://paperswithcode.com/sota/language-modelling-on-wikitext-2).
Is the DistilBert model works much better than the GPT-2? Or is it just because the loss functions are different?
Here are the commands:
# 1) code:
git clone https://github.com/huggingface/transformers.git
# 2) Download dataset:
cd transformers/examples/
wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-raw-v1.zip
unzip wikitext-2-raw-v1.zip
# 3) Benchmart:
## distilbert:
export TRAIN_FILE=./wikitext-2-raw/wiki.train.raw
export TEST_FILE=./wikitext-2-raw/wiki.test.raw
CUDA_VISIBLE_DEVICES=6 python run_language_modeling.py \
--output_dir=output_distilbert \
--model_type=distilbert \
--model_name_or_path=distilbert-base-uncased \
--do_train \
--per_gpu_train_batch_size 15 \
--train_data_file=$TRAIN_FILE \
--do_eval \
--eval_data_file=$TEST_FILE \
--mlm
Result:
04/22/2020 13:25:33 - INFO - __main__ - ***** Running evaluation *****
04/22/2020 13:25:33 - INFO - __main__ - Num examples = 535
04/22/2020 13:25:33 - INFO - __main__ - Batch size = 4
Evaluating: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 134/134 [00:05<00:00, 24.02it/s]
04/22/2020 13:25:38 - INFO - __main__ - ***** Eval results *****
04/22/2020 13:25:38 - INFO - __main__ - perplexity = tensor(6.1200)
A link to original question on Stack Overflow:
@VictorSanh can correct me if I'm wrong, but in general, the perplexity for masked language models like BERT is much lower than the perplexity for causal language models.
That's because the "MLM perplexity" isn't the actual perplexity (in the sense that it is the probability of a sentence, computed from the predicted next word), but rather the "masked perplexity" which is computed differently.
It isn't surprising to me that you obtain ~6 perplexity on WikitText-2 with DistilBERT.
Thank you!
Most helpful comment
@VictorSanh can correct me if I'm wrong, but in general, the perplexity for masked language models like BERT is much lower than the perplexity for causal language models.
That's because the "MLM perplexity" isn't the actual perplexity (in the sense that it is the probability of a sentence, computed from the predicted next word), but rather the "masked perplexity" which is computed differently.
It isn't surprising to me that you obtain ~6 perplexity on WikitText-2 with DistilBERT.