Fairseq: How to evaluate a roberta base checkpoint model on language modelling?

Created on 29 Oct 2019  路  13Comments  路  Source: pytorch/fairseq

I trained the roberta base model on a custom dataset and I am trying to evaluate the checkpoints generated during the pre-training process on language modeling task for the test dataset.
If I run fairseq-eval-lm with the saved checkpoint file as the --path value and the default task of language modeling, it fails with the following stack trace:

Traceback (most recent call last):
File "/home/dm4511/anaconda3/envs/vision/bin/fairseq-eval-lm", line 11, in
load_entry_point('fairseq', 'console_scripts', 'fairseq-eval-lm')()
File "/scratch/dm4511/IS_NLP/fairseq/fairseq_cli/eval_lm.py", line 223, in cli_main
main(args)
File "/scratch/dm4511/IS_NLP/fairseq/fairseq_cli/eval_lm.py", line 62, in main
task=task,
File "/scratch/dm4511/IS_NLP/fairseq/fairseq/checkpoint_utils.py", line 167, in load_model_ensemble
ensemble, args, _task = load_model_ensemble_and_task(filenames, arg_overrides, task)
File "/scratch/dm4511/IS_NLP/fairseq/fairseq/checkpoint_utils.py", line 186, in load_model_ensemble_and_task
model.load_state_dict(state['model'], strict=True)
File "/scratch/dm4511/IS_NLP/fairseq/fairseq/models/fairseq_model.py", line 69, in load_state_dict
return super().load_state_dict(state_dict, strict)
File "/home/dm4511/anaconda3/envs/vision/lib/python3.6/site-packages/torch/nn/modules/module.py", line 845, in load_state_dict
self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for RobertaModel:
size mismatch for decoder.sentence_encoder.embed_tokens.weight: copying a param with shape torch.Size([50265, 768]) from checkpoint, the shape in current model is torch.Size([50264, 768]).
size mismatch for decoder.lm_head.weight: copying a param with shape torch.Size([50265, 768]) from checkpoint, the shape in current model is torch.Size([50264, 768]).
size mismatch for decoder.lm_head.bias: copying a param with shape torch.Size([50265]) from checkpoint, the shape in current model is torch.Size([50264]).

Most helpful comment

Hi,

I am facing the same issue of evaluating a trained model

I am running the command-
fairseq-eval-lm data-bin/CM_data \
--path checkpoints/checkpoint_best.pt \
--sample-break-mode complete --max-tokens 3072 \
--context-window 2560 --softmax-batch 1024

My code logs-

Namespace(add_bos_token=False, bpe=None, context_window=2560, cpu=False, criterion='cross_entropy', data='data-bin/CM_data', dataset_
impl=None, force_anneal=None, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, future_target=False,
gen_subset='test', lazy_load=False, log_format=None, log_interval=1000, lr_scheduler='fixed', lr_shrink=0.1, max_sentences=None, max
_target_positions=None, max_tokens=3072, memory_efficient_fp16=False, min_loss_scale=0.0001, model_overrides='{}', momentum=0.99, no_
progress_bar=False, num_shards=1, num_workers=1, optimizer='nag', output_dictionary_size=-1, output_word_probs=False, output_word_sta
ts=False, past_target=False, path='checkpoints/checkpoint_best.pt', quiet=False, raw_text=False, remove_bpe=None, required_batch_size
_multiple=8, results_path=None, sample_break_mode='complete', seed=1, self_target=False, shard_id=0, skip_invalid_size_inputs_valid_t
est=False, softmax_batch=1024, task='language_modeling', tbmf_wrapper=False, tensorboard_logdir='', threshold_loss_scale=None, tokeni
zer=None, tokens_per_sample=1024, user_dir=None, warmup_updates=0, weight_decay=0.0)
| dictionary: 634704 types
| loading model(s) from checkpoints/checkpoint_best.pt
Traceback (most recent call last):
File "/home/parul09chopra/miniconda3/bin/fairseq-eval-lm", line 8, in
sys.exit(cli_main())
File "/home/parul09chopra/miniconda3/lib/python3.7/site-packages/fairseq_cli/eval_lm.py", line 219, in cli_main
main(args)
File "/home/parul09chopra/miniconda3/lib/python3.7/site-packages/fairseq_cli/eval_lm.py", line 62, in main
task=task,
File "/home/parul09chopra/miniconda3/lib/python3.7/site-packages/fairseq/checkpoint_utils.py", line 155, in load_model_ensemble
ensemble, args, _task = load_model_ensemble_and_task(filenames, arg_overrides, task)
File "/home/parul09chopra/miniconda3/lib/python3.7/site-packages/fairseq/checkpoint_utils.py", line 173, in load_model_ensemble_and
_task
model = task.build_model(args)
File "/home/parul09chopra/miniconda3/lib/python3.7/site-packages/fairseq/tasks/language_modeling.py", line 140, in build_model
raise ValueError('Unsupported language modeling target: {}'.format(target))
ValueError: Unsupported language modeling target: future

What can be done to evaluate and find perplexity on test set using a trained roberta model?

All 13 comments

Hi,

I am facing the same issue of evaluating a trained model

I am running the command-
fairseq-eval-lm data-bin/CM_data \
--path checkpoints/checkpoint_best.pt \
--sample-break-mode complete --max-tokens 3072 \
--context-window 2560 --softmax-batch 1024

My code logs-

Namespace(add_bos_token=False, bpe=None, context_window=2560, cpu=False, criterion='cross_entropy', data='data-bin/CM_data', dataset_
impl=None, force_anneal=None, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, future_target=False,
gen_subset='test', lazy_load=False, log_format=None, log_interval=1000, lr_scheduler='fixed', lr_shrink=0.1, max_sentences=None, max
_target_positions=None, max_tokens=3072, memory_efficient_fp16=False, min_loss_scale=0.0001, model_overrides='{}', momentum=0.99, no_
progress_bar=False, num_shards=1, num_workers=1, optimizer='nag', output_dictionary_size=-1, output_word_probs=False, output_word_sta
ts=False, past_target=False, path='checkpoints/checkpoint_best.pt', quiet=False, raw_text=False, remove_bpe=None, required_batch_size
_multiple=8, results_path=None, sample_break_mode='complete', seed=1, self_target=False, shard_id=0, skip_invalid_size_inputs_valid_t
est=False, softmax_batch=1024, task='language_modeling', tbmf_wrapper=False, tensorboard_logdir='', threshold_loss_scale=None, tokeni
zer=None, tokens_per_sample=1024, user_dir=None, warmup_updates=0, weight_decay=0.0)
| dictionary: 634704 types
| loading model(s) from checkpoints/checkpoint_best.pt
Traceback (most recent call last):
File "/home/parul09chopra/miniconda3/bin/fairseq-eval-lm", line 8, in
sys.exit(cli_main())
File "/home/parul09chopra/miniconda3/lib/python3.7/site-packages/fairseq_cli/eval_lm.py", line 219, in cli_main
main(args)
File "/home/parul09chopra/miniconda3/lib/python3.7/site-packages/fairseq_cli/eval_lm.py", line 62, in main
task=task,
File "/home/parul09chopra/miniconda3/lib/python3.7/site-packages/fairseq/checkpoint_utils.py", line 155, in load_model_ensemble
ensemble, args, _task = load_model_ensemble_and_task(filenames, arg_overrides, task)
File "/home/parul09chopra/miniconda3/lib/python3.7/site-packages/fairseq/checkpoint_utils.py", line 173, in load_model_ensemble_and
_task
model = task.build_model(args)
File "/home/parul09chopra/miniconda3/lib/python3.7/site-packages/fairseq/tasks/language_modeling.py", line 140, in build_model
raise ValueError('Unsupported language modeling target: {}'.format(target))
ValueError: Unsupported language modeling target: future

What can be done to evaluate and find perplexity on test set using a trained roberta model?

I also get errors when trying to evaluate perplexity on a RoBERTa model. Are there any updates on this issue?

When running the cli as listed on the repo I get:
ValueError: Unsupported language modeling target: future

When I add --task masked_lm --criterion masked_lm I get an error that my dataset isn't Monolingual. Not sure what that means either from the perspective of these tests.

If I remove --context-window (Defaulting it to 0), I get this error:

line 157, in main
    if args.add_bos_token:
AttributeError: 'Namespace' object has no attribute 'add_bos_token'

here is what I tried:

  1. changed the default target to "self", which fixed the error faced by PC09. (https://github.com/pytorch/fairseq/blob/master/fairseq/tasks/language_modeling.py#L137)
  2. added the --add_bos_token argument to the eval_lm.py script and removed "add_bos_token" from the list of filtered args (https://github.com/pytorch/fairseq/blob/master/eval_lm.py#L68)
  3. added the following line at https://github.com/pytorch/fairseq/blob/master/fairseq/tasks/language_modeling.py#L92. This is because the eval_lm.py expects a model with embedding layer for dictionary of size 50264 whereas the saved checkpoint has an embedding layer for a dictionary of size 50265. self.mask_idx = dictionary.add_symbol('<mask>')

This got the script to run on my saved roberta models but I still face some error after running on few samples.

/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo, TensorInfo, TensorInfo, int, int, IndexType, IndexType, long) [with T = c10::Half, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [224,0,0], thread: [127,0,0] Assertion srcIndex < srcSelectDimSize failed.
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCReduceAll.cuh line=327 error=59 : device-side assert triggered

I tried debugging around and from what I could understand, it looks like an issue where you try to access element out of bounds. So seems like that issue could be with my input.

It would be great if someone else could try and let me know if the above steps got it working for them...

@DikshaMeghwal I tried running it with your modifications. Do you set add_bos_token to be False? I get an error here https://github.com/pytorch/fairseq/blob/master/eval_lm.py#L158

fairseq-eval-lm --fp16 $DATA_DIR --path $ROBERTA_PATH\ --task language_modeling \ --sample-break-mode complete --tokens-per-sample $TOKENS_PER_SAMPLE \ --max-sentences $MAX_SENTENCES \ --add-bos-token \ --log-format simple --log-interval 1

I got that script to run if I avoid the last batch of data, otherwise I get the same error you do @DikshaMeghwal.

However, I am looking to evaluate on masked language modeling. The perplexity score given with language_modeling isn't useful for me. My work-around has been to modify the training script and ignore all training aspects of it to produce validation metrics (loss, ppl, etc.) on data of my choosing. This appears to be a temporary solution for me.

@kristjanArumae oh right! This can be a workaround for me too! thanks for the tip

@kristjanArumae Can you explain a bit more or share modified code for the work around you mentioned?

If you just want to compute validation perplexity you can use the validate.py script: https://github.com/pytorch/fairseq/blob/master/validate.py

Something like this should work:

python -m fairseq_cli.validate /path/to/data-bin --path /path/to/roberta.large/model.pt --task masked_lm

I don't want to validate for masked language modeling. I want to evaluate for language modeling. What is the best way to evaluate a saved checkpoint for RoBerta for a language modeling task?

You mean predict tokens left-to-right? The model isn't trained to do that, you'd need to train a language model for that purpose (or otherwise somehow adapt the model to predict left-to-right).

Using
python validate.py /path/to/data-bin --path /path/to/roberta.large/model.pt --task masked_lm
I get:
NotImplementedError: Unable to infer Criterion arguments, please implement MaskedLmLoss.build_criterion

I've encountered 'AttributeError: 'Namespace' object has no attribute 'add_bos_token'' when evaluating a saved checkpoint for RoBerta for a language modeling task. I'm on latest version 0.9.0.

When attempting to pass the arg 'add_bos_token', it throws 'fairseq-eval-lm: error: unrecognized arguments: --add-bos-token'. 'add_bos_token ' is not specified in latest doc

Was this page helpful?
0 / 5 - 0 ratings