Fairseq: RuntimeError: Expected object of backend CPU but got backend CUDA for argument #4 'mat1'

Created on 7 Aug 2019  路  6Comments  路  Source: pytorch/fairseq

When I try to train the model with the pretrained model already provided, this RuntimError doens't happen.

However, when I try to train the model with the model saved after some training, this error come out.

Does anybody see this error and solve it?

Here is my command for running training.

CUDA_VISIBLE_DEVICES=0 python train.py {$data_dir} \
--restore-file {$pretrained_model_path} \
--max-positions 512 \
--max-sentences 32 \
--max-tokens 4400 \
--task sentence_prediction \
--reset-optimizer --reset-dataloader --reset-meters \
--required-batch-size-multiple 1 \
--init-token 0 --separator-token 2 \
--arch roberta_base \
--criterion sentence_prediction --num-classes 3 \
--dropout 0.1 --attention-dropout 0.1 \
--weight-decay 0.1 --optimizer adam --adam-betas "(0.9, 0.98)" --adam-eps 1e-06 \
--clip-norm 0.0 \
--lr-scheduler polynomial_decay --lr 1e-5 \
--total-num-update 110000 --warmup-updates 6600 \
--threshold-loss-scale 1 \
--max-epoch 1 \
--find-unused-parameters --truncate-sequence \
--best-checkpoint-metric accuracy --maximize-best-checkpoint-metric \
--save-dir {$save_dir}

All 6 comments

Can you please add full stacktrace? That helps in finding issue faster

Can you please add full stacktrace? That helps in finding issue faster

This is the full stacktrace.

| epoch 001:   0%|                                                             | 0/428487 [00:00<?, ?it/s]Traceback (most recent call last):
  File "train.py", line 325, in <module>
    cli_main()
  File "train.py", line 321, in cli_main
    main(args)
  File "train.py", line 80, in main
    train(args, trainer, task, epoch_itr)
  File "train.py", line 121, in train
    log_output = trainer.train_step(samples)
  File "/home/sam/fairseq/fairseq/trainer.py", line 287, in train_step
    raise e
  File "/home/sam/fairseq/fairseq/trainer.py", line 264, in train_step
    ignore_grad
  File "/home/sam/fairseq/fairseq/tasks/fairseq_task.py", line 230, in train_step
    loss, sample_size, logging_output = criterion(model, sample)
  File "/home/sam/anaconda3/envs/vector_provide_test/lib/python3.5/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/sam/fairseq/fairseq/criterions/sentence_prediction.py", line 43, in forward
    padding_mask=padding_mask,
  File "/home/sam/anaconda3/envs/vector_provide_test/lib/python3.5/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/sam/fairseq/fairseq/models/roberta/model.py", line 183, in forward
    x = self.dense(x)
  File "/home/sam/anaconda3/envs/vector_provide_test/lib/python3.5/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/sam/anaconda3/envs/vector_provide_test/lib/python3.5/site-packages/torch/nn/modules/linear.py", line 92, in forward
    return F.linear(input, self.weight, self.bias)
  File "/home/sam/anaconda3/envs/vector_provide_test/lib/python3.5/site-packages/torch/nn/functional.py",line 1406, in linear
    ret = torch.addmm(bias, input, weight.t())
RuntimeError: Expected object of backend CPU but got backend CUDA for argument #4 'mat1'

I got the same problem.

Thanks for reporting this issue. The fix should be out soon.

Should be fixed now, can you please try again?
Also if you want continue training from your previous checkpoint, you might not need --reset-optimizer --reset-dataloader --reset-meters but that depends on your usecase.

Let me know if you still see any issues. Thanks

Thank you for fixing it. Now it's solved.

Was this page helpful?
0 / 5 - 0 ratings