When I try to train the model with the pretrained model already provided, this RuntimError doens't happen.
However, when I try to train the model with the model saved after some training, this error come out.
Does anybody see this error and solve it?
Here is my command for running training.
CUDA_VISIBLE_DEVICES=0 python train.py {$data_dir} \
--restore-file {$pretrained_model_path} \
--max-positions 512 \
--max-sentences 32 \
--max-tokens 4400 \
--task sentence_prediction \
--reset-optimizer --reset-dataloader --reset-meters \
--required-batch-size-multiple 1 \
--init-token 0 --separator-token 2 \
--arch roberta_base \
--criterion sentence_prediction --num-classes 3 \
--dropout 0.1 --attention-dropout 0.1 \
--weight-decay 0.1 --optimizer adam --adam-betas "(0.9, 0.98)" --adam-eps 1e-06 \
--clip-norm 0.0 \
--lr-scheduler polynomial_decay --lr 1e-5 \
--total-num-update 110000 --warmup-updates 6600 \
--threshold-loss-scale 1 \
--max-epoch 1 \
--find-unused-parameters --truncate-sequence \
--best-checkpoint-metric accuracy --maximize-best-checkpoint-metric \
--save-dir {$save_dir}
Can you please add full stacktrace? That helps in finding issue faster
Can you please add full stacktrace? That helps in finding issue faster
This is the full stacktrace.
| epoch 001: 0%| | 0/428487 [00:00<?, ?it/s]Traceback (most recent call last):
File "train.py", line 325, in <module>
cli_main()
File "train.py", line 321, in cli_main
main(args)
File "train.py", line 80, in main
train(args, trainer, task, epoch_itr)
File "train.py", line 121, in train
log_output = trainer.train_step(samples)
File "/home/sam/fairseq/fairseq/trainer.py", line 287, in train_step
raise e
File "/home/sam/fairseq/fairseq/trainer.py", line 264, in train_step
ignore_grad
File "/home/sam/fairseq/fairseq/tasks/fairseq_task.py", line 230, in train_step
loss, sample_size, logging_output = criterion(model, sample)
File "/home/sam/anaconda3/envs/vector_provide_test/lib/python3.5/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/sam/fairseq/fairseq/criterions/sentence_prediction.py", line 43, in forward
padding_mask=padding_mask,
File "/home/sam/anaconda3/envs/vector_provide_test/lib/python3.5/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/sam/fairseq/fairseq/models/roberta/model.py", line 183, in forward
x = self.dense(x)
File "/home/sam/anaconda3/envs/vector_provide_test/lib/python3.5/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/sam/anaconda3/envs/vector_provide_test/lib/python3.5/site-packages/torch/nn/modules/linear.py", line 92, in forward
return F.linear(input, self.weight, self.bias)
File "/home/sam/anaconda3/envs/vector_provide_test/lib/python3.5/site-packages/torch/nn/functional.py",line 1406, in linear
ret = torch.addmm(bias, input, weight.t())
RuntimeError: Expected object of backend CPU but got backend CUDA for argument #4 'mat1'
I got the same problem.
Thanks for reporting this issue. The fix should be out soon.
Should be fixed now, can you please try again?
Also if you want continue training from your previous checkpoint, you might not need --reset-optimizer --reset-dataloader --reset-meters but that depends on your usecase.
Let me know if you still see any issues. Thanks
Thank you for fixing it. Now it's solved.