Transformers: ImportError: cannot import name 'DataCollatorForLanguageModeling'_File "run_language_modeling.py"

Created on 23 Apr 2020 · 7Comments · Source: huggingface/transformers

❓ Questions & Help

Details

_https://huggingface.co/blog/how-to-train_
I followed everything in this colab without changing anything.
And this is the problem I encountered.

Traceback (most recent call last):
File "run_language_modeling.py", line 29, in
from transformers import (
ImportError: cannot import name 'DataCollatorForLanguageModeling'
CPU times: user 21.5 ms, sys: 18 ms, total: 39.5 ms
Wall time: 4.52 s

How can I fix this problem?

Thank you for your kindness and support

Source

Kittisaksam

Most helpful comment

I change in Colab installation command to:
!pip install git+https://github.com/huggingface/transformers

and its running :), now only to solve problem with out of memory

PeterPirog on 23 Apr 2020

👍7 🚀3

All 7 comments

Unfortunatelly I have the same error

Traceback (most recent call last):
File "run_language_modeling.py", line 29, in
from transformers import (
ImportError: cannot import name 'DataCollatorForLanguageModeling' from 'transformers' (E:\PycharmProjects\Ancient_BERT\venv\lib\site-packages\transformers__init__.py)

PeterPirog on 23 Apr 2020

Unfortunatelly I have the same error

Traceback (most recent call last):
File "run_language_modeling.py", line 29, in
from transformers import (
ImportError: cannot import name 'DataCollatorForLanguageModeling' from 'transformers' (E:\PycharmProjects\Ancient_BERT\venv\lib\site-packages\transformers__init__.py)

I tried to do this (instead of !pip install transformers)

!git clone https://github.com/huggingface/transformers
cd transformers
pip install .

And I get the following error:

Traceback (most recent call last):
File "run_language_modeling.py", line 280, in
main()
File "run_language_modeling.py", line 225, in main
if training_args.do_train
File "run_language_modeling.py", line 122, in get_dataset
tokenizer=tokenizer, file_path=file_path, block_size=args.block_size, local_rank=local_rank
File "/usr/local/lib/python3.6/dist-packages/transformers/data/datasets/language_modeling.py", line 84, in init
assert os.path.isfile(file_path)
AssertionError

I think it should be about tokenizers-0.7.0.
But I now still don't know how to fix it.

Kittisaksam on 23 Apr 2020

Both PyCharm running and example in Colab script has the same problem:
https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb

Traceback (most recent call last):
File "run_language_modeling.py", line 29, in
from transformers import (
ImportError: cannot import name 'DataCollatorForLanguageModeling'
CPU times: user 36.2 ms, sys: 22.9 ms, total: 59.1 ms
Wall time: 15.2 s

PeterPirog on 23 Apr 2020

I change in Colab installation command to:
!pip install git+https://github.com/huggingface/transformers

and its running :), now only to solve problem with out of memory

PeterPirog on 23 Apr 2020

👍7 🚀3

!pip install git+https://github.com/huggingface/transformers

You have to change batch size.

cmd = """
python run_language_modeling.py
--train_data_file ./oscar.eo.txt
--output_dir ./EsperBERTo-small-v1
--model_type roberta
--mlm
--config_name ./EsperBERTo
--tokenizer_name ./EsperBERTo
--do_train
--line_by_line
--learning_rate 1e-4
--num_train_epochs 1
--save_total_limit 2
--save_steps 2000
--per_gpu_train_batch_size 4
--seed 42
""".replace("\n", " ")

And Thank you for your help.

Kittisaksam on 23 Apr 2020

Maybe somone had error (error occurs after reaching save step value=2000):

Iteration: 16% 1997/12500 [10:38<49:42, 3.52it/s]
Iteration: 16% 1998/12500 [10:39<47:03, 3.72it/s]

{"learning_rate": 8.4e-05, "loss": 7.684231301307678, "step": 2000}
Epoch: 0% 0/1 [10:39 Iteration: 16% 1999/12500 [10:39<49:56, 3.50it/s]04/23/2020 11:55:42 - INFO - transformers.trainer - Saving model checkpoint to ./EsperBERTo-small-v1/checkpoint-2000
04/23/2020 11:55:42 - INFO - transformers.configuration_utils - Configuration saved in ./EsperBERTo-small-v1/checkpoint-2000/config.json
04/23/2020 11:55:42 - INFO - transformers.modeling_utils - Model weights saved in ./EsperBERTo-small-v1/checkpoint-2000/pytorch_model.bin
Traceback (most recent call last):
File "run_language_modeling.py", line 280, in
main()
File "run_language_modeling.py", line 254, in main
trainer.train(model_path=model_path)
File "/usr/local/lib/python3.6/dist-packages/transformers/trainer.py", line 363, in train
self._rotate_checkpoints()
File "/usr/local/lib/python3.6/dist-packages/transformers/trainer.py", line 458, in _rotate_checkpoints
checkpoints_sorted = self._sorted_checkpoints(use_mtime=use_mtime)
File "/usr/local/lib/python3.6/dist-packages/transformers/trainer.py", line 443, in _sorted_checkpoints
regex_match = re.match(".*{}-([0-9]+)".format(checkpoint_prefix), path)
File "/usr/lib/python3.6/re.py", line 172, in match
return _compile(pattern, flags).match(string)
TypeError: expected string or bytes-like object

Epoch: 0% 0/1 [10:40 Iteration: 16% 1999/12500 [10:40<56:04, 3.12it/s]
CPU times: user 2.78 s, sys: 928 ms, total: 3.71 s
Wall time: 11min 33s

My configuration is:
config = {
"architectures": [
"RobertaForMaskedLM"
],
"attention_probs_dropout_prob": 0.1,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"layer_norm_eps": 1e-05,
"max_position_embeddings": 514,
"model_type": "roberta",
"num_attention_heads": 12,
"num_hidden_layers": 6,
"type_vocab_size": 1,
"vocab_size": 52000
}
cmd = """
python run_language_modeling.py
--train_data_file ./oscar.eo.txt
--output_dir ./EsperBERTo-small-v1
--model_type roberta
--mlm
--config_name ./EsperBERTo
--tokenizer_name ./EsperBERTo
--do_train
--line_by_line
--learning_rate 1e-4
--num_train_epochs 1
--save_total_limit 2
--save_steps 2000
--per_gpu_train_batch_size 4
--seed 42
""".replace("\n", " ")