Transformers: Too many bugs in Version 2.5.0

Created on 24 Feb 2020 · 15Comments · Source: huggingface/transformers

It cannot be installed on MacOS. By runing pip install -U transformers, I got the following errors:

Building wheels for collected packages: tokenizers
Building wheel for tokenizers (PEP 517) ... error
ERROR: Command errored out with exit status 1:
command: /anaconda/bin/python /anaconda/lib/python3.7/site-packages/pip/_vendor/pep517/_in_process.py build_wheel /var/folders/5h/fr2vhgsx4jd8wz4bphzt22_8p1v0bf/T/tmpfh6km7na
cwd: /private/var/folders/5h/fr2vhgsx4jd8wz4bphzt22_8p1v0bf/T/pip-install-fog09t3h/tokenizers
Complete output (36 lines):
running bdist_wheel
running build
running build_py
creating build
creating build/lib
creating build/lib/tokenizers
copying tokenizers/__init__.py -> build/lib/tokenizers
creating build/lib/tokenizers/models
copying tokenizers/models/__init__.py -> build/lib/tokenizers/models
creating build/lib/tokenizers/decoders
copying tokenizers/decoders/__init__.py -> build/lib/tokenizers/decoders
creating build/lib/tokenizers/normalizers
copying tokenizers/normalizers/__init__.py -> build/lib/tokenizers/normalizers
creating build/lib/tokenizers/pre_tokenizers
copying tokenizers/pre_tokenizers/__init__.py -> build/lib/tokenizers/pre_tokenizers
creating build/lib/tokenizers/processors
copying tokenizers/processors/__init__.py -> build/lib/tokenizers/processors
creating build/lib/tokenizers/trainers
copying tokenizers/trainers/__init__.py -> build/lib/tokenizers/trainers
creating build/lib/tokenizers/implementations
copying tokenizers/implementations/byte_level_bpe.py -> build/lib/tokenizers/implementations
copying tokenizers/implementations/sentencepiece_bpe.py -> build/lib/tokenizers/implementations
copying tokenizers/implementations/base_tokenizer.py -> build/lib/tokenizers/implementations
copying tokenizers/implementations/__init__.py -> build/lib/tokenizers/implementations
copying tokenizers/implementations/char_level_bpe.py -> build/lib/tokenizers/implementations
copying tokenizers/implementations/bert_wordpiece.py -> build/lib/tokenizers/implementations
copying tokenizers/__init__.pyi -> build/lib/tokenizers
copying tokenizers/models/__init__.pyi -> build/lib/tokenizers/models
copying tokenizers/decoders/__init__.pyi -> build/lib/tokenizers/decoders
copying tokenizers/normalizers/__init__.pyi -> build/lib/tokenizers/normalizers
copying tokenizers/pre_tokenizers/__init__.pyi -> build/lib/tokenizers/pre_tokenizers
copying tokenizers/processors/__init__.pyi -> build/lib/tokenizers/processors
copying tokenizers/trainers/__init__.pyi -> build/lib/tokenizers/trainers
running build_ext
running build_rust
error: Can not find Rust compiler

ERROR: Failed building wheel for tokenizers
Running setup.py clean for tokenizers
Failed to build tokenizers
ERROR: Could not build wheels for tokenizers which use PEP 517 and cannot be installed directly
On Linux, it can be installed, but failed with the following code:

import transformers
transformers.AutoTokenizer.from_pretrained("bert-base-cased").save_pretrained("./")
transformers.AutoModel.from_pretrained("bert-base-cased").save_pretrained("./")
transformers.AutoTokenizer.from_pretrained("./")
transformers.AutoModel.from_pretrained("./")

Actually, it is the second line that generates the following errors:

Traceback (most recent call last):
File "", line 1, in
File "/anaconda/lib/python3.7/site-packages/transformers/tokenization_utils.py", line 587, in save_pretrained
return vocab_files + (special_tokens_map_file, added_tokens_file)
TypeError: unsupported operand type(s) for +: 'NoneType' and 'tuple'

The vocabulary size of xlm-roberta is wrong, so it failed with the following code, (this bug also exist in Version 2.4.1):
> import transformers
> tokenizer = transformers.AutoTokenizer.from_pretrained("xlm-roberta-base")
> tokenizer.convert_ids_to_tokens(range(tokenizer.vocab_size))

The error is actually caused by the wrong vocab size:

[libprotobuf FATAL /sentencepiece/src/../third_party/protobuf-lite/google/protobuf/repeated_field.h:1506] CHECK failed: (index) < (current_size_):
terminate called after throwing an instance of 'google::protobuf::FatalException'
what(): CHECK failed: (index) < (current_size_):
zsh: abort python

Tokenization Installation Should Fix

Source

erikchwang

Most helpful comment

Hi! Indeed, there have been a few issues as this was the first release incorporating tokenizers by default. A new version of tokenizers and transformers will be available either today or tomorrow and should fix most of these.

LysandreJik on 24 Feb 2020

👍2

All 15 comments

LysandreJik on 24 Feb 2020

👍2

For future reference, when you say that some code "fails", please also provide the stack trace. This helps greatly when debugging.

BramVanroy on 24 Feb 2020

👍1

Thanks, stack trace provided...
I just noticed that in Version 2.5.0, AutoTokenizer.from_pretrained() takes a new argument use_fast, and defaults it to True. This seems to be the reason for the error, because when I set it to False, the loaded model can be correctly saved by save_pretrained().
I wonder why this use_fast argument is added, and why it is default to True?

erikchwang on 24 Feb 2020

use_fast uses the tokenizers library which is a new, extremely fast implementation of different tokenizers. I agree that for the first few releases it might've been better to expose the argument but setting it to False by default as to catch errors only by early adopters. Now many errors are reported that could've otherwise been avoided. In the meantime, you can explicitly set it to False.

BramVanroy on 24 Feb 2020

👍1

For Tokenizers library:
1, Where is the document about how to install and use it? The Readme is too brief...
2, I understand that it is designed as a combination of various tokenizers. But to use a pre-trained model, is it better to use the original tokenizer to avoid subtle differences like special tokens? If so, the Transformers library should not use the tokenizers from Tokenizers library by default...

erikchwang on 24 Feb 2020

tokenizers sits in its own repository. You can find it here and its Python bindings here.

I think that the fast tokenizers are tested to get the exact same output as the other ones.

BramVanroy on 24 Feb 2020

👍1

Thanks...
It seems that tokenizers has been installed together with transformers by pip install transformers?
In the future, will the tokenizer classes (e.g. BertTokenizer, AutoTokenizer, etc.) still be kept in the transformers library? Or they will be deprecated?

erikchwang on 24 Feb 2020

I cannot answer that, I don't know what the roadmap looks like.

BramVanroy on 24 Feb 2020

👍1

Install Python 64-bit instead of 32-bit solved my same issue.

huaiyukhaw on 1 Mar 2020

Which issue did you solved?
I think 64-bit Python is almost used by everyone...

erikchwang on 3 Mar 2020

1) This issue should be opened on huggingface/tokenizers as it is an installation issue with the huggingface/tokenizers library.

2) This issue is solved in the current master (and 2.5.1) as well.

3) This is fixed in https://github.com/huggingface/transformers/pull/3198 which will be merged in a bit.

LysandreJik on 9 Mar 2020

i still have this prob, is anyone can tell me how to solve it?

Lapis-Hong on 7 Apr 2020

which problem?

LysandreJik on 7 Apr 2020

Still seeing the error

[libprotobuf FATAL /sentencepiece/src/../third_party/protobuf-lite/google/protobuf/repeated_field.h:1506] CHECK failed: (index) < (current_size_): 
terminate called after throwing an instance of 'google::protobuf::FatalException'
  what():  CHECK failed: (index) < (current_size_):

how do I work around this?

catyeo18 on 19 Apr 2020

Hi @catyeo18, please provide the code that gets you this error, alongside the different versions of software you're using. Here's the template for bug reports. Thank you.

LysandreJik on 23 Apr 2020

Was this page helpful?

0 / 5 - 0 ratings