Flair: albert-base-v2 tokenization broken

Created on 23 Jun 2020  路  3Comments  路  Source: flairNLP/flair

Describe the bug
The TextClassifier model loading crashes when model is trained on albert-base-v2

To Reproduce

  • Train a text classifier using albert-base-v2. Save the model.
  • Now try to load this model in some other machine.
  • The loading crashes due to SentencePiece file not existing.

Expected behavior
Model should load successfully.

Environment (please complete the following information):

  • OS [e.g. iOS, Linux]: Ubuntu-20-LTS
  • Version [e.g. flair-0.3.2]: flair-github-master

Additional context
There is workaround that involves monkey patching a bit of code like this

from types import MethodType
import transformers
vocab_file  = transformers.tokenization_albert.AlbertTokenizer.from_pretrained("albert-base-v2").vocab_file
def _setstate(self, d):  # Method to patch with
    self.__dict__ = d
    try:
        import sentencepiece as spm
    except ImportError:
        logger.warning(
            "You need to install SentencePiece to use AlbertTokenizer: https://github.com/google/sentencepiece"
            "pip install sentencepiece"
        )
        raise
    self.sp_model = spm.SentencePieceProcessor()
    self.sp_model.Load(vocab_file)

# Actual Patching being done here
transformers.tokenization_albert.AlbertTokenizer.__setstate__ = MethodType(
            _setstate, transformers.tokenization_albert.AlbertTokenizer(vocab_file )
)

Having to do this everytime is crazy. Maybe we can implement some better way of handling this issue

bug

All 3 comments

Thanks for reporting this - @whoisjones can you take a look?

i'll take a look and comment here @alanakbik @mittalsuraj18

@mittalsuraj18 issue lies in huggingface lib, similar issue has been opened last week for MarianMT. SentencePiece save its files in cache, thus they can't be found on another machine. Please open a respective issue according to the linked one in huggingface.

Was this page helpful?
0 / 5 - 0 ratings