Transformers: RAG Tokenizer erroring out

Created on 10 Oct 2020 · 6Comments · Source: huggingface/transformers

Environment info

transformers version: 3.3.1
Platform: Linux-5.4.0-48-generic-x86_64-with-debian-buster-sid
Python version: 3.7.9
PyTorch version (GPU?): 1.6.0 (True)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?:
Using distributed or parallel set-up in script?:

@ola13 @mfuntowicz

Information

Hi- I am trying to get the RAG running, however I am getting the error when I follow the instructions here: https://huggingface.co/facebook/rag-token-nq

Particularly, the error message is as follows:

TypeError                                 Traceback (most recent call last)
<ipython-input-7-35cd6a2213c0> in <module>
      1 from transformers import AutoTokenizer, AutoModelWithLMHead
      2 
----> 3 tokenizer = AutoTokenizer.from_pretrained("facebook/rag-token-nq")

~/src/transformers/src/transformers/tokenization_auto.py in from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
    258                 return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
    259             else:
--> 260                 return tokenizer_class_py.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
    261 
    262         raise ValueError(

~/src/transformers/src/transformers/tokenization_rag.py in from_pretrained(cls, pretrained_model_name_or_path, **kwargs)
     61         print(config.generator)
     62         print("***")
---> 63         generator = AutoTokenizer.from_pretrained(generator_path, config=config.generator)
     64         return cls(question_encoder=question_encoder, generator=generator)
     65 

~/src/transformers/src/transformers/tokenization_auto.py in from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
    258                 return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
    259             else:
--> 260                 return tokenizer_class_py.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
    261 
    262         raise ValueError(

~/src/transformers/src/transformers/tokenization_utils_base.py in from_pretrained(cls, pretrained_model_name_or_path, *init_inputs, **kwargs)
   1557 
   1558         return cls._from_pretrained(
-> 1559             resolved_vocab_files, pretrained_model_name_or_path, init_configuration, *init_inputs, **kwargs
   1560         )
   1561 

~/src/transformers/src/transformers/tokenization_utils_base.py in _from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, *init_inputs, **kwargs)
   1648 
   1649         # Add supplementary tokens.
-> 1650         special_tokens = tokenizer.all_special_tokens
   1651         if added_tokens_file is not None:
   1652             with open(added_tokens_file, encoding="utf-8") as added_tokens_handle:

~/src/transformers/src/transformers/tokenization_utils_base.py in all_special_tokens(self)
   1026         Convert tokens of :obj:`tokenizers.AddedToken` type to string.
   1027         """
-> 1028         all_toks = [str(s) for s in self.all_special_tokens_extended]
   1029         return all_toks
   1030 

~/src/transformers/src/transformers/tokenization_utils_base.py in all_special_tokens_extended(self)
   1046         logger.info(all_toks)
   1047         print(all_toks)
-> 1048         all_toks = list(OrderedDict.fromkeys(all_toks))
   1049         return all_toks
   1050 

TypeError: unhashable type: 'dict'

all_toks variable looks as follows. Obviously, it is a dictionary and OrderedDict.fromkeys doesn't like it.

[{'content': '<s>', 'single_word': False, 'lstrip': False, 'rstrip': False, 'normalized': True}, {'content': '</s>', 'single_word': False, 'lstrip': False, 'rstrip': False, 'normalized': True}, {'content': '<unk>', 'single_word': False, 'lstrip': False, 'rstrip': False, 'normalized': True}, {'content': '</s>', 'single_word': False, 'lstrip': False, 'rstrip': False, 'normalized': True}, {'content': '<pad>', 'single_word': False, 'lstrip': False, 'rstrip': False, 'normalized': True}, {'content': '<s>', 'single_word': False, 'lstrip': False, 'rstrip': False, 'normalized': True}, {'content': '<mask>', 'single_word': False, 'lstrip': True, 'rstrip': False, 'normalized': True}]

I will be digging deeper, hoping that I am doing an obvious mistake.

To reproduce

from transformers import AutoTokenizer, AutoModelWithLMHead

tokenizer = AutoTokenizer.from_pretrained("facebook/rag-token-nq")

Expected behavior

It should load the tokenizer!

Thank you.

Source

dzorlu

👍2

Most helpful comment

Should be solved now - let me know if you still experience problems @dzorlu

patrickvonplaten on 13 Oct 2020

🚀2 ❤1

All 6 comments

Just to follow up on this, look like special tokens are loaded for the RAG generator here, but it is not converted to AddedTokens here and hence not compatible with downstream operations.

dzorlu on 11 Oct 2020

When I run the examples from:
https://huggingface.co/transformers/model_doc/rag.html

I get exactly the same error: