Model I am using: Bert
Language I am using the model on: English
Call bertTokenizer.tokenize("text", return_tokens_mapped_to_origin=True)
Result:
TypeError: _tokenize() got an unexpected keyword argument 'return_tokens_mapped_to_origin'
The official documentation mentions a "return_tokens_mapped_to_origin" optional parameter that when set to True should return the index of each token in the initial given text.
https://huggingface.co/transformers/main_classes/tokenizer.html?highlight=return_tokens_mapped_to_origin#transformers.PreTrainedTokenizer.tokenize
In the source code this parameter is never used outside of the doc comment, neither in the base class nor in its implementations.
What is the idea here? That for each (sub)token its "parent" token ID is remembered? That would be so great. I can definitely use functionality like that.
What is the idea here? That for each (sub)token its "parent" token ID is remembered? That would be so great. I can definitely use functionality like that.
This is what the doc says:
return_tokens_mapped_to_origin: (optional) Set to True to return the index of each token in the initial whitespace tokenization. (default False)
I think the idea was that with this parameter set to True, in addition to the tokens, the function returns a map to the position of the i-th token in the original sentence, so the word it belongs to.
So for example considering the sentence: Word-embedding is so nice
If the tokenization is ["word", "-", "em", "##bed", "##ding", "is", "so", "nice"]
I should have as second returned value something like [0, 0, 0, 0, 0, 1, 2, 3] which corresponds to the position of the tokens "parent" in the whitespace tokenization ["word-embedding", "is", "so", "nice"]
It would be very useful but as I can see it hasn't been implemented, don't know why it is mentioned in the documentation.
An easy way to implement it without the need to adapt the code to every single tokenizer could be to whitespace-tokenize the text first, then for each whitespace-token call the subword-tokenizer and add to the 'map' the current position for the number of subword-tokens returned.
This could be used in the library to implement this feature and can work also as a workaround to achieve the same result.
Hi, thanks for pointing that out @alessiocancian, this documentation was an error. You're right about the expected behavior, this is what happens in the squad_convert_examples_to_features.
It is not implemented yet in the tokenize method as we don't have the bandwidth for it currently, but it will probably be in a future release as it's very useful to map tokens back to the original normalized sentence.
This sounds like a great addition indeed! +1
For everyone interested here's the code of the workaround I mentioned:
sentence = "Word-embedding is so nice"
words = sentence.split() #whitespace tokenization
tokens = []
tokens_map = []
for i, word in enumerate(words):
_tokens = tokenizer.tokenize(word)
for token in _tokens:
tokens.append(token)
tokens_map.append(i)
print(words[tokens_map[2]]) #prints "Word-embedding"
Needs some changes to work with separators, but could be a starting point for an easy implementation in the tokenize method @LysandreJik
EDIT: found out that sentence.split() is not the best to reconstruct words because of punctuation, you can change it with a generic word tokenizer like nltk.word_tokenize.
@alessiocancian Unfortunately you will inevitably run into inconsistencies between the tokenizer that you used and the base tokenizer that is used in transformers internally. I am not sure whether there are even distinct steps in the tokenisation process (string->tokens->subword units), so I am curious to see what @LysandreJik has planned and how they are going to implement it! When I look at the source code of the squad example, it seems that punctuation is not taken care of and that splits happen on white space characters (as defined in _is_whitespace) only.
I might be missing something, though.
@alessiocancian Unfortunately you will inevitably run into inconsistencies between the tokenizer that you used and the base tokenizer that is used in transformers internally.
@BramVanroy yes I thought the same thing, with whitespace tokenization you can reconstruct it easily but using a tokenizer you can't, you need to use the same one.
A way could be to have the tokenizer as parameter following a common interface (a tokenize method which takes a string and returns a list of strings) but I'm not sure if it makes sense.
Whitespace tokenization in most cases is useless because you get unexpected extra punctuation.
The easiest way is still to use the code I shared so you have full control on the tokenization you're referencing to. I'm using it and works fine.
Hey @alessiocancian. I did some testing and I ran into an issue: your idea won't work for all tokenizer since it seems that they are context-sensitive. Here is an example with the roberta tokenizer:
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
print(tokenizer.tokenize('They were hugging.'))
# ['They', '_were', '_hugging', '.']
print(tokenizer.tokenize('hugging'))
# ['h', 'ug', 'ging']
I am not sure whether it is expected for tokenizers to work like this. It seems odd: if "hugging" is in the vicabulary, why isn't the tokenizer using it in the second case? I also tried starting the string with a space or a special token, but to no avail. Perhaps @LysandreJik can shed some light here.
I tested with a couple of tokenizers, and to get the same tokenization for the whole sequence at once and word-for-word, it seems that you can add "i" (or any token with only one sub token) to the token and then remove that subtoken again. However, for the first token, the "i" must be at the end. I tested this with 10k sentences on albert, bert, distilbert, gpt2, openai, roberta, and xlnet tokenizers. XLNet behaves a bit weird because it tokenizes the i like 'â', 'i' so the tokens need to be removed twice. It's messy, I know, but it works...
tokens = []
for idx, t in enumerate(sentence.split()):
if idx > 0:
t = f"i {t}"
subtokens = tok.tokenize(t)
subtokens.pop(0)
# need to pop twice for xlnet to remove
# 'â', 'i'
if tok_name == 'xlnet':
subtokens.pop(0)
else:
t = f"{t} i"
subtokens = tok.tokenize(t)
subtokens.pop(-1)
if tok_name == 'xlnet':
subtokens.pop(-1)
tokens += subtokens
Hi @BramVanroy, concerning your question of why the word "hugging" was split even though it clearly was in the dictionary: the RoBERTa tokenizer uses a byte-level BPE tokenizer like GPT-2. It makes the difference between words preceded by a space, and those that are not, as you correctly guessed.
You can't simply add a space at the beginning as it will get stripped in the tokenize method. In order to do so, you would have to specify the add_prefix_space boolean option:
from transformers import RobertaTokenizer
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
print(tokenizer.tokenize('They were hugging.'))
# ['They', 'Ä were', 'Ä hugging', '.']
print(tokenizer.tokenize('hugging', add_prefix_space=True))
# ['Ä hugging']
Hey @LysandreJik thanks for your time. But isn't that exactly what the tokenizer does? What am I missing here?
Also, it is a bit strange to see that not all tokenizers know this attribute. Wouldn't it make more sense to have this as part of the PretrainedTokenizer's _tokenize or at least adding **kwargs to all tokenizer's _tokenize? It feels awkward now when quickly wanting to swapping tokenizers by only changing the init, but then you get:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
print(tokenizer.tokenize('They were hugging.'))
# ['They', 'Ä were', 'Ä hugging', '.']
print(tokenizer.tokenize('hugging', add_prefix_space=True))
# TypeError: _tokenize() got an unexpected keyword argument 'add_prefix_space'
I understand _why_ the other tokenizers don't need it, but from a usage perspective it is odd that the same tokenize() function doesn't accept the same arguments.
It also becomes awkward when you want to do something more dynamic like
from transformers import BertTokenizer, RobertaTokenizer
models = {
'bert': (BertTokenizer, 'bert-base-uncased'),
'roberta': (RobertaTokenizer, 'roberta-base')
}
# from user-input or from config
mname = 'bert'
tokenizer = models[mname][0].from_pretrained(models[mname][1])
print(tokenizer.tokenize('They were hugging.'))
# ['They', 'Ä were', 'Ä hugging', '.']
print(tokenizer.tokenize('hugging', add_prefix_space=mname == 'roberta'))
# roberta: ['Ä hugging']
# bert: TypeError: _tokenize() got an unexpected keyword argument 'add_prefix_space'
I hope it's clear what I am trying to say.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Most helpful comment
Hi, thanks for pointing that out @alessiocancian, this documentation was an error. You're right about the expected behavior, this is what happens in the
squad_convert_examples_to_features.It is not implemented yet in the
tokenizemethod as we don't have the bandwidth for it currently, but it will probably be in a future release as it's very useful to map tokens back to the original normalized sentence.