Short description
The SubwordTextEncoder text encoder counts twice the tokens which are reserved when building the encoder.
Environment information
tensorflow-datasets version: 1.3.2tensorflow : 2.0.0Reproduction instructions
The working example to reproduce the bug is here: https://gist.github.com/psds01/2e03f8c5f53e45e7463126194dedfa40
The exact problem is as follows:
For the same token it will assign different embedding IDs (based on position?)
Expected behavior
The behaviour I am expecting is
len(tokenizer.subwords) == len(set(tokenizer.subwords))
Additional context
Maybe, after this line we can add one more line like:
candidate_subwords.sort(reverse=True)
subwords = reserved_tokens + [s for _, s in candidate_subwords]
subwords = list(set(subwords)) # take unique subwords including reserved tokens
Or maybe I don't understand how this works.
Also, even spacing causes the token embedding IDs to be changed.
resp = tokenize("say <NUM> what <COMPANY> what?")
>>>
[('s', 242),
('a', 224),
('y', 248),
(' ', 159),
('<NUM> ', 35),
('w', 246),
('h', 231),
('at ', 47),
('<COMPANY> ', 39),
('w', 246),
('h', 231),
('a', 224),
('t', 243),
('?', 190)]
resp = tokenize("say<NUM>what<COMPANY>what?")
>>>
[('s', 242),
('a', 224),
('y', 248),
('<NUM>', 1),
('w', 246),
('h', 231),
('a', 224),
('t', 243),
('<COMPANY>', 2),
('w', 246),
('h', 231),
('a', 224),
('t', 243),
('?', 190)]
Thank you for reporting. We are aware that our text tokenizer has some bug and we are planning to remove it entirely in the future in favour of better alternative.
You should have a look at tensorflow text, which contains online C++ ops to tokenize sentence in a performant way: https://github.com/tensorflow/text
https://www.tensorflow.org/tutorials/tensorflow_text/intro
@Conchylicultor thank you for the reply .
Is there any scope to include sub-word tokenization with tensorflow_text? For my specific case, subword tokenization is working really well (even with the above bug) and I would love to keep using it.
Thanks,
Here are the available tokeniser: https://github.com/tensorflow/text/blob/master/docs/api_docs/python/text.md
I believe that WordpieceTokenizer or BertTokenizer should allow subword encoding. Maybe SentencepieceTokenizer too though I'm less sure.
Thanks @Conchylicultor
One more thing I have observed with this tokenizer is that it assigns different IDs when trained and encoded in the same session/run. But if I train it, save it, load it from file and then encode it, then I get token IDs shifted by 1.
```%python
tokenizer = tfds.features.text.SubwordTextEncoder.build_from_corpus(
(x for x in sentences[:100]),
target_vocab_size=target_vocab_size,
max_subword_length=20,
max_corpus_chars=None,
reserved_tokens=None,
)
print(tokenizer.encode("say what aain"))
[380, 362, 386, 297, 384, 369, 362, 49, 362, 362, 370, 375]
tokenizer.save_to_file(tokenizer_filename_prefix)
tokenizer = tfds.features.text.SubwordTextEncoder.load_from_file(
tokenizer_filename_prefix
)
print(tokenizer.encode("say what aain"))
[379, 361, 385, 296, 383, 368, 361, 49, 361, 361, 369, 374]
```
What's the workaround to this for now? I have this tokenizer as a part of my pipeline that is about to go to production. Should I not use this if it has these bugs?
Just FYI, the bug in the above comment is because of the \n subword in the vocab. Since the save_to_file writes the file as text file and the token \n gets split into two lines like this:
'
'
Replacing newline from the data with space seems to work. For now.
If anyone interested, I've encountered the same problem and was able to fix it (kind of) by deleting the reserved tokens in the corpus while building the subword dictionary.
It seems like the original intension was to naively put all the reserved_tokens at the top of subword vocabs, so the build_from_corpus() doesn't seem to care if those tokens are in the vocabulary already.
Basically, it seems like misleading documentation than a bug. In the current documentation, reserved_tokens is described:
list<str>, list of tokens that will always be treated as whole tokens and not split up. Note that these must contain a mix of alphanumeric and non-alphanumeric characters (e.g. "") and not end in an underscore.
So it sounds like it will ignore the tokens appear in reserved_tokens while building the subword vocabs, yet it is more like manually inserting tokens in the subword dictionary, and then the build_from_corpus() will not look them up while building the dictionary. Therefore, if those reserved_tokens appear again in the corpus, it will make duplicated (partially or fully) subwords.
After deleting all the reserved_tokens manually from the corpus, it works just fine.
However, I didn't really look at how it's implemented internally, so it might not be true.
This is interesting.
But according to me the real problem lies here. For any token, if the _next_ token is a space, then the given token will have an underscore at the end of it. This will create duplicates for all the tokens that are sometimes followed by a space and sometimes not.
Here is an illustration:
import tensorflow_datasets as tfds
data = ['what is your name', 'say what']
tokenizer = tfds.features.text.SubwordTextEncoder.build_from_corpus(
data,
target_vocab_size=2**32,
)
print(tokenizer.subwords)
The above program will print:
['your_', 'what_', 'what', 'say_', 'name', 'is_']
which has two copies of what:
what_, from the first examplewhat, from the second example.The code behaves as it was intended to (stated above.) This means there is no bug (unknown execution) in the tensorflow_datasets library.
Now that we know how the code works, it is up to the user to preprocess the text however they like to get whatever is expected from it.
Most helpful comment
This is interesting.
But according to me the real problem lies here. For any token, if the _next_ token is a space, then the given token will have an underscore at the end of it. This will create duplicates for all the tokens that are sometimes followed by a space and sometimes not.
Here is an illustration:
The above program will print:
['your_', 'what_', 'what', 'say_', 'name', 'is_']which has two copies of
what:what_, from the first examplewhat, from the second example.The code behaves as it was intended to (stated above.) This means there is no bug (unknown execution) in the
tensorflow_datasetslibrary.Now that we know how the code works, it is up to the user to preprocess the text however they like to get whatever is expected from it.