Datasets: Reserved tokens appear twice in the subwords list of SubwordTextEncoder

Created on 8 Jan 2020  路  8Comments  路  Source: tensorflow/datasets

Short description
The SubwordTextEncoder text encoder counts twice the tokens which are reserved when building the encoder.
Environment information

  • Operating System: Ubuntu 18
  • Python version: 3.7.5
  • tensorflow-datasets version: 1.3.2
  • tensorflow : 2.0.0

Reproduction instructions

The working example to reproduce the bug is here: https://gist.github.com/psds01/2e03f8c5f53e45e7463126194dedfa40

The exact problem is as follows:
For the same token it will assign different embedding IDs (based on position?)

Expected behavior
The behaviour I am expecting is

len(tokenizer.subwords) == len(set(tokenizer.subwords))

Additional context
Maybe, after this line we can add one more line like:

candidate_subwords.sort(reverse=True)
subwords = reserved_tokens + [s for _, s in candidate_subwords]
subwords = list(set(subwords)) # take unique subwords including reserved tokens

Or maybe I don't understand how this works.

bug

Most helpful comment

This is interesting.

But according to me the real problem lies here. For any token, if the _next_ token is a space, then the given token will have an underscore at the end of it. This will create duplicates for all the tokens that are sometimes followed by a space and sometimes not.

Here is an illustration:

import tensorflow_datasets as tfds
data = ['what is your name', 'say what']
tokenizer = tfds.features.text.SubwordTextEncoder.build_from_corpus(
    data, 
    target_vocab_size=2**32,
)
print(tokenizer.subwords)

The above program will print:
['your_', 'what_', 'what', 'say_', 'name', 'is_']
which has two copies of what:

  1. one with space after it, what_, from the first example
  2. one without a space after it, what, from the second example.

The code behaves as it was intended to (stated above.) This means there is no bug (unknown execution) in the tensorflow_datasets library.

Now that we know how the code works, it is up to the user to preprocess the text however they like to get whatever is expected from it.

All 8 comments

Also, even spacing causes the token embedding IDs to be changed.

resp = tokenize("say <NUM> what <COMPANY> what?")
>>>
[('s', 242),
 ('a', 224),
 ('y', 248),
 (' ', 159),
 ('<NUM> ', 35),
 ('w', 246),
 ('h', 231),
 ('at ', 47),
 ('<COMPANY> ', 39),
 ('w', 246),
 ('h', 231),
 ('a', 224),
 ('t', 243),
 ('?', 190)]
resp = tokenize("say<NUM>what<COMPANY>what?")
>>>
[('s', 242),
 ('a', 224),
 ('y', 248),
 ('<NUM>', 1),
 ('w', 246),
 ('h', 231),
 ('a', 224),
 ('t', 243),
 ('<COMPANY>', 2),
 ('w', 246),
 ('h', 231),
 ('a', 224),
 ('t', 243),
 ('?', 190)]

Thank you for reporting. We are aware that our text tokenizer has some bug and we are planning to remove it entirely in the future in favour of better alternative.

You should have a look at tensorflow text, which contains online C++ ops to tokenize sentence in a performant way: https://github.com/tensorflow/text
https://www.tensorflow.org/tutorials/tensorflow_text/intro

@Conchylicultor thank you for the reply .

Is there any scope to include sub-word tokenization with tensorflow_text? For my specific case, subword tokenization is working really well (even with the above bug) and I would love to keep using it.

Thanks,

Here are the available tokeniser: https://github.com/tensorflow/text/blob/master/docs/api_docs/python/text.md

I believe that WordpieceTokenizer or BertTokenizer should allow subword encoding. Maybe SentencepieceTokenizer too though I'm less sure.

Thanks @Conchylicultor

One more thing I have observed with this tokenizer is that it assigns different IDs when trained and encoded in the same session/run. But if I train it, save it, load it from file and then encode it, then I get token IDs shifted by 1.

```%python

train tokenizer

tokenizer = tfds.features.text.SubwordTextEncoder.build_from_corpus(
(x for x in sentences[:100]),
target_vocab_size=target_vocab_size,
max_subword_length=20,
max_corpus_chars=None,
reserved_tokens=None,
)

encode

print(tokenizer.encode("say what aain"))

[380, 362, 386, 297, 384, 369, 362, 49, 362, 362, 370, 375]

save tokenizer

tokenizer.save_to_file(tokenizer_filename_prefix)

load tokenizer

tokenizer = tfds.features.text.SubwordTextEncoder.load_from_file(
tokenizer_filename_prefix
)

encode after loading from the disk

print(tokenizer.encode("say what aain"))

[379, 361, 385, 296, 383, 368, 361, 49, 361, 361, 369, 374]
```

What's the workaround to this for now? I have this tokenizer as a part of my pipeline that is about to go to production. Should I not use this if it has these bugs?

Just FYI, the bug in the above comment is because of the \n subword in the vocab. Since the save_to_file writes the file as text file and the token \n gets split into two lines like this:

'
'

Replacing newline from the data with space seems to work. For now.

If anyone interested, I've encountered the same problem and was able to fix it (kind of) by deleting the reserved tokens in the corpus while building the subword dictionary.

It seems like the original intension was to naively put all the reserved_tokens at the top of subword vocabs, so the build_from_corpus() doesn't seem to care if those tokens are in the vocabulary already.

Basically, it seems like misleading documentation than a bug. In the current documentation, reserved_tokens is described:

list<str>, list of tokens that will always be treated as whole tokens and not split up. Note that these must contain a mix of alphanumeric and non-alphanumeric characters (e.g. "") and not end in an underscore.

So it sounds like it will ignore the tokens appear in reserved_tokens while building the subword vocabs, yet it is more like manually inserting tokens in the subword dictionary, and then the build_from_corpus() will not look them up while building the dictionary. Therefore, if those reserved_tokens appear again in the corpus, it will make duplicated (partially or fully) subwords.

After deleting all the reserved_tokens manually from the corpus, it works just fine.

However, I didn't really look at how it's implemented internally, so it might not be true.

This is interesting.

But according to me the real problem lies here. For any token, if the _next_ token is a space, then the given token will have an underscore at the end of it. This will create duplicates for all the tokens that are sometimes followed by a space and sometimes not.

Here is an illustration:

import tensorflow_datasets as tfds
data = ['what is your name', 'say what']
tokenizer = tfds.features.text.SubwordTextEncoder.build_from_corpus(
    data, 
    target_vocab_size=2**32,
)
print(tokenizer.subwords)

The above program will print:
['your_', 'what_', 'what', 'say_', 'name', 'is_']
which has two copies of what:

  1. one with space after it, what_, from the first example
  2. one without a space after it, what, from the second example.

The code behaves as it was intended to (stated above.) This means there is no bug (unknown execution) in the tensorflow_datasets library.

Now that we know how the code works, it is up to the user to preprocess the text however they like to get whatever is expected from it.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ericmclachlan picture ericmclachlan  路  5Comments

ageron picture ageron  路  4Comments

Eshan-Agarwal picture Eshan-Agarwal  路  3Comments

powergkrry picture powergkrry  路  3Comments

lopeselio picture lopeselio  路  5Comments