Transformers: How properly apply a tokenizer map function to a Tensorflow batched dataset?

Created on 18 Apr 2020 · 9Comments · Source: huggingface/transformers

Considering the following batched_dataset:

samples =  ([{"query": "this is a query 1", "doc": "this is one relevant document regarding query 1"}, 
              {"query": "this is a query 2", "doc": "this is one relevant document regarding query 2"},
              {"query": "this is a query 3", "doc": "this is one relevant document regarding query 3"},
              {"query": "this is a query 4", "doc": "this is one relevant document regarding query 4"},
              ])
dataset = tf.data.Dataset.from_generator( 
    lambda: samples, {"query": tf.string, "doc": tf.string})

batched_dataset = dataset.batch(2)

#{
#'doc': <tf.Tensor: shape=(2,), dtype=string, numpy=array(
#     [b'this is one relevant document regarding query 1',
#      b'this is one relevant document regarding query 2'], dtype=object)>,
# 
#'query': <tf.Tensor: shape=(2,), dtype=string, numpy=array(
#     [b'this is a query 1', 
#      b'this is a query 2'], dtype=object)>
#}

and a map function to tokenize this batched_dataset:

def tokenize(sample):
    tokenized_query = tokenizer.batch_encode_plus(sample["query"].numpy().astype('str'), ...)
    tokenized_doc = tokenizer.batch_encode_plus(sample["doc"].numpy().astype('str'), ...)
    return (tokenized_query, tokenized_doc)

I could tokenize the entire batched_dataset using a for-loop:

for batch in batched_dataset:
    tokenize(batch)
# (
# {'input_ids': <tf.Tensor: shape=(2, 8), dtype=int32, numpy=
#   array([[  101,  2023,  2003,  1037, 23032,  1015,   102,     0],
#          [  101,  2023,  2003,  1037, 23032,  1016,   102,     0]],
#      dtype=int32)>, 
#  'attention_mask': <tf.Tensor: shape=(2, 8), dtype=int32, numpy=
#   array([[1, 1, 1, 1, 1, 1, 1, 0],
#          [1, 1, 1, 1, 1, 1, 1, 0]], dtype=int32)>}, 

# {'input_ids': <tf.Tensor: shape=(2, 8), #dtype=int32, numpy=
#   array([[ 101, 2023, 2003, 2028, 7882, 6254, 4953,  102],
#          [ 101, 2023, 2003, 2028, 7882, 6254, 4953,  102]], dtype=int32)>, 
#  'attention_mask': <tf.Tensor: shape=(2, 8), dtype=int32, numpy=
#   array([[1, 1, 1, 1, 1, 1, 1, 1],
#          [1, 1, 1, 1, 1, 1, 1, 1]], dtype=int32)>})
#  ...

However, when using tf.data.Dataset.map the following error arises:

tokenized_dataset = batched_dataset.map(tokenize)
AttributeError: 'Tensor' object has no attribute 'numpy'

Then, how properly apply a tokenizer map function to a batched dataset?

Note: I published a working example on Google Colab.

Source

Ceceu

Most helpful comment

The ideal would be to follow the pipeline (read from the file >> generate batches >> tokenize >> train >> evaluate). It is the most efficient approach as pointed in the TensorFlow tutorial.

Tensorflow when dealing with texts generates string tensors that are stored as byte string:

<tf.Tensor: shape=(2,), dtype=string, numpy=array(
     [b'Thê first utf-8 string of the batçh.',
      b'Thê secônd utf-8 string of the batçh.'], dtype=object)>

However, I didn't find an efficient way to decode this kind of tensor as a list of strings. It's even worse if the byte string containing a non-ascii character.

What I really need is one of these two options:

a tokenizer which is able to accept aforementioned byte string tensor as input to tokenize; or
a vectorized approach to transforming a byte string tensor into a string list.

Thank you very much for all your help.

Ceceu on 24 Apr 2020

👍2

All 9 comments

This seems like more of a TF-related question rather than a Transformers-related question. The issue seems to stem from your code trying to get the value of a tensor which is not eager, using numpy. I believe the tf.data.Dataset.map method must trace inputs, resulting in the Tensors not being eager.

Couldn't you build the tf.data.Dataset with already tokenized inputs instead?

LysandreJik on 23 Apr 2020

👍1

The ideal would be to follow the pipeline (read from the file >> generate batches >> tokenize >> train >> evaluate). It is the most efficient approach as pointed in the TensorFlow tutorial.

Tensorflow when dealing with texts generates string tensors that are stored as byte string:

<tf.Tensor: shape=(2,), dtype=string, numpy=array(
     [b'Thê first utf-8 string of the batçh.',
      b'Thê secônd utf-8 string of the batçh.'], dtype=object)>

However, I didn't find an efficient way to decode this kind of tensor as a list of strings. It's even worse if the byte string containing a non-ascii character.

What I really need is one of these two options:

a tokenizer which is able to accept aforementioned byte string tensor as input to tokenize; or
a vectorized approach to transforming a byte string tensor into a string list.

Thank you very much for all your help.

Ceceu on 24 Apr 2020

👍2

@Ceceu I am running into this exact issue as well, and am wondering if you had found a good solution?

oja on 8 Jun 2020

@oja,
The best solution I could find was adapting an example from the Tensorflow tutorial: Load Text which uses tf.py_function.
Let me know if I can help more.

Ceceu on 9 Jun 2020

👍1

@Ceceu got it, thank you!

oja on 9 Jun 2020

🎉1

Tokenizers can now output numpy arrays with return_tensors='np' so I think this should work now.