Transformers: How properly apply a tokenizer map function to a Tensorflow batched dataset?

Created on 18 Apr 2020  路  9Comments  路  Source: huggingface/transformers

Considering the following batched_dataset:

samples =  ([{"query": "this is a query 1", "doc": "this is one relevant document regarding query 1"}, 
              {"query": "this is a query 2", "doc": "this is one relevant document regarding query 2"},
              {"query": "this is a query 3", "doc": "this is one relevant document regarding query 3"},
              {"query": "this is a query 4", "doc": "this is one relevant document regarding query 4"},
              ])
dataset = tf.data.Dataset.from_generator( 
    lambda: samples, {"query": tf.string, "doc": tf.string})

batched_dataset = dataset.batch(2)

#{
#'doc': <tf.Tensor: shape=(2,), dtype=string, numpy=array(
#     [b'this is one relevant document regarding query 1',
#      b'this is one relevant document regarding query 2'], dtype=object)>,
# 
#'query': <tf.Tensor: shape=(2,), dtype=string, numpy=array(
#     [b'this is a query 1', 
#      b'this is a query 2'], dtype=object)>
#}

and a map function to tokenize this batched_dataset:

def tokenize(sample):
    tokenized_query = tokenizer.batch_encode_plus(sample["query"].numpy().astype('str'), ...)
    tokenized_doc = tokenizer.batch_encode_plus(sample["doc"].numpy().astype('str'), ...)
    return (tokenized_query, tokenized_doc) 

I could tokenize the entire batched_dataset using a for-loop:

for batch in batched_dataset:
    tokenize(batch)
# (
# {'input_ids': <tf.Tensor: shape=(2, 8), dtype=int32, numpy=
#   array([[  101,  2023,  2003,  1037, 23032,  1015,   102,     0],
#          [  101,  2023,  2003,  1037, 23032,  1016,   102,     0]],
#      dtype=int32)>, 
#  'attention_mask': <tf.Tensor: shape=(2, 8), dtype=int32, numpy=
#   array([[1, 1, 1, 1, 1, 1, 1, 0],
#          [1, 1, 1, 1, 1, 1, 1, 0]], dtype=int32)>}, 

# {'input_ids': <tf.Tensor: shape=(2, 8), #dtype=int32, numpy=
#   array([[ 101, 2023, 2003, 2028, 7882, 6254, 4953,  102],
#          [ 101, 2023, 2003, 2028, 7882, 6254, 4953,  102]], dtype=int32)>, 
#  'attention_mask': <tf.Tensor: shape=(2, 8), dtype=int32, numpy=
#   array([[1, 1, 1, 1, 1, 1, 1, 1],
#          [1, 1, 1, 1, 1, 1, 1, 1]], dtype=int32)>})
#  ...

However, when using tf.data.Dataset.map the following error arises:

tokenized_dataset = batched_dataset.map(tokenize)
AttributeError: 'Tensor' object has no attribute 'numpy'

Then, how properly apply a tokenizer map function to a batched dataset?

Note: I published a working example on Google Colab.

Most helpful comment

The ideal would be to follow the pipeline (read from the file >> generate batches >> tokenize >> train >> evaluate). It is the most efficient approach as pointed in the TensorFlow tutorial.

Tensorflow when dealing with texts generates string tensors that are stored as byte string:

<tf.Tensor: shape=(2,), dtype=string, numpy=array(
     [b'Th锚 first utf-8 string of the bat莽h.',
      b'Th锚 sec么nd utf-8 string of the bat莽h.'], dtype=object)>

However, I didn't find an efficient way to decode this kind of tensor as a list of strings. It's even worse if the byte string containing a non-ascii character.

What I really need is one of these two options:

  1. a tokenizer which is able to accept aforementioned byte string tensor as input to tokenize; or
  2. a vectorized approach to transforming a byte string tensor into a string list.

Thank you very much for all your help.

All 9 comments

This seems like more of a TF-related question rather than a Transformers-related question. The issue seems to stem from your code trying to get the value of a tensor which is not eager, using numpy. I believe the tf.data.Dataset.map method must trace inputs, resulting in the Tensors not being eager.

Couldn't you build the tf.data.Dataset with already tokenized inputs instead?

The ideal would be to follow the pipeline (read from the file >> generate batches >> tokenize >> train >> evaluate). It is the most efficient approach as pointed in the TensorFlow tutorial.

Tensorflow when dealing with texts generates string tensors that are stored as byte string:

<tf.Tensor: shape=(2,), dtype=string, numpy=array(
     [b'Th锚 first utf-8 string of the bat莽h.',
      b'Th锚 sec么nd utf-8 string of the bat莽h.'], dtype=object)>

However, I didn't find an efficient way to decode this kind of tensor as a list of strings. It's even worse if the byte string containing a non-ascii character.

What I really need is one of these two options:

  1. a tokenizer which is able to accept aforementioned byte string tensor as input to tokenize; or
  2. a vectorized approach to transforming a byte string tensor into a string list.

Thank you very much for all your help.

@Ceceu I am running into this exact issue as well, and am wondering if you had found a good solution?

@oja,
The best solution I could find was adapting an example from the Tensorflow tutorial: Load Text which uses tf.py_function.
Let me know if I can help more.

@Ceceu got it, thank you!

Tokenizers can now output numpy arrays with return_tensors='np' so I think this should work now.

Thanks @thomwolf, I will check it out and if it works on TPU then it solves https://github.com/huggingface/transformers/issues/5066

Thanks @thomwolf, I will check it out and if it works on TPU then it solves #5066

Did you check if it works on TPU?

It does not work on TPU

Was this page helpful?
0 / 5 - 0 ratings