Considering the following batched_dataset:
samples = ([{"query": "this is a query 1", "doc": "this is one relevant document regarding query 1"},
{"query": "this is a query 2", "doc": "this is one relevant document regarding query 2"},
{"query": "this is a query 3", "doc": "this is one relevant document regarding query 3"},
{"query": "this is a query 4", "doc": "this is one relevant document regarding query 4"},
])
dataset = tf.data.Dataset.from_generator(
lambda: samples, {"query": tf.string, "doc": tf.string})
batched_dataset = dataset.batch(2)
#{
#'doc': <tf.Tensor: shape=(2,), dtype=string, numpy=array(
# [b'this is one relevant document regarding query 1',
# b'this is one relevant document regarding query 2'], dtype=object)>,
#
#'query': <tf.Tensor: shape=(2,), dtype=string, numpy=array(
# [b'this is a query 1',
# b'this is a query 2'], dtype=object)>
#}
and a map function to tokenize this batched_dataset:
def tokenize(sample):
tokenized_query = tokenizer.batch_encode_plus(sample["query"].numpy().astype('str'), ...)
tokenized_doc = tokenizer.batch_encode_plus(sample["doc"].numpy().astype('str'), ...)
return (tokenized_query, tokenized_doc)
I could tokenize the entire batched_dataset using a for-loop:
for batch in batched_dataset:
tokenize(batch)
# (
# {'input_ids': <tf.Tensor: shape=(2, 8), dtype=int32, numpy=
# array([[ 101, 2023, 2003, 1037, 23032, 1015, 102, 0],
# [ 101, 2023, 2003, 1037, 23032, 1016, 102, 0]],
# dtype=int32)>,
# 'attention_mask': <tf.Tensor: shape=(2, 8), dtype=int32, numpy=
# array([[1, 1, 1, 1, 1, 1, 1, 0],
# [1, 1, 1, 1, 1, 1, 1, 0]], dtype=int32)>},
# {'input_ids': <tf.Tensor: shape=(2, 8), #dtype=int32, numpy=
# array([[ 101, 2023, 2003, 2028, 7882, 6254, 4953, 102],
# [ 101, 2023, 2003, 2028, 7882, 6254, 4953, 102]], dtype=int32)>,
# 'attention_mask': <tf.Tensor: shape=(2, 8), dtype=int32, numpy=
# array([[1, 1, 1, 1, 1, 1, 1, 1],
# [1, 1, 1, 1, 1, 1, 1, 1]], dtype=int32)>})
# ...
However, when using tf.data.Dataset.map the following error arises:
tokenized_dataset = batched_dataset.map(tokenize)
AttributeError: 'Tensor' object has no attribute 'numpy'
Then, how properly apply a tokenizer map function to a batched dataset?
Note: I published a working example on Google Colab.
This seems like more of a TF-related question rather than a Transformers-related question. The issue seems to stem from your code trying to get the value of a tensor which is not eager, using numpy. I believe the tf.data.Dataset.map method must trace inputs, resulting in the Tensors not being eager.
Couldn't you build the tf.data.Dataset with already tokenized inputs instead?
The ideal would be to follow the pipeline (read from the file >> generate batches >> tokenize >> train >> evaluate). It is the most efficient approach as pointed in the TensorFlow tutorial.
Tensorflow when dealing with texts generates string tensors that are stored as byte string:
<tf.Tensor: shape=(2,), dtype=string, numpy=array(
[b'Th锚 first utf-8 string of the bat莽h.',
b'Th锚 sec么nd utf-8 string of the bat莽h.'], dtype=object)>
However, I didn't find an efficient way to decode this kind of tensor as a list of strings. It's even worse if the byte string containing a non-ascii character.
What I really need is one of these two options:
Thank you very much for all your help.
@Ceceu I am running into this exact issue as well, and am wondering if you had found a good solution?
@oja,
The best solution I could find was adapting an example from the Tensorflow tutorial: Load Text which uses tf.py_function.
Let me know if I can help more.
@Ceceu got it, thank you!
Tokenizers can now output numpy arrays with return_tensors='np' so I think this should work now.
Thanks @thomwolf, I will check it out and if it works on TPU then it solves https://github.com/huggingface/transformers/issues/5066
Thanks @thomwolf, I will check it out and if it works on TPU then it solves #5066
Did you check if it works on TPU?
It does not work on TPU
Most helpful comment
The ideal would be to follow the pipeline (read from the file >> generate batches >> tokenize >> train >> evaluate). It is the most efficient approach as pointed in the TensorFlow tutorial.
Tensorflow when dealing with texts generates string tensors that are stored as byte string:
However, I didn't find an efficient way to decode this kind of tensor as a list of strings. It's even worse if the byte string containing a non-ascii character.
What I really need is one of these two options:
Thank you very much for all your help.