Transformers: cast_bool_to_primitive breaks TensorFlow graph support.

Created on 16 Jul 2020 · 19Comments · Source: huggingface/transformers

🐛 Bug

To reproduce

import transformers
bert = tf.function(transformers.TFBertForMaskedLM.from_pretrained('bert-base-uncased'))

for i in range(2):
    (_, hidden_state) = bert(tf.constant([[10,11,12]]), output_hidden_states=True)
    print(f'computed {i}')

Errors with

ValueError: not enough values to unpack (expected 2, got 1)

Expected behavior

computed 1
computed 2

Same result as if tf.function was not used.

Environment info

Example environment : https://colab.research.google.com/gist/AndreasMadsen/593df94a3319dee58bba33a26efedeb3/untitled6.ipynb

transformers version: 3.0.2
Platform: Linux-4.19.104+-x86_64-with-Ubuntu-18.04-bionic
Python version: 3.6.9
PyTorch version (GPU?): 1.5.1+cu101 (False)
Tensorflow version (GPU?): 2.2.0 (False)
Using GPU in script?:
Using distributed or parallel set-up in script?:

Details

The bug happens due to cast_bool_to_primitive, that was introduced in https://github.com/huggingface/transformers/commit/6e603cb7892b49a2cbbc10ba859759f92c3fb7a6. Before that, it was possible to get the hidden_states from Bert in TensorFlow graph/function mode.

Generally speaking, casting TensorFlow tensors to primitives is not a good practice, as it only works in eager mode. It is also completely unnecessary in this case, as using if bool_tensor_scalar: works perfectly fine.

```python
def print_bool(x):
if x:
print('True')
else:
print('False')

print_bool_graph = tf.function(print_bool)

print('eager:')
print_bool(True) # Prints True
print_bool(False) # Prints False
print_bool(tf.constant(True)) # Prints True
print_bool(tf.constant(False)) # Prints False

print('')
print('graph:')
print_bool_graph(True) # Prints True
print_bool_graph(False) # Prints False
print_bool_graph(tf.constant(True)) # Prints True
print_bool_graph(tf.constant(False)) # Prints False

I can see there are some cases where defaults are used. The right way to handle that is to implement the default handling upstream in the first `call()` method. A lesser way would be to implement it as:

```python
def cast_bool_to_primitive(x, default_value=False):
  if x is None:
    return default_value
  return x

wontfix

Source

AndreasMadsen

All 19 comments

Hey @AndreasMadsen,

Thanks a lot for the detailed error description! I added this function, so I will try to take a look as soon as possible. I will be off for the next two weeks though (but will not this issue down). If by change @jplu you find some time, feel free to take a look :-)

patrickvonplaten on 17 Jul 2020

Should be fixed in the PR https://github.com/huggingface/transformers/pull/5468

jplu on 17 Jul 2020

👍1

Hi @jplu, I'm sorry, but I doubt #5468 will fix the issue. Fundamentally speaking casting to primitives is not a good practice in TensorFlow, as it invalidates the use of @tf.function and is generally unnecessary as described above. Casting to primitives is, in my experience, just never the correct solution in TensorFlow.

I do think #5468 mitigates the issue, which is maybe where the confusion is coming from. This is because, the models will now correctly default to the config object when output_hidden_states=True is not specified as an input. In those cases object property is never cast to a tensor to begin with, therefore the @tf.function graph will be statically compiled to always output the hidden_states, as intended.

However, the behavior is different when output_hidden_states=True is specified as an input, as it will be cast to a Tensor when it becomes part of the inputs argument in call(). After that, it is not possible to convert back to a primitive, as that invalidates @tf.function.

If you insist on keeping it as a primitive, the best solution might be to specify it as an aux-input, similar to training and mask in a keras.layers.Layer, as they don't get converted the same way. I'm not familiar enough with the Keras internals to know the details here, and I think it might also be incompatible with compute_output_shape etc.

_BTW, in the keras RNN layers, hidden_state is only specified in the constructor, properly because it can get a bit messy having to specify it in the inputs, but I don't see anything fundamentally wrong with specifying it in inputs._

AndreasMadsen on 20 Jul 2020

👍1

The PR fix your issue at least for the piece of code you are providing, here the result I have:

>>> import transformers
>>> import tensorflow as tf
>>> bert = tf.function(transformers.TFBertForMaskedLM.from_pretrained('bert-base-uncased'))
>>> for i in range(2):
...   (_, hidden_state) = bert(tf.constant([[10,11,12]]), output_hidden_states=True)
...   print(f'computed {i}')
... 
computed 0
computed 1
>>>

jplu on 20 Jul 2020

I believe the solution of @AndreasMadsen is correct -- output_hidden_states and also output_attentions should be passed as named arguments of the call method of TFBertEncoder.call -- that way they are not converted to Tensors and can be just constants.

foxik on 21 Jul 2020

@jplu Sorry for the misunderstanding. The test now works because output_hidden_states is now an auxiliary-input, thus is stays a primitive, thus the casting is no longer involved. However, casting to primitives is still not good practice in TensorFlow, so it doesn't take much to break it.

I read your code more thoroughly, and have the following two failure cases for you.

Case 1:

bert = tf.function(transformers.TFBertForMaskedLM.from_pretrained('bert-base-uncased'))
outputs = bert(tf.constant([[10,11,12]]), output_attentions=True)
assert len(outputs) == 2

Fails with an IndexError. Essentially because casting in a tf.function can not be done. Edit: I think it is an inconsistent casting, in some functions the output_attentions is a primitive and the "casting" works by being a no-op, it other functions the casting can't be done as it is in a tf.function.

Case 2:

bert = tf.function(transformers.TFBertForMaskedLM.from_pretrained('bert-base-uncased'))
outputs = bert(tf.constant([[10,11,12]]), output_hidden_states=tf.constant(True))
assert len(outputs) == 2

outputs only one output (two is exected), because tf.constant can not be casted in a tf.function.

However, I would like that, instead of working around these issues, you read the documentation on AutoGraph.

I really think there is a misunderstanding here, about what TensorFlow AutoGraph can do for you and why casting to primitives is really not necessary at all. I would suggest you read https://www.tensorflow.org/guide/function#conditionals and also check out the hidden docs, such as https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/autograph/g3doc/reference/control_flow.md#effects-of-the-tracing-process which explains it in more details.

What @foxik says is true, but I think depending on the auxiliary-inputs just avoids the misunderstanding. Truly, casting to primitives is just not a good idea.

AndreasMadsen on 21 Jul 2020

Ok, thanks for the hints, I will review that part! I should be able to start working on it from tomorrow or Thursday.

jplu on 21 Jul 2020

❤1

I have started to work on this today. I basically just removed the casting to primitive function and updated the code accordingly. Nevertheless it doesn't work when booleans are tensors (your Case 2) because AutoGraph do not let you return an output of different size on each branch. A simple example to simply the use case can be:

import tensorflow as tf

@tf.function
def output(output1, output2):
    first_conditional_output = ()
    second_conditional_output = ()

    for i in range(2):
        if output1:
            first_conditional_output = first_conditional_output + (i,)

        if output2:
            second_conditional_output = second_conditional_output + (i,)

    outputs = (0,)
    if output1:
        outputs = outputs + (first_conditional_output,)
    if output2:
        outputs = outputs + (second_conditional_output,)

    return outputs

This piece of code works fine as long as the parameters are primitives, but when using tf.constant(True) for at least one of the two makes it fail with:

/opt/anaconda3/envs/transformers/lib/python3.7/site-packages/tensorflow/python/ops/cond_v2.py:633 _make_indexed_slices_indices_types_match
    assert len(set(outs_per_branch)) == 1, outs_per_branch

AssertionError: [1, 0]

I currently struggling on this, and try to find a workaround but if it is not possible to fix this, I think we will just skip this functionality.

jplu on 23 Jul 2020

@jplu TensorFlow guys internally use tf.cond to handle this case, for example here: https://github.com/tensorflow/tensorflow/blob/2b96f3662bd776e277f86997659e61046b56c315/tensorflow/python/framework/smart_cond.py#L27-L59 . Keras layers like Dropout have another special case for Variables here (the smart_module.smart_cond is the above method): https://github.com/tensorflow/tensorflow/blob/2b96f3662bd776e277f86997659e61046b56c315/tensorflow/python/keras/utils/tf_utils.py#L42-L65

foxik on 24 Jul 2020

I already came across this, but I don't see how it could solve this issue. Each if in this piece of code is automatically translated to a tf.cond. The problem is that there is no else and then the branches true_fn and false_fn will have a different output size from each other. This is not allowed and this is the problem :(

Unless I'm missing something in what you said.

I'm still trying to figure out how to find a workaround anyway.

jplu on 24 Jul 2020

@jplu I thought that tf.cond can return arbitrary results from the true_fn and false_fn (contrary to Autograph), but you are right, it is not allowed during graph construction -- my bad.

Personally I think it is enough for the output_hidden_size to be a Python constant (and fail otherwise). It is in fact not obvious what type should the output have in case computation graph is used and the output_hidden_size is undefined Tensor value.

So I think you were right with skipping the functionality ;-)

foxik on 24 Jul 2020

@jplu Good point, regarding the variable output-signature. I think it is perfectly acceptable to assert for a primitive input, at least for now.

Alternatively, the solution would be to return an empty tensor when tf.constant([False]) is used. Such an approach could look like this:

import tensorflow as tf

@tf.function
def output(output1, output2):
    first_conditional_output = tf.TensorArray(tf.int64, size=0, dynamic_size=True, clear_after_read=True)
    second_conditional_output = tf.TensorArray(tf.int64, size=0, dynamic_size=True, clear_after_read=True)

    for i in range(2):
        if output1:
            first_conditional_output = first_conditional_output.write(i, i)
        if output2:
            second_conditional_output = second_conditional_output.write(i, i)

    outputs = (0,)
    if isinstance(output1, tf.Tensor) or output1:
        outputs = outputs + (first_conditional_output.stack(),)

    if isinstance(output2, tf.Tensor) or output2:
        outputs = outputs + (second_conditional_output.stack(),)

    return outputs

edit: PS: there is more documentation here: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/autograph/g3doc/reference/index.md

AndreasMadsen on 25 Jul 2020

@AndreasMadsen Thanks!! I basically already came across the same solution (which is I think the only one :smile:), but the issue here is that we will always have three outputs, which is not really what is wanted. I have to talk to the whole team to see if they are ok with this or not, If yes, we will update the code base accordingly otherwise we will skip this.

jplu on 26 Jul 2020

I'm trying to figure out if using tf.boolean_mask could work.

jplu on 27 Jul 2020

Hi @jplu, I will leave it up to you to decide what is "wanted". But you should consider the usage pattern when unpacking the output:

with always three outputs:

@tf.function
def usage(hasOutput1, hasOutput2):
  (one, output1, output2) = output(hasOutput1, hasOutput2)

  tf.print(one)
  if hasOutput1:
    tf.print(output1)
  if hasOutput2:
    tf.print(output2)

with always variable outputs:

@tf.function
def usage(hasOutput1, hasOutput2):

  output1 = tf.zeros((0,))
  output2 = tf.zeros((0,))
  if hasOutput1 and hasOutput2:
    (one, output1, output2) = output(hasOutput1, hasOutput2)
  elif hasOutput1:
    (one, output1) = output(hasOutput1, hasOutput2)
  elif hasOutput2:
    (one, output2) = output(hasOutput1, hasOutput2)
  else:
    (one, ) = output(hasOutput1, hasOutput2)

  tf.print(one)
  if hasOutput1:
    tf.print(output1)
  if hasOutput2:
    tf.print(output2)

AndreasMadsen on 29 Jul 2020

You are totally right @AndreasMadsen!! Now everything should work like expected in the PR at the condition to use primitive booleans. If we all decide to fix the output size to 3 I will open a new PR for this.

Nevertheless, I've just spotted another problem with the usage of TensorArray, instead to have a tuple that looks like : (batch_size, num_tokens, hidden_states_size) * num_layers we get a tensor that looks like (num_layers, batch_size, num_tokens, hidden_states_size) which cause several other issues for later.

jplu on 29 Jul 2020

Nevertheless, I've just spotted another problem with the usage of TensorArray, instead to have a tuple that looks like : (batch_size, num_tokens, hidden_states_size) * num_layers we get a tensor that looks like (num_layers, batch_size, num_tokens, hidden_states_size) which cause several other issues for later.

To avoid a breaking change, you could do it as a tuple of empty tensors. For the next major version, I would suggest it become one big tensor. You can swap transpose/swap the first and last axis, to make it mostly compatible, with indexing, unpack, etc..

AndreasMadsen on 29 Jul 2020

Exactly!! This should be a good way to go if the decision of having always 3 outputs is taken.

jplu on 29 Jul 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.