Transformers: Custom models: MixUp Transformers with TF.Keras code

Created on 27 Sep 2019 · 11Comments · Source: huggingface/transformers

Ideally I would like to use TFRobertaModel or any other model (BERT, XLNet) as parts (modules) of a bigger model. For example, it could be nice to start with Roberta as a document encoder and then build a multi-label classifier on top of that. Possibly there are ways to hack TFRobertaForSequenceClassification in order to do multi-label classification using custom configurations, but the point is:

How we could leverage Roberta and any other pre-trained model and stack other layers on top (e.g., I may want to add a custom attention layer or do a hierarchical version of Roberta with a shared Roberta encoder)?

import tensorflow as tf
import numpy as np
from transformers import TFRobertaModel, RobertaTokenizer
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import Model

# Define input layer
inputs  = Input(shape=(None, ))

# Define Roberta a document encoder
roberta_model = TFRobertaModel.from_pretrained('roberta-base')

# Collect hidden state representations
roberta_encodings = roberta_model(inputs)[0]

# Collect CLS representations
document_encodings = tf.squeeze(roberta_encodings[:, 0:1, :], axis=1)

# Add classification layer (Linear + Sigmoid)
outputs =  Dense(10, activation='sigmoid')(document_encodings)

# Build meta-model
model = Model(inputs=[inputs], outputs=[outputs])

# Compile model
model.compile(optimizer='adam',  loss='binary_crossentropy')

# Train model
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
x = np.asarray(tokenizer.encode("Hello, my dog is cute", add_special_tokens=True))[None, :]
y = tf.convert_to_tensor(np.zeros((1,10)), dtype=tf.float32)
model.fit(x, y)

The main issue here is that we can't use an Input layer to feed Roberta... Any ideas for a workaround to make this piece of code working...?

wontfix

Source

iliaschalkidis

Most helpful comment

@iliaschalkidis I've also run into this issue when trying to make a plug-and-play wrapper around the numerous TF-compatible models. Like you, I was able to get the RoBERTa model working by hacking around it a bit. Not ideal, but it works.

For anyone else that's interested, the line above that raises the error occurs in TFRobertaMainLayer.call. You can get around it by wrapping the call as a TensorFlow 2.0 function whenever you want to use a model that depends on TFRobertaMainLayer (which is all of them?). Here I'm using TFRobertaForSequenceClassification:

from transformers import TFRobertaForSequenceClassification
import tensorflow as tf

# Establish a RoBERTa-based classifier.
clf = TFRobertaForSequenceClassification.from_pretrained("roberta-base", num_labels=5)

# "Decorate" the `call` method as a TensorFlow2.0 function.
clf.roberta.call = tf.function(clf.transformer.roberta.call)

Using that, I was successfully able to fine-tune the classifier on a multi-GPU setup without much trouble. I still get a ton of warnings from ZMQ and TensorFlow, but I'm not yet sure they're official transformer issues.

Note: I suspect you'll have to wrap the call instance method any time you initialize this model (e.g., if you save your pre-trained model and re-load for prediction/inference, you may not be able to just use TFRobertaForSequenceClassification). In that case, it may be simpler to define a minimal subclass that does it for you. This is untested code, but I suspect it'd work alright:

import tensorflow as tf
import transformers


class _TFRobertaForSequenceClassification(transformers.TFRobertaForSequenceClassification):

    def __init__(self, config, *inputs, **kwargs):
        super(_TFRobertaForSequenceClassification, self).__init__(config, *inputs, **kwargs)
        self.roberta.call = tf.function(self.roberta.call)

Hope this helps!

Unrelated tip: I also had a bit of trouble using TFv2 metrics (e.g., tf.keras.metrics.[Precision/Recall/AUC] because the TFRobertaClassificationHead outputs logits (no softmax activation). If anybody else is wondering, you can set the classifier head's output layer to use softmax quite easily:

# Continuing from the previous setup.
clf.classifier.out_proj.activation = tf.keras.activations.softmax

This way, you can monitor Precision/Recall/AUC in the call to clf.compile:

# Compile our model.
clf.compile(
    optimizer=...,
    loss=...,
    metrics=[
        tf.keras.metrics.CategoricalCrossentropy(from_logits=False),
        tf.keras.metrics.Precision(thresholds=.50, name="precision"),
        tf.keras.metrics.Recall(thresholds=.50, name="recall"),
        tf.keras.metrics.AUC(curve="PR", name="auc-pr")
    ]
)

Furthermore, if you want to just fine-tune the classifier layer, you can easily freeze the core RoBERTa layers:

# Note you have ~125M trainable parameters. This'll take a while!
clf.summary()

# Freeze core RoBERTa model (embeddings, encoder, pooler).
clf.roberta.trainable = False

# Note you have ~600K trainable parameters. Much better!
clf.summary()

dataframing on 2 Oct 2019

❤7 👍1

All 11 comments

The main issue is at line 85 on the forward pass of TFRobertaMainLayer:

https://github.com/huggingface/transformers/blob/ca559826c4188be8713e46f191ddf5f379c196e7/transformers/modeling_tf_roberta.py#L85

It seems that passing Input placeholders mess up this comparison:

    OperatorNotAllowedInGraphError: using a `tf.Tensor` as a Python `bool` is not allowed in Graph execution. Use Eager execution or decorate this function with @tf.function.

When I comment-out this block of code, the training process works... I can't find any way to by-pass this error without commenting-out though....

iliaschalkidis on 27 Sep 2019

from transformers import TFRobertaForSequenceClassification
import tensorflow as tf

# Establish a RoBERTa-based classifier.
clf = TFRobertaForSequenceClassification.from_pretrained("roberta-base", num_labels=5)

# "Decorate" the `call` method as a TensorFlow2.0 function.
clf.roberta.call = tf.function(clf.transformer.roberta.call)

import tensorflow as tf
import transformers


class _TFRobertaForSequenceClassification(transformers.TFRobertaForSequenceClassification):

    def __init__(self, config, *inputs, **kwargs):
        super(_TFRobertaForSequenceClassification, self).__init__(config, *inputs, **kwargs)
        self.roberta.call = tf.function(self.roberta.call)

Hope this helps!

# Continuing from the previous setup.
clf.classifier.out_proj.activation = tf.keras.activations.softmax

This way, you can monitor Precision/Recall/AUC in the call to clf.compile:

# Compile our model.
clf.compile(
    optimizer=...,
    loss=...,
    metrics=[
        tf.keras.metrics.CategoricalCrossentropy(from_logits=False),
        tf.keras.metrics.Precision(thresholds=.50, name="precision"),
        tf.keras.metrics.Recall(thresholds=.50, name="recall"),
        tf.keras.metrics.AUC(curve="PR", name="auc-pr")
    ]
)

Furthermore, if you want to just fine-tune the classifier layer, you can easily freeze the core RoBERTa layers:

# Note you have ~125M trainable parameters. This'll take a while!
clf.summary()

# Freeze core RoBERTa model (embeddings, encoder, pooler).
clf.roberta.trainable = False

# Note you have ~600K trainable parameters. Much better!
clf.summary()

dataframing on 2 Oct 2019

❤7 👍1

@dataframing thanx a lot, this was really helpful! I opted to go with a very similar solution...

Define a meta-model on top of TFRobertaModel:

import tensorflow as tf
import transformers


class ROBERTA(transformers.TFRobertaModel):

    def __init__(self, config, *inputs, **kwargs):
        super(ROBERTA, self).__init__(config, *inputs, **kwargs)
        self.roberta.call = tf.function(self.roberta.call)

Build a wrapper tf.keras.Model:

# Define inputs (token_ids, mask_ids, seg_ids)
token_inputs = Input(shape=(None,), name='word_inputs', dtype='int32')
mask_inputs = Input(shape=(None,), name='mask_inputs', dtype='int32')
seg_inputs = Input(shape=(None,), name='seg_inputs', dtype='int32')

# Load model and collect encodings
roberta = ROBERTA.from_pretrained('roberta-base')
roberta_encodings = roberta([token_inputs, mask_inputs, seg_inputs])[0]

# Keep [CLS] token encoding
doc_encoding = tf.squeeze(roberta_encodings[:, 0:1, :], axis=1)

# Apply dropout
doc_encoding = Dropout(0.1)(doc_encoding)

# Final output (projection) layer
outputs = Dense(self.n_classes, activation='sigmoid', name='outputs')(doc_encoding)

# Wrap-up model
model = Model(inputs=[word_inputs, mask_inputs, seg_inputs], outputs=[outputs])
model.compile(optimizer=Adam(lr=3e-4), loss='binary_crossentropy')

Everything works like a charm, except the annoying warnings. Although working on a single RTX 2080Ti or any other 12GB GPU has a limitation of batch size up to 4-5 samples of 512 subword units (the same applies for BERT), while I was able to go up to 8 when I was calling bert-base via Tensorflow Hub and wrap it as Keras layer, which is really weird... Any idea why, moving to transformers library and TF2 will make such a different?

iliaschalkidis on 2 Oct 2019

Thanks for the report.
We can probably get rid of this test in the TF version of RoBERTa if it's a blocking element for integrating with other Keras modules.
I've never been a huge fan of this hacky solution anyway. In the future, we should probably move forward with a breaking change in the tokenizers and have control tokens included by default in the tokenizer encoding output instead of having them as an option.
cc @LysandreJik @julien-c

thomwolf on 9 Oct 2019

@thomwolf Having the tokenizers include special tokens in the call to tokenizer.encode[_plus] seems like a pretty safe default, but I think it also makes sense to have this inline inspection to make sure that the end user has properly encoded their tokens. Wrapping the method in a tf.function like above call seems to make it work fine as-is, so maybe there's a way to have the best of both worlds?

dataframing on 9 Oct 2019

@dataframing BERT and ROBERTa work like a charm with the tweaks you proposed, although with XLNet I still have issues:

# Define token ids as inputs
word_inputs = Input(batch_shape=(2, 2000), name='word_inputs', dtype='int32')

# Call XLNet model
xlnet = TFXLNetModel.from_pretrained('xlnet-base-cased')
xlnet_encodings = xlnet(word_inputs)

# Collect last hidden step (CLS)
doc_encoding = tf.squeeze(xlnet_encodings[:, -1:, :], axis=1)

# Apply dropout
doc_encoding = Dropout(dropout_rate)(doc_encoding)

# Final output (projection) layer
outputs = Dense(n_classes, activation='softmax', name='outputs')(doc_encoding)

# Compile model
model = Model(inputs=[word_inputs], outputs=[outputs])
model.compile(optimizer=Adam(lr=lr, loss='categorical_crossentropy'))

xlnet_encodings = xlnet(word_inputs)
.../tensorflow_core/python/keras/engine/base_layer.py", line 842, in __call__
outputs = call_fn(cast_inputs, *args, **kwargs)
.../tensorflow_core/python/autograph/impl/api.py", line 237, in wrapper
raise e.ag_error_metadata.to_exception(e) AttributeError: in converted code:
relative to .../transformers/modeling_tf_xlnet.py:810 call  *
outputs = self.transformer(inputs, **kwargs)
tensorflow_core/python/keras/engine/base_layer.py:874 __call__
    inputs, outputs, args, kwargs)
tensorflow_core/python/keras/engine/base_layer.py:2038 _set_connectivity_metadata_
    input_tensors=inputs, output_tensors=outputs, arguments=arguments)
tensorflow_core/python/keras/engine/base_layer.py:2068 _add_inbound_node
    arguments=arguments)
tensorflow_core/python/keras/engine/node.py:110 __init__
    self.output_shapes = nest.map_structure(backend.int_shape, output_tensors)
tensorflow_core/python/util/nest.py:535 map_structure
    structure[0], [func(*x) for x in entries],
tensorflow_core/python/util/nest.py:535 <listcomp>
    structure[0], [func(*x) for x in entries],
tensorflow_core/python/keras/backend.py:1185 int_shape
    shape = x.shape
AttributeError: 'NoneType' object has no attribute 'shape'

Pretty much the same story happens using the TFXLNetForSequenceClassification class:

# Call TFXLNetForSequenceClassification model
model = TFXLNetForSequenceClassification.from_pretrained('xlnet-base-cased', num_labels=n_classes)

# Amend activation functions
model.logits_proj.activation = tf.keras.activations.softmax

# Compile model
model.compile(optimizer=Adam(lr=lr, loss='categorical_crossentropy'))

File .../tensorflow_core/python/keras/engine/training.py", line 2709, in _set_inputs
outputs = self(inputs, **kwargs)
File .../tensorflow_core/python/keras/engine/base_layer.py", line 842, in __call__
outputs = call_fn(cast_inputs, *args, **kwargs)
File .../tensorflow_core/python/autograph/impl/api.py", line 237, in wrapper
raise e.ag_error_metadata.to_exception(e)
TypeError: in converted code:
transformers/modeling_tf_xlnet.py:916 call  *
    output = self.sequence_summary(output)
tensorflow_core/python/keras/engine/base_layer.py:842 __call__
    outputs = call_fn(cast_inputs, *args, **kwargs)
transformers/modeling_tf_utils.py:459 call  *
    output = self.first_dropout(output)
tensorflow_core/python/autograph/impl/api.py:396 converted_call
    return py_builtins.overload_of(f)(*args)
TypeError: 'NoneType' object is not callable

iliaschalkidis on 10 Oct 2019

In your case, it might be because you are not extracting the hidden states from the model tuple output.
This line: xlnet_encodings = xlnet(word_inputs)
Should be like this:

outputs = xlnet(word_inputs)
xlnet_encodings = outputs[0]

I'm working on adding some tests on this integration with other Keras modules here: #1482

thomwolf on 10 Oct 2019

Hi @thomwolf,

Even with this update it keeps producing the exact same error. The actual error happens internally on TF2, when the abstract keras.Layer calls the Autograph API to do some adjustments. This actually parse the whole network layer by layer and convert the call() functions for some reason. It fails in the very end, when it tries to convert the final (outer) call of the TFXLNetMainLayer:

outputs = self.transformer(inputs, **kwargs)

The main reason, as I see it through debugging, is the fact that you return by default as part of the outputs a list called new_mems. This returns a list of None, if the user do not provide such an input, that later the internal Keras engine cannot handle, because the elements of this list lack of shape and lead to the aforementioned error AttributeError: 'NoneType' object has no attribute 'shape'.

The only way to surpass this at this stage, is again with some hacking in line 653 of modeling_tf_xlnet.py:

outputs = (tf.transpose(output, perm=(1, 0, 2)), new_mems)

outputs = tf.transpose(output, perm=(1, 0, 2))

Probably, if I pass memories as an input in TFXLNetModel, this won't happen any more and I'll avoid hacking. Could you please remind me the notion of memories and how should I pass this information when I'm calling the model? Is this a single integer denoting how many steps back can the Transformer-XL use?

iliaschalkidis on 10 Oct 2019

In two words memories are cached hidden-states to be reused to speed up or allow for longer sentence inputs. The best to understand the notion of memory is to read the Transformer-XL paper which is here: http://arxiv.org/abs/1901.02860

We have a couple of models outputting memories and it seems to be a problem for Keras indeed (GPT-2) has the same.

So the best (non-breaking) solution is probably to add a flag in the configuration that you can set to False to avoid outputting memories or cache.

thomwolf on 10 Oct 2019

Great, I read Transformer-XL a few months ago. Maybe if I pass memories as input, I'll avoid this error, and probably I have to do so, if i want the model to act as a real Transformer-XL and not forget all previous timesteps at each segment... What's the specification for mems: a tensor of shape (batch_size, ) including integers (e.g., 200 steps back) for the memory length?

iliaschalkidis on 10 Oct 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.