Transformers: How do you do inference in production?

Created on 12 Mar 2020 · 31Comments · Source: huggingface/transformers

❓ Questions & Help

Details

I was wondering how do you guys do inference in production? I tried to convert this model to tensorflow model but failed.

This is what I tried:

tf_model = TFGPT2LMHeadModel.from_pretrained("tmp/", from_pt=True)
tf.saved_model.save(tf_model,"tmp/saved")
loaded = tf.saved_model.load("tmp/saved")
print(list(loaded.signatures.keys()))

And it returns an empty list

A link to original question on Stack Overflow: https://stackoverflow.com/questions/52826134/keras-model-subclassing-examples

Source

ZhuoranLyu

Most helpful comment

If trainer is just used for training, why in run_tf_ner.py line 246, there is a prediction done with the trainer:

This part is only here to evaluate the model and output the predictions on the test set into a file and not for inference in production. It is two distinct cases.

If I set the mode to prediction, initialize the trainer with a nonsense output_dir, replace test_dataset.get_dataset(), with my own data, I can actually get the predictions. I guess it is initiated through checkpoints dir.

Yes, it is normal because the predict is just here to evaluate your model on a dataset, and it is not initatied from the checkpoint dir but from the .h5 file in your model folder only.

If I use the code discussed in this post to save and load the model, the saved model can convert the sentence to features, but it cannot do any prediction; the loaded model would not convert the sentence to features.

This is normal because your input doesn't correspond to the signature. The big picture is that from the loaded_model(...) line you don't get features, you get the real output of the model, this is what does a saved model. A tensor of values for each token where each value is the prob of the corresponding label.

Hence once you get your saved model, run the command:

tensorflow_model_server \
  --rest_api_port=8501 \
  --model_name=ner \
  --model_base_path="tf2_0606_german" >server.log 2>&1

Now, you have an API that wraps your model. Finally, in a Python script you can do:

import json
import numpy
import requests
my_features = # call here the tokenizer
data = json.dumps({"signature_name": "serving_default",
                   "instances": my_features})
headers = {"content-type": "application/json"}
json_response = requests.post('http://localhost:8501/v1/models/ner:predict',
                              data=data, headers=headers)
predictions = numpy.array(json.loads(json_response.text)["predictions"])

Finally, you get your predictions and you have to code the translation preds -> text.

I also found one more complication. The code you showed works only for sentences containing three words or less. If "it is me" is changed to "it is me again", the code will return the same argument error message I mentioned in the last response.

This is totally normal, as I told you, you have to code your own signature as it is showed in the TF documentation that I linked you in my previous post.

For now, nothing is implemented in the transformers lib to do what you are looking for with a saved model. It means that, to do inference in production with a saved model you have to code all the logic I explained above by yourself. It is planned to integrate this part in a near future, it is even an ongoing work, but far to be finished.

jplu on 6 Jun 2020

🎉4

All 31 comments

Did you try out to just use this save_... function: https://github.com/huggingface/transformers/blob/2e81b9d8d76a4d41a13f74eb5e0f4a65d8143cab/src/transformers/modeling_tf_utils.py#L232 ?

tf_model = TFGPT2LMHeadModel.from_pretrained("tmp/", from_pt=True)
tf_model.save_pretrained("./tf_model")
tf_model = TFGPT2LMHeadModel.from_pretrained("./tf_model")

patrickvonplaten on 12 Mar 2020

Did you try out to just use this save_... function:

https://github.com/huggingface/transformers/blob/2e81b9d8d76a4d41a13f74eb5e0f4a65d8143cab/src/transformers/modeling_tf_utils.py#L232

?
->
tf_model = TFGPT2LMHeadModel.from_pretrained("tmp/", from_pt=True)
tf_model.save_pretrained("./tf_model")
tf_model = TFGPT2LMHeadModel.from_pretrained("./tf_model")

Hi, thanks for the reply. But what I want to do is to save it as a pb file in order to serve the model using tensorflow-serving.

ZhuoranLyu on 13 Mar 2020

Can we re-open this? It's still an issue.

pdcoded on 15 Apr 2020

Can we re-open this? It's still an issue.

How to open this issue?

ZhuoranLyu on 16 Apr 2020

Sure, sorry I guess I closed this too early!

patrickvonplaten on 16 Apr 2020

Any progress on this issue?
How to save the model for production?

jx669 on 4 Jun 2020

Hmm, I am not really familiar with tensorflow protobuf saving -> @LysandreJik @jplu do you know more about this maybe?

patrickvonplaten on 4 Jun 2020

Hello !

To create a saved model you have to run something like the following lines:

import tensorflow as tf
from transformers import TFXXXModel, XXXTokenizer

hf_model = TFXXXModel.from_pretrained('model/location/path')
tokenizer = XXXTokenizer.from_pretrained("tokenizer/location/path")
features = tokenizer.encode_plus("Sentence to featurize", add_special_tokens=True, return_tensors="tf")
hf_model._set_inputs(features)
tf.saved_model.save(hf_model, "saved_model/location/path")

Replace XXX by the model name you plan to save.

It is also planed to add a to_saved_model() method in the trainer, to allow anybody to autimatically create a saved model without to run those lines.

jplu on 4 Jun 2020

Hi!

Sorry. I misunderstood it. I thought all TF models were saved by TF Trainer and all TF trainer saved models would have a hard time with inference in production. So I thought this post is similar to mine: https://github.com/huggingface/transformers/issues/4758

After finishing with the sample code and sample data, I checked the "output_dir/saved_model" folder, it is empty. Then I restarted the code to save the model to a new directory.

model = TFAutoModelForTokenClassification.from_pretrained(
            model_args.model_name_or_path,
            from_pt=bool(".bin" in model_args.model_name_or_path),
            config=config,
            cache_dir=model_args.cache_dir,
        )

model.save('saved_model/my_model')
newmodel = tf.keras.models.load_model('saved_model/my_model')

I get the message that the model is not compiled:
WARNING:tensorflow:No training configuration found in save file, so the model was *not* compiled. Compile it manually.

I am wondering how to extract the fine-tuned local model for inference. Thanks.

jx669 on 4 Jun 2020

Look at the piece of code I have done, it is totally different :) Also you are not using the load and save from the lib, the error message is normal.

jplu on 4 Jun 2020

hf_model = TFXXXModel.from_pretrained('model/location/path')
tokenizer = XXXTokenizer.from_pretrained("tokenizer/location/path")

Are these the official models like 'bert-base-uncased'? If yes, then it's not trained.
If it is local model, I don't know where the local model is because the "saved_model" folder is empty.

jx669 on 4 Jun 2020

'you are not using the load and save from the lib, the error message is normal.'
--- which lib are you referring? I followed only the official tensorflow manual: https://www.tensorflow.org/guide/saved_model

jx669 on 4 Jun 2020

Ok then sorry I didn't get what you meant. If I recall well, what you are looking for is to load a trained model and run an inference with it? Right?

jplu on 4 Jun 2020

Right.
I also wish to serve the model through TF serving.

jx669 on 4 Jun 2020

Ok then at first try the following piece of code and tell me if it works for you:

from transformers import BertTokenizer, TFBertForTokenClassification
import tensorflow as tf

model = TFBertForTokenClassification.from_pretrained("bert-base-uncased")
tf.saved_model.save(model, "saved_model")

loaded_model = tf.saved_model.load("saved_model")
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
features = {"input_ids": tokenizer.encode("it is me", add_special_tokens=True, return_tensors="tf")}
print(loaded_model(features, training=False))

If this works you can do the same for your trained model, just specify your output dir in .from_pretrained() function. If you want to create a more elaborate signature than the default one, you have to follow this part of the documentation

Later the TF Trainer will create a saved model in same time than the usual h5 file. Therefore it will be more user friendly to have its own saved model and then use it in production with TF serving.

jplu on 4 Jun 2020

Yes, the above code works.

I still have some doubts on how TFTrainer loads the saved model. When it is set to the prediction mode, even if I changed the output_dir to nonsense, it still can do the prediction. I also noticed the output_dir/saved_model folder is empty. If so, how can TF Trainer load the model? I asked these still with the intention to make sure I save my fine-tuned model to a right place, then load, and serve it.

python3 run_tf_ner.py --data_dir ./ \ --labels ./labels.txt \ --model_name_or_path $BERT_MODEL \ --output_dir $OUTPUT_DIR \ --max_seq_length $MAX_LENGTH \ --num_train_epochs $NUM_EPOCHS \ --per_device_train_batch_size $BATCH_SIZE \ --save_steps $SAVE_STEPS \ --seed $SEED \ --do_predict

If I train my model this way and would like to save the model, I need to set the code to prediction mode, with the trainer initialized, save the model through tf.saved_model.save(model, "saved_model"). correct?

jx669 on 5 Jun 2020

I tested it. That way would not be able to save the model.
https://colab.research.google.com/drive/1uPCpR31U5VRMT3dArGyDK9WT6hKQa0bv?usp=sharing

Then I am still wondering how to save the pb model through TF Trainer trained model.

jx669 on 5 Jun 2020

If I train my model this way and would like to save the model, I need to set the code to prediction mode, with the trainer initialized, save the model through tf.saved_model.save(model, "saved_model"). correct?

No, you have just have to open your Python prompt and run these three lines:

from transformers import TFAutoModelForTokenClassification
model = TFAutoModelForTokenClassification.from_pretrained("<OUTPUT_DIR>")
tf.saved_model.save(model, "saved_model")

And of course replace <OUTPUT_DIR> with the propoer localtion of where your model is.

The trainer is only here to train a model and not to serve a model :) That's why it is called trainer ;)

If you want a saved model you have to create it yourself with the piece of code I gave you. I suggest you to create also your own signature (as indicated in the TF documentation linked above) and then run it as detailed in this documentation section.

For now the models saved by the TF trainer are not compliant with served models, you have to do it yourself manually but this will change in a near future.

jplu on 5 Jun 2020

If trainer is just used for training, why in _run_tf_ner.py_ line 246, there is a prediction done with the trainer:
predictions, label_ids, metrics = trainer.predict(test_dataset.get_dataset())

If I set the mode to prediction, initialize the trainer with a nonsense output_dir, replace test_dataset.get_dataset(), with my own data, I can actually get the predictions. I guess it is initiated through checkpoints dir.

It seems that rather than model.predict(sentence), with the logic written in _run_tf_ner.py,_ we need to do prediction through Trainer trainer.predict(sentence). I am not sure if I am right, but line 246 is there, and I can succeed in getting predicted results with the initiated trainer in prediction mode.

If I use the code discussed in this post to save and load the model, the _loaded model_ would not convert the sentence to features.

from transformers import TFAutoModelForTokenClassification, BertTokenizer, TFBertForTokenClassification
import tensorflow as tf

output_dir = "model"
saved_model_dir = "tf2_0606_german"

model = TFAutoModelForTokenClassification.from_pretrained(output_dir)
tf.saved_model.save(model, saved_model_dir)
loaded_nodel = tf.saved_model.load(saved_model_dir)

tokenizer = BertTokenizer.from_pretrained("bert-base-multilingual-cased")
sentence = "1951 bis 1953 wurde der nördliche Teil als Jugendburg des Kolpingwerkes gebaut ."
features = {"input_ids": tokenizer.encode(sentence, add_special_tokens=True, return_tensors="tf")}

print(model(features, training=False))
print(loaded_model(features, training=False))

Error message can be found
https://colab.research.google.com/drive/1uPCpR31U5VRMT3dArGyDK9WT6hKQa0bv?usp=sharing#scrollTo=SBCchEi-qlnA

My suspicion is "output_dir" does not save all the information it needs, and "checkpoint" directory is where the trainer get initialized when it is set to the prediction mode. But I am not sure how to recover the model information for production with these two directories.

06/06/2020 07:53:52 - INFO - transformers.trainer_tf -   Saving checkpoint for step 1500 at checkpoint/ckpt-3
06/06/2020 07:53:55 - INFO - transformers.trainer_tf -   Saving model in model
06/06/2020 07:53:55 - INFO - transformers.trainer_tf -   Saving model in model/saved_model

jx669 on 6 Jun 2020

I also found one more complication. The code you showed works only for sentences containing three words or less. If "it is me" is changed to "it is me again", the code will return the same argument error message I mentioned in the last response.

from transformers import BertTokenizer, TFBertForTokenClassification
import tensorflow as tf

model = TFBertForTokenClassification.from_pretrained("bert-base-uncased")
tf.saved_model.save(model, "saved_model")

loaded_model = tf.saved_model.load("saved_model")
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
features = {"input_ids": tokenizer.encode("it is me again", add_special_tokens=True, return_tensors="tf")}
print(loaded_model(features, training=False))

jx669 on 6 Jun 2020

If trainer is just used for training, why in run_tf_ner.py line 246, there is a prediction done with the trainer:

This part is only here to evaluate the model and output the predictions on the test set into a file and not for inference in production. It is two distinct cases.

If I set the mode to prediction, initialize the trainer with a nonsense output_dir, replace test_dataset.get_dataset(), with my own data, I can actually get the predictions. I guess it is initiated through checkpoints dir.

Yes, it is normal because the predict is just here to evaluate your model on a dataset, and it is not initatied from the checkpoint dir but from the .h5 file in your model folder only.

If I use the code discussed in this post to save and load the model, the saved model can convert the sentence to features, but it cannot do any prediction; the loaded model would not convert the sentence to features.

Hence once you get your saved model, run the command:

tensorflow_model_server \
  --rest_api_port=8501 \
  --model_name=ner \
  --model_base_path="tf2_0606_german" >server.log 2>&1

Now, you have an API that wraps your model. Finally, in a Python script you can do:

import json
import numpy
import requests
my_features = # call here the tokenizer
data = json.dumps({"signature_name": "serving_default",
                   "instances": my_features})
headers = {"content-type": "application/json"}
json_response = requests.post('http://localhost:8501/v1/models/ner:predict',
                              data=data, headers=headers)
predictions = numpy.array(json.loads(json_response.text)["predictions"])

Finally, you get your predictions and you have to code the translation preds -> text.

I also found one more complication. The code you showed works only for sentences containing three words or less. If "it is me" is changed to "it is me again", the code will return the same argument error message I mentioned in the last response.

This is totally normal, as I told you, you have to code your own signature as it is showed in the TF documentation that I linked you in my previous post.

jplu on 6 Jun 2020

🎉4

Thanks so much for your elaborate response! I did not fully appreciate what signature means... Thanks!!!

jx669 on 6 Jun 2020

@jplu thanks for the great answer. I was wondering if it is possible to include the tokenizer inside the saved model (or something similar in order to make the tokenization inside TF serving ) ? Or do we have to use the tokenizer before doing the request ?

Shiro-LK on 23 Jun 2020

It is currently not possible to integrate the tokenizers in a saved model as preprocessing, you have to do that by yourself before to use the saved model.

jplu on 23 Jun 2020

@jplu Thanks for your great answer. But I have a question, in this part

import json
import numpy
import requests
my_features = # call here the tokenizer
data = json.dumps({"signature_name": "serving_default",
                   "instances": my_features})
headers = {"content-type": "application/json"}
json_response = requests.post('http://localhost:8501/v1/models/ner:predict',
                              data=data, headers=headers)
predictions = numpy.array(json.loads(json_response.text)["predictions"])

can you give an example about how to do # call here the tokenizer part?

kevin-yauris on 5 Aug 2020

You have plenty of examples on how to use the tokenizers, such as in the examples folder or inside the source code.

jplu on 5 Aug 2020

hi, @jplu thank you for your answer. I forgot to remove return_tensor=tf in tokenizer before so it is failing. I have been working based on your answer on this issue and this reference to do inference with Tensorflow Serving Saved Model on Sentiment Analysis task. Please see here for my complete attempt link to the collab

This is totally normal, as I told you, you have to code your own signature as it is showed in the TF documentation that I linked you in my previous post.
I try to do this by making it like this

import tensorflow as tf
from transformers import *
tf.config.optimizer.set_jit(True)

class WrappedModel(tf.Module):
    def __init__(self):
        super(WrappedModel, self).__init__()
        self.model = TFAutoModelForSequenceClassification.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')
    @tf.function
    def __call__(self, x):
        return self.model(x)

model = WrappedModel()

call = model.__call__.get_concrete_function(tf.TensorSpec([None, None], tf.int32, name='input_ids'))
tf.saved_model.save(model, saved_model_path, signatures=call, )

it is working fine I try to predict one example or couple examples with the same length of sequences

import json
import numpy as np
import requests
my_features = {"input_ids": tokenizer.encode("it is really great, I don't think I will use this", add_special_tokens=True)}
my_instances = [my_features, my_features]
print(my_instances)
data = json.dumps({"signature_name": "serving_default",
                   "instances": [my_features, my_features]})
headers = {"content-type": "application/json"}
json_response = requests.post('http://localhost:8503/v1/models/sentiment_analysis2:predict',
                              data=data, headers=headers)
print(json_response)
predictions = numpy.array(json.loads(json_response.text)["predictions"])
for prediction in predictions:
  print(np.argmax(prediction))

but when there is more than 1 variation of sequence length, it is not working. So I think this is because the tensor shape for every example must be the same so I try to do padding into max_seq_length. But something weird happens, the prediction result for the same sentence are different between the padding and the non-padding version. The more padding tokens added the more model thinks that the sentence is having negative sentiment (probability for label 0 is increasing and for label 1 is decreasing).

Can you please tell me what that I did wrong?
Also, I am looking to integrate the preprocessing step, inference into Tensorflow Serving and prediction result in step so it can be done automatically instead of manually running separate code. Can you please tell me what option I have regarding this?
Thank you in advance! @jplu

kevin-yauris on 5 Aug 2020

Can you please tell me what that I did wrong?

Nothing, the results depends of the model itself, so you should ask to the person who has uploaded the model.

Can you please tell me what option I have regarding this?

Currently no options, you cannot do this.

jplu on 5 Aug 2020

@jplu Thank you very much for your quick reply.

Nothing, the results depends of the model itself, so you should ask to the person who has uploaded the model.

So if I understand correctly there is no mistake in my code but it is because of the model I use right? I will try with other models then, thank you.

Currently no options, you cannot do this.

Ok, thank you.

kevin-yauris on 5 Aug 2020

@jplu @kevin-yauris to be able to perform the same task with batch_encoding_plus,
how should we modify the callback function to achieve that?

with existing piece of code,
for an instance, input to the model looks like

with batch encoding,
it might look something like
{'input_ids': array([[ 101, 7592, 102, 0, 0, 0],
[ 101, 2054, 1037, 2204, 2154, 102]], dtype=int32)>, 'attention_mask': array([[1, 1, 1, 0, 0, 0],
[1, 1, 1, 1, 1, 1]], dtype=int32)>}

In which case how should this call function look like?
call = model.__call__.get_concrete_function(tf.TensorSpec([None, None], tf.int32, name='input_ids'))

Thanks in advance

chikubee on 19 Aug 2020

@jplu @kevin-yauris to be able to perform the same task with batch_encoding_plus,
how should we modify the callback function to achieve that?

with existing piece of code,
for an instance, input to the model looks like

with batch encoding,
it might look something like
{'input_ids': array([[ 101, 7592, 102, 0, 0, 0],
[ 101, 2054, 1037, 2204, 2154, 102]], dtype=int32)>, 'attention_mask': array([[1, 1, 1, 0, 0, 0],
[1, 1, 1, 1, 1, 1]], dtype=int32)>}

In which case how should this call function look like?
call = model.call.get_concrete_function(tf.TensorSpec([None, None], tf.int32, name='input_ids'))

Thanks in advance