transformers pipeline in GCP cloud functions

Created on 1 Apr 2020 · 7Comments · Source: huggingface/transformers

❓ Transformers pipeline in GCP

I am trying to use transformers pipeline in GCP cloud functions. While calling the function the downloading of model is happening everytime. How can we sort this issue.

Help wanted

Source

vijender412

All 7 comments

You could cache the model once in your environment and then load it from there. Just point from_pretrained to the directory containing the model and configuration (or tokenizer file if loading the tokenizer) instead of the S3 link.

LysandreJik on 1 Apr 2020

👍1

@LysandreJik Completely understood and right but this is different case here. I am trying to make use of pipeline and load using pipeline (). And now how this can be achieved in GCP.

vijender412 on 1 Apr 2020

The pipeline also accepts directories as models and tokenizers. See the pipeline documentation:

model (str or PreTrainedModel or TFPreTrainedModel, optional, defaults to None) –

The model that will be used by the pipeline to make predictions. This can be None, a string checkpoint identifier or an actual pre-trained model inheriting from PreTrainedModel for PyTorch and TFPreTrainedModel for TensorFlow.

If None, the default of the pipeline will be loaded.

LysandreJik on 1 Apr 2020

Thanks @LysandreJik. The documentation link did helped to get better clarity. Will try and get back.

vijender412 on 1 Apr 2020

👍1

Tried this way ```
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased-distilled-squad")

model = AutoModelForQuestionAnswering.from_pretrained("https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-cased-distilled-squad-pytorch_model.bin")

nlp_qa = pipeline('question-answering', model=model, tokenizer=tokenizer)
``GettingUnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte`

vijender412 on 3 Apr 2020

You can't put a URL like that. It has to be a local file path, like it is shown in the documentation. You can either fetch them and save them to a directory:

mkdir local_path
cd local_path

wget https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-cased-distilled-squad-pytorch_model.bin
wget https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-cased-distilled-squad-config.json
wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-vocab.txt

or in Python:

model = DistilBertModel.from_pretrained("distilbert-base-cased-distilled-squad")
model.save_pretrained("local_path")

tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-cased-distilled-squad")
tokenizer.save_pretrained("local_path")

You can then access this model/tokenizer:

nlp = pipeline("question-answering", model="local_path", tokenizer="local_path")

LysandreJik on 3 Apr 2020

Thanks @LysandreJik

vijender412 on 16 May 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Need a Restore training mechenisim in run_lm_finetuning.py

chuanmingliu · 3Comments

Weights not initialized from pretrained model

lemonhu · 3Comments

Problem about convert TF model and pretraining

zhezhaoa · 3Comments

BERT tokenizer - set special tokens

adigoryl · 3Comments

_load_from_state_dict() takes 7 positional arguments but 8 were given

guanlongtianzi · 3Comments