Bert: Profiling BERT speed of predictions

Created on 18 Dec 2018 · 12Comments · Source: google-research/bert

I'd like to profile the prediction speed of BERT+classifier - ignoring the model setup, tokenization etc, to evaluate whether it will be practical for our use case.

Because it's using the Estimator API. I can't seem to find a way to do that. Has anyone done such a speed evaluation or knows how to do so with the Estimator API (or alternatively make ?

Source

jaderabbit

Most helpful comment

Have you looked at tf.contrib.predictor.from_estimator? Something like this might work:

def serving_input_fn():
    inputs = {
        'label_ids': tf.placeholder(dtype=tf.int64, shape=[None], name='label_ids'),
        'input_ids': tf.placeholder(dtype=tf.int64, shape=[None,MAX_SEQ_LENGTH], name='input_ids'),
        'input_mask': tf.placeholder(dtype=tf.int64, shape=[None,MAX_SEQ_LENGTH], name='input_mask'),
        'segment_ids': tf.placeholder(dtype=tf.int64, shape=[None,MAX_SEQ_LENGTH], name='segment_ids'),
    }
    return tf.estimator.export.ServingInputReceiver(inputs, inputs)

predictor = tf.contrib.predictor.from_estimator(estimator, serving_input_fn)

Then you should be able to run prediction on featurized data without having to load the whole model:

p = predictor({'input_ids': input_ids, 
                 'input_mask': input_masks,
                 'segment_ids': segment_ids,
                 'label_ids': label_ids})

evornov on 18 Dec 2018

👍5

All 12 comments

Have you looked at tf.contrib.predictor.from_estimator? Something like this might work:

def serving_input_fn():
    inputs = {
        'label_ids': tf.placeholder(dtype=tf.int64, shape=[None], name='label_ids'),
        'input_ids': tf.placeholder(dtype=tf.int64, shape=[None,MAX_SEQ_LENGTH], name='input_ids'),
        'input_mask': tf.placeholder(dtype=tf.int64, shape=[None,MAX_SEQ_LENGTH], name='input_mask'),
        'segment_ids': tf.placeholder(dtype=tf.int64, shape=[None,MAX_SEQ_LENGTH], name='segment_ids'),
    }
    return tf.estimator.export.ServingInputReceiver(inputs, inputs)

predictor = tf.contrib.predictor.from_estimator(estimator, serving_input_fn)

Then you should be able to run prediction on featurized data without having to load the whole model:

p = predictor({'input_ids': input_ids, 
                 'input_mask': input_masks,
                 'segment_ids': segment_ids,
                 'label_ids': label_ids})

evornov on 18 Dec 2018

👍5

@jaderabbit here is a benchmark https://github.com/hanxiao/bert-as-service#zap-benchmark

bert-as-service is optimized for efficiency, low memory footprint and scalability. You may use it in production.

hanxiao on 19 Dec 2018

🎉2

@hanxiao Exactly what I was looking for :) Brilliant, thank you

jaderabbit on 19 Dec 2018

@hanxiao Can you share the benchmark metrics also in terms of tokens / characters? Your benchmarks use "sentences", but I'm unsure what that means exactly. Also some CPU numbers would be great. Thanks!

piskvorky on 22 Feb 2019

@piskvorky see https://github.com/hanxiao/bert-as-service/issues/204#issuecomment-456724497
https://github.com/hanxiao/bert-as-service/issues/225#issuecomment-459927005

hanxiao on 22 Feb 2019

👍1

@hanxiao Thanks for the quick reply. Do you have anything on the CPU?

piskvorky on 22 Feb 2019

yes, but I believe the most convincing numbers come from running bert-serving-benchmark by yourself.

hanxiao on 22 Feb 2019

@piskvorky Did you run the tests on a CPU ? Do you have some results ?

AngularDe on 27 Mar 2019

@AngularDe not yet, no time for that. Let me know if you get any numbers / graphs.

piskvorky on 27 Mar 2019

@AngularDe not yet, no time for that. Let me know if you get any numbers / graphs.

Benchmark Ryzen 5 2600X
bert-serving-start -model_dir /data/training/model/uncased_L-12_H-768_A-12/ -num_worker=1 fp32

encoding 512 sentences 12.20s 44 samples/s 375 tokens/s
encoding 1024 sentences 24.55s 44 samples/s 375 tokens/s
encoding 2048 sentences 49.60s 44 samples/s 375 tokens/s
encoding 4096 sentences 99.81s 44 samples/s 375 tokens/s
encoding 8192 sentences 200.20s 44 samples/s 375 tokens/s
encoding 16384 sentences 401.24s 44 samples/s 375 tokens/s

AndreasFdev on 17 May 2019

👍1

you whoo

Sent from my Redmi 4A
On AndreasFdev notifications@github.com, May 17, 2019 4:04 PM wrote:

@AngularDehttps://github.com/AngularDe not yet, no time for that. Let me know if you get any numbers / graphs.

Benchmark Ryzen 5 2600X
bert-serving-start -model_dir /data/training/model/uncased_L-12_H-768_A-12/ -num_worker=1 fp32

encoding 512 sentences 12.20s 44 samples/s 375 tokens/s
encoding 1024 sentences 24.55s 44 samples/s 375 tokens/s
encoding 2048 sentences 49.60s 44 samples/s 375 tokens/s
encoding 4096 sentences 99.81s 44 samples/s 375 tokens/s
encoding 8192 sentences 200.20s 44 samples/s 375 tokens/s
encoding 16384 sentences 401.24s 44 samples/s 375 tokens/s

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHubhttps://github.com/google-research/bert/issues/282?email_source=notifications&email_token=AFQHH7C2YMIINOQZCFRZRDTPVZYJPA5CNFSM4GLBOM52YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVUGF5I#issuecomment-493380341, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AFQHH7AAUF5SBE3QLX7DPELPVZYJPANCNFSM4GLBOM5Q.

IntelOSt on 17 May 2019

@jaderabbit @hanxiao I wonder what is the inference time of a single sentence pair classification when using a GPU, for some reason I seem not to be able to get it bellow 1 second on a V100 with the pytorch implementation