I'd like to profile the prediction speed of BERT+classifier - ignoring the model setup, tokenization etc, to evaluate whether it will be practical for our use case.
Because it's using the Estimator API. I can't seem to find a way to do that. Has anyone done such a speed evaluation or knows how to do so with the Estimator API (or alternatively make ?
Have you looked at tf.contrib.predictor.from_estimator? Something like this might work:
def serving_input_fn():
inputs = {
'label_ids': tf.placeholder(dtype=tf.int64, shape=[None], name='label_ids'),
'input_ids': tf.placeholder(dtype=tf.int64, shape=[None,MAX_SEQ_LENGTH], name='input_ids'),
'input_mask': tf.placeholder(dtype=tf.int64, shape=[None,MAX_SEQ_LENGTH], name='input_mask'),
'segment_ids': tf.placeholder(dtype=tf.int64, shape=[None,MAX_SEQ_LENGTH], name='segment_ids'),
}
return tf.estimator.export.ServingInputReceiver(inputs, inputs)
predictor = tf.contrib.predictor.from_estimator(estimator, serving_input_fn)
Then you should be able to run prediction on featurized data without having to load the whole model:
p = predictor({'input_ids': input_ids,
'input_mask': input_masks,
'segment_ids': segment_ids,
'label_ids': label_ids})
@jaderabbit here is a benchmark https://github.com/hanxiao/bert-as-service#zap-benchmark
bert-as-service is optimized for efficiency, low memory footprint and scalability. You may use it in production.
@hanxiao Exactly what I was looking for :) Brilliant, thank you
@hanxiao Can you share the benchmark metrics also in terms of tokens / characters? Your benchmarks use "sentences", but I'm unsure what that means exactly. Also some CPU numbers would be great. Thanks!
@piskvorky see https://github.com/hanxiao/bert-as-service/issues/204#issuecomment-456724497
https://github.com/hanxiao/bert-as-service/issues/225#issuecomment-459927005
@hanxiao Thanks for the quick reply. Do you have anything on the CPU?
yes, but I believe the most convincing numbers come from running bert-serving-benchmark by yourself.
@piskvorky Did you run the tests on a CPU ? Do you have some results ?
@AngularDe not yet, no time for that. Let me know if you get any numbers / graphs.
@AngularDe not yet, no time for that. Let me know if you get any numbers / graphs.
Benchmark Ryzen 5 2600X
bert-serving-start -model_dir /data/training/model/uncased_L-12_H-768_A-12/ -num_worker=1 fp32
encoding 512 sentences 12.20s 44 samples/s 375 tokens/s
encoding 1024 sentences 24.55s 44 samples/s 375 tokens/s
encoding 2048 sentences 49.60s 44 samples/s 375 tokens/s
encoding 4096 sentences 99.81s 44 samples/s 375 tokens/s
encoding 8192 sentences 200.20s 44 samples/s 375 tokens/s
encoding 16384 sentences 401.24s 44 samples/s 375 tokens/s
you whoo
Sent from my Redmi 4A
On AndreasFdev notifications@github.com, May 17, 2019 4:04 PM wrote:
@AngularDehttps://github.com/AngularDe not yet, no time for that. Let me know if you get any numbers / graphs.
Benchmark Ryzen 5 2600X
bert-serving-start -model_dir /data/training/model/uncased_L-12_H-768_A-12/ -num_worker=1 fp32
encoding 512 sentences 12.20s 44 samples/s 375 tokens/s
encoding 1024 sentences 24.55s 44 samples/s 375 tokens/s
encoding 2048 sentences 49.60s 44 samples/s 375 tokens/s
encoding 4096 sentences 99.81s 44 samples/s 375 tokens/s
encoding 8192 sentences 200.20s 44 samples/s 375 tokens/s
encoding 16384 sentences 401.24s 44 samples/s 375 tokens/s
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHubhttps://github.com/google-research/bert/issues/282?email_source=notifications&email_token=AFQHH7C2YMIINOQZCFRZRDTPVZYJPA5CNFSM4GLBOM52YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVUGF5I#issuecomment-493380341, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AFQHH7AAUF5SBE3QLX7DPELPVZYJPANCNFSM4GLBOM5Q.
@jaderabbit @hanxiao I wonder what is the inference time of a single sentence pair classification when using a GPU, for some reason I seem not to be able to get it bellow 1 second on a V100 with the pytorch implementation
Most helpful comment
Have you looked at tf.contrib.predictor.from_estimator? Something like this might work:
Then you should be able to run prediction on featurized data without having to load the whole model: