Serving: Tensorflow model server crashing with error : Check failed: c->in_use(); tf_serving_entrypoint.sh: line 3: 8 Aborted

Created on 8 Dec 2018 · 11Comments · Source: tensorflow/serving

I am running an object detection model using tensorflow/serving:latest-gpu docker image & Nvidia-docker on Amazon Deep Learning AMI (EC2 P3 instance). The model server starts up fine. Then I run a gRPC client that loops through several images & sending them over to the server to fetch predictions. I am getting expected & quick predictions, and the server runs on ~95% GPU utilization (memory used is below limits).

However, often the model server crashes after giving continuous predictions for a while. The error it gives right before crashing is:

F external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:458] Check failed: c->in_use() && (c->bin_num == kInvalidBinNum) 
/usr/bin/tf_serving_entrypoint.sh: line 3: 8 Aborted tensorflow_model_server --port=8500 --rest_api_port=8501 --model_name=${MODEL_NAME} --model_base_path=${MODEL_BASE_PATH}/${MODEL_NAME} "$@"

I have tried sending larger payloads from the client to the server & have observed resource exhaustion errors, which makes sense since the GPU goes out of memory. But I am not able to understand what exactly is causing the above issue.

Can someone please help?

Thanks in advance.

awaiting response

Source

dipsatch

Most helpful comment

I'm having the same issue. Server is getting ~95% utilization and crashes after a few iterations of training. I'm using tf version 1.12.0.
The error I'm getting is F tensorflow/core/common_runtime/bfc_allocator.cc:458] Check failed: c->in_use() && (c->bin_num == kInvalidBinNum) Aborted

YananJian on 16 Jan 2019

👍15

All 11 comments

That appears to come from the memory allocator when trying to free memory it thinks has been freed (or is still in use for some reason). This could be a bug in code (say, a memory leak) or simply a side effect of running out of memory for other reasons.

There's not enough information here to debug anything further, though. This is deep in Tensorflow core logic, so if you can reproduce the issue, you might want to file an issue on the Tensorflow project.

gautamvasudevan on 11 Dec 2018

@dipsatch your problem is definitely related to memory (whether your own code or tensorflow itself), I had the same issue and see that before the tf server crashed memory usage was at capacity

echan00 on 15 Dec 2018

Closing this issue as there is no response received from the user. Feel free to post updates(if any), we will reopen the issue.

Harshini-Gadige on 15 Jan 2019

YananJian on 16 Jan 2019

👍15

This exception might be related to this issue #22581 https://github.com/tensorflow/tensorflow/issues/22581

schen119 on 5 Feb 2019

I had the same issue and was able to solve this pulling the most recent tf-nightly-gpu image (with v1.13.0). See the comments here.

wronk on 19 Feb 2019

i got the same issue, and my tf-version=1.12.0, have someone kown about this

DenceChen on 19 Mar 2019

@YananJian Have u found a solution? I met the same problem, I need your help, thanks, pls.

zhouyuangan on 22 Mar 2019

@zhouyuangan tf==1.9.0 will be ok!

DenceChen on 22 Mar 2019

@YananJian Have u found a solution? I met the same problem, I need your help, thanks, pls.

Yes, tf==1.10.0 works.

YananJian on 26 Mar 2019

👍3

Got the same issue while using tensorflow/serving:latest-gpu . Used the latest one and tested with three streams and found this problem is solved in tensorflow/serving:1.13-gpu