Serving: How to configure GPU memory allocation by Tensorflow model server

Created on 16 Nov 2016 · 11Comments · Source: tensorflow/serving

When a tensorflow model server starts it will allocate all available GPU memory in the same way Tensorflow do. When running Tensorflow it is possible to configure how much memory is allocated.

Is it possible to do the same when starting a Tensorflow model server like this?

Anything like doing doing this?
https://www.tensorflow.org/versions/r0.11/how_tos/using_gpu/index.html#allowing-gpu-memory-growth

config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.4
session = tf.Session(config=config, ...)

Source

bjelkenhed

👍2

Most helpful comment

Does anybody know how to use per_process_gpu_memory_fraction flag when using nvidia docker to start tensorflow serving?

solved, just put this flag behind tensorflow serving
-t tensorflow/serving:latest-gpu --per_process_gpu_memory_fraction=0.5

stevewyl on 25 Feb 2019

👍4

All 11 comments

model_servers/main.cc has some code that creates a SessionBundleSourceAdapterConfig. If you dig inside that config as follows: SessionBundleSourceAdapterConfig->SessionBundleConfig->ConfigProto you get to the tensorflow Session config proto, which has the option you are interested in.

Currently the model-server binary doesn't expose this option via flags, so you'll have to tweak the binary and recompile. It would be great if you could contribute a PR that would add a flag that points to a SessionBundleConfig textual proto file to be loaded by the model-server. It would replace the current enable_batching flag which sets a tiny part of the SessionBundleConfig (and doesn't let you set the batching tuning parameters, which are important to get right).

chrisolston on 28 Nov 2016

I made fork of serving with flag per_process_gpu_memory_fraction. Now you can limit memory usage using flag. https://github.com/movchan74/serving

movchan74 on 13 Apr 2017

@movchan74 Could you please make a pull request against the tensorflow serving main repo, so your work can be merged into the main trunck, which could be helpful to users like me.

Michael-Jing on 29 Jul 2017

Was there any progress on this issue? I'm dealing with same issue and don't know how.

RKTP on 29 Nov 2017

👍3

For any one still looking for the solution #694

malnakli on 28 Mar 2018

use the flag this:
tensorflow_model_server --per_process_gpu_memory_fraction=0.400000 ...

Flags:
--port=8500 int32 port to listen on
--enable_batching=false bool enable batching
--batching_parameters_file="" string If non-empty, read an ascii BatchingParameters protobuf from the supplied file name and use the contained values instead of the defaults.
--model_config_file="" string If non-empty, read an ascii ModelServerConfig protobuf from the supplied file name, and serve the models in that file. This config file can be used to specify multiple models to serve and other advanced parameters including non-default version policy. (If used, --model_name, --model_base_path are ignored.)
--model_name="default" string name of model (ignored if --model_config_file flag is set
--model_base_path="" string path to export (ignored if --model_config_file flag is set, otherwise required)
--file_system_poll_wait_seconds=1 int32 interval in seconds between each poll of the file system for new model version
--flush_filesystem_caches=true bool If true (the default), filesystem caches will be flushed after the initial load of all servables, and after each subsequent individual servable reload (if the number of load threads is 1). This reduces memory consumption of the model server, at the potential cost of cache misses if model files are accessed after servables are loaded.
--tensorflow_session_parallelism=0 int64 Number of threads to use for running a Tensorflow session. Auto-configured by default.Note that this option is ignored if --platform_config_file is non-empty.
--platform_config_file="" string If non-empty, read an ascii PlatformConfigMap protobuf from the supplied file name, and use that platform config instead of the Tensorflow platform. (If used, --enable_batching is ignored.)
--per_process_gpu_memory_fraction=0.000000 float Fraction that each process occupies of the GPU memory space the value is between 0.0 and 1.0 (with 0.0 as the default) If 1.0, the server will allocate all the memory when the server starts, If 0.0, Tensorflow will automatically select a value.