Models: How do I use multiple GPU on 1 server to train model slim

Created on 14 Mar 2018 · 13Comments · Source: tensorflow/models

System information

What is the top-level directory of the model you are using: research/slim
Have I written custom code (as opposed to using a stock example script provided in TensorFlow):no

I have 4 GPU on 1 server, I want use them all to train models/research/slim. My understanding is that.
I need create 1 replica, 4 clones. So I use following commands.

python train_image_classifier.py --train_dir=${TRAIN_DIR} --dataset_name=flowers --dataset_split_name=train --dataset_dir=${DATASET_DIR} --model_name=inception_v3 --num_clones=4

But, the after I run this command, tensorflow allocate memory on all 4 GPUs, and only 1 GPU have significant usage(90%), other 3 are near 0.

I don't find any documentation on how to use this functionality. can someone help me?

research support

Source

scotthuang1989

Most helpful comment

@scotthuang1989 I am using num_clones for attention_ocr but the GPU consumption is still showing for 1 GPU ( I have 3 GPUs, ie. num_clones=3)
Can you elaborate your command or any other configuration you tweaked?

ayushbits on 7 Aug 2018

👍2

All 13 comments

When I training object detection model, I have the same problem.
Thanks

xhzcyc on 14 Mar 2018

@xhzcyc do you have a solution by now?

scotthuang1989 on 14 Mar 2018

I am also a bit confused about the required flags for the case of multi-gpu training. As described here, I interpret a common strategy to be: place a model replica on each GPU and use the CPU as the parameter server. However I am unsure what combination of flags produce this behavior.

Looking at train_image_classifier.py the following flags are available:

num_clones: Number of model clones to deploy.
clone_on_cpu: Use CPUs to deploy clones.
worker_replicas: Number of worker replicas.
num_ps_tasks: The number of parameter servers. If the value is 0, then the parameters are handled locally by the worker.
...
task: Task id of the replica running the training.

Most of these flags are used to specify DeploymentConfig parameters in model_deploy.py. Which include:

num_clones: Number of model clones to deploy in each replica.
clone_on_cpu: True if clones should be placed on CPU.
replica_id: Index of the replica for which the model is deployed. Usually 0 for the chief replica.
num_replicas: Number of replicas to use.
num_ps_tasks: Number of tasks for the ps job. 0 to not use replicas.

I am a bit confused on a few points.

What is the difference between a clone and a replica? How do they operate together?
However clones and replicas are to be interpreted, does the value include a "chief" instance or simply count additional duplicated models?
The task/replica_id seems to indicate that a model is only deployed on a single replica? This seems to be at odds with the concept of duplicated models.

It seems most posts about running multi-gpu training specify to just set the num_clones flag like so:

python train_image_classifier.py --num_clones=2

but looking at the propagation of default flags, i fail to see how this properly sets up multi-gpu training.

Some discussion mentions:

python train_image_classifier.py --num_clones=2 --num_ps_tasks=1

But I still don't understand how things are delineated between hardware at that point. Could a contributor or knowledgeable user provide some direction for using these flags to train on multiple gpus.

dcyoung on 15 Mar 2018

😕2 👍2

Agree with dcyoung, I'm also confused by the parameters.

pandasMX on 16 Mar 2018

same problem here on 4 gpus only one is at 90% utilization and the rest are 0%. And I find the usage of slim completely and utterly confusing and unnecessary. The same I can say for the use of flags. Somebody tell me what's the use of flags that I cannot do with pure python argparse? All these things result into a big pile of code in which we spend days trying to debug and understand what the hell the models are doing and how they work in the first place.

kirk86 on 26 Mar 2018

I can use 4 local GPU by setting num_clones now, but still can't train it with multiple server. I just watch tensorflow summit 2018, it appears that the recommend way of doing distributed for now is Estimator. Does this ps/worker schema is out of support?

scotthuang1989 on 7 Apr 2018

👍2

ayushbits on 7 Aug 2018

👍2

ayush first try --num_clones=2 --num_ps_tasks=1, then try --num_clones=3 --num_ps_tasks=2 as well as --num_clones=3 --num_ps_tasks=1.

rohitsaluja22 on 7 Aug 2018

@rohitsaluja22 I tried the above combinations, however , none of them seem to work. Only 1 GPU is fully using volatile GPU utilisation.

ayushbits on 7 Aug 2018

I have Nvidia 1080 Ti and 1060 two cards on my server and run the following configuration.

CUDA_VISIBLE_DEVICES=0,1 python train_image_classifier.py --num_clones=2 --num_ps_tasks=1

The program came out with the following error.

saving checkpoints to ../model_res_lap_finetune/
/home/n200/.local/lib/python2.7/site-packages/requests/__init__.py:83: RequestsDependencyWarning: Old version of cryptography ([1, 2, 3]) may cause slowdown.
  warnings.warn(warning, RequestsDependencyWarning)
WARNING:tensorflow:From /home/n200/library/tensorflow_model/models/slim/train_image_classifier.py:397: create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.create_global_step
INFO:tensorflow:Fine-tuning from /home/n200/980223/imdb_wiki_ignore_image/model_res/model.ckpt-1000000
2019-02-20 16:37:55.499050: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2019-02-20 16:37:55.599066: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:892] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-02-20 16:37:55.599425: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:01:00.0
totalMemory: 10.92GiB freeMemory: 10.70GiB
2019-02-20 16:37:55.672687: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:892] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-02-20 16:37:55.672979: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 1 with properties:
name: GeForce GTX 1060 6GB major: 6 minor: 1 memoryClockRate(GHz): 1.7845
pciBusID: 0000:02:00.0
totalMemory: 5.94GiB freeMemory: 5.86GiB
2019-02-20 16:37:55.672999: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Device peer to peer matrix
2019-02-20 16:37:55.673020: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1051] DMA: 0 1
2019-02-20 16:37:55.673025: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 0:   Y N
2019-02-20 16:37:55.673036: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 1:   N Y
2019-02-20 16:37:55.673044: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
2019-02-20 16:37:55.673058: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:1) -> (device: 1, name: GeForce GTX 1060 6GB, pci bus id: 0000:02:00.0, compute capability: 6.1)
INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>, Cannot assign a device for operation 'save_1/RestoreV2_622': Operation was explicitly assigned to /job:ps/task:0/device:CPU:0 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:GPU:0, /job:localhost/replica:0/task:0/device:GPU:1 ]. Make sure the device specification refers to a valid device.
         [[Node: save_1/RestoreV2_622 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:ps/task:0/device:CPU:0"](save_1/Const, save_1/RestoreV2_622/tensor_names, save_1/RestoreV2_622/shape_and_slices)]]

Caused by op u'save_1/RestoreV2_622', defined at:
  File "/home/n200/library/tensorflow_model/models/slim/train_image_classifier.py", line 573, in <module>
    tf.app.run()
  File "/home/n200/.local/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "/home/n200/library/tensorflow_model/models/slim/train_image_classifier.py", line 569, in main
    sync_optimizer=optimizer if FLAGS.sync_replicas else None)
  File "/home/n200/.local/lib/python2.7/site-packages/tensorflow/contrib/slim/python/slim/learning.py", line 654, in train
    saver = saver or tf_saver.Saver()
  File "/home/n200/.local/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1218, in __init__
    self.build()
  File "/home/n200/.local/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1227, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "/home/n200/.local/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1263, in _build
    build_save=build_save, build_restore=build_restore)
  File "/home/n200/.local/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 751, in _build_internal
    restore_sequentially, reshape)
  File "/home/n200/.local/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 427, in _AddRestoreOps
    tensors = self.restore_op(filename_tensor, saveable, preferred_shard)
  File "/home/n200/.local/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 267, in restore_op
    [spec.tensor.dtype])[0])
  File "/home/n200/.local/lib/python2.7/site-packages/tensorflow/python/ops/gen_io_ops.py", line 1021, in restore_v2
    shape_and_slices=shape_and_slices, dtypes=dtypes, name=name)
  File "/home/n200/.local/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/n200/.local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
    op_def=op_def)
  File "/home/n200/.local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1470, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): Cannot assign a device for operation 'save_1/RestoreV2_622': Operation was explicitly assigned to /job:ps/task:0/device:CPU:0 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:GPU:0, /job:localhost/replica:0/task:0/device:GPU:1 ]. Make sure the device specification refers to a valid device.
         [[Node: save_1/RestoreV2_622 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:ps/task:0/device:CPU:0"](save_1/Const, save_1/RestoreV2_622/tensor_names, save_1/RestoreV2_622/shape_and_slices)]]

Traceback (most recent call last):
  File "/home/n200/library/tensorflow_model/models/slim/train_image_classifier.py", line 573, in <module>
    tf.app.run()
  File "/home/n200/.local/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "/home/n200/library/tensorflow_model/models/slim/train_image_classifier.py", line 569, in main
    sync_optimizer=optimizer if FLAGS.sync_replicas else None)
  File "/home/n200/.local/lib/python2.7/site-packages/tensorflow/contrib/slim/python/slim/learning.py", line 742, in train
    master, start_standard_services=False, config=session_config) as sess:
  File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/home/n200/.local/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 964, in managed_session
    self.stop(close_summary_writer=close_summary_writer)
  File "/home/n200/.local/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 792, in stop
    stop_grace_period_secs=self._stop_grace_secs)
  File "/home/n200/.local/lib/python2.7/site-packages/tensorflow/python/training/coordinator.py", line 389, in join
    six.reraise(*self._exc_info_to_raise)
  File "/home/n200/.local/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 953, in managed_session
    start_standard_services=start_standard_services)
  File "/home/n200/.local/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 708, in prepare_or_wait_for_session
    init_feed_dict=self._init_feed_dict, init_fn=self._init_fn)
  File "/home/n200/.local/lib/python2.7/site-packages/tensorflow/python/training/session_manager.py", line 279, in prepare_session
    sess.run(init_op, feed_dict=init_feed_dict)
  File "/home/n200/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 889, in run
    run_metadata_ptr)
  File "/home/n200/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1120, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/n200/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1317, in _do_run
    options, run_metadata)
  File "/home/n200/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1336, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation 'save_1/RestoreV2_622': Operation was explicitly assigned to /job:ps/task:0/device:CPU:0 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:GPU:0, /job:localhost/replica:0/task:0/device:GPU:1 ]. Make sure the device specification refers to a valid device.
         [[Node: save_1/RestoreV2_622 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:ps/task:0/device:CPU:0"](save_1/Const, save_1/RestoreV2_622/tensor_names, save_1/RestoreV2_622/shape_and_slices)]]

Caused by op u'save_1/RestoreV2_622', defined at:
  File "/home/n200/library/tensorflow_model/models/slim/train_image_classifier.py", line 573, in <module>
    tf.app.run()
  File "/home/n200/.local/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "/home/n200/library/tensorflow_model/models/slim/train_image_classifier.py", line 569, in main
    sync_optimizer=optimizer if FLAGS.sync_replicas else None)
  File "/home/n200/.local/lib/python2.7/site-packages/tensorflow/contrib/slim/python/slim/learning.py", line 654, in train
    saver = saver or tf_saver.Saver()
  File "/home/n200/.local/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1218, in __init__
    self.build()
  File "/home/n200/.local/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1227, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "/home/n200/.local/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1263, in _build
    build_save=build_save, build_restore=build_restore)
  File "/home/n200/.local/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 751, in _build_internal
    restore_sequentially, reshape)
  File "/home/n200/.local/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 427, in _AddRestoreOps
    tensors = self.restore_op(filename_tensor, saveable, preferred_shard)
  File "/home/n200/.local/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 267, in restore_op
    [spec.tensor.dtype])[0])
  File "/home/n200/.local/lib/python2.7/site-packages/tensorflow/python/ops/gen_io_ops.py", line 1021, in restore_v2
    shape_and_slices=shape_and_slices, dtypes=dtypes, name=name)
  File "/home/n200/.local/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/n200/.local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
    op_def=op_def)
  File "/home/n200/.local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1470, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): Cannot assign a device for operation 'save_1/RestoreV2_622': Operation was explicitly assigned to /job:ps/task:0/device:CPU:0 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:GPU:0, /job:localhost/replica:0/task:0/device:GPU:1 ]. Make sure the device specification refers to a valid device.
         [[Node: save_1/RestoreV2_622 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:ps/task:0/device:CPU:0"](save_1/Const, save_1/RestoreV2_622/tensor_names, save_1/RestoreV2_622/shape_and_slices)]]

Is there anyone know how to use multiple GPUs on one server to train model slim? Or other recommended way of using multiple GPUs to train the model is appreciated. Thank you!