I have 4 GPU on 1 server, I want use them all to train models/research/slim. My understanding is that.
I need create 1 replica, 4 clones. So I use following commands.
python train_image_classifier.py --train_dir=${TRAIN_DIR} --dataset_name=flowers --dataset_split_name=train --dataset_dir=${DATASET_DIR} --model_name=inception_v3 --num_clones=4
But, the after I run this command, tensorflow allocate memory on all 4 GPUs, and only 1 GPU have significant usage(90%), other 3 are near 0.
I don't find any documentation on how to use this functionality. can someone help me?
When I training object detection model, I have the same problem.
Thanks
@xhzcyc do you have a solution by now?
I am also a bit confused about the required flags for the case of multi-gpu training. As described here, I interpret a common strategy to be: place a model replica on each GPU and use the CPU as the parameter server. However I am unsure what combination of flags produce this behavior.
Looking at train_image_classifier.py the following flags are available:
num_clones: Number of model clones to deploy.clone_on_cpu: Use CPUs to deploy clones.worker_replicas: Number of worker replicas.num_ps_tasks: The number of parameter servers. If the value is 0, then the parameters are handled locally by the worker.task: Task id of the replica running the training.Most of these flags are used to specify DeploymentConfig parameters in model_deploy.py. Which include:
num_clones: Number of model clones to deploy in each replica.clone_on_cpu: True if clones should be placed on CPU.replica_id: Index of the replica for which the model is deployed. Usually 0 for the chief replica.num_replicas: Number of replicas to use.num_ps_tasks: Number of tasks for the ps job. 0 to not use replicas.I am a bit confused on a few points.
task/replica_id seems to indicate that a model is only deployed on a single replica? This seems to be at odds with the concept of duplicated models.It seems most posts about running multi-gpu training specify to just set the num_clones flag like so:
python train_image_classifier.py --num_clones=2
but looking at the propagation of default flags, i fail to see how this properly sets up multi-gpu training.
Some discussion mentions:
python train_image_classifier.py --num_clones=2 --num_ps_tasks=1
But I still don't understand how things are delineated between hardware at that point. Could a contributor or knowledgeable user provide some direction for using these flags to train on multiple gpus.
Agree with dcyoung, I'm also confused by the parameters.
same problem here on 4 gpus only one is at 90% utilization and the rest are 0%. And I find the usage of slim completely and utterly confusing and unnecessary. The same I can say for the use of flags. Somebody tell me what's the use of flags that I cannot do with pure python argparse? All these things result into a big pile of code in which we spend days trying to debug and understand what the hell the models are doing and how they work in the first place.
I can use 4 local GPU by setting num_clones now, but still can't train it with multiple server. I just watch tensorflow summit 2018, it appears that the recommend way of doing distributed for now is Estimator. Does this ps/worker schema is out of support?
@scotthuang1989 I am using num_clones for attention_ocr but the GPU consumption is still showing for 1 GPU ( I have 3 GPUs, ie. num_clones=3)
Can you elaborate your command or any other configuration you tweaked?
ayush first try --num_clones=2 --num_ps_tasks=1, then try --num_clones=3 --num_ps_tasks=2 as well as --num_clones=3 --num_ps_tasks=1.
@rohitsaluja22 I tried the above combinations, however , none of them seem to work. Only 1 GPU is fully using volatile GPU utilisation.
I have Nvidia 1080 Ti and 1060 two cards on my server and run the following configuration.
CUDA_VISIBLE_DEVICES=0,1 python train_image_classifier.py --num_clones=2 --num_ps_tasks=1
The program came out with the following error.
saving checkpoints to ../model_res_lap_finetune/
/home/n200/.local/lib/python2.7/site-packages/requests/__init__.py:83: RequestsDependencyWarning: Old version of cryptography ([1, 2, 3]) may cause slowdown.
warnings.warn(warning, RequestsDependencyWarning)
WARNING:tensorflow:From /home/n200/library/tensorflow_model/models/slim/train_image_classifier.py:397: create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.create_global_step
INFO:tensorflow:Fine-tuning from /home/n200/980223/imdb_wiki_ignore_image/model_res/model.ckpt-1000000
2019-02-20 16:37:55.499050: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2019-02-20 16:37:55.599066: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:892] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-02-20 16:37:55.599425: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:01:00.0
totalMemory: 10.92GiB freeMemory: 10.70GiB
2019-02-20 16:37:55.672687: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:892] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-02-20 16:37:55.672979: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 1 with properties:
name: GeForce GTX 1060 6GB major: 6 minor: 1 memoryClockRate(GHz): 1.7845
pciBusID: 0000:02:00.0
totalMemory: 5.94GiB freeMemory: 5.86GiB
2019-02-20 16:37:55.672999: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Device peer to peer matrix
2019-02-20 16:37:55.673020: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1051] DMA: 0 1
2019-02-20 16:37:55.673025: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 0: Y N
2019-02-20 16:37:55.673036: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 1: N Y
2019-02-20 16:37:55.673044: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
2019-02-20 16:37:55.673058: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:1) -> (device: 1, name: GeForce GTX 1060 6GB, pci bus id: 0000:02:00.0, compute capability: 6.1)
INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>, Cannot assign a device for operation 'save_1/RestoreV2_622': Operation was explicitly assigned to /job:ps/task:0/device:CPU:0 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:GPU:0, /job:localhost/replica:0/task:0/device:GPU:1 ]. Make sure the device specification refers to a valid device.
[[Node: save_1/RestoreV2_622 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:ps/task:0/device:CPU:0"](save_1/Const, save_1/RestoreV2_622/tensor_names, save_1/RestoreV2_622/shape_and_slices)]]
Caused by op u'save_1/RestoreV2_622', defined at:
File "/home/n200/library/tensorflow_model/models/slim/train_image_classifier.py", line 573, in <module>
tf.app.run()
File "/home/n200/.local/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/home/n200/library/tensorflow_model/models/slim/train_image_classifier.py", line 569, in main
sync_optimizer=optimizer if FLAGS.sync_replicas else None)
File "/home/n200/.local/lib/python2.7/site-packages/tensorflow/contrib/slim/python/slim/learning.py", line 654, in train
saver = saver or tf_saver.Saver()
File "/home/n200/.local/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1218, in __init__
self.build()
File "/home/n200/.local/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1227, in build
self._build(self._filename, build_save=True, build_restore=True)
File "/home/n200/.local/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1263, in _build
build_save=build_save, build_restore=build_restore)
File "/home/n200/.local/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 751, in _build_internal
restore_sequentially, reshape)
File "/home/n200/.local/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 427, in _AddRestoreOps
tensors = self.restore_op(filename_tensor, saveable, preferred_shard)
File "/home/n200/.local/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 267, in restore_op
[spec.tensor.dtype])[0])
File "/home/n200/.local/lib/python2.7/site-packages/tensorflow/python/ops/gen_io_ops.py", line 1021, in restore_v2
shape_and_slices=shape_and_slices, dtypes=dtypes, name=name)
File "/home/n200/.local/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/n200/.local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
op_def=op_def)
File "/home/n200/.local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1470, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
InvalidArgumentError (see above for traceback): Cannot assign a device for operation 'save_1/RestoreV2_622': Operation was explicitly assigned to /job:ps/task:0/device:CPU:0 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:GPU:0, /job:localhost/replica:0/task:0/device:GPU:1 ]. Make sure the device specification refers to a valid device.
[[Node: save_1/RestoreV2_622 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:ps/task:0/device:CPU:0"](save_1/Const, save_1/RestoreV2_622/tensor_names, save_1/RestoreV2_622/shape_and_slices)]]
Traceback (most recent call last):
File "/home/n200/library/tensorflow_model/models/slim/train_image_classifier.py", line 573, in <module>
tf.app.run()
File "/home/n200/.local/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/home/n200/library/tensorflow_model/models/slim/train_image_classifier.py", line 569, in main
sync_optimizer=optimizer if FLAGS.sync_replicas else None)
File "/home/n200/.local/lib/python2.7/site-packages/tensorflow/contrib/slim/python/slim/learning.py", line 742, in train
master, start_standard_services=False, config=session_config) as sess:
File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
return self.gen.next()
File "/home/n200/.local/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 964, in managed_session
self.stop(close_summary_writer=close_summary_writer)
File "/home/n200/.local/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 792, in stop
stop_grace_period_secs=self._stop_grace_secs)
File "/home/n200/.local/lib/python2.7/site-packages/tensorflow/python/training/coordinator.py", line 389, in join
six.reraise(*self._exc_info_to_raise)
File "/home/n200/.local/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 953, in managed_session
start_standard_services=start_standard_services)
File "/home/n200/.local/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 708, in prepare_or_wait_for_session
init_feed_dict=self._init_feed_dict, init_fn=self._init_fn)
File "/home/n200/.local/lib/python2.7/site-packages/tensorflow/python/training/session_manager.py", line 279, in prepare_session
sess.run(init_op, feed_dict=init_feed_dict)
File "/home/n200/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 889, in run
run_metadata_ptr)
File "/home/n200/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1120, in _run
feed_dict_tensor, options, run_metadata)
File "/home/n200/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1317, in _do_run
options, run_metadata)
File "/home/n200/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1336, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation 'save_1/RestoreV2_622': Operation was explicitly assigned to /job:ps/task:0/device:CPU:0 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:GPU:0, /job:localhost/replica:0/task:0/device:GPU:1 ]. Make sure the device specification refers to a valid device.
[[Node: save_1/RestoreV2_622 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:ps/task:0/device:CPU:0"](save_1/Const, save_1/RestoreV2_622/tensor_names, save_1/RestoreV2_622/shape_and_slices)]]
Caused by op u'save_1/RestoreV2_622', defined at:
File "/home/n200/library/tensorflow_model/models/slim/train_image_classifier.py", line 573, in <module>
tf.app.run()
File "/home/n200/.local/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/home/n200/library/tensorflow_model/models/slim/train_image_classifier.py", line 569, in main
sync_optimizer=optimizer if FLAGS.sync_replicas else None)
File "/home/n200/.local/lib/python2.7/site-packages/tensorflow/contrib/slim/python/slim/learning.py", line 654, in train
saver = saver or tf_saver.Saver()
File "/home/n200/.local/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1218, in __init__
self.build()
File "/home/n200/.local/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1227, in build
self._build(self._filename, build_save=True, build_restore=True)
File "/home/n200/.local/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1263, in _build
build_save=build_save, build_restore=build_restore)
File "/home/n200/.local/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 751, in _build_internal
restore_sequentially, reshape)
File "/home/n200/.local/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 427, in _AddRestoreOps
tensors = self.restore_op(filename_tensor, saveable, preferred_shard)
File "/home/n200/.local/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 267, in restore_op
[spec.tensor.dtype])[0])
File "/home/n200/.local/lib/python2.7/site-packages/tensorflow/python/ops/gen_io_ops.py", line 1021, in restore_v2
shape_and_slices=shape_and_slices, dtypes=dtypes, name=name)
File "/home/n200/.local/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/n200/.local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
op_def=op_def)
File "/home/n200/.local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1470, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
InvalidArgumentError (see above for traceback): Cannot assign a device for operation 'save_1/RestoreV2_622': Operation was explicitly assigned to /job:ps/task:0/device:CPU:0 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:GPU:0, /job:localhost/replica:0/task:0/device:GPU:1 ]. Make sure the device specification refers to a valid device.
[[Node: save_1/RestoreV2_622 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:ps/task:0/device:CPU:0"](save_1/Const, save_1/RestoreV2_622/tensor_names, save_1/RestoreV2_622/shape_and_slices)]]
Is there anyone know how to use multiple GPUs on one server to train model slim? Or other recommended way of using multiple GPUs to train the model is appreciated. Thank you!
same problem, desiring the attention and resolution
same problem, desiring the attention and resolution
Closing for now.
Most helpful comment
@scotthuang1989 I am using
num_clonesfor attention_ocr but the GPU consumption is still showing for 1 GPU ( I have 3 GPUs, ie.num_clones=3)Can you elaborate your command or any other configuration you tweaked?