Hello,
I get an error when trying train_lenet_on_mnist.sh with --worker_replicas=2 and --num_ps_tasks=1
INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>, Cannot assign a device to node 'save/RestoreV2_7': Could not satisfy explicit device specification '/job:worker/device:CPU:0' because no devices matching that specification are registered in this process; available devices: /job:localhost/replica:0/task:0/cpu:0, /job:localhost/replica:0/task:0/gpu:0, /job:localhost/replica:0/task:0/gpu:1, /job:localhost/replica:0/task:0/gpu:2, /job:localhost/replica:0/task:0/gpu:3, /job:localhost/replica:0/task:0/gpu:4, /job:localhost/replica:0/task:0/gpu:5, /job:localhost/replica:0/task:0/gpu:6, /job:localhost/replica:0/task:0/gpu:7
[[Node: save/RestoreV2_7 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:worker/device:CPU:0"](save/Const, save/RestoreV2_7/tensor_names, save/RestoreV2_7/shape_and_slices)]]
Tested on aws p2.8x instance with tensorflow_gpu-1.0.1-cp27-cp27mu-manylinux1_x86_64.whl and also compiled from source.
Thanks
Could you try passing --master=localhost as well? Additionally, if you just don't specified worker_replicas and num_ps_tasks, does it work?
--master=localhost generates the following error: E tensorflow/core/common_runtime/session.cc:69] Not found: No session factory registered for the given session options: {target: "localhost" config: } Registered factories are {DIRECT_SESSION, GRPC_SESSION}.
Anyway if I don't specify worker_replicas and num_ps_tasks or I use worker_replicas=1 and num_ps_tasks=0 it works fine, as expected.
Thanks
I don't know why this issue has been closed. I still get an error when I try to use train_lenet_on_mnist.sh with parameter server. I didn't get any satisfactory answer.
Regards
Oh sorry, I misread your last response. Re-opening this for the right owners to take a look.
@mrry, do you have any insight on this or know who might?
From looking at that script, it doesn't look like anything in that library ever creates a tf.train.Server, so I can't see how it would ever work for distributed training. There doesn't seem to be any public interface e.g. for setting the network addresses of the worker and PS tasks, which would be a prerequisite.
Perhaps this code was copied from another codebase that does have distributed support, but the distributed parts were left out?
I was debugging a similar error message. Is it possible that the DeploymentConfig in model_deploy.py is hardcoding the worker_job_name to worker when it means localhost?
See:
This is causing the device name to be named /job:worker/device:CPU:0
I didn't get too much farther before getting distracted but perhaps this is of use to those looking into the problem ...
I think the worker_job_name is a TensorFlow job name (as in the string "worker" in "/job:worker"), and not a network address.
These scripts need a lot of modifications (including extra configuration options and tf.train.Server setup) to work in a distributed setting. Once that is there, when starting a worker task, you'd set the --master flag to grpc://hostname:port where hostname:port is the address of the tf.train.Server in that worker.
Is this problem solved?
As far as i know, it's not solved :(
I got the same problem when trying to run the flowers retraining on Google ML Engine. What @mrry reported is also happening to me.
@sguada Can you comment?
@beeva-enriqueotero Hello, do you know any solution to run TF-slim training on multiple machines? I have the same need but failed to find a way. I opened a question about this on Stackoverflow:Run TF-slim model training on multiple machinies
Hi @kzhang28. I suffered a lot when trying to run pure TF-slim models distributed, but since TF-1.4 I've been running TF-slim models distributed on Google ML Engine using the Estimator API. Basically my model_fn call TF-slim models and it works perfectly.
I hope it helps you.
@rodrigofp-cit Thanks! I have never used Estimator API and Google ML engine before. I was wondering if the way you suggested would work if I try to run TF-slim model on multiple metal machines instead of on Google ML Engine.
@kzhang28 Probably it will, since Estimator API does all networking under the hood. The only requirement is to have all machines within the same network.
@rodrigofp-cit Thank you! Would you mind sharing your distributed training scripts with me? I was wondering how to specify the cluster setup and job name(worker/ps) for each server. Should I launch the distributed training job in the following way? Do I need to use
tf.train.MonitoredSession. I appreciate your help.
~~~~# On ps0.example.com:
$ python trainer.py \
--ps_hosts=ps0.example.com:2222,ps1.example.com:2222 \
--worker_hosts=worker0.example.com:2222,worker1.example.com:2222 \
--job_name=ps --task_index=0
$ python trainer.py \
--ps_hosts=ps0.example.com:2222,ps1.example.com:2222 \
--worker_hosts=worker0.example.com:2222,worker1.example.com:2222 \
--job_name=ps --task_index=1
$ python trainer.py \
--ps_hosts=ps0.example.com:2222,ps1.example.com:2222 \
--worker_hosts=worker0.example.com:2222,worker1.example.com:2222 \
--job_name=worker --task_index=0
$ python trainer.py \
--ps_hosts=ps0.example.com:2222,ps1.example.com:2222 \
--worker_hosts=worker0.example.com:2222,worker1.example.com:2222 \
--job_name=worker --task_index=1~~~~
@kzhang28 Are you running on Google ML Engine? If so, you don't need to specify any of these networking stuffs.
@rodrigofp-cit I see, however, I am going to run the training on a physical cluster instead of Google ML Engine (6 machines with Ubuntu 16 and have TF installed). If there is no way to specify the hosts for ps/worker jobs, I would not be able to run it in a distributed multi-machines fashion.
Hi There,
We are checking to see if you still need help on this, as this seems to be considerably old issue. Please update this issue with the latest information, code snippet to reproduce your issue and error you are seeing.
If we don't hear from you in the next 7 days, this issue will be closed automatically. If you don't need help on this issue any more, please consider closing this.
Automatically closing due to lack of recent activity. Please update the issue when new information becomes available, and we will reopen the issue. Thanks!