Models: [slim] Problem with train_lenet_on_mnist when requiring parameter server

Created on 17 Mar 2017 · 22Comments · Source: tensorflow/models

Hello,

I get an error when trying train_lenet_on_mnist.sh with --worker_replicas=2 and --num_ps_tasks=1

INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>, Cannot assign a device to node 'save/RestoreV2_7': Could not satisfy explicit device specification '/job:worker/device:CPU:0' because no devices matching that specification are registered in this process; available devices: /job:localhost/replica:0/task:0/cpu:0, /job:localhost/replica:0/task:0/gpu:0, /job:localhost/replica:0/task:0/gpu:1, /job:localhost/replica:0/task:0/gpu:2, /job:localhost/replica:0/task:0/gpu:3, /job:localhost/replica:0/task:0/gpu:4, /job:localhost/replica:0/task:0/gpu:5, /job:localhost/replica:0/task:0/gpu:6, /job:localhost/replica:0/task:0/gpu:7
     [[Node: save/RestoreV2_7 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:worker/device:CPU:0"](save/Const, save/RestoreV2_7/tensor_names, save/RestoreV2_7/shape_and_slices)]]

Tested on aws p2.8x instance with tensorflow_gpu-1.0.1-cp27-cp27mu-manylinux1_x86_64.whl and also compiled from source.

Thanks

awaiting maintainer support

Source

beeva-enriqueotero

All 22 comments

Could you try passing --master=localhost as well? Additionally, if you just don't specified worker_replicas and num_ps_tasks, does it work?

concretevitamin on 18 Mar 2017

--master=localhost generates the following error: E tensorflow/core/common_runtime/session.cc:69] Not found: No session factory registered for the given session options: {target: "localhost" config: } Registered factories are {DIRECT_SESSION, GRPC_SESSION}.

Anyway if I don't specify worker_replicas and num_ps_tasks or I use worker_replicas=1 and num_ps_tasks=0 it works fine, as expected.

Thanks

beeva-enriqueotero on 21 Mar 2017

I don't know why this issue has been closed. I still get an error when I try to use train_lenet_on_mnist.sh with parameter server. I didn't get any satisfactory answer.

Regards

beeva-enriqueotero on 21 Mar 2017

Oh sorry, I misread your last response. Re-opening this for the right owners to take a look.

concretevitamin on 21 Mar 2017

@mrry, do you have any insight on this or know who might?

aselle on 28 Mar 2017

From looking at that script, it doesn't look like anything in that library ever creates a tf.train.Server, so I can't see how it would ever work for distributed training. There doesn't seem to be any public interface e.g. for setting the network addresses of the worker and PS tasks, which would be a prerequisite.

Perhaps this code was copied from another codebase that does have distributed support, but the distributed parts were left out?

mrry on 29 Mar 2017

👍1

I was debugging a similar error message. Is it possible that the DeploymentConfig in model_deploy.py is hardcoding the worker_job_name to worker when it means localhost?

See:

This is causing the device name to be named /job:worker/device:CPU:0

I didn't get too much farther before getting distracted but perhaps this is of use to those looking into the problem ...

vincentchu on 29 Mar 2017

I think the worker_job_name is a TensorFlow job name (as in the string "worker" in "/job:worker"), and not a network address.

These scripts need a lot of modifications (including extra configuration options and tf.train.Server setup) to work in a distributed setting. Once that is there, when starting a worker task, you'd set the --master flag to grpc://hostname:port where hostname:port is the address of the tf.train.Server in that worker.

mrry on 29 Mar 2017

👍1

Is this problem solved?

fanlu on 21 Apr 2017

As far as i know, it's not solved :(

beeva-enriqueotero on 21 Apr 2017

I got the same problem when trying to run the flowers retraining on Google ML Engine. What @mrry reported is also happening to me.

rodrigofp-cit on 27 Oct 2017

@sguada Can you comment?

karmel on 22 Feb 2018

@beeva-enriqueotero Hello, do you know any solution to run TF-slim training on multiple machines? I have the same need but failed to find a way. I opened a question about this on Stackoverflow:Run TF-slim model training on multiple machinies

kzhang28 on 2 Jul 2018

Hi @kzhang28. I suffered a lot when trying to run pure TF-slim models distributed, but since TF-1.4 I've been running TF-slim models distributed on Google ML Engine using the Estimator API. Basically my model_fn call TF-slim models and it works perfectly.
I hope it helps you.

rodrigofp-cit on 2 Jul 2018

👍1

@rodrigofp-cit Thanks! I have never used Estimator API and Google ML engine before. I was wondering if the way you suggested would work if I try to run TF-slim model on multiple metal machines instead of on Google ML Engine.

kzhang28 on 2 Jul 2018

@kzhang28 Probably it will, since Estimator API does all networking under the hood. The only requirement is to have all machines within the same network.

rodrigofp-cit on 2 Jul 2018

@rodrigofp-cit Thank you! Would you mind sharing your distributed training scripts with me? I was wondering how to specify the cluster setup and job name(worker/ps) for each server. Should I launch the distributed training job in the following way? Do I need to use tf.train.MonitoredSession. I appreciate your help.

~~~~# On ps0.example.com:
$ python trainer.py \
--ps_hosts=ps0.example.com:2222,ps1.example.com:2222 \
--worker_hosts=worker0.example.com:2222,worker1.example.com:2222 \
--job_name=ps --task_index=0

On ps1.example.com:

$ python trainer.py \
--ps_hosts=ps0.example.com:2222,ps1.example.com:2222 \
--worker_hosts=worker0.example.com:2222,worker1.example.com:2222 \
--job_name=ps --task_index=1

On worker0.example.com:

$ python trainer.py \
--ps_hosts=ps0.example.com:2222,ps1.example.com:2222 \
--worker_hosts=worker0.example.com:2222,worker1.example.com:2222 \
--job_name=worker --task_index=0

On worker1.example.com:

$ python trainer.py \
--ps_hosts=ps0.example.com:2222,ps1.example.com:2222 \
--worker_hosts=worker0.example.com:2222,worker1.example.com:2222 \
--job_name=worker --task_index=1~~~~

kzhang28 on 2 Jul 2018

@kzhang28 Are you running on Google ML Engine? If so, you don't need to specify any of these networking stuffs.

rodrigofp-cit on 2 Jul 2018

@rodrigofp-cit I see, however, I am going to run the training on a physical cluster instead of Google ML Engine (6 machines with Ubuntu 16 and have TF installed). If there is no way to specify the hosts for ps/worker jobs, I would not be able to run it in a distributed multi-machines fashion.

kzhang28 on 2 Jul 2018

Humm, @kzhang28 I've never done a distributed training with bare metal machines, only with ML Engine. However, I think you should populate a RunConfig object and pass it to your Estimator constructor.

rodrigofp-cit on 2 Jul 2018

Hi There,
We are checking to see if you still need help on this, as this seems to be considerably old issue. Please update this issue with the latest information, code snippet to reproduce your issue and error you are seeing.
If we don't hear from you in the next 7 days, this issue will be closed automatically. If you don't need help on this issue any more, please consider closing this.

tensorflowbutler on 30 Jan 2020

Automatically closing due to lack of recent activity. Please update the issue when new information becomes available, and we will reopen the issue. Thanks!