Ray: [sgd] A local worker will always be created for PytorchTrainer?

Created on 6 May 2020 · 19Comments · Source: ray-project/ray

Hi,

I'm new to RaySGD and am trying to use PyTorchTrainer for distributed training.
I glanced at a code, and see that in all cases there would be a worker on the driver node and other workers probably on other ray nodes, am I understanding correctly?
So what if I don't have enough resources on the driver?
Also, when creating PyTorchRunner, no resource is specified? Then when training, only 1 cpu will be used for each runner? But as I notice, more than 1 cpu is used actually.
Hope to get some clarification and help on this! Thanks so much!
(I'm using Ray 0.8.4 and torch 1.5.0 btw)

question sgd

Source

hkvision

Most helpful comment

By doing this, TorchTrainer becomes an actor and would be placed on one node; and thus the local worker would be placed on that node as well. Am I understanding correctly?

Yep!

Okay. I will try this. I'm not familiar with tune and will study this first :) Thanks so much!

Yep, feel free to ask any questions as you do!

richardliaw on 7 May 2020

👍2

All 19 comments

@hkvision thanks for opening up this PR! What do you mean by not having enough resources on the driver? Like the driver doesn't have a GPU?

richardliaw on 7 May 2020

@hkvision thanks for opening up this PR! What do you mean by not having enough resources on the driver? Like the driver doesn't have a GPU?

Hi @richardliaw Thanks for your reply. Yes. I mean like in YARN settings, the driver is only for scheduling and probably the driver hardware is not as good as other workers. Thus for distributed training, if there is always a process on the driver, then it will be the bottleneck for syncing parameters?

hkvision on 7 May 2020

Makes sense.

My suggestion for you is to "lift" the TorchTrainer into an actor:

RemoteTrainer = ray.remote(num_gpus=1)(TorchTrainer)
remote_trainer = RemoteTrainer.remote(data_creator=...)

remote_trainer.train.remote()

richardliaw on 7 May 2020

👍1

Another option is simply to use Tune, which will allow you to keep all of the fault tolerance guarantees and also provide utilities such as auto-checkpoint, tensorboard:

tune.run(TorchTrainer.as_trainable(data_creator..)

richardliaw on 7 May 2020

Makes sense.

My suggestion for you is to "lift" the TorchTrainer into an actor:
RemoteTrainer = ray.remote(num_gpus=1)(TorchTrainer)
remote_trainer = RemoteTrainer.remote(data_creator=...)

remote_trainer.train.remote()

By doing this, TorchTrainer becomes an actor and would be placed on one node; and thus the local worker would be placed on that node as well. Am I understanding correctly?

hkvision on 7 May 2020

👍1

Another option is simply to use Tune, which will allow you to keep all of the fault tolerance guarantees and also provide utilities such as auto-checkpoint, tensorboard:
tune.run(TorchTrainer.as_trainable(data_creator..)

Okay. I will try this. I'm not familiar with tune and will study this first :) Thanks so much!

hkvision on 7 May 2020

By doing this, TorchTrainer becomes an actor and would be placed on one node; and thus the local worker would be placed on that node as well. Am I understanding correctly?

Yep!

Okay. I will try this. I'm not familiar with tune and will study this first :) Thanks so much!

Yep, feel free to ask any questions as you do!

richardliaw on 7 May 2020

👍2

https://github.com/ray-project/ray/blob/master/python/ray/util/sgd/torch/torch_trainer.py#L336
@richardliaw Want to ask here why creating Runner actor, it forces cpu=1? So if I don't have gpu, each worker only runs one cpu core? Would this restriction hinder the performance?

Also seems if I run two workers locally, the workers actually use more than 1 cpu core. For multiple nodes, each worker only uses one core... (weird, I will have more test on this...)
Hope to get some help!

hkvision on 7 May 2020

Restriction doesn't hinder performance. It's only for bookkeeping.

Also seems if I run two workers locally, the workers actually use more than 1 cpu core. For multiple nodes, each worker only uses one core...

Can you tell me more about this?

richardliaw on 7 May 2020

Also, are you CPU only? Why not try to use GPU?

richardliaw on 7 May 2020

Also, are you CPU only? Why not try to use GPU?

Yeah. Coz our team works on running on cpu servers... so we use cpu only for now...
Under the case I use cpu only, the code forces the resource of each worker to 1 cpu, won't it hinder the performance? since each worker only runs one core...
Thanks!

hkvision on 8 May 2020

Any response? @richardliaw Thanks :)

hkvision on 12 May 2020

Ah yeah; we don't do CPU isolation in Ray, so 1 CPU doesn't limit the performance of a single worker.

richardliaw on 12 May 2020

Ah yeah; we don't do CPU isolation in Ray, so 1 CPU doesn't limit the performance of a single worker.

Oh really? Even if the worker only has 1 cpu resource, if I set something like OMP_NUM_THREADS=10, it can still use 10 cores indeed?

But one problem is in the current code, all workers have 1 cpu resources. So it is very likely that all the workers are placed on the same worker with 10 cpus. Then even if they are use more than 1 cpu, they would still competing limited resources. Any idea on this?

Or can future versions make num_cpus configurable?

hkvision on 13 May 2020

@hkvision sorry for the slow reply.

Oh really? Even if the worker only has 1 cpu resource, if I set something like OMP_NUM_THREADS=10, it can still use 10 cores indeed?

yes

But one problem is in the current code, all workers have 1 cpu resources. So it is very likely that all the workers are placed on the same worker with 10 cpus. Then even if they are use more than 1 cpu, they would still competing limited resources. Any idea on this? Or can future versions make num_cpus configurable?

Yeah, perhaps we should allow the number of CPUs per node to be customizable. Would something like

Trainer(num_cpus_per_worker=4)

work for you?

richardliaw on 22 Jun 2020

@richardliaw No problem. Thanks so much for your reply. Yeah, it would be great to make cpu number customizable.
May I ask what's your previous assumption when specifying num_cpus? As specifying this doesn't actually control the cpu usage for actors, what's the usage of num_cpus?

hkvision on 22 Jun 2020

Actually @ConeyLiu just pushed a PR that addresses this issue and I'll try to merge it.

The assumption was that the workload will primarily be GPU intensive, so CPU reservations wouldn't matter so much.

richardliaw on 22 Jun 2020

👍1

Okay! Much thanks for your help!

hkvision on 22 Jun 2020

Fixed in https://github.com/ray-project/ray/pull/8963! Please let me know if that works for you.

richardliaw on 23 Jun 2020

Was this page helpful?

0 / 5 - 0 ratings