Ray: New trials do not run after old trials terminates.

Created on 24 Nov 2018 · 11Comments · Source: ray-project/ray

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04
Ray installed from (source or binary): binary
Ray version: 24bfe8ab76f743983aa494528c4ec8fe1e9d1a3e
Python version: 3.6.5
Exact command to reproduce:
Try https://github.com/ray-project/ray/files/2593051/test_ray.txt.

For fast reproduction, you can change training_iteration from 500 to 20.

Either single-node or distributed mode has this issue. It seems that the requested resources of terminated trials are not released.

question

Source

llan-ml

All 11 comments

I think one problem could be that you are creating nested actors, and stopping the trial doesn't automatically kill the nested actors.

One workaround for this right now is simply to implement _stop for your Trainable class and call actor.__ray_terminate__.remote() for all actors you've nested.

Let me know if you have any other questions/if new issues arise.

richardliaw on 24 Nov 2018

👍1

I think this is because the refcounts for actors aren't released if the parent sys.exits()s. FYI @stephanie-wang

ericl on 24 Nov 2018

Works now. I have had problems before due to resource allocation. I'm wondering if it's possible to monitor the resource allocation of the backend.

llan-ml on 25 Nov 2018

If you're using the autoscaler resource stats are printed periodically to the monitor logs.

There's also ray.global_state.cluster_resources()

ericl on 25 Nov 2018

👍1

And ray.global_state.available_resources() which was helpful in finding
these nested actors.

On Sun, Nov 25, 2018 at 12:18 AM Eric Liang notifications@github.com
wrote:

If you're using the autoscaler resource stats are printed periodically to
the monitor logs.

There's also ray.global_state.cluster_resources()

—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
https://github.com/ray-project/ray/issues/3392#issuecomment-441423361,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEUc5fQN7b3lSsiT_ahG3r6jhWM2vkfzks5uylJvgaJpZM4YxaIB
.

richardliaw on 25 Nov 2018

👍1

I run the above file test_ray.txt on a single node with 56 cpus, with the change of

    @classmethod
    def default_resource_request(cls, config):
        return Resources(
            cpu=3,
            gpu=0,
            extra_cpu=20,
            extra_gpu=0)

and the tune logs Resources requested: 46/56 CPUs, 0/2 GPUs. But the backend uses two extra cpus per trial:

In [28]: ray.global_state.cluster_resources()
Out[28]: {'GPU': 2.0, 'CPU': 56.0}
In [29]: ray.global_state.available_resources()
Out[29]: {'GPU': 2.0, 'CPU': 6.0}

@ericl @richardliaw Do you know which commands cause the extras use of cpus?

llan-ml on 30 Nov 2018

Probably the seedholder and the Parameter server. cpu=3 means that the trainable itself gets 3 CPUs.

richardliaw on 30 Nov 2018

oh I see. I should set cpus for nested actors in extra_cpus.

llan-ml on 30 Nov 2018

yep! BTW, any suggestions on how we can make this clearer would be great...

On Fri, Nov 30, 2018 at 12:06 PM lanlin notifications@github.com wrote:

oh I see. I should set cpus for nested actors in extra_cpus.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/ray-project/ray/issues/3392#issuecomment-443323829,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEUc5cb5ImNlFrDV6T2nDNf-dFpJ6i2uks5u0Y-0gaJpZM4YxaIB
.

richardliaw on 30 Nov 2018

Closing this because this should be resolved.

richardliaw on 29 Dec 2018

I just had this same issue - new trials could not start because old trials weren't releasing the CPUs and GPUs. I think in my case the issue was that my Trainable created multiple sub-actors and passed them to a separate (daemon) thread, which presumably kept a reference to the sub-actors after the main Trainable was cleaned up. The recommendation to implement _stop and explicitly clean up these actors seems to have fixed the issue.