Ray: New trials do not run after old trials terminates.

Created on 24 Nov 2018  Â·  11Comments  Â·  Source: ray-project/ray

System information

For fast reproduction, you can change training_iteration from 500 to 20.

Either single-node or distributed mode has this issue. It seems that the requested resources of terminated trials are not released.

question

All 11 comments

I think one problem could be that you are creating nested actors, and stopping the trial doesn't automatically kill the nested actors.

One workaround for this right now is simply to implement _stop for your Trainable class and call actor.__ray_terminate__.remote() for all actors you've nested.

Let me know if you have any other questions/if new issues arise.

I think this is because the refcounts for actors aren't released if the parent sys.exits()s. FYI @stephanie-wang

Works now. I have had problems before due to resource allocation. I'm wondering if it's possible to monitor the resource allocation of the backend.

If you're using the autoscaler resource stats are printed periodically to the monitor logs.

There's also ray.global_state.cluster_resources()

And ray.global_state.available_resources() which was helpful in finding
these nested actors.

On Sun, Nov 25, 2018 at 12:18 AM Eric Liang notifications@github.com
wrote:

If you're using the autoscaler resource stats are printed periodically to
the monitor logs.

There's also ray.global_state.cluster_resources()

—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
https://github.com/ray-project/ray/issues/3392#issuecomment-441423361,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEUc5fQN7b3lSsiT_ahG3r6jhWM2vkfzks5uylJvgaJpZM4YxaIB
.

I run the above file test_ray.txt on a single node with 56 cpus, with the change of

    @classmethod
    def default_resource_request(cls, config):
        return Resources(
            cpu=3,
            gpu=0,
            extra_cpu=20,
            extra_gpu=0)

and the tune logs Resources requested: 46/56 CPUs, 0/2 GPUs. But the backend uses two extra cpus per trial:

In [28]: ray.global_state.cluster_resources()
Out[28]: {'GPU': 2.0, 'CPU': 56.0}
In [29]: ray.global_state.available_resources()
Out[29]: {'GPU': 2.0, 'CPU': 6.0}

@ericl @richardliaw Do you know which commands cause the extras use of cpus?

Probably the seedholder and the Parameter server. cpu=3 means that the trainable itself gets 3 CPUs.

oh I see. I should set cpus for nested actors in extra_cpus.

yep! BTW, any suggestions on how we can make this clearer would be great...

On Fri, Nov 30, 2018 at 12:06 PM lanlin notifications@github.com wrote:

oh I see. I should set cpus for nested actors in extra_cpus.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/ray-project/ray/issues/3392#issuecomment-443323829,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEUc5cb5ImNlFrDV6T2nDNf-dFpJ6i2uks5u0Y-0gaJpZM4YxaIB
.

Closing this because this should be resolved.

I just had this same issue - new trials could not start because old trials weren't releasing the CPUs and GPUs. I think in my case the issue was that my Trainable created multiple sub-actors and passed them to a separate (daemon) thread, which presumably kept a reference to the sub-actors after the main Trainable was cleaned up. The recommendation to implement _stop and explicitly clean up these actors seems to have fixed the issue.

Was this page helpful?
0 / 5 - 0 ratings