Ray: [Core] Recycle driver job IDs.

Created on 15 Apr 2020 · 12Comments · Source: ray-project/ray

Describe your feature request

For a long-lived Ray cluster with many (not necessarily concurrent) drivers, the simple job ID provisioning of consulting GCS to receive an incremented job ID eventually causes a returned job ID to exceed the maximum JobID.

  File "python/ray/worker.py", line 1155, in connect
    int(worker.redis_client.incr("JobCounter")))
  File "python/ray/includes/unique_ids.pxi", line 292, in ray._raylet.JobID.from_int
AssertionError: Maximum JobID integer is 65535.

It'd be nice if these auto-generated job IDs could be recycled, so instead of being limited to 65535 jobs/drivers for the lifetime of the Ray cluster, you'd be limited to 65535 _concurrent_ jobs/drivers, which users will probably be much more hard-pressed to hit.

P2 enhancement

Source

clarkzinzow

👍1

Most helpful comment

Looked into this a bit more, spec for reference:
https://github.com/ray-project/ray/blob/master/src/ray/design_docs/id_specification.md

The idea is to remove the existing flags bytes, which would then free up the full two bytes that are currently allocated to it to add to the job id. Unfortunately it’s not straightforward to remove the object_type bit because the way object ids is calculated is if it’s a PUT_OBJECT it’s based on a monotonically increasing counter in each process and if it’s a RETURN_OBJECT it’s based on just the index that the object is within the return arguments. We rely on being able to calculate the RETURN_OBJECT ids deterministically, so it’s hard to remove that flag.

If we don’t remove that flag, we only have 1 byte that we can allocate to the job id, which might be ok to address this issue, but having an odd number of bytes is difficult because the job id is determined from an integer. Unfortunately there doesn't seem to be a standard uint_24 type we can use, so we'll have to deal with little/big endian ourselves.

cc @clarkzinzow who is going to look into this more. Hopefully you have a better idea than me :)

edoakes on 17 Aug 2020

👍2

All 12 comments

@ffbin is working on here https://github.com/ray-project/ray/pull/8015 i believe

simon-mo on 15 Apr 2020

@simon-mo I believe #8015 is only for garbage collecting data when a job finishes. It doesn't recycle job ids.

Actually, instead of reusing job ids for different jobs, I'd prefer to just extending job id to more bytes.
This is the current layout of id bytes. Due to plasma's limitation, Object ID had to be no more than 20 bytes. So we ended up only giving 2 bytes to job id.

@pcmoritz @suquark do you know if we can extend plasma's object id to more bytes?

raulchen on 16 Apr 2020

👍1

@raulchen That bytes bump for the job ID would be a big help! We'd be hard-pressed to exceed 2**32 jobs during the lifetime of a cluster; if the cluster was running for a year, that'd equate to ~8200 jobs per minute. At that scale, we'd most likely rearchitect to reuse driver connections more aggressively.

clarkzinzow on 16 Apr 2020

@raulchen we are no longer using the transport type flag so we can repurpose those bits for the job ID. Thoughts?

edoakes on 27 May 2020

@edoakes do you only plan to re-purpose the 2 bits used by the transport flag, or the whole flag bits?

raulchen on 28 May 2020

@raulchen the current spec for flags is:

flags bytes format

  1b     1b        3b                          11b
+-------------------------------------------------------------------------+
| (1) | (2) |     (3)      |                (4)unused                     |
+-------------------------------------------------------------------------+

The (1) created_by_task part is one bit to indicate whether this ObjectID is generated (put or returned) from a task.
The (2) object_type part is one bit to indicate the type of this object, whether a PUT_OBJECT or a RETURN_OBJECT.
- PUT_OBJECT indicates this object is generated through ray.put during the task's execution.
- RETURN_OBJECT indicates this object is the return value of a task.
The (3) transport_type part is 3 bits to indicate the type of the transport which is used to transfer this object. So it can support 8 types.
There are 11 bits unused in flags bytes.

We could leave the existing flags there and allocate one additional byte for the job ID. Unfortunately this puts us at 3 bytes for the job ID which makes the implementation a bit more ugly (because there's no 3 byte integer type we'll have to deal with both little/big endian in order to convert from an int).

The other option is to completely scrap these flags and allocate it all to the job ID, which I would tentatively be in favor of. None of the flags here are actually used except for PUT_OBJECT in one place (task cancellation) to validate input, which isn't necessary.

edoakes on 29 May 2020

I'd be in favor of scrapping the flags too.

ericl on 29 May 2020

Yeah, let's scrap the flags, it will also make the code more understandable :)

pcmoritz on 30 May 2020

@raulchen thoughts here?

edoakes on 2 Jun 2020

@edoakes I asked a few people. Currently, we don't see any potential use of these flags as well. So I'm tentatively in favor of scraping them as well.

raulchen on 4 Jun 2020

Thanks @raulchen, let's move ahead with it then.

edoakes on 4 Jun 2020

Looked into this a bit more, spec for reference:
https://github.com/ray-project/ray/blob/master/src/ray/design_docs/id_specification.md

cc @clarkzinzow who is going to look into this more. Hopefully you have a better idea than me :)

edoakes on 17 Aug 2020

👍2

Was this page helpful?

0 / 5 - 0 ratings