For a long-lived Ray cluster with many (not necessarily concurrent) drivers, the simple job ID provisioning of consulting GCS to receive an incremented job ID eventually causes a returned job ID to exceed the maximum JobID.
File "python/ray/worker.py", line 1155, in connect
int(worker.redis_client.incr("JobCounter")))
File "python/ray/includes/unique_ids.pxi", line 292, in ray._raylet.JobID.from_int
AssertionError: Maximum JobID integer is 65535.
It'd be nice if these auto-generated job IDs could be recycled, so instead of being limited to 65535 jobs/drivers for the lifetime of the Ray cluster, you'd be limited to 65535 _concurrent_ jobs/drivers, which users will probably be much more hard-pressed to hit.
@ffbin is working on here https://github.com/ray-project/ray/pull/8015 i believe
@simon-mo I believe #8015 is only for garbage collecting data when a job finishes. It doesn't recycle job ids.
Actually, instead of reusing job ids for different jobs, I'd prefer to just extending job id to more bytes.
This is the current layout of id bytes. Due to plasma's limitation, Object ID had to be no more than 20 bytes. So we ended up only giving 2 bytes to job id.
@pcmoritz @suquark do you know if we can extend plasma's object id to more bytes?
@raulchen That bytes bump for the job ID would be a big help! We'd be hard-pressed to exceed 2**32 jobs during the lifetime of a cluster; if the cluster was running for a year, that'd equate to ~8200 jobs per minute. At that scale, we'd most likely rearchitect to reuse driver connections more aggressively.
@raulchen we are no longer using the transport type flag so we can repurpose those bits for the job ID. Thoughts?
@edoakes do you only plan to re-purpose the 2 bits used by the transport flag, or the whole flag bits?
@raulchen the current spec for flags is:
flags bytes format
1b 1b 3b 11b
+-------------------------------------------------------------------------+
| (1) | (2) | (3) | (4)unused |
+-------------------------------------------------------------------------+
The (1) created_by_task part is one bit to indicate whether this ObjectID is generated (put or returned) from a task.
The (2) object_type part is one bit to indicate the type of this object, whether a PUT_OBJECT or a RETURN_OBJECT.
PUT_OBJECT indicates this object is generated through ray.put during the task's execution.RETURN_OBJECT indicates this object is the return value of a task.The (3) transport_type part is 3 bits to indicate the type of the transport which is used to transfer this object. So it can support 8 types.
There are 11 bits unused in flags bytes.
We could leave the existing flags there and allocate one additional byte for the job ID. Unfortunately this puts us at 3 bytes for the job ID which makes the implementation a bit more ugly (because there's no 3 byte integer type we'll have to deal with both little/big endian in order to convert from an int).
The other option is to completely scrap these flags and allocate it all to the job ID, which I would tentatively be in favor of. None of the flags here are actually used except for PUT_OBJECT in one place (task cancellation) to validate input, which isn't necessary.
I'd be in favor of scrapping the flags too.
Yeah, let's scrap the flags, it will also make the code more understandable :)
@raulchen thoughts here?
@edoakes I asked a few people. Currently, we don't see any potential use of these flags as well. So I'm tentatively in favor of scraping them as well.
Thanks @raulchen, let's move ahead with it then.
Looked into this a bit more, spec for reference:
https://github.com/ray-project/ray/blob/master/src/ray/design_docs/id_specification.md
The idea is to remove the existing flags bytes, which would then free up the full two bytes that are currently allocated to it to add to the job id. Unfortunately it鈥檚 not straightforward to remove the object_type bit because the way object ids is calculated is if it鈥檚 a PUT_OBJECT it鈥檚 based on a monotonically increasing counter in each process and if it鈥檚 a RETURN_OBJECT it鈥檚 based on just the index that the object is within the return arguments. We rely on being able to calculate the RETURN_OBJECT ids deterministically, so it鈥檚 hard to remove that flag.
If we don鈥檛 remove that flag, we only have 1 byte that we can allocate to the job id, which might be ok to address this issue, but having an odd number of bytes is difficult because the job id is determined from an integer. Unfortunately there doesn't seem to be a standard uint_24 type we can use, so we'll have to deal with little/big endian ourselves.
cc @clarkzinzow who is going to look into this more. Hopefully you have a better idea than me :)
Most helpful comment
Looked into this a bit more, spec for reference:
https://github.com/ray-project/ray/blob/master/src/ray/design_docs/id_specification.md
The idea is to remove the existing flags bytes, which would then free up the full two bytes that are currently allocated to it to add to the job id. Unfortunately it鈥檚 not straightforward to remove the object_type bit because the way object ids is calculated is if it鈥檚 a PUT_OBJECT it鈥檚 based on a monotonically increasing counter in each process and if it鈥檚 a RETURN_OBJECT it鈥檚 based on just the index that the object is within the return arguments. We rely on being able to calculate the RETURN_OBJECT ids deterministically, so it鈥檚 hard to remove that flag.
If we don鈥檛 remove that flag, we only have 1 byte that we can allocate to the job id, which might be ok to address this issue, but having an odd number of bytes is difficult because the job id is determined from an integer. Unfortunately there doesn't seem to be a standard
uint_24type we can use, so we'll have to deal with little/big endian ourselves.cc @clarkzinzow who is going to look into this more. Hopefully you have a better idea than me :)