OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
Ubuntu in docker
Linux f3559e14813a 4.9.87-linuxkit-aufs #1 SMP Wed Mar 14 15:12:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Ray installed from (source or binary): source
I wrote a python script for performance test, which transfers 100K integers from on actor to another actor, each actor running in a different docker. Previously with legacy ray it takes about 1 minute, while with raylet enabled it takes 3 minutes.
On docker 2, do:
ray start --head --redis-port=6379 --resources='{"Resource2": 4}' --use-raylet
On docker1, do:
ray start --redis-address 172.17.0.3:6379 --resources='{"Resource1": 2}' --use-raylet
Then on docker1, run:
python perf.py
[Perf.py source code]
@robertnishihara @pcmoritz @istoica is it expected?
Currently xray performance is a bit worse than legacy Ray performance. We still have a lot of performance optimizations to do.
How big is the slowdown in this example?
EDIT: Oh I see, 1 minute to 3 minutes. That's bigger than expected.
In our testing, raylet cost 3x slower than non-raylet path.
We call an empty function which accept an int as input parameter for 100k times. The non-raylet path cost 1 min, but the raylet path cost 3mins.
hmm, I'll look into this as well
The bits running in docker is not latest ray, but was sync'ed around a while ago. I just tried latest ray, and I could reproduce the problem on my macbook using the same script, without using docker. And the perf regression looks more significant - the same script doesn't complete in 15 minutes this time.
Did some tracing and I think I found the root cause - removeTaskFromQueue() takes too much time. I'm working on a fix. Will post updates.
Here's the hottest callstack.

Thanks, Zhijun.
These are the number of calls, right?
Can we also get the time spent in these calls? That would be more useful
for profiling.
Thanks!
Ion
On Wed, Jul 18, 2018 at 8:55 PM, Zhijun Fu notifications@github.com wrote:
Here's the hottest callstack.
[image: callstack]
https://user-images.githubusercontent.com/37800433/42920687-96ebbaf8-8b4a-11e8-9c7c-fa81b283d409.png—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/ray-project/ray/issues/2403#issuecomment-406146898,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABGKQbM6aAPRgoEvXsPqpB3iwu-odF0Fks5uIANEgaJpZM4VQo6g
.
@istoica
Sure Ion. Here's the calltack with time information.

@zhijunfu Thanks. Adding @atumanov as well.
@zhijunfu I would hold off on any fix that touches the queues until after https://github.com/ray-project/ray/pull/2420 is merged.
The inefficiency of removeTasksFromQueue has been a known issue since the very beginning. Once direct transfers between queues are implemented, this issue will be resolved.
@atumanov sure, thanks for the context.
Just to update this thread:
I made a simple fix by using hash + list to represent a local queue, and running the same script now takes around 40s (instead of 15+ minutes), which is comparable to numbers on legacy ray. I'll hold off the fix until Alexey get his changes in.
@zhijunfu that's excellent! Thanks for the fix. There's no harm in opening a PR now I think, but yes you're right there are a few other current PRs that touch similar parts of the code.
@robertnishihara sure, I just created a PR for this one.
https://github.com/ray-project/ray/pull/2438
BTW, just to clarify the problem a little bit. There are two issues here:
@robertnishihara @istoica @stephanie-wang
I got some updates for issue 2 above.
I made some changes to reduce unnecessary allocation/free with lineage entries, that brought the time down from 180s to 130s. I can send a PR for the change - if there are no existing on-going related PRs. There are still big gaps compared to previous numbers (40s), and some additional effort is going to be needed to get it on-par.
Here's are the raylet hot callstacks captured in two dockers, Actor A runs in docker 1, and actor B runs in docker 2. The raw numbers are obtained with perf command, and converted to graph with another tool. You'll need to enlarge it a few times to see it clearly:)


@zhijunfu thanks, I'd be happy to review a PR for improving lineage allocation and deallocation!
For the serialization/deserialization overheads, I think we can improve it by rethinking some of the lineage stash algorithm. The workload you've been testing with causes actor A's node manager to send the entire stashed lineage of every task submitted to actor B to actor B's node manager. Even though the lineage gets periodically evicted, it's pretty clear that this will be worse than doing nothing at all, like in legacy Ray.
There is a lot of duplicate lineage that's getting sent with the current lineage stash algorithm. I think we can improve it by keeping track of which node managers have received which lineage so far. Then, when node manager A forwards a task for actor B to node manager B, it can send the message without any uncommitted lineage, since it knows that it already sent node manager B the same lineage with a previous message.
Also, from running the script you sent me, I noticed that in xray, actor A only finishes pushing messages to actor B shortly before actor B finishes reading the messages. This is in contrast to legacy ray, where actor A finishes pushing soon (6s) after starting. I believe this is due to a known issue that we've had with asio where, for a single event handler on a socket, we're only handling one message at a time. This means that if there are multiple messages in the socket, the next message will get delayed until its event handler is processed. We should fix this by processing as many messages as possible from any type of socket, possibly with some upper bound on how much time we can spend in a single event handler. I think this will probably help with the slowdown that you're seeing as well.
@stephanie-wang Thanks a lot for the explanations, appreciate that!
As a first step, here's the PR for the improvement I mentioned earlier, basically to avoid unnecessary allocation/free for lineage entries. Kindly let me know in case of any comments:)
@zhijunfu Some results for perf.py, in which actor A calls actor B 100K times, and each call takes a single integer as argument. The result are on my macbook:
So things look reasonable promising.
@istoica Thanks. This is great! Yes I saw PR for reducing lineage is already under review :-)
I've been looking into this issue for a bit and here is what I found so far:
malloc for redis commands. The main redis commands are for adding to the TaskLeaseTable and adding to the TaskTable. After taking out writes to the former (which we write to at least as many times as to TaskTable), there is no discernible difference in runtime.ForwardTask messages (which are written synchronously from one node manager to another) would take >100ms to complete. I modified the node manager to read out more ForwardTask messages per handler, but this did not affect the overall runtime. I do think we should incorporate this change into master though. Another thing we could do to reduce the ForwardTask latency is to write the message asynchronously, but we would have to buffer the messages ourselves (I haven't implemented this fully, but I think it's worth doing).GetTask, the node manager assigns the actor a task, the actor executes it, and repeat. there is very little that i can see on the node manager side to optimize at first glance, but there may be places on the Python side that we can speed up.ray.put entirely so that each task on the receiving actor does not put any of its return values in the object store. Obviously this won't work as a general optimization, but just to illustrate what the overhead of ray.put is. For a single sending and receiving actor, this increased the stable throughput on the receiving actor from 10k/3s to 10k/1.5s.So in summary, the main performance issues that I've found so far are:
ray.put adds significant overhead when there are many tiny tasks.Thanks, Stephanie. Great investigation work!
When you use 4 pairs of actors, how many integers you are you sending between each pair, still 100K?
Otherwise, can we check the throughout of ray.put() is less than 10K/sec. This seems quite low. Can we double check this number with a simple experiment? Just put 100K or 1M integers.
Thanks,
Ion
Sent from my iPhone
On Aug 16, 2018, at 5:33 AM, Stephanie Wang notifications@github.com wrote:
I've been looking into this issue for a bit and here is what I found so far:
On master on my laptop, the script runs in about 35s.
The sending actor finishes much earlier than the receiving actor, so the benchmark is bottlenecked on the receiving node for most of the job.
Profiling the receiving node shows that it spends most of its time in malloc for redis commands. The main redis commands are for adding to the TaskLeaseTable and adding to the TaskTable. After taking out writes to the former (which we write to at least as many times as to TaskTable), there is no discernible difference in runtime.
The number of bytes available in the node manager-to-node manager socket blew up as more tasks are sent. This is because the receiving node manager is not reading out tasks fast enough due a known asio issue where a handler only processes 1 message a time from a socket, the same issue that I referenced in my previous comment. A symptom of this was that some of the ForwardTask messages (which are written synchronously from one node manager to another) would take >100ms to complete. I modified the node manager to read out more ForwardTask messages per handler, but this did not affect the overall runtime. I do think we should incorporate this change into master though. Another thing we could do to reduce the ForwardTask latency is to write the message asynchronously, but we would have to buffer the messages ourselves (I haven't implemented this fully, but I think it's worth doing).
From the above points, i decided that the runtime is mostly going towards the loop between the receiving node manager and actor, which is basically: actor calls GetTask, the node manager assigns the actor a task, the actor executes it, and repeat. there is very little that i can see on the node manager side to optimize at first glance, but there may be places on the Python side that we can speed up.
Since throughput seemed to be bottlenecked by the Python actor, I upped the number of actors to see what the throughput would look like then. With 4 pairs of sending and receiving actors, the job finishes in about 81s.
The overall throughput of the job is significantly lower than the stable throughput for the receiving actor. For 4 pairs of actors, the overall throughput is 400k/81 = ~5k tasks/s. However, after the first couple rounds, each of the actors sends 10k tasks every 5s, which means a stable throughput of about 40k/5 = ~8k tasks/s. That means there's an unexplained delay of about 30s at the beginning of the job, when the sending actors are still submitting tasks. I haven't looked much into what's happening during this period yet, other than the note about not reading from the socket fast enough.
After investigating the Python side a bit more, I tried commenting out ray.put entirely so that each task on the receiving actor does not put any of its return values in the object store. Obviously this won't work as a general optimization, but just to illustrate what the overhead of ray.put is. For a single sending and receiving actor, this increased the stable throughput on the receiving actor from 10k/3s to 10k/1.5s.
So in summary, the main performance issues that I've found so far are:asio handlers need to process more than one event at a time, within some reasonable time limit.
ray.put adds significant overhead when there are many tiny tasks.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
Yes, each pair sends 100k.
Good point. On my laptop, without warming up the object store and without hugepages, ray.put of 100k integers takes 14.6s. This matches the overhead I was seeing when running the full actor benchmark.
@pcmoritz pointed out that Arrow serialization is very good for large numpy arrays, but won't be as efficient for objects like a single integer. Perhaps we should consider using something like the default pickle for small objects.
What would be the overhead using pickle in this case?
Ion
Sent from my iPhone
On Aug 16, 2018, at 9:45 PM, Stephanie Wang notifications@github.com wrote:
Yes, each pair sends 100k.
Good point. On my laptop, without warming up the object store and without hugepages, ray.put of 100k integers takes 14.6s. This matches the overhead I was seeing when running the full actor benchmark.
@pcmoritz pointed out that Arrow serialization is very good for large numpy arrays, but won't be as efficient for objects like a single integer. Perhaps we should consider using something like the default pickle for small objects.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
@stephanie-wang the issue might be the arrow serialization, but it could also be the many IPCs that we do to the plasma store (at least 3 (create/seal/release)).
@robertnishihara good point. @stephanie-wang, can we test this by creating an empty object?
Another way to figure out the overhead is to serialize some data structure whose serialization is linear in its size, maybe an arrray? Serialize then this data structure at a few different sizes and then fit a +b*size. We need to find “a” which is the fixed overhead.
Auto-closing stale issue.
Most helpful comment
@zhijunfu Some results for perf.py, in which actor A calls actor B 100K times, and each call takes a single integer as argument. The result are on my macbook:
So things look reasonable promising.