Orleans: Orleans Client Tuning

Created on 4 Apr 2016 · 19Comments · Source: dotnet/orleans

Hi,

We have been conducting benchmark testing with Orleans Client and Cluster deployed on Azure Virtual Network. A lot of code changes have been made to optimize the overall performance. Right now, it seems the Orleans Client call to the Silo is taking most of time. We are using Azure table so every client talk to each active silo with a single TCP Connection. We are suspecting there is some queuing or slow network between Client and Silo that causes those delays but need to dig deeper to find the exact cause.

We were wondering if anyone has tuned the Orleans Client to improve performance and can share the experience.

Thanks,
Chunsheng

question

Source

ChunyangMS

Most helpful comment

To stress what Sergey wrote: use server GC and disable concurrent GC both for silo and clients. Try 1000 Actors per silo.

gabikliot on 23 Apr 2016

👍2

All 19 comments

@simonycs For a comparison, have you tried deploying client within the same hosted service to see if there's any difference in latency? Just to rule out the possibility that connecting through a VN is slowing things down.

sergeybykov on 5 Apr 2016

We were wondering if anyone has tuned the Orleans Client to improve performance and can share the experience.

We have load test that run nightly and the Orleans Client is part of those, so there has been optimization work done on the client (though there is always room for more). As far as tuning, some message settings exist that can effect performance (see IMessagingConfiguration), but those tuning knobs rarely need to be used, so I can't suggest tinkering with them at this point.

"Right now, it seems the Orleans Client call to the Silo is taking most of the time."

Can you provide more details on this please:

As Sergey eluded to, performance of networking over VN is relevant. Do you have any measures of this?
"Orleans Client call to the Silo is taking most of the time" - How much time?
When Orleans client makes a call to the silo, I assume it's making the call to one of your grains. Do you know how much time is being spend in the grain call itself?
Do you have rough numbers of how many grain calls per second your service is expecting?

jason-bragg on 5 Apr 2016

Just to share my experience: running client and silo locally but in different processes, I get around 1800 calls/s.

cmello on 19 Apr 2016

Running client and silo on same host is strongly not recommended from performance reasons, definitely not for perf. measurements. In addition, you need multiple clients to stress silo perf. One client is simply not enough. Rule of thumb is 1 client per 2 silo core: 4 clients for 8 core silo.

gabikliot on 19 Apr 2016

Thanks Gabi. I'll do tests in a distributed scenario soon. The local test was just to have an idea of the lowest latency (no network).

cmello on 19 Apr 2016

You are welcome. The round trip latency, under light/no load, is half to one millisecond. First call to a grain takes a bit longer, since it involves activating the grain, but starting from 2nd call it's 1/2 to 1 millisecond.

gabikliot on 19 Apr 2016

OK now with 12 clients connected to the same silo I get 7600 calls/s, much nicer. And I'm still calling a single actor, so maybe that's a bottleneck now. I'll change my benchmark to spread the calls into more actors and get the statistics from a "parent" actor. Just to make sure: my benchmark is not related to Chunsheng's. My interest is to understand the networking layer and compare it to other networking stacks. Thank you a lot!

cmello on 20 Apr 2016

Obviously, the throughput is directly related to the number of actors. Every actor is single threaded, so you are not utilizing all cores. Watch the server CPU, until it is fully maxed.

By networking I assume you mean messaging: the RPC layer. The lower level networking layer that deals with sockets is small part of the overall overhead and latency. The end to end round trip rpc with serialization and actor call invocation is the main contributor to latency.

gabikliot on 20 Apr 2016

Thank you Gabi! The actor I am using for this benchmark just increments a counter in the method call, and registers a 5-second timer to gather the number of calls/second. I will change the benchmark to aggregate several of these actors now. Anyway, the single actor performance is something I want to evaluate too and compare to other "reliable" RPCs like SF's Reliable Actors and TAO's FT-CORBA. Thank you a lot for your explanations, I appreciate them a lot.

cmello on 20 Apr 2016

Thank you Cesar. I will be interested to see the comparison numbers, especially with SF Actors.

gabikliot on 20 Apr 2016

👍1

I would love to make this kind of benchmark for Orleans: http://www.scylladb.com/technology/cassandra-vs-scylla-benchmark-2/

cmello on 20 Apr 2016

Yes, that would be nice. Especially since Orleans shares a lot of fundamental runtime design with Scylla: one thread per core, demultiplexed OS resources everywhere, carefull memory usage, ...

gabikliot on 20 Apr 2016

👍1

Just a preview: Using one actor per client, with 18 clients I reached 10.000 calls/s. Sometimes it drops to 8000 calls/s and gets back to 10k calls/s. The single silo is using ~86% CPU, its network ~47 Mbps (a 1 Gbps lan, so the bandwidth usage is below 5%). I could not compare to SF yet because I must setup a real cluster. I would like to setup a 5-silo cluster to compare with a 5-node SF cluster.

cmello on 22 Apr 2016

👍1

I suspect one actor per client is too few. Also, make sure you configure ServerGC for all processes, silos and clients.

sergeybykov on 22 Apr 2016

👍1

@cmello
10k/sec still seems a bit low.

For a point of reference, from one of last nights load tests we got around 26k/sec per silo.
Test involved: 16 clients,16 silos, randomly placed reentrant grains, using random grain IDs (so no telling how many grains).
With this setup, the cluster executed ~430k grain calls per second, or around 26k/s per silo.
This test differs from your testing due to the number of grains and the reentrancy, but should help provide a performance target for your testing.

In any case, we're very interested in how Orleans performance compares with other platforms, so thank you for putting time into this and sharing your findings.

jason-bragg on 23 Apr 2016

👍2

To stress what Sergey wrote: use server GC and disable concurrent GC both for silo and clients. Try 1000 Actors per silo.

gabikliot on 23 Apr 2016

👍2

@simonycs,

Just to drop my 2 cents here... I have Orleans running on a (very) complex networking setup on Azure many of the Azure SDN components and I have not problem at all with increased latency on our clustes. We once had a deployment at the very beginning of our cluster config which were running 2 roles on the same vNet, for obvious reasons that would not ran for too much time that way and I thought we would suffer to segment our network and use other Azure SDN componenents and that would introduce big latency on our pipeline... I was never so wrong... If you correctly configure VNet, subnets, Network Security Groups, Application Gateway and Load balancers (if any) and even VPNs, you will not suffer any impact as long as your vnet(s) remain on the same DC/Azure Zone. If you are making Cross-Zone clusters, based on VPNs etc, you will end up on public network which will indeed introduce latency regardless if you are even using the Infinity Band features of Azure. For cross-zone/geo-distributed clusters, I would recommend you wait a bit until it gets fully support inside Orleans although you can easily do that with a regular VPN, I wouldn't recommend yet.

@cmello,

Nice to see that you are getting progress on your benchmark. Please share the results of SkyNet when you can! :P Also, I remember someone from core team saying that 1 single grain activation was supposed to hold 10k requests per second on ordinary hardware...