Description :
I am executing the locust tests under distributed mode in a Kubernetes cluster. At the moment I am testing only 1 master - 1 slave configuration. At end of the tests, it is found that locust master fails to acknowledge the last stats from slave. On completion, the Locust master prints the following statement "Time limit reached. Stopping Locust." and sends a quit signal to the slave/s with a wait time of 0.5 seconds to get reports from slaves. This is described as part of below code
main.py
` def spawn_run_time_limit_greenlet():
logger.info("Run time limit set to %s seconds" % options.run_time)
def timelimit_stop():
logger.info("Time limit reached. Stopping Locust.")
runners.locust_runner.quit()
gevent.spawn_later(options.run_time, timelimit_stop)`
runners.py
def quit(self):
for client in self.clients.all:
self.server.send_to_client(Message("quit", None, client.id))
gevent.sleep(0.5) # wait for final stats report from all slaves
self.greenlet.kill(block=True)
However in most of the cases it is found that by the time slaves sends the report to master, its shutdown is already invoked stopping it from receiving any stats. Here is one example of the locust master-slave logs
As per master logs (as shown below), the signal was sent around 2020-06-13 21:04:13,968 and then it waited for exactly 0,5 seconds and the shutdown process began at 2020-06-13 21:04:14,469
[2020-06-13 21:04:13,968] locust-master-1-n9qxq/INFO/locust.main: Time limit reached. Stopping Locust.
[2020-06-13 21:04:14,469] locust-master-1-n9qxq/INFO/locust.main: Shutting down (exit code 1), bye.
[2020-06-13 21:04:14,470] locust-master-1-n9qxq/INFO/locust.main: Cleaning up runner...
[2020-06-13 21:04:14,971] locust-master-1-n9qxq/INFO/locust.main: Running teardowns...
As per the slave logs (as shown below), it received the message from master only at [2020-06-13 21:04:15,727] and by this time the master was already in shutdown process.
[2020-06-13 21:04:15,727] locust-slave-1-jqpq8/INFO/locust.runners: Got quit message from master, shutting down...
[2020-06-13 21:04:16,632] locust-slave-1-jqpq8/INFO/locust.main: Shutting down (exit code 0), bye.
[2020-06-13 21:04:16,632] locust-slave-1-jqpq8/INFO/locust.main: Cleaning up runner...
[2020-06-13 21:04:16,632] locust-slave-1-jqpq8/INFO/locust.main: Running teardowns...
As you can see, the master didn't rely on "acknowledgment signal from slave" rather it was depending on a wait time of 0.5 seconds which should not be the said criteria. This has resulted in loss of data wherein the master doesn't have the entire information of stats from the slave. So the aggregate results at master is less than the total requests tested by slave.
The master should wait for the slave to send back the acknowledgement signal and the part of keeping time as the constant should be removed.
Hi! Can you do this consistently? What about if you run it on a single machine?
I've had no issues getting the last samples since that sleep was introduced. The size of the sleep is designed to account for any latency between master & slave, but nothing more, so if it takes a full second or so before the slaves get the message to shut down then it will not work. Are you doing anything particular in your locustfile that might block the slaves from receiving the message for that long?
@cyberw Hi, Thank you for the response. Correct me if I am wrong but by single machine do you mean running both master and slave on a single host? In that case, there is no issue at all the request-response between slave and master is quite fast.
However this issue seems to be coming up (like 7 out of 10 cases) where the master and slave are spinned as separate pods in the "same" Kubernetes namespace. So definitely time should not be a criteria to account for latency rather master should have a different mechanism to poll the response from slave, only then it should go to shutdown.
I haven't added anything else apart from 2 API calls in locust file. It looks something like this
@seq_task(2)
def sample_task_one(self):
<Call to an endpoint and capture results>
@seq_task(1)
def sample_task_two(self):
<Call to an endpoint and capture results>
class UserBehavior(TaskSequence):
tasks = {sample_task_one: 1, sample_task_two: 1}
class WebsiteUser(HttpLocust):
task_set = UserBehavior
wait_time = between(5.0, 10.0)
Correct me if I am wrong but by single machine do you mean running both master and slave on a single host? In that case, there is no issue at all the request-response between slave and master is quite fast.
Yes, that was what I meant.
I haven't added anything else apart from 2 API calls in locust file. It looks something like this
Looks like your test is very "normal" (assuming your endpoint calls are just HttpUser requests and "capture result" doesnt actually do anything blocking)
When I added this sleep (quite recently actually, until then we were always dropping the last few requests on the workers :-/ ), I first tried to do it a more "safe" way, but ended up deadlocking somehow.
If you do have >0.5s latency between master and slave (introduced by k8s, underlying hardware, something else) then I'm afraid losing the last results is expected and unlikely to be fixed any time soon - unless you fix it yourself and make a PR :)
I guess for your purposes you could just fork / monkey patch it to be 5 seconds instead.
@cyberw - is there any common way to inject config changes in Locust? Like prepare an ini/json config file that would override defaults in any way? Instead of doing monkey patching this possibly would help in this particular problem.
All of the "normal" configuration can be set using command line, environment variables or config file https://docs.locust.io/en/latest/configuration.html
Settings that havent been made configurable can, of course, not be configured using config file, which is why you're stuck with forking or monkey patching.
For this particular problem I think we should probably take another look at fixing the code (waiting for the last result set until it arrives instead of waiting for a set time) instead of making the wait configurable. But I dont have time, so someone else will have to do it.