In the coming days/weeks we will hopefully be migrating kibana-ci to use dynamic workers, so we should be able to scale the number of active jobs far beyond the limits of our current jenkins worker pool. In order to take the best advantage of this we will be splitting up the current "selenium" and "x-pack" jobs with the goal of completing things much faster. To do this we will need to define the groups that tests will execute in, aiming for groups that take around the same time to run. @LeeDr do you think you could help define these groups? I remember you had some timing data on the test runs, and if you could help pick which tests will run in each group I would like to split things up into their own configs so we can try out the groups in the current jobs, track execution times, and make sure that the groups work for us before defining the separate Jenkins jobs.
Once we have things split up we'll need to work with infra to define the actual jobs, which will probably just require a bunch of copy-paste on the jjb configurations.
I don't think we need to worry about splitting the intake job up right now, at least not until we get the other jobs down under 20 minutes a peice.
We also need to do this in a way that doesn't require backporting to every version of Kibana that might get updated from time to time. For PR builds on very old branches I plan for Jenkins to still look for the scripts we have today so most of the jobs will quit right after cloning the repo, and three of them will execute the intake, selenium, (and sometimes x-pack scripts) that we have in those branches.
Yes. I'd be happy to take a stab at it. I think one logical break would be to split the dashboard tests out from the rest of the base selenium tests since they take about 24 minutes by themselves. I think this would also be nice for the sharing team to have 1 distinct job for their tests to focus on the stability of.
I'll look into it more and also at the x-pack tests and come up with a plan to try.
@Crazybus
jin.mu [21:59]
@spalger so next step is splitting the jobs and coordinating with @crazybus to decide how many dynamic workers we need to have, right?
The worker limit per cluster is set in the gobld config here. This limit mainly exists to avoid the situation we ran into when using a Jenkins plugin for this which spawned 5000 workers before I noticed. The other reason it is a good idea is that we don't want to overload the Jenkins master. The tricky part is that not all jobs are equal, 100 jobs which aren't outputting many logs is no problem, 100 jobs constantly streaming debug logs could potentially overwhelm the master.
So this is another great example of "it depends" and the only way to find the real limit is to try it out. So when you guys are ready and have a rough idea of how many concurrent workers you would like just let me know the number and we can try it out. We are also able to give the Jenkins master more horse power to handle the load (this is the official recommended way to "scale" Jenkins 🤦♂️ ). Changing the limit up and down takes seconds so we can easily rollback if it turns out 1000 workers is a bad idea :P.
P.S. I'm super excited to see you guys really using this to it's full potential. I haven't written up the docs for the workflow just yet however it is already possible to run a local Jenkins master locally (in docker)
with dynamic workers in the cloud. This is going to make testing job changes and execution a lot lot easier (setup in a minute, and no need to run workers locally!).
Most helpful comment
The worker limit per cluster is set in the gobld config here. This limit mainly exists to avoid the situation we ran into when using a Jenkins plugin for this which spawned 5000 workers before I noticed. The other reason it is a good idea is that we don't want to overload the Jenkins master. The tricky part is that not all jobs are equal, 100 jobs which aren't outputting many logs is no problem, 100 jobs constantly streaming debug logs could potentially overwhelm the master.
So this is another great example of "it depends" and the only way to find the real limit is to try it out. So when you guys are ready and have a rough idea of how many concurrent workers you would like just let me know the number and we can try it out. We are also able to give the Jenkins master more horse power to handle the load (this is the official recommended way to "scale" Jenkins 🤦♂️ ). Changing the limit up and down takes seconds so we can easily rollback if it turns out 1000 workers is a bad idea :P.
P.S. I'm super excited to see you guys really using this to it's full potential. I haven't written up the docs for the workflow just yet however it is already possible to run a local Jenkins master locally (in docker)
with dynamic workers in the cloud. This is going to make testing job changes and execution a lot lot easier (setup in a minute, and no need to run workers locally!).