When we look at runtime performance of screenshot capture, we can see a predominant part of the timeline is consumed by waiting for Kibana to load. During this time, the browser requests and parses very large bundles of uncached scripts and images.
This issue proposes that the runtime can be sped up by keeping a single chromium process running for all reports for a Kibana instance. Loading the bundles from cache should largely reduce the time it takes for the Reporting browser to open the capture URL.
The start method of the Reporting Plugin would have a step that launches a puppeteer browser driver instance. When a job for a report comes in, the screenshot pipeline would open a Kibana page as a new page of the Chromium application. When the screenshot is done, the page closes but the application keeps running.
More details about how this change would be implemented depends on how the following concerns are resolved:
Some cached browser data is potentially sensitive: any cookies files should not be cached.
An initial idea to solve this was to use incognito windows to run the report job. However, it's pretty obvious that with incognito windows, we wouldn't cache the static files and there wouldn't be a great performance benefit. Perhaps a more realistic proposal is to use a non-incognito browser to make an initial request to the Kibana app, have the bundles cached in Chromium's data directory, and then to use incognito windows for the Reporting jobs. Presumably the cached bundles in the data directory would be used even though the page is running in an incognito window.
The downsides of the "more realistic proposal" is:
The downside of caching Kibana scripts for every screenshot report is we would not be able to maintain the current prevention measures of deleting the Chromium data directory after each job. If we have incognito mode windows running, we don't need to worry about the data directory storing sensitive information.
If the data directory doesn't get deleted after each job, we should monitor the size in bytes of the directory contents to make sure they aren't growing over time. If it's normal for the data directory to grow over time, we should set controls over it to periodically delete it. We don't know right now if deleting the data directory would require us to restart the Chromium process.
We should discuss on this issue to get an agreement on how to move forward.
ping @elastic/kibana-canvas @elastic/kibana-security
When a user logs out of Kibana, the sid cookie is deleted and the session storage is cleared. If we're comfortable with that fully logging the user out of Kibana, I can't think of anything else we should be paranoid about.
Perhaps we should be clearing local storage as well, as that's where Console history is currently stored?
We can also delete all cookies using the Chrome DevTools API https://chromedevtools.github.io/devtools-protocol/tot/Network/#method-clearBrowserCookies instead of just deleting the sid cookie.
When a user logs out of Kibana, the
sidcookie is deleted and the session storage is cleared. If we're comfortable with that fully logging the user out of Kibana, I can't think of anything else we should be paranoid about.Perhaps we should be clearing local storage as well, as that's where Console history is currently stored?
Clearing both local and session storage is probably a good idea (in addition to the sid cookie).
Will jobs run concurrently or sequentially on a single Kibana instance?
@legrego right now, the jobs run sequentially and I expect we would want to keep that behavior the same to try to make one small change at a time.
To be really clear, the most time seems to be spent in having to wait for the Loading page to finish.
It sounds to me like using a Chrome DevTools API would be the best way to clear any session data, because it would clear for all domains. It's too bad there isn't something exactly like that, but for localStorage.
Let's pick this up again. Today in Reporting sync the idea of enabling Dev Tools came up as a possible way to capture process statistics from Chromium, such as heap size. That idea matches this one because it's also about enabling Dev Tools. One concern came up, that enabling Dev Tools all the time has a performance overhead, and some things such as caching might be disabled.
@joelgriffith did I capture the conversation correctly?
I started looking at the reporting code and looking around at how to structure some of these changes. If we're concerned about re-using the browser object for all jobs (or the dev tools perf overhead to reuse the browser object for all jobs) and if the goal is to speed up reporting through page reuse, would it make sense for a first pass at the goal to be just re-using browser and page objects within a job at first?
So when a job executed, the browser is launched, a blank page is started, and then for each URL that needs to be screenshotted, it can just use the same page.
If I'm understanding correctly, the downside here will be that the browser won't get to cache kibana between jobs. However, a multi-url job could still benefit from the speed up of reusing a pages (and caching between Kibana URLs). Could be a good baby-step with the next step being re-use across jobs
Thoughts on this? Still new to reporting code + puppeteer so it's entirely possible (and likely) that I'm missing something or thinking about this incorrectly :)
So the problems I found in re-using Chromium was that Kibana doesn't really cache assets well enough (at least in dev, didn't try and docker or other builds of the app). Even when re-using the same browser and page, a hard reload still took ~14 seconds, and the job itself takes about 23-26 seconds to complete.
Canvas would probably benefit the most since (I'm guessing here) that there won't necessarily be a hard-reload in between workpad renders? If so then the perf benefits would be good, but not as a general "across-the-board" performance improvement for reporting in Kibana.
Right, yeah, if we use the same page to go to the different workpad pages we can avoid those long reload periods. If we try to reuse the same page for the whole job, should I make this feature off by default and maybe have a flag passed to enable it when the job is enqueued?
should I make this feature off by default and maybe have a flag passed to enable it when the job is enqueued
I don't think that will be necessary, and I could see that it would make testing harder, as we'd have more configurations to test with. I also think there is enough benefit to starting out just trying to improve the performance for multi-page reports, and I'm all for making baby steps.
I'd be happy to pair with you on finding where the changes in Reporting code should go. I know that the way we "kill" the process after the job is done is hard to understand. I keep wanting to find the part of the code where we call page.close() or browser.close() but we don't do that anywhere.
Most helpful comment
@legrego right now, the jobs run sequentially and I expect we would want to keep that behavior the same to try to make one small change at a time.
To be really clear, the most time seems to be spent in having to wait for the
Loadingpage to finish.It sounds to me like using a Chrome DevTools API would be the best way to clear any session data, because it would clear for all domains. It's too bad there isn't something exactly like that, but for localStorage.