There seems to be a CDN issue around releasing new packages. The Ingest manager is able to identify that a newer package exists but when it goes to retrieve the package the registry returns a 404.
I'm only able to reproduce this using the internal shared SIEM development server: https://kibana.siem.estc.dev/
Steps to reproduce:
/setup request which initiates an endpoint package upgrade
My proposal is to only block the ingest manager UI on a /setup failure that is from a normal package install and not an upgrade. In the upgrade scenario I think we could quietly log the error and move on. This is what the Security app does as well.
If this is the first time that the endpoint package is being installed then it must be a fresh install and the ingest manager UI should block until the endpoint package can correctly be installed.
Long term we should figure out why the ingest manager is able to see that a new package is available but unable to download the package.
Pinging @elastic/ingest-management (Team:Ingest Management)
I think a simple solution would be to add the oldVersion and newVersion info to the bulk error response and then here: https://github.com/elastic/kibana/blob/master/x-pack/plugins/ingest_manager/server/services/epm/packages/install.ts#L85 we can also check if this is an upgrade. If it is an upgrade then log the error but don't throw an error. If it is an install then throw the error.
@neptunian would be nice to have your opinion on this?
@ruflin or @mtojek Any idea about the CDN?
Hmm... we need to review the deployment model. It may happen that traffic is randomly distributed across instances with and without the newly updated packages. Caching time is round 10 minutes, which may elongate this period.
@jonathan-buttner I will look at this tomorrow/Wednesday
@jfsiii I'm working on a PR for the Kibana side. I'll push up the draft. I'm running into an unhandled promise reject and could use your help if you have time.
@jonathan-buttner sure thing. I'll can look at whatever you have and/or ping you in the AM
Here's what I have so far: https://github.com/elastic/kibana/pull/79791
@jonathan-buttner Why differentiate between install and upgrade as to whether to not block? They could both be completely broken in either case so I'm not sure why the scenario would be treated differently.
@ph I talked to @jonathan-buttner a bit about this earlier. What do you think about he & I working on this to a) unblock the UI b) deal with the unhandled rejection he discovered into while re-factoring?
(b) isn't happening in main now, but the refactored code path seems to have uncovered one
Why differentiate between install and upgrade as to whether to not block? They could both be completely broken in either case so I'm not sure why the scenario is treated differently.
cc @ruflin & @mostlyjason for thoughts on how to deal with failed upgrades/install during /setup
@ph I talked to @jonathan-buttner a bit about this earlier. What do you think about he & I working on this to a) unblock the UI b) deal with the unhandled rejection he discovered into while re-factoring?
(b) isn't happening in main now, but the refactored code path seems to have uncovered one
sounds good to me
@jonathan-buttner Why differentiate between install and upgrade as to whether to not block? They could both be completely broken in either case so I'm not sure why the scenario would be treated differently.
My thinking is if a default package had already been successfully installed and an upgrade gets triggered by /setup if a failure occurs we could rollback to the previous working package version. That way if we release a bad package or the registry is in some weird state we can avoid blocking the ingest UI when the user already has a working version of the default packages installed.
@jonathan-buttner Why differentiate between install and upgrade as to whether to not block? They could both be completely broken in either case so I'm not sure why the scenario would be treated differently.
My thinking is if a default package had already been successfully installed and an upgrade gets triggered by
/setupif a failure occurs we could rollback to the previous working package version. That way if we release a bad package or the registry is in some weird state we can avoid blocking the ingest UI when the user already has a working version of the default packages installed.
That makes sense. I wasn't sure if we were already rolling back upgrades for failed upgrades in setup. So I agree with the condition you posted in your PR:
If an upgrade occurs and fails and the rollback fails in the handleInstallPackageFailure function then it also will block the UI with the original upgrade failure message. My thinking here is that if rollback fails, the package will be in a broken state and will be unusable in that case so display the blocking message
One thing that concerns me is the user being stuck in a state where they cannot upgrade packages. We should probably communicate that to them in some way. The registry 404 issue is harmless to not communicate since it will resolve eventually, but all the unknown ones that aren't temporary are probably going to persist on each upgrade attempt.
Other issues that I found while working on a WIP for this one:
https://github.com/elastic/kibana/issues/80022
https://github.com/elastic/kibana/issues/80031
https://github.com/elastic/kibana/issues/80032
Would be nice to reduce the blast radius of a failure like this. If one package fails to upgrade, it'd be a bummer to block the entire Fleet experience since other packages may be operating normally. There are a few strategies we could use:
@mukeshelastic I wonder if we need an effort around Fleet / Integrations observability similar to agent observabiltiy? If there is a failed package install, where do users see that? It seems like it should be in the logs so they can see the error message, when the issue started, and when it resolved.
@jfsiii It slipped into the crack, I +1 to for you look into into this with @jonathan-buttner
also @nnamdifrankie FYI