Cilium master during v1.10 dev cycle (but likely affects v1.9.0).
I started Cilium under microk8s and hit this warning:
level=warning msg="Failed to send policy update as monitor notification" error="monitor agent is not set up" policyAddRequest=7ae059f5-2dfc-11eb-936b-54e1ad14ebc0 policyRevision=2 subsys=daemon
This seems like a race condition between policy processing and monitor server startup, likely a red herring but if users observe it they may think that it is meaningful when it is not.
I suspect this was introduced by commit d543d418bde3ad55e8759e9f28941f830bfc94bb (/cc @gandro), as previously we didn't check the error in Daemon.policyDelete(...).
To be clear, I don't believe this is a real issue, my main concern is that it pollutes the logs with a warning when otherwise everything appears to be fine. Fixing it will likely involve digging into the daemon startup logic and understanding whether to either conditionally skip sending events like this, somehow queueing them, or forcing policy handling to wait on monitor startup (probably not ideal).
I suspect this was introduced by commit d543d41 (/cc @gandro), as previously we didn't check the error in
Daemon.policyDelete(...).
Yes, looking at the changes I did around that time, that seems a correct assessment: My commit made made this into a warning (whereas before, it was silently ignored). I think we could probably extract the fmt.Errorf("monitor agent is not set up") as a ErrMonitorNotReady variable or something and conditionally ignore it.
To me, the question overall is if this is something that we should only handle in the policy update case, of if we should generally either drop or queue any agent notifications if the agent is not yet set up.
Extracting the error seems like an easy mitigation, :+1:
Regarding the latter question, queueing seems like it would be a bit more robust and provide better introspection during agent restart. Agent restart tends to be one of the more sensitive times for ensuring visibility since things can go wrong (eg during upgrade), but it's also a time that's easy to neglect just given how ad-hoc a lot of the daemon initialization logic is today.
We don't have to go and design this right now, but some brief thoughts for anyone who may decide to pick that up as an enhancement: