Today, the certificate on ci.adoptopenjdk.net expired. There are Nagios/Icinga plug-ins that can warn beforehand.
Blocked on #1229.
At the risk of attempting something that I haven't tried before, I'd quite like to pick this up, if nobody has any objections
@Willsparker Go for it! Are you going to try and install Nagios or...?
I'm using https://github.com/matteocorti/check_ssl_cert at work. That is run locally on the Icinga2 controller.
Well, we do have it in our playbooks, so it feels right to use Nagios / for me to learn how to use it :-)
Ref: https://github.com/AdoptOpenJDK/openjdk-infrastructure/issues/1229#issuecomment-701264941
I've had a look on the nagios machine and I've already found a configuration for ci.adoptopenjdk.net, in which you can use ssh to check things. Here's one that checks the password expiry:
define service{
use generic-service
host_name ci.adoptopenjdk.net
check_period once-a-day-at-8
service_description Passwd Expiry
check_command check_by_ssh!/usr/local/nagios/libexec/check_passwd -w 14 -c 3
}
Presumably, we can do this for checking the SSL cert expiration - possibly using certbot ?
(Found a tutorial of LetsEncrypt / nginx / certbot here) The certification expiration can be checked using sudo certbot renew – dry-run, so the Nagios service could be something like this.
define service{
use generic-service
host_name ci.adoptopenjdk.net
check_period once-a-day-at-8
service_description SSL_Cert Expiry
check_command check_by_ssh!certbot renew --dry-run
}
Or indeed, it could use cron - Something @aahlenst already has an ansible role for ( https://github.com/AdoptOpenJDK/openjdk-infrastructure/issues/1569#issuecomment-700592112 ). I believe, this would mean we wouldn't have to bother with Nagios at all - but that's counter-intuitive to #1229 .
@Willsparker Your proposed route asks certbot about the state of the certificate. This is unfortunately not sufficient. You have to go through the webserver and do a real HTTPS call. https://github.com/matteocorti/check_ssl_cert is the relevant plug-in that you can invoke on the controller (and not on the server to be monitored). If you ask only certbot about the state of the certificate, you'll miss two potential failures: misconfigured HTTP server and revoked certificate. Last year (or so) Let's encrypt revoked a large batch of certificates. You had to force certbot to renew those.
Okay, I've managed to use check_ssl_cert as suggested, on a VM.
$ ./check_ssl_cert -H ci.adoptopenjdk.net
SSL_CERT OK - x509 certificate 'ci.adoptopenjdk.net' from 'Let's Encrypt Authority X3' valid until Dec 27 07:21:27 2020 GMT (expires in 86 days)|days_chain_elem1=86;20;15;; days_chain_elem2=167;20;15;;
I managed to get it setup using Nagios (on a couple of VMs):

I was able to do this with the following service in `../nagios/etc/servers/ubuntu_host.cfg
define service{
use generic-service
service_description SSL_Cert Check
host_name ubuntu_host
check_interval 2
service_description SSL_Cert Expiry
check_command check_ssl_cert! ci.adoptopenjdk.net
register 1
}
I realise that checking the SSL cert could be done on the Nagios monitoring server itself, it's not required to run on a host - but I would assume we will want it to, so we can use this to trigger certbot on the actual ci.adoptopenjdk.net machine, when the cert expires / gets close to expiration.
Okay, I've had a look at the nagios server we have:

This definitely needs to be fixed up...
Beyond this, the ci.adoptopenjdk.net server is part of the list of machines that are monitored:

As a practical exercise, I'm going to add the check_ssl_cert service, and see if I can get that monitoring correctly. If all is well, this issue can be closed, and we can use this knowledge for when #1229 is being worked on.

Check_SSL_Cert was run via Nagios on the ci.adoptopenjdk.net machine :tada:
It seems the Nagios restart also fixed a bunch of other things!

FYI, to get this to work, I also had to add the check_ssl_cert command to /usr/local/nagios/etc/commands.cfg, and symlinked the command check_ssl_cert to /usr/local/nagios/libexec/ on both machines.
Does anyone object to this issue being closed? The theory works, it just needs to be usefully implemented (along with #1569 ) with #1229
Looks good from my POV. I see that the cert expires over the Christmas break. We should get that automatic renewal sorted until then 😏
@Willsparker Can we get Nagios to ping our Slack on a nagios channel? It would be worth getting this to green...
Discussion about this on Slack: https://adoptopenjdk.slack.com/archives/C53GHCXL4/p1601980237072200
Gist was that we already have a channel that nagios pings called #infrastructure-bot , however, it is saturated with a lot of output that isn't massively helpful. Hopefully this can be cleared up with #1229 - possibly through cutting down the amount of checks Nagios does on each node, or by disabling notifications on services that don't matter as much (explained below)
As far as I can tell from reading documentation, Nagios pings slack on any service that has a state-change (i.e. 'OK' --> 'Warning'). You can disable notifications on certain services via:
define service{
...
notifications_enabled [0/1]
}
or you can change the contacs / contact_group that is contacted with those fields in the service definition. In the contacts definition, you can define which command to run to contact, so we could theoretically have this SSL-Cert check ping a different Slack channel, if we wanted it to.
I'm going to close this Issue, and start looking at #1569 - Both the solutions found in this issue and the next are going to be 'fully implemented' when I look at #1229 , so we can decide then where we want certain services to notify (if at all).
Most helpful comment
Looks good from my POV. I see that the cert expires over the Christmas break. We should get that automatic renewal sorted until then 😏