Openjdk-infrastructure: Monitor SSL certificate experiation

Created on 26 Sep 2020  ·  12Comments  ·  Source: AdoptOpenJDK/openjdk-infrastructure

Today, the certificate on ci.adoptopenjdk.net expired. There are Nagios/Icinga plug-ins that can warn beforehand.

Blocked on #1229.

bug

Most helpful comment

Looks good from my POV. I see that the cert expires over the Christmas break. We should get that automatic renewal sorted until then 😏

All 12 comments

At the risk of attempting something that I haven't tried before, I'd quite like to pick this up, if nobody has any objections

@Willsparker Go for it! Are you going to try and install Nagios or...?

I'm using https://github.com/matteocorti/check_ssl_cert at work. That is run locally on the Icinga2 controller.

Well, we do have it in our playbooks, so it feels right to use Nagios / for me to learn how to use it :-)

Ref: https://github.com/AdoptOpenJDK/openjdk-infrastructure/issues/1229#issuecomment-701264941

I've had a look on the nagios machine and I've already found a configuration for ci.adoptopenjdk.net, in which you can use ssh to check things. Here's one that checks the password expiry:

define service{
        use                             generic-service
        host_name                       ci.adoptopenjdk.net
        check_period            once-a-day-at-8
        service_description             Passwd Expiry
        check_command                   check_by_ssh!/usr/local/nagios/libexec/check_passwd -w 14 -c 3
        }

Presumably, we can do this for checking the SSL cert expiration - possibly using certbot ?
(Found a tutorial of LetsEncrypt / nginx / certbot here) The certification expiration can be checked using sudo certbot renew – dry-run, so the Nagios service could be something like this.

define service{
        use                             generic-service
        host_name                       ci.adoptopenjdk.net
        check_period            once-a-day-at-8
        service_description             SSL_Cert Expiry
        check_command                   check_by_ssh!certbot renew --dry-run
        }

Or indeed, it could use cron - Something @aahlenst already has an ansible role for ( https://github.com/AdoptOpenJDK/openjdk-infrastructure/issues/1569#issuecomment-700592112 ). I believe, this would mean we wouldn't have to bother with Nagios at all - but that's counter-intuitive to #1229 .

@Willsparker Your proposed route asks certbot about the state of the certificate. This is unfortunately not sufficient. You have to go through the webserver and do a real HTTPS call. https://github.com/matteocorti/check_ssl_cert is the relevant plug-in that you can invoke on the controller (and not on the server to be monitored). If you ask only certbot about the state of the certificate, you'll miss two potential failures: misconfigured HTTP server and revoked certificate. Last year (or so) Let's encrypt revoked a large batch of certificates. You had to force certbot to renew those.

Okay, I've managed to use check_ssl_cert as suggested, on a VM.

$ ./check_ssl_cert -H ci.adoptopenjdk.net
SSL_CERT OK - x509 certificate 'ci.adoptopenjdk.net' from 'Let's Encrypt Authority X3' valid until Dec 27 07:21:27 2020 GMT (expires in 86 days)|days_chain_elem1=86;20;15;; days_chain_elem2=167;20;15;;

I managed to get it setup using Nagios (on a couple of VMs):
image

I was able to do this with the following service in `../nagios/etc/servers/ubuntu_host.cfg

define service{
        use                             generic-service
        service_description             SSL_Cert Check
        host_name                       ubuntu_host
        check_interval                  2
        service_description             SSL_Cert Expiry
        check_command                   check_ssl_cert! ci.adoptopenjdk.net
        register                        1
}

I realise that checking the SSL cert could be done on the Nagios monitoring server itself, it's not required to run on a host - but I would assume we will want it to, so we can use this to trigger certbot on the actual ci.adoptopenjdk.net machine, when the cert expires / gets close to expiration.

Okay, I've had a look at the nagios server we have:
image
This definitely needs to be fixed up...

Beyond this, the ci.adoptopenjdk.net server is part of the list of machines that are monitored:
image

As a practical exercise, I'm going to add the check_ssl_cert service, and see if I can get that monitoring correctly. If all is well, this issue can be closed, and we can use this knowledge for when #1229 is being worked on.

image

Check_SSL_Cert was run via Nagios on the ci.adoptopenjdk.net machine :tada:

It seems the Nagios restart also fixed a bunch of other things!

image

FYI, to get this to work, I also had to add the check_ssl_cert command to /usr/local/nagios/etc/commands.cfg, and symlinked the command check_ssl_cert to /usr/local/nagios/libexec/ on both machines.

Does anyone object to this issue being closed? The theory works, it just needs to be usefully implemented (along with #1569 ) with #1229

Looks good from my POV. I see that the cert expires over the Christmas break. We should get that automatic renewal sorted until then 😏

@Willsparker Can we get Nagios to ping our Slack on a nagios channel? It would be worth getting this to green...

Discussion about this on Slack: https://adoptopenjdk.slack.com/archives/C53GHCXL4/p1601980237072200
Gist was that we already have a channel that nagios pings called #infrastructure-bot , however, it is saturated with a lot of output that isn't massively helpful. Hopefully this can be cleared up with #1229 - possibly through cutting down the amount of checks Nagios does on each node, or by disabling notifications on services that don't matter as much (explained below)

As far as I can tell from reading documentation, Nagios pings slack on any service that has a state-change (i.e. 'OK' --> 'Warning'). You can disable notifications on certain services via:

define service{
...
notifications_enabled [0/1]
}

or you can change the contacs / contact_group that is contacted with those fields in the service definition. In the contacts definition, you can define which command to run to contact, so we could theoretically have this SSL-Cert check ping a different Slack channel, if we wanted it to.

I'm going to close this Issue, and start looking at #1569 - Both the solutions found in this issue and the next are going to be 'fully implemented' when I look at #1229 , so we can decide then where we want certain services to notify (if at all).

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Willsparker picture Willsparker  ·  9Comments

Haroon-Khel picture Haroon-Khel  ·  8Comments

sxa picture sxa  ·  7Comments

sxa picture sxa  ·  3Comments

LongyuZhang picture LongyuZhang  ·  4Comments