openjdk-infrastructure 🚀 - Bring Nagios Monitoring to the fore

(ref https://github.com/AdoptOpenJDK/openjdk-infrastructure/pull/1228 )

Willsparker on 15 Apr 2020

This has been open for a while now and nobody has spoken up. Do we all agree to shut down the nagios server?

gdams on 11 May 2020

@sxa your call as the main infra rep - do you want something of this nature? Seems odd that we don't have anything unless Jenkin's reporting is deemed good enough

karianna on 11 May 2020

I don't personally feel there's any reason to shut it down - ideally we'd be making use of it, but I'm not at present because too many other things keep coming up. That doesn't mean it's not a useful thing to have in place.

I'm not aware that anyone has a reason in favour of it being shut down completely.

sxa on 12 May 2020

OK, so I'll relabel this as 'give it a spruce up'

karianna on 12 May 2020

😄1

Nagios should at least be updated to ensure we remain secure there. The latest version of Core is now 4.4.6.

tellison on 28 Sep 2020

Paging @Willsparker as he wanted to look at this from the other ticket on monitoring SSL certs

karianna on 28 Sep 2020

If bringing Nagios back to life requires a lot of work, it might make sense to check beforehand where that thing is located and what else needs to be done so that's ready for the future security- and performance-wise.

aahlenst on 28 Sep 2020

I can look at updating the playbooks to have the latest version of Nagios, that (hopefully) shouldn't be a problem :-)

Willsparker on 30 Sep 2020

@sxa @karianna If I understand correctly, there's no Nagios server at the moment. If that's true, can we re-evaluate and write down how we ended up with Nagios and which edition we're going to use? As I already said on Slack, if we have Nagios Core only, Icinga might be the better choice.

aahlenst on 30 Sep 2020

@Willsparker / all - we actually have a Nagios Master in place already at 78.47.239.96

karianna on 30 Sep 2020

Oh cool :-) I'm currently looking at installing it on a VM and playing around with it to figure it out

edit: It has the superuser on it so I can login too :+1:

Willsparker on 30 Sep 2020

I'm going to start looking at this, but I thought I'd ask what everyone would want to actually be monitored via Nagios?
Currently there's a lot of default checks for each host that I don't think are entirely necessary (i.e. the 'PING' service- surely by virtue of the other services running on the host, a connection issue would be found via these, and a PING service becomes unnecessary). The vast majority of these checks also notify the #infrastructure-bot which result in an awful lot of output that ends up becoming white noise, so certain services could notify the relevant slack channels, or if the service isn't really important, notifications could be disabled entirely.

So - what services should run for each type of machine (i.e. build, test, infra, perf), where do the services notify if something goes wrong (if at all), and are there any special exceptions (i.e. the ci.adoptopenjdk.net machine will have a service to monitor it's SSL certificate: #1568 )

Willsparker on 7 Oct 2020

Nagios and its forks indeed report if no result comes back some way or another (state "unknown").

As we're dealing with build/test servers, we should check for the problems that concern us. Maxing out RAM/CPU probably does not, filling up the disk does.

Shoot from the hip:

Disk usage
SSH reachable
Security patches installed
Jenkins service connected (probably better solved by querying Jenkins)
Network time sync (I have a plugin that works well on Linux)

For the Jenkins server, TRSS and other servers that provide services: monitoring CPU, RAM, SSL certificate might be a good idea, too. The SSL certificate check would also be good for the website and the API.

aahlenst on 8 Oct 2020

👍2

Looking at https://github.com/AdoptOpenJDK/openjdk-infrastructure/issues/1602, it might make sense to monitor RAM.

aahlenst on 8 Oct 2020

Maybe all of those as a once-a-day-check? Possibly in the morning so it doesn't run at the same time as nightlies.
Also what Network time sync plugin would that be? Currently we're not monitoring Windows, so that's something I should probably look into as well :-)

Willsparker on 9 Oct 2020

Once a day sounds good. Mornings (EMEA time) probably sane. Network time sync probably tries to keep all of the machines sync'd timewise (you can see the drift when you look at all of the nodes in Jenkins actually).

Windows would be great 👍

karianna on 12 Oct 2020

👍1

Okay, cool - I'll get started on that then :-) I'll keep a backup of all the old config files in a directory somewhere, just in case.

Willsparker on 12 Oct 2020

(note to self)
Useful Documentation / resources I found (I'll update this as I go) :
"Setting up a Nagios Server on Ubuntu1604" : https://www.howtoforge.com/tutorial/how-to-install-nagios-on-ubuntu-16-04/
"Template-Based Object Configuration" : http://nagios.manubulon.com/traduction/docs25en/xodtemplate.html
"Event Handlers" : https://assets.nagios.com/downloads/nagioscore/docs/nagioscore/3/en/eventhandlers.html

Willsparker on 13 Oct 2020

I've been looking through and making a script to generate a config file for each host automatically (I'll put it here once it's done), but I was looking at adding the service to query if a given node was connected to Jenkins, and there doesn't appear to be an immediately obvious way. I'd be able to use the check_by_ssh to run a script that looks for a java process running Jenkins, but that doesn't necessarily mean it's running as expected

Willsparker on 23 Oct 2020

Only Jenkins knows what is connected and what not. Therefore, I'd query https://ci.adoptopenjdk.net/computer/api/json?pretty=true. This would also allow us to define sets of nodes and be alerted if there are, for example, less than X machines with a specific label.

aahlenst on 23 Oct 2020

Is there anyway to query the API to return the info for a single node? Can't find any documentation to show how to use the API

Willsparker on 26 Oct 2020

Append /api to any URL you open in Jenkins and you get the API. https://ci.adoptopenjdk.net/computer/build-azure-win2012r2-x64-1/api/json?pretty=true gives you info about build-azure-win2012r2-x64-1.

aahlenst on 26 Oct 2020

👍1

Ah! Excellent, thanks very much :-)

Willsparker on 26 Oct 2020

Okay, I wrote a script which I've been able to get working in Nagios

** For the purposes of testing, I called the node build-scaleway-ubuntu1604-x64-1; It's actually just a VM running on my machine, but the check_jenkins command I made uses whats defined as the hostname to query the Jenkins API

#!/bin/bash

if [ -z $1 ]; then
  echo "UNKNOWN- Invalid arguments"
  echo "Usage: $0 < agent_name >"
  exit 3
fi

wget -q https://ci.adoptopenjdk.net/computer/$1/api/json?pretty=true -O jenkins_query_$1
if [[ $? != 0 ]]; then
  echo "UNKNOWN- Failed to get agent information"
  rm jenkins_query_$1
  exit 3
fi

is_agent_offline=$(awk '/"offline"/{gsub("[,]","",$3); print$3}' < jenkins_query_$1)
is_agent_temp_offline=$(awk '/"temporarilyOffline"/{gsub("[,]","",$3); print$3}' < jenkins_query_$1)
rm jenkins_query_$1

if [[ $is_agent_offline == "false" ]]; then
  echo "OK - Jenkins Agent is connected"
  exit 0
elif [[ $is_agent_offline == "true" ]] && [[ $is_agent_temp_offline == "true" ]]; then
  echo "WARNING - Jenkins Agent temporarily disconnected"
  exit 1
elif [[ $is_agent_offline == "true" ]] && [[ $is_agent_temp_offline == "false" ]]; then
  echo "CRITICAL - Jenkins agent is fully disconnected"
  exit 2
else
  echo "UNKNOWN - Couldn't find 'offline' entry in JSON"
  exit 3
fi

Pretty simple, Syntax is ./check_agent <Name of Node>, and it runs on the Nagios Server itself, as it just queries the Jenkins API- also means we only have to put it on one machine instead of ~100ish.

Willsparker on 26 Oct 2020

Looks great, apart from one thing: awk isn't the best choice for querying JSON. https://stedolan.github.io/jq/ is much more reliable and digestible. curl is also more friendly for saving the response in a variable. Saves you the temporary file and the problems associated with it.

Pulled from a script on my disk:

CURL_RESPONSE=$(curl -s -H "Accept: application/json" -H "Authorization: Bearer $TOKEN" "https://example.com")

aahlenst on 26 Oct 2020

👍1

Updated to use JQ and curl :+1:

#!/bin/bash

if [ -z $1 ]; then
  echo "UNKNOWN - Invalid arguments"
  echo "Usage: $0 < agent_name >"
  exit 3
fi

if ! command -v jq &> /dev/null; then
  echo "UNKNOWN - JQ isn't installed"
  exit 3
fi

CURL_RESPONSE=$(curl -s https://ci.adoptopenjdk.net/computer/$1/api/json?pretty=true)
if [[ $? != 0 ]]; then
  echo "UNKNOWN- Failed to get agent information"
  exit 3
fi

is_agent_offline=$(echo $CURL_RESPONSE | jq .offline)
is_agent_temp_offline=$(echo $CURL_RESPONSE | jq .temporarilyOffline )

if [[ $is_agent_offline == "false" ]]; then
  echo "OK - Jenkins Agent is connected"
  exit 0
elif [[ $is_agent_offline == "true" ]] && [[ $is_agent_temp_offline == "true" ]]; then
  echo "WARNING - Jenkins Agent temporarily disconnected"
  exit 1
elif [[ $is_agent_offline == "true" ]] && [[ $is_agent_temp_offline == "false" ]]; then
  echo "CRITICAL - Jenkins agent is fully disconnected"
  exit 2
else
  echo "UNKNOWN - Couldn't find 'offline' entry in JSON"
  exit 3
fi

Willsparker on 27 Oct 2020

Looks great.

Would be great to have a script that checks for specific labels or label combinations and alerts if we lose a certain percentage of machines.

aahlenst on 27 Oct 2020

On the Nagios server, I've made a backup of the objects and servers directories (Just in case) at /usr/local/nagios/cfg_backup_281020. I'm going to start looking at starting to generate the .cfg files for all the servers. I've tested this and managed to get it working, but if there's anything I'm missing, let me know :-)

#!/bin/bash

[[ ! -f $1 ]] && echo "Input a variable file"

source $1
export FILENAME="$HOSTNAME.cfg"


case $(echo "$DISTRO" | tr -d [:digit:] | tr [:upper:] [:lower:]) in
  "ubuntu" | "debian") 
    PKGMNGR="apt";;
  "rhel" | "centos")
    PKGMNGR="yum";;
esac

echo "DEBUG:
  FILENAME: $FILENAME
  HOSTNAME: $HOSTNAME
  ALIAS   : $ALIAS
  ADDRESS : $IP_ADDRESS
  DISTRO  : $DISTRO
  PKGMNGR : $PKGMNGR
  SPECIAL : $EXTRA
"  

echo " # Checks SSH to determine if the host is available
define host {
        use                             linux-server
        host_name                       $HOSTNAME
        alias                           $ALIAS
        address                         $IP_ADDRESS
        check_command                   check_ssh!-4 -t 60
        max_check_attempts              5
        check_period                    24x7
        notification_interval           30
        notification_period             24x7
}" >> $FILENAME

echo "define service {
        use                             generic-service        
    host_name                       $HOSTNAME
        service_description             Disk Usage
        check_command                   check_remote_disk!10%!5%!/
        check_period                    once-a-day-at-8
}" >> $FILENAME

echo "define service {
        use                             generic-service        
        host_name                       $HOSTNAME
        service_description             Updates-Required - $PKGMNGR
        check_command                   check_remote_${PKGMNGR}
        check_period                    once-a-day-at-8
}" >> $FILENAME

echo "define service {
        use                             generic-service
        host_name                       $HOSTNAME
        service_description             Check Free Memory
        check_command                   check_remote_mem!10!5
        check_interval                  30
}" >> $FILENAME

# This only runs with centos/rhel 7+, as centos6 doesn't uses systemd
if [[ $(echo "$DISTRO" | tr -d [:alpha:]) != 6 ]]; then
  echo "define service {
        use                             generic-service
        host_name                       $HOSTNAME
        service_description             Network Time Sync
        check_command                   check_remote_timesync
        check_period                    once-a-day-at-8
}" >> $FILENAME
fi

echo "define service {
        use                             generic-service
        host_name                       $HOSTNAME
        host_name                       $HOSTNAME
        service_description             Check if Jenkins Agent Connected
        check_command                   check_agent!$HOSTNAME
        check_period                    once-a-day-at-8
}" >> $FILENAME

# Only for the servers that need SSL certification
if [[ $EXTRA == 1 ]]; then
  echo "define service {
        use                             generic-service
        host_name                       $HOSTNAME
        service_description             Check CPU Load
        check_command                   check_remote_load
        check_interval                  10
}" >> $FILENAME

  echo "define service {
        use                             generic-service
        host_name                       $HOSTNAME
        service_description             Check_SSL_Cert
        check_command                   check_ssl_cert!$HOSTNAME         
        check_period                    once-a-day-at-8
}" >> $FILENAME
fi

An example of the variable file is as follows:

export HOSTNAME="build-test-test-x64-1"
export ALIAS="Build Host"
export IP_ADDRESS="127.0.0.1"
export DISTRO=CentOS7
export EXTRA=1

The once-a-day-at-8 time period is defined as :

define timeperiod{
        timeperiod_name once-a-day-at-8
        alias           Between 8am 9am GMT everyday
        sunday          9:00-10:00
        monday          9:00-10:00
        tuesday         9:00-10:00
        wednesday       9:00-10:00
        thursday        9:00-10:00
        friday          9:00-10:00
        saturday        9:00-10:00
}

According to a note left by Brad Blondin, the Nagios server is on CEST time, so to get 8-9am in GMT, the server will be 9-10am ( I think ).

The extra commands that need to be defined in /usr/local/nagios/etc/objects/commands.cfg are as follows:

##############
#
# COMMANDS ADDED (By Willsparker)
#
##############

define command{
        command_name    check_remote_disk
        command_line    $USER1$/check_by_ssh -t 360 -H $HOSTADDRESS$ -C '/usr/lib/nagios/plugins/check_disk -w $ARG1$ -c $ARG2$ -p $ARG3$'
}

define command{
        command_name    check_remote_yum
        command_line    $USER1$/check_by_ssh -t 360 -H $HOSTADDRESS$ -C '/usr/lib/nagios/plugins/check_yum -t 60'
}

define command{
        command_name    check_remote_apt
        command_line    $USER1$/check_by_ssh -t 360 -H $HOSTADDRESS$ -C '/usr/lib/nagios/plugins/check_apt -t 60'
}

define command{
        command_name    check_remote_load
        command_line    $USER1$/check_by_ssh -t 360 -H $HOSTADDRESS$ -C '/usr/lib/nagios/plugins/check_load -w $ARG1$ -c $ARG2$'
}

# Note: This plugin needs to be manually installed on remote nodes: https://github.com/justintime/nagios-plugins/tree/master/check_mem
define command{
        command_name    check_remote_mem
        command_line    $USER1$/check_by_ssh -t 360 -H $HOSTADDRESS$ -C '/usr/lib/nagios/plugins/check_mem -f -C -w $ARG1$ -c $ARG2$'
}

# Note: This plugin needs to be manually installed on remote nodes
define command{
        command_name    check_remote_timesync
        command_line    $USER1$/check_by_ssh -t 360 -H $HOSTADDRESS$ -C '/usr/lib/nagios/plugins/check_timesync'
}

# Note: This plugin needs to be manually installed on the Nagios server
define command{
        command_name    check_agent
        command_line    $USER1$/check_agent $ARG1$
}

# Note: This plugin needs to be manually installed on the Nagios server
define command{
        command_name    check_ssl_cert
        command_line    $USER1$/check_ssl_cert -H $ARG1$
}

I think that's all the prep work I need to do before re-doing the Nagios setup, except the notifications -
1) Should we keep it as it is, with nagios pinging the #infrastructure-bot channel?
2) Should I alter the notification period / interval?
3) Do all tasks need notifications enabled?

Would be great to have a script that checks for specific labels or label combinations and alerts if we lose a certain percentage of machines.

@aahlenst I'll have a look at adding that today :-)

Willsparker on 28 Oct 2020

@Willsparker Thanks for the great work. Question: Why no Ansible playbook for the config?

aahlenst on 28 Oct 2020

Honestly, I wasn't aware that the playbook could be used for the config :sweat_smile: I'll look to see if I can use the roles, and merge the script I wrote above, into the Nagios_Ansible_Config_tool.sh script mentioned in https://github.com/AdoptOpenJDK/openjdk-infrastructure/blob/master/ansible/playbooks/AdoptOpenJDK_Unix_Playbook/roles/Nagios_Master_Config/tasks/main.yml
It'll save me a lot of manual work :-)

Willsparker on 28 Oct 2020

@aahlenst
Script to check the percentage of machines online in the label

#!/bin/bash

if [ -z $1 ] || [ -z $2 ] || [ -z $3 ]; then
  echo "UNKNOWN - Invalid arguments"
  echo "Usage: $0 <Label> <Warning_Level> <Critical_Level>"
  exit 3
fi

if ! command -v jq &> /dev/null; then
  echo "UNKNOWN - JQ isn't installed"
  exit 3
fi

# Get list of machines in label
mapfile -t machine_array < <(curl -s https://ci.adoptopenjdk.net/label/$1/api/json | jq '.nodes[] | .nodeName' | sed 's/\"//g') 

# For each machine, query if they're connected
response_array=()
for node in ${machine_array[@]}
do
  response_array+=($(curl -s "https://ci.adoptopenjdk.net/computer/${node}/api/json" | jq .offline)) 
done

online=0
offline=0
for response in ${response_array[@]}
do
  if [[ ${response} == "false" ]]; then 
    online=$((online+1))
  else
    offline=$((offline+1))
  fi
done

export percentage_online=$(echo "scale=2; ($online/($offline+$online)) * 100" | bc -l)
if (( $(echo "$percentage_online < $3" | bc -l) )); then
  echo "CRITICAL - $percentage_online% machines online in '$1' label"
  echo "$online online machines; $offline offline machines"
  exit 2 
elif (( $(echo "$percentage_online < $2" | bc -l) )); then 
  echo "WARNING - $percentage_online% machines online in '$1' label"
  echo "$online online machines; $offline offline machines"
  exit 1
else
  echo "OK - $percentage_online% machines online in '$1' label"
  echo "$online online machines; $offline offline machines"
  exit 0
fi

Any requested changes?

Willsparker on 28 Oct 2020

More questions than change requests:

Why export percentage_online?
Can this handle label groups, for example build&&linux&&s390x?

aahlenst on 29 Oct 2020

I think ... I've got it working using the Tools that Brad Blondin made when he initially setup the Nagios stuff!

Here's a list of things I had to do to make it work (which may or may not be related to me running this on a machine that isn't in the inventory):

Alter the Nagios_Ansible_Config_Tool.sh to remove all the references to sys_pingtest (this is due to the new template not having a ping service)
Alter the template.cfg to have the services we want to monitor
In the playbook:
- Change the way we get the Nagios public key back to the old method (i.e. before https://github.com/AdoptOpenJDK/openjdk-infrastructure/pull/1560 ). I couldn't get the new one working due to ansible complaining that invalid key specified, on this task
- Manually set provider variable
- Manually set inventory_hostname, as it was picking it up as 'localhost', when it should have been the IP Address for the machine...
- Change the ssh command needed to make the Nagios_Ansible_Config_Tool.sh script run. This seems to be a weird thing with Ansible whereby, if you run the ansible-playbook command with the -b option, when it gets to a task that is delegated to localhost, it sudos the localhost user too. Shouldn't be an issue if you run the playbook on the root user of a given machine (as the -b option won't be used)
- Add a task to put the new network_time_sync script on the machine. I used the copy module for this, but when I PR this stuff, It'll be in the repo (I think- @aahlenst Am I able to commit that script to this repo, legally?)
- Add the check_agent command to the Nagios Server (I'll commit this in Supporting Scripts, but it won't be needed in the playbook)
  
  -Installed JQ on the Nagios Server

For putting in the label checker script, I'll manually put that in, and have it run on the Nagios server :-)

Willsparker on 29 Oct 2020

For the sake of doublely testing, I'm going to try this on a C7 machine that @sxa provided me, but I'll connect this to jenkins (and not put any labels on it), to see if this fixes a lot of my previous issues. Once I've got that working, I can do this for all the machines :-)

Willsparker on 29 Oct 2020

@aahlenst Removed the export - Not sure why I put that in. And apparently it does! (completely intentional!) You just need to make sure you put the labels in speech marks, otherwise bash doesn't like the & signs

will@will-XPS-13-9360:~/Documents/nagios_testing$ ./check_label.sh "build&&linux&&s390x" 50 30
OK - 100.00% machines online in 'build&&linux&&s390x' label
2 online machines; 0 offline machines

Willsparker on 29 Oct 2020

With the help of @gdams , We were able to get the test-aws-ubuntu1804-armv8-1 machine added to Nagios.
Final list of stuff to do:

[x] Backup the old version of the Nagios_Ansible_Config_Tool on the Nagios server
[x] Figure out how we want to do the notifications of each service so the #infrastructure-bot slack channel isn't constantly spammed.
[x] Create PR to: alter the Nagios_Ansible_Config_Tool, the templates and the relevant sections of the playbook; re-enable the Nagios roles; Put the plugins I wrote into additional_plugins.
[x] Remove all the old configs for the machines. (Already backed them up)
[x] Update Nagios Core to latest version
[ ] Run the Nagios_* roles on the machines.
[x] Manually add the check_label script to Nagios Server
[x] Change the Nagios server config to check the most common machine labels
[x] Update the Nagios server config in general

Additional:

[X] Setup a cron job on the Nagios Server that queries Jenkins, and removes the machine entry from /usr/local/nagios/etc/servers/, if it's no longer in Jenkins (This _should_ keep Nagios actually fairly useful, and not full of old machines we don't have)
[ ] Setup a method of checking that all machines in the inventory are monitored by Nagios (with a way of identifying exceptions, such as ad-hoc experimental machines, Windows machines (for now), or any machines we can't run the Nagios_* playbook roles on)

Willsparker on 30 Oct 2020

Setup a cron job on the Nagios Server that queries Jenkins, and removes the machine entry from /usr/local/nagios/etc/servers/, if it's no longer in Jenkins.

I'm not particularly keen on using Jenkins as the source of truth for our inventory. I'd rather let Ansible handle that (we have to remove the servers from the inventory anyway). Instead, I'd add a check that alerts us if a machine pops up in Jenkins that isn't known to Nagios (with the possibility to ignore dynamic agents).

aahlenst on 3 Nov 2020

The issue is, if we're removing a machine from the inventory (i.e. through a PR), Ansible isn't been used, and the machines will just stay there. And my only concern with that check is that sometimes ad-hoc machines are added to Jenkins ( i.e. test-will-debian-riscv-1/ ) that don't necessarily need monitoring.

Willsparker on 3 Nov 2020

Nagios Core has been updated to 4.4.6 from 4.3.4, following this guide

Willsparker on 3 Nov 2020

@aahlenst What labels should we be checking for?

Willsparker on 3 Nov 2020

The issue is, if we're removing a machine from the inventory (i.e. through a PR), Ansible isn't been used, and the machines will just stay there.

Yeah, but Nagios will start nagging us and we need more discipline in this area. Let's assume the worst and someone manages to temporarily remove machines from Jenkins. If Nagios automatically drops the machine, it might take us weeks to realize that something has happened. That gives me shivers.

And my only concern with that check is that sometimes ad-hoc machines are added to Jenkins ( i.e. test-will-debian-riscv-1/ ) that don't necessarily need monitoring.

This is just bad practice (I'm aware that I'm guilty here, too). We need to separate "regular" from experimental machines, for example by prefixing them with experimental-. And we need to diligently monitor the presence of machines, not just the accumulated number. Any unknown new machine being added should trigger a red alert.

aahlenst on 3 Nov 2020

Yeah, but Nagios will start nagging us and we need more discipline in this area. Let's assume the worst and someone manages to temporarily remove machines from Jenkins. If Nagios automatically drops the machine, it might take us weeks to realize that something has happened. That gives me shivers.
This is just bad practice (I'm aware that I'm guilty here, too). We need to separate "regular" from experimental machines, for example by prefixing them with experimental-. And we need to diligently monitor the presence of machines, not just the accumulated number. Any unknown new machine being added should trigger a red alert.

Both good points. When you say Any unknown new machine being added should trigger a red alert. , are you referring to any new machines being added to the inventory, that aren't being monitored (except, for example, machine proceeded by experimental) ?

Willsparker on 3 Nov 2020

are you referring to any new machines being added to the inventory, that aren't being monitored (except, for example, machine proceeded by experimental

If a new machine appears in Jenkins that is not known to Nagios, it should trigger an alert. For that to work, we need a list of known machines. One possibility to achieve that could be that we combine our inventory with a list of experimental machines that is maintained alongside the Ansible inventory.

aahlenst on 3 Nov 2020

We could use the config files in /usr/local/nagios/etc/servers/ to check for machines that are known to Nagios. Otherwise, I would have concerns with maintaining 2 separate lists- before I looked at https://github.com/AdoptOpenJDK/openjdk-infrastructure/issues/619 , even the inventory wasn't very well maintained.

Willsparker on 4 Nov 2020

Changed the timeperiod once-a-day-at-8 to 9:00-10:00, as (due to the nagios server being in CET) that refers to 8-9 in GMT, which should hopefully be after all the nightlies have finished.

Willsparker on 5 Nov 2020

I've been able to add all the machines I've got access to, as well as manually changed some of the machines I don't have access to.
The following are the machines that are yet to be added to/updated in Nagios:

build-linaro-centos76-armv8-2 (timeout)
build-packet-ubuntu1804-armv8-1 (Connection closed)

docker-aws-ubuntu1604-x64-1 x
docker-aws-ubuntu1604-x64-2 x
docker-godaddy-ubuntu1604-x64-1 x
docker-scaleway-ubuntu1604-armv7-1 (Connection Closed)

test-ibm-aix71-ppc64-1 x
test-ibm-aix71-ppc64-2 x
test-ibmcloud-ubuntu1604-x64-1 x
test-macstadium-macos11-arm64-1 x
test-macstadium-macos11-arm64-2 x

Willsparker on 10 Nov 2020

We could use the config files in /usr/local/nagios/etc/servers/ to check for machines that are known to Nagios.

Ideally, the Ansible inventory serves as single source of truth for those config files. If that isn't possible at the moment, we have to live with it, but should have a ticket that states what's left to do.

aahlenst on 10 Nov 2020

@Willsparker adding to your requirements, I would like to see "Alerts" added to Warn (via Slack) that a given Node "Free disk space" is within a certain margin(say ~3Gb) or our 10Gb Jenkins offline limit,ie. Warn if <13Gb free !
This is so we get a heads up that a node is near to being taken offline by Jenkins and we can act before if fails a nightly build....

andrew-m-leonard on 19 Nov 2020

We could use the config files in /usr/local/nagios/etc/servers/ to check for machines that are known to Nagios.

Ideally, the Ansible inventory serves as single source of truth for those config files. If that isn't possible at the moment, we have to live with it, but should have a ticket that states what's left to do.

That should absoutely be accurate as far as "live" machines are concerned and I've been acting to resolve any discrepencies as soon as they show up, so hopefully we don't need a ticket with a long list of todos on that one ;-) Just PRs put in to fix them. So yes that list should be definitive.

sxa on 19 Nov 2020

@Willsparker adding to your requirements, I would like to see "Alerts" added to Warn (via Slack) that a given Node "Free disk space" is within a certain margin(say ~3Gb) or our 10Gb Jenkins offline limit,ie. Warn if <13Gb free !
This is so we get a heads up that a node is near to being taken offline by Jenkins and we can act before if fails a nightly build....

@andrew-m-leonard If I'm understanding correctly, we should already have that, but it works based on percentage of machine's disk ( see here ). This sends it to the #infrastructure-bot channel on Slack. Currently it's warning at 20% diskspace and critical at 10%, at both times the slack channel will be updated. I don't believe I've seen a machine with less than 100GB total diskspace, so we should be warned before the ~3GB limit you suggested.

Willsparker on 23 Nov 2020

Considering the vast majority of machines have been added to Nagios and this issue is very non-specific now, I'm going to close it in favour of some more specific ones

Willsparker on 27 Nov 2020

Openjdk-infrastructure: Bring Nagios Monitoring to the fore

Most helpful comment

All 52 comments

Related issues