Openjdk-infrastructure: Bring Nagios Monitoring to the fore

Created on 25 Mar 2020  路  52Comments  路  Source: AdoptOpenJDK/openjdk-infrastructure

This was a question as to whether to keep it. We think we should, but we need to bring it into our monitoring regime.

question

Most helpful comment

Nagios and its forks indeed report if no result comes back some way or another (state "unknown").

As we're dealing with build/test servers, we should check for the problems that concern us. Maxing out RAM/CPU probably does not, filling up the disk does.

Shoot from the hip:

  • Disk usage
  • SSH reachable
  • Security patches installed
  • Jenkins service connected (probably better solved by querying Jenkins)
  • Network time sync (I have a plugin that works well on Linux)

For the Jenkins server, TRSS and other servers that provide services: monitoring CPU, RAM, SSL certificate might be a good idea, too. The SSL certificate check would also be good for the website and the API.

All 52 comments

(ref https://github.com/AdoptOpenJDK/openjdk-infrastructure/pull/1228 )

This has been open for a while now and nobody has spoken up. Do we all agree to shut down the nagios server?

@sxa your call as the main infra rep - do you want something of this nature? Seems odd that we don't have anything unless Jenkin's reporting is deemed good enough

I don't personally feel there's any reason to shut it down - ideally we'd be making use of it, but I'm not at present because too many other things keep coming up. That doesn't mean it's not a useful thing to have in place.

I'm not aware that anyone has a reason in favour of it being shut down completely.

OK, so I'll relabel this as 'give it a spruce up'

Nagios should at least be updated to ensure we remain secure there. The latest version of Core is now 4.4.6.

Paging @Willsparker as he wanted to look at this from the other ticket on monitoring SSL certs

If bringing Nagios back to life requires a lot of work, it might make sense to check beforehand where that thing is located and what else needs to be done so that's ready for the future security- and performance-wise.

I can look at updating the playbooks to have the latest version of Nagios, that (hopefully) shouldn't be a problem :-)

@sxa @karianna If I understand correctly, there's no Nagios server at the moment. If that's true, can we re-evaluate and write down how we ended up with Nagios and which edition we're going to use? As I already said on Slack, if we have Nagios Core only, Icinga might be the better choice.

@Willsparker / all - we actually have a Nagios Master in place already at 78.47.239.96

Oh cool :-) I'm currently looking at installing it on a VM and playing around with it to figure it out

edit: It has the superuser on it so I can login too :+1:

I'm going to start looking at this, but I thought I'd ask what everyone would want to actually be monitored via Nagios?
Currently there's a lot of default checks for each host that I don't think are entirely necessary (i.e. the 'PING' service- surely by virtue of the other services running on the host, a connection issue would be found via these, and a PING service becomes unnecessary). The vast majority of these checks also notify the #infrastructure-bot which result in an awful lot of output that ends up becoming white noise, so certain services could notify the relevant slack channels, or if the service isn't really important, notifications could be disabled entirely.

So - what services should run for each type of machine (i.e. build, test, infra, perf), where do the services notify if something goes wrong (if at all), and are there any special exceptions (i.e. the ci.adoptopenjdk.net machine will have a service to monitor it's SSL certificate: #1568 )

Nagios and its forks indeed report if no result comes back some way or another (state "unknown").

As we're dealing with build/test servers, we should check for the problems that concern us. Maxing out RAM/CPU probably does not, filling up the disk does.

Shoot from the hip:

  • Disk usage
  • SSH reachable
  • Security patches installed
  • Jenkins service connected (probably better solved by querying Jenkins)
  • Network time sync (I have a plugin that works well on Linux)

For the Jenkins server, TRSS and other servers that provide services: monitoring CPU, RAM, SSL certificate might be a good idea, too. The SSL certificate check would also be good for the website and the API.

Looking at https://github.com/AdoptOpenJDK/openjdk-infrastructure/issues/1602, it might make sense to monitor RAM.

Maybe all of those as a once-a-day-check? Possibly in the morning so it doesn't run at the same time as nightlies.
Also what Network time sync plugin would that be? Currently we're not monitoring Windows, so that's something I should probably look into as well :-)

Once a day sounds good. Mornings (EMEA time) probably sane. Network time sync probably tries to keep all of the machines sync'd timewise (you can see the drift when you look at all of the nodes in Jenkins actually).

Windows would be great 馃憤

Okay, cool - I'll get started on that then :-) I'll keep a backup of all the old config files in a directory somewhere, just in case.

(note to self)
Useful Documentation / resources I found (I'll update this as I go) :
"Setting up a Nagios Server on Ubuntu1604" : https://www.howtoforge.com/tutorial/how-to-install-nagios-on-ubuntu-16-04/
"Template-Based Object Configuration" : http://nagios.manubulon.com/traduction/docs25en/xodtemplate.html
"Event Handlers" : https://assets.nagios.com/downloads/nagioscore/docs/nagioscore/3/en/eventhandlers.html

I've been looking through and making a script to generate a config file for each host automatically (I'll put it here once it's done), but I was looking at adding the service to query if a given node was connected to Jenkins, and there doesn't appear to be an immediately obvious way. I'd be able to use the check_by_ssh to run a script that looks for a java process running Jenkins, but that doesn't necessarily mean it's running as expected

Only Jenkins knows what is connected and what not. Therefore, I'd query https://ci.adoptopenjdk.net/computer/api/json?pretty=true. This would also allow us to define sets of nodes and be alerted if there are, for example, less than X machines with a specific label.

Is there anyway to query the API to return the info for a single node? Can't find any documentation to show how to use the API

Append /api to any URL you open in Jenkins and you get the API. https://ci.adoptopenjdk.net/computer/build-azure-win2012r2-x64-1/api/json?pretty=true gives you info about build-azure-win2012r2-x64-1.

Ah! Excellent, thanks very much :-)

Okay, I wrote a script which I've been able to get working in Nagios

image

** For the purposes of testing, I called the node build-scaleway-ubuntu1604-x64-1; It's actually just a VM running on my machine, but the check_jenkins command I made uses whats defined as the hostname to query the Jenkins API

#!/bin/bash

if [ -z $1 ]; then
  echo "UNKNOWN- Invalid arguments"
  echo "Usage: $0 < agent_name >"
  exit 3
fi

wget -q https://ci.adoptopenjdk.net/computer/$1/api/json?pretty=true -O jenkins_query_$1
if [[ $? != 0 ]]; then
  echo "UNKNOWN- Failed to get agent information"
  rm jenkins_query_$1
  exit 3
fi

is_agent_offline=$(awk '/"offline"/{gsub("[,]","",$3); print$3}' < jenkins_query_$1)
is_agent_temp_offline=$(awk '/"temporarilyOffline"/{gsub("[,]","",$3); print$3}' < jenkins_query_$1)
rm jenkins_query_$1

if [[ $is_agent_offline == "false" ]]; then
  echo "OK - Jenkins Agent is connected"
  exit 0
elif [[ $is_agent_offline == "true" ]] && [[ $is_agent_temp_offline == "true" ]]; then
  echo "WARNING - Jenkins Agent temporarily disconnected"
  exit 1
elif [[ $is_agent_offline == "true" ]] && [[ $is_agent_temp_offline == "false" ]]; then
  echo "CRITICAL - Jenkins agent is fully disconnected"
  exit 2
else
  echo "UNKNOWN - Couldn't find 'offline' entry in JSON"
  exit 3
fi

Pretty simple, Syntax is ./check_agent <Name of Node>, and it runs on the Nagios Server itself, as it just queries the Jenkins API- also means we only have to put it on one machine instead of ~100ish.

Looks great, apart from one thing: awk isn't the best choice for querying JSON. https://stedolan.github.io/jq/ is much more reliable and digestible. curl is also more friendly for saving the response in a variable. Saves you the temporary file and the problems associated with it.

Pulled from a script on my disk:

CURL_RESPONSE=$(curl -s -H "Accept: application/json" -H "Authorization: Bearer $TOKEN" "https://example.com")

Updated to use JQ and curl :+1:

#!/bin/bash

if [ -z $1 ]; then
  echo "UNKNOWN - Invalid arguments"
  echo "Usage: $0 < agent_name >"
  exit 3
fi

if ! command -v jq &> /dev/null; then
  echo "UNKNOWN - JQ isn't installed"
  exit 3
fi

CURL_RESPONSE=$(curl -s https://ci.adoptopenjdk.net/computer/$1/api/json?pretty=true)
if [[ $? != 0 ]]; then
  echo "UNKNOWN- Failed to get agent information"
  exit 3
fi

is_agent_offline=$(echo $CURL_RESPONSE | jq .offline)
is_agent_temp_offline=$(echo $CURL_RESPONSE | jq .temporarilyOffline )

if [[ $is_agent_offline == "false" ]]; then
  echo "OK - Jenkins Agent is connected"
  exit 0
elif [[ $is_agent_offline == "true" ]] && [[ $is_agent_temp_offline == "true" ]]; then
  echo "WARNING - Jenkins Agent temporarily disconnected"
  exit 1
elif [[ $is_agent_offline == "true" ]] && [[ $is_agent_temp_offline == "false" ]]; then
  echo "CRITICAL - Jenkins agent is fully disconnected"
  exit 2
else
  echo "UNKNOWN - Couldn't find 'offline' entry in JSON"
  exit 3
fi

Looks great.

Would be great to have a script that checks for specific labels or label combinations and alerts if we lose a certain percentage of machines.

On the Nagios server, I've made a backup of the objects and servers directories (Just in case) at /usr/local/nagios/cfg_backup_281020. I'm going to start looking at starting to generate the .cfg files for all the servers. I've tested this and managed to get it working, but if there's anything I'm missing, let me know :-)

#!/bin/bash

[[ ! -f $1 ]] && echo "Input a variable file"

source $1
export FILENAME="$HOSTNAME.cfg"


case $(echo "$DISTRO" | tr -d [:digit:] | tr [:upper:] [:lower:]) in
  "ubuntu" | "debian") 
    PKGMNGR="apt";;
  "rhel" | "centos")
    PKGMNGR="yum";;
esac

echo "DEBUG:
  FILENAME: $FILENAME
  HOSTNAME: $HOSTNAME
  ALIAS   : $ALIAS
  ADDRESS : $IP_ADDRESS
  DISTRO  : $DISTRO
  PKGMNGR : $PKGMNGR
  SPECIAL : $EXTRA
"  

echo " # Checks SSH to determine if the host is available
define host {
        use                             linux-server
        host_name                       $HOSTNAME
        alias                           $ALIAS
        address                         $IP_ADDRESS
        check_command                   check_ssh!-4 -t 60
        max_check_attempts              5
        check_period                    24x7
        notification_interval           30
        notification_period             24x7
}" >> $FILENAME

echo "define service {
        use                             generic-service        
    host_name                       $HOSTNAME
        service_description             Disk Usage
        check_command                   check_remote_disk!10%!5%!/
        check_period                    once-a-day-at-8
}" >> $FILENAME

echo "define service {
        use                             generic-service        
        host_name                       $HOSTNAME
        service_description             Updates-Required - $PKGMNGR
        check_command                   check_remote_${PKGMNGR}
        check_period                    once-a-day-at-8
}" >> $FILENAME

echo "define service {
        use                             generic-service
        host_name                       $HOSTNAME
        service_description             Check Free Memory
        check_command                   check_remote_mem!10!5
        check_interval                  30
}" >> $FILENAME

# This only runs with centos/rhel 7+, as centos6 doesn't uses systemd
if [[ $(echo "$DISTRO" | tr -d [:alpha:]) != 6 ]]; then
  echo "define service {
        use                             generic-service
        host_name                       $HOSTNAME
        service_description             Network Time Sync
        check_command                   check_remote_timesync
        check_period                    once-a-day-at-8
}" >> $FILENAME
fi

echo "define service {
        use                             generic-service
        host_name                       $HOSTNAME
        host_name                       $HOSTNAME
        service_description             Check if Jenkins Agent Connected
        check_command                   check_agent!$HOSTNAME
        check_period                    once-a-day-at-8
}" >> $FILENAME

# Only for the servers that need SSL certification
if [[ $EXTRA == 1 ]]; then
  echo "define service {
        use                             generic-service
        host_name                       $HOSTNAME
        service_description             Check CPU Load
        check_command                   check_remote_load
        check_interval                  10
}" >> $FILENAME

  echo "define service {
        use                             generic-service
        host_name                       $HOSTNAME
        service_description             Check_SSL_Cert
        check_command                   check_ssl_cert!$HOSTNAME         
        check_period                    once-a-day-at-8
}" >> $FILENAME
fi 

An example of the variable file is as follows:

export HOSTNAME="build-test-test-x64-1"
export ALIAS="Build Host"
export IP_ADDRESS="127.0.0.1"
export DISTRO=CentOS7
export EXTRA=1

The once-a-day-at-8 time period is defined as :

define timeperiod{
        timeperiod_name once-a-day-at-8
        alias           Between 8am 9am GMT everyday
        sunday          9:00-10:00
        monday          9:00-10:00
        tuesday         9:00-10:00
        wednesday       9:00-10:00
        thursday        9:00-10:00
        friday          9:00-10:00
        saturday        9:00-10:00
}

According to a note left by Brad Blondin, the Nagios server is on CEST time, so to get 8-9am in GMT, the server will be 9-10am ( I think ).

The extra commands that need to be defined in /usr/local/nagios/etc/objects/commands.cfg are as follows:

##############
#
# COMMANDS ADDED (By Willsparker)
#
##############

define command{
        command_name    check_remote_disk
        command_line    $USER1$/check_by_ssh -t 360 -H $HOSTADDRESS$ -C '/usr/lib/nagios/plugins/check_disk -w $ARG1$ -c $ARG2$ -p $ARG3$'
}

define command{
        command_name    check_remote_yum
        command_line    $USER1$/check_by_ssh -t 360 -H $HOSTADDRESS$ -C '/usr/lib/nagios/plugins/check_yum -t 60'
}

define command{
        command_name    check_remote_apt
        command_line    $USER1$/check_by_ssh -t 360 -H $HOSTADDRESS$ -C '/usr/lib/nagios/plugins/check_apt -t 60'
}

define command{
        command_name    check_remote_load
        command_line    $USER1$/check_by_ssh -t 360 -H $HOSTADDRESS$ -C '/usr/lib/nagios/plugins/check_load -w $ARG1$ -c $ARG2$'
}

# Note: This plugin needs to be manually installed on remote nodes: https://github.com/justintime/nagios-plugins/tree/master/check_mem
define command{
        command_name    check_remote_mem
        command_line    $USER1$/check_by_ssh -t 360 -H $HOSTADDRESS$ -C '/usr/lib/nagios/plugins/check_mem -f -C -w $ARG1$ -c $ARG2$'
}

# Note: This plugin needs to be manually installed on remote nodes
define command{
        command_name    check_remote_timesync
        command_line    $USER1$/check_by_ssh -t 360 -H $HOSTADDRESS$ -C '/usr/lib/nagios/plugins/check_timesync'
}

# Note: This plugin needs to be manually installed on the Nagios server
define command{
        command_name    check_agent
        command_line    $USER1$/check_agent $ARG1$
}

# Note: This plugin needs to be manually installed on the Nagios server
define command{
        command_name    check_ssl_cert
        command_line    $USER1$/check_ssl_cert -H $ARG1$
}

I think that's all the prep work I need to do before re-doing the Nagios setup, except the notifications -
1) Should we keep it as it is, with nagios pinging the #infrastructure-bot channel?
2) Should I alter the notification period / interval?
3) Do all tasks need notifications enabled?

Would be great to have a script that checks for specific labels or label combinations and alerts if we lose a certain percentage of machines.

@aahlenst I'll have a look at adding that today :-)

@Willsparker Thanks for the great work. Question: Why no Ansible playbook for the config?

Honestly, I wasn't aware that the playbook could be used for the config :sweat_smile: I'll look to see if I can use the roles, and merge the script I wrote above, into the Nagios_Ansible_Config_tool.sh script mentioned in https://github.com/AdoptOpenJDK/openjdk-infrastructure/blob/master/ansible/playbooks/AdoptOpenJDK_Unix_Playbook/roles/Nagios_Master_Config/tasks/main.yml
It'll save me a lot of manual work :-)

@aahlenst
Script to check the percentage of machines online in the label

#!/bin/bash

if [ -z $1 ] || [ -z $2 ] || [ -z $3 ]; then
  echo "UNKNOWN - Invalid arguments"
  echo "Usage: $0 <Label> <Warning_Level> <Critical_Level>"
  exit 3
fi

if ! command -v jq &> /dev/null; then
  echo "UNKNOWN - JQ isn't installed"
  exit 3
fi

# Get list of machines in label
mapfile -t machine_array < <(curl -s https://ci.adoptopenjdk.net/label/$1/api/json | jq '.nodes[] | .nodeName' | sed 's/\"//g') 

# For each machine, query if they're connected
response_array=()
for node in ${machine_array[@]}
do
  response_array+=($(curl -s "https://ci.adoptopenjdk.net/computer/${node}/api/json" | jq .offline)) 
done

online=0
offline=0
for response in ${response_array[@]}
do
  if [[ ${response} == "false" ]]; then 
    online=$((online+1))
  else
    offline=$((offline+1))
  fi
done

export percentage_online=$(echo "scale=2; ($online/($offline+$online)) * 100" | bc -l)
if (( $(echo "$percentage_online < $3" | bc -l) )); then
  echo "CRITICAL - $percentage_online% machines online in '$1' label"
  echo "$online online machines; $offline offline machines"
  exit 2 
elif (( $(echo "$percentage_online < $2" | bc -l) )); then 
  echo "WARNING - $percentage_online% machines online in '$1' label"
  echo "$online online machines; $offline offline machines"
  exit 1
else
  echo "OK - $percentage_online% machines online in '$1' label"
  echo "$online online machines; $offline offline machines"
  exit 0
fi

Any requested changes?

More questions than change requests:

  • Why export percentage_online?
  • Can this handle label groups, for example build&&linux&&s390x?

I think ... I've got it working using the Tools that Brad Blondin made when he initially setup the Nagios stuff!
image

Here's a list of things I had to do to make it work (which may or may not be related to me running this on a machine that isn't in the inventory):

  • Alter the Nagios_Ansible_Config_Tool.sh to remove all the references to sys_pingtest (this is due to the new template not having a ping service)
  • Alter the template.cfg to have the services we want to monitor
  • In the playbook:

    • Change the way we get the Nagios public key back to the old method (i.e. before https://github.com/AdoptOpenJDK/openjdk-infrastructure/pull/1560 ). I couldn't get the new one working due to ansible complaining that invalid key specified, on this task

    • Manually set provider variable

    • Manually set inventory_hostname, as it was picking it up as 'localhost', when it should have been the IP Address for the machine...

    • Change the ssh command needed to make the Nagios_Ansible_Config_Tool.sh script run. This seems to be a weird thing with Ansible whereby, if you run the ansible-playbook command with the -b option, when it gets to a task that is delegated to localhost, it sudos the localhost user too. Shouldn't be an issue if you run the playbook on the root user of a given machine (as the -b option won't be used)

    • Add a task to put the new network_time_sync script on the machine. I used the copy module for this, but when I PR this stuff, It'll be in the repo (I think- @aahlenst Am I able to commit that script to this repo, legally?)

    • Add the check_agent command to the Nagios Server (I'll commit this in Supporting Scripts, but it won't be needed in the playbook)

      -Installed JQ on the Nagios Server

For putting in the label checker script, I'll manually put that in, and have it run on the Nagios server :-)

For the sake of doublely testing, I'm going to try this on a C7 machine that @sxa provided me, but I'll connect this to jenkins (and not put any labels on it), to see if this fixes a lot of my previous issues. Once I've got that working, I can do this for all the machines :-)

@aahlenst Removed the export - Not sure why I put that in. And apparently it does! (completely intentional!) You just need to make sure you put the labels in speech marks, otherwise bash doesn't like the & signs

will@will-XPS-13-9360:~/Documents/nagios_testing$ ./check_label.sh "build&&linux&&s390x" 50 30
OK - 100.00% machines online in 'build&&linux&&s390x' label
2 online machines; 0 offline machines

With the help of @gdams , We were able to get the test-aws-ubuntu1804-armv8-1 machine added to Nagios.
Final list of stuff to do:

  • [x] Backup the old version of the Nagios_Ansible_Config_Tool on the Nagios server
  • [x] Figure out how we want to do the notifications of each service so the #infrastructure-bot slack channel isn't constantly spammed.
  • [x] Create PR to: alter the Nagios_Ansible_Config_Tool, the templates and the relevant sections of the playbook; re-enable the Nagios roles; Put the plugins I wrote into additional_plugins.
  • [x] Remove all the old configs for the machines. (Already backed them up)
  • [x] Update Nagios Core to latest version
  • [ ] Run the Nagios_* roles on the machines.
  • [x] Manually add the check_label script to Nagios Server
  • [x] Change the Nagios server config to check the most common machine labels
  • [x] Update the Nagios server config in general

Additional:

  • [X] Setup a cron job on the Nagios Server that queries Jenkins, and removes the machine entry from /usr/local/nagios/etc/servers/, if it's no longer in Jenkins (This _should_ keep Nagios actually fairly useful, and not full of old machines we don't have)
  • [ ] Setup a method of checking that all machines in the inventory are monitored by Nagios (with a way of identifying exceptions, such as ad-hoc experimental machines, Windows machines (for now), or any machines we can't run the Nagios_* playbook roles on)

Setup a cron job on the Nagios Server that queries Jenkins, and removes the machine entry from /usr/local/nagios/etc/servers/, if it's no longer in Jenkins.

I'm not particularly keen on using Jenkins as the source of truth for our inventory. I'd rather let Ansible handle that (we have to remove the servers from the inventory anyway). Instead, I'd add a check that alerts us if a machine pops up in Jenkins that isn't known to Nagios (with the possibility to ignore dynamic agents).

The issue is, if we're removing a machine from the inventory (i.e. through a PR), Ansible isn't been used, and the machines will just stay there. And my only concern with that check is that sometimes ad-hoc machines are added to Jenkins ( i.e. test-will-debian-riscv-1/ ) that don't necessarily need monitoring.

Nagios Core has been updated to 4.4.6 from 4.3.4, following this guide

@aahlenst What labels should we be checking for?

The issue is, if we're removing a machine from the inventory (i.e. through a PR), Ansible isn't been used, and the machines will just stay there.

Yeah, but Nagios will start nagging us and we need more discipline in this area. Let's assume the worst and someone manages to temporarily remove machines from Jenkins. If Nagios automatically drops the machine, it might take us weeks to realize that something has happened. That gives me shivers.

And my only concern with that check is that sometimes ad-hoc machines are added to Jenkins ( i.e. test-will-debian-riscv-1/ ) that don't necessarily need monitoring.

This is just bad practice (I'm aware that I'm guilty here, too). We need to separate "regular" from experimental machines, for example by prefixing them with experimental-. And we need to diligently monitor the presence of machines, not just the accumulated number. Any unknown new machine being added should trigger a red alert.

Yeah, but Nagios will start nagging us and we need more discipline in this area. Let's assume the worst and someone manages to temporarily remove machines from Jenkins. If Nagios automatically drops the machine, it might take us weeks to realize that something has happened. That gives me shivers.
This is just bad practice (I'm aware that I'm guilty here, too). We need to separate "regular" from experimental machines, for example by prefixing them with experimental-. And we need to diligently monitor the presence of machines, not just the accumulated number. Any unknown new machine being added should trigger a red alert.

Both good points. When you say Any unknown new machine being added should trigger a red alert. , are you referring to any new machines being added to the inventory, that aren't being monitored (except, for example, machine proceeded by experimental) ?

are you referring to any new machines being added to the inventory, that aren't being monitored (except, for example, machine proceeded by experimental

If a new machine appears in Jenkins that is not known to Nagios, it should trigger an alert. For that to work, we need a list of known machines. One possibility to achieve that could be that we combine our inventory with a list of experimental machines that is maintained alongside the Ansible inventory.

We could use the config files in /usr/local/nagios/etc/servers/ to check for machines that are known to Nagios. Otherwise, I would have concerns with maintaining 2 separate lists- before I looked at https://github.com/AdoptOpenJDK/openjdk-infrastructure/issues/619 , even the inventory wasn't very well maintained.

Changed the timeperiod once-a-day-at-8 to 9:00-10:00, as (due to the nagios server being in CET) that refers to 8-9 in GMT, which should hopefully be after all the nightlies have finished.

I've been able to add all the machines I've got access to, as well as manually changed some of the machines I don't have access to.
The following are the machines that are yet to be added to/updated in Nagios:

build-linaro-centos76-armv8-2 (timeout)
build-packet-ubuntu1804-armv8-1 (Connection closed)

docker-aws-ubuntu1604-x64-1 x
docker-aws-ubuntu1604-x64-2 x
docker-godaddy-ubuntu1604-x64-1 x
docker-scaleway-ubuntu1604-armv7-1 (Connection Closed)

test-ibm-aix71-ppc64-1 x
test-ibm-aix71-ppc64-2 x
test-ibmcloud-ubuntu1604-x64-1 x
test-macstadium-macos11-arm64-1 x
test-macstadium-macos11-arm64-2 x

We could use the config files in /usr/local/nagios/etc/servers/ to check for machines that are known to Nagios.

Ideally, the Ansible inventory serves as single source of truth for those config files. If that isn't possible at the moment, we have to live with it, but should have a ticket that states what's left to do.

@Willsparker adding to your requirements, I would like to see "Alerts" added to Warn (via Slack) that a given Node "Free disk space" is within a certain margin(say ~3Gb) or our 10Gb Jenkins offline limit,ie. Warn if <13Gb free !
This is so we get a heads up that a node is near to being taken offline by Jenkins and we can act before if fails a nightly build....

We could use the config files in /usr/local/nagios/etc/servers/ to check for machines that are known to Nagios.

Ideally, the Ansible inventory serves as single source of truth for those config files. If that isn't possible at the moment, we have to live with it, but should have a ticket that states what's left to do.

That should absoutely be accurate as far as "live" machines are concerned and I've been acting to resolve any discrepencies as soon as they show up, so hopefully we don't need a ticket with a long list of todos on that one ;-) Just PRs put in to fix them. So yes that list should be definitive.

@Willsparker adding to your requirements, I would like to see "Alerts" added to Warn (via Slack) that a given Node "Free disk space" is within a certain margin(say ~3Gb) or our 10Gb Jenkins offline limit,ie. Warn if <13Gb free !
This is so we get a heads up that a node is near to being taken offline by Jenkins and we can act before if fails a nightly build....

@andrew-m-leonard If I'm understanding correctly, we should already have that, but it works based on percentage of machine's disk ( see here ). This sends it to the #infrastructure-bot channel on Slack. Currently it's warning at 20% diskspace and critical at 10%, at both times the slack channel will be updated. I don't believe I've seen a machine with less than 100GB total diskspace, so we should be warned before the ~3GB limit you suggested.

Considering the vast majority of machines have been added to Nagios and this issue is very non-specific now, I'm going to close it in favour of some more specific ones

Was this page helpful?
0 / 5 - 0 ratings