When it comes to service monitoring, nagios is a pretty good tool,
though it can only monitor services from outside. To add monitoring of
metrics not exposed to the network, like disk usage, you need an agent
running on the monitored server. nagios-statd, written as a python
daemon, is such an agent.
From a management and security perspective it is good to keep the
number of processes and open network ports as minimal as possible.
Those that not only monitor service availability with nagios, but also
resource usage with ganglia, already have the ganglia agent (gmond)
running on each server. Adding nagios-statd would be the second agent
that needs to be configured and kept up-to-date.
But what if you want to be alerted if one of the hard disks get full,
load goes through the roof or the server begins using swap space? You
would need nagios-statd although ganglia knows about everything.
That's why I came up with the idea of using data from ganglia and
monitor them with nagios and wrote a
small php shellscript that accomplishes
this task. There is for sure some work to do but I think it's a good
start:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
42 43 44 45 46 47 48 49 50 51
52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 | #!/usr/bin/php <?php ### Get command line arguments $host = $argv[1]; $metric = $argv[2]; $metric_unit = $argv[3]; $cluster_arg = $argv[4]; #optional $threshold_warn_arg = $argv[5]; #optional $threshold_crit_arg = $argv[6]; #optional
### Fill Variables if (!is_numeric($threshold_warn_arg)) $threshold_warn = 2; else $threshold_warn = $threshold_warn_arg;
if (!is_numeric($threshold_crit_arg)) $threshold_crit = 1; else $threshold_crit = $threshold_crit_arg;
if (!$cluster_arg) $cluster = "sk"; else $cluster = $cluster_args;
### Get data from gmond $fp = fsockopen("localhost", 8649, $errno, $errstr, 30); if (!$fp) { echo "GANGLIA Unknown - $errstr ($errno)\n"; exit(3); } else { while (!feof($fp)) { $buffer .= fgets($fp, 128); } fclose($fp); }
### Get metric out of XML $xmlobj = simplexml_load_string($buffer); $metric_value = $xmlobj->xpath("/GANGLIA_XML/CLUSTER[@NAME='$cluster']/HOST[@NAME='$host']/ METRIC[@NAME='$metric']");
### Convert data (more types tbd) $metric_value = $metric_value[0]->attributes(); if ($metric_value["TYPE"] == "double") $metric_value = doubleval($metric_value["VAL"]); else $metric_value = $metric_value["VAL"];
### Build output strings $perfcounter = $metric . "=" . $metric_value . $metric_unit . ";" . $threshold_warn . $metric_unit . ";" . $threshold_crit . $metric_unit; $text = $metric . " is " . $metric_value . $metric_unit;
### Output if ($metric_value > $threshold_warn && $metric_value > $threshold_crit) { print("GANGLIA OK - " . $text . " |" . $perfcounter . "\n"); exit(0); }
if ($metric_value < $threshold_warn && $metric_value > $threshold_crit) { print("GANGLIA Warning - " . $text . " |" . $perfcounter . "\n"); exit(1); }
if ($metric_value < $threshold_warn && $metric_value < $threshold_crit) { print("GANGLIA Critical - " . $text . " |" . $perfcounter . "\n"); exit(2); }
echo "GANGLIA Unknown\n"; exit(3); ?> |
Save the script in your nagios-plugin directory (Debian defaults to /usr/lib/nagios/plugins) and make it executable (chmod +x check_ganglia). You can then define a new command in your nagios config as follows (in this example disk_free):
1 2 3 4 | define command{ command_name check_ganglia_disk_free command_line /usr/lib/nagios/plugins/check_ganglia $ARG1$ disk_free GB } |
After that, add a new service to your host config:
1 2 3 4 5 6 | define service{ use generic-service host_name xxx1 service_description DISK check_command check_ganglia_disk_free!xxx1 } |
Restart Nagios and you should be able to monitor disk usage from then on.