Posts Tagged “Nagios”

Pre-req reading:

Nagios customization: Alerting via SMS, or anything you like!

Making the bird tweet using python

or
Update twitter in a single line

This entry will cover how to send nagios alerts to twitter, in the examples to follow curl will be used however you can choose to use the python example (link above) in place of this.

Firstly edit /usr/local/nagios/etc/objects/commands.cfg

And add the two following commands.

UPDATE 24/03/2011 Twitter no longer supports basic auth, use my oAuth updater here

1
2
3
4
5
6
7
8
9
define command {
        command_name    notify-by-twitter
        command_line    /usr/bin/curl --basic --user "twitteruser:twitterpassword" --data-ascii "status=[Nagios] $NOTIFICATIONTYPE$ $HOSTALIAS$/$SERVICEDESC$ is $SERVICESTATE$" http://twitter.com/statuses/update.json
}

define command {
        command_name    host-notify-by-twitter
        command_line    /usr/bin/curl --basic --user "twitteruser:twitterpassword" --data-ascii "status=[Nagios] $HOSTSTATE$ alert for $HOSTNAME$" http://twitter.com/statuses/update.json
}

Now define a contact for this twitter service

/usr/local/nagios/etc/objects/contacts.cfg

1
2
3
4
5
6
7
8
9
define contact{
        contact_name                    twitter
        service_notification_commands   notify-by-twitter
        host_notification_commands      host-notify-by-twitter
        service_notification_period 24x7
        host_notification_period 24x7
        service_notification_options a
        host_notification_options a
}

Choose your own notification options, for my feed I only choose alerts, I also have this send updated to a ‘private feed’ which I then follow.

Add this contact into your existing contact groups, i.e.

1
2
3
4
5
define contactgroup{
        contactgroup_name       admins
        alias                   Nagios Administrators
        members                 nagiosadmin,sms_alert,twitter
        }

Then run a nagios-verify to ensure you have no syntax errors, and restart nagios.

Trigger an alert by manually switching a monitored service off or entering a manual result to test.

Tags: , , , ,

Comments 4 Comments »

I ment to note this down yesterday but everything is going ten to the dozen at the moment.

basically I have now authored a nagios addon for monitoring master-master replication between two servers, this carries out 4 stages of checks

  1. Validates all required data is passed by servers
  2. Slave IO is running on both servers
  3. Seconds_Behind_Master check, args can be passed to vary warn and critical thresholds
  4. (slave) Master_Log_File == (master) File

The 5th check was a comparison on the binlog positions themselves, comparing (slave) Read_Master_Log_Pos and (master) Position

Here in lies the problem, which took a while to track down, the problem is that no matter what I tried the slave was ALWAYS behind the master position … but why?

The reason is why I designed the High Availability solution in the first place … Very high traffic level, in the region of 20,800 transactions per second.

Why was this the problem? the two queries run to gather the data are done sequentially per server, using the python time library I was able to find that there is a 0.02s interval between gathering datasets (20 milliseconds) … in that time 416 transactions had take place.

i.e.

time: binlog pos

Slave A

0.000: 100

Master B

0.020: 516

This unfortunately has now lead to some 32 lines of code being commented out, as I can see no way to reliably use the binlog positions for monitoring the replication in this situation, if any delay occurs anywhere at any point during the dataset collection i.e. network latency, delay in query processing due to traffic peak on one server … etc. the collected samples will always be different

The only way I ever see this working is if you can validate that the datasets came from the same exact point in time down to the nanosecond, this however is again not possible, on the network the servers currently reside there is a 0.13 millisecond ping response time this works out to 13,000 nanoseconds (0.00013 * 10^9)

If anyone has any theories on how to overcome this please let me know.

NOTE: At present due to the programming of this addon being done during working hours the nagios addons are not for public release at this time, this may be subject to change in the future should my employers allow their release.

Tags: , , ,

Comments No Comments »

So I find myself needing to tweak my Nagios installation a little bit, in this case I found the need for “out of hours” SMS alerts.

Nagios doesn’t cater for this natively, rather it does however allow you to create your own custom commands, this allows you to specify a script to be executed.

Now I am going to assume you are already quite familiar with Nagios , so here is the command definition from my installation.


# ‘alert-by-sms’ command definition
define command{
command_name alert-by-sms
command_line /etc/nagios/alert-by-sms.php “** $NOTIFICATIONTYPE$ alert – $HOSTALIAS$/$SERVICEDESC$ is $SERVICESTATE$ **”
}

As you can see all this command definition realy does is execute a php script, bear in mind that

“/path/to/php /path/to/script ”

as the command_line does not seem to work, so just add “#!/path/to/php -q” to the top of the php script (before the opening <?PHP tag). and CHMOD +X the file.

The php script used here takes $argv[1] and passes it into a function specific to the SMS api I use, the phone number and API definitions are hard coded ito the script.
You don’t really need me to upload my script, and if you do then you shouldn’t be attempting this …

Basically Nagios will execute the script, as defined at command_line, the script can do anything you choose.

Now to implement the command so it is actually used, I am pretty sure this entry in “timeperiods.cfg” is the default but just incase here it is.

# ‘nonworkhours’ timeperiod definition
define timeperiod{
timeperiod_name nonworkhours
alias Non-Work Hours
sunday 00:00-24:00
monday 00:00-09:00,17:00-24:00
tuesday 00:00-09:00,17:00-24:00
wednesday 00:00-09:00,17:00-24:00
thursday 00:00-09:00,17:00-24:00
friday 00:00-09:00,17:00-24:00
saturday 00:00-24:00
}

This is what I use for the “out of hours” definition, now to implement the SMS alerting, for this I have simply created a new contact definition in “contacts.cfg”, granted this means there are now two contact definitions for myself.

define contact{
contact_name out_of_hours
alias Out Of Hours Mobile

service_notification_period nonworkhours
host_notification_period nonworkhours
service_notification_options c,u,r,f
host_notification_options d,u,r
service_notification_commands alert-by-sms
host_notification_commands alert-by-sms
email HIDDEN EMAIL

}

This can be further customized depending on your setup, in this case the contact is me and I want to receive alerts for all servers & services, so I just add the contact “out_of_hours” into the admins contact group.

define contactgroup{
contactgroup_name admins
alias Nagios Administrators
members nagios-admin,out_of_hours
}

So there you have it, you now have the ground work to potentially make Nagios fire you alerts anyway you like, you could go as far as having it call you via attached modem, if you _realy_ want, but when you want your servers talking to you via phone call is the day you need to switch to decaff, and head out to the pub once in a while.

Now just “nagios -v /path/to/nagios.cfg” to do a quick sanity check and make sure there are no errors (if you have any go back and fix them and run nagios -v again!), if all is ok /etc/init.d/nagios restart (or equivalent for your distribution).

As always if you run into problems drop me a comment :-)

Tags: , , , , , , , , , ,

Comments 9 Comments »