Server Monitoring - Few Winners
5/10/2009As a programmer, I like to know how my applications are handling. I like pretty graphs of response times and I really want to know if they blow up. In our department we've been running a very old installation of Big Brother (BB) from before Quest. I kid you not. It's old but it works with relatively little fuss and sheer lack of a compelling enough competitor has kept it humming away all these years.
Still, BB is very simplistic and we'd like to set up into something from this century. We've been literally watching for years for a suitable Open Source replacement to emerge but nothing seems to fit. Of course, we're familiar with Hobbit but it's in the same vein as BB. We were never really that happy with BB but it was already in place. Inertia is a powerful force.
And just yesterday I read that Nagios has been forked. Maybe I'm not the only one unhappy with the available choices. I'm about this close to writing my own. Might be fun.
Below is more or less my personal list of gripes, minus the names of the guilty. I really have no interest in gunning down well-meaning projects. Of course, some score better than others but none seem to do it all. You'll notice I'm mostly concerned with the server itself, since for the most part the agents work great.
Want to add to the list? What do you use to monitor your apps?
Slow. Most of the time, if I'm checking the site I just want to see a graph or check that a specific service is working. It shouldn't take forever. That includes navigation -- it should be easy to find historical data.
Use existing agents. Every monitor doesn't need it's own agents, there are plenty out there. $new-fangled-monitor would ideally work with agents I can apt-get (with Nagios being high on that priority list).
All configuration for alerts, plugins and tests should be stored on the server and centrally managed.
Should be able to make mass changes to alerts and agent configurations, something most lack.
If it uses a database, it should be able to use the major Open Source databases and at least Oracle (if I'm forced).
It should automatically alert on obvious things. If I've setup a ping or HTTP test on a server I probably also want to know if it stops responding. Just allow for a way to override the default.
Should only alert once. I don't really want to take the time to designate which alerts are critical and which are not. That can add up to a lot of configuration time, and I have plenty of stuff to watch. I'll get the email and decide if it's worth checking out right now or not. Not to mention, since one outage can cause cascading outages, I don't want to also cause an email outage.
On that note, it should have easily adjustable change windows for planned maintenance.
Configuration should be dead simple. I have better things to do than spend all day fooling with the monitor server. I don't mind editing text files so much, but they need to be well documented like Apache's. The problem with text files is often times you don't know the possible values. XML is for programs and not for end users to hand edit.
Should integrate with the network. Here our network is unfortunately run by mostly Windows servers. At least that means I shouldn't have to setup users, manage passwords, etc. Single signon with Kerberos or NTLM is a must.
On that note, don't require logins for status pages. Or at least be able to allow access for any authenticated domain user. It's not a state secret. They're already on the network if they reached the monitor. If they cared, they could ping the server themselves. Automated monitoring is supposed to make it easier.
RSS feeds or portlets and possibly some embeddable AJAX widget would be a great way to integrate with the applications and various other web servers. I'd love to have a page in my own web apps were users could check the status of various systems and progress on fixing them.
Give me a way to configure a page or dashboard just for stake holders. I want to email them a URL and let them see for themselves that the application is working.
It should look nice, too. I'm not sure why, but most of the monitoring solutions are ugly. Again, I want to give this to the business and let them get a warm fuzzy that everything is working. It should be simple, professional and quickly communicate where problems lie. They're not going to build their own dashboard with flashing lights and server pictures. They just want to know what broke.
Should have a developer's API. Everything and everybody knows HTTP. We have great proxies and load balancers. Firewalls know all about HTTP. It doesn't make sense to write a new protocol. Should be usable from shell scripts.
Should always page if an agent stops sending updates. That seems kinda basic, but I shouldn't have to configure an alert for each and every one. Of course, still allow for a way to override the default.
A nice mobile page is a must. I might not be in the office or I might be upgrading my workstation again.
Should work through, over and under firewalls. Unbelievably, this was an issue with one I tried.
Speaking of dumb problems, one I tried would show a blank page if my cookie expired. I'd have manually remove it to login again. Not awesome. The basics are important.
Statically typed language. Edit: eek, that's what I meant. I know that's a bit controversial but this is only my personal preference. Simply put, I'm probably going to install a monitor and forget about it until it breaks. I've been bitten by upgrading the PHP/Python/Perl package often enough that I'd prefer something less prone to incompatible changes.
It should not require an agent or convoluted configuration to setup a simple HTTP test from the monitor server itself. Oddly, one I'd tried required something like 30 clicks to setup a simple ping, not to mention a lot of head-scratching.
Server should run on Linux, obviously.
Would you use something that matched that description?