Rule #1: Run To The Console Now!

Thursday 11 October 2007

There’s a simple rule of thumb that I’ve used in order to evaluate any new alerts that I want to put into my environment. Each one must be a situation which means “run to the console now” to fix the problem. Unfortunately, many shops design monitoring as if it absolves them from having to check their applications manually. It does not. The alerts should only tell you about events that require your immediate attention, everything else should not alert, or send it to a queue that you check occasionally.

This standard is simple to remember, and is easy to apply. But it’s not a rule that can be only applied occasionally. It will force you to be ruthless about eliminating all alerts that fall short.

The reason for this is simple. Have you ever been in a building that kept giving you false fire alarms? Did you even get out of your chair by the 5th time it happened? Alerts work the same way. If every time the fire alarm went off there was really a fire, you’d pay attention to all of the alerts. If it was spotty, you’d be inclined to ignore all of them, including the real ones.

There’s another aspect to this issue that’s not necessarily obvious before you set out on designing your alerts. If you got a fire alarm for another building in another city, and you weren’t responsible for doing anything about it, how useful is the alert? Every single alert must go only to the people who are responsible for fixing the problem. This is another rule that gets broken repeatedly by many groups. They send alerts to everyone on a team, when only a few people are responsible for the systems. This is the very steep slippery slope to alerts that get ignored by everyone. There’s something about our psychology that even if you get a few unnecessary ones, you desensitize quickly.

Finally, you should always start with less alarms, and then add them later. If you were in that building with false fire alarms for a month, and then they fixed the problem, you’d still be suspicious the next time you heard an alarm. Most people blame the tool, but, in fact, it might have been misconfigured, or configured to alert too often.

Fortunately, there’s a fairly straightforward method of determining just the key alerts for an application, no matter what it is. It’s called critical path analysis, and I’ll be going over this once I get done explaining the three rules of good alerting.

Posted by randall / Filed under:Techniques

Leave a comment

Name (required)
Mail (will not be published) (required)
Website

Effective Monitoring designed by SEO-Themes and powered by Wordpress