Rule #2: Actionable Alerts
Tied very closely to the rule that all alerts must mean “run to the console now” is the concept that all alerts must be actionable. Informational alerts become noise, even if it’s an event that’s interesting. Actionable alerts all have an action associated with them that can fix the system, or get your users working again.
When I first implemented monitoring of my systems in the 90’s, I set an alarm to go off if CPU utilization was more than 90% for more than a half-hour straight. The alert worked perfectly. It indeed went off when the CPU was heavily used. We ran to the console, and looked at the server stats. The server had been running at high utilization for a long time. In fact, it had been high almost all day. But, what were we supposed to do about this issue? The alert flashed on throughout the day, and all we could do is stare at it. The system was just busy, there was nothing wrong.
Certainly, high utilization can affect the users–except of course when it doesn’t. If you’re running a report at 3AM that needs to be done at 8AM, the fact that it takes until 4:30 AM isn’t a problem, even if utilization has maxed out. Generally, I recommend using a user experience monitor for alerts such as these. For example, for web-based apps, You can use a URL monitor to check response times, and set a time limit. Even then, you should make sure that that response time threshold you set makes sense, and it should probably happen more than just once. Then, if you’ve done a good job collecting the data you need to analyze for problems such as these, you can look at your stats to find the cause. (And, of course, talking about statistics is a topic for another day. In fact, many other days. It’s a big topic, and of equal importance to alerting. It’s also the presentation that I’ll be giving in Orlando at NetConnect next week.)
The good point about creating actionable alerts is that they force you to come up with a list of the critical errors that can occur on your applications, and then determine what needs to be done to fix them. This links directly to another key part of monitoring related to procedures for dealing with these alerts. ITIL gives a good terminology and framework for dealing with these items, but even ITIL won’t help you if you don’t have a set of procedures that directly relate to your organization, operations, applications, users, and the systems themselves.
Here’s the short version about procedures which is another topic of future articles at effectivemonitoring.com: for each alert that you plan on handling, can you imagine the physical actions that you will do in order to deal with the issue? If not, you don’t have an actionable alert. If you can imagine what you’d do when you got that alarm, write it down! You’ve got an excellent start on a procedure.
It should be clear at this point that rule #2: “All alerts must be actionable” is another powerful filter for your alerting, and should further reduce the number of monitors that you implement. The next rule is the only one of the three that increases them.
