browsing Techniques

Rule #3: Only Alert The Ones Responsible

Posted on Wednesday 7 November 2007

Can you imagine a fire alarm going off in your building for a fire in a building five hundred miles away?

No? Why?

I have a guess: because it’s not relevant to you. But based on the three rules here, it meets #1: It’s certainly an emergency. It also meets #2: It’s actionable. But of course, that alert doesn’t have anything to do with you, so for you, that fire alarm is just noise.

Now, let’s stretch everyone’s imagination and say that an IT department is not much different than a fire department. Would you want your fire department to get all fire alarms for buildings in a 500 mile radius? Of course not. So why would you treat your IT department the same way? Often, shops do so, by sending every event to every member of an administration team.

What I don’t understand is why some IT shops do a good job designing alerts, but then just hope that someone will do something about them when they appear. There’s a simple way to solve this. Rule #3 states: “Alerts must only go to people who are responsible to act on it.” This is almost a part of the actionable nature of good alarms. If you are going to define what action you should take if you get a certain alert–say, an application process being down–you should also specify who is supposed to take the action.

I’ll never forget the first time that I implemented Systems Monitoring in 1997/1998. It was an instance of hardware monitoring for our servers. In this case, it was one of the very first versions of Compaq Insight Manager (CIM.) After installing CIM, it immediately set off an alarm that a server drive was in a degraded state. This meant that the drive had not yet failed, but it would shortly and should be replaced immediately. Of course, the implementation team was excited because the tool was working out of the box, and they could prove the value of the tool right away. They flagged an admin, and showed them the alert.

The subsequent conversation went something like this:

Team: “Hey, you have a degraded hard drive!”

Admin: “Wow. You’re right.”

Team: “Um. Aren’t you going to do something about it?”

Admin: “No. That’s not my job.”

Team: “Oookay.”

They looked for other admins, and had no luck finding anyone to take action. Naturally the drive failed the next day. The follow-up conversation went something like this:

Admin: “Hey, the drive failed! The system is supposed to tell us.”

Team: “Um, it did. We told you. You didn’t do anything about it.”

Admin: “This Systems Monitoring stuff isn’t working.”

This kind of attitude is why I like to say that the technology is no more than 40% of the Systems Monitoring solution.  The technology could work fine, but if you don’t act on it properly, the technology can’t help you.

Here’s what I’d suggest to successfully implement rule #3:

  • When you determine the action for each alert (rule #2)  you should also determine who is to take that action, and in what time frame. Include off-hours coverage if you are a 24/7 shop.
  • Even if there are groups of alerts on the same server, determine who the appropriate recipient is for each one, and only send it to those recipients. We have many servers that have alerts that go to DBA’s, and other alerts that go to the web team, for example. They don’t get each other’s alerts because they aren’t responsible for them.
  • Unless you have an operations team, never depend on your mobile admins to “watch the console” for problems. Instead, page them or send them email. If you do a good job with the three rules, every alert will be an actionable emergency that they will be responsible for, and they will never get any alerts that are just noise.
  • When you must page a group of people for a problem rather than just individuals, make sure that there is always one person on call that is responsible for taking action for the alerts that go to the team. Otherwise, everyone assumes that someone else will handle the problems. We always have a primary, and a backup.

You need to be ruthless about enforcing this rule as much as the first two. If anyone receives alerts that are just informational for them, they will delay looking at their alerts because they might not be directly responsible. You are depending on them to do filtering. Again, think of the fire department. Do they get “informational” fire alarms for fires that are hundreds of miles away? Only if they’re really serious, and in those cases, they are contacted directly because then, they have an actionable emergency that they are responsible for. You should set up your own system the same way.

The next major topic is designing custom monitoring. In particular, I’m going to cover a technique called Critical Path Monitoring that will let you monitor any application. This will be a series of articles, because, as you can imagine, this topic isn’t simple. But it has worked for years in our departments, helping us make thousands of alerts, and monitor over a hundred custom applications.

Rule #2: Actionable Alerts

Posted on Thursday 11 October 2007

Tied very closely to the rule that all alerts must mean “run to the console now” is the concept that all alerts must be actionable. Informational alerts become noise, even if it’s an event that’s interesting. Actionable alerts all have an action associated with them that can fix the system, or get your users working again.

When I first implemented monitoring of my systems in the 90’s, I set an alarm to go off if CPU utilization was more than 90% for more than a half-hour straight. The alert worked perfectly. It indeed went off when the CPU was heavily used. We ran to the console, and looked at the server stats. The server had been running at high utilization for a long time. In fact, it had been high almost all day. But, what were we supposed to do about this issue? The alert flashed on throughout the day, and all we could do is stare at it. The system was just busy, there was nothing wrong.

Certainly, high utilization can affect the users–except of course when it doesn’t. If you’re running a report at 3AM that needs to be done at 8AM, the fact that it takes until 4:30 AM isn’t a problem, even if utilization has maxed out. Generally, I recommend using a user experience monitor for alerts such as these. For example, for web-based apps, You can use a URL monitor to check response times, and set a time limit. Even then, you should make sure that that response time threshold you set makes sense, and it should probably happen more than just once. Then, if you’ve done a good job collecting the data you need to analyze for problems such as these, you can look at your stats to find the cause. (And, of course, talking about statistics is a topic for another day. In fact, many other days. It’s a big topic, and of equal importance to alerting. It’s also the presentation that I’ll be giving in Orlando at NetConnect next week.)

The good point about creating actionable alerts is that they force you to come up with a list of the critical errors that can occur on your applications, and then determine what needs to be done to fix them. This links directly to another key part of monitoring related to procedures for dealing with these alerts. ITIL gives a good terminology and framework for dealing with these items, but even ITIL won’t help you if you don’t have a set of procedures that directly relate to your organization, operations, applications, users, and the systems themselves.

Here’s the short version about procedures which is another topic of future articles at effectivemonitoring.com: for each alert that you plan on handling, can you imagine the physical actions that you will do in order to deal with the issue? If not, you don’t have an actionable alert. If you can imagine what you’d do when you got that alarm, write it down! You’ve got an excellent start on a procedure.

It should be clear at this point that rule #2: “All alerts must be actionable” is another powerful filter for your alerting, and should further reduce the number of monitors that you implement. The next rule is the only one of the three that increases them.

Rule #1: Run To The Console Now!

Posted on Thursday 11 October 2007

There’s a simple rule of thumb that I’ve used in order to evaluate any new alerts that I want to put into my environment. Each one must be a situation which means “run to the console now” to fix the problem. Unfortunately, many shops design monitoring as if it absolves them from having to check their applications manually. It does not. The alerts should only tell you about events that require your immediate attention, everything else should not alert, or send it to a queue that you check occasionally.

This standard is simple to remember, and is easy to apply. But it’s not a rule that can be only applied occasionally. It will force you to be ruthless about eliminating all alerts that fall short.

The reason for this is simple. Have you ever been in a building that kept giving you false fire alarms? Did you even get out of your chair by the 5th time it happened? Alerts work the same way. If every time the fire alarm went off there was really a fire, you’d pay attention to all of the alerts. If it was spotty, you’d be inclined to ignore all of them, including the real ones.

There’s another aspect to this issue that’s not necessarily obvious before you set out on designing your alerts. If you got a fire alarm for another building in another city, and you weren’t responsible for doing anything about it, how useful is the alert? Every single alert must go only to the people who are responsible for fixing the problem. This is another rule that gets broken repeatedly by many groups. They send alerts to everyone on a team, when only a few people are responsible for the systems. This is the very steep slippery slope to alerts that get ignored by everyone. There’s something about our psychology that even if you get a few unnecessary ones, you desensitize quickly.

Finally, you should always start with less alarms, and then add them later. If you were in that building with false fire alarms for a month, and then they fixed the problem, you’d still be suspicious the next time you heard an alarm. Most people blame the tool, but, in fact, it might have been misconfigured, or configured to alert too often.

Fortunately, there’s a fairly straightforward method of determining just the key alerts for an application, no matter what it is. It’s called critical path analysis, and I’ll be going over this once I get done explaining the three rules of good alerting.

The Three Rules of Meaningful Alerts

Posted on Monday 8 October 2007

There’s just one simple reason why most monitoring implementations fail: they send out too many alerts.

Most implementers start out from the premise that the biggest problem to avoid is to miss a critical problem. It’s not. Too many alerts are a bigger problem, by far. Once you have too many, your administrators will start ignoring all of the alerts.

When you design monitoring, follow these three rules:

  1. Every alert must mean “Run to the console now!”
  2. All alerts must be actionable.
  3. Alerts must cover every part of an application.

In order to follow all of these rules, most shops have to turn off a great number of monitors, as most of them fall short. We’re going to talk about each of these points in detail in coming articles, because they each require some explanation. Stay tuned for more on each of these.

Effective Monitoring designed by SEO-Themes and powered by Wordpress