browsing Articles

Rule #3: Only Alert The Ones Responsible

Posted on Wednesday 7 November 2007

Can you imagine a fire alarm going off in your building for a fire in a building five hundred miles away?

No? Why?

I have a guess: because it’s not relevant to you. But based on the three rules here, it meets #1: It’s certainly an emergency. It also meets #2: It’s actionable. But of course, that alert doesn’t have anything to do with you, so for you, that fire alarm is just noise.

Now, let’s stretch everyone’s imagination and say that an IT department is not much different than a fire department. Would you want your fire department to get all fire alarms for buildings in a 500 mile radius? Of course not. So why would you treat your IT department the same way? Often, shops do so, by sending every event to every member of an administration team.

What I don’t understand is why some IT shops do a good job designing alerts, but then just hope that someone will do something about them when they appear. There’s a simple way to solve this. Rule #3 states: “Alerts must only go to people who are responsible to act on it.” This is almost a part of the actionable nature of good alarms. If you are going to define what action you should take if you get a certain alert–say, an application process being down–you should also specify who is supposed to take the action.

I’ll never forget the first time that I implemented Systems Monitoring in 1997/1998. It was an instance of hardware monitoring for our servers. In this case, it was one of the very first versions of Compaq Insight Manager (CIM.) After installing CIM, it immediately set off an alarm that a server drive was in a degraded state. This meant that the drive had not yet failed, but it would shortly and should be replaced immediately. Of course, the implementation team was excited because the tool was working out of the box, and they could prove the value of the tool right away. They flagged an admin, and showed them the alert.

The subsequent conversation went something like this:

Team: “Hey, you have a degraded hard drive!”

Admin: “Wow. You’re right.”

Team: “Um. Aren’t you going to do something about it?”

Admin: “No. That’s not my job.”

Team: “Oookay.”

They looked for other admins, and had no luck finding anyone to take action. Naturally the drive failed the next day. The follow-up conversation went something like this:

Admin: “Hey, the drive failed! The system is supposed to tell us.”

Team: “Um, it did. We told you. You didn’t do anything about it.”

Admin: “This Systems Monitoring stuff isn’t working.”

This kind of attitude is why I like to say that the technology is no more than 40% of the Systems Monitoring solution.  The technology could work fine, but if you don’t act on it properly, the technology can’t help you.

Here’s what I’d suggest to successfully implement rule #3:

  • When you determine the action for each alert (rule #2)  you should also determine who is to take that action, and in what time frame. Include off-hours coverage if you are a 24/7 shop.
  • Even if there are groups of alerts on the same server, determine who the appropriate recipient is for each one, and only send it to those recipients. We have many servers that have alerts that go to DBA’s, and other alerts that go to the web team, for example. They don’t get each other’s alerts because they aren’t responsible for them.
  • Unless you have an operations team, never depend on your mobile admins to “watch the console” for problems. Instead, page them or send them email. If you do a good job with the three rules, every alert will be an actionable emergency that they will be responsible for, and they will never get any alerts that are just noise.
  • When you must page a group of people for a problem rather than just individuals, make sure that there is always one person on call that is responsible for taking action for the alerts that go to the team. Otherwise, everyone assumes that someone else will handle the problems. We always have a primary, and a backup.

You need to be ruthless about enforcing this rule as much as the first two. If anyone receives alerts that are just informational for them, they will delay looking at their alerts because they might not be directly responsible. You are depending on them to do filtering. Again, think of the fire department. Do they get “informational” fire alarms for fires that are hundreds of miles away? Only if they’re really serious, and in those cases, they are contacted directly because then, they have an actionable emergency that they are responsible for. You should set up your own system the same way.

The next major topic is designing custom monitoring. In particular, I’m going to cover a technique called Critical Path Monitoring that will let you monitor any application. This will be a series of articles, because, as you can imagine, this topic isn’t simple. But it has worked for years in our departments, helping us make thousands of alerts, and monitor over a hundred custom applications.

The Best of Breed Debate

Posted on Monday 20 August 2007

If you’ve gotten through determining what you’d like to cover at a general level on all of your systems, it’s time to pick out the tools that you’re going to use if you haven’t done so already. The techniques that I’ll be covering at EffectiveMonitoring.com will work no matter what vendor you use, but the choice of tool will possibly make your job easier.

Most of the documentation on how to monitor is vendor specific, and written in such a way that the only solution to the “problems” that they bring up are their own products. There’s rarely a good debate about what tools should be like in general, and definitely not about making the right mix of tools that will get your job done. I don’t know about you, but I personally find that a lot of the articles written about systems monitoring read more like press releases from vendors rather than good discussions about comparisons between tools.

The most heated debate that has been argued literally throughout the entire decade that I’ve been involved in systems monitoring has to do with getting “Best of Breed” tools versus “Jack of All Trade” tools that try to manage the entire infrastructure.

The Best of Breed camp says that it’s necessary to drill deeply down into each application in order to do a good job monitoring it. This sometimes leads to tools that work for just a few platforms, which may cause you to have to purchase and maintain many solutions to cover your entire enterprise. It also can put a burden on your operations team (or whoever watches the consoles), because they may have to contend with multiple places to get alerts.

The Jack of All Trade camp says that you must have tools that span everything in your enterprise. The simplified version of their argument is that you should have alerts that correlate across your all platforms. Unix, Windows, Network, SQL, Linux, applications, and everything else should have alerts. They say that alerts should go to just one place so that the tools can do correlation between alerts. Root cause analysis is easier at this point, because problems are all on one console. Also, it’s simpler operationally because of a single console. Unfortunately, these solutions tend to do a few things fairly well, and then provide mediocre coverage for the rest. Two sayings come to mind for these tools: “Jack of all trades, master of none.” And “A mile wide and an inch deep.”

And because this is IT, there’s a third camp that reared up. Some believe in consolidation tools that will roll up alerts from any solution into their console. Once these alerts are in a single tool, it can perform correlation or other analysis.

Now, because correlation is a crux issue for this debate, I need to cover it briefly now. Correlation is the concept that you should filter the “symptom” alerts from the “cause” alerts. For example, if your entire database server is down, then the fact that your application server is writing a logfile that it can’t contact the database is a symptom, not a cause. There’s only one alert that matters here, and your good correlation tools will filter for this. But correlation as it relates to this debate has a very strong tendency to favor Best of Breed. The reason is simple: correlation assumes that you’ve done a complete job of putting alerts on all of your critical points of failure first. Otherwise, you have nothing to correlate. The Jack of All Trades tools can miss “deep” alerts. My other observation about correlation is that it rarely works in practice. There is too much manual work involved, and these tools can generate too many false alerts due to incorrect correlations. I’ll cover correlation in detail in a future article, because it’s quite a large topic.

After a decade of using various tools, I would suggest that IT shops use tools that are best of breed within the systems monitoring space that can cover as much of your environment as possible, and then use specific tools to solve the rest of the issues. I haven’t seen any tools that do a good job of mixing network alerting (routers, switches, and cable plant monitoring) with the systems alerting, especially if your company is large enough to have a networking team. Their needs tend to be so different they need their own console and control over their own tools. And, besides, they tend to ignore systems monitoring alerts. That’s only fair, because systems monitoring folks often have to ignore network alarms because sometimes traffic can be routed through other infrastructure, and the alerts aren’t meaningful.

I do believe that having fewer consoles is a goal that you should always strive towards, and this is why your application alerting and your operating system alerting needs to go in the same tool as much as possible. There’s a simple rule of thumb for this decision if you need to evaluate possible solutions: You must be able to do deep monitoring on each aspect of your system. Leave none out. Your set of monitoring solutions must cover database servers, your web servers, your custom applications, and all other critical aspects of your operating systems. If you can’t find an overall solution that covers all of these, you need to bring in a best-of-breed solution that will be able to handle the alerts on all of the parts that you haven’t covered yet.

I prefer category solutions in the systems space that allow me to write custom scripts. I often find that I want to alert on areas that the out-of-box monitoring doesn’t cover, and I need the freedom to add in a new alert type. But I want to emphasize again that the upcoming techniques and articles are vendor-neutral, and that whatever you choose, you will be able to use these solutions. As long as you make sure that your set of solutions cover all of the areas that we talked about in the Defining “Down” article, you will find the next articles usable almost immediately.

The next article in this series covers a surefire way to make certain that every single one of your alerts is meaningful.

Defining “Down”

Posted on Tuesday 14 August 2007

One of the points that’s often missed when it comes to monitoring is what we mean by a system, server, or application to be “down.” Indeed, it’s not as clear a concept as it may seem. This is where we need to start if we’re going to design a comprehensive set of alerts for a system.

What we do know for sure is if your network, server hardware, operating system, or application components are having trouble, your users will call and simply say: “The system is down.” The goal of monitoring is to know about these faults before they happen, or at worst when they happen. Note that even if you find out at the same time as your users, you will be cutting out all of the sleuthing that you’d be doing if you just got that user call. Not only that, you’re saving the time in between when they first try to contact you, and when the actual message arrives. If you have a large organization, sometimes tickets can spend hours in various help desk queues.

I’ve seen many shops complain about trouble with their monitoring, but fail to take the simple first step of identifying the actual components of their systems that can cause a failure. This process is part of a methodology I call identifying the critical path, which I am going to cover in a future articles in detail. For now, I just want to focus on the overall areas of failure for systems in general, and talk about different monitoring solutions for each. You must have monitoring that covers all of these in your set of monitoring tools.

Network: Network monitoring is entirely different than systems monitoring. Your best ones will check all of the network links, as well as the status of your network hardware. These articles won’t cover network monitoring in depth, so it will assume that you have this covered. The good news about monitoring networks is that it has a more regular set of faults that can occur, and so monitors can be implemented without as much alerting design required. This is often true no matter what kind of hardware or cable plants that you have in your data center.

Hardware: Whether your systems have a bad hard drive, correctable memory errors, a failed motherboard, or a dead fan, your system can crash as a result. You need to know the status of your hardware at all times. Fortunately, most of the hardware monitoring systems are predictive. That is, they will often tell you what needs to be replaced before they completely fail. Hardware monitoring is also out of scope for this series of articles, although the alarms for these faults can certainly be sent to the systems monitoring console. The best hardware monitoring solutions usually come directly from your hardware vendors because they have the tightest integration with the actual hardware.

Operating System: If you run out of disk space, memory, or have any other operating system failures, your users are still going to call you and say that your system has failed. This alerting should be part of the same software as your application monitoring solution. This is a complex topic that will be handled in a series of future articles, but has the advantage of using the same set of monitors no matter what applications are running on the system.

Application Monitoring: Applications in this case have a very broad definition. It includes infrastructure software such as databases and web servers, but also includes application processes, services, or daemons. This monitoring for applications are irregular and difficult, because there is no catch-all monitoring that works for every application. In fact, most applications are unmanaged because administrators just don’t have the method of understanding their applications from a monitoring perspective, rather than just an administrative perspective. We’ll be spending considerable time on this monitoring because although many vendors claim that they can do this automatically, it requires administrators to design these alerts. I will cover how to do this in detail, and provide a simple-to-follow guide on how to break apart an application into the alerts that matter.

Now that we’ve covered the overall areas that can cause a system to be down, the next article will talk about the longest-standing argument about the tools that can catch these failures: The Best of Breed debate.

Effective Monitoring designed by SEO-Themes and powered by Wordpress