Rule #3: Only Alert The Ones Responsible

Wednesday 7 November 2007

Can you imagine a fire alarm going off in your building for a fire in a building five hundred miles away?

No? Why?

I have a guess: because it’s not relevant to you. But based on the three rules here, it meets #1: It’s certainly an emergency. It also meets #2: It’s actionable. But of course, that alert doesn’t have anything to do with you, so for you, that fire alarm is just noise.

Now, let’s stretch everyone’s imagination and say that an IT department is not much different than a fire department. Would you want your fire department to get all fire alarms for buildings in a 500 mile radius? Of course not. So why would you treat your IT department the same way? Often, shops do so, by sending every event to every member of an administration team.

What I don’t understand is why some IT shops do a good job designing alerts, but then just hope that someone will do something about them when they appear. There’s a simple way to solve this. Rule #3 states: “Alerts must only go to people who are responsible to act on it.” This is almost a part of the actionable nature of good alarms. If you are going to define what action you should take if you get a certain alert–say, an application process being down–you should also specify who is supposed to take the action.

I’ll never forget the first time that I implemented Systems Monitoring in 1997/1998. It was an instance of hardware monitoring for our servers. In this case, it was one of the very first versions of Compaq Insight Manager (CIM.) After installing CIM, it immediately set off an alarm that a server drive was in a degraded state. This meant that the drive had not yet failed, but it would shortly and should be replaced immediately. Of course, the implementation team was excited because the tool was working out of the box, and they could prove the value of the tool right away. They flagged an admin, and showed them the alert.

The subsequent conversation went something like this:

Team: “Hey, you have a degraded hard drive!”

Admin: “Wow. You’re right.”

Team: “Um. Aren’t you going to do something about it?”

Admin: “No. That’s not my job.”

Team: “Oookay.”

They looked for other admins, and had no luck finding anyone to take action. Naturally the drive failed the next day. The follow-up conversation went something like this:

Admin: “Hey, the drive failed! The system is supposed to tell us.”

Team: “Um, it did. We told you. You didn’t do anything about it.”

Admin: “This Systems Monitoring stuff isn’t working.”

This kind of attitude is why I like to say that the technology is no more than 40% of the Systems Monitoring solution.  The technology could work fine, but if you don’t act on it properly, the technology can’t help you.

Here’s what I’d suggest to successfully implement rule #3:

  • When you determine the action for each alert (rule #2)  you should also determine who is to take that action, and in what time frame. Include off-hours coverage if you are a 24/7 shop.
  • Even if there are groups of alerts on the same server, determine who the appropriate recipient is for each one, and only send it to those recipients. We have many servers that have alerts that go to DBA’s, and other alerts that go to the web team, for example. They don’t get each other’s alerts because they aren’t responsible for them.
  • Unless you have an operations team, never depend on your mobile admins to “watch the console” for problems. Instead, page them or send them email. If you do a good job with the three rules, every alert will be an actionable emergency that they will be responsible for, and they will never get any alerts that are just noise.
  • When you must page a group of people for a problem rather than just individuals, make sure that there is always one person on call that is responsible for taking action for the alerts that go to the team. Otherwise, everyone assumes that someone else will handle the problems. We always have a primary, and a backup.

You need to be ruthless about enforcing this rule as much as the first two. If anyone receives alerts that are just informational for them, they will delay looking at their alerts because they might not be directly responsible. You are depending on them to do filtering. Again, think of the fire department. Do they get “informational” fire alarms for fires that are hundreds of miles away? Only if they’re really serious, and in those cases, they are contacted directly because then, they have an actionable emergency that they are responsible for. You should set up your own system the same way.

The next major topic is designing custom monitoring. In particular, I’m going to cover a technique called Critical Path Monitoring that will let you monitor any application. This will be a series of articles, because, as you can imagine, this topic isn’t simple. But it has worked for years in our departments, helping us make thousands of alerts, and monitor over a hundred custom applications.

Posted by randall / Filed under:Articles and Techniques

NetConnect Lessons: No More CYA Alerts

Monday 5 November 2007

It’s taken me some time to integrate what I heard at NetConnect this year. I don’t mean that I learned a ton of new things. It’s that a lot of environments are in more trouble than I thought. One of the things that it’s caused me to do is change the third rule. I’ve been stunned at both the large number of deployments of tools out there in the world, as well as the number of them that just aren’t used.

I never thought that another name for “monitoring” would be CYA. That is to say: “Cover Your Donkey” but, of course, using one of the synonyms for donkey.

It seems that many environments would rather have alerts that no one uses rather than have their system miss something, and get blamed for a production issue. This leads to a proliferation of alerts that, very quickly, no one watches because they break the rules that I’ve been talking about here.

One of the rules that I felt was just a component of the first one needs to be a separate rule. So I’m replacing rule #3. Instead, that’s going to be one of the rules of creating custom monitoring, which I’m going to explain soon when I talk about critical path monitoring. Instead, I’m going to talk about rule #3 in the next article. And this article, I want to talk about the CYA problem.

But, just to get it out of the way, here are the new three rules:

  1. Every alert must mean “Run to the console now!”
  2. All alerts must be actionable.
  3. Alerts must only go to people who are responsible to act on it.

Like I said, #3 is next article. I just want to expand on the point that I mentioned in #1. For now, let’s talk about the CYA issue.

If you let administrators completely depend only on your alerts from your systems tools, you will always be surprised by some events that happen on the servers. No matter how good of a job you do developing your alerts, there are always oddball issues that can come up that you will miss. I inform every administrator of this issue, and tell them that they should still check their servers on an occasional basis, just like they should be doing before putting monitoring in place.

Because an alert is proof of a problem, though, some management and administrative groups have used it to assign blame rather than to fix things. I ran into more than one group at NetConnect that get told that they may not turn off any alerts because they might “miss something” even if they are deluged with many false alerts. But unfortunately, because many of these shops are not being strict about reducing the alerts down to the ones that follow the rules, a lot of alerts are meaningless and they ignore all of them, good and bad. This makes adding new alerts when they are necessary next to useless, because even the current set of alerts are already ignored.

It does take time and effort to reduce the number of alerts down to the ones that matter, but if you want to get any value out of Systems Monitoring tools at all, it’s a necessity.

In fact, the conversation that is inspiring this article is with an administrator of one company that was monitoring 2400 servers in their environment. I asked them how many alerts that they get a day. He told me “thousands.” I asked what they did with them. He answered: “Nothing.” And so I asked why, and he said: “CYA.”

Now, why is this so easy to believe?

Posted by randall / Filed under:Uncategorized

Welcome NetConnect Folks!

Friday 19 October 2007

I’d like to welcome the NetConnect Members to the EffectiveMonitoring blog. The presentation Find it Fast: Using AppManager Stats to Identify Problems Quickly is now available on the Presentations page.

As mentioned during the presentation, this is an entire blog dedicated to monitoring techniques. These are developed regardless of the tool that you use, the underlying operating systems that you have, and the organization that you have in your environment. Although this latest presentation is of course about AppManager and Windows operating systems because of the conference. The general topic applies, and I’ll be covering those in more detail in later entries.

If you want to contact me for any reason, please just use the contact page. I’m happy to talk to anyone about the presentation, or your monitoring issues. I’ve helped people solve quite a few issues.

For now, you’re catching the blog in the middle of a discussion about how to create and apply alerts based on my 10 years of experience with monitoring tools.

As for a recap of the conference, it was very much worth my time. I had a great time and really had a lot of time to talk to a lot of other monitoring folks and also to people from NetIQ. I spent a lot of time trying to get certain enhancement requests in front of PMs and support. Being able to explain these things in person helped a lot, as many of them are seemingly complex issues that can be more simply conveyed in a 5 minute conversation than an email that seems like a book.

For example, there’s an issue with AppManager regarding handling events across reboots and restarts of the agent. A feature called event collapsing will make sure that only one event is triggered upon an error, and, depending on how it’s configured, a continuing failure will get just one event. These events can also trigger actions. For example, a page or email. So, let’s take a situation where it’s monitoring a website, and it’s set to event to the console, and send an email via SMTP. If a website is down, you’ll get one event, and one email, no matter how long it’s down. You’ll only get another email and event if the website comes up and then goes down again. A very useful feature, for certain.

But, if you reboot your monitoring agent server (the one that is monitoring the website, NOT the monitoring infrastructure backend, which is not tied to these events) you’ll get another event and email when it comes back up. You’ll also get one if you restart the monitoring agent, or if you change any of the monitoring properties. This leads to another alert to the people who will then think that the website has come back up and is back down again. This event persistence should save itself across reboots, restarts, or changes to the policy to avoid this problem.

I know that your eyes are probably glazing over reading this explanation, and it still may not make sense, especially if you’ve never used the product. But it’s a problem that has a real effect on large shops such as ours. I’m glad that I was able to bring this and other issues to their attention, because I hope that it can make a feature list. As you can imagine, I’m a very detailed person when it comes to these features, because these seemingly small feature issues can cause major problems in environments as large as the ones that I face.

I will, perhaps, put up a list of my wish-list for AppManager on the chicagoiq.net website (which I run.) If you’re an AppManager user and want to join together with my organization in case you have the same pain that we do feel free to join in there.

In the next post, which I might be able to write as I continue to wait for my plane back, will talk about some of the future topics that I’d like to cover based on talking to other companies. There’s a lot of problems that we’ve gotten past in our organization, and I think that there’s a lot that I can share that will save you quite a lot of time in your monitoring.

If my plane is delayed even more, who knows, I’ll possibly write even more articles while they’re still fresh. Meanwhile, I’m heading to my gate now.

Posted by randall / Filed under:News

Rule #2: Actionable Alerts

Thursday 11 October 2007

Tied very closely to the rule that all alerts must mean “run to the console now” is the concept that all alerts must be actionable. Informational alerts become noise, even if it’s an event that’s interesting. Actionable alerts all have an action associated with them that can fix the system, or get your users working again.

When I first implemented monitoring of my systems in the 90’s, I set an alarm to go off if CPU utilization was more than 90% for more than a half-hour straight. The alert worked perfectly. It indeed went off when the CPU was heavily used. We ran to the console, and looked at the server stats. The server had been running at high utilization for a long time. In fact, it had been high almost all day. But, what were we supposed to do about this issue? The alert flashed on throughout the day, and all we could do is stare at it. The system was just busy, there was nothing wrong.

Certainly, high utilization can affect the users–except of course when it doesn’t. If you’re running a report at 3AM that needs to be done at 8AM, the fact that it takes until 4:30 AM isn’t a problem, even if utilization has maxed out. Generally, I recommend using a user experience monitor for alerts such as these. For example, for web-based apps, You can use a URL monitor to check response times, and set a time limit. Even then, you should make sure that that response time threshold you set makes sense, and it should probably happen more than just once. Then, if you’ve done a good job collecting the data you need to analyze for problems such as these, you can look at your stats to find the cause. (And, of course, talking about statistics is a topic for another day. In fact, many other days. It’s a big topic, and of equal importance to alerting. It’s also the presentation that I’ll be giving in Orlando at NetConnect next week.)

The good point about creating actionable alerts is that they force you to come up with a list of the critical errors that can occur on your applications, and then determine what needs to be done to fix them. This links directly to another key part of monitoring related to procedures for dealing with these alerts. ITIL gives a good terminology and framework for dealing with these items, but even ITIL won’t help you if you don’t have a set of procedures that directly relate to your organization, operations, applications, users, and the systems themselves.

Here’s the short version about procedures which is another topic of future articles at effectivemonitoring.com: for each alert that you plan on handling, can you imagine the physical actions that you will do in order to deal with the issue? If not, you don’t have an actionable alert. If you can imagine what you’d do when you got that alarm, write it down! You’ve got an excellent start on a procedure.

It should be clear at this point that rule #2: “All alerts must be actionable” is another powerful filter for your alerting, and should further reduce the number of monitors that you implement. The next rule is the only one of the three that increases them.

Posted by randall / Filed under:Techniques

Rule #1: Run To The Console Now!

Thursday 11 October 2007

There’s a simple rule of thumb that I’ve used in order to evaluate any new alerts that I want to put into my environment. Each one must be a situation which means “run to the console now” to fix the problem. Unfortunately, many shops design monitoring as if it absolves them from having to check their applications manually. It does not. The alerts should only tell you about events that require your immediate attention, everything else should not alert, or send it to a queue that you check occasionally.

This standard is simple to remember, and is easy to apply. But it’s not a rule that can be only applied occasionally. It will force you to be ruthless about eliminating all alerts that fall short.

The reason for this is simple. Have you ever been in a building that kept giving you false fire alarms? Did you even get out of your chair by the 5th time it happened? Alerts work the same way. If every time the fire alarm went off there was really a fire, you’d pay attention to all of the alerts. If it was spotty, you’d be inclined to ignore all of them, including the real ones.

There’s another aspect to this issue that’s not necessarily obvious before you set out on designing your alerts. If you got a fire alarm for another building in another city, and you weren’t responsible for doing anything about it, how useful is the alert? Every single alert must go only to the people who are responsible for fixing the problem. This is another rule that gets broken repeatedly by many groups. They send alerts to everyone on a team, when only a few people are responsible for the systems. This is the very steep slippery slope to alerts that get ignored by everyone. There’s something about our psychology that even if you get a few unnecessary ones, you desensitize quickly.

Finally, you should always start with less alarms, and then add them later. If you were in that building with false fire alarms for a month, and then they fixed the problem, you’d still be suspicious the next time you heard an alarm. Most people blame the tool, but, in fact, it might have been misconfigured, or configured to alert too often.

Fortunately, there’s a fairly straightforward method of determining just the key alerts for an application, no matter what it is. It’s called critical path analysis, and I’ll be going over this once I get done explaining the three rules of good alerting.

Posted by randall / Filed under:Techniques

The Three Rules of Meaningful Alerts

Monday 8 October 2007

There’s just one simple reason why most monitoring implementations fail: they send out too many alerts.

Most implementers start out from the premise that the biggest problem to avoid is to miss a critical problem. It’s not. Too many alerts are a bigger problem, by far. Once you have too many, your administrators will start ignoring all of the alerts.

When you design monitoring, follow these three rules:

  1. Every alert must mean “Run to the console now!”
  2. All alerts must be actionable.
  3. Alerts must cover every part of an application.

In order to follow all of these rules, most shops have to turn off a great number of monitors, as most of them fall short. We’re going to talk about each of these points in detail in coming articles, because they each require some explanation. Stay tuned for more on each of these.

Posted by randall / Filed under:Techniques

The Break

Monday 8 October 2007

I apologize for the recent break in posts. I was just finishing writing a book that is coming out from St. Martin’s press next year. I felt guilty if I did any writing that wasn’t dedicated to finishing by my deadline. The manuscript is with the editor now, so I’ll be writing articles for this site, possibly multiple articles a day, to catch up.

In case you’re curious, the book is about how to run your own independent band. There’s a surprising amount of technology involved with music now. Considering that I’m a musician as an avocation (in my spare time), my skills as a technologist and as a musician on the side came in handy.

Also, I’ve never believed that you can have just one passion in life. Mine are information technology, music, and writing.

The next book I do will be on the topics in this blog, which I will share here in article-sized pieces.

Posted by randall / Filed under:News

Welcome ITSMF

Friday 24 August 2007

I’d like to welcome the ITSMF folks to EffectiveMonitoring.com!

I’ve had requests for the slides from the presentation that I gave on Thursday. I’m going to post it, and all others that I do, on the presentations page on this website.

Posted by randall / Filed under:News

The Best of Breed Debate

Monday 20 August 2007

If you’ve gotten through determining what you’d like to cover at a general level on all of your systems, it’s time to pick out the tools that you’re going to use if you haven’t done so already. The techniques that I’ll be covering at EffectiveMonitoring.com will work no matter what vendor you use, but the choice of tool will possibly make your job easier.

Most of the documentation on how to monitor is vendor specific, and written in such a way that the only solution to the “problems” that they bring up are their own products. There’s rarely a good debate about what tools should be like in general, and definitely not about making the right mix of tools that will get your job done. I don’t know about you, but I personally find that a lot of the articles written about systems monitoring read more like press releases from vendors rather than good discussions about comparisons between tools.

The most heated debate that has been argued literally throughout the entire decade that I’ve been involved in systems monitoring has to do with getting “Best of Breed” tools versus “Jack of All Trade” tools that try to manage the entire infrastructure.

The Best of Breed camp says that it’s necessary to drill deeply down into each application in order to do a good job monitoring it. This sometimes leads to tools that work for just a few platforms, which may cause you to have to purchase and maintain many solutions to cover your entire enterprise. It also can put a burden on your operations team (or whoever watches the consoles), because they may have to contend with multiple places to get alerts.

The Jack of All Trade camp says that you must have tools that span everything in your enterprise. The simplified version of their argument is that you should have alerts that correlate across your all platforms. Unix, Windows, Network, SQL, Linux, applications, and everything else should have alerts. They say that alerts should go to just one place so that the tools can do correlation between alerts. Root cause analysis is easier at this point, because problems are all on one console. Also, it’s simpler operationally because of a single console. Unfortunately, these solutions tend to do a few things fairly well, and then provide mediocre coverage for the rest. Two sayings come to mind for these tools: “Jack of all trades, master of none.” And “A mile wide and an inch deep.”

And because this is IT, there’s a third camp that reared up. Some believe in consolidation tools that will roll up alerts from any solution into their console. Once these alerts are in a single tool, it can perform correlation or other analysis.

Now, because correlation is a crux issue for this debate, I need to cover it briefly now. Correlation is the concept that you should filter the “symptom” alerts from the “cause” alerts. For example, if your entire database server is down, then the fact that your application server is writing a logfile that it can’t contact the database is a symptom, not a cause. There’s only one alert that matters here, and your good correlation tools will filter for this. But correlation as it relates to this debate has a very strong tendency to favor Best of Breed. The reason is simple: correlation assumes that you’ve done a complete job of putting alerts on all of your critical points of failure first. Otherwise, you have nothing to correlate. The Jack of All Trades tools can miss “deep” alerts. My other observation about correlation is that it rarely works in practice. There is too much manual work involved, and these tools can generate too many false alerts due to incorrect correlations. I’ll cover correlation in detail in a future article, because it’s quite a large topic.

After a decade of using various tools, I would suggest that IT shops use tools that are best of breed within the systems monitoring space that can cover as much of your environment as possible, and then use specific tools to solve the rest of the issues. I haven’t seen any tools that do a good job of mixing network alerting (routers, switches, and cable plant monitoring) with the systems alerting, especially if your company is large enough to have a networking team. Their needs tend to be so different they need their own console and control over their own tools. And, besides, they tend to ignore systems monitoring alerts. That’s only fair, because systems monitoring folks often have to ignore network alarms because sometimes traffic can be routed through other infrastructure, and the alerts aren’t meaningful.

I do believe that having fewer consoles is a goal that you should always strive towards, and this is why your application alerting and your operating system alerting needs to go in the same tool as much as possible. There’s a simple rule of thumb for this decision if you need to evaluate possible solutions: You must be able to do deep monitoring on each aspect of your system. Leave none out. Your set of monitoring solutions must cover database servers, your web servers, your custom applications, and all other critical aspects of your operating systems. If you can’t find an overall solution that covers all of these, you need to bring in a best-of-breed solution that will be able to handle the alerts on all of the parts that you haven’t covered yet.

I prefer category solutions in the systems space that allow me to write custom scripts. I often find that I want to alert on areas that the out-of-box monitoring doesn’t cover, and I need the freedom to add in a new alert type. But I want to emphasize again that the upcoming techniques and articles are vendor-neutral, and that whatever you choose, you will be able to use these solutions. As long as you make sure that your set of solutions cover all of the areas that we talked about in the Defining “Down” article, you will find the next articles usable almost immediately.

The next article in this series covers a surefire way to make certain that every single one of your alerts is meaningful.

Posted by randall / Filed under:Articles and Editorials

Defining “Down”

Tuesday 14 August 2007

One of the points that’s often missed when it comes to monitoring is what we mean by a system, server, or application to be “down.” Indeed, it’s not as clear a concept as it may seem. This is where we need to start if we’re going to design a comprehensive set of alerts for a system.

What we do know for sure is if your network, server hardware, operating system, or application components are having trouble, your users will call and simply say: “The system is down.” The goal of monitoring is to know about these faults before they happen, or at worst when they happen. Note that even if you find out at the same time as your users, you will be cutting out all of the sleuthing that you’d be doing if you just got that user call. Not only that, you’re saving the time in between when they first try to contact you, and when the actual message arrives. If you have a large organization, sometimes tickets can spend hours in various help desk queues.

I’ve seen many shops complain about trouble with their monitoring, but fail to take the simple first step of identifying the actual components of their systems that can cause a failure. This process is part of a methodology I call identifying the critical path, which I am going to cover in a future articles in detail. For now, I just want to focus on the overall areas of failure for systems in general, and talk about different monitoring solutions for each. You must have monitoring that covers all of these in your set of monitoring tools.

Network: Network monitoring is entirely different than systems monitoring. Your best ones will check all of the network links, as well as the status of your network hardware. These articles won’t cover network monitoring in depth, so it will assume that you have this covered. The good news about monitoring networks is that it has a more regular set of faults that can occur, and so monitors can be implemented without as much alerting design required. This is often true no matter what kind of hardware or cable plants that you have in your data center.

Hardware: Whether your systems have a bad hard drive, correctable memory errors, a failed motherboard, or a dead fan, your system can crash as a result. You need to know the status of your hardware at all times. Fortunately, most of the hardware monitoring systems are predictive. That is, they will often tell you what needs to be replaced before they completely fail. Hardware monitoring is also out of scope for this series of articles, although the alarms for these faults can certainly be sent to the systems monitoring console. The best hardware monitoring solutions usually come directly from your hardware vendors because they have the tightest integration with the actual hardware.

Operating System: If you run out of disk space, memory, or have any other operating system failures, your users are still going to call you and say that your system has failed. This alerting should be part of the same software as your application monitoring solution. This is a complex topic that will be handled in a series of future articles, but has the advantage of using the same set of monitors no matter what applications are running on the system.

Application Monitoring: Applications in this case have a very broad definition. It includes infrastructure software such as databases and web servers, but also includes application processes, services, or daemons. This monitoring for applications are irregular and difficult, because there is no catch-all monitoring that works for every application. In fact, most applications are unmanaged because administrators just don’t have the method of understanding their applications from a monitoring perspective, rather than just an administrative perspective. We’ll be spending considerable time on this monitoring because although many vendors claim that they can do this automatically, it requires administrators to design these alerts. I will cover how to do this in detail, and provide a simple-to-follow guide on how to break apart an application into the alerts that matter.

Now that we’ve covered the overall areas that can cause a system to be down, the next article will talk about the longest-standing argument about the tools that can catch these failures: The Best of Breed debate.

Posted by randall / Filed under:Articles
Effective Monitoring designed by SEO-Themes and powered by Wordpress