archive 2007 October

Welcome NetConnect Folks!

Posted on Friday 19 October 2007

I’d like to welcome the NetConnect Members to the EffectiveMonitoring blog. The presentation Find it Fast: Using AppManager Stats to Identify Problems Quickly is now available on the Presentations page.

As mentioned during the presentation, this is an entire blog dedicated to monitoring techniques. These are developed regardless of the tool that you use, the underlying operating systems that you have, and the organization that you have in your environment. Although this latest presentation is of course about AppManager and Windows operating systems because of the conference. The general topic applies, and I’ll be covering those in more detail in later entries.

If you want to contact me for any reason, please just use the contact page. I’m happy to talk to anyone about the presentation, or your monitoring issues. I’ve helped people solve quite a few issues.

For now, you’re catching the blog in the middle of a discussion about how to create and apply alerts based on my 10 years of experience with monitoring tools.

As for a recap of the conference, it was very much worth my time. I had a great time and really had a lot of time to talk to a lot of other monitoring folks and also to people from NetIQ. I spent a lot of time trying to get certain enhancement requests in front of PMs and support. Being able to explain these things in person helped a lot, as many of them are seemingly complex issues that can be more simply conveyed in a 5 minute conversation than an email that seems like a book.

For example, there’s an issue with AppManager regarding handling events across reboots and restarts of the agent. A feature called event collapsing will make sure that only one event is triggered upon an error, and, depending on how it’s configured, a continuing failure will get just one event. These events can also trigger actions. For example, a page or email. So, let’s take a situation where it’s monitoring a website, and it’s set to event to the console, and send an email via SMTP. If a website is down, you’ll get one event, and one email, no matter how long it’s down. You’ll only get another email and event if the website comes up and then goes down again. A very useful feature, for certain.

But, if you reboot your monitoring agent server (the one that is monitoring the website, NOT the monitoring infrastructure backend, which is not tied to these events) you’ll get another event and email when it comes back up. You’ll also get one if you restart the monitoring agent, or if you change any of the monitoring properties. This leads to another alert to the people who will then think that the website has come back up and is back down again. This event persistence should save itself across reboots, restarts, or changes to the policy to avoid this problem.

I know that your eyes are probably glazing over reading this explanation, and it still may not make sense, especially if you’ve never used the product. But it’s a problem that has a real effect on large shops such as ours. I’m glad that I was able to bring this and other issues to their attention, because I hope that it can make a feature list. As you can imagine, I’m a very detailed person when it comes to these features, because these seemingly small feature issues can cause major problems in environments as large as the ones that I face.

I will, perhaps, put up a list of my wish-list for AppManager on the chicagoiq.net website (which I run.) If you’re an AppManager user and want to join together with my organization in case you have the same pain that we do feel free to join in there.

In the next post, which I might be able to write as I continue to wait for my plane back, will talk about some of the future topics that I’d like to cover based on talking to other companies. There’s a lot of problems that we’ve gotten past in our organization, and I think that there’s a lot that I can share that will save you quite a lot of time in your monitoring.

If my plane is delayed even more, who knows, I’ll possibly write even more articles while they’re still fresh. Meanwhile, I’m heading to my gate now.




Rule #2: Actionable Alerts

Posted on Thursday 11 October 2007

Tied very closely to the rule that all alerts must mean “run to the console now” is the concept that all alerts must be actionable. Informational alerts become noise, even if it’s an event that’s interesting. Actionable alerts all have an action associated with them that can fix the system, or get your users working again.

When I first implemented monitoring of my systems in the 90’s, I set an alarm to go off if CPU utilization was more than 90% for more than a half-hour straight. The alert worked perfectly. It indeed went off when the CPU was heavily used. We ran to the console, and looked at the server stats. The server had been running at high utilization for a long time. In fact, it had been high almost all day. But, what were we supposed to do about this issue? The alert flashed on throughout the day, and all we could do is stare at it. The system was just busy, there was nothing wrong.

Certainly, high utilization can affect the users–except of course when it doesn’t. If you’re running a report at 3AM that needs to be done at 8AM, the fact that it takes until 4:30 AM isn’t a problem, even if utilization has maxed out. Generally, I recommend using a user experience monitor for alerts such as these. For example, for web-based apps, You can use a URL monitor to check response times, and set a time limit. Even then, you should make sure that that response time threshold you set makes sense, and it should probably happen more than just once. Then, if you’ve done a good job collecting the data you need to analyze for problems such as these, you can look at your stats to find the cause. (And, of course, talking about statistics is a topic for another day. In fact, many other days. It’s a big topic, and of equal importance to alerting. It’s also the presentation that I’ll be giving in Orlando at NetConnect next week.)

The good point about creating actionable alerts is that they force you to come up with a list of the critical errors that can occur on your applications, and then determine what needs to be done to fix them. This links directly to another key part of monitoring related to procedures for dealing with these alerts. ITIL gives a good terminology and framework for dealing with these items, but even ITIL won’t help you if you don’t have a set of procedures that directly relate to your organization, operations, applications, users, and the systems themselves.

Here’s the short version about procedures which is another topic of future articles at effectivemonitoring.com: for each alert that you plan on handling, can you imagine the physical actions that you will do in order to deal with the issue? If not, you don’t have an actionable alert. If you can imagine what you’d do when you got that alarm, write it down! You’ve got an excellent start on a procedure.

It should be clear at this point that rule #2: “All alerts must be actionable” is another powerful filter for your alerting, and should further reduce the number of monitors that you implement. The next rule is the only one of the three that increases them.




Rule #1: Run To The Console Now!

Posted on Thursday 11 October 2007

There’s a simple rule of thumb that I’ve used in order to evaluate any new alerts that I want to put into my environment. Each one must be a situation which means “run to the console now” to fix the problem. Unfortunately, many shops design monitoring as if it absolves them from having to check their applications manually. It does not. The alerts should only tell you about events that require your immediate attention, everything else should not alert, or send it to a queue that you check occasionally.

This standard is simple to remember, and is easy to apply. But it’s not a rule that can be only applied occasionally. It will force you to be ruthless about eliminating all alerts that fall short.

The reason for this is simple. Have you ever been in a building that kept giving you false fire alarms? Did you even get out of your chair by the 5th time it happened? Alerts work the same way. If every time the fire alarm went off there was really a fire, you’d pay attention to all of the alerts. If it was spotty, you’d be inclined to ignore all of them, including the real ones.

There’s another aspect to this issue that’s not necessarily obvious before you set out on designing your alerts. If you got a fire alarm for another building in another city, and you weren’t responsible for doing anything about it, how useful is the alert? Every single alert must go only to the people who are responsible for fixing the problem. This is another rule that gets broken repeatedly by many groups. They send alerts to everyone on a team, when only a few people are responsible for the systems. This is the very steep slippery slope to alerts that get ignored by everyone. There’s something about our psychology that even if you get a few unnecessary ones, you desensitize quickly.

Finally, you should always start with less alarms, and then add them later. If you were in that building with false fire alarms for a month, and then they fixed the problem, you’d still be suspicious the next time you heard an alarm. Most people blame the tool, but, in fact, it might have been misconfigured, or configured to alert too often.

Fortunately, there’s a fairly straightforward method of determining just the key alerts for an application, no matter what it is. It’s called critical path analysis, and I’ll be going over this once I get done explaining the three rules of good alerting.




The Three Rules of Meaningful Alerts

Posted on Monday 8 October 2007

There’s just one simple reason why most monitoring implementations fail: they send out too many alerts.

Most implementers start out from the premise that the biggest problem to avoid is to miss a critical problem. It’s not. Too many alerts are a bigger problem, by far. Once you have too many, your administrators will start ignoring all of the alerts.

When you design monitoring, follow these three rules:

  1. Every alert must mean “Run to the console now!”
  2. All alerts must be actionable.
  3. Alerts must cover every part of an application.

In order to follow all of these rules, most shops have to turn off a great number of monitors, as most of them fall short. We’re going to talk about each of these points in detail in coming articles, because they each require some explanation. Stay tuned for more on each of these.




The Break

Posted on Monday 8 October 2007

I apologize for the recent break in posts. I was just finishing writing a book that is coming out from St. Martin’s press next year. I felt guilty if I did any writing that wasn’t dedicated to finishing by my deadline. The manuscript is with the editor now, so I’ll be writing articles for this site, possibly multiple articles a day, to catch up.

In case you’re curious, the book is about how to run your own independent band. There’s a surprising amount of technology involved with music now. Considering that I’m a musician as an avocation (in my spare time), my skills as a technologist and as a musician on the side came in handy.

Also, I’ve never believed that you can have just one passion in life. Mine are information technology, music, and writing.

The next book I do will be on the topics in this blog, which I will share here in article-sized pieces.




Effective Monitoring designed by SEO-Themes and powered by Wordpress