NetConnect Lessons: No More CYA Alerts

Monday 5 November 2007

It’s taken me some time to integrate what I heard at NetConnect this year. I don’t mean that I learned a ton of new things. It’s that a lot of environments are in more trouble than I thought. One of the things that it’s caused me to do is change the third rule. I’ve been stunned at both the large number of deployments of tools out there in the world, as well as the number of them that just aren’t used.

I never thought that another name for “monitoring” would be CYA. That is to say: “Cover Your Donkey” but, of course, using one of the synonyms for donkey.

It seems that many environments would rather have alerts that no one uses rather than have their system miss something, and get blamed for a production issue. This leads to a proliferation of alerts that, very quickly, no one watches because they break the rules that I’ve been talking about here.

One of the rules that I felt was just a component of the first one needs to be a separate rule. So I’m replacing rule #3. Instead, that’s going to be one of the rules of creating custom monitoring, which I’m going to explain soon when I talk about critical path monitoring. Instead, I’m going to talk about rule #3 in the next article. And this article, I want to talk about the CYA problem.

But, just to get it out of the way, here are the new three rules:

  1. Every alert must mean “Run to the console now!”
  2. All alerts must be actionable.
  3. Alerts must only go to people who are responsible to act on it.

Like I said, #3 is next article. I just want to expand on the point that I mentioned in #1. For now, let’s talk about the CYA issue.

If you let administrators completely depend only on your alerts from your systems tools, you will always be surprised by some events that happen on the servers. No matter how good of a job you do developing your alerts, there are always oddball issues that can come up that you will miss. I inform every administrator of this issue, and tell them that they should still check their servers on an occasional basis, just like they should be doing before putting monitoring in place.

Because an alert is proof of a problem, though, some management and administrative groups have used it to assign blame rather than to fix things. I ran into more than one group at NetConnect that get told that they may not turn off any alerts because they might “miss something” even if they are deluged with many false alerts. But unfortunately, because many of these shops are not being strict about reducing the alerts down to the ones that follow the rules, a lot of alerts are meaningless and they ignore all of them, good and bad. This makes adding new alerts when they are necessary next to useless, because even the current set of alerts are already ignored.

It does take time and effort to reduce the number of alerts down to the ones that matter, but if you want to get any value out of Systems Monitoring tools at all, it’s a necessity.

In fact, the conversation that is inspiring this article is with an administrator of one company that was monitoring 2400 servers in their environment. I asked them how many alerts that they get a day. He told me “thousands.” I asked what they did with them. He answered: “Nothing.” And so I asked why, and he said: “CYA.”

Now, why is this so easy to believe?

Posted by randall / Filed under:Uncategorized

Leave a comment

Name (required)
Mail (will not be published) (required)
Website

Effective Monitoring designed by SEO-Themes and powered by Wordpress