browsing Editorials

The Best of Breed Debate

Posted on Monday 20 August 2007

If you’ve gotten through determining what you’d like to cover at a general level on all of your systems, it’s time to pick out the tools that you’re going to use if you haven’t done so already. The techniques that I’ll be covering at EffectiveMonitoring.com will work no matter what vendor you use, but the choice of tool will possibly make your job easier.

Most of the documentation on how to monitor is vendor specific, and written in such a way that the only solution to the “problems” that they bring up are their own products. There’s rarely a good debate about what tools should be like in general, and definitely not about making the right mix of tools that will get your job done. I don’t know about you, but I personally find that a lot of the articles written about systems monitoring read more like press releases from vendors rather than good discussions about comparisons between tools.

The most heated debate that has been argued literally throughout the entire decade that I’ve been involved in systems monitoring has to do with getting “Best of Breed” tools versus “Jack of All Trade” tools that try to manage the entire infrastructure.

The Best of Breed camp says that it’s necessary to drill deeply down into each application in order to do a good job monitoring it. This sometimes leads to tools that work for just a few platforms, which may cause you to have to purchase and maintain many solutions to cover your entire enterprise. It also can put a burden on your operations team (or whoever watches the consoles), because they may have to contend with multiple places to get alerts.

The Jack of All Trade camp says that you must have tools that span everything in your enterprise. The simplified version of their argument is that you should have alerts that correlate across your all platforms. Unix, Windows, Network, SQL, Linux, applications, and everything else should have alerts. They say that alerts should go to just one place so that the tools can do correlation between alerts. Root cause analysis is easier at this point, because problems are all on one console. Also, it’s simpler operationally because of a single console. Unfortunately, these solutions tend to do a few things fairly well, and then provide mediocre coverage for the rest. Two sayings come to mind for these tools: “Jack of all trades, master of none.” And “A mile wide and an inch deep.”

And because this is IT, there’s a third camp that reared up. Some believe in consolidation tools that will roll up alerts from any solution into their console. Once these alerts are in a single tool, it can perform correlation or other analysis.

Now, because correlation is a crux issue for this debate, I need to cover it briefly now. Correlation is the concept that you should filter the “symptom” alerts from the “cause” alerts. For example, if your entire database server is down, then the fact that your application server is writing a logfile that it can’t contact the database is a symptom, not a cause. There’s only one alert that matters here, and your good correlation tools will filter for this. But correlation as it relates to this debate has a very strong tendency to favor Best of Breed. The reason is simple: correlation assumes that you’ve done a complete job of putting alerts on all of your critical points of failure first. Otherwise, you have nothing to correlate. The Jack of All Trades tools can miss “deep” alerts. My other observation about correlation is that it rarely works in practice. There is too much manual work involved, and these tools can generate too many false alerts due to incorrect correlations. I’ll cover correlation in detail in a future article, because it’s quite a large topic.

After a decade of using various tools, I would suggest that IT shops use tools that are best of breed within the systems monitoring space that can cover as much of your environment as possible, and then use specific tools to solve the rest of the issues. I haven’t seen any tools that do a good job of mixing network alerting (routers, switches, and cable plant monitoring) with the systems alerting, especially if your company is large enough to have a networking team. Their needs tend to be so different they need their own console and control over their own tools. And, besides, they tend to ignore systems monitoring alerts. That’s only fair, because systems monitoring folks often have to ignore network alarms because sometimes traffic can be routed through other infrastructure, and the alerts aren’t meaningful.

I do believe that having fewer consoles is a goal that you should always strive towards, and this is why your application alerting and your operating system alerting needs to go in the same tool as much as possible. There’s a simple rule of thumb for this decision if you need to evaluate possible solutions: You must be able to do deep monitoring on each aspect of your system. Leave none out. Your set of monitoring solutions must cover database servers, your web servers, your custom applications, and all other critical aspects of your operating systems. If you can’t find an overall solution that covers all of these, you need to bring in a best-of-breed solution that will be able to handle the alerts on all of the parts that you haven’t covered yet.

I prefer category solutions in the systems space that allow me to write custom scripts. I often find that I want to alert on areas that the out-of-box monitoring doesn’t cover, and I need the freedom to add in a new alert type. But I want to emphasize again that the upcoming techniques and articles are vendor-neutral, and that whatever you choose, you will be able to use these solutions. As long as you make sure that your set of solutions cover all of the areas that we talked about in the Defining “Down” article, you will find the next articles usable almost immediately.

The next article in this series covers a surefire way to make certain that every single one of your alerts is meaningful.

Effective Monitoring designed by SEO-Themes and powered by Wordpress