Defining “Down”
One of the points that’s often missed when it comes to monitoring is what we mean by a system, server, or application to be “down.” Indeed, it’s not as clear a concept as it may seem. This is where we need to start if we’re going to design a comprehensive set of alerts for a system.
What we do know for sure is if your network, server hardware, operating system, or application components are having trouble, your users will call and simply say: “The system is down.” The goal of monitoring is to know about these faults before they happen, or at worst when they happen. Note that even if you find out at the same time as your users, you will be cutting out all of the sleuthing that you’d be doing if you just got that user call. Not only that, you’re saving the time in between when they first try to contact you, and when the actual message arrives. If you have a large organization, sometimes tickets can spend hours in various help desk queues.
I’ve seen many shops complain about trouble with their monitoring, but fail to take the simple first step of identifying the actual components of their systems that can cause a failure. This process is part of a methodology I call identifying the critical path, which I am going to cover in a future articles in detail. For now, I just want to focus on the overall areas of failure for systems in general, and talk about different monitoring solutions for each. You must have monitoring that covers all of these in your set of monitoring tools.
Network: Network monitoring is entirely different than systems monitoring. Your best ones will check all of the network links, as well as the status of your network hardware. These articles won’t cover network monitoring in depth, so it will assume that you have this covered. The good news about monitoring networks is that it has a more regular set of faults that can occur, and so monitors can be implemented without as much alerting design required. This is often true no matter what kind of hardware or cable plants that you have in your data center.
Hardware: Whether your systems have a bad hard drive, correctable memory errors, a failed motherboard, or a dead fan, your system can crash as a result. You need to know the status of your hardware at all times. Fortunately, most of the hardware monitoring systems are predictive. That is, they will often tell you what needs to be replaced before they completely fail. Hardware monitoring is also out of scope for this series of articles, although the alarms for these faults can certainly be sent to the systems monitoring console. The best hardware monitoring solutions usually come directly from your hardware vendors because they have the tightest integration with the actual hardware.
Operating System: If you run out of disk space, memory, or have any other operating system failures, your users are still going to call you and say that your system has failed. This alerting should be part of the same software as your application monitoring solution. This is a complex topic that will be handled in a series of future articles, but has the advantage of using the same set of monitors no matter what applications are running on the system.
Application Monitoring: Applications in this case have a very broad definition. It includes infrastructure software such as databases and web servers, but also includes application processes, services, or daemons. This monitoring for applications are irregular and difficult, because there is no catch-all monitoring that works for every application. In fact, most applications are unmanaged because administrators just don’t have the method of understanding their applications from a monitoring perspective, rather than just an administrative perspective. We’ll be spending considerable time on this monitoring because although many vendors claim that they can do this automatically, it requires administrators to design these alerts. I will cover how to do this in detail, and provide a simple-to-follow guide on how to break apart an application into the alerts that matter.
Now that we’ve covered the overall areas that can cause a system to be down, the next article will talk about the longest-standing argument about the tools that can catch these failures: The Best of Breed debate.
