Imagine that your entire job is to watch a single server.
Sounds easy, doesn’t it? You would just need to stay connected to it, watch disk space and other basic statistics, and connect to the applications running on it occasionally.
Now imagine that your job is to watch five servers.
It still sounds possible, right? Although you need to go through them, and switch between them to see if any of them are having problems.
Now imagine that it’s your job to watch 200 servers. You’re in the middle of a data center. The fans are roaring in your ears, and the hard drives are whirring. What servers are having problems at this point? Out of curiosity, I’ve asked many application groups how they know when their systems are down. Inevitably, they give me the same answer:
“Our users call us when there’s a problem.”
How can this be an acceptable solution? An IT emergency in this case might be in the network, in the server hardware, an operating system fault, a problem with the application itself, or any number of places. A lot of applications actually run across many servers, and so when users call, it’s impossible to tell what component is having the problem without some detective work.
My data center is 10 times the size of the above example, with over 2200 servers, and at least 15,000 applications running. But when I get asked the question “Are there any problems?” I can glance at a console, and give them an answer. In fact, most of the time, my consoles have NO alerts flashing on them. That does not mean that there are no problems, but it does mean that there are no unhandled problems. All of the alerts have had an open ticket, and have people on the job, working to get it fixed.
You can get this same level of comfort while monitoring your own environment. And the answer isn’t in the latest tool. In fact, it doesn’t matter what monitoring tool you use. It’s in how you implement monitoring, and the procedures that you base around the alerts that you get.
Although many companies blame the monitoring tools, many systems monitoring implementations fail simply because the alerts haven’t been implemented properly. This blog will share how to do this effectively. I will cover troubleshooting, alerting, and determining what the critical points of a server are. Also, I will cover how to collect troubleshooting and performance statistics, so you’ll be able to answer the dreaded question: “What happened?” With the right data, you can often tell them. In fact, in my own environment, we often are able to tell people what caused a failure on a server within the last month.
This blog covers effective monitoring for IT systems based on over a decade of experience implementing systems monitoring in enterprise environments. I’m sharing this information simply because I needed to bring it together for some presentations that I’ve been asked to give for conferences and ITIL organizations, and there is more material that I can cover in a short amount of time that I have for the presentations. I wanted to share my full experience and methodology, which isn’t vendor specific. I will release new articles regularly in between my job as a systems monitoring implementor for a Fortune 100 company, and the presentations that I will be giving.
The next topic will lay out the scope of the problem, and identify everything that can break, so that we can understand the real problem that we’re trying to solve.
