I’d like to welcome the NetConnect Members to the EffectiveMonitoring blog. The presentation Find it Fast: Using AppManager Stats to Identify Problems Quickly is now available on the Presentations page.
As mentioned during the presentation, this is an entire blog dedicated to monitoring techniques. These are developed regardless of the tool that you use, the underlying operating systems that you have, and the organization that you have in your environment. Although this latest presentation is of course about AppManager and Windows operating systems because of the conference. The general topic applies, and I’ll be covering those in more detail in later entries.
If you want to contact me for any reason, please just use the contact page. I’m happy to talk to anyone about the presentation, or your monitoring issues. I’ve helped people solve quite a few issues.
For now, you’re catching the blog in the middle of a discussion about how to create and apply alerts based on my 10 years of experience with monitoring tools.
As for a recap of the conference, it was very much worth my time. I had a great time and really had a lot of time to talk to a lot of other monitoring folks and also to people from NetIQ. I spent a lot of time trying to get certain enhancement requests in front of PMs and support. Being able to explain these things in person helped a lot, as many of them are seemingly complex issues that can be more simply conveyed in a 5 minute conversation than an email that seems like a book.
For example, there’s an issue with AppManager regarding handling events across reboots and restarts of the agent. A feature called event collapsing will make sure that only one event is triggered upon an error, and, depending on how it’s configured, a continuing failure will get just one event. These events can also trigger actions. For example, a page or email. So, let’s take a situation where it’s monitoring a website, and it’s set to event to the console, and send an email via SMTP. If a website is down, you’ll get one event, and one email, no matter how long it’s down. You’ll only get another email and event if the website comes up and then goes down again. A very useful feature, for certain.
But, if you reboot your monitoring agent server (the one that is monitoring the website, NOT the monitoring infrastructure backend, which is not tied to these events) you’ll get another event and email when it comes back up. You’ll also get one if you restart the monitoring agent, or if you change any of the monitoring properties. This leads to another alert to the people who will then think that the website has come back up and is back down again. This event persistence should save itself across reboots, restarts, or changes to the policy to avoid this problem.
I know that your eyes are probably glazing over reading this explanation, and it still may not make sense, especially if you’ve never used the product. But it’s a problem that has a real effect on large shops such as ours. I’m glad that I was able to bring this and other issues to their attention, because I hope that it can make a feature list. As you can imagine, I’m a very detailed person when it comes to these features, because these seemingly small feature issues can cause major problems in environments as large as the ones that I face.
I will, perhaps, put up a list of my wish-list for AppManager on the chicagoiq.net website (which I run.) If you’re an AppManager user and want to join together with my organization in case you have the same pain that we do feel free to join in there.
In the next post, which I might be able to write as I continue to wait for my plane back, will talk about some of the future topics that I’d like to cover based on talking to other companies. There’s a lot of problems that we’ve gotten past in our organization, and I think that there’s a lot that I can share that will save you quite a lot of time in your monitoring.
If my plane is delayed even more, who knows, I’ll possibly write even more articles while they’re still fresh. Meanwhile, I’m heading to my gate now.
