For those of us who endure the regular on-call schedule, one thing is more frustrating than anything else: the so-called “False Alert.” It comes at the most inconvenient of hours, it requires getting things out of bags and connecting and the whole rigmarole of engaging, and then you find that the server isn’t really down, or that the router memory usage has dropped back below threshold, or some other thing that wasn’t worth the disruption.
This is almost immediately followed by making the ticket as resolved and leaving a note that it was a FALSE ALERT before closing the laptop and returning to your day. Or night. Or whatever.
I know – I’ve been there. I’ve been that person ignoring alarms because I know better than the monitoring system. I was wrong, I have seen the light.
It undoubtedly relates to being the dedicated administrator for a monitoring tool in a large organization, but I’ve seen the effects of such ignorant responses, and my thinking has evolved. The duty of a monitoring tool is to tell you that there is a problem, and there are three accepted paths away from an alarm that it generates:
1: Fix the fault condition
This one is obvious. Something broke, it generates an alert, you do the needful and fix it. Alarm clears, you close the ticket.
2: Fix the monitor
Sometimes our monitors are overzealous. Maybe you have an application that manages its own memory, and so it grabs 90% and does its thing. If the monitoring tool is set for a 90% threshold, you’re going to have a bad time – just adjust the threshold! Or fix the app configuration to only use 85%.
Overly sensitive monitors lead to desensitization of the technicians and administrators who have to respond, which also means that when things are truly broken, they may just ignore it.
3: Pass the buck
This is the one that I see a lot of. Some of it falls under the point above, some tools don’t work that way or are too complicated to set up for all scenarios. It happens when something in the path between monitor and monitored breaks. For example if a switch breaks and I can’t ping the servers, and I generate alerts for all the servers being down. The server admins get frustrated because it’s clearly a network problem and close their tickets. The same happens when website monitors trigger because the single sign on (SSO) tool breaks.
THIS IS NOT A FALSE ALERT. It is not fake news. It’s just not your problem. You do need to make sure that it gets escalated to the right people. And, if possible, push your monitoring people to add dependencies so you don’t get disturbed again.
False alerts are different. They occur for different reasons that should be rare. Like you accidentally blacklisted the monitoring platform and it couldn’t ping your devices. Or the platform itself failed somehow.
In any case, an alert should always be a call to action – just not always the action that it immediately indicates.