I’m always looking for ways to improve both internal and external services and tools because I believe that to stagnate is bad. Just because what we have works, doesn’t mean we can’t make it better. If we can make it better in a cost-effective way, why aren’t we doing that?
It is in that vein that when I saw an ad for PagerDuty‘s services (with a free T-shirt) that I looked a little more at what they actually did, and then signed up for a trial.
Disclaimer: I have no affiliation with PagerDuty beyond my use as a trial user. Any links provided to PagerDuty are free of any referral codes and I pay no cost nor do I reap any benefit by anyone clicking through to their site and signing up for their services. PagerDuty offers a free t-shirt on the first alert, but since my address wasn’t requested either explicitly or given an option to enter on the form, I can only assume this is only for paid users – it certainly wasn’t made clear on signup but it does make sense that you need to give them money for them to give things to you.
The Basics
PagerDuty is a cloud-based service that provides alert escalations to duty technicians or administrators based on predefined rules and schedules.
PagerDuty can, based on user preferences, escalate alerts via email, SMS, phone call or by push notifications to their Android and iPhone applications.
PagerDuty accepts alerts from a wide range of tools including (but certainly not limited to) Nagios, Icinga, Zabbix, Pingdom, UptimeRobot and NodePing, harnessing both email queues and API-based tools.
When an alert is received it is assigned to an escalation path based on where it came from. The escalation path then notifies individuals or schedules and can be escalated if desired when the assigned on-call fails to acknowledge the alert. If the acknowledged alert goes unresolved for a period of time it is possible to have the alert fall back into a triggered state whereby it starts the escalation process again.
Who Is PagerDuty For?
PagerDuty is for anyone who needs to escalate alerts. PagerDuty is excellent for organizations where there are one or more monitoring systems that need to be consolidated into a single escalation system (e.g. Pingdom for system availability and Nagios for specific sub-checks). PagerDuty also excels at scheduling on-call staff on a daily, weekly or other basis. Override tools are also provided so that if Frank is out for a few days, Keith can be scheduled without breaking the rotation.
In my personal environment, PagerDuty would escalate a set of monitoring systems in a uniform manner and notify me of issues. In my work environment it would be an excellent tool to handle alerts for our level 3 support teams where our Systems, Network and possibly Management teams would each have an escalation path to which some alerts would go directly and others would be initiated after investigation by our Level 1/2 technical staff.
Review
Enough about what PagerDuty is and what it does, let’s get into the meat of it. I signed up for PagerDuty and the first thing I did was poke around. The interface is pretty intuitive, I never really got lost. There were no glaring bugs in the system, everything tied together nicely. A few things do have dependencies, you can’t delete an Escalation Path if it is tied to a Service, for example, but the warnings and errors were more than sufficient to tell me what the problem was and what I needed to do to fix it.
The first thing I did was add my Nagios system, because that is my primary source of alerts. Many of them require tuning to reduce false alerts, but that’s another story. Nagios is an interesting one, they can receive alerts by email but for systems like Nagios and Zabbix that can integrate with their notification system it works a little more directly. There is a queue that alerts are posted to, and a cron-job that runs every minute to flush the queue. If you prefer to trigger via e-mail, you can do so. With Nagios there is no reason you can’t have both – an alert that checks the queue is being cleared that sends via email, or an alert that checks mail is being processed that alerts via the queue tool. Once I got the Nagios tools working, I moved onto some of my other external monitoring tools that were only supported by email. Both tools were on the supported list and integrated perfectly.
Alerts in PagerDuty have three states, Triggered, Acknowledged, and Resolved. A new alert will land in the Triggered state and will trigger the Escalation Path associated with the service it came in from. This can be as simple or as convoluted as you want it to be. It might just notify you individually and stop, or it might start by notifying the Level 1 Schedule, wait 15 minutes, then notify the Level 1 Schedule along with the Level 2 schedule, then wait another 30 minutes before notifying the Management Schedule. As soon as someone clicks the “Acknowledge” button for an alert, the Escalation Path is stopped. There is an option (per-service) to time-out an acknowledged an alert, that is – if the alert remains “Acknowledged” for that time (default: 30 minutes) it falls back into the Triggered state and the Escalation Path starts again from the beginning.
For alerts from systems like Nagios that can also send the “OK” state to contacts, PagerDuty will automatically resolve the open alerts (triggered or acknowledged) and any escalations will stop for those also.
Now, for those who expect your alert tools to send everything and then be able to filter at the PD level will be disappointed. There are zero options in PagerDuty to determine whether an alert should be escalated differently or not – you need to have predefined this all. For example, if you have Nagios alerts that should escalate at a high-priority level and others that should escalate at a much lower priority, you need to set those up to escalate to two different contacts so that they can have two different services and two different escalation paths in PagerDuty. That means that it doesn’t really simplify the problem very much, it just moves it.
The next problem is one that will irk managers who want to define how escalations work at the individual level, and that is that each individual user can define how they are contacted. While there may be some API-based tools to update all users (I doubt it, but I haven’t looked), it is completely possible that your engineers have configured their profile to not be notified for 30 minutes by any method. Personally, I would like to see a way for a manager to enforce an individual escalation procedure (SMS and Email at 0 minutes, and a phone call at 15 minutes, for example). One of the handy tools is that you can configure a notification when going on or off call, and this can come by email or by SMS.
This problem is closely related to the user permissions levels. I’d be interested to see a table that shows the various tasks that can be performed, and which level is required to perform them. It seemed that most of the critical tasks performed by PagerDuty could be performed by a “User” where it would probably be beneficial to have these restricted to Admin. It may also be that PagerDuty have separated these where Team Leads who manage schedules are “Users”, their subordinates who just handle alerts are Limited Users and anyone dealing with anything Billing related is an Admin, but I was personally surprised by what a “User” could do compared with an “Admin.”
The final major issue I noted is that adding users is not the simplest task I’ve seen. An Admin or User must “invite” the user via email, and it’s possible to add users who haven’t accepted their invitations to the schedules. There didn’t seem to be an easy way to determine if a user has accepted their invitation and configured themselves.
PagerDuty Staff and Support
I haven’t contacted their support team, though I noted they don’t seem to have an easy way to contact them from any of the logged in pages and after hours their live-support was unavailable from the main site.
That said, every trial user is (understandably) a sales opportunity and their sales team are very aggressive in wanting to show you the ins and outs of their tool. After a week the email started, and I got three messages within three days from their staff wanting to further demonstrate the product. By that point I was well acquainted and failed to respond (sorry guys) but for my use cases there is nothing more I can do with the product than I have already.
The Numbers
PagerDuty is expensive, there is no question there. At $19.95/user/month (paid yearly, $24.95/user/month if paid monthly) it gets very expensive, very quickly, even for a small team. I estimate that if I pushed this to my employer, that would be the first question and it would be turned down very quickly as a result. Even if we only entered our Network, Systems and core managers (who weren’t already on either team) into the tool we would be looking at at least 9, maybe 10 users which means we’re looking at $199.50/mo just for a monitoring service, assuming we paid yearly – a bill of nearly $2400. At that cost, we’re better off paying one of our technicians to build the same system with PHP and in just a couple of months we’d be profiting.
Keep in mind, the $19.95 number is also for fairly basic service. It provides only 25 international alerts per month, so if you have a lot of overseas staff (outside the USA) then you’d best not need to escalate to them very often (it is possible to purchase more at $0.35 each). More worrying is that the $19.95 level doesn’t grant any SLA and only Email support. If you want more international alerts (100/mo then $0.35 each), phone support, or the OPTION (i.e. costs extra) for 24/7 phone support or an SLA, you’ll be paying $39.95/user/month (paid yearly, $49.95/user/month if paid monthly). As an added bonus, the higher level also allows for Single Sign On.
The Verdict
Unfortunately because of the cost alone, I can’t recommend PagerDuty and I think I’ll be terminating at the end of the trial. If we put cost aside, it’s a great tool with great potential that I would be more than happy to push at work or even keep for myself, but I just can’t get past that cost factor – it’s not quite that valuable.