Also, Rule 39 — there’s no such thing as a coincidence (there is, but shush!)
A few weeks ago we had a power outage in one of our larger colocation facilities hosting customer racks. Because (legacy, cost) we have a number of servers providing some core services that don’t have redundant power supplies, and so a couple of these went down with everything else. Ever since, we’d been having serious performance issues with one of those nodes.
Some quick background: At every facility we provide service in, we have at least a couple of servers which exist as hypervisors. Back in the day we were using OpenVZ to operate separate containers, then we were using KVM-based VMs, and now we’re in the process of moving over to VMWare. In each iteration, the principal is the same: Provide services within the datacenter that are either not ideal, or not recommended, to traverse the public internet.
For example, we host a speed test server so that we can have our clients test their connectivity within the datacenter, to the next closest facility, or across the country. We have a syslog proxy server which takes plaintext input and sends it back to a central logging server via an encrypted link. We have a couple of DNS servers which we offer to our clients for local DNS resolvers. We have a monitoring system that reaches out to local devices via SNMP (often unencrypted) and reports back to the cloud-based monitoring tool.
Ever since the power outage, we noticed that the monitoring server was a lot more sluggish than usual, and reporting itself as down multiple times per day. I did some digging, noticed a failure on one of the drives, had it replaced — that didn’t go well. Ended up taking the entire hypervisor down and rebuilding it and its VMs (these hosts don’t have any kind of centralized storage). No big, I thought. This won’t take long, I thought.
The process was going a lot slower than usual, and I didn’t put nearly enough thought into why that might be the case. I left the hypervisor building overnight and came back in with a fresh head. I was just settling in when I noticed a flood of alerts arriving from hosts that they weren’t able to reach Puppet. Odd, I thought, the box is fine! Memory is good, CPU is good, I/O Wait time (which is notoriously poor on this hardware) was fine. I set about monitoring — puppet runs were taking forever. Our puppet master typically compiles the manifest in 3-4 seconds, tops. It was taking 20-30s per host — that’s not sustainable with nearly 200 servers checking in.
What’s embarrassing for me is that it wasn’t until I had a customer ticket come in that they were getting slow DNS lookups that I realized exactly what was happening — the hypervisor I was rebuilding, the one I had consciously deleted configs for “dns1” on, was the culprit.
A quick modification to Puppet Master’s resolv.conf, and also a temporary update to the Puppet-configured resolv.conf that we push out, reversing those internal nameservers, and everything started to clear out. Puppet runs dropped back down to single digits, customers reported everything was fine.
Of course it was DNS. It’s always DNS.