When you have an environment as big as ours is, there absolutely needs to be processes in place to manage things coming in, existing things, and things going out. That is why I haven’t been posting much recently – I’ve been stuck trying to fix this problem because those processes are either not written down, don’t exist, or no-one follows them.
Consider this: When I took on this role in July of 2014, we were getting our feet wet in Puppet, we had dabbled at some point in Spacewalk, and no further. We had somewhere in the vicinity of 200 servers, be they physical devices or virtual containers of one kind or another, most of them running CentOS 5, and no central processes or tools to manage them. I don’t know how the team managed Shellshock or Heartbleed, I’m assuming they patched the systems they could think of that would be most likely to get hit or would hurt the most, and ignored the rest.
My highest priority coming in was to fix the Puppet implementation, re-deploy Spacewalk, set CentOS 6.x as the standard and get moving on pushing systems into that environment. So far we’ve made good progress – over 150 systems are now in that environment, I don’t have a good count on what is left but we’re well over 50%. Still there are systems I don’t know about, or don’t know enough about.
Our cloud solution is one of them. I worked on this project as a Junior after we’d pushed it and had problems. I was astounded, here we were trying to put together a product to sell to our customers at an extremely high premium, and we were throwing a few hours a week at building it in between panicked days of supporting needy customers. It was no wonder to me that when we rolled it out it was broken, it wasn’t properly documented or monitored, and no-one knew how it all worked. Part of me wonders if we intended for it to fail.
And so we come back to oversight and documentation. As my team is in the midst of conceptual design for our next virtualization platform, the thing fails. By now it has a few shared webhosting servers running on it, and that’s about it, but our support team is still getting slammed and we need to fix it.
- The control node for the environment had filled its disk. Apparently some time back in June or July, I don’t know for sure — we weren’t monitoring it.
- The backup server, which stores backups of VMs generated via the control node, had a corrupted disk and had gone into read-only mode. Possibly as far back as February — we weren’t monitoring it.
- Two hypervisors failed simultaneously, one of them came back up but the VM it hosted was still broken. Still, we learned this when customers called in and reported issues, and when the VMs themselves generated alerts by being unreachable. We weren’t properly monitoring the hypervisors.
All of these issues should have been handled long before the service became available for sale. Parts of them were documented as needing to be fixed, but no-one seemed too worried about making it happen.
My predecessor once said “if it isn’t documented, it isn’t finished” — I agree. But expanding on that, if it isn’t monitored, it isn’t fully documented. If it isn’t documented, it isn’t finished, and if it isn’t finished, it isn’t ready for full-price-paying customers in production.