There are several principles to consider when designing and building a system for high availability, so called 100% Uptime.
Firstly, what is the budget? Cost constraints will invariably determine how high that availability is. For this exercise, we’re going to focus on a virtual environment in which the VMs will have 100% Uptime.
Constraining factors:
Power: Reliable power is critical. If grid power goes out, what happens? You need to know that a UPS system is in place, and is well monitoring and tested. You also need to know that at least one generator is in place, ideally more, to pick up the load for prolonged power grid outages.
Network: Your VMs may be 100% available internally, but if you can’t reach them from the outside, there is no point. Having a network provider with 100% uptime will be critical, or you accept that 99.999% is your maximum based on your provider’s maximum. That, or you need to select multiple providers to blend bandwidth with yourself – that gets expensive though, moreso when you consider the equipment you’ll need and the expertise to configure and maintain it.
Level of redundancy: How much of your stack are you able to lose before you can’t provide full service? For example, do you have N+1 hypervisors, N+2, or 2N? In each case, “N” is the minimum number of things you need to provide service. Knowing that you have more than that (and how much more) available is critical.
Single Points of Failure
To be 100% reliable, you must not have any single point of failure. If you draw your infrastructure out on a piece of paper and throw darts at it, no dart should be able to strike something that takes down the entire stack. Two darts, hopefully not. Three darts, maybe. Four well-placed darts, quite possibly.
In fact, this is an excellent exercise in finding weak points. Draw out your infrastructure and make copies. Mark out one item on each copy, and determine whether the stack will continue to function. The hardest one of these is often power or core network. If a member of your storage switch failed, would everything continue to function? If a power circuit failed, either due to a breaker trip or a PDU problem, would you stay up and online?
At least two of everything
The minimum number of anything in a High Availability deployment is 2. Two power strips, two storage controllers, two switches (per switch stack), two provider uplinks, etc, etc, etc. They may be active/active, they may be active/slave, but there are two and they are hot, ready to pick up the full load on no notice.
For example, if your hypervisors only have one power supply, you should have two hypervisors anyway, and you will feed power one to each of your power circuits. Your network connectivity should lead one port to each member of your switch stack, and configured for some type of link aggregation or failover in the event that a switch member fails or a switchport on your system fails.
Recalculating as needed
Often we’ll deploy these systems with a set of design goals or a set upper capacity. Time passes and we realize that current used capacity is above designed upper capacity, and we should have processes in place to notice that. When that happens, you’ll need to recalculate and re-evaluate. Maybe you now need another hypervisor or two to retain the advertised redundancy. Maybe you need more storage devices, or more switches.
Make someone else do it
Of course, the “easy” way out is to pay someone else to design and build your system for you. Plenty of service providers will be quite happy to do so, from Dell, to VCE, to many smaller MSPs. You still need to understand the above, and be able to question their designs based on it.