Introduction to High Availability and Load Balancing

1. What is high availability?
- 1.1. reduce (if not entirely eliminate) service downtime
- 1.2. systems that degrade rather than fail completely
2. How systems fail
3. A brief history of reliability techniques
4. Load balancing in more detail
- 4.1. Concepts and terminology
- 4.2. Load balancing failure modes
5. General high-availability principles
- 5.1. never have only one of anything
- 5.2. be more careful about changes

1 What is high availability?

1.1 reduce (if not entirely eliminate) service downtime

find ways to fix things faster when they break
avoid single points of failure

1.2 systems that degrade rather than fail completely

add redundant capacity
distribute systems across multiple hosts/networks

2 How systems fail

2.1 Hardware failure

CPU and memory hardware failure
disk failure
power failure
network hardware failure

2.2 Software failure

bugs
misconfiguration
upgrade changes

2.3 Human error

sometimes your most frequent problem
can sometimes be overcome with good habits, automated testing

3 A brief history of reliability techniques

3.1 "The really good basket"

From the era of very expensive hardware.

"If all your eggs are in one basket, make it a really good basket"
Rapid service for hardware faults
Full-system backups for storage failure
Careful change management
Potentially long outage recovery

3.2 Failover

Maintain a redundant, identically-configured standby server
Switch it on when primary server fails
Requires careful synchronization of software, configuration, and data
Brief outages for failover and failback
Doubles hardware costs
Increased system management effort

3.3 RAID (for storage only)

Replicate data across additional disks
RAID-1 (mirrored disks), RAID-5 (add parity disk), RAID-6 (multiple parity)
Protects against single disk failures (or multiple in RAID-6)
Does not protect against software corruption, accidental overwriting, catastrophes that wipe out entire systems, . . .

3.4 Load balancing

Distribute requests among multiple server systems
Requires very careful synchronization of configuration
Allows for scaling (add more servers to handle more load)
Server failures decrease capacity but do not substantially disrupt service
"N+1" capacity: have at least enough servers to serve load, plus one more so system can tolerate single-server failure

4 Load balancing in more detail

4.1 Concepts and terminology

(server) pool

a set of servers that are identically configured to provide a service

virtual ip

the externally-visible IP address that clients contact to access a service (often also associated with a specific TCP port)

session state

Individual client requests have to be routed to a particular pool member, and the load balancer has to keep track of which pool members are handling which client requests.

persistence

sometimes an application requires multiple transactions (web applications in particular) and some application identifier (like a web cookie) is used to assure persistent association with a particular pool member that maintains the application state

scheduling

how the load balancer chooses a pool member for a new client session. Examples:

round-robin: cycle through the pool members
least-connection: choose a pool member handling the fewest sessions
least-load: choose a pool member with the least resource usage

health checks

the load balancer frequently checks the status of pool members and stops routing traffic to those that appear unresponsive. These could be as simple as opening a TCP connection to a service port ("shallow" monitoring), or issuing a more complicated request to ensure that an application is responsive ("deep" monitoring).

4.2 Load balancing failure modes

flail-over: active load balancer flails back and forth between members of a redundant pair, usually because of a critical operating system problem or misconfiguration. Is sometimes addressed with STONITH (acronym for the grotesque metaphor "shoot the other node in the head")
split-brain: both members of a redundant pair operate simultaneously, usually because of a network partition that prevents them from checking each other's status

5 General high-availability principles

5.1 never have only one of anything

one server
one network connection
one power supply
one copy of valuable data
one site where everything is
one person who knows how something works

5.2 be more careful about changes

5.2.1 separate test environment

(probably reduced) instance of your infrastructure with (nearly) identical configuration
try out changes in noncritical environment
promote tested changes to production (as with git branching + puppet environments)
adds resource overhead

5.2.2 change control/review

other people spot mistakes you don't
a different approach might be better
adds delay

5.2.3 automated testing

perform functional tests on changes before deployment
test-driven development – create tests before writing code
block deployment of nonworking code
greater development overhead
better deployment confidence

5.2.4 continuous/incremental deployment

requires high redundancy
requires automated (or at least integrated) testing
requires configuration management
upgrade one piece at a time
"do 1, do 100, do 1%, do one per second"
excellent talk on modern deployment/upgrade techniques:
- http://hexadecimal.uoregon.edu/~stevev/Limoncelli-LiveUpgrades.pdf
- https://www.usenix.org/conference/lisa14/conference-program/presentation/limoncelli-talk