Introduction to High Availability and Load Balancing
Table of Contents
1 What is high availability?
1.1 reduce (if not entirely eliminate) service downtime
- find ways to fix things faster when they break
- avoid single points of failure
1.2 systems that degrade rather than fail completely
- add redundant capacity
- distribute systems across multiple hosts/networks
2 How systems fail
2.1 Hardware failure
- CPU and memory hardware failure
- disk failure
- power failure
- network hardware failure
2.2 Software failure
- bugs
- misconfiguration
- upgrade changes
2.3 Human error
- sometimes your most frequent problem
- can sometimes be overcome with good habits, automated testing
3 A brief history of reliability techniques
3.1 "The really good basket"
From the era of very expensive hardware.
- "If all your eggs are in one basket, make it a really good basket"
- Rapid service for hardware faults
- Full-system backups for storage failure
- Careful change management
- Potentially long outage recovery
3.2 Failover
- Maintain a redundant, identically-configured standby server
- Switch it on when primary server fails
- Requires careful synchronization of software, configuration, and data
- Brief outages for failover and failback
- Doubles hardware costs
- Increased system management effort
3.3 RAID (for storage only)
- Replicate data across additional disks
- RAID-1 (mirrored disks), RAID-5 (add parity disk), RAID-6 (multiple parity)
- Protects against single disk failures (or multiple in RAID-6)
- Does not protect against software corruption, accidental overwriting, catastrophes that wipe out entire systems, . . .
3.4 Load balancing
- Distribute requests among multiple server systems
- Requires very careful synchronization of configuration
- Allows for scaling (add more servers to handle more load)
- Server failures decrease capacity but do not substantially disrupt service
- "N+1" capacity: have at least enough servers to serve load, plus one more so system can tolerate single-server failure
4 Load balancing in more detail
4.1 Concepts and terminology
- (server) pool
- a set of servers that are identically configured to provide a service
- virtual ip
- the externally-visible IP address that clients contact to access a service (often also associated with a specific TCP port)
- session state
- Individual client requests have to be routed to a particular pool member, and the load balancer has to keep track of which pool members are handling which client requests.
- persistence
- sometimes an application requires multiple transactions (web applications in particular) and some application identifier (like a web cookie) is used to assure persistent association with a particular pool member that maintains the application state
- scheduling
- how the load balancer chooses a pool member for a new
client session. Examples:
- round-robin
- cycle through the pool members
- least-connection
- choose a pool member handling the fewest sessions
- least-load
- choose a pool member with the least resource usage
- health checks
- the load balancer frequently checks the status of pool members and stops routing traffic to those that appear unresponsive. These could be as simple as opening a TCP connection to a service port ("shallow" monitoring), or issuing a more complicated request to ensure that an application is responsive ("deep" monitoring).
4.2 Load balancing failure modes
- flail-over
- active load balancer flails back and forth between members of a redundant pair, usually because of a critical operating system problem or misconfiguration. Is sometimes addressed with STONITH (acronym for the grotesque metaphor "shoot the other node in the head")
- split-brain
- both members of a redundant pair operate simultaneously, usually because of a network partition that prevents them from checking each other's status
5 General high-availability principles
5.1 never have only one of anything
- one server
- one network connection
- one power supply
- one copy of valuable data
- one site where everything is
- one person who knows how something works
5.2 be more careful about changes
5.2.1 separate test environment
- (probably reduced) instance of your infrastructure with (nearly) identical configuration
- try out changes in noncritical environment
- promote tested changes to production (as with git branching + puppet environments)
- adds resource overhead
5.2.2 change control/review
- other people spot mistakes you don't
- a different approach might be better
- adds delay
5.2.3 automated testing
- perform functional tests on changes before deployment
- test-driven development – create tests before writing code
- block deployment of nonworking code
- greater development overhead
- better deployment confidence
5.2.4 continuous/incremental deployment
- requires high redundancy
- requires automated (or at least integrated) testing
- requires configuration management
- upgrade one piece at a time
- "do 1, do 100, do 1%, do one per second"
- excellent talk on modern deployment/upgrade techniques: