Introduction to High Availability and Load Balancing

Table of Contents

1 What is high availability?

1.1 reduce (if not entirely eliminate) service downtime

  • find ways to fix things faster when they break
  • avoid single points of failure

1.2 systems that degrade rather than fail completely

  • add redundant capacity
  • distribute systems across multiple hosts/networks

2 How systems fail

2.1 Hardware failure

  • CPU and memory hardware failure
  • disk failure
  • power failure
  • network hardware failure

2.2 Software failure

  • bugs
  • misconfiguration
  • upgrade changes

2.3 Human error

  • sometimes your most frequent problem
  • can sometimes be overcome with good habits, automated testing

3 A brief history of reliability techniques

3.1 "The really good basket"

From the era of very expensive hardware.

  • "If all your eggs are in one basket, make it a really good basket"
  • Rapid service for hardware faults
  • Full-system backups for storage failure
  • Careful change management
  • Potentially long outage recovery

3.2 Failover

  • Maintain a redundant, identically-configured standby server
  • Switch it on when primary server fails
  • Requires careful synchronization of software, configuration, and data
  • Brief outages for failover and failback
  • Doubles hardware costs
  • Increased system management effort

3.3 RAID (for storage only)

  • Replicate data across additional disks
  • RAID-1 (mirrored disks), RAID-5 (add parity disk), RAID-6 (multiple parity)
  • Protects against single disk failures (or multiple in RAID-6)
  • Does not protect against software corruption, accidental overwriting, catastrophes that wipe out entire systems, . . .

3.4 Load balancing

  • Distribute requests among multiple server systems
  • Requires very careful synchronization of configuration
  • Allows for scaling (add more servers to handle more load)
  • Server failures decrease capacity but do not substantially disrupt service
  • "N+1" capacity: have at least enough servers to serve load, plus one more so system can tolerate single-server failure

4 Load balancing in more detail

4.1 Concepts and terminology

(server) pool
a set of servers that are identically configured to provide a service
virtual ip
the externally-visible IP address that clients contact to access a service (often also associated with a specific TCP port)
session state
Individual client requests have to be routed to a particular pool member, and the load balancer has to keep track of which pool members are handling which client requests.
persistence
sometimes an application requires multiple transactions (web applications in particular) and some application identifier (like a web cookie) is used to assure persistent association with a particular pool member that maintains the application state
scheduling
how the load balancer chooses a pool member for a new client session. Examples:
round-robin
cycle through the pool members
least-connection
choose a pool member handling the fewest sessions
least-load
choose a pool member with the least resource usage
health checks
the load balancer frequently checks the status of pool members and stops routing traffic to those that appear unresponsive. These could be as simple as opening a TCP connection to a service port ("shallow" monitoring), or issuing a more complicated request to ensure that an application is responsive ("deep" monitoring).

4.2 Load balancing failure modes

flail-over
active load balancer flails back and forth between members of a redundant pair, usually because of a critical operating system problem or misconfiguration. Is sometimes addressed with STONITH (acronym for the grotesque metaphor "shoot the other node in the head")
split-brain
both members of a redundant pair operate simultaneously, usually because of a network partition that prevents them from checking each other's status

5 General high-availability principles

5.1 never have only one of anything

  • one server
  • one network connection
  • one power supply
  • one copy of valuable data
  • one site where everything is
  • one person who knows how something works

5.2 be more careful about changes

5.2.1 separate test environment

  • (probably reduced) instance of your infrastructure with (nearly) identical configuration
  • try out changes in noncritical environment
  • promote tested changes to production (as with git branching + puppet environments)
  • adds resource overhead

5.2.2 change control/review

  • other people spot mistakes you don't
  • a different approach might be better
  • adds delay

5.2.3 automated testing

  • perform functional tests on changes before deployment
  • test-driven development – create tests before writing code
  • block deployment of nonworking code
  • greater development overhead
  • better deployment confidence

5.2.4 continuous/incremental deployment

Author: Steve VanDevender

Created: 2015-07-27 Mon 14:08

Emacs 24.4.1 (Org mode 8.2.10)

Validate