Monitoring, alerting, and statistics

Table of Contents

1 Monitoring

1.1 Prehistory

  • Phone rings. "The mainframe is down!"
  • Operations staff attended computers at all hours to notice and fix problems
  • Separate resources for automated monitoring might be unavailable

1.2 Simple early monitoring

  • scripts ping hosts, connect to services, etc.
  • scripts running on hosts report status to central system
  • could eventually become complicated, redundant, resource-intensive

1.3 Monitoring methods

  • Shallow
    • ping host, connect to network port
    • detects when a host or service is totally down, but not when it's unusable
  • Deep: exercise more complex aspects of software
    • Fetch a web page, then check for content
    • Make a query that tests authentication, database access, etc.
    • Notices more subtle application problems, but may not make it easy to diagnose the problem

2 Notifications and alerts

2.1 Quantity and frequency

  • You don't want to be flooded with messages when lots of things are down
  • Too-frequent notifications aren't helpful
  • Being notified repeatedly and indefinitely isn't helpful

2.2 Relevance

  • Failure of a central resource might cause lots of other things to (appear to) fail
    • Network switches or routers
    • File servers
    • User or authentication databases
  • Ideally you want precise notification about specific failure causes
    • spend less time diagnosing and more time fixing

2.3 Methods

  • Email (lots of stuff likes to use this, even when it probably shouldn't)
  • Pager, text messaging
  • Escalation
    • start with email to general sysadmin contact address
    • email to higher-level contacts (manager, on-call person)
    • generate text message to on-call person

3 Monitoring software

3.1 SNMP: "Simple" Network Monitoring Protocol

  • provides general network-accessible framework for host status data
  • uses UDP for communication
  • server runs on host and can be queried by other hosts
  • It has some problems
    • Poor security: unprotected, or passwords in cleartext
    • effectively usable only on a local network
    • complicated numerical MIB tree for structuring data

3.2 Example: Nagios

  • Provides a general monitoring and alerting framework with a custom configuration language
  • Server can initiate "active" checks over the network for remote hosts and services
  • Hosts can run "passive" checks that report to Nagios server
  • Dependency representation
    • generate notifications only about a central resource that is down rather than all dependent services
  • Notification thresholds are configurable
  • Notifications can be limited
  • Escalations can generate higher-priority notifications (alternate email, pager, phone message)
  • Problems can be acknowledged to stop notifications

4 Statistics

If you're already polling and collecting status information from your hosts, why not store and analyze it?

4.1 Usage history

  • Look for patterns in downtime, high usage periods, etc.
  • Determine whether service-level agreements are met

4.2 Trend analysis

  • How much of your resources are you using?
  • Do you have enough?
  • When will you need more?

5 Statistics-gathering (and monitoring) software

Often there is a lot of overlap between status monitoring and and statistics collection

5.1 Example: Cacti

  • Mainly for statistics gathering
  • Uses SNMP for data collection
  • Stores historical database of statistics, provides graphing over custom time scales

5.2 Example: Zabbix

  • Combined statistics and monitoring tool
  • Zabbix agent runs on servers and reports to central collection server
  • "Triggers" provide configurable status notification and alerting

Author: Steve VanDevender

Created: 2017-08-02 Wed 14:43

Emacs 24.5.1 (Org mode 8.2.10)

Validate