Monitoring, alerting, and statistics
Table of Contents
1 Monitoring
1.1 Prehistory
- Phone rings. "The mainframe is down!"
- Operations staff attended computers at all hours to notice and fix problems
- Separate resources for automated monitoring might be unavailable
1.2 Simple early monitoring
- scripts ping hosts, connect to services, etc.
- scripts running on hosts report status to central system
- could eventually become complicated, redundant, resource-intensive
1.3 Monitoring methods
- Shallow
- ping host, connect to network port
- detects when a host or service is totally down, but not when it's unusable
- Deep: exercise more complex aspects of software
- Fetch a web page, then check for content
- Make a query that tests authentication, database access, etc.
- Notices more subtle application problems, but may not make it easy to diagnose the problem
2 Notifications and alerts
2.1 Quantity and frequency
- You don't want to be flooded with messages when lots of things are down
- Too-frequent notifications aren't helpful
- Being notified repeatedly and indefinitely isn't helpful
2.2 Relevance
- Failure of a central resource might cause lots of other things to
(appear to) fail
- Network switches or routers
- File servers
- User or authentication databases
- Ideally you want precise notification about specific failure causes
- spend less time diagnosing and more time fixing
2.3 Methods
- Email (lots of stuff likes to use this, even when it probably shouldn't)
- Pager, text messaging
- Escalation
- start with email to general sysadmin contact address
- email to higher-level contacts (manager, on-call person)
- generate text message to on-call person
3 Monitoring software
3.1 SNMP: "Simple" Network Monitoring Protocol
- provides general network-accessible framework for host status data
- uses UDP for communication
- server runs on host and can be queried by other hosts
- It has some problems
- Poor security: unprotected, or passwords in cleartext
- effectively usable only on a local network
- complicated numerical MIB tree for structuring data
3.2 Example: Nagios
- Provides a general monitoring and alerting framework with a custom configuration language
- Server can initiate "active" checks over the network for remote hosts and services
- Hosts can run "passive" checks that report to Nagios server
- Dependency representation
- generate notifications only about a central resource that is down rather than all dependent services
- Notification thresholds are configurable
- Notifications can be limited
- Escalations can generate higher-priority notifications (alternate email, pager, phone message)
- Problems can be acknowledged to stop notifications
4 Statistics
If you're already polling and collecting status information from your hosts, why not store and analyze it?
4.1 Usage history
- Look for patterns in downtime, high usage periods, etc.
- Determine whether service-level agreements are met
4.2 Trend analysis
- How much of your resources are you using?
- Do you have enough?
- When will you need more?
5 Statistics-gathering (and monitoring) software
Often there is a lot of overlap between status monitoring and and statistics collection
5.1 Example: Cacti
- Mainly for statistics gathering
- Uses SNMP for data collection
- Stores historical database of statistics, provides graphing over custom time scales
5.2 Example: Zabbix
- Combined statistics and monitoring tool
- Zabbix agent runs on servers and reports to central collection server
- "Triggers" provide configurable status notification and alerting