Monitoring, alerting, and statistics

1. Monitoring
2. Notifications and alerts
3. Monitoring software
- 3.1. SNMP: "Simple" Network Monitoring Protocol
- 3.2. Example: Nagios
4. Statistics
- 4.1. Usage history
- 4.2. Trend analysis
5. Statistics-gathering (and monitoring) software
- 5.1. Example: Cacti
- 5.2. Example: Zabbix

1 Monitoring

1.1 Prehistory

Phone rings. "The mainframe is down!"
Operations staff attended computers at all hours to notice and fix problems
Separate resources for automated monitoring might be unavailable

1.2 Simple early monitoring

scripts ping hosts, connect to services, etc.
scripts running on hosts report status to central system
could eventually become complicated, redundant, resource-intensive

1.3 Monitoring methods

Shallow
- ping host, connect to network port
- detects when a host or service is totally down, but not when it's unusable
Deep: exercise more complex aspects of software
- Fetch a web page, then check for content
- Make a query that tests authentication, database access, etc.
- Notices more subtle application problems, but may not make it easy to diagnose the problem

2 Notifications and alerts

2.1 Quantity and frequency

You don't want to be flooded with messages when lots of things are down
Too-frequent notifications aren't helpful
Being notified repeatedly and indefinitely isn't helpful

2.2 Relevance

Failure of a central resource might cause lots of other things to (appear to) fail
- Network switches or routers
- File servers
- User or authentication databases
Ideally you want precise notification about specific failure causes
- spend less time diagnosing and more time fixing

2.3 Methods

Email (lots of stuff likes to use this, even when it probably shouldn't)
Pager, text messaging
Escalation
- start with email to general sysadmin contact address
- email to higher-level contacts (manager, on-call person)
- generate text message to on-call person

3 Monitoring software

3.1 SNMP: "Simple" Network Monitoring Protocol

provides general network-accessible framework for host status data
uses UDP for communication
server runs on host and can be queried by other hosts
It has some problems
- Poor security: unprotected, or passwords in cleartext
- effectively usable only on a local network
- complicated numerical MIB tree for structuring data

3.2 Example: Nagios

Provides a general monitoring and alerting framework with a custom configuration language
Server can initiate "active" checks over the network for remote hosts and services
Hosts can run "passive" checks that report to Nagios server
Dependency representation
- generate notifications only about a central resource that is down rather than all dependent services
Notification thresholds are configurable
Notifications can be limited
Escalations can generate higher-priority notifications (alternate email, pager, phone message)
Problems can be acknowledged to stop notifications

4 Statistics

If you're already polling and collecting status information from your hosts, why not store and analyze it?

4.1 Usage history

Look for patterns in downtime, high usage periods, etc.
Determine whether service-level agreements are met

4.2 Trend analysis

How much of your resources are you using?
Do you have enough?
When will you need more?

5 Statistics-gathering (and monitoring) software

Often there is a lot of overlap between status monitoring and and statistics collection

5.1 Example: Cacti

Mainly for statistics gathering
Uses SNMP for data collection
Stores historical database of statistics, provides graphing over custom time scales

5.2 Example: Zabbix

Combined statistics and monitoring tool
Zabbix agent runs on servers and reports to central collection server
"Triggers" provide configurable status notification and alerting

Monitoring, alerting, and statistics

Table of Contents

1 Monitoring

1.1 Prehistory

1.2 Simple early monitoring

1.3 Monitoring methods

2 Notifications and alerts

2.1 Quantity and frequency

2.2 Relevance

2.3 Methods

3 Monitoring software

3.1 SNMP: "Simple" Network Monitoring Protocol

3.2 Example: Nagios

4 Statistics

4.1 Usage history

4.2 Trend analysis

5 Statistics-gathering (and monitoring) software

5.1 Example: Cacti

5.2 Example: Zabbix