Assignment for Week 6: Monitoring, Measurement, and Notification

Description

Even when using configuration management and high-availibility techniques like load balancing, there are situations when your systems fail and manual effort is required to maintain or restore service (although ideally you also figure out how to improve your design or automate the recovery process). It's also important to ensure that system capacity is able to meet request load. That means that in a comprehensive system design, you should:

Monitor systems and software for indications of failure
Notify system administrators when important events occur
Measure system performance to determine whether you are providing adequate capacity

In this assignment you'll explore EC2 monitoring and notification options and the metrics gathered by their "CloudWatch" system. Also note that you will need only "basic monitoring" for this assignment; please do not enable "Detailed monitoring" (an extra-cost feature).

Relevant documentation: Monitoring Amazon EC2

What you need to do

Select one or more of your instances and look at the "Monitoring" tab. You should see a number of graphs of instance statistics. Look over those graphs to get an idea of each instance's behavior in these categories:
- CPU utilization
- Disk Reads (bytes)
- Disk Writes (bytes)
- Network In
- Network Out
- Status Check Failed (Instance)
- Status Check Failed (System)
Create alarms based on these statistics to notify you of exceptional usage in all of these areas for each of your instances. Note that you need to consider the specific units and values for metrics when crafting the alarm threshold. Please create a single "topic" for your team based on your team name to use for all notifications, and list email addresses of all your team members as recipients. Recipients will receive a confirmation email from AWS that needs to be acknowledged before they will receive notifications.
Try to trigger at least three of these alarms. For example, for CPU utilization, you could run a CPU-intensive program (even just an infinite loop) for enough time to trigger a notification. It may help to set the alarm thresholds artificially low, at least temporarily, to make it easier to trigger alarms without creating excessive resource usage.

What to turn in

Create a subdirectory in your team git repository with the name "week6" and create files containing your handin materials under that.

Provide samples of notification emails that you get when triggering alarms.

Material for all of the above should be checked into your team git repository by class time on Monday, August 5. For an individual team member to receive credit for the assignment, they must have made at least one commit.

Class presentation/discussion

On Monday, August 5 we will take some time in class to have each group speak briefly about their experience with this assignment.

Evaluation

I will check that all of your instances have all the required alarm enabled (7 for each instance). I will also check that you were able to receive at least three notifications by triggering alarms.

Steve VanDevender