Assignment for Week 6: Monitoring, Measurement, and Notification
Description
Even when using configuration management and high-availibility
techniques like load balancing, there are situations when your
systems fail and manual effort is required to maintain or restore
service (although ideally you also figure out how to improve your
design or automate the recovery process). It's also important to
ensure that system capacity is able to meet request load. That
means that in a comprehensive system design, you should:
- Monitor systems and software for indications of failure
- Notify system administrators when important events occur
- Measure system performance to determine whether you are
providing adequate capacity
In this assignment you'll explore EC2 monitoring and notification
options and the metrics gathered by their "CloudWatch" system. Also
note that you will need only "basic monitoring" for this assignment;
please do not enable "Detailed monitoring" (an extra-cost feature).
Relevant documentation:
Monitoring Amazon EC2
What you need to do
- Select one or more of your instances and look at the
"Monitoring" tab. You should see a number of graphs of instance
statistics. Look over those graphs to get an idea of each
instance's behavior in these categories:
- CPU utilization
- Disk Reads (bytes)
- Disk Writes (bytes)
- Network In
- Network Out
- Status Check Failed (Instance)
- Status Check Failed (System)
- Create alarms based on these statistics to notify you of
exceptional usage in all of these areas for each of your instances.
Note that you need to consider the specific units and values for
metrics when crafting the alarm threshold. Please create a single
"topic" for your team based on your team name to use for all
notifications, and list email addresses of all your team members as
recipients. Recipients will receive a confirmation email from AWS
that needs to be acknowledged before they will receive
notifications.
- Try to trigger at least three of these alarms. For example, for
CPU utilization, you could run a CPU-intensive program (even just an
infinite loop) for enough time to trigger a notification. It may
help to set the alarm thresholds artificially low, at least
temporarily, to make it easier to trigger alarms without creating
excessive resource usage.
What to turn in
Create a subdirectory in your team git repository with the name
"week6" and create files containing your handin materials under
that.
- Provide samples of notification emails that you get when
triggering alarms.
Material for all of the above should be checked into your team
git repository by class time on Monday, August 5. For an individual
team member to receive credit for the assignment, they must have
made at least one commit.
Class presentation/discussion
On Monday, August 5 we will take some time in class to have each
group speak briefly about their experience with this assignment.
Evaluation
I will check that all of your instances have all the required
alarm enabled (7 for each instance). I will also check that you
were able to receive at least three notifications by triggering
alarms.
Steve VanDevender