Next: Measuring relationships with categorical Up: Describing Data Previous: The idea of a   Contents

## Measures of center and spread

• Graphical measures are very good tools for analyzing distributions. But we would like summary measures which can capture the important elements of a distribution.
• We would like a measure for the center of a distribution
• We would like a measure for the spread of a distribution
• Sample of 20 SEI scores for working adults from the 2000 Census: (explain SEI scores)
Convert occupational categories into a composite measure of prestige, based on income and education
 27 18 84 44 68 53 51 46 15 44 10 44 22 15 72 73 18 93 68 67

Ordered:
 10 15 15 18 18 22 27 44 44 44 46 51 53 67 68 68 72 73 84 93

Stemplot:

  1 | 05588
2 | 27
3 |
4 | 4446
5 | 13
6 | 788
7 | 23
8 | 4
9 | 3

• Review - first let's review some algebra
• x=female babies at birth (98), y=male babies at birth (102)
• Ratio - one number over another

• Proportion - number out of total

• Percent (per 100), simply multiply by 100
• "Rate" - accounting for exposure in some way. Let's say 2 of the girl babies die in their first year, and 3 of the boy babies die.

• A true rate always involves a measure of time in the denominator.
• Now let's consider some shorthand:
• summation sign
• subscripting

• Measures of Center (average - don't use this term)
• mean - the balancing point of a distribution (draw picture)

or

Example: ,
• median - the midpoint of a distribution, so that half of the observations are to the left and half to the right. (draw a picture)
1. Re-order observations in order.
2. If is odd, then find the midpoint value in this ordered list (i.e. 3 of 5)
3. If is even, then find the two midpoint values and average them.
Example: the 10th observation is 44, the 11th observation is 46

• mode - the most common observation, i.e. "the peak". We have already talked about this in terms of whether the distribution is "unimodal." We will not use the mode very often, but you should know it. (Example=44)
• Have students go through some sample distributions and try to locate these things.
1. symmetric, normal distribution
2. left-skewed
3. right-skewed
4. bimodal, symmetric
• What do you notice, mean is pulled more by extreme values than median. In general, mean is more sensitive than median. (show income distribution)

• Measures of Spread - choice of measure depends on choice of measure of center
• Using Median
• We could measure the range - the distance between the smallest and largest point, but this would not be very useful because of outliers.
• Better to make use of the quantiles (or percentiles) of the distribution. The median is basically the 50th percentile of the distribution.
1. sort the observations so they are in ascending order.
2. For the th percentile, find the ordered observation which corresponds to that observation.

3. For empirical distributions, you will often have to find the closest value rather than the exact value.
4. This value is the th percentile of the distribution
5. We are often interested in the quartiles of the distribution: the 25th, 50th, and 75th percentiles.
Example: ,
6. Taking the difference between the 75th and 25th percentile gives us the interquartile range (IQR), which is a measure of the spread of the distribution, which is less affected by outliers than the range.
Example:
• Using the IQR, we can now get a five-number summary of the distribution:
 Minimum 25th median 75th Maximum 10 20 45 68 93
• We can use this five number summary to produce another important kind of graph called the boxplot. (go through steps with the example)
1. On the y-axis draw a box which extends from the 25th to the 75th percentiles.
2. put a line through the box to indicate the median.
3. put whiskers extending from the box to the minimum and maximum, unless these values are higher than some rule of thumb from the median, (1.5xIQR).
4. plot points higher than rule of thumb individually.
• Using Mean
• When spread is measured relative to the mean, we use something called the variance() and its square root the standard deviation ().

• Go through this step at a time
1. How large the values of are on average gives us some measure of how spread out the observations are around the mean. But we can't average these values because they will sum to zero (show this)
2. So first we will square these values, so that they are all positive.
3. Then we will sum them up.
4. We want some measure of "average" squared distance from the mean, so we divide by the number of observations. However, we have to subtract one from this number first, because we used the mean to calculate the variance, we have one less degree of freedom
5. This measures variance, but our units are now squared because of the squared term earlier, so we take the square root of this to get a measure which is in the same units as the mean.
6. Go through example
 27 -19.6 384.16 53 6.4 40.96 10 -36.6 1339.56 73 26.4 696.96 18 -28.6 817.96 51 4.4 19.36 44 -2.6 6.76 18 -28.6 817.96 84 37.4 1398.76 46 -0.6 0.36 22 -24.6 605.16 93 46.4 2152.96 44 -2.6 6.76 15 -31.6 998.56 15 -31.6 998.56 68 21.4 457.96 68 21.4 457.96 44 -2.6 6.76 72 25.4 645.16 67 20.4 416.16

• Like the mean, is sensitive to outliers and skewness.

Next: Measuring relationships with categorical Up: Describing Data Previous: The idea of a   Contents
Aaron 2005-12-20