Next: Contingency tables and Chisquared
Up: Data Collection and Statistical
Previous: Errors in hypothesis testing:
Contents
 Problem with previous examples
 I played a trick on you: We don't usually know the population standard deviation .
 Therefore, we can't estimate the zscores we used in our hypothesis test.
 Normally, we replace our with our sample estimate .
 This is problematic because itself is an estimate and thus there will be greater uncertainty in our test.
 In fact when we replace with , our test statistic will no longer have the standard normal distribution (Draw a picture to remind them), but rather come from what is known as the tdistribution.
 The tdistribution
 Using
 We can substitute in our equation for confidence intervals.
Before:
Now:
With 20 degrees of freedom, we would use for a 95% confidence interval.
 We can use the tdistribution for our onesample hypothesis tests, where .
 For small , this will still not be accurate if underlying population distribution is nonnormal.
 However, it turns out that the tdistribution is fairly robust to violations of the normality assumptions, meaning it gives a good approximation even for small sample sizes. Use the following rule of thumb:
 , if data look close to normal with no outliers, use ttest
 , use ttest except in cases of big outliers or extreme skewness
 , use ttest and you should be approximately accurate.
 The twosample ttest
 So far we have compared one sample to a hypothesized population mean. More often what we are interested in is comparing the means of two groups for which we only have samples.
 We call this the twosample ttest.
mean for the first group
mean for the second group
 We are interested in
, i.e. the difference between the groups.

, means are the same
, twosided alternative
, a onesided alternative
 Our observed difference is
.
 Our test statistic is:
 We are estimating two standard deviations here, so it turns out that our test statistic is only approximately distributed as a . Fairly robust, particularly, if groups are equal in size.
 What should the degrees of freedom be? Two options:
 Take the smaller of or . This will give you a conservative estimate on the degrees of freedom.
 Use a complicated formula, which has little intuition.
 This formula gives very precise results for . It is always as large as the smaller sample size (minus one) and never larger than the largest sample size (minus one).
 If you assume that
, then this equation simplifies somewhat.
 Calculate a pooled estimate of the variance  just a weighted average
 Then the variance of the difference between means simplifies
 So the tstatistic is:
 The degrees of freedom are
 When we work with proportions, the pooled variance rule always holds. When we work with other continuous variables, however, it is an assumption.
 Example: let's take the observed age difference between survivors and nonsurvivors on the Titanic. Earlier in the semester, we calculated that survivors of the Titanic were 1.63 years older on average than nonsurvivors. Taking this particular sinking as one of many possible sinkings, is this difference due to random chance or is there a real age advantage?
 Calculate numerator of test statistic:
 Calculate denominator of test statistic:
 Calculate t
 Let's use conservative estimate of degrees of freedom and a twosided test. What is the critical value for ?
 Now let's pool our estimate of the standard deviation:
 So our test statistic becomes:
Next: Contingency tables and Chisquared
Up: Data Collection and Statistical
Previous: Errors in hypothesis testing:
Contents
Aaron
20051220