Next: Missing data/Weighting
Up: Review of OLS Regression
Previous: Administrative/Review
Contents
- When two models are ``nested", we can formally compare them using the F-test.
- We are often interested in a comparison between models, or in including the right control variables to get the best estimate of the net effect of a particular independent variable. Model selection plays a large role in social science research - where is the right model between the null model
and the saturated model?
- Example: state-level crime data from Ehrlich (1973). The question of interest is whether rational incentives like the probability of being caught and the length of imprisonment affect crime rates. In order to assess the effects of these variables we want a model that can control for other important predictors of crime rates. Our candidate variables are:
- percent of males aged 14-24
- south indicator variable
- mean years of schooling
- police expenditures in 1960
- police expenditures in 1959
- labor force participation rate
- M/F sex ratio
- population
- nonwhites per 1000 population
- unemployment rate of urban males 14-24
- unemployment rate of urban males 35-39
- GDP
- income inequality
- probability of imprisonment
- average time served in state prisons
The full model with all variables and Ehrlich's chosen model are shown in the table. It is clear that the effects are much smaller in the full model than in the model Ehrlich chose, which makes us wonder if his effects are measured using the right model.
- We assess the goodness of fit of the OLS model generally by
, but we have to balance goodness-of-fit with parsimony.
- We can always get a better fit (or at least not a worse fit) by adding more variables to the model, so
is not helpful in and of itself. One technique for balancing parsimony is to use the adjusted
:
adj.
The expression before the sum of squares ratio will always be less than one when k is greater than zero, so this expression "penalizes"
for including more variables into the model. We could try choosing the model which maximizes our adjusted
.
- Another technique is to proceed in a stepwise fashion. There are three basic approaches:
- Forward Selection. Begin with the variable with the highest single
value. Next search for the variable to add which will lead to the next greatest increment in
, and continue this process until all possible variables are added or some stopping criterion is reached (typically that the next added value is not stat. sig. at some level).
- Backward Selection. Begin with the full model. Remove one variable based on some criterion (typically the smallest t-value). Continue this process until all remaining variables are above some pre-established threshold.
- Stepwise Selection. Begin like forward selection, but at the end of each step, remove any variables that have fallen below a certain threshold (typically on the t-value) before moving forward another step.
- These methods can sometimes give different results.
- Elimination is purely on statistical significance.
- What about things like interactions or polynomial terms?
- Bayesian Information Criterion (BIC)
- The best approach depends on what you are doing with the model.
- Its best to have a theoretical reason for fitting the model as you are, rather than just throwing in variables randomly and seeing what happens (monkeys and computers can do that-you are supposed to be the brains of the operation)
- Generally we are interested in a particular variable or a small set of variables and the others are included as controls. Even if these controls look not statistically distinguishable from zero, it may still be good to include them in the equation. Show the full model and let your readers make their own decisions.
- In some social science research, the key issue is a comparison of models which present competing explanations of the process. In these cases, model selection is very important, and you should carefully consider your options. Generally it is best here to present several techniques.
Next: Missing data/Weighting
Up: Review of OLS Regression
Previous: Administrative/Review
Contents
Aaron
2005-12-21