Next: Generalized Linear Models Up: Advanced Techniques Previous: Factor Analysis and Structural   Contents

• We have not talked much about one of the major assumptions of the OLS regression technique. Let's say we have the following simple model.

One major assumption of OLS regression is that the values are uncorrelated with the error terms. Note that this is impossible to confirm because we only have estimates of the error terms and if correlation exists, then these estimates will be incorrect.
• What brings about this problem? Well, in general this problem is brought about by omitted variable bias. There is another variable which is correlated with both and so that after fitting the model above there is still a relationship with this other variable and the residuals.
• The omitted variable bias is the major difficulty of observational data. It is a major problem because we are generally interested in whether the model above represents a causal relationship between and . A frequent interpretation of the model above is that if we could manipulate by raising it one unit, would increase by units. This is a causal argument.
• Omitted variable bias is the most common illustration of what economists refer to as endogeneity. endogenous variables are variables determined by other variables in the system, while exogenous variables are variables which can be considered external shocks to the system (draw a picture).
• The other most important source of endogeneity is reverse causality.
• To truly be able to make a causal claim, we need a truly exogenous variable - that is, a variable which is not related to any of the other variables in the system, unobserved and observed. The problem with observational data is that there are an infinite number of unobserved variables which could render our observed relationship endogenous. This is the problem of unobserved heterogeneity in our sample.
• As an example, let's look at a simple question. Do private schools improve student's test performance? Let's say we had a sample of public and private school students' math test scores. We could look at the difference in the average score between groups. But it would be dangerous to assume that such a difference reflected the "treatment" of private schools, because it seems likely that more apt students are more likely to self-select into private schools.
• One standard solution is to control for all the observed measures that might lead to such self-selection which are available to us. The problem is that we are unlikely to effectively control for all of this selectivity, because some variables associated with the selection process are probably unobserved. Even if all the important variables were observed, we would only completely control for them if we correctly specified the functional form of their relationship to test scores.
• This problem has led to much wailing and gnashing of teeth among economists. Although aware of the problem, sociologists have been traditionally less concerned with the issue. I would argue this is due to a different conception of how arguments are presented and empirically tested in sociology. I would say the traditional model (for all empirical methods, not just statistical) follows this basic format:
1. Make an argument about how and why things ARE AS THEY ARE.
2. Show that the available empirical data are consistent with your argument.
3. Demonstrate that the available empirical data are inconsistent with counter-arguments for how and why things ARE AS THEY ARE.
• The key issue here is the last one. The focus is on a debate between real concrete stories not on some generalized debate that some unspecified counter-story could plausibly exist.
• Although I actually tend to prefer this kind of conception of what we do, the problem of endogeneity is a real one and it behooves us to take a look at some of the ways in which people (partially) address it. The bottom line is that no method can perfectly recover causality from observational data, but in certain cases we can effectively reduce the range of plausible counter-stories. Let's focus on two common methods:

1. Fixed Effects Models
• Fixed effects models come primarily out of longitudinal data designs in which we have repeated observations on an individual over time. However, they can be applied more broadly than this.
• To continue with our example, lets say we had repeated observations on a set of high school student over their entire four-year high school period (to make it simple lets say none drop out or repeat a grade). We also have recorded test scores for each individual over this time period. Some students during this period have also migrated between public and private schools.
• Let's define some variables.
is the test score for individual in time period .
is the private school indicator variable for individual in time period .
is a set of observed time-varying variables for individual in time period .
is a set of unobserved time-varying variables for individual in time period .
is a set of time-constant observed variables for individual .
is a set of time-constant unobserved variable for individual .
• We could start with the same model we had above, but this time we will also control for any observed variables which may be confounding our relationship.

This model estimates the average difference between private and public students' math scores with , controlling for all observed time-varying and time-constant covariates.

This model is problematic because it doesn't take account of the fact that we have repeated observations on the same students, which will likely lead to correlated error terms within students. Putting this issue aside, however, the model is still problematic in that it doesn't address the unobserved variables that may lead to self-selection into school type and better or worse test performance.

• We can take advantage of our longitudinal design to eliminate some of this unobserved heterogeneity (and to correct the issue with error terms). First, let's define a set of dummy terms, , which will be one if the observation comes from individual and zero otherwise. Add these dummy terms to the models and we have:

Or, more concisely

These dummy variables allow us to fit a term for every individual. Because we have multiple observations per individual, doing this will not saturate the model. Essentially, we are trying to explain variation within individuals. The terms are our "fixed effects."
• Note that we are no longer explicitly including the observed terms in the model. This is because our terms explain all time-constant variation across individuals, so they supercede our . In technical terms:

So the fixed effects can account for both observed and unobserved time-constant variables. Thus we can be certain that our new estimate of is not the result of lurking variables that are constant across time.
• Another way of looking at this is that we are using the migrations of certain students as information about the effect of public and private schools. This is an improvement over the cross-sectional approach because we can rule out unobserved heterogeneity that is time-constant. However, we cannot rule out time-varying unobserved heterogeneity. In particular, we might think other events in students lives may be associated both with movements to and from private schools and test scores. A family disruption for example might reduce the resources to pay for private school and reduce test scores through stress and distraction.
• This approach can be applied to other data of a hierarchical structure. Longitudinal data are hierarchical data where time observations are nested within students. The fixed effects approach can be used on all data of this type to rule out any unobserved heterogeneity at the higher level. For example, if we had information on female siblings, we might use a fixed effects approach to rule out any family effects on the observed relationship between teen pregnancy and educational attainment.
2. Instrumental variables
• In some cases we may not be able to rule out that is partially endogenous but we may have another variable which we can be fairly certain has an effect on but not on . (draw a picture)
• The best situation is when we know that has been completely randomized. Let's say for example that an experiment was done which randomly selected some families to receive private school vouchers. It is likely that these vouchers will induce some public school students to move to private school, thus there will be a relationship between and . However, we can be fairly certain (because we know assignment was random), that voucher assignment itself is not directly related to test scores.
• This situation is rare. The more common situation is a natural experiment in which for some historical reason we oberve a shock to a system which can reasonably be treated as random. The most famous example here is the use of birthdates in the question of whether military service affected subsequent labor market experience. Because the draft was assigned on the basis of birthdates, it is highly correlated with military service, but unlikely to be correlated with labor market experience.
• If this relationship holds, then we can treat as an instrument for inducement into the "treatment" of .
• Mathematically, the basic reasoning is as follows. We have the basic relationship: But since might be endogenous we cannot trust our estimate of . We can get an instrumental variable estimate of as:

If we are correct in our assumptions about the instrument , then , and therefore:

So that the IV estimator will be an unbiased estimator of . In essence we have used the exogeneous shock of the instrument to "clean out" any endogenous relationship between and .
• The most common method for doing the actual estimation, two-stage least squares (2SLS), will help clarify this issue.
1. As a first step, predict the value of from . If you plan on including other terms in your final model for , say , it is typical to include them at this stage as well:

2. Now use the predicted value of rather than its real value in an OLS regression predicting

Because you are using the predicted value of you are essentially leaving behind the residuals from the first equation. Since is an exogenous shock on , those residuals are the part of which are potentially endogenous with . You have stripped them away.
• In cases where you have strong reasons to believe in the exogeneity of , the instrumental variables technique is quite clever. In practice, however, such instruments are often hard to find. This has led to the degradation of the method using what are called weak instruments which are only moderately correlated with and where the assumption of no relationship between and may be dubious. In these cases, precise estimation is only possible when samples are large. Even then, IV generally doesn't solve the problem but rather re-focuses debate from the possible endogeneity of to the validity of as an instrument. Since this latter argument is likely to be more esoteric, the value of the IV approach becomes questionable.
• My personal recommendation is that IV only be used in cases where the endogeneity of is clear and unresolvable. In these cases, IV should be used as a complement to rather than a substitute for OLS.
• This field of what is called "causal modeling" or "causal inference" is very large and active, so I have only been able to touch on its surface. I have not even touched on a thrid major method, propensity score matching, nor have I discussed probably the major innovation of the last few years - the issue of counterfactual causality and treatment heterogeneity, which would be critical of all of these methods. If you are interested in it, I have provided some further readings on Courseworks.

Next: Generalized Linear Models Up: Advanced Techniques Previous: Factor Analysis and Structural   Contents
Aaron 2005-12-21