Chapter 5 Notes on OLS

Why the constant term?

The SLR model is a population model. When it comes to estimating β1 (and β0) using a random sample of data, we must restrict how u and x are related to each other.

  • What we must do is restrict the way u and x relate to each other in the population.

Taking the simple linear regression (SLR) model as given, y=β0+β1x+u, we will make a simplifying assumption: E(u)=0 That is, the average value of u is zero in the population.

Note then, that it is the presence of β0 in y=β0+β1x+u that allows us to assume E(u)=0 without loss of generality.

  • If the average of u was different from zero, say α0, we would just adjust the intercept, which leaves the slope unchanged: y=(β0+α0)+β1x+(uα0)

  • The new error is: uα0
  • The new intercept is: β0+α0

Importantly, note that what we care about knowing (i.e., β1) is invariant to α0

What matters more, though…

… what matters for the inference we can make is that the mean of the error term does not change across different values of x in the population: E(u|x)=E(u)x

If E(u|x)=E(u) for all x, then u is mean independent of x.

For example

Suppose you are interested in some Income=β0+β1(schooling)+u

In u would be “ability,” then, yes? Mean independence would require that, E(ability|x=8)=E(ability|x=12)=E(ability|x=16)

That is, the average ability should be the same across levels of education.

  • But, surely people choose education levels partly based on ability, yes? (We’ll return to this.)

Combining E(u|x)=E(u) (the substantive assumption) with E(u)=0 (a normalization) gives E(u|x)=0x.

  • We refer to this as the zero conditional mean assumption.
  • Because the conditional expected value is a linear operator, E(u|x)=0 implies E(y|x)=β0+β1x, which implies that the conditional expectation function (or population regression function) is a linear function of x.

How can we estimate the population parameters, β0 and β1?

  • Let {f(xi;yi):i=1,2,...;n} be a random sample of size n (the number of observations) from the population.
  • Plug any ith observation into the population equation: yi=β0+β1xi+ui, where the i subscript indicates a particular observation.

  • Note: We observe yi and xi, but we do not observe ui (though we know it is there).

To obtain estimating equations for β0 and β1, use the two population restrictions: E(u)=0 Cov(x,u)=0

We talked about the first condition. The second condition, that the covariance is zero, means that x and u are uncorrelated. Both conditions are implied by E(u|x)=0.

Fitted values: For any candidates ˆβ0 and ˆβ1, define the fitted value for each i as ˆyi=ˆβ0+ˆβ1xi There are n of these fitted values… the values we predict for yi given xi. \end{frame}

Residuals: The “mistakes” we make in predicting yi is the residual, ˆui=yiˆyi=yi(ˆβ0+ˆβ1xi)

We want to track both overshooting and undershooting y. So… we measure the size of the “mistakes” across all i in the sample by first squaring them, and then summing across i:ni=1ˆu2i=ni=1(yiˆβ0ˆβ1xi)2

  • The OLS estimates are those ˆβ0 and ˆβ1 that minimize the sum of squared residuals (SSR).

Note… ˆβ1 is an estimate of the parameter β1, obtained for a specific sample.

  • Different samples will generate different estimates (of true β1).

Unbiasedness: If we have an unbiased estimator, we could take as many random samples from a population as we wanted to, and the average (mean) of all the estimates from each of those samples would be equal to the true β.

Zero Conditional Mean Assumption: In the population, the error term has zero mean given any value of the explanatory variable: E(u|x)=E(u)=0

  • This is the key assumption implying that OLS is unbiased, with the “zero” not being important once we assume E(u|x) does not change with x.

The problem with OLS?

The problem: We can compute OLS estimates whether or not the zero conditional mean assumption (i.e., E(u|x)=E(u)=0) holds.

The bigger problem: We think the zero conditional mean assumption is a little hard to swallow.

Consider Anscombe’s Quartet, for example… four distinct data-generating processes, each yielding the same coefficient estimates in OLS (in this case, y=3+0.5x). In no way has the causal effect of x been identified in all four scenarios.

Assumed in each: n=11,μx=9,μy=7.5,σx=11,σy=[4.122,4.127],σxy=0.816


Omitted-variable bias

Consider another example: 1(workingi)=β0+β1numkidsi+ui

If family size is random, then Cov(numkidsi,ui)=0, in which case ˆβ1 is the causal effect of numkids on working.

Q: How do we interpret ˆβ1 if numkids is non-random? That is, what if Cov(numkidsi,ui)0?

  • A: ˆβ1 is biased, where the sign if the bias is determined by Cov(xi,ui) and Cov(ui,yi).

The rule:

If Cov(ui,yi) & Cov(xi,ui) are similar in sign, the bias is positive.

If Cov(ui,yi) & Cov(xi,ui) are opposite in sign, the bias is negative.


An example

Q: What if 1) those with more kids also tend to be married (Cov(numkidsi,marriedi)>0) and, 2) married people tend to be more likely to work (Cov(marriedi,workingi)>0)?

  • A: ˆβ1 is biased upward. (The effect of married is loading onto numkids.)

Q: What if 1) black families tend to have more kids (Cov(numkidsi,blacki)>0) and 2) black employment tends to be lower (Cov(blacki,workingi)<0)?

  • A: ˆβ1 is biased downward. (The effect of black is loading onto numkids.)

So, if family size varies non-randomly… can we get to a place where we “believe” our ˆβ1? That is… be comfortable assuming that family size is conditionally random? What would you want to condition on?

  • Anything that correlates with numkids and with working.

1(workingi)=β0+β1numkidsi+γ1marriedi+γ2blacki+ui

Here, if we want to estimate the average causal effect of family size on labor supply, the identifying assumption is

  • Cov(numkidsi,ui)=0, or,

  • Cov(numkidsi,ui|{marriedi,blacki})=0

OVB in one picture

An abstraction for sure, but sometimes helpful. Consider these shapes representative of variation, and overlap therefore co-variation. The source of OVB in attributing the overlap of Y and X1 entirely to X1, as though causal, is the three-way overlap of Y, X1, and X2, having run the model Y=β0+β1X1+u.

In these figures (Venn diagrams)

  • Each circle illustrates a variable.
  • Overlap gives the share of correlatation between two variables.
  • Dotted borders denote omitted variables.