Chapter 5 Notes on OLS

Why the constant term?

The SLR model is a population model. When it comes to estimating \(\beta_1\) (and \(\beta_0\)) using a random sample of data, we must restrict how \(u\) and \(x\) are related to each other.

  • What we must do is restrict the way \(u\) and \(x\) relate to each other in the population.

Taking the simple linear regression (SLR) model as given, \[y = \beta_0 + \beta_1x + u\], we will make a simplifying assumption: \[E(u)=0\] That is, the average value of \(u\) is zero in the population.

Note then, that it is the presence of \(\beta_0\) in \(y = \beta_0 + \beta_1x + u\) that allows us to assume \(E(u) = 0\) without loss of generality.

  • If the average of \(u\) was different from zero, say \(\alpha_0\), we would just adjust the intercept, which leaves the slope unchanged: \[y = (\beta_0 + \alpha_0) + \beta_1x + (u-\alpha_0)\]

  • The new error is: \(u - \alpha_0\)
  • The new intercept is: \(\beta_0 + \alpha_0\)

Importantly, note that what we care about knowing (i.e., \(\beta_1\)) is invariant to \(\alpha_0\)

What matters more, though…

… what matters for the inference we can make is that the mean of the error term does not change across different values of \(x\) in the population: \[E(u | x)=E(u) \hspace{2mm}\forall \hspace{2mm} x\]

If \(E(u | x)=E(u)\) for all \(x\), then \(u\) is mean independent of \(x\).

For example

Suppose you are interested in some \[Income=\beta_0 + \beta_1(schooling) + u\]

In \(u\) would be “ability,” then, yes? Mean independence would require that, \[E(ability | x = 8) = E(ability | x = 12) = E(ability | x = 16)\]

That is, the average ability should be the same across levels of education.

  • But, surely people choose education levels partly based on ability, yes? (We’ll return to this.)

Combining \(E(u | x) = E(u)\) (the substantive assumption) with \(E(u) = 0\) (a normalization) gives \[E(u | x)=0 \hspace{2mm}\forall \hspace{2mm} x.\]

  • We refer to this as the zero conditional mean assumption.
  • Because the conditional expected value is a linear operator, \(E(u | x) = 0\) implies \[E(y | x)= \beta_0 +\beta_1 x,\] which implies that the conditional expectation function (or population regression function) is a linear function of \(x\).

How can we estimate the population parameters, \(\beta_0\) and \(\beta_1\)?

  • Let \(\{f(x_i ; y_i ) : i = 1, 2, ... ; n\}\) be a random sample of size \(n\) (the number of observations) from the population.
  • Plug any \(i^{th}\) observation into the population equation: \[y_i = \beta_0 + \beta_1x_i + u_i,\] where the \(i\) subscript indicates a particular observation.

  • Note: We observe \(y_i\) and \(x_i\), but we do not observe \(u_i\) (though we know it is there).

To obtain estimating equations for \(\beta_0\) and \(\beta_1\), use the two population restrictions: \[E(u) = 0\] \[Cov(x,u) = 0\]

We talked about the first condition. The second condition, that the covariance is zero, means that \(x\) and \(u\) are uncorrelated. Both conditions are implied by \(E(u | x) = 0\).

Fitted values: For any candidates \({\hat \beta}_0\) and \({\hat \beta}_1\), define the fitted value for each \(i\) as \[{\hat y}_i = {\hat \beta}_0 + {\hat \beta}_1x_i\] There are \(n\) of these fitted values… the values we predict for \(y_i\) given \(x_i\). \end{frame}

Residuals: The “mistakes” we make in predicting \(y_i\) is the residual, \[{\hat u}_i = y_i - {\hat y}_i = y_i - ({\hat \beta}_0 + {\hat \beta}_1x_i)\]

We want to track both overshooting and undershooting \(y\). So… we measure the size of the “mistakes” across all \(i\) in the sample by first squaring them, and then summing across \(i\):\[\sum_{i=1}^{n}{\hat u}_i^2 = \sum_{i=1}^{n}(y_i-{\hat \beta}_0 - {\hat \beta}_1x_i)^2\]

  • The OLS estimates are those \({\hat \beta}_0\) and \({\hat \beta}_1\) that minimize the sum of squared residuals (SSR).

Note… \({\hat \beta}_1\) is an estimate of the parameter \(\beta_1\), obtained for a specific sample.

  • Different samples will generate different estimates (of true \(\beta_1\)).

Unbiasedness: If we have an unbiased estimator, we could take as many random samples from a population as we wanted to, and the average (mean) of all the estimates from each of those samples would be equal to the true \(\beta\).

Zero Conditional Mean Assumption: In the population, the error term has zero mean given any value of the explanatory variable: \[E(u | x) = E(u) = 0\]

  • This is the key assumption implying that OLS is unbiased, with the “zero” not being important once we assume \(E(u | x)\) does not change with \(x\).

The problem with OLS?

The problem: We can compute OLS estimates whether or not the zero conditional mean assumption (i.e., \(E(u | x) = E(u) = 0\)) holds.

The bigger problem: We think the zero conditional mean assumption is a little hard to swallow.

Consider Anscombe’s Quartet, for example… four distinct data-generating processes, each yielding the same coefficient estimates in OLS (in this case, \(y=3+0.5x\)). In no way has the causal effect of \(x\) been identified in all four scenarios.

Assumed in each: \(n=11, \mu_x=9, \mu_y=7.5, \sigma_x=11, \sigma_y=[4.122,4.127], \sigma_{xy}=0.816\)


Omitted-variable bias

Consider another example: \[ \mathbb{1}(working_i) = \beta_0 + \beta_1 numkids_i + u_i\]

If family size is random, then \(Cov(numkids_i, u_i)=0\), in which case \({\hat \beta}_1\) is the causal effect of \(numkids\) on \(working\).

Q: How do we interpret \({\hat \beta}_1\) if \(numkids\) is non-random? That is, what if \(Cov(numkids_i, u_i)\ne 0\)?

  • A: \({\hat \beta}_1\) is biased, where the sign if the bias is determined by \(Cov(x_i, u_i)\) and \(Cov(u_i, y_i)\).

The rule:

If \(Cov(u_i, y_i)\) & \(Cov(x_i, u_i)\) are similar in sign, the bias is positive.

If \(Cov(u_i, y_i)\) & \(Cov(x_i, u_i)\) are opposite in sign, the bias is negative.


An example

Q: What if 1) those with more kids also tend to be married (\(Cov(numkids_i, married_i)>0\)) and, 2) married people tend to be more likely to work (\(Cov(married_i, working_i)>0\))?

  • A: \({\hat \beta}_1\) is biased upward. (The effect of \(married\) is loading onto \(numkids\).)

Q: What if 1) black families tend to have more kids (\(Cov(numkids_i, black_i)>0\)) and 2) black employment tends to be lower (\(Cov(black_i, working_i)<0\))?

  • A: \({\hat \beta}_1\) is biased downward. (The effect of \(black\) is loading onto \(numkids\).)

So, if family size varies non-randomly… can we get to a place where we “believe” our \({\hat \beta}_1\)? That is… be comfortable assuming that family size is conditionally random? What would you want to condition on?

  • Anything that correlates with \(numkids\) and with \(working\).

\[\mathbb{1}(working_i) = \beta_0 + \beta_1 numkids_i + \gamma_1 married_i + \gamma_2 black_i + u_i\]

Here, if we want to estimate the average causal effect of family size on labor supply, the identifying assumption is

  • \(Cov(numkids_i, u_i)= 0\), or,

  • \(Cov(numkids_i, u_i | \{married_i, black_i\}) = 0\)

OVB in one picture

An abstraction for sure, but sometimes helpful. Consider these shapes representative of variation, and overlap therefore co-variation. The source of OVB in attributing the overlap of \(Y\) and \(X_1\) entirely to \(X_1\), as though causal, is the three-way overlap of \(Y\), \(X_1\), and \(X_2\), having run the model \(Y=\beta_0 + \beta_1 X_1 + u\).

In these figures (Venn diagrams)

  • Each circle illustrates a variable.
  • Overlap gives the share of correlatation between two variables.
  • Dotted borders denote omitted variables.