Next: Goodness-of-fit Up: Logistic Regression Previous: Odds and Probabilities   Contents

## Logistic Regression

Logistic regression

• When we have a dichotomous variable as the dependent variable, OLS regression won't work. The Linear Probability Model can be fit, but
• The relationship is non-linear because the probabilities are bound between 0 and 1.
• The error terms are heteroskadastic because the dependent variable is produce by a binomial process where the variance depends upon the underlying value.
• As we have learned we can correct these problems with a generalized linear model.
• We know that the error distribution is given by a binomial distribution. So, we only need to choose a link function. We know the identity link won't work because we have the non-linearity problem.
• There are several possible link functions, but the best one (or at least the easiest to interpret) is the logit function.
• The logit is the log of the odds:

• This function spreads the probabilities over the entire number range.
• So, our logistic regression model looks like:

• How do we interpret the 's? Well, first lets relate this equation back to odds rather than the log-odds by exponentiating both sides.

• How does a one-unit change in affect the predicted odds?
• It increases the odds by a multiplicative factor of .
• By exponentiating the 's we get odds ratios - how much the odds increase multiplicatively with a one-unit change in the independent variable. For categorical variables, these can be interpreted directly as odds ratios between groups. For continuous variables they are the odds ratios between individuals who are identical on the other variables but differ by one unit on the variable of interest. (show an example of each)
• Therefore, the 's themselves are log-odds ratios. Negative values indicate a negative relationship between the probability of "success" and the independent variable; positive values indicate a positive relationship.
• When you exponentiate them, the dividing line between a positive and negative relationship is 1 not 0.
• Let's take our titanic example. We have the equation:

Where is an indicator variable for women vs. men. The results of this model are given by:

 Variable Coefficient Constant -1.44 (-16.46) Gender Women 2.42 (17.82) Men (ref.) -

What is the intercept giving us here? The log-odds of survival for the reference group (Men). To convert this into odds take the exponential:

What is the slope giving us. The difference in the log-odds ratio of survival between men and women. To convert this into an odds ratio take the exponential:

If we want to know the log-odds of survival for women then we have to add the relevant parameters:

log-odds of a woman's survival

To get this as and odds, exponentiate:

These numbers should look familiar. They are precisely what we got before by hand.

The coefficients returned from a logistic regression model are log-odds ratios. They tell us how the log-odds of a "success" change with a one-unit change in the independent variable. Increasing the log-odds of a success means increasing the probability, and vice-versa decreasing the log-odds of a success means decreasing the probability. Therefore, the sign of the log-odds ratio indicates the direction of its relationship: + means a positive relationship between and the likelihood of a success, and - means a negative relationship. In order to get an intuitive sense of how much things are changing, we need to get the exponential of the log-odds ratio, which gives us the odds ratio itself. Let's return to our example from yesterday looking at survival of the Titanic by gender:

The positive coefficient indicates that women were more likely to survive the Titanic than men. But our coefficients are related to the log-odds of survival. Let's exponentiate both sides to see how they related to the odds of survival.

The odds for any individual is a multiplicative function of a "baseline" odds and "odds ratios" of their characteristics. The predicted odds for a man are:

The odds for a woman are:

For the log-odds ratios, a negative value indicates a negative relationship. But all odds-ratios are positive values. The distinction regarding a positive or negative relationship in the odds ratios is given by which side of 1 they fall on. 1 indicates no relationship. Less than one indicates a negative relationship and greater than one indicates a positive relationship.

The interpretation is similar with continuous variables. Let's take the case of predicting survival by fare paid.

Once again let's exponentiate this to get the results in terms of the odds of survival.

Once again we have a multiplicative relationship. Let's take the three cases where the paid either zero, one, or two pound for his/her ticket.

For the first person, the odds of survival is simply given by the exponential of the intercept term, which in this case leads to an odds of 0.414. For the second person, the odds of survival increases by a factor of 1.012 because this person paid a pound more than the first. For the third person, the odds of survival increase a further factor of 1.012 because this person paid a pound more than the second. The exponential of the coefficient then gives the expected odds ratios between two individuals who only differ by one unit on the given independent variable.

We can think of interactions in a similar way: they tell us how much the odds ratio related to one variable is different between groups. Lets now interact gender and fare in our Titanic example.

where is a female indicator and is the fare paid Exponentiate again:

For men:

For women:

The exponential of the gender effect (6.81) gives us the level odds ratio between genders, while the exponential of the interaction term tells us how much lower/higher in a multiplicative sense the odds ratio between survival and fare is for women than men. In this case, gender differences in survival increased with fare.

Multivariate: change in the odds ratios holding all the other variables constant

• Let's look at a full example from the Titanic data.
 Variable Coefficient SE t-statistic p-value Intercept -0.0514 0.2139 -0.2404 0.8101 Passenger class First Class (ref) Second Class -1.6830 0.3286 -5.1224 0.0000 Third Class -1.6235 0.2906 -5.5871 0.000 Gender Male Female 3.9976 0.5031 7.9452 0.000 Gender*Class Female*2nd Class 0.0611 0.6379 0.0958 0.9237 Female*3rd Class -2.5360 0.5495 -4.6149 0.0000 Age (mean-centered) -0.0454 0.0073 -6.2560 0.0000 Fare (mean-centered) 0.0004 0.0021 0.2003 0.8412 # Siblings/Spouses -0.3349 0.1010 -3.3163 0.0009
• What is the intercept telling us? The reference group is a first-class man at the mean age and who payed the mean fare and had no siblings or spouse on board. This man had a log-odds of surviving of -0.0514. This translated into an odds of:

odds of survival

This translates into a probability of surviving of:

probability of survival

• Let's plug in the following values: a second-class 35 year-old woman with a husband who payed 15 pounds for her fare. What are the predicted odds of a person with these values surviving? How about the probability.

This leads to a probability of 0.887.

• Let's change age to 25. How does this change the odds? how does it change the probabilities?

This leads to a probability of 0.900. Note that the difference in the odds is much greater than the difference in the probability. The difference in the odds is a factor of . This is because a one-unit change in age leads to a multiplicative change in the odds of .

• Be careful with the interaction term. Because it is in there, the effects of gender and passenger class alone only represent the effects for the reference group. So the passenger class variables give you the log-odds difference in survival between 1st and 2nd (-1.68) and 1st and 3rd (-1.62) male passengers.

To get the effects for female passengers, we need to add on the female interaction terms, so the log-odds difference in survival between female 1st and 2nd class passengers is (-1.68+0.061) and between female 1st and 3rd class passengers is (-1.62-2.536).

You would similarly need to calculate the difference between men and women within each class. 1st (3.998), 2nd (3.998+.061), 3rd (3.998-2.534). Here we see that the difference in survival between men and women dropped off considerably in the 3rd class.

• Statistical Inference
• Remember our friend, statistical inference? Our estimates in the logistic regression model are based on a sample and yet we want to measure the values for the population. Therefore, we need some measure of how secure we are in these numbers, just like for OLS regression.
• The GLM method for estimating the coefficients handily also returns estimated standard errors for their sampling distribution.
• Like OLS regression the sampling distribution of these coefficients is a t-distribution, so:

For all intents and purposes, our tests of statistical significance are identical in the case of logistic regression as they were for OLS regression, once we have these t-statistics:
• If sample is large enough, then t-statistic is roughly normal ( means ).
• Like OLS regression our interest is in whether the parameter is distinguishable from zero.
• In this case , zero in the log-odds means one in the odds.
• We have to think about one-sided and two-sided tests in the same way.
• We can construct confidence intervals around our estimates.

• Let's try a couple of examples.
• Is the effect of fare paid "statistically distinguishable" from zero at the 5% level?

There is an 84% chance that we would observe a value of 0.0004 or larger on a sample of this size just by random chance. We cannot distinguish the effect of fare from zero. Note that we found fare to be important earlier before we controlled for passenger class. The basic point is that most of the important information on survival contained in the fare variable is better picked up by knowing each passenger's class.
• Construct a 95% confidence interval for the odds ratio of the age variable

Next: Goodness-of-fit Up: Logistic Regression Previous: Odds and Probabilities   Contents
Aaron 2005-12-21