Logistic regression

- When we have a dichotomous variable as the dependent variable, OLS regression won't work. The
**Linear Probability Model**can be fit, but- The relationship is non-linear because the probabilities are bound between 0 and 1.
- The error terms are heteroskadastic because the dependent variable is produce by a binomial process where the variance depends upon the underlying value.

- As we have learned we can correct these problems with a
**generalized linear model**. - We know that the error distribution is given by a binomial distribution. So, we only need to choose a link function. We know the identity link won't work because we have the non-linearity problem.
- There are several possible link functions, but the best one (or at least the easiest to interpret) is the
**logit**function. - The logit is the log of the odds:
- This function spreads the probabilities over the entire number range.
- So, our
**logistic regression**model looks like: - How do we interpret the 's? Well, first lets relate this equation back to odds rather than the log-odds by exponentiating both sides.
- How does a one-unit change in affect the predicted odds?
- It increases the odds by a multiplicative factor of .
- By exponentiating the 's we get
**odds ratios**- how much the odds increase multiplicatively with a one-unit change in the independent variable. For categorical variables, these can be interpreted directly as odds ratios between groups. For continuous variables they are the odds ratios between individuals who are identical on the other variables but differ by one unit on the variable of interest. (show an example of each) - Therefore, the 's themselves are
**log-odds ratios**. Negative values indicate a negative relationship between the probability of "success" and the independent variable; positive values indicate a positive relationship. - When you exponentiate them, the dividing line between a positive and negative relationship is 1 not 0.
- Let's take our titanic example. We have the equation:
Variable Coefficient Constant -1.44 (-16.46) Gender Women 2.42 (17.82) Men (ref.) - What is the intercept giving us here? The log-odds of survival for the reference group (Men). To convert this into odds take the exponential:

log-odds of a woman's survivalTo get this as and odds, exponentiate:

The coefficients returned from a logistic regression model are **log-odds ratios**. They tell us how the log-odds of a "success" change with a one-unit change in the independent variable.
Increasing the log-odds of a success means increasing the probability, and vice-versa decreasing the log-odds of a success means decreasing the probability. Therefore, the sign of the log-odds ratio indicates the direction of its relationship: + means a positive relationship between and the likelihood of a success, and - means a negative relationship.
In order to get an intuitive sense of how much things are changing, we need to get the exponential of the log-odds ratio, which gives us the **odds ratio** itself.
Let's return to our example from yesterday looking at survival of the Titanic by gender:

The interpretation is similar with continuous variables. Let's take the case of predicting survival by fare paid.

For the first person, the odds of survival is simply given by the exponential of the intercept term, which in this case leads to an odds of 0.414. For the second person, the odds of survival increases by a factor of 1.012 because this person paid a pound more than the first. For the third person, the odds of survival increase a further factor of 1.012 because this person paid a pound more than the second. The exponential of the coefficient then gives the expected odds ratios between two individuals who only differ by one unit on the given independent variable.

We can think of interactions in a similar way: they tell us how much the odds ratio related to one variable is different between groups. Lets now interact gender and fare in our Titanic example.

Multivariate: change in the odds ratios holding all the other variables constant

- Let's look at a full example from the Titanic data.
Variable Coefficient SE t-statistic p-value Intercept -0.0514 0.2139 -0.2404 0.8101 Passenger class First Class (ref) Second Class -1.6830 0.3286 -5.1224 0.0000 Third Class -1.6235 0.2906 -5.5871 0.000 Gender Male Female 3.9976 0.5031 7.9452 0.000 Gender*Class Female*2nd Class 0.0611 0.6379 0.0958 0.9237 Female*3rd Class -2.5360 0.5495 -4.6149 0.0000 Age (mean-centered) -0.0454 0.0073 -6.2560 0.0000 Fare (mean-centered) 0.0004 0.0021 0.2003 0.8412 # Siblings/Spouses -0.3349 0.1010 -3.3163 0.0009 - What is the intercept telling us? The reference group is a first-class man at the mean age and who payed the mean fare and had no siblings or spouse on board. This man had a log-odds of surviving of -0.0514. This translated into an odds of:
odds of survivalThis translates into a probability of surviving of:probability of survival
- Let's plug in the following values: a second-class 35 year-old woman with a husband who payed 15 pounds for her fare. What are the predicted odds of a person with these values surviving? How about the probability.
- Let's change age to 25. How does this change the odds? how does it change the probabilities?
- Be careful with the interaction term. Because it is in there, the effects of gender and passenger class alone only represent the effects for the reference group. So the passenger class variables give you the log-odds difference in survival between 1st and 2nd (-1.68) and 1st and 3rd (-1.62)
**male**passengers.To get the effects for female passengers, we need to add on the female interaction terms, so the log-odds difference in survival between female 1st and 2nd class passengers is (-1.68+0.061) and between female 1st and 3rd class passengers is (-1.62-2.536).

You would similarly need to calculate the difference between men and women within each class. 1st (3.998), 2nd (3.998+.061), 3rd (3.998-2.534). Here we see that the difference in survival between men and women dropped off considerably in the 3rd class.

- What is the intercept telling us? The reference group is a first-class man at the mean age and who payed the mean fare and had no siblings or spouse on board. This man had a log-odds of surviving of -0.0514. This translated into an odds of:
- Statistical Inference
- Remember our friend, statistical inference? Our estimates in the logistic regression model are based on a sample and yet we want to measure the values for the population. Therefore, we need some measure of how secure we are in these numbers, just like for OLS regression.
- The GLM method for estimating the coefficients handily also returns estimated standard errors for their sampling distribution.
- Like OLS regression the sampling distribution of these coefficients is a t-distribution, so:
- If sample is large enough, then t-statistic is roughly normal ( means ).
- Like OLS regression our interest is in whether the parameter is distinguishable from zero.
- In this case , zero in the log-odds means one in the odds.
- We have to think about one-sided and two-sided tests in the same way.
- We can construct confidence intervals around our estimates.

- Let's try a couple of examples.
- Is the effect of fare paid "statistically distinguishable" from zero at the 5% level?
- Construct a 95% confidence interval for the odds ratio of the age variable

- Is the effect of fare paid "statistically distinguishable" from zero at the 5% level?