--- title: "Regression and Other Stories: Student" author: "Andrew Gelman, Jennifer Hill, Aki Vehtari" date: "`r format(Sys.Date())`" output: html_document: theme: readable toc: true toc_depth: 2 toc_float: true code_download: true --- Models for regression coefficients. See Chapter 12 in Regression and Other Stories. ------------- ```{r setup, include=FALSE} knitr::opts_chunk$set(message=FALSE, error=FALSE, warning=FALSE, comment=NA) # switch this to TRUE to save figures in separate files savefigs <- FALSE ``` #### Load packages ```{r } library("rprojroot") root<-has_file(".ROS-Examples-root")$make_fix_file() library("rstanarm") library("rstantools") library("loo") library("ggplot2") library("bayesplot") theme_set(bayesplot::theme_default(base_family = "sans")) ``` Set random seed for reproducibility ```{r } SEED <- 2132 ``` ```{r eval=FALSE, include=FALSE} # grayscale figures for the book if (savefigs) color_scheme_set(scheme = "gray") ``` Here we consider regression with more than handful of predictors. We demonstrate the usefulness of standardization of predictors and models for regression coefficients. We demonstrate with data from Portuguese students and their final period math grade. We predict the third period math grade given student's school, student's sex, student's age, student's home address type, family size, parent's cohabitation status, mother's education, father's education, home to school travel time, weekly study tie, number of past class failures, extra educational support, extra paid classes within the course subject, extra-curricular activities, attended nursery school, wants to take higher education, Internet access at home, with a romantic relationship, quality of family relationships, free time after school, going out with friends, workday alcohol consumption, weekend alcohol consumption, current health status, and number of school absences. #### Load data ```{r } # Use the merged data with students having both math and Portuguese language grades data <- read.csv(root("Student/data","student-merged.csv")) head(data) grades <- c("G1mat","G2mat","G3mat","G1por","G2por","G3por") predictors <- c("school","sex","age","address","famsize","Pstatus","Medu","Fedu","traveltime","studytime","failures","schoolsup","famsup","paid","activities", "nursery", "higher", "internet", "romantic","famrel","freetime","goout","Dalc","Walc","health","absences") p <- length(predictors) ``` #### Data for predicting the final math grade Pick columns to make models for third period mathematics grade G3mat and select only students with non-zero math grade (use G3por to make a model for third grade Portuguese language grade). ```{r } data_G3mat <- subset(data, subset=G3mat>0, select=c("G3mat",predictors)) n <- nrow(data_G3mat) ``` ## Default weak prior on original coefficients #### Fit a regression model with default weak priors Dot (.) in the formula means all other columns execpt what is already on the left of ~. ```{r } fit0 <- stan_glm(G3mat ~ ., data = data_G3mat, seed = SEED, refresh=0) ``` #### Plot posterior marginals of coefficients ```{r } p0 <- mcmc_areas(as.matrix(fit0), pars=vars(-'(Intercept)',-sigma), prob_outer=0.95, area_method = "scaled height") + xlim(c(-3.2,2.4)) p0 <- p0 + scale_y_discrete(limits = rev(levels(p0$data$parameter))) p0 ``` ```{r eval=FALSE, include=FALSE} if (savefigs) ggsave(root("Student/figs","student_fit0_mcmc_areas.pdf"), p0, height=5, width=5, colormodel="gray") ``` The above figure shows that without standardization of predictors, it looks like there is a different amount of uncertainty on the relevance of the predictors. For example, it looks like `absences` has really small relevance and high certainty. Standardize all predictors for easier comparison of relevances as discussed in Section 12.1. ```{r } datastd_G3mat <- data_G3mat datastd_G3mat[,predictors] <-scale(data_G3mat[,predictors]) ``` ## Default weak prior on coefficients #### Fit a regression model with default weak priors ```{r } fit1 <- stan_glm(G3mat ~ ., data = datastd_G3mat, seed = SEED, refresh=0) ``` #### Plot posterior marginals of coefficients ```{r } p1 <- mcmc_areas(as.matrix(fit1), pars=vars(-'(Intercept)',-sigma), prob_outer=0.95, area_method = "scaled height") + xlim(c(-1.2,0.8)) p1 <- p1 + scale_y_discrete(limits = rev(levels(p1$data$parameter))) p1 ``` ```{r eval=FALSE, include=FALSE} if (savefigs) ggsave(root("Student/figs","student_fit1_mcmc_areas.pdf"), p1, height=5, width=5, colormodel="gray") ``` The above figure shows that after all predictors have been standardized to have equal standard deviation, the uncertainties on the relevances are similar. For example, it is now easier to see that `absences` has relatively high relevance compared to other predictors in the model. #### Compare Bayesian $R^2$ and LOO $R^2$ ```{r } round(median(bayes_R2(fit1)), 2) round(median(loo_R2(fit1)), 2) ``` #### Compute LOO log score ```{r } (loo1 <- loo(fit1)) ``` Medians of Bayesian $R^2$ and LOO $R^2$ are quite different, and p_loo is approximately 26, which indicates that the model is fitting to all predictors. If the predictors have been standardized to have standard deviation 1 and we give the regression coefficients independent normal priors with mean 0 and standard deviation 2.5, this implies that the prior standard deviation of the modeled predictive means is $2.5 \sqrt{26} = 12.7$. The default prior for $\sigma$ is an exponential distribution, scaled to have mean equal to data standard deviation which in this case is approximately 3.3 which is much less than 12.7. We can simulate from these prior distributions and examine what is the corresponding prior distribution for explained variance $R^2$. #### Bayesian $R^2$ distribution ```{r } ggplot() + geom_histogram(aes(x=bayes_R2(fit1)), breaks=seq(0,1,length.out=100)) + xlim(c(0,1)) + scale_y_continuous(breaks=NULL) + labs(x="Bayesian R^2", y="") ``` #### Prior predictive checking by looking at the prior on Bayesian $R^2$
```{r } ppR2<-numeric() for (i in 1:4000) { sigma2 <- rexp(1,rate=0.3)^2; muvar <- var(as.matrix(datastd_G3mat[,2:27]) %*% rnorm(26)*2.5) ppR2[i] <- muvar/(muvar+sigma2) } ggplot()+geom_histogram(aes(x=ppR2), breaks=seq(0,1,length.out=50)) + xlim(c(0,1)) + scale_y_continuous(breaks=NULL) + labs(x="Prior predictive Bayesian R^2",y="") pp1 <- mcmc_hist(data.frame(Prior=ppR2,Posterior=bayes_R2(fit1)), breaks=seq(0,1,length.out=100), facet_args = list(nrow = 2)) + facet_text(size = 13) + scale_x_continuous(limits = c(0,1), expand = c(0, 0), labels = c("0","0.25","0.5","0.75","1")) + theme(axis.line.y = element_blank()) + xlab("Bayesian R^2") pp1 ``` ```{r eval=FALSE, include=FALSE} if (savefigs) ggsave(root("Student/figs","student_fit1_R2.pdf"), pp1, height=3, width=3, colormodel="gray") ``` The above figure shows that with the default prior on regression coefficients and $\sigma$, the implied prior distribution for $R^2$ is strongly favoring larger values and thus is favoring overfitted models. The priors often considered as weakly informative for regression coefficients turn out to be in multiple predictor case highly informative for the explained variance. ## Weakly informative prior scaled with the number of covariates #### Prior predictive checking by looking at the prior on Bayesian $R^2$ ```{r } ppR2<-numeric() for (i in 1:4000) { sigma2 <- 0.7*rexp(1, rate=1/sd(datastd_G3mat$G3mat))^2 muvar <- var(as.matrix(datastd_G3mat[,2:27]) %*% rnorm(26, sd=sd(datastd_G3mat$G3mat)/sqrt(26)*sqrt(0.3))) ppR2[i] <- muvar/(muvar+sigma2) } ggplot()+geom_histogram(aes(x=ppR2), breaks=seq(0,1,length.out=50)) + xlim(c(0,1)) + scale_y_continuous(breaks=NULL) + labs(x="Prior predictive Bayesian R^2",y="") ``` #### Fit a regression model with a weakly informative prior scaled with the number of covariates ```{r } fit2 <- stan_glm(G3mat ~ ., data = datastd_G3mat, seed = SEED, prior=normal(scale=sd(datastd_G3mat$G3mat)/sqrt(26)*sqrt(0.3), autoscale=FALSE), refresh=0) ``` When we compare Bayesian $R^2$ and LOO $R^2$, we see the difference is much smaller and LOO $R^2$ has improved slightly. #### Compare Bayesian $R^2$ and LOO $R^2$ ```{r } round(median(loo_R2(fit2)), 2) round(median(bayes_R2(fit2)), 2) ``` #### Bayesian $R^2$ distribution ```{r } ggplot()+geom_histogram(aes(x=bayes_R2(fit2)), breaks=seq(0,1,length.out=100)) + xlim(c(0,1)) + scale_y_continuous(breaks=NULL) + labs(x="Bayesian R^2",y="") pp2 <- mcmc_hist(data.frame(Prior=ppR2,Posterior=bayes_R2(fit2)), breaks=seq(0,1,length.out=100), facet_args = list(nrow = 2)) + facet_text(size = 13) + scale_x_continuous(limits = c(0,1), expand = c(0, 0), labels = c("0","0.25","0.5","0.75","1")) + theme(axis.line.y = element_blank()) + xlab("Bayesian R^2") pp2 ``` ```{r eval=FALSE, include=FALSE} if (savefigs) ggsave(root("Student/figs","student_fit2_R2.pdf"), pp2, height=3, width=3, colormodel="gray") ``` Comparison of the LOO log score reveals that the new model has better leave-one-out prediction. #### Compute LOO log score ```{r } (loo2 <- loo(fit2)) ``` #### Compare models ```{r } loo_compare(loo1,loo2) ``` #### Plot posterior marginals of coefficients ```{r } p2 <- mcmc_areas(as.matrix(fit2), pars=vars(-'(Intercept)',-sigma), prob_outer=0.95, area_method = "scaled height") + xlim(c(-1.2,0.8)) p2 <- p2 + scale_y_discrete(limits = rev(levels(p2$data$parameter))) p2 ``` ```{r eval=FALSE, include=FALSE} if (savefigs) ggsave(root("Student/figs","student_fit2_mcmc_areas.pdf"), p2, height=5, width=5, colormodel="gray") ``` The above figure shows the posterior distributions of coefficients, which are slightly more concentrated than for the previous model. ## Weakly informative prior assuming only some covariates are relevant We next use regularized horseshoe pruior, assuming that the expected number of relevant predictors is near $p_0=6$ and the prior scale for the relevant predictors is chosen as in the previous model but using $p_0$ for scaling. We can then simulate from this prior and examine the corresponding prior for $R^2$ ```{r } p0 <- 6 slab_scale <- sd(datastd_G3mat$G3mat)/sqrt(p0)*sqrt(0.3) # ppR2<-numeric() for (i in 1:4000) { sigma2 <- 0.7*rexp(1,rate=1/sd(datastd_G3mat$G3mat))^2; global_scale <- p0 / (p - p0) * sqrt(sigma2) / sqrt(n) z <- rnorm(p) lambda <- rcauchy(p) tau <- rcauchy(1, scale = global_scale) caux <- 1/rgamma(1, shape=0.5, rate=0.5) c <- slab_scale * sqrt(caux) lambda_tilde <- sqrt(c^2 * lambda^2 / (c^2 + tau^2*lambda^2)) beta <- rnorm(p) * lambda_tilde * tau muvar <- var(as.matrix(datastd_G3mat[,2:27]) %*% beta) ppR2[i] <- muvar/(muvar+sigma2) } ggplot()+geom_histogram(aes(x=ppR2), breaks=seq(0,1,length.out=50)) + xlim(c(0,1)) + scale_y_continuous(breaks=NULL) + labs(x="Prior predictive Bayesian R^2",y="") ``` The above figure shows that the regularized horseshoe prior with sensible parameters implies a more cautious prior on explained variance $R^2$ than is implicitly assumed by the default wide prior. The horseshoe prior favors simpler models, but is quite flat around most $R^2$ values. #### Fit a regression model with regularized horseshoe prior
```{r } p0 <- 6 slab_scale <- sd(datastd_G3mat$G3mat)/sqrt(p0)*sqrt(0.3) # global scale without sigma, as the scaling by sigma happens in stan_glm global_scale <- p0 / (p - p0) / sqrt(n) fit3 <- stan_glm(G3mat ~ ., data = datastd_G3mat, seed = SEED, prior=hs(global_scale=global_scale, slab_scale=slab_scale), refresh=0) ``` #### Compare Bayesian $R^2$ and LOO $R^2$ ```{r } round(median(loo_R2(fit3)), 2) round(median(bayes_R2(fit3)), 2) ``` When we compare models using LOO log score, the new model is better than the default prior model, but there is no difference compared the model with normal prior scaled with the predictors. It is common that the data do not have strong information about how many predictors are relevant and then different types of priors can produce similar predictive accuracies. #### Compute LOO log score ```{r } (loo3 <- loo(fit3)) ``` #### Compare models ```{r } loo_compare(loo1,loo3) loo_compare(loo2,loo3) ``` #### Bayesian $R^2$ distribution ```{r } ggplot()+geom_histogram(aes(x=bayes_R2(fit3)), breaks=seq(0,1,length.out=100)) + xlim(c(0,1)) + scale_y_continuous(breaks=NULL) + labs(x="Bayesian R^2",y="") pp3 <- mcmc_hist(data.frame(Prior=ppR2,Posterior=bayes_R2(fit3)), breaks=seq(0,1,length.out=100), facet_args = list(nrow = 2)) + facet_text(size = 13) + scale_x_continuous(limits = c(0,1), expand = c(0, 0), labels = c("0","0.25","0.5","0.75","1")) + theme(axis.line.y = element_blank()) + xlab("Bayesian R^2") pp3 ``` ```{r eval=FALSE, include=FALSE} if (savefigs) ggsave(root("Student/figs","student_fit3_R2.pdf"), pp3, height=3, width=3, colormodel="gray") ``` #### Plot posterior marginals of coefficients ```{r } p3 <- mcmc_areas(as.matrix(fit3), pars=vars(-'(Intercept)',-sigma), prob_outer=0.95, area_method = "scaled height") + xlim(c(-1.2,0.8)) p3 <- p3 + scale_y_discrete(limits = rev(levels(p3$data$parameter))) p3 ``` ```{r eval=FALSE, include=FALSE} if (savefigs) ggsave(root("Student/figs","student_fit3_mcmc_areas.pdf"), p3, height=5, width=5, colormodel="gray") ``` The above figure shows that the regularized horseshoe prior has the benefit of shrinking the posterior for many regression coefficients more tightly towards 0, making it easier to see the most relevant predictors. Failures, school support, going out, and the number of absences appear to be the most relevant predictors. ## Subset of covariates Fit a regression model with subset of covariates and default weak prior on coefficients ```{r } fit4 <- stan_glm(G3mat ~ failures + schoolsup + goout + absences, data = datastd_G3mat, seed = SEED, refresh=0) ``` When we compare Bayesian $R^2$ and LOO $R^2$, we see the difference is small and there is less overfit than when using all predictors with wide prior. LOO $R^2$ is just slightly smaller than for models with all predictors and better priors. The prediction performance can not improved much by adding more predictors. Note that by observing more students it might be possible to learn regression coefficients for other predictors with sufficient small uncertainty so that predictions for new students could be improved. #### Compare Bayesian $R^2$ and LOO $R^2$ ```{r } round(median(loo_R2(fit4)), 2) round(median(bayes_R2(fit4)), 2) ``` #### Bayesian $R^2$ distribution ```{r } ggplot()+geom_histogram(aes(x=bayes_R2(fit4)), breaks=seq(0,1,length.out=100)) + xlim(c(0,1)) + scale_y_continuous(breaks=NULL) + labs(x="Bayesian R^2",y="") ``` #### Compute LOO log score ```{r } (loo4 <- loo(fit4)) ``` #### Compare models ```{r } loo_compare(loo3,loo4) loo_compare(loo2,loo4) loo_compare(loo1,loo4) ``` #### Plot posterior marginals of coefficients ```{r } p4 <- mcmc_areas(as.matrix(fit4), pars=vars(-'(Intercept)',-sigma), prob_outer=0.99, area_method = "scaled height") + xlim(c(-1.3,0.1)) p4 <- p4 + scale_y_discrete(limits = rev(levels(p4$data$parameter))) p4 ```