---
title: "Regression and Other Stories: Scalability"
author: "Andrew Gelman, Jennifer Hill, Aki Vehtari"
date: "`r format(Sys.Date())`"
output:
  html_document:
    theme: readable
    toc: true
    toc_depth: 2
    toc_float: true
    code_download: true
---
Demonstrate how the computation time scales with bigger data. See
Chapter 22 in Regression and Other Stories.

-------------


```{r setup, include=FALSE}
knitr::opts_chunk$set(message=FALSE, error=FALSE, warning=FALSE, comment=NA)
```

#### Load packages

```{r }
library(arm)
library(rstanarm)
options(mc.cores = parallel::detectCores())
library(tictoc)
```

## Linear regression n=100K, p=100

#### Create fake data with n=100\,000 and p=100

```{r }
SEED <- 1656
set.seed(SEED)
n <- 1e5
k <- 100
x <- rnorm(n)
xn <- matrix(rnorm(n*(k-1)),nrow=n)
a <- 2
b <- 3
sigma <- 10
y <- a + b*x + sigma*rnorm(n)
fake <- data.frame(x, xn, y)
```

#### Fit using lm

```{r }
tic()
fit1 <- lm(y ~ ., data=fake)
toc()
#display(fit1)
```

#### Fit using stan_glm and MCMC

stan_glm is fast for linear regression with n>k and small or
moderate k (using OLS trick)

```{r }
tic()
```
```{r results='hide'}
fit2 <- stan_glm(y ~ ., data=fake, mean_PPD=FALSE,
                 refresh=500, seed=SEED, cores=1)
```
```{r }
toc()
#print(fit2, digits=2)
```

#### Fit using stan_glm and optimization

Using optimization with normal approximation and Pareto smoothed
importance resampling provides coefficient standard errors, but
also diagnostic whether normal approximation at the mode is
appropriate.

```{r }
tic()
fit3 <- stan_glm(y ~ ., data=fake, algorithm='optimizing', init=0) 
toc()
#print(fit3, digits=2)
```

## Logistic regression n=10K, p=100

#### Create fake data with 10\,000 observations and p=100

```{r }
SEED <- 1655
set.seed(SEED)
n <- 1e4
k <- 100
x <- rnorm(n)
xn <- matrix(rnorm(n*(k-1)),nrow=n)
a <- 2
b <- 3
sigma <- 1
y <- as.numeric(a + b*x + sigma*rnorm(n) > 0)
fake <- data.frame(x, xn, y)
```

#### Fit using glm

```{r }
tic()
fit1 <- glm(y ~ ., data=fake, family=binomial())
toc()
#display(fit1)
```

#### Fit using stan_glm and MCMC

```{r }
tic()
fit2 <- stan_glm(y ~ ., data=fake, family=binomial(), mean_PPD=FALSE,
                 init_r=0.1, seed=SEED)
toc()
#print(fit2, digits=2)
```

#### Fit using stan_glm and optimization

Using optimization with normal approximation and Pareto smoothed
importance resampling provides coefficient standard errors, but
also diagnostic whether normal approximation at the mode is
appropriate.

```{r }
tic()
fit3 <- stan_glm(y ~ ., data=fake, family=binomial(),
                algorithm='optimizing', init=0, draws = 16000, keep_every = 4) 
toc()
#print(fit3, digits=2)
```

## Logistic regression n=100K, p=100

#### Create fake data with 100\,000 observations and p=100

```{r }
SEED <- 1655
set.seed(SEED)
n <- 1e5
k <- 100
x <- rnorm(n)
xn <- matrix(rnorm(n*(k-1)),nrow=n)
a <- 2
b <- 3
sigma <- 1
y <- as.numeric(a + b*x + sigma*rnorm(n) > 0)
fake <- data.frame(x, xn, y)
```

#### Fit using glm

```{r }
tic()
fit1 <- glm(y ~ ., data=fake, family=binomial())
toc()
#display(fit1)
```

#### Fit using stan_glm and optimization

Using optimization with normal approximation and Pareto smoothed
importance resampling provides coefficient standard errors, but
also diagnostic whether normal approximation at the mode is
appropriate.

```{r }
tic()
fit3 <- stan_glm(y ~ ., data=fake, family=binomial(),
                 algorithm='optimizing', init=0) 
toc()
summary(fit3, digits=2)
```