Weighted and segmented hurdle model

Pollen identified in lake sediments is used to determine the presence of a species in the general region. However, pollen may not reliably identify small populaitons (false absences) or it may be present when the tree species is absent from the region (due to long-distance transport). Pollen in surface sediments over a broad region may be used to asses these issues.

The following data set tsme.csv is from Herring et al. 2018. The dataset has the following columns:

Load and plot the data. Use a square-root transform on the pollen-percent axis.

tsme<-read.csv("tsme.csv")
plot(tsme$distance,sqrt(tsme$perc_tsme))

With count data (such as number of pollen grains in a sample) are often zero-inflated: there are an excess of zeros due to low probability of detection at low abundances. It is therefore useful to model the zero values separately from the positive values. This is the purpose of a hurdle model. See this page for more explanation.

The code below applies the hurdle command from the countreg package to fit both parts of the hurdle model. The "zero" model is a binomial model with a logit link (essentially a logistic regression). The model for positive values (zero-truncated model) is a poisson regression model. Because it requires count data, the counts of mountain hemlock pollen are standardized to a pollen sum of 400 grains. However, because there is unequal search effort for each sample (roughly 200 to 1000 grains identified per sample), it is possible to weight the samples. The weights used is the log(pollen sum) scaled to 0,1. The purpose of weighting is to give points where there was more search effort (more pollen identified) a greater importance in fitting the overall relationship.

install.packages("countreg", repos="http://R-Forge.R-project.org")
library(countreg)
# pollen count for a 400-grain pollen sum.
tsme_400 <- ceiling(tsme$perc_tsme*4) 
distance <- tsme$distance
psum <- tsme$psum
hur<-hurdle(tsme_400~distance,weights=log(psum)/    max(log(psum)),dist="poisson",zero.dist="binomial",link="logit",x=TRUE)
summary(hur)

The summary command shows the parameters of the two parts of the model. The "zero" part is modeling the presence/absence of pollen using a logit-link function. Read the introduction to the hurdle command.

plot(distance,sqrt(tsme_400))
lines(-30:170,sqrt(predict(hur,newdata=data.frame(distance=-30:170))))

Note the poor fit of the smooth poisson regression model. Another examination of the model fit is the "rootogram" obtained by running rootogram(hur). This shows the fit of the hurdle model (red line) expressed as the number of observations across the dependent variable (tsme_400). Descending from this line are bars that represent the observed number of observations. It shows the number of zeros is correctly predicted, but postive counts of tsme are generally underpredicted. Using the negative binomial distribution dist="negbin" improves the fit according to the AIC (AIC=8140 for the poisson model and 1774 for the negative binomial. But visually it not very different from dist="pois".

A segmented regression would fit the data much better. However, it is not possible to use segmented regression on a hurdle model (the object hur above). But it is possible to segment the tructated regression part of a hurdle model.

Therefore, I will fit the two parts of a hurdle model separately rather than using the hurdle function from the countreg package. So, therefore we should fit a glm model that is a poisson (or negative binomial) regression that is zero-truncated, then segment that model.

Here is code for maximum likelihood estimatation for fitting the "zero" part of the model that is more transparent than the functions embedded in the hurdle function.

# Function: hurdle.p1.zero.loglik
#
# Computes the negative log likelihood (Y ne 0), unweighted or weighted options of zero part of 
# hurdle model.
#
# ARGUMENTS:
#
# y0: A nx1 vector of binary values 0 or 1 [e.g., if y is count, y0 = (y>0)].
#   : e.g., y0 = (y > 0) # y0 = 1 if (y>0), else 0
# Z: A nxp matrix of covariates data. (Note: first column of 1 if INCLUDING intercept term.)
# w: A nx1 is weight variable for Pr(y=1).
# wt: FALSE default, fit hurdle model; else fit weighted hurdle model.
# beta: Vector of initial value to compute likelihood.
#
# VALUE:
#
# -loglik: Minus log likelihood for a given parameter beta and data (Z, y0) and optional weight w.
#
# Author: DVN
# Senturk D, Dalrymple DS, Mu Y, Nguyen DV (2014) Weighted hurdle regression method for joint 
# modeling of cardiovascular events likelihood and rate in the U.S. dialysis population. 
#
# 2.8.14
###################################################################################################
hurdle.p1.zero.loglik <- function(beta, Z, y0, w, wt = FALSE) {

    etaz = Z %*% beta # eta for zero model part

    if (wt == FALSE){
        p = 1/(1+ exp(-etaz)) # zeromodel: logit(p), ZIP
        #print('Unweighted zero part of hurdle model.')
    } else {
        p = w*( 1/(1+ exp(-etaz)) ) # zeromodel: logit(p), WZIP Â 
        #print('Weighted zero part of hurdle model.')
        #print(p)
    }

    # likelihood for zero part
    loglik = sum( y0*log(p) + (1-y0)*log(1-p) ) 
    #print(-loglik)
    #print(summary(p))

    -loglik # return minus loglik
    }
###################################################################################################
# Function: summary.optim
#
# Make summary table of model fit from optim(): Estimate, Std.Error, z.value, p.value.
#
# ARGUMENTS:
#
# optim.fit: Fitted model object from optim().
#
# VALUE: 
#
# ds.out: Table with 4 columns: cbind(Estimate, Std.Error, z.value, p.value).
#
# Author: DVN
# Senturk D, Dalrymple DS, Mu Y, Nguyen DV (2014) Weighted hurdle regression method for joint 
# modeling of cardiovascular events likelihood and rate in the U.S. dialysis population. 
#
# 2.8.14
###################################################################################################
summary.optim <- function ( optim.fit ){

    Estimate = optim.fit$par
    cov.mat = solve(optim.fit$hessian)
    Std.Error = sqrt( diag(cov.mat) )
    z.value = Estimate/Std.Error
    p.value = 2*( 1-pnorm(abs(z.value)) )

    ds.out = cbind(Estimate, Std.Error, z.value, p.value)
    colnames(ds.out) <- c('Estimate', 'Std Error', 'z value', 'p value')

    ds.out
}
###################################################################################################

Binomial/zero part. Get inital values using glm

beta0.z <- matrix( coef(glm( I(tsme$count_tsme>0) ~ tsme$distance, family="binomial")), ncol=1 )
fit.zero = optim( par=beta0.z, fn=hurdle.p1.zero.loglik, method='BFGS', hessian=TRUE,y0=(tsme$count_tsme>0), Z=cbind(1,as.matrix(tsme$distance,nrow=366,ncol=1)), w=log(tsme$psum)/max(log(tsme$psum)), wt=TRUE)

summary.optim(fit.zero)
#        Estimate   Std Error   z value      p value
#[1,]  2.43331811 0.308324851  7.892060 2.886580e-15
#[2,] -0.03980641 0.004868007 -8.177147 2.220446e-16

Visualize the fit of the logit-link model.

plot(tsme$distance,tsme$count_tsme>0)
lines(-30:170, plogis(2.43331811 - 0.03980641*(-30:170)))

Truncated Poisson part. Select only data where pollen is present. (225 sites)

tsme.pos<-tsme[tsme$count_tsme>0,]
# pollen count for a 400-grain pollen sum.
tsme_400 <- ceiling(tsme.pos$perc_tsme*4)  

dd<-data.frame(dist=tsme.pos$distance,tsme400=tsme_400)
fit.pois <- glm(tsme400~dist,data=dd,family=poisson(link=log))
fit.pois.seg <- segmented(fit.pois,seg.Z=~dist)
fit.pois.seg$psi
      Initial      Est.    St.Err
psi1.dist    -3.6 -9.350767 0.4488357

AIC(fit.pois.seg)
[1] 9768.729

Fit two breakpoints:

fit.pois.seg <- segmented(fit.pois,seg.Z=~dist,psi=c(-10,40))
fit.pois.seg$psi

      Initial      Est.    St.Err
psi1.dist     -10 -5.738623 0.1978762
psi2.dist      40 14.556636 1.3251065

Here is the fit of the two-breakpoint model:

summary(fit.pois.seg)

And output from the summary command:

Regression Model with Segmented Relationship(s)

Call: 
segmented.glm(obj = fit.pois, seg.Z = ~dist, psi = c(-10, 40))

Estimated Break-Point(s):
             Est. St.Err
psi1.dist -5.739  0.198
psi2.dist 14.557  1.325

Meaningful coefficients of the linear terms:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  4.197328   0.035012 119.882  < 2e-16 ***
dist         0.026460   0.003342   7.918 2.41e-15 ***
U1.dist     -0.165211   0.006759 -24.444       NA    
U2.dist      0.136568   0.006432  21.232       NA    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)

Null     deviance: 10938.2  on 224  degrees of freedom
Residual deviance:  8222.7  on 219  degrees of freedom
AIC: 9252.6

Convergence attained in 29 iterations with relative change 0.001867473 

Using three breakpoints did indeed improve fit as determined by AIC (=9158), showing an even sharper drop at distance= -4 to -3. However, four breakpoints did not improve fit (AIC=9224).

plot(dd$dist,sqrt(dd$tsme400),ylim=c(0,18),xlab="distance from mountain hemlock zone",ylab="pollen count (square root transform)")
#note: need to use exp on the predicted values because the model was fit to log-transformed values.  Then need to square-root values for plotting purposes (easier to view variability)
lines(-30:150,sqrt(exp(predict(fit.pois.seg,newdata=data.frame(dist=-30:150)))))

Now calculate the predicted mean counts for the combined hurdle model. This is achieved by scaling the predicted values of a non-truncated poisson regression. The scaling is the ratio of the probability of a non-zero value from the "zero" binomial model divided by the probability of a non-zero value from the non-truncated poisson (segmented) model.

First, recompute the segmented regression using all data (including zeros). The breakpoints are very similar to the locations of breakpoints in the zero-truncated version.

dd<-data.frame(dist=tsme$distance,tsme400=ceiling(tsme$perc_tsme*4))
fit.pois.all <- glm(tsme400~dist,data=dd,family=poisson(link=log))
fit.pois.all.seg <- segmented(fit.pois.all,seg.Z=~dist,psi=c(-10,40))
fit.pois.all.seg$psi

num_c = probability of non-zero from the "zero" model...predited over a range of distance values (-23 to 188). mean_c = predicted values from the segmented poisson model. denom_c = probability of non-zero value from the segmented poisson model.

num_c <- plogis(2.43331811-0.03980641*(-23:188))
mean_c <- exp(predict(fit.pois.all.seg,newdata=data.frame(dist=-23:188)))
denom_c <- 1-ppois(0,mean_c)
hur.pred <- (num_c/denom_c)*mean_c

par(mar=c(5, 5, 4, 6) + 0.1)
plot(dd$dist,sqrt(dd$tsme400),ylab="Mountain hemlock pollen count (square-root)",xlab="Distance from mountain hemlock distribution")
lines(-23:188,sqrt(hur.pred),lwd=2,col=6)
# overlay binomial "zero" model
par(new=T)
plot(-23:188, num_c, axes=F, ylim=c(0,1), xlab="", ylab="",  type="l",lty=2, main="",lwd=2)
axis(4,ylim=c(0.1))
mtext(4,text="Probability of a non-zero value", line=3.5)