Pollen identified in lake sediments is used to determine the presence of a species in the general region. However, pollen may not reliably identify small populaitons (false absences) or it may be present when the tree species is absent from the region (due to long-distance transport). Pollen in surface sediments over a broad region may be used to asses these issues.
The following data set tsme.csv is from Herring et al. 2018. The dataset has the following columns:
Load and plot the data. Use a square-root transform on the pollen-percent axis.
tsme<-read.csv("tsme.csv")
plot(tsme$distance,sqrt(tsme$perc_tsme))
With count data (such as number of pollen grains in a sample) are often zero-inflated: there are an excess of zeros due to low probability of detection at low abundances. It is therefore useful to model the zero values separately from the positive values. This is the purpose of a hurdle model. See this page for more explanation.
The code below applies the hurdle
command from the countreg
package to fit both parts of the hurdle model. The "zero" model is a binomial model with a logit link (essentially a logistic regression). The model for positive values (zero-truncated model) is a poisson regression model. Because it requires count data, the counts of mountain hemlock pollen are standardized to a pollen sum of 400 grains. However, because there is unequal search effort for each sample (roughly 200 to 1000 grains identified per sample), it is possible to weight the samples. The weights used is the log(pollen sum) scaled to 0,1. The purpose of weighting is to give points where there was more search effort (more pollen identified) a greater importance in fitting the overall relationship.
install.packages("countreg", repos="http://R-Forge.R-project.org")
library(countreg)
# pollen count for a 400-grain pollen sum.
tsme_400 <- ceiling(tsme$perc_tsme*4)
distance <- tsme$distance
psum <- tsme$psum
hur<-hurdle(tsme_400~distance,weights=log(psum)/ max(log(psum)),dist="poisson",zero.dist="binomial",link="logit",x=TRUE)
summary(hur)
The summary command shows the parameters of the two parts of the model. The "zero" part is modeling the presence/absence of pollen using a logit-link function. Read the introduction to the hurdle
command.
plot(distance,sqrt(tsme_400))
lines(-30:170,sqrt(predict(hur,newdata=data.frame(distance=-30:170))))
Note the poor fit of the smooth poisson regression model. Another examination of the model fit is the "rootogram" obtained by running rootogram(hur)
. This shows the fit of the hurdle model (red line) expressed as the number of observations across the dependent variable (tsme_400). Descending from this line are bars that represent the observed number of observations. It shows the number of zeros is correctly predicted, but postive counts of tsme are generally underpredicted. Using the negative binomial distribution dist="negbin"
improves the fit according to the AIC (AIC=8140 for the poisson model and 1774 for the negative binomial. But visually it not very different from dist="pois"
.
A segmented regression would fit the data much better. However, it is not possible to use segmented
regression on a hurdle model (the object hur
above). But it is possible to segment the tructated regression part of a hurdle model.
Therefore, I will fit the two parts of a hurdle model separately rather than using the hurdle function from the countreg package. So, therefore we should fit a glm model that is a poisson (or negative binomial) regression that is zero-truncated, then segment that model.
Here is code for maximum likelihood estimatation for fitting the "zero" part of the model that is more transparent than the functions embedded in the hurdle
function.
# Function: hurdle.p1.zero.loglik
#
# Computes the negative log likelihood (Y ne 0), unweighted or weighted options of zero part of
# hurdle model.
#
# ARGUMENTS:
#
# y0: A nx1 vector of binary values 0 or 1 [e.g., if y is count, y0 = (y>0)].
# : e.g., y0 = (y > 0) # y0 = 1 if (y>0), else 0
# Z: A nxp matrix of covariates data. (Note: first column of 1 if INCLUDING intercept term.)
# w: A nx1 is weight variable for Pr(y=1).
# wt: FALSE default, fit hurdle model; else fit weighted hurdle model.
# beta: Vector of initial value to compute likelihood.
#
# VALUE:
#
# -loglik: Minus log likelihood for a given parameter beta and data (Z, y0) and optional weight w.
#
# Author: DVN
# Senturk D, Dalrymple DS, Mu Y, Nguyen DV (2014) Weighted hurdle regression method for joint
# modeling of cardiovascular events likelihood and rate in the U.S. dialysis population.
#
# 2.8.14
###################################################################################################
hurdle.p1.zero.loglik <- function(beta, Z, y0, w, wt = FALSE) {
etaz = Z %*% beta # eta for zero model part
if (wt == FALSE){
p = 1/(1+ exp(-etaz)) # zeromodel: logit(p), ZIP
#print('Unweighted zero part of hurdle model.')
} else {
p = w*( 1/(1+ exp(-etaz)) ) # zeromodel: logit(p), WZIP Â
#print('Weighted zero part of hurdle model.')
#print(p)
}
# likelihood for zero part
loglik = sum( y0*log(p) + (1-y0)*log(1-p) )
#print(-loglik)
#print(summary(p))
-loglik # return minus loglik
}
###################################################################################################
# Function: summary.optim
#
# Make summary table of model fit from optim(): Estimate, Std.Error, z.value, p.value.
#
# ARGUMENTS:
#
# optim.fit: Fitted model object from optim().
#
# VALUE:
#
# ds.out: Table with 4 columns: cbind(Estimate, Std.Error, z.value, p.value).
#
# Author: DVN
# Senturk D, Dalrymple DS, Mu Y, Nguyen DV (2014) Weighted hurdle regression method for joint
# modeling of cardiovascular events likelihood and rate in the U.S. dialysis population.
#
# 2.8.14
###################################################################################################
summary.optim <- function ( optim.fit ){
Estimate = optim.fit$par
cov.mat = solve(optim.fit$hessian)
Std.Error = sqrt( diag(cov.mat) )
z.value = Estimate/Std.Error
p.value = 2*( 1-pnorm(abs(z.value)) )
ds.out = cbind(Estimate, Std.Error, z.value, p.value)
colnames(ds.out) <- c('Estimate', 'Std Error', 'z value', 'p value')
ds.out
}
###################################################################################################
Binomial/zero part. Get inital values using glm
beta0.z <- matrix( coef(glm( I(tsme$count_tsme>0) ~ tsme$distance, family="binomial")), ncol=1 )
fit.zero = optim( par=beta0.z, fn=hurdle.p1.zero.loglik, method='BFGS', hessian=TRUE,y0=(tsme$count_tsme>0), Z=cbind(1,as.matrix(tsme$distance,nrow=366,ncol=1)), w=log(tsme$psum)/max(log(tsme$psum)), wt=TRUE)
summary.optim(fit.zero)
# Estimate Std Error z value p value
#[1,] 2.43331811 0.308324851 7.892060 2.886580e-15
#[2,] -0.03980641 0.004868007 -8.177147 2.220446e-16
Visualize the fit of the logit-link model.
plot(tsme$distance,tsme$count_tsme>0)
lines(-30:170, plogis(2.43331811 - 0.03980641*(-30:170)))
Truncated Poisson part. Select only data where pollen is present. (225 sites)
tsme.pos<-tsme[tsme$count_tsme>0,]
# pollen count for a 400-grain pollen sum.
tsme_400 <- ceiling(tsme.pos$perc_tsme*4)
dd<-data.frame(dist=tsme.pos$distance,tsme400=tsme_400)
fit.pois <- glm(tsme400~dist,data=dd,family=poisson(link=log))
fit.pois.seg <- segmented(fit.pois,seg.Z=~dist)
fit.pois.seg$psi
Initial Est. St.Err
psi1.dist -3.6 -9.350767 0.4488357
AIC(fit.pois.seg)
[1] 9768.729
Fit two breakpoints:
fit.pois.seg <- segmented(fit.pois,seg.Z=~dist,psi=c(-10,40))
fit.pois.seg$psi
Initial Est. St.Err
psi1.dist -10 -5.738623 0.1978762
psi2.dist 40 14.556636 1.3251065
Here is the fit of the two-breakpoint model:
summary(fit.pois.seg)
And output from the summary command:
Regression Model with Segmented Relationship(s)
Call:
segmented.glm(obj = fit.pois, seg.Z = ~dist, psi = c(-10, 40))
Estimated Break-Point(s):
Est. St.Err
psi1.dist -5.739 0.198
psi2.dist 14.557 1.325
Meaningful coefficients of the linear terms:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.197328 0.035012 119.882 < 2e-16 ***
dist 0.026460 0.003342 7.918 2.41e-15 ***
U1.dist -0.165211 0.006759 -24.444 NA
U2.dist 0.136568 0.006432 21.232 NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 10938.2 on 224 degrees of freedom
Residual deviance: 8222.7 on 219 degrees of freedom
AIC: 9252.6
Convergence attained in 29 iterations with relative change 0.001867473
Using three breakpoints did indeed improve fit as determined by AIC (=9158), showing an even sharper drop at distance= -4 to -3. However, four breakpoints did not improve fit (AIC=9224).
plot(dd$dist,sqrt(dd$tsme400),ylim=c(0,18),xlab="distance from mountain hemlock zone",ylab="pollen count (square root transform)")
#note: need to use exp on the predicted values because the model was fit to log-transformed values. Then need to square-root values for plotting purposes (easier to view variability)
lines(-30:150,sqrt(exp(predict(fit.pois.seg,newdata=data.frame(dist=-30:150)))))
Now calculate the predicted mean counts for the combined hurdle model. This is achieved by scaling the predicted values of a non-truncated poisson regression. The scaling is the ratio of the probability of a non-zero value from the "zero" binomial model divided by the probability of a non-zero value from the non-truncated poisson (segmented) model.
First, recompute the segmented regression using all data (including zeros). The breakpoints are very similar to the locations of breakpoints in the zero-truncated version.
dd<-data.frame(dist=tsme$distance,tsme400=ceiling(tsme$perc_tsme*4))
fit.pois.all <- glm(tsme400~dist,data=dd,family=poisson(link=log))
fit.pois.all.seg <- segmented(fit.pois.all,seg.Z=~dist,psi=c(-10,40))
fit.pois.all.seg$psi
num_c
= probability of non-zero from the "zero" model...predited over a range of distance values (-23 to 188). mean_c
= predicted values from the segmented poisson model. denom_c
= probability of non-zero value from the segmented poisson model.
num_c <- plogis(2.43331811-0.03980641*(-23:188))
mean_c <- exp(predict(fit.pois.all.seg,newdata=data.frame(dist=-23:188)))
denom_c <- 1-ppois(0,mean_c)
hur.pred <- (num_c/denom_c)*mean_c
par(mar=c(5, 5, 4, 6) + 0.1)
plot(dd$dist,sqrt(dd$tsme400),ylab="Mountain hemlock pollen count (square-root)",xlab="Distance from mountain hemlock distribution")
lines(-23:188,sqrt(hur.pred),lwd=2,col=6)
# overlay binomial "zero" model
par(new=T)
plot(-23:188, num_c, axes=F, ylim=c(0,1), xlab="", ylab="", type="l",lty=2, main="",lwd=2)
axis(4,ylim=c(0.1))
mtext(4,text="Probability of a non-zero value", line=3.5)