Fitting Generalized Linear Models for Multivariate Abundance Data

manyglm is used to fit generalized linear models to high-dimensional data, such as multivariate abundance data in ecology. This is the base model-fitting function - see plot.manyglm for assumption checking, and anova.manyglm or summary.manyglm for significance testing.

Usage

manyglm(formula, family="negative.binomial", composition=FALSE, data=NULL, subset=NULL,
    na.action=options("na.action"), K=1, theta.method = "PHI", model = FALSE,
    x = TRUE, y = TRUE, qr = TRUE, cor.type= "I", shrink.param=NULL,
    tol=sqrt(.Machine$double.eps), maxiter=25, maxiter2=10,
    show.coef=FALSE, show.fitted=FALSE, show.residuals=FALSE,
    show.warning=FALSE, offset, ... )

Arguments

formula: an object of class "formula" (or one that can be coerced to that class): a symbolic description of the model to be fitted. The details of model specification are given under Details.
family: a description of the distribution function to be used in the in the model. The default is negative binomial regression (using a log link, with unknown overdispersion parameter), the following family functions are also accepted: binomial(), binomial(link="cloglog"), poisson(), Gamma(link="log"), which can also be specified using character strings as 'binomial', 'cloglog' and 'poisson', 'gamma' respectively. In future we hope to include other family functions as described in family.
data: an optional data frame, list or environment (or object coercible by as.data.frame to a data frame) containing the variables in the model. If not found in data, the variables are taken from environment(formula), typically the environment from which glm is called.
composition: FALSE (default) will model abundance, TRUE will model relative abundance, by adding a row effect to the model, and partition effects of environmental variables into main effects (alpha, total abundance/richness) and interactions with response (beta, relative abundance/turnover). See details.
subset: an optional vector specifying a subset of observations to be used in the fitting process.
na.action: a function which indicates what should happen when the data contain NAs. The default is set by the na.action setting of options, and is na.fail if that is unset. The ‘factory-fresh’ default is na.omit. Another possible value is NULL, no action. Value na.exclude can be useful.
K: number of trials in binomial regression. By default, K=1 for presence-absence data using logistic regression.
theta.method: the method used for the estimation of the overdisperson parameter theta, such that the mean-variance relationship is V=m+m^2/theta for the negative binomial family. Here offers three options
"PHI" = Maximum likelihood estimation with respect to phi (default)
"ML" = Maximum likelihood estimation with respect to theta, as in Lawless(1987), the default for the gamma family.
"Chi2" = Moment estimation using chi-square dampening on the log scale, as in Hilbe(2008). "MM" = Moment estimation for gamma family.
model, x, y, qr: logicals. If TRUE the corresponding components of the fit (the model frame, the model matrix, the model matrix, the response, the QR decomposition of the model matrix) are returned.
cor.type: the structure imposed on the estimated correlation matrix under the fitted model. Can be "I"(default), "shrink", or "R". See Details. This parameter is merely stored in manyglm, and will be used as the default value for cor.type in subsequent functions for inference.
shrink.param: shrinkage parameter to be used if cor.type="shrink". If a numerical value is not supplied, it will be estimated from the data by cross validation-penalised normal likelihood as in Warton (2008). The parameter value is stored as an attribute of the manyglm object, and will be used in subsequent functions for inference.
tol: the tolerance used in estimation.
maxiter: maximum allowed iterations in the weighted least square estimation of beta. The default value is 25.
maxiter2: maximum allowed iterations in the internal ML estimations of negative binomial regression. The default value is 10.
show.coef, show.fitted, show.residuals, show.warning: logical. Whether to show model coefficients, fitted values, standardized pearson residuals, or operation warnings.
offset: this can be used to specify an _a priori_ known component to be included in the linear predictor during fitting. This should be 'NULL' or a numeric vector of length equal to NROW (i.e. number of observations) or a matrix of NROW times p (i.e. number of species).
...: further arguments passed to or from other methods.

Details

manyglm is used to calculate the parameter estimates of generalised linear models fitted to each of many variables simultaneously as in Warton et. al. (2012) and Wang et.al.(2012). Models for manyglm are specified symbolically. For details on how to specify a formula see the details section of lm and formula.

Generalised linear models are designed for non-normal data for which a distribution can be specified that offers a reasonable model for data, as specified using the argument family. The manyglm function currently handles count and binary data, and accepts either a character argument or a family argument for common choices of family. For binary (presence/absence) data, family=binomial() can be used for logistic regression (logit link, "logistic regression"), or the complementary log-log link can be used family=binomial("cloglog"), arguably a better choice for presence-absence data. Poisson regression family=poisson() can be used for counts that are not "overdispersed" (that is, if the variance is not larger than the mean), although for multivariate abundance data it has been shown that the negative binomial distribution (family="negative.binomial") is usually a better choice (Warton 2005). In both cases, a log-link is used. If another link function or family is desired, this can be specified using the manyany function, which accepts regular family arguments.

composition=TRUE allows relative abundance to be modelled rather than absolute abundance, which is useful for partitioning effects of environmental variables on alpha vs beta diversity, and is needed if there is variation in sampling intensity due to variables that haven't been measured. Data are manipulated into "long format", with column factor called cols and row variable called rows, then a model is fitted using main effects for predictors as in the provided formula, plus rows, cols and the interaction of predictors with cols. Inclusion of rows in the model accounts for variation in sampling intensity across rows, main effects for environmental variables capture their effects on total abundance/richness (alpha diversity), and their interaction with cols captures changes in relative abundance/turnover (beta diversity). Unfortunately, data are not efficiently stored in long format, so models are slower to fit using composition=TRUE.

In negative binomial regression, the overdispersion parameter (theta) is estimated separately for each variable from the data, as controlled by theta.method for negative binomial distributions. We iterate between updates of theta and generalised linear model updates for regression parameters, as many as maxiter2 times.

cor.type is the structure imposed on the estimated correlation matrix under the fitted model. Possible values are:
"I"(default) = independence is assumed (correlation matrix is the identity)
"shrink" = sample correlation matrix is shrunk towards I to improve its stability.
"R" = unstructured correlation matrix is used. (Only available when N>p.)

If cor.type=="shrink", a numerical value will be assigned to shrink.param either through the argument or by internal estimation. The working horse function for the internal estimation is ridgeParamEst, which is based on cross-validation (Warton 2008). The validation groups are chosen by random assignment, so some slight variation in the estimated values may be observed in repeat analyses. See ridgeParamEst for more details. The shrinkage parameter can be any value between 0 and 1 (0="I" and 1="R", values closer towards 0 indicate more shrinkage towards "I").

Value

manyglm returns an object inheriting from "manyglm",

"manylm" and "mglm".

The function summary (i.e. summary.manyglm) can be used to obtain or print a summary of the results and the function

anova (i.e. anova.manyglm) to produce an analysis of variance table, although note that these functions use resampling so they can take a while to fit.

The generic accessor functions coefficients,

fitted.values and residuals can be used to extract various useful features of the value returned by manyglm.

An object of class "manyglm" is a list containing at least the following components:

coefficients: a named matrix of coefficients.
var.coefficients: the estimated variances of each coefficient.
fitted.values: the matrix of fitted mean values, obtained by transforming the linear predictors by the inverse of the link function.
linear.predictor: the linear fit on the scale of the linear predictor.
residuals: the matrix of working residuals, that is the Pearson residuals standardized by the leverage adjustment h obtained from the diagonal elements of the hat matrix H.
PIT.residuals: probability integral transform (PIT) residuals - for a model that fits well these should be approximately standard uniform values evenly scattered between 0 and 1. Their calculation involves some randomisation, so different fits will return slightly different values for PIT residuals.
sqrt.1_Hii: the matrix of scale terms used to standardize the Pearson reidusals.
var.estimator: the estimated variance of each observation, computed using the corresponding family function.
sqrt.weight: the matrix of square root of working weights, estimated for the corresponding family function.
theta: the estimated nuisance parameters accounting for overdispersion
two.loglike: two times the log likelihood.
deviance: up to a constant, minus twice the maximized log-likelihood. Where sensible, the constant is chosen so that a saturated model has deviance zero.
iter: the number of iterations of IWLS used.
data: a data frame storing the input data.
stderr.coefficients: the estimated standard error of each coefficient.
phi: the inverse of theta
tol: the tolerance used in estimations.
maxiter,maxiter2,family,theta.method,cor.type,formula: arguments supplied in the manyglm call.
aic: a vector returning Akaike's An Information Criterion for each response variable - minus twice the maximized log-likelihood plus twice the number of coefficients.
AICsum: the sum of the AIC's over all variables.
shrink.param: the shrink parameter to be used in subsequent inference.
call: the matched call.
terms: the terms object used.
rank: the numeric rank of the fitted linear model.
xlevels: (where relevant) a record of the levels of the factors used in fitting.
df.residual: the residual degrees of freedom.
x: if the argument x is TRUE, this is the design matrix used.
y: if the argument y is TRUE, this is the response variables used.
model: if the argument model is TRUE, this is the model.frame.
qr: if the argument qr is TRUE, this is the QR decomposition of the design matrix.
show.coef,show.fitted,show.residuals: arguments supplied in the manyglm call concerning what it presented in output.
offset: the offset data used (where applicable).

References

Lawless, J. F. (1987) Negative binomial and mixed Poisson regression, Canadian Journal of Statistics 15, 209-225.

Hilbe, J. M. (2008) Negative Binomial Regression, Cambridge University Press, Cambridge.

Warton D.I. (2005) Many zeros does not mean zero inflation: comparing the goodness of-fit of parametric models to multivariate abundance data, Environmetrics 16(3), 275-289.

Warton D.I. (2008). Penalized normal likelihood and ridge regularization of correlation and covariance matrices. Journal of the American Statistical Association 103, 340-349.

Warton D.I. (2011). Regularized sandwich estimators for analysis of high dimensional data using generalized estimating equations. Biometrics, 67(1), 116-123.

Warton D. I., Wright S., and Wang, Y. (2012). Distance-based multivariate analyses confound location and dispersion effects. Methods in Ecology and Evolution, 3(1), 89-101.

Wang Y., Neuman U., Wright S. and Warton D. I. (2012). mvabund: an R package for model-based analysis of multivariate abundance data. Methods in Ecology and Evolution. online 21 Feb 2012.

Author

Yi Wang, Ulrike Naumann and David Warton <David.Warton@unsw.edu.au>.

Examples

data(spider)
spiddat <- mvabund(spider$abund)
X <- as.matrix(spider$x)

#To fit a log-linear model assuming counts are poisson:
glm.spid <- manyglm(spiddat~X, family="poisson")
glm.spid
#> 
#> Call:  manyglm(formula = spiddat ~ X, family = "poisson") 
#> [1] "poisson"
#> 
#> Degrees of Freedom: 27 Total (i.e. Null); 21 Residual
#> 
#>                     Alopacce  Alopcune  Alopfabr  Arctlute  Arctperi  Auloalbi
#> 2*log-likelihood:   -100.596  -158.960   -79.169   -48.981   -26.339   -83.547
#> Residual Deviance:    37.034    96.894    39.471    29.146     5.781    37.204
#> AIC:                 114.596   172.960    93.169    62.981    40.339    97.547
#>                     Pardlugu  Pardmont  Pardnigr  Pardpull  Trocterr  Zoraspin
#> 2*log-likelihood:    -84.014  -276.075  -208.508  -164.826  -305.495  -136.706
#> Residual Deviance:    31.778   190.879   146.200    99.834   183.862    73.667
#> AIC:                  98.014   290.075   222.508   178.826   319.495   150.706

summary(glm.spid, resamp="residual")
#> 
#> Test statistics:
#>                wald value Pr(>wald)    
#> (Intercept)        17.941     0.001 ***
#> Xsoil.dry          20.142     0.001 ***
#> Xbare.sand          8.057     0.072 .  
#> Xfallen.leaves     11.414     0.008 ** 
#> Xmoss              13.831     0.002 ** 
#> Xherb.layer        17.671     0.001 ***
#> Xreflection        12.519     0.004 ** 
#> --- 
#> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
#> 
#> Test statistic:  51.38, p-value: 0.001 
#> Arguments:
#>  Test statistics calculated assuming response assumed to be uncorrelated 
#>  P-value calculated using 999 resampling iterations via residual resampling (to account for correlation in testing).
#> 

#To fit a binomial regression model to presence/absence data:
pres.abs <- spiddat
pres.abs[pres.abs>0] = 1
Xdf <- data.frame(spider$x) #turn into a data frame to refer to variables in formula
glm.spid.bin <- manyglm(pres.abs~soil.dry+bare.sand+moss, data=Xdf, family="binomial")
glm.spid.bin
#> 
#> Call:  manyglm(formula = pres.abs ~ soil.dry + bare.sand + moss, family = "binomial",      data = Xdf) 
#> [1] "binomial(link=logit)"
#> 
#> Degrees of Freedom: 27 Total (i.e. Null); 24 Residual
#> 
#>                     Alopacce  Alopcune  Alopfabr  Arctlute  Arctperi  Auloalbi
#> 2*log-likelihood:   -20.102   -21.201   -13.191   -23.457    -0.004   -34.568 
#> Residual Deviance:   20.102    21.201    13.191    23.457     0.004    34.568 
#> AIC:                 28.102    29.201    21.191    31.457     8.004    42.568 
#>                     Pardlugu  Pardmont  Pardnigr  Pardpull  Trocterr  Zoraspin
#> 2*log-likelihood:   -13.839   -22.618   -31.817   -24.109    -8.669   -16.368 
#> Residual Deviance:   13.839    22.618    31.817    24.109     8.669    16.368 
#> AIC:                 21.839    30.618    39.817    32.109    16.669    24.368 
drop1(glm.spid.bin) #AICs for one-term deletions, suggests dropping bare.sand
#> Single term deletions
#> 
#> Model:
#> pres.abs ~ soil.dry + bare.sand + moss
#>           Df    AIC
#> <none>       325.94
#> soil.dry  12 333.27
#> bare.sand 12 318.90
#> moss      12 333.26

glm2.spid.bin <- manyglm(pres.abs~soil.dry+moss, data=Xdf, family="binomial")
drop1(glm2.spid.bin) #backward elimination suggests settling on this model.
#> Single term deletions
#> 
#> Model:
#> pres.abs ~ soil.dry + moss
#>          Df    AIC
#> <none>      318.90
#> soil.dry 12 367.47
#> moss     12 324.42