Title: | Fits Poisson-Sum-of-Gammas GLMs, Tweedie GLMs, and Delta Log-Normal Models |
---|---|
Description: | Fits models to catch and effort data. Single-species models are 1) delta log-normal, 2) Tweedie, or 3) Poisson-gamma (G)LMs. |
Authors: | Scott D. Foster [aut, cre] |
Maintainer: | Scott D. Foster <[email protected]> |
License: | GPL |
Version: | 0.29.2 |
Built: | 2025-03-13 06:01:53 UTC |
Source: | https://github.com/cran/fishMod |
fishMod is a package that contains some very specific functions for a) fitting extensions of generalised linear models (GLMs) that allow for non-negative observations with exact zeros, and b) targetted fisheries catch and effort data using a (quite complex) mixture model. The package is intended to be used in reasonable anger on real data sets but it does have limitations. Primarily, these arise from not nesting the models within the hierarchy of standard R functions. This means that there are currently no generic methods, such as ‘plot’ and ‘summary’ for the objects returned by the functions. Instead, the analyst must fish around (excuse the pun) for the bits and peices required for the task at hand, for example residuals and fitted values for diagnostic plots.
The important functions in the fishMod namespace can be arranged into two groups: 1) dPoisGam
, pPoisGam
, rPoisGam
, dPoisGamDerivs
, dTweedie
, pTweedie
, rTweedie
; and 2) deltaLN
, pgm
, tglm
, tglm.fit
, simReg
. These correspond to functions for 1) utility functions for Tweedie densities and Poisson-gamma representations of Tweedies; and 2) extensions of GLMs for non-negative data.
Note that there is an extended version of this package (available from the author but not from CRAN) that also contains functions for multi-species catch and effort data that attempt to account for targetting.
Scott D. Foster
Fits a compound model that assumes a Delta Log-Normal distribution. The mean of the log-normal process and the mean of the binary process are allowed to change with covariates.
deltaLN( ln.form, binary.form, data, residuals=TRUE)
deltaLN( ln.form, binary.form, data, residuals=TRUE)
ln.form |
an object of class "formula" (or one that can be coerced to that class). This is a symbolic representation of the model for the log-normal variable. Note that offset terms (if any) should be included in this part of the model. |
binary.form |
an object of class "formula" (or one that can be coerced to that class). This is a symbolic representation of the model for the binary variable and should not contain an outcome (e.g. ~1+var1+var2). |
data |
a data frame, list or environment (or object coercible by as.data.frame to a data frame) containing the variables in the model. |
residuals |
boolean indicating if the quantile residuals should be calculated. Default is TRUE indicating residuals are to be calculated. |
The observed random variables y_i are assumed to arise from a process that has a non-zero probability that y_i is greater than zero; further, the distribution of y_i conditional on y_i>0 follows a log-normal distribution. This modelling framework models the mean of the conditional distribution and the probability of obtaining a non-zero.
The means of each component of the model are specified in ln.form and binary.form for the log-normal and the zero/non-zero model components respectively. The binary.form formula should not contain an outcome. The binary part of the model is done using a logistic link funciton.
If residuals are requested then two types are returned: Pearson residuals and randomised quantile residuals, described in general by Dunn and Smyth (1996).
pgm returns an object of class "DeltaLNmod" , a list with the following elements
|
|
coef |
the estiamted coefficients from the fitting process. A list with an element for the binary and log-normal parts of the model as well as an element for the standard deviation of the log-normal. |
logl |
the maximum log likelihood (found at the estimates). |
AIC |
an Information Criteria. |
BIC |
Bayesian Information Criteria. |
fitted |
fitted values of the delta log-normal variable. |
fitted.var |
variance of the fitted delta log-normal variable. |
residuals |
a 2-column matrix whose first column contains the randomised quantile residuals and whose second column contains the Pearson residuals. |
n |
the number of observations used to fit the model. |
ncovars |
the number of parameters in the combined model. |
nzero |
the number of non-zero elements. |
lnMod |
the lm object obtained from fitting the log-normal (non-zero) part of the model. |
binMod |
the glm object obtained from fitting the zero / non-zero part of the model. |
Scott D. Foster
Aitchison J. (1955) On the Distribution of a Positive Random Variable Having a Discrete Probability Mass at the Origin. Journal of the American Statistical Association 50 901-908.
Dunn P. K and Smyth G. K (1996) Randomized Quantile Residuals. Journal of Computational and Graphical Statistics 5: 236-244.
Foster, S.D. and Bravington, M.V. (2013) A Poisson-Gamma Model for Analysis of Ecological Non-Negative Continuous Data. Journal of Environmental and Ecological Statistics 20: 533-552
Density, derivatives, distribution function, and random generation for the Poisson-Gamma distribution. The distribution is parameterised by the Poisson mean, the Gamma mean and the Gamma dispersion parameter. Derivatives are with respect to these three parameters.
dPoisGam( y, lambda, mu.Z, alpha, LOG=TRUE) dPoisGamDerivs( y, lambda, mu.Z, alpha, do.checks=TRUE) pPoisGam( q, lambda, mu.Z, alpha) rPoisGam( n, lambda, mu.Z, alpha)
dPoisGam( y, lambda, mu.Z, alpha, LOG=TRUE) dPoisGamDerivs( y, lambda, mu.Z, alpha, do.checks=TRUE) pPoisGam( q, lambda, mu.Z, alpha) rPoisGam( n, lambda, mu.Z, alpha)
y , q
|
vector of quantiles |
n |
number of random draws |
lambda |
scalar or vector (length matches y or equal to n) of Poisson means |
mu.Z |
scalar or vector (length matches y or equal to n) of Gamma means |
alpha |
scalar or vector (length matches y or equal to n) of Gamma dispersions |
LOG |
indication of return scale. If TRUE (default) then the density is returned on the log scale. |
do.checks |
boolean indicating if checks on arguments should be performed. If TRUE (default) then checks will be performed. |
The observed random variables y_i are assumed to arise from the process: y_i=sum(z_{i1}+z_{i2}+...+z_{in_i}) where n_i is a Poisson variable with mean lambda and z_{ij} is a Gamma variable with mean mu.Z and variance mu.Z^2 / alpha.
The density calculation is based on the series summation method described in Dunn and Smyth (2005). However, the parameterisation used here is different and follows that described in Section 2 of Smyth (1996). The details of density calculation are subsequently different from Dunn and Smyth (2005).
The derivatives are calculated in a similar manner to the density. The derivatives are for the log-density.
dPoisGam returns a numeric vector containing the (log-)densities. |
|
dPoisGamDeriv returns a matrix with three columns , one for each of the distributional parameters.
|
|
pPoisGam returns a numeric vector containing the values of the distribution function. |
|
rPoisGam returns a numeric vector containing the random variables. |
Scott D. Foster
Smyth, G. K. (1996) Regression analysis of quantity data with exact zeros. Proceedings of the Second Australia–Japan Workshop on Stochastic Models in Engineering, Technology and Management. Technology Management Centre, University of Queensland, pp. 572-580.
Dunn P. K. and Smyth G. K. (2005) Series evaluation of Tweedie exponential dispersion model densities. Statistics and Computing 15: 267-280.
Foster, S.D. and Bravington, M.V. (2013) A Poisson-Gamma Model for Analysis of Ecological Non-Negative Continuous Data. Journal of Environmental and Ecological Statistics 20: 533-552
my.seq <- seq( from=0, to=20, length=200) par( mfrow=c( 1,2)) plot( my.seq, dPoisGam( y=my.seq, lambda=8, mu.Z=1.2, alpha=0.6, LOG=FALSE), type='l', xlab="Variable", ylab="Density") plot( my.seq, pPoisGam( q=my.seq, lambda=8, mu.Z=1.2, alpha=0.6), type='l', xlab="Variable", ylab="Distribution")
my.seq <- seq( from=0, to=20, length=200) par( mfrow=c( 1,2)) plot( my.seq, dPoisGam( y=my.seq, lambda=8, mu.Z=1.2, alpha=0.6, LOG=FALSE), type='l', xlab="Variable", ylab="Density") plot( my.seq, pPoisGam( q=my.seq, lambda=8, mu.Z=1.2, alpha=0.6), type='l', xlab="Variable", ylab="Distribution")
Density, derivatives, distribution function, and random generation for the Tweedie distribution. The distribution is parameterised by the mean, dispersion parameter and the power parameter so that the distribution's variance is given by disperion * mean^power.
dTweedie( y, mu, phi, p, LOG=TRUE) pTweedie( q, mu, phi, p) rTweedie( n, mu, phi, p)
dTweedie( y, mu, phi, p, LOG=TRUE) pTweedie( q, mu, phi, p) rTweedie( n, mu, phi, p)
y , q
|
vector of quantiles |
n |
number of random draws |
mu |
scalar or vector (length matches y or equal to n) of means |
phi |
scalar or vector (length matches y or equal to n) of dispersion parameters |
p |
scalar or vector (length matches y or equal to n) of power parameters |
LOG |
indication of return scale. If TRUE (default) then the density is returned on the log scale. |
The density calculation is based on the series summation method described in Dunn and Smyth (2005). These functions are really just wrappers to the equivalent functions dPoisGam, pPoisGam and rPoisGam. The functions are equivalent up to parameterisation of the distribution.
The distribution function is calculated by adaptive quadrature (a call to the integrate function). However, the point discontinuity at zero is handled explicitly by assuming the convention that the distribution function evaluated at zero is equal to the density at zero.
dTweedie returns a numeric vector containing the (log-)densities. |
|
pTweedie returns a numeric vector containing the values of the distribution function. |
|
rTweedie returns a numeric vector containing the random variables. |
Scott D. Foster
Dunn P. K. and Smyth G. K. (2005) Series evaluation of Tweedie exponential dispersion model densities. Statistics and Computing 15: 267-280.
Foster, S.D. and Bravington, M.V. (2013) A Poisson-Gamma Model for Analysis of Ecological Non-Negative Continuous Data. Journal of Environmental and Ecological Statistics 20: 533-552
my.seq <- seq( from=0, to=20, length=200) par( mfrow=c( 1,2)) plot( my.seq, dTweedie( y=my.seq, mu=5, phi=2.1, p=1.6, LOG=FALSE), type='l', xlab="Variable", ylab="Density") plot( my.seq, pTweedie( q=my.seq, mu=5, phi=2.1, p=1.6), type='l', xlab="Variable", ylab="Distribution")
my.seq <- seq( from=0, to=20, length=200) par( mfrow=c( 1,2)) plot( my.seq, dTweedie( y=my.seq, mu=5, phi=2.1, p=1.6, LOG=FALSE), type='l', xlab="Variable", ylab="Density") plot( my.seq, pTweedie( q=my.seq, mu=5, phi=2.1, p=1.6), type='l', xlab="Variable", ylab="Distribution")
Fits a model that assumes a Poisson-Gamma distribution. The mean of the unobserved Poisson variable and the mean of the unobserved Gamma random variables are allowed to change with covariates.
pgm( p.form, g.form, data, wts=NULL, alpha=NULL, inits=NULL, vcov=TRUE, residuals=TRUE, trace=1)
pgm( p.form, g.form, data, wts=NULL, alpha=NULL, inits=NULL, vcov=TRUE, residuals=TRUE, trace=1)
p.form |
an object of class "formula" (or one that can be coerced to that class). This is a symbolic representation of the model for the unobserved Poisson variable. Note that offset terms (if any) should be included in this part of the model. |
g.form |
an object of class "formula" (or one that can be coerced to that class). This is a symbolic representation of the model for the unobserved Gamma variables. Note that any outcome and offset terms included in this part of the model are ignored. |
wts |
relative contribution to the log-likelihood contributions. Must be the same length as the number of observations. |
data |
a data frame, list or environment (or object coercible by as.data.frame to a data frame) containing the variables in the model. |
alpha |
a positive scalar whose presence indicates that a profile likelihood should be used for estimating all parameters except alpha (set to supllied value). |
inits |
numeric vector containing the starting values for the coefficient estimates. The ordering of the vector is:coefficients for p.form, coefficients for g.form, and log( alpha), where alpha is the dispersion parameter for the unobserved gamma variables. Defaults to zero for each coefficient. |
vcov |
boolean indicating if the variance-covariance matrix of the coefficient estimates should be calculated or not. Default is TRUE indicating that it will be calculated. |
residuals |
boolean indicating if the quantile residuals should be calculated. Default is TRUE indicating residuals are to be calculated. |
trace |
non-negative integer value indicating how often during estimation the updated coefficients should be calculated. Default is 1, indicating printing at every iteration. A value of 0 indicates that no printing will occur. |
The observed random variables y_i are assumed to arise from the process: y_i=sum(z_{i1}+z_{i2}+...+z_{in_i}) where n_i is a Poisson variable with mean lambda and z_{ij} is a Gamma variable with mean mu.Z and variance mu.Z^2 / alpha.
The means of each of the unobserved random variables are modelled using a log-link and the formula given in p.form and g.form.
The distribution is equivalent, up to parameterisation and model interpretation, to the Tweedie GLM model described in Smyth (1996) and Dunn and Smyth (2005). Their models contain model the overall mean only and not the separate processes. Also the Poisson-Gamma distribution is a subset of the more general Tweedie distribution.
The residuals are quantile residuals, described in general by Dunn and Smyth (1996). Calculation of the quantile residuals ignores any specified weights.
pgm returns an object of class "pgm" , a list with the following elements
|
|
coef |
the estiamted coefficients from the fitting process. |
logl |
the maximum log likelihood (found at the estimates). |
score |
the score of the log-likelihood (at the estimates). |
vcov |
the variance-covariance matrix of the estimates. If vcov is FALSE then this will be NULL. |
conv |
the convergence message from the call to nlminb. |
message |
the convergence message from nlminb. |
niter |
the number of iterations taken to obtain convergence. |
evals |
the number of times the log-likelihood and its derivative were evaluated. |
call |
the call used to invoke the function. |
fitted |
matrix with three columns containing:the total modelled mean, the Poisson mean, and the gamma mean. |
residuals |
the vector of quantile residuals. |
Scott D. Foster
Smyth, G. K. (1996) Regression analysis of quantity data with exact zeros. Proceedings of the Second Australia–Japan Workshop on Stochastic Models in Engineering, Technology and Management. Technology Management Centre, University of Queensland, pp. 572-580.
Dunn P. K and Smyth G. K (1996) Randomized Quantile Residuals. Journal of Computational and Graphical Statistics 5: 236-244.
Dunn P. K. and Smyth G. K. (2005) Series evaluation of Tweedie exponential dispersion model densities. Statistics and Computing 15: 267-280.
Foster, S.D. and Bravington, M.V. (2013) A Poisson-Gamma Model for Analysis of Ecological Non-Negative Continuous Data. Journal of Environmental and Ecological Statistics 20: 533-552
my.coef <- c(0.6, 1.2, 0, -0.3, 0, -0.5, 0.85) sim.dat <- simReg( n=250, lambda.tau=my.coef[1:3], mu.Z.tau=my.coef[4:6], alpha=my.coef[7]) fm <- pgm( p.form=y~1+x1+x2, g.form=~1+x1+x2, data=sim.dat) tmp <- matrix( c( my.coef, fm$coef, sqrt( diag( fm$vcov))), ncol=3) tmp[nrow( tmp),1] <- log( tmp[nrow( tmp),1]) #putting values on same scale colnames( tmp) <- c("actual","estiated","SE") rownames( tmp) <- names( fm$coef) print( tmp)
my.coef <- c(0.6, 1.2, 0, -0.3, 0, -0.5, 0.85) sim.dat <- simReg( n=250, lambda.tau=my.coef[1:3], mu.Z.tau=my.coef[4:6], alpha=my.coef[7]) fm <- pgm( p.form=y~1+x1+x2, g.form=~1+x1+x2, data=sim.dat) tmp <- matrix( c( my.coef, fm$coef, sqrt( diag( fm$vcov))), ncol=3) tmp[nrow( tmp),1] <- log( tmp[nrow( tmp),1]) #putting values on same scale colnames( tmp) <- c("actual","estiated","SE") rownames( tmp) <- names( fm$coef) print( tmp)
simReg
is used to simulate data from a combined model for the Poisson and Gamma components of a Poisson-Gamma distribution. This formuation allows the distribution to vary with covariates.
simReg( n, lambda.tau, mu.Z.tau, alpha, offset1=NULL, X=NULL)
simReg( n, lambda.tau, mu.Z.tau, alpha, offset1=NULL, X=NULL)
n |
number of observations |
lambda.tau |
vector of coefficients for Poisson part of model. |
mu.Z.tau |
vector of coefficients for Gamma component of model. Its length must be equal to the length of lambda.tau. |
alpha |
scalar for Gamma dispersion. |
offset1 |
vector of offset values (for Poisson part of the process). If NULL (default) a vector of zeros is created and used. |
X |
a design matrix with appropriate elements for simulation. If NULL (default) then one will be created. |
The observed random variables y_i are assumed to arise from the process: y_i=sum(z_{i1}+z_{i2}+...+z_{in_i}) where n_i is a Poisson variable with mean lambda and z_{ij} is a Gamma variable with mean mu.Z and variance mu.Z^2 / alpha.
The Poisson mean is given by lambda=exp( X %*% lambda.tau) where X is a suitable design matrix whose first column is full of 1s and whose remain columns are random draws from a standard normal.
The Gamma mean is given by mu.Z=exp( X %*% mu.Z.tau) where X is identical to before.
A data frame containing the random draws , the offset (if not NULL) , and the covariates. The data frame has an attribute called "coef" that lists the values used for the simulation.
|
Scott D. Foster
Foster, S.D. and Bravington, M.V. (2013) A Poisson-Gamma Model for Analysis of Ecological Non-Negative Continuous Data. Journal of Environmental and Ecological Statistics 20: 533-552
sim.dat <- simReg( n=250, lambda.tau=c(0.6, 1.2, 0), mu.Z.tau=c(-0.3, 0, -0.5), alpha=0.85, X=NULL) print( head( sim.dat))
sim.dat <- simReg( n=250, lambda.tau=c(0.6, 1.2, 0), mu.Z.tau=c(-0.3, 0, -0.5), alpha=0.85, X=NULL) print( head( sim.dat))
Fits a log-link Tweedie GLM model using maximum likelihood for all parameters, if wanted. The dispersion and power parameters (phi and p) can be set if required.
tglm( mean.form, data, wts=NULL, phi=NULL, p=NULL, inits=NULL, vcov=TRUE, residuals=TRUE, trace=1, iter.max=150) tglm.fit( x, y, wts=NULL, offset=rep( 0, length( y)), inits=rnorm( ncol( x)), phi=NULL, p=NULL, vcov=TRUE, residuals=TRUE, trace=1, iter.max=150)
tglm( mean.form, data, wts=NULL, phi=NULL, p=NULL, inits=NULL, vcov=TRUE, residuals=TRUE, trace=1, iter.max=150) tglm.fit( x, y, wts=NULL, offset=rep( 0, length( y)), inits=rnorm( ncol( x)), phi=NULL, p=NULL, vcov=TRUE, residuals=TRUE, trace=1, iter.max=150)
mean.form |
an object of class "formula" (or one that can be coerced to that class). This is a symbolic representation of the model for the mean of the observations. Offset terms can be included in this formula. |
data |
a data frame, list or environment (or object coercible by as.data.frame to a data frame) containing the variables in the model. |
wts |
relative contribution to the log-likelihood contributions. Must be the same length as the number of observations. |
phi |
a positive scalar whose presence indicates that a profile likelihood should be used for estimating all parameters except the dispersion phi (set to the supllied value). |
p |
a value in the interval [1,2] whose presence indicates that a profile likelihood should be used for estimating all parameters except the power parameter p (set to the supplied value). |
inits |
(tglm) numeric vector containing the starting values for the coefficient estimates. The ordering of the vector is:coefficients for mean.form, phi and p. Defaults to zero for each the mean coefficients, 1 for phi, and 1.6 for p. (tglm.fit) numeric vector that must be specified (no default). It has length equal to the number of covariates plus 2. |
vcov |
boolean indicating if the variance-covariance matrix of the coefficient estimates should be calculated or not. Default is TRUE indicating that it will be calculated. |
residuals |
boolean indicating if the quantile residuals should be calculated. Default is TRUE indicating residuals are to be calculated. |
trace |
non-negative integer value indicating how often during estimation the updated coefficients should be calculated. Default is 1, indicating printing at every iteration. A value of 0 indicates that no printing will occur. |
iter.max |
the maximum number of iterations allowed. |
x , y , offset
|
x is a design matrix, y is the vector of outcomes and offset are the offset values. None of these values are checked. The design matrix has dimension pXn where p is the number of covariates and n is the number of observations. |
The means of each of the unobserved random variables are modelled using a log-link and the formula given in mean.form. This function implements the model described in Smyth (1996) and Dunn and Smyth (2005) but does so by maximising the complete likelihood (ala Dunn and Smyth, 2005). The model specification is restricted to the log link.
The residuals are quantile residuals, described in general by Dunn and Smyth (1996).
tglm.fit is the workhorse function: it is not normally called directly but can be more efficient where the response vector and design matrix have already been calculated. Completely analgous to the function glm.fit.
tglm returns an object of class "tglm", a list with the following elements
coef |
the estiamted coefficients from the fitting process. |
logl |
the maximum log likelihood (found at the estimates). |
score |
the score of the log-likelihood (at the estimates). |
vcov |
the variance-covariance matrix of the estimates. If vcov is FALSE then this will be NULL. |
conv |
the convergence message from the call to nlminb. |
message |
the convergence message from nlminb. |
niter |
the number of iterations taken to obtain convergence. |
evals |
the number of times the log-likelihood and its derivative were evaluated. |
call |
the call used to invoke the function. |
fitted |
matrix with one column containing the modelled mean. |
residuals |
the vector of quantile residuals. |
Scott D. Foster
Smyth, G. K. (1996) Regression analysis of quantity data with exact zeros. Proceedings of the Second Australia–Japan Workshop on Stochastic Models in Engineering, Technology and Management. Technology Management Centre, University of Queensland, pp. 572-580.
Dunn P. K and Smyth G. K (1996) Randomized Quantile Residuals. Journal of Computational and Graphical Statistics 5: 236-244.
Dunn P. K. and Smyth G. K. (2005) Series evaluation of Tweedie exponential dispersion model densities. Statistics and Computing 15: 267-280.
Foster, S.D. and Bravington, M.V. (2013) A Poisson-Gamma Model for Analysis of Ecological Non-Negative Continuous Data. Journal of Environmental and Ecological Statistics 20: 533-552
print( "Data generated using Poisson-sum-of-gammas model and fitted using a Tweedie GLM.") print( "A great fit is not expected -- especially for the first covariate") my.coef <- c(0.6, 1.2, 0, -0.3, 0, -0.5, 0.85) sim.dat <- simReg( n=250, lambda.tau=my.coef[1:3], mu.Z.tau=my.coef[4:6], alpha=my.coef[7]) fm <- tglm( mean.form=y~1+x1+x2, data=sim.dat) tmp <- matrix( c( my.coef[1:3] + my.coef[4:6], NA, (2+0.85)/(1+0.85), fm$coef), ncol=2) colnames( tmp) <- c("actual","estimated") rownames( tmp) <- c( names( fm$coef)[1:3], "phi", "p") print( tmp)
print( "Data generated using Poisson-sum-of-gammas model and fitted using a Tweedie GLM.") print( "A great fit is not expected -- especially for the first covariate") my.coef <- c(0.6, 1.2, 0, -0.3, 0, -0.5, 0.85) sim.dat <- simReg( n=250, lambda.tau=my.coef[1:3], mu.Z.tau=my.coef[4:6], alpha=my.coef[7]) fm <- tglm( mean.form=y~1+x1+x2, data=sim.dat) tmp <- matrix( c( my.coef[1:3] + my.coef[4:6], NA, (2+0.85)/(1+0.85), fm$coef), ncol=2) colnames( tmp) <- c("actual","estimated") rownames( tmp) <- c( names( fm$coef)[1:3], "phi", "p") print( tmp)
TigerFlathead is a a data set arising from a south-east Australia fish survey, see Bax and Williams (1999).
TigerFlathead
TigerFlathead
The Longitude of the trawl.
The Latitude of the trawl.
Bathymetry (depth) of the trawl, taken from longitude/latitude of trawls.
The distance along the 100m depth contour. The origin is arbitrary, but is northwards of all trawls.
The swept area of the shot, giving a measure of effort.
The number of tiger flathead caught.
The total biomass of tiger flathead caught.
Bax N, Williams A (eds) (1999) Habitat and fisheries production in the South East Fishery. Final report to FRDC Project 94/040, CSIRO Marine Research, Hobart, Tasmania, Australia
Foster, S.D. and Bravington, M.V. (2013) A Poisson-Gamma Model for Analysis of Ecological Non-Negative Continuous Data. Journal of Environmental and Ecological Statistics 20: 533-552