| Title: | GAMLSS Infrastructure for Flexible Distributional Regression |
|---|---|
| Description: | Next generation infrastructure for generalized additive models for location, scale, and shape (GAMLSS) and distributional regression more generally. The package provides a fresh reimplementation of the classic 'gamlss' package while being more modular and facilitating the creation of advanced terms and models. |
| Authors: | Nikolaus Umlauf [aut, cre] (ORCID: <https://orcid.org/0000-0003-2160-9803>), Mikis Stasinopoulos [aut] (ORCID: <https://orcid.org/0000-0003-2407-5704>), Robert Rigby [aut] (ORCID: <https://orcid.org/0000-0003-3853-1707>), Achim Zeileis [aut] (ORCID: <https://orcid.org/0000-0003-0918-3766>), Reto Stauffer [aut] (ORCID: <https://orcid.org/0000-0002-3798-5507>), Gillian Heller [aut] (ORCID: <https://orcid.org/0000-0003-1270-1499>), Fernanda de Bastiani [aut] (ORCID: <https://orcid.org/0000-0001-8532-639X>), Andreas Mayr [aut] (ORCID: <https://orcid.org/0000-0001-7106-9732>) |
| Maintainer: | Nikolaus Umlauf <[email protected]> |
| License: | GPL-2 | GPL-3 |
| Version: | 0.1-0 |
| Built: | 2026-06-02 10:35:34 UTC |
| Source: | https://github.com/gamlss-dev/gamlss2 |
Next generation infrastructure for generalized additive models for location, scale, and shape (GAMLSS) and distributional regression more generally. The package provides a fresh reimplementation of the classic 'gamlss' package while being more modular and facilitating the creation of advanced terms and models.
The primary purpose of this package is to facilitate the creation of advanced infrastructures designed to enhance the Generalized Additive Models for Location Scale and Shape (GAMLSS, Rigby and Stasinopoulos 2005) modeling framework. Notably, the gamlss2 package represents a significant overhaul of its predecessor, gamlss, with a key emphasis on improving estimation speed and incorporating more adaptable infrastructures. These enhancements enable the seamless integration of various algorithms into GAMLSS, including gradient boosting, Bayesian estimation, regression trees, and forests, fostering a more versatile and powerful modeling environment.
Moreover, the package expands its compatibility by supporting all model terms from the base R mgcv package. Additionally, the gamlss2 package introduces the capability to accommodate more than four parameter families. Essentially, this means that users can now specify any type of model using these new infrastructures, making the package highly flexible and accommodating to a wide range of modeling requirements.
Index of help topics:
bamlss2 Bayesian GAMLSS Wrapper
BS Bayesian Sampler for GAMLSS Models
calibration Calibration Plots for Binary Responses
cv_gamlss2 Cross Validation for 'gamlss2' Models
discretize Discretize Continuous Distribution Family for
Count Regression Models
elm Extreme Learning Machine Model Terms
fake_formula Extended Processing of "Fake" Formulas
find_family Find and Fit GAMLSS Families
gamlss2 Generalized Additive Models for Location Scale
and Shape
gamlss2_control Control Parameters
gamlss2_start Model Initialization and Parameter Constraints
gamlss2-package GAMLSS Modeling with Advanced Flexible
Infrastructures
gamlss2.family Family Objects in 'gamlss2'
GDF Create a GDF Distribution
gnet Model Terms with 'glmnet'
HarzTraffic Traffic Counts at Sonnenberg in the Harz Region
Kumaraswamy Kumaraswamy Distribution
la Lasso Model Terms
make.link2 Create a Link for Families
mcmc Run MCMC Sampling for a Fitted 'gamlss2' Model
MN Multinomial Logit Family
new_formula Extracting a New Formula After Selection
Algorithms
OL Ordered Logistic Family for Ordinal Regression
pb P-Splines for GAMLSS
plot.gamlss2 Plotting GAMLSS
predict.gamlss2 Extracting Fitted or Predicted Parameters or
Terms from gamlss2 Models
prodist.gamlss2 Extracting Fitted or Predicted Probability
Distributions from gamlss2 Models
quantile.gamlss2 Quantiles for GAMLSS
re Random Effects
response_name Auxiliary Functions for Formulas and Model
Objects
RS Rigby and Stasinopoulos (RS) & Cole and Green
(CG) Algorithm
Rsq GAIC and Generalised (Pseudo) R-squared for
GAMLSS Models
select_gamlss2 Smooth Model Term Selection with Additional
Penalties
softplus Softplus Link Object
special_terms Special Model Terms for GAMLSS
SpirometryUS Spirometry Measurements from NHANES 2007-2012
stepwise Stepwise Model Term Selection Using GAIC
Nikolaus Umlauf, Mikis Stasinopoulos, Robert Rigby, Achim Zeileis, Reto Stauffer, Gillian Heller, Fernanda de Bastiani, Andreas Mayr.
Maintainer: Nikolaus Umlauf [email protected].
Rigby RA, Stasinopoulos DM (2005). “Generalized Additive Models for Location, Scale and Shape (with Discussion).” Journal of the Royal Statistical Society, Series C (Applied Statistics), 54, 507–554. doi:10.1111/j.1467-9876.2005.00510.x
Rigby RA, Stasinopoulos DM, Heller GZ, De Bastiani F (2019). Distributions for Modeling Location, Scale, and Shape: Using GAMLSS in R, Chapman and Hall/CRC. doi:10.1201/9780429298547
Stasinopoulos DM, Rigby RA (2007). “Generalized Additive Models for Location Scale and Shape (GAMLSS) in R.” Journal of Statistical Software, 23(7), 1–46. doi:10.18637/jss.v023.i07
Stasinopoulos DM, Rigby RA, Heller GZ, Voudouris V, De Bastiani F (2017). Flexible Regression and Smoothing: Using GAMLSS in R, Chapman and Hall/CRC. doi:10.1201/b21973
The function bamlss2() is a convenience wrapper around
gamlss2 to fit Bayesian GAMLSS models via MCMC sampling. It
sets the optimizer to an MCMC routine and allows n.iter,
burnin, and thin to be specified directly.
bamlss2(formula, n.iter = 1200, burnin = 200, thin = 1, maxit = 2, ...)bamlss2(formula, n.iter = 1200, burnin = 200, thin = 1, maxit = 2, ...)
formula |
A GAM-type |
n.iter |
Integer, the total number of MCMC iterations. |
burnin |
Integer, the burn-in period. |
thin |
Integer, the thinning parameter for saved samples. |
maxit |
Integer, the number of backfitting iterations used to compute
starting values before running MCMC. If |
... |
Further arguments passed to |
Function bamlss2() calls gamlss2 and sets the optimizer
to a Bayesian sampler. By default, a small number of backfitting iterations
is performed first to obtain reasonable starting values for the MCMC sampler.
The wrapper is designed to behave in line with the gamlss()-style
interfaces, i.e., it supports weights and offset arguments
through gamlss2.
An object of class "bamlss2" inheriting from "gamlss2"
containing the fitted model and posterior samples.
Umlauf N, Klein N, Zeileis A (2018). BAMLSS: Bayesian Additive Models for Location, Scale and Shape (and Beyond). Journal of Computational and Graphical Statistics, 27(3), 612–627. doi:10.1080/10618600.2017.1407325
## Not run: ## load the abdominal circumference data data("abdom", package = "gamlss.data") ## specify the model formula f <- y ~ s(x) | s(x) | 1 | 1 ## estimate Bayesian model m <- bamlss2(f, data = abdom, family = BCPE) ## posterior summary summary(m) ## plot estimated effects plot(m) ## plot samples plot(m, which = "samples") ## predict parameters using samples pm <- predict(m, FUN = mean) print(head(pm)) psd <- predict(m, FUN = sd) print(head(psd)) ## End(Not run)## Not run: ## load the abdominal circumference data data("abdom", package = "gamlss.data") ## specify the model formula f <- y ~ s(x) | s(x) | 1 | 1 ## estimate Bayesian model m <- bamlss2(f, data = abdom, family = BCPE) ## posterior summary summary(m) ## plot estimated effects plot(m) ## plot samples plot(m, which = "samples") ## predict parameters using samples pm <- predict(m, FUN = mean) print(head(pm)) psd <- predict(m, FUN = sd) print(head(psd)) ## End(Not run)
The function BS() implements an MCMC sampler for generalized additive
models for location, scale, and shape (GAMLSS) as fitted by
gamlss2. The sampler uses Metropolis-Hastings updates for
regression coefficients of linear and smooth terms.
BS(x, y, specials, family, offsets, weights, start, xterms, sterms, control)BS(x, y, specials, family, offsets, weights, start, xterms, sterms, control)
x |
The full model matrix to be used for fitting. |
y |
The response vector or matrix. |
specials |
A named list of special model terms, for example including
design and penalty matrices for fitting smooth terms using
|
family |
A family object, see |
offsets |
If supplied, a list or data frame of model offsets. See
|
weights |
If supplied, a numeric vector of prior weights used in the fitting process. |
start |
Starting values for the parameters of the response distribution.
See the examples for |
xterms |
A named list specifying the linear model terms. Each named list element represents one parameter as specified in the family object. |
sterms |
A named list specifying the special model terms. Each named list element represents one parameter as specified in the family object. |
control |
Further control arguments as specified in the call to
|
Function BS() is typically called via mcmc or via the
wrapper function bamlss2. It implements a blocked
Metropolis-Hastings sampler using working responses and working weights.
Smooth terms are updated using penalized weighted least squares proposals and
their smoothing variances are sampled using univariate slice sampling.
The function uses the following control arguments:
n.iter: integer, the total number of MCMC iterations;
burnin: integer, the burn-in period;
thinning: integer, the thinning parameter for saved samples;
trace: logical, should information be printed while the sampler is running;
flush: logical, whether to use flush.console
to display current output in the console;
nullmodel: logical, should an intercept-only null model be
evaluated to compute a null deviance and deviance reduction?;
binning: logical, should binning be used for smooth terms if available?
A fitted model object containing posterior samples and summary information. Typically it includes:
samples: posterior draws for coefficients and smoothing variances;
coefficients: posterior means of coefficients;
fitted.values: posterior mean fitted values for each distributional parameter;
edf: estimated degrees of freedom for smooth terms;
logLik: log-likelihood at the posterior mean predictor.
Umlauf N, Klein N, Zeileis A (2018). BAMLSS: Bayesian Additive Models for Location, Scale and Shape (and Beyond). Journal of Computational and Graphical Statistics, 27(3), 612–627. doi:10.1080/10618600.2017.1407325
gamlss2, RS, CG,
mcmc, bamlss2
## see ?mcmc## see ?mcmc
Compute and plot calibration curves for models with 0/1 responses.
The function can handle one or several fitted gamlss2 models
and compares their calibration in a single plot.
calibration(..., newdata = NULL, y = NULL, parameter = NULL, breaks = seq(0, 1, by = 0.1), minn = 20, main = "Calibration plot", xlab = "Predicted probability", ylab = "Observed proportion", plot = TRUE, add_loess = TRUE, smooth_n = 200, col = NULL, lty = NULL, legend = TRUE, pos = "topleft", xlim = NULL, ylim = NULL)calibration(..., newdata = NULL, y = NULL, parameter = NULL, breaks = seq(0, 1, by = 0.1), minn = 20, main = "Calibration plot", xlab = "Predicted probability", ylab = "Observed proportion", plot = TRUE, add_loess = TRUE, smooth_n = 200, col = NULL, lty = NULL, legend = TRUE, pos = "topleft", xlim = NULL, ylim = NULL)
... |
One or several fitted |
newdata |
Optional data frame for out-of-sample calibration. |
y |
Optional numeric or factor vector with the binary response. |
parameter |
Character, which parameter should be used for prediction. |
breaks |
Numeric vector of break points used to construct bins for the predicted probabilities. |
minn |
Integer, the minimum number of observations required in each bin. |
main |
Character, main title for the plot. |
xlab |
Character, label for the x-axis. |
ylab |
Character, label for the y-axis. |
plot |
Logical, should a calibration plot be produced? |
add_loess |
Logical, should a loess smooth be added for each model? |
smooth_n |
Integer, number of evaluation points for the loess curve. |
col |
Colors used for the models. |
lty |
Line types used for the loess curves. |
legend |
Logical, should a legend be added when more than one model is supplied? |
pos |
Character, legend position passed to
|
xlim |
The x limits of the plot. |
ylim |
The y limits of the plot. |
For each fitted model, predicted probabilities are obtained via
predict using type = "parameter" and the selected
parameter. If the corresponding gamlss2.family object
provides a probabilities() method, this is used to transform the
parameter vector into probabilities. For multi-column outputs, the first
column is used.
Predicted probabilities are grouped into bins defined by breaks.
Within each bin, the mean predicted probability, the observed proportion
of 1s, and the number of observations are computed. Bins with fewer than
minn observations are dropped.
Invisibly returns a data frame with the columns interval,
probs, y, n, and model. For a single model,
the model column is dropped.
Van Calster B, McLernon DJ, van Smeden M, Wynants L, Steyerberg EW (2019). Calibration: the Achilles heel of predictive analytics. BMC Medicine, 17, 230. doi:10.1186/s12916-019-1466-7
## Not run: set.seed(123) n <- 400 x1 <- runif(n, -3, 3) x2 <- runif(n, -3, 3) eta <- sin(x1) + 0.5 * x2 p <- plogis(eta) y <- rbinom(n, size = 1, prob = p) d <- data.frame(y = y, x1 = x1, x2 = x2) m1 <- gamlss2(y ~ x1, family = BI, data = d) m2 <- gamlss2(y ~ s(x1), family = BI, data = d) m3 <- gamlss2(y ~ s(x1) + x2, family = BI, data = d) calibration(m1, m2, m3) head(calibration(m1, m2, m3, plot = FALSE)) ## End(Not run)## Not run: set.seed(123) n <- 400 x1 <- runif(n, -3, 3) x2 <- runif(n, -3, 3) eta <- sin(x1) + 0.5 * x2 p <- plogis(eta) y <- rbinom(n, size = 1, prob = p) d <- data.frame(y = y, x1 = x1, x2 = x2) m1 <- gamlss2(y ~ x1, family = BI, data = d) m2 <- gamlss2(y ~ s(x1), family = BI, data = d) m3 <- gamlss2(y ~ s(x1) + x2, family = BI, data = d) calibration(m1, m2, m3) head(calibration(m1, m2, m3, plot = FALSE)) ## End(Not run)
cv_gamlss2() implements K-fold cross validation for models fitted with
gamlss2. Different scoring rules can be supplied via the
metric argument. Convenience metric functions
(log_pdf_metric(), rqres_metric(), mse_metric()) are provided.
## k-fold cross-validation cv_gamlss2(..., data, folds = 5, metric = log_pdf_metric, parallel = FALSE, simplify = TRUE) ## log-pdf for each observation log_pdf_metric(model, data) ## randomized quantile residuals rqres_metric(model, data) ## mean squared error mse_metric(model, data)## k-fold cross-validation cv_gamlss2(..., data, folds = 5, metric = log_pdf_metric, parallel = FALSE, simplify = TRUE) ## log-pdf for each observation log_pdf_metric(model, data) ## randomized quantile residuals rqres_metric(model, data) ## mean squared error mse_metric(model, data)
... |
model specification passed to |
data |
a |
folds |
either an integer specifying the number of folds, or a list, matrix, or data frame of index sets for test folds. Defaults to 5. |
metric |
a function of the form |
parallel |
logical. If |
simplify |
logical. If |
model |
a fitted |
cv_gamlss2() splits the data into training and test folds.
For each fold the model is fitted on the training data, and the chosen
metric is evaluated on the held-out test data. By default,
the scoring rule is the log predictive density (log_pdf_metric),
but other metrics can be used, such as randomized quantile residuals
(rqres_metric) or mean squared error of the conditional mean
(mse_metric).
The function returns either a list of fold-wise results or, if
simplify = TRUE, a named vector or data frame aligned with the original
observations.
If simplify = TRUE and the metric returns scalars,
a named numeric vector of fold scores is returned. Otherwise a
data frame with fold membership and scores per observation is returned.
The convenience metrics return a numeric vector of scores or residuals.
## Not run: ## load the abdominal circumference data data("abdom", package = "gamlss.data") ## cross-validation using the NO distribution ## only model the mean with s(x) cv1 <- cv_gamlss2(y ~ s(x), data = abdom, family = NO) ## now, also model the standard deviation with s(x) cv2 <- cv_gamlss2(y ~ s(x) | s(x), data = abdom, family = BCT) ## evaluate log-likelihood sum(cv1$score) sum(cv2$score) ## End(Not run)## Not run: ## load the abdominal circumference data data("abdom", package = "gamlss.data") ## cross-validation using the NO distribution ## only model the mean with s(x) cv1 <- cv_gamlss2(y ~ s(x), data = abdom, family = NO) ## now, also model the standard deviation with s(x) cv2 <- cv_gamlss2(y ~ s(x) | s(x), data = abdom, family = BCT) ## evaluate log-likelihood sum(cv1$score) sum(cv2$score) ## End(Not run)
This function takes any continuous distribution family object and discretizes it,
enabling it to be used for the estimation of count regression models. The discretized family
can then be used in gamlss2 models that deal with count data.
discretize(family = NO)discretize(family = NO)
family |
A continuous distribution family object. The family will be discretized for modeling count data, where the distribution is adapted for count outcomes. |
The function discretizes a continuous distribution family by converting its cumulative distribution function (CDF) into a probability mass function (PMF). This is done by computing the difference between the CDF evaluated at adjacent points. The resulting discretized distribution can be used in count regression models to estimate the relationship between count data and explanatory variables.
Returns an object of class "gamlss2.family", which is a discretized version of
the input continuous family object, suitable for use in gamlss2
models for count data.
## simulate count data using the Poisson distribution set.seed(111) y <- rpois(1000, lambda = 10) ## create a discretized family using the BCT distribution (with log link for mu) fam <- discretize(family = BCT(mu.link = "log")) ## fit a count regression model using the discretized family fit_family(y, family = fam) ## including covariate ## Not run: set.seed(111) x <- sort(runif(1000, -3, 3)) y <- rpois(1000, lambda = exp(sin(x))) m <- gamlss2(y ~ s(x) | s(x) | s(x) | s(x), family = fam) ## plot effects plot(m) ## predict 95 p <- quantile(m, probs = 0.95) plot(x, y, pch = 19, col = adjustcolor(1, 0.3)) lines(p ~ x, lty = 1, col = 4, lwd = 3) ## End(Not run)## simulate count data using the Poisson distribution set.seed(111) y <- rpois(1000, lambda = 10) ## create a discretized family using the BCT distribution (with log link for mu) fam <- discretize(family = BCT(mu.link = "log")) ## fit a count regression model using the discretized family fit_family(y, family = fam) ## including covariate ## Not run: set.seed(111) x <- sort(runif(1000, -3, 3)) y <- rpois(1000, lambda = exp(sin(x))) m <- gamlss2(y ~ s(x) | s(x) | s(x) | s(x), family = fam) ## plot effects plot(m) ## predict 95 p <- quantile(m, probs = 0.95) plot(x, y, pch = 19, col = adjustcolor(1, 0.3)) lines(p ~ x, lty = 1, col = 4, lwd = 3) ## End(Not run)
Constructor function for Extreme Learning Machine (ELM) model terms for GAMLSS.
## Model term constructor function. elm(x, k = 50, a = "tanh", ...)## Model term constructor function. elm(x, k = 50, a = "tanh", ...)
x |
A numeric vector or matrix, a factor, or a formula.
If |
k |
Integer, number of hidden units (random features) used to build the ELM design matrix. |
a |
Character, activation function for the hidden units. Supported options are
|
... |
Further control arguments can be passed:
|
The ELM term constructs a randomized single-hidden-layer representation of the covariate(s) in
x. Internally, a design matrix Z is built (including an intercept column), random
weights are sampled, and the linear predictors Z %*% W are transformed through the
activation function a to obtain the hidden-layer design matrix X. Columns of
X are centered before estimation.
Internal scaling. For numerical stability and comparability across terms,
Z can be scaled internally (scale = TRUE). Intercept columns are left unchanged.
If x is a factor, a QR-based group scaling is applied; otherwise, a column-wise
center-and-scale normalization is used. The scaling transformation is stored and is reused
automatically during prediction.
Activation functions. The activation is applied element-wise to Z %*% W.
For numerical stability, bounded activations are evaluated on clipped inputs
(currently ).
Weight sampling. By default, hidden-unit weights are generated using a
low-discrepancy Halton design in a whitened input space. This tends to produce
more stable and less redundant features than purely random draws in many practical
applications. With sampler = "pca", directions are additionally aligned
with principal component directions of the scaled design. The historical
stochastic behavior can be recovered with sampler = "random".
Reproducibility. The default sampler = "halton" is deterministic
given the input data. Use set.seed only if you explicitly
request sampler = "random".
An object representing a specials model term, used internally by
gamlss2 during model fitting and prediction.
Huang GB, Zhu QY, Siew CK (2006). Extreme Learning Machine: Theory and Applications. Neurocomputing, 70(1–3), 489–501. doi:10.1016/j.neucom.2005.12.126
Huang GB, Zhu QY, Siew CK (2004). Extreme Learning Machine: A New Learning Scheme of Feedforward Neural Networks. In: Proceedings of IJCNN 2004, 2, 985–990. doi:10.1109/IJCNN.2004.1380068
## Not run: ## load data data("SpirometryUS") ## subset for female d <- subset(SpirometryUS, gender == "female") ## the default sampler is deterministic; set ## a seed only for sampler = "random" set.seed(1328) ## formula for all 4 parameters f <- fev1 ~ elm(age,k=100,a="tanh") | . | . | . ## estimate model m <- gamlss2(f, data = d, family = BCT) ## estimated effects plot(m) ## predict quantiles qu <- c(0.025, seq(0.1, 0.9, by = 0.1), 0.975) fit <- quantile(m, probs = qu) ## plot plot(fev1 ~ age, data = d) i <- order(d$age) matplot(d$age[i], fit[i, ], type = "l", lty = 1, lwd = 2, col = c(2, rep(4, ncol(fit) - 1)), add = TRUE) ## main effects and interactions f <- fev1 ~ s(age) + s(height) + s(weight) + elm(~age+height+weight,k=200) | . | . | . m <- gamlss2(f, data = d, family = BCT) ## summary to inspect effect of interactions ## of the elm()s summary(m) ## plot main effects plot(m) ## quantile residuals plot(m, which = "resid") ## prediction is handled automatically via ## the special term interface n <- 50 nd <- with(d, expand.grid( "age" = seq(min(age), max(age), length = n), "weight" = seq(min(weight), max(weight), length = n) )) nd$height <- mean(d$height) ## compute lower 2.5 nd$fit <- quantile(m, newdata = nd, probs = 0.025) ## visualize n <- length(unique(nd$age)) age <- sort(unique(nd$age)) weight <- sort(unique(nd$weight)) z <- matrix(nd$fit, n, n) image(age, weight, z, col = hcl.colors(100, "YlOrRd"), xlab = "age", ylab = "weight") contour(age, weight, z, add = TRUE) ## End(Not run)## Not run: ## load data data("SpirometryUS") ## subset for female d <- subset(SpirometryUS, gender == "female") ## the default sampler is deterministic; set ## a seed only for sampler = "random" set.seed(1328) ## formula for all 4 parameters f <- fev1 ~ elm(age,k=100,a="tanh") | . | . | . ## estimate model m <- gamlss2(f, data = d, family = BCT) ## estimated effects plot(m) ## predict quantiles qu <- c(0.025, seq(0.1, 0.9, by = 0.1), 0.975) fit <- quantile(m, probs = qu) ## plot plot(fev1 ~ age, data = d) i <- order(d$age) matplot(d$age[i], fit[i, ], type = "l", lty = 1, lwd = 2, col = c(2, rep(4, ncol(fit) - 1)), add = TRUE) ## main effects and interactions f <- fev1 ~ s(age) + s(height) + s(weight) + elm(~age+height+weight,k=200) | . | . | . m <- gamlss2(f, data = d, family = BCT) ## summary to inspect effect of interactions ## of the elm()s summary(m) ## plot main effects plot(m) ## quantile residuals plot(m, which = "resid") ## prediction is handled automatically via ## the special term interface n <- 50 nd <- with(d, expand.grid( "age" = seq(min(age), max(age), length = n), "weight" = seq(min(weight), max(weight), length = n) )) nd$height <- mean(d$height) ## compute lower 2.5 nd$fit <- quantile(m, newdata = nd, probs = 0.025) ## visualize n <- length(unique(nd$age)) age <- sort(unique(nd$age)) weight <- sort(unique(nd$weight)) z <- matrix(nd$fit, n, n) image(age, weight, z, col = hcl.colors(100, "YlOrRd"), xlab = "age", ylab = "weight") contour(age, weight, z, add = TRUE) ## End(Not run)
Create a simplified formula from a formula, a
Formula, or a list of
formulas. The result retains the variables and
transformations needed to build a model.frame while
separating out special model terms. This is mainly used internally when
gamlss2 prepares formulas for fitting.
fake_formula(formula, specials = NULL, nospecials = FALSE, onlyspecials = FALSE)fake_formula(formula, specials = NULL, nospecials = FALSE, onlyspecials = FALSE)
formula |
|
specials |
Character, vector of names of special functions in the formula,
see |
nospecials |
Logical, should variables of special model terms be part of the "fake formula"? |
onlyspecials |
Logical, should only the special model terms be returned? |
The function is primarily an internal helper for formula preprocessing in gamlss2, but it can also be useful for inspecting how complex formulas are decomposed before fitting.
Special model terms such as s(), te(), ti(), re(),
or package-specific specials are identified and handled separately from
ordinary variables. Depending on the control arguments:
if nospecials = FALSE, variables used inside registered special
terms are reintroduced into the returned formula so they remain available
in the model frame;
if nospecials = TRUE, the returned formula excludes the special
terms and their variables;
if onlyspecials = TRUE, the function returns the names of the
detected special terms instead of a formula object.
For multi-part or multi-parameter formulas, the output preserves the overall formula structure.
Depending on the input, the function returns a
formula or Formula. If
onlyspecials = TRUE, a character vector or list of special model term
names is returned.
## basic formula f <- y ~ x1 + x2 + log(x3) ff <- fake_formula(f) print(ff) ## including special model terms f <- y ~ x1 + s(x2) + x3 + te(log(x3), x4) ff <- fake_formula(f) print(ff) ## multiple parts on the right-hand side f <- y ~ x1 + s(x2) + x3 + te(log(x3), x4) | x2 + sqrt(x5) ff <- fake_formula(f) print(ff) ## collapse all formula parts print(formula(ff, collapse = TRUE)) print(formula(ff, collapse = TRUE, update = TRUE)) ## list of formulas f <- list( y ~ x1 + s(x2) + x3 + te(log(x3), x4), ~ x2 + sqrt(x5), ~ z2 + x1 + exp(x3) ) ff <- fake_formula(f) print(ff) ## extract separate parts on the right-hand side formula(ff, rhs = 1) formula(ff, rhs = 2) formula(ff, rhs = 3) ## formula with multiple responses and multiple parts f <- y1 | y2 | y3 ~ x1 + s(x2) + x3 + te(log(x3), x4) | x2 + ti(x5) ff <- fake_formula(f) print(ff) ## list of formulas with multiple responses f <- list( y1 ~ x1 + s(x2) + x3 + te(log(x3), x4), y2 ~ x2 + sqrt(x5), y3 ~ z2 + x1 + exp(x3) + s(x10) ) ff <- fake_formula(f) ## extract only without special terms ff <- fake_formula(f, nospecials = TRUE) print(ff) ## extract only special terms ff <- fake_formula(f, onlyspecials = TRUE) print(ff)## basic formula f <- y ~ x1 + x2 + log(x3) ff <- fake_formula(f) print(ff) ## including special model terms f <- y ~ x1 + s(x2) + x3 + te(log(x3), x4) ff <- fake_formula(f) print(ff) ## multiple parts on the right-hand side f <- y ~ x1 + s(x2) + x3 + te(log(x3), x4) | x2 + sqrt(x5) ff <- fake_formula(f) print(ff) ## collapse all formula parts print(formula(ff, collapse = TRUE)) print(formula(ff, collapse = TRUE, update = TRUE)) ## list of formulas f <- list( y ~ x1 + s(x2) + x3 + te(log(x3), x4), ~ x2 + sqrt(x5), ~ z2 + x1 + exp(x3) ) ff <- fake_formula(f) print(ff) ## extract separate parts on the right-hand side formula(ff, rhs = 1) formula(ff, rhs = 2) formula(ff, rhs = 3) ## formula with multiple responses and multiple parts f <- y1 | y2 | y3 ~ x1 + s(x2) + x3 + te(log(x3), x4) | x2 + ti(x5) ff <- fake_formula(f) print(ff) ## list of formulas with multiple responses f <- list( y1 ~ x1 + s(x2) + x3 + te(log(x3), x4), y2 ~ x2 + sqrt(x5), y3 ~ z2 + x1 + exp(x3) + s(x10) ) ff <- fake_formula(f) ## extract only without special terms ff <- fake_formula(f, nospecials = TRUE) print(ff) ## extract only special terms ff <- fake_formula(f, onlyspecials = TRUE) print(ff)
These functions provide useful infrastructures for finding suitable GAMLSS families for a response variable.
## list of available families from gamlss.dist package available_families(type = c("continuous", "discrete"), families = NULL, ...) ## find suitable response distribution find_family(y, families = NULL, k = 2, verbose = TRUE, ...) ## fit distribution parameters fit_family(y, family = NO, plot = TRUE, ...) ## find gamlss2 model find_gamlss2(formula, families = NULL, k = 2, select = FALSE, verbose = TRUE, ...)## list of available families from gamlss.dist package available_families(type = c("continuous", "discrete"), families = NULL, ...) ## find suitable response distribution find_family(y, families = NULL, k = 2, verbose = TRUE, ...) ## fit distribution parameters fit_family(y, family = NO, plot = TRUE, ...) ## find gamlss2 model find_gamlss2(formula, families = NULL, k = 2, select = FALSE, verbose = TRUE, ...)
type |
Character, is the response continuous or discrete? |
families |
Character, the names of the family objects of the gamlss.dist package that should be returned. |
y |
The response vector or matrix. |
k |
Numeric, the penalty factor that should be used for the |
verbose |
Logical, should runtime information be printed? |
family |
A family object that should be used for estimation,
see also |
plot |
Logical, should a plot of the fitted density be provided? |
formula |
A model formula, see |
select |
Logical, if set to |
... |
For function |
The function find_family() employs gamlss2 to estimate intercept-only
models for each specified family object in the families argument. Note that model
estimation occurs within a try block with warnings suppressed.
Additionally, the function calculates the GAIC for each family whenever
feasible and returns the values sorted in ascending order of the selected fit criterion,
so smaller values indicate a better fit.
Function fit_family() fits a single intercept-only model using the
specified family and creates a plot of the fitted density.
Function find_family() returns a vector of GAIC values for the
different fitted families. Function fit_family() returns the fitted intercept-only
model.
## Not run: ## load data data("rent", package = "gamlss.data") ## find a suitable response to the response ic <- find_family(rent$R) print(ic) ## fit parameters using the BCCG family fit_family(rent$R, family = BCCG) ## count data data("polio", package = "gamlss.data") ## search best count model ic <- find_family(polio, k = 0, families = available_families(type = "discrete")) print(ic) ## fit parameters using the ZAPIG family fit_family(polio, family = ZAPIG) ## search complete model ## m <- find_gamlss2(R ~ s(Fl) + s(A) | . | . | ., data = rent, ## select = TRUE, mu.link = "log") ## End(Not run)## Not run: ## load data data("rent", package = "gamlss.data") ## find a suitable response to the response ic <- find_family(rent$R) print(ic) ## fit parameters using the BCCG family fit_family(rent$R, family = BCCG) ## count data data("polio", package = "gamlss.data") ## search best count model ic <- find_family(polio, k = 0, families = available_families(type = "discrete")) print(ic) ## fit parameters using the ZAPIG family fit_family(polio, family = ZAPIG) ## search complete model ## m <- find_gamlss2(R ~ s(Fl) + s(A) | . | . | ., data = rent, ## select = TRUE, mu.link = "log") ## End(Not run)
The function gamlss2() fits generalized additive models for location,
scale, and shape (GAMLSS). It estimates one additive predictor for each
parameter of the chosen response distribution, such as mu,
sigma, nu, or tau. The number and meaning of these
parameters are determined by the selected family object, see
gamlss2.family.
In addition to ordinary linear terms, gamlss2() supports smooth terms
from mgcv as well special user-defined model-terms, see specials.
gamlss2(formula, ...) ## S3 method for class 'formula' gamlss2(formula, data, family = NO, subset, na.action, weights, offset, start = NULL, knots = NULL, control = gamlss2_control(...), ...) ## S3 method for class 'list' gamlss2(formula, ...)gamlss2(formula, ...) ## S3 method for class 'formula' gamlss2(formula, data, family = NO, subset, na.action, weights, offset, start = NULL, knots = NULL, control = gamlss2_control(...), ...) ## S3 method for class 'list' gamlss2(formula, ...)
formula |
A GAM-type |
data |
A data frame, list, or environment containing the variables in the model.
If not found in |
family |
A |
subset |
An optional vector specifying a subset of observations to be used in the fitting process. |
na.action |
|
weights |
An optional vector of prior |
offset |
Optional offset terms to be included in the linear predictors during fitting. If a single numeric vector is supplied, it is assigned to the first parameter of the distribution. If offsets are needed for several parameters, provide a data frame or list in the same order as the parameter names in the family object. |
start |
Optional starting values for the estimation algorithm, see
|
knots |
An optional list of user-specified knots, see |
control |
A list of control arguments, see |
... |
Arguments passed to |
The function gamlss2() is the main entry point for fitting distributional
regression models in gamlss2.
The response distribution is specified through a family object, either
from gamlss.dist or created as a gamlss2.family object.
By default, estimation is carried out with the RS algorithm.
The optimizer can be changed through gamlss2_control. A family
object may also provide its own optimizer.
The returned object depends on the chosen optimizer. In the standard case,
this is an object of class "gamlss2", for which summary, plotting,
prediction, and extractor methods are available.
The object returned by the selected optimizer. By default, this is an object
of class "gamlss2" produced by RS. Standard methods and
extractor functions are available for this class.
Rigby RA, Stasinopoulos DM (2005). “Generalized Additive Models for Location, Scale and Shape (with Discussion).” Journal of the Royal Statistical Society, Series C (Applied Statistics), 54, 507–554. doi:10.1111/j.1467-9876.2005.00510.x
Rigby RA, Stasinopoulos DM, Heller GZ, De Bastiani F (2019). Distributions for Modeling Location, Scale, and Shape: Using GAMLSS in R, Chapman and Hall/CRC. doi:10.1201/9780429298547
Stasinopoulos DM, Rigby RA (2007). “Generalized Additive Models for Location Scale and Shape (GAMLSS) in R.” Journal of Statistical Software, 23(7), 1–46. doi:10.18637/jss.v023.i07
Stasinopoulos DM, Rigby RA, Heller GZ, Voudouris V, De Bastiani F (2017). Flexible Regression and Smoothing: Using GAMLSS in R, Chapman and Hall/CRC. doi:10.1201/b21973
RS, gamlss2_control, gamlss2.family,
gamlss2_start
## load the abdominal circumference data data("abdom", package = "gamlss.data") ## specify a distributional model: ## location and scale vary with x, ## shape parameters remain constant f <- y ~ s(x) | s(x) | 1 | 1 ## estimate model m1 <- gamlss2(f, data = abdom, family = BCT) ## model summary summary(m1) ## plot parameter-specific effects plot(m1, which = "effects") ## plot diagnostics plot(m1, which = "resid") ## predict selected distribution parameters par <- predict(m1, model = c("mu", "sigma")) ## fitted centile curves pq <- quantile(m1, probs = c(0.05, 0.5, 0.95)) ## visualize plot(y ~ x, data = abdom, pch = 19, col = rgb(0.1, 0.1, 0.1, alpha = 0.3)) matlines(abdom$x, pq, lwd = 2, lty = 1, col = 4) ## count response example using HarzTraffic data data("HarzTraffic", package = "gamlss2") ## seasonal count model for motorcycles m2 <- gamlss2( bikes ~ s(yday, bs = "cc") | s(yday, bs = "cc"), data = HarzTraffic, family = NBI ) ## summary summary(m2) ## effect plots plot(m2) ## residual diagnostics plot(m2, which = "resid") ## quantiles over the day of the year nd <- data.frame(yday = 1:365) pq <- quantile(m2, newdata = nd, probs = c(0.05, 0.5, 0.95)) ## visualize fitted quantiles plot(bikes ~ yday, data = HarzTraffic, pch = 19, col = gray(0.1, alpha = 0.3)) matlines(nd$yday, pq, lty = c(2, 1, 2), lwd = 2, col = 4)## load the abdominal circumference data data("abdom", package = "gamlss.data") ## specify a distributional model: ## location and scale vary with x, ## shape parameters remain constant f <- y ~ s(x) | s(x) | 1 | 1 ## estimate model m1 <- gamlss2(f, data = abdom, family = BCT) ## model summary summary(m1) ## plot parameter-specific effects plot(m1, which = "effects") ## plot diagnostics plot(m1, which = "resid") ## predict selected distribution parameters par <- predict(m1, model = c("mu", "sigma")) ## fitted centile curves pq <- quantile(m1, probs = c(0.05, 0.5, 0.95)) ## visualize plot(y ~ x, data = abdom, pch = 19, col = rgb(0.1, 0.1, 0.1, alpha = 0.3)) matlines(abdom$x, pq, lwd = 2, lty = 1, col = 4) ## count response example using HarzTraffic data data("HarzTraffic", package = "gamlss2") ## seasonal count model for motorcycles m2 <- gamlss2( bikes ~ s(yday, bs = "cc") | s(yday, bs = "cc"), data = HarzTraffic, family = NBI ) ## summary summary(m2) ## effect plots plot(m2) ## residual diagnostics plot(m2, which = "resid") ## quantiles over the day of the year nd <- data.frame(yday = 1:365) pq <- quantile(m2, newdata = nd, probs = c(0.05, 0.5, 0.95)) ## visualize fitted quantiles plot(bikes ~ yday, data = HarzTraffic, pch = 19, col = gray(0.1, alpha = 0.3)) matlines(nd$yday, pq, lty = c(2, 1, 2), lwd = 2, col = 4)
Control parameters for fitting GAMLSS models with gamlss2.
gamlss2_control(optimizer = RS, trace = TRUE, flush = TRUE, light = FALSE, expand = TRUE, model = TRUE, x = TRUE, y = TRUE, fixed = FALSE, ...)gamlss2_control(optimizer = RS, trace = TRUE, flush = TRUE, light = FALSE, expand = TRUE, model = TRUE, x = TRUE, y = TRUE, fixed = FALSE, ...)
optimizer |
Function, the optimizer to be used for fitting. |
trace |
Logical, should information be printed while the algorithm is running? |
flush |
Logical, whether to use |
light |
Logical, if set to |
expand |
Logical, if fewer formulas are supplied than there are parameters of the distribution, should intercept-only formulas be added automatically? |
model |
Logical, should the model frame be included as a component of the returned object. |
x |
Logical, indicating whether the model matrix should be included as a component of the returned object. |
y |
Logical, should the response be included as a component of the returned object. |
fixed |
Named logical vector indicating which parameters should be fixed during estimation.
See |
... |
Further control parameters to be included in the return value, for example
parameters used by the optimizer function |
The set of control parameters can be extended. For example, if a different optimizer is used, newly specified control parameters are passed on to that optimizer automatically.
A list with the arguments specified.
## Not run: ## load the abdominal circumference data data("abdom", package = "gamlss.data") ## specify the model formula f <- y ~ s(x) | s(x) ## estimate model with different step length ## control in the RS algorithm m1 <- gamlss2(f, data = abdom, family = BCT, step = 1) m2 <- gamlss2(f, data = abdom, family = BCT, step = 0.1) ## End(Not run)## Not run: ## load the abdominal circumference data data("abdom", package = "gamlss.data") ## specify the model formula f <- y ~ s(x) | s(x) ## estimate model with different step length ## control in the RS algorithm m1 <- gamlss2(f, data = abdom, family = BCT, step = 1) m2 <- gamlss2(f, data = abdom, family = BCT, step = 0.1) ## End(Not run)
Guidance on supplying starting values through argument start in
gamlss2 and on fixing distribution parameters via
fixed in gamlss2_control.
The start argument can be used to initialize estimation in several ways.
Supported inputs include:
a named numeric vector or list giving one starting value per
distribution parameter, such as mu, sigma, nu, or
tau;
a data frame or matrix with one column per distribution parameter and one row per observation, providing starting values for the full additive predictors;
a fitted "gamlss2" object, in which case fitted()
values are used as starting values for all distribution parameters;
a coef.gamlss2 object, allowing coefficient-wise
initialization of linear and smooth terms.
Parameters can be fixed during estimation using the fixed control
argument. This is useful, for example, when shape parameters should remain
constant while the remaining parameters are estimated.
## Not run: ## load the abdominal circumference data data("abdom", package = "gamlss.data") ## distributional model f <- y ~ s(x) | s(x) | 1 | 1 ## scalar starting values for the distribution parameters m1 <- gamlss2(f, data = abdom, family = BCT, start = c(mu = 200, sigma = 0.1, nu = 0, tau = 10)) ## fix shape parameters during estimation m2 <- gamlss2(f, data = abdom, family = BCT, start = c(mu = 200, sigma = 0.1, nu = 0, tau = 10), fixed = c(nu = TRUE, tau = TRUE)) ## coefficient estimates of the fitted model coef(m2) ## use fitted additive predictors as starting values m3 <- gamlss2(f, data = abdom, family = BCT, start = fitted(m2)) ## same using the fitted model object directly m4 <- gamlss2(f, data = abdom, family = BCT, start = m2) ## coefficient-wise initialization m5 <- gamlss2(f, data = abdom, family = BCT, start = coef(m2)) ## End(Not run)## Not run: ## load the abdominal circumference data data("abdom", package = "gamlss.data") ## distributional model f <- y ~ s(x) | s(x) | 1 | 1 ## scalar starting values for the distribution parameters m1 <- gamlss2(f, data = abdom, family = BCT, start = c(mu = 200, sigma = 0.1, nu = 0, tau = 10)) ## fix shape parameters during estimation m2 <- gamlss2(f, data = abdom, family = BCT, start = c(mu = 200, sigma = 0.1, nu = 0, tau = 10), fixed = c(nu = TRUE, tau = TRUE)) ## coefficient estimates of the fitted model coef(m2) ## use fitted additive predictors as starting values m3 <- gamlss2(f, data = abdom, family = BCT, start = fitted(m2)) ## same using the fitted model object directly m4 <- gamlss2(f, data = abdom, family = BCT, start = m2) ## coefficient-wise initialization m5 <- gamlss2(f, data = abdom, family = BCT, start = coef(m2)) ## End(Not run)
A "gamlss2.family" object describes a response distribution for use with
gamlss2. It tells the fitting algorithm which distributional
parameters exist, which link functions they use, and which distributional
functions are available, such as densities, quantiles, random-number generators,
or score functions.
## S3 method for class 'gamlss2' family(object, ...)## S3 method for class 'gamlss2' family(object, ...)
object |
A fitted |
... |
Not used currently. |
A gamlss2 family object is a list of class "gamlss2.family".
It defines the response distribution and the functions needed to fit and use
that distribution in a model.
The minimum required elements are:
family: the name of the distribution (character string).
names: a character vector of parameter names (e.g., c("mu", "sigma")).
links: a named character vector specifying the link function for each parameter
(e.g., c(mu = "identity", sigma = "log")), or a list of link functions, see
softplus.
pdf(y, par, log = FALSE, ...): a function to evaluate the (log-)density.
The pdf() function must accept the response y, a named list
par of evaluated parameter values (e.g., par$mu, par$sigma),
a logical log, and optional additional arguments.
Further elements can be added depending on how fully the family should support fitting, prediction, diagnostics, and simulation. Common optional elements are:
score: a named list of functions (one per parameter), each computing the
first derivative of the log-likelihood with respect to the linear predictor:
score[[param]](y, par, ...).
hess: a named list of functions computing second derivatives (the negative Hessian).
For parameters mu and sigma, this includes:
hess[["mu"]](), hess[["sigma"]](), and optionally
cross derivatives like hess[["mu.sigma"]]().
loglik(y, par, ...): a function computing the total log-likelihood.
cdf(y, par, ...): cumulative distribution function.
quantile(p, par, ...): quantile function.
random(n, par, ...): random number generator.
mean(par, ...): mean function.
variance(par, ...): variance function.
skewness(par, ...), kurtosis(par, ...): optional higher-order moment functions.
initialize: a named list of initialization functions, one for each parameter (e.g.,
initialize$mu(y, ...)), used to generate starting values.
valid.response(x): a function that checks whether the response is valid (e.g.,
numeric, non-factor).
optimizer(): an optional function defining a custom optimization
method for use with gamlss2, see also RS.
If analytical score or Hessian functions are not provided, they are approximated
numerically. If quantile residuals are required, a cdf() function must
be available.
For family.gamlss2(), the family object stored in a fitted
"gamlss2" model is returned. A "gamlss2.family" object itself is
a list describing the response distribution, its parameters, their link
functions, and the distribution-specific methods used during fitting,
prediction, diagnostics, and simulation.
## Not run: ## new family object for the normal distribution Normal <- function(...) { fam <- list( "family" = "Normal", "names" = c("mu", "sigma"), "links" = c("mu" = "identity", "sigma" = "log"), "score" = list( "mu" = function(y, par, ...) { (y - par$mu) / (par$sigma^2) }, "sigma" = function(y, par, ...) { -1 + (y - par$mu)^2 / (par$sigma^2) } ), "hess" = list( "mu" = function(y, par, ...) { 1 / (par$sigma^2) }, "sigma" = function(y, par, ...) { rep(2, length(y)) }, "mu.sigma" = function(y, par, ...) { rep(0, length(y)) } ), "loglik" = function(y, par, ...) { sum(dnorm(y, par$mu, par$sigma, log = TRUE)) }, "mu" = function(par, ...) { par$mu }, "pdf" = function(y, par, log = FALSE) { dnorm(y, mean = par$mu, sd = par$sigma, log = log) }, "cdf" = function(y, par, ...) { pnorm(y, mean = par$mu, sd = par$sigma, ...) }, "random" = function(n, par) { rnorm(n, mean = par$mu, sd = par$sigma) }, "quantile" = function(p, par) { qnorm(p, mean = par$mu, sd = par$sigma) }, "initialize" = list( "mu" = function(y, ...) { (y + mean(y)) / 2 }, "sigma" = function(y, ...) { rep(sd(y), length(y)) } ), "mean" = function(par) par$mu, "variance" = function(par) par$sigma^2, "valid.response" = function(x) { if(is.factor(x) | is.character(x)) stop("the response should be numeric!") return(TRUE) } ) class(fam) <- "gamlss2.family" return(fam) } ## load the abdominal circumference data data("abdom", package = "gamlss.data") ## specify the model formula f <- y ~ s(x) | s(x) ## estimate model m1 <- gamlss2(f, data = abdom, family = Normal) ## plot estimated effects plot(m1, which = "effects") ## plot diagnostics plot(m1, which = "resid") ## predict parameters par <- predict(m1) ## predict quantiles pq <- quantile(m1, probs = c(0.05, 0.5, 0.95)) ## visualize plot(y ~ x, data = abdom, pch = 19, col = rgb(0.1, 0.1, 0.1, alpha = 0.3)) matlines(abdom$x, pq, lwd = 2, lty = 1, col = 4) ## another example that defines only the density ## function; in this case all derivatives are ## approximated numerically. For residual diagnostics, ## the $cdf() and $quantile() functions are needed as well Gamma <- function(...) { fam <- list( "names" = c("mu", "sigma"), "links" = c("mu" = "log", "sigma" = "log"), "pdf" = function(y, par, log = FALSE, ...) { shape <- par$sigma scale <- par$mu/par$sigma dgamma(y, shape = shape, scale = scale, log = log) }, "cdf" = function(y, par, lower.tail = TRUE, log.p = FALSE) { shape <- par$sigma scale <- par$mu/par$sigma pgamma(y, shape = shape, scale = scale, lower.tail = lower.tail, log.p = log.p) }, "quantile" = function(p, par, lower.tail = TRUE, log.p = FALSE) { shape <- par$sigma scale <- par$mu/par$sigma qgamma(p, shape = shape, scale = scale, lower.tail = lower.tail, log.p = log.p) } ) class(fam) <- "gamlss2.family" return(fam) } ## example using the Munich rent data data("rent", package = "gamlss.data") ## model formula f <- R ~ ti(Fl) + ti(A) + ti(Fl, A, bs = "ps") | ti(Fl) + ti(A) + ti(Fl, A, bs = "ps") ## estimate model m2 <- gamlss2(f, data = rent, family = Gamma) ## visualize estimated effects plot(m2, which = "effects") ## diagnostics, needs the $cdf() and $quantile() function! plot(m2, which = "resid") ## End(Not run)## Not run: ## new family object for the normal distribution Normal <- function(...) { fam <- list( "family" = "Normal", "names" = c("mu", "sigma"), "links" = c("mu" = "identity", "sigma" = "log"), "score" = list( "mu" = function(y, par, ...) { (y - par$mu) / (par$sigma^2) }, "sigma" = function(y, par, ...) { -1 + (y - par$mu)^2 / (par$sigma^2) } ), "hess" = list( "mu" = function(y, par, ...) { 1 / (par$sigma^2) }, "sigma" = function(y, par, ...) { rep(2, length(y)) }, "mu.sigma" = function(y, par, ...) { rep(0, length(y)) } ), "loglik" = function(y, par, ...) { sum(dnorm(y, par$mu, par$sigma, log = TRUE)) }, "mu" = function(par, ...) { par$mu }, "pdf" = function(y, par, log = FALSE) { dnorm(y, mean = par$mu, sd = par$sigma, log = log) }, "cdf" = function(y, par, ...) { pnorm(y, mean = par$mu, sd = par$sigma, ...) }, "random" = function(n, par) { rnorm(n, mean = par$mu, sd = par$sigma) }, "quantile" = function(p, par) { qnorm(p, mean = par$mu, sd = par$sigma) }, "initialize" = list( "mu" = function(y, ...) { (y + mean(y)) / 2 }, "sigma" = function(y, ...) { rep(sd(y), length(y)) } ), "mean" = function(par) par$mu, "variance" = function(par) par$sigma^2, "valid.response" = function(x) { if(is.factor(x) | is.character(x)) stop("the response should be numeric!") return(TRUE) } ) class(fam) <- "gamlss2.family" return(fam) } ## load the abdominal circumference data data("abdom", package = "gamlss.data") ## specify the model formula f <- y ~ s(x) | s(x) ## estimate model m1 <- gamlss2(f, data = abdom, family = Normal) ## plot estimated effects plot(m1, which = "effects") ## plot diagnostics plot(m1, which = "resid") ## predict parameters par <- predict(m1) ## predict quantiles pq <- quantile(m1, probs = c(0.05, 0.5, 0.95)) ## visualize plot(y ~ x, data = abdom, pch = 19, col = rgb(0.1, 0.1, 0.1, alpha = 0.3)) matlines(abdom$x, pq, lwd = 2, lty = 1, col = 4) ## another example that defines only the density ## function; in this case all derivatives are ## approximated numerically. For residual diagnostics, ## the $cdf() and $quantile() functions are needed as well Gamma <- function(...) { fam <- list( "names" = c("mu", "sigma"), "links" = c("mu" = "log", "sigma" = "log"), "pdf" = function(y, par, log = FALSE, ...) { shape <- par$sigma scale <- par$mu/par$sigma dgamma(y, shape = shape, scale = scale, log = log) }, "cdf" = function(y, par, lower.tail = TRUE, log.p = FALSE) { shape <- par$sigma scale <- par$mu/par$sigma pgamma(y, shape = shape, scale = scale, lower.tail = lower.tail, log.p = log.p) }, "quantile" = function(p, par, lower.tail = TRUE, log.p = FALSE) { shape <- par$sigma scale <- par$mu/par$sigma qgamma(p, shape = shape, scale = scale, lower.tail = lower.tail, log.p = log.p) } ) class(fam) <- "gamlss2.family" return(fam) } ## example using the Munich rent data data("rent", package = "gamlss.data") ## model formula f <- R ~ ti(Fl) + ti(A) + ti(Fl, A, bs = "ps") | ti(Fl) + ti(A) + ti(Fl, A, bs = "ps") ## estimate model m2 <- gamlss2(f, data = rent, family = Gamma) ## visualize estimated effects plot(m2, which = "effects") ## diagnostics, needs the $cdf() and $quantile() function! plot(m2, which = "resid") ## End(Not run)
The GDF (gamlss2 Distribution Family) is a unified class with corresponding methods that represent all distributional families supported by the gamlss2 package. It enables seamless integration with the distributions3 workflow and provides a consistent interface for model fitting and distributional computations.
GDF(family, parameters)GDF(family, parameters)
family |
character. Name of a |
parameters |
numeric, matrix, list or data frame, see the examples. |
The S3 class GDF is a slightly more general implementation of the S3 class
GAMLSS tailored for gamlss2. For details please see the documentation of GAMLSS
A GDF object, inheriting from distribution.
Zeileis A, Lang MN, Hayes A (2022). “distributions3: From Basic Probability to Probabilistic Regression.” Presented at useR! 2022 - The R User Conference. Slides, video, vignette, code at https://www.zeileis.org/news/user2022/.
## package and random seed library("distributions3") set.seed(6020) ## one normal distribution X <- GDF("NO", c(mu = 1, sigma = 2)) X ## two normal distributions X <- GDF("NO", cbind(c(1, 1.5), c(0.6, 1.2))) X ## three Weibull distributions X <- GDF("WEI", list(mu = c(1, 1, 2), sigma = c(1, 2, 2))) X ## see ?gamlss.dist::GAMLSS for the remainder of this example ## example using gamlss2 data("abdom", package = "gamlss.data") ## estimate model m <- gamlss2(y ~ s(x) | . | . | ., data = abdom, family = GA) ## extract, also works with newdata d <- data.frame( "mean" = mean(m), "median" = median(m), "q95" = quantile(m, probs = 0.95), "variance" = variance(m), "pdf" = pdf(m), "cdf" = cdf(m) ) print(head(d))## package and random seed library("distributions3") set.seed(6020) ## one normal distribution X <- GDF("NO", c(mu = 1, sigma = 2)) X ## two normal distributions X <- GDF("NO", cbind(c(1, 1.5), c(0.6, 1.2))) X ## three Weibull distributions X <- GDF("WEI", list(mu = c(1, 1, 2), sigma = c(1, 2, 2))) X ## see ?gamlss.dist::GAMLSS for the remainder of this example ## example using gamlss2 data("abdom", package = "gamlss.data") ## estimate model m <- gamlss2(y ~ s(x) | . | . | ., data = abdom, family = GA) ## extract, also works with newdata d <- data.frame( "mean" = mean(m), "median" = median(m), "q95" = quantile(m, probs = 0.95), "variance" = variance(m), "pdf" = pdf(m), "cdf" = cdf(m) ) print(head(d))
Constructor for estimating penalized model terms using the glmnet package.
gnet(formula, ...)gnet(formula, ...)
formula |
A formula specifying the covariates that should be estimated using the glmnet implementation. |
... |
Control arguments passed to |
gnet() constructs a model.matrix representation
of the supplied formula and delegates estimation to glmnet
inside the backfitting algorithm. The optimal penalty parameter is selected
using an information criterion, with criterion = "bic" used by default.
A special model-term object used internally by gamlss2.
## Not run: data("rent", package = "gamlss.data") ## transform to categorical rent$Flc <- cut(rent$Fl, breaks = seq(20, 160, by = 10), include.lowest = TRUE) rent$Ac <- cut(rent$A, breaks = seq(1890, 1990, by = 10), include.lowest = TRUE) ## model formula, for each parameter we use ## the same gnet() model term f <- R ~ gnet(~Flc + Ac + loc) | . | . | . ## estimate model m <- gamlss2(f, data = rent, family = BCT) ## summary with edf summary(m) ## plot coefficient paths sp <- specials(m, model = "mu", term = "gnet(", element = "model") print(class(sp)) plot(sp) ## End(Not run)## Not run: data("rent", package = "gamlss.data") ## transform to categorical rent$Flc <- cut(rent$Fl, breaks = seq(20, 160, by = 10), include.lowest = TRUE) rent$Ac <- cut(rent$A, breaks = seq(1890, 1990, by = 10), include.lowest = TRUE) ## model formula, for each parameter we use ## the same gnet() model term f <- R ~ gnet(~Flc + Ac + loc) | . | . | . ## estimate model m <- gamlss2(f, data = rent, family = BCT) ## summary with edf summary(m) ## plot coefficient paths sp <- specials(m, model = "mu", term = "gnet(", element = "model") print(class(sp)) plot(sp) ## End(Not run)
This dataset contains daily traffic counts near Sonnenberg in the Harz region of Germany. It covers nearly three years, from 2021-01-01 to 2023-11-30, and combines traffic counts with weather covariates. The data are useful for illustrating seasonal count regression models, for example with cyclic smoothers over the day of the year.
data("HarzTraffic", package = "gamlss2")data("HarzTraffic", package = "gamlss2")
A data frame containing 1057 observations on 16 variables.
Date, the date of the record.
Integer, the day of the year extracted from date. This is
convenient for modeling within-year seasonality.
Integer, the number of motorcycles on that day.
Integer, the number of cars on that day.
Integer, the number of trucks on that day.
Integer, the number of other vehicles on that day.
Numeric, minimum temperature in .
Numeric, maximum temperature in .
Numeric, mean temperature in .
Numeric, mean relative humidity in percent.
Numeric, average dewpoint temperature in .
Numeric, average cloud cover in percent.
Numeric, amount of precipitation in mm (snow and rain).
Numeric, sunshine duration in minutes.
Numeric, mean wind speed in m/s.
Numeric, maximum wind speed in m/s.
The traffic variables record daily counts by vehicle type. The weather
variables summarize daily conditions measured at a nearby station. In package
examples, bikes is often modeled as a seasonal response because
motorcycle traffic shows a pronounced annual cycle and substantial variation
across weather conditions.
Weather Data:
Deutscher Wetterdienst (DWD), Climate Data Center (CDC).
CC BY 4.0
Wernigerode (5490; Sachsen-Anhalt)
10.7686/51.8454/233 (lon, lat, alt, EPSG 4326)
Traffic Data:
Bundesanstalt für Strassenwesen (BASt)
CC BY 4.0
## seasonal variation of motorcycle counts at Sonnenberg/Harz data("HarzTraffic", package = "gamlss2") plot(bikes ~ yday, data = HarzTraffic) ## count distribution barplot(table(HarzTraffic$bikes)) ## negative binomial seasonal model using cyclic splines m <- gamlss2(bikes ~ s(yday, bs = "cc") | s(yday, bs = "cc"), data = HarzTraffic, family = NBI) ## visualize effects plot(m) ## residual diagnostics plot(m, which = "resid") ## fitted parameters for each day of the year nd <- data.frame(yday = 1:365) ## corresponding quantiles p <- quantile(m, newdata = nd, probs = c(0.05, 0.5, 0.95)) ## visualization plot(bikes ~ yday, data = HarzTraffic, pch = 19, col = gray(0.1, alpha = 0.3)) matlines(nd$yday, p, lty = c(2, 1, 2), lwd = 2, col = 4)## seasonal variation of motorcycle counts at Sonnenberg/Harz data("HarzTraffic", package = "gamlss2") plot(bikes ~ yday, data = HarzTraffic) ## count distribution barplot(table(HarzTraffic$bikes)) ## negative binomial seasonal model using cyclic splines m <- gamlss2(bikes ~ s(yday, bs = "cc") | s(yday, bs = "cc"), data = HarzTraffic, family = NBI) ## visualize effects plot(m) ## residual diagnostics plot(m, which = "resid") ## fitted parameters for each day of the year nd <- data.frame(yday = 1:365) ## corresponding quantiles p <- quantile(m, newdata = nd, probs = c(0.05, 0.5, 0.95)) ## visualization plot(bikes ~ yday, data = HarzTraffic, pch = 19, col = gray(0.1, alpha = 0.3)) matlines(nd$yday, p, lty = c(2, 1, 2), lwd = 2, col = 4)
This function implements the two-parameter Kumaraswamy family for responses in (0, 1).
## the Kumaraswamy family Kumaraswamy(a.link = shiftlog, b.link = shiftlog, ...) KS(a.link = shiftlog, b.link = shiftlog, ...) ## the exp(x) + shift link specification shiftlog(shift = 1)## the Kumaraswamy family Kumaraswamy(a.link = shiftlog, b.link = shiftlog, ...) KS(a.link = shiftlog, b.link = shiftlog, ...) ## the exp(x) + shift link specification shiftlog(shift = 1)
a.link |
Character or function, the link function to be used for parameter |
b.link |
Character or function, the link function to be used for parameter |
shift |
Numeric, the shift parameter to be used for the link. |
... |
Not used. |
The Kumaraswamy distribution is a continuous distribution defined on the interval (0, 1). The probability density function is
is the response, and are non-negative parameters.
The shiftlog link function is given by:
This is the default, since the mode of the distribution is only defined for , .
The family returns an object of class "gamlss2.family".
Function shiftlog() returns a link specification
object of class "link-glm".
Kumaraswamy P (1980). “A Generalized Probability Density Function for Double-Bounded Random Processes.” Journal of Hydrology, 46(1), 79–88. doi:10.1016/0022-1694(80)90036-0
## create family object with ## different link specifications fam <- Kumaraswamy(a.link = shiftlog, b.link = "log") ## simulate data set.seed(123) n <- 1000 d <- data.frame("x" = runif(n, -pi, pi)) ## true parameters par <- data.frame( "a" = exp(1.2 + sin(d$x)) + 1, "b" = 1 ) ## sample response d$y <- fam$r(1, par) ## estimate model using the Kumaraswamy family m <- gamlss2(y ~ s(x), data = d, family = fam) ## plot estimated effect plot(m) ## plot residual diagnostics plot(m, which = "resid") ## predict quantiles p <- quantile(m, probs = c(0.05, 0.1, 0.5)) ## plot plot(d, pch = 19, col = adjustcolor(1, 0.3)) i <- order(d$x) matlines(d$x[i], p[i, ], lty = 1, col = 4, lwd = 2) axis(4, at = p[i, ][1, ], labels = names(p))## create family object with ## different link specifications fam <- Kumaraswamy(a.link = shiftlog, b.link = "log") ## simulate data set.seed(123) n <- 1000 d <- data.frame("x" = runif(n, -pi, pi)) ## true parameters par <- data.frame( "a" = exp(1.2 + sin(d$x)) + 1, "b" = 1 ) ## sample response d$y <- fam$r(1, par) ## estimate model using the Kumaraswamy family m <- gamlss2(y ~ s(x), data = d, family = fam) ## plot estimated effect plot(m) ## plot residual diagnostics plot(m, which = "resid") ## predict quantiles p <- quantile(m, probs = c(0.05, 0.1, 0.5)) ## plot plot(d, pch = 19, col = adjustcolor(1, 0.3)) i <- order(d$x) matlines(d$x[i], p[i, ], lty = 1, col = 4, lwd = 2) axis(4, at = p[i, ][1, ], labels = names(p))
Constructor function and plotting for Lasso penalized model terms for GAMLSS.
## model term constructor function la(x, type = 1, const = 1e-05, ...) ## plotting function plot_lasso(x, terms = NULL, which = c("criterion", "coefficients"), zoom = c(3, 4), spar = TRUE, ...)## model term constructor function la(x, type = 1, const = 1e-05, ...) ## plotting function plot_lasso(x, terms = NULL, which = c("criterion", "coefficients"), zoom = c(3, 4), spar = TRUE, ...)
x |
For function |
type |
Integer or character, the type of the Lasso penalty.
|
const |
Numeric, the constant that is used for approximating the absolute function. |
terms |
Character or integer, the model term that should be plotted. The
default |
which |
Character, should the information criterion or the coefficient paths be plotted? See the examples. |
zoom |
Numeric vector of length 2, the zooming factors for plotting information criteria curves and coefficient paths. The first element sets the distance from the optimum shrinkage parameter lambda to the left side, and the second element to the right side, respectively. |
spar |
Logical, should plotting parameters be automatically set in
|
... |
For function |
To implement the Lasso penalty, an approximation of the absolute value function is used, following the approach by Oelker and Tutz (2015). This enables the use of standard Newton-Raphson-type algorithms for estimation. Each Lasso model term has its own shrinkage parameter, allowing a mix of different penalty types within the model. The framework builds on the methodology of Groll et al. (2019), where coefficients are updated through iteratively reweighted least squares (IWLS). This is feasible due to the absolute function approximation, which results in a quadratic penalty matrix similar to that used in penalized splines. By default, the shrinkage parameters are selected using the Bayesian Information Criterion (BIC).
The choice of type determines the structure that is penalized:
"normal" penalizes individual coefficients and is suitable for
numeric covariates or general design matrices.
"group" penalizes one complete block of coefficients jointly and
is mainly useful for factor effects or other grouped design matrices.
"ordinal" uses an adjacent-differences fusion penalty and is intended
for ordered factor effects or matrices whose columns represent ordered levels.
"nominal" uses a pairwise-differences fusion penalty and is intended
for unordered factor effects or matrices whose columns represent unordered levels.
For numerical stability and comparable shrinkage across terms, la() can
scale the underlying design matrix internally. The exact scaling depends on the
selected penalty type. For "ordinal" and "nominal" terms, an
additional ridge component is included by default.
The la() function is used internally within gamlss2
and provides the necessary details for estimating Lasso-type model terms.
Essentially, it serves as a special model term, as outlined
in specials.
Currently, the plot_lasso() function does not return any output.
Andreas Groll, Julien Hambuckers, Thomas Kneib, and Nikolaus Umlauf (2019). Lasso-type penalization in the framework of generalized additive models for location, scale and shape. Computational Statistics & Data Analysis. doi:10.1016/j.csda.2019.06.005
Oelker Margreth-Ruth and Tutz Gerhard (2015). A uniform framework for combination of penalties in generalized structured models. Adv Data Anal Classif. doi:10.1007/s11634-015-0205-y
## Not run: ## load data data("rent", package = "gamlss.data") ## transform numeric to factor variables rent$Flc <- cut(rent$Fl, breaks = seq(20, 160, by = 10), include.lowest = TRUE) rent$Ac <- cut(rent$A, breaks = seq(1890, 1990, by = 10), include.lowest = TRUE) ## set up the model formula for a BCT model f <- R ~ la(Flc,type=3) + la(Ac,type=3) + la(loc,type=4) | . | . ## estimation m1 <- gamlss2(f, data = rent, family = BCT) ## summary, shows the estimated degrees of freedom ## for each model term summary(m1) ## plot estimated coefficients plot(m1) ## plot information criteria curves ## for each model term plot_lasso(m1) ## plot parameter paths plot_lasso(m1, which = "coefficients") ## plot a single model term plot_lasso(m1, which = "coefficients", term = 5) ## same with plot_lasso(m1, which = "coefficients", term = "sigma.la(Ac") ## zoom out plot_lasso(m1, which = "coefficients", term = 5, zoom = c(8, 7)) ## set names plot_lasso(m1, which = "coefficients", term = 5, zoom = c(8, 7), names = c("A", "B", "C")) ## set title plot_lasso(m1, which = "coefficients", term = 5, zoom = c(8, 7), main = "Fused Lasso") ## simulated example using the nominal fused lasso ## and a matrix as argument for la() set.seed(1328) ## number of observations and covariates n <- 500 k <- 50 ## model matrix X <- matrix(rnorm(n * k), n, k) colnames(X) <- paste0("x", 1:k) ## true coefficients beta <- list( "mu" = rbinom(k, 1, 0.1), "sigma" = rbinom(k, 1, 0.1) * 0.3 ) ## parameters mu <- X %*% beta$mu sigma <- exp(-1 + X %*% beta$sigma) ## response y <- rnorm(n, mean = mu, sd = sigma) ## model formula with nominal fused lasso f <- y ~ la(X,type=4) | la(X,type=4) ## estimate model incl. extra ridge penalty ## for all la() model terms m2 <- gamlss2(f, add_ridge = TRUE) ## plot information criteria curves plot_lasso(m2) ## coefficient paths plot_lasso(m2, which = "coefficients") ## zoom out plot_lasso(m2, which = "coefficients", zoom = c(8, 9)) ## extract coefficients cb <- coef(m2, full = TRUE) ## compare (without intercept) cb_mu <- cb[grep("mu.", names(cb))][-1] cb_sigma <- cb[grep("sigma.", names(cb))][-1] ## true positive rate tp <- mean(c(cb_mu[beta$mu > 0] > 0, cb_sigma[beta$sigma > 0] > 0)) ## false positive rate, needs threshold thres <- 0.01 fp <- mean(c(abs(cb_mu[beta$mu == 0]) > thres, abs(cb_sigma[beta$sigma == 0]) > thres)) ## End(Not run)## Not run: ## load data data("rent", package = "gamlss.data") ## transform numeric to factor variables rent$Flc <- cut(rent$Fl, breaks = seq(20, 160, by = 10), include.lowest = TRUE) rent$Ac <- cut(rent$A, breaks = seq(1890, 1990, by = 10), include.lowest = TRUE) ## set up the model formula for a BCT model f <- R ~ la(Flc,type=3) + la(Ac,type=3) + la(loc,type=4) | . | . ## estimation m1 <- gamlss2(f, data = rent, family = BCT) ## summary, shows the estimated degrees of freedom ## for each model term summary(m1) ## plot estimated coefficients plot(m1) ## plot information criteria curves ## for each model term plot_lasso(m1) ## plot parameter paths plot_lasso(m1, which = "coefficients") ## plot a single model term plot_lasso(m1, which = "coefficients", term = 5) ## same with plot_lasso(m1, which = "coefficients", term = "sigma.la(Ac") ## zoom out plot_lasso(m1, which = "coefficients", term = 5, zoom = c(8, 7)) ## set names plot_lasso(m1, which = "coefficients", term = 5, zoom = c(8, 7), names = c("A", "B", "C")) ## set title plot_lasso(m1, which = "coefficients", term = 5, zoom = c(8, 7), main = "Fused Lasso") ## simulated example using the nominal fused lasso ## and a matrix as argument for la() set.seed(1328) ## number of observations and covariates n <- 500 k <- 50 ## model matrix X <- matrix(rnorm(n * k), n, k) colnames(X) <- paste0("x", 1:k) ## true coefficients beta <- list( "mu" = rbinom(k, 1, 0.1), "sigma" = rbinom(k, 1, 0.1) * 0.3 ) ## parameters mu <- X %*% beta$mu sigma <- exp(-1 + X %*% beta$sigma) ## response y <- rnorm(n, mean = mu, sd = sigma) ## model formula with nominal fused lasso f <- y ~ la(X,type=4) | la(X,type=4) ## estimate model incl. extra ridge penalty ## for all la() model terms m2 <- gamlss2(f, add_ridge = TRUE) ## plot information criteria curves plot_lasso(m2) ## coefficient paths plot_lasso(m2, which = "coefficients") ## zoom out plot_lasso(m2, which = "coefficients", zoom = c(8, 9)) ## extract coefficients cb <- coef(m2, full = TRUE) ## compare (without intercept) cb_mu <- cb[grep("mu.", names(cb))][-1] cb_sigma <- cb[grep("sigma.", names(cb))][-1] ## true positive rate tp <- mean(c(cb_mu[beta$mu > 0] > 0, cb_sigma[beta$sigma > 0] > 0)) ## false positive rate, needs threshold thres <- 0.01 fp <- mean(c(abs(cb_mu[beta$mu == 0]) > thres, abs(cb_sigma[beta$sigma == 0]) > thres)) ## End(Not run)
This function is used with the family functions in gamlss2().
Given the name of a link, it returns a link function, an inverse
link function, the derivative , and a function for domain
checking. Note that make.link2() is slightly more flexible and also allows
functions as arguments.
make.link2(link)make.link2(link)
link |
A character string, see function |
A list of class "link_gamlss2" containing the following components:
linkfun |
Link function |
linkinv |
Inverse link function |
mu.eta |
Derivative |
valideta |
Function |
name |
A character string representing the name of the link function. |
make.link, gamlss2, gamlss2.family.
## character specification utils::str(make.link2("logit")) ## functions utils::str(make.link2(softplus))## character specification utils::str(make.link2("logit")) ## functions utils::str(make.link2(softplus))
gamlss2 ModelThe function mcmc() runs MCMC sampling for a fitted
gamlss2 model. It reuses the fitted model object and merges the
MCMC output back into the original object so that standard methods such as
summary and plot work as expected.
mcmc(object, n.iter = 1200, burnin = 200, thin = 1)mcmc(object, n.iter = 1200, burnin = 200, thin = 1)
object |
An object of class |
n.iter |
Integer, the total number of MCMC iterations. |
burnin |
Integer, the burn-in period. |
thin |
Integer, the thinning parameter for saved samples. |
The function sets n.iter, burnin, and thin in the model
control list and obtains starting values from the fitted object, including
smoothing parameters. The MCMC sampler BS is then called and
the resulting samples and posterior summaries are merged into the original
model object.
This approach ensures that the result retains the terms,
call, and related components required by extractor and plotting
methods. If the fitted model object does not contain the design information
required for MCMC sampling, for example because a reduced control setting was
used, sampling cannot be performed.
A model object inheriting from "gamlss2" containing posterior samples
and summary information. The object also contains:
samples: posterior draws for coefficients and smoothing variances;
results: model results derived from posterior means;
elapsed: elapsed sampling time.
Umlauf N, Klein N, Zeileis A (2018). BAMLSS: Bayesian Additive Models for Location, Scale and Shape (and Beyond). Journal of Computational and Graphical Statistics, 27(3), 612–627. doi:10.1080/10618600.2017.1407325
## Not run: ## load the abdominal circumference data data("abdom", package = "gamlss.data") ## specify the model formula f <- y ~ s(x) | s(x) | 1 | 1 ## estimate model using backfitting m1 <- gamlss2(f, data = abdom, family = BCT) ## run MCMC sampling set.seed(1328) m2 <- mcmc(m1) ## generate more samples m2 <- mcmc(m2) ## posterior summary summary(m2) ## plot estimated effects plot(m2) ## plot samples plot(m2, which = "samples") plot(m2, which = "samples", ask = TRUE) plot(m2, which = "samples", ask = TRUE, model = "mu", lag.max = 120, term = "s(x)") ## samples can be extracted for further inspection print(head(m2$samples)) ## compare estimated mean effects plot(c(m1, m2)) ## predict parameters using samples pm <- predict(m2, FUN = mean) print(head(pm)) psd <- predict(m2, FUN = sd) print(head(psd)) ## End(Not run)## Not run: ## load the abdominal circumference data data("abdom", package = "gamlss.data") ## specify the model formula f <- y ~ s(x) | s(x) | 1 | 1 ## estimate model using backfitting m1 <- gamlss2(f, data = abdom, family = BCT) ## run MCMC sampling set.seed(1328) m2 <- mcmc(m1) ## generate more samples m2 <- mcmc(m2) ## posterior summary summary(m2) ## plot estimated effects plot(m2) ## plot samples plot(m2, which = "samples") plot(m2, which = "samples", ask = TRUE) plot(m2, which = "samples", ask = TRUE, model = "mu", lag.max = 120, term = "s(x)") ## samples can be extracted for further inspection print(head(m2$samples)) ## compare estimated mean effects plot(c(m1, m2)) ## predict parameters using samples pm <- predict(m2, FUN = mean) print(head(pm)) psd <- predict(m2, FUN = sd) print(head(psd)) ## End(Not run)
The MN() family implements a multinomial logit model for
categorical responses with unordered categories.
MN(k)MN(k)
k |
An integer specifying the number of response categories. |
The response must be a factor with exactly k levels.
The first level is treated as the reference category.
For categories , each predictor models the
log-odds relative to the reference category.
The family provides analytical first and second derivatives of the
log-likelihood with respect to the predictors and a
probabilities() method for fitted category probabilities.
A "gamlss2.family" object to be used with gamlss2.
## Not run: data("marital.nz", package = "VGAM") ## multinomial model m <- gamlss2(mstatus ~ s(age) | . | ., data = marital.nz, family = MN(4)) ## plot effects plot(m) ## predict probabilities par <- predict(m) probs <- family(m)$probabilities(par) names(probs) <- levels(marital.nz$mstatus) ## plot i <- order(marital.nz$age) matplot(marital.nz$age[i], probs[i, ], type = "l", lty = 1, lwd = 2, xlab = "Age", ylab = "Probabilities") legend("center", names(probs), lty = 1, lwd = 2, col = 1:4) ## End(Not run)## Not run: data("marital.nz", package = "VGAM") ## multinomial model m <- gamlss2(mstatus ~ s(age) | . | ., data = marital.nz, family = MN(4)) ## plot effects plot(m) ## predict probabilities par <- predict(m) probs <- family(m)$probabilities(par) names(probs) <- levels(marital.nz$mstatus) ## plot i <- order(marital.nz$age) matplot(marital.nz$age[i], probs[i, ], type = "l", lty = 1, lwd = 2, xlab = "Age", ylab = "Probabilities") legend("center", names(probs), lty = 1, lwd = 2, col = 1:4) ## End(Not run)
The generic function extracts the selected model terms from a fitted selection procedure and returns the corresponding reduced model formula.
new_formula(object, ...)new_formula(object, ...)
object |
A fitted model object produced by a selection routine such as
|
... |
Not used yet. |
The function is intended for fitted models that store the outcome of an
automatic term-selection step in object$selection$formula. It extracts
the selected right-hand side terms and restores the original response on the
left-hand side.
For multi-parameter distributional models, the returned object is a
Formula with one right-hand side part for each
distribution parameter. This makes it convenient to inspect the final model
or to refit it without repeating the selection step.
A Formula containing the selected model terms.
step_gamlss2, select_gamlss2, gamlss2
## no runnable example; see \code{\link{step_gamlss2}} and ## \code{\link{select_gamlss2}} for end-to-end usage## no runnable example; see \code{\link{step_gamlss2}} and ## \code{\link{select_gamlss2}} for end-to-end usage
Defines the ordered logistic (cumulative logit) family for modeling ordinal response variables
within the gamlss2 framework. This implementation supports flexible modeling of the
location and threshold (cutpoint) parameters, including effects of covariates.
OL(k) ologit(k)OL(k) ologit(k)
k |
An integer specifying the number of response categories. Must be |
This family implements a cumulative logit model for ordinal responses with k
ordered categories. The linear predictor models a latent location parameter, and the
cutpoints between response categories are parameterized via a monotonic transformation:
The first cutpoint is modeled directly (theta1).
The remaining cutpoints are expressed as theta1 + exp(delta_j) for j = 2, ..., k - 1,
ensuring that the thresholds remain ordered.
The OL() family supports modeling the location and threshold differences (delta_j)
as functions of covariates using additive predictors in gamlss2 via the "|" formula interface.
The family returns an object of class "gamlss2.family", which includes methods for evaluating
the log-likelihood, simulating from the model, and computing predicted probabilities.
ologit() is retained as a compatibility alias for OL().
A "gamlss2.family" object to be used with gamlss2.
## example using the housing data from the MASS package: library("MASS") ## fit standard cumulative logit model using polr() m1 <- polr(Sat ~ Infl + Type + Cont, weights = Freq, data = housing) summary(m1) ## convert response to integer for use with gamlss2 housing$Satint <- as.integer(housing$Sat) ## fit equivalent model using gamlss2 m2 <- gamlss2(Satint ~ Infl + Type + Cont, data = housing, weights = Freq, family = OL(k = 3)) summary(m2) ## compare coefficients coef(m1) coef(m2) ## predict class probabilities pm1 <- predict(m1, type = "p") pm2 <- predict(m2) pm2 <- family(m2)$probabilities(pm2) print(head(pm1)) print(head(pm2))## example using the housing data from the MASS package: library("MASS") ## fit standard cumulative logit model using polr() m1 <- polr(Sat ~ Infl + Type + Cont, weights = Freq, data = housing) summary(m1) ## convert response to integer for use with gamlss2 housing$Satint <- as.integer(housing$Sat) ## fit equivalent model using gamlss2 m2 <- gamlss2(Satint ~ Infl + Type + Cont, data = housing, weights = Freq, family = OL(k = 3)) summary(m2) ## compare coefficients coef(m1) coef(m2) ## predict class probabilities pm1 <- predict(m1, type = "p") pm2 <- predict(m2) pm2 <- family(m2)$probabilities(pm2) print(head(pm1)) print(head(pm2))
Estimation of P-splines using an efficient local maximum likelihood approach to automatically
select the smoothing parameter. According to the inventors of P-splines,
pb stands for "penalized beta" splines or "Paul and Brian".
pb(x, k = 20, ...)pb(x, k = 20, ...)
x |
The variable that should be used for estimation. |
k |
The dimension of the B-spline basis to represent the smooth term. |
... |
Further arguments passed to function |
Function pb() is an internal wrapper function that calls s
to set up a smooth specification object that can be used for model fitting with
gamlss2. Using pb(), an efficient local maximum likelihood
approach is used to estimate the smoothing parameter. See the reference for details.
The function returns a smooth specification object of class "ps.smooth.spec", see
also smooth.construct.ps.smooth.spec.
Eilers PHC, Marx BD (1996). “Flexible Smoothing with B-Splines and Penalties.” Statistical Science, 11(2), 89–121. doi:10.1214/ss/1038425655
Rigby RA, Stasinopoulos DM (2014). “Automatic Smoothing Parameter Selection in GAMLSS with an Application to Centile Estimation.” Statistical Methods in Medical Research, 23(4), 318–332. doi:10.1177/0962280212473302
gamlss2, smooth.construct.ps.smooth.spec
## Not run: ## load head circumference data data("dbhh", package = "gamlss.data") ## specify the model formula f <- head ~ pb(age) | pb(age) | pb(age) | pb(age) ## estimate model m <- gamlss2(f, data = dbhh, family = BCT) ## visualize estimated effects plot(m, which = "effects") ## plot diagnostics plot(m, which = "resid") ## predict quantiles pq <- quantile(m, probs = c(0.05, 0.5, 0.95)) ## plot plot(head ~ age, data = dbhh, pch = 19, col = rgb(0.1, 0.1, 0.1, alpha = 0.3)) matlines(dbhh$age, pq, lty = 1, col = 4) ## End(Not run)## Not run: ## load head circumference data data("dbhh", package = "gamlss.data") ## specify the model formula f <- head ~ pb(age) | pb(age) | pb(age) | pb(age) ## estimate model m <- gamlss2(f, data = dbhh, family = BCT) ## visualize estimated effects plot(m, which = "effects") ## plot diagnostics plot(m, which = "resid") ## predict quantiles pq <- quantile(m, probs = c(0.05, 0.5, 0.95)) ## plot plot(head ~ age, data = dbhh, pch = 19, col = rgb(0.1, 0.1, 0.1, alpha = 0.3)) matlines(dbhh$age, pq, lty = 1, col = 4) ## End(Not run)
Plotting methods for objects of class "gamlss2" and
"gamlss2.list". These methods support effect plots for model
terms and residual diagnostics. Effect plots for model terms with more
than two covariates are not supported; in such cases, use
predict and plot the results manually.
## S3 method for class 'gamlss2' plot(x, parameter = NULL, which = "effects", terms = NULL, scale = TRUE, spar = TRUE, ...) ## S3 method for class 'gamlss2.list' plot(x, parameter = NULL, which = "effects", terms = NULL, spar = TRUE, legend = TRUE, ...)## S3 method for class 'gamlss2' plot(x, parameter = NULL, which = "effects", terms = NULL, scale = TRUE, spar = TRUE, ...) ## S3 method for class 'gamlss2.list' plot(x, parameter = NULL, which = "effects", terms = NULL, spar = TRUE, legend = TRUE, ...)
x |
An object of class |
parameter |
Character or integer specifying for which distribution parameter the plots should be created.
The arguments |
which |
Character or integer selecting the type of plot.
|
terms |
Character or integer. Which model term(s) should be plotted? |
scale |
Logical. If |
spar |
Logical. Should graphical parameters be set automatically? |
legend |
Logical. Should a legend be added for plots with multiple models? |
... |
Additional graphical arguments such as
|
For objects of class "gamlss2", which = "effects" is the default
and plots the partial effects stored in results(object). If no effect
information is available, residual diagnostics are shown instead.
For objects of class "gamlss2.list", plots compare effect estimates
across several fitted models. Residual diagnostics are only available for
single-model objects.
## Not run: ## load data data("film90", package = "gamlss.data") ## model formula f <- lborev1 ~ s(lboopen) + s(lnosc) | . | . | . ## estimate model m1 <- gamlss2(f, data = film90, family = BCCG) ## plot effects (default) plot(m1) ## plot specific effect plot(m1, parameter = "sigma") plot(m1, model = "sigma") plot(m1, model = "nu", terms = 1) plot(m1, model = "nu", terms = 2) plot(m1, model = "nu", terms = "lnosc") plot(m1, terms = "lnosc") ## plot all residual diagnostics plot(m1, which = "resid") ## single diagnostic plots plot(m1, which = "hist-resid") plot(m1, which = "qq-resid") plot(m1, which = "wp-resid") plot(m1, which = "scatter-resid") ## if available, plot stored stepwise-selection path ## plot(m1, which = "selection") ## estimate another model m2 <- gamlss2(f, data = film90, family = BCPE) ## compare estimated effects plot(c(m1, m2)) plot(c(m1, m2), terms = "lboopen", col = c(1, 4), lwd = 3, lty = 1, pos = c("topleft", "topright", "bottomleft", "bottomright")) plot(c(m1, m2), model = "sigma") plot(c(m1, m2), model = "sigma", terms = 2) plot(c(m1, m2), model = c("mu", "nu")) ## End(Not run)## Not run: ## load data data("film90", package = "gamlss.data") ## model formula f <- lborev1 ~ s(lboopen) + s(lnosc) | . | . | . ## estimate model m1 <- gamlss2(f, data = film90, family = BCCG) ## plot effects (default) plot(m1) ## plot specific effect plot(m1, parameter = "sigma") plot(m1, model = "sigma") plot(m1, model = "nu", terms = 1) plot(m1, model = "nu", terms = 2) plot(m1, model = "nu", terms = "lnosc") plot(m1, terms = "lnosc") ## plot all residual diagnostics plot(m1, which = "resid") ## single diagnostic plots plot(m1, which = "hist-resid") plot(m1, which = "qq-resid") plot(m1, which = "wp-resid") plot(m1, which = "scatter-resid") ## if available, plot stored stepwise-selection path ## plot(m1, which = "selection") ## estimate another model m2 <- gamlss2(f, data = film90, family = BCPE) ## compare estimated effects plot(c(m1, m2)) plot(c(m1, m2), terms = "lboopen", col = c(1, 4), lwd = 3, lty = 1, pos = c("topleft", "topright", "bottomleft", "bottomright")) plot(c(m1, m2), model = "sigma") plot(c(m1, m2), model = "sigma", terms = 2) plot(c(m1, m2), model = c("mu", "nu")) ## End(Not run)
Methods for gamlss2 model objects that extract fitted values or predictions for distribution parameters, additive predictors, individual terms, or the mean of the response distribution.
## S3 method for class 'gamlss2' predict(object, model = NULL, newdata = NULL, type = c("parameter", "link", "response", "terms"), terms = NULL, se.fit = FALSE, drop = TRUE, ...)## S3 method for class 'gamlss2' predict(object, model = NULL, newdata = NULL, type = c("parameter", "link", "response", "terms"), terms = NULL, se.fit = FALSE, drop = TRUE, ...)
object |
A model object of class |
model |
Character. Which distribution parameter(s) should be predicted?
This can be one or more of |
newdata |
Data frame. Optionally, a new data frame in which to look for variables with which to predict. If omitted, the original observations are used. |
type |
Character. Which type of prediction should be computed?
Use |
terms |
Character. Which of the terms in the additive predictor(s) should be included? By default all terms are included. |
se.fit |
Logical. Should standard errors for the predictions be included?
Standard errors are computed by simulating from the approximate multivariate normal
distribution of the maximum likelihood estimates. The number of simulations is controlled
by the argument |
drop |
Logical. Should the predictions be simplified to a vector
if possible ( |
... |
Further control arguments. Supported aliases are |
Prediction for a gamlss2 object proceeds in stages. First, the
required model terms are evaluated for the original data or for newdata.
These terms are then added to form the full additive predictor for each
distribution parameter. Applying the inverse link function yields the predicted
distribution parameters. If requested, the mean of the corresponding response
distribution is computed in a final step.
If se.fit = TRUE, or if a summary function is supplied via FUN,
predictions are based on simulation draws from the approximate multivariate
normal distribution of the fitted coefficients, unless the object already
contains stored draws in samples. For objects of class "bamlss2",
stored posterior draws are used directly.
For type = "terms", the terms argument is matched against linear
and special-term labels. By default, partial matching is used; set
nogrep = TRUE in ... to require exact matching.
See also prodist.gamlss2 for setting up a full distributions3 object
from which moments, probabilities, quantiles, or random numbers can be obtained.
If drop = FALSE, a data frame is returned. If drop = TRUE
(the default), the result may be simplified to a numeric vector when possible.
For type = "terms", a data frame is returned with one column per selected
term. For se.fit = TRUE, the returned object contains fitted values and
corresponding simulation-based standard errors.
predict, prodist.gamlss2, quantile.gamlss2
## fit heteroscedastic normal GAMLSS model ## stopping distance (ft) explained by speed (mph) data("cars", package = "datasets") m <- gamlss2(dist ~ s(speed) | s(speed), data = cars, family = NO) ## new data for predictions nd <- data.frame(speed = c(10, 20, 30)) ## default: model parameter(s) for all model parts predict(m, newdata = nd) ## additive predictors on the link scale predict(m, newdata = nd, type = "link") ## mean of the response distribution predict(m, newdata = nd, type = "response") ## model parameter(s) predict(m, newdata = nd, model = "sigma") predict(m, newdata = nd, model = "sigma", drop = FALSE) ## individual terms in additive predictor(s) predict(m, newdata = nd, type = "terms", model = "sigma") predict(m, newdata = nd, type = "terms", model = "sigma", terms = "s(speed)") ## predict quantiles quantile(m, newdata = nd, probs = c(0.1, 0.5, 0.9)) ## standard errors predict(m, newdata = nd, se.fit = TRUE, R = 200)## fit heteroscedastic normal GAMLSS model ## stopping distance (ft) explained by speed (mph) data("cars", package = "datasets") m <- gamlss2(dist ~ s(speed) | s(speed), data = cars, family = NO) ## new data for predictions nd <- data.frame(speed = c(10, 20, 30)) ## default: model parameter(s) for all model parts predict(m, newdata = nd) ## additive predictors on the link scale predict(m, newdata = nd, type = "link") ## mean of the response distribution predict(m, newdata = nd, type = "response") ## model parameter(s) predict(m, newdata = nd, model = "sigma") predict(m, newdata = nd, model = "sigma", drop = FALSE) ## individual terms in additive predictor(s) predict(m, newdata = nd, type = "terms", model = "sigma") predict(m, newdata = nd, type = "terms", model = "sigma", terms = "s(speed)") ## predict quantiles quantile(m, newdata = nd, probs = c(0.1, 0.5, 0.9)) ## standard errors predict(m, newdata = nd, se.fit = TRUE, R = 200)
Methods for gamlss2 model objects for extracting fitted (in-sample) or predicted (out-of-sample) probability distributions as distributions3 objects.
## S3 method for class 'gamlss2' prodist(object, ...)## S3 method for class 'gamlss2' prodist(object, ...)
object |
A model object of class |
... |
Arguments passed on to |
To facilitate making probabilistic forecasts based on gamlss2
model objects, the prodist method extracts fitted
or predicted probability distribution objects. Internally, the
predict.gamlss2 method is used first to obtain the distribution
parameters (mu, sigma, tau, nu, or a subset thereof).
Subsequently, the corresponding distribution object is set up as a
GDF object, a unified gamlss2 distribution interface for
the workflow provided by the distributions3 package (see Zeileis et al. 2022).
Note that these probability distributions only reflect the random variation in the dependent variable based on the model employed (and its associated distributional assumption for the dependent variable). This does not capture the uncertainty in the parameter estimates.
An object of class GDF inheriting from distribution.
Zeileis A, Lang MN, Hayes A (2022). “distributions3: From Basic Probability to Probabilistic Regression.” Presented at useR! 2022 - The R User Conference. Slides, video, vignette, code at https://www.zeileis.org/news/user2022/.
## packages, code, and data library("distributions3") data("cars", package = "datasets") ## fit heteroscedastic normal GAMLSS model ## stopping distance (ft) explained by speed (mph) m <- gamlss2(dist ~ s(speed) | s(speed), data = cars, family = NO) ## obtain predicted distributions for three levels of speed d <- prodist(m, newdata = data.frame(speed = c(10, 20, 30))) print(d) ## obtain quantiles (works the same for any distribution object 'd' !) quantile(d, 0.5) quantile(d, c(0.05, 0.5, 0.95), elementwise = FALSE) quantile(d, c(0.05, 0.5, 0.95), elementwise = TRUE) ## visualization plot(dist ~ speed, data = cars) nd <- data.frame(speed = 0:240/4) nd$dist <- prodist(m, newdata = nd) nd$fit <- quantile(nd$dist, c(0.05, 0.5, 0.95)) matplot(nd$speed, nd$fit, type = "l", lty = 1, col = "slategray", add = TRUE) ## moments mean(d) variance(d) ## simulate random numbers random(d, 5) ## density and distribution pdf(d, 50 * -2:2) cdf(d, 50 * -2:2) ## Poisson example data("FIFA2018", package = "distributions3") m2 <- gamlss2(goals ~ s(difference), data = FIFA2018, family = PO) d2 <- prodist(m2, newdata = data.frame(difference = 0)) print(d2) quantile(d2, c(0.05, 0.5, 0.95)) ## note that log_pdf() can replicate logLik() value sum(log_pdf(prodist(m2), FIFA2018$goals)) logLik(m2)## packages, code, and data library("distributions3") data("cars", package = "datasets") ## fit heteroscedastic normal GAMLSS model ## stopping distance (ft) explained by speed (mph) m <- gamlss2(dist ~ s(speed) | s(speed), data = cars, family = NO) ## obtain predicted distributions for three levels of speed d <- prodist(m, newdata = data.frame(speed = c(10, 20, 30))) print(d) ## obtain quantiles (works the same for any distribution object 'd' !) quantile(d, 0.5) quantile(d, c(0.05, 0.5, 0.95), elementwise = FALSE) quantile(d, c(0.05, 0.5, 0.95), elementwise = TRUE) ## visualization plot(dist ~ speed, data = cars) nd <- data.frame(speed = 0:240/4) nd$dist <- prodist(m, newdata = nd) nd$fit <- quantile(nd$dist, c(0.05, 0.5, 0.95)) matplot(nd$speed, nd$fit, type = "l", lty = 1, col = "slategray", add = TRUE) ## moments mean(d) variance(d) ## simulate random numbers random(d, 5) ## density and distribution pdf(d, 50 * -2:2) cdf(d, 50 * -2:2) ## Poisson example data("FIFA2018", package = "distributions3") m2 <- gamlss2(goals ~ s(difference), data = FIFA2018, family = PO) d2 <- prodist(m2, newdata = data.frame(difference = 0)) print(d2) quantile(d2, c(0.05, 0.5, 0.95)) ## note that log_pdf() can replicate logLik() value sum(log_pdf(prodist(m2), FIFA2018$goals)) logLik(m2)
The function computes estimated quantiles and optionally produces a plot.
## S3 method for class 'gamlss2' quantile(x, probs = c(0.025, 0.25, 0.50, 0.75, 0.975), variable = NULL, newdata = NULL, plot = FALSE, data = TRUE, n = 100L, ...)## S3 method for class 'gamlss2' quantile(x, probs = c(0.025, 0.25, 0.50, 0.75, 0.975), variable = NULL, newdata = NULL, plot = FALSE, data = TRUE, n = 100L, ...)
x |
An object of class |
probs |
Numeric vector of probabilities with values in [0,1]. |
variable |
Logical, integer, or character. Should quantiles be plotted
against a covariate? If |
newdata |
Data frame that should be used for computing the quantiles. |
plot |
Logical, should a plot be shown? |
data |
Logical, should the raw data be added to the plot? |
n |
Integer, number of observations that should be used to compute
an equidistant grid for the selected |
... |
Arguments such as
|
The function applies the predict method to determine the parameters of the response distribution.
It then computes the quantiles as specified in the argument probs.
The estimated quantiles. For multiple probabilities, a data frame is returned.
For a single probability and variable = NULL, the result is simplified
to a numeric vector.
## Not run: ## load data data("film90", package = "gamlss.data") ## model formula f <- lborev1 ~ s(lboopen) | . | . | . ## estimate model m <- gamlss2(f, data = film90, family = BCPE) ## compute quantiles using "newdata" nd <- film90[1:10, ] print(quantile(m, newdata = nd)) ## plot sorted quantiles quantile(m, plot = TRUE) ## quantile plot using covariate data quantile(m, plot = TRUE, variable = TRUE) ## plot without raw data quantile(m, plot = TRUE, variable = TRUE, data = FALSE) ## End(Not run)## Not run: ## load data data("film90", package = "gamlss.data") ## model formula f <- lborev1 ~ s(lboopen) | . | . | . ## estimate model m <- gamlss2(f, data = film90, family = BCPE) ## compute quantiles using "newdata" nd <- film90[1:10, ] print(quantile(m, newdata = nd)) ## plot sorted quantiles quantile(m, plot = TRUE) ## quantile plot using covariate data quantile(m, plot = TRUE, variable = TRUE) ## plot without raw data quantile(m, plot = TRUE, variable = TRUE, data = FALSE) ## End(Not run)
Random effects can be included in gamlss2 models in two ways.
The first uses s() with bs = "re" for simple random effects,
i.e., when a single factor is entered into the model as a smoother.
This approach relies on the s() function from the mgcv package.
For example, if area is a factor with several levels,
then s(area, bs = "re") shrinks the level-specific effects of
area towards their overall mean.
The second, more general approach uses the model term constructor re(),
which provides an interface to the specialised random-effects functionality
in the nlme package. This allows fitting more complex random-effects structures.
This documentation focuses on the re() function, but we also provide examples
using s(..., bs = "re").
re(random, correlation = NULL, ...)re(random, correlation = NULL, ...)
random |
A formula specifying the random effect part of the model,
as in the |
correlation |
An optional correlation object, see |
... |
For the |
Both functions set up model terms that can be estimated using a
backfitting algorithm, e.g., the default RS algorithm.
Function s with bs = "re" returns a smooth
specification object of class "re.smooth.spec", see
also smooth.construct.re.smooth.spec.
The re() function returns a special model term specification object,
see specials for details.
gamlss2, smooth.construct.re.smooth.spec,
s, lme
## Not run: ## orthodontic measurement data data("Orthodont", package = "nlme") ## model using lme() m <- lme(distance ~ I(age-11), data = Orthodont, random =~ I(age-11) | Subject, method = "ML") ## using re(), function I() is not supported, ## please transform all variables in advance Orthodont$age11 <- Orthodont$age - 11 ## estimation using the re() constructor, ## setup formula first f <- distance ~ age11 + re(~age11|Subject) ## estimate model m1 <- gamlss2(f, data = Orthodont) ## compare fitted values plot(fitted(m1, model = "mu"), fitted(m)) abline(0, 1, col = 4) ## extract summary for re() model term st <- specials(m1, model = "mu", term = "age11", elements = "model") summary(st) ## random intercepts and slopes with s() using AIC a <- gamlss2(distance ~ s(age,k=3) + s(Subject, bs = "re") + s(Subject, age11, bs = "re"), data = Orthodont) ## compare fitted values plot(fitted(m1, model = "mu"), fitted(m)) points(fitted(a, model = "mu"), fitted(m), col = 2) abline(0, 1, col = 4) ## more complicated correlation structures data("Ovary", package = "nlme") ## ARMA model m <- lme(follicles ~ sin(2 * pi * Time) + cos(2 * pi * Time), data = Ovary, random = pdDiag(~sin(2*pi*Time)), correlation = corARMA(q = 2)) ## now with gamlss2(), transform in advance Ovary$sin1 <- sin(2 * pi * Ovary$Time) Ovary$cos1 <- cos(2 * pi * Ovary$Time) ## model formula f <- follicles ~ sin1 + cos1 + re(~ 1 | Mare) + re(~ sin1 - 1 | Mare, correlation = corARMA(q = 2)) ## estimate model m2 <- gamlss2(f, data = Ovary) ## smooth random effects f <- follicles ~ ti(Time) + ti(Mare, bs = "re") + ti(Mare, Time, bs = c("re", "cr"), k = c(11, 5)) g <- gamlss2(f, data = Ovary) ## compare fitted values par(mfrow = n2mfrow(nlevels(Ovary$Mare)), mar = c(4, 4, 1, 1)) for(j in levels(Ovary$Mare)) { ds <- subset(Ovary, Mare == j) plot(follicles ~ Time, data = ds) f <- fitted(m2, model = "mu")[Ovary$Mare == j] lines(f ~ ds$Time, col = 4, lwd = 2) f <- fitted(g, model = "mu")[Ovary$Mare == j] lines(f ~ ds$Time, col = 3, lwd = 2) f <- fitted(m)[Ovary$Mare == j] lines(f ~ ds$Time, col = 2, lwd = 2) } ## simulated data set.seed(1328) n <- 10000 k <- 500 ## generate random effects f <- as.factor(sample(1:k, size = n, replace = TRUE)) ref <- rnorm(k, sd = 0.6) ## random effects only for sigma y <- rBCT(n, mu = 10, sigma = exp(-1 + ref[f]), tau = 10) ## estimate model m3 <- gamlss2(y ~ 1 | re(~ 1 | f), family = BCT) ## extract fitted random effects cb <- coef(m3, full = TRUE) cb <- cb[grepl("re", names(cb), fixed = TRUE)] ## compare plot(cb, ref, main = round(mean((cb - ref)^2), 2)) abline(0, 1, lwd = 2, col = 4) ## End(Not run)## Not run: ## orthodontic measurement data data("Orthodont", package = "nlme") ## model using lme() m <- lme(distance ~ I(age-11), data = Orthodont, random =~ I(age-11) | Subject, method = "ML") ## using re(), function I() is not supported, ## please transform all variables in advance Orthodont$age11 <- Orthodont$age - 11 ## estimation using the re() constructor, ## setup formula first f <- distance ~ age11 + re(~age11|Subject) ## estimate model m1 <- gamlss2(f, data = Orthodont) ## compare fitted values plot(fitted(m1, model = "mu"), fitted(m)) abline(0, 1, col = 4) ## extract summary for re() model term st <- specials(m1, model = "mu", term = "age11", elements = "model") summary(st) ## random intercepts and slopes with s() using AIC a <- gamlss2(distance ~ s(age,k=3) + s(Subject, bs = "re") + s(Subject, age11, bs = "re"), data = Orthodont) ## compare fitted values plot(fitted(m1, model = "mu"), fitted(m)) points(fitted(a, model = "mu"), fitted(m), col = 2) abline(0, 1, col = 4) ## more complicated correlation structures data("Ovary", package = "nlme") ## ARMA model m <- lme(follicles ~ sin(2 * pi * Time) + cos(2 * pi * Time), data = Ovary, random = pdDiag(~sin(2*pi*Time)), correlation = corARMA(q = 2)) ## now with gamlss2(), transform in advance Ovary$sin1 <- sin(2 * pi * Ovary$Time) Ovary$cos1 <- cos(2 * pi * Ovary$Time) ## model formula f <- follicles ~ sin1 + cos1 + re(~ 1 | Mare) + re(~ sin1 - 1 | Mare, correlation = corARMA(q = 2)) ## estimate model m2 <- gamlss2(f, data = Ovary) ## smooth random effects f <- follicles ~ ti(Time) + ti(Mare, bs = "re") + ti(Mare, Time, bs = c("re", "cr"), k = c(11, 5)) g <- gamlss2(f, data = Ovary) ## compare fitted values par(mfrow = n2mfrow(nlevels(Ovary$Mare)), mar = c(4, 4, 1, 1)) for(j in levels(Ovary$Mare)) { ds <- subset(Ovary, Mare == j) plot(follicles ~ Time, data = ds) f <- fitted(m2, model = "mu")[Ovary$Mare == j] lines(f ~ ds$Time, col = 4, lwd = 2) f <- fitted(g, model = "mu")[Ovary$Mare == j] lines(f ~ ds$Time, col = 3, lwd = 2) f <- fitted(m)[Ovary$Mare == j] lines(f ~ ds$Time, col = 2, lwd = 2) } ## simulated data set.seed(1328) n <- 10000 k <- 500 ## generate random effects f <- as.factor(sample(1:k, size = n, replace = TRUE)) ref <- rnorm(k, sd = 0.6) ## random effects only for sigma y <- rBCT(n, mu = 10, sigma = exp(-1 + ref[f]), tau = 10) ## estimate model m3 <- gamlss2(y ~ 1 | re(~ 1 | f), family = BCT) ## extract fitted random effects cb <- coef(m3, full = TRUE) cb <- cb[grepl("re", names(cb), fixed = TRUE)] ## compare plot(cb, ref, main = round(mean((cb - ref)^2), 2)) abline(0, 1, lwd = 2, col = 4) ## End(Not run)
Various auxiliary functions to facilitate the work with formulas and fitted model objects.
response_name(formula)response_name(formula)
formula |
Function response_name extracts the response name as a character vector.
## basic formula f <- y ~ x1 + x2 + log(x3) response_name(f) ## formula with multiple responses f <- y1 | y2 | y3 ~ x1 + s(x2) + x3 + te(log(x3), x4) | x2 + ti(x5) response_name(f) ## list of formulas f <- list( y1 ~ x1 + s(x2) + x3 + te(log(x3), x4), y2 ~ x2 + sqrt(x5), y3 ~ z2 + x1 + exp(x3) + s(x10) ) response_name(f)## basic formula f <- y ~ x1 + x2 + log(x3) response_name(f) ## formula with multiple responses f <- y1 | y2 | y3 ~ x1 + s(x2) + x3 + te(log(x3), x4) | x2 + ti(x5) response_name(f) ## list of formulas f <- list( y1 ~ x1 + s(x2) + x3 + te(log(x3), x4), y2 ~ x2 + sqrt(x5), y3 ~ z2 + x1 + exp(x3) + s(x10) ) response_name(f)
The function RS() implements the algorithm of Rigby and Stasinopoulos,
the function CG() implements the algorithm of Cole and Green for estimating
a GAMLSS with gamlss2.
## rigby and stasinopoulos algorithm RS(x, y, specials, family, offsets, weights, start, xterms, sterms, control) ## cole and green algorithm CG(x, y, specials, family, offsets, weights, start, xterms, sterms, control)## rigby and stasinopoulos algorithm RS(x, y, specials, family, offsets, weights, start, xterms, sterms, control) ## cole and green algorithm CG(x, y, specials, family, offsets, weights, start, xterms, sterms, control)
x |
The full model matrix to be used for fitting. |
y |
The response vector or matrix. |
specials |
A named list of special model terms, e.g., including design and
penalty matrices for fitting smooth terms using |
family |
A family object, see |
offsets |
If supplied, a list or data frame of possible model offset. |
weights |
If supplied, a numeric vector of weights. |
start |
Starting values, either for the parameters of the response distribution or,
if specified as a named list in which each element of length one is named with
|
xterms |
A named list specifying the linear model terms. Each named list element represents one parameter as specified in the family object. |
sterms |
A named list specifying the special model terms. Each named list element represents one parameter as specified in the family object. |
control |
Further control arguments as specified within the call of |
Functions RS() and CG() are called within gamlss2. Both functions
implement a backfitting algorithm for estimating GAMLSS. For algorithm details see
Rigby and Stasinopoulos (2005).
The functions use the following control arguments:
eps: Numeric vector of length 2, the stopping criterion.
Default is eps = c(1e-05, 1e-05) for the outer and the inner backfitting loop.
maxit: Integer vector of length 2, the maximum number of iterations of
the outer and inner backfitting loop. Default is maxit = c(100, 10).
step: Numeric, the step length control parameter. Default is step = 1.
Note that if step is set smaller than 1, it might be appropriate to lower
the stopping criterion eps, too.
CG: Integer, the number of iterations when to start the CG correction.
Default is CG = Inf.
trace: Logical, should information be printed while the algorithm is running?
flush: Logical, use flush.console for displaying the
current output in the console.
ridge: Logical, should automatic ridge penalization be applied
only to linear effects, without penalizing the intercept? For each parameter of
the distribution the optimum ridge penalty is estimated using an information criterion.
Possible options are criterion = c("aic", "aicc", "bic", "gaic", "gcv"). The
default is criterion = "gaic" and argument K = 2, which can
be set in gamlss2_control.
To facilitate the development of new algorithms for gamlss2, users can exchange
them using the optimizer argument in gamlss2_control. Users
developing new model fitting functions are advised to use these functions as templates
and pass them to gamlss2_control. Alternatively, users can replace the
optimizer function by adding a named list element, "optimizer", to the family object.
For instructions on setting up new families in gamlss2, see gamlss2.family.
Functions RS() and CG() return a named list of class "gamlss2"
containing the following objects:
fitted.values |
A data frame of the fitted values of the modeled parameters of the selected distribution. |
fitted.specials |
A named list, one element for each parameter of the distribution, containing the fitted model object information of special model terms. |
fitted.linear |
A named list, one element for each parameter of the distribution, containing the information on fitted linear effects. |
coefficients |
A named list, one element for each parameter of the distribution, containing the estimated parameters of the linear effects. |
elapsed |
The elapsed runtime of the algorithm. |
iterations |
How many iterations the algorithm performed. |
logLik |
The final value of the log-likelihood of the model. |
control |
All control arguments used as supplied from function |
Rigby RA, Stasinopoulos DM (2005). “Generalized Additive Models for Location, Scale and Shape (with Discussion).” Journal of the Royal Statistical Society, Series C (Applied Statistics), 54, 507–554. doi:10.1111/j.1467-9876.2005.00510.x
gamlss2, gamlss2_control, gamlss2.family
## Not run: ## count response example using HarzTraffic data data("HarzTraffic", package = "gamlss2") ## specify the model formula f <- bikes ~ s(yday, bs = "cc") + s(temp) + s(rain) | . ## estimate model using RS (default) m1 <- gamlss2(f, data = HarzTraffic, family = NBI, optimizer = RS) ## now with CG m2 <- gamlss2(f, data = HarzTraffic, family = NBI, optimizer = CG) ## first 2 RS iterations and afterwards switch to CG m3 <- gamlss2(f, data = HarzTraffic, family = NBI, CG = 2) ## End(Not run)## Not run: ## count response example using HarzTraffic data data("HarzTraffic", package = "gamlss2") ## specify the model formula f <- bikes ~ s(yday, bs = "cc") + s(temp) + s(rain) | . ## estimate model using RS (default) m1 <- gamlss2(f, data = HarzTraffic, family = NBI, optimizer = RS) ## now with CG m2 <- gamlss2(f, data = HarzTraffic, family = NBI, optimizer = CG) ## first 2 RS iterations and afterwards switch to CG m3 <- gamlss2(f, data = HarzTraffic, family = NBI, CG = 2) ## End(Not run)
Functions to compute the GAIC and the generalized R-squared of Nagelkerke (1991) for GAMLSS models.
## information criteria GAIC(object, ..., k = 2, corrected = FALSE) ## r-squared Rsq(object, ..., type = c("Cox Snell", "Cragg Uhler", "both", "simple"), newdata = NULL)## information criteria GAIC(object, ..., k = 2, corrected = FALSE) ## r-squared Rsq(object, ..., type = c("Cox Snell", "Cragg Uhler", "both", "simple"), newdata = NULL)
object |
A fitted model object. |
... |
Optionally more fitted model objects. |
k |
Numeric, the penalty to be used. The default |
corrected |
Logical, whether the corrected AIC should be used?
Note that it applies only when |
type |
Which definition of R-squared should be used?
Possible values are |
newdata |
For |
The Rsq() function uses the following definition of R-squared:
where is the null model (only a constant is fitted to all parameters) and
is the fitted model under consideration. This definition is
sometimes referred to as the Cox and Snell R-squared. The Nagelkerke or Cragg and Uhler
definition divides the above by
Numeric vector or data frame, depending on the number of fitted model objects.
Nagelkerke NJD (1991). “A Note on a General Definition of the Coefficient of Determination.” Biometrika, 78(3), 691–692. doi:10.1093/biomet/78.3.691
## load the aids data set data("aids", package = "gamlss.data") ## estimate negative binomial count models m1 <- gamlss2(y ~ x + qrt, data = aids, family = NBI) m2 <- gamlss2(y ~ s(x) + s(qrt, bs = "re"), data = aids, family = NBI) ## compare models Rsq(m1) Rsq(m1, type = "both") Rsq(m1, m2) GAIC(m1, m2) AIC(m1, m2) BIC(m1, m2) ## plot estimated effects plot(m2)## load the aids data set data("aids", package = "gamlss.data") ## estimate negative binomial count models m1 <- gamlss2(y ~ x + qrt, data = aids, family = NBI) m2 <- gamlss2(y ~ s(x) + s(qrt, bs = "re"), data = aids, family = NBI) ## compare models Rsq(m1) Rsq(m1, type = "both") Rsq(m1, m2) GAIC(m1, m2) AIC(m1, m2) BIC(m1, m2) ## plot estimated effects plot(m2)
The function select_gamlss2() applies an additional shrinkage penalty to
all mgcv model terms. This can shrink some terms to zero and thereby
remove them from the model. In addition to this penalty,
model terms are selected based on two criteria: the estimated degrees of freedom
of the term and the percentage of the predictor range covered by the model term.
These two thresholds aim to mimic a natural selection process, similar to what
one might do by inspecting summaries and effect plots, so that only
relevant terms are retained in the final model.
## wrapper function select_gamlss2(formula, ..., criterion = "BIC", thres = c(0.9, 0.2)) ## modified RS optimizer function sRS(x, y, specials, family, offsets, weights, start, xterms, sterms, control)## wrapper function select_gamlss2(formula, ..., criterion = "BIC", thres = c(0.9, 0.2)) ## modified RS optimizer function sRS(x, y, specials, family, offsets, weights, start, xterms, sterms, control)
formula |
A GAM-type |
... |
Arguments passed to |
criterion |
The information criterion used to estimate the
shrinkage parameters. This can also be a vector of length 2, where the first
element specifies the criterion to be used during the selection step, and the
second element specifies the criterion to be used during the refitting step.
Possible options are |
thres |
A vector of thresholds used for model term selection. The first element controls the minimum allowed estimated degrees of freedom for a model term to enter the model. The second element specifies the minimum percentage of the total estimated predictor range required for a term to be included in the model. |
x |
The full model matrix to be used for fitting. |
y |
The response vector or matrix. |
specials |
A named list of special model terms, e.g., including design and
penalty matrices for fitting smooth terms using |
family |
A family object, see |
offsets |
If supplied, a list or data frame of model offsets. |
weights |
If supplied, a numeric vector of weights. |
start |
Starting values, either for the parameters of the response distribution or,
if specified as a named list in which each element of length one is named with
|
xterms |
A named list specifying the linear model terms. Each named list element represents one parameter as specified in the family object. |
sterms |
A named list specifying the special model terms. Each named list element represents one parameter as specified in the family object. |
control |
Further control arguments as specified within the call
of |
The function select_gamlss2() selects model terms by identifying those
with estimated degrees of freedom greater than a pre-specified threshold
(e.g., thres = 0.9). In addition, model term selection can also be based
on the percentage of the total predictor range covered by the model term to
ensure that the term represents a substantial portion of the total effect.
After the selection step, the model is refitted using only the selected terms,
excluding the additional penalty applied during the initial fitting process.
The additional penalty for automatic term selection is described in
gam.selection. This approach is experimental, so model
selection should still be assessed carefully.
An object of class "gamlss2".
## Not run: ## simulate data set.seed(123) ## number of observations n <- 1000 ## covariates k <- 100 x <- matrix(runif(n * k, -3, 3), ncol = k) colnames(x) <- paste0("x", 1:k) d <- as.data.frame(x) ## true effects d$f1 <- sin(d$x1) d$f2 <- exp(d$x2)/15 - 1 d$f3 <- -d$x3 / 3 d$f4 <- d$x4^2/5 - 1 d$f5 <- cos(d$x5) ## true parameters mu <- with(d, f1 + f3 + f4) sigma <- with(d, exp(1.5 + f2 + f4 + f5)) ## simulate response d$y <- rnorm(n, mean = mu, sd = sigma) ## model formula f <- paste("y ~", paste0("s(x", 1:k, ")", collapse = "+"), "| .") f <- as.formula(f) ## estimate model (takes some time) m <- select_gamlss2(f, data = d, family = NO) ## plot selected estimated effects plot(m) ## final model new_formula(m) ## example taken from ?gam.selection library("mgcv") set.seed(3) n <- 200 ## simulate data dat <- gamSim(1, n = n, scale = .15, dist = "poisson") ## spurious dat$x4 <- runif(n, 0, 1) dat$x5 <- runif(n, 0, 1) ## formula f <- y ~ s(x0) + s(x1) + s(x2) + s(x3) + s(x4) + s(x5) ## estimate model m1 <- gam(f, data = dat, family = poisson, select = TRUE, method = "REML") summary(m1) plot(m1, pages = 1) ## same with gamlss2 m2 <- select_gamlss2(f, data = dat, family = PO) ## plot selected effects plot(m2) ## final model new_formula(m2) ## End(Not run)## Not run: ## simulate data set.seed(123) ## number of observations n <- 1000 ## covariates k <- 100 x <- matrix(runif(n * k, -3, 3), ncol = k) colnames(x) <- paste0("x", 1:k) d <- as.data.frame(x) ## true effects d$f1 <- sin(d$x1) d$f2 <- exp(d$x2)/15 - 1 d$f3 <- -d$x3 / 3 d$f4 <- d$x4^2/5 - 1 d$f5 <- cos(d$x5) ## true parameters mu <- with(d, f1 + f3 + f4) sigma <- with(d, exp(1.5 + f2 + f4 + f5)) ## simulate response d$y <- rnorm(n, mean = mu, sd = sigma) ## model formula f <- paste("y ~", paste0("s(x", 1:k, ")", collapse = "+"), "| .") f <- as.formula(f) ## estimate model (takes some time) m <- select_gamlss2(f, data = d, family = NO) ## plot selected estimated effects plot(m) ## final model new_formula(m) ## example taken from ?gam.selection library("mgcv") set.seed(3) n <- 200 ## simulate data dat <- gamSim(1, n = n, scale = .15, dist = "poisson") ## spurious dat$x4 <- runif(n, 0, 1) dat$x5 <- runif(n, 0, 1) ## formula f <- y ~ s(x0) + s(x1) + s(x2) + s(x3) + s(x4) + s(x5) ## estimate model m1 <- gam(f, data = dat, family = poisson, select = TRUE, method = "REML") summary(m1) plot(m1, pages = 1) ## same with gamlss2 m2 <- select_gamlss2(f, data = dat, family = PO) ## plot selected effects plot(m2) ## final model new_formula(m2) ## End(Not run)
Link object (with link function, inverse link function, etc.) that assures positivity of parameters based on the softplus function.
softplus(a = 1)softplus(a = 1)
a |
Extra parameter of the generalized softplus function |
The softplus link function with parameter is given by:
This is an approximation of the linear spline
where the discrepancy between the two functions decreases with
increasing .
Wiemann et al. (2023) propose to employ the softplus function as the inverse link function where positivity of a parameter needs to be assured, e.g., in count data regressions. This is in particular of interest as an alternative to the exponential inverse link function because the exponential implies multiplicative effects of the regressors while the softplus function does not.
An object of class "link-glm".
Wiemann PFV, Kneib T, Hambuckers J (2023). “Using the Softplus Function to Construct Alternative Link Functions in Generalized Linear Models and Beyond.” Statistical Papers, forthcoming. doi:10.1007/s00362-023-01509-x
## visualization of the softplus function from Wiemann et al. (2023, Figure 1) x <- -200:200/50 plot(x, softplus(1)$linkinv(x), ylab = expression(softplus[a](x)), type = "l", col = 2, lwd = 2) grid() lines(x, softplus(5)$linkinv(x), col = 3, lwd = 2) lines(x, softplus(10)$linkinv(x), col = 4, lwd = 2) lines(x, pmax(0, x), lty = 3, lwd = 2) legend("topleft", c("a = 1", "a = 5", "a = 10", "linear spline"), col = c(2, 3, 4, 1), lty = c(1, 1, 1, 3), lwd = 2, bty = "n") ## poisson regression example with different links data("FIFA2018", package = "distributions3") m_exp <- glm(goals ~ difference, data = FIFA2018, family = poisson(link = "log")) m_splus <- glm(goals ~ difference, data = FIFA2018, family = poisson(link = softplus(1))) AIC(m_exp, m_splus) ## comparison of fitted effects nd <- data.frame(difference = -15:15/10) nd$mu_exp <- predict(m_exp, newdata = nd, type = "response") nd$mu_splus <- predict(m_splus, newdata = nd, type = "response") plot(mu_exp ~ difference, data = nd, ylab = expression(mu), type = "l", col = 4, lwd = 2, ylim = c(0, 2.5)) lines(mu_splus ~ difference, data = nd, col = 2, lwd = 2) legend("topleft", c("exp", "softplus"), col = c(4, 2), lwd = 2, lty = 1, bty = "n")## visualization of the softplus function from Wiemann et al. (2023, Figure 1) x <- -200:200/50 plot(x, softplus(1)$linkinv(x), ylab = expression(softplus[a](x)), type = "l", col = 2, lwd = 2) grid() lines(x, softplus(5)$linkinv(x), col = 3, lwd = 2) lines(x, softplus(10)$linkinv(x), col = 4, lwd = 2) lines(x, pmax(0, x), lty = 3, lwd = 2) legend("topleft", c("a = 1", "a = 5", "a = 10", "linear spline"), col = c(2, 3, 4, 1), lty = c(1, 1, 1, 3), lwd = 2, bty = "n") ## poisson regression example with different links data("FIFA2018", package = "distributions3") m_exp <- glm(goals ~ difference, data = FIFA2018, family = poisson(link = "log")) m_splus <- glm(goals ~ difference, data = FIFA2018, family = poisson(link = softplus(1))) AIC(m_exp, m_splus) ## comparison of fitted effects nd <- data.frame(difference = -15:15/10) nd$mu_exp <- predict(m_exp, newdata = nd, type = "response") nd$mu_splus <- predict(m_splus, newdata = nd, type = "response") plot(mu_exp ~ difference, data = nd, ylab = expression(mu), type = "l", col = 4, lwd = 2, ylim = c(0, 2.5)) lines(mu_splus ~ difference, data = nd, col = 2, lwd = 2) legend("topleft", c("exp", "softplus"), col = c(4, 2), lwd = 2, lty = 1, bty = "n")
The gamlss2 package provides an interface for special model terms whose
fitting is not handled by the standard linear or smooth-term machinery.
Examples include neural networks, trees, or forests. A special term is expected
to provide its own fitting and prediction methods so that it can be updated
within the RS or CG backfitting algorithms.
## generic fitting method special_fit(x, ...) ## generic predict method special_predict(x, ...) ## extractor function for fitted special terms specials(object, model = NULL, terms = NULL, elements = NULL, ...)## generic fitting method special_fit(x, ...) ## generic predict method special_predict(x, ...) ## extractor function for fitted special terms specials(object, model = NULL, terms = NULL, elements = NULL, ...)
x |
A model term object as supplied in the formula in the |
object |
A fitted |
model |
Character or integer, specifies the model for which fitted special terms should be extracted. |
terms |
Character or integer, specifies the special model terms that should be extracted. |
elements |
Character, specifies which elements of a fitted special term should be
extracted. If |
... |
Arguments needed for the |
To implement a new special term, define:
a constructor function that is used inside the model formula,
a corresponding special_fit() method,
and a corresponding special_predict() method.
The constructor function should collect the information needed for fitting,
such as the model formula, the variables, and any control arguments. If the
constructor has a new special name, this name must be registered via the
specials argument of fake_formula. In the example below,
this is not necessary because "n" is already registered.
The following paragraphs describe the expected arguments and return values of the fitting and prediction methods.
A method for special_fit() has the following arguments:
x: The special model term object, containing all the data for fitting.
z: The current working response/residual from the backfitting step.
w: The current working weights from the backfitting step.
y: The response vector/matrix, e.g., used to evaluate the log-likelihood.
eta: The current named list of predictors.
j: Character, the parameter name for which the model term needs to be updated.
family: The family object of the model, see gamlss2.family.
control: A named list of control arguments, see gamlss2_control.
Only the first three arguments are required to set up a special model term;
all remaining arguments are optional. The function must at least return a
named list containing the component "fitted.values" in order to work
with RS and CG.
In practice, useful return components are:
"fitted.values": Required. Numeric vector of fitted contributions of the special term.
Additional fitted objects: Optional. For example, the fitted model object itself, degrees of freedom, or quantities needed for prediction.
"transfer": Optional. Information that should be passed from one backfitting
iteration to the next.
A method for special_predict() has the following arguments:
x: Depending on the return value of function special_fit(), the
fitted model term object, see the examples.
data: The data for which predictions should be computed.
se.fit: Logical, should standard errors of the predictions be computed.
The function special_predict() should return fitted contributions for
the supplied data. The minimal valid return value is a numeric vector.
If se.fit = TRUE, the method may instead return a data frame with a
column named "fit" and optional columns such as "lower" and
"upper" when interval information is available.
gamlss2, RS, gamlss2_control, gamlss2.family
## example special term for neural networks, ## the constructor function is used in the formula ## when calling gamlss2() n <- function(formula, ...) { stopifnot(requireNamespace("nnet")) ## list for setting up the special model term st <- list() ## list of control arguments ctr <- list(...) if(is.null(ctr$size)) ctr$size <- 50 if(is.null(ctr$maxit)) ctr$maxit <- 1000 if(is.null(ctr$decay)) ctr$decay <- 0.1 if(is.null(ctr$trace)) ctr$trace <- FALSE if(is.null(ctr$MaxNWts)) ctr$MaxNWts <- 10000 if(is.null(ctr$scale)) ctr$scale <- TRUE ## put all information together st$control <- ctr st$formula <- formula st$term <- all.vars(formula) st$label <- paste0("n(", paste0(gsub(" ", "", as.character(formula)), collapse = ""), ")") st$data <- model.frame(formula) ## scale per default! if(ctr$scale) { sx <- list() for(j in colnames(st$data)) { if(!is.factor(st$data[[j]])) { sx[[j]] <- range(st$data[[j]]) st$data[[j]] <- (st$data[[j]] - sx[[j]][1]) / diff(sx[[j]]) } } st$scalex <- sx } ## assign the "special" class and the new class "n" class(st) <- c("special", "n") return(st) } ## set up the special "n" model term fitting function special_fit.n <- function(x, z, w, control, ...) { ## model formula needs to be updated .fnns <- update(x$formula, response_z ~ .) ## assign current working response x$data$response_z <- z x$data$weights_w <- w ## possible weights from last iteration Wts <- list(...)$transfer$Wts ## estimate model nnc <- parse(text = paste0('nnet::nnet(formula = .fnns, data = x$data, weights = weights_w,', 'size = x$control$size, maxit = x$control$maxit, decay = x$control$decay,', 'trace = x$control$trace, MaxNWts = x$control$MaxNWts, linout = TRUE', if(!is.null(Wts)) ', Wts = Wts)' else ')')) rval <- list("model" = eval(nnc)) ## get the fitted.values rval$fitted.values <- predict(rval$model) ## transferring the weights for the next backfitting iteration ## note, "transfer" can be used to transfer anything from one ## iteration to the next rval$transfer <- list("Wts" = rval$model$wts) ## center fitted values rval$shift <- mean(rval$fitted.values) rval$fitted.values <- rval$fitted.values - rval$shift ## degrees of freedom rval$edf <- length(coef(rval$model)) ## possible scaling rval$scalex <- x$scalex ## assign class for predict method class(rval) <- "n.fitted" return(rval) } ## finally, the predict method special_predict.n.fitted <- function(x, data, se.fit = FALSE, ...) { if(!is.null(x$scalex)) { for(j in names(x$scalex)) { data[[j]] <- (data[[j]] - x$scalex[[j]][1]) / diff(x$scalex[[j]]) } } p <- predict(x$model, newdata = data, type = "raw") p <- p - x$shift if(se.fit) p <- data.frame("fit" = p) return(p) } ## Not run: ## example with data data("film90", package = "gamlss.data") ## specify the model formula f <- y ~ n(~x,decay=0.01) | . | . | . ## estimate model, ## set the seed for reproducibility ## note, data should be scaled! set.seed(123) m1 <- gamlss2(f, data = abdom, family = BCT) ## visualize estimated effects plot(m1, which = "effects") ## plot diagnostics plot(m1, which = "resid") ## predict quantiles pq <- quantile(m1, probs = c(0.05, 0.5, 0.95)) ## plot plot(y ~ x, data = abdom, pch = 19, col = rgb(0.1, 0.1, 0.1, alpha = 0.3)) matlines(abdom$x, pq, lwd = 2, lty = 1, col = 4) ## another example using the Munich rent data data("rent", package = "gamlss.data") ## model formula f <- R ~ n(~Fl+A,size=10,decay=0.7) | . ## estimate model set.seed(456) m2 <- gamlss2(f, data = rent, family = GA) ## plot estimated effects plot(m2, which = "effects", persp = FALSE) ## diagnostics plot(m2, which = "resid") ## predict using new data n <- 50 nd <- with(rent, expand.grid( "Fl" = seq(min(Fl), max(Fl), length = n), "A" = seq(min(A), max(A), length = n) )) ## compute median rent R estimate nd$fit <- quantile(m2, newdata = nd, probs = 0.5) ## visualize library("lattice") p1 <- wireframe(fit ~ Fl + A, data = nd, screen = list(z = 50, x = -70, y = -10), aspect = c(1, 0.9), drape = TRUE, main = "n(~Fl+A)", xlab = "Floor", ylab = "YoC", zlab = "Rent") p2 <- levelplot(fit ~ Fl + A, data = nd, contour = TRUE, main = "n(~Fl+A)", xlab = "Floor", ylab = "YoC") print(p1, split = c(1, 1, 2, 1), more = TRUE) print(p2, split = c(2, 1, 2, 1), more = FALSE) ## extract fitted special terms, ## fitted NN for parameter mu specials(m2, model = "mu", elements = "model") ## same for sigma specials(m2, model = "sigma", elements = "model") ## return element names of fitted special term list specials(m2, model = "sigma", elements = "names") ## End(Not run)## example special term for neural networks, ## the constructor function is used in the formula ## when calling gamlss2() n <- function(formula, ...) { stopifnot(requireNamespace("nnet")) ## list for setting up the special model term st <- list() ## list of control arguments ctr <- list(...) if(is.null(ctr$size)) ctr$size <- 50 if(is.null(ctr$maxit)) ctr$maxit <- 1000 if(is.null(ctr$decay)) ctr$decay <- 0.1 if(is.null(ctr$trace)) ctr$trace <- FALSE if(is.null(ctr$MaxNWts)) ctr$MaxNWts <- 10000 if(is.null(ctr$scale)) ctr$scale <- TRUE ## put all information together st$control <- ctr st$formula <- formula st$term <- all.vars(formula) st$label <- paste0("n(", paste0(gsub(" ", "", as.character(formula)), collapse = ""), ")") st$data <- model.frame(formula) ## scale per default! if(ctr$scale) { sx <- list() for(j in colnames(st$data)) { if(!is.factor(st$data[[j]])) { sx[[j]] <- range(st$data[[j]]) st$data[[j]] <- (st$data[[j]] - sx[[j]][1]) / diff(sx[[j]]) } } st$scalex <- sx } ## assign the "special" class and the new class "n" class(st) <- c("special", "n") return(st) } ## set up the special "n" model term fitting function special_fit.n <- function(x, z, w, control, ...) { ## model formula needs to be updated .fnns <- update(x$formula, response_z ~ .) ## assign current working response x$data$response_z <- z x$data$weights_w <- w ## possible weights from last iteration Wts <- list(...)$transfer$Wts ## estimate model nnc <- parse(text = paste0('nnet::nnet(formula = .fnns, data = x$data, weights = weights_w,', 'size = x$control$size, maxit = x$control$maxit, decay = x$control$decay,', 'trace = x$control$trace, MaxNWts = x$control$MaxNWts, linout = TRUE', if(!is.null(Wts)) ', Wts = Wts)' else ')')) rval <- list("model" = eval(nnc)) ## get the fitted.values rval$fitted.values <- predict(rval$model) ## transferring the weights for the next backfitting iteration ## note, "transfer" can be used to transfer anything from one ## iteration to the next rval$transfer <- list("Wts" = rval$model$wts) ## center fitted values rval$shift <- mean(rval$fitted.values) rval$fitted.values <- rval$fitted.values - rval$shift ## degrees of freedom rval$edf <- length(coef(rval$model)) ## possible scaling rval$scalex <- x$scalex ## assign class for predict method class(rval) <- "n.fitted" return(rval) } ## finally, the predict method special_predict.n.fitted <- function(x, data, se.fit = FALSE, ...) { if(!is.null(x$scalex)) { for(j in names(x$scalex)) { data[[j]] <- (data[[j]] - x$scalex[[j]][1]) / diff(x$scalex[[j]]) } } p <- predict(x$model, newdata = data, type = "raw") p <- p - x$shift if(se.fit) p <- data.frame("fit" = p) return(p) } ## Not run: ## example with data data("film90", package = "gamlss.data") ## specify the model formula f <- y ~ n(~x,decay=0.01) | . | . | . ## estimate model, ## set the seed for reproducibility ## note, data should be scaled! set.seed(123) m1 <- gamlss2(f, data = abdom, family = BCT) ## visualize estimated effects plot(m1, which = "effects") ## plot diagnostics plot(m1, which = "resid") ## predict quantiles pq <- quantile(m1, probs = c(0.05, 0.5, 0.95)) ## plot plot(y ~ x, data = abdom, pch = 19, col = rgb(0.1, 0.1, 0.1, alpha = 0.3)) matlines(abdom$x, pq, lwd = 2, lty = 1, col = 4) ## another example using the Munich rent data data("rent", package = "gamlss.data") ## model formula f <- R ~ n(~Fl+A,size=10,decay=0.7) | . ## estimate model set.seed(456) m2 <- gamlss2(f, data = rent, family = GA) ## plot estimated effects plot(m2, which = "effects", persp = FALSE) ## diagnostics plot(m2, which = "resid") ## predict using new data n <- 50 nd <- with(rent, expand.grid( "Fl" = seq(min(Fl), max(Fl), length = n), "A" = seq(min(A), max(A), length = n) )) ## compute median rent R estimate nd$fit <- quantile(m2, newdata = nd, probs = 0.5) ## visualize library("lattice") p1 <- wireframe(fit ~ Fl + A, data = nd, screen = list(z = 50, x = -70, y = -10), aspect = c(1, 0.9), drape = TRUE, main = "n(~Fl+A)", xlab = "Floor", ylab = "YoC", zlab = "Rent") p2 <- levelplot(fit ~ Fl + A, data = nd, contour = TRUE, main = "n(~Fl+A)", xlab = "Floor", ylab = "YoC") print(p1, split = c(1, 1, 2, 1), more = TRUE) print(p2, split = c(2, 1, 2, 1), more = FALSE) ## extract fitted special terms, ## fitted NN for parameter mu specials(m2, model = "mu", elements = "model") ## same for sigma specials(m2, model = "sigma", elements = "model") ## return element names of fitted special term list specials(m2, model = "sigma", elements = "names") ## End(Not run)
Various spirometry measurements from the National Health and Nutrition Examination Survey (NHANES) 2007–2012 along with covariates providing demographics and basic body measurements.
data("SpirometryUS", package = "gamlss2")data("SpirometryUS", package = "gamlss2")
A data frame containing 16596 observations on 13 variables.
Numeric. Forced vital capacity (FVC) in liters, i.e., the volume of air that can forcibly be blown out after full inspiration.
Numeric. Forced expiratory volume in 1 second (FEV1) in liters, i.e., the volume of air that can forcibly be blown out in the first second, after full inspiration.
Numeric. Ratio of FEV1 to FVC.
Numeric, peak expiratory flow (PEF) in liters per second, i.e., the maximal flow (or speed) achieved during the maximally forced expiration initiated at full inspiration.
Numeric. Forced expiratory flow (FEF) in liters per second, i.e., the flow (or speed) of air coming out of the lung during the middle portion (25% to 75%) of a forced expiration.
Numeric. Extrapolated volume.
Numeric. Forced expiratory time (FET) in seconds, i.e., the length of the expiration.
Factor. Binary gender information with levels
female and male.
Numeric. Age in years (rounded to quarters).
Numeric. Body weight in kilograms.
Numeric. Body height in centimeters.
Numeric. Body mass index in kilograms per meter-squared, rounded to 2 decimal places.
Factor. Self-reported race and ethnicity information
with levels white, black, mexican American,
other hispanic, and other (including multi-racial).
In order to establish lung function reference equations, Zavorsky (2025) studies the dependence of three spirometry measurements (FVC, FEV1, and the FEV1/FVC ratio) on age, adjusted for height and weight and separately for females and males. He intends to show that a simple normally-distributed model with (piecewise) linear mean equation and (piecewise) constant variance suffices for obtaining an adequate distributional fit from which the 5% quantile can be obtained as the so-called lower limit of normal (LLN). Actually, his comparison with GAMLSS – using flexible predictors for both mean and variance along with a Box-Cox-transformed normal distribution – shows that GAMLSS leads to a similar fit for the mean but a much better fit for the LLN.
Zavorsky's (2025) analyses are based on a data set that he derived from the National Health and Nutrition Examination Survey (NHANES) in the United States 2007–2012. From the entire available data from https://wwwn.cdc.gov/nchs/nhanes/ he included those observations which met or exceeded the technical acceptability of the measurements for forced expiratory volume in 1 second (FEV1) and forced vital capacity (FVC). The data are described in a short communication published in the Data in Brief journal and the accompanying spreadsheet in CSV format (comma-separated values) is available from Mendeley Data.
The data comprises observations from NHANES' “Examination Data”, in particular in “Spirometry – Pre and Post-Bronchodilator” and “Body Measures”, plus accompanying “Demographics Data”. See the variable descriptions above for more details. Basic information about spirometry can be found for example in the Wikipedia at https://en.wikipedia.org/wiki/Spirometry.
Zavorsky GS (2024). “Refined NHANES 2007–2012 Spirometry Dataset for the Comparison of Segmented (Piecewise) Linear Models to That of GAMLSS”, Mendeley Data, V1. doi:10.17632/dwjykg3xww.1
Zavorsky GS (2024). “A Refined Spirometry Dataset for Comparing Segmented (Piecewise) Linear Models to that of GAMLSS”. Data in Brief, 57, 111062. doi:10.1016/j.dib.2024.111062
Zavorsky GS (2025). “Debunking the GAMLSS Myth: Simplicity Reigns in Pulmonary Function Diagnostics”. Respiratory Medicine, 236, 107836. doi:10.1016/j.rmed.2024.107836
data("SpirometryUS", package = "gamlss2") summary(SpirometryUS)data("SpirometryUS", package = "gamlss2") summary(SpirometryUS)
The optimizer function stepwise() performs stepwise model term selection
using a Generalized Akaike Information Criterion (GAIC). Estimation is based on the
Rigby and Stasinopoulos (RS) and Cole and Green (CG) algorithms, as implemented in
function RS.
## wrapper function for stepwise GAMLSS estimation step_gamlss2(formula, ..., K = 2, strategy = c("both.linear", "both"), keeporder = FALSE, cores = 1L) ## stepwise optimizer function stepwise(x, y, specials, family, offsets, weights, start, xterms, sterms, control)## wrapper function for stepwise GAMLSS estimation step_gamlss2(formula, ..., K = 2, strategy = c("both.linear", "both"), keeporder = FALSE, cores = 1L) ## stepwise optimizer function stepwise(x, y, specials, family, offsets, weights, start, xterms, sterms, control)
formula |
A model formula for |
... |
Arguments passed to |
K |
Numeric, the penalty for the |
strategy |
Character, the strategy to be used by the stepwise algorithm.
Possible options are |
keeporder |
Logical. For the different stepwise strategies,
should the updates be performed sequentially according to the order of the parameters
of the response distribution as specified in the family (see |
cores |
Integer. If |
x |
The full model matrix to be used for fitting. |
y |
The response vector or matrix. |
specials |
A named list of special model terms, e.g., including design and
penalty matrices for fitting smooth terms using |
family |
A family object, see |
offsets |
If supplied, a list or data frame of model offsets. |
weights |
If supplied, a numeric vector of weights. |
start |
Starting values, either for the parameters of the response distribution or,
if specified as a named list in which each element of length one is named with
|
xterms |
A named list specifying the linear model terms. Each named list element represents one parameter as specified in the family object. |
sterms |
A named list specifying the special model terms. Each named list element represents one parameter as specified in the family object. |
control |
Further control arguments as specified within the call
of |
The wrapper function step_gamlss2() calls gamlss2
using the stepwise() optimizer function.
The stepwise algorithm can apply the following strategies:
Each predictor must include an intercept.
In a forward selection step, model terms with the highest improvement on the GAIC are selected.
In a replacement step, each model term is tested to see if an exchange with a term not yet selected improves the GAIC.
In a backward step, model terms are removed if this further improves the GAIC.
In a bidirectional step, model terms can be either added or removed.
In addition, the forward, backward and replace selection step can be combined.
The selected strategies are iterated until no further improvement is achieved.
The different strategies can be selected using argument strategy; see
the examples. Possible values are "both", "forward",
"backward", "replace", and "all".
Here, strategy = "all" combines the forward, backward, and replace selection steps.
In addition, each of the steps 2-4 can be applied to linear model terms only,
prior to performing the steps for all model terms. This can be done by additionally
setting strategy to "both.linear", "forward.linear",
"backward.linear", "replace.linear", or "all.linear".
The default is strategy = c("both.linear", "both") and keeporder = FALSE.
Note that each of the steps 2–4 can be performed while maintaining the order of the
parameters of the response distribution, i.e., if the keeporder = TRUE
argument is set, then the parameters will be updated in the order specified in the
gamlss2.family. Using backward elimination, the model terms are
deselected in reverse order.
The optimizer function stepwise() returns the final model as a named list
of class "gamlss2". See the return value of function RS.
The wrapper function step_gamlss2() also returns the final model.
new_formula, gamlss2, gamlss2_control, RS
## Not run: ## load the Munich rent data data("rent", package = "gamlss.data") ## because of possible linear interactions, ## scale the covariates first rent$Fl <- scale(rent$Fl) rent$A <- scale(rent$A) ## the formula defines the searching scope f <- R ~ Fl + A + Fl:A + loc + s(Fl) + s(A) + te(Fl, A) | Fl + A + loc + Fl:A + s(Fl) + s(A) + te(Fl, A) ## estimate a Gamma model using the stepwise algorithm m1 <- step_gamlss2(f, data = rent, family = GA, K = 2) ## same with ## m <- gamlss2(f, data = rent, family = GA, optimizer = stepwise, K = 2) ## show the new formula of selected model terms new_formula(m1) ## final model summary summary(m1) ## effect plots plot(m1) ## diagnostic plots plot(m1, which = "resid") ## plot GAIC plot(m1, which = "selection") ## use forward linear, replace and backward strategy m2 <- step_gamlss2(f, data = rent, family = GA, K = 2, strategy = c("forward.linear", "replace", "backward")) ## more complex model ## note, the third parameter ## nu does not include any model terms f <- R ~ Fl + A + Fl:A + loc + s(Fl) + s(A) + te(Fl, A) | Fl + A + loc + Fl:A + s(Fl) + s(A) + te(Fl, A) | 1 | Fl + A + loc + Fl:A + s(Fl) + s(A) + te(Fl, A) ## model using the BCT family m3 <- step_gamlss2(f, data = rent, family = BCT, K = 2, strategy = c("forward.linear", "both"), keeporder = TRUE) ## plot GAIC plot(m3, which = "selection") ## End(Not run)## Not run: ## load the Munich rent data data("rent", package = "gamlss.data") ## because of possible linear interactions, ## scale the covariates first rent$Fl <- scale(rent$Fl) rent$A <- scale(rent$A) ## the formula defines the searching scope f <- R ~ Fl + A + Fl:A + loc + s(Fl) + s(A) + te(Fl, A) | Fl + A + loc + Fl:A + s(Fl) + s(A) + te(Fl, A) ## estimate a Gamma model using the stepwise algorithm m1 <- step_gamlss2(f, data = rent, family = GA, K = 2) ## same with ## m <- gamlss2(f, data = rent, family = GA, optimizer = stepwise, K = 2) ## show the new formula of selected model terms new_formula(m1) ## final model summary summary(m1) ## effect plots plot(m1) ## diagnostic plots plot(m1, which = "resid") ## plot GAIC plot(m1, which = "selection") ## use forward linear, replace and backward strategy m2 <- step_gamlss2(f, data = rent, family = GA, K = 2, strategy = c("forward.linear", "replace", "backward")) ## more complex model ## note, the third parameter ## nu does not include any model terms f <- R ~ Fl + A + Fl:A + loc + s(Fl) + s(A) + te(Fl, A) | Fl + A + loc + Fl:A + s(Fl) + s(A) + te(Fl, A) | 1 | Fl + A + loc + Fl:A + s(Fl) + s(A) + te(Fl, A) ## model using the BCT family m3 <- step_gamlss2(f, data = rent, family = BCT, K = 2, strategy = c("forward.linear", "both"), keeporder = TRUE) ## plot GAIC plot(m3, which = "selection") ## End(Not run)