Package 'gamlss2'

Title: GAMLSS Infrastructure for Flexible Distributional Regression
Description: Next generation infrastructure for generalized additive models for location, scale, and shape (GAMLSS) and distributional regression more generally. The package provides a fresh reimplementation of the classic 'gamlss' package while being more modular and facilitating the creation of advanced terms and models.
Authors: Nikolaus Umlauf [aut, cre] (ORCID: <https://orcid.org/0000-0003-2160-9803>), Mikis Stasinopoulos [aut] (ORCID: <https://orcid.org/0000-0003-2407-5704>), Robert Rigby [aut] (ORCID: <https://orcid.org/0000-0003-3853-1707>), Achim Zeileis [aut] (ORCID: <https://orcid.org/0000-0003-0918-3766>), Reto Stauffer [aut] (ORCID: <https://orcid.org/0000-0002-3798-5507>), Gillian Heller [aut] (ORCID: <https://orcid.org/0000-0003-1270-1499>), Fernanda de Bastiani [aut] (ORCID: <https://orcid.org/0000-0001-8532-639X>), Andreas Mayr [aut] (ORCID: <https://orcid.org/0000-0001-7106-9732>)
Maintainer: Nikolaus Umlauf <[email protected]>
License: GPL-2 | GPL-3
Version: 0.1-0
Built: 2026-06-02 10:35:34 UTC
Source: https://github.com/gamlss-dev/gamlss2

Help Index


GAMLSS Modeling with Advanced Flexible Infrastructures

Description

Next generation infrastructure for generalized additive models for location, scale, and shape (GAMLSS) and distributional regression more generally. The package provides a fresh reimplementation of the classic 'gamlss' package while being more modular and facilitating the creation of advanced terms and models.

Details

The primary purpose of this package is to facilitate the creation of advanced infrastructures designed to enhance the Generalized Additive Models for Location Scale and Shape (GAMLSS, Rigby and Stasinopoulos 2005) modeling framework. Notably, the gamlss2 package represents a significant overhaul of its predecessor, gamlss, with a key emphasis on improving estimation speed and incorporating more adaptable infrastructures. These enhancements enable the seamless integration of various algorithms into GAMLSS, including gradient boosting, Bayesian estimation, regression trees, and forests, fostering a more versatile and powerful modeling environment.

Moreover, the package expands its compatibility by supporting all model terms from the base R mgcv package. Additionally, the gamlss2 package introduces the capability to accommodate more than four parameter families. Essentially, this means that users can now specify any type of model using these new infrastructures, making the package highly flexible and accommodating to a wide range of modeling requirements.

Index of help topics:

bamlss2                 Bayesian GAMLSS Wrapper
BS                      Bayesian Sampler for GAMLSS Models
calibration             Calibration Plots for Binary Responses
cv_gamlss2              Cross Validation for 'gamlss2' Models
discretize              Discretize Continuous Distribution Family for
                        Count Regression Models
elm                     Extreme Learning Machine Model Terms
fake_formula            Extended Processing of "Fake" Formulas
find_family             Find and Fit GAMLSS Families
gamlss2                 Generalized Additive Models for Location Scale
                        and Shape
gamlss2_control         Control Parameters
gamlss2_start           Model Initialization and Parameter Constraints
gamlss2-package         GAMLSS Modeling with Advanced Flexible
                        Infrastructures
gamlss2.family          Family Objects in 'gamlss2'
GDF                     Create a GDF Distribution
gnet                    Model Terms with 'glmnet'
HarzTraffic             Traffic Counts at Sonnenberg in the Harz Region
Kumaraswamy             Kumaraswamy Distribution
la                      Lasso Model Terms
make.link2              Create a Link for Families
mcmc                    Run MCMC Sampling for a Fitted 'gamlss2' Model
MN                      Multinomial Logit Family
new_formula             Extracting a New Formula After Selection
                        Algorithms
OL                      Ordered Logistic Family for Ordinal Regression
pb                      P-Splines for GAMLSS
plot.gamlss2            Plotting GAMLSS
predict.gamlss2         Extracting Fitted or Predicted Parameters or
                        Terms from gamlss2 Models
prodist.gamlss2         Extracting Fitted or Predicted Probability
                        Distributions from gamlss2 Models
quantile.gamlss2        Quantiles for GAMLSS
re                      Random Effects
response_name           Auxiliary Functions for Formulas and Model
                        Objects
RS                      Rigby and Stasinopoulos (RS) & Cole and Green
                        (CG) Algorithm
Rsq                     GAIC and Generalised (Pseudo) R-squared for
                        GAMLSS Models
select_gamlss2          Smooth Model Term Selection with Additional
                        Penalties
softplus                Softplus Link Object
special_terms           Special Model Terms for GAMLSS
SpirometryUS            Spirometry Measurements from NHANES 2007-2012
stepwise                Stepwise Model Term Selection Using GAIC

Author(s)

Nikolaus Umlauf, Mikis Stasinopoulos, Robert Rigby, Achim Zeileis, Reto Stauffer, Gillian Heller, Fernanda de Bastiani, Andreas Mayr.

Maintainer: Nikolaus Umlauf [email protected].

References

Rigby RA, Stasinopoulos DM (2005). “Generalized Additive Models for Location, Scale and Shape (with Discussion).” Journal of the Royal Statistical Society, Series C (Applied Statistics), 54, 507–554. doi:10.1111/j.1467-9876.2005.00510.x

Rigby RA, Stasinopoulos DM, Heller GZ, De Bastiani F (2019). Distributions for Modeling Location, Scale, and Shape: Using GAMLSS in R, Chapman and Hall/CRC. doi:10.1201/9780429298547

Stasinopoulos DM, Rigby RA (2007). “Generalized Additive Models for Location Scale and Shape (GAMLSS) in R.” Journal of Statistical Software, 23(7), 1–46. doi:10.18637/jss.v023.i07

Stasinopoulos DM, Rigby RA, Heller GZ, Voudouris V, De Bastiani F (2017). Flexible Regression and Smoothing: Using GAMLSS in R, Chapman and Hall/CRC. doi:10.1201/b21973

See Also

gamlss2, fake_formula


Bayesian GAMLSS Wrapper

Description

The function bamlss2() is a convenience wrapper around gamlss2 to fit Bayesian GAMLSS models via MCMC sampling. It sets the optimizer to an MCMC routine and allows n.iter, burnin, and thin to be specified directly.

Usage

bamlss2(formula, n.iter = 1200, burnin = 200, thin = 1, maxit = 2, ...)

Arguments

formula

A GAM-type formula or Formula. The model can include smooth terms as provided by the mgcv package.

n.iter

Integer, the total number of MCMC iterations.

burnin

Integer, the burn-in period.

thin

Integer, the thinning parameter for saved samples.

maxit

Integer, the number of backfitting iterations used to compute starting values before running MCMC. If maxit = 0, sampling starts from default or user-supplied starting values.

...

Further arguments passed to gamlss2 and/or gamlss2_control.

Details

Function bamlss2() calls gamlss2 and sets the optimizer to a Bayesian sampler. By default, a small number of backfitting iterations is performed first to obtain reasonable starting values for the MCMC sampler.

The wrapper is designed to behave in line with the gamlss()-style interfaces, i.e., it supports weights and offset arguments through gamlss2.

Value

An object of class "bamlss2" inheriting from "gamlss2" containing the fitted model and posterior samples.

References

Umlauf N, Klein N, Zeileis A (2018). BAMLSS: Bayesian Additive Models for Location, Scale and Shape (and Beyond). Journal of Computational and Graphical Statistics, 27(3), 612–627. doi:10.1080/10618600.2017.1407325

See Also

gamlss2, mcmc, BS

Examples

## Not run: ## load the abdominal circumference data
data("abdom", package = "gamlss.data")

## specify the model formula
f <- y ~ s(x) | s(x) | 1 | 1

## estimate Bayesian model
m <- bamlss2(f, data = abdom, family = BCPE)

## posterior summary
summary(m)

## plot estimated effects
plot(m)

## plot samples
plot(m, which = "samples")

## predict parameters using samples
pm <- predict(m, FUN = mean)
print(head(pm))
psd <- predict(m, FUN = sd)
print(head(psd))

## End(Not run)

Bayesian Sampler for GAMLSS Models

Description

The function BS() implements an MCMC sampler for generalized additive models for location, scale, and shape (GAMLSS) as fitted by gamlss2. The sampler uses Metropolis-Hastings updates for regression coefficients of linear and smooth terms.

Usage

BS(x, y, specials, family, offsets,
  weights, start, xterms, sterms, control)

Arguments

x

The full model matrix to be used for fitting.

y

The response vector or matrix.

specials

A named list of special model terms, for example including design and penalty matrices for fitting smooth terms using smooth.construct.

family

A family object, see gamlss2.family.

offsets

If supplied, a list or data frame of model offsets. See gamlss2 for details on the offset interface.

weights

If supplied, a numeric vector of prior weights used in the fitting process.

start

Starting values for the parameters of the response distribution. See the examples for gamlss2.

xterms

A named list specifying the linear model terms. Each named list element represents one parameter as specified in the family object.

sterms

A named list specifying the special model terms. Each named list element represents one parameter as specified in the family object.

control

Further control arguments as specified in the call to gamlss2.

Details

Function BS() is typically called via mcmc or via the wrapper function bamlss2. It implements a blocked Metropolis-Hastings sampler using working responses and working weights. Smooth terms are updated using penalized weighted least squares proposals and their smoothing variances are sampled using univariate slice sampling.

The function uses the following control arguments:

  • n.iter: integer, the total number of MCMC iterations;

  • burnin: integer, the burn-in period;

  • thinning: integer, the thinning parameter for saved samples;

  • trace: logical, should information be printed while the sampler is running;

  • flush: logical, whether to use flush.console to display current output in the console;

  • nullmodel: logical, should an intercept-only null model be evaluated to compute a null deviance and deviance reduction?;

  • binning: logical, should binning be used for smooth terms if available?

Value

A fitted model object containing posterior samples and summary information. Typically it includes:

  • samples: posterior draws for coefficients and smoothing variances;

  • coefficients: posterior means of coefficients;

  • fitted.values: posterior mean fitted values for each distributional parameter;

  • edf: estimated degrees of freedom for smooth terms;

  • logLik: log-likelihood at the posterior mean predictor.

References

Umlauf N, Klein N, Zeileis A (2018). BAMLSS: Bayesian Additive Models for Location, Scale and Shape (and Beyond). Journal of Computational and Graphical Statistics, 27(3), 612–627. doi:10.1080/10618600.2017.1407325

See Also

gamlss2, RS, CG, mcmc, bamlss2

Examples

## see ?mcmc

Calibration Plots for Binary Responses

Description

Compute and plot calibration curves for models with 0/1 responses. The function can handle one or several fitted gamlss2 models and compares their calibration in a single plot.

Usage

calibration(..., newdata = NULL,
  y = NULL, parameter = NULL, breaks = seq(0, 1, by = 0.1),
  minn = 20, main = "Calibration plot",
  xlab = "Predicted probability",
  ylab = "Observed proportion", plot = TRUE,
  add_loess = TRUE, smooth_n = 200,
  col = NULL, lty = NULL, legend = TRUE, pos = "topleft",
  xlim = NULL, ylim = NULL)

Arguments

...

One or several fitted gamlss2 model objects to be assessed.

newdata

Optional data frame for out-of-sample calibration.

y

Optional numeric or factor vector with the binary response.

parameter

Character, which parameter should be used for prediction.

breaks

Numeric vector of break points used to construct bins for the predicted probabilities.

minn

Integer, the minimum number of observations required in each bin.

main

Character, main title for the plot.

xlab

Character, label for the x-axis.

ylab

Character, label for the y-axis.

plot

Logical, should a calibration plot be produced?

add_loess

Logical, should a loess smooth be added for each model?

smooth_n

Integer, number of evaluation points for the loess curve.

col

Colors used for the models.

lty

Line types used for the loess curves.

legend

Logical, should a legend be added when more than one model is supplied?

pos

Character, legend position passed to legend.

xlim

The x limits of the plot.

ylim

The y limits of the plot.

Details

For each fitted model, predicted probabilities are obtained via predict using type = "parameter" and the selected parameter. If the corresponding gamlss2.family object provides a probabilities() method, this is used to transform the parameter vector into probabilities. For multi-column outputs, the first column is used.

Predicted probabilities are grouped into bins defined by breaks. Within each bin, the mean predicted probability, the observed proportion of 1s, and the number of observations are computed. Bins with fewer than minn observations are dropped.

Value

Invisibly returns a data frame with the columns interval, probs, y, n, and model. For a single model, the model column is dropped.

References

Van Calster B, McLernon DJ, van Smeden M, Wynants L, Steyerberg EW (2019). Calibration: the Achilles heel of predictive analytics. BMC Medicine, 17, 230. doi:10.1186/s12916-019-1466-7

See Also

gamlss2, predict

Examples

## Not run: set.seed(123)
n <- 400
x1 <- runif(n, -3, 3)
x2 <- runif(n, -3, 3)
eta <- sin(x1) + 0.5 * x2
p <- plogis(eta)
y <- rbinom(n, size = 1, prob = p)
d <- data.frame(y = y, x1 = x1, x2 = x2)

m1 <- gamlss2(y ~ x1, family = BI, data = d)
m2 <- gamlss2(y ~ s(x1), family = BI, data = d)
m3 <- gamlss2(y ~ s(x1) + x2, family = BI, data = d)

calibration(m1, m2, m3)
head(calibration(m1, m2, m3, plot = FALSE))

## End(Not run)

Cross Validation for gamlss2 Models

Description

cv_gamlss2() implements K-fold cross validation for models fitted with gamlss2. Different scoring rules can be supplied via the metric argument. Convenience metric functions (log_pdf_metric(), rqres_metric(), mse_metric()) are provided.

Usage

## k-fold cross-validation
cv_gamlss2(..., data, folds = 5,
  metric = log_pdf_metric, parallel = FALSE, simplify = TRUE)

## log-pdf for each observation
log_pdf_metric(model, data)

## randomized quantile residuals
rqres_metric(model, data)

## mean squared error
mse_metric(model, data)

Arguments

...

model specification passed to gamlss2 such as formula, family, etc.

data

a data.frame containing the variables in the model. For functions supplied to argument metric, a data.frame for evaluating predictions or residuals.

folds

either an integer specifying the number of folds, or a list, matrix, or data frame of index sets for test folds. Defaults to 5.

metric

a function of the form metric(model, data) returning a score for the given fitted model and test data. Defaults to log_pdf_metric.

parallel

logical. If TRUE, computation is carried out in parallel using future.apply.

simplify

logical. If TRUE, results are returned in a simplified vector or data frame depending on the metric output.

model

a fitted gamlss2 model.

Details

cv_gamlss2() splits the data into training and test folds. For each fold the model is fitted on the training data, and the chosen metric is evaluated on the held-out test data. By default, the scoring rule is the log predictive density (log_pdf_metric), but other metrics can be used, such as randomized quantile residuals (rqres_metric) or mean squared error of the conditional mean (mse_metric).

The function returns either a list of fold-wise results or, if simplify = TRUE, a named vector or data frame aligned with the original observations.

Value

If simplify = TRUE and the metric returns scalars, a named numeric vector of fold scores is returned. Otherwise a data frame with fold membership and scores per observation is returned. The convenience metrics return a numeric vector of scores or residuals.

See Also

gamlss2, log_pdf

Examples

## Not run: ## load the abdominal circumference data
data("abdom", package = "gamlss.data")

## cross-validation using the NO distribution
## only model the mean with s(x)
cv1 <- cv_gamlss2(y ~ s(x), data = abdom, family = NO)

## now, also model the standard deviation with s(x)
cv2 <- cv_gamlss2(y ~ s(x) | s(x), data = abdom, family = BCT)

## evaluate log-likelihood
sum(cv1$score)
sum(cv2$score)

## End(Not run)

Discretize Continuous Distribution Family for Count Regression Models

Description

This function takes any continuous distribution family object and discretizes it, enabling it to be used for the estimation of count regression models. The discretized family can then be used in gamlss2 models that deal with count data.

Usage

discretize(family = NO)

Arguments

family

A continuous distribution family object. The family will be discretized for modeling count data, where the distribution is adapted for count outcomes.

Details

The function discretizes a continuous distribution family by converting its cumulative distribution function (CDF) into a probability mass function (PMF). This is done by computing the difference between the CDF evaluated at adjacent points. The resulting discretized distribution can be used in count regression models to estimate the relationship between count data and explanatory variables.

Value

Returns an object of class "gamlss2.family", which is a discretized version of the input continuous family object, suitable for use in gamlss2 models for count data.

See Also

gamlss2, gamlss2.family

Examples

## simulate count data using the Poisson distribution
set.seed(111)
y <- rpois(1000, lambda = 10)

## create a discretized family using the BCT distribution (with log link for mu)
fam <- discretize(family = BCT(mu.link = "log"))

## fit a count regression model using the discretized family
fit_family(y, family = fam)

## including covariate
## Not run: set.seed(111)
x <- sort(runif(1000, -3, 3))
y <- rpois(1000, lambda = exp(sin(x)))

m <- gamlss2(y ~ s(x) | s(x) | s(x) | s(x), family = fam)

## plot effects
plot(m)

## predict 95
p <- quantile(m, probs = 0.95)
plot(x, y, pch = 19, col = adjustcolor(1, 0.3))
lines(p ~ x, lty = 1, col = 4, lwd = 3)

## End(Not run)

Extreme Learning Machine Model Terms

Description

Constructor function for Extreme Learning Machine (ELM) model terms for GAMLSS.

Usage

## Model term constructor function.
elm(x, k = 50, a = "tanh", ...)

Arguments

x

A numeric vector or matrix, a factor, or a formula. If x is a formula, a design matrix is created using model.matrix. See the examples.

k

Integer, number of hidden units (random features) used to build the ELM design matrix.

a

Character, activation function for the hidden units. Supported options are "logistic", "tanh" (default), "relu", "leaky_relu", "elu", "softplus", "atan", "softsign", "gaussian", "laplace", "sine", and "identity".

...

Further control arguments can be passed: criterion = "bic" (default) for shrinkage parameter selection and scale = TRUE (default) for internal scaling of the design matrix. The hidden-layer initialization can be controlled via sampler = "halton" (default), "pca", or "random", and via further tuning arguments such as whiten, target_sd, slope_sd_1d, q_range, and orthogonalize. Further arguments are passed to model.matrix if x is specified using a formula.

Details

The ELM term constructs a randomized single-hidden-layer representation of the covariate(s) in x. Internally, a design matrix Z is built (including an intercept column), random weights are sampled, and the linear predictors Z %*% W are transformed through the activation function a to obtain the hidden-layer design matrix X. Columns of X are centered before estimation.

Internal scaling. For numerical stability and comparability across terms, Z can be scaled internally (scale = TRUE). Intercept columns are left unchanged. If x is a factor, a QR-based group scaling is applied; otherwise, a column-wise center-and-scale normalization is used. The scaling transformation is stored and is reused automatically during prediction.

Activation functions. The activation is applied element-wise to Z %*% W. For numerical stability, bounded activations are evaluated on clipped inputs (currently [35,35][-35, 35]).

Weight sampling. By default, hidden-unit weights are generated using a low-discrepancy Halton design in a whitened input space. This tends to produce more stable and less redundant features than purely random draws in many practical applications. With sampler = "pca", directions are additionally aligned with principal component directions of the scaled design. The historical stochastic behavior can be recovered with sampler = "random".

Reproducibility. The default sampler = "halton" is deterministic given the input data. Use set.seed only if you explicitly request sampler = "random".

Value

An object representing a specials model term, used internally by gamlss2 during model fitting and prediction.

References

Huang GB, Zhu QY, Siew CK (2006). Extreme Learning Machine: Theory and Applications. Neurocomputing, 70(1–3), 489–501. doi:10.1016/j.neucom.2005.12.126

Huang GB, Zhu QY, Siew CK (2004). Extreme Learning Machine: A New Learning Scheme of Feedforward Neural Networks. In: Proceedings of IJCNN 2004, 2, 985–990. doi:10.1109/IJCNN.2004.1380068

See Also

gamlss2, specials.

Examples

## Not run: ## load data
data("SpirometryUS")

## subset for female
d <- subset(SpirometryUS, gender == "female")

## the default sampler is deterministic; set
## a seed only for sampler = "random"
set.seed(1328)

## formula for all 4 parameters
f <- fev1 ~ elm(age,k=100,a="tanh") | . | . | .

## estimate model
m <- gamlss2(f, data = d, family = BCT)

## estimated effects
plot(m)

## predict quantiles
qu <- c(0.025, seq(0.1, 0.9, by = 0.1), 0.975)
fit <- quantile(m, probs = qu)

## plot
plot(fev1 ~ age, data = d)
i <- order(d$age)
matplot(d$age[i], fit[i, ],
  type = "l", lty = 1, lwd = 2,
  col = c(2, rep(4, ncol(fit) - 1)),
  add = TRUE)

## main effects and interactions
f <- fev1 ~ s(age) + s(height) + s(weight) +
  elm(~age+height+weight,k=200) | . | . | .

m <- gamlss2(f, data = d, family = BCT)

## summary to inspect effect of interactions
## of the elm()s
summary(m)

## plot main effects
plot(m)

## quantile residuals
plot(m, which = "resid")

## prediction is handled automatically via
## the special term interface
n <- 50
nd <- with(d, expand.grid(
  "age" = seq(min(age), max(age), length = n),
  "weight" = seq(min(weight), max(weight), length = n)
))
nd$height <- mean(d$height)

## compute lower 2.5
nd$fit <- quantile(m, newdata = nd, probs = 0.025)

## visualize
n <- length(unique(nd$age))
age <- sort(unique(nd$age))
weight <- sort(unique(nd$weight))

z <- matrix(nd$fit, n, n)

image(age, weight, z,
      col = hcl.colors(100, "YlOrRd"),
      xlab = "age", ylab = "weight")
contour(age, weight, z, add = TRUE)

## End(Not run)

Extended Processing of "Fake" Formulas

Description

Create a simplified formula from a formula, a Formula, or a list of formulas. The result retains the variables and transformations needed to build a model.frame while separating out special model terms. This is mainly used internally when gamlss2 prepares formulas for fitting.

Usage

fake_formula(formula, specials = NULL,
  nospecials = FALSE, onlyspecials = FALSE)

Arguments

formula

A formula, Formula, or a list of formulas.

specials

Character, vector of names of special functions in the formula, see terms.formula.

nospecials

Logical, should variables of special model terms be part of the "fake formula"?

onlyspecials

Logical, should only the special model terms be returned?

Details

The function is primarily an internal helper for formula preprocessing in gamlss2, but it can also be useful for inspecting how complex formulas are decomposed before fitting.

Special model terms such as s(), te(), ti(), re(), or package-specific specials are identified and handled separately from ordinary variables. Depending on the control arguments:

  • if nospecials = FALSE, variables used inside registered special terms are reintroduced into the returned formula so they remain available in the model frame;

  • if nospecials = TRUE, the returned formula excludes the special terms and their variables;

  • if onlyspecials = TRUE, the function returns the names of the detected special terms instead of a formula object.

For multi-part or multi-parameter formulas, the output preserves the overall formula structure.

Value

Depending on the input, the function returns a formula or Formula. If onlyspecials = TRUE, a character vector or list of special model term names is returned.

See Also

gamlss2

Examples

## basic formula
f <- y ~ x1 + x2 + log(x3)
ff <- fake_formula(f)
print(ff)

## including special model terms
f <- y ~ x1 + s(x2) + x3 + te(log(x3), x4)
ff <- fake_formula(f)
print(ff)

## multiple parts on the right-hand side
f <- y ~ x1 + s(x2) + x3 + te(log(x3), x4) | x2 + sqrt(x5)
ff <- fake_formula(f)
print(ff)

## collapse all formula parts
print(formula(ff, collapse = TRUE))
print(formula(ff, collapse = TRUE, update = TRUE))

## list of formulas
f <- list(
  y ~ x1 + s(x2) + x3 + te(log(x3), x4),
    ~ x2 + sqrt(x5),
    ~ z2 + x1 + exp(x3)
)
ff <- fake_formula(f)
print(ff)

## extract separate parts on the right-hand side
formula(ff, rhs = 1)
formula(ff, rhs = 2)
formula(ff, rhs = 3)

## formula with multiple responses and multiple parts
f <- y1 | y2 | y3 ~ x1 + s(x2) + x3 + te(log(x3), x4) | x2 + ti(x5)
ff <- fake_formula(f)
print(ff)

## list of formulas with multiple responses
f <- list(
  y1 ~ x1 + s(x2) + x3 + te(log(x3), x4),
  y2 ~ x2 + sqrt(x5),
  y3 ~ z2 + x1 + exp(x3) + s(x10)
)
ff <- fake_formula(f)

## extract only without special terms
ff <- fake_formula(f, nospecials = TRUE)
print(ff)

## extract only special terms
ff <- fake_formula(f, onlyspecials = TRUE)
print(ff)

Find and Fit GAMLSS Families

Description

These functions provide useful infrastructures for finding suitable GAMLSS families for a response variable.

Usage

## list of available families from gamlss.dist package
available_families(type = c("continuous", "discrete"), families = NULL, ...)

## find suitable response distribution
find_family(y, families = NULL, k = 2, verbose = TRUE, ...)

## fit distribution parameters
fit_family(y, family = NO, plot = TRUE, ...)

## find gamlss2 model
find_gamlss2(formula, families = NULL, k = 2,
  select = FALSE, verbose = TRUE, ...)

Arguments

type

Character, is the response continuous or discrete?

families

Character, the names of the family objects of the gamlss.dist package that should be returned.

y

The response vector or matrix.

k

Numeric, the penalty factor that should be used for the GAIC.

verbose

Logical, should runtime information be printed?

family

A family object that should be used for estimation, see also gamlss2.family.

plot

Logical, should a plot of the fitted density be provided?

formula

A model formula, see gamlss2.

select

Logical, if set to select = TRUE, model term selection is enforced using model fitting function select_gamlss2.

...

For function available_families(), arguments passed to the family objects, e.g., for setting link functions mu.link = "log". Further arguments to be passed to gamlss2 when using find_family(), or arguments legend = TRUE/FALSE, pos = "topright" (see also function legend), main, xlab and ylab when argument plot = TRUE using function fit_family(). For function find_gamlss2(), arguments are passed to available_families() and gamlss2.

Details

The function find_family() employs gamlss2 to estimate intercept-only models for each specified family object in the families argument. Note that model estimation occurs within a try block with warnings suppressed. Additionally, the function calculates the GAIC for each family whenever feasible and returns the values sorted in ascending order of the selected fit criterion, so smaller values indicate a better fit.

Function fit_family() fits a single intercept-only model using the specified family and creates a plot of the fitted density.

Value

Function find_family() returns a vector of GAIC values for the different fitted families. Function fit_family() returns the fitted intercept-only model.

See Also

gamlss2.

Examples

## Not run: ## load data
data("rent", package = "gamlss.data")

## find a suitable response to the response
ic <- find_family(rent$R)
print(ic)

## fit parameters using the BCCG family
fit_family(rent$R, family = BCCG)

## count data
data("polio", package = "gamlss.data")

## search best count model
ic <- find_family(polio, k = 0,
  families = available_families(type = "discrete"))
print(ic)

## fit parameters using the ZAPIG family
fit_family(polio, family = ZAPIG)

## search complete model
## m <- find_gamlss2(R ~ s(Fl) + s(A) | . | . | ., data = rent,
##   select = TRUE, mu.link = "log")

## End(Not run)

Generalized Additive Models for Location Scale and Shape

Description

The function gamlss2() fits generalized additive models for location, scale, and shape (GAMLSS). It estimates one additive predictor for each parameter of the chosen response distribution, such as mu, sigma, nu, or tau. The number and meaning of these parameters are determined by the selected family object, see gamlss2.family.

In addition to ordinary linear terms, gamlss2() supports smooth terms from mgcv as well special user-defined model-terms, see specials.

Usage

gamlss2(formula, ...)

## S3 method for class 'formula'
gamlss2(formula, data, family = NO,
  subset, na.action, weights, offset, start = NULL,
  knots = NULL, control = gamlss2_control(...), ...)

## S3 method for class 'list'
gamlss2(formula, ...)

Arguments

formula

A GAM-type formula or Formula describing the additive predictors for the distribution parameters. All smooth terms from mgcv are supported, see also formula.gam. For gamlss.list(), formula is a list of formulas.

data

A data frame, list, or environment containing the variables in the model. If not found in data, the variables are taken from environment(formula), typically the environment from which gamlss2() is called.

family

A gamlss.family or gamlss2.family object defining the response distribution and the link functions of its parameters.

subset

An optional vector specifying a subset of observations to be used in the fitting process.

na.action

NA processing for setting up the model.frame.

weights

An optional vector of prior weights to be used in the fitting process. Should be NULL or a numeric vector.

offset

Optional offset terms to be included in the linear predictors during fitting. If a single numeric vector is supplied, it is assigned to the first parameter of the distribution. If offsets are needed for several parameters, provide a data frame or list in the same order as the parameter names in the family object.

start

Optional starting values for the estimation algorithm, see gamlss2_start.

knots

An optional list of user-specified knots, see smoothCon.

control

A list of control arguments, see gamlss2_control.

...

Arguments passed to gamlss2_control.

Details

The function gamlss2() is the main entry point for fitting distributional regression models in gamlss2.

  • The response distribution is specified through a family object, either from gamlss.dist or created as a gamlss2.family object.

  • By default, estimation is carried out with the RS algorithm. The optimizer can be changed through gamlss2_control. A family object may also provide its own optimizer.

  • The returned object depends on the chosen optimizer. In the standard case, this is an object of class "gamlss2", for which summary, plotting, prediction, and extractor methods are available.

Value

The object returned by the selected optimizer. By default, this is an object of class "gamlss2" produced by RS. Standard methods and extractor functions are available for this class.

References

Rigby RA, Stasinopoulos DM (2005). “Generalized Additive Models for Location, Scale and Shape (with Discussion).” Journal of the Royal Statistical Society, Series C (Applied Statistics), 54, 507–554. doi:10.1111/j.1467-9876.2005.00510.x

Rigby RA, Stasinopoulos DM, Heller GZ, De Bastiani F (2019). Distributions for Modeling Location, Scale, and Shape: Using GAMLSS in R, Chapman and Hall/CRC. doi:10.1201/9780429298547

Stasinopoulos DM, Rigby RA (2007). “Generalized Additive Models for Location Scale and Shape (GAMLSS) in R.” Journal of Statistical Software, 23(7), 1–46. doi:10.18637/jss.v023.i07

Stasinopoulos DM, Rigby RA, Heller GZ, Voudouris V, De Bastiani F (2017). Flexible Regression and Smoothing: Using GAMLSS in R, Chapman and Hall/CRC. doi:10.1201/b21973

See Also

RS, gamlss2_control, gamlss2.family, gamlss2_start

Examples

## load the abdominal circumference data
data("abdom", package = "gamlss.data")

## specify a distributional model:
## location and scale vary with x,
## shape parameters remain constant
f <- y ~ s(x) | s(x) | 1 | 1

## estimate model
m1 <- gamlss2(f, data = abdom, family = BCT)

## model summary
summary(m1)

## plot parameter-specific effects
plot(m1, which = "effects")

## plot diagnostics
plot(m1, which = "resid")

## predict selected distribution parameters
par <- predict(m1, model = c("mu", "sigma"))

## fitted centile curves
pq <- quantile(m1, probs = c(0.05, 0.5, 0.95))

## visualize
plot(y ~ x, data = abdom, pch = 19,
  col = rgb(0.1, 0.1, 0.1, alpha = 0.3))
matlines(abdom$x, pq, lwd = 2,
  lty = 1, col = 4)

## count response example using HarzTraffic data
data("HarzTraffic", package = "gamlss2")

## seasonal count model for motorcycles
m2 <- gamlss2(
  bikes ~ s(yday, bs = "cc") | s(yday, bs = "cc"),
  data = HarzTraffic,
  family = NBI
)

## summary
summary(m2)

## effect plots
plot(m2)

## residual diagnostics
plot(m2, which = "resid")

## quantiles over the day of the year
nd <- data.frame(yday = 1:365)
pq <- quantile(m2, newdata = nd, probs = c(0.05, 0.5, 0.95))

## visualize fitted quantiles
plot(bikes ~ yday, data = HarzTraffic, pch = 19,
  col = gray(0.1, alpha = 0.3))
matlines(nd$yday, pq, lty = c(2, 1, 2),
  lwd = 2, col = 4)

Control Parameters

Description

Control parameters for fitting GAMLSS models with gamlss2.

Usage

gamlss2_control(optimizer = RS, trace = TRUE,
  flush = TRUE, light = FALSE, expand = TRUE,
  model = TRUE, x = TRUE, y = TRUE,
  fixed = FALSE, ...)

Arguments

optimizer

Function, the optimizer to be used for fitting.

trace

Logical, should information be printed while the algorithm is running?

flush

Logical, whether to use flush.console to display current output in the console.

light

Logical, if set to light = TRUE, no model frame, response, model matrix and other design matrices will be part of the return value.

expand

Logical, if fewer formulas are supplied than there are parameters of the distribution, should intercept-only formulas be added automatically?

model

Logical, should the model frame be included as a component of the returned object.

x

Logical, indicating whether the model matrix should be included as a component of the returned object.

y

Logical, should the response be included as a component of the returned object.

fixed

Named logical vector indicating which parameters should be fixed during estimation. See gamlss2_start for examples.

...

Further control parameters to be included in the return value, for example parameters used by the optimizer function RS.

Details

The set of control parameters can be extended. For example, if a different optimizer is used, newly specified control parameters are passed on to that optimizer automatically.

Value

A list with the arguments specified.

See Also

RS, gamlss2, gamlss2_start

Examples

## Not run: ## load the abdominal circumference data
data("abdom", package = "gamlss.data")

## specify the model formula
f <- y ~ s(x) | s(x)

## estimate model with different step length
## control in the RS algorithm
m1 <- gamlss2(f, data = abdom, family = BCT, step = 1)
m2 <- gamlss2(f, data = abdom, family = BCT, step = 0.1)

## End(Not run)

Model Initialization and Parameter Constraints

Description

Guidance on supplying starting values through argument start in gamlss2 and on fixing distribution parameters via fixed in gamlss2_control.

Details

The start argument can be used to initialize estimation in several ways. Supported inputs include:

  • a named numeric vector or list giving one starting value per distribution parameter, such as mu, sigma, nu, or tau;

  • a data frame or matrix with one column per distribution parameter and one row per observation, providing starting values for the full additive predictors;

  • a fitted "gamlss2" object, in which case fitted() values are used as starting values for all distribution parameters;

  • a coef.gamlss2 object, allowing coefficient-wise initialization of linear and smooth terms.

Parameters can be fixed during estimation using the fixed control argument. This is useful, for example, when shape parameters should remain constant while the remaining parameters are estimated.

See Also

gamlss2, gamlss2_control, RS

Examples

## Not run: ## load the abdominal circumference data
data("abdom", package = "gamlss.data")

## distributional model
f <- y ~ s(x) | s(x) | 1 | 1

## scalar starting values for the distribution parameters
m1 <- gamlss2(f, data = abdom, family = BCT,
  start = c(mu = 200, sigma = 0.1, nu = 0, tau = 10))

## fix shape parameters during estimation
m2 <- gamlss2(f, data = abdom, family = BCT,
  start = c(mu = 200, sigma = 0.1, nu = 0, tau = 10),
  fixed = c(nu = TRUE, tau = TRUE))

## coefficient estimates of the fitted model
coef(m2)

## use fitted additive predictors as starting values
m3 <- gamlss2(f, data = abdom, family = BCT,
  start = fitted(m2))

## same using the fitted model object directly
m4 <- gamlss2(f, data = abdom, family = BCT,
  start = m2)

## coefficient-wise initialization
m5 <- gamlss2(f, data = abdom, family = BCT,
  start = coef(m2))

## End(Not run)

Family Objects in gamlss2

Description

A "gamlss2.family" object describes a response distribution for use with gamlss2. It tells the fitting algorithm which distributional parameters exist, which link functions they use, and which distributional functions are available, such as densities, quantiles, random-number generators, or score functions.

Usage

## S3 method for class 'gamlss2'
family(object, ...)

Arguments

object

A fitted "gamlss2" model object.

...

Not used currently.

Details

A gamlss2 family object is a list of class "gamlss2.family". It defines the response distribution and the functions needed to fit and use that distribution in a model.

The minimum required elements are:

  • family: the name of the distribution (character string).

  • names: a character vector of parameter names (e.g., c("mu", "sigma")).

  • links: a named character vector specifying the link function for each parameter (e.g., c(mu = "identity", sigma = "log")), or a list of link functions, see softplus.

  • pdf(y, par, log = FALSE, ...): a function to evaluate the (log-)density.

The pdf() function must accept the response y, a named list par of evaluated parameter values (e.g., par$mu, par$sigma), a logical log, and optional additional arguments.

Further elements can be added depending on how fully the family should support fitting, prediction, diagnostics, and simulation. Common optional elements are:

  • score: a named list of functions (one per parameter), each computing the first derivative of the log-likelihood with respect to the linear predictor: score[[param]](y, par, ...).

  • hess: a named list of functions computing second derivatives (the negative Hessian). For parameters mu and sigma, this includes: hess[["mu"]](), hess[["sigma"]](), and optionally cross derivatives like hess[["mu.sigma"]]().

  • loglik(y, par, ...): a function computing the total log-likelihood.

  • cdf(y, par, ...): cumulative distribution function.

  • quantile(p, par, ...): quantile function.

  • random(n, par, ...): random number generator.

  • mean(par, ...): mean function.

  • variance(par, ...): variance function.

  • skewness(par, ...), kurtosis(par, ...): optional higher-order moment functions.

  • initialize: a named list of initialization functions, one for each parameter (e.g., initialize$mu(y, ...)), used to generate starting values.

  • valid.response(x): a function that checks whether the response is valid (e.g., numeric, non-factor).

  • optimizer(): an optional function defining a custom optimization method for use with gamlss2, see also RS.

If analytical score or Hessian functions are not provided, they are approximated numerically. If quantile residuals are required, a cdf() function must be available.

Value

For family.gamlss2(), the family object stored in a fitted "gamlss2" model is returned. A "gamlss2.family" object itself is a list describing the response distribution, its parameters, their link functions, and the distribution-specific methods used during fitting, prediction, diagnostics, and simulation.

See Also

gamlss2

Examples

## Not run: ## new family object for the normal distribution
Normal <- function(...) {
  fam <- list(
    "family" = "Normal",
    "names" = c("mu", "sigma"),
    "links" = c("mu" = "identity", "sigma" = "log"),
    "score" = list(
      "mu" = function(y, par, ...) {
        (y - par$mu) / (par$sigma^2)
      },
      "sigma" = function(y, par, ...) {
        -1 + (y - par$mu)^2 / (par$sigma^2)
      }
    ),
    "hess" = list(
      "mu" = function(y, par, ...) {
        1 / (par$sigma^2)
      },
      "sigma" = function(y, par, ...) {
        rep(2, length(y))
      },
      "mu.sigma" = function(y, par, ...) {
        rep(0, length(y))
      }
    ),
    "loglik" = function(y, par, ...) {
      sum(dnorm(y, par$mu, par$sigma, log = TRUE))
    },
    "mu" = function(par, ...) {
      par$mu
    },
    "pdf" = function(y, par, log = FALSE) {
      dnorm(y, mean = par$mu, sd = par$sigma, log = log)
    },
    "cdf" = function(y, par, ...) {
      pnorm(y, mean = par$mu, sd = par$sigma, ...)
    },
    "random" = function(n, par) {
      rnorm(n, mean = par$mu, sd = par$sigma)
    },
    "quantile" = function(p, par) {
      qnorm(p, mean = par$mu, sd = par$sigma)
    },
    "initialize" = list(
      "mu"    = function(y, ...) { (y + mean(y)) / 2 },
      "sigma" = function(y, ...) { rep(sd(y), length(y)) }
    ),
    "mean"      = function(par) par$mu,
    "variance"  = function(par) par$sigma^2,
    "valid.response" = function(x) {
      if(is.factor(x) | is.character(x))
        stop("the response should be numeric!")
      return(TRUE)
    }
  )

  class(fam) <- "gamlss2.family"

  return(fam)
}

## load the abdominal circumference data
data("abdom", package = "gamlss.data")

## specify the model formula
f <- y ~ s(x) | s(x)

## estimate model
m1 <- gamlss2(f, data = abdom, family = Normal)

## plot estimated effects
plot(m1, which = "effects")

## plot diagnostics
plot(m1, which = "resid")

## predict parameters
par <- predict(m1)

## predict quantiles
pq <- quantile(m1, probs = c(0.05, 0.5, 0.95))

## visualize
plot(y ~ x, data = abdom, pch = 19,
  col = rgb(0.1, 0.1, 0.1, alpha = 0.3))
matlines(abdom$x, pq, lwd = 2,
  lty = 1, col = 4)

## another example that defines only the density
## function; in this case all derivatives are
## approximated numerically. For residual diagnostics,
## the $cdf() and $quantile() functions are needed as well
Gamma <- function(...) {
  fam <- list(
    "names" = c("mu", "sigma"),
    "links" = c("mu" = "log", "sigma" = "log"),
    "pdf" = function(y, par, log = FALSE, ...) {
      shape <- par$sigma
      scale <- par$mu/par$sigma
      dgamma(y, shape = shape, scale = scale, log = log)
    },
    "cdf" = function(y, par, lower.tail = TRUE, log.p = FALSE) {
      shape <- par$sigma
      scale <- par$mu/par$sigma
      pgamma(y, shape = shape, scale = scale,
        lower.tail = lower.tail, log.p = log.p)
    },
    "quantile" = function(p, par, lower.tail = TRUE, log.p = FALSE) {
      shape <- par$sigma
      scale <- par$mu/par$sigma
       qgamma(p, shape = shape, scale = scale,
         lower.tail = lower.tail, log.p = log.p)
    }
  )

  class(fam) <- "gamlss2.family"

  return(fam)
}

## example using the Munich rent data
data("rent", package = "gamlss.data")

## model formula
f <- R ~ ti(Fl) + ti(A) + ti(Fl, A, bs = "ps") |
  ti(Fl) + ti(A) + ti(Fl, A, bs = "ps")

## estimate model
m2 <- gamlss2(f, data = rent, family = Gamma)

## visualize estimated effects
plot(m2, which = "effects")

## diagnostics, needs the $cdf() and $quantile() function!
plot(m2, which = "resid")

## End(Not run)

Create a GDF Distribution

Description

The GDF (gamlss2 Distribution Family) is a unified class with corresponding methods that represent all distributional families supported by the gamlss2 package. It enables seamless integration with the distributions3 workflow and provides a consistent interface for model fitting and distributional computations.

Usage

GDF(family, parameters)

Arguments

family

character. Name of a gamlss2.family or a family provided by the gamlss.dist package, e.g, NO or BI for the normal or binomial distribution, respectively.

parameters

numeric, matrix, list or data frame, see the examples.

Details

The S3 class GDF is a slightly more general implementation of the S3 class GAMLSS tailored for gamlss2. For details please see the documentation of GAMLSS

Value

A GDF object, inheriting from distribution.

References

Zeileis A, Lang MN, Hayes A (2022). “distributions3: From Basic Probability to Probabilistic Regression.” Presented at useR! 2022 - The R User Conference. Slides, video, vignette, code at https://www.zeileis.org/news/user2022/.

See Also

gamlss2.family

Examples

## package and random seed
library("distributions3")
set.seed(6020)

## one normal distribution
X <- GDF("NO", c(mu = 1, sigma = 2))
X

## two normal distributions
X <- GDF("NO", cbind(c(1, 1.5), c(0.6, 1.2)))
X

## three Weibull distributions
X <- GDF("WEI", list(mu = c(1, 1, 2), sigma = c(1, 2, 2)))
X

## see ?gamlss.dist::GAMLSS for the remainder of this example
## example using gamlss2
data("abdom", package = "gamlss.data")

## estimate model
m <- gamlss2(y ~ s(x) | . | . | ., data = abdom, family = GA)

## extract, also works with newdata
d <- data.frame(
  "mean" = mean(m),
  "median" = median(m),
  "q95" = quantile(m, probs = 0.95),
  "variance" = variance(m),
  "pdf" = pdf(m),
  "cdf" = cdf(m)
)
print(head(d))

Model Terms with glmnet

Description

Constructor for estimating penalized model terms using the glmnet package.

Usage

gnet(formula, ...)

Arguments

formula

A formula specifying the covariates that should be estimated using the glmnet implementation.

...

Control arguments passed to glmnet and the information-criterion-based shrinkage selection.

Details

gnet() constructs a model.matrix representation of the supplied formula and delegates estimation to glmnet inside the backfitting algorithm. The optimal penalty parameter is selected using an information criterion, with criterion = "bic" used by default.

Value

A special model-term object used internally by gamlss2.

See Also

la, gamlss2, specials.

Examples

## Not run: data("rent", package = "gamlss.data")

## transform to categorical
rent$Flc <- cut(rent$Fl, breaks = seq(20, 160, by = 10),
  include.lowest = TRUE)
rent$Ac <- cut(rent$A, breaks = seq(1890, 1990, by = 10),
  include.lowest = TRUE)

## model formula, for each parameter we use
## the same gnet() model term
f <- R ~ gnet(~Flc + Ac + loc) | . | . | .

## estimate model
m <- gamlss2(f, data = rent, family = BCT)

## summary with edf
summary(m)

## plot coefficient paths
sp <- specials(m, model = "mu", term = "gnet(", element = "model")
print(class(sp))
plot(sp)

## End(Not run)

Traffic Counts at Sonnenberg in the Harz Region

Description

This dataset contains daily traffic counts near Sonnenberg in the Harz region of Germany. It covers nearly three years, from 2021-01-01 to 2023-11-30, and combines traffic counts with weather covariates. The data are useful for illustrating seasonal count regression models, for example with cyclic smoothers over the day of the year.

Usage

data("HarzTraffic", package = "gamlss2")

Format

A data frame containing 1057 observations on 16 variables.

date

Date, the date of the record.

yday

Integer, the day of the year extracted from date. This is convenient for modeling within-year seasonality.

bikes

Integer, the number of motorcycles on that day.

cars

Integer, the number of cars on that day.

trucks

Integer, the number of trucks on that day.

others

Integer, the number of other vehicles on that day.

tempmin

Numeric, minimum temperature in C^{\circ}C.

tempmax

Numeric, maximum temperature in C^{\circ}C.

temp

Numeric, mean temperature in C^{\circ}C.

humidity

Numeric, mean relative humidity in percent.

tempdew

Numeric, average dewpoint temperature in C^{\circ}C.

cloudiness

Numeric, average cloud cover in percent.

rain

Numeric, amount of precipitation in mm (snow and rain).

sunshine

Numeric, sunshine duration in minutes.

wind

Numeric, mean wind speed in m/s.

windmax

Numeric, maximum wind speed in m/s.

Details

The traffic variables record daily counts by vehicle type. The weather variables summarize daily conditions measured at a nearby station. In package examples, bikes is often modeled as a seasonal response because motorcycle traffic shows a pronounced annual cycle and substantial variation across weather conditions.

Source

Weather Data:

Data Source:

Deutscher Wetterdienst (DWD), Climate Data Center (CDC).

Licence:

CC BY 4.0

URL:

https://opendata.dwd.de/climate_environment/CDC/

Station:

Wernigerode (5490; Sachsen-Anhalt)

Position:

10.7686/51.8454/233 (lon, lat, alt, EPSG 4326)

Traffic Data:

Data Source:

Bundesanstalt für Strassenwesen (BASt)

Licence:

CC BY 4.0

URL:

https://www.bast.de

Examples

## seasonal variation of motorcycle counts at Sonnenberg/Harz
data("HarzTraffic", package = "gamlss2")
plot(bikes ~ yday, data = HarzTraffic)

## count distribution
barplot(table(HarzTraffic$bikes))

## negative binomial seasonal model using cyclic splines
m <- gamlss2(bikes ~ s(yday, bs = "cc") | s(yday, bs = "cc"),
  data = HarzTraffic, family = NBI)

## visualize effects
plot(m)

## residual diagnostics
plot(m, which = "resid")

## fitted parameters for each day of the year
nd <- data.frame(yday = 1:365)

## corresponding quantiles
p <- quantile(m, newdata = nd, probs = c(0.05, 0.5, 0.95))

## visualization
plot(bikes ~ yday, data = HarzTraffic, pch = 19, col = gray(0.1, alpha = 0.3))
matlines(nd$yday, p, lty = c(2, 1, 2), lwd = 2, col = 4)

Kumaraswamy Distribution

Description

This function implements the two-parameter Kumaraswamy family for responses in (0, 1).

Usage

## the Kumaraswamy family
Kumaraswamy(a.link = shiftlog, b.link = shiftlog, ...)
KS(a.link = shiftlog, b.link = shiftlog, ...)

## the exp(x) + shift link specification
shiftlog(shift = 1)

Arguments

a.link

Character or function, the link function to be used for parameter a.

b.link

Character or function, the link function to be used for parameter b.

shift

Numeric, the shift parameter to be used for the link.

...

Not used.

Details

The Kumaraswamy distribution is a continuous distribution defined on the interval (0, 1). The probability density function is

f(y;a,b)=abya1(1ya)b1\displaystyle f(y; a, b) = aby^{a - 1}(1 - y^a)^{b - 1}

y(0,1)y \in (0, 1) is the response, aa and bb are non-negative parameters.

The shiftlog link function is given by:

exp(x)+1\displaystyle \exp(x) + 1

This is the default, since the mode of the distribution is only defined for a1a \geq 1, b1b \geq 1.

Value

The family returns an object of class "gamlss2.family".

Function shiftlog() returns a link specification object of class "link-glm".

References

Kumaraswamy P (1980). “A Generalized Probability Density Function for Double-Bounded Random Processes.” Journal of Hydrology, 46(1), 79–88. doi:10.1016/0022-1694(80)90036-0

See Also

gamlss2.family, gamlss2

Examples

## create family object with
## different link specifications
fam <- Kumaraswamy(a.link = shiftlog, b.link = "log")

## simulate data
set.seed(123)
n <- 1000
d <- data.frame("x" = runif(n, -pi, pi))

## true parameters
par <- data.frame(
  "a" = exp(1.2 + sin(d$x)) + 1,
  "b" = 1
)

## sample response
d$y <- fam$r(1, par)

## estimate model using the Kumaraswamy family
m <- gamlss2(y ~ s(x), data = d, family = fam)

## plot estimated effect
plot(m)

## plot residual diagnostics
plot(m, which = "resid")

## predict quantiles
p <- quantile(m, probs = c(0.05, 0.1, 0.5))

## plot
plot(d, pch = 19, col = adjustcolor(1, 0.3))
i <- order(d$x)
matlines(d$x[i], p[i, ], lty = 1, col = 4, lwd = 2)
axis(4, at = p[i, ][1, ], labels = names(p))

Lasso Model Terms

Description

Constructor function and plotting for Lasso penalized model terms for GAMLSS.

Usage

## model term constructor function
la(x, type = 1, const = 1e-05, ...)

## plotting function
plot_lasso(x, terms = NULL,
  which = c("criterion", "coefficients"),
  zoom = c(3, 4), spar = TRUE, ...)

Arguments

x

For function la(), a numeric vector or matrix, or a formula. See the examples. For function plot_lasso(), an object returned from gamlss2.

type

Integer or character, the type of the Lasso penalty. type = 1 or type = "normal" uses the normal penalty, type = 2 or type = "group" the group penalty, type = 3 or type = "ordinal" the ordinal fusion penalty and type = 4 or type = "nominal" the nominal fusion penalty.

const

Numeric, the constant that is used for approximating the absolute function.

terms

Character or integer, the model term that should be plotted. The default terms = NULL is plotting all model terms.

which

Character, should the information criterion or the coefficient paths be plotted? See the examples.

zoom

Numeric vector of length 2, the zooming factors for plotting information criteria curves and coefficient paths. The first element sets the distance from the optimum shrinkage parameter lambda to the left side, and the second element to the right side, respectively.

spar

Logical, should plotting parameters be automatically set in par?

...

For function la() further control arguments can be passed: The criterion = "bic" for shrinkage parameter selection, arguments for creating the model.matrix if the model term is specified using a formula, and scale = TRUE/FALSE to control the internal scaling of the design matrix. For function plot_lasso() arguments like lwd, col, main, etc., that control plotting parameters can be supplied. An additional ridge penalty (elastic net) can be added to each la() term be setting add_ridge = TRUE in the gamlss2 call.

Details

To implement the Lasso penalty, an approximation of the absolute value function is used, following the approach by Oelker and Tutz (2015). This enables the use of standard Newton-Raphson-type algorithms for estimation. Each Lasso model term has its own shrinkage parameter, allowing a mix of different penalty types within the model. The framework builds on the methodology of Groll et al. (2019), where coefficients are updated through iteratively reweighted least squares (IWLS). This is feasible due to the absolute function approximation, which results in a quadratic penalty matrix similar to that used in penalized splines. By default, the shrinkage parameters are selected using the Bayesian Information Criterion (BIC).

The choice of type determines the structure that is penalized:

  • "normal" penalizes individual coefficients and is suitable for numeric covariates or general design matrices.

  • "group" penalizes one complete block of coefficients jointly and is mainly useful for factor effects or other grouped design matrices.

  • "ordinal" uses an adjacent-differences fusion penalty and is intended for ordered factor effects or matrices whose columns represent ordered levels.

  • "nominal" uses a pairwise-differences fusion penalty and is intended for unordered factor effects or matrices whose columns represent unordered levels.

For numerical stability and comparable shrinkage across terms, la() can scale the underlying design matrix internally. The exact scaling depends on the selected penalty type. For "ordinal" and "nominal" terms, an additional ridge component is included by default.

Value

The la() function is used internally within gamlss2 and provides the necessary details for estimating Lasso-type model terms. Essentially, it serves as a special model term, as outlined in specials.

Currently, the plot_lasso() function does not return any output.

References

Andreas Groll, Julien Hambuckers, Thomas Kneib, and Nikolaus Umlauf (2019). Lasso-type penalization in the framework of generalized additive models for location, scale and shape. Computational Statistics & Data Analysis. doi:10.1016/j.csda.2019.06.005

Oelker Margreth-Ruth and Tutz Gerhard (2015). A uniform framework for combination of penalties in generalized structured models. Adv Data Anal Classif. doi:10.1007/s11634-015-0205-y

See Also

gamlss2, specials.

Examples

## Not run: ## load data
data("rent", package = "gamlss.data")

## transform numeric to factor variables
rent$Flc <- cut(rent$Fl, breaks = seq(20, 160, by = 10),
  include.lowest = TRUE)
rent$Ac <- cut(rent$A, breaks = seq(1890, 1990, by = 10),
  include.lowest = TRUE)

## set up the model formula for a BCT model
f <- R ~ la(Flc,type=3) + la(Ac,type=3) + la(loc,type=4) | . | .

## estimation
m1 <- gamlss2(f, data = rent, family = BCT)

## summary, shows the estimated degrees of freedom
## for each model term
summary(m1)

## plot estimated coefficients
plot(m1)

## plot information criteria curves
## for each model term
plot_lasso(m1)

## plot parameter paths
plot_lasso(m1, which = "coefficients")

## plot a single model term
plot_lasso(m1, which = "coefficients", term = 5)

## same with
plot_lasso(m1, which = "coefficients", term = "sigma.la(Ac")

## zoom out
plot_lasso(m1, which = "coefficients", term = 5,
  zoom = c(8, 7))

## set names
plot_lasso(m1, which = "coefficients", term = 5,
  zoom = c(8, 7), names = c("A", "B", "C"))

## set title
plot_lasso(m1, which = "coefficients", term = 5,
  zoom = c(8, 7), main = "Fused Lasso")

## simulated example using the nominal fused lasso
## and a matrix as argument for la()
set.seed(1328)

## number of observations and covariates
n <- 500
k <- 50

## model matrix
X <- matrix(rnorm(n * k), n, k)
colnames(X) <- paste0("x", 1:k)

## true coefficients
beta <- list(
  "mu" = rbinom(k, 1, 0.1),
  "sigma" = rbinom(k, 1, 0.1) * 0.3
)

## parameters
mu <- X %*% beta$mu
sigma <- exp(-1 + X %*% beta$sigma)

## response
y <- rnorm(n, mean = mu, sd = sigma)

## model formula with nominal fused lasso
f <- y ~ la(X,type=4) | la(X,type=4)

## estimate model incl. extra ridge penalty
## for all la() model terms
m2 <- gamlss2(f, add_ridge = TRUE)

## plot information criteria curves
plot_lasso(m2)

## coefficient paths
plot_lasso(m2, which = "coefficients")

## zoom out
plot_lasso(m2, which = "coefficients",
  zoom = c(8, 9))

## extract coefficients
cb <- coef(m2, full = TRUE)

## compare (without intercept)
cb_mu <- cb[grep("mu.", names(cb))][-1]
cb_sigma <- cb[grep("sigma.", names(cb))][-1]

## true positive rate
tp <- mean(c(cb_mu[beta$mu > 0] > 0,
  cb_sigma[beta$sigma > 0] > 0))

## false positive rate, needs threshold
thres <- 0.01
fp <- mean(c(abs(cb_mu[beta$mu == 0]) > thres,
  abs(cb_sigma[beta$sigma == 0]) > thres))

## End(Not run)

Create a Link for Families

Description

This function is used with the family functions in gamlss2(). Given the name of a link, it returns a link function, an inverse link function, the derivative dμ/dηd\mu/d\eta, and a function for domain checking. Note that make.link2() is slightly more flexible and also allows functions as arguments.

Usage

make.link2(link)

Arguments

link

A character string, see function make.link, or function.

Value

A list of class "link_gamlss2" containing the following components:

linkfun

Link function function(mu).

linkinv

Inverse link function function(eta).

mu.eta

Derivative function(eta): dμ/dηd\mu/d\eta.

valideta

Function function(eta) that returns TRUE if eta is in the domain of linkinv.

name

A character string representing the name of the link function.

See Also

make.link, gamlss2, gamlss2.family.

Examples

## character specification
utils::str(make.link2("logit"))

## functions
utils::str(make.link2(softplus))

Run MCMC Sampling for a Fitted gamlss2 Model

Description

The function mcmc() runs MCMC sampling for a fitted gamlss2 model. It reuses the fitted model object and merges the MCMC output back into the original object so that standard methods such as summary and plot work as expected.

Usage

mcmc(object, n.iter = 1200, burnin = 200, thin = 1)

Arguments

object

An object of class "gamlss2" or "bamlss2" as returned by gamlss2.

n.iter

Integer, the total number of MCMC iterations.

burnin

Integer, the burn-in period.

thin

Integer, the thinning parameter for saved samples.

Details

The function sets n.iter, burnin, and thin in the model control list and obtains starting values from the fitted object, including smoothing parameters. The MCMC sampler BS is then called and the resulting samples and posterior summaries are merged into the original model object.

This approach ensures that the result retains the terms, call, and related components required by extractor and plotting methods. If the fitted model object does not contain the design information required for MCMC sampling, for example because a reduced control setting was used, sampling cannot be performed.

Value

A model object inheriting from "gamlss2" containing posterior samples and summary information. The object also contains:

  • samples: posterior draws for coefficients and smoothing variances;

  • results: model results derived from posterior means;

  • elapsed: elapsed sampling time.

References

Umlauf N, Klein N, Zeileis A (2018). BAMLSS: Bayesian Additive Models for Location, Scale and Shape (and Beyond). Journal of Computational and Graphical Statistics, 27(3), 612–627. doi:10.1080/10618600.2017.1407325

See Also

gamlss2, bamlss2, BS

Examples

## Not run: ## load the abdominal circumference data
data("abdom", package = "gamlss.data")

## specify the model formula
f <- y ~ s(x) | s(x) | 1 | 1

## estimate model using backfitting
m1 <- gamlss2(f, data = abdom, family = BCT)

## run MCMC sampling
set.seed(1328)
m2 <- mcmc(m1)

## generate more samples
m2 <- mcmc(m2)

## posterior summary
summary(m2)

## plot estimated effects
plot(m2)

## plot samples
plot(m2, which = "samples")
plot(m2, which = "samples", ask = TRUE)
plot(m2, which = "samples", ask = TRUE,
  model = "mu", lag.max = 120, term = "s(x)")

## samples can be extracted for further inspection
print(head(m2$samples))

## compare estimated mean effects
plot(c(m1, m2))

## predict parameters using samples
pm <- predict(m2, FUN = mean)
print(head(pm))
psd <- predict(m2, FUN = sd)
print(head(psd))

## End(Not run)

Multinomial Logit Family

Description

The MN() family implements a multinomial logit model for categorical responses with k2k \ge 2 unordered categories.

Usage

MN(k)

Arguments

k

An integer specifying the number of response categories.

Details

The response must be a factor with exactly k levels. The first level is treated as the reference category.

For categories j=2,,kj = 2, \ldots, k, each predictor models the log-odds relative to the reference category.

The family provides analytical first and second derivatives of the log-likelihood with respect to the predictors and a probabilities() method for fitted category probabilities.

Value

A "gamlss2.family" object to be used with gamlss2.

See Also

gamlss2, gamlss2.family

Examples

## Not run: data("marital.nz", package = "VGAM")

## multinomial model
m <- gamlss2(mstatus ~ s(age) | . | ., data = marital.nz, family = MN(4))

## plot effects
plot(m)

## predict probabilities
par <- predict(m)
probs <- family(m)$probabilities(par)
names(probs) <- levels(marital.nz$mstatus)

## plot
i <- order(marital.nz$age)
matplot(marital.nz$age[i], probs[i, ], type = "l", lty = 1, lwd = 2,
  xlab = "Age", ylab = "Probabilities")
legend("center", names(probs), lty = 1, lwd = 2, col = 1:4)

## End(Not run)

Extracting a New Formula After Selection Algorithms

Description

The generic function extracts the selected model terms from a fitted selection procedure and returns the corresponding reduced model formula.

Usage

new_formula(object, ...)

Arguments

object

A fitted model object produced by a selection routine such as step_gamlss2 or select_gamlss2.

...

Not used yet.

Details

The function is intended for fitted models that store the outcome of an automatic term-selection step in object$selection$formula. It extracts the selected right-hand side terms and restores the original response on the left-hand side.

For multi-parameter distributional models, the returned object is a Formula with one right-hand side part for each distribution parameter. This makes it convenient to inspect the final model or to refit it without repeating the selection step.

Value

A Formula containing the selected model terms.

See Also

step_gamlss2, select_gamlss2, gamlss2

Examples

## no runnable example; see \code{\link{step_gamlss2}} and
## \code{\link{select_gamlss2}} for end-to-end usage

Ordered Logistic Family for Ordinal Regression

Description

Defines the ordered logistic (cumulative logit) family for modeling ordinal response variables within the gamlss2 framework. This implementation supports flexible modeling of the location and threshold (cutpoint) parameters, including effects of covariates.

Usage

OL(k)

ologit(k)

Arguments

k

An integer specifying the number of response categories. Must be k >= 2.

Details

This family implements a cumulative logit model for ordinal responses with k ordered categories. The linear predictor models a latent location parameter, and the cutpoints between response categories are parameterized via a monotonic transformation:

  • The first cutpoint is modeled directly (theta1).

  • The remaining cutpoints are expressed as theta1 + exp(delta_j) for j = 2, ..., k - 1, ensuring that the thresholds remain ordered.

The OL() family supports modeling the location and threshold differences (delta_j) as functions of covariates using additive predictors in gamlss2 via the "|" formula interface.

The family returns an object of class "gamlss2.family", which includes methods for evaluating the log-likelihood, simulating from the model, and computing predicted probabilities.

ologit() is retained as a compatibility alias for OL().

Value

A "gamlss2.family" object to be used with gamlss2.

See Also

gamlss2, gamlss2.family, polr

Examples

## example using the housing data from the MASS package:
library("MASS")

## fit standard cumulative logit model using polr()
m1 <- polr(Sat ~ Infl + Type + Cont, weights = Freq, data = housing)
summary(m1)

## convert response to integer for use with gamlss2
housing$Satint <- as.integer(housing$Sat)

## fit equivalent model using gamlss2
m2 <- gamlss2(Satint ~ Infl + Type + Cont,
  data = housing, weights = Freq,
  family = OL(k = 3))
summary(m2)

## compare coefficients
coef(m1)
coef(m2)

## predict class probabilities
pm1 <- predict(m1, type = "p")
pm2 <- predict(m2)
pm2 <- family(m2)$probabilities(pm2)

print(head(pm1))
print(head(pm2))

P-Splines for GAMLSS

Description

Estimation of P-splines using an efficient local maximum likelihood approach to automatically select the smoothing parameter. According to the inventors of P-splines, pb stands for "penalized beta" splines or "Paul and Brian".

Usage

pb(x, k = 20, ...)

Arguments

x

The variable that should be used for estimation.

k

The dimension of the B-spline basis to represent the smooth term.

...

Further arguments passed to function s.

Details

Function pb() is an internal wrapper function that calls s to set up a smooth specification object that can be used for model fitting with gamlss2. Using pb(), an efficient local maximum likelihood approach is used to estimate the smoothing parameter. See the reference for details.

Value

The function returns a smooth specification object of class "ps.smooth.spec", see also smooth.construct.ps.smooth.spec.

References

Eilers PHC, Marx BD (1996). “Flexible Smoothing with B-Splines and Penalties.” Statistical Science, 11(2), 89–121. doi:10.1214/ss/1038425655

Rigby RA, Stasinopoulos DM (2014). “Automatic Smoothing Parameter Selection in GAMLSS with an Application to Centile Estimation.” Statistical Methods in Medical Research, 23(4), 318–332. doi:10.1177/0962280212473302

See Also

gamlss2, smooth.construct.ps.smooth.spec

Examples

## Not run: ## load head circumference data
data("dbhh", package = "gamlss.data")

## specify the model formula
f <- head ~ pb(age) | pb(age) | pb(age) | pb(age)

## estimate model
m <- gamlss2(f, data = dbhh, family = BCT)

## visualize estimated effects
plot(m, which = "effects")

## plot diagnostics
plot(m, which = "resid")

## predict quantiles
pq <- quantile(m, probs = c(0.05, 0.5, 0.95))

## plot
plot(head ~ age, data = dbhh, pch = 19,
  col = rgb(0.1, 0.1, 0.1, alpha = 0.3))
matlines(dbhh$age, pq,
  lty = 1, col = 4)

## End(Not run)

Plotting GAMLSS

Description

Plotting methods for objects of class "gamlss2" and "gamlss2.list". These methods support effect plots for model terms and residual diagnostics. Effect plots for model terms with more than two covariates are not supported; in such cases, use predict and plot the results manually.

Usage

## S3 method for class 'gamlss2'
plot(x, parameter = NULL,
  which = "effects", terms = NULL,
  scale = TRUE, spar = TRUE, ...)

## S3 method for class 'gamlss2.list'
plot(x, parameter = NULL, which = "effects",
  terms = NULL, spar = TRUE, legend = TRUE, ...)

Arguments

x

An object of class "gamlss2" or "gamlss2.list". Objects of class "gamlss2.list" can be created by combining "gamlss2" objects with c(). See the examples.

parameter

Character or integer specifying for which distribution parameter the plots should be created. The arguments model and what, if supplied through ..., are treated as aliases for parameter to support alternative calling styles.

which

Character or integer selecting the type of plot. "effects" produces effect plots of model terms, "resid" shows the available residual diagnostics, "hist-resid" produces a histogram of residuals, "qq-resid" a quantile-quantile plot of residuals, "wp-resid" a worm plot of residuals, and "scatter-resid" a scatter plot of residuals against fitted values for the distribution mean (or median, if available in the family object). In addition, "selection" plots the GAIC path stored for a fitted stepwise or selection object, and "samples" visualizes posterior or simulation draws for objects that contain samples.

terms

Character or integer. Which model term(s) should be plotted?

scale

Logical. If TRUE, effect plots all have the same scale on the y-axis. If FALSE, each effect plot has its own y-axis scale.

spar

Logical. Should graphical parameters be set automatically?

legend

Logical. Should a legend be added for plots with multiple models?

...

Additional graphical arguments such as lwd, lty, col, legend = TRUE (for multiple model plots), among others, depending on the type of plot. Arguments model and what are treated as aliases for parameter.

Details

For objects of class "gamlss2", which = "effects" is the default and plots the partial effects stored in results(object). If no effect information is available, residual diagnostics are shown instead.

For objects of class "gamlss2.list", plots compare effect estimates across several fitted models. Residual diagnostics are only available for single-model objects.

See Also

gamlss2.

Examples

## Not run: ## load data
data("film90", package = "gamlss.data")

## model formula
f <- lborev1 ~ s(lboopen) + s(lnosc) | . | . | .

## estimate model
m1 <- gamlss2(f, data = film90, family = BCCG)

## plot effects (default)
plot(m1)

## plot specific effect
plot(m1, parameter = "sigma")
plot(m1, model = "sigma")
plot(m1, model = "nu", terms = 1)
plot(m1, model = "nu", terms = 2)
plot(m1, model = "nu", terms = "lnosc")
plot(m1, terms = "lnosc")

## plot all residual diagnostics
plot(m1, which = "resid")

## single diagnostic plots
plot(m1, which = "hist-resid")
plot(m1, which = "qq-resid")
plot(m1, which = "wp-resid")
plot(m1, which = "scatter-resid")

## if available, plot stored stepwise-selection path
## plot(m1, which = "selection")

## estimate another model
m2 <- gamlss2(f, data = film90, family = BCPE)

## compare estimated effects
plot(c(m1, m2))
plot(c(m1, m2), terms = "lboopen",
  col = c(1, 4), lwd = 3, lty = 1,
  pos = c("topleft", "topright", "bottomleft", "bottomright"))
plot(c(m1, m2), model = "sigma")
plot(c(m1, m2), model = "sigma", terms = 2)
plot(c(m1, m2), model = c("mu", "nu"))

## End(Not run)

Extracting Fitted or Predicted Parameters or Terms from gamlss2 Models

Description

Methods for gamlss2 model objects that extract fitted values or predictions for distribution parameters, additive predictors, individual terms, or the mean of the response distribution.

Usage

## S3 method for class 'gamlss2'
predict(object, model = NULL, newdata = NULL,
  type = c("parameter", "link", "response", "terms"), terms = NULL,
  se.fit = FALSE, drop = TRUE, ...)

Arguments

object

A model object of class gamlss2.

model

Character. Which distribution parameter(s) should be predicted? This can be one or more of "mu", "sigma", etc. By default, predictions are returned for all model parts.

newdata

Data frame. Optionally, a new data frame in which to look for variables with which to predict. If omitted, the original observations are used.

type

Character. Which type of prediction should be computed? Use "link" for additive predictors on the link scale, "parameter" for distribution parameters on their natural scale, "terms" for individual additive terms, and "response" for the mean of the fitted response distribution.

terms

Character. Which of the terms in the additive predictor(s) should be included? By default all terms are included.

se.fit

Logical. Should standard errors for the predictions be included? Standard errors are computed by simulating from the approximate multivariate normal distribution of the maximum likelihood estimates. The number of simulations is controlled by the argument R, which defaults to R = 200, and can be passed via ....

drop

Logical. Should the predictions be simplified to a vector if possible (TRUE) or always returned as a data frame (FALSE)?

...

Further control arguments. Supported aliases are what and parameter for model. Additional arguments are used for simulation-based summaries, including FUN, R, seed, burnin, and nogrep.

Details

Prediction for a gamlss2 object proceeds in stages. First, the required model terms are evaluated for the original data or for newdata. These terms are then added to form the full additive predictor for each distribution parameter. Applying the inverse link function yields the predicted distribution parameters. If requested, the mean of the corresponding response distribution is computed in a final step.

If se.fit = TRUE, or if a summary function is supplied via FUN, predictions are based on simulation draws from the approximate multivariate normal distribution of the fitted coefficients, unless the object already contains stored draws in samples. For objects of class "bamlss2", stored posterior draws are used directly.

For type = "terms", the terms argument is matched against linear and special-term labels. By default, partial matching is used; set nogrep = TRUE in ... to require exact matching.

See also prodist.gamlss2 for setting up a full distributions3 object from which moments, probabilities, quantiles, or random numbers can be obtained.

Value

If drop = FALSE, a data frame is returned. If drop = TRUE (the default), the result may be simplified to a numeric vector when possible. For type = "terms", a data frame is returned with one column per selected term. For se.fit = TRUE, the returned object contains fitted values and corresponding simulation-based standard errors.

See Also

predict, prodist.gamlss2, quantile.gamlss2

Examples

## fit heteroscedastic normal GAMLSS model
## stopping distance (ft) explained by speed (mph)
data("cars", package = "datasets")
m <- gamlss2(dist ~ s(speed) | s(speed), data = cars, family = NO)

## new data for predictions
nd <- data.frame(speed = c(10, 20, 30))

## default: model parameter(s) for all model parts
predict(m, newdata = nd)

## additive predictors on the link scale
predict(m, newdata = nd, type = "link")

## mean of the response distribution
predict(m, newdata = nd, type = "response")

## model parameter(s)
predict(m, newdata = nd, model = "sigma")
predict(m, newdata = nd, model = "sigma", drop = FALSE)

## individual terms in additive predictor(s)
predict(m, newdata = nd, type = "terms", model = "sigma")
predict(m, newdata = nd, type = "terms", model = "sigma", terms = "s(speed)")

## predict quantiles
quantile(m, newdata = nd, probs = c(0.1, 0.5, 0.9))

## standard errors
predict(m, newdata = nd, se.fit = TRUE, R = 200)

Extracting Fitted or Predicted Probability Distributions from gamlss2 Models

Description

Methods for gamlss2 model objects for extracting fitted (in-sample) or predicted (out-of-sample) probability distributions as distributions3 objects.

Usage

## S3 method for class 'gamlss2'
prodist(object, ...)

Arguments

object

A model object of class gamlss2.

...

Arguments passed on to predict.gamlss2, e.g., newdata.

Details

To facilitate making probabilistic forecasts based on gamlss2 model objects, the prodist method extracts fitted or predicted probability distribution objects. Internally, the predict.gamlss2 method is used first to obtain the distribution parameters (mu, sigma, tau, nu, or a subset thereof). Subsequently, the corresponding distribution object is set up as a GDF object, a unified gamlss2 distribution interface for the workflow provided by the distributions3 package (see Zeileis et al. 2022).

Note that these probability distributions only reflect the random variation in the dependent variable based on the model employed (and its associated distributional assumption for the dependent variable). This does not capture the uncertainty in the parameter estimates.

Value

An object of class GDF inheriting from distribution.

References

Zeileis A, Lang MN, Hayes A (2022). “distributions3: From Basic Probability to Probabilistic Regression.” Presented at useR! 2022 - The R User Conference. Slides, video, vignette, code at https://www.zeileis.org/news/user2022/.

See Also

GDF, predict.gamlss2

Examples

## packages, code, and data
library("distributions3")
data("cars", package = "datasets")

## fit heteroscedastic normal GAMLSS model
## stopping distance (ft) explained by speed (mph)
m <- gamlss2(dist ~ s(speed) | s(speed), data = cars, family = NO)

## obtain predicted distributions for three levels of speed
d <- prodist(m, newdata = data.frame(speed = c(10, 20, 30)))
print(d)

## obtain quantiles (works the same for any distribution object 'd' !)
quantile(d, 0.5)
quantile(d, c(0.05, 0.5, 0.95), elementwise = FALSE)
quantile(d, c(0.05, 0.5, 0.95), elementwise = TRUE)

## visualization
plot(dist ~ speed, data = cars)
nd <- data.frame(speed = 0:240/4)
nd$dist <- prodist(m, newdata = nd)
nd$fit <- quantile(nd$dist, c(0.05, 0.5, 0.95))
matplot(nd$speed, nd$fit, type = "l", lty = 1, col = "slategray", add = TRUE)

## moments
mean(d)
variance(d)

## simulate random numbers
random(d, 5)

## density and distribution
pdf(d, 50 * -2:2)
cdf(d, 50 * -2:2)

## Poisson example
data("FIFA2018", package = "distributions3")
m2 <- gamlss2(goals ~ s(difference), data = FIFA2018, family = PO)
d2 <- prodist(m2, newdata = data.frame(difference = 0))
print(d2)
quantile(d2, c(0.05, 0.5, 0.95))

## note that log_pdf() can replicate logLik() value
sum(log_pdf(prodist(m2), FIFA2018$goals))
logLik(m2)

Quantiles for GAMLSS

Description

The function computes estimated quantiles and optionally produces a plot.

Usage

## S3 method for class 'gamlss2'
quantile(x, probs = c(0.025, 0.25, 0.50, 0.75, 0.975),
  variable = NULL, newdata = NULL,
  plot = FALSE, data = TRUE,
  n = 100L, ...)

Arguments

x

An object of class "gamlss2".

probs

Numeric vector of probabilities with values in [0,1].

variable

Logical, integer, or character. Should quantiles be plotted against a covariate? If TRUE, the single model covariate is used. Alternatively, a covariate can be selected by position or name. This option is only available for single-covariate models.

newdata

Data frame that should be used for computing the quantiles.

plot

Logical, should a plot be shown?

data

Logical, should the raw data be added to the plot?

n

Integer, number of observations that should be used to compute an equidistant grid for the selected variable.

...

Arguments such as col, legend = TRUE/FALSE. See the examples.

Details

The function applies the predict method to determine the parameters of the response distribution. It then computes the quantiles as specified in the argument probs.

Value

The estimated quantiles. For multiple probabilities, a data frame is returned. For a single probability and variable = NULL, the result is simplified to a numeric vector.

See Also

gamlss2.

Examples

## Not run: ## load data
data("film90", package = "gamlss.data")

## model formula
f <- lborev1 ~ s(lboopen) | . | . | .

## estimate model
m <- gamlss2(f, data = film90, family = BCPE)

## compute quantiles using "newdata"
nd <- film90[1:10, ]
print(quantile(m, newdata = nd))

## plot sorted quantiles
quantile(m, plot = TRUE)

## quantile plot using covariate data
quantile(m, plot = TRUE, variable = TRUE)

## plot without raw data
quantile(m, plot = TRUE, variable = TRUE, data = FALSE)

## End(Not run)

Random Effects

Description

Random effects can be included in gamlss2 models in two ways. The first uses s() with bs = "re" for simple random effects, i.e., when a single factor is entered into the model as a smoother. This approach relies on the s() function from the mgcv package. For example, if area is a factor with several levels, then s(area, bs = "re") shrinks the level-specific effects of area towards their overall mean.

The second, more general approach uses the model term constructor re(), which provides an interface to the specialised random-effects functionality in the nlme package. This allows fitting more complex random-effects structures.

This documentation focuses on the re() function, but we also provide examples using s(..., bs = "re").

Usage

re(random, correlation = NULL, ...)

Arguments

random

A formula specifying the random effect part of the model, as in the lme function.

correlation

An optional correlation object, see lme.

...

For the re() function, the dots argument is used to specify additional control arguments for the lme function, such as the method and correlation arguments.

Details

Both functions set up model terms that can be estimated using a backfitting algorithm, e.g., the default RS algorithm.

Value

Function s with bs = "re" returns a smooth specification object of class "re.smooth.spec", see also smooth.construct.re.smooth.spec.

The re() function returns a special model term specification object, see specials for details.

See Also

gamlss2, smooth.construct.re.smooth.spec, s, lme

Examples

## Not run: ## orthodontic measurement data
data("Orthodont", package = "nlme")

## model using lme()
m <- lme(distance ~ I(age-11), data = Orthodont,
  random =~ I(age-11) | Subject, method = "ML")

## using re(), function I() is not supported,
## please transform all variables in advance
Orthodont$age11  <- Orthodont$age - 11

## estimation using the re() constructor,
## setup formula first
f <- distance ~ age11 + re(~age11|Subject)

## estimate model
m1 <- gamlss2(f, data = Orthodont)

## compare fitted values
plot(fitted(m1, model = "mu"), fitted(m))
abline(0, 1, col = 4)

## extract summary for re() model term
st <- specials(m1, model = "mu", term = "age11", elements = "model")
summary(st)

## random intercepts and slopes with s() using AIC
a <- gamlss2(distance ~ s(age,k=3) + s(Subject, bs = "re") + s(Subject, age11, bs = "re"),
  data = Orthodont)

## compare fitted values
plot(fitted(m1, model = "mu"), fitted(m))
points(fitted(a, model = "mu"), fitted(m), col = 2)
abline(0, 1, col = 4)

## more complicated correlation structures
data("Ovary", package = "nlme")

## ARMA model
m <- lme(follicles ~ sin(2 * pi * Time) + cos(2 * pi * Time), data = Ovary, 
  random = pdDiag(~sin(2*pi*Time)), correlation = corARMA(q = 2))

## now with gamlss2(), transform in advance
Ovary$sin1 <- sin(2 * pi * Ovary$Time)
Ovary$cos1 <- cos(2 * pi * Ovary$Time)

## model formula
f <- follicles ~ sin1 + cos1 +
  re(~ 1 | Mare) +
  re(~ sin1 - 1 | Mare, correlation = corARMA(q = 2))

## estimate model
m2 <- gamlss2(f, data = Ovary)

## smooth random effects
f <- follicles ~ ti(Time) + ti(Mare, bs = "re") + 
  ti(Mare, Time, bs = c("re", "cr"), k = c(11, 5))

g <- gamlss2(f, data = Ovary)

## compare fitted values
par(mfrow = n2mfrow(nlevels(Ovary$Mare)), mar = c(4, 4, 1, 1))

for(j in levels(Ovary$Mare)) {
  ds <- subset(Ovary, Mare == j)

  plot(follicles ~ Time, data = ds)

  f <- fitted(m2, model = "mu")[Ovary$Mare == j]
  lines(f ~ ds$Time, col = 4, lwd = 2)

  f <- fitted(g, model = "mu")[Ovary$Mare == j]
  lines(f ~ ds$Time, col = 3, lwd = 2)

  f <- fitted(m)[Ovary$Mare == j]
  lines(f ~ ds$Time, col = 2, lwd = 2)
}

## simulated data
set.seed(1328)

n <- 10000
k <- 500

## generate random effects
f <- as.factor(sample(1:k, size = n, replace = TRUE))
ref <- rnorm(k, sd = 0.6)

## random effects only for sigma
y <- rBCT(n, mu = 10, sigma = exp(-1 + ref[f]), tau = 10)

## estimate model
m3 <- gamlss2(y ~ 1 | re(~ 1 | f), family = BCT)

## extract fitted random effects
cb <- coef(m3, full = TRUE)
cb <- cb[grepl("re", names(cb), fixed = TRUE)]

## compare
plot(cb, ref, main = round(mean((cb - ref)^2), 2))
abline(0, 1, lwd = 2, col = 4)

## End(Not run)

Auxiliary Functions for Formulas and Model Objects

Description

Various auxiliary functions to facilitate the work with formulas and fitted model objects.

Usage

response_name(formula)

Arguments

formula

A formula, Formula, or a fitted model object.

Value

Function response_name extracts the response name as a character vector.

See Also

gamlss2

Examples

## basic formula
f <- y ~ x1 + x2 + log(x3)
response_name(f)

## formula with multiple responses
f <- y1 | y2 | y3 ~ x1 + s(x2) + x3 + te(log(x3), x4) | x2 + ti(x5)
response_name(f)

## list of formulas
f <- list(
  y1 ~ x1 + s(x2) + x3 + te(log(x3), x4),
  y2  ~ x2 + sqrt(x5),
  y3  ~ z2 + x1 + exp(x3) + s(x10)
)
response_name(f)

Rigby and Stasinopoulos (RS) & Cole and Green (CG) Algorithm

Description

The function RS() implements the algorithm of Rigby and Stasinopoulos, the function CG() implements the algorithm of Cole and Green for estimating a GAMLSS with gamlss2.

Usage

## rigby and stasinopoulos algorithm
RS(x, y, specials, family, offsets,
  weights, start, xterms, sterms, control)

## cole and green algorithm
CG(x, y, specials, family, offsets,
  weights, start, xterms, sterms, control)

Arguments

x

The full model matrix to be used for fitting.

y

The response vector or matrix.

specials

A named list of special model terms, e.g., including design and penalty matrices for fitting smooth terms using smooth.construct.

family

A family object, see gamlss2.family.

offsets

If supplied, a list or data frame of possible model offset.

weights

If supplied, a numeric vector of weights.

start

Starting values, either for the parameters of the response distribution or, if specified as a named list in which each element of length one is named with "(Intercept)", the respective intercepts are initialized. If starting values are specified as a named list, data frame or matrix, where each element/column is a vector with the same length as the number of observations in the data, the respective predictors are initialized with these. See the examples for gamlss2.

xterms

A named list specifying the linear model terms. Each named list element represents one parameter as specified in the family object.

sterms

A named list specifying the special model terms. Each named list element represents one parameter as specified in the family object.

control

Further control arguments as specified within the call of gamlss2. See the details.

Details

Functions RS() and CG() are called within gamlss2. Both functions implement a backfitting algorithm for estimating GAMLSS. For algorithm details see Rigby and Stasinopoulos (2005).

The functions use the following control arguments:

  • eps: Numeric vector of length 2, the stopping criterion. Default is eps = c(1e-05, 1e-05) for the outer and the inner backfitting loop.

  • maxit: Integer vector of length 2, the maximum number of iterations of the outer and inner backfitting loop. Default is maxit = c(100, 10).

  • step: Numeric, the step length control parameter. Default is step = 1. Note that if step is set smaller than 1, it might be appropriate to lower the stopping criterion eps, too.

  • CG: Integer, the number of iterations when to start the CG correction. Default is CG = Inf.

  • trace: Logical, should information be printed while the algorithm is running?

  • flush: Logical, use flush.console for displaying the current output in the console.

  • ridge: Logical, should automatic ridge penalization be applied only to linear effects, without penalizing the intercept? For each parameter of the distribution the optimum ridge penalty is estimated using an information criterion. Possible options are criterion = c("aic", "aicc", "bic", "gaic", "gcv"). The default is criterion = "gaic" and argument K = 2, which can be set in gamlss2_control.

To facilitate the development of new algorithms for gamlss2, users can exchange them using the optimizer argument in gamlss2_control. Users developing new model fitting functions are advised to use these functions as templates and pass them to gamlss2_control. Alternatively, users can replace the optimizer function by adding a named list element, "optimizer", to the family object. For instructions on setting up new families in gamlss2, see gamlss2.family.

Value

Functions RS() and CG() return a named list of class "gamlss2" containing the following objects:

fitted.values

A data frame of the fitted values of the modeled parameters of the selected distribution.

fitted.specials

A named list, one element for each parameter of the distribution, containing the fitted model object information of special model terms.

fitted.linear

A named list, one element for each parameter of the distribution, containing the information on fitted linear effects.

coefficients

A named list, one element for each parameter of the distribution, containing the estimated parameters of the linear effects.

elapsed

The elapsed runtime of the algorithm.

iterations

How many iterations the algorithm performed.

logLik

The final value of the log-likelihood of the model.

control

All control arguments used as supplied from function gamlss2_control.

References

Rigby RA, Stasinopoulos DM (2005). “Generalized Additive Models for Location, Scale and Shape (with Discussion).” Journal of the Royal Statistical Society, Series C (Applied Statistics), 54, 507–554. doi:10.1111/j.1467-9876.2005.00510.x

See Also

gamlss2, gamlss2_control, gamlss2.family

Examples

## Not run: ## count response example using HarzTraffic data
data("HarzTraffic", package = "gamlss2")

## specify the model formula
f <- bikes ~ s(yday, bs = "cc") + s(temp) + s(rain) | .

## estimate model using RS (default)
m1 <- gamlss2(f, data = HarzTraffic, family = NBI, optimizer = RS)

## now with CG
m2 <- gamlss2(f, data = HarzTraffic, family = NBI, optimizer = CG)

## first 2 RS iterations and afterwards switch to CG
m3 <- gamlss2(f, data = HarzTraffic, family = NBI, CG = 2)

## End(Not run)

GAIC and Generalised (Pseudo) R-squared for GAMLSS Models

Description

Functions to compute the GAIC and the generalized R-squared of Nagelkerke (1991) for GAMLSS models.

Usage

## information criteria
GAIC(object, ...,
  k = 2, corrected = FALSE)

## r-squared
Rsq(object, ...,
  type = c("Cox Snell", "Cragg Uhler", "both", "simple"),
  newdata = NULL)

Arguments

object

A fitted model object.

...

Optionally more fitted model objects.

k

Numeric, the penalty to be used. The default k = 2 corresponds to the classical AIC.

corrected

Logical, whether the corrected AIC should be used? Note that it applies only when k = 2.

type

Which definition of R-squared should be used? Possible values are "Cox Snell", "Cragg Uhler", "both", and "simple". The option "simple" computes an R-squared based on the median; in this case newdata may also be supplied.

newdata

For type = "simple", the R-squared can also be evaluated on newdata.

Details

The Rsq() function uses the following definition of R-squared:

R2=1(L(0)L(θ^))2/nR^2=1- \left(\frac{L(0)}{L(\hat{\theta})}\right)^{2/n}

where L(0)L(0) is the null model (only a constant is fitted to all parameters) and L(θ^)L(\hat{\theta}) is the fitted model under consideration. This definition is sometimes referred to as the Cox and Snell R-squared. The Nagelkerke or Cragg and Uhler definition divides the above by

1L(0)2/n1 - L(0)^{2/n}

Value

Numeric vector or data frame, depending on the number of fitted model objects.

References

Nagelkerke NJD (1991). “A Note on a General Definition of the Coefficient of Determination.” Biometrika, 78(3), 691–692. doi:10.1093/biomet/78.3.691

See Also

gamlss2

Examples

## load the aids data set
data("aids", package = "gamlss.data")

## estimate negative binomial count models
m1 <- gamlss2(y ~ x + qrt, data = aids, family = NBI)
m2 <- gamlss2(y ~ s(x) + s(qrt, bs = "re"), data = aids, family = NBI)

## compare models
Rsq(m1)
Rsq(m1, type = "both")
Rsq(m1, m2)
GAIC(m1, m2)
AIC(m1, m2)
BIC(m1, m2)

## plot estimated effects
plot(m2)

Smooth Model Term Selection with Additional Penalties

Description

The function select_gamlss2() applies an additional shrinkage penalty to all mgcv model terms. This can shrink some terms to zero and thereby remove them from the model. In addition to this penalty, model terms are selected based on two criteria: the estimated degrees of freedom of the term and the percentage of the predictor range covered by the model term. These two thresholds aim to mimic a natural selection process, similar to what one might do by inspecting summaries and effect plots, so that only relevant terms are retained in the final model.

Usage

## wrapper function
select_gamlss2(formula, ..., criterion = "BIC", thres = c(0.9, 0.2))

## modified RS optimizer function
sRS(x, y, specials, family, offsets,
  weights, start, xterms, sterms, control)

Arguments

formula

A GAM-type formula or Formula. All smooth terms of the mgcv package are supported, see also formula.gam. For gamlss.list() formula is a list of formulas.

...

Arguments passed to gamlss2 and to argument control in function sRS(). For criterion = "GAIC", the penalty can be supplied via K.

criterion

The information criterion used to estimate the shrinkage parameters. This can also be a vector of length 2, where the first element specifies the criterion to be used during the selection step, and the second element specifies the criterion to be used during the refitting step. Possible options are "BIC", "GCV", "AIC", "AICc" and "GAIC" with user-defined penalty K; the default is K = 2.

thres

A vector of thresholds used for model term selection. The first element controls the minimum allowed estimated degrees of freedom for a model term to enter the model. The second element specifies the minimum percentage of the total estimated predictor range required for a term to be included in the model.

x

The full model matrix to be used for fitting.

y

The response vector or matrix.

specials

A named list of special model terms, e.g., including design and penalty matrices for fitting smooth terms using smooth.construct.

family

A family object, see gamlss2.family.

offsets

If supplied, a list or data frame of model offsets.

weights

If supplied, a numeric vector of weights.

start

Starting values, either for the parameters of the response distribution or, if specified as a named list in which each element of length one is named with "(Intercept)", the respective intercepts are initialized. If starting values are specified as a named list, data frame or matrix, where each element/column is a vector with the same length as the number of observations in the data, the respective predictors are initialized with these. See the examples for gamlss2.

xterms

A named list specifying the linear model terms. Each named list element represents one parameter as specified in the family object.

sterms

A named list specifying the special model terms. Each named list element represents one parameter as specified in the family object.

control

Further control arguments as specified within the call of gamlss2.

Details

The function select_gamlss2() selects model terms by identifying those with estimated degrees of freedom greater than a pre-specified threshold (e.g., thres = 0.9). In addition, model term selection can also be based on the percentage of the total predictor range covered by the model term to ensure that the term represents a substantial portion of the total effect. After the selection step, the model is refitted using only the selected terms, excluding the additional penalty applied during the initial fitting process. The additional penalty for automatic term selection is described in gam.selection. This approach is experimental, so model selection should still be assessed carefully.

Value

An object of class "gamlss2".

See Also

gamlss2, new_formula.

Examples

## Not run: ## simulate data
set.seed(123)

## number of observations
n <- 1000

## covariates
k <- 100
x <- matrix(runif(n * k, -3, 3), ncol = k)
colnames(x) <- paste0("x", 1:k)
d <- as.data.frame(x)

## true effects
d$f1 <- sin(d$x1)
d$f2 <- exp(d$x2)/15 - 1
d$f3 <- -d$x3 / 3
d$f4 <- d$x4^2/5 - 1 
d$f5 <- cos(d$x5)

## true parameters
mu <- with(d, f1 + f3 + f4)
sigma <- with(d, exp(1.5 + f2 + f4 + f5))

## simulate response
d$y <- rnorm(n, mean = mu, sd = sigma)

## model formula
f <- paste("y ~", paste0("s(x", 1:k, ")", collapse = "+"), "| .")
f <- as.formula(f)

## estimate model (takes some time)
m <- select_gamlss2(f, data = d, family = NO)

## plot selected estimated effects
plot(m)

## final model
new_formula(m)

## example taken from ?gam.selection
library("mgcv")
set.seed(3)
n <- 200

## simulate data
dat <- gamSim(1, n = n, scale = .15, dist = "poisson")

## spurious
dat$x4 <- runif(n, 0, 1)
dat$x5 <- runif(n, 0, 1)

## formula
f <- y ~ s(x0) + s(x1) + s(x2) + s(x3) + s(x4) + s(x5)

## estimate model
m1 <- gam(f, data = dat, family = poisson,
  select = TRUE, method = "REML")
summary(m1)
plot(m1, pages = 1)

## same with gamlss2
m2 <- select_gamlss2(f, data = dat, family = PO)

## plot selected effects
plot(m2)

## final model
new_formula(m2)

## End(Not run)

Softplus Link Object

Description

Link object (with link function, inverse link function, etc.) that assures positivity of parameters based on the softplus function.

Usage

softplus(a = 1)

Arguments

a

Extra parameter of the generalized softplus function

Details

The softplus link function with parameter aa is given by:

log(1+exp(ax))a\displaystyle \frac{\log(1 + \exp(a \cdot x))}{a}

This is an approximation of the linear spline max{0,x}\max\{0, x\} where the discrepancy between the two functions decreases with increasing aa.

Wiemann et al. (2023) propose to employ the softplus function as the inverse link function where positivity of a parameter needs to be assured, e.g., in count data regressions. This is in particular of interest as an alternative to the exponential inverse link function because the exponential implies multiplicative effects of the regressors while the softplus function does not.

Value

An object of class "link-glm".

References

Wiemann PFV, Kneib T, Hambuckers J (2023). “Using the Softplus Function to Construct Alternative Link Functions in Generalized Linear Models and Beyond.” Statistical Papers, forthcoming. doi:10.1007/s00362-023-01509-x

See Also

make.link, gamlss2

Examples

## visualization of the softplus function from Wiemann et al. (2023, Figure 1)
x <- -200:200/50
plot(x, softplus(1)$linkinv(x), ylab = expression(softplus[a](x)),
  type = "l", col = 2, lwd = 2)
grid()
lines(x, softplus(5)$linkinv(x), col = 3, lwd = 2)
lines(x, softplus(10)$linkinv(x), col = 4, lwd = 2)
lines(x, pmax(0, x), lty = 3, lwd = 2)
legend("topleft", c("a = 1", "a = 5", "a = 10", "linear spline"),
  col = c(2, 3, 4, 1), lty = c(1, 1, 1, 3), lwd = 2, bty = "n")

## poisson regression example with different links
data("FIFA2018", package = "distributions3")
m_exp <- glm(goals ~ difference, data = FIFA2018, family = poisson(link = "log"))
m_splus <- glm(goals ~ difference, data = FIFA2018, family = poisson(link = softplus(1)))
AIC(m_exp, m_splus)

## comparison of fitted effects
nd <- data.frame(difference = -15:15/10)
nd$mu_exp <- predict(m_exp, newdata = nd, type = "response")
nd$mu_splus <- predict(m_splus, newdata = nd, type = "response")
plot(mu_exp ~ difference, data = nd, ylab = expression(mu),
  type = "l", col = 4, lwd = 2, ylim = c(0, 2.5))
lines(mu_splus ~ difference, data = nd, col = 2, lwd = 2)
legend("topleft", c("exp", "softplus"), col = c(4, 2), lwd = 2, lty = 1, bty = "n")

Special Model Terms for GAMLSS

Description

The gamlss2 package provides an interface for special model terms whose fitting is not handled by the standard linear or smooth-term machinery. Examples include neural networks, trees, or forests. A special term is expected to provide its own fitting and prediction methods so that it can be updated within the RS or CG backfitting algorithms.

Usage

## generic fitting method
special_fit(x, ...)

## generic predict method
special_predict(x, ...)

## extractor function for fitted special terms
specials(object, model = NULL, terms = NULL, elements = NULL, ...)

Arguments

x

A model term object as supplied in the formula in the gamlss2 call.

object

A fitted gamlss2 object.

model

Character or integer, specifies the model for which fitted special terms should be extracted.

terms

Character or integer, specifies the special model terms that should be extracted.

elements

Character, specifies which elements of a fitted special term should be extracted. If elements = "names", the corresponding element names are extracted.

...

Arguments needed for the special_fit() function to facilitate the fitting of the model term, see the details. Similarly, for the special_predict() function, the ... argument encompasses the objects for computing predictions for the model term.

Details

To implement a new special term, define:

  1. a constructor function that is used inside the model formula,

  2. a corresponding special_fit() method,

  3. and a corresponding special_predict() method.

The constructor function should collect the information needed for fitting, such as the model formula, the variables, and any control arguments. If the constructor has a new special name, this name must be registered via the specials argument of fake_formula. In the example below, this is not necessary because "n" is already registered.

The following paragraphs describe the expected arguments and return values of the fitting and prediction methods.

A method for special_fit() has the following arguments:

  • x: The special model term object, containing all the data for fitting.

  • z: The current working response/residual from the backfitting step.

  • w: The current working weights from the backfitting step.

  • y: The response vector/matrix, e.g., used to evaluate the log-likelihood.

  • eta: The current named list of predictors.

  • j: Character, the parameter name for which the model term needs to be updated.

  • family: The family object of the model, see gamlss2.family.

  • control: A named list of control arguments, see gamlss2_control.

Only the first three arguments are required to set up a special model term; all remaining arguments are optional. The function must at least return a named list containing the component "fitted.values" in order to work with RS and CG. In practice, useful return components are:

  • "fitted.values": Required. Numeric vector of fitted contributions of the special term.

  • Additional fitted objects: Optional. For example, the fitted model object itself, degrees of freedom, or quantities needed for prediction.

  • "transfer": Optional. Information that should be passed from one backfitting iteration to the next.

A method for special_predict() has the following arguments:

  • x: Depending on the return value of function special_fit(), the fitted model term object, see the examples.

  • data: The data for which predictions should be computed.

  • se.fit: Logical, should standard errors of the predictions be computed.

The function special_predict() should return fitted contributions for the supplied data. The minimal valid return value is a numeric vector. If se.fit = TRUE, the method may instead return a data frame with a column named "fit" and optional columns such as "lower" and "upper" when interval information is available.

See Also

gamlss2, RS, gamlss2_control, gamlss2.family

Examples

## example special term for neural networks,
## the constructor function is used in the formula
## when calling gamlss2()
n <- function(formula, ...)
{
  stopifnot(requireNamespace("nnet"))

  ## list for setting up the special model term
  st <- list()

  ## list of control arguments
  ctr <- list(...)
  if(is.null(ctr$size))
    ctr$size <- 50
  if(is.null(ctr$maxit))
    ctr$maxit <- 1000
  if(is.null(ctr$decay))
    ctr$decay <- 0.1
  if(is.null(ctr$trace))
    ctr$trace <- FALSE
  if(is.null(ctr$MaxNWts))
    ctr$MaxNWts <- 10000
  if(is.null(ctr$scale))
    ctr$scale <- TRUE

  ## put all information together
  st$control <- ctr
  st$formula <- formula
  st$term <- all.vars(formula)
  st$label <- paste0("n(", paste0(gsub(" ", "", as.character(formula)), collapse = ""), ")")
  st$data <- model.frame(formula)

  ## scale per default!
  if(ctr$scale) {
    sx <- list()
    for(j in colnames(st$data)) {
      if(!is.factor(st$data[[j]])) {
        sx[[j]] <- range(st$data[[j]])
        st$data[[j]] <- (st$data[[j]] - sx[[j]][1]) / diff(sx[[j]])
      }
    }
    st$scalex <- sx
  }

  ## assign the "special" class and the new class "n"
  class(st) <- c("special", "n")

  return(st)
}

## set up the special "n" model term fitting function
special_fit.n <- function(x, z, w, control, ...)
{
  ## model formula needs to be updated
  .fnns <- update(x$formula, response_z ~ .)

  ## assign current working response
  x$data$response_z <- z
  x$data$weights_w <- w

  ## possible weights from last iteration
  Wts <- list(...)$transfer$Wts

  ## estimate model
  nnc <- parse(text = paste0('nnet::nnet(formula = .fnns, data = x$data, weights = weights_w,',
      'size = x$control$size, maxit = x$control$maxit, decay = x$control$decay,',
      'trace = x$control$trace, MaxNWts = x$control$MaxNWts, linout = TRUE',
      if(!is.null(Wts)) ', Wts = Wts)' else ')'))

  rval <- list("model" = eval(nnc))

  ## get the fitted.values
  rval$fitted.values <- predict(rval$model)

  ## transferring the weights for the next backfitting iteration
  ## note, "transfer" can be used to transfer anything from one
  ## iteration to the next
  rval$transfer <- list("Wts" = rval$model$wts)

  ## center fitted values
  rval$shift <- mean(rval$fitted.values)
  rval$fitted.values <- rval$fitted.values - rval$shift

  ## degrees of freedom
  rval$edf <- length(coef(rval$model))

  ## possible scaling
  rval$scalex <- x$scalex

  ## assign class for predict method
  class(rval) <- "n.fitted"

  return(rval)
}

## finally, the predict method
special_predict.n.fitted <- function(x, data, se.fit = FALSE, ...)
{
  if(!is.null(x$scalex)) {
    for(j in names(x$scalex)) {
      data[[j]] <- (data[[j]] - x$scalex[[j]][1]) / diff(x$scalex[[j]])
    }
  }
  p <- predict(x$model, newdata = data, type = "raw")
  p <- p - x$shift
  if(se.fit)
    p <- data.frame("fit" = p)
  return(p)
}

## Not run: ## example with data
data("film90", package = "gamlss.data")

## specify the model formula
f <- y ~ n(~x,decay=0.01) | . | . | .

## estimate model,
## set the seed for reproducibility
## note, data should be scaled!
set.seed(123)
m1 <- gamlss2(f, data = abdom, family = BCT)

## visualize estimated effects
plot(m1, which = "effects")

## plot diagnostics
plot(m1, which = "resid")

## predict quantiles
pq <- quantile(m1, probs = c(0.05, 0.5, 0.95))

## plot
plot(y ~ x, data = abdom, pch = 19,
  col = rgb(0.1, 0.1, 0.1, alpha = 0.3))
matlines(abdom$x, pq, lwd = 2,
  lty = 1, col = 4)

## another example using the Munich rent data
data("rent", package = "gamlss.data")

## model formula
f <- R ~ n(~Fl+A,size=10,decay=0.7) | .

## estimate model
set.seed(456)
m2 <- gamlss2(f, data = rent, family = GA)

## plot estimated effects
plot(m2, which = "effects", persp = FALSE)

## diagnostics
plot(m2, which = "resid")

## predict using new data
n <- 50
nd <- with(rent, expand.grid(
  "Fl" = seq(min(Fl), max(Fl), length = n),
  "A" = seq(min(A), max(A), length = n)
))

## compute median rent R estimate
nd$fit <- quantile(m2, newdata = nd, probs = 0.5)

## visualize
library("lattice")

p1 <- wireframe(fit ~ Fl + A, data = nd,
  screen = list(z = 50, x = -70, y = -10),
  aspect = c(1, 0.9), drape = TRUE,
  main = "n(~Fl+A)",
  xlab = "Floor", ylab = "YoC",
  zlab = "Rent")

p2 <- levelplot(fit ~ Fl + A, data = nd,
  contour = TRUE,
  main = "n(~Fl+A)", xlab = "Floor", ylab = "YoC")

print(p1, split = c(1, 1, 2, 1), more = TRUE)
print(p2, split = c(2, 1, 2, 1), more = FALSE)

## extract fitted special terms,
## fitted NN for parameter mu
specials(m2, model = "mu", elements = "model")

## same for sigma
specials(m2, model = "sigma", elements = "model")

## return element names of fitted special term list
specials(m2, model = "sigma", elements = "names")

## End(Not run)

Spirometry Measurements from NHANES 2007–2012

Description

Various spirometry measurements from the National Health and Nutrition Examination Survey (NHANES) 2007–2012 along with covariates providing demographics and basic body measurements.

Usage

data("SpirometryUS", package = "gamlss2")

Format

A data frame containing 16596 observations on 13 variables.

fvc

Numeric. Forced vital capacity (FVC) in liters, i.e., the volume of air that can forcibly be blown out after full inspiration.

fev1

Numeric. Forced expiratory volume in 1 second (FEV1) in liters, i.e., the volume of air that can forcibly be blown out in the first second, after full inspiration.

ratio

Numeric. Ratio of FEV1 to FVC.

pef

Numeric, peak expiratory flow (PEF) in liters per second, i.e., the maximal flow (or speed) achieved during the maximally forced expiration initiated at full inspiration.

fef

Numeric. Forced expiratory flow (FEF) in liters per second, i.e., the flow (or speed) of air coming out of the lung during the middle portion (25% to 75%) of a forced expiration.

volume

Numeric. Extrapolated volume.

fet

Numeric. Forced expiratory time (FET) in seconds, i.e., the length of the expiration.

gender

Factor. Binary gender information with levels female and male.

age

Numeric. Age in years (rounded to quarters).

weight

Numeric. Body weight in kilograms.

height

Numeric. Body height in centimeters.

bmi

Numeric. Body mass index in kilograms per meter-squared, rounded to 2 decimal places.

ethnicity

Factor. Self-reported race and ethnicity information with levels white, black, mexican American, other hispanic, and other (including multi-racial).

Details

In order to establish lung function reference equations, Zavorsky (2025) studies the dependence of three spirometry measurements (FVC, FEV1, and the FEV1/FVC ratio) on age, adjusted for height and weight and separately for females and males. He intends to show that a simple normally-distributed model with (piecewise) linear mean equation and (piecewise) constant variance suffices for obtaining an adequate distributional fit from which the 5% quantile can be obtained as the so-called lower limit of normal (LLN). Actually, his comparison with GAMLSS – using flexible predictors for both mean and variance along with a Box-Cox-transformed normal distribution – shows that GAMLSS leads to a similar fit for the mean but a much better fit for the LLN.

Zavorsky's (2025) analyses are based on a data set that he derived from the National Health and Nutrition Examination Survey (NHANES) in the United States 2007–2012. From the entire available data from https://wwwn.cdc.gov/nchs/nhanes/ he included those observations which met or exceeded the technical acceptability of the measurements for forced expiratory volume in 1 second (FEV1) and forced vital capacity (FVC). The data are described in a short communication published in the Data in Brief journal and the accompanying spreadsheet in CSV format (comma-separated values) is available from Mendeley Data.

The data comprises observations from NHANES' “Examination Data”, in particular in “Spirometry – Pre and Post-Bronchodilator” and “Body Measures”, plus accompanying “Demographics Data”. See the variable descriptions above for more details. Basic information about spirometry can be found for example in the Wikipedia at https://en.wikipedia.org/wiki/Spirometry.

Source

Zavorsky GS (2024). “Refined NHANES 2007–2012 Spirometry Dataset for the Comparison of Segmented (Piecewise) Linear Models to That of GAMLSS”, Mendeley Data, V1. doi:10.17632/dwjykg3xww.1

References

Zavorsky GS (2024). “A Refined Spirometry Dataset for Comparing Segmented (Piecewise) Linear Models to that of GAMLSS”. Data in Brief, 57, 111062. doi:10.1016/j.dib.2024.111062

Zavorsky GS (2025). “Debunking the GAMLSS Myth: Simplicity Reigns in Pulmonary Function Diagnostics”. Respiratory Medicine, 236, 107836. doi:10.1016/j.rmed.2024.107836

Examples

data("SpirometryUS", package = "gamlss2")
summary(SpirometryUS)

Stepwise Model Term Selection Using GAIC

Description

The optimizer function stepwise() performs stepwise model term selection using a Generalized Akaike Information Criterion (GAIC). Estimation is based on the Rigby and Stasinopoulos (RS) and Cole and Green (CG) algorithms, as implemented in function RS.

Usage

## wrapper function for stepwise GAMLSS estimation
step_gamlss2(formula, ..., K = 2,
  strategy = c("both.linear", "both"), keeporder = FALSE,
  cores = 1L)

## stepwise optimizer function
stepwise(x, y, specials, family, offsets,
  weights, start, xterms, sterms, control)

Arguments

formula

A model formula for gamlss2.

...

Arguments passed to gamlss2.

K

Numeric, the penalty for the GAIC.

strategy

Character, the strategy to be used by the stepwise algorithm. Possible options are "forward.linear", "forward", "backward", "backward.linear", "replace", "replace.linear", "both", "both.linear", "all" and "all.linear". See the details.

keeporder

Logical. For the different stepwise strategies, should the updates be performed sequentially according to the order of the parameters of the response distribution as specified in the family (see gamlss2.family), or should the selection search be performed across all parameters?

cores

Integer. If cores > 1L, mclapply is used to speed up computations within the selection steps.

x

The full model matrix to be used for fitting.

y

The response vector or matrix.

specials

A named list of special model terms, e.g., including design and penalty matrices for fitting smooth terms using smooth.construct.

family

A family object, see gamlss2.family.

offsets

If supplied, a list or data frame of model offsets.

weights

If supplied, a numeric vector of weights.

start

Starting values, either for the parameters of the response distribution or, if specified as a named list in which each element of length one is named with "(Intercept)", the respective intercepts are initialized. If starting values are specified as a named list, data frame or matrix, where each element/column is a vector with the same length as the number of observations in the data, the respective predictors are initialized with these. See the examples for gamlss2.

xterms

A named list specifying the linear model terms. Each named list element represents one parameter as specified in the family object.

sterms

A named list specifying the special model terms. Each named list element represents one parameter as specified in the family object.

control

Further control arguments as specified within the call of gamlss2.

Details

The wrapper function step_gamlss2() calls gamlss2 using the stepwise() optimizer function.

The stepwise algorithm can apply the following strategies:

  1. Each predictor must include an intercept.

  2. In a forward selection step, model terms with the highest improvement on the GAIC are selected.

  3. In a replacement step, each model term is tested to see if an exchange with a term not yet selected improves the GAIC.

  4. In a backward step, model terms are removed if this further improves the GAIC.

  5. In a bidirectional step, model terms can be either added or removed.

  6. In addition, the forward, backward and replace selection step can be combined.

The selected strategies are iterated until no further improvement is achieved.

The different strategies can be selected using argument strategy; see the examples. Possible values are "both", "forward", "backward", "replace", and "all". Here, strategy = "all" combines the forward, backward, and replace selection steps.

In addition, each of the steps 2-4 can be applied to linear model terms only, prior to performing the steps for all model terms. This can be done by additionally setting strategy to "both.linear", "forward.linear", "backward.linear", "replace.linear", or "all.linear".

The default is strategy = c("both.linear", "both") and keeporder = FALSE.

Note that each of the steps 2–4 can be performed while maintaining the order of the parameters of the response distribution, i.e., if the keeporder = TRUE argument is set, then the parameters will be updated in the order specified in the gamlss2.family. Using backward elimination, the model terms are deselected in reverse order.

Value

The optimizer function stepwise() returns the final model as a named list of class "gamlss2". See the return value of function RS. The wrapper function step_gamlss2() also returns the final model.

See Also

new_formula, gamlss2, gamlss2_control, RS

Examples

## Not run: ## load the Munich rent data
data("rent", package = "gamlss.data")

## because of possible linear interactions,
## scale the covariates first
rent$Fl <- scale(rent$Fl)
rent$A <- scale(rent$A)

## the formula defines the searching scope
f <- R ~ Fl + A + Fl:A + loc + s(Fl) + s(A) + te(Fl, A) |
  Fl + A + loc + Fl:A + s(Fl) + s(A) + te(Fl, A)

## estimate a Gamma model using the stepwise algorithm
m1 <- step_gamlss2(f, data = rent, family = GA, K = 2)

## same with
## m <- gamlss2(f, data = rent, family = GA, optimizer = stepwise, K = 2)
## show the new formula of selected model terms
new_formula(m1)

## final model summary
summary(m1)

## effect plots
plot(m1)

## diagnostic plots
plot(m1, which = "resid")

## plot GAIC
plot(m1, which = "selection")

## use forward linear, replace and backward strategy
m2 <- step_gamlss2(f, data = rent, family = GA, K = 2,
  strategy = c("forward.linear", "replace", "backward"))

## more complex model
## note, the third parameter
## nu does not include any model terms
f <- R ~ Fl + A + Fl:A + loc + s(Fl) + s(A) + te(Fl, A) |
  Fl + A + loc + Fl:A + s(Fl) + s(A) + te(Fl, A) |
  1 |
  Fl + A + loc + Fl:A + s(Fl) + s(A) + te(Fl, A)

## model using the BCT family
m3 <- step_gamlss2(f, data = rent, family = BCT,
  K = 2, strategy = c("forward.linear", "both"),
  keeporder = TRUE)

## plot GAIC
plot(m3, which = "selection")

## End(Not run)