The package is designed to follow the workflow of well-established model fitting functions like lm()
or glm()
, i.e., the step of estimating full distributional regression models is actually not very difficult.
To illustrate the workflow using gamlss2
, we analyze the HarzTraffic
data, where we model the number of motorcycles (response bikes
) at Sonnenberg in the Harz region of Germany. The data can be loaded with
data("HarzTraffic")
head(HarzTraffic)
date yday bikes cars trucks others tempmin tempmax temp humidity
1 2021-01-01 0 2 3135 25 1 0.2 2.6 1.32 85
2 2021-01-02 1 7 6593 32 10 1.6 4.0 2.63 72
3 2021-01-03 2 0 3367 30 2 -0.4 1.2 0.19 94
4 2021-01-04 3 0 2186 75 0 -0.6 -0.1 -0.37 97
5 2021-01-05 4 3 2071 68 0 -0.6 0.5 -0.21 98
6 2021-01-06 5 1 2022 97 0 -0.2 0.5 0.17 93
tempdew cloudiness rain sunshine wind windmax
1 -0.93 99 1.2 50 1.38 3.7
2 -1.99 100 0.0 13 1.35 2.4
3 -0.65 98 13.5 0 1.74 2.8
4 -0.78 99 3.8 0 1.39 2.1
5 -0.46 98 5.3 0 1.42 2.1
6 -0.84 97 4.5 0 3.02 4.6
The data consists of seasonal time information (variable yday
) along with a number of environmental variables (e.g. mean daily temp
erature). As a first model, we estimate a linear regression model with normal errors (which is the default)
Note
Note that the use of the HarzTraffic
data here is for demonstration purpose only and should not be taken as a proper final analysis.
b1 <- gamlss2(bikes ~ temp + rain + sunshine + wind, data = HarzTraffic)
GAMLSS-RS iteration 1: Global Deviance = 14325.7146 eps = 0.046095
GAMLSS-RS iteration 2: Global Deviance = 14325.7146 eps = 0.000000
Call:
gamlss2(formula = bikes ~ temp + rain + sunshine + wind, data = HarzTraffic)
---
Family: NO
Link function: mu = identity, sigma = log
*--------
Parameter: mu
---
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -15.33601 19.67683 -0.779 0.4359
temp 17.68501 1.01429 17.436 < 2e-16 ***
rain -3.76399 1.74873 -2.152 0.0316 *
sunshine 0.36452 0.02995 12.172 < 2e-16 ***
wind -25.26905 4.38768 -5.759 1.11e-08 ***
*--------
Parameter: sigma
---
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.35765 0.02175 246.3 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
*--------
n = 1057 df = 6 res.df = 1051
Deviance = 14325.7146 Null Dev. Red. = 4.61%
AIC = 14337.7146 elapsed = 0.01sec
Note that the summary output is very similar to lm()
and glm()
with the main difference being that summary outputs are provided for all parameters of the distribution. In this case, the model is estimated using the NO
family of the gamlss.dist package, a two-parameter distribution with parameters mu
and sigma
.
Residual Diagnostics
Since we estimated a simple linear model with Gaussian errors up to now, we are assuming that the distribution of the response variable, the number of motorcycles (bikes
), follows a normal distribution with constant variance. However, this assumption may not always hold true, especially when the response variable is count data, which often exhibits overdispersion or non-constant variance. The data also exhibits a strong seasonal effect that is likely not fully explained by the environmental variables alone. This effect may include nonlinear patterns that require further modeling for proper capture.
To assess whether the linear normal distribution with constant variance is appropriate, we can start by examining diagnostic plots.
These plots help us visually inspect the residuals for any deviations from the assumptions of normality and constant variance.
Estimating Nonlinear Effects
The gamlss2
package uses the mgcv infrastructures for estimating nonlinear smooth effects. Now, let’s inspect the seaonal aspect of the data, there
par(mar = c(4, 4, 1, 1))
plot(bikes ~ yday, data = HarzTraffic)
Clearly, the number of bikes increases during the summer season. Therefore, we add the seasonal component to the model using the s()
smooth constructor of the mgcv package. Moreover, since the variation of the number of bikes increases during the summer season, we now estimate a full GAMLSS and model also the variance parameter of the normal distribution by covariates
## set up the model formula for
## the mu and sigma parameter
## the vertical | separates the two formulae
f <- bikes ~ temp + rain + sunshine + wind + s(yday, bs = "cc") |
temp + rain + sunshine + wind + s(yday, bs = "cc")
## estimate model
b2 <- gamlss2(f, data = HarzTraffic)
GAMLSS-RS iteration 1: Global Deviance = 13299.0381 eps = 0.114458
GAMLSS-RS iteration 2: Global Deviance = 12599.4086 eps = 0.052607
GAMLSS-RS iteration 3: Global Deviance = 12016.9723 eps = 0.046227
GAMLSS-RS iteration 4: Global Deviance = 11573.6582 eps = 0.036890
GAMLSS-RS iteration 5: Global Deviance = 11473.4169 eps = 0.008661
GAMLSS-RS iteration 6: Global Deviance = 11469.6537 eps = 0.000328
GAMLSS-RS iteration 7: Global Deviance = 11469.5639 eps = 0.000007
Plot estimated seasonal effect.
The effect for both, mu
and sigma
show a clear seasonal peak during summer times.
Inspect again model residuals
plot(b2, which = "resid")
The quantile residuals show a much better model fit, but still show that the model might not be the most appropriate for predicting the number of bikes
.
Count Models
Now, instead of a normal distribution, we use the negative binomial distribution for count data
b3 <- gamlss2(f, data = HarzTraffic, family = NBI)
GAMLSS-RS iteration 1: Global Deviance = 9919.3001 eps = 0.168829
GAMLSS-RS iteration 2: Global Deviance = 9831.2958 eps = 0.008872
GAMLSS-RS iteration 3: Global Deviance = 9822.4038 eps = 0.000904
GAMLSS-RS iteration 4: Global Deviance = 9820.9376 eps = 0.000149
GAMLSS-RS iteration 5: Global Deviance = 9819.6601 eps = 0.000130
GAMLSS-RS iteration 6: Global Deviance = 9819.4848 eps = 0.000017
GAMLSS-RS iteration 7: Global Deviance = 9819.3533 eps = 0.000013
GAMLSS-RS iteration 8: Global Deviance = 9819.3298 eps = 0.000002
Plot again the estimated smooth seasonal effects.
Inspecting the model residuals again shows a major improvement.
plot(b3, which = "resid")
Rigby, R. A., and D. M. Stasinopoulos. 2005.
“Generalized Additive Models for Location, Scale and Shape.” Journal of the Royal Statistical Society C 54 (3): 507–54.
https://doi.org/10.1111/j.1467-9876.2005.00510.x.