First Steps

The package is designed to follow the workflow of well-established model fitting functions like lm() or glm(), i.e., the step of estimating full distributional regression models is actually not very difficult.

To illustrate the workflow using gamlss2, we analyze the HarzTraffic data, where we model the number of motorcycles (response bikes) at Sonnenberg in the Harz region of Germany. The data can be loaded with

data("HarzTraffic")
head(HarzTraffic)
        date yday bikes cars trucks others tempmin tempmax  temp humidity
1 2021-01-01    0     2 3135     25      1     0.2     2.6  1.32       85
2 2021-01-02    1     7 6593     32     10     1.6     4.0  2.63       72
3 2021-01-03    2     0 3367     30      2    -0.4     1.2  0.19       94
4 2021-01-04    3     0 2186     75      0    -0.6    -0.1 -0.37       97
5 2021-01-05    4     3 2071     68      0    -0.6     0.5 -0.21       98
6 2021-01-06    5     1 2022     97      0    -0.2     0.5  0.17       93
  tempdew cloudiness rain sunshine wind windmax
1   -0.93         99  1.2       50 1.38     3.7
2   -1.99        100  0.0       13 1.35     2.4
3   -0.65         98 13.5        0 1.74     2.8
4   -0.78         99  3.8        0 1.39     2.1
5   -0.46         98  5.3        0 1.42     2.1
6   -0.84         97  4.5        0 3.02     4.6

The data consists of seasonal time information (variable yday) along with a number of environmental variables (e.g. mean daily temperature). As a first model, we estimate a linear regression model with normal errors (which is the default)

Note

Note that the use of the HarzTraffic data here is for demonstration purpose only and should not be taken as a proper final analysis.

b1 <- gamlss2(bikes ~ temp + rain + sunshine + wind, data = HarzTraffic)
GAMLSS-RS iteration  1: Global Deviance = 14325.7146 eps = 0.046095     
GAMLSS-RS iteration  2: Global Deviance = 14325.7146 eps = 0.000000     
summary(b1)
Call:
gamlss2(formula = bikes ~ temp + rain + sunshine + wind, data = HarzTraffic)
---
Family: NO 
Link function: mu = identity, sigma = log
*--------
Parameter: mu 
---
Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -15.33601   19.67683  -0.779   0.4359    
temp         17.68501    1.01429  17.436  < 2e-16 ***
rain         -3.76399    1.74873  -2.152   0.0316 *  
sunshine      0.36452    0.02995  12.172  < 2e-16 ***
wind        -25.26905    4.38768  -5.759 1.11e-08 ***
*--------
Parameter: sigma 
---
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  5.35765    0.02175   246.3   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
*--------
n = 1057 df =  6 res.df =  1051
Deviance = 14325.7146 Null Dev. Red. = 4.61%
AIC = 14337.7146 elapsed =  0.01sec

Note that the summary output is very similar to lm() and glm() with the main difference being that summary outputs are provided for all parameters of the distribution. In this case, the model is estimated using the NO family of the gamlss.dist package, a two-parameter distribution with parameters mu and sigma.

1 Residual Diagnostics

Since we estimated a simple linear model with Gaussian errors up to now, we are assuming that the distribution of the response variable, the number of motorcycles (bikes), follows a normal distribution with constant variance. However, this assumption may not always hold true, especially when the response variable is count data, which often exhibits overdispersion or non-constant variance. The data also exhibits a strong seasonal effect that is likely not fully explained by the environmental variables alone. This effect may include nonlinear patterns that require further modeling for proper capture.

To assess whether the linear normal distribution with constant variance is appropriate, we can start by examining diagnostic plots.

plot(b1)

These plots help us visually inspect the residuals for any deviations from the assumptions of normality and constant variance.

2 Estimating Nonlinear Effects

The gamlss2 package uses the mgcv infrastructures for estimating nonlinear smooth effects. Now, let’s inspect the seaonal aspect of the data, there

par(mar = c(4, 4, 1, 1))
plot(bikes ~ yday, data = HarzTraffic)

Clearly, the number of bikes increases during the summer season. Therefore, we add the seasonal component to the model using the s() smooth constructor of the mgcv package. Moreover, since the variation of the number of bikes increases during the summer season, we now estimate a full GAMLSS and model also the variance parameter of the normal distribution by covariates

## set up the model formula for
## the mu and sigma parameter
## the vertical | separates the two formulae
f <- bikes ~ temp + rain + sunshine + wind + s(yday, bs = "cc") |
  temp + rain + sunshine + wind + s(yday, bs = "cc")

## estimate model
b2 <- gamlss2(f, data = HarzTraffic)
GAMLSS-RS iteration  1: Global Deviance = 13299.0381 eps = 0.114458     
GAMLSS-RS iteration  2: Global Deviance = 12599.4086 eps = 0.052607     
GAMLSS-RS iteration  3: Global Deviance = 12016.9723 eps = 0.046227     
GAMLSS-RS iteration  4: Global Deviance = 11573.6582 eps = 0.036890     
GAMLSS-RS iteration  5: Global Deviance = 11473.4169 eps = 0.008661     
GAMLSS-RS iteration  6: Global Deviance = 11469.6537 eps = 0.000328     
GAMLSS-RS iteration  7: Global Deviance = 11469.5639 eps = 0.000007     

Plot estimated seasonal effect.

plot(b2)

The effect for both, mu and sigma show a clear seasonal peak during summer times.

Inspect again model residuals

plot(b2, which = "resid")

The quantile residuals show a much better model fit, but still show that the model might not be the most appropriate for predicting the number of bikes.

3 Count Models

Now, instead of a normal distribution, we use the negative binomial distribution for count data

b3 <- gamlss2(f, data = HarzTraffic, family = NBI)
GAMLSS-RS iteration  1: Global Deviance = 9919.3001 eps = 0.168829     
GAMLSS-RS iteration  2: Global Deviance = 9831.2958 eps = 0.008872     
GAMLSS-RS iteration  3: Global Deviance = 9822.4038 eps = 0.000904     
GAMLSS-RS iteration  4: Global Deviance = 9820.9376 eps = 0.000149     
GAMLSS-RS iteration  5: Global Deviance = 9819.6601 eps = 0.000130     
GAMLSS-RS iteration  6: Global Deviance = 9819.4848 eps = 0.000017     
GAMLSS-RS iteration  7: Global Deviance = 9819.3533 eps = 0.000013     
GAMLSS-RS iteration  8: Global Deviance = 9819.3298 eps = 0.000002     

Plot again the estimated smooth seasonal effects.

plot(b3)

Inspecting the model residuals again shows a major improvement.

plot(b3, which = "resid")

Rigby, R. A., and D. M. Stasinopoulos. 2005. “Generalized Additive Models for Location, Scale and Shape.” Journal of the Royal Statistical Society C 54 (3): 507–54. https://doi.org/10.1111/j.1467-9876.2005.00510.x.