Package 'gamlss.prepdata'

Title: Prepering Data for Fitting a Generalized Additive Model for Location Scale and Shape
Description: Functions for prepering data to fit a Generalized Additive Models for Location Scale and Shape from the 'gamlss' or `gamlss2` package, Stasinopoulos and Rigby (2007) <doi:10.18637/jss.v023.i07>, using for graphical methods 'ggplot2'.
Authors: Mikis Stasinopoulos [aut, cre, cph] (ORCID: <https://orcid.org/0000-0003-2407-5704>), Robert Rigby [aut] (ORCID: <https://orcid.org/0000-0003-3853-1707>), Fernanda De Bastiani [aut] (ORCID: <https://orcid.org/0000-0001-8532-639X>), Julian Merder [ctb]
Maintainer: Mikis Stasinopoulos <[email protected]>
License: GPL-2 | GPL-3
Version: 0.1.19
Built: 2026-06-02 10:51:08 UTC
Source: https://github.com/gamlss-dev/gamlss.prepdata

Help Index


Prepering Data for Fitting a Generalized Additive Model for Location Scale and Shape

Description

Functions for prepering data to fit a Generalized Additive Models for Location Scale and Shape from the 'gamlss' or 'gamlss2' package, Stasinopoulos and Rigby (2007) <doi:10.18637/jss.v023.i07>, using for graphical methods 'ggplot2'.

Details

The DESCRIPTION file:

Package: gamlss.prepdata
Type: Package
Title: Prepering Data for Fitting a Generalized Additive Model for Location Scale and Shape
Version: 0.1.19
Date: 2025-10-08
Authors@R: c(person("Mikis", "Stasinopoulos", role = c("aut", "cre", "cph"), email = "[email protected]", comment = c(ORCID = "0000-0003-2407-5704")), person("Robert", "Rigby", role = "aut", email = "[email protected]", comment = c(ORCID = "0000-0003-3853-1707")), person("Fernanda", "De Bastiani", role = "aut", email = "[email protected]", comment = c(ORCID = "0000-0001-8532-639X")), person("Julian", "Merder", role = "ctb") )
Description: Functions for prepering data to fit a Generalized Additive Models for Location Scale and Shape from the 'gamlss' or `gamlss2` package, Stasinopoulos and Rigby (2007) <doi:10.18637/jss.v023.i07>, using for graphical methods 'ggplot2'.
License: GPL-2 | GPL-3
URL: https://www.gamlss.com/
BugReports: https://github.com/gamlss-dev/gamlss.prepdata/issues
Depends: R (>= 4.1.0), gamlss.dist, gamlss (>= 4.3.3), gamlss.foreach
Imports: methods, ggridges, ellipse, gamlss.inf, foreach, mgcv, ggplot2, yaImpute, gamlss.ggplots, acepack
Suggests: glmnet, reshape2, igraph, networkD3, grid, gridExtra
LazyLoad: yes
Config/pak/sysreqs: cmake make libicu-dev libuv1-dev libssl-dev
Repository: https://gamlss-dev.r-universe.dev
Date/Publication: 2026-03-10 09:11:22 UTC
RemoteUrl: https://github.com/gamlss-dev/gamlss.prepdata
RemoteRef: HEAD
RemoteSha: 50496dc3866efe155493e4b2fcf49b4fe504a139
Author: Mikis Stasinopoulos [aut, cre, cph] (ORCID: <https://orcid.org/0000-0003-2407-5704>), Robert Rigby [aut] (ORCID: <https://orcid.org/0000-0003-3853-1707>), Fernanda De Bastiani [aut] (ORCID: <https://orcid.org/0000-0001-8532-639X>), Julian Merder [ctb]
Maintainer: Mikis Stasinopoulos <[email protected]>

Index of help topics:

cor_perm_test           Pemutatione and Boostrap Tests for Paitwise
                        EReletionships
data_cor                Plotting pairwise linear and partial
                        correlations.
data_dim                Function to get information from data.
data_factor             Changing the Reference Level of Factors
data_inter              Identifying Pair-Wise Interactions in the Data
                        Frames
data_leverage           Getting the Leverage of All explanatory
                        variables
data_mcor               Functions to fit Maximal Correlation to Data
data_outliers           Outlier identification
data_part               A function to partition a data frame
data_rm                 Functions operating on variables in the data
data_scale              Scalling Continuous Variables in Data
data_str                Function applied to data
data_void               Finding the Percentage of Empty Spaces
data_xyplot             Plotting the response against the explanatory
                        variables
gamlss.prepdata-package
                        Prepering Data for Fitting a Generalized
                        Additive Model for Location Scale and Shape

The following convention has been used to name the functions:

y_NAME: plots concerning fitted values from a single fitted model

data_NAME: plots concerning residuals from a single fitted model

where NAME refer to different characteristics.

Author(s)

Mikis Stasinopoulos [aut, cre, cph] (ORCID: <https://orcid.org/0000-0003-2407-5704>), Robert Rigby [aut] (ORCID: <https://orcid.org/0000-0003-3853-1707>), Fernanda De Bastiani [aut] (ORCID: <https://orcid.org/0000-0001-8532-639X>), Julian Merder [ctb]

Maintainer: Mikis Stasinopoulos <[email protected]>

References

Rigby, R. A. and Stasinopoulos D. M. (2005). Generalized additive models for location, scale and shape,(with discussion), Appl. Statist., 54, part 3, pp 507-554.

Rigby, R. A., Stasinopoulos, D. M., Heller, G. Z., and De Bastiani, F. (2019) Distributions for modeling location, scale, and shape: Using GAMLSS in R, Chapman and Hall/CRC, doi:10.1201/9780429298547. An older version can be found in https://www.gamlss.com/.

Stasinopoulos D. M. Rigby R.A. (2007) Generalized additive models for location scale and shape (GAMLSS) in R. Journal of Statistical Software, Vol. 23, Issue 7, Dec 2007, doi:10.18637/jss.v023.i07.

Stasinopoulos D. M., Rigby R.A., Heller G., Voudouris V., and De Bastiani F., (2017) Flexible Regression and Smoothing: Using GAMLSS in R, Chapman and Hall/CRC. doi:10.1201/b21973

Stasinopoulos, M. D., Rigby, R. A., and De Bastiani F., (2018) GAMLSS: a distributional regression approach, Statistical Modelling, Vol. 18, pp, 248-273, SAGE Publications Sage India: New Delhi, India.

Stasinopoulos, M.D., Kneib, T., Klein, N., Mayr, A. and Heller, G.Z., (2024). Generalized Additive Models for Location, Scale and Shape: A Distributional Regression Approach, with Applications (Vol. 56). Cambridge University Press.

(see also https://www.gamlss.com/).

See Also

gamlss, gamlss.family

Examples

data(rent)
rent |> data_str()

Pemutatione and Boostrap Tests for Paitwise EReletionships

Description

There are several functions here;

cor_perm_test peerfoems a premutation test

Usage

cor_perm_test(x, y, data = NULL, B = 1000, seed = 123, 
  tail = c("one", "two"), fun = cor, ...)
  
cor_boot(x, y, data = NULL, B = 1000, seed = 123, 
  tail = c("one", "two"), fun = cor, ...)

Arguments

x

a coninuous variable

y

a coninuous variable

data

the data whete to find x and y

B

Th enumber of simulations or boostraps

seed

settinf the seed number

tail

one or two tail test

fun

the function to calculate the assosiation measeure between two contibuous variables

...

extar argument pass to the function, fun

Details

Those two function help to test pairwise associations. The function cor_perm_test() uses the Fisher's permutation test while cor_boot() using bootstraping.

Value

The function cor_perm_test() retunrs a R object called permutationTest with methods print and plot.

Author(s)

Mikis Stasinopoulos

References

Rigby, R. A., Stasinopoulos, D. M., Heller, G. Z., and De Bastiani, F. (2019) Distributions for modeling location, scale, and shape: Using GAMLSS in R, Chapman and Hall/CRC. doi:10.1201/9780429298547.

Stasinopoulos D. M., Rigby R.A., Heller G., Voudouris V., and De Bastiani F., (2017) Flexible Regression and Smoothing: Using GAMLSS in R, Chapman and Hall/CRC. doi:10.1201/b21973

Stasinopoulos, M.D., Kneib, T., Klein, N., Mayr, A. and Heller, G.Z., (2024). Generalized Additive Models for Location, Scale and Shape: A Distributional Regression Approach, with Applications (Vol. 56). Cambridge University Press.

(see also https://www.gamlss.com/).

See Also

data_cor

Examples

data_cor(rent99)
pp <-  cor_perm_test(rent99$rent, rent99$rent)
pp
plot(pp)

Plotting pairwise linear and partial correlations.

Description

The function data_cor takes a data frame and plots the pairwise Pearson's correlation coefficients of all continuous variables in the data.

The function data_pcor takes a data frame and plots the pairwise partial Pearson's correlation coefficients of all continuous variables in the data.

The function data_association takes a data frame and plots the pairwise association coefficients of all variables in the data. For contituous against continuous variables it shows the absolute value of the Pearson's correlation coeficient, for categorical agaist categorical it shows Cramer's ϕ\phi, for continuous agaist categorigal it fit an analyis of variance model and reports the square root of the R2R^2.

The functions high_val and low_val take the square matrix generated by the above thee functions and shows which paire-wise have a value larger or smaller respectively, than the value specified by the argument by val.

Usage

data_cor(data,  digits = 3, plot = TRUE, diag.off = TRUE,
        lower.tri.off = FALSE, method = c("square", "circle"),
        type = c("pearson", "kendall", "spearman"),
        outline.color = "gray", colors = c("blue", "white", "red"),
        legend.title = "Corr", title, ggtheme = theme_minimal(),
        tl.cex = 12, tl.col = "black", tl.srt = 45, lab = TRUE,
        lab_col = "black", lab_size = 3, circle.size = 20, seed = 123, 
        percentage,  print.info=TRUE)

data_pcor(data, digits = 3, plot = TRUE, diag.off = TRUE,
        lower.tri.off = FALSE, method = c("square", "circle"),
        outline.color = "gray", colors = c("blue", "white", "red"),
        legend.title = "Corr", title, ggtheme = theme_minimal(),
        tl.cex = 12, tl.col = "black", tl.srt = 45, lab = TRUE,
        lab_col = "black", lab_size = 3, circle.size = 20, seed = 123, 
        percentage, print.info=TRUE)

data_association(data, digits = 3, plot = TRUE, diag.off = TRUE, 
        lower.tri.off = FALSE, method = c("square", "circle"), 
        outline.color = "gray", colors = c("blue", "white", "red"), 
        legend.title = "Assoc", title, ggtheme = ggplot2::theme_minimal(), 
        tl.cex = 12, tl.col = "black", tl.srt = 45, lab = TRUE, 
        lab_col = "black", lab_size = 3, circle.size = 20, seed = 123, 
        percentage, print.info=TRUE )

high_val(table, val = 0.90, digits = 3, plot = FALSE, igraph = TRUE)

low_val(table, val = 0.05, digits = 3, plot = FALSE, igraph = TRUE)

Arguments

data

a data frame

table

a correlation table obtained by data_cor(,plot=FALSE) or data_pcor(,plot=FALSE)

digits

the digits for printing the correlation coefficients

plot

whether to plot or not

diag.off

whether to show the diagonal ellements

lower.tri.off

whether to show the lower part of the matrix

method

plotting in "square" or "cicle"

type

type of correlation c("pearson", "kendall", "spearman")

outline.color

the outline colour

colors

the range of colours

legend.title

title for the legend

title

the main title

ggtheme

the theme for the plot, see package ggthemes for more themes

tl.cex

the text size for the marginal labels

tl.col

the colour of the he marginal labels

tl.srt

the angle of the text in the bottom labels of the table

lab

whether to show the correlation coefficients in the table

lab_col

the colour of the lettering of the correlation coefficients

lab_size

the size of the lettering of the correlation coefficients, increase (or decrease) if the defaul 3 is not appropriate

circle.size

the size of the circles, increase (or decrease) if the defaul 20 is not appropriate

percentage

this is for big data sets. if more tha a milion ony 10% is plotted, if from 100.00 to a milion, 20%, if 50.000 to 100.000, 50% otherwise 100% of the data.

seed

Setting a seed value for selection of the percantage of data (for big data sets)

print.info

whether to print infomation when cutting the data usinf data_cut()

val

the theshold value so if tha actul value is greater than val it will be reported in high_val()

igraph

if in high_val(), the option plot=TRUE is set, then there are two options for plotting; i) igraph::graph_from_data_frame() with igraph=TRUE or ii) networkD3::simpleNetwork() with igraph=FALSE

Value

creates a correlation matrix plot.

Author(s)

Mikis Stasinopoulos

References

Rigby, R. A., Stasinopoulos, D. M., Heller, G. Z., and De Bastiani, F. (2019) Distributions for modeling location, scale, and shape: Using GAMLSS in R, Chapman and Hall/CRC. doi:10.1201/9780429298547.

Stasinopoulos D. M., Rigby R.A., Heller G., Voudouris V., and De Bastiani F., (2017) Flexible Regression and Smoothing: Using GAMLSS in R, Chapman and Hall/CRC. doi:10.1201/b21973

Stasinopoulos, M.D., Kneib, T., Klein, N., Mayr, A. and Heller, G.Z., (2024). Generalized Additive Models for Location, Scale and Shape: A Distributional Regression Approach, with Applications (Vol. 56). Cambridge University Press.

(see also https://www.gamlss.com/).

See Also

mcor

Examples

data_cor(rent99)
Pearson.cor <-  data_cor(rent99, plot=FALSE)

data_pcor(rent99)
partial.cor  <-  data_pcor(rent99, plot=FALSE) 


high_val(partial.cor, val=0.5)
high_val(Pearson.cor, val=0.5)

Function to get information from data.

Description

This is a set of function are designed to help the user to deal with new data sets.

data_dim(): the class, the dimension and the % NA's in the data

data_na_vars(): which variables have NA's and how many

data_na_obs(): which observations have NA's

data_omit(): omit the NA's from the data.

data_names(): The names of the variables in the data.

data_shorter_names(): abbriviate the names up to specified digits.

data_rename() renames some of of the variables.

Usage

data_dim(data)

data_na_vars(data)

data_na_obs(data)

data_omit(data)

data_names(data)

data_shorter_names(data, max = 5, newnames)

data_rename(data, oldnames, newnames)

Arguments

data

a data frame

max

the maximum number of characters allowed, with default 5. Make sure that you are using enought characters otherwise you could end up with variables with the some name

newnames

New names if not abbreviation is required, as characters

oldnames

the old names as characters

Details

The function data_dim() gives the the class, the dimension and the % NA's in the data.

The function data_na_vars() gives the number of missing observation for each variable in the data.

The function data_omit(): omits the NA's from the data.

The function data_names() gives the names of the variables.

The function data_shorter_names() takes the current names and abbreviates to max characters.

The function data_rename() renames variable from the data.

Value

The function data_dim() after printing gives the originasl data set.

The function data_na_vars() prints the number of missing observation for each variable in the data and passes the original data set.

The function data_omit(): omits the NA's from the data and passes the new data set.

The function data_names() prints the names of the variables in the data andpasses the original data

The function data_shorter_names() takes the current names and abbreviates to max characters and return the data with shorter names.

Author(s)

Mikis Stasinopoulos, Bob Rigby and Fernanda De Bastiani

References

Rigby, R. A. and Stasinopoulos D. M. (2005). Generalized additive models for location, scale and shape,(with discussion), Appl. Statist., 54, part 3, pp 507-554.

Rigby, R. A., Stasinopoulos, D. M., Heller, G. Z., and De Bastiani, F. (2019) Distributions for modeling location, scale, and shape: Using GAMLSS in R, Chapman and Hall/CRC. An older version can be found in https://www.gamlss.com/.

Stasinopoulos D. M. Rigby R.A. (2007) Generalized additive models for location scale and shape (GAMLSS) in R. Journal of Statistical Software, Vol. 23, Issue 7, Dec 2007, https://www.jstatsoft.org/v23/i07/.

Stasinopoulos D. M., Rigby R.A., Heller G., Voudouris V., and De Bastiani F., (2017) Flexible Regression and Smoothing: Using GAMLSS in R, Chapman and Hall/CRC. Stasinopoulos, M.D., Kneib, T., Klein, N., Mayr, A. and Heller, G.Z., (2024). Generalized Additive Models for Location, Scale and Shape: A Distributional Regression Approach, with Applications (Vol. 56). Cambridge University Press.

(see also https://www.gamlss.com/).

See Also

data_cor

Examples

data_dim(rent)
data_na_vars(rent)
data_na_obs(rent)
data_omit(rent)
data_names(rent)
data_shorter_names(rent)
pp=data_rename(rent, c("R", "Fl"), c("rent", "floor"))
data_names(pp)

Changing the Reference Level of Factors

Description

Function to change the reference levels of factors. y_factor() takes only one factor data_factor() takes a data.frame.

Usage

y_factor(x, how = c("lower", "higher"))

data_factor(data, how = c("lower", "higher"))

Arguments

data

a data frame

x

a variable

how

which reference level, default the one woth fewer obsrvervesions

Value

A factor or a data.frame depending whether y_factor() or data_factor() is used.

Note

The idea here is that in model selection maybe we like the first level to be the weakest level so when we select levels the stonger level has a change if different. This is more likelily to be usfull in gamlss2() where the stepwise select levels rather than factors.

Author(s)

Mikis Stasinopoulos

References

Rigby, R. A., Stasinopoulos, D. M., Heller, G. Z., and De Bastiani, F. (2019) Distributions for modeling location, scale, and shape: Using GAMLSS in R, Chapman and Hall/CRC. doi:10.1201/9780429298547.

Stasinopoulos D. M., Rigby R.A., Heller G., Voudouris V., and De Bastiani F., (2017) Flexible Regression and Smoothing: Using GAMLSS in R, Chapman and Hall/CRC. doi:10.1201/b21973

Stasinopoulos, M.D., Kneib, T., Klein, N., Mayr, A. and Heller, G.Z., (2024). Generalized Additive Models for Location, Scale and Shape: A Distributional Regression Approach, with Applications (Vol. 56). Cambridge University Press.

(see also https://www.gamlss.com/).

See Also

data_str

Examples

levels(rent$B)
da<-data_factor(rent)
levels(rent$B)

Identifying Pair-Wise Interactions in the Data Frames

Description

The function data_inter() is trying to identify pair-wise interations given the response variable using linear regression methodology. At the moment it works only with continuous reponse variables.

Usage

data_inter(data, response, weights, digits = 3, plot = TRUE, 
        lower.tri.off = TRUE, 
        method = c("circle", "square"), fit.method = c("linear", "nonlinear"), 
        outline.color = "gray", colors = c("blue", "white", "red"), 
        legend.title = "Inter", title, ggtheme = theme_minimal(), 
        tl.cex = 12, tl.col = "black", tl.srt = 45, lab = TRUE, 
        lab_col = "black", lab_size = 3, circle.size = 20,  
        percentage, seed = 123, print.info = TRUE)

Arguments

data

a data frame

response

the response variable

weights

prior weights

digits

the number of digits in the plot

plot

whether to plot the results

lower.tri.off

whether to show the lower part of the matrix

method

plotting in "square" or "cicle"

fit.method

whether in "linear" or "nonlinear"

outline.color

the outline colour

colors

the range of colours

legend.title

title for the legend

title

the main tittle

ggtheme

the theme for the plot, see package ggthemes for more themes

tl.cex

the text size for the marginal labels

tl.col

the colour of the he marginal labels

tl.srt

the angle of the text in the bottom labels of the table

lab

whether to show the correlation coefficients in the table

lab_col

the colour of the lettering of the correlation coefficients

lab_size

the size of the lettering of the correlation coefficients, increase (or decrease) if the defaul 3 is not appropriate

circle.size

the size of the circles, increase (or decrease) if the defaul 20 is not appropriate

print.info

whether to print infomation when cutting the data usinf data_cut()

percentage

the percentage of data to show if the observation number is too big

seed

Setting a seed value for selection of the percantage of data (for big data sets)

Details

The function data_inter() uses the funcion z_scores() to standarized the continuous response variable and then uses linear model fits to establish whether the first order interactions between the the x's are singificant or not. It reports the significant level based on Chi-square tests. Note that for large data sets it uses the function data_cut() to cut randomnly the size of the data in order to use ggplo2 graphs to plot it.

Typically for linear model first ortder interaction it fits the models y~x1+x2 and y~x1*x2, respectively, and calculated significant level based on the difference in deviances. Under the HoH_o hypothesis the difference in deviances follow be a Chi-square distribution with degrees of freedom based on the difference of the degrees of freedom of the two fitted models.

Value

It produce a plot plot=TRUE or a square upper triangular table.

Note

The function data_inter() works only for continuous responses.

Author(s)

Mikis Stasinopoulos

References

Rigby, R. A., Stasinopoulos, D. M., Heller, G. Z., and De Bastiani, F. (2019) Distributions for modeling location, scale, and shape: Using GAMLSS in R, Chapman and Hall/CRC. doi:10.1201/9780429298547.

Stasinopoulos D. M., Rigby R.A., Heller G., Voudouris V., and De Bastiani F., (2017) Flexible Regression and Smoothing: Using GAMLSS in R, Chapman and Hall/CRC. doi:10.1201/b21973

Stasinopoulos, M.D., Kneib, T., Klein, N., Mayr, A. and Heller, G.Z., (2024). Generalized Additive Models for Location, Scale and Shape: A Distributional Regression Approach, with Applications (Vol. 56). Cambridge University Press.

(see also https://www.gamlss.com/).

See Also

data_cor

Examples

data_inter(rent[,-4,5], response=R)

Getting the Leverage of All explanatory variables

Description

The function data_leverage() is designed to identify observations in the explanatory variable space with high leverage values and, therefore, potential outliers.

Usage

data_leverage(data, response, weights, quantile.value = 0.99, annotate = TRUE, 
            line.col = "steelblue4", point.col = "steelblue4", 
            annot.col = "darkred", plot = TRUE, title, percentage, 
            seed = 123, print.info=TRUE, ...)

Arguments

data

The data.frame

response

The response variable (to be excuded)

weights

Prior weights if needed

quantile.value

The quantile values for the identification of high leverage. The default is quantile.value=0.99 so only the 1 percent of the high leverage are identifyied as outliers. This should be increase for large data

annotate

whether to annotete the the outliers

line.col

The color of the line

point.col

The colout of the plotting points]

annot.col

The colour of the annotate outliers

plot

whether to show the leverage plot

title

what title to put

percentage

what percentage of data to use in the plot

seed

The seed use to calculete the percentage

print.info

whether to print infomation when cutting the data usinf data_cut()

...

other arguments

Details

The function data_leverage() uses the linear model methodology to identify unusual observations as a group within the explnatory variables. It fit a linear model to all explanatory variables in the data, calculate the leverge points and plots them. It identifies one percent of the data as outliers. The line in the plot is calulated as 2(r/n)2(r/n) where rr is the number of explanatory variables and nn the number of obsrervations. 2(r/n)2(r/n) is given in the literature as the rule of thumb is that an observation is considered to have high leverage. In practice the value 2(r/n)2(r/n) is too low for indetification of outliers. Here we use quan.val=.99 which identify 1% of the obsrvation with high leverage.

Value

It plots the leverages plot=TRUE or identify outliers plot=FALSE.

Author(s)

Mikis Stasinopoulos

References

Rigby, R. A. and Stasinopoulos D. M. (2005). Generalized additive models for location, scale and shape,(with discussion), Appl. Statist., 54, part 3, pp 507-554.

Rigby, R. A., Stasinopoulos, D. M., Heller, G. Z., and De Bastiani, F. (2019) Distributions for modeling location, scale, and shape: Using GAMLSS in R, Chapman and Hall/CRC. An older version can be found in https://www.gamlss.com/.

Stasinopoulos D. M. Rigby R.A. (2007) Generalized additive models for location scale and shape (GAMLSS) in R. Journal of Statistical Software, Vol. 23, Issue 7, Dec 2007, https://www.jstatsoft.org/v23/i07/.

Stasinopoulos D. M., Rigby R.A., Heller G., Voudouris V., and De Bastiani F., (2017) Flexible Regression and Smoothing: Using GAMLSS in R, Chapman and Hall/CRC.

Stasinopoulos, M.D., Kneib, T., Klein, N., Mayr, A. and Heller, G.Z., (2024). Generalized Additive Models for Location, Scale and Shape: A Distributional Regression Approach, with Applications (Vol. 56). Cambridge University Press.

(see also https://www.gamlss.com/).

See Also

data_zscores,

Examples

data_leverage(rent99[,-c(2,9)],  response=rent)
rent99[, -c(2,9)] |> data_leverage( response=rent, plot=FALSE)

Functions to fit Maximal Correlation to Data

Description

The function data_mcor() is using either the function mcor() or the function cor_M() to find the non-linear maximal correlation between the continuous variabes of the data. Note that the function cor_M() is usinf the function ace() of the package acepack which is is faster for large data.

The function mcor() fits a single maximal correlation. It uses the function 'ACE().

The function ACE() it takes two variables and produce among other thing the maximal coefficient.

The function ACE.iter() it takes two variables and shows the fittings to acheive the maximal coefficient.

Usage

data_mcor(data, fun = cor_M, digits = 3, plot = TRUE, diag.off = TRUE, 
       lower.tri.off = FALSE, method = c("square", "circle"), 
       fit.method = c("P-splines", "loess"), 
       outline.color = "gray", colors = c("blue", "white", "red"), 
       legend.title = "Corr", title, 
       ggtheme = ggplot2::theme_minimal(), tl.cex = 12,
       tl.col = "black", tl.srt = 45, lab = TRUE, lab_col = "black", 
       lab_size = 3, circle.size = 20, ...)
       
mcor(x, y, data = NULL, fit.method = c("loess", "P-splines"), 
               nseg = 10, max.df = 6, ...)       

ACE(x, y, weights, data = NULL, con_crit = 0.01, 
        fit.method = c("loess", "P-splines"), 
        nseg = 10, max.df = 6, ...)   
        
ACE.iter(x, y, weights, data = NULL, con_crit = 0.001, 
         fit.method = c("loess", "P-splines"), 
         nseg = 10, max.df = 6, ...)

Arguments

data

a data frame

fun

the function mcor or cor_M

digits

how many digit to show in the correlation table

plot

whether to plot the results

diag.off

whether to show the diagonal entries

lower.tri.off

whether to show the lower triangular prt of the matrix

method

plotting in "square" or "cicle

fit.method

whether to use "loess" or "P-splines"

outline.color

the outline colour

colors

the range of colours

legend.title

title for the legend

title

the main title

ggtheme

the theme use by ggplot2 with default theme_minimal()

tl.cex

the text size for the marginal labels

tl.col

the colour of the he marginal labels

tl.srt

the angle of the text in the bottom labels of the table

lab

whether to show the correlation coefficients in the table

lab_col

the colour of the lettering of the correlation coefficients

lab_size

the size of the lettering of the correlation coefficients, increase (or decrease) if the defaul 3 is not appropriate

circle.size

the size of the circles, increase (or decrease) if the defaul 20 is not appropriate

x

a continuous variable

y

a continuous variable

weights

prior weights

con_crit

convergence critirion for "P-splines" fit

nseg

The number of breakpoints n the definition of "P-splines"

max.df

the maximal degree of freedom alowen if "P-splines" is fitted

...

for more arguments

Details

The function dsata_mcor() is using the function mcor() to find the non-linear maximum correlation between the continuous varaibes of the data.

Value

The function data_mcor() produve a table

The function mcor() produve the maxomal correlations

The function ACE() produve a object classn "ACE", with methods plot() ands print()

Author(s)

Mikis Stasinopoulos

References

Rigby, R. A., Stasinopoulos, D. M., Heller, G. Z., and De Bastiani, F. (2019) Distributions for modeling location, scale, and shape: Using GAMLSS in R, Chapman and Hall/CRC. doi:10.1201/9780429298547.

Stasinopoulos D. M., Rigby R.A., Heller G., Voudouris V., and De Bastiani F., (2017) Flexible Regression and Smoothing: Using GAMLSS in R, Chapman and Hall/CRC. doi:10.1201/b21973

Stasinopoulos, M.D., Kneib, T., Klein, N., Mayr, A. and Heller, G.Z., (2024). Generalized Additive Models for Location, Scale and Shape: A Distributional Regression Approach, with Applications (Vol. 56). Cambridge University Press.

(see also https://www.gamlss.com/).

See Also

data_cor

Examples

data_mcor(rent99)
max.cor <-  data_mcor(rent99, plot=FALSE)

high_val(max.cor, val=0.5)

Outlier identification

Description

Those functions idententify outliers in variables in the data.

The function data_outliers() takes a data.frame and applies the y_outliers() function on all its continuous variables.

The function y_outliers() takes one continuous variable and identify which observations could be classified as outliers. It does this for both z-scores (fitting a distribution and then taking the residuals) or quantile measures for example K times MAD) way from the median.

Te function y_outliers_both() takes one continuous variable and identify outliers using both the z-scores and the quantile method. It prints or the union or the intersect of the two resuls.

The function y_outliers_by() takes one continuous variable and a factor and indentify outliers within each level of the factor.

The function y_outliers_loop() takes one continuous variable and identify outliers by going through a loop first by weighing out obsrvarions and then refitting the model.

The function y_outliers_z() takes one continuous variable and possible a factor and identify outliers or by a single fit or by repeated fits were outliers observation from previous fits are removed (equivalent to y_outliers_loop())

Usage

data_outliers(data, value, min.distinct = 50, family = SHASHo, 
         type = c("zscores","quantile") )

y_outliers(x, value, family = SHASHo,  type = c("zscores","quantile"),
          transform = TRUE)

y_outliers_both(x, value, family = SHASHo, 
        method =c("intersect", "union"))

y_outliers_by(x, by, family = SHASHo, type = c("zscores", "quantile"))        
       
y_outliers_z(x, by, value, family = SHASHo, transform = TRUE, loop = FALSE)  

y_outliers_loop(x, value, family = SHASHo, type = c("zscores", "quantile"),
             transform = TRUE)

Arguments

data

a data frame

x

a continues variable

by

a factor partioning the data

loop

whether to lop for identify more ouliers

transform

whether to transfomed using a power transformation

value

max value from which the absolute value of the z-scores should be greater to identify outliers

min.distinct

if a variable has less distinct values than min.distinct is excluded

family

the distribution family used for standardization

type

for y_outliers() and data_outliers() the type of outlier detection "zscores" or "quantile"

method

for y_outliers() and data_outliers() the type of outlier detection "zscores" or "quantile"

for y_outliers_both() whether the "intersect" (default) or the "union" of the two sets will be used

Details

the continuous variables are power transforemed and then standartised

Value

return a list

Author(s)

Mikis Stasinopoulos

References

Rigby, R. A. and Stasinopoulos D. M. (2005). Generalized additive models for location, scale and shape,(with discussion), Appl. Statist., 54, part 3, pp 507-554.

Rigby, R. A., Stasinopoulos, D. M., Heller, G. Z., and De Bastiani, F. (2019) Distributions for modeling location, scale, and shape: Using GAMLSS in R, Chapman and Hall/CRC. An older version can be found in https://www.gamlss.com/.

Stasinopoulos D. M. Rigby R.A. (2007) Generalized additive models for location scale and shape (GAMLSS) in R. Journal of Statistical Software, Vol. 23, Issue 7, Dec 2007, https://www.jstatsoft.org/v23/i07/.

Stasinopoulos D. M., Rigby R.A., Heller G., Voudouris V., and De Bastiani F., (2017) Flexible Regression and Smoothing: Using GAMLSS in R, Chapman and Hall/CRC.

Stasinopoulos, M.D., Kneib, T., Klein, N., Mayr, A. and Heller, G.Z., (2024). Generalized Additive Models for Location, Scale and Shape: A Distributional Regression Approach, with Applications (Vol. 56). Cambridge University Press.

(see also https://www.gamlss.com/).

See Also

data_names

Examples

da <- rent99[,-2]
data_outliers(da)

A function to partition a data frame

Description

The function data_part() takes a data.frame and creates a new identical data.frame with an extra factor called partition which can be used to allocate data in different data sets.

i) if the option of the function is partition=2L the factor has two levels train, and test.

ii) if the option of the function is partition=3L the factror has three levels train, for training data, val for validation data and test for test data.

iii) if the option of the function is partition > 4L say K then the levels of the factor are "1", "2"..."K". The factor then can be used to identify K-fold cross validation sets (up to K=20).

The function data_part_list() in parform a similar task like the function data_part() but instead of adding a factor to the data creates a list with ellements data.frames. Note that this function do allow K-fold cross-validation creation (up to 20 K-folds), see also the function data_Kfold().

The function data_boot_index() takes a data.frame and produces two list of length K, the in-bag list, IB, and the out-of-bag list, OOB.

The function data_boot_weights() takes a data.frame and produces a matrix of dimensions n x K which column of wich can be used as prior weight in a regression situation.

The function data_Kfold() takes a data.frame and produces a matrix of indeces which then can be used to fit diffetent sections of the data for cross validation.

The function data_cut() takes a data.frame and selects randomly specified % of the data. It is usually applied to graphical function =to reduce time for plotting; For data.frames with more than 50.000 observations is automatically select part of the data.

Usage

data_part(data, partition = 2L, probs, setseed = 123, ...)

data_part_list(data, partition = 2L, probs, setseed = 123, ...)

data_boot_index(data, B = 10, setseed = 123)

data_boot_weights()(data, B = 10, setseed = 123)

data_Kfold_index()(data, K = 10, setseed = 123)

data_Kfold_weights()(data, K = 10, setseed = 123)

data_cut(data, percentage, seed = 123, print.info = TRUE)

Arguments

data

a data.frame

partition

2, 3 or a number less than 20

K

the number of partitions, (maximum 20 for CV)

B

the number of bootstrap samples

probs

probabilities for the random selection

setseed

setting the sead so the proccess can be repeated

percentage

The percentage of data to keep. If set, i.e. percentage=0.5 only 50% are kept otherwise for large data set (more that 50.000) only percentage of data are kept.

seed

the set.seed() argument

print.info

whether to print infomation when cutting the data usinf data_cut()

...

extra arguments

Value

The functions data_part(), data_part_list(), data_boot_index(), data_boot_weights(), data_Kfold() produce a data.frame, lists or matrices to be later used for data partition during the fitting process. The function data_cut() randomly select part of the data.

Author(s)

Mikis Stasinopoulos, Bob Rigby and Fernanda De Bastiani

References

Rigby, R. A., Stasinopoulos, D. M., Heller, G. Z., and De Bastiani, F. (2019) Distributions for modeling location, scale, and shape: Using GAMLSS in R, Chapman and Hall/CRC. An older version can be found in https://www.gamlss.com/.

Stasinopoulos D. M. Rigby R.A. (2007) Generalized additive models for location scale and shape (GAMLSS) in R. Journal of Statistical Software, Vol. 23, Issue 7, Dec 2007, https://www.jstatsoft.org/v23/i07/.

Stasinopoulos D. M., Rigby R.A., Heller G., Voudouris V., and De Bastiani F., (2017) Flexible Regression and Smoothing: Using GAMLSS in R, Chapman and Hall/CRC.

Stasinopoulos, M.D., Kneib, T., Klein, N., Mayr, A. and Heller, G.Z., (2024). Generalized Additive Models for Location, Scale and Shape: A Distributional Regression Approach, with Applications (Vol. 56). Cambridge University Press.

(see also https://www.gamlss.com/).

See Also

data_str

Examples

da <- data_part(rent)
head(da)
mosaicplot(table(da$partition))

da.train <- subset(da, partition=="train")
da.test <- subset(da, partition=="test")
dim(da.train)
dim(da.test)


allda <-  data_part_list(rent) 
dim(allda[[1]]) # training data
dim(allda[[2]]) # test data

Functions operating on variables in the data

Description

There are several function operating on a data.frame and export a data.frame. The functions are

1) data_rm(): this function removes the variables specified by vars from the data.frame. Note that vars can take either character names or numbers.

2) data_rm1val(): This function looks for varables with a unique distinct value (most likely factors left from a previous subset() operation) and remove them form the data.

3) data_exclude_class(): This function looks for variable (columns) of a specified 'R' class and remove them from the data. The default class is "factor".

4) data_only_continuous()": This function pick up only the continuous variable in the data.frame.

5) data_select()": This function select only the variables in the vars list and save the data.

Usage

data_rm(data, vars)

data_rm1val(data)

data_exclude_class(data, class.out = "factor")

data_only_continuous(data)

data_select(data, vars)

data_rmNAvars(data)

Arguments

data

a data frame

vars

selected variables (columns from the data frame)

class.out

a specific variable class to be excluded form the data frame

Details

All the above functions can be used for piping i.e. da |> data_rm1val().

Value

returm a data.frame

Author(s)

Mikis Stasinopoulos, Bob Rigby and Fernanda De Bastiani

References

Rigby, R. A., Stasinopoulos, D. M., Heller, G. Z., and De Bastiani, F. (2019) Distributions for modeling location, scale, and shape: Using GAMLSS in R, Chapman and Hall/CRC. An older version can be found in https://www.gamlss.com/.

Stasinopoulos D. M. Rigby R.A. (2007) Generalized additive models for location scale and shape (GAMLSS) in R. Journal of Statistical Software, Vol. 23, Issue 7, Dec 2007, https://www.jstatsoft.org/v23/i07/.

Stasinopoulos D. M., Rigby R.A., Heller G., Voudouris V., and De Bastiani F., (2017) Flexible Regression and Smoothing: Using GAMLSS in R, Chapman and Hall/CRC.

Stasinopoulos, M.D., Kneib, T., Klein, N., Mayr, A. and Heller, G.Z., (2024). Generalized Additive Models for Location, Scale and Shape: A Distributional Regression Approach, with Applications (Vol. 56). Cambridge University Press.

(see also https://www.gamlss.com/).

See Also

data_cor

Examples

library(gamlss)
da <- rent |> data_rm( vars=c("Sp", "Sm"))
head(da)

da<- rent |> data_exclude_class() 
head(da)

da<- data_only_continuous(rent)
head(da)

da <- rent |> data_select( vars=c("R", "Fl", "A"))
head(da)

Scalling Continuous Variables in Data

Description

The function data_scale() takes a data.frame and creates a new data set with all continous variable standarised. The standardization can be

i) mean 0 variance 1 which is equivalent to have the options scale.to="z-scores" and family="NO". That is the z-scores (residuals) after fitting a normal distribution to a continuous variable

ii) A more general z-score using say scale.to="z-scores" and family="SHASH" in which case correction to the skewness and kurtosis is done to the specified variable. Finaly,

iii) the range is resticted from zero to one, i.e. scale.to="0to1".

The function data_vars2data() creates a new dataframe where the continuous variables are remain the same or become polynomials and the factors becomes dummy variables.

The function data_formulae() takes a data.frame and creates four formulae

1) The first contains the response and all main effects of the variables in the data i.e. R~Fl+A+K+loc

2) The second contains the response and all first order interactions i.e R~(Fl+A+K+loc)^2

3) The third contains all main effects of the variables in the data with no response i.e. ~Fl+A+K+loc

2) The fourth contains all first order interactions with no response i.e ~(Fl+A+K+loc)^2

The function data_form2X takes a data.frame and a forrmula and creates a design matrix.

Usage

data_scale(data, response, position.response = NULL, 
          scale.to = c("z-scores", "0to1"), family = "NO", 
          scale.response = FALSE)
          
data_vars2data(data, response, exclude = NULL, 
          type = c("main.effect", "first.order"), 
          weights = NULL, nonlinear = FALSE, basis = "poly", 
          arg = 2)   

data_formulae(data, response)

data_form2X(data, formula, response, 
          scale.to = c("no", "z-scores", "0to1"), 
          family = NO)

Arguments

data

A data frame

response

The name of the response variable

position.response

or the position of the response variable in the data.

scale.to

how to scape by normalization, scale.to="z-scores" or range 0 to 1 scale.to="0to1".

family

The family used in the standarization, defaul is family="NO" but any other continuous family from Inf-Inf to +Inf+Inf will do i.e. family="SHASH".

scale.response

whether to scale also the response. The default value is scale.response=FALSE because in GAMLSS we are hoping to get the right family for the response.

exclude

which variable to exclude

type

whether the main effects only ot dummies for first order interactions also

weights

Prior weights when the matrix is created

nonlinear

Im not sure what it does?

basis

what basis should be used for non linearities, deault is poly()

arg

the argument for the basis i.e 3

formula

the formula to create the design matrix

Value

A data frame is return with all continous variables standarised.

Author(s)

Mikis Stasinopolos

References

Rigby, R. A. and Stasinopoulos D. M. (2005). Generalized additive models for location, scale and shape,(with discussion), Appl. Statist., 54, part 3, pp 507-554.

Rigby, R. A., Stasinopoulos, D. M., Heller, G. Z., and De Bastiani, F. (2019) Distributions for modeling location, scale, and shape: Using GAMLSS in R, Chapman and Hall/CRC. An older version can be found in https://www.gamlss.com/.

Stasinopoulos D. M. Rigby R.A. (2007) Generalized additive models for location scale and shape (GAMLSS) in R. Journal of Statistical Software, Vol. 23, Issue 7, Dec 2007, https://www.jstatsoft.org/v23/i07/.

Stasinopoulos D. M., Rigby R.A., Heller G., Voudouris V., and De Bastiani F., (2017) Flexible Regression and Smoothing: Using GAMLSS in R, Chapman and Hall/CRC.

Stasinopoulos, M.D., Kneib, T., Klein, N., Mayr, A. and Heller, G.Z., (2024). Generalized Additive Models for Location, Scale and Shape: A Distributional Regression Approach, with Applications (Vol. 56). Cambridge University Press.

(see also https://www.gamlss.com/).

See Also

data_zscores

Examples

rent[, -c(4,5)] |> data_scale(, response=R)|> head()


rent[, -c(4,5)] |> data_vars2data( response=R, nonlinear = TRUE, 
                  basis = "poly", arg=3)  -> D
D |> head()
D |> str()

ff <- data_formulae(D, response=R)
ff

head(data_form2X(D, formula=ff[[1]], response=R))

Function applied to data

Description

his is a set of function are designed to help the user to deal with the structure of new data sets.

Usage

data_str(data, min.values = 100, min.levels = 10)

y_distinct(var)

data_distinct(data, get.distinct = FALSE, print=TRUE)

data_cha2fac(data, show.str = FALSE)

data_few2fac(data, max.levels = 5, show.str = FALSE)

data_int2num(data, min.values = 50, show.str = FALSE)

data_fac2num(data, vars)

Arguments

data

a data frame

min.values

the minimal value distinct values before warning

max.levels

the maximum value for distinct values in the variable

min.levels

the minimal value distinct levels befor warning

var

a vector

vars

a character vector with names from the data

get.distinct

TRUE if you need to save the values FALSE if not not

show.str

whether to show the structure

print

TRUE or FALSE

Details

The function data_str() gives the structure of the data set.

The function data_distinct() gives the distinct values of the vectors in the data set

The function y_distinct() gives the distinct values of single vector

The function data_cha2fac() tranforms all character vectors to factors

The function data_few2fac() transform all vectors with fewer values than min.levels into factors

The function data_int2num() transform all integer vectors with more values than min.values into numeric

The function data_fac2num() transform sellected variables factors into numeric vectors

Author(s)

Mikis Stasinopoulos, Bob Rigby and Fernanda De Bastiani

References

Rigby, R. A. and Stasinopoulos D. M. (2005). Generalized additive models for location, scale and shape,(with discussion), Appl. Statist., 54, part 3, pp 507-554.

Rigby, R. A., Stasinopoulos, D. M., Heller, G. Z., and De Bastiani, F. (2019) Distributions for modeling location, scale, and shape: Using GAMLSS in R, Chapman and Hall/CRC. An older version can be found in https://www.gamlss.com/.

Stasinopoulos D. M. Rigby R.A. (2007) Generalized additive models for location scale and shape (GAMLSS) in R. Journal of Statistical Software, Vol. 23, Issue 7, Dec 2007, https://www.jstatsoft.org/v23/i07/.

Stasinopoulos D. M., Rigby R.A., Heller G., Voudouris V., and De Bastiani F., (2017) Flexible Regression and Smoothing: Using GAMLSS in R, Chapman and Hall/CRC.

Stasinopoulos, M.D., Kneib, T., Klein, N., Mayr, A. and Heller, G.Z., (2024). Generalized Additive Models for Location, Scale and Shape: A Distributional Regression Approach, with Applications (Vol. 56). Cambridge University Press.

(see also https://www.gamlss.com/).

See Also

data_dim

Examples

data_str(rent)
data_distinct(rent)
data_cha2fac(rent)
data_few2fac(rent)
data_int2num(rent)

Finding the Percentage of Empty Spaces

Description

The function void() is looking for the % of empty spaces in the direction of two variables x and y.

The function data_void() is looking pair-wise for empty spaces in all the continuous variables in the data set.

Usage

data_void(data, digits = 3, plot = TRUE, diag.off = TRUE, 
         lower.tri.off = FALSE, 
         method = c("square", "circle"), 
         outline.color = "gray", colors = c("blue", "white", "red"), 
         legend.title = "Void", title, ggtheme = ggplot2::theme_minimal(), 
         tl.cex = 12, tl.col = "black", tl.srt = 45, lab = TRUE, 
         lab_col = "black", lab_size = 3, circle.size = 20, seed = 123,
         percentage, print.info = TRUE)

void(x, y, plot = TRUE, print = TRUE, table.length)

Arguments

data

A data frame

digits

the digits for printing the correlation coefficients

plot

whether to plot or not

diag.off

whether to show the diagonal ellements

lower.tri.off

whether to show the lower part of the matrix

method

plotting in "square" or "cicle"

outline.color

the outline colour

colors

the range of colours

legend.title

title for the legend

title

the main tittle

ggtheme

the theme for the plot, see package ggthemes for more themes

tl.cex

the text size for the marginal labels

tl.col

the colour of the he marginal labels

tl.srt

the angle of the text in the bottom labels of the table

lab

whether to show the correlation coefficients in the table

lab_col

the colour of the lettering of the correlation coefficients

lab_size

the size of the lettering of the correlation coefficients, increase (or decrease) if the defaul 3 is not appropriate

circle.size

the size of the circles, increase (or decrease) if the defaul 20 is not appropriate

seed

the set.seed() value

percentage

the percentage of data to show if the observation number is too big

print.info

whether to print infomation when cutting the data usinf data_cut()

x

the first variable in void()

y

the second variable in void()

print

whether to print the results

table.length

the table length (if siging is calculated automatically)

Details

The functions void() and data_void() work with discretising the data in the x and y direction and then calculate the % of zeros. By discretising the data we mean cut both variable x variables abd y, at an equal spaced grid of k points and create a (k x k) dimenstional matrix containing the number of data points in the grid. The problem thought, with any attempt to calculated the % of empty spaces is that by increasing k) in the x and y directions would resulst more zeros cells and therefore more % empty spaces. To avoid this we need a way to stop the discretazation at a stage before the data become too sparce. The waythis is done in tjhe current function is the following;

i) If the n points (x,y) are randomly allocated we would expect the number of counts in the cells of the matrix of a discretised two dimestional data set to be Poisson distributed with a probability for zeros equal to exp(μ)exp(-\mu) where μ\mu is the mean of the Poisson distribution. That is, under the null hypothesis that the n points are spead randomly we expect some of the cell to be zero with probability exp(μ)exp(-\mu). Given n the number of obsrvations, we can use this information to find out at which disretation point k we should stop.

ii) To identify at which stage k we should stop for given number of observations say n, we have genarated randomly from a uniform distribution n values for x and y. We use those values to calculate at which point k this will give a probability of zero close to 0.05. We calculate those probabilities using exp(xbar)exp(-xbar) where xbar is the mean of the cells. By doing this we found that that the following is holding; logk=0.516+0.498logn\log k= -0.516+ 0.498 \log n. This equation provide us with an easy way to calculate k given n.

Value

It produce a value between zero and 1.

Author(s)

Mikis Stasinopoulos

References

Rigby, R. A., Stasinopoulos, D. M., Heller, G. Z., and De Bastiani, F. (2019) Distributions for modeling location, scale, and shape: Using GAMLSS in R, Chapman and Hall/CRC. doi:10.1201/9780429298547.

Stasinopoulos D. M., Rigby R.A., Heller G., Voudouris V., and De Bastiani F., (2017) Flexible Regression and Smoothing: Using GAMLSS in R, Chapman and Hall/CRC. doi:10.1201/b21973

Stasinopoulos, M.D., Kneib, T., Klein, N., Mayr, A. and Heller, G.Z., (2024). Generalized Additive Models for Location, Scale and Shape: A Distributional Regression Approach, with Applications (Vol. 56). Cambridge University Press.

(see also https://www.gamlss.com/).

See Also

data_cor

Examples

void(rent$A, rent$Fl)
data_void(rent)

Plotting the response against the explanatory variables

Description

The function data_xyplot() plots the response against all other variables in a given data set.

The function data_plot() plots all variables individually.

The function data_bucket() plots the bucket plot for all continuous variables.

The function data_zscores() calculates and plots the z-scores (obtained after fitting the SHASHo distribution) for all continuous variables.

The function y_zscores() calculates and plots the z-scores (obtained after fitting the SHASHo distribution) for a single variable.

The function data_response() calculates and plots the z-scores (obtained after fitting the SHASHo distribution) for the response variable.

Usage

data_xyplot(data, response, point.size = 0.5, nrow = NULL,
           print.info = TRUE,
           ncol = NULL, percentage, seed = 123,  max.levels = 10,
           plots.per.page = 9, one.by.one = FALSE, title, text.x.angle = 0, ...)

data_plot(data, value = 3, hist.col = "black", hist.fill = "white",
           dens.fill = "#FF6666", nrow = NULL, ncol = NULL,
           percentage, seed = 123, print.info = TRUE,
           plot.hist = TRUE, plots.per.page = 9,
           one.by.one = FALSE,
           title, ...)
           
data_bucket(data, value = 3, max.levels = 20,
           nrow = NULL, ncol = NULL, plots.per.page = 9,
           one.by.one = FALSE, title, percentage, seed = 123, 
           print.info = TRUE,
           ...)

y_zscores(x, weights, family = SHASHo, value  = 3, plot = TRUE, 
            hist = FALSE,  transform = FALSE, ...)

data_zscores(data, plot = TRUE, hist=FALSE, value = 3, family = SHASHo,
          max.levels = 10, hist.col = "black", hist.fill = "white",
          dens.fill = "#FF6666", nrow = NULL, ncol = NULL,
          plots.per.page = 9, one.by.one = FALSE, title, print.info = TRUE,
          percentage, seed = 123,...)

data_response(data, response, plot = TRUE, 
             percentage, seed = 123, print.info = TRUE)

Arguments

data

a data frame

x

a single variable

weights

prior weights

transform

ehether to use a power transformation ot not

family

a gamlss distribution family (continuous)

response

the respose variable should be in the data

point.size

the size of points in scatter plots

nrow

the number of rows in the plot

ncol

the number of columns in the plot

plots.per.page

maximu plots per page

one.by.one

whether plotted individually

value

value to identify outliers if y_dots is used i.e. for upper tail an outliers is if it is greater than Q_3+value*IQ

hist.col

the colour of lines of the histogram, if plot.hist=TRUE

hist.fill

the colour of the histogram, if plot.hist=TRUE

dens.fill

the color of the density plot, if plot.hist=TRUE

plot.hist

whether to use y_dots() or y_hist() for the continuous variables

plot

whether to plot

hist

whether histiogram or dot plot

max.levels

excludes from plotting bucket plots for variables with less than max.levels, distinct values

title

title of the plot

percentage

if set, i.e. 0.50, plots a portotion of data otherwise for big data sets greater than 50.000 observartions it plots a porpotion

seed

the set.seed() argument

print.info

whether to print infomation when cutting the data usinf data_cut()

text.x.angle

how the text in the x-axis is printed (helping if say factors have a lot of levels). It can be a signle number of a vector. In both cases it will expand as a vector with length the number of explanatory variables. Therefore for full control a vector of the same length as the number of x-variables should be given.)

...

other arguments

Details

The function data_xyplot() it takes a data frame and plot all the explanarory variables against the response.

The function data_plot() it takes a data frame and plot all variables against the response. The continuous are plotted using y_dots() or y_hist() while the factors and integer as bar plots.

Value

Plots of the data

Author(s)

Mikis Stasinopoulos

References

Rigby, R. A. and Stasinopoulos D. M. (2005). Generalized additive models for location, scale and shape,(with discussion), Appl. Statist., 54, part 3, pp 507-554.

Rigby, R. A., Stasinopoulos, D. M., Heller, G. Z., and De Bastiani, F. (2019) Distributions for modeling location, scale, and shape: Using GAMLSS in R, Chapman and Hall/CRC. An older version can be found in https://www.gamlss.com/.

Stasinopoulos D. M. Rigby R.A. (2007) Generalized additive models for location scale and shape (GAMLSS) in R. Journal of Statistical Software, Vol. 23, Issue 7, Dec 2007, https://www.jstatsoft.org/v23/i07/.

Stasinopoulos D. M., Rigby R.A., Heller G., Voudouris V., and De Bastiani F., (2017) Flexible Regression and Smoothing: Using GAMLSS in R, Chapman and Hall/CRC.

Stasinopoulos, M.D., Kneib, T., Klein, N., Mayr, A. and Heller, G.Z., (2024). Generalized Additive Models for Location, Scale and Shape: A Distributional Regression Approach, with Applications (Vol. 56). Cambridge University Press.

(see also https://www.gamlss.com/).

See Also

data_names

Examples

da <- rent99[,-2]
data_xyplot(da, rent)
data_plot(da)
y_zscores(da$rent)
data_response(da, response=rent)