| Title: | Prepering Data for Fitting a Generalized Additive Model for Location Scale and Shape |
|---|---|
| Description: | Functions for prepering data to fit a Generalized Additive Models for Location Scale and Shape from the 'gamlss' or `gamlss2` package, Stasinopoulos and Rigby (2007) <doi:10.18637/jss.v023.i07>, using for graphical methods 'ggplot2'. |
| Authors: | Mikis Stasinopoulos [aut, cre, cph] (ORCID: <https://orcid.org/0000-0003-2407-5704>), Robert Rigby [aut] (ORCID: <https://orcid.org/0000-0003-3853-1707>), Fernanda De Bastiani [aut] (ORCID: <https://orcid.org/0000-0001-8532-639X>), Julian Merder [ctb] |
| Maintainer: | Mikis Stasinopoulos <[email protected]> |
| License: | GPL-2 | GPL-3 |
| Version: | 0.1.19 |
| Built: | 2026-06-02 10:51:08 UTC |
| Source: | https://github.com/gamlss-dev/gamlss.prepdata |
Functions for prepering data to fit a Generalized Additive Models for Location Scale and Shape from the 'gamlss' or 'gamlss2' package, Stasinopoulos and Rigby (2007) <doi:10.18637/jss.v023.i07>, using for graphical methods 'ggplot2'.
The DESCRIPTION file:
| Package: | gamlss.prepdata |
| Type: | Package |
| Title: | Prepering Data for Fitting a Generalized Additive Model for Location Scale and Shape |
| Version: | 0.1.19 |
| Date: | 2025-10-08 |
| Authors@R: | c(person("Mikis", "Stasinopoulos", role = c("aut", "cre", "cph"), email = "[email protected]", comment = c(ORCID = "0000-0003-2407-5704")), person("Robert", "Rigby", role = "aut", email = "[email protected]", comment = c(ORCID = "0000-0003-3853-1707")), person("Fernanda", "De Bastiani", role = "aut", email = "[email protected]", comment = c(ORCID = "0000-0001-8532-639X")), person("Julian", "Merder", role = "ctb") ) |
| Description: | Functions for prepering data to fit a Generalized Additive Models for Location Scale and Shape from the 'gamlss' or `gamlss2` package, Stasinopoulos and Rigby (2007) <doi:10.18637/jss.v023.i07>, using for graphical methods 'ggplot2'. |
| License: | GPL-2 | GPL-3 |
| URL: | https://www.gamlss.com/ |
| BugReports: | https://github.com/gamlss-dev/gamlss.prepdata/issues |
| Depends: | R (>= 4.1.0), gamlss.dist, gamlss (>= 4.3.3), gamlss.foreach |
| Imports: | methods, ggridges, ellipse, gamlss.inf, foreach, mgcv, ggplot2, yaImpute, gamlss.ggplots, acepack |
| Suggests: | glmnet, reshape2, igraph, networkD3, grid, gridExtra |
| LazyLoad: | yes |
| Config/pak/sysreqs: | cmake make libicu-dev libuv1-dev libssl-dev |
| Repository: | https://gamlss-dev.r-universe.dev |
| Date/Publication: | 2026-03-10 09:11:22 UTC |
| RemoteUrl: | https://github.com/gamlss-dev/gamlss.prepdata |
| RemoteRef: | HEAD |
| RemoteSha: | 50496dc3866efe155493e4b2fcf49b4fe504a139 |
| Author: | Mikis Stasinopoulos [aut, cre, cph] (ORCID: <https://orcid.org/0000-0003-2407-5704>), Robert Rigby [aut] (ORCID: <https://orcid.org/0000-0003-3853-1707>), Fernanda De Bastiani [aut] (ORCID: <https://orcid.org/0000-0001-8532-639X>), Julian Merder [ctb] |
| Maintainer: | Mikis Stasinopoulos <[email protected]> |
Index of help topics:
cor_perm_test Pemutatione and Boostrap Tests for Paitwise
EReletionships
data_cor Plotting pairwise linear and partial
correlations.
data_dim Function to get information from data.
data_factor Changing the Reference Level of Factors
data_inter Identifying Pair-Wise Interactions in the Data
Frames
data_leverage Getting the Leverage of All explanatory
variables
data_mcor Functions to fit Maximal Correlation to Data
data_outliers Outlier identification
data_part A function to partition a data frame
data_rm Functions operating on variables in the data
data_scale Scalling Continuous Variables in Data
data_str Function applied to data
data_void Finding the Percentage of Empty Spaces
data_xyplot Plotting the response against the explanatory
variables
gamlss.prepdata-package
Prepering Data for Fitting a Generalized
Additive Model for Location Scale and Shape
The following convention has been used to name the functions:
y_NAME: plots concerning fitted values from a single fitted model
data_NAME: plots concerning residuals from a single fitted model
where NAME refer to different characteristics.
Mikis Stasinopoulos [aut, cre, cph] (ORCID: <https://orcid.org/0000-0003-2407-5704>), Robert Rigby [aut] (ORCID: <https://orcid.org/0000-0003-3853-1707>), Fernanda De Bastiani [aut] (ORCID: <https://orcid.org/0000-0001-8532-639X>), Julian Merder [ctb]
Maintainer: Mikis Stasinopoulos <[email protected]>
Rigby, R. A. and Stasinopoulos D. M. (2005). Generalized additive models for location, scale and shape,(with discussion), Appl. Statist., 54, part 3, pp 507-554.
Rigby, R. A., Stasinopoulos, D. M., Heller, G. Z., and De Bastiani, F. (2019) Distributions for modeling location, scale, and shape: Using GAMLSS in R, Chapman and Hall/CRC, doi:10.1201/9780429298547. An older version can be found in https://www.gamlss.com/.
Stasinopoulos D. M. Rigby R.A. (2007) Generalized additive models for location scale and shape (GAMLSS) in R. Journal of Statistical Software, Vol. 23, Issue 7, Dec 2007, doi:10.18637/jss.v023.i07.
Stasinopoulos D. M., Rigby R.A., Heller G., Voudouris V., and De Bastiani F., (2017) Flexible Regression and Smoothing: Using GAMLSS in R, Chapman and Hall/CRC. doi:10.1201/b21973
Stasinopoulos, M. D., Rigby, R. A., and De Bastiani F., (2018) GAMLSS: a distributional regression approach, Statistical Modelling, Vol. 18, pp, 248-273, SAGE Publications Sage India: New Delhi, India.
Stasinopoulos, M.D., Kneib, T., Klein, N., Mayr, A. and Heller, G.Z., (2024). Generalized Additive Models for Location, Scale and Shape: A Distributional Regression Approach, with Applications (Vol. 56). Cambridge University Press.
(see also https://www.gamlss.com/).
data(rent) rent |> data_str()data(rent) rent |> data_str()
There are several functions here;
cor_perm_test peerfoems a premutation test
cor_perm_test(x, y, data = NULL, B = 1000, seed = 123, tail = c("one", "two"), fun = cor, ...) cor_boot(x, y, data = NULL, B = 1000, seed = 123, tail = c("one", "two"), fun = cor, ...)cor_perm_test(x, y, data = NULL, B = 1000, seed = 123, tail = c("one", "two"), fun = cor, ...) cor_boot(x, y, data = NULL, B = 1000, seed = 123, tail = c("one", "two"), fun = cor, ...)
x |
a coninuous variable |
y |
a coninuous variable |
data |
the data whete to find |
B |
Th enumber of simulations or boostraps |
seed |
settinf the seed number |
tail |
|
fun |
the function to calculate the assosiation measeure between two contibuous variables |
... |
extar argument pass to the function, |
Those two function help to test pairwise associations. The function cor_perm_test()
uses the Fisher's permutation test while cor_boot() using bootstraping.
The function cor_perm_test() retunrs a R object called permutationTest with methods print and plot.
Mikis Stasinopoulos
Rigby, R. A., Stasinopoulos, D. M., Heller, G. Z., and De Bastiani, F. (2019) Distributions for modeling location, scale, and shape: Using GAMLSS in R, Chapman and Hall/CRC. doi:10.1201/9780429298547.
Stasinopoulos D. M., Rigby R.A., Heller G., Voudouris V., and De Bastiani F., (2017) Flexible Regression and Smoothing: Using GAMLSS in R, Chapman and Hall/CRC. doi:10.1201/b21973
Stasinopoulos, M.D., Kneib, T., Klein, N., Mayr, A. and Heller, G.Z., (2024). Generalized Additive Models for Location, Scale and Shape: A Distributional Regression Approach, with Applications (Vol. 56). Cambridge University Press.
(see also https://www.gamlss.com/).
data_cor
data_cor(rent99) pp <- cor_perm_test(rent99$rent, rent99$rent) pp plot(pp)data_cor(rent99) pp <- cor_perm_test(rent99$rent, rent99$rent) pp plot(pp)
The function data_cor takes a data frame and plots the pairwise Pearson's correlation coefficients of all continuous variables in the data.
The function data_pcor takes a data frame and plots the pairwise partial Pearson's correlation coefficients of all continuous variables in the data.
The function data_association takes a data frame and plots the pairwise association coefficients of all variables in the data. For contituous against continuous variables it shows the absolute value of the Pearson's correlation coeficient, for categorical agaist categorical it shows Cramer's , for continuous agaist categorigal it fit an analyis of variance model and reports the square root of the .
The functions high_val and low_val take the square matrix generated by the above thee functions and shows which paire-wise have a value larger or smaller respectively, than the value specified by the argument by val.
data_cor(data, digits = 3, plot = TRUE, diag.off = TRUE, lower.tri.off = FALSE, method = c("square", "circle"), type = c("pearson", "kendall", "spearman"), outline.color = "gray", colors = c("blue", "white", "red"), legend.title = "Corr", title, ggtheme = theme_minimal(), tl.cex = 12, tl.col = "black", tl.srt = 45, lab = TRUE, lab_col = "black", lab_size = 3, circle.size = 20, seed = 123, percentage, print.info=TRUE) data_pcor(data, digits = 3, plot = TRUE, diag.off = TRUE, lower.tri.off = FALSE, method = c("square", "circle"), outline.color = "gray", colors = c("blue", "white", "red"), legend.title = "Corr", title, ggtheme = theme_minimal(), tl.cex = 12, tl.col = "black", tl.srt = 45, lab = TRUE, lab_col = "black", lab_size = 3, circle.size = 20, seed = 123, percentage, print.info=TRUE) data_association(data, digits = 3, plot = TRUE, diag.off = TRUE, lower.tri.off = FALSE, method = c("square", "circle"), outline.color = "gray", colors = c("blue", "white", "red"), legend.title = "Assoc", title, ggtheme = ggplot2::theme_minimal(), tl.cex = 12, tl.col = "black", tl.srt = 45, lab = TRUE, lab_col = "black", lab_size = 3, circle.size = 20, seed = 123, percentage, print.info=TRUE ) high_val(table, val = 0.90, digits = 3, plot = FALSE, igraph = TRUE) low_val(table, val = 0.05, digits = 3, plot = FALSE, igraph = TRUE)data_cor(data, digits = 3, plot = TRUE, diag.off = TRUE, lower.tri.off = FALSE, method = c("square", "circle"), type = c("pearson", "kendall", "spearman"), outline.color = "gray", colors = c("blue", "white", "red"), legend.title = "Corr", title, ggtheme = theme_minimal(), tl.cex = 12, tl.col = "black", tl.srt = 45, lab = TRUE, lab_col = "black", lab_size = 3, circle.size = 20, seed = 123, percentage, print.info=TRUE) data_pcor(data, digits = 3, plot = TRUE, diag.off = TRUE, lower.tri.off = FALSE, method = c("square", "circle"), outline.color = "gray", colors = c("blue", "white", "red"), legend.title = "Corr", title, ggtheme = theme_minimal(), tl.cex = 12, tl.col = "black", tl.srt = 45, lab = TRUE, lab_col = "black", lab_size = 3, circle.size = 20, seed = 123, percentage, print.info=TRUE) data_association(data, digits = 3, plot = TRUE, diag.off = TRUE, lower.tri.off = FALSE, method = c("square", "circle"), outline.color = "gray", colors = c("blue", "white", "red"), legend.title = "Assoc", title, ggtheme = ggplot2::theme_minimal(), tl.cex = 12, tl.col = "black", tl.srt = 45, lab = TRUE, lab_col = "black", lab_size = 3, circle.size = 20, seed = 123, percentage, print.info=TRUE ) high_val(table, val = 0.90, digits = 3, plot = FALSE, igraph = TRUE) low_val(table, val = 0.05, digits = 3, plot = FALSE, igraph = TRUE)
data |
a data frame |
table |
a correlation table obtained by |
digits |
the digits for printing the correlation coefficients |
plot |
whether to plot or not |
diag.off |
whether to show the diagonal ellements |
lower.tri.off |
whether to show the lower part of the matrix |
method |
plotting in |
type |
type of correlation c("pearson", "kendall", "spearman") |
outline.color |
the outline colour |
colors |
the range of colours |
legend.title |
title for the legend |
title |
the main title |
ggtheme |
the theme for the plot, see package ggthemes for more themes |
tl.cex |
the text size for the marginal labels |
tl.col |
the colour of the he marginal labels |
tl.srt |
the angle of the text in the bottom labels of the table |
lab |
whether to show the correlation coefficients in the table |
lab_col |
the colour of the lettering of the correlation coefficients |
lab_size |
the size of the lettering of the correlation coefficients, increase (or decrease) if the defaul 3 is not appropriate |
circle.size |
the size of the circles, increase (or decrease) if the defaul 20 is not appropriate |
percentage |
this is for big data sets. if more tha a milion ony 10% is plotted, if from 100.00 to a milion, 20%, if 50.000 to 100.000, 50% otherwise 100% of the data. |
seed |
Setting a seed value for selection of the percantage of data (for big data sets) |
print.info |
whether to print infomation when cutting the data usinf |
val |
the theshold value so if tha actul value is greater than |
igraph |
if in |
creates a correlation matrix plot.
Mikis Stasinopoulos
Rigby, R. A., Stasinopoulos, D. M., Heller, G. Z., and De Bastiani, F. (2019) Distributions for modeling location, scale, and shape: Using GAMLSS in R, Chapman and Hall/CRC. doi:10.1201/9780429298547.
Stasinopoulos D. M., Rigby R.A., Heller G., Voudouris V., and De Bastiani F., (2017) Flexible Regression and Smoothing: Using GAMLSS in R, Chapman and Hall/CRC. doi:10.1201/b21973
Stasinopoulos, M.D., Kneib, T., Klein, N., Mayr, A. and Heller, G.Z., (2024). Generalized Additive Models for Location, Scale and Shape: A Distributional Regression Approach, with Applications (Vol. 56). Cambridge University Press.
(see also https://www.gamlss.com/).
mcor
data_cor(rent99) Pearson.cor <- data_cor(rent99, plot=FALSE) data_pcor(rent99) partial.cor <- data_pcor(rent99, plot=FALSE) high_val(partial.cor, val=0.5) high_val(Pearson.cor, val=0.5)data_cor(rent99) Pearson.cor <- data_cor(rent99, plot=FALSE) data_pcor(rent99) partial.cor <- data_pcor(rent99, plot=FALSE) high_val(partial.cor, val=0.5) high_val(Pearson.cor, val=0.5)
This is a set of function are designed to help the user to deal with new data sets.
data_dim(): the class, the dimension and the % NA's in the data
data_na_vars(): which variables have NA's and how many
data_na_obs(): which observations have NA's
data_omit(): omit the NA's from the data.
data_names(): The names of the variables in the data.
data_shorter_names(): abbriviate the names up to specified digits.
data_rename() renames some of of the variables.
data_dim(data) data_na_vars(data) data_na_obs(data) data_omit(data) data_names(data) data_shorter_names(data, max = 5, newnames) data_rename(data, oldnames, newnames)data_dim(data) data_na_vars(data) data_na_obs(data) data_omit(data) data_names(data) data_shorter_names(data, max = 5, newnames) data_rename(data, oldnames, newnames)
data |
a data frame |
max |
the maximum number of characters allowed, with default 5. Make sure that you are using enought characters otherwise you could end up with variables with the some name |
newnames |
New names if not abbreviation is required, as characters |
oldnames |
the old names as characters |
The function data_dim() gives the the class, the dimension and the % NA's in the data.
The function data_na_vars() gives the number of missing observation for each variable in the data.
The function data_omit(): omits the NA's from the data.
The function data_names() gives the names of the variables.
The function data_shorter_names() takes the current names and abbreviates to max characters.
The function data_rename() renames variable from the data.
The function data_dim() after printing gives the originasl data set.
The function data_na_vars() prints the number of missing observation for each variable in the data and passes the original data set.
The function data_omit(): omits the NA's from the data and passes the new data set.
The function data_names() prints the names of the variables in the data andpasses the original data
The function data_shorter_names() takes the current names and abbreviates to max characters and return the data with shorter names.
Mikis Stasinopoulos, Bob Rigby and Fernanda De Bastiani
Rigby, R. A. and Stasinopoulos D. M. (2005). Generalized additive models for location, scale and shape,(with discussion), Appl. Statist., 54, part 3, pp 507-554.
Rigby, R. A., Stasinopoulos, D. M., Heller, G. Z., and De Bastiani, F. (2019) Distributions for modeling location, scale, and shape: Using GAMLSS in R, Chapman and Hall/CRC. An older version can be found in https://www.gamlss.com/.
Stasinopoulos D. M. Rigby R.A. (2007) Generalized additive models for location scale and shape (GAMLSS) in R. Journal of Statistical Software, Vol. 23, Issue 7, Dec 2007, https://www.jstatsoft.org/v23/i07/.
Stasinopoulos D. M., Rigby R.A., Heller G., Voudouris V., and De Bastiani F., (2017) Flexible Regression and Smoothing: Using GAMLSS in R, Chapman and Hall/CRC. Stasinopoulos, M.D., Kneib, T., Klein, N., Mayr, A. and Heller, G.Z., (2024). Generalized Additive Models for Location, Scale and Shape: A Distributional Regression Approach, with Applications (Vol. 56). Cambridge University Press.
(see also https://www.gamlss.com/).
data_dim(rent) data_na_vars(rent) data_na_obs(rent) data_omit(rent) data_names(rent) data_shorter_names(rent) pp=data_rename(rent, c("R", "Fl"), c("rent", "floor")) data_names(pp)data_dim(rent) data_na_vars(rent) data_na_obs(rent) data_omit(rent) data_names(rent) data_shorter_names(rent) pp=data_rename(rent, c("R", "Fl"), c("rent", "floor")) data_names(pp)
Function to change the reference levels of factors. y_factor() takes only one factor data_factor() takes a data.frame.
y_factor(x, how = c("lower", "higher")) data_factor(data, how = c("lower", "higher"))y_factor(x, how = c("lower", "higher")) data_factor(data, how = c("lower", "higher"))
data |
a data frame |
x |
a variable |
how |
which reference level, default the one woth fewer obsrvervesions |
A factor or a data.frame depending whether y_factor() or data_factor() is used.
The idea here is that in model selection maybe we like the first level to be the weakest level so when we select levels the stonger level has a change if different. This is more likelily to be usfull in gamlss2() where the stepwise select levels rather than factors.
Mikis Stasinopoulos
Rigby, R. A., Stasinopoulos, D. M., Heller, G. Z., and De Bastiani, F. (2019) Distributions for modeling location, scale, and shape: Using GAMLSS in R, Chapman and Hall/CRC. doi:10.1201/9780429298547.
Stasinopoulos D. M., Rigby R.A., Heller G., Voudouris V., and De Bastiani F., (2017) Flexible Regression and Smoothing: Using GAMLSS in R, Chapman and Hall/CRC. doi:10.1201/b21973
Stasinopoulos, M.D., Kneib, T., Klein, N., Mayr, A. and Heller, G.Z., (2024). Generalized Additive Models for Location, Scale and Shape: A Distributional Regression Approach, with Applications (Vol. 56). Cambridge University Press.
(see also https://www.gamlss.com/).
levels(rent$B) da<-data_factor(rent) levels(rent$B)levels(rent$B) da<-data_factor(rent) levels(rent$B)
The function data_inter() is trying to identify pair-wise interations given the response variable using linear regression methodology. At the moment it works only with continuous reponse variables.
data_inter(data, response, weights, digits = 3, plot = TRUE, lower.tri.off = TRUE, method = c("circle", "square"), fit.method = c("linear", "nonlinear"), outline.color = "gray", colors = c("blue", "white", "red"), legend.title = "Inter", title, ggtheme = theme_minimal(), tl.cex = 12, tl.col = "black", tl.srt = 45, lab = TRUE, lab_col = "black", lab_size = 3, circle.size = 20, percentage, seed = 123, print.info = TRUE)data_inter(data, response, weights, digits = 3, plot = TRUE, lower.tri.off = TRUE, method = c("circle", "square"), fit.method = c("linear", "nonlinear"), outline.color = "gray", colors = c("blue", "white", "red"), legend.title = "Inter", title, ggtheme = theme_minimal(), tl.cex = 12, tl.col = "black", tl.srt = 45, lab = TRUE, lab_col = "black", lab_size = 3, circle.size = 20, percentage, seed = 123, print.info = TRUE)
data |
a data frame |
response |
the response variable |
weights |
prior weights |
digits |
the number of digits in the plot |
plot |
whether to plot the results |
lower.tri.off |
whether to show the lower part of the matrix |
method |
plotting in |
fit.method |
whether in |
outline.color |
the outline colour |
colors |
the range of colours |
legend.title |
title for the legend |
title |
the main tittle |
ggtheme |
the theme for the plot, see package ggthemes for more themes |
tl.cex |
the text size for the marginal labels |
tl.col |
the colour of the he marginal labels |
tl.srt |
the angle of the text in the bottom labels of the table |
lab |
whether to show the correlation coefficients in the table |
lab_col |
the colour of the lettering of the correlation coefficients |
lab_size |
the size of the lettering of the correlation coefficients, increase (or decrease) if the defaul 3 is not appropriate |
circle.size |
the size of the circles, increase (or decrease) if the defaul 20 is not appropriate |
print.info |
whether to print infomation when cutting the data usinf |
percentage |
the percentage of data to show if the observation number is too big |
seed |
Setting a seed value for selection of the percantage of data (for big data sets) |
The function data_inter() uses the funcion z_scores() to standarized the continuous response variable and then uses linear model fits to establish whether the first order interactions between the the x's are singificant or not. It reports the significant level based on Chi-square tests. Note that for large data sets it uses the function data_cut() to cut randomnly the size of the data in order to use ggplo2 graphs to plot it.
Typically for linear model first ortder interaction it fits the models y~x1+x2 and y~x1*x2, respectively, and calculated significant level based on the difference in deviances. Under the hypothesis the difference in deviances follow be a Chi-square distribution with degrees of freedom based on the difference of the degrees of freedom of the two fitted models.
It produce a plot plot=TRUE or a square upper triangular table.
The function data_inter() works only for continuous responses.
Mikis Stasinopoulos
Rigby, R. A., Stasinopoulos, D. M., Heller, G. Z., and De Bastiani, F. (2019) Distributions for modeling location, scale, and shape: Using GAMLSS in R, Chapman and Hall/CRC. doi:10.1201/9780429298547.
Stasinopoulos D. M., Rigby R.A., Heller G., Voudouris V., and De Bastiani F., (2017) Flexible Regression and Smoothing: Using GAMLSS in R, Chapman and Hall/CRC. doi:10.1201/b21973
Stasinopoulos, M.D., Kneib, T., Klein, N., Mayr, A. and Heller, G.Z., (2024). Generalized Additive Models for Location, Scale and Shape: A Distributional Regression Approach, with Applications (Vol. 56). Cambridge University Press.
(see also https://www.gamlss.com/).
data_inter(rent[,-4,5], response=R)data_inter(rent[,-4,5], response=R)
The function data_leverage() is designed to identify observations in the explanatory variable space with high leverage values and, therefore, potential outliers.
data_leverage(data, response, weights, quantile.value = 0.99, annotate = TRUE, line.col = "steelblue4", point.col = "steelblue4", annot.col = "darkred", plot = TRUE, title, percentage, seed = 123, print.info=TRUE, ...)data_leverage(data, response, weights, quantile.value = 0.99, annotate = TRUE, line.col = "steelblue4", point.col = "steelblue4", annot.col = "darkred", plot = TRUE, title, percentage, seed = 123, print.info=TRUE, ...)
data |
The |
response |
The response variable (to be excuded) |
weights |
Prior weights if needed |
quantile.value |
The quantile values for the identification of high leverage. The default is |
annotate |
whether to annotete the the outliers |
line.col |
The color of the line |
point.col |
The colout of the plotting points] |
annot.col |
The colour of the annotate outliers |
plot |
whether to show the leverage plot |
title |
what title to put |
percentage |
what percentage of data to use in the plot |
seed |
The seed use to calculete the percentage |
print.info |
whether to print infomation when cutting the data usinf |
... |
other arguments |
The function data_leverage() uses the linear model methodology to identify unusual observations as a group within the explnatory variables. It fit a linear model to all explanatory variables in the data, calculate the leverge points and plots them. It identifies one percent of the data as outliers.
The line in the plot is calulated as where is the number of explanatory variables and the number of obsrervations. is given in the literature as the rule of thumb is that an observation is considered to have high leverage. In practice the value is too low for indetification of outliers. Here we use quan.val=.99 which identify 1% of the obsrvation with high leverage.
It plots the leverages plot=TRUE or identify outliers plot=FALSE.
Mikis Stasinopoulos
Rigby, R. A. and Stasinopoulos D. M. (2005). Generalized additive models for location, scale and shape,(with discussion), Appl. Statist., 54, part 3, pp 507-554.
Rigby, R. A., Stasinopoulos, D. M., Heller, G. Z., and De Bastiani, F. (2019) Distributions for modeling location, scale, and shape: Using GAMLSS in R, Chapman and Hall/CRC. An older version can be found in https://www.gamlss.com/.
Stasinopoulos D. M. Rigby R.A. (2007) Generalized additive models for location scale and shape (GAMLSS) in R. Journal of Statistical Software, Vol. 23, Issue 7, Dec 2007, https://www.jstatsoft.org/v23/i07/.
Stasinopoulos D. M., Rigby R.A., Heller G., Voudouris V., and De Bastiani F., (2017) Flexible Regression and Smoothing: Using GAMLSS in R, Chapman and Hall/CRC.
Stasinopoulos, M.D., Kneib, T., Klein, N., Mayr, A. and Heller, G.Z., (2024). Generalized Additive Models for Location, Scale and Shape: A Distributional Regression Approach, with Applications (Vol. 56). Cambridge University Press.
(see also https://www.gamlss.com/).
data_leverage(rent99[,-c(2,9)], response=rent) rent99[, -c(2,9)] |> data_leverage( response=rent, plot=FALSE)data_leverage(rent99[,-c(2,9)], response=rent) rent99[, -c(2,9)] |> data_leverage( response=rent, plot=FALSE)
The function data_mcor() is using either the function mcor() or the function cor_M() to find the non-linear maximal correlation between the continuous variabes of the data. Note that the function cor_M() is usinf the function ace() of the package acepack which is is faster for large data.
The function mcor() fits a single maximal correlation. It uses the function 'ACE().
The function ACE() it takes two variables and produce among other thing the maximal coefficient.
The function ACE.iter() it takes two variables and shows the fittings to acheive the maximal coefficient.
data_mcor(data, fun = cor_M, digits = 3, plot = TRUE, diag.off = TRUE, lower.tri.off = FALSE, method = c("square", "circle"), fit.method = c("P-splines", "loess"), outline.color = "gray", colors = c("blue", "white", "red"), legend.title = "Corr", title, ggtheme = ggplot2::theme_minimal(), tl.cex = 12, tl.col = "black", tl.srt = 45, lab = TRUE, lab_col = "black", lab_size = 3, circle.size = 20, ...) mcor(x, y, data = NULL, fit.method = c("loess", "P-splines"), nseg = 10, max.df = 6, ...) ACE(x, y, weights, data = NULL, con_crit = 0.01, fit.method = c("loess", "P-splines"), nseg = 10, max.df = 6, ...) ACE.iter(x, y, weights, data = NULL, con_crit = 0.001, fit.method = c("loess", "P-splines"), nseg = 10, max.df = 6, ...)data_mcor(data, fun = cor_M, digits = 3, plot = TRUE, diag.off = TRUE, lower.tri.off = FALSE, method = c("square", "circle"), fit.method = c("P-splines", "loess"), outline.color = "gray", colors = c("blue", "white", "red"), legend.title = "Corr", title, ggtheme = ggplot2::theme_minimal(), tl.cex = 12, tl.col = "black", tl.srt = 45, lab = TRUE, lab_col = "black", lab_size = 3, circle.size = 20, ...) mcor(x, y, data = NULL, fit.method = c("loess", "P-splines"), nseg = 10, max.df = 6, ...) ACE(x, y, weights, data = NULL, con_crit = 0.01, fit.method = c("loess", "P-splines"), nseg = 10, max.df = 6, ...) ACE.iter(x, y, weights, data = NULL, con_crit = 0.001, fit.method = c("loess", "P-splines"), nseg = 10, max.df = 6, ...)
data |
a data frame |
fun |
the function |
digits |
how many digit to show in the correlation table |
plot |
whether to plot the results |
diag.off |
whether to show the diagonal entries |
lower.tri.off |
whether to show the lower triangular prt of the matrix |
method |
plotting in "square" or "cicle |
fit.method |
whether to use "loess" or "P-splines" |
outline.color |
the outline colour |
colors |
the range of colours |
legend.title |
title for the legend |
title |
the main title |
ggtheme |
the theme use by |
tl.cex |
the text size for the marginal labels |
tl.col |
the colour of the he marginal labels |
tl.srt |
the angle of the text in the bottom labels of the table |
lab |
whether to show the correlation coefficients in the table |
lab_col |
the colour of the lettering of the correlation coefficients |
lab_size |
the size of the lettering of the correlation coefficients, increase (or decrease) if the defaul 3 is not appropriate |
circle.size |
the size of the circles, increase (or decrease) if the defaul 20 is not appropriate |
x |
a continuous variable |
y |
a continuous variable |
weights |
prior weights |
con_crit |
convergence critirion for |
nseg |
The number of breakpoints n the definition of |
max.df |
the maximal degree of freedom alowen if |
... |
for more arguments |
The function dsata_mcor() is using the function mcor() to find the non-linear maximum correlation between the continuous varaibes of the data.
The function data_mcor() produve a table
The function mcor() produve the maxomal correlations
The function ACE() produve a object classn "ACE", with methods plot() ands print()
Mikis Stasinopoulos
Rigby, R. A., Stasinopoulos, D. M., Heller, G. Z., and De Bastiani, F. (2019) Distributions for modeling location, scale, and shape: Using GAMLSS in R, Chapman and Hall/CRC. doi:10.1201/9780429298547.
Stasinopoulos D. M., Rigby R.A., Heller G., Voudouris V., and De Bastiani F., (2017) Flexible Regression and Smoothing: Using GAMLSS in R, Chapman and Hall/CRC. doi:10.1201/b21973
Stasinopoulos, M.D., Kneib, T., Klein, N., Mayr, A. and Heller, G.Z., (2024). Generalized Additive Models for Location, Scale and Shape: A Distributional Regression Approach, with Applications (Vol. 56). Cambridge University Press.
(see also https://www.gamlss.com/).
data_cor
data_mcor(rent99) max.cor <- data_mcor(rent99, plot=FALSE) high_val(max.cor, val=0.5)data_mcor(rent99) max.cor <- data_mcor(rent99, plot=FALSE) high_val(max.cor, val=0.5)
Those functions idententify outliers in variables in the data.
The function data_outliers() takes a data.frame and applies the y_outliers() function on all its continuous variables.
The function y_outliers() takes one continuous variable and identify which observations could be classified as outliers. It does this for both z-scores (fitting a distribution and then taking the residuals) or quantile measures for example K times MAD) way from the median.
Te function y_outliers_both() takes one continuous variable and identify outliers using both the z-scores and the quantile method. It prints or the union or the intersect of the two resuls.
The function y_outliers_by() takes one continuous variable and a factor and indentify outliers within each level of the factor.
The function y_outliers_loop() takes one continuous variable and identify outliers by going through a loop first by weighing out obsrvarions and then refitting the model.
The function y_outliers_z() takes one continuous variable and possible a factor and identify outliers or by a single fit or by repeated fits were outliers observation from previous fits are removed (equivalent to y_outliers_loop())
data_outliers(data, value, min.distinct = 50, family = SHASHo, type = c("zscores","quantile") ) y_outliers(x, value, family = SHASHo, type = c("zscores","quantile"), transform = TRUE) y_outliers_both(x, value, family = SHASHo, method =c("intersect", "union")) y_outliers_by(x, by, family = SHASHo, type = c("zscores", "quantile")) y_outliers_z(x, by, value, family = SHASHo, transform = TRUE, loop = FALSE) y_outliers_loop(x, value, family = SHASHo, type = c("zscores", "quantile"), transform = TRUE)data_outliers(data, value, min.distinct = 50, family = SHASHo, type = c("zscores","quantile") ) y_outliers(x, value, family = SHASHo, type = c("zscores","quantile"), transform = TRUE) y_outliers_both(x, value, family = SHASHo, method =c("intersect", "union")) y_outliers_by(x, by, family = SHASHo, type = c("zscores", "quantile")) y_outliers_z(x, by, value, family = SHASHo, transform = TRUE, loop = FALSE) y_outliers_loop(x, value, family = SHASHo, type = c("zscores", "quantile"), transform = TRUE)
data |
a data frame |
x |
a continues variable |
by |
a factor partioning the data |
loop |
whether to lop for identify more ouliers |
transform |
whether to transfomed using a power transformation |
value |
max value from which the absolute value of the z-scores should be greater to identify outliers |
min.distinct |
if a variable has less distinct values than |
family |
the distribution family used for standardization |
type |
for |
method |
for |
for y_outliers_both() whether the "intersect" (default) or the "union" of the two sets will be used
the continuous variables are power transforemed and then standartised
return a list
Mikis Stasinopoulos
Rigby, R. A. and Stasinopoulos D. M. (2005). Generalized additive models for location, scale and shape,(with discussion), Appl. Statist., 54, part 3, pp 507-554.
Rigby, R. A., Stasinopoulos, D. M., Heller, G. Z., and De Bastiani, F. (2019) Distributions for modeling location, scale, and shape: Using GAMLSS in R, Chapman and Hall/CRC. An older version can be found in https://www.gamlss.com/.
Stasinopoulos D. M. Rigby R.A. (2007) Generalized additive models for location scale and shape (GAMLSS) in R. Journal of Statistical Software, Vol. 23, Issue 7, Dec 2007, https://www.jstatsoft.org/v23/i07/.
Stasinopoulos D. M., Rigby R.A., Heller G., Voudouris V., and De Bastiani F., (2017) Flexible Regression and Smoothing: Using GAMLSS in R, Chapman and Hall/CRC.
Stasinopoulos, M.D., Kneib, T., Klein, N., Mayr, A. and Heller, G.Z., (2024). Generalized Additive Models for Location, Scale and Shape: A Distributional Regression Approach, with Applications (Vol. 56). Cambridge University Press.
(see also https://www.gamlss.com/).
da <- rent99[,-2] data_outliers(da)da <- rent99[,-2] data_outliers(da)
The function data_part() takes a data.frame and creates a new identical data.frame with an extra factor called partition which can be used to allocate data in different data sets.
i) if the option of the function is partition=2L the factor has two levels train, and test.
ii) if the option of the function is partition=3L the factror has three levels train, for training data, val for validation data and test for test data.
iii) if the option of the function is partition > 4L say K then the levels of the factor are "1", "2"..."K". The factor then can be used to identify K-fold cross validation sets (up to K=20).
The function data_part_list() in parform a similar task like the function data_part() but instead of adding a factor to the data creates a list with ellements data.frames. Note that this function do allow K-fold cross-validation creation (up to 20 K-folds), see also the function data_Kfold().
The function data_boot_index() takes a data.frame and produces two list of length K, the in-bag list, IB, and the out-of-bag list, OOB.
The function data_boot_weights() takes a data.frame and produces a matrix of dimensions n x K which column of wich can be used as prior weight in a regression situation.
The function data_Kfold() takes a data.frame and produces a matrix of indeces which then can be used to fit diffetent sections of the data for cross validation.
The function data_cut() takes a data.frame and selects randomly specified % of the data. It is usually applied to graphical function =to reduce time for plotting; For data.frames with more than 50.000 observations is automatically select part of the data.
data_part(data, partition = 2L, probs, setseed = 123, ...) data_part_list(data, partition = 2L, probs, setseed = 123, ...) data_boot_index(data, B = 10, setseed = 123) data_boot_weights()(data, B = 10, setseed = 123) data_Kfold_index()(data, K = 10, setseed = 123) data_Kfold_weights()(data, K = 10, setseed = 123) data_cut(data, percentage, seed = 123, print.info = TRUE)data_part(data, partition = 2L, probs, setseed = 123, ...) data_part_list(data, partition = 2L, probs, setseed = 123, ...) data_boot_index(data, B = 10, setseed = 123) data_boot_weights()(data, B = 10, setseed = 123) data_Kfold_index()(data, K = 10, setseed = 123) data_Kfold_weights()(data, K = 10, setseed = 123) data_cut(data, percentage, seed = 123, print.info = TRUE)
data |
a |
partition |
2, 3 or a number less than 20 |
K |
the number of partitions, (maximum 20 for CV) |
B |
the number of bootstrap samples |
probs |
probabilities for the random selection |
setseed |
setting the sead so the proccess can be repeated |
percentage |
The percentage of data to keep. If set, i.e. |
seed |
the |
print.info |
whether to print infomation when cutting the data usinf |
... |
extra arguments |
The functions data_part(), data_part_list(), data_boot_index(), data_boot_weights(), data_Kfold() produce a data.frame, lists or matrices to be later used for data partition during the fitting process. The function data_cut() randomly select part of the data.
Mikis Stasinopoulos, Bob Rigby and Fernanda De Bastiani
Rigby, R. A., Stasinopoulos, D. M., Heller, G. Z., and De Bastiani, F. (2019) Distributions for modeling location, scale, and shape: Using GAMLSS in R, Chapman and Hall/CRC. An older version can be found in https://www.gamlss.com/.
Stasinopoulos D. M. Rigby R.A. (2007) Generalized additive models for location scale and shape (GAMLSS) in R. Journal of Statistical Software, Vol. 23, Issue 7, Dec 2007, https://www.jstatsoft.org/v23/i07/.
Stasinopoulos D. M., Rigby R.A., Heller G., Voudouris V., and De Bastiani F., (2017) Flexible Regression and Smoothing: Using GAMLSS in R, Chapman and Hall/CRC.
Stasinopoulos, M.D., Kneib, T., Klein, N., Mayr, A. and Heller, G.Z., (2024). Generalized Additive Models for Location, Scale and Shape: A Distributional Regression Approach, with Applications (Vol. 56). Cambridge University Press.
(see also https://www.gamlss.com/).
da <- data_part(rent) head(da) mosaicplot(table(da$partition)) da.train <- subset(da, partition=="train") da.test <- subset(da, partition=="test") dim(da.train) dim(da.test) allda <- data_part_list(rent) dim(allda[[1]]) # training data dim(allda[[2]]) # test datada <- data_part(rent) head(da) mosaicplot(table(da$partition)) da.train <- subset(da, partition=="train") da.test <- subset(da, partition=="test") dim(da.train) dim(da.test) allda <- data_part_list(rent) dim(allda[[1]]) # training data dim(allda[[2]]) # test data
There are several function operating on a data.frame and export a data.frame. The functions are
1) data_rm(): this function removes the variables specified by vars from the data.frame. Note that vars can take either character names or numbers.
2) data_rm1val(): This function looks for varables with a unique distinct value (most likely factors left from a previous subset() operation) and remove them form the data.
3) data_exclude_class(): This function looks for variable (columns) of a specified 'R' class and remove them from the data. The default class is "factor".
4) data_only_continuous()": This function pick up only the continuous variable in the data.frame.
5) data_select()": This function select only the variables in the vars list and save the data.
data_rm(data, vars) data_rm1val(data) data_exclude_class(data, class.out = "factor") data_only_continuous(data) data_select(data, vars) data_rmNAvars(data)data_rm(data, vars) data_rm1val(data) data_exclude_class(data, class.out = "factor") data_only_continuous(data) data_select(data, vars) data_rmNAvars(data)
data |
a data frame |
vars |
selected variables (columns from the data frame) |
class.out |
a specific variable class to be excluded form the data frame |
All the above functions can be used for piping i.e. da |> data_rm1val().
returm a data.frame
Mikis Stasinopoulos, Bob Rigby and Fernanda De Bastiani
Rigby, R. A., Stasinopoulos, D. M., Heller, G. Z., and De Bastiani, F. (2019) Distributions for modeling location, scale, and shape: Using GAMLSS in R, Chapman and Hall/CRC. An older version can be found in https://www.gamlss.com/.
Stasinopoulos D. M. Rigby R.A. (2007) Generalized additive models for location scale and shape (GAMLSS) in R. Journal of Statistical Software, Vol. 23, Issue 7, Dec 2007, https://www.jstatsoft.org/v23/i07/.
Stasinopoulos D. M., Rigby R.A., Heller G., Voudouris V., and De Bastiani F., (2017) Flexible Regression and Smoothing: Using GAMLSS in R, Chapman and Hall/CRC.
Stasinopoulos, M.D., Kneib, T., Klein, N., Mayr, A. and Heller, G.Z., (2024). Generalized Additive Models for Location, Scale and Shape: A Distributional Regression Approach, with Applications (Vol. 56). Cambridge University Press.
(see also https://www.gamlss.com/).
data_cor
library(gamlss) da <- rent |> data_rm( vars=c("Sp", "Sm")) head(da) da<- rent |> data_exclude_class() head(da) da<- data_only_continuous(rent) head(da) da <- rent |> data_select( vars=c("R", "Fl", "A")) head(da)library(gamlss) da <- rent |> data_rm( vars=c("Sp", "Sm")) head(da) da<- rent |> data_exclude_class() head(da) da<- data_only_continuous(rent) head(da) da <- rent |> data_select( vars=c("R", "Fl", "A")) head(da)
The function data_scale() takes a data.frame and creates a new data set with all continous variable standarised. The standardization can be
i) mean 0 variance 1 which is equivalent to have the options scale.to="z-scores" and family="NO". That is the z-scores (residuals) after fitting a normal distribution to a continuous variable
ii) A more general z-score using say scale.to="z-scores" and family="SHASH" in which case correction to the skewness and kurtosis is done to the specified variable. Finaly,
iii) the range is resticted from zero to one, i.e. scale.to="0to1".
The function data_vars2data() creates a new dataframe where the continuous variables are remain the same or become polynomials and the factors becomes dummy variables.
The function data_formulae() takes a data.frame and creates four formulae
1) The first contains the response and all main effects of the variables in the data i.e. R~Fl+A+K+loc
2) The second contains the response and all first order interactions i.e R~(Fl+A+K+loc)^2
3) The third contains all main effects of the variables in the data with no response i.e. ~Fl+A+K+loc
2) The fourth contains all first order interactions with no response i.e ~(Fl+A+K+loc)^2
The function data_form2X takes a data.frame and a forrmula and creates a design matrix.
data_scale(data, response, position.response = NULL, scale.to = c("z-scores", "0to1"), family = "NO", scale.response = FALSE) data_vars2data(data, response, exclude = NULL, type = c("main.effect", "first.order"), weights = NULL, nonlinear = FALSE, basis = "poly", arg = 2) data_formulae(data, response) data_form2X(data, formula, response, scale.to = c("no", "z-scores", "0to1"), family = NO)data_scale(data, response, position.response = NULL, scale.to = c("z-scores", "0to1"), family = "NO", scale.response = FALSE) data_vars2data(data, response, exclude = NULL, type = c("main.effect", "first.order"), weights = NULL, nonlinear = FALSE, basis = "poly", arg = 2) data_formulae(data, response) data_form2X(data, formula, response, scale.to = c("no", "z-scores", "0to1"), family = NO)
data |
A data frame |
response |
The name of the response variable |
position.response |
or the position of the response variable in the data. |
scale.to |
how to scape by normalization, |
family |
The family used in the standarization, defaul is |
scale.response |
whether to scale also the response. The default value is |
exclude |
which variable to exclude |
type |
whether the main effects only ot dummies for first order interactions also |
weights |
Prior weights when the matrix is created |
nonlinear |
Im not sure what it does? |
basis |
what basis should be used for non linearities, deault is |
arg |
the argument for the basis i.e 3 |
formula |
the formula to create the design matrix |
A data frame is return with all continous variables standarised.
Mikis Stasinopolos
Rigby, R. A. and Stasinopoulos D. M. (2005). Generalized additive models for location, scale and shape,(with discussion), Appl. Statist., 54, part 3, pp 507-554.
Rigby, R. A., Stasinopoulos, D. M., Heller, G. Z., and De Bastiani, F. (2019) Distributions for modeling location, scale, and shape: Using GAMLSS in R, Chapman and Hall/CRC. An older version can be found in https://www.gamlss.com/.
Stasinopoulos D. M. Rigby R.A. (2007) Generalized additive models for location scale and shape (GAMLSS) in R. Journal of Statistical Software, Vol. 23, Issue 7, Dec 2007, https://www.jstatsoft.org/v23/i07/.
Stasinopoulos D. M., Rigby R.A., Heller G., Voudouris V., and De Bastiani F., (2017) Flexible Regression and Smoothing: Using GAMLSS in R, Chapman and Hall/CRC.
Stasinopoulos, M.D., Kneib, T., Klein, N., Mayr, A. and Heller, G.Z., (2024). Generalized Additive Models for Location, Scale and Shape: A Distributional Regression Approach, with Applications (Vol. 56). Cambridge University Press.
(see also https://www.gamlss.com/).
rent[, -c(4,5)] |> data_scale(, response=R)|> head() rent[, -c(4,5)] |> data_vars2data( response=R, nonlinear = TRUE, basis = "poly", arg=3) -> D D |> head() D |> str() ff <- data_formulae(D, response=R) ff head(data_form2X(D, formula=ff[[1]], response=R))rent[, -c(4,5)] |> data_scale(, response=R)|> head() rent[, -c(4,5)] |> data_vars2data( response=R, nonlinear = TRUE, basis = "poly", arg=3) -> D D |> head() D |> str() ff <- data_formulae(D, response=R) ff head(data_form2X(D, formula=ff[[1]], response=R))
his is a set of function are designed to help the user to deal with the structure of new data sets.
data_str(data, min.values = 100, min.levels = 10) y_distinct(var) data_distinct(data, get.distinct = FALSE, print=TRUE) data_cha2fac(data, show.str = FALSE) data_few2fac(data, max.levels = 5, show.str = FALSE) data_int2num(data, min.values = 50, show.str = FALSE) data_fac2num(data, vars)data_str(data, min.values = 100, min.levels = 10) y_distinct(var) data_distinct(data, get.distinct = FALSE, print=TRUE) data_cha2fac(data, show.str = FALSE) data_few2fac(data, max.levels = 5, show.str = FALSE) data_int2num(data, min.values = 50, show.str = FALSE) data_fac2num(data, vars)
data |
a data frame |
min.values |
the minimal value distinct values before warning |
max.levels |
the maximum value for distinct values in the variable |
min.levels |
the minimal value distinct levels befor warning |
var |
a vector |
vars |
a character vector with names from the data |
get.distinct |
TRUE if you need to save the values FALSE if not not |
show.str |
whether to show the structure |
print |
TRUE or FALSE |
The function data_str() gives the structure of the data set.
The function data_distinct() gives the distinct values of the vectors in the data set
The function y_distinct() gives the distinct values of single vector
The function data_cha2fac() tranforms all character vectors to factors
The function data_few2fac() transform all vectors with fewer values than min.levels into factors
The function data_int2num() transform all integer vectors with more values than min.values into numeric
The function data_fac2num() transform sellected variables factors into numeric vectors
Mikis Stasinopoulos, Bob Rigby and Fernanda De Bastiani
Rigby, R. A. and Stasinopoulos D. M. (2005). Generalized additive models for location, scale and shape,(with discussion), Appl. Statist., 54, part 3, pp 507-554.
Rigby, R. A., Stasinopoulos, D. M., Heller, G. Z., and De Bastiani, F. (2019) Distributions for modeling location, scale, and shape: Using GAMLSS in R, Chapman and Hall/CRC. An older version can be found in https://www.gamlss.com/.
Stasinopoulos D. M. Rigby R.A. (2007) Generalized additive models for location scale and shape (GAMLSS) in R. Journal of Statistical Software, Vol. 23, Issue 7, Dec 2007, https://www.jstatsoft.org/v23/i07/.
Stasinopoulos D. M., Rigby R.A., Heller G., Voudouris V., and De Bastiani F., (2017) Flexible Regression and Smoothing: Using GAMLSS in R, Chapman and Hall/CRC.
Stasinopoulos, M.D., Kneib, T., Klein, N., Mayr, A. and Heller, G.Z., (2024). Generalized Additive Models for Location, Scale and Shape: A Distributional Regression Approach, with Applications (Vol. 56). Cambridge University Press.
(see also https://www.gamlss.com/).
data_str(rent) data_distinct(rent) data_cha2fac(rent) data_few2fac(rent) data_int2num(rent)data_str(rent) data_distinct(rent) data_cha2fac(rent) data_few2fac(rent) data_int2num(rent)
The function void() is looking for the % of empty spaces in the direction of two variables x and y.
The function data_void() is looking pair-wise for empty spaces in all the continuous variables in the data set.
data_void(data, digits = 3, plot = TRUE, diag.off = TRUE, lower.tri.off = FALSE, method = c("square", "circle"), outline.color = "gray", colors = c("blue", "white", "red"), legend.title = "Void", title, ggtheme = ggplot2::theme_minimal(), tl.cex = 12, tl.col = "black", tl.srt = 45, lab = TRUE, lab_col = "black", lab_size = 3, circle.size = 20, seed = 123, percentage, print.info = TRUE) void(x, y, plot = TRUE, print = TRUE, table.length)data_void(data, digits = 3, plot = TRUE, diag.off = TRUE, lower.tri.off = FALSE, method = c("square", "circle"), outline.color = "gray", colors = c("blue", "white", "red"), legend.title = "Void", title, ggtheme = ggplot2::theme_minimal(), tl.cex = 12, tl.col = "black", tl.srt = 45, lab = TRUE, lab_col = "black", lab_size = 3, circle.size = 20, seed = 123, percentage, print.info = TRUE) void(x, y, plot = TRUE, print = TRUE, table.length)
data |
A data frame |
digits |
the digits for printing the correlation coefficients |
plot |
whether to plot or not |
diag.off |
whether to show the diagonal ellements |
lower.tri.off |
whether to show the lower part of the matrix |
method |
plotting in |
outline.color |
the outline colour |
colors |
the range of colours |
legend.title |
title for the legend |
title |
the main tittle |
ggtheme |
the theme for the plot, see package ggthemes for more themes |
tl.cex |
the text size for the marginal labels |
tl.col |
the colour of the he marginal labels |
tl.srt |
the angle of the text in the bottom labels of the table |
lab |
whether to show the correlation coefficients in the table |
lab_col |
the colour of the lettering of the correlation coefficients |
lab_size |
the size of the lettering of the correlation coefficients, increase (or decrease) if the defaul 3 is not appropriate |
circle.size |
the size of the circles, increase (or decrease) if the defaul 20 is not appropriate |
seed |
the |
percentage |
the percentage of data to show if the observation number is too big |
print.info |
whether to print infomation when cutting the data usinf |
x |
the first variable in |
y |
the second variable in |
print |
whether to print the results |
table.length |
the table length (if siging is calculated automatically) |
The functions void() and data_void() work with discretising the data in the x and y direction and then calculate the % of zeros.
By discretising the data we mean cut both variable x variables abd y, at an equal spaced grid of k points and create a (k x k) dimenstional matrix containing the number of data points in the grid. The problem thought, with any attempt to calculated the % of empty spaces is that by increasing k) in the x and y directions would resulst more zeros cells and therefore more % empty spaces. To avoid this we need a way to stop the discretazation at a stage before the data become too sparce. The waythis is done in tjhe current function is the following;
i) If the n points (x,y) are randomly allocated we would expect the number of counts in the cells of the matrix of a discretised two dimestional data set to be Poisson distributed with a probability for zeros equal to where is the mean of the Poisson distribution. That is, under the null hypothesis that the n points are spead randomly we expect some of the cell to be zero with probability . Given n the number of obsrvations, we can use this information to find out at which disretation point k we should stop.
ii) To identify at which stage k we should stop for given number of observations say n, we have genarated randomly from a uniform distribution n values for x and y. We use those values to calculate at which point k this will give a probability of zero close to 0.05. We calculate those probabilities using where xbar is the mean of the cells. By doing this we found that that the following is holding;
. This equation provide us with an easy way to calculate k given n.
It produce a value between zero and 1.
Mikis Stasinopoulos
Rigby, R. A., Stasinopoulos, D. M., Heller, G. Z., and De Bastiani, F. (2019) Distributions for modeling location, scale, and shape: Using GAMLSS in R, Chapman and Hall/CRC. doi:10.1201/9780429298547.
Stasinopoulos D. M., Rigby R.A., Heller G., Voudouris V., and De Bastiani F., (2017) Flexible Regression and Smoothing: Using GAMLSS in R, Chapman and Hall/CRC. doi:10.1201/b21973
Stasinopoulos, M.D., Kneib, T., Klein, N., Mayr, A. and Heller, G.Z., (2024). Generalized Additive Models for Location, Scale and Shape: A Distributional Regression Approach, with Applications (Vol. 56). Cambridge University Press.
(see also https://www.gamlss.com/).
void(rent$A, rent$Fl) data_void(rent)void(rent$A, rent$Fl) data_void(rent)
The function data_xyplot() plots the response against all other variables
in a given data set.
The function data_plot() plots all variables individually.
The function data_bucket() plots the bucket plot for all continuous variables.
The function data_zscores() calculates and plots the z-scores (obtained after fitting the SHASHo distribution) for all continuous variables.
The function y_zscores() calculates and plots the z-scores (obtained after fitting the SHASHo distribution) for a single variable.
The function data_response() calculates and plots the z-scores (obtained after fitting the SHASHo distribution) for the response variable.
data_xyplot(data, response, point.size = 0.5, nrow = NULL, print.info = TRUE, ncol = NULL, percentage, seed = 123, max.levels = 10, plots.per.page = 9, one.by.one = FALSE, title, text.x.angle = 0, ...) data_plot(data, value = 3, hist.col = "black", hist.fill = "white", dens.fill = "#FF6666", nrow = NULL, ncol = NULL, percentage, seed = 123, print.info = TRUE, plot.hist = TRUE, plots.per.page = 9, one.by.one = FALSE, title, ...) data_bucket(data, value = 3, max.levels = 20, nrow = NULL, ncol = NULL, plots.per.page = 9, one.by.one = FALSE, title, percentage, seed = 123, print.info = TRUE, ...) y_zscores(x, weights, family = SHASHo, value = 3, plot = TRUE, hist = FALSE, transform = FALSE, ...) data_zscores(data, plot = TRUE, hist=FALSE, value = 3, family = SHASHo, max.levels = 10, hist.col = "black", hist.fill = "white", dens.fill = "#FF6666", nrow = NULL, ncol = NULL, plots.per.page = 9, one.by.one = FALSE, title, print.info = TRUE, percentage, seed = 123,...) data_response(data, response, plot = TRUE, percentage, seed = 123, print.info = TRUE)data_xyplot(data, response, point.size = 0.5, nrow = NULL, print.info = TRUE, ncol = NULL, percentage, seed = 123, max.levels = 10, plots.per.page = 9, one.by.one = FALSE, title, text.x.angle = 0, ...) data_plot(data, value = 3, hist.col = "black", hist.fill = "white", dens.fill = "#FF6666", nrow = NULL, ncol = NULL, percentage, seed = 123, print.info = TRUE, plot.hist = TRUE, plots.per.page = 9, one.by.one = FALSE, title, ...) data_bucket(data, value = 3, max.levels = 20, nrow = NULL, ncol = NULL, plots.per.page = 9, one.by.one = FALSE, title, percentage, seed = 123, print.info = TRUE, ...) y_zscores(x, weights, family = SHASHo, value = 3, plot = TRUE, hist = FALSE, transform = FALSE, ...) data_zscores(data, plot = TRUE, hist=FALSE, value = 3, family = SHASHo, max.levels = 10, hist.col = "black", hist.fill = "white", dens.fill = "#FF6666", nrow = NULL, ncol = NULL, plots.per.page = 9, one.by.one = FALSE, title, print.info = TRUE, percentage, seed = 123,...) data_response(data, response, plot = TRUE, percentage, seed = 123, print.info = TRUE)
data |
a data frame |
x |
a single variable |
weights |
prior weights |
transform |
ehether to use a power transformation ot not |
family |
a gamlss distribution family (continuous) |
response |
the respose variable should be in the data |
point.size |
the size of points in scatter plots |
nrow |
the number of rows in the plot |
ncol |
the number of columns in the plot |
plots.per.page |
maximu plots per page |
one.by.one |
whether plotted individually |
value |
value to identify outliers if |
hist.col |
the colour of lines of the histogram, if |
hist.fill |
the colour of the histogram, if |
dens.fill |
the color of the density plot, if |
plot.hist |
whether to use |
plot |
whether to plot |
hist |
whether histiogram or dot plot |
max.levels |
excludes from plotting bucket plots for variables with less than |
title |
title of the plot |
percentage |
if set, i.e. 0.50, plots a portotion of data otherwise for big data sets greater than 50.000 observartions it plots a porpotion |
seed |
the |
print.info |
whether to print infomation when cutting the data usinf |
text.x.angle |
how the text in the x-axis is printed (helping if say factors have a lot of levels). It can be a signle number of a vector. In both cases it will expand as a vector with length the number of explanatory variables. Therefore for full control a vector of the same length as the number of x-variables should be given.) |
... |
other arguments |
The function data_xyplot() it takes a data frame and plot all the explanarory variables against the response.
The function data_plot() it takes a data frame and plot all variables against the response. The continuous are plotted using y_dots() or y_hist() while the factors and integer as bar plots.
Plots of the data
Mikis Stasinopoulos
Rigby, R. A. and Stasinopoulos D. M. (2005). Generalized additive models for location, scale and shape,(with discussion), Appl. Statist., 54, part 3, pp 507-554.
Rigby, R. A., Stasinopoulos, D. M., Heller, G. Z., and De Bastiani, F. (2019) Distributions for modeling location, scale, and shape: Using GAMLSS in R, Chapman and Hall/CRC. An older version can be found in https://www.gamlss.com/.
Stasinopoulos D. M. Rigby R.A. (2007) Generalized additive models for location scale and shape (GAMLSS) in R. Journal of Statistical Software, Vol. 23, Issue 7, Dec 2007, https://www.jstatsoft.org/v23/i07/.
Stasinopoulos D. M., Rigby R.A., Heller G., Voudouris V., and De Bastiani F., (2017) Flexible Regression and Smoothing: Using GAMLSS in R, Chapman and Hall/CRC.
Stasinopoulos, M.D., Kneib, T., Klein, N., Mayr, A. and Heller, G.Z., (2024). Generalized Additive Models for Location, Scale and Shape: A Distributional Regression Approach, with Applications (Vol. 56). Cambridge University Press.
(see also https://www.gamlss.com/).
da <- rent99[,-2] data_xyplot(da, rent) data_plot(da) y_zscores(da$rent) data_response(da, response=rent)da <- rent99[,-2] data_xyplot(da, rent) data_plot(da) y_zscores(da$rent) data_response(da, response=rent)