Title: | Robust and Flexible Model-Based Clustering for Data Sets with Missing Values at Random |
---|---|
Description: | Implementations of various robust and flexible model-based clustering methods for data sets with missing values at random. Two main models are: Multivariate Contaminated Normal Mixture (MCNM, Tong and Tortora, 2022, <doi:10.1007/s11634-021-00476-1>) and Multivariate Generalized Hyperbolic Mixture (MGHM, Wei et al., 2019, <doi:10.1016/j.csda.2018.08.016>). Mixtures via some special or limiting cases of the multivariate generalized hyperbolic distribution are also included: Normal-Inverse Gaussian, Symmetric Normal-Inverse Gaussian, Skew-Cauchy, Cauchy, Skew-t, Student's t, Normal, Symmetric Generalized Hyperbolic, Hyperbolic Univariate Marginals, Hyperbolic, and Symmetric Hyperbolic. |
Authors: | Hung Tong [aut, cre], Cristina Tortora [aut, ths, dgs] |
Maintainer: | Hung Tong <[email protected]> |
License: | GPL (>= 2) |
Version: | 3.0.4 |
Built: | 2025-03-07 03:19:30 UTC |
Source: | https://github.com/cran/MixtureMissing |
This data set consists of three types of entities: (a) the specification of an auto in terms of various characteristics, (b) its assigned insurance risk rating, (c) its normalized losses in use as compared to other cars. The second rating corresponds to the degree to which the auto is more risky than its price indicates. Cars are initially assigned a risk factor symbol associated with its price. Then, if it is more risky (or less), this symbol is adjusted by moving it up (or down) the scale. Actuarians call this process "symboling". A value of +3 indicates that the auto is risky, -3 that it is probably pretty safe.
auto
auto
A data frame with 205 rows and 26 variables. The first 15 variables are continuous, while the last 11 variables are categorical. There are 45 rows with missing values.
continuous from 65 to 256.
continuous from 86.6 120.9.
continuous from 141.1 to 208.1.
continuous from 60.3 to 72.3.
continuous from 47.8 to 59.8.
continuous from 1488 to 4066.
continuous from 61 to 326.
continuous from 2.54 to 3.94.
continuous from 2.07 to 4.17.
continuous from 7 to 23.
continuous from 48 to 288.
continuous from 4150 to 6600.
continuous from 13 to 49.
continuous from 16 to 54.
continuous from 5118 to 45400.
-3, -2, -1, 0, 1, 2, 3.
alfa-romero, audi, bmw, chevrolet, dodge, honda, isuzu, jaguar, mazda, mercedes-benz, mercury, mitsubishi, nissan, peugot, plymouth, porsche, renault, saab, subaru, toyota, volkswagen, volvo
diesel, gas.
std, turbo.
four, two.
hardtop, wagon, sedan, hatchback, convertible.
4wd, fwd, rwd.
front, rear.
dohc, dohcv, l, ohc, ohcf, ohcv, rotor.
eight, five, four, six, three, twelve, two.
1bbl, 2bbl, 4bbl, idi, mfi, mpfi, spdi, spfi.
Kibler, D., Aha, D.W., & Albert,M. (1989). Instance-based prediction of real-valued attributes. Computational Intelligence, Vol 5, 51–57. https://archive.ics.uci.edu/ml/datasets/automobile
The data set contains the ratio of retained earnings (RE) to total assets, and the ratio of earnings before interests and taxes (EBIT) to total assets of 66 American firms recorded in the form of ratios. Half of the selected firms had filed for bankruptcy.
bankruptcy
bankruptcy
A data frame with 66 rows and 3 variables:
Status of the firm: 0 for bankruptcy and 1 for financially sound.
Ratio of retained earnings.
Ratio of earnings before interests and taxes.
Altman E.I. (1968). Financial ratios, discriminant analysis and the prediction of corporate bankruptcy. J Finance 23(4): 589-609 https://www.jstor.org/stable/2978933
Evaluate the performance of a classification model by comparing its predicted labels to the true labels. Various metrics are returned to give an insight on how well the model classifies the observations. This function is added to aid outlier detection evaluation of MCNM and MtM in case that true outliers are known in advance.
evaluation_metrics(true_labels, pred_labels)
evaluation_metrics(true_labels, pred_labels)
true_labels |
An 0-1 or logical vector denoting the true labels. The meaning of 0 and 1 (or TRUE and FALSE) is up to the user. |
pred_labels |
An 0-1 or logical vector denoting the true labels. The meaning of 0 and 1 (or TRUE and FALSE) is up to the user. |
A list with the following slots:
matr |
The confusion matrix built upon true labels and predicted labels. |
TN |
True negative. |
FP |
False positive (type I error). |
FN |
False negative (type II error). |
TP |
True positive. |
TPR |
True positive rate (sensitivy). |
FPR |
False positive rate. |
TNR |
True negative rate (specificity). |
FNR |
False negative rate. |
precision |
Precision or positive predictive value (PPV). |
accuracy |
Accuracy. |
error_rate |
Error rate. |
FDR |
False discovery rate. |
#++++ Inputs are 0-1 vectors ++++# evaluation_metrics( true_labels = c(1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1), pred_labels = c(1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1) ) #++++ Inputs are logical vectors ++++# evaluation_metrics( true_labels = c(TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE), pred_labels = c(FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE) )
#++++ Inputs are 0-1 vectors ++++# evaluation_metrics( true_labels = c(1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1), pred_labels = c(1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1) ) #++++ Inputs are logical vectors ++++# evaluation_metrics( true_labels = c(TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE), pred_labels = c(FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE) )
Extract values from MixtureMissing
objects or from outputs of
select_mixture.
extract( object, what = c("model", "parameters", "cluster", "posterior", "outlier", "missing", "imputed", "complete", "information"), criterion = c("AIC", "BIC", "KIC", "KICc", "AIC3", "CAIC", "AICc", "ICL", "AWE", "CLC"), m_code = NULL )
extract( object, what = c("model", "parameters", "cluster", "posterior", "outlier", "missing", "imputed", "complete", "information"), criterion = c("AIC", "BIC", "KIC", "KICc", "AIC3", "CAIC", "AICc", "ICL", "AWE", "CLC"), m_code = NULL )
object |
A |
what |
The specific value to be extracted. See the return section for possible values. |
criterion |
If |
m_code |
Only used in the case when |
Available information criteria include
AIC - Akaike information criterion
BIC - Bayesian information criterion
KIC - Kullback information criterion
KICc - Corrected Kullback information criterion
AIC3 - Modified AIC
CAIC - Bozdogan's consistent AIC
AICc - Small-sample version of AIC
ICL - Integrated Completed Likelihood criterion
AWE - Approximate weight of evidence
CLC - Classification likelihood criterion
One of the following depending on what
If what = "model"
- A data frame showing the component distribution
and its abbreviation, number of clusters, and whether the data set is complete
or incomplete.
If what = "parameters"
- A list containing the relevant parameters.
If what = "cluster"
- A numeric vector of length indicating cluster
memberships determined by the model.
If what = "posterior"
- An by
matrix where each
row indicates the expected probabilities that the corresponding observation
belongs to each cluster.
If what = "outlier"
- A logical vector of length indicating observations that are outliers.
Only available if
model
is CN or t; NULL otherwise with a warning.
If what = "missing"
- A data frame showing how many observations (cases)
have missing values and the number of missing values per variables.
If what = "imputed"
- The original data set if it is complete; otherwise, this is
the data set with missing values imputed by appropriate expectations.
If what = "complete"
- An by
logical matrix indicating which cells have no missing values.
If what = "information"
- A data frame showing the number of clusters, final observed
log-likelihood value, number of parameters, and desired information criteria.
#++++ With no missing values ++++# X <- iris[, 1:4] mod <- MGHM(X, G = 2, model = 'GH', max_iter = 10) extract(mod, what = "model") extract(mod, what = "parameters") extract(mod, what = "cluster") #++++ With missing values ++++# set.seed(123) X <- hide_values(iris[, 1:4], n_cases = 20) mod <- MGHM(X, G = 2, model = 'GH', max_iter = 10) extract(mod, what = "outlier") extract(mod, what = "missing") extract(mod, what = "imputed")
#++++ With no missing values ++++# X <- iris[, 1:4] mod <- MGHM(X, G = 2, model = 'GH', max_iter = 10) extract(mod, what = "model") extract(mod, what = "parameters") extract(mod, what = "cluster") #++++ With missing values ++++# set.seed(123) X <- hide_values(iris[, 1:4], n_cases = 20) mod <- MGHM(X, G = 2, model = 'GH', max_iter = 10) extract(mod, what = "outlier") extract(mod, what = "missing") extract(mod, what = "imputed")
Generate all possible missing patterns in a multivariate data set. The function
can be used to complement the function ampute()
from package mice
in which a matrix of patterns is needed to allow for general missing-data
patterns with missing-data mechanism missing at random (MAR). Using this
function, each observation can have more than one missing value.
generate_patterns(d)
generate_patterns(d)
d |
The number of variables or columns of the data set. |
An observation cannot have all values missing values. A complete observation
is not qualified for missing-data pattern. Note that a large value of d
may
result in memory allocation error.
A matrix where 0 indicates that a variable should have missing values
and 1 indicates that a variable should remain complete. This matrix has d
columns and rows.
generate_patterns(4) #++++ To use with the function ampute() from package mice ++++# library(mice) patterns_matr <- generate_patterns(4) data_missing <- ampute(iris[1:4], prop = 0.5, patterns = patterns_matr)$amp
generate_patterns(4) #++++ To use with the function ampute() from package mice ++++# library(mice) patterns_matr <- generate_patterns(4) data_missing <- ampute(iris[1:4], prop = 0.5, patterns = patterns_matr)$amp
A convenient function that randomly introduces missing values to an at-least-bivariate data set. The user can specify either the proportion of observations that contain some missing values or the exact number of observations that contain some missing values. Note that the function does not guarantee that underlying missing-data mechanism to be missing at random (MAR).
hide_values(X, prop_cases = 0.1, n_cases = NULL)
hide_values(X, prop_cases = 0.1, n_cases = NULL)
X |
An |
prop_cases |
(optional) Proportion of observations that contain some missing values.
|
n_cases |
(optional) Number of observations that contain some missing values.
|
If subject to missingness, an observation can have at least 1 and at
most ncol(X) - 1
missing values. Depending on the data
set, it is not guaranteed that the resulting matrix will have the number of
rows with missing values matches the specified proportion.
The orginal by
matrix or data frame with missing values.
set.seed(1234) hide_values(iris[1:4]) hide_values(iris[1:4], prop_cases = 0.5) hide_values(iris[1:4], n_cases = 80)
set.seed(1234) hide_values(iris[1:4]) hide_values(iris[1:4], prop_cases = 0.5) hide_values(iris[1:4], n_cases = 80)
Initialize cluster memberships and component parameters to start the EM algorithm using a heuristic clustering method or user-defined labels.
initialize_clusters( X, G, init_method = c("kmedoids", "kmeans", "hierarchical", "mclust", "manual"), clusters = NULL )
initialize_clusters( X, G, init_method = c("kmedoids", "kmeans", "hierarchical", "mclust", "manual"), clusters = NULL )
X |
An |
G |
The number of clusters, which must be at least 1. If |
init_method |
(optional) A string specifying the method to initialize
the EM algorithm. "kmedoids" clustering is used by default. Alternative
methods include "kmeans", "hierarchical", "manual". When
"manual" is chosen, a vector |
clusters |
A numeric vector of length |
Available heuristic methods include k-medoids clustering, k-means clustering,
and hierarchical clustering. Alternately, the user can also enter pre-specified
cluster memberships, making other initialization methods possible. If the given
data set contains missing values, only observations with complete records will
be used to initialize clusters. However, in this case, except when G = 1
, the resulting cluster
memberships will be set to NULL
since they represent those complete records
rather than the original data set as a whole.
A list with the following slots:
pi |
Component mixing proportions. |
mu |
A |
Sigma |
A |
clusters |
An numeric vector with values from 1 to |
Everitt, B., Landau, S., Leese, M., and Stahl, D. (2011). Cluster Analysis. John Wiley & Sons.
Kaufman, L. and Rousseeuw, P. J. (2009). Finding groups in data: an
introduction to cluster analysis, volume 344. John Wiley & Sons.
Hartigan, J. A. and Wong, M. A. (1979). Algorithm AS 136: A K-means clustering
algorithm. Applied Statistics, 28, 100-108. doi: 10.2307/2346830.
#++++ Initialization using a heuristic method ++++# set.seed(1234) init <- initialize_clusters(iris[1:4], G = 3) init <- initialize_clusters(iris[1:4], G = 3, init_method = 'kmeans') init <- initialize_clusters(iris[1:4], G = 3, init_method = 'hierarchical') #++++ Initialization using user-defined labels ++++# init <- initialize_clusters(iris[1:4], G = 3, init_method = 'manual', clusters = as.numeric(iris$Species)) #++++ Initial parameters and pairwise scatterplot showing the mapping ++++# init$pi init$mu init$Sigma init$clusters pairs(iris[1:4], col = init$clusters, pch = 16)
#++++ Initialization using a heuristic method ++++# set.seed(1234) init <- initialize_clusters(iris[1:4], G = 3) init <- initialize_clusters(iris[1:4], G = 3, init_method = 'kmeans') init <- initialize_clusters(iris[1:4], G = 3, init_method = 'hierarchical') #++++ Initialization using user-defined labels ++++# init <- initialize_clusters(iris[1:4], G = 3, init_method = 'manual', clusters = as.numeric(iris$Species)) #++++ Initial parameters and pairwise scatterplot showing the mapping ++++# init$pi init$mu init$Sigma init$clusters pairs(iris[1:4], col = init$clusters, pch = 16)
Carries out model-based clustering using a multivariate contaminated normal mixture (MCNM). The function will determine itself if the data set is complete or incomplete and fit the appropriate model accordingly. In the incomplete case, the data set must be at least bivariate, and missing values are assumed to be missing at random (MAR).
MCNM( X, G, criterion = c("BIC", "AIC", "KIC", "KICc", "AIC3", "CAIC", "AICc", "ICL", "AWE", "CLC"), max_iter = 20, epsilon = 0.01, init_method = c("kmedoids", "kmeans", "hierarchical", "mclust", "manual"), clusters = NULL, eta_min = 1.001, progress = TRUE )
MCNM( X, G, criterion = c("BIC", "AIC", "KIC", "KICc", "AIC3", "CAIC", "AICc", "ICL", "AWE", "CLC"), max_iter = 20, epsilon = 0.01, init_method = c("kmedoids", "kmeans", "hierarchical", "mclust", "manual"), clusters = NULL, eta_min = 1.001, progress = TRUE )
X |
An |
G |
An integer vector specifying the numbers of clusters, which must be at least 1. |
criterion |
A character string indicating the information criterion for model selection. "BIC" is used by default. See the details section for a list of available information criteria. |
max_iter |
(optional) A numeric value giving the maximum number of iterations each EM algorithm is allowed to use; 20 by default. |
epsilon |
(optional) A number specifying the epsilon value for the Aitken-based stopping criterion used in the EM algorithm: 0.01 by default. |
init_method |
(optional) A string specifying the method to initialize
the EM algorithm. "kmedoids" clustering is used by default. Alternative
methods include "kmeans", "hierarchical", "mclust", and "manual". When "manual" is chosen,
a vector |
clusters |
(optional) A numeric vector of length |
eta_min |
(optional) A numeric value close to 1 to the right specifying the minimum value of eta; 1.001 by default. |
progress |
(optional) A logical value indicating whether the fitting progress should be displayed; TRUE by default. |
Available information criteria include
AIC - Akaike information criterion
BIC - Bayesian information criterion
KIC - Kullback information criterion
KICc - Corrected Kullback information criterion
AIC3 - Modified AIC
CAIC - Bozdogan's consistent AIC
AICc - Small-sample version of AIC
ICL - Integrated Completed Likelihood criterion
AWE - Approximate weight of evidence
CLC - Classification likelihood criterion
An object of class MixtureMissing
with:
model |
The model used to fit the data set. |
pi |
Mixing proportions. |
mu |
Component location vectors. |
Sigma |
Component dispersion matrices. |
alpha |
Component proportions of good observations. |
eta |
Component degrees of contamination. |
z_tilde |
An |
v_tilde |
An |
clusters |
A numeric vector of length |
outliers |
A logical vector of length |
data |
The original data set if it is complete; otherwise, this is the data set with missing values imputed by appropriate expectations. |
complete |
An |
npar |
The breakdown of the number of parameters to estimate. |
max_iter |
Maximum number of iterations allowed in the EM algorithm. |
iter_stop |
The actual number of iterations needed when fitting the data set. |
final_loglik |
The final value of log-likelihood. |
loglik |
All the values of log-likelihood. |
AIC |
Akaike information criterion. |
BIC |
Bayesian information criterion. |
KIC |
Kullback information criterion. |
KICc |
Corrected Kullback information criterion. |
AIC3 |
Modified AIC. |
CAIC |
Bozdogan's consistent AIC. |
AICc |
Small-sample version of AIC. |
ent |
Entropy. |
ICL |
Integrated Completed Likelihood criterion. |
AWE |
Approximate weight of evidence. |
CLC |
Classification likelihood criterion. |
init_method |
The initialization method used in model fitting. |
Punzo, A. and McNicholas, P.D., 2016. Parsimonious mixtures of multivariate
contaminated normal distributions. Biometrical Journal, 58(6), pp.1506-1537.
Tong, H. and, Tortora, C., 2022. Model-based clustering and outlier detection
with missing data. Advances in Data Analysis and Classification.
data('auto') #++++ With no missing values ++++# X <- auto[, c('engine_size', 'city_mpg', 'highway_mpg')] mod <- MCNM(X, G = 2, init_method = 'kmedoids', max_iter = 10) summary(mod) plot(mod) #++++ With missing values ++++# X <- auto[, c('normalized_losses', 'horsepower', 'highway_mpg', 'price')] mod <- MCNM(X, G = 2, init_method = 'kmedoids', max_iter = 10) summary(mod) plot(mod)
data('auto') #++++ With no missing values ++++# X <- auto[, c('engine_size', 'city_mpg', 'highway_mpg')] mod <- MCNM(X, G = 2, init_method = 'kmedoids', max_iter = 10) summary(mod) plot(mod) #++++ With missing values ++++# X <- auto[, c('normalized_losses', 'horsepower', 'highway_mpg', 'price')] mod <- MCNM(X, G = 2, init_method = 'kmedoids', max_iter = 10) summary(mod) plot(mod)
Replace missing values of data set by the mean of other observed values.
mean_impute(X)
mean_impute(X)
X |
An |
A complete data matrix with missing values imputed accordingly.
Schafer, J. L. and Graham, J. W. (2002). Missing data: our view of the state of the art.
Psychological Methods, 7(2):147–177.
Little, R. J. A. and Rubin, D. B. (2020). Statistical analysis with missing data.
Wiley Series in Probability and Statistics. Wiley, Hoboken, NJ, 3rd edition
X <- matrix(nrow = 6, ncol = 3, byrow = TRUE, c( NA, 2, 2, 3, NA, 5, 4, 3, 2, NA, NA, 3, 7, 2, NA, NA, 4, 2 )) mean_impute(X)
X <- matrix(nrow = 6, ncol = 3, byrow = TRUE, c( NA, 2, 2, 3, NA, 5, 4, 3, 2, NA, NA, 3, 7, 2, NA, NA, 4, 2 )) mean_impute(X)
Carries out model-based clustering using a multivariate generalized hyperbolic mixture (MGHM). The function will determine itself if the data set is complete or incomplete and fit the appropriate model accordingly. In the incomplete case, the data set must be at least bivariate, and missing values are assumed to be missing at random (MAR).
MGHM( X, G, model = c("GH", "NIG", "SNIG", "SC", "C", "St", "t", "N", "SGH", "HUM", "H", "SH"), criterion = c("BIC", "AIC", "KIC", "KICc", "AIC3", "CAIC", "AICc", "ICL", "AWE", "CLC"), max_iter = 20, epsilon = 0.01, init_method = c("kmedoids", "kmeans", "hierarchical", "mclust", "manual"), clusters = NULL, outlier_cutoff = 0.95, deriv_ctrl = list(eps = 1e-08, d = 1e-04, zero.tol = sqrt(.Machine$double.eps/7e-07), r = 6, v = 2, show.details = FALSE), progress = TRUE )
MGHM( X, G, model = c("GH", "NIG", "SNIG", "SC", "C", "St", "t", "N", "SGH", "HUM", "H", "SH"), criterion = c("BIC", "AIC", "KIC", "KICc", "AIC3", "CAIC", "AICc", "ICL", "AWE", "CLC"), max_iter = 20, epsilon = 0.01, init_method = c("kmedoids", "kmeans", "hierarchical", "mclust", "manual"), clusters = NULL, outlier_cutoff = 0.95, deriv_ctrl = list(eps = 1e-08, d = 1e-04, zero.tol = sqrt(.Machine$double.eps/7e-07), r = 6, v = 2, show.details = FALSE), progress = TRUE )
X |
An |
G |
An integer vector specifying the numbers of clusters, which must be at least 1. |
model |
A string indicating the mixture model to be fitted; "GH" for generalized hyperbolic by default. See the details section for a list of available distributions. |
criterion |
A character string indicating the information criterion for model selection. "BIC" is used by default. See the details section for a list of available information criteria. |
max_iter |
(optional) A numeric value giving the maximum number of iterations each EM algorithm is allowed to use; 20 by default. |
epsilon |
(optional) A number specifying the epsilon value for the Aitken-based stopping criterion used in the EM algorithm: 0.01 by default. |
init_method |
(optional) A string specifying the method to initialize
the EM algorithm. "kmedoids" clustering is used by default. Alternative
methods include "kmeans", "hierarchical", "mclust", and "manual". When "manual" is chosen,
a vector |
clusters |
(optional) A vector of length |
outlier_cutoff |
(optional) A number between 0 and 1 indicating the percentile cutoff used for outlier detection. This is only relevant for t mixture. |
deriv_ctrl |
(optional) A list containing arguments to control the numerical
procedures for calculating the first and second derivatives. Some values are
suggested by default. Refer to functions |
progress |
(optional) A logical value indicating whether the fitting progress should be displayed; TRUE by default. |
Beside the generalized hyperbolic distribution, the function can fit mixture via its special and limiting cases. Available distributions include
GH - Generalized Hyperbolic
NIG - Normal-Inverse Gaussian
SNIG - Symmetric Normal-Inverse Gaussian
SC - Skew-Cauchy
C - Cauchy
St - Skew-t
t - Student's t
N - Normal or Gaussian
SGH - Symmetric Generalized Hyperbolic
HUM- Hyperbolic Univariate Marginals
H - Hyperbolic
SH - Symmetric Hyperbolic
Available information criteria include
AIC - Akaike information criterion
BIC - Bayesian information criterion
KIC - Kullback information criterion
KICc - Corrected Kullback information criterion
AIC3 - Modified AIC
CAIC - Bozdogan's consistent AIC
AICc - Small-sample version of AIC
ICL - Integrated Completed Likelihood criterion
AWE - Approximate weight of evidence
CLC - Classification likelihood criterion
An object of class MixtureMissing
with:
model |
The model used to fit the data set. |
pi |
Mixing proportions. |
mu |
Component location vectors. |
Sigma |
Component dispersion matrices. |
beta |
Component skewness vectors. Only available if |
lambda |
Component index parameters. Only available if |
omega |
Component concentration parameters. Only available if |
df |
Component degrees of freedom. Only available if |
z_tilde |
An |
clusters |
A numeric vector of length |
outliers |
A logical vector of length |
data |
The original data set if it is complete; otherwise, this is the data set with missing values imputed by appropriate expectations. |
complete |
An |
npar |
The breakdown of the number of parameters to estimate. |
max_iter |
Maximum number of iterations allowed in the EM algorithm. |
iter_stop |
The actual number of iterations needed when fitting the data set. |
final_loglik |
The final value of log-likelihood. |
loglik |
All the values of log-likelihood. |
AIC |
Akaike information criterion. |
BIC |
Bayesian information criterion. |
KIC |
Kullback information criterion. |
KICc |
Corrected Kullback information criterion. |
AIC3 |
Modified AIC. |
CAIC |
Bozdogan's consistent AIC. |
AICc |
Small-sample version of AIC. |
ent |
Entropy. |
ICL |
Integrated Completed Likelihood criterion. |
AWE |
Approximate weight of evidence. |
CLC |
Classification likelihood criterion. |
init_method |
The initialization method used in model fitting. |
Browne, R. P. and McNicholas, P. D. (2015). A mixture of generalized hyperbolic distributions.
Canadian Journal of Statistics, 43(2):176–198.
Wei, Y., Tang, Y., and McNicholas, P. D. (2019). Mixtures of generalized hyperbolic
distributions and mixtures of skew-t distributions for model-based clustering
with incomplete data. Computational Statistics & Data Analysis, 130:18–41.
data('bankruptcy') #++++ With no missing values ++++# X <- bankruptcy[, 2:3] mod <- MGHM(X, G = 2, init_method = 'kmedoids', max_iter = 10) summary(mod) plot(mod) #++++ With missing values ++++# set.seed(1234) X <- hide_values(bankruptcy[, 2:3], prop_cases = 0.1) mod <- MGHM(X, G = 2, init_method = 'kmedoids', max_iter = 10) summary(mod) plot(mod)
data('bankruptcy') #++++ With no missing values ++++# X <- bankruptcy[, 2:3] mod <- MGHM(X, G = 2, init_method = 'kmedoids', max_iter = 10) summary(mod) plot(mod) #++++ With missing values ++++# set.seed(1234) X <- hide_values(bankruptcy[, 2:3], prop_cases = 0.1) mod <- MGHM(X, G = 2, init_method = 'kmedoids', max_iter = 10) summary(mod) plot(mod)
Provide four model-based clustering plots for a MixtureMissing
object. The options
include (1) pairwise scatter plots showing cluster memberships and highlighting outliers denoted by triangles;
(2) pairwise scatter plots highlighting in red observations whose values are missing but are replaced by
expectations obtained in the EM algorithm; (3) parallel plot of up to the first 10 variables of a multivariate
data set; and (4) plots of estimated density in the form of contours. A single or multiple options
can be specified. In the latter case, interactive mode will be triggered for the user to choose.
## S3 method for class 'MixtureMissing' plot( x, what = c("classification", "missing", "parallel", "density"), nlevels = 15, drawlabels = TRUE, addpoints = TRUE, cex.point = 1, cex.axis = 1, cex.labels = 2, lwd = 1, col_line = "gray", ... )
## S3 method for class 'MixtureMissing' plot( x, what = c("classification", "missing", "parallel", "density"), nlevels = 15, drawlabels = TRUE, addpoints = TRUE, cex.point = 1, cex.axis = 1, cex.labels = 2, lwd = 1, col_line = "gray", ... )
x |
A |
what |
A string or a character vector specifying the desired plots. See the details section for a list of available plots. |
nlevels |
Number of contour levels desired; 15 by default. |
drawlabels |
Contour levels are labelled if |
addpoints |
Colored points showing cluster memberships are added if |
cex.point |
A numerical value giving the amount by which data points should be magnified relative to the default. |
cex.axis |
The magnification to be used for axis annotation. |
cex.labels |
A numerical value to control the character size of variable labels. |
lwd |
The contour line width, a positive number, defaulting to 1. |
col_line |
The color of contour; "gray" by default. |
... |
Arguments to be passed to methods, such as graphical parameters. |
The plots that can be retrieved include
If what = "classification"
- Pairwise scatter plots showing cluster memberships
and highlighting outliers denoted by triangles.
If what = "missing"
- Pairwise scatter plots highlighting in red observations
whose values are missing but are replaced by expectations obtained in the EM algorithm.
If what = "parallel"
- Parallel plot of up to the first 10 variables of a multivariate
data set.
If what = "density"
- Plots of estimated density in the form of contours.
No return value, called to visualize the fitted model's results
set.seed(123) X <- hide_values(iris[, 1:4], n_cases = 20) mod <- MCNM(X, G = 2, max_iter = 10) plot(mod, what = 'classification')
set.seed(123) X <- hide_values(iris[, 1:4], n_cases = 20) mod <- MCNM(X, G = 2, max_iter = 10) plot(mod, what = 'classification')
Print MixtureMissing
object.
## S3 method for class 'MixtureMissing' print(x, ...)
## S3 method for class 'MixtureMissing' print(x, ...)
x |
A |
... |
Further arguments passed to or from other methods. |
The description includes information on the complete or incomplete data, number of clusters, and component distribution.
No return value, called to print the fitted model's description.
#++++ With no missing values ++++# X <- iris[, 1:4] mod <- MGHM(X, G = 2, model = 'GH', max_iter = 10) print(mod) #++++ With missing values ++++# set.seed(123) X <- hide_values(iris[, 1:4], n_cases = 20) mod <- MGHM(X, G = 2, model = 'GH', max_iter = 10) print(mod)
#++++ With no missing values ++++# X <- iris[, 1:4] mod <- MGHM(X, G = 2, model = 'GH', max_iter = 10) print(mod) #++++ With missing values ++++# set.seed(123) X <- hide_values(iris[, 1:4], n_cases = 20) mod <- MGHM(X, G = 2, model = 'GH', max_iter = 10) print(mod)
Fit mixtures via various distributions and decide the best model based on a given information criterion. The distributions include multivariate contaminated normal, multivariate generalized hyperbolic, special and limiting cases of multivariate generalized hyperbolic.
select_mixture( X, G, model = c("CN", "GH", "NIG", "SNIG", "SC", "C", "St", "t", "N", "SGH", "HUM", "H", "SH"), criterion = c("BIC", "AIC", "KIC", "KICc", "AIC3", "CAIC", "AICc", "ICL", "AWE", "CLC"), max_iter = 20, epsilon = 0.01, init_method = c("kmedoids", "kmeans", "hierarchical", "manual"), clusters = NULL, eta_min = 1.001, outlier_cutoff = 0.95, deriv_ctrl = list(eps = 1e-08, d = 1e-04, zero.tol = sqrt(.Machine$double.eps/7e-07), r = 6, v = 2, show.details = FALSE), progress = TRUE )
select_mixture( X, G, model = c("CN", "GH", "NIG", "SNIG", "SC", "C", "St", "t", "N", "SGH", "HUM", "H", "SH"), criterion = c("BIC", "AIC", "KIC", "KICc", "AIC3", "CAIC", "AICc", "ICL", "AWE", "CLC"), max_iter = 20, epsilon = 0.01, init_method = c("kmedoids", "kmeans", "hierarchical", "manual"), clusters = NULL, eta_min = 1.001, outlier_cutoff = 0.95, deriv_ctrl = list(eps = 1e-08, d = 1e-04, zero.tol = sqrt(.Machine$double.eps/7e-07), r = 6, v = 2, show.details = FALSE), progress = TRUE )
X |
An |
G |
The number of clusters, which must be at least 1. If |
model |
A vector of character strings indicating the mixture model(s) to be fitted. See the details section for a list of available distributions. However, all distributions will be considered by default. |
criterion |
A character string indicating the information criterion for model selection. "BIC" is used by default. See the details section for a list of available information criteria. |
max_iter |
(optional) A numeric value giving the maximum number of iterations each EM algorithm is allowed to use; 20 by default. |
epsilon |
(optional) A number specifying the epsilon value for the Aitken-based stopping criterion used in the EM algorithm: 0.01 by default. |
init_method |
(optional) A string specifying the method to initialize
the EM algorithm. "kmedoids" clustering is used by default. Alternative
methods include "kmeans", "hierarchical", and "manual". When "manual" is chosen,
a vector |
clusters |
(optional) A vector of length |
eta_min |
(optional) A numeric value close to 1 to the right specifying the minimum value of eta; 1.001 by default. This is only relevant for CN mixture |
outlier_cutoff |
(optional) A number between 0 and 1 indicating the percentile cutoff used for outlier detection. This is only relevant for t mixture. |
deriv_ctrl |
(optional) A list containing arguments to control the numerical
procedures for calculating the first and second derivatives. Some values are
suggested by default. Refer to functions |
progress |
(optional) A logical value indicating whether the fitting progress should be displayed; TRUE by default. |
The function can fit mixtures via the contaminated normal distribution, generalized hyperbolic distribution, and special and limiting cases of the generalized hyperbolic distribution. Available distributions include
CN - Contaminated Normal
GH - Generalized Hyperbolic
NIG - Normal-Inverse Gaussian
SNIG - Symmetric Normal-Inverse Gaussian
SC - Skew-Cauchy
C - Cauchy
St - Skew-t
t - Student's t
N - Normal or Gaussian
SGH - Symmetric Generalized Hyperbolic
HUM- Hyperbolic Univariate Marginals
H - Hyperbolic
SH - Symmetric Hyperbolic
Available information criteria include
AIC - Akaike information criterion
BIC - Bayesian information criterion
KIC - Kullback information criterion
KICc - Corrected Kullback information criterion
AIC3 - Modified AIC
CAIC - Bozdogan's consistent AIC
AICc - Small-sample version of AIC
ICL - Integrated Completed Likelihood criterion
AWE - Approximate weight of evidence
CLC - Classification likelihood criterion
A list with
best_mod |
An object of class |
all_mod |
A list of objects of class |
criterion |
A numeric vector containing the chosen information criterion values of all models of consideration. The vector is in the order of best-to-worst models. |
Each object of class MixtureMissing
have slots depending on the fitted model. See
the returned value of MCNM and MGHM.
Browne, R. P. and McNicholas, P. D. (2015). A mixture of generalized hyperbolic distributions.
Canadian Journal of Statistics, 43(2):176–198.
Wei, Y., Tang, Y., and McNicholas, P. D. (2019). Mixtures of generalized hyperbolic
distributions and mixtures of skew-t distributions for model-based clustering
with incomplete data. Computational Statistics & Data Analysis, 130:18–41.
data('bankruptcy') #++++ With no missing values ++++# X <- bankruptcy[, 2:3] mod <- select_mixture(X, G = 2, model = c('CN', 'GH', 'St'), criterion = 'BIC', max_iter = 10) #++++ With missing values ++++# set.seed(1234) X <- hide_values(bankruptcy[, 2:3], prop_cases = 0.1) mod <- select_mixture(X, G = 2, model = c('CN', 'GH', 'St'), criterion = 'BIC', max_iter = 10)
data('bankruptcy') #++++ With no missing values ++++# X <- bankruptcy[, 2:3] mod <- select_mixture(X, G = 2, model = c('CN', 'GH', 'St'), criterion = 'BIC', max_iter = 10) #++++ With missing values ++++# set.seed(1234) X <- hide_values(bankruptcy[, 2:3], prop_cases = 0.1) mod <- select_mixture(X, G = 2, model = c('CN', 'GH', 'St'), criterion = 'BIC', max_iter = 10)
Summarizes main information regarding a MixtureMissing
object.
## S3 method for class 'MixtureMissing' summary(object, ...)
## S3 method for class 'MixtureMissing' summary(object, ...)
object |
A |
... |
Arguments to be passed to methods, such as graphical parameters. |
Information includes the model used to fit the data set, initialization method, clustering table, total outliers, outliers per cluster, mixing proportions, component means and variances, final log-likelihood value, information criteria.
No return value, called to summarize the fitted model's results
#++++ With no missing values ++++# X <- auto[, c('horsepower', 'highway_mpg', 'price')] mod <- MCNM(X, G = 2, init_method = 'kmedoids', max_iter = 10) summary(mod) #++++ With missing values ++++# X <- auto[, c('normalized_losses', 'horsepower', 'highway_mpg', 'price')] mod <- MCNM(X, G = 2, init_method = 'kmedoids', max_iter = 10) summary(mod)
#++++ With no missing values ++++# X <- auto[, c('horsepower', 'highway_mpg', 'price')] mod <- MCNM(X, G = 2, init_method = 'kmedoids', max_iter = 10) summary(mod) #++++ With missing values ++++# X <- auto[, c('normalized_losses', 'horsepower', 'highway_mpg', 'price')] mod <- MCNM(X, G = 2, init_method = 'kmedoids', max_iter = 10) summary(mod)
The data set contains the 2019 cost of living indices of 50 states in five different categories: grocery, housing, transportation, utilities, and miscellaneous (Washington DC is not included). The indices are calculated by first determining the average cost of living in the United States to be used as a baseline set at 100. States are then measured against this baseline. For example, a state with a cost of living index of 200 is twice as expensive as the national average.
UScost
UScost
A data frame with 50 rows and 7 variables. There are no missing values
State abbreviation.
State name.
Grocery index.
Housing index.
Utilities index
Transporation index.
Miscellaneous index
https://worldpopulationreview.com