Package 'MixtureMissing' reference manual

Title:	Robust and Flexible Model-Based Clustering for Data Sets with Missing Values at Random
Description:	Implementations of various robust and flexible model-based clustering methods for data sets with missing values at random. Two main models are: Multivariate Contaminated Normal Mixture (MCNM, Tong and Tortora, 2022, <doi:10.1007/s11634-021-00476-1>) and Multivariate Generalized Hyperbolic Mixture (MGHM, Wei et al., 2019, <doi:10.1016/j.csda.2018.08.016>). Mixtures via some special or limiting cases of the multivariate generalized hyperbolic distribution are also included: Normal-Inverse Gaussian, Symmetric Normal-Inverse Gaussian, Skew-Cauchy, Cauchy, Skew-t, Student's t, Normal, Symmetric Generalized Hyperbolic, Hyperbolic Univariate Marginals, Hyperbolic, and Symmetric Hyperbolic.
Authors:	Hung Tong [aut, cre], Cristina Tortora [aut, ths, dgs]
Maintainer:	Hung Tong <[email protected]>
License:	GPL (>= 2)
Version:	3.0.4
Built:	2025-03-07 03:19:30 UTC
Source:	https://github.com/cran/MixtureMissing

Automobile Data Set

Description

This data set consists of three types of entities: (a) the specification of an auto in terms of various characteristics, (b) its assigned insurance risk rating, (c) its normalized losses in use as compared to other cars. The second rating corresponds to the degree to which the auto is more risky than its price indicates. Cars are initially assigned a risk factor symbol associated with its price. Then, if it is more risky (or less), this symbol is adjusted by moving it up (or down) the scale. Actuarians call this process "symboling". A value of +3 indicates that the auto is risky, -3 that it is probably pretty safe.

Usage

auto
auto

Format

A data frame with 205 rows and 26 variables. The first 15 variables are continuous, while the last 11 variables are categorical. There are 45 rows with missing values.

normalized_losses: continuous from 65 to 256.
wheel_base: continuous from 86.6 120.9.
length: continuous from 141.1 to 208.1.
width: continuous from 60.3 to 72.3.
height: continuous from 47.8 to 59.8.
curb_weight: continuous from 1488 to 4066.
engine_size: continuous from 61 to 326.
bore: continuous from 2.54 to 3.94.
stroke: continuous from 2.07 to 4.17.
compression_ratio: continuous from 7 to 23.
horsepower: continuous from 48 to 288.
peak_rpm: continuous from 4150 to 6600.
city_mpg: continuous from 13 to 49.
highway_mpg: continuous from 16 to 54.
price: continuous from 5118 to 45400.
symboling: -3, -2, -1, 0, 1, 2, 3.
make: alfa-romero, audi, bmw, chevrolet, dodge, honda, isuzu, jaguar, mazda, mercedes-benz, mercury, mitsubishi, nissan, peugot, plymouth, porsche, renault, saab, subaru, toyota, volkswagen, volvo
fuel_type: diesel, gas.
aspiration: std, turbo.
num_doors: four, two.
body_style: hardtop, wagon, sedan, hatchback, convertible.
drive_wheels: 4wd, fwd, rwd.
engine_location: front, rear.
engine_type: dohc, dohcv, l, ohc, ohcf, ohcv, rotor.
num_cylinders: eight, five, four, six, three, twelve, two.
fuel_system: 1bbl, 2bbl, 4bbl, idi, mfi, mpfi, spdi, spfi.

Source

Kibler, D., Aha, D.W., & Albert,M. (1989). Instance-based prediction of real-valued attributes. Computational Intelligence, Vol 5, 51–57. https://archive.ics.uci.edu/ml/datasets/automobile

Bankruptcy Data Set

Description

The data set contains the ratio of retained earnings (RE) to total assets, and the ratio of earnings before interests and taxes (EBIT) to total assets of 66 American firms recorded in the form of ratios. Half of the selected firms had filed for bankruptcy.

Usage

bankruptcy
bankruptcy

Format

A data frame with 66 rows and 3 variables:

Y: Status of the firm: 0 for bankruptcy and 1 for financially sound.
RE: Ratio of retained earnings.
EBIT: Ratio of earnings before interests and taxes.

Source

Altman E.I. (1968). Financial ratios, discriminant analysis and the prediction of corporate bankruptcy. J Finance 23(4): 589-609 https://www.jstor.org/stable/2978933

Binary Classification Evaluation

Description

Evaluate the performance of a classification model by comparing its predicted labels to the true labels. Various metrics are returned to give an insight on how well the model classifies the observations. This function is added to aid outlier detection evaluation of MCNM and MtM in case that true outliers are known in advance.

Usage

evaluation_metrics(true_labels, pred_labels)
evaluation_metrics(true_labels, pred_labels)

Arguments

`true_labels`	An 0-1 or logical vector denoting the true labels. The meaning of 0 and 1 (or TRUE and FALSE) is up to the user.
`pred_labels`	An 0-1 or logical vector denoting the true labels. The meaning of 0 and 1 (or TRUE and FALSE) is up to the user.

Value

A list with the following slots:

`matr`	The confusion matrix built upon true labels and predicted labels.
`TN`	True negative.
`FP`	False positive (type I error).
`FN`	False negative (type II error).
`TP`	True positive.
`TPR`	True positive rate (sensitivy).
`FPR`	False positive rate.
`TNR`	True negative rate (specificity).
`FNR`	False negative rate.
`precision`	Precision or positive predictive value (PPV).
`accuracy`	Accuracy.
`error_rate`	Error rate.
`FDR`	False discovery rate.

Examples


#++++ Inputs are 0-1 vectors ++++#

evaluation_metrics(
  true_labels = c(1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1),
  pred_labels = c(1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1)
)

#++++ Inputs are logical vectors ++++#

evaluation_metrics(
  true_labels = c(TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE),
  pred_labels = c(FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE)
)

#++++ Inputs are 0-1 vectors ++++#

evaluation_metrics(
  true_labels = c(1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1),
  pred_labels = c(1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1)
)

#++++ Inputs are logical vectors ++++#

evaluation_metrics(
  true_labels = c(TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE),
  pred_labels = c(FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE)
)

Extractor function for MixtureMissing

Description

Extract values from MixtureMissing objects or from outputs of select_mixture.

Usage

extract(
  object,
  what = c("model", "parameters", "cluster", "posterior", "outlier", "missing",
    "imputed", "complete", "information"),
  criterion = c("AIC", "BIC", "KIC", "KICc", "AIC3", "CAIC", "AICc", "ICL", "AWE", "CLC"),
  m_code = NULL
)
extract(
  object,
  what = c("model", "parameters", "cluster", "posterior", "outlier", "missing",
    "imputed", "complete", "information"),
  criterion = c("AIC", "BIC", "KIC", "KICc", "AIC3", "CAIC", "AICc", "ICL", "AWE", "CLC"),
  m_code = NULL
)

Arguments

`object`	A `MixtureMissing` object or an output of select_mixture.
`what`	The specific value to be extracted. See the return section for possible values.
`criterion`	If `what = "information"`, `criterion` is a vector of desired information criteria. All criteria will be extracted by default. Duplicate values in the vector will not be shown again. See the details section for a list of available information criteria.
`m_code`	Only used in the case when `object` is an output of select_mixture. If `m_code = NULL`, extracting will be based on the best model. If `m_code` is one of 'CN', 'GH', 'NIG', 'SNIG', 'SC', 'C', 'St', 't', 'N', 'SGH', 'HUM', 'H', and 'SH', the function will look for this specific model and extract accordingly.

Details

Available information criteria include

AIC - Akaike information criterion
BIC - Bayesian information criterion
KIC - Kullback information criterion
KICc - Corrected Kullback information criterion
AIC3 - Modified AIC
CAIC - Bozdogan's consistent AIC
AICc - Small-sample version of AIC
ICL - Integrated Completed Likelihood criterion
AWE - Approximate weight of evidence
CLC - Classification likelihood criterion

Value

One of the following depending on what

If what = "model" - A data frame showing the component distribution and its abbreviation, number of clusters, and whether the data set is complete or incomplete.
If what = "parameters" - A list containing the relevant parameters.
If what = "cluster" - A numeric vector of length $n$ indicating cluster memberships determined by the model.
If what = "posterior" - An $n$ by $G$ matrix where each row indicates the expected probabilities that the corresponding observation belongs to each cluster.
If what = "outlier" - A logical vector of length $n$ indicating observations that are outliers. Only available if model is CN or t; NULL otherwise with a warning.
If what = "missing" - A data frame showing how many observations (cases) have missing values and the number of missing values per variables.
If what = "imputed" - The original data set if it is complete; otherwise, this is the data set with missing values imputed by appropriate expectations.
If what = "complete" - An $n$ by $d$ logical matrix indicating which cells have no missing values.
If what = "information" - A data frame showing the number of clusters, final observed log-likelihood value, number of parameters, and desired information criteria.

Examples


#++++ With no missing values ++++#

X <- iris[, 1:4]
mod <- MGHM(X, G = 2, model = 'GH', max_iter = 10)
extract(mod, what = "model")
extract(mod, what = "parameters")
extract(mod, what = "cluster")

#++++ With missing values ++++#

set.seed(123)
X <- hide_values(iris[, 1:4], n_cases = 20)
mod <- MGHM(X, G = 2, model = 'GH', max_iter = 10)
extract(mod, what = "outlier")
extract(mod, what = "missing")
extract(mod, what = "imputed")

#++++ With no missing values ++++#

X <- iris[, 1:4]
mod <- MGHM(X, G = 2, model = 'GH', max_iter = 10)
extract(mod, what = "model")
extract(mod, what = "parameters")
extract(mod, what = "cluster")

#++++ With missing values ++++#

set.seed(123)
X <- hide_values(iris[, 1:4], n_cases = 20)
mod <- MGHM(X, G = 2, model = 'GH', max_iter = 10)
extract(mod, what = "outlier")
extract(mod, what = "missing")
extract(mod, what = "imputed")

Missing-Data Pattern Generation

Description

Generate all possible missing patterns in a multivariate data set. The function can be used to complement the function ampute() from package mice in which a matrix of patterns is needed to allow for general missing-data patterns with missing-data mechanism missing at random (MAR). Using this function, each observation can have more than one missing value.

Usage

generate_patterns(d)
generate_patterns(d)

Arguments

`d`	The number of variables or columns of the data set. `d` must be an integer greater than 1.

Details

An observation cannot have all values missing values. A complete observation is not qualified for missing-data pattern. Note that a large value of d may result in memory allocation error.

Value

A matrix where 0 indicates that a variable should have missing values and 1 indicates that a variable should remain complete. This matrix has d columns and $2^d - 2$ rows.

Examples

generate_patterns(4)

#++++ To use with the function ampute() from package mice ++++#
library(mice)

patterns_matr <- generate_patterns(4)
data_missing <- ampute(iris[1:4], prop = 0.5, patterns = patterns_matr)$amp

generate_patterns(4)

#++++ To use with the function ampute() from package mice ++++#
library(mice)

patterns_matr <- generate_patterns(4)
data_missing <- ampute(iris[1:4], prop = 0.5, patterns = patterns_matr)$amp

Missing Values Generation

Description

A convenient function that randomly introduces missing values to an at-least-bivariate data set. The user can specify either the proportion of observations that contain some missing values or the exact number of observations that contain some missing values. Note that the function does not guarantee that underlying missing-data mechanism to be missing at random (MAR).

Usage

hide_values(X, prop_cases = 0.1, n_cases = NULL)
hide_values(X, prop_cases = 0.1, n_cases = NULL)

Arguments

`X`	An $n$ by $d$ matrix or data frame where $n$ is the number of observations and $d$ is the number of columns or variables. `X` must have at least 2 rows and 2 columns.
`prop_cases`	(optional) Proportion of observations that contain some missing values. `prop_cases` must be a number in $(0, 1)$ . `prop_cases = 0.1` by default, but will be ignored if `n_cases` is specified.
`n_cases`	(optional) Number of observations that contain some missing values. `n_cases` must be an integer ranging from 1 to `nrow(X) - 1`.

Details

If subject to missingness, an observation can have at least 1 and at most ncol(X) - 1 missing values. Depending on the data set, it is not guaranteed that the resulting matrix will have the number of rows with missing values matches the specified proportion.

Value

The orginal $n$ by $d$ matrix or data frame with missing values.

Examples

set.seed(1234)

hide_values(iris[1:4])
hide_values(iris[1:4], prop_cases = 0.5)
hide_values(iris[1:4], n_cases = 80)

set.seed(1234)

hide_values(iris[1:4])
hide_values(iris[1:4], prop_cases = 0.5)
hide_values(iris[1:4], n_cases = 80)

Cluster Initialization using a Heuristic Method

Description

Initialize cluster memberships and component parameters to start the EM algorithm using a heuristic clustering method or user-defined labels.

Usage

initialize_clusters(
  X,
  G,
  init_method = c("kmedoids", "kmeans", "hierarchical", "mclust", "manual"),
  clusters = NULL
)
initialize_clusters(
  X,
  G,
  init_method = c("kmedoids", "kmeans", "hierarchical", "mclust", "manual"),
  clusters = NULL
)

Arguments

`X`	An $n$ x $d$ matrix or data frame where $n$ is the number of observations and $d$ is the number of columns or variables. Alternately, `X` can be a vector of $n$ observations.
`G`	The number of clusters, which must be at least 1. If `G = 1`, then user-defined `clusters` is ignored.
`init_method`	(optional) A string specifying the method to initialize the EM algorithm. "kmedoids" clustering is used by default. Alternative methods include "kmeans", "hierarchical", "manual". When "manual" is chosen, a vector `clusters` of length $n$ must be specified. When `G = 1` and "kmedoids" clustering is used, the medoid will be returned, not the sample mean.
`clusters`	A numeric vector of length $n$ that specifies the initial cluster memberships of the user when `init_method` is set to "manual". This argument is NULL by default, so that it is ignored whenever other given initialization methods are chosen.

Details

Available heuristic methods include k-medoids clustering, k-means clustering, and hierarchical clustering. Alternately, the user can also enter pre-specified cluster memberships, making other initialization methods possible. If the given data set contains missing values, only observations with complete records will be used to initialize clusters. However, in this case, except when G = 1, the resulting cluster memberships will be set to NULL since they represent those complete records rather than the original data set as a whole.

Value

A list with the following slots:

`pi`	Component mixing proportions.
`mu`	A $G$ by $d$ matrix where each row is the component mean vector.
`Sigma`	A $G$ -dimensional array where each $d$ by $d$ matrix is the component covariance matrix.
`clusters`	An numeric vector with values from 1 to $G$ indicating initial cluster memberships if `X` is a complete data set; NULL otherwise.

References

Everitt, B., Landau, S., Leese, M., and Stahl, D. (2011). Cluster Analysis. John Wiley & Sons.

Kaufman, L. and Rousseeuw, P. J. (2009). Finding groups in data: an introduction to cluster analysis, volume 344. John Wiley & Sons.

Hartigan, J. A. and Wong, M. A. (1979). Algorithm AS 136: A K-means clustering algorithm. Applied Statistics, 28, 100-108. doi: 10.2307/2346830.

Examples


#++++ Initialization using a heuristic method ++++#

set.seed(1234)

init <- initialize_clusters(iris[1:4], G = 3)
init <- initialize_clusters(iris[1:4], G = 3, init_method = 'kmeans')
init <- initialize_clusters(iris[1:4], G = 3, init_method = 'hierarchical')

#++++ Initialization using user-defined labels ++++#

init <- initialize_clusters(iris[1:4], G = 3, init_method = 'manual',
                            clusters = as.numeric(iris$Species))

#++++ Initial parameters and pairwise scatterplot showing the mapping ++++#

init$pi
init$mu
init$Sigma
init$clusters

pairs(iris[1:4], col = init$clusters, pch = 16)

#++++ Initialization using a heuristic method ++++#

set.seed(1234)

init <- initialize_clusters(iris[1:4], G = 3)
init <- initialize_clusters(iris[1:4], G = 3, init_method = 'kmeans')
init <- initialize_clusters(iris[1:4], G = 3, init_method = 'hierarchical')

#++++ Initialization using user-defined labels ++++#

init <- initialize_clusters(iris[1:4], G = 3, init_method = 'manual',
                            clusters = as.numeric(iris$Species))

#++++ Initial parameters and pairwise scatterplot showing the mapping ++++#

init$pi
init$mu
init$Sigma
init$clusters

pairs(iris[1:4], col = init$clusters, pch = 16)

Multivariate Contaminated Normal Mixture (MCNM)

Description

Carries out model-based clustering using a multivariate contaminated normal mixture (MCNM). The function will determine itself if the data set is complete or incomplete and fit the appropriate model accordingly. In the incomplete case, the data set must be at least bivariate, and missing values are assumed to be missing at random (MAR).

Usage

MCNM(
  X,
  G,
  criterion = c("BIC", "AIC", "KIC", "KICc", "AIC3", "CAIC", "AICc", "ICL", "AWE", "CLC"),
  max_iter = 20,
  epsilon = 0.01,
  init_method = c("kmedoids", "kmeans", "hierarchical", "mclust", "manual"),
  clusters = NULL,
  eta_min = 1.001,
  progress = TRUE
)
MCNM(
  X,
  G,
  criterion = c("BIC", "AIC", "KIC", "KICc", "AIC3", "CAIC", "AICc", "ICL", "AWE", "CLC"),
  max_iter = 20,
  epsilon = 0.01,
  init_method = c("kmedoids", "kmeans", "hierarchical", "mclust", "manual"),
  clusters = NULL,
  eta_min = 1.001,
  progress = TRUE
)

Arguments

`X`	An $n$ x $d$ matrix or data frame where $n$ is the number of observations and $d$ is the number of variables.
`G`	An integer vector specifying the numbers of clusters, which must be at least 1.
`criterion`	A character string indicating the information criterion for model selection. "BIC" is used by default. See the details section for a list of available information criteria.
`max_iter`	(optional) A numeric value giving the maximum number of iterations each EM algorithm is allowed to use; 20 by default.
`epsilon`	(optional) A number specifying the epsilon value for the Aitken-based stopping criterion used in the EM algorithm: 0.01 by default.
`init_method`	(optional) A string specifying the method to initialize the EM algorithm. "kmedoids" clustering is used by default. Alternative methods include "kmeans", "hierarchical", "mclust", and "manual". When "manual" is chosen, a vector `clusters` of length $n$ must be specified. If the data set is incomplete, missing values will be first filled based on the mean imputation method.
`clusters`	(optional) A numeric vector of length $n$ that specifies the initial cluster memberships of the user when `init_method` is set to "manual". This argument is NULL by default, so that it is ignored whenever other given initialization methods are chosen.
`eta_min`	(optional) A numeric value close to 1 to the right specifying the minimum value of eta; 1.001 by default.
`progress`	(optional) A logical value indicating whether the fitting progress should be displayed; TRUE by default.

Details

Available information criteria include

AIC - Akaike information criterion
BIC - Bayesian information criterion
KIC - Kullback information criterion
KICc - Corrected Kullback information criterion
AIC3 - Modified AIC
CAIC - Bozdogan's consistent AIC
AICc - Small-sample version of AIC
ICL - Integrated Completed Likelihood criterion
AWE - Approximate weight of evidence
CLC - Classification likelihood criterion

Value

An object of class MixtureMissing with:

`model`	The model used to fit the data set.
`pi`	Mixing proportions.
`mu`	Component location vectors.
`Sigma`	Component dispersion matrices.
`alpha`	Component proportions of good observations.
`eta`	Component degrees of contamination.
`z_tilde`	An $n$ by $G$ matrix where each row indicates the expected probabilities that the corresponding observation belongs to each cluster.
`v_tilde`	An $n$ by $G$ matrix where each row indicates the expected probabilities that the corresponding observation is good with respect to each cluster.
`clusters`	A numeric vector of length $n$ indicating cluster memberships determined by the model.
`outliers`	A logical vector of length $n$ indicating observations that are outliers.
`data`	The original data set if it is complete; otherwise, this is the data set with missing values imputed by appropriate expectations.
`complete`	An $n$ by $d$ logical matrix indicating which cells have no missing values.
`npar`	The breakdown of the number of parameters to estimate.
`max_iter`	Maximum number of iterations allowed in the EM algorithm.
`iter_stop`	The actual number of iterations needed when fitting the data set.
`final_loglik`	The final value of log-likelihood.
`loglik`	All the values of log-likelihood.
`AIC`	Akaike information criterion.
`BIC`	Bayesian information criterion.
`KIC`	Kullback information criterion.
`KICc`	Corrected Kullback information criterion.
`AIC3`	Modified AIC.
`CAIC`	Bozdogan's consistent AIC.
`AICc`	Small-sample version of AIC.
`ent`	Entropy.
`ICL`	Integrated Completed Likelihood criterion.
`AWE`	Approximate weight of evidence.
`CLC`	Classification likelihood criterion.
`init_method`	The initialization method used in model fitting.

References

Punzo, A. and McNicholas, P.D., 2016. Parsimonious mixtures of multivariate contaminated normal distributions. Biometrical Journal, 58(6), pp.1506-1537.

Tong, H. and, Tortora, C., 2022. Model-based clustering and outlier detection with missing data. Advances in Data Analysis and Classification.

Examples


data('auto')

#++++ With no missing values ++++#

X <- auto[, c('engine_size', 'city_mpg', 'highway_mpg')]
mod <- MCNM(X, G = 2, init_method = 'kmedoids', max_iter = 10)

summary(mod)
plot(mod)

#++++ With missing values ++++#

X <- auto[, c('normalized_losses', 'horsepower', 'highway_mpg', 'price')]
mod <- MCNM(X, G = 2, init_method = 'kmedoids', max_iter = 10)

summary(mod)
plot(mod)

data('auto')

#++++ With no missing values ++++#

X <- auto[, c('engine_size', 'city_mpg', 'highway_mpg')]
mod <- MCNM(X, G = 2, init_method = 'kmedoids', max_iter = 10)

summary(mod)
plot(mod)

#++++ With missing values ++++#

X <- auto[, c('normalized_losses', 'horsepower', 'highway_mpg', 'price')]
mod <- MCNM(X, G = 2, init_method = 'kmedoids', max_iter = 10)

summary(mod)
plot(mod)

Mean Imputation

Description

Replace missing values of data set by the mean of other observed values.

Usage

mean_impute(X)
mean_impute(X)

Arguments

`X`	An $n$ x $d$ matrix or data frame where $n$ is the number of observations and $d$ is the number of columns or variables. Alternately, `X` can be a vector of $n$ observations.

Value

A complete data matrix with missing values imputed accordingly.

References

Schafer, J. L. and Graham, J. W. (2002). Missing data: our view of the state of the art. Psychological Methods, 7(2):147–177.

Little, R. J. A. and Rubin, D. B. (2020). Statistical analysis with missing data. Wiley Series in Probability and Statistics. Wiley, Hoboken, NJ, 3rd edition

Examples


X <- matrix(nrow = 6, ncol = 3, byrow = TRUE, c(
  NA,  2,  2,
   3, NA,  5,
   4,  3,  2,
  NA, NA,  3,
   7,  2, NA,
  NA,  4,  2
))


mean_impute(X)

X <- matrix(nrow = 6, ncol = 3, byrow = TRUE, c(
  NA,  2,  2,
   3, NA,  5,
   4,  3,  2,
  NA, NA,  3,
   7,  2, NA,
  NA,  4,  2
))


mean_impute(X)

Multivariate Generalized Hyperbolic Mixture (MGHM)

Description

Carries out model-based clustering using a multivariate generalized hyperbolic mixture (MGHM). The function will determine itself if the data set is complete or incomplete and fit the appropriate model accordingly. In the incomplete case, the data set must be at least bivariate, and missing values are assumed to be missing at random (MAR).

Usage

MGHM(
  X,
  G,
  model = c("GH", "NIG", "SNIG", "SC", "C", "St", "t", "N", "SGH", "HUM", "H", "SH"),
  criterion = c("BIC", "AIC", "KIC", "KICc", "AIC3", "CAIC", "AICc", "ICL", "AWE", "CLC"),
  max_iter = 20,
  epsilon = 0.01,
  init_method = c("kmedoids", "kmeans", "hierarchical", "mclust", "manual"),
  clusters = NULL,
  outlier_cutoff = 0.95,
  deriv_ctrl = list(eps = 1e-08, d = 1e-04, zero.tol = sqrt(.Machine$double.eps/7e-07), r
    = 6, v = 2, show.details = FALSE),
  progress = TRUE
)
MGHM(
  X,
  G,
  model = c("GH", "NIG", "SNIG", "SC", "C", "St", "t", "N", "SGH", "HUM", "H", "SH"),
  criterion = c("BIC", "AIC", "KIC", "KICc", "AIC3", "CAIC", "AICc", "ICL", "AWE", "CLC"),
  max_iter = 20,
  epsilon = 0.01,
  init_method = c("kmedoids", "kmeans", "hierarchical", "mclust", "manual"),
  clusters = NULL,
  outlier_cutoff = 0.95,
  deriv_ctrl = list(eps = 1e-08, d = 1e-04, zero.tol = sqrt(.Machine$double.eps/7e-07), r
    = 6, v = 2, show.details = FALSE),
  progress = TRUE
)

Arguments

`X`	An $n$ x $d$ matrix or data frame where $n$ is the number of observations and $d$ is the number of variables.
`G`	An integer vector specifying the numbers of clusters, which must be at least 1.
`model`	A string indicating the mixture model to be fitted; "GH" for generalized hyperbolic by default. See the details section for a list of available distributions.
`criterion`	A character string indicating the information criterion for model selection. "BIC" is used by default. See the details section for a list of available information criteria.
`max_iter`	(optional) A numeric value giving the maximum number of iterations each EM algorithm is allowed to use; 20 by default.
`epsilon`	(optional) A number specifying the epsilon value for the Aitken-based stopping criterion used in the EM algorithm: 0.01 by default.
`init_method`	(optional) A string specifying the method to initialize the EM algorithm. "kmedoids" clustering is used by default. Alternative methods include "kmeans", "hierarchical", "mclust", and "manual". When "manual" is chosen, a vector `clusters` of length $n$ must be specified. If the data set is incomplete, missing values will be first filled based on the mean imputation method.
`clusters`	(optional) A vector of length $n$ that specifies the initial cluster memberships of the user when `init_method` is set to "manual". Both numeric and character vectors are acceptable. This argument is NULL by default, so that it is ignored whenever other given initialization methods are chosen.
`outlier_cutoff`	(optional) A number between 0 and 1 indicating the percentile cutoff used for outlier detection. This is only relevant for t mixture.
`deriv_ctrl`	(optional) A list containing arguments to control the numerical procedures for calculating the first and second derivatives. Some values are suggested by default. Refer to functions `grad` and `hessian` under the package `numDeriv` for more information.
`progress`	(optional) A logical value indicating whether the fitting progress should be displayed; TRUE by default.

Details

Beside the generalized hyperbolic distribution, the function can fit mixture via its special and limiting cases. Available distributions include

GH - Generalized Hyperbolic
NIG - Normal-Inverse Gaussian
SNIG - Symmetric Normal-Inverse Gaussian
SC - Skew-Cauchy
C - Cauchy
St - Skew-t
t - Student's t
N - Normal or Gaussian
SGH - Symmetric Generalized Hyperbolic
HUM- Hyperbolic Univariate Marginals
H - Hyperbolic
SH - Symmetric Hyperbolic

Available information criteria include

AIC - Akaike information criterion
BIC - Bayesian information criterion
KIC - Kullback information criterion
KICc - Corrected Kullback information criterion
AIC3 - Modified AIC
CAIC - Bozdogan's consistent AIC
AICc - Small-sample version of AIC
ICL - Integrated Completed Likelihood criterion
AWE - Approximate weight of evidence
CLC - Classification likelihood criterion

Value

An object of class MixtureMissing with:

`model`	The model used to fit the data set.
`pi`	Mixing proportions.
`mu`	Component location vectors.
`Sigma`	Component dispersion matrices.
`beta`	Component skewness vectors. Only available if `model` is GH, NIG, SNIG, SC, SGH, HUM, H, or SH; NULL otherwise.
`lambda`	Component index parameters. Only available if `model` is GH, NIG, SNIG, SGH, HUM, H, or SH; NULL otherwise.
`omega`	Component concentration parameters. Only available if `model` is GH, NIG, SNIG, SGH, HUM, H, or SH; NULL otherwise.
`df`	Component degrees of freedom. Only available if `model` is St or t; NULL otherwise.
`z_tilde`	An $n$ by $G$ matrix where each row indicates the expected probabilities that the corresponding observation belongs to each cluster.
`clusters`	A numeric vector of length $n$ indicating cluster memberships determined by the model.
`outliers`	A logical vector of length $n$ indicating observations that are outliers. Only available if `model` is t; NULL otherwise.
`data`	The original data set if it is complete; otherwise, this is the data set with missing values imputed by appropriate expectations.
`complete`	An $n$ by $d$ logical matrix indicating which cells have no missing values.
`npar`	The breakdown of the number of parameters to estimate.
`max_iter`	Maximum number of iterations allowed in the EM algorithm.
`iter_stop`	The actual number of iterations needed when fitting the data set.
`final_loglik`	The final value of log-likelihood.
`loglik`	All the values of log-likelihood.
`AIC`	Akaike information criterion.
`BIC`	Bayesian information criterion.
`KIC`	Kullback information criterion.
`KICc`	Corrected Kullback information criterion.
`AIC3`	Modified AIC.
`CAIC`	Bozdogan's consistent AIC.
`AICc`	Small-sample version of AIC.
`ent`	Entropy.
`ICL`	Integrated Completed Likelihood criterion.
`AWE`	Approximate weight of evidence.
`CLC`	Classification likelihood criterion.
`init_method`	The initialization method used in model fitting.

References

Browne, R. P. and McNicholas, P. D. (2015). A mixture of generalized hyperbolic distributions. Canadian Journal of Statistics, 43(2):176–198.

Wei, Y., Tang, Y., and McNicholas, P. D. (2019). Mixtures of generalized hyperbolic distributions and mixtures of skew-t distributions for model-based clustering with incomplete data. Computational Statistics & Data Analysis, 130:18–41.

Examples


data('bankruptcy')

#++++ With no missing values ++++#

X <- bankruptcy[, 2:3]
mod <- MGHM(X, G = 2, init_method = 'kmedoids', max_iter = 10)

summary(mod)
plot(mod)

#++++ With missing values ++++#

set.seed(1234)

X <- hide_values(bankruptcy[, 2:3], prop_cases = 0.1)
mod <- MGHM(X, G = 2, init_method = 'kmedoids', max_iter = 10)

summary(mod)
plot(mod)

data('bankruptcy')

#++++ With no missing values ++++#

X <- bankruptcy[, 2:3]
mod <- MGHM(X, G = 2, init_method = 'kmedoids', max_iter = 10)

summary(mod)
plot(mod)

#++++ With missing values ++++#

set.seed(1234)

X <- hide_values(bankruptcy[, 2:3], prop_cases = 0.1)
mod <- MGHM(X, G = 2, init_method = 'kmedoids', max_iter = 10)

summary(mod)
plot(mod)

MixtureMissing Plotting

Description

Provide four model-based clustering plots for a MixtureMissing object. The options include (1) pairwise scatter plots showing cluster memberships and highlighting outliers denoted by triangles; (2) pairwise scatter plots highlighting in red observations whose values are missing but are replaced by expectations obtained in the EM algorithm; (3) parallel plot of up to the first 10 variables of a multivariate data set; and (4) plots of estimated density in the form of contours. A single or multiple options can be specified. In the latter case, interactive mode will be triggered for the user to choose.

Usage

## S3 method for class 'MixtureMissing'
plot(
  x,
  what = c("classification", "missing", "parallel", "density"),
  nlevels = 15,
  drawlabels = TRUE,
  addpoints = TRUE,
  cex.point = 1,
  cex.axis = 1,
  cex.labels = 2,
  lwd = 1,
  col_line = "gray",
  ...
)
## S3 method for class 'MixtureMissing'
plot(
  x,
  what = c("classification", "missing", "parallel", "density"),
  nlevels = 15,
  drawlabels = TRUE,
  addpoints = TRUE,
  cex.point = 1,
  cex.axis = 1,
  cex.labels = 2,
  lwd = 1,
  col_line = "gray",
  ...
)

Arguments

`x`	A `MixtureMissing` object or an output of select_mixture. In the latter, only the best model will be considered.
`what`	A string or a character vector specifying the desired plots. See the details section for a list of available plots.
`nlevels`	Number of contour levels desired; 15 by default.
`drawlabels`	Contour levels are labelled if `TRUE`.
`addpoints`	Colored points showing cluster memberships are added if `TRUE`.
`cex.point`	A numerical value giving the amount by which data points should be magnified relative to the default.
`cex.axis`	The magnification to be used for axis annotation.
`cex.labels`	A numerical value to control the character size of variable labels.
`lwd`	The contour line width, a positive number, defaulting to 1.
`col_line`	The color of contour; "gray" by default.
`...`	Arguments to be passed to methods, such as graphical parameters.

Details

The plots that can be retrieved include

If what = "classification" - Pairwise scatter plots showing cluster memberships and highlighting outliers denoted by triangles.
If what = "missing" - Pairwise scatter plots highlighting in red observations whose values are missing but are replaced by expectations obtained in the EM algorithm.
If what = "parallel" - Parallel plot of up to the first 10 variables of a multivariate data set.
If what = "density" - Plots of estimated density in the form of contours.

Value

No return value, called to visualize the fitted model's results

Examples


set.seed(123)
X <- hide_values(iris[, 1:4], n_cases = 20)
mod <- MCNM(X, G = 2, max_iter = 10)
plot(mod, what = 'classification')

set.seed(123)
X <- hide_values(iris[, 1:4], n_cases = 20)
mod <- MCNM(X, G = 2, max_iter = 10)
plot(mod, what = 'classification')

Print for MixtureMissing

Description

Print MixtureMissing object.

Usage

## S3 method for class 'MixtureMissing'
print(x, ...)
## S3 method for class 'MixtureMissing'
print(x, ...)

Arguments

`x`	A `MixtureMissing` object or an output of select_mixture. In the latter, only the best model will be considered.
`...`	Further arguments passed to or from other methods.

Details

The description includes information on the complete or incomplete data, number of clusters, and component distribution.

Value

No return value, called to print the fitted model's description.

Examples


#++++ With no missing values ++++#

X <- iris[, 1:4]
mod <- MGHM(X, G = 2, model = 'GH', max_iter = 10)
print(mod)

#++++ With missing values ++++#

set.seed(123)
X <- hide_values(iris[, 1:4], n_cases = 20)
mod <- MGHM(X, G = 2, model = 'GH', max_iter = 10)
print(mod)

#++++ With no missing values ++++#

X <- iris[, 1:4]
mod <- MGHM(X, G = 2, model = 'GH', max_iter = 10)
print(mod)

#++++ With missing values ++++#

set.seed(123)
X <- hide_values(iris[, 1:4], n_cases = 20)
mod <- MGHM(X, G = 2, model = 'GH', max_iter = 10)
print(mod)

Mixture Model Selection

Description

Fit mixtures via various distributions and decide the best model based on a given information criterion. The distributions include multivariate contaminated normal, multivariate generalized hyperbolic, special and limiting cases of multivariate generalized hyperbolic.

Usage

select_mixture(
  X,
  G,
  model = c("CN", "GH", "NIG", "SNIG", "SC", "C", "St", "t", "N", "SGH", "HUM", "H",
    "SH"),
  criterion = c("BIC", "AIC", "KIC", "KICc", "AIC3", "CAIC", "AICc", "ICL", "AWE", "CLC"),
  max_iter = 20,
  epsilon = 0.01,
  init_method = c("kmedoids", "kmeans", "hierarchical", "manual"),
  clusters = NULL,
  eta_min = 1.001,
  outlier_cutoff = 0.95,
  deriv_ctrl = list(eps = 1e-08, d = 1e-04, zero.tol = sqrt(.Machine$double.eps/7e-07), r
    = 6, v = 2, show.details = FALSE),
  progress = TRUE
)
select_mixture(
  X,
  G,
  model = c("CN", "GH", "NIG", "SNIG", "SC", "C", "St", "t", "N", "SGH", "HUM", "H",
    "SH"),
  criterion = c("BIC", "AIC", "KIC", "KICc", "AIC3", "CAIC", "AICc", "ICL", "AWE", "CLC"),
  max_iter = 20,
  epsilon = 0.01,
  init_method = c("kmedoids", "kmeans", "hierarchical", "manual"),
  clusters = NULL,
  eta_min = 1.001,
  outlier_cutoff = 0.95,
  deriv_ctrl = list(eps = 1e-08, d = 1e-04, zero.tol = sqrt(.Machine$double.eps/7e-07), r
    = 6, v = 2, show.details = FALSE),
  progress = TRUE
)

Arguments

`X`	An $n$ x $d$ matrix or data frame where $n$ is the number of observations and $d$ is the number of variables.
`G`	The number of clusters, which must be at least 1. If `G = 1`, then both `init_method` and `clusters` are ignored.
`model`	A vector of character strings indicating the mixture model(s) to be fitted. See the details section for a list of available distributions. However, all distributions will be considered by default.
`criterion`	A character string indicating the information criterion for model selection. "BIC" is used by default. See the details section for a list of available information criteria.
`max_iter`	(optional) A numeric value giving the maximum number of iterations each EM algorithm is allowed to use; 20 by default.
`epsilon`	(optional) A number specifying the epsilon value for the Aitken-based stopping criterion used in the EM algorithm: 0.01 by default.
`init_method`	(optional) A string specifying the method to initialize the EM algorithm. "kmedoids" clustering is used by default. Alternative methods include "kmeans", "hierarchical", and "manual". When "manual" is chosen, a vector `clusters` of length $n$ must be specified. If the data set is incomplete, missing values will be first filled based on the mean imputation method.
`clusters`	(optional) A vector of length $n$ that specifies the initial cluster memberships of the user when `init_method` is set to "manual". Both numeric and character vectors are acceptable. This argument is NULL by default, so that it is ignored whenever other given initialization methods are chosen.
`eta_min`	(optional) A numeric value close to 1 to the right specifying the minimum value of eta; 1.001 by default. This is only relevant for CN mixture
`outlier_cutoff`	(optional) A number between 0 and 1 indicating the percentile cutoff used for outlier detection. This is only relevant for t mixture.
`deriv_ctrl`	(optional) A list containing arguments to control the numerical procedures for calculating the first and second derivatives. Some values are suggested by default. Refer to functions `grad` and `hessian` under the package `numDeriv` for more information.
`progress`	(optional) A logical value indicating whether the fitting progress should be displayed; TRUE by default.

Details

The function can fit mixtures via the contaminated normal distribution, generalized hyperbolic distribution, and special and limiting cases of the generalized hyperbolic distribution. Available distributions include

CN - Contaminated Normal
GH - Generalized Hyperbolic
NIG - Normal-Inverse Gaussian
SNIG - Symmetric Normal-Inverse Gaussian
SC - Skew-Cauchy
C - Cauchy
St - Skew-t
t - Student's t
N - Normal or Gaussian
SGH - Symmetric Generalized Hyperbolic
HUM- Hyperbolic Univariate Marginals
H - Hyperbolic
SH - Symmetric Hyperbolic

Available information criteria include

AIC - Akaike information criterion
BIC - Bayesian information criterion
KIC - Kullback information criterion
KICc - Corrected Kullback information criterion
AIC3 - Modified AIC
CAIC - Bozdogan's consistent AIC
AICc - Small-sample version of AIC
ICL - Integrated Completed Likelihood criterion
AWE - Approximate weight of evidence
CLC - Classification likelihood criterion

Value

A list with

`best_mod`	An object of class `MixtureMissing` corresponding to the best model.
`all_mod`	A list of objects of class `MixtureMissing` corresponding to all models of consideration. The list is in the order of `model`.
`criterion`	A numeric vector containing the chosen information criterion values of all models of consideration. The vector is in the order of best-to-worst models.

Each object of class MixtureMissing have slots depending on the fitted model. See the returned value of MCNM and MGHM.

References

Examples


data('bankruptcy')

#++++ With no missing values ++++#

X <- bankruptcy[, 2:3]
mod <- select_mixture(X, G = 2, model = c('CN', 'GH', 'St'), criterion = 'BIC', max_iter = 10)

#++++ With missing values ++++#

set.seed(1234)

X <- hide_values(bankruptcy[, 2:3], prop_cases = 0.1)
mod <- select_mixture(X, G = 2, model = c('CN', 'GH', 'St'), criterion = 'BIC', max_iter = 10)

data('bankruptcy')

#++++ With no missing values ++++#

X <- bankruptcy[, 2:3]
mod <- select_mixture(X, G = 2, model = c('CN', 'GH', 'St'), criterion = 'BIC', max_iter = 10)

#++++ With missing values ++++#

set.seed(1234)

X <- hide_values(bankruptcy[, 2:3], prop_cases = 0.1)
mod <- select_mixture(X, G = 2, model = c('CN', 'GH', 'St'), criterion = 'BIC', max_iter = 10)

Summary for MixtureMissing

Description

Summarizes main information regarding a MixtureMissing object.

Usage

## S3 method for class 'MixtureMissing'
summary(object, ...)
## S3 method for class 'MixtureMissing'
summary(object, ...)

Arguments

`object`	A `MixtureMissing` object or an output of select_mixture. In the latter, only the best model will be considered.
`...`	Arguments to be passed to methods, such as graphical parameters.

Details

Information includes the model used to fit the data set, initialization method, clustering table, total outliers, outliers per cluster, mixing proportions, component means and variances, final log-likelihood value, information criteria.

Value

No return value, called to summarize the fitted model's results

Examples


#++++ With no missing values ++++#

X <- auto[, c('horsepower', 'highway_mpg', 'price')]
mod <- MCNM(X, G = 2, init_method = 'kmedoids', max_iter = 10)
summary(mod)

#++++ With missing values ++++#

X <- auto[, c('normalized_losses', 'horsepower', 'highway_mpg', 'price')]
mod <- MCNM(X, G = 2, init_method = 'kmedoids', max_iter = 10)
summary(mod)

#++++ With no missing values ++++#

X <- auto[, c('horsepower', 'highway_mpg', 'price')]
mod <- MCNM(X, G = 2, init_method = 'kmedoids', max_iter = 10)
summary(mod)

#++++ With missing values ++++#

X <- auto[, c('normalized_losses', 'horsepower', 'highway_mpg', 'price')]
mod <- MCNM(X, G = 2, init_method = 'kmedoids', max_iter = 10)
summary(mod)

US Cost of Living Indices in 2019 Data Set

Description

The data set contains the 2019 cost of living indices of 50 states in five different categories: grocery, housing, transportation, utilities, and miscellaneous (Washington DC is not included). The indices are calculated by first determining the average cost of living in the United States to be used as a baseline set at 100. States are then measured against this baseline. For example, a state with a cost of living index of 200 is twice as expensive as the national average.

Usage

UScost
UScost

Format

A data frame with 50 rows and 7 variables. There are no missing values

Abbr: State abbreviation.
State: State name.
Grocery: Grocery index.
Housing: Housing index.
Utilities: Utilities index
Transportation: Transporation index.
Misc: Miscellaneous index

Source

https://worldpopulationreview.com

Package 'MixtureMissing'

Help Index

Automobile Data Set

Description

Usage

Format

Source

Bankruptcy Data Set

Description

Usage

Format

Source

Binary Classification Evaluation

Description

Usage

Arguments

Value

Examples

Extractor function for MixtureMissing

Description

Usage

Arguments

Details

Value

Examples

Missing-Data Pattern Generation

Description

Usage

Arguments

Details

Value

Examples

Missing Values Generation

Description

Usage

Arguments

Details

Value

Examples

Cluster Initialization using a Heuristic Method

Description

Usage

Arguments

Details

Value

References

Examples

Multivariate Contaminated Normal Mixture (MCNM)

Description

Usage

Arguments

Details

Value

References

Examples

Mean Imputation

Description

Usage

Arguments

Value

References

Examples

Multivariate Generalized Hyperbolic Mixture (MGHM)

Description

Usage

Arguments

Details

Value

References

Examples

MixtureMissing Plotting

Description

Usage

Arguments

Details

Value

Examples

Print for MixtureMissing

Description

Usage