# The BetaML.Clustering Module

`BetaML.Clustering`

— Module`Clustering module (WIP)`

Provide clustering methods and missing values imputation / collaborative filtering / reccomendation systems using clustering methods as backend.

The module provides the following functions. Use `?[function]`

to access their full signature and detailed documentation:

`initRepresentatives(X,K;initStrategy,Z₀)`

: Initialisation strategies for Kmean and Kmedoids`kmeans(X,K;dist,initStrategy,Z₀)`

: Classical KMean algorithm`kmedoids(X,K;dist,initStrategy,Z₀)`

: Kmedoids algorithm`gmm(X,K;p₀,mixtures,tol,verbosity,minVariance,minCovariance,initStrategy)`

: gmm algorithm over GMM`predictMissing(X,K;p₀,mixtures,tol,verbosity,minVariance,minCovariance)`

: Impute mixing values ("matrix completion") using gmm as backbone. Note that this can be used for collaborative filtering / reccomendation systems often with better results than traditional algorithms as k-nearest neighbors (KNN)

{Spherical|Diagonal|Full}Gaussian mixtures for `gmm`

/ `predictMissing`

are already provided. User defined mixtures can be used defining a struct as subtype of `Mixture`

and implementing for that mixture the following functions:

`initMixtures!(mixtures, X; minVariance, minCovariance, initStrategy)`

`lpdf(m,x,mask)`

(for the e-step)`updateParameters!(mixtures, X, pₙₖ; minVariance, minCovariance)`

(the m-step)

## Module Index

`BetaML.Clustering.gmm`

`BetaML.Clustering.initMixtures!`

`BetaML.Clustering.initRepresentatives`

`BetaML.Clustering.kmeans`

`BetaML.Clustering.kmedoids`

`BetaML.Clustering.lpdf`

`BetaML.Clustering.lpdf`

`BetaML.Clustering.lpdf`

`BetaML.Clustering.predictMissing`

## Detailed API

`BetaML.Clustering.gmm`

— Methodgmm(X,K;p₀,mixtures,tol,verbosity,minVariance,minCovariance,initStrategy)

Compute Expectation-Maximisation algorithm to identify K clusters of X data, i.e. employ a Generative Mixture Model as the underlying probabilistic model.

X can contain missing values in some or all of its dimensions. In such case the learning is done only with the available data. Implemented in the log-domain for better numerical accuracy with many dimensions.

**Parameters:**

`X`

: A (n x d) data to clusterise`K`

: Number of cluster wanted`p₀`

: Initial probabilities of the categorical distribution (K x 1) [default:`nothing`

]`mixtures`

: An array (of length K) of the mixture to employ (see notes) [def:`[DiagonalGaussian() for i in 1:K]`

]`tol`

: Tolerance to stop the algorithm [default: 10^(-6)]`verbosity`

: A verbosity parameter regulating the information messages frequency [def:`STD`

]`minVariance`

: Minimum variance for the mixtures [default: 0.05]`minCovariance`

: Minimum covariance for the mixtures with full covariance matrix [default: 0]. This should be set different than minVariance (see notes).`initStrategy`

: Mixture initialisation algorithm [def:`kmeans`

]`maxIter`

: Maximum number of iterations [def:`-1`

, i.e. ∞]`rng`

: Random Number Generator (see`FIXEDSEED`

) [deafult:`Random.GLOBAL_RNG`

]

**Returns:**

- A named touple of:
`pₙₖ`

: Matrix of size (N x K) of the probabilities of each point i to belong to cluster j`pₖ`

: Probabilities of the categorical distribution (K x 1)`mixtures`

: Vector (K x 1) of the estimated underlying distributions`ϵ`

: Vector of the discrepancy (matrix norm) between pⱼₓ and the lagged pⱼₓ at each iteration`lL`

: The log-likelihood (without considering the last mixture optimisation)`BIC`

: The Bayesian Information Criterion (lower is better)`AIC`

: The Akaike Information Criterion (lower is better)

**Notes:**

- The mixtures currently implemented are
`SphericalGaussian(μ,σ²)`

,`DiagonalGaussian(μ,σ²)`

and`FullGaussian(μ,σ²)`

- Reasonable choices for the minVariance/Covariance depends on the mixture. For example 0.25 seems a reasonable value for the SphericalGaussian, 0.05 seems better for the DiagonalGaussian, and FullGaussian seems to prefer either very low values of variance/covariance (e.g.
`(0.05,0.05)`

) or very big but similar ones (e.g.`(100,100)`

). - For
`initStrategy`

, look at the documentation of`initMixtures!`

for the mixture you want. The provided gaussian mixtures support`grid`

,`kmeans`

or`given`

.`grid`

is faster (expecially if X contains missing values), but`kmeans`

often provides better results.

**Resources:**

**Example:**

`julia> clusters = gmm([1 10.5;1.5 0; 1.8 8; 1.7 15; 3.2 40; 0 0; 3.3 38; 0 -2.3; 5.2 -2.4],3,verbosity=HIGH)`

`BetaML.Clustering.initMixtures!`

— Method`initMixtures!(mixtures::Array{T,1}, X; minVariance=0.25, minCovariance=0.0, initStrategy="grid",rng=Random.GLOBAL_RNG)`

The parameter `initStrategy`

can be `grid`

, `kmeans`

or `given`

:

`grid`

: Uniformly cover the space observed by the data`kmeans`

: Use the kmeans algorithm. If the data contains missing values, a first run of`predictMissing`

is done under init=`grid`

to impute the missing values just to allow the kmeans algorithm. Then the em algorithm is used with the output of kmean as init values.`given`

: Leave the provided set of initial mixtures

`BetaML.Clustering.initRepresentatives`

— MethodinitRepresentatives(X,K;initStrategy,Z₀)

Initialisate the representatives for a K-Mean or K-Medoids algorithm

**Parameters:**

`X`

: a (N x D) data to clusterise`K`

: Number of cluster wonted`initStrategy`

: Whether to select the initial representative vectors:`random`

: randomly in the X space`grid`

: using a grid approach [default]`shuffle`

: selecting randomly within the available points`given`

: using a provided set of initial representatives provided in the`Z₀`

parameter

`Z₀`

: Provided (K x D) matrix of initial representatives (used only together with the`given`

initStrategy) [default:`nothing`

]`rng`

: Random Number Generator (see`FIXEDSEED`

) [deafult:`Random.GLOBAL_RNG`

]

**Returns:**

- A (K x D) matrix of initial representatives

**Example:**

`julia> Z₀ = initRepresentatives([1 10.5;1.5 10.8; 1.8 8; 1.7 15; 3.2 40; 3.6 32; 3.6 38],2,initStrategy="given",Z₀=[1.7 15; 3.6 40])`

`BetaML.Clustering.kmeans`

— Methodkmeans(X,K;dist,initStrategy,Z₀)

Compute K-Mean algorithm to identify K clusters of X using Euclidean distance

**Parameters:**

`X`

: a (N x D) data to clusterise`K`

: Number of cluster wonted`dist`

: Function to employ as distance (see notes). Default to Euclidean distance.`initStrategy`

: Whether to select the initial representative vectors:`random`

: randomly in the X space`grid`

: using a grid approach [default]`shuffle`

: selecting randomly within the available points`given`

: using a provided set of initial representatives provided in the`Z₀`

parameter

`Z₀`

: Provided (K x D) matrix of initial representatives (used only together with the`given`

initStrategy) [default:`nothing`

]`rng`

: Random Number Generator (see`FIXEDSEED`

) [deafult:`Random.GLOBAL_RNG`

]

**Returns:**

- A tuple of two items, the first one being a vector of size N of ids of the clusters associated to each point and the second one the (K x D) matrix of representatives

**Notes:**

- Some returned clusters could be empty
- The
`dist`

parameter can be:- Any user defined function accepting two vectors and returning a scalar
- An anonymous function with the same characteristics (e.g.
`dist = (x,y) -> norm(x-y)^2`

) - One of the above predefined distances:
`l1_distance`

,`l2_distance`

,`l2²_distance`

,`cosine_distance`

**Example:**

`julia> (clIdx,Z) = kmeans([1 10.5;1.5 10.8; 1.8 8; 1.7 15; 3.2 40; 3.6 32; 3.3 38; 5.1 -2.3; 5.2 -2.4],3)`

`BetaML.Clustering.kmedoids`

— Methodkmedoids(X,K;dist,initStrategy,Z₀)

Compute K-Medoids algorithm to identify K clusters of X using distance definition `dist`

**Parameters:**

`X`

: a (n x d) data to clusterise`K`

: Number of cluster wonted`dist`

: Function to employ as distance (see notes). Default to Euclidean distance.`initStrategy`

: Whether to select the initial representative vectors:`random`

: randomly in the X space`grid`

: using a grid approach`shuffle`

: selecting randomly within the available points [default]`given`

: using a provided set of initial representatives provided in the`Z₀`

parameter

`Z₀`

: Provided (K x D) matrix of initial representatives (used only together with the`given`

initStrategy) [default:`nothing`

]`rng`

: Random Number Generator (see`FIXEDSEED`

) [deafult:`Random.GLOBAL_RNG`

]

**Returns:**

- A tuple of two items, the first one being a vector of size N of ids of the clusters associated to each point and the second one the (K x D) matrix of representatives

**Notes:**

- Some returned clusters could be empty
- The
`dist`

parameter can be:- Any user defined function accepting two vectors and returning a scalar
- An anonymous function with the same characteristics (e.g.
`dist = (x,y) -> norm(x-y)^2`

) - One of the above predefined distances:
`l1_distance`

,`l2_distance`

,`l2²_distance`

,`cosine_distance`

**Example:**

`julia> (clIdx,Z) = kmedoids([1 10.5;1.5 10.8; 1.8 8; 1.7 15; 3.2 40; 3.6 32; 3.3 38; 5.1 -2.3; 5.2 -2.4],3,initStrategy="grid")`

`BetaML.Clustering.lpdf`

— Methodlpdf(m::DiagonalGaussian,x,mask) - Log PDF of the mixture given the observation `x`

`BetaML.Clustering.lpdf`

— Methodlpdf(m::FullGaussian,x,mask) - Log PDF of the mixture given the observation `x`

`BetaML.Clustering.lpdf`

— Methodlpdf(m::SphericalGaussian,x,mask) - Log PDF of the mixture given the observation `x`

`BetaML.Clustering.predictMissing`

— FunctionpredictMissing(X,K;p₀,mixtures,tol,verbosity,minVariance,minCovariance)

Fill missing entries in a sparse matrix (i.e. perform a "matrix completion") assuming an underlying Gaussian Mixture probabilistic Model (GMM) and implementing an Expectation-Maximisation algorithm.

While the name of the function is `predictMissing`

, the function can be also used for system reccomendation / collaborative filtering and GMM-based regressions. The advantage over traditional algorithms as k-nearest neighbors (KNN) is that GMM can "detect" the hidden structure of the observed data, where some observation can be similar to a certain pool of other observvations for a certain characteristic, but similar to an other pool of observations for other characteristics.

Implemented in the log-domain for better numerical accuracy with many dimensions.

**Parameters:**

`X`

: A (N x D) sparse matrix of data to fill according to a GMM model`K`

: Number of mixtures (latent classes) to consider [def: 3]`p₀`

: Initial probabilities of the categorical distribution (K x 1) [default:`nothing`

]`mixtures`

: An array (of length K) of the mixture to employ (see notes) [def:`[DiagonalGaussian() for i in 1:K]`

]`tol`

: Tolerance to stop the algorithm [default: 10^(-6)]`verbosity`

: A verbosity parameter regulating the information messages frequency [def:`STD`

]`minVariance`

: Minimum variance for the mixtures [default: 0.05]`minCovariance`

: Minimum covariance for the mixtures with full covariance matrix [default: 0]. This should be set different than minVariance (see notes).`initStrategy`

: Mixture initialisation algorithm [def:`grid`

]`maxIter`

: Maximum number of iterations [def:`-1`

, i.e. ∞]`rng`

: Random Number Generator (see`FIXEDSEED`

) [deafult:`Random.GLOBAL_RNG`

]

**Returns:**

A named touple of:

`̂X̂`

: The Filled Matrix of size (N x D)`nFill`

: The number of items filled`lL`

: The log-likelihood (without considering the last mixture optimisation)`BIC`

: The Bayesian Information Criterion (lower is better)`AIC`

: The Akaike Information Criterion (lower is better)

# Notes:

- The mixtures currently implemented are
`SphericalGaussian(μ,σ²)`

,`DiagonalGaussian(μ,σ²)`

and`FullGaussian(μ,σ²)`

- For
`initStrategy`

, look at the documentation of`initMixtures!`

for the mixture you want. The provided gaussian mixtures support`grid`

,`kmeans`

or`given`

.`grid`

is faster, but`kmeans`

often provides better results. - The algorithm requires to specify a number of "latent classes" (mlixtures) to divide the dataset into. If there isn't any prior domain specific knowledge on this point one can test sevaral
`k`

and verify which one minimise the`BIC`

or`AIC`

criteria.

**Example:**

`julia> cFOut = predictMissing([1 10.5;1.5 missing; 1.8 8; 1.7 15; 3.2 40; missing missing; 3.3 38; missing -2.3; 5.2 -2.4],3)`

`MLJModelInterface.predict`

— Methodpredict(m::KMeans, fitResults, X) - Given a trained clustering model and some observations, predict the class of the observation

`MLJModelInterface.transform`

— Methodtransform(m::MissingImputator, fitResults, X) - Given a trained imputator model fill the missing data of some new observations

`MLJModelInterface.transform`

— Methodfit(m::KMeans, fitResults, X) - Given a trained clustering model and some observations, return the distances to each centroids