The BetaML.Clustering Module

BetaML.ClusteringModule
Clustering module (WIP)

Provide clustering methods and missing values imputation / collaborative filtering / reccomendation systems using clustering methods as backend.

The module provides the following functions. Use ?[function] to access their full signature and detailed documentation:

{Spherical|Diagonal|Full}Gaussian mixtures for gmm / predictMissing are already provided. User defined mixtures can be used defining a struct as subtype of Mixture and implementing for that mixture the following functions:

• initMixtures!(mixtures, X; minVariance, minCovariance, initStrategy)
• lpdf(m,x,mask) (for the e-step)
• updateParameters!(mixtures, X, pₙₖ; minVariance, minCovariance) (the m-step)
source

Detailed API

BetaML.Clustering.gmmMethod

gmm(X,K;p₀,mixtures,tol,verbosity,minVariance,minCovariance,initStrategy)

Compute Expectation-Maximisation algorithm to identify K clusters of X data, i.e. employ a Generative Mixture Model as the underlying probabilistic model.

X can contain missing values in some or all of its dimensions. In such case the learning is done only with the available data. Implemented in the log-domain for better numerical accuracy with many dimensions.

Parameters:

• X : A (n x d) data to clusterise
• K : Number of cluster wanted
• p₀ : Initial probabilities of the categorical distribution (K x 1) [default: nothing]
• mixtures: An array (of length K) of the mixture to employ (see notes) [def: [DiagonalGaussian() for i in 1:K]]
• tol: Tolerance to stop the algorithm [default: 10^(-6)]
• verbosity: A verbosity parameter regulating the information messages frequency [def: STD]
• minVariance: Minimum variance for the mixtures [default: 0.05]
• minCovariance: Minimum covariance for the mixtures with full covariance matrix [default: 0]. This should be set different than minVariance (see notes).
• initStrategy: Mixture initialisation algorithm [def: kmeans]
• maxIter: Maximum number of iterations [def: -1, i.e. ∞]
• rng: Random Number Generator (see FIXEDSEED) [deafult: Random.GLOBAL_RNG]

Returns:

• A named touple of:
• pₙₖ: Matrix of size (N x K) of the probabilities of each point i to belong to cluster j
• pₖ: Probabilities of the categorical distribution (K x 1)
• mixtures: Vector (K x 1) of the estimated underlying distributions
• ϵ: Vector of the discrepancy (matrix norm) between pⱼₓ and the lagged pⱼₓ at each iteration
• lL: The log-likelihood (without considering the last mixture optimisation)
• BIC: The Bayesian Information Criterion (lower is better)
• AIC: The Akaike Information Criterion (lower is better)

Notes:

• The mixtures currently implemented are SphericalGaussian(μ,σ²),DiagonalGaussian(μ,σ²) and FullGaussian(μ,σ²)
• Reasonable choices for the minVariance/Covariance depends on the mixture. For example 0.25 seems a reasonable value for the SphericalGaussian, 0.05 seems better for the DiagonalGaussian, and FullGaussian seems to prefer either very low values of variance/covariance (e.g. (0.05,0.05) ) or very big but similar ones (e.g. (100,100) ).
• For initStrategy, look at the documentation of initMixtures! for the mixture you want. The provided gaussian mixtures support grid, kmeans or given. grid is faster (expecially if X contains missing values), but kmeans often provides better results.

Resources:

Example:

julia> clusters = gmm([1 10.5;1.5 0; 1.8 8; 1.7 15; 3.2 40; 0 0; 3.3 38; 0 -2.3; 5.2 -2.4],3,verbosity=HIGH)
source
BetaML.Clustering.initMixtures!Method
initMixtures!(mixtures::Array{T,1}, X; minVariance=0.25, minCovariance=0.0, initStrategy="grid",rng=Random.GLOBAL_RNG)

The parameter initStrategy can be grid, kmeans or given:

• grid: Uniformly cover the space observed by the data
• kmeans: Use the kmeans algorithm. If the data contains missing values, a first run of predictMissing is done under init=grid to impute the missing values just to allow the kmeans algorithm. Then the em algorithm is used with the output of kmean as init values.
• given: Leave the provided set of initial mixtures
source
BetaML.Clustering.initRepresentativesMethod

initRepresentatives(X,K;initStrategy,Z₀)

Initialisate the representatives for a K-Mean or K-Medoids algorithm

Parameters:

• X: a (N x D) data to clusterise
• K: Number of cluster wonted
• initStrategy: Whether to select the initial representative vectors:
• random: randomly in the X space
• grid: using a grid approach [default]
• shuffle: selecting randomly within the available points
• given: using a provided set of initial representatives provided in the Z₀ parameter
• Z₀: Provided (K x D) matrix of initial representatives (used only together with the given initStrategy) [default: nothing]
• rng: Random Number Generator (see FIXEDSEED) [deafult: Random.GLOBAL_RNG]

Returns:

• A (K x D) matrix of initial representatives

Example:

julia> Z₀ = initRepresentatives([1 10.5;1.5 10.8; 1.8 8; 1.7 15; 3.2 40; 3.6 32; 3.6 38],2,initStrategy="given",Z₀=[1.7 15; 3.6 40])
source
BetaML.Clustering.kmeansMethod

kmeans(X,K;dist,initStrategy,Z₀)

Compute K-Mean algorithm to identify K clusters of X using Euclidean distance

Parameters:

• X: a (N x D) data to clusterise
• K: Number of cluster wonted
• dist: Function to employ as distance (see notes). Default to Euclidean distance.
• initStrategy: Whether to select the initial representative vectors:
• random: randomly in the X space
• grid: using a grid approach [default]
• shuffle: selecting randomly within the available points
• given: using a provided set of initial representatives provided in the Z₀ parameter
• Z₀: Provided (K x D) matrix of initial representatives (used only together with the given initStrategy) [default: nothing]
• rng: Random Number Generator (see FIXEDSEED) [deafult: Random.GLOBAL_RNG]

Returns:

• A tuple of two items, the first one being a vector of size N of ids of the clusters associated to each point and the second one the (K x D) matrix of representatives

Notes:

• Some returned clusters could be empty
• The dist parameter can be:
• Any user defined function accepting two vectors and returning a scalar
• An anonymous function with the same characteristics (e.g. dist = (x,y) -> norm(x-y)^2)
• One of the above predefined distances: l1_distance, l2_distance, l2²_distance, cosine_distance

Example:

julia> (clIdx,Z) = kmeans([1 10.5;1.5 10.8; 1.8 8; 1.7 15; 3.2 40; 3.6 32; 3.3 38; 5.1 -2.3; 5.2 -2.4],3)
source
BetaML.Clustering.kmedoidsMethod

kmedoids(X,K;dist,initStrategy,Z₀)

Compute K-Medoids algorithm to identify K clusters of X using distance definition dist

Parameters:

• X: a (n x d) data to clusterise
• K: Number of cluster wonted
• dist: Function to employ as distance (see notes). Default to Euclidean distance.
• initStrategy: Whether to select the initial representative vectors:
• random: randomly in the X space
• grid: using a grid approach
• shuffle: selecting randomly within the available points [default]
• given: using a provided set of initial representatives provided in the Z₀ parameter
• Z₀: Provided (K x D) matrix of initial representatives (used only together with the given initStrategy) [default: nothing]
• rng: Random Number Generator (see FIXEDSEED) [deafult: Random.GLOBAL_RNG]

Returns:

• A tuple of two items, the first one being a vector of size N of ids of the clusters associated to each point and the second one the (K x D) matrix of representatives

Notes:

• Some returned clusters could be empty
• The dist parameter can be:
• Any user defined function accepting two vectors and returning a scalar
• An anonymous function with the same characteristics (e.g. dist = (x,y) -> norm(x-y)^2)
• One of the above predefined distances: l1_distance, l2_distance, l2²_distance, cosine_distance

Example:

julia> (clIdx,Z) = kmedoids([1 10.5;1.5 10.8; 1.8 8; 1.7 15; 3.2 40; 3.6 32; 3.3 38; 5.1 -2.3; 5.2 -2.4],3,initStrategy="grid")
source
BetaML.Clustering.predictMissingFunction

predictMissing(X,K;p₀,mixtures,tol,verbosity,minVariance,minCovariance)

Fill missing entries in a sparse matrix (i.e. perform a "matrix completion") assuming an underlying Gaussian Mixture probabilistic Model (GMM) and implementing an Expectation-Maximisation algorithm.

While the name of the function is predictMissing, the function can be also used for system reccomendation / collaborative filtering and GMM-based regressions. The advantage over traditional algorithms as k-nearest neighbors (KNN) is that GMM can "detect" the hidden structure of the observed data, where some observation can be similar to a certain pool of other observvations for a certain characteristic, but similar to an other pool of observations for other characteristics.

Implemented in the log-domain for better numerical accuracy with many dimensions.

Parameters:

• X : A (N x D) sparse matrix of data to fill according to a GMM model
• K : Number of mixtures (latent classes) to consider [def: 3]
• p₀ : Initial probabilities of the categorical distribution (K x 1) [default: nothing]
• mixtures: An array (of length K) of the mixture to employ (see notes) [def: [DiagonalGaussian() for i in 1:K]]
• tol: Tolerance to stop the algorithm [default: 10^(-6)]
• verbosity: A verbosity parameter regulating the information messages frequency [def: STD]
• minVariance: Minimum variance for the mixtures [default: 0.05]
• minCovariance: Minimum covariance for the mixtures with full covariance matrix [default: 0]. This should be set different than minVariance (see notes).
• initStrategy: Mixture initialisation algorithm [def: grid]
• maxIter: Maximum number of iterations [def: -1, i.e. ∞]
• rng: Random Number Generator (see FIXEDSEED) [deafult: Random.GLOBAL_RNG]

Returns:

• A named touple of:

• ̂X̂ : The Filled Matrix of size (N x D)
• nFill: The number of items filled
• lL : The log-likelihood (without considering the last mixture optimisation)
• BIC : The Bayesian Information Criterion (lower is better)
• AIC : The Akaike Information Criterion (lower is better)

Notes:

• The mixtures currently implemented are SphericalGaussian(μ,σ²),DiagonalGaussian(μ,σ²) and FullGaussian(μ,σ²)
• For initStrategy, look at the documentation of initMixtures! for the mixture you want. The provided gaussian mixtures support grid, kmeans or given. grid is faster, but kmeans often provides better results.
• The algorithm requires to specify a number of "latent classes" (mlixtures) to divide the dataset into. If there isn't any prior domain specific knowledge on this point one can test sevaral k and verify which one minimise the BIC or AIC criteria.

Example:

julia>  cFOut = predictMissing([1 10.5;1.5 missing; 1.8 8; 1.7 15; 3.2 40; missing missing; 3.3 38; missing -2.3; 5.2 -2.4],3)
source