The BetaML.Clustering Module
BetaML.Clustering — ModuleClustering module (WIP)Provide clustering methods and missing values imputation / collaborative filtering / reccomendation systems using clustering methods as backend.
The module provides the following functions. Use ?[function] to access their full signature and detailed documentation:
initRepresentatives(X,K;initStrategy,Z₀): Initialisation strategies for Kmean and Kmedoids- `kmeans(X,K;dist,initStrategy,Z₀)`: Classical KMean algorithm
- `kmedoids(X,K;dist,initStrategy,Z₀)`: Kmedoids algorithm
- `em(X,K;p₀,mixtures,tol,verbosity,minVariance,minCovariance,initStrategy)`: EM algorithm over GMM
- `predictMissing(X,K;p₀,mixtures,tol,verbosity,minVariance,minCovariance)`: Fill mixing values / collaborative filtering using EM as backbone
{Spherical|Diagonal|Full}Gaussian mixtures for em / predictMissing are already provided. User defined mixtures can be used defining a struct as subtype of Mixture and implementing for that mixture the following functions:
initMixtures!(mixtures, X; minVariance, minCovariance, initStrategy)lpdf(m,x,mask)(for the e-step)updateParameters!(mixtures, X, pₙₖ; minVariance, minCovariance)(the m-step)
Module Index
BetaML.Clustering.emBetaML.Clustering.initMixtures!BetaML.Clustering.initRepresentativesBetaML.Clustering.kmeansBetaML.Clustering.kmedoidsBetaML.Clustering.lpdfBetaML.Clustering.lpdfBetaML.Clustering.lpdfBetaML.Clustering.predictMissing
Detailed API
BetaML.Clustering.em — Methodem(X,K;p₀,mixtures,tol,verbosity,minVariance,minCovariance,initStrategy)
Compute Expectation-Maximisation algorithm to identify K clusters of X data, i.e. employ a Generative Mixture Model as the underlying probabilistic model.
X can contain missing values in some or all of its dimensions. In such case the learning is done only with the available data. Implemented in the log-domain for better numerical accuracy with many dimensions.
Parameters:
X: A (n x d) data to clusteriseK: Number of cluster wantedp₀: Initial probabilities of the categorical distribution (K x 1) [default:nothing]mixtures: An array (of length K) of the mixture to employ (see notes) [def:[DiagonalGaussian() for i in 1:K]]tol: Tolerance to stop the algorithm [default: 10^(-6)]verbosity: A verbosity parameter regulating the information messages frequency [def:STD]minVariance: Minimum variance for the mixtures [default: 0.05]minCovariance: Minimum covariance for the mixtures with full covariance matrix [default: 0]. This should be set different than minVariance (see notes).initStrategy: Mixture initialisation algorithm [def:grid]
Returns:
- A named touple of:
pⱼₓ: Matrix of size (N x K) of the probabilities of each point i to belong to cluster jpⱼ: Probabilities of the categorical distribution (K x 1)μ: Means (K x d) of the Gaussianσ²: Variance of the gaussian (K x 1). We assume here that the gaussian has the same variance across all the dimensionsϵ: Vector of the discrepancy (matrix norm) between pⱼₓ and the lagged pⱼₓ at each iterationlL: The log-likelihood (without considering the last mixture optimisation)BIC: The Bayesian Information Criterion (lower is better)AIC: The Akaike Information Criterion (lower is better)
Notes:
- The mixtures currently implemented are
SphericalGaussian(μ,σ²),DiagonalGaussian(μ,σ²)andFullGaussian(μ,σ²) - Reasonable choices for the minVariance/Covariance depends on the mixture. For example 0.25 seems a reasonable value for the SphericalGaussian, 0.05 seems better for the DiagonalGaussian, and FullGaussian seems to prefer either very low values of variance/covariance (e.g.
(0.05,0.05)) or very big but similar ones (e.g.(100,100)). - For
initStrategy, look at the documentation ofinitMixtures!for the mixture you want. The provided gaussian mixtures support eithergridorkmeans.gridis faster (expecially if X contains missing values), butkmeansoften provides better results.
Resources:
Example:
julia> clusters = em([1 10.5;1.5 0; 1.8 8; 1.7 15; 3.2 40; 0 0; 3.3 38; 0 -2.3; 5.2 -2.4],3,verbosity=HIGH)BetaML.Clustering.initMixtures! — MethodinitMixtures!(mixtures::Array{T,1}, X; minVariance=0.25, minCovariance=0.0, initStrategy="grid")
TThe parameter initStrategy can be either grid or kmeans. If the data contains missing values, a first run of predictMissing is done under init=grid to impute the missing values just to allow the kmeans algorithm. Then the em algorithm is used with the output of kmean as init values.
BetaML.Clustering.initRepresentatives — MethodinitRepresentatives(X,K;initStrategy,Z₀)
Initialisate the representatives for a K-Mean or K-Medoids algorithm
Parameters:
X: a (N x D) data to clusteriseK: Number of cluster wontedinitStrategy: Wheter to select the initial representative vectors:random: randomly in the X spacegrid: using a grid approach [default]shuffle: selecting randomly within the available pointsgiven: using a provided set of initial representatives provided in theZ₀parameter
Z₀: Provided (K x D) matrix of initial representatives (used only together with thegiveninitStrategy) [default:nothing]
Returns:
- A (K x D) matrix of initial representatives
Example:
julia> Z₀ = initRepresentatives([1 10.5;1.5 10.8; 1.8 8; 1.7 15; 3.2 40; 3.6 32; 3.6 38],2,initStrategy="given",Z₀=[1.7 15; 3.6 40])BetaML.Clustering.kmeans — Methodkmeans(X,K;dist,initStrategy,Z₀)
Compute K-Mean algorithm to identify K clusters of X using Euclidean distance
Parameters:
X: a (N x D) data to clusteriseK: Number of cluster wonteddist: Function to employ as distance (see notes). Default to Euclidean distance.initStrategy: Wheter to select the initial representative vectors:random: randomly in the X spacegrid: using a grid approach [default]shuffle: selecting randomly within the available pointsgiven: using a provided set of initial representatives provided in theZ₀parameter
Z₀: Provided (K x D) matrix of initial representatives (used only together with thegiveninitStrategy) [default:nothing]
Returns:
- A tuple of two items, the first one being a vector of size N of ids of the clusters associated to each point and the second one the (K x D) matrix of representatives
Notes:
- Some returned clusters could be empty
- The
distparameter can be:- Any user defined function accepting two vectors and returning a scalar
- An anonymous function with the same characteristics (e.g.
dist = (x,y) -> norm(x-y)^2) - One of the above predefined distances:
l1_distance,l2_distance,l2²_distance,cosine_distance
Example:
julia> (clIdx,Z) = kmeans([1 10.5;1.5 10.8; 1.8 8; 1.7 15; 3.2 40; 3.6 32; 3.3 38; 5.1 -2.3; 5.2 -2.4],3)BetaML.Clustering.kmedoids — Methodkmedoids(X,K;dist,initStrategy,Z₀)
Compute K-Medoids algorithm to identify K clusters of X using distance definition dist
Parameters:
X: a (n x d) data to clusteriseK: Number of cluster wonteddist: Function to employ as distance (see notes). Default to Euclidean distance.initStrategy: Wheter to select the initial representative vectors:random: randomly in the X spacegrid: using a grid approachshuffle: selecting randomly within the available points [default]given: using a provided set of initial representatives provided in theZ₀parameter
Z₀: Provided (K x D) matrix of initial representatives (used only together with thegiveninitStrategy) [default:nothing]
Returns:
- A tuple of two items, the first one being a vector of size N of ids of the clusters associated to each point and the second one the (K x D) matrix of representatives
Notes:
- Some returned clusters could be empty
- The
distparameter can be:- Any user defined function accepting two vectors and returning a scalar
- An anonymous function with the same characteristics (e.g.
dist = (x,y) -> norm(x-y)^2) - One of the above predefined distances:
l1_distance,l2_distance,l2²_distance,cosine_distance
Example:
julia> (clIdx,Z) = kmedoids([1 10.5;1.5 10.8; 1.8 8; 1.7 15; 3.2 40; 3.6 32; 3.3 38; 5.1 -2.3; 5.2 -2.4],3,initStrategy="grid")BetaML.Clustering.lpdf — Methodlpdf(m::DiagonalGaussian,x,mask) - Log PDF of the mixture given the observation x
BetaML.Clustering.lpdf — Methodlpdf(m::FullGaussian,x,mask) - Log PDF of the mixture given the observation x
BetaML.Clustering.lpdf — Methodlpdf(m::SphericalGaussian,x,mask) - Log PDF of the mixture given the observation x
BetaML.Clustering.predictMissing — MethodpredictMissing(X,K;p₀,mixtures,tol,verbosity,minVariance,minCovariance)
Fill missing entries in a sparse matrix assuming an underlying Gaussian Mixture probabilistic Model (GMM) and implementing an Expectation-Maximisation algorithm.
While the name of the function is predictMissing, the function can be used also for system reccomendation / collaborative filtering and GMM-based regressions.
Implemented in the log-domain for better numerical accuracy with many dimensions.
Parameters:
X: A (N x D) sparse matrix of data to fill according to a GMM modelK: Number of mixtures desiredp₀: Initial probabilities of the categorical distribution (K x 1) [default:nothing]mixtures: An array (of length K) of the mixture to employ (see notes) [def:[DiagonalGaussian() for i in 1:K]]tol: Tolerance to stop the algorithm [default: 10^(-6)]verbosity: A verbosity parameter regulating the information messages frequency [def:STD]minVariance: Minimum variance for the mixtures [default: 0.05]minCovariance: Minimum covariance for the mixtures with full covariance matrix [default: 0]. This should be set different than minVariance (see notes).initStrategy: Mixture initialisation algorithm [def:grid]
Returns:
A named touple of:
̂X̂: The Filled Matrix of size (N x D)nFill: The number of items filledlL: The log-likelihood (without considering the last mixture optimisation)BIC: The Bayesian Information Criterion (lower is better)AIC: The Akaike Information Criterion (lower is better)
Notes:
- The mixtures currently implemented are
SphericalGaussian(μ,σ²),DiagonalGaussian(μ,σ²)andFullGaussian(μ,σ²) - For
initStrategy, look at the documentation ofinitMixtures!for the mixture you want. The provided gaussian mixtures support eithergridorkmeans.gridis faster, butkmeansoften provides better results.
Example:
julia> cFOut = predictMissing([1 10.5;1.5 missing; 1.8 8; 1.7 15; 3.2 40; missing missing; 3.3 38; missing -2.3; 5.2 -2.4],3)