The BetaML.Clustering Module
BetaML.Clustering
— ModuleClustering module (WIP)
Provide clustering methods and missing values imputation / collaborative filtering / reccomendation systems using clustering methods as backend.
The module provides the following functions. Use ?[function]
to access their full signature and detailed documentation:
initRepresentatives(X,K;initStrategy,Z₀)
: Initialisation strategies for Kmean and Kmedoids- `kmeans(X,K;dist,initStrategy,Z₀)`: Classical KMean algorithm
- `kmedoids(X,K;dist,initStrategy,Z₀)`: Kmedoids algorithm
- `em(X,K;p₀,mixtures,tol,verbosity,minVariance,minCovariance,initStrategy)`: EM algorithm over GMM
- `predictMissing(X,K;p₀,mixtures,tol,verbosity,minVariance,minCovariance)`: Fill mixing values / collaborative filtering using EM as backbone
{Spherical|Diagonal|Full}Gaussian mixtures for em
/ predictMissing
are already provided. User defined mixtures can be used defining a struct as subtype of Mixture
and implementing for that mixture the following functions:
initMixtures!(mixtures, X; minVariance, minCovariance, initStrategy)
lpdf(m,x,mask)
(for the e-step)updateParameters!(mixtures, X, pₙₖ; minVariance, minCovariance)
(the m-step)
Module Index
BetaML.Clustering.em
BetaML.Clustering.initMixtures!
BetaML.Clustering.initRepresentatives
BetaML.Clustering.kmeans
BetaML.Clustering.kmedoids
BetaML.Clustering.lpdf
BetaML.Clustering.lpdf
BetaML.Clustering.lpdf
BetaML.Clustering.predictMissing
Detailed API
BetaML.Clustering.em
— Methodem(X,K;p₀,mixtures,tol,verbosity,minVariance,minCovariance,initStrategy)
Compute Expectation-Maximisation algorithm to identify K clusters of X data, i.e. employ a Generative Mixture Model as the underlying probabilistic model.
X can contain missing values in some or all of its dimensions. In such case the learning is done only with the available data. Implemented in the log-domain for better numerical accuracy with many dimensions.
Parameters:
X
: A (n x d) data to clusteriseK
: Number of cluster wantedp₀
: Initial probabilities of the categorical distribution (K x 1) [default:nothing
]mixtures
: An array (of length K) of the mixture to employ (see notes) [def:[DiagonalGaussian() for i in 1:K]
]tol
: Tolerance to stop the algorithm [default: 10^(-6)]verbosity
: A verbosity parameter regulating the information messages frequency [def:STD
]minVariance
: Minimum variance for the mixtures [default: 0.05]minCovariance
: Minimum covariance for the mixtures with full covariance matrix [default: 0]. This should be set different than minVariance (see notes).initStrategy
: Mixture initialisation algorithm [def:grid
]
Returns:
- A named touple of:
pⱼₓ
: Matrix of size (N x K) of the probabilities of each point i to belong to cluster jpⱼ
: Probabilities of the categorical distribution (K x 1)μ
: Means (K x d) of the Gaussianσ²
: Variance of the gaussian (K x 1). We assume here that the gaussian has the same variance across all the dimensionsϵ
: Vector of the discrepancy (matrix norm) between pⱼₓ and the lagged pⱼₓ at each iterationlL
: The log-likelihood (without considering the last mixture optimisation)BIC
: The Bayesian Information Criterion (lower is better)AIC
: The Akaike Information Criterion (lower is better)
Notes:
- The mixtures currently implemented are
SphericalGaussian(μ,σ²)
,DiagonalGaussian(μ,σ²)
andFullGaussian(μ,σ²)
- Reasonable choices for the minVariance/Covariance depends on the mixture. For example 0.25 seems a reasonable value for the SphericalGaussian, 0.05 seems better for the DiagonalGaussian, and FullGaussian seems to prefer either very low values of variance/covariance (e.g.
(0.05,0.05)
) or very big but similar ones (e.g.(100,100)
). - For
initStrategy
, look at the documentation ofinitMixtures!
for the mixture you want. The provided gaussian mixtures support eithergrid
orkmeans
.grid
is faster (expecially if X contains missing values), butkmeans
often provides better results.
Resources:
Example:
julia> clusters = em([1 10.5;1.5 0; 1.8 8; 1.7 15; 3.2 40; 0 0; 3.3 38; 0 -2.3; 5.2 -2.4],3,verbosity=HIGH)
BetaML.Clustering.initMixtures!
— MethodinitMixtures!(mixtures::Array{T,1}, X; minVariance=0.25, minCovariance=0.0, initStrategy="grid")
TThe parameter initStrategy
can be either grid
or kmeans
. If the data contains missing values, a first run of predictMissing
is done under init=grid
to impute the missing values just to allow the kmeans algorithm. Then the em algorithm is used with the output of kmean as init values.
BetaML.Clustering.initRepresentatives
— MethodinitRepresentatives(X,K;initStrategy,Z₀)
Initialisate the representatives for a K-Mean or K-Medoids algorithm
Parameters:
X
: a (N x D) data to clusteriseK
: Number of cluster wontedinitStrategy
: Wheter to select the initial representative vectors:random
: randomly in the X spacegrid
: using a grid approach [default]shuffle
: selecting randomly within the available pointsgiven
: using a provided set of initial representatives provided in theZ₀
parameter
Z₀
: Provided (K x D) matrix of initial representatives (used only together with thegiven
initStrategy) [default:nothing
]
Returns:
- A (K x D) matrix of initial representatives
Example:
julia> Z₀ = initRepresentatives([1 10.5;1.5 10.8; 1.8 8; 1.7 15; 3.2 40; 3.6 32; 3.6 38],2,initStrategy="given",Z₀=[1.7 15; 3.6 40])
BetaML.Clustering.kmeans
— Methodkmeans(X,K;dist,initStrategy,Z₀)
Compute K-Mean algorithm to identify K clusters of X using Euclidean distance
Parameters:
X
: a (N x D) data to clusteriseK
: Number of cluster wonteddist
: Function to employ as distance (see notes). Default to Euclidean distance.initStrategy
: Wheter to select the initial representative vectors:random
: randomly in the X spacegrid
: using a grid approach [default]shuffle
: selecting randomly within the available pointsgiven
: using a provided set of initial representatives provided in theZ₀
parameter
Z₀
: Provided (K x D) matrix of initial representatives (used only together with thegiven
initStrategy) [default:nothing
]
Returns:
- A tuple of two items, the first one being a vector of size N of ids of the clusters associated to each point and the second one the (K x D) matrix of representatives
Notes:
- Some returned clusters could be empty
- The
dist
parameter can be:- Any user defined function accepting two vectors and returning a scalar
- An anonymous function with the same characteristics (e.g.
dist = (x,y) -> norm(x-y)^2
) - One of the above predefined distances:
l1_distance
,l2_distance
,l2²_distance
,cosine_distance
Example:
julia> (clIdx,Z) = kmeans([1 10.5;1.5 10.8; 1.8 8; 1.7 15; 3.2 40; 3.6 32; 3.3 38; 5.1 -2.3; 5.2 -2.4],3)
BetaML.Clustering.kmedoids
— Methodkmedoids(X,K;dist,initStrategy,Z₀)
Compute K-Medoids algorithm to identify K clusters of X using distance definition dist
Parameters:
X
: a (n x d) data to clusteriseK
: Number of cluster wonteddist
: Function to employ as distance (see notes). Default to Euclidean distance.initStrategy
: Wheter to select the initial representative vectors:random
: randomly in the X spacegrid
: using a grid approachshuffle
: selecting randomly within the available points [default]given
: using a provided set of initial representatives provided in theZ₀
parameter
Z₀
: Provided (K x D) matrix of initial representatives (used only together with thegiven
initStrategy) [default:nothing
]
Returns:
- A tuple of two items, the first one being a vector of size N of ids of the clusters associated to each point and the second one the (K x D) matrix of representatives
Notes:
- Some returned clusters could be empty
- The
dist
parameter can be:- Any user defined function accepting two vectors and returning a scalar
- An anonymous function with the same characteristics (e.g.
dist = (x,y) -> norm(x-y)^2
) - One of the above predefined distances:
l1_distance
,l2_distance
,l2²_distance
,cosine_distance
Example:
julia> (clIdx,Z) = kmedoids([1 10.5;1.5 10.8; 1.8 8; 1.7 15; 3.2 40; 3.6 32; 3.3 38; 5.1 -2.3; 5.2 -2.4],3,initStrategy="grid")
BetaML.Clustering.lpdf
— Methodlpdf(m::DiagonalGaussian,x,mask) - Log PDF of the mixture given the observation x
BetaML.Clustering.lpdf
— Methodlpdf(m::FullGaussian,x,mask) - Log PDF of the mixture given the observation x
BetaML.Clustering.lpdf
— Methodlpdf(m::SphericalGaussian,x,mask) - Log PDF of the mixture given the observation x
BetaML.Clustering.predictMissing
— MethodpredictMissing(X,K;p₀,mixtures,tol,verbosity,minVariance,minCovariance)
Fill missing entries in a sparse matrix assuming an underlying Gaussian Mixture probabilistic Model (GMM) and implementing an Expectation-Maximisation algorithm.
While the name of the function is predictMissing
, the function can be used also for system reccomendation / collaborative filtering and GMM-based regressions.
Implemented in the log-domain for better numerical accuracy with many dimensions.
Parameters:
X
: A (N x D) sparse matrix of data to fill according to a GMM modelK
: Number of mixtures desiredp₀
: Initial probabilities of the categorical distribution (K x 1) [default:nothing
]mixtures
: An array (of length K) of the mixture to employ (see notes) [def:[DiagonalGaussian() for i in 1:K]
]tol
: Tolerance to stop the algorithm [default: 10^(-6)]verbosity
: A verbosity parameter regulating the information messages frequency [def:STD
]minVariance
: Minimum variance for the mixtures [default: 0.05]minCovariance
: Minimum covariance for the mixtures with full covariance matrix [default: 0]. This should be set different than minVariance (see notes).initStrategy
: Mixture initialisation algorithm [def:grid
]
Returns:
A named touple of:
̂X̂
: The Filled Matrix of size (N x D)nFill
: The number of items filledlL
: The log-likelihood (without considering the last mixture optimisation)BIC
: The Bayesian Information Criterion (lower is better)AIC
: The Akaike Information Criterion (lower is better)
Notes:
- The mixtures currently implemented are
SphericalGaussian(μ,σ²)
,DiagonalGaussian(μ,σ²)
andFullGaussian(μ,σ²)
- For
initStrategy
, look at the documentation ofinitMixtures!
for the mixture you want. The provided gaussian mixtures support eithergrid
orkmeans
.grid
is faster, butkmeans
often provides better results.
Example:
julia> cFOut = predictMissing([1 10.5;1.5 missing; 1.8 8; 1.7 15; 3.2 40; missing missing; 3.3 38; missing -2.3; 5.2 -2.4],3)