The BetaML.Clustering Module

BetaML.ClusteringModule
Clustering module (WIP)

(Hard) Clustering algorithms

Provide hard clustering methods using K-means and k-medoids. Please see also the GMM module for GMM-mased soft clustering, missing values imputation / collaborative filtering / reccomendation systems using clustering methods as backend.

The module provides the following functions. Use ?[function] to access their full signature and detailed documentation:

source

Module Index

Detailed API

BetaML.Clustering.initRepresentativesMethod

initRepresentatives(X,K;initStrategy,Z₀)

Initialisate the representatives for a K-Mean or K-Medoids algorithm

Parameters:

  • X: a (N x D) data to clusterise
  • K: Number of cluster wonted
  • initStrategy: Whether to select the initial representative vectors:
  • random: randomly in the X space
  • grid: using a grid approach [default]
  • shuffle: selecting randomly within the available points
  • given: using a provided set of initial representatives provided in the Z₀ parameter
  • Z₀: Provided (K x D) matrix of initial representatives (used only together with the given initStrategy) [default: nothing]
  • rng: Random Number Generator (see FIXEDSEED) [deafult: Random.GLOBAL_RNG]

Returns:

  • A (K x D) matrix of initial representatives

Example:

julia> Z₀ = initRepresentatives([1 10.5;1.5 10.8; 1.8 8; 1.7 15; 3.2 40; 3.6 32; 3.6 38],2,initStrategy="given",Z₀=[1.7 15; 3.6 40])
source
BetaML.Clustering.kmeansMethod

kmeans(X,K;dist,initStrategy,Z₀)

Compute K-Mean algorithm to identify K clusters of X using Euclidean distance

Parameters:

  • X: a (N x D) data to clusterise
  • K: Number of cluster wonted
  • dist: Function to employ as distance (see notes). Default to Euclidean distance.
  • initStrategy: Whether to select the initial representative vectors:
  • random: randomly in the X space
  • grid: using a grid approach [default]
  • shuffle: selecting randomly within the available points
  • given: using a provided set of initial representatives provided in the Z₀ parameter
  • Z₀: Provided (K x D) matrix of initial representatives (used only together with the given initStrategy) [default: nothing]
  • rng: Random Number Generator (see FIXEDSEED) [deafult: Random.GLOBAL_RNG]

Returns:

  • A tuple of two items, the first one being a vector of size N of ids of the clusters associated to each point and the second one the (K x D) matrix of representatives

Notes:

  • Some returned clusters could be empty
  • The dist parameter can be:
  • Any user defined function accepting two vectors and returning a scalar
  • An anonymous function with the same characteristics (e.g. dist = (x,y) -> norm(x-y)^2)
  • One of the above predefined distances: l1_distance, l2_distance, l2²_distance, cosine_distance

Example:

julia> (clIdx,Z) = kmeans([1 10.5;1.5 10.8; 1.8 8; 1.7 15; 3.2 40; 3.6 32; 3.3 38; 5.1 -2.3; 5.2 -2.4],3)
source
BetaML.Clustering.kmedoidsMethod

kmedoids(X,K;dist,initStrategy,Z₀)

Compute K-Medoids algorithm to identify K clusters of X using distance definition dist

Parameters:

  • X: a (n x d) data to clusterise
  • K: Number of cluster wonted
  • dist: Function to employ as distance (see notes). Default to Euclidean distance.
  • initStrategy: Whether to select the initial representative vectors:
  • random: randomly in the X space
  • grid: using a grid approach
  • shuffle: selecting randomly within the available points [default]
  • given: using a provided set of initial representatives provided in the Z₀ parameter
  • Z₀: Provided (K x D) matrix of initial representatives (used only together with the given initStrategy) [default: nothing]
  • rng: Random Number Generator (see FIXEDSEED) [deafult: Random.GLOBAL_RNG]

Returns:

  • A tuple of two items, the first one being a vector of size N of ids of the clusters associated to each point and the second one the (K x D) matrix of representatives

Notes:

  • Some returned clusters could be empty
  • The dist parameter can be:
  • Any user defined function accepting two vectors and returning a scalar
  • An anonymous function with the same characteristics (e.g. dist = (x,y) -> norm(x-y)^2)
  • One of the above predefined distances: l1_distance, l2_distance, l2²_distance, cosine_distance

Example:

julia> (clIdx,Z) = kmedoids([1 10.5;1.5 10.8; 1.8 8; 1.7 15; 3.2 40; 3.6 32; 3.3 38; 5.1 -2.3; 5.2 -2.4],3,initStrategy="grid")
source
MLJModelInterface.predictMethod

predict(m::KMeans, fitResults, X) - Given a fitted clustering model and some observations, predict the class of the observation

source
MLJModelInterface.transformMethod

fit(m::KMeans, fitResults, X) - Given a fitted clustering model and some observations, return the distances to each centroids

source