The BetaML.Clustering Module

BetaML.ClusteringModule
Clustering module (WIP)

(Hard) Clustering algorithms

Provide hard clustering methods using K-means and K-medoids. Please see also the GMM module for GMM-based soft clustering (i.e. where a probability distribution to be part of the various classes is assigned to each record instead of a single class), missing values imputation / collaborative filtering / reccomendation systems using clustering methods as backend.

The module provides the following models. Use ?[model] to access their documentation:

Some metrics of the clustered output are available (e.g. silhouette).

source

Module Index

Detailed API

BetaML.Clustering.KMeansC_hpType
mutable struct KMeansC_hp <: BetaMLHyperParametersSet

Hyperparameters for the KMeansClusterer model

Parameters:

  • n_classes::Int64: Number of classes to discriminate the data [def: 3]

  • dist::Function: Function to employ as distance. Default to the Euclidean distance. Can be one of the predefined distances (l1_distance, l2_distance, l2squared_distance, cosine_distance), any user defined function accepting two vectors and returning a scalar or an anonymous function with the same characteristics. Attention that the KMeansClusterer algorithm is not guaranteed to converge with other distances than the Euclidean one.

  • initialisation_strategy::String: The computation method of the vector of the initial representatives. One of the following:

    • "random": randomly in the X space [default]
    • "grid": using a grid approach
    • "shuffle": selecting randomly within the available points
    • "given": using a provided set of initial representatives provided in the initial_representatives parameter
  • initial_representatives::Union{Nothing, Matrix{Float64}}: Provided (K x D) matrix of initial representatives (useful only with initialisation_strategy="given") [default: nothing]
source
BetaML.Clustering.KMeansClustererType
mutable struct KMeansClusterer <: BetaMLUnsupervisedModel

The classical "K-Means" clustering algorithm (unsupervised).

Learn to partition the data and assign each record to one of the n_classes classes according to a distance metric (default Euclidean).

For the parameters see ?KMeansC_hp and ?BML_options.

Notes:

  • data must be numerical
  • online fitting (re-fitting with new data) is supported by using the "old" representatives as init ones

Example :

julia> using BetaML

julia> X = [1.1 10.1; 0.9 9.8; 10.0 1.1; 12.1 0.8; 0.8 9.8]
5×2 Matrix{Float64}:
  1.1  10.1
  0.9   9.8
 10.0   1.1
 12.1   0.8
  0.8   9.8

julia> mod = KMeansClusterer(n_classes=2)
KMeansClusterer - A K-Means Model (unfitted)

julia> classes = fit!(mod,X)
5-element Vector{Int64}:
 1
 1
 2
 2
 1

julia> newclasses = fit!(mod,[11 0.9])
1-element Vector{Int64}:
 2

julia> info(mod)
Dict{String, Any} with 2 entries:
  "fitted_records"       => 6
  "av_distance_last_fit" => 0.0
  "xndims"               => 2

julia> parameters(mod)
BetaML.Clustering.KMeansMedoids_lp (a BetaMLLearnableParametersSet struct)
- representatives: [1.13366 9.7209; 11.0 0.9]
source
BetaML.Clustering.KMedoidsC_hpType
mutable struct KMedoidsC_hp <: BetaMLHyperParametersSet

Hyperparameters for the and KMedoidsClusterer models

Parameters:

  • n_classes::Int64: Number of classes to discriminate the data [def: 3]

  • dist::Function: Function to employ as distance. Default to the Euclidean distance. Can be one of the predefined distances (l1_distance, l2_distance, l2squared_distance, cosine_distance), any user defined function accepting two vectors and returning a scalar or an anonymous function with the same characteristics. Attention that the KMeansClusterer algorithm is not guaranteed to converge with other distances than the Euclidean one.

  • initialisation_strategy::String: The computation method of the vector of the initial representatives. One of the following:

    • "random": randomly in the X space
    • "grid": using a grid approach
    • "shuffle": selecting randomly within the available points [default]
    • "given": using a provided set of initial representatives provided in the initial_representatives parameter
  • initial_representatives::Union{Nothing, Matrix{Float64}}: Provided (K x D) matrix of initial representatives (useful only with initialisation_strategy="given") [default: nothing]
source
BetaML.Clustering.KMedoidsClustererType
mutable struct KMedoidsClusterer <: BetaMLUnsupervisedModel

The classical "K-Medoids" clustering algorithm (unsupervised).

Similar to K-Means, learn to partition the data and assign each record to one of the n_classes classes according to a distance metric, but the "representatives" (the cetroids) are guaranteed to be one of the training points. The algorithm work with any arbitrary distance measure (default Euclidean).

For the parameters see ?KMedoidsC_hp and ?BML_options.

Notes:

  • data must be numerical
  • online fitting (re-fitting with new data) is supported by using the "old" representatives as init ones
  • with initialisation_strategy different than shuffle (the default initialisation for K-Medoids) the representatives may not be one of the training points when the algorithm doesn't perform enought iterations. This can happen for example when the number of classes is close to the number of records to cluster.

Example:

julia> using BetaML

julia> X = [1.1 10.1; 0.9 9.8; 10.0 1.1; 12.1 0.8; 0.8 9.8]
5×2 Matrix{Float64}:
  1.1  10.1
  0.9   9.8
 10.0   1.1
 12.1   0.8
  0.8   9.8

julia> mod = KMedoidsClusterer(n_classes=2)
KMedoidsClusterer - A K-Medoids Model (unfitted)

julia> classes = fit!(mod,X)
5-element Vector{Int64}:
 1
 1
 2
 2
 1

julia> newclasses = fit!(mod,[11 0.9])
1-element Vector{Int64}:
 2

julia> info(mod)
Dict{String, Any} with 2 entries:
"fitted_records"       => 6
"av_distance_last_fit" => 0.0
"xndims"               => 2

julia> parameters(mod)
BetaML.Clustering.KMeansMedoids_lp (a BetaMLLearnableParametersSet struct)
- representatives: [0.9 9.8; 11.0 0.9]
source