The BetaML.Clustering Module
BetaML.Clustering
— ModuleClustering module (WIP)
(Hard) Clustering algorithms
Provide hard clustering methods using K-means and K-medoids. Please see also the GMM
module for GMM-based soft clustering (i.e. where a probability distribution to be part of the various classes is assigned to each record instead of a single class), missing values imputation / collaborative filtering / reccomendation systems using clustering methods as backend.
The module provides the following models. Use ?[model]
to access their documentation:
KMeansClusterer
: Classical K-mean algorithmKMedoidsClusterer
: K-medoids algorithm with configurable distance metric
Some metrics of the clustered output are available (e.g. silhouette
).
Module Index
BetaML.Clustering.KMeansC_hp
BetaML.Clustering.KMeansClusterer
BetaML.Clustering.KMedoidsC_hp
BetaML.Clustering.KMedoidsClusterer
Detailed API
BetaML.Clustering.KMeansC_hp
— Typemutable struct KMeansC_hp <: BetaMLHyperParametersSet
Hyperparameters for the KMeansClusterer
model
Parameters:
n_classes::Int64
: Number of classes to discriminate the data [def: 3]dist::Function
: Function to employ as distance. Default to the Euclidean distance. Can be one of the predefined distances (l1_distance
,l2_distance
,l2squared_distance
,cosine_distance
), any user defined function accepting two vectors and returning a scalar or an anonymous function with the same characteristics. Attention that theKMeansClusterer
algorithm is not guaranteed to converge with other distances than the Euclidean one.initialisation_strategy::String
: The computation method of the vector of the initial representatives. One of the following:- "random": randomly in the X space [default]
- "grid": using a grid approach
- "shuffle": selecting randomly within the available points
- "given": using a provided set of initial representatives provided in the
initial_representatives
parameter
initial_representatives::Union{Nothing, Matrix{Float64}}
: Provided (K x D) matrix of initial representatives (useful only withinitialisation_strategy="given"
) [default:nothing
]
BetaML.Clustering.KMeansClusterer
— Typemutable struct KMeansClusterer <: BetaMLUnsupervisedModel
The classical "K-Means" clustering algorithm (unsupervised).
Learn to partition the data and assign each record to one of the n_classes
classes according to a distance metric (default Euclidean).
For the parameters see ?KMeansC_hp
and ?BML_options
.
Notes:
- data must be numerical
- online fitting (re-fitting with new data) is supported by using the "old" representatives as init ones
Example :
julia> using BetaML
julia> X = [1.1 10.1; 0.9 9.8; 10.0 1.1; 12.1 0.8; 0.8 9.8]
5×2 Matrix{Float64}:
1.1 10.1
0.9 9.8
10.0 1.1
12.1 0.8
0.8 9.8
julia> mod = KMeansClusterer(n_classes=2)
KMeansClusterer - A K-Means Model (unfitted)
julia> classes = fit!(mod,X)
5-element Vector{Int64}:
1
1
2
2
1
julia> newclasses = fit!(mod,[11 0.9])
1-element Vector{Int64}:
2
julia> info(mod)
Dict{String, Any} with 2 entries:
"fitted_records" => 6
"av_distance_last_fit" => 0.0
"xndims" => 2
julia> parameters(mod)
BetaML.Clustering.KMeansMedoids_lp (a BetaMLLearnableParametersSet struct)
- representatives: [1.13366 9.7209; 11.0 0.9]
BetaML.Clustering.KMedoidsC_hp
— Typemutable struct KMedoidsC_hp <: BetaMLHyperParametersSet
Hyperparameters for the and KMedoidsClusterer
models
Parameters:
n_classes::Int64
: Number of classes to discriminate the data [def: 3]dist::Function
: Function to employ as distance. Default to the Euclidean distance. Can be one of the predefined distances (l1_distance
,l2_distance
,l2squared_distance
,cosine_distance
), any user defined function accepting two vectors and returning a scalar or an anonymous function with the same characteristics. Attention that theKMeansClusterer
algorithm is not guaranteed to converge with other distances than the Euclidean one.initialisation_strategy::String
: The computation method of the vector of the initial representatives. One of the following:- "random": randomly in the X space
- "grid": using a grid approach
- "shuffle": selecting randomly within the available points [default]
- "given": using a provided set of initial representatives provided in the
initial_representatives
parameter
initial_representatives::Union{Nothing, Matrix{Float64}}
: Provided (K x D) matrix of initial representatives (useful only withinitialisation_strategy="given"
) [default:nothing
]
BetaML.Clustering.KMedoidsClusterer
— Typemutable struct KMedoidsClusterer <: BetaMLUnsupervisedModel
The classical "K-Medoids" clustering algorithm (unsupervised).
Similar to K-Means, learn to partition the data and assign each record to one of the n_classes
classes according to a distance metric, but the "representatives" (the cetroids) are guaranteed to be one of the training points. The algorithm work with any arbitrary distance measure (default Euclidean).
For the parameters see ?KMedoidsC_hp
and ?BML_options
.
Notes:
- data must be numerical
- online fitting (re-fitting with new data) is supported by using the "old" representatives as init ones
- with
initialisation_strategy
different thanshuffle
(the default initialisation for K-Medoids) the representatives may not be one of the training points when the algorithm doesn't perform enought iterations. This can happen for example when the number of classes is close to the number of records to cluster.
Example:
julia> using BetaML
julia> X = [1.1 10.1; 0.9 9.8; 10.0 1.1; 12.1 0.8; 0.8 9.8]
5×2 Matrix{Float64}:
1.1 10.1
0.9 9.8
10.0 1.1
12.1 0.8
0.8 9.8
julia> mod = KMedoidsClusterer(n_classes=2)
KMedoidsClusterer - A K-Medoids Model (unfitted)
julia> classes = fit!(mod,X)
5-element Vector{Int64}:
1
1
2
2
1
julia> newclasses = fit!(mod,[11 0.9])
1-element Vector{Int64}:
2
julia> info(mod)
Dict{String, Any} with 2 entries:
"fitted_records" => 6
"av_distance_last_fit" => 0.0
"xndims" => 2
julia> parameters(mod)
BetaML.Clustering.KMeansMedoids_lp (a BetaMLLearnableParametersSet struct)
- representatives: [0.9 9.8; 11.0 0.9]