# The BetaML.Utils Module

BetaML.UtilsModule
Utils module

Provide shared utility functions for various machine learning algorithms.

For the complete list of functions provided see below. The main ones are:

Helper functions for logging

• Most BetAML functions accept a parameter verbosity that expect one of the element in the Verbosity enoum (NONE, LOW, STD, HIGH or FULL)
• Writing complex code and need to find where something is executed ? Use the macro @codeLocation

Stochasticity management

• Utils provide [FIXEDSEED], [FIXEDRNG] and generateParallelRngs. All stochastic functions accept a rng paraemter. See the "Getting started" section in the tutorial for details.

Data processing

Samplers

Transformers

Measures

Imputers

• Imputers of missing values
source

## Detailed API

BetaML.Utils.FIXEDRNGConstant
FIXEDRNG

Fixed ring to allow reproducible results

Use it with:

• myAlgorithm(;rng=FIXEDRNG) # always produce the same sequence of results on each run of the script ("pulling" from the same rng object on different calls)
• myAlgorithm(;rng=copy(FIXEDRNG)) # always produce the same result (new rng object on each function call)
source
BetaML.Utils.FIXEDSEEDConstant
FIXEDSEED

Fixed seed to allow reproducible results. This is the seed used to obtain the same results under unit tests.

Use it with:

• myAlgorithm(;rng=FIXEDRNG) # always produce the same sequence of results on each run of the script ("pulling" from the same rng object on different calls)
• myAlgorithm(;rng=copy(FIXEDRNG) # always produce the same result (new rng object on each call)
source
BetaML.Utils.ConfusionMatrixType
ConfusionMatrix

Scores and measures resulting from a comparation between true and predicted categorical variables

Use the function ConfusionMatrix(ŷ,y;classes,labels,rng) to build it and report(cm::ConfusionMatrix;what) to visualise it, or use the individual parts of interest, e.g. display(cm.scores).

Fields:

• labels: Array of categorical labels
• accuracy: Overall accuracy rate
• misclassification: Overall misclassification rate
• actualCount: Array of counts per lebel in the actual data
• predictedCount: Array of counts per label in the predicted data
• scores: Matrix actual (rows) vs predicted (columns)
• normalisedScores: Normalised scores
• tp: True positive (by class)
• tn: True negative (by class)
• fp: False positive (by class), aka "type I error" or "false allarm"
• fn: False negative (by class), aka "type II error" or "miss"
• precision: True class i over predicted class i (by class)
• recall: Predicted class i over true class i (by class), aka "True Positive Rate (TPR)", "Sensitivity" or "Probability of detection"
• specificity: Predicted not class i over true not class i (by class), aka "True Negative Rate (TNR)"
• f1Score: Harmonic mean of precision and recall
• meanPrecision: Mean by class, respectively unweighted and weighted by actualCount
• meanRecall: Mean by class, respectively unweighted and weighted by actualCount
• meanSpecificity: Mean by class, respectively unweighted and weighted by actualCount
• meanF1Score: Mean by class, respectively unweighted and weighted by actualCount
source
BetaML.Utils.ConfusionMatrixMethod
ConfusionMatrix(ŷ,y;classes,labels,rng)

Build a "confusion matrix" between predicted (columns) vs actual (rows) categorical values

Parameters:

• ŷ: Vector of predicted categorical data
• y: Vector of actual categorical data
• classes: The full set of possible classes (useful to give a specicif order or if not al lclasses are represented in y) [def: unique(y) ]
• labels: String representation of the classes [def: string.(classes)]
• rng: Random number generator. Used only if ŷ is given in terms of a PMF and there are multi-modal values, as these are assigned randomply [def: Random.GLOBAL_RNG]

Return:

• a ConfusionMatrix object
source
Base.errorMethod

error(ŷ,y) - Categorical error with with probabilistic predictions of a dataset given in terms of a dictionary of probabilities (Dict{T,Float64} vs T).

source
Base.printMethod
print(cm,what)

Print a ConfusionMatrix object

The what parameter is a string vector that can include "all", "scores", "normalisedScores" or "report" [def: ["all"]]

source
BetaML.Api.partitionMethod
partition(data,parts;shuffle,dims,rng)

Partition (by rows) one or more matrices according to the shares in parts.

Parameters

• data: A matrix/vector or a vector of matrices/vectors
• parts: A vector of the required shares (must sum to 1)
• shufle: Whether to randomly shuffle the matrices (preserving the relative order between matrices)
• dims: The dimension for which to partition [def: 1]
• copy: Wheter to copy the actual data or only create a reference [def: true]
• rng: Random Number Generator (see FIXEDSEED) [deafult: Random.GLOBAL_RNG]

Notes:

• The sum of parts must be equal to 1
• The number of elements in the specified dimension must be the same for all the arrays in data

Example:

julia julia> x = [1:10 11:20] julia> y = collect(31:40) julia> ((xtrain,xtest),(ytrain,ytest)) = partition([x,y],[0.7,0.3])

source
BetaML.Utils.accuracyMethod

accuracy(ŷ,y;tol)

Categorical accuracy with probabilistic predictions of a dataset given in terms of a dictionary of probabilities (Dict{T,Float64} vs T).

Parameters:

• ŷ: An array where each item is the estimated probability mass function in terms of a Dictionary(Item1 => Prob1, Item2 => Prob2, ...)
• y: The N array with the correct category for each point $n$.
• tol: The tollerance to the prediction, i.e. if considering "correct" only a prediction where the value with highest probability is the true value (tol = 1), or consider instead the set of tol maximum values [def: 1].
source
BetaML.Utils.accuracyMethod
accuracy(ŷ,y;tol)

Categorical accuracy with probabilistic prediction of a single datapoint given in terms of a dictionary of probabilities (Dict{T,Float64} vs T).

Parameters:

• ŷ: The returned probability mass function in terms of a Dictionary(Item1 => Prob1, Item2 => Prob2, ...)
• tol: The tollerance to the prediction, i.e. if considering "correct" only a prediction where the value with highest probability is the true value (tol = 1), or consider instead the set of tol maximum values [def: 1].
source
BetaML.Utils.accuracyMethod

accuracy(ŷ,y;tol,ignoreLabels)

Categorical accuracy with probabilistic predictions of a dataset (PMF vs Int).

Parameters:

• ŷ: An (N,K) matrix of probabilities that each $\hat y_n$ record with $n \in 1,....,N$ being of category $k$ with $k \in 1,...,K$.
• y: The N array with the correct category for each point $n$.
• tol: The tollerance to the prediction, i.e. if considering "correct" only a prediction where the value with highest probability is the true value (tol = 1), or consider instead the set of tol maximum values [def: 1].
• ignoreLabels: Whether to ignore the specific label order in y. Useful for unsupervised learning algorithms where the specific label order don't make sense [def: false]
source
BetaML.Utils.accuracyMethod
accuracy(ŷ,y;tol)

Categorical accuracy with probabilistic prediction of a single datapoint (PMF vs Int).

Use the parameter tol [def: 1] to determine the tollerance of the prediction, i.e. if considering "correct" only a prediction where the value with highest probability is the true value (tol = 1), or consider instead the set of tol maximum values.

source
BetaML.Utils.autoJacobianMethod

autoJacobian(f,x;nY)

Evaluate the Jacobian using AD in the form of a (nY,nX) matrix of first derivatives

Parameters:

• f: The function to compute the Jacobian
• x: The input to the function where the jacobian has to be computed
• nY: The number of outputs of the function f [def: length(f(x))]

Return values:

• An Array{Float64,2} of the locally evaluated Jacobian

Notes:

• The nY parameter is optional. If provided it avoids having to compute f(x)
source
BetaML.Utils.batchMethod

batch(n,bSize;sequential=false,rng)

Return a vector of bSize vectors of indeces from 1 to n. Randomly unless the optional parameter sequential is used.

Example:

julia julia> Utils.batch(6,2,sequential=true) 3-element Array{Array{Int64,1},1}: [1, 2] [3, 4] [5, 6]

source
BetaML.Utils.classCountsMethod

classCounts(x;classes=nothing)

Return a (unsorted) vector with the counts of each unique item (element or rows) in a dataset.

If order is important or not all classes are present in the data, a preset vectors of classes can be given in the parameter classes

source
BetaML.Utils.crossValidationFunction
crossValidation(f,data,sampler;dims,verbosity,returnStatistics)

Perform crossValidation according to sampler rule by calling the function f and collecting its output

Parameters

• f: The user-defined function that consume the specific train and validation data and return somehting (often the associated validation error). See later
• data: A single n-dimenasional array or a vector of them (e.g. X,Y), depending on the tasks required by f.
• sampler: An istance of a AbstractDataSampler, defining the "rules" for sampling at each iteration. [def: KFold(nSplits=5,nRepeats=1,shuffle=true,rng=Random.GLOBAL_RNG) ]
• dims: The dimension over performing the crossValidation i.e. the dimension containing the observations [def: 1]
• verbosity: The verbosity to print information during each iteration (this can also be printed in the f function) [def: STD]
• returnStatistics: Wheter crossValidation should return the statistics of the output of f (mean and standard deviation) or the whole outputs [def: true].

Notes

crossValidation works by calling the function f, defined by the user, passing to it the tuple trainData, valData and rng and collecting the result of the function f. The specific method for which trainData, and valData are selected at each iteration depends on the specific sampler, whith a single 5 k-fold rule being the default.

This approach is very flexible because the specific model to employ or the metric to use is left within the user-provided function. The only thing that crossValidation does is provide the model defined in the function f with the opportune data (and the random number generator).

Input of the user-provided function trainData and valData are both themselves tuples. In supervised models, crossValidations data should be a tuple of (X,Y) and trainData and valData will be equivalent to (xtrain, ytrain) and (xval, yval). In unsupervised models data is a single array, but the training and validation data should still need to be accessed as trainData[1] and valData[1]. Output of the user-provided function The user-defined function can return whatever. However, if returnStatistics is left on its default true value the user-defined function must return a single scalar (e.g. some error measure) so that the mean and the standard deviation are returned.

Note that crossValidation can beconveniently be employed using the do syntax, as Julia automatically rewrite crossValidation(data,...) trainData,valData,rng ...user defined body... end as crossValidation(f(trainData,valData,rng ), data,...)

Example

julia> X = [11:19 21:29 31:39 41:49 51:59 61:69];
julia> Y = [1:9;];
julia> sampler = KFold(nSplits=3);
julia> (μ,σ) = crossValidation([X,Y],sampler) do trainData,valData,rng
(xtrain,ytrain) = trainData; (xval,yval) = valData
trainedModel    = buildForest(xtrain,ytrain,30)
predictions     = predict(trainedModel,xval)
ϵ               = meanRelError(predictions,yval,normRec=false)
return ϵ
end
(0.3202242202242202, 0.04307662219315022)
source
BetaML.Utils.generateParallelRngsMethod
generateParallelRngs(rng::AbstractRNG, n::Integer;reSeed=false)

For multi-threaded models, return n independent random number generators (one per thread) to be used in threaded computations.

Note that each ring is a copy of the original random ring. This means that code that use these RNGs will not change the original RNG state.

Use it with rngs = generateParallelRngs(rng,Threads.nthreads()) to have a separate rng per thread. By default the function doesn't re-seed the RNG, as you may want to have a loop index based re-seeding strategy rather than a threadid-based one (to guarantee the same result independently of the number of threads). If you prefer, you can instead re-seed the RNG here (using the parameter reSeed=true), such that each thread has a different seed. Be aware however that the stream of number generated will depend from the number of threads at run time.

source
BetaML.Utils.getPermutationsMethod
getPermutations(v::AbstractArray{T,1};keepStructure=false)

Return a vector of either (a) all possible permutations (uncollected) or (b) just those based on the unique values of the vector

Useful to measure accuracy where you don't care about the actual name of the labels, like in unsupervised classifications (e.g. clustering)

source
BetaML.Utils.getScaleFactorsMethod
getScaleFactors(x;skip)

Return the scale factors (for each dimensions) in order to scale a matrix X (n,d) such that each dimension has mean 0 and variance 1.

Parameters

• x: the (n × d) dimension matrix to scale on each dimension d
• skip: an array of dimension index to skip the scaling [def: []]

Return

• A touple whose first elmement is the shift and the second the multiplicative

term to make the scale.

source
BetaML.Utils.giniMethod

gini(x)

Calculate the Gini Impurity for a list of items (or rows).

See: https://en.wikipedia.org/wiki/Decisiontreelearning#Information_gain

source
BetaML.Utils.integerDecoderMethod
integerDecoder(x,factors::AbstractVector{T};unique)

Decode an array of integers to an array of T corresponding to the elements of factors

Parameters:

• x: The vector to decode
• factors: The vector of elements to use for the encoding
• unique: Wether factors is already made of unique elements [def: true]

Return:

• A vector of length(x) elements corresponding to the (unique) factors elements at the position x

Example:

julia> integerDecoder([1, 2, 2, 3, 2, 1],["aa","cc","bb"]) # out: ["aa","cc","cc","bb","cc","aa"]
source
BetaML.Utils.integerEncoderMethod
integerEncoder(x;factors=unique(x))

Encode an array of T to an array of integers using the their position in factor vector (default to the unique vector of the input array)

Parameters:

• x: The vector to encode
• factors: The vector of factors whose position is the result of the encoding [def: unique(x)]

Return:

• A vector of [1,length(x)] integers corresponding to the position of each element in the factors vector

Note:

• Attention that while this function creates a ordered (and sortable) set, it is up to the user to be sure that this "property" is not indeed used in his code if the unencoded data is indeed unordered.

Example:

julia> integerEncoder(["a","e","b","e"],factors=["a","b","c","d","e"]) # out: [1,5,2,5]
source
BetaML.Utils.meanDictsMethod

meanDicts(dicts)

Compute the mean of the values of an array of dictionaries.

Given dicts an array of dictionaries, meanDicts first compute the union of the keys and then average the values. If the original valueas are probabilities (non-negative items summing to 1), the result is also a probability distribution.

source
BetaML.Utils.meanRelErrorMethod

meanRelError(ŷ,y;normDim=true,normRec=true,p=1)

Compute the mean relative error (l-1 based by default) between ŷ and y.

There are many ways to compute a mean relative error. In particular, if normRec (normDim) is set to true, the records (dimensions) are normalised, in the sense that it doesn't matter if a record (dimension) is bigger or smaller than the others, the relative error is first computed for each record (dimension) and then it is averaged. With both normDim and normRec set to false the function returns the relative mean error; with both set to true (default) it returns the mean relative error (i.e. with p=1 the "mean absolute percentage error (MAPE)") The parameter p [def: 1] controls the p-norm used to define the error.

The mean relative error enfatises the relativeness of the error, i.e. all observations and dimensions weigth the same, wether large or small. Conversly, in the relative mean error the same relative error on larger observations (or dimensions) weights more.

For example, given y = [1,44,3] and ŷ = [2,45,2], the mean relative error meanRelError(ŷ,y) is 0.452, while the relative mean error meanRelError(ŷ,y, normRec=false) is "only" 0.0625.

source
BetaML.Utils.modeMethod

mode(elements,rng)

Given a vector of dictionaries whose key is numerical (e.g. probabilities), a vector of vectors or a matrix, it returns the mode of each element (dictionary, vector or row) in terms of the key or the position.

Use it to return a unique value from a multiclass classifier returning probabilities.

Note:

• If multiple classes have the highest mode, one is returned at random (use the parameter rng to fix the stochasticity)
source
BetaML.Utils.modeMethod

mode(v::AbstractVector{T};rng)

Return the position with the highest value in an array, interpreted as mode (using rand in case of multimodal values)

source
BetaML.Utils.mseMethod
mse(ŷ,y)

Compute the mean squared error (MSE) (aka mean squared deviation - MSD) between two vectors ŷ and y. Note that while the deviation is averaged by the length of y is is not scaled to give it a relative meaning.

source
BetaML.Utils.oneHotEncoderMethod
oneHotEncoder(x;d,factors,count)

Encode arrays (or arrays of arrays) of categorical data as matrices of one column per factor.

The case of arrays of arrays is for when at each record you have more than one categorical output. You can then decide to encode just the presence of the factors or their counting

Parameters:

• x: The data to convert (array or array of arrays)
• d: The number of dimensions in the output matrix [def: maximum(x) for integers and length(factors) otherwise]
• factors: The factors from which to encode [def: 1:d for integer x or unique(x) otherwise]
• count: Wether to count multiple instances on the same dimension/record (true) or indicate just presence. [def: false]

Examples

julia> oneHotEncoder(["a","c","c"],factors=["a","b","c","d"])
3×4 Matrix{Int64}:
1  0  0  0
0  0  1  0
0  0  1  0
julia> oneHotEncoder([2,4,4])
3×4 Matrix{Int64}:
0  1  0  0
0  0  0  1
0  0  0  1
julia> oneHotEncoder([[2,2,1],[2,4,4]],count=true)
2×4 Matrix{Int64}:
1  2  0  0
0  1  0  2
source
BetaML.Utils.pcaMethod

pca(X;K,error)

Perform Principal Component Analysis returning the matrix reprojected among the dimensions of maximum variance.

Parameters:

• X : The (N,D) data to reproject
• K : The number of dimensions to maintain (with K<=D) [def: nothing]
• error: The maximum approximation error that we are willing to accept [def: 0.05]

Return:

• A named tuple with:
• X: The reprojected (NxK) matrix with the column dimensions organized in descending order of of the proportion of explained variance
• K: The number of dimensions retieved
• error: The actual proportion of variance not explained in the reprojected dimensions
• P: The (D,K) matrix of the eigenvectors associated to the K-largest eigenvalues used to reproject the data matrix
• explVarByDim: An array of dimensions D with the share of the cumulative variance explained by dimensions (the last element being always 1.0)

Notes:

• If K is provided, the parameter error has no effect.
• If one doesn't know a priori the error that she/he is willling to accept, nor the wished number of dimensions, he/she can run this pca function with out = pca(X,K=size(X,2)) (i.e. with K=D), analise the proportions of explained cumulative variance by dimensions in out.explVarByDim, choose the number of dimensions K according to his/her needs and finally pick from the reprojected matrix only the number of dimensions needed, i.e. out.X[:,1:K].

Example:

julia> X = [1 10 100; 1.1 15 120; 0.95 23 90; 0.99 17 120; 1.05 8 90; 1.1 12 95]
6×3 Matrix{Float64}:
1.0   10.0  100.0
1.1   15.0  120.0
0.95  23.0   90.0
0.99  17.0  120.0
1.05   8.0   90.0
1.1   12.0   95.0
julia> X = pca(X,error=0.05).X
6×2 Matrix{Float64}:
100.449    3.1783
120.743    6.80764
91.3551  16.8275
120.878    8.80372
90.3363   1.86179
95.5965   5.51254
source
BetaML.Utils.polynomialKernelMethod

Polynomial kernel parametrised with c=0 and d=2 (i.e. a quadratic kernel). For other cᵢ and dᵢ use K = (x,y) -> polynomialKernel(x,y,c=cᵢ,d=dᵢ) as kernel function in the supporting algorithms

source
BetaML.Utils.pool1dFunction
pool1d(x,poolSize=2;f=mean)

Apply funtion f to a rolling poolSize contiguous (in 1d) neurons.

Applicable to VectorFunctionLayer, e.g. layer2 = VectorFunctionLayer(nₗ,f=(x->pool1d(x,4,f=mean)) Attention: to apply this funciton as activation function in a neural network you will need Julia version >= 1.6, otherwise you may experience a segmentation fault (see this bug report)

source
BetaML.Utils.radialKernelMethod

Radial Kernel (aka RBF kernel) parametrised with γ=1/2. For other gammas γᵢ use K = (x,y) -> radialKernel(x,y,γ=γᵢ) as kernel function in the supporting algorithms

source
BetaML.Utils.scaleFunction
scale(x,scaleFactors;rev)

Perform a linear scaling of x using scaling factors scaleFactors.

Parameters

• x: The (n × d) dimension matrix to scale on each dimension d
• scalingFactors: A tuple of the constant and multiplicative scaling factor

respectively [def: the scaling factors needed to scale x to mean 0 and variance 1]

• rev: Whether to invert the scaling [def: false]

Return

• The scaled matrix

Notes:

• Also available scale!(x,scaleFactors) for in-place scaling.
• Retrieve the scale factors with the getScaleFactors() function
source
BetaML.Utils.squaredCostMethod

squaredCost(ŷ,y)

Compute the squared costs between a vector of prediction and one of observations as (1/2)*norm(y - ŷ)^2.

Aside the 1/2 term, it correspond to the squared l-2 norm distance and when it is averaged on multiple datapoints corresponds to the Mean Squared Error (MSE). It is mostly used for regression problems.

source
Random.shuffleMethod
shuffle(data;dims,rng)

Shuffle a vector of n-dimensional arrays across dimension dims keeping the same order between the arrays

Parameters

• data: The vector of arrays to shuffle
• dims: The dimension over to apply the shuffle [def: 1]
• rng: An AbstractRNG to apply for the shuffle

Notes

• All the arrays must have the same size for the dimension to shuffle

Example

julia> a = [1 2 30; 10 20 30]; b = [100 200 300]; julia> (aShuffled, bShuffled) = shuffle([a,b],dims=2) 2-element Vector{Matrix{Int64}}: [1 30 2; 10 30 20] [100 300 200]`

source