The BetaML.Imputation Module

BetaML.ImputationModule
Imputation module

Provide various imputation methods for missing data. Note that the interpretation of "missing" can be very wide. For example, reccomendation systems / collaborative filtering (e.g. suggestion of the film to watch) can well be representated as a missing data to impute problem, often with better results than traditional algorithms as k-nearest neighbors (KNN)

Provided imputers:

  • SimpleImputer: Impute data using the feature (column) mean, optionally normalised by l-norms of the records (rows) (fastest)
  • GaussianMixtureImputer: Impute data using a Generative (Gaussian) Mixture Model (good trade off)
  • RandomForestImputer: Impute missing data using Random Forests, with optional replicable multiple imputations (most accurate).
  • GeneralImputer: Impute missing data using a vector (one per column) of arbitrary learning models (classifiers/regressors) that implement m = Model([options]), fit!(m,X,Y) and predict(m,X) (not necessarily from BetaML).

Imputations for all these models can be optained by running mod = ImputatorModel([options]), fit!(mod,X). The data with the missing values imputed can then be obtained with predict(mod). Useinfo(m::Imputer) to retrieve further information concerning the imputation. Trained models can be also used to impute missing values in new data with predict(mox,xNew). Note that if multiple imputations are run (for the supporting imputators) predict() will return a vector of predictions rather than a single one`.

Example

julia> using Statistics, BetaML

julia> X            = [2 missing 10; 2000 4000 1000; 2000 4000 10000; 3 5 12 ; 4 8 20; 1 2 5]
6×3 Matrix{Union{Missing, Int64}}:
    2      missing     10
 2000  4000          1000
 2000  4000         10000
    3     5            12
    4     8            20
    1     2             5

julia> mod          = RandomForestImputer(multiple_imputations=10,  rng=copy(FIXEDRNG));

julia> fit!(mod,X);

julia> vals         = predict(mod)
10-element Vector{Matrix{Union{Missing, Int64}}}:
 [2 3 10; 2000 4000 1000; … ; 4 8 20; 1 2 5]
 [2 4 10; 2000 4000 1000; … ; 4 8 20; 1 2 5]
 [2 4 10; 2000 4000 1000; … ; 4 8 20; 1 2 5]
 [2 136 10; 2000 4000 1000; … ; 4 8 20; 1 2 5]
 [2 137 10; 2000 4000 1000; … ; 4 8 20; 1 2 5]
 [2 4 10; 2000 4000 1000; … ; 4 8 20; 1 2 5]
 [2 4 10; 2000 4000 1000; … ; 4 8 20; 1 2 5]
 [2 4 10; 2000 4000 1000; … ; 4 8 20; 1 2 5]
 [2 137 10; 2000 4000 1000; … ; 4 8 20; 1 2 5]
 [2 137 10; 2000 4000 1000; … ; 4 8 20; 1 2 5]

julia> nR,nC        = size(vals[1])
(6, 3)

julia> medianValues = [median([v[r,c] for v in vals]) for r in 1:nR, c in 1:nC]
6×3 Matrix{Float64}:
    2.0     4.0     10.0
 2000.0  4000.0   1000.0
 2000.0  4000.0  10000.0
    3.0     5.0     12.0
    4.0     8.0     20.0
    1.0     2.0      5.0

julia> infos        = info(mod);

julia> infos["n_imputed_values"]
1
source

Module Index

Detailed API

BetaML.Imputation.GaussianMixtureImputerType
mutable struct GaussianMixtureImputer <: Imputer

Missing data imputer that uses a Generative (Gaussian) Mixture Model.

For the parameters (n_classes,mixtures,..) see GaussianMixture_hp.

Limitations:

  • data must be numerical
  • the resulted matrix is a Matrix{Float64}
  • currently the Mixtures available do not support random initialisation for missing imputation, and the rest of the algorithm (Expectation-Maximisation) is deterministic, so there is no random component involved (i.e. no multiple imputations)

Example:

julia> using BetaML

julia> X = [1 2.5; missing 20.5; 0.8 18; 12 22.8; 0.4 missing; 1.6 3.7];

julia> mod = GaussianMixtureImputer(mixtures=[SphericalGaussian() for i in 1:2])
GaussianMixtureImputer - A Gaussian Mixture Model based imputer (unfitted)

julia> X_full = fit!(mod,X)
Iter. 1:        Var. of the post  2.373498171519511       Log-likelihood -29.111866299189792
6×2 Matrix{Float64}:
  1.0       2.5
  6.14905  20.5
  0.8      18.0
 12.0      22.8
  0.4       4.61314
  1.6       3.7

julia> info(mod)
Dict{String, Any} with 7 entries:
  "xndims"           => 2
  "error"            => [2.3735, 0.17527, 0.0283747, 0.0053147, 0.000981885]
  "AIC"              => 57.798
  "fitted_records"   => 6
  "lL"               => -21.899
  "n_imputed_values" => 2
  "BIC"              => 56.3403

julia> parameters(mod)
BetaML.Imputation.GaussianMixtureImputer_lp (a BetaMLLearnableParametersSet struct)
- mixtures: AbstractMixture[SphericalGaussian{Float64}([1.0179819950570768, 3.0999990977255845], 0.2865287884295908), SphericalGaussian{Float64}([6.149053737674149, 20.43331198167713], 15.18664378248651)]
- initial_probmixtures: [0.48544987084082347, 0.5145501291591764]
- probRecords: [0.9999996039918224 3.9600817749531375e-7; 2.3866922376272767e-229 1.0; … ; 0.9127030246369684 0.08729697536303167; 0.9999965964161501 3.403583849794472e-6]
source
BetaML.Imputation.GeneralI_hpType
mutable struct GeneralI_hp <: BetaMLHyperParametersSet

Hyperparameters for GeneralImputer

Parameters:

  • cols_to_impute: Columns in the matrix for which to create an imputation model, i.e. to impute. It can be a vector of columns IDs (positions), or the keywords "auto" (default) or "all". With "auto" the model automatically detects the columns with missing data and impute only them. You may manually specify the columns or use "all" if you want to create a imputation model for that columns during training even if all training data are non-missing to apply then the training model to further data with possibly missing values.

  • estimator: An entimator model (regressor or classifier), with eventually its options (hyper-parameters), to be used to impute the various columns of the matrix. It can also be a cols_to_impute-length vector of different estimators to consider a different estimator for each column (dimension) to impute, for example when some columns are categorical (and will hence require a classifier) and some others are numerical (hence requiring a regressor). [default: nothing, i.e. use BetaML random forests, handling classification and regression jobs automatically].

  • missing_supported: Wheter the estimator(s) used to predict the missing data support itself missing data in the training features (X). If not, when the model for a certain dimension is fitted, dimensions with missing data in the same rows of those where imputation is needed are dropped and then only non-missing rows in the other remaining dimensions are considered. It can be a vector of boolean values to specify this property for each individual estimator or a single booleann value to apply to all the estimators [default: false]

  • fit_function: The function used by the estimator(s) to fit the model. It should take as fist argument the model itself, as second argument a matrix representing the features, and as third argument a vector representing the labels. This parameter is mandatory for non-BetaML estimators and can be a single value or a vector (one per estimator) in case of different estimator packages used. [default: BetaML.fit!]

  • predict_function: The function used by the estimator(s) to predict the labels. It should take as fist argument the model itself and as second argument a matrix representing the features. This parameter is mandatory for non-BetaML estimators and can be a single value or a vector (one per estimator) in case of different estimator packages used. [default: BetaML.predict]

  • recursive_passages: Define the number of times to go trough the various columns to impute their data. Useful when there are data to impute on multiple columns. The order of the first passage is given by the decreasing number of missing values per column, the other passages are random [default: 1].

  • multiple_imputations: Determine the number of independent imputation of the whole dataset to make. Note that while independent, the imputations share the same random number generator (RNG).

source
BetaML.Imputation.GeneralImputerType
mutable struct GeneralImputer <: Imputer

Impute missing values using arbitrary learning models.

Impute missing values using any arbitrary learning model (classifier or regressor, not necessarily from BetaML) that implement an interface m = Model([options]), train!(m,X,Y) and predict(m,X). For non-BetaML supervised models the actual training and predict functions must be specified in the fit_function and predict_function parameters respectively. If needed (for example when some columns with missing data are categorical and some numerical) different models can be specified for each column. Multiple imputations and multiple "passages" trought the various colums for a single imputation are supported.

See GeneralI_hp for all the hyper-parameters.

Examples:

  • Using BetaML models:
julia> using BetaML
julia> X = [1.4 2.5 "a"; missing 20.5 "b"; 0.6 18 missing; 0.7 22.8 "b"; 0.4 missing "b"; 1.6 3.7 "a"]
6×3 Matrix{Any}:
 1.4        2.5       "a"
  missing  20.5       "b"
 0.6       18         missing
 0.7       22.8       "b"
 0.4         missing  "b"
 1.6        3.7       "a"

 julia> mod = GeneralImputer(recursive_passages=2,multiple_imputations=2)
 GeneralImputer - A imputer based on an arbitrary regressor/classifier(unfitted)

 julia> mX_full = fit!(mod,X);
 ** Processing imputation 1
 ** Processing imputation 2

 julia> mX_full[1]
 6×3 Matrix{Any}:
  1.4        2.5     "a"
  0.546722  20.5     "b"
  0.6       18       "b"
  0.7       22.8     "b"
  0.4       19.8061  "b"
  1.6        3.7     "a"

 julia> mX_full[2]
 6×3 Matrix{Any}:
  1.4        2.5     "a"
  0.554167  20.5     "b"
  0.6       18       "b"
  0.7       22.8     "b"
  0.4       20.7551  "b"
  1.6        3.7     "a"
  
 julia> info(mod)
 Dict{String, Any} with 1 entry:
   "n_imputed_values" => 3
 
  • Using third party packages (in this example DecisionTree):
julia> using BetaML
julia> import DecisionTree
julia> X = [1.4 2.5 "a"; missing 20.5 "b"; 0.6 18 missing; 0.7 22.8 "b"; 0.4 missing "b"; 1.6 3.7 "a"]
6×3 Matrix{Any}:
 1.4        2.5       "a"
  missing  20.5       "b"
 0.6       18         missing
 0.7       22.8       "b"
 0.4         missing  "b"
 1.6        3.7       "a"
julia> mod = GeneralImputer(estimator=[DecisionTree.DecisionTreeRegressor(),DecisionTree.DecisionTreeRegressor(),DecisionTree.DecisionTreeClassifier()], fit_function = DecisionTree.fit!, predict_function=DecisionTree.predict, recursive_passages=2)
GeneralImputer - A imputer based on an arbitrary regressor/classifier(unfitted)
julia> X_full = fit!(mod,X)
** Processing imputation 1
6×3 Matrix{Any}:
 1.4    2.5  "a"
 0.94  20.5  "b"
 0.6   18    "b"
 0.7   22.8  "b"
 0.4   13.5  "b"
 1.6    3.7  "a"
source
BetaML.Imputation.RandomForestI_hpType
mutable struct RandomForestI_hp <: BetaMLHyperParametersSet

Hyperparameters for RandomForestImputer

Parameters:

  • rfhpar::Any: For the underlying random forest algorithm parameters (n_trees,max_depth,min_gain,min_records,max_features:,splitting_criterion,β,initialisation_strategy, oob and rng) see RandomForestE_hp for the specific RF algorithm parameters

  • forced_categorical_cols::Vector{Int64}: Specify the positions of the integer columns to treat as categorical instead of cardinal. [Default: empty vector (all numerical cols are treated as cardinal by default and the others as categorical)]

  • recursive_passages::Int64: Define the times to go trough the various columns to impute their data. Useful when there are data to impute on multiple columns. The order of the first passage is given by the decreasing number of missing values per column, the other passages are random [default: 1].

  • multiple_imputations::Int64: Determine the number of independent imputation of the whole dataset to make. Note that while independent, the imputations share the same random number generator (RNG).

  • cols_to_impute::Union{String, Vector{Int64}}: Columns in the matrix for which to create an imputation model, i.e. to impute. It can be a vector of columns IDs (positions), or the keywords "auto" (default) or "all". With "auto" the model automatically detects the columns with missing data and impute only them. You may manually specify the columns or use "auto" if you want to create a imputation model for that columns during training even if all training data are non-missing to apply then the training model to further data with possibly missing values.

Example:

julia>mod = RandomForestImputer(n_trees=20,max_depth=10,recursive_passages=3)
source
BetaML.Imputation.RandomForestImputerType
mutable struct RandomForestImputer <: Imputer

Impute missing data using Random Forests, with optional replicable multiple imputations.

See RandomForestI_hp, RandomForestE_hp and BML_options for the parameters.

Notes:

  • Given a certain RNG and its status (e.g. RandomForestImputer(...,rng=StableRNG(FIXEDSEED))), the algorithm is completely deterministic, i.e. replicable.
  • The algorithm accepts virtually any kind of data, sortable or not

Example:

julia> using BetaML

julia> X = [1.4 2.5 "a"; missing 20.5 "b"; 0.6 18 missing; 0.7 22.8 "b"; 0.4 missing "b"; 1.6 3.7 "a"]
6×3 Matrix{Any}:
 1.4        2.5       "a"
  missing  20.5       "b"
 0.6       18         missing
 0.7       22.8       "b"
 0.4         missing  "b"
 1.6        3.7       "a"

julia> mod = RandomForestImputer(n_trees=20,max_depth=10,recursive_passages=2)
RandomForestImputer - A Random-Forests based imputer (unfitted)

julia> X_full = fit!(mod,X)
** Processing imputation 1
6×3 Matrix{Any}:
 1.4        2.5     "a"
 0.504167  20.5     "b"
 0.6       18       "b"
 0.7       22.8     "b"
 0.4       20.0837  "b"
 1.6        3.7     "a"
source
BetaML.Imputation.SimpleI_hpType
mutable struct SimpleI_hp <: BetaMLHyperParametersSet

Hyperparameters for the SimpleImputer model

Parameters:

  • statistic::Function: The descriptive statistic of the column (feature) to use as imputed value [def: mean]

  • norm::Union{Nothing, Int64}: Normalise the feature mean by l-norm norm of the records [default: nothing]. Use it (e.g. norm=1 to use the l-1 norm) if the records are highly heterogeneus (e.g. quantity exports of different countries).

source
BetaML.Imputation.SimpleImputerType
mutable struct SimpleImputer <: Imputer

Simple imputer using the missing data's feature (column) statistic (def: mean), optionally normalised by l-norms of the records (rows)

Parameters:

  • statistics: The descriptive statistic of the column (feature) to use as imputed value [def: mean]
  • norm: Normalise the feature mean by l-norm norm of the records [default: nothing]. Use it (e.g. norm=1 to use the l-1 norm) if the records are highly heterogeneus (e.g. quantity exports of different countries).

Limitations:

  • data must be numerical

Example:

julia> using BetaML

julia> X = [2.0 missing 10; 20 40 100]
2×3 Matrix{Union{Missing, Float64}}:
  2.0    missing   10.0
 20.0  40.0       100.0

julia> mod = SimpleImputer(norm=1)
SimpleImputer - A simple feature-stat based imputer (unfitted)

julia> X_full = fit!(mod,X)
2×3 Matrix{Float64}:
  2.0   4.04494   10.0
 20.0  40.0      100.0

julia> info(mod)
Dict{String, Any} with 1 entry:
  "n_imputed_values" => 1

julia> parameters(mod)
BetaML.Imputation.SimpleImputer_lp (a BetaMLLearnableParametersSet struct)
- cStats: [11.0, 40.0, 55.0]
- norms: [6.0, 53.333333333333336]
source