The BetaML.Imputation Module
BetaML.Imputation
— ModuleImputation module
Provide various imputation methods for missing data. Note that the interpretation of "missing" can be very wide. For example, reccomendation systems / collaborative filtering (e.g. suggestion of the film to watch) can well be representated as a missing data to impute problem, often with better results than traditional algorithms as k-nearest neighbors (KNN)
Provided imputers:
SimpleImputer
: Impute data using the feature (column) mean, optionally normalised by l-norms of the records (rows) (fastest)GaussianMixtureImputer
: Impute data using a Generative (Gaussian) Mixture Model (good trade off)RandomForestImputer
: Impute missing data using Random Forests, with optional replicable multiple imputations (most accurate).GeneralImputer
: Impute missing data using a vector (one per column) of arbitrary learning models (classifiers/regressors) that implementm = Model([options])
,fit!(m,X,Y)
andpredict(m,X)
(not necessarily fromBetaML
).
Imputations for all these models can be optained by running mod = ImputatorModel([options])
, fit!(mod,X)
. The data with the missing values imputed can then be obtained with predict(mod)
. Useinfo(m::Imputer)
to retrieve further information concerning the imputation. Trained models can be also used to impute missing values in new data with predict(mox,xNew)
. Note that if multiple imputations are run (for the supporting imputators) predict()
will return a vector of predictions rather than a single one`.
Example
julia> using Statistics, BetaML
julia> X = [2 missing 10; 2000 4000 1000; 2000 4000 10000; 3 5 12 ; 4 8 20; 1 2 5]
6×3 Matrix{Union{Missing, Int64}}:
2 missing 10
2000 4000 1000
2000 4000 10000
3 5 12
4 8 20
1 2 5
julia> mod = RandomForestImputer(multiple_imputations=10, rng=copy(FIXEDRNG));
julia> fit!(mod,X);
julia> vals = predict(mod)
10-element Vector{Matrix{Union{Missing, Int64}}}:
[2 3 10; 2000 4000 1000; … ; 4 8 20; 1 2 5]
[2 4 10; 2000 4000 1000; … ; 4 8 20; 1 2 5]
[2 4 10; 2000 4000 1000; … ; 4 8 20; 1 2 5]
[2 136 10; 2000 4000 1000; … ; 4 8 20; 1 2 5]
[2 137 10; 2000 4000 1000; … ; 4 8 20; 1 2 5]
[2 4 10; 2000 4000 1000; … ; 4 8 20; 1 2 5]
[2 4 10; 2000 4000 1000; … ; 4 8 20; 1 2 5]
[2 4 10; 2000 4000 1000; … ; 4 8 20; 1 2 5]
[2 137 10; 2000 4000 1000; … ; 4 8 20; 1 2 5]
[2 137 10; 2000 4000 1000; … ; 4 8 20; 1 2 5]
julia> nR,nC = size(vals[1])
(6, 3)
julia> medianValues = [median([v[r,c] for v in vals]) for r in 1:nR, c in 1:nC]
6×3 Matrix{Float64}:
2.0 4.0 10.0
2000.0 4000.0 1000.0
2000.0 4000.0 10000.0
3.0 5.0 12.0
4.0 8.0 20.0
1.0 2.0 5.0
julia> infos = info(mod);
julia> infos["n_imputed_values"]
1
Module Index
BetaML.Imputation.GaussianMixtureImputer
BetaML.Imputation.GeneralI_hp
BetaML.Imputation.GeneralImputer
BetaML.Imputation.RandomForestI_hp
BetaML.Imputation.RandomForestImputer
BetaML.Imputation.SimpleI_hp
BetaML.Imputation.SimpleImputer
Detailed API
BetaML.Imputation.GaussianMixtureImputer
— Typemutable struct GaussianMixtureImputer <: Imputer
Missing data imputer that uses a Generative (Gaussian) Mixture Model.
For the parameters (n_classes
,mixtures
,..) see GaussianMixture_hp
.
Limitations:
- data must be numerical
- the resulted matrix is a Matrix{Float64}
- currently the Mixtures available do not support random initialisation for missing imputation, and the rest of the algorithm (Expectation-Maximisation) is deterministic, so there is no random component involved (i.e. no multiple imputations)
Example:
julia> using BetaML
julia> X = [1 2.5; missing 20.5; 0.8 18; 12 22.8; 0.4 missing; 1.6 3.7];
julia> mod = GaussianMixtureImputer(mixtures=[SphericalGaussian() for i in 1:2])
GaussianMixtureImputer - A Gaussian Mixture Model based imputer (unfitted)
julia> X_full = fit!(mod,X)
Iter. 1: Var. of the post 2.373498171519511 Log-likelihood -29.111866299189792
6×2 Matrix{Float64}:
1.0 2.5
6.14905 20.5
0.8 18.0
12.0 22.8
0.4 4.61314
1.6 3.7
julia> info(mod)
Dict{String, Any} with 7 entries:
"xndims" => 2
"error" => [2.3735, 0.17527, 0.0283747, 0.0053147, 0.000981885]
"AIC" => 57.798
"fitted_records" => 6
"lL" => -21.899
"n_imputed_values" => 2
"BIC" => 56.3403
julia> parameters(mod)
BetaML.Imputation.GaussianMixtureImputer_lp (a BetaMLLearnableParametersSet struct)
- mixtures: AbstractMixture[SphericalGaussian{Float64}([1.0179819950570768, 3.0999990977255845], 0.2865287884295908), SphericalGaussian{Float64}([6.149053737674149, 20.43331198167713], 15.18664378248651)]
- initial_probmixtures: [0.48544987084082347, 0.5145501291591764]
- probRecords: [0.9999996039918224 3.9600817749531375e-7; 2.3866922376272767e-229 1.0; … ; 0.9127030246369684 0.08729697536303167; 0.9999965964161501 3.403583849794472e-6]
BetaML.Imputation.GeneralI_hp
— Typemutable struct GeneralI_hp <: BetaMLHyperParametersSet
Hyperparameters for GeneralImputer
Parameters:
cols_to_impute
: Columns in the matrix for which to create an imputation model, i.e. to impute. It can be a vector of columns IDs (positions), or the keywords "auto" (default) or "all". With "auto" the model automatically detects the columns with missing data and impute only them. You may manually specify the columns or use "all" if you want to create a imputation model for that columns during training even if all training data are non-missing to apply then the training model to further data with possibly missing values.estimator
: An entimator model (regressor or classifier), with eventually its options (hyper-parameters), to be used to impute the various columns of the matrix. It can also be acols_to_impute
-length vector of different estimators to consider a different estimator for each column (dimension) to impute, for example when some columns are categorical (and will hence require a classifier) and some others are numerical (hence requiring a regressor). [default:nothing
, i.e. use BetaML random forests, handling classification and regression jobs automatically].missing_supported
: Wheter the estimator(s) used to predict the missing data support itself missing data in the training features (X). If not, when the model for a certain dimension is fitted, dimensions with missing data in the same rows of those where imputation is needed are dropped and then only non-missing rows in the other remaining dimensions are considered. It can be a vector of boolean values to specify this property for each individual estimator or a single booleann value to apply to all the estimators [default:false
]fit_function
: The function used by the estimator(s) to fit the model. It should take as fist argument the model itself, as second argument a matrix representing the features, and as third argument a vector representing the labels. This parameter is mandatory for non-BetaML estimators and can be a single value or a vector (one per estimator) in case of different estimator packages used. [default:BetaML.fit!
]predict_function
: The function used by the estimator(s) to predict the labels. It should take as fist argument the model itself and as second argument a matrix representing the features. This parameter is mandatory for non-BetaML estimators and can be a single value or a vector (one per estimator) in case of different estimator packages used. [default:BetaML.predict
]recursive_passages
: Define the number of times to go trough the various columns to impute their data. Useful when there are data to impute on multiple columns. The order of the first passage is given by the decreasing number of missing values per column, the other passages are random [default:1
].multiple_imputations
: Determine the number of independent imputation of the whole dataset to make. Note that while independent, the imputations share the same random number generator (RNG).
BetaML.Imputation.GeneralImputer
— Typemutable struct GeneralImputer <: Imputer
Impute missing values using arbitrary learning models.
Impute missing values using any arbitrary learning model (classifier or regressor, not necessarily from BetaML) that implement an interface m = Model([options])
, train!(m,X,Y)
and predict(m,X)
. For non-BetaML supervised models the actual training and predict functions must be specified in the fit_function
and predict_function
parameters respectively. If needed (for example when some columns with missing data are categorical and some numerical) different models can be specified for each column. Multiple imputations and multiple "passages" trought the various colums for a single imputation are supported.
See GeneralI_hp
for all the hyper-parameters.
Examples:
- Using BetaML models:
julia> using BetaML
julia> X = [1.4 2.5 "a"; missing 20.5 "b"; 0.6 18 missing; 0.7 22.8 "b"; 0.4 missing "b"; 1.6 3.7 "a"]
6×3 Matrix{Any}:
1.4 2.5 "a"
missing 20.5 "b"
0.6 18 missing
0.7 22.8 "b"
0.4 missing "b"
1.6 3.7 "a"
julia> mod = GeneralImputer(recursive_passages=2,multiple_imputations=2)
GeneralImputer - A imputer based on an arbitrary regressor/classifier(unfitted)
julia> mX_full = fit!(mod,X);
** Processing imputation 1
** Processing imputation 2
julia> mX_full[1]
6×3 Matrix{Any}:
1.4 2.5 "a"
0.546722 20.5 "b"
0.6 18 "b"
0.7 22.8 "b"
0.4 19.8061 "b"
1.6 3.7 "a"
julia> mX_full[2]
6×3 Matrix{Any}:
1.4 2.5 "a"
0.554167 20.5 "b"
0.6 18 "b"
0.7 22.8 "b"
0.4 20.7551 "b"
1.6 3.7 "a"
julia> info(mod)
Dict{String, Any} with 1 entry:
"n_imputed_values" => 3
- Using third party packages (in this example
DecisionTree
):
julia> using BetaML
julia> import DecisionTree
julia> X = [1.4 2.5 "a"; missing 20.5 "b"; 0.6 18 missing; 0.7 22.8 "b"; 0.4 missing "b"; 1.6 3.7 "a"]
6×3 Matrix{Any}:
1.4 2.5 "a"
missing 20.5 "b"
0.6 18 missing
0.7 22.8 "b"
0.4 missing "b"
1.6 3.7 "a"
julia> mod = GeneralImputer(estimator=[DecisionTree.DecisionTreeRegressor(),DecisionTree.DecisionTreeRegressor(),DecisionTree.DecisionTreeClassifier()], fit_function = DecisionTree.fit!, predict_function=DecisionTree.predict, recursive_passages=2)
GeneralImputer - A imputer based on an arbitrary regressor/classifier(unfitted)
julia> X_full = fit!(mod,X)
** Processing imputation 1
6×3 Matrix{Any}:
1.4 2.5 "a"
0.94 20.5 "b"
0.6 18 "b"
0.7 22.8 "b"
0.4 13.5 "b"
1.6 3.7 "a"
BetaML.Imputation.RandomForestI_hp
— Typemutable struct RandomForestI_hp <: BetaMLHyperParametersSet
Hyperparameters for RandomForestImputer
Parameters:
rfhpar::Any
: For the underlying random forest algorithm parameters (n_trees
,max_depth
,min_gain
,min_records
,max_features:
,splitting_criterion
,β
,initialisation_strategy
,oob
andrng
) seeRandomForestE_hp
for the specific RF algorithm parametersforced_categorical_cols::Vector{Int64}
: Specify the positions of the integer columns to treat as categorical instead of cardinal. [Default: empty vector (all numerical cols are treated as cardinal by default and the others as categorical)]recursive_passages::Int64
: Define the times to go trough the various columns to impute their data. Useful when there are data to impute on multiple columns. The order of the first passage is given by the decreasing number of missing values per column, the other passages are random [default:1
].multiple_imputations::Int64
: Determine the number of independent imputation of the whole dataset to make. Note that while independent, the imputations share the same random number generator (RNG).cols_to_impute::Union{String, Vector{Int64}}
: Columns in the matrix for which to create an imputation model, i.e. to impute. It can be a vector of columns IDs (positions), or the keywords "auto" (default) or "all". With "auto" the model automatically detects the columns with missing data and impute only them. You may manually specify the columns or use "auto" if you want to create a imputation model for that columns during training even if all training data are non-missing to apply then the training model to further data with possibly missing values.
Example:
julia>mod = RandomForestImputer(n_trees=20,max_depth=10,recursive_passages=3)
BetaML.Imputation.RandomForestImputer
— Typemutable struct RandomForestImputer <: Imputer
Impute missing data using Random Forests, with optional replicable multiple imputations.
See RandomForestI_hp
, RandomForestE_hp
and BML_options
for the parameters.
Notes:
- Given a certain RNG and its status (e.g.
RandomForestImputer(...,rng=StableRNG(FIXEDSEED))
), the algorithm is completely deterministic, i.e. replicable. - The algorithm accepts virtually any kind of data, sortable or not
Example:
julia> using BetaML
julia> X = [1.4 2.5 "a"; missing 20.5 "b"; 0.6 18 missing; 0.7 22.8 "b"; 0.4 missing "b"; 1.6 3.7 "a"]
6×3 Matrix{Any}:
1.4 2.5 "a"
missing 20.5 "b"
0.6 18 missing
0.7 22.8 "b"
0.4 missing "b"
1.6 3.7 "a"
julia> mod = RandomForestImputer(n_trees=20,max_depth=10,recursive_passages=2)
RandomForestImputer - A Random-Forests based imputer (unfitted)
julia> X_full = fit!(mod,X)
** Processing imputation 1
6×3 Matrix{Any}:
1.4 2.5 "a"
0.504167 20.5 "b"
0.6 18 "b"
0.7 22.8 "b"
0.4 20.0837 "b"
1.6 3.7 "a"
BetaML.Imputation.SimpleI_hp
— Typemutable struct SimpleI_hp <: BetaMLHyperParametersSet
Hyperparameters for the SimpleImputer
model
Parameters:
statistic::Function
: The descriptive statistic of the column (feature) to use as imputed value [def:mean
]norm::Union{Nothing, Int64}
: Normalise the feature mean by l-norm
norm of the records [default:nothing
]. Use it (e.g.norm=1
to use the l-1 norm) if the records are highly heterogeneus (e.g. quantity exports of different countries).
BetaML.Imputation.SimpleImputer
— Typemutable struct SimpleImputer <: Imputer
Simple imputer using the missing data's feature (column) statistic (def: mean
), optionally normalised by l-norms of the records (rows)
Parameters:
statistics
: The descriptive statistic of the column (feature) to use as imputed value [def:mean
]norm
: Normalise the feature mean by l-norm
norm of the records [default:nothing
]. Use it (e.g.norm=1
to use the l-1 norm) if the records are highly heterogeneus (e.g. quantity exports of different countries).
Limitations:
- data must be numerical
Example:
julia> using BetaML
julia> X = [2.0 missing 10; 20 40 100]
2×3 Matrix{Union{Missing, Float64}}:
2.0 missing 10.0
20.0 40.0 100.0
julia> mod = SimpleImputer(norm=1)
SimpleImputer - A simple feature-stat based imputer (unfitted)
julia> X_full = fit!(mod,X)
2×3 Matrix{Float64}:
2.0 4.04494 10.0
20.0 40.0 100.0
julia> info(mod)
Dict{String, Any} with 1 entry:
"n_imputed_values" => 1
julia> parameters(mod)
BetaML.Imputation.SimpleImputer_lp (a BetaMLLearnableParametersSet struct)
- cStats: [11.0, 40.0, 55.0]
- norms: [6.0, 53.333333333333336]