# A classification task when labels are known - determining the country of origin of cars given the cars characteristics

In this exercise we have some car technical characteristics (mpg, horsepower,weight, model year...) and the country of origin and we would like to create a model such that the country of origin can be accurately predicted given the technical characteristics. As the information to predict is a multi-class one, this is a [classification](https://en.wikipedia.org/wiki/Statistical_classification) task. It is a challenging exercise due to the simultaneous presence of three factors: (1) presence of missing data; (2) unbalanced data - 254 out of 406 cars are US made; (3) small dataset.

Data origin:

Field description:

1. mpg: continuous
2. cylinders: multi-valued discrete
3. displacement: continuous
4. horsepower: continuous
5. weight: continuous
6. acceleration: continuous
7. model year: multi-valued discrete
8. origin: multi-valued discrete
9. car name: string (unique for each instance) - not used here

!!! warning As the above example is automatically executed by GitHub on every code update, it uses parameters (epoch numbers, parameter space of hyperparameter validation, number of trees,...) that minimise the computation. In real case you will want to use better but more computationally intensive ones. For the same reason benchmarks codes are commented and the pre-run output reported rather than actually being executed.

We load a buch of packages that we'll use during this tutorial..

using Random, HTTP, CSV, DataFrames, BenchmarkTools, BetaML
import DecisionTree, Flux
import Pipe: @pipe

To load the data from the internet our workflow is (1) Retrieve the data –> (2) Clean it –> (3) Load it –> (4) Output it as a DataFrame.

For step (1) we use HTTP.get(), for step (2) we use replace!, for steps (3) and (4) we uses the CSV package, and we use the "pip" |> operator to chain these operations:

urlDataOriginal = "https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data-original"
data = @pipe HTTP.get(urlDataOriginal).body                                                |>
replace!(_, UInt8('\t') => UInt8(' '))                                        |>
CSV.File(_, delim=' ', missingstring="NA", ignorerepeated=true, header=false) |>
DataFrame;

This results in a table where the rows are the observations (the various cars) and the column the fields. All BetaML models expect this layout.

As the dataset is ordered, we randomly shuffle the data. Note that we pass to shuffle copy(FIXEDRNG) as the random nuber generator in order to obtain reproducible output ( FIXEDRNG is nothing else than an istance of StableRNG(123) defined in the BetaML.Utils sub-module, but you can choose of course your own "fixed" RNG). See the Dealing with stochasticity section in the Getting started tutorial for details.

data[shuffle(copy(FIXEDRNG),axes(data, 1)), :]
describe(data)

9 rows × 7 columns

variablemeanminmedianmaxnmissingeltype
SymbolUnion…AnyUnion…AnyInt64Type
1Column123.51469.023.046.68Union{Missing, Float64}
2Column25.475373.04.08.00Float64
3Column3194.7868.0151.0455.00Float64
4Column4105.08246.095.0230.06Union{Missing, Float64}
5Column52979.411613.02822.55140.00Float64
6Column615.51978.015.524.80Float64
7Column775.921270.076.082.00Float64
8Column81.568971.01.03.00Float64

Columns 1 to 7 contain characteristics of the car, while column 8 encodes the country or origin ("1" -> US, "2" -> EU, "3" -> Japan). That's the variable we want to be able to predict.

Columns 9 contains the car name, but we are not going to use this information in this tutorial. Note also that some fields have missing data.

Our first step is hence to divide the dataset in features (the x) and the labels (the y) we want to predict. The x is then a Julia standard Matrix of 406 rows by 7 columns and the y is a vector of the 406 observations:

x     = Matrix{Union{Missing,Float64}}(data[:,1:7]);
y     = Vector{Int64}(data[:,8]);

Some algorithms that we will use today don't work with missing data, so we need to impute them. We use the predictMissing function provided by the BetaML.Clustering sub-module. Internally the function uses a Gaussian Mixture Model to assign to the missing walue of a given record an average of the values of the non-missing records weighted for how close they are to our specific record. Note that the same function (predictMissing) can be used for Collaborative Filtering / recomendation systems. Using GMM has the advantage over traditional algorithms as k-nearest neighbors (KNN) that GMM can "detect" the hidden structure of the observed data, where some observation can be similar to a certain pool of other observvations for a certain characteristic, but similar to an other pool of observations for other characteristics.

xFull = predictMissing(x,rng=copy(FIXEDRNG)).X̂;
Iter. 1:	Var. of the post  20.10554158364678 	  Log-likelihood -12110.417616527317

Further, some models don't work with categorical data as such, so we need to represent our y as a matrix with a separate column for each possible categorical value (the so called "one-hot" representation). For example, within a three classes field, the individual value 2 (or "Europe" for what it matters) would be represented as the vector [0 1 0], while 3 (or "Japan") would become the vector [0 0 1]. To encode as one-hot we use the function oneHotEncoder in BetaML.Utils

y_oh  = oneHotEncoder(y);

In supervised machine learning it is good practice to partition the available data in a training, validation, and test subsets, where the first one is used to train the ML algorithm, the second one to train any eventual "hyper-parameters" of the algorithm and the test subset is finally used to evaluate the quality of the algorithm. Here, for brevity, we use only the train and the test subsets, implicitly assuming we already know the best hyper-parameters. Please refer to the regression tutorial for examples of how to use the validation subset to train the hyper-parameters, or even better the clustering tutorial for an example of using the crossValidation function.

We use then the partition function in BetaML.Utils, where we can specify the different data to partition (each matrix or vector to partition must have the same number of observations) and the shares of observation that we want in each subset. Here we keep 80% of observations for training (xtrain, xTrainFull and ytrain) and we use 20% of them for testing (xtest, xTestFull and ytest):

((xtrain,xtest),(xtrainFull,xtestFull),(ytrain,ytest),(ytrain_oh,ytest_oh)) = partition([x,xFull,y,y_oh],[0.8,1-0.8],rng=copy(FIXEDRNG));

## Random Forests

We are now ready to use our first model, the Random Forests (in the BetaML.Trees sub-module). Random Forests build a "forest" of decision trees models and then average their predictions in order to make an overall prediction out of a feature vector.

To "build" the forest model (i.e. to "train" it) we need to give the model the training feature matrix and the associated "true" training labels, and we need to specify the number of trees to employ (this is an example of hyper-parameters). Here we use 30 individual decision trees.

As the labels are encoded using integers, we need also to specify the parameter forceClassification=true, otherwise the model would undergo a regression job instead.

myForest       = buildForest(xtrain,ytrain,30, rng=copy(FIXEDRNG),forceClassification=true);

To obtain the predicted values, we can simply use the function BetaML.Trees.predict with our myForest model and either the training or the testing data.

ŷtrain,ŷtest   = predict.(Ref(myForest), [xtrain,xtest],rng=copy(FIXEDRNG));

Finally we can measure the accuracy of our predictions with the accuracy function, with the sidenote that we need first to "parse" the ŷs as forcing the classification job transformed automatically them to strings (they originally were integers):

trainAccuracy,testAccuracy  = accuracy.([parse.(Int64,mode(ŷtrain,rng=copy(FIXEDRNG))),parse.(Int64,mode(ŷtest,rng=copy(FIXEDRNG)))],[ytrain,ytest])
2-element Vector{Float64}:
0.9969230769230769
0.8024691358024691

The predictions are quite good, for the training set the algoritm predicted almost all cars' origins correctly, while for the testing set (i.e. those records that has not been used to train the algorithm), the correct prediction level is still quite high, at 80%

While accuracy can sometimes suffice, we may often want to better understand which categories our model has trouble to predict correctly. We can investigate the output of a multi-class classifier more in-deep with a ConfusionMatrix where the true values (y) are given in rows and the predicted ones (ŷ) in columns, together to some per-class metrics like the precision (true class i over predicted in class i), the recall (predicted class i over the true class i) and others.

We fist build the ConfusionMatrix object between ŷ and y and then we print it (we do it here for the test subset):

cm = ConfusionMatrix(parse.(Int64,mode(ŷtest,rng=copy(FIXEDRNG))),ytest,classes=[1,2,3],labels=["US","EU","Japan"])
print(cm,"all")

-----------------------------------------------------------------

*** CONFUSION MATRIX ***

Scores actual (rows) vs predicted (columns):

4×4 Matrix{Any}:
"Labels"    "US"    "EU"   "Japan"
"US"      43       3      3
"EU"       2      13      1
"Japan"    5       2      9
Normalised scores actual (rows) vs predicted (columns):

4×4 Matrix{Any}:
"Labels"   "US"      "EU"       "Japan"
"US"      0.877551  0.0612245  0.0612245
"EU"      0.125     0.8125     0.0625
"Japan"   0.3125    0.125      0.5625
*** CONFUSION REPORT ***

- Accuracy:               0.8024691358024691
- Misclassification rate: 0.19753086419753085
- Number of classes:      3

N Class   precision   recall  specificity  f1Score  actualCount  predictedCount
TPR       TNR                 support

1 US          0.860    0.878        0.781    0.869           49              50
2 EU          0.722    0.812        0.923    0.765           16              18
3 Japan       0.692    0.562        0.938    0.621           16              13

- Simple   avg.    0.758    0.751        0.881    0.751
- Weigthed avg.    0.800    0.802        0.840    0.799

-----------------------------------------------------------------

From the report we can see that Japanese cars have more trouble in being correctly classified, and in particular many Japanease cars are classified as US ones. This is likely a result of the class imbalance of the data set, and could be solved by balancing the dataset with various sampling tecniques before training the model.

When we benchmark the resourse used (time and memory) we find that Random Forests remain pretty fast, expecially when we compare them with neural networks (see later) @btime buildForest(xtrain,ytrain,30, rng=copy(FIXEDRNG),forceClassification=true); 134.096 ms (781027 allocations: 196.30 MiB)

### Comparision with DecisionTree.jl

DecisionTrees.jl random forests are similar in usage: we first "build" (train) the forest and we then make predictions out of the trained model. The main difference is that the model requires data with nonmissing values, so we are going to use the xtrainFull and xtestFull feature labels we created earlier:

# We train the model...
model = DecisionTree.build_forest(ytrain, xtrainFull,-1,30,rng=123)
# ..and we generate predictions and measure their error
(ŷtrain,ŷtest) = DecisionTree.apply_forest.([model],[xtrainFull,xtestFull]);
(trainAccuracy,testAccuracy) = accuracy.([ŷtrain,ŷtest],[ytrain,ytest])
2-element Vector{Float64}:
0.9969230769230769
0.7530864197530864

While the accuracy on the training set is exactly the same as for BetaML random forets, DecisionTree.jl random forests are slighly less accurate in the testing sample. Where however DecisionTrees.jl excell is in the efficiency: they are extremelly fast and memory thrifty, even if to this benchmark we should add the resources needed to impute the missing values.

Also, one of the reasons DecisionTrees are such efficient is that internally they sort the data to avoid repeated comparision, but in this way they work only with features that are sortable, while BetaML random forests accept virtually any kind of input without the need of adapt it. @btime DecisionTree.build_forest(ytrain, xtrainFull,-1,30,rng=123); 1.431 ms (10875 allocations: 1.52 MiB)

### Neural network

Neural networks (NN) can be very powerfull, but have two "inconvenients" compared with random forests: first, are a bit "picky". We need to do a bit of work to provide data in specific format. Note that this is not feature engineering. One of the advantages on neural network is that for the most this is not needed for neural networks. However we still need to "clean" the data. One issue is that NN don't like missing data. So we need to provide them with the feature matrix "clean" of missing data. Secondly, they work only with numerical data. So we need to use the one-hot encoding we saw earlier. Further, they work best if the features are scaled such that each feature has mean zero and standard deviation 1. We can achieve it with the function scale or, as in this case, getScaleFactors.

xScaleFactors   = getScaleFactors(xtrainFull)
D               = size(xtrainFull,2)
classes         = unique(y)
nCl             = length(classes)
3

The second "inconvenient" of NN is that, while not requiring feature engineering, they stil lneed a bit of practice on the way to build the network. It's not as simple as train(model,x,y). We need here to specify how we want our layers, chain the layers together and then decide a loss overall function. Only when we done these steps, we have the model ready for training. Here we define 2 DenseLayer where, for each of them, we specify the number of neurons in input (the first layer being equal to the dimensions of the data), the output layer (for a classification task, the last layer output size beying equal to the number of classes) and an activation function for each layer (default the identity function).

ls   = 50
l1   = DenseLayer(D,ls,f=relu,rng=copy(FIXEDRNG))
l2   = DenseLayer(ls,nCl,f=relu,rng=copy(FIXEDRNG))
DenseLayer([-0.2146463925907584 -0.3077087587320811 … -0.28256208289474877 0.21510681158042189; -0.08916953797649538 -0.041727530915651345 … -0.30444064706465346 -0.22349634154766507; 0.11376391271810127 -0.011244515923068743 … 0.12916068649773038 -0.2518581440082599], [0.2918467648814228, -0.004167534280141383, 0.29060333096888613], BetaML.Utils.relu, nothing)

For a classification the last layer is a VectorFunctionLayer that has no learnable parameters but whose activation function is applied to the ensemble of the neurons, rather than individually on each neuron. In particular, for classification we pass the BetaML.Utils.softmax function whose output has the same size as the input (and the number of classes to predict), but we can use the VectorFunctionLayer with any function, including the pool1d function to create a "pooling" layer (using maximum, mean or whatever other subfunction we pass to pool1d)

l3   = VectorFunctionLayer(nCl,f=softmax) ## Add a (parameterless) layer whose activation function (softMax in this case) is defined to all its nodes at once
VectorFunctionLayer{0}(fill(NaN), 3, 3, BetaML.Utils.softmax, nothing, nothing)

Finally we chain the layers and assign a loss function with buildNetwork:

mynn = buildNetwork([l1,l2,l3],squaredCost,name="Multinomial logistic regression Model Cars") ## Build the NN and use the squared cost (aka MSE) as error function (crossEntropy could also be used)
NN(AbstractLayer[DenseLayer([-0.20697795942360017 0.12136901733806621 … -0.026906267639792758 -0.08117976159071413; -0.08598387697252577 -0.094763858010351 … 0.12473600153619574 -0.3038923887271602; … ; -0.027245366184114994 -0.13726439265473928 … -0.0812568745415187 0.03639080823771723; 0.21807051339784328 -0.04462192007289573 … -0.09201125674551605 -0.25212961016223523], [-0.32359467997037583, 0.15252213461474667, 0.05681391283261772, -0.0718950523804685, 0.014112894961209654, 0.250715710684788, -0.08912986019414143, -0.22300278678968377, 0.011113501374895696, -0.2307615737177409  …  -0.06868980831928373, 0.22894772322501816, 0.19247052748600918, 0.3243004525232283, -0.005682087946757619, 0.3173647151609262, -0.0419358245557761, 0.1792922925250479, 0.11465680902201164, -0.049714381947920605], BetaML.Utils.relu, nothing), DenseLayer([-0.2146463925907584 -0.3077087587320811 … -0.28256208289474877 0.21510681158042189; -0.08916953797649538 -0.041727530915651345 … -0.30444064706465346 -0.22349634154766507; 0.11376391271810127 -0.011244515923068743 … 0.12916068649773038 -0.2518581440082599], [0.2918467648814228, -0.004167534280141383, 0.29060333096888613], BetaML.Utils.relu, nothing), VectorFunctionLayer{0}(fill(NaN), 3, 3, BetaML.Utils.softmax, nothing, nothing)], BetaML.Utils.squaredCost, nothing, false, "Multinomial logistic regression Model Cars")

Now we can train our network using the function train!. It has many options, have a look at the documentation for all the possible arguments. Note that we train the network based on the scaled feature matrix.

res  = train!(mynn,scale(xtrainFull,xScaleFactors),ytrain_oh,epochs=500,batchSize=8,optAlg=ADAM(),rng=copy(FIXEDRNG)) ## Use optAlg=SGD() to use Stochastic Gradient Descent instead
(epochs = 500, ϵ_epochs = [0.3947933466712271, 0.28073284643031027, 0.22540959714717973, 0.2045020231239191, 0.19136110202105647, 0.1829174183357756, 0.17760762488726559, 0.17366196897056083, 0.1706339472451232, 0.16751743542979855  …  0.06798955708038988, 0.06785310875861965, 0.0680514707512953, 0.06843456668489702, 0.06792390250775648, 0.06790615579332532, 0.06768486549836293, 0.06777986689954776, 0.06775912020670322, 0.06776877566449138], θ_epochs = Any[])

Once trained, we can predict the label. As the trained was based on the scaled feature matrix, so must be for the predictions

(ŷtrain,ŷtest)  = predict.(Ref(mynn),[scale(xtrainFull,xScaleFactors),scale(xtestFull,xScaleFactors)])
trainAccuracy   = accuracy(ŷtrain,ytrain,rng=copy(FIXEDRNG))
0.8953846153846153
testAccuracy    = accuracy(ŷtest,ytest,rng=copy(FIXEDRNG))

cm = ConfusionMatrix(ŷtest,ytest,classes=[1,2,3],labels=["US","EU","Japan"],rng=copy(FIXEDRNG))
ConfusionMatrix{Int64}([1, 2, 3], ["US", "EU", "Japan"], 0.7654320987654321, 0.23456790123456794, [49, 16, 16], [53, 12, 16], [44 0 5; 3 10 3; 6 2 8], [0.8979591836734694 0.0 0.10204081632653061; 0.1875 0.625 0.1875; 0.375 0.125 0.5], [44, 10, 8], [23, 63, 57], [9, 2, 8], [5, 6, 8], [0.8301886792452831, 0.8333333333333334, 0.5], [0.8979591836734694, 0.625, 0.5], [0.71875, 0.9692307692307692, 0.8769230769230769], [0.8627450980392157, 0.7142857142857143, 0.5], (0.7211740041928721, 0.7655873903253358), (0.6743197278911565, 0.7654320987654321), (0.8549679487179488, 0.7994717473884142), (0.6923436041083101, 0.7617664349690494))

print(cm)

4×4 Matrix{Any}: "Labels" "US" "EU" "Japan" "US" 44 0 5 "EU" 3 10 3 "Japan" 6 2 8 4×4 Matrix{Any}: "Labels" "US" "EU" "Japan" "US" 0.897959 0.0 0.102041 "EU" 0.1875 0.625 0.1875 "Japan" 0.375 0.125 0.5

We see a bit the limits of neural networks in this example. While NN can be extremelly performant in many domains, they also require lot of data and computational power, expecially considering the many possible hyper-parameters and hence its large space in the hyper-parameter tuning. In this example we arrive short to the performance of random forests, yet with a significant numberof neurons.

@btime train!(mynn,scale(xtrainFull),ytrain_oh,epochs=300,batchSize=8,rng=copy(FIXEDRNG),verbosity=NONE); 11.841 s (62860672 allocations: 4.21 GiB)

### Comparisons with Flux

In Flux the input must be in the form (fields, observations), so we transpose our original matrices

xtrainT, ytrain_ohT = transpose.([scale(xtrainFull,xScaleFactors), ytrain_oh])
xtestT, ytest_ohT   = transpose.([scale(xtestFull,xScaleFactors), ytest_oh])
2-element Vector{LinearAlgebra.Transpose{Float64, Matrix{Float64}}}:
[-0.9354691121867126 0.8211030586728023 … 1.6357452248685191 -0.5790631644760863; 0.3321589709665676 -0.8476406526851753 … -0.2577408408593039 0.3321589709665676; … ; 0.08499289148756756 2.234618551678942 … 1.5539037592850058 -0.022488391522001404; 0.582575265003537 1.1235380110782482 … 1.1235380110782482 -1.3107943462579525]
[0.0 0.0 … 0.0 1.0; 1.0 1.0 … 1.0 0.0; 0.0 0.0 … 0.0 0.0]

We define the Flux neural network model in a similar way than BetaML and load it with data, we train it, predict and measure the accuracies on the training and the test sets:

Random.seed!(123)

l1         = Flux.Dense(D,ls,Flux.relu)
#l2         = Flux.Dense(ls,ls,Flux.relu)
l3         = Flux.Dense(ls,nCl,Flux.relu)
Flux_nn    = Flux.Chain(l1,l3)
loss(x, y) = Flux.logitcrossentropy(Flux_nn(x), y)
ps         = Flux.params(Flux_nn)
begin for i in 1:500  Flux.train!(loss, ps, nndata, Flux.ADAM()) end end
ŷtrain     = Flux.onecold(Flux_nn(xtrainT),1:3)
ŷtest      = Flux.onecold(Flux_nn(xtestT),1:3)
trainAccuracy =  accuracy(ŷtrain,ytrain)
0.9446153846153846
testAccuracy  = accuracy(ŷtest,ytest)
0.7407407407407407

While the train accuracy is little bit higher that BetaML, the test accuracy remains comparable

However the time is again lower than BetaML, even if here for "just" a factor 2 @btime begin for i in 1:500 Flux.train!(loss, ps, nndata, Flux.ADAM()) end end; 5.665 s (8943640 allocations: 1.07 GiB)

## Summary

This is the summary of the results we had trying to predict the country of origin of the cars, based on their technical characteristics:

ModelTrain accTest AccTraining time (ms)*Training mem (MB) *
RF0.99690.8025134196
RF (DecisionTree.jl)0.99690.75311.431.5
NN0.8950.765118414311
NN (Flux.jl)0.9380.74156651096
• on a Intel Core i5-8350U laptop

We warn that this table just provides a rought idea of the various algorithms performances. Indeed there is a large amount of stochasticity both in the sampling of the data used for training/testing and in the initial settings of the parameters of the algorithm. For a statistically significant comparision we would have to repeat the analysis with multiple sampling (e.g. by cross-validation, see the clustering tutorial for an example) and initial random parameters.

Neverthless the table above shows that, when we compare BetaML with the algorithm-specific leading packages, we found similar results in terms of accuracy, but often the leading packages are better optimised and run more efficiently (but sometimes at the cost of being less verstatile). Also, for this dataset, Random Forests seems to remain marginally more accurate than Neural Network, altought of course this depends on the hyper-parameters and, with a single run of the models, we don't know if this difference is significant.