A regression task: the prediction of bike sharing demand

The task is to estimate the influence of several variables (like the weather, the season, the day of the week..) on the demand of shared bicycles, so that the authority in charge of the service can organise the service in the best way.

Data origin:

Note that even if we are estimating a time serie, we are not using here a recurrent neural network as we assume the temporal dependence to be negligible (i.e. $Y_t = f(X_t)$ alone).

Library and data loading

We first load all the packages we are going to use

using  LinearAlgebra, Random, Statistics, DataFrames, CSV, Plots, Pipe, BenchmarkTools, BetaML
import Distributions: Uniform
import DecisionTree, Flux ## For comparisions

Here we load the data from a csv provided by the BataML package

baseDir = joinpath(dirname(pathof(BetaML)),"..","docs","src","tutorials","Regression - bike sharing")
data    = CSV.File(joinpath(baseDir,"data","bike_sharing_day.csv"),delim=',') |> DataFrame
describe(data)

16 rows × 7 columns

variablemeanminmedianmaxnmissingeltype
SymbolUnion…AnyUnion…AnyInt64DataType
1instant366.01366.07310Int64
2dteday2011-01-012012-12-310Date
3season2.4965813.040Int64
4yr0.50068401.010Int64
5mnth6.5198417.0120Int64
6holiday0.028727800.010Int64
7weekday2.9972603.060Int64
8workingday0.68399501.010Int64
9weathersit1.3953511.030Int64
10temp0.4953850.05913040.4983330.8616670Float64
11atemp0.4743540.07906960.4867330.8408960Float64
12hum0.6278940.00.6266670.97250Float64
13windspeed0.1904860.02239170.1809750.5074630Float64
14casual848.1762713.034100Int64
15registered3656.17203662.069460Int64
16cnt4504.35224548.087140Int64

The variable we want to learn to predict is cnt, the total demand of bikes for a given day. Even if it is indeed an integer, we treat it as a continuous variable, so each single prediction will be a scalar $Y \in \mathbb{R}$.

plot(data.cnt, title="Daily bike sharing rents (2Y)", label=nothing)

Decision Trees

We start our regression task with Decision Trees.

Decision trees training consist in choosing the set of questions (in a hierarcical way, so to form indeed a "decision tree") that "best" split the dataset given for training, in the sense that the split generate the sub-samples (always 2 subsamples in the BetaML implementation) that are, for the characteristic we want to predict, the most homogeneous possible. Decision trees are one of the few ML algorithms that has an intuitive interpretation and can be used for both regression or classification tasks.

Data preparation

The first step is to prepare the data for the analysis. This indeed depends already on the model we want to employ, as some models "accept" almost everything as input, no matter if the data is numerical or categorical, if it has missing values or not... while other models are instead much more exigents, and require more work to "clean up" our dataset.

Here we start using Decision Tree and Random Forest models that definitly belong to the first group, so the only thing we have to do is to select the variables in input (the "feature matrix", that we will indicate with "X") and the variable representing our output (the information we want to learn to predict, we call it "y"):

x    = Matrix{Float64}(data[:,[:instant,:season,:yr,:mnth,:holiday,:weekday,:workingday,:weathersit,:temp,:atemp,:hum,:windspeed]])
y    = data[:,16];

We can now split the dataset between the data we will use for training the algorithm (xtrain/ytrain), those for selecting the hyperparameters (xval/yval) and finally those for testing the quality of the algoritm with the optimal hyperparameters (xtest/ytest). We use the partition function specifying the share we want to use for these three different subsets, here 75%, 12.5% and 12.5 respectively. As our data represents indeed a time serie, we want our model to be able to predict future demand of bike sharing from past, observed rented bikes, so we do not shuffle the datasets as it would be the default.

((xtrain,xval,xtest),(ytrain,yval,ytest)) = partition([x,y],[0.75,0.125,1-0.75-0.125],shuffle=false)
(ntrain, nval, ntest) = size.([ytrain,yval,ytest],1)
3-element Vector{Int64}:
 548
  92
  91

We can now "tune" our model so-called hyper-parameters, i.e. choose the best exogenous parameters of our algorithm, where "best" refers to some minimisation of a "loss" function between the true and the predicted values. We compute this loss function on a specific subset of data, that we call the "validation" subset (xval and yval).

BetaML doesn't have a dedicated function for hyper-parameters optimisation, but it is easy to write some custom julia code, at least for a simple grid-based "search". Indeed one of the main reasons that a dedicated function exists in other Machine Learning libraries is that loops in other languages are slow, but this is not a problem in julia, so we can retain the flexibility to write the kind of hyper-parameter tuning that best fits our needs.

Below is an example of a possible such function. Note there are more "elegant" ways to code it, but this one does the job. In particular, for simplicity, this hyper-paramerter tuning function just run multiple repetitions. In real world it is better to use cross-validation in the hyper-parameter tuning, expecially when the observations are small. The Clustering tutorial shows an example on how to use crossValidation.

We will see the various functions inside tuneHyperParameters() in a moment. For now let's going just to observe that tuneHyperParameters just loops over all the possible hyper-parameters and selects the ones where the error between xval and yval is minimised. For the meaning of the various hyper-parameter, consult the documentation of the buildTree and buildForest functions. The function uses multiple threads, so we calls generateParallelRngs() (in the BetaML.Utils submodule) to generate thread-safe random number generators and locks the comparision step.

function tuneHyperParameters(model,xtrain,ytrain,xval,yval;maxDepthRange=15:15,maxFeaturesRange=size(xtrain,2):size(xtrain,2),nTreesRange=20:20,βRange=0:0,minRecordsRange=2:2,repetitions=5,rng=Random.GLOBAL_RNG)
    # We start with an infinitely high error
    bestRme         = +Inf
    bestMaxDepth    = 1
    bestMaxFeatures = 1
    bestMinRecords  = 2
    bestNTrees      = 1
    bestβ           = 0
    compLock        = ReentrantLock()

    # Generate one random number generator per thread
    masterSeed = rand(rng,100:9999999999999) ## Some RNG have problems with very small seed. Also, the master seed has to be computed _before_ generateParallelRngs
    rngs = generateParallelRngs(rng,Threads.nthreads())

    # We loop over all possible hyperparameter combinations...
    parLengths = (length(maxDepthRange),length(maxFeaturesRange),length(minRecordsRange),length(nTreesRange),length(βRange))
    Threads.@threads for ij in CartesianIndices(parLengths) ## This to avoid many nested for loops
           (maxDepth,maxFeatures,minRecords,nTrees,β)   = (maxDepthRange[Tuple(ij)[1]], maxFeaturesRange[Tuple(ij)[2]], minRecordsRange[Tuple(ij)[3]], nTreesRange[Tuple(ij)[4]], βRange[Tuple(ij)[5]]) ## The specific hyperparameters of this nested loop
           tsrng = rngs[Threads.threadid()] ## The random number generator is specific for each thread..
           joinedIndx = LinearIndices(parLengths)[ij]
           # And here we make the seeding depending on the id of the loop, not the thread: hence we get the same results indipendently of the number of threads
           Random.seed!(tsrng,masterSeed+joinedIndx*10)
           totAttemptError = 0.0
           # We run several repetitions with the same hyperparameter combination to account for stochasticity...
           for r in 1:repetitions
              if model == "DecisionTree"
                 # Here we train the Decition Tree model
                 myTrainedModel = buildTree(xtrain,ytrain, maxDepth=maxDepth,maxFeatures=maxFeatures,minRecords=minRecords,rng=tsrng)
              else
                 # Here we train the Random Forest model
                 myTrainedModel = buildForest(xtrain,ytrain,nTrees,maxDepth=maxDepth,maxFeatures=maxFeatures,minRecords=minRecords,β=β,rng=tsrng)
              end
              # Here we make prediciton with this trained model and we compute its error
              ŷval   = predict(myTrainedModel, xval,rng=tsrng)
              rmeVal = meanRelError(ŷval,yval,normRec=false)
              totAttemptError += rmeVal
           end
           avgAttemptedDepthError = totAttemptError / repetitions
           begin
               lock(compLock) ## This step can't be run in parallel...
               try
                   # Select this specific combination of hyperparameters if the error is the lowest
                   if avgAttemptedDepthError < bestRme
                     bestRme         = avgAttemptedDepthError
                     bestMaxDepth    = maxDepth
                     bestMaxFeatures = maxFeatures
                     bestNTrees      = nTrees
                     bestβ           = β
                     bestMinRecords  = minRecords
                   end
               finally
                   unlock(compLock)
               end
           end
    end
    return (bestRme,bestMaxDepth,bestMaxFeatures,bestMinRecords,bestNTrees,bestβ)
end
tuneHyperParameters (generic function with 1 method)

We can now run the hyperparameter optimisation function with some "reasonable" ranges. To obtain replicable results we call tuneHyperParameters with rng=copy(FIXEDRNG), where FIXEDRNG is a fixed-seeded random number generator guaranteed to maintain the same stream of random numbers even between different julia versions. That's also what we use for our unit tests (see the Getting started for more details).

(bestRme,bestMaxDepth,bestMaxFeatures,bestMinRecords) = tuneHyperParameters("DecisionTree",xtrain,ytrain,xval,yval,
           maxDepthRange=3:7,maxFeaturesRange=10:12,minRecordsRange=2:6,repetitions=10,rng=copy(FIXEDRNG))
(0.09376348124191129, 5, 12, 5, 20, 0)

Now that we have found the "optimal" hyperparameters we can build ("train") our model using them:

myTree = buildTree(xtrain,ytrain, maxDepth=bestMaxDepth, maxFeatures=bestMaxFeatures,minRecords=bestMinRecords,rng=copy(FIXEDRNG));

Let's benchmark the time and memory usage of the training step of a decision tree:

@btime  buildTree(xtrain,ytrain, maxDepth=bestMaxDepth, maxFeatures=bestMaxFeatures,minRecords=bestMinRecords,rng=copy(FIXEDRNG));
  23.786 ms (55753 allocations: 58.57 MiB)

Individual decision trees are blazing fast, among the fastest algorithms we could use.

The above buildTree function produces a DecisionTree object that can be used to make predictions given some new features, i.e. given some X matrix of (number of observations x dimensions), predict the corresponding Y vector of scalers in R.

(ŷtrain,ŷval,ŷtest) = predict.([myTree], [xtrain,xval,xtest])
3-element Vector{Vector{Float64}}:
 [1259.388888888889, 1259.388888888889, 1259.388888888889, 1259.388888888889, 1259.388888888889, 1259.388888888889, 1259.388888888889, 1259.388888888889, 1259.388888888889, 1259.388888888889  …  6850.893617021276, 6850.893617021276, 6850.893617021276, 6850.893617021276, 6850.893617021276, 6850.893617021276, 6850.893617021276, 6850.893617021276, 6850.893617021276, 6850.893617021276]
 [6850.893617021276, 6850.893617021276, 6850.893617021276, 6850.893617021276, 6850.893617021276, 6850.893617021276, 6850.893617021276, 6850.893617021276, 6850.893617021276, 6850.893617021276  …  6850.893617021276, 6850.893617021276, 6850.893617021276, 6850.893617021276, 6850.893617021276, 6850.893617021276, 6850.893617021276, 6850.893617021276, 6850.893617021276, 6850.893617021276]
 [5678.555555555556, 5678.555555555556, 6850.893617021276, 6850.893617021276, 6850.893617021276, 5225.7692307692305, 5225.7692307692305, 6850.893617021276, 6850.893617021276, 6850.893617021276  …  5225.7692307692305, 5225.7692307692305, 2120.5, 5225.7692307692305, 2120.5, 5225.7692307692305, 5225.7692307692305, 5225.7692307692305, 5225.7692307692305, 5225.7692307692305]

Note that the above code uses the "dot syntax" to "broadcast" predict() over an array of label matrices. It is exactly equivalent to:

ŷtrain = predict(myTree, xtrain);
ŷval   = predict(myTree, xval);
ŷtest  = predict(myTree, xtest);

We now compute the relative mean error for the training, the validation and the test set. The meanRelError is a very flexible error function. Without additional parameter, it computes, as the name says, the mean relative error, also known as the "mean absolute percentage error" (MAPE) between an estimated and a true vector. However it can also compute the relative mean error (as we do here), or use a p-norm higher than 1. The mean relative error enfatises the relativeness of the error, i.e. all observations and dimensions weigth the same, wether large or small. Conversly, in the relative mean error the same relative error on larger observations (or dimensions) weights more. In this exercise we use the later, as our data has clearly some outlier days with very small rents, and we care more of avoiding our customers finding empty bike racks than having unrented bikes on the rack. Targeting a low mean average error would push all our predicitons down to try accomodate the low-level predicitons (to avoid a large relative error), and that's not what we want.

For example let's consider the following example:

y     = [30,28,27,3,32,38];
ŷpref = [32,30,28,10,31,40];
ŷbad  = [29,25,24,5,28,35];

Here ŷpref is an ipotetical output of a model that minimises the relative mean error, while ŷbad minimises the mean realative error.

meanRelError.([ŷbad, ŷpref],[y,y],normRec=true) ## Mean relative error
2-element Vector{Float64}:
 0.18703355611250347
 0.43205786456882955
meanRelError.([ŷbad, ŷpref],[y,y],normRec=false) ## Relative mean error
2-element Vector{Float64}:
 0.10126582278481013
 0.0949367088607595
plot([y ŷbad ŷpref], colour=[:black :red :green], label=["obs" "bad est" "good est"])

We can then compute the relative mean error for the decision tree

(rmeTrain, rmeVal, rmeTest) = meanRelError.([ŷtrain,ŷval,ŷtest],[ytrain,yval,ytest],normRec=false)
3-element Vector{Float64}:
 0.12659039939672598
 0.0937634812419113
 0.22233930540215602

We can plot the true labels vs the estimated one for the three subsets...

scatter(ytrain,ŷtrain,xlabel="daily rides",ylabel="est. daily rides",label=nothing,title="Est vs. obs in training period (DT)")
scatter(yval,ŷval,xlabel="daily rides",ylabel="est. daily rides",label=nothing,title="Est vs. obs in validation period (DT)")
scatter(ytest,ŷtest,xlabel="daily rides",ylabel="est. daily rides",label=nothing,title="Est vs. obs in testing period (DT)")

Or we can visualise the true vs estimated bike shared on a temporal base. First on the full period (2 years) ...

ŷtrainfull = vcat(ŷtrain,fill(missing,nval+ntest))
ŷvalfull   = vcat(fill(missing,ntrain), ŷval, fill(missing,ntest))
ŷtestfull  = vcat(fill(missing,ntrain+nval), ŷtest)
plot(data[:,:dteday],[data[:,:cnt] ŷtrainfull ŷvalfull ŷtestfull], label=["obs" "train" "val" "test"], legend=:topleft, ylabel="daily rides", title="Daily bike sharing demand observed/estimated across the\n whole 2-years period (DT)")

..and then focusing on the testing period

stc = 620
endc = size(x,1)
plot(data[stc:endc,:dteday],[data[stc:endc,:cnt] ŷvalfull[stc:endc] ŷtestfull[stc:endc]], label=["obs" "val" "test"], legend=:bottomleft, ylabel="Daily rides", title="Focus on the testing period (DT)")

The predictions aren't so bad in this case, however decision trees are highly instable, and the output could have depended just from the specific initial random seed.

Random Forests

Rather than trying to solve this problem using a single Decision Tree model, let's not try to use a Random Forest model. Random forests average the results of many different decision trees and provide a more "stable" result. Being made of many decision trees, random forests are hovever more computationally expensive to train, but luckily they tend to self-tune (or self-regularise). In particular the parameters maxDepth and maxFeatures shouldn't need tuning.

We still tune however the model for other parameters, and in particular the β parameter, a prerogative of BetaML Random Forests that allows to assign more weigth to the best performing trees in the forest. It may be particularly important if there are many outliers in the data we don't want to "learn" from.

minRecordsRange=[2,4,5]; nTreesRange=60:10:80; βRange=100:100:300
(bestRme,bestMaxDepth,bestMaxFeatures,bestMinRecords,bestNTrees,bestβ) = tuneHyperParameters("RandomForest",xtrain,ytrain,xval,yval,
        maxDepthRange=size(xtrain,1):size(xtrain,1),maxFeaturesRange=Int(round(sqrt(size(xtrain,2)))):Int(round(sqrt(size(xtrain,2)))),
        minRecordsRange=minRecordsRange,nTreesRange=nTreesRange,βRange=βRange,repetitions=5,rng=copy(FIXEDRNG))
(0.11247916355718526, 548, 3, 5, 60, 200)

As for decision trees, once the hyper-parameters of the model are tuned we wan train again the model using the optimal parameters.

myForest = buildForest(xtrain,ytrain, bestNTrees, maxDepth=bestMaxDepth,maxFeatures=bestMaxFeatures,minRecords=bestMinRecords,β=bestβ,oob=true,rng=copy(FIXEDRNG));

Let's now benchmark the training of the BetaML Random Forest model

@btime buildForest(xtrain,ytrain, bestNTrees, maxDepth=bestMaxDepth,maxFeatures=bestMaxFeatures,minRecords=bestMinRecords,β=bestβ,oob=true,rng=copy(FIXEDRNG));
  768.243 ms (2451922 allocations: 971.33 MiB)

Random forests are evidently slower than individual decision trees but are still relativly fast. We should also consider that they are by default efficiently parallelised, so their speed increases with the number of available cores (in building this documentation page, GitHub CI servers allow for a single core, so all the bechmark you see in this tutorial are run with a single core available).

Random forests support the so-called "out-of-bag" error, an estimation of the error that we would have when the model is applied on a testing sample. However in this case the oob reported is much smaller than the testing error we will actually find. This is due to the fact that the division between training/validation and testing in this exercise is not random, but has a temporal basis. It seems that in this example the data in validation/testing follows a different pattern/variance than those in training (in probabilistic terms, the daily observations are not i.i.d.).

oobError, trueTestMeanRelativeError  = myForest.oobError,meanRelError(ŷtest,ytest,normRec=true)
(ŷtrain,ŷval,ŷtest)         = predict.([myForest], [xtrain,xval,xtest])
(rmeTrain, rmeVal, rmeTest) = meanRelError.([ŷtrain,ŷval,ŷtest],[ytrain,yval,ytest],normRec=false)
3-element Vector{Float64}:
 0.06512088063750154
 0.09389025098838209
 0.22229683782214169

In this case we found an error very similar to the one employing a single decision tree. Let's print the observed data vs the estimated one using the random forest and then along the temporal axis:

scatter(ytrain,ŷtrain,xlabel="daily rides",ylabel="est. daily rides",label=nothing,title="Est vs. obs in training period (RF)")
scatter(yval,ŷval,xlabel="daily rides",ylabel="est. daily rides",label=nothing,title="Est vs. obs in validation period (RF)")
scatter(ytest,ŷtest,xlabel="daily rides",ylabel="est. daily rides",label=nothing,title="Est vs. obs in testing period (RF)")

Full period plot (2 years):

ŷtrainfull = vcat(ŷtrain,fill(missing,nval+ntest))
ŷvalfull   = vcat(fill(missing,ntrain), ŷval, fill(missing,ntest))
ŷtestfull  = vcat(fill(missing,ntrain+nval), ŷtest)
plot(data[:,:dteday],[data[:,:cnt] ŷtrainfull ŷvalfull ŷtestfull], label=["obs" "train" "val" "test"], legend=:topleft, ylabel="daily rides", title="Daily bike sharing demand observed/estimated across the\n whole 2-years period (RF)")

Focus on the testing period:

stc = 620
endc = size(x,1)
plot(data[stc:endc,:dteday],[data[stc:endc,:cnt] ŷvalfull[stc:endc] ŷtestfull[stc:endc]], label=["obs" "val" "test"], legend=:bottomleft, ylabel="Daily rides", title="Focus on the testing period (RF)")