# A regression task: the prediction of bike sharing demand

The task is to estimate the influence of several variables (like the weather, the season, the day of the week..) on the demand of shared bicycles, so that the authority in charge of the service can organise the service in the best way.

Data origin:

- original full dataset (by hour, not used here): https://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset
- simplified dataset (by day, with some simple scaling): https://www.hds.utc.fr/~tdenoeux/dokuwiki/en/aec
- description: https://www.hds.utc.fr/~tdenoeux/dokuwiki/
*media/en/exam*2019*ace*.pdf - data: https://www.hds.utc.fr/~tdenoeux/dokuwiki/
*media/en/bike*sharing_day.csv.zip

Note that even if we are estimating a time serie, we are not using here a recurrent neural network as we assume the temporal dependence to be negligible (i.e. $Y_t = f(X_t)$ alone).

## Library and data loading

We first load all the packages we are going to use

```
using LinearAlgebra, Random, Statistics, DataFrames, CSV, Plots, Pipe, BenchmarkTools, BetaML
import Distributions: Uniform
import DecisionTree, Flux ## For comparisions
```

Here we load the data from a csv provided by the BataML package

```
baseDir = joinpath(dirname(pathof(BetaML)),"..","docs","src","tutorials","Regression - bike sharing")
data = CSV.File(joinpath(baseDir,"data","bike_sharing_day.csv"),delim=',') |> DataFrame
describe(data)
```

16 rows × 7 columns

variable | mean | min | median | max | nmissing | eltype | |
---|---|---|---|---|---|---|---|

Symbol | Union… | Any | Union… | Any | Int64 | DataType | |

1 | instant | 366.0 | 1 | 366.0 | 731 | 0 | Int64 |

2 | dteday | 2011-01-01 | 2012-12-31 | 0 | Date | ||

3 | season | 2.49658 | 1 | 3.0 | 4 | 0 | Int64 |

4 | yr | 0.500684 | 0 | 1.0 | 1 | 0 | Int64 |

5 | mnth | 6.51984 | 1 | 7.0 | 12 | 0 | Int64 |

6 | holiday | 0.0287278 | 0 | 0.0 | 1 | 0 | Int64 |

7 | weekday | 2.99726 | 0 | 3.0 | 6 | 0 | Int64 |

8 | workingday | 0.683995 | 0 | 1.0 | 1 | 0 | Int64 |

9 | weathersit | 1.39535 | 1 | 1.0 | 3 | 0 | Int64 |

10 | temp | 0.495385 | 0.0591304 | 0.498333 | 0.861667 | 0 | Float64 |

11 | atemp | 0.474354 | 0.0790696 | 0.486733 | 0.840896 | 0 | Float64 |

12 | hum | 0.627894 | 0.0 | 0.626667 | 0.9725 | 0 | Float64 |

13 | windspeed | 0.190486 | 0.0223917 | 0.180975 | 0.507463 | 0 | Float64 |

14 | casual | 848.176 | 2 | 713.0 | 3410 | 0 | Int64 |

15 | registered | 3656.17 | 20 | 3662.0 | 6946 | 0 | Int64 |

16 | cnt | 4504.35 | 22 | 4548.0 | 8714 | 0 | Int64 |

The variable we want to learn to predict is `cnt`

, the total demand of bikes for a given day. Even if it is indeed an integer, we treat it as a continuous variable, so each single prediction will be a scalar $Y \in \mathbb{R}$.

`plot(data.cnt, title="Daily bike sharing rents (2Y)", label=nothing)`

## Decision Trees

We start our regression task with Decision Trees.

Decision trees training consist in choosing the set of questions (in a hierarcical way, so to form indeed a "decision tree") that "best" split the dataset given for training, in the sense that the split generate the sub-samples (always 2 subsamples in the BetaML implementation) that are, for the characteristic we want to predict, the most homogeneous possible. Decision trees are one of the few ML algorithms that has an intuitive interpretation and can be used for both regression or classification tasks.

### Data preparation

The first step is to prepare the data for the analysis. This indeed depends already on the model we want to employ, as some models "accept" almost everything as input, no matter if the data is numerical or categorical, if it has missing values or not... while other models are instead much more exigents, and require more work to "clean up" our dataset.

Here we start using Decision Tree and Random Forest models that definitly belong to the first group, so the only thing we have to do is to select the variables in input (the "feature matrix", that we will indicate with "X") and the variable representing our output (the information we want to learn to predict, we call it "y"):

```
x = Matrix{Float64}(data[:,[:instant,:season,:yr,:mnth,:holiday,:weekday,:workingday,:weathersit,:temp,:atemp,:hum,:windspeed]])
y = data[:,16];
```

We can now split the dataset between the data we will use for training the algorithm (`xtrain`

/`ytrain`

), those for selecting the hyperparameters (`xval`

/`yval`

) and finally those for testing the quality of the algoritm with the optimal hyperparameters (`xtest`

/`ytest`

). We use the `partition`

function specifying the share we want to use for these three different subsets, here 75%, 12.5% and 12.5 respectively. As our data represents indeed a time serie, we want our model to be able to predict *future* demand of bike sharing from *past*, observed rented bikes, so we do not shuffle the datasets as it would be the default.

```
((xtrain,xval,xtest),(ytrain,yval,ytest)) = partition([x,y],[0.75,0.125,1-0.75-0.125],shuffle=false)
(ntrain, nval, ntest) = size.([ytrain,yval,ytest],1)
```

3-element Vector{Int64}: 548 92 91

We can now "tune" our model so-called hyper-parameters, i.e. choose the best exogenous parameters of our algorithm, where "best" refers to some minimisation of a "loss" function between the true and the predicted values. We compute this loss function on a specific subset of data, that we call the "validation" subset (`xval`

and `yval`

).

BetaML doesn't have a dedicated function for hyper-parameters optimisation, but it is easy to write some custom julia code, at least for a simple grid-based "search". Indeed one of the main reasons that a dedicated function exists in other Machine Learning libraries is that loops in other languages are slow, but this is not a problem in julia, so we can retain the flexibility to write the kind of hyper-parameter tuning that best fits our needs.

Below is an example of a possible such function. Note there are more "elegant" ways to code it, but this one does the job. In particular, for simplicity, this hyper-paramerter tuning function just run multiple repetitions. In real world it is better to use cross-validation in the hyper-parameter tuning, expecially when the observations are small. The Clustering tutorial shows an example on how to use `crossValidation`

.

We will see the various functions inside `tuneHyperParameters()`

in a moment. For now let's going just to observe that `tuneHyperParameters`

just loops over all the possible hyper-parameters and selects the ones where the error between `xval`

and `yval`

is minimised. For the meaning of the various hyper-parameter, consult the documentation of the `buildTree`

and `buildForest`

functions. The function uses multiple threads, so we calls `generateParallelRngs()`

(in the `BetaML.Utils`

submodule) to generate thread-safe random number generators and locks the comparision step.

```
function tuneHyperParameters(model,xtrain,ytrain,xval,yval;maxDepthRange=15:15,maxFeaturesRange=size(xtrain,2):size(xtrain,2),nTreesRange=20:20,βRange=0:0,minRecordsRange=2:2,repetitions=5,rng=Random.GLOBAL_RNG)
# We start with an infinitely high error
bestRme = +Inf
bestMaxDepth = 1
bestMaxFeatures = 1
bestMinRecords = 2
bestNTrees = 1
bestβ = 0
compLock = ReentrantLock()
# Generate one random number generator per thread
masterSeed = rand(rng,100:9999999999999) ## Some RNG have problems with very small seed. Also, the master seed has to be computed _before_ generateParallelRngs
rngs = generateParallelRngs(rng,Threads.nthreads())
# We loop over all possible hyperparameter combinations...
parLengths = (length(maxDepthRange),length(maxFeaturesRange),length(minRecordsRange),length(nTreesRange),length(βRange))
Threads.@threads for ij in CartesianIndices(parLengths) ## This to avoid many nested for loops
(maxDepth,maxFeatures,minRecords,nTrees,β) = (maxDepthRange[Tuple(ij)[1]], maxFeaturesRange[Tuple(ij)[2]], minRecordsRange[Tuple(ij)[3]], nTreesRange[Tuple(ij)[4]], βRange[Tuple(ij)[5]]) ## The specific hyperparameters of this nested loop
tsrng = rngs[Threads.threadid()] ## The random number generator is specific for each thread..
joinedIndx = LinearIndices(parLengths)[ij]
# And here we make the seeding depending on the id of the loop, not the thread: hence we get the same results indipendently of the number of threads
Random.seed!(tsrng,masterSeed+joinedIndx*10)
totAttemptError = 0.0
# We run several repetitions with the same hyperparameter combination to account for stochasticity...
for r in 1:repetitions
if model == "DecisionTree"
# Here we train the Decition Tree model
myTrainedModel = buildTree(xtrain,ytrain, maxDepth=maxDepth,maxFeatures=maxFeatures,minRecords=minRecords,rng=tsrng)
else
# Here we train the Random Forest model
myTrainedModel = buildForest(xtrain,ytrain,nTrees,maxDepth=maxDepth,maxFeatures=maxFeatures,minRecords=minRecords,β=β,rng=tsrng)
end
# Here we make prediciton with this trained model and we compute its error
ŷval = predict(myTrainedModel, xval,rng=tsrng)
rmeVal = meanRelError(ŷval,yval,normRec=false)
totAttemptError += rmeVal
end
avgAttemptedDepthError = totAttemptError / repetitions
begin
lock(compLock) ## This step can't be run in parallel...
try
# Select this specific combination of hyperparameters if the error is the lowest
if avgAttemptedDepthError < bestRme
bestRme = avgAttemptedDepthError
bestMaxDepth = maxDepth
bestMaxFeatures = maxFeatures
bestNTrees = nTrees
bestβ = β
bestMinRecords = minRecords
end
finally
unlock(compLock)
end
end
end
return (bestRme,bestMaxDepth,bestMaxFeatures,bestMinRecords,bestNTrees,bestβ)
end
```

tuneHyperParameters (generic function with 1 method)

We can now run the hyperparameter optimisation function with some "reasonable" ranges. To obtain replicable results we call `tuneHyperParameters`

with `rng=copy(FIXEDRNG)`

, where `FIXEDRNG`

is a fixed-seeded random number generator guaranteed to maintain the same stream of random numbers even between different julia versions. That's also what we use for our unit tests (see the `Getting started`

for more details).

```
(bestRme,bestMaxDepth,bestMaxFeatures,bestMinRecords) = tuneHyperParameters("DecisionTree",xtrain,ytrain,xval,yval,
maxDepthRange=3:7,maxFeaturesRange=10:12,minRecordsRange=2:6,repetitions=10,rng=copy(FIXEDRNG))
```

(0.09376348124191129, 5, 12, 5, 20, 0)

Now that we have found the "optimal" hyperparameters we can build ("train") our model using them:

`myTree = buildTree(xtrain,ytrain, maxDepth=bestMaxDepth, maxFeatures=bestMaxFeatures,minRecords=bestMinRecords,rng=copy(FIXEDRNG));`

Let's benchmark the time and memory usage of the training step of a decision tree:

`@btime buildTree(xtrain,ytrain, maxDepth=bestMaxDepth, maxFeatures=bestMaxFeatures,minRecords=bestMinRecords,rng=copy(FIXEDRNG));`

23.786 ms (55753 allocations: 58.57 MiB)

Individual decision trees are blazing fast, among the fastest algorithms we could use.

The above `buildTree`

function produces a DecisionTree object that can be used to make predictions given some new features, i.e. given some X matrix of (number of observations x dimensions), predict the corresponding Y vector of scalers in R.

`(ŷtrain,ŷval,ŷtest) = predict.([myTree], [xtrain,xval,xtest])`

3-element Vector{Vector{Float64}}: [1259.388888888889, 1259.388888888889, 1259.388888888889, 1259.388888888889, 1259.388888888889, 1259.388888888889, 1259.388888888889, 1259.388888888889, 1259.388888888889, 1259.388888888889 … 6850.893617021276, 6850.893617021276, 6850.893617021276, 6850.893617021276, 6850.893617021276, 6850.893617021276, 6850.893617021276, 6850.893617021276, 6850.893617021276, 6850.893617021276] [6850.893617021276, 6850.893617021276, 6850.893617021276, 6850.893617021276, 6850.893617021276, 6850.893617021276, 6850.893617021276, 6850.893617021276, 6850.893617021276, 6850.893617021276 … 6850.893617021276, 6850.893617021276, 6850.893617021276, 6850.893617021276, 6850.893617021276, 6850.893617021276, 6850.893617021276, 6850.893617021276, 6850.893617021276, 6850.893617021276] [5678.555555555556, 5678.555555555556, 6850.893617021276, 6850.893617021276, 6850.893617021276, 5225.7692307692305, 5225.7692307692305, 6850.893617021276, 6850.893617021276, 6850.893617021276 … 5225.7692307692305, 5225.7692307692305, 2120.5, 5225.7692307692305, 2120.5, 5225.7692307692305, 5225.7692307692305, 5225.7692307692305, 5225.7692307692305, 5225.7692307692305]

Note that the above code uses the "dot syntax" to "broadcast" `predict()`

over an array of label matrices. It is exactly equivalent to:

```
ŷtrain = predict(myTree, xtrain);
ŷval = predict(myTree, xval);
ŷtest = predict(myTree, xtest);
```

We now compute the relative mean error for the training, the validation and the test set. The `meanRelError`

is a very flexible error function. Without additional parameter, it computes, as the name says, the *mean relative error*, also known as the "mean absolute percentage error" (MAPE) between an estimated and a true vector. However it can also compute the *relative mean error* (as we do here), or use a p-norm higher than 1. The *mean relative error* enfatises the relativeness of the error, i.e. all observations and dimensions weigth the same, wether large or small. Conversly, in the *relative mean error* the same relative error on larger observations (or dimensions) weights more. In this exercise we use the later, as our data has clearly some outlier days with very small rents, and we care more of avoiding our customers finding empty bike racks than having unrented bikes on the rack. Targeting a low mean average error would push all our predicitons down to try accomodate the low-level predicitons (to avoid a large relative error), and that's not what we want.

For example let's consider the following example:

```
y = [30,28,27,3,32,38];
ŷpref = [32,30,28,10,31,40];
ŷbad = [29,25,24,5,28,35];
```

Here ŷpref is an ipotetical output of a model that minimises the relative mean error, while ŷbad minimises the mean realative error.

`meanRelError.([ŷbad, ŷpref],[y,y],normRec=true) ## Mean relative error`

2-element Vector{Float64}: 0.18703355611250347 0.43205786456882955

`meanRelError.([ŷbad, ŷpref],[y,y],normRec=false) ## Relative mean error`

2-element Vector{Float64}: 0.10126582278481013 0.0949367088607595

`plot([y ŷbad ŷpref], colour=[:black :red :green], label=["obs" "bad est" "good est"])`

We can then compute the relative mean error for the decision tree

`(rmeTrain, rmeVal, rmeTest) = meanRelError.([ŷtrain,ŷval,ŷtest],[ytrain,yval,ytest],normRec=false)`

3-element Vector{Float64}: 0.12659039939672598 0.0937634812419113 0.22233930540215602

We can plot the true labels vs the estimated one for the three subsets...

`scatter(ytrain,ŷtrain,xlabel="daily rides",ylabel="est. daily rides",label=nothing,title="Est vs. obs in training period (DT)")`

`scatter(yval,ŷval,xlabel="daily rides",ylabel="est. daily rides",label=nothing,title="Est vs. obs in validation period (DT)")`

`scatter(ytest,ŷtest,xlabel="daily rides",ylabel="est. daily rides",label=nothing,title="Est vs. obs in testing period (DT)")`

Or we can visualise the true vs estimated bike shared on a temporal base. First on the full period (2 years) ...

```
ŷtrainfull = vcat(ŷtrain,fill(missing,nval+ntest))
ŷvalfull = vcat(fill(missing,ntrain), ŷval, fill(missing,ntest))
ŷtestfull = vcat(fill(missing,ntrain+nval), ŷtest)
plot(data[:,:dteday],[data[:,:cnt] ŷtrainfull ŷvalfull ŷtestfull], label=["obs" "train" "val" "test"], legend=:topleft, ylabel="daily rides", title="Daily bike sharing demand observed/estimated across the\n whole 2-years period (DT)")
```

..and then focusing on the testing period

```
stc = 620
endc = size(x,1)
plot(data[stc:endc,:dteday],[data[stc:endc,:cnt] ŷvalfull[stc:endc] ŷtestfull[stc:endc]], label=["obs" "val" "test"], legend=:bottomleft, ylabel="Daily rides", title="Focus on the testing period (DT)")
```

The predictions aren't so bad in this case, however decision trees are highly instable, and the output could have depended just from the specific initial random seed.

## Random Forests

Rather than trying to solve this problem using a single Decision Tree model, let's not try to use a *Random Forest* model. Random forests average the results of many different decision trees and provide a more "stable" result. Being made of many decision trees, random forests are hovever more computationally expensive to train, but luckily they tend to self-tune (or self-regularise). In particular the parameters `maxDepth`

and `maxFeatures`

shouldn't need tuning.

We still tune however the model for other parameters, and in particular the β parameter, a prerogative of BetaML Random Forests that allows to assign more weigth to the best performing trees in the forest. It may be particularly important if there are many outliers in the data we don't want to "learn" from.

```
minRecordsRange=[2,4,5]; nTreesRange=60:10:80; βRange=100:100:300
(bestRme,bestMaxDepth,bestMaxFeatures,bestMinRecords,bestNTrees,bestβ) = tuneHyperParameters("RandomForest",xtrain,ytrain,xval,yval,
maxDepthRange=size(xtrain,1):size(xtrain,1),maxFeaturesRange=Int(round(sqrt(size(xtrain,2)))):Int(round(sqrt(size(xtrain,2)))),
minRecordsRange=minRecordsRange,nTreesRange=nTreesRange,βRange=βRange,repetitions=5,rng=copy(FIXEDRNG))
```

(0.11247916355718526, 548, 3, 5, 60, 200)

As for decision trees, once the hyper-parameters of the model are tuned we wan train again the model using the optimal parameters.

`myForest = buildForest(xtrain,ytrain, bestNTrees, maxDepth=bestMaxDepth,maxFeatures=bestMaxFeatures,minRecords=bestMinRecords,β=bestβ,oob=true,rng=copy(FIXEDRNG));`

Let's now benchmark the training of the BetaML Random Forest model

`@btime buildForest(xtrain,ytrain, bestNTrees, maxDepth=bestMaxDepth,maxFeatures=bestMaxFeatures,minRecords=bestMinRecords,β=bestβ,oob=true,rng=copy(FIXEDRNG));`

768.243 ms (2451922 allocations: 971.33 MiB)

Random forests are evidently slower than individual decision trees but are still relativly fast. We should also consider that they are by default efficiently parallelised, so their speed increases with the number of available cores (in building this documentation page, GitHub CI servers allow for a single core, so all the bechmark you see in this tutorial are run with a single core available).

Random forests support the so-called "out-of-bag" error, an estimation of the error that we would have when the model is applied on a testing sample. However in this case the oob reported is much smaller than the testing error we will actually find. This is due to the fact that the division between training/validation and testing in this exercise is not random, but has a temporal basis. It seems that in this example the data in validation/testing follows a different pattern/variance than those in training (in probabilistic terms, the daily observations are not i.i.d.).

`oobError, trueTestMeanRelativeError = myForest.oobError,meanRelError(ŷtest,ytest,normRec=true)`

```
(ŷtrain,ŷval,ŷtest) = predict.([myForest], [xtrain,xval,xtest])
(rmeTrain, rmeVal, rmeTest) = meanRelError.([ŷtrain,ŷval,ŷtest],[ytrain,yval,ytest],normRec=false)
```

3-element Vector{Float64}: 0.06512088063750154 0.09389025098838209 0.22229683782214169

In this case we found an error very similar to the one employing a single decision tree. Let's print the observed data vs the estimated one using the random forest and then along the temporal axis:

`scatter(ytrain,ŷtrain,xlabel="daily rides",ylabel="est. daily rides",label=nothing,title="Est vs. obs in training period (RF)")`

`scatter(yval,ŷval,xlabel="daily rides",ylabel="est. daily rides",label=nothing,title="Est vs. obs in validation period (RF)")`

`scatter(ytest,ŷtest,xlabel="daily rides",ylabel="est. daily rides",label=nothing,title="Est vs. obs in testing period (RF)")`

Full period plot (2 years):

```
ŷtrainfull = vcat(ŷtrain,fill(missing,nval+ntest))
ŷvalfull = vcat(fill(missing,ntrain), ŷval, fill(missing,ntest))
ŷtestfull = vcat(fill(missing,ntrain+nval), ŷtest)
plot(data[:,:dteday],[data[:,:cnt] ŷtrainfull ŷvalfull ŷtestfull], label=["obs" "train" "val" "test"], legend=:topleft, ylabel="daily rides", title="Daily bike sharing demand observed/estimated across the\n whole 2-years period (RF)")
```

Focus on the testing period:

```
stc = 620
endc = size(x,1)
plot(data[stc:endc,:dteday],[data[stc:endc,:cnt] ŷvalfull[stc:endc] ŷtestfull[stc:endc]], label=["obs" "val" "test"], legend=:bottomleft, ylabel="Daily rides", title="Focus on the testing period (RF)")
```