EXERCISE 4.1: House value prediction with Neural Networks (regression)
In this problem, we are given a dataset containing average house values in different Boston suburbs, together with the suburb characteristics (proportion of owner-occupied units built prior to 1940, index of accessibility to radial highways, etc...) Our task is to build a neural network model and train it in order to predict the average house value on each suburb.
The detailed attributes of the dataset are:
- CRIM per capita crime rate by town
- ZN proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS proportion of non-retail business acres per town
- CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX nitric oxides concentration (parts per 10 million)
- RM average number of rooms per dwelling
- AGE proportion of owner-occupied units built prior to 1940
- DIS weighted distances to five Boston employment centres
- RAD index of accessibility to radial highways
- TAX full-value property-tax rate per $10,000
- PTRATIO pupil-teacher ratio by town
- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT % lower status of the population
- MEDV Median value of owner-occupied homes in $1000's
Further information concerning this dataset can be found on this file
Our prediction concern the median value (column 14 of the dataset)
Skills employed:
- download and import data from internet
- design and train a Neural Network for regression tasks using
BetaML
- use the additional
BetaML
functionspartition
,oneHotEncoder
,scale
,meanRelError
Instructions
If you have already cloned or downloaded the whole course repository the folder with the exercise is on [REPOSITORY_ROOT]/lessonsMaterial/04_NN/bostonHousing
. Otherwise download a zip of just that folder here.
In the folder you will find the file BostonHousingValue.jl
containing the julia file that you will have to complete to implement the missing parts and run the file (follow the instructions on that file). In that folder you will also find the Manifest.toml
file. The proposal of resolution below has been tested with the environment defined by that file. If you are stuck and you don't want to lookup to the resolution above you can also ask for help in the forum at the bottom of this page. Good luck!
Resolution
Click "ONE POSSIBLE SOLUTION" to get access to (one possible) solution for each part of the code that you are asked to implement.
1) Setting up the environment...
Start by setting the working directory to the directory of this file and activate it. If you have the provided Manifest.toml
file in the directory, just run Pkg.instantiate()
, otherwise manually add the packages Pipe
, HTTP
, CSV
, DataFrames
, Plots
and BetaML
.
ONE POSSIBLE SOLUTION
cd(@__DIR__)
using Pkg
Pkg.activate(".")
# If using a Julia version different than 1.10 please uncomment and run the following line (reproductibility guarantee will hower be lost)
# Pkg.resolve()
Pkg.instantiate()
using Random
Random.seed!(123)
2) Load the packages
Load the packages Pipe
, HTTP
, CSV
, DataFrames
, Plots
and BetaML
.
ONE POSSIBLE SOLUTION
using Pipe, HTTP, CSV, DataFrames, Plots, BetaML
3) Load the data
Load from internet or from local file the input data into a DataFrame or a Matrix. You will need the CSV options header=false
and ignorerepeated=true
dataURL="https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data"
ONE POSSIBLE SOLUTION
data = @pipe HTTP.get(dataURL).body |> CSV.File(_, delim=' ', header=false, ignorerepeated=true) |> DataFrame
4) Implement one-hot encoding of categorical variables
The 4th column is a dummy related to the information if the suburb bounds a certain Boston river. Use the BetaML model OneHotEncoder
to encode this dummy into two separate vectors, one for each possible value.
ONE POSSIBLE SOLUTION
riverDummy = fit!(OneHotEncoder(),data[:,4])
5) Put together the feature matrix
Now create the X matrix of features concatenating horizzontaly the 1st to 3rd column of data
, the 5th to 13th columns and the two columns you created with the one hot encoding. Make sure you have a 506×14 matrix.
ONE POSSIBLE SOLUTION
X = hcat(Matrix(data[:,[1:3;5:13]]),riverDummy)
6) Build the label vector
Similarly define Y to be the 14th column of data
ONE POSSIBLE SOLUTION
Y = data[:,14]
7) Partition the data
Partition the data in (xtrain
,xtest
) and (ytrain
,ytest
) keeping 80% of the data for training and reserving 20% for testing. Keep the default option to shuffle the data, as the input data isn't.
ONE POSSIBLE SOLUTION
((xtrain,xtest),(ytrain,ytest)) = partition([X,Y],[0.8,0.2])
8) Define the neural network architecture
Define a NeuralNetworkEstimator
model with the following characteristics:
- 3 dense layers with respectively 14, 20 and 1 nodes and activation function relu
- cost function
squared_cost
- training options: 400 epochs and 6 records to be used on each batch
ONE POSSIBLE SOLUTION
l1 = DenseLayer(14,20,f=relu)
l2 = DenseLayer(20,20,f=relu)
l3 = DenseLayer(20,1,f=relu)
mynn= NeuralNetworkEstimator(layers=[l1,l2,l3],loss=squared_cost,batch_size=6,epochs=400)
9) Train the model
Train the model using ytrain
and a scaled version of xtrain
(where all columns have zero mean and 1 standard deviation).
ONE POSSIBLE SOLUTION
fit!(mynn,fit!(Scaler(),xtrain),ytrain)
10) Predict the labels
Predict the training labels ŷtrain
and the test labels ŷtest
. Recall you did the training on the scaled features!
ONE POSSIBLE SOLUTION
ŷtrain = predict(mynn, fit!(Scaler(),xtrain))
ŷtest = predict(mynn, fit!(Scaler(),xtest))
11) Evaluate the model
Compute the train and test relative mean error using the function relative_mean_error
ONE POSSIBLE SOLUTION
trainRME = relative_mean_error(ytrain,ŷtrain)
testRME = relative_mean_error(ytest,ŷtest)
12) Plot the errors and the estimated values vs the true ones
Run the following commands to plots the average loss per epoch and the true vs estimated test values:
plot(info(mynn)["loss_per_epoch"])
scatter(ytest,ŷtest,xlabel="true values", ylabel="estimated values", legend=nothing)
13) (Optional) Use unscaled data
Run the same workflow without scaling the data. How this affect the quality of your predictions ?
ONE POSSIBLE SOLUTION
Random.seed!(123)
((xtrain,xtest),(ytrain,ytest)) = partition([X,Y],[0.8,0.2])
l1 = DenseLayer(14,20,f=relu)
l2 = DenseLayer(20,20,f=relu)
l3 = DenseLayer(20,1,f=relu)
mynn= NeuralNetworkEstimator(layers=[l1,l2,l3],loss=squared_cost,batch_size=6,epochs=400)
fit!(mynn,xtrain,ytrain)
ŷtrain = predict(mynn, xtrain)
ŷtest = predict(mynn, xtest)
trainRME = relative_mean_error(ytrain,ŷtrain)
testRME = relative_mean_error(ytest,ŷtest)
plot(info(mynn)["loss_per_epoch"])
scatter(ytest,ŷtest,xlabel="true values", ylabel="estimated values", legend=nothing)