Introduction
I decided to explore the data relating to Sustainable Development Goal 1: ‘Eradicate Poverty’, focusing on the ‘Housing cost burden rate: total and by type of housing occupancy’, in order to establish a possible relationship between the difficulties families face in meeting their housing costs and other relevant factors, such as their income, type of housing and social support for housing.
Data Collection
- Dependent variable: ‘Overburden rate of housing expenditure’: Only total housing occupancy regimes.
- Independent variable 1: ‘Average equivalent income: by household type (Euro)’: Total indicator type of household only".
- Independent variable 2: ‘Population by type of occupancy of usual residence accommodation (%)’: Owner and tenant indicators.
- Independent variable 3: ‘Social protection expenditure: total and by function, at constant prices (base=2010)’: Housing indicator only.
Machine Learning Techniques
The following three machine learning techniques were employed to predictions:
- Decision Tree
- K-Nearest Neighbors (KNN)
- Neural Networks
Data Preprocessing
Data preprocessing steps included handling missing values, normalizing numerical features, and encoding categorical variables.
The dataset was split into training and testing sets with a ratio of 70 - 30%.
Machine Learning Models
1. Decision Tree
To create an appendage model using the decision tree method you need discrete data, but the data collected is all continuous. For this reason, I normalised the data obtained by classifying it. The code used to create this learning model. I used the ‘randomForest’ function to create 1 decision tree. The model is trained using the training data set, while the predictions use the test set, based on the model that has been trained.
Results:
- Accuracy: High.

2. K-Nearest Neighbors (KNN)
The K nearest neighbour method learning model can be fed with any type of data (continuous or discrete), but it is always necessary to use the classified dependent variable. So I used ‘pure’ data for all the independent variables and the dependent variable used the same classification as was used for the decision tree model.
I used the ‘Knn3’ function, from which I train the model with the training data and then present the predictions based on the test data. I finish with the ‘confusionMatrix’ function to evaluate the method used.
Results:
- Accuracy: Low

3. Neural Networks
The learning model of the neural network method requires its output values to be between 0 and 1. For this reason, I updated the classification of the variable data, which had been done in the decision tree, and adapted the new scale.
I used the ‘nnet’ function to train the model with the training data and then present the predictions based on the test data. I finish with the ‘confusionMatrix’ function to evaluate the method used.
Results:
- Accuracy: Medium

Discussion
Differences Between Models
- Decision Tree: Obtained high accuracy results, although its true positive rate (sensitivity) is only 50% for class 1 and 0% for classes 2 and 3. So this model stands out for its high accuracy, even though its sensitivity varies a lot.
- KNN: Performed unsatisfactorily, with an accuracy rate of just 18.18%.
- Neural Networks: Had an average performance, since its accuracy was 47.06% and the model converges, which means that a solution is found that is considered satisfactory.
Conclusion
I chose to use the ‘confusionMatrix’ function for all the models so that I could compare the effectiveness and performance of the memsos more easily. I have prepared a performance comparison table:
|
|
Árvore de Decisão |
Knn |
Rede Neural |
|
Accuracy |
0.7273 |
0.1818 |
0.4706 |
|
Sensitivity |
0.8649 (for class 0) 0.5000 (for class 1) 0.0000 (for class 2 and 3) |
0.6250 (for class 1) 0.0000 (for class 0, 2 and 3) |
0.5000 (for class 0.25) |
|
Specificity |
0.4444 (for class 0) 0.8718 (for class 1) 1.0000 (for class 2 and 3) |
0.0769 (for class 1) 0.8333 (for class 2) 1. (for class 0 and 3) |
0.0000 (for class 0.25) |
When I analysed the results of the 3 learning models, I concluded that the decision tree model was the one that managed to best predict the results of the dependent variable, recording an accuracy rate of 72.73%, although its true positive rate (sensibility) was only 50% for class 1 and 0% for classes 2 and 3. So this model stands out for its high accuracy, even though its sensitivity varies greatly.
The neural network had an average performance, since its accuracy was 47.06% and the model converges, which means that it finds a solution that is considered satisfactory.
Finally, the Knn learning model performed poorly, with an accuracy rate of only 18.18%.
I therefore conclude that the independent variables chosen to predict the dependent variable are a good choice when used with the decision tree or neural network learning model.
Knowing that both the decision tree and the neural network generated were based on just one tree/network, I think that expanding to a random forest or a more complex network would help to improve the results found.
I believe that the unsatisfactory performance of the KNN learning mode is related to the heterogeneity of the data being processed, since I am evaluating data related to the economic reality of several European countries, which are, by nature, very different from each other. When the data is divided into the training and test sets, it is distributed randomly and the KNN model, which is characterised by the leaking of neighbours to make its predictions, ends up performing poorly. This method would therefore be more suitable for use on a sub-set of data focussing only on countries with more similar economic realities (such as Portugal and Spain) or with other types of data that have a more ‘stable’ guideline.
References
- PORDATA - Database for European statistics: PORDATA