Report on Predicting SDG Indicator Using Machine Learning Techniques
Introduction
The Sustainable Development Goals (SDGs) provide a comprehensive framework for addressing global challenges. Linking policy indicators with SDGs can help determine the requirements to achieve these goals. This report explores the relationship between a specific SDG indicator and relevant policy indicators using three machine learning techniques: Decision Trees, K-Nearest Neighbors (KNN), and Neural Networks. The focus is on predicting the percentage of the population reporting occurrences of crime, violence, and vandalism in their area.
Selected Indicator and Independent Variables
Indicator:
- Percentage of the population reporting occurrences of crime, violence, and vandalism in their area.
Independent Variables:
- Housing overcrowding rate: Percentage of people living in houses without sufficient rooms for all family members.
- Average number of rooms per person: Total and by household type.
- Participation of adults in learning in the last four weeks: Percentage of people aged 25-64 receiving formal or informal education/training.
Data Collection and Preparation
Data was collected from PORDATA, spanning from 2005 to 2020. Each dataset was transformed into a CSV file. The data was initially structured with years as rows and countries as columns, necessitating transformation into a long format for analysis.
Steps taken for data preparation:
- Data Import and Transformation:
- Read CSV files using
read.csv. - Transformed data into a long format with years and countries as columns.
- Read CSV files using
- Data Merging:
- Created a concatenated column combining year and country as an identifier.
- Merged datasets using the concatenated column.
- Data Cleaning:
- Removed rows with invalid values.
- Ensured correct data types.
Methodology
Three machine learning techniques were applied to predict the selected indicator.
1. Decision Trees:
- Binarized the indicator using the median value to create a balanced class distribution.
- Achieved a model accuracy of 75.79%.
2. K-Nearest Neighbors (KNN):
- Normalized data to a [0, 1] range.
- Achieved a model accuracy of 75.79%.
3. Neural Networks:
- Normalized data to a [-1, 1] range.
- Achieved a model accuracy of 54.74%.
Results and Discussion
The results highlight the varying performance of the three machine learning models.
Decision Trees and KNN:
- Both models achieved the same accuracy (75.79%), suggesting a strong relationship between the independent variables and the indicator.
- Decision Trees provided insights into the importance of each variable, indicating which factors most significantly influence the perception of crime.
Neural Networks:
- Achieved a lower accuracy (54.74%), indicating potential issues with data normalization or model parameters.
- Further optimization and tuning might be necessary to improve performance.
Factors Influencing the Indicator
The analysis of variable importance from the Decision Trees model revealed key factors:
- Housing Overcrowding Rate:
- Higher overcrowding rates are associated with increased reporting of crime, violence, and vandalism.
- Participation in Adult Learning:
- Higher participation in learning correlates with lower crime reporting, suggesting a potential link between education and crime perception.
- Number of Rooms per Person:
- More rooms per person are associated with lower crime reporting.
Conclusion
Machine learning techniques can effectively predict SDG indicators using relevant country data. The study shows that Decision Trees and KNN perform well in this context, while Neural Networks require further refinement. Understanding the factors influencing SDG indicators can guide policy decisions to achieve sustainable development.
Future Work
Further research could explore:
- Incorporating additional variables for a more comprehensive analysis.
- Applying advanced machine learning techniques and fine-tuning model parameters.
- Extending the study to other SDG indicators for a broader understanding of policy impacts.
References:
- Data Source: PORDATA