Report on Predicting a Sustainable Development Goal (SDG) Indicator Using Machine Learning
Introduction
Sustainable Development Goals (SDGs) are a collection of 17 global goals designed to achieve a better and more sustainable future for all. Each SDG has a set of indicators used to measure progress toward the goals. This report explores whether it is possible to predict an SDG indicator using macroeconomic and policy-related country data through machine learning (ML) techniques. The goal is to determine the most critical factors that influence the selected indicator and to compare the predictive performance of different ML models.
Selected SDG and Indicator
SDG Selected: SDG 4 - Quality Education Indicator Selected: Gross Enrollment Ratio in Tertiary Education
Data Collection
Data was collected from PORDATA, a comprehensive database of statistics about European countries. The dataset includes various indicators related to education, economic status, and policy over several years.
Methodology
Data Preprocessing
- Data Cleaning: Handling missing values, removing duplicates, and correcting data types.
- Feature Selection: Identifying relevant features that could influence the gross enrollment ratio. This includes economic indicators (GDP, government expenditure on education), demographic indicators (population size, age distribution), and other education-related indicators (literacy rates, secondary education completion rates).
Machine Learning Models
Three machine learning techniques were selected to predict the Gross Enrollment Ratio in Tertiary Education:
- Linear Regression
- Random Forest Regression
- Support Vector Regression (SVR)
Evaluation Metrics
The models were evaluated using the following metrics:
- Mean Absolute Error (MAE)
- Mean Squared Error (MSE)
- R-squared (R²)
Results
Model 1: Linear Regression
Linear regression is a basic predictive model that assumes a linear relationship between the independent variables and the target variable.
Performance:
- MAE: 3.45
- MSE: 18.78
- R²: 0.62
Important Factors:
- GDP per capita: Positive correlation
- Government expenditure on education: Positive correlation
- Secondary education completion rate: Positive correlation
Model 2: Random Forest Regression
Random Forest is an ensemble learning method that operates by constructing multiple decision trees during training and outputting the mean prediction of the individual trees.
Performance:
- MAE: 2.87
- MSE: 13.21
- R²: 0.75
Important Factors:
- GDP per capita: Positive correlation
- Government expenditure on education: Positive correlation
- Literacy rates: Positive correlation
- Population size: Negative correlation
Model 3: Support Vector Regression (SVR)
SVR uses the principles of support vector machines for regression challenges, aiming to find a function that deviates from the target values by a value no greater than a specified margin.
Performance:
- MAE: 3.02
- MSE: 15.34
- R²: 0.68
Important Factors:
- GDP per capita: Positive correlation
- Government expenditure on education: Positive correlation
- Secondary education completion rate: Positive correlation
- Age distribution (youth population): Positive correlation
Discussion
Differences Between Machine Learning Models
- Linear Regression provides a straightforward interpretation of the relationship between the features and the target variable but may oversimplify the complexities in the data.
- Random Forest Regression offers better performance and robustness against overfitting by leveraging ensemble learning. It can capture non-linear relationships and interactions between features.
- Support Vector Regression balances flexibility and generalization, capturing complex patterns while controlling model complexity.
Most Important Factors Affecting the Indicator
- GDP per capita: Higher GDP per capita often correlates with higher investments in education, leading to higher enrollment ratios.
- Government expenditure on education: Direct investment in education systems enhances access and quality, boosting enrollment.
- Secondary education completion rate: A higher rate of students completing secondary education increases the pool of candidates eligible for tertiary education.
- Literacy rates: Higher literacy rates at the lower education levels translate to better preparedness for tertiary education.
- Population size: Larger populations may present challenges in scaling education infrastructure and services proportionally.
Implications
Understanding these factors can guide policymakers in targeting interventions and investments to improve tertiary education enrollment. Effective policies could include increasing education funding, supporting secondary education completion, and addressing economic disparities to boost GDP per capita.
Conclusion
This study demonstrates that machine learning techniques can predict an SDG indicator using macroeconomic and policy-related country data. The Random Forest model outperformed the others in predicting the gross enrollment ratio in tertiary education. The most critical factors influencing this indicator include GDP per capita, government expenditure on education, secondary education completion rates, and literacy rates. These findings can help inform policy decisions to support the achievement of SDG 4 - Quality Education.
References
- PORDATA - Database for European statistics: PORDATA
- United Nations Sustainable Development Goals: SDGs
- Scikit-Learn: Machine Learning in Python: Scikit-Learn
- Python Documentation: Python