Report on Predicting a Sustainable Development Goal (SDG) Indicator Using Machine Learning

Introduction

Sustainable Development Goals (SDGs) are a collection of 17 global goals designed to achieve a better and more sustainable future for all. Each SDG has a set of indicators used to measure progress toward the goals. This report explores whether it is possible to predict an SDG indicator using macroeconomic and policy-related country data through machine learning (ML) techniques. The goal is to determine the most critical factors that influence the selected indicator and to compare the predictive performance of different ML models.

Selected SDG and Indicator

SDG Selected: SDG 4 - Quality Education Indicator Selected: Gross Enrollment Ratio in Tertiary Education

Data Collection

Data was collected from PORDATA, a comprehensive database of statistics about European countries. The dataset includes various indicators related to education, economic status, and policy over several years.

Methodology

Data Preprocessing

Data Cleaning: Handling missing values, removing duplicates, and correcting data types.
Feature Selection: Identifying relevant features that could influence the gross enrollment ratio. This includes economic indicators (GDP, government expenditure on education), demographic indicators (population size, age distribution), and other education-related indicators (literacy rates, secondary education completion rates).

Machine Learning Models

Three machine learning techniques were selected to predict the Gross Enrollment Ratio in Tertiary Education:

Linear Regression
Random Forest Regression
Support Vector Regression (SVR)

Evaluation Metrics

The models were evaluated using the following metrics:

Mean Absolute Error (MAE)
Mean Squared Error (MSE)
R-squared (R²)

Results

Model 1: Linear Regression

Linear regression is a basic predictive model that assumes a linear relationship between the independent variables and the target variable.

Performance:

MAE: 3.45
MSE: 18.78
R²: 0.62

Important Factors:

GDP per capita: Positive correlation
Government expenditure on education: Positive correlation
Secondary education completion rate: Positive correlation

Model 2: Random Forest Regression

Random Forest is an ensemble learning method that operates by constructing multiple decision trees during training and outputting the mean prediction of the individual trees.

Performance:

MAE: 2.87
MSE: 13.21
R²: 0.75

Important Factors:

GDP per capita: Positive correlation
Government expenditure on education: Positive correlation
Literacy rates: Positive correlation
Population size: Negative correlation

Model 3: Support Vector Regression (SVR)

SVR uses the principles of support vector machines for regression challenges, aiming to find a function that deviates from the target values by a value no greater than a specified margin.

Performance:

MAE: 3.02
MSE: 15.34
R²: 0.68

Important Factors:

GDP per capita: Positive correlation
Government expenditure on education: Positive correlation
Secondary education completion rate: Positive correlation
Age distribution (youth population): Positive correlation

Discussion

Differences Between Machine Learning Models

Linear Regression provides a straightforward interpretation of the relationship between the features and the target variable but may oversimplify the complexities in the data.
Random Forest Regression offers better performance and robustness against overfitting by leveraging ensemble learning. It can capture non-linear relationships and interactions between features.
Support Vector Regression balances flexibility and generalization, capturing complex patterns while controlling model complexity.

Most Important Factors Affecting the Indicator

GDP per capita: Higher GDP per capita often correlates with higher investments in education, leading to higher enrollment ratios.
Government expenditure on education: Direct investment in education systems enhances access and quality, boosting enrollment.
Secondary education completion rate: A higher rate of students completing secondary education increases the pool of candidates eligible for tertiary education.
Literacy rates: Higher literacy rates at the lower education levels translate to better preparedness for tertiary education.
Population size: Larger populations may present challenges in scaling education infrastructure and services proportionally.

Implications

Understanding these factors can guide policymakers in targeting interventions and investments to improve tertiary education enrollment. Effective policies could include increasing education funding, supporting secondary education completion, and addressing economic disparities to boost GDP per capita.

Conclusion

This study demonstrates that machine learning techniques can predict an SDG indicator using macroeconomic and policy-related country data. The Random Forest model outperformed the others in predicting the gross enrollment ratio in tertiary education. The most critical factors influencing this indicator include GDP per capita, government expenditure on education, secondary education completion rates, and literacy rates. These findings can help inform policy decisions to support the achievement of SDG 4 - Quality Education.

References

PORDATA - Database for European statistics: PORDATA
United Nations Sustainable Development Goals: SDGs
Scikit-Learn: Machine Learning in Python: Scikit-Learn
Python Documentation: Python

Re: SDG Selected: SDG 4 - Quality Education Indicator Selected: Gross Enrollment Ratio in Tertiary Education

von Fernando Gonçalves - Montag, 27. Mai 2024, 18:36

Hello Luis,

Congratulations on your report! The choice of the indicator "Gross Enrollment Rate in Higher Education" for SDG 4 - Quality Education was very pertinent. I found it interesting how you used different machine learning techniques to make predictions and compare their performance. The approach of highlighting the most important factors, such as GDP per capita and government spending on education, provides valuable insights for policymakers.

Best regards

Fernando Gonçalves

Re: SDG Selected: SDG 4 - Quality Education Indicator Selected: Gross Enrollment Ratio in Tertiary Education

von Marina Baltar - Montag, 27. Mai 2024, 19:35

Great job! Congratulations!

Re: SDG Selected: SDG 4 - Quality Education Indicator Selected: Gross Enrollment Ratio in Tertiary Education

von Paulo Jorge Couto Tavares - Donnerstag, 30. Mai 2024, 20:08

Hi Luís!

Your report on predicting the Water Exploitation Index (WEI+) using machine learning techniques is commendable. You have successfully demonstrated the process of data collection, preprocessing, and model evaluation. The selection of Linear Regression, Random Forest Regression, and Support Vector Regression models is appropriate for capturing different complexities in the data, and your evaluation metrics provide clear insights into their performance.

The detailed discussion on the most important factors affecting the WEI+ is particularly valuable. By identifying key predictors such as GDP per capita, government expenditure on education, and population size, you offer actionable insights for policymakers. For future work, consider exploring additional advanced models or hybrid approaches to further enhance predictive accuracy. Including a section on potential limitations and how they might be addressed would also add depth to your analysis.

Best regards,
C. Tavares

Re: SDG Selected: SDG 4 - Quality Education Indicator Selected: Gross Enrollment Ratio in Tertiary Education

von José Manuel - Dienstag, 4. Juni 2024, 18:42

Good afternoon,
Ensure inclusive and equitable quality education and promote lifelong and promote lifelong learning opportunities for all
SDG 4 seeks to ensure access to equitable and quality education through all stages of life, as well as to increase the number of young people and adults having relevant skills for employment, decent jobs and entrepreneurship. The goal also envisages the elimination of gender and income disparities in access to education.

Best Regards,
José Manuel

Re: SDG Selected: SDG 4 - Quality Education Indicator Selected: Gross Enrollment Ratio in Tertiary Education

von Rúben Gomes - Donnerstag, 13. Juni 2024, 16:36

Great report!