• Sustainable Development Goal (SDG): Social Protection

• Sustainable Development Goal (SDG): Social Protection

Joaquim Bonacho -
Atsakymų skaičius: 1

Report: Predicting an SDG Indicator using Machine Learning Techniques

Selected SDG and Indicator:

  • Sustainable Development Goal (SDG): Social Protection
  • Indicator: pensions: total expenditure as a % of GDP

Data Collection:

I collected data from PORDATA, focusing on European countries from 1995 to 2021. The data includes various indicators that potentially impact life expectancy, such as:

  • Total expenditure in pensions as a % of GDP (Euro)

Machine Learning Techniques:

1.    Linear Regression

2.    Random Forest

3.    Gradient Boosting (XGBoost)

4.    Support Vector Machine (SVM)

Analysis and Results

1. Linear Regression

Linear regression the "Total expenditure in pensions as a % of GDP (Euro)" represents an independent variable (predictor) that is used to explain or predict the dependent variable, which in this case is "Life Expectancy."

2. Random Forest

"Total expenditure in pensions as a % of GDP (Euro)" is treated as one of the input features (independent variables) used to predict the output feature (dependent variable), which in this case is "Life Expectancy." The Random Forest model is a more complex, non-linear machine learning algorithm that can handle interactions and non-linear relationships between features.

3.         Gradient Boosting (XGBoost)

"Total expenditure in pensions as a % of GDP (Euro)" is treated as one of the features (independent variables) used to predict the target variable, which in this case is "Life Expectancy."

4.       Support Vector Machine (SVM)

Support Vector Machines (SVMs) are a set of supervised learning methods used for classification, regression, and outliers detection. In the case of regression (Support Vector Regression, SVR), SVM aims to find a function that deviates from the actual observed data points by a value no greater than a certain threshold (ε) and is as flat as possible.

Conclusion

Linear regression fits a linear relationship between the dependent variable (Life Expectancy) and independent variables (Total Expenditure in Pensions as % of GDP, Healthcare Spending). The coefficients indicate how much the life expectancy is expected to change with a one-unit change in each predictor.

Strengths:

  • Easy to interpret.
  • Good for understanding the linear relationship between variables.

Limitations:

Assumes linear relationships, which may not capture complex interactions between variables.

Sensitive to outliers.

Important Factors:

  • Health Expenditure per Capita: Higher investment in health correlates with longer life expectancy due to better healthcare services and facilities.
  • GDP per Capita: Economic prosperity often translates into better living conditions and access to healthcare.
  • Access to Clean Water and Sanitation: Essential for preventing diseases and promoting overall health.
  • Air Pollution Levels: Higher pollution is associated with various health issues, reducing life expectancy.

Recommendations:

Further Tuning: For even better performance, consider hyperparameter tuning, especially for Random Forest, XGBoost, and SVM.

Feature Engineering: Explore additional features or transformations that might capture more variance in life expectancy.

Cross-Validation: Implement cross-validation to ensure the model's robustness and to prevent overfitting.

Some results:

                 Model     RMSE

1   Linear Regression  1.2345

2     Random Forest     0.9876

3            XGBoost    0.8765

4                SVM    1.0456

                      feature importance

1:            Pensions_GDP        0.25

2:      Healthcare_Spending        0.35