Artificial Intelligence: Report on predicting SDG 16- Peace, Justice and strong institutions, Indicator Corruption Perception Index.

Report on Predicting a Sustainable Development Goal (SDG) Indicator Using Machine Learning

Introduction

Sustainable Development Goals (SDGs) are a collection of 17 global goals designed to achieve a better and more sustainable future for all. Each SDG has a set of indicators used to measure progress toward the goals. This report explores whether it is possible to predict an SDG indicator using macroeconomic and policy-related country data through machine learning (ML) techniques. The goal is to determine the most critical factors that influence the selected indicator and to compare the predictive performance of different ML models.

Selected SDG and Indicator

SDG Selected: SDG 16 - Peace, Justice, and Strong Institutions
Indicator Selected: Corruption Perception Index (CPI)

Data Collection

Data was collected from PORDATA, a database of statistics about European countries. The dataset includes CPI scores and various indicators related to economic status, governance, and policy over several years for Portugal, Spain, and the EU27.

Methodology

Data Processing

Data Cleaning: Removing any missing or inconsistent values to ensure that the data is accurate and reliable for analysis.
Feature Selection: By selecting the most relevant features or indicators from the data will help to improve the performance of machine learning models.

The collected data represented in the following table:

Corruption Perception Index

Years	EU27 (2020)	Spain	Portugal
2013	63.0	59.0	62.0
2014	64.0	60.0	63.0
2015	65.0	58.0	64.0
2016	64.0	58.0	62.0
2017	64.0	57.0	63.0
2018	64.0	58.0	64.0
2019	64.0	62.0	62.0
2020	64.0	62.0	61.0
2021	64.0	61.0	62.0
2022	64.0	60.0	62.0
2023	64.0	60.0	61.0

Data Sources: Transparency International - Corruption Perception Index
Source: PORDATA
Last updated: 2024-03-13

Machine Learning Models

Linear Regression;
Random Forest Regression;
Support Vector Regression (SVR);

These models were implemented in research by including “randomForest”,

“e1071” and “Metrics” packages in R.

Evaluation Metrics

Mean Absolute Error (MAE);
Mean Squared Error (MSE);
R-Squared (R²:);

By introducing collected data into R console and using the linear regression method I obtained the following values:

MAE: 0.236819;

MSE: 0.1037094;

R²: 0.4295985.

For random Forest regression:

MAE: 0.1581688;

MSE: 0.07334626;

R²: 0.5965956.

For Support Vector Regression (SVR):

MAE: 0.1733797;

MSE: 0.1141697;

R²: 0.3720665.

Reasons, why were chosen these technics:

Linear Regression: Linear regression attempts to model the relationship between a dependent variable (in this case, the CPI) and one or more independent variables (e.g., Year, Spain CPI, Portugal CPI) by fitting a linear equation to the observed data.

Used because:

Simplicity: It's straightforward and easy to interpret.
Baseline Model: Acts as a good baseline for comparing with more complex models.
Interpretability: Coefficients provide insight into the relationship between each predictor and the target variable.

Efficiency and Accuracy:

Efficiency: Computationally inexpensive and fast to train and predict.
Accuracy: May not capture complex relationships well, leading to lower performance in terms of metrics like R² compared to non-linear models.

Random Forest Regression: Random Forest regression uses an ensemble of decision trees to improve predictive performance. It builds multiple trees and merges them to get a more accurate and stable prediction.

Used because:

Non-Linearity: Can capture complex non-linear relationships between predictors and the target.
Robustness: Less prone to overfitting compared to single decision trees due to averaging multiple trees.
Feature Importance: Provides a measure of the importance of each feature.

Efficiency and Accuracy:

Efficiency: More computationally intensive than linear regression, but parallelizable and efficient with large datasets.
Accuracy: Typically more accurate than linear regression, especially for complex datasets.

Support Vector Regression (SVR): SVR uses the principles of Support Vector Machines for regression tasks. It finds a function that deviates from the actual values by a value no greater than a specified margin.

Used because:

Flexibility: Effective in high-dimensional spaces and when the number of dimensions is greater than the number of samples.
Complex Relationships: Good for capturing complex, non-linear relationships.
Regularization: Controls model complexity and prevents overfitting.

Efficiency and Accuracy:

Efficiency: Computationally intensive, especially with large datasets and complex kernels.
Accuracy: Can be very accurate for certain types of data, but requires careful tuning of hyperparameters.

Comparison of Models

Efficiency:

Linear Regression: Most efficient, fastest to train and predict.
Random Forest: Moderately efficient, more computationally expensive but handles large datasets well.
SVR: Least efficient, especially with large datasets due to computational complexity.

Accuracy:

Linear Regression: Least accurate for non-linear relationships, but interpretable.
Random Forest: More accurate for complex data, robust, and provides feature importance.
SVR: Can be very accurate but requires tuning, best for high-dimensional and complex data.

Summurizing the comparison:

Linear Regression is simple and interpretable but may not capture complex relationships.
Random Forest Regression balances accuracy and interpretability, capturing non-linear relationships effectively and providing robust predictions.
Support Vector Regression is powerful for complex, high-dimensional data but requires careful tuning and is computationally intensive.

By comparing these models, we can choose the one that best balances accuracy and efficiency for our specific use case. Typically, Random Forest may provide a good balance for many datasets, while SVR could be the best choice for very complex relationships if computational resources allow.

Conclusion

Based on the research involving the prediction of the Corruption Perception Index (CPI) using various machine learning techniques, I can draw several conclusions and provide recommendations on how to predict the index and improve it.

Predicting the Corruption Perception Index:

Linear Regression: Provides a basic understanding of the linear relationships between CPI and predictor variables like GDP per capita, government effectiveness, and other socio-economic factors. However, it may not capture complex, non-linear interactions well.
Random Forest Regression: Offers a robust method for predicting CPI, capturing non-linear relationships and interactions between multiple features. This model provided the best performance among the tested models, indicated by the lowest Mean Absolute Error (MAE), Mean Squared Error (MSE), and highest R-squared (R²).
Support Vector Regression (SVR): Effective for high-dimensional and non-linear data but requires careful tuning of hyperparameters. It performed better than linear regression but not as well as the random forest in this context.

Recommendations for Predicting the Index

Data Collection and Feature Selection:

Continuously collect relevant data from reliable sources such as PORDATA, Transparency International, and other governmental and non-governmental organizations.
Include a diverse set of features that may influence CPI, such as economic indicators (GDP per capita, unemployment rate), governance indicators (government effectiveness, political stability, rule of law), and socio-political indicators (education levels, public health).

Model Selection:

Primary Model: Use Random Forest Regression as the primary model for predicting CPI due to its ability to handle non-linear relationships and its robustness to overfitting.
Supplementary Models: Utilize linear regression for quick, interpretable insights and SVR for complex data scenarios where non-linearity is pronounced and computational resources are available.

Model Evaluation and Tuning:

Regularly evaluate the model’s performance using cross-validation techniques to ensure robustness and generalizability.
Fine-tune hyperparameters, especially for complex models like SVR, to optimize performance.

References

PORDATA - Database for European statistics: PORDATA
United Nations Sustainable Development Goals: SDGs

Summary of the research:

The research indicates that machine learning techniques, particularly Random Forest Regression, are effective in predicting the Corruption Perception Index. By leveraging diverse data and robust models, policymakers can gain insights into the factors affecting CPI and take targeted actions to improve governance and reduce corruption. Continuous improvement in data collection, model evaluation, and policy implementation will be crucial in achieving sustainable progress in reducing corruption and enhancing institutional integrity.

Re: Report on predicting SDG 16- Peace, Justice and strong institutions, Indicator Corruption Perception Index.

Luís Loureiro - ketvirtadienis, 2024 birželio 13, 15:28

Well done, nice work

Re: Report on predicting SDG 16- Peace, Justice and strong institutions, Indicator Corruption Perception Index.

Rúben Gomes - ketvirtadienis, 2024 birželio 13, 17:15

Great report!