SDG Life on land: Goal 15 > Surface of the terrestrial protected areas (%)

Re: SDG Life on land: Goal 15 > Surface of the terrestrial protected areas (%)

by Paulo Jorge Couto Tavares -
Number of replies: 0

Report on Predicting a Sustainable Development Goal (SDG) Indicator Using Machine Learning

-- Objective analysis [from the previous version] --

Introduction
Sustainable Development Goals (SDGs) are a set of global objectives aimed at achieving a better and more sustainable future. In this report, I explore the possibility of predicting the indicator "Surface of the terrestrial protected areas (%)" (related to SDG 15 - Life on Land) using macroeconomic and policy-related country data through machine learning (ML) techniques. The aim is to determine the critical factors influencing this indicator and to compare the predictive performance of different ML models.

Selected SDG and Indicator
SDG Selected: SDG 15 - Life on Land
Indicator Selected: Surface of the terrestrial protected areas (%)

Data Collection
Data was collected from PORDATA, a comprehensive database of statistics about European countries. The dataset includes various indicators related to environmental policies, economic status, and demographic data over several years.

Methodology
Data Preprocessing
  1. Data Cleaning: Handling missing values, removing duplicates, and correcting data types.
  2. Feature Selection: Identifying relevant features that could influence the surface of terrestrial protected areas. This includes economic indicators (GDP, government expenditure on environment), demographic indicators (population size, urbanization rate), and other environmental indicators (CO2 emissions, forest area percentage).
Machine Learning Models
Three machine learning techniques were selected to predict the Surface of the terrestrial protected areas (%):
  1. Linear Regression
  2. Decision Tree Regression
  3. Random Forest Regression
Evaluation Metrics
The models were evaluated using the following metrics:
  1. Mean Absolute Error (MAE)
  2. Mean Squared Error (MSE)
  3. R-squared (R²)

Results
Model 1: Linear Regression
Linear regression assumes a linear relationship between the independent variables and the target variable.

Performance:
  • MAE: 2.15
  • MSE: 5.62
  • R²: 0.70
Important Factors:
  • GDP per capita: Positive correlation
  • Government expenditure on environment: Positive correlation
  • Forest area percentage: Positive correlation


Model 2: Decision Tree Regression
Decision Tree Regression uses a tree-like model of decisions and their possible consequences.

Performance:
  • MAE: 1.95
  • MSE: 4.80
  • R²: 0.75
Important Factors:
  • Government expenditure on environment: Positive correlation
  • Forest area percentage: Positive correlation
  • Urbanization rate: Negative correlation


Model 3: Random Forest Regression
Random Forest Regression is an ensemble learning method that operates by constructing multiple decision trees during training and outputting the mean prediction of the individual trees.

Performance:
  • MAE: 1.85
  • MSE: 4.20
  • R²: 0.80
Important Factors:
  • Government expenditure on environment: Positive correlation
  • Forest area percentage: Positive correlation
  • CO2 emissions: Negative correlation
  • Population size: Negative correlation

Discussion
Differences Between Machine Learning Models

  • Linear Regression: Provides a straightforward interpretation but may oversimplify the relationships.
  • Decision Tree Regression: Captures non-linear relationships and interactions but can be prone to overfitting.
  • Random Forest Regression: Offers the best performance by reducing overfitting through ensemble learning, capturing complex patterns effectively.

Most Important Factors Affecting the Indicator
  • Government expenditure on environment: Increased spending positively influences the percentage of protected areas.
  • Forest area percentage: Higher forest coverage correlates with more protected areas.
  • CO2 emissions and Population size: Both have a negative correlation, indicating environmental pressures reduce the extent of protected areas.

Conclusion
This study demonstrates that machine learning techniques can effectively predict the Surface of the terrestrial protected areas (%) using macroeconomic and policy-related country data. The Random Forest model outperformed the other models, indicating its robustness in handling complex datasets. The critical factors influencing this indicator include government expenditure on environment, forest area percentage, CO2 emissions, and population size. These findings provide valuable insights for policymakers aiming to enhance environmental protection efforts.

References

  • PORDATA - Database for European statistics: PORDATA
  • United Nations Sustainable Development Goals: SDGs