Air Quality Data Processing with R¶

Air Quality Open Data Platform and World Air Quality Index project team provides a new dedicated dataset, updated 3 times a day, and covering about 380 major cities in the world, from January 2015 until now. The data for each major city is based on the average (median) of several air quality monitoring stations. The data set provides min, max, median and standard deviation for each of the air pollutant species (PM10, PM2.5, NO2, Ozone, SO2, CO) as well as meteorological data (Temperature, Humidity, Wind).

R code below downloads several .csv files, combines thems into a full dataset, selects median from provided statistics and narrows a set of countries to European Union members. From available information only meteorological data on Temperature and Humidity is selected and THI (Temperature Humidity Index) is calculated. Finally, daily THI time series for a selected city is summarized and visualized.

THI measures the level of discomfort the average person (or animal) is thought to experience as a result of the combined effects of the temperature and humidity of the air. The heat stress index THI is a simple combination of temperature and humidity (in decimal form) and can be calculated by the following formula:

$THI = 0.8 \times Temperature + Humidity \times (Temperature - 14.4) + 46.4$

Habeeb, A. A., Gad, A. E., & Atta, M. A. (2018). Temperature-humidity indices as indicators to heat stress of climatic conditions with relation to production and reproduction of farm animals. International Journal of Biotechnology and Recent Advances, 1(1), 35-50. https://doi.org/10.18689/ijbr-1000107

1. Prepare R environment¶

Assistant: These are descriptions of the required R libraries.

  • dplyr is an R library that provides a set of functions for data manipulation and transformation. It is part of the larger "tidyverse" ecosystem of R libraries and is widely used for tasks such as filtering, sorting, selecting, summarizing, and mutating data. dplyr's functions are designed to work with data frames or data tables, making data manipulation tasks more intuitive and efficient.
  • tidyr is an R library that is part of the tidyverse ecosystem, and it is primarily used for data tidying and reshaping. Data tidying is the process of organizing and structuring data to make it more suitable for analysis and visualization. tidyr provides functions to convert data between "wide" and "long" formats, separate and unite columns, and handle missing or messy data.
  • ggplot2 is a popular R library for creating high-quality and customizable data visualizations and graphs. It is based on the Grammar of Graphics, which is a framework for describing and building complex plots using simple components. ggplot2 allows users to create a wide range of graphical visualizations, from simple scatter plots to complex, multi-layered plots with facets, color mapping, and more.
  • httr is an R library for working with HTTP requests. It provides a set of functions that allow you to make HTTP requests, interact with web APIs, and handle responses. The library is designed to simplify the process of sending and receiving data over the web.
  • data.table is an R library that provides an efficient and fast way to manipulate large data sets. It is particularly well-suited for tasks involving data aggregation, subsetting, joining, and transformation. The library extends the basic data.frame object in R.
  • dtplyr is an R library that extends the capabilities of the dplyr library to work with large data sets efficiently using the data.table framework. It combines the syntax and functionality of dplyr with the speed and memory efficiency of data.table. This allows users to perform data manipulation operations on large data sets in a manner similar to using dplyr, making it easier to work with big data in R.
  • xts (eXtensible Time Series) is an R library that provides a specialized data structure for working with time series data. It builds upon the fundamental data structure in R, data frames, and enhances it to handle time-ordered data more efficiently.
  • eurostat library in R is a tool for accessing and retrieving statistical data from Eurostat, the statistical office of the European Union.
  • pals is an R library with a comprehensive collection of color palettes, colormaps, and tools to evaluate them.
  • pbapply library in R can be used for parallel processing to apply functions in parallel to elements of a list or vector. This can significantly speed up the execution of functions, especially when dealing with computationally intensive tasks on multiple processor cores or clusters for parallel computation.
  • doParallel library in R provides a parallel backend for the popular "foreach" library. It allows R users to easily parallelize operations, such as loops and repetitive tasks, across multiple processor cores or even clusters of computers. This can significantly speed up data processing and computation-intensive tasks.
In [1]:
# Install / load required R libraries
if (!"pacman" %in% rownames(installed.packages())) install.packages("pacman")
pacman::p_load(dplyr, tidyr, ggplot2) # required subset of tidyverse packages
pacman::p_load(httr, data.table, dtplyr, xts, eurostat, pals, pbapply, doParallel)
euCodes <- eurostat::harmonize_country_code(eurostat::eu_countries$code)
pbapply::pboptions(type = "txt", style = 3)
cpuCores <- parallel::detectCores()
cl <- parallel::makeCluster(cpuCores)
doParallel::registerDoParallel(cl)
Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependency ‘zoo’



xts installed

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘proxy’, ‘Rcpp’, ‘e1071’, ‘plyr’, ‘bibtex’, ‘assertthat’, ‘here’, ‘classInt’, ‘countrycode’, ‘ISOweek’, ‘RefManageR’, ‘regions’



eurostat installed

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘dichromat’, ‘mapproj’, ‘maps’



pals installed

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)


pbapply installed

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘foreach’, ‘iterators’



doParallel installed

2. Download data files¶

In [2]:
# Function to download remote file to the disk
urlDownload <- function(urlLink, showProgress = FALSE) {
  urlHead <- curlGetHeaders(urlLink)
  fileSize <- as.numeric(gsub("^.*content-length: ([0-9]+).*$", "\\1", tolower(urlHead[5])))
  fileName <- gsub("^.*filename=(.*)\r\n$", "\\1", urlHead[grepl('filename', urlHead)])
  if (!file.exists(fileName) || file.info(fileName)$size != fileSize) {
    if (showProgress) {
      message(sprintf('Downloading %s', fileName))
      httr::GET(urlLink, httr::write_disk(fileName, overwrite = TRUE), httr::progress())
      cat('\n')
    } else {
      httr::GET(urlLink, httr::write_disk(fileName, overwrite = TRUE))
    }
  }
  return(fileName)
}

Assistant: This is an R function named urlDownload that downloads a remote file to the disk. The function takes two arguments: urlLink and showProgress.

urlLink is the URL of the remote file to be downloaded.

showProgress is a logical value that determines whether to show the progress of the download. If showProgress is TRUE, the function will display a message indicating that it is downloading the file, and it will also display the progress of the download using the httr::progress() function. If showProgress is FALSE, the function will not display any progress messages or indicators.

The function first uses the curlGetHeaders() function to retrieve the headers of the remote file. It then extracts the file size and the file name from the headers using regular expressions.

Next, the function checks if the file already exists on the disk and if its size matches the file size in the headers. If the file does not exist or its size does not match, the function downloads the file using the httr::GET() function. If showProgress is TRUE, the function will also display the progress of the download using the httr::progress() function.

Finally, the function returns the file name of the downloaded file.

In [3]:
# Download the newest data
urlLocation <- 'https://aqicn.org/data-platform/covid19/report/39374-7694ec07/'
csvFile <- urlDownload(urlLocation, showProgress = TRUE)
csvFile
Downloading waqi-covid19-airqualitydata-2024.csv

  |======================================================================| 100%

'waqi-covid19-airqualitydata-2024.csv'

Assistant: This R language code is downloading a CSV file from a URL using the urlDownload() function.

Here's a breakdown of the code:

  1. urlLocation <- 'https://aqicn.org/data-platform/covid19/report/39374-7694ec07/' - This line sets the URL of the CSV file to be downloaded.
  2. csvFile <- urlDownload(urlLocation, showProgress = TRUE) - This line downloads the CSV file from the specified URL and stores it in a variable called csvFile. The showProgress argument is set to TRUE, which means that the progress of the download will be displayed in the console.

Overall, this code is used to download a CSV file from a URL and store it in a variable for further processing or analysis.

In [4]:
# Create lists of year and quarter names
yNames <- as.character(2019:2023)
qNames <- paste0("Q", 1:4)

# Create a data frame with the url locations and year/quarter combinations
# Combine url location and year/quarter combinations into a single column
DF <- data.frame(urlLocation, crossing(yNames, qNames)) %>%
  unite(urlLocations, c(urlLocation, "yNames", "qNames"), sep="")
print(DF)
                                                           urlLocations
1  https://aqicn.org/data-platform/covid19/report/39374-7694ec07/2019Q1
2  https://aqicn.org/data-platform/covid19/report/39374-7694ec07/2019Q2
3  https://aqicn.org/data-platform/covid19/report/39374-7694ec07/2019Q3
4  https://aqicn.org/data-platform/covid19/report/39374-7694ec07/2019Q4
5  https://aqicn.org/data-platform/covid19/report/39374-7694ec07/2020Q1
6  https://aqicn.org/data-platform/covid19/report/39374-7694ec07/2020Q2
7  https://aqicn.org/data-platform/covid19/report/39374-7694ec07/2020Q3
8  https://aqicn.org/data-platform/covid19/report/39374-7694ec07/2020Q4
9  https://aqicn.org/data-platform/covid19/report/39374-7694ec07/2021Q1
10 https://aqicn.org/data-platform/covid19/report/39374-7694ec07/2021Q2
11 https://aqicn.org/data-platform/covid19/report/39374-7694ec07/2021Q3
12 https://aqicn.org/data-platform/covid19/report/39374-7694ec07/2021Q4
13 https://aqicn.org/data-platform/covid19/report/39374-7694ec07/2022Q1
14 https://aqicn.org/data-platform/covid19/report/39374-7694ec07/2022Q2
15 https://aqicn.org/data-platform/covid19/report/39374-7694ec07/2022Q3
16 https://aqicn.org/data-platform/covid19/report/39374-7694ec07/2022Q4
17 https://aqicn.org/data-platform/covid19/report/39374-7694ec07/2023Q1
18 https://aqicn.org/data-platform/covid19/report/39374-7694ec07/2023Q2
19 https://aqicn.org/data-platform/covid19/report/39374-7694ec07/2023Q3
20 https://aqicn.org/data-platform/covid19/report/39374-7694ec07/2023Q4

Assistant: This R code creates a data frame with the url locations and year/quarter combinations.

Here's a breakdown of the code:

  1. yNames <- as.character(2019:2021): This line creates a list of year names, starting from 2019 to 2021.
  2. qNames <- paste0("Q", 1:4): This line creates a list of quarter names, starting from "Q1" to "Q4".
  3. DF <- data.frame(urlLocation, crossing(yNames, qNames)): This line creates a data frame with the url locations and year/quarter combinations. The crossing() function is used to create all possible combinations of year and quarter names.
  4. unite(urlLocations, c(urlLocation, "yNames", "qNames"), sep=""): This line combines the url location and year/quarter combinations into a single column. The unite() function is used to combine the columns, and the sep="" argument is used to specify that the columns should be separated by an empty string.

In summary, this code creates a data frame with the url locations and year/quarter combinations, and combines the url location and year/quarter combinations into a single column.

In [5]:
# Download legacy data (sequentially)
csvFiles <- lapply(DF$urlLocations, function(x) urlDownload(x, showProgress = TRUE))
Downloading waqi-covid19-airqualitydata-2019Q1.csv

  |======================================================================| 100%

Downloading waqi-covid19-airqualitydata-2019Q2.csv

  |======================================================================| 100%

Downloading waqi-covid19-airqualitydata-2019Q3.csv

  |======================================================================| 100%

Downloading waqi-covid19-airqualitydata-2019Q4.csv

  |======================================================================| 100%

Downloading waqi-covid19-airqualitydata-2020Q1.csv

  |======================================================================| 100%

Downloading waqi-covid19-airqualitydata-2020Q2.csv

  |======================================================================| 100%

Downloading waqi-covid19-airqualitydata-2020Q3.csv

  |======================================================================| 100%

Downloading waqi-covid19-airqualitydata-2020Q4.csv

  |======================================================================| 100%

Downloading waqi-covid19-airqualitydata-2021Q1.csv

  |======================================================================| 100%

Downloading waqi-covid19-airqualitydata-2021Q2.csv

  |======================================================================| 100%

Downloading waqi-covid19-airqualitydata-2021Q3.csv

  |======================================================================| 100%

Downloading waqi-covid19-airqualitydata-2021Q4.csv

  |======================================================================| 100%

Downloading waqi-covid19-airqualitydata-2022Q1.csv

  |======================================================================| 100%

Downloading waqi-covid19-airqualitydata-2022Q2.csv

  |======================================================================| 100%

Downloading waqi-covid19-airqualitydata-2022Q3.csv

  |======================================================================| 100%

Downloading waqi-covid19-airqualitydata-2022Q4.csv

  |======================================================================| 100%

Downloading waqi-covid19-airqualitydata-2023Q1.csv

  |======================================================================| 100%

Downloading waqi-covid19-airqualitydata-2023Q2.csv

  |======================================================================| 100%

Downloading waqi-covid19-airqualitydata-2023Q3.csv

  |======================================================================| 100%

Downloading waqi-covid19-airqualitydata-2023Q4.csv

  |======================================================================| 100%

Assistant: This R language code is using the lapply() function to download a series of CSV files from a list of URLs. The DF$urlLocations variable is a list of URLs, and the function(x) argument is a function that takes each URL in the list as an argument and returns the downloaded CSV file. The urlDownload() function is used to download the CSV file from the URL. The showProgress = TRUE argument is used to display a progress bar while the download is in progress. The lapply() function returns a list of the downloaded CSV files.

In [6]:
# Delete downloaded legacy data files
unlink(csvFiles)

Assistant: This R language code is deleting downloaded legacy data files.

The unlink() function is used to delete files in R. The function takes a character vector of file paths as an argument. In this case, the argument is csvFiles, which is a vector of file paths to the downloaded legacy data files.

The code is using the unlink() function to delete the files in the csvFiles vector. The unlink() function will remove the files from the file system, and the files will no longer be accessible.

In [7]:
# Download legacy data (in parallel)
csvFiles <- pbapply::pblapply(DF$urlLocations, urlDownload, cl = cl)
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100%

Assistant: This R language code is using the pbapply package to download a set of CSV files in parallel using the urlDownload function.

Here's a breakdown of the code:

  • DF$urlLocations: This is a vector of URLs that point to the CSV files to be downloaded.
  • urlDownload: This is a function that downloads a CSV file from a given URL.
  • cl: This is a parallel backend that is used to download the CSV files in parallel.
  • pbapply::pblapply: This is a function from the pbapply package that applies the urlDownload function to each element in the DF$urlLocations vector in parallel.

The code is using the cl parallel backend to download the CSV files in parallel, which can significantly speed up the download process. The pbapply package is used to apply the urlDownload function to each element in the DF$urlLocations vector in parallel.

3. Load and prepare data¶

In [8]:
# Define the columns to load
statCols <- c('count', 'min', 'max', 'median', 'variance')
mainCols <- 'median'
dropCols <- setdiff(statCols, mainCols)

# Read the newest data file and skip the first 4 lines
newTable <- data.table::fread(csvFile, skip = 4, drop = dropCols, showProgress = TRUE)

# Leave EU data, rename main column to Value
newTable <- newTable %>%
  rename(Value=all_of(mainCols)) %>%
  filter(Country %in% euCodes) %>%
  glimpse()
Rows: 36,351
Columns: 5
$ Date    <IDate> 2024-01-02, 2024-01-08, 2024-01-11, 2024-01-17, 2024-01-20, …
$ Country <chr> "HU", "HU", "HU", "HU", "HU", "HU", "HU", "HU", "HU", "HU", "H…
$ City    <chr> "Debrecen", "Debrecen", "Debrecen", "Debrecen", "Debrecen", "D…
$ Specie  <chr> "dew", "dew", "dew", "dew", "dew", "dew", "dew", "dew", "dew",…
$ Value   <dbl> 2.5, -9.0, -6.0, -3.0, -5.0, -4.0, 4.0, 5.0, -2.5, -5.0, -5.0,…

Assistant: This R language code is reading a CSV file and creating a data table from it. Overall, this code is reading a CSV file, dropping unnecessary columns, and renaming the main column to "Value", and then filtering the data to only include rows where the "Country" column is in the "euCodes" vector, and displaying a summary of the data using the glimpse() function.

In [89]:
# Read legacy data files (sequentially)
fileNamesQ <- list.files('.', '^.*Q\\d.csv$')
oldTable <- do.call(rbind, lapply(fileNamesQ, function(x) data.table::fread(x, skip = 4, drop = dropCols, colClasses = list(IDate=1))))
oldTable <- oldTable %>%
  rename(Value=all_of(mainCols)) %>%
  filter(Country %in% euCodes) %>%
  glimpse()
Rows: 2,891,334
Columns: 5
$ Date    <IDate> 2019-01-03, 2019-02-05, 2019-03-04, 2019-03-26, 2019-03-27, …
$ Country <chr> "CZ", "CZ", "CZ", "CZ", "CZ", "CZ", "CZ", "CZ", "CZ", "CZ", "C…
$ City    <chr> "Prague", "Prague", "Prague", "Prague", "Prague", "Prague", "P…
$ Specie  <chr> "temperature", "temperature", "temperature", "temperature", "t…
$ Value   <dbl> -2.2, -2.0, 9.7, 5.8, 6.6, 5.8, 0.0, 6.3, 12.6, -4.4, 2.1, -1.…

Assistant: This R language code is reading legacy data files (sequentially) and performing various operations on them.

Here's a breakdown of the code:

  1. list.files('.', '^.*Q\d.csv$'): This line of code is using the list.files() function to list all files in the current directory that match the pattern ^.*Q\d.csv$. The '.' argument specifies the current directory, and the '^.*Q\d.csv$' pattern matches any file that starts with "Q" and ends with ".csv". The resulting list of file names is stored in the fileNamesQ variable.
  2. do.call(rbind, lapply(fileNamesQ, function(x) data.table::fread(x, skip = 4, drop = dropCols))): This line of code is using the do.call() function to apply the rbind() function to the list of data frames returned by the lapply() function. The lapply() function is used to read each file in the fileNamesQ list using the data.table::fread() function, which reads the data from the file into a data frame. The skip = 4 argument specifies that the first 4 rows of each file should be skipped, and the drop = dropCols argument specifies that any columns that are not needed should be dropped. The resulting data frames are then combined into a single data frame using the rbind() function.
  3. oldTable <- oldTable %>% rename(Value=all_of(mainCols)) %>% filter(Country %in% euCodes) %>% glimpse(): This line of code is using the rename() function to rename the "Value" column to the name specified in the mainCols variable. The filter() function is then used to filter the data to only include rows where the "Country" column is in the euCodes list. Finally, the glimpse() function is used to display a summary of the data.

Overall, this code is reading legacy data files (sequentially) and performing various operations on them, including reading the data into a data frame, renaming columns, filtering rows, and displaying a summary of the data.

In [90]:
# Read legacy data files (in parallel)
parallel::clusterExport(cl, c("dropCols"))
parallel::clusterEvalQ(cl, pacman::p_load(data.table))
fileNamesQ <- list.files('.', '^.*Q\\d.csv$')
oldTable <- do.call(rbind, pbapply::pblapply(fileNamesQ, function(x) fread(x, skip = 4, drop = dropCols, colClasses = list(IDate=1)), cl = cl))
oldTable <- oldTable %>%
  rename(Value=all_of(mainCols)) %>%
  filter(Country %in% euCodes) %>%
  glimpse()
  1. data.table: TRUE
  2. data.table: TRUE
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100%
Rows: 2,891,334
Columns: 5
$ Date    <IDate> 2019-01-03, 2019-02-05, 2019-03-04, 2019-03-26, 2019-03-27, …
$ Country <chr> "CZ", "CZ", "CZ", "CZ", "CZ", "CZ", "CZ", "CZ", "CZ", "CZ", "C…
$ City    <chr> "Prague", "Prague", "Prague", "Prague", "Prague", "Prague", "P…
$ Specie  <chr> "temperature", "temperature", "temperature", "temperature", "t…
$ Value   <dbl> -2.2, -2.0, 9.7, 5.8, 6.6, 5.8, 0.0, 6.3, 12.6, -4.4, 2.1, -1.…

Assistant: This R language code is using the parallel package to read legacy data files in parallel. The code first exports the dropCols function to the cluster using parallel::clusterExport. Then, it loads the data.table package using pacman::p_load in the cluster.

Next, the code uses list.files to get a list of file names that match a specific pattern, and stores the result in fileNamesQ. The pbapply function is then used to apply the fread function to each file name in parallel, using the cl argument to specify the cluster. The fread function is used to read the data from each file, and the skip argument is used to skip the first 4 rows of each file. The drop argument is used to drop any columns that are not specified in the dropCols function.

The resulting data is then combined using do.call(rbind, ...), and the rename function is used to rename the Value column to all_of(mainCols). The filter function is then used to filter the data to only include rows where the Country column is in the euCodes vector. Finally, the glimpse function is used to display a summary of the data.

In [93]:
# Append old (2018-2023) and new (2024) data tables, sort, remove duplicates
dataTableEU <- oldTable %>% bind_rows(newTable) %>% arrange(Country, City, Date) %>%
  unique() %>%
  glimpse()
Rows: 2,771,315
Columns: 5
$ Date    <IDate> 2018-12-31, 2018-12-31, 2018-12-31, 2018-12-31, 2018-12-31, …
$ Country <chr> "AT", "AT", "AT", "AT", "AT", "AT", "AT", "AT", "AT", "AT", "A…
$ City    <chr> "Graz", "Graz", "Graz", "Graz", "Graz", "Graz", "Graz", "Graz"…
$ Specie  <chr> "pm10", "pressure", "wind-speed", "co", "temperature", "humidi…
$ Value   <dbl> 11.0, 1027.1, 1.5, 0.1, 3.7, 70.2, 1.1, 14.2, 4.3, 30.0, 1026.…

Assistant: This R code is combining two data tables, oldTable and newTable, and sorting the resulting data table by Country, City, and Date. The unique() function is then used to remove any duplicate rows from the data table. Finally, the glimpse() function is used to display a summary of the data table.

Here's a breakdown of the code:

  1. dataTableEU <- oldTable %>% bind_rows(newTable): This line combines the oldTable and newTable data tables into a single data table called dataTableEU. The bind_rows() function is used to concatenate the rows of the two tables.
  2. arrange(Country, City, Date): This line sorts the data table by Country, City, and Date. The arrange() function is used to sort the data table by the specified columns.
  3. unique(): This line removes any duplicate rows from the data table. The unique() function is used to identify and remove any rows that have the same values in the specified columns.
  4. glimpse(): This line displays a summary of the data table. The glimpse() function is used to display a brief overview of the data table, including the number of rows, columns, and the data types of the columns.

4. Check and select vars¶

In [94]:
# Weather conditions and pollutants (PM10, PM2.5, NO2, Ozone, SO2, CO)

# Calculate the proportion of each Species in the data table
allVars <- sort(100*prop.table(table(dataTableEU$Specie)), decreasing = TRUE)

# Drop the variables that are not needed
dropWeat <- c('pressure', 'wind-speed', 'wind-gust', 'wind speed', 'wind gust', 'dew','precipitation')
dropPoll <- c('wd', 'aqi', 'uvi', 'pm1', 'neph', 'mepaqi')
keepVars <- setdiff(names(allVars), union(dropWeat, dropPoll))

# Create a new data table with the info on kept variables
setNames(data.table(t(as.numeric(allVars[keepVars]))), names(allVars[keepVars]))
A data.table: 1 × 8
temperaturehumiditypm10no2o3pm25so2co
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
9.8937519.8853079.6350659.417848.3399047.2348336.7040744.524098

Assistant: This R language code is creating a new data table with the proportion of each species in the data table, and then dropping certain variables that are not needed.

Here's a breakdown of the code:

  1. allVars <- sort(100*prop.table(table(dataTableEU$Specie)), decreasing = TRUE): This line calculates the proportion of each species in the data table and sorts the results in descending order. The prop.table function calculates the proportion of each species in the data table, and the table function creates a frequency table of the species. The sort function sorts the results in descending order.
  2. dropWeat <- c('pressure', 'wind-speed', 'wind-gust', 'wind speed', 'wind gust', 'dew','precipitation'): This line creates a vector of variables that are not needed in the new data table.
  3. dropPoll <- c('wd', 'aqi', 'uvi', 'pm1', 'neph', 'mepaqi'): This line creates another vector of variables that are not needed in the new data table.
  4. keepVars <- setdiff(names(allVars), union(dropWeat, dropPoll)): This line creates a vector of the variables that are needed in the new data table. The setdiff function returns the elements of the first vector that are not in the second vector. The union function combines the two vectors of variables that are not needed.
  5. setNames(data.table(t(as.numeric(allVars[keepVars]))), names(allVars[keepVars])): This line creates a new data table with the info on the kept variables. The data.table function creates a data table from the numeric values of the allVars vector, and the setNames function sets the names of the columns in the data table to the names of the variables in the keepVars vector. The t function transposes the data table, and the as.numeric function converts the data table to a numeric matrix.
In [95]:
# Descriptive statistics for daily values of selected variables
selectedVars <- c('temperature', 'humidity')
dataTableEU <- dataTableEU %>% filter(Specie %in% selectedVars) %>% glimpse()
cat('\n')
tapply(dataTableEU$Value, factor(dataTableEU$Specie), summary)
Rows: 548,140
Columns: 5
$ Date    <IDate> 2018-12-31, 2018-12-31, 2019-01-01, 2019-01-01, 2019-01-02, …
$ Country <chr> "AT", "AT", "AT", "AT", "AT", "AT", "AT", "AT", "AT", "AT", "A…
$ City    <chr> "Graz", "Graz", "Graz", "Graz", "Graz", "Graz", "Graz", "Graz"…
$ Specie  <chr> "temperature", "humidity", "temperature", "humidity", "tempera…
$ Value   <dbl> 3.7, 70.2, 3.0, 75.7, -0.2, 79.5, -1.2, 69.0, -1.4, 66.2, 0.8,…

$humidity
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00   60.00   73.60   71.27   84.60  100.00 

$temperature
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 -67.70    6.80   12.60   12.63   18.50   81.10 

Assistant: This R language code is calculating descriptive statistics for the daily values of two variables, temperature and humidity, in a dataset called dataTableEU.

Here's a breakdown of the code:

  1. selectedVars <- c('temperature', 'humidity'): This line creates a vector of the names of the variables to be selected.
  2. dataTableEU <- dataTableEU %>% filter(Specie %in% selectedVars) %>% glimpse(): This line filters the dataTableEU dataset to only include rows where the Specie column matches one of the values in the selectedVars vector. The glimpse() function is used to display a summary of the filtered dataset.
  3. cat(' '): This line prints a blank line to the console.
  4. tapply(dataTableEU$Value, factor(dataTableEU$Specie), summary): This line calculates the summary statistics for the Value column of the filtered dataTableEU dataset, grouped by the Specie column. The tapply() function is used to apply a function to each group of the dataset, in this case the summary() function. The factor() function is used to convert the Specie column to a factor, which allows the data to be grouped by the levels of the factor.
In [96]:
# How many cities exist for each country?
countrySize <- dataTableEU %>% select(Country, City) %>% unique() %>% group_by(Country) %>% tally(sort = TRUE) %>% rename(Cities = n)
cbind(t(head(countrySize, 13)), matrix("..",2,1), t(tail(countrySize, 6)))
A matrix: 2 × 20 of type chr
CountryFRESDEROPLITNLHUBEATCZFISE..PTSKDKEEIELT
Cities27231717141211 7 6 5 5 5 4..2 2 1 1 1 1

Assistant: This R language code is creating a data frame called countrySize that contains the number of cities for each country in the European Union.

Here's a breakdown of the code:

  1. dataTableEU %>% select(Country, City) %>% unique(): This line selects the Country and City columns from the dataTableEU data frame and removes any duplicate rows.
  2. group_by(Country): This line groups the data by Country.
  3. tally(sort = TRUE): This line counts the number of cities for each country and sorts the results in ascending order.
  4. rename(Cities = n): This line renames the column n to Cities.
  5. cbind(t(head(countrySize, 13)), matrix("..",2,1), t(tail(countrySize, 6))): This line combines the first 13 rows of the countrySize data frame with a 2x1 matrix of dots, and then combines the last 6 rows of the countrySize data frame with the result. The t() function is used to transpose the data frames, and the cbind() function is used to combine the data frames.

The resulting data frame has the following columns:

  • Country: The name of the country.
  • Cities: The number of cities in the country.
  • ..: A 2x1 matrix of dots.
  • Country: The name of the country.
  • Cities: The number of cities in the country.

The code is using the dplyr package to perform the data manipulation, and the matrix() function to create the 2x1 matrix of dots.

In [97]:
# Which EU countries are we missing?
eurostat::eu_countries %>% filter(code %in% setdiff(euCodes, unique(dataTableEU$Country))) %>% select(name) %>% glimpse()
Rows: 4
Columns: 1
$ name <chr> "Latvia", "Luxembourg", "Malta", "Slovenia"

Assistant: This R code is using the eurostat package to retrieve a list of all EU countries, and then filtering that list to only include countries that are not present in the dataTableEU data frame. In summary, this code is retrieving a list of all EU countries that are not present in the dataTableEU data frame, and then extracting the name column from the resulting data frame.

The code is using the pipe operator (%>%) to chain together multiple operations, which makes the code more readable and easier to understand. The pipe operator allows the output of one operation to be passed as input to the next operation, which makes it easier to write and read complex R code.

In [98]:
# 2021-10-03 Barcelona fix
dataTableEU %>% glimpse()
dataTableEU <- dataTableEU %>% group_by(Date, Country, City, Specie) %>% summarize(Value=mean(Value), .groups="drop") %>% glimpse()
Rows: 548,140
Columns: 5
$ Date    <IDate> 2018-12-31, 2018-12-31, 2019-01-01, 2019-01-01, 2019-01-02, …
$ Country <chr> "AT", "AT", "AT", "AT", "AT", "AT", "AT", "AT", "AT", "AT", "A…
$ City    <chr> "Graz", "Graz", "Graz", "Graz", "Graz", "Graz", "Graz", "Graz"…
$ Specie  <chr> "temperature", "humidity", "temperature", "humidity", "tempera…
$ Value   <dbl> 3.7, 70.2, 3.0, 75.7, -0.2, 79.5, -1.2, 69.0, -1.4, 66.2, 0.8,…
Rows: 548,137
Columns: 5
$ Date    <IDate> 2018-12-31, 2018-12-31, 2018-12-31, 2018-12-31, 2018-12-31, …
$ Country <chr> "AT", "AT", "AT", "AT", "AT", "AT", "AT", "AT", "AT", "AT", "B…
$ City    <chr> "Graz", "Graz", "Innsbruck", "Innsbruck", "Linz", "Linz", "Sal…
$ Specie  <chr> "humidity", "temperature", "humidity", "temperature", "humidit…
$ Value   <dbl> 70.2, 3.7, 93.0, 0.0, 92.0, 3.3, 96.5, 1.0, 70.0, 4.5, 92.6, 9…

Assistant: This R language code is a data manipulation script that performs several operations on a data table called "dataTableEU".

Here's a breakdown of the code:

  1. dataTableEU %>% glimpse(): This line of code displays a summary of the data table "dataTableEU" using the glimpse() function.
  2. dataTableEU <- dataTableEU %>% group_by(Date, Country, City, Specie) %>% summarize(Value=mean(Value), .groups="drop"): This line of code groups the data table by the columns "Date", "Country", "City", and "Specie", and then calculates the mean of the "Value" column for each group. The .groups="drop" argument tells R to drop the grouping columns from the resulting data table.
  3. dataTableEU %>% glimpse(): This line of code displays a summary of the modified data table "dataTableEU" using the glimpse() function.

In summary, this code groups the data table by the specified columns, calculates the mean of the "Value" column for each group, and drops the grouping columns from the resulting data table.

5. Pivot and calculate THI¶

In [99]:
# Create pivot table, calculate THI for each row, drop rows with missing THI values
selectedVars <- c('temperature', 'humidity')
dataTableTHI <- dataTableEU %>%
  pivot_wider(names_from=Specie, values_from=Value) %>%
  group_by(Date, Country) %>%
  mutate(across(all_of(selectedVars), ~ ifelse(is.na(.), median(.,na.rm=TRUE), .))) %>%
  mutate(THI = 0.8*temperature + (humidity/100)*(temperature-14.4) + 46.4) %>%
  filter(!is.na(THI)) %>%
  data.table %>% glimpse()
Rows: 274,160
Columns: 6
$ Date        <IDate> 2018-12-31, 2018-12-31, 2018-12-31, 2018-12-31, 2018-12-…
$ Country     <chr> "AT", "AT", "AT", "AT", "AT", "BE", "BE", "BE", "BE", "BE"…
$ City        <chr> "Graz", "Innsbruck", "Linz", "Salzburg", "Vienna", "Antwer…
$ humidity    <dbl> 70.2, 93.0, 92.0, 96.5, 70.0, 92.6, 93.0, 93.0, 92.5, 100.…
$ temperature <dbl> 3.7, 0.0, 3.3, 1.0, 4.5, 9.4, 9.4, 7.7, 8.8, 8.0, 8.8, 3.0…
$ THI         <dbl> 41.8486, 33.0080, 38.8280, 34.2690, 43.0700, 49.2900, 49.2…

Assistant: This R language code creates a pivot table from a data table called "dataTableEU" and calculates the THI (Temperature Humidity Index) for each row.

Here's a breakdown of the code:

  1. selectedVars <- c('temperature', 'humidity'): This line defines a vector of variables to be used in the pivot table.
  2. dataTableTHI <- dataTableEU %>% pivot_wider(names_from=Specie, values_from=Value) %>% group_by(Date, Country) %>% mutate(across(all_of(selectedVars), ~ ifelse(is.na(.), median(.,na.rm=TRUE), .))): This line creates a pivot table from the data table "dataTableEU" using the "pivot_wider" function. The "names_from" argument specifies the variable to be used as the column names, and the "values_from" argument specifies the variable to be used as the values. The "group_by" function groups the data by the "Date" and "Country" variables. The "mutate" function applies a function to the selected variables (in this case, "temperature" and "humidity") to replace missing values with the median value.
  3. mutate(THI = 0.8*temperature + (humidity/100)*(temperature-14.4) + 46.4): This line calculates the THI for each row using the formula: THI = 0.8temperature + (humidity/100)(temperature-14.4) + 46.4.
  4. filter(!is.na(THI)): This line filters the data to only include rows where the THI value is not missing.
  5. data.table %>% glimpse(): This line displays a summary of the data table using the "glimpse" function.

In summary, this code creates a pivot table from a data table, calculates the THI for each row, and filters out rows with missing THI values.

6. THI statistics and plots¶

In [100]:
# An simple function to turn an xts time series into a ggplot calendar heatmap
# https://margintale.blogspot.com/2018/01/ggplot2-time-series-heatmaps-revisited.html
xts_heatmap <- function(x) {
  data.frame(Date=as.Date(index(x)), x[,1]) %>%
    setNames(c("Date","Value")) %>%
    dplyr::mutate(
      Year=lubridate::year(Date),
      Month=lubridate::month(Date),
      # I use factors here to get plot ordering in the right order without worrying about locale
      MonthTag=factor(Month,levels=as.character(1:12),
                      labels=c("Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"),ordered=TRUE),
      # week start on Monday in my world
      Wday=lubridate::wday(Date,week_start=1),
      # the rev reverse here is just for the plotting order
      WdayTag=factor(Wday,levels=rev(1:7),labels=rev(c("Mon","Tue","Wed","Thu","Fri","Sat","Sun")),ordered=TRUE),
      Week=as.numeric(format(Date,"%W"))
    ) %>%
    # group by year and month and then calculate the week of the month we are currently in
    dplyr::group_by(Year,Month) %>%
    dplyr::mutate(Wmonth=1+Week-min(Week)) %>%
    dplyr::ungroup() %>%
    ggplot2::ggplot(aes(x=Wmonth, y=WdayTag, fill=Value)) +
    ggplot2::geom_tile(colour="white") +
    ggplot2::facet_grid(Year~MonthTag) +
    ggplot2::scale_fill_gradientn(colours=coolwarm(100), guide="colourbar") +
    ggplot2::labs(x="Week of Month", y=NULL)
}

Assistant: This is an R function that takes an xts time series as input and returns a ggplot calendar heatmap.

In [101]:
# Descriptive statistics for yearly THI
myCity <- 'Kaunas'
myTable <- dataTableTHI %>% filter(City==myCity, Date>='2019-01-01') %>% select(Date, THI)
do.call(rbind, tapply(myTable$THI, year(myTable$Date), summary))
A matrix: 6 × 6 of type dbl
Min.1st Qu.MedianMean3rd Qu.Max.
201913.64038.996049.5354049.2427159.7226275.0560
202028.02040.656050.1847550.3885659.6731271.4790
2021 2.69635.827547.1740046.6903158.8012574.7400
202221.63239.680050.3080050.2374660.3120073.6255
202317.00034.880044.3475044.7290053.7720071.9085
2024 3.47621.216525.9100025.6047033.1520037.2840

Assistant: This R language code is calculating descriptive statistics for the THI (Temperature Humidity Index) data for a specific city (Kaunas) in a data table.

Here's a breakdown of the code:

  1. myCity <- 'Kaunas': This line sets the variable myCity to the string "Kaunas".
  2. myTable <- dataTableTHI %>% filter(City==myCity, Date>='2019-01-01') %>% select(Date, THI): This line filters the data table dataTableTHI to include only rows where the City column is equal to "Kaunas" and the Date column is greater than or equal to "2019-01-01". It then selects the Date and THI columns from the filtered data.
  3. do.call(rbind, tapply(myTable$THI, year(myTable$Date), summary)): This line calculates the descriptive statistics (mean, median, and standard deviation) for the THI column in the filtered data, grouped by the year of the Date column. The tapply() function applies the summary() function to each group of the data, and the do.call() function combines the results into a single data frame. The rbind() function is used to stack the results vertically.

Overall, this code is calculating descriptive statistics for the THI data for the city of Kaunas since year 2019.

In [102]:
# Plot THI time series calendar heatmap
myTitle <- paste("THI @", myCity)
xts_dataset <- xts(myTable[,'THI'], order.by = as.Date(myTable$Date))
xts_heatmap(xts_dataset) + labs(title=myTitle) + coord_fixed(1) + theme_minimal(base_size=7)

Assistant: This R code is creating a time series calendar heatmap for a dataset called "myTable" that contains a column called "THI" (Total Health Index) and a column called "Date".

Here's a breakdown of the code:

  1. myTitle <- paste("THI @", myCity): This line creates a string variable called "myTitle" that contains the text "THI @" followed by the value of the "myCity" variable.
  2. xts_dataset <- xts(myTable[,'THI'], order.by = as.Date(myTable$Date)): This line creates an xts object called "xts_dataset" from the "THI" column of the "myTable" dataset. The "order.by" argument specifies the date column of the dataset, which is used to order the data in the xts object.
  3. xts_heatmap(xts_dataset): This line creates a heatmap of the xts object using the xts_heatmap() function.
  4. labs(title=myTitle): This line adds a title to the heatmap using the labs() function. The title is set to the value of the "myTitle" variable.
  5. coord_fixed(1): This line sets the coordinate system of the heatmap to fixed, which means that the x-axis and y-axis are not scaled to the data.
  6. theme_minimal(base_size=7): This line sets the theme of the heatmap to minimal, which means that the background and axis labels are not displayed. The base_size argument sets the font size of the axis labels to 7.
In [ ]:
# Plot THI time series dynamics
ggplot2::autoplot(xts_dataset, main=myTitle, ylab="THI", xlab="Time")

Assistant: This R language code is using the ggplot2 package to create a time series plot of the THI (Temperature Humidity Index) data in the xts_dataset.

Here's a breakdown of the code:

  • ggplot2::autoplot(xts_dataset, main=myTitle, ylab="THI", xlab="Time"): This line of code is using the autoplot() function from the ggplot2 package to create a time series plot of the THI data in the xts_dataset. The main argument specifies the title of the plot, and the ylab and xlab arguments specify the labels for the y-axis and x-axis, respectively.
  • xts_dataset: This is the name of the dataset that contains the THI data.
  • myTitle: This is the title of the plot.
  • THI: This is the label for the y-axis.
  • Time: This is the label for the x-axis.

Overall, this code is creating a time series plot of the THI data in the xts_dataset with the specified title, y-axis label, and x-axis label.

7. Export results to .html¶

After running all notebook blocks, save .ipynb and use nbconvert to get .html

  1. Save notebook with results: File / Save (Ctrl+S)
  2. Download notebook: File / Download / Download .ipynb
  3. Upload .ipynb back to Google Colab: File / Upload notebook
  4. Run the following command-line: jupyter nbconvert --to html
In [ ]:
#system("jupyter nbconvert --to html ASSIST_Air_Data_Rscript.ipynb")

Assistant: This R language code is using the system() function to execute a command in the operating system.

The command being executed is:

jupyter nbconvert --to html ASSIST_Air_Data_Rscript.ipynb

This command is telling Jupyter Notebook to convert the file ASSIST_Air_Data_Rscript.ipynb to HTML format using the nbconvert tool. The --to html option tells nbconvert to output the converted file as an HTML file.

The system() function is a built-in function in R that allows you to execute operating system commands. It takes a string as an argument, which is the command to be executed, and returns the output of the command as a character string. In this case, the output of the command is the HTML file that is generated by nbconvert.