Air Quality Data Processing with Python¶

Air Quality Open Data Platform and World Air Quality Index project team provides a new dedicated dataset, updated 3 times a day, and covering about 380 major cities in the world, from January 2015 until now. The data for each major city is based on the average (median) of several air quality monitoring stations. The data set provides min, max, median and standard deviation for each of the air pollutant species (PM10, PM2.5, NO2, Ozone, SO2, CO) as well as meteorological data (Temperature, Humidity, Wind).

R code below downloads several .csv files, combines thems into a full dataset, selects median from provided statistics and narrows a set of countries to European Union members. From available information only meteorological data on Temperature and Humidity is selected and THI (Temperature Humidity Index) is calculated. Finally, daily THI time series for a selected city is summarized and visualized.

THI measures the level of discomfort the average person (or animal) is thought to experience as a result of the combined effects of the temperature and humidity of the air. The heat stress index THI is a simple combination of temperature and humidity (in decimal form) and can be calculated by the following formula:

$THI = 0.8 \times Temperature + Humidity \times (Temperature - 14.4) + 46.4$

Habeeb, A. A., Gad, A. E., & Atta, M. A. (2018). Temperature-humidity indices as indicators to heat stress of climatic conditions with relation to production and reproduction of farm animals. International Journal of Biotechnology and Recent Advances, 1(1), 35-50. https://doi.org/10.18689/ijbr-1000107

1. Prepare Python environment¶

In [1]:
# Install custom Python packages
!pip install pyeucountrycodes --quiet
!pip install calplot --quiet
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 132.3/132.3 kB 2.7 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
  Building wheel for calplot (setup.py) ... done
In [2]:
# Load required Python packages
import os, re, requests, logging, calplot
import pandas as pd
import dask.dataframe as dd
import multiprocessing as mp
from dask.multiprocessing import get
from dask.diagnostics import ProgressBar
ProgressBar().register()
from tqdm import tqdm
from itertools import product
from eu_country_codes import COUNTRY_CODES

Assistant: These are descriptions of the required Python packages.

  • os module in Python provides a way to interact with the operating system, allowing you to perform various operations related to file and directory management, process execution, and environment variables.
  • re module in Python stands for regular expressions. It provides support for working with regular expressions, which are powerful tools for pattern matching and text manipulation.
  • requests package in Python is a popular library for making HTTP requests. It simplifies the process of sending HTTP requests and receiving HTTP responses from web services and websites. The "requests" package is widely used for tasks such as web scraping, interacting with web APIs, and accessing web content.
  • logging package in Python is a built-in module that provides a flexible and powerful framework for emitting log messages from Python programs.
  • calplot creates heatmaps from Pandas time series data.
  • dask.dataframe package is part of the Dask library, which is designed for parallel and distributed computing in Python. Dask provides tools for handling larger-than-memory computations, enabling users to work with larger datasets and perform data manipulations efficiently.
  • multiprocessing package in Python is a built-in module that provides a convenient way to create and manage parallel processes. It's primarily used for parallel and concurrent programming, allowing you to take advantage of multiple CPU cores and perform tasks in parallel.
  • dask.multiprocessing package is part of the Dask library, which is designed for parallel and distributed computing in Python. It allows Dask to utilize multiple CPU cores on a single machine for parallel execution of tasks.
  • ProgressBar class from the dask.diagnostics module in Dask is a useful tool for monitoring the progress of Dask computations.
  • tqdm package in Python provides a fast and extensible progress bar for loops and other iterable tasks. It's particularly useful for tracking the progress of lengthy computations or operations and providing real-time feedback to the user. The package's name, "tqdm" stands for "taqaddum" in Arabic, which means "progress".
  • itertools.product is used to compute the Cartesian product of input iterables. In other words, it generates all possible combinations of elements from multiple input sequences.
  • eu_country_codes module is a Python package that provides a dictionary of COUNTRY_CODES, which contains ISO 3166-1 alpha-2 country codes as keys and the corresponding country names as values for European Union (EU) member states.

2. Download data files¶

In [3]:
# Function to download remote file to the disk
def urlDownload(urlLink, showProgress = False):
  with requests.get(urlLink, stream=True) as r:
    fileSize = int(r.headers.get('Content-Length'))
    fileName = r.headers.get('Content-Disposition').split("filename=")[1]
    if not os.path.exists(fileName) or os.path.getsize(fileName) != fileSize:
      block_size = 1024
      if showProgress:
        print(f"Downloading {fileName}")
        progress_bar = tqdm(total=fileSize, unit='iB', unit_scale=True)
      with open(fileName, 'wb') as file:
        for data in r.iter_content(block_size):
          if showProgress:
            progress_bar.update(len(data))
          file.write(data)
      if showProgress:
        progress_bar.close()
    return fileName

Assistant: This code defines a function called urlDownload that takes two arguments: urlLink and showProgress. The function uses the requests library to download a file from a remote URL and save it to the disk.

Here's a breakdown of the code:

  1. with requests.get(urlLink, stream=True) as r:: This line uses the requests library to make a GET request to the URL specified in urlLink. The stream=True parameter tells requests to stream the response, which allows us to read the data in chunks instead of loading the entire file into memory. The as r part assigns the response object to the variable r.
  2. fileSize = int(r.headers.get('Content-Length')): This line retrieves the file size from the response headers. The Content-Length header contains the size of the file in bytes.
  3. fileName = r.headers.get('Content-Disposition').split("filename=")[1]: This line retrieves the file name from the response headers. The Content-Disposition header contains information about the file, including its name. The split function is used to split the header value into a list of substrings, and the [1] part selects the second substring, which is the file name.
  4. if not os.path.exists(fileName) or os.path.getsize(fileName) != fileSize:: This line checks if the file already exists on the disk and if its size matches the file size retrieved from the response headers. If the file does not exist or its size does not match, the code inside the if block is executed.
  5. block_size = 1024: This line sets the block size for reading the file in chunks. The block size is the number of bytes to read at a time.
  6. if showProgress:: This line checks if the showProgress argument is set to True. If it is, the code inside the if block is executed.
  7. print(f"Downloading {fileName}"): This line prints a message to the console indicating that the file is being downloaded.
  8. progress_bar = tqdm(total=fileSize, unit='iB', unit_scale=True): This line creates a progress bar object using the tqdm library. The total parameter specifies the total number of bytes to download, the unit parameter specifies the unit of measurement (in this case, bytes), and the unit_scale parameter specifies whether to display the progress bar in units of bytes or in a more human-readable format.
  9. for data in r.iter_content(block_size):: This line reads the file in chunks of the specified block size using the iter_content method of the requests response object. The for loop iterates over the chunks of data and writes them to the file.
  10. if showProgress:: This line checks if the showProgress argument is set to True. If it is, the code inside the if block is executed.
  11. progress_bar.update(len(data)): This line updates the progress bar with the number of bytes read so far.
  12. file.write(data): This line writes the data.
In [5]:
# Download the newest data
urlLocation = 'https://aqicn.org/data-platform/covid19/report/39374-7694ec07/'
csvFile = urlDownload(urlLocation, showProgress=True)
csvFile
Out[5]:
'waqi-covid19-airqualitydata-2024.csv'

Assistant: This Python code is downloading a CSV file from a URL using the urllib library.

Here's a breakdown of the code:

  1. urlLocation = 'https://aqicn.org/data-platform/covid19/report/39374-7694ec07/' - This line sets the URL of the CSV file to be downloaded.
  2. csvFile = urlDownload(urlLocation, showProgress=True) - This line downloads the CSV file from the URL specified in urlLocation. The showProgress parameter is set to True, which means that the progress of the download will be displayed.

In summary, this code downloads a CSV file from a URL and assigns it to a variable named csvFile.

In [6]:
# Create lists of year and quarter names
yNames = [str(i) for i in range(2019, 2024)]
qNames = ["Q" + str(i) for i in range(1, 5)]

# Create a data frame with the url locations and year/quarter combinations
DF = pd.DataFrame(list(product(yNames, qNames)),columns=['yNames', 'qNames'])
DF.insert(loc=0, column='urlLocation', value=urlLocation)

# Combine url location and year/quarter combinations into a single column
DF = pd.DataFrame({'urlLocations': DF.agg(''.join, axis=1)})
DF
Out[6]:
urlLocations
0 https://aqicn.org/data-platform/covid19/report...
1 https://aqicn.org/data-platform/covid19/report...
2 https://aqicn.org/data-platform/covid19/report...
3 https://aqicn.org/data-platform/covid19/report...
4 https://aqicn.org/data-platform/covid19/report...
5 https://aqicn.org/data-platform/covid19/report...
6 https://aqicn.org/data-platform/covid19/report...
7 https://aqicn.org/data-platform/covid19/report...
8 https://aqicn.org/data-platform/covid19/report...
9 https://aqicn.org/data-platform/covid19/report...
10 https://aqicn.org/data-platform/covid19/report...
11 https://aqicn.org/data-platform/covid19/report...
12 https://aqicn.org/data-platform/covid19/report...
13 https://aqicn.org/data-platform/covid19/report...
14 https://aqicn.org/data-platform/covid19/report...
15 https://aqicn.org/data-platform/covid19/report...
16 https://aqicn.org/data-platform/covid19/report...
17 https://aqicn.org/data-platform/covid19/report...
18 https://aqicn.org/data-platform/covid19/report...
19 https://aqicn.org/data-platform/covid19/report...

Assistant: This code is creating a data frame with two columns: 'yNames' and 'qNames'. The 'yNames' column contains the years 2019, 2020, 2021, 2022 and 2023, and the 'qNames' column contains the quarters Q1, Q2, Q3, and Q4.

The code then creates a new data frame 'DF' with the url locations and the year/quarter combinations. The 'urlLocation' column is inserted into the data frame at the beginning, and the 'yNames' and 'qNames' columns are combined into a single column called 'urlLocations'.

The code is using the 'product' function from the 'itertools' module to create the combinations of year and quarter for the same url location prexif. The 'agg' function is then used to combine the 'yNames' and 'qNames' columns into a single column called 'urlLocations'.

In [7]:
# Download legacy data (sequentially)
csvFiles = DF.apply(lambda x : urlDownload(x[0], showProgress = True), axis=1)
Downloading waqi-covid19-airqualitydata-2019Q1.csv
100%|██████████| 26.5M/26.5M [00:03<00:00, 8.56MiB/s]
Downloading waqi-covid19-airqualitydata-2019Q2.csv
100%|██████████| 28.2M/28.2M [00:03<00:00, 8.72MiB/s]
Downloading waqi-covid19-airqualitydata-2019Q3.csv
100%|██████████| 31.6M/31.6M [00:03<00:00, 9.34MiB/s]
Downloading waqi-covid19-airqualitydata-2019Q4.csv
100%|██████████| 31.3M/31.3M [00:03<00:00, 9.57MiB/s]
Downloading waqi-covid19-airqualitydata-2020Q1.csv
100%|██████████| 29.1M/29.1M [00:03<00:00, 8.59MiB/s]
Downloading waqi-covid19-airqualitydata-2020Q2.csv
100%|██████████| 32.6M/32.6M [00:03<00:00, 9.68MiB/s]
Downloading waqi-covid19-airqualitydata-2020Q3.csv
100%|██████████| 32.4M/32.4M [00:03<00:00, 9.10MiB/s]
Downloading waqi-covid19-airqualitydata-2020Q4.csv
100%|██████████| 31.7M/31.7M [00:03<00:00, 9.30MiB/s]
Downloading waqi-covid19-airqualitydata-2021Q1.csv
100%|██████████| 32.0M/32.0M [00:03<00:00, 9.42MiB/s]
Downloading waqi-covid19-airqualitydata-2021Q2.csv
100%|██████████| 31.8M/31.8M [00:03<00:00, 9.39MiB/s]
Downloading waqi-covid19-airqualitydata-2021Q3.csv
100%|██████████| 30.8M/30.8M [00:03<00:00, 8.64MiB/s]
Downloading waqi-covid19-airqualitydata-2021Q4.csv
100%|██████████| 29.6M/29.6M [00:03<00:00, 8.70MiB/s]
Downloading waqi-covid19-airqualitydata-2022Q1.csv
100%|██████████| 30.8M/30.8M [00:03<00:00, 10.1MiB/s]
Downloading waqi-covid19-airqualitydata-2022Q2.csv
100%|██████████| 31.1M/31.1M [00:03<00:00, 8.81MiB/s]
Downloading waqi-covid19-airqualitydata-2022Q3.csv
100%|██████████| 30.9M/30.9M [00:03<00:00, 8.69MiB/s]
Downloading waqi-covid19-airqualitydata-2022Q4.csv
100%|██████████| 347/347 [00:00<00:00, 339kiB/s]
Downloading waqi-covid19-airqualitydata-2023Q1.csv
100%|██████████| 29.9M/29.9M [00:03<00:00, 9.61MiB/s]
Downloading waqi-covid19-airqualitydata-2023Q2.csv
100%|██████████| 30.6M/30.6M [00:03<00:00, 9.45MiB/s]
Downloading waqi-covid19-airqualitydata-2023Q3.csv
100%|██████████| 13.9M/13.9M [00:02<00:00, 6.16MiB/s]
Downloading waqi-covid19-airqualitydata-2023Q4.csv
100%|██████████| 11.8M/11.8M [00:01<00:00, 6.06MiB/s]

Assistant: This code is using the pandas library in Python to download a series of CSV files from the internet.

Here's a breakdown of the code:

  1. DF: This is a pandas DataFrame object that contains the URLs of the CSV files to be downloaded.
  2. apply: This is a method of the pandas DataFrame object that applies a function to each row of the DataFrame. In this case, the function is lambda x: urlDownload(x[0], showProgress=True).
  3. lambda x: urlDownload(x[0], showProgress=True): This is an anonymous function that takes a single argument x and returns the result of calling the urlDownload function with the URL in the first column of the DataFrame (x[0]) and the showProgress argument set to True.
  4. urlDownload: This is a function that downloads a CSV file from the internet and returns the contents of the file as a string.
  5. axis=1: This specifies that the apply method should be applied to the rows of the DataFrame, rather than the columns.

Overall, this code is downloading a series of CSV files from the internet and storing the contents of each file in a list. The showProgress argument is set to True to display a progress bar while the files are being downloaded.

In [8]:
# Delete downloaded legacy data files
for delFile in csvFiles:
  os.unlink(delFile)

Assistant: This Python code is deleting downloaded legacy data files.

Here's a breakdown of the code:

  1. csvFiles is a list of file names that are downloaded.
  2. for delFile in csvFiles: is a loop that iterates over each file name in the csvFiles list.
  3. os.unlink(delFile) is a function that deletes the file specified by delFile.

So, the code is deleting each file in the csvFiles list by calling the os.unlink() function on each file name.

In [9]:
# Download legacy data (in parallel)
DDF = dd.from_pandas(DF, npartitions=mp.cpu_count())
csvFiles = DDF.apply(lambda x : urlDownload(x[0]), axis=1, meta=pd.Series(dtype="str")).compute(scheduler='threads')
[########################################] | 100% Completed | 41.36 s

Assistant: This code is using the dask library to download a large number of files in parallel.

Here's a breakdown of what each line does:

  1. DDF = dd.from_pandas(DF, npartitions=mp.cpu_count()): This line creates a dask.dataframe.DataFrame (DDF) from a pandas dataframe (DF). The npartitions argument specifies the number of partitions to use when distributing the data across multiple CPUs. In this case, it's set to the number of CPUs available on the system.
  2. csvFiles = DDF.apply(lambda x : urlDownload(x[0]), axis=1, meta=pd.Series(dtype="str")): This line applies a function to each row of the DDF. The function takes a single argument x, which is a tuple containing the URL and the file name. The lambda function downloads the file and returns the file name. The axis=1 argument specifies that the function should be applied to each row of the DDF. The meta argument specifies the metadata for the resulting DDF, which in this case is a pandas Series with a string dtype.
  3. csvFiles.compute(scheduler='threads'): This line computes the DDF and returns a pandas Series containing the file names. The scheduler argument specifies the scheduler to use when distributing the computation across multiple CPUs. In this case, it's set to 'threads', which means that the computation will be distributed across multiple CPU threads.

Overall, this code is using dask to download a large number of files in parallel, and then returning a pandas Series containing the file names

3. Load and prepare data¶

In [10]:
# Define the columns to load
meta_cols = ['Date', 'Country', 'City', 'Specie']
main_column = 'median' # 'count', 'min', 'max', 'median', 'variance'
selected_cols = meta_cols + [main_column]

# Read the newest data file and skip the first 4 lines
DF = pd.read_csv(csvFile, skiprows=4, usecols=selected_cols)

# Leave EU data, rename main column to Value
selectEU = DF['Country'].isin(COUNTRY_CODES)
newTable = DF[selectEU].rename(columns={main_column: 'Value'})
print(newTable)
              Date Country      City         Specie  Value
19346   2024-01-02      HU  Debrecen            dew    2.5
19347   2024-01-08      HU  Debrecen            dew   -9.0
19348   2024-01-11      HU  Debrecen            dew   -6.0
19349   2024-01-17      HU  Debrecen            dew   -3.0
19350   2024-01-20      HU  Debrecen            dew   -5.0
...            ...     ...       ...            ...    ...
127266  2024-01-19      AT      Graz  precipitation   99.4
127267  2024-01-01      AT      Graz  precipitation    1.2
127268  2024-01-06      AT      Graz  precipitation    0.5
127269  2024-01-07      AT      Graz  precipitation    0.3
127270  2024-01-08      AT      Graz  precipitation   99.1

[36351 rows x 5 columns]

Assistant: This Python code is using the pandas library to read a CSV file and perform some data manipulation.

Here's a breakdown of what the code is doing:

  1. The first line defines a list of columns to load from the CSV file. The list includes the columns "Date", "Country", "City", and "Specie", as well as the main column to load, which is specified as "median".
  2. The second line defines a list of columns to load from the CSV file. The list includes all the columns defined in the first line, as well as the main column to load.
  3. The third line reads the CSV file using the pd.read_csv() function, skipping the first 4 lines of the file and only loading the columns specified in the second line.
  4. The fourth line creates a new DataFrame called "DF" by selecting only the rows where the "Country" column is in the list of EU countries (defined as "COUNTRY_CODES").
  5. The fifth line renames the main column to "Value".
  6. The final line prints the new DataFrame to the console.

In summary, this code reads a CSV file, selects only the rows where the "Country" column is in the list of EU countries, and renames the main column to "Value".

In [11]:
# Read legacy data files (sequentially)
fileNamesQ = [f for f in os.listdir('.') if re.match(r'^.*Q\d.csv$', f)]
DF = pd.concat((pd.read_csv(f, skiprows=4, usecols=selected_cols) for f in fileNamesQ), ignore_index=True)
selectEU = DF['Country'].isin(COUNTRY_CODES)
oldTable = DF[selectEU].rename(columns={main_column: 'Value'})
print(oldTable)
                Date Country      City    Specie   Value
27384     2020-05-30      AT  Salzburg      pm10    10.0
27385     2020-06-12      AT  Salzburg      pm10    10.0
27386     2020-04-14      AT  Salzburg      pm10    10.0
27387     2020-04-17      AT  Salzburg      pm10    19.0
27388     2020-04-25      AT  Salzburg      pm10    25.0
...              ...     ...       ...       ...     ...
10102970  2020-03-10      LT    Kaunas  pressure  1010.0
10102971  2020-01-04      LT    Kaunas  pressure  1003.0
10102972  2020-03-30      LT    Kaunas  pressure  1019.0
10102973  2020-03-31      LT    Kaunas  pressure  1021.5
10102974  2020-03-29      LT    Kaunas  pressure  1013.0

[2891334 rows x 5 columns]

Assistant: This is a Python code that reads legacy data files (sequentially) and concatenates them into a single DataFrame.

Here's a breakdown of the code:

  1. fileNamesQ = [f for f in os.listdir('.') if re.match(r'^.*Q\d.csv$', f)]: This line of code uses the os module to list all files in the current directory that match the pattern Q\d.csv. The re.match() function is used to match the file names against a regular expression that requires the file name to start with any character (.*) followed by the letter Q followed by any digit (\d). The resulting list of file names is stored in the variable fileNamesQ.
  2. DF = pd.concat((pd.read_csv(f, skiprows=4, usecols=selected_cols) for f in fileNamesQ), ignore_index=True): This line of code uses the pandas library to read each file in the fileNamesQ list using the pd.read_csv() function. The skiprows parameter is set to 4 to skip the first 4 rows of each file, which are likely to be header rows. The usecols parameter is set to selected_cols to specify the columns to read from each file. The resulting DataFrames are concatenated into a single DataFrame using the pd.concat() function. The ignore_index=True parameter is used to reset the index of the resulting DataFrame.
  3. selectEU = DF['Country'].isin(COUNTRY_CODES): This line of code uses the isin() method of the Series object to select only the rows where the value in the Country column is in the COUNTRY_CODES list. The resulting Series is stored in the variable selectEU.
  4. oldTable = DF[selectEU].rename(columns={main_column: 'Value'}): This line of code uses the rename() method of the DataFrame to rename the main_column column to Value. The resulting DataFrame is stored in the variable oldTable.
  5. print(oldTable): This line of code simply prints the oldTable DataFrame to the console.
In [12]:
# Read legacy data files (in parallel)
fileNamesQ = [f for f in os.listdir('.') if re.match(r'^.*Q\d.csv$', f)]
DF = dd.compute(dd.read_csv(fileNamesQ, skiprows=4, usecols=selected_cols))[0]
selectEU = DF['Country'].isin(COUNTRY_CODES)
oldTable = DF[selectEU].rename(columns={main_column: 'Value'})
print(oldTable)
[########################################] | 100% Completed | 9.01 s
              Date Country      City    Specie   Value
27384   2020-05-30      AT  Salzburg      pm10    10.0
27385   2020-06-12      AT  Salzburg      pm10    10.0
27386   2020-04-14      AT  Salzburg      pm10    10.0
27387   2020-04-17      AT  Salzburg      pm10    19.0
27388   2020-04-25      AT  Salzburg      pm10    25.0
...            ...     ...       ...       ...     ...
539568  2020-03-10      LT    Kaunas  pressure  1010.0
539569  2020-01-04      LT    Kaunas  pressure  1003.0
539570  2020-03-30      LT    Kaunas  pressure  1019.0
539571  2020-03-31      LT    Kaunas  pressure  1021.5
539572  2020-03-29      LT    Kaunas  pressure  1013.0

[2891334 rows x 5 columns]

Assistant: This Python code is using the Dask library to read and process large datasets in parallel.

Here's a breakdown of the code:

  1. fileNamesQ = [f for f in os.listdir('.') if re.match(r'^.*Q\d.csv$', f)]: This line uses the os module to list all files in the current directory that match the pattern Q\d.csv. The re.match function is used to match the file names against a regular expression that requires the file name to start with Q followed by a digit. The resulting list of file names is assigned to fileNamesQ.
  2. DF = dd.compute(dd.read_csv(fileNamesQ, skiprows=4, usecols=selected_cols))[0]: This line reads the CSV files in fileNamesQ using the dd.read_csv function, which is a Dask version of the Pandas read_csv function. The skiprows parameter is set to 4 to skip the first 4 rows of each file, and the usecols parameter is set to selected_cols to only read the columns specified in the selected_cols list. The compute function is used to execute the Dask computation in parallel. The resulting DataFrame is assigned to DF.
  3. selectEU = DF['Country'].isin(COUNTRY_CODES): This line selects all rows in DF where the value in the Country column is in the COUNTRY_CODES list. The isin function is used to check if the value in the Country column is in the COUNTRY_CODES list. The resulting boolean Series is assigned to selectEU.
  4. oldTable = DF[selectEU].rename(columns={main_column: 'Value'}): This line selects all rows in DF where the value in the selectEU column is True and renames the main_column to Value. The resulting DataFrame is assigned to oldTable.
  5. print(oldTable): This line prints the oldTable DataFrame to the console.
In [13]:
# Append old (2018-2023) and new (2024) data tables, sort, remove duplicates
DF = pd.concat([oldTable, newTable])
dataTableEU = DF.sort_values(by=['Country', 'City', 'Date']).drop_duplicates()
print(dataTableEU)
             Date Country    City       Specie   Value
10266  2018-12-31      AT    Graz         pm10    11.0
10392  2018-12-31      AT    Graz     pressure  1027.1
10505  2018-12-31      AT    Graz   wind-speed     1.5
10547  2018-12-31      AT    Graz           co     0.1
10673  2018-12-31      AT    Graz  temperature     3.7
...           ...     ...     ...          ...     ...
77191  2024-01-23      SK  Košice     humidity    94.0
77196  2024-01-23      SK  Košice    wind-gust     2.7
77209  2024-01-23      SK  Košice          dew    -4.0
77245  2024-01-23      SK  Košice  temperature    -3.3
77266  2024-01-23      SK  Košice          so2     4.6

[2771315 rows x 5 columns]

Assistant: This Python code is using the pandas library to manipulate and analyze data.

Here's a breakdown of what each line does:

  1. DF = pd.concat([oldTable, newTable]): This line combines the data from two tables, oldTable and newTable, into a single DataFrame called DF. The concat function is used to concatenate the two tables along the rows.
  2. dataTableEU = DF.sort_values(by=['Country', 'City', 'Date']).drop_duplicates(): This line sorts the data in the DF DataFrame by the columns Country, City, and Date, and then removes any duplicate rows.
  3. print(dataTableEU): This line simply prints the sorted and deduplicated data to the console.

In summary, this code combines data from two tables, sorts it by country, city, and date, and then removes any duplicate rows.

4. Check and select vars¶

In [14]:
# Weather conditions and pollutants (PM10, PM2.5, NO2, Ozone, SO2, CO)

# Calculate the proportion of each Species in the data table
all_vars = 100 * pd.value_counts(dataTableEU.Specie) / len(dataTableEU)

# Drop the variables that are not needed
drop_weat = ['pressure', 'wind-speed', 'wind-gust', 'wind speed', 'wind gust', 'dew', 'precipitation']
drop_poll = ['wd', 'aqi', 'uvi', 'pm1', 'neph', 'mepaqi']
keep_vars = set(all_vars.index) - set(drop_weat + drop_poll)

# Create a new data table with the info on kept variables
new_data_table = pd.DataFrame([all_vars[list(keep_vars)].sort_values(ascending=False)])
new_data_table.style.hide(axis="index")
Out[14]:
temperature humidity pm10 no2 o3 pm25 so2 co
9.893751 9.885307 9.635065 9.417840 8.339904 7.234833 6.704074 4.524098

Assistant: This code is using the pandas library in Python to perform some data manipulation and analysis.

Here's a breakdown of what each line does:

  1. all_vars = 100 * pd.value_counts(dataTableEU.Specie) / len(dataTableEU): This line calculates the proportion of each species in the data table by first counting the number of occurrences of each species using pd.value_counts(), then dividing the count by the total number of rows in the data table (len(dataTableEU)) to get a proportion. The result is stored in a variable called all_vars.
  2. drop_weat = ['pressure', 'wind-speed', 'wind-gust', 'wind speed', 'wind gust', 'dew', 'precipitation']: This line creates a list called drop_weat that contains the names of the variables that should be dropped from the data table.
  3. drop_poll = ['wd', 'aqi', 'uvi', 'pm1', 'neph', 'mepaqi']: This line creates a list called drop_poll that contains the names of the variables that should be dropped from the data table.
  4. keep_vars = set(all_vars.index) - set(drop_weat + drop_poll): This line creates a set called keep_vars that contains the names of the variables that should be kept in the data table. This is done by first getting the index of the all_vars DataFrame using all_vars.index, then subtracting the names of the variables to be dropped from the set using the - operator.
  5. new_data_table = pd.DataFrame([all_vars[list(keep_vars)].sort_values(ascending=False)]): This line creates a new data table called new_data_table that contains only the variables that should be kept, sorted in descending order by their proportion in the data table. The sort_values() method is used to sort the data table by the values in the all_vars DataFrame, and the ascending=False argument is used to sort in descending order.
  6. new_data_table.style.hide(axis="index"): This line hides the index of the new_data_table DataFrame, which is the default behavior of the pd.DataFrame() constructor. The hide() method is used to hide the index, and the axis="index" argument is used to specify that the index should be hidden.
In [15]:
# Descriptive statistics for daily values of selected variables
selectedVars = ['temperature', 'humidity']
selectedIdx = dataTableEU['Specie'].isin(selectedVars)
dataTableEU = dataTableEU[selectedIdx]
dataTableEU.groupby('Specie')['Value'].describe()
Out[15]:
count mean std min 25% 50% 75% max
Specie
humidity 273953.0 71.272162 16.891712 0.0 60.0 73.6 84.6 100.0
temperature 274187.0 12.625512 8.099295 -67.7 6.8 12.6 18.5 81.1

Assistant: This Python code is using the pandas library to obtain descriptive statistics of a dataset.

Here's a breakdown of the code:

  1. selectedVars = ['temperature', 'humidity']: This line defines a list of variables that the code will be working with. In this case, the list contains two variables: "temperature" and "humidity".
  2. selectedIdx = dataTableEU['Specie'].isin(selectedVars): This line uses the isin method to check if the values in the "Specie" column of the dataTableEU dataframe are in the selectedVars list. If a value is in the list, it will be included in the resulting dataframe.
  3. dataTableEU = dataTableEU[selectedIdx]: This line selects the rows of the dataTableEU dataframe that have values in the "Specie" column that are also in the selectedVars list.
  4. dataTableEU.groupby('Specie')['Value'].describe(): This line groups the data by the "Specie" column and calculates the descriptive statistics for the "Value" column. The describe method calculates the mean, standard deviation, minimum, maximum, and count for each group.

In summary, this code is selecting a subset of the data that only includes rows where the "Specie" column has values that are also in the selectedVars list, and then calculating the descriptive statistics for the "Value" column for each group.

In [16]:
# 2021-10-03 Barcelona fix
print(dataTableEU)
dataTableEU = dataTableEU.groupby(['Date', 'Country', 'City', 'Specie'])[['Value']].mean().reset_index()
print(dataTableEU)
             Date Country    City       Specie  Value
10673  2018-12-31      AT    Graz  temperature    3.7
10751  2018-12-31      AT    Graz     humidity   70.2
10637  2019-01-01      AT    Graz  temperature    3.0
10771  2019-01-01      AT    Graz     humidity   75.7
10710  2019-01-02      AT    Graz  temperature   -0.2
...           ...     ...     ...          ...    ...
77244  2024-01-21      SK  Košice  temperature   -2.5
77176  2024-01-22      SK  Košice     humidity   94.0
77249  2024-01-22      SK  Košice  temperature   -3.8
77191  2024-01-23      SK  Košice     humidity   94.0
77245  2024-01-23      SK  Košice  temperature   -3.3

[548140 rows x 5 columns]
              Date Country        City       Specie  Value
0       2018-12-31      AT        Graz     humidity   70.2
1       2018-12-31      AT        Graz  temperature    3.7
2       2018-12-31      AT   Innsbruck     humidity   93.0
3       2018-12-31      AT   Innsbruck  temperature    0.0
4       2018-12-31      AT        Linz     humidity   92.0
...            ...     ...         ...          ...    ...
548132  2024-01-23      SE     Uppsala  temperature    4.0
548133  2024-01-23      SK  Bratislava     humidity   91.0
548134  2024-01-23      SK  Bratislava  temperature   -1.8
548135  2024-01-23      SK      Košice     humidity   94.0
548136  2024-01-23      SK      Košice  temperature   -3.3

[548137 rows x 5 columns]

Assistant: This Python code is using the pandas library to perform some data manipulation and analysis.

Here's a breakdown of what each line does:

  1. print(dataTableEU): This line simply prints the contents of the dataTableEU variable to the console.
  2. dataTableEU = dataTableEU.groupby(['Date', 'Country', 'City', 'Specie'])[['Value']].mean().reset_index(): This line groups the data in the dataTableEU variable by the columns 'Date', 'Country', 'City', and 'Specie', and then calculates the mean of the 'Value' column for each group. The resulting data is then reset to its original index.

In summary, this code is grouping the data in dataTableEU by the specified columns and calculating the mean of the 'Value' column for each group. The resulting data is then reset to its original index.

5. Pivot and calculate THI¶

In [17]:
# Create pivot table, calculate THI for each row, drop rows with missing THI values
dataTableTHI = dataTableEU.pivot_table(index=['Date', 'Country', 'City'], columns='Specie', values='Value').reset_index()
dataTableTHI["THI"] = 0.8 * dataTableTHI.temperature + (dataTableTHI.humidity/100)*(dataTableTHI.temperature-14.4) + 46.4
dataTableTHI = dataTableTHI.dropna(subset=["THI"])
print(dataTableTHI)
Specie        Date Country        City  humidity  temperature      THI
0       2018-12-31      AT        Graz      70.2          3.7  41.8486
1       2018-12-31      AT   Innsbruck      93.0          0.0  33.0080
2       2018-12-31      AT        Linz      92.0          3.3  38.8280
3       2018-12-31      AT    Salzburg      96.5          1.0  34.2690
4       2018-12-31      AT      Vienna      70.0          4.5  43.0700
...            ...     ...         ...       ...          ...      ...
274188  2024-01-23      SE       Malmö      89.0          4.4  41.0200
274189  2024-01-23      SE   Stockholm      75.0          5.0  43.3500
274190  2024-01-23      SE     Uppsala      80.0          4.0  41.2800
274191  2024-01-23      SK  Bratislava      91.0         -1.8  30.2180
274192  2024-01-23      SK      Košice      94.0         -3.3  27.1220

[273944 rows x 6 columns]

Assistant: This Python code creates a pivot table from a data table called dataTableEU and calculates the THI (Temperature Humidity Index) for each row. The THI is calculated using the following formula:

THI = 0.8 * temperature + (humidity/100) * (temperature - 14.4) + 46.4

The pivot table is then reset to its original index and the rows with missing THI values are dropped using the dropna() method. Finally, the resulting pivot table is printed to the console.

Here's a breakdown of the code:

  1. dataTableTHI = dataTableEU.pivot_table(index=['Date', 'Country', 'City'], columns='Specie', values='Value').reset_index(): This line creates a pivot table from the dataTableEU data table, with the Date, Country, and City columns as the index, and the Specie column as the columns. The values parameter is set to 'Value' to specify that the values in the Value column should be used as the values in the pivot table. Finally, the reset_index() method is called to reset the index of the pivot table to its original values.
  2. dataTableTHI["THI"] = 0.8 * dataTableTHI.temperature + (dataTableTHI.humidity/100)*(dataTableTHI.temperature-14.4) + 46.4: This line calculates the THI for each row in the pivot table.
  3. dataTableTHI = dataTableTHI.dropna(subset=["THI"]): This line drops the rows from the pivot table that have missing values in the THI column. The dropna() method is called with the subset parameter set to ["THI"] to specify that only the rows with missing values in the THI column should be dropped.
  4. print(dataTableTHI): This line prints the resulting pivot table to the console.

6. THI statistics and plots¶

In [18]:
# Descriptive statistics for yearly THI
myCity = 'Kaunas'
myTable = dataTableTHI[(dataTableTHI['City']==myCity) & (dataTableTHI['Date']>='2019-01-01')][["Date", "THI"]]
myTable.groupby(pd.to_datetime(myTable['Date']).dt.year).describe()
Out[18]:
Specie THI
count mean std min 25% 50% 75% max
Date
2019 362.0 49.242705 12.738765 13.640 38.9960 49.53540 59.722625 75.0560
2020 348.0 50.388563 10.801356 28.020 40.6560 50.18475 59.673125 71.4790
2021 362.0 46.690309 14.882599 2.696 35.8275 47.17400 58.801250 74.7400
2022 281.0 50.237457 12.161641 21.632 39.6800 50.30800 60.312000 73.6255
2023 281.0 44.728998 11.624320 17.000 34.8800 44.34750 53.772000 71.9085
2024 23.0 25.604696 8.608295 3.476 21.2165 25.91000 33.152000 37.2840

Assistant: This code is using the pandas library in Python to obtain descriptive statistics for a dataset containing THI (Temperature Humidity Index) values for a specific city (Kaunas) over a period of time.

Here's a breakdown of the code:

  1. myCity = 'Kaunas': This line sets the variable myCity to the string value 'Kaunas'.
  2. myTable = dataTableTHI[(dataTableTHI['City']==myCity) & (dataTableTHI['Date']>='2019-01-01')][["Date", "THI"]]: This line selects a subset of the dataTableTHI dataframe that contains only the rows where the City column is equal to myCity and the Date column is starting with year 2019. The [["Date", "THI"]] syntax is used to select only the Date and THI columns from the resulting dataframe.
  3. myTable.groupby(pd.to_datetime(myTable['Date']).dt.year).describe(): This line groups the myTable dataframe by the year component of the Date column (using the dt.year attribute of the pd.to_datetime function) and then applies the describe method to each group. The describe method calculates various summary statistics for each group, such as the mean, standard deviation, minimum, maximum, and count.

In summary, this code is using the pandas library to perform descriptive statistics on a subset of the dataTableTHI dataframe that contains only the rows for a specific city and a specific time period, and then grouping the results by year to calculate summary statistics for each year.

In [19]:
# Plot THI time series calendar heatmap
pdTimeSeries = pd.Series(myTable['THI'].values, index=pd.DatetimeIndex(myTable['Date']))
logging.getLogger('matplotlib.font_manager').disabled = True
cp = calplot.calplot(pdTimeSeries, dropzero=True, cmap='coolwarm', suptitle='THI @ ' + myCity)

Assistant: This code is using the matplotlib library to create a calendar heatmap of the THI (Temperature Humidity Index) time series data from a pandas dataframe called myTable.

Here's a breakdown of the code:

  1. pdTimeSeries = pd.Series(myTable['THI'].values, index=pd.DatetimeIndex(myTable['Date'])): This line creates a pandas Series object called pdTimeSeries from the THI column of the myTable dataframe. The values attribute of the THI column is used to create the values of the Series, and the index attribute is used to create the index of the Series, which is a DatetimeIndex object created from the Date column of the dataframe.
  2. logging.getLogger('matplotlib.font_manager').disabled = True: This line disables the logging of font warnings in the matplotlib.font_manager module. This is necessary because the calplot function uses the matplotlib.font_manager module to create the calendar heatmap, and it can generate a lot of warnings if the font used in the heatmap is not available on the system.
  3. calplot.calplot(pdTimeSeries, dropzero=True, cmap='coolwarm', suptitle='THI @ ' + myCity): This line calls the calplot function from the calplot module, passing in the pdTimeSeries object as the first argument. The dropzero argument is set to True to drop the zero values from the heatmap, and the cmap argument is set to 'coolwarm' to use the coolwarm colormap. The suptitle argument is set to 'THI @ ' + myCity to set the title of the heatmap to 'THI @ ' + myCity.

Overall, this code is creating a calendar heatmap of the THI time series data from the myTable dataframe, with the coolwarm colormap and a title that includes the city name.

In [20]:
# Plot THI time series dynamics
ts = pdTimeSeries.plot(title='THI @ ' + myCity, ylabel="THI", xlabel="Time", lw=1, grid=True)

Assistant: This code is using the plot() method of the pdTimeSeries object to create a time series plot of the THI (Temperature Humidity Index) values. The title parameter is set to "THI @ " + myCity, which means that the title of the plot will be "THI @ [myCity]". The lw parameter is set to 1, which means that the line width of the plot will be 1. The grid parameter is set to True, which means that a grid will be displayed on the plot.

7. Export results to .html¶

After running all notebook blocks, save .ipynb and use nbconvert to get .html

  1. Save notebook with results: File / Save (Ctrl+S)
  2. Download notebook: File / Download / Download .ipynb
  3. Upload .ipynb back to Google Colab: File / Upload notebook
  4. Run the following command-line: jupyter nbconvert --to html
In [ ]:
#!jupyter nbconvert --to html ASSIST_Air_Data_Python.ipynb

Assistant: This block contains a command-line to convert a Jupyter notebook file named ASSIST_Air_Data_Python.ipynb to an HTML file.