Site icon Zdataset

Air Quality in Madrid (2001-2018)

Different pollution levels in Madrid from 2001 to 2018

LicenseOther (specified in description)

Tagsearth and natureenvironmentpollution

Context

In the recent years, the high levels of pollution during certain dry periods in Madrid has forced the authorities to take measures against the use of cars in the city center, and has been used as a reson to propose drastic modifications in the city’s urbanism. Thanks to Madrid’s City Council Open Data website, the air quality data has been uploaded is plubicly available. There are several files available, including daily and hourly historical data of the levels registered from 2001 to 2018 and the list of stations being used for pollution and other particles analysis in the city.

However, when exploring this data from a data analysis and time series point of view, we found that the format was somehow confusing and not common, and some design decisions in the dataset were far from optimal: The hourly data was split in monthly files containing slightly different formats through the years, which were equally as uncommon: rows are certain measures in certain days, each containing 24 columns (one per hour in the day) that includes a control character. This control character is V if the measurement is valid, and mostly (but not exclusively) N if not.

These handicaps when exploring the historical data can ruin the purpose of the Open Data: to be publicly audited, and to be freely explored and used for experimentation. For that reason in Decide we are release our own version of the data, which has been designed for ease of use using common standards and performant formats. This allows to ship a faster, smaller and more convenient and intuitively structured dataset.

Content

All the data is extracted from the original files and processed to result in a more convenient format for typical Kaggle purposes.
While the original data includes hours as different columns and measurements as different rows, this version is structured the other way round: Each row is timestamped and the columns are the different measures performed at that point in time in a certain stations. This allows faster preparation for time series analysis and prediction tasks.

This dataset defines stations as the higher hierarchical level: each individual station history can be individually extracted from the file for further study. Inside each station’s DataFrame, all the particles measurements that such station has registered in the period of 2001/01 – 2018/04 (if active this whole time). Not every station has the same equipment, therefore each station can measure only a certain subset of particles. The complete list of possible measurements and their explanations (following the original explanation document) are:

Also the master DataFrame is included the file, which contains information about the active stations. Notice that only active stations are included in there, since the Open Data files do not provide information about the stations that have ceased activity.

Using this hierarchical structure, we can store it in an HDF5 file, which is also compressed and allows for great performance when accessing contiguous data (which is the casa in this time-indexed design). These modifications allow to encapsulate the same information that is provided in the original page in monthly files adding up to 250MiB in a single, structured file of just 74MiB. Since some people may not be familiar with HDF5 format yet, we provide some snippets to make it easier for you to start exploring the data in Python. You can find a short introduction in to HDF5 format in this kernel.

However, if for some reason using HDF5 is still inconvenient for you, this dataset also provides a zip folder containing the same information gathered in plain-text CSV files and a stations.csv file equivalent to the master dataframe. These CSV files still benefit from the data reorganization but the lack of advatange performances make them much heavier (174MiB compressed, 500MiB uncompressed).

Source and Licensing

All the data present in this dataset comes from Madrid’s City Council Open Data website, which are the ones to be acknowledged for the data collection. It aims to provide a more convenient format for data scientist, as well as some enhanced context in a single place.

The data therefore inherits the Madrid Open Data Terms of Use, which allow for free commercial and non-commercial use, and provide no liability on the data. For more details about the licensing, please refer back to the aforementioned document detailing the terms of use (in Spanish).

Inspiration

This dataset is created out of the frustation of how inconvenient and irregular the historical data was provided in the Open Data website. It contains in a practical format 18 years (2001-2018) of hourly data in just a single file, which makes this dataset a great playground for time series analysis and other prediction tasks. How do different gases correlate their levels? Are there any changes in trends? Can they be mapped to the recent decisions made by the city council, or do they relate to rainy dates? What is the best model to predict pollution levels? How do the levels interpolate between the location of the stations? Are some gases more common at different elevations? We are looking forward to see what you can come up with!

Exit mobile version