Methodology of the HazeGazer Platform

The HazeGazer platform is a pilot to test the crowdsourcing of data around the issue of air pollution in Ulaanbaatar. The intent is to make this crowdsourcing as part of the intervention into this issue. To this end the platform was created as a public project to engage the citizenry collectively.

There are several pieces of the HazeGazer platform:

  • Crowdsourced data collection of respiratory illnesses via our chatbot on the Agaarneg Facebook page.
  • Air quality interpolation to estimate air quality in between existing governmental air quality monitors.
  • Reporting to analyze the correlations between these crowdourced data and air quality.

Each of these pieces has their own limitations. It is our hope that acknowledging these limitations will help others better understand this project and hopefully improve upon similar projects in the future.

Limitations
Before we describe the methdology of each of the elements of the platform, we believe it is important to acknowledge the limitations inherent to each.
 

Crowdsourced Data Collection

HazeGazer collects data in an automated way via it’s chatbot on Facebook Messenger. For in person surveys, you know that the person you are interviewing is a real person. Of course, with an online survey, this is not necessarily the case. It is relatively common for an individual in Mongolia to have multiple Facebook accounts.

We are also relying on citizens to accurately self-report, and we have no way of indepdently verifying an indiviual survey. This is of course the same for any survey. In addition, as we are asking the citizenry to self-report, we should not expect our samples to necessarily be representative of the population.
To mitigate both of these issues we allow respondents to only fill out a survey once a week. This ensures that we do not have over-reporting by a certain group and that it becomes more difficult to flood the system with inaccurate or false reports.

Nevertheless, These limitations mean that the data collected via these surveys should not be considered to be authoritative or of the same quality as large scale randomly sampled in-person surveys.

Air Quality Interpolation

On the homepage map there is a “heatmap” that shows estimated air quality around Ulaanbaatar. These estimates are derived from a total of 13 governmental air quality monitors around the city (12 via the Mongolian government and 1 at the US Embassy). However, air quality can vary greatly around the city, and citizens may not know which station is closest to the air quality near their home. These stations give valuable information for citizens to make informed decisions about their daily lives. However, hourly air quality measurements do not necessarily give an accurate picture of air quality over time.

To allow for users a more general understanding of air quality near their home the platform takes a 7-day average of air quality each day, then creates an estimate for every khoroo in Ulaanbaatar (excluding Nalaikh and Bagakhangai, which do not have available data).

Given the small number of air quality sensors in Ulaanbaatar, and the fact that air pollution can vary geographically so much, that these estimates may have less accuracy as the distance from the nearest air quality station increases. During testing our model showed to be quite accurate, but it should be understood that these estimates are for information purposes only.

These estimates are designed to communicate air pollution data in a way that is easier for those unfamiliar with air pollution measurements. As such, instead of AQI numbers, which can be abstract, we show the estimated AQI category according to the Mongolian particulate matter standards, which has easily understandable names. This means that the outputs of our model, which are in AQI values, are converted to AQI category. As a result there is a loss of reporting accuracy for the sake of communication.

Note: See below for a technical explanation of this process.

Correlation Reporting

As this platform is a pilot, the reporting of our data is somewhat experimental. In general, we hope to find correlations between air pollution and crowdsourced reports of respiratory illness. However, our ability to do this will depend on several factors.

First, if we are unable to consistently get enough reports for a small geographic area each week (for example a microdistrict/khoroolol) then we will not be able to generalize a trend. This may be a large limitation, especially considering that certain areas of Ulaanbaatar are disparately impacted by air pollution.

Second, survey respondents will have to intially go to the platform or Agaarneg Facebook Page in order to subscribe to the chatbot. This means that we have a somewhat self-selecting sample of citizens who are concerned about air pollution. It is possible that this group of people may have a different distribution of respiratory illness.

Third, it can be assumed that as air pollution worsens through the winter, more residents of Ulaanbaatar may find the HazeGazer platform. This could mean that there may be inconsistent reporting during different time periods.

To mitigate these issues reporting of the data for this platform will take these factors into consideration. If sample sizes are too small to generalize an estimate, we will not attempt to do so. Also, a “baseline” phone survey was completed in November 2019 of 300 households. This was a random sample in selected khoroo’s, which will help us gauge possible sampling issues with the crowdsourced data.

Crowdsourced Data Collection Methdology

Crowdsourced data collection is a key piece of this pilot project. A Facebook Messenger chatbot is used to survey citizens in Ulaanbaatar about respiratory illness. The methodology for this data collection was somewhat modeled on the idea of a longitudinal study. This is how the chatbot works:

  • A user visits the platform and is directed to the chatbot via a link. Alternatively a user can visit the chatbot directly via Facebook.
  • When a user visits the chatbot for the first time several demographic questions are asked. On subsequent visits these questions are not asked again to make survey responses easier. The demographic data collected includes:
    • Home location (district and khoroo)
    • Age
    • Gender
    • Family size
    • Number of children in the household
  • The user is then presented with a survey tree in which their answers determine which question they move to. The first question is, “Have you are another member of your family had a respiratory illness in the past week?”.
    • If the answer is no, then the survey ends.
    • If yes, then further questions are asked to get details about who was sick, for how long, if they missed school or work, and for how many days.
  • On completion of the survey the user is given an estimated 7-day average air quality category for their home khoroo.
  • The user then has an option to fill in a free response question to describe how air pollution impacts them and/or their family.
 
After completing the survey, the user is then “subscribed” to the chatbot. Each week they will be sent a message to prompt them to complete another survey. A user will only be able to submit a survey each week.
 
Each survey is connected to the users demographic data to enable a longitudinal study for that user. Over time we hope there will be enough participants on the platform to be able to generalize trends.
 
AQI Interpolation Methodology

Note: This section is a technical explanation of the AQI interpolation model used on the HazeGazer platform.

Air quality in Ulaanbaatar can vary significantly even in a short distance. This means that if you live between two air quality monitors, they may not be giving you an accurate measure of air quality in your area. In addition, hourly air quality, which is the type readily available, does not necessarily give you a larger time scale sense of air quality.

Interpolating air quality is similar in many ways to interpolating things like atmospheric CO2 concentrations, ground water contamination, or other spatial statistics problems. In Ulaanbaatar there are 13 government air quality stations, 12 from the Mongolian government (via Agaar.mn) and 1 at the US Embassy. Some of these stations measure both PM2.5 and PM10, and other only one type. These stations are spread across the city, however there are more stations for PM10 monitoring than PM2.5.

 PM2.5 Stations:
  • Tolgoit
  • MNB
  • Wrestling Palace
  • Baruun 4 Zam
  • Nisekh
  • Amgalan
  • Bayankhoshuu
  • 1 Khoroolol
  • US Embassy

PM10 Stations:

  • Mongol gazar
  • 100 ail
  • Wrestling Palace
  • Baruun 4 Zam
  • Misheel Expo
  • Nisekh
  • Amgalan
  • MNB
  • Tolgoit
  • Urgakh naran
  • 1 Khoroolol
  • Bayankhoshuu

General Methodology

Each day, the previous 7 days of hourly air quality are collected from the platform’s database. The measurements include the stations listed above. Averages are then calculated per station for both PM2.5 and PM10, using the Mongolian AQI standard.

  • If a station is missing more than 50% of the hourly measurements for that week it is excluded from the model for that day.
  • The US Embassy reports their data as both physical PM2.5 measurements and US AQI. To convert this to Mongolian AQI the physical values have a 3 hour moving average taken then this average is passed through the piecewise linear function published by the Mongolian Ministry of Environment. This follows the process the Ministry uses to convert the physical values for the Agaar.mn stations.

The average values are then used to create two separate models, one for PM2.5 and one for PM10. Predictions are made using a grid of over 17,000 coordinates around Ulaanbaatar. Each point is separated by 0.002 degrees of latitude and longitude (7.2 seconds).

The grid created is a flat grid. Due to the relatively small size of Ulaanbaatar, a spheroid conversion is not necessary.

The output grid of predicted points is then used to generate a contour plot. This plot uses the estimated AQI and converts it to the corresponding Mongolian AQI category color.

 

Model Used

In order to create these estimates a Gaussian Process Regression algorithm was chosen with a Matern kernel (prior function). This method is sometimes called Kriging process, which is what this method is called in the spatial statistics domain. This Gaussian Process uses a specified function (or kernel) to create a prior distribution. This function is combined with the observed data to create a posterior distribution, and finally a predictive distribution is the output of the model. The popular Python package Scikit-learn was used for both model creating and testing.

Gaussian Process Regression has several benefits, primarily being that it can learn and perform well on small datasets. While technically machine learning, this method is also in the domain of spatial statistics. Also, while Gaussian Process Regression shares some similarities with linear interpolation methods, it is technically a non-linear interpolation technique.

Model Testing

To test the model a leave one out cross-validation technique was used. This technique leaves one station out of the training dataset during model training. The process is repeated for each station. The errors are then averaged to estimate the predictive power of the model on unseen data.

To test the methodology above, data from November 2019 to November 2020 was tested. Rolling weeks were tested using the leave one out method described above. This resulted in a mean absolute AQI error of 47 for PM10 and 33 for PM2.5.

Next, a confidence level was created using a two-tailed t-test. For PM10 it was +-16 and for PM2.5 it was +-9. This results in reasonably high confidence that predicted AQI will not vary significantly from actual levels. However, with all interpolations, large errors can exist, which is why the model outputs are being communicated as estimates.

Model Limitations

For locations at a significant distance to the nearest air quality station, especially to the northwest, northeast, and south, the model accuracy can’t easily be known.

Ulaanbaatar has a high variance in elevation. It is possible that high elevations to the north and south may have significantly less air pollution due to the inversion effect, where the colder polluted air is held down by a blanket of warm air.

Air quality stations are not uniformly distributed around Ulaanbaatar. As such, those areas with more sensors will have a more accurate estimate.

Why did you use a 7-day average?

A 7-day average was used because there is much less variability in-between stations than hourly data. There can be a difference of hundreds of AQI points between stations during certain hours of the day. Averaged out over a week there is much less variability, which makes interpolation easier.

In addition, hourly AQI hasn’t been reliably connected to specific health outcomes. Longer-term exposure to air pollution is better documented and thus makes more sense to compare health outcomes.

What about low-cost sensors?

There are many low cost air quality sensors around Ulaanbaatar. However, in testing the model, these sensors tended to vary significantly from the existing government sensors, reducing their usefullness for modeling. It is unknown whether sensor placement, sensor calibration, or other issues contribute to this issue.

Correlation Reporting Methodology
 
As mentioned above, the reporting methodology for this platform will evolve over time. The overarching goal of the reporting for this platform is to understand the economic impact of air pollution and how it may disparately impact certain classes (such as gender or age). Here are some possible reports that will be made using the data collected by the platform.
 
  • Location specific correlation between air pollution and respiratory illness. Care will need to be taken not to conflate the number of reports with a population prevalence of respiratory illness. A metric such as an illness ratio combined with a trend could be used to ensure these values aren’t misinterpreted.
  • Estimate average economic costs of respiratory illness due to lost wages. Location specific wages can be inferred from recent census data when it becomes available to get a more accurate estimate of lost wages.
  • Estimate of loss of education. Those areas with the highest air quality are also generally speaking areas of higher risk of education gaps due to income. Loss of education because of illness may impact lower socioeconomic areas more severely.
 
This reporting methodology section will be updated as data comes into the platform and is able to be analyzed.