With the increasing awareness of health impacts of particulate matter, there is a growing need to comprehend the spatial and temporal variations of the global abundance of ground-level airborne particulate matter (PM2.5). Here we use a suite of remote sensing and meteorological data products together with ground based observations of PM2.5 from 8,329 measurement sites in 55 countries taken between 1997 and 2014 to train a machine learning algorithm to estimate the daily distributions of PM2.5 from 1997 to the present. We demonstrate that the new PM2.5 data product can reliably represent global observations of PM2.5 for epidemiological studies. An analysis of Baltimore schizophrenia emergency room admissions is presented in terms of the levels of ambient pollution. PM2.5 appears to have an impact on some aspects of mental health.
Introduction
In this study, we use machine learning to bring together multiple global datasets from remote sensing, meteorology, and population density, together with hourly in situ PM2.5 observations from 55 countries over the last two decades. This allowed the creation of a new global PM2.5 product at 10 km resolution from August 1997 to the present.1 This new dataset is specifically designed to support health impact studies. We show some examples of this global PM2.5 dataset, and finish by examining the mental health emergency room admissions in Baltimore, MD.
In March, 2014, the World Health Organization (WHO) released a report stating that seven million premature deaths annually are linked to air pollution ( http://www.who.int/mediacentre/news/releases/2014/air-pollution/en/). Airborne particulate matter is a significant component of this pollution. The wide range of health impacts (Table 1) of particulate matter (PM) with a diameter of 2.5 micrometers or less (PM2.5) depend in part on the PM2.5 abundance at ground level in the atmospheric boundary layer (Fig. 1), where they can be inhaled. These health outcomes range from general mortality to pulmonary effects, asthma and chronic obstructive pulmonary disease (COPD), lung cancer, cardiovascular effects, reproductive effects, and even neurotoxic effects.
Table 1.
Particulate matter and health outcomes for PM10, PM2.5, and ultrafine particulates (UFPs) (modified from Ref. 2).

Figure 1.
(Left) A schematic of the atmospheric boundary layer, which is the layer of the atmosphere closest to the earth's surface. (Right) Schematic representation of how the height of the boundary layer changes through the day.

Our radically new approach1 uses a suite of more than 40 NASA remote sensing and meteorological data products (Table 2) together with hourly ground-based observations of air quality from 8,329 measurement sites in 55 countries taken between 1997 and the present to train a proprietary machine learning algorithm to estimate the daily air quality from 1997 to the present. To the best of our knowledge, this is the most systematic and comprehensive study ever conducted. In addition, our PM2.5 product is the only one we know of that provides an uncertainty estimate. The PM2.5 data product is produced by a real-time processing system with a latency of 1 day and is then available for integration in a variety of health decision support tools. Our goal is learning from the past to inform the future.
Table 2.
Datasets used in this study.

Health impacts
Numerous studies have shown that among air pollutants PM2.5 has the strongest link with human health effects.3–456 Increased morbidity and mortality has been associated with exposure to PM2.5,7 thereby suggesting that improved life expectancy is possible by reducing the exposure level.8 Not only in the US but also in European studies a significant number of premature deaths, including cardiopulmonary and lung-cancer deaths, were attributed to long-term exposure to PM2.5.9–1011
For more than half a century, researchers have been studying the impact of PM on health. Initially the attempt was to learn about the possible adverse effects, and then the focus shifted to investigating the exposure–response relationships. Now with further advancement in technology and more awareness of health concerns, studies on composition-specific effects have emerged.12 With the implementation of computational fluid dynamics (CFD) models and digital imaging of organs, researchers have started to study the pathophysiology associated with PM to better understand the translocation of particulates in the human body after their deposition as well as the fate of these particulates in impacting health.
Most short-term exposure impact studies on PM2.5, whether for morbidity or mortality, focus on cardiovascular/cardiopulmonary13 or respiratory14 conditions. Our dataset, with daily temporal scale, is suitable for such studies. We are already studying daily asthma-related hospital admissions associated with PM2.5 using our estimated data.
On the other hand, diseases such as lung cancer require study of the long-term exposure to PM2.5. Data generated from this study is expected to contribute to health impact assessment (HIA) in different parts of the world concerning long-term exposure to PM2.5. Currently, long-term PM values are not available in many localities, and in many instances PM2.5 values are estimated from PM10 for long-term HIA.10 Studies also suggest that even low-level PM2.5 exposure can contribute to serious health impacts.15–161718 We have already created daily global estimates of PM2.5 with an associated uncertainty from 1997 to the present,1 providing an appropriate dataset for extended cohort studies for the areas with both high and low levels of ambient PM2.5. In addition, long-range transportation of dust can provide potential vectors for bacteria.19,20 With global coverage of this study, tracking PM2.5 transport is now easier for public health surveillance.
In recent years, researchers are finding it worthwhile to investigate a link of PM2.5 exposure with adverse birth outcomes,21,22 epigenetic alteration,23–24252627282930313233343536 infant mortality,37–3839404142 atherosclerosis,43–4445 stroke,46–47484950 rheumatic autoimmune diseases,51,52 central nervous system disorders,53–54555657 and diabetes.58–5960 Since many of these health conditions are interlinked, comprehensive studies are required to better understand the impact of PM2.5. With increasing availability of electronic health records, reliable PM2.5 data with seamless temporal and geographic coverage can contribute to revealing many unknowns of PM2.5 impacts on health.
The type and degree of adverse effect greatly depends on the composition of the particulate matter. Particle composition is a function of both its primary source and any secondary chemical reactions and transformations that may occur during its atmospheric transport. Our current study does not provide information on the composition of PM2.5. However, this study can be extended to examine the potential of source apportionment considering land use/land cover conditions and transportation mechanism. Recent studies show specific adverse impacts of exposure to ultrafine particles (UFPs).
PM2.5 distribution
Various networks of ground-based sensors routinely measure the abundance of PM2.5. However, the spatial coverage has many large gaps, and in some countries no observations are made at all. Globally, more observations of PM10 are available than PM2.5. This paper focuses on PM2.5, which in the literature has been related to a wider variety of health conditions than PM10 or UFPs (Table 1).
Several studies have sought to overcome this limitation of spatial coverage by using remote sensing and satellite-derived aerosol optical depth (AOD) coupled with regression and/or numerical models to estimate the ground-level abundance of PM2.5.61–62636465666768697071727374757677787980
Studies have shown that the relationship between PM2.5 and AOD is not always suitable for simple regression models. Rather, it is determined by a multivariate function of a large number of parameters, including humidity, temperature, boundary layer height, surface pressure, population density, topography, wind speed, surface type, surface reflectivity, season, land use, normalized variance of rainfall events, size spectrum and phase of cloud particles, cloud cover, cloud optical depth, cloud top pressure, and the proximity to particulate sources.71,72,75,78,80–81828384858687888990919293949596979899 The picture is further complicated by the biases present in the satellite AOD products,100–101102103104 the difference in spatial scales of the in situ point PM2.5 observations and the remote sensing data (several kilometers per pixel), and, finally, the sharp PM2.5 gradients that can exist in and around cities.
The large number of datasets we use in our fully nonlinear, nonparametric machine learning estimate of PM2.5 are shown in Table 2. Future studies are recommended to derive further size fractions beyond just PM2.5, particularly the UFPs in the submicrometer size range.
This study
Our approach and its validation are described elsewhere.1 The approach makes five incremental contributions.
First, we believe that we have used the most comprehensive training dataset to date for a study that empirically relates hourly in situ PM2.5 observations to remote sensing, meteorological, and other contextual environmental data. This is important because the local context of the various PM2.5 observations varies widely, and to have a robust estimation of the global PM2.5 distribution we must be able to have representative observations over a wide range of conditions. Hourly PM2.5 observations were acquired from 1997 to the present from across the world. In this study, we used hourly PM2.5 data from 8,329 measurement sites in 55 countries.
Second, we believe that we have used the widest range of contextual variables to date (over 30, these variables identified from the literature presented in the last section) in our analysis of the measured multivariate, nonlinear, nonparametric relationship between ground-based observations of PM2.5 and remote sensing observations, meteorological observations, and associated contextual information.
Third, we have used the most suitable multivariate, nonlinear, nonparametric machine learning approach currently available (briefly described in the next section), which has not been used previously for investigating the empirical relationship between hourly in situ PM2.5 observations and remote sensing, meteorological, and other contextual environmental data.
Fourth, we not only estimate the PM2.5 abundance but also provide an uncertainty estimate.
Fifth, we cover the longest time period, estimating the PM2.5 abundance on a daily basis from September 1, 1997, up to the present.
Many studies have shown that the relationship between PM2.5 and AOD is a multivariate function of a large number of parameters.71,75,78,80,98 Further, many of these relationships are nonlinear, some are of unknown functional form, and many have non-Gaussian distributions. Therefore, any successful description of the relationship between PM2.5 and AOD needs to be multivariate, nonparametric (we do not know the functional form from theory), and able to deal with nonlinear behavior and non-Gaussianly distributed variables. This would suggest that a machine learning algorithm should be used.
A useful validation of the new PM2.5 data product is to survey the key features of the global PM2.5 distribution and see if they capture what we expect to find and what has been reported in the literature.
Examples
As an example, Figure 2 shows the monthly average of our machine learning PM2.5 product (µg m–3) for August 2001. The average of the observations at a given site is overlaid as color filled circles when observations were available for at least a third of the days. Notice the good agreement between the PM2.5 product and the observations. Also, as would be expected, in summer, the eastern US has much higher PM2.5 abundance than the western US. Central Valley and LA are clearly visible in California. Inset panel (a) is of Alaska and highlights common fire areas associated with elevated PM2.5. Insets (b) and (c) show the good agreement between our product and observations. Inset (d) shows the elevated PM2.5 with the heavily agricultural Central Valley in California, the highly populated Los Angeles metro area, the Sonoran desert (one of the most active dust source regions in the US), the Four Corners power plants (some of the largest coal-fired generating stations in the US), and the Great Salt Lake Desert. Note the fine scaled features visible in this product.
Figure 2.
The monthly average of our machine learning PM2.5 product (µg m–3) for August 2001. The average of the observations at a given site is overlaid as color filled circles when observations were available for at least a third of the days. Notice the good agreement between the PM2.5 product and the observations. Also, as would be expected, in summer, the eastern US has much higher PM2.5 abundance than the western US. Central Valley and LA are clearly visible in California. Inset panel (A) is of Alaska and highlights common fire areas associated with elevated PM2.5. Insets (B) and (C) show the good agreement between our product and observations. Inset (D) shows the elevated PM2.5 with the heavily agricultural Central Valley in California, the highly populated Los Angeles metro area, the Sonoran desert (one of the most active dust source regions in the US), and the Four Corners power plants (some of the largest coal-fired generating stations in the US), and the Great Salt Lake Desert. Note the fine scaled features visible in this product, which are in marked contrast to the AirNow product.

Figure 3 shows a very different kind of example, this time for Indonesia during October 2005 and October 2006. In Equatorial Asia, the El Nino phase of the El Nino Southern Oscillation (ENSO) is linked to extended periods of drought lasting a few months to a year, particularly in areas undergoing land-use conversion to more fire-susceptible regimes, such as the peatlands of Sumatra and Borneo.105 Fire emissions in these areas have been observed to be as much as 30 times higher during El Nino compared to La Nina years.106 Comparison of our PM2.5 product for October 2005 (panel a) and October 2006 (panel b) shows monthly average enhancements in surface PM2.5 concentrations during the 2006 El Nino event of more than 30 µg m–3. The large regions of burning are clearly visible in October 2006.
Figure 3.
An example of our machine learning PM2.5 product (µg m–3) for Indonesia during October 2005 and October 2006. In Equatorial Asia, the El Nino phase of the El Nino Southern Oscillation (ENSO) is linked to extended periods of drought lasting a few months to a year, particularly in areas undergoing land-use conversion to more fire-susceptible regimes, such as the peatlands of Sumatra and Borneo.105 Fire emissions in these areas have been observed to be as much as 30 times higher during El Nino compared to La Nina years.106 Comparison of our PM2.5 product for October 2005 (panel A) and October 2006 (panel B) shows monthly average enhancements in surface PM2.5 concentrations during the 2006 El Nino event of more than 30 µg m–3.

Finally, Figure 4 shows that our machine learning approach (background colors) well reproduces the US PM2.5 seasonal cycle when compared to climatology observations (overlaid color filled circles).
Figure 4.
Monthly average PM2.5 climatology (in µg m–3) for 1997–2014 estimated using machine learning. The overlaid color filled circles are a climatology of available observations.

Limitations
A unique strength of this study is the daily global coverage from 1997 to the present. However, as a consequence of having a wide array of point sources, the PM2.5 abundance can contain high spatial variability on small scales. The spatial resolution of our study is 10 km × 10 km (approximately 0.1° × 0.1°) determined by the spatial resolution of the MODIS collection 5 aerosol products. Spatial variability on scales smaller than 10 km is present but unresolved in our data product. In addition, there are data gaps due to both cloud coverage and the difficulty that the standard MODIS retrieval algorithm has with retrievals over bright surfaces.
In MODIS collection 5.1 Deep Blue Terra data is not available after 2007. When collection 6 is released, this should be remedied, and there will be greater Deep Blue data coverage and higher spatial resolution. Collection 6 will include various refinements to Deep Blue, including extended coverage to vegetated and bright land surfaces, improved cloud screening and surface reflectance, and aerosol microphysical models. Many of these improvements were developed during the recent application of Deep Blue to SeaWiFS data.
MODIS collection 6 is about to be released and will help address several of these issues. Collection 6 will have a 3-km resolution and greater Deep Blue data coverage. Collection 6 will include various refinements to Deep Blue, including extended coverage to vegetated and bright land surfaces, improved cloud screening and surface reflectance, and aerosol microphysical models. In addition, any satellite instrument has a finite life, and both MODIS satellites are aging. We hope that data continuity will be provided by the recently launched Visible Infrared Imaging Radiometer Suite (VIIRS) on the Suomi National Polar-Orbiting Partnership weather satellite. When data quality from VIIRS becomes acceptable, that data can also be used.
Table 3.
Correlation between the ICD-9-CM diagnosis codes (column) and the environmental variables (row) temperature (T), carbon monoxide (CO), nitrogen dioxide (NO2), and PM2.5. For each correlation there are three numbers: first, the correlation coefficient; second, the P-value; and third the number of data points. The numbers in bold are entries with a correlation coefficient of >0.5.

Although, to our knowledge, we have used more training data than any other studies of PM2.5 estimation, there are yet certain parts of the world from where we are still collecting training data. This lack of uniformity in training data may cause some inconsistency in data product quality. However, as we make progress in acquiring more ground PM2.5 data from different parts of the world having gaps, quality of our dataset will improve for those parts of the world as well.
Emergency Room Admissions for Mental Health Issues in Baltimore
Epidemiological studies have consistently shown an association between air pollution and respiratory and cardiovascular conditions (Table 1). In this analysis, we extend this to cover psychologically and mentally related health issues. By combining ambient air quality data and nonconfidential ambulatory care emergency department admissions in Maryland for 2002, we examined the hypothesis that the number of admissions to Baltimore City emergency rooms with psychologically and mentally related issues increase when the level of air pollution increases. The study yielded some interesting results, showing a correlation between certain air pollutants (ie, particulate matter) and specific types of schizophrenia (ICD 9 Code 295.9). Table 1 summarizes the key health impacts associated with airborne particulates. However, there is little published work on the relationship between air pollution and mental health. It is time to consider the impact of atmospheric pollution on mental health.
It is disturbing to see various psychiatric and psychological conditions on the rise [ie, depression, anxiety, post-traumatic stress disorder (PTSD), and suicide]. Moreover, the incidence of psychological disorders and mental illnesses is much higher in highly populated cities than in other parts of the country.107 Stress of a busy city life plays a role,108 and can serve as a trigger for undesirable genetic predisposition. Psychotic disorders include various types of schizophrenias. As with other mental illnesses, their concentration is high in the bigger cities. Symptoms of schizophrenia may include delusions, hallucinations, disorganized speech, grossly disorganized or catatonic behavior, negative symptoms [ie, affective flattening, alogia (inability to speak), or avolition (inability to make decisions)]. Schizophrenia is divided into subtypes such as paranoid, disorganized, catatonic, undifferentiated, and residual.109 It is now known that unless a person has a genetic predisposition for schizophrenia, he or she cannot develop the illness.110 Not all people who are genetically predisposed develop an acute form of schizophrenia. Although socioeconomic factors have been found to play an important role in the development of schizophrenia,111 the precise origins remain unknown.
Figure 5.
Correlation between the number of cases of unspecified schizophrenia (ICD-9-CM Diagnosis 295.9) admissions at Baltimore City emergency room and PM2.5 in 2002.

Interestingly, many symptoms present in mental disorders can be induced chemically, through administration of intravenous injection or inhaling. For many disorders, DSM (the Diagnostic and Statistical Manual) includes a section for “chemically induced” conditions, thereby attributing the etiology to chemical exposures. It is important to consider the different possible factors that contribute to the development of schizophrenia. For example, what role might heavy air pollution play in this process, considering that both pollution and mental illness are concentrated in large cities?
Mental health and air pollution
How are people at risk of schizophrenia affected by air pollution? Family history of schizophrenia is the strongest and the best established risk factor for the disease at the individual level.112 Pedersen has shown that the risk for schizophrenia increased with increasing levels of all air pollution variables and traffic density but only significantly for benzene, CO, and traffic.113 It has also been shown that the higher the levels of traffic, CO, and benzene, the greater was the risk of schizophrenia, while the levels of NOx and NO2 had no impact. However, only the level of traffic at birth had a significant effect. It was found that children born in an urban area had a greater risk of schizophrenia than those born in rural areas.
While previous studies have hinted at a link between air pollution and physiological conditions, we were interested to see whether there is a correlation between air pollution and psychological states. Krabbendam and van Os107 in more than 10 studies have consistently shown that around one-third of all schizophrenia incidence may be related to unknown but likely unconfounded environmental factors operating in the urban environment that have an impact on developing children and adolescents to increase, relatively specifically, the later expression of psychosis-like at-risk mental states and overt psychotic disorders.
We used the nonconfidential ambulatory care file for emergency department admission data for 2002 in Maryland. The emergency room admissions data for the entire state of Maryland during 2002 consisted of 1,684,008 admissions. Of these, a total of 348,883 were for Baltimore City. To provide more reliable sample sizes, the data for mental disorders was considered in monthly increments. For 2002, the total number of cases classified under the category of mental disorders in Baltimore City was 13,163.
The number of admissions in each month for each of the disorders was then correlated with air quality observations in Baltimore City. The EPA data were considered for every hour of every day of 2002. The monthly average of the daily maxima was used. In this study, we considered temperature (T), carbon monoxide (CO), nitrogen dioxide (NO2), and particulate matter with a diameter of less than 2.5 micrometers (PM2.5).
Analysis
Correlation is obviously not causation. However, we found that the number of people diagnosed with unspecified schizophrenia (ICD-9-CM Diagnosis 295.9) was significantly related to particulate matter (r = 0.61, p = 0.03) and temperature (r = 0.56, p = 0.05). The number of people diagnosed with unspecified schizophrenia had no correlation with CO or NO2 Paranoid schizophrenia (ICD-9-CM Diagnosis 295.3) and schizo-affective type schizophrenia (ICD-9-CM Diagnosis 295.7) had no significant relation to the air pollutants we had data for.
Other types of schizophrenia had insufficient rates of admission to ascertain any effect. Other mental disorders were correlated with particulate matter. Drug psychosis was related to PM2.5 (r = 0.8, p = 0.001) and T (r = 0.87, p = 0002). Nondependent abuse of drugs was related to PM2.5 (r = 0.75, p = 0.004) and T (r = 0.67, p = 0.01). Alcoholic psychosis was correlated to PM2.5 (r = 0.7, p = 0.009) and T (r = 0.6, p = 0.04). Neurotic disorders (r = 0.66, p = 0.01) and depressive disorder not elsewhere classified (r = 0.67, p = 0.01) were related to temperature. There is an extraneous factor with regard to CO that acted as a confounder for this study. Carbon monoxide is strongly negatively correlated with temperature. Thus, the highest levels of CO are observed in winter months. CO, a chemical poisonous to humans, is unlikely to have any causal relationship with decreased emergency room admissions of patients with mental disorders.
This analysis has yielded some interesting results, especially with regard to the correlation between PM2.5 and several mental disorders. However, it is important to keep in mind the possibility of other factors. For example, when we talk about a correlation between air pollution and drug-related disorders, it is possible that significant results are due to a third unknown factor. For example, major sporting events in Baltimore city, which might increase the pollution level due to heavy traffic, also might increase the number of emergency room admissions with alcohol or drug problems as a consequence of the event. As we think about the diagnosis of unspecified schizophrenia by the emergency room, another question arises: Are there coding issues here? Is there a bias toward giving a coding of unspecified schizophrenia, perhaps due to a lack of psychological expertise in the emergency room department.
Nonetheless, this study clearly demonstrates a correlation between PM2.5 and a number of psychological conditions. This air pollutant is not well known for associations with mental health, nor has it been studied extensively with regard to mental health. In view of our findings, further research might be conducted in the area of mental health and air pollution.
Conclusions
A new approach to use ground-based observations of particulate matter together with a suite of remote sensing and meteorological data products to train a machine learning algorithm to estimate the daily distributions of PM2.5 has been demonstrated. This new PM2.5 daily global data product reproduces global observations and spans an unprecedented 16 years from 1997 to the present. The correlation coefficient for each of the five training datasets is 0.96 or greater, and the correlation coefficient for each of the independent validation datasets is 0.52 or greater. This implies that the PM2.5 abundance inferred using machine learning agrees well with the ground truth from in situ observations. The quality varies slightly with the satellite, with the best fits obtained from Terra data, followed by Aqua, and SeaWIFS. In all cases, the shape of PM2.5 data product reproduces the observations between the 25th and 75th quantiles. The machine learning PM2.5 data product is useful for human health studies because it resolves both spatial and temporal variability. PM2.5 appears to have a role not just in health outcomes such as cardiovascular and respiratory conditions, but also has an impact on some aspects of mental health. An example of this is the statistically significant association of emergency room admissions we found during 2002 for unspecified schizophrenia and airborne particulate matter in Baltimore, MD.
Author Contributions
Conceived and designed the experiments: DJL, TL, BS. Analyzed the data: DJL, TL. Wrote the first draft of the manuscript: DJL, TL. Contributed to the writing of the manuscript: DJL, TL, BS. Agree with manuscript results and conclusions: DJL, TL, BS. Jointly developed the structure and arguments for the paper: DJL, TL. Made critical revisions and approved final version: DJL, TL, BS. All authors reviewed and approved of the final manuscript.
Acknowledgments
The contents of this paper are the sole responsibility of the authors and do not necessarily represent the official views of the funding agencies. The environment agencies of Albania, Australia, Austria, Azores Islands, Belarus, Belgium, Brazil, Canada, Canary Islands, Chile, China, Croatia, Cyprus, Czech Republic, Denmark, Estonia, Finland, France, Germany, Greece, Hong Kong, Hungary, Iceland, Iceland, India, Ireland, Israel, Italy, Japan, Latvia, Lithuania, Madeira Islands, Malaysia, Mexico, Mongolia, New Zealand, Netherlands, Norway, Peru, Poland, Portugal, Russia, Singapore, Slovakia, South Africa, South Korea, Spain, Sweden, Taiwan, Thailand, United Kingdom, United States, and Vietnam are thanked for the use of their PM2.5 observations.