Methodology

How AERSI Works

A complete reference for the AERSI formula — covering pollution load, persistence, variability, and data confidence. Designed to be scientifically honest about what the data can and cannot tell us.

Master Formula

AERSI uses a geometric weighted mean of three components — pollution burden, persistence, and variability — each raised to a calibrated exponent. A station that performs badly across multiple dimensions scores more severely than one that is bad in only one.

AERSI = PL0.50 × EPF0.25 × VSF0.25
PL — Pollution Load
WHO-normalized, soft-saturated weighted concentration across all present pollutants. Exponent 0.50 — largest single influence.
EPF — Exposure Persistence
How often AQI exceeded 100 over 30 days, dampened proportionally when data is sparse. Exponent 0.25.
VSF — Variability Severity
Median day-to-day AQI swing. Robust to single-day spikes. Captures unpredictability that std dev cannot. Exponent 0.25.
CF — Data Confidence
Data completeness score — pollutant coverage, day coverage, sensor quality. Shown as a confidence label alongside the score, not a score multiplier.
Baseline: A station at exactly WHO limits for all pollutants, zero exceedances, and zero volatility scores AERSI = 1.0. Every point above represents compounding exposure beyond the safety threshold.

Pollution Load (PL_robust)

PL

How polluted is the air, across all pollutants combined?

Each pollutant present is normalized against its WHO 2021 guideline limit, soft-saturated to prevent single extreme readings from dominating, then combined using weights renormalized over only the pollutants actually available.

Step 1 — WHO normalization

N_p = C_p / L_p

N_p = 1.0 means exactly at the WHO safe limit. N_p = 4.0 means four times the danger threshold.

Step 2 — Soft-saturation transform

f(N_p) = N_p ^ 0.6

The 0.6 exponent approximates the concave concentration-response relationship documented for PM2.5 in global epidemiological cohort studies, where marginal health risk decreases at higher concentrations (Pope & Dockery, 2006; GBD Integrated Exposure-Response framework). This value is informed by published sublinear dose-response shapes but is treated as heuristic in this version. Sensitivity analyses across exponents 0.4 to 0.8 confirm that station rankings and severity category assignments are robust to this choice.

Step 3 — Weight renormalization

If some pollutants are missing, weights are renormalized across only the pollutants present:

w_adj_p = w_p / Σ(w_q for all present q)

Step 4 — Pollution Load

PL_robust = Σ w_adj_p × N_p^0.6

Pollutant weights & WHO 2021 limits

PM2.5 15 µg/m³
0.40
PM10 45 µg/m³
0.20
NO2 25 µg/m³
0.15
Ozone 60 µg/m³
0.15
SO2 40 µg/m³
0.10
Pollutant weights are derived from two sources. PM2.5 and Ozone weights are derived proportionally from India-specific attributable DALYs reported in the Global Burden of Disease Study 2019 (India State-Level Disease Burden Initiative, Lancet Planetary Health, 2021), where ambient PM2.5 accounts for 31.1 million DALYs and ambient ozone accounts for 3.06 million DALYs. PM10, NO2, and SO2 weights are estimated using global comparative-risk exposure-response functions and coarse PM respiratory morbidity literature, then renormalized so the full five-pollutant weight vector sums to 1.0. PM2.5 carries the largest weight (0.40), consistent with its dominant share of India's ambient pollution-attributable disease burden.

Exposure Persistence Factor (EPF_adj)

EPF

How often has this station's air been unsafe?

EPF counts how many of the 30 rolling days had AQI above 100 — the CPCB threshold between satisfactory and moderate. A single honest dampening term scales the persistence by the square root of data coverage, preventing overconfident scores from sparse stations.

data_weight = (D_obs / 30)^0.5
EPF_adj = 1 + (D_exceed / 30) × data_weight

Using square root rather than linear dampening means a station with 15 days of data gets weight 0.707 — not the 0.25 that a quadratic term would produce. Proportional, not punishing.

The primary EPF threshold is AQI > 100, the regulatory boundary between satisfactory and unhealthy air quality under the CPCB framework — the same threshold used in national public health advisories. Sensitivity analyses using WHO PM2.5 guideline (15 µg/m³) and interim target (35 µg/m³) as concentration-based thresholds produce consistent station rankings.

Days Observeddata_weightEPF if 80% exceededEPF if 30% exceeded
7 days0.4831.3871.145
15 days0.7071.5661.212
22 days0.8561.6851.257
30 days1.0001.8001.300

Variability Severity Factor (VSF_robust)

VSF

How unpredictably does air quality swing day to day?

Volatile air is dangerous in a specific way — acute spikes cause cardiovascular events, and people cannot adapt to swings they cannot predict. VSF uses the median absolute day-to-day AQI change instead of standard deviation, which makes it robust to single sensor spikes or one unusual event.

Median is used instead of mean to ensure robustness against single-day sensor anomalies or recording errors — a common occurrence in real-world monitoring networks. Epidemiological time-series and case-crossover studies demonstrate associations between short-term concentration fluctuations and acute cardiovascular and respiratory events, providing the health basis for including a volatility dimension.

S = median(|AQI_t − AQI_{t−1}|)
VSF_robust = 1 + tanh(S / 45)

The tanh function keeps VSF bounded between 1.0 and 2.0 regardless of how extreme the swings become. The constant 45 means a median daily swing of 45 AQI points gives tanh(1) ≈ 0.76, placing VSF at 1.76.

Median Daily Swing (S)tanh(S/45)VSFInterpretation
00.001.00Perfectly stable
150.321.32Mild day-to-day change
300.581.58Moderate swings
450.761.76Large, significant swings
800.941.94Extreme daily volatility

Data Confidence (CF)

CF

How much can we trust this station's score?

A station with incomplete pollutant reporting or short history should not be presented with the same confidence as a fully observed one. CF_data is computed for every station and used to generate a confidence label shown alongside the score — it does not multiply into the AERSI score itself.

CF_pollutant = k / 5     (pollutants present out of 5)
CF_day      = D_obs / 30   (days with data)
CF_quality  = 1.0         (sensor metadata, pending)

CF_data = 0.5 × CF_pollutant + 0.3 × CF_day + 0.2 × CF_quality

CF_data is stored alongside every station score and mapped to one of four confidence labels. This makes data quality visible to the reader without suppressing the score for stations in regions with limited sensor infrastructure.

CF_quality is set to 1.0 until sensor metadata becomes available in the CPCB data feed. When sensor uptime and anomaly flags are accessible, this will be computed from continuity scores.

Confidence labels

High Confidence CF_data ≥ 0.85 All or nearly all pollutants, near-full history
Medium Confidence CF_data ≥ 0.65 Some pollutants missing or partial history
Low Confidence CF_data ≥ 0.40 Significant gaps in data coverage
Provisional CF_data < 0.40 Very sparse — treat score as indicative only

Scoring & Bands

The reference baseline is built into the formula. A station at exactly WHO limits on all five pollutants, with zero exceedances and zero volatility, scores AERSI = 1.0.

1.0^0.50 × 1.0^0.25 × 1.0^0.25 = 1.0
AERSICategoryMeaning
< 0.6Very LowCleaner than WHO guidelines across all dimensions
0.6 – 1.0LowNear the safety threshold, mostly acceptable
1.0 – 1.5ModerateRegular exceedance — concerning for sensitive groups
1.5 – 2.0HighPersistent exposure risk for general population
> 2.0ExtremeSevere, persistent, volatile — among the worst globally
Extreme reflects genuinely severe air quality by WHO standards. Indian cities regularly exceed WHO PM2.5 limits by 3–6× year-round. The Extreme band captures stations where pollution load, persistence, and volatility all compound simultaneously.

Design Principles

Data honesty over false precision

Incomplete data is flagged rather than silently ignored. A station with two pollutants and 15 days of history receives a lower confidence label than a fully observed one — making data quality visible to the reader without suppressing the score itself.

Robust to real-world data

Indian monitoring data regularly has sensor downtime, missing pollutants, and station outages. Every design decision — weight renormalization, sqrt dampening, median absolute change — was chosen to degrade gracefully under these conditions rather than catastrophically.

Each life matters equally

Population density does not modify AERSI. A remote industrial station with extreme scores is flagged identically to a dense urban one. AERSI measures exposure severity per person — not total public health burden.

Bounded and interpretable

EPF is bounded 1.0–2.0. VSF is bounded 1.0–2.0. AERSI has a natural baseline of 1.0. No arbitrary normalization step distorts the output — the meaning is built into the mathematics.

Improves over time

As data accumulates, EPF reaches full confidence and VSF stabilizes on a longer sequence of day-to-day changes. The pipeline runs daily and scores become more trustworthy automatically.

Remaining limitations

(1) Pollutant weights for NO2, SO2, and PM10 are estimated from global comparative-risk literature rather than India-specific GBD attributable burden figures, which are not yet separately published for these pollutants at the national level. (2) The 0.6 saturation exponent is heuristic and not fitted to local health outcome data. (3) EPF and VSF are operationalised using AQI rather than raw concentration series, introducing a partial dependency on AQI's structural choices. (4) CF_quality is set to 1.0 pending sensor metadata availability. These limitations are explicitly acknowledged and will be addressed in subsequent versions through empirical calibration against health outcome data.

Relationship to AQI

AERSI is designed as a complement to standard AQI, not a replacement. AQI answers: how bad is the air today? AERSI answers: how severe, persistent, and volatile has the exposure been over the past 30 days? These are complementary analytical questions.

AERSI uses AQI as input for EPF and VSF because AQI is the standardised national multi-pollutant summary — this makes AERSI directly comparable with existing public health alert frameworks while adding the persistence and volatility dimensions that AQI structurally cannot capture in a single daily reading.

AQI and AERSI answer different questions. AQI tells you how bad the air is today. AERSI tells you how severe, persistent, and volatile the exposure has been over the past 30 days. Both are useful — they are complementary tools, not competing ones.

Data Sources

Air quality data is sourced from the Central Pollution Control Board (CPCB) via the Government of India's open data platform, data.gov.in.

Resource ID: 3b01bcb8-0b14-4abf-b6f2-c1bfd384ba69
Update frequency: Daily snapshots at 10:30 AM IST
Coverage: ~530 active monitoring stations across India
Window: Rolling 30-day dataset

WHO guideline limits are from the WHO Global Air Quality Guidelines (2021). Pollutant weights for PM2.5 and Ozone are derived from India-specific attributable DALYs in the Global Burden of Disease Study 2019 (India State-Level Disease Burden Initiative, Lancet Planetary Health, 2021). Weights for PM10, NO2, and SO2 are estimated from global comparative-risk exposure-response literature and renormalized. The soft-saturation exponent of 0.6 is consistent with sublinear PM2.5 exposure-response relationships documented in peer-reviewed health impact assessments (Pope & Dockery, 2006).