The symptom
Your dashboard shows drunk-driver fatal crashes dropping from ~10,000/year to zero overnight in 2021. You assume MADD finally won. Then you check the news, and 2021 was actually one of the worst years on record for impaired driving fatalities. Something is wrong with your pipeline.
The query that broke:
SELECT year, SUM(DRUNK_DR) AS drunk_drivers
FROM accident
GROUP BY year;
Results:
| Year | SUM(DRUNK_DR) | Reality (derived from VEHICLE.DR_DRINK) |
|---|---|---|
| 2017 | 9,581 | 9,581 |
| 2018 | 8,993 | 8,993 |
| 2019 | 8,904 | 8,904 |
| 2020 | 9,873 | 9,873 |
| 2021 | 0 | 10,785 |
| 2022 | 0 | 10,654 |
| 2023 | 0 | 9,975 |
The 2021+ counts above are real numbers we cross-checked against NHTSA's own VEHICLE.csv source files (verified row-for-row in our ingestion pipeline). They're the highest impaired-driver years on record, not zero.
The root cause
Starting with the 2021 data year, NHTSA removed the DRUNK_DR column from ACCIDENT.csv entirely. We verified this by checking the column headers across years:
| Year | DRUNK_DR in ACCIDENT.csv? |
|---|---|
| 2017-2020 | Present (sum is real) |
| 2021-2023 | Removed entirely |
This is not documented as a breaking change anywhere obvious in NHTSA's release notes. The 2023 FARS Analytical User's Manual simply doesn't list DRUNK_DR on the ACCIDENT table. If you're not actively diffing column headers between years (and almost nobody is), your pipeline will read it as null/zero and silently produce wrong results.
The same fate befell HIT_RUN, SP_LIMIT, and SUR_COND on the ACCIDENT table - all removed in the 2020/2021 transition. See our general FARS data guide for the full list.
The fix: derive from VEHICLE.DR_DRINK
The data still exists - it just lives on the VEHICLE table now, as a per-vehicle flag. The "drunk-driver crashes" count is the number of crashes where at least one vehicle has DR_DRINK = 1:
SELECT
a.year,
COUNT(DISTINCT a.st_case) AS drunk_driver_crashes
FROM accident a
JOIN vehicle v ON v.st_case = a.st_case AND v.year = a.year
WHERE v.dr_drink = 1
GROUP BY a.year;
Or, if you want the per-crash count of drunk drivers (not just whether any were involved):
SELECT
a.year, a.st_case,
SUM(CASE WHEN v.dr_drink = 1 THEN 1 ELSE 0 END) AS drunk_drivers
FROM accident a
JOIN vehicle v ON v.st_case = a.st_case AND v.year = a.year
GROUP BY a.year, a.st_case;
This works for every year from 2017 through 2023. We verified the derived counts match the published DRUNK_DR exactly for 2017-2020 (when both fields are present), confirming the derivation is the right reconstruction of NHTSA's logic.
If you need cross-year data with drunk-driver counts
Build your pipeline to compute drunk_drivers at ingest time from VEHICLE.DR_DRINK for every year, including the years where DRUNK_DR is still present. Don't conditionally use one source for some years and another source for others - that's how you get hard-to-debug discrepancies.
If you'd rather not deal with this at all: FARS API normalizes drunk_drivers as a crash-level field across all 7 years (2017-2023). Query /v1/crashes?drunk_driver=true&year_from=2021 and you'll get the actual ~30,000 drunk-driver fatal crashes from 2021-2023, not zero.
Related issues you'll hit on the same pipeline
- Combining FARS years 2017-2023 in a single SQL query - the broader pattern of schema drift across years and how to handle it
- Why your FARS counts don't match NHTSA's Crash Viewer - covers label fragmentation, sentinel value filtering, and other common count discrepancies
- Working with NHTSA FARS Data: The Hard Parts - the full guide to schema quirks across all FARS tables
Verification
The numbers in this guide come from cross-checking our production database against NHTSA's source CSV files for every year 2017-2023. Specifically: for each year, we sum DR_DRINK = 1 rows in VEHICLE.csv grouped by ST_CASE, and verify the count matches what's in our crashes.drunk_drivers column. All 7 years match exactly. Source: NHTSA FARS download portal.