Why Your FARS Query Returns Different Counts Than NHTSA's Crash Viewer

You ran a query against your FARS database and got 9,845 drunk-driver crashes for 2020. NHTSA's Crash Viewer says 11,654 for the same year. Both numbers feel authoritative. One of you is wrong - or you're both right but counting different things.

The five reasons counts diverge

In our cross-checking work building FARS API, we found five distinct reasons why a self-built FARS pipeline produces different totals than NHTSA's published numbers. Most of them aren't bugs - they're definitional differences nobody documents.

1. Coordinate sentinel filtering

NHTSA encodes missing or unknown GPS coordinates as 77.7777, 88.8888, or 99.9999 in the LATITUDE and LONGITUD columns. These are not real coordinates - they're code values that mean "data not available."

If you treat these as real values and run a radius query, you'll get phantom hits. NHTSA's Crash Viewer filters them server-side; if your pipeline doesn't, your geographic queries will return crashes that "happened" in the Arctic Ocean.

Conversely, if you DO filter sentinel values, your total crash count will be slightly lower than NHTSA's overall total (which counts every record regardless of whether the location is geocoded). The mismatch is small (~0.4% in our 2017-2023 dataset, ~635 crashes per year) but real.

2. Drunk-driver counts in 2021+

The DRUNK_DR column was removed from the ACCIDENT table in 2021. If your pipeline reads it directly, you'll get zero drunk-driver crashes for 2021, 2022, and 2023. NHTSA's Crash Viewer derives the value from VEHICLE.DR_DRINK and shows the real count. If your numbers and theirs disagree by ~10,000/year for these years, that's almost certainly the cause.

We have a dedicated guide with the exact magnitude (9,873 in 2020 → 0 in 2021 in our naive query, vs 10,785 derived from VEHICLE.DR_DRINK in the same year).

3. ARF vs Final File for the most recent year

FARS publishes two versions of each year's data:

The ARF and Final File for the same year can differ by 100-300 crashes. If you downloaded the 2023 dataset in early 2025, you have the ARF. NHTSA's Crash Viewer might be showing a more recent revision. The 2017-2022 numbers are stable (Final Files); 2023 is still ARF as of this writing.

This is also why you'll see discrepancies between FARS and the FARS Encyclopedia for the same year - the Encyclopedia sometimes pulls from a different snapshot.

4. Label fragmentation when grouping by *NAME columns

NHTSA uses different human-readable wordings for the same coded value across years. For example, the MAN_COLLNAME for code 0 (no collision with another motor vehicle in transport) appears as three distinct strings:

If you GROUP BY MAN_COLLNAME across multi-year data, you'll get three buckets that should be one. The total count is correct, but the "most common collision type" claim is wrong because the largest single category got split. NHTSA's Crash Viewer presumably groups by the code internally and only displays one canonical label.

Workaround: group by the underlying numeric code (MAN_COLL), not the label. The codes are stable across years; the labels aren't.

5. Restraint use codes in the user manual are wrong

This one is the most insidious. NHTSA's FARS Analytical User's Manual (DOT HS 813 556, 2023 edition) lists code-to-label mappings for every variable. We built initial code lookup tables from those tables. The REST_USE codes 1, 2, 3, 4 in the manual we worked from were transposed:

CodeManual saysNHTSA's REST_USENAME column actually says
1None UsedShoulder Belt Only Used
2Shoulder Belt OnlyLap Belt Only Used
3Lap Belt OnlyShoulder and Lap Belt Used
4Shoulder and Lap BeltChild Restraint Type Unknown

If your pipeline is decoding these from a hardcoded table built from the manual, your "no restraint" counts are wrong by a factor of ~3 - you're labeling shoulder-belt and lap-belt deaths as "no restraint" and vice versa. NHTSA's Crash Viewer reads the *NAME columns directly from the same source CSVs and gets the right answer.

Fix: never trust hardcoded code-to-label tables. Always read the *NAME columns directly from the source CSV. NHTSA decodes them server-side from their authoritative variable definitions, which are correct.

How to reconcile

If your numbers don't match NHTSA's, work through this checklist in order:

  1. Are you filtering sentinel coordinate values? You should be. (Lose ~0.4% of crashes vs unfiltered.)
  2. For 2021-2023, are you deriving drunk_drivers from VEHICLE.DR_DRINK, not reading DRUNK_DR? If not, this is your discrepancy.
  3. Are you using ARF (initial release) or Final File for the most recent year? Check NHTSA's release notes for the year you're querying.
  4. Are you grouping by MAN_COLLNAME (label) instead of MAN_COLL (code)? Switch to the code if you need stable groupings across years.
  5. Are you decoding categorical fields from a hardcoded code-to-label table? Switch to reading *NAME columns directly from the source.

If you've handled all five and still disagree, the next likely cause is filtering at the application layer (e.g., excluding crashes where FATALS = 0, which shouldn't exist in FARS but occasionally does in the raw files due to NHTSA edge cases).

How FARS API handles each of these

IssueFARS API behavior
Sentinel coordinatesFiltered to NULL on ingest. Radius queries automatically exclude.
DRUNK_DR removalAlways derived from VEHICLE.DR_DRINK for every year. Cross-checked against published 2017-2020 values for verification.
ARF vs Final File2017-2022 Final Files, 2023 ARF. Documented in the report citation footer.
Label driftVariant wordings collapsed to canonical labels at query time and in stats.
Code-to-label tablesWe read NHTSA's *NAME columns directly. Cross-checked all 7 years × 10 categorical fields against source: 70 of 70 distributions match exactly.

Related

Try the API Field Reference