The five reasons counts diverge
In our cross-checking work building FARS API, we found five distinct reasons why a self-built FARS pipeline produces different totals than NHTSA's published numbers. Most of them aren't bugs - they're definitional differences nobody documents.
1. Coordinate sentinel filtering
NHTSA encodes missing or unknown GPS coordinates as 77.7777, 88.8888, or 99.9999 in the LATITUDE and LONGITUD columns. These are not real coordinates - they're code values that mean "data not available."
If you treat these as real values and run a radius query, you'll get phantom hits. NHTSA's Crash Viewer filters them server-side; if your pipeline doesn't, your geographic queries will return crashes that "happened" in the Arctic Ocean.
Conversely, if you DO filter sentinel values, your total crash count will be slightly lower than NHTSA's overall total (which counts every record regardless of whether the location is geocoded). The mismatch is small (~0.4% in our 2017-2023 dataset, ~635 crashes per year) but real.
2. Drunk-driver counts in 2021+
The DRUNK_DR column was removed from the ACCIDENT table in 2021. If your pipeline reads it directly, you'll get zero drunk-driver crashes for 2021, 2022, and 2023. NHTSA's Crash Viewer derives the value from VEHICLE.DR_DRINK and shows the real count. If your numbers and theirs disagree by ~10,000/year for these years, that's almost certainly the cause.
We have a dedicated guide with the exact magnitude (9,873 in 2020 → 0 in 2021 in our naive query, vs 10,785 derived from VEHICLE.DR_DRINK in the same year).
3. ARF vs Final File for the most recent year
FARS publishes two versions of each year's data:
- Annual Report File (ARF) - released about 12 months after the year ends. Initial estimates. Subject to revision.
- Final File - released ~24 months after the year ends. Reconciled with state DOTs and corrected.
The ARF and Final File for the same year can differ by 100-300 crashes. If you downloaded the 2023 dataset in early 2025, you have the ARF. NHTSA's Crash Viewer might be showing a more recent revision. The 2017-2022 numbers are stable (Final Files); 2023 is still ARF as of this writing.
This is also why you'll see discrepancies between FARS and the FARS Encyclopedia for the same year - the Encyclopedia sometimes pulls from a different snapshot.
4. Label fragmentation when grouping by *NAME columns
NHTSA uses different human-readable wordings for the same coded value across years. For example, the MAN_COLLNAME for code 0 (no collision with another motor vehicle in transport) appears as three distinct strings:
"Not a Collision with Motor Vehicle in Transport""Not a Collision with Motor Vehicle In-Transport"(note the hyphen)"The First Harmful Event was Not a Collision with a Motor Vehicle in Transport"
If you GROUP BY MAN_COLLNAME across multi-year data, you'll get three buckets that should be one. The total count is correct, but the "most common collision type" claim is wrong because the largest single category got split. NHTSA's Crash Viewer presumably groups by the code internally and only displays one canonical label.
Workaround: group by the underlying numeric code (MAN_COLL), not the label. The codes are stable across years; the labels aren't.
5. Restraint use codes in the user manual are wrong
This one is the most insidious. NHTSA's FARS Analytical User's Manual (DOT HS 813 556, 2023 edition) lists code-to-label mappings for every variable. We built initial code lookup tables from those tables. The REST_USE codes 1, 2, 3, 4 in the manual we worked from were transposed:
| Code | Manual says | NHTSA's REST_USENAME column actually says |
|---|---|---|
| 1 | None Used | Shoulder Belt Only Used |
| 2 | Shoulder Belt Only | Lap Belt Only Used |
| 3 | Lap Belt Only | Shoulder and Lap Belt Used |
| 4 | Shoulder and Lap Belt | Child Restraint Type Unknown |
If your pipeline is decoding these from a hardcoded table built from the manual, your "no restraint" counts are wrong by a factor of ~3 - you're labeling shoulder-belt and lap-belt deaths as "no restraint" and vice versa. NHTSA's Crash Viewer reads the *NAME columns directly from the same source CSVs and gets the right answer.
Fix: never trust hardcoded code-to-label tables. Always read the *NAME columns directly from the source CSV. NHTSA decodes them server-side from their authoritative variable definitions, which are correct.
How to reconcile
If your numbers don't match NHTSA's, work through this checklist in order:
- Are you filtering sentinel coordinate values? You should be. (Lose ~0.4% of crashes vs unfiltered.)
- For 2021-2023, are you deriving
drunk_driversfrom VEHICLE.DR_DRINK, not reading DRUNK_DR? If not, this is your discrepancy. - Are you using ARF (initial release) or Final File for the most recent year? Check NHTSA's release notes for the year you're querying.
- Are you grouping by
MAN_COLLNAME(label) instead ofMAN_COLL(code)? Switch to the code if you need stable groupings across years. - Are you decoding categorical fields from a hardcoded code-to-label table? Switch to reading
*NAMEcolumns directly from the source.
If you've handled all five and still disagree, the next likely cause is filtering at the application layer (e.g., excluding crashes where FATALS = 0, which shouldn't exist in FARS but occasionally does in the raw files due to NHTSA edge cases).
How FARS API handles each of these
| Issue | FARS API behavior |
|---|---|
| Sentinel coordinates | Filtered to NULL on ingest. Radius queries automatically exclude. |
| DRUNK_DR removal | Always derived from VEHICLE.DR_DRINK for every year. Cross-checked against published 2017-2020 values for verification. |
| ARF vs Final File | 2017-2022 Final Files, 2023 ARF. Documented in the report citation footer. |
| Label drift | Variant wordings collapsed to canonical labels at query time and in stats. |
| Code-to-label tables | We read NHTSA's *NAME columns directly. Cross-checked all 7 years × 10 categorical fields against source: 70 of 70 distributions match exactly. |
Related
- FARS DRUNK_DR returns 0 in 2021+ - the most common count divergence
- How to combine FARS years 2017-2023 in a single SQL query - the broader pattern of schema drift
- The full FARS data guide - BAC encoding, body type labels, GPS quirks, and more
- FARS API data coverage - per-year row counts and field-level coverage rates