Working with NHTSA FARS Data - The Hard Parts Nobody Tells You

Don't want to wrangle CSVs? farsapi.com handles every gotcha in this guide — schema drift, missing columns, code decoding, year normalization — and serves the result as a queryable REST API. Try the free preview → or see the API guide →

What is NHTSA FARS data?

FARS (Fatality Analysis Reporting System) is NHTSA's authoritative US dataset of every fatal motor vehicle crash since 1975. A crash qualifies if at least one person died within 30 days on a public road involving a motor vehicle in transport. State analysts compile the records; NHTSA publishes annually as ~20 CSV tables (ACCIDENT, VEHICLE, PERSON, plus subtables) with an 18-month lag. Schema drifts year-over-year — this guide covers the traps.

NHTSA's FARS Analytical User's Manual is 400+ pages. It documents the intended schema. What it doesn't document well is how the schema actually behaves when you try to load seven years of data into a database and query across them. This guide covers the gaps.

The ACCIDENT table lost columns you probably expect

If you've worked with older FARS data (pre-2020), you might expect to find DRUNK_DR, SP_LIMIT, SUR_COND, and HIT_RUN on the ACCIDENT table. They're gone.

Starting with the 2020 data year, NHTSA removed several crash-level summary fields from the ACCIDENT CSV. The data still exists - it just moved to the VEHICLE table or needs to be computed:

Field	Pre-2020	2020+	What to do
`DRUNK_DR`	On ACCIDENT table	Removed	Count vehicles where `DR_DRINK=1` on VEHICLE table
`HIT_RUN`	On ACCIDENT table	Removed	Check `HIT_RUN` on VEHICLE table - if any vehicle is flagged, the crash is hit-and-run
`SP_LIMIT`	On ACCIDENT table	Removed	Use `VSPD_LIM` on VEHICLE table (per-vehicle, not per-crash)
`SUR_COND`	On ACCIDENT table	Removed entirely	No direct replacement. Road surface condition is no longer recorded at the crash level in these years.

This isn't documented as a breaking change anywhere obvious. The User's Manual for 2023 simply doesn't list these fields. If you're writing an ingest pipeline that spans pre-2020 and post-2020 data, you need conditional logic - and you need to decide whether to derive crash-level values from vehicle-level data or leave them null for the newer years.

FARS API computes drunk_drivers and hit_and_run at the crash level by aggregating from vehicle records. Speed limit is returned per-vehicle only, because that's where the data actually lives.

BAC is stored as an integer, not a decimal

The ALC_RES field on the PERSON table looks like it should be a Blood Alcohol Concentration - and it is - but it's encoded as an integer where 148 means 0.148 BAC. The legal limit in most US states is 0.08 BAC, which is stored as 80.

If you load this column as-is into your database, you'll get BAC values of 148, 217, 53 - which look like nonsense unless you know to divide by 1000.

There's a second trap: values 995–999 are special codes, not BAC readings:

ALC_RES value	Meaning
0	BAC = 0.000 (tested, no alcohol)
1–940	BAC = value / 1000 (e.g., 148 = 0.148)
995	Test Not Given
996	Test Refused
997	Test Performed, Results Unknown
998	Not Reported
999	Unknown

In our dataset, 79,334 persons tested at 0.000 BAC. 8,334 tested between 0.001 and 0.080 (under legal limit). 32,526 tested above 0.080. And 444 tested above 0.400 - which sounds impossible, but these are fatal crash records. Those people are almost always dead.

FARS API returns BAC as a proper decimal (0.148, not 148) and strips the special codes to null.

Vehicle model names require VPIC, not the MODEL column

The VEHICLE table has a column called MODEL. You'd reasonably assume it contains model names. It doesn't - it contains NHTSA's internal numeric model codes (e.g., 12481 for a Ford F-250, 20421 for a Chevrolet Tahoe).

The human-readable model name is in VPICMODELNAME - a column that comes from NHTSA's Vehicle Product Information Catalog (vPIC) decoder. Similarly, MAKENAME has the make name, and VPICBODYCLASSNAME has a detailed body type description.

Not every vehicle has a VPIC decode. About 3.4% of vehicle records have only numeric codes for the model - typically older vehicles, custom builds, or cases where the vehicle couldn't be identified precisely. Those records fall through to the raw MODEL code.

This matters if you're building queries like "find all crashes involving Toyota Camry." Searching the MODEL column returns nothing useful. You need VPICMODELNAME.

Body type labels are more specific than you'd expect

The BODY_TYP column uses numeric codes (14 = SUV, 30 = Pickup, etc.), and NHTSA's BODY_TYPNAME column provides the official labels. But these labels are verbose:

14 → "Compact Utility (Utility Vehicle Categories 'Small' and 'Midsize')"
78 → "Truck-tractor (Cab only, or with any number of trailing unit; any weight)"
20 → "Minivan (Chrysler Town and Country, Caravan, Grand Caravan, Voyager, Honda-Odyssey, ...)"

If you use the raw NHTSA labels, your API responses will have 90-character body type strings. If you write your own labels from the coding manual, you'll have shorter strings - but you'll need to maintain them across years as NHTSA occasionally changes code definitions.

FARS API uses NHTSA's own labels from the CSV (BODY_TYPNAME) for accuracy, even though they're verbose. When querying by body type, partial matching handles the verbosity - searching "Pickup" matches "Light Pickup" and "Light Pickup Truck."

Column names change between years (quietly)

We diffed every CSV header across 2017–2023. Most columns are stable, but there are changes that will break a naive ingest pipeline:

Table	Years	Change
ACCIDENT	2022 → 2023	BOM encoding changed (UTF-8 BOM on STATE column in 2022, clean in 2023)
VEHICLE	2020 → 2021	`UNDERIDE` renamed to `UNDEROVERRIDE`
VEHICLE	2022 → 2023	`ACC_CONFIG` and `ACC_CONFIGNAME` added
PERSON	2021 → 2022	`ALC_DET` and `DRUG_DET` removed; `DEVMOTOR` and `DEVTYPE` added

The BOM issue is particularly insidious. In 2022, the first column reads as STATE instead of STATE. Standard CSV parsers handle this silently in some languages (Python's utf-8-sig codec) and fail silently in others - your code runs, but the first column doesn't match by name.

The column renames aren't announced in a changelog. You discover them when your pipeline throws a KeyError on a column that existed last year.

GPS coordinates: good but not perfect

99.6% of crashes in 2017–2023 have GPS coordinates. That's excellent. But there are caveats:

635 crashes have no coordinates at all. Typically crashes in rural areas or on tribal land where precise location couldn't be determined.
Coordinates are reported to varying precision. Some have 8 decimal places (sub-meter), others have 4 (11-meter). For intersection queries with a 500-foot radius, 4-decimal precision is fine. For "exact location" analysis, be aware of this variance.
Older data (pre-2015, not in our current dataset) has 15–20% missing coordinates. If you extend back to 1975, geocoding quality degrades significantly.
NHTSA uses 77.7777/88.8888/99.9999 as sentinel values for missing or unknown coordinates. These need to be filtered, not treated as real locations. (77.7777° latitude would put a crash in the Arctic Ocean.)

FARS API filters sentinel values and serves null for coordinates that aren't real. Radius queries automatically exclude crashes with missing coordinates.

FATALS is the only field that's truly reliable across all years

Every crash in FARS has at least one fatality - that's the inclusion criterion. The FATALS field on the ACCIDENT table is always present, always >= 1, and consistently defined. It's the one field you can count on never being null, never being a special code, and never changing meaning between years.

Our dataset: minimum 1, maximum 13, average 1.09. That average tells you most fatal crashes have exactly one fatality - which matters for actuarial models that might otherwise assume higher counts.

The same category has different label wordings in different years

NHTSA's MAN_COLLNAME column (manner of collision) uses different wordings for the SAME category across years. For the single-vehicle / not-a-collision-with-another-motor-vehicle category, we observed at least three distinct labels in the source CSVs:

"Not a Collision with Motor Vehicle in Transport" (older years)
"Not a Collision with Motor Vehicle In-Transport" (note the hyphen)
"The First Harmful Event was Not a Collision with a Motor Vehicle in Transport" (newer years - 77 chars)

If you GROUP BY manner_of_collision across multi-year data, you'll get three buckets that should be one. This silently corrupts any "most common collision type" analysis. NHTSA documents the code (0 = "Not a Collision with Motor Vehicle in Transport") as stable, but the human-readable label drifts.

The fix: normalize at query time. Match on the code (MAN_COLL = 0), or canonicalize labels containing "Not a Collision with" + "Motor Vehicle" into one bucket. The same trap exists for some weather and light condition labels - watch for slash vs comma drift ("Fog/Smog/Smoke" vs "Fog, Smog, Smoke").

The hardcoded code-to-label tables in the user manual are wrong

NHTSA's FARS Analytical User's Manual (DOT HS 813 556, 2023 edition) lists code-to-label mappings for every variable. We built our initial code lookup tables from those tables. They contained errors.

Specifically, the REST_USE (restraint use) codes 1, 2, 3, and 4 in the manual we worked from were transposed:

Code	What we had (wrong)	What NHTSA's `REST_USENAME` column actually says
1	None Used	Shoulder Belt Only Used
2	Shoulder Belt Only	Lap Belt Only Used
3	Lap Belt Only	Shoulder and Lap Belt Used
4	Shoulder and Lap Belt	Child Restraint Type Unknown

This is the difference between "the deceased was not wearing a seat belt" and "the deceased was wearing a lap and shoulder belt" - which matters in court. The fix: never trust hardcoded code-to-label tables. Always read the *NAME columns directly from the source CSV, which NHTSA decodes server-side from their authoritative variable definitions.

FARS API now reads REST_USENAME, PER_TYPNAME, SEXNAME, INJ_SEVNAME, BODY_TYPNAME, and the other *NAME columns directly from NHTSA's source CSVs. We do not maintain a hardcoded lookup for any field that NHTSA already decodes.

Pre-2017 data has no decoded label columns at all

The *NAME decoded label columns we now rely on were introduced in the 2017 dataset. Pre-2017 (we sampled 2014, 2015, 2016) the CSVs only contain numeric codes for state, county, city, make, model, body type, and every other categorical field. The decoded labels you see in NHTSA's web Crash Viewer are computed at display time from internal lookup tables they don't ship with the CSV download.

If you want pre-2017 data with human-readable labels, you have to build your own code-to-label tables. NHTSA publishes the FIPS state and county codes separately, but city codes and the make/model code lists are scattered across different documents and years.

This is why we currently cover 2017-2023 in our dataset and not 2014-2023 - the marginal value of three more years of older data isn't worth the engineering work to build complete decoder tables for every categorical field.

What this means for your pipeline

If you're building your own FARS data pipeline:

Read CSVs with utf-8-sig encoding to handle BOM
Look up columns case-insensitively and by multiple possible names
Read *NAME decoded label columns directly from source - never trust hardcoded code-to-label tables for fields NHTSA already decodes
Normalize variant label wordings across years before grouping or counting
Allocate at least varchar(120) for label columns - some NHTSA labels exceed 75 characters
Derive crash-level drunk_drivers and hit_and_run from vehicle data (the ACCIDENT-level columns were removed in 2021)
Divide ALC_RES by 1000 and filter >= 995
Use VPICMODELNAME for vehicle models, not MODEL
Filter lat/lon sentinel values (77.7777, 88.8888, 99.9999)
If you need pre-2017 data, expect to build your own city / county / make / model decoder tables
Expect columns to appear, disappear, or rename between years

Or use FARS API and skip all of this. That's why we built it.

Deep dives on specific traps

If you hit one of these issues and want the detailed explanation with exact magnitudes and SQL fixes:

FARS DRUNK_DR Returns 0 in 2021+ - the most common silent failure. NHTSA removed the column from ACCIDENT.csv in 2021 with no documented warning. Includes the SQL pattern to derive the same value from VEHICLE.DR_DRINK with verified counts for every year.
How to Combine FARS Years 2017-2023 in a Single SQL Query - the broader pattern of schema drift. Covers column moves, label drift across years, encoding traps (UTF-8 BOM in 2022), and the schema length surprises (some labels exceed 75 chars).
Why Your FARS Counts Don't Match NHTSA's Crash Viewer - five distinct reasons your self-built pipeline produces different totals than NHTSA's published numbers, with a reconciliation checklist.

API Documentation Data Coverage Details

Working with NHTSA FARS Data: What the Documentation Doesn't Tell You