The naive approach (and why it's broken)
The natural way to load 7 years of FARS into Postgres:
CREATE TABLE crashes AS SELECT * FROM read_csv('FARS2017NationalCSV/accident.csv', AUTO_DETECT);
INSERT INTO crashes SELECT * FROM read_csv('FARS2018NationalCSV/accident.csv', AUTO_DETECT);
-- ... repeat for every year
This will fail or silently produce wrong data, depending on which database engine you use. Reasons:
- The 2017 ACCIDENT.csv has different columns than the 2020+ ACCIDENT.csv (notably, the 2017 file has
DRUNK_DRandHIT_RUNdirectly on the ACCIDENT table, but those were removed in 2021). - The 2017 file has no decoded label columns (no
STATENAME,WEATHERNAME, etc) - those were added in 2017+ but the 2017 file we observed already has them, so this varies depending on which year you're starting from. Pre-2017 files have only numeric codes. - Multiple years use different label wordings for the same coded value.
MAN_COLLNAME = 0appears as three different strings across years. - The 2022 file has a UTF-8 BOM on the STATE column. Pandas and naive CSV readers will silently rename the column to
"\uFEFFSTATE", which then doesn't match by name in your downstream queries.
The columns that move between tables
These crash-level summary fields exist on the ACCIDENT table in older years and were moved (or removed entirely) in 2020/2021:
| Field | 2017-2020 | 2021-2023 | Where to find it now |
|---|---|---|---|
DRUNK_DR |
On ACCIDENT | Removed | Derive from VEHICLE.DR_DRINK |
HIT_RUN |
On ACCIDENT | Removed | Check VEHICLE.HIT_RUN - if any vehicle is flagged, the crash is hit-and-run |
SP_LIMIT |
On ACCIDENT | Removed | Use VEHICLE.VSPD_LIM - per-vehicle, not per-crash |
SUR_COND |
On ACCIDENT | Removed | Not directly available - road surface condition is no longer recorded at crash level |
If you've been computing impaired-driver crash counts by reading DRUNK_DR directly, you've been getting zero for 2021-2023. We have a dedicated guide on that specific bug with the exact magnitude (~30,000 missing crashes over three years).
The label drift trap
Even when the column exists in every year, the human-readable label NHTSA uses can change. The MAN_COLLNAME column for code 0 (no collision with another motor vehicle) appears as at least three distinct strings across years:
"Not a Collision with Motor Vehicle in Transport""Not a Collision with Motor Vehicle In-Transport"(note the hyphen)"The First Harmful Event was Not a Collision with a Motor Vehicle in Transport"(77 chars)
If you GROUP BY MAN_COLLNAME across years, you'll get three buckets that should be one. This silently corrupts any "most common collision type" analysis. The number column (MAN_COLL = 0) is stable across years - so when in doubt, group by the code, not the label.
Schema length surprises
NHTSA's longest MAN_COLLNAME string is 77 characters. The longest BODY_TYPNAME is 99. The longest VPICMODELNAME is 76. If you create your table with VARCHAR(60) columns (a reasonable guess for category labels), you'll hit truncation errors when you load years where the longer wordings appear. Use VARCHAR(120) at minimum, or just TEXT.
The pattern that works
A multi-year FARS pipeline that handles all of the above:
-- One target table with all the year-stable columns plus
-- derived ones we'll compute at ingest time
CREATE TABLE crashes (
st_case INT NOT NULL,
year SMALLINT NOT NULL,
state SMALLINT,
state_name VARCHAR(30),
fatalities SMALLINT,
drunk_drivers SMALLINT, -- DERIVED from VEHICLE.DR_DRINK
hit_and_run BOOLEAN, -- DERIVED from VEHICLE.HIT_RUN
manner_of_collision VARCHAR(120), -- READ from MAN_COLLNAME (>60 chars!)
manner_of_collision_code SMALLINT,-- READ from MAN_COLL (stable across years)
weather VARCHAR(120),
light_condition VARCHAR(120),
road_function_class VARCHAR(120),
-- ...
PRIMARY KEY (st_case, year)
);
-- For each year, ingest with a year-aware function that:
-- 1. Reads CSV with utf-8-sig to handle BOM
-- 2. Looks up columns case-insensitively (some years use lowercase filenames too)
-- 3. Reads *NAME columns directly (never trust hardcoded code-to-label tables)
-- 4. Falls back to VEHICLE-derived values for DRUNK_DR / HIT_RUN in 2021+
-- 5. Uses VARCHAR(120) for label columns
-- Then query across all years using the CODE columns when grouping:
SELECT manner_of_collision_code,
MIN(manner_of_collision) AS canonical_label,
COUNT(*) AS crashes
FROM crashes
WHERE year BETWEEN 2017 AND 2023
GROUP BY manner_of_collision_code
ORDER BY crashes DESC;
The encoding gotchas
- BOM on the 2022 ACCIDENT.csv. First column reads as
"\uFEFFSTATE"instead of"STATE"in pandas. Useencoding='utf-8-sig'to strip it. - Mixed case filenames. Some years ship as
ACCIDENT.csv, others asaccident.CSV, others asaccident.csv. Use a case-insensitive file lookup. - BAC encoding. The
ALC_REScolumn on the PERSON table is an integer where 148 means BAC 0.148. Values 995-999 are special codes (not tested, refused, etc). Divide by 1000 and filter out the 995+ range. - Latitude sentinels. NHTSA uses 77.7777 / 88.8888 / 99.9999 as sentinel values for missing or unknown coordinates. Filter them out before any geographic query.
Or skip all this
FARS API does all of the above for you and serves a clean unified schema across all 7 years. Query /v1/crashes?year_from=2017&year_to=2023 and you get every fatal crash with consistent labels, derived fields populated correctly, and no schema drift.
Related
- FARS DRUNK_DR returns 0 in 2021+ - the most common version of this bug
- Why your FARS counts don't match NHTSA's Crash Viewer - the related but distinct problem of cross-source reconciliation
- The full FARS data guide - covers BAC encoding, body type labels, GPS quirks, and the rest of the traps