How to Combine FARS Years 2017-2023 in a Single SQL Query

NHTSA changes the FARS schema between years. Columns disappear, rename, and split into separate tables - silently. If you load CSVs into one big table and query across years, you'll get wrong answers without an error. Here's what to watch for and how to fix it.

The naive approach (and why it's broken)

The natural way to load 7 years of FARS into Postgres:

CREATE TABLE crashes AS SELECT * FROM read_csv('FARS2017NationalCSV/accident.csv', AUTO_DETECT);
INSERT INTO crashes SELECT * FROM read_csv('FARS2018NationalCSV/accident.csv', AUTO_DETECT);
-- ... repeat for every year

This will fail or silently produce wrong data, depending on which database engine you use. Reasons:

  1. The 2017 ACCIDENT.csv has different columns than the 2020+ ACCIDENT.csv (notably, the 2017 file has DRUNK_DR and HIT_RUN directly on the ACCIDENT table, but those were removed in 2021).
  2. The 2017 file has no decoded label columns (no STATENAME, WEATHERNAME, etc) - those were added in 2017+ but the 2017 file we observed already has them, so this varies depending on which year you're starting from. Pre-2017 files have only numeric codes.
  3. Multiple years use different label wordings for the same coded value. MAN_COLLNAME = 0 appears as three different strings across years.
  4. The 2022 file has a UTF-8 BOM on the STATE column. Pandas and naive CSV readers will silently rename the column to "\uFEFFSTATE", which then doesn't match by name in your downstream queries.

The columns that move between tables

These crash-level summary fields exist on the ACCIDENT table in older years and were moved (or removed entirely) in 2020/2021:

Field2017-20202021-2023Where to find it now
DRUNK_DR On ACCIDENT Removed Derive from VEHICLE.DR_DRINK
HIT_RUN On ACCIDENT Removed Check VEHICLE.HIT_RUN - if any vehicle is flagged, the crash is hit-and-run
SP_LIMIT On ACCIDENT Removed Use VEHICLE.VSPD_LIM - per-vehicle, not per-crash
SUR_COND On ACCIDENT Removed Not directly available - road surface condition is no longer recorded at crash level

If you've been computing impaired-driver crash counts by reading DRUNK_DR directly, you've been getting zero for 2021-2023. We have a dedicated guide on that specific bug with the exact magnitude (~30,000 missing crashes over three years).

The label drift trap

Even when the column exists in every year, the human-readable label NHTSA uses can change. The MAN_COLLNAME column for code 0 (no collision with another motor vehicle) appears as at least three distinct strings across years:

If you GROUP BY MAN_COLLNAME across years, you'll get three buckets that should be one. This silently corrupts any "most common collision type" analysis. The number column (MAN_COLL = 0) is stable across years - so when in doubt, group by the code, not the label.

Schema length surprises

NHTSA's longest MAN_COLLNAME string is 77 characters. The longest BODY_TYPNAME is 99. The longest VPICMODELNAME is 76. If you create your table with VARCHAR(60) columns (a reasonable guess for category labels), you'll hit truncation errors when you load years where the longer wordings appear. Use VARCHAR(120) at minimum, or just TEXT.

The pattern that works

A multi-year FARS pipeline that handles all of the above:

-- One target table with all the year-stable columns plus
-- derived ones we'll compute at ingest time
CREATE TABLE crashes (
    st_case INT NOT NULL,
    year SMALLINT NOT NULL,
    state SMALLINT,
    state_name VARCHAR(30),
    fatalities SMALLINT,
    drunk_drivers SMALLINT,           -- DERIVED from VEHICLE.DR_DRINK
    hit_and_run BOOLEAN,              -- DERIVED from VEHICLE.HIT_RUN
    manner_of_collision VARCHAR(120), -- READ from MAN_COLLNAME (>60 chars!)
    manner_of_collision_code SMALLINT,-- READ from MAN_COLL (stable across years)
    weather VARCHAR(120),
    light_condition VARCHAR(120),
    road_function_class VARCHAR(120),
    -- ...
    PRIMARY KEY (st_case, year)
);

-- For each year, ingest with a year-aware function that:
-- 1. Reads CSV with utf-8-sig to handle BOM
-- 2. Looks up columns case-insensitively (some years use lowercase filenames too)
-- 3. Reads *NAME columns directly (never trust hardcoded code-to-label tables)
-- 4. Falls back to VEHICLE-derived values for DRUNK_DR / HIT_RUN in 2021+
-- 5. Uses VARCHAR(120) for label columns

-- Then query across all years using the CODE columns when grouping:
SELECT manner_of_collision_code,
       MIN(manner_of_collision) AS canonical_label,
       COUNT(*) AS crashes
FROM crashes
WHERE year BETWEEN 2017 AND 2023
GROUP BY manner_of_collision_code
ORDER BY crashes DESC;

The encoding gotchas

Or skip all this

FARS API does all of the above for you and serves a clean unified schema across all 7 years. Query /v1/crashes?year_from=2017&year_to=2023 and you get every fatal crash with consistent labels, derived fields populated correctly, and no schema drift.

Related

Try the API Data Coverage