The following analysis was made using OSHA’s enforcement data, accessible from the Department of Labor’s website. The three main types of files are accident, inspection and violation data.

Both the inspection and violation data have a column in common to merge them by: “activity_nr”. Since the accident file does not have this value, we will use another variable as a bridge.


The accident data contains information such as the date and description of every reported accident. It is a single file that will be combined with two additional files that contain the abstract, or a more detailed summary of the accident event, and the “rel_insp_nr” which will be used to merge data further on.


osha_accident <- fread("Data/OSHA/osha_accident.csv") 
osha_accident_abstract <- fread("Data/OSHA/osha_accident_abstract.csv") 
osha_accident_injury <- fread("Data/OSHA/osha_accident_injury.csv") 

Once we loaded the data, we filtered the main accident file. We will need 5 of the 16 columns:

osha_accident <- select(osha_accident,

And we can filter the data to keep every accident that resulted in a fatality since 2002:

osha_accident <- filter(osha_accident, fatality == "X")  
osha_accident$year <- year(osha_accident$event_date) 
osha_accident <- filter(osha_accident, year >= "2002") 

After fixing the format in the accident_abstract file, we joined this and the rel_insp_nr to the main accident file by using the “accident_nr” column:

osha_accident_abstract <- group_by(osha_accident_abstract, summary_nr) %>%
  mutate(text = paste(abstract_text, collapse = "")) %>%
  select(summary_nr, text) %>%
  distinct(summary_nr, .keep_all = TRUE) 

osha_accident_injury <- select(osha_accident_injury,
                               summary_nr ,
                               rel_insp_nr) %>%
  distinct(summary_nr, .keep_all = TRUE) 

# now merge them
osha_accident <- left_join(osha_accident, osha_accident_abstract, by = c("summary_nr" = "summary_nr")) %>%
  left_join(osha_accident_injury, by = c("summary_nr" = "summary_nr")) 


The inspection data contains information about the location of the incident and a corresponding NAICS code. It has been split into smaller files due to the large size. We made a function that filters the inspection data since 2002 and only keeps the columns we need. Then we applied this function to each individual 1,000,000 row file before merging the chunks into one:

# load inspection files
osha_inspection0 <- fread("Data/OSHA/osha_inspection_20210625/osha_inspection0.csv")
osha_inspection1 <- fread("Data/OSHA/osha_inspection_20210625/osha_inspection1.csv")
osha_inspection2 <- fread("Data/OSHA/osha_inspection_20210625/osha_inspection2.csv")
osha_inspection3 <- fread("Data/OSHA/osha_inspection_20210625/osha_inspection3.csv")
osha_inspection4 <- fread("Data/OSHA/osha_inspection_20210625/osha_inspection4.csv")

# make function that will filter columns and rows before merging chunks
# 2002 and newer 

inspection_function <- function(inspection, filter_year) {
  inspection <- select(inspection, activity_nr,
  inspection$year <- year(inspection$open_date)
  inspection <- filter(inspection, year >= (!!filter_year)) %>%
    select(-c(open_date, year))

# apply the function to each file

osha_inspection0 <- inspection_function(osha_inspection0, 2002)
osha_inspection1 <- inspection_function(osha_inspection1, 2002)
osha_inspection2 <- inspection_function(osha_inspection2, 2002)
osha_inspection3 <- inspection_function(osha_inspection3, 2002)
osha_inspection4 <- inspection_function(osha_inspection4, 2002)

# merge individual files

osha_inspection <- rbind(osha_inspection0, osha_inspection1) %>% 
  rbind(osha_inspection2) %>%
  rbind(osha_inspection3) %>%


The violation data, which contains any penalty amounts, is also split into several smaller files. We made a similar function to be applied in the same way as the inspection data.

# similar function that keeps relevant columns 
# and filters data since 2002

violation_function <- function(violation, filter_year) {
  violation <- select(violation, activity_nr,
  violation$year <- year(violation$issuance_date)
  violation <- filter(violation, year >= (!!filter_year)) %>%
    select(-c(issuance_date, year))

# then the function is applied to each file

osha_violation0 <- violation_function(osha_violation0, 2002)
osha_violation1 <- violation_function(osha_violation1, 2002)
osha_violation2 <- violation_function(osha_violation2, 2002)
osha_violation3 <- violation_function(osha_violation3, 2002)
osha_violation4 <- violation_function(osha_violation4, 2002)
osha_violation5 <- violation_function(osha_violation5, 2002)
osha_violation6 <- violation_function(osha_violation6, 2002)
osha_violation7 <- violation_function(osha_violation7, 2002)
osha_violation8 <- violation_function(osha_violation8, 2002)
osha_violation9 <- violation_function(osha_violation9, 2002)
osha_violation10 <- violation_function(osha_violation10, 2002)
osha_violation11 <- violation_function(osha_violation11, 2002)
osha_violation12 <- violation_function(osha_violation12, 2002)

# and then merged into a single table

osha_violation <- rbind(osha_violation0, osha_violation1) %>% 
  rbind(osha_violation2) %>%
  rbind(osha_violation3) %>%
  rbind(osha_violation4) %>%
  rbind(osha_violation5) %>%
  rbind(osha_violation6) %>%
  rbind(osha_violation7) %>%
  rbind(osha_violation8) %>%
  rbind(osha_violation9) %>%
  rbind(osha_violation10) %>%
  rbind(osha_violation11) %>%
  rbind(osha_violation12) #4,339,551 obs of 5 var

The total current penalty for an incident can be lower than the initial penalty given due to settlements. Sometimes this is reflected by a different amount in the “current_penalty” column. However, in some instances where the penalty is reduced to “0”, it is reflected by a mark on the “delete_flag” column instead.

To update the numeric value of the current penalty we converted all deleted flag marks into a “0” value and any non-deleted flag values into a “1” value. Then we multiplied this column by the “current_penalty” column.

This results in a “0” penalty for rows with a “delete_flag” mark while keeping the rest intact.

osha_violation <- osha_violation %>% mutate(delete_flag = replace(delete_flag, delete_flag == "X", "0")) %>%
  mutate(delete_flag = replace(delete_flag, delete_flag == "", "1"))

osha_violation$current_penalty <- as.double(osha_violation$delete_flag) * osha_violation$current_penalty

osha_violation <- select(osha_violation, -delete_flag)


After cleaning and filtering the three main data sets, we merged them into a master data set. Violation and inspection data and joined by “activity_nr” and then that was merged to the accident data by the equivalent code, “rel_insp_nr”.

master <- right_join(osha_violation, osha_inspection, by = c("activity_nr" = "activity_nr")) %>%
  right_join(osha_accident, by = c("activity_nr" = "rel_insp_nr"))

This master file contains a summary of every fatality reported since 2002, the location of the establishment, corresponding NAICS codes and any penalty amounts given.

## remove all previous chunks
rm(osha_accident_abstract, osha_accident_injury, osha_inspection0,
   osha_inspection1, osha_inspection2, osha_inspection3, osha_inspection4,
   osha_violation0, osha_violation1, osha_violation2, osha_violation3,
   osha_violation4, osha_violation5, osha_violation6, osha_violation7,
   osha_violation8, osha_violation9, osha_violation10, osha_violation11,
   osha_violation12, osha_inspection, osha_violation, osha_accident)

Before we filtered the master data set into agriculture workers, we replaced empty penalty values for “0”s to calculate some stats properly.

master <- master %>% mutate(current_penalty = replace_na(current_penalty, 0)) %>%
  mutate(initial_penalty = replace_na(initial_penalty, 0))

There are 3 layers of filters we used to determine all the heat-related deaths from farm workers since 2002.

First, we extracted any incidents with a NAICS 111 (Crop Production) or NAICS 1151 (Support Activities for Crop Production). Then we extracted incidents that contained any of the following values from the “event_keyword” column: Heat exhaustion, Heat stress, heat-related illness, or heat stroke.

Those 2 layers catch the heat-related deaths we need, but some events with a different cause of death that also contain those keywords were included. To remove these, we added a third layer that “reads” the event abstract and removes any accidents that say “natural causes”, “not heat-related” or “head trauma”.

ag111 <- master %>% filter(str_detect(naics_code, "^111")) 
ag1151 <- master %>% filter(str_detect(naics_code, "^1151")) 

ag <- rbind(ag111, ag1151)

ag <- ag %>% 
  filter(str_detect(event_keyword, "HEAT EXHAUSTION|HEAT STRESS|HEAT-RELATED ILLNESS|HEAT STROKE")) %>% 
  filter(str_detect(text, "natural causes", negate = T)) %>%
  filter(str_detect(text, "not heat-related", negate = T)) %>%
  filter(str_detect(text, "head trauma", negate = T))

ag_id <- distinct(ag, summary_nr, .keep_all = TRUE)

ag_pivot <- group_by(ag, activity_nr) %>% 
  summarise(total_current_penalty = sum(current_penalty),
            total_initial_penalty = sum(initial_penalty)) %>%
  left_join(ag_id, by = c("activity_nr" = "activity_nr")) %>%
  select(-current_penalty) %>%


Since 2002, we found that 65 agriculture workers died from heat-related illnesses.

Some of the recent deaths have open cases, which might result in a reduction. But from the 65 deaths we found, 18 resulted in no fine and another 29 had a fine under 10,000 USD. The median is 4000 USD.

We also looked at the heat-related deaths across all industries.

industry <- master %>% filter(str_detect(event_keyword, "HEAT EXHAUSTION|HEAT STRESS|HEAT-RELATED ILLNESS|HEAT STROKE")) %>% 
  filter(str_detect(text, "natural causes", negate = T)) %>%
  filter(str_detect(text, "not heat-related", negate = T)) %>%
  filter(str_detect(text, "head trauma", negate = T))

industry_id <- distinct(industry, summary_nr, .keep_all = TRUE)

industry_pivot <- group_by(industry, activity_nr) %>% 
  summarise(total_current_penalty = sum(current_penalty),
            total_initial_penalty = sum(initial_penalty)) %>%
  left_join(industry_id, by = c("activity_nr" = "activity_nr")) %>%
  select(-current_penalty) %>%

Using similar filter parameters but including all NAICS codes, we found a total of 396 deaths since 2002.

That means that agriculture deaths represent 16.4 of all heat-related deaths from workers in the US.

round(nrow(ag_pivot)/nrow(industry_pivot)*100, 1)
## [1] 16.4

At last, to make a searchable table we scraped the 6-digit NAICS list to match the codes with the definitions in agriculture.

naics <- read_csv("Data/NAICS.csv")

naics <- naics %>% filter(Length == 6) %>% select(-Length)

data_wrapper <- left_join(ag_pivot, naics, by = c("naics_code" = "Codes")) %>% 
  select(Year = year, 
         Establishment = estab_name,
         State = site_state,
         `Total Current Penalty` = total_current_penalty,
         Industry = Titles,
         `Description of Event` = text) %>% arrange(desc(Year))