Site Overlay

Regression Machine Learning

A policy analysis of COVID-19 spread in United States prisons 

A study conducted with the most interpretable regression models, that allowed us to extract the most insights and report on the possible causes or influential factors in the spread.

The Problem

In January 2020, the United States first reported its first case of C0VID-19 in the state of Washington. Little was known about this novel coronavirus, but as time progressed, we have seen cases increase sporadically, death growing at an unbelievable rate, and life as we know it has drastically changed the way we interact with our fellow citizens.

In this project, we took a deep dive into one segment of the population that is oftentimes overlooked in the discussion of COVID-19 — prisoners of the United States. Regardless of one’s beliefs about prisoners (whether justice was served fairly or not), we will take a look at the impact it has taken on the prison system as a whole.

Furthermore, we wanted to gain meaningful insights from prisoners and staff members affected by COVID-19, and suggest recommendations to the Federal Bureau of Prisons to decrease the rate of COVID-19 cases

The analysis we used involved several models which will be discussed further within this summary, where the best models produced an accuracy score ranging between 75% and 80%

Exploratory Data Analysis

To better understand the phenomena we started by pulling data from The Marshal Project, Jhon Hopkins University. We wanted to understand how many inmates got Covid19, and if there were correlations with cases outside of prison.

We then check for number of new cases over time, divinding the States in 4 regions, as per the Federal Bureau of Justice classification.

We also took it a little further by comparing the features of prisons with number of cases. To do so we looked into the Federal Bureau of Justice datasets.

Facilities Characteristics:

  • Design capacity
  • Percent of design capacity occupied
  • Rated capacity
  • Percent of rated capacity occupied
  • Total Prisoners 2018 & 2019
  • Total Male Inmates 2018 & 2019
  • Total Number of Facilities
  • Facilities operating one or more work programs
  • Inmates participating in one or more work programs
    • Prison industries
    • Support services
    • Farming
    • Public works
    • Phone Rate

To investigate correlations with the Prison Features we run a Lasso regression and interpret coefficients. Interpretability has a very important role in our analysis, reasons for which we decide to leverage on simpler models, that would allow for more intelligibility of data.

As illustrated in the slide above, we discovered that where the number of inmates participating into Farming programs is higher, the number of cases improved of 2 points.

This finding is in line with a national trend, that showed how farming industry workers have been more affected by the 2020 pandemic.

Models

For the COVID-19 model, the features that were passed in the model were the following:

  • cases per month
  • cases per state
  • Data related to staff, prisoners, and civilians, measuring cases, tests, and deaths

For the COVID-19 model with Feature Engineering, the features that were passed in the model were the following:

  • Every feature that was aforementioned above as well as:
    • Design Capacity
    • Prison Work Programs

Testing

We customized our train-test-split into chronological order by date. This change allowed our model to predict the number of cases for the following month

Modeling Prisoner COVID-19 Cases Based on COVID-19 Data

ModelTrain ScoreTest Score
Ridge.713.786
Lasso.834.758
Linear Regression.841.647
Random Forest.932.656
Bagged Decision Tree.927.783
KNN.555.561

Modeling Staff COVID-19 Cases Based on COVID-19 Data

ModelTrain ScoreTest Score
Random Forest GridSearch.747.792
Ridge.835.786
Lasso.95.715
Random Forest.945.869
Bagged Decision Tree.938.820
KNN.629.662

Modeling Prisoner COVID-19 Cases with Facility Features

ModelTrain ScoreTest Score
Random Forest.718.759
Lasso.761.698
Ridge.635.751
Bagged Decision Tree.821.68
KNN.559.517

Modeling Staff COVID-19 Cases with Facility Features

ModelTrain ScoreTest Score
Random Forest.779.80
Lasso.95.704
Ridge.947.772
Bagged Decision Tree.772.747
KNN.633.635

These results show that we had to test a variety of models to ensure that our ideal model would generalize well on data that it hasn’t seen, thus ensuring the model wouldn’t be deemed as overfit.

Scroll Up