The COVID-19 pandemic has truly served as a transformative historical event in the 21st century. The pandemic has led to the dramatic loss of human life around the world. Additionally, many aspects of life and human behavior have been forced to change due to the restrictions and limitations COVID-19 has brought forth. The food service industry, retail, educational institutions, healthcare, and much more have all been challenged and threatened at a scale we have never seen before. Periodic lockdowns and enforced quarantines have resulted in loss of income, economic assests, and employment in general for millions. When livelihoods become challenged and endangered to the extent like they have in this pandemic, immense feelings of desperation and hopelessness become frequent. For this reason, it becomes worthwhile to hypothesize if this pandemic and all of its challenges have impacted criminal activitiy. As a result, our null hypothesis will be that this pandemic have not impacted criminal activity at all.
In this project, we will attempt to present and uncover a correlation between the pandemic and criminal activity in the city of College Park, Maryland. We will do so by analyzing crime data, provided by the University of Maryland College Park Police Department (UMPD), between two time periods. The first time period being the entire year of 2019 to March 2020 (pre-pandemic), and the second time period being April of 2020 onwards (pandemic).
The following sections of this project will serve as a guide on the collection of our data (Data Collection/Scraping), the cleaning of the data (Data Processing), the visualization of the data (Data Visualization), and the in-depth analysis of the data (Analysis, Hypothesis Testing, and Machine Learning).
To collect the neccessary data for this project, we will scrape the University of Maryland, College Park Police Department website: http://www.umpd.umd.edu/stats/incident_logs.cfm.
Because the website organizes the data by year and month, we will use python's requests module to retrieve the HTML that corresponds to the page for each month in a year. Afterwards, we will use the BeautifulSoup library to to find the html for the table that contains the data for each month. We will then concatenate each month's data to make a dataframe for each year. Because we are comparing crime between pre-pandemic and during the pandemic, we will only use the data from 2019 to 2020. Yearly data is then stored in a dictionary, with the year (2019 or 2020) acting as the key and the corresponding dataframes for each year serving as the values.
import requests
import pandas as pd, numpy as np, sklearn as sk, matplotlib.pyplot as plt
from sklearn import linear_model
from bs4 import BeautifulSoup
from scipy.stats import ttest_rel
import sys
!{sys.executable} -m pip install lxml html5lib
years = ['2019','2020']
#Dictionary that will hold the data
data_raw = {}
#Loops for each year
for year in years:
#This will hold each yearly data set
df = pd.DataFrame()
#Loops for each month
for m in range(1,13):
page = requests.get('http://www.umpd.umd.edu/stats/incident_logs.cfm?year='+year+"&month="+str(m))
content = BeautifulSoup(page.content)
table = content.find('table')
t = pd.read_html(str(table))
d=pd.concat(t)
df = pd.concat([df,d])
#adds the yearly data to the main dictionary
data_raw[year] = df
data_raw['2019'].head()
As can be seen above, data scraping does not give ideally formatted or "clean" data, due to how the website and html stores data. The main problem is that each case is listed twice in the dataframe, with the location of the case being listed 5 times in the 2nd entry and there being a column named "Unnamed: 5". Additonally, the way we have divided our data in terms of year does not really make sense either. We have two dataframes for crime data for 2019 and 2020, since we want to be comparing data pre-pandemic and during the pandemic, it makes more sense if we change these two dataframes so that they reflect the actual periods of time that constitute life before the pandemic and life during the pandemic. For the purposes of this project, we are going to consider 2019 to before March 2020 as pre-pandemic and March 2020 onwards as the pandemic. In this section we will tidy up the data, reformat it for our convenience, and deal with all NaN or missing values.
data_cleaned = {}
for year in years:
i = 0
df_new = pd.DataFrame(columns = ['CaseNumber', 'OccuredDateTime','ReportDateTime', 'Type', 'Disposition', 'Location'])
while i < len(data_raw[year]):
df_new = df_new.append({'CaseNumber' : data_raw[year].iloc[i]['UMPD CASENUMBER'],
'OccuredDateTime' : data_raw[year].iloc[i]['OCCURRED DATE TIMELOCATION'],
'ReportDateTime' : data_raw[year].iloc[i]['REPORT DATE TIME'],
'Type' : data_raw[year].iloc[i]['TYPE'],
'Disposition' : data_raw[year].iloc[i]['DISPOSITION'],
'Location' : data_raw[year].iloc[i+1]['OCCURRED DATE TIMELOCATION']}, ignore_index=True)
i+=2
data_cleaned[year] = df_new
data_cleaned['2019'].head()
In summary, instead of modifying each dataframe, we opted to create a new dataframe for each year, as this approach was simpler to code. Essentially, we have removed every other row and replaced the unnamed column with the location of the incident, as well as renaming each column header.
After this data cleaning process, two more issues need solving. The first is that we now need to split this clean data into the appropriate time periods, pre-pandemic and pandemic, for our analysis. The second is that we need to deal with any NaN values. The below code will accomplish this.
As described before, pre-pandemic will constitute 2019 to March 2020 and the pandemic will constitute April 2020 onwards. Therefore, we need to split the 2020 dataframe in the data_cleaned dictionary at the first case that occured in April. The pre-pandemic time period will be spread out over two dataframes, pre_pandemic_2019 and pre_pandemic_2020. All cases that occured in 2020 but before April will be part of the pre_pandemic_2020 dataframe, and all cases that occured April onwards will be part of the pandemic_2020 dataframe. The pre_pandemic_2019 dataframe will just contain all cases that occured in the year of 2019.
For NaN values in columns like Location, we will replace all instances of NaN with "Unknown".
Note: Due to certain efficiency constraints of the geolocation library we would like to use for the visualization section of this project, we will sample 10% of the cases for each time period dataframe.
data_to_analyze = {} #store the final processed data into this dictionary
for index, row in data_cleaned['2020'].iterrows():
if row['OccuredDateTime'].startswith('04'): #finds the first case that occured in March of 2020
data_to_analyze['pre_pandemic_2019'] = pd.DataFrame(data_cleaned['2019']).sample(frac = 0.1)
#dataframe that has cases before April 2020
data_to_analyze['pre_pandemic_2020'] = pd.DataFrame(data_cleaned['2020'].iloc[:index,:]).sample(frac = 0.1)
#dataframe that has cases from April 2020 onwards
data_to_analyze['pandemic_2020'] = pd.DataFrame(data_cleaned['2020'].iloc[index + 1:,:]).sample(frac = 0.1)
break
data_to_analyze['pre_pandemic_2019']['Location'].fillna("Unknown", inplace=True)
data_to_analyze['pre_pandemic_2020']['Location'].fillna("Unknown", inplace=True)
data_to_analyze['pandemic_2020']['Location'].fillna("Unknown", inplace=True)
data_to_analyze['pre_pandemic_2019']
data_to_analyze['pre_pandemic_2020']
data_to_analyze['pandemic_2020']
In this section, we will be visualizing the College Park crime data by generating interactive maps provided by the Folium library.
For more information on Folium and its documentation: https://python-visualization.github.io/folium/quickstart.html
Before we can actually generate these interactive maps, we must add latitude and longitude coordinates for each case in our pre-pandemic and pandemic dataframes. We will first need to add latitude and longitude columns to each of our dataframes. To obtain these coordinates, we will need to formulate a complete College Park address for the location of each case and then determine the coordinates of this address by using geopy (https://pypi.org/project/geopy/).
To make sure we provide a complete and appropriate address that geopy can identiy and locate, we must format certain types of case locations.
!pip install geopy
import geopy
from geopy.geocoders import Nominatim
#Set default np.nan values for Latitude and Longtitude
data_to_analyze['pre_pandemic_2019']["Latitude"] = np.nan
data_to_analyze['pre_pandemic_2019']["Longitude"] = np.nan
data_to_analyze['pre_pandemic_2020']["Latitude"] = np.nan
data_to_analyze['pre_pandemic_2020']["Longitude"] = np.nan
data_to_analyze['pandemic_2020']["Latitude"] = np.nan
data_to_analyze['pandemic_2020']["Longitude"] = np.nan
found_coord = {} #Keeps track of repeating addresses
for df in data_to_analyze.values(): #For each time period dataframe
for idx in df.index:
address = df.at[idx,"Location"]
if address == "Unknown": # if address/location is unknown give arbitrary coordinate points for College Park
df.at[idx,"Latitude"] = 38.980666
df.at[idx,"Longitude"] = -76.9369189
continue
if not('Golf Course Rd' not in address):
df.at[idx,"Latitude"] = 38.991081
df.at[idx,"Longitude"] = -76.953156
continue
if address in found_coord: #check if we have already found coordinates for this location
df.at[idx,"Latitude"] = found_coord[address][0]
df.at[idx,"Longitude"] = found_coord[address][1]
continue
if not("block of" not in address): #block location case
lst = address.split()
address = lst[0]
index_after_of = 3
for i in range(3, len(lst)): #reformats address to include block number and street name
address += " " + lst[i]
if not(" at " not in address): #building location case
lst = address.split()
address = ""
index_after_of = lst.index('at')
#reformats address to just be equal to the street address of the building
for i in range(index_after_of + 1, len(lst)):
address += " " + lst[i]
address = address + ", College Park, MD 20740" #ensure address comes from College Park
geolocator = Nominatim(user_agent="murat") #to use Nominatim you must provide a user_agent
location = geolocator.geocode(address)
if location is not None: #makes sure we only edit cases that return valid coordinates
df.at[idx,"Latitude"] = location.latitude
df.at[idx,"Longitude"] = location.longitude
found_coord[address] =[location.latitude, location.longitude]
data_to_analyze['pandemic_2020']
data_to_analyze['pre_pandemic_2020']
data_to_analyze['pre_pandemic_2019']
After finding the latitude and longitude coordinates for all of the cases, now we can use folium to map and visualize the data for each dataframe. Let's first generate a map of the general College Park area.
!pip install folium
import folium
import matplotlib.pyplot as plt
map_osm = folium.Map(location=[38.9897, -76.9378], zoom_start=14)
map_osm
To make our maps more descriptive we will need to use marker popups which will represent an individual case. To generate these popups for each of the maps we will need to create an html statement as seen in the following function.
#Takes in a single row of a dataframe, outputs an html statement out of the contents of the row
def popup_as_html(row):
html = '<center><p>' + \
str(row['CaseNumber']) + \
'</br>' + \
'Type: ' + str(row['Type']) + \
'</br>' + \
'Disposition: ' + str(row['Disposition']) + \
'</p></center>'
return html
Now we can generate our map visualizations, alongside each map for each time period will be code to generate a list of the top 10 types of cases that were filed.
map1 = folium.Map(location=[38.9897, -76.9378], zoom_start=14)
for idx, row in data_to_analyze['pre_pandemic_2019'].iterrows(): #each case is labeled with a blue marker
folium.Marker(
location=[row['Latitude'], row['Longitude']],
popup=popup_as_html(row),
icon=folium.Icon(color='blue', icon='warning', prefix='fa')
).add_to(map1)
map1
Code to generate a list of the top ten types of cases across the 2019 pre-pandemic period.
types = sorted(data_to_analyze['pre_pandemic_2019']["Type"].unique()) #Get all unique case types for this time period
types_with_counts = [] #Each case type's count will be stored in this list
temp_df = data_to_analyze['pre_pandemic_2019'].copy()
i = 0
for ele in types: #Computes frequency for each case type by grouping in the temp_df
#Count by the attribute type, fill in that value in all columns, then store as a tuple into list
types_with_counts.append((ele, temp_df.groupby('Type').count()['Disposition'][0:].values[i]))
i += 1
#types_with_counts is now of the form: (TYPE_OF_CRIME, FREQUENCY)
types_with_counts = sorted(types_with_counts, key=lambda x: x[1], reverse = True)
types_with_counts[0:10]
map2 = folium.Map(location=[38.9897, -76.9378], zoom_start=14)
for idx, row in data_to_analyze['pre_pandemic_2020'].iterrows():
folium.Marker(
location=[row['Latitude'], row['Longitude']],
popup=popup_as_html(row),
icon=folium.Icon(color='red', icon='warning', prefix='fa')
).add_to(map2)
map2
Code to generate a list of the top ten types of cases across the 2020 pre-pandemic period.
types = sorted(data_to_analyze['pre_pandemic_2020']["Type"].unique()) #Get all unique case types for this time period
types_with_counts = [] #Each case type's count will be stored in this list
temp_df = data_to_analyze['pre_pandemic_2020'].copy()
i = 0
for ele in types: #Computes frequency for each case type by grouping in the temp_df
#Count by the attribute type, fill in that value in all columns, then store as a tuple into list
types_with_counts.append((ele, temp_df.groupby('Type').count()['Disposition'][0:].values[i]))
i += 1
#types_with_counts is now of the form: (TYPE_OF_CRIME, FREQUENCY)
types_with_counts = sorted(types_with_counts, key=lambda x: x[1], reverse = True)
types_with_counts[0:10]
The following is a map of all the cases filed for our sample of the pandemic 2020 time period.
map3 = folium.Map(location=[38.9897, -76.9378], zoom_start=14)
for idx, row in data_to_analyze['pandemic_2020'].iterrows():
folium.Marker(
location=[row['Latitude'], row['Longitude']],
popup=popup_as_html(row),
icon=folium.Icon(color='purple', icon='warning', prefix='fa')
).add_to(map3)
map3
Code to generate a list of the top ten types of cases across the 2020 pandemic period.
types = sorted(data_to_analyze['pandemic_2020']["Type"].unique()) #Get all unique case types for this time period
types_with_counts = [] #Each case type's count will be stored in this list
temp_df = data_to_analyze['pandemic_2020'].copy()
i = 0
for ele in types: #Computes frequency for each case type by grouping in the temp_df
#Count by the attribute type, fill in that value in all columns, then store as a tuple into list
types_with_counts.append((ele, temp_df.groupby('Type').count()['Disposition'][0:].values[i]))
i += 1
#types_with_counts is now of the form: (TYPE_OF_CRIME, FREQUENCY)
types_with_counts = sorted(types_with_counts, key=lambda x: x[1], reverse = True)
types_with_counts[0:10]
From the maps for each time period, despite only using a sample of each dataframe, it is clear to see that the combined pre-pandemic time period saw more crime occur than the pandemic time period. The markers for the 2019 pre-pandemic and 2020 pandemic time period out number the markers on the 2020 pandemic map and are also more spread out over a wider area of the University of Maryland Campus and off-campus areas. From the pre-pandemic maps, CDS (controlled drug substance) violations and DWI/DUI (driving under the influence) violations seemed to focus in south-campus (Campus dr and Route 1) and Old Town areas. This is understandable seeing as how these areas are where most bars and fraternity houses are located. In comparison, the distribution of CDS and DWI/DUI violations is not as high for these same areas for the 2020 pandemic period. This can be attributed to the COVID-19 protection measures taken by both the university and Prince George's County, where College Park is located.
Interestingly, theft, DWI/DUI violations, and injured/sick person cases persist as frequently reported cases across all three time periods. However, as time progresses, it seems thefts seem to be more common, especially in the 2020 pandemic period.
It is important to note, however, that since we 10% of the data from each dataframe, these visualizations and lists of top ten cases could certainly reflect bias or sampling error.
In this section, we'll be looking at the data from the crime rate in College Park, Maryland in the years of 2019 and 2020. We'll gather the amount of violations each month and then compare them to see if there's any difference between pre-COVID and post-COVID with violations.
# Copy created to not mess with inital DF
data19 = data_cleaned['2019'].copy()
# Have to change to datetime format for pandas to split by month
data19["ReportDateTime"] = pd.to_datetime(data19["ReportDateTime"])
# Grouping by the type of violation and getting the count of the number of violations per month in 2019
# Size is the amount of values that have a certain month value in it.
# i.e. if there's a 01 as month, size will report the amount that have 01 as a month value
# drop_duplicates() drops any values that are repeated. Values from 0 to 94 will have 94 repeated in the Num_Of_Violations
# column because size will report the amount that have 01 as a month value which occurs 94 times
count19 = pd.DataFrame(data19.groupby([pd.Grouper(key = 'ReportDateTime', freq = 'M')])['ReportDateTime'].transform('size')).drop_duplicates()
# Renames columns from ReportDateTime to Num_Of_Violations
count19.columns = {'Num_Of_Violations'}
count19
# Copy created to not mess with inital DF
data20 = data_cleaned['2020'].copy()
# Have to change to datetime format for pandas to split by month
data20["ReportDateTime"] = pd.to_datetime(data20["ReportDateTime"])
# Grouping by the type of violation and getting the count of the number of violations per month in 2020
# Size is the amount of values that have a certain month value in it.
# i.e. if there's a 01 as month, size will report the amount that have 01 as a month value
# drop_duplicates() drops any values that are repeated. Values from 0 to 94 will have 94 repeated in the Num_Of_Violations
# column because size will report the amount that have 01 as a month value which occurs 94 times
count20 = pd.DataFrame(data20.groupby([pd.Grouper(key = 'ReportDateTime', freq = 'M')])['ReportDateTime'].transform('size')).drop_duplicates()
# Renames columns from ReportDateTime to Num_Of_Violations
count20.columns = {'Num_Of_Violations'}
count20
Since the indexes of both tables are in ascending order (starting at 0), we can assume each index marks a new month. i.e. in 2020, 0 is Janurary, 106 is Feburary, 260 is March, etc. So, we'll add the monthly values in order accordingly. (i.e Janurary = 1, Feburary = 2, etc).
# Adds a month value to the table. Because there's only 12 Num_Of_Violations values that change after each month, we
# can add these values as a length of the table as 12 is expected. We then add 1 to everything as months start with 1,
# not 0
count19['Month'] = np.arange(len(count19))
count19['Month'] += 1
count19
# Adds a month value to the table. Because there's only 12 Num_Of_Violations values that change after each month, we
# can add these values as a length of the table as 12 is expected. We then add 1 to everything as months start with 1,
# not 0
count20['Month'] = np.arange(len(count20))
count20['Month'] += 1
count20
Initially comparing the data in 2019 and 2020, we see that violations in Janurary and Feburary are around the same between 2019 and 2020 with slight deviations. However, when we hit March, we see a decrease, all the way through pretty much the rest of the year.
Note that during this time, we do know that the number of students, staff, and faculty decreased on campus after students were sent home after March 12, 2020 and stay-at-home (virtual) learning began. This does decrease the number of people commuting to and residing within College Park (at least students living in dorms have decreased). Furthermore, since virtual learning continued throughout summer and fall semesters, we continue to see less students and staff on campus. With a siginificant amount of campus population being gone, it could contribute to the major decrase in violations in College Park.
Let's take a quick closer looke at this:
According to this website: https://datausa.io/profile/geo/college-park-md/, we see that there's a population of 32,183.
According to this website: https://www.usnews.com/best-colleges/university-of-maryland-2103/student-life, 39% of students live on campus.
39% of 40,743 total enrolled students is around 15,890 students on campus. Assuming that everyone went home because of the COVID-19 Pandemic, that's around 16,000 students that left the College Park area (assuming that students who live on campus don't live in College Park).
Now, let's do a linear regression for 2019 and 2020 and plot the number of violations each month in each year:
# Reshapes the arrays to run a linear regression on them
x19 = np.array(count19['Month']).reshape(-1,1)
y19 = np.array(count19['Num_Of_Violations']).reshape(-1,1)
# We use SkLearn to run a linear regression based off of the data in 2019 that we just gathered
reg19 = sk.linear_model.LinearRegression().fit(x19, y19)
# Plots the linear regression and the number of violations per month in 2019, as well as adding relevant graph info
plt.plot(x19, reg19.predict(x19))
plt.title('Number of Violations per Month in 2019')
plt.xlabel('Month')
plt.ylabel('Number of Violations')
for i, row in count19.iterrows():
plt.scatter(row['Month'], row['Num_Of_Violations'])
plt.show()
# Reshapes the arrays to run a linear regression on them
x20 = np.array(count20['Month']).reshape(-1,1)
y20 = np.array(count20['Num_Of_Violations']).reshape(-1,1)
# We use SkLearn to run a linear regression based off of the data in 2020 that we just gathered
reg20 = sk.linear_model.LinearRegression().fit(x20, y20)
# Plots the linear regression and the number of violations per month in 2020, as well as adding relevant graph info
plt.plot(x20, reg20.predict(x20))
plt.title('Number of Violations per Month in 2020')
plt.xlabel('Month')
plt.ylabel('Number of Violations')
for i, row in count20.iterrows():
plt.scatter(row['Month'], row['Num_Of_Violations'])
plt.show()
After looking at the plot of violations per months for 2019 and 2020, we see that the number of violations per month incrases in 2019 (as seen in the linear regression in 2019) and the number of violations per month in 2020 decreases (as seen in the linear regression of 2020).
Continuing, we'll run a t-test based off the values of the linear regression from 2019 and 2020 and see if we reject or fail to reject the null hypothesis.
# We grab predicted values based off of our x values (months) from the linear regression
# to see how accurate our linear regression is for 2019 and 2020 values, respectively.
predict19 = reg19.predict(x19)
predict20 = reg20.predict(x20)
# Run ttests based off of our real y values and predicted y values for each year
# then for the predicted 2019 and predicted 2020 values and real y values for 2019 and 2020
t1 = ttest_rel(predict19, y19)
t2 = ttest_rel(predict20, y20)
t3 = ttest_rel(predict19, predict20)
t4 = ttest_rel(y19, y20)
print(t1)
print(t2)
print(t3)
print(t4)
As we can see from the t-test results, the linear regression for 2019 works really well as the t value is extremely close to 0. The samse also occurs in 2020. Both of the first p-values are 1 meaning our linear regressions work really well with our data sets.
The last t-test compares the values in 2019 and 2020. As we can see, our t value is high at 14.19 with our p-value being really close to 0. Thus we fail to reject the null hypothesis, as our p-value is extremely low, we reject the null hypothesis as there is enough evidence to link the COVID-19 pandemic with a lower crime rate. Furthermore, we can confirm this with our real data as our pvalue is 0.00025 for a t value of 5.298.