Introduction¶

Imports¶

Below are a few important packages that may be used to analyze, manipulate, and visualize the data.

In [1]:
#Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
pd.set_option('display.max_columns', None)
plt.rcParams['figure.figsize'] = [10, 5]
/var/folders/gc/0752xrm56pnf0r0dsrn5370c0000gr/T/ipykernel_93493/2858617164.py:2: DeprecationWarning: 
Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd

Dataset¶

This dataset contains data on all police-recorded motor vehicle collisions in New York City. Each row represents a reported vehicle collision with information about where the incident occured, the circumstances surrounding the incident, and how many people were injured or killed.

This dataset is provided by the New York Police Department (NYPD) through the official NYC OpenData website and has been frequently frequently since 2014. Because the original dataset contains over 2 million entries, a sample of 100,000 crashes from 2021 and 2022 has been imported from the official website.

Below is code importing the csv from the NYC OpenData web API.

In [2]:
#Reading in the data from the NYC Open Data API
#Note: the $limit parameter is set to 100,000 to avoid performance and memory issues
crashes = pd.read_csv('https://data.cityofnewyork.us/resource/h9gi-nx95.csv?$limit=100000')

display(crashes.head(5))
crash_date crash_time borough zip_code latitude longitude location on_street_name off_street_name cross_street_name number_of_persons_injured number_of_persons_killed number_of_pedestrians_injured number_of_pedestrians_killed number_of_cyclist_injured number_of_cyclist_killed number_of_motorist_injured number_of_motorist_killed contributing_factor_vehicle_1 contributing_factor_vehicle_2 contributing_factor_vehicle_3 contributing_factor_vehicle_4 contributing_factor_vehicle_5 collision_id vehicle_type_code1 vehicle_type_code2 vehicle_type_code_3 vehicle_type_code_4 vehicle_type_code_5
0 2021-09-11T00:00:00.000 2:39 NaN NaN NaN NaN NaN WHITESTONE EXPRESSWAY 20 AVENUE NaN 2 0 0 0 0 0 2 0 Aggressive Driving/Road Rage Unspecified NaN NaN NaN 4455765 Sedan Sedan NaN NaN NaN
1 2022-03-26T00:00:00.000 11:45 NaN NaN NaN NaN NaN QUEENSBORO BRIDGE UPPER NaN NaN 1 0 0 0 0 0 1 0 Pavement Slippery NaN NaN NaN NaN 4513547 Sedan NaN NaN NaN NaN
2 2022-06-29T00:00:00.000 6:55 NaN NaN NaN NaN NaN THROGS NECK BRIDGE NaN NaN 0 0 0 0 0 0 0 0 Following Too Closely Unspecified NaN NaN NaN 4541903 Sedan Pick-up Truck NaN NaN NaN
3 2021-09-11T00:00:00.000 9:35 BROOKLYN 11208.0 40.667202 -73.866500 \n, \n(40.667202, -73.8665) NaN NaN 1211 LORING AVENUE 0 0 0 0 0 0 0 0 Unspecified NaN NaN NaN NaN 4456314 Sedan NaN NaN NaN NaN
4 2021-12-14T00:00:00.000 8:13 BROOKLYN 11233.0 40.683304 -73.917274 \n, \n(40.683304, -73.917274) SARATOGA AVENUE DECATUR STREET NaN 0 0 0 0 0 0 0 0 NaN NaN NaN NaN NaN 4486609 NaN NaN NaN NaN NaN

Preprocessing¶

There are a few important groups of data that are found in the dataset:

  • Time Data: The precise time and date of the event are provided for every crash in the 'crash_date' and 'crash_time' columns
  • Location Data: The location of the crash is recorded in 'borough', 'zip_code', 'on_street_name', etc
  • Fatality and Injury Data: Data on the number of injuries and fatalities is in 'number_of_(civilian type)_(injured/killed)'
  • Factors and Vehicles: Information on presumed reasons for the crash and the makes of the cars involved is found in the 'contributing_factors' and 'vehicle_type_codes' columns

Below, all 29 of the column names are printed.

In [3]:
#Printing the column names
display(pd.DataFrame(crashes.columns).transpose()) 
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
0 crash_date crash_time borough zip_code latitude longitude location on_street_name off_street_name cross_street_name number_of_persons_injured number_of_persons_killed number_of_pedestrians_injured number_of_pedestrians_killed number_of_cyclist_injured number_of_cyclist_killed number_of_motorist_injured number_of_motorist_killed contributing_factor_vehicle_1 contributing_factor_vehicle_2 contributing_factor_vehicle_3 contributing_factor_vehicle_4 contributing_factor_vehicle_5 collision_id vehicle_type_code1 vehicle_type_code2 vehicle_type_code_3 vehicle_type_code_4 vehicle_type_code_5

Column Deletion¶

Below, data on the exact latitude and longitude is removed from the dataframe. Because we already have the borough and street name where the crash occured, the exact GPS location will not be helpful in analyzing this data. The 'cross_street_name' and 'off_street_name' columns were deleted as well because they are very sparse and have an ambiguous meaning as detailed by the NYPD site. Finally, the 'collision_id' column is removed because it does not contain any important data on a crash - it is a key generated to uniquely identify each crash in the NYPD database, and will not be useful to this analysis.

In [4]:
#Deleting GPS location and collision_id columns
crashes.drop(['latitude', 'longitude', 'location', 'collision_id', 'zip_code', 'cross_street_name', 'off_street_name'], axis=1, inplace=True)

This dataset has 5 columns each for 'contributing_factors' and 'vehicle_codes'. However, very few crashes actually record data in all of these columns; many are listed as NaN or Unspecified. Below is a bar plot that records what percent of crashes contain data in each contributing factor and vehicle code column.

In [5]:
cfs = crashes[['contributing_factor_vehicle_1', 'contributing_factor_vehicle_2', 'contributing_factor_vehicle_3', 'contributing_factor_vehicle_4', 'contributing_factor_vehicle_5']]
vtc = crashes[['vehicle_type_code1', 'vehicle_type_code2', 'vehicle_type_code_3', 'vehicle_type_code_4', 'vehicle_type_code_5']]

#Counting the number of crashes that have a specified contributing factor and vehicle code for each #
cfsUNDEFINED = (cfs == 'Unspecified')
cfsNANs = cfs.isna()
cfsNullCounts = (cfsUNDEFINED | cfsNANs).sum()
cfsDataCounts = cfs.shape[0] - cfsNullCounts
cfsDataPct = cfsDataCounts / cfs.shape[0] * 100

vtcNANs = vtc.isna()
vtcNullCounts = vtcNANs.sum()
vtcDataCounts = vtc.shape[0] - vtcNullCounts
vtcDataPct = vtcDataCounts / vtc.shape[0] * 100

DataPct = pd.concat([cfsDataPct, vtcDataPct], axis=1)

#Plotting the percentage of crashes with a specified contributing factor for each #
DataPct.index = ['Factor 1', 'Factor 2', 'Factor 3', 'Factor 4', 'Factor 5', 'Vehicle 1', 'Vehicle 2', 'Vehicle 3', 'Vehicle 4', 'Vehicle 5']
DataPct.plot(kind='bar')
plt.title('Percentage of Crashes with a Specified Contributing Factor/Vehicle Type')
plt.xlabel('Factor/Vehicle #')
plt.ylabel('Percentage')
plt.xticks(rotation=0)
plt.legend(['Contributing Factor', 'Vehicle Type'], loc='upper left')
plt.gca().set_yticks(np.arange(0, 101, 10))
plt.gca().set_yticklabels(['{:.0f}%'.format(x) for x in plt.gca().get_yticks()])
for i, v in enumerate(DataPct[0][0:5]):
    plt.text(i-0.35, v+2, '{:.1f}%'.format(v))
for i, v in enumerate(DataPct[1][5:10]):
    plt.text(i+4.9, v+1.5, '{:.1f}%'.format(v))

plt.show()
No description has been provided for this image

This graph shows that most rows contain data for at least 2 vehicles, and some have up to two factors. However, very few use all 5 factors. All columns that contain data in less than 0.5% of entries (>250 entries) will be deleted. This includes 'contributing_factor_4' and 'contributing_factor_5'.

In [6]:
crashes.drop(['contributing_factor_vehicle_4', 'contributing_factor_vehicle_5'], axis=1, inplace=True)

Column Addition and Null Conversion¶

For all qualitative columns, any 'NaN' or 'Unspecified' entries will be changed to 'None'. This will apply for all contributing factors, vehicle codes, boroughs, and street names.

In [7]:
crashes = crashes.fillna('None')
crashes = crashes.replace('Unspecified', 'None')

As it is now, the 'crash_date' column is hard to parse. Instead, it will be split into 'year', 'month', 'day', and 'timeofday' columns for easier access. The 'timeofday' column will contain a categorical simplification of the time: a crash can take place during the Night (12AM to 6AM), Morning (6AM to 12PM), Afternoon (12PM to 6PM), or Evening (6PM to 12AM).

The dataset will also be sorted by date for easier viewing. Also, the 'crash_time' column will be reformatted to an integer of what minute of the day it is (0-1439 ~12:00AM - 11:49PM). Finally, the to time function will be added to convert the time in minutes back to a readable string, when necessary.

In [8]:
def toTime(x):
    #converts time from minutes to HH:MM AM/PM format
    return (str(12) if np.floor(x/60)%12 == 0 else str(int(x / 60)%12)) + ':' + str(int(x % 60)).zfill(2) + ' ' + ('AM' if x < 720 else 'PM')

crashes['crash_date'] = pd.to_datetime(crashes['crash_date'])
crashes['year'] = crashes['crash_date'].dt.year
crashes['day'] = crashes['crash_date'].dt.day_name()
crashes['month'] = crashes['crash_date'].dt.month
crashes['month'] = crashes['month'].replace({1: 'January', 2: 'February', 3: 'March', 4: 'April', 5: 'May', 6: 'June', 7: 'July', 8: 'August', 9: 'September', 10: 'October', 11: 'November', 12: 'December'})
crashes.sort_values(by='crash_date', inplace=True)
crashes.reset_index(drop=True, inplace=True)
crashes['crash_time'] = pd.to_datetime(crashes['crash_time'])
crashes['crash_time'] = crashes['crash_time'].dt.hour * 60 + crashes['crash_time'].dt.minute
crashes['timeofday'] = pd.cut(crashes['crash_time'], bins=[-1, 360, 720, 1080, 1440], labels=['Night', 'Morning', 'Afternoon', 'Evening'])

display(crashes.head(3))
/var/folders/gc/0752xrm56pnf0r0dsrn5370c0000gr/T/ipykernel_93493/1185858539.py:12: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  crashes['crash_time'] = pd.to_datetime(crashes['crash_time'])
crash_date crash_time borough on_street_name number_of_persons_injured number_of_persons_killed number_of_pedestrians_injured number_of_pedestrians_killed number_of_cyclist_injured number_of_cyclist_killed number_of_motorist_injured number_of_motorist_killed contributing_factor_vehicle_1 contributing_factor_vehicle_2 contributing_factor_vehicle_3 vehicle_type_code1 vehicle_type_code2 vehicle_type_code_3 vehicle_type_code_4 vehicle_type_code_5 year day month timeofday
0 2012-07-27 1253 BROOKLYN RALPH AVENUE 0 0 0 0 0 0 0 0 Failure to Yield Right-of-Way None None Station Wagon/Sport Utility Vehicle E-Scooter None None None 2012 Friday July Evening
1 2012-08-01 622 BROOKLYN PITKIN AVENUE 1 0 0 0 1 0 0 0 None None None Station Wagon/Sport Utility Vehicle Bike None None None 2012 Wednesday August Morning
2 2012-09-25 756 QUEENS WEIRFIELD STREET 0 0 0 0 0 0 0 0 Prescription Medication None None Station Wagon/Sport Utility Vehicle Station Wagon/Sport Utility Vehicle None None None 2012 Tuesday September Afternoon

A 'vehiclesInvolved' column will be added to track how many vehicles were involved in the crash, which will be computed by how many 'vehicle_type_code' columns are not 'None'. Finally, the rows will be renamed and reordered for easier viewing and use. Below, the new set of columns are printed.

In [9]:
crashes['vehicles_involved'] = crashes[['vehicle_type_code1', 'vehicle_type_code2', 'vehicle_type_code_3', 'vehicle_type_code_4', 'vehicle_type_code_5']].apply(lambda x: 5-x.str.contains('None').sum(), axis=1)

crashes.rename(
    columns={
        'crash_date': 'date',
        'crash_time': 'time',
        'on_street_name': 'street',
        'number_of_persons_injured': 'injured',
        'number_of_persons_killed': 'killed',
        'number_of_pedestrians_injured': 'pedestrians_injured',
        'number_of_pedestrians_killed': 'pedestrians_killed',
        'number_of_cyclist_injured': 'cyclists_injured',
        'number_of_cyclist_killed': 'cyclists_killed',
        'number_of_motorist_injured': 'motorists_injured',
        'number_of_motorist_killed': 'motorists_killed',
        'contributing_factor_vehicle_1': 'factor1',
        'contributing_factor_vehicle_2': 'factor2',
        'contributing_factor_vehicle_3': 'factor3',
        'vehicle_type_code1': 'vehicle1',
        'vehicle_type_code2': 'vehicle2',
        'vehicle_type_code_3': 'vehicle3',
        'vehicle_type_code_4': 'vehicle4',
        'vehicle_type_code_5': 'vehicle5'

    },
    inplace=True
)
cols = ['date', 'year', 'month','day', 'time', 'timeofday', 'borough', 'street', 'injured', 'killed', 'pedestrians_injured', 'pedestrians_killed', 'cyclists_injured', 'cyclists_killed', 'motorists_injured', 'motorists_killed', 'factor1', 'factor2', 'factor3', 'vehicle1', 'vehicle2', 'vehicle3', 'vehicle4', 'vehicle5', 'vehicles_involved']
crashes = crashes[cols]
display(pd.DataFrame(crashes.columns).transpose())
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
0 date year month day time timeofday borough street injured killed pedestrians_injured pedestrians_killed cyclists_injured cyclists_killed motorists_injured motorists_killed factor1 factor2 factor3 vehicle1 vehicle2 vehicle3 vehicle4 vehicle5 vehicles_involved

Outlier Analysis¶

The box and whisker plot below shows how many outliers are in each injury and fatality catagory. The datapoint on the left is the mean of the column, and the datapoint on the right is the number of outliers in the column. It can be seen that most accidents don't lead to deaths or injuries, so most rows treat any injury or fatality at all as an outlier. This is especially prevalent in the 'motorists_injured' column, which treats every value that isn't a 0 as an outlier, resulting in a huge amount of outliers.

In [10]:
#Outlier Analysis
#Creating a boxplot for the number of injured persons
plt.figure(figsize=(10, 5))
quantitative = crashes[['injured', 'killed', 'pedestrians_injured', 'pedestrians_killed', 'cyclists_injured', 'cyclists_killed', 'motorists_injured', 'motorists_killed']]
numOutliers = (quantitative > quantitative.quantile(0.75) + 1.5*(quantitative.quantile(0.75) - quantitative.quantile(0.25))).sum()
sns.boxplot(quantitative, orient='h', palette='Set2')
for i, v in enumerate(quantitative.mean()):
    plt.text(0.1, i, '{:.2f}'.format(v), va='center')
for i, v in enumerate(numOutliers):
    plt.text(19.8, i, v, ha='right', va='center')
plt.title('Boxplot of the Number of Injured or Killed Persons')
plt.xlabel('Number of Injured Persons')
plt.xticks(np.arange(0, 21, 5))
plt.show()
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
No description has been provided for this image

Below, the same analysis is applied to the 'vehicles_involved' column. This column only contains values in the range [0,5], and treats Any value above 3 as an outlier. Strangely, there are a few entries that have 0 vehicles involved; these entries could be traffic events that didn't involve vehicles, or even a failure to record the data on the NYPD's part. For our use of this data, 0 will be treated as an outlier value for this column.

In [11]:
#vehicles involved outliers
plt.figure(figsize=(10, 1))
vehicles = crashes['vehicles_involved']
vehiclesOutliers = (vehicles > vehicles.quantile(0.75) + 1.5*(vehicles.quantile(0.75) - vehicles.quantile(0.25))).sum()
sns.boxplot(vehicles, orient='h', palette='Set2')
plt.text(0.1, -0.15, 'μ = {:.2f}'.format(vehicles.mean()), va='center')
plt.text(5.8, 0, vehiclesOutliers, ha='right', va='center')
plt.title('Boxplot of the Number of Vehicles Involved')
plt.xlabel('Number of Vehicles Involved')
plt.xticks(np.arange(0, 7, 1))
plt.show()

#Creating a bar graph for num vehicles
vehiclesCounts = vehicles.value_counts()
vehiclesCounts = vehiclesCounts.sort_index()
vehiclesCounts.plot(kind='bar', edgecolor='black')
plt.title('Number of Vehicles Involved in Crashes')
plt.xlabel('Number of Vehicles')
plt.ylabel('Number of Crashes')
plt.xticks(rotation=0)
for i, v in enumerate(vehiclesCounts):
    plt.text(i-0.16, v+400, v)
plt.show()
/var/folders/gc/0752xrm56pnf0r0dsrn5370c0000gr/T/ipykernel_93493/3375989920.py:5: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(vehicles, orient='h', palette='Set2')
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
No description has been provided for this image
No description has been provided for this image

Now that the columns are organized and unneccesary data has been deleted, the dataset is ready to be analyzed in detail!

Summary Data Analysis¶

As stated before, the data in this set is divided cleanly into time data, location data, fatality/injury data, factors, and vehicles. Each of these catagories are further divided into columns. This portion of the analysis will go deep into patterns seen in these catagories and how they relate to eachother.

In [12]:
display(crashes.head(5))
date year month day time timeofday borough street injured killed pedestrians_injured pedestrians_killed cyclists_injured cyclists_killed motorists_injured motorists_killed factor1 factor2 factor3 vehicle1 vehicle2 vehicle3 vehicle4 vehicle5 vehicles_involved
0 2012-07-27 2012 July Friday 1253 Evening BROOKLYN RALPH AVENUE 0 0 0 0 0 0 0 0 Failure to Yield Right-of-Way None None Station Wagon/Sport Utility Vehicle E-Scooter None None None 2
1 2012-08-01 2012 August Wednesday 622 Morning BROOKLYN PITKIN AVENUE 1 0 0 0 1 0 0 0 None None None Station Wagon/Sport Utility Vehicle Bike None None None 2
2 2012-09-25 2012 September Tuesday 756 Afternoon QUEENS WEIRFIELD STREET 0 0 0 0 0 0 0 0 Prescription Medication None None Station Wagon/Sport Utility Vehicle Station Wagon/Sport Utility Vehicle None None None 2
3 2012-10-22 2012 October Monday 1038 Afternoon None BELT PARKWAY 0 0 0 0 0 0 0 0 Unsafe Speed Other Vehicular Other Vehicular Sedan Sedan Sedan None None 3
4 2016-04-16 2016 April Saturday 860 Afternoon BROOKLYN WEST 17 STREET 0 0 0 0 0 0 0 0 Driver Inattention/Distraction None None Sedan Station Wagon/Sport Utility Vehicle None None None 2

Time Data¶

This portion of the dataset contains all information about when an accident occured. It includes the 'year', 'month', 'day', 'date', and 'time' columns. Over these columns a trend starts to form: more crashes seem to happen at times when more cars are on the road.

Year¶

The 'year' column contains the year in which the accident occurred. Below is a graph that shows the distribution of crashes based on year. It can be seen that a majority of the crashes are from 2021 and 2022: over 99%, in fact. This is partially because of the limited sample size of 100,000 entries that was taken from the original dataset.

In [13]:
#Creating a new dataframe that contains the total number of crashes for each year
crash_years = pd.DataFrame(crashes['year'].value_counts().sort_index()).reset_index()
crash_years.columns = ['Year', 'Crashes']
crach_pct= crashes['year'].value_counts(normalize=True).sort_index()
crash_years.plot(kind='bar', x='Year', y='Crashes')
plt.title('Total Number of Crashes by Year')
plt.xlabel('Year')
plt.ylabel('Number of Crashes')
plt.xticks(rotation=0)
for i, v in enumerate(crash_years['Crashes']):
    plt.text(i-0.22, v+500, v)
for i, v in enumerate(crach_pct):
    plt.text(i-0.24, 5000 , '{:.1f}%'.format(v*100))
plt.show()
No description has been provided for this image

Month¶

The 'month' column contains the month when each crash took place (January, April, etc.). Below is a bar graph of the total amount of crashes that took place in any given month. It can be seen that the majority of crashes take place during warmer months, and crash rate seems to dip down during the winter. This could be an indicator that people drive more often during the summer and early autumn months than during the winter, and therefore get into more crashes during that time.

In [14]:
crash_months = pd.DataFrame(crashes['month'].value_counts())
crash_months.columns = ['Crashes']
crash_months = crash_months.reindex(['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December'])
crash_months.plot(kind='bar', color='orange', edgecolor='black')
plt.title('Total Number of Crashes by Month')
plt.xlabel('Month')
plt.ylabel('Number of Crashes')
plt.xticks(rotation=45)
plt.axhline(crash_months['Crashes'].mean(), color='red', linestyle='dashdot', label='Average')
plt.text(11.6, crash_months['Crashes'].mean(), 'Average: {:.0f}'.format(crash_months['Crashes'].mean()))
plt.legend(loc = 'lower right')
for i, v in enumerate(crash_months['Crashes']):
    plt.text(i-0.25, v+100, v)

plt.show()
No description has been provided for this image

Day¶

The 'day' column contains the day of the week when the crash took place. Below is a similar bar graph showing the proportion of crashes that occur on each day. The distribution is generally uniform, with a significant peak on Friday. This makes sense, as Friday is generally when people will be out driving the most, either celebrating the end of the work/school week or leaving the office for the weekend. The minimum is on Sunday, which also makes sense: most people don't have work and aren't going out on Sundays, so there should be fewer cars on the road.

In [15]:
crash_days = pd.DataFrame(crashes['day'].value_counts())
crash_days.columns = ['Crashes']
crash_days = crash_days.reindex(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])
crash_days.plot(kind='bar', color='green', edgecolor='black')
plt.title('Total Number of Crashes by Day')
plt.xlabel('Day')
plt.ylabel('Number of Crashes')
plt.xticks(rotation=0)
plt.axhline(crash_days['Crashes'].mean(), color='lightgreen', linestyle='dashdot', label='Average')
plt.text(6.6, crash_days['Crashes'].mean(), 'Average: {:.0f}'.format(crash_days['Crashes'].mean()))
plt.legend( loc='lower right')
for i, v in enumerate(crash_days['Crashes']):
    plt.text(i-0.19, v+150, v)

plt.show()
No description has been provided for this image

Time¶

The 'time' column gives the approximate minute that an accident occured in ET. Below is a historgram of the the amount of crashes that happened in every 30 minute interval of the day. The red line is a 5th degree least-squares trend curve that estimates how the rate of crashes evolves continuously throughout the day: it helps us visualize what a continous analysis would look like.

It can be seen that the most crashes happen between 5 and 6 PM while the least happen between 3 and 4 AM. This fits the idea established by other time-based data columns that more crashes happen when more cars are on the road. Strangely, a huge amount of crashes happen at exactly 12:00 AM. It seems plausible that 12:00 AM is used as a 'default time' for when a time is not entered into the NYPD system, which explains the strange peak.

In [16]:
#line graph crashes by time
plt.figure(figsize=(10, 5))
crashes['time'].plot(kind='hist', bins=48, edgecolor='black')

#least squares 5th-degree polynomial fit
x = np.arange(0, 1440, 30)
y = crashes['time'].value_counts(bins = 48).sort_index()
coeffs = np.polyfit(x, y, 5)
poly = np.poly1d(coeffs)
plt.plot(x, poly(x), color='red', linestyle='solid', label='Least Squares Fit', linewidth=2)

plt.title('Total Number of Crashes by Time')
plt.xlabel('Time')
plt.ylabel('Number of Crashes')
plt.xticks(rotation=20)
plt.gca().set_xticks(np.arange(0, 1441, 120))
plt.gca().set_xticklabels([toTime(x) for x in plt.gca().get_xticks()])
plt.legend(loc='upper right')
plt.show()
No description has been provided for this image

Holidays¶

To solidify the hypothesis that crashes happen more often when more cars are on the road, checking what dates have the most incidents makes sense. If the hypothesis is correct, then the dates with the most incidents should be holidays. This seems to be the case, as many of the busiest holidays have more crashes than average. Notably, more than double the average number of crashes happened on Halloween 2021. For reference, the day with the most crashes is plotted as the rightmost bar.

In [17]:
datecounts = crashes['date'].value_counts()
average = datecounts.mean()
maxdate = datecounts.idxmax()
holidaycounts = pd.DataFrame(datecounts[[ '2021-12-25', '2021-12-24', '2021-10-31', '2021-12-31', '2021-07-04', str(maxdate)]])
holidaycounts.columns = ['Crashes']
holidaycounts.index = ['Christmas', 'Christmas Eve', 'Halloween', 'New Year\'s Eve', '4th of July', 'Most Crashes']
holidaycounts.plot(kind='bar', color='purple', edgecolor='black')
plt.title('Total Number of Crashes on Holidays')
plt.xlabel('Date')
plt.ylabel('Number of Crashes')
plt.xticks(rotation=45)
plt.axhline(average, color='red', linestyle='dashdot', label='Average')
plt.text(5.6, average, 'Average: {:.0f}'.format(average))
plt.legend(loc='upper left')
for i, v in enumerate(holidaycounts['Crashes']):
    plt.text(i-0.1, v+4, v)
plt.show()
No description has been provided for this image

Location Data¶

This section of the dataset contains information about where in the city an incident occurrs. It contains the 'borough' and 'street' columns.

Borough¶

The 'borough' column details the borough in which the incident took place. About a third of the incidents do not have a recorded borough (indicated by the None entry). A majority of the crashes took place in brooklyn and queens, while less than 3% took place in Staten Island. The ordering of the crash numbers seems to strongly reflect the populations of the boroughs.

In [18]:
boroughs = pd.DataFrame(crashes['borough'].value_counts())
boroughs.columns = ['Crashes']
boroughs.plot(kind='bar', color='pink', edgecolor='black')
boroughpct = crashes['borough'].value_counts(normalize=True)
plt.title('Total Number of Crashes by Borough')
plt.xlabel('Borough')
plt.ylabel('Number of Crashes')
plt.xticks(rotation=45)
plt.axhline(boroughs['Crashes'].mean(), color='red', linestyle='dashdot', label='Average')
plt.text(4.6, boroughs['Crashes'].mean()+500, 'Average: {:.0f}'.format(boroughs['Crashes'].mean()))
plt.legend(loc='upper right')
for i, v in enumerate(boroughs['Crashes']):
    plt.text(i-0.15, v+100, v)

for i, v in enumerate(boroughpct):
    plt.text(i-0.15, 600, '{:.1f}%'.format(v*100))
plt.show()
No description has been provided for this image

Street¶

The 'street' column contains the street name of where each incident took place. The entire dataset contains 4,359 unique street names, while 27% of entries (27,081 entries) do not have a specified street name.

Below is a bar graph describing the amount of streets that appear a specified amount of times. Many streets only appear a few times in the entire dataset, but a good amount of them appear over 10 times. Note that every bar to the left of 'More than 10' in the graph is a subset of the all the bars to the right apart from the 'Less than 10' bar.

In [19]:
streetcounts = crashes['street'].value_counts()

numStreets = streetcounts.shape[0]


print('Number of crashes with no specified street: {}'.format(streetcounts['None']))
print('Total number of unique streets: {}'.format(numStreets))

lowlim = [10, 50, 100, 200, 500, 750, 1000]

streetseries = pd.Series([streetcounts[streetcounts > x].shape[0]-1 for x in lowlim], index=['More than 10', 'More than 50', 'More than 100', 'More than 200', 'More than 500', 'More than 750', 'More than 1000'])
streetseries['Less than 10'] = streetcounts[streetcounts <= 10].shape[0]
streetseries = streetseries[['Less than 10','More than 10', 'More than 50', 'More than 100', 'More than 200', 'More than 500', 'More than 750', 'More than 1000']]
streetseries.plot(kind='bar', color='cyan', edgecolor='black')
streetpct = streetseries / numStreets * 100
plt.title('Number of Streets that have a Specified Number of Crashes')
plt.xlabel('Number of Crashes')
plt.ylabel('Number of Streets')
plt.xticks(rotation=45)
for i, v in enumerate(streetseries):
    plt.text(i-0.2, v+20, v)
for i, v in enumerate(streetpct):
    plt.text(i-0.22, 500, '{:.1f}%'.format(v))
plt.show()
Number of crashes with no specified street: 27081
Total number of unique streets: 4359
No description has been provided for this image

Fatality/Injury Data¶

This catagory is about data involving deaths and injuries that happen in the crashes. This includes all columns with '...deaths' and '...injuries' in the name. It's important to note that all of these columns are quantitative.

Earlier in the report, a box and whisker plot was made to analyze how many outliers exist in each injury and fatality column. This plot is shown again below. As before, the mean number of deaths is given on the left of a boxplot, and the number of outliers is given on the right. It shows that any given collision usually only results in one or no injuries, and almost never results in death. It also seems like there are less instances of cyclists and pedestrians being involved in accidents than motorists.

In [20]:
#Outlier Analysis
#Creating a boxplot for the number of injured persons
plt.figure(figsize=(10, 5))
quantitative = crashes[['injured', 'killed', 'pedestrians_injured', 'pedestrians_killed', 'cyclists_injured', 'cyclists_killed', 'motorists_injured', 'motorists_killed']]
numOutliers = (quantitative > quantitative.quantile(0.75) + 1.5*(quantitative.quantile(0.75) - quantitative.quantile(0.25))).sum()
sns.boxplot(quantitative, orient='h', palette='Set2')
for i, v in enumerate(quantitative.mean()):
    plt.text(0.1, i, '{:.2f}'.format(v), va='center')
for i, v in enumerate(numOutliers):
    plt.text(19.8, i, v, ha='right', va='center')
plt.title('Boxplot of the Number of Injured or Killed Persons')
plt.xlabel('Number of Injured Persons')
plt.xticks(np.arange(0, 21, 5))
plt.show()
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
No description has been provided for this image

Total Injuries¶

The 'injured' column contains information on how many people were injuried in total during any incident. It is generally just the sum of all 'injured' columns. The bar graph below shows the number of crashes that resulted in a specified number of injuries. It can be seen that more than half of the incidents resulted in no injuries, and a significant amount only resulted in one injury. Only a few resulted in more than 3. This could mean that most crashes in New York are smaller collissions.

In [21]:
# Count the occurrences of each unique value in the 'injured' column
injured_counts = crashes['injured'].value_counts().sort_index()

# Define the bins for the bar graph
bins = [-1, 0, 1, 2, 4, 6,8, 10, 20]

# Group the counts into the defined bins
grouped_counts = injured_counts.groupby(pd.cut(injured_counts.index, bins)).sum()

grouped_counts.index = ['0', '1', '2', '3-4', '5-6', '7-8', '9-10', '11-20']

# Plot the bar graph
grouped_counts.plot(kind='bar', color='lightblue', edgecolor='black')

injured_pct = grouped_counts / grouped_counts.sum() * 100

for i, v in enumerate(grouped_counts):
    plt.text(i-0.2, v+500, v)
for i, v in enumerate(injured_pct):
    plt.text(i-0.22, 10000, '{:.1f}%'.format(v))

# Set the labels and title
plt.title('Number of Crashes by Number of Injuries')
plt.xlabel('Number of Injuries')
plt.ylabel('Number of Crashes')
plt.xticks(rotation=0)
# Show the plot
plt.show()
/var/folders/gc/0752xrm56pnf0r0dsrn5370c0000gr/T/ipykernel_93493/687600154.py:8: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  grouped_counts = injured_counts.groupby(pd.cut(injured_counts.index, bins)).sum()
No description has been provided for this image

Total Fatalities¶

The 'killed' column is similar to the 'injured' column in that it contains the total amount of persons killed during a given incident. The bar graph below shows a similar trend in that most incidents resulted in no crashes, while only a few crashes resulted in more. This is a bit lower than I expected, which is a nice surprise. This also suggests that a majority of incidents in the NYPD system are not serious crashes, but just small collisions that are perhaps reported for insurance reasons.

In [22]:
# Calculate the counts of killed
killed_counts = crashes['killed'].value_counts().sort_index()

# Calculate the percentage of killed
killed_pct = (killed_counts / len(crashes)) * 100

# Create a DataFrame for killed counts and percentages
killed_data = pd.DataFrame({'Counts': killed_counts, 'Percentage': killed_pct})

# Plot the counts of killed
plt.figure(figsize=(10, 5))
sns.barplot(x=killed_data.index, y=killed_data['Counts'], color='red', edgecolor='black')
plt.title('Counts of Killed in Crashes')
plt.xlabel('Number of Killed')
plt.ylabel('Counts')

# Enumerate the percentages as text on the graph
for i, count in enumerate(killed_data['Counts']):
    plt.text(i, count, killed_data["Counts"][i], ha='center', va='bottom')
for i, count in enumerate(killed_data['Counts']):
    plt.text(i, 20000, f'{killed_data["Percentage"][i]:.2f}%', ha='center', va='bottom')

plt.show()
No description has been provided for this image

Pedestrian Statistics¶

The 'pedestrians_injured' and 'pedestrians_killed' columns are subsets of the 'injured' and 'killed' columns that contain data only on pedestrians that were injured or killed in a given incident. The distribution in the below graphs is similar the previous graphs. Note that no accident resulted in more than 2 pedestrian deaths or 6 pedestrian injuries. This could imply that most crashes in New York have the cars stay on the road, as pedestrians being injured or killed would imply that the cars ran into the sidewalk or the pedestrian was in a crosswalk.

In [23]:
# Create a bar graph for pedestrians injured and pedestrians killed
killed_counts = crashes['pedestrians_killed'].value_counts().sort_index()
injured_counts = crashes['pedestrians_injured'].value_counts().sort_index()
killed_counts = killed_counts.reindex(injured_counts.index, fill_value=0)

plt.figure(figsize=(10, 5))
killed_counts.plot(kind='bar', color='red', edgecolor='black', position=0, width=0.4, label='Pedestrians Killed')
injured_counts.plot(kind='bar', color='blue', edgecolor='black', position=1, width=0.4, label='Pedestrians Injured')
plt.title('Number of Crashes by Pedestrians Injured and Killed')
plt.xlabel('Number of Pedestrians')
plt.ylabel('Number of Crashes')
plt.xticks(rotation=0)
plt.legend(loc='upper right')
for i, v in enumerate(killed_counts):
    plt.text(i+0.05, v+800, v, color = 'red')

for i, v in enumerate(injured_counts):
    plt.text(i-.38, v+880, v, color = 'blue')

plt.show()
No description has been provided for this image

Motorist and Cyclist Statistics¶

The 'motorist' and 'cyclist' injury and death columns are also subsets of the 'injured' and 'killed' columns. Below are two bar graphs that display the number of cyclists and motorists injured.

There are many crashes with a high 'motorists_injured' value. Some even reach to over 15. For one incident to have so many motorists involved at once, it must not be a "basic" crash. Could NY's infrastructure have failed, or perhaps a dangerous vehicle was at play? The hypothesis that strange circumstances surrounding a crash corrolate with a high injury and death count will be explored in the Discussion section.

In [24]:
motor_killed = crashes['motorists_killed'].value_counts().sort_index()
motor_injured = crashes['motorists_injured'].value_counts().sort_index()
cycle_killed = crashes['cyclists_killed'].value_counts().sort_index()
cycle_injured = crashes['cyclists_injured'].value_counts().sort_index()

motor_killed = motor_killed.reindex(motor_injured.index, fill_value=0)
cycle_killed = cycle_killed.reindex(cycle_injured.index, fill_value=0)

fig, ax = plt.subplots(1, 2, figsize=(20, 5))

motor_killed.plot(kind='bar', color='red', edgecolor='black', position=0, width=0.4, label='Motorists Killed', ax=ax[0])
motor_injured.plot(kind='bar', color='blue', edgecolor='black', position=1, width=0.4, label='Motorists Injured', ax=ax[0])
ax[0].set_title('Number of Crashes by Motorists Injured and Killed')
ax[0].set_xlabel('Number of Motorists')
ax[0].set_ylabel('Number of Crashes')
ax[0].legend(loc='upper right')
for i, v in enumerate(motor_killed):
    ax[0].text(i+0.05, v+800, v, color = 'red', rotation=45)
for i, v in enumerate(motor_injured):
    ax[0].text(i-.38, v+880, v, color = 'blue', rotation=45)

cycle_killed.plot(kind='bar', color='red', edgecolor='black', position=0, width=0.4, label='Cyclists Killed', ax=ax[1])
cycle_injured.plot(kind='bar', color='blue', edgecolor='black', position=1, width=0.4, label='Cyclists Injured', ax=ax[1])
ax[1].set_title('Number of Crashes by Cyclists Injured and Killed')
ax[1].set_xlabel('Number of Cyclists')
ax[1].set_ylabel('Number of Crashes')
ax[1].legend(loc='upper right')
for i, v in enumerate(cycle_killed):
    ax[1].text(i+0.05, v+800, v, color = 'red')
for i, v in enumerate(cycle_injured):
    ax[1].text(i-.38, v+880, v, color = 'blue')

plt.show()
No description has been provided for this image

Factors and Vehicles¶

This section is concerned about the circumstances surrounding a crash. It details the factors contributing to why a crash occurred as inferred by the reporting officer and the models of the vehicles that were involved in the crash.

Contributing Factors¶

The 'factor' columns have data on the factors that may have contributed an incident. Below is a bar graph showing the most common factors in a crash and a dataframe showing every contributing factor sorted by how often they appear in the dataframe. The 'factor1' column is used in the analysis because it is present in the most entries: the 'factor2' and 'factor3' columns follow a similar distribution as 'factor1' does, just with many more None entries. There are a total of 55 different contributing factors across the entire dataset. The most common factor is Driver Inattention by far, making up over 23% of all incidents.

In [25]:
commonFactors = crashes['factor1'].value_counts().head(10)
commonFactors.plot(kind='bar', color='purple', edgecolor='black')
plt.title('Top 10 Most Common Contributing Factors')
plt.xlabel('Contributing Factor')
plt.ylabel('Number of Crashes')
plt.xticks(rotation=45, ha='right')
plt.axhline(crashes['factor1'].value_counts().mean(), color='red', linestyle='dashdot', label='Average')
plt.text(9.6, crashes['factor1'].value_counts().mean()-200, 'Average: {:.0f}'.format(crashes['factor1'].value_counts().mean()))
plt.legend(loc='upper right')
for i, v in enumerate(commonFactors):
    plt.text(i-0.23, v+300, v)

plt.show()
display( pd.DataFrame(crashes['factor1'].value_counts()).transpose())
No description has been provided for this image
factor1 None Driver Inattention/Distraction Failure to Yield Right-of-Way Following Too Closely Passing or Lane Usage Improper Passing Too Closely Unsafe Speed Backing Unsafely Traffic Control Disregarded Other Vehicular Turning Improperly Unsafe Lane Changing Driver Inexperience Alcohol Involvement Reaction to Uninvolved Vehicle Pedestrian/Bicyclist/Other Pedestrian Error/Confusion View Obstructed/Limited Pavement Slippery Aggressive Driving/Road Rage Fell Asleep Brakes Defective Oversized Vehicle Steering Failure Passenger Distraction Outside Car Distraction Obstruction/Debris Lost Consciousness Tire Failure/Inadequate Illnes Pavement Defective Glare Fatigued/Drowsy Failure to Keep Right Driverless/Runaway Vehicle Drugs (illegal) Animals Action Accelerator Defective Cell Phone (hand-Held) Traffic Control Device Improper/Non-Working Physical Disability Tinted Windows Lane Marking Improper/Inadequate Prescription Medication Using On Board Navigation Device Vehicle Vandalism Other Electronic Device Other Lighting Defects Headlights Defective Tow Hitch Defective Eating or Drinking Cell Phone (hands-free) Texting Shoulders Defective/Improper Listening/Using Headphones Windshield Inadequate
count 25226 23864 6905 6617 4564 3831 3666 3116 2894 2709 2290 2078 1991 1580 1363 962 850 820 753 441 412 411 268 246 209 205 195 188 173 142 136 130 110 103 90 88 81 48 44 42 26 25 17 14 12 10 10 9 8 8 6 5 5 2 2

Vehicle Models¶

The 'vehicles' column contains information on the models of the vehicles involved in the crash. Below is a bar graph of the top 10 most common vehicles and a series containing every imputted vehicle. Most crashes involve a Sedan or an SUV, which makes sense considering those are very common cars. There are also a good amount of crashes that involve Bikes and Motorcycles: it could be worthwile to view subsets of these groups alongside the 'cyclists_injured' and 'motorists_injured' columns.

It's important to note that these values are almost certainly handwritten into the dataset by the reporting NYPD officer: the many mispelt entries like Ambulace, Ambulane, and Fire Engin make that clear. Therefore, the counts here are not completely reliable. Some of the entries also very strange. For example, there are three entries in the dataset involving a Tank, and one even involves a Freight Train.

In [26]:
vehiclecounts = crashes['vehicle1'].value_counts()
vehiclecounts = crashes['vehicle1'].value_counts() + crashes['vehicle2'].value_counts().reindex( vehiclecounts.index, fill_value=0) + crashes['vehicle3'].value_counts().reindex( vehiclecounts.index, fill_value=0) + crashes['vehicle4'].value_counts().reindex( vehiclecounts.index, fill_value=0) + crashes['vehicle5'].value_counts().reindex( vehiclecounts.index, fill_value=0)
vehiclecounts.index = vehiclecounts.index.str.capitalize()
vehiclecounts = vehiclecounts.groupby(level=0).sum()
vehiclecounts= vehiclecounts.sort_values(ascending=False).drop('None')

crashes.replace('Station Wagon/Sport Utility Vehicle', 'SUV', inplace=True)

commonVehicles = vehiclecounts.head(10)
commonVehicles['Other'] = vehiclecounts[10:].sum()
commonVehicles.rename(index={'Station wagon/sport utility vehicle': 'SUV'}, inplace=True)

commonVehicles.plot(kind='bar', color='orange', edgecolor='black')
plt.title('Top 10 Most Common Vehicle Types')
plt.xlabel('Vehicle Type')
plt.ylabel('Number of Crashes')
plt.xticks(rotation=45, ha='right')
plt.legend(loc='upper right')
for i, v in enumerate(commonVehicles):
    plt.text(i-0.23, v+300, v)
plt.show()

display(pd.DataFrame(vehiclecounts).transpose())
No description has been provided for this image
vehicle1 Sedan Station wagon/sport utility vehicle Bike Pick-up truck Box truck Taxi Bus E-bike Motorcycle Tractor truck diesel E-scooter Van Ambulance Moped Dump Pk Flat bed Garbage or refuse Convertible Carry all Motorscooter Tow truck / wrecker Motorbike Tractor truck gasoline Chassis cab 4 dr sedan Tanker Fire truck 3-door Trailer Limo Refrigerated van Concrete mixer Armored truck School bus Flat rack Scooter Multi-wheeled vehicle Firetruck Beverage truck Unknown Tow truck Unk Open body Lift boom Truck Stake or rack Pedicab Snow plow Minibike Forklift Bulk agriculture Commercial Ambu Fdny ambul Fdny truck Dump truck Pick up Com Lunch wagon Garbage tr Mta bus Minicycle Fdny 2 dr sedan Van camper Pallet Usps truck Motor scoo Amb Utility Sprinter v Hopper Fdny fire Delivery t Electric s Rv Passenger Pas Power shov Delivery Pickup Util Self insur Delv Pick up tr Usps Nys ambula Fdny engin Street swe Fire Pickup with mounted camper Glass rack Tank Pc Golf cart Suburban Road sweep Mack Nyc sanita Refg Fire engin Fork lift Enclosed body - nonremovable enclosure Ford van Ford Dirt bike Escooter s F550 Nypd van Motor home Ec3 E scooter Motorized Motorized home Self Ups Semi trail Usps mail Skateboard Utility ve Box Boom lift Pick-up tr Van/truck Pickup tru Vms Ambulane Sanitation Work van Econoline Mopad Gas scoote Food cart Mini van Mailtruck Ems Mail truck Mack truck Tl Tr Flatbed Street cle Van wh School bu Subn Verzion va Tk Shuttle bu Suv Sanmen cou Ram White van Work truck Rgr Sw/van Revel scoo Yamaha Street Scooter ga Utility va Utility tr Skywatch Truck comm Tf Uhal Semi Unk box tr Smyellscho Unmarked v Tlr Sedona Us mail tr Us postal Tlc Uspcs Usps posta Usps small Usps vehic Tcn ''lime mope Pumper Citywide Commerical Con ed tru Constructi Crane boom D2 Dent and s Department Dodge Dodge ram Dumpster t E bike uni Ems bus Emt ambula Enclosed body - removable enclosure Engine sp0 Esu rep Excavator Cmix City owned Pump City 12 passage 50cc scoot 994 Ambulace Ambulence Ambulette App Asphalt ro Backhoe Camper van Cargo van Carriage Carrier Cat forkli Cater Cement tru Cherv Fdny firet Fdny ladde Fire appar Flatbed pi Motorscoot Moving tru Mta Mtr h Nonmotords Nypd tow t Pas (4dr s Pass Pick wh Pick-up Pkup Police rep Post offic Postal ser Postal tru Pro master Psd Mopd . Mini bus Grumman ll Freight Freight tr Frt Garbage Gas mo ped Gas powere Golf car Horse carr M2 Horse trai Hwh Kick scoot Ladder co Lift Livestock rack Locomotive �mbu
count 82786 60662 4944 3876 3840 3718 3054 2733 1717 1433 1380 1203 1050 686 629 427 361 354 342 261 239 233 218 195 135 97 94 86 59 56 56 53 53 47 46 43 42 40 37 36 36 34 34 32 32 29 26 25 22 21 17 15 15 15 14 14 12 11 11 10 10 9 9 8 8 8 8 8 8 8 8 7 7 7 6 6 5 5 5 5 5 5 4 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Number of Vehicles Involved¶

The custom made 'vehicles_involved' column gives information on how many vehicles are involved in a crash. Specifically, it counts how many 'vehicle' entries are not None. It can be seen that most incidents involve only 1 or 2 vehicles, but a significant amount of incidents involve more than that.

Strangely, there are some incidents that involve 0 vehicles. These entries could be errors in the NYPD system, entries that involved road conditions, or perhaps even incidents that were reported after the vehicles had left the scene. They will be investigated in the discussion portion of this report.

In [27]:
numvehicles = crashes['vehicles_involved'].value_counts()
numvehicles.sort_index(inplace=True)
numvehicles.plot(kind='bar', color='lightgreen', edgecolor='black')
plt.title('Number of Crashes by Number of Vehicles Involved')
plt.xlabel('Number of Vehicles')
plt.ylabel('Number of Crashes')
plt.xticks(rotation=0)
plt.axhline(numvehicles.mean(), color='red', linestyle='dashdot', label='Average')
plt.text(4.6, numvehicles.mean(), 'Average: {:.0f}'.format(numvehicles.mean()))
plt.legend(loc='upper right')
for i, v in enumerate(numvehicles):
    plt.text(i-0.15, v+500, v)
plt.show()
No description has been provided for this image

Interesting Correlations¶

The following chart divides how many crashes have a specified number of injuries by borough. The percentages tell us what percent of crashes in that borough had that many injuries.

While the distributions are generally the same across each borough, it's important to note the slight differences between them. For example, Staten Island has the highest ratio of non-injury crashes, while there are more crashes in Brooklyn with one or two injuries than in other boroughs. Also, notice that crashes with more than 10 injuries happened in the Bronx and Staten Island, but not in any of the other boroughs. Also, for some reason, crashes with no specified borough tend to have more injuries than those with boroughs: note that the percentage of 2-injury crashes in None is 6.55%, which is significantly higher than any borough.

In [28]:
boroughinjuries = crashes.groupby('borough')['injured'].value_counts(bins= [-1, 0, 1, 2, 4, 6,8, 10, 20], sort=False).fillna(0)
plt.subplots(2, 3, figsize=(20, 10))


for i, b in enumerate(crashes['borough'].unique()):
    totalforborough = crashes[crashes['borough'] == b].shape[0]
    plt.subplot(2, 3, i+1)
    boroughinjuries[b].plot(kind='bar', color='lightblue', edgecolor='black', label='Injured')
    plt.title('Number of Crashes by Number of Injuries in {}'.format(b))
    plt.xlabel('Number of Injuries')
    plt.ylabel('Number of Crashes')
    plt.xticks(rotation=0)
    plt.gca().set_xticklabels(['0', '1', '2', '3-4', '5-6', '7-8', '9-10', '11-20'])
    plt.legend(loc='upper right')
    # plotting total crashes per borough
    plt.text(4.8, totalforborough*0.3, 'Total Crashes: {}'.format(totalforborough))
    for j, v in enumerate(boroughinjuries[b]):
        plt.text(j-0.25, v+80 - (60 if b == "STATEN ISLAND" else 0), '{:.2f}%'.format(v/totalforborough*100))

plt.show()
No description has been provided for this image

The next chart compares how many injuries happen at different times of day. It divides the dataset into Night (12 AM - 6 AM), Morning (6 AM - 12PM), Afternoon (12PM - 6PM) and Evening (6PM - 12AM) and plots them individually, using the same scheme as the borough chart did above.

Notice that many more crashes with zero injuries happen in the morning than during the evening or at night. It seems like the evening is the most dangerous time of day to get into an accident, while the morning is the least likely to result in injury. However, there are significantly fewer total crashes in the Night. This is probably because there are less cars on the road in the middle of the night than during rush hour in the morning or the evening.

In [29]:
timeofdayinjuries = crashes.groupby('timeofday')['injured'].value_counts(bins= [-1, 0, 1, 2, 4, 6,8, 10, 20], sort=False).fillna(0)
plt.subplots(2, 2, figsize=(20, 10))

for i, t in enumerate(crashes['timeofday'].unique()):
    totalfortime = crashes[crashes['timeofday'] == t].shape[0]
    plt.subplot(2, 2, i+1)
    timeofdayinjuries[t].plot(kind='bar', color='lightgreen', edgecolor='black', label='Injured')
    plt.title('Number of Crashes by Number of Injuries in the {}'.format(t))
    plt.xlabel('Number of Injuries')
    plt.ylabel('Number of Crashes')
    plt.xticks(rotation=0)
    plt.gca().set_xticklabels(['0', '1', '2', '3-4', '5-6', '7-8', '9-10', '11-20'])
    plt.legend(loc='upper right')
    plt.text(5.8, totalfortime*0.3, 'Total Crashes: {}'.format(totalfortime), ha = 'left')
    for j, v in enumerate(timeofdayinjuries[t]):
        plt.text(j-0.25, v+80, '{:.2f}%'.format(v/totalfortime*100))
plt.show()
/var/folders/gc/0752xrm56pnf0r0dsrn5370c0000gr/T/ipykernel_93493/3326097605.py:1: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  timeofdayinjuries = crashes.groupby('timeofday')['injured'].value_counts(bins= [-1, 0, 1, 2, 4, 6,8, 10, 20], sort=False).fillna(0)
No description has been provided for this image

Discussion¶

Question 1:¶

Does the contributing factors, vehicles and the number of vehicles involved in an incident allow us to predict the number of injuries and deaths in that entry?

Essentially, this question wants to know that, if a crash involves a dangerous vehicle, has an uncommon set of contributing factors, or involves more vehicles than average, will the injury and death counts represent that? This question aims to predict a quantitative value.

Hypothesis: I expect to see that crashes with strange vehicles, contributing factors, and high vehicle numbers corrolate with higher injury and death counts. For one, if a crash involves more vehicles, then there should be a higher chance of drivers or pedestrians getting injured. Specifically, I expect that incidents involving trucks, motorcycles, and other "dangerous" vehicles will result in more motorist injuries.

Graphical analysis¶

The following code creates small, 1000-entry samples of crashes involving "Dangerous" vehicles and "Safe" vehicles, then plots the average number of injuries and fatalities in the sample. The samples of "Dangerous" vehicles includes trucks, e-bikes, and motorcycles: these are generally thought to be more dangerous to ride in than most vehicles. The "Safe" vehicle samples include sedans, suvs, and taxis.

This graph shows us that, on average, motorcycles and e-bikes have more injuries and deaths than normal vehicles like sedans and suvs. The ratio of deaths to injuries in the motorcycles column is significantly higher than other samples as well.

In [30]:
#getting sample of data involving trucks, e-bikes, and motorcycles ("Dangerous" vehicles)
truckssample = crashes[crashes['vehicle1'].str.contains('Truck') | crashes['vehicle2'].str.contains('Truck') | crashes['vehicle3'].str.contains('Truck') | crashes['vehicle4'].str.contains('Truck') | crashes['vehicle5'].str.contains('Truck')].sample(1000)
ebikesample = crashes[crashes['vehicle1'].str.contains('E-Bike') | crashes['vehicle2'].str.contains('E-Bike') | crashes['vehicle3'].str.contains('E-Bike') | crashes['vehicle4'].str.contains('E-Bike') | crashes['vehicle5'].str.contains('E-Bike')].sample(1000)
motorcyclesample = crashes[crashes['vehicle1'].str.contains('Motorcycle') | crashes['vehicle2'].str.contains('Motorcycle') | crashes['vehicle3'].str.contains('Motorcycle') | crashes['vehicle4'].str.contains('Motorcycle') | crashes['vehicle5'].str.contains('Motorcycle')].sample(1000)

#getting sample of data involving sedans, SUVs, and taxis ("Safe" vehicles)
sedansample = crashes[crashes['vehicle1'].str.contains('Sedan') | crashes['vehicle2'].str.contains('Sedan') | crashes['vehicle3'].str.contains('Sedan') | crashes['vehicle4'].str.contains('Sedan') | crashes['vehicle5'].str.contains('Sedan')].sample(1000)
suvsample = crashes[crashes['vehicle1'].str.contains('SUV') | crashes['vehicle2'].str.contains('SUV') | crashes['vehicle3'].str.contains('SUV') | crashes['vehicle4'].str.contains('SUV') | crashes['vehicle5'].str.contains('SUV')].sample(1000)
taxisample = crashes[crashes['vehicle1'].str.contains('Taxi') | crashes['vehicle2'].str.contains('Taxi') | crashes['vehicle3'].str.contains('Taxi') | crashes['vehicle4'].str.contains('Taxi') | crashes['vehicle5'].str.contains('Taxi')].sample(1000)


#plotting average injury and killed counts for each vehicle type, dividing based on safety
plt.figure(figsize=(10, 5))
plt.bar('Truck', truckssample['injured'].mean(), color='blue', label='Dangerous-Injured')
plt.bar('Truck', truckssample['killed'].mean(), color='red', label='Dangerous-Killed')
plt.bar('E-Bike', ebikesample['injured'].mean(), color='blue')
plt.bar('E-Bike', ebikesample['killed'].mean(), color='red')
plt.bar('Motorcycle', motorcyclesample['injured'].mean(), color='blue')
plt.bar('Motorcycle', motorcyclesample['killed'].mean(), color='red')
plt.bar('Sedan', sedansample['injured'].mean(), color='green', label = 'Safe-Injured')
plt.bar('Sedan', sedansample['killed'].mean(), color='purple', label = 'Safe-Killed')
plt.bar('SUV', suvsample['injured'].mean(), color='green')
plt.bar('SUV', suvsample['killed'].mean(), color='purple')
plt.bar('Taxi', taxisample['injured'].mean(), color='green')
plt.bar('Taxi', taxisample['killed'].mean(), color='purple')
plt.title('Average Number of Injuries and Deaths by Vehicle Type')
plt.xlabel('Vehicle Type')
plt.ylabel('Number of Persons')
plt.xticks(rotation=0)
plt.legend(loc='upper right')
plt.show()
No description has been provided for this image

This next graph looks at the average injury and death rate of samples of different contributing factors. Crashes involving being asleep at the wheel and drunk driving are marked as "dangerous", while those involving driver inattention or backing incorrectly are marked as "safe" by comparison. This graph follows the same trend as the last: the more dangerous factors correspond with a higher injury count and, in the case of drunk drivers, death count.

In [31]:
#plotting average injury and killed counts for some contributing factors, dividing based on safety
sleepsample = crashes[crashes['factor1'].str.contains('Asleep') | crashes['factor2'].str.contains('Asleep') | crashes['factor3'].str.contains('Asleep')].sample(300)
alcoholsample = crashes[crashes['factor1'].str.contains('Alcohol') | crashes['factor2'].str.contains('Alcohol') | crashes['factor3'].str.contains('Alcohol')].sample(300)

innatention = crashes[crashes['factor1'].str.contains('Inattention') | crashes['factor2'].str.contains('Inattention') | crashes['factor3'].str.contains('Inattention')].sample(300)
backing = crashes[crashes['factor1'].str.contains('Backing') | crashes['factor2'].str.contains('Backing') | crashes['factor3'].str.contains('Backing')].sample(300)

plt.figure(figsize=(10, 5))
plt.bar('Sleeping', sleepsample['injured'].mean(), color='blue', label='Dangerous-Injured')
plt.bar('Sleeping', sleepsample['killed'].mean(), color='red', label='Dangerous-Killed')
plt.bar('Alcohol', alcoholsample['injured'].mean(), color='blue')
plt.bar('Alcohol', alcoholsample['killed'].mean(), color='red')
plt.bar('Inattention', innatention['injured'].mean(), color='green', label = 'Safe-Injured')
plt.bar('Inattention', innatention['killed'].mean(), color='purple', label = 'Safe-Killed')
plt.bar('Backing', backing['injured'].mean(), color='green')
plt.bar('Backing', backing['killed'].mean(), color='purple')
plt.title('Average Number of Injuries and Deaths by Contributing Factor')
plt.xlabel('Contributing Factor')
plt.ylabel('Number of Persons')
plt.xticks(rotation=0)
plt.legend(loc='upper right')
plt.show()
No description has been provided for this image

This graph does the same thing as the last, but divides the samples by how many vehicles were invovled. Strangely, there are some samples that include zero vehicles, even though this shouldnt be possible. Even more strange is that there is a higher proportion of injuries in the zero-vehicle sample than in samples including one or two. Apart from that, the average injury count seems to be linearly proportional with the number of vehicles: a higher amount of vehicles results in a higher proportion of injuries.

In [32]:
#getting samples of different vehicles involved values
zerovehicles = crashes[crashes['vehicles_involved'] == 0].sample(800)
onevehicle = crashes[crashes['vehicles_involved'] == 1].sample(800)
twovehicles = crashes[crashes['vehicles_involved'] == 2].sample(800)
threevehicles = crashes[crashes['vehicles_involved'] == 3].sample(800)
fourvehicles = crashes[crashes['vehicles_involved'] == 4].sample(800)
fivevehicles = crashes[crashes['vehicles_involved'] == 5].sample(800)

#plotting average injury and killed counts for each vehicle type, dividing based on safety
plt.figure(figsize=(10, 5))
plt.bar('0', zerovehicles['injured'].mean(), color='blue', label='Injured')
plt.bar('0', zerovehicles['killed'].mean(), color='red', label='Killed')
plt.bar('1', onevehicle['injured'].mean(), color='blue')
plt.bar('1', onevehicle['killed'].mean(), color='red')
plt.bar('2', twovehicles['injured'].mean(), color='blue')
plt.bar('2', twovehicles['killed'].mean(), color='red')
plt.bar('3', threevehicles['injured'].mean(), color='blue')
plt.bar('3', threevehicles['killed'].mean(), color='red')
plt.bar('4', fourvehicles['injured'].mean(), color='blue')
plt.bar('4', fourvehicles['killed'].mean(), color='red')
plt.bar('5', fivevehicles['injured'].mean(), color='blue')
plt.bar('5', fivevehicles['killed'].mean(), color='red')
plt.title('Average Number of Injuries and Deaths by Number of Vehicles Involved')
plt.xlabel('Number of Vehicles')
plt.ylabel('Number of Persons')
plt.xticks(rotation=0)
plt.legend(loc='upper right')
plt.show()
No description has been provided for this image

kNN Model Test¶

To fully explore the possibility of a corellation between the cirucumstances surrounding a crash and the injury counts, I trained a knn model to try to predict the injury count based only on the factors and vehicles of crash. The accuracy of the knn is just under 70%, which isn't perfect, but is high enough to signal a significant corellation between these columns.

In [33]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Step 1: Prepare the data
features = crashes[['factor1', 'factor2', 'factor3', 'vehicle1', 'vehicle2', 'vehicle3', 'vehicle4','vehicle5', 'vehicles_involved']]
target = crashes['injured']  # Modify target to be a single column

# Step 2: Encode categorical variables
encoder = OneHotEncoder()
features_encoded = encoder.fit_transform(features)

# Step 3: Split the data
X_train, X_test, y_train, y_test = train_test_split(features_encoded, target, test_size=0.2, random_state=6602)

# Step 4: Train the KNN model
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

# Step 5: Evaluate the model
y_pred = knn.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy:', accuracy)
Accuracy: 0.68625

Conclusion:¶

Overall, this data supports the hypothesis that certain contributing factors and vehicle models can give a solid inference at the number of injuries that may have occurred in a certain crash. Specifically, this evidence suggests that incidents that involve dangerous factors, unsafe vehicles, and more drivers tend to result in more injuries.

Question 2:¶

Does the number of injuries, deaths, and how many vehicles were involved in an incident allow us to predict the time that a crash occurred?

By "predict the time a crash occurred", it means to predict what time of day the crash occurred in based on the entry in the categorical 'timeofday' column. During the time analysis section, there was a corrolation between more cars being on the road and more crashes occurring. This question aims to take this observation further, and see if a higher number of injuries and deaths happen at different times.

Hypothesis: I expect to see that crashes with more vehicles involved and higher injury rates will corrolate with times when cars are more likely to be on the road. Specifically, I think that morning and afternoon rush hour will have a higher rate of injuries and deaths. Additionally, I expect that incidents at night or in the early morning will only involve one or two cars, and have much lower injury and death rates compared to other times.

Graphical Analysis¶

Below is a matrix of pie charts representing the time of day distribution of crashes with specified numbers of injuries. It can be seen that, in general, incidents happen more often in the afternoon than any other time of day. However, the way the values change as the injury count goes up is reveals interesting trends. For example, low injury crashes tend to happen more in the morning than high injury crashes. Also, the most dangerous crashes happen in the evening, even though not nearly as many crashes without any injuries do during the evening.

In [34]:
#Making pie charts based on injury data and time of day 
injury0sample = crashes[crashes['injured'] == 0].sample(800)
injury1sample = crashes[crashes['injured'] == 1].sample(800)
injury23sample = crashes[crashes['injured'].isin([2, 3])].sample(800)
injury4upsample = crashes[crashes['injured']>=4].sample(800)

plt.figure(figsize=(10, 10))
plt.subplot(2, 2, 1)
injury0sample['timeofday'].value_counts().sort_index().plot(kind='pie', autopct='%1.1f%%', colors=['lightblue', 'lightgreen', 'lightcoral', 'lightyellow'])
plt.title('Time of Day for Crashes with 0 Injuries')
plt.ylabel('')
plt.subplot(2, 2, 2)
injury1sample['timeofday'].value_counts().sort_index().plot(kind='pie', autopct='%1.1f%%', colors=['lightblue', 'lightgreen', 'lightcoral', 'lightyellow'])
plt.title('Time of Day for Crashes with 1 Injury')
plt.ylabel('')
plt.subplot(2, 2, 3)
injury23sample['timeofday'].value_counts().sort_index().plot(kind='pie', autopct='%1.1f%%', colors=['lightblue', 'lightgreen', 'lightcoral', 'lightyellow'])
plt.title('Time of Day for Crashes with 2-3 Injuries')
plt.ylabel('')
plt.legend(loc='lower left')
plt.subplot(2, 2, 4)
injury4upsample['timeofday'].value_counts().sort_index().plot(kind='pie', autopct='%1.1f%%', colors=['lightblue', 'lightgreen', 'lightcoral', 'lightyellow'])
plt.title('Time of Day for Crashes with 4 or More Injuries')
plt.ylabel('')
plt.show()
No description has been provided for this image

kNN Model Test¶

This knn model is a final test to see if there is enough of a corellation to reliably predict the time of day based only on the injury and death counts. Sadly, this knn performed poorly with an accuracy of just 32%. This is only slightly better than randomly guessing between the four groups, which would result in 25%. This seems to imply that, although the graphs above show some correlation between injury count and time of day, there isnt a strong enough link between the two datapoints to be statistically significant.

In [35]:
features = crashes[['injured', 'motorists_injured', 'killed','motorists_killed', 'vehicles_involved']]
target = crashes['timeofday']

X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

y_pred = knn.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print('Accuracy:', accuracy)
Accuracy: 0.2356

Conclusion:¶

Although there is clearly some corellation with high injury count crashes happening more during the evening than the morning, it doesnt seem like there is a strong enough connection between the injury count and the time of day to be significant.

Conclusion¶

This dataset had many interesting trends and corellations that were strange at first, but made sense after further investigation. The proportions of car crashes that happen at different times of day, and on different days was found and hypothesized about. Also, it was found that certain circumstances surrounding a crash strongly correlate with how many injuries occur in that crash. Overall, this experience has opened my eyes about the uncountable nuances of car crashes that have happened in the various boroughs of New York City.

Dataset : https://data.cityofnewyork.us/Public-Safety/Motor-Vehicle-Collisions-Crashes/h9gi-nx95/about_data