Introduction¶
Imports¶
Below are a few important packages that may be used to analyze, manipulate, and visualize the data.
#Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
pd.set_option('display.max_columns', None)
plt.rcParams['figure.figsize'] = [10, 5]
/var/folders/gc/0752xrm56pnf0r0dsrn5370c0000gr/T/ipykernel_93493/2858617164.py:2: DeprecationWarning: Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0), (to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries) but was not found to be installed on your system. If this would cause problems for you, please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466 import pandas as pd
Dataset¶
This dataset contains data on all police-recorded motor vehicle collisions in New York City. Each row represents a reported vehicle collision with information about where the incident occured, the circumstances surrounding the incident, and how many people were injured or killed.
This dataset is provided by the New York Police Department (NYPD) through the official NYC OpenData website and has been frequently frequently since 2014. Because the original dataset contains over 2 million entries, a sample of 100,000 crashes from 2021 and 2022 has been imported from the official website.
Below is code importing the csv from the NYC OpenData web API.
#Reading in the data from the NYC Open Data API
#Note: the $limit parameter is set to 100,000 to avoid performance and memory issues
crashes = pd.read_csv('https://data.cityofnewyork.us/resource/h9gi-nx95.csv?$limit=100000')
display(crashes.head(5))
crash_date | crash_time | borough | zip_code | latitude | longitude | location | on_street_name | off_street_name | cross_street_name | number_of_persons_injured | number_of_persons_killed | number_of_pedestrians_injured | number_of_pedestrians_killed | number_of_cyclist_injured | number_of_cyclist_killed | number_of_motorist_injured | number_of_motorist_killed | contributing_factor_vehicle_1 | contributing_factor_vehicle_2 | contributing_factor_vehicle_3 | contributing_factor_vehicle_4 | contributing_factor_vehicle_5 | collision_id | vehicle_type_code1 | vehicle_type_code2 | vehicle_type_code_3 | vehicle_type_code_4 | vehicle_type_code_5 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2021-09-11T00:00:00.000 | 2:39 | NaN | NaN | NaN | NaN | NaN | WHITESTONE EXPRESSWAY | 20 AVENUE | NaN | 2 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | Aggressive Driving/Road Rage | Unspecified | NaN | NaN | NaN | 4455765 | Sedan | Sedan | NaN | NaN | NaN |
1 | 2022-03-26T00:00:00.000 | 11:45 | NaN | NaN | NaN | NaN | NaN | QUEENSBORO BRIDGE UPPER | NaN | NaN | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | Pavement Slippery | NaN | NaN | NaN | NaN | 4513547 | Sedan | NaN | NaN | NaN | NaN |
2 | 2022-06-29T00:00:00.000 | 6:55 | NaN | NaN | NaN | NaN | NaN | THROGS NECK BRIDGE | NaN | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Following Too Closely | Unspecified | NaN | NaN | NaN | 4541903 | Sedan | Pick-up Truck | NaN | NaN | NaN |
3 | 2021-09-11T00:00:00.000 | 9:35 | BROOKLYN | 11208.0 | 40.667202 | -73.866500 | \n, \n(40.667202, -73.8665) | NaN | NaN | 1211 LORING AVENUE | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Unspecified | NaN | NaN | NaN | NaN | 4456314 | Sedan | NaN | NaN | NaN | NaN |
4 | 2021-12-14T00:00:00.000 | 8:13 | BROOKLYN | 11233.0 | 40.683304 | -73.917274 | \n, \n(40.683304, -73.917274) | SARATOGA AVENUE | DECATUR STREET | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | 4486609 | NaN | NaN | NaN | NaN | NaN |
Preprocessing¶
There are a few important groups of data that are found in the dataset:
- Time Data: The precise time and date of the event are provided for every crash in the 'crash_date' and 'crash_time' columns
- Location Data: The location of the crash is recorded in 'borough', 'zip_code', 'on_street_name', etc
- Fatality and Injury Data: Data on the number of injuries and fatalities is in 'number_of_(civilian type)_(injured/killed)'
- Factors and Vehicles: Information on presumed reasons for the crash and the makes of the cars involved is found in the 'contributing_factors' and 'vehicle_type_codes' columns
Below, all 29 of the column names are printed.
#Printing the column names
display(pd.DataFrame(crashes.columns).transpose())
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | crash_date | crash_time | borough | zip_code | latitude | longitude | location | on_street_name | off_street_name | cross_street_name | number_of_persons_injured | number_of_persons_killed | number_of_pedestrians_injured | number_of_pedestrians_killed | number_of_cyclist_injured | number_of_cyclist_killed | number_of_motorist_injured | number_of_motorist_killed | contributing_factor_vehicle_1 | contributing_factor_vehicle_2 | contributing_factor_vehicle_3 | contributing_factor_vehicle_4 | contributing_factor_vehicle_5 | collision_id | vehicle_type_code1 | vehicle_type_code2 | vehicle_type_code_3 | vehicle_type_code_4 | vehicle_type_code_5 |
Column Deletion¶
Below, data on the exact latitude and longitude is removed from the dataframe. Because we already have the borough and street name where the crash occured, the exact GPS location will not be helpful in analyzing this data. The 'cross_street_name' and 'off_street_name' columns were deleted as well because they are very sparse and have an ambiguous meaning as detailed by the NYPD site. Finally, the 'collision_id' column is removed because it does not contain any important data on a crash - it is a key generated to uniquely identify each crash in the NYPD database, and will not be useful to this analysis.
#Deleting GPS location and collision_id columns
crashes.drop(['latitude', 'longitude', 'location', 'collision_id', 'zip_code', 'cross_street_name', 'off_street_name'], axis=1, inplace=True)
This dataset has 5 columns each for 'contributing_factors' and 'vehicle_codes'. However, very few crashes actually record data in all of these columns; many are listed as NaN or Unspecified. Below is a bar plot that records what percent of crashes contain data in each contributing factor and vehicle code column.
cfs = crashes[['contributing_factor_vehicle_1', 'contributing_factor_vehicle_2', 'contributing_factor_vehicle_3', 'contributing_factor_vehicle_4', 'contributing_factor_vehicle_5']]
vtc = crashes[['vehicle_type_code1', 'vehicle_type_code2', 'vehicle_type_code_3', 'vehicle_type_code_4', 'vehicle_type_code_5']]
#Counting the number of crashes that have a specified contributing factor and vehicle code for each #
cfsUNDEFINED = (cfs == 'Unspecified')
cfsNANs = cfs.isna()
cfsNullCounts = (cfsUNDEFINED | cfsNANs).sum()
cfsDataCounts = cfs.shape[0] - cfsNullCounts
cfsDataPct = cfsDataCounts / cfs.shape[0] * 100
vtcNANs = vtc.isna()
vtcNullCounts = vtcNANs.sum()
vtcDataCounts = vtc.shape[0] - vtcNullCounts
vtcDataPct = vtcDataCounts / vtc.shape[0] * 100
DataPct = pd.concat([cfsDataPct, vtcDataPct], axis=1)
#Plotting the percentage of crashes with a specified contributing factor for each #
DataPct.index = ['Factor 1', 'Factor 2', 'Factor 3', 'Factor 4', 'Factor 5', 'Vehicle 1', 'Vehicle 2', 'Vehicle 3', 'Vehicle 4', 'Vehicle 5']
DataPct.plot(kind='bar')
plt.title('Percentage of Crashes with a Specified Contributing Factor/Vehicle Type')
plt.xlabel('Factor/Vehicle #')
plt.ylabel('Percentage')
plt.xticks(rotation=0)
plt.legend(['Contributing Factor', 'Vehicle Type'], loc='upper left')
plt.gca().set_yticks(np.arange(0, 101, 10))
plt.gca().set_yticklabels(['{:.0f}%'.format(x) for x in plt.gca().get_yticks()])
for i, v in enumerate(DataPct[0][0:5]):
plt.text(i-0.35, v+2, '{:.1f}%'.format(v))
for i, v in enumerate(DataPct[1][5:10]):
plt.text(i+4.9, v+1.5, '{:.1f}%'.format(v))
plt.show()
This graph shows that most rows contain data for at least 2 vehicles, and some have up to two factors. However, very few use all 5 factors. All columns that contain data in less than 0.5% of entries (>250 entries) will be deleted. This includes 'contributing_factor_4' and 'contributing_factor_5'.
crashes.drop(['contributing_factor_vehicle_4', 'contributing_factor_vehicle_5'], axis=1, inplace=True)
Column Addition and Null Conversion¶
For all qualitative columns, any 'NaN' or 'Unspecified' entries will be changed to 'None'. This will apply for all contributing factors, vehicle codes, boroughs, and street names.
crashes = crashes.fillna('None')
crashes = crashes.replace('Unspecified', 'None')
As it is now, the 'crash_date' column is hard to parse. Instead, it will be split into 'year', 'month', 'day', and 'timeofday' columns for easier access. The 'timeofday' column will contain a categorical simplification of the time: a crash can take place during the Night (12AM to 6AM), Morning (6AM to 12PM), Afternoon (12PM to 6PM), or Evening (6PM to 12AM).
The dataset will also be sorted by date for easier viewing. Also, the 'crash_time' column will be reformatted to an integer of what minute of the day it is (0-1439 ~12:00AM - 11:49PM). Finally, the to time function will be added to convert the time in minutes back to a readable string, when necessary.
def toTime(x):
#converts time from minutes to HH:MM AM/PM format
return (str(12) if np.floor(x/60)%12 == 0 else str(int(x / 60)%12)) + ':' + str(int(x % 60)).zfill(2) + ' ' + ('AM' if x < 720 else 'PM')
crashes['crash_date'] = pd.to_datetime(crashes['crash_date'])
crashes['year'] = crashes['crash_date'].dt.year
crashes['day'] = crashes['crash_date'].dt.day_name()
crashes['month'] = crashes['crash_date'].dt.month
crashes['month'] = crashes['month'].replace({1: 'January', 2: 'February', 3: 'March', 4: 'April', 5: 'May', 6: 'June', 7: 'July', 8: 'August', 9: 'September', 10: 'October', 11: 'November', 12: 'December'})
crashes.sort_values(by='crash_date', inplace=True)
crashes.reset_index(drop=True, inplace=True)
crashes['crash_time'] = pd.to_datetime(crashes['crash_time'])
crashes['crash_time'] = crashes['crash_time'].dt.hour * 60 + crashes['crash_time'].dt.minute
crashes['timeofday'] = pd.cut(crashes['crash_time'], bins=[-1, 360, 720, 1080, 1440], labels=['Night', 'Morning', 'Afternoon', 'Evening'])
display(crashes.head(3))
/var/folders/gc/0752xrm56pnf0r0dsrn5370c0000gr/T/ipykernel_93493/1185858539.py:12: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format. crashes['crash_time'] = pd.to_datetime(crashes['crash_time'])
crash_date | crash_time | borough | on_street_name | number_of_persons_injured | number_of_persons_killed | number_of_pedestrians_injured | number_of_pedestrians_killed | number_of_cyclist_injured | number_of_cyclist_killed | number_of_motorist_injured | number_of_motorist_killed | contributing_factor_vehicle_1 | contributing_factor_vehicle_2 | contributing_factor_vehicle_3 | vehicle_type_code1 | vehicle_type_code2 | vehicle_type_code_3 | vehicle_type_code_4 | vehicle_type_code_5 | year | day | month | timeofday | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2012-07-27 | 1253 | BROOKLYN | RALPH AVENUE | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Failure to Yield Right-of-Way | None | None | Station Wagon/Sport Utility Vehicle | E-Scooter | None | None | None | 2012 | Friday | July | Evening |
1 | 2012-08-01 | 622 | BROOKLYN | PITKIN AVENUE | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | None | None | None | Station Wagon/Sport Utility Vehicle | Bike | None | None | None | 2012 | Wednesday | August | Morning |
2 | 2012-09-25 | 756 | QUEENS | WEIRFIELD STREET | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Prescription Medication | None | None | Station Wagon/Sport Utility Vehicle | Station Wagon/Sport Utility Vehicle | None | None | None | 2012 | Tuesday | September | Afternoon |
A 'vehiclesInvolved' column will be added to track how many vehicles were involved in the crash, which will be computed by how many 'vehicle_type_code' columns are not 'None'. Finally, the rows will be renamed and reordered for easier viewing and use. Below, the new set of columns are printed.
crashes['vehicles_involved'] = crashes[['vehicle_type_code1', 'vehicle_type_code2', 'vehicle_type_code_3', 'vehicle_type_code_4', 'vehicle_type_code_5']].apply(lambda x: 5-x.str.contains('None').sum(), axis=1)
crashes.rename(
columns={
'crash_date': 'date',
'crash_time': 'time',
'on_street_name': 'street',
'number_of_persons_injured': 'injured',
'number_of_persons_killed': 'killed',
'number_of_pedestrians_injured': 'pedestrians_injured',
'number_of_pedestrians_killed': 'pedestrians_killed',
'number_of_cyclist_injured': 'cyclists_injured',
'number_of_cyclist_killed': 'cyclists_killed',
'number_of_motorist_injured': 'motorists_injured',
'number_of_motorist_killed': 'motorists_killed',
'contributing_factor_vehicle_1': 'factor1',
'contributing_factor_vehicle_2': 'factor2',
'contributing_factor_vehicle_3': 'factor3',
'vehicle_type_code1': 'vehicle1',
'vehicle_type_code2': 'vehicle2',
'vehicle_type_code_3': 'vehicle3',
'vehicle_type_code_4': 'vehicle4',
'vehicle_type_code_5': 'vehicle5'
},
inplace=True
)
cols = ['date', 'year', 'month','day', 'time', 'timeofday', 'borough', 'street', 'injured', 'killed', 'pedestrians_injured', 'pedestrians_killed', 'cyclists_injured', 'cyclists_killed', 'motorists_injured', 'motorists_killed', 'factor1', 'factor2', 'factor3', 'vehicle1', 'vehicle2', 'vehicle3', 'vehicle4', 'vehicle5', 'vehicles_involved']
crashes = crashes[cols]
display(pd.DataFrame(crashes.columns).transpose())
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | date | year | month | day | time | timeofday | borough | street | injured | killed | pedestrians_injured | pedestrians_killed | cyclists_injured | cyclists_killed | motorists_injured | motorists_killed | factor1 | factor2 | factor3 | vehicle1 | vehicle2 | vehicle3 | vehicle4 | vehicle5 | vehicles_involved |
Outlier Analysis¶
The box and whisker plot below shows how many outliers are in each injury and fatality catagory. The datapoint on the left is the mean of the column, and the datapoint on the right is the number of outliers in the column. It can be seen that most accidents don't lead to deaths or injuries, so most rows treat any injury or fatality at all as an outlier. This is especially prevalent in the 'motorists_injured' column, which treats every value that isn't a 0 as an outlier, resulting in a huge amount of outliers.
#Outlier Analysis
#Creating a boxplot for the number of injured persons
plt.figure(figsize=(10, 5))
quantitative = crashes[['injured', 'killed', 'pedestrians_injured', 'pedestrians_killed', 'cyclists_injured', 'cyclists_killed', 'motorists_injured', 'motorists_killed']]
numOutliers = (quantitative > quantitative.quantile(0.75) + 1.5*(quantitative.quantile(0.75) - quantitative.quantile(0.25))).sum()
sns.boxplot(quantitative, orient='h', palette='Set2')
for i, v in enumerate(quantitative.mean()):
plt.text(0.1, i, '{:.2f}'.format(v), va='center')
for i, v in enumerate(numOutliers):
plt.text(19.8, i, v, ha='right', va='center')
plt.title('Boxplot of the Number of Injured or Killed Persons')
plt.xlabel('Number of Injured Persons')
plt.xticks(np.arange(0, 21, 5))
plt.show()
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float)
Below, the same analysis is applied to the 'vehicles_involved' column. This column only contains values in the range [0,5], and treats Any value above 3 as an outlier. Strangely, there are a few entries that have 0 vehicles involved; these entries could be traffic events that didn't involve vehicles, or even a failure to record the data on the NYPD's part. For our use of this data, 0 will be treated as an outlier value for this column.
#vehicles involved outliers
plt.figure(figsize=(10, 1))
vehicles = crashes['vehicles_involved']
vehiclesOutliers = (vehicles > vehicles.quantile(0.75) + 1.5*(vehicles.quantile(0.75) - vehicles.quantile(0.25))).sum()
sns.boxplot(vehicles, orient='h', palette='Set2')
plt.text(0.1, -0.15, 'μ = {:.2f}'.format(vehicles.mean()), va='center')
plt.text(5.8, 0, vehiclesOutliers, ha='right', va='center')
plt.title('Boxplot of the Number of Vehicles Involved')
plt.xlabel('Number of Vehicles Involved')
plt.xticks(np.arange(0, 7, 1))
plt.show()
#Creating a bar graph for num vehicles
vehiclesCounts = vehicles.value_counts()
vehiclesCounts = vehiclesCounts.sort_index()
vehiclesCounts.plot(kind='bar', edgecolor='black')
plt.title('Number of Vehicles Involved in Crashes')
plt.xlabel('Number of Vehicles')
plt.ylabel('Number of Crashes')
plt.xticks(rotation=0)
for i, v in enumerate(vehiclesCounts):
plt.text(i-0.16, v+400, v)
plt.show()
/var/folders/gc/0752xrm56pnf0r0dsrn5370c0000gr/T/ipykernel_93493/3375989920.py:5: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect. sns.boxplot(vehicles, orient='h', palette='Set2') /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float)
Now that the columns are organized and unneccesary data has been deleted, the dataset is ready to be analyzed in detail!
Summary Data Analysis¶
As stated before, the data in this set is divided cleanly into time data, location data, fatality/injury data, factors, and vehicles. Each of these catagories are further divided into columns. This portion of the analysis will go deep into patterns seen in these catagories and how they relate to eachother.
display(crashes.head(5))
date | year | month | day | time | timeofday | borough | street | injured | killed | pedestrians_injured | pedestrians_killed | cyclists_injured | cyclists_killed | motorists_injured | motorists_killed | factor1 | factor2 | factor3 | vehicle1 | vehicle2 | vehicle3 | vehicle4 | vehicle5 | vehicles_involved | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2012-07-27 | 2012 | July | Friday | 1253 | Evening | BROOKLYN | RALPH AVENUE | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Failure to Yield Right-of-Way | None | None | Station Wagon/Sport Utility Vehicle | E-Scooter | None | None | None | 2 |
1 | 2012-08-01 | 2012 | August | Wednesday | 622 | Morning | BROOKLYN | PITKIN AVENUE | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | None | None | None | Station Wagon/Sport Utility Vehicle | Bike | None | None | None | 2 |
2 | 2012-09-25 | 2012 | September | Tuesday | 756 | Afternoon | QUEENS | WEIRFIELD STREET | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Prescription Medication | None | None | Station Wagon/Sport Utility Vehicle | Station Wagon/Sport Utility Vehicle | None | None | None | 2 |
3 | 2012-10-22 | 2012 | October | Monday | 1038 | Afternoon | None | BELT PARKWAY | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Unsafe Speed | Other Vehicular | Other Vehicular | Sedan | Sedan | Sedan | None | None | 3 |
4 | 2016-04-16 | 2016 | April | Saturday | 860 | Afternoon | BROOKLYN | WEST 17 STREET | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Driver Inattention/Distraction | None | None | Sedan | Station Wagon/Sport Utility Vehicle | None | None | None | 2 |
Time Data¶
This portion of the dataset contains all information about when an accident occured. It includes the 'year', 'month', 'day', 'date', and 'time' columns. Over these columns a trend starts to form: more crashes seem to happen at times when more cars are on the road.
Year¶
The 'year' column contains the year in which the accident occurred. Below is a graph that shows the distribution of crashes based on year. It can be seen that a majority of the crashes are from 2021 and 2022: over 99%, in fact. This is partially because of the limited sample size of 100,000 entries that was taken from the original dataset.
#Creating a new dataframe that contains the total number of crashes for each year
crash_years = pd.DataFrame(crashes['year'].value_counts().sort_index()).reset_index()
crash_years.columns = ['Year', 'Crashes']
crach_pct= crashes['year'].value_counts(normalize=True).sort_index()
crash_years.plot(kind='bar', x='Year', y='Crashes')
plt.title('Total Number of Crashes by Year')
plt.xlabel('Year')
plt.ylabel('Number of Crashes')
plt.xticks(rotation=0)
for i, v in enumerate(crash_years['Crashes']):
plt.text(i-0.22, v+500, v)
for i, v in enumerate(crach_pct):
plt.text(i-0.24, 5000 , '{:.1f}%'.format(v*100))
plt.show()
Month¶
The 'month' column contains the month when each crash took place (January, April, etc.). Below is a bar graph of the total amount of crashes that took place in any given month. It can be seen that the majority of crashes take place during warmer months, and crash rate seems to dip down during the winter. This could be an indicator that people drive more often during the summer and early autumn months than during the winter, and therefore get into more crashes during that time.
crash_months = pd.DataFrame(crashes['month'].value_counts())
crash_months.columns = ['Crashes']
crash_months = crash_months.reindex(['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December'])
crash_months.plot(kind='bar', color='orange', edgecolor='black')
plt.title('Total Number of Crashes by Month')
plt.xlabel('Month')
plt.ylabel('Number of Crashes')
plt.xticks(rotation=45)
plt.axhline(crash_months['Crashes'].mean(), color='red', linestyle='dashdot', label='Average')
plt.text(11.6, crash_months['Crashes'].mean(), 'Average: {:.0f}'.format(crash_months['Crashes'].mean()))
plt.legend(loc = 'lower right')
for i, v in enumerate(crash_months['Crashes']):
plt.text(i-0.25, v+100, v)
plt.show()
Day¶
The 'day' column contains the day of the week when the crash took place. Below is a similar bar graph showing the proportion of crashes that occur on each day. The distribution is generally uniform, with a significant peak on Friday. This makes sense, as Friday is generally when people will be out driving the most, either celebrating the end of the work/school week or leaving the office for the weekend. The minimum is on Sunday, which also makes sense: most people don't have work and aren't going out on Sundays, so there should be fewer cars on the road.
crash_days = pd.DataFrame(crashes['day'].value_counts())
crash_days.columns = ['Crashes']
crash_days = crash_days.reindex(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])
crash_days.plot(kind='bar', color='green', edgecolor='black')
plt.title('Total Number of Crashes by Day')
plt.xlabel('Day')
plt.ylabel('Number of Crashes')
plt.xticks(rotation=0)
plt.axhline(crash_days['Crashes'].mean(), color='lightgreen', linestyle='dashdot', label='Average')
plt.text(6.6, crash_days['Crashes'].mean(), 'Average: {:.0f}'.format(crash_days['Crashes'].mean()))
plt.legend( loc='lower right')
for i, v in enumerate(crash_days['Crashes']):
plt.text(i-0.19, v+150, v)
plt.show()
Time¶
The 'time' column gives the approximate minute that an accident occured in ET. Below is a historgram of the the amount of crashes that happened in every 30 minute interval of the day. The red line is a 5th degree least-squares trend curve that estimates how the rate of crashes evolves continuously throughout the day: it helps us visualize what a continous analysis would look like.
It can be seen that the most crashes happen between 5 and 6 PM while the least happen between 3 and 4 AM. This fits the idea established by other time-based data columns that more crashes happen when more cars are on the road. Strangely, a huge amount of crashes happen at exactly 12:00 AM. It seems plausible that 12:00 AM is used as a 'default time' for when a time is not entered into the NYPD system, which explains the strange peak.
#line graph crashes by time
plt.figure(figsize=(10, 5))
crashes['time'].plot(kind='hist', bins=48, edgecolor='black')
#least squares 5th-degree polynomial fit
x = np.arange(0, 1440, 30)
y = crashes['time'].value_counts(bins = 48).sort_index()
coeffs = np.polyfit(x, y, 5)
poly = np.poly1d(coeffs)
plt.plot(x, poly(x), color='red', linestyle='solid', label='Least Squares Fit', linewidth=2)
plt.title('Total Number of Crashes by Time')
plt.xlabel('Time')
plt.ylabel('Number of Crashes')
plt.xticks(rotation=20)
plt.gca().set_xticks(np.arange(0, 1441, 120))
plt.gca().set_xticklabels([toTime(x) for x in plt.gca().get_xticks()])
plt.legend(loc='upper right')
plt.show()
Holidays¶
To solidify the hypothesis that crashes happen more often when more cars are on the road, checking what dates have the most incidents makes sense. If the hypothesis is correct, then the dates with the most incidents should be holidays. This seems to be the case, as many of the busiest holidays have more crashes than average. Notably, more than double the average number of crashes happened on Halloween 2021. For reference, the day with the most crashes is plotted as the rightmost bar.
datecounts = crashes['date'].value_counts()
average = datecounts.mean()
maxdate = datecounts.idxmax()
holidaycounts = pd.DataFrame(datecounts[[ '2021-12-25', '2021-12-24', '2021-10-31', '2021-12-31', '2021-07-04', str(maxdate)]])
holidaycounts.columns = ['Crashes']
holidaycounts.index = ['Christmas', 'Christmas Eve', 'Halloween', 'New Year\'s Eve', '4th of July', 'Most Crashes']
holidaycounts.plot(kind='bar', color='purple', edgecolor='black')
plt.title('Total Number of Crashes on Holidays')
plt.xlabel('Date')
plt.ylabel('Number of Crashes')
plt.xticks(rotation=45)
plt.axhline(average, color='red', linestyle='dashdot', label='Average')
plt.text(5.6, average, 'Average: {:.0f}'.format(average))
plt.legend(loc='upper left')
for i, v in enumerate(holidaycounts['Crashes']):
plt.text(i-0.1, v+4, v)
plt.show()
Location Data¶
This section of the dataset contains information about where in the city an incident occurrs. It contains the 'borough' and 'street' columns.
Borough¶
The 'borough' column details the borough in which the incident took place. About a third of the incidents do not have a recorded borough (indicated by the None entry). A majority of the crashes took place in brooklyn and queens, while less than 3% took place in Staten Island. The ordering of the crash numbers seems to strongly reflect the populations of the boroughs.
boroughs = pd.DataFrame(crashes['borough'].value_counts())
boroughs.columns = ['Crashes']
boroughs.plot(kind='bar', color='pink', edgecolor='black')
boroughpct = crashes['borough'].value_counts(normalize=True)
plt.title('Total Number of Crashes by Borough')
plt.xlabel('Borough')
plt.ylabel('Number of Crashes')
plt.xticks(rotation=45)
plt.axhline(boroughs['Crashes'].mean(), color='red', linestyle='dashdot', label='Average')
plt.text(4.6, boroughs['Crashes'].mean()+500, 'Average: {:.0f}'.format(boroughs['Crashes'].mean()))
plt.legend(loc='upper right')
for i, v in enumerate(boroughs['Crashes']):
plt.text(i-0.15, v+100, v)
for i, v in enumerate(boroughpct):
plt.text(i-0.15, 600, '{:.1f}%'.format(v*100))
plt.show()
Street¶
The 'street' column contains the street name of where each incident took place. The entire dataset contains 4,359 unique street names, while 27% of entries (27,081 entries) do not have a specified street name.
Below is a bar graph describing the amount of streets that appear a specified amount of times. Many streets only appear a few times in the entire dataset, but a good amount of them appear over 10 times. Note that every bar to the left of 'More than 10' in the graph is a subset of the all the bars to the right apart from the 'Less than 10' bar.
streetcounts = crashes['street'].value_counts()
numStreets = streetcounts.shape[0]
print('Number of crashes with no specified street: {}'.format(streetcounts['None']))
print('Total number of unique streets: {}'.format(numStreets))
lowlim = [10, 50, 100, 200, 500, 750, 1000]
streetseries = pd.Series([streetcounts[streetcounts > x].shape[0]-1 for x in lowlim], index=['More than 10', 'More than 50', 'More than 100', 'More than 200', 'More than 500', 'More than 750', 'More than 1000'])
streetseries['Less than 10'] = streetcounts[streetcounts <= 10].shape[0]
streetseries = streetseries[['Less than 10','More than 10', 'More than 50', 'More than 100', 'More than 200', 'More than 500', 'More than 750', 'More than 1000']]
streetseries.plot(kind='bar', color='cyan', edgecolor='black')
streetpct = streetseries / numStreets * 100
plt.title('Number of Streets that have a Specified Number of Crashes')
plt.xlabel('Number of Crashes')
plt.ylabel('Number of Streets')
plt.xticks(rotation=45)
for i, v in enumerate(streetseries):
plt.text(i-0.2, v+20, v)
for i, v in enumerate(streetpct):
plt.text(i-0.22, 500, '{:.1f}%'.format(v))
plt.show()
Number of crashes with no specified street: 27081 Total number of unique streets: 4359
Fatality/Injury Data¶
This catagory is about data involving deaths and injuries that happen in the crashes. This includes all columns with '...deaths' and '...injuries' in the name. It's important to note that all of these columns are quantitative.
Earlier in the report, a box and whisker plot was made to analyze how many outliers exist in each injury and fatality column. This plot is shown again below. As before, the mean number of deaths is given on the left of a boxplot, and the number of outliers is given on the right. It shows that any given collision usually only results in one or no injuries, and almost never results in death. It also seems like there are less instances of cyclists and pedestrians being involved in accidents than motorists.
#Outlier Analysis
#Creating a boxplot for the number of injured persons
plt.figure(figsize=(10, 5))
quantitative = crashes[['injured', 'killed', 'pedestrians_injured', 'pedestrians_killed', 'cyclists_injured', 'cyclists_killed', 'motorists_injured', 'motorists_killed']]
numOutliers = (quantitative > quantitative.quantile(0.75) + 1.5*(quantitative.quantile(0.75) - quantitative.quantile(0.25))).sum()
sns.boxplot(quantitative, orient='h', palette='Set2')
for i, v in enumerate(quantitative.mean()):
plt.text(0.1, i, '{:.2f}'.format(v), va='center')
for i, v in enumerate(numOutliers):
plt.text(19.8, i, v, ha='right', va='center')
plt.title('Boxplot of the Number of Injured or Killed Persons')
plt.xlabel('Number of Injured Persons')
plt.xticks(np.arange(0, 21, 5))
plt.show()
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float)
Total Injuries¶
The 'injured' column contains information on how many people were injuried in total during any incident. It is generally just the sum of all 'injured' columns. The bar graph below shows the number of crashes that resulted in a specified number of injuries. It can be seen that more than half of the incidents resulted in no injuries, and a significant amount only resulted in one injury. Only a few resulted in more than 3. This could mean that most crashes in New York are smaller collissions.
# Count the occurrences of each unique value in the 'injured' column
injured_counts = crashes['injured'].value_counts().sort_index()
# Define the bins for the bar graph
bins = [-1, 0, 1, 2, 4, 6,8, 10, 20]
# Group the counts into the defined bins
grouped_counts = injured_counts.groupby(pd.cut(injured_counts.index, bins)).sum()
grouped_counts.index = ['0', '1', '2', '3-4', '5-6', '7-8', '9-10', '11-20']
# Plot the bar graph
grouped_counts.plot(kind='bar', color='lightblue', edgecolor='black')
injured_pct = grouped_counts / grouped_counts.sum() * 100
for i, v in enumerate(grouped_counts):
plt.text(i-0.2, v+500, v)
for i, v in enumerate(injured_pct):
plt.text(i-0.22, 10000, '{:.1f}%'.format(v))
# Set the labels and title
plt.title('Number of Crashes by Number of Injuries')
plt.xlabel('Number of Injuries')
plt.ylabel('Number of Crashes')
plt.xticks(rotation=0)
# Show the plot
plt.show()
/var/folders/gc/0752xrm56pnf0r0dsrn5370c0000gr/T/ipykernel_93493/687600154.py:8: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. grouped_counts = injured_counts.groupby(pd.cut(injured_counts.index, bins)).sum()
Total Fatalities¶
The 'killed' column is similar to the 'injured' column in that it contains the total amount of persons killed during a given incident. The bar graph below shows a similar trend in that most incidents resulted in no crashes, while only a few crashes resulted in more. This is a bit lower than I expected, which is a nice surprise. This also suggests that a majority of incidents in the NYPD system are not serious crashes, but just small collisions that are perhaps reported for insurance reasons.
# Calculate the counts of killed
killed_counts = crashes['killed'].value_counts().sort_index()
# Calculate the percentage of killed
killed_pct = (killed_counts / len(crashes)) * 100
# Create a DataFrame for killed counts and percentages
killed_data = pd.DataFrame({'Counts': killed_counts, 'Percentage': killed_pct})
# Plot the counts of killed
plt.figure(figsize=(10, 5))
sns.barplot(x=killed_data.index, y=killed_data['Counts'], color='red', edgecolor='black')
plt.title('Counts of Killed in Crashes')
plt.xlabel('Number of Killed')
plt.ylabel('Counts')
# Enumerate the percentages as text on the graph
for i, count in enumerate(killed_data['Counts']):
plt.text(i, count, killed_data["Counts"][i], ha='center', va='bottom')
for i, count in enumerate(killed_data['Counts']):
plt.text(i, 20000, f'{killed_data["Percentage"][i]:.2f}%', ha='center', va='bottom')
plt.show()
Pedestrian Statistics¶
The 'pedestrians_injured' and 'pedestrians_killed' columns are subsets of the 'injured' and 'killed' columns that contain data only on pedestrians that were injured or killed in a given incident. The distribution in the below graphs is similar the previous graphs. Note that no accident resulted in more than 2 pedestrian deaths or 6 pedestrian injuries. This could imply that most crashes in New York have the cars stay on the road, as pedestrians being injured or killed would imply that the cars ran into the sidewalk or the pedestrian was in a crosswalk.
# Create a bar graph for pedestrians injured and pedestrians killed
killed_counts = crashes['pedestrians_killed'].value_counts().sort_index()
injured_counts = crashes['pedestrians_injured'].value_counts().sort_index()
killed_counts = killed_counts.reindex(injured_counts.index, fill_value=0)
plt.figure(figsize=(10, 5))
killed_counts.plot(kind='bar', color='red', edgecolor='black', position=0, width=0.4, label='Pedestrians Killed')
injured_counts.plot(kind='bar', color='blue', edgecolor='black', position=1, width=0.4, label='Pedestrians Injured')
plt.title('Number of Crashes by Pedestrians Injured and Killed')
plt.xlabel('Number of Pedestrians')
plt.ylabel('Number of Crashes')
plt.xticks(rotation=0)
plt.legend(loc='upper right')
for i, v in enumerate(killed_counts):
plt.text(i+0.05, v+800, v, color = 'red')
for i, v in enumerate(injured_counts):
plt.text(i-.38, v+880, v, color = 'blue')
plt.show()
Motorist and Cyclist Statistics¶
The 'motorist' and 'cyclist' injury and death columns are also subsets of the 'injured' and 'killed' columns. Below are two bar graphs that display the number of cyclists and motorists injured.
There are many crashes with a high 'motorists_injured' value. Some even reach to over 15. For one incident to have so many motorists involved at once, it must not be a "basic" crash. Could NY's infrastructure have failed, or perhaps a dangerous vehicle was at play? The hypothesis that strange circumstances surrounding a crash corrolate with a high injury and death count will be explored in the Discussion section.
motor_killed = crashes['motorists_killed'].value_counts().sort_index()
motor_injured = crashes['motorists_injured'].value_counts().sort_index()
cycle_killed = crashes['cyclists_killed'].value_counts().sort_index()
cycle_injured = crashes['cyclists_injured'].value_counts().sort_index()
motor_killed = motor_killed.reindex(motor_injured.index, fill_value=0)
cycle_killed = cycle_killed.reindex(cycle_injured.index, fill_value=0)
fig, ax = plt.subplots(1, 2, figsize=(20, 5))
motor_killed.plot(kind='bar', color='red', edgecolor='black', position=0, width=0.4, label='Motorists Killed', ax=ax[0])
motor_injured.plot(kind='bar', color='blue', edgecolor='black', position=1, width=0.4, label='Motorists Injured', ax=ax[0])
ax[0].set_title('Number of Crashes by Motorists Injured and Killed')
ax[0].set_xlabel('Number of Motorists')
ax[0].set_ylabel('Number of Crashes')
ax[0].legend(loc='upper right')
for i, v in enumerate(motor_killed):
ax[0].text(i+0.05, v+800, v, color = 'red', rotation=45)
for i, v in enumerate(motor_injured):
ax[0].text(i-.38, v+880, v, color = 'blue', rotation=45)
cycle_killed.plot(kind='bar', color='red', edgecolor='black', position=0, width=0.4, label='Cyclists Killed', ax=ax[1])
cycle_injured.plot(kind='bar', color='blue', edgecolor='black', position=1, width=0.4, label='Cyclists Injured', ax=ax[1])
ax[1].set_title('Number of Crashes by Cyclists Injured and Killed')
ax[1].set_xlabel('Number of Cyclists')
ax[1].set_ylabel('Number of Crashes')
ax[1].legend(loc='upper right')
for i, v in enumerate(cycle_killed):
ax[1].text(i+0.05, v+800, v, color = 'red')
for i, v in enumerate(cycle_injured):
ax[1].text(i-.38, v+880, v, color = 'blue')
plt.show()
Factors and Vehicles¶
This section is concerned about the circumstances surrounding a crash. It details the factors contributing to why a crash occurred as inferred by the reporting officer and the models of the vehicles that were involved in the crash.
Contributing Factors¶
The 'factor' columns have data on the factors that may have contributed an incident. Below is a bar graph showing the most common factors in a crash and a dataframe showing every contributing factor sorted by how often they appear in the dataframe. The 'factor1' column is used in the analysis because it is present in the most entries: the 'factor2' and 'factor3' columns follow a similar distribution as 'factor1' does, just with many more None entries. There are a total of 55 different contributing factors across the entire dataset. The most common factor is Driver Inattention by far, making up over 23% of all incidents.
commonFactors = crashes['factor1'].value_counts().head(10)
commonFactors.plot(kind='bar', color='purple', edgecolor='black')
plt.title('Top 10 Most Common Contributing Factors')
plt.xlabel('Contributing Factor')
plt.ylabel('Number of Crashes')
plt.xticks(rotation=45, ha='right')
plt.axhline(crashes['factor1'].value_counts().mean(), color='red', linestyle='dashdot', label='Average')
plt.text(9.6, crashes['factor1'].value_counts().mean()-200, 'Average: {:.0f}'.format(crashes['factor1'].value_counts().mean()))
plt.legend(loc='upper right')
for i, v in enumerate(commonFactors):
plt.text(i-0.23, v+300, v)
plt.show()
display( pd.DataFrame(crashes['factor1'].value_counts()).transpose())
factor1 | None | Driver Inattention/Distraction | Failure to Yield Right-of-Way | Following Too Closely | Passing or Lane Usage Improper | Passing Too Closely | Unsafe Speed | Backing Unsafely | Traffic Control Disregarded | Other Vehicular | Turning Improperly | Unsafe Lane Changing | Driver Inexperience | Alcohol Involvement | Reaction to Uninvolved Vehicle | Pedestrian/Bicyclist/Other Pedestrian Error/Confusion | View Obstructed/Limited | Pavement Slippery | Aggressive Driving/Road Rage | Fell Asleep | Brakes Defective | Oversized Vehicle | Steering Failure | Passenger Distraction | Outside Car Distraction | Obstruction/Debris | Lost Consciousness | Tire Failure/Inadequate | Illnes | Pavement Defective | Glare | Fatigued/Drowsy | Failure to Keep Right | Driverless/Runaway Vehicle | Drugs (illegal) | Animals Action | Accelerator Defective | Cell Phone (hand-Held) | Traffic Control Device Improper/Non-Working | Physical Disability | Tinted Windows | Lane Marking Improper/Inadequate | Prescription Medication | Using On Board Navigation Device | Vehicle Vandalism | Other Electronic Device | Other Lighting Defects | Headlights Defective | Tow Hitch Defective | Eating or Drinking | Cell Phone (hands-free) | Texting | Shoulders Defective/Improper | Listening/Using Headphones | Windshield Inadequate |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 25226 | 23864 | 6905 | 6617 | 4564 | 3831 | 3666 | 3116 | 2894 | 2709 | 2290 | 2078 | 1991 | 1580 | 1363 | 962 | 850 | 820 | 753 | 441 | 412 | 411 | 268 | 246 | 209 | 205 | 195 | 188 | 173 | 142 | 136 | 130 | 110 | 103 | 90 | 88 | 81 | 48 | 44 | 42 | 26 | 25 | 17 | 14 | 12 | 10 | 10 | 9 | 8 | 8 | 6 | 5 | 5 | 2 | 2 |
Vehicle Models¶
The 'vehicles' column contains information on the models of the vehicles involved in the crash. Below is a bar graph of the top 10 most common vehicles and a series containing every imputted vehicle. Most crashes involve a Sedan or an SUV, which makes sense considering those are very common cars. There are also a good amount of crashes that involve Bikes and Motorcycles: it could be worthwile to view subsets of these groups alongside the 'cyclists_injured' and 'motorists_injured' columns.
It's important to note that these values are almost certainly handwritten into the dataset by the reporting NYPD officer: the many mispelt entries like Ambulace, Ambulane, and Fire Engin make that clear. Therefore, the counts here are not completely reliable. Some of the entries also very strange. For example, there are three entries in the dataset involving a Tank, and one even involves a Freight Train.
vehiclecounts = crashes['vehicle1'].value_counts()
vehiclecounts = crashes['vehicle1'].value_counts() + crashes['vehicle2'].value_counts().reindex( vehiclecounts.index, fill_value=0) + crashes['vehicle3'].value_counts().reindex( vehiclecounts.index, fill_value=0) + crashes['vehicle4'].value_counts().reindex( vehiclecounts.index, fill_value=0) + crashes['vehicle5'].value_counts().reindex( vehiclecounts.index, fill_value=0)
vehiclecounts.index = vehiclecounts.index.str.capitalize()
vehiclecounts = vehiclecounts.groupby(level=0).sum()
vehiclecounts= vehiclecounts.sort_values(ascending=False).drop('None')
crashes.replace('Station Wagon/Sport Utility Vehicle', 'SUV', inplace=True)
commonVehicles = vehiclecounts.head(10)
commonVehicles['Other'] = vehiclecounts[10:].sum()
commonVehicles.rename(index={'Station wagon/sport utility vehicle': 'SUV'}, inplace=True)
commonVehicles.plot(kind='bar', color='orange', edgecolor='black')
plt.title('Top 10 Most Common Vehicle Types')
plt.xlabel('Vehicle Type')
plt.ylabel('Number of Crashes')
plt.xticks(rotation=45, ha='right')
plt.legend(loc='upper right')
for i, v in enumerate(commonVehicles):
plt.text(i-0.23, v+300, v)
plt.show()
display(pd.DataFrame(vehiclecounts).transpose())
vehicle1 | Sedan | Station wagon/sport utility vehicle | Bike | Pick-up truck | Box truck | Taxi | Bus | E-bike | Motorcycle | Tractor truck diesel | E-scooter | Van | Ambulance | Moped | Dump | Pk | Flat bed | Garbage or refuse | Convertible | Carry all | Motorscooter | Tow truck / wrecker | Motorbike | Tractor truck gasoline | Chassis cab | 4 dr sedan | Tanker | Fire truck | 3-door | Trailer | Limo | Refrigerated van | Concrete mixer | Armored truck | School bus | Flat rack | Scooter | Multi-wheeled vehicle | Firetruck | Beverage truck | Unknown | Tow truck | Unk | Open body | Lift boom | Truck | Stake or rack | Pedicab | Snow plow | Minibike | Forklift | Bulk agriculture | Commercial | Ambu | Fdny ambul | Fdny truck | Dump truck | Pick up | Com | Lunch wagon | Garbage tr | Mta bus | Minicycle | Fdny | 2 dr sedan | Van camper | Pallet | Usps truck | Motor scoo | Amb | Utility | Sprinter v | Hopper | Fdny fire | Delivery t | Electric s | Rv | Passenger | Pas | Power shov | Delivery | Pickup | Util | Self insur | Delv | Pick up tr | Usps | Nys ambula | Fdny engin | Street swe | Fire | Pickup with mounted camper | Glass rack | Tank | Pc | Golf cart | Suburban | Road sweep | Mack | Nyc sanita | Refg | Fire engin | Fork lift | Enclosed body - nonremovable enclosure | Ford van | Ford | Dirt bike | Escooter s | F550 | Nypd van | Motor home | Ec3 | E scooter | Motorized | Motorized home | Self | Ups | Semi trail | Usps mail | Skateboard | Utility ve | Box | Boom lift | Pick-up tr | Van/truck | Pickup tru | Vms | Ambulane | Sanitation | Work van | Econoline | Mopad | Gas scoote | Food cart | Mini van | Mailtruck | Ems | Mail truck | Mack truck | Tl | Tr | Flatbed | Street cle | Van wh | School bu | Subn | Verzion va | Tk | Shuttle bu | Suv | Sanmen cou | Ram | White van | Work truck | Rgr | Sw/van | Revel scoo | Yamaha | Street | Scooter ga | Utility va | Utility tr | Skywatch | Truck comm | Tf | Uhal | Semi | Unk box tr | Smyellscho | Unmarked v | Tlr | Sedona | Us mail tr | Us postal | Tlc | Uspcs | Usps posta | Usps small | Usps vehic | Tcn | ''lime mope | Pumper | Citywide | Commerical | Con ed tru | Constructi | Crane boom | D2 | Dent and s | Department | Dodge | Dodge ram | Dumpster t | E bike uni | Ems bus | Emt ambula | Enclosed body - removable enclosure | Engine sp0 | Esu rep | Excavator | Cmix | City owned | Pump | City | 12 passage | 50cc scoot | 994 | Ambulace | Ambulence | Ambulette | App | Asphalt ro | Backhoe | Camper van | Cargo van | Carriage | Carrier | Cat forkli | Cater | Cement tru | Cherv | Fdny firet | Fdny ladde | Fire appar | Flatbed pi | Motorscoot | Moving tru | Mta | Mtr h | Nonmotords | Nypd tow t | Pas (4dr s | Pass | Pick wh | Pick-up | Pkup | Police rep | Post offic | Postal ser | Postal tru | Pro master | Psd | Mopd | . | Mini bus | Grumman ll | Freight | Freight tr | Frt | Garbage | Gas mo ped | Gas powere | Golf car | Horse carr | M2 | Horse trai | Hwh | Kick scoot | Ladder co | Lift | Livestock rack | Locomotive | �mbu |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 82786 | 60662 | 4944 | 3876 | 3840 | 3718 | 3054 | 2733 | 1717 | 1433 | 1380 | 1203 | 1050 | 686 | 629 | 427 | 361 | 354 | 342 | 261 | 239 | 233 | 218 | 195 | 135 | 97 | 94 | 86 | 59 | 56 | 56 | 53 | 53 | 47 | 46 | 43 | 42 | 40 | 37 | 36 | 36 | 34 | 34 | 32 | 32 | 29 | 26 | 25 | 22 | 21 | 17 | 15 | 15 | 15 | 14 | 14 | 12 | 11 | 11 | 10 | 10 | 9 | 9 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 7 | 7 | 7 | 6 | 6 | 5 | 5 | 5 | 5 | 5 | 5 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
Number of Vehicles Involved¶
The custom made 'vehicles_involved' column gives information on how many vehicles are involved in a crash. Specifically, it counts how many 'vehicle' entries are not None. It can be seen that most incidents involve only 1 or 2 vehicles, but a significant amount of incidents involve more than that.
Strangely, there are some incidents that involve 0 vehicles. These entries could be errors in the NYPD system, entries that involved road conditions, or perhaps even incidents that were reported after the vehicles had left the scene. They will be investigated in the discussion portion of this report.
numvehicles = crashes['vehicles_involved'].value_counts()
numvehicles.sort_index(inplace=True)
numvehicles.plot(kind='bar', color='lightgreen', edgecolor='black')
plt.title('Number of Crashes by Number of Vehicles Involved')
plt.xlabel('Number of Vehicles')
plt.ylabel('Number of Crashes')
plt.xticks(rotation=0)
plt.axhline(numvehicles.mean(), color='red', linestyle='dashdot', label='Average')
plt.text(4.6, numvehicles.mean(), 'Average: {:.0f}'.format(numvehicles.mean()))
plt.legend(loc='upper right')
for i, v in enumerate(numvehicles):
plt.text(i-0.15, v+500, v)
plt.show()
Interesting Correlations¶
The following chart divides how many crashes have a specified number of injuries by borough. The percentages tell us what percent of crashes in that borough had that many injuries.
While the distributions are generally the same across each borough, it's important to note the slight differences between them. For example, Staten Island has the highest ratio of non-injury crashes, while there are more crashes in Brooklyn with one or two injuries than in other boroughs. Also, notice that crashes with more than 10 injuries happened in the Bronx and Staten Island, but not in any of the other boroughs. Also, for some reason, crashes with no specified borough tend to have more injuries than those with boroughs: note that the percentage of 2-injury crashes in None is 6.55%, which is significantly higher than any borough.
boroughinjuries = crashes.groupby('borough')['injured'].value_counts(bins= [-1, 0, 1, 2, 4, 6,8, 10, 20], sort=False).fillna(0)
plt.subplots(2, 3, figsize=(20, 10))
for i, b in enumerate(crashes['borough'].unique()):
totalforborough = crashes[crashes['borough'] == b].shape[0]
plt.subplot(2, 3, i+1)
boroughinjuries[b].plot(kind='bar', color='lightblue', edgecolor='black', label='Injured')
plt.title('Number of Crashes by Number of Injuries in {}'.format(b))
plt.xlabel('Number of Injuries')
plt.ylabel('Number of Crashes')
plt.xticks(rotation=0)
plt.gca().set_xticklabels(['0', '1', '2', '3-4', '5-6', '7-8', '9-10', '11-20'])
plt.legend(loc='upper right')
# plotting total crashes per borough
plt.text(4.8, totalforborough*0.3, 'Total Crashes: {}'.format(totalforborough))
for j, v in enumerate(boroughinjuries[b]):
plt.text(j-0.25, v+80 - (60 if b == "STATEN ISLAND" else 0), '{:.2f}%'.format(v/totalforborough*100))
plt.show()
The next chart compares how many injuries happen at different times of day. It divides the dataset into Night (12 AM - 6 AM), Morning (6 AM - 12PM), Afternoon (12PM - 6PM) and Evening (6PM - 12AM) and plots them individually, using the same scheme as the borough chart did above.
Notice that many more crashes with zero injuries happen in the morning than during the evening or at night. It seems like the evening is the most dangerous time of day to get into an accident, while the morning is the least likely to result in injury. However, there are significantly fewer total crashes in the Night. This is probably because there are less cars on the road in the middle of the night than during rush hour in the morning or the evening.
timeofdayinjuries = crashes.groupby('timeofday')['injured'].value_counts(bins= [-1, 0, 1, 2, 4, 6,8, 10, 20], sort=False).fillna(0)
plt.subplots(2, 2, figsize=(20, 10))
for i, t in enumerate(crashes['timeofday'].unique()):
totalfortime = crashes[crashes['timeofday'] == t].shape[0]
plt.subplot(2, 2, i+1)
timeofdayinjuries[t].plot(kind='bar', color='lightgreen', edgecolor='black', label='Injured')
plt.title('Number of Crashes by Number of Injuries in the {}'.format(t))
plt.xlabel('Number of Injuries')
plt.ylabel('Number of Crashes')
plt.xticks(rotation=0)
plt.gca().set_xticklabels(['0', '1', '2', '3-4', '5-6', '7-8', '9-10', '11-20'])
plt.legend(loc='upper right')
plt.text(5.8, totalfortime*0.3, 'Total Crashes: {}'.format(totalfortime), ha = 'left')
for j, v in enumerate(timeofdayinjuries[t]):
plt.text(j-0.25, v+80, '{:.2f}%'.format(v/totalfortime*100))
plt.show()
/var/folders/gc/0752xrm56pnf0r0dsrn5370c0000gr/T/ipykernel_93493/3326097605.py:1: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. timeofdayinjuries = crashes.groupby('timeofday')['injured'].value_counts(bins= [-1, 0, 1, 2, 4, 6,8, 10, 20], sort=False).fillna(0)
Discussion¶
Question 1:¶
Does the contributing factors, vehicles and the number of vehicles involved in an incident allow us to predict the number of injuries and deaths in that entry?
Essentially, this question wants to know that, if a crash involves a dangerous vehicle, has an uncommon set of contributing factors, or involves more vehicles than average, will the injury and death counts represent that? This question aims to predict a quantitative value.
Hypothesis: I expect to see that crashes with strange vehicles, contributing factors, and high vehicle numbers corrolate with higher injury and death counts. For one, if a crash involves more vehicles, then there should be a higher chance of drivers or pedestrians getting injured. Specifically, I expect that incidents involving trucks, motorcycles, and other "dangerous" vehicles will result in more motorist injuries.
Graphical analysis¶
The following code creates small, 1000-entry samples of crashes involving "Dangerous" vehicles and "Safe" vehicles, then plots the average number of injuries and fatalities in the sample. The samples of "Dangerous" vehicles includes trucks, e-bikes, and motorcycles: these are generally thought to be more dangerous to ride in than most vehicles. The "Safe" vehicle samples include sedans, suvs, and taxis.
This graph shows us that, on average, motorcycles and e-bikes have more injuries and deaths than normal vehicles like sedans and suvs. The ratio of deaths to injuries in the motorcycles column is significantly higher than other samples as well.
#getting sample of data involving trucks, e-bikes, and motorcycles ("Dangerous" vehicles)
truckssample = crashes[crashes['vehicle1'].str.contains('Truck') | crashes['vehicle2'].str.contains('Truck') | crashes['vehicle3'].str.contains('Truck') | crashes['vehicle4'].str.contains('Truck') | crashes['vehicle5'].str.contains('Truck')].sample(1000)
ebikesample = crashes[crashes['vehicle1'].str.contains('E-Bike') | crashes['vehicle2'].str.contains('E-Bike') | crashes['vehicle3'].str.contains('E-Bike') | crashes['vehicle4'].str.contains('E-Bike') | crashes['vehicle5'].str.contains('E-Bike')].sample(1000)
motorcyclesample = crashes[crashes['vehicle1'].str.contains('Motorcycle') | crashes['vehicle2'].str.contains('Motorcycle') | crashes['vehicle3'].str.contains('Motorcycle') | crashes['vehicle4'].str.contains('Motorcycle') | crashes['vehicle5'].str.contains('Motorcycle')].sample(1000)
#getting sample of data involving sedans, SUVs, and taxis ("Safe" vehicles)
sedansample = crashes[crashes['vehicle1'].str.contains('Sedan') | crashes['vehicle2'].str.contains('Sedan') | crashes['vehicle3'].str.contains('Sedan') | crashes['vehicle4'].str.contains('Sedan') | crashes['vehicle5'].str.contains('Sedan')].sample(1000)
suvsample = crashes[crashes['vehicle1'].str.contains('SUV') | crashes['vehicle2'].str.contains('SUV') | crashes['vehicle3'].str.contains('SUV') | crashes['vehicle4'].str.contains('SUV') | crashes['vehicle5'].str.contains('SUV')].sample(1000)
taxisample = crashes[crashes['vehicle1'].str.contains('Taxi') | crashes['vehicle2'].str.contains('Taxi') | crashes['vehicle3'].str.contains('Taxi') | crashes['vehicle4'].str.contains('Taxi') | crashes['vehicle5'].str.contains('Taxi')].sample(1000)
#plotting average injury and killed counts for each vehicle type, dividing based on safety
plt.figure(figsize=(10, 5))
plt.bar('Truck', truckssample['injured'].mean(), color='blue', label='Dangerous-Injured')
plt.bar('Truck', truckssample['killed'].mean(), color='red', label='Dangerous-Killed')
plt.bar('E-Bike', ebikesample['injured'].mean(), color='blue')
plt.bar('E-Bike', ebikesample['killed'].mean(), color='red')
plt.bar('Motorcycle', motorcyclesample['injured'].mean(), color='blue')
plt.bar('Motorcycle', motorcyclesample['killed'].mean(), color='red')
plt.bar('Sedan', sedansample['injured'].mean(), color='green', label = 'Safe-Injured')
plt.bar('Sedan', sedansample['killed'].mean(), color='purple', label = 'Safe-Killed')
plt.bar('SUV', suvsample['injured'].mean(), color='green')
plt.bar('SUV', suvsample['killed'].mean(), color='purple')
plt.bar('Taxi', taxisample['injured'].mean(), color='green')
plt.bar('Taxi', taxisample['killed'].mean(), color='purple')
plt.title('Average Number of Injuries and Deaths by Vehicle Type')
plt.xlabel('Vehicle Type')
plt.ylabel('Number of Persons')
plt.xticks(rotation=0)
plt.legend(loc='upper right')
plt.show()
This next graph looks at the average injury and death rate of samples of different contributing factors. Crashes involving being asleep at the wheel and drunk driving are marked as "dangerous", while those involving driver inattention or backing incorrectly are marked as "safe" by comparison. This graph follows the same trend as the last: the more dangerous factors correspond with a higher injury count and, in the case of drunk drivers, death count.
#plotting average injury and killed counts for some contributing factors, dividing based on safety
sleepsample = crashes[crashes['factor1'].str.contains('Asleep') | crashes['factor2'].str.contains('Asleep') | crashes['factor3'].str.contains('Asleep')].sample(300)
alcoholsample = crashes[crashes['factor1'].str.contains('Alcohol') | crashes['factor2'].str.contains('Alcohol') | crashes['factor3'].str.contains('Alcohol')].sample(300)
innatention = crashes[crashes['factor1'].str.contains('Inattention') | crashes['factor2'].str.contains('Inattention') | crashes['factor3'].str.contains('Inattention')].sample(300)
backing = crashes[crashes['factor1'].str.contains('Backing') | crashes['factor2'].str.contains('Backing') | crashes['factor3'].str.contains('Backing')].sample(300)
plt.figure(figsize=(10, 5))
plt.bar('Sleeping', sleepsample['injured'].mean(), color='blue', label='Dangerous-Injured')
plt.bar('Sleeping', sleepsample['killed'].mean(), color='red', label='Dangerous-Killed')
plt.bar('Alcohol', alcoholsample['injured'].mean(), color='blue')
plt.bar('Alcohol', alcoholsample['killed'].mean(), color='red')
plt.bar('Inattention', innatention['injured'].mean(), color='green', label = 'Safe-Injured')
plt.bar('Inattention', innatention['killed'].mean(), color='purple', label = 'Safe-Killed')
plt.bar('Backing', backing['injured'].mean(), color='green')
plt.bar('Backing', backing['killed'].mean(), color='purple')
plt.title('Average Number of Injuries and Deaths by Contributing Factor')
plt.xlabel('Contributing Factor')
plt.ylabel('Number of Persons')
plt.xticks(rotation=0)
plt.legend(loc='upper right')
plt.show()
This graph does the same thing as the last, but divides the samples by how many vehicles were invovled. Strangely, there are some samples that include zero vehicles, even though this shouldnt be possible. Even more strange is that there is a higher proportion of injuries in the zero-vehicle sample than in samples including one or two. Apart from that, the average injury count seems to be linearly proportional with the number of vehicles: a higher amount of vehicles results in a higher proportion of injuries.
#getting samples of different vehicles involved values
zerovehicles = crashes[crashes['vehicles_involved'] == 0].sample(800)
onevehicle = crashes[crashes['vehicles_involved'] == 1].sample(800)
twovehicles = crashes[crashes['vehicles_involved'] == 2].sample(800)
threevehicles = crashes[crashes['vehicles_involved'] == 3].sample(800)
fourvehicles = crashes[crashes['vehicles_involved'] == 4].sample(800)
fivevehicles = crashes[crashes['vehicles_involved'] == 5].sample(800)
#plotting average injury and killed counts for each vehicle type, dividing based on safety
plt.figure(figsize=(10, 5))
plt.bar('0', zerovehicles['injured'].mean(), color='blue', label='Injured')
plt.bar('0', zerovehicles['killed'].mean(), color='red', label='Killed')
plt.bar('1', onevehicle['injured'].mean(), color='blue')
plt.bar('1', onevehicle['killed'].mean(), color='red')
plt.bar('2', twovehicles['injured'].mean(), color='blue')
plt.bar('2', twovehicles['killed'].mean(), color='red')
plt.bar('3', threevehicles['injured'].mean(), color='blue')
plt.bar('3', threevehicles['killed'].mean(), color='red')
plt.bar('4', fourvehicles['injured'].mean(), color='blue')
plt.bar('4', fourvehicles['killed'].mean(), color='red')
plt.bar('5', fivevehicles['injured'].mean(), color='blue')
plt.bar('5', fivevehicles['killed'].mean(), color='red')
plt.title('Average Number of Injuries and Deaths by Number of Vehicles Involved')
plt.xlabel('Number of Vehicles')
plt.ylabel('Number of Persons')
plt.xticks(rotation=0)
plt.legend(loc='upper right')
plt.show()
kNN Model Test¶
To fully explore the possibility of a corellation between the cirucumstances surrounding a crash and the injury counts, I trained a knn model to try to predict the injury count based only on the factors and vehicles of crash. The accuracy of the knn is just under 70%, which isn't perfect, but is high enough to signal a significant corellation between these columns.
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
# Step 1: Prepare the data
features = crashes[['factor1', 'factor2', 'factor3', 'vehicle1', 'vehicle2', 'vehicle3', 'vehicle4','vehicle5', 'vehicles_involved']]
target = crashes['injured'] # Modify target to be a single column
# Step 2: Encode categorical variables
encoder = OneHotEncoder()
features_encoded = encoder.fit_transform(features)
# Step 3: Split the data
X_train, X_test, y_train, y_test = train_test_split(features_encoded, target, test_size=0.2, random_state=6602)
# Step 4: Train the KNN model
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
# Step 5: Evaluate the model
y_pred = knn.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy:', accuracy)
Accuracy: 0.68625
Conclusion:¶
Overall, this data supports the hypothesis that certain contributing factors and vehicle models can give a solid inference at the number of injuries that may have occurred in a certain crash. Specifically, this evidence suggests that incidents that involve dangerous factors, unsafe vehicles, and more drivers tend to result in more injuries.
Question 2:¶
Does the number of injuries, deaths, and how many vehicles were involved in an incident allow us to predict the time that a crash occurred?
By "predict the time a crash occurred", it means to predict what time of day the crash occurred in based on the entry in the categorical 'timeofday' column. During the time analysis section, there was a corrolation between more cars being on the road and more crashes occurring. This question aims to take this observation further, and see if a higher number of injuries and deaths happen at different times.
Hypothesis: I expect to see that crashes with more vehicles involved and higher injury rates will corrolate with times when cars are more likely to be on the road. Specifically, I think that morning and afternoon rush hour will have a higher rate of injuries and deaths. Additionally, I expect that incidents at night or in the early morning will only involve one or two cars, and have much lower injury and death rates compared to other times.
Graphical Analysis¶
Below is a matrix of pie charts representing the time of day distribution of crashes with specified numbers of injuries. It can be seen that, in general, incidents happen more often in the afternoon than any other time of day. However, the way the values change as the injury count goes up is reveals interesting trends. For example, low injury crashes tend to happen more in the morning than high injury crashes. Also, the most dangerous crashes happen in the evening, even though not nearly as many crashes without any injuries do during the evening.
#Making pie charts based on injury data and time of day
injury0sample = crashes[crashes['injured'] == 0].sample(800)
injury1sample = crashes[crashes['injured'] == 1].sample(800)
injury23sample = crashes[crashes['injured'].isin([2, 3])].sample(800)
injury4upsample = crashes[crashes['injured']>=4].sample(800)
plt.figure(figsize=(10, 10))
plt.subplot(2, 2, 1)
injury0sample['timeofday'].value_counts().sort_index().plot(kind='pie', autopct='%1.1f%%', colors=['lightblue', 'lightgreen', 'lightcoral', 'lightyellow'])
plt.title('Time of Day for Crashes with 0 Injuries')
plt.ylabel('')
plt.subplot(2, 2, 2)
injury1sample['timeofday'].value_counts().sort_index().plot(kind='pie', autopct='%1.1f%%', colors=['lightblue', 'lightgreen', 'lightcoral', 'lightyellow'])
plt.title('Time of Day for Crashes with 1 Injury')
plt.ylabel('')
plt.subplot(2, 2, 3)
injury23sample['timeofday'].value_counts().sort_index().plot(kind='pie', autopct='%1.1f%%', colors=['lightblue', 'lightgreen', 'lightcoral', 'lightyellow'])
plt.title('Time of Day for Crashes with 2-3 Injuries')
plt.ylabel('')
plt.legend(loc='lower left')
plt.subplot(2, 2, 4)
injury4upsample['timeofday'].value_counts().sort_index().plot(kind='pie', autopct='%1.1f%%', colors=['lightblue', 'lightgreen', 'lightcoral', 'lightyellow'])
plt.title('Time of Day for Crashes with 4 or More Injuries')
plt.ylabel('')
plt.show()
kNN Model Test¶
This knn model is a final test to see if there is enough of a corellation to reliably predict the time of day based only on the injury and death counts. Sadly, this knn performed poorly with an accuracy of just 32%. This is only slightly better than randomly guessing between the four groups, which would result in 25%. This seems to imply that, although the graphs above show some correlation between injury count and time of day, there isnt a strong enough link between the two datapoints to be statistically significant.
features = crashes[['injured', 'motorists_injured', 'killed','motorists_killed', 'vehicles_involved']]
target = crashes['timeofday']
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy:', accuracy)
Accuracy: 0.2356
Conclusion:¶
Although there is clearly some corellation with high injury count crashes happening more during the evening than the morning, it doesnt seem like there is a strong enough connection between the injury count and the time of day to be significant.
Conclusion¶
This dataset had many interesting trends and corellations that were strange at first, but made sense after further investigation. The proportions of car crashes that happen at different times of day, and on different days was found and hypothesized about. Also, it was found that certain circumstances surrounding a crash strongly correlate with how many injuries occur in that crash. Overall, this experience has opened my eyes about the uncountable nuances of car crashes that have happened in the various boroughs of New York City.
Dataset : https://data.cityofnewyork.us/Public-Safety/Motor-Vehicle-Collisions-Crashes/h9gi-nx95/about_data