Introduction:
This dataset was uploaded from: https://catalog.data.gov/dataset/motor-vehicle-collisions-crashes. I came across this dataset while exploring the Data.gov website, which was one of the websites that was recommeneded to us for finding datasets.
The dataset contains details on motor vehicle crash events within the city of New York where someone was either injured or killed or if there was at least $1000 worth of damage.
import numpy as np
import pandas as pd
import seaborn as sns
/var/folders/gc/0752xrm56pnf0r0dsrn5370c0000gr/T/ipykernel_95722/573954172.py:2: DeprecationWarning: Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0), (to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries) but was not found to be installed on your system. If this would cause problems for you, please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466 import pandas as pd
crashes = pd.read_csv('https://data.cityofnewyork.us/api/views/h9gi-nx95/rows.csv?accessType=DOWNLOAD')
/var/folders/gc/0752xrm56pnf0r0dsrn5370c0000gr/T/ipykernel_95722/4003339771.py:1: DtypeWarning: Columns (3) have mixed types. Specify dtype option on import or set low_memory=False. crashes = pd.read_csv('https://data.cityofnewyork.us/api/views/h9gi-nx95/rows.csv?accessType=DOWNLOAD')
Preprocessing
crashes.columns allows us to view the columns for the dataset.
crashes.columns
Index(['CRASH DATE', 'CRASH TIME', 'BOROUGH', 'ZIP CODE', 'LATITUDE', 'LONGITUDE', 'LOCATION', 'ON STREET NAME', 'CROSS STREET NAME', 'OFF STREET NAME', 'NUMBER OF PERSONS INJURED', 'NUMBER OF PERSONS KILLED', 'NUMBER OF PEDESTRIANS INJURED', 'NUMBER OF PEDESTRIANS KILLED', 'NUMBER OF CYCLIST INJURED', 'NUMBER OF CYCLIST KILLED', 'NUMBER OF MOTORIST INJURED', 'NUMBER OF MOTORIST KILLED', 'CONTRIBUTING FACTOR VEHICLE 1', 'CONTRIBUTING FACTOR VEHICLE 2', 'CONTRIBUTING FACTOR VEHICLE 3', 'CONTRIBUTING FACTOR VEHICLE 4', 'CONTRIBUTING FACTOR VEHICLE 5', 'COLLISION_ID', 'VEHICLE TYPE CODE 1', 'VEHICLE TYPE CODE 2', 'VEHICLE TYPE CODE 3', 'VEHICLE TYPE CODE 4', 'VEHICLE TYPE CODE 5'], dtype='object')
I decided to remove some columns as just the CRASH DATE and BOROUGH would be enough information for time and place given the breadth of this project. Only CONTRIBUTING FACTOR VEHICLE 1 and VEHICLE TYPE CODE 1 were included for the same reason. I also decided to simplify CRASH DATE to just the year of the crash.
data = crashes[['CRASH DATE', 'BOROUGH', 'NUMBER OF PERSONS INJURED', 'NUMBER OF PERSONS KILLED', 'NUMBER OF PEDESTRIANS INJURED',
'NUMBER OF PEDESTRIANS KILLED', 'NUMBER OF CYCLIST INJURED', 'NUMBER OF CYCLIST KILLED', 'NUMBER OF MOTORIST INJURED',
'NUMBER OF MOTORIST KILLED', 'CONTRIBUTING FACTOR VEHICLE 1',
'VEHICLE TYPE CODE 1']]
data['CRASH DATE'] = pd.to_datetime(data['CRASH DATE']).dt.year
data.columns
/var/folders/gc/0752xrm56pnf0r0dsrn5370c0000gr/T/ipykernel_95722/3146826242.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy data['CRASH DATE'] = pd.to_datetime(data['CRASH DATE']).dt.year
Index(['CRASH DATE', 'BOROUGH', 'NUMBER OF PERSONS INJURED', 'NUMBER OF PERSONS KILLED', 'NUMBER OF PEDESTRIANS INJURED', 'NUMBER OF PEDESTRIANS KILLED', 'NUMBER OF CYCLIST INJURED', 'NUMBER OF CYCLIST KILLED', 'NUMBER OF MOTORIST INJURED', 'NUMBER OF MOTORIST KILLED', 'CONTRIBUTING FACTOR VEHICLE 1', 'VEHICLE TYPE CODE 1'], dtype='object')
Doing a count reveals that many rows within BOROUGH are missing data. We use dropna() remove these rows. Since we're working with a fairly large dataset, we should have plenty of data left over.
data.count()
data.dropna(inplace=True)
/var/folders/gc/0752xrm56pnf0r0dsrn5370c0000gr/T/ipykernel_95722/947580015.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy data.dropna(inplace=True)
Outlier Detection:
Taking a look at the quartiles reveals that both Q25 and Q75 are 0 for all columns. This means that the interquartile range is also 0 and that a nonzero number of casualities is considered an outlier. However, I decided not to remove any rows as that effectively would remove all information about casualty rates.
casualties = data[data.columns[2:10]]
casualties.quantile([0.25, 0.75])
NUMBER OF PERSONS INJURED | NUMBER OF PERSONS KILLED | NUMBER OF PEDESTRIANS INJURED | NUMBER OF PEDESTRIANS KILLED | NUMBER OF CYCLIST INJURED | NUMBER OF CYCLIST KILLED | NUMBER OF MOTORIST INJURED | NUMBER OF MOTORIST KILLED | |
---|---|---|---|---|---|---|---|---|
0.25 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
0.75 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
Statistical Summaries:
If we do a statistical summary of the quantitative columns, we can see that it was a good idea not to remove any rows as there is valuable information about the rate of injury/death.
casualties.describe()
NUMBER OF PERSONS INJURED | NUMBER OF PERSONS KILLED | NUMBER OF PEDESTRIANS INJURED | NUMBER OF PEDESTRIANS KILLED | NUMBER OF CYCLIST INJURED | NUMBER OF CYCLIST KILLED | NUMBER OF MOTORIST INJURED | NUMBER OF MOTORIST KILLED | |
---|---|---|---|---|---|---|---|---|
count | 1.416756e+06 | 1.416756e+06 | 1.416756e+06 | 1.416756e+06 | 1.416756e+06 | 1.416756e+06 | 1.416756e+06 | 1.416756e+06 |
mean | 2.902448e-01 | 1.285331e-03 | 6.099568e-02 | 7.016028e-04 | 3.034609e-02 | 1.199924e-04 | 1.951656e-01 | 4.425603e-04 |
std | 6.686598e-01 | 3.774715e-02 | 2.534821e-01 | 2.711072e-02 | 1.733918e-01 | 1.101770e-02 | 6.205620e-01 | 2.277269e-02 |
min | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 |
25% | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 |
50% | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 |
75% | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 |
max | 4.300000e+01 | 8.000000e+00 | 2.700000e+01 | 6.000000e+00 | 4.000000e+00 | 2.000000e+00 | 4.300000e+01 | 5.000000e+00 |
I decided to use violin plots to visualize the data as opposed to box plots as they give more information pertaining to the data's density. As expected, every nonzero number of casualites is considered an outlier, which results in rather interesting plots.
injuries = casualties[casualties.columns[0:8:2]]
sns.catplot(
data=injuries, orient='h', kind='violin', aspect=2
)
<seaborn.axisgrid.FacetGrid at 0x12703ba90>
deaths = casualties[casualties.columns[1:9:2]]
sns.catplot(
data=deaths, orient='h', kind='violin', aspect=2
)
<seaborn.axisgrid.FacetGrid at 0x1509a3fd0>
As for the categorical columns, I decided to visualize them by looking at the number of collisions per categorical value.
years = data[["CRASH DATE"]]
sns.displot(data=years, x='CRASH DATE', kind='hist', aspect=1.8)
<seaborn.axisgrid.FacetGrid at 0x3390033d0>
borough = data[["BOROUGH"]]
sns.displot(data=borough, x='BOROUGH', kind='hist', aspect=1.8)
<seaborn.axisgrid.FacetGrid at 0x1509b2150>
Unfortunately, I was unable to find a way to make suitable plots for VEHICLE TYPE CODE 1 and CONTRIBUTING VEHICLE TYPE 1 as they had many more columns. Some of the values were also unclear, such as "sedan" and "4 dr sedan", and didn't seem to be of much value. I decided to opt for just a numerical summary instead.
vehicles = data[["VEHICLE TYPE CODE 1"]].value_counts().head(10).reset_index()
vehicles
#sns.displot(data=vehicles, x='VEHICLE TYPE CODE 1', y='count', aspect=1.8)
VEHICLE TYPE CODE 1 | count | |
---|---|---|
0 | Sedan | 371823 |
1 | PASSENGER VEHICLE | 309834 |
2 | Station Wagon/Sport Utility Vehicle | 289199 |
3 | SPORT UTILITY / STATION WAGON | 133934 |
4 | Taxi | 34000 |
5 | TAXI | 28038 |
6 | 4 dr sedan | 26935 |
7 | Pick-up Truck | 21854 |
8 | VAN | 20503 |
9 | OTHER | 18072 |
factors = data[["CONTRIBUTING FACTOR VEHICLE 1"]].value_counts().head(10)
factors
CONTRIBUTING FACTOR VEHICLE 1 Unspecified 536161 Driver Inattention/Distraction 272516 Failure to Yield Right-of-Way 89551 Backing Unsafely 59847 Following Too Closely 47684 Other Vehicular 45969 Passing Too Closely 36986 Passing or Lane Usage Improper 35860 Turning Improperly 34129 Fatigued/Drowsy 25662 Name: count, dtype: int64
For correlations, I decided to first look at the relationship between the number of persons injured or killed.
casualties[["NUMBER OF PERSONS INJURED", "NUMBER OF PERSONS KILLED"]].corr()
NUMBER OF PERSONS INJURED | NUMBER OF PERSONS KILLED | |
---|---|---|
NUMBER OF PERSONS INJURED | 1.000000 | 0.015617 |
NUMBER OF PERSONS KILLED | 0.015617 | 1.000000 |
One would expect the 2 variables to be postively correlated as you would expect more injuries to be correlated with more deaths. However, there didn't seem to be any correlation whatsoever. I suspected this had something to do with the number of rows with 0 injuries and 0 deaths skewing the results, so I decided to remove those rows to see what happens.
one_casualty = casualties.loc[(casualties['NUMBER OF PERSONS INJURED'] > 0) | (casualties['NUMBER OF PERSONS KILLED'] > 0)]
one_casualty[["NUMBER OF PERSONS INJURED", "NUMBER OF PERSONS KILLED"]].corr()
NUMBER OF PERSONS INJURED | NUMBER OF PERSONS KILLED | |
---|---|---|
NUMBER OF PERSONS INJURED | 1.000000 | -0.063997 |
NUMBER OF PERSONS KILLED | -0.063997 | 1.000000 |
Unexpectedly, the correlation went from positive to negative, though was still small enough to be indeterministic of whether a correlation actually exists. Graphically, a slight negative correlation seems to be plausible.
sns.relplot(data=one_casualty, x="NUMBER OF PERSONS INJURED", y="NUMBER OF PERSONS KILLED", kind='line')
<seaborn.axisgrid.FacetGrid at 0x122e604d0>
Going from this, I decided to look into relationships between the other columns within casualties, specificailly, whether there was a correlation between injury rates of 2 types of people such as pedestrians and motorists.
casualties[["NUMBER OF PEDESTRIANS INJURED", "NUMBER OF MOTORIST INJURED"]].corr()
NUMBER OF PEDESTRIANS INJURED | NUMBER OF MOTORIST INJURED | |
---|---|---|
NUMBER OF PEDESTRIANS INJURED | 1.000000 | -0.063352 |
NUMBER OF MOTORIST INJURED | -0.063352 | 1.000000 |
Once again, I suspected that rows with zeros were skewing the results.
one_casualty2 = casualties.loc[(casualties['NUMBER OF PEDESTRIANS INJURED'] > 0) | (casualties['NUMBER OF MOTORIST INJURED'] > 0)]
one_casualty2[["NUMBER OF PEDESTRIANS INJURED", "NUMBER OF MOTORIST INJURED"]].corr()
NUMBER OF PEDESTRIANS INJURED | NUMBER OF MOTORIST INJURED | |
---|---|---|
NUMBER OF PEDESTRIANS INJURED | 1.00000 | -0.60673 |
NUMBER OF MOTORIST INJURED | -0.60673 | 1.00000 |
We end up getting a moderate negative correlation. This makes sense as most collisions involve just 1 other party (pedestrians or motorists in this case) which stops the vechicle, and not multiple parties in one instance.
sns.relplot(data=one_casualty2, x="NUMBER OF PEDESTRIANS INJURED", y="NUMBER OF MOTORIST INJURED", kind="line")
<seaborn.axisgrid.FacetGrid at 0x1228d1250>
To confirm these results, I wanted to look into the relationship between pedestrians and cylists as well.
casualties[["NUMBER OF PEDESTRIANS INJURED", "NUMBER OF CYCLIST INJURED"]].corr()
NUMBER OF PEDESTRIANS INJURED | NUMBER OF CYCLIST INJURED | |
---|---|---|
NUMBER OF PEDESTRIANS INJURED | 1.000000 | -0.037746 |
NUMBER OF CYCLIST INJURED | -0.037746 | 1.000000 |
one_casualty2 = casualties.loc[(casualties['NUMBER OF PEDESTRIANS INJURED'] > 0) | (casualties['NUMBER OF CYCLIST INJURED'] > 0)]
one_casualty2[["NUMBER OF PEDESTRIANS INJURED", "NUMBER OF CYCLIST INJURED"]].corr()
NUMBER OF PEDESTRIANS INJURED | NUMBER OF CYCLIST INJURED | |
---|---|---|
NUMBER OF PEDESTRIANS INJURED | 1.000000 | -0.902039 |
NUMBER OF CYCLIST INJURED | -0.902039 | 1.000000 |
In this instance, we were able to observe a strong negative correlation between the 2 variables.
sns.relplot(data=one_casualty2, x="NUMBER OF PEDESTRIANS INJURED", y="NUMBER OF CYCLIST INJURED", kind="line")
<seaborn.axisgrid.FacetGrid at 0x122e65210>
Overall, it's important to keep in mind that correlation is not causation. More pedestrians injured does not equal less cyclists or motorists injured. More likely is that just one party is involved in a accident within one instance.
Discussion:
Question 1: Can the number of pedestrians, cyclists, and motorists injured or killed from a collision predict the type of the motor vehicle?
Question 2: Can vehicle type, contributing cause, and borough predict the number of persons injured/killed from a collision?