IntroductionΒΆ
The first dataset, death
, includes 2012 and 2013 death counts in the state of California by zipcode and gender. It was sourced from data catalog data.gov, and it is available to the public at https://data.world/ellianas/untitledproject-3-19-2024/workspace/project-summary?agentid=health&datasetid=death-by-zip-code-by-gender
The second data set, income
, includes 2012 and 2013 Incomes fro several states by zipcode. It was sourced from the IRS Statistics of Income, and it is available to the public at https://data.world/jonloyens/irs-income-by-zip-code/workspace/project-summary?agentid=jonloyens&datasetid=irs-income-by-zip-code
The last data set, city
, includes the city and county name and the population size for all California zipcodes. It was sourced from The World Population Review, and it is available to the public at https://worldpopulationreview.com/zips/california
Below, we import all three datasets.
import pandas as pd
death = pd.read_csv("Death_by_ZIP_Code_by_Gender__2012_-_2013.csv")
income = pd.read_csv("IRSIncomeByZipCode.csv")
city = pd.read_csv("ca-zip-codes-data.csv")
/var/folders/gc/0752xrm56pnf0r0dsrn5370c0000gr/T/ipykernel_92900/1640254929.py:1: DeprecationWarning: Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0), (to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries) but was not found to be installed on your system. If this would cause problems for you, please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466 import pandas as pd
ReprocessingΒΆ
We merge death
, income
, and city
by ZIPCODE into a single dataframe.
merge = pd.merge(death, income, on='ZIPCODE')
city.rename(columns={'zip': 'ZIPCODE'}, inplace=True)
df = pd.merge(merge, city, on='ZIPCODE')
We rename that columns in the dataframe so that they are all of the same format, which makes utilizing their names more simple in future steps. Then, we drop all missing values and nonmeaningful rows and columns from the dataframe. We add the column Percent death count
, so we can see the number of deaths relative to the population. Finally, we reorder the columns of the dataframe to make it more easily read.
df.rename(columns={'YEAR': 'Year', 'ZIPCODE': 'Zipcode', 'GENDER': 'Gender', 'COUNT': 'Death count', 'STATE': 'State', 'population': 'Population', 'city': 'City', 'county': 'County','Number of returns':'Number of tax returns'}, inplace=True)
df.dropna(inplace=True)
df = df[(df['Year'] == 2013) & (df['Gender'] == 'Female')]
df.drop(['Zipcode','Year','Location 1','State','Adjusted gross income (AGI)','Number of returns with total income','Total income amount','Number of returns with taxable income','Taxable income amount','Avg taxable income','City','Gender','Number of tax returns'],axis=1,inplace=True)
df['Percent death count'] = df['Death count'] / df['Population'] * 100
df = df[['County','Population','Death count','Percent death count','Avg AGI','Avg total income']]
Here, we change the format of Avg AGI
and Avg total income
from its default, scientific notation, to standard notation.
df['Avg AGI'] = df['Avg AGI'] * 1000
df['Avg total income'] = df['Avg total income'] * 1000
This is the final dataframe after reprocessing. It contains death, population, and income data for females. This data only contains information from 2013 for the state of California.
df.head()
County | Population | Death count | Percent death count | Avg AGI | Avg total income | |
---|---|---|---|---|---|---|
2827 | Los Angeles | 57652 | 102.0 | 0.176924 | 25152.93276 | 25358.70291 |
2829 | Los Angeles | 53108 | 109.0 | 0.205242 | 24410.49578 | 24637.55274 |
2831 | Los Angeles | 75024 | 134.0 | 0.178610 | 23404.62185 | 23638.19710 |
2833 | Los Angeles | 58833 | 118.0 | 0.200568 | 59128.94737 | 60286.22076 |
2835 | Los Angeles | 37754 | 79.0 | 0.209249 | 45821.12767 | 46486.71419 |
We want to preform an outlier analysis in order to identify any out of the ordinary observations. The condition for outliers will be based on a 1.5 interquartile range.
import seaborn as sns
def is_outlier(x):
Q25, Q75 = x.quantile([.25,.75])
IQR = Q75 - Q25
return (x < Q25 - 1.5*IQR) | (x > Q75 + 1.5*IQR)
by_event = df.groupby("County")
outliers = by_event["Death count"].transform(is_outlier)
(df.loc[outliers,"County"].value_counts()).to_frame()
count | |
---|---|
County | |
Los Angeles | 4 |
Shasta | 4 |
Humboldt | 4 |
Mendocino | 3 |
Tulare | 3 |
Siskiyou | 2 |
El Dorado | 2 |
Amador | 2 |
Merced | 2 |
Orange | 2 |
Lake | 2 |
Yuba | 2 |
Plumas | 1 |
Nevada | 1 |
Tehama | 1 |
Sacramento | 1 |
Del Norte | 1 |
Tuolumne | 1 |
Stanislaus | 1 |
Mariposa | 1 |
Santa Cruz | 1 |
San Benito | 1 |
Marin | 1 |
Contra Costa | 1 |
Napa | 1 |
Alameda | 1 |
Inyo | 1 |
Kings | 1 |
Lassen | 1 |
by_county_outliers = df.loc[outliers,"County"].value_counts().sum()
print(f"Number of death count outliers by county = {by_county_outliers}")
total_num_deathcount = df['Death count'].shape[0]
print(f"Total number of death count observations = {total_num_deathcount}")
percent_outliers = (by_county_outliers / total_num_deathcount) * 100
print(f"Percent of death count outliers when grouped by county = {percent_outliers:.2f}%")
Number of death count outliers by county = 49 Total number of death count observations = 1412 Percent of death count outliers when grouped by county = 3.47%
Based on the outlier analysis, we can see how many outliers there are for Death count
in each county in California. We can also conclude. We can also conclude that 3.47% percent of death counts in California are outliers when group by county.
Summary Data AnalysisΒΆ
The summary statistics for all numerical columns of the dataframe are as follows:
df[['County','Population','Death count','Percent death count','Avg AGI','Avg total income']].describe()
Population | Death count | Percent death count | Avg AGI | Avg total income | |
---|---|---|---|---|---|
count | 1412.000000 | 1412.000000 | 1412.000000 | 1.412000e+03 | 1.412000e+03 |
mean | 27627.461048 | 84.987252 | 0.368747 | 7.445815e+04 | 7.574361e+04 |
std | 22090.010399 | 69.191318 | 0.320635 | 7.132591e+04 | 7.237741e+04 |
min | 33.000000 | 1.000000 | 0.023149 | 2.040964e+04 | 2.060633e+04 |
25% | 7298.750000 | 23.000000 | 0.236887 | 4.276960e+04 | 4.350000e+04 |
50% | 26031.500000 | 73.500000 | 0.321837 | 5.697303e+04 | 5.801139e+04 |
75% | 41666.250000 | 130.000000 | 0.425047 | 7.997999e+04 | 8.138399e+04 |
max | 106042.000000 | 420.000000 | 9.090909 | 1.149294e+06 | 1.160494e+06 |
The number of observations per county are as follows:
df['County'].value_counts().to_frame()
count | |
---|---|
County | |
Los Angeles | 272 |
San Diego | 93 |
Orange | 84 |
Riverside | 67 |
San Bernardino | 66 |
Santa Clara | 54 |
Sacramento | 51 |
Alameda | 47 |
Fresno | 44 |
Kern | 39 |
Contra Costa | 36 |
San Joaquin | 28 |
Sonoma | 27 |
San Francisco | 26 |
San Mateo | 25 |
Ventura | 23 |
Stanislaus | 22 |
Monterey | 21 |
Humboldt | 21 |
Tulare | 20 |
San Luis Obispo | 19 |
Shasta | 18 |
Merced | 17 |
Placer | 17 |
Santa Barbara | 17 |
El Dorado | 16 |
Mendocino | 15 |
Marin | 15 |
Butte | 15 |
Siskiyou | 15 |
Santa Cruz | 12 |
Calaveras | 12 |
Solano | 11 |
Yolo | 10 |
Lake | 10 |
Imperial | 10 |
Madera | 10 |
Yuba | 9 |
Tuolumne | 8 |
Nevada | 7 |
Lassen | 7 |
Napa | 7 |
Amador | 7 |
Sutter | 7 |
Plumas | 7 |
Trinity | 6 |
Kings | 6 |
Tehama | 5 |
Inyo | 4 |
Colusa | 4 |
Mariposa | 4 |
San Benito | 4 |
Del Norte | 4 |
Glenn | 3 |
Mono | 3 |
Modoc | 3 |
Sierra | 1 |
Alpine | 1 |
We want to plot the graphical components of every column. We do so by using a pairplot, so we can see multiple pairwise scatter plots all at once.
import matplotlib.pyplot as plt
sns.pairplot(data=df, hue='County')
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key)
<seaborn.axisgrid.PairGrid at 0x12017bc90>
To further explore the dataset, we make boxplot plots of Death count
, Population
, and Avg total income
against County
.
import matplotlib.pyplot as plt
sns.catplot(data=df, x="Death count", y="County", kind="box", height=15)
sns.catplot(data=df, x="Population", y="County", kind="box", height=15)
sns.catplot(data=df, x="Avg total income", y="County", kind="box", height=15)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float)
<seaborn.axisgrid.FacetGrid at 0x12243cf50>
import seaborn as sns
import matplotlib.pyplot as plt
corr_matrix = df[['Population','Death count','Percent death count','Avg AGI','Avg total income']].corr()
plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)
plt.title('Correlation Matrix')
plt.show()
sns.relplot(data=df, x="Death count", y="Population")
<seaborn.axisgrid.FacetGrid at 0x105e7e090>