IntroductionΒΆ

The first dataset, death, includes 2012 and 2013 death counts in the state of California by zipcode and gender. It was sourced from data catalog data.gov, and it is available to the public at https://data.world/ellianas/untitledproject-3-19-2024/workspace/project-summary?agentid=health&datasetid=death-by-zip-code-by-gender

The second data set, income, includes 2012 and 2013 Incomes fro several states by zipcode. It was sourced from the IRS Statistics of Income, and it is available to the public at https://data.world/jonloyens/irs-income-by-zip-code/workspace/project-summary?agentid=jonloyens&datasetid=irs-income-by-zip-code

The last data set, city, includes the city and county name and the population size for all California zipcodes. It was sourced from The World Population Review, and it is available to the public at https://worldpopulationreview.com/zips/california

Below, we import all three datasets.

InΒ [1]:
import pandas as pd

death = pd.read_csv("Death_by_ZIP_Code_by_Gender__2012_-_2013.csv")
income = pd.read_csv("IRSIncomeByZipCode.csv")
city = pd.read_csv("ca-zip-codes-data.csv")
/var/folders/gc/0752xrm56pnf0r0dsrn5370c0000gr/T/ipykernel_92900/1640254929.py:1: DeprecationWarning: 
Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd

ReprocessingΒΆ

We merge death, income, and city by ZIPCODE into a single dataframe.

InΒ [2]:
merge = pd.merge(death, income, on='ZIPCODE')
city.rename(columns={'zip': 'ZIPCODE'}, inplace=True)
df = pd.merge(merge, city, on='ZIPCODE')

We rename that columns in the dataframe so that they are all of the same format, which makes utilizing their names more simple in future steps. Then, we drop all missing values and nonmeaningful rows and columns from the dataframe. We add the column Percent death count, so we can see the number of deaths relative to the population. Finally, we reorder the columns of the dataframe to make it more easily read.

InΒ [3]:
df.rename(columns={'YEAR': 'Year', 'ZIPCODE': 'Zipcode', 'GENDER': 'Gender', 'COUNT': 'Death count', 'STATE': 'State', 'population': 'Population', 'city': 'City', 'county': 'County','Number of returns':'Number of tax returns'}, inplace=True)

df.dropna(inplace=True)
df = df[(df['Year'] == 2013) & (df['Gender'] == 'Female')]
df.drop(['Zipcode','Year','Location 1','State','Adjusted gross income (AGI)','Number of returns with total income','Total income amount','Number of returns with taxable income','Taxable income amount','Avg taxable income','City','Gender','Number of tax returns'],axis=1,inplace=True)

df['Percent death count'] = df['Death count'] / df['Population'] * 100

df = df[['County','Population','Death count','Percent death count','Avg AGI','Avg total income']]

Here, we change the format of Avg AGI and Avg total income from its default, scientific notation, to standard notation.

InΒ [4]:
df['Avg AGI'] = df['Avg AGI'] * 1000
df['Avg total income'] = df['Avg total income'] * 1000

This is the final dataframe after reprocessing. It contains death, population, and income data for females. This data only contains information from 2013 for the state of California.

InΒ [5]:
df.head()
Out[5]:
County Population Death count Percent death count Avg AGI Avg total income
2827 Los Angeles 57652 102.0 0.176924 25152.93276 25358.70291
2829 Los Angeles 53108 109.0 0.205242 24410.49578 24637.55274
2831 Los Angeles 75024 134.0 0.178610 23404.62185 23638.19710
2833 Los Angeles 58833 118.0 0.200568 59128.94737 60286.22076
2835 Los Angeles 37754 79.0 0.209249 45821.12767 46486.71419

We want to preform an outlier analysis in order to identify any out of the ordinary observations. The condition for outliers will be based on a 1.5 interquartile range.

InΒ [6]:
import seaborn as sns

def is_outlier(x):
    Q25, Q75 = x.quantile([.25,.75])
    IQR = Q75 - Q25
    return (x < Q25 - 1.5*IQR) |  (x > Q75 + 1.5*IQR)

by_event = df.groupby("County")
outliers = by_event["Death count"].transform(is_outlier)
(df.loc[outliers,"County"].value_counts()).to_frame()
Out[6]:
count
County
Los Angeles 4
Shasta 4
Humboldt 4
Mendocino 3
Tulare 3
Siskiyou 2
El Dorado 2
Amador 2
Merced 2
Orange 2
Lake 2
Yuba 2
Plumas 1
Nevada 1
Tehama 1
Sacramento 1
Del Norte 1
Tuolumne 1
Stanislaus 1
Mariposa 1
Santa Cruz 1
San Benito 1
Marin 1
Contra Costa 1
Napa 1
Alameda 1
Inyo 1
Kings 1
Lassen 1
InΒ [7]:
by_county_outliers = df.loc[outliers,"County"].value_counts().sum()
print(f"Number of death count outliers by county = {by_county_outliers}")
total_num_deathcount = df['Death count'].shape[0]
print(f"Total number of death count observations = {total_num_deathcount}")
percent_outliers = (by_county_outliers / total_num_deathcount) * 100
print(f"Percent of death count outliers when grouped by county = {percent_outliers:.2f}%")
Number of death count outliers by county = 49
Total number of death count observations = 1412
Percent of death count outliers when grouped by county = 3.47%

Based on the outlier analysis, we can see how many outliers there are for Death count in each county in California. We can also conclude. We can also conclude that 3.47% percent of death counts in California are outliers when group by county.

Summary Data AnalysisΒΆ

The summary statistics for all numerical columns of the dataframe are as follows:

InΒ [8]:
df[['County','Population','Death count','Percent death count','Avg AGI','Avg total income']].describe()
Out[8]:
Population Death count Percent death count Avg AGI Avg total income
count 1412.000000 1412.000000 1412.000000 1.412000e+03 1.412000e+03
mean 27627.461048 84.987252 0.368747 7.445815e+04 7.574361e+04
std 22090.010399 69.191318 0.320635 7.132591e+04 7.237741e+04
min 33.000000 1.000000 0.023149 2.040964e+04 2.060633e+04
25% 7298.750000 23.000000 0.236887 4.276960e+04 4.350000e+04
50% 26031.500000 73.500000 0.321837 5.697303e+04 5.801139e+04
75% 41666.250000 130.000000 0.425047 7.997999e+04 8.138399e+04
max 106042.000000 420.000000 9.090909 1.149294e+06 1.160494e+06

The number of observations per county are as follows:

InΒ [9]:
df['County'].value_counts().to_frame()
Out[9]:
count
County
Los Angeles 272
San Diego 93
Orange 84
Riverside 67
San Bernardino 66
Santa Clara 54
Sacramento 51
Alameda 47
Fresno 44
Kern 39
Contra Costa 36
San Joaquin 28
Sonoma 27
San Francisco 26
San Mateo 25
Ventura 23
Stanislaus 22
Monterey 21
Humboldt 21
Tulare 20
San Luis Obispo 19
Shasta 18
Merced 17
Placer 17
Santa Barbara 17
El Dorado 16
Mendocino 15
Marin 15
Butte 15
Siskiyou 15
Santa Cruz 12
Calaveras 12
Solano 11
Yolo 10
Lake 10
Imperial 10
Madera 10
Yuba 9
Tuolumne 8
Nevada 7
Lassen 7
Napa 7
Amador 7
Sutter 7
Plumas 7
Trinity 6
Kings 6
Tehama 5
Inyo 4
Colusa 4
Mariposa 4
San Benito 4
Del Norte 4
Glenn 3
Mono 3
Modoc 3
Sierra 1
Alpine 1

We want to plot the graphical components of every column. We do so by using a pairplot, so we can see multiple pairwise scatter plots all at once.

InΒ [10]:
import matplotlib.pyplot as plt
sns.pairplot(data=df, hue='County')
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
Out[10]:
<seaborn.axisgrid.PairGrid at 0x12017bc90>
No description has been provided for this image

To further explore the dataset, we make boxplot plots of Death count, Population, and Avg total income against County.

InΒ [11]:
import matplotlib.pyplot as plt

sns.catplot(data=df, x="Death count", y="County", kind="box", height=15)
sns.catplot(data=df, x="Population", y="County", kind="box", height=15)
sns.catplot(data=df, x="Avg total income", y="County", kind="box", height=15)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
Out[11]:
<seaborn.axisgrid.FacetGrid at 0x12243cf50>
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
InΒ [12]:
import seaborn as sns
import matplotlib.pyplot as plt

corr_matrix = df[['Population','Death count','Percent death count','Avg AGI','Avg total income']].corr()

plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)
plt.title('Correlation Matrix')
plt.show()
No description has been provided for this image
InΒ [13]:
sns.relplot(data=df, x="Death count", y="Population")
Out[13]:
<seaborn.axisgrid.FacetGrid at 0x105e7e090>
No description has been provided for this image

DiscussionΒΆ