import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
/var/folders/gc/0752xrm56pnf0r0dsrn5370c0000gr/T/ipykernel_72475/1020607637.py:1: DeprecationWarning: Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0), (to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries) but was not found to be installed on your system. If this would cause problems for you, please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466 import pandas as pd
#Loading the dataset
data = pd.read_csv("Death_rates_for_suicide__by_sex__race__Hispanic_origin__and_age__United_States.csv")
# Select quantitative columns
quantitative_columns = data.select_dtypes(include=['int64', 'float64']).columns
What the below snippit of code does:¶
This code computes summary statistics for the columns specified in quantitative_columns of the DataFrame data. It calculates statistics such as count, mean, standard deviation, minimum, 25th percentile (Q1), median (50th percentile or Q2), 75th percentile (Q3), and maximum for each numerical column in the DataFrame.
The describe() function provides a quick overview of the distribution of numerical variables in the dataset. The resulting summary statistics are then printed to the console.
# Compute summary statistics
summary_statistics = data[quantitative_columns].describe()
print("Summary Statistics:")
print(summary_statistics)
Summary Statistics: UNIT_NUM STUB_NAME_NUM STUB_LABEL_NUM YEAR YEAR_NUM \ count 1176.0 1176.0 1176.000000 1176.000000 1176.000000 mean 2.0 5.0 5.169719 1996.214286 21.500000 std 0.0 0.0 0.051632 14.948365 12.126075 min 2.0 5.0 5.112000 1950.000000 1.000000 25% 2.0 5.0 5.123750 1987.000000 11.000000 50% 2.0 5.0 5.142500 1997.500000 21.500000 75% 2.0 5.0 5.223250 2008.000000 32.000000 max 2.0 5.0 5.244000 2018.000000 42.000000 AGE_NUM ESTIMATE count 1176.000000 1012.000000 mean 3.307143 14.309585 std 1.067134 11.471759 min 2.000000 1.200000 25% 2.000000 5.400000 50% 3.000000 10.850000 75% 4.000000 20.725000 max 5.200000 61.900000
What this snippit of code does:¶
This code calculates the Z-scores for each quantitative column in a DataFrame, allowing for standardization and comparison of values across different distributions.
NaN values occur when there is no standard deviation, such as when all values in a column are the same. Negative Z-scores indicate that a value is below the mean of the distribution.
quantitative_columns = data.select_dtypes(include=['int64', 'float64']).columns
# Compute Z-scores for each quantitative column
z_scores = pd.DataFrame()
for column in quantitative_columns:
z_scores[column + '_zscore'] = (data[column] - data[column].mean()) / data[column].std()
# Print the Z-scores
print("Z-scores:")
print(z_scores)
Z-scores: UNIT_NUM_zscore STUB_NAME_NUM_zscore STUB_LABEL_NUM_zscore \ 0 NaN NaN -1.117882 1 NaN NaN -1.117882 2 NaN NaN -1.117882 3 NaN NaN -1.117882 4 NaN NaN -1.117882 ... ... ... ... 1171 NaN NaN 1.438662 1172 NaN NaN 1.438662 1173 NaN NaN 1.438662 1174 NaN NaN 1.438662 1175 NaN NaN 1.438662 YEAR_zscore YEAR_NUM_zscore AGE_NUM_zscore ESTIMATE_zscore 0 -3.091595 -1.690572 -1.224910 -0.672049 1 -2.422625 -1.608105 -1.224910 -0.497708 2 -1.753656 -1.525638 -1.224910 -0.035704 3 -1.084686 -1.443171 -1.224910 0.618076 4 -1.017789 -1.360704 -1.224910 0.591924 ... ... ... ... ... 1171 1.189810 1.360704 0.649269 -0.881258 1172 1.256707 1.443171 0.649269 -0.794088 1173 1.323604 1.525638 0.649269 -0.855107 1174 1.390501 1.608105 0.649269 -0.881258 1175 1.457398 1.690572 0.649269 -0.828956 [1176 rows x 7 columns]
# Visualize distributions
for column in quantitative_columns:
plt.figure(figsize=(8, 6))
sns.histplot(data[column], kde=True)
plt.title(f"Distribution of {column}")
plt.xlabel(column)
plt.ylabel("Frequency")
plt.show()
What this code snippit does:¶
This code uses he z-scores calculated above for each value in the quantitative columns of the dataset. Then, it identifies outliers by selecting rows where any of the z-scores are greater than 3. Finally, it prints out the rows containing outliers.
# Identify outliers using z-score
outliers = data[(z_scores > 3).any(axis=1)]
print("Outliers:")
print(outliers)
Outliers: INDICATOR UNIT \ 126 Death rates for suicide Deaths per 100,000 resident population, crude 168 Death rates for suicide Deaths per 100,000 resident population, crude 169 Death rates for suicide Deaths per 100,000 resident population, crude 174 Death rates for suicide Deaths per 100,000 resident population, crude 175 Death rates for suicide Deaths per 100,000 resident population, crude 176 Death rates for suicide Deaths per 100,000 resident population, crude 177 Death rates for suicide Deaths per 100,000 resident population, crude 178 Death rates for suicide Deaths per 100,000 resident population, crude 179 Death rates for suicide Deaths per 100,000 resident population, crude 180 Death rates for suicide Deaths per 100,000 resident population, crude 181 Death rates for suicide Deaths per 100,000 resident population, crude 182 Death rates for suicide Deaths per 100,000 resident population, crude 183 Death rates for suicide Deaths per 100,000 resident population, crude 184 Death rates for suicide Deaths per 100,000 resident population, crude 185 Death rates for suicide Deaths per 100,000 resident population, crude 434 Death rates for suicide Deaths per 100,000 resident population, crude UNIT_NUM STUB_NAME STUB_NAME_NUM \ 126 2 Sex, age and race 5 168 2 Sex, age and race 5 169 2 Sex, age and race 5 174 2 Sex, age and race 5 175 2 Sex, age and race 5 176 2 Sex, age and race 5 177 2 Sex, age and race 5 178 2 Sex, age and race 5 179 2 Sex, age and race 5 180 2 Sex, age and race 5 181 2 Sex, age and race 5 182 2 Sex, age and race 5 183 2 Sex, age and race 5 184 2 Sex, age and race 5 185 2 Sex, age and race 5 434 2 Sex, age and race 5 STUB_LABEL STUB_LABEL_NUM YEAR \ 126 Male: White: 65-74 years 5.1151 1950 168 Male: White: 75-84 years 5.1152 1950 169 Male: White: 75-84 years 5.1152 1960 174 Male: White: 75-84 years 5.1152 1983 175 Male: White: 75-84 years 5.1152 1984 176 Male: White: 75-84 years 5.1152 1985 177 Male: White: 75-84 years 5.1152 1986 178 Male: White: 75-84 years 5.1152 1987 179 Male: White: 75-84 years 5.1152 1988 180 Male: White: 75-84 years 5.1152 1989 181 Male: White: 75-84 years 5.1152 1990 182 Male: White: 75-84 years 5.1152 1991 183 Male: White: 75-84 years 5.1152 1992 184 Male: White: 75-84 years 5.1152 1993 185 Male: White: 75-84 years 5.1152 1994 434 Male: American Indian or Alaska Native: 15-24 ... 5.1320 1990 YEAR_NUM AGE AGE_NUM ESTIMATE 126 1 65-74 years 5.1 53.2 168 1 75-84 years 5.2 61.9 169 2 75-84 years 5.2 55.7 174 7 75-84 years 5.2 52.5 175 8 75-84 years 5.2 51.9 176 9 75-84 years 5.2 57.0 177 10 75-84 years 5.2 58.8 178 11 75-84 years 5.2 60.9 179 12 75-84 years 5.2 61.4 180 13 75-84 years 5.2 55.3 181 14 75-84 years 5.2 60.2 182 15 75-84 years 5.2 56.1 183 16 75-84 years 5.2 53.1 184 17 75-84 years 5.2 52.1 185 18 75-84 years 5.2 50.1 434 14 15-24 years 2.0 49.1