In [ ]:
 
In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
/var/folders/gc/0752xrm56pnf0r0dsrn5370c0000gr/T/ipykernel_72475/1020607637.py:1: DeprecationWarning: 
Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd
In [2]:
#Loading the dataset
data = pd.read_csv("Death_rates_for_suicide__by_sex__race__Hispanic_origin__and_age__United_States.csv")
In [3]:
# Select quantitative columns
quantitative_columns = data.select_dtypes(include=['int64', 'float64']).columns

What the below snippit of code does:¶

This code computes summary statistics for the columns specified in quantitative_columns of the DataFrame data. It calculates statistics such as count, mean, standard deviation, minimum, 25th percentile (Q1), median (50th percentile or Q2), 75th percentile (Q3), and maximum for each numerical column in the DataFrame.

The describe() function provides a quick overview of the distribution of numerical variables in the dataset. The resulting summary statistics are then printed to the console.

In [4]:
# Compute summary statistics
summary_statistics = data[quantitative_columns].describe()
print("Summary Statistics:")
print(summary_statistics)
Summary Statistics:
       UNIT_NUM  STUB_NAME_NUM  STUB_LABEL_NUM         YEAR     YEAR_NUM  \
count    1176.0         1176.0     1176.000000  1176.000000  1176.000000   
mean        2.0            5.0        5.169719  1996.214286    21.500000   
std         0.0            0.0        0.051632    14.948365    12.126075   
min         2.0            5.0        5.112000  1950.000000     1.000000   
25%         2.0            5.0        5.123750  1987.000000    11.000000   
50%         2.0            5.0        5.142500  1997.500000    21.500000   
75%         2.0            5.0        5.223250  2008.000000    32.000000   
max         2.0            5.0        5.244000  2018.000000    42.000000   

           AGE_NUM     ESTIMATE  
count  1176.000000  1012.000000  
mean      3.307143    14.309585  
std       1.067134    11.471759  
min       2.000000     1.200000  
25%       2.000000     5.400000  
50%       3.000000    10.850000  
75%       4.000000    20.725000  
max       5.200000    61.900000  

What this snippit of code does:¶

This code calculates the Z-scores for each quantitative column in a DataFrame, allowing for standardization and comparison of values across different distributions.

NaN values occur when there is no standard deviation, such as when all values in a column are the same. Negative Z-scores indicate that a value is below the mean of the distribution.

In [5]:
quantitative_columns = data.select_dtypes(include=['int64', 'float64']).columns

# Compute Z-scores for each quantitative column
z_scores = pd.DataFrame()
for column in quantitative_columns:
    z_scores[column + '_zscore'] = (data[column] - data[column].mean()) / data[column].std()

# Print the Z-scores
print("Z-scores:")
print(z_scores)
Z-scores:
      UNIT_NUM_zscore  STUB_NAME_NUM_zscore  STUB_LABEL_NUM_zscore  \
0                 NaN                   NaN              -1.117882   
1                 NaN                   NaN              -1.117882   
2                 NaN                   NaN              -1.117882   
3                 NaN                   NaN              -1.117882   
4                 NaN                   NaN              -1.117882   
...               ...                   ...                    ...   
1171              NaN                   NaN               1.438662   
1172              NaN                   NaN               1.438662   
1173              NaN                   NaN               1.438662   
1174              NaN                   NaN               1.438662   
1175              NaN                   NaN               1.438662   

      YEAR_zscore  YEAR_NUM_zscore  AGE_NUM_zscore  ESTIMATE_zscore  
0       -3.091595        -1.690572       -1.224910        -0.672049  
1       -2.422625        -1.608105       -1.224910        -0.497708  
2       -1.753656        -1.525638       -1.224910        -0.035704  
3       -1.084686        -1.443171       -1.224910         0.618076  
4       -1.017789        -1.360704       -1.224910         0.591924  
...           ...              ...             ...              ...  
1171     1.189810         1.360704        0.649269        -0.881258  
1172     1.256707         1.443171        0.649269        -0.794088  
1173     1.323604         1.525638        0.649269        -0.855107  
1174     1.390501         1.608105        0.649269        -0.881258  
1175     1.457398         1.690572        0.649269        -0.828956  

[1176 rows x 7 columns]
In [6]:
# Visualize distributions
for column in quantitative_columns:
    plt.figure(figsize=(8, 6))
    sns.histplot(data[column], kde=True)
    plt.title(f"Distribution of {column}")
    plt.xlabel(column)
    plt.ylabel("Frequency")
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

What this code snippit does:¶

This code uses he z-scores calculated above for each value in the quantitative columns of the dataset. Then, it identifies outliers by selecting rows where any of the z-scores are greater than 3. Finally, it prints out the rows containing outliers.

In [7]:
# Identify outliers using z-score
outliers = data[(z_scores > 3).any(axis=1)]

print("Outliers:")
print(outliers)
Outliers:
                   INDICATOR                                           UNIT  \
126  Death rates for suicide  Deaths per 100,000 resident population, crude   
168  Death rates for suicide  Deaths per 100,000 resident population, crude   
169  Death rates for suicide  Deaths per 100,000 resident population, crude   
174  Death rates for suicide  Deaths per 100,000 resident population, crude   
175  Death rates for suicide  Deaths per 100,000 resident population, crude   
176  Death rates for suicide  Deaths per 100,000 resident population, crude   
177  Death rates for suicide  Deaths per 100,000 resident population, crude   
178  Death rates for suicide  Deaths per 100,000 resident population, crude   
179  Death rates for suicide  Deaths per 100,000 resident population, crude   
180  Death rates for suicide  Deaths per 100,000 resident population, crude   
181  Death rates for suicide  Deaths per 100,000 resident population, crude   
182  Death rates for suicide  Deaths per 100,000 resident population, crude   
183  Death rates for suicide  Deaths per 100,000 resident population, crude   
184  Death rates for suicide  Deaths per 100,000 resident population, crude   
185  Death rates for suicide  Deaths per 100,000 resident population, crude   
434  Death rates for suicide  Deaths per 100,000 resident population, crude   

     UNIT_NUM          STUB_NAME  STUB_NAME_NUM  \
126         2  Sex, age and race              5   
168         2  Sex, age and race              5   
169         2  Sex, age and race              5   
174         2  Sex, age and race              5   
175         2  Sex, age and race              5   
176         2  Sex, age and race              5   
177         2  Sex, age and race              5   
178         2  Sex, age and race              5   
179         2  Sex, age and race              5   
180         2  Sex, age and race              5   
181         2  Sex, age and race              5   
182         2  Sex, age and race              5   
183         2  Sex, age and race              5   
184         2  Sex, age and race              5   
185         2  Sex, age and race              5   
434         2  Sex, age and race              5   

                                            STUB_LABEL  STUB_LABEL_NUM  YEAR  \
126                           Male: White: 65-74 years          5.1151  1950   
168                           Male: White: 75-84 years          5.1152  1950   
169                           Male: White: 75-84 years          5.1152  1960   
174                           Male: White: 75-84 years          5.1152  1983   
175                           Male: White: 75-84 years          5.1152  1984   
176                           Male: White: 75-84 years          5.1152  1985   
177                           Male: White: 75-84 years          5.1152  1986   
178                           Male: White: 75-84 years          5.1152  1987   
179                           Male: White: 75-84 years          5.1152  1988   
180                           Male: White: 75-84 years          5.1152  1989   
181                           Male: White: 75-84 years          5.1152  1990   
182                           Male: White: 75-84 years          5.1152  1991   
183                           Male: White: 75-84 years          5.1152  1992   
184                           Male: White: 75-84 years          5.1152  1993   
185                           Male: White: 75-84 years          5.1152  1994   
434  Male: American Indian or Alaska Native: 15-24 ...          5.1320  1990   

     YEAR_NUM          AGE  AGE_NUM  ESTIMATE  
126         1  65-74 years      5.1      53.2  
168         1  75-84 years      5.2      61.9  
169         2  75-84 years      5.2      55.7  
174         7  75-84 years      5.2      52.5  
175         8  75-84 years      5.2      51.9  
176         9  75-84 years      5.2      57.0  
177        10  75-84 years      5.2      58.8  
178        11  75-84 years      5.2      60.9  
179        12  75-84 years      5.2      61.4  
180        13  75-84 years      5.2      55.3  
181        14  75-84 years      5.2      60.2  
182        15  75-84 years      5.2      56.1  
183        16  75-84 years      5.2      53.1  
184        17  75-84 years      5.2      52.1  
185        18  75-84 years      5.2      50.1  
434        14  15-24 years      2.0      49.1