In [1]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=DeprecationWarning)

import numpy as np
import pandas as pd
import seaborn as sns

1. Loading the dataset¶

This dataset is obtained from a Portugese banking institution's marketing campaigns. Each entry represents a client and contains information about their demographics and responses to marketing calls regarding term deposits. The dataset is sourced from the UCI Machine Learning Repository.

Moro,S., Rita,P., and Cortez,P.. (2012). Bank Marketing. UCI Machine Learning Repository. https://doi.org/10.24432/C5K306.

In [2]:
bank_data = pd.read_csv('bank-full.csv', delimiter=';')
bank_data.head()
Out[2]:
age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome y
0 58 management married tertiary no 2143 yes no unknown 5 may 261 1 -1 0 unknown no
1 44 technician single secondary no 29 yes no unknown 5 may 151 1 -1 0 unknown no
2 33 entrepreneur married secondary no 2 yes yes unknown 5 may 76 1 -1 0 unknown no
3 47 blue-collar married unknown no 1506 yes no unknown 5 may 92 1 -1 0 unknown no
4 33 unknown single unknown no 1 no no unknown 5 may 198 1 -1 0 unknown no
In [3]:
bank_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45211 entries, 0 to 45210
Data columns (total 17 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   age        45211 non-null  int64 
 1   job        45211 non-null  object
 2   marital    45211 non-null  object
 3   education  45211 non-null  object
 4   default    45211 non-null  object
 5   balance    45211 non-null  int64 
 6   housing    45211 non-null  object
 7   loan       45211 non-null  object
 8   contact    45211 non-null  object
 9   day        45211 non-null  int64 
 10  month      45211 non-null  object
 11  duration   45211 non-null  int64 
 12  campaign   45211 non-null  int64 
 13  pdays      45211 non-null  int64 
 14  previous   45211 non-null  int64 
 15  poutcome   45211 non-null  object
 16  y          45211 non-null  object
dtypes: int64(7), object(10)
memory usage: 5.9+ MB

2. Preprocessing¶

Remove the following columns: day, month,duration,pdays,previous, and poutcome.
All of these columns contain data that is irrelevant or are filled with almost entirely 'unknown' entries which contributes nothing to the dataset.

In [4]:
unnecessary_columns = ['day', 'month', 'duration', 'pdays', 'previous', 'poutcome']
bank_data.drop(unnecessary_columns, axis=1, inplace=True)

Check for missing values, marked by 'unkown', and drop all rows containing them

In [5]:
bank_data.replace('unknown', np.nan, inplace=True)
bank_data.dropna(inplace=True)
bank_data.head()
Out[5]:
age job marital education default balance housing loan contact campaign y
12657 27 management single secondary no 35 no no cellular 1 no
12658 54 blue-collar married primary no 466 no no cellular 1 no
12659 43 blue-collar married secondary no 105 no yes cellular 2 no
12660 31 technician single secondary no 19 no no telephone 2 no
12661 27 technician single secondary no 126 yes yes cellular 4 no

Perform an outlier analysis with all the quantative columns.

In [6]:
import matplotlib.pyplot as plt
quantitative_columns = ['age','balance','campaign']
for column in quantitative_columns:
    plt.figure(figsize=(10,6))
    sns.boxplot(x=bank_data[column])

#calculate and print the quantiles of each numerical column
for column in quantitative_columns:
    print(f'Quantiles for {column}:')
    print(bank_data[column].quantile([0.25,0.5,0.75]).to_string())
    #lower and upper bounds
    q1,q3 = bank_data[column].quantile(0.25),bank_data[column].quantile(0.75)
    iqr = q3 - q1
    lower_bound = q1 - 1.5*iqr
    upper_bound = q3 + 1.5*iqr
    print(f'Lower bound: {lower_bound}')
    print(f'Upper bound: {upper_bound}','\n')
Quantiles for age:
0.25    32.0
0.50    39.0
0.75    48.0
Lower bound: 8.0
Upper bound: 72.0 

Quantiles for balance:
0.25      80.0
0.50     473.0
0.75    1502.5
Lower bound: -2053.75
Upper bound: 3636.25 

Quantiles for campaign:
0.25    1.0
0.50    2.0
0.75    3.0
Lower bound: -2.0
Upper bound: 6.0 

No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Intepretation of Outliers
Age : Outliers represent individuals who are significantly older than the age demographic of the banking institution's clients. The boxplot indicates that outliers begin at ages just over 70 years old.

Balance : This represents the average yearly balance a client has in their account. The significant number of outliers present indicate individuals with very high or relatively low/negative balances.

Campaign : This variable represents the number of contacts performed during a marking campaign for each client. The high number of outliers present indicates an exceptionally high number of contacts during the campaign which deviates from the typical 1-4 contacts per client.

3. Summary Data Analysis¶

Statistical summary of quantitative columns¶

In [7]:
bank_data.describe()
Out[7]:
age balance campaign
count 30907.000000 30907.000000 30907.000000
mean 40.918918 1425.760701 2.751318
std 10.922583 3190.967030 2.954412
min 18.000000 -8019.000000 1.000000
25% 32.000000 80.000000 1.000000
50% 39.000000 473.000000 2.000000
75% 48.000000 1502.500000 3.000000
max 95.000000 102127.000000 50.000000

Statistical summary of categorical columns¶

'job' --> type of job ('admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed')

'marital' --> marital status ('divorced','married','single'; note: 'divorced' means divorced or widowed)

'education' --> ('primary','tertiary','secondary')

'default' --> has a housing loan with the bank

'loan' --> has a personal loan with the bank

'contact' --> contact communication type ('cellular','telephone') 'y' --> has the client subscribed to a term deposit?

In [8]:
bank_data.describe(include='object')
Out[8]:
job marital education default housing loan contact y
count 30907 30907 30907 30907 30907 30907 30907 30907
unique 11 3 3 2 2 2 2 2
top management married secondary no no no cellular no
freq 7329 18379 16004 30397 15564 25787 28213 26394
In [9]:
categorical_columns = bank_data.select_dtypes(include='object')
for column in categorical_columns:
    plt.figure(figsize=(8,4))
    bank_data[column].value_counts().plot(kind='bar')
    plt.title(f'Bar Plot of {column}')
    plt.ylabel('Count')
    plt.xticks(rotation=25)
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Exploring correlations between quantitative columns¶

  1. A correlation between age and balance. This can hep understand if there is any relationship between the client's age and their account balance.

    The plot shows a small wave-like pattern which suggests that the balances of clients varies across different age groups. Most of the data spread is between the ages of 25 to 60 which indicates that most clients who hold accounts and contribute to their balances are of that age.
In [10]:
plt.figure(figsize=(10,6))
sns.scatterplot(x='age', y='balance', data=bank_data)
plt.title('Scatter Plot of Age vs Balance')
Out[10]:
Text(0.5, 1.0, 'Scatter Plot of Age vs Balance')
No description has been provided for this image
  1. A correlation between age and campaign. This can help visualize any relationship between the client's age and the number of times they've been contacted. This can determine what age groups are more receptive to the campaign.

    Similar to the age and balance correlation, the plot shows that the data spread is focused between the ages of 25 to 60. More specifically, it's concentrated between 30 and 40. This concentration of data points indicates that clients 25-60 recieve the most contacts due to being the bank's target demographic.
In [11]:
plt.figure(figsize=(10,6))
sns.scatterplot(x='age', y='campaign', data=bank_data)
plt.title('Scatter Plot of Age vs Campaign')
Out[11]:
Text(0.5, 1.0, 'Scatter Plot of Age vs Campaign')
No description has been provided for this image
  1. A correlation between balance and campaign. This can show if there is a relationship between the client's account balance and the effectiveness of the campaign. This can be used to see how responsive client's with certain balances are to campaign contacts.

    The plot shows data points that are tightly skewed right closer to zero.This suggets that a consistent majority of campaign contacts were made with clients who have lower balance amounts(less than 2000). As balances increase, fewer contacts are made. This indicates that the banking institution concentrated campaign efforts on clients with lower balances to make term deposits.
In [12]:
plt.figure(figsize=(10,6))
plot = sns.scatterplot(x='balance', y='campaign', data=bank_data)

plt.title('Scatter Plot of Balance vs Campaign')
Out[12]:
Text(0.5, 1.0, 'Scatter Plot of Balance vs Campaign')
No description has been provided for this image

4. Discussion¶

  1. (Categorical) Can demographic factors such as education level, martial status, and housing loan status predict whether a client subscribes to a term deposit?

  2. (Quantitative) Can the client's job type and education level predict the average balance maintained in their account?