In [1]:

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=DeprecationWarning)

import numpy as np
import pandas as pd
import seaborn as sns

1. Loading the dataset¶

This dataset is obtained from a Portugese banking institution's marketing campaigns. Each entry represents a client and contains information about their demographics and responses to marketing calls regarding term deposits. The dataset is sourced from the UCI Machine Learning Repository.

_{Moro,S., Rita,P., and Cortez,P.. (2012). Bank Marketing. UCI Machine Learning Repository. https://doi.org/10.24432/C5K306.}

In [2]:

bank_data = pd.read_csv('bank-full.csv', delimiter=';')
bank_data.head()

Out[2]:

	age	job	marital	education	default	balance	housing	loan	contact	day	month	duration	campaign	pdays	poutcome	y
0	58	management	married	tertiary	no	2143	yes	no	unknown	5	may	261	1	-1	unknown	no
1	44	technician	single	secondary	no	29	yes	no	unknown	5	may	151	1	-1	unknown	no
2	33	entrepreneur	married	secondary	no	2	yes	yes	unknown	5	may	76	1	-1	unknown	no
3	47	blue-collar	married	unknown	no	1506	yes	no	unknown	5	may	92	1	-1	unknown	no
4	33	unknown	single	unknown	no	1	no	no	unknown	5	may	198	1	-1	unknown	no

In [3]:

bank_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45211 entries, 0 to 45210
Data columns (total 17 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   age        45211 non-null  int64 
 1   job        45211 non-null  object
 2   marital    45211 non-null  object
 3   education  45211 non-null  object
 4   default    45211 non-null  object
 5   balance    45211 non-null  int64 
 6   housing    45211 non-null  object
 7   loan       45211 non-null  object
 8   contact    45211 non-null  object
 9   day        45211 non-null  int64 
 10  month      45211 non-null  object
 11  duration   45211 non-null  int64 
 12  campaign   45211 non-null  int64 
 13  pdays      45211 non-null  int64 
 14  previous   45211 non-null  int64 
 15  poutcome   45211 non-null  object
 16  y          45211 non-null  object
dtypes: int64(7), object(10)
memory usage: 5.9+ MB

2. Preprocessing¶

Remove the following columns: day, month,duration,pdays,previous, and poutcome.
All of these columns contain data that is irrelevant or are filled with almost entirely 'unknown' entries which contributes nothing to the dataset.

In [4]:

unnecessary_columns = ['day', 'month', 'duration', 'pdays', 'previous', 'poutcome']
bank_data.drop(unnecessary_columns, axis=1, inplace=True)

Check for missing values, marked by 'unkown', and drop all rows containing them

In [5]:

bank_data.replace('unknown', np.nan, inplace=True)
bank_data.dropna(inplace=True)
bank_data.head()

Out[5]:

	age	job	marital	education	default	balance	housing	loan	contact	campaign	y
12657	27	management	single	secondary	no	35	no	no	cellular	1	no
12658	54	blue-collar	married	primary	no	466	no	no	cellular	1	no
12659	43	blue-collar	married	secondary	no	105	no	yes	cellular	2	no
12660	31	technician	single	secondary	no	19	no	no	telephone	2	no
12661	27	technician	single	secondary	no	126	yes	yes	cellular	4	no

Perform an outlier analysis with all the quantative columns.

In [6]:

import matplotlib.pyplot as plt
quantitative_columns = ['age','balance','campaign']
for column in quantitative_columns:
    plt.figure(figsize=(10,6))
    sns.boxplot(x=bank_data[column])

#calculate and print the quantiles of each numerical column
for column in quantitative_columns:
    print(f'Quantiles for {column}:')
    print(bank_data[column].quantile([0.25,0.5,0.75]).to_string())
    #lower and upper bounds
    q1,q3 = bank_data[column].quantile(0.25),bank_data[column].quantile(0.75)
    iqr = q3 - q1
    lower_bound = q1 - 1.5*iqr
    upper_bound = q3 + 1.5*iqr
    print(f'Lower bound: {lower_bound}')
    print(f'Upper bound: {upper_bound}','\n')

Quantiles for age:
0.25    32.0
0.50    39.0
0.75    48.0
Lower bound: 8.0
Upper bound: 72.0 

Quantiles for balance:
0.25      80.0
0.50     473.0
0.75    1502.5
Lower bound: -2053.75
Upper bound: 3636.25 

Quantiles for campaign:
0.25    1.0
0.50    2.0
0.75    3.0
Lower bound: -2.0
Upper bound: 6.0

No description has been provided for this image

Intepretation of Outliers
Age : Outliers represent individuals who are significantly older than the age demographic of the banking institution's clients. The boxplot indicates that outliers begin at ages just over 70 years old.

Balance : This represents the average yearly balance a client has in their account. The significant number of outliers present indicate individuals with very high or relatively low/negative balances.

Campaign : This variable represents the number of contacts performed during a marking campaign for each client. The high number of outliers present indicates an exceptionally high number of contacts during the campaign which deviates from the typical 1-4 contacts per client.

3. Summary Data Analysis¶

Statistical summary of quantitative columns¶

In [7]:

bank_data.describe()

Out[7]:

	age	balance	campaign
count	30907.000000	30907.000000	30907.000000
mean	40.918918	1425.760701	2.751318
std	10.922583	3190.967030	2.954412
min	18.000000	-8019.000000	1.000000
25%	32.000000	80.000000	1.000000
50%	39.000000	473.000000	2.000000
75%	48.000000	1502.500000	3.000000
max	95.000000	102127.000000	50.000000

Statistical summary of categorical columns¶

'job' --> type of job ('admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed')

'marital' --> marital status ('divorced','married','single'; note: 'divorced' means divorced or widowed)

'education' --> ('primary','tertiary','secondary')

'default' --> has a housing loan with the bank

'loan' --> has a personal loan with the bank

'contact' --> contact communication type ('cellular','telephone') 'y' --> has the client subscribed to a term deposit?

In [8]:

bank_data.describe(include='object')

Out[8]:

	job	marital	education	default	housing	loan	contact	y
count	30907	30907	30907	30907	30907	30907	30907	30907
unique	11	3	3	2	2	2	2	2
top	management	married	secondary	no	no	no	cellular	no
freq	7329	18379	16004	30397	15564	25787	28213	26394

In [9]:

categorical_columns = bank_data.select_dtypes(include='object')
for column in categorical_columns:
    plt.figure(figsize=(8,4))
    bank_data[column].value_counts().plot(kind='bar')
    plt.title(f'Bar Plot of {column}')
    plt.ylabel('Count')
    plt.xticks(rotation=25)

Exploring correlations between quantitative columns¶

A correlation between age and balance. This can hep understand if there is any relationship between the client's age and their account balance.

The plot shows a small wave-like pattern which suggests that the balances of clients varies across different age groups. Most of the data spread is between the ages of 25 to 60 which indicates that most clients who hold accounts and contribute to their balances are of that age.

In [10]:

plt.figure(figsize=(10,6))
sns.scatterplot(x='age', y='balance', data=bank_data)
plt.title('Scatter Plot of Age vs Balance')

Out[10]:

Text(0.5, 1.0, 'Scatter Plot of Age vs Balance')

A correlation between age and campaign. This can help visualize any relationship between the client's age and the number of times they've been contacted. This can determine what age groups are more receptive to the campaign.

Similar to the age and balance correlation, the plot shows that the data spread is focused between the ages of 25 to 60. More specifically, it's concentrated between 30 and 40. This concentration of data points indicates that clients 25-60 recieve the most contacts due to being the bank's target demographic.

In [11]:

plt.figure(figsize=(10,6))
sns.scatterplot(x='age', y='campaign', data=bank_data)
plt.title('Scatter Plot of Age vs Campaign')

Out[11]:

Text(0.5, 1.0, 'Scatter Plot of Age vs Campaign')

A correlation between balance and campaign. This can show if there is a relationship between the client's account balance and the effectiveness of the campaign. This can be used to see how responsive client's with certain balances are to campaign contacts.

The plot shows data points that are tightly skewed right closer to zero.This suggets that a consistent majority of campaign contacts were made with clients who have lower balance amounts(less than 2000). As balances increase, fewer contacts are made. This indicates that the banking institution concentrated campaign efforts on clients with lower balances to make term deposits.

In [12]:

plt.figure(figsize=(10,6))
plot = sns.scatterplot(x='balance', y='campaign', data=bank_data)

plt.title('Scatter Plot of Balance vs Campaign')

Out[12]:

Text(0.5, 1.0, 'Scatter Plot of Balance vs Campaign')

4. Discussion¶

(Categorical) Can demographic factors such as education level, martial status, and housing loan status predict whether a client subscribes to a term deposit?
(Quantitative) Can the client's job type and education level predict the average balance maintained in their account?