import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
/var/folders/gc/0752xrm56pnf0r0dsrn5370c0000gr/T/ipykernel_58267/1020607637.py:1: DeprecationWarning: Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0), (to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries) but was not found to be installed on your system. If this would cause problems for you, please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466 import pandas as pd
IntroductionΒΆ
The following datasets are collections of chemical characteristics of red and white vinho verde wine samples along with the overall quality of the wine sample. The dataset can be found at https://archive.ics.uci.edu/dataset/186/wine+quality
Import the files winequality-white.csv
and winequality-red.csv
as a dataframe named white
and red
, respectively.
white = pd.read_csv('winequality-white.csv', sep = ';')
red = pd.read_csv('winequality-red.csv', sep = ';')
PreprocessingΒΆ
Create a function called remove_outliers
that removes outliers from a dataframe using the 1.5*IQR method.
Use the remove_outliers
function to remove any outliers from white
and red
.
def remove_outliers(df):
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
df_out = df[~((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))).any(axis = 1)]
return df_out
white = remove_outliers(white.select_dtypes(include = [np.number]))
red = remove_outliers(red.select_dtypes(include = [np.number]))
Since I want to analyze the characteristics of the combined datasets, I will remove outliers prior to combining them. Removing outliers after merging the datasets could disproportionally affect either set of data, affecting any consequent analysis.
Create a new column called wine_type
to the white
and red
dataframes. Combine the two dataframes into one dataframe called wine
.
white['wine_type'] = 'white'
red['wine_type'] = 'red'
wine = pd.concat([white, red])
Remove any rows with missing values from wine
.
wine = wine.dropna()
Summary Data AnalysisΒΆ
Statistical SummaryΒΆ
Find the summary statistics for each quantitative variable in wine
.
wine.describe()
fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 5037.000000 | 5037.000000 | 5037.000000 | 5037.000000 | 5037.000000 | 5037.000000 | 5037.000000 | 5037.000000 | 5037.000000 | 5037.000000 | 5037.000000 | 5037.000000 |
mean | 7.128956 | 0.324697 | 0.305950 | 5.408785 | 0.050942 | 30.120409 | 115.174509 | 0.994540 | 3.218817 | 0.516831 | 10.499715 | 5.815168 |
std | 1.113448 | 0.150817 | 0.119913 | 4.686900 | 0.018869 | 16.174677 | 55.619411 | 0.002865 | 0.148287 | 0.120583 | 1.147178 | 0.754510 |
min | 4.800000 | 0.080000 | 0.000000 | 0.600000 | 0.015000 | 1.000000 | 6.000000 | 0.987110 | 2.820000 | 0.220000 | 8.400000 | 4.000000 |
25% | 6.400000 | 0.220000 | 0.250000 | 1.800000 | 0.038000 | 17.000000 | 78.000000 | 0.992150 | 3.120000 | 0.430000 | 9.500000 | 5.000000 |
50% | 6.900000 | 0.280000 | 0.300000 | 2.800000 | 0.047000 | 29.000000 | 117.000000 | 0.994800 | 3.210000 | 0.500000 | 10.400000 | 6.000000 |
75% | 7.600000 | 0.380000 | 0.380000 | 8.100000 | 0.059000 | 41.000000 | 155.000000 | 0.996800 | 3.320000 | 0.590000 | 11.300000 | 6.000000 |
max | 12.300000 | 1.005000 | 0.730000 | 22.000000 | 0.119000 | 80.000000 | 255.000000 | 1.001960 | 3.680000 | 0.980000 | 14.200000 | 7.000000 |
The 5 number summary above shows the majorit of the numerical variables show low spread from the mean, with free sulfur dioxide
, total sulfur dioxide
, and fixed acidity
having the most variance. This is a possible area of study later to see if these attributes can be used to determine wine type.
sns.catplot(data = wine, x = 'sulphates', kind = 'box', col = 'wine_type')
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float)
<seaborn.axisgrid.FacetGrid at 0x1356f3590>
The sulphates
column has a lower distribution in white wine than red wine.
sns.catplot(data = wine[['free sulfur dioxide', 'total sulfur dioxide']], kind = 'box')
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float)
<seaborn.axisgrid.FacetGrid at 0x135867710>
The above plot shows that free sulfur dioxide
is most likely some portion of total sulfur dioxide
.
sns.catplot(data = wine, x = 'alcohol', kind = 'box', col = 'quality')
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float)
<seaborn.axisgrid.FacetGrid at 0x1357a7990>
The alcohol
level can be analyzed by quality
to provide an initial understanding of their relationship. Increased alcohol level has a higher grouping a data with no distinguishable additional spread.
sns.catplot(data = wine[['citric acid', 'volatile acidity']], kind = 'box')
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float)
<seaborn.axisgrid.FacetGrid at 0x135b11050>
The citric acid
and volatile acidity
levels show similar grouping and spread of data. This gives a possible correlation to be explored later.
sns.catplot(data = wine, x = 'chlorides', kind = 'box')
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float)
<seaborn.axisgrid.FacetGrid at 0x135a8e4d0>
The plot of chlorides
shows outliers beyond the maximum value of the dataset for the combined dataset. This is possibly due to a difference between the chemical makeup between red and white wines. The relationship can be further examined to determine if the chloride level can be used to predict the type of wine.
sns.catplot(data = wine, x = 'pH', kind = 'box')
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float)
<seaborn.axisgrid.FacetGrid at 0x135c0f990>
The pH of the wine samples sits around 3.2 and has little variability across red and white wine.
sns.catplot(data = wine, x = 'residual sugar', kind = 'box', col = 'wine_type')
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float)
<seaborn.axisgrid.FacetGrid at 0x135c5afd0>
The residual_sugar
plot separated by the wine type that red wine has lower residual sugar levels and very little spread. White wine residual sugar levels show comparably high variability between samples. This is a possible factor to be used later in the prediction of wine type from a sample's chemical composition.
sns.catplot(data = wine, x = 'fixed acidity', kind = 'box')
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float)
<seaborn.axisgrid.FacetGrid at 0x135d27990>
The plot of fixed acidity
shows much higher levels than volatile acidity
(plotted above). This corresponds with general understanding of wine chemical makeup. Wine is made from fruits, which carry high levels of acids.
sns.catplot(data = wine, x = 'density', kind = 'box')
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float)
<seaborn.axisgrid.FacetGrid at 0x135d50850>
The density
plot shows that the majority of wine samples had a density of below 1, the density of water. This is expected as the density of ethanol, the other main component of wine, is less dense than water.
CorrelationsΒΆ
plt.figure(figsize = (9, 7))
sns.heatmap(wine.drop(columns = 'wine_type').corr(), annot = True, cmap = 'vlag')
<Axes: >
Correlation #1: There is a strong negative correlation (-0.74) between the amount of alcohol in the wine with the density of the wine. This can be expected as the density of alcohol is less than the density of water (the other main component of wine), so that the more alcohol the wine, the lower the density is.
Correlation #2: There is a somewhat strong negative correlation (-0.47) between the amount of citric acid and the level of volatile acidity. This is possible due to the fact that citric acid is not a volatile vapor.
Correlation #3: There is a strong positive correlation (0.74) between the amount of free sulfur dioxide and the amount of total sulfur dioxide. This is reasonable as the free sulfur dioxide is most likely a portion of the total sulfur dioxide.
DiscussionΒΆ
Question #1: Can the type of wine (red vs. white) be determined based on the alcohol, residual sugar, citric acid, and chloride amounts in a sample?
Question #2: Can the quality of the wine be predicted from the pH along with the fixed acidity, volatile acidity, and sulphate amounts?