InΒ [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
/var/folders/gc/0752xrm56pnf0r0dsrn5370c0000gr/T/ipykernel_58267/1020607637.py:1: DeprecationWarning: 
Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd

IntroductionΒΆ

The following datasets are collections of chemical characteristics of red and white vinho verde wine samples along with the overall quality of the wine sample. The dataset can be found at https://archive.ics.uci.edu/dataset/186/wine+quality

Import the files winequality-white.csv and winequality-red.csv as a dataframe named white and red, respectively.

InΒ [2]:
white = pd.read_csv('winequality-white.csv', sep = ';')
red = pd.read_csv('winequality-red.csv', sep = ';')

PreprocessingΒΆ

Create a function called remove_outliers that removes outliers from a dataframe using the 1.5*IQR method.

Use the remove_outliers function to remove any outliers from white and red.

InΒ [3]:
def remove_outliers(df):
    Q1 = df.quantile(0.25)
    Q3 = df.quantile(0.75)
    IQR = Q3 - Q1
    df_out = df[~((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))).any(axis = 1)]
    return df_out

white = remove_outliers(white.select_dtypes(include = [np.number]))
red = remove_outliers(red.select_dtypes(include = [np.number]))

Since I want to analyze the characteristics of the combined datasets, I will remove outliers prior to combining them. Removing outliers after merging the datasets could disproportionally affect either set of data, affecting any consequent analysis.

Create a new column called wine_type to the white and red dataframes. Combine the two dataframes into one dataframe called wine.

InΒ [4]:
white['wine_type'] = 'white'
red['wine_type'] = 'red'

wine = pd.concat([white, red])

Remove any rows with missing values from wine.

InΒ [5]:
wine = wine.dropna()

Summary Data AnalysisΒΆ

Statistical SummaryΒΆ

Find the summary statistics for each quantitative variable in wine.

InΒ [6]:
wine.describe()
Out[6]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
count 5037.000000 5037.000000 5037.000000 5037.000000 5037.000000 5037.000000 5037.000000 5037.000000 5037.000000 5037.000000 5037.000000 5037.000000
mean 7.128956 0.324697 0.305950 5.408785 0.050942 30.120409 115.174509 0.994540 3.218817 0.516831 10.499715 5.815168
std 1.113448 0.150817 0.119913 4.686900 0.018869 16.174677 55.619411 0.002865 0.148287 0.120583 1.147178 0.754510
min 4.800000 0.080000 0.000000 0.600000 0.015000 1.000000 6.000000 0.987110 2.820000 0.220000 8.400000 4.000000
25% 6.400000 0.220000 0.250000 1.800000 0.038000 17.000000 78.000000 0.992150 3.120000 0.430000 9.500000 5.000000
50% 6.900000 0.280000 0.300000 2.800000 0.047000 29.000000 117.000000 0.994800 3.210000 0.500000 10.400000 6.000000
75% 7.600000 0.380000 0.380000 8.100000 0.059000 41.000000 155.000000 0.996800 3.320000 0.590000 11.300000 6.000000
max 12.300000 1.005000 0.730000 22.000000 0.119000 80.000000 255.000000 1.001960 3.680000 0.980000 14.200000 7.000000

The 5 number summary above shows the majorit of the numerical variables show low spread from the mean, with free sulfur dioxide, total sulfur dioxide, and fixed acidity having the most variance. This is a possible area of study later to see if these attributes can be used to determine wine type.

InΒ [7]:
sns.catplot(data = wine, x = 'sulphates', kind = 'box', col = 'wine_type')
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
Out[7]:
<seaborn.axisgrid.FacetGrid at 0x1356f3590>
No description has been provided for this image

The sulphates column has a lower distribution in white wine than red wine.

InΒ [8]:
sns.catplot(data = wine[['free sulfur dioxide', 'total sulfur dioxide']], kind = 'box')
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
Out[8]:
<seaborn.axisgrid.FacetGrid at 0x135867710>
No description has been provided for this image

The above plot shows that free sulfur dioxide is most likely some portion of total sulfur dioxide.

InΒ [9]:
sns.catplot(data = wine, x = 'alcohol', kind = 'box', col = 'quality')
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
Out[9]:
<seaborn.axisgrid.FacetGrid at 0x1357a7990>
No description has been provided for this image

The alcohol level can be analyzed by quality to provide an initial understanding of their relationship. Increased alcohol level has a higher grouping a data with no distinguishable additional spread.

InΒ [10]:
sns.catplot(data = wine[['citric acid', 'volatile acidity']], kind = 'box')
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
Out[10]:
<seaborn.axisgrid.FacetGrid at 0x135b11050>
No description has been provided for this image

The citric acid and volatile acidity levels show similar grouping and spread of data. This gives a possible correlation to be explored later.

InΒ [11]:
sns.catplot(data = wine, x = 'chlorides', kind = 'box')
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
Out[11]:
<seaborn.axisgrid.FacetGrid at 0x135a8e4d0>
No description has been provided for this image

The plot of chlorides shows outliers beyond the maximum value of the dataset for the combined dataset. This is possibly due to a difference between the chemical makeup between red and white wines. The relationship can be further examined to determine if the chloride level can be used to predict the type of wine.

InΒ [12]:
sns.catplot(data = wine, x = 'pH', kind = 'box')
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
Out[12]:
<seaborn.axisgrid.FacetGrid at 0x135c0f990>
No description has been provided for this image

The pH of the wine samples sits around 3.2 and has little variability across red and white wine.

InΒ [13]:
sns.catplot(data = wine, x = 'residual sugar', kind = 'box', col = 'wine_type')
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
Out[13]:
<seaborn.axisgrid.FacetGrid at 0x135c5afd0>
No description has been provided for this image

The residual_sugar plot separated by the wine type that red wine has lower residual sugar levels and very little spread. White wine residual sugar levels show comparably high variability between samples. This is a possible factor to be used later in the prediction of wine type from a sample's chemical composition.

InΒ [14]:
sns.catplot(data = wine, x = 'fixed acidity', kind = 'box')
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
Out[14]:
<seaborn.axisgrid.FacetGrid at 0x135d27990>
No description has been provided for this image

The plot of fixed acidity shows much higher levels than volatile acidity (plotted above). This corresponds with general understanding of wine chemical makeup. Wine is made from fruits, which carry high levels of acids.

InΒ [15]:
sns.catplot(data = wine, x = 'density', kind = 'box')
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
Out[15]:
<seaborn.axisgrid.FacetGrid at 0x135d50850>
No description has been provided for this image

The density plot shows that the majority of wine samples had a density of below 1, the density of water. This is expected as the density of ethanol, the other main component of wine, is less dense than water.

CorrelationsΒΆ

InΒ [16]:
plt.figure(figsize = (9, 7))
sns.heatmap(wine.drop(columns = 'wine_type').corr(), annot = True, cmap = 'vlag')
Out[16]:
<Axes: >
No description has been provided for this image

Correlation #1: There is a strong negative correlation (-0.74) between the amount of alcohol in the wine with the density of the wine. This can be expected as the density of alcohol is less than the density of water (the other main component of wine), so that the more alcohol the wine, the lower the density is.

Correlation #2: There is a somewhat strong negative correlation (-0.47) between the amount of citric acid and the level of volatile acidity. This is possible due to the fact that citric acid is not a volatile vapor.

Correlation #3: There is a strong positive correlation (0.74) between the amount of free sulfur dioxide and the amount of total sulfur dioxide. This is reasonable as the free sulfur dioxide is most likely a portion of the total sulfur dioxide.

DiscussionΒΆ

Question #1: Can the type of wine (red vs. white) be determined based on the alcohol, residual sugar, citric acid, and chloride amounts in a sample?

Question #2: Can the quality of the wine be predicted from the pH along with the fixed acidity, volatile acidity, and sulphate amounts?