In [1]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=DeprecationWarning)

import numpy as np
import pandas as pd
import seaborn as sns
import sklearn as sk

Introduction

The following data was pulled from data.gov at https://catalog.data.gov/dataset/water-quality-data-41c5e/resource/c013c8da-49d3-4898-93a5-f6c0f0e95a0d. It describes the Back Bay Water Quality

In [2]:
watq = pd.read_csv('waterqual.csv')
watq
Out[2]:
Site_Id Unit_Id Read_Date Salinity (ppt) Dissolved Oxygen (mg/L) pH (standard units) Secchi Depth (m) Water Depth (m) Water Temp (?C) Air Temp-Celsius Air Temp (?F) Time (24:00) Field_Tech DateVerified WhoVerified AirTemp (C) Year
0 Bay NaN 1/3/1994 1.3 11.7 7.3 0.40 0.40 5.9 8.0 46.40 11:00 NaN NaN NaN 8.000000 1994
1 Bay NaN 1/31/1994 1.5 12.0 7.4 0.20 0.35 3.0 2.6 36.68 11:30 NaN NaN NaN 2.600000 1994
2 Bay NaN 2/7/1994 1.0 10.5 7.2 0.25 0.60 5.9 7.6 45.68 9:45 NaN NaN NaN 7.600000 1994
3 Bay NaN 2/23/1994 1.0 10.1 7.4 0.35 0.50 10.0 2.7 36.86 NaN NaN NaN NaN 2.700000 1994
4 Bay NaN 2/28/1994 1.0 12.6 7.2 0.20 0.40 1.6 0.0 32.00 10:30 NaN NaN NaN 0.000000 1994
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2366 Bay NaN 10/11/2018 1.9 5.0 7.0 4.00 1.20 25.0 NaN 78.00 09:30 Sue Poe 11/13/2019 Christine Folks 25.555556 2018
2367 Bay NaN 10/24/2018 0.0 9.0 7.0 0.30 0.60 18.0 NaN 58.00 09:30 Sue Poe 11/13/2019 Christine Folks 14.444444 2018
2368 Bay NaN 10/28/2018 0.9 2.9 7.0 0.40 0.90 13.0 NaN 49.00 09:20 Sue Poe 11/13/2019 Christine Folks 9.444444 2018
2369 Bay NaN 11/7/2018 1.7 NaN 7.0 0.45 0.90 20.0 NaN 65.00 09:45 Sue Poe 11/13/2019 Christine Folks 18.333333 2018
2370 Bay NaN 12/11/2018 0.1 NaN 7.0 0.10 0.10 10.0 NaN 42.00 09:40 Sue Poe 11/13/2019 Christine Folks 5.555556 2018

2371 rows × 17 columns

Preprocessing

Next we drop some columns so we can look at site id, salinity, dissolved oxygen, ph, secchi depth, water depth, water temp, and year. Then, eliminate rows with NaN values

In [3]:
watq = watq.drop(columns=['Unit_Id','Read_Date','Field_Tech', 'DateVerified', 'WhoVerified','AirTemp (C)', 'Air Temp-Celsius', 'Air Temp (?F)', 'Time (24:00)'], axis=1)
watq = watq.dropna()
watq
Out[3]:
Site_Id Salinity (ppt) Dissolved Oxygen (mg/L) pH (standard units) Secchi Depth (m) Water Depth (m) Water Temp (?C) Year
0 Bay 1.3 11.7 7.3 0.40 0.40 5.9 1994
1 Bay 1.5 12.0 7.4 0.20 0.35 3.0 1994
2 Bay 1.0 10.5 7.2 0.25 0.60 5.9 1994
3 Bay 1.0 10.1 7.4 0.35 0.50 10.0 1994
4 Bay 1.0 12.6 7.2 0.20 0.40 1.6 1994
... ... ... ... ... ... ... ... ...
2361 D 0.0 6.0 6.5 0.70 1.20 26.0 2018
2364 D 0.0 6.9 6.5 0.90 1.30 20.0 2018
2366 Bay 1.9 5.0 7.0 4.00 1.20 25.0 2018
2367 Bay 0.0 9.0 7.0 0.30 0.60 18.0 2018
2368 Bay 0.9 2.9 7.0 0.40 0.90 13.0 2018

1320 rows × 8 columns

Below is a boxplot grouped by Site_Id according to salinity

In [4]:
sns.boxplot(x='Site_Id', y='Salinity (ppt)', data=watq)
Out[4]:
<Axes: xlabel='Site_Id', ylabel='Salinity (ppt)'>
No description has been provided for this image

Finding the number of outliers according to Site Id

In [5]:
def is_outlier(x):
    Q25, Q75 = x.quantile([.25,.75])
    I = Q75 - Q25
    return ((x < (Q25 - 1.5*I)) |  (x > (Q75 + 1.5*I)))
    
outliers = watq.groupby("Site_Id")['Salinity (ppt)'].transform(is_outlier)
q = watq.loc[outliers, 'Site_Id'].value_counts()

print(q)
Site_Id
A      40
B      18
D      18
C       6
Bay     3
Name: count, dtype: int64

Dissolved oxygen analysis

In [6]:
sns.boxplot(x='Site_Id', y='Dissolved Oxygen (mg/L)', data=watq)
Out[6]:
<Axes: xlabel='Site_Id', ylabel='Dissolved Oxygen (mg/L)'>
No description has been provided for this image
In [7]:
outliers = watq.groupby("Site_Id")['Dissolved Oxygen (mg/L)'].transform(is_outlier)
q = watq.loc[outliers, 'Site_Id'].value_counts()

print(q)
Site_Id
C      4
Bay    2
Name: count, dtype: int64

pH analysis

In [8]:
sns.boxplot(x='Site_Id', y='pH (standard units)', data=watq)
Out[8]:
<Axes: xlabel='Site_Id', ylabel='pH (standard units)'>
No description has been provided for this image
In [9]:
outliers = watq.groupby("Site_Id")['pH (standard units)'].transform(is_outlier)
q = watq.loc[outliers, 'Site_Id'].value_counts()

print(q)
Site_Id
D      77
A      25
Bay     3
B       2
C       1
Name: count, dtype: int64

Secchi depth analysis

In [10]:
sns.boxplot(x='Site_Id', y='Secchi Depth (m)', data=watq)
Out[10]:
<Axes: xlabel='Site_Id', ylabel='Secchi Depth (m)'>
No description has been provided for this image
In [11]:
outliers = watq.groupby("Site_Id")['Secchi Depth (m)'].transform(is_outlier)
q = watq.loc[outliers, 'Site_Id'].value_counts()

print(q)
Site_Id
Bay    22
A       5
D       4
B       3
C       1
Name: count, dtype: int64

Water depth analysis

In [12]:
sns.boxplot(x='Site_Id', y='Water Depth (m)', data=watq)
Out[12]:
<Axes: xlabel='Site_Id', ylabel='Water Depth (m)'>
No description has been provided for this image
In [13]:
outliers = watq.groupby("Site_Id")['Water Depth (m)'].transform(is_outlier)
q = watq.loc[outliers, 'Site_Id'].value_counts()

print(q)
Site_Id
Bay    16
D       8
B       4
C       3
A       2
Name: count, dtype: int64

Water temp analysis

In [14]:
sns.boxplot(x='Site_Id', y='Water Temp (?C)', data=watq)
Out[14]:
<Axes: xlabel='Site_Id', ylabel='Water Temp (?C)'>
No description has been provided for this image
In [15]:
outliers = watq.groupby("Site_Id")['Water Temp (?C)'].transform(is_outlier)
q = watq.loc[outliers, 'Site_Id'].value_counts()

print(q)
Site_Id
Bay    2
A      1
D      1
Name: count, dtype: int64

Further outlier analysis

In [16]:
watq.describe()
Out[16]:
Salinity (ppt) Dissolved Oxygen (mg/L) pH (standard units) Secchi Depth (m) Water Depth (m) Water Temp (?C) Year
count 1320.000000 1320.000000 1320.000000 1320.000000 1320.000000 1320.000000 1320.000000
mean 0.923023 6.731606 7.211553 0.497761 0.749390 18.026902 2005.139394
std 1.373680 2.520411 0.807654 0.477977 0.576336 8.302714 8.590170
min 0.000000 0.000000 0.300000 0.000000 0.010000 0.000000 1899.000000
25% 0.000000 4.800000 6.500000 0.300000 0.442500 11.000000 1999.000000
50% 0.000000 6.600000 7.000000 0.400000 0.680000 18.900000 2006.000000
75% 1.500000 8.612500 7.500000 0.600000 0.900000 25.000000 2012.000000
max 9.000000 15.100000 9.900000 7.400000 7.500000 74.000000 2019.000000

Summary Data analysis

Salinity

In [17]:
sns.displot(data=watq, x = 'Salinity (ppt)', hue='Site_Id', kind='kde')
Out[17]:
<seaborn.axisgrid.FacetGrid at 0x129fd2550>
No description has been provided for this image

Dissolved oxygen

In [18]:
sns.displot(data=watq, x = 'Dissolved Oxygen (mg/L)', hue='Site_Id', kind='kde')
Out[18]:
<seaborn.axisgrid.FacetGrid at 0x12a087c90>
No description has been provided for this image

pH

In [19]:
sns.displot(data=watq, x = 'pH (standard units)', hue='Site_Id', kind='kde')
Out[19]:
<seaborn.axisgrid.FacetGrid at 0x129d65790>
No description has been provided for this image

Secchi depth

In [20]:
sns.displot(data=watq, x = 'Secchi Depth (m)', hue='Site_Id', kind='kde')
Out[20]:
<seaborn.axisgrid.FacetGrid at 0x12a016a10>
No description has been provided for this image

Water depth

In [21]:
sns.displot(data=watq, x = 'Water Depth (m)', hue='Site_Id', kind='kde')
Out[21]:
<seaborn.axisgrid.FacetGrid at 0x129f5c6d0>
No description has been provided for this image

Water temp

In [22]:
sns.displot(data=watq, x = 'Water Temp (?C)', hue='Site_Id', kind='kde')
Out[22]:
<seaborn.axisgrid.FacetGrid at 0x12a2f7510>
No description has been provided for this image

Now we make a dataframe of Pearson correlation coefficients between all of the columns

In [23]:
cols = watq.columns[1:]
pcorr = watq[cols].corr(method='pearson')
pcorr
Out[23]:
Salinity (ppt) Dissolved Oxygen (mg/L) pH (standard units) Secchi Depth (m) Water Depth (m) Water Temp (?C) Year
Salinity (ppt) 1.000000 0.330929 0.305956 -0.151324 -0.066474 0.004317 -0.626121
Dissolved Oxygen (mg/L) 0.330929 1.000000 0.068198 -0.061994 -0.017595 -0.526239 -0.380029
pH (standard units) 0.305956 0.068198 1.000000 -0.114660 -0.136254 0.181488 -0.306841
Secchi Depth (m) -0.151324 -0.061994 -0.114660 1.000000 0.850086 0.024603 0.146823
Water Depth (m) -0.066474 -0.017595 -0.136254 0.850086 1.000000 0.068967 0.041250
Water Temp (?C) 0.004317 -0.526239 0.181488 0.024603 0.068967 1.000000 0.036493
Year -0.626121 -0.380029 -0.306841 0.146823 0.041250 0.036493 1.000000

Next, finding correlation between dissolved oxygen and water depth

In [24]:
corr = watq.groupby('Site_Id')[['Dissolved Oxygen (mg/L)', 'Water Depth (m)']].corr()
print(corr)
                                 Dissolved Oxygen (mg/L)  Water Depth (m)
Site_Id                                                                  
A       Dissolved Oxygen (mg/L)                 1.000000        -0.046860
        Water Depth (m)                        -0.046860         1.000000
B       Dissolved Oxygen (mg/L)                 1.000000         0.049660
        Water Depth (m)                         0.049660         1.000000
Bay     Dissolved Oxygen (mg/L)                 1.000000        -0.072035
        Water Depth (m)                        -0.072035         1.000000
C       Dissolved Oxygen (mg/L)                 1.000000        -0.109684
        Water Depth (m)                        -0.109684         1.000000
D       Dissolved Oxygen (mg/L)                 1.000000         0.100906
        Water Depth (m)                         0.100906         1.000000

Next, salinity and pH

In [25]:
corr = watq.groupby('Site_Id')[['Salinity (ppt)', 'pH (standard units)']].corr()
print(corr)
                             Salinity (ppt)  pH (standard units)
Site_Id                                                         
A       Salinity (ppt)             1.000000             0.269008
        pH (standard units)        0.269008             1.000000
B       Salinity (ppt)             1.000000             0.352222
        pH (standard units)        0.352222             1.000000
Bay     Salinity (ppt)             1.000000             0.245911
        pH (standard units)        0.245911             1.000000
C       Salinity (ppt)             1.000000             0.397169
        pH (standard units)        0.397169             1.000000
D       Salinity (ppt)             1.000000            -0.129337
        pH (standard units)       -0.129337             1.000000

Secchi depth and water temp

In [26]:
corr = watq.groupby('Site_Id')[['Secchi Depth (m)', 'Water Temp (?C)']].corr()
print(corr)
                          Secchi Depth (m)  Water Temp (?C)
Site_Id                                                    
A       Secchi Depth (m)          1.000000        -0.063283
        Water Temp (?C)          -0.063283         1.000000
B       Secchi Depth (m)          1.000000         0.031976
        Water Temp (?C)           0.031976         1.000000
Bay     Secchi Depth (m)          1.000000         0.119465
        Water Temp (?C)           0.119465         1.000000
C       Secchi Depth (m)          1.000000         0.064327
        Water Temp (?C)           0.064327         1.000000
D       Secchi Depth (m)          1.000000        -0.410634
        Water Temp (?C)          -0.410634         1.000000

Discussion

Do pH, salinity, water depth, and water temperature determine which fish live in a body of water?

Can water temperature, pH, salinity, and dissolved oxygen predict clutch survival rates?