Introduction¶

This is a dataset of the elo of every NBA game from the 2022-2023 season. Elo is used to measure the relative strength of every game. Raptor is is a plus-minus statistic that measures the number of points a player contributes to his team’s offense and defense per 100 possessions, relative to a league-average player. A team's raptor score is essentially an aggregate of the raptor ratings of all of the players on the team. The dataset came from https://data.fivethirtyeight.com/ and they compiled it using information from Basketball-Reference.com

In [ ]:
import numpy as np
import pandas as pd
import seaborn as sns
#code to import the data from a local file
elo = pd.read_csv("nba_elo_latest (1).csv")
elo.head()
/var/folders/gc/0752xrm56pnf0r0dsrn5370c0000gr/T/ipykernel_27281/3783585675.py:2: DeprecationWarning: 
Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd
Out[ ]:
date season neutral playoff team1 team2 elo1_pre elo2_pre elo_prob1 elo_prob2 ... carm-elo2_post raptor1_pre raptor2_pre raptor_prob1 raptor_prob2 score1 score2 quality importance total_rating
0 2022-10-18 2023 0 NaN BOS PHI 1657.639749 1582.247327 0.732950 0.267050 ... NaN 1693.243079 1641.876729 0.670612 0.329388 126 117 96 13 55
1 2022-10-18 2023 0 NaN GSW LAL 1660.620307 1442.352444 0.862011 0.137989 ... NaN 1615.718147 1472.173711 0.776502 0.223498 123 109 67 20 44
2 2022-10-19 2023 0 NaN IND WAS 1399.201934 1440.077372 0.584275 0.415725 ... NaN 1462.352663 1472.018225 0.599510 0.400490 107 114 37 28 33
3 2022-10-19 2023 0 NaN DET ORL 1393.525172 1366.089249 0.675590 0.324410 ... NaN 1308.969909 1349.865183 0.563270 0.436730 113 109 3 1 2
4 2022-10-19 2023 0 NaN ATL HOU 1535.408152 1351.164973 0.837022 0.162978 ... NaN 1618.256817 1283.328356 0.917651 0.082349 117 107 24 1 13

5 rows × 27 columns

Preprocessing¶

I reomved the neutral column because it didn't seem meaningful, especially since there were only 2 games played at neutral sites. I also removed the carm-elo columns because a lot of the values for carm-elo_prob1 were missing so I didn't want to use that metric if I didn't have access to all of the information. I replaced all of the NaN values in the playoff column with 'no' to represent that it was not a playoff game to improve readability. After cleaning the data, I detected that there were no more missing values. When performing my outlier analysis I noticed that there were not a lot of outliers for each column.

In [ ]:
#Removing columns that are not meaningful or otherwise of interest.
elo.drop(columns=["neutral", "carm-elo1_post", "carm-elo2_post", "carm-elo1_pre","carm-elo2_pre","carm-elo_prob1","carm-elo_prob2"], inplace=True)
#Replacing the NaN value in the "playoff" column with no 
elo["playoff"].fillna("No", inplace=True)
#Detecting the number of remaining missing values
missing = elo.isnull().sum()
print(missing)
#Outlier analysis on all quantitative columns
def is_outlier(x):
    Q25, Q75 = x.quantile([.25,.75])
    I = Q75 - Q25
    return (x < Q25 - 1.5*I) |  (x > Q75 + 1.5*I)

columns = ['elo1_pre', 'elo2_pre', 'elo_prob1', 
           'elo_prob2', 'elo1_post', 'elo2_post', 
           'raptor1_pre', 'raptor2_pre', 'raptor_prob1', 
           'raptor_prob2', 'score1', 'score2', 
           'quality', 'importance', 'total_rating']

outliers = elo[columns].apply(is_outlier)
for column in columns:
    column_outliers = elo.loc[outliers[column], column]
    if not column_outliers.empty:
        print(f"Outliers in {column}:")
        print(column_outliers)
date            0
season          0
playoff         0
team1           0
team2           0
elo1_pre        0
elo2_pre        0
elo_prob1       0
elo_prob2       0
elo1_post       0
elo2_post       0
raptor1_pre     0
raptor2_pre     0
raptor_prob1    0
raptor_prob2    0
score1          0
score2          0
quality         0
importance      0
total_rating    0
dtype: int64
Outliers in elo1_pre:
925     1273.897216
936     1268.439667
966     1282.097978
999     1272.433788
1009    1283.730645
1023    1278.344960
1031    1283.114640
1037    1288.076684
1122    1284.988155
1141    1288.679847
1156    1278.277520
1170    1283.499466
1175    1271.268580
1184    1274.145674
1189    1264.103229
Name: elo1_pre, dtype: float64
Outliers in elo2_pre:
871     1284.832073
889     1280.529922
898     1281.004146
907     1276.710512
915     1278.571482
928     1273.318217
957     1263.861085
1117    1284.216629
1123    1282.666952
1137    1280.151782
1140    1282.069303
1156    1281.234452
1157    1279.095230
1166    1276.012506
1169    1276.969243
1202    1257.300726
1218    1274.045202
Name: elo2_pre, dtype: float64
Outliers in elo_prob1:
29      0.227572
925     0.193022
936     0.210668
999     0.208926
1023    0.193097
1122    0.181341
Name: elo_prob1, dtype: float64
Outliers in elo_prob2:
29      0.772428
925     0.806978
936     0.789332
999     0.791074
1023    0.806903
1122    0.818659
Name: elo_prob2, dtype: float64
Outliers in elo1_post:
925     1268.439667
936     1263.861085
980     1287.126463
1009    1278.344960
1017    1283.114640
1122    1282.069303
1141    1279.095230
1156    1283.499466
1170    1274.145674
1175    1264.103229
1189    1257.300726
Name: elo1_post, dtype: float64
Outliers in elo2_post:
871     1280.529922
877     1281.004146
889     1276.710512
898     1278.571482
907     1273.318217
915     1273.897216
957     1282.097978
966     1272.433788
991     1283.730645
1099    1284.216629
1117    1282.666952
1123    1280.151782
1137    1278.277520
1140    1281.234452
1156    1276.012506
1157    1276.969243
1166    1271.268580
1202    1274.045202
1218    1271.086891
Name: elo2_post, dtype: float64
Outliers in raptor1_pre:
528     1208.275227
925     1209.889285
977     1140.477612
993     1224.574907
1017    1232.515467
1019    1166.190125
1031    1224.878250
1039    1168.322826
1041    1104.314130
1065    1185.354357
1102    1144.167095
1119    1132.252814
1122    1232.547678
1127    1061.312436
1141    1223.596996
1143    1046.663372
1148    1172.159365
1158    1107.921186
1162    1074.626883
1175    1167.341205
1177    1024.313010
1178    1163.901638
1189    1062.388972
1199    1175.906628
1201    1088.041119
1212     958.273219
1213    1060.073139
1215    1066.924234
1225    1229.920233
1226    1112.922253
1227    1230.840066
1228     955.234235
Name: raptor1_pre, dtype: float64
Outliers in raptor2_pre:
387     1137.854781
458     1192.691021
835     1126.294753
855     1200.173053
889     1182.649776
891     1182.131393
907     1211.762478
1026    1104.983912
1077    1137.428595
1079    1155.126758
1089    1139.639119
1096    1193.365915
1098    1154.092456
1141    1146.391794
1164    1100.554118
1183     962.299324
1187    1132.968397
1191    1162.038793
1198     945.804075
1200    1147.078894
1202    1149.753022
1214     999.541687
1216    1164.064601
1218    1165.345744
1227    1073.199905
Name: raptor2_pre, dtype: float64
Outliers in raptor_prob1:
1041    0.071186
1127    0.085744
1143    0.087815
1158    0.112300
1162    0.073751
1177    0.057100
1178    0.083423
1189    0.100047
1212    0.059204
1213    0.066333
1228    0.027498
Name: raptor_prob1, dtype: float64
Outliers in raptor_prob2:
1041    0.928814
1127    0.914256
1143    0.912185
1158    0.887700
1162    0.926249
1177    0.942900
1178    0.916577
1189    0.899953
1212    0.940796
1213    0.933667
1228    0.972502
Name: raptor_prob2, dtype: float64
Outliers in score1:
209     153
402      82
448     150
561     150
797     153
900     175
1046     82
1099    151
1143     80
Name: score1, dtype: int64
Outliers in score2:
706     150
900     176
971     147
1135    149
1213    151
1228    157
1243     80
1256     79
Name: score2, dtype: int64
/var/folders/gc/0752xrm56pnf0r0dsrn5370c0000gr/T/ipykernel_27281/1180162728.py:4: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  elo["playoff"].fillna("No", inplace=True)

Summary data analysis¶

The close proximity of the means and the relatively small standard deviations of the elo and RAPTOR ratings show that both of the metrics are balanced and consistent. There is a strong positive correlation between each teams elo and their teamms elo probability so as a teams elo increases, their win probability also increases. There is a strong negative correlation between a teams elo and their opponents win probability, so when a teams elo increases their opponents win probability decreases. There is a perfect negative correlation between a teams win probability and its opponents probability, meaning that as one teams odds increases, its opponents odds linearly decreases.

In [ ]:
#Statistical summary of every column using numerical components
for col in columns:
    print(elo[col].describe())
#Statistical summary of every column using graphical components    
big_cols = ['elo1_pre', 'elo2_pre', 'elo1_post', 'elo2_post', 'raptor1_pre', 'raptor2_pre']
prob_cols = ['raptor_prob1', 'raptor_prob2', 'elo_prob1', 'elo_prob2']    
small_cols = ['score1', 'score2', 'quality', 'importance', 'total_rating']
sns.catplot(data=elo[big_cols], kind="box", height=3, aspect=2)
sns.catplot(data=elo[prob_cols], kind="box", height=3, aspect=2)
sns.catplot(data=elo[small_cols], kind="box", height=3, aspect=2)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
count    1320.000000
mean     1511.867655
std        88.962661
min      1264.103229
25%      1461.568345
50%      1524.876508
75%      1576.453207
max      1705.343075
Name: elo1_pre, dtype: float64
count    1320.000000
mean     1511.311315
std        89.571409
min      1257.300726
25%      1460.866530
50%      1523.136739
75%      1577.825540
max      1719.448667
Name: elo2_pre, dtype: float64
count    1320.000000
mean        0.627059
std         0.151149
min         0.181341
25%         0.533438
50%         0.640681
75%         0.736876
max         0.933707
Name: elo_prob1, dtype: float64
count    1320.000000
mean        0.372941
std         0.151149
min         0.066293
25%         0.263124
50%         0.359319
75%         0.466562
max         0.818659
Name: elo_prob2, dtype: float64
count    1320.000000
mean     1510.664651
std        89.560581
min      1257.300726
25%      1460.990452
50%      1521.917992
75%      1576.753366
max      1705.343075
Name: elo1_post, dtype: float64
count    1320.000000
mean     1512.514319
std        89.306155
min      1271.086891
25%      1460.476484
50%      1525.368845
75%      1577.663170
max      1719.448667
Name: elo2_post, dtype: float64
count    1320.000000
mean     1503.864076
std       116.590601
min       955.234235
25%      1445.292160
50%      1523.778627
75%      1585.715871
max      1733.775148
Name: raptor1_pre, dtype: float64
count    1320.000000
mean     1499.075101
std       116.514412
min       945.804075
25%      1432.346974
50%      1522.906299
75%      1579.358239
max      1728.915073
Name: raptor2_pre, dtype: float64
count    1320.000000
mean        0.603891
std         0.187301
min         0.027498
25%         0.490977
50%         0.617850
75%         0.742625
max         0.982744
Name: raptor_prob1, dtype: float64
count    1320.000000
mean        0.396109
std         0.187301
min         0.017256
25%         0.257375
50%         0.382150
75%         0.509023
max         0.972502
Name: raptor_prob2, dtype: float64
count    1320.000000
mean      115.630303
std        11.991075
min        80.000000
25%       108.000000
50%       116.000000
75%       124.000000
max       175.000000
Name: score1, dtype: float64
count    1320.000000
mean      113.030303
std        12.001920
min        79.000000
25%       105.000000
50%       113.000000
75%       121.000000
max       176.000000
Name: score2, dtype: float64
count    1320.000000
mean       50.511364
std        27.217232
min         0.000000
25%        29.000000
50%        52.000000
75%        73.000000
max        99.000000
Name: quality, dtype: float64
count    1320.000000
mean       32.458333
std        29.408726
min         0.000000
25%         9.000000
50%        24.000000
75%        49.000000
max       100.000000
Name: importance, dtype: float64
count    1320.000000
mean       41.744697
std        24.238657
min         0.000000
25%        21.000000
50%        43.500000
75%        57.000000
max       100.000000
Name: total_rating, dtype: float64
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
Out[ ]:
<seaborn.axisgrid.FacetGrid at 0x153dcbb90>
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
In [ ]:
#Correlations between quantitative columns (the heatmap would not fit above)
pairs = elo[['elo1_pre', 'elo2_pre', 'score1', 'score2', 'elo_prob1', 'elo_prob2']]
C = pairs.corr()
sns.heatmap(C, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
Out[ ]:
<Axes: >
No description has been provided for this image

Discussion¶

Two questions about the dataset to explore in future projects

  • Categorical outcome: Can elo, elo probability, raptor, and raptor probability predict whether a team makes the playoffs or not

  • Quantitative outcome: What percentage of important games (games with an importance of 75 or greater) can elo, elo probability, raptor, and raptor probability correctly predict?