Introduction¶
This is a dataset of the elo of every NBA game from the 2022-2023 season. Elo is used to measure the relative strength of every game. Raptor is is a plus-minus statistic that measures the number of points a player contributes to his team’s offense and defense per 100 possessions, relative to a league-average player. A team's raptor score is essentially an aggregate of the raptor ratings of all of the players on the team. The dataset came from https://data.fivethirtyeight.com/ and they compiled it using information from Basketball-Reference.com
import numpy as np
import pandas as pd
import seaborn as sns
#code to import the data from a local file
elo = pd.read_csv("nba_elo_latest (1).csv")
elo.head()
/var/folders/gc/0752xrm56pnf0r0dsrn5370c0000gr/T/ipykernel_27281/3783585675.py:2: DeprecationWarning: Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0), (to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries) but was not found to be installed on your system. If this would cause problems for you, please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466 import pandas as pd
date | season | neutral | playoff | team1 | team2 | elo1_pre | elo2_pre | elo_prob1 | elo_prob2 | ... | carm-elo2_post | raptor1_pre | raptor2_pre | raptor_prob1 | raptor_prob2 | score1 | score2 | quality | importance | total_rating | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2022-10-18 | 2023 | 0 | NaN | BOS | PHI | 1657.639749 | 1582.247327 | 0.732950 | 0.267050 | ... | NaN | 1693.243079 | 1641.876729 | 0.670612 | 0.329388 | 126 | 117 | 96 | 13 | 55 |
1 | 2022-10-18 | 2023 | 0 | NaN | GSW | LAL | 1660.620307 | 1442.352444 | 0.862011 | 0.137989 | ... | NaN | 1615.718147 | 1472.173711 | 0.776502 | 0.223498 | 123 | 109 | 67 | 20 | 44 |
2 | 2022-10-19 | 2023 | 0 | NaN | IND | WAS | 1399.201934 | 1440.077372 | 0.584275 | 0.415725 | ... | NaN | 1462.352663 | 1472.018225 | 0.599510 | 0.400490 | 107 | 114 | 37 | 28 | 33 |
3 | 2022-10-19 | 2023 | 0 | NaN | DET | ORL | 1393.525172 | 1366.089249 | 0.675590 | 0.324410 | ... | NaN | 1308.969909 | 1349.865183 | 0.563270 | 0.436730 | 113 | 109 | 3 | 1 | 2 |
4 | 2022-10-19 | 2023 | 0 | NaN | ATL | HOU | 1535.408152 | 1351.164973 | 0.837022 | 0.162978 | ... | NaN | 1618.256817 | 1283.328356 | 0.917651 | 0.082349 | 117 | 107 | 24 | 1 | 13 |
5 rows × 27 columns
Preprocessing¶
I reomved the neutral column because it didn't seem meaningful, especially since there were only 2 games played at neutral sites. I also removed the carm-elo columns because a lot of the values for carm-elo_prob1 were missing so I didn't want to use that metric if I didn't have access to all of the information. I replaced all of the NaN values in the playoff column with 'no' to represent that it was not a playoff game to improve readability. After cleaning the data, I detected that there were no more missing values. When performing my outlier analysis I noticed that there were not a lot of outliers for each column.
#Removing columns that are not meaningful or otherwise of interest.
elo.drop(columns=["neutral", "carm-elo1_post", "carm-elo2_post", "carm-elo1_pre","carm-elo2_pre","carm-elo_prob1","carm-elo_prob2"], inplace=True)
#Replacing the NaN value in the "playoff" column with no
elo["playoff"].fillna("No", inplace=True)
#Detecting the number of remaining missing values
missing = elo.isnull().sum()
print(missing)
#Outlier analysis on all quantitative columns
def is_outlier(x):
Q25, Q75 = x.quantile([.25,.75])
I = Q75 - Q25
return (x < Q25 - 1.5*I) | (x > Q75 + 1.5*I)
columns = ['elo1_pre', 'elo2_pre', 'elo_prob1',
'elo_prob2', 'elo1_post', 'elo2_post',
'raptor1_pre', 'raptor2_pre', 'raptor_prob1',
'raptor_prob2', 'score1', 'score2',
'quality', 'importance', 'total_rating']
outliers = elo[columns].apply(is_outlier)
for column in columns:
column_outliers = elo.loc[outliers[column], column]
if not column_outliers.empty:
print(f"Outliers in {column}:")
print(column_outliers)
date 0 season 0 playoff 0 team1 0 team2 0 elo1_pre 0 elo2_pre 0 elo_prob1 0 elo_prob2 0 elo1_post 0 elo2_post 0 raptor1_pre 0 raptor2_pre 0 raptor_prob1 0 raptor_prob2 0 score1 0 score2 0 quality 0 importance 0 total_rating 0 dtype: int64 Outliers in elo1_pre: 925 1273.897216 936 1268.439667 966 1282.097978 999 1272.433788 1009 1283.730645 1023 1278.344960 1031 1283.114640 1037 1288.076684 1122 1284.988155 1141 1288.679847 1156 1278.277520 1170 1283.499466 1175 1271.268580 1184 1274.145674 1189 1264.103229 Name: elo1_pre, dtype: float64 Outliers in elo2_pre: 871 1284.832073 889 1280.529922 898 1281.004146 907 1276.710512 915 1278.571482 928 1273.318217 957 1263.861085 1117 1284.216629 1123 1282.666952 1137 1280.151782 1140 1282.069303 1156 1281.234452 1157 1279.095230 1166 1276.012506 1169 1276.969243 1202 1257.300726 1218 1274.045202 Name: elo2_pre, dtype: float64 Outliers in elo_prob1: 29 0.227572 925 0.193022 936 0.210668 999 0.208926 1023 0.193097 1122 0.181341 Name: elo_prob1, dtype: float64 Outliers in elo_prob2: 29 0.772428 925 0.806978 936 0.789332 999 0.791074 1023 0.806903 1122 0.818659 Name: elo_prob2, dtype: float64 Outliers in elo1_post: 925 1268.439667 936 1263.861085 980 1287.126463 1009 1278.344960 1017 1283.114640 1122 1282.069303 1141 1279.095230 1156 1283.499466 1170 1274.145674 1175 1264.103229 1189 1257.300726 Name: elo1_post, dtype: float64 Outliers in elo2_post: 871 1280.529922 877 1281.004146 889 1276.710512 898 1278.571482 907 1273.318217 915 1273.897216 957 1282.097978 966 1272.433788 991 1283.730645 1099 1284.216629 1117 1282.666952 1123 1280.151782 1137 1278.277520 1140 1281.234452 1156 1276.012506 1157 1276.969243 1166 1271.268580 1202 1274.045202 1218 1271.086891 Name: elo2_post, dtype: float64 Outliers in raptor1_pre: 528 1208.275227 925 1209.889285 977 1140.477612 993 1224.574907 1017 1232.515467 1019 1166.190125 1031 1224.878250 1039 1168.322826 1041 1104.314130 1065 1185.354357 1102 1144.167095 1119 1132.252814 1122 1232.547678 1127 1061.312436 1141 1223.596996 1143 1046.663372 1148 1172.159365 1158 1107.921186 1162 1074.626883 1175 1167.341205 1177 1024.313010 1178 1163.901638 1189 1062.388972 1199 1175.906628 1201 1088.041119 1212 958.273219 1213 1060.073139 1215 1066.924234 1225 1229.920233 1226 1112.922253 1227 1230.840066 1228 955.234235 Name: raptor1_pre, dtype: float64 Outliers in raptor2_pre: 387 1137.854781 458 1192.691021 835 1126.294753 855 1200.173053 889 1182.649776 891 1182.131393 907 1211.762478 1026 1104.983912 1077 1137.428595 1079 1155.126758 1089 1139.639119 1096 1193.365915 1098 1154.092456 1141 1146.391794 1164 1100.554118 1183 962.299324 1187 1132.968397 1191 1162.038793 1198 945.804075 1200 1147.078894 1202 1149.753022 1214 999.541687 1216 1164.064601 1218 1165.345744 1227 1073.199905 Name: raptor2_pre, dtype: float64 Outliers in raptor_prob1: 1041 0.071186 1127 0.085744 1143 0.087815 1158 0.112300 1162 0.073751 1177 0.057100 1178 0.083423 1189 0.100047 1212 0.059204 1213 0.066333 1228 0.027498 Name: raptor_prob1, dtype: float64 Outliers in raptor_prob2: 1041 0.928814 1127 0.914256 1143 0.912185 1158 0.887700 1162 0.926249 1177 0.942900 1178 0.916577 1189 0.899953 1212 0.940796 1213 0.933667 1228 0.972502 Name: raptor_prob2, dtype: float64 Outliers in score1: 209 153 402 82 448 150 561 150 797 153 900 175 1046 82 1099 151 1143 80 Name: score1, dtype: int64 Outliers in score2: 706 150 900 176 971 147 1135 149 1213 151 1228 157 1243 80 1256 79 Name: score2, dtype: int64
/var/folders/gc/0752xrm56pnf0r0dsrn5370c0000gr/T/ipykernel_27281/1180162728.py:4: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method. The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy. For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object. elo["playoff"].fillna("No", inplace=True)
Summary data analysis¶
The close proximity of the means and the relatively small standard deviations of the elo and RAPTOR ratings show that both of the metrics are balanced and consistent. There is a strong positive correlation between each teams elo and their teamms elo probability so as a teams elo increases, their win probability also increases. There is a strong negative correlation between a teams elo and their opponents win probability, so when a teams elo increases their opponents win probability decreases. There is a perfect negative correlation between a teams win probability and its opponents probability, meaning that as one teams odds increases, its opponents odds linearly decreases.
#Statistical summary of every column using numerical components
for col in columns:
print(elo[col].describe())
#Statistical summary of every column using graphical components
big_cols = ['elo1_pre', 'elo2_pre', 'elo1_post', 'elo2_post', 'raptor1_pre', 'raptor2_pre']
prob_cols = ['raptor_prob1', 'raptor_prob2', 'elo_prob1', 'elo_prob2']
small_cols = ['score1', 'score2', 'quality', 'importance', 'total_rating']
sns.catplot(data=elo[big_cols], kind="box", height=3, aspect=2)
sns.catplot(data=elo[prob_cols], kind="box", height=3, aspect=2)
sns.catplot(data=elo[small_cols], kind="box", height=3, aspect=2)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key)
count 1320.000000 mean 1511.867655 std 88.962661 min 1264.103229 25% 1461.568345 50% 1524.876508 75% 1576.453207 max 1705.343075 Name: elo1_pre, dtype: float64 count 1320.000000 mean 1511.311315 std 89.571409 min 1257.300726 25% 1460.866530 50% 1523.136739 75% 1577.825540 max 1719.448667 Name: elo2_pre, dtype: float64 count 1320.000000 mean 0.627059 std 0.151149 min 0.181341 25% 0.533438 50% 0.640681 75% 0.736876 max 0.933707 Name: elo_prob1, dtype: float64 count 1320.000000 mean 0.372941 std 0.151149 min 0.066293 25% 0.263124 50% 0.359319 75% 0.466562 max 0.818659 Name: elo_prob2, dtype: float64 count 1320.000000 mean 1510.664651 std 89.560581 min 1257.300726 25% 1460.990452 50% 1521.917992 75% 1576.753366 max 1705.343075 Name: elo1_post, dtype: float64 count 1320.000000 mean 1512.514319 std 89.306155 min 1271.086891 25% 1460.476484 50% 1525.368845 75% 1577.663170 max 1719.448667 Name: elo2_post, dtype: float64 count 1320.000000 mean 1503.864076 std 116.590601 min 955.234235 25% 1445.292160 50% 1523.778627 75% 1585.715871 max 1733.775148 Name: raptor1_pre, dtype: float64 count 1320.000000 mean 1499.075101 std 116.514412 min 945.804075 25% 1432.346974 50% 1522.906299 75% 1579.358239 max 1728.915073 Name: raptor2_pre, dtype: float64 count 1320.000000 mean 0.603891 std 0.187301 min 0.027498 25% 0.490977 50% 0.617850 75% 0.742625 max 0.982744 Name: raptor_prob1, dtype: float64 count 1320.000000 mean 0.396109 std 0.187301 min 0.017256 25% 0.257375 50% 0.382150 75% 0.509023 max 0.972502 Name: raptor_prob2, dtype: float64 count 1320.000000 mean 115.630303 std 11.991075 min 80.000000 25% 108.000000 50% 116.000000 75% 124.000000 max 175.000000 Name: score1, dtype: float64 count 1320.000000 mean 113.030303 std 12.001920 min 79.000000 25% 105.000000 50% 113.000000 75% 121.000000 max 176.000000 Name: score2, dtype: float64 count 1320.000000 mean 50.511364 std 27.217232 min 0.000000 25% 29.000000 50% 52.000000 75% 73.000000 max 99.000000 Name: quality, dtype: float64 count 1320.000000 mean 32.458333 std 29.408726 min 0.000000 25% 9.000000 50% 24.000000 75% 49.000000 max 100.000000 Name: importance, dtype: float64 count 1320.000000 mean 41.744697 std 24.238657 min 0.000000 25% 21.000000 50% 43.500000 75% 57.000000 max 100.000000 Name: total_rating, dtype: float64
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float)
<seaborn.axisgrid.FacetGrid at 0x153dcbb90>
#Correlations between quantitative columns (the heatmap would not fit above)
pairs = elo[['elo1_pre', 'elo2_pre', 'score1', 'score2', 'elo_prob1', 'elo_prob2']]
C = pairs.corr()
sns.heatmap(C, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
<Axes: >
Discussion¶
Two questions about the dataset to explore in future projects
Categorical outcome: Can elo, elo probability, raptor, and raptor probability predict whether a team makes the playoffs or not
Quantitative outcome: What percentage of important games (games with an importance of 75 or greater) can elo, elo probability, raptor, and raptor probability correctly predict?