MATH 219 | Project 1 March 22, 2024¶

0: [IGNORABLE BOILERPLATE]¶

In [ ]:
import warnings;
warnings.simplefilter(action='ignore', category=FutureWarning);
warnings.simplefilter(action='ignore', category=DeprecationWarning);

# CONDA LIBRARY IMPORTS
import seaborn as sns;
import pandas as pds;
import numpy as npy;

1: [INTRODUCTION]¶

The data set chosen for this project was plucked from the official game statistics of professional baseball players over the 2023 MLB season. It was entirely sourced from the metrics platform at baseballsavant.mlb.com/statcast_search, but a .csv file with the specific type of data must first be generated from a filter search in order to be downloadable. The files used are under 20MB total and should have therefore been already included.¶
Another issue to note is due to the complexity of the filter forms and how the numbers are aggregated over the season, the data as it appears in this set was only obtainable per each team, requiring 30 seperate .csv files to match my request. I have not renamed the files or otherwise modified the data, causing the odd import-merge code shown below to be neccessary. After the merging is complete, the last few lines of code confirm the data meets the minimum requirements.¶
In [ ]:
# start with one, then merge the rest in a loop(s)
MLB = pds.read_csv("158_data.csv");

# csv files numbered 108 through 121 merged here
for index in range(14):
    index += 108;
    data = pds.read_csv("{0}_data.csv".format(index));
    MLB = pds.concat([MLB, data]);

# csv files numbered 133 through 147 merged here
for index in range(15):
    index += 133;
    data = pds.read_csv("{0}_data.csv".format(index));
    MLB = pds.concat([MLB, data]);

# sanity check size & format to show merge was successful
print("\nSize of data matrix is {0}\n\n".format(MLB.shape));
#TODO perhaps randomize the values here before the print? TODO

MLB.head()
Size of data matrix is (23703, 92)


Out[ ]:
pitch_type game_date release_speed release_pos_x release_pos_z player_name batter pitcher events description ... fld_score post_away_score post_home_score post_bat_score post_fld_score if_fielding_alignment of_fielding_alignment spin_axis delta_home_win_exp delta_run_exp
0 FF 2023-10-01 92.4 -1.10 6.07 Crow-Armstrong, Pete 691718 605288 walk ball ... 3 0 3 0 3 Standard Standard 222.0 -0.012 0.110
1 FF 2023-10-01 92.9 -1.06 6.11 Canario, Alexander 672744 605288 strikeout called_strike ... 3 0 3 0 3 Standard Standard 214.0 0.021 -0.152
2 CU 2023-10-01 83.5 -1.41 5.35 Tauchman, Mike 643565 676083 strikeout called_strike ... 4 0 4 0 4 Standard Standard 39.0 0.008 -0.152
3 SI 2023-10-01 92.2 -1.47 5.78 Tauchman, Mike 643565 605288 walk ball ... 0 0 0 0 0 Standard Standard 227.0 -0.025 0.151
4 FC 2023-09-30 86.9 1.50 6.02 Gomes, Yan 543228 641778 walk ball ... 6 6 6 6 6 Standard Standard 160.0 -0.015 0.052

5 rows × 92 columns

2: [Preprocessing]¶

Although the import stage is messy because of it, thanks to the filter generation, most pre-processing won't actually be neccessary. At any rate, I have decided I will cherry pick from the 92 attributes, to only those relevant to the topic I wish to explore. Each of the rows in this data matrix represent the statistics from an individual pitching attempt that resulted in a batter walked or a called strike (meaning the umpire decided it was fair, not the hitter). While some of the columns should be self-explanatory in future projects after being renamed, I will include a brief description of the ones that made the cut below. The reasoning behind choosing these specific features will be discussed in the 'summary' block.¶

The remaining code after the column filtering is "by-the-book" pre-processing and should be self-explanatory.¶

In [ ]:
MLB = MLB[
    [ "pitch_name", "pitch_type",
    "release_speed", "effective_speed",
    "release_pos_x", "release_pos_y", "release_pos_z",
    "ax", "ay", "vx0", "vy0", "vz0",
    "pfx_x", "pfx_z", "sz_top", "sz_bot",
    "spin_axis", "stand", "p_throws",
    "release_extension", "release_spin_rate",
    "plate_x", "plate_z",
    "events", "description" ]
];

MLB.dropna();
SOME IMPORTED COLUMN DESCRIPTIONS¶
release_pos (x, y, z):: 'Y' is the relative travel distance to get to the strike zone, and 'X/Z' are horizontal and vertical offsets, respectively.¶
pfx (x/z):: total horizontal and vertical movement of the ball, respectively, after it's release, relative to the catchers mitt.¶
plate (x/z) horizontal and vertical position, respectively, of the pitch as it crosses home plate, relative to the catchers mitt. This is independant of the strike zone, which is different for ever batter.¶
release speed:: magnitude of pitch velocity at it's release point, in the absolute direction of it's release vector (100%).¶
effective speed:: an adjustment to release speed, based on release extension and release vector. ball speed towards the strike zone.¶
sz (top/bottom):: The top and the bottom heights of the hitter's strike zone (auto-set when the ball is halfway to the plate).¶
spin axis:: the axis the pitched ball rotates on; 180 degrees = in direction of strike zone, 0 or 360 = direction of pitchers mound.¶
a (x/y) acceleration of the pitch out of the pitchers hand, or at the point just before release.¶
vx0/vy0/vz0:: this is the speed vector of the ball as it is caught by the catcher past the strike zone.¶
OUTLIER COUNTING FOR QUANTATIVE COLUMNS:¶
In [ ]:
data = MLB.select_dtypes(include="number")

Q25 = data.quantile(0.25);
Q75 = data.quantile(0.75);
IQR = Q75 - Q25;
MIN = -1.5*IQR + Q25;
MAX = 1.5*IQR + Q75;

outliers = data[(data < MIN) | (data > MAX)].count()

print(outliers)
release_speed         242
effective_speed       304
release_pos_x           0
release_pos_y         255
release_pos_z         532
ax                      0
ay                     63
vx0                    88
vy0                   247
vz0                   105
pfx_x                   0
pfx_z                 297
sz_top                204
sz_bot                216
spin_axis             146
release_extension     348
release_spin_rate    1321
plate_x                16
plate_z                64
dtype: int64
These amounts are mostly an insignificant ratio compared to the roughly 24K samples in the data set, with the exception of release_spin_rate. I would like to theorize that this is a product of the type of column it is, since 180 is considered the expected median, and values close to the extremes are hard-capped between 0 and 360, which are are essentially the same.¶

3: [SUMMARY DATA ANALYSIS]¶

I have included some correlations between attributes I logically expected to have a critical relationship, such as spin rate and spin axis, or release speed and expected speed. I also portray several I didn't expect anything of, but found potentially interesting (such as vertical vs horizontal position).¶
In [ ]:
print("By the names alone, the strong correlation  here is sort of stupidly obvious, but done still as a vibe check for the usefulness of this data set")
print(data[["release_speed", "effective_speed"]].corr())
print()
print("A bit more interestingly here, (as noted above) vertical starting position means much more to vertical movement than the same for horizontal")
print(data[["release_pos_x", "pfx_x"]].corr())
print()
print(data[["release_pos_z", "pfx_z"]].corr())
print()
print("another seemingly trivial correlation, but still helpful to note the ball trends toward the center of the strike zone in high count situations")
print("that is to say, in fuller counts (at least 3-1 or 2-2) the ball stays closer to the bottom when it gets far from the top, or vice versa from top")
print(MLB[["sz_top", "sz_bot"]].corr())
By the names alone, the strong correlation  here is sort of stupidly obvious, but done still as a vibe check for the usefulness of this data set
                 release_speed  effective_speed
release_speed         1.000000         0.846694
effective_speed       0.846694         1.000000

A bit more interestingly here, (as noted above) vertical starting position means much more to vertical movement than the same for horizontal
               release_pos_x     pfx_x
release_pos_x       1.000000  0.425303
pfx_x               0.425303  1.000000

               release_pos_z     pfx_z
release_pos_z       1.000000  0.130046
pfx_z               0.130046  1.000000

another seemingly trivial correlation, but still helpful to note the ball trends toward the center of the strike zone in high count situations
that is to say, in fuller counts (at least 3-1 or 2-2) the ball stays closer to the bottom when it gets far from the top, or vice versa from top
          sz_top    sz_bot
sz_top  1.000000  0.807802
sz_bot  0.807802  1.000000
While I would have loved to include a handsome graph for this whole data set, I was perhaps a bit too ambitious to utilize 30+ features of a 25K sample dataset and expect a heavy imported package written in an interpreted computer language to not take 10 mins to create one plot, I have decided to limit the graphical portion of the summary to 4 quantitative and 1 categorical column as suggested in the project description.¶
In [ ]:
data.describe()

sns.catplot(data=data, col="pitch_type", x="release_speed", y="release_spin_rate", kind="box")
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[6], line 3
      1 data.describe()
----> 3 sns.catplot(data=data, col="pitch_type", x="release_speed", y="release_spin_rate", kind="box")

File ~/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:2781, in catplot(data, x, y, hue, row, col, kind, estimator, errorbar, n_boot, seed, units, weights, order, hue_order, row_order, col_order, col_wrap, height, aspect, log_scale, native_scale, formatter, orient, color, palette, hue_norm, legend, legend_out, sharex, sharey, margin_titles, facet_kws, ci, **kwargs)
   2778     elif x is not None and y is not None:
   2779         raise ValueError("Cannot pass values for both `x` and `y`.")
-> 2781 p = Plotter(
   2782     data=data,
   2783     variables=dict(
   2784         x=x, y=y, hue=hue, row=row, col=col, units=units, weight=weights
   2785     ),
   2786     order=order,
   2787     orient=orient,
   2788     # Handle special backwards compatibility where pointplot originally
   2789     # did *not* default to multi-colored unless a palette was specified.
   2790     color="C0" if kind == "point" and palette is None and color is None else color,
   2791     legend=legend,
   2792 )
   2794 for var in ["row", "col"]:
   2795     # Handle faceting variables that lack name information
   2796     if var in p.variables and p.variables[var] is None:

File ~/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:66, in _CategoricalPlotter.__init__(self, data, variables, order, orient, require_numeric, color, legend)
     55 def __init__(
     56     self,
     57     data=None,
   (...)
     63     legend="auto",
     64 ):
---> 66     super().__init__(data=data, variables=variables)
     68     # This method takes care of some bookkeeping that is necessary because the
     69     # original categorical plots (prior to the 2021 refactor) had some rules that
     70     # don't fit exactly into VectorPlotter logic. It may be wise to have a second
   (...)
     75     # default VectorPlotter rules. If we do decide to make orient part of the
     76     # _base variable assignment, we'll want to figure out how to express that.
     77     if self.input_format == "wide" and orient in ["h", "y"]:

File ~/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:634, in VectorPlotter.__init__(self, data, variables)
    629 # var_ordered is relevant only for categorical axis variables, and may
    630 # be better handled by an internal axis information object that tracks
    631 # such information and is set up by the scale_* methods. The analogous
    632 # information for numeric axes would be information about log scales.
    633 self._var_ordered = {"x": False, "y": False}  # alt., used DefaultDict
--> 634 self.assign_variables(data, variables)
    636 # TODO Lots of tests assume that these are called to initialize the
    637 # mappings to default values on class initialization. I'd prefer to
    638 # move away from that and only have a mapping when explicitly called.
    639 for var in ["hue", "size", "style"]:

File ~/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:679, in VectorPlotter.assign_variables(self, data, variables)
    674 else:
    675     # When dealing with long-form input, use the newer PlotData
    676     # object (internal but introduced for the objects interface)
    677     # to centralize / standardize data consumption logic.
    678     self.input_format = "long"
--> 679     plot_data = PlotData(data, variables)
    680     frame = plot_data.frame
    681     names = plot_data.names

File ~/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_core/data.py:58, in PlotData.__init__(self, data, variables)
     51 def __init__(
     52     self,
     53     data: DataSource,
     54     variables: dict[str, VariableSpec],
     55 ):
     57     data = handle_data_source(data)
---> 58     frame, names, ids = self._assign_variables(data, variables)
     60     self.frame = frame
     61     self.names = names

File ~/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_core/data.py:232, in PlotData._assign_variables(self, data, variables)
    230     else:
    231         err += "An entry with this name does not appear in `data`."
--> 232     raise ValueError(err)
    234 else:
    235 
    236     # Otherwise, assume the value somehow represents data
    237 
    238     # Ignore empty data structures
    239     if isinstance(val, Sized) and len(val) == 0:

ValueError: Could not interpret value `pitch_type` for `col`. An entry with this name does not appear in `data`.

4: [DISCUSSION]¶

As mentioned before, I have cherry picked the attributes in this data set to be of a particular scenario. The filter of at least partially full counts and of pitches that only resulted in either a walked batter or strike as called-by-umpire (non swinging) leaves only critically important plays where the pitcher had the potential to either walk or retire the batter with one more throw. In all of these situations, batter decided that this ball wasn't worth swinging at, and judgement would be deffered to the behind-the-plate official, resulting in the umpire-called judgement. The first question then, the categorical one, is fairly controversial in the sport, and one one that inspired the topic for this project:

{1} Is it desirable or even fair for the umpire position to be performed or least assisted (triggering call reviews) by an AI trained system?¶

I wiil of course need to fully reduce this question to one of a categorical classification: If trained on real time capturable data surrounding a critical scenario like those discussed above, is it possible for the model to accurately classify the categorical result the umpire would have declared (of ball vs called strike)? According to fangraphs, umpire accuracy vs the true strike zone has increased from 81.3% when pitch tracking began in 2008, to peak at 92.4% more recently. Being trained on more recent umpire calls, I would need to hope for a minimum of 91% for a model to be condsidered effective enough to be potentially useful. The second one, the quantative one, is one of reconstruction of it's position in the strike zone:

{2}Given only the real time information about a pitch before it's caught, such as it's release position, initial movement and spin, is it possible to not just categorize its call, but predict it's caught position inside or outside of the strike zone?¶

For example, would knowing a curveball with a specifically high release point vector and fast forward spin will approaching the plate, enable a regression model to predict for how low at the bottom of the strike zone the pitch will be caught when it arrives? Without looking at previous umpire calls, this could quantify if AI models really could call the strike zone with a higher accuracy than human umpires, as they would not be trained on their potential mistakes.