My Quest To Conquer the World of Baseball DFS - Part 2

Tate Campbell - March 26, 2017

As a practicing data scientist and an avid fan of daily fantasy sports, I've decided it's high time for me to develop my own machine learning algorithm to predict fantasy point production. Given that baseball season is rapidly approaching I'll be developing my methods to attempt to produce optimal lineups for MLB contests on DraftKings.

Welcome back!

In my last post I told you how I started my MLB DFS conquest for the 2017 season by scraping and cleaning lots of data and by messing around with some random forests. At that time I felt that I had a decent model for predicting fantasy points for batters, but I hadn't put any work into making predictions for pitchers. Since then I've continued to build out my model for batters and I've also put together an inital dataset and model for pitchers. Lastly I started spec'ing out a workflow and some skeleton code for the script that I'll run everytime I download a new DraftKings salary csv to fetch all the data necessary to make predictions.

Today I'll tell you a little bit about what I've learned on the subject of modeling pitcher fantasy point production.

A Pitcher is Worth a Thousand Words

As I mentioned last time, the way I initally thought of predicting a pitcher's fantasy points involved decomposing the pitcher's points based on the performance of every batter in the opposing lineup. This seemed like a logical approach to me since, on average, pitchers are going to perform differently against David Ortiz hitting cleanup than they would against Jon Lester in the 9-hole.

As it turns out, segmenting a pitcher's fantasy points by the contribution of each individual batter is tougher than it sounds. The only way I could think of doing it was by parsing each batter's stat line using a lot of regular expressions, not the most fun code to write but it's doable. That's all well and good, but of course the starting pitcher usually doesn't pitch the entire game! You can't just attribute all of a batter's outs and/or hits to the starting pitcher.

So now you have to worry about pitch counts and when relievers typically enter the game, which is highly pitcher/team dependent and is fairly difficult to predict. In addition to the issues related to number of innings pitched, I also ran into some problems with my specific dataset where in certain games I don't have data for all nine batters on one team. This is a total nightmare when trying to build a model based on nine BvP matchups, and I really didn't want to throw out every game where I was missing a batter or two.

So while I still think this is a great approach in theory, I ended up abandoning it in practice. What I ended up doing was simply averaging batter splits and building a model based on an aggregated version of the dataset I used for batters. The idea is that I still want to take advantage of the composition of the opposing lineup, e.g. if a lefty pitcher is going up against a bunch of righty bats with substantial platoon splits, we should expect some runs to be given up.

Pitcher Model Performance

Here's a scatter plot of the prediction results for one iteration of my cross validation. You can hover over data points to find out the pitcher, game date, and predicted vs. actual performance for that point. Try zooming in on the upper right-hand corner to see some truly massive outliers, including a complete game shutout with 14 strikeouts, only three hits, and zero walks; resulting in a 55 DK point game from Clayton Kershaw against the San Diego Padres.

This model had a mean absolute error of 8.8 DK points, however this was done using an sklearn random forest model out of the box without tuning any parameters, so I'm hopeful that I can bring that error down a bit. While the initial accuracy wasn't quite as good as I'd hoped, the good news is that the residuals seem to be randomly distributed around zero. For you non stat nerds that means that the difference between the predicted values and the actual values is random. This is a necessary condition when performing regression analysis because it means that your model is unbiased, i.e. it doesn't tend to underpredict or overpredict more than it should.

Feature Importances

Let's take a look at some of the most important and least important features in the pitcher model.

Top 10 Important Features

Feature	Importance (%)
pitcher_splits_earned_run_average	8.5
pitcher_earned_run_average	5.9
avg_batter_splits_strikeout_rate	4.3
pitcher_splits_walks_plus_hits_per_inning_pitched	4.2
pitcher_dk_salary	4.1
pitcher_splits_strikeout_rate	3.9
avg_batter_splits_RBI_rate	3.9
avg_batter_splits_batting_average	3.6
avg_batter_dk_salary	3.6
avg_batter_splits_home_run_rate	3.4

Here we see pitcher splits playing a much bigger role than they did in the modeling of batters. The pitcher's ERA splits and overall ERA are far and away the most predictive variables. It is also not at all surprising to see strikeout rates as a highly important factors, since pitchers receive two points for every K.

We also see that DraftKings salaries aren't as predictive of fantasy point as they are for batters, which I suspect is related to differences in salary variability based on matchup for pitchers and hitters. For example if Nelson Cruz (a well documented lefty crusher) is going against a lefty, his salary will likely increase by 10% or more, since the probability of him hitting one or more home runs increases dramatically. Conversely the price of Madison Bumgarner (a lefty) would likely not change very much in the case that he's going against a Mariners lineup with Cruz in it.

Bottom 10 Important Features

Feature	Importance (%)
pitcher_batting_order	0.3
pitcher_home	0.4
pitchers_walks_per_9_innings_pitched	1.1
pitcher_batting_average_on_balls_in_play	1.2
over_under	1.3
park_factors_walks	1.3
pitcher_strand_rate	1.4
pitcher_strikeout_to_walk_ratio	1.4
park_factors_hits	1.4
pitcher_home_runs_allowed_per_9_innings_pitched	1.4

I thought the most surprising result here was that park factors and the over/under didn't end up being very important. Anyone who's played baseball DFS knows how everyone hates playing pitchers at Coors Field in Denver because home runs are so common there due to the high altitude. To me these feature weightings suggest that people may be putting too much stock into park factors.

It is also worth noting that a few pitching statistics that are commonly analyzed for DFS purposes such as BABIP, BB/9, and strand rate (left-on-base percentage) had very little impact on a pitcher's fantasy points. Check out the MLB scoring rules on DraftKings for a complete breakdown of how points are gained/lost for pitcher and hitters.

On Deck

With both the batter and pitcher models at a point where I'd feel all right using them to make predictions, the next task has to be writing the script that will bridge the gap between a DraftKings salary csv and my current data set. I've been dreading writing the script because it requiries mixing and matching data from many different sources which won't be fun. However as soon as I finish that script I can move onto lineup optimization which I'm really looking forward to.

In looking ahead to the optimization, I found some code that looks really promising from swanson. He wrote lineup optimization algorithms for NFL, NBA, and CFB teams using the Google ortools python library, which I used in grad school so I'm pretty comfortable with it. I should be able to adapt his code to optimize MLB lineups fairly easily which will be the last link in the chain.

Check back next week for another update. If you find this project interesting follow me on twitter @lolskee to get the most current information on my MLB DFS system.

UPDATE 6/23/17: The third and final installment of My MLB DFS Conquest ended up being a walkthrough python script and can be found here. After getting my projection system up and running; I used the system to generate lineups which I entered on DraftKings contests for about two weeks, after which I stopped playing as the time commitment became too great. I lost a little bit of money overall, mostly due to poor bankroll management (results can be seen here), but it was truly amazing to see some of the patterns the model was able to pick up on. After the whole experience I am convinced that my approach was a good one and maybe I'll pick this work back up at some point in the future. If you're interested in trying something like this on your own check out some of the datasets I created and my github baseball repo for more resources. Thanks for reading. - Tate

Back to Home Page