My Quest To Conquer the World of Baseball DFS - Part 1

Tate Campbell - March 19, 2017

As a practicing data scientist and an avid fan of daily fantasy sports, I've decided it's high time for me to develop my own machine learning algorithm to predict fantasy point production. Given that baseball season is rapidly approaching I'll be developing my methods to attempt to produce lineups for MLB contests on DraftKings.

What I've done so far

Scraped lots of data...

That's been the majority of the work so far. There aren't many good datasets out there for DFS purposes, so I set out to create my own. Having played a fair amount of MLB DFS in the past, I had a pretty good idea of what kinds of metrics I wanted to include in my dataset, things like BvP stats, park factors, batting/pitching splits, and vegas lines. A few sources I ended up scraping include:

rotoguru - for historical DraftKings data
ESPN - for splits and park factors
FanGraphs - for advanced batting stats
Sportsbook Review - for moneylines and over/unders

I scraped these and other sources with python using BeautifulSoup to parse the HTML tables, after which I created a seperate csv file for each dataset. I then began the laborious task of cleaning and joining the disjoint files into one master dataset that will ultimately serve as training data for a machine learning model.

Initial Modeling Efforts

After the data was cleanly assembled into one dataset, I ran a quick and dirty regression model to get a sense of how predictable a batter's DK points are given the various covariates that I'd compiled. I trained a RandomForestRegressor model from the sklearn.ensemble module using 300 estimators, a maximum depth of 10, and a square root proportional maximum number of features (the training data had about 50 features at this point in time). The model was evaluated over 3-fold cross validation using an 80/20 train/test split on roughly 23k data points.

I have to say I was pleasantly surprised by the initial performance of the model. The model achieved an average mean absolute error of 5.4, which means that on average the model is off by 5.4 DK points. This may not sound so great, but anyone who's played baseball DFS knows how variable it can be.

On any given day a hitter as talented as Bryce Harper can easily go 0/4 and score 0 DK points. These situations are common and wouldn't even be considered outliers. There are also situations where total no name guys like Gary Sanchez get called up from the Yankees minor league team and go for two home runs, putting up 30 or 40 DK points, so the variance is really quite high.

Feature Importances

What pleased me even more than the modest error of the model was seeing the relative importances of the features. I thought that these weightings make a lot of sense and they reaffirmed my choice in data collection.

Top 10 Important Features

Feature	Importance (%)
pitcher_dk_salary	5.5
batter_dk_salary	5.1
pitcher_splits_earned_run_average	4.1
batter_win_probability	3.7
pitcher_win_probability	3.4
batter_splits_on_base_plus_slugging	3.4
pitcher_splits_strikeout_rate	2.8
batter_splits_slugging_percentage	2.8
pitcher_splits_walk_rate	2.5
batter_batting_order	2.4

These top features make a lot of sense to me. DraftKings creates salaries based on their projections for player, so it certainly makes sense that these are the single most predictive variable for a batter's fantasy point. It's also no surprise that the vegas lines, which give fairly accurate representations of which teams are expected to win and/or score more runs, show up as two highly important factors (the over/under was the 11th most important feature with an importance of 2.4).

Bottom 10 Important Features

Feature	Importance (%)
batter_splits_triple_rate	1.4
batter_splits_batting_average	1.3
batter_splits_caught_stealing_rate	1.3
batter_walk_to_strikeout_ratio	1.3
batter_bunt_rate	1.2
batter_intentional_walk_rate	1.2
batter_walk_rate	1.1
batter_strikeout_rate	0.9
batter_contact_rate	0.9
batter_home	0.5

Nothing too extraordinary here. It was a little surprising that the batter's batting average splits didn't end up being very important, but other than that these results seem logical. It doesn't matter much how often batters strikeout or get caught stealing since outs don't effect a batter's fantasy points, however we would expect those features to be fairly important when predicting fantasy points for pitchers.

What's next

My next order of business will be to develop a regression model to predict pitcher's fantasy points. This endeavor will undoubtedly be more complex as each pitcher is going up against 9 batters. My initial thoughts on how this might be accomplished are that a pitcher's fantasy points could be segmented down to individual batters, then a prediction for an entire game would be the sum of all 9 BvP predictions plus some win EV component.

After both models are complete I plan to develop a script that takes a DraftKings Salary csv file as input and fetches all the relevant covariates so projections can be made for any given MLB slate. Lastly I'll need to implement some sort of optimization which will allow me to generate optimal lineups based on the projections.

I'll continue to detail my progress here when time permits, hopefully I'll get this beast up and running before the season starts so I can document how well the system performs on a nightly basis. Please feel free to follow me on twitter @lolskee.

Back to Home Page