A regression model is used to predict a value using several useful variables. So say you want to know how many innings a pitcher can pitch before he gets injured which happens way more frequently than one might guess but that’s not the point. You would first gather data about previously injured pitchers. Like a classification model you have a response variable except this time it’s not 1 or 0, it can be any real number. For this example, the response variable is Innings Pitched because we want to know how many innings a pitcher can throw before they get injured. What you want to do next is find useful predictors of your response variable. Generally speaking, you’d want variables with a strong correlation to the response variable but you need to make sure they are independent from the response variable. Independence is very important because say you wanted to predict how many times a batter strikes out; if you used their strikeout rate and their total at bats you would get the correct answer every time but that doesn’t tell us anything useful. It’s obvious that if you multiply the rate of a batter striking out times their at bats you will get their total strikeouts. We would want to use variables like how many HRs they hit or the percentage of sliders that were thrown to them because that could potentially tell us something interesting that we didn’t already know. To show you a really simple example I am first going to propose a question. My question is what teams are paying too much for their players?
First we need to determine what is important for a team. I can’t speak for all teams but if I was in charge of a team my goal would be to win a World Series and to do that you need to make it to the playoffs. This generally requires a team to have a win percentage around or above .6 or 97 wins. I will be using this and a very simple regression model to answer which teams are not getting enough value out of their players?
##
## Call:
## lm(formula = Salary ~ Team, data = data, model = TRUE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12864958 -4248147 -2170503 3329665 22051500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3707663 2396122 1.547 0.12328
## TeamATL 3672884 3492918 1.052 0.29423
## TeamBAL 402411 3622596 0.111 0.91166
## TeamBOS 5886194 3622596 1.625 0.10569
## TeamCHC 4340511 3622596 1.198 0.23220
## TeamCIN 1909939 3622596 0.527 0.59859
## TeamCLE 1099850 3492918 0.315 0.75317
## TeamCOL 5531633 3388628 1.632 0.10409
## TeamCWS 30559 3788602 0.008 0.99357
## TeamDET 4240837 4319671 0.982 0.32735
## TeamHOU 3063070 3388628 0.904 0.36707
## TeamKC 1051511 3788602 0.278 0.78163
## TeamLAA 9718795 3492918 2.782 0.00589 **
## TeamLAD 1103587 3492918 0.316 0.75235
## TeamMIA -803496 3788602 -0.212 0.83225
## TeamMIL 3610764 3388628 1.066 0.28785
## TeamMIN 1694231 3230930 0.524 0.60057
## TeamNYM 785919 3388628 0.232 0.81682
## TeamNYY 1304621 3492918 0.374 0.70915
## TeamOAK 275087 3302826 0.083 0.93370
## TeamPHI 2959728 3492918 0.847 0.39776
## TeamPIT -449311 3388628 -0.133 0.89464
## TeamSD 2282532 3622596 0.630 0.52933
## TeamSEA 4578496 3388628 1.351 0.17811
## TeamSF 9229837 4319671 2.137 0.03378 *
## TeamSTL 4614067 3302826 1.397 0.16389
## TeamTB -1584603 3302826 -0.480 0.63189
## TeamTEX 1945120 3302826 0.589 0.55654
## TeamTOR -668830 3388628 -0.197 0.84373
## TeamWSH 3201779 3492918 0.917 0.36038
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7188000 on 210 degrees of freedom
## Multiple R-squared: 0.1267, Adjusted R-squared: 0.00609
## F-statistic: 1.051 on 29 and 210 DF, p-value: 0.4024
For the regression model I want to know how the machine would predict a person’s salary if it only knew what team they were on. Keep in mind this is only using positions 3-10 because you would use other data to predict catcher and pitcher salaries and I am mainly using this data to predict hitters salaries but I thought this example was interesting because I am using a model that is awful but useful to help you learn something I perceive to be very valuable.No models are correct but some are useful. The point of this model is simply to get the team’s salaries but exclude pitchers and catchers because their main objective shouldn’t be to get hits and score runs but to prevent them. This is why I used runs scored rather than win percentage in the plot above.
The x axis are the coefficients related to the linear model that was created by using the formula Salary=Team where the response variable is the salary of the team’s players. I know I could have easily just pulled team salaries off the internet and used that but I wanted to show a different way to use a regression model. As you can see, teams like the Tigers, Giants, and Angels are not getting enough production for what they are spending. On the opposite side, the Yankees, Astros, Dodgers, A’s and Twins, all teams that very much utilize analytics in their decision making processes, are getting an above average amount of runs for their below average expenses. I would also put the Rays in that category because they had a very impressive win percentage compared to their team’s salary but most of their surplus value is added through pitching and potentially catching rather than their hitters.
We will now look at creating a model that predicts the value of a hitter using more than On-Base Percentage, which is supposedly the only variable that the “Moneyball” A’s used in their model and no offense but their model didn’t actually help them that much. Their pitching rotation was filled with All-Stars and future Hall of Famers. It consisted of Tim Hudson, Mark Mulder, Barry Zito and their closer Billy Koch who had the best year of his career. The team had a combined era of 3.68 which is incredibly good. This would no doubt give them a top 5 team era in any given year.
##
## Call:
## lm(formula = Salary ~ Age + b_ab_scoring + HR + Krate + HardHitP +
## in_zone_swing + SolidContactP + WhiffP + batted_ball, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14208705 -3332354 -302251 2229660 25501117
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -41670793 4408201 -9.453 <2e-16 ***
## Age 1329874 97019 13.707 <2e-16 ***
## b_ab_scoring -38492 24529 -1.569 0.1180
## HR 105996 50636 2.093 0.0374 *
## Krate 22220 136974 0.162 0.8713
## HardHitP 81728 64749 1.262 0.2081
## in_zone_swing -5499 6014 -0.914 0.3614
## SolidContactP 46807 202120 0.232 0.8171
## WhiffP 32631 121424 0.269 0.7884
## batted_ball 29485 12203 2.416 0.0165 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4989000 on 230 degrees of freedom
## Multiple R-squared: 0.5393, Adjusted R-squared: 0.5213
## F-statistic: 29.92 on 9 and 230 DF, p-value: < 2.2e-16
The above grid of plots dissects by position every player in the MLB right now and shows how their predicted salary compares to their actual salary. You could certainly pick a very good cheap team with this model. Ideally you’d pick players that are the furthest above the line like Hunter Pence in Right Field or Brett Gardner in Left. There is definitely more that should be done to choose the components of one’s team. To start there could easily be better variables chosen that would be used to predict salary. Age has a very significant affect on the predicted salary and that should also be noted when choosing a team based on a linear regression model because a higher age will not necessarily make you a better baseball player. You could use other models like a logistic regression model which would be very good for considering age but I do believe making a model is where every team should start when deciding their players because now we can look deeper into why teams like the Angels and Tigers are paying so much and not getting a corresponding amount of runs. It’s because they have overpaid players like Albert Pujols, Mike Trout and Miguel Cabrera. Yes, they might all be extremely good players but they are indeed costing their teams more than they are helping them. The next thing that should be done when picking a team would be to create a model that predicts how many runs your team will allow and how many they will score and if those numbers won’t get you to the playoffs on average then some changes need to be made. I strongly believe any team could easily produce a playoff team on any budget. They, just like any business, need to use analytics in order to see where their problems are. Tons of people would tell you that Mike Trout is the best player in the MLB but only an analyst would tell you that he costs way too much compared to what he is actually doing. A big name player like that will attract more fans and jersey sales but are they worth the cost of having the big contract? That’s something that only the Angels Finance Department could know but that’s definitely a question they should be looking at.