A regression model is used to predict a value using several useful variables. So say you want to know how many innings a pitcher can pitch before he gets injured which happens way more frequently than one might guess but that’s not the point. You would first gather data about previously injured pitchers. Like a classification model you have a response variable except this time it’s not 1 or 0, it can be any real number. For this example, the response variable is Innings Pitched because we want to know how many innings a pitcher can throw before they get injured. What you want to do next is find useful predictors of your response variable. Generally speaking, you’d want variables with a strong correlation to the response variable but you need to make sure they are independent from the response variable. Independence is very important because say you wanted to predict how many times a batter strikes out; if you used their strikeout rate and their total at bats you would get the correct answer every time but that doesn’t tell us anything useful. It’s obvious that if you multiply the rate of a batter striking out times their at bats you will get their total strikeouts. We would want to use variables like how many HRs they hit or the percentage of sliders that were thrown to them because that could potentially tell us something interesting that we didn’t already know. To show you a really simple example I am first going to propose a question. My question is what teams are paying too much for their players?

First we need to determine what is important for a team. I can’t speak for all teams but if I was in charge of a team my goal would be to win a World Series and to do that you need to make it to the playoffs. This generally requires a team to have a win percentage around or above .6 or 97 wins. I will be using this and a very simple regression model to answer which teams are not getting enough value out of their players?

## 
## Call:
## lm(formula = Salary ~ Team, data = data, model = TRUE)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -12864958  -4248147  -2170503   3329665  22051500 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  3707663    2396122   1.547  0.12328   
## TeamATL      3672884    3492918   1.052  0.29423   
## TeamBAL       402411    3622596   0.111  0.91166   
## TeamBOS      5886194    3622596   1.625  0.10569   
## TeamCHC      4340511    3622596   1.198  0.23220   
## TeamCIN      1909939    3622596   0.527  0.59859   
## TeamCLE      1099850    3492918   0.315  0.75317   
## TeamCOL      5531633    3388628   1.632  0.10409   
## TeamCWS        30559    3788602   0.008  0.99357   
## TeamDET      4240837    4319671   0.982  0.32735   
## TeamHOU      3063070    3388628   0.904  0.36707   
## TeamKC       1051511    3788602   0.278  0.78163   
## TeamLAA      9718795    3492918   2.782  0.00589 **
## TeamLAD      1103587    3492918   0.316  0.75235   
## TeamMIA      -803496    3788602  -0.212  0.83225   
## TeamMIL      3610764    3388628   1.066  0.28785   
## TeamMIN      1694231    3230930   0.524  0.60057   
## TeamNYM       785919    3388628   0.232  0.81682   
## TeamNYY      1304621    3492918   0.374  0.70915   
## TeamOAK       275087    3302826   0.083  0.93370   
## TeamPHI      2959728    3492918   0.847  0.39776   
## TeamPIT      -449311    3388628  -0.133  0.89464   
## TeamSD       2282532    3622596   0.630  0.52933   
## TeamSEA      4578496    3388628   1.351  0.17811   
## TeamSF       9229837    4319671   2.137  0.03378 * 
## TeamSTL      4614067    3302826   1.397  0.16389   
## TeamTB      -1584603    3302826  -0.480  0.63189   
## TeamTEX      1945120    3302826   0.589  0.55654   
## TeamTOR      -668830    3388628  -0.197  0.84373   
## TeamWSH      3201779    3492918   0.917  0.36038   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7188000 on 210 degrees of freedom
## Multiple R-squared:  0.1267, Adjusted R-squared:  0.00609 
## F-statistic: 1.051 on 29 and 210 DF,  p-value: 0.4024

For the regression model I want to know how the machine would predict a person’s salary if it only knew what team they were on. Keep in mind this is only using positions 3-10 because you would use other data to predict catcher and pitcher salaries and I am mainly using this data to predict hitters salaries but I thought this example was interesting because I am using a model that is awful but useful to help you learn something I perceive to be very valuable.No models are correct but some are useful. The point of this model is simply to get the team’s salaries but exclude pitchers and catchers because their main objective shouldn’t be to get hits and score runs but to prevent them. This is why I used runs scored rather than win percentage in the plot above.

The x axis are the coefficients related to the linear model that was created by using the formula Salary=Team where the response variable is the salary of the team’s players. I know I could have easily just pulled team salaries off the internet and used that but I wanted to show a different way to use a regression model. As you can see, teams like the Tigers, Giants, and Angels are not getting enough production for what they are spending. On the opposite side, the Yankees, Astros, Dodgers, A’s and Twins, all teams that very much utilize analytics in their decision making processes, are getting an above average amount of runs for their below average expenses. I would also put the Rays in that category because they had a very impressive win percentage compared to their team’s salary but most of their surplus value is added through pitching and potentially catching rather than their hitters.

We will now look at creating a model that predicts the value of a hitter using more than On-Base Percentage, which is supposedly the only variable that the “Moneyball” A’s used in their model and no offense but their model didn’t actually help them that much. Their pitching rotation was filled with All-Stars and future Hall of Famers. It consisted of Tim Hudson, Mark Mulder, Barry Zito and their closer Billy Koch who had the best year of his career. The team had a combined era of 3.68 which is incredibly good. This would no doubt give them a top 5 team era in any given year.

## 
## Call:
## lm(formula = Salary ~ Age + b_ab_scoring + HR + Krate + HardHitP + 
##     in_zone_swing + SolidContactP + WhiffP + batted_ball, data = data)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -14208705  -3332354   -302251   2229660  25501117 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -41670793    4408201  -9.453   <2e-16 ***
## Age             1329874      97019  13.707   <2e-16 ***
## b_ab_scoring     -38492      24529  -1.569   0.1180    
## HR               105996      50636   2.093   0.0374 *  
## Krate             22220     136974   0.162   0.8713    
## HardHitP          81728      64749   1.262   0.2081    
## in_zone_swing     -5499       6014  -0.914   0.3614    
## SolidContactP     46807     202120   0.232   0.8171    
## WhiffP            32631     121424   0.269   0.7884    
## batted_ball       29485      12203   2.416   0.0165 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4989000 on 230 degrees of freedom
## Multiple R-squared:  0.5393, Adjusted R-squared:  0.5213 
## F-statistic: 29.92 on 9 and 230 DF,  p-value: < 2.2e-16

The above grid of plots dissects by position every player in the MLB right now and shows how their predicted salary compares to their actual salary. You could certainly pick a very good cheap team with this model. Ideally you’d pick players that are the furthest above the line like Hunter Pence in Right Field or Brett Gardner in Left. There is definitely more that should be done to choose the components of one’s team. To start there could easily be better variables chosen that would be used to predict salary. Age has a very significant affect on the predicted salary and that should also be noted when choosing a team based on a linear regression model because a higher age will not necessarily make you a better baseball player. You could use other models like a logistic regression model which would be very good for considering age but I do believe making a model is where every team should start when deciding their players because now we can look deeper into why teams like the Angels and Tigers are paying so much and not getting a corresponding amount of runs. It’s because they have overpaid players like Albert Pujols, Mike Trout and Miguel Cabrera. Yes, they might all be extremely good players but they are indeed costing their teams more than they are helping them. The next thing that should be done when picking a team would be to create a model that predicts how many runs your team will allow and how many they will score and if those numbers won’t get you to the playoffs on average then some changes need to be made. I strongly believe any team could easily produce a playoff team on any budget. They, just like any business, need to use analytics in order to see where their problems are. Tons of people would tell you that Mike Trout is the best player in the MLB but only an analyst would tell you that he costs way too much compared to what he is actually doing. A big name player like that will attract more fans and jersey sales but are they worth the cost of having the big contract? That’s something that only the Angels Finance Department could know but that’s definitely a question they should be looking at.