Yesterday, I posted an article describing how I modeled to some extent a way to tell whether and by how much pitchers may be able to pitch in such a way as to allow fewer or more runs than their components, including the more subtle ones, like balks, SB/CS, WP, catcher PB, GIDP, and ROE suggest.

For various reasons, I suggest taking these numbers with a grain of salt. For one thing, I need to tweak my RA9 simulator to take into consideration a few more of these subtle components. For another, there may be some things that stick with a pitcher from year to year that have nothing to do with his “RA9 skill” but which serve to increase or decrease run scoring, given the same set of components. Two of these are a pitcher’s outfielder arms and the vagueries of his home park, which both have an effect on base runner advances on hits and outs. Using a pitcher’s actual sac flies against will mitigate this, but the sim is also using league averages for base runner advances on hits, which, as I said, can vary from pitchers to pitcher, and tend to persist from year to year (if a pitcher stays on the same team) based on his outfielders and his home park. Like DIPS, it would be better to do these correlations only on pitchers who switch teams, but I fear that the sample would be too small to get any meaningful results.

Anyway, I have a database now of the last 10 years’ differences between a pitcher’s RA9 and his sim RA9 (the runs per 27 outs generated by my sim), for all pitchers who threw to at least 100 batters in a season.

First here are some interesting categorical observations:

Jared Cross, of Steamer projections, suggested to me that perhaps some pitchers, like lefties, might hold base runners on first base better than others, and therefore depress scoring a little as compared to the sim, which uses league-average base running advancement numbers. Well, lefties actually did a hair worse in my database. Their RA9 was .02 greater than their sim RA. Righties were -.01 better. That does not necessarily mean that RHP have some kind of RA skill that LHP do not have. It is more likely a bias in the sim that I am not correcting for.

How about number of pitches in a pitcher’s repertoire. I hypothesized that pitchers with more pitches would be better able to tailor their approach to the situation. For example, with a base open, you want your pitcher to be able to throw lots of good off-speed pitches in order to induce a strikeout or weak contact, whereas you don’t mind if he walks the batter.

I was wrong. Pitchers with 3 or more pitches that they throw at least 10% of the time are .01 runs worse in RA9. Pitchers with only 2 or fewer pitches, are .02 runs better. I have no idea why that is.

How about pitchers who are just flat out good in their components such that their sim RA is low, like under 4.00 runs? Their RA9 is .04 worse. Again, their might be some bias in the sim which is causing that. Or perhaps if you just go out and there “air it out” and try and get as many outs and strikeouts as possible, regardless of the situation, you are not pitching optimally.

Conversely, pitchers with a sim RA of 4.5 or greater shave .03 points off their RA9. If you are over 5 in your sim RA, your actual RA9 is .07 points better and if you are below 3.5, your RA9 is .07 runs higher. So, there probably is something about having extreme components that even the sim is not picking up. I’m not sure what that could be. Or, perhaps if you are simply not that good of a pitcher, you have to find ways to minimize run scoring above and beyond the hits and walks you allow overall.

For the NL pitchers, their RA9 is .05 runs better than their sim RA, and for the AL, they are .05 runs worse. So the sim is not doing a good job with respect to the leagues, likely because of pitchers batting. I’m not sure why, but I need to fix that. For now, I’ll adjust a pitcher’s sim RA according to his league.

You might think that younger pitchers would be “throwers” and older ones would be “pitchers” and thus their RA skill would reflect that. This time you would be right – to some extent.

Pitchers less than 26 years old were .01 runs worse in RA9. Pitchers older than 30 were .03 better. But that might just reflect the fact that pitchers older than 30 are just not very good – remember, we have a bias in terms of quality of the sim RA and the difference between that and regular RA9.

Actually, even when I control for the quality of the pitcher, the older pitchers had more RA skill than the younger ones by around .02 to .04 runs. As you can see, none of these effects, even if they are other than noise, is very large.

Finally, here are the lists of the 10 best and worst pitchers with respect to “RA skill,” with no commentary. I adjusted for the “quality of the sim RA” bias, as well as the league bias. Again, take these with a large grain of salt, considering the discussion above.

Best, 2004-2013:

Sean Chacon -.18

Steve Trachsel -.18

Francisco Rodriguez -.18

Jose Mijares -.17

Scott Linebrink -.16

Roy Oswalt -.16

Dennys Reyes -.15

Dave Riske -.15

Ian Snell -.15

5 others tied for 10th.

Worst:

Derek Lowe .27

Luke Hochevar .20

Randy Johnson .19

Jeremy Bonderman .18

Blaine Boyer .18

Rich Hill .18

Jason Johnson .18

5 others tied for 8th place.

(None of these pitchers stand out to me one way or another. The “good” ones are not any you would expect, I don’t think.)

We showed in The Book that there is a small but palpable “pitching from the stretch” talent. That of course would effect a pitcher’s RA as compared to some kind of base runner and “timing” neutral measure like FIP or component ERA, or really any of the ERA estimators.

As well, a pitcher’s ability to tailor his approach to the situation, runners, outs, score, batter, etc., would also implicate some kind of “RA talent,” again, as compared to a “timing” neutral RA estimator.

A few months ago I looked to see if RE24 results for pitchers showed any kind of talent for pitching to the situation, by comparing that to the results of a straight linear weights analysis or even a BaseRuns measure. I found no year-to-year correlations for the difference between RE24 and regular linear weights. In other words, I was trying to see if some pitchers were able to change their approach to benefit them in certain bases/outs situations more than other pitchers. I was surprised that there was no discernible correlation, i.e., that it didn’t seem to be much of a skill if at all. You would think that some pitchers would either be smarter than others or have a certain skill set that would enable them, for example, to get more K with a runner on 3rd and less than 2 outs, more walks and fewer hits with a base open, or fewer home runs with runners on base or with 2 outs and no one on base. Obviously all pitchers, on the average, vary their approach a lot with respect to these things, but I found nothing much when doing these correlations. Essentially an “r” of zero.

To some extent the pitching from the stretch talent should show up in comparing RE24 to regular lwts, but it didn’t, so again, I was a little surprised at the results.

Anyway, I decided to try one more thing.

I used my “pitching sim” to compute a component ERA for each pitcher. I tried to include everything that would create or not create runs while he was pitching, like WP/PB, SB/CS, GIDP, roe, in addition to s,d,t,hr,bb, and so. I considered an IBB as a 1/2 BB in the sim, since I didn’t program IBB into it.

So now, for each year, I recorded the difference between a pitcher’s RA9 and his simulated component RA9, and then ran year-to-year correlations. This was again to see if I could find a “RA talent” wherever it may lie – clutch pitching, stretch talent, approach talent, etc.

I got a small year-to-year correlation which, as always, varied with the underlying sample size – TBF in each of the paired years. When I limited it to pitchers with at least 500 TBF in each year, I got an “r” of .142 with an average PA of 791 in each year. That comes out to a 50% regression at around 5000 PA, or 5 years for a full-time starter, similar to BABIP for pitchers. In other words, the “stabilization” point was around 5,000 TBF.

If that .142 is accurate (at 2 sigma the confidence interval is .072 to .211), I think that is pretty interesting. For example, notable “ERA whiz” Tom Glavine from 2001 to 2006, was an average of .246 in RA9 better than his sim RA9 (simulated component RA). If we regress that difference 50%, we get .133 runs per game, which is pretty sizable I think. That is over 1/3 of a win per season. Notable “ERA hack” Ricky Nolasco from 2008 to 2010 (I only looked at 2001-2010) was an average of .357 worse in his ERA. Regress that 62.5%, and we get .134 runs worse per season, also 1/3 of a win.

So, for example, if you want to know how to reconcile fWAR (FG) and bWAR (B-R) for pitchers, take the difference and regress according to the number of TBF, using the formula 5000/(5000+TBF) for the amount of regression.

Here are a couple more interesting ones, off the top of my head. I thought that Livan Hernandez seemed like a crafty pitcher, despite having inferior stuff late in his career. Sure enough, he out-pitched his components by around .164 runs per game over 9 seasons. After regressing, that’s .105 rpg.

The other name that popped into my head was Wakefield. I always wondered if a knuckler was able to pitch to the situation as well as other pitchers could. It does not seem like they can, with only one pitch with comparatively little control. His RA9 was .143 worse than his components suggest, despite his FIP being .3 runs per 9 worse than his ERA! After regressing, he is around .095 worse than his simulated component RA.

Of course, after looking at Wake, we have to check Dickey as well. He didn’t start throwing a knuckle ball until 2005, and then only half the time. His average difference between RA9 and simulated RA9 is .03 on the good side, but our sample size for him is small with a total of only 1600 TBF, implying a regression of 76%.

If you want the numbers on any of your favorite or no-so-favorite pitchers, let me know in the comments section.

If anyone is out there (hello? helloooo?), as promised, here are the AL team expected winning percentages and their actual winning percentages, conglomerated over the last 5 years. In case you were waiting with bated breath, as I have been.

Combined results for all five years (AL 2009-2013), in order of the “best” teams to the “worst:”

Team

My WP

Vegas WP

Actual WP

Diff

My Starters

Actual Starters

My Batting

Actual Batting

NYA

.546

.566

.585

.039

98

99

.30

.45

TEX

.538

.546

.558

.020

102

95

.14

.24

OAK

.498

.490

.517

.019

104

101

-.08

.07

LAA

.508

.526

.522

.014

103

106

.07

.17

TBA

.556

.544

.562

.006

100

102

.24

.17

BAL

.460

.452

.463

.003

110

115

-.03

-.27

DET

.548

.547

.550

.002

97

91

.21

.31

BOS

.546

.596

.546

.000

99

98

.26

.36

CHW

.489

.450

.488

-.001

99

97

-.16

-.29

TOR

.479

.482

.478

-.001

106

107

-.05

.12

MIN

.468

.469

.464

-.004

108

109

-.07

-.07

SEA

.462

.464

.446

-.016

106

106

-.26

-.36

KCR

.474

.460

.444

-.030

108

106

-.22

-.28

CLE

.492

.469

.462

-.030

108

109

.13

.01

HOU

.420

.420

.386

-.034

106

109

-.46

-.61

I find this chart quite interesting. As with the NL, it looks to me like the top over-performing teams are managed by stable high-profile, peer and player respected guys – Torre, Washington, Maddon, Scioscia, Leyland, Showalter.

Also, as with the NL teams, much of the differences between my model and the actual results are due to over-regression on my part, especially on offense. Keep in mind that I do include defense and base running in my model, so there may be some similar biases there.

Even after accounting for too much regression, some of the teams completely surprised me with respect to my model. Look at Oakland’s batting. I had them projected as a minus -.08 run per game team and somehow they managed to produce .07 rpg. That’s a huge miss over many players and many years. There has to be something going on there. Perhaps they know a lot more about their young hitters than we (I) do. That extra offense alone accounts for 16 points in WP, almost all of their 19 point over-performance. Even the A’s pitching outdid my projections.

Say what you will about the Yankees, but even though my undershooting their offense cost my model 16 points in WP, they still over-performed by a whopping 39 points, or 6.3 wins per season! I’m sure Rivera had a little to do with that even though my model includes him as closer. Then there’s the Yankee Mystique!

Again, even accounting for my too-aggressive regression, I completely missed the mark with the TOR, CLE, and BAL offense. Amazingly, while the Orioles pitched 5 points in FIP- worse than I projected and .24 runs per game worse on offense, they somehow managed to equal my projection.

Other notable anomalies are the Rangers’ and Tigers’ pitching. Those two starting staffs outdid me by seven and six points in FIP-, respectively, which is around 1/4 run in ERA – 18 points in WP. Texas did indeed win games at a 20 point clip better than I expected, but the Tigers, despite out-pitching my projections by 18 points in WP, AND outhitting me by another 11 points in WP, somehow managed to only win .3 games per season more than I expected. Must be that Leyland (anti-) magic!

Ok, enough of the bad Posnanski and Woody Allen rants and back to some interesting baseball analysis – sort of. I’m not exactly sure what to make of this, but I think you might find it interesting, especially if you are a fan of a particular team, which I’m pretty sure most of you are.

I went back five years and compared every team’s performance in each and every game to what would be expected based on their lineup that day, their starting pitcher, an estimate of their reliever and pinch hitter usage for that game, as well as the same for their opponent. Basically, I created a win/loss model for every game over the last five years. I didn’t simulate the game as I have done in the past. Instead, I used a theoretical model to estimate mean runs scored for each team, given a real-time projection for all of the relevant players, as well as the run-scoring environment, based on the year, league, and ambient conditions, like the weather and park (among other things).

When I say “real-time” projections, they are actually not up-to-the game projections. They are running projections for the year, updated once per month. So, for the first month of every season, I am using pre-season projections, then for the second month, I am using pre-season projections updated to include the first month’s performance, etc.

For a “sanity check” I am also keeping track of a consensus expectation for each game, as reflected by the Las Vegas line, the closing line at Pinnacle Sports Book, one of the largest and most respected online sports books in the internet betosphere.

The results I will present are the combined numbers for all five years, 2009 to 2013. Basically, you will see something like, “The Royals had an expected 5-year winning% of .487 and this is how they actually performed – .457.” I will present two expected WP actually – one from my models and one from the Vegas line. They should be very similar. What is interesting of course is the amount that the actual WP varies from the expected WP for each team. You can make of those variations what you want. They could be due to random chance, bad expectations for whatever reasons, or poor execution by the teams for whatever reasons.

Keep in mind that the composite expectations for the entire 5-year period are based on the expectation of each and every game. And because those expectation are updated every 6 months by my model and presumably every day by the Vegas model, they reflect the changing expected talent of the team as the season progresses. By that, I mean this: Rather than using a pre-season projection for every player and then applying that to the personnel used or presumed used (in the case of the relievers and pinch hitters) in every game that season, after the first 30 games, for example, those projections are updated and thus reflect to some extent, actual performance that season. For example, last year, pre-season, Roy Halladay might have been expected to have a 3.20 ERA or something like that. After he pitched horribly for a few weeks or months, and it was well-known that he was injured, his expected performance presumably changed in my model as well as in the Vegas model. Again, the Vegas model likely changes every day, whereas my model can only change after each month, or 5 times per season.

Here are the combined results for all five years (NL 2009-2013):

Team

My Model

Vegas

Actual

My Exp. Starting Pitching (RA9-)

Actual Starting Pitching (FIP-)

My Exp. Batting (marginal rpg)

Actual Batting (marginal rpg)

ARI

.496

.495

.486

103

103

0

-.08

ATL

.530

.545

.564

100

97

.25

.21

CHC

.488

.478

.446

103

102

-.09

-17

CIN

.522

.517

.536

104

108

.01

.12

COL

.494

.500

.486

102

96

-.04

-.09

MIA

.493

.472

.453

102

102

.01

-.05

LAD

.524

.526

.542

96

99

.02

-.03

MLW

.519

.509

.504

105

108

.13

.30

NYM

.474

.470

.464

106

108

-.02

.01

PHI

.516

.546

.554

96

98

-.01

.07

PIT

.461

.454

.450

109

111

-.19

-.28

SDP

.469

.463

.483

110

115

-.12

-.26

STL

.532

.554

.558

100

98

.23

.40

SFG

.506

.518

.515

98

102

-.19

-.30

WAS

.497

.484

.486

103

103

.01

.07

If you are an American league fan, you’ll have to wait until Part II. This is a lot of work, guys!

By the way, if you think that the Vegas line is remarkably good, and much better than mine, it is at least partly an illusion. They get to “cheat,” and to some extent they do. I can do the same thing, but I don’t. I am not looking at the expected WP and result of each game and then doing some kind of RMS error to test the accuracy of my model and the Vegas “model” on a game-by-game basis. I am comparing the composite results of each model to the composite W/L results of each team, for the entire 5 years. That probably makes little sense, so here is an example which should explain what I mean by the oddsmakers being able to “cheat,” thus making their composite odds close to the actual odds for the entire 5-year period.

Let’s say that before the season starts Vegas thinks that the Nationals are a .430 team. And let’s say that after 3 months, they were a .550 team. Now, Vegas by all rights should have them as something like a .470 team for the rest of the season – numbers for illustration purposes only – and my model should too, assuming that I started off with .430 as well. And let’s say that the updated expected WP of .470 were perfect and that they went .470 for the second half. Vegas and I would have a composite expected WP of .450 for the season, .430 for the first half and .470 for the second half. The Nationals record would be .510 for the season, and both of our models would look pretty bad.

However, Vegas, to some extent uses a team’s W/L record to-date to set the lines, since that’s what the public does and since Vegas assumes that a team’s W/L record, even over a relatively short period of time, is somewhat indicative of their true talent, which it is of course. After the Nats go .550 for the first half, Vegas can set the second-half odds as .500 rather than .470, even if they think that .470 is truly the best estimate of their performance going forward.

One they do that, their composite expected WP for the season will be (.430 + .500) / 2, or .465, rather than my .450. And even if the .470 were correct, and the Nationals go .470 for the second half, whose composite model is going to look better at the end of the season? Theirs will of course.

If Vegas wanted to look even better for the season, they can set the second half lines to .550, on the average. Even if that is completely wrong, and the team goes .470 over the second half, Vegas will look even better at the end of the season! They will be .490 for the season, I will be .450, and the Nats will have a final W/L percentage of .490! Vegas will look perfect and I will look bad, even though we had the same “wrong” expectation for the first half of the season, and I was right on the money for the second half and they were completely and deliberately wrong. Quite the paradox, huh? So take those Vegas lines with a grain of salt as you compare them to my model and to the final composite records of the teams. I’m not saying that my model is necessarily better than the Vegas model, only that in order to fairly compare them, you would have to take them one game at a time, or always look at each team’s prospective results compared to the Vegas line or my model.

Here is the same table as above, ordered by the difference between my expected w/l percentage and each team’s actual w/l percentage. The firth column is that difference. Call those differences whatever you want – luck, team “efficiency,” good or bad managing, player development, team chemistry, etc. I hope you find these numbers as interesting as I do!

Combined results for all five years (NL 2009-2013), in order of the “best” teams to the “worst:”

Team

My Model

Vegas

Actual

Difference

My Exp. Starting Pitching (RA9-)

Actual Starting Pitching (FIP-)

My Exp. Batting (marginal rpg)

Actual Batting (marginal rpg)

PHI

.516

.546

.554

.038

96

98

-.01

.07

ATL

.530

.545

.564

.034

100

97

.25

.21

STL

.532

.554

.558

.026

100

98

.23

.40

LAD

.524

.526

.542

.018

96

99

.02

-.03

SDP

.469

.463

.483

.014

110

115

-.12

-.26

CIN

.522

.517

.536

.014

104

108

.01

.12

SFG

.506

.518

.515

.009

98

102

-.19

-.30

COL

.494

.500

.486

-.008

102

96

-.04

-.09

NYM

.474

.470

.464

-.010

106

108

-.02

.01

PIT

.461

.454

.450

-.010

109

111

-.19

-.28

ARI

.496

.495

.486

-.010

103

103

0

-.08

WAS

.497

.484

.486

-.011

103

103

.01

.07

MLW

.519

.509

.504

-.015

105

108

.13

.30

MIA

.493

.472

.453

-.040

102

102

.01

-.05

CHC

.488

.478

.446

-.042

103

102

-.09

-.17

As you can see from either chart, it appears as if my model over-regresses both batting and starting pitching, especially the former.

Also, a quick and random observation from the above chart – it may mean absolutely nothing. It seems as though those top teams, most of them at least, have had notable, long-term, “players’ managers,” like Manuel, LaRussa, Mattingly, Torre, Black, Bochy, and Baker, while you might not be able to even recall or name most of the managers of the teams at the bottom. It will be interesting to see if the American League teams evince a similar pattern.

Note: After you read the Woody Allen example, please read the note below it, which describes how I screwed up the analysis!

One of the most important concepts in science, and sometimes in life, involves something called Bayesian Probability or Bayes Theorem. Since you are reading a sabermetric blog, you are likely at least somewhat familiar with it. Simply put, it has to do with conditional probability. You have probably read or heard about Bayes with respect to the following AIDS testing hypothetical.

Let’s say that you are not in a high risk group for contracting HIV, the virus that causes AIDS, or, alternatively, you are randomly selected from the adult U.S. population at large. And let’s say that in that population, one in 500 persons is HIV positive. You take an initial ELISA test, and it turns out positive for HIV. What are the chances that you actually carry the disease?

The first thing you need to know is the false positive rate for that particular test. It is also around one in 500. We’ll ignore the fact that there are better, more accurate tests available or that your blood specimen would be given another test if it had a positive ELISA. You might be tempted to think that your chances of carrying the virus is 99.8% or one minus .002, where .002 is the one in 500 false positive rate.

And you would be wrong. Enter Bayes. Since you only had a 1 in 500 chance of being HIV+ going in, there is a prior probability which must be added “to the equation.”

To understand how this works, and to avoid any semi-complex Bayesian formulas, we can frame the analysis like this:

In a population of 500,000 persons, there would be 1,000 carriers, since we specified that the HIV rate was one in 500. All of them would test positive, assuming a zero false-negative rate. Among the 499,000 non-carriers, there would be 998 false positives (a one in 500 chance).

So in our population of 500,000 persons, there are 1,998 positives and only 1,000 of these truly carry the virus. The other 998 positives are false. If you are selected from this population, and have a positive ELISA test, you naturally have a 1,000 in 1,998, or around a 50% chance of having the disease. That is a far cry from 99.8%, and should be somewhat comforting to anyone who fails an initial screening. That is basically how Bayes works, although it can get far more complex than that. It also applies to many, many other important things in life, including the guilt or innocence of a defendant in a criminal or civil prosecution, which I will address next.

Another famous, but less well-known, illustration of Bayes with respect to the criminal justice system, involves a young English woman named Sally Clark who was convicted of killing two of her children in 1999. In 1996, her first-born son died, presumably of Sudden Infant Death Syndrome (SIDS). In 1998, she gave birth to another boy, and he too died at home shortly after birth. She and her husband were soon arrested and charged with the murder of the two boys. The charges against her husband were eventually dropped.

Sally was convicted of both murders and sentenced to life in prison. By the way, she and her husband were affluent attorneys in England. At her trial, the following statistical “evidence” was presented by a pediatrician for the prosecution:

He testified that there was about a 1 in 8,500 chance that a baby in that situation would die of SIDS and therefore the chances that both of her children would perish from natural causes related to that syndrome was 1/8500 times 1/8500, or 1 in 73 million. Sally Clark was convicted largely on the “strength” of the statistical “evidence” that the chance of those babies both dying from SIDS, which was the defense’s assertion, was almost zero.

First of all, the 1 in 73 million might not be accurate. It is possible, in fact likely, according to the medical research, that those two probabilities are not independent. If you want to know the chances of two events occurring, multiplying the chances of one event by the other is only proper when the probability of the two events are independent – Stats 101. In this case, it was estimated by an expert witness for the defense in an appeal, that if one infant in a family dies of SIDS, the chances that another one also dies similarly is 5 to 10 times higher than the initial probability.

So that reduces our probability to between one in 15 million and one in 7 million. In addition, the same expert witness, a Professor of Mathematics who studied the historical SIDS data, argued that the 1 in 8,500 was really closer to 1 in 1,300 due to the gender of the Clark babies and other genetic and environmental characteristics. If that number is accurate, that brings us down to 1 in 227,000 for the chances of her two boys both dying of SIDS. While a far cry from 1 in 73 million, that is still some pretty damning evidence, right?

Wrong! That 1 in 227,000 chance of dying of SIDS, or the inverse, a 99.99955 chance of dying from something other than SIDS, like murder, is like our erroneous 99.8% chance of having HIV when our initial AIDS test is positive. In order to calculate the true odds of Mrs. Clark being guilty of murder based solely on the statistical evidence, we need to know, as with the AIDS test, what the chances are, going in, before we know about the deaths, that a woman like Sally Clark would be a double murderer of her own children. That is exactly the same thing as us needing to know the chances that we are an HIV carrier before we are tested, based upon the population we belong to. Remember, that was 1 in 500, which transformed our odds of having HIV from 99.8% to only 50%.

In this case, it is obviously difficult to estimate that a priori probability, the chances that a woman in Sally Clark’s shoes would murder her only two children back to back. The same mathematician estimated that the chances of Sally Clark being a double murderer, knowing nothing about what actually happened, was much rarer than the chances of both of her infants dying of natural causes. In fact, he claimed that it was 4 to 10 times rarer, which means that out of all young, affluent mothers with two new-born sons, maybe 1 in a million or 1 in 2 million would kill both of their children. That does not seem like an unreasonable estimate to me, although I have no way of knowing that off the top of my head.

So, as with the AIDS test, if there were a population of one million similar women with two newly born boys, around 4 of them (1 in 227,000) would suffer the tragedy of back-to-back deaths by SIDS, and only ½ to 1 would commit double infanticide. So the odds, based solely on these statistics, of Sally Clark being guilty as charged was around 10 to 20%, obviously not nearly enough to convict, and just a tad less than the 72,999,999 to 1 that the prosecution implied at her trial.

Anyway, after spending more than 3 years in prison, she won her case on appeal and was released. The successful appeal was based not only on the newly presented Bayesian evidence, but on the fact that the prosecution withheld evidence that her second baby had had an infection that may have contributed to his death from natural causes. Unfortunately, Sally Clark, unable to deal with the effects of her children’s deaths, the ensuing trial and incarceration, and public humiliation, died of self-inflicted alcohol poisoning 4 years later.

Which brings us to our final example of how Bayes can greatly affect an accused person’s chances of guilt or innocence, and perhaps more importantly, how it can cloud the judgment of the average person who is not statistically savvy, such as the judge and jurors, and the public, in the Clark case.

Unless you avoid the internet and the print tabloids like the plague, which is unlikely since you’re reading this blog, you no doubt know that Woody Allen was accused around 20 years ago of molesting his adopted 7-year old daughter, Dylan Farrow. The case was investigated back then, and no charges were ever filed. Recently, Dylan brought up the issue again in a NY Times article, and Allen issued a rebuttal and denial in his own NY Times op-ed. Dylan’s mother Mia, Woody Allen’s ex-partner, is firmly on the side of Dylan, and various family members are allied with one or the other. Dylan is unwavering in her memories and claims of abuse, and Mia is equally adamant about her belief that their daughter was indeed molested by Woody.

I am not going to get into any of the so-called evidence one way or another or comment on whether I think Woody is guilty or not. Clearly I am not in a position to do the latter. However, I do want to bring up how Bayes comes into play in this situation, much like with the AIDS and SIDS cases described above, and how, in fact, it comes into play in many “he-said, she-said” claims of sexual and physical abuse, whether the alleged victim is a child or an adult. If you have been following along so far, you probably know where I am going with this.

In cases like this, whether there is corroborating evidence or not, it is often alleged by the prosecution or the plaintiff in civil cases, that there is either no reason for the alleged victim to lie about what happened, or that given the emotional and graphic allegations or testimony of the victim, especially if it is a child, common sense tells us that the chances of the victim lying or being mistaken is extremely low. And that may well be the case. However, as you now know or already knew, according to Bayes, that is often not nearly enough to convict a defendant, even in a civil case where the burden on the plaintiff is based on a “preponderance of the evidence.”

Let’s use the Woody Allen case as an example. Again, we are going to ignore any incriminating or exculpatory evidence other than the allegations of Dylan Farrow, the alleged victim, and perhaps the corroborating testimony of her mother. Clearly, Dylan appears to believe that she was molested by Woody when she was seven, and clearly she seems to have been traumatically affected by her recollection of the experience. Please understand that I am not suggesting one way or another whether Dylan or anyone else is telling the truth or not. I have no idea.

Her mother, Mia, although she did not witness the alleged molestation, claims that, shortly after the incident, Dylan told her what happened and that she wholeheartedly believes her. Many people are predicating Allen’s likely guilt on the fact that Dylan seems to clearly remember what happened and that she is a credible person and has no reason to lie, especially at this point in her life and at this point in the timeline of the events. The statute of limitations precludes any criminal charges against Allen, and likely any civil action as well. I would assume however, that hypothetically, if this case were tried in court, the emotional testimony of Dylan would be quite damaging to Woody, as it often is in a sexual abuse case in which the alleged victim testifies.

Now let’s do the same Bayesian analysis that we did in the above two situations, the AIDS testing, and the murder case, and see if we can come up with any estimate as to the likely guilt or innocence of Woody Allen and perhaps other people accused of sexual abuse where the case hinges to a large extent on the credibility the alleged victim and his or her testimony. We’ll have to make some very rough assumptions, and again, we are assuming no other evidence, for or against.

First, we’ll assume that the chances of the victim and perhaps other people who were told of the alleged events by the victim, such as Dylan’s mother, Mia Farrow, lying or being delusional are very slim. So we are actually on the hypothetical prosecution or plaintiff’s side. ‘How is it possible that this victim and/or her mother would be lying about something as serious and traumatic as this?’

Now, even common sense tells is that it is possible, but not likely. I have no idea what the statistics or the assumptions in the field are, but surely there are many cases of fabrication by victims, false repressed memories by victims who are treated by so-called clinicians who specialize in repressed-memories of physical or sexual abuse, memories that are “implanted” in children by unscrupulous parents, etc. There are many documented cases of all of the above and more. Again, I am not saying that this case fits into one of these profiles and that Dylan is lying or mistaken, although clearly that is possible.

Let’s put the number at 1 in a 100 in a case similar to this. I’m not sure that any reasonable person could quarrel too much with that. I could easily make the case that it is higher than that. The population that we are talking about is this: First we have a 7 year-old child. The chances that the recollections of a young child, including the chances that those recollections were planted or at least influenced by an adult, might be faulty, have to be greater than that of an adult. The fact that Woody and Mia were already having severe relationship problems and in a bitter custody dispute also increase the odds that Dylan might have been “coached” or influenced in some manner by her mother. But I’ll leave the odds at 100-1 against. So, Allen is 99% guilty, right? You already know that the answer to that is, “No, not even close.”

So now we have to bring in Thomas Bayes as our expert witness. What are the chances that a random affluent and famous father like Woody Allen, again, not assuming anything else about the case or about Woody’s character or past or future behavior, would molest his 7-year old daughter? Again, I have no idea what that number is, but we’ll also say that it’s 100-1 against. I think it is lower than that, but I could be wrong.

So now, in order to compute the chances that Allen, or anyone else in a similar situation, where the alleged victim is a very credible witness – like we believe that there is a 99% chance they are telling the truth – is guilty, we can simply take the ratio of the prior probability of guilt, assuming no accusations at all, to the chances of the victim lying or otherwise being mistaken. That gives us the odds that the accused is guilty. In this case, it is .01 divided by .01 or 1, which means that it is “even money” that Woody Allen is guilty as charged, again, not nearly enough to convict in a criminal court. Unfortunately, many, perhaps most, people, including jurors in an actual trial, would assume that if there were a 99% chance that the alleged victim was telling the truth, well, the accused is most likely guilty!

Edit: As James in the comments section, Tango on the Book blog, and probably others, have noted, I screwed up the Woody Allen analysis. The only way that Bayes would come into play as I describe would be if we assumed that 1 out of 100 random daughters in a similar situation would make a false accusation against a father like Woody. That seems like a rather implausible assumption, but maybe not – I don’t really know. In any case, if that were true, then while my Bayesian analysis would be correct and it would make Allen have around a 50% chance of being guilty, the chances that Dylan was not telling the truth would not be 1% as I indicated. It would be a little less than 50%.

So, really, the chances that she is telling the truth is equal to the chances of Allen being guilty, as you might expect. In this case, unlike in the other two examples I gave, the intuitive answer is correct, and Bayes is not really implicated. The only way that Bayes would be implicated in the manner I described would be if a prosecutor or plaintiff’s lawyer pointed out that 99% of all daughters do not make false accusations against a father like Woody, therefore there is a 99% chance that she is telling the truth. That would be wrong, but that was not the point I was making. So, mea culpa, I screwed up, and I thank those people who pointed that out to me, and I apologize to the readers. 

I should add this:

The rate of false accusations is probably not significantly related to the rate of true accusations or the actual rate of abuse in any particular population. In other words, if the overall false accusation rate is 5-10% of all accusations, which is what the research suggests, that percentage will not be nearly the same in a population where the actual incidence of abuse is 20% or 5%. The ratio of true to real accusations is probably not constant. What is likely somewhat constant is the percentage of false accusations as compared to the number of potential accusations, although there are surely factors which would make false accusations more or less likely, such as the relationship between the mother and father.

What that means is that the extrinsic (outside of the accusation itself) chance that an accused person is guilty is related to the chances of a false accusation. If in one population the incidence of abuse is 20%, there is probably a much lower chance that a person who makes an accusation is lying, as compared to a population where the incidence of abuse is, say, 5%.

So, if an accused person is otherwise not likely to be guilty but for an accusation, a prosecutor would be misleading the jury if he reported that overall only 5% of all accusations were false therefore the chance that this accusation is false, is also 5%.

If that is hard to understand, imagine a population of persons where the chance of abuse is zero. There will still be some false accusations in that population, and since there will be no real ones, the chances that someone is telling the truth if they accuse someone is zero. The percentage of false accusations is 100%. If the percentage of abuse in a population is very high, then the ratio of false to true accusations will be much lower than the overall 5-10% number.

* And why I am getting tired of writers and analysts picking and choosing one or more of a bushel of statistics to make their (often weak) point.

Let’s first get something out of the way:

Let’s say that you know of this very good baseball player. He is well-respected and beloved on and off the field,  he played for only one, dynastic, team, he has several World Series rings, double digit All-Star appearances, dozens of awards, including 5 Gold Gloves, 5 Silver Sluggers, and a host of other commendations and accolades. Oh, and he dates super models and doesn’t use PEDs (we think).

Does it matter whether he is a 40, 50, 60, 80, or 120 win (WAR) player in terms of his HOF qualifications? I submit that the answer is an easy, “No, it doesn’t” He is a slam dunk HOF’er whether he is indeed a very good, great, or all-time, inner-circle, great player. If you want to debate his goodness or greatness, fine. But it would be disingenuous to debate that in terms of his HOF qualifications. There are no serious groups of persons, including “stat-nerds,” whose consensus is that this player does not belong in the HOF.

Speaking of strawmen, before I lambaste Mr. Posnanski, which is the crux of this post, let me start by giving him some major props for pointing out that this article, by the “esteemed” and “venerable” writer Allen Barra, is tripe. That is Pos’ word – not mine. Indeed, the article is garbage, and Barra, at least when writing about anything remotely related to sabermetrics, is a hack. Unfortunately, Posnanski’s article is not much further behind in tripeness.

Pos’ thesis, I suppose, can be summarized by this, at the beginning of the article:

[Jeter] was a fantastic baseball player. But you know what? Alan Trammell was just about as good.

Here are Alan Trammell’s and Derek Jeter’s neutralized offensive numbers.

Trammell: .289/.357/.420
Jeter: .307/.375/..439

Jeter was a better hitter. But it was closer than you might think.

He points out several times in the article that, “Trammell was almost as good as Jeter, offensively.”

Let’s examine that proposition.

First though, let me comment on the awful argument, “Closer than you think.” Pos should be ashamed of himself for using that in an assertion or argument. It is a terrible way to couch an argument. First of all, how does he know, “What I think?” And who is he referring to when he says, “You?” The problem with that “argument,” if you want to even call it that, is that it is entirely predicated on what the purveyor decides “You are thinking.” Let’s say a player has a career OPS of .850. I can say, “I will prove that he is better than you think, assuming of course that you think that he is worse than .850, and it is up to me to determine what you think.” Or I can say the opposite. “This player is worse than you think, assuming of course, that you think that he better than an .850 player. And I am telling you that you are thinking that (or at least implying that)!”

Sometimes it is obvious what, “You think.” Often times it is not. And that’s even assuming that we know who, “You” is. In this case, is it obvious what, “You think of Jeter’s offense compared to Trammell?” I certainly don’t think so, and I know a thing or two about baseball. I am pretty sure that most knowledgeable baseball people think that both players were pretty good hitters overall and very good hitters for a SS. So, really, what is the point of, “It was closer than you think.” That is a throwaway comment and serves no purpose other than to make a strawman argument.

But that is only the beginning of what’s wrong with this premise and this article in general. He goes on to state or imply two things. One, that their “neutralized” career OPS’s are closer than their raw ones. I guess that is what he means by “closer than you think,” although he should have simply said, “Their neutralized offensive stats are closer than their non-neutralized ones,” rather than assuming what, “I think.”

Anyway, it is true that in non-neutralized OPS, they were 60 points apart, whereas once “neutralized,” at least according to the article, the gap is only 37 points, but:

Yeah, it is closer once “neutralized” (I don’t know where he gets his neutralized numbers from or how they were computed ), but 37 points is a lot man! I don’t think too many people would say that a 37 point difference, especially over 20-year careers, is “close.”

More importantly, a big part of that “neutralization” is due to the different offensive environments. Trammell played in a lower run scoring environment than did Jeter, presumably, at least partially, because of rampant PED use in the 90’s and aughts. Well, if that’s true, and Jeter did not use PED’s, then why should we adjust his offensive accomplishments downward just because many other players, the ones who were putting up artificially inflated and gaudy numbers, were using? Not to mention the fact that he had to face juiced-up pitchers and Trammell did not! In other words, you could easily make the argument, and probably should, that if (you were pretty sure that) a player was not using during the steroid era, that his offensive stats should not be neutralized to account for the inflated offense during that era, assuming that that inflation was due to rampart PED use of course.

Finally, with regard to this, somewhat outlandish, proposition that Jeter and Trammell were similar in offensive value (of course, it depends on your definition of “similar” and “close” which is why using words like that creates “weaselly” arguments), let’s look at the (supposedly) context-neutral offensive runs or wins above replacement (or above average – it doesn’t matter what the baseline is when comparing players’ offensive value) from Fangraphs.

Jeter

369 runs batting, 43 runs base running

Trammell

124 runs batting, 23 runs base running

Whether you want to include base running on “offense” doesn’t matter. Look at the career batting runs. 369 runs to 124. Seriously, what was Posnanski drinking (aha, that’s it – Russian vodka! – he is in Sochi in case you didn’t klnow) when he wrote an entire article mostly about how similar Trammell and Jeter were, offensively, throughout their careers. And remember, these are linear weights batting runs, which are presented as “runs above or below average” compared to a league-average player. In other words, they are neutralized with respect to the run-scoring environment of the league. Again, with respect to PED use during Jeter’s era, we can make an argument that the gap between them is even larger than that.

So, Posnanski tries to make the argument that, “They are not so far apart offensively as some people might think (yeah, the people who look at their stats on Fangraphs!),” by presenting some “neutralized” OPS stats. (And again, he is claiming that a 37-point difference is “close,” which is eminently debatable.)

Before he even finishes, I can make the exact opposite claim – that they are worlds apart offensively, by presenting their career (similar length careers, by the way, although Jeter did play in 300 more games), league and park adjusted batting runs. They are 245 runs, or 24 wins, apart!

That, my friends, is why I am sick and tired of credible writers and even some analysts making their point by cherry picking one (or more than one) of scores of legitimate and semi-legitimate sabermetric and not-so-sabermetric statistics.

But, that’s not all!  I did say that Posnanski’s article was hacktastic, and I didn’t just mean his sketchy use of one (not-so-great) statistic (“neturalized” OPS) to make an even sketchier point.

This:

By Baseball Reference’s defensive WAR Trammell was 22 wins better than a replacement shortstop. Jeter was nine runs worse.

By Fangraphs, Trammell was 76 runs better than a replacement shortstop. Jeter was 139 runs worse.

Is an abomination. First of all, when talking about defense, you should not use the term “replacement” (and you really shouldn’t use it for offense either). Replacement refers to the total package, not to one component of player value. Replacement shortstops, could be average or above-average defenders and awful hitters, decent hitters and terrible defenders, or anything in between. In fact, for various reasons, most replacement players are average or so defenders and poor hitters.

And then he conflates wins and runs (don’t use both in the same paragraph – that  is sure to confuse some readers), although I know that he knows the difference. In fact, I think he means “nine wins” worse in the first sentence, and not, “nine runs worse.” But, that mistake is on him for trying to use both wins and runs when talking about the same thing (Jeter and Trammell’s defense), for no good reason.

Pos then says:

You can buy those numbers or you can partially agree with them or you can throw them out entirely, but there’s no doubt in my mind that Trammell was a better defensive shortstop.

Yeah, yada, yada, yada. Yeah we know. No credible baseball person doesn’t think that Trammell was much the better defender. Unfortunately we are not very certain of how much better he was in terms of career runs/wins. Again, not that it matters in terms of Jeter’s qualifications for, or his eventually being voted into, the HOF. He will obviously be a first-ballot, near-unanimous selection, and rightfully so.

Yes, it is true that Trammell has not gotten his fair due from the HOF voters, for whatever reasons. But, comparing him to Jeter doesn’t help make his case, in my opinion. Jeter is not going into the HOF because he has X number of career WAR. He is going in because he was clearly a very good or great player, and because of the other dozen or more things he has going for him that the voters (and the fans) include, consciously or not, in terms of their consideration. Even if it could be proven that Jeter and Trammell had the exact same context-neutral statistical value over the course of their careers, Jeter could still be reasonably considered a slam dunk HOF’er and Trammell not worthy of induction (I am not saying that he isn’t worthy). It is still the Hall of Fame (which means many different things to many different people) and not the Hall of WAR or the Hall of Your Context-Neutral Statistical Value.

For the record, I love Posnanski’s work in general, but no one is perfect.

In The Book: Playing the Percentages in Baseball, we found that when a batter pinch hits against right-handed relief pitchers (so there are no familiarity or platoon issues), his wOBA is 34 points (10%) worse than when he starts and bats against relievers, after adjusting for the quality of the pitchers in each pool (PH or starter). We called this the pinch hitting penalty.

We postulated that the reason for this was that a player coming off the bench in the middle or towards the end of a game is not as physically or mentally prepared to hit as a starter who has been hitting and playing the field for two or three hours. In addition, some of these pinch hitters are not starting because they are tired or slightly injured.

We also found no evidence that there is a “pinch hitting skill.” In other words, there is no such thing as a “good pinch hitter.” If a hitter has had exceptionally good (or bad) pinch hitting stats, it is likely that that was due to chance alone, and thus it has no predictive value. The best predictor of a batter’s pinch-hitting performance is his regular projection with the appropriate penalty added.

We found a similar situation with designated hitters. However, their penalty was around half that of a pinch hitter, or 17 points (5%) of wOBA. Similar to the pinch hitter, the most likely explanation for this is that the DH is not as physically (and perhaps mentally) prepared for each PA as a player who is constantly engaged in the game. As well, the DH may be slightly injured or tired, especially if he is normally a position player. It makes sense that the DH penalty would be less than the PH penalty, as the DH is more involved in a game than a PH. Pinch hitting is often considered “the hardest job in baseball.” The numbers suggest that that is true. Interestingly, we found a small “DH skill” such that different players seem to have more or less of a true DH penalty.

Andy Dolphin (one of the authors of The Book) revisited the PH penalty issue in this Baseball Prospectus article from 2006. In it, he found a PH penalty of 21 points in wOBA, or 6%, significantly less than what was presented in The Book (34 points).

Tom Thress, on his web site, reports a PH penalty of .009 in “player won-loss records” (offensive performance translated into a “w/l record”), which he says is similar to that found in The Book (34 points). However, he finds an even larger DH penalty of .011 wins, which is more than twice that which we presented in The Book. I assume that .011 is slightly larger than 34 points in wOBA.

So, everyone seems to be in agreement that there is a significant PH and DH penalty, however, there is some disagreement as to the magnitude of each (with empirical data, we can never be sure anyway). I am going to revisit this issue by looking at data from 1998 to 2012. The method I am going to use is the “delta method,” which is common when doing this kind of “either/or” research with many player seasons in which the number of opportunities (in this case, PA) in each “bucket” can vary greatly for each player (for example, a player may have 300 PA in the “either” bucket and only 3 PA in the “or” bucket) and from player to player.

The “delta method” looks something like this: Let’s say that we have 4 players (or player seasons) in our sample, and each player has a certain wOBA and number of PA in bucket A and in bucket B, say, DH and non-DH – the number of PA are in parentheses.

wOBA as DH wOBA as Non-DH
Player 1 .320 (150) .330 (350)
Player 2 .350 (300) .355 (20)
Player 3 .310 (350) .325 (50)
Player 4 .335 (100) .350 (150)

In order to compute the DH penalty (difference between when DH’ing and playing the field) using the “delta method,” we compute the difference for each player separately and take a weighted average of the differences, using the lesser of the two PA (or the harmonic mean) as the weight for each player. In the above example, we have:

((.330 – .320) * 150 + (.355 – .350) * 20 + (.325 – .310) * 50 + (.350 – .335) * 100) / (150 + 20 + 50 + 100)

If you didn’t follow that, that’s fine. You’ll just have to trust me that this is a good way to figure the “average difference” when you have a bunch of different player seasons, each with a different number of opportunities (e.g. PA) in each bucket.

In addition to figuring the PH and DH penalties (in various scenarios, as you will see), I am also going to look at some other interesting “penalty situations” like playing in a day game after a night game, or both games of a double header.

In my calculations, I adjust for the quality of the pitchers faced, the percentage of home and road PA, and the platoon advantage between the batter and pitcher. If I don’t do that, it is possible for one bucket to be inherently more hitter-friendly than the other bucket, either by chance alone or due to some selection bias, or both.

First let’s look at the DH penalty. Remember that in The Book, we found a roughly 17 point penalty, and  Tom Thresh found a penalty that was greater than that of a PH, presumably more than 34 points in wOBA.

Again, my data was from 1998 to 2012, and I excluded all inter-league games. I split the DH samples into two groups: One group had more DH PA than non-DH PA in each season (they were primarily DH’s), and vice versa in the other group (primarily position players).

The DH penalty was the same in both groups – 14 points in wOBA.

The total sample sizes were 10,222 PA for the primarily DH group and 32,797 for the mostly non-DH group. If we combine the two groups, we get a total of 43,019 PA. That number represents the total of the “lesser of the PA” for each player season. One standard deviation in wOBA for that many PA is around 2.5 wOBA points. For the difference between two groups of 43,000 each, it is 3.5 points (the square root of the sum of the variances). So we can say with 95% confidence that the true DH penalty is between 7 and 21 points with the most likely value being 14. This is very close to the 17 point value we presented in The Book.

I expected that the penalty would be greater for position players who occasionally DH’d rather than DH’s who occasionally played in the field. That turned out not to be the case, but given the relatively small sample sizes, the true values could very well be different.

Now let’s move on to pinch hitter penalties. I split those into two groups as well: One, against starting pitchers and the other versus relievers. We would expect the former to show a greater penalty since a “double whammy” would be in effect – first, the “first time through the order” penalty, and second, the “sitting on the bench” penalty. In the reliever group, we would only have the “coming in cold” penalty. I excluded all ninth innings or later.

Versus starting pitchers only, the PH penalty was 19.5 points in 8,523 PA. One SD is 7.9 points, so the 95% confidence interval is a 4 to 35 point penalty.

Versus relievers only, the PH penalty was 12.8 points in 17,634 PA. One SD is 5.5 points – the 95% confidence interval is a 2 to 24 point penalty.

As expected, the penalty versus relievers, where batters typically only face the pitcher for the first and only time in the game, whether they are in the starting lineup or are pinch hitting, is less than that versus the starting pitcher, by around 7 points. Again, keep in mind that the sample sizes are small enough such that the true difference between the starter PH penalty and reliever PH penalty could be the same or could even be reversed. Of course, our prior when applying a Bayesian scheme is that there is a strong likelihood that the true penalty is larger against starting pitchers for the reason explained above. So it is likely that the true difference is similar to the one observed (a 7-point greater penalty versus starters).

Notice that my numbers indicate penalties of a similar magnitude for pinch hitters and designated hitters. The PH penalty is a little higher than the DH penalty when pinch hitters face a starter, and a little lower than the DH penalty when they face a reliever. I expected the PH penalty to be greater than the DH penalty, as we found in The Book. Again, these numbers are based on relatively small sample sizes, so the true PH and DH penalties could be quite different.

Role Penalty (wOBA)
DH 14 points
PH vs. Starters 20 points
PH vs. Relievers 13 points

Now let’s look at some other potential “penalty” situations, such as the second game of a double-header and a day game following a night game.

In a day game following a night game, batters hit 6.2 wOBA points worse than in day games after day games or day games after not playing at all the previous day. The sample size was 95,789 PA. The 95% certainty interval is 1.5 to 11 points.

What about the when a player plays both ends of a double-header (no PH or designated hitters)? Obviously many regulars sit out one or the other game – certainly the catchers.

Batters in the second game of a twin bill lose 8.1 points of wOBA compared to all other games. Unfortunately, the sample is only 9,055 PA, so the 2 SD interval is -7.5 to 23.5. If 8.1 wOBA points (or more) is indeed reflective of the true double-header penalty, it would be wise for teams to sit some of their regulars in one of the two games – which they do of course. It would also behoove teams to make sure that their two starters in a twin bill pitch with the same hand in order to discourage fortuitous platooning by the opposing team.

Finally, I looked at games in which a player and his team (in order to exclude times when the player sat because he wasn’t 100% healthy) did not play the previous day, versus games in which the player had played at least 8 days in a row. I am looking for a “consecutive-game fatigue” penalty and those are the two extremes. I excluded all games in April and all pinch-hitting appearances.

The “penalty” for playing at least 8 days in a row is 4.0 wOBA points in 92,287 PA. One SD is 2.4 so that is not a statistically significant difference. However, with a Bayesian prior such that we expect there to be a “consecutive-game fatigue” penalty, I think we can be fairly confident with the empirical results (although obviously there is not much certainty as to the magnitude).

To see whether the consecutive day result is a “penalty” or the day off result is a bonus, I compared them to all other games.

When a player and his team has had a day off the previous day, the player hits .1 points better than otherwise in 115,471 PA (-4.5 to +4.5). Without running the “consecutive days off” scenario, we can infer that there is an observed penalty when playing at least 8 days in a row, of around 4 points, compared to all other games (the same as compared to after an off-day).

So having a day off is not really a “bonus,” but playing too many days in row creates a penalty. It probably behooves all players to take an occasional day off. Players like Cal Ripken, Steve Garvey, and Miguel Tejada (and others) may have had substantially better careers had they been rested more, at least rate-wise.

I also looked at players who played in fewer days in a row (5, 6, and 7) and found penalties of less than 4 points, suggesting that the more days in a row a player plays, the more his offense is penalized. It would be interesting to see if a day off after several days in a row restores a player to his normal offensive levels.

There are many other situations where batters and pitchers may suffer penalties (or bonuses), such as game(s) after coming back from the DL, getaway (where the home team leaves for another venue) games, Sunday night games, etc.

Unfortunately, I don’t have the time to run all of these potentially interesting scenarios – and I have to leave something for aspiring saberists to do!

Addendum: Tango Tiger suggested I split the DH into “versus relievers and starters.” I did not expect there to be a difference in penalties since, unlike a PH, a DH faces the starter the same number of times as when he isn’t DH’ing. However, I found a penalty difference of 8 points – the DH penalty versus starters was 16.3 and versus relievers, it was 8.3. Maybe the DH becomes “warmer” towards the end of the game, or maybe the difference is a random, statistical blip. I don’t know. We are often faced with these conundrums (what to conclude) when dealing with limited empirical data (relatively small sample sizes). Even if we are statistically confident that an effect exists (or doesn’t), we are are usually quite uncertain as to the magnitude of that effect.

I also looked at getaway (where the home team goes on the road after this game) night games. It has long been postulated that the home team does not perform as well in these games. Indeed, the home team batter penalty in these games was 1.6 wOBA points, again, not a statistically significant difference, but consistent with the Bayesian prior. Interestingly, the road team batters performed .6 points better suggesting that home team pitchers in getaway games might have a small penalty as well.

I just downloaded my Kindle version of the brand spanking new Hardball Times Annual, 2014 from Amazon.com. It is also available from Createspace.com (best place to order).

Although I was disappointed with last year’s Annual, I have been very much looking forward to reading this year’s, as I have enjoyed it tremendously in the past, and have even contributed an article or two, I think. To be fair, I am only interested in the hard-core analytical articles, which comprise a small part of the anthology. The book is split into 5 parts, according to the TOC: The “2013 season,” which consists of reviews/views of each of the six divisions plus one chapter about the post-season. Two, general Commentary. Three, History, four, Analysis, and finally, a glossary of statistical terms, and short bios on the various illustrious authors (including Bill James and Rob Neyer).

As I said, the only chapters which interest me are the ones in the Analysis section, and those are the ones that I am going to review, starting with Jeff Zimmerman’s, “Shifty Business, or the War Against Hitters.” It is mostly about the shifts employed by infielders against presumably extreme pull (and mostly slow) hitters. The chapter is pretty good with lots of interesting data mostly provided by Inside Edge, a company much like BIS and STATS, which provides various data to teams, web sites, and researchers (for a fee). It also raised several questions in my mind, some of which I wish Jeff had answered or at least brought up himself. There were also some things that he wrote which were confusing – at least in my 50+ year-old mind.

He starts out, after a brief intro, with a chart (BTW, if you have the Kindle version, unless you make the font size tiny, some of the charts get cut off) that shows the number, BABIP, and XBH% of plays where a ball was put into play with a shift (and various kinds of shifts), no shift, no doubles defense (OF deep and corners guarding lines), infield in, and corners in (expecting a bunt). This is the first time I have seen any data with a no-doubles defense, infield in, and with the corners up anticipating a bunt. The numbers are interesting. With a no-doubles defense, the BABIP is quite high and the XBH% seems low, but unfortunately Jeff does not give us a baseline for XBH% other than the values for the other situations, shift, no shift, etc., although I guess that pretty much includes all situations. I have not done any calculations, but the BABIP for a no-doubles defense is so high and the reduction in doubles and triples is so small, that it does not look like a great strategy off the top of my head. Obviously it depends on when it is being employed.

The infield-in data is also interesting. As expected, the BABIP is really elevated. Unfortunately, I don’t know if Jeff includes ROE and fielder’s choices (with no outs) in that metric. What is the standard? With the infield in, there are lots of ROE and lots of throws home where no out is recorded (a fielder’s choice). I would like to know if these are included in the BABIP.

For the corners playing up expecting a bunt, the numbers include all BIP, mostly bunts I assume. It would have been nice had he given us the BABIP when the ball is not bunted (and bunted). An important consideration for whether to bunt or not is how much not bunting increases the batter’s results when he swings away.

I would also have liked to see wOBA or some metric like that for all situations – not just BABIP and XBH%. It is possible, in fact likely, that walk and K rates vary in different situations. For example, perhaps walk rates increase when batters are facing a shift because they are not as eager to put the ball in play or the pitchers are trying to “pitch into the shift” and are consequently more wild. Or perhaps batters hit more HR because they are trying to elevate the ball as opposed to hitting a ground ball or line drive. It would also be nice to look at GDP rates with the shift. Some people, including Bill James, have suggested that the DP is harder to turn with the fielders out of position. Without looking at all these things, it is hard to say that the shift “works” or doesn’t work just by looking at BABIP (and even harder to say to what extent it works).

Jeff goes on to list the players against whom the shift is most often employed. He gives us the shift and no shift BABIP and XBH%. Collectively, their BABIP fell 37 points with the shift and it looks like their XBH% fell a lot too (although for some reason, Jeff does not give us that collective number, I don’t think). He writes:

…their BABIP [for these 20 players] collectively fell 37 points…when hitting with the shift on. In other words, the shift worked.

I am not crazy about that conclusion – “the shift worked.” First of all, as I said, we need to know a lot more than BABIP to conclude that “the shift worked.” And even if it did “work” we really want to know by how much in terms of wOBA or run expectancy. Nowhere is there an attempt by Jeff to do that. 37 points seems like a lot, but overall it could be only a small advantage. I’m not saying that it is small – only that without more data and analysis we don’t know.

Also, when and why are these “no-shifts” occurring? Jeff is comparing shift BIP data to no-shift BIP data and he is assuming that everything else is the same. That is probably a poor assumption. Why are these no-shifts occurring? Probably first and foremost because there are runners on base. With runners on base, everything is different. It might also be with a completely different pool of pitchers and fielders. Maybe teams are mostly shifting when they have good fielders? I have no idea. I am just throwing out reasons why it may not be an apples-to-apples comparison when comparing “shift” results to “no-shift” results.

It is also likely that the pool of batters is different with a shift and no shift even though he only looked at the batters who had the most shifts against them. In fact. a better method would have been a “delta” method, whereby he would use a weighted average of the differences between shift and no-shift for each individual player.

He then lists the speed score and GB and line drive pull percentages for the top ten most shifted players. The average Bill James speed score was 3.2 (I assume that is slow, but again, I don’t see where he tells us the average MLB score), GB pull % was 80% and LD pull % was 62%. The average MLB GB and LD pull %, Jeff tells us, is 72% and 50%, respectively. Interestingly several players on that list were at or below the MLB averages in GB pull %. I have no idea why they are so heavily shifted on.

Jeff talks a little bit about some individual players. For example, he mentions Chris Davis:

“Over the first four months of the season, he hit into an average of 29 shifts per month, and was able to maintain a .304 BA and a .359 BABIP. Over the last two months of the season, teams shifted more often against him…41 times per month. Consequently, his BA was .250 and his BABIP was .293.

The shift was killing him. Without a shift employed, Davis hit for a .425 BABIP…over the course of the 2013 season. When the shift was set, his BABIP dropped to .302…

This reminds me a little of the story that Daniel Kahneman, 2002 Nobel Prize Laureate in Economics, tells about teaching military flight instructors that praise works better than punishment. One of the instructors said:

“On many occasions I have praised flight cadets for clean execution of some aerobatic maneuver, and in general when they try it again, they do worse. On the other hand, I have often screamed at cadets for bad execution, and in general they do better the next time.”

Of course the reason for that was “regression towards the mean.” No matter what you say to someone who has done poorer than expected, they will tend to do better next time, and vice versa for someone who has just done better than expected.

If Chris Davis hits .304 the first four months of the season with a BABIP of .359, and his career numbers are around .260 and .330, then no matter what you do against him (wear your underwear backwards, for example), his next two months are likely going to show a reduction in both of these numbers! That does not necessarily imply a cause and effect relationship.

He makes the same mistake with several other players that he discusses. I fact, I have always had the feeling that at least part of the “observed” success for the shift was simply regression towards the mean. Imagine this scenario – I’m not saying that this is exactly what happens or happened, but to some extent I think it may be true. You are a month into the season and for X number of players, say they are all pull hitters, they are just killing you with hits to the pull side. Their collective BA and BABIP is .380 and .415. You decide enough is enough and you decide to shift against them. What do you  think is going to happen and what do you think everyone is going to conclude about the effectiveness of the shift, especially when they compare the “shift” to “no-shift” numbers?

Again, I think that the shift gives the defense a substantial advantage. I am just not 100% sure about that and I am definitely not sure about how much of an advantage it is and whether it is correctly employed against every player.

Jeff also shows us the number of times that each team employs the shift. Obviously not every team faces the same pool of batters, but the differences are startling. For example, the Orioles shifted 470 times and the Nationals 41! The question that pops into my mind is, “If the shift is so obviously advantageous (37 points of BABIP) why aren’t all teams using it extensively?” It is not like it is a secret anymore.

Finally, Jeff discusses bunting to beat the shift. That is obviously an interesting topic. Jeff shows that not many batters opt to do that but when they do, they reach base 58% of the time. Unfortunately, out of around 6,000 shifts where the ball was put into play, players only bunted 48 times! That is an amazingly low number. Jeff (likely correctly) hypothesizes that players should be bunting more often (a lot more often?). That is probably true, but I don’t think we can say how often and by whom? Maybe most of the players who did not bunt are terrible bunters and all they would be doing is bunting back to the pitcher or fouling the ball off or missing. And BTW, telling us that a bunt results in reaching base 58% of the time is not quite the whole story. We also need to know how many bunt attempts resulted in a strike. Imagine that if a player attempted to bunt 10 times, fouled it off or missed it 9 times and reached base once.  That is probably not a good result even though it looks like he bunted with a 1.000 average!

It is also curious to me that 7 players bunted into a shift almost 4 times each, and reached base 16 times (a .615 BA). They are obviously decent or good bunters. Why are they not bunting every time until the shift is gone against them? They are smart enough to occasionally bunt into a shift, but not smart enough to always do it? Something doesn’t seem right.

Anyway, despite my many criticisms, it was an interesting chapter and well-done by Jeff. I am looking forward to reading the rest of the articles in the Analysis section and if I have time, I will review one or more of them.

This is a follow up to my article on baseballprospectus.com about starting pitcher times through the order penalties (TTOP).

Several readers wondered whether pitchers who throw lots of fastballs (or one type of pitch) have a particularly large penalty as opposed to pitchers who throw more of a variety of pitches. The speculation was that it would be harder or take longer for a batter to acclimate himself to a pitcher who has lots of different pitches in his arsenal. As well, since most starters tend to throw more fastballs the first time through the order, those pitchers who follow that up with more off-speed pitches for the remainder of the game would have an advantage over those pitchers who continue to throw mostly fastballs.

First I split all the starters up into 3 groups: One, over 75% fastballs, two, under 50% fastballs, and three, all the rest. The data is from 2002-2012. I downloaded pitcher pitch type data from fangraphs.com. The results will amaze you.

FB %

N (Pitcher Seasons)

Overall

First Time

Second Time

Third Time

Fourth Time

Second Minus First

Third Minus Second

Fourth Minus Third

> 75%

159

.357

.341

.363

.376

.348

.027

.020

-.013

< 50%

359

.352

.346

.349

.360

.361

.003

.015

.010

All others

2632

.359

.346

.361

.370

.371

.015

.015

.013

Pitchers who throw mostly fastballs lose 35 points in wOBA against by the third time facing the lineup. Those with a much lower fastball frequency only lose 24 points. Interestingly, the former group reverts back to better than normal levels the fourth time (I don’t know why that is, but I’ll return to that issue later), but the latter group continues to suffer a penalty as do all the others. Keep in mind that the fourth time numbers are small samples for the first two groups, and that fourth time TBF are only around 15% of first time TBF (i.e., starters don’t often make it past the third time through the order) .

The takeaway here is that a starter’s pitch repertoire is extremely important in terms of how long he should be left in the game and whether he should start or relieve (we already knew the latter, right?). If we look at columns three and four, we can get an idea as to the difference between a pitcher as a starter and as a reliever, at least as far as times through the order is concerned (there are other considerations, such as velocity – e.g., when a pitcher is a short reliever, he can usually throw harder). The mostly fastball group is 16 points (around .5 runs per 9 innings) more effective the first time through the order than overall, while the low frequency fastball group only has a 6 point (.20 RA9) advantage. Keep in mind that some of that first time through the order advantage for all groups is due to the “first inning” effect (see my original article on BP).

Next I split the pitchers into four groups based on the number of pitches they threw at least 10% of the time. The categories of pitches (from the FG database) were fast balls, sliders, cutters, curve balls, change ups, splitters, and knuckle balls.

# Pitches in Repertoire (> 10%)

N (Pitcher Seasons)

Overall

First Time

Second Time

Third Time

Fourth Time

Second Minus First

Third Minus Second

Fourth Minus Third

1

41

.359

.344

.370

.375

.303

.027

.009

-.061

2

1000

.358

.343

.359

.371

.366

.016

.018

.007

3

1712

.361

.349

.362

.371

.372

.013

.015

.014

4

378

.351

.340

.351

.360

.368

.011

.013

.019

This is even more interesting. It appears that the fewer pitches you have in your repertoire, the more that batters become quickly familiar with you, we we might expect. One-pitch pitchers lose 36 points by the third time through the order, while four-pitch pitchers lose only 24 points. The fourth time through the order is exactly the opposite. Against one-pitch pitchers, pitchers gain 61 points (small sample size warning – 639 PA). Again, I have no idea why. Maybe fastball pitchers are able to ramp it up in the later innings, or maybe they start throwing more off-speed pitches. A pitch f/x analysis would shed some more light on this issue. Against the four-pitch pitchers, batters gain 19 points the fourth time around compared to the third. If we weight and combine the third and fourth times in order to increase our sample sizes, we get this:

# Pitches in Repertoire (> 10%)

N (Pitcher Seasons)

Overall

First Time

Second Time

Third and Fourth Times

Second Minus First

Third+ Minus Second

1

41

.359

.344

.370

.364

.027

-.001

2

1000

.358

.343

.359

.370

.016

.017

3

1712

.361

.349

.362

.371

.013

.015

4

378

.351

.340

.351

.361

.011

.015

Again, we see the largest, by far, second time penalty for the one-pitch pitchers (27 points), and a gradually decreasing penalty for two, three, and four-pitch pitchers (16, 13, and 11). Interestingly, they all have around the same penalty the third time and later, other than the one-pitch pitchers, who essentially retain their quality or even get a bit better, although this is driven by their large fourth time advantage, as you saw in the previous table.

It is not clear that you should take your one-pitch starters out early and leave in those who have multiple pitches in their weaponry. In fact, it may be the opposite. While the one-pitch pitchers would do well if they only face the order one time (and so would the two-pitch starters actually), once you allow them to stay in the game for the second go around, you might as well keep them in there as long as they are not fatigued, at least as compared to the multiple-pitch starters. Starters with more than one pitch appear to get 10-15 points worse each time through the order even though they don’t have the large penalty between the first and second time, as the one-pitch pitchers do. Remember, for the last two tables, a pitch is considered part of a starter’s repertoire if he throws it at least 10% of the time.

I’ll now split the pitchers into four groups again based on how many pitches they throw, but this time, the cutoff for a “pitch” will be 15% rather than 10%. The number of pitchers who throw four pitches at least 15% of the time each are too few for the their numbers to be meaningful, so I’ll throw them in with the three pitch pitchers. I’ll also combine the third and fourth times through the order again.

# Pitches in Repertoire (> 15%)

N (Pitcher Seasons)

Overall

First Time

Second Time

Third and Fourth Times

Second Minus First

Third+ Minus Second

1

447

.358

.342

.362

.364

.027

-.001

2

1954

.359

.346

.361

.370

.016

.017

3+

742

.355

.347

.352

.371

.013

.015

The three and four-pitch starters are better overall by three or four points of wOBA (.11 RA9). The first time through the order, however, the one-pitch starters are better by 5 points or so (.15 RA9). The second time around, the one-pitch pitchers fare the worst, but by the third and fourth times through the order, they are once again the best (by 6 or 7 points, or .22 RA9). It is difficult to say what the optimal use of these starters would look like. At the very least, these numbers should give a manager/team more information in terms of estimating a starter’s penalty at various points in the game, based on his pitch repertoire.

I’ll try one more thing: Two groups. The first group are pitchers who throw at least 80% of one type of pitch, excluding knuckleballers. These are truly one-pitch pitchers. The second group throw three (or more) pitches at least 20% of the time each. These are truly three-pitch pitchers. Let’s see the contrast.

# Pitches in Repertoire

N (Pitcher Seasons)

Overall

First Time

Second Time

Third and Fourth Times

Second Minus First

Third+ Minus Second

1 (> 80%)

47

.360

.343

.367

.370

.025

.009

3+ (> 20%)

104

.353

.350

.357

.357

.008

.009

It certainly looks like the 42 one-pitch pitchers (47 is the number of pitcher seasons) would be much better off as relievers, facing each batter only one time. They are not very good overall, and after only one go around, they are 25 points (.85 RA9) worse than the first time facing the lineup! The three-pitch pitchers suffer only a small (8 point) penalty after the first time through the order. Both groups actually suffer the same penalty from the second to the third (and more)  time through the order (9 points).

So who are these 42 pitchers who are ill-suited to being a starter? Perhaps they are swingmen or emergency starters. I looked at all pitchers who started at least one game – not just regular starters. Here is the complete list from 2002 to 2012. The numbers after the names are the number of TBF faced as starters and as relievers.

Mike Timlin 20, 352

Kevin Brown 206, 68

Ben Diggins 114, 0

Jarrod Wahburn 847, 0

Mike Crudale 9, 199

Grant Balfour 17, 94

Shane Loux 69, 69

Jimmy Anderson 180, 3

Kirk Reuter 620, 0

Jaret Wright 768, 0

Logan Kensing 55, 11

Tanyon Sturze 57, 277

Chris Young 156, 0

Nate Bump 33, 286

Bartolo Colon 2683, 49

Carlos Silva 876, 10

Aaron Cook 3337, 0

Cal Eldred 12, 141

Rick Bauer 21, 281

Mike Smith 18, 0

Shawn Estes 27, 0

Troy Percival 4, 146

Andrew Miller 306, 0

Luke Hochevar 12, 41

Luke Hudson 13, 0

Dana Eveland 15, 13

Denny Bautista 9, 38

Dennis Safarte 81, 274

Roberto Hernandez 548, 0

Mike Pelfrey 812, 0

Daniel Cabrera 881, 0

Frankie de la Cruz 15, 37

Mark Mulder 3, 9

Ty Taubenheim 27, 0

Brad Kilby 7, 58

Darren Oliver 17, 264

Justin Masterson 1794, 4

Luis Mendosa 60, 0

Ross Detwiler 627, 51

Cesar Ramos 11, 109

Josh Stinson 17, 21

Ross Detwiler 627, 51

Many of these pitchers barely had a cup of coffee in the majors. Others were emergency starters, swingmen, or they changed roles at some point in their careers. Others were simply mediocre or poor starting pitchers, like Kirk Reuter, Jarrod Washburn, Mike Pelfrey, Carlos Silva, and Daniel Cabrera, while others were good or even excellent starters, like Kevin Brown, Mark Mulder, and Bartolo Colon.

I think the lesson is clear. Unless a team has a compelling reason to make a one-pitch pitcher a starter (perhaps they are an extreme sinker-baller, like Brown, Cook, and Masterson), they should probably only relieve. If a team is going to use a swingman for an occasional start or a reliever for an emergency start, they would do well to use a two or three-pitch pitcher or limit him to one time through the order.

If we remove the swingmen and emergency starters as well as those pitchers who faced fewer than 50 batters in a season, we get this:

# Pitches in Repertoire

N (Pitcher Seasons)

Overall

First Time

Second Time

Third and Fourth Times

Second Minus First

Third+ Minus Second

1 (> 80%)

28

.353

.336

.364

.365

.028

.004

3+ (> 20%)

104

.353

.350

.357

.357

.008

.009

Even if we only look at regular starters with one primary pitch other than a knuckleball, we still see a huge penalty after the first time facing the order. In fact, the second time penalty (compared to the first) is worse than when we include the swingmen and emergency starters. Although these pitchers overall are as good as multiple-pitch starters, they still would have been much better off as short relievers.

Here is that updated list of starters once we remove the ones who rarely start. These guys as a whole should probably have been short relievers.

Cook

Miller

Colon

Diggins

Silva

Young

Cabrera

Wright

Washburn

Anderson

Masterson

Brown

Rueter

Kensing

Mendoza

Pelfrey

Hernandez

Detwiler

You might think that the one-pitch starters in the above list who are good or at least had one or two good seasons might not necessarily be good candidates for short relief. You would be wrong. These pitchers had huge second to first penalties and pitched much better the first time through the order than overall. Here is the same chart as before, but only including above-average starters for that season.

# Pitches in Repertoire

N (Pitcher Seasons)

Overall

First Time

Second Time

Third and Fourth Times

Second Minus First

Third+ Minus Second

1 (> 80%)

11

.328

.307

.332

.332

.039

-.013

3+ (> 20%)

35

.321

.318

.323

.323

.004

.003

Here are those pitchers who pitched very well overall, but were lights out the first time facing the lineup. Remember that these pitchers were above average in the season or seasons that they went into this bucket – they were not necessarily good pitchers throughout their careers or even in any other season.

Kevin Brown

Jarrod Washburn

Jaret Wright

Chris Young

Bartolo Colon

Carlos Silva

Justin Masterson

Ross Detwiler

Interestingly, the very good multiple-pitch pitchers had very small penalties each time through the order. These are probably the only kind of starters we want to go deep into games! Here is that list of starters.

Sonnanstine

B. Myers

Pavano

Sabathia

Billingsley

Carpenter

Hamels

Haren

F. Garcia

Iwakuma

Shields

J. Contreras

Beckett

Duchscherer

Gabbard

K. Rogers

Buehrle

M. Clement

Halladay

R. Hernandez

T. Hunter

Finally, in case you are  interested, here are the numbers for all of the one-pitch knuckleballers that I have been omitting in some of the tables thus far:

Knuckle Ballers Only

N

First Time

Second Time

Third+ Time

Second Minus First

Third+ Minus Second

20 .321 .354  .345 .034 -.006

Where are all the knuckle ball relievers? Although we don’t have tremendous sample sizes here (3024 second time TBF), so we have to take the numbers with a grain of salt, it looks like they are brilliant the first time through the order but once a batter has seen a knuckleballer one time, he does pretty well against him thereafter (although we do see a 6 point rebound the third time and later through the order).

I think that more research, especially using the pitch f/x data, is needed. However, I think that teams can use the information above to make more informed decisions about what roles pitchers should occupy and when to take out a starter during a game.

Last night I lambasted the Cardinals’ sophomore manager, Mike Matheny, for some errors in bullpen management that I estimated cost his team over 2% in win expectancy (WE). Well, after tonight’s game, all I have to say is, as BTO so eloquently said, “You ain’t seen nothin’ yet!”

Tonight (or last night, or whatever), John Farrell, the equally clueless manager of the Red Sox (God, I hope I don’t ever have to meet these people I call idiots and morons!), basically told Matheny, “I’ll see your stupid bullpen management and raise you one moronic non-pinch hit appearance!”

I’m talking of course about the top of the 7th inning in Game 5. The Red Sox had runners on second and third, one out, and John Lester, the Sox’ starter was due to hit (some day, I’ll be telling my grandkids, “Yes, Johnny, pitchers once were also hitters.”). Lester was pitching well (assuming you define “well” as how many hits/runs he allowed so far – not that I am suggesting that he wasn’t  pitching “well”) and had only thrown 69 pitches, I think. I don”t think it ever crossed Farrell’s mind to pinch hit for him in that spot. The Sox were also winning 2-1 at the time, so, you know, they didn’t need any more runs in order to guarantee a win <sarcasm>.

Anyway, I’m not going to engage in a lot of hyperbole and rhetoric (yeah, I probably will). It doesn’t take a genius to figure out that not pinch hitting for Lester in that particular spot (runners on 2nd and 3rd, and one out) is going to cost a decent number of fraction of runs. It doesn’t even take a genius, I don’t think, to figure out that that means that it also costs the Red Sox some chance of ultimately winning the game. I’ll explain it like I would to a 6-year-old child. With a pinch hitter, especially Napoli, you are much more likely to score, and if you do, you are likely to score more runs. And if on the average you score more runs that inning with a pinch hitter, you are more likely to win the game, since you only have a 1 run lead and the other team still gets to come to bat 3 more times. Surely, Farrell can figure that part out.

How many runs and how much win expectancy does that cost, on the average? That is pretty easy to figure out. I’ll get to that in a second (spoiler alert: it’s a lot). So that’s the downside. What is the upside? It is two-fold, sort of. One, you get to continue to pitch Lester for another inning or two. I assume that Farrell does not know exactly how much longer he plans on using Lester, but he probably has some idea. Two, you get to rest your bullpen in the 7th and possibly the 8th.

Both of those upsides are questionable in my opinion, but, as you’ll see, I will actually give Farrell and any other naysayer (to my way of thinking) the benefit of the doubt. The reason I think it is questionable is this: Lester, despite pitching well so far, and only throwing 69 pitches, is facing the order for the 3rd time in the 7th inning, which means that he is likely .4 runs per 9 innings worse than he is overall, and the Red Sox, like most World Series teams, have several very good options in the pen who are actually at least as good as Lester when facing the order for the third time, not to mention the fact that Farrell can mix and match his relievers in those two innings on order to get the platoon advantage. So, in my opinion, the first upside for leaving in Lester is not an upside at all.  But, when I do my final analysis, I will sort of assume that it is, as you will see.

The second upside is the idea of saving the bullpen, or more specifically, saving the back end of the bullpen, the short relievers. In my opinion, again, that is a sketchy argument. We are talking about the Word Series, where you carry 11 or 12 pitchers in order to play 7 games in 9 days and then take 5 months off. In fact, tomorrow (today?) is an off day followed by 2 more games and then they all go home. Plus, it’s not like either bullpen has been overworked in the post-season so far. But, I will be happy to concede that “saving your pen” is indeed an upside for leaving Lester in the game. How much is it worth? No one knows, but I don’t think anyone would disagree with this: A manager would not choose to “save” his bullpen for 1-2 innings when there is an off day followed by 2 more games, followed by 100 off days, when the cost of that savings is a significant chunk of win expectancy in the game he is playing at the present time. I mean, if you don’t agree with that, just stop reading and don’t ever come back to this site.

The final question, then, is how much in run or win expectancy did that non-pinch hit cost? Remember in my last post how I talked about “categories” of mistakes that a manager can make? I said that a Category I mistake, a big one, cost a team 1-2% in win expectancy. That may not seem like a lot for one game, but it is. We all criticize managers for “costing” their team the game when we think  they made a mistake and their team loses. If you’ve never done that, then you can stop reading too. The fact of the matter is that there is almost nothing a manager can do, short of losing his mind and pinch hitting the bat boy in a high leverage situation, that is worth more than 1 or 2% in win expectancy. Other than this.

The run expectancy with runners on second and third and one out in a low run environment is around 1.40. That means that on the average with a roughly average hitter at the plate, the batting team will score, on the average, 1.40 runs during that inning, from that point on. We’ll assume that it is about the same if Napoli pinch hit. He is a very good pinch hitter, but there is a pinch hitting penalty and he is facing a right handed pitcher. To be honest, it doesn’t really matter. It could be 1.2 runs or 1.5 runs. It won’t make much of a difference.

What is the run expectancy with Lester at the plate? I don’t know much about his hitting, but I assume that since he has never been in the NL, and therefore hardly ever hits, it is not good. We can easily say that it is below that of an average pitcher, but that doesn’t really matter either. With an average pitcher batting in that same situation, and the top of the order coming up, the average RE is around 1.10 runs. So the difference is .3 runs. Again, it doesn’t matter much if it is .25 or .4 runs. And there really isn’t much wiggle room. We know that it is a run scoring situation and we know that a pinch hitter like Napoli (or almost anyone for that matter) is going to be a much better hitter than Lester. So .3 runs sounds more than reasonable. Basically we are saying that, on the average, with a pinch hitter like Napoli at the plate in that situation, runners on 2nd and 3rd with 1 out, the Red Sox will score .3 more runs than with Lester at the plate. I don’t know that anyone would quarrel with that – even someone like a Tim McCarver or Joe Morgan.

In order to figure out how much in win expectancy that is going to cost, again, on the average, first we need to multiply that number by the leverage index in that situation. The LI is 1.64.  1.64 times .3 runs divided by 10 is .049 or 4.9%. That is the difference in WE between batting Lester or a pinch hitter. It means that with the pinch hitter, the Red Sox can expect, on the average, to win the game around 5% more often than if Lester hits, everything else being equal. I don’t know whether you can appreciate the enormity of that number. I have been working with these kinds of numbers for over 20 years. If you can’t appreciate it, you will just have to take my word for it that that is a ginormous number when it comes to WE in one game. As I said, I usually consider an egregious error to be worth 1-2%. This is worth almost 5%. That is ridiculous. It’s like someone offering you a brand new Chevy or Mercedes for the same price. And you take the Chevy, if you are John Farrell.

Just to see if we are in the right ballpark with our calculations, I am going to to run this scenario through my baseball simulator, which is pretty darn accurate (even though it does not have an algorithm for heart or grit) in these kinds of relatively easy situations to analyze.

Sound of computers whirring….

With Lester hitting, the Red Sox win the game 76.6% of the time. And therein lies the problem! Farrell knows that no matter what he does, he is probably going to win the game, and if he takes out Lester, not only is he going to bruise his feelings (boo hoo), but if the relief core blows the game, he is going to be lambasted and probably feel like crap. If he takes Lester out, he knows he’s also going to probably win the game, and what’s a few percent here and there. But if he lets Lester continue, as all of Red Sox nation assumes and hopes he will, and then they blow the game, no one is going to blame Farrell. You know why? Because at the first sign of trouble, he is going to pull Lester, and no one is going to criticize a manager for leaving in a pitcher who is pitching a 3-hitter through 6 innings and only 69 pitches and yanks him as soon as he gives up a baserunner or two. So letting Lester hit for himself is the safe decision. Not a good one, but a safe one.

After that rant, you probably want to know how often the Sox win if they pinch hit for Lester. 79.5% of the time. So that’s only a 2.9% difference. Still higher than my formerly highest Category of manager mistakes, 1-2%.

Let’s be conservative and call it a 3% mistake. I wonder if you told John Farrell that by not pinch hitting for Jon Lester his team’s chances of winning go from 79.5% to 76.6%. Even if he believed that, do you think it would sway his decision? I don’t think so, because he feels with all his heart and soul that having Lester, who is “dealing,” pitch another inning or two, and saving his bullpen, is well worth the difference between 77% and 80%. After all, either way, they probably win.

So how much does Lester pitching another inning or two (we’ll call it 1.5 innings, since at the time it could have been anywhere from 0 to 2, I think  - I am pretty sure that Koji was pitching the 9th no matter what) gain over another pitcher? Well, I already said that the answer is nothing. Any of their good relievers are at least as good as Lester the 3rd time though the order. But I also said that I will concede that Lester is going to be just amazing, on the average, if Farrell leaves him in the game. How good does he have to be in order to make up the .3 runs or 3% in WE that are lost by allowing Lester to hit?

A league average reliever allows around 4 runs a game. It doesn’t matter what that exact number is – we are only using it for comparison purposes. A good short reliever actually allows more like 3 or 3.5 runs a game. Starting pitchers, in general, are a little worse than the average pitcher (because of that nasty times through the order penalty). A very good pitcher like Lester allows around 3.5 runs a game (a pitcher like Wainwright around 3 runs a game). So let’s assume that a very average reliever came into the game to pitch the 7th and half the 8th rather than Lester. They would allow 4 runs a game. That is very pedestrian for a reliever. Almost any short reliever can do that with his eyes closed. In order to make up the .3 runs we lost by letting Lester hit, Lester needs to allow fewer runs than 4 runs a game. How much less? Well, .3 runs in 1.5 innings is .2 runs per inning. .2 runs per inning times 9 innings is 1.8 runs. So Lester would have to pitch like a pitcher who allows 2.2 runs per 9 innings. No starting pitcher like that exists. Even the best starter in baseball, Clayton Kershaw, is a 2.5 run per 9 pitcher at best.

Let’s go another route. Remember that I said Lester was probably around a 3.5 run pitcher (Steamer, a very good projection system, has him projected with a 3.60 FIP, which is around a 3.5 pitcher in my projection system), but that the third time through the order he is probably a 3.80 or 3.90 pitcher. Forget about that. Let’s decree that Lester is indeed going to pitch the 7th and 8th innings, or however long he continues, like an ace reliever. Let’s call him a 3.00 pitcher, not the 3.80 or 3.90 pitcher that I think he really is, going into the 7th inning.

In case, you are wondering, there is no evidence that good or even great pitching through 6 or 7 innings predicts good pitching for future innings. Quite the contrary. Even starters who are pitching well have the times through the order penalty, and if they are allowed to continue, they end up pitching worse than they do overall in a random game. That is what real life says. That is what happens. It is not my opinion, observation, or recollection. A wise person once said that, “Truth comes from evidence and not opinion or faith.”

But, again, we are living on Planet Farrell, so we are conceding that Lester is a great pitcher going into the 7th inning and the third time through the order. (Please don’t tell me how he did that inning. If you do or even think that, you need to leave and never come back. Seriously.)  We are calling him a 3.0 pitcher, around the same as a very good closer.

How bad does a replacement for Lester for 1.5 innings have to be to make up for that .3 runs? Again, we need .2 runs per inning, times 9 innings, or a total of 1.8 runs per 9. So the reliever to replace him would have to be a 4.8 pitcher. That is a replacement pitcher folks, There is no one on either roster who is even close to that.

So there you have it. Like Keith Olbermann’s, Worst person in the world, we have the worst manager in baseball – John Farrell.

Addendum: Please keep in mind that some of the hyperbole and rhetoric is just that. Take comments like, “Farrell is an idiot,” or, “the worst manager in baseball,” with a grain of salt and chalk it up to flowery emotion. It is not relevant to the argument of course. The argument speaks for itself, and you, the reader, are free to conclude what you want about whether his moves, or any other managerial moves that I might discuss, were warranted or not.

I am not insensitive to factors that drive all managers’ decisions, like the reaction, desires, and opinions of the fans, media, upper management, and especially, the players. As several people have pointed out, if a manager were to do things that were “technically” correct, yet in doing so, alienate his players (and/or the fans) thereby affecting morale, loyalty, and perhaps a conscious or subconscious desire to win, then those “correct” decisions may become “incorrect” in the grand scheme of things.

That being said, my intention is to inform the reader and to take the hypothetical perspective of informing the manager of the relevant and correct variables and inputs such that they and you can make an informed decision. Imagine this scenario: I am sitting down with Farrell and perhaps the Red Sox front office and we are rationally and intelligently discussing ways to improve managerial strategy. Surely no manager can be so arrogant as to think that everything he does is correct. You would not want an employee like that working for your company no matter how much you respect him and trust his skills. Anyway, let’s say that we are discussing this very same situation, and Farrell says something like, “You know, I really didn’t care whether I removed Lester for a pinch hitter or not, and I don’t think he or my players would either. Plus, the preservation of my bullpen was really a secondary issue. I could have easily used Morales, Dempster, or even Breslow again. Managers have to make tough decisions like that all the time. I genuinely thought that with Lester pitching and us already being up a run, we had the best chance to win. But now that you have educated me on the numbers, I realize that that assumption on my part was wrong. In the future I will have to rethink my position if that or a similar situation should come up.”

That may not be a realistic scenario, but that is the kind of discussion and thinking I am trying to foster.

MGL