Archive for the ‘Pinch Hitters’ Category

Note: I updated the pinch hitting data to include a larger sample (previously I went back to 2008. Now, 2000).

Note: It was pointed out by a commenter below and another one on Twitter that you can’t look only at innings where the #9 and #1 batters batted (eliminating innings where the #1 hitter led off), as Russell did in his study, and which he uses to support his theory (he says that it is the best evidence). That creates a huge bias, of course. It eliminates all PA in which the #9 hitter made the last out of an inning or at least an out was made while he was at the plate. In fact, the wOBA for a #9 hitter, who usually bats around .300, is .432 in innings where he and the #1 hitter bat (after eliminating so many PA in which an out was made). How that got past Russell, I have no idea.  Perhaps he can explain.

Recently, Baseball Prospectus published an article by one of their regular writers, Russell Carleton (aka Pizza Cutter), in which he examined whether the so-called “times through the order” penalty (TTOP) was in fact a function of how many times a pitcher has turned over the lineup in a game or whether it was merely an artifact of a pitcher’s pitch count. In other words, is it pitcher fatigue or batter familiarity (the more the batter sees the pitcher during the game, the better he performs) which causes this effect?

It is certainly possible that most or all of the TTOP is really due to fatigue, as “times through the order” is clearly a proxy for pitch count. In any case, after some mathematic gyrations that Mr. Carleton is want to do (he is the “Warning: Gory Mathematical Details Ahead” guy) in his articles, he concludes unequivocally that there is no such thing as a TTOP – that it is really a PCP or Pitch Count Penalty effect that makes a pitcher less and less effective as he goes through the order and it has little or nothing to do with batter/pitcher familiarity. In fact, in the first line of his article, he declares, “There is no such thing as the ‘times through the order’ penalty!”

If that is true, this is a major revelation which has slipped through the cracks in the sabermetric community and its readership. I don’t believe it is, however.

As one of the primary researchers (along with Tom Tango) of the TTOP, I was taken quite aback by Russell’s conclusion, not because I was personally affronted (the “truth” is not a matter of opinion), but because my research suggested that pitch count or fatigue was likely not a significant part of the penalty. In my BP article on the TTOP a little over 2 years ago, I wrote this: “…the TTOP is not about fatigue. It is about familiarity. The more a batter sees a pitcher’s delivery and repertoire, the more likely he is to be successful against him.” What was my evidence?

First, I looked at the number of pitches thrown going into the second, third, and fourth times through the order. I split that up into two groups—a low pitch count and a high pitch count. Here are those results. The numbers in parentheses are the average number of pitches thrown going into that “time through the order.”

Times Through the Order Low Pitch Count High Pitch Count
1 .341 .340
2 .351 (28) .349 (37)
3 .359 (59) .359 (72)
4 .361 (78) .360 (97)


If Russell’s thesis were true, you should see a little more of a penalty in the “high pitch count” column on the right, which you don’t. The penalty appears to be the same regardless of whether the pitcher has thrown few or many pitches. To be fair, the difference in pitch count between the two groups is not large and there is obviously sample error in the numbers.

The second way I examined the question was this: I looked only at individual batters in each group who had seen few or many pitches in their prior PA. For example, I looked at batters in their second time through the order who had seen fewer than three pitches in their first PA, and also batters who saw more than four pitches in their first PA. Those were my two groups. I did the same thing for each time through the order. Here are those results. The numbers in parentheses are the average number of pitches seen in the prior PA, for every batter in the group combined.


Times Through the Order Low Pitch Count each Batter High Pitch Count each Batter
1 .340 .340
2 .350 (1.9) .365 (4.3)
3 .359 (2.2) .361 (4.3)


As you can see, if a batter sees more pitches in his first or second PA, he performs better in his next PA than if he sees fewer pitches. The effect appears to be much greater from the first to the second PA. This lends credence to the theory of “familiarity” and not pitcher fatigue. It is unlikely that 2 or 3 extra pitches would cause enough fatigue to elevate a batter’s wOBA by 8.5 points per PA (the average of 15 and 2, the “bonuses” for seeing more pitches during the first and second PA, respectively).

So how did Russell come to his conclusion and is it right or wrong? I believe he made a fatal flaw in his methodology which led him to a faulty conclusion (that the TTOP does not exist).

Among other statistical tests, here is the primary one which led Russell to conclude that the TTOP is a mirage and merely a product of pitcher fatigue due to an ever-increasing pitch count:

This time, I tried something a little different. If we’re going to see a TTOP that is drastic, the place to look for it is as the lineup turns over. I isolated all cases in which a pitcher was facing the ninth batter in the lineup for the second time and then the first batter in the lineup for the third time. To make things fair, neither hitter was allowed to be the pitcher (this essentially limited the sample to games in AL parks), and the hitters needed to be faced in the same inning. Now, because the leadoff hitter is usually a better hitter, we need to control for that. I created a control variable for all outcomes using the log odds ratio method, which controls for the skills of the batter, as well as that of the pitcher. I also controlled for whether or not the pitcher had the platoon advantage in either case.

First of all, there was no reason to limit the data to “the same inning”. Regardless of whether the pitcher faces the 9th and 1st batters in the same inning or they are split up (the 9 hitter makes the last out), since one naturally follows the other, they will always have around the same pitch count, and the leadoff hitter will always be one time through the order ahead of the number nine hitter.

Anyway, what did Russell find? He found that TTOP was not a predictor of outcome. In other words, that the effect on the #9 hitter was the same as the #1 hitter, even though the #1 hitter had faced the pitcher one more time than the #9 hitter.

I thought about this for a long time and I finally realized why that would be the case even if there was a “times order” penalty (mostly) independent of pitch count. Remember that in order to compare the effect of TTO on that #9 and #1 hitter, he had to control for the overall quality of the hitter. The last hitter in the lineup is going to be a much worse hitter overall than the leadoff hitter, on the average, in his sample.

So the results should look something like this if there were a true TTOP: Say the #9 batters are normally .300 wOBA batters, and the leadoff guys are .330. In this situation, the #9 batters should bat around .300 (during the second time through the order we see around a normal wOBA) but the leadoff guys should bat around .340 – they should have a 10 point wOBA bonus for facing the pitcher for the third time.

Russell, without showing us the data (he should!), presumably gets something like .305 for the #9 batters (since the pitcher has gone essentially 2 ½ times through the lineup, pitch count-wise) and the leadoff hitters should hit .335, or 5 points above their norm as well (maybe .336 since they are facing a pitcher with a few more pitches under his belt than the #9 hitter).

So if he gets those numbers, .335 and .305, is that evidence that there is no TTOP? Do we need to see numbers like .340 and .300 to support the TTOP theory rather than the PCP theory? I submit that even if Russell sees numbers like the former ones, that is not evidence that there is no TTOP and it’s all about the pitch count. I believe that Russell made a fatal error.

Here is where he went wrong:

Remember that he uses the log-odds method to computer the baseline numbers, or what he would expect from a given batter-pitcher matchup, based on their overall season numbers. In this experiment, there is no need to do that, since both batters, #1 and #9, are facing the same pitcher the same number of times. All he has to do is use each batter’s seasonal numbers to establish the base line.

But where do those base lines come from? Well, it is likely that the #1 hitters are mostly #1 hitters throughout the season and that #9 hitters usually hit at the bottom of the order. #1 hitters get around 150 more PA than #9 hitters over a full season. Where do those extra PA come from? Some of them come from relievers of course. But many of them come from facing the starting pitcher more often per game than those bottom-of-the-order guys. In addition, #9 hitters sometimes are removed for pinch hitters late in a game against a starter such that they lose even more of those 3rd and 4th time through the order PA’s. Here is a chart of the mean TTO per game versus the starting pitcher for each batting slot:


Batting Slot Mean TTO/game
1 2.15
2 2.08
3 2.02
4 1.98
5 1.95
6 1.91
7 1.86
8 1.80
9 1.77

(By the way, if Russell’s thesis is true, bottom of the order guys have it even easier, since they are always batting when the pitcher has a higher pitch count, per time through the order. Also, this is the first time you have been introduced to the concept that the top of the order batters have it a little easier than the bottom of the order guys, and that switching spots in the order can affect overall performance because of the TTOP or PCP.)

What that does is result in the baseline for the #1 hitter being higher than for the #9 hitter, because the baseline includes more pitcher TTOP (more times facing the starter for the 3rd and 4th times). That makes it look like the #1 hitter is not getting his advantage as compared to the #9 hitter, or at least he is only getting a partial advantage in Russell’s experiment.

In other words, the #9 hitter is really a true .305 hitter and the #1 hitter is really a true .325 hitter, even though their seasonal stats suggest .300 and .330. The #9 hitters are being hurt by not facing starters late in the game compared to the average hitter and the #1 hitters are being helped by facing starters for the 3rd and 4th times more often than the average hitter.

So if #9 hitters are really .305 hitters, then the second time through the order, we expect them to hit .305, if the TTOP is true. If the #1 hitters are really .325 hitters, despite hitting .330 for the whole season, we expect them to hit .335 the third time through the order, if the TTOP is true. And that is exactly what we see (presumably).

But when Russell sees .305 and .335 he concludes, “no TTOP!” He sees what he thinks is a true .300 hitter hitting .305 after the pitcher has thrown around 65 pitches and what he thinks is a true .330 hitter hitting .335 after 68 or 69 pitches. He therefore concludes that both hitters are being affected equally even though one is batting for the second time and the other for the third time – thus, there is no TTOP!

As I have shown, those numbers are perfectly consistent with a TTOP of around 8-10 points per times through the order, which is exactly what we see.

Finally, I ran one other test which I think can give us more evidence one way or another. I looked at pinch hitting appearances against starting pitchers. If the TTOP is real and pitch count is not a significant factor in the penalty, we should see around the same performance for pinch hitters regardless of the pitcher’s pitch count, since the pinch hitter always faces the pitcher for the first time and the first time only. In fact, this is a test that Russell probably should have run. The only problem is sample size. Because there are relatively few pinch hitting PA versus starting pitchers, we have quite a bit of sample error in the numbers. I split the sample of pinch hitting appearances up into 2 groups: Low pitch count and high pitch count.


Here is what I got:

PH TTO Overall Low Pitch Count High Pitch Count
2 .295 (PA=4901) .295 (PA=2494) .293 (PA=2318)
3 .289 (PA=10774) .290 (PA=5370) .287 (PA=5404)


I won’t comment on the fact that the pinch hitters performed a little better against pitchers with a low pitch count (the differences are not nearly statistically significant) other than to say that there is no evidence that pitch count has any influence on the performance of pinch hitters who are naturally facing pitchers for the first and only time. Keep in mind that the times through the order (the left column) is a good proxy for pitch count in and of itself and we also see no evidence that that makes a difference in terms of pinch hitting performance. In other words, if pitch count significantly influenced pitching effectiveness, we should see pinch hitters overall performing better when the pitcher is in the midst of his 3rd time through the order as opposed to the 2nd time (his pitch count would be around 30-35 pitches higher). We don’t. In fact, we see a worse performance (the difference is not statistically significant – one SD is 8 points of wOBA).


I have to say that it is difficult to follow Russell’s chain of logic and his methodology in many of his articles because he often fails to “show his work” and he uses somewhat esoteric and opaque statistical techniques only. In this case, I believe that he made a fatal mistake in his methodology as I have described above which led him to the erroneous conclusion that, “The TTOP does not exist.” I believe that I have shown fairly strong evidence that the penalty that we see pitchers incur as the game wears on is mostly or wholly as a result of the TTO and not due to fatigue caused by an increasing pitch count.

I look forward to someone doing additional research to support one theory or the other.


In The Book: Playing the Percentages in Baseball, we found that when a batter pinch hits against right-handed relief pitchers (so there are no familiarity or platoon issues), his wOBA is 34 points (10%) worse than when he starts and bats against relievers, after adjusting for the quality of the pitchers in each pool (PH or starter). We called this the pinch hitting penalty.

We postulated that the reason for this was that a player coming off the bench in the middle or towards the end of a game is not as physically or mentally prepared to hit as a starter who has been hitting and playing the field for two or three hours. In addition, some of these pinch hitters are not starting because they are tired or slightly injured.

We also found no evidence that there is a “pinch hitting skill.” In other words, there is no such thing as a “good pinch hitter.” If a hitter has had exceptionally good (or bad) pinch hitting stats, it is likely that that was due to chance alone, and thus it has no predictive value. The best predictor of a batter’s pinch-hitting performance is his regular projection with the appropriate penalty added.

We found a similar situation with designated hitters. However, their penalty was around half that of a pinch hitter, or 17 points (5%) of wOBA. Similar to the pinch hitter, the most likely explanation for this is that the DH is not as physically (and perhaps mentally) prepared for each PA as a player who is constantly engaged in the game. As well, the DH may be slightly injured or tired, especially if he is normally a position player. It makes sense that the DH penalty would be less than the PH penalty, as the DH is more involved in a game than a PH. Pinch hitting is often considered “the hardest job in baseball.” The numbers suggest that that is true. Interestingly, we found a small “DH skill” such that different players seem to have more or less of a true DH penalty.

Andy Dolphin (one of the authors of The Book) revisited the PH penalty issue in this Baseball Prospectus article from 2006. In it, he found a PH penalty of 21 points in wOBA, or 6%, significantly less than what was presented in The Book (34 points).

Tom Thress, on his web site, reports a PH penalty of .009 in “player won-loss records” (offensive performance translated into a “w/l record”), which he says is similar to that found in The Book (34 points). However, he finds an even larger DH penalty of .011 wins, which is more than twice that which we presented in The Book. I assume that .011 is slightly larger than 34 points in wOBA.

So, everyone seems to be in agreement that there is a significant PH and DH penalty, however, there is some disagreement as to the magnitude of each (with empirical data, we can never be sure anyway). I am going to revisit this issue by looking at data from 1998 to 2012. The method I am going to use is the “delta method,” which is common when doing this kind of “either/or” research with many player seasons in which the number of opportunities (in this case, PA) in each “bucket” can vary greatly for each player (for example, a player may have 300 PA in the “either” bucket and only 3 PA in the “or” bucket) and from player to player.

The “delta method” looks something like this: Let’s say that we have 4 players (or player seasons) in our sample, and each player has a certain wOBA and number of PA in bucket A and in bucket B, say, DH and non-DH – the number of PA are in parentheses.

wOBA as DH wOBA as Non-DH
Player 1 .320 (150) .330 (350)
Player 2 .350 (300) .355 (20)
Player 3 .310 (350) .325 (50)
Player 4 .335 (100) .350 (150)

In order to compute the DH penalty (difference between when DH’ing and playing the field) using the “delta method,” we compute the difference for each player separately and take a weighted average of the differences, using the lesser of the two PA (or the harmonic mean) as the weight for each player. In the above example, we have:

((.330 – .320) * 150 + (.355 – .350) * 20 + (.325 – .310) * 50 + (.350 – .335) * 100) / (150 + 20 + 50 + 100)

If you didn’t follow that, that’s fine. You’ll just have to trust me that this is a good way to figure the “average difference” when you have a bunch of different player seasons, each with a different number of opportunities (e.g. PA) in each bucket.

In addition to figuring the PH and DH penalties (in various scenarios, as you will see), I am also going to look at some other interesting “penalty situations” like playing in a day game after a night game, or both games of a double header.

In my calculations, I adjust for the quality of the pitchers faced, the percentage of home and road PA, and the platoon advantage between the batter and pitcher. If I don’t do that, it is possible for one bucket to be inherently more hitter-friendly than the other bucket, either by chance alone or due to some selection bias, or both.

First let’s look at the DH penalty. Remember that in The Book, we found a roughly 17 point penalty, and  Tom Thresh found a penalty that was greater than that of a PH, presumably more than 34 points in wOBA.

Again, my data was from 1998 to 2012, and I excluded all inter-league games. I split the DH samples into two groups: One group had more DH PA than non-DH PA in each season (they were primarily DH’s), and vice versa in the other group (primarily position players).

The DH penalty was the same in both groups – 14 points in wOBA.

The total sample sizes were 10,222 PA for the primarily DH group and 32,797 for the mostly non-DH group. If we combine the two groups, we get a total of 43,019 PA. That number represents the total of the “lesser of the PA” for each player season. One standard deviation in wOBA for that many PA is around 2.5 wOBA points. For the difference between two groups of 43,000 each, it is 3.5 points (the square root of the sum of the variances). So we can say with 95% confidence that the true DH penalty is between 7 and 21 points with the most likely value being 14. This is very close to the 17 point value we presented in The Book.

I expected that the penalty would be greater for position players who occasionally DH’d rather than DH’s who occasionally played in the field. That turned out not to be the case, but given the relatively small sample sizes, the true values could very well be different.

Now let’s move on to pinch hitter penalties. I split those into two groups as well: One, against starting pitchers and the other versus relievers. We would expect the former to show a greater penalty since a “double whammy” would be in effect – first, the “first time through the order” penalty, and second, the “sitting on the bench” penalty. In the reliever group, we would only have the “coming in cold” penalty. I excluded all ninth innings or later.

Versus starting pitchers only, the PH penalty was 19.5 points in 8,523 PA. One SD is 7.9 points, so the 95% confidence interval is a 4 to 35 point penalty.

Versus relievers only, the PH penalty was 12.8 points in 17,634 PA. One SD is 5.5 points – the 95% confidence interval is a 2 to 24 point penalty.

As expected, the penalty versus relievers, where batters typically only face the pitcher for the first and only time in the game, whether they are in the starting lineup or are pinch hitting, is less than that versus the starting pitcher, by around 7 points. Again, keep in mind that the sample sizes are small enough such that the true difference between the starter PH penalty and reliever PH penalty could be the same or could even be reversed. Of course, our prior when applying a Bayesian scheme is that there is a strong likelihood that the true penalty is larger against starting pitchers for the reason explained above. So it is likely that the true difference is similar to the one observed (a 7-point greater penalty versus starters).

Notice that my numbers indicate penalties of a similar magnitude for pinch hitters and designated hitters. The PH penalty is a little higher than the DH penalty when pinch hitters face a starter, and a little lower than the DH penalty when they face a reliever. I expected the PH penalty to be greater than the DH penalty, as we found in The Book. Again, these numbers are based on relatively small sample sizes, so the true PH and DH penalties could be quite different.

Role Penalty (wOBA)
DH 14 points
PH vs. Starters 20 points
PH vs. Relievers 13 points

Now let’s look at some other potential “penalty” situations, such as the second game of a double-header and a day game following a night game.

In a day game following a night game, batters hit 6.2 wOBA points worse than in day games after day games or day games after not playing at all the previous day. The sample size was 95,789 PA. The 95% certainty interval is 1.5 to 11 points.

What about the when a player plays both ends of a double-header (no PH or designated hitters)? Obviously many regulars sit out one or the other game – certainly the catchers.

Batters in the second game of a twin bill lose 8.1 points of wOBA compared to all other games. Unfortunately, the sample is only 9,055 PA, so the 2 SD interval is -7.5 to 23.5. If 8.1 wOBA points (or more) is indeed reflective of the true double-header penalty, it would be wise for teams to sit some of their regulars in one of the two games – which they do of course. It would also behoove teams to make sure that their two starters in a twin bill pitch with the same hand in order to discourage fortuitous platooning by the opposing team.

Finally, I looked at games in which a player and his team (in order to exclude times when the player sat because he wasn’t 100% healthy) did not play the previous day, versus games in which the player had played at least 8 days in a row. I am looking for a “consecutive-game fatigue” penalty and those are the two extremes. I excluded all games in April and all pinch-hitting appearances.

The “penalty” for playing at least 8 days in a row is 4.0 wOBA points in 92,287 PA. One SD is 2.4 so that is not a statistically significant difference. However, with a Bayesian prior such that we expect there to be a “consecutive-game fatigue” penalty, I think we can be fairly confident with the empirical results (although obviously there is not much certainty as to the magnitude).

To see whether the consecutive day result is a “penalty” or the day off result is a bonus, I compared them to all other games.

When a player and his team has had a day off the previous day, the player hits .1 points better than otherwise in 115,471 PA (-4.5 to +4.5). Without running the “consecutive days off” scenario, we can infer that there is an observed penalty when playing at least 8 days in a row, of around 4 points, compared to all other games (the same as compared to after an off-day).

So having a day off is not really a “bonus,” but playing too many days in row creates a penalty. It probably behooves all players to take an occasional day off. Players like Cal Ripken, Steve Garvey, and Miguel Tejada (and others) may have had substantially better careers had they been rested more, at least rate-wise.

I also looked at players who played in fewer days in a row (5, 6, and 7) and found penalties of less than 4 points, suggesting that the more days in a row a player plays, the more his offense is penalized. It would be interesting to see if a day off after several days in a row restores a player to his normal offensive levels.

There are many other situations where batters and pitchers may suffer penalties (or bonuses), such as game(s) after coming back from the DL, getaway (where the home team leaves for another venue) games, Sunday night games, etc.

Unfortunately, I don’t have the time to run all of these potentially interesting scenarios – and I have to leave something for aspiring saberists to do!

Addendum: Tango Tiger suggested I split the DH into “versus relievers and starters.” I did not expect there to be a difference in penalties since, unlike a PH, a DH faces the starter the same number of times as when he isn’t DH’ing. However, I found a penalty difference of 8 points – the DH penalty versus starters was 16.3 and versus relievers, it was 8.3. Maybe the DH becomes “warmer” towards the end of the game, or maybe the difference is a random, statistical blip. I don’t know. We are often faced with these conundrums (what to conclude) when dealing with limited empirical data (relatively small sample sizes). Even if we are statistically confident that an effect exists (or doesn’t), we are are usually quite uncertain as to the magnitude of that effect.

I also looked at getaway (where the home team goes on the road after this game) night games. It has long been postulated that the home team does not perform as well in these games. Indeed, the home team batter penalty in these games was 1.6 wOBA points, again, not a statistically significant difference, but consistent with the Bayesian prior. Interestingly, the road team batters performed .6 points better suggesting that home team pitchers in getaway games might have a small penalty as well.