Archive for the ‘Managers’ Category

Richard Nichols (@RNicholsLV on Twitter) sent me this link. These are notes that the author, Lee Judge, a Royals blogger for the K.C. Star, took during the season. They reflect thoughts and comments from players, coaches, etc. I thought I’d briefly comment on each one. Hope you enjoy!

Random, but interesting, things about baseball – Lee Judge

▪ If a pitcher does not have a history of doubling up on pickoff throws (two in a row) take a big lead, draw a throw and then steal on the next pitch.

Of course you can do that. But how many times can you get away with it? Once? If the pitcher or one of his teammates or coaches notices it, he’ll pick you off the next time by “doubling up.” Basically by exploiting the pitcher’s non-random and thus exploitable strategy, the runner becomes exploitable himself. A pitcher, of course, should be picking a certain percentage of the time each time he goes into the set position, based on the likelihood of the runner stealing and the value of the steal attempt. That “percentage” must be randomized by the pitcher and it “resets” each time he throws a pitch or attempts a pickoff.

By “randomize” I mean the prior action, pick or no pick, cannot affect the percentage chance of a pick. If a pitcher is supposed to pick 50% prior to the next pitch he must do so whether he’s just attempted a pickoff 0, 1, 2, or 10 times in a row. The runner can’t know that a pickoff is more or less likely based on how many picks were just attempted. In fact you can tell him, “Hey every time I come set, there’s a 50% (or 20%, or whatever) chance I will attempt to pick you off,” and there’s nothing he can do to exploit that information.

For example, if he decides that he must throw over 50% of the time he comes set (in reality the optimal % changes with the count), then he flips a mental coin (or uses something – unknown to the other team – to randomize his decision, with a .5 mean). What will happen on the average is that he won’t pick half the time, 25% of the time he’ll pick once only, 12.5% of the time he’ll pick exactly twice, 25% of the time he’ll pick at least twice, etc.

Now, the tidbit from the player or coach says, “does not have a history of doubling up.” I’m not sure what that means. Surely most pitchers when they do pick, will pick once sometimes and twice sometimes, etc. Do any pitchers really never pick more than once per pitch? If they do, I would guess that it’s because the runner is not really a threat and the one-time pick is really a pick with a low percentage. If a runner is not much of a threat to run, then maybe the correct pick percentage is 10%. If that’s the case, then they will not double-up 99% of the time and correctly so. That cannot be exploited, again, assuming that a 10% rate is optimal for that runner in that situation. So while it may look like they never double up, they do in fact double up 1% of the time, which is correct and cannot be exploited (assuming the 10% is correct for that runner and in that situation).

Basically what I’m saying is that this person’s comment is way to simple and doesn’t really mean anything without putting it into context as I explain above.

▪ Foul balls with two strikes can indicate a lack of swing-and-miss stuff; the pitcher can get the batters to two strikes, but then can’t finish them off.

Not much to say here. Some pitchers have swing-and-miss stuff and others don’t, and everything in-between. You can find that out by looking at…uh…their swing-and-miss percentages (presuming a large enough sample size to give you some minimum level of certainty). Foul balls with two strikes? That’s just silly. A pitcher without swing-and-miss stuff will get more foul balls and balls in play with two strikes. That’s a tautology. He’ll also get more foul balls and balls in play with no strikes, one strike, etc.

▪ Royals third-base coach Mike Jirschele will walk around the outfield every once in a while just to remind himself how far it is to home plate and what a great throw it takes to nail a runner trying to score.

If my coach has to do that I’m not sure I want him coaching for me. That being said, whatever little quirks he has or needs to send or hold runners the correct percentage of time is fine by me. I don’t know that I would be teaching or recommending that to my coaches – again, not that there’s anything necessarily wrong with it.

Bottom line is that he better know the minimum percentages that runners need to be safe in any given situation (mostly # of outs) – i.e. the break-even points – and apply them correctly to the situation (arm strength and accuracy etc.) in order to make optimal decisions. I would surely be going over those numbers with my coaches from time to time and then evaluating his sends and holds to make sure he’s not making systematic errors or too many errors in general.

▪ For the most part, the cutter is considered a weak contact pitch; the slider is considered a swing-and-miss pitch.

If that’s confirmed by pitch f/x, fine. If it’s not, then I guess it’s not true. Swing-and-miss is really just a subset of weak contact and weak contact is a subset of contact which is a subset of a swing. The result of a swing depends on the naked quality of the pitch, where it is thrown, and the count. So while for the most part (however you want to define that – words are important!) it may be true, surely it depends on the quality of each of the pitches, on what counts they tend to be thrown, how often they are thrown at those counts, and the location they are thrown to. Pitches away from the heart of the plate tend to be balls and swing-and-miss pitches. Pitches nearer the heart tend to be contacted more often, everything else being equal.

▪ With the game on the line and behind in the count, walk the big-money guys; put your ego aside and make someone else beat you.

Stupid. Just. Plain. Stupid. Probably the dumbest thing a pitcher or manager can think/do in a game. I don’t even know what it means and neither do they. So tie game in the 9th, no one on base, 0 outs, count is 1-0. Walk the batter? That’s what he said! I can think of a hundred stupid examples like that. A pitcher’s approach changes with every batter and every score, inning, outs, runners, etc. A blanket statement like that, even as a rule of thumb, is Just. Plain. Dumb. Any interpretation of that by players and coaches can only lead to sub-optimal decisions – and does. All the time. Did I say that one is stupid?

▪ A pitcher should not let a hitter know what he’s thinking; if he hits a batter accidentally he shouldn’t pat his chest to say “my bad.” Make the hitter think you might have drilled him intentionally and that you just might do it again.

O.K. To each his own.

▪ Opposition teams are definitely trying to get into Yordano Ventura’s head by stepping out and jawing with him; anything to make him lose focus.

If he says so. I doubt much of that goes on in baseball. Not that kind of game. Some, but not much.

▪ In the big leagues, the runner decides when he’s going first-to-third; he might need a coach’s help on a ball to right field — it’s behind him — but if the play’s in front of him, the runner makes the decision.

Right, we teach that in Little League (a good manager that is). You teach your players that they are responsible for all base running decisions until they get to third. Then it’s up to the third base coach. It’s true that the third base coach can and should help the runner on a ball hit to RF, but ultimately the decision is on the runner whether to try and take third.

Speaking of taking third, while the old adage “don’t make the first or third out at third base” is a good rule of thumb, players should know that it doesn’t mean, “Never take a risk on trying to advance to third.” It means the risk has to be low (like 10-20%), but that the risk can be twice as high with 0 outs as with 2 outs. So really, the adage should be, “Never make the third out at third base, but you can sometimes make the first out at third base.”

You can also just forget about the first out part of that adage. Really, the two-out break-even point is almost exactly in between the first-out and one-out one. In other words, with no outs, you need to be safe at third around 80% of the time, with one out, around 70%, and with two outs around 90%. Players should be taught that and not just the “rule of thumb.” They should also be taught that the numbers change with trailing runners, the pitcher, and who the next batter or batters are. For example, with a trailing runner, making the third out is really bad but making the first out where the trailing runner can advance is a bonus.

▪ Even in a blowout there’s something to play for; if you come close enough to make the other team use their closer, maybe he won’t be available the next night.

I’m pretty sure the evidence suggests that players play at their best (more or less) regardless of the score. That makes sense under almost any economic or cognitive theory of behavior since players get paid big money to have big numbers. Maybe they do partially because managers and coaches encourage them to do so with tidbits like that. I don’t know.

Depending on what they mean by blowout, what they’re saying is that, say you have a 5% chance of winning a game down six runs in the late innings. Now say you have a 20% chance of making it a 3-run or less game, and that means that the opponent closer comes into the game. And say that him coming into the game gives you another 2% chance of winning tomorrow because he might not be available, and an extra 1% the day after that (if it’s the first game in a series). So rather than a 5% win expectancy, you actually have a 5% plus 20% * 3% or, 5.6% WE. Is that worth extra effort? To be honest, a manager and coach is supposed to teach his players to play hard (within reason) regardless of the score for two reasons: One, because it makes for better habits when the game is close and two, at exactly what point is the game a blowout (Google the sorites paradox)?

▪ If it’s 0-2, 1-2 and 2-2, those are curveball counts and good counts to run on. That’s why pitchers often try pickoffs in those counts.

On the other hand, 0-2 is not a good count to run on because of the threat of the pitchout. As it turns out, the majority of SB attempts (around 68%) occur at neutral counts. Only around 16% of all steal attempts occur at those pitchers’ counts. So whoever said that is completely wrong.

Of course pitchers should (and do) attempt more pickoffs the greater the chance of a steal attempt. That also tends to make it harder to steal (hence the game theory aspect).

That being said, some smart people (e.g., Professor Ted Turocy of Chadwick Baseball Bureau) believe that there is a Nash equilibrium between the offense and defense with respect to base stealing (for most players – not at the extremes) such that neither side can exploit the other by changing their strategy. I don’t know if it’s true or not. I think Professor Turocy may have a paper on this. You can check it out on the web or contact him.

▪ Don’t worry about anyone’s batting average until they have 100 at-bats.

How about “Don’t worry about batting average…period.” In so many ways this is wrong. I would have to immediately fire whoever said that if it was a coach, manager or executive.

▪ It’s hard to beat a team three times in a row; teams change starting pitchers every night and catching three different pitchers having a down night is not the norm.

Whoever said this should be fired sooner than the one above. As in, before they even finished that colossally innumerate sentence.

▪ At this level, “see-it-and-hit” will only take you so far. The best pitchers are throwing so hard you have to study the scouting reports and have some idea of what’s coming next.

If that’s your approach at any level you have a lot to learn. That goes for 20 or 50 years ago the same as it does today. If pitchers were throwing maybe 60 mph not so much I guess. But even at 85 you definitely need to know what you’re likely to get at any count and in any situation from that specific pitcher. Batters who tell you that they are “see-it-and-hit-it” batters are lying to you or to themselves. There is no such thing in professional baseball. Even the most unsophisticated batter in the world knows that at 3-0, no outs, no runners on, his team is down 6 runs, he’s likely to be getting 100% fastballs.

▪ If a pitcher throws a fastball in a 1-1 count, nine out of 10 times, guess fastball. But if it’s that 10th time and he throws a slider instead, you’re going to look silly.

WTF? If you go home expecting your house to be empty but there are two giraffes and a midget, you’re going to be surprised.

▪ Good hitters lock in on a certain pitch, look for it and won’t come off it. You can make a guy look bad until he gets the pitch he was looking for and then he probably won’t miss it.

Probably have to fire this guy too. That’s complete bullshit. Makes no sense from a game-theory perspective or from any perspective for that matter. So just never throw him that pitch right? Then he can’t be a good hitter. But now if you never throw him the pitch he’s looking for, he’ll stop looking for it, and will instead look for the alternative pitch you are throwing him. So you’ll stop throwing him that pitch and then…. Managers and hitting coaches (and players) really (really) need a primer on game theory. I am available for the right price.

▪ According to hitting coach Dale Sveum, hitters should not give pitchers too much credit; wait for a mistake and if the pitcher makes a great pitch, take it. Don’t start chasing great pitches; stick to the plan and keep waiting for that mistake.

Now why didn’t I think of that!

▪ The Royals are not a great off-speed hitting club, so opposition pitchers want to spin it up there.

Same as above. Actually, remember this: You cannot tell how good or bad a player or team is at hitting any particular pitch by looking at the results. You can only tell by how often they get each type of pitch. Game theory tells us that the results of all the different pitches (type, location, etc.) will be about the same to any hitter. What changes depending on that hitter’s strengths and weaknesses are the frequencies. And this whole, “Team is good/bad at X” is silly. It’s about the individual players of course. I’m pretty sure there was at least one hitter on the team who is good at hitting off-speed.

Also, never evaluate or define “good hitting” based on batting average which most coaches and managers do even in 2016. I don’t have to tell you, dear sophisticated reader, that. However, you should also not define good or bad hitting on a pitch level based on OPS or wOBA (presumably on contact) either. You need to include pitches not put into play and you need to incorporate count. For example, at a 3-ball count there is a huge premium on not swinging at a ball. Your result on contact is not so important. At 2-strike counts, not taking a strike is also especially important. Whenever you see pitch level numbers without including balls not swung at, or especially only on balls put into play (which is usually the case), be very wary of those numbers. For example, a good off-speed hitting player will tend to have good strike zone recognition (and not necessarily good results on contact) skills because many more off-speed pitches are thrown in pitchers’ counts and out of the strike zone.

▪ According to catcher Kurt Suzuki, opposition pitchers should not try to strike out the Royals. Kansas City hitters make contact and a pitcher that’s going for punchouts might throw 100 pitches in five innings.

Wait. If they are a good contact team, doesn’t that mean that you can try and strike them out without running up your pitch count? Another dumb statement. Someone should tell Mr. Suzuki that pitch framing is really important.

▪ If you pitch down in the zone you can use the whole plate; any pitch at the knees is a pretty good pitch (a possible exception is down-and-in to lefties). If you pitch up in the zone you have to hit corners.

To some extent that’s true though it’s (a lot) more complicated than that. What’s probably more important is that when pitching down in the zone you want to pitch more away and when pitching up in the zone more inside. By the way, is it true lefties like (hit better) the down-and-in pitch more than righties? No, it is not. Where does that pervasive myth come from? Where do all the hundreds of myths that players, fans, coaches, managers, and pundits think are true come from?

▪ If you pitch up, you have to be above the swing path.

Not really sure what that means? Above the swing “path?” Swing path tends to follow the pitch so that doesn’t make too much sense. “Path” implies angle of attack and to say “above” or “below” an angle of attach doesn’t really make sense. Maybe he means, “If you are going to pitch high, pitch really high?” Or, “If the batter tends to be a high ball hitter, pitch really high?”

▪ Numbers without context might be meaningless; or worse — misleading

I don’t know what that means. Anything might be misleading or worthless without context. Words, numbers, apple pie, dogs, cats…

▪ All walks are not equal: a walk at the beginning of an inning is worth more than a walk with two outs, a walk to Jarrod Dyson is worth more than a walk to Billy Butler.

Correct. I might give this guy one of the other guys’ (that I fired) jobs. Players, especially pitchers (but batters and fielders too), should always know the relative value of the various offensive events depending on the batter, pitcher, score, inning, count, runners, etc., and then tailor their approach to those values. This is one of the most important things in baseball.

▪ So when you look at a pitcher’s walks, ask yourself who he walked and when he walked them.

True. Walks should be weighed towards bases open, 2 outs, sluggers, close games, etc. If not, and the sample is large, then the pitcher is likely either doing something wrong or he has terrible command/control or both. For example, Greg Maddux went something like 10 years before he walked his first pitcher.

▪ When a pitcher falls behind 2-0 or 3-1, what pitch does he throw to get back in the count? Can he throw a 2-0 cutter, sinker or slider, or does he have to throw a fastball down the middle and hope for the best?

All batters, especially in this era of big data, should be acutely aware of a pitcher’s tendencies against their type of batter in any given situation and count. One of the most important ones is, “Does he have enough command of his secondary pitches (and how good is his fastball even when the batter knows it’s coming) to throw them in hitter’s counts, especially the 3-2 count?”

▪ Hitters who waggle the bat head have inconsistent swing paths.

I never heard that before. Doubt it is anything useful.

▪ The more violent the swing, the worse the pitch recognition. So if a guy really cuts it loose when he swings and allows his head to move, throw breaking stuff and change-ups. If he keeps his head still, be careful.

Honestly, if that’s all you know about a batter, someone is not doing their homework. And again, there’s game theory that must be accounted for and appreciated. Players, coaches and managers are just terrible at understanding this very important part of baseball especially the batter/pitcher matchup. If you think you can tell a pitcher to throw a certain type of pitch in a certain situation (like if the batter swings violently throw him off-speed), then surely the batter can and will know that too. If he does, which he surely will – eventually – then he basically knows what’s coming and the pitcher will get creamed!

Advertisement

In Game 7 of the World Series anyone who was watching the top of the 9th inning probably remembers Javier Baez attempting a (safety squeeze – presumably) bunt on a 3-2 count with 1 out and Jason Heyward on 3rd base. You also remember that Baez struck out on a foul ball, much to the consternation of Cubs fans.

There was plenty of noise on social media criticizing Maddon (or Baez, if he did that on his own) for such an unusual play (you rarely see position players bunt on 2-strike counts, let alone with a 3-2 count and let alone with a runner on 3rd) and of course because it failed and eventually led to a scoreless inning. I was among those screaming bloody murder on Twitter and continuing my long-running criticism of Maddon’s dubious (in my opinion) post-season in-game tactics dating back to his Tampa days. I did, however, point out that I didn’t know off the top of my head (and it was anything but obvious or trivial to figure out) what the “numbers” were but that I was pretty sure it was a bad strategy.

Some “prima facia” evidence that it might be bad play, as I also tweeted, was, “When have you ever seen a play like that in baseball game?” That doesn’t automatically mean that it’s a bad play, but it is evidence nonetheless. And the fact that it was a critical post-season game meant nothing. If was correct to do it in that game it would be correct to do it in any game – at least in the late innings of a tie or 1-run game.

Anyway, I decided to look at some numbers although it’s not an easy task to ascertain whether in fact this was a good, bad, or roughly neutral (or we just don’t know) play. I turned to Retrosheet as I often do, and looked at what happens when a generic batter (who isn’t walked, which probably eliminates lots of good batters) does not bunt (which is almost all of the time of course) on a 3-2 count with 1 out, runner on third base and no runner on first, in a tie game or one in which the batting team was ahead, in the late innings, when the infield would likely be playing in to prevent a run from scoring on a ground ball. This is what I found:

The runner scores around 28% of the time overall. There were 33% walks (pitcher should be pitching a bit around the batter in this situation), 25% strikeouts and 25% BIP outs. When the ball is put in play, which occurs 42% of the time, the runner scores 63% of the time.

Now let’s look at what happens when a pitcher simply bunts the ball on a 3-2 count in a sacrifice situation. We’ll use that as a proxy for what Baez might do when trying to bunt in this situation. Pitchers are decent bunters overall (although they don’t run well on a bunt) and Baez is probably an average bunter at best for a position player. In fact, Baez has a grand total of one sacrifice hit in his entire minor and major league career so he may be an poor bunter – but to give him and Maddon the benefit of the doubt we’ll assume that he is as good at bunting as your typical NL pitcher.

On a 3-2 count in a sac situation when the pitcher is still bunting, he strikes out 40% of the time and walks 22% of the time. Compare that to the hitter who swings away at 3-2, runner on 3rd and 1 out where he K’s 25% of the time and walks 33% of the time. Of those 40% strikeouts, lots are bunt fouls. In fact, pitchers strike out on a foul bunt with a 3-2 count 25% % of the time. The rest, 15%, are called strikes and missed bunt attempts. It’s very easy to strike out on a foul bunt when you have two strikes, even when there are 3 balls (and you can take some close pitches).

How often does the run score on a 3-2 bunt attempt with a runner on 3rd such as in the Baez situation? From that data we can’t tell because we’re only looking at 3-2 bunts from pitchers with no runner on 3rd so we have make some inferences.

The pitcher puts the ball in play 36% of the time when bunting on a 3-2 count. How often would a runner score if there were a runner on 3rd? We’ll have to make some more inferences. In situations where a batter attempts a squeeze (either a suicide or safety – for the most part, we can’t tell from the Retrosheet data), the runner scores 80% of the time when the ball in bunted in play. So let’s assume the same with our pitchers/Baez. So 36% of the time the ball is put in play on a 3-2 bunt, 80% of the time the run scores. That’s a score rate of 29% – around the same as when swinging away.

So swinging away, the run scores 28% of the time. With a bunt attempt the run scores 29% of the time, so it would appear to be a tie with no particular strategy a clear winner. But wait….

When the run doesn’t score, the batter who is swinging away at 3-2 walks 33% of the time while the pitcher who is attempting a bunt on a 3-2 pitch walks only 25% of the time. But, we won’t count that as an advantage for the batter swinging away. The BB difference is likely due to the fact that pitchers are pitching around batters in that situation and they are going right after pitchers on 3-2 counts in sacrifice situations. In a situation like Baez’ the pitcher is going to issue more than 25% walks since he doesn’t mind the free pass and he is not going to groove one. So we’ll ignore the difference in walks. But wait again….

When a run scores on a squeeze play the batter is out 72% of the time and ends up mostly on first 28% of the time (a single, error, or fielder’s choice). When a run scores with a batter swinging away on a 3-2 count, the batter is out only 36% of the time. 21% of those are singles and errors and 15% are extra base hits including 10% triples and 5% HR.

So even though the run scores with both bunting and hitting away on a 3-2 count around the same percentage of the time, the batter is safe, including walks, hits, errors and fielder’s choices, only 26% of the time when bunting and 50% when swinging away. Additionally, when the batter swinging away gets a hit, 20% are triples and 6% are HR. So even though the runner on third scores around the same percentage of time whether swinging away or bunting on that 3-2 count, when the run does score, the batter who is swinging away reaches base safely (with some extras base hits including HR) more than twice as often as the batter who is bunting

I’m going to say that the conclusion is that while the bunt attempt was probably not a terrible play, it was still the wrong strategy given that it was the top of the inning. The runner from third will probably score around the same percentage of the time whether Baez is bunting or swinging away, but when the run does score, Baez is going to be safe a much higher percentage of the time, including via the double, triple or HR, leading to an additional run scoring significantly more often than with the squeeze attempt.

I’m not giving a pass to Maddon on this one. That would be true regardless of whether the bunt worked or not – of course.

Addendum: A quick estimate is that an additional run (or more) will score around 12% more often when swinging away. An extra run in the top of the 9th, going from a 1-run lead to a 2-run lead,  increases a team’s chances of winning by 10% (after that every additional run is worth half the the value of the preceding run). So we get an extra 1.2% (10% times 12%) in win expectancy from swinging away rather than bunting via the extra hits that occur when the ball is put into play.

 

 

Let me explain game theory wrt sac bunting using tonight’s CLE game as an example. Bottom of the 10th, leadoff batter on first, Gimenez is up. He is a very weak batter with little power or on-base skills, and the announcers say, “You would expect him to be bunting.” He clearly is.

Now, in general, to determine whether to bunt or not, you estimate the win expectancies (WE) based on the frequencies of the various outcomes of the bunt, versus the frequencies of the various outcomes of swinging away. Since, for a position player, those two final numbers are usually close, even in late tied-game situations, the correct decision usually hinges on: On the swing side, whether the batter is a good hitter or not, and his expected GDP rate. On the bunt side, how good of a sac bunter is he and how fast is he (which affect the single and ROE frequencies, which are an important part of the bunt WE)?

Gimenez is a terrible hitter which favors the bunt attempt but he is also not a good bunter and slow which favors hitting away. So the WE’s are probably somewhat close.

One thing that affects the WE for both bunting and swinging, of course, is where the third baseman plays before the pitch is thrown. Now, in this game, it was obvious that Gimenez was bunting all the way and everyone seemed fine with that. I think the announcers and probably everyone would have been shocked if he didn’t (we’ll ignore the count completely for this discussion – the decision to bunt or not clearly can change with it).

The announcers also said, “Sano is playing pretty far back for a bunt.” He was playing just on the dirt I think, which is pretty much “in between when expecting a bunt.” So it did seem like he was not playing up enough.

So what happens if he moves up a little? Maybe now it is correct to NOT bunt because the more he plays in, the lower the WE for a bunt and the higher the WE for hitting away! So maybe he shouldn’t play up more (the assumption is that if he is bunting, then the closer he plays, the better). Maybe then the batter will hit away and correctly so, which is now better for the offense than bunting with the third baseman playing only half way. Or maybe if he plays up more, the bunt is still correct but less so than with him playing back, in which case he SHOULD play up more.

So what is supposed to happen? Where is the third baseman supposed to play and what does the batter do? There is one answer and one answer only. How many managers and coaches do you think know the answer (they should)?

The third baseman is supposed to play all the way back “for starters” in his own mind, such that it is clearly correct for the batter to bunt. Now he knows he should play in a little more. So in his mind again, he plays up just a tad bit.

Now is it still correct for the batter to bunt? IOW, is the bunt WE higher than the swing WE given where the third baseman is playing? If it is, of course he is supposed to move up just a little more (in his head).

When does he stop? He stops of course when the WE from bunting is exactly the same as the WE from swinging. Where that is completely depends on those things I talked about before, like the hitting and bunting prowess of the batter, his speed, and even the pitcher himself.

What if he keeps moving up in his mind and the WE from bunting is always higher than hitting, like with most pitchers at the plate with no outs? Then the 3B simply plays in as far as he can, assuming that the batter is bunting 100%.

So in our example, if Sano is indeed playing at the correct depth which maybe he was and maybe he wasn’t, then the WE from bunting and hitting must be exactly the same, in which case, what does the batter do? It doesn’t matter, obviously! He can do whatever he wants, as long as the 3B is playing correctly.

So in a bunt situation like this, assuming that the 3B (and other fielders if applicable) is playing reasonably correctly, it NEVER matters what the batter does. That should be the case in every single potential sac bunt situation you see in a baseball game. It NEVER matters what the batter does. Either bunting or not are equally “correct.” They result in exactly the same WE.

The only exceptions (which do occur) are when the WE from bunting is always higher than swinging when the 3B is playing all the way up (a poor hitter and/or exceptional bunter) OR the WE from swinging is always higher even when the 3B is playing completely back (a good or great hitter and/or poor bunter).

So unless you see the 3B playing all the way in or all the way back and they are playing reasonably optimally it NEVER matters what the batter does. Bunt or not bunt and the win expectancy is exactly the same! And if the 3rd baseman plays all the way in or all the way back and is playing optimally, then it is always correct for the batter to bunt or not bunt 100% of the time.

I won’t go into this too much because the post assumed that the defense was playing optimally, i.e. it was in a “Nash Equilibrium” (as I explained, it is playing in a position such that the WE for bunting and swinging are exactly equal) or it was correctly playing all the way in (the WE for bunting is still equal to or great than for swinging) or all the way back (the WE for swinging is >= that of bunting), but if the defense is NOT playing optimally, then the batter MUST bunt or swing away 100% of the time.

This is critical and amazingly there is not ONE manager or coach in MLB that understands it and thus correctly utilizes a correct bunt strategy or bunt defense.

There seems to be an unwritten rule in baseball – not on the field, but in the stands, at home, in the press box, etc.

“You can’t criticize a manager’s decision if it doesn’t directly affect the outcome of the game, if it appears to ‘work’, or if the team goes on to win the game despite the decision.”

That’s ridiculous of course. The outcome of a decision or the game has nothing to do with whether the decision was correct or not. Some decisions may raise or lower a team’s chances of winning from 90% and other decisions may affect a baseline of 10 or 15%.

If decision A results in a team’s theoretical chances of winning of 95% and decision A, 90%, obviously A is the correct move. Choosing B would be malpractice. Equally obvious is if manager chooses B, an awful decision, he is still going to win the game 90% of the time, and based on the “unwritten rule” we rarely get to criticize him. Similarly, if decision A results in a 15% win expectancy (WE) and B results in 10%, A is the clear choice, yet the team still loses most of the time and we get to second guess the manager whether he chooses A or B. All of that is silly and counter-productive.

If your teenager drives home drunk yet manages to not kill himself or anyone else, do you say nothing because “it turned out OK?” I hope not. In sports, most people understand the concept of “results versus process” if they are cornered into thinking about it, but in practice, they just can’t bring themselves to accept it in real time. No one is going to ask Terry Collins in the post-game presser why he didn’t pinch hit for DeGrom in the 6th inning – no one. The analyst – a competent one at least – doesn’t give a hoot what happened after that. None whatsoever. He looks at a decision and if it appears questionable at the time, he tries to determine what the average consequences are – with all known data at the time the decision is made – with the decision or with one or more alternatives. That’s it. What happens after that is irrelevant to the analyst. For some reason this is a hard concept for the average fan – the average person – to apply. As I said, I truly think they understand it, especially if you give obvious examples, like the drunk driving one. They just don’t seem to be able to break the “unwritten rule” in practice. It goes against their grain.

Well, I’m an analyst and I don’t give a flying ***k whether the Mets won, lost, tied, or Wrigley Field collapsed in the 8th inning. The “correctness” of the decision to allow DeGrom to hit or not in the top of the 6th, with runners on second and third, boiled down to this question and this question only:

“What is the average win expectancy (WE) of the Mets with DeGrom hitting and then pitching some number of innings and what is the average WE with a pinch hitter and someone else pitching in place of DeGrom?”

Admittedly the gain, if there is any, from making the decision to bring in a PH and reliever or relievers must be balanced against any known or potential negative consequences for the Mets not related to the game at hand. Examples of these might be: 1) limiting your relief possibilities in the rest of the series or the World Series. 2) Pissing off DeGrom or his teammates for taking him out and thus affecting the morale of the team.

I’m fine with the fans or the manager and coaches including these and other considerations in their decision. I am not fine with them making their decision not knowing how it affects the win expectancy of the game at hand, since that is clearly the most important of the considerations.

My guess is that if we asked Collins about his decision-making process, and he was honest with us, he would not say, “Yeah, I knew that letting him hit would substantially lower our chances of winning the game, but I also wanted to save the pen a little and give DeGrom a chance to….” I’m pretty sure he thought that with DeGrom pitching well (which he usually does, by the way – it’s not like he was pitching well-above his norm), his chances of winning were better with him hitting and then pitching another inning or two.

At this point, and before I get into estimating the WE of the two alternatives facing Collins, letting DeGrom hit and pitch or pinch hitting and bringing in a reliever, I want to discuss an important concept in decision analysis in sports. In American civil law, there is a thing called a summary judgment. When a party in a civil action moves for one, the judge makes his decision based on the known facts and assuming controversial facts and legal theories in a light most favorable to the non-moving party. In other words, if everything that the other party says is true is true (and is not already known to be false) and the moving party would still win the case according to the law, then the judge must accept the motion and the moving party wins the case without a trial.

When deciding whether a particular decision was “correct” or not in a baseball game or other contest, we can often do the same thing in order to make up for an imperfect model (which all models are by the way). You know the old saw in science – all models are wrong, but some are useful. In this particular instance, we don’t know for sure how DeGrom will pitch in the 6th and 7th innings to the Cubs order for the 3rd time, we don’t know for how much longer he will pitch, we don’t know how well DeGrom will bat, and we don’t know who Collins can and will bring in.

I’m not talking about the fact that we don’t know whether DeGrom or a reliever is going to give up a run or two, or whether he or they are going to shut the Cubs down. That is in the realm of “results-based analysis” and I‘ve already explained how and why that is irrelevant. I’m talking about what is DeGrom’s true talent, say in runs allowed per 9 facing the Cubs for the third time, what is a reliever’s or relievers’ true talent in the 6th and 7th, how many innings do we estimate DeGrom will pitch on the average if he stays in the game, and what is his true batting talent.

Our estimates of all of those things will affect our model’s results – our estimate of the Mets’ WE with and without DeGrom hitting. But what if we assumed everything in favor of keeping DeGrom in the game – we looked at all controversial items in a light most favorable to the non-moving party – and it was still a clear decision to pinch hit for him? Well, we get a summary judgment! Pinch hitting for him would clearly be the correct move.

There is one more caveat. If it is true that there are indirect negative consequences to taking him out – and I’m not sure that there are – then we also have to look at the magnitude of the gain from taking him out and then decide whether it is worth it. In order to do that, we have to have some idea as to what is a small and what is a large advantage. That is actually not that hard to do. Managers routinely bring in closers in the 9th inning with a 2-run lead, right? No one questions that. In fact, if they didn’t – if they regularly brought in their second or third best reliever instead, they would be crucified by the media and fans. How much does bringing in a closer with a 2-run lead typically add to a team’s WE, compared to a lesser reliever? According to The Book, an elite reliever compared to an average reliever in the 9th inning with a 2-run lead adds around 4% to the team’s WE. So we know that 4% is a big advantage, which it is.

That brings up another way to account for the imperfection of our models. The first way was to use the “summary judgment” method, or assume things most favorable to making the decision that we are questioning. The second way is to simply estimate everything to the best of our ability and then look at the magnitude of the results. If the difference between decision A and B is 4%, it is extremely unlikely that any reasonable tweak to the model will change that 4% to 0% or -1%.

In this situation, whether we assume DeGrom is going to pitch 1.5 more innings or 1.6 or 1.4, it won’t change the results much. If we assume that DeGrom is an average hitting pitcher or a poor one, it won’t change the result all that much. If we assume that the “times through the order penalty” is .25 runs or .3 runs per 9 innings, it won’t change the results much. If we assume that the relievers used in place of DeGrom have a true talent of 3.5, 3.3, 3.7, or even 3.9, it won’t change the results all that much. Nothing can change the results from 4% in favor of decision A to something in favor of decision B. 4% is just too much to overcome even if our model is not completely accurate. Now, if our results assuming “best of our ability estimates” for all of these things yield a 1% advantage for choosing A, then it is entirely possible that B is the real correct choice and we might defer to the manager in case he knows some things that we don’t or we simply are mistaken in our estimates or we failed to account for some important variable.

Let’s see what the numbers say, assuming “average” values for all of these relevant variables and then again making reasonable assumptions in favor of allowing DeGrom to hit (assuming that pinch hitting for him appears to be correct).

What is the win expectancy with DeGrom batting. We’ll assume he is an average-hitting pitcher or so (I have heard that he is a poor-hitting pitcher). An average pitcher’s batting line is around 10% single, 2% double or triple, .3% HR, 4% BB, and 83.7% out. The average WE for an average team leading by 1 run in the top of the 6th, with runners on second and third, 2 outs, and a batter with this line, is…..

63.2%.

If DeGrom were an automatic out, the WE would be 59.5%. That is the average WE leading off the bottom of the 6th with the visiting team winning by a run. So an average pitcher batting in that spot adds a little more than 3.5% in WE. That’s not wood. What if DeGrom were a poor hitting pitcher?

Whirrrrr……

62.1%.

So whether DeGrom is an average or poor-hitting pitcher doesn’t change the Mets’ WE in that spot all that much. Let’s call it 63%. That is reasonable. He adds 3.5% to the Mets’ WE compared to an out.

What about a pinch hitter? Obviously the quality of the hitter matters. The Mets have some decent hitters on the bench – notably Cuddyer from the right side and Johnson from the left. Let’s assume a league-average hitter. Given that, the Mets’ WE with runners on second and third, 2 outs, and a 1-run lead, is 68.8%. A league-average hitter adds over 9% to the Mets’ WE compared to an out. The difference between DeGrom as a slightly below-average hitting pitcher and a league-average hitter is 5.8%. That means, unequivocally, assuming that our numbers are reasonably accurate, that letting DeGrom hit cost the Mets almost 6% in their chances of winning the game.

That is enormous of course. Remember we said that bringing in an elite reliever in the 9th of a 2-run game, as compared to a league-average reliever, is worth 4% in WE. You can’t really make a worse decision as a manager than reducing your chances of winning by 5.8%, unless you purposely throw the game. But, that’s not nearly the end of the story. Collins presumably made this decision thinking that DeGrom pitching the 6th and perhaps the 7th would more than make up for that. Actually he’s not quite thinking, “Make up for that.” He is not thinking in those terms. He does not know that letting him hit “cost 5.8% in win expectancy” compared to a pinch hitter. I doubt that the average manager knows what “win expectancy” means let alone how to use it in making in-game decisions. He merely thinks, “I really want him to pitch another inning or two, and letting him hit is a small price to pay,” or something like that.

So how much does he gain by letting him pitch the 6th and 7th rather than a reliever. To be honest it is debatable whether he gains anything at all. Not only that, but if we look back in history to see how many innings starters end up pitching, on the average, in situations like that, we will find that it is not 2 innings. It is probably not even 1.5 innings. He was at 82 pitches through 5. He may throw 20 or 25 pitches in the 6th (like he did in the first), in which case he may be done. He may give up a base runner or two, or even a run or two, and come out in the 6th, perhaps before recording an out. At best, he pitches 2 more innings, and once in a blue moon he pitches all or part of the 8th I guess (as it turned out, he pitched 2 more effective innings and was taken out after seven). Let’s assume 1.5 innings, which I think is generous.

What is DeGrom’s expected RA9 for those 2 innings? He has pitched well thus far but not spectacularly well. In any case, there is no evidence that pitching well through 5 innings tells us anything about how a pitcher is going to pitch in the 6th and beyond. What is DeGrom’s normal expected RA9? Steamer, ZIPS and my projection systems say about 83% of league-average run prevention. That is equivalent to a #1 or #2 starter. It is equivalent to an elite starter, but not quite the level of the Kershaw’s, Arrieta’s, or even the Price’s or Sale’s. Obviously he could turn out to be better than that – or worse – but all we can do in these calculations and all managers can do in making these decisions is use the best information and the best models available to estimate player talent.

Then there is the “times through the order penalty.” There is no reason to think that this wouldn’t apply to DeGrom in this situation. He is going to face the Cubs for the third time in the 6th and 7th innings. Research has found that the third time through the order a starter’s RA9 is .3 runs worse than his overall RA9. So a pitcher who allows 83% of league average runs allows 90% when facing the order for the 3rd time. That is around 3.7 runs per 9 innings against an average NL team.

Now we have to compare that to a reliever. The Mets have Niese, Robles, Reed, Colon, and Gilmartin available for short or long relief. Colon might be the obvious choice for the 6th and 7th inning, although they surely could use a combination of righties and lefties, especially in very high leverage situations. What do we expect these relievers’ RA9 to be? The average reliever is around 4.0 to start with, compared to DeGrom’s 3.7. If Collins uses Colon, Reed, Niese or some combination of relievers, we might expect them to be better than the average NL reliever. Let’s be conservative and assume an average, generic reliever for those 1.5 innings.

How much does that cost the Mets in WE? To figure that, we take the difference in run prevention between DeGrom and the reliever(s), multiply by the game leverage and convert it into WE. The difference between a 3.7 RA9 and a 4.0 RA9 in 1.5 innings is .05 runs. The average expected leverage index in the 6th and 7th innings where the road team is up by a run is around 1.7. So we multiply .05 by 1.7 and convert that into WE. The final number is .0085, or less than 1% in win expectancy gained by allowing DeGrom to pitch rather than an average reliever.

That might shock some people. It certainly should shock Collins, since that is presumably his reason for allowing DeGrom to hit – he really, really wanted him to pitch another inning or two. He presumably thought that that would give his team a much better chance to win the game as opposed to one or more of his relievers. I have done this kind of calculation dozens of times and I know that keeping good or even great starters in the game for an inning or two is not worth much. For some reason, the human mind, in all its imperfect and biased glory, overestimates the value of 1 or 2 innings of a pitcher who is “pitching well” as compared to an “unknown entity” (of course we know the expected performance of our relievers almost as well as we know the expected performance of the starter). It is like a manager who brings in his closer in a 3-run game in the 9th. He thinks that his team has a much better chance of winning than if he brings in an inferior pitcher. The facts say that he is wrong, but tell that to a manager and see if he agrees with you – he won’t. Of course, it’s not a matter of opinion – it’s a matter of fact.

Do I need to go any further? Do I need to tweak the inputs? Assuming average values for the relevant variables yields a loss of over 5% in win expectancy by allowing DeGrom to hit. What if we knew that DeGrom were going to pitch two more innings rather than an average of 1.5? He saves .07 runs rather than .05 which translates to 1.2% WE rather than .85%, which means that pinch hitting for him increases the Mets’ chances of winning by 4.7% rather than 5.05%. 4.7% is still an enormous advantage. Reducing your team‘s chances of winning by 4.7% by letting DeGrom hit is criminal. It’s like pinch hitting Jeff Mathis for Mike Trout in a high leverage situation – twice!

What about if our estimate of DeGrom’s true talent is too conservative? What if he is as good as Kershaw and Arrieta? That’s 63% of league average run prevention or 2.6 RA9. Third time through the order and it’s 2.9. The difference between that and an average reliever is 1.1 runs per 9, which translates to a 3.1% WE difference in 1.5 innings. So allowing Kershaw to hit in that spot reduces the Mets chances of winning by 2.7%. That’s not wood either.

What if the reliever you replaced DeGrom with was a replacement level pitcher – the worst pitcher in the major leagues? He allows around 113% league average runs, or 4.6 RA9. Difference between DeGrom and him for 1.5 innings? 2.7% for a net loss of 3.1% by letting him hit rather than pinch hitting for him and letting the worst pitcher in baseball pitch the next 1.5 innings? If you told Collins, “Hey genius, if you pinch hit for Degrom and let the worst pitcher in baseball pitch for another inning and a half instead of DeGrom, you will increase your chances of winning by 3.1%,” what do you think he would say?

What if DeGrom were a good hitting pitcher? What if….?

You should be getting the picture. Allowing him to hit is so costly, assuming reasonable and average values for all the pertinent variables, that even if we are missing something in our model, or some of our numbers are a little off – even if assume everything in the best possible light of allowing him to hit – the decision is a no-brainer in favor of a pinch hitter.

If Collins truly wanted to give his team the best chance of winning the game, or in the vernacular of ballplayers, putting his team in the best position to succeed, the clear and unequivocal choice was to lift DeGrom for a pinch hitter. It’s too bad that no one cares because the Mets ultimately won the game, which they were going to do at least 60% of the time anyway, regardless of whether Collins made the right or wrong decision.

The biggest loser, other than the Cubs, is Collins (I don’t mean he is a loser, as in the childish insult), because every time you use results to evaluate a decision and the results are positive, you deprive yourself of the opportunity to learn a valuable lesson. In this case, the analysis could have and should have been done before the game even started. All managers should know the importance of bringing in pinch hitters for pitchers in high leverage situations in important games, no matter how good the pitchers are or how well they are pitching in the game so far. Maybe someday they will.

Last night in the Cubs/Cardinals game, the Cardinals skipper took his starter, Lackey, out in the 8th inning of a 1-run game with one out, no one on base and lefty Chris Coghlan coming to the plate. Coghlan is mostly a platoon player. He has faced almost four times as many righties in his career than lefties. His career wOBA against righties is a respectable .342. Against lefties it is an anemic .288. I have him with a projected platoon split of 27 points, less than his actual splits, which is to be expected as platoon splits in general get heavily regressed toward the mean, because they tend to be laden with noise for two reasons: One, the samples are rarely large because you are comparing performance against righties to performance against lefties and the smaller of the two tends to dominate the effective sample size – in Coghlan’s case, he has faced only 540 lefties in his entire 7-year career, less than the number of PA a typical  full-time batter gets in one season. Two, there is not much of a spread in platoon talent among both batters and pitchers. The less spread in talent for any statistic, the more the differences you see among players, especially in small samples, are noise. Sort of like DIPS for pitchers.

Anyway, even with a heavy regression, we think that Coghlan has a larger than average platoon split for a lefty and the average lefty split tends to be large. You typically would not want him facing a lefty in that situation. That is especially true when you have a very good and fairly powerful right-handed bat on the bench – Jorge Soler. Soler has a reverse career platoon split, but with only 114 PA versus lefties, that number is almost meaningless. I estimate his actual platoon split to be 23 points, a little less than the average righty. For RHB, there is always a heavy regression of actual platoon splits, regardless of the sample size (while the greater the sample of actual PA versus lefties, the less you regress, it might be a 95% regression for small samples and an 80% regression for large samples – either way, large) simply because there is not a very large spread of talent among RHB. If we look at the actual splits for all RHB over many, many PA, we see a narrow range of results. In fact, there is virtually no such thing as a RHB with true reverse platoon splits.

Soler seems to be the obvious choice,  so of course that’s what Maddon did – he pinch hit for Coghlan with Soler, right? This is also a perfect opportunity since Matheny cannot counter with a RHP – Siegrest has to pitch to at least one batter after entering the game. Maddon let Coghlan hit and he was easily dispatched by Siegrest 4 pitches later. Not that the result has anything to do with the decision by Matheny or Maddon. It doesn’t. Matheny’s decision to bring in Siegrest at that point in time was rather curious too, if you think about it. Surely he must have assumed that Maddon would bring in a RH pinch hitter. So he had to decide whether to pitch Lackey against Coghlan or Siegrest against a right handed hitter, probably Soler. Plus, the next batter, Russell, is another righty. It looks like he got extraordinarily lucky when Maddon did what he did – or didn’t do – in letting Coghlan bat. But that’s not the whole story…

Siegrest may or may not be your ordinary left-handed pitcher. What if Siegrest actually has reverse splits? What if we expect him to pitch better against right handed batters and worse against left-handed batters?  In that case, Coghlan might actually be the better choice than Soler even though he doesn’t often face lefty pitchers. When a pitcher has reverse splits – true reverse splits – we treat him exactly like a pitcher of the opposite hand.  It would be exactly like Coghlan or Soler were facing a RHP. Or maybe Siegrest has no splits – i.e. RH and LH batters of equal overall talent perform about the same. Or very small platoon splits compared to the average left-hander? So maybe hitting Coghlan or Soler is a coin flip.

It might also have been correct for Matheny to bring in Siegrest no matter who he was going to face, simply because Lackey, who is arguably a good but not great pitcher, was about to face a good lefty hitter for the third time – not a great matchup. And if Siegrest does indeed have very small splits either positive or negative, or no splits at all, that is a perfect opportunity to bring him in, and not care whether Maddon leaves Coghlan in or pinch hits Soler. At the same time, if Maddon things that Siegrest has significant reverse splits, he can leave in Coghlan, and if he thinks that the lefty pitcher has somewhere around a neutral platoon split, he can still leave Coghlan in and save Soler for another pinch hit opportunity. Of course, if he thinks that Siegrest is like your typical lefty pitcher, with a 30 point platoon split, then using Coghlan is a big mistake.

So how do managers determine what a pitcher’s true or expected (the same thing) platoon split is? The typical troglodyte will use batting average against during the season in question. After all, that’s what you hear ad-nauseam from the talking heads on TV, most of them ex-players or even ex-managers. Even the slightly informed fan knows that batting average against for a pitcher is worthless stat in and of itself (what, walks don’t count, and a HR is the same as a single?), especially in light of DIPS. The slightly more informed fan also knows that one season splits for a batter or pitcher are not very useful for the reasons I explained above.

If you look at Siegrest’s BA against splits for 2015, you will see .163 versus RHB and .269 versus LHB. Cue the TV commentators: “Siegrest is much better against right-handed batters than left-handed ones.” Of course, is and was are very different things in this context and with respect to making decisions like Matheny and Maddon did. The other day David Price was a pretty mediocre to poor pitcher. He is a great pitcher and you would certainly be taking your life into your hands if you treated him like a mediocre to poor pitcher in the present. Kershaw was a poor pitcher in the playoffs…well, you get the idea. Of course, sometimes, was is very similar to is. It depends on what we are talking about and how long the was was, and what the was actually is.

Given that Matheny is not considered to be such an astute manager when it comes to data-driven decisions, it may be is surprising that he would bring in Siegrest to pitch to Coghlan knowing that Siegrest has an enormous reverse BA against split in 2015. Maybe he was just trying to bring in a fresh arm – Siegrest is a very good pitcher overall. He also knows that the lefty is going to have to pitch to the next batter, Russell, a RHB.

What about Maddon? Surely he knows better than to look at such a garbage stat for one season to inform a decision like that. Let’s use a much better stat like wOBA and look at Siegrest’s career rather than just one season. Granted, a pitcher’s true platoon splits may change from season to season as he changes his pitch repertoire, perhaps even arm angle, position on the rubber, etc. Given that, we can certainly give more weight to the current season if we like. For his career, Siegrest has a .304 wOBA against versus LHB and .257 versus RHB. Wait, let me double check that. That can’t be right. Yup, it’s right. He has a career reverse wOBA split of 47 points! All hail Joe Maddon for leaving Coghlan in to face essentially a RHP with large platoon splits! Maybe.

Remember how in the first few paragraphs I talked about how we have to regress actual platoon splits a lot for pitchers and batters, because we normally don’t have a huge sample and because there is not a great deal of spread among pitchers with respect to true platoon split talent? Also remember that what we, and Maddon and Matheny, are desperately trying to do is estimate Siegrest’s true, real-life honest-to-goodness platoon split in order to make the best decision we can regarding the batter/pitcher matchup. That estimate may or may not be the same as or even remotely similar to his actual platoon splits, even over his entire career. Those actual splits will surely help us in this estimate, but the was is often quite different than the is.

Let me digress a little and invoke the ole’ coin flipping analogy in order to explain how sample size and spread of talent come into play when it comes to estimating a true anything for a player – in this case platoon splits.

Note: If you want you can skip the “coins” section and go right to the “platoon” section. 

Coins

Let’s say that we have a bunch of fair coins that we stole from our kid’s piggy bank. We know of course that each of them has a 50/50 chance of coming up head or tails in one flip – sort of like a pitcher with exactly even true platoon splits. If we flip a bunch of them 100 times, we know we’re going to get all kinds of results – 42% heads, 61% tails, etc. For the math inclined, if we flip enough coins the distribution of results will be a normal curve, with the mean and median at 50% and the standard deviation equal to the binomial standard deviation of 100 flips, which is 5%.

Based on the actual results of 100 flips of any of the coins, what would you estimate the true heads/tails percentage of that coin? If one coin came up 65/35 in favor of heads, what is your estimate for future flips? 50% of course. 90/10? 50%. What if we flipped a coin 1000 or even 5000 times and it came up 55% heads and 45% tails? Still 50%. If you don’t believe or understand that, stop reading and go back to whatever you were doing. You won’t understand the rest of this article. Sorry to be so blunt.

That’s like looking at a bunch of pitchers platoon stats and no matter what they are and over how many TBF, you conclude that the pitcher really has an even split and what you observed is just noise. Why is that? With the coins it is because we know beforehand that all the coins are fair (other than that one trick coin that your kid keeps for special occasions). We can say that there is no “spread in talent” among the coins and therefore regardless of the result of a number of flips and regardless of how many flips, we regress the result 100% of the way toward the mean of all the coins, 50%, in order to estimate the true percentage of any one coin.

But, there is a spread of talent among pitcher and batter platoon splits. At least we think there is. There is no reason why it has to be so. Even if it is true, we certainly can’t know off the top of our head how much of a spread there is. As it turns out, that is really important in terms of estimating true pitcher and batter splits. Let’s get back to the coins to see why that is. Let’s say that we don’t have 100% fair coins. Our sly kid put in his piggy bank a bunch of trick coins, but not really, really tricky. Most are still 50/50, but some are 48/52, 52/48, a few less are 45/55, and 1 or 2 are 40/60 and 60/40. We can say that there is now a spread of “true coin talent” but the spread is small. Most of the coins are still right around 50/50 and a few are more biased than that.  If your kid were smart enough to put in a normal distribution of “coin talent,” even one with a small spread, the further away from 50/50, the fewer coins there are.  Maybe half the coins are still fair coins, 20% are 48/52 or 52/48, and a very, very small percentage are 60/40 or 40/60.  Now what happens if we flip a bunch of these coins?

If we flip them 100 times, we are still going to be all over the place, whether we happen to flip a true 50/50 coin or a true 48/52 coin. It will be hard to guess what kind of a true coin we flipped from the result of 100 flips. A 50/50 coin is almost as likely to come up 55 heads and 45 tails as a coin that is truly a 52/48 coin in favor of heads. That is intuitive, right?

This next part is really important. It’s called Bayesian inference, but you don’t need to worry about what it’s called or even how it technically works. It is true that if you flipped a coin and got 60/40 heads that that coin was much more likely to be a true 60/40 coin than it is to be a 50/50 coin. That should be obvious too.  But here’s the catch. There are many, many more 50/50 coins in your kid’s piggy bank than there are 60/40. Your kid was smart enough to put in a normal distribution of trick coins.

So even though it seems like if you flipped a coin 100 times and got 60/40 heads, it is more likely you have a true 60/40 coin than a true 50/50 coin, it isn’t. It is much more likely that you have a 50/50 coin that got “heads lucky” than a true 60/40 coin that landed on the most likely result after 100 flips (60/40) because there are many more 50/50 coins in the bank than 60/40 coins – assuming a somewhat normal distribution with a small spread.

Here is the math: The chances of a 50/50 coin coming up exactly 60/40 is around .01. Chances of a true 60/40 coin coming up 60/40 is 8 times that amount, or .08. But, if there are 8 times as many 50/50 coins in your piggy bank as 60/40 coins, then the chances of your 60/40 coin being a fair coin or a 60/40 biased coin is only 50/50. If there 800 times more 50/50 coins than 60/40 coins in your bank, as there is likely to be if the spread of coin talent is small, then it is 100 times more likely that you have a true 50/50 coin than a true 60/40 coin even though the coin came up 60 heads in 100 flips.

It’s like the AIDS test contradiction. If you are a healthy, heterosexual, non-drug user, and you take an AIDS test which has a 1% false positive rate and you test positive, you are extremely unlikely to have AIDS. There are very few people with AIDS in your population so it is much more likely that you do not have AIDS and got a false positive (1 in 100) than you did have AIDS in the first place (maybe 1 in 100,000) and tested positive. Out of a million people in your demographic, if they all got tested, 10 will have AIDS and test positive (assuming a 0% false negative rate) and 999,990 will not have AIDS, but 10,000 of them (1 in 100) will have a false positive. So the odds you have AIDS is 10,000 to 10 or 1000 to 1 against.

In the coin example where the spread of coin talent is small and most coins are still at or near 50/50, pretty much no matter what we get when flipping a coin 100 times, we are going to conclude that there is a good chance that our coin is still around 50/50 because most of the coins are around 50/50 in true coin talent. However, there is some chance that the coin is biased, if we get an unusual result.

Now, it is awkward and not particularly useful to conclude something like, “There is a 60% chance that our coin is a true 50/50 coin, 20% it is a 55/45 coin, etc.” So what we usually do is combine all those probabilities and come up with a single number called a weighted mean.

If one coin comes up 60/40, our weighted mean estimate of its “true talent” may be 52%. If we come up with 55/45, it might be 51%. 30/70 might be 46%. Etc. That weighed mean is what we refer to as “an estimate of true talent” and is the crucial factor in making decisions based on what we think the talent of the coins/players are likely to be in the present and in the future.

Now what if the spread of coin talent were still small, as in the above example, but we flipped the coins 500 times each? Say we came up with 60/40 again in 500 flips. The chances of that happening with a 60/40 coin is 24,000 times more likely than if the coin were 50/50! So now we are much more certain that we have a true 60/40 coin even if we don’t have that many of them in our bank. In fact, if the standard deviation of our spread in coin talent were 3%, we would be about ½ certain that our coin was a true 50/50 coin and half certain it was a true 60/40 coin, and our weighted mean would be 55%.

There is a much easier way to do it. We have to do some math gyrations which I won’t go into that will enable us to figure out how much to regress our observed flip percentage to the mean flip percentage of all the coins, 50%. For 100 flips it was a large regression such that with a 60/40 result we might estimate a true flip talent of 52%, assuming a spread of coin talent of 3%. For 500 flips, we would regress less towards 50% to give us around 55% as our estimate of coin talent. Regressing toward a mean rather than doing the long-hand Bayesian inferences using all the possible true talent states assumes a normal distribution or close to one.

The point is that the sample size of the observed measurement is determines how much we regress the observed amount towards the mean. The larger the sample, the less we regress. One season observed splits and we regress a lot. Career observed splits that are 5 times that amount, like our 500 versus 100 flips, we regress less.

But sample size of the observed results is not the only thing that determines how much to regress. Remember if all our coins were fair and there were no spread in talent, we would regress 100% no matter how many flips we did with each coin.

So what if there were a large spread in talent in the piggy bank? Maybe a SD of 10 percent so that almost all of our coins were anywhere from 20/80 to 80/20 (in a normal distribution the rule of thumb is that almost of the values fall within 3 SD of the mean in either direction)? Now what if we flipped a coin 100 times and came up with 60 heads. Now there are lots more coins at true 60/40 and even some coins at 70/30 and 80/20. The chances that we have a truly biased coin when we get an unusual result is much greater than if the spread in coin talent were smaller, even in 100 flips.

So now we have the second rule. The first rule was that the number of trials is important in determining how much credence to give to an unusual result, i.e., how much to regress that result towards the mean, assuming that there is some spread in true talent. If there is no spread, then no matter how many trials our result is based on, and no matter how unusual our result, we still regress 100% toward the mean.

All trials whether they be coins or human behavior have random results around a mean that we can usually model as long as the mean is not 0 or 1. That is an important concept, BTW. Put it in your “things I should know” book. No one can control or influence that random distribution. A human being might change his mean from time to time but he cannot change or influence the randomness around that mean. There will always be randomness, and I mean true randomness, around that mean regardless of what we are measuring, as long as the mean is between 0 and 1, and there is more than 1 trial (in one trial you either succeed or fail of course). There is nothing that anyone can do to influence that fluctuation around the mean. Nothing.

The second rule is that the spread of talent also matters in terms of how much to regress the actual results toward the mean. The more the spread, the less we regress the results for a given sample size. What is more important? That’s not really a specific enough question, but a good answer is that if the spread is small, no matter how many trials the results are based on, within reason, we regress a lot. If the spread is large, it doesn’t take a whole lot of trials, again, within reason, in order to trust the results more and not regress them a lot towards the mean.

Let’s get back to platoon splits, now that you know almost everything about sample size, spread of talent, regression to mean, and watermelons. We know that how much to trust and regress results depends on their sample size and on the spread of true talent in the population with respect to that metric, be it coin flipping or platoon splits. Keep in mind that when we say trust the results, that it is not a binary thing, as in, “With this sample and this spread of talent, I believe the results – the 60/40 coin flips or the 50 point reverse splits, and with this sample and spread, I don’t believe them.” That’s not the way it works. You never believe the results. Ever. Unless you have enough time on your hands to wait for an infinite number of results and the underlying talent never changes.

What we mean by trust is literally how much to regress the results toward a mean. If we don’t trust the stats much, we regress a lot. If we trust them a lot, we regress a little. But. We. Always. Regress. It is possible to come up with a scenario where you might regress almost 100% or 0%, but in practice most regressions are in the 20% to 80% range, depending on sample size and spread of talent. That is just a very rough rule of thumb.

We generally know the sample size of the results we are looking at. With Siegrest (I almost forgot what started this whole thing) his career TBF is 604 TBF, but that’s not his sample size for platoon splits because platoon splits are based on the difference between facing lefties and righties. The real sample size for platoon splits is the harmonic mean of TBF versus lefties and righties. If you don’t know what that means don’t worry about it. A shortcut is to use the lesser of the two which is almost always TBF versus lefties, or in Siegrest’s case, 231. That’s not a lot, obviously, but we have two possible things going for Maddon, who played his cards like Siegrest was a true reverse split lefty pitcher. One, maybe the spread of platoon skill among lefty pitchers is large (it’s not), and two, he has a really odd observed split of 47 points in reverse. That’s like flipping a coin 100 times and getting 70 heads and 30 tails or 65/35. It is an unusual result. The question is, again, not binary – whether we believe that -47 point split or not. It is how much to regress it toward the mean of +29 – the average left-handed platoon split for MLB pitchers.

While the unusual nature of the observed result is not a factor in how much regressing to do, it does obviously come into play, in terms of our final estimate of true talent. Remember that the sample size and spread of talent in the underlying population, in this case, all lefty pitchers, maybe all lefty relievers if we want to get even more specific, is the only thing that determines how much we trust the observed results, i.e., how much we regress them toward the mean. If we regress -47 points 50% toward the mean of +29 points, we get quite a different answer than if we regress, say, an observed -10 split 50% towards the mean. In the former case, we get a true talent estimate of -9 points and in the latter we get +10. That’s a big difference. Are we “trusting” the -47 more than the -10 because it is so big? You can call it whatever you want, but the regression is the same assuming the sample size and spread of talent is the same.

The “regression”, by the way, if you haven’t figured it out yet, is simply the amount, in percent, we move the observed toward the mean. -47 points is 76 points “away” from the mean of +29 (the average platoon split for a LHP). 50% regression means to move it half way, or 38 points. If you move -47 points 38 points toward +29 points, you get -9 points, our estimate of Siegrest’s true platoon split if  the correct regression is 50% given his 231 sample size and the spread of platoon talent among LH MLB pitchers. I’ll spoil the punch line. It is not even close to 50%. It’s a lot more.

How do we determine the spread of talent in a population, like platoon talent? That is actually easy but it requires some mathematical knowledge and understanding. Most of you will just have to trust me on this. There are two basic methods which are really the same thing and yield the same answer. One, we can take a sample of players, say 100 players who all had around the same number of opportunities (sample size), say, 300. That might be all full-time starting pitchers in one season and the 300 is the number of LHB faced. Or it might be all pitchers over several seasons who faced around 300 LHB. It doesn’t matter. Nor do the number of opportunities.  They don’t even have to be the same for all pitchers. It is just easier to explain that way. Now we compute the variance in that group – stats 101. Then we compare that variance with the variance expected by chance – still stats 101.

Let’s take BA, for example. If we have a bunch of players with 400 AB each, what is the variance in BA among the players expected by chance? Easy. Binomial theorem. .000625 in BA. What if we observe a variance of twice that, or .00125? Where is the extra variance coming from? A tiny bit is coming from the different contexts that the player plays in, home/road, park, weather, opposing pitchers, etc. A tiny bit comes from his own day-to-day changes in true talent. We’ll ignore that. They really are small. We can of course estimate that too and throw it into the equation. Anyway, that extra variance, the .000625, is coming from the spread of talent. The square root of that is .025 or 25 points of BA, which would be one SD of talent in this example. I just made up the numbers, but that is probably close to accurate.

Now that we know the spread in talent for BA, which we get from this formula – observed variance = random variance + talent variance – we can now calculate the exact regression amount for any sample of observed batting average or whatever metric we are looking at. It’s the ratio of random variance to total variance. Remember we need only 2 things and 2 things only to be able to estimate true talent with respect to any metric, like platoon splits: spread of talent and sample size of the observed results. That gives us the regression amount. From that we merely move the observed result toward the mean by that amount, like I did above with Siegrest’s -47 points and the mean of +29 for a league-average LHP.

The second way, which is actually more handy, is to run a regression of player results from one time period to another. We normally do year-to-year but it can be odd days to even, odd PA to even PA, etc. Or an intra-class correlation (ICC) which is essentially the same thing but it correlates every PA (or whatever the opportunity is) to every other PA within a sample.  When we do that, we either use the same sample size for every player, like we did in the first method, or we can use different sample sizes and then take the harmonic mean of all of them as our average sample size.

This second method yields a more intuitive and immediately useful answer, even though they both end up with the same result. This actually gives you the exact amount to regress for that sample size (the average of the group in your regression). In our BA example, if the average sample size of all the players were 500 and we got a year-to-year (or whatever time period) correlation of .4, that would mean that for BA, the correct amount of regression for a sample size of 500 is 60% (1 minus the correlation coefficient or “r”). So if a player bats .300 in 500 AB and the league average is .250 and we know nothing else about him, we estimate his true BA to be (.300 – .250) * .4 + .250 or .270. We move his observed BA 60% towards the mean of .250. We can easily with a little more math calculate the amount of regression for any sample size.

Using method #1 tells us precisely what the spread in talent is. Method 2 tells us that implicitly by looking at the correlation coefficient and the sample size. With either method, we get the amount to regress for any given sample size.

Platoon

Let’s look at some year-to-year correlations for a 500 “opportunity” (PA, BA, etc.) sample for some common metrics. Since we are using the same sample size for each, the correlation tells us the relative spreads in talent for each of these metrics. The higher the correlation for any given sample, the higher the spread in talent (there are other factors that slightly affect the correlation other than spread of talent for any given sample size but we can safely ignore them).

BA: .450

OBA: .515

SA: .525

Pitcher ERA: .240

BABIP for pitchers (DIPS): .155

BABIP for batters: .450

Now let’s look at platoon splits:

This is for an average of 200 TBF versus a LHP, so the sample size is smaller than the ones above.

Platoon wOBA differential for pitchers (200 BF v. LHB): .135

RHP: .110

LHP: .195

Platoon wOBA differential for batters (200 BF v. LHP): .180

RHB: .0625

LHB: .118

Those numbers are telling us that, like DIPS, the spread of talent among batters and pitchers with respect to platoon splits is very small. You all know now that this, along with sample size, tells us how much to regress an observed split like Siegrest’s -47 points. Yes, a reverse split of 47 points is a lot, but that has nothing to do with how much to regress it in order to estimate Siegrist’s true platoon split. The fact that -47 points is very far from the average left-handed pitcher’s +29 points means that it will take a lot of regression to moved it into the plus zone, but the -47 points in and of itself does not mean that we “trust it more.” If the regression were 99% then whether the observed were -47 or +10, we would arrive at nearly the same answer. Don’t confuse the regression with the observed result. One has nothing to do with the other. And don’t think in terms of “trusting” the observed result or not. Regress the result and that’s your answer. If you arrive at answer X it makes no difference whether your starting point, the observed result, was B, or C. None whatsoever.  That is a very important point. I don’t know how many times I have heard, “But he had a 47 point reverse split in his entire career!” You can’t possibly be saying that you estimate his real split to be +10 or +12 or whatever it is.” Yes, that’s exactly what I’m saying. A +10 estimated split is exactly the same whether the observed split were -47 or +5. The estimate using the regression amount is the only thing that counts.

What about the certainty of the result? The certainty of the estimate depends mostly on the sample size of the observed results. If we never saw a player hit before and we estimate that he is a .250 hitter we are surely less certain than if we have a hitter who has hit .250 over 5000 AB. But does that change the estimate? No. The certainty due to the sample size was already included in the estimate. The higher the certainty the less we regressed the observed results. So once we have the estimate we don’t revise that again because of the uncertainty. We already included that in the estimate!

And what about the practical importance of the certainty in terms of using that estimate to make decisions? Does it matter whether we are 100% or 90% sure that Siegrest is a +10 true platoon split pitcher? Or whether we are only 20% sure – he might actually have a higher platoon split or a lower one? Remember the +10 is a weighted mean which means that it is in the middle of our error bars. The answer to that is, “No, no and no!” Every decision that a manager makes on the field is or should be based on weighted mean estimates of various player talents. The certainty or distribution rarely should come into play. Basically the noise in the result of a sample of 1 is so large that it doesn’t matter at all what the uncertainty level of your estimates are.

So what do we estimate Siegrest’s true platoon split, given a 47 point reverse split in 231 TBF versus LHB. Using no weighting for more recent results, we regress his observed splits 1 minus 230/1255, or .82 (82%) towards the league average for lefty pitchers, which is around 29 points for a LHP. 82% of 76 points is 62 points. So we regress his -47 points 62 points in the plus direction which gives us an estimate of +15 points in true platoon split. That is half the split of an average LHP, but it is plus nonetheless.

That means that a left-handed hitter like Coghlan will hit better than he normally does against a left-handed pitcher. However, Coghlan has a larger than average estimated split, so that cancels out Siegrest’s smaller than average split to some extent. That also means that Soler or another righty will not hit as well against Siegrest as he would against a LH pitcher with average splits. And since Soler himself has a slightly smaller platoon split than the average RHB, his edge against Siegrest is small.

We also have another method for better estimating true platoon splits for pitchers which can be used to enhance the method we use using sample results, sample size, and means. It is very valuable. We have a pretty good idea as to what causes one pitcher to have a smaller or greater platoon split than another. It’s not like pitchers deliberately throw better or harder to one side or the other or that RH or LH batters scare or distract them. Pitcher platoon splits mostly come from two things: One is arm angle. If you’ve ever played or watched baseball that should be obvious to you. The more a pitcher comes from the side, the tougher he is on same-side batters and the larger his platoon split. That is probably the number one factor in these splits. It is almost impossible for a side-armer not to have large splits.

What about Siegrest? His arm angle is estimated by Jared Cross of Steamer, using pitch f/x data, at 48 degrees. That is about a ¾ arm angle. That strongly suggests that he does not have true reverse splits and it certainly enables us to be more confident that he is plus in the platoon split department.

The other thing that informs us very well about likely splits is pitch repertoire. Each pitch has its own platoon profile. For example, pitches with the largest splits are sliders and sinkers and those with the lowest or even reverse are the curve (this surprises most people), splitter, and change.

In fact, Jared (Steamer) has come up with a very good regression formula which estimates platoon split from pitch repertoire and arm angle only. This formula can be used by itself for estimating true platoon splits. Or it can be used to establish the mean towards which the actual splits should be regressed. If you use the latter method the regression percentage is much higher than if you don’t. It’s like adding a lot more 50/50 coins to that piggy bank.

If we plug Siegrest’s 2015 numbers into that regression equation, we get an estimated platoon from arm angle and pitch repertoire of 14 points, which is less than the average lefty even with the 48 degree arm angle. That is mostly because he uses around 18% change ups this year. Prior to this season, when he didn’t use the change up that often, we would probably have estimated a much higher true split.

So now rather than regressing towards just an average lefty with a 29 point platoon split, we can regress his -47 points to a more accurate mean of 14 points. But, the more you isolate your population mean, the more you have to regress for any given sample size, because you are reducing the spread of talent in that more specific population. So rather than 82%, we have to regress something line 92%. That brings -47 to +9 points.

So now we are down to a left-handed pitcher with an even smaller platoon split. That probably makes Maddon’s decision somewhat of a toss-up.

His big mistake in that same game was not pinch-hitting for Lester and Ross in the 6th. That was indefensible in my opinion. Maybe he didn’t want to piss off Lester, his teammates, and possibly the fan base.Who knows?

In response to my two articles on whether pitcher performance over the first 6 innings is predictive of their 7th inning performance (no), a common response from saber and non-saber leaning critics and commenters goes something like this:

No argument with the results or general method, but there’s a bit of a problem in selling these findings. MGL is right to say that you can’t use the stat line to predict inning number 7, but I would imagine that a lot of managers aren’t using the stat line as much as they are using their impression of the pitcher’s stuff and the swings the batters are taking.

You hear those kinds of comments pretty often even when a pitcher’s results aren’t good, “they threw the ball pretty well,” and “they didn’t have a lot of good swings.”

There’s no real way to test this and I don’t really think managers are particularly good at this either, but it’s worth pointing out that we probably aren’t able to do a great job capturing the crucial independent variable.

That is actually a comment on The Book Blog by Neil Weinberg, one of the editors of Beyond the Box Score and a sabermetric blog writer (I hope I got that somewhat right).

My (edited) response on The Book Blog was this:

Neil I hear that refrain all the time and with all due respect I’ve never seen any evidence to back it up. There is plenty of evidence, however, that for the most part it isn’t true.

If we are to believe that managers are any good whatsoever at figuring out which pitchers should stay and which should not, one of two things must be true:

1) The ones who stay must pitch well, especially in close games. That simply isn’t true.

2) The ones who do not stay would have pitched terribly. In order for that to be the case, we must be greatly under-estimating the TTO penalty. That strains credulity.

Let me explain the logic/math in # 2:

We have 100 pitchers pitching thru 6 innings. Their true talent is 4.0 RA9. 50 of them stay and 50 of them go, or some other proportion – it doesn’t matter.

We know that those who stay pitch to the tune of around 4.3. We know that. That’s what the data say. They pitch at the true talent plus the 3rd TTOP, after adjusting for the hitters faced in the 7th inning.

If we are to believe that managers can tell, to any extent whatsoever, whether a pitcher is likely to be good or bad in the next inning or so, then it must be true that the ones who stay will pitch better on the average then the ones who do not, assuming that the latter were allowed to stay in the game of course.

So let’s assume that those who were not permitted to continue would have pitched at a 4.8 level, .5 worse than the pitchers who were deemed fit to remain.

That tells us that if everyone were allowed to continue, they would pitch collectively at a 4.55 level, which implies a .55 rather than a .33 TTOP.

Are we to believe that the real TTOP is a lot higher than we think, but is depressed because managers know when to take pitchers out such that the ones they leave in actually pitch better than all pitchers would if they were all allowed to stay?

Again, to me that seems unlikely.

Anyway, here is some new data which I think strongly suggests that managers and pitching coaches have no better clue than you or I as to whether a pitcher should remain in a game or not. In fact, I think that the data suggest that whatever criteria they are using, be it runs allowed, more granular performance like K, BB, and HR, or keen, professional observation and insight, it is simply not working at all.

After 6 innings, if a game is close, a manager should make a very calculated decision as far as whether or not he should remove his starter. That decision ought to be based primarily on whether the manager thinks that his starter will pitch well in the 7th and possibly beyond, as opposed to one of his back-end relievers. Keep in mind that we are talking about general tendencies which should apply in close games going into the 7th inning. Obviously every game may be a little different in terms of who is on the mound, who is available in the pen, etc. However, in general, when the game is close in the 7th inning and the starter has already thrown 6 full, the decision to yank him or allow him to continue pitching is more important than when the game is not close.

If the game is already a blowout, it doesn’t matter much whether you leave in your starter or not. It has little effect on the win expectancy of the game. That is the whole concept of leverage. In cases where the game is not close, the tendency of the manager should be to do whatever is best for the team in the next few games and in the long run. That may be removing the starter because he is tired and he doesn’t want to risk injury or long-term fatigue. Or it may be letting his starter continue (the so-called “take one for the team” approach) in order to rest his bullpen. Or it may be to give some needed work to a reliever or two.

Let’s see what managers actually do in close and not-so-close games when their starter has pitched 6 full innings and we are heading into the 7th, and then how those starters actually perform in the 7th if they are allowed to continue.

In close games, which I defined as a tied or one-run game, the starter was allowed to begin the 7th inning 3,280 times and he was removed 1,138 times. So the starter was allowed to pitch to at least 1 batter in the 7th inning of a close game 74% of the time. That’s a pretty high percentage, although the average pitch count for those 3,280 pitcher-games was only 86 pitches, so it is not a complete shock that managers would let their starters continue especially when close games tend to be low scoring games. If a pitcher is winning or losing 2-1 or 3-2 or 1-0 or the game is tied 0-0, 1-1, 2-2, and the starter’s pitch count is not high, managers are typically loathe to remove their starter. In fact, in those 3,280 instances, the average runs allowed for the starter through 6 innings was only 1.73 runs (a RA9 of 2.6) and the average number of innings pitched beyond 6 innings was 1.15.

So these are presumably the starters that managers should have the most confidence in. These are the guys who, regardless of their runs allowed, or even their component results, like BB, K, and HR, are expected to pitch well into the 7th, right? Let’s see how they did.

These were average pitchers, on the average. Their seasonal RA9 was 4.39 which is almost exactly league average for our sample, 2003-2013 AL. They were facing the order for the 3rd time on the average, so we expect them to pitch .33 runs worse than they normally do if we know nothing about them.

These games are in slight pitcher’s parks, average PF of .994, and the batters they faced in the 7th were worse than average, including a platoon adjustment (it is almost always the case that batters faced by a starter in the 7th are worse than league average, adjusted for handedness). That reduces their expected RA9 by around .28 runs. Combine that with the .33 run “nick” that we expect from the TTOP and we expect these pitchers to pitch at a 4.45 level, again knowing nothing about them other than their seasonal levels and attaching a generic TTOP penalty and then adjusting for batter and park.

Surely their managers, in allowing them to pitch in a very close game in the 7th know something about their fitness to continue – their body language, talking to their catcher, their mechanics, location, past experience, etc. All of this will help them to weed out the ones who are not likely to pitch well if they continue, such that the ones who are called on to remain in the game, the 74% of pitchers who face this crossroad and move on, will surely pitch better than 4.45, which is about the level of a near-replacement reliever.

In other words, if a manager thought that these starters were going to pitch at a 4.45 level in such a close game in the 7th inning, they would surely bring in one of their better relievers – the kind of pitchers who typically have a 3.20 to 4.00 true talent.

So how did these hand-picked starters do in the 7th inning? They pitched at a 4.70 level. The worst reliever in any team’s pen could best that by ½ run. Apparently managers are not making very good decisions in these important close and late game situations, to say the least.

What about in non-close game situations, which I defined as a 4 or more run differential?

73% of pitchers who pitch through 6 were allowed to continue even in games that were not close. No different from the close games. The other numbers are similar too. The ones who are allowed to continue averaged 1.29 runs over the first 6 innings with a pitch count of 84, and pitched an average of 1.27 innings more.

These guys had a true talent of 4.39, the same as the ones in the close games – league average pitchers, collectively. They were expected to pitch at a 4.50 level after adjusting for TTOP, park and batters faced. They pitched at a 4.78 level, slightly worse than our starters in a close game.

So here we have two very different situations that call for very different decisions, on the average. In close games, managers should (and presumably think they are) be making very careful decision about whom to pitch in the 7th, trying to make sure that they use the best pitcher possible. In not-so-close games, especially blowouts, it doesn’t really matter who they pitch, in terms of the WE of the game, and the decision-making goal should be oriented toward the long-term.

Yet we see nothing in the data that suggests that managers are making good decisions in those close games. If we did, we would see much better performance from our starters than in not-so-close games and good performance in general. Instead we see rather poor performance, replacement level reliever numbers in the 7th inning of both close and not-so-close games. Surely that belies the, “Managers are able to see things that we don’t and thus can make better decisions about whether to leave starters in or not,” meme.

Let’s look at a couple more things to further examine this point.

In the first installment of these articles I showed that good or bad run prevention over the first 6 innings has no predictive value whatsoever for the 7th inning. In my second installment, there was some evidence that poor component performance, as measured by in-game, 6-inning FIP had some predictive value, but not good or great component performance.

Let’s see if we can glean what kind of things managers look at when deciding to yank starters in the 7th or not.

In all games in which a starter allows 1 or 0 runs through 6, even though his FIP was high, greater than 4, suggesting that he really wasn’t pitching such a great game, his manager let him continue 78% of the time, which was more than the 74% overall that starters pitched into the 7th.

In games where the starter allowed 3 or more runs through 6 but had a low FIP, less than 3, suggesting that he pitched better than his RA suggest, managers let them continue to pitch just 55% of the time.

Those numbers suggest that managers pay more attention to runs allowed than component results when deciding whether to pull their starter in the 7th. We know that that is not a good decision-making process as the data indicate that runs allowed have no predictive value while component results do, at least when those results reflect poor performance.

In addition, there is no evidence that managers can correctly determine who should stay and who to pull in close games – when that decision matters the most. Can we put to rest, for now at least, this notion that managers have some magical ability to figure out which of their starters has gas left in their tank and which do not? They don’t. They really, really, really don’t.

Note: “Guy,” a frequent participant on The Book Blog, pointed out an error I have been making in calculating the expected RA9 for starters. I have been using their season RA9 as the baseline, and then adjusting for context. That is wrong. I must consider the RA9 of the first 6 innings and then subtract that from the seasonal RA9. For example if a group of pitchers has a RA9 for the season of 4.40 and they have a RA9 of 1.50 for the first 6 innings, if they average 150 IP for the season, our baseline adjusted expectation for the 7th inning, not considering any effects from pitch count, TTOP, manager’s decision to let them continue, etc., is 73.3 (number of runs allowed over 150 IP for the season) minus 1 run for 6 innings, or 72.3 runs over 144 innings, which is an expected RA9 of 4.52, .12 runs higher than the seasonal RA9 of 4.40.

The same goes for the starters who have gotten shelled through 6. Their adjusted expected RA9 for any other time frame, e.g., the 7th inning, is a little lower than 4.40 if 4.40 is their full-season RA9. How much lower depends on the average number of runs allowed in those 6 innings. If it is 4, then we have 73.3 – 4, or 69.3, divided by 144, times 9, or 4.33.

So I will adjust all my numbers to the tune of .14 runs up for dealing pitchers and .07 down for non-dealing pitchers. The exact adjustments might vary a little from these, depending on the average number of runs allowed over the first 6 innings in the various groups of pitchers I looked at.

The other day I wrote that pitcher performance though 6 innings, as measured solely by runs allowed, is not a good predictor of performance in the 7th inning. Whether a pitcher is pitching a shutout or has allowed 4 runs thus far, his performance in the 7th is best projected mostly by his full-season true talent level plus a times through the order penalty of around .33 runs per 9 innings (the average batter faced in the 7th inning appears for the 3rd time). Pitch count has a small effect on those late inning projections as well.

Obviously if you have allowed no or even 1 run through 6 your component results will tend to be much better than if you have allowed 3 or 4 runs, however there is going to be some overlap. Some small proportion of 0 or 1 run starters will have allowed a HR, 6 or 7 walks and hits, and few if any strikeouts. Similarly, some small percentage of pitchers who allow 3 or 4 runs through 6 will have struck out 7 or 8 batters and only allowed a few hits and walks.

If we want to know whether pitching ”well” or not through 6 innings has some predictive value for the 7th (and later) inning, it is better to focus on things that reflect the pitcher’s raw performance than simply runs allowed. It is an established fact that pitchers have little control over whether their non-HR batted balls fall for hits or outs or whether their hits and walks get “clustered” to produce lots of runs or are spread out such that few if any runs are scored.

It is also established that the components most under control by a pitcher are HR, walks, and strikeouts, and that pitchers who excel at the K, and limit walks and HR tend to be the most talented, and vice versa. It also follows that when a pitcher strikes out a lot of batters in a game and limits his HR and walks total that he is pitching “well,” regardless of how many runs he has allowed – and vice versa.

Accordingly, I have extended my inquiry into whether pitching “well” or not has some predictive value intra-game to focus on in-game FIP rather than runs allowed.  My intra-game FIP is merely HR, walks, and strikeouts per inning, using the same weights as are used in the standard FIP formula – 13 for HR, 3 for walks and 2 for strikeouts.

So, rather than defining dealing as allowing 1 or fewer runs through 6 and not dealing as 3 or more runs, I will define the former as an FIP through 6 innings below some maximum threshold and the latter as above some minimum threshold. Although I am not nearly convinced that managers and pitching coaches, and certainly not the casual fan, look much further than runs allowed, I think we can all agree that they should be looking at these FIP components instead.

Here is the same data that I presented in my last article, this time using FIP rather than runs allowed to differentiate pitchers who have been pitching very well through 6 innings or not.

Pitchers who have been dealing or not through 6 innings – how they fared in the 7th

Starters through 6 innings Avg runs allowed through 6 # of Games RA9 in the 7th inning
Dealing (FIP less than 3 through 6) 1.02 5,338 4.39
Not-dealing (FIP greater than 4) 2.72 3,058 5.03

The first thing that should jump out at you is while our pitchers who are not pitching well do indeed continue to pitch poorly, our dealing pitchers, based upon K, BB, and HR rate over the first 6 innings, are not exactly breaking the bank either in the 7th inning.

Let’s put some context into those numbers.

Pitchers who have been dealing or not through 6 innings – how they fared in the 7th

Starters through 6 innings True talent level based on season RA9 Expected RA9 in 7th RA9 in the 7th inning
Dealing (FIP less than 3 through 6) 4.25 4.50 4.39
Not-dealing (FIP greater than 4) 4.57 4.62 5.03

As you can see, our new dealing pitchers are much better pitchers. They normally allow 4.25 runs per game during the season. Yet they allow 4.39 runs in the 7th despite pitching very well through 6, irrespective of runs allowed (and of course they allow few runs too). In other words, we have eliminated those pitchers who allowed few runs but may have actually pitched badly or at least not as well as their meager runs allowed would suggest. All of these dealing pitchers had some combination of high K rates, and low BB and HR rates through 6 innings. But still, we see only around .1 runs per 9 in predictive value – not significantly different from zero or none.

On the other hand, pitchers who have genuinely been pitching badly, at least in terms of some combination of a low K rate and high BB and HR rates, do continue to pitch around .4 runs per 9 innings worse than we would expect given their true talent level and the TTOP.

There is one other thing that is driving some of the difference. Remember that in our last inquiry we found that pitch count was a factor in future performance. We found that while pitchers who only had 78 pitches through 6 innings pitched about as well as expected in the 7th, pitchers with an average of 97 pitches through 6 performed more than .2 runs worse than expected.

In our above 2 groups, the dealing pitchers averaged 84 pitches through 6 and the non-dealing 88, so we expect some bump in the 7th inning performance of the latter group because of a touch of fatigue, at least as compared to the dealing group.

So when we use a more granular approach to determining whether pitchers have been dealing through 6, there is not any evidence that it has much predictive value – the same thing we concluded when we looked at runs allowed only. These pitchers only pitches .11 runs per 9 better than expected.

On the other hand, if pitchers have been pitching poorly for 6 innings, as reflected in the components in which they exert the most control, K, BB, and HR rates, they do in fact pitch worse than expected, even after accounting for a slight elevation in pitch count as compared to the dealing pitchers. That decrease in performance is about .4 runs per 9.

I also want to take this time to state that based on this data and the data from my previous article, there is little evidence that managers are able to identify when pitchers should stay in the game or should be removed. We are only looking at pitchers who were chosen to continue pitching in the 7th inning by their managers and coaches. Yet, the performance of those pitchers is worse than their seasonal numbers, even for the dealing pitchers. If managers could identify those pitchers who were likely to pitch well, whether they had pitched well in prior innings or not, clearly we would see better numbers from them in the 7th inning. At best a dealing pitcher is able to mitigate his TTOP, and a non-dealing pitcher who is allowed to pitch the 7th pitches terribly, which does not bode well for the notion that managers know whom to pull and and whom to keep in the game.

For example, in the above charts, we see that dealing pitchers threw .14 runs per 9 worse than their seasonal average – which also happens to be exactly at league average levels. The non-dealing pitchers, who were also deemed fit to continue by their managers, pitched almost ½ run worse than their seasonal performance and more than .6 runs worse than the league average pitcher. Almost any reliever in the 7th inning would have been a better alternative than either the dealing or non-dealing pitchers. Once again, I have yet to see some concrete evidence that the ubiquitous cry from some of the sabermetric naysayers, “Managers know more about their players’ performance prospects than we do,” has any merit whatsoever.

Note: “Guy,” a frequent participant on The Book Blog, pointed out an error I have been making in calculating the expected RA9 for starters. I have been using their season RA9 as the baseline, and then adjusting for context. That is wrong. I must consider the RA9 of the first 6 innings and then subtract that from the seasonal RA9. For example if a group of pitchers has a RA9 for the season of 4.40 and they have a RA9 of 1.50 for the first 6 innings, if they average 150 IP for the season, our baseline adjusted expectation for the 7th inning, not considering any effects from pitch count, TTOP, manager’s decision to let them continue, etc., is 73.3 (number of runs allowed over 150 IP for the season) minus 1 run for 6 innings, or 72.3 runs over 144 innings, which is an expected RA9 of 4.52, .12 runs higher than the seasonal RA9 of 4.40.

The same goes for the starters who have gotten shelled through 6. Their adjusted expected RA9 for any other time frame, e.g., the 7th inning, is a little lower than 4.40 if 4.40 is their full-season RA9. How much lower depends on the average number of runs allowed in those 6 innings. If it is 4, then we have 73.3 – 4, or 69.3, divided by 144, times 9, or 4.33.

So I will adjust all my numbers to the tune of .14 runs up for dealing pitchers and .07 down for non-dealing pitchers. The exact adjustments might vary a little from these, depending on the average number of runs allowed over the first 6 innings in the various groups of pitchers I looked at.

Almost everyone, to a man, thinks that a manager’s decision as to whether to allow his starter to pitch in the 6th, 7th, or 8th (or later) innings of an important game hinges, at least in part, on whether said starter has been dealing or getting banged around thus far in the game.

Obviously there are many other variables that a manager can and does consider in making such a decision, including pitch count, times through the order (not high in a manager’s hierarchy of criteria, as analysts have been pointing out more and more lately), the quality and handedness of the upcoming hitters, and the state of the bullpen, both in term of quality and availability.

For the purposes of this article, we will put aside most of these other criteria. The two questions we are going to ask is this:

  • If a starter is dealing thus far, say, in the first 6 innings, and he is allowed to continue, how does he fare in the very next inning? Again, most people, including almost every baseball insider, (player, manager, coach, media commentator, etc.), will assume that he will continue to pitch well.
  • If a starter has not been dealing, or worse yet, he is achieving particularly poor results, these same folks will usually argue that it is time to take him out and replace him with a fresh arm from the pen. As with the starter who has been dealing, the presumption is that the pitcher’s bad performance over the first, say, 6 innings, is at least somewhat predictive of his performance in the next inning or two. Is that true as well?

Keep in mind that one thing we are not able to look at is how a poorly performing pitcher might perform if he were left in a game, even though he was removed. In other words, we can’t do the controlled experiment we would like – start a bunch of pitchers, track how they perform through 6 innings and then look at their performance through the next inning or two.

So, while we have to assume that, in some cases at least, when a pitcher is pitching poorly and his manager allows him to pitch a while longer, that said manager still had some confidence in the pitcher’s performance over the remaining innings, we also must assume that if most people’s instincts are right, the dealing pitchers through 6 innings will continue to pitch exceptionally well and the not-so dealing pitchers will continue to falter.

Let’s take a look at some basic numbers before we start to parse them and do some necessary adjustments. The data below is from the AL only, 2003-2013.

 

 Pitchers who have been dealing or not through 6 innings – how they fared in the 7th

Starters through 6 innings # of Games RA9 in the 7th inning
Dealing (0 or 1 run allowed through 6) 5,822 4.46
Not-dealing (3 or more runs allowed through 6) 2,960 4.48

First, let me explain what “RA9 in the 7th inning” means: It is the average number of runs allowed by the starter in the 7th inning extrapolated to 9 innings, i.e. runs per inning in the 7th multiplied by 9. Since the starter is often removed in the middle of the 7th inning whether has been dealing or not, I calculated his runs allowed in the entire inning by adding together his actual runs allowed while he was pitching plus the run expectancy of the average pitcher when he left the game, scaled to his talent level and adjusted for time through the order, based on the number of outs and base runners.
For example, let’s say that a starter who is normally 10% worse than a league average pitcher allowed 1 run in the 7th inning and then left with 2 outs and a runner on first base. He would be charged with allowing 1 plus (.231 * 1.1 * 1.08) runs or 1.274 runs in the 7th inning. The .231 is the average run expectancy for a runner on first base and 2 outs, the 1.1 multiplier is because he is 10% worse than a league average pitcher, and the 1.08 multiplier is because most batters in the 7th inning are appearing for the 3rd time (TTOP). When all the 7th inning runs are tallied, we can convert them into a runs per 9 innings or the RA9 you see in the chart above.

At first glance it appears that whether a starter has been dealing in prior innings or not has absolutely no bearing on how he is expected to pitch in the following inning, at least with respect to those pitchers who were allowed to remain in the game past the 6th inning. However, we have different pools of pitchers, batters, parks, etc., so the numbers will have to be parsed to make sure we are comparing apples to apples.

Let’s add some pertinent data to the above chart:

Starters through 6 RA9 in the 7th Seasonal RA9
Dealing 4.46 4.29
Not-dealing 4.48 4.46

As you can see, the starters who have been dealing are, not surprisingly, better pitchers. However, interestingly, we have a reverse hot and cold effect. The pitchers who have allowed only 1 run or less through 6 innings pitch worse than expected in the 7th inning, based on their season-long RA9. Many of you will know why – the times through the order penalty. If you have not read my two articles on the TTOP, and I suggest you do, each time through the order, a starting pitcher fares worse and worse, to the tune of about .33 runs per 9 innings each time he faces the entire lineup. In the 7th inning, the average TTO is 3.0, so we expect our good pitchers, the ones with the 4.29 RA9 during the season, to average around 4.76 RA9 in the 7th inning (the 3rd time though the order, a starter pitches about .33 runs per 9 worse than he pitches overall, and the seasonal adjustment – see the note above – adds another .14 runs). They actually pitch to the tune of 4.46 or .3 runs better than expected after considering the TTOP. What’s going on there?

Well, as it turns out, there are 3 contextual factors that depress a dealing starter’s results in the 7th inning that have nothing to do with his performance in the 6 previous innings:

  • The batters that a dealing pitcher is allowed to face are 5 points lower in wOBA than the average batter that each faces over the course of the season, after adjusting for handedness. This should not be surprising. If any starting pitcher is allowed to pitch the 7th inning, it is likely that the batters in that inning are slightly less formidable or more advantageous platoon-wise, than is normally the case. Those 5 points of wOBA translate to around .17 runs per 9 innings, reducing our expected RA9 to 4.59.
  • The parks in which we find dealing pitchers are not-surprisingly, slightly pitcher friendly, with an average PF of .995, further reducing our expectation of future performance by .02 runs per 9, further reducing our expectation to 4.57.
  • The temperature in which this performance occurs is also slightly more pitcher friendly by around a degree F, although this would have a de minimus effect on run scoring (it takes about a 10 degree difference in temperature to move run scoring by around .025 runs per game).

So our dealing starters pitch .11 runs per 9 innings better than expected, a small effect, but nothing to write home about, and well within the range of values that can be explained purely by chance.

What about the starters who were not dealing? They out-perform their seasonal RA9 plus the TTOP by around .3 runs per 9. The batters they face in the 7th inning are 6 points worse than the average league batter after adjusting for the platoon advantage, and the average park and ambient temperature tend to slightly favor the hitter. Adjusting their seasonal RA9 to account for the fact that they pitched poorly through 6 (see my note at the beginning of this article), we get an expectation of 4.51. So these starters fare almost exactly as expected (4.48 to 4.51) in the 7th inning, after adjusting for the batter pool, despite allowing 3 or more runs for the first 6 innings. Keep in mind that we are only dealing with data from around 9,000 BF. One standard deviation in “luck” is around 5 points of wOBA which translates to around .16 runs per 9.

It appears to be quite damning that starters who are allowed to continue after pitching 6 stellar or mediocre to poor innings pitch almost exactly as (poorly as) expected – their normal adjusted level plus .33 runs per 9 because of the TTOP – as if we had no idea how well or poorly they pitched in the prior 6 innings.

Score one for simply using a projection plus the TTOP to project how any pitcher is likely to pitch in the middle to late innings, regardless of how well or poorly they have pitched thus far in the game. Prior performance in the same game has almost no bearing on that performance. If anything, when a manager allows a dealing pitcher to continue pitching after 6 innings, when facing the lineup for the 3rd time on the average, he is riding that pitcher too long. And, more importantly, presumably he has failed to identify anything that the pitcher might be doing, velocity-wise, mechanics-wise, repertoire-wise, command-wise, results-wise, that would suggest that he is indeed on that day and will continue to pitch well for another inning or so.

In fact, whether pitchers have pitched very well or very poorly or anything in between for the first 6 innings of a game, managers and pitching coaches seem to have no ability to determine whether they are likely to pitch well if they remain in the game. The best predictor of 7th inning performance for any pitcher who is allowed to remain in the game, is his seasonal performance (or projection) plus a fixed times through the order penalty. The TTOP is approximately .33 runs per 9 innings for every pass through the order. Since the second time through the order is roughly equal to a pitcher’s overall performance, starting with the 3rd time through the lineup we expect that starter to pitch .33 runs worse than he does overall, again, regardless of how he has pitched thus far in the game. The 4th time TTO, we expect a .66 drop in performance. Pitchers rarely if ever get to throw to the order for the 5th time.

Fatigue and Pitch Counts

Let’s look at fatigue using pitch count as a proxy, and see if that has any effect on 7th inning performance for pitchers who allowed 3 or more runs through 6 innings. For example, if a pitcher has not pitched particularly well, should we allow him to continue if he has a low pitch count?

Pitch count and 7th inning performance for non-dealing pitchers:

Pitch count through 6 Expected RA9 Actual RA9
Less than 85 (avg=78) 4.56 4.70
Greater than 90 (avg=97) 4.66 4.97

 

Expected RA9 accounts for the pitchers’ adjusted seasonal RA9 plus the pool of batters faced in the 7th inning including platoon considerations, as well as park and weather. The latter 2 affect the numbers minimally. As you can see, pitchers who had relatively high pitch counts going into the 7th inning but were allowed to pitch for whatever reasons despite allowing at least 3 runs thus far, fared .3 runs worse than expected, even after adjusting for the TTOP. Pitchers with low pitch counts did only about .14 runs worse than expected, including the TTOP. Those 20 extra pitches appear to account for around .17 runs per 9, not a surprising result. Again, please keep in mind that we dealing with limited sample sizes, so these small differences are inferential suggestions and are not to be accepted with a high degree of certainty. They do point us in a certain direction, however, and one which comports with our prior expectation – at least my prior expectation.

What about if a pitcher has been dealing and he also has a low pitch count going into the 7th inning. Very few managers, if any, would remove a starter who allowed zero or 1 run through 6 innings and has only thrown 65 or 70 pitchers. That would be baseball blasphemy. Besides the affront to the pitcher (which may be a legitimate concern, but one which is beyond the scope of this article), the assumption by nearly everyone is that the pitcher will continue to pitch exceptionally well. After all, he is not at all tired and he has been dealing! Let’s see if that is true – that these starters continue to pitch well, better than expected based on their projections or seasonal performance plus the TTOP.

Pitch count and 7th inning performance for dealing pitchers:

Pitch count through 6 Expected RA9 Actual RA9
Less than 80 (avg=72) 4.75 4.50
Greater than 90 (avg=96) 4.39 4.44

Keep in mind that these pitchers normally allow 4.30 runs per 9 innings during the entire season (4.44 after doing the seasonal adjustment). The reason the expected RA9 is so much higher for pitchers with a low pitch count is primarily due to the TTOP. For pitchers with a high pitch count, the batters they face in the 7th are 10 points less in wOBA than league average, thus the 4.39 expected RA9, despite the usual .3 to .35 TTOP.

Similar to the non-dealing pitchers, fatigue appears to play a factor in a dealing pitcher’s performance in the 7th. However, in either case, low-pitch or high-pitch, their performance through the first 6 innings has little bearing on their 7th inning performance. With no fatigue they out-perform their expectation by .25 runs per 9. The fatigued pitchers under-performed their overall season-long adjusted talent plus the usual TTOP by .05 runs per 9.

Again, we see that there is little value to taking out a pitcher who has been getting a little knocked around or leaving in a pitcher who has been dealing for 6 straight innings. Both groups will continue to perform at around their expected full-season levels plus any applicable TTOP, with a slight increase in performance for a low-pitch count pitcher and a slight decrease for a high-pitch count pitcher. The biggest increase we see, .25 runs, is for pitchers who were dealing and had very low pitch counts.

What about if we increase our threshold to pitchers who allow 4 or more runs over 6 innings and those who are pitching a shutout?

Starters through 6 Seasonal RA9 Expected RA9 7th inning RA9
Dealing (shutouts only) 4.23 4.62 4.70
Not-dealing (4 or more runs) 4.62 4.81 4.87

Here, we see no predictive value in the first 6 innings of performance. In fact, for some reason starters pitching a shutout pitched slightly worse than expected in the 7th inning, after adjusting for the pool of batters faced and the TTOP.

How about the holy grail of starters who are expected to keep lighting it up in the 7th inning – starters pitching a shutout and with a low pitch count? These were true talent 4.25 pitchers facing better than average batters in the 7th, mostly for the third time in the game, so we expect a .3 bump or so for the TTOP. Our expected RA9 was 4.78 after making all the adjustments, and the actual was 4.61. Nothing much to speak of. Their dealing combined with a low pitch count had a very small predictive value in the 7th. Less than .2 runs per 9 innings.

Conclusion

As I have been preaching for what seems like forever – and the data are in accordance – however a pitcher is pitching through X innings in a game, at least as measured by runs allowed, even at the extremes, has very little relevance with regard to how he is expected to pitch in subsequent innings. The best marker for whether to pull a pitcher or not seems to be pitch count.

If you want to know the most likely result, or the mean expected result at any point in the game, you should mostly ignore prior performance in that game and use a credible projection plus a fixed times through the order penalty, which is around .33 runs per 9 the 3rd time through, and another .33 the 4th time through. Of course the batters faced, park, weather, etc. will further dictate the absolute performance of the pitcher in question.

Keep in mind that I have not looked at a more granular approach to determining whether a pitcher has been pitching extremely well or getting shelled, such as hits, walks, strikeouts, and the like. It is possible that such an approach might yield a subset of pitching performance that indeed has some predictive value within a game. For now, however, you should be pretty convinced that run prevention alone during a game has little predictive value in terms of subsequent innings. Certainly a lot less than what most fans, managers, and other baseball insiders think.

Those of you who follow me on Twitter know that I am somewhat obsessed with how teams (managers) construct their lineups. With few exceptions, managers tend to do two things when it comes to setting their daily lineups: One, they follow more or less the traditional model of lineup construction, which is to put your best overall offensive player third, a slugger fourth, and scrappy, speedy players in the one and/or two holes. Two, monkey with lineups based on things like starting pitcher handedness (relevant), hot and cold streaks, and batter/pitcher matchups, the latter two generally being not so relevant. For example, in 2012, the average team used 122 different lineups.

If you have read The Book (co-authored by Yours Truly, Tom Tango and Andy Dolphin), you may remember that the optimal lineup differs from the traditional one. According to The Book, a team’s 3 best hitters should bat 1,2, and 4, and the 4th and 5th best hitters 3 and 5. The 1 and 2 batters should be more walk prone than the 4 and 5 hitters. Slots 6 through 9 should feature the remaining hitters in more or less descending order of quality. As we know, managers violate or in some cases butcher this structure by batting poor, sometimes awful hitters, in the 1 and 2 holes, and usually slotting their best overall hitter third. They also sometimes bat a slow, but good offensive player, often a catcher, down in the order.

In addition to these guidelines, The Book suggests placing good base stealers in front of low walk, and high singles and doubles hitters. That often means the 6 hole rather than the traditional 1 and 2 holes in which managers like to put their speedy, base stealing players. Also, because the 3 hole faces a disproportionate number of GDP opportunities, putting a good hitter who hits into a lot of DP, like a Miguel Cabrera, into the third slot can be quite costly. Surprisingly, a good spot for a GDP-prone hitter is leadoff, where a hitter encounters relatively few GDP opportunities.

Of course, other than L/R considerations (and perhaps G/F pitcher/batter matchups for extreme players) and when substituting one player for another, optimal lineups should rarely if ever change. The notion that a team has to use 152 different lineups (like TB did in 2012) in 162 games, is silly at best, and a waste of a manager’s time and sub-optimal behavior at worst.

Contrary to the beliefs of some sabermetric naysayers, most good baseball analysts and sabermetricians are not unaware of or insensitive to the notion that some players may be more or less happy or comfortable in one lineup slot or another. In fact, the general rule should be that player preference trumps a “computer generated” optimal lineup slot. That is not to say that it is impossible to change or influence a player’s preferences.

For those of you who are thinking, “Batting order doesn’t really matter, as long as it is somewhat reasonable,” you are right and you are wrong. It depends on what you mean by “matter.” It is likely that in most cases the difference between a prevailing, traditional order and an optimal one, not-withstanding any effect from player preferences, is on the order of less than 1 win (10 or 11 runs) per season; however, teams pay on the free agent market over 5 million dollars for a player win, so maybe those 10 runs do “matter.” We also occasionally find that the difference between an actual and optimal lineup is 2 wins or more. In any case, as the old sabermetric saying goes, “Why do something wrong, when you can do it right?” In other words, in order to give up even a few runs per season, there has to be some relevant countervailing and advantageous argument, otherwise you are simply throwing away potential runs, wins, and dollars.

Probably the worst lineup offense that managers commit is putting a scrappy, speedy, bunt-happy, bat-control, but poor overall offensive player in the two hole. Remember that The Book (the real Book) says that the second slot in the lineup should be reserved for one of your two best hitters, not one of your worst. Yet teams like the Reds, Braves, and the Indians, among others, consistently put awful hitting, scrappy players in the two-hole. The consequence, of course, is that there are fewer base runners for the third and fourth hitters to drive in, and you give an awful hitter many more PA per season and per game. This might surprise some people, but the #2 hitter will get over 100 more PA than the #8 hitter, per 150 games. For a bad hitter, that means more outs for the team with less production. It is debatable what else a poor, but scrappy hitter batting second brings to the table to offset those extra empty 100 PA.

The other mistake (among many) that managers make in constructing what they (presumably) think is an optimal order is using current season statistics, and often spurious ones like BA and RBI, rather than projections. I would venture to guess that you can count on one hand, at best, the number of managers that actually look at credible projections when making decisions about likely future performance, especially 4 or 5 months into the season. Unless a manager has a time machine, what a player has done so far during the season has nothing to do with how he is likely to do in the upcoming game, other than how those current season stats inform an estimate of future performance. While it is true that there is obviously a strong correlation between 4 or 5 months past performance and future performance, there are many instances where a hitter is projected as a good hitter but has had an awful season thus far, and vice versa. If you have read my previous article on projections, you will know that projections trump seasonal performance at any point in the season (good projections include current season performance to-date – of course). So, for example, if a manager sees that a hitter has a .280 wOBA for the first 4 months of the season, despite a .330 projection, and bats him 8th, he would be making a mistake, since we expect him to bat like a .330 hitter and not a .280 hitter, and in fact he does, according to an analysis of historical player seasons (again, see my article on projections).

Let’s recap the mistakes that managers typically make in constructing what they think are the best possible lineups. Again, we will ignore player preferences and other “psychological factors” not because they are unimportant, but because we don’t know when a manager might slot a player in a position that even he doesn’t think is optimal in deference to that player. The fact that managers constantly monkey with lineups anyway suggests that player preferences are not that much of a factor. Additionally, more often than not I think, we hear players say things like, “My job is to hit as well as I can wherever the manager puts me in the lineup.” Again, that is not to say that some players don’t have certain preferences and that managers shouldn’t give some, if not complete, deference to them, especially with veteran players. In other words, an analyst advising a team or manager should suggest an optimal lineup taking into consideration player preferences. No credible analyst is going to say (or at least they shouldn’t), “I don’t care where Jeter is comfortable hitting or where he wants to hit, he should bat 8th!”

Managers typically follow the traditonal batting order philosophy which is to bat your best hitter 3rd, your slugger 4th, and fast, scrappy, good-bat handlers 1 or 2, whether they are good overall hitters or not. This is not nearly the same as an optimal batting order, based on extensive computer and mathematical research, which suggest that your best hitter should bat 2 or 4, and that you need to put your worst hitters at the bottom of the order in order to limit the number of PA they get per game and per season. Probably the biggest and most pervasive mistake that managers make is slotting terrible hitters at the top, especially in the 2-hole. Managers also put too many base stealers in front of power hitters and hitters who are prone to the GDP in the 3 hole.

Finally, managers pay too much attention (they should pay none) to short term and seasonal performance as well as specific batter/pitcher past results when constructing their batting orders. In general, your batting order versus lefty and righty starting pitchers should rarely change, other than when substituting/resting players, or occasionally when player projections significantly change, in order to suit certain ballparks or weather conditions, or extreme ground ball or fly ball opposing pitchers (and perhaps according to the opposing team’s defense). Other than L/R platoon considerations (and avoiding batting consecutive lefties if possible), most of these other considerations (G/F, park, etc.) are marginal at best.

With that as a background and primer on batting orders, here is what I did: I looked at all 30 teams’ lineups as of a few days ago. No preference was made for whether the opposing pitcher was right or left-handed or whether full-time starters or substitutes were in the lineup on that particular day. Basically these were middle of August random lineups for all 30 teams.

The first thing I did was to compare a team’s projected runs scored based on adding up each player’s projected linear weights in runs per PA and then weighting each lineup slot by its average number of PA per game, to the number of runs scored using a game simulator and those same projections. For example, if the leadoff batter had a linear weights projection of -.01 runs per PA, we would multiply that by 4.8 since the average number of PA per game for a leadoff hitter is 4.8. I would do that for every player in the lineup in order to get a total linear weights for the team. In the NL, I assumed an average hitting pitcher for every team. I also added in every player’s base running (not base stealing) projected linear weights, using the UBR (Ultimate Base Running) stat you see on Fangraphs. The projections I used were my own. They are likely to be similar to those you see on Fangraphs, The Hardball Times, or BP, but in some cases they may be different.

In order to calculate runs per game in a simulated fashion, I ran a simple game simulator which uses each player’s projected singles, doubles, triples, HR, UIBB+HP, ROE, G/F ratio, GDP propensity, and base running ability. No bunts, steals or any in-game strategies (such as IBB) were used in the simulation. The way the base running works is this: Every player is assigned a base running rating from 1-5, based on their base running projections in runs above/below average (typically from -5 to +5 per season). In the simulator, every time a base running opportunity is encountered, like how many bases to advance on a single or double, or whether to score from third on a fly ball, it checks the rating of the appropriate base runner and makes an adjustment. For example, on an outfield single with a runner on first, if the runner is rated as a “1” (slow and/or poor runner), he advances to third just 18% of the time, whereas if he is a “5”, he advances 2 bases 41% of the time. The same thing is done with a ground ball and a runner on first (whether he is safe at second and the play goes to first), a ground ball, runner on second, advances on hits, tagging up on fly balls, and advancing on potential wild pitches, passed balls, and errors in the middle of a play (not ROE).

Keep in mind that a lineup does 2 things. One, it gives players at the top more PA than players at the bottom, which is a pretty straightforward thing. Because of that, it should be obvious that you want your best hitters batting near the top and your worst near the bottom. But, if that were the only thing that lineups “do,” then you would simply arrange the lineup in a descending order of quality. The second way that a lineup creates runs is by each player interacting with other players, especially those near them in the order. This is very tricky and complex. Although a computer analysis can give us rules of thumb for optimal lineup construction, as we do in The Book, it is also very player dependent, in terms of each player’s exact offensive profile (again, ignoring things like player preferences or abilities of players to optimize their approach to each lineup slot). As well, if you move one player from one slot to another, you have to move at least one other player. When moving players around in order to create an optimal lineup, things can get very messy. As we discuss in The Book, in general, you want on base guys in front of power hitters and vice versa, good base stealers in front of singles hitters with low walk totals, high GDP guys in the one hole or at the bottom of the order, etc. Basically, constructing an optimal batting order is impossible for a human being to do. If any manager thinks he can, he is either lying or fooling himself. Again, that is not to say that a computer can necessarily do a better job. As with most things in MLB, the proper combination of “scouting and stats” is usually what the doctor ordered.

In any case, adding up each player’s batting and base running projected linear weights, after controlling for the number of PA per game in each batting slot, is one way to project how many runs a lineup will score per game. Running a simulation using the same projections is another way which also captures to some extent the complex interactions among the players’ offensive profiles. Presumably, if you just stack hitters from best to worst, the “adding up the linear weights” method will result in the maximum runs per game, while the simulation should result in a runs per game quite a bit less, and certainly less than with an optimal lineup construction.

I was curious as to the extent that the actual lineups I looked at optimized these interactions. In order to do that, I compared one method to the other. For example, for a given lineup, the total linear weights prorated by number of PA per game might be -30 per 150 games. That is a below average offensive lineup by 30/150 or .2 runs per game. If the lineup simulator resulted in actual runs scored of -20 per 150 games, presumably there were advantageous interactions among the players that added another 10 runs. Perhaps the lineup avoided a high GDP player in the 3-hole or perhaps they had high on base guys in front of power hitters. Again, this has nothing to do with order per se. If a lineup has poor hitters batting first and/or second, against the advice given in The Book, both the linear weights and the simulation methods would bear the brunt of that poor construction. In fact, if those poor hitters were excellent base runners and it is advisable to have good base runners at the top of the order (and I don’t know that it is), then presumably the simulation should reflect that and perhaps create added value (more runs per game) as compared to the linear weights method of projecting runs per game.

The second thing I did was to try and use a basic model for optimizing each lineup, using the prescriptions in The Book. I then re-ran the simulation and re-calculated the total linear weights to see which teams could benefit the most from a re-working of their lineup, at least based on the lineups I chose for this analysis. This is probably the more interesting query. For the simulations, I ran 100,000 games per team, which is actually not a whole lot of games in terms of minimizing the random noise in the resultant average runs per game. One standard error in runs per 150 games is around 1.31. So take these results with a grain or two of salt.

In the NL, here are the top 3 and bottom 3 teams in terms of additional or fewer runs that a lineup simulation produced, as compared to simply adding up each player’s projected batting and base running runs, adjusting for the league average number of PA per game for each lineup slot.

Top 3

Team Linear Weights Lineup Simulation Gain per 150 games
ARI -97 -86 11
COL -23 -13 10
PIT 10 17 6

Here are those lineups:

ARI

Inciarte

Pennington

Peralta

Trumbo

Hill

Pacheco

Marte

Gosewisch

 

COL

Blackmon

Stubbs

Morneau

Arenado

Dickerson

Rosario

Culberson

Lemahieu

 

PIT

Harrison

Polanco

Martin

Walker

Marte

Snider

Davis

Alvarez

 

Bottom 3

Team Linear Weights Lineup Simulation Gain per 150 games
LAD 43 28 -15
SFN 35 27 -7
WAS 42 35 -7

 

 

LAD

Gordon

Puig

Gonzalez

Kemp

Crawford

Uribe

Ellis

Rojas

 

SFN

Pagan

Pence

Posey

Sandoval

Morse

Duvall

Panik

Crawford

 

WAS

Span

Rendon

Werth

Laroche

Ramos

Harper

Cabrera

Espinosa

 

In “optimizing” each of the 30 lineups, I used some simple criteria. I put the top two overall hitters in the 2 and 4 holes. Whichever of the two had the greatest SLG batted 4th. The next two best hitters batted 1 and 3, with the highest SLG in the 3 hole. From 5 through 8 or 8, I simply slotted them in descending order of quality.

Here is a comparison of the simple “optimal” lineup to the lineups that the teams actually used. Remember, I am using the same personnel and changing only the batting orders.

Before giving you the numbers, the first thing that jumped out at me was how little most of the numbers changed. Conventional, and even most sabermetric, thought is that any one reasonable lineup is usually just about as good as any other, give or take a few runs. As well, a good lineup must strike a balance between putting better hitters at the top of the lineup, and those who are good base runners but poor overall hitters.

The average absolute difference between the runs per game generated by the simulator from the actual and the “optimal” lineup was 3.1 runs per 150 games per team. Again, keep in mind that much of that is noise since I am running only 100,000 games per team, which generates a standard error of something like 1.3 runs per 150 games.

The kicker, however, is that the “optimal” lineups, on the average, only slightly outperformed the actual ones, by only 2/3 of a run per team. Essentially there was no difference between the lineups chosen by the managers and ones that were “optimized” according to the simple rules explained above. Keep in mind that a real optimization – one that tried every possible batting order configuration and chose the best one – would likely generate better results.

That being said, here are the teams whose actual lineups out-performed and were out-performed by the “optimal” ones:

Most sub-optimal lineups

Team Actual Lineup Simulation Results (Runs per 150) “Optimal” Lineup Simulation Results Gain per 150 games
STL 62 74 12
ATL 31 37 6
CLE -33 -27 6
MIA 7 12 5

Here are those lineups. The numbers after each player’s name represents their projected batting runs per 630 PA (around 150 games). Keep in mind that these lineups faced either RH or LH starting pitchers. When I run my simulations, I am using overall projections for each player which do not take into consideration the handedness of the batter or any opposing pitcher.

Cardinals

Name Projected Batting runs
Carpenter 30
Wong -11
Holliday 26
Adams 14
Peralta 7
Pierz -10
Jay 17
Robinson -18

Here, even though we have plenty of good bats in this lineup, Matheny prefers to slot one of the worst in the two hole. Many managers just can’t resist doing so, and I’m not really sure why, other than it seems to be a tradition without a good reason. Perhaps it harkens back to the day when managers would often sac bunt or hit and run after the leadoff hitter reached base with no outs. It is also a mystery why Jay bats 7th. He is even having a very good year at the plate, so it’s not like his seasonal performance belies his projection.

What if we swap Wong and Jay? That generates 69 runs above average per 150 games, which is 7 runs better than with Wong batting second, and 5 runs worse than my original “optimal” lineup. Let’s try another “manual” optimization. We’ll put Jay lead off, followed by Carp, Adams, Holliday, Peralta, Wong, Pierz, and Robinson. That lineup produces 76 runs above average, 14 runs better than the actual one, and better than my computer generated simple “optimal” one. So for the Cardinals, we’ve added 1.5 wins per season just by shuffling around their lineup, and especially by removing a poor hitter from the number 2 slot and moving up a good hitter in Jay (and who also happens to be an excellent base runner).

Braves

Name Projected Batting runs
Heyward 23
Gosselin -29
Freeman 24
J Upton 20
Johnson 9
Gattis -1
Simmons -16
BJ Upton -13

Our old friend Fredi Gonzalez finally moved BJ Upton from first to last (and correctly so, although he was about a year too late), he puts Heyward at lead off, which is pretty radical, yet he somehow bats one of the worst batters in all of baseball in the 2-hole, accumulating far too many outs at the top of the order. If we do nothing but move Gosselin down to 8th, where he belongs, we generate 35 runs, 4 more than with him batting second. Not a huge difference, but 1/2 win is a half a win. They all count and they all add up.

Indians

Name Projected Batting runs
Kipnis 5
Aviles -19
Brantley 13
Santana 6
Gomes 8
Rayburn -9
Walters -13
Holt -21
Jose Ramirez -32

The theme here is obvious. When a team puts a terrible hitter in the two-hole, they lose runs, which is not surprising. If we merely move Aviles down to the 7 spot and move everyone up accordingly, the lineup produces -28 runs rather than -33 runs, a gain of 5 runs just by removing Aviles from the second slot.

Marlins

Name Projected Batting runs
Yelich 15
Solano -21
Stanton 34
McGhee -8
Jones -10
Salty 0
Ozuna 4
Hechavarria -27

With the Fish, we have an awful batter in the two hole, a poor hitter in the 4 hole, and decent batters in the 6 and 7 hole. What if we just swap Solano for Ozuna, getting that putrid bat out of the 2 hole? Running another simulation results in 13 runs above average per 150 games, besting the actual lineup by 6 runs.

Just for the heck of it, let’s rework the entire lineup, putting Ozuna in the 2 hole, Salty in the 3 hole, Stanton in the 4 hole, then McGhee, Jones, Solano, and Hechy. Surpisingly, that only generates 12 runs above average per 150, better than their actual lineup, but slightly worse than just swapping Solano and Ozuna. The achilles heel for that lineup, as it is for several others, appears to be the poor hitter batting second.

Most optimal lineups

Team Actual Lineup Simulation Results (Runs per 150) “Optimal” Lineup Simulation Results Gain per 150 games
LAA 160 153 -7
SEA 45 39 -6
DET 13 8 -5
TOR 86 82 -4

Finally, let’s take a look at the actual lineups that generate more runs per game than my simple “optimal” batting order.

Angels

Name Projected Batting runs
Calhoun 20
Trout 59
Pujols 7
Hamilton 17
Kendrick 10
Freese 8
Aybar 0
Iannetta 2
Cowgill -7

 

Mariners

Name Projected Batting runs
Jackson 11
Ackley -3
Cano 35
Morales 1
Seager 13
Zunino -14
Morrison -2
Chavez -24
Taylor -2

 

Tigers

Name Projected Batting runs
Davis -2
Kinsler 6
Cabrera 50
V Martinez 17
Hunter 10
JD Martinez -4
Castellanos -20
Holaday -44
Suarez -23

 

Blue Jays

Name Projected Batting runs
Reyes 11
Cabrera 15
Bautista 34
Encarnacion 20
Lind 6
Navarro -7
Rasmus -1
Valencia -9
Lawasaki -23

Looking at all these “optimal” lineups, the trend is pretty clear. Bat your best hitters at the top and your worst at the bottom, and do NOT put a scrappy, no-hit batter in the two hole! The average projected linear weights per 150 games for the number two hitter in our 4 best actual lineups is 19.25 runs. The average 2-hole hitter in our 4 worst lineups is -20 runs. That should tell you just about everything you need to know about lineups construction.

Note: According to The Book, batting your pitcher 8th in an NL lineup generates slightly more runs per game than batting him 9th, as most managers do. Tony LaRussa sometimes did this, especially with McGwire in the lineup. Other managers, like Maddon, occasionally do the same. There is some controversy over which option is optimal.

When I ran my simulations above, swapping the pitcher and the 8th hitter in the NL lineups. the resultant runs per game were around 2 runs worse (per 150) than with the traditional order. It probably depends on who the position player is at the bottom of the order and perhaps on the players at the top of the order as well.

 

Note: These are rules of thumb which apply 90-99% of the time (or so). Some of them have a few or even many exceptions and nuances to consider. I do believe, however, that if every manager followed these religiously, even without employing any exceptions or considering any of the nuances, that he would be much better off than the status quo. There are also many other suggestions, commandments, and considerations that I would use, that are not included in this list.

1)      Though shalt never use individual batter/pitcher matchups, recent batter or pitcher stats, or even seasonal batter or pitcher stats. Ever. The only thing that this organization uses are projections based on long-term performance. You will use those constantly.

2)      Thou shalt never, ever use batting average again. wOBA is your new BA. Learn how to construct it and learn what it means.

3)      Thou shalt be given and thou shalt use the following batter/pitcher matchups every game: Each batter’s projection versus each pitcher. They include platoon considerations. Those numbers will be used for all your personnel decisions. They are your new “index cards.”

4)      Thou shalt never issue another IBB again, other than obvious late and close-game situations.

5)      Thou shalt instruct your batters whether to sacrifice bunt or not, in all sacrifice situations, based on a “commit line.” If the defense plays in front of that line, thy batters will hit away. If they play behind the line, thy batters will bunt. If they are at the commit line, they may do as they please. Each batter will have his own commit line against each pitcher. Some batters will never bunt.

6)      Thou shalt never sacrifice with runners at first and third, even with a pitcher at bat. You may squeeze if you want. With 1 out and a runner on 1st only your worst hitting pitchers will bunt.

7)      Thou shalt keep thy starter in or remove him based on two things and two things only: One, his pitch count, and two, the number of times he has faced the order. Remember that ALL pitchers lose 1/3 of a run in ERA each time through the order, regardless of how they are pitching thus far.

8)      Thou shalt remove thy starter for a pinch hitter in a high leverage situation if he is facing the order for the 3rd time or more, regardless of how he is pitching.

9)      Speaking of leverage, thou shalt be given a leverage chart with score, inning, runners, and outs. Use it!

10)   Thou shalt, if at all possible, use thy best pitchers in high leverage situations and thy worst pitchers in low leverage situations, regardless of the score or inning.  Remember that “best” and “worst” are based on your new “index cards” (batter v. pitcher projections) or your chart which contains each pitcher’s generic projection. It is never based on how they did yesterday, last week, or even the entire season. Thou sometimes may use “specialty” pitchers, such as when a GDP or a K are at a premium.

11)   Thou shalt be given a chart for every base runner and several of the most common inning, out, and score situations. There will be a number next to each player’s name for each situation. If the pitcher’s time home plus the catcher’s pop time are less than that number, thy runner will not steal. If it is greater, thy runner may steal. No runner shall steal second base with a lefty pitcher on the mound.

12)   Thou shalt not let thy heart be troubled by the outcome of your decisions. No one who works for this team will ever question your decision based on the outcome. Each decision you make is either right, wrong, or a toss-up, before we know, and regardless of, the outcome.

13)   Thou shalt be held responsible for your decisions, also regardless of the outcome. If your decisions are contrary to what we believe as an organization, we would like to hear your explanation and we will discuss it with you. However, you are expected to make the right decisions at all times, based on the beliefs and philosophies of the organization. We don’t care what the fans or the media think.  We will take care of that. We will all make sure that our players are on the same page as we are.

14)   Finally, thou shalt know that we respect and admire your leadership and motivational skills. That is one of the reasons we hired you. However, if you are not on board with our decision-making processes and willing to employ them at all times, please find yourself another team to manage.