Last night in game 4 of the 2017 World Series, the Astros manager, A.J. Hinch, sort of a sabermetric wunderkind, at least as far as managers go (the Astros are one of the more, if not the most, analytically oriented teams), brought in their closer, Ken Giles, to pitch the 9th in a tie game. This is standard operating procedure for the sabemetrically inclined team – bring in your best pitcher in a tie game in the 9th inning or later, especially if you’re the home team, where you’ll never have the opportunity to protect a lead. The reasoning is simple: You want to guarantee that you’ll use your best pitcher in the 9th or later inning, in a high leverage situation (in the 9+ inning of a tie game, the LI is always at least 1.73 to start the inning).
So what’s the problem? Hinch did exactly what he was supposed to do. It is more or less the optimal move, although it depends a bit on the quality of that closer against the batters he’s going to face, as opposed to the alternative (as well as other bullpen considerations). In this case, it was Giles versus, say, Devenski. Let’s look at their (my) normalized (4.00 is average) runs allowed per 9 inning projections:
Devenski: 3.37
That’s a very good reliever. That’s closer quality although not elite closer quality.
Giles: 2.71
That is an elite closer. In fact, I have Giles as the 6th best closer in baseball. The gap between the two pitchers is pretty substantial, .66 runs per 9 innings. For one inning with a leverage index (LI) of 2.0, that translates to a 1.5% win expectancy (WE) advantage for Giles over Devenski. As one-decision “swings” (the difference between the optimal and a sub-optimal move) go, that’s considered huge. Of course, if you are going to use Giles later in the game anyway if you stay with Devenski for another inning or two, if the game goes that long, you get some of that WE back. Not all of it (because he may not get to pitch), but some of it. Anyway, that’s not really the issue I want to discuss.
Why were many of the so-called sabermetric writers (they often know just enough about sabermetrics or mathematical/logical thinking in general to be “dangerous,” although that’s a bit unfair on my part – let’s just say they know enough to be “right” much of the time, but “wrong” some of the time) aghast, or at least critical, of this seemingly correct move?
First, it was due to the result of course, which belies the fact that these are sabermetric writers. The first thing they teach you in sabermetrics 101 is not to be results oriented. For the most part, the results of a decision have virtually no correlation with the “correctness” of the decision itself. Sure, some of them will claim that they thought or even publicly said beforehand that it was the wrong move, and some of them are not lying – but it doesn’t really matter. That’s only one reason why lots of people were complaining of this move – maybe even the secondary reason (or not the reason at all), especially for the saber-writers.
The primary reason (again, at least stated – I’m 100% certain that the result strongly influenced nearly all of the detractors) was that these naysayers had little or no confidence in Giles going into this game. He must have had a bad season, right, despite my stellar projection? After all, good projection systems use 3, 4 or more years of data along with a healthy dose of regression, especially with relievers who never have a large sample size of innings pitched or batters faced. Occasionally you can have a great projection for a player who had a mediocre or poor season, and that projection will be just as reliable as any other (because the projection model accurately includes the current season, but doesn’t give it as much weight as nearly all fans and media do). So what were Giles’ 2017 numbers?
Only a 2.30 ERA and 2.39 FIP in a league where the average ERA was 4.37! His career ERA and FIP are 2.43 and 2.25, and he throws 98 mph. He’s a great pitcher. One of the best. There’s little doubt that’s true. But….
He’s thrown terribly thus far in the post-season. That is, his results have been poor. In 7.2 IP his ERA is 11.74. Of course he’s also struck out 10 and has a BABIP of .409. But he “looked terrible” these naysayers keep saying. Well, no shit. When you give up 10 runs in 7.2 innings on the biggest stage in sports, you’re pretty much going to “look bad.” Is there any indication, other than having poor results, that there’s “something wrong with Giles?” Given that his velocity is fine (97.9 so far) and that Hinch saw fit to remove Devenski who was “pitching well” and insert Giles in a critical situation, I think we can say with some certainty that there is no indication that anything is wrong with him. In fact, the data, such as his 12 K/9 rate, normal velocity, and an “unlucky” .409 BABIP, all suggest that there is nothing “wrong with him.” But honestly, I’m not here to discuss that kind of thing. I think it’s a futile and silly discussion. I’ve written many times how the notion that you can just tell (or that a manager can tell – which is not the case here, since Hinch was the one who decided to use him!) when a player is hot or cold by observing him is one of the more silly myths in sports, at least in baseball, and I have reams of data-driven evidence to support that assertion.
What I’m interested in discussing right now, is, “What do the data say?” How do we expect a reliever to pitch after 6 or 7 innings or appearances in which he’s gotten shelled? It doesn’t have to be 7 IP of course, but for research like this, it doesn’t matter. Whatever you find in 7 IP you’re going to find in 5 IP or in 12 IP, assuming you have large enough sample sizes and you don’t get really unlucky with a Type I or II error. The same goes for what constitutes getting shelled compared to how you perceive or define “getting shelled.” With research like this, it doesn’t matter. Again, you’re going to get the same answer whether you define getting shelled (or pitching brilliantly) by wOBA against, runs allowed, hard hit balls, FIP, etc. It also doesn’t matter what thresholds you set – you’ll also likely get the same answer.
Here’s what I did to answer this question – or at least to shed some light on it. I looked at all relievers over the last 10 years and split them up into three groups, depending on how they pitched in all 6-game sequences. Group I pitched brilliantly over a 6-game span. The criteria I set was a wOBA against less than .175. Group III were pitchers who got hammered over a 6-game stretch, at least as far as wOBA was concerned (of course in large samples you will get equivalent RA for these wOBA). They allowed a wOBA of at least .450. Group II was all the rest. Here are what the groups looked like:
Group | Average wOBA against | Equivalent RA9 |
I | .130 | Around 0 |
II | .308 | Around 3 |
III | .496 | Around 10 |
Then I looked at their very next appearance. Again, I could have looked at their next 2 or 3 appearances but it wouldn’t make any difference (other than increasing the sample size – at the risk of the “hot” or “cold” state wearing off).
Group | Average wOBA against | wOBA next appearance |
I | .130 | .307 |
II | .308 | .312 |
III | .496 | .317 |
While we certainly don’t see a large carryover effect, we do appear to see some effect. The relievers who have been throwing brilliantly continue to pitch 10 points better than the ones who have been getting hammered. 10 points in wOBA is equivalent to about .3 runs per 9 innings, so that would make a pitcher like Giles closer to Devenski, but still not quite there. But wait! Are these groups of pitchers of the same quality? No. The ones who were pitching brilliantly belong to a much better pool of pitchers than the ones who were getting hammered. Much better. This should not be surprising. I already assumed that when doing the research. How much better? Let’s look at their seasonal numbers (those will be a little biased because we already established that these groups pitched brilliantly or terribly for some period of time in the same season).
Group | Average wOBA against | wOBA next appearance | Season wOBA |
I | .130 | .307 | .295 |
II | .308 | .312 | .313 |
III | .496 | .317 | .330 |
As you can see our brilliant pitchers are much better than our terrible ones. Even if we were able to back out the bias (say, by looking at last year’s wOBA), we still get .305 for the brilliant relievers and .315 for the hammered ones, based on the previous season’s numbers. In fact, we’ll use those instead.
Group | Average wOBA against | wOBA next appearance | Prior season wOBA |
I | .130 | .307 | .305 |
II | .308 | .312 | .314 |
III | .496 | .317 | .315 |
Now that’s brilliant. We do have some sample error. The number of PA in the “next appearance” for group’s I and III are around 40,000 each (SD of wOBA = 2 points). However, look at the “expected” wOBA against, which is essentially the pitcher talent (Giles’ and Devenski’s projections) compared to their actual. They are almost identical. Regardless of how a reliever has pitched in his last 6 appearances, he pitches exactly as his normal projection would suggest on that 7th appearance. The last 6 IP has virtually no predictive value even at the extremes. I don’t want to hear, “Well he really (really, really) been getting hammered – what about that big shot?”. Allowing a .496 wOBA is getting really, really, really hammered, and .130 is throwing almost no-hit baseball, so we’ve already looked at the extremes!
So, as you can clearly see, and exactly what you should have expected, if you really knew about sabermetrics (unlike some of these so-called saber-oriented writers and pundits who like to cherry pick the sabermetric principles that suit their narratives and biases), is that 7 IP of pitching compared to 150 or more, is almost worthless information. The data don’t lie.
But you just know that something is wrong with Giles, right? You can just tell. You are absolutely certain that he’ll continue to pitch badly. You just knew that he was going to implode again last night (and you haven’t been wrong about that 90% of the time in your previous feelings). It’s all bullshit folks. But if it makes you feel smart or happy, it’s fine by me. I have nothing invested in all of this. I’m just trying to find the truth. It’s the nature of my personality. That makes me happy.
I basically have to repeat stuff like “Meaningless Sample Size” and “I Don’t Know This Pitcher’s Psychology” to myself like mantras to avoid being wrong about stuff like this all the time.
As for calling things in advance: Earlier in the game, when Springer came up for the third time, I told myself that Roberts was going to leave Wood in because he hadn’t allowed a hit; that I was inclined to leave him in because I (over)value getting innings out of each given pitcher so as not to “burn” more of the pen … and that MGL would pull Wood immediately, and MGL was surely right.
Then when Springer hit the home run, I laughed and observed that MGL would still point out that the home run DOES NOT validate his strategy; the home run was just one data point, insignificant by itself, to be added to the enormous data set that shows Springer was most likely to make an out in that spot no matter who was pitching, but was slightly (but importantly!) more likely to beat Wood than the first guy out of the pen.
Right. Never focus on the result. Ever. Ever. I will always lead you astray. Always. (OK, not always, but you know what I mean.)
It’s not that taking Wood out after 2 times through the order is “right” or “wrong.” It’s that we know that all starters likely degrade by around .25 runs per 9 innings (essentially in ERA) each time through the order. So we include that information in making decisions about whom to pitch when.
If people take all this TTO research and bastardize it into, “Always take out your starter the 3rd TTO,” I’ll be deeply disappointed. It is simply about making good decisions using the best available information and especially NOT making decisions based upon faulty information, such as, “The starter has been pitching exceptionally well therefore I expect him to pitch exceptionally well.” That will often lead to a poor decision if your goal is to maximize your chances of winning a game or series.
As we all know there are many pieces of information to process and consider when pondering taking out your starter – effect on bullpen fatigue, what is the quality of the replacement, how does that affect the “chain of relievers” for the remainder of the game, what effect does it have on the long and short term morale of the starter and the team?
All I ask is that the manager address those issues in the context of accurate information about the starter’s likely performance.
What if you run a similar analysis but just focus on something like walk rates? If a relief pitcher has been wild in his last 6IP, is he more likely to be wilder than his normal the next time out?
Intuitively it seems more likely that there *could* be a correlation.
It’s possible but I doubt it. If there were any significant carry over effect for BB rate it would show up in wOBA. As I said in the article, with this kind of research, no matter how you slice the data (walk rate, K rate, wOBA, RA, hard hit balls, etc.) the result is likely going to be the same. 6 or 7 IP is just too TINY a sample to have much if any predictive value by itself. It’s too easy for a pitcher who is perfectly fine to get hammered in 6 or 7 IP. That’s ONE GAME for a starter. I’m sure you realize how easy it is for any pitcher to get hammered by chance alone for that many innings. Sure, one or two out of 100 may genuinely have something wrong such that it will carry over, but how are we supposed to identify that 1 or 2%? The research tells us that we can’t find it by looking at the numbers in those 6 or 7 IP. We’ll be wrong 98 or 99% of the time, by definition.
This is good stuff. I think a lot of analytically-minded commentators revert to being emotional baseball fans in the postseason – which is fine, but you see a lot of people trying to justify guy fan reactions with pseudo-analysis (notoriously: https://fivethirtyeight.com/features/send-alex-gordon/).
P.S. The “no spin zone” joke is in rather poor taste at this point.
Great post, and I get all that, but I also totally believe that there are players who get too nervous in the spotlight that it makes them nothing like they normally are. But I have little idea how to show or compute that statistically.
So, the way I approached it is, I looked at Giles 2017 performance, and he gave up runs in 6 of the 7 appearances! From 2014-2017, however, he only gave up runs in 50 out of his 245 appearances, or 20% of his appearances. What are the odds of someone who only gave up runs 20% of his appearances suddenly giving up runs in 6 of 7 appearances? Seems like low odds, but I’m not sure what/if a statistical method can work that out, and perhaps the noise from just a sample of 7 could result in wide gyrations like this. Or is that a wrong way to look at this?
I get that it’s extremely small sample size, only 40 batters faced, way below the thresholds for when peripherals stabilizes. But is there any way to use the data regarding his prior appearances in the regular season, and computing the odds of him randomly giving up runs in 7 appearances? Would it be as simple as taking his 245 appearances, slap them into a random number selection process, and simulate 1,000,000 different 7 game results? Is there better ways? Or, again, is this the wrong way to approach it?
What’s up, I would like to subscribe for this website to get most recent updates, so where
can i do it please help.