Archive for the ‘Statistics’ Category

Let’s face it. Most of you just can’t process the notion that a pitcher who’s had 10 or 15 starts at mid-season can have an ERA of 5+ and still be expected to pitch well for the remainder of the season. Maybe, if they’re a Kershaw or Verlander or a known ace, but not some run of the mill hurler. Similarly, if a previously unheralded and perhaps terrible starter were to be sporting a 2.50 ERA in July after 12 solid starts, the notion that he’s still a bad pitcher, although not quite as bad as we previously estimated, is antithetical to one of the strongest biases that human beings have when it comes to sports, gambling, and in fact, many other aspects of life in general – recency bias. According to the online skeptics dictionary, recency bias is, “the tendency to think that trends and patterns we observe in the recent past will continue in the future.”

I looked at all starting pitcher in the last 3 years who either:

  1. In the first week of July, had a RA9 (runs allowed per 9 innings) adjusted for park, weather, and opponent, that was at least 1 run higher than their mid-season (as of June 30) projection. In addition, these pitchers had to have a projected context-neutral RA9 of less than 4.00 (good pitchers).
  2. In the first week of July, had an adjusted RA9 at least 1 run lower than their mid-season projection. They also had to have a projection greater than 4.50 (bad pitchers).

Basically, group I pitchers above were projected to be good pitchers but had very poor results for around 3 months. Group II pitchers were projected to be bad pitchers despite having very good results in the first half of the season.

A projection is equivalent to estimating a player’s most likely performance for the next game or for the remainder of the season (not accounting for aging). So in order to test a projection, we usually look at that player’s or a group of players’ performance in the future. In order to mimic the real-time question, “How do we expect this pitcher to pitch today, I looked at the next 3 games performance, in RA9.

Here are the aggregate results:

The average RA9 from 2015-2017 was around 4.39.

Group I pitchers (cold first half) N=36 starts after first week in July

Season-to-date RA9 Projected RA9 Next 3 starts RA9
5.45 3.76 3.71

Group II Pitchers (hot first half) N=84 starts after first week in July

Season-to-date RA9 Projected RA9 Next 3 starts RA9
3.33 4.95 4.81


As you can see, the season-to-date context neutral (adjusted for park, weather and opponent) RA9 tells us almost nothing about how these pitchers are expected to pitch, independent of our projection. Keep in mind that the projection has the current season performance baked into the model, so it’s not that the projection is ignoring the “anomalous” performance, and somehow magically the pitcher reverts to somewhere around his prior performance.

Actually, two things are happening here to create these dissonant (within the context of recency bias) results: One, these projections are using 3 or 4 years of prior performance (including the minor leagues), if available, such that another 3 months, even the most recent 3 months (which gets more weight in our projection model), often doesn’t have much effect on the projection (depending on how much prior data there is). As well, even if there isn’t that much prior data the very bad or good 3-month performance is going to get regressed towards league average anyway.

Two, how much integrity is there in a very bad RA9 for a pitcher who was and is considered a very good pitcher, and vice versa? By that, I mean does it really reflect how well the pitcher has pitched in terms of the components allowed or was he just lucky or unlucky in terms of the timing of those events? We can attempt to answer that question by looking at our same pitchers above and see how their season-to-date RA9 looks compared to a a component RA9, which is an RA9 looking number constructed from a pitcher’s component stats (using a BaseRuns formula). Let’s add that to the charts above.

Group I

Season-to-date RA9 To-date component RA9 Projected RA9 Next 3 starts RA9
5.45 4.40 3.76 3.71

Group II

Season-to-date RA9 To-date component RA9 Projected RA9 Next 3 starts RA9
3.33 4.25 4.84 4.81


These pitchers’ component results were not nearly as bad or good as their RA9 suggests.

So, if a pitcher is still projected to be a good pitcher, even after a terrible first half (or vice versa), RA9-wise (and presumably ERA-wise), two things are going on to justify that projection: One, the first half may be a relatively small sample compared to 3 or 4 years prior performance – remember, everything counts (albeit recent performance is given more weight)! Two, and more importantly, that RA9 is mostly timing-driven luck. The to-date components suggest that both the hot and cold pitchers have not pitched nearly as badly or as well as their RA9 suggests. The to-date component RA9’s are around league-average for both groups.

The takeaway here is that your recency bias will cause to you reject these projections in favor of to-date performance as reflected in RA9 or ERA, when in fact the projections are still the best predictor of future performance.


Note: There is the beginning of a very good discussion about this topic on The Book blog. If this topic interests you, feel free to check it out and participate if you want to.

I’ve been thinking about this for many years and in fact I have been threatening to redo my UZR methodology, in order to try and reduce one of the biggest weaknesses inherent in most if not all of the batted ball advanced defensive metrics.

Here is how most of these metrics work: Let’s say a hard hit ball was hit down the third base line and the third baseman made the play and threw the runner out. He would be credited with an out minus the percentage of time that an average fielder would make the same or similar play, perhaps 40% of the time. So the third baseman would get credit for 60% of a “play” on that ball, which is roughly .9 runs (the difference between the average value of a hit down the 3rd base line and an out) times .6 or .54 runs. Similarly, if he does not make the play, he gets debited with .4 plays or minus .36 runs.

There are all kind of adjustments which can be made, such as park effects, handedness of the batter, speed of the runner, outs and base runners (these affect the positioning of the fielders and therefore the average catch rate), and even the G/F ratio of the pitcher (e.g., a ground ball pitcher’s “hard” hit balls will be a little softer than a fly ball pitcher’s “hard” hit ball).

Anyway here is the problem with this methodology which, as I said, is basic to most if not all of these defensive metrics, and it has to do with our old friend Bayes. As is usually the case, this problem is greater in smaller sample sample sizes. We don’t really, really know the probability of an average fielder making any given play; we can only roughly infer it from the characteristics of the batted ball that we have access to and perhaps from the context that I described above (like the outs, runners, batter hand, park, etc.).

In the above example, a hard hit ground ball down the third base line, I said that the league average catch rate was 40%. Where did I get than number from? (Actually, I made it up, but let’s assume that that is a correct number in MLB over the last few years, given the batted ball location database that we are working with.) We looked at all hard hit balls hit to that approximate location (right down the third base line), according to the people who provide us with the database, and found out that of those 600 some odd balls over the last 4 years, 40% of them were turned into outs by the third baseman on the field.

So what is wrong with giving a third baseman .6 credit when he makes the play and .4 debit when he doesn’t? Well, surely not every single play, if you were to “observe” and “crunch” the play like, say, Statcast would do, is caught exactly 40% of the time. For any given play in that bucket, whether the fielder caught the ball or not, we know that he didn’t really have exactly a 40% chance of catching it if he were an average fielder. You knew that already. That 40% is the aggregate for all of the balls that fit into that “bucket” (“hard hit ground ball right down the third base line”).

Sometimes it’s 30%. Other times it’s 50%. Still other times it is near 0 (like if the 3rd baseman happens to be playing way off the line, and correctly so) or near 100% (like when he is guarding the line and he gets a nice big hop right in front of him), and everything in between.

On the average it is 40%, so you say, well, what are we to do? We can’t possibly tell from the data how much it really varies from that 40% on any particular play, which is true. So the best we can do is assume 40%, which is also true. That’s just part of the uncertainty of the metric. On the average, it’s right, but with error bars. Right? Wrong!

We do have information which helps us to nail down the true catch percentage of the average fielder given that exact same batted ball, at least how it is recorded by the people who provide us with the data. I’m not talking about the above-mentioned adjustments like the speed of the batter, his handedness, or that kind of thing. Sure, that helps us and we can use it or not. Let’s assume that we are using all of these “contextual adjustments” to the best of our ability. There is still something else that can help us to tweak those “league average caught” percentages such that we don’t have to use 40% on every hard hit ground ball down the line. Unfortunately, most metrics, including my own UZR, don’t take advantage of this valuable information even though it is staring us right in the face. Can you guess what it is?

The information that is so valuable is whether the player caught the ball or not! You may be thinking that that is circular logic or perhaps illogical. We are using that information to credit or debit the fielder. How and why would we also use it to change the base line catch percentage – in our example, 40%? In comes Bayes.

Basically what is happening is this: Hard ground ball is hit down the third base line. Overall 40% of those plays are made, but we know that not every play has a 40% chance of being caught because we don’t know where the fielder was positioned and we don’t really know the exact characteristics of the ball which greatly affect its chances of being caught: it was hit hard, but how hard? What kind of a bounce did it take? Did it have spin? Was it exactly down the line or 2 feet from the line (they were all classified as being in the same “location”)? We know the runner is fast (let’s say we created a separate bucket for those batted balls with a fast runner at the plate), but exactly how fast was he? Maybe he was a blazer and he beat it out by an eyelash.

So what does that have to do with whether the fielder caught the ball or not? That should be obvious by now. If the third baseman did not catch the ball, on the average, it should be clear that the ball tended to be one of those balls that were harder to catch than the average ball in that bucket. In other words, the chances that any ball that is caught should or would have been caught by an average fielder is clearly less than 40%. Similarly if a ball was caught, by any fielder, it was more likely to be an easier play than the average ball in that bucket. What we want are conditional probabilities, based on whether the ball was caught or not.

How much easier are the caught balls than the not-caught ones in any given bucket? That’s hard to say. Really hard to say. One would have to have lots of information in order to apply Bayes theorem to better estimate the “catch rate” of a ball in a particular bucket based on whether it is caught or not caught. I can tell you that I think the differences are pretty significant. It mostly depends on the spread (and what the actual distribution looks like) of actual catch rates in any given bucket. That depends on a lot of things. For one thing, the “size” and accuracy of the locations and other characteristics which make up the buckets. For example, if the unique locations were pretty large, say, one “location bucket” is anywhere from down the third base line to 20 feet off the bag (about 1/7 of the total distance from line to line), then the spread of actual catch rates versus the average catch rate in that bucket is going to be huge. Therefore the difference between the true catch rates for caught ball and non-caught ball is going to be large as well.

Speed of the batted ball is important as well. On very hard hit balls, the distribution of actual catch rates within a certain location will tend to be polarized or “bi-modal.” Either the ball will tend to be hit near the fielder and he makes the play or a little bit away from the fielder and he doesn’t. In other words, a catch might have a 75% true catch rate and non-catch, 15%, on the average, even if the overall rate is 40%.

Again, most metrics use the same base line catch rate for catches and non-catches because that seems like the correct and intuitive thing to do. It is incorrect! The problem, of course, is what number to assign to a catch and to a non-catch in any given bucket. How do we figure that out? Well, I haven’t gotten to that point yet, and I don’t think anyone else has either (I could be wrong). I do know, however, that it is guaranteed that if I use 39% for a non-catch and 41% for a catch, in that 40% bucket, I am going to be more accurate in my results, so why not do that? Probably 42/38 is better still. I just don’t know when to stop. I don’t want to go too far so that I end up cutting my own throat.

This is similar to the problem with park factors and MLE’s (among other “adjustments”). We don’t know that using 1.30 for Coors Field is correct but we surely know that using 1.05 is better than 1.00. We don’t know that taking 85% of player’s AAA stats to convert them to a major league equivalency is correct, but we definitely know that 95% is better than nothing.

Anyway, here is what I did today (other than torture myself by watching the Ringling Brothers and…I mean the Republican debates). I took a look at all ground balls that were hit in vector “C” according to BIS and was either caught or went through the infield in less than 1.5 seconds, basically a hard hit ball down the third base line. If you watch these plays, even though I would put them in the same bucket in the UZR engine, it is clear that some are easy to field and others are nearly impossible. You would be surprised at how much variability there is. On paper they “look” almost exactly the same. In reality they can vary from day to night and everything in between. Again, we don’t really care about the variance per se, but we definitely care about the mean catch rates when they are caught and when they are not.

Keep in mind that we can never empirically figure out those mean catch rates like we do when we aggregate all of the plays in the bucket (and then simply use the average catch rate of all of those balls). You can’t figure out the “catch rate” of a group of balls that were caught. It would be 100% right? We are interested in the catch rate of an average fielder when these balls were caught by these particular fielders, for whatever reasons they caught them. Likewise we want to know the league average catch rates of a group of balls that were not caught by these particular fielders for whatever reasons.

We can make these estimates (the catch rates of caught balls and non-caught balls in this bucket) in one of two ways: the first way is probably better and much less prone to human bias. It is also way more difficult to do in practice. We can try and observe all of the balls in this bucket and then try and re-classify them into many buckets according to the exact batted ball characteristics and fielder positioning. In other words, one bucket might be hard hit ground huggers right down the line with the third baseman playing roughly 8 feet off the line. Another might be, well, you get the point. Then we can actually use the catch rates in those sub-buckets.

When we are done, we can figure out the average catch rate on balls that were caught and those that were not, in the entire bucket. If that is hard to conceptualize, try constructing an example yourself and you will see how it works.

As I said, that is a lot of work. You have to watch a lot of plays and try and create lots and lots of sub-buckets. And then, even in the sub-buckets you will have the same situation, although much less problematic. For example, in one of those sub-buckets, a caught ball might be catchable 20% of the time in reality and a non-caught one only 15% – not much to worry about. In the large, original bucket, it might be 25% and 60%, as I said before. And that is a problem, especially for small samples.

Keep in mind that this problem will be mitigated in large samples but it will never go away. It will always overrate a good performance and underrate a bad one. But, in small samples, like even in one season, it will overrate so-called good fielding performance and underrate bad ones. The better the numbers the more they overstate the actual performance. The same is true for bad numbers. This is why I have been saying for years to regress what you see from UZR or DRS, even if you want to estimate “what happened.” (You would have to regress even more if you want to estimate true fielding talent.)

This is one of the problems with simply combining offense and defense to generate WAR. The defensive component needs to be regressed while the offensive one does not (base running needs to be regressed too. It suffers from the same malady as the defensive metrics).

Anyway, I looked at 20 or so plays in one particular bucket and tried to use the second method of estimating true catch rates for catches and non-catches. I simply observed the play and tried to estimate how often an average fielder would have made the play whether it was caught or not.

This is not nearly as easy as you might think. For one thing, guessing an average “catch rate” number like 60% or 70%, even if you’ve watched thousands of games in your life like I have, is incredibly difficult. The 0-10% and 90-100% ones are not that hard. Everything else is. I would guess that my uncertainty is something like 25% on a lot of plays, and my uncertainty on that estimate of uncertainty is also high!

The other problem is bias. When a play is made, you will overrate the true average catch rate (how often an average fielder would have made the play) and vice versa for plays that are not made. Or maybe you will underrate them because you are trying to compensate for the tendency to overrate them. Either way, you will be biased by whether the play was made or not, and remember you are trying to figure out the true catch rate on every play you observe with no regard to whether the play was made or not. (In actuality maybe whether it was made or not can help you with that assessment).

Here is a condensed version of the numbers I got. In that one location, presumably from the third base line to around 6 feet off the line, for ground balls that arrive in less than 1.5 seconds (I have 4 such categories of speed/time for GB), the average catch rate overall was 36%. However, for balls that were caught (and I only looked at 6 random ones), I estimated the average catch rate to be 11% (that varied from 0 to 35%). For balls that were caught (also 6 of them), it was 53% (from 10% to 95%). That is a ridiculously large difference and look at the variation even within those two groups (caught and not-caught). Even though using 11% for non-catches and 53% for catches is better than using 40% for everything, we are still making lots of mistakes within the new caught and not caught buckets!

How does that affect a defensive metric? Let’s look at a hypothetical example: Third baseman A makes 10 plays in that bucket and misses 20. Third baseman B makes 15 and misses 15. B clearly had a better performance, but how much better? Let’s assume that the average fielder makes 26% of the plays in the bucket and the misses are 15% and the catches are 56% (actually a smaller spread than I estimated). Using 15% and 56% yields an overall catch rate of around 26%.

UZR and most of the other metrics will do the calculations this way: Player A’s UZR is 10 * .74 – 20 * .26, or plus 2.2 plays which is around plus 2 runs. Player B is 15 * .74 – 15 * .26, or plus 7.2 plays, which equals plus 6.5 runs.

What about if we use the better numbers, 15% for missed plays and 56% for made ones. Now for Player A we have: 10 * .44 – 20 * .15, or 1.4 plays, which is 1.3 runs. Player B is 3.9 runs. So Player A’s UZR for those 30 plays went from +2 to + 1.3 and Player B went from +6.5 to +3.9. Each player regressed around 35-40% toward zero. That’s a lot!

Now I have to figure out how to incorporate this “solution” to all of the UZR buckets in some kind of fairly elegant way, short of spending hundreds of hours observing plays. Any suggestions would be appreciated.