Baseball Data NormalizationThe Kinds of Baseball-Stat NormalizationPark-Effect NormalizationIt is by now well known that the 30 major-league baseball parks are, in effect, 30 fun-house mirrors, each reflecting player performance with a different set of distortions--some minor, some--too many--grotesque. The first step HBH applies in evaluating player-performance statistics is to normalize out those distortions.Such normalization is not the trivial task too many "stat analysis" books and web sites seem to believe it is, with their "park factors" calculated in embarrassingly naive form from simple "Home/Road" split data. The crucial question always to keep in mind when thinking about park factors is What compared to what? Our approach is based on the idea that the only rational basis for truly neutral comparison of men and teams playing baseball in different parks is what they would have achieved had they played all their games in a single imaginary park that exactly averages the idiosyncracies of all the real parks in their league. (Ideally, we'd like to use one park representing all of major-league baseball, but the cross-league data is far too minimal--not to speak of the DH complication--so we use one imaginary "neutral" park for each league.) Here is how we first get that neutral park. For simplicity, consider a league with just six teams, which we will unimaginatively call A through F. Also for simplicity, let's focus on a single statistic: say, runs scored. For illustrative purposes, let's suppose we somehow know, a priori, what the true park factors for run scoring--compared to an averaged or "neutral" park--really are; let's say they're:
One thing we know even without that Table is that the factors for the six parks must add up to 6.00 (that is, average out to exactly 1), by the definition of "average". Unfortunately, what we do have in the real world is not the Table numbers--they're what we're trying to figure out how to get--but rather sets of numbers by paired stadia. That is, we have (if we know where to look) the results of Team B batting in Park A (against, of course, Team A's pitching), of Team B batting at home in Park B against Team A, of Team A batting at home against Team B, and of Team A batting at Park B--and all the same sorts of results for all the possible pairings of Team A with the other five teams. We can thus construct a set of relative factors, such as Parks A:B and D:F and so on. To do that, we take what Team A's batters did in Park A against Team B plus what Team B's batters did in Park A against Team A and add the two to get Park A data for Teams A and B; then we do it in reverse--what Team A's batters did in Park B against Team B plus what Team B's batters did in Park B against Team A and add the two to get Park B data for Teams A and B. If we take the ratio of the Park B to Park A numbers, we have park factors for Park B relative to Park A. (We have to do it a team-pair at a time or we're measuring the teams, not the parks--remember, we must always be comparing like to like, so we can only compare what A and B combined did in one park with what that same combination--A and B--did in another park.) (To clean up as we go along: In taking such ratios, we need first to convert the raw numbers themselves to ratios--the obvious base for the data of interest being plate appearances. We have to do that to get rid of possible imbalances owing to possible different numbers of games in each park or, much more likely, different lengths of games. Then, when we take the park-to-park ratios, they are "clean"--true, normalized ratios.) Next, we consider what those ratios represent in terms of the neutral park we seek to define. Sticking with runs for our example, our a priori knowledge that Park A's runs factor is 1.03 times neutral and that Park B's is 0.96 times neutral shows us that runs scored in Park B compared to those scored in Park A will be in the ratio of 0.96 ÷ 1.03, or about 0.932 of those that would be scored in a truly league-neutral park. Let's see what we get if we compare each park with Park A (obviously, though we don't have in the real world any data for Park A compared to Park A, that number is self-evidently exactly 1).
So what? Well, let's add those six figures up. They sum to 5.825242716. That still doesn't float your boat? Well, there are six parks; let's divide that sum by 6. What do we get? The result is 0.970873786. Now--watch carefully, at no time do the fingers leave the hand--divide 1 by 0.970873786; we get 1.03--exactly Park A's true factor. (That is to say that 0.970873786 is the ratio of our imaginary neutral park to the real Park A's run scoring.) That is not a card trick. It follows logically from the definition of "average" for our neutral park. When we know the relative ratios of all six parks combined to (for example) Park A, we also know that the absolute ratio of those six parks must average to 1.000000000 (that is, sum to 6.000000000). The ratio of the relative 5.825242716 to the absolute 6.000000000 necessarily gives us the absolute ratio of the neutral park to our base park for this exercise, Park A. From that, knowing actual run-scoring in Park A, we can deduce what run scoring would be in our wanted "neutral" park. And what we can do for runs, we can do for all the other important stats, from doubles to walks. What further follows from all that is that we should, in principle, be able to fully deduce the wanted neutral park from the paired data for any one of the parks in the league. That does not work in practice. We get results that look only roughly alike. The problem, as a moment's reflection should suggest, is the paucity of data we are working with: in some cases, A against B will be a very few games in each park--things that would average out in larger numbers can affect the results. So, what we do is construct the neutral park from each park individually, then average out those six slightly different "neutral" parks to what we hope and believe will reasonably approximate the true league-neutral park. When the real leagues are 14 and 16 teams, we have reasonable confidence that our resultant averaged-out neutral park is a decent representation of a true neutral park. (This would all be a deal easier and more accurate did not baseball clubs change the layouts of existing parks so frequently, much less change parks entirely, much, much less do so in mid-season. But apparently MLB has decided that no park over five years old can be considered anything but antiquated and inadequate.) Now that we have a reasonably satisfactory league-average, or "neutral", park, what next? Now we abandon the normalizing we needed to get a fair representation of that park, because now we want to know how each team's numbers have been affected by playing in real, non-neutral parks. So, for each team--let's say our imaginary Team A in our imaginary six-team league--we calculate the number of plate appearances they made in every park in the league (including their own), and then construct a factor for that team (as opposed to any given park) by weighting the various park factors by the time the team played in that park. For example, suppose Team A played 100 games, half at home and the rest evely divided among the other five stadia in its league. We would construct a runs factor for Team A this way:
If we add up all those partial factors, we get 1.012. That means that to see what Team A would have done playing all its games in our imaginary average, "neutral" park, we have to divide its actual runs total by 1.012. The example above is actually too simple, because we made all the away parks get the same number of games; had we set games played (out of that made-up total of 100) as, say, 50 at home, then 20 in Park B, 14 in Park C, 6 in Park D, 7 in Park E and 3 in Park F, we would have better seen the power of the method in arriving at a truly representative conversion of actual data to park-neutral data. Finally, when we have conversion factors for a given team (always for one given season), then we have the factors for the individual men on that team, which are the same thing. Is that method of correcting for baseball park effects perfect? No--very far from it; But we think it is nonetheless much better than any other currently being proffered. Its shortcomings derive from the same source the shortcomings of any park-correction method must derive from: limited data. We ideally want dozens or hundreds of game records where we have a handful. But so long as new baseball parks come on line frequently, while existing parks are modified in game-affecting ways almost annually, there is nothing we or anyone can do about the size of the database. And for any size database, we feel what we do is the best that can be done. Seasonal (SillyBall) NormalizationFor many, many years (at least 16 in fact), the level of major-league baseball play--the posted statistics--when averaged across an entire league or all of major-league baseball, were quite stable from season to season. There might be an occasional freak year like 1987, but the large-scale totals, representing the norm against which individual men's performances must be judged, were constant enough that per-season adjustment was not vital. Everyone had a pretty good idea what a .300 batting average or what a 3.65 American-League ERA meant about men's abilities.But, starting with the 1993 season, we began to be exposed to yet more of The Lords Of Baseball deciding that if football is successful, baseball must emulate it in every imaginable way, including scoring. So, what we here call The SillyBall. (Check the discussion at that link for a full and final demonstration of the ball as the controlling factor in the recent explosion of offense and an evaluation of just how, how much, and why it is hurting--and may eventually destroy--the Grand Old Game.) The ridiculous and sudden increase in all offensive statistics consequent on The SillyBall have made it vital that baseball statistics be also normalized for season of play just as much as for the ballpark, and for the same sorts of reasons. HBH thus now also normalizes all statistics against a baseline which is, roughly speaking, what most baseball followers of the '80s and early '90s would have called "normal." If the new folly persists long enough, we will have to change that baseline but, for now at least, all of our results are comparable not only one with another, but also with the established norms of major-league baseball play during the two decades or so prior to The Silly Ball. So that you may clearly see for yourself how great the differences are, and how accurate the now-to-baseline adjustments are, we create a daily page showing the raw and converted stats for the two leagues individually and for major-league baseball as a whole. Here is a link to that League Performances page for today. Some Technical StuffIn calculating the various performance measures, we have followed the policy of using the adjusted (as described above) data in decimal form, rather than round stats off to whole numbers. So, while no batter can actually have, for example, 182.37594 hits in a season, that is the kind of number we use when calculating things like batting averages. We do this because of the fine sensitivity of baseball values: a point on, for example, a batting average is in fact one part in a thousand. (Or, to put it another way, the difference between a .240 hitter and a .280 hitter, both playing pretty much every day, is only about a hit a week on average through the season.) To force the normalized raw data to round numbers would be to throw away some of the accuracy, so we do not do that.In calculating the various performance measures for batters, HBH follows the policy of ignoring sacrifice bunts, since those are always managerially commanded and invariably work to reduce overall offensive productivity. Also, while we do include an allowance for hit-by-pitch and sacrifice-fly occurrences, we use overall major-league average values for those data rather than a player's actual data (for HBP that is ordinarily well justified, since--excepting a very few players--the chances of getting hit by a pitch are essentially random). The actual pitcher ERAs we show on these pages are based simply on normalized innings pitched and earned runs allowed. The "Quality of Pitching" measures, which are--to oversimplify here--reverse use of TOP analysis, applying it to what a given pitcher has given up to batters, require enough special discussion that they have, as noted earlier, their own page on this site. HBH possesses but does not display a variety of other interesting baseball data (such as minor-league stats expressed in major-league equivalencies, full player TOP/ERA/TPP histories, and more) on these pages for a simple reason: we long did and may someday again derive income from presenting such specialized computations to baseball clubs, and there is a limit to what we can give away for free. Current-season converted performance measures for major-league players is our limit--sorry. |
Measures calculated by High Boskage House Baseball Operations, using proprietary techniques.
All data soon will be (but is not yet) normalized for park effects and seasonal variations.
()
|
|
This site is one of The Owlcroft Company family of web sites. Please click on the link (or the owl) to see a menu of our other diverse user-friendly, helpful sites. |
|
|
Site Front Page Late Baseball-Site News and Thoughts |
||
|
Daily Baseball Data: |
||
|---|---|---|
|
Teams: |
||
| Overall Team Performance Stats (win projections and more from actual quality of play to date) | ||
| Player Performance Stats, by Team | ||
|
Batters: |
||
| Batters by Last Name: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z | ||
| Batters by Performance (a single all-batters list) | ||
|
Batters by Positions Played:
alphabetically: C | 1B | 2B | SS | 3B | LF | CF | RF | DH | SP | RP by batting performance: C | 1B | 2B | SS | 3B | LF | CF | RF | DH | SP | RP |
||
|
Pitchers: |
||
| Pitchers by Last Name: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z | ||
| Pitchers by Performance (a single all-pitchers list) | ||
|
Pitchers by Role:
alphabetically: Starters | Relievers by pitching performance: Starters | Relievers |
||
|
Other Statistical Data: |
||
| "Regular" Players, Starting Pitchers, and Relief Pitchers, by Performance | ||
|
Team
Defense (and its projected consequences)
|
||
|
Baseball "White Papers"--meanings and explanations of the things on this site |
||
|
General Background: |
||
| For You Rookies: what this site is all about--what it is telling you about baseball, and how, and why | ||
| Some Baseball Analysis Theory: a semi-technical backgrounding on modern baseball analysis | ||
| Baseball Stat Definitions: the standard and the unique statistics we present here, defined | ||
| Baseball Data Normalization: how we correct for what, and why we need to | ||
| The "Quality of Pitching" Measures: why they are the best way to evaluate pitching performance | ||
|
"Steroids":
why just about everything you think you know about them is wrong Now a site of its own! steroids-and-baseball.com (the link above gets you there) |
||
| "The SillyBall": why baseball before and after 1993 is really two different games | ||
|
About Particular Pages Here: |
||
| The Team-Performance Table: there is a lot in that Table, and this explains what it all is | ||
|
The Team-Defense
Table: how important defense is or isn't in baseball, and how to
correctly evaluate it
|
||
|
Miscellaneous--but not unimportant |
||
| About High Boskage House: who we are and why we might know what we're talking about regarding baseball | ||
|
Links To A Select Few
Other Useful Baseball Sites (including those that link to this one)
|
||
|
The High Boskage House Baseball Shop (which offers more than baseball books--in fact, more than just books) |
||
|
What Makes This "Baseball Shop" Special: |
||
| Finding Books About Baseball Topics: we've already done it for you, and our list is updated daily | ||
| Search For Any New Book at Amazon (which is, after all, the cheapest place to buy books new) | ||
| Search For Any Used Book at Abebooks (which is the easiest place on the internet to find any used book) | ||
|
Search For Anything at
All at Amazon: nowadays, they're a lot more than just books
|
||
| Baseball Books Available Today: | ||
| A Master Baseball-Books List (plain text your browser can easily "search") | ||
|
Baseball Books By Title:
(because so many baseball book titles begin with the word "baseball", those are broken out separately in the title lists below) A | B | "Baseball" | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | other |
||
Not every browser renders proper HTML correctly (Internet Explorer famously does not);
so, if your browser experiences any difficulties with this page (or, really, even if it
doesn't),
(It's free!)