The High Boskage House Baseball-Analysis Web Site
baseball team and player performance examined realistically and accurately

  email me search site site directory  

Normalizing Data For Baseball Analysis


Baseball Data Normalization

The Kinds of Baseball-Stat Normalization

Park-Effect Normalization

It is by now well known that the 30 major-league baseball parks are, in effect, 30 fun-house mirrors, each reflecting player performance with a different set of distortions: some minor, some--too many--grotesque. Ideally, the first step one would apply in baseball analysis is to normalize out those distortions, to make the data commensurate and "park-neutral".

Such normalization is not the trivial task too many "stat analysis" books and web sites seem to believe it is, with their "park factors" calculated in embarrassingly naive form from simple "Home/Road" split data. There are several crucial questions involved; perhaps the threshold one is What compared to what?

The only rational basis for truly neutral comparison of men and teams playing baseball in different parks is what they would have achieved had they played all their games in a single imaginary park that exactly averages the idiosyncracies of all the real parks in their league.

(Ideally, we'd like to use one imaginary park representing all of major-league baseball, but the cross-league data is far too minimal--not to speak of the DH complication--so we would be satisfied with one imaginary "neutral" park for each league.)

Now the usual way of compiling "park factors" is to try to compare home and road data. The standard approach is to use the combined team-and-opponents data for home games versus those for away games. To take a simple example, to get a park factor for runs, one would take the sum of the runs scored at Team X's home park by both it and its opponents, then divide by home games played (so as to get a per-game value, in case games at home and games away are not equal); one would then do the same for away data: sum the runs scored on the road by Team X and its opponents and divide by away games played; finally, one would take a ratio of the two. To continue the example, if the home total were 1700 runs over 80 games and the away total were 1600 runs over 80 games (giving per-games values of 21.25 at home and 20.00 on the road), one might conclude that Team X's home park increases run scoring by 6.25%--that is, 21.25/20.00 = 1.0625. (That is, to pick one, exactly the method used by ESPN). While other sources somewhat tinker with that exact approach, it remains at the heart of most such calculating.

Now, to avoid the real-world complications we'll get to just a little farther on, let's assume that all the parks in major-league use have remained the same, and--equally important--each in the same configuration for a good number of years running. (That assumption is so comically wrong that one at once gets an idea of some of the other problems, but let's make it for now). If we apply the methodology described above, what have we calculated? For each park, we have calculated how it affects, in our running example, run scoring. But, again, compared to what? In fact, we have compared each park to all the other parks. Think about it: no one park is being compared to the same base as any other park. Shea Stadium, to randomly pick one, would be being compared to all parks except Shea Stadium; Dodger Stadium (to randomly pick another park in the same league) would be being compared to all parks except Dodger Stadium. Those "compared to" bases are by no means the same thing. Our resultant "park factors" are, in reality, a pile of reeking garbage.

(Well, that's a bit of an exaggeration--but very obviously, they are severely defective.)

When we wake up to that defect, we realize that it can be almost wholly corrected for. What we need to do is to multiply the "away" data by one less than the number of teams in the league (that is, by the number of Team X's opponent teams), add to that Team X's away per-game datum (that is, Team X plus its opponents), then divide the result by the full number of teams in the league. That way, the basis becomes very nearly all parks in the league, rather than "all parks but the one we're talking about". (It's "very nearly", not exactly, because we have no data equivalent to Team X's batters facing Team X's pitchers in any park, but it does help a lot.)

But the point of that discourse is not to suggest that we can with care get accurate park factors: it is to suggest that too many people have done too little thinking about what they're really doing when they try to construct "park factors". In fairness, the concept arose--and was very obviously needed--in a day when ballparks were much less subject to the dizzying pace of modification or outright replacement that is the modern norm, and multi-year per-park data were reasonably stable and could safely be cumulated, and thus--did one take the sorts of care described above--meaningful park factors calculated.

That is only the beginning of the complications. Consider, for example, the idea of using "games" as the normalizer, to get a "per-game" datum. That stinks, for what should be obvious reasons. The best normalizing basis for a given stat will vary with the stat in question. As others have pointed out, what's wanted as a basis for any given stat is the opportunities for achieving it. For walks, for example, the "opportunity" is all plate appearances, so the normalized walks datum should not be walks per game but walks per plate appearance. (Actually, to be precise one should probably subtract sacrifice bunts from plate appearances for most stats, since the player laying down a sac was ordered to do so, and that plate appearance was thus not an "opportunity" to take a walk.) For home runs, it could be argued that the basis ought to be at-bats minus strikeouts, though it could also be argued that since parks affect both strikeout and walk totals, there is an interdependence there that perhaps ought not to be. (But, again, this is not to try, right here and now, to develop perfect park-effects answers but only to indicate the related questions that need careful review, review that they rarely get.)

(Not everyone has been blind to these problems; there was an enlightening if technical 2007 paper titled "Improving Major League Baseball Park Factor Estimates", by Acharya, Ahmed, D'Amour, Lu, Morris, Oglevee, Peterson, and Swift, published in the Harvard Sports Analysis Collective. But, though they seem to improve as compared to "the ESPN model", the paper's Conclusion notes that:
Unfortunately, the lack of longer-term data in Major League Baseball, particularly due to the park relocation undergone by eight National League teams, makes it extraordinarily difficult to assess the true contribution of a ballpark to a team's offense or defensive strength. While we openly admit this diffculty, we still feel that the ESPN model for Park Factors is inadequate and requires improvement. Its theoretical errors are too significant for the [extent to which] it is currently quoted.
Just so. But despite their model's theoretical improvements over the standard form, the result can never be better than the data, the issue we address next.)

Most of the sorts of issues raised above, however crucial, have become, in the modern era, essentially moot, the reason being the grotesque invalidity of the temporary assumption we made earlier: that all the parks in major-league use have remained the same, and each in the same configuration, for a good number of years running. Statistical data is only valid and useful when it represents some sample size large enough that sheer chance is not likely to have a large effect on the result. If we toss a coin four times and happen to get three heads, we would be something a lot worse than ill-advised to pronounce as a verity that tossed coins come up heads 75% of the time. It is hard nowadays to find a major-league ballpark that does not have some structural change nearly every season, and it is remarkable how much effect even a seemingly minor change can have on certain stats. And remember: even if Park X remains unchanged for some years in a row, it is almost 100% sure that some other park, whose nature goes into the all-parks basis, will have changed--often several will have, and not infrequently in a major way. And, with the prevalence of retractable roofs these days, even an "unchanged" park can be a very different place from season to season depending on the number of times the roof is open versus closed in a given year (roofs typically have a huge effect on game stats), not to speak of whether the openings and closing are daytime or night-time.

When we consider those sorts of things (not to mention the wildly unequal numbers of games one team may play versus different opponents), it should be clear that attempting "park factor" numbers is a task so complicated as to perhaps be impossible--that is, any result we come up with may have flaws, which we cannot hope to correct for, that render it inaccurate. While we here at HBH continue to ponder these issues, for now our approach is to concede that various parks have obvious and often substantial effects on performance results, but that corrections for those factors are so unreliable that we are better off to use unadjusted data and make broad-brush mental corrections to the results. (That is, say to ourselves, "Well, sure, but that's from playing at Coors Field" or suchlike things.) Unpalatable, but less so, for now, than the alternatives.

(To give you a clear idea of how impossible making a meaningful park factor is without cluttering up this page, we have prepared a separate examination of what happens when one takes a practical run at the problem.)

Seasonal (SillyBall) Normalization

For a long stretch of years, sixteen years (from 1977 through 1992, inclusive)--a period amounting to nearly a generation--the levels of major-league baseball performance when averaged across an entire league or all of major-league baseball, were quite stable from season to season. There might be an occasional freak year like 1987, but the large-scale totals, which represent the norms against which we judge individual men's performances, were constant enough that per-season adjustment was not important. Everyone had a pretty good idea what a .300 batting average or what a 3.65 American-League ERA meant about a man's abilities.

It is now clear that starting sometime in the 1993 season the baseball itself was somehow substantially juiced. That juicing, which created what we call The SillyBall, is demonstrated exhaustively elsewhere on this site, so here we merely accept it as the fact that it is. Baseball before 1993 (at least back to 1977 anyway, which is when a different, and more resilient, brand of baseball was introduced into the game) and baseball after 1993 right on through today are simply two different, incommensurable games.

For quite a while, we at HBH calculated and applied an adjustment factor to the raw stats, to make them commensurable with a baseline of results from that 16-year window. We did that so that no one looking at, say 1998 numbers for a team or man and comparing them with the sort of "built-in" mental benchmarks of what for so long had been "normal" would be misled by the effects of the SillyBall. But we are now another 16 or so years on from the change, and most of today's observers' mental benchmarks have conditioned by, or re-set to, the norms of the SillyBall era, and so corrections to a now-outdated era seem worse than useless. So we don't do them anymore. There are not a lot of men left in the game with seasonal data from 1993 or earlier, and for those few that part of their history no longer represents much of their career totals, so we can ignore such pre-SillyBall data with very little effect on those men's cumulative career results.

If, incidentally, you wonder if "SillyBall" isn't a needlessly pejorative term, we use it for a reason. That reason is our deeply held belief that there is such a thing as an "ideal" baseball scoring level, and that the SillyBall produces results well above that ideal. On what basis might one speak of an "ideal" scoring level? This: that on the one hand the scoring of a run in a ball game ought to be neither so rare as to make the game boring, and make any given run almost invariably crucial (since a fair proportion of runs are as much luck as skill), but that on the other hand it ought not to be so trivial--so much just another rotation of the turnstile--that it's ho hum, and let's just see how many they pile up by the end. A run should be exciting yet not game-controlling. What exactly that translates to in numbers is, of course, somewhat subjective, but our feeling is that a combined game total (both teams, that is) of somewhere from at the least 7 to at the most 9 runs is about right. We don't want an endless procession of 3-2 and 2-1 games, but neither do we want a parade of 11-7 or 9-2 games, either. If no one will do away with the brain-dead Designated Hitter Rule, then let the NL average 8 runs a game and the AL 9 runs a game, and be done with it (the DH rather obviously raises scoring by, very roughly,9:8, inasmuch as there are effectively nine run-producing batters instead of eight). And in fact, those are about the levels that were in effect prior to the advent of the SillyBall. But, at least in the early 1990s (if not still today), The Lords Of Baseball were hypnotized by the relative market success of football, and apparently decided that baseball, rather than--as a sane person might think logical for "The National Pastime"--emphasize the things that make it unique and wonderful, should instead attempt to emulate football as much as possible, and pre-eminently by jacking run-scoring to the point where it is often difficult to determine at a glance if a game score in those hideously annoying screen-bottom tickers TV runs is from a football game or a baseball game. OK, </RANT>.






You loaded this page on Friday, 31 October 2014, at 1:27 am EDT;
it was last modified on Sunday, 29 March 2009, at 10:44 pm EDT.

Site Mechanics:

Search this site:


Custom Search
(the usual Google search rules apply)


Site Directory:

 This site's Front Page
 Late News about the site



(team and player performance evaluations, updated daily)
The Performance Stats:
    Team Measures:
    Player Measures:

(meanings and explanations of the things on this site)
Baseball-Analysis Background:
    For You Rookies:
 what this site is all about--what it is telling you about baseball, and how, and why
    Some Baseball Analysis Theory:
 a semi-technical backgrounding on modern baseball analysis
    Baseball Stat Definitions:
 the standard and the unique statistics we present
    The "Quality of Pitching" Measures:
 why they are the best way to evaluate pitching
    The SillyBall:
 why baseball before and after 1993 is really two different games
    Fielding and Defense in Baseball
 how important defense is or isn't in baseball, and how to correctly evaluate it
    Baseball Data Normalization:
 why raw stats need "correction", and how and why we can and cannot apply it
    "Steroids" and Other "Performance-Enhancing Drugs":
 why just about everything you think you know about them is wrong
(now a full-fledged site of its own)



(miscellaneous but not unimportant)
Some Miscellaneous Information:
    The Team-Performance Table
 there is a lot in that Table, and this explains what it all is
    The HBH Baseball-Analysis Formula Tested
 what we get when we apply it to half a century of team stats
    The Pitfalls of Park Factors
 an explicit, detailed demonstration of how and why they are so dubious
    About High Boskage House
 who we are and why we might know what we're talking about
    Links About Eric Walker
 links to baseball-related pages concerning the webmaster here
    Links To A Select Few Other Useful Baseball Sites
 including those that link to this one



(new, used--find any book, anywhere in the world)
The High Boskage House Baseball Shop:
    What Makes This "Baseball Shop" Special:
    Baseball Books Available Today:


Site Info:

owl logo This site is one of The Owlcroft Company family of web sites. Please click on the link (or the owl) to see a menu of our other diverse user-friendly, helpful sites.       Pair Networks logo Like all our sites, this one is hosted at the highly regarded Pair Networks, whom we strongly recommend. We invite you to click on the Pair link (or their logo) for more information on getting your site or sites hosted on a first-class service.
All Owlcroft systems run on Ubuntu Linux and we heartily recommend it to everyone--click on the link for more information.

Comments? Criticisms? Questions?

Please, e-mail me by clicking here.

(Or, if you cannot email from your browser, send mail to webmaster@highboskage.com)

All content copyright © 1999 - 2014 The Owlcroft Company.

This web page is strictly compliant with the W3C (World Wide Web Consortium)
Extensible HyperText Markup Language (XHTML) Protocol v1.0 (Transitional).
Click on the logo below to test us!

So if your browser experiences any difficulties with this page(or, really, even if it doesn't seem to),
just click on the logo below to find out all about (and even get)--


Get the Firefox browser!
(It's free!)



---=== end of page ===---