Probability and Baseball(As a sidelight, it is both interesting and amusing that probability theory, which began in the seventeenth century, was originally contrived to answer the questions some aristocratic gentlemen had about the true odds in certain dice games they favored. So probability theory's relationship with games is historically sound.) The main reason we need a little insight into probability is to understand just what it means to say we have a "formula" for team runs scored or some such. Some formulas are not probabilistic: if we know the velocity of the baseball and the distance from the pitcher's hand to the plate and the resistance of the air, we can calculate exactly the time the baseball will take to get to the batter--not "more or less" but exactly. But obviously we cannot calculate the runs a team will score from its plate appearances, hits, walks, and so on with that same kind of absolute, cause-and-effect, dead-on accuracy. What we can calculate with near-perfect accuracy is the probable number of runs the team will have scored. But if we say this or that baseball formula is correct as to the probable result, what do we have? If the team actually scored 800 runs and the formula said 775, what--if anything--do we know? To gain understanding, we need no more complex tool than a coin, which we will, in imagination, undertake to toss. We all understand implicitly that a fair, balanced coin will come up heads half the time (and tails the other half), which is why so many decisions that we want to make in an unbiased way are made by tossing a coin; so much is elementary. Or is it? Just what do we really mean by "half the time"? That needs to be looked at more closely. We certainly do not expect that a tossed coin will come up heads-tails-heads-tails et cetera forever. Nor do we require or expect that if we toss it 10 times we always and surely get 5 heads and 5 tails. In short, what we mean by "half heads" is that the more times we toss the coin, the more closely we expect the percentage of heads to approach 50 percent. In 10 tosses, 7 heads is of no account; in 100 tosses, 70 heads would surprise us mightily; and in 10,000 tosses, 7,000 heads would shock us to the core (or, more correctly, would absolutely, positively convince us that the coin is not in fact a fair, balanced coin). As a matter of record, there is a precise, if complex, mathematical relationship between the number of times we "sample" something and our confidence that our samples represent "true" or very-long-term expected results. That is what is referred to in, for example, political polling results when the pollster speaks of "a confidence level of plus-or-minus X percent." An interesting sidelight is that added data become progressively less helpful in improving confidence. If a rookie bats .300 in his first full year, we are hopeful but not fully convinced; but if he bats .300 or so again the next year, we feel like the team has found a real .300 hitter. If he then bats .300 for his career--well, we already expected that, didn't we? The increase in at-bats from the 1200 or so of his first two seasons to the perhaps 10 times as many of his full career gave us less new information than the mere doubling of the number from his first year to his second. Mathematically stated, confidence goes up as the square root of the data multiple: that is, it takes four times as much data as we have to double our confidence in what the data are telling us. That's one big reason why pollsters can determine pretty well, for example, how popular a given TV show is by surveying only a few hundred households. If we do the math--which we won't here--using typical baseball-team numbers, we find that over a full 162-game season, for runs scored we can expect an average variation from target of a little under 3 percent (about 2.9%) from chance alone. Over a more restricted period of time--say the first month of a season--the expected average scatter rises significantly, to around 6%, owing to the reduced data sample. Probability theory tells us more than just the expected average error. It also tells us how we should expect any actual set of results to be distributed around that average. Think of it this way: if we firmly clamp a rifle in a vise so that it is aimed directly and precisely at the bullseye of a target some distance away and then fire that rifle a number of times, what do we expect the pattern of the holes in the target to look like? If we did the test in an ideal indoor windless test room we might just get a large series of dead bullseyes; but if we do it outdoors, where there is a wind blowing in a moderate but randomly variable (in both direction and velocity) way, what we expect is a scattering--but one centered on the bullseye. Moreover, we expect to see most of the holes fairly near the bullseye, a few a little ways out, and perhaps an occasional one quite a ways out. If we measured the distance of each hole from the exact bullseye center and then made a little graph plotting number of holes (data points) against distance from the bullseye (expected norm), the graph--given enough shots (data) to show its shape clearly, would look something like a cross-section of the Liberty Bell--which is why such distributions are called "bell curves" (you've probably heard the term). Technically they're "Gaussian distributions," named after the mathematician Karl Gauss. The point of this digression is that if you take the results of any competent analysis of baseball statistics--let's say HBH's "TOP" formula--and repeatedly compare its predictions against real-world results, you expect to see a bell curve whose exact size and shape depend on definitely known numbers. If that is the case, you have good cause to say that the formula is correct. There are minor differences in accuracy between various different formulae from various different sources, but those differences are very small compared to the degree to which all of them, however derived by whom, generally agree with one another and with the expected scatter patterns probability mathematics demands of an accurate formula. So that you can see that we put our money where our mouth is, we include this demonstration tabulation of the HBH TOP formula tested on the most recent half-century of baseball. You can also take a look at short-term results on the Team-Performance page on this site, but we don't link it at this specific spot because you should read more before going there. The Logic of Baseball AnalysisWinning GamesIndividual baseball games are, obviously, won or lost based on a very clear and simple rule: the team that scores more runs than it gives up by the end of the game wins. Less obvious is that there is a definite and clear relationship between the runs a team scores and gives up over a series of games and the percentage of games it wins in that series (and, again, that of course is a probabilistic relation). Given that fact, if we knew how many runs a team could be expected to score and give up over a season, we could predict with reasonable accuracy how many games they would win in that season.There are numerous versions of this formula; they often look very different, but when one does an engineering analysis with typical baseball numbers, they essentially resolve into the same thing. Naturally, they thus also each give almost exactly the same results for a given set of games and runs figures. Cook, in Percentage Baseball, used simply (where R and OR are Runs and Opponents' Runs) to get the expected win percentage. Bill James has used his so-called "Pythagorean" formula, not easily reproduced on a web page. We at High Boskage House have yet another. None of them is really right, because "right" here would be a very messy probabilistic equation based on typical scatter of runs scored around its average value for a team (that is not a simple bell curve, because it is constrained at one side, the lower limit--you can't score fewer than zero runs--but there is no upper limit, especially with the SillyBall). But they all work quite well enough. To make this less mystical, consider a team that plays a fairly large number of games against another team or set of teams--in fact, a typical baseball season. If, at the end of that time, the team has scored exactly as many runs as it has given up, it is no great leap of logic to say that on balance they have been neither better nor worse a team than their average opponent. That being so, we would expect that the most reasonable outcome is that they have won no more than they have lost: that they are .500 in those games. All games-won formulae thus must meet the test that at equal numbers of runs scored and runs yielded, they predict a .500 win percentage. (Consider, for example, Cook's formula as given above.) Moreover, we certainly feel that if the team has outscored their opponents by a little, they should have won a little more than half their games; and if they outscored them a lot, then they should have won at well over a .500 clip. The various formulae quantize those expectations, giving specific, reliable win percentages for specific R and OR run sets. Scoring RunsIf we can--and we indeed can--project probable games won from runs scored and runs yielded over any arbitrary set of games (with, of course, increasing accuracy as the number of games in the series rises), we would next like to be able to project runs scored and yielded for a team based on who is playing and pitching for it. If we could do that as well--and here too we can--we could then project with some accuracy a team's ultimate win percentage just from the identities of its player personnel.The essence of scoring runs in baseball is remarkably straightforward: put runners on base and then drive them in. The background is the ticking clock of baseball--outs. Of all the many and diverse numbers in baseball analysis, none is nearly so important as this one: three, the three outs that define an inning. Another thing that probability mathematics tells us is that the chances of two things both happening is the chance for one multiplied by the independent chance for the other. The chance of a man getting on base is very simply expressed by a now-familiar (if late-arriving) stat: the on-base percentage. To get the chances of a man at the plate becoming a run scored, we need to take his on-base percentage and multiply it by some factor representing the chances that a teammate will knock him in. (We do need, naturally, to make some adjustment to the raw on-base percentage to allow for the facts that the man may get on by an error, and that he may be put out on the basepaths even after having reached safely). As an aside, we need to remember at all times--which many discussions and analyses we have seen do not--that the batter at the plate is also a base runner. That is, there is always at least one runner on for every batter: himself. He is the base runner on "zeroth base." What he does as a batter independently affects him as a base runner (analysts sometimes forget that, but the Rules Of Baseball don't, referring to the "batter-runner"). It is as if there are two men at the plate: a runner, just like a runner at any other base, and a batter who does what he does and then fades into thin air as the base runners (or runner--himself) do whatever is appropriate for what he as a batter did. The mechanics of what such an "RBI factor" might comprise, and of how it is derived, are somewhat complicated. Evidently base hits are going to be very important, and extra-base hits especially so; but walks have some value, and even minor factors like wild pitches and balks are not utterly negligible. In fact, the details of both the philosophy and practice of calculating an RBI factor of some sort are largely (but by no means wholly) what distinguish one school of analysis from another. High Boskage House has its own methods, which we will not detail here for a variety of reasons (brevity being one, their proprietary nature being another). In many workers' formulations, the occurrence rate of Total Bases (the sum of all hits weighted by bases per hit-- that is, for example, triples are 3 and singles are 1) is the only determinant in their RBI factor, whatever they call it (if they call it by a name at all). That can actually give a pretty fair result, and it has the virtue of simplicity. The first runs-scored formula Bill James widely published was indeed that simple: Since Hits + Walks is, roughly anyway, the available base-runner total, and At-Bats + Walks is--also roughly--the plate-appearances total, manifestly James' "RBI Factor" in this formulation was indeed just the Total Bases rate (TB/PA, more or less). Note that James did not use the on-base percentage, or any rough equivalent of it: he used what amounts to the actual number of base runners. That's OK for a quick, simple formulation which will serve to demonstrate how well analytic methods work, but it limits the utility of the formulation to evaluating what has happened; you cannot use it to predict what likely will happen because to know how many men will reach base, you need to state your formula in terms of an on-base rate. And that brings us to another important point. The on-base percentage and an RBI factor, when multiplied, give the chance that a given batter will become a run scored; but the actual number of runs scored also depends on how many men come to the plate so as to have that chance. That number, actual total team plate appearances, varies significantly from team to team and year to year; but it does not do so without cause. Remember outs as the ticking clock of an inning: the less likely a team is to make an out at the plate, the more men they will get to the plate over the long haul. That can be stated quite precisely in a mathematical formulation, but its essence for the purposes of understanding is this: a team's on-base percentage has a form of compound-interest effect on run scoring. First, it directly increases the chance that any one batter will ultimately become a run scored; and second, it increases the number of men who will get to have that chance. It is for these reasons that the single most important baseball statistic viewed in isolation is the on-base percentage; actual run scoring tracks on-base percentage more closely than any other single statistic (as we now understand that it should). And one more time: if you have any doubts that the HBH run-scoring equation works, and works very well, look over the actual results again. Rating PlayersNow consider this: what we can calculate for a team from its statistics, we can also calculate for any one batter from his personal statistics. If we then set the number of available outs to what it is for a full team for a full season, we get a number that sums up that man's ability to contribute to his team's scoring of runs in one number; we can think of it as the runs that would be scored in a season by a team made up entirely of exact clones of that man.High Boskage House calculates just such a measure, which we call the Total Offensive Productivity, or just TOP. It is shown for all batters listed anywhere in these pages; in the by-team batting lists, the batters are arranged in order of descending TOP. Moreover, what one can calculate for a batter, one can correspondingly calculate for a pitcher, using the numbers that he gives up to batters. You will find on this site just such calculations, which yield a novel and very, very important measure that we call the "Quality of Pitching" stat (there is also a closely related stat that we call the TPP, for Total Pitching Productivity, because the term pairs nicely with the TOP). There is a separate page on this site that discusses those measures further, but you would be best off to finish this page before jumping there. Two other and somewhat related points need mention. (Actually, they need extensive discussion, and we hope in future, as we gradually expand these notes, to give them that discussion.) One is the predictability of individual men's performances. As we said earlier, there is a sort of law of diminishing returns for the meaning of increasing data; by the time we have roughly the equivalent of two seasons' full-time play for a batter, we have enough data to have defined his norms of performance pretty well. Pitchers, for complex reasons, take more time to evaluate, although using the TPP instead of the ERA gives results in time periods comparable to those needed for batters. Moreover, by the time a batter reaches double-A ball, he has become pretty much what he will be; if we have two seasons' worth of data above class A ball, we have the man defined. It is precisely that predictability that makes it possible to "engineer" a baseball team in a manner quite comparable to the process of engineering an automobile engine. By knowing the data for the components and the equations for how those components interact, we can design an engine or a team to meet a specified set of performance criteria. It is crucially important to understand what we are saying here: we are not saying that we can predict accurately how every man will do in a given season from how he has done in the past; that is, as common sense suggests, impossible. But, just as we certainly cannot predict how a pair of dice will come up in any given throw or small number of throws--which is why people gamble--we can equally certainly predict with great accuracy how much money a craps table will likely take in on one shift because we know the tendencies of the dice well. So with a ball club: if we know the tendencies of the batters and pitchers--what they have done in the past--we can predict with good accuracy how the cumulative results of 25 men over a full season will come out. We know some will be surprisingly low and some surprisingly high, but--most of the time--the net will be on target. (A full season for a ball club is enough for acceptable precision, but there is always room for the occasional burst of especially good or bad luck seen by most fans--as well as professionals who should know better--as either "clutch performance" or "choking." (How come no sane craps player ever refers to "clutch" dice or "choking" dice when they win or lose?) The second point, related to what we just said, is that minor-league statistics--long thought by most baseball professionals and fans to be nearly meaningless--can be translated so that we see what the man would have achieved playing at that same level of ability in a major-league ballpark against major-league competition. (That realization, and the mechanics to implement it, are one of Bill James' most valuable contributions--probably his most valuable--to the art.) Finally, we repeat that all statistics, to be useful, must be comparable. There is a separate page on this site that discusses the "normalization" processes that HBH applies to all raw data before transforming it into performance measures. |
Measures calculated by High Boskage House Baseball Operations, using proprietary techniques.
All data soon will be (but is not yet) normalized for park effects and seasonal variations.
(What do you know about OmniKnow?)
|
|
This site is one of The Owlcroft Company family of web sites. Please click on the link (or the owl) to see a menu of our other diverse user-friendly, helpful sites. |
|
|
Site Front Page Late Baseball-Site News and Thoughts |
||
|
Daily Baseball Data: |
||
|---|---|---|
|
Teams: |
||
| Overall Team Performance Stats (win projections and more from actual quality of play to date) | ||
| Player Performance Stats, by Team | ||
|
Batters: |
||
| Batters by Last Name: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z | ||
| Batters by Performance (a single all-batters list) | ||
|
Batters by Positions Played:
alphabetically: C | 1B | 2B | SS | 3B | LF | CF | RF | DH | SP | RP by batting performance: C | 1B | 2B | SS | 3B | LF | CF | RF | DH | SP | RP |
||
|
Pitchers: |
||
| Pitchers by Last Name: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z | ||
| Pitchers by Performance (a single all-pitchers list) | ||
|
Pitchers by Role:
alphabetically: Starters | Relievers by pitching performance: Starters | Relievers |
||
|
Other Statistical Data: |
||
| "Regular" Players, Starting Pitchers, and Relief Pitchers, by Performance | ||
|
Team
Defense (and its projected consequences)
|
||
|
Baseball "White Papers"--meanings and explanations of the things on this site |
||
|
General Background: |
||
| For You Rookies: what this site is all about--what it is telling you about baseball, and how, and why | ||
| Some Baseball Analysis Theory: a semi-technical backgrounding on modern baseball analysis | ||
| Baseball Stat Definitions: the standard and the unique statistics we present here, defined | ||
| Baseball Data Normalization: how we correct for what, and why we need to | ||
| The "Quality of Pitching" Measures: why they are the best way to evaluate pitching performance | ||
|
"Steroids":
why just about everything you think you know about them is wrong Now a site of its own! steroids-and-baseball.com (the link above gets you there) |
||
| "The SillyBall": why baseball before and after 1993 is really two different games | ||
|
About Particular Pages Here: |
||
| The Team-Performance Table: there is a lot in that Table, and this explains what it all is | ||
|
The Team-Defense
Table: how important defense is or isn't in baseball, and how to
correctly evaluate it
|
||
|
Miscellaneous--but not unimportant |
||
| About High Boskage House: who we are and why we might know what we're talking about regarding baseball | ||
|
Links To A Select Few
Other Useful Baseball Sites (including those that link to this one)
|
||
|
The High Boskage House Baseball Shop (which offers more than baseball books--in fact, more than just books) |
||
|
What Makes This "Baseball Shop" Special: |
||
| Finding Books About Baseball Topics: we've already done it for you, and our list is updated daily | ||
| Search For Any New Book at Amazon (which is, after all, the cheapest place to buy books new) | ||
| Search For Any Used Book at Abebooks (which is the easiest place on the internet to find any used book) | ||
|
Search For Anything at
All at Amazon: nowadays, they're a lot more than just books
|
||
| Baseball Books Available Today: | ||
| A Master Baseball-Books List (plain text your browser can easily "search") | ||
|
Baseball Books By Title:
(because so many baseball book titles begin with the word "baseball", those are broken out separately in the title lists below) A | B | "Baseball" | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | other |
||
Not every browser renders proper HTML correctly (Internet Explorer famously does not);
so, if your browser experiences any difficulties with this page (or, really, even if it
doesn't),
(It's free!)