View Full Version : Sabermetric Book
Tango Tiger
03-27-2006, 07:50 PM
I see a few threads talking about books that we've read. Is it against the TOS to talk about books we've written?
leecemark
03-27-2006, 09:25 PM
--Depends what you have to say. Advertising is prohibited, but if you wanted to share some of the ideas in it I, for one, would be very interested. Welcome to BBF. Its a pleasure to have such a distinguished vistor.
Ubiquitous
03-27-2006, 09:43 PM
A couple of weeks back your book was brought up this was my quick book review:
I finished The Book the other day and overall it was good. Nothing revolutionary but then again I wouldn't expect it to be. Basically what the book is is a in depth look at baseballs most commonly held strategic decisions and cliches. Things such as bunting, platooning, reliever usage, and so forth. Lots of charts and explanations of how they do things so one never really has to guess at how they came to their conclusions but at the same time one can easily skim a section if the material is too laborious for ones interest and not lose track of what they are saying.
One of the things I liked about the book is they mention game theory and go over it a little and how it relates to baseball. They do it in a way that I don't think most people do, and thats including stat-bashers and statheads. They don't just look at it in terms of success and failure but how it forces other teams to play you. For instance in terms of bunting if you never bunt in sac bunt situations then your opposing teams can alter their defense in a way that gives them an advantage. The corner infielders don't have to protect against the bunt, middle infielders can play deeper and play against a hit and so forth. So by not forcing the opposing team to respect the bunt you make it harder to get a hit in those situations. So there are times when one has to bite the bullet and sac bunt just to show people you will.
Now for what it isn't. This isn't an exciting book, it isn't a bill james book. Theres no stories no history no essay type sabremetric book like a Barra or Neyer book. This is more like a report prepared for a baseball manager then a fun summer read. This book isn't a book that ranks players, you won't be seeing people using quotes and passages from this book in these forums like they do with Bill James. This is simply a book that tries to explain what happens in a game when certain events occur and I believe that they do this very well.
Captain Cold Nose
03-28-2006, 05:38 AM
Might I suggest you get ahold of Sean, the webmaster and owner of this site. (member webmaster.) On the Almanac part of the site, there is a book page, including a listing on one written by a Fever member.
Personally, I'd consider letting us know about your book a favor to the site and its members.
Tango Tiger
03-28-2006, 08:01 AM
Thanks leece. I guess I'll just tackle whatever issues are brought forth.
Ub, that was a fair and honest review.
Nose, good idea.
Ubiquitous
03-28-2006, 10:43 AM
From discussions about this sort of thing I believe it was decided that one is allowed to post a link and a short blurb to their site in their signature.
Anyway to kick off some talk about the book I'll throw you some questions.
In another thread a discussion came up about pitchers pitching to the score, with some believing this is a real effect. I tried to explain your books view, but personally I think I did a horrible job doing that, so if you could perhaps you could explain your books view here. What kind of effects you saw, were pitch counts different, pitch situations different, so on and so on.
Second question:
What kind of impact does a baserunner have in the game? Again on this site we have had debates in which some believe that good baserunning is not measurable. That the impact of basestealer is not captured in the stats, the pressure they put on the defense and so forth. Is it measurable and if so whats the impact? Was someone like Lou Brock "better" then his stats because the pressure he applied wasn't measured or was there anything to measure? What has PbP data shown you guys when players steal bases, when great basestealers are on and when they try to steal?
Hopefully that will get the ball rolling for you.
BlueJay
03-28-2006, 11:27 AM
I'll second the second question. I'm particularly curious to what extent a player's ability to "take the extra base" has a significant impact on the outcome of the game.
To which I'll add a question about the relative value of a strong/accurate throwing arm. What "percentage" of an outfielder's defensive value does his throwing arm account for? Intuitively, one would think it's of less value than his range. Is this something discussed in your book?
Tango Tiger
03-28-2006, 12:25 PM
Re: baserunning
Everything is measurable. It's a question of finding that sensitive needle. Basestealers not only put pressure on the defense... they put pressure on the offense. The effect is rather powerful, especially for those runners who don't like to sit and wait. They also open up the hole between 1b and 2b, and so a LH or an opposite-field hitting RH would be able to leverage that situation.
The overall value of taking the extra base is certainly real. A quick way to think about it is this way: the fast runner will advance about 0.20 bases more than an average runner, when a single or double is hit, and when he is on 1b or 2b. Each base is worth about 0.25 runs. The average runner will find himself on 1b or 2b about 40% of the time (including "duplicates"). A single or double is hit about 20% of the time. So, .20 x .25 x .40 x .20 = .004 runs per PA, or almost 3 runs for a full season. Add a bit more for GB movements, and we are talking about 5 runs. That's the overall extent. However, in any single PA, the effect can be enormous or non-existent.
For the throwing arm, you have a similar process, and you'll end up with roughly similar numbers. Of course if an OF is unduly tested, and he makes the most of it with his kills, he'll get more value than if the runners simply stayed put. (We don't talk about this in the book.)
Re: playing to the score, that was Andy's research, so I'd rather not comment on it without having the book by my side, so as to no misrepresent him.
misterdirt
03-28-2006, 01:13 PM
First, congratulations on your book, it represents a tremendous amount of work and discusses for the first time some very interesting topics. It is by far the best book to come out this year including "Baseball Between the Numbers", "The Fielding Bible", and the THT and Baseball Prospectus Annuals.
But I do think that Gary Huckabay's article in the 2006 Baseball Prospectus in which he discusses statistical analysis with an unnamed GM to be the single most interesting read.
Even though I think that the discussions in your book are very informative and innovative I have trouble with the conclusions being based on the Run expectancy tables. I still feel that even though Run Expectancy Tables are useful for many things, they are not appropriate for evaluating strategic decisions. The strategies that you discuss in the book are always decided on very specific variables for each situation, a specific batter facing a specific batter at a specific point in a game with a specific score. Run expectancy tables are based on average values, an average batter facing an average pitcher in an average inning with an average score. Even though mention or try to control for some of the variables in most of the analysis in your book, the situations are too complex for you to be able to control all of the relavant variables. It would seem that a proper analysis would require a much more sophisticated game simulation program than you have at your disposal.
A specific question about intentional walks. Your analysis identifies the 1 out men on 2nd and 3d situation as one which is a good possibility for using the intentional walk. I am using a different data set than you, 2003-2005 PBP data instead of 2000-2004, but in those three years intentionally walking the batter in that situation cost teams 125 runs, by far the biggest run losing occurence of any Base Out situation. The next biggest loser was walking the batter with 2 outs and a man on 2nd which lost 23 runs. The use of the intentional walk in all other Base Out situations were either plusses for the defensive team or only insignificant losses. Is this just a data quirk or are managers overusing the intentional walk in the 2 situations that it would seem most likely to help them?
Tango Tiger
03-28-2006, 01:24 PM
Thank you very much for the kind words!
I should correct your claim on the RE tables. We in fact insist that Win Expectancy (WE) tables, and not RE tables, should be the driving force (a message that is noted especially in the sac bunt chapter). It's also noted very specifically in my basestealing chapter, where the breakeven points change drastically based on the inning and score. The RE tables are valuable as a starting point, but by no means are they the ending point (so I agree with you there). And certainly, the batters on deck, the relief pitchers, etc, all play a part in this. Mick, Andy, and I each have our own "sophisticated" game program, but we only used it in the book where it would add to the content.
Andy wrote the IBB chapter, so, again, I'll have to have the book next to me. I'll reply to these questions tonight or tomorrow, and ask Andy to chip in his two cents.
misterdirt
03-29-2006, 06:06 AM
I know that you use win expectancy tables and even though they are an improvement on run expectancy tables they are also based on average teams and therefore are not very helpful in evaluating strategy. For example, win expectancy tables start with the assumption that two teams have an almost equal chance of winning a game (slightly better than 50% if you are the home team and slightly worse if you are away). But that it almost never the case. For the 16 games that Pittsburgh played against SL last year the actual average chance that Pittsburgh had of winning each game was probably between .250 (the actual percentage they won against SL) and .355 (their chance of winning based on their overall league WP and SL's league winning percentage). Pittsburgh should use very different strategies to try and win games against SL than SL should use against Pittsburgh, but win expectancy tables would evaluate the strategies of each team in the same way if the game situations of score, out, inning, and men on base were the same. This type of analysis would lead to wrong conclusions.
Tango Tiger
03-29-2006, 06:09 AM
You are right about that. A game starting with Pedro, RJ, Clemens would have a 70% chance of winning, so the strategies against them would be different than with a triple-A pitcher.
I don't think we did a good job of explaining that. I'll have to reread the portions where that would apply, and see how well we addressed it.
Tango Tiger
03-29-2006, 08:09 AM
First off, I should correct you on the chance of Pittburgh beating St Louis. It has nothign to do with how they actually performed against each other. It's a small sample size. A .600 team facing a .400 team will win 70% of the time.
On to the larger issue: when we did our analysis, I think we made a decent effort of explaining the "average" issue. The intro to the 9 pages of WE tables discussed that you certainly need to go beyond average.
The IBB analysis assumed average for everything, except the batters on deck. You could come up with a larger matrix, that also included the quality of pitcher on the mound, and the pitcher you have at your disposal. In the SB chapter, I think I said that teams should run more often with a great pitcher on the mound, but I'm not sure. I've said it in other places for sure (not necessarily the book).
All this to say that there are tons of variables to consider, and the book gives you the path to do that. It's a huge step up from the current analysis being done, and we still need to do the next huge step to consider the park, the opponent, your teammates, the count, and inning, score, base, out. I don't think we left anyone with the impression that you should only consider the variables we did.
misterdirt
03-29-2006, 02:16 PM
<I>First off, I should correct you on the chance of Pittburgh beating St Louis. It has nothign to do with how they actually performed against each other. It's a small sample size. A .600 team facing a .400 team will win 70% of the time.<I>
Pittsburgh played as a .440 team last year adjusted for their division and to a .500 opponent. St.Louis as a .586 team. A .440 team playing a .586 team has a .355 chance of winning. I would trust that number as an average but they did only win a quarter of their games against St. Louis so even though it is a relatively meaningless small sample I thought that it should be mentioned as a lower bound to the range.
Tango Tiger
03-29-2006, 03:15 PM
Your .355 figure is pretty much what I get using the Odds Ratio method, if they were truly .440 and .586 teams. (They probably weren't.)
You are correct in mentioning the other number, and their true chances of winning was between .250 and .355, though much closer to .355.
misterdirt
03-30-2006, 02:40 AM
Your .355 figure is pretty much what I get using the Odds Ratio method, if they were truly .440 and .586 teams. (They probably weren't.)
OOPS! I read the wrong column in my spreadsheet. How does .427 for Pittsburgh and .601 for SL sound to you? That would give Pittsburgh an average .326 chance of beating SL in the games last year.
Tango Tiger
04-07-2006, 09:54 AM
For those interested, we have another excerpt of the book here:
http://www.hardballtimes.com/main/article/pitching-around-batters
Tango Tiger
04-17-2006, 10:56 AM
Most likely the very last excerpt:
http://sportsillustrated.cnn.com/2006/baseball/mlb/04/17/thebook.excerpt/index.html
SABR Matt
04-29-2006, 11:40 AM
I just got my copy of THE BOOK...I'm specifically interested in custom and league linear weights based on the markov chain concept...hopefully there's enough information on how all of that is done that I can start to make some inferences into how to adapt the theory to the pre-PBP era.
Ubiquitous
04-29-2006, 12:46 PM
I don't believe that information is in that book. You're probably better off looking at Tangotiger's site for that. I'm thinking his essay's on baseruns will probably show you how to do what tango is doing better then the book. AS for Markov chains, googling it is probably your best bet.
SABR Matt
04-29-2006, 01:26 PM
No...I already have a link on how markov chains work...that's not the problem. The problem is figuring out EXACTLY what input data I need to do specific things and then determining how to estimate state-to-state transition frequencies based on the frequency of annual statistics so that I can calculate BaseRuns for eras prior to 1957. I've read some of his writing on baseRuns already, but I don't believe he ever told us SPECIFICALLY how that can be done...if he has...I'll be happy to stand corrected (and informed. :) )
Ubiquitous
04-29-2006, 01:33 PM
I'm away from my resources but I think (though I am not sure) that a book called curve ball (I think) might explain some of what needs to be inputted but again I'm not sure. I'm seem to recall coming across some info on how to do a markov chain for baseball but I can't seem to recall where. Best bet would be a series of e-mails with some of the professors who now and then get published for doing it, or of course with Tango.
Tango Tiger
04-29-2006, 03:05 PM
All you would need are the state-to-state transition matrix. I will probably post those on our site to the benefit of whoever bought the book.
As well, you could figure them out youself. It's not that hard. For example, the state-to-state for the HR is a snap. Also fairly easy for triples and walks. Singles and doubles are a little harder, but I do have this on my site to help you out:
http://www.tangotiger.net/destmob.html
Strikeouts are straightforward. The outs are the hardest ones probably.
Tom
SABR Matt
04-29-2006, 03:21 PM
OK...so in order to do this kind of thing...you need to know:
How frequently (on average) each offensive event occurs in each base/out state. IOW, how often is a single hit with a runner at second and two outs...how often is a walk taken with the bases loaded and none out...how often does a batter K with a runner at third and one out? Etc for all base/out states.
and
How frequently does each event result in each specific state-to-state change (how often does a single...occuring with a man on first...result on that runner being on third and the batter advancing to second on the throw to third? Etc.
and
How frequently do runners score from each base given the out state.
...
I fail to see how it is possible to generate such things for years prior to PBP.
Tango Tiger
04-29-2006, 05:32 PM
You need the first two. The third is implied by the second.
As for past years: the first one is fairly stable. If let's say 10% of PA are walks, but that from the PBP years, the walk happens in 8% of PA with a man on 1B, and 12% of PA with 1B open, and then you go to some older year where 8% of PA are walks.... well, you could make these numbers 6.4% and 9.6%. This would be a fair estimate.
As for the second one, again, you can keep the same state/transition rates. After you apply these rates, you will end up with a final runs per inning. If this figure is less than reality, then you can assume that your second matrix was too conservative, and therefore, you bump up the rates a bit, until you get the number to match to reality. It's really just a matter of tweaking a bit.
Got it?
SABR Matt
04-29-2006, 05:46 PM
I was unaware that the frequency of events occuring in specific state situations was relatively stable. I assumed that in radically different run scoring environments, strategies would be different and the matrix would look very different for some of the events (IBB, SB, CS, SF, SH, Errors etc)
Tango Tiger
04-29-2006, 07:58 PM
Certainly they would be different from run environment to run environment. But, relative to the 24 base-out states, they would be the same.
For example, if say there are 3 times more SB attempts in the 80s than today (as an illustration). It's likely that the % of SB with 0,1,2 outs stays the same (27%, 33%, 40%). It's likely that the % of throwing errors per SB is relatively the same.
When you create a model, you make a best-guess estimate, and you then test the model to determine how many runs your model creates, and the % of times that 0,1,2, etc runs scored per game. If your best-guess estimate results in an outcome that matches reality, then you probably have a sound model.
Pete Palmer just used a handful of play-by-play games from the World Series in the 50s to create his model. And it works.
Tango Tiger
05-01-2006, 09:08 AM
I have an article up here:
http://www.hardballtimes.com/main/article/crucial-situations
Tom
SABR Matt
05-01-2006, 11:16 AM
Wow...that is a fascinating read Tango...how do you propose we use the Leverage Index? What can we do with it in the analysis of, say...the value of a relief pitcher?
Tango Tiger
05-01-2006, 11:49 AM
Your book should be arriving in a few days... you'll see!
SABR Matt
05-01-2006, 01:05 PM
LOL...cool, so you discuss the LI in the book?
I look forward to reading it.
Tango Tiger
05-01-2006, 01:40 PM
I discuss leverage in as a non-mathematical-speak as I could. So, I establish the LI levels with a 3-run, 2-run, 1-run lead in the 8th and 9th innings, as well as how often teams used their ace in a low-leverage situation. Answer: ALOT!
SABR Matt
05-01-2006, 02:01 PM
Oh I definitely agree that closers are badly badly misused...I knew that intuitively before any math got involved...
but can it be said that an LI of 10 is 5 times more important to a game than an LI of 2 or is it just an ordinal comparative tool and not a linear model?
Tango Tiger
05-01-2006, 02:12 PM
Absolutely that 10 has 5 times more impact than 2. Reread the article a few times until you believe it. That's the whole point of LI: it measures the spread of possible outcomes... the gaps. If one has a gap level of 10, and another of 2, then the spread is 5 times wider, and therefore, the events will have 5 times the impact.
SABR Matt
05-01-2006, 02:20 PM
Excellent!
I just wanted to make sure that was true. I think this is outstanding work and can be used to more accurately analyze the TRUE value of good relief pitching and thus fairly rate relievers on the same level as starters (or at least...as close to the same level as is representative of their true worth)
Tango Tiger
05-08-2006, 11:23 AM
My Leverage Index has now been incorporated here:
http://www.fangraphs.com/wins.aspx?date=2006-05-03&team=Blue%20Jays&dh=0
You can also look at each team's season-to-date totals.
Lou Diamond
05-08-2006, 01:26 PM
Tom, I just purchased "the book". So far, it's an interesting read.
SABR Matt
05-08-2006, 04:22 PM
Hey Tom...just wanted to let you know I really appreciate the hard work you do...I've read the first four chapter in THE BOOK and thus far really enjoyed it and learned something too.
I'm taking a look through the LI stats you just posted...I really think this LI research is groundbreaking and VERY important.
SABR Matt
05-08-2006, 04:29 PM
This is really fascinating data...I wish I had your database...
Tango Tiger
05-08-2006, 08:35 PM
Thanks guys, I appreciate the support.
***
The data at Fangraphs is a deal between David and BIS. I have no involvement. I licenced my WE and LI data to David so that he can generate the WE/LI charts.
SABR Matt
05-18-2006, 02:51 AM
Question Tango...
In the Toolshed section of your book (I was rereading it today), why do the starting and ending win probabilities of each offensive event in the table on runs per win all well exceed .500 except the defensive indifference play? Seems to be suggesting that every team from 1999-2002, when it was on offense...had a better than 50% chance of winning.
Hey Tango, did you get my PM?
Tango Tiger
05-18-2006, 08:27 AM
Matt, good question. I've never thought about it too much, but I think it has to do with the "left on base" and "3 out" situation. On average, you will have 1 man left on base per inning, so you are leaving runs (or wins) on the table. So, on average you have to have a better than .500 chance of winning. If I included an "extra" PA (the one that wipes off the runners from the base), we would end up with an overall exactly .500 chance of winning.
SABR Matt
05-18-2006, 10:03 AM
Interesting...
I thought that all 2-out plays (the ones that conclude innings) were recorded though in this table...shouldn't the after-play win expectency have drop overall to near .500?
Tango Tiger
05-18-2006, 11:02 AM
Yes, they are included. All plays are included. The "starting WE" is the WE at the start of the PA. What I would want to add to complete this is the starting WE for the first batter of the next half-inning, since this will clear out the runners left on base of the previous half-inning. I think this is what I want, but I haven't thought about it too much.
misterdirt
05-18-2006, 11:55 AM
The starting value for the next half inning is what I used for my win expectancy table.
AstrosFan
05-18-2006, 03:16 PM
My copy arrived in the mail at home, and I can't read it, 'cause I'm stuck here at summer school.
Lou Diamond
05-18-2006, 03:31 PM
Just finished reading Tom's book. Fantastic stuff man.
Tango Tiger
05-18-2006, 03:35 PM
Just finished reading Tom's book. Fantastic stuff man.
Cool, thanks!
SABR Matt
05-18-2006, 04:50 PM
Yeah Tom...if you flushed out the LOBs with the team's WE after the inning completes...you would get a perfect .500 when totalled up I suspect.
Lou Diamond
05-18-2006, 09:26 PM
Tom, aren't you involved with the NHL too ?
Tango Tiger
05-19-2006, 10:59 AM
Yes, I do some work as a subcontractor for some teams. I can't really say any more than that.
digglahhh
05-31-2006, 11:04 AM
Here's a question for you, Tango.
I'm not completey finished with the book, but I just wanted to make a point about the RE tables.
I don't recall you saying this specifically, but the data, as well as common sense would seem to suggest it. It stands to reason that there are times when the overall RE would lessen, but the likelihood of scoring, period, would increase.
For example, in the sac bunt chapter, there is a lot of talk about the 1 out, runner on second state as having a lower RE than the runner on 1st with no out state. Isn't it possible though, that the likelihood of scoring in general increases with runner at 2nd and 1 out, though the likelihood of scoring MULTIPLE runs decreases?
In fact, isn't this kind of the subtext of the whole high and low offensive environment issue? When it is less likely, in general, to score multiple runs (to string together numerous favorable offensive events) strategies that may help to scratch out runs (at the expense of potential big innings) become better choices.
My intuition would lead me to believe that such a phenomenon should exist in certain situations. I think that, in fact, that has to do with why sometimes the RE expectancies in two states, relative to eachother, seem counterintuitive. How about the rate at which we can expect that specific runner to score? Would it be different from the overall tables sometimes?
SABR Matt
05-31-2006, 11:18 AM
Here's a question for you, Tango.
I'm not completey finished with the book, but I just wanted to make a point about the RE tables.
I don't recall you saying this specifically, but the data, as well as common sense would seem to suggest it. It stands to reason that there are times when the overall RE would lessen, but the likelihood of scoring, period, would increase.
For example, in the sac bunt chapter, there is a lot of talk about the 1 out, runner on second state as having a lower RE than the runner on 1st with no out state. Isn't it possible though, that the likelihood of scoring in general increases with runner at 2nd and 1 out, though the likelihood of scoring MULTIPLE runs decreases?
In fact, isn't this kind of the subtext of the whole high and low offensive environment issue? When it is less likely, in general, to score multiple runs (to string together numerous favorable offensive events) strategies that may help to scratch out runs (at the expense of potential big innings) become better choices.
My intuition would lead me to believe that such a phenomenon should exist in certain situations. I think that, in fact, that has to do with why sometimes the RE expectancies in two states, relative to eachother, seem counterintuitive. How about the rate at which we can expect that specific runner to score? Would it be different from the overall tables sometimes?
I had similar thoughts, digglahhh. The bunt that moves the runner from first to second and gives up an out is trading guaranteed baserunner advancement for the POTENTIAL that the baserunner will advance. If you have confidence in the ability of the next hitter to get that single...it might still be worth it.
Ubiquitous
05-31-2006, 11:35 AM
BP did an article on this a few years back and they showed the % of scoring just one run by base/out situation. If I recall correctly it still didn't look like a good play but I think it was a simple study. One that didn't factor in other variables like errors and runner advancement through other means.
There will always be situations and scenarios in which averages will not apply. That one should do the opposite of what the RE tells you to do.
Tango Tiger
05-31-2006, 01:25 PM
What you want, ALWAYS, is win expectancy (WE). And you want the win expectancy based on the exact context you are in.
Run expectancy (RE) does a good job in standing-in for WE in most situations. It's in the late and close games where you should forget about RE and concentrate on WE. However, WE is based on RE. So, properly manipulating RE will get you WE.
I do provide the "chance of scoring" in Tables 8 and 9 I believe. You can see for yourself how often you will score 0,1,2,3,4,5 runs from any base-out state. You can bump them up or down based on whatever your current context is.
In short, establish the chances of winning given the context, and use the average-based RE/WE tables as a starting point.
So, I'm not really sure what the issue is.
digglahhh
06-01-2006, 03:40 PM
There's no issue. I was just making an observation.
According to table 8:
1B, 0 out- you will go scoreless 55.7% of the time.
2B, 1 out- you will go scoreless 58.6% of the time.
So right there, it looks like the bunt is a bad idea, even when trying to get only one run and accepting the one run despite downplaying the potential for a big inning.
It is interesting to note though, that the likelihood of scoring 1 run is almost 6% higher in the runner on 2nd, 1 out case than the runner on first, 0 out case.
The RE difference is big, almost .23 between the two states. But the likelihood of scoring itself shifts only 3% between the two. Taking into consideration your players due up and so on and so forth, that 3% can possibly disappear. So, if you are "playing for one," say in extra innings. The bunt could be a good play, depending on the players involved.
I know that you talk about this and, again, I give you credit for reminding the reader that the percentages increase and decrease based upon the quality of the players involved.
A good thing to take from these tables is that, seeing as how you have to attempt things that may often be bad percentage plays just for their threat to remain credible, one can select the spots in which they "hurt you the least" to attempt them.
SABR Matt
06-01-2006, 07:15 PM
There's no issue. I was just making an observation.
According to table 8:
1B, 0 out- you will go scoreless 55.7% of the time.
2B, 1 out- you will go scoreless 58.6% of the time.
So right there, it looks like the bunt is a bad idea, even when trying to get only one run and accepting the one run despite downplaying the potential for a big inning.
It is interesting to note though, that the likelihood of scoring 1 run is almost 6% higher in the runner on 2nd, 1 out case than the runner on first, 0 out case.
The RE difference is big, almost .23 between the two states. But the likelihood of scoring itself shifts only 3% between the two. Taking into consideration your players due up and so on and so forth, that 3% can possibly disappear. So, if you are "playing for one," say in extra innings. The bunt could be a good play, depending on the players involved.
I know that you talk about this and, again, I give you credit for reminding the reader that the percentages increase and decrease based upon the quality of the players involved.
A good thing to take from these tables is that, seeing as how you have to attempt things that may often be bad percentage plays just for their threat to remain credible, one can select the spots in which they "hurt you the least" to attempt them.
The missing element in the book is context.
In an average situation, it's a bad idea to bunt. If the batter you're bunting with is...say..Yuniesky Betancourt, and the next man up is ICHIRO...it's probably a very GOOD idea.
Tango Tiger
06-01-2006, 09:14 PM
Pages 253-257 shows the RE from a weak and strong batter at the plate.
Page 260-261 deals with the on-deck hitter, with the statement "sac more often with a low-walk, low-OBP hitter on deck".
The Book goes on to discuss the speed and proficiency of the batter as well as the speed of the baserunner, the count, the pitcher bunting, close and late games.
Which element is missing?
SABR Matt
06-01-2006, 10:23 PM
Yeah I know Tango...I was referring to the basic tables this discussion was talking about...not the whole big sac bunt chapter. :)
digglahhh
06-02-2006, 07:38 AM
Page 260-261 deals with the on-deck hitter, with the statement "sac more often with a low-walk, low-OBP hitter on deck".
The Book goes on to discuss the speed and proficiency of the batter as well as the speed of the baserunner, the count, the pitcher bunting, close and late games.
Which element is missing?
Right, right, which is why I said that given the right context the likelihood of scoring in the RE tables on pg. 8 could be trumped by the situation.
Though you will go scoreless in the 1 out, runner on second state, 3% more often than the runner on first, no out state, if you have a speedy runner and the right type of hitter at the plate, you probably have a better chance of scoring, period, than if you would be able to revert to the runner on 1st, no out state. That's all I was saying.
Tango Tiger
06-02-2006, 11:18 AM
dig: I didn't have any issue with anything you said. I was directing my question to Matt's "missing element". I think we're all on the same page now.
SABR Matt
06-02-2006, 12:40 PM
Yeah...I was only referring to the element missing from the %chance of scoring tables at the front of the book...you covered the basics well in the main chapter.
misterdirt
06-02-2006, 05:18 PM
Tango - I think the problem, if there is one, is not a problem of missing elements but of misplaced emphasis. As you correctly point out, you mention the necessity of context on page 32 and then deal with some contextual sittuations in the various in depth chapters. But the main focus of the first chapter is the construction of the RE and WE tables. Context is mentioned in the one sentence on page 32 followed by an introduction of the WE charts which are described as the "long sought after charts." So, for the average reader the emphasis seems to be on how useful the charts will be and not on how necessary it is to establish the proper context to make a good evaluation of a particular strategy.
Now you and your fellow authors don't make the mistake of using the RE or WE charts without establishing the proper context. You are always very good at stating the parameters of the particular situation that you are evaluating and modifying the charts to properly evaluate that situation. You are so scrupulous about doing this that I can find no instance in the entire book where you actually use the RE or WE charts unmodified to reach any of your conclusions. But you have hyped the charts so much in the first chapter that readers want to be able to use them and they don't have the tools necessary to make the modifications like you do so that seems to lead to some of the questions like digglahhh's.
Tango Tiger
06-02-2006, 09:31 PM
dirt: Thanks, and I agree to a point.
MGL, Andy and I used event files that are available from Retrosheet. All our work is reproducible (if you are willing to spend hundreds of hours). At the very least, a reader should be able to stop someone else cold in his tracks if he tries to misuse the RE and WE charts.
digglahhh
06-02-2006, 10:35 PM
Tango, did you find problems with the events files?
I found some, I made some lengthy posts about them over the winter.
But just quickly.
Some teams have tons of ground rule doubles while others have none.
Many of the hits are without trajectory or location.
Miscategorized double plays, double plays that were not really GDPs labeled as such...
Maybe tomorrow, I'll try to dig up my posts about it. I work for MLB.com and was involved in a project to try to convert the event files from Retrosheet into our code format in order to eventually archive the data on our site as well. I started noticing some funny stuff about data from team to team.
SABR Matt
06-02-2006, 11:28 PM
Tango didn't deal with ball trajectories in his book as far as I'm aware except in the platoon chapter when he asked about groundball and flyball batters vs. groundball and flyball pitchers. And the trajectory data is closer to complete the further into the modern era you go...but yes, there's a LOT of missing data before the 1990s and even still some after.
I never noticed the Groudnrule Double thing, but I suspect differences in groundrule double percentages have to do with the park's eccentricities and the groundrules thereabout.
Tango Tiger
06-03-2006, 07:30 AM
dig: I think the ground-rule double is what Matt said. I didn't use GIDP, per se, just the starting/ending base states. You should report any problems to the Retrolist group. IF those problems exist, then wouldn't we see an imbalance between the GIDP totals of Retro against the official source?
SABR Matt
06-03-2006, 09:20 AM
Yes...we likely would. BTW, as far as I'm aware, if you LINE out to second and it turns into a DP, the official offensive event scoring that goes on your permanent record is GIDP...in your seasonal batting line you get credit for causing a double play grounder even if it's a line-out I think.
digglahhh
06-03-2006, 03:35 PM
well lined into DPs are not counted as GIDPs statistically.
The problem Tango, is that, as far as I'm aware, the event files have been composed from multiple sources.
Tango Tiger
06-03-2006, 06:31 PM
But I'm talking about the aggregate of those event files. Are you saying that if Damon has 8 GIDP, that Retrosheet will have GIDP from 8 games, but that those are not necessarily the correct 8 games?
But then you'd have the problem on the pitcher side as well.
If the aggregate matches to the hitter and pitcher (or team) totals, then that's a strong (though not bulletproof) indication that things were done correctly, especially true for "uncommon" events like GIDP, SB, etc.
digglahhh
06-03-2006, 07:39 PM
Tango, I'm going to take this to a PM if you don't mind.
Tango Tiger
06-12-2006, 02:16 PM
I'm pretty sure Matt will get a kick out of this, and anyone else interested in the Run Expectancy matrix:
http://www.insidethebook.com/ee/index.php/site/article/the_secret_recipes_of_the_run_expectancy_matrix/
Feel free to post your comments on that thread, or here.
AstrosFan
06-12-2006, 04:09 PM
Tom, I understand everything except the 30% chance of scoring, on average. Where did that figure come from?
Tango Tiger
06-12-2006, 05:34 PM
At this point, it's not important. You could have used 20% or 40%, and everything still would have worked out the same.
However, it will become important in the next step, and I will describe where it comes frrom. In short, (R-HR)/(H-HR+BB+HBP) = .30
AstrosFan
06-12-2006, 10:10 PM
Thanks. I'm looking forward to reading about the rest of the study.
SABR Matt
06-12-2006, 10:33 PM
Wow...please do continue with this bit of logic...I'm very curious to see how this turns out...because I will probably adapt something like this to construct my RE matrices for seasons without PBP
Tango Tiger
06-13-2006, 02:10 PM
Wow...please do continue with this bit of logic...I'm very curious to see how this turns out...because I will probably adapt something like this to construct my RE matrices for seasons without PBP
Matt, and just to show how even having thousands of plays is not enough: in AL, 1964, the RE with man on 2b and 2 outs is .325, while with man on 3B, it's .292
(The NL that year is .312, .394. From 1960-2004, it was .229, .276, respectively)
The best model is one grounded in logic, or through a Markov chain.
SABR Matt
06-13-2006, 02:21 PM
Yep...there is definitely error associated with just the empirical data from season to season. I recall in your book how you showed that you couldn't even get the LW of a HR with no one on base correct using the empirical data...not perfectly so anyway.
SABR Matt
06-13-2006, 05:28 PM
Perhaps you could explain to me the logical basis for the 3/2/1 rule, Tango. I'm sure there is one...but I'd like to see it fleshed out so I can put a period on the sentence that is that assertion.
Tango Tiger
06-15-2006, 03:39 PM
The short answer is that you have 4.5 batters to the end of the inning when you have 0 outs, 3 batters with 1 out, and 1.5 batters with 2 outs. The fleshing out will be done on my site near the end of the series. (Btw, I updated the thread on my site with more info.)
Ubiquitous
06-15-2006, 06:57 PM
Okay Tango I gave it a whirl with the 2005 Cubs. Everything seems to make sense except I added a few things.
For instance with a player on third not only did I consider the error rate but I also factored in WP and PB rate along with a balk. Secondly I found that a walk did not occur for the Cubs around 10% of the time but less but it was so small it wasn't going to make a difference. But the WP and PB part does change the numbers.
The other thing is I went ahead and added in reached second on an error for the Cubs. It happened only once, and they actually got to third twice.
So anyway here is the 2005 Cubs RE:
0 1 2
BE 0.488 0.267 0.104
1st 0.86 0.515 0.228
2nd 1.115 0.685 0.313
3rd 1.339 0.938 0.393
1st2nd 1.487 0.933 0.437
1st3rd 1.711 1.186 0.517
2nd3rd 1.966 1.356 0.602
BasFull 2.338 1.604 0.726
Ubiquitous
06-15-2006, 07:38 PM
Oh also I forgot to mention that I didn't use the 50% score from third with less then 2outs assumption. I used the actual Cubs data which was the Cubs scored from third with less then 2 outs 68 times out of 150 outs.
And I was wondering if it was correct that a man on third with one out was more valuable then a man on first and second one out?
Ubiquitous
06-15-2006, 07:57 PM
I came up with the Cubs have a .289 chance of scoring from third with 2 outs. The average was .268 and error and WP/PB was .021+negligible walks so it comes out to .289. I realize that by adding the WP/PB to only third base it is the only one getting inflated so now I can understand why it is possible for it to be higher then 1st and 2nd. With 1 out it is .671, and no outs it is .851.
If I ignore WP/Pb it goes down to .012 for that part and then a total of .280/.663/.842 as compared to .289/.671/.851 and would knock third base 1 out below 1st and 2nd 1 out.
misterdirt
06-15-2006, 10:54 PM
0 1 2
BE .482 .266 .109
1st .741 .501 .245
2nd 1.183 .749 .318
3d 1.615 .857 .214
1st2nd 1.338 .801 .369
1st3rd 1.382 1.095 .250
2nd3rd 1.958 1.127 .464
BasFull 1.813 1.347 .587
This is the table I got for 2005 from the actual empirical data. Pretty darn close to what you got considering that actual occurences for some of the base out states are only in the 20s. I am not sure what kind of study would benefit from having REs on a team basis rather than a league basis or even a multiple year basis given the sample size errors inherent in the smaller data set
misterdirt
06-15-2006, 10:55 PM
Damn. How do you get the table to print nice and straight?
Ubiquitous
06-15-2006, 11:03 PM
0 1 2
BE .482 .266 .109
1st .741 .501 .245
2nd 1.183 .749 .318
3d 1.615 .857 .214
1st2nd 1.338 .801 .369
1st3rd 1.382 1.095 .250
2nd3rd 1.958 1.127 .464
BasFul 1.813 1.347 .587
You mean like that?
I take it that is the Cubs RE?
Ubiquitous
06-15-2006, 11:05 PM
I see based on the empirical data that indeed for the Cubs a man on third with one out was worth more then a man on 1st and 2nd with one out, and it looks that way as well with no outs.
Question for you is there a easy way of doing it the empirical way? Meaning I have the PbP data for the Cubs in 2005 and I hesitate to go the empirical route since I have to believe their is an easier way to manipulate the data then by setting up filters in Access and counting manually from there.
Ubiquitous
06-15-2006, 11:09 PM
Sise by side, empirical on the left, Tango's on the right
0 1 2 0 1 2
BE .482 .266 .109 BE 0.488 0.267 0.104
1st .741 .501 .245 1st 0.86 0.515 0.228
2nd 1.183 .749 .318 2nd 1.115 0.685 0.313
3d 1.615 .857 .214 3rd 1.339 0.938 0.393
1st2nd 1.338 .801 .369 1st2nd 1.487 0.933 0.437
1st3rd 1.382 1.095 .250 1st3rd 1.711 1.186 0.517
2nd3rd 1.958 1.127 .464 2nd3rd 1.966 1.356 0.602
BasFul 1.813 1.347 .587 BasFull 2.338 1.604 0.726
Ubiquitous
06-15-2006, 11:14 PM
I am not sure what kind of study would benefit from having REs on a team basis rather than a league basis or even a multiple year basis given the sample size errors inherent in the smaller data set
Well wouldn't the differnt runs in the RE be a reflection of that team? Granted some of the more smaller data points would have to be taken with a grain of salt. But couldn't one look at a team as a whole with say a runner on first and then compare that to other teams RE's with runners on first? See if one team got more or less out of this situation and then from there try to find out why. Isolate why one team had a higher RE then another team in that situation, whether because of speed, power, or something else.
misterdirt
06-16-2006, 07:14 AM
Gosh, the charts look so pretty when you do them!
Well wouldn't the differnt runs in the RE be a reflection of that team? Granted some of the more smaller data points would have to be taken with a grain of salt. But couldn't one look at a team as a whole with say a runner on first and then compare that to other teams RE's with runners on first? See if one team got more or less out of this situation and then from there try to find out why. Isolate why one team had a higher RE then another team in that situation, whether because of speed, power, or something else.
The RE tables would reflect the individual team's efforts but in terms of analysis I don't think they would tell you more than looking at production from each line-up slot. If you know that the #1 and #2 hitters are getting on base a lot and the #3 and #4 hitters are getting more than average extra base hits you have a pretty good idea how a team is scoring its runs. Learning team chemistry from the RE tables is tougher. Take for example the 70 times that the Cubs had a man on 3d with 1 out last year. I have them scoring 60 times. But you have no idea whether they are scoring because the next batter hit a single or a home run. Or a sacrifice fly or a bunt. Or the next three batters walked.
Question for you is there a easy way of doing it the empirical way? Meaning I have the PbP data for the Cubs in 2005 and I hesitate to go the empirical route since I have to believe their is an easier way to manipulate the data then by setting up filters in Access and counting manually from there.
Easy is in the mind of the beholder. I use Access, and add a base outs state field to every event in my EventsGeneral table. I also add a runs scored in the rest of inning field. With those, a single query can sum the runs scored in rest of inning grouped by base out state. The work is in adding the fields.
Ubiquitous
06-16-2006, 07:43 AM
For the fields that makes sense. So what do you do? You create a field and then link it the outs field and runners on base as well? How does that work? Would you have to set up a seperate query to do a count of runners on base then link it back? Then how would do you also add the out to it or does it simply become code? Like 1 would BE-no outs, 2 would be BE-1out, 3, would be BE 2outs, 4 would be runners on 1st no outs, so on and so on. Either way how is that achieved. Another question is what exactly is measured? You say you create a base/out number for every play is that the base/out situation before the play or after the play? Meaning if the play is a single and the bases are empty with 1 out. Would the number in the box be BE -1 out or Runner on 1st-1 out?
Thinking about it more,
I'm guessing to create the field you have to take it to excel right? Create the formula their for base/out and runs scored to the end of the inning and then bring it back right? If that is the way I think I could do that pretty easily so I guess the main question would be my last one in the first paragraph above. Which base/out situation do you use? The one before the play or the one after?
Ubiquitous
06-16-2006, 08:00 AM
Take for example the 70 times that the Cubs had a man on 3d with 1 out last year. I have them scoring 60 times.
I guess that refutes the belief that the Cubs couldn't score from third base last year.
How did you get 60 times out of 70 with man on third 1 out? I got 31 times in 73 situations. That is in just that base/out situation the runner scored 31 times before the base/out situation changed. Is that how it is done or does any and all scoring even after the base/out situation changes count towards man on third 1 out? Meaning if you have a man on third with 1 out the batter K's and then the next batter gets a single that run counts toward man on third 1 out but yet it doesn't add another opp? We only count opps of when they initially make it to the base/out situation, and don't lock the result to just the very next play but whatever happens to the end of the inning?
Tango Tiger
06-16-2006, 08:09 AM
Guys,
I just want to see I'm impressed with the initiative and the greatwork being done. I'll reply to each point made shortly.
Tom
Tango Tiger
06-16-2006, 08:22 AM
For instance with a player on third not only did I consider the error rate but I also factored in WP and PB rate along with a balk.
Ub, certainly you can add as much as possible. I was getting worried about putting in "too much", but then again, maybe I shouldn't. Maybe I'll break out my state-transition matrix for each event for 1999-2002, and present those numbers. Then, I can leave it to the reader how he wants to create his own RE matrix. When I created my Markov RE charts for the book, I did in fact take all possible events and transitions.
I used the actual Cubs data which was the Cubs scored from third with less then 2 outs 68 times out of 150 outs.
Yes, you should use whatever information you do have on hand. After all, we are trying to construct a matrix that shows how the runs did score. You can see how it might be quite difficult if you use a very small number of games. So, not only are you thinking "how did they score", but you have to ask, "if they continue at this pace". When you've got 6000 PA in a season, that's a reasonable thing to agree to. But in terms of scoring from 3B with less than 2 outs, you would need some regression. Alot of it is park-dependent, so it would be nice to calculate this by park.
BE .482 .266 .109
1st .741 .501 .245
2nd 1.183 .749 .318
3d 1.615 .857 .214
As you can see, empirical data will give you oddball results, such that it's easier to score from FIRST base than THIRD base, with 2 outs. So, 162 games of a team is still not enough.
I am not sure what kind of study would benefit from having REs on a team basis rather than a league basis or even a multiple year basis given the sample size errors inherent in the smaller data set
This was answered quite well by Ub. The chances of scoring is dependent on the whole state-transition matrix. And those are not static. If I added an additional parameter, quality of batter, you will see how the RE matrix would balloon. Right now, we are assuming that it's always an average team batter at the plate. On my site, in the "Walking Bonds" blog entry, I show you how to change the win probability table based on who the batter is. Alot easier said than done.
How did you get 60 times out of 70 with man on third 1 out? I got 31 times in 73 situations.
As discussed, we are talking about the chance of that runner scoring, at all. So, that means to the end of the inning. If you go through my calculations, I show you how to figure out the chance of scoring for that particular PA, and then for the rest of the inning. So, it would be something like a 50% chance of scorinig right there and then, and then of the 50% times that he doesn't score, he's got a 30% chance of scoring of those times. So, .50 + .50*.30 = .65 (more or less).
Tango Tiger
06-16-2006, 08:27 AM
Oh, and as for how to figure out the empirical RE, that was also answered, and I do the same thing.
Create a view/query to generate a table that gives you:
game,half-inning,r
Something like:
create table InningRuns
select game,half-inning,sum(r)
from events
group by game, half-inning
(Half-inning could be inning,team, or inning,homeVisitor, or whatever. I convert the top/bottom inning into a number from 1 to 18 for a 9-inning game.)
Once you have that, you do a join of your events table to the InningRuns table. You already know how many runs have already scored from the events table, and you know how many did score for the inning in InningRuns table. The difference is the number of runs scored from that point to the end of the inning.
misterdirt
06-16-2006, 09:27 AM
Something like:
create table InningRuns
select game,half-inning,sum(r)
from events
group by game, half-inning
Similar to what I do but I find it easier to use the max function on home score and visitor score grouped by inning and game. I then link that to the EventsGeneral table.
I do everything in Access. There is no need to go to Excel for this. For Base Out state I query the EventsRunners table for each on base situation individually. There is probably a way to use nested if functions to do it in one query I find nested ifs complicated to do in Access. I code a man on first as 1000, a man on second as 200, a man on third as 30. Outs are in the unit column. These can be added together to create a single code for each base out situation. Men on first and third two outs would be 1032, etc. Makes things very readable and easy.
Run Expectancy counts all runs scored in the inning on all plays subsequent to the base out situation being evaluated, not just on the following play. For example, a batter walks to lead off an inning. Base out situation is 1000. Next batter up GIDP. Third batter homers. Fourth batter strikes out. For this inning the base out states and run scored totals would be baseout state 0, 1 run, 1 occurence; BoS 1000, 1 run, 1 occurence; BoS 2, 1 run, 2 occurences.
This was answered quite well by Ub. The chances of scoring is dependent on the whole state-transition matrix. And those are not static. If I added an additional parameter, quality of batter, you will see how the RE matrix would balloon. Right now, we are assuming that it's always an average team batter at the plate. On my site, in the "Walking Bonds" blog entry, I show you how to change the win probability table based on who the batter is. Alot easier said than done.
I certainly can see the benefit from having a working Markov chain RE where you can plug in a hypothetical situation and do a quick "what if" study. I still question whether Ub can glean much benefit from individual team REs. If he uses empirical data he has the small sample problems that we have already discussed. If he uses your method, which is kind of a question and answer Markov, or a Markov without matrix algebra, there are still some small sample problems plus you have supplied some of the transition information from large data sets that may not apply to the actual team being studied or even the wider range of run environments found at the team level.
Tango Tiger
06-16-2006, 10:19 AM
I certainly can see the benefit from having a working Markov chain RE where you can plug in a hypothetical situation and do a quick "what if" study. I still question whether Ub can glean much benefit from individual team REs. If he uses empirical data he has the small sample problems that we have already discussed. If he uses your method, which is kind of a question and answer Markov, or a Markov without matrix algebra, there are still some small sample problems plus you have supplied some of the transition information from large data sets that may not apply to the actual team being studied or even the wider range of run environments found at the team level.
For something like: chances of scoring from 3b with 2 outs, there's really no need to use empirical data. What you want is the team batting average and reached on error.
For chances of scoring with exactly 0 and 1 out, I would use the team/park data, regressed, if I thought the makeup of the team or the park was kinda unique. So, that's where those numbers come into play.
Otherwise, the empirical data isn't necessarily needed. It should all follow-through from the exact number of runs that did score.
If you had continued my process, you would also need to know how often you go from 1b to 3b on a single, for each out, etc ( http://www.tangotiger.net/destmob.html ), and again, you would care about the team/park makeup. The end-result is taking that complex approach, or the quick shorthand that I presented, will get you to pretty much the same spot.
SABR Matt
06-16-2006, 10:22 AM
Doesn't using a team's actual situational numbers in this shortcut approach sort of defeat the purpose of having the shortcut approach?
That purpose being twofold:
a) Save the user from needing situational data to come up with a reasonably accurate RE chart for a team or league
b) avoid the hazards of small sample size and ground your RE tables in logic
Just sayin...
Ubiquitous
06-16-2006, 10:58 AM
Using the data I used wasn't exactly time consuming and like Runs created if the information is available use it. The data I changed were the assumption that a team scores from third with 0 or 1 outs 50% of the time. The Cubs came in a little under. That there is a walk in an at bat 10% of the time, and finding out how many WP/PB there were. Tweaking the walk rate to reflect what you are looking at doesn't alter the logic nor is it complicated to do or unavailable throughout history. Using WP/PB is like adding SB/CS data to the RC formula. Does it alter the logic of RC. It is minor and if you have the data it is real quick to add it into the shortcut, nor is the shortcut exactly 2*2=4. In order to use the formula one has to look at data and if one is looking at the data then one might as well use what is available if it is easy to incorporate.
If I'm using the short cut method to look at a year wouldn't I want the data to be as close as possible as I can get it? The shortcut I believe is meant as introductory lesson into making ones own RE. A break the ice, see look how simple it is, you probably thought it was hard but look how simple it is. A way to create a RE if one lacks the resources or know-how to do one based on more advanced mathematical formulas. A way to let the arm-chair analyst use one of the tools of the more advanced stat-heads.
Tango Tiger
06-16-2006, 11:24 AM
Doesn't using a team's actual situational numbers in this shortcut approach sort of defeat the purpose of having the shortcut approach?
That purpose being twofold:
a) Save the user from needing situational data to come up with a reasonably accurate RE chart for a team or league
b) avoid the hazards of small sample size and ground your RE tables in logic
Just sayin...
That's why I said it needs to be regressed. We are lucky these days that parks are generally of the same dimensions, and that's with really wild parks like Fenway and Coors among others. Imagine the old days where the OF dimensions were 30 to 50 feet (or more) farther away.
In the simple approach, the only "event specific" number I used was the chance of scoring from 3B with less than 2 outs, in that particular at bat. Since this number is mostly dependent on the FB/GB tendency of the team and the dimensions of the park, it would make sense to use the team-specific data, and regress it.
The regression part is important, and should not be taken lightly.
Ubiquitous
06-16-2006, 11:41 AM
Tango or Misterdirt if it is possible could you perhaps upload an access file which has the empirical formula set up? Obviously it can't be the whole thing but perhaps one line of data, for example ANA20050190 2nd inning 1 out so on so on and so on with all the necessary eventfiles linked and so forth. It would help me get my mind around what needs to be done do this with access. If it's too much trouble don't bother, either way thanks for the time.
SABR Matt
06-16-2006, 01:03 PM
I am having some trouble wrapping my brain around all of the logic in your latest set of articles Tango.
I'm having trouble seeing what all of the variables are that I need to consider when attempting to establish my own methodology for calculating an RE table for each league. If you could attempt to break down everything you said into a list of all of the information I need and what part of the analysis it is needed for, I would highly appreciate it. My aplogies for my stupidity...I'm just having trouble figurin all of this out when it's given all at once like that.
Tango Tiger
06-16-2006, 01:50 PM
I am having some trouble wrapping my brain around all of the logic in your latest set of articles Tango.
I'm having trouble seeing what all of the variables are that I need to consider when attempting to establish my own methodology for calculating an RE table for each league. If you could attempt to break down everything you said into a list of all of the information I need and what part of the analysis it is needed for, I would highly appreciate it. My aplogies for my stupidity...I'm just having trouble figurin all of this out when it's given all at once like that.
I've given it pretty much, step-by-step. I suggest you take it a bit at a time. When you come to a roadblock, I'll remove the obstacle.
I understand that I may have built an expressway, but maybe someone else can build the service road for you.
SABR Matt
06-16-2006, 02:06 PM
I understood the whole section on the no one on base line.
I'm just trying to figure out what variables you need for, like...the men on second and third with 1 out...line. Doesn't that require information on every possible event?
misterdirt
06-16-2006, 02:15 PM
Ub - I don't think that seeing the event files would help you any. As previously describes the event files are exactly the same as the existing event files but with two additional fields. One field has the baseout situation given in the type of code that I described, 1000 for a man on first and no outs, 1231 for the bases loaded and 1 out etc. This field you can get through a series of queries on the EventsRunners table as I described. The other field has the total runs scored in the inning by the team at bat, minus the runs already scored in the inning up to the point of the event. You calculate that through queries using the pseudo code that Tango gave or the query that I described that uses the summation function max over the HOME_SCORE and VISITOR_SCORE fields for each game. You seem to be having trouble as well about how to add fields to a table. That process involves linking tables together in a query and creating a new table that combines fields from both tables. A good book on Access will have examples that will make that process clear. I don't think I could describe the process clearly enough. You really need to see it in a book where they can show good illustrations of what the computer screen looks like. Best of luck. Try it yourself and if you have trouble at a specific point I will try an help, either in this forum, emails or over the phone.
Tango Tiger
06-16-2006, 03:40 PM
I understood the whole section on the no one on base line.
I'm just trying to figure out what variables you need for, like...the men on second and third with 1 out...line. Doesn't that require information on every possible event?
The men on 2b/3b, 1 out
= bases empty, 1 out
+ chance of scoring from 2b, 1 out
+ chance of scoring from 3b, 1 out
SABR Matt
06-16-2006, 06:03 PM
Oh...I didn't think of it like that...
With the second base situation though, wouldn't you need to know things like the average rate at which runners advance from seocnd base on a single and wouldn't that therefore make this impossible to do in the era before PBP?
Tango Tiger
06-16-2006, 06:47 PM
No, that's the point of using the total runs scored. You start with the total runs scored, you subtract the number of HR and you subtract the triples times the sr3 value. What's left is br1*sr1 + br2*sr2. And, since sr2 = sr1+.17, it becomes a simple math exercise to figure out the chance of scoring from 1B, on average. Apply the 3,2,1 rule, and you are all set.
SABR Matt
06-16-2006, 07:33 PM
Why do you use the triple rate for br3? A lot of runners reach third base who had nothing at all to do with tripling.
Ubiquitous
06-16-2006, 09:34 PM
I think I can guess the answer to that one. Because it is a shortcut. Which is what this whole thing is. He uses .170 because he found that for the most part the actual number is going to be pretty close so simply using .170 is a shortcut.
misterdirt
06-16-2006, 10:25 PM
Tango - Your model's assumption that base runners and HRs are going to be split evenly between 0,1, and 2 outs will be off a small amount because of double plays. HRs are split 36%, 32.6%, 31.4%. I couldn't figure out a quick way to directly compute baserunners but PA's are split 34.5%, 33.1%, and 32.4%. 2003-2005 PBP data with 9th inning+ removed. I don't know whether these small differences will make any significant difference in your estimated RE table but they probably should be used since its no harder to do the math, its theoretically correct, and double plays could be a more significant variable in smaller data sets. For those interested in working with smaller data sets, the relationship between double plays and PA's per out state could probably be worked out. I couldn't figure any logical reason why the HRs per PA would vary between the out states unless pitchers are being more careful with men on base or more GB relievers are brought in in the middle of an inning. Any thoughts? Also, would differing CS and OOB rates have an effect on your estimation process? This could make a difference for someone like Matt who wants to use your method to create RE tables for pre 1920 seasons where these rates might be very different from what they are today.
SABR Matt
06-17-2006, 01:17 AM
Ooh...nice catch about splitting baserunners and HRs evenly among the outs MrD.
Double plays make it a virtual certainty that you will always have slightly more PAs with 0 out than with 1 or 2 outs because sometimes you skip right past one of the out states.
BaseballHistoryNut
06-17-2006, 02:51 AM
A couple of weeks back your book was brought up this was my quick book review:
For my money, saying it's not a Neyer book is high praise. If any of y'all knows him, PLEASE pass that on.
BHN
misterdirt
06-18-2006, 07:39 AM
BE .482 .266 .109
1st .741 .501 .245
2nd 1.183 .749 .318
3d 1.615 .857 .214
As you can see, empirical data will give you oddball results, such that it's easier to score from FIRST base than THIRD base, with 2 outs. So, 162 games of a team is still not enough.
This statement is absolutely wrong and the reason that it is wrong is what leads to other misuses of the RE table. The .245 value for base out state man on first 2 outs and the .214 value for the base out state man on third 2 outs do NOT represent the chances of scoring from these base out states. They represent the chances of TOTAL runs scored from these two base out states. The occurences of base out states are not evenly distributed throughout the lineup. If the man on first two out state occurs more often when the #2, #3, #4 or #5 batters are up them the total runs scored in the remainder of the inning are going to be considerably higher then the chances of the man on first scoring. In addition the likelihood of the man on first scoring with those batters is also much higher than the likelihood with batters in other lineup positions. If the man on third 2 out base state occurs more often when the #6, #7, #8, #9 batters are coming up then the total runs scored will be very close to the chances of the man on third scoring. Similarly, the likelihood of a man on third scoring with the bottom of the linup batting is less than with the top of the lineup batting.
It is quite easy to imagine plausible lineups and strategies where baseout states would be distributed in this way. The man on first 2 out state almost always is. For 2003-2005 PBP data that state occured in one of the 2,3,4 or 5 lineup slots 48.7% of the time rather than a randomly distributed 44.4%. In smaller datasets on the individual season or team level the variation would be greater and the effect even more pronounced. The man on third 2 out situation does not normally occur more often in the bottem part of the lineup in large aggregate datasets. But it might on the team or even the individual season level if more than average triples hitters were in the #5 or 6 slots or if a manager sacrifice bunted a lot with a man on 2nd and no outs in the bottom half of the lineup. Or even if there were more than average left handed hitters in the bottom half of the lineup.
misterdirt
06-18-2006, 09:51 AM
In the simple approach, the only "event specific" number I used was the chance of scoring from 3B with less than 2 outs, in that particular at bat. Since this number is mostly dependent on the FB/GB tendency of the team and the dimensions of the park, it would make sense to use the team-specific data, and regress it.
The regression part is important, and should not be taken lightly.
I am not sure what you would be regressing this information to and why you would consider it important to do so. Regressing to the mean is a sabermetric expression for the statistical concept of estimating an unknown population mean from the known means of 2 or more datasets that are presumed to come from that population. The reason that you do this is because you want to construct a model of future performance that will be more accurate if based on the best estimate of the actual population mean rather than on a small sample dataset mean. That is not what you have here. Your RE estimator is trying to reconstruct the specific small dataset that you are trying to study (actually the evidence of some calculated information drawn from that dataset, i.e. the RE table.) You are trying to do that by using some limited information drawn from that specific dataset (the overall scoring rate and the home run rate) plus information from a much larger aggregated dataset ( the PBP information for years that you do have). The more information that you can include from the specific dataset that you are trying to study, the more representive the model of that dataset will be. It would defeat your purpose to "regress" that data to the more generalized data.
If you are trying to study the effects of a particular park on RE values from a time period before PBP data exists, the only park specific splits that you usually have are in runs scored and allowed home and away and HRs at home and away. It would seem to me that the proper way to use your shortcut method to study the RE effects of that park would be to aggergate several years of home team data on runs scored rate and HR rate for consecutive years where the runs scored rate is relatively stable and you know from historical information that no physical changes occured to the park. Use your shortcut method to create an RE table from that and also a separate RE table from that teams away runs scored rate and away HR rate for that same period WITHOUT DOING ANY REGRESSION. Then if you compared the two tables and found differences, you could infer that they came from the effect of the park.
misterdirt
06-18-2006, 09:56 AM
Tango answer to Matt about why he uses the 3-2-1 method:
The short answer is that you have 4.5 batters to the end of the inning when you have 0 outs, 3 batters with 1 out, and 1.5 batters with 2 outs. The fleshing out will be done on my site near the end of the series. (Btw, I updated the thread on my site with more info.)
I will be interested in reading this as the 3-2-1 method would not seem to logically follow from the division of plate appearances given above.
SABR Matt
06-18-2006, 11:28 AM
Tango regresses to the mean because he KNOWS that the small data sample he's using (for instance...one league/season) is going to have flukish data, MrD and what he's trying to construct is a table of RE values that would occur if you played that league/season over again a thousand times because is the truer measure of player ability (the one based on a randomly generated league/seasons's RE table)
misterdirt
06-18-2006, 01:27 PM
Matt - First,why don't you let Tango answer for himself. Second, if you are worried about small sample aberations then you should just use the larger aggregated data set. Third, there is nothing about creating a shortcut RE table that relates to a specific player's ability. You are estimating how a team or league has actually performed in the past, not predicting how a team or league or individual will perform in the future or should have performed. Aberations from an RE table based on larger aggregate data may due to a small size or they may be the result of actual differences in factors between the two populations. You learn more by investigating the possible causes of those aberations than by masking them by automatically adjusting for small sample size.
SABR Matt
06-18-2006, 01:53 PM
MrD...I'm sorry I apparently offended you simply because I had an idea of what Tango was trying to do.
You speak as though rating the players' skills and diagnosing what already happened are two entirely separate issues that can never be bridged together into one form of analysis...I think that's a wrong-headed way to approach baseball analysis. If you always keep skill analysis and historical analysis separate, you will never make any headway toward understanding how contexts and intrinsic skills meld together.
And you don't nevessarily want to use larger agragated data sets because if you do that, you'll lose too much information about how the league contexts and team contexts change.
misterdirt
06-18-2006, 02:18 PM
Matt - You didn't offend me. I just don't think you have a clue about what I am trying to discuss or how Tango would respond.
SABR Matt
06-18-2006, 04:22 PM
Oh good...I've graduated from offensive to retarded...thanks MrD.
Might it just be that I'm not a complete idiot and that I am capable of understanding what Tango has said and what you are asking? Chew on that for a while.
Jesus.
misterdirt
06-18-2006, 04:36 PM
Matt - I never said or implied that you were either offensive or retarded or an idiot so please don't put words in my mouth. When I said you don't have a clue, its because I don't think you read posts thoroughly and really make an effort to try and understand what the poster is trying to say before you respond.
You speak as though rating the players' skills and diagnosing what already happened are two entirely separate issues that can never be bridged together into one form of analysis...I think that's a wrong-headed way to approach baseball analysis.
It would be wrong headed if I had ever said it or implied it but I didn't. See paragraph 1 above.
SABR Matt
06-18-2006, 07:50 PM
You are estimating how a team or league has actually performed in the past, not predicting how a team or league or individual will perform in the future or should have performed.
Please...tell me how that sentence doesn't make the claim that predictive analysis (or skill analysis, which is the lead-in to predictive analysis) and historical analysis can never be linked...you cite the two objectives as an either/or proposition.
SABR Matt
06-19-2006, 01:22 AM
OK...I've given this some additional thought, trying to come up with a way to phrase what I was thinking to make more sense...
In your original question to Tango, you asked why you would want to regress components of the analysis that lead to RE tables because "you're trying to construct a realistic depiction of how a team scored its runs, not see how a team would score runs in the future"
I'm not convinced that's actually what Tango is using RE tables for. I don't think he (or most sabermetricians that use this kind of analysis) uses RE solely for the purpose of accurately rendering how runs are scored. I think he uses them as the basis for linear weights as well...and linear weights are designed so that you can diagnose a player's ability and make predictions about his future performance every bit as much as they are designed to capture with a lot of accuracy what already happened.
That's the issue I had with your question and I suspect that's what Tango had in mind when he was talking about regressing things to the mean. If all you want to do is accurately depict how runs were scored by a team, then you don't need any of this logic really...all you need is the empirical data (at least for seasons where PBP data exists), because you can explain anomalous results the precise way you explained the ones you obtained (inequities in base/out states over different spots in the batting order)...I don't think Tango's usage of RE tables is that limited, though.
misterdirt
06-19-2006, 05:35 AM
Your last post seems a much more thoughtful and balanced response. I agree with everything in it. I also use RE tables as a basis for linear weights to be able to project a player's future performance. I don't know whether Tango uses empirical data or a Markov chain to produce the RE tables that has presented. I use empirical data aggregated over 3 seasons and do not regress the data to any larger aggregation. There is absolutely no difference between the linear weights that Tango generates from his RE table and the linear weights that I generate from my RE table.
I suspect that you have not actually gone through the process of creating linear weights from an RE table or you would understand why there would not be a difference. Even if you have anomalous values in some of the base out states because of small sample size for those base out states, the resulting linear weights is unaffected. This is because when you are determining the value of an offensive (for example a single) you are multiplying the change in base out state value over the number of occurencesof that base out state and then adding the resulting values over all the base out states. Since the number of occurences of the anomalous base out states are small (that is why you are having the anomalies to begin with) their effect on the total run value of a single does not register within the level of precision used in linear weights. Try calculating a linear weights value with a manually altered value for the man on third 0 outs state and you will see what I mean.
So there is really no gain in regressing to try and remove those anomalies for those who create linear weights from RE tables. There is, however, a loss for people like Ub who are looking at the team level for explanations for how a team might be better or worse at creating runs. Regressing to remove the anomalies removes just the data he is looking for, i.e. the differences between that team and an average team. He still must investigate whether those differences in data are based on actual differences within the team or just on sample size but at least he has a starting point for his investigations.
Tango Tiger
06-19-2006, 07:57 AM
You guys did a great job with your posts, especially the last two. I'll go through all the posts, and add whatever clarity I can.
Tango Tiger
06-19-2006, 08:09 AM
Why do you use the triple rate for br3? A lot of runners reach third base who had nothing at all to do with tripling.
br = br1+br2+br3
br is the number of initial baserunners. That is, where did the batter land. br3 is the number of times the batter landed on third base, so that's essentially his triples.
br2 is basically his doubles, but we should include his reaching on 2b errors, and getting to 2b on throws to other bases.
HRs are split 36%, 32.6%, 31.4%. I couldn't figure out a quick way to directly compute baserunners but PA's are split 34.5%, 33.1%, and 32.4%.
Right, I'm trying to keep things nice and simple. The biggest non-randomness is with walks, as those are given alot more with 1b open than not. Pitchers and hitters adjust based on the context. They implicitly understand the linear weights by the 24 base-out states, and understand each event has a different impact, which is why walks, the easiest of the outcomes to control, occurs so non-randomly.
Tango said: As you can see, empirical data will give you oddball results, such that it's easier to score from FIRST base than THIRD base, with 2 outs. So, 162 games of a team is still not enough.
This statement is absolutely wrong and the reason that it is wrong is what leads to other misuses of the RE table.
This is a very good point, and deserves more explanation. When a reader looks at the RE table, all his sees is a static table, a table that is presumed to come from something whereby every cell in the table is affected a certain way.
In reality, each value in the table is in fact computed independent of the other 23. They are based on empirical data, and are nothign more than samples of reality, (the way 600 PA from Todd Helton and 600 PA from Jose Cruz, Sr are nothign more than samples of their performance, one mostly at Coors, and on mostly at the Astrodome).
The empirical RE tables simply says: "of the times that a runner happened to be on third base and there were two outs, how many runs ended up scoring for that inning". (And same question for first base). It's clear that it's not the exact same players in the same number of PA that make up both samples. And even if it was, the size of the sample would be so small as to have a huge margin of error.
These empirical RE tables should come with a margin of error, or a reader has to be experienced enough to see the RE-man-on-3b-2-outs and the RE-man-on-1b-2-outs in the same light he'd see Todd Helton and Jose Cruz numbers, if stacked side-by-side.
When I present RE tables, it only makes sense to use them if they've been adjusted, or the sample size is so large as to make the margin of error very small.
If there's something else that I didn't address, let me know.
misterdirt
06-19-2006, 08:48 AM
Pitchers and hitters adjust based on the context. They implicitly understand the linear weights by the 24 base-out states, and understand each event has a different impact, which is why walks, the easiest of the outcomes to control, occurs so non-randomly.
True, and because pitchers have more control the progess of the PA than batters, walks occur more frequently in situations that hurt the defensive team less.
Right, I'm trying to keep things nice and simple.
I can understand the desire to keep things simple and your method succeeds admirably at that. I was just trying to understand logically how it works and if there was any additional information that was available in the pre PBP era that could be incorporated. For some research purposes it would seem like all available information should be incorporated even if it makes the process slightly more complicated and the gains are minimal.
Do you also have an estimation process for the number of occurences of each base out state? If not, how can you convert the RE table into an estimation of linear weights.
misterdirt
06-19-2006, 08:55 AM
Tango said:
As you can see, empirical data will give you oddball results, such that it's easier to score from FIRST base than THIRD base, with 2 outs. So, 162 games of a team is still not enough.
This statement is absolutely wrong and the reason that it is wrong is what leads to other misuses of the RE table.
What I was trying to express here, and doing a poor job of it, was that the RE tables don't show how easy it is to score from a base. How easy it is to score from a base is shown in a ONE_RUN+ table. Something I know that you know because you have calculated them. But a fact that is often forgotten when the uses of RE tables are discussed.
Tango Tiger
06-19-2006, 09:53 AM
Tango said:
As you can see, empirical data will give you oddball results, such that it's easier to score from FIRST base than THIRD base, with 2 outs. So, 162 games of a team is still not enough.
This statement is absolutely wrong and the reason that it is wrong is what leads to other misuses of the RE table.
What I was trying to express here, and doing a poor job of it, was that the RE tables don't show how easy it is to score from a base. How easy it is to score from a base is shown in a ONE_RUN+ table. Something I know that you know because you have calculated them. But a fact that is often forgotten when the uses of RE tables are discussed.
Yes, absolutely. You can *fairly* say that this is a true statement, for the leadrunner states (xx3, 1x3, x23, 123 for chance of scoring from 3B), (x2x, 12x for chance of scoring from 2b), (1xx for chance of scoring from 1b).
If you look in the book, table 9, you will probably find the chance of scoring at least one run (or 1 minus chance of scoring no runs), is around the .875 level for 3B, using any of those 4 base states, and 0 outs. It will probably follow similarly for the other 2 out states. It's not exactly the same, since bases loaded will allow the runner from 3B to score from a walk, and the x23 just needs two walks instead of the three that xx3 would need.
Studying Tables 9 and 10 is certainly something that is hugely recommended in following along.
Tango Tiger
06-19-2006, 11:50 AM
I can understand the desire to keep things simple and your method succeeds admirably at that. I was just trying to understand logically how it works and if there was any additional information that was available in the pre PBP era that could be incorporated. For some research purposes it would seem like all available information should be incorporated even if it makes the process slightly more complicated and the gains are minimal.
I have a basic Markov program that generates what you want here. It doesn't have basestealing or other non-batter events. It's pretty cool, because it simply take a team's batting line, and generates the RE table from the Markov program (takes one second to run). I'll eventually release the code to the public.
So, we have a few ways to do the RE tables. The shortcut way, this basic Markov program, and then a really complex Markov.
Do you also have an estimation process for the number of occurences of each base out state? If not, how can you convert the RE table into an estimation of linear weights.
You're right, you'd need that if you want to get the LWTS (though, not necessarily... more on that later). Here again, I use a 3-2-1 rule, but I'll see if I can come up with something better.
SABR Matt
06-19-2006, 12:57 PM
MrD...you are correct that I have not thus far attempted to generate my own LWTS using this method...
I do have a problem with using three-year aggragate data to filter out sample size errors...if you do that...1930 doesn't come out right. Or 1987. Or 1911 and 1912. I'm operating under the presumption that major changes in the way runs are scored over a single season will cause major changes in the LWTS.
misterdirt
06-19-2006, 01:25 PM
I do have a problem with using three-year aggragate data to filter out sample size errors...if you do that...1930 doesn't come out right. Or 1987. Or 1911 and 1912. I'm operating under the presumption that major changes in the way runs are scored over a single season will cause major changes in the LWTS.
I debated about whether to go three years, single year, or single year single league (to differentiate DH from non DH), or 3 year single league. I finally decided that 3 year was best for the studies I was doing. But if I was doing a different study I might decide otherwise. Certainly if you have reason to believe that there are causal factors rather than random variation that are causing the year to year variations you should use a single year. But you still would not want to regress those variations to a larger data set. Why? Because you have just decided that they are not due to random variation.
misterdirt
06-19-2006, 01:38 PM
So, we have a few ways to do the RE tables. The shortcut way, this basic Markov program, and then a really complex Markov.
Does your really complex Markov vary the data by batting order position?
Tango Tiger
06-19-2006, 02:46 PM
Does your really complex Markov vary the data by batting order position?
Yes, that's how I did the batting order chapter (with the pitcher moving around, etc). The complex Markov used 5-dimensional arrays. The program itself is fairly small, but, it is mind-numbing to program, and then to debug or enhance. It's one of the few programs I wrote that I had to put in extensive documention.
SABR Matt
06-19-2006, 06:10 PM
I don't suppose I could convince you to let me have a look at your Markov program, Tango...my partner in crime Randy Fiato is a programmer who took an interest in what he was calling "baseball as a state machine" and the two of us might be able to make improvements is we could look at the code for a while and study it. Just a thought...I doubt I wold ever be able to program something as complex as your version myself, and Randy doesn't have time to write whole new programs, but perhaps we can come to some sort of accomodation that would help further my own research and could result in continued improvements in our understanding?
Tango Tiger
06-20-2006, 05:34 AM
Right now, I don't release any of my work. But, eventually, I may.
SABR Matt
06-20-2006, 06:12 PM
I can certainly understand the impulse to keep your work under wraps while you're still tinkering with it and before you have the opportunity to get significant publications out to prove that you were the one who did the research...I just wanted to let you know that I'm interested in picking up some of the research threads that you have started and as of yet shown no interest in continuing (for example...the first basemen saving errors study...possibly a more advanced catchers study involving looking for their effect on pitchers' DIPS statistics...etc)
Some of the things I'd Like to pursue require markov simulation really to make work (though neither of the two things I mentioned above do...LOL)
Tango Tiger
06-20-2006, 08:42 PM
What I do is not hard. It just requires alot of time and patience. I suggest you learn a programming language, any programming language. Learn arrays, and understand recursion.
SABR Matt
06-21-2006, 02:04 AM
I have some experience with EIGHT programming languages, Tango.
C++
Perl
MySQL
Visual Basic
IDL
R
Java
HTML (ok...not really a programming language...but still worth mentioning).
I was a computer science major for 3 semesters, and as soon as I hit object oriented programming and advanced recursion methods, my brain exploded. What you do may not be difficult for you, but not everyone is really wired to think in the way you need to think to program on a functional level. Believe me...it's not for lack of effort that I lack programming chops...I've attacked this research from so many angles it's not even funny...I haven't made headway in any of them.
Tango Tiger
06-21-2006, 08:50 AM
perl is probably the one to focus on. You don't need OO or advanced recursion methods. Just be able to call a function without getting into a loop. As for making headway, the only suggestion I can offer is to pick up a Perl O'Reilly book, and do the exercises start to finish. If you have done all that, and you haven't made headway, then I guess programming is not for you.
SABR Matt
06-21-2006, 12:13 PM
I haven't done that with Perl yet...
And I shouldn't say I've made zero headway...my MySQL query writing ability has significantly improved in the last year...enough that I am confident that with a powerful enough computer I can organize and normalize alll of the available data into one database.
I will see what I can do with Perl...the only problem with that language as I understand it is it's extremely slow for mathematical calculations.
Ubiquitous
06-21-2006, 12:27 PM
I downloaded PERL today and I am probably going to play around with it since Baseball Hacks have some scripts for it. Most notably the one for run expectancy.
SABR Matt
06-21-2006, 01:18 PM
There's a RE script in the Hacks book?
I didn't see that...I'll have to look through it.
SABR Matt
06-21-2006, 01:41 PM
Actually, Ubi...the RE hack is done entirely in MySQL...which is very encouraging, since that's the language I know best...LOL
Ubiquitous
06-21-2006, 03:08 PM
Yes but to get the data quickly and easy in his book you will need PERL. I used his programs to download all of the event files in go and then bevent them all in shot instead of doing it one at a time.
SABR Matt
06-21-2006, 04:31 PM
Right...the warning I would give you on that is you get a lot of extra useless information and the database you construct with the standard bevent program is ungainly and too big for most computers. THat's OK if you're just going to grab small pieces of the data at a time, as long as you have things well indexed, but it can cause problems...I've created the beginnings of my own database already...just waiting for my computer upgrades to be finished and I'll continue work on that...but I'm doing something much more streamlined tso that I can see the entire dataset all at once.
PERl is useful for parsing text into your code which makes it ideal for stealing data off the web.
Ubiquitous
06-21-2006, 04:52 PM
I'm finding out more and more that his hacks are pretty unwielding. I've had to change a few things around just to get it to download all the files. I then had to be basically strip his script for unpacking the zip files to just unzipping his files. I couldn't get the bevent part of his script to work properly so I ended up using Tango's tips from the ASS conversation, and then added the header filer and all of these text files into one mega file. It took awhile and it is just finishing up. Hopefully when all that is done the historical PBP file will be setup just like he expects it to be setup in the Hacks book and then from there I can move on to other things.
Nice a 3 gig file that should be a joy to work with.
SABR Matt
06-21-2006, 05:32 PM
He uses his pBP2K file which has 2000-2004 only...his hacks work in that context but if you try them on the HUGE file...it's going to explode and your computer will need to be restarted. Fair warning.
Ubiquitous
06-21-2006, 06:51 PM
The more I use his book the less I like it. Not really his fault more my own. I'm a novice at programming and a lot of his stuff has a lot of assumptions about the readers level of expertise. that level is above my own. I've spent several hours and have basically not gotten anything accomplished from the book. It did cause me to branch out on my own and get things down just not in the way he describes, but it was because of his book.
For instance I now have a quick way to load the header into my database instead of having to manually tell the database each time I create one what each fields name is. I also now have PBP data for each year available in a text file ready to be used in database. Whereas before I did not.
I haven't tried to load the data yet into MySQL because I fear it will cause another wasted half day. I was really hoping the language he used would work in Access but alas it does not. SO that shortcut is denied.
Ubiquitous
06-21-2006, 06:56 PM
Oh and I even got the roster problem solved. I don't know if you recall but in the last discussion about EVA files we were having problems dealing with the roster files. Well that is one script of his that does work rather well. Also while I was playing around with Tango's tip I figured out that his .bat file was not properly worded which was causing some of the problems as well. The way it was worded each years ev* would be extracted into one gigantic file instead of 49 seperate files.
It has to look like this to get individual yearly files:
bevent -y 1957 1957*.ev* > 1957.txt
Whereas Tango's tip had this line:
bevent -y 1957 *.ev* > 1957.txt
SABR Matt
06-22-2006, 12:12 AM
I would agree with you that his assumptions about a person's programming skill are a little lofty...I haven't bene able to do much with his hacks either although I am starting to learn enough MySQL to have the ability to do some of these things on my own...his "check_field_sizes" hack works REALLY well and I now use it every time I want to read in data into MySQL for example.
Ubiquitous
06-22-2006, 12:31 AM
It works but I have finally realized after several missteps that you have to play with it slightly to use it. Something he neglects to tell you, I think he assumes you know you will have to. The part I am talking about is the loading data local infile part. In that part you have to tell it where the file is at but he neglects to tell you that. So after numerous head beatings I figured that part out. What I do now is run the checkfield.pl and then just copy and paste the lines into mysql. Works fine like that.
I finally got everything to work after several nerve wracking hours and I have finally produced an empirical run expectancy chart based on the PBP from 2004. Hopefully it is right, if somebody like misterdirt can check it I would appreciate it.
0 1 2
BE 0.54 0.29 0.11
1st 0.93 0.55 0.24
2nd 1.17 0.71 0.34
3rd 1.45 0.97 0.37
1sta2nd 1.49 0.97 0.46
1sta3rd 1.86 1.24 0.54
2nda3rd 2.15 1.48 0.63
BaseFul 2.27 1.60 0.82
Ubiquitous
06-22-2006, 12:48 AM
Also for some clarification:
Man on 3rd not outs has a run expectancy of 1.45 and this situation occured 509 times. So does this mean that 738 total runs were scored when after they entered this situation? 509*1.45 right? So how does one figure out the odds of a run scoring with this data? Or does one need other data besides this to figure that out?
SABR Matt
06-22-2006, 03:01 AM
For that, you would need to ask "how many times in those X number of times where the situation arose did zero runs score?"
Once you have the empirical odds of NOT scoring...you'll have theodds of scoring at least one run.
misterdirt
06-22-2006, 06:00 AM
Ub - I ran the numbers for 2004 and got REs consistently lower than yours by about a .01 or .02. If you could post the raw numbers of both the counts and additional runs scored for each event I could check whether the fault lies in your program or mine. I suspect that it is in my count of additional runs scored as I got 728 for man on third 0 out and the same 509 count of events as you did. It is more likely that a program would fail to count a run rather than double counting them.
Also, usually RE tables exclude all home batting events in the 9th inning or later because of the chance that a walk off run will shorten the inning before all potential runs are scored. But give me the numbers for all events as you have already calculated them.
misterdirt
06-22-2006, 06:40 AM
Ub -I checked my program. I did have a problem with additional runs scored on walk offs in the 9th inning or later. I had known that when I calculated my RE table several months ago but had ignored it since I was eliminating those innings anyway. I subsequently forgot that I had left those incorrect numbers in there. Try calculating the RE table eliminating Home batting after the 8th inning and then we can compare.
misterdirt
06-22-2006, 09:50 AM
Ub - Fixed my problem with additional runs. My chart now looks identical with yours.
Ubiquitous
06-22-2006, 10:08 AM
Thanks Misterdirt for checking.
Thanks Matt I should figured that one out, hopefully I can blame it on it being late at night for not figuring that out.
Now that I have done it hopefully I can play around with his scipts to do the rest of what I want to do. I also should be able to come up with linear weights as well. But then there will be some regression issues he uses a program called R so we'll see.
Ubiquitous
06-22-2006, 10:25 AM
I'm able to do the linear weights value of a home run. That was pretty easy and I am thinking that a walk and a triple are going to be simple as well. But I imagine that everything else is a little more involved and that you need more data then just the RE and how many times each situation occurred.
SABR Matt
06-22-2006, 11:34 AM
What you need to calculate every linear weight is the average starting RE for each event and the average finishing RE for those events...
The PBP database includes all state data at the start of the event and the destination info for every play, so what you need is a script that creates a single number that represents the base/out state before and then another single number that represents the base/out state after each play...then you can just ask how many times each specific event resulted in each unique change in base/out state...since you know what the change in RE is between any two base/out states you can get an answer from there.
Ubiquitous
06-22-2006, 11:47 AM
Yes but creating the script is the part I'm not familiar with.
Tango Tiger
06-22-2006, 12:23 PM
Matt, you are right on. What you should do in a database is have 9 fields (which can be collapsed into two if you want to get fancy). Start1B, Start2B, Start3B, StartOuts, End1B, End2B, End3B, EndOuts, RunsOnPlay.
All the base fields should be set to 1 or 0, or True/False. All the starting fields and the RunsOnPlay should probably be available directly from BEVENT (been a while for me).
For the ending, I think you are told where a runner ends up, in BEVENT, right? So, something like
UPDATE eventTable
SET End1B = 1
WHERE DestBatter = 1;
UPDATE eventTAble
SET End2B = 1
WHERE DestBatter = 1
or Dest1B = 1;
...
(Also make all the Dest fields 0 when EndOuts = 3.)
So, what you've done here is established the starting and ending states for each event.
Put your RE matrix in a table (24 rows) that looks like:
Runner1B, Runner2B, Runner3B, Outs, RE
(Add a 25th row for Outs = 3)
Then you join
SELECT s.RE, e.RE
FROM eventTable et, reMatrix s, reMatrix e
WHERE s.Runner1B = et.Start1B
AND e.Runner1B = et.End1B
etc, etc, etc
That'll give you the starting and ending RE for every event. The difference is your Linear Weights. (This is the derivation of Table 6 in The Book.)
SABR Matt
06-22-2006, 02:12 PM
See I would just create one column called StartBO and one called EndBO where each one went from 0 to 24 (0 = bases empty none out, 1 = bases empty 1 out...etc...and 24 = 3 outs) and then one row for the event type...then you just
GROUP BY StartBO, EndBO, EventType
and COUNT the number that fit in each unique grouping and SUM the number of runs scored on each play.
Tango Tiger
06-22-2006, 02:30 PM
Well, I did say "which can be collapsed into two if you want to get fancy".
***
Your group by will result in a 24x25x20 number of rows. You still need to then do a join. You do the join, and do sum(s.RE)/sum(n), sum(e.RE + runsOnPlay)/sum(n), and GROUP BY eventType.
Or you can collapse the two into one query, as I showed earlier.
SABR Matt
06-22-2006, 02:50 PM
I know that wasn't the only step..I was just saying I think it makes wording the queries a little easier if your data is as compact as possible.
Tango Tiger
06-22-2006, 03:24 PM
I do do it the way you are describing it, but I wouldn't recommend it until you are used to manipulating the 24 base/out states. The clean way is to do it the way I described it. Once you have a handle on that, then you can worry about collapsing them. After all, what if you are only interested in the 2-out states for some other reason? Now, you've got to join to a 24-row table that will decipher your baseOut field (which is what I also do, and also don't recommend).
When you are setting things up, keep it clean, and don't worry about efficiency or size.
misterdirt
06-22-2006, 03:57 PM
All the starting fields and the RunsOnPlay should probably be available directly from BEVENT (been a while for me).
RunsOnPlay probably should be a BEVENT field but it isn't. It has to be calculated.
(Also make all the Dest fields 0 when EndOuts = 3.)
If you do this you cheat players out of value when they get a hit that results in a third out on the basepaths. Example: a single with 2 outs and men on first and second where the lead runner makes the third out trying for home should not have the same result as a strikeout. The resulting base out state should be bases loaded 2 outs for computation purposes of hit value and value to the batter.
See I would just create one column called StartBO and one called EndBO where each one went from 0 to 24 (0 = bases empty none out, 1 = bases empty 1 out...etc...and 24 = 3 outs) and then one row for the event type...then you just
Again Matt, have you ever actually done this? Creating the EndBO state from the Runner destination fields is not trivial. I use a single code for base out state as a mentioned in posts 97 and 106. Instead of 1 to 24 as Matt suggests it has a man on first as a 1 in the thousands place, a man on 2nd as a 2 in the hundreds place, a man on 3d as a 3 in the tens place and outs in the ones place. This has the advantage that the base out state is evident visually in the code itself.
and COUNT the number that fit in each unique grouping and SUM the number of runs scored on each play.
This wouldn't work. You have to do an intermediate step of calculating the change in run value on the play. Change in run value is the (RE of the EndBO state - the RE of the Start BO state) + Runs scored on the Play. Change in Run Value is what then gets SUMMED just for offensive event type, not for every StartBO state. The SUM of Change in Run value is then divided by the Count of the number of events.
Ub - Are you now more confused than ever?
SABR Matt
06-22-2006, 04:51 PM
MrD...I knew all of that (about how to calcualte a linear weight)...I just didn't say it correctly in my post...sorry if I caused confusion.
I don't generally consider it a good idea to use a string of numbers as a code for the base/out state because that can be a pain in the ass to attempt to query even with usage of the substr function....but I do see the appeal of using that code in the visual inspection of the data.
I'm also contending with trying to create ONE database that includes ALL of the data for ALL of the years...I'm going for "all of the info is there" in "as small a data field as possible".
misterdirt
06-22-2006, 07:18 PM
I don't generally consider it a good idea to use a string of numbers as a code for the base/out state because that can be a pain in the ass to attempt to query even with usage of the substr function....but I do see the appeal of using that code in the visual inspection of the data.
Its not a string of numbers, its a number. I find it simple to query it for any single base out state or any class of states. For example, looking at all men on first and third states would be >1029 and <1033. What's the problem?
SABR Matt
06-22-2006, 07:44 PM
Interesting.
And clever.
I misunderstood what you meant by how you coded your data, MrD. That's a most efficient way to do it...and it doesn't increase the size of the piece of data I would need to utilize something similar by more than one byte per record (using just 0-24 I would need a 1 byte integer, using your method, I'd need a 2-byte integer).
Ubiquitous
06-22-2006, 08:40 PM
Ub - Are you now more confused than ever?
I understand what you are talking about for something like this if it isn't written down step-by-step or the person is sitting next to me it is lost on me. I understand what you need to do I just don't know the computer language and steps to do it
Ubiquitous
06-22-2006, 08:47 PM
wouldn't RBI on play work? Though I guess there a few plays in which a run scores but no RBI.
misterdirt
06-22-2006, 09:28 PM
wouldn't RBI on play work? Though I guess there a few plays in which a run scores but no RBI.
Not really. You'd lose about 1100 runs a year. And its not the total runs that you lose, its that you lose them from only a couple of types of offensive events, errors and outs mostly. So those outcomes would be highly affected.
The method I use is to create a new field called previous event number by subtracting 1 from every event number. Then you create another new field that has every event number the same except it assigns 0 to the last event number of a game. Then you can map the EventsGeneral table onto itself by linking those two fields. This gives you the HOME_SCORE and VISITOR_SCORE of the next event that you can add to your database. The HOME_SCORE or VISITOR_SCORE of the next event is obviously the HOME_SCORE or VISITOR_SCORE of the original event plus the runs scored on the play so you can extract the runs scored on play by subtraction. For the final event of the game you have to subtract from the appropriate FINAL_SCORE. There is probably a much more elegant solution that everybody but me has already discovered.
SABR Matt
06-22-2006, 09:48 PM
or you can jjut use the vis_score and home_score fields from the NEXT event (by adding one to every event) and subtract from those the current score.
misterdirt
06-23-2006, 06:39 AM
or you can jjut use the vis_score and home_score fields from the NEXT event (by adding one to every event) and subtract from those the current score.
That doesn't work.
Tango Tiger
06-23-2006, 08:15 AM
If you do this you cheat players out of value when they get a hit that results in a third out on the basepaths.
I wasn't worrying about the splitting up of hitting/baserunning values at this point.
***
As for runs on play, why not just sum all those destbatter, dest1B, dest2b, dest3b fields? Doing a self-join on the eventsTable is a huge processing cost.
misterdirt
06-23-2006, 08:30 AM
I wasn't worrying about the splitting up of hitting/baserunning values at this point
It is not the normal splitting up of baserunning/hitting values which I don't bother with either. It is the fact that with 2 outs the hitter loses all value for his hit except for the value of runs scored on the play, unless you are zeroing that out too, in which case he loses all value. The linear weights will probably work out the same as the value lost is probably within the margins of precision given. But the value lost to the hitter in calculating his changes in RE may be significant for that hitter.
As for runs on play, why not just sum all those destbatter, dest1B, dest2b, dest3b fields? Doing a self-join on the eventsTable is a huge processing cost.
Obviously, you can do that, but I find it more time consuming. I don't know what you mean about processing cost. A table maps on itself very quickly.
Tango Tiger
06-23-2006, 09:02 AM
Sure, it's time consuming to code (2 minutes, instead of 20 seconds for the self-join). But, as the eventsTable gets bigger, the processing cost is huge. Each year has 200,000 records. If you do what Matt is suggesting, you're going to have a database with 5 to 10 million records. Joining a 10-million row table to another 10-million row table to get a 10-million row output is an enormous cost.
On the other hand, if you want to just join one team, or maybe one year, that's ok. But, I highly recommend doing the update as I'm suggesting.
misterdirt
06-23-2006, 09:37 AM
It was Ub I was trying to help and he was only asking about doing one year. I did it with three years and it was not a problem. As for Matt, doing querying in Access with a database that large (if it is even possible in Access) would be time consuming, especially nested IF functions.
SABR Matt
06-23-2006, 10:42 AM
Who said anything about Access?
I do those kinds of complex queries in Query Browser...hard-coded MySQL
SABR Matt
06-23-2006, 10:44 AM
I don't have access to the data right now (my main computer is begin upgraded almost entirely for the purpose of being able t handle the 7.4 million row PBP events database) so I have a question.
Do the visitor score and home score fields reflect the score BEFORE or AFTER a play has already occured?
misterdirt
06-23-2006, 10:50 AM
Before the play.
Ubiquitous
06-23-2006, 11:15 AM
ALright can answer this?
I ran the hack #60 in the book the way it was written for 2004 and it worked like a charm. I then tried to do it for 2001 and got way off numbers. This is what the original hack looks like:
create table runs_by_inning2004 AS
select game_id, inning, batting_team,
if (batting_team=0,
min(vis_score) +
sum(if(runner_on_1st_dest>3,1,0)) +
sum(if(runner_on_2nd_dest>3,1,0)) +
sum(if(runner_on_3rd_dest>3,1,0)) +
sum(if(batter_dest>3,1,0)),
min(vis_score)) AS vis_score_end_of_inning,
min(vis_score) AS vis_score_beginning_of_inning,
if (batting_team=1,
min(home_score) +
sum(if(runner_on_1st_dest>3,1,0)) +
sum(if(runner_on_2nd_dest>3,1,0)) +
sum(if(runner_on_3rd_dest>3,1,0)) +
sum(if(batter_dest>3,1,0)),
min(home_score)) AS home_score_end_of_inning,
min(home_score) AS home_score_beginning_of_inning
from pbp.pbp2k
where substring(game_id,4,4)="2004"
group by game_id, inning, batting_team;
you then create an index:
create index runs_by_inning2004_idx
ON runs_by_inning2004(game_id, inning, batting_team);
Then you do the RE:
select p.outs,
if (p.first_runner != "", 1, 0) AS runner_on_1st,
if (p.second_runner != "", 1, 0) AS runner_on_2nd,
if (p.third_runner != "", 1, 0) AS runner_on_3rd,
sum(if (p.batting_team=0,
r.vis_score_end_of_inning - p.vis_score,
r.home_score_end_of_inning - p.home_score)
) / count(*)
AS expected_runs,
count(*) AS N
from fullpbp2004 p inner join runs_by_inning2004 r
on p.game_id=r.game_id AND p.inning=r.inning
AND p.batting_team=r.batting_team
group by runner_on_1st, runner_on_2nd, runner_on_3rd, outs;
Now the only change I do to that is I change the fullpbp2004 name to the name of the file with all my pbp data since I don't have a table named fullpbp2004. I honestly don't know where he got that name since he never introduced it before and he was getting his pbp data from a different file to begin with. But changing the name the file in which I have all my pbp gets a proper return for 2004 as verified by Misterdirt. Now then when I change it to 2001 by basically replacing all the 2004 part to 2001 I get oddball results. I get bases empty no outs at 1.33 runs expected at 89,000+ occurences. What am I doing wrong?
Tango Tiger
06-23-2006, 11:43 AM
It shouldn't change anything { and I'm not familiar with MySQL, but other databases sometimes have a problem with count(*) }, can you change
from: count(*)
to: sum(1)
(In some databases, you'd have to specify p.* or r.*, or even a field name, perhaps even non-nullable. Sum(1) simply bypasses all those issues. It might even be better processing-wise, but I've never tested it.)
Tango Tiger
06-23-2006, 11:57 AM
By the way, while this table is fine as coded:
runs_by_inning2004
I don't do it that way, especially if all you care about is the RE. This will suffice:
select game_id, inning, batting_team,
(
sum(if(runner_on_1st_dest>3,1,0)) +
sum(if(runner_on_2nd_dest>3,1,0)) +
sum(if(runner_on_3rd_dest>3,1,0)) +
sum(if(batter_dest>3,1,0))
) as bat_runs_for_inning
;
If you want to get fancy, you can do:
int(runner_on_1st_dest/4) ... this returns a 0 if the value is 0,1,2,3, and a 1 if it's 4,5,6,7 .... since the max value is 6, it'll work like a charm. I'm pretty sure that the "IF" costs more than the int and "/4".
I just did this quick, so test it first.
SABR Matt
06-23-2006, 12:01 PM
Before the play.
OK...so if the Vis_Score and Home_Score Fields are the score before the play...then it follows that the NEXT home_score and vis_score will always be the score after the play except when it is the last play of the game.
It seems like there is a way that should be able to be used to access the score of the play after yours and the score of your current play and take the difference. Forgive my naivite if I'm wrong...I'm away from my database and can't play around with it yet, and have not attempted anything beyond some simple size-reduction queries...
of course you could also just do something like create temporary scoring flags (batterScores, 1stRunnerScores, 2nRunneScores, 3rdRunnerScores) that equal 1 if the corresponding destination base is 4 and 0 in all other cases and then just derive your runsOnPlay field from that by adding up the flag counts. That would probably be the fastest thing to do with a large database.
SABR Matt
06-23-2006, 12:03 PM
Tango...why is the max value of the destination fields 6? I thought it was 0 for no advance, 1 for first base, 2 for second base, 3 for third base and 4 for scorring...and nothing above that.
Tango Tiger
06-23-2006, 12:07 PM
Matt/188: that's what I do in 187.
Matt/189: I think it's 4 for scoring an earned run, 5 for an unearned run, and 6 for a team unearned run, or somethign like that. Check Retrosheet's documentation. It's there.
SABR Matt
06-23-2006, 12:10 PM
I've been through the documentation three dozen times and never noticed that...*sigh*...I'm sure it's there I just never saw it...when I get back my (upgraded) computer I'll have some work to do.
SABR Matt
06-23-2006, 12:14 PM
Oh and correct me if I'm wrong but doesn't your query in 187 produce runs scored in the ENTIRE inning? I'd rather have information on each PLAY that I can easily group up and add together, and I think rather than calling the SUM function four times and then asking for an IF statement to check it, it would be faster and more useful to calculate runsOnPlay for each single play as a simple sum on scoring flags, and then just delete the scoring flag fields and work without all the complex IF statements in the summation query when I start asking about run scoring for a whole inning.
You have to use the IF statements at some point...I'd rather do that in pre-processing and do it just once...and then leave no record of that for all future calculations.
Tango Tiger
06-23-2006, 01:28 PM
http://www.retrosheet.org/datause.txt
RETROSHEET: HOW TO USE OUR EVENT FILES
....
58 batter dest* (5 if scores and unearned, 6 if team unearned)
59 runner on 1st dest* (5 if scores and unearned, 6 if team unearned)
60 runner on 2nd dest* (5 if scores and unearned, 6 if team unearned)
61 runner on 3rd dest* (5 if socres and uneanred, 6 if team unearned)
Tango Tiger
06-23-2006, 01:30 PM
Matt/192: Ouch, that's what happens when I don't think too much. Yes, of course what I do won't work.
SABR Matt
06-23-2006, 05:02 PM
Obviously you know way more than I do about how to make things work the way you feel comfortable...I'm sure what you posted is a part of a larger method...
Ubiquitous
06-23-2006, 11:22 PM
I'm thinking that when I set up the database the first time I must have screwed something up because I created a whole new database with exactly the same info and I can now do RE for all the years without the data being screwy.
2000 0 1 2 2001 0 1 2
BE 0.58 0.31 0.12 BE 0.53 0.29 0.12
1st 0.98 0.60 0.27 1st 0.92 0.55 0.25
2nd 1.18 0.73 0.33 2nd 1.17 0.71 0.35
1st-2nd 1.62 1.01 0.49 1st-2nd 1.53 0.92 0.44
3rd 1.51 0.99 0.41 3rd 1.52 0.98 0.37
1st-3rd 1.89 1.22 0.51 1st-3rd 1.85 1.26 0.53
2nd-3rd 2.04 1.50 0.64 2nd-3rd 2.04 1.43 0.61
Bfull 2.51 1.70 0.82 Bfull 2.35 1.60 0.80
Ubiquitous
06-23-2006, 11:37 PM
And because I'm so happy I figured it out here is the 2000-2005 RE:
2000's 0 1 2
BE 0.54 0.29 0.11
1st 0.93 0.56 0.24
2nd 1.46 0.98 0.38
1st-2nd 1.86 1.22 0.53
3rd 1.17 0.71 0.34
1st-3rd 1.53 0.95 0.46
2nd-3rd 2.04 1.44 0.61
Bfull 2.38 1.59 0.80
So Tango is there more to your logical RE? I know at the bottom of the article it said more to come, any timeframe on that?
SABR Matt
06-24-2006, 12:53 AM
Once you have REs you have to create a unifying key for each base/out state that exists in both the PBP Event tables and the RE tables you generate so that you can link those two tables and use RE values to calculate linear weights...that's the tricky part'
Ubiquitous
06-24-2006, 01:13 AM
There are tricky parts all over this darn thing.
right now I am downloading all of 2006 PBP data and hope to have it in retrosheet format. I'll do it overnight. I tried to get the 2006 seaonal stats but couldn't get it to work right. Hopefully the pbp data will work. It takes about a minute a game, 15 games or so a day, 70 days of games, so lets see that is 17.5 hours of downloading. Geez I hope it doesn't take that long.
SABR Matt
06-24-2006, 01:37 AM
How are you getting 2006 PBP data?
Ubiquitous
06-24-2006, 01:42 AM
Through BAseball hacks scripts. Though like I said nothing is easy. For whatever reason the connection keeps cutting out nad I have to restart the program. Don't know why yet and I don't know if I'll bother continuing.
SABR Matt
06-24-2006, 01:49 AM
Hmm...that doesn't sound promising. :(
Ubiquitous
06-24-2006, 01:58 AM
The good news is if you do have to run the program over again it will skip over any games you have already done. Bad news is that it will also do this for any partial games as well. You probably have to go back delete the partial game and then figure out what the number was for that game and reword the program to just download that game. Or the quickest way is to delete day and rerun it. I don't know why the connection keeps going out. Perhaps it is just my connection.
misterdirt
06-24-2006, 09:47 AM
Ub - How did the download go? Have you tried his program for parsing the data into Retrosheet format?
Ubiquitous
06-24-2006, 10:22 AM
For whatever reason so far it is extremely buggy. I don't know if it is my connection or what but around every 8 or 9 games the connection is lost and I have to delete the game that is being downloaded and rerun the program. Right now I have about 6 days worth of games. So it is going to take awhile to get to the next step.
I think this program wouldn't be so bad if you keep on top of the updating from the start of the season. Every night or every other night run the program that way you don't have to try and download over 700 games and the problems I'm having.
It is a good book but I will say nothing runs smoothly in it. Everything has to be finessed and there is very little explanation as to what he is doing and how to go about doing it yourself. For instance last night I was trying to spider MLB seasonal data using his Hack. It worked fine for 2005 fielding which was the example he gave but it didn't really work when I tried to get 2006 hitting. Instead of giving me 700 names and their stats it gave me the top 50 names over and over again. For whatever reason it wasn't looking to the next page for players 51 and beyond for hitting while it did so for fielding last year.
Ubiquitous
06-24-2006, 11:37 AM
Well the parser program doesn't work.
It gives me this line
junk after document element at line 1, column 875, byte 875 at C:/Perl/site/lib/XML/Parser.pm line 187
SABR Matt
06-24-2006, 01:22 PM
Oy...
I'm glad I didn't try his "how to get play by play data for current seasons" hacks...they don't sound fun. >(
Corn Beef Labia
06-24-2006, 01:47 PM
Baseball Hacks sounds interesting. How effective it is at presenting and teaching it's material ?
SABR Matt
06-24-2006, 04:53 PM
It's...OK...at presenting the material, but I wouldn't get it and expect it to make you a skilled baseball researcher...it's got problems...some of which are discussed in this thread.
Corn Beef Labia
06-24-2006, 07:17 PM
It's...OK...at presenting the material, but I wouldn't get it and expect it to make you a skilled baseball researcher...it's got problems...some of which are discussed in this thread.
I think I'm still going to buy it. I'm ready for the challenge.
SABR Matt
06-24-2006, 07:20 PM
I would never say "don't buy it"...it's worth having because it does give you a starting point to build off of. I don't regret buying it.
Ubiquitous
06-24-2006, 09:01 PM
Oy...
I'm glad I didn't try his "how to get play by play data for current seasons" hacks...they don't sound fun. >(
Well that is the hack I'm doing and when I have time off I'm going to have to play with his scripts to get the parster to work.
SABR Matt
06-24-2006, 09:49 PM
Yeah...I had al;l kinds of problems making his script to fetch PBP data from retrosheet work (he made assumptions about what kind of OS you were working with...a number of his commands don't work in Windows unless you download some extra bundles but he neglected to mention that)...the current seasons script sounds even more convoluted.
Ubiquitous
06-24-2006, 10:01 PM
I was able to get the PBP data from Retrosheet with his scripts but yeah you do have to play around with them a bit. He does tell you though in the beginning what platform he was using and describing in. Right now I am trying his Hack28 scripts unmodified to see if they work the way they were written. It usually doesn't work at first when you deviate from the info he is trying to setup. Right now the parser program is having trouble with the XML tags. I don't know enough about XML to do anything about it though. Also it seems that the spider program is running smoother with the 2005 data. That could be a fluke or that problem could be with my setup though. I'll check the parser program in a minute.
SABR Matt
06-24-2006, 10:22 PM
I was able to get the PBP fetching script to work too, it just took me about 8 tries and some consultation with a friend of mine who's a computer programmer by trade.
I expect similar issues in the future.
Ubiquitous
06-24-2006, 10:41 PM
Well I tried the PBP fetch and parse the way it is written for 2005 and of course it works like a charm. But if I do it for 2006 it doesn't work, and all I change in the script is 2005 to 2006. So something in the script is looking for something specific in time and I can't find it. If I run the original script on my 2006 files I don't get an error message but it won't convert the files because it is looking for 2005 files. I'll have to see if I can mess around with it and get it to look at 2006.
SABR Matt
06-24-2006, 10:49 PM
Yeah...I had problems with the ordinary Event File fetcher when I tried changing the year range from the standard 1960-1992 so that it would cover up through 1998...that blew up on me so I wound up doing that the hard way.
Ubiquitous
06-24-2006, 10:59 PM
That one took me a second try. I basically copied and pasted the 2000 part of the script and then changed it so it was looking for 1997 and 1998 ML instead of 2000 and up. Worked like a charm after that. Though the script to turn it into a readable format in MySQL did not work. I had to copy and paste the first part into MYSQL directly. The part where it creates the layout and then had to copy and paste the second part where it loads the data. The perl script as a whole for some reason wouldn't work.
But the up to data PBP fetcher is what I really want and it is driving me crazy that it is not working.
Ubiquitous
06-24-2006, 11:13 PM
I don't know why and I don't know how but I got the parser to work and I did the exact same thing I did before. The only difference is that I started the spider on April 11th instead of the start of the season. So I'll try to run it again from the start and see what I can get. Now it looks like the biggest headache is the dropped connections.
You should give the hack a try all you need is the parser, the spider, and the Hack-28 save_to_db file and you can run it. The only change is you have to change the start time to 3,3,106 and the end date to 24,5,106. See if the connection ends on you too. If it doesn't then it is my connection going screwy. IF it works then possibly you could email me the files in chunks at a time.
SABR Matt
06-24-2006, 11:19 PM
I'll see if I can make it work once I re-dwonload Perl (I just upgraded my computer, so I'm having to reload a lot of programs onto the new HD).
Ubiquitous
06-26-2006, 02:13 PM
I've finally downloaded all of the XML files through yesterday. Unfortunately the parser file is buggy. The XML tags for whatever reason are a little screwy on certain days. Removing the day allows it to keep going so somehow I'll have to figure out how to fix that or bypass it. The second flaw is that the MLB site stops giving out player.txt files after April 24th. I don't know why, in 2005 they had txt files all the way through the season. His parser program looks for this file so when it can't find it it stops running. Now I can create some sort of player.txt file only problem is that they are not constant so that probably would be buggy, I could create a program that creates the txt file or I could somehow remove that part of the program. All 3 are probably going to be hard to pull off without problems and it still leaves the XML tag problem.
SABR Matt
06-26-2006, 02:38 PM
I'm guessing MLB only updates the player.txt files once in a while...although I'd be surprised if they haven't updated it since the 23rd of April.
This sounds like way more trouble than it's worth even though eventually I would like to be able to calculate PCA ratings "in-season"...:(
Ubiquitous
06-26-2006, 03:03 PM
I'm thinking a computer progammer could quickly and easily create a player text file based on the XML files of the game in question. All it is is a text file that provides the name number and jersey number of each player ID number on the team. Each game has XML files for each player that tells us that info. Unfortunately I am not a programmer. I sent an email off to the author of Baseball Hacks and hopefully he can answer some of these problems, along with the XML tag problem, and connection problem.
Ubiquitous
06-26-2006, 03:36 PM
The author contacted me back and unfortunately it wasn't the cure all. I was right in that one can use the XML from MLb now I have to figure out how to get the spider program to download it.
SABR Matt
06-26-2006, 03:41 PM
Perhaps if you ever get this thing to work, you ought to post your modified version of the code at one of the big Y! Groups so that others don't have to spend all of this agonizing time trying to figure it out.
That would definitely be a very popular bit of code.
Ubiquitous
06-26-2006, 05:53 PM
Okay I'm pretty sure I'm going to call it quits on this one. I setup the new code to find the playes.xml and it worked like a charm. Then I changed the parser to use that instead and that worked like a charm. But by doing this I discovered two fatal flaws in MLB's data. First flaw which is not all that fatal just a pain in the ass. Is that whenever a switch hitter comes up and they have the line turns around to bat left or right handed the MLB data counts that as an at bat and in fact does not list the real at bat of the previous batter. So instead of say Magglio Ordonez hits a homer to left it will say Dmitri Young turns around to bat right handed. Kind of a big error. It is fixable but one has to go in and manually changed that at bat by finding out what the real at bat was, the real player's ID and so forth. Not a big problem if you were to upload the data once a night. But when you are doing 3 or so months or a whole year it will happen enough to make you unhappy about it.
The second and most fatal flaw is that the PBP data is incomplete and I have real idea by how much. I discovered this while fixing the switch-hitter problems. I discovered that in at least one game the files completely omitted the bottom half of an inning. About 6 or so at bats if I recall. It it did that for one inning it probably has done it in other games as well. Which would make doing all this pointless. I only really have two options for this. A) I could go through each game manually and make sure both parts of inning are there. Tedious and not totally guaranteed that all the info will be there. Then manually type in the data for the missing parts. Not really going to happen. Or B) try to find another site and download the data from them. This causes a whole host of other problems. Especially if they word their data differently which would then force me to modify the scripts even more. Ideally someone could create a program that highlights the problem innings (for instance if an inning doesn't have a top and bottom half and 3 outs each side, unless of course it is the 9th inning or later) and then download that missing info from other sites.
Ubiquitous
06-26-2006, 10:32 PM
Yep it's official I'm quitting this. The MLB.com data is missing lots of data, the script is having trouble interpreting subsitutions, switch hitting is corupting at bats, and I can't figure out ESPN or Yahoo stat sites to get their data. So to stop the madness and wasting of my time I'm calling it.
SABR Matt
06-26-2006, 11:02 PM
I had a feeling that was coming.
Damnit. LOL
I solute you for giving it a shot though...real-time data acquisition would have been nice.
Ubiquitous
06-26-2006, 11:36 PM
If one is a programmer I'm betting it would be pretty easy to fix his scripts so that they could easily handle 2006 data. Unfortunately the author has stated he hasn't had time to modify the scripts. Which I think is a pretty poor excuse. He wrote a book and basically all of it is unworkable in some way. He doesn't even have a site or an area at O'Reilly books that allows one to discuss it with fellow readers and the author. There is no place to ask a question and have a community help out with the answer. Isntead you are left scratching in the dark discouraged to move at all.
SABR Matt
06-27-2006, 12:02 AM
Yeah...it's really unfortunate that a corporate type did this project rather than someone who actually wants to devote attention to sabermetric research. We need a guy like Tango (I know you're busy Tom, I'm just using you as a good example of someone who knows what the heck he's doing...LOL) or Tom Ruane to step up and tell the "idiots" out here who want to do research but lack the programming skill what they need to do to get the data.
Tango Tiger
06-27-2006, 02:01 PM
I've communicated with Joe Adler a few times, and he seems pretty detailed, and not at all a "corporate type". When you write code, you are beholden to outside forces, like MLB.com. If they change something on their side, then you have to change your hack as well. This is why they are called hacks and not software programs. MLB.com does not provide any APIs.
I havent' had a chance to implement any of his hacks, but if/when I do, I'll make sure to post my comments on my blog.
I'm not sure what kind of expectations a reader should have on the author. For example, in our book, the three of us spent in excess of 1,000 hours, for which we got, essentially, less than minimum wage. My guess is that Joe, and most writers, are in the same boat. I'm not sure that a writer should devote anything more than his finished product, and just let the market dictate the acceptance of it. You wouldn't expect an artist to change his painting, or a director to change his movie.
The same thing happened with Kerby's A.S.S. He creates a great product, gives it out for free, and then people complain when things don't work, and they expect Ray to devote more time to handle questions and supply fixes. In the end, his program is no longer supported, and the upgrades he wanted to implement did not materialize. (I don't blame him for not doing it.)
Ubiquitous
06-27-2006, 03:11 PM
I'm not complaining that he isn't doing more. I'm complaining that his publishing company isn't doing more. Providing a better service is a way to insure a happy and repeat customer. I've talked with Mr. Adler and thanked him for his book and his time and do not expect anything more from him nor blame him for most of the problems. I do think that he failed to meet the audience that he stated he was writing for. I don't think he adequately explained his scripts in a way that a novice could then use them in any kind of real functional manner. It was a how-to book that really didn't teach one to do, it simply let you copy and paste and do only exact what he did beforehand, which by this point is outdated.
Does he have to continually update his book? No, but I do think his publisher should create somesort of online area that allows customers to talk about the book and work out the kinks and ask questions. It seems kind of silly that a company that specializes in teaching doesn't have a way to refine the education to make their product better.
Tango Tiger
06-27-2006, 09:56 PM
I think you should definitely contact the publisher for some of the claims they've made ("tested and verified to best of our ability").
As for the online support forum, I don't think they want to get into that, otherwise they would have already. In terms of cost/benefit, they're probably happy with the way they are.
misterdirt
06-27-2006, 10:02 PM
I do think that he failed to meet the audience that he stated he was writing for. I don't think he adequately explained his scripts in a way that a novice could then use them in any kind of real functional manner.
How did he fail? You said yourself that the hack worked on 2005 data. His job wasn't to teach you to be a programmer. Nor was it to anticipate changes that might be made in the way data was recorded that would necessitate changes in his programs. Just the fact that he has let his audience know that there are ways to access data that we might not thought of before makes it a successful book. Hack #28, which seems to have given you the most trouble, is rated as an "expert" hack. To me that means that his audience has a fair amount of computer knowledge. That you, a self proclaimed computer programming novice, were as successful at it as you were says a lot about your persistance but also shows that the book WAS successful at its job.
No, but I do think his publisher should create somesort of online area that allows customers to talk about the book and work out the kinks and ask questions.
The publisher does have just the area that you have described. It not only lists errata that have been confirmed by the author and publisher but also problems that have been discovered by the readers. Really, how can you expect more?
Ubiquitous
06-27-2006, 10:08 PM
I think you should definitely contact the publisher for some of the claims they've made ("tested and verified to best of our ability").
As for the online support forum, I don't think they want to get into that, otherwise they would have already. In terms of cost/benefit, they're probably happy with the way they are.
I'm sure they are happy with their setup but that doesn't mean I have to be. Buying one of their books was always going to be a one time shot for me but if I had known then what I know now about the book and the lack of support I would not have bought it. Nor would I buy more books from them in the future knowing what trouble lays ahead. Bottom line is it I think they have poor customer support which to me is the most crucial part of HowTO books. Whether that be with a better howto book itself or a better afterbook help area.
Ubiquitous
06-27-2006, 10:26 PM
How did he fail? You said yourself that the hack worked on 2005 data. His job wasn't to teach you to be a programmer.
Yes it worked but it worked poorly. Also his instructions on what to do was poor. For instance Hack 28 requires a save_to_DB.pm file to be in the folder with spider and the parser. He doesn't tell you this. You have to figure that out. This omitting of important instructions is common in this book, practically every hack I encountered neglected to tell you something that you had to do in order for the hack to work. I couldn't just take his hack and run it and it would work. I could get it to work but only have I figured out what step he omitted or what he neglected to tell me I had to change to get it to work on my computer.
Nor was it to anticipate changes that might be made in the way data was recorded that would necessitate changes in his programs. Just the fact that he has let his audience know that there are ways to access data that we might not thought of before makes it a successful book.
I don't hold him accountable for the changing data on the net. Never did. I understand that it will change but I
Hack #28, which seems to have given you the most trouble, is rated as an "expert" hack. To me that means that his audience has a fair amount of computer knowledge. That you, a self proclaimed computer programming novice, were as successful at it as you were says a lot about your persistance but also shows that the book WAS successful at its job.
Looking back at Hack 28 I'm not totally satisfied that it would indeed work without problems for 2005 data. The spider.pl never worked smoothly no matter what it was used on, and after reviewing the logs it turns out the parser.pl program is extremely buggy as well.
The publisher does have just the area that you have described. It not only lists errata that have been confirmed by the author and publisher but also problems that have been discovered by the readers. Really, how can you expect more?
Because the errata is typos, and the reader area again pretty much deals with typos. What can I expect? I can expect them to say something like to do this for 2006 here is what I did. Doing this for ESPN.com this is what I did. Using Windows causes this problem this is how it is overcome. The author himself doesn't even need to be involved, you could let the readers do it. I expect to be able to ask questions and be able to have a community or some sort of tech support answer it. Or at least have the possibility of having the question answered. I sent an email asking about 3 problems. I got help on one, pretty much got the data problem handled the way I knew it would (again no slight on him, I knew it wasn't something to do with his program) and the other question was ignored. I appreciated his time and thanked him for helping anyway he could. But I don't think it should end there. No I don't think I should take up the authors time, but I do think they should allow people to post questions and have others answer them. This is a howto book the author and publishing are selling you a product that is meant to teach you how to do something. The publisher also sells a whole line of book on this same theme. No follow-up, no way to fine-tune the learning is a pretty piss poor way of creating loyal customers. All it takes is one competing business to do a better job at customer support and poofda there goes your loyal customers.
REad his preface. REad what he says this book was meant to do. I don't think he met those goals.
SABR Matt
06-27-2006, 10:26 PM
There's essentially no overhead in terms of cost in running your basic discussion forum online...there's no reason they couldn't have done that.
Ubiquitous
06-27-2006, 10:29 PM
Oh and I know hack 28 doesn't work as written because when you run the parser program it fails to create a teamyear file which is required by bevent to convert into a readable .txt file which is the whole point of the parser program.
Tango Tiger
06-28-2006, 08:52 AM
Bottom line is it I think they have poor customer support which to me is the most crucial part of HowTO books.
Welcome to the world of Computer Programming Tutorials (and computer software in general).
There's essentially no overhead in terms of cost in running your basic discussion forum online...there's no reason they couldn't have done that.
Sure there is. O'Reilly's got several hundred books! They open up a forum, and you'll get millions of hits a month. On top of which is the expectation that if O'Reilly hosts such a site, then they need to have their authors or other experts be able to handle such a site.
Oracle, on the other hand, does have a Forum:
http://forums.oracle.com/forums/index.jspa?categoryID=1
But, their software costs a bundle (not 20$), so in terms of cost/benefit, it's definitely there. They don't want customers jumping ship to Microsoft.
O'Reilly is *extremely* successful. They put out tremendous books. If it means losing a few customers in return for not spending thousands of dollars on a dicussion board, it makes perfect sense to me.
In any case, usenet is perfectly suitable to handle the discussions.
SABR Matt
06-28-2006, 04:20 PM
usent is an ironic name for something so completely unusable. It's a tremendous pain in the ass attempting to set up your computer to proper make use of the newsgroups.
Tango Tiger
06-28-2006, 07:41 PM
I simply go through Google:
http://groups.google.com/
piece of cake.
SABR Matt
06-28-2006, 07:55 PM
Didn't know this service was available. Been using all of the extra software and hunting for servers that work etc...pain in the ass.
Maybe this is easier.
Taco De Muerte
06-29-2006, 09:08 PM
I just picked up baseball hacks recently. So far it seems pretty easy - But from What I've read, it gets harder as the book goes on.
SABR Matt
07-05-2006, 02:49 PM
Dude...you guys should see some of the coding work I've done the last few days in MySQL. My goal is to (a) shrink the sizes of all of the components of my database as much as I could get away with without losing distinct information and (b) get the PBP event files, the gamelogs and the bdb talking to each other. I haven't finished optimizing some of the lahman database tables, but I now have readily available (and checked as much as physically possible for accuracy) critical pieces of information in the PBP database like "Runs Scored on Each Play, Base/Out State before and after Each Play, A More Exhaustive Event Numbering that includes distinct event enumeration for sacrifice bunts, sacrifice flies, DP and TP (which were oddly excluded from the original event numbering system), and the like. Alll player information in the PBP logs is now in the form of a lahmanID (numeric primary key of the master table in the BDB), reference to in which game each event takes place is now just a GameID which is the new primary key of the gamelog table. The gamelog table references teams in lahman format with a numeric key I added.
I believe I am starting to really get the hang of programming in MySQL.
Taco De Muerte
07-05-2006, 11:22 PM
To be honest, I'm having problems trying to Install MySQL 5.0.
SABR Matt
07-06-2006, 03:02 AM
You're having difficulty installing MySQL?
I can probably help you with that if you describe the nature of the problem.
Taco De Muerte
07-06-2006, 01:55 PM
You're having difficulty installing MySQL?
I can probably help you with that if you describe the nature of the problem.
Got that problem fixed, but Now I'm having trouble with step four ( import the database) for hack 10. I did everything the hack instructed, so then I moved on to step 5 to see if the database loaded, and it says no database selected.
SABR Matt
07-06-2006, 03:15 PM
OK Tango...I have a question.
I've been trying to figure this out all day.
I'm trying to calculate run expectency for each league/year and each base/out state.
The way my database is structured, I have the starting and ending base/out states for each play, as well as the runs scored on each play and I have half-innings coded in one number rather than in an inning and side thing as the data normally comes in.
How do I get (in SQL code) the runs scored after each play to the end of a half-inning. I can't figure out how to tally the runs just to the point of the next end-of-inning.
SABR Matt
07-07-2006, 05:34 AM
Nevermind...I figured it out. :)
I suspect I took the long way round, but it works. :D
SABR Matt
07-07-2006, 06:58 AM
The hack in Baseball Hacks to calculate RE is horribly inelegant and not very nice as far as adaptability to other years or forms of the PBP database...I actually found run expectency without even looking at that hack...
I now have REs for every year and league from 1957 to 2005. :D
Now...the next challenge...Linear Weights.