Retrosheet


Frequently Asked Questions


When will additional seasons be posted on the web site?

12-7-2002

Retrosheet policy is firm that game information is posted on the web site only after a detailed process of proofing and comparison to official totals. This policy covers the narrative game accounts and box scores as well as the data files that are our primary form of storage. Depending on the number of corrections that need to be made and on the nature of our raw information, this proofing can take many months for a single season. It is, therefore, not possible to make promises or predictions concerning future releases. Proofing work is always underway and help is always needed in this area. See the FAQ on volunteering for more details.

In addition to the regular season data which comprise the bulk of our information, Retrosheet also has the data for all post-season games, beginning with the 1903 World Series. These games are also being proofed and are a high priority for release, although once again a specific timetable is not possible.
Return to list of questions

How can I volunteer to help?

12-14-2002

Retrosheet has always depended on the generous donation of time by volunteers. This is still true, but circumstances have changed since we began work in 1989. As a result, the nature of the help we need has changed as well. A brief historical summary is a good place to begin. However, this is a multi-faceted topic and there are many topics to be addressed.

First, where did we get the raw data? We began by collecting scorebooks from the Major League teams.. Teams vary widely in the extent of their historical records. Some of the “original 16” (clearly an incorrect term, but one that is commonly used to describe the era in which there were no franchise changes from 1903-1952) have scorebooks going back to 1946, whereas some have nothing older than 1974. The majority have coverage back into the 1960s. After many years, we were able to persuade all 26 teams which existed in 1983 to allow us to copy what they had. While this was going on, we also made contacts with dozens of sportswriters and announcers (and the families of those who were deceased) and obtained a very large number of additional accounts that the teams did not have. The third major source is individual fans who scored the games, usually at the park on a standard scorecard. This has also been a very important way for us to get information, and it is ongoing as Retrosheet volunteers prowl eBay and other on-line auction services for these items. The final data source for us has been play by play descriptions from newspapers. Many fans are not aware that for much of the 20th century (and also the 19th century), it was common for newspapers to carry the full play by play text of the game played in that city that day (see the example here from the Cleveland Plain Dealer for the game of April 30, 1930). This practice was especially important in the days before widespread radio coverage. Since only day games were played until the late 1930s, the usual pattern was that an evening paper, published at 6 PM or so, would have the game information. All told, Retrosheet has over 105,000 accounts of games played prior to 1984, representing about 70,000 games (there are many games for which we have multiple, independent accounts).

As these accounts were being collected, we began the process of translating them into our format so the information could be stored in the computer. Without a rational computer system for describing and recording the events of the games, it would be impossible to do analysis or even meaningful archiving of the games. Using the DiamondWare system, dozens of volunteers spent literally thousands of hours poring over scoresheets, deciphering occasionally mysterious notations, and entering them into our standard computer format. Anyone who keeps a scoresheet develops individual, special features and some of the ones we have were extremely challenging to translate, but our stalwart crew got the job done. At this point, we have very few scoresheets and programs left to process, almost all of them from earlier than 1953, and many of them are among our more difficult sources. It is not feasible to break in potential new volunteers on this kind of material. Therefore, Retrosheet is no longer looking for additional volunteers to enter games from scoresheets, although there are other ways in which volunteers can contribute.

We do still have a few thousand of the pre-World War II newspaper accounts on hand that need attention, However, many of these present a separate problem, namely that we do not have box scores to go with them. Without a box score, it is almost impossible to enter one of these newspaper accounts, since fielders are referred to by name, not position. To illustrate with a simple example, "a fly ball to Ruth" might be "7" or "9", depending on which park the Yankees were playing in. (Ruth played over 1000 games each in left and right fields. Although this particular fact has been known by SABR researchers for some time, there are many other interesting little tidbits that we uncover). This problem can be solved by having the box score from The Sporting News or the New York Times. We can use help in making copies from either source (from microfilm), either as part of inputting activity or as a separate effort. Contact Dave Smith (dwsmith@udel.edu) if you have the time to spend on it. He will steer you to the seasons we need help with. Your costs for photocopying and postage will be reimbursed, but of course we have to rely on the generous donation of a much more valuable commodity, namely your time. Finally, note that there are still many thousands of newspaper accounts out there waiting to be copied, but this will usually entail the borrowing of microfilmed newspapers via Interlibrary Loan, which is not realistic for most people. However, we would welcome with open arms offers to do this kind of data collection. Once again, contact Dave for more details.

There is another kind of data collection that does not involve play by play data, but rather a careful study of box scores. There are really two categories here. The first is what might be called ancillary information such as game length, attendance, umpires, starting time (or at least day/night indication). It is surprising how many of the scoresheets from the teams, writers and announcers did not record these items. Although our accounts are still useful without them, it is disappointing not to have such basic facts. The second category involves our game logs which contain over 100 data items for every game (see http://www.retrosheet.org/gamelogs/index.html). There are varying degrees of completeness in the logs for different years. For the years from 1974 to the present, the logs are complete. However, for earlier seasons only the bare minimum of information is present, although we do have scores, managers and starting pitchers. As we get more seasons proofed and ready for release, we will be able to fill in the details for these years. However, for many seasons our holdings are pretty incomplete and it will not be likely that we can complete the logs from our own play by play files. Once again the good news is that all or most of the items we need are readily available from the daily newspaper box scores. If you are interested in spending some time at the microfilm machine tracking down and recording the data, that would be very helpful. Contact Dave for more details.

Our other major activity is the proofing of games that we have already computerized. As noted in the FAQ on data release, our policy on adequate proofing before release is firm. This is an arduous process under the best of conditions. Once again examples may help. When we have all the games for a league for a season, then we generate the yearly totals for each player and compare them with a custom program to the official totals. A typical season in the 1970s will have over 700 discrepancies between batters and the official totals in each league and about 350 pitcher differences. Those numbers reflect any difference from the official totals for one player in one category. There are many causes of these differences: mistakes in the scoresheet, errors by the translator or inputter, or incorrect information in the official record. The reality is not quite as bad as the numbers make it appear, since many of the discrepancies are reciprocal. For example, in the mid-1960s the Orioles had both Bob Johnson and Davey Johnson play second base. Several of the games were entered into the computer with the wrong player. This happened because early in the season they only had one Johnson on the roster, so the scorer did not indicate “D.Johnson” or “B.Johnson” as he did after the second one arrived from the minors. As a result one of the Johnsons had too many games, at bats, hits, runs, doubles, etc in our files when compared to the official totals and the other Johnson had the corresponding shortfall. When the files were edited to get the right Johnson in the right games, then all of those discrepancies were resolved at once, probably 20 or so. We always end up with a few dozen differences that we cannot reconcile with the official totals. Our starting assumption is always that the official numbers are correct and that we made a mistake. For those which we cannot find an error in our work, we make a note of the discrepancy and explain why we are leaving our files as they are. A list of these disagreements is published with the data files for that year. At this point, these discrepancy lists have only been posted for a few seasons, but others are in the works and will be available as soon as we get them into proper shape.

So, how can volunteers help with the proofing? All the full seasons, that is those for which we have all the games, have been completed or are under their final review. The problem for the partial seasons is that we cannot do the comparison of full season totals, since our files are missing some games. There are two choices for proceeding. 1. Generate box scores from our files for the games that we do have and then proof those against newspaper box scores. This is very tedious and subject to mistakes as all the comparisons are made by visual inspection. It is also the case that newspaper box scores (remember that these are all pre-1974 games) did not routinely list many categories we have come to expect, such as caught stealing, grounded into double play, walks and strikeouts for batters. In addition, pinch-runners and pinch-hitters were often not marked that way if the players stayed in the game and went into the field. Of course, the newspaper box scores were far from error-free, so we can end up trying to track down differences that aren’t really there. However, this kind of proofing is something which can be done by anyone in any location with access to old newspapers on microfilm. Many public libraries have extensive holdings of the New York Times, and this has been a helpful contribution from several volunteers. 2. Retrosheet is fortunate to have microfilmed copies of the official daily statistics for both leagues for many seasons. This information can be taken from the film and used to create a daily log in a spreadsheet for all official categories for the players who appeared in the games that we are missing. Totals can then be easily generated for these players and combined with the numbers from the games for which we do have the play by play data. The net result is computerized season totals that can be used for comparison to the official numbers, just as we have done for the complete seasons. This is not for the faint-hearted, since the image quality of these films varies enormously and the data format was not consistent over the years either. We have not followed this option very often although there would be great value in having the daily logs computerized. It may be possible to scan the daily film and generate electronic images, which would allow this work to be done at home and not a library (or other source of a microfilm reader), but that option has not been explored in detail yet.

There is one final area to address, which is that many people who are interested in Retrosheet and our data files are highly skilled computer professionals, representing a broad range of skills. Several have volunteered to help with programming, analysis, and other computer-related work with our files. Additional analytic software has been written by others to process our data, but we do not include links to that work, since we believe that at this time it is best for the web site to be kept as simple as possible. However, some of these “third party” efforts have been posted to the RetroList, or at least described there with details on how to contact the author. Everyone appreciates the way that this sharing has taken place and interested individuals should check the RetroList to see what has been done.
Return to list of questions

What player's running speed sometimes is a good description for the pace at which Retrosheet gets things done?

Cecil Fielder would be a good candidate. Keep in mind that we are an all volunteer organization and that many of us have real jobs and families that demand a lot of our time. Remember that Cecil managed to get around the bases to score a lot of runs, and that it was not unusual for him to knock one out of the park so he could take all the time he needed to circle the sacks. We hope the success of Retrosheet has enjoyed makes the last analogy quite appropriate.
Return to list of questions


Page Updated: 4/9/2007
All data contained at this site is copyright © 1996-2007 by Retrosheet. All Rights Reserved.