Data-Entry and Verification in the Minnesota Sample1
by William C. Block and Dianne L. Star
The data-entry system for the 1880, 1850, and 1920 PUMS emphasized the dual goals of accuracy and efficiency. Achieving these objectives was not an easy task given the often poor condition of filmed manuscript census schedules as well as census takers who enumerated persons, households, and even entire districts in unorthodox ways. This essay briefly describes four issues centered around data-entry and quality: the standardized system of abbreviations, symbols, and comments that went into data-entry; difficulties encountered when the manuscript schedules were partially deteriorated prior to filming or when enumeration was problematic; the evolution of data-entry from 1880 to 1850 to 1920; and verification and transcription error rates.
In order to move the census data efficiently from manuscript to computer, we developed a system of symbols and standardized comments for abbreviating common responses. These abbreviations ensured that data were entered using a minimum number of key strokes. Such abbreviations were used for common responses to questions regarding relationship, occupation, birthplace, marital status, etc. A second purpose of standardized abbreviations was to reduce the size of data dictionaries. As we realized partly through the 1880 project, precision in entering each enumerated spelling variation led to unwieldy data dictionaries containing hundreds of superfluous entries.
Data symbols also allowed post-entry processing to be conducted efficiently. Various symbols were designated for missing and illegible responses, illegible letters within responses, changes or additions made by the Census Bureau, and suggestions on the part of data-entry operators in cases of unclear or ambiguous data. The missing data symbol saved time during the checking process by indicating that the enumerator, rather than the data-entry operator, had neglected to enter a response. A response with illegible letter symbols might be deciphered at a later time or left intact, as often happened with names that were difficult to read. When the Census Bureau made changes or additions (e.g., more specific information about counties or cities), a symbol was entered that allowed this information to be included in the field.
Over the course of entering ten nationally representative subsamples, a set of standardized comments was developed to provide concise, consistent, and machine-readable statements for later processing. These comments indicated that certain variables contained information changed or added by the Census Bureau, responses were enumerated in the wrong columns but were still considered valid, or certain fields contained information that exceeded the length of the field in the data-entry program or contained other oddities requiring further investigation or consideration. Standard comments also made it possible to preserve extra information that went beyond what was requested in the schedule instructions (e.g., the number of months a particular child had been in school). Nonstandard comments were used to capture the oddities of census enumeration that did not fit within the framework we developed for standardized comments.
When the filmed manuscript schedules were in good condition, and when enumerators followed instructions and wrote clearly, the task of data-entry was relatively straightforward. When deteriorated pages, poor handwriting, bad spelling, out-of-order pages, and idiosyncratic enumerators were encountered, data-entry was made more difficult. Often, however, it was possible to enumerate most cases with a high degree of certainty using information available on the manuscript. For example, in those instances in which enumerators failed to record dwelling and family numbers correctly, other information such as street addresses, relationships, and even blank lines between dwellings was usually provided and served as proxies for dwelling and family numbers. Some enumerators used their own system of ditto marks and abbreviations; some dittoed their information vertically, others horizontally. Some used dashes, and some merely entered "do." Occasionally even blank fields indicated dittoed information, although such cases were unusual and required careful examination before data-entry proceeded. Finally, from the perspective of data-entry, a few variables seemed more prone to error than others. For example, some enumerators reversed first and last names; others inadvertently changed the literacy column from "cannot" read or write to "can" read and write. The unemployment variable was also prone to error, as some enumerators seemed to misunderstand the meaning of the question, instead listing the number of months employed rather than months unemployed. More systematic research has been conducted on this subject, and interested readers are referred to King and Magnuson (1993).
As data-entry progressed from 1880 to 1850 and then 1920, we implemented new procedures based on our growing experience. During the first three out of ten subsamples for 1880, the data were entered exactly as they appeared on the census schedules. Such strict adherence to the original responses created unnecessarily large dictionaries and inconsistencies in the handling of odd occurrences. Beginning with the fourth subsample, the abbreviations, data symbols, and standard comments previously described were introduced. When we completed the entry of the 1880 data, project members met to discuss improvements in the data-entry phase of 1850. Several changes were made, the most important being that the missing response entry was allowed in more fields. For the 1850 project, we allowed missing field responses in the variables of city, name, birthplace, and occupation in addition to those allowed in 1880 (race, sex, and marital status). Other symbols were introduced that allowed data-entry operators to note changes made by the Census Bureau and to make suggestions in the city, occupation, and birthplace fields instead of making standard comments that were not as helpful in later stages of the project.
Before the 1920 project began further improvements in data-entry procedure were implemented. First, the data-entry software was replaced by software designed specifically for 1920. One enhancement of the new program was automatic entry of certain variables (e.g., reel number and state). Second, more standard abbreviations were created for certain fields (e.g. birthplace and mother tongue). Third, data-entry operators no longer corrected responses when other information in the case permitted the cleaning program to impute the response, such as blank parental birthplaces when parents were present. Finally, consistency checking became a dynamic part of the program, seeking to identify inconsistent data during the data-entry process. For more information about the enhanced data-entry program for 1920, see Todd Gardner's article on software development in this issue.
As data-entry progressed, we instituted a system of verification to evaluate the accuracy of the data. This verification process served three purposes. First, it provided research assistants with a means of identifying data-quality problems with specific variables and/or data-entry operators. Second, verification allowed for the creation of a transcription error rate estimate for each variable. Third, verification allowed the creation of a nationally representative subsample, containing verified information on one-tenth of the persons in the entire sample. Users who do not require the entire dataset can select this verified subsample from the final dataset.
During the 1880 data-entry project specifically, verification was conducted on two levels. The first 10 percent of the cases entered were sight verified by research assistants who verified all sample point decisions and key variables. This ensured from the beginning of data-entry that sampling rules were followed correctly and data were entered consistently. Sight verification also permitted us to check closely the quality of the work during the early stages of our first data-entry project when the operators were relatively inexperienced with sampling procedures, census returns, and nineteenth-century handwriting.
In addition to sight verification, 10 percent of the 1880 and 1850 microfilm reels were subjected to reentry verification. For 1880, one reel out of every ten completed reels per data-entry operator was randomly selected for reentry by a second operator. In the 1850 project, the process of selection differed slightly, although we still verified one reel in ten. Both versions of the data were compared, field by field, and a report of differences produced. A research assistant then made corrections to the original data, if required, and tallied a transcription error rate for each variable. The final error rates for 1850 and 1880 are shown in Table 1. Preliminary rates indicate acceptable levels of transcription error for 1920 as well, but these are based on too few cases to publish.
The transcription error rates for both 1880 and 1850 are quite good. The total error rate for each project is less than 0.3 percent. To compare the error rates for each variable, however, it is important to understand how both substantive and procedural issues have influenced the variation in error rates between the two datasets. The variable state, for example, in 1850, was a computer-generated entry (and thus has an error rate of zero), compared with manual entry in 1880, which had an error rate of .21 percent. Variables identifying families and dwellings--dwelling number, number of families, family size and family number--on the other hand, all contain higher rates of error in 1850 than 1880. Information on family relationships in 1880 allowed data-entry staff to resolve ambiguous cases in a more consistent fashion. The CITY variable contains a higher error rate for 1850 than 1880 because of differences in verification procedures. During 1880 data-entry, when the CITY field contained an entry such as Covington, Covington Township, or Covington City, no error was assigned the original reel if the compare reel contained one of those entries. Such failure to assign errors in these situations, however, made subsequent construction of the city dictionary much more difficult and consumed many more research-assistant hours than would have been the case had the data-entry phase been more sensitive to noting differences between townships and cities, etc. This situation was corrected for the 1850 census, when such differences were carefully noted during data-entry. As a result, the error rate rose dramatically in 1850 and may be viewed as a more realistic assessment of the difficulties in entering the complicated field of CITY.
Carefully devising a system of consistent and efficient data-entry for the censuses of 1880, 1850, and 1920 not only required much thought but also involved a process of trial and improvement as we moved from one stage of each project to the next. Timely completion of data-entry and the excellent transcription error rates indicate that we accomplished our most important objectives. Given the improvements we have implemented during each data-entry project, we are certain that the 1920 data will be of comparable or even better quality than the two preceding datasets. With the availability of two 1-in-100 samples of the 1880, 1850, and 1920 U.S. population, researchers will have available three excellent mid- to late-nineteenth century individual-level datasets.
- This article was originally published in Historical Methods, 28:1, pp. 63-65 (Winter 1995). Reprinted with permission of The Helen Dwight Reid Educational Foundation. Published by Heldref Publications, 1319 18th Street, N.W., Washington, D.C. 20036-1802. Copyright 1995.
|NUMBER OF FAMILIES||0.74||0.15|
|FAMILY SEQUENCE NUMBER||0.00||0.02|
|RELATION TO HEAD||NA||0.10|