NOTE: This article was originally published in Historical Methods, 28:1, pp. 59-62 (Winter 1995). Reprinted with permission of The Helen Dwight Reid Educational Foundation. Published by Heldref Publications, 1319 18th Street, N.W., Washington, D.C. 20036-1802. Copyright 1995.
The building blocks for the Integrated Public-Use Microdata Series are national datasets for eleven census years, created by the Census Bureau and by individual scholars. Here at the University of Minnesota we are responsible for three of those datasets, the 1850, 1880 and 1920 samples. The first step in the creation of each microdata project was software development. Each individual Public-Use Microdata Sample (PUMS) required software for data entry, post-entry data consistency checking, verification, and final data recoding. We first began work on the 1880 project and developed all of the software on microcomputers. As the volume of data with which we were working increased, however, the performance of the microcomputers became unacceptable for many of these tasks, primarily, because of their limited memory. We therefore shifted post-entry data processing to a UNIX workstation to take advantage of its greater speed, memory and storage capacity. Microcomputers continued to be vital to our work, however, as we used them for data entry and editing. For the projects that followed, the Public Use Microdata Samples of the 1850 and 1920 censuses, we maintained this arrangement, performing data entry on microcomputers and post-entry data processing on a UNIX workstation.
When we began work on the 1880 project, the highest priority for software development was the data entry program. Once that was in place we developed programs to check for errors in the data and a final recoding program. These programs took shape gradually as we assessed data quality and received feedback from Research Assistants who were performing the post-entry data processing. As data entry proceeded, we constructed files known as data dictionaries to translate alphanumeric field entries to their numeric codes. Each of these data dictionaries required software to build and maintain the file. In the final stage of the project we developed the software that performed the numeric recoding of the alphanumeric data to produce the Public Use Microdata Sample. We were able to carry over much of the software developed for the 1880 project to the 1850 and 1920 projects with only minor modifications. In some cases, however, such as the 1920 data entry program, we had to develop new software.
Data Entry Software
The data entry software that we selected for entering 1880 data was the Integrated System for Survey Analysis (ISSA). ISSA is produced by Institute for Resource Development in association with Westinghouse. This software offered some advantages over other software packages we examined. While household-level information had to be entered on a separate screen form, all of the individual-level information for a dwelling could be included on the same form. This meant that our Data Entry Operators could see the individual-level information for the entire dwelling while entering the data for each individual. Another advantage of ISSA was that the output was in ASCII format. The census data files needed to be in ASCII format so that the post-entry data processing programs could read them.
From early on, however, we encountered problems using ISSA because we were trying to use this software for a project for which it had not been designed. ISSA was designed to enter survey data that had already been numerically recoded. We had determined, though, that we wanted to enter the census data in alphanumeric form as close to the original enumerator entries as possible, and then perform numeric recoding as the last step in producing the Public Use Microdata Sample. Many of the fields on the census enumeration form required that we establish large strings on the screen form to accommodate verbose enumerator entries. For example, it was not uncommon to see an entry in the occupation field such as "Works in a shoe factory." Preserving the original entry provided us with a great deal of flexibility in how we went about numerically coding the data, but our record sizes were quite large and we were not able to perform many interactive data consistency checks within the data entry software. Because ISSA was designed to accommodate primarily numeric data, it offered few string handling functions. We were able to perform only limited operations on string data within the data entry program. We relied instead on post-entry data consistency checking software to carry out many of these functions.
We decided to continue using ISSA for the 1850 project because we did not have sufficient time to carry out a study of available data entry software and also because we had already developed a workable data entry program for the 1880 data. Because fewer questions were asked in the 1850 census than in the 1880 census, it was relatively easy to modify the 1880 version of the data entry program to accommodate the entry of 1850 census data. Because of our experience in dealing with the 1880 data, we were able to avoid many of the problems that we encountered when we had initially set up the data entry program. Our Data Entry Operators were also already familiar with the way ISSA operated, so they were able to carry over their skills to the 1850 project.
For the 1920 project we were forced to look elsewhere for our data entry software. More questions were asked on the 1920 census than on the 1880 and 1850 censuses, and ISSA could not accommodate the expanded length of the 1920 individual records. We began our search for an alternative data entry program by consulting with a variety of personnel at the University of Minnesota as well as a number of computer publications to determine what software was available. We encountered many of the same problems that we had when we searched for software for the 1880 project. One of the most common difficulties we faced with the database/data entry software we tested was that many of these packages were designed to allow users to view only one record on the screen at a time while entering data. We wanted, however, to design the screen form to look as much like the enumeration form as possible, showing as much information as possible about the household and all individuals in the dwelling. Some Windows-based software packages, such as FileMaker Pro, are good for designing complex screen forms, but with our computers the screen refresh rate was unacceptably slow as our Data Entry Operators began to enter data on the large, complicated screen forms designed to look as much as possible like the original census enumeration form.
Another common problem we encountered with the software that we tested was that most database/data entry software has its own proprietary output format. We wanted the output to be in ASCII format for use in our post-entry data processing programs. While the software packages we tested typically offered an option to convert the output to ASCII format, in most cases only the file in the proprietary format could be updated. Because of this we would have had to maintain two separate data files for every microfilm reel, one in ASCII format for use in our post-entry data processing programs as well as one in the proprietary format for updating. Each of our data files is associated with a microfilm reel, and because the 1920 census consisted of more than 2000 microfilm reels, we would have had to maintain more than 4000 files.
We decided to develop our own data entry software using Microsoft C along with a product called C-scape by Liant Software, which is a library of screen handling functions designed to allow users to create customized interactive software. Before proceeding with developing our own data entry software, we gave serious consideration to the disadvantages of this approach, especially the unforeseen difficulties that are a part of any new software development and the time necessary for this endeavor. We determined, however, that we had the expertise on staff as well as the time necessary to accomplish the task.
Developing a customized program allowed us a great deal flexibility in designing the screen. We designed screen forms that looked similar to the census enumeration form, simultaneously displaying household-level information and individual-level information. We also included customized help screens in the data entry software. Developing our own data entry software also allowed us to build in sophisticated interactive data consistency checking for string data as well as numeric data. By including such logic in the data entry program we were able to reduce the number of the post-entry changes to the data that we had to perform in the earlier projects.
One of the other great advantages of developing our own data entry software was that we were able to integrate the sample point selection and the data entry procedures. In the 1880 and 1850 projects we generated random sample point numbers in a separate program and then printed them out on a standard form. The data entry operators marked these sheets by hand, indicating whether they accepted or rejected the sample point, along with any comments. This process was time consuming and also used a great deal of paper. In the 1920 project sample points were displayed on the screen and the Data Entry Operators accepted or rejected sample points by pressing the appropriate keys. This procedure not only saved many reams of paper, but it also sped up the process and eliminated the risk of a data entry error for the fields that uniquely identified that sample point because those fields were automatically filled in from the sample point selection screen.
Post-Entry Data Processing
With the exception of the data entry software, almost all of the programs developed for the census projects were written in FORTRAN. We used FORTRAN because of convenience; we already had a number of programs written in that language, some of them obtained from the Census Bureau itself. It was quite easy to transport the programs we had developed on microcomputers to the UNIX workstation. The compilers are similar enough that only a few minor changes had to be made to the code.
Perhaps the most important reason for moving the post-entry data processing to the UNIX workstation was that we began to develop large data dictionaries. A data dictionary is simply a computer file containing all of the various entries from a field on the census form along with the numeric recode values for each entry. We constructed data dictionaries for fields with a large number of possible entries. For some fields, such as birthplace, the Data Entry Operators used standard abbreviations for the most common entries, but in many cases standardization was not possible. Because these data dictionaries had to include all of the spelling variations as well as infrequent entries that appeared in the data, they quickly became quite large--too large to be read into a DOS-based FORTRAN program.
We constructed data dictionaries for both household-level and individual-level fields. The fields for which we used data dictionaries in each census project are listed in Table 1.
An example of the contents of a data dictionary is shown in Table 2. This table shows a few lines from the data dictionary for the relationship to the head of household field. We used data dictionaries for data consistency checking as well as for the final numeric recoding. Each line of the data dictionary contains a field entry as it appears in the data file. In the relationship dictionary each entry has three associated values. The first column of numbers is a general relationship code, and the next column is a detailed code. These are the numeric recode values that appeared in the final Public Use Microdata Sample. The last column in the relationship data dictionary contains the valid entries in the sex field for the associated relationship--(M)ale, (F)emale or (E)ither. The data consistency checking program used these entries to make sure that the entries in the sex and relationship to head of household fields were consistent with one another.
Two programs were necessary for maintaining and updating each of the data dictionaries. The first of these programs read through all of the entered census data looking for field entries that had not yet been included in the data dictionary. This program generated an output file listing all unknown entries along with the locations of those entries. A Research Assistant then determined the appropriate recode values for each entry and typed them into the file. A second program then read in this file and added the unknown field entries to the data dictionary along with their associated values. The program then wrote out a new version of the data dictionary sorted in whatever was deemed to be the most functional format. Some dictionaries were sorted in order of the recode values while others were sorted in alphabetic order of the dictionary entries.
Data Consistency Checking
After the Data Entry Operators finished entering all of the appropriate data from a microfilm reel, the file associated with that reel was put through a data consistency checking program. In the 1880 project we carried out two data consistency checking steps separately. The first step involved checking for inconsistencies in the data, such as a male respondent was listed as "wife" or a seven-year-old listed as "married." This program generated a list of warning messages that a Research Assistant would then check to determine if any of the data needed to be corrected. A second step was later implemented that made consistency checks largely based on the contents of the relationship field. This second program performed a set of routines that linked spouses together and also linked parents and children. Based on these links the software then checked for inconsistencies based on relationships, for example if the age of a father was only 10 years greater than the age of his son. We identified a number of minor systematic changes that needed to be made to the data. We incorporated the logic to identify these problems into the second program, so that the software could perform these changes to the data automatically.
For the 1850 data, these two programs were combined into a single program. The program was not as sophisticated as the 1880 data consistency checking programs because the respondents in 1850 were not asked to provide their relationship to the head of the household, so the spouse and parent-child linking could not be performed on the data. As with the 1880 Public Use Microdata Sample, however, this program incorporated automatic changes to the data along with the warning messages that a Research Assistant then reviewed.
We decided to continue the single program approach in the 1920 project. The output from this program indicated what changes had automatically been made to the data as well as possible inconsistencies. This program also incorporated the linking procedures that we had initially employed in the 1880 data consistency checking programs.
In order to document the accuracy of the data entry and consistency checking procedures, one out of every ten microfilm reels was selected at random and entered again by a Data Entry Operator other than the Data Entry Operator who had originally entered the reel. We then read the two files into a verification program and compared the files field by field. Transferring the verification program to the UNIX workstation brought about a dramatic improvement in performance. No longer constrained by the limited memory of the microcomputers, we could read two complete data files, each of which was typically 100K or larger, into memory.
The trickiest problem that we had to deal with in the verification program stemmed from incorrectly entered key information. The key is made up of the fields that uniquely identify each case, consisting of a sequence number, the page number on the enumeration form, and the line number of the first individual taken. If any of these fields were entered incorrectly, the verification program would not be able to match a case with the corresponding case in the other file. To deal with this problem, the verification program displayed a list of all keys present in one of the files but not the other. A Research Assistant would then examine these cases to determine if the key was incorrect, and then indicate that when prompted by the program. We avoided these problems in the 1920 project because the data entry program automatically filled in the key values when the Data Entry Operator accepted a sample point.
In the final step, the program created an output data file in which all incorrect responses from the original file were replaced with the correct ones. The program also generated a number of reports that allowed us to calculate our overall error rates and enabled us to identify any systematic problems. When all differences had been checked the program produced a report indicating the error rates for each field and an itemized list of all changes made. This report also contained a list of all sample point selection errors committed by the Data Entry Operator. The same procedures were followed for the 1880, 1850 and 1920 census projects.
The final step in preparing the Public Use Microdata Sample consisted of running all of the data through a final recoding program. This program converted all alphanumeric data to numeric recode values and formatted the output into household records and person records. For those fields listed in Table 1, recoding was done using the data dictionaries. For other fields with a limited number of valid entries, such as marital status and sex, the recode values were "hardcoded," written into the program itself. This program read the data dictionaries into memory prior to processing the data. Because of the limited memory of the microcomputers, only one (or a part of one) dictionary could be read into memory at a time. For each data file, each dictionary had to be read into memory separately, so on the microcomputers the recoding process was enormously time-consuming. Because the UNIX workstation version of this program could hold all of the dictionaries in memory at one time, the program needed to read the data dictionaries only once. Where the microcomputers would have taken days to recode the entire Public Use Microdata Sample, the UNIX workstation software only took about three hours to complete the entire task.
The software and hardware markets are constantly changing. At each stage of these Public Use Microdata Sample projects we made decisions based on the best options available to us at the time. Improvements in hardware and software often brought about dramatic improvements in productivity. The challenge we constantly faced was to make the best use of such improvements while still maintaining compatibility with what we had already developed.
|State||Relationship to Head of Household|
|Institution||Relationship to Head of Household|