IPUMS Documentation:
Helping Researchers Deal with Change in 150 Years of Census Data1
by Patricia Kelly Hall, Catherine Fitch, Margot Canaday,
Lisa Ebeltoft-Kraske,Carrie Ronnander and Kathleen M. Thomas
Go Back to IPUMS Documentation Index
Census microdata are of little use without documentation to interpret them. Researchers using any dataset without a thorough understanding of the variables put their analysis at risk. But if using data without documentation for one census year is risky, it is even more perilous to analyze data across multiple census years without a thorough understanding of how the questions and responses have changed over time. IPUMS documentation describes each variable in historical detail and enables researchers to make intelligent use of this rich and complex database.
The purpose of the IPUMS documentation is to provide researchers with comprehensive support for every phase of analysis using the IPUMS microdata, with standards at or exceeding those of the U.S. Census Bureau. Documentation for the IPUMS is not confined to codebooks describing the labels associated with each census category. We provide a wide variety of ancillary information to aid in the interpretation of the data, including enumerator instructions, details of sample designs and sampling errors, procedural histories of each census, descriptions of data-entry procedures, and documentation of error correction and other post-enumeration processing. Documentary material not readily presentable in text format-such as census forms and maps-is easily accessible in graphic images on the IPUMS website.
Every complex variable in the IPUMS data has an extensive comparability discussion in the documentation that points out consistencies and inconsistencies across all samples in the database. Important differences are highlighted, along with warnings about likely errors and strategies for enhancing compatibility for specific comparisons.
For many complex variables, such as birthplace, entire essays are necessary (see the article by Goeken, et al. in this issue). All this material is presented in a hypertext documentation system that allows users to locate relevant information quickly and easily.
Since the release of IPUMS-95 (Ruggles and Sobek 1995), the documentation has grown from an 800-page user's guide to more than 3,000 pages of text, tables and maps housed in five volumes. Rapid growth in the technology available for electronic delivery of information has allowed us to expand IPUMS documentation while still giving researchers navigational tools to move through it efficiently. The IPUMS homepage on the web has both a stationary frame with links to major sections of the site and a content frame displaying whatever page the user selects for viewing. Hypertext links throughout the site allow nonlinear access to information based on the user's needs and interests. A full-text search engine enables users to find a specific item quickly and easily. A hypertext version of the documentation is also available for downloading to the user's personal computer, making continual connection to the internet unnecessary.
The flexibility and availability of electronic delivery has virtually eliminated the need for paper copies of the documentation. The updated five-volume IPUMS-2000 documentation (Ruggles and Sobek 1998a, 1998b, 1998c, 1998d) will be the last to be released in paper form. For ease of presentation, the figures shown in this article are from the printed version of the IPUMS-2000 documentation. This allows us to display information from several linked web pages in one figure.
All Years...All Variables...One Electronic Document
No researcher should attempt to use the IPUMS data without consulting the variable descriptions. They are the core of the IPUMS documentation and a prerequisite to understanding how information captured in the variables has changed over time. Each variable description includes an availability table showing which samples contain the variable and a universe statement identifying characteristics of households or persons whose responses are reported. Most variables also include extensive discussions of comparability of information across time and a table listing all possible responses and the frequency with which they occur in each census year. When necessary, Top code values, data quality flags, and other variable-specific information is included.
These entries collect and condense variable information from ten or more codebooks and track the intricacies and variations of each data item over many census years. Historical Census Projects staff also consulted the original census forms and enumerator instructions to ensure that the information presented was complete and comprehensive.
The variable entry for CLASSWKR (class of worker), shown in Figure 1, provides a good example of the common features found in the data dictionary. The availability table indicates that class of worker was enumerated in all available years since 1910. The universe statement, however, shows that the question was asked of a different population over this period. For example, in the earliest samples the question was asked only of gainfully employed persons while in 1990 the question was asked of all persons, age 16 or older, who had worked in the last 5 years. Individuals with similar circumstances would have been asked the question in one year but not in the other. For example, in 1990 no person younger than 16 would have answered the question even if employed. Individuals who were not currently working, but who had worked in the past 5 years would have responded in 1990 but not in 1910. Users would have to eliminate both groups of individuals from their analysis if they planned to analyze 1990 with 1910. The comparability discussion for CLASSWKR suggests creating a common universe of all persons age 16+ who are currently employed by using the variables AGE and EMPSTAT (employment status).
In addition to instructions for making the universe appropriate for multi-year analysis, the comparability discussion describes the different coding systems used in various enumerations. Because questions are not asked in a consistent manner each year, IPUMS codes provide varying levels of detail for each variable for each year it is available. As seen in the codes and frequencies table, the class of worker variable uses the general and detailed coding schemes to create a comparable variable across all years. Two categories-"self-employed" and "works for wages or salary"-are available in every year. Codes for all responses which fall under either of these two general definitions begin with the same number. All "works for wages or salary" responses codes, for example, begin with the general code of "2." Detail codes within the general retain information specific to a census year, such as public or private employment. "Wage/salary at non-profit" is coded "23"-the general code of "2" indicates wage and salary status while the detail code "3" preserves the non-profit detail available only in 1990.
Multi-Level Documentation
Variable discussions are not always sufficient in themselves. Every variable discussion, therefore, includes links to documents that show how the data were originally collected - such as census forms, census questions and enumerator instructions-as well as how they were transformed into microdata - sample design, data entry procedures and editing and allocation protocols. These links allow IPUMS users to access material from the documentation at varying levels of complexity depending on the needs of the researcher. To illustrate how this multi-level documentation system works, we will look at metadata relating to data quality.
At the basic level, the variable description reports the names of any data quality flags associated with that variable. The presence of a data quality flag alerts users that original responses have been altered in some way, such as by editing or allocation of missing values. Some researchers ignore data quality flags and simply take all cases available, whether or not they contain original responses. Those who want information on the extent of changes made can access two charts to quickly determine the census years with changed values as well as percentage of cases affected. For some researchers, these two charts are sufficient to decide whether or not to include altered cases in their analysis. Other researchers would prefer to decide which cases to include based on what type of change was made. Detailed descriptions of all editing and allocation procedures give these researchers the information they need to refine their analysis. Relevant material on post-enumeration processing, editing and allocation has been culled from all documentation published with the original samples. Some of the information - such as the Census Bureau's data allocation and editing procedures for the 1980 and 1990 samples-is published here for the first time.2
Editing and allocation procedures used in samples constructed by the Historical Census Projects are illustrated in Figure 2. This document gives users detailed information on three critical procedures for ensuring high-quality data: internal consistency checks, internal edits and allocation of missing data. Internal consistency rules are the parameters used for deciding when responses within an individual record don't make sense with one another. These rules also describe what is done to resolve any inconsistencies that are found. Internal edits identify opportunities for filling in missing responses-or clarifying ambiguous ones-with information found in other parts of the record. Usually this information is drawn from the individual's own record. In other cases, such as the birthplace and mother tongue variables, information from other members of the household is used (see the article by Goeken et al. in this issue).
After the internal consistency checking and internal editing procedures are completed, any values still missing are filled in using the allocation procedures. The allocation process searches the data for a "donor" person with characteristics similar to the individual with the missing value. The appropriate response in this donor's record is then copied into the missing value field. The first step in the allocation procedure is to try and identify a donor who matches the subject on all predictor variables. If the first step fails to produce a donor, the rank order for eliminating predictor variables from the search criteria is spelled out.
In the CLASSWKR (class of worker) example shown here, internal consistency rules require a check of the AGE variable to be sure that the individual is at least 13 years old. If AGE is less than 13, the program does further checks of RELATE (relationship to head) and MARST (marital status) to verify that the AGE was correctly reported. If the person is younger than 13 and the age is consistent with his or her relationship to head and marital status, then the class of worker response indicating employment is deemed inconsistent and changed to missing. Internal edit rules for 1910 note that missing CLASSWKR values may be captured from the EMPSTAT (employment status) response. If allocation of missing values is required, predictor variables used to identify donors for CLASSWKR are OCC1950 (occupation in the 1950 coding system), AGE, RACE and SEX.
Missing values in most variables are identified in separate matching procedures. If an individual record has missing values on several variables, the donors for these values may come from several separate individuals since the goal is to maximize the number of characteristics shared by donor and subject. For some variables, however, logic requires that one donor contribute values for an entire group of variables. Geographic variables, such as CITY, CITYPOP (city population) and URBAN (urban/rural status) shown in Figure 2, are one example of this logical clustering. The birthplace and language group of variables (described in Goeken, et al. elsewhere in this issue) is another, more complex example.
Similar allocation and editing procedures are documented for samples produced by the Census Bureau and others. For every procedure described, the appropriate data quality flag value is shown. This allows researchers to select cases based on the type of change made to the original data. By using information from these reference tools, researchers can tailor their case selection to the level appropriate to their analysis.
New and Old Research Tools
One of the guiding principles in the construction of the IPUMS data as well as its documentation has been to provide tools that facilitate research using all available census years. For example, measures such as the Duncan socio-economic index and poverty status formulas-analytical tools commonly used in the modern censuses-have been incorporated as variables in the IPUMS data to simplify diachronic analysis in the historical samples. In the documentation, this principle is reflected in such features as the high-resolution metropolitan boundary maps and the tables of change in county composition of metropolitan areas.
Some material in the IPUMS documentation not only helps researchers make sense of the data but also serves as a stand-alone reference tool. Figure 3 shows a section of one such reference tool, the County Composition of Nineteenth-Century Metropolitan Areas. This table identifies all places that could be classified as urban in 1900 and tracks the changing county composition of each metropolitan area over the entire period from 1850-1990. The information in this table was used to construct the METAREA (metropolitan area) variable.3 The METAREA code is the value in the data assigned to each metropolitan area. In the case of large metropolitan areas, such as New York, the first three digits of the code indicate an extensive area of shared economic activity. Within a complex metropolitan area, the last digit identifies separate areas which were large enough to meet the Census Bureau's criteria for metropolitan status in that year.
Knowing the county composition of metropolitan areas is important to the intelligent analysis of findings when using the METAREA variable. Even on its own, however, the table conveys a good deal of information about the changing nature of metropolitan areas in the United States. For example, the table shows that by 1850 New York was already a large, five-county metropolis while Philadelphia's metropolitan area was contained in just two counties. New York experienced such rapid extensive growth that by 1970 it had spawned three new metropolitan areas: Bergen-Passaic, Jersey City and Newark. Nassau-Suffolk was identified as a separate metropolitan area in 1980 and Middlesex-Somerset spun off in 1990. Bergen County bounced back and forth between the New York and Bergen-Passaic metropolitan areas several times between 1950 and 1990. By contrast, Philadelphia regularly incorporated counties into its metropolitan area and-as of 1990-no part had reached the threshold for separate metropolitan area status. Some of the change shown in this table is an indication of significant change in the area's population. Some, like the movement of Bergen County, may be more indicative of the Census Bureau's changing definitions of metropolitan status than of demographic change.
Information in the two historical county tables was unpublished and therefore not generally available to researchers before it was incorporated into the IPUMS documentation. These documents constitute a significant addition to the resources available for historical research. In addition to new research tools such as these, users will find other material-such as maps of metropolitan designations in the later census years-which had been published but were not readily available until now.
Integrating the IPUMS
The IPUMS contains 26 census samples and more than 300 variables. One of the biggest challenges facing the Historical Census Projects was devising an efficient method for translating the nine unique coding schemes used to build the original samples into an integrated system comparable across time. Transforming raw sample data into IPUMS code was accomplished with a series of translation tables which gave the IPUMS software program three basic instructions: (1) what to name the IPUMS variable being constructed, (2) year-by-year column locations for finding the input data in the original samples, and (3) what IPUMS value to assign each original value encountered.
Figure 4 shows the translation table used to construct CLASSWKR (class of worker). The table instructs the IPUMS program to look for input in nine census years - the only years for which this question was asked. Of these nine samples, however, only seven of them appear in the translation table. This is because the 1910 Hispanic oversample as well as the 1920 sample were created by the Historical Census Projects after the basic coding scheme for the IPUMS had been established (Gutmann 1997; also see the article by Gutmann and Ewbank in this issue). Since these two samples were constructed with IPUMS codes, the program simply reads the original values from the sample directly into the IPUMS data. For all other years, the table provides the information necessary to translate the raw values into IPUMS codes.
The CLASSWKR variable was constructed with a general code to facilitate comparability of responses across years as well as a detail code to preserve information unique to one year. Adding a detail code to the translation table was the principle means of preserving information in the IPUMS. Occasionally, information about the original values was not preserved as was the case with two 1910 values shown on the CLASSWKR translation table - those coded missing or illegible. So although we translated the original 1910 raw values into IPUMS codes, the programming notes at the bottom indicate that these were allocated along with other missing values in the IPUMS allocation program described above. The category labels and IPUMS codes in the variable description seen in Figure 1 were created with these translation tables, with codes for illegible and missing being eliminated after allocation of missing values.
In some cases, data transformations are required which cannot be accommodated in the translation table. Notes at the bottom of each translation table indicate which procedures are handled with special programming. The CLASSWKR example shows that programming was required to create an NA (not applicable) category. This was necessary because in 1910, no separate NA category was established and the field was simply left blank for those who were outside the universe-that is, those not asked the question.
It took more than 500 translation tables to construct the IPUMS, one for each original variable, constructed variable and data quality flag in the database. All of them have been reproduced in the IPUMS documentation so that researchers can see exactly how the raw data from each sample was translated into IPUMS code. The translation tables allow users to recreate research done with the original public use samples.
Conclusion
IPUMS documentation is a work in progress. While the 3,000-plus pages in the original printed volumes have all been entered on-line, we are still in the process of adding additional links which will make it even easier for users to navigate the documentation system. The five volumes of the updated IPUMS-2000 documentation are the last to be released on paper. The advent of a downloadable hypertext documentation program has made the paper volumes virtually obsolete.
The most exciting enhancement now being planned is a documentation system that would customize documentation output to the samples and variables selected by the user in the extract system. Guided by information from the extract request, the system would search the documentation for all information relevant to the data requested. Users would receive a downloadable, hypertext file containing information on only those variables and only those years included in their data extract.
Prompted by user requests, we are adding a new page to our web site which will serve as a launching point for users new to census data and the IPUMS. Links will be provided to short, clear essays explaining critical concepts such as weights, universe statements and comparability across time. Answers will be given to frequently-asked questions about the analytical problems of historical census research as well as the technical difficulties of extracting and downloading data.
All new system enhancements and resource materials that become available will be added to the website. The downloadable files and revisions notification page will continue to be updated regularly. Our goal for the IPUMS documentation - as for the data-is to provide the highest quality research tools available for long-run analysis of the American population.
ENDNOTES:
- This article appeared in Historical Methods, Volume 32, Number 3, Pages 111-118, Summer 1999. Reprinted with Permission of the Helen Dwight Reid Educational Foundation. Published by Heldref Publications, 1319 18th St. N.W. Washington, D.C. 20036-1802. Copyright 1999.
- The Census Bureau provided the Historical Census Projects with all available internal reports and memos related to specific editing and allocation procedures used in processing data from the 1980 and 1990 censuses. Information in these unpublished materials was distilled into a uniform format and included in the IPUMS-2000 documentation.
- Because fire destroyed the census manuscripts, no 1890 data are included in the IPUMS. Counties that became part of metropolitan areas in 1890 are asterisked in the 1900 column. Since state is the smallest geographic unit available in the 1960 census, no county information is reported for that year. Although the concept of a metropolitan area did not exist prior to 1950, metropolitan areas were identified in the early samples using criteria similar to those used by the Census Bureau in 1950.
References
Gutmann, M. P., S. Ruggles, W. C. Block, C. A. Fitch, T. K. Gardner, P. K. Hall, D. Kallgren and M. Sobek. 1997. The Hispanic oversample of the 1910 U.S. Bureau of the Census of population: User's guide. Minneapolis: Historical Census Projects, Department of History, University of Minnesota.
Ruggles, S. and M. Sobek with P. K. Hall and C. Ronnander. 1995. Integrated public use microdata series: Version 1.0, User's guide. Minneapolis: Historical Census Projects, University of Minnesota.
Ruggles, S. and M. Sobek with C. A. Fitch, P. K. Hall and C. Ronnander. 1998a. Integrated public use microdata series: Version 2.0., Volume 1: User's guide, edited by C. A. Fitch. Minneapolis: Historical Census Projects, University of Minnesota.
Ruggles, S. and M. Sobek with C. A. Fitch, P. K. Hall and C. Ronnander. 1998b. Integrated public use microdata series: Version 2.0., Volume 2: User's guide supplement, edited by P. K. Hall. Minneapolis: Historical Census Projects, University of Minnesota.
Ruggles, S. and M. Sobek with C. A. Fitch, P. K. Hall and C. Ronnander. 1998c. Integrated public use microdata series: Version 2.0., Volume 3: Counting the past, edited by P. K. Hall. Minneapolis: Historical Census Projects, University of Minnesota.
Ruggles, S. and M. Sobek with C. A. Fitch, P. K. Hall and C. Ronnander. 1998d. Integrated public use microdata series: Version 2.0., Volumes 4 and 5: Translation tables, edited by C. A. Fitch. Minneapolis: Historical Census Projects, University of Minnesota