The Minnesota Historical Census Projects1
by Steven Ruggles and Russell R. Menard
The census is our fundamental source of information about American social structure in the past. No other source can compete with respect to population coverage and reliability.2 The census provides the only data on population characteristics for the period before the mid-twentieth century that are not profoundly distorted with respect to class, race, gender, or education. Moreover, the census is the only historical source that provides comprehensive geographic coverage and broad chronological scope. Indeed, many Americans in earlier generations left no surviving trace of their existence except for their listings in the census. Thus, we must frequently turn to census data for basic historical generalizations about the lives of ordinary people.
Quantitative studies of long-term social change have always relied on published tabulations of the census. The published data are invaluable, but they do have significant limitations. Until recently, the Census Bureau itself carried out only the most rudimentary analysis with the data it collected. The topics addressed by census publications have always focussed on contemporary concerns, concerns that have shifted dramatically over the past two centuries. For example, although the Census Bureau has gathered data on family and household composition since 1880, it was not until 1940 that Census publications began to provide statistics in this area. On the other hand, the progressive-era censuses provided great detail on nativity and ethnicity, reflecting the contemporary concern with immigration. The high costs of tabulation before the introduction of modern data processing equipment meant that few cross-classifications of census data were feasible, and the census office had to make difficult choices about what to measure. As a result, much of the information collected by the census was never tabulated at all. Thus, census publications do not exploit the full potential of the census enumerations.
Development of Census Microdata Files
In the 1960s, New Social Historians adopted an innovation that multiplied the uses of census data. Instead of using the aggregate published tabulations of the census, scholars like Stephan Thernstrom (1963) turned directly to the enumerators' manuscripts.
These manuscripts provided information on individual households or persons -- what we now call microdata, to distinguish it from data describing population aggregates. The microdata collected by enumerators contain far richer information than was ever published in the census volumes. The New Social Historians found that by using microdata they could make tabulations tailored to their particular research questions. In fact, the nineteenth-century manuscript censuses allowed researchers to address issues never contemplated by the Census Office. This was especially true for the census years from 1850 onwards, when the enumerators' manuscripts began to provide information on individuals as well as households. By combining the characteristics of different individuals within the same household historians could construct new categories of information. For example, they analyzed the occupations of spouses and residence with extended kin, even though such information was collected only inadvertently by the census and was never tabulated. Moreover, historians using the nineteenth century census manuscripts soon began to use analytic methods that did not even exist when the censuses were first tabulated, from multiple regression to own-child fertility analysis.
At about the same time that historians began using nineteenth-century census microdata, the Census Bureau released the first national census microdata. Social science was booming in the early 1960s, and researchers flooded the Census Bureau with requests for specialized tabulations on particular topics. Some social scientists were even beginning to use computers. The Bureau came up with a seemingly obvious but innovative solution: they created a 1-in-1000 extract of the basic microdata they were using to create the tabulations for the published census volumes, and released it in machine-readable form to researchers who could then make their own tables (U.S. Bureau of the Census 1963). The data file was called a Public Use Sample, to distinguish it from the census samples being used internally by the Bureau. To preserve confidentiality, the Census Bureau removed the names, addresses, and other potentially identifying information.
The 1960 Public Use Sample was an immediate success. It allowed researchers to create their own measures and to adopt the latest multivariate analytic tools. But the sample did have two significant limitations. First, the sample size was relatively small.
The 1 in 1000 sample density yielded information on about 180,000 persons. Given the modest capacity of computers in 1964, this was a lot of cases, but as researchers began to use the sample for detailed analysis of small population subgroups, its limitations became apparent. Halliman Winsborough, one of the early users of the 1960 sample, later recalled "analyses evaporating in a welter of empty cells." Second, the 1960 Public Use Sample provided limited geographic information. In its zeal to preserve confidentiality, the Census Bureau stripped off all information on places below the state level. This meant, for example, that it was impossible to extract a subsample of the New York City population.
The Census Bureau addressed both problems with the release of the 1970 Public Use Samples (U.S. Bureau of the Census 1973a). The total sample size rose some sixty-five fold, to twelve million persons. In addition, the 1970 samples provided a variety of alternate geographic codes, although the Census Bureau still did not identify any places with a population of less than 250,000.
In conjunction with the 1970 public use samples, the Census Bureau released an expanded version of the 1960 public use sample, enlarging the sample density from 1 in 1000 to 1 in 100 (U.S. Bureau of the Census 1973b). The record layout and coding schemes of the new 1960 sample were reorganized to be virtually identical to those of the 1970 samples. This compatibility made it easy for investigators to pool data from 1960 and 1970, and thereby incorporate chronological change into their analyses.
By the late 1970s, the public use samples from the 1960 and 1970 censuses had become essential tools of American social science. It was in this climate that two separate teams of researchers independently came up with the idea of extending the series backwards by creating historical public use samples for earlier census years. Samuel Preston directed projects at the University of Washington and the University of Pennsylvania to produce a 1-in-750 sample of the 1900 census and a 1-in-250 sample of the 1910 census (Graham 1980; Strong et. al. 1989). Meanwhile, Halliman Winsborough and a group of other demographers at the University of Wisconsin in collaboration with the Census Bureau created 1-in-100 samples for the census years 1940 and 1950 (U.S. Bureau of the Census 1984a, 1984b). But Preston and Winsborough both suffered federal budget cuts and both ran out of money.
All four historical census files were eventually completed, but not without strain. In the mid-1980s, the future looked dim for historical public use samples of the census. The 1900, 1940, and 1950 samples had all become available by 1984, but they failed to attract the scholarly interest that had been envisioned by their creators. The 1910 sample, which did eventually stimulate a great deal of research, was far behind schedule. Both Preston and Winsborough decided they had done enough.
Undaunted by numerous warnings about the perils of data collection projects, we decided to pick up where Preston and Winsborough left off. The 1880 census was the obvious candidate for the next national microdata project. It was the first census to include such key inquiries as marital status, family relationships, and parental birthplaces, but because the Census Office ran out of money in 1881, these variables were never tabulated. In late 1988 we submitted a modest proposal to the National Institutes of Health for to create a 1-in-500 sample of the 1880 census. The study group reviewed the proposal with enthusiasm, but they wanted one small modification: they requested an enlargement of sample size to allow study of small population subgroups. We complied, raising the proposed sample density from 1/500 to 1/100. The proposal was funded, and in the summer of 1989 we began to enter data on some 550,000 individuals (Ruggles and Menard 1990). We finished the 1880 Public Use Microdata Sample in 1994.
Thus encouraged, we decided to complete the series of national census microdata files. In 1992 we began work on a 1 in 100 sample of the 1850 population census, with funding from the National Science Foundation. Since 1850 was the first census year to gather information at the level of individuals, it is the logical starting point for the series of national census microdata. The final version of the 1850 sample was completed in 1994.
We recently received funding for a 1 in 100 sample of the 1920 census from the National Institutes of Health. To protect the privacy of census respondents, census enumeration manuscripts remain confidential for 72 years. The public release of the 1920 manuscripts in 1992 made a national sample for that year feasible.3 The 1920 project, which involves data entry of information on 1.1 million persons, will be completed in 1998.
National public use microdata samples of the census therefore exist or are in preparation for all but four census years since 1850. The 1890 manuscript census was destroyed in a fire, so there is no possibility of creating a national census sample in that year. The remaining census years are 1860, 1870, and 1930. We are presently working on proposals to create new samples for the 1860 and 1870 census years. The 1930 census manuscripts will not be released until 2002, so that sample will be delayed.
In the meantime, we hope to enlarge and enhance the census files for 1900 and 1910. The 1900 sample was the first of the historical files to be produced, and at 100,000 persons it is also the smallest by a wide margin. Moreover, the sample design for 1900 is somewhat incompatible with later census years. We therefore plan to apply for funds to enlarge the 1900 sample to approximately 760,000 cases. The 1910 sample is larger than the 1900 sample, and it has already been expanded by adding an oversample of the black population. In addition, we are presently working with Myron Gutmann of the University of Texas to add an oversample of the Hispanic population in 1910.
As historical work was extending the series of national census files backward in time, the Census Bureau was extending it forward by releasing machine-readable samples of the 1980 and 1990 censuses (U.S. Bureau of the Census 1982a, 1993). We therefore now have a series of public use microdata samples of the U.S. census covering the years 1850, 1880, 1900, 1910, 1920, 1940, 1950, 1960, 1970, 1980 and 1990. The general characteristics of these samples are shown in Table 1. If we are successful in obtaining funds to fill the remaining gaps, we will eventually have a large national sample for every census year from 1850 onwards except for 1890.
Strengths and Weaknesses of the Public Use Microdata Series
The range of potential topics that can be addressed with the national census files is far too great to describe within the page limitations of this article. Some of the most obvious topics of investigation include household composition, fertility, life-course transitions, ethnicity, immigration, internal migration, female labor force participation, the household economy, industrial and occupational structure, urbanization, nuptiality, and education.4
Compared with the community-level census microdata files created by Thernstrom and many other historians in the 1960s and 1970s, the national census microdata samples have some liabilities. Many of the community-level samples created by historians linked local census data to other local sources, such as tax lists or business directories, thereby enriching the basic census data. Moreover, in some cases historians linked individuals from census year to census year, thus allowing longitudinal analysis. The national census microdata files do not incorporate such enrichment: they are independent random cross-sectional samples for successive years without any individual-level information added from other sources.5
The national census files do, however, offer three key strengths that make them far more powerful than the census samples of particular communities ordinarily used by historians. These strengths are complete geographic coverage, large scale, and broad chronological scope.
Complete geographic coverage is important not only because it allows generalization at the national level. Paradoxically, one of the greatest advantages of national samples is the potential for study of the ways local conditions affect behavior. Such analysis has always been one of the goals of community studies, and the authors of such studies have frequently criticized national analyses because they obscure local and regional diversity. The irony is that individual community studies do not permit the study of the impact of local conditions or geographic diversity: these topics can only be addressed by comparing localities. Comparison of community studies is complicated by inevitable differences in measures and methods among different historians.
The national public use census files allow systematic comparisons across space. City, county and even ward-level local characteristics on topics such as agriculture, manufacturing, religion, voting, and property taxes are readily available, for the most part in machine-readable form. We can easily link these local characteristics to the historical census files for the period 1850 through 1920. We can then carry out contextual analyses of the effects of local conditions on individual and family behavior. Although such contextual analysis is still in its infancy, the early studies are indeed impressive (Elman 1993, Landale and Tolnay 1991; also see Ruggles, forthcoming).6
The second strength of the national public use census files is their large size. The number of cases available for each census year ranges from the hundreds of thousands to the tens of millions. This allows study of small and geographically dispersed population subgroups. For example, Minnesota graduate students using the historical public use samples have examined topics such as the professionalization of nursing, American Indian fertility patterns, the living arrangements of elderly urban blacks, the demography of the prison population, the gender composition of clerical workers, and the living arrangements of parentless children. These research topics could not be pursued using a general social survey of the scale ordinarily undertaken by academic social scientists. Indeed, even the largest social survey carried out by the government -- the Current Population Survey -- is far too small for the detailed analysis of topics like American Indian fertility or the composition of the clerical work force. The public use samples are the only general source of microdata available for any period with sufficient cases to study such small population subgroups.
The third -- and most important -- strength of the historical public use census files is their potential for the study of social and economic change over long periods of time. There is no other consistent source of information about the American population spanning more than a few decades. Despite frequent changes in subject content and modifications of enumeration procedures, the core of the census has remained remarkably stable over the past century and a half. So far, however, few researchers have exploited the great potential of the national census files for the study of long-term change. Instead, most investigators use the samples as isolated cross-sections. At Minnesota we recently began to compile a bibliography of research using the public use microdata samples; to date, over 80 percent of the studies use only one of the eleven census years currently available.
The main reason that the national census microdata files have not been widely used to study change is that it is difficult to use more than one of the national census files at a time. Each sample has a different format, different coding schemes, and different documentation. Since the original 1960 and 1970 samples, at least five separate research teams have been involved in the creation of the samples, and each of them has had their own ideas on how to organize the data. We are faced with eight different occupational classifications that have a total of 3200 different categories, and at least seven incompatible classifications for variables such as birthplace, household relationship, and institution type. Documentation for the eleven existing samples is contained in eleven separate volumes totaling over 3000 pages. These volumes are organized differently from one another, and their treatment of comparability issues is often cursory. The 1960 and 1970 public use samples constitute the only exception to the general rule of incompatibility. It has been relatively easy to use the 1960 and 1970 census years in combination; as a result, most of the research using more than one public use microdata sample has focussed on these two census years.
The incompatibility of the public use microdata samples in their present form means that multi-sample studies require a large initial investment of time and money to prepare the data for use. To use multiple census years in the same analysis, each investigator must prepare a special-purpose compatible extract of selected variables. This ad hoc approach has led to duplication of effort among the few investigators using multiple census years. Moreover, because of the complexity of the census files and the often subtle differences among them, the potential for error is large.
To reduce the incompatibility problems among the national census files, the Social History Research Laboratory is now converting the series of public use samples into a single coherent form with uniform documentation. This project, funded by the National Science Foundation, is called the Integrated Public Use Microdata Series.
The Integrated Public Use Microdata Series -- or IPUMS, as we call it -- promises to be a powerful tool for the study of social change. The database will include information on over 50 million individuals spread over 140 years of extraordinary social and economic change. We expect that the unprecedented potential to locate individual behavior in time and spatial context will generate important new research on topics such as fertility, urbanization, immigration, household composition, and occupational structure.
- NOTE: This article was originally published in Historical Methods, 28:1, pp. 6-10 (Winter 1995). Reprinted with permission of The Helen Dwight Reid Educational Foundation. Published by Heldref Publications, 1319 18th Street, N.W., Washington, D.C. 20036-1802. Copyright 1995.
- It has long been fashionable among historians to question the reliability, accuracy, and representativeness of the census (See for example Hill 1993, Sharpless and Shortridge 1975). In fact, the older censuses compare favorably to modern social survey data in all these respects. Many recent sociological surveys have non-response rates far greater than those of the nineteenth century censuses. For a recent discussion of census coverage issues, see Magnuson and King (forthcoming).
- The 1940 and 1950 census years were still covered by census confidentiality rules, so those historical samples were created by Census Bureau employees at far greater cost. A preliminary 1 in 1000 subsample of the 1920 PUMS was initially released; the final 1 in 100 version was complete in 1998.
- For a discussion of some applications of the historical census microdata samples, see Ruggles (1993).
- Record linkage -- either between census years or between census records and other sources -- is always a difficult and painstaking undertaking, but in the case of the national census files it is either impossible or extremely expensive. For the period from 1940 onwards no record linkage can be carried out because those census years are still protected by confidentiality rules and therefore provide no information that can be used to identify individual cases. Although some linkage is theoretically possible in the earlier years, the costs would be high. Most of the community-level census files created by historians include all the cases from a particular locality, which makes record linkage fairly straightforward. By contrast, the national census files are low-density random samples, so identifying links is many times slower.
- For the period from 1940 onwards, the potential for contextual analysis of the census files will be somewhat more limited; because of the census confidentiality rules, the public use files for those years do not identify geographic location with great precision.
Table 1: Characteristics of the Public Use Census Files, 1850-1990
|Census Year||Principal Investigator/Sample Version||Release Date||Sample Density||No. of Cases (000's)||No. of Variables|
|Preliminary||Final||Households||Persons||Household Record||Person Record||Sample Line|
|15% State Sample||1972||1/100||744||2,030||63||62|
|15% County Sample||1972||1/100||744||2,030||60||62|
|5% State Sample||1972||1/100||744||2,030||75||59|
|5% County Sample||1972||1/100||744||2,030||72||59|
a excluding data-quality flags