What is IPUMS?
The Integrated Public Use Microdata Series (IPUMS) consists of over sixty high-precision samples of the American population drawn from sixteen federal censuses, from the American Community Surveys of 2000-present, and from the Puerto Rican Community Surveys of 2005-present. Some of these samples have existed for years, and others were created specifically for this database. These samples collectively constitute our richest source of quantitative information on long-term changes in the American population. However, because different investigators created these samples at different times, they employed a wide variety of record layouts, coding schemes, and documentation. This has complicated efforts to use them to study change over time. The IPUMS assigns uniform codes across all the samples and brings relevant documentation into a coherent form to facilitate analysis of social and economic change.
Contents of this page
- Uses of Microdata
- Subject Content
- Organization of Documentation
- File Structure
- Sample-Line Characteristics in 1940 and 1950
- Sample Weights
- Variations in Group Quarters Definitions
- Alternate Census Forms: 1960-1980
- Geographic Coding and Comparability
- Occupational and Industrial Classifications
- Constructed Variables on Family Interrelationships
- Coding Schemes
- Using the Data Dictionaries
- Public Use Microdata Series Source Materials
Uses of Microdata
Most population data - especially historical census data - have traditionally been available only in aggregated tabular form. The IPUMS is microdata, which means that it provides information about individual persons and households. This makes it possible for researchers to create tabulations tailored to their particular questions. Since the IPUMS includes nearly all the detail originally recorded by the census enumerations, users can construct a great variety of tabulations interrelating any desired set of variables. The flexibility offered by microdata is particularly important for historical research because the aggregate tabulations produced by the Census Bureau are often not comparable across time, and until recently the subject coverage of census publications was limited.
Microdata do pose some limitations, however. For the period since 1950 census microdata are subject to strict confidentiality measures that limit their usefulness for some applications. The IPUMS samples for these years include no names, addresses or other potentially identifying information. To further ensure that no individuals can be identified, the Census Bureau limits the detail on place of residence, place of work, very high incomes, and several other variables. Most important, the microdata records for the period since 1950 identify no geographic areas with fewer than 100,000 inhabitants (250,000 in 1960 and 1970). Therefore the IPUMS is inappropriate for research that requires the identification of specific small geographic areas in those census years.
The data series includes information on a broad range of population characteristics, including fertility, nuptiality, life-course transitions, immigration, internal migration, labor-force participation, occupational structure, education, ethnicity, and household composition. The information available in each sample varies according to the questions asked in that year and by differences in post-enumeration processing. In general, the later census years provide a greater range of characteristics than the earlier ones, though the earlier censuses often contain greater detail for the variables that are available. A full listing of available variables can be found in the documentation section entitled "Variable Availability." For the period since 1960, the data series also provides detailed housing characteristics.
Organization of Documentation
The IPUMS documentation is divided into five volumes. Volume I covers minimum information most users require to use the database, including the overall design of the database, variable descriptions, and coding schemes for each variable. Volume II includes supplemental information on some of the more complicated variables, such as maps and coding schemes for the geographic areas identified in the samples, and alternate occupation and industry coding schemes available for particular census years. Volume III provides enumerator instructions, replicas of the census forms, procedural histories of the censuses, and descriptions of missing-data allocation procedures and other post-enumeration processing. Volumes IV and V consist of a complete descriptions of all the data transformations we carried out to create the IPUMS.
This introduction to the database is essential reading for all prospective users. Additional chapters in this volume cover sample designs, sampling errors, occupational coding, and family interrelationship codes. We then provide a guide of availability for variables in each IPUMS sample, variable descriptions, coding schemes, and marginal frequencies for the household record. Finally, we briefly describe the missing and inconsistent data allocation and data-quality flags.
The Integrated Public Use Microdata Series consists of a series of compatible-format individual-level representative samples of the United States for the years 1850-1880, 1900-present, the American Community Surveys of 2000-present, and the Puerto Rican Community Surveys of 2005-present. Each of these samples is independent; it is not possible to trace individuals from one census year to the next. The data represent all persons in states, territories that eventually became states, Puerto Rico and the District of Columbia, with the following exceptions:
- The 1850 and 1860 samples exclude the slave population.
- The pre-1890 samples exclude "Indians not taxed."
- The pre-1960 samples, except 1900 and 1910, exclude Alaska and Hawaii.
Users will access the data through our on-line data extraction system. This system allows users to select a subset of cases and variables for analysis, so they do not have to cope with multiple gigabytes of data. The extraction system creates a record layout tailored to the needs of each user.
IPUMS samples consist of records that are of two types: household records and person records. Household records - identified with an "H" in the "Record Type" column on the "Variable Availability" pages - contain information pertaining to an entire household or group quarters residence. Each household record is followed in the raw data files by a series of person records, which contain information about each sampled individual in the unit. Person-level data is available for extraction through the IPUMS on-line data extraction system. Person records are identified with a "P" in the "Record Type" column on the "Variable Availability" pages.
Within households, individuals are generally sequenced according to their household relationship codes (see the variable RELATE for more detailed information). Household heads (or householders after 1970) appear first. They are ordinarily followed by a spouse (if any), children (in descending age order), other relatives, and non-relatives. In the pre-1960 census years there are a few exceptions to this sequence, since the original order of enumeration is preserved and not all enumerators followed the prescribed sequence.
Most analyses of IPUMS data require manipulation of both household and person records. Each record contains a serial number which links the persons in the housing unit to the appropriate household record. Many statistical software packages - such as SPSS, Stata, and SAS - will handle the IPUMS datasets easily. Users will specify which of these formats (SPSS, STATA or SAS), they would like their command files to be generated as when they create an extract. Users will also select whether they want their files to be hierarchical, where, like in the raw data files, person-level records follow the households to which they belong or rectangular, where household information is attached to each person record. They also have the option of selecting household records only. Constructed variables in the IPUMS make it easy to attach characteristics of household and family members, such as spouses, children, and parents; use of these variables is discussed in "Family Interrelationships" in the IPUMS documentation.
In general, the variables in the IPUMS are coded into numeric categories, but there are several exceptions. The following variables contain alphabetic characters:
- RECTYPE - Record Type
- STREET - Street Address
- NAMELAST - Name, last
- NAMEFRST - Name, First
- INDNAICS - Industry, NAICS
- OCCSOC - Occupation, SOC
Sample-Line Characteristics in 1940 and 1950
Since 1940, the census has asked an extended set of questions of a sample of the population. In the first two years that the census used sampling, it was done on an individual basis: individuals who fell on a designated "sample line" of the census enumeration form were asked an extra set of questions (the enumeration forms are reproduced in Volume III). The 1940 and 1950 samples were constructed to ensure that each household contains one of these sample-line individuals. In the IPUMS, these variables are coded as "not applicable" for all individuals except the sample-line individual. Thus, users will find that for many variables in 1940 and 1950 the number of available cases is limited. The sample-line format of the census imposes some limitations on the kinds of analysis that can be carried out. For example, one might want to find out if native-born persons with foreign-born parents tended to marry other persons of the same ethnicity. Since the variables on parental birthplaces in 1940 and 1950 are on the sample line, and only one sample-line person is included in each household, it is impossible to compare the parental birthplaces of husbands and wives in these years.
In the censuses of 1960 through 2000, and in the ACS/PRCS, the Census Bureau also employed sampling, but it was carried out on a household basis. In the 1960-2000 censuses, a portion of households received "long form" questionnaires with an additional set of questions to be filled out for each member of the household. In the ACS, the Census Bureau employed a monthly rolling sample of households that was designed to replace the census long forms, where the sampling unit was the household and all persons residing in the household. Since the public use files were constructed entirely from these long forms for the 1960-2000 censuses, and from the ACS PUMS data, most sample items are available for all individuals in the IPUMS. The few exceptions are noted in the section of this volume entitled "Variable Availability" where "x" indicates availability for the variable for a particular year and "." indicates unavailability.
With the exception of the samples for 1940, 1950, 1990 (State, Metro, Elderly and Labor Market Areas), 2000 (1% and 5%), the ACS/PRCS samples and the 1910 oversamples of Blacks and Hispanics, the IPUMS samples are unweighted "flat" samples. The 1860 and 1870 samples are also flat, with the exception of their oversample versions where households containing any black person were sampled at twice the density of other households, which allows for more detailed analysis of blacks in the years surrounding the Civil War. In 1990 and 2000, IPUMS also created flat sample versions. Flat samples are those where each observation - whether a household or individual - represents a fixed number of persons in the general US population. For samples other than the 1940, 1950, 1990 (State, Metro, Elderly and Labor Market Areas), and 2000 (1% and 5%) censuses and the ACS/PRCS, consult the HHWT variable for more information. In the 1860 and 1870 (not oversample versions), 1940 1%, 1950, 1990 (State, Metro, Elderly, and Labor Market Areas), 2000 (1% and 5%) and ACS/PRCS samples are weighted. This means that persons with some characteristics are over-represented in the samples, while others are underrepresented. In the case of 1940 and 1950, a flat sample was impossible because of the complexities in sample design necessitated by the sample-line individuals. In the 1990 and 2000 censuses and the ACS, the Census Bureau opted against a flat sample design in order to maximize precision for persons residing in small localities. In cases where samples are weighted, users must take additional steps if they want to obtain statistics that are representative of the general population.
To obtain representative statistics, users have two basic options: they can either apply sample weights or select a representative unweighted subsample of the data.
To apply sample weights to an IPUMS file, users should follow one of the following procedures:
- For household-level analyses using the 1900 (except for the American Indian sample), 1940 1%, 1950, 1990 (except for the 1% unweighted sample), 2000 (except for the 1% unweighted sample), and the ACS/PRCS samples, weight the households using the HHWT variable. HHWT gives the number of households in the general population represented by each household in the sample.
- For person-level analyses using the 1900 (except for the American Indian sample), 1940 1%, 1950, 1990 (except for the 1% unweighted sample), 2000 (except for the 1% unweighted sample), and the ACS/PRCS samples, apply the PERWT variable. PERWT gives the population represented by each individual in the sample.
- For any analyses in 1940 1% or 1950 involving sample-line characteristics, apply the variable, SLWT. SLWT gives the number of individuals in the general population represented by each sample-line individual.
The variables HHWT, PERWT and SLWT also exist for the unweighted IPUMS samples, but they are simply the overall ratio of the full population count to the number of cases in the sample. For example, the weights in 1880 average 99.74, and all those in 1900 average 758.94. Thus, applying the weights to the unweighted samples will inflate case counts to correspond to the entire population.
To select an unweighted subsample of an IPUMS file, users should follow one of the following procedures:
- For any analyses of 1940 1% or 1950 that do not involve sample-line characteristics, select entire households with a value of "2" for the variable SELFWTHH.
- For any analyses of 1940 1% or 1950 involving sample-line characteristics, select individuals with a value of "2" for the variable SELFWTSL.
The unweighted subsamples have the disadvantage that the number of cases may be sharply reduced, but for some applications they are easier to work with than the weighted samples. To make things simpler for users of the 1990 and 2000 samples, we have created unweighted 1-in-100 extracts of the 1-in-20 weighted samples. These versions require no weights, and the values of HHWT and PERWT are always 100.
Additional details on sample weights and sample designs can be found in "Sample Designs."
Variations in Group Quarters Definitions
In all the IPUMS files, persons in large units such as institutions and boarding houses were sampled at the individual level, and not as members of households. Such units are termed "group quarters." Unlike members of households, group quarters residents cannot be compared to their co-residents because the co-residents are not ordinarily contained in the sample. Furthermore, the samples for 1940 and subsequent census years provide fewer variables for residents of group quarters than they do for household members.
Unfortunately, different census years did not define group quarters consistently, and this complicates inter-year comparisons for many variables. Thus, individuals sampled as group quarters residents in some years might have been sampled as household members in other years. For example, the 1940 1%, 1950, 1960, and 1970 census years define group quarters as units containing five or more persons unrelated to the household head/householder. The IPUMS samples for 1980, 1990, the 1940 100% file, and those before 1940, define group quarters as institutions and other units with 10 or more members unrelated to the householder. In the 2000 and 2010 censuses, ACS and PRCS, no threshold was applied; for a household to be considered group quarters, it had to be on a list of group quarters that is continuously maintained by the Census Bureau. In earlier years, a similar list was used, with the unrelated-persons rule imposed as a safeguard.
Sufficient information is available in all census years to determine whether a sampling unit would have been treated as group quarters or a household under the 1940 1%-1970 rules, and this is done in the IPUMS variable GQ. Users who wish to create a common household universe for all fifteen census years can use GQ to eliminate from their research universe all units that would have been sampled as group quarters in the 1940-1970 census years, by simply selecting households coded "1."
However, many researchers, particularly those concentrating on the earlier IPUMS years when boarders, lodgers, servants, and secondary families were much more common, will not wish to use GQ in this way. Doing so would eliminate from the universe many pre-1940 units for which the pre-1940 samples contain full information on all household members.
The differences among the samples in treatment of group quarters, like other variations in sample design, also affect the precision of the samples. See "Sample Designs" and "Sampling Errors" in this volume.
Alternate Census Forms: 1960-1980
In two census years, 1960 and 1970, the Census Bureau used two different sample "long forms" with slightly differing questions when they took the census. Moreover, in 1980 certain information was collected for only a fraction of forms. Depending on the census year, these have varying implications for the IPUMS.
In 1960, certain housing information (such as stories and elevators) was only collected in cities with 50,000 or more residents, where form PH-4 was distributed, and other information (such as sewage disposal) was only collected outside large cities, where form PH-3 was used. A third set of housing questions was asked for a 5 percent sample of all housing units, and a fourth set was asked of 20 percent of all units. The 1960 IPUMS 1 % and 5% samples contains all four sets of questions, but each is available only for a subset of cases. The IPUMS variable SAMP1960 identifies which set of questions the household answered. For those housing items asked of only a subset of the population, the universe statement identifies which form included the question, and those not included in the universe are coded "not applicable." In the case of the PH-3 and PH-4 samples, users must be aware of the universe limitations imposed by the different forms. The distinction between the 20% form and the 5% form should have little effect for most analyses beyond limiting the number of available cases, but if users want to estimate the absolute number of occurrences of an item in the 1960 population, they should multiply 20% form items by 125 and the 5% form items by 500.
In 1970 the Census Bureau used two substantially different long forms with varying questions on both the person and household records. One form, referred to in the IPUMS as Form 1, was filled out by 5% of the population, and the other form, Form 2, was filled out by 15% of the population. For example, the Form 1 (5%) sample included questions such as age at first marriage, citizenship status, and occupation five years ago, whereas the Form 2 (15%) sample inquired about parental birthplaces, school attendance, and migration status. The IPUMS includes separate samples for each form. Users must decide which version they need before carrying out their analysis.
The 1980 census used a single long form, but the makers of the sample used only a portion of the forms for questions on travel time to work, place of work, and migration. In the IPUMS, these variables are available for half of all cases; the rest are coded "not applicable." This should have little effect on analysis, but if users wish to estimate the absolute number of occurrences of one of these items in the 1980 population, they should multiply the weight by two or apply the variable MIGSAMP.
Geographic Coding and Comparability
Every census collected precise information on residential location. This information is preserved in the pre-1950 census samples, but the 1950 and subsequent census years suppressed much of it in order to meet confidentiality requirements. Thus, the 1950 samples do not identify places that had fewer than 100,000 residents in 1980; the 1960 1 % and 1970 samples do not identify places with fewer than 250,000 inhabitants; the 1960 5 % does not identify places with fewer than 50,000 residents; and the 1980 samples, the 1990 samples, and the 2000 5 % sample do not identify places with fewer than 100,000 inhabitants. The 2000 1% sample does not identify places with fewer than 400,000 inhabitants. In the ACS from 2000-2004, the smallest geographic unit identified was the state. The 2005 ACS does not identify places with fewer than 100,000 inhabitants. The impact of these rules on comparability and geographic coverage of areas smaller than states are explained in the documentation for each geographic variable - see particularly the variables CNTYGP97 (County group, 1970), CNTYGP98 (County group, 1980), PUMA (Public Use Microdata Area, 1990), SEA (State economic area), METRO (Metropolitan status), METAREA (Metropolitan area), CITY (City identifier), and URBAN (Urban/rural status). Users should also consult the migration and place of work variables for further discussion of the effects of geographic comparability on those variables.
Since 1970, the Bureau has produced alternate versions of the public-use files containing different geographic codes. By providing multiple samples with different coding schemes, the Bureau was able to maximize flexibility without violating the confidentiality thresholds.
For 1960, the 5% sample provides geographic areas modeled on Public Use Microdata Areas (PUMAs), which are the smallest identifiable geographic areas in public-use microdata. See 1960 Geography page for more information about the process to create these areas and the improved geography information they provide compared to the 1960 1%.
In 1970, three geographic versions were produced, and for each one the Bureau created a Form 1 sample and a Form 2 sample (see above):
- The State samples identify all states. No geographic subdivisions of states are identified, but for most states it is possible to distinguish rural from urban areas and metropolitan from non-metropolitan areas.
- The Metro (also known as County Group) samples identify economic areas within states of 250,000 or more, but those areas do not always follow state boundaries. Every metropolitan area of 250,000 or more is identified. Only four states can be completely identified, because metropolitan areas frequently cross state boundaries and identification of both state and metropolitan area would violate the confidentiality rules. No urban/rural distinctions are available, and smaller metropolitan areas cannot be identified.
- The Neighborhood samples provide only regional information and size of place, but they also give specific characteristics of the surrounding neighborhood that allow contextual analysis. The neighborhoods are not identified by name but represent areas approximately the size of census tracts, which contained about 4000 people. See "1970 Neighborhood Characteristics" in "Geographic Tools," Volume II: User's Guide Supplement, for the record layout and descriptions of the characteristics.
The Census Bureau created three geographic variations in 1980, labeled "A," "B," and "C":
- The State (A) sample identifies all states, larger metropolitan areas, and most counties over 100,000 population. In many cases individual cities are also identified. It does not identify rural/urban residence or residence in smaller metropolitan areas. The State sample is very large, including 1-in-20 (5%) of the US population.
- The Metro (B) sample identifies 282 metropolitan areas over 100,000 population. Only twenty states can be completely identified, because metropolitan areas frequently cross state boundaries and identification of both state and metropolitan area would violate the confidentiality rules. Metropolitan areas are distinguished from non-metropolitan areas, but the sample does not identify urban/rural residence.
- The Urban/Rural (C) sample identifies urban/rural residence, central city residence, and particular urbanized areas. For confidentiality reasons, only 28 states can be entirely identified, and no metropolitan areas are identified.
The Census Bureau produced three geographic versions in 1990:
- The State (5%) sample identifies all states, and within states, most counties or parts of counties with 100,000 or more population. It also identifies most metropolitan areas over 100,000 completely. The sample is the largest one in the IPUMS.
- The Metro (1%) sample is similar, but more metropolitan areas are identified and some states cannot be completely identified because of confidentiality restrictions.
- The Labor Market Area (.5%) sample identifies no place smaller than 100,000 population, with Labor Market Area (LMA) defined as a combination of counties.
The Census Bureau produced two geographic versions in 2000:
- The Census (5%) sample identifies all states, and within states, most counties or parts of counties with 100,000 or more population. It also identifies most metropolitan areas over 100,000 completely.
- The Census (1%) sample identifies Super-PUMAs containing at least 400,000 persons. Super-PUMAs do not cross state boundaries.
The variable descriptions, and comparability discussions that comprise the data dictionaries for each IPUMS variable explain how variables are affected by confidentiality restrictions. For example, the variable description for METAREA indicates that the variable is available only for the Metro samples in 1970 and only for the State and Metro samples in 1980.
To maximize historical comparability, the IPUMS constructs a variety of geographic variables for earlier years that were not contained in the original samples. The variables SEA (State economic area), METRO (Metropolitan status), and METAREA (Metropolitan area), among others, represent concepts not yet in use when the earlier censuses were taken. The IPUMS applies these concepts to the pre-1940 samples in order to extend the series of comparable geographic codes backwards. In addition, the 1850-1940 census years include COUNTY (County of residence), which is especially useful for constructing variables describing the local socioeconomic context of each case. These county-level data are available in machine-readable form (ICPSR file 0003) from the Inter-University Consortium for Political and Social Research (http://www.icpsr.umich.edu/index.html).
Occupational and Industrial Classifications
Occupation and industry are among the most important variables for analyses of long-term social change because the early census years provide few alternative indicators of socioeconomic status or labor-force participation. The Census Bureau has modified its classification systems every decade, so all comparisons of occupation and industry require extensive reconciliation of codes. There are twelve different occupational classification systems consisting of between 285 and 550 categories each. Links to each of the occupational classifications systems can be found on the OCC codes page. In addition, there are eight industry classification systems consisting of up to 270 categories each. Links to each of the industry classification systems can be found on the IND codes page. Although a complete reconciliation of these coding schemes is impossible, we provide variables that maximize the potential for consistent comparisons of occupational status and industry. In each census year, we provide the contemporary occupational and industry classifications as well as imposing a common coding scheme based on the 1950 Census Bureau classifications. A more in-depth discussion of occupational variables can be found in "Occupation Codes and Income Scores."
Constructed Variables on Family Interrelationships
The IPUMS contains three pointer variables - MOMLOC, POPLOC, and SPLOC - that give the location within the household of each individual's mother, father, and spouse. These variables allow users to easily attach characteristics of individuals to those of their kin, and they are convenient tools for constructing measures of fertility and co-residence. The IPUMS also includes several of the most commonly requested variables on own children: Number of own children (NCHILD), Number of own children under age five (NCHLT5), Age of eldest own child (ELDCH), and age of youngest own child (YNGCH). These and other constructed family interrelationship variables are fully described in this volume in "Family Interrelationships."
The original census samples that make up the IPUMS employed different classification systems and coding schemes in every census year. A central goal of the IPUMS is to reconcile these in order to create comparable codes for each variable. Perfect uniformity across years was our ideal, but is only achieved in a few variables (most notably SEX) which are classified in precisely the same way in each sample. However, for most variables, such perfection could not be achieved without an unacceptable loss of information. This is primarily because the variables often contain more detail in some census years than in others; if we had reduced all census years to their lowest common denominator, this detail would have been lost.
The person-record variable RELATE (Relationship to household head/householder), is illustrative. Most of the census samples code it differently. For instance, household heads/householders are coded "1000" in the original 1910 sample, "01" in the 1940 sample, and "0" in the 1960 sample. To reconcile these, the IPUMS translates the various codes into a single code ("01") for household heads/householders in all years, thus easily eliminating this simple incompatibility. But some censuses code RELATE in more detail - that is, using more categories - than others. For example, the 1960 and 1970 samples used only 15 categories for RELATE, while the 1910 sample distinguishes 161 categories. Table 1 and Table 2 reproduce part of the original codebook pages describing the "Relationship to Household Head/Householder" for both the 1910 and 1960 samples. In this example, if we forced the 1910 categories into those for 1960, we would lose such categories as nephew, aunt, and domestic servant.
To avoid such problems and still maximize code comparability, we designed two-part coding systems for many variables. A general code, constituting the first one, two, or sometimes three columns of these variables, serves as a lowest common denominator that classifies information available in all samples containing the variable. The general codes are usually fully comparable across years. A detail code, contained in the columns that follow the general code, preserves sub-categories that are available in some but not all years. It must be read with the general code. The general-detailed coding system maximizes comparability without losing information, and has been applied to all complex categorical classifications except for occupation, which is discussed at length in "Occupation Codes and Income Scores."
Table 3 shows a sample representation of the IPUMS data dictionary for RELATE. The respective headers indicate that RELATE is available in both general and detailed form. The general codes comprise the first two columns of RELATE. These first two digits are comparable for all years. They generally follow the 1960 and 1970 codes, which served as our lowest common denominator for creating RELATE, through the process described above. The general codes allow researchers to comparatively analyze RELATE across all census and ACS/PRCS years for which it is available. Researchers who wish to use RELATE in its detailed form - that is, those who wish to analyze information that is available only for some years - should read all four columns of RELATE, thus incorporating the detailed sub-categories contained in the extended columns.
Using the Data Dictionaries
The Data Dictionaries, split into household record and person record sections, contain descriptions of each variable. For each variable, the dictionaries provide a universe statement, a variable description, a discussion of variable comparability across census and ACS/PRCS years, along with links to codes and frequencies and relevant data quality flags. In certain cases, additional user notes caution researchers against potential problems that some uses of the variable may entail. Users should pay close attention to the universe statement, comparability discussion, and user notes. In many cases, variables with apparently comparable codes are actually defined slightly differently over time or are available for different populations in various census years.
For most variables, a frequency table gives the value label for each IPUMS code. The RELATE frequency table shown in Table 3 serves as an example. The numbers under each census year are the number of cases (frequencies) in each category for that year; a blank indicates that the category is not available for that year. The codes and frequencies table for the general codes typically has no blanks, indicating that all categories are available in all census and ACS years. By contrast, the table for the detailed codes is often filled with blanks, because many codes are available only in a subset of years. A blank in the pre-1950 censuses usually means that the response did not occur in the population, because, for the most part, the pre-1950 census samples preserved virtually all available detail. A blank in a table for the census years from 1950 to the present usually means that the variable was simply coded into broader categories by the Census Bureau.
The frequency counts in the Data Dictionaries are unweighted; therefore, they do not necessarily accurately reflect the distribution for the general population. In particular, frequencies for 1950 (a heavily weighted sample) often appear inconsistent with surrounding years. By applying appropriate weights, all samples can be made representative of the general population.
In years with multiple samples, we chose one for purposes of presenting frequencies. The 1970 frequencies are from the Form 2 (15%) State sample, by default. If a variable is only available in the Form 1 (5%) samples, then the Form 1 State sample frequencies are presented. In the case of METAREA in 1970, the Form 2 Metro (County Group) sample was used. The 1980 frequencies are taken from the Metro (1%, "B") sample, the 1990 frequencies are from the Metro (1%) sample, the 2000 frequencies are from the 1% sample, and the ACS/PRCS frequencies are taken from the entire sample for each year.
Indentations in the value labels column of the data dictionary are meaningful. Any item indented beneath another is a subset of the larger category. If a subcategory is not available in a given year, in general the cases would have been coded into the larger category. For example, in Table 3, panel 2, the category of "adopted child" is not available after 1940. Adopted children in more recent years are recorded in the larger category "child," under which "adopted child" is indented. Please see the RELATE codes and frequencies page for more detail.
Public Use Microdata Samples Source Materials
For all census years from 1850 to 1950 (except for 1890, which was destroyed by fire), the original manuscript population schedules are preserved on microfilm at the National Archives in Washington D.C. In each year, the microfilm reels and the schedules within reels are organized geographically: alphabetically by state, within states alphabetically by county, and within counties numerically by enumeration district. For census and ACS/PRCS years since 1960, the census schedules exist in fully machine-readable form.
The basic sources for most of the IPUMS documentation are the documentation provided for each of the individual public use microdata samples. These are listed below. All of them should be available through the Inter-university Consortium for Political and Social Research (ICPSR), P.O. Box 1248, Ann Arbor, MI, 48106.
1850: Steven Ruggles, Russell R. Menard, et al., Public Use Microdata Sample of the 1850 United States Census of Population: User's Guide and Technical Documentation, Social History Research Laboratory, Department of History, University of Minnesota, 1995.
1860 and 1870: The 1860 and 1870 samples are currently being created by Steve Ruggles and others as part of the Historical Census Project at the Minnesota Population Center, University of Minnesota. The samples were designed at the outset to be compatible with the IPUMS, so a separate volume of documentation will not be created.
1880: Steven Ruggles, Russell R. Menard, et al., Public Use Microdata Sample of the 1880 United States Census of Population: User's Guide and Technical Documentation, Social History Research Laboratory, Department of History, University of Minnesota, 1994.
1900-1930: These samples were designed at the outset to be compatible with the IPUMS, so a separate volume of documentation was not created.
1910: Alberto Palloni, Halliman H. Winsborough, Francisco Scarano, Puerto Rico Census Project, 1910, University of Wisconsin Survey Center, Inter-University Consortium for Political and Social Research, 2005.
1920: Alberto Palloni, Halliman H. Winsborough, Francisco Scarano, Puerto Rico Census Project, 1920,University of Wisconsin Survey Center, Inter-University Consortium for Political and Social Research, 2005.
1940: United States Bureau of the Census, Census of Population, 1940: Public Use Sample Technical Documentation, Government Printing Office, 1984.
1950: United States Bureau of the Census, Census of Population, 1950: Public Use Sample Technical Documentation, Government Printing Office, 1984.
1960: United States Bureau of the Census, Technical Documentation for the 1960 Public Use Sample, Government Printing Office, 1973. See also the entry for 1970.
1970: United States Bureau of the Census, Public Use Samples of Basic Records from the 1970 Census: Description and Technical Documentation, Government Printing Office, 1972. Also contains fundamental information about the 1960 sample that is not contained in the 1960 documentation.
1970 Puerto Rico: United States Bureau of the Census, 1970 Puerto Rico Public Use Samples: Description and Technical Documentation, Government Printing Office, 1990.
1980: United States Bureau of the Census, Public Use Samples of Basic Records from the 1980 Census: Description and Technical Documentation, Government Printing Office, 1983.
1980 Puerto Rico: United States Bureau of the Census, Census of Population and Housing, 1980: Public-Use Microdata Samples, Technical Documentation: Supplement for Puerto Rico, Government Printing Office, 1983.
1990: United States Bureau of the Census, Census of Population and Housing, 1990: Public Use Microdata Samples, Technical Documentation, Government Printing Office, 1993.
1990 Puerto Rico: United States Bureau of the Census, 1990 Census of Population and Housing Public Use Microdata Samples (PUMS): Puerto Rico Technical Documentation, Government Printing Office, 1993.
2000: United States Bureau of the Census, Census of Population and Housing, 2000: Public Use Microdata Samples, Technical Documentation, Government Printing Office, 2003.
2010: United States Bureau of the Census, 2010 Census of Population and Housing, United States Public Use Microdata Sample (PUMS): Technical Documentation, U.S. Census Bureau, 2014.
ACS: United States Bureau of the Census, American Community Survey, 2000: Public Use Microdata Samples, Operation Plan, United States Bureau of Census, 2003.