Constructing strata in the IPUMS samples

The IPUMS STRATA and CLUSTER variables permit the use of statistical techniques that produce standard errors that properly account for the complex survey design. These variables are to be used in addition to HHWT, PERWT, or SLWT. This page describes the creation of STRATA and CLUSTER in the pre-1960 IPUMS samples

Overview

Taylor series linearization is the easiest and most widely-used method for estimating variance with complex sample designs, but it is not designed for samples with implicit stratification. Because the stratification in census records is implicit, there is no geographic unit in the data that corresponds precisely to the geographic stratification embedded in the page ordering of the manuscripts. This poses a major problem for Taylor series linearization, since the method requires explicit information about strata.

To create a proxy for the implicit geographic stratification of the historical IPUMS samples, we used the Census Bureau's page numbering scheme in combination with explicit geographic information written on the census page. The 1850-1930 samples are based on public data, so our strata were able to capitalize on a great deal of implicit and explicit geographic information. Many variables and values from the 1940-1950 samples are still protected by Census Bureau confidentiality restrictions, however, so the pseudo-strata in those years draw on slightly less information.

To give some additional background on the organization of the census records, the manuscripts from each census are stored on several thousand microfilm reels. Most reels contain several hundred pages. Each of these pages contains between 40 and 50 lines, with each line containing information on one person. Most reels are divided into two or more "page sequences"; at the beginning of each sequence, the page numbering starts over at one. Page sequences usually denote a change in the county of enumeration, though changes in county of enumeration also often occur within a given page sequence.

Strata in the 1850-1930 samples

For samples from 1850-1930, our modified strata definition takes advantage of all of this information. For each reel, we create strata of approximately 10,000 lines, which included information on about 2,000 households averaging 5 persons each. These strata contain no breaks in county of enumeration or page sequence. Whenever a new county or microfilm sequence is encountered, we close out the current strata and start a new one. For this reason, many strata ultimately draw on fewer than 10,000 lines. This was a necessary compromise, since the number of persons living in a contiguous geographic unit was very unlikely to be an exact multiple of 10,000.

To provide an example of how this typically works out, consider the example of a reel that has 421 pages, with 100 persons per page, with a break in county on page 132:

strata 1 will be pages 1-100
strata 2 will be pages 101-131
strata 3 will be pages 132-200
strata 4 will be pages 201-300
strata 5 will be pages 301-400
strata 6 will be pages 401-421

After this is done, we check for strata with only 1 household ("singletons"). Of the 2.5 million households in all of the 1850-1930 samples combined, only 4,200 were in singleton strata. We placed these singletons by following the steps listed below; we have indicated the proportion of cases that were placed in each step.

Steps for placing singletons:

if the singleton's county is same as that of the previous or following strata, add the singleton household to the previous or following strata (87% of singletons were placed in this step).
if the singleton's county is different from that of the previous and following strata, create a strata with other singletons in the State Economic Area (SEA). SEAs are groups of contiguous counties sharing similar economic characteristics in 1950 (12% of singletons were placed in this step.
if there are no other singletons in the SEA, create a strata with other singletons in the state (less than 1% of singletons were placed in this step).
if there are no other singletons in the state, create a strata with other singletons in the nation (less than 1% of singletons were placed in this step).
if there are no other singletons in the nation, add the singleton household to the previous strata (less than 1% of singletons were placed in this step).

Strata in the 1940-1950 samples

For the 1940-1950 samples, no information is available regarding reel, page, or page sequence. Even the county of enumeration is only available for some cases. For this reason, we defined strata as the Enumeration District during these years. Enumeration Districts consist of contiguous groups of neighborhoods or minor civil divisions. The specific boundaries of enumeration districts were defined as the area that an door-to-door enumerator could be expected to cover in a two-week period. Enumeration Districts are consistently identified in these samples.

When Enumeration Districts had only one household in the sample (a singleton household), we created a strata with other singleton households in the State Economic Area. The 1940 and 1950 datasets together had 852,164 households. Of those, 84,487 were singletons within their enumeration district. All of these cases were placed in SEA-level strata with other singletons.

Strata in the 1960-2000 samples

For the 1960-2000 samples, strata were created based on the stratification criteria used to select PUMS samples. For 1960, 38 strata were created based on household size, home ownership, race, and group quarters residence. The procedures used to select cases for inclusion in the 1970 public use samples were similar to those used in 1960, but were slightly more elaborate. Seventy-five strata were created based on home ownership, race, sex of head, household size, presence of own children, inmate status, and other residence in group quarters. For the 1980 samples, strata were created based on race, Spanish origin, home ownership, and presence of own children. Sampling rate was not used due to unavailability, halving the number of strata from 102 to 51. For the 1990 and 2000 samples, strata were created based on presence of own children, race, Spanish origin, and home ownership. Similar to 1980, sampling rate was not included as a criteria. In addition, to avoid singletons the Asian race categories were collapsed into one category and this criteria was also used for the 2000 samples (White/Other Race/Two or More Races Hispanic; White/Other Race/Two or More Races Non-Hispanic; Asian and Pacific Islander; Black and American Indian). For the 1990 samples age was not used for non-institutional group quarters to avoid singletons. Any remaining singletons were collapsed into the White Non-Hispanic Origin strata. These methods produced 119 strata for the 1990 samples and 131 strata for the 2000 samples. For more information on the Census public use microdata sampling design, see Chapter 2: Sample Designs.

Strata in the American Community Survey samples

For the American Community Survey (ACS) samples, strata are based on the lowest level of geography available in the sample. For the 2000-2004 samples, each state forms a stratum. In the 2005 onward ACS samples, strata are defined as unique Public Use Micro-data Areas (PUMA).

For further discussion of how researchers with the Minnesota Population Center have constructed and evaluated strata in the historical samples, please see "Drawing Statistical Inferences from Historical Census Data".