Sampling Errors
Go back to the IPUMS User's Guide
As described in "Sample Designs", the component samples of the IPUMS used a variety of sample designs. These variations significantly affect the precision of sample estimates. Estimates derived from any sample are subject to sampling variability, which is usually measured as the standard error. The standard error of a sample statistic estimates the variation of that statistic across many similar samples drawn from the same population. Approximately two-thirds of random samples will produce estimates within one standard error of the full population, and approximately 95 percent of samples produce estimates within two standard errors. Standard errors depend on both sample size and sample design. This chapter describes how sample design affects sample precision, estimates the resulting differences in standard errors across the IPUMS samples, and discusses strategies for obtaining realistic estimates of statistical significance.
Clustering
All the IPUMS samples are cluster samples. The most interesting census information describes individual characteristics, such as age, race, sex, income, education, and so on. But the IPUMS samples are not individual-level samples; instead, they are samples of households or dwellings. Information about individuals is gathered household by household because many important topics of analysis - such as fertility, household composition, and nuptiality - require information about multiple individuals within the same household. The number of independent observations in each census file is the number of households or dwellings, not the number of individuals.
Some individual characteristics, such as ancestry, are highly correlated within households. For example, if one household member is of Chinese descent, the odds are high that other household members are also of Chinese descent. Suppose we wish to estimate the standard error for the proportion of the population of Chinese ancestry. If we had the sort of sample generally assumed by statistics textbooks - an independent random sample of all individuals in the population - the standard error for Chinese ancestry would be inversely proportional to the square root of the number of individuals in the sample. But because the samples are cluster samples and Chinese ancestry is highly correlated within clusters, the usual method for calculating standard errors would overestimate sample precision. A better estimate of the standard error would be obtained by substituting the number of households for the number of individuals when making the calculation.
Standard errors in cluster samples depend on both the number of clusters sampled and on the homogeneity of variables within clusters. Calculation of standard errors for cluster samples is complicated.1 In the worst case, with perfect homogeneity within clusters, the standard errors for variables would be inversely proportional to the square root of the number of clusters rather than the number of individuals. For variables that are heterogeneous within clusters, such as age and sex, clustering may have little or no effect on sample precision.
The impact of clustering therefore varies from variable to variable. It also varies from census year to census year. The homogeneity of particular characteristics within clusters can change over time. For example, as ethnic intermarriage increases, ethnicity within households becomes less homogeneous.
The size of clusters also differs across census years. The larger the average size of clusters, the smaller the number of independent observations. Household size has fallen dramatically over the past century as fertility has declined, boarders and extended families have become less common, and more persons have chosen to live as primary individuals. Thus the samples for recent years have far smaller clusters, on average, than those for earlier years.
Changes in cluster size occur not only because of change over time in household size but also because of differences in the treatment of group quarters - large units such as institutions, boarding houses, and college dormitories. To maximize sample precision, such units are sampled at the individual level in all census years. Thus, instead of treating a prison with 1,000 inmates as a single sample unit, they would be sampled as if they were 1,000 one-person households. This procedure multiplies manyfold the number of independent observations for persons in large units.
The criteria for designating units to be sampled on an individual basis varies considerably among the samples. Table 1 summarizes the criteria used of group quarters in each census year. Samples from Puerto Rico followed the same group quarters sampling strategies as the United States samples.
In general, the 1940 1%, 1950, 1960, and 1970 samples employ the broadest definition of group quarters: all persons in units with five or more persons unrelated to the household head are sampled as individuals. Thus, for example, a wealthy family with five co-resident servants would be treated as group quarters, and each member of the unit - including the primary family - would be sampled as if he or she resided in a separate one-person household. This inclusive definition of group quarters minimizes the size of clusters. In the 1850-1930 samples and the 1940 100%, on the other hand, a unit may have up to thirty members, related or unrelated, before it is sampled as group quarters. In a unit with over thirty members, each related group is sampled jointly in order to preserve all co-resident family relationships. This is a minimal definition of group quarters that maximizes the size of clusters; sample precision was reduced in order to preserve information about family and household interrelationships. Further details appear in "Sample Designs".
Census years | Criteria |
---|---|
1850-1900 1910 (except cases where SAMP1910=5) 1920-1930 1940 100% |
Units of size 31 or more; related groups within group quarters sampled jointly. |
1910 (only cases where SAMP1910=5). | Units with 21 or more members unrelated to household head; related groups within group quarters sampled individually. See "1-in-250 national sample" section of the 1910 sample description. |
1940 1%-1970 | Institutions and other units with 5 or more members unrelated to household head; related groups within group quarters sampled individually. |
1980-1990 | Institutions and other units with 10 or more members unrelated to householder; related groups within group quarters sampled individually. |
2000-onward | In these samples, no threshold was applied. For a household to be considered group quarters, it had to be on a list of group quarters that is continuously maintained by the Census Bureau. |
There is one additional feature of the sample designs that affects the size of clusters. Most public use files are samples of households, individuals within households, and group quarters.2 The 1850-1930 samples, however, add another level of hierarchy in that multi-household dwellings containing thirty or fewer residents were sampled as dwellings instead of as households - that is, all households within the dwelling were sampled. This provides important additional information, since many dwellings contained two interrelated households. At the same time, sampling by dwellings sacrifices some sample precision, since it increases the average size of clusters.
Stratification
All of the public use samples are implicitly or explicitly stratified. That is, they divide the population into strata based on key characteristics, and then sample separately from each stratum. This ensures that each stratum is proportionately represented in the final sample.
Stratification has the opposite effect of clustering: it increases the precision of sample estimates. It does so not only for those characteristics that are explicitly stratified, but also for any other characteristics that are correlated with them. In some cases, the positive effects of stratification outweigh the adverse effects of clustering, so the IPUMS sample designs can actually yield smaller standard errors than would be obtained through a simple random sample of similar size.
The samples for the four most recent census years all employ elaborate stratification schemes; these are described in some detail in the preceding chapter on sampling design. To select one example, the 1960 sample divided the population into 38 strata, based on household size, home ownership, race, and group quarters residence, and systematically selected households from each stratum for inclusion in the sample. Sample precision was further enhanced by a selection scheme that ensured even coverage within every geographic area. Public use samples of subsequent censuses are even more elaborately stratified - the 1990 sample was selected from 1,049 strata.
The samples for earlier censuses are not as stratified. Unlike the 1960 and subsequent samples, these were not drawn from an existing machine-readable source; instead, they were entered by hand from microfilm copies of the original enumerators' manuscripts. Most individual and household characteristics were unknown before the cases were entered, so they could not be efficiently used to stratify the samples.
However, since the microfilm reels and the manuscripts they contain are both arranged geographically, this characteristic was available for every case prior to data entry. Each of the pre-1960 samples could therefore be geographically stratified. This was done by dividing the raw data into individual census pages, groups of pages, or individual microfilm reels, and then sampling independently from each of these strata. The result is a significantly more even geographical distribution of cases than would be expected from a true random sample. This dramatically improves the precision of the geographic variables, such as region and urban residence. It also indirectly improves the precision of variables highly correlated with geography (e.g., race, ethnicity, education, occupation, farm residence, and home ownership).
Estimating Sampling Errors in the IPUMS
For a true random sample in which each case is an independent observation, it is a simple matter to estimate sampling error before the sample is actually taken. For samples that are simultaneously clustered and stratified, this is much more difficult. Though theoretical pre-sample estimates are possible, they are usually based upon oversimplifications of the problem even though the calculations involved are inevitably complex.
Once the samples are complete, however, it is fairly easy to develop empirical estimates of standard errors. As stated above, standard errors are simply estimates of the standard deviation of a statistic over all possible samples of a population. The IPUMS samples are large enough that we can divide them into many subsample replicates and then directly measure the distribution of a statistic across the subsamples.
Table 2 shows estimated design factors for selected variables in each IPUMS file from 1880 to 1980. These were arrived at by dividing each sample into fifty randomly selected subsample replicates, calculating the standard deviation of the expected value of each variable across the fifty subsamples, and dividing the result by the standard error that statistical theory predicts for a simple random sample of the same size as each subsample.3 Design factors for the 1990 and 2000 censuses and the ACS are available in the published Census Bureau documentation for each year (see pages 3-8 and 3-9 of the 1990 documentation, and pages 4-21 to 4-73 of the 2000 documentation, and the ACS accuracy statements.)
Variable | 1880 | 1900 | 1910 | 1940 1% | 1950 | 1960 | 1970 | 1980 |
---|---|---|---|---|---|---|---|---|
Relationship | 1.0 | 0.9 | 1.0 | 0.9 | 0.9 | 0.5 | 0.4 | 0.6 |
Age | 1.1 | 1.1 | 1.1 | 1.2 | 1.1 | 1.0 | 0.9 | 1.1 |
Sex | 0.8 | 0.9 | 0.8 | 0.8 | 0.7 | 0.7 | 0.8 | 0.7 |
Marital status | 1.0 | 1.0 | 0.9 | 1.0 | 1.0 | 0.7 | 0.5 | 0.8 |
Race | 2.2 | 2.1 | 2.1 | 2.1 | 2.1 | 0.5 | 1.4 | 1.3 |
Occupation | 1.2 | 1.1 | 1.2 | 1.1 | 0.8 | 0.9 | 1.0 | 1.0 |
Birthplace | 1.3 | 1.3 | 1.5 | 1.2 | 1.7 | 1.1 | 1.1 | 1.1 |
Language spoken | N/A | N/A | 1.7 | (0.8) | N/A | 1.3 | 1.6 | 1.4 |
Citizenship | N/A | 1.1 | 1.3 | 1.2 | 1.3 | 0.9 | 1.2 | 1.3 |
School attendance | 1.3 | 1.3 | 1.2 | 1.2 | (0.8) | 0.9 | 0.9 | 1.0 |
Ownership of dwelling | N/A | 0.9 | 0.9 | 0.9 | N/A | 0.1 | 0.4 | 0.9 |
Urban/rural status | 0.8 | 0.7 | 0.7 | N/A | N/A | 1.0 | 0.8 | N/A |
Farm | 0.9 | 1.0 | 0.9 | 0.9 | 1.1 | 1.0 | 1.0 | 1.1 |
The design factors represent the ratio of observed standard errors for a variable to the standard errors that would be obtained from a simple random sample of the same size. Thus, a design factor of 1.0 means that the effects of stratification and clustering on sample precision cancel one another out. If the design factor is 1.0, a standard statistical package like SPSS or SAS - or a standard statistics textbook - would produce reliable significance statistics. A design factor of 2.0 means that the empirically observed standard errors are twice as great as would be predicted for a simple random sample. For such variables, a statistical package would overestimate statistical significance. Conversely, a design factor of 0.5 means that the sample is twice as precise as would by predicted by standard statistical tests.
Only a few variables have design factors that usually exceed 1.0 by a wide margin. The most dramatic case is RACE, where the design factor exceeds 2.0 in each of the census years before 1960. This reflects the impact of clustering, since households have historically been extremely homogeneous with respect to race. Stratification schemes adopted since 1960 have reduced the design factor for race. BPL (Birthplace), LANGUAGE, and CITIZEN (Citizenship status) also tend to be homogeneous within clusters and have relatively high design factors. The design factor for SCHOOL (attendance) probably dropped over time due to declining fertility: as the number of school-age children per household has fallen, the potential for clustering has diminished.
The design factors presented in Table 2 are only valid for analyses of all individuals in the nation as a whole; the results could be significantly different for any population subgroup. Moreover, particular categories within a given variable may have design factors markedly different from that of the variable itself. For example, the categories "head" and "wife" from RELATE (Relationship) have uniformly low design factors; because each household ordinarily contains only one head and no more than one wife, there is no potential for homogeneity within households. By contrast, the design factor for the category "child" is quite high because children tend to occur in combination.
Most common individual-level analyses have lower design factors than those listed in Table 2 because researchers tend to choose population subgroups in such a way as to effectively minimize the impact of clustering. For example, fertility studies most frequently focus on married women ages 15 to 49. Since the great majority of sample clusters contain only one such individual, the impact of clustering is trivial. Thus, fertility studies almost inevitably have design factors at or below 1.0, which yield conservative estimates of statistical significance. Likewise, investigations of such topics as the living arrangements of elderly women, the occupational status of young men, or the education of never-married adult women should generally yield precision at least as high as would be obtained from a simple random sample of the same size.
Researchers can usually set up their analyses to avoid high design factors. For example, consider the variable SCHOOL (attendance). For the population as a whole, the design factor is 1.33, indicating that the standard error for school attendance in the IPUMS sample would be 33 percent larger than in a simple random sample. But if we restrict the analysis by age and sex and look at the school attendance of girls ages 10 to 14, the design factor improves to 1.08 because most clusters include only one girl in this age group. Finally, if we randomly select one school-age child from each household, the design factor falls below 1.0.
In general, if studies are designed so that they rarely or never examine more than one individual per household, the effects of clustering can be altogether avoided. Especially when doing analyses of children, boarders, and other groups likely to appear multiple times in the same household, researchers should develop strategies to eliminate the redundant cases. Instead of assessing the characteristics of all children, for example, one can look at eldest children, or youngest children, or children of a particular age, or a randomly selected child from each household.
In practice, IPUMS users rarely take the trouble to estimate true standard errors, mainly because the methods for doing so are so cumbersome. Instead, researchers usually accept the significance statistics generated by their statistical packages. On the whole, the results presented here are reassuring: most of the analyses done by users of the IPUMS probably have design factors close to 1.0 or lower, which means that the estimates of the packages are not too far off. If users are aware of the dangers of clustering and design their studies to minimize it, they can safely use statistical procedures designed for simple random samples.
In addition, replicate weights (REPWT and REPWTP) are available for the 2005-onward samples of the ACS for researchers who want to estimate standard errors directly from the data. The Census Bureau supplied the replicate weights, which are generated using full sample design information that is not publicly available. Replicate weights are preferable to subsampling techniques used to estimate standard errors precisely because the full sample design is taken into account.
Replicate weights represent a slightly altered version of the original sampling weights, and the variability between such altered versions is used to calculate standard errors. Using altered weights is sensible because the original sampling weights are not exactly correct. Due to nonresponse and incomplete information about the sampling frame, the weights cannot perfectly describe the number of people in the population represented by a particular sampled unit. By slightly altering the weights, one can assess the degree of uncertainty caused by imperfect specification of the original weights. For advice on using replicate weights with IPUMS-USA data, see
Researchers at the Minnesota Population Center are currently at work on an NIH-funded project to develop better user-friendly methods of dealing with the complex design of the IPUMS samples. For more information, see the project summary. We expect to release final project results in 2007.
ENDNOTES
- See M.H. Hansen, W.N. Hurwitz, and W.G. Madow, Sample Survey Methods and Theory, 1953, and L. Kish, Survey Sampling, 1965.
- "Household" and "group quarters" are the modern census terms. The census has historically employed a variety of terms and definitions for distinguishing housing units. Instead of households and group quarters, past censuses identified "quasi-households," "families," and "dwellings." For fuller discussion, see the preceding chapter, "Sample Designs." Definitional changes have modest consequences, and very close approximations of the modern concepts can be reconstructed for all census years since 1850.
- For further discussion of this method for estimating design factors, see United States Bureau of the Census, Census of Population and Housing, 1990: Public Use Microdata Samples, Technical Documentation, 1993. For our estimates, we treated all variables as categorical variables, so our design factors are actually the weighted average of the factors for each category of each variable. We used the IPUMS general-code version of each variable, except that age was grouped in 10-year intervals and Top coded at 80, and for occupation, language, and birthplace we used only the first two digits of the IPUMS variable. The fifty subsamples were derived from the one hundred subsample replicates provided in the IPUMS. In all cases, selection of subsamples replicated the original sample selection procedures. For 1940 1% and 1950, we used the self-weighting versions of the samples to estimate design factors. Our estimates differ from the Census Bureau's for the 1970 and 1980 census years; ours are probably more reliable because the Census Bureau estimates were created before the samples were available and were derived from a simplified analytic model; see United States Bureau of the Census, Public Use Samples of Basic Records from the 1970 Census: Description and Technical Documentation, 1972, and Public Use Samples of Basic Records from the 1980 Census: Description and Technical Documentation, 1983.