Introduction to data editing and allocation

Most variables in the IPUMS have been edited for missing, illegible and inconsistent values. The paragraphs below describe the basic philosophy and procedures that the Census Bureau and IPUMS researchers follow when editing and allocating data values.  More information about how editing and allocation were carried out for each variable is available in the detailed editing and allocation procedures page.

The enumeration forms that make up the source data of the IPUMS have a variety of flaws. In some cases, enumerators or respondents neglected to answer a particular inquiry. In others, the item was completed but is illegible. In the older censuses, poor microfilm or deterioration of the paper occasionally make it impossible to read a response. In other cases, a legible response is present but it is clearly incorrect, since it contradicts multiple other items in the unit. For example, if a woman is married, her household relationship is "spouse," and her age is under 12, the age is considered to be inconsistent with the other values.

The samples from 1940 onward incorporated computer edits at the time of their creation. These edits were conducted by the Census Bureau and are reported in the Data Quality Flag variables (see below). IPUMS does not perform additional edits on these samples.

For the samples from 1850-1930, the IPUMS imposes edits for nearly all substantive variables. These edits are generally comparable to those performed by the Census Bureau in the later samples. These edits are also reported in the Data Quality Flags.

Computer editing of census data falls into three basic categories:

  1. Logical Edits. (data quality codes 1-3) Logical edits infer missing or inconsistent values from other information available in the person record or elsewhere in the household, on the basis of a set of rules. For example, the marital status "never married" is inferred for persons under age twelve with missing information, and a race of "white" is inferred for persons with white parents present in the household. Logical edits are usually quite reliable.
  2. Hot Deck Allocation. (data quality codes 4-6) In cases where logical edits are impossible, hot deck techniques are used. This involves searching the data file for a "donor" record which shares key characteristics with the missing, illegible or inconsistent case. For example, in the 1950 sample if sex was missing or illegible and could not be logically inferred from other characteristics, the sex of the most proximate person record which had the same household type and size, race, age, marital status, and relation to household head was allocated.
  3. Cold Deck Allocation. (data quality codes 7-8) Occasionally, when a hot deck allocation failed to find a suitable donor record, cold deck allocation was used. This method involves randomly assigning a value from a pre-determined distribution, or assigning a modal value.

In general, the allocation procedures provide better estimates than would be obtained by simply excluding the cases with missing values. Such exclusion implicitly assumes that the missing cases are representative of the population as a whole. This is equivalent to cold-deck allocation. Hot deck allocation makes the less extreme assumption that missing cases are representative of the population that shares key predictor characteristics. Moreover, because hot deck allocation assigns the value of the most proximate suitable donor and the census files are organized geographically, the results reflect local and regional variations in characteristics.

There are, however, certain circumstances in which allocated data may produce undesirable bias. For example, to allocate income in the 1990 sample, age was one of the key characteristics used to locate an appropriate donor. However, age was categorized, and the top category was 65 years old or older. If one is interested in the income of persons aged 80 or older, the allocated incomes are likely to be biased upwards, since persons aged 65 to 79 are more likely to have significant earnings than persons aged 80 or older.

If the rate of allocations for a group is appreciably large and a bias in the allocated value is evident, it may be desirable to exclude allocated data from the analysis. It should be apparent from this illustration that a knowledge of the specific allocation procures used would be valuable for detailed subject analyses. Unfortunately, the allocation procedures used vary in each census year, as does the amount of descriptive information. A separate page provides detailed information on allocation procedures for each variable in each year. All relevant information from the original sample codebooks and Census Bureau procedural histories is included. Information on 1980 and 1990 procedures was never published, therefore the description of allocation procedures in those years is thin. Users with queries on allocation of particular items may contact the Population Division of the Census Bureau for 1980 and 1990, or the Social History Research Laboratory, University of Minnesota for 1850-1920.

The IPUMS provides data quality flags for all computer edits. The name of each data quality flag that applies to a given household or person variable is given as part of its variable description in the data dictionary. Many data quality flags apply to more than one variable. Conversely, multiple flags can apply to the same variable in a given sample, depending on how the original sample handled the data quality flags. The availability of each flag is given in the table IPUMS Data Quality Flags. More information about how editing and allocation were carried out for each variable is available in the detailed editing and allocation procedures page.

The data quality flag codes vary slightly by census year. In all years, a code of "0" means the case was unaltered. The standard set of codes and their meanings are as follows:

0 - Unaltered case
2 - Logical hand edit by Census Office or by census sample research staff
3 - Logical computer edit by IPUMS
4 - Hot deck allocation by IPUMS

1940 , 1950
0 - Unaltered case
1 - Logical edit, original value inconsistent
2 - Logical edit, original value illegible
3 - Logical edit, original value missing
4 - Hot deck allocation, original value inconsistent
5 - Hot deck allocation, original value illegible
6 - Hot deck allocation, original value missing
7 - Cold deck allocation, original value illegible
8 - Cold deck allocation, original value missing or inconsistent

0 - Unaltered case
4 - Allocated (method not specified)

0 - Unaltered Case
3 - Logical edit (select housing variables only)
4 - Allocated (method not specified)
5 - Hot deck allocation (select housing variables only)
9 - Edited, method unknown (PUMS specifies both direct and indirect methods)

0 - Unaltered Case
1 - Allocated, original entry fell outside acceptable range (select variables only)
3 - Logical edit
4 - Hot deck allocation
5 - Cold deck allocation (select variables)

0 - Unaltered case
4 - Allocated (method not specified)
5 - Allocated from "unanswered" (select variables)
6 - Allocated from "unknown" (select variables)

2000, ACS, PRCS
0 - Unaltered case
4 - Allocated (method not specified)