Data Consistency Checking
Daniel C. Kallgren and David Beck Ryden
NOTE: This article was originally published in Historical Methods, 28:1, pp. 66-69 (Winter 1995). Reprinted with permission of The Helen Dwight Reid Educational Foundation. Published by Heldref Publications, 1319 18th Street, N.W., Washington, D.C. 20036-1802. Copyright 1995.
Data consistency checking was integral to the construction of the 1850, 1880, and 1920 datasets. Consistency checking was a procedure designed to ferret out both-data entry errors and apparent enumeration mistakes or inconsistencies. By systematically applying a computerized checking routine to each case entered, we hoped to limit the number of errors in the final Integrated Public Use Microdata Series (IPUMS) to a minimum.
Overview:
The checking process began after each reel of the microfilmed census manuscripts was coded into machine-readable form. These raw data were submitted to a consistency program that scanned each individual case for possible errors and logical inconsistencies. For example, the program was designed to detect unlikely relationship descriptions such as a 2-year-old boy who was coded as "divorced." The software also made a number of simple changes to the data automatically. The output generated by the checking software included a list of these changes and a list of problem cases that the research staff then examined individually. If the problem was the consequence of data-entry error, the research staff made the appropriate correction. If the difficulty lay with the enumerator's procedure and if the research staff was able to determine the problem, the information was then corrected. If the problem could not be resolved, the staff left the case as originally enumerated. All cases modified from the original census schedules are distinguished in the final datasets by a "data quality flag" variable. This variable indicates which changes were made by the software and which were made manually by research staff.
The amount of work related to consistency checking depended on the quality of the microfilm and the enumeration. At times, the microfilm was difficult to read because of poor film quality. Manuscript pages were occasionally out of order, disrupting the sequence of dwelling and family numbers and, on rare occasions, dividing households. A more common problem however, was the deterioration of the census manuscripts themselves. Some information was lost due to frayed edges and corners of the enumeration schedules. The difficulty in reading these manuscript facsimiles was at times further compounded by enumerator carelessness or ignorance of proper census-taking procedure.
The number of enumeration errors was directly correlated with the scope and complexity of the census. Of the three PUMS coded at the University of Minnesota, the 1850 manuscripts contained the fewest errors and inconsistencies since they lacked the complex "relationship to head of household" variable as well as the "marital status" variable. 1880, on the other hand, was the most problematic because the Census Bureau failed to instruct the enumerators properly on methods of recording new variables. Some census takers identified census families and dwellings incorrectly. Some enumerators reversed first and last names and sometimes reversed the unemployment variable, providing the weeks employed instead of the weeks unemployed. In most cases, we were able to determine the problem and make the appropriate corrections.
Interrelationship Checking Procedures
We used a variety of computerized checks for geographical, occupational, and data-entry inconsistencies. However, the largest number of problems came from the family relationship variable recorded in the 1880 and 1920 censuses. Unfortunately, these inconsistencies were typically too complicated to be corrected automatically by machine. We therefore designed the checking software to identify each problematic case as well as produce a message describing the nature of the inconsistency. The research staff looked up each of these cases on the microfilm and determined family relationships based on clues recorded on the manuscript. If the family relationship could not be determined, the researchers made no changes to the data.
Although there was a degree of subjectivity in solving interfamily relationship inconsistencies, the research staff followed a set of well-defined rules before making changes to the data. Most decisions involved either inferring relationships or altering the division of dwellings into households. The basic objective in making changes to the relationship variable was to ensure that each individual was identified in relation to the household head, and the household head was listed first in any family.
In achieving this goal, we encountered a variety of interrelationship problems that were resolved either by inferring family relationship, merging multiple households into one, or splitting an enumerated family into two. The following discussion details each of these modifications to the data and provides illustrative examples of particular difficulties that we encountered.
Inferring Family Relationship
Inferring family relationships was necessary when the field was left blank, or when relationships conflicted with other enumerated information. Potential conflicts were identified when children had different surnames from the head, the age difference between children and parents was too small, or the sex and family relationship was incompatible (such as a female enumerated as an uncle). Inferences were based on all other available information such as sex, age, race, marital status, birthplace, parental birthplace, and position in the family.
In Figure 1, for example, George E. Morton, person 3, appears to be the son of George B. Morton because of surname, parental birthplaces, and age. After making this inference, we changed Mary A. to daughter-in-law, and the children to grandchildren so that they were properly related to the head of household.
Sometimes a child had a different surname from the head of the household because his or her mother had remarried. If this looked likely, then the children could actually be stepchildren, and we entered stepson or stepdaughter where appropriate. In about 10 percent of cases, a different surname occurred because the operator had failed to correctly indicate a reversal of the first and last names in the comment field. Even more common were slight spelling differences between the head and child. We standardized the spellings of surnames by adopting the head's spelling, unless it looked implausible. It was important to correct these two errors, because surname similarity was used for parent-child linking (see Ruggles, pp. xx-xx).
Twenty percent of the time a surname difference occurred because the child was not really a child of the head but rather the child of a servant, boarder, or other relative. If the immediately preceding adult shared the surname of the child and was of a reasonable age to be a parent and there were no parental birthplace conflicts, we thought it safe to assume that he or she was the true parent and altered the child's relationship codes accordingly. For instance, in Figure 2 we decided based on surname, age, occupation, and mother's birthplace that Annie Pipper, person 5, was the daughter of Lucie. We therefore changed the relationship from daughter to servant's daughter.
Merging Two Enumerated Households into One
When it came to identifying separate households, some enumerators were overzealous and split every dwelling into multiple families. To resolve this problem, we adopted the policy of letting the relationship field take priority over the family numbers. When the first person in the second family had a valid relationship to the head of the first family within the same dwelling, the second family number was replaced with that of the prior family. This change is indicated by the data quality flag QHHNUM on the person record.
In general, enumerators were instructed to treat servants and boarders as part of the family with whom they resided, even if they took their meals separately. However, we treated as separate units families of boarders who had their own family number. In the case of servants with their own family number, we checked the occupation of the dwelling head to see if it seemed plausible that they were servants of the head; if so, we merged them with the head's family
Figure 3 presents a case where we merged two families. Vance Hammel was clearly the son of Crawford but is listed as the first person in the second family of the dwelling. We assumed that the family relationships are correct, which means the family numbers are wrong. To fix the problem, we changed Vance's family number to 123, changed the "number of families" variable to 1, and changed the family size to 6.
Splitting an Enumerated Family into Two
When two apparently unrelated families lived in the same dwelling, but were given the same family number, we divided the unit into two. This problem occurred when an enumerator simply forgot to change the family number within a multifamily dwelling. Before making the decision to change a single-family dwelling into a multiple dwelling, we looked closely at a number of related questions listed on the census schedule. We checked the relevant surnames, birthplaces, parental birthplaces, ages, and sometimes occupation to ensure that the secondary related group had no clear relationship to the primary family. We kept in mind the fact that many relatives may have had different surnames such as in-laws and married daughters. If we were persuaded that a secondary related group was unrelated to the head, we changed the group into a separate family by assigning a new family number
At first glance, Figure 4 looks like a classic secondary family. William and Julia Ferguson, persons 9 and 10, are just an elderly married couple residing with the widow Vincent. But notice that the VA/AL parental birthplaces of the head just match the Fergusons. This seems unlikely to be a coincidence. Thus we changed William to father and Julia to mother
On the other hand, the Youmanses and the Wallaces in Figure 5 indicate no relationship at all. O.V. Wallace also shows no relationship to head listed, often the case for heads of households. We therefore assigned O.V. the relationship of "head" and gave a new family number for each of the Wallaces.
Conclusion
Early in the planning stages for the IPUMS, we decided not to preserve obviously erroneous information in the data. We thought the samples would be most useful to researchers if such enumerator errors were corrected. Although the changes made automatically by the computer and the changes made by the research staff followed a set of clearly defined rules, a certain amount of subjectivity was involved in all these changes. By using the data quality flags, however, researchers can choose between data that have been altered and those that have not.
Figure 1
Relationship of Person 3 Changed to "Son" and Subsequent
Relationships Changed to Match
Dwelling | [6 2961 30] | Rule: 1 | DWSIZE: 7 | NBRFAMS: 1 | SEQFAM: 1 | FAMSIZE: 7 | NBRTAKEN: 7 |
( 3) Probably related to head
( 3) Operator Comment: REL ERR; COHEAD OR SON
( 4) WIFE not in second position
( 4) Operator Comment: REL ERR
( 5) Operator Comment: REL ERR
( 6) Operator Comment: REL ERR
( 7) Operator Comment: REL ERR
FAMNO | Last Name | First Name | R | S | AGE | Relationship | M | W | Occupation | BPL | PBPL | MBPL |
( 1) 44 | MORTON | GEORGE B | W | M | 71 | HEAD | M | FARMER | MASS | MASS | MASS | |
( 2) 44 | MORTON | MARY A | W | F | 69 | WIFE | M | MAINE | MAINE | R.I. | ||
( 3) 44 | MORTON | GEORGE E | W | M | 43 | M | FARNER | MASS | MASS | MAINE | ||
( 4) 44 | MORTON | MARY A | W | F | 37 | WIFE | M | KEEPING HOUSE | MASS | MASS | VERMONT | |
( 5) 44 | MORTON | CLARENCE E | W | M | 14 | SON | S | ATTENDING SCHOOL | ILLINOIS | MASS | MASS | |
( 6) 44 | MORTON | LESTER R | W | M | 11 | SON | S | ATTENDING SCHOOL | ILLINOIS | MASS | MASS | |
( 7) 44 | MORTON | HATTIE W | W | F | 4 | DAUGHTER | S | ILLINOIS | MASS | MASS |
Figure 2
Relationship of Person 5 Changed to "Servant's Daughter"
Dwelling | [176 241 20] | Rule: 1 | DWSIZE: 4 | NBRFAMS: 1 | SEQFAM: 1 | FAMSIZE: 4 | NBRTAKEN: 4 |
( 4) Child has different surname from head
FAMNO | Last Name | First Name | R | S | AGE | Relationship | M | W | Occupation | BPL | PBPL | MBPL |
( 1) 92 | ALLEN | ALBERT A | W | M | 59 | HEAD | M | FARMER | KENTUCKY | VIRGINIA | VIRGINIA | |
( 2) 92 | ALLEN | SARAH | W | F | 28 | WIFE | M | KEEPING HOUSE | KENTUCKY | KENTUCKY | KY | |
( 3) 92 | ALLEN | BETTY | W | F | 6 | DAUGHTER | S | KENTUCKY | KY | KY | ||
( 4) 92 | PIPPER | LUCIE | B | F | 35 | SERVANT | S | SERVANT | KENTUCKY | KY | KY | |
( 5) 92 | PIPPER | ANNIE | B | F | 8 | DAUGHTER | S | KENTUCKY | KY | KY |
Figure 3
Two Families Merged Into One
Dwelling | [ 76 2892 29] | Rule: 1 | DWSIZE: 8 | NBRFAMS: 2 | SEQFAM: 1 | FAMSIZE: 4 | NBRTAKEN: 8 |
( 5) First position not HEAD
( 7) Child has different surname from head
( 8) Child has different surname from head
FAMNO | Last Name | First Name | R | S | AGE | Relationship | M | W | Occupation | BPL | PBPL | MBPL |
( 1) 123 | HAMMEL | CRAWFORD | W | M | 69 | HEAD | M | LABOURER | PENNSYLVANIA | PA | PA | |
( 2) 123 | HAMMEL | PRUDENCE | W | F | 61 | WIFE | M | KEEPING HOUSE | PENNSYLVANIA | PA | PA | |
( 3) 123 | HAMMEL | NANCY | W | F | 27 | DAUGHTER | D | PENNSYLVANIA | PA | PA | ||
( 4) 123 | HAMMEL | JOSEPHINE | W | F | 19 | DAUGHTER | S | LIVING OUT | PENNSYLVANIA | PA | PA | |
( 5) 124 | HAMMEL | VANCE | W | M | 11 | SON | S | PENNSYLVANIA | PA | PA | ||
( 6) 124 | HAMMEL | CRAWFORD | W | M | 6 | SON | S | PENNSYLVANIA | PA | PA | ||
( 7) 124 | GRUB | MARGARET | W | F | 4 | DAUGHTER | S | PENNSYLVANIA | PA | PA | ||
( 8) 124 | GRUB | JOHN | W | M | 1 | SON | S | PENNSYLVANIA | PA | PA |
Figure 4
Relationships of Persons 9 and 10 Changed to "Father" and "Mother"
Dwelling | [311 5272 37] | Rule: 1 | DWSIZE: 10 | NBRFAMS: 1 | SEQFAM: 1 | FAMSIZE: 10 | NBRTAKEN: 10 |
( 9) Blank REL
(10) WIFE not in second position
FAMNO | Last Name | First Name | R | S | AGE | Relationship | M | W | Occupation | BPL | PBPL | MBPL |
( 1) 637 | VINCENT | CATERINA O | W | F | 46 | HEAD | W | W | KEEPING HOUSE | AL | VA | AL |
( 2) 637 | VINCENT | BENJAMIN | W | M | 21 | SON | S | BOOKKEEPER | AL | AL | AL | |
( 3) 637 | VINCENT | LOUISA O | W | F | 19 | DAU | S | AT HOME | AL | AL | AL | |
( 4) 637 | VINCENT | JOHN K | W | M | 16 | SON | S | AT SCHOOL | AL | AL | AL | |
( 5) 637 | VINCENT | ALEXINA T | W | F | 13 | DAU | S | AT SCHOOL | AL | AL | AL | |
( 6) 637 | VINCENT | ANNA M | W | F | 11 | DAU | S | AT SCHOOL | AL | AL | AL | |
( 7) 637 | VINCENT | CHARLES E | W | M | 9 | SON | S | AT SCHOOL | AL | AL | AL | |
( 8) 637 | VINCENT | FANNIE D | W | F | 7 | DAU | S | AT SCHOOL | AL | AL | AL | |
( 9) 637 | FERGUSON | WILLIAM | W | M | 78 | % | M | RETIRED | VA | VA | VA | |
(10) 637 | FERGUSON | JULIA | W | F | 74 | WIFE | M | AT HOME | AL | AL | AL |
Figure 5
Relationship of Person 5 Changed to "Head"
Dwelling | [ 28 242 45] | Rule: 1 | DWSIZE: 6 | NBRFAMS: 1 | SEQFAM: 1 | FAMSIZE: 6 | NBRTAKEN: 6 |
( 4) Blank REL
( 5) WIFE not in second position
( 6) Child has different surname from head
FAMNO | Last Name | First Name | R | S | AGE | Relationship | M | W | Occupation | BPL | PBPL | MBPL |
( 1) 315 | YOUMANS | S G | W | M | 58 | HEAD | M | FARMER | NY | NY | NY | |
( 2) 315 | YOUMANS | RUTH | W | F | 61 | WIFE | M | KEEPING HOUSE | NY | NY | NY | |
( 3) 315 | YOUMANS | SAMUEL | W | M | 17 | SON | S | AT SCHOOL | CA | NY | NY | |
( 4) 315 | WALLACE | O V | W | M | 44 | M | CARPENTER | OH | PA | PA | ||
( 5) 315 | WALLACE | MARY J | W | F | 40 | WIFE | M | KEEPING HOUSE | SCOTLAND | SCOTLAND | SCOTLAND | |
( 6) 315 | WALLACE | FRANK | W | M | 19 | SON | S | FARMER | CO | OH | SCOTLAND |