IPUMS Linked Representative Samples, 1850-1930
Final Data Release (June 2010)
The Minnesota Population Center is pleased to announce the final release of the IPUMS Linked Representative Samples. The new database links records from the 1880 complete-count database to 1% samples of the 1850 to 1930 U.S. censuses of the population. We have produced data samples for 7 pairs of years: 1850-1880, 1860-1880, 1870-1880, 1880-1900, 1880-1910, 1880-1920, and 1880-1930. Each of these contains three independent linked samples: one of men, one of women, and one of married couples.
This final release of the database supersedes the preliminary version that was made available beginning in August 2008. The final release differs from the preliminary release in several important ways that are described in more detail below. The most significant improvements were methodological: namely, we used better "training" data and we used a "name commonness" measure in determining linkage confidence. As a result of these methodological improvements, the final release contains many cases that were not present in the preliminary release (some cases were also removed). Overall, the samples have about a sixty- to seventy-percent overlap between the preliminary and final releases.
The final version of the datasets also used better weighting procedures than the preliminary release, resulting in numerous changes to the weight value attached to each case. And the 1880 observations in the final data include many more variables than they did in the preliminary release, including literacy, school attendance, physical infirmities, months unemployed, month of birth, and whether or not the respondent was married in the previous year.
See this FAQ for the latest information on the Linked Representative Samples.
All files contain weights that allow users to generate representative population estimates. The universes for the male and couple files are straightforward. The linked male files represent all men who were in the U.S. in both census years. The linked couple files represent all married couples who were in the U.S. and married to one another in both census years. The universe for the linked female files is more complicated; these files represent women who were in the U.S. in both census years and who did not change their surname due to marriage (or remarriage) between the two census years. For this reason, the female files only link women who were single at both time periods, were married to the same person in both censuses, or who transitioned from married to widowed or divorced between time periods. The female files do not link women who entered into marriage between censuses.
The linking strategy relied on five variables that theoretically should not change over time: birth year, place of birth, given name, surname, and race. Records were only compared in the linking process if they had an exact match on race and place of birth. The age and name variables, on the other hand, were permitted to have some variation. Name information was "cleaned," following the procedures specified below. A strategic decision in the project was to ignore information from other co-resident individuals because of potential bias issues. For example, if linking decisions were based on the characteristics of co-resident family members, then the samples would be biased in favor of those with co-resident family members. Similarly, using place of residence information in linking decisions would have resulted in the over-representation of people whose place of residence did not change (non-movers).
Each sample contains data on all linked individuals and records for other members of the linked person's household from both years. All data comes from IPUMS-USA (in contrast to the preliminary release which used the version of the 1880 complete-count database from the North Atlantic Population Project). Variable descriptions are available via the IPUMS variable availability page.
All linked samples consist of independently linked records and their complete households in the two census years. The primary link has a value of '0' for LINKTYPE. In addition, we also subsequently link common household members in households containing a primary link (indicated by LINKTYPE = 5). Table 1 gives a few examples from the 1870 - 1880 male linked sample.
All linked samples contain variables from a sample year and from 1880. In each sample, variables ending with "_1" contain data from the earlier year and variables ending with "_2" contain data from the later year. For instance, for the pre-1880 linked samples (i.e., 1850, 1860 and 1870), the sample year variables end with "_1" and the 1880 variables end with "_2". For the post-1880 linked samples, the 1880 variables end with "_1" and the sample year variables end with "_2".
We have structured each file such that the pernum for the sample year is numbered sequentially upward from 1 for each household. The pernum for the sample year always contains a value equal to or greater than 1 even if the household member was present in 1880 but not in the sample year. However, the pernum for 1880 will have a value of 0 if a household member was present in the sample year but not in 1880. The easiest way to view households in the linked files is to sort in ascending order by the sample year serial, HHSEQ (more on HHSEQ below), and the sample year pernum.
In the first example in Table 1, the primary link (indicated by LINKTYPE = 0) is the third record in the 1870 household, James H A, who was 10 years old in the 1870 census. Based on consistent race, birthplace, name and age information, we also establish household links for all other members of the 1870 household with the exception of the head of household, who does not appear to have been present in the 1880 household (and this is indicated by LINKTYPE = 9). In addition, the 1880 household has one member, Mary Gillan, a 40-year-old domestic servant, who was not present in 1870.
|0||6801||1||2||5||3177716||1||ASHBY||MARY J||ASHBY||MARY J.||2||1||30||40|
|1131.7||6801||1||3||0||3177716||2||ASHBY||JAMES H A||ASHBY||JAMES H.||3||3||10||21|
|0||6801||1||4||5||3177716||3||ASHBY||MARY J||ASHBY||MARY J.||3||3||9||20|
|0||6801||1||5||5||3177716||4||ASHBY||ANNIE V||ASHBY||ANNIE B.||3||3||8||19|
|0||6801||1||6||5||3177716||5||ASHBY||WILLIAM H||ASHBY||WM. H.||3||3||0||10|
|0||4301||1||4||5||222598||3||HUBARD||WILEY N||HUBBARD||WILLEY NEWTON||3||3||9||20|
The second household in Table 1 gives an example where we have two primary links in the 1870 household. Since we have two primary links, the 1870 household is written out twice (indicated by HHSEQ), with each version providing links to a different 1880 household. In this case the first primary link is the head of the household, 57-year-old Moses Woodfin. Of the seven household members in 1870, all are also co-resident in 1880 with the exception of William and 15-year-old Moses. However, 15-year-old Moses is also a primary link, indicated by LINKTYPE = 100 in the first version of the household. The second version of the 1870 household (i.e., HHSEQ = 2) shows 15-year-old Moses as the primary link; in 1880 he was 25 years old, living with his wife and three young children.
The third household in Table 1 gives an example of the imprecision of some of our household links. The primary link is Lafeyette, and we also establish household links for everyone except the head (W D), the eldest daughter (Mary J) and an unrelated individual (Robert Paromes). However, it is likely that the head of household in 1870, 37-year-old W D, is the same person as the head of household in 1880, 45-year-old Davis. Given our rules for establishing household links, we do not make this link because of first name ambiguity.
Researchers who are not interested in co-resident persons can greatly simplify the dataset by removing all cases where LINKTYPE is greater than 0. This will remove all persons who were not linked in the primary linking process. Once this is done, LINKTYPE and HHSEQ can be ignored. Information on co-resident persons is one of the most powerful features of the linked samples, however, and we expect that this information will be useful for a variety of research purposes. For instance, co-resident persons can be used to create variables describing characteristics such as birth order, father's occupation, or the employment status of household members.
Table 2 gives the number of linked records for the various sample years by nativity/race, and for females by nativity/race and marital status categories. In addition to showing the limitations of some of our samples due to small numbers for some population components, we also weight within each of the subgroups. The purpose of the weights is to compensate for under and over representation and are only constructed for the primary links. We construct the weights based on terminal year population characteristics (i.e., 1880 characteristics for the pre-1880 samples and by sample year characteristics for the post-1880 samples). Weights are based on an estimate of the "linkable" population. Using the 1870-1880 samples as an example, the linkable population for native-born groups is anyone that was 10 years or older in the 1880 census. However, since we do not have year of immigration information before 1900, for the foreign-born we have to look at the foreign-born in the 1870 census and apply life Tables to estimate how many of these individuals would have still been alive in 1880. The difference between this estimate and the actual total in 1880 would be due to immigration between 1870 and 1880.
|Native-born white males||7009||10424||17724||18595||14854||10049||9018|
|African American males||82||235||2181||1336||796||506||352|
|"Couples" samples (by male characteristics)|
|Native-born white males||2133||4537||8858||7745||4644||2105||612|
|African American males||6||19||408||180||102||26||7|
|Native-born white females, married||1077||2253||4254||3274||1891||793||221|
|Native-born white females, formerly married||811||951||1192||1161||1124||894||545|
|Native-born white females, single||468||1495||6699||4239||1407||849||700|
|Foreign-born females, married||63||215||671||554||317||111||26|
|Foreign-born females, formerly married||62||90||191||252||231||145||71|
|Foreign-born females, single||2||30||150||65||40||17||16|
|African American females, married||8||34||432||164||73||20||2|
|African American females, formerly married||12||34||152||88||51||34||29|
|African American females, single||9||35||902||210||40||20||7|
After identifying the appropriate estimates for linkable subgroups, we multiply by the inverse of the group's linkage rate to get an initial weight that inflates the subgroup's number of linked records to the same number as that in the linkable population. We then calculate the specific weights for a number of weighting variables. Using 1870-1880 males as an example, we first estimate the relationship-to-head weight, which is calculated by the proportion of the linkable population by relationship-to-head categories divided by the proportions for the linked sample. We then apply the relationship-to-head weight to the linked records, and then calculate the proportions for the next weighting variable (in this case individual birthplaces). We repeat this process for 5-year age groups, size of place and occupational categories, with a specific record's weight getting modified with each iteration.
This process works fairly well if we have enough records. From Table 2, however, we can see that some of the sub-groups have small numbers. As a result some subgroups have a large range of weight values, with some records representing a relatively large number of records. We deal with this by imposing a minimum and maximum on the weight for all sub-groups. The minimum is 1/5 of the average weight for the subgroup, with the maximum capped at 4 times the average weight for the subgroup. This has little effect on some of the larger groups; for example, under 1 percent of the native-born male links for 1870-1880 are affected by the minimum/maximum rule. However, almost 10 percent of the foreign-born records are either below or above the initial minimum or maximum.
The ability to assess similarity can be enhanced by cleaning and standardizing the source data. The sample and complete-count data has been through a variety of cleaning and logical edits prior to release as part of the IPUMS. Age information, for example, is subject to a variety of consistency checks at the original data collection stage and later in IPUMS processing. Thus we felt no need to further process age information prior to linkage. The name fields, in contrast, receive little processing prior to IPUMS release. IPUMS data have separate fields for given and last name. While the last name field consistently contains a single string, the given name field can contain given and middle name, given name and middle initial, or even a lone first initial. Some enumerators also used abbreviations for common given names, which were transcribed verbatim in the data collection process.
We ultimately decided to do a minimal amount of processing on the surname field. We removed non-alpha characters, but did not attempt to standardize or correct perceived misspellings. We generally took the same approach with the given names in that we were not overly concerned with misspellings. For example, we felt that small variations in names would not be enough to confidently distinguish a true link from a false link. Another factor in our decision was the large number of names in the sample and complete count data. Although some of the variation is caused by the occasional presence of middle initials, when combined with sex, our given name dictionary contained approximately 1.7 million unique strings.
Table 3 shows the 24 most common first names for men, with the FNAME field containing the raw string. When a person had an initial, or a listed middle name, the raw string was parsed into two fields: XNAME1 and XNAME2. The resulting standardized name is in the field STD_XNAME1. For the first names listed below, the only name that was modified in the standard name field is "WM.", which is standardized as "WILLIAM."
Given the large number of unique strings, we standardized only first names with a frequency greater than 100, of which there were approximately 20,000. Most of this work involved standardizing commonly used abbreviations and diminutives, such as those listed below in Table 4.
|Raw string||n1||n1 standard||Frequency|
The creation of the linked samples required millions of individual comparisons. Linked individuals were required to have the same sex, race, and birthplace, so the comparison of these variables is straightforward. Evaluating exact matches on name and age information is unambiguous, but we also had to assess the likelihood that near matches were correct links. Fortunately, several software packages have been developed to deal with the ambiguities in record linkage. We relied on the Freely Extensible Biomedical Record Linkage (FEBRL), software created by Peter Christen and Tim Churches at the Australian National University's Data Mining Group.
Input files consisted of individuals with a given race, sex, and birthplace. For example, white males born in Michigan who were in the 1870 one-percent sample were compared to white males born in Michigan who were present in the 1880 complete-count data. Given consistent sex, race and birthplace information, the FEBRL software would compare all records within a seven year sliding window. Thus, 30-year-olds in the 1870 (who would be expected to be 40 in 1880) were compared to all persons between the ages of 33 and 47 in 1880. When an age match was not exact, the similarity of ages was quantified as the difference from the expected age (in years) as a proportion of the individual's age. In addition to constructing age similarity scores, FEBRL would also assess name similarity. We used the Jaro-Winkler distance algorithm, which takes a mathematical approach to dealing with transpositions of particular letters or syllables in comparing the similarity of any two alphabetic strings (Jaro 1989; Porter and Winkler 1997; Winkler 1990, 1993, 2002). Output files consisted of potential links that exceeded preset thresholds for age and name similarity.
After all files from a given pair of census years had been through similarity score construction, we needed to classify all potential links as "true" or "false". The identification and classification of data patterns is a major issue in computer science, particularly in research on data mining and machine learning. Our procedures draw heavily on the concept of Support Vector Machines, or SVMs (Vapnik 1998; Cristianini 2000; Abe 2005). SVMs are not physical machines; they are a set of methods and tools used to conduct machine learning. Our use of SVMs relied on the LIBSVM software, developed by Chih-Chung Chang and Chih-Jen Lin at National Taiwan University's Department of Computer Science.
SVM construction depends on the existence of training data, which typically consists of a verified set of true and false links. The classifier analyzes the training data, plots them in a multidimensional space, and then constructs a boundary between the two classes of records that maximizes the distance from the hyperplane and the nearest data points in both of the classes (i.e., between the true and false links). After SVM construction (which is based on the training data), unclassified records are plotted on this multidimensional space and the end result is a file consisting of potential links and the classifier-produced confidence score. Confidence scores are interpreted dichotomously; a positive score = "true" link and a negative score = "false" link.
At the classifier stage each potential link is evaluated independently, which often results in numerous potential links (from 1880) to a given sample record. We consider these links to be ambiguous, and they are not included in our linked data. Table 5 shows the confidence scores for potential links to John Bradley, a 25-year-old white male born in South Carolina from the 1870 data. Of the 43 potential links, only the top four receive positive confidence scores. Although the potential link with the highest confidence score is an exact match, the other three also have a high degree of similarity. If we had to choose, we would say the exact link is probably the correct link. However, we also feel that the probability that it is the correct link is significantly under 95 percent, and using these types of links would introduce an unaccepTable error rate.
Final Data Release
The material above describes the process we used for the preliminary linked data release, which took place in Fall 2008. An expanded narrative describing methods will be available in a forthcoming issue of Historical Methods. An important part of the process was a set of links for 1870-1880 and 1880-1900 produced by Pleiades Software. The comparison of our preliminary linked data to the Pleiades linked data disclosed that we had produced accurate linked data, but our process was highly dependent on the precision on the data. More specifically, if the individuals that we attempted to link were accurately enumerated in both census years, we would either make the link or, in the case of multiple potential links, reject the link as ambiguous. But if the correct link was unidentifiable-because of mortality, underenumeration, or misenumeration of linking variable information-we would make an incorrect link if there was another person with similar characteristics in the 1880 complete-count data.
We dealt with this problem by constructing formal measures for the commonness of names. A fairly standard approach in record linkage projects is to construct frequency Tables for names, which more or less assesses the probability of a correct link. For example, based on frequency Tables, record linkers would be more confident in linking someone with the name Roland Marsupial as opposed to John Smith. But given our minimalist approach to name cleaning and standardization, we run into a problem with minor typos and misspellings which would show up as low frequency names, although many of these names have high similarity to high frequency names (and would in fact show up as potential links to records with high frequency names). Our solution was to construct name similarity scores based on the following: for a given sample record we determined the proportion of records (by race, birthplace, and sex) in the 1880 complete-count data that have a Jaro-Winkler similarity score greater than 0.9. The choice of this threshold is somewhat arbitrary, but based on the preliminary linked data, we rarely linked records that did not exceed this threshold. We also constructed a density of birth measure, which is the proportion of 1880 records for specific birthplaces, by race and sex.
The impact of name commonness on the linkage rates can be seen in Table 6. As expected, all three groups have higher linkage rates for records with less common names compared to those with the most common names. The linkage rate differentials within the ethnic/race groups are also extreme. Taking nothing else into account, native-born whites in the least common name category are approximately 30 times more likely to be linked compared to native-born whites in the most common name category.
However, there is one other basic factor to take into account when evaluating the name commonness scores. Table 7 presents linkage rates by name commonness categories and by birthplace categories, with birthplace rank determined by the proportion of all 1870 males by place of birth. The basic pattern of a higher linkage rate for less common names is evident regardless of birthplace rank. For records in the least common name category, the birthplace categories do not show much difference in linkage rates. But as we move to the more common name categories differentials between less and more populated states become more extreme. For example, only the most common name categories show a linkage rate significantly lower than the average for all records from the small states.
We expected to see this pattern in the linked data. The pattern largely results from our decision not to link in ambiguous cases (i.e., where we had more than one potential link). But by generating name commonness scores for all records, we have also forced the classifiers not to make links in cases where a name is relatively common, even in the absence of competing potential links. And the Tables make clear that we do have a biased sample of linked data; in general, we are more likely to link records from the smaller states, and we are more likely to link records with less common names.
We always assumed that we would deal with linkage differentials by constructing weights for the linked records; that all things being equal, linked individuals born in Delaware would have a lower weight than those born in New York. But we also assumed that there would not be an inherent bias between more versus less common names. The construction of the name scores allows us to examine this a fairly straightforward way. Tables 8a and 8b give mean occupational score for 1870 males by race/nativity and name category. Table 8a excludes non-occupational responses and the distribution is relatively flat. Table 8b excludes non-occupational responses and farmers, and we also see relatively little variation among name categories.
|Name commonness||Linkage rate||N|
|1 (least common)||19.00%||39797|
|7 (most common)||0.60%||10765|
|1 (least common)||4.70%||6081|
|7 (most common)||0.20%||5372|
|1 (least common)||6.00%||12525|
|7 (most common)||1.30%||6145|
|Birthplace rank||Name commonness||Linkage rate||N|
|1 (smallest)||1 (least common)||17.80%||5590|
|7 (most common)||2.70%||1858|
|2||1 (least common)||20.70%||11152|
|7 (most common)||0.30%||3473|
|3||1 (least common)||21.00%||12859|
|7 (most common)||0.10%||3450|
|4 (largest)||1 (least common)||15.20%||10196|
|7 (most common)||0.05%||1984|
|Name commonness||OCCSCORE mean||N|
|1 (least common)||19.32||18200|
|7 (most common)||19.07||5667|
|1 (least common)||22.11||5098|
|7 (most common)||21.25||4739|
|1 (least common)||12.43||7452|
|7 (most common)||13.56||3527|
|Name commonness||OCCSCORE mean||N|
|1 (least common)||22.26||11734|
|7 (most common)||21.80||3698|
|1 (least common)||23.83||4208|
|7 (most common)||22.70||3949|
|1 (least common)||12.17||6368|
|7 (most common)||13.51||3137|
This work was supported in part by the University of Minnesota Supercomputing Institute.
Abe, Shigeo. 2005. Support Vector Machines for Pattern Classification. London: Springer.
Cristianini, Nello and John Shawe-Taylor. 2000. An Introduction to Support Vector Machines and other kernel-based learning methods. London: Cambridge University Press.
Ferrie, Joseph. 1996. A New Sample of Males Linked from the Public-Use-Microdata-Sample of the 1850 US Federal Census of Population to the 1860 US Federal Census Manuscript Schedules. Historical Methods 29: 141-156.
Guest, Avery. 1987. Notes from the National Panel Study: Linkage and Migration in the Late Nineteenth Century. Historical Methods 20: 63-77.
Jaro, Matthew A. 1989. Advances in Record Linking Methodology as Applied to Matching the 1985 Census of Tampa, Florida. Journal of the American Statistical Association 84: 414-20.
Porter, Edward H. and William E. Winkler. 1997. Approximate String Comparison and its Effect on an Advanced Record Linkage System. Census Bureau Research Report RR97/02 (Washington D.C.: U.S. Bureau of the Census).
Vapnik, Vladimir. 1995. The Nature of Statistical Learning Theory. London: Springer-Verlag.
Winkler, William E. 1990. String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage. American Statistical Association, 1990, Proceedings of the Section of Survey Research Methods, 354-359.
Winkler, William E. 1993. Improved Decision Rules in the Fellegi-Sunter Model of Record Linkage. American Statistical Association, 1993, Proceedings of the Section of Survey Research Methods, 274-279.
Winkler, William E. 2002. Methods for Record Linkage and Bayesian Networks. Census Bureau Research Report RRS2002-05 (Washington D.C.: U.S. Bureau of the Census).