IPUMS Linked Representative Samples, 1850-1930
Final Data Release (June 2010)

The Minnesota Population Center is pleased to announce the final release of the IPUMS Linked Representative Samples. The new database links records from the 1880 complete-count database to 1% samples of the 1850 to 1930 U.S. censuses of the population. We have produced data samples for 7 pairs of years: 1850-1880, 1860-1880, 1870-1880, 1880-1900, 1880-1910, 1880-1920, and 1880-1930. Each of these contains three independent linked samples: one of men, one of women, and one of married couples.

This final release of the database supersedes the preliminary version that was made available beginning in August 2008. The final release differs from the preliminary release in several important ways that are described in more detail below.  The most significant improvements were methodological: namely, we used better "training" data and we used a "name commonness" measure in determining linkage confidence. As a result of these methodological improvements, the final release contains many cases that were not present in the preliminary release (some cases were also removed). Overall, the samples have about a sixty- to seventy-percent overlap between the preliminary and final releases.

The final version of the datasets also used better weighting procedures than the preliminary release, resulting in numerous changes to the weight value attached to each case. And the 1880 observations in the final data include many more variables than they did in the preliminary release, including literacy, school attendance, physical infirmities, months unemployed, month of birth, and whether or not the respondent was married in the previous year.

See this FAQ for the latest information on the Linked Representative Samples.

DOWNLOAD LINKED DATA FILES

Dataset basics

All files contain weights that allow users to generate representative population estimates. The universes for the male and couple files are straightforward. The linked male files represent all men who were in the U.S. in both census years. The linked couple files represent all married couples who were in the U.S. and married to one another in both census years. The universe for the linked female files is more complicated; these files represent women who were in the U.S. in both census years and who did not change their surname due to marriage (or remarriage) between the two census years. For this reason, the female files only link women who were single at both time periods, were married to the same person in both censuses, or who transitioned from married to widowed or divorced between time periods. The female files do not link women who entered into marriage between censuses.

The linking strategy relied on five variables that theoretically should not change over time: birth year, place of birth, given name, surname, and race. Records were only compared in the linking process if they had an exact match on race and place of birth. The age and name variables, on the other hand, were permitted to have some variation. Name information was "cleaned," following the procedures specified below. A strategic decision in the project was to ignore information from other co-resident individuals because of potential bias issues. For example, if linking decisions were based on the characteristics of co-resident family members, then the samples would be biased in favor of those with co-resident family members. Similarly, using place of residence information in linking decisions would have resulted in the over-representation of people whose place of residence did not change (non-movers).

Each sample contains data on all linked individuals and records for other members of the linked person's household from both years. All data comes from IPUMS-USA (in contrast to the preliminary release which used the version of the 1880 complete-count database from the North Atlantic Population Project). Variable descriptions are available via the IPUMS variable availability page.

Formatting of linked data

All linked samples consist of independently linked records and their complete households in the two census years. The primary link has a value of '0' for LINKTYPE. In addition, we also subsequently link common household members in households containing a primary link (indicated by LINKTYPE = 5). Table 1 gives a few examples from the 1870 - 1880 male linked sample.

All linked samples contain variables from a sample year and from 1880. In each sample, variables ending with "_1" contain data from the earlier year and variables ending with "_2" contain data from the later year. For instance, for the pre-1880 linked samples (i.e., 1850, 1860 and 1870), the sample year variables end with "_1" and the 1880 variables end with "_2". For the post-1880 linked samples, the 1880 variables end with "_1" and the sample year variables end with "_2".

We have structured each file such that the pernum for the sample year is numbered sequentially upward from 1 for each household. The pernum for the sample year always contains a value equal to or greater than 1 even if the household member was present in 1880 but not in the sample year. However, the pernum for 1880 will have a value of 0 if a household member was present in the sample year but not in 1880. The easiest way to view households in the linked files is to sort in ascending order by the sample year serial, HHSEQ (more on HHSEQ below), and the sample year pernum.

In the first example in Table 1, the primary link (indicated by LINKTYPE = 0) is the third record in the 1870 household, James H A, who was 10 years old in the 1870 census. Based on consistent race, birthplace, name and age information, we also establish household links for all other members of the 1870 household with the exception of the head of household, who does not appear to have been present in the 1880 household (and this is indicated by LINKTYPE = 9). In addition, the 1880 household has one member, Mary Gillan, a 40-year-old domestic servant, who was not present in 1870.

Table 1 Example of linked data from the 1870-1880 linked sample of males
perwt serial_1 hhseq pernum_1 linktype serial_2 pernum_2 namelast_1 namefrst_1 namelast_2 namefrst_2 imprel_1 relate_2 age_1 age_2
0 6801 1 1 9 3177716 0 ASHBY JAMES 1 . 42 .
0 6801 1 2 5 3177716 1 ASHBY MARY J ASHBY MARY J. 2 1 30 40
1131.7 6801 1 3 0 3177716 2 ASHBY JAMES H A ASHBY JAMES H. 3 3 10 21
0 6801 1 4 5 3177716 3 ASHBY MARY J ASHBY MARY J. 3 3 9 20
0 6801 1 5 5 3177716 4 ASHBY ANNIE V ASHBY ANNIE B. 3 3 8 19
0 6801 1 6 5 3177716 5 ASHBY WILLIAM H ASHBY WM. H. 3 3 0 10
0 6801 1 7 9 3177716 6 GILLAN MARY . 12 . 40
perwt serial_1 hhseq pernum_1 linktype serial_2 pernum_2 namelast_1 namefrst_1 namelast_2 namefrst_2 imprel_1 relate_2 age_1 age_2
437.71 70601 1 1 0 60191 1 WOODFIN MOSES WOODFIN MOSES 1 1 57 67
0 70601 1 2 5 60191 2 WOODFIN MARY WOODFIN MARY A. 2 2 49 61
0 70601 1 3 9 60191 0 WOODFIN WILLIAM 3 . 17 .
0 70601 1 4 1 60191 0 WOODFIN MOSES 3 . 15 .
0 70601 1 5 5 60191 3 WOODFIN LYDIA WOODFIN LIDDA A. 3 3 11 21
0 70601 1 6 5 60191 4 WOODFIN MANDA WOODFIN AMANDA C. 3 3 6 16
0 70601 1 7 5 60191 5 WOODFIN JAMES WOODFIN JAMES 3 3 3 12
0 70601 2 1 1 47253 0 WOODFIN MOSES 1 . 57 .
0 70601 2 2 9 47253 0 WOODFIN MARY 2 . 49 .
0 70601 2 3 9 47253 0 WOODFIN WILLIAM 3 . 17 .
1117.6 70601 2 4 0 47253 1 WOODFIN MOSES WOODFIN MOSES 3 1 15 25
0 70601 2 5 9 47253 0 WOODFIN LYDIA 3 . 11 .
0 70601 2 6 9 47253 0 WOODFIN MANDA 3 . 6 .
0 70601 2 7 9 47253 0 WOODFIN JAMES 3 . 3 .
0 70601 2 8 9 47253 2 WOODFIN SALLIE . 2 . 23
0 70601 2 9 9 47253 3 WOODFIN JOHN . 3 . 3
0 70601 2 10 9 47253 4 WOODFIN SAMUEL . 3 . 1
0 70601 2 11 9 47253 5 WOODFIN ALACE . 3 . 0
perwt serial_1 hhseq pernum_1 linktype serial_2 pernum_2 namelast_1 namefrst_1 namelast_2 namefrst_2 imprel_1 relate_2 age_1 age_2
0 4301 1 1 9 222598 0 HUBARD W D 1 . 37 .
0 4301 1 2 5 222598 2 HUBARD REEBECCA HUBBARD REBECA N. 2 2 27 38
0 4301 1 3 9 222598 0 HUBARD MARY J 3 . 13 .
0 4301 1 4 5 222598 3 HUBARD WILEY N HUBBARD WILLEY NEWTON 3 3 9 20
0 4301 1 5 5 222598 4 HUBARD WILLIAM HUBBARD WILLIAM D. 3 3 7 17
456.85 4301 1 6 0 222598 5 HUBARD LAFAYETTE HUBBARD LAFAYETT M. 3 3 5 15
0 4301 1 7 5 222598 6 HUBARD JASAPHENE HUBBARD JOSAPHINE 3 3 1 11
0 4301 1 8 9 222598 0 PAROMES ROBERT 12 . 14 .
0 4301 1 9 9 222598 1 HUBBARD DAVIS . 1 . 45
0 4301 1 10 9 222598 7 HUBBARD MARTHA V. . 3 . 9
0 4301 1 11 9 222598 8 HUBBARD SUSAN R. . 3 . 6
0 4301 1 12 9 222598 9 HUBBARD FANNIE . 3 . 3

The second household in Table 1 gives an example where we have two primary links in the 1870 household. Since we have two primary links, the 1870 household is written out twice (indicated by HHSEQ), with each version providing links to a different 1880 household. In this case the first primary link is the head of the household, 57-year-old Moses Woodfin. Of the seven household members in 1870, all are also co-resident in 1880 with the exception of William and 15-year-old Moses. However, 15-year-old Moses is also a primary link, indicated by LINKTYPE = 100 in the first version of the household. The second version of the 1870 household (i.e., HHSEQ = 2) shows 15-year-old Moses as the primary link; in 1880 he was 25 years old, living with his wife and three young children.

The third household in Table 1 gives an example of the imprecision of some of our household links. The primary link is Lafeyette, and we also establish household links for everyone except the head (W D), the eldest daughter (Mary J) and an unrelated individual (Robert Paromes). However, it is likely that the head of household in 1870, 37-year-old W D, is the same person as the head of household in 1880, 45-year-old Davis. Given our rules for establishing household links, we do not make this link because of first name ambiguity.

Researchers who are not interested in co-resident persons can greatly simplify the dataset by removing all cases where LINKTYPE is greater than 0. This will remove all persons who were not linked in the primary linking process. Once this is done, LINKTYPE and HHSEQ can be ignored. Information on co-resident persons is one of the most powerful features of the linked samples, however, and we expect that this information will be useful for a variety of research purposes. For instance, co-resident persons can be used to create variables describing characteristics such as birth order, father's occupation, or the employment status of household members.

Table 2 gives the number of linked records for the various sample years by nativity/race, and for females by nativity/race and marital status categories. In addition to showing the limitations of some of our samples due to small numbers for some population components, we also weight within each of the subgroups. The purpose of the weights is to compensate for under and over representation and are only constructed for the primary links. We construct the weights based on terminal year population characteristics (i.e., 1880 characteristics for the pre-1880 samples and by sample year characteristics for the post-1880 samples). Weights are based on an estimate of the "linkable" population. Using the 1870-1880 samples as an example, the linkable population for native-born groups is anyone that was 10 years or older in the 1880 census. However, since we do not have year of immigration information before 1900, for the foreign-born we have to look at the foreign-born in the 1870 census and apply life Tables to estimate how many of these individuals would have still been alive in 1880. The difference between this estimate and the actual total in 1880 would be due to immigration between 1870 and 1880.

Table 2. Number of Linked Records, by Sample Year and Linked Population Sub-group
"Males" samples
1850 1860 1870 1900 1910 1920 1930
Native-born white males 7009 10424 17724 18595 14854 10049 9018
Foreign-born males 299 634 877 1512 990 508 336
African American males 82 235 2181 1336 796 506 352
Total 7390 11293 20782 21443 16640 11063 9706
"Couples" samples (by male characteristics)
1850 1860 1870 1900 1910 1920 1930
Native-born white males 2133 4537 8858 7745 4644 2105 612
Foreign-born males 227 932 2256 2129 1101 415 105
African American males 6 19 408 180 102 26 7
Total 2366 5488 11522 10054 5847 2546 724
"Females" samples
1850 1860 1870 1900 1910 1920 1930
Native-born white females, married 1077 2253 4254 3274 1891 793 221
Native-born white females, formerly married 811 951 1192 1161 1124 894 545
Native-born white females, single 468 1495 6699 4239 1407 849 700
Foreign-born females, married 63 215 671 554 317 111 26
Foreign-born females, formerly married 62 90 191 252 231 145 71
Foreign-born females, single 2 30 150 65 40 17 16
African American females, married 8 34 432 164 73 20 2
African American females, formerly married 12 34 152 88 51 34 29
African American females, single 9 35 902 210 40 20 7
Total 2512 5137 14643 10007 5174 2883 1617

After identifying the appropriate estimates for linkable subgroups, we multiply by the inverse of the group's linkage rate to get an initial weight that inflates the subgroup's number of linked records to the same number as that in the linkable population. We then calculate the specific weights for a number of weighting variables. Using 1870-1880 males as an example, we first estimate the relationship-to-head weight, which is calculated by the proportion of the linkable population by relationship-to-head categories divided by the proportions for the linked sample. We then apply the relationship-to-head weight to the linked records, and then calculate the proportions for the next weighting variable (in this case individual birthplaces). We repeat this process for 5-year age groups, size of place and occupational categories, with a specific record's weight getting modified with each iteration.

This process works fairly well if we have enough records. From Table 2, however, we can see that some of the sub-groups have small numbers. As a result some subgroups have a large range of weight values, with some records representing a relatively large number of records. We deal with this by imposing a minimum and maximum on the weight for all sub-groups. The minimum is 1/5 of the average weight for the subgroup, with the maximum capped at 4 times the average weight for the subgroup. This has little effect on some of the larger groups; for example, under 1 percent of the native-born male links for 1870-1880 are affected by the minimum/maximum rule. However, almost 10 percent of the foreign-born records are either below or above the initial minimum or maximum.

Detailed linking procedures

The ability to assess similarity can be enhanced by cleaning and standardizing the source data. The sample and complete-count data has been through a variety of cleaning and logical edits prior to release as part of the IPUMS. Age information, for example, is subject to a variety of consistency checks at the original data collection stage and later in IPUMS processing. Thus we felt no need to further process age information prior to linkage. The name fields, in contrast, receive little processing prior to IPUMS release. IPUMS data have separate fields for given and last name. While the last name field consistently contains a single string, the given name field can contain given and middle name, given name and middle initial, or even a lone first initial. Some enumerators also used abbreviations for common given names, which were transcribed verbatim in the data collection process.

We ultimately decided to do a minimal amount of processing on the surname field. We removed non-alpha characters, but did not attempt to standardize or correct perceived misspellings. We generally took the same approach with the given names in that we were not overly concerned with misspellings. For example, we felt that small variations in names would not be enough to confidently distinguish a true link from a false link. Another factor in our decision was the large number of names in the sample and complete count data. Although some of the variation is caused by the occasional presence of middle initials, when combined with sex, our given name dictionary contained approximately 1.7 million unique strings.

Table 3 shows the 24 most common first names for men, with the FNAME field containing the raw string. When a person had an initial, or a listed middle name, the raw string was parsed into two fields: XNAME1 and XNAME2. The resulting standardized name is in the field STD_XNAME1. For the first names listed below, the only name that was modified in the standard name field is "WM.", which is standardized as "WILLIAM."

Table 3. The most common first names for men in 1880, before and after standardization
SEX FNAME XNAME1 XNAME2 STD_XNAME1 FREQ
M JOHN JOHN JOHN 3297148
M GEORGE GEORGE GEORGE 1311868
M WILLIAM WILLIAM WILLIAM 1161737
M HENRY HENRY HENRY 1156464
M JAMES JAMES JAMES 792520
M CHARLES CHARLES CHARLES 591368
M THOMAS THOMAS THOMAS 474816
M JOSEPH JOSEPH JOSEPH 458782
M FRANK FRANK FRANCIS 382949
M PETER PETER PETER 382598
M EDWARD EDWARD EDWARD 298570
M ROBERT ROBERT ROBERT 241233
M SAMUEL SAMUEL SAMUEL 217773
M DAVID DAVID DAVID 189674
M JACOB JACOB JACOB 186299
M ALBERT ALBERT ALBERT 172205
M DANIEL DANIEL DANIEL 157188
M WM. WM WILLIAM 152013
M MICHAEL MICHAEL MICHAEL 147171
M ANDREW ANDREW ANDREW 137175
M PATRICK PATRICK PATRICK 136865
M WILLIE WILLIE WILLIAM 128333
M RICHARD RICHARD RICHARD 123986
M LOUIS LOUIS LOUIS 115083
M JOHN W. JOHN W JOHN 113715

Given the large number of unique strings, we standardized only first names with a frequency greater than 100, of which there were approximately 20,000. Most of this work involved standardizing commonly used abbreviations and diminutives, such as those listed below in Table 4.

Table 4. The most common abbreviations and diminutives for male first names in 1880, before and after standardization
Raw string n1 n1 standard Frequency
Wm. Wm William 152013
Willie Willie William 128333
Fred Fred Frederick 102832
Chas. Chas Charles 65007
Joe Joe Joseph 63356
Geo. Geo George 47135
Charley Charley Charles 46206
Thos. Thos Thomas 42224
Sam Sam Samuel 39799
Charlie Charlie Charles 38491
Eddie Eddie Edward 36106
Robt. Robt Robert 28990
Alex Alex Alexander 26199
Jas. Jas James 25740
Ben Ben Benjamin 24261
Jno. Jno John 23679
Tom Tom Thomas 21772
Jim Jim James 20176
Saml. Saml Samuel 14691
Benj. Benj Benjamin 14298
Jos. Jos Joseph 12072

The creation of the linked samples required millions of individual comparisons. Linked individuals were required to have the same sex, race, and birthplace, so the comparison of these variables is straightforward. Evaluating exact matches on name and age information is unambiguous, but we also had to assess the likelihood that near matches were correct links. Fortunately, several software packages have been developed to deal with the ambiguities in record linkage. We relied on the Freely Extensible Biomedical Record Linkage (FEBRL), software created by Peter Christen and Tim Churches at the Australian National University's Data Mining Group.

Input files consisted of individuals with a given race, sex, and birthplace. For example, white males born in Michigan who were in the 1870 one-percent sample were compared to white males born in Michigan who were present in the 1880 complete-count data. Given consistent sex, race and birthplace information, the FEBRL software would compare all records within a seven year sliding window. Thus, 30-year-olds in the 1870 (who would be expected to be 40 in 1880) were compared to all persons between the ages of 33 and 47 in 1880. When an age match was not exact, the similarity of ages was quantified as the difference from the expected age (in years) as a proportion of the individual's age. In addition to constructing age similarity scores, FEBRL would also assess name similarity. We used the Jaro-Winkler distance algorithm, which takes a mathematical approach to dealing with transpositions of particular letters or syllables in comparing the similarity of any two alphabetic strings (Jaro 1989; Porter and Winkler 1997; Winkler 1990, 1993, 2002). Output files consisted of potential links that exceeded preset thresholds for age and name similarity.

After all files from a given pair of census years had been through similarity score construction, we needed to classify all potential links as "true" or "false". The identification and classification of data patterns is a major issue in computer science, particularly in research on data mining and machine learning. Our procedures draw heavily on the concept of Support Vector Machines, or SVMs (Vapnik 1998; Cristianini 2000; Abe 2005). SVMs are not physical machines; they are a set of methods and tools used to conduct machine learning. Our use of SVMs relied on the LIBSVM software, developed by Chih-Chung Chang and Chih-Jen Lin at National Taiwan University's Department of Computer Science.

SVM construction depends on the existence of training data, which typically consists of a verified set of true and false links. The classifier analyzes the training data, plots them in a multidimensional space, and then constructs a boundary between the two classes of records that maximizes the distance from the hyperplane and the nearest data points in both of the classes (i.e., between the true and false links). After SVM construction (which is based on the training data), unclassified records are plotted on this multidimensional space and the end result is a file consisting of potential links and the classifier-produced confidence score. Confidence scores are interpreted dichotomously; a positive score = "true" link and a negative score = "false" link.

At the classifier stage each potential link is evaluated independently, which often results in numerous potential links (from 1880) to a given sample record. We consider these links to be ambiguous, and they are not included in our linked data. Table 5 shows the confidence scores for potential links to John Bradley, a 25-year-old white male born in South Carolina from the 1870 data. Of the 43 potential links, only the top four receive positive confidence scores. Although the potential link with the highest confidence score is an exact match, the other three also have a high degree of similarity. If we had to choose, we would say the exact link is probably the correct link. However, we also feel that the probability that it is the correct link is significantly under 95 percent, and using these types of links would introduce an unaccepTable error rate.

Table 5. Output and confidence scores for potential links to John Bradley, 25-year old South Carolinian, 1870-1880
serial_1 pernum_1 serial_2 pernum_2 name1_1 namelast_clean_1 name1_2 namelast_clean_2 age_1 age_2 confidence
22901 1 976277 1 john bradley john bradley 25 35 1.164
22901 1 193150 1 john bradley john bradly 25 34 1.000
22901 1 8618365 1 john bradley john bradley 25 37 0.999
22901 1 8743061 1 john bradley john bradley 25 38 0.879
22901 1 8741891 1 john bradley h bradley 25 35 -0.995
22901 1 9179845 1 john bradley j bailey 25 35 -0.995
22901 1 9064198 1 john bradley john shandley 25 34 -1.000
22901 1 1055670 1 john bradley john bryan 25 35 -1.000
22901 1 984100 1 john bradley john ragsdalle 25 35 -1.000
22901 1 8654004 1 john bradley john bryante 25 35 -1.001
22901 1 8597775 1 john bradley john bail 25 35 -1.002
22901 1 8578272 1 john bradley john darby 25 36 -1.004
22901 1 8724337 1 john bradley john nalley 25 35 -1.010
22901 1 8574468 1 john bradley john ashley 25 35 -1.010
22901 1 8646083 1 john bradley john rarden 25 35 -1.010
22901 1 8649875 1 john bradley john ashley 25 35 -1.010
22901 1 1163802 1 john bradley john trader 25 35 -1.010
22901 1 8732963 1 john bradley john bryce 25 34 -1.012
22901 1 8584289 1 john bradley john ready 25 36 -1.020
22901 1 8644204 1 john bradley john beasley 25 35 -1.024
22901 1 8668427 1 john bradley josiah bramlet 25 35 -1.026
22901 1 8616451 1 john bradley john bayler 25 33 -1.028
22901 1 8603992 1 john bradley john blake 25 35 -1.028
22901 1 8605006 1 john bradley john boyer 25 35 -1.028
22901 1 205053 1 john bradley john berry 25 35 -1.028
22901 1 5103940 1 john bradley john brownlee 25 36 -1.038
22901 1 8673698 7 john bradley john branch 25 34 -1.045
22901 1 8691338 1 john bradley john clardy 25 34 -1.047
22901 1 8726708 1 john bradley john brazell 25 36 -1.050
22901 1 8572344 1 john bradley john gray 25 35 -1.055
22901 1 8759890 1 john bradley john rainey 25 36 -1.056
22901 1 8577946 1 john bradley john breazeale 25 35 -1.057
22901 1 8671388 1 john bradley john brackman 25 35 -1.071
22901 1 1031068 1 john bradley john batey 25 35 -1.085
22901 1 8649108 1 john bradley john berry 25 36 -1.087
22901 1 8692014 1 john bradley john berry 25 34 -1.087
22901 1 5111582 1 john bradley john bailey 25 34 -1.090
22901 1 8669087 1 john bradley john bailey 25 34 -1.090
22901 1 8666039 1 john bradley john saddler 25 35 -1.092
22901 1 8654395 1 john bradley john mccarley 25 35 -1.096
22901 1 999367 1 john bradley john mcbryde 25 35 -1.108
22901 1 8563426 1 john bradley john brownley 25 35 -1.109
22901 1 8718581 1 john bradley john bolen 25 35 -1.111

Final Data Release

The material above describes the process we used for the preliminary linked data release, which took place in Fall 2008. An expanded narrative describing methods will be available in a forthcoming issue of Historical Methods. An important part of the process was a set of links for 1870-1880 and 1880-1900 produced by Pleiades Software. The comparison of our preliminary linked data to the Pleiades linked data disclosed that we had produced accurate linked data, but our process was highly dependent on the precision on the data. More specifically, if the individuals that we attempted to link were accurately enumerated in both census years, we would either make the link or, in the case of multiple potential links, reject the link as ambiguous. But if the correct link was unidentifiable-because of mortality, underenumeration, or misenumeration of linking variable information-we would make an incorrect link if there was another person with similar characteristics in the 1880 complete-count data.

We dealt with this problem by constructing formal measures for the commonness of names. A fairly standard approach in record linkage projects is to construct frequency Tables for names, which more or less assesses the probability of a correct link. For example, based on frequency Tables, record linkers would be more confident in linking someone with the name Roland Marsupial as opposed to John Smith. But given our minimalist approach to name cleaning and standardization, we run into a problem with minor typos and misspellings which would show up as low frequency names, although many of these names have high similarity to high frequency names (and would in fact show up as potential links to records with high frequency names). Our solution was to construct name similarity scores based on the following: for a given sample record we determined the proportion of records (by race, birthplace, and sex) in the 1880 complete-count data that have a Jaro-Winkler similarity score greater than 0.9. The choice of this threshold is somewhat arbitrary, but based on the preliminary linked data, we rarely linked records that did not exceed this threshold. We also constructed a density of birth measure, which is the proportion of 1880 records for specific birthplaces, by race and sex.

The impact of name commonness on the linkage rates can be seen in Table 6. As expected, all three groups have higher linkage rates for records with less common names compared to those with the most common names. The linkage rate differentials within the ethnic/race groups are also extreme. Taking nothing else into account, native-born whites in the least common name category are approximately 30 times more likely to be linked compared to native-born whites in the most common name category.

However, there is one other basic factor to take into account when evaluating the name commonness scores. Table 7 presents linkage rates by name commonness categories and by birthplace categories, with birthplace rank determined by the proportion of all 1870 males by place of birth. The basic pattern of a higher linkage rate for less common names is evident regardless of birthplace rank. For records in the least common name category, the birthplace categories do not show much difference in linkage rates. But as we move to the more common name categories differentials between less and more populated states become more extreme. For example, only the most common name categories show a linkage rate significantly lower than the average for all records from the small states.

We expected to see this pattern in the linked data. The pattern largely results from our decision not to link in ambiguous cases (i.e., where we had more than one potential link). But by generating name commonness scores for all records, we have also forced the classifiers not to make links in cases where a name is relatively common, even in the absence of competing potential links. And the Tables make clear that we do have a biased sample of linked data; in general, we are more likely to link records from the smaller states, and we are more likely to link records with less common names.

We always assumed that we would deal with linkage differentials by constructing weights for the linked records; that all things being equal, linked individuals born in Delaware would have a lower weight than those born in New York. But we also assumed that there would not be an inherent bias between more versus less common names. The construction of the name scores allows us to examine this a fairly straightforward way. Tables 8a and 8b give mean occupational score for 1870 males by race/nativity and name category. Table 8a excludes non-occupational responses and the distribution is relatively flat. Table 8b excludes non-occupational responses and farmers, and we also see relatively little variation among name categories.

Table 6. Linkage Rate for 1870 Males by Nativity/Race and Name Commonness Scores
Name commonness Linkage rate N
Native-born white
1 (least common) 19.00% 39797
2 17.90% 35783
3 10.70% 15355
4 7.90% 14612
5 4.30% 14736
6 2.10% 14281
7 (most common) 0.60% 10765
Total 12.20% 145329
Foreign-born white
1 (least common) 4.70% 6081
2 5.60% 6129
3 3.40% 2821
4 2.30% 2751
5 1.60% 3052
6 0.90% 3162
7 (most common) 0.20% 5372
Total 3.00% 29368
African-American
1 (least common) 6.00% 12525
2 10.10% 10445
3 9.90% 4528
4 8.40% 4391
5 4.80% 4499
6 2.60% 4801
7 (most common) 1.30% 6145
Total 6.40% 47334
Table 7. Linkage Rate for Native-Born 1870 Males by Birthplace Rank (number of males by birthplace)
and Name Commonness Scores
Birthplace rank Name commonness Linkage rate N
1 (smallest) 1 (least common) 17.80% 5590
2 26.60% 5140
3 21.60% 2046
4 19.50% 2112
5 14.10% 2121
6 8.90% 2288
7 (most common) 2.70% 1858
Total 17.80% 21155
2 1 (least common) 20.70% 11152
2 23.70% 10013
3 16.20% 4490
4 12.30% 4186
5 5.70% 4379
6 1.90% 4286
7 (most common) 0.30% 3473
Total 14.90% 41979
3 1 (least common) 21.00% 12859
2 17.10% 11039
3 7.80% 4761
4 4.20% 4521
5 1.70% 4526
6 0.40% 4277
7 (most common) 0.10% 3450
Total 11.60% 45433
4 (largest) 1 (least common) 15.20% 10196
2 8.10% 9591
3 2.70% 4058
4 1.10% 3793
5 0.30% 3710
6 0.10% 3430
7 (most common) 0.05% 1984
Total 6.80% 36762
Table 8a. Mean OCCSCORE value for 1870 Males by Nativity/Race and Name Commonness Scores,
excluding non-occupational responses
Name commonness OCCSCORE mean N
Native-born white
1 (least common) 19.32 18200
2 19.40 20505
3 19.09 7603
4 19.28 7257
5 19.32 7201
6 18.51 7224
7 (most common) 19.07 5667
Foreign-born white
1 (least common) 22.11 5098
2 21.97 5329
3 21.99 2465
4 21.82 2378
5 22.14 2694
6 21.59 2814
7 (most common) 21.25 4739
African-American
1 (least common) 12.43 7452
2 12.41 6299
3 12.65 2697
4 12.68 2576
5 12.85 2654
6 13.08 2769
7 (most common) 13.56 3527
Table 8b. Mean OCCSCORE value for 1870 Males by Nativity/Race and Name Commonness Scores, excluding non-occupational responses and farmers
Name commonness OCCSCORE mean N
Native-born white
1 (least common) 22.26 11734
2 22.16 12943
3 21.61 5033
4 21.95 4849
5 21.91 4778
6 20.91 4856
7 (most common) 21.80 3698
Foreign-born white
1 (least common) 23.83 4208
2 23.96 4266
3 23.97 1976
4 23.47 1964
5 24.02 2189
6 23.32 2292
7 (most common) 22.70 3949
African-American
1 (least common) 12.17 6368
2 12.14 5389
3 12.44 2324
4 12.47 2226
5 12.68 2317
6 12.94 2404
7 (most common) 13.51 3137

Acknowledgements

This work was supported in part by the University of Minnesota Supercomputing Institute.

References Cited

Abe, Shigeo. 2005. Support Vector Machines for Pattern Classification. London: Springer.

Cristianini, Nello and John Shawe-Taylor. 2000. An Introduction to Support Vector Machines and other kernel-based learning methods. London: Cambridge University Press.

Ferrie, Joseph. 1996. A New Sample of Males Linked from the Public-Use-Microdata-Sample of the 1850 US Federal Census of Population to the 1860 US Federal Census Manuscript Schedules. Historical Methods 29: 141-156.

Guest, Avery. 1987. Notes from the National Panel Study: Linkage and Migration in the Late Nineteenth Century. Historical Methods 20: 63-77.

Jaro, Matthew A. 1989. Advances in Record Linking Methodology as Applied to Matching the 1985 Census of Tampa, Florida. Journal of the American Statistical Association 84: 414-20.

Porter, Edward H. and William E. Winkler. 1997. Approximate String Comparison and its Effect on an Advanced Record Linkage System. Census Bureau Research Report RR97/02 (Washington D.C.: U.S. Bureau of the Census).

Vapnik, Vladimir. 1995. The Nature of Statistical Learning Theory. London: Springer-Verlag.

Winkler, William E. 1990. String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage. American Statistical Association, 1990, Proceedings of the Section of Survey Research Methods, 354-359.

Winkler, William E. 1993. Improved Decision Rules in the Fellegi-Sunter Model of Record Linkage. American Statistical Association, 1993, Proceedings of the Section of Survey Research Methods, 274-279.

Winkler, William E. 2002. Methods for Record Linkage and Bayesian Networks. Census Bureau Research Report RRS2002-05 (Washington D.C.: U.S. Bureau of the Census).