IPUMS Linked Representative Samples, 1850-1930
Final Data Release (June 2010)

The Minnesota Population Center is pleased to announce the final release of the IPUMS Linked Representative Samples. The new database links records from the 1880 complete-count database to 1% samples of the 1850 to 1930 U.S. censuses of the population. We have produced data samples for 7 pairs of years: 1850-1880, 1860-1880, 1870-1880, 1880-1900, 1880-1910, 1880-1920, and 1880-1930. Each of these contains three independent linked samples: one of men, one of women, and one of married couples.

This final release of the database supersedes the preliminary version that was made available beginning in August 2008. The final release differs from the preliminary release in several important ways that are described in more detail below. The most significant improvements were methodological: namely, we used better "training" data and we used a "name commonness" measure in determining linkage confidence. As a result of these methodological improvements, the final release contains many cases that were not present in the preliminary release (some cases were also removed). Overall, the samples have about a sixty- to seventy-percent overlap between the preliminary and final releases.

The final version of the datasets also used better weighting procedures than the preliminary release, resulting in numerous changes to the weight value attached to each case. And the 1880 observations in the final data include many more variables than they did in the preliminary release, including literacy, school attendance, physical infirmities, months unemployed, month of birth, and whether or not the respondent was married in the previous year.

See this FAQ for the latest information on the Linked Representative Samples.

DOWNLOAD LINKED DATA FILES

Dataset basics

All files contain weights that allow users to generate representative population estimates. The universes for the male and couple files are straightforward. The linked male files represent all men who were in the U.S. in both census years. The linked couple files represent all married couples who were in the U.S. and married to one another in both census years. The universe for the linked female files is more complicated; these files represent women who were in the U.S. in both census years and who did not change their surname due to marriage (or remarriage) between the two census years. For this reason, the female files only link women who were single at both time periods, were married to the same person in both censuses, or who transitioned from married to widowed or divorced between time periods. The female files do not link women who entered into marriage between censuses.

The linking strategy relied on five variables that theoretically should not change over time: birth year, place of birth, given name, surname, and race. Records were only compared in the linking process if they had an exact match on race and place of birth. The age and name variables, on the other hand, were permitted to have some variation. Name information was "cleaned," following the procedures specified below. A strategic decision in the project was to ignore information from other co-resident individuals because of potential bias issues. For example, if linking decisions were based on the characteristics of co-resident family members, then the samples would be biased in favor of those with co-resident family members. Similarly, using place of residence information in linking decisions would have resulted in the over-representation of people whose place of residence did not change (non-movers).

Each sample contains data on all linked individuals and records for other members of the linked person's household from both years. All data comes from IPUMS-USA (in contrast to the preliminary release which used the version of the 1880 complete-count database from the North Atlantic Population Project). Variable descriptions are available via the IPUMS variable availability page.

Formatting of linked data

All linked samples consist of independently linked records and their complete households in the two census years. The primary link has a value of '0' for LINKTYPE. In addition, we also subsequently link common household members in households containing a primary link (indicated by LINKTYPE = 5). Table 1 gives a few examples from the 1870 - 1880 male linked sample.

All linked samples contain variables from a sample year and from 1880. In each sample, variables ending with "_1" contain data from the earlier year and variables ending with "_2" contain data from the later year. For instance, for the pre-1880 linked samples (i.e., 1850, 1860 and 1870), the sample year variables end with "_1" and the 1880 variables end with "_2". For the post-1880 linked samples, the 1880 variables end with "_1" and the sample year variables end with "_2".

We have structured each file such that the pernum for the sample year is numbered sequentially upward from 1 for each household. The pernum for the sample year always contains a value equal to or greater than 1 even if the household member was present in 1880 but not in the sample year. However, the pernum for 1880 will have a value of 0 if a household member was present in the sample year but not in 1880. The easiest way to view households in the linked files is to sort in ascending order by the sample year serial, HHSEQ (more on HHSEQ below), and the sample year pernum.

In the first example in Table 1, the primary link (indicated by LINKTYPE = 0) is the third record in the 1870 household, James H A, who was 10 years old in the 1870 census. Based on consistent race, birthplace, name and age information, we also establish household links for all other members of the 1870 household with the exception of the head of household, who does not appear to have been present in the 1880 household (and this is indicated by LINKTYPE = 9). In addition, the 1880 household has one member, Mary Gillan, a 40-year-old domestic servant, who was not present in 1870.

Table 1. Example of linked data from the 1870-1880 linked sample of males
perwt	serial_1	hhseq	pernum_1	linktype	serial_2	pernum_2	namelast_1	namefrst_1	namelast_2	namefrst_2	imprel_1	relate_2	age_1	age_2
0	6801	1	1	9	3177716	0	ASHBY	JAMES			1	.	42	.
0	6801	1	2	5	3177716	1	ASHBY	MARY J	ASHBY	MARY J.	2	1	30	40
1131.7	6801	1	3	0	3177716	2	ASHBY	JAMES H A	ASHBY	JAMES H.	3	3	10	21
0	6801	1	4	5	3177716	3	ASHBY	MARY J	ASHBY	MARY J.	3	3	9	20
0	6801	1	5	5	3177716	4	ASHBY	ANNIE V	ASHBY	ANNIE B.	3	3	8	19
0	6801	1	6	5	3177716	5	ASHBY	WILLIAM H	ASHBY	WM. H.	3	3	0	10
0	6801	1	7	9	3177716	6			GILLAN	MARY	.	12	.	40
perwt	serial_1	hhseq	pernum_1	linktype	serial_2	pernum_2	namelast_1	namefrst_1	namelast_2	namefrst_2	imprel_1	relate_2	age_1	age_2
437.71	70601	1	1	0	60191	1	WOODFIN	MOSES	WOODFIN	MOSES	1	1	57	67
0	70601	1	2	5	60191	2	WOODFIN	MARY	WOODFIN	MARY A.	2	2	49	61
0	70601	1	3	9	60191	0	WOODFIN	WILLIAM			3	.	17	.
0	70601	1	4	1	60191	0	WOODFIN	MOSES			3	.	15	.
0	70601	1	5	5	60191	3	WOODFIN	LYDIA	WOODFIN	LIDDA A.	3	3	11	21
0	70601	1	6	5	60191	4	WOODFIN	MANDA	WOODFIN	AMANDA C.	3	3	6	16
0	70601	1	7	5	60191	5	WOODFIN	JAMES	WOODFIN	JAMES	3	3	3	12
0	70601	2	1	1	47253	0	WOODFIN	MOSES			1	.	57	.
0	70601	2	2	9	47253	0	WOODFIN	MARY			2	.	49	.
0	70601	2	3	9	47253	0	WOODFIN	WILLIAM			3	.	17	.
1117.6	70601	2	4	0	47253	1	WOODFIN	MOSES	WOODFIN	MOSES	3	1	15	25
0	70601	2	5	9	47253	0	WOODFIN	LYDIA			3	.	11	.
0	70601	2	6	9	47253	0	WOODFIN	MANDA			3	.	6	.
0	70601	2	7	9	47253	0	WOODFIN	JAMES			3	.	3	.
0	70601	2	8	9	47253	2			WOODFIN	SALLIE	.	2	.	23
0	70601	2	9	9	47253	3			WOODFIN	JOHN	.	3	.	3
0	70601	2	10	9	47253	4			WOODFIN	SAMUEL	.	3	.	1
0	70601	2	11	9	47253	5			WOODFIN	ALACE	.	3	.	0
perwt	serial_1	hhseq	pernum_1	linktype	serial_2	pernum_2	namelast_1	namefrst_1	namelast_2	namefrst_2	imprel_1	relate_2	age_1	age_2
0	4301	1	1	9	222598	0	HUBARD	W D			1	.	37	.
0	4301	1	2	5	222598	2	HUBARD	REEBECCA	HUBBARD	REBECA N.	2	2	27	38
0	4301	1	3	9	222598	0	HUBARD	MARY J			3	.	13	.
0	4301	1	4	5	222598	3	HUBARD	WILEY N	HUBBARD	WILLEY NEWTON	3	3	9	20
0	4301	1	5	5	222598	4	HUBARD	WILLIAM	HUBBARD	WILLIAM D.	3	3	7	17
456.85	4301	1	6	0	222598	5	HUBARD	LAFAYETTE	HUBBARD	LAFAYETT M.	3	3	5	15
0	4301	1	7	5	222598	6	HUBARD	JASAPHENE	HUBBARD	JOSAPHINE	3	3	1	11
0	4301	1	8	9	222598	0	PAROMES	ROBERT			12	.	14	.
0	4301	1	9	9	222598	1			HUBBARD	DAVIS	.	1	.	45
0	4301	1	10	9	222598	7			HUBBARD	MARTHA V.	.	3	.	9
0	4301	1	11	9	222598	8			HUBBARD	SUSAN R.	.	3	.	6
0	4301	1	12	9	222598	9			HUBBARD	FANNIE	.	3	.	3

The second household in Table 1 gives an example where we have two primary links in the 1870 household. Since we have two primary links, the 1870 household is written out twice (indicated by HHSEQ), with each version providing links to a different 1880 household. In this case the first primary link is the head of the household, 57-year-old Moses Woodfin. Of the seven household members in 1870, all are also co-resident in 1880 with the exception of William and 15-year-old Moses. However, 15-year-old Moses is also a primary link, indicated by LINKTYPE = 100 in the first version of the household. The second version of the 1870 household (i.e., HHSEQ = 2) shows 15-year-old Moses as the primary link; in 1880 he was 25 years old, living with his wife and three young children.

The third household in Table 1 gives an example of the imprecision of some of our household links. The primary link is Lafeyette, and we also establish household links for everyone except the head (W D), the eldest daughter (Mary J) and an unrelated individual (Robert Paromes). However, it is likely that the head of household in 1870, 37-year-old W D, is the same person as the head of household in 1880, 45-year-old Davis. Given our rules for establishing household links, we do not make this link because of first name ambiguity.

Researchers who are not interested in co-resident persons can greatly simplify the dataset by removing all cases where LINKTYPE is greater than 0. This will remove all persons who were not linked in the primary linking process. Once this is done, LINKTYPE and HHSEQ can be ignored. Information on co-resident persons is one of the most powerful features of the linked samples, however, and we expect that this information will be useful for a variety of research purposes. For instance, co-resident persons can be used to create variables describing characteristics such as birth order, father's occupation, or the employment status of household members.

Table 2 gives the number of linked records for the various sample years by nativity/race, and for females by nativity/race and marital status categories. In addition to showing the limitations of some of our samples due to small numbers for some population components, we also weight within each of the subgroups. The purpose of the weights is to compensate for under and over representation and are only constructed for the primary links. We construct the weights based on terminal year population characteristics (i.e., 1880 characteristics for the pre-1880 samples and by sample year characteristics for the post-1880 samples). Weights are based on an estimate of the "linkable" population. Using the 1870-1880 samples as an example, the linkable population for native-born groups is anyone that was 10 years or older in the 1880 census. However, since we do not have year of immigration information before 1900, for the foreign-born we have to look at the foreign-born in the 1870 census and apply life Tables to estimate how many of these individuals would have still been alive in 1880. The difference between this estimate and the actual total in 1880 would be due to immigration between 1870 and 1880.

Table 2. Number of Linked Records, by Sample Year and Linked Population Sub-group
"Males" samples
	1850	1860	1870	1900	1910	1920	1930
Native-born white males	7009	10424	17724	18595	14854	10049	9018
Foreign-born males	299	634	877	1512	990	508	336
African American males	82	235	2181	1336	796	506	352
Total	7390	11293	20782	21443	16640	11063	9706
"Couples" samples (by male characteristics)
	1850	1860	1870	1900	1910	1920	1930
Native-born white males	2133	4537	8858	7745	4644	2105	612
Foreign-born males	227	932	2256	2129	1101	415	105
African American males	6	19	408	180	102	26	7
Total	2366	5488	11522	10054	5847	2546	724
"Females" samples
	1850	1860	1870	1900	1910	1920	1930
Native-born white females, married	1077	2253	4254	3274	1891	793	221
Native-born white females, formerly married	811	951	1192	1161	1124	894	545
Native-born white females, single	468	1495	6699	4239	1407	849	700
Foreign-born females, married	63	215	671	554	317	111	26
Foreign-born females, formerly married	62	90	191	252	231	145	71
Foreign-born females, single	2	30	150	65	40	17	16
African American females, married	8	34	432	164	73	20	2
African American females, formerly married	12	34	152	88	51	34	29
African American females, single	9	35	902	210	40	20	7
Total	2512	5137	14643	10007	5174	2883	1617

After identifying the appropriate estimates for linkable subgroups, we multiply by the inverse of the group's linkage rate to get an initial weight that inflates the subgroup's number of linked records to the same number as that in the linkable population. We then calculate the specific weights for a number of weighting variables. Using 1870-1880 males as an example, we first estimate the relationship-to-head weight, which is calculated by the proportion of the linkable population by relationship-to-head categories divided by the proportions for the linked sample. We then apply the relationship-to-head weight to the linked records, and then calculate the proportions for the next weighting variable (in this case individual birthplaces). We repeat this process for 5-year age groups, size of place and occupational categories, with a specific record's weight getting modified with each iteration.

This process works fairly well if we have enough records. From Table 2, however, we can see that some of the sub-groups have small numbers. As a result some subgroups have a large range of weight values, with some records representing a relatively large number of records. We deal with this by imposing a minimum and maximum on the weight for all sub-groups. The minimum is 1/5 of the average weight for the subgroup, with the maximum capped at 4 times the average weight for the subgroup. This has little effect on some of the larger groups; for example, under 1 percent of the native-born male links for 1870-1880 are affected by the minimum/maximum rule. However, almost 10 percent of the foreign-born records are either below or above the initial minimum or maximum.

Detailed linking procedures

The ability to assess similarity can be enhanced by cleaning and standardizing the source data. The sample and complete-count data has been through a variety of cleaning and logical edits prior to release as part of the IPUMS. Age information, for example, is subject to a variety of consistency checks at the original data collection stage and later in IPUMS processing. Thus we felt no need to further process age information prior to linkage. The name fields, in contrast, receive little processing prior to IPUMS release. IPUMS data have separate fields for given and last name. While the last name field consistently contains a single string, the given name field can contain given and middle name, given name and middle initial, or even a lone first initial. Some enumerators also used abbreviations for common given names, which were transcribed verbatim in the data collection process.

We ultimately decided to do a minimal amount of processing on the surname field. We removed non-alpha characters, but did not attempt to standardize or correct perceived misspellings. We generally took the same approach with the given names in that we were not overly concerned with misspellings. For example, we felt that small variations in names would not be enough to confidently distinguish a true link from a false link. Another factor in our decision was the large number of names in the sample and complete count data. Although some of the variation is caused by the occasional presence of middle initials, when combined with sex, our given name dictionary contained approximately 1.7 million unique strings.

Table 3 shows the 24 most common first names for men, with the FNAME field containing the raw string. When a person had an initial, or a listed middle name, the raw string was parsed into two fields: XNAME1 and XNAME2. The resulting standardized name is in the field STD_XNAME1. For the first names listed below, the only name that was modified in the standard name field is "WM.", which is standardized as "WILLIAM."

Table 3. The most common first names for men in 1880, before and after standardization
SEX	FNAME	XNAME1	XNAME2	STD_XNAME1	FREQ
M	JOHN	JOHN		JOHN	3297148
M	GEORGE	GEORGE		GEORGE	1311868
M	WILLIAM	WILLIAM		WILLIAM	1161737
M	HENRY	HENRY		HENRY	1156464
M	JAMES	JAMES		JAMES	792520
M	CHARLES	CHARLES		CHARLES	591368
M	THOMAS	THOMAS		THOMAS	474816
M	JOSEPH	JOSEPH		JOSEPH	458782
M	FRANK	FRANK		FRANCIS	382949
M	PETER	PETER		PETER	382598
M	EDWARD	EDWARD		EDWARD	298570
M	ROBERT	ROBERT		ROBERT	241233
M	SAMUEL	SAMUEL		SAMUEL	217773
M	DAVID	DAVID		DAVID	189674
M	JACOB	JACOB		JACOB	186299
M	ALBERT	ALBERT		ALBERT	172205
M	DANIEL	DANIEL		DANIEL	157188
M	WM.	WM		WILLIAM	152013
M	MICHAEL	MICHAEL		MICHAEL	147171
M	ANDREW	ANDREW		ANDREW	137175
M	PATRICK	PATRICK		PATRICK	136865
M	WILLIE	WILLIE		WILLIAM	128333
M	RICHARD	RICHARD		RICHARD	123986
M	LOUIS	LOUIS		LOUIS	115083
M	JOHN W.	JOHN	W	JOHN	113715

Given the large number of unique strings, we standardized only first names with a frequency greater than 100, of which there were approximately 20,000. Most of this work involved standardizing commonly used abbreviations and diminutives, such as those listed below in Table 4.

Table 4. The most common abbreviations and diminutives for male first names in 1880, before and after standardization
Raw string	n1	n1 standard	Frequency
Wm.	Wm	William	152013
Willie	Willie	William	128333
Fred	Fred	Frederick	102832
Chas.	Chas	Charles	65007
Joe	Joe	Joseph	63356
Geo.	Geo	George	47135
Charley	Charley	Charles	46206
Thos.	Thos	Thomas	42224
Sam	Sam	Samuel	39799
Charlie	Charlie	Charles	38491
Eddie	Eddie	Edward	36106
Robt.	Robt	Robert	28990
Alex	Alex	Alexander	26199
Jas.	Jas	James	25740
Ben	Ben	Benjamin	24261
Jno.	Jno	John	23679
Tom	Tom	Thomas	21772
Jim	Jim	James	20176
Saml.	Saml	Samuel	14691
Benj.	Benj	Benjamin	14298
Jos.	Jos	Joseph	12072

The creation of the linked samples required millions of individual comparisons. Linked individuals were required to have the same sex, race, and birthplace, so the comparison of these variables is straightforward. Evaluating exact matches on name and age information is unambiguous, but we also had to assess the likelihood that near matches were correct links. Fortunately, several software packages have been developed to deal with the ambiguities in record linkage. We relied on the Freely Extensible Biomedical Record Linkage (FEBRL), software created by Peter Christen and Tim Churches at the Australian National University's Data Mining Group.

Input files consisted of individuals with a given race, sex, and birthplace. For example, white males born in Michigan who were in the 1870 one-percent sample were compared to white males born in Michigan who were present in the 1880 complete-count data. Given consistent sex, race and birthplace information, the FEBRL software would compare all records within a seven year sliding window. Thus, 30-year-olds in the 1870 (who would be expected to be 40 in 1880) were compared to all persons between the ages of 33 and 47 in 1880. When an age match was not exact, the similarity of ages was quantified as the difference from the expected age (in years) as a proportion of the individual's age. In addition to constructing age similarity scores, FEBRL would also assess name similarity. We used the Jaro-Winkler distance algorithm, which takes a mathematical approach to dealing with transpositions of particular letters or syllables in comparing the similarity of any two alphabetic strings (Jaro 1989; Porter and Winkler 1997; Winkler 1990, 1993, 2002). Output files consisted of potential links that exceeded preset thresholds for age and name similarity.

After all files from a given pair of census years had been through similarity score construction, we needed to classify all potential links as "true" or "false". The identification and classification of data patterns is a major issue in computer science, particularly in research on data mining and machine learning. Our procedures draw heavily on the concept of Support Vector Machines, or SVMs (Vapnik 1998; Cristianini 2000; Abe 2005). SVMs are not physical machines; they are a set of methods and tools used to conduct machine learning. Our use of SVMs relied on the LIBSVM software, developed by Chih-Chung Chang and Chih-Jen Lin at National Taiwan University's Department of Computer Science.

SVM construction depends on the existence of training data, which typically consists of a verified set of true and false links. The classifier analyzes the training data, plots them in a multidimensional space, and then constructs a boundary between the two classes of records that maximizes the distance from the hyperplane and the nearest data points in both of the classes (i.e., between the true and false links). After SVM construction (which is based on the training data), unclassified records are plotted on this multidimensional space and the end result is a file consisting of potential links and the classifier-produced confidence score. Confidence scores are interpreted dichotomously; a positive score = "true" link and a negative score = "false" link.

At the classifier stage each potential link is evaluated independently, which often results in numerous potential links (from 1880) to a given sample record. We consider these links to be ambiguous, and they are not included in our linked data. Table 5 shows the confidence scores for potential links to John Bradley, a 25-year-old white male born in South Carolina from the 1870 data. Of the 43 potential links, only the top four receive positive confidence scores. Although the potential link with the highest confidence score is an exact match, the other three also have a high degree of similarity. If we had to choose, we would say the exact link is probably the correct link. However, we also feel that the probability that it is the correct link is significantly under 95 percent, and using these types of links would introduce an unaccepTable error rate.

Table 5. Output and confidence scores for potential links to John Bradley, 25-year old South Carolinian, 1870-1880
serial_1	pernum_1	serial_2	pernum_2	name1_1	namelast_clean_1	name1_2	namelast_clean_2	age_1	age_2	confidence
22901	1	976277	1	john	bradley	john	bradley	25	35	1.164
22901	1	193150	1	john	bradley	john	bradly	25	34	1.000
22901	1	8618365	1	john	bradley	john	bradley	25	37	0.999
22901	1	8743061	1	john	bradley	john	bradley	25	38	0.879
22901	1	8741891	1	john	bradley	h	bradley	25	35	-0.995
22901	1	9179845	1	john	bradley	j	bailey	25	35	-0.995
22901	1	9064198	1	john	bradley	john	shandley	25	34	-1.000
22901	1	1055670	1	john	bradley	john	bryan	25	35	-1.000
22901	1	984100	1	john	bradley	john	ragsdalle	25	35	-1.000
22901	1	8654004	1	john	bradley	john	bryante	25	35	-1.001
22901	1	8597775	1	john	bradley	john	bail	25	35	-1.002
22901	1	8578272	1	john	bradley	john	darby	25	36	-1.004
22901	1	8724337	1	john	bradley	john	nalley	25	35	-1.010
22901	1	8574468	1	john	bradley	john	ashley	25	35	-1.010
22901	1	8646083	1	john	bradley	john	rarden	25	35	-1.010
22901	1	8649875	1	john	bradley	john	ashley	25	35	-1.010
22901	1	1163802	1	john	bradley	john	trader	25	35	-1.010
22901	1	8732963	1	john	bradley	john	bryce	25	34	-1.012
22901	1	8584289	1	john	bradley	john	ready	25	36	-1.020
22901	1	8644204	1	john	bradley	john	beasley	25	35	-1.024
22901	1	8668427	1	john	bradley	josiah	bramlet	25	35	-1.026
22901	1	8616451	1	john	bradley	john	bayler	25	33	-1.028
22901	1	8603992	1	john	bradley	john	blake	25	35	-1.028
22901	1	8605006	1	john	bradley	john	boyer	25	35	-1.028
22901	1	205053	1	john	bradley	john	berry	25	35	-1.028
22901	1	5103940	1	john	bradley	john	brownlee	25	36	-1.038
22901	1	8673698	7	john	bradley	john	branch	25	34	-1.045
22901	1	8691338	1	john	bradley	john	clardy	25	34	-1.047
22901	1	8726708	1	john	bradley	john	brazell	25	36	-1.050
22901	1	8572344	1	john	bradley	john	gray	25	35	-1.055
22901	1	8759890	1	john	bradley	john	rainey	25	36	-1.056
22901	1	8577946	1	john	bradley	john	breazeale	25	35	-1.057
22901	1	8671388	1	john	bradley	john	brackman	25	35	-1.071
22901	1	1031068	1	john	bradley	john	batey	25	35	-1.085
22901	1	8649108	1	john	bradley	john	berry	25	36	-1.087
22901	1	8692014	1	john	bradley	john	berry	25	34	-1.087
22901	1	5111582	1	john	bradley	john	bailey	25	34	-1.090
22901	1	8669087	1	john	bradley	john	bailey	25	34	-1.090
22901	1	8666039	1	john	bradley	john	saddler	25	35	-1.092
22901	1	8654395	1	john	bradley	john	mccarley	25	35	-1.096
22901	1	999367	1	john	bradley	john	mcbryde	25	35	-1.108
22901	1	8563426	1	john	bradley	john	brownley	25	35	-1.109
22901	1	8718581	1	john	bradley	john	bolen	25	35	-1.111

Final Data Release

The material above describes the process we used for the preliminary linked data release, which took place in Fall 2008. An expanded narrative describing methods will be available in a forthcoming issue of Historical Methods. An important part of the process was a set of links for 1870-1880 and 1880-1900 produced by Pleiades Software. The comparison of our preliminary linked data to the Pleiades linked data disclosed that we had produced accurate linked data, but our process was highly dependent on the precision on the data. More specifically, if the individuals that we attempted to link were accurately enumerated in both census years, we would either make the link or, in the case of multiple potential links, reject the link as ambiguous. But if the correct link was unidentifiable-because of mortality, underenumeration, or misenumeration of linking variable information-we would make an incorrect link if there was another person with similar characteristics in the 1880 complete-count data.

We dealt with this problem by constructing formal measures for the commonness of names. A fairly standard approach in record linkage projects is to construct frequency Tables for names, which more or less assesses the probability of a correct link. For example, based on frequency Tables, record linkers would be more confident in linking someone with the name Roland Marsupial as opposed to John Smith. But given our minimalist approach to name cleaning and standardization, we run into a problem with minor typos and misspellings which would show up as low frequency names, although many of these names have high similarity to high frequency names (and would in fact show up as potential links to records with high frequency names). Our solution was to construct name similarity scores based on the following: for a given sample record we determined the proportion of records (by race, birthplace, and sex) in the 1880 complete-count data that have a Jaro-Winkler similarity score greater than 0.9. The choice of this threshold is somewhat arbitrary, but based on the preliminary linked data, we rarely linked records that did not exceed this threshold. We also constructed a density of birth measure, which is the proportion of 1880 records for specific birthplaces, by race and sex.

The impact of name commonness on the linkage rates can be seen in Table 6. As expected, all three groups have higher linkage rates for records with less common names compared to those with the most common names. The linkage rate differentials within the ethnic/race groups are also extreme. Taking nothing else into account, native-born whites in the least common name category are approximately 30 times more likely to be linked compared to native-born whites in the most common name category.

However, there is one other basic factor to take into account when evaluating the name commonness scores. Table 7 presents linkage rates by name commonness categories and by birthplace categories, with birthplace rank determined by the proportion of all 1870 males by place of birth. The basic pattern of a higher linkage rate for less common names is evident regardless of birthplace rank. For records in the least common name category, the birthplace categories do not show much difference in linkage rates. But as we move to the more common name categories differentials between less and more populated states become more extreme. For example, only the most common name categories show a linkage rate significantly lower than the average for all records from the small states.

We expected to see this pattern in the linked data. The pattern largely results from our decision not to link in ambiguous cases (i.e., where we had more than one potential link). But by generating name commonness scores for all records, we have also forced the classifiers not to make links in cases where a name is relatively common, even in the absence of competing potential links. And the Tables make clear that we do have a biased sample of linked data; in general, we are more likely to link records from the smaller states, and we are more likely to link records with less common names.

We always assumed that we would deal with linkage differentials by constructing weights for the linked records; that all things being equal, linked individuals born in Delaware would have a lower weight than those born in New York. But we also assumed that there would not be an inherent bias between more versus less common names. The construction of the name scores allows us to examine this a fairly straightforward way. Tables 8a and 8b give mean occupational score for 1870 males by race/nativity and name category. Table 8a excludes non-occupational responses and the distribution is relatively flat. Table 8b excludes non-occupational responses and farmers, and we also see relatively little variation among name categories.

Table 6. Linkage Rate for 1870 Males by Nativity/Race and Name Commonness Scores
	Name commonness	Linkage rate	N
Native-born white
	1 (least common)	19.00%	39797
	2	17.90%	35783
	3	10.70%	15355
	4	7.90%	14612
	5	4.30%	14736
	6	2.10%	14281
	7 (most common)	0.60%	10765
	Total	12.20%	145329
Foreign-born white
	1 (least common)	4.70%	6081
	2	5.60%	6129
	3	3.40%	2821
	4	2.30%	2751
	5	1.60%	3052
	6	0.90%	3162
	7 (most common)	0.20%	5372
	Total	3.00%	29368
African-American
	1 (least common)	6.00%	12525
	2	10.10%	10445
	3	9.90%	4528
	4	8.40%	4391
	5	4.80%	4499
	6	2.60%	4801
	7 (most common)	1.30%	6145
	Total	6.40%	47334

Table 7. Linkage Rate for Native-Born 1870 Males by Birthplace Rank (number of males by birthplace)
and Name Commonness Scores
Birthplace rank	Name commonness	Linkage rate	N
1 (smallest)	1 (least common)	17.80%	5590
	2	26.60%	5140
	3	21.60%	2046
	4	19.50%	2112
	5	14.10%	2121
	6	8.90%	2288
	7 (most common)	2.70%	1858
	Total	17.80%	21155
2	1 (least common)	20.70%	11152
	2	23.70%	10013
	3	16.20%	4490
	4	12.30%	4186
	5	5.70%	4379
	6	1.90%	4286
	7 (most common)	0.30%	3473
	Total	14.90%	41979
3	1 (least common)	21.00%	12859
	2	17.10%	11039
	3	7.80%	4761
	4	4.20%	4521
	5	1.70%	4526
	6	0.40%	4277
	7 (most common)	0.10%	3450
	Total	11.60%	45433
4 (largest)	1 (least common)	15.20%	10196
	2	8.10%	9591
	3	2.70%	4058
	4	1.10%	3793
	5	0.30%	3710
	6	0.10%	3430
	7 (most common)	0.05%	1984
	Total	6.80%	36762

Table 8a. Mean OCCSCORE value for 1870 Males by Nativity/Race and Name Commonness Scores,
excluding non-occupational responses
	Name commonness	OCCSCORE mean	N
Native-born white
	1 (least common)	19.32	18200
	2	19.40	20505
	3	19.09	7603
	4	19.28	7257
	5	19.32	7201
	6	18.51	7224
	7 (most common)	19.07	5667
Foreign-born white
	1 (least common)	22.11	5098
	2	21.97	5329
	3	21.99	2465
	4	21.82	2378
	5	22.14	2694
	6	21.59	2814
	7 (most common)	21.25	4739
African-American
	1 (least common)	12.43	7452
	2	12.41	6299
	3	12.65	2697
	4	12.68	2576
	5	12.85	2654
	6	13.08	2769
	7 (most common)	13.56	3527

Table 8b. Mean OCCSCORE value for 1870 Males by Nativity/Race and Name Commonness Scores, excluding non-occupational responses and farmers
	Name commonness	OCCSCORE mean	N
Native-born white
	1 (least common)	22.26	11734
	2	22.16	12943
	3	21.61	5033
	4	21.95	4849
	5	21.91	4778
	6	20.91	4856
	7 (most common)	21.80	3698
Foreign-born white
	1 (least common)	23.83	4208
	2	23.96	4266
	3	23.97	1976
	4	23.47	1964
	5	24.02	2189
	6	23.32	2292
	7 (most common)	22.70	3949
African-American
	1 (least common)	12.17	6368
	2	12.14	5389
	3	12.44	2324
	4	12.47	2226
	5	12.68	2317
	6	12.94	2404
	7 (most common)	13.51	3137

Acknowledgements

This work was supported in part by the University of Minnesota Supercomputing Institute.

References Cited

Abe, Shigeo. 2005. Support Vector Machines for Pattern Classification. London: Springer.

Cristianini, Nello and John Shawe-Taylor. 2000. An Introduction to Support Vector Machines and other kernel-based learning methods. London: Cambridge University Press.

Ferrie, Joseph. 1996. A New Sample of Males Linked from the Public-Use-Microdata-Sample of the 1850 US Federal Census of Population to the 1860 US Federal Census Manuscript Schedules. Historical Methods 29: 141-156.

Guest, Avery. 1987. Notes from the National Panel Study: Linkage and Migration in the Late Nineteenth Century. Historical Methods 20: 63-77.

Jaro, Matthew A. 1989. Advances in Record Linking Methodology as Applied to Matching the 1985 Census of Tampa, Florida. Journal of the American Statistical Association 84: 414-20.

Porter, Edward H. and William E. Winkler. 1997. Approximate String Comparison and its Effect on an Advanced Record Linkage System. Census Bureau Research Report RR97/02 (Washington D.C.: U.S. Bureau of the Census).

Vapnik, Vladimir. 1995. The Nature of Statistical Learning Theory. London: Springer-Verlag.

Winkler, William E. 1990. String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage. American Statistical Association, 1990, Proceedings of the Section of Survey Research Methods, 354-359.

Winkler, William E. 1993. Improved Decision Rules in the Fellegi-Sunter Model of Record Linkage. American Statistical Association, 1993, Proceedings of the Section of Survey Research Methods, 274-279.

Winkler, William E. 2002. Methods for Record Linkage and Bayesian Networks. Census Bureau Research Report RRS2002-05 (Washington D.C.: U.S. Bureau of the Census).

IPUMS Linked Representative Samples, 1850-1930 Final Data Release (June 2010)