IPUMS Linked Representative Samples, 1850-1930
Preliminary Data Release (August 2008)

The Minnesota Population Center is pleased to announce the preliminary release of the IPUMS Linked Representative Samples. The new database links cases from the 1880 census to 1% samples of all other censuses of 1850-1930. The project takes advantage of a complete transcription of the 1880 census of the United States, an extraordinary new data source produced and provided by The Church of Jesus Christ of Latter-day Saints.

The IPUMS Linked Representative Samples have two key differences from previous efforts to make longitudinal census databases. First, the central goal of previous efforts has been to maximize the proportion of the population that is linked. Our primary goal, on the other hand, has been to minimize selection bias and maximize representativeness of the linked cases. This approach has required a very conservative linking strategy. Second, the IPUMS linked data files provide far more cases than have ever before been available. Previous linkage projects have provided about 5,000 linked cases each (Ferrie 1996; Guest 1987). The IPUMS linked datasets contain information on nearly 500,000 people at two points in time, with one observation from the 1880 census and a second observation from one other census between 1850 and 1930.

Available data files

We have produced data samples for 7 pairs of years: 1850-1880, 1860-1880, 1870-1880, 1880-1900, 1880-1910, 1880-1920, and 1880-1930. Each pair of years contains three independent linked samples: one of men, one of women, and one of married couples.

All files contain weights that allow users to generate representative population estimates. The universe of people represented in the male files and couple files are straightforward. The linked male files represent all men who were in the U.S. in both census years. The linked couples files represent all married couples who were in the U.S. and married to one another in both census years. The universe of people represented in the linked female files is more complicated: these files represent women who were in the U.S. in both census years and who did not change their surname due to marriage between the two census years. For this reason, the female file represents only women who were single at both time periods, were married to the same person at both time periods, or who transitioned from married to widowed or divorced between time periods. The female files do not represent women who entered into marriage between time periods.

Each sample contains data on all linked individuals and records for other members of the linked person's household from both years. The 1880 data comes from the North Atlantic Population Project (NAPP). Data from the other years comes from the IPUMS. Variable descriptions for 1880 variable (i.e., those ending in "_80"), are available via the NAPP site, at http://www.nappdata.org/napp-action/variables/group. Variable descriptions for other variables are available via the IPUMS variable availability page, at http://usa.ipums.org/usa-action/variables/group

There are three variables that are unique to the Linked Representative Samples: HHSEQ, LINKTYPE, and LINKWT. HHSEQ is used in combination with SERIAL to identify all members of the linked person's household in both census years. This variable was necessary because households with more than one linked person were written out multiple times. LINKTYPE specifies why each case was included in the Linked Data Sample. LINKWT indicates the weight value assigned to each linked case. These are all weighted samples, and the LINKWT variable must be used to generate representative population estimates. Each of these variables is described in more detail below.

The table below provides data files in SAS, SPSS, and Stata format.

Data file format
SAS SPSS Stata
1850-1880 men
women
couples
men
women
couples
men
women
couples
1860-1880 men
women
couples
men
women
couples
men
women
couples
1870-1880 men
women
couples
men
women
couples
men
women
couples
1880-1900 men
women
couples
men
women
couples
men
women
couples
1880-1910 men
women
couples
men
women
couples
men
women
couples
1880-1920 men
women
couples
men
women
couples
men
women
couples
1880-1930 men
women
couples
men
women
couples
men
women
couples
All samples, all years (in a *.zip archive) all samples all samples all samples

Basic linking strategy

Each linked sample was produced by comparing cases from the complete-count database of the 1880 100% data file to those from a sample of U.S. population in another census year. For example, the 1870-1880 linked samples use the 1880 100% data and a 1% sample from the 1870 U.S. population census. More information on the 1880 complete-count database is available at

http://www.nappdata.org. More information on the 1850-1930 sample data is available at

http://usa.ipums.org/usa/sampdesc.shtml.

The linking strategy relied on five variables that should theoretically not change over time: birth year, state of birth, given name, surname, and race. Records were only compared in the linking process if they had an exact match on race and state of birth. The age and name variables, on the other hand, were permitted to have some variation. Name information was "cleaned," following the procedures specified below. Age was allowed to be up to seven years higher or lower than would be expected.

One of the major strategic decisions in the project was to ignore information from other co-resident individuals, because of potential bias issues. For example, if linking decisions were based on the characteristics of co-resident family members, then the samples would be biased in favor of those with co-resident family members. Similarly, using place of residence information in linking decisions would have resulted in the over-representation of non-movers.

These strategies were designed to produce a conservative, but representative set of linked data. We are confident that the links provided in these data are solid, and that, taken as a whole, the database is representative of those surviving from one census to the next.

The project linking strategy is described in more detail in this article: Steven Ruggles, "Linking Historical Censuses: A New Approach," History and Computing 14:1-2 (2006) 213-224.

Variables unique to the Linked Representative Samples: LINKTYPE, HHSEQ, LINKWT

The Linked Representative Samples contain many individuals other than those who were linked via the systematic linking procedures. Specifically, the files contain all people who lived in a linked person's household during either census year. Researchers who are not interested in co-resident persons can greatly simplify the dataset by removing all cases where LINKTYPE is greater than 0. This will remove all persons who were not linked via our systematic linking procedures. Once this is done, LINKTYPE and HHSEQ can be ignored.

Information on co-resident persons is one of the most powerful features of the linked samples, however, and we expect that this information will be useful for a variety of research purposes. For instance, co-resident persons can be used to create variables describing characteristics such as birth order, father's occupation, or the employment status of household members.

For each linked person, the SERIAL and HHSEQ variables identify all of the linked person's household members in both time periods. LINKTYPE specifies why each of those persons was included in the dataset. The meaning of LINKTYPE and HHSEQ codes are described briefly below. Table 1 and the paragraphs below it show in more detail how these variables are used.

LINKTYPE: reason individual was included in the Linked Data Sample

0: Primary linked person.

1: Additional linked person who is also a primary link. This person will have a LINKTYPE value of 0 in a repeat of the non-1880 household.

5: Additional linked person who was only linked after the systematic linkage process. This person is not identified as a primary link in any household. These links were made for the convenience of researchers interested in the stability of primary linked persons' households.

9: Unlinked person, present in the household of a primary linked person during only one census year.

HHSEQ: Linked Data Sample household identification number, to be used in conjunction with SERIAL. Each primary linked person has a unique combination of SERIAL and HHSEQ that is shared by all members of the linked person's households in both census years. HHSEQ values range from 1 to 9.

Figure 1 is a selection of households from the file linking men in the 1870 1% sample to those in the 1880 100% file. The paragraphs below discuss households on this page in detail, in order to explain the LINKTYPE and HHSEQ variables. The combination of SERIAL and HHSEQ is common to all members of the linked person's 1870 and 1880 households. All household members from both years are assigned a unique value in PERNUM.

Table 1. The first two households in the 1870-1880 linked data sample of men

SERIAL HHSEQ PERNUM LINKTYPE SERIAL_80 PERNUM_80 NAMELAST NAMEFRST NAMELAST_80 NAMEFRST_80 IMPREL RELATE_80 AGE AGE_80
201 1 1 5 311 1 MORGAN SARAH T MORGAN SARAH 1 101 42 52
201 1 2 5 311 7 MORGAN MARY F MIMS MARY 3 301 16 26
201 1 3 5 311 2 MORGAN HENRY T MORGAN HENRY 3 301 13 23
201 1 4 5 311 3 MORGAN ELLEN E MORGAN ELLEN 3 301 11 20
201 1 5 5 311 4 MORGAN CARRIE S MORGAN CARRIE 3 301 7 17
201 1 6 0 311 5 MORGAN JEREMIAH MORGAN JERREMINE 3 301 2 12
201 1 7 9 311 6 MORGAN AURELIA 0 301 0 9
201 1 8 9 311 8 MIMS MORGAN 0 901 0 7
201 1 9 9 311 9 MIMS FANNY 0 901 0 5
201 1 10 9 311 10 MIMS MARY E 0 901 0 3
201 1 11 9 311 11 MIMS ALEXANDER D 0 901 0 1
301 1 1 9 9265363 0 COUNTS WILLIAM 1 0 55 0
301 1 2 9 9265363 0 COUNTS ELIZA J 2 0 57 0
301 1 3 0 9265363 1 COUNTS THOMAS W COUNTS THOMAS 3 101 17 27
301 1 4 9 9265363 0 COUNTS MARY E 3 0 10 0
301 1 5 9 9265363 0 COUNTS MARTHA E 3 0 7 0
301 1 6 1 9265363 0 NEWMAN DAVID 12 0 21 0
301 1 7 9 9265363 2 COUNTS FRANSINE 0 201 0 27
301 1 8 9 9265363 3 COUNTS TOCHIE 0 301 0 5
301 2 1 9 9335742 0 COUNTS WILLIAM 1 0 55 0
301 2 2 9 9335742 0 COUNTS ELIZA J 2 0 57 0
301 2 3 1 9335742 0 COUNTS THOMAS W 3 0 17 0
301 2 4 9 9335742 0 COUNTS MARY E 3 0 10 0
301 2 5 9 9335742 0 COUNTS MARTHA E 3 0 7 0
301 2 6 0 9335742 1 NEWMAN DAVID NEWMAN DAVID 12 101 21 33
301 2 7 9 9335742 2 NEWMAN REBECCA 0 201 0 27
301 2 8 9 9335742 3 NEWMAN DAVID 0 301 0 13
301 2 9 9 9335742 4 NEWMAN WILLIAM 0 301 0 10
301 2 10 9 9335742 5 NEWMAN LIZZIE 0 301 0 8
301 2 11 9 9335742 6 NEWMAN ANNA 0 301 0 6
301 2 12 9 9335742 7 NEWMAN JOHN 0 301 0 4

In the first household in Table 1 (SERIAL = 201), Jeremiah Morgan (PERNUM=6) is a primary link established between the 1870 1% sample and the 1880 100% data file. Jeremiah status as the primary linked person is indicated by the LINKTYPE value of 0.

Persons 1-5 in Morgan's household also appear to have been present in both census years. As a convenience to researchers, we have identified these additional household links when possible, even though these links were not established through our systematic procedures used in the linked male samples. These additional links are indicated with a LINKTYPE value of 5. We identified additional household links by using information on sex, age, and name similarity within households that contained a primary link.

All other persons in SERIAL 201 were present in only one year (1880, in this case), and thus have a LINKTYPE value of 9. Note that although there are only six records in the 1870 household, PERNUM for this household goes to 11. This is to account for members of the 1880 household who were not present in 1870. The original order of these persons in their 1880 household is available in PERNUM_80. For example, Aurelia Morgan, who was not yet born in 1870, has a PERNUM_80 value of 6. This means that she was the 6th person listed in the 1880 household, which can be identified in the SERIAL_80 variable.

In the second household (SERIAL = 301), Thomas W. Counts (PERNUM = 3) is identified as the primary linked person, with a LINKTYPE value of 0. Counts was a 17-year-old in 1870. By 1880, he had left home and was married with a 5-year-old son, Tochie. Since Tochie was not present in 1870, his LINKTYPE value is 9.

This household also includes another primary link, David Newman (SERIAL = 301, PERNUM = 6). Newman's LINKTYPE value of 1 indicates that he is an additional primary link in the household with SERIAL number 301. A LINKTYPE value of 1 also indicates that the 1870 household will be written out again. In this case, the second iteration of SERIAL = 301 includes a repetition of the 1870 household, as well as all members of Newman's household in 1880.

In this second instance of SERIAL 301, HHSEQ is 2 and Newman's LINKTYPE value is 0. Since Newman and Counts lived apart in 1880, HHSEQ 2 of SERIAL 301 (Newman's 1870 and 1880 households) has a different set of 1880 individuals than did HHSEQ 1 of SERIAL 301 (Counts's 1870 and 1880 households). If there had been third and fourth primary links established in 1870 SERIAL 301, those persons would have been written out in additional households with SERIAL number 301 and HHSEQ values 3 and 4.

LINKWT: weight value for Linked Data Sample. LINKWT values range up to 5000.

THE LINKWT VARIABLE WAS CORRECTED IN ALL SAMPLES ON AUGUST 11, 2009. DATA DOWNLOADED PRIOR TO THAT DATE SHOULD BE RE-DOWNLOADED FROM THE LINKS ABOVE.

LINKWT indicates the number of persons represented by each linked case. These weights are designed to produce population estimates for the universe of potential links in the sample's terminal year. For example, in the male 1880-1900 linked sample, LINKWT produces population estimates of males who were at least 20 years old in 1900, and who were in the United States in 1880. LINKWT achieves this representativeness both by inflating the value of all cases (since the cases come from a 1% sample), and by applying higher weight values to some cases than to others.

The creation of LINKWT was necessary, because some records were clearly more likely to be linked than others. For instance, persons from a birth-state with a small population always had a higher likelihood of successful linkage than persons from a birth-state with a large population. The larger the population for a place of birth, the more likely it was that the linkage engine would find more than one potential link for a given record in the sample data. When the linkage procedures revealed more than one high-quality potential link, no link was made and the case was discarded.

Persons unrelated to others in the household were also less likely to be linked. These include persons such as farm laborers and boarders. The lower linkage rate for these persons was at least partly due to the fact that household heads (who were usually the ones who spoke with census officials) would be less likely to be have accurate age and birthplace information for the unrelated individual. Such "enumerator-respondent bias" was probably also more common among migrants, which would lead to a lower linkage rate for areas where interstate migrants were more numerous, such as larger cities.

To account for these and other types of biases, the LINKWT variable adjust the value of each case based on five characteristics: age (divided into 5-year age groups), region of residence, population size of place of residence (divided into the 18 categories in SIZEPL), relationship to household head (separating those who were related from those who were not), state or country of birth, and marital status (for the female samples).

The weight calculation is determined by using the population data file from the linked sample's terminal year. For example, for the 1880-1900 male linked sample, we calculated the proportion of the 1900 population in a given category, and divide by the proportion of the linked population in that category. For example, if 20 to 24 year olds represent 10 percent of the male population ages 20 and up, but only 5 percent of the linked males, linked men in this age group would receive an age-weight adjustment of 2. Ultimately each linked record receives a specific weight adjustment for the five weight adjustment variables listed above. The final weight adjustment is the product of the five individual weight adjustment, multiplied by the sample record's original PERWT value (which is almost always 100, since these are generally 1% samples). In the couples file, both members of the couple receive the same LINKWT value, which was derived from the man's characteristics.

Detailed linking procedures

Standardizing the first name strings

Names are the most important and problematic variables in any linking project. For instance, researchers relying on name information need to be aware of the inconsistent use of abbreviations, initials, diminutives, and of course simple mis-spellings. These are all significant problems in nineteenth-century census data, particularly for first names. Our approach to these issues was to standardize first name strings before beginning the linking process.

The first step in our name standardization was to construct "dictionaries" of first names, which contained all unique name strings from the 1880 complete-count data and the various samples. This resulted in approximately 1.7 million unique strings for first names. Table 2 shows the 24 most common first names for men, with the FNAME field containing the raw string. When a person had an initial, or a listed middle name, the raw string was parsed into two fields: XNAME1 and XNAME2. The resulting standardized name is in the field STD_XNAME1. For the first names listed below, the only name that got modified in the standard name field is "WM. ", which is standardized as "WILLIAM."

Table 2. The most common first names for men in 1880, before and after standardization

SEX FNAME XNAME1 XNAME2 STD_XNAME1 FREQ
M JOHN JOHN JOHN 3297148
M GEORGE GEORGE GEORGE 1311868
M WILLIAM WILLIAM WILLIAM 1161737
M HENRY HENRY HENRY 1156464
M JAMES JAMES JAMES 792520
M CHARLES CHARLES CHARLES 591368
M THOMAS THOMAS THOMAS 474816
M JOSEPH JOSEPH JOSEPH 458782
M FRANK FRANK FRANCIS 382949
M PETER PETER PETER 382598
M EDWARD EDWARD EDWARD 298570
M ROBERT ROBERT ROBERT 241233
M SAMUEL SAMUEL SAMUEL 217773
M DAVID DAVID DAVID 189674
M JACOB JACOB JACOB 186299
M ALBERT ALBERT ALBERT 172205
M DANIEL DANIEL DANIEL 157188
M WM. WM WILLIAM 152013
M MICHAEL MICHAEL MICHAEL 147171
M ANDREW ANDREW ANDREW 137175
M PATRICK PATRICK PATRICK 136865
M WILLIE WILLIE WILLIAM 128333
M RICHARD RICHARD RICHARD 123986
M LOUIS LOUIS LOUIS 115083
M JOHN W. JOHN W JOHN 113715

Given the large number of unique strings, we standardized only first names with that appeared more than 100 times (evident in the FREQ column), of which there were approximately 20,000. Most of this work involved standardizing commonly used abbreviations and diminutives, such as those listed below in Table 3.

Table 3. The most common abbreviations and diminutives for male first names in 1880,
before and after standardization

SEX FNAME XNAME1 XNAME2 STD_XNAME1 FREQ
M CHAS. CHAS CHARLES 65007
M JOE JOE JOSEPH 63356
M GEO. GEO GEORGE 47135
M CHARLEY CHARLEY CHARLES 46206
M THOS. THOS THOMAS 42224
M SAM SAM SAMUEL 39799
M CHARLIE CHARLIE CHARLES 38491
M CHARLY CHARLY CHARLES 8852
M DICK DICK RICHARD 8663
M BOB BOB ROBERT 8460
M ANDY ANDY ANDREW 8262

Identifying potential links

The creation of the linked samples required millions of individual comparisons. Linked persons were required to have the same sex, race, and birthplace, so the comparison of these variables was straightforward. But since more tolerance was permitted with the name and age variables, many potential links required more complex decision-making. Fortunately, several software packages have been developed to deal with the ambiguities in record linkage. We relied on the Freely Extensible Biomedical Record Linkage (FEBRL), software created by Peter Christen and Tim Churches at the Australian National University's Data Mining Group .

A typical input file to the FEBRL software would contain individuals with a given race, sex, birthplace, and--for the married samples--marital status. For example, for the creation of the 1870–1880 linked male sample, the software was tasked with comparing two groups of people: white males born in Michigan who were in the 1870 sample, and white males born in Michigan who were present in the 1880 data.

Given these constants, the FEBRL software made comparisons based on age, first name, and last name. Age was allowed to vary up to seven years in either direction. Thus, 30-year-olds in the 1870 (who would be expected to be 40 in 1880) were compared to all persons between the ages of 33 and 47 in 1880. When an age match was not perfect, the similarity of ages was quantified as the difference from the expected age (in years) as a proportion of the individual's age in the non-1880 year.

As was true with age, we did not set the software to look for only exact name matches. First names and last names had already been standardized to account for common abbreviations and diminutives (as described above). The quantification of name comparisons relied on the "Jaro-Winkler distance algorithm," which takes a mathematical approach to dealing with transpositions of particular letters or syllables in comparing the similarity of any two alphabetic strings (Jaro 1989; Porter and Winkler 1997; Winkler 1990, 1993, 2002)

Using these approaches for comparing ages and names, the FEBRL software generated "similarity scores" for each of the three variables, for every comparison made (i.e., often thousands of comparisons for each person). The next step was to convert these millions of similarity scores into either true or false links.

Distinguishing good links from bad links

After all files from a given pair of census years had been through similarity score construction, we needed to classify all potential links as "true" or "false". The identification and classification of data patterns is a major issue in computer science, particularly in research on data mining and machine learning.

Our procedures draw heavily on the concept of Support Vector Machines, or SVMs (Vapnik 1998; Cristianini 2000; Abe 2005). SVMs are not physical machines; they are a set of methods and tools used to conduct machine learning. Our use of SVMs relied on the LIBSVM software, developed by Chih-Chung Chang and Chih-Jen Lin at National Taiwan University's Department of Computer Science.

In our case, there were two main inputs to the SVM software. The key input was our database of "similarity scores" of age and name variables, representing millions of individual person-to-person comparisons. The SVM was tasked with reviewing each comparison and classifying it as true or false.

The second input to the SVM was "training" data that has already been classified properly. To create this training set, data entry operators considered a random sample of potential links and coded each one as either good or bad. Operators were not permitted to view the original census forms; it was important that they have access only to the same information that would be input to the SVM. When a majority of operators classified a potential link as a "yes", then it was coded as a "yes" in the training data. The remainder of cases were coded as "no".

Using the training data and the "similarity scores" created by FEBRL, we then ran our potential links through the LIBSVM software. The end result of the SVM classifier process was a file consisting of all potential links and the SVM-produced "confidence score," where a positive score meant a "true" link and a negative score meant a "false" link.

The SVM classification often resulted in more than one "true" link being made to a given sample record. Whenever a person from a non-1880 sample had more than one true link in the 1880 data, we excluded the case from the IPUMS Linked Representative Samples. This meant that there were many cases where we declined to make a link despite that there may have been only one perfect match in the 1880 population. Table 4 below shows such a case. The first John Bradley was not linked, despite that he is the only 25-year old with the exact same name as the sample case. Our experience with nineteenth-century census data suggests that this type of conservative approach is necessary.

Table 4. Output and confidence scores for potential links to John Bradley, 25-year old South Carolinian, 1870-1880

xserial_70 pernum_70 xserial_80 pernum_80 name1_70 namelast_clean_70 name1_80 namelast_clean_80 age70 age80 confidence
22901 1 976277 1 john bradley john bradley 25 35 1.164
22901 1 193150 1 john bradley john bradly 25 34 1.000
22901 1 8618365 1 john bradley john bradley 25 37 0.999
22901 1 8743061 1 john bradley john bradley 25 38 0.879
22901 1 8741891 1 john bradley h bradley 25 35 -0.995
22901 1 9179845 1 john bradley j bailey 25 35 -0.995
22901 1 9064198 1 john bradley john shandley 25 34 -1.000
22901 1 1055670 1 john bradley john bryan 25 35 -1.000
22901 1 984100 1 john bradley john ragsdalle 25 35 -1.000
22901 1 8654004 1 john bradley john bryante 25 35 -1.001
22901 1 8597775 1 john bradley john bail 25 35 -1.002
22901 1 8578272 1 john bradley john darby 25 36 -1.004
22901 1 8724337 1 john bradley john nalley 25 35 -1.010
22901 1 8574468 1 john bradley john ashley 25 35 -1.010
22901 1 8646083 1 john bradley john rarden 25 35 -1.010
22901 1 8649875 1 john bradley john ashley 25 35 -1.010
22901 1 1163802 1 john bradley john trader 25 35 -1.010
22901 1 8732963 1 john bradley john bryce 25 34 -1.012
22901 1 8584289 1 john bradley john ready 25 36 -1.020
22901 1 8644204 1 john bradley john beasley 25 35 -1.024
22901 1 8668427 1 john bradley josiah bramlet 25 35 -1.026
22901 1 8616451 1 john bradley john bayler 25 33 -1.028
22901 1 8603992 1 john bradley john blake 25 35 -1.028
22901 1 8605006 1 john bradley john boyer 25 35 -1.028
22901 1 205053 1 john bradley john berry 25 35 -1.028
22901 1 5103940 1 john bradley john brownlee 25 36 -1.038
22901 1 8673698 7 john bradley john branch 25 34 -1.045
22901 1 8691338 1 john bradley john clardy 25 34 -1.047
22901 1 8726708 1 john bradley john brazell 25 36 -1.050
22901 1 8572344 1 john bradley john gray 25 35 -1.055
22901 1 8759890 1 john bradley john rainey 25 36 -1.056
22901 1 8577946 1 john bradley john breazeale 25 35 -1.057
22901 1 8671388 1 john bradley john brackman 25 35 -1.071
22901 1 1031068 1 john bradley john batey 25 35 -1.085
22901 1 8649108 1 john bradley john berry 25 36 -1.087
22901 1 8692014 1 john bradley john berry 25 34 -1.087
22901 1 5111582 1 john bradley john bailey 25 34 -1.090
22901 1 8669087 1 john bradley john bailey 25 34 -1.090
22901 1 8666039 1 john bradley john saddler 25 35 -1.092
22901 1 8654395 1 john bradley john mccarley 25 35 -1.096
22901 1 999367 1 john bradley john mcbryde 25 35 -1.108
22901 1 8563426 1 john bradley john brownley 25 35 -1.109
22901 1 8718581 1 john bradley john bolen 25 35 -1.111

Future Work

In the next few months, we will be working with collaborators from the North Atlantic Population Project to extend these methods to IPUMS-style databases of census data from Canada, Great Britain, Iceland, Norway, and Sweden. In the Fall of 2009, we will release final versions of the Linked Representative Samples for the United States. All data will then be available via the IPUMS-USA Data Extract System.

References Cited

Abe, Shigeo. 2005. Support Vector Machines for Pattern Classification. London: Springer.

Cristianini, Nello and John Shawe-Taylor. 2000. An Introduction to Support Vector Machines and other kernel-based learning methods. London: Cambridge University Press.

Ferrie, Joseph. 1996. A New Sample of Males Linked from the Public-Use-Microdata-Sample of the 1850 US Federal Census of Population to the 1860 US Federal Census Manuscript Schedules. Historical Methods 29: 141-156.

Guest, Avery. 1987. Notes from the National Panel Study: Linkage and Migration in the Late Nineteenth Century. Historical Methods 20: 63-77.

Jaro, Matthew A. 1989. Advances in Record Linking Methodology as Applied to Matching the 1985 Census of Tampa, Florida. Journal of the American Statistical Association 84: 414-20.

Porter, Edward H. and William E. Winkler. 1997. Approximate String Comparison and its Effect on an Advanced Record Linkage System. Census Bureau Research Report RR97/02 (Washington D.C.: U.S. Bureau of the Census).

Vapnik, Vladimir. 1995. The Nature of Statistical Learning Theory. London: Springer-Verlag.

Winkler, William E. 1990. String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage. American Statistical Association, 1990, Proceedings of the Section of Survey Research Methods, 354-359.

Winkler, William E. 1993. Improved Decision Rules in the Fellegi-Sunter Model of Record Linkage. American Statistical Association, 1993, Proceedings of the Section of Survey Research Methods, 274-279.

Winkler, William E. 2002. Methods for Record Linkage and Bayesian Networks. Census Bureau Research Report RRS2002-05 (Washington D.C.: U.S. Bureau of the Census).