IPUMS Linked Representative Samples, 1850-1930
Preliminary Data Release (August 2008)

This page has been preserved for reference. Users are encouraged to use linked full count census data via IPUMS MLP. Users may also wish to consult the final release of the Linked Representative Samples from June 2010. Please send any questions about data access, including the preliminary release of the Linked Representative Samples to our email at ipums@umn.edu.

The Minnesota Population Center is pleased to announce the preliminary release of the IPUMS Linked Representative Samples. The new database links cases from the 1880 census to 1% samples of all other censuses of 1850-1930. The project takes advantage of a complete transcription of the 1880 census of the United States, an extraordinary new data source produced and provided by The Church of Jesus Christ of Latter-day Saints.

The IPUMS Linked Representative Samples have two key differences from previous efforts to make longitudinal census databases. First, the central goal of previous efforts has been to maximize the proportion of the population that is linked. Our primary goal, on the other hand, has been to minimize selection bias and maximize representativeness of the linked cases. This approach has required a very conservative linking strategy. Second, the IPUMS linked data files provide far more cases than have ever before been available. Previous linkage projects have provided about 5,000 linked cases each (Ferrie 1996; Guest 1987). The IPUMS linked datasets contain information on nearly 500,000 people at two points in time, with one observation from the 1880 census and a second observation from one other census between 1850 and 1930.

Available data files

We have produced data samples for 7 pairs of years: 1850-1880, 1860-1880, 1870-1880, 1880-1900, 1880-1910, 1880-1920, and 1880-1930. Each pair of years contains three independent linked samples: one of men, one of women, and one of married couples.

All files contain weights that allow users to generate representative population estimates. The universe of people represented in the male files and couple files are straightforward. The linked male files represent all men who were in the U.S. in both census years. The linked couples files represent all married couples who were in the U.S. and married to one another in both census years. The universe of people represented in the linked female files is more complicated: these files represent women who were in the U.S. in both census years and who did not change their surname due to marriage between the two census years. For this reason, the female file represents only women who were single at both time periods, were married to the same person at both time periods, or who transitioned from married to widowed or divorced between time periods. The female files do not represent women who entered into marriage between time periods.

Each sample contains data on all linked individuals and records for other members of the linked person's household from both years. The 1880 data comes from the North Atlantic Population Project (NAPP). Data from the other years comes from the IPUMS. Variable descriptions for 1880 variable (i.e., those ending in "_80"), are available via the NAPP site, at http://www.nappdata.org/napp-action/variables/group. Variable descriptions for other variables are available via the IPUMS variable availability page, at http://usa.ipums.org/usa-action/variables/group

There are three variables that are unique to the Linked Representative Samples: HHSEQ, LINKTYPE, and LINKWT. HHSEQ is used in combination with SERIAL to identify all members of the linked person's household in both census years. This variable was necessary because households with more than one linked person were written out multiple times. LINKTYPE specifies why each case was included in the Linked Data Sample. LINKWT indicates the weight value assigned to each linked case. These are all weighted samples, and the LINKWT variable must be used to generate representative population estimates. Each of these variables is described in more detail below.

Basic linking strategy

Each linked sample was produced by comparing cases from the complete-count database of the 1880 100% data file to those from a sample of U.S. population in another census year. For example, the 1870-1880 linked samples use the 1880 100% data and a 1% sample from the 1870 U.S. population census. More information on the 1880 complete-count database is available at

http://www.nappdata.org. More information on the 1850-1930 sample data is available at

http://usa.ipums.org/usa/sampdesc.shtml.

The linking strategy relied on five variables that should theoretically not change over time: birth year, state of birth, given name, surname, and race. Records were only compared in the linking process if they had an exact match on race and state of birth. The age and name variables, on the other hand, were permitted to have some variation. Name information was "cleaned," following the procedures specified below. Age was allowed to be up to seven years higher or lower than would be expected.

One of the major strategic decisions in the project was to ignore information from other co-resident individuals, because of potential bias issues. For example, if linking decisions were based on the characteristics of co-resident family members, then the samples would be biased in favor of those with co-resident family members. Similarly, using place of residence information in linking decisions would have resulted in the over-representation of non-movers.

These strategies were designed to produce a conservative, but representative set of linked data. We are confident that the links provided in these data are solid, and that, taken as a whole, the database is representative of those surviving from one census to the next.

The project linking strategy is described in more detail in this article: Steven Ruggles, "Linking Historical Censuses: A New Approach," History and Computing 14:1-2 (2006) 213-224.

Variables unique to the Linked Representative Samples: LINKTYPE, HHSEQ, LINKWT

The Linked Representative Samples contain many individuals other than those who were linked via the systematic linking procedures. Specifically, the files contain all people who lived in a linked person's household during either census year. Researchers who are not interested in co-resident persons can greatly simplify the dataset by removing all cases where LINKTYPE is greater than 0. This will remove all persons who were not linked via our systematic linking procedures. Once this is done, LINKTYPE and HHSEQ can be ignored.

Information on co-resident persons is one of the most powerful features of the linked samples, however, and we expect that this information will be useful for a variety of research purposes. For instance, co-resident persons can be used to create variables describing characteristics such as birth order, father's occupation, or the employment status of household members.

For each linked person, the SERIAL and HHSEQ variables identify all of the linked person's household members in both time periods. LINKTYPE specifies why each of those persons was included in the dataset. The meaning of LINKTYPE and HHSEQ codes are described briefly below. Table 1 and the paragraphs below it show in more detail how these variables are used.

LINKTYPE: reason individual was included in the Linked Data Sample

0: Primary linked person.

1: Additional linked person who is also a primary link. This person will have a LINKTYPE value of 0 in a repeat of the non-1880 household.

5: Additional linked person who was only linked after the systematic linkage process. This person is not identified as a primary link in any household. These links were made for the convenience of researchers interested in the stability of primary linked persons' households.

9: Unlinked person, present in the household of a primary linked person during only one census year.

HHSEQ: Linked Data Sample household identification number, to be used in conjunction with SERIAL. Each primary linked person has a unique combination of SERIAL and HHSEQ that is shared by all members of the linked person's households in both census years. HHSEQ values range from 1 to 9.

Figure 1 is a selection of households from the file linking men in the 1870 1% sample to those in the 1880 100% file. The paragraphs below discuss households on this page in detail, in order to explain the LINKTYPE and HHSEQ variables. The combination of SERIAL and HHSEQ is common to all members of the linked person's 1870 and 1880 households. All household members from both years are assigned a unique value in PERNUM.

Table 1. The first two households in the 1870-1880 linked data sample of men
SERIAL	HHSEQ	PERNUM	LINKTYPE	SERIAL_80	PERNUM_80	NAMELAST	NAMEFRST	NAMELAST_80	NAMEFRST_80	IMPREL	RELATE_80	AGE	AGE_80
201	1	1	5	311	1	MORGAN	SARAH T	MORGAN	SARAH	1	101	42	52
201	1	2	5	311	7	MORGAN	MARY F	MIMS	MARY	3	301	16	26
201	1	3	5	311	2	MORGAN	HENRY T	MORGAN	HENRY	3	301	13	23
201	1	4	5	311	3	MORGAN	ELLEN E	MORGAN	ELLEN	3	301	11	20
201	1	5	5	311	4	MORGAN	CARRIE S	MORGAN	CARRIE	3	301	7	17
201	1	6	0	311	5	MORGAN	JEREMIAH	MORGAN	JERREMINE	3	301	2	12
201	1	7	9	311	6			MORGAN	AURELIA	0	301	0	9
201	1	8	9	311	8			MIMS	MORGAN	0	901	0	7
201	1	9	9	311	9			MIMS	FANNY	0	901	0	5
201	1	10	9	311	10			MIMS	MARY E	0	901	0	3
201	1	11	9	311	11			MIMS	ALEXANDER D	0	901	0	1
301	1	1	9	9265363	0	COUNTS	WILLIAM			1	0	55	0
301	1	2	9	9265363	0	COUNTS	ELIZA J			2	0	57	0
301	1	3	0	9265363	1	COUNTS	THOMAS W	COUNTS	THOMAS	3	101	17	27
301	1	4	9	9265363	0	COUNTS	MARY E			3	0	10	0
301	1	5	9	9265363	0	COUNTS	MARTHA E			3	0	7	0
301	1	6	1	9265363	0	NEWMAN	DAVID			12	0	21	0
301	1	7	9	9265363	2			COUNTS	FRANSINE	0	201	0	27
301	1	8	9	9265363	3			COUNTS	TOCHIE	0	301	0	5
301	2	1	9	9335742	0	COUNTS	WILLIAM			1	0	55	0
301	2	2	9	9335742	0	COUNTS	ELIZA J			2	0	57	0
301	2	3	1	9335742	0	COUNTS	THOMAS W			3	0	17	0
301	2	4	9	9335742	0	COUNTS	MARY E			3	0	10	0
301	2	5	9	9335742	0	COUNTS	MARTHA E			3	0	7	0
301	2	6	0	9335742	1	NEWMAN	DAVID	NEWMAN	DAVID	12	101	21	33
301	2	7	9	9335742	2			NEWMAN	REBECCA	0	201	0	27
301	2	8	9	9335742	3			NEWMAN	DAVID	0	301	0	13
301	2	9	9	9335742	4			NEWMAN	WILLIAM	0	301	0	10
301	2	10	9	9335742	5			NEWMAN	LIZZIE	0	301	0	8
301	2	11	9	9335742	6			NEWMAN	ANNA	0	301	0	6
301	2	12	9	9335742	7			NEWMAN	JOHN	0	301	0	4

In the first household in Table 1 (SERIAL = 201), Jeremiah Morgan (PERNUM=6) is a primary link established between the 1870 1% sample and the 1880 100% data file. Jeremiah status as the primary linked person is indicated by the LINKTYPE value of 0.

Persons 1-5 in Morgan's household also appear to have been present in both census years. As a convenience to researchers, we have identified these additional household links when possible, even though these links were not established through our systematic procedures used in the linked male samples. These additional links are indicated with a LINKTYPE value of 5. We identified additional household links by using information on sex, age, and name similarity within households that contained a primary link.

All other persons in SERIAL 201 were present in only one year (1880, in this case), and thus have a LINKTYPE value of 9. Note that although there are only six records in the 1870 household, PERNUM for this household goes to 11. This is to account for members of the 1880 household who were not present in 1870. The original order of these persons in their 1880 household is available in PERNUM_80. For example, Aurelia Morgan, who was not yet born in 1870, has a PERNUM_80 value of 6. This means that she was the 6th person listed in the 1880 household, which can be identified in the SERIAL_80 variable.

In the second household (SERIAL = 301), Thomas W. Counts (PERNUM = 3) is identified as the primary linked person, with a LINKTYPE value of 0. Counts was a 17-year-old in 1870. By 1880, he had left home and was married with a 5-year-old son, Tochie. Since Tochie was not present in 1870, his LINKTYPE value is 9.

This household also includes another primary link, David Newman (SERIAL = 301, PERNUM = 6). Newman's LINKTYPE value of 1 indicates that he is an additional primary link in the household with SERIAL number 301. A LINKTYPE value of 1 also indicates that the 1870 household will be written out again. In this case, the second iteration of SERIAL = 301 includes a repetition of the 1870 household, as well as all members of Newman's household in 1880.

In this second instance of SERIAL 301, HHSEQ is 2 and Newman's LINKTYPE value is 0. Since Newman and Counts lived apart in 1880, HHSEQ 2 of SERIAL 301 (Newman's 1870 and 1880 households) has a different set of 1880 individuals than did HHSEQ 1 of SERIAL 301 (Counts's 1870 and 1880 households). If there had been third and fourth primary links established in 1870 SERIAL 301, those persons would have been written out in additional households with SERIAL number 301 and HHSEQ values 3 and 4.

LINKWT: weight value for Linked Data Sample. LINKWT values range up to 5000.

THE LINKWT VARIABLE WAS CORRECTED IN ALL SAMPLES ON AUGUST 11, 2009. DATA DOWNLOADED PRIOR TO THAT DATE SHOULD BE RE-DOWNLOADED FROM THE LINKS ABOVE.

LINKWT indicates the number of persons represented by each linked case. These weights are designed to produce population estimates for the universe of potential links in the sample's terminal year. For example, in the male 1880-1900 linked sample, LINKWT produces population estimates of males who were at least 20 years old in 1900, and who were in the United States in 1880. LINKWT achieves this representativeness both by inflating the value of all cases (since the cases come from a 1% sample), and by applying higher weight values to some cases than to others.

The creation of LINKWT was necessary, because some records were clearly more likely to be linked than others. For instance, persons from a birth-state with a small population always had a higher likelihood of successful linkage than persons from a birth-state with a large population. The larger the population for a place of birth, the more likely it was that the linkage engine would find more than one potential link for a given record in the sample data. When the linkage procedures revealed more than one high-quality potential link, no link was made and the case was discarded.

Persons unrelated to others in the household were also less likely to be linked. These include persons such as farm laborers and boarders. The lower linkage rate for these persons was at least partly due to the fact that household heads (who were usually the ones who spoke with census officials) would be less likely to be have accurate age and birthplace information for the unrelated individual. Such "enumerator-respondent bias" was probably also more common among migrants, which would lead to a lower linkage rate for areas where interstate migrants were more numerous, such as larger cities.

To account for these and other types of biases, the LINKWT variable adjust the value of each case based on five characteristics: age (divided into 5-year age groups), region of residence, population size of place of residence (divided into the 18 categories in SIZEPL), relationship to household head (separating those who were related from those who were not), state or country of birth, and marital status (for the female samples).

The weight calculation is determined by using the population data file from the linked sample's terminal year. For example, for the 1880-1900 male linked sample, we calculated the proportion of the 1900 population in a given category, and divide by the proportion of the linked population in that category. For example, if 20 to 24 year olds represent 10 percent of the male population ages 20 and up, but only 5 percent of the linked males, linked men in this age group would receive an age-weight adjustment of 2. Ultimately each linked record receives a specific weight adjustment for the five weight adjustment variables listed above. The final weight adjustment is the product of the five individual weight adjustment, multiplied by the sample record's original PERWT value (which is almost always 100, since these are generally 1% samples). In the couples file, both members of the couple receive the same LINKWT value, which was derived from the man's characteristics.

Detailed linking procedures

Standardizing the first name strings

Names are the most important and problematic variables in any linking project. For instance, researchers relying on name information need to be aware of the inconsistent use of abbreviations, initials, diminutives, and of course simple mis-spellings. These are all significant problems in nineteenth-century census data, particularly for first names. Our approach to these issues was to standardize first name strings before beginning the linking process.

The first step in our name standardization was to construct "dictionaries" of first names, which contained all unique name strings from the 1880 complete-count data and the various samples. This resulted in approximately 1.7 million unique strings for first names. Table 2 shows the 24 most common first names for men, with the FNAME field containing the raw string. When a person had an initial, or a listed middle name, the raw string was parsed into two fields: XNAME1 and XNAME2. The resulting standardized name is in the field STD_XNAME1. For the first names listed below, the only name that got modified in the standard name field is "WM. ", which is standardized as "WILLIAM."

Table 2. The most common first names for men in 1880, before and after standardization
SEX	FNAME	XNAME1	XNAME2	STD_XNAME1	FREQ
M	JOHN	JOHN		JOHN	3297148
M	GEORGE	GEORGE		GEORGE	1311868
M	WILLIAM	WILLIAM		WILLIAM	1161737
M	HENRY	HENRY		HENRY	1156464
M	JAMES	JAMES		JAMES	792520
M	CHARLES	CHARLES		CHARLES	591368
M	THOMAS	THOMAS		THOMAS	474816
M	JOSEPH	JOSEPH		JOSEPH	458782
M	FRANK	FRANK		FRANCIS	382949
M	PETER	PETER		PETER	382598
M	EDWARD	EDWARD		EDWARD	298570
M	ROBERT	ROBERT		ROBERT	241233
M	SAMUEL	SAMUEL		SAMUEL	217773
M	DAVID	DAVID		DAVID	189674
M	JACOB	JACOB		JACOB	186299
M	ALBERT	ALBERT		ALBERT	172205
M	DANIEL	DANIEL		DANIEL	157188
M	WM.	WM		WILLIAM	152013
M	MICHAEL	MICHAEL		MICHAEL	147171
M	ANDREW	ANDREW		ANDREW	137175
M	PATRICK	PATRICK		PATRICK	136865
M	WILLIE	WILLIE		WILLIAM	128333
M	RICHARD	RICHARD		RICHARD	123986
M	LOUIS	LOUIS		LOUIS	115083
M	JOHN W.	JOHN	W	JOHN	113715

Given the large number of unique strings, we standardized only first names with that appeared more than 100 times (evident in the FREQ column), of which there were approximately 20,000. Most of this work involved standardizing commonly used abbreviations and diminutives, such as those listed below in Table 3.

Table 3. The most common abbreviations and diminutives for male first names in 1880,
before and after standardization
SEX	FNAME	XNAME1	STD_XNAME1	FREQ
M	CHAS.	CHAS	CHARLES	65007
M	JOE	JOE	JOSEPH	63356
M	GEO.	GEO	GEORGE	47135
M	CHARLEY	CHARLEY	CHARLES	46206
M	THOS.	THOS	THOMAS	42224
M	SAM	SAM	SAMUEL	39799
M	CHARLIE	CHARLIE	CHARLES	38491
M	CHARLY	CHARLY	CHARLES	8852
M	DICK	DICK	RICHARD	8663
M	BOB	BOB	ROBERT	8460
M	ANDY	ANDY	ANDREW	8262

Identifying potential links

The creation of the linked samples required millions of individual comparisons. Linked persons were required to have the same sex, race, and birthplace, so the comparison of these variables was straightforward. But since more tolerance was permitted with the name and age variables, many potential links required more complex decision-making. Fortunately, several software packages have been developed to deal with the ambiguities in record linkage. We relied on the Freely Extensible Biomedical Record Linkage (FEBRL), software created by Peter Christen and Tim Churches at the Australian National University's Data Mining Group .

A typical input file to the FEBRL software would contain individuals with a given race, sex, birthplace, and--for the married samples--marital status. For example, for the creation of the 1870–1880 linked male sample, the software was tasked with comparing two groups of people: white males born in Michigan who were in the 1870 sample, and white males born in Michigan who were present in the 1880 data.

Given these constants, the FEBRL software made comparisons based on age, first name, and last name. Age was allowed to vary up to seven years in either direction. Thus, 30-year-olds in the 1870 (who would be expected to be 40 in 1880) were compared to all persons between the ages of 33 and 47 in 1880. When an age match was not perfect, the similarity of ages was quantified as the difference from the expected age (in years) as a proportion of the individual's age in the non-1880 year.

As was true with age, we did not set the software to look for only exact name matches. First names and last names had already been standardized to account for common abbreviations and diminutives (as described above). The quantification of name comparisons relied on the "Jaro-Winkler distance algorithm," which takes a mathematical approach to dealing with transpositions of particular letters or syllables in comparing the similarity of any two alphabetic strings (Jaro 1989; Porter and Winkler 1997; Winkler 1990, 1993, 2002)

Using these approaches for comparing ages and names, the FEBRL software generated "similarity scores" for each of the three variables, for every comparison made (i.e., often thousands of comparisons for each person). The next step was to convert these millions of similarity scores into either true or false links.

Distinguishing good links from bad links

After all files from a given pair of census years had been through similarity score construction, we needed to classify all potential links as "true" or "false". The identification and classification of data patterns is a major issue in computer science, particularly in research on data mining and machine learning.

Our procedures draw heavily on the concept of Support Vector Machines, or SVMs (Vapnik 1998; Cristianini 2000; Abe 2005). SVMs are not physical machines; they are a set of methods and tools used to conduct machine learning. Our use of SVMs relied on the LIBSVM software, developed by Chih-Chung Chang and Chih-Jen Lin at National Taiwan University's Department of Computer Science.

In our case, there were two main inputs to the SVM software. The key input was our database of "similarity scores" of age and name variables, representing millions of individual person-to-person comparisons. The SVM was tasked with reviewing each comparison and classifying it as true or false.

The second input to the SVM was "training" data that has already been classified properly. To create this training set, data entry operators considered a random sample of potential links and coded each one as either good or bad. Operators were not permitted to view the original census forms; it was important that they have access only to the same information that would be input to the SVM. When a majority of operators classified a potential link as a "yes", then it was coded as a "yes" in the training data. The remainder of cases were coded as "no".

Using the training data and the "similarity scores" created by FEBRL, we then ran our potential links through the LIBSVM software. The end result of the SVM classifier process was a file consisting of all potential links and the SVM-produced "confidence score," where a positive score meant a "true" link and a negative score meant a "false" link.

The SVM classification often resulted in more than one "true" link being made to a given sample record. Whenever a person from a non-1880 sample had more than one true link in the 1880 data, we excluded the case from the IPUMS Linked Representative Samples. This meant that there were many cases where we declined to make a link despite that there may have been only one perfect match in the 1880 population. Table 4 below shows such a case. The first John Bradley was not linked, despite that he is the only 25-year old with the exact same name as the sample case. Our experience with nineteenth-century census data suggests that this type of conservative approach is necessary.

Table 4. Output and confidence scores for potential links to John Bradley, 25-year old South Carolinian, 1870-1880
xserial_70	pernum_70	xserial_80	pernum_80	name1_70	namelast_clean_70	name1_80	namelast_clean_80	age70	age80	confidence
22901	1	976277	1	john	bradley	john	bradley	25	35	1.164
22901	1	193150	1	john	bradley	john	bradly	25	34	1.000
22901	1	8618365	1	john	bradley	john	bradley	25	37	0.999
22901	1	8743061	1	john	bradley	john	bradley	25	38	0.879
22901	1	8741891	1	john	bradley	h	bradley	25	35	-0.995
22901	1	9179845	1	john	bradley	j	bailey	25	35	-0.995
22901	1	9064198	1	john	bradley	john	shandley	25	34	-1.000
22901	1	1055670	1	john	bradley	john	bryan	25	35	-1.000
22901	1	984100	1	john	bradley	john	ragsdalle	25	35	-1.000
22901	1	8654004	1	john	bradley	john	bryante	25	35	-1.001
22901	1	8597775	1	john	bradley	john	bail	25	35	-1.002
22901	1	8578272	1	john	bradley	john	darby	25	36	-1.004
22901	1	8724337	1	john	bradley	john	nalley	25	35	-1.010
22901	1	8574468	1	john	bradley	john	ashley	25	35	-1.010
22901	1	8646083	1	john	bradley	john	rarden	25	35	-1.010
22901	1	8649875	1	john	bradley	john	ashley	25	35	-1.010
22901	1	1163802	1	john	bradley	john	trader	25	35	-1.010
22901	1	8732963	1	john	bradley	john	bryce	25	34	-1.012
22901	1	8584289	1	john	bradley	john	ready	25	36	-1.020
22901	1	8644204	1	john	bradley	john	beasley	25	35	-1.024
22901	1	8668427	1	john	bradley	josiah	bramlet	25	35	-1.026
22901	1	8616451	1	john	bradley	john	bayler	25	33	-1.028
22901	1	8603992	1	john	bradley	john	blake	25	35	-1.028
22901	1	8605006	1	john	bradley	john	boyer	25	35	-1.028
22901	1	205053	1	john	bradley	john	berry	25	35	-1.028
22901	1	5103940	1	john	bradley	john	brownlee	25	36	-1.038
22901	1	8673698	7	john	bradley	john	branch	25	34	-1.045
22901	1	8691338	1	john	bradley	john	clardy	25	34	-1.047
22901	1	8726708	1	john	bradley	john	brazell	25	36	-1.050
22901	1	8572344	1	john	bradley	john	gray	25	35	-1.055
22901	1	8759890	1	john	bradley	john	rainey	25	36	-1.056
22901	1	8577946	1	john	bradley	john	breazeale	25	35	-1.057
22901	1	8671388	1	john	bradley	john	brackman	25	35	-1.071
22901	1	1031068	1	john	bradley	john	batey	25	35	-1.085
22901	1	8649108	1	john	bradley	john	berry	25	36	-1.087
22901	1	8692014	1	john	bradley	john	berry	25	34	-1.087
22901	1	5111582	1	john	bradley	john	bailey	25	34	-1.090
22901	1	8669087	1	john	bradley	john	bailey	25	34	-1.090
22901	1	8666039	1	john	bradley	john	saddler	25	35	-1.092
22901	1	8654395	1	john	bradley	john	mccarley	25	35	-1.096
22901	1	999367	1	john	bradley	john	mcbryde	25	35	-1.108
22901	1	8563426	1	john	bradley	john	brownley	25	35	-1.109
22901	1	8718581	1	john	bradley	john	bolen	25	35	-1.111

Future Work

In the next few months, we will be working with collaborators from the North Atlantic Population Project to extend these methods to IPUMS-style databases of census data from Canada, Great Britain, Iceland, Norway, and Sweden. In the Fall of 2009, we will release final versions of the Linked Representative Samples for the United States. All data will then be available via the IPUMS-USA Data Extract System.

References Cited

Abe, Shigeo. 2005. Support Vector Machines for Pattern Classification. London: Springer.

Cristianini, Nello and John Shawe-Taylor. 2000. An Introduction to Support Vector Machines and other kernel-based learning methods. London: Cambridge University Press.

Ferrie, Joseph. 1996. A New Sample of Males Linked from the Public-Use-Microdata-Sample of the 1850 US Federal Census of Population to the 1860 US Federal Census Manuscript Schedules. Historical Methods 29: 141-156.

Guest, Avery. 1987. Notes from the National Panel Study: Linkage and Migration in the Late Nineteenth Century. Historical Methods 20: 63-77.

Jaro, Matthew A. 1989. Advances in Record Linking Methodology as Applied to Matching the 1985 Census of Tampa, Florida. Journal of the American Statistical Association 84: 414-20.

Porter, Edward H. and William E. Winkler. 1997. Approximate String Comparison and its Effect on an Advanced Record Linkage System. Census Bureau Research Report RR97/02 (Washington D.C.: U.S. Bureau of the Census).

Vapnik, Vladimir. 1995. The Nature of Statistical Learning Theory. London: Springer-Verlag.

Winkler, William E. 1990. String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage. American Statistical Association, 1990, Proceedings of the Section of Survey Research Methods, 354-359.

Winkler, William E. 1993. Improved Decision Rules in the Fellegi-Sunter Model of Record Linkage. American Statistical Association, 1993, Proceedings of the Section of Survey Research Methods, 274-279.

Winkler, William E. 2002. Methods for Record Linkage and Bayesian Networks. Census Bureau Research Report RRS2002-05 (Washington D.C.: U.S. Bureau of the Census).

IPUMS Linked Representative Samples, 1850-1930 Preliminary Data Release (August 2008)