Linked Data Home • Data Description • Linking MethodDownloadsUser Guide


MLP Linked Data Version 1.0

Data Description

The current version of the data consists of seven separate crosswalks, representing a population described in Table 1, below. The number of successfully linked individuals across two censuses ranges from about 6 million between 1850-1860 to 52 million between 1930-1940.

Table 1. Number of linked individuals across censuses
Census years
Linked persons
1850-1860
5,879,611
1860-1870
8,949,304
1870-1880
13,829,390
1880-1900
13,782,479
1900-1910
30,017,630
1910-1920
37,682,985
1920-1930
45,326,985
1930-1940
52,490,126

Linking Strategy and Representivity of the Data

As explained under Linking Method and in more detail in this study, the MLP algorithm uses household information and geography to find links across census years. This strategy — along with the linkage rate across different populations and overall accuracy — has consequences for different research questions. We encourage users to read the following overview and the linkage section carefully when developing their own analytic approach.

Accuracy of the Data

A key consideration when analyzing linked data pertains to the accuracy of declared links. In other words, to what extent is the linking method successful in identifying the same individual across two censuses. One way to assess this success is using the machine-learning precision calculation. Precision is measured as the ability of a linking method to identify the correct match and is typically estimated based on the input data used to calibrate the linking algorithm (also called the training data). Precision ranges from 0 (no match is correct) to 1 (all matches are correct). Because the precision calculation relies on the original training data used to the build the algorithm, we prefer to augment this calculation with additional quality measures. To assess accuracy of our links we compare a randomly selected sample of 500,000 cases from our linked populations with i) the FamilySearch (FS) database hosted at Brigham Young University and ii) manual review that confirms or rejects links using digitized records at Ancestry.com.

The results of the comparison with the FamilySearch database are presented in Table 2. Note that the database has yet to include 19th century data. Among the cases covered by the database, the match between MLP and FamilySearch is very high, suggesting that - at most - 2.9 percent of declared links are incorrect. It is important to note that the FamilySearch database covers only between 11.6 (1900-1910) to 7.7 (1930-1940) percent of the random sample of 500,000 links. An obvious concern is that the accuracy of links outside those covered by the FamilySearch database may be lower, which was another motivation for the manual checks.

Table 2. Comparison to FamilySearch database
Census years
Cases checked against FS database
Coverage in FS database
(%)
FS matches MLP

(%)
Manual check matches MLP
(%)
1850-1860
500,000
36.8
98.5
93.9
1860-1870
500,000
25.6
98.1
95.3
1870-1880
500,000
33.3
98.0
94.5
1880-1900
n.a.
n.a.
n.a.
n.a.
1900-1910
500,000
11.6
99.1
98.3
1910-1920
500,000
10.7
97.1
99.5
1920-1930
500,000
9.1
98.2
99.0
1930-1940
500,000
7.7
99.5
99.5

Additionally, for the 20th century links, we manually reviewed 300 cases for each pair of census years. These cases were divided across three different categories of matches from the FamilySearch comparison. We randomly extracted i) cases where MLP and FamilySearch agree, ii) cases that disagree, and iii) cases not covered by the FamilySearch database. Manual checks confirm that the data covered in FamilySearch are not different in their accuracy compared to cases not in the FamilySearch database. The last column of Table 2 presents the results from the third category as an alternative indicator of precision. While the two true estimates do not line up perfectly, the manual check figures indicate a very high agreement and are not substantially different from the comparison with the FamilySearch database. For the 19th century data, we randomly extracted 2,000 cases from each crosswalk, having two trained research assistants independently evaluate each link against Ancestry.com. When the RAs were in agreement, we considered their assessment as conclusive; and if they disagreed, a third RA was tasked with making the final determination.

Back to Top