Linked Data Home • Data Description • Linking MethodGet Data


MLP Linked Data: Version 1.1

Data Description

The current version of the data consists of links between adjoining censuses, representing a population described in Table 1, below. The number of successfully linked individuals across two censuses ranges from about 6 million between 1850-1860 to 52 million between 1930-1940. Links can be chained across multiple pairs of census years to yield multi-decade longitudinal data.

Table 1. Number of linked individuals across censuses
Census years
Linked persons
1850-1860
5,997,363
1860-1870
8,925,067
1870-1880
13,798,054
1880-1900
10,705,233
1900-1910
30,313,883
1910-1920
37,205,366
1920-1930
44,642,307
1930-1940
52,480,797

Linking Strategy and Representivity of the Data

As explained under Linking Method and in more detail in this study, the MLP algorithm uses household information and geography to find links across census years. This strategy — along with the linkage rate across different populations and overall accuracy — has consequences for different research questions. We encourage users to read the following overview and the linkage section carefully when developing their own analytic approach.

Accuracy of the Data

A key consideration when analyzing linked data pertains to the accuracy of declared links. In other words, to what extent is the linking method successful in identifying the same individual across two censuses. One way to assess this success is using the machine-learning precision calculation. Precision is measured as the ability of a linking method to identify the correct match and is typically estimated based on the input data used to calibrate the linking algorithm (also called the training data). Precision ranges from 0 (no match is correct) to 1 (all matches are correct). Because the precision calculation relies on the original training data used to the build the algorithm, we prefer to augment this calculation with additional quality measures. To assess accuracy of our links, we compared a randomly selected sample of 500,000 cases from our linked populations with i) the FamilySearch (FS) database hosted at Brigham Young University and ii) performed a manual review that confirmed or rejected links using digitized records at Ancestry.com. These comparisons were carried out for version 1.0 of the database, created in 2020.

The results of the comparison with the FamilySearch database are presented in Table 2. Among the cases covered by the database, the match between MLP and FamilySearch is very high, suggesting that - at most - 2.9 percent of declared links are incorrect. It is important to note that the FamilySearch database covers only between 36.8 (1850-1860) and 7.7 (1930-1940) percent of the random sample of 500,000 links. An obvious concern is that the accuracy of links outside those covered by the FamilySearch database may be lower, which was another motivation for the manual checks.

Table 2. Comparison to FamilySearch database
Census years
Cases checked against FS database
Coverage in FS database
(%)
FS matches MLP

(%)
Manual check matches MLP
(%)
1850-1860
500,000
36.8
98.5
93.9
1860-1870
500,000
25.6
98.1
95.3
1870-1880
500,000
33.3
98.0
94.5
1880-1900
n.a.
n.a.
n.a.
n.a.
1900-1910
500,000
11.6
99.1
98.3
1910-1920
500,000
10.7
97.1
99.5
1920-1930
500,000
9.1
98.2
99.0
1930-1940
500,000
7.7
99.5
99.5

Additionally, for the 20th century links, we manually reviewed 300 cases for each pair of census years. These cases were divided across three different categories of matches from the FamilySearch comparison. We randomly extracted i) cases where MLP and FamilySearch agree, ii) cases that disagree, and iii) cases not covered by the FamilySearch database. Manual checks confirm that the data covered in FamilySearch are not different in their accuracy compared to cases not in the FamilySearch database. The last column of Table 2 presents the results from the third category as an alternative indicator of precision. While the two true estimates do not line up perfectly, the manual check figures indicate a very high agreement and are not substantially different from the comparison with the FamilySearch database. For the 19th century data, we randomly extracted 2,000 cases from each crosswalk, having two trained research assistants independently evaluate each link against Ancestry.com. When the RAs were in agreement, we considered their assessment as conclusive; and if they disagreed, a third RA was tasked with making the final determination.

To ascertain that the version 1.1 links (released November 2022) have the same high accuracy as the version 1.0 links, we conducted extensive comparisons. Between 96.9 (1910-1920, 1920-1930) and 99.7 (1860-1870) percent of the census A individuals that were linked in version 1.0 were also linked in version 1.1. No less than 99.9 percent of these persons linked to the same census B record. In sum, the versions overlap to an extreme degree. Our examination of the small number of discrepant links was unable to conclude whether v1.0 or v1.1 links systematically should be preferred.

Back to Top