Linked Data Home • Data Description • Linking MethodDownloadsUser Guide


MLP Linked Data Version 1.0

Data Description

The current version of the data consists of four separate crosswalks, representing a population described in Table 1, below. The number of successfully linked individuals across two censuses ranges from almost 30 million between 1900-1910 to 53 million between 1930-1940, consistently representing a little more than 40 percent of the linkable population. As an example of the linkable population, the 1910 population that can realistically be found in the 1900 census, given criteria imposed in our algorithm, were born no later than 1903 and with an immigration year (if foreign born) no later than 1900. For the linkable population between 1930 and 1940, no immigration year restriction can be implemented due to the lack of this information in the 1940 census.

Table 1. Number of linked individuals across censuses
Census years
Linked persons
Linkable to 1920
Linkable to 1930
Linkable to 1940
1900-1910
30,017,630
14,837,863
7,742,926
4,030,757
1910-1920
37,682,985
--
18,869,941
9,625,976
1920-1930
45,326,985
--
--
22,159,831
1930-1940
52,490,126
--
--
--

Linking Strategy and Representivity of the Data

As explained under Linking Method and in more detail in our working paper, the MLP algorithm uses household information and geography to find links across census years. This strategy — along with the linkage rate across different populations and overall accuracy — has consequences for different research questions. We encourage users to read the following overview and the linkage section carefully when developing their own analytic approach.

A key concern when analyzing linked data is the degree to which the individuals that an algorithm is able to confidently link resembles the full underlying, linkable, population. Table 2 displays, by sex, key sociodemographic characteristics of all currently available linked populations, as well as of their respective linkable population. As expected, linked individuals are more likely than the unlinked population to be born in the United States, white, reside with relatives, and reside in their place of birth.

Users should consider how much their variable of interest (migration status, occupation, children ever born, etc.) might be influenced by who is and is not linked in these data. In general, the linking probability will be greatest for those residing with the same family members and in same location. The linked data are likely to underrepresent migrants and those making transitions between households. In many cases, however, weighting should be able to adjust for under/overrepresented groups. We will release weights in future versions of the data. Users should also consider overall accuracy of linkages when comparing this linking strategy to other approaches.

Accuracy of the Data

A second important consideration when analyzing linked data pertains to the accuracy of declared links. In other words, to what extent is the linking method successful in identifying the same individual across two censuses. One way to assess this success is using the machine-learning precision calculation. Precision is measured as the ability of a linking method to identify the correct match and is typically estimated based on the input data used to calibrate the linking algorithm (also called the training data). Precision ranges from 0 (no match is correct) to 1 (all matches are correct). Because the precision calculation relies on the original training data used to the build the algorithm, we prefer to augment this calculation with additional quality measures. To assess accuracy of our links we compare a randomly selected sample of 500,000 cases from our linked populations with i) the Family Tree database hosted at Brigham Young University and ii) manual review that confirms or rejects links using digitized records at Ancestry.com.

The results of the comparison with the Family Tree database are presented in Table 3. Among the cases covered by the database, the match between MLP and Family Tree is very high, suggesting that - at most - 2.9 percent of declared links are incorrect. It is important to note that the Family Tree database covers only between 11.6 (1900-1910) to 7.7 (1930-1940) percent of the random sample of 500,000 links. An obvious concern is that the accuracy of links outside those covered by the Family Tree database is considerably lower, which was another motivation for the manual checks.

Table 3. Comparison to Family Tree database
1900-1910
1910-1920
1920-1930
1930-1940
Cases checked against FT database
500,000
500,000
500,000
500,000
Coverage in FT database
11.6
10.7
9.1
7.7
FT matches MLP
99.1
97.1
98.2
99.5
Manual check matches MLP
98.3
99.5
99.0
99.5

We manually reviewed 300 cases for each pair of census years; these cases were divided across three different categories of matches from the Family Tree comparison. We randomly extracted i) cases where MLP and Family Tree agree, ii) cases that disagree, and iii) cases not covered by the Family Tree database. Manual checks confirm that the data covered in Family Tree are not different in their accuracy compared to cases not in the Family Tree database. The last row of Table 3 presents the results from the third category as an alternative indicator of precision. While the two true estimates do not line up perfectly, the manual check figures indicate a very high agreement and are not substantially different from the comparison with the Family Tree database. In other words, the MLP algorithm has a very high rate of true positive matches. For more information on quality analysis, please see our working paper.

Back to Top