MLP Linked Data Version 1.0
Data Description
The current version of the data consists of four separate crosswalks, representing a population described in Table 1, below. The number of successfully linked individuals across two censuses ranges from almost 30 million between 1900-1910 to 53 million between 1930-1940, consistently representing a little more than 40 percent of the linkable population. As an example of the linkable population, the 1910 population that can realistically be found in the 1900 census, given criteria imposed in our algorithm, were born no later than 1903 and with an immigration year (if foreign born) no later than 1900. For the linkable population between 1930 and 1940, no immigration year restriction can be implemented due to the lack of this information in the 1940 census.
Table 1. Number of linked individuals across censusesCensus years | ||||
---|---|---|---|---|
1900-1910 | ||||
1910-1920 | ||||
1920-1930 | ||||
1930-1940 |
Linking Strategy and Representivity of the Data
As explained under Linking Method and in more detail in our working paper, the MLP algorithm uses household information and geography to find links across census years. This strategy — along with the linkage rate across different populations and overall accuracy — has consequences for different research questions. We encourage users to read the following overview and the linkage section carefully when developing their own analytic approach.
A key concern when analyzing linked data is the degree to which the individuals that an algorithm is able to confidently link resembles the full underlying, linkable, population. Table 2 displays, by sex, key sociodemographic characteristics of all currently available linked populations, as well as of their respective linkable population. As expected, linked individuals are more likely than the unlinked population to be born in the United States, white, reside with relatives, and reside in their place of birth.
Users should consider how much their variable of interest (migration status, occupation, children ever born, etc.) might be influenced by who is and is not linked in these data. In general, the linking probability will be greatest for those residing with the same family members and in same location. The linked data are likely to underrepresent migrants and those making transitions between households. In many cases, however, weighting should be able to adjust for under/overrepresented groups. We will release weights in future versions of the data. Users should also consider overall accuracy of linkages when comparing this linking strategy to other approaches.
Accuracy of the Data
A second important consideration when analyzing linked data pertains to the accuracy of declared links. In other words, to what extent is the linking method successful in identifying the same individual across two censuses. One way to assess this success is using the machine-learning precision calculation. Precision is measured as the ability of a linking method to identify the correct match and is typically estimated based on the input data used to calibrate the linking algorithm (also called the training data). Precision ranges from 0 (no match is correct) to 1 (all matches are correct). Because the precision calculation relies on the original training data used to the build the algorithm, we prefer to augment this calculation with additional quality measures. To assess accuracy of our links we compare a randomly selected sample of 500,000 cases from our linked populations with i) the Family Tree database hosted at Brigham Young University and ii) manual review that confirms or rejects links using digitized records at Ancestry.com.
The results of the comparison with the Family Tree database are presented in Table 3. Among the cases covered by the database, the match between MLP and Family Tree is very high, suggesting that - at most - 2.9 percent of declared links are incorrect. It is important to note that the Family Tree database covers only between 11.6 (1900-1910) to 7.7 (1930-1940) percent of the random sample of 500,000 links. An obvious concern is that the accuracy of links outside those covered by the Family Tree database is considerably lower, which was another motivation for the manual checks.
Table 3. Comparison to Family Tree databaseCases checked against FT database | ||||
Coverage in FT database | ||||
FT matches MLP | ||||
Manual check matches MLP |
We manually reviewed 300 cases for each pair of census years; these cases were divided across three different categories of matches from the Family Tree comparison. We randomly extracted i) cases where MLP and Family Tree agree, ii) cases that disagree, and iii) cases not covered by the Family Tree database. Manual checks confirm that the data covered in Family Tree are not different in their accuracy compared to cases not in the Family Tree database. The last row of Table 3 presents the results from the third category as an alternative indicator of precision. While the two true estimates do not line up perfectly, the manual check figures indicate a very high agreement and are not substantially different from the comparison with the Family Tree database. In other words, the MLP algorithm has a very high rate of true positive matches. For more information on quality analysis, please see our working paper.