MLP Linked Data Description
The current version of the data consists of links between censuses representing a population described in the table below. The number of successfully linked individuals across adjacent censuses ranges from about 6.6 million between 1850-1860 to 54.9 million between 1940-1950. Links can be chained across multiple pairs of census years to yield multi-decade longitudinal data. The newest release also includes direct 20 and 30-year links as well as links between Social Security records and censuses. The IPUMS extract system currently contains version 1.1 of MLP.
Number of individuals linked across census pairs: version 1.2 crosswalksLinking Strategy and Representivity of the Data
As explained under Linking Method and in more detail in this study, the MLP algorithm uses household information and geography to find links across census years. This strategy — along with the linkage rate across different populations and overall accuracy — has consequences for different research questions. We encourage users to read the following overview and the linkage section carefully when developing their own analytic approach.
Accuracy of the Data
A key consideration when analyzing linked data pertains to the accuracy of declared links. In other words, to what extent is the linking method successful in identifying the same individual across two censuses. One way to assess this success is by calculating the precision obtained during the machine-learning calibration stage. It measures the ability of a linking algorithm to identify the correct match in the underlying training data, ranging from 0 (no match is correct) to 1 (all matches are correct). Since we rely on relatively small sets of training data that do not cover all combinations of censuses that we link, our preferred measure of data quality is obtained by overlapping our data with data from FamilySearch.org, accessed through our collaboration with the Record Linkage Lab at Brigham Young University.
These results are presented in the table below, where the measure represents the conditional accuracy, i.e. it is based only on cases for which FamilySearch.org can provide a link for comparison. For example, for 61.5 percent of the 6.6 million 1850 records that MLP linked to a 1860 record, the FamilySearch data has a 1850-1860 link to compare to the MLP link. Naturally, should the accuracy of the links not covered by the FamilySearch data differ substantially from the conditional accuracy, the overall accuracy of the data set will be lower. While we acknowledge that the conditional accuracy represents an upper bound of the overall accuracy, earlier investigations into the unconditional accuracy of MLP links are comforting. Also note that the conditional accuracy of links made to the 1950 census is not provided, as this universe is not yet covered by the FamilySearch data. Among the cases covered by the database, the agreement between MLP and FamilySearch is very high, suggesting that — in the worst instance — 4.1 percent of declared links are incorrect, and most census pairs are within 2 percent.
For the first version of MLP we also performed double blind manual checks of a small subset of links to obtain an alternative measure of accuracy. The results from that assessment were comparable to the comparisons to the FamilySearch links for the twentieth century and were modestly lower in accuracy for the nineteenth century.
Comparison to FamilySearch database - MLP version 1.2
(%) |
(%) |
|
---|---|---|