Linked Data Home • Data Description • Linking MethodGet Data


MLP Linked Data Description

The current version of the data consists of links between censuses representing a population described in the table below. The number of successfully linked individuals across adjacent censuses ranges from about 6.6 million between 1850-1860 to 54.9 million between 1940-1950. Links can be chained across multiple pairs of census years to yield multi-decade longitudinal data. The newest release also includes direct 20 and 30-year links as well as links between Social Security records and censuses. The IPUMS extract system currently contains version 1.1 of MLP.

Number of individuals linked across census pairs: version 1.2 crosswalks
10-year
20-year
30-year
Census Years
N Links
Census Years
N Links
Census Years
N Links
1850-1860
6,659,383
1850-1870
3,266,557
1850-1880
2,596,586
1860-1870
8,437,331
1860-1880
5,107,660
1870-1900
4,731,201
1870-1880
13,148,961
1880-1900
10,199,154
1880-1910
6,286,364
1900-1910
30,601,820
1900-1920
14,641,957
1900-1930
10,950,343
1910-1920
37,940,159
1910-1930
20,401,999
1910-1940
13,918,705
1920-1930
45,335,479
1920-1940
25,448,196
1920-1950
18,245,447
1930-1940
54,136,642
1930-1950
30,287,357
1940-1950
54,927,229

Linking Strategy and Representivity of the Data

As explained under Linking Method and in more detail in this study, the MLP algorithm uses household information and geography to find links across census years. This strategy — along with the linkage rate across different populations and overall accuracy — has consequences for different research questions. We encourage users to read the following overview and the linkage section carefully when developing their own analytic approach.

Accuracy of the Data

A key consideration when analyzing linked data pertains to the accuracy of declared links. In other words, to what extent is the linking method successful in identifying the same individual across two censuses. One way to assess this success is by calculating the precision obtained during the machine-learning calibration stage. It measures the ability of a linking algorithm to identify the correct match in the underlying training data, ranging from 0 (no match is correct) to 1 (all matches are correct). Since we rely on relatively small sets of training data that do not cover all combinations of censuses that we link, our preferred measure of data quality is obtained by overlapping our data with data from FamilySearch.org, accessed through our collaboration with the Record Linkage Lab at Brigham Young University.

These results are presented in the table below, where the measure represents the conditional accuracy, i.e. it is based only on cases for which FamilySearch.org can provide a link for comparison. For example, for 61.5 percent of the 6.6 million 1850 records that MLP linked to a 1860 record, the FamilySearch data has a 1850-1860 link to compare to the MLP link. Naturally, should the accuracy of the links not covered by the FamilySearch data differ substantially from the conditional accuracy, the overall accuracy of the data set will be lower. While we acknowledge that the conditional accuracy represents an upper bound of the overall accuracy, earlier investigations into the unconditional accuracy of MLP links are comforting. Also note that the conditional accuracy of links made to the 1950 census is not provided, as this universe is not yet covered by the FamilySearch data. Among the cases covered by the database, the agreement between MLP and FamilySearch is very high, suggesting that — in the worst instance — 4.1 percent of declared links are incorrect, and most census pairs are within 2 percent.

For the first version of MLP we also performed double blind manual checks of a small subset of links to obtain an alternative measure of accuracy. The results from that assessment were comparable to the comparisons to the FamilySearch links for the twentieth century and were modestly lower in accuracy for the nineteenth century.

Comparison to FamilySearch database - MLP version 1.2

Census years
Coverage in FS database
(%)
FS matches MLP

(%)
10-Year Links
1850-1860
61.5
97.7
1860-1870
61.6
95.9
1870-1880
62.2
97.8
1900-1910
67.2
99.1
1910-1920
78.5
98.4
1920-1930
78.4
98.7
1930-1940
80.7
98.3
20-Year Links
1850-1870
53.2
96.8
1860-1880
54.0
98.3
1880-1900
61.3
99.1
1900-1920
76.5
98.5
1910-1930
75.4
98.8
1920-1940
70.8
98.2
30-Year Links
1850-1880
37.3
98.3
1870-1900
37.6
97.9
1880-1910
43.6
98.7
1900-1930
60.2
98.2
1910-1940
59.4
97.8
    Note: 1950 data are not yet available in the FamilySearch database for comparison.

Back to Top