Linked Data Home • Data Description • Linking MethodGet Data


MLP Linked Data Description

The current version of the MLP data consists of links between censuses representing a population described in the table below. The number of successfully linked individuals varies from tens of millions for adjacent census years to roughly 100 thousand for the longest time spans. IPUMS MLP does not provide links across periods longer than 80 years.

The MLP linking process matches records across 10, 20, and 30-year census intervals. The links between census pairs described in this table are the result of chaining those constituent links together and removing inconsistencies, enabling the identification of individuals across many decades.

Number of individuals linked across each census pair (N in millions)

1860
1870
1880
1900
1910
1920
1930
1940
1950
1850
8.0
4.6
3.9
1.4
0.7
0.3
0.1
-
-
1860
-
10.4
6.8
2.8
1.7
0.9
0.4
0.1
-
1870
-
-
15.9
6.7
4.4
2.6
1.5
0.7
0.2
1880
-
-
-
13.8
10.0
6.3
4.2
2.3
1.0
1900
-
-
-
-
36.3
24.4
18.6
13.0
8.2
1910
-
-
-
-
-
46.2
32.0
24.0
16.6
1920
-
-
-
-
-
-
55.8
40.0
28.8
1930
-
-
-
-
-
-
-
70.2
46.5
1940
-
-
-
-
-
-
-
-
73.7
  * Enslaved persons are not included in the 1850 and 1860 census data.

Linking Strategy and Representativeness of the Data

The MLP algorithm uses individual, household, and geography information to link individuals across census years. This strategy — along with the linkage rate across different populations and overall accuracy — has consequences for different research questions. We encourage users to read the following overview and the description of the Linking Method when developing their own analytic approach.

Accuracy of the Data

A key consideration when analyzing linked data pertains to the accuracy of declared links. In other words, to what extent is the linking method successful in identifying the same individual across two censuses. One way to assess this success is by calculating the precision obtained during the machine-learning calibration stage. It measures the ability of a linking algorithm to identify the correct match in the underlying training data, ranging from 0 (no match is correct) to 1 (all matches are correct). Since we rely on relatively small sets of training data that do not cover all combinations of censuses that we link, our preferred measure of data quality is obtained by overlapping our data with the genealogically derived Family Tree data that originated from FamilySearch.org and are available from the Census Tree.

The results of this comparison are presented in the table below. The measures represent the conditional accuracy — they are based only on cases for which Family Tree can provide a link for comparison. The degree of overlap between the Family Tree and MLP links is reported in parentheses. For example, of the 8.0 million records that MLP links between 1850 and 1860, Family Tree linked 41.4 percent. This is our universe for comparison, yielding an agreement rate of 98.1 percent. The conditional accuracy of links made to the 1950 census is not provided, as this universe is not yet covered by the Family Tree data.

Comparison to the Family Tree database
 (% agreement for linked cases; % overlap in coverage)

1860
1870
1880
1900
1910
1920
1930
1940
1850
98.1
(41.4)
98.0
(36.9)
97.6
(40.4)
98.3
(40.3)
98.6
(40.1)
98.9
(36.6)
99.3
(31.3)
-
1860
-
98.4
(36.2)
98.5
(40.7)
98.6
(39.4)
98.7
(38.7)
98.8
(35.2)
99.1
(30.8)
99.3
(28.0)
1870
-
-
98.7
(37.4)
98.2
(35.7)
98.3
(35.9)
98.4
(32.4)
98.7
(28.9)
98.9
(27.7)
1880
-
-
-
99.0
(50.8)
98.7
(49.8)
98.8
(45.0)
98.8
(38.3)
98.8
(37.7)
1900
-
-
-
-
99.5
(63.2)
99.2
(50.0)
99.0
(40.8)
98.9
(41.7)
1910
-
-
-
-
-
99.5
(54.0)
99.2
(43.0)
99.0
(42.9)
1920
-
-
-
-
-
-
99.4
(39.4)
99.1
(38.4)
1930
-
-
-
-
-
-
-
99.4
(33.0)

The agreement between MLP and Family Tree links is very high overall, with a minimum rate of 97.6 in the nineteenth century and rates from 98.2 to 99.5 for twentieth century pairings. Naturally, should the accuracy of the links not covered by the Family Tree data differ substantially from the conditional accuracy, the overall accuracy of the data set will be lower. While we acknowledge that the conditional accuracy represents an upper bound of the overall accuracy, earlier investigations into the unconditional accuracy of MLP links are reassuring. For the first version of MLP we performed double blind manual checks of a small subset of links to obtain an alternative measure of accuracy. The results from that assessment were comparable to the comparisons to FamilySearch genealogical links for the twentieth century and were modestly lower in accuracy for the nineteenth century.

Representativeness

MLP does not link everyone, and inevitably some groups in the population are better represented than others due to our linking strategy, the source data, and the inherent difficulty of tracking certain groups over time. For example, the linked population includes a greater proportion of Whites, rural residents, and people living in their state of birth. To provide an overview across a variety of characteristics, this table compares the distribution of each census population to the population linked from that census to the next one. Most users will want to weight the data to adjust for biases in the linked data. MLP does not provide such weights, because they need to be tailored to the particular study. Inverse probability weighting is the most common approach.

Back to Top