Linked Data Home • Data Description • Linking Method • Get Data

MLP Linking Method

The MLP linking method is a three-step probabilistic procedure using a supervised machine-learning approach. The first step aims to identify individual links across two censuses. The second step exploits links made during the previous step to link any remaining household members that may be present in both censuses. The general linking strategy used for these first two steps is detailed in this methodological publication, which describes the first iteration of MLP released in 2020. In the final step — new to Version 2.0 — links are created across adjacent censuses for single women who married in that interval and changed their names.

While the overall approach is consistent across census pairs, differing information provided by each respective census necessitates some variation in the linking features used by the algorithm (outlined in this linking features table). Using training data, the algorithm is calibrated based on the features available in the particular two censuses being linked.

The MLP linking process is summarized in this graphic and described in greater detail below:

Step 1

For each pair of census years, we start with the population of the earlier census. Each person is linked to a population of potential matches in the later census, subject to the following criteria:

Census A individual and Census B potential match have the same sex and place of birth.
Birth year of Census B potential match is within +/- three years of the Census A individual.
Ever-married females cannot match to never-married females; separated, divorced, and widowed females in Census A can only link to females with the same status in Census B; married males in Census A cannot match to never-married males in Census B.
Jaro-Winkler first and last name similarity score > 0.8

The XGBoost algorithm predicts the similarity of each unique Census A-Census B potential match. Our approach uses an extended set of characteristics when assessing a potential match, including individual, household, and contextual characteristics. For each Census A individual, a link to the highest predicted similarity Census B potential match is confirmed when two conditions are fulfilled:

The predicted similarity of a combination of census records exceeds a given threshold (α)
The predicted similarity of the most probable census record combination (α₁) exceeds the predicted similarity of the second-most probable combination of census records (α₂) by at least (β). Consequently; α₁/α₂ > β.

The parameter estimates used to predict the similarity between two records, as well as the thresholds for α and β, are obtained through calibration of the algorithm with training data. In addition, training data is used to calculate the expected quality of the resulting linked data (also called precision and recall in the machine learning literature). For a short discussion of quality see the data description. Step 1 is complete after the identification of pairs of census records that fulfill both conditions listed above and the removal of duplicate matches (i.e., multiple individuals in Census A linked to the same Census B individual).

The Step 1 procedure is performed separately for 10, 20, and 30-year census comparisons, each of which uses distinct sets of training data and calibration process.

Step 2

The linked census records not only identify a given individual at two points in time, but also the household in which he or she resides at the time of both censuses. In Step 2, we exploit individual links established during Step 1 to examine all household members for potential matches across the two censuses. The universe of census A individuals who we attempt to link in Step 2 is restricted to those residing in a household with at least one individual who was successfully linked to Census B in Step 1. This linking step includes a re-examination of the persons linked in Step 1 and can include comparisons to multiple households in Census B, if there were multiple Step 1 links pointing to different households. For matching, we use an approach similar to Step 1, comparing all pairs of census records that meet the following blocking criteria:

Census A individual, residing in a household with a Step 1 linked individual, is matched to all Census B household members of the Step 1 linked individual(s).
Birth year of Census B potential match is within +/- ten years of Census A individual.

The procedure employed to confirm a match follows a similar strategy as during Step 1, albeit exploiting a slightly reduced set of linking characteristics. More specifically, point estimates and α/β threshold values are calibrated based on distinct training data and used to determine which links constitute a match.

In cases where Steps 1 and 2 match to different people, we retain the Step 2 link, which consistently achieved higher agreement rates with the Family Tree linked genealogical database. If a single woman links to a married woman and her surname is unchanged, we drop the link.

The Step 2 procedure is performed separately for 10, 20, and 30-year census comparisons, each of which uses distinct sets of training data and calibration processes.

Step 3

In Step 3, we use information from the Social Security Numident database to link women across decades in which they transitioned from single to married status. The universe for this step is women who were single in Census A who were not linked to a later Census in Step 1 or Step 2. The universe is further limited to women who we were able to link from the census to the Numident, which supplies both birth and married surnames. These Numident-derived links are only available for 1880-1950 and only created for adjacent census year pairs. For matching, we use the approach outlined in Step 1, but compare only never-married women in Census A to currently married women in Census B.

Step 3 uses specialized training data oriented to females transitioning from single to married status.

Conflict resolution

The 10, 20, and 30-year links can conflict with one another. In such cases, we suppress the entire chain of interconnected links for all census years that involve those cases. This results in the removal of approximately 2.1% of all links.

Historical Identification Keys (HIKs)

Once matches are finalized, HLINK assigns a consistent Historical Identification Key (HIK) that uniquely identifies each linked person in all census years in which they appear. This involves chaining together the 10, 20, and 30-year links created in Steps 1 through 3 to create sets of links spanning as many as 80 years. HIKs are analogous to Census Bureau PIKs (protected identification keys) for linking census and administrative records. The HIKs are incorporated into the IPUMS data dissemination system to enable the creation of custom data extracts composed of individuals linked across multiple censuses. A new HIK is generated for an individual if there is any change among the links associated with that person. In such cases, the old HIK is archived for future reference and reproducibility.

Software

A custom-built Python package called Hlink performs probabilistic record linkage for MLP. It employs a supervised machine-learning approach to calculate the probability of a link between records utilizing training data classified as matches or non-matches by a team of researchers. The current version of MLP uses the XGBoost (Extreme Gradient Boosting) algorithm for matching, which we determined from testing performed better than several competitors. The open-source software used for production is available from GitHub.