MLP Linking Method
For the MLP project, we developed a probabilistic method of record linkage and built infrastructure necessary to manage record linkage across multiple full-count census databases. Our study describes the method in detail. Researchers who wish to implement the method described in the working paper can access the Stata code used in the design of the process. The Hlink open-source software used for production is available from GitHub. Hlink uses a distributed processing engine to spread the workload across a cluster of computers, increasing the efficiency and speed of the linking process. The program, built on top of Apache Spark, is designed for the researcher to tailor the procedure for each dataset, choosing which linking variables to include or exclude, as well as employing various linking methodologies.
The linking method described on this page and related documents is specific to version 1.0 of MLP, released in late 2020. The two subsequent MLP releases follow the same design, making relatively minor improvements apart from adding new or improved data sources.
The linking algorithm is a two-step procedure, where the first step aims to identify individual links across two censuses. In the second step, the algorithm exploits links made during the previous step to link remaining household members that are present in both censuses. While the overall approach is consistent across census pairs, there is slight variation in terms of the linking features used by the algorithm, due to limitations in the information provided by each respective census, outlined in this table. For version 1.2, the algorithm is uniquely calibrated based on the features that can be used for the particular two censuses that are being linked. Consequently, while the algorithms used to link 1850-1860 and 1930-1940 are both calibrated on 1900-1910 training data, they use slightly different sets of features, based on the variables present in the 1850/1860 and 1930/1940 censuses, respectively. All censuses separated by ten years use training data developed using 1900-1910 data, whereas all censuses separated by twenty years use data for the 1880-1900 period. Lastly, the training data used when linking censuses separated by thirty years use 1900-1930 data for Step I of the algorithm and 1880-1900 data for Step II.
For the NUMIDENT-to-census links, we developed training data for men and women separately, with links to all censuses between 1900 and 1940. The algorithm was trained and implemented separately for men, ever-married women, and never-married women and separately for each target census. The NUMIDENT-1940 calibration exercises were used for the NUMIDENT-1950 links. Although it was necessary to employ fewer features compared to the census-to-census links, the procedure used for the NUMIDENT-census links followed the process outlined under Step I, below. More details on the NUMIDENT linking method are described here.
Step I
For each pair of census years, we start with the population of males from the earlier of the two censuses. Each individual is linked to a population of potential matches from the later census, subjected to the following criteria:
- Census A individual and Census B potential match have the same sex and place of birth.
- Birth year of Census B potential match is within +/- three years of the Census A individual.
- Agreement on at least one adjusted bigram on first and last name between Census A individual and Census B potential match.
- Jaro-Winkler first and last name similarity score > 0.7
We rely on a regression approach (logit estimator), predicting the similarity of each unique Census A-Census B potential match. Our approach is unique through its use of an extended set of characteristics when comparing a potential match. Specifically, in addition to individual level characteristics, similar to those used by most historical census linkage projects, our approach exploits an extended set of individual, household and contextual characteristics when predicting the degree of similarity between a given set of census records. For each Census A individual, a link to the highest predicted similarity Census B potential match is confirmed when two conditions are fulfilled:
- The predicted similarity of a combination of census records exceeds a given threshold (α)
- The predicted similarity of the most probable census record combination (α1) exceeds the predicted similarity of the second-most probable combination of census records (α2) by at least (β). Consequently; α1/α2 > β.
The parameter estimates used to predict the similarity between two records, as well as the thresholds for α and β, are obtained through calibration of the algorithm with training data. In addition, training data is used to inform about the expected quality of the resulting linked data (also called precision and recall in the machine learning literature). For a short description of quality see the data description, or our linking paper for more details. Step I is complete after the identification of pairs of census records that fulfill both conditions listed above and the removal of duplicate matches (i.e. multiple individuals in Census A linked to the same Census B individual).
Step II
The linked census records not only identify a given individual at two points in time, but also the household in which he or she resides at the time of both censuses. In Step II, we exploit individual links established during Step I to successfully and accurately link remaining household members—men as well as women—who were present at both points in time. The universe of census A individuals who we attempt to link in step II is restricted to those residing in a household with at least one other individual who was successfully linked to Census B in Step I. For males, we further restrict the universe to those unlinked in Step I. We use a logistic regression approach similar to Step 1, comparing all pairs of census records that meet the following blocking criteria:
- Census A individual, residing in a household with a Step 1 linked individual, is matched to all Census B household members of the Step 1 linked individual(s).
- Birth year of Census B potential match is within +/- ten years of Census A individual.
The procedure employed to confirm a match follows a similar strategy as during Step I, albeit exploiting a different set of linking characteristics. More specifically, point estimates and α/β threshold values are calibrated based on specialized training data and used to determine which links constitute a match. For women, we select Step II matches if there is a competing Step I match.