MLP Linking Method
For the MLP project, we developed a probabilistic method of record linkage and built infrastructure necessary to manage record linkage across multiple full-count census databases. Our study describes the method in detail. Researchers who wish to implement the method described in the working paper can access the Stata code used in the design of the process. The Hlink open-source software used for production will be available soon. Hlink uses a distributed processing engine to spread the workload across a cluster of computers, increasing the efficiency and speed of the linking process. The program, built on top of Apache Spark, is designed for the researcher to tailor the procedure for each dataset, choosing which linking variables to include or exclude, as well as employing various linking methodologies.
The linking algorithm is a two-step procedure, where the first step aims to identify individual links across two censuses. In the second step, the algorithm exploits links made during the previous step to link remaining household members that are present in both censuses. While the overall approach is consistent across census pairs, there is slight variation in terms of the linking features used by the algorithm, due to limitations in the information provided by each respective census. This table outlines this variation, by census pairs.
For each pair of census years, we start with the population of males from the earlier of the two censuses. Each individual is linked to a population of potential matches from the later census, subjected to the following criteria:
- Census A individual and Census B potential match have the same sex and place of birth.
- Birth year of Census B potential match is within +/- three years of the Census A individual.
- Agreement on at least one adjusted bigram on first and last name between Census A individual and Census B potential match.
- Jaro-Winkler first and last name similarity score > 0.7
We rely on a regression approach (logit estimator), predicting the similarity of each unique Census A-Census B potential match. Our approach is unique through its use of an extended set of characteristics when comparing a potential match. Specifically, in addition to individual level characteristics, similar to those used by most historical census linkage projects, our approach exploits an extended set of individual, household and contextual characteristics when predicting the degree of similarity between a given set of census records. For each Census A individual, a link to the highest predicted similarity Census B potential match is confirmed when two conditions are fulfilled:
- The predicted similarity of a combination of census records exceeds a given threshold (α)
- The predicted similarity of the most probable census record combination (α1) exceeds the predicted similarity of the second-most probable combination of census records (α2) by at least (β). Consequently; α1/α2 > β.
The parameter estimates used to predict the similarity between two records, as well as the thresholds for α and β, are obtained through calibration of the algorithm with training data. In addition, training data is used to inform about the expected quality of the resulting linked data (also called precision and recall in the machine learning literature). For a short description of quality see the data description, or our linking paper for more details. Step I is complete after the identification of pairs of census records that fulfill both conditions listed above and the removal of duplicate matches (i.e. multiple individuals in Census A linked to the same Census B individual).
The linked census records not only identify a given individual at two points in time, but also the household in which he or she resides at the time of both censuses. In Step II, we exploit individual links established during Step I to successfully and accurately link remaining household members—men as well as women—who were present at both points in time. The universe of census A individuals who we attempt to link in step II is restricted to those residing in a household with at least one other individual who was successfully linked to Census B in Step I. For males, we further restrict the universe to those unlinked in Step I. We use a logistic regression approach similar to Step 1, comparing all pairs of census records that meet the following blocking criteria:
- Census A individual, residing in a household with a Step 1 linked individual, is matched to all Census B household members of the Step 1 linked individual(s).
- Birth year of Census B potential match is within +/- ten years of Census A individual.
The procedure employed to confirm a match follows a similar strategy as during Step I, albeit exploiting a different set of linking characteristics. More specifically, point estimates and α/β threshold values are calibrated based on specialized training data and used to determine which links constitute a match. For women, we select Step II matches if there is a competing Step I match.