Making Sense of Census Responses: Coding Complex Variables in the 1920 PUMS¹

by Ronald Goeken, Marjorie Bryer and Cassandra Lucas

The production of a machine-readable dataset drawn from census manuscripts involves two major steps: data entry and data processing. Staff of the Historical Census Projects entered data for the 1920 public use microdata sample from microfilmed copies of the original census manuscripts.² The resulting raw data files were direct alphabetic transcriptions of the original census records. Specially designed software streamlined the data entry process wherever possible. Some information was automatically entered by the software. For example, since each microfilm reel only contained individuals from the same state, both microfilm reel number and state of residence were entered just once at the beginning of each reel. The data entry software also allowed operators to duplicate some information across members of households with special keystrokes (birthplace and language responses were common examples). The use of standard abbreviations and symbols was another time-saving feature of the 1920 data entry program. For example, operators entered "HD" for "head" or added "?" before an entry to indicate their best guess about an illegible response.

After data entry, most variables consisted of alphabetic strings that had to be converted to numeric form to be analyzed efficiently by statistical software packages. Some variables were simply recoded because of the small number of responses. For example, marital status was recorded by census enumerators as "S" (for single), "M" (married), "W" (widowed), and "D" (divorced). A computer program easily converted these four letters to the appropriate numeric codes. Other variables, however, had thousands of different responses in the raw data files.

We dealt with the problem of more complex variables by creating variable dictionaries. A variable dictionary is a machine-readable file containing a separate record for each alphabetic string encountered in the data. Each record combines a field for the alpha string and one or more numeric fields containing the appropriate numeric category assigned to that string. Variable dictionaries were compiled for the following variables: relationship-to-head, institution, occupation, industry, birthplace, mother tongue, city of residence and county of residence. A description of the variable dictionary process using the relationship to head variable as an example is given below. Subsequent sections outline problems specific to the coding of birthplace and language responses. A separate article in this issue deals exclusively with complexities in the occupation and industry variables (see the article by Ronnander in this issue).

Creating a Variable Dictionary

Data entry of the 1920 PUMS was divided into ten subsamples, with each subsample containing all microfilm reels ending in the same number. For example, "subsample 1" contained reels 0001, 0011, 0021 ... 2021, a total of 203 microfilm reels. Work on the variable dictionaries began once the data entry operators completed the first entire subsample.

The 1920 project made use of the dictionaries from previous samples, so the most common entries and their appropriate abbreviations were already in the dictionary for the first subsample. Figure 1 shows a section of codes in the relationship-to-head dictionary corresponding to parents, including parent, stepparent, parent-in-law, and stepparent-in-law. As seen in this example, the dictionary already contained standard abbreviations and many variations in spelling for these relationships. The two-part numeric general/detail code allows the IPUMS to provide compatibility across census years for complex variables like relationship. The two digits on the left represent the general code for that category, such as 05 for parents and 06 for parents-in-law. The right two digits give more detail about relationships within the category. For example in the relationship code 05 02, the general code 05 tells users that an individual is a parent while the detail code 02 indicates a relationship of stepparent.

Figure 1. Relationship to Head Dictionary Showing Codes for Parents

IPUMS Code
General	Detail	Relationship
05	01	PARENT
05	01	MOM
05	01	MOM1
05	01	MOTHER
05	01	DAD
05	01	FAT
05	01	FATHER
05	02	STMOM
05	02	ST FAT
05	02	STFAT
06	01	MOMLAW
06	01	MOMLAW
06	01	MOTLAW
06	01	WFSMOM
06	01	FAT LAW
06	01	FATHERLAW
06	01	FATLAW
06	02	STFATLAW

Once the data entry operators completed the first subsample, a computer program processed the raw data files, comparing each alpha string in the relationship field to those in the dictionary. This program generated an "unknown file" of all responses that did not appear in the dictionary. Figure 2 shows a segment of one of these raw unknown files. The responses given for relationship in Figure 2 are fairly typical of those found in unknown files. The most common relationships-for example head, wife, son, daughter, mother, father, boarder, and lodger-are already in the variable dictionary and "known" to the program. Most of the unknown relationships are typos, less common relationships (especially non-familial relationships), and responses that are not even relationships. Figure 2 includes examples of the more common types of unknown responses encountered in the relationship field: MOT is most likely a typo for MOM, the standard abbreviation for mother. FATSWF is an abbreviation of FATHER'S WIFE. For uncommon entries such as ABBY'S DAU, the only way to correctly identify the relationship is to return to the microfilm and review the household's original enumeration. The unknown file contains additional information, such as section, page and line numbers, to allow research assistants to quickly find the household on the microfilm reel. In most cases, research assistants find information in the original enumeration that allows them to correctly code the response in the data. In other cases, however, they have to change the data itself. For example, ABBY'S DAU was changed to G DAU (09 01) when the microfilm showed that this person was the head's granddaughter.

Figure 3 shows the same unknown file after research assistants had assigned numeric codes to the responses. From this file an updated dictionary file was created, incorporating the new alpha strings and corresponding codes. The updated unknown file was

then merged with the existing dictionary. The new dictionary, then contained every alpha string encountered in the relationship field in the first subsample. In Figure 4, the relationship codes for parents now include alpha strings for MOT and FATSWF. Once an alpha string appears in the dictionary, identical responses that occur in later subsamples will not appear in subsequent unknown files.

Birthplace

Census enumerators in 1920 received specific instructions for recording an individual's birthplace (along with the birthplace of an individual's father and mother). Enumerators were instructed to record the specific country with the following exceptions:

If a person says he was born in Austria Hungary, Germany, Russia, or Turkey as they were before the war, enter the name of the Province (State or Region) in which born, as Alsace-Lorraine, Bohemia, Bavaria, German or Russian Poland, Croatia, Galacia, Finland, Slovakland, etc.; or the name of the city or town in which born, as Berlin, Prague, Vienna, etc.... Instead of Great Britain, write Ireland, England, Scotland, or Wales. If the person was born in Cuba or Puerto Rico, so state, and do not write West Indies. (U.S. Bureau of the Census 1919, paragraphs 139 and 141)

Figure 2. Raw Unknown File for Relationship to Head

Not found in dictionary

IPUMS Code		Microfilm Information
General	Detail	Reel #	[Sec-pg-line]	Loc	Relationship
{ }	{ }	reel:0101	[2- 68- 29]	( 5)	rel:AUNT BY MARRIA
{ }	{ }	reel:0231	[2-191- 32]	( 4)	rel:FATSWF
{ }	{ }	reel:0971	[2- 45- 74]	( 1)	rel:ABBY'S DAU
{ }	{ }	reel:0241	[1-265- 88]	(15)	rel:ARMY FIELD CLK
{ }	{ }	reel:0421	[1-254- 8]	( 4)	rel:MOT
{ }	{ }	reel:0451	[2- 46- 10]	( 9)	rel:TENANT
{ }	{ }	reel:1771	[1-193- 34]	( 4)	rel:ADOPTED BRO
{ }	{ }	reel:1791	[1-295- 92]	( 4)	rel:BACHELOR

Loc = location of the individual within the household record

Figure 3. Unknown File with Appropriate Relationship Codes Assigned

Not found in dictionary

IPUMS Code		Microfilm Information
General	Detail	Reel #	[Sec-pg-line]	Loc	Relationship
{10}	{22}	reel:0101	[2- 68- 29]	( 5)	rel:AUNT BY MARRIA
{05}	{02}	reel:0231	[2-191- 32]	( 4)	rel:FATSWF
{03}	{01}	reel:0971	[2- 45- 74]	( 1)	rel:ABBY'S DAU
{12}	{21}	reel:0241	[1-265- 88]	(15)	rel:ARMY FIELD CLK
{05}	{01}	reel:0421	[1-254- 8]	( 4)	rel:MOT
{12}	{05}	reel:0451	[2- 46- 10]	( 9)	rel:TENANT
{03}	{02}	reel:1771	[1-193- 34]	( 4)	rel:ADOPTED BRO
{12}	{30}	reel:1791	[1-295- 92]	( 4)	rel:BACHELOR

Loc = location of the individual within the household record

Figure 4. Updated Relationship to Head
Dictionary Showing Codes for Parents

IPUMS Code
General	Detail	Relationship
05	01	PARENT
05	01	MOM
05	01	MOM1
05	01	MOT
05	01	MOTHER
05	01	DAD
05	01	FAT
05	01	FATHER
05	02	FATSWF
05	02	STMOM
05	02	ST FAT
05	02	STFAT
06	01	MOMLAW
06	01	MOMLAW
06	01	MOTLAW
06	01	WFSMOM
06	01	FAT LAW
06	01	FATHERLAW
06	01	FATLAW
06	02	STFATLAW

Adherence to these instructions produced many responses containing specific cities and provinces. This multitude of responses greatly complicated the construction of the birthplace dictionary and required frequent reference to historical maps of central and eastern Europe. At the most basic level, it added to the size of the dictionary (which contained over 6,000 unique entries). The format for the birthplace classification system-which contains a general and detailed code-allowed us to preserve a reasonable amount of geographic specificity. For example, Germany has a general code of '453.' There are also 42 separate detailed codes for Germany (453 01 to 453 42) which correspond to various provinces and large cities. The logic that resulted in 42 different codes for Germany was also applied to other countries. For example, the general code for Czechoslovakia is '452.' But there are also separate two-digit detail codes for Bohemia, Bohemia-Moravia, and Slovakia.

The sheer size of the birthplace dictionary was greater than it otherwise might have been because the data-entry operators were instructed to preserve any editing on the original manuscript by census clerks. Most of the edits reflected categorization performed by clerks while tabulating the returns. For example, if 'Vienna' was listed as a birthplace, the clerk would typically write 'Austria,' which was the appropriate 1920 Census Bureau category for this response. Although the edits sometimes clarified obscure place names, they more often represented the Census Bureau imposing a general category on a more detailed response. Where appropriate, we preserved the detail of the original response. There also were cases where the opposite occurred. Enumerators did not always follow the instructions concerning the reporting of provinces and cities for central and eastern Europe; many individuals responded with answers that reflected 1920 nation-state boundaries rather than specific provinces or cities. For example, 'Czechoslovakia' was a frequent birthplace response, whereas the enumerator instructions indicated that the correct response should have been either Bohemia, Moravia, or Slovakia-the three Austria-Hungarian provinces that were united as Czechoslovakia in 1918. In this case we had no choice but to assign the general code for Czechoslovakia. While our policy was to preserve as much detail as possible, researchers should be aware that limitations such as these do exist in the data.

Original manuscript entries were reviewed to resolve incongruities in the individual or household records. In most cases, checking the entered values against the microfilmed manuscript resulted in an easy coding decision. The task was further complicated, however, when the manuscript revealed discrepancies between what the enumerator recorded and "D.C. edits"-changes made by the Census Bureau editor in Washington.

A few specific examples demonstrate the complexities involved in creating an efficient birthplace coding system that also preserves data quality.

The coding schemes used by D.C. editors for birthplace and parents' birthplace were not the same, reflecting geopolitical place names at the time each generation was born. If individuals reported a birthplace different from their parents and this discrepancy was due to such geo-political boundary changes, we made all birthplaces consistent using the individual's place of birth-the most recent boundary division-as the definitive code. For example, if an individual's birthplace field contained Yugoslavia but the father's and mother's birthplaces were listed as the Hungarian Empire, we coded all three birthplaces (individual, paternal and maternal) as Yugoslavia.
Grodno (Hrodno), Poland, was on the border between Lithuania and Poland between 1918 and 1923. Today it is part of Belarus. In the 1920 public use sample it is coded as Russian Poland.
Danzig was a free city between 1920 and 1930 but is coded here as part of West Prussia (455 26).
A number of respondents told enumerators their birthplace was Byrote/Syria and their mother tongue was Syrian. Syria and Lebanon were French mandates in 1920, part of the Levant States. These people are coded as Syrian (541 00) in the 1920 sample.
In 1920 Iceland was an independent kingdom in personal union with Denmark. Individuals who gave their birthplace as Denmark/Iceland are coded as Iceland (402 00).
A number of cases said listed Turkey as the birthplace and Greek as the mother tongue. Subsequently, the D.C. editor changed these birthplaces to Greece based on the language response. Since it is impossible to determine exactly where these individuals were born, the detail (ambiguous though it may be) is preserved by using a code for Turkey-Greece (433 20). Although Turkey did not officially become a republic until 1923, we retained all responses that occurred in the 1920 data

Mother Tongue

Mother tongue was frequently used to help clarify birthplace coding. Similarly, construction of the mother tongue dictionary was facilitated by the birthplace dictionary. Because mother tongue is most often contingent upon birthplace, values in the unknown files for mother tongue also contained an individual's place of birth. The enumerator instructions for mother tongue were fairly straightforward, but enumerator error or misinterpretation complicated the coding of languages. Despite instructions to record mother tongue for all foreign born persons, some enumerators left the column blank. As in samples for other census years, all missing or incomplete responses are allocated according to procedures described in Ruggles and Sobek (1998). The missing data allocation procedure uses information from persons with shared characteristics to assign values to all blanks. In most cases, imputed values in birthplace and language variables were derived from information contained in records of other household members.

Some enumerators also apparently interpreted the mother tongue question to mean the language that the respondent currently spoke. Thus some enumerators recorded 'English' as the mother tongue of immigrants from predominantly non-English speaking countries. The Census Bureau was aware of this problem and instructed clerks to correct the mistake when it appeared that an enumerator had systematically misinterpreted the question. Since our data entry staff entered clerk edits along with the original enumerator response, we were able to verify on a case-by-case basis whether the clerk's edit was the probable correct response.

Another type of census clerk edit involved cases where enumerators entered a pseudo-language. Examples include reporting Mexican for individuals born in Mexico, Hungarian for individuals born in Hungary and Belgian for individuals born in Belgium. The clerks would typically change these responses to Spanish, Magyar and Flemish, respectively. In these cases we preserved the enumerator's original response. Thus Mexican has a code of '12 50' which is a subset of Spanish, coded of '12 00.' Although we do not feel that the pseudo-language responses have substantive meaning, given our policy of retaining as much detail as possible for the user, we decided to preserve the detail and let users determine whether they can extract meaning from it.

Conclusion

Overall, birthplace and mother tongue were two of the most complicated variables to code. Work on the data dictionaries for the other variables was, for the most part, much less time consuming. There are two major exceptions. First, coding for the occupation and industry variables also required interpreting and categorizing thousands of unique responses-a complex process described elsewhere in this issue. Second, the construction of the city of residence dictionary involved assigning populations to incorporated municipalities a dictionary with close to 10,000 specific entries. Although still a massive task, improvements in enumeration procedures for the 1920 census made this process more straightforward than that for 1850 and 1880 the samples constructed earlier by the Historical Census Projects. For example, the inclusion of separate spaces to write incorporated place and civil division on 1920 enumeration forms made it easier to distinguish between villages and townships with the same name.

Coding these complex response variables required a substantial investment of time by the 1920 research staff. This investment has resulted in coding schemes that preserve virtually all meaningful detail and impose a logical order on the responses that makes them easier for researchers to use.

ENDNOTES

This article appeared in Historical Methods, Volume 32, Number 3, Pages 134 -138, Summer 1999. Reprinted with Permission of the Helen Dwight Reid Educational Foundation. Published by Heldref Publications, 1319 18th St. N.W. Washington, D.C. 20036-1802. Copyright 1999.
The 1920 public use sample project was funded by the National Institute of child Health and Human Development, grant HD29015.

References

U.S. Bureau of the Census. 1919. Fourteenth census of the United States, January 1, 1920: Instructions to enumerators. Washington: GPO.

Go Back to IPUMS Documentation Index

Making Sense of Census Responses: Coding Complex Variables in the 1920 PUMS1