1910 Hispanic Oversample: Introduction to the User's Guide

(Back to 1910 Hispanic Oversample - User's Guide)

This document describes the design and characteristics of the Hispanic Oversample of the 1910 US Census of Population. The background research, planning, and design of this project were conducted at the University of Texas at Austin under a research team headed by Myron Gutmann with funding provided by the National Institute of Child Health and Human Development (Grant number 5 R01 HD 32325). Data entry, design and creation of the dataset were carried out under subcontract by the Historical Census Projects at the University of Minnesota under the direction of Steven Ruggles.

Prior to recent censuses that asked and tabulated an explicit question about Hispanic identity, the Hispanic population of the United States remained something of a mystery. To better enable researchers to study the Hispanic population of the United States, we created a new supplement to the frequently-used Public Use Sample of the 1910 Census (Strong, 1989). This relatively large oversample contains data pertaining to roughly 10 percent of the Hispanic-headed households and Hispanic individuals in 1910, plus about four-tenths as many non-Hispanic-headed households living near them. The sample is drawn from 57 counties in six states (California, Arizona, New Mexico, Texas, Kansas and Florida) with predominantly Mexican-, Spanish-, and Cuban-origin populations (see Map 1).

The Hispanic Oversample is a component sample of the Integrated Public Use Microdata Series (IPUMS). The IPUMS combines samples from 13 census years into a consistent database. The Hispanic Oversample uses the same record layout and coding scheme as all other samples in the IPUMS. This User's Guide describes only those elements that are contained in the Oversample.

Map 1
Location of Hispanic Oversample Counties

Click on map for larger image

Map 1: Location of Hispanic Oversample Counties

Sample Design

The Hispanic Oversample of the 1910 US Census of Population consists of two differently-sampled populations. One section of the sample is a one-in-five sample of households headed by Mexican-origin individuals chosen from 55 counties in Texas, California, Arizona, Kansas, and New Mexico. The second section of the sample is a one-in-two selection of households headed by Cuban- and Spanish-origin individuals, drawn from two Florida Counties.

The Hispanic Oversample also includes a control-group sample drawn from the same counties. The control group sample includes one non-Hispanic household for every two Hispanic households drawn in the sample, with an upper size limit of one-in-five non-Hispanic households in the states where we drew a Mexican-origin sample, and an upper size limit of one-in-two non-Hispanic households in the two Florida counties.

In practice, the control group sample is somewhat less than one-half as large as the Hispanic sample in the oversample counties. First, the number of households is slightly less than half as large as the number of Hispanic households since some border counties—especially in Texas—have such a high proportion of Hispanics that there were not enough non-Hispanics to sample. Secondly, the number of non-Hispanic individuals is less than half as large as the number of Hispanic individuals because, on average, non-Hispanic households were smaller than Hispanic households.

We envision two different ways that the Hispanic oversample can be used. While it can be used by itself – and we will describe how and why a researcher would take that approach – it is designed to be used with the original public use microsample of the 1910 census. In that situation, the user can have confidence that she or he has relatively accurate national estimates of the Hispanic population in a national context. (This approach is similar to that used by Morgan and Ewbank, 1990, in their oversample of black-headed households.) The combined sample will thus be especially useful for researchers who want to estimate the number of Hispanics in the United States or to tabulate their characteristics. The combined sample is also appropriate for comparisons with groups that did not reside in the same counties as Hispanics. If one wanted to compare Mexican immigrants and their descendants, for example, with German, Chinese, or Russian immigrants, the combined sample would be ideal.

An alternative approach would be to use the Hispanic oversample on its own. If one wanted to analyze Hispanic Americans in 1910 in the context of their own localities and the people living near them, then just using the Hispanic sample counties would be the correct choice. One advantage of this approach is that dealing with the weights is conceptually more straightforward in the Hispanic oversample. In the oversample, the Mexican-origin and Cuban-origin Hispanics have the same weight regardless of their county of residence. Non-Hispanics in the oversample counties have weights which vary by county of residence. (A description of the various weights used in the Hispanic Oversample follows this section.)

In 1998 the University of Minnesota will release a single 1910 sample that will include the original PUMS together with the Hispanic and black oversamples. Appropriate sample weights will be given for each individual, including weights for use in a combined sample as well as individual sample weights. These data are designed to share codes and formats with other parts of the Integrated Public Use Microdata Series (Ruggles and Menard, 1995; Ruggles and Sobek, 1995).

In 1910, roughly three-fourths of all Hispanics lived in fewer than 80 US counties. To locate Hispanics for sampling in an economical fashion, we needed to concentrate on those places. The sample is stratified on a small number of counties in which more than half of all US Hispanics lived, according to the published data from the US Census of 1910 (US Bureau of the Census, 1913). To achieve the desired sample density with reasonable costs, we chose 50 counties in Texas, New Mexico, Arizona, Kansas, and California, in which significant numbers of Mexican-born people lived in 1910. While these 50 counties are fairly representative of the location of the Mexican- origin population in 1910 in terms of distance from Mexico and urban-rural distribution, they over-represent counties in which larger-than-normal proportions of Mexican-origin people lived.

Not all the places where Hispanic people in the United States lived in 1910 can be identified by the published tables of birthplaces by county, which were the criteria available to us. Already in 1910 many Hispanic families of Spanish and Mexican origin – and especially those living in New Mexico – had lived in the United States long enough for the current generation of household heads to be native-born children of native-born parents. To represent those people well in the new sample, we needed an additional criterion for selecting counties to sample. To meet this need we chose five additional counties in New Mexico from a list of counties in which many individuals identified themselves as being of Hispanic Ancestry in 1990, but which did not have a large proportion of Mexican-born people in 1910. These five counties are Guadalupe, Santa Fe, Socorro, Taos, and Valencia, all in northern New Mexico (see Map 2).

Map 2
Hispanic Oversample Counties in California, Arizona, New Mexico,Texas and Kansas

Click on map for larger image

Map 2: Hispanic Oversample Counties in California, Arizona, New Mexico, Texas and Kansas

In addition to the 55 counties from which we sampled Mexican-origin people, we also chose two counties in Florida in which a large proportion of Cuban- and Spanish-born people resided: Hillsborough County (Tampa) and Monroe County (Key West). By sampling 50 percent of the Hispanic households in these two counties we expect to arrive at a total sample of about 40 percent of Spanish- and Cuban-origin households in the United States.

Maps 1 and 2 show the location of all 57 counties in the Hispanic Oversample.

In the census of 1910, all individuals were assigned to a "family," a term that the census then defined very similarly to the modern census concept "household." A family was an individual or group of individuals "living together in the same dwelling place." Census instructions defined dwelling places as any occupied structure; they included both wigwams and tenement houses. Two or more families could reside in a single dwelling place, provided they occupied separate parts of the house and their housekeeping was separate. However, all the permanent occupants of hotels, institutions, and military barracks constituted single families. Census enumerators likewise counted boarders, lodgers, and servants as part of the family occupying the dwelling place where they slept, regardless of their housekeeping arrangements.

Public use samples are organized hierarchically; they are simultaneously samples of families (or households) and of individuals, and within groups the relationships among individuals are known. This hierarchical structure allows the creation of a wide variety of variables. In creating the 1910 Oversample of Hispanics, we followed a modified version of the general household selection rules used for the public use samples of 1850, 1860, 1870, 1880, and 1920.

The sampling strategy is based on the census page. Each census page contains 100 persons. We randomly generated the appropriate number of sample points for each census page based on the desired sample density for the county. For example, for those counties in which we need a 20 percent sample, we have generated 20 sample points. To simplify the data entry, these sample lines were grouped in four clusters of five contiguous lines. To ensure that households had an equal probability of being included in the sample regardless of their size, they were entered only if the sample points fell on the line containing the first person in the household. When the sample point fell on any other household member, the household was skipped. If the household was included, all five members were entered. Under this procedure each household and individual in the population had an equal probability of inclusion.

We modified this procedure for persons residing in institutions and large group quarters. The previous public use samples incorporated a variety of sampling strategies for handling such cases. In general, members of large units were sampled on an individual basis, simply by treating each member as if they lived in their own one-person household. This procedure increases the efficiency of the sample by raising the number of observations, while at the same time maintaining representativeness. The following set of inclusion rules assures compatibility with the sample designs of the 1910 PUMS and all other previous PUMS. These rules result in equal probabilities of inclusion, regardless of household size. They also ensure that all relationships among kin are preserved regardless of dwelling or family size.

  1. If the household contains 30 or fewer residents:
    1. accept the entire household if the sample point falls on the first listed individual in the household and that individual is of Hispanic origin.
    2. reject the entire dwelling if the sample point falls on any other household resident.
  2. If the household contains 31 or more persons and the sample point falls within any group of related persons within the household (in 1960 census usage, within a primary or secondary family):
    1. accept the group of related persons if the sample point falls on the first listed individual within the related group, and that individual is of Hispanic origin. Also enter data on overall household size.
    2. reject the entire related group if the sample point falls on any other member of the related group.
  3. If the household contains 31 or more persons and the sample point falls on an individual with no relatives in the family: accept the individual if she/he is of Hispanic origin. Also enter data on overall household size.

We also collected a geographically proximate sample of non-Hispanics as a control group. We obtained this sample by waiving the requirement of Hispanic ethnicity in rules 1 through 3 after every second case of Hispanics was accepted.

At the heart of our process of collecting data is the need to make decisions about whether a given head-of-household can be described as "Hispanic." Our strategy has been to make these decisions as liberally as possible so as to include the largest number of people who might be identified as Hispanic. We then include variables in the final data set that allow individual users to exclude (or consider excluding) those individuals whose inclusion is problematic from the researcher's point of view. These are the rules we have adopted for identifying "Hispanic" individuals in the Oversample:

  1. Mexican-origin sample (Texas, Arizona, New Mexico, and California): A Mexican-origin person is someone enumerated as "White" who meets at least one of these tests (by limiting the test to whites we eliminate American Indians with Hispanic surnames):
    - Born in Mexico
    - Had at least one parent born in Mexico
    - Speaks Spanish
    - Identified as being of Mexican "Race"
    - Has a Hispanic surname. We identified Hispanic surnames as being those in the 1980 US Census Hispanic name list, as well as other Hispanic surnames identified by the data entry operator.
  2. Cuban- and Spanish-origin sample (Florida): A Cuban- or Spanish-origin person is someone enumerated as "White" who meets at least one of these tests:
    - Born in Cuba or Spain
    - At least one parent born in Cuba or Spain
    - Speaks Spanish
    - Identified as being of "Spanish" or "Cuban" race.
    - Has a Hispanic surname.

Sample Weights

The original 1910 Public Use Sample is a self-weighted 1/250 sample of all households in the United States. The Hispanic oversample is not a self-weighting sample and requires a complex weighting scheme because sampling fractions varied among counties. First, not every county in the US was sampled. Second, two different sampling fractions were used for Hispanic counties. Finally, each sampled county has a different sampling fraction for non-Hispanics. This discussion of sample weights follows Morgan and Ewbank (1990) in their presentation of weights for the 1910 Black Oversample.1

In the case of the combined Hispanic oversample and original 1910 PUMS, the sampling fractions and weights for the sampled counties apply to all persons in those counties, whether they were chosen in the original 1/250 sample or in one of the two oversamples. That means that a member of a Hispanic family in the original sample has the same sampling fraction, as well as the same weight, as a member of a Hispanic family in the oversample. The sampling fraction for Hispanics in the sampled counties is thus the sum of the sampling fraction for the original sample (1/250, or 0.004), and the sampling fraction for the new sample (1/5, or 0.2 for the Mexican-origin counties, and 1/2, or 0.5 for the Florida counties).

It is important to recognize that the sampling scheme essentially chooses families or households, rather than individuals, and that weights are applied to households rather than individuals. A non-Hispanic selected as a member of a Hispanic household receives the same weight as a Hispanic member of that household. Similarly, a Hispanic member of a non-Hispanic household receives the same weight as a non-Hispanic member of that non-Hispanic household.

For purposes of assigning weights, we selected Hispanic household heads in the original sample based on their place of birth and their parents' places of birth, and on the language they speak, but not on race or surname. The original 1910 PUMS combined all non-standard races into one "other" category, and its creators did not preserve surnames.2

The non-Hispanic sampling fractions and weights differ for each county. This is because the number of non-Hispanics sampled is proportional to the number of Hispanics, and not to the universe of non-Hispanics. Rather than taking one non-Hispanic in five, we took one non-Hispanic for every two Hispanic households.

Weights are therefore derived as follows:

For the Mexican American Counties:
P(N) = P(C) - (5 x P(H)),
For the Florida Counties:
P(N) = P(C) - (2 x P(H)), Where: P(N) = estimated non-Hispanic population, P(C) = County population as published in 1910, P(H) = size of the county's Hispanic Oversample.

The sampling fraction for oversample counties is the sum of 0.004 and P(O)/P(N), Where: P(O) = number of non-Hispanics in the Hispanic oversample for a given County.

Table 1 presents an example of the computation of the sampling fraction for non-Hispanics in a Hispanic oversample county. Non-Hispanic weights for oversample counties cover a broad range, from a low of roughly 5 in counties with a relatively large number of Hispanics (such as Reeves County, Texas, and Santa Cruz County, Arizona) to a high of 132 for Los Angeles County, California, which has a relatively small Hispanic population in 1910.


Table 1
Computation of sample fraction and weight for non-Hispanics in Luna County, New Mexico
Notation/Computation N
Hispanic Oversample Count of:
· Hispanics P(H) 277
· Non-Hispanics P(O) 135
Total Published Population P(C) 3,913
Estimate of Total:
· Hispanic Population 5 x P(H) = 5 x 277 1,385
· Non-Hispanic Population P(N)
P(N) = P(C) – [5 x P(H)]
P(N) = 3,913 - 1,385 2,528
Non-Hispanic Sampling Fraction .0041 + [P(O)/P(N)]
.004 + ( 135 / 2,528 ) 0.0574
Sample Weight 1/Sampling fraction
Sample Weight for Luna County non-Hispanics 1 /.0574 17.4212

1The value .004 represents the sampling fraction for all persons in the original 1910 PUMS.
2To achieve integer weights, 78 non-Hispanics would be assigned a weight of 17 and 57 non-Hispanics would have a weight of 18. The average weight for all non-Hispanics in Luna County would then be 17.422, differing slightly from the sampling weight shown above because of rounding.


If a researcher wishes to use the Hispanic Oversample on its own, the weights are very similar to those just described for the combined samples. All that differs is that the sampling fractions do not include the added 0.004 (representing the contribution of those households sampled in the original PUMS). In the Oversample, the weight for all members of Hispanic-headed households is 5.0. For all members of non-Hispanic-headed households the sampling fraction is calculated as above, without the addition of 0.004.

The Hispanic Oversample file as distributed has integer weights, so we have converted the weights as calculated above into whole numbers. For the earlier example of a calculated weight of 4.92, we assign a weight of 4 to eight percent of the cases that have this weight, and a weight of 5 to 92 percent of the cases. The result is an average weight of 4.92. This distribution is done randomly.

Identifying Hispanics in the 1910 Oversample and PUMS

Researchers can identify all members of the Hispanic population in the Oversample by using the variables for birthplace, parental birthplaces, race, language, relationship, and Spanish surname. In the Hispanic Oversample, a person is Hispanic if he or she:

  1. Was a Hispanic Head of Household identified in the original sampling process.
  2. Was related to a person defined as a Hispanic Head of Household in the original sampling process.
  3. If unrelated to a Hispanic Head of Household, was born in Mexico, Cuba or Spain, was the child of a person born in one of those places, spoke Spanish, was enumerated as being of Mexican race, or had a Spanish surname as defined by the Census Bureau for the creation of the 1980 census (US Bureau of the Census, 1983, p. K-44).

    If the person identified as Hispanic under Rule 3 is the spouse of a non-Hispanic Head of Household, the Head of Household is then classified as Hispanic too, and all persons related to the Head of Household are also classified as Hispanic.

  4. Is related to the spouse of the Household Head, if the spouse is defined as Hispanic under Rule 3, above.

Researchers can identify all members of the Hispanic Oversample in the original 1910 PUMS by using the variables for birthplace, parental birthplaces, race, language, and relationship. A person can be considered Hispanic in the original PUMS, if he or she:

  1. Is a Hispanic Head of Household, defined as someone who is a head of household, was born in Mexico, Cuba, or Spain, was the child of someone born in one of those countries, or speaks Spanish.
  2. Was related to a person defined as a Hispanic Head of Household.
  3. If unrelated to a Hispanic Head of Household, was born in Mexico, Cuba or Spain, was the child of a person born in one of those places, or spoke Spanish.

    If the person identified as Hispanic under Rule 3 is the spouse of a non-Hispanic Head of Household, the Head of Household is then classified as Hispanic too, and all persons related to the Head of Household are also classified as Hispanic.

  4. Is related to the spouse of the Household Head, if the spouse is defined as Hispanic under Rule 3, above.

The number of Hispanics yielded by this identification process for the original 1910 PUMS will be slightly smaller than that produced from the 1910 Hispanic Oversample because of the additional criterion of having a Spanish surname. In practice, the difference is negligible.

Variables Added to the Hispanic Oversample

The variables FSCHDNUM (Farm schedule number) and HISPRULE (Hispanic sample point selection rule) are unique to the Hispanic oversample.

As individual enumerators completed their work, each farm schedule was assigned a different number. FSCHDNUM thus represents the actual number of the farm schedule completed by each person in the household (if more than one) and makes it possible to link data from the farm schedules with household- and person-record data from the census of population.

HISPRULE simply refers to the Hispanic sample-point selection rule that guided data entry decision-making. Hispanic sample points were selected based on the following criteria: birthplace, parental birthplaces, language, race or surname, using the procedures described in the previous section.

In addition, two variables have been captured in the Hispanic oversample which were not included in the original 1910 PUMS, even though the data were available in the census. These variables are ENUMMO (month of enumeration) and ENUMDAY (day of enumeration).

See the Data Dictionary listing for each of these variables.

Endnotes:

  1. Although the instructions to enumerators did not permit "Mexican" as a race in 1910, a large number of individuals were given a race of "M" or "Mx" by their enumerator.
  2. We have located but not yet completely mastered the archive tapes with the original transcription of the 1910 public use sample. Surnames are not recorded there, but it is possible that the race variable is.

References:

Morgan., S. Philip and Douglas Ewbank. 1990. Census of Population, 1910: Oversample of Black-Headed Households. Philadelphia: University of Pennsylvania, Population Studies Center.
Ruggles, S. and R.R. Menard. 1995. "The Minnesota Historical Census Projects." Historical Methods 28 (1995), 6-78.
Ruggles, S. and M. Sobek. 1995. Integrated Public Use Microdata Series: Version 1.0. Minneapolis: Social History Research Laboratory, University of Minnesota.
Strong, M.A., S. H. Preston, A.R. Miller, M. Hereward, H.R. Lentzner, J.R. Seman, H.C. Williams. 1989. User's Guide, Public Use Sample. 1910 United States Census of Population. Philadelphia: Population Studies Center, University of Pennsylvania.
United States Bureau of the Census. 1913. Thirteenth Census of the United States, 1910. Washington: US G.P.O.
_____. 1983, Census of Population and Housing, 1980: Public-Use Microdata Samples Technical Documentation. Washington: USG.P.O.

(Back to 1910 Hispanic Oversample - User's Guide)