The 1910 Black and Hispanic Oversamples1

by Myron P. Gutmann and Douglas Ewbank

When the 1910 PUMS was originally conceived and released it was remarkable for its size and comprehensiveness, and for the technology used to construct it. With a sample density of approximately 1-in-250, it was more than three times as large as the earlier 1900 PUMS, in relative terms. For the first time, not only was there a nationally representative sample, but it was also possible to study significant subgroups of the population, as Watkins (1994) showed. Populous geographical areas were well represented, with 15,599 cases for Texas, 36,027 for New York State, and large numbers for major cities, including 19,081 cases for New York City, 6,081 for Philadelphia, and 8,584 for Chicago.

Despite the impressive overall size of the sample and the ability to use it to study carefully selected subgroups within the population, there is research for which it was not designed. It is not possible, for example, to analyze the characteristics of small populations within the United States in 1910, if it is also necessary to subdivide them into a significant number of additional subgroups. This turned out to be an important research consideration when studying Americans of African and Hispanic origin in 1910, which led two research groups to design and construct oversamples of the 1910 PUMS, which could be used for the detailed study of those groups. This article describes those samples in general terms. Fuller discussion can be found in the IPUMS documentation (Gutmann et al. 1998), as well as in Gutmann, Frisbie and Blanchard (1999) and in Morgan and Ewbank (1990).

The Black Oversample

The Black Oversample was designed to increase the sample of persons living in black-headed households in the 1910 microsample. It was designed by Douglas Ewbank at the University of Pennsylvania, and carried out by the same team that produced the original 1910 microsample. The sample was designed to enrich opportunities for studying differentials within the black population.

The sample is a random sample of households headed by individuals whose race was classified as Negro, black or mulatto. All individuals in each selected household were included. The sampling procedure involved computer generated page and line numbers. If the individual on that line was a head of household and was listed as black, Negro or mulatto, all individuals in the household were included in the sample.

Preliminary tests for a few counties in Louisiana demonstrated that sampling only black-headed households was not feasible in areas where less than 10% of the population was black. Therefore, most of the sample was drawn from counties where the population exceeded this threshold. The research team excluded the states with the four largest black populations (Alabama, Georgia, South Carolina, and Mississippi), because blacks in these states were already well represented in the original microsample. In the two states with the next largest black populations (Maryland and Kentucky), the sample includes 0.5% of black-headed households in counties where more than 10% of the population was black. In six states with smaller black populations (Virginia, North Carolina, Florida, Tennessee, and Arkansas) the sample includes 1.0% in counties with large black populations. In Louisiana, the sample includes 0.5% in some counties and 1.0% in others.

By itself, the Black Oversample does not include a random sample of the black population of the U.S. in 1910. However, it can be combined with the original 1910 microsample to produce a data set that includes 14,698 black-headed households. The data files for the 1910 microsample and the Black Oversample include indicators for whether the household comes from a county that was only sampled in the original micro-sample, whether it was also included in the 1% sample, or whether it was also included in the 0.5% sample. When combined using the appropriate sample weights, the combined data are representative of all blacks in 1910. The exact weighting procedure is described in detail in Morgan and Ewbank (1990).

The Black Oversample is an excellent resource for studying the black population of the U.S. in 1910. Because of its design, it is especially valuable for research on geographic differences in the black population, because when it is combined with the original micro-sample, it insures relatively large sample sizes for the 13 states with the largest black populations.

The Hispanic Oversample

The Hispanic Oversample was created by a team headed by Myron Gutmann and Steven Ruggles at the Universities of Texas and Minnesota during 1995-1998. The overall sample was designed to reflect the presence of persons of Mexican, Spanish, and Cuban origin in the United States in 1910, as well as a significant number of their non-Hispanic neighbors. The goal of the researchers was to create a data set that would allow them to study the variations within the Hispanic population in 1910, in order to show regional differences, as well as differences by origin, urban residence, and head's occupation. Research by Gutmann, Haines, Frisbie, and Blanchard (1998) shows that there are major differences in levels of child mortality by region, with Hispanics children in Tampa surviving the best, and those in rural New Mexico surviving the least.

Samples were drawn from 57 counties in six states: Arizona, California, Florida, Kansas, New Mexico, and Texas. In the two Florida counties, where most of the Spanish- and Cuban-origin population lived, the sample design selected half of the households headed by Hispanics; in all the other counties, the sample design produced a twenty percent sample of households headed by Hispanics in each county. The overall sample includes roughly ten percent of the 845,000 Hispanics living in the United States in 1910, a very large proportion (Gutmann, Frisbie, and Blanchard 1999). In addition, one household not headed by a Hispanic was selected for every two Hispanic-headed households in the sample counties, except where the non-Hispanic population was so small as to render this selection impossible.2 The coding procedures and case-selection rules for group quarters used in the Hispanic oversample generally follow those used in the 1920 PUMS created at the University of Minnesota.

Defining an individual as Hispanic is one of the complexities involved in the creation of the Hispanic oversample. For the purposes of sample creation, heads of household were defined as Hispanic if they were born in Mexico, Cuba, or Spain, if one of their parents was born in one of those countries, if they spoke Spanish, if they had a name that was judged to be Hispanic by the data entry operator, or if the census enumerator in 1910 enumerated their race as "Mexican."3 Each household in the Hispanic oversample has a variable that indicates the reason why it was selected for the sample. It is important to know that persons who virtually all researchers would agree were not Hispanic could have been selected in Hispanic households, while persons that a consensus would agree are Hispanic were selected into the sample in non-Hispanic households. Gutmann and Gratton (forthcoming) have devised criteria for classifying individuals in the 1910 IPUMS (as well as other IPUMS years) as Hispanic, and will produce a publicly-available data set that assigns those classifications to each individual and can be merged with the IPUMS.

The Hispanic oversample can be used by itself, as well as together with the original 1910 PUMS. In that situation, the user can have confidence that she or he has relatively accurate national estimates of the Hispanic population in a national context. The combined sample is appropriate for comparisons with groups that did not reside in the same counties as Hispanics. If one wanted to compare Mexican immigrants and their descendants, for example, with German, Chinese, or Russian immigrants, the combined sample would be ideal. The Hispanic oversample can also be used on its own. If one wanted to analyze Hispanic Americans in 1910 in their own localities and with the people who lived near them, using the Hispanic sample counties alone would be the correct choice.

Although Hispanic-headed households in the Hispanic oversample make up a flat sample, those headed by non-Hispanics do not constitute a flat sample, and the sample when combined with the original 1910 PUMS is not a flat sample. The designers of the oversample have constructed appropriate weights that should be applied to every case in the oversample and in the combined samples. These weights are part of the combined 1910 IPUMS data set as distributed. Their computation is described in detail in the oversample documentation, as well as in Gutmann, Frisbie and Blanchard (1999).


The Black and Hispanic Oversamples of the 1910 Public Use Sample constitute unique means to supplement the national PUMS for that year, in order to study geographic and other differences within these important sub-populations. In both Oversamples, researchers found areas in the United States with significant minority populations, and sampled them in a way that matched the format of the existing general sample. Both require special weighting procedures to be useful with the national sample, but when used appropriately they greatly enhance the research opportunities possible with the 1910 IPUMS.


  1. This article appeared in Historical Methods, Volume 32, Number 3, Pages 156 -158, Summer 1999. Reprinted with Permission of the Helen Dwight Reid Educational Foundation. Published by Heldref Publications, 1319 18th St. N.W. Washington, D.C. 20036-1802. Copyright 1999.
  2. This was the case in some Texas counties that had populations that were more than two-thirds Hispanic. In such a situation, there were not enough non-Hispanic households to select.
  3. Although the enumerator's instructions did not allow a race code for Mexican, a number of enumerators coded "Mx" for Mexican in 1910.


