1900 Sampling Procedures1
For sampling purposes, the 1900 Census of Population was divided into logical groups called sample units. Each sample unit began with a head of household and continued until another head of household was found, or until some other logical break was found, such as blank lines on the page or the end of the enumeration district. The division into sample units followed very closely the Census division into "families", and each sample unit usually had a unique family number, but the enumerator's use of family numbers was inconsistent and occasional discrepancies would occur.
The individuals contained in the sample unit were divided into two classes, family members and primary individuals. Family members consisted of the head of household, all individuals related by blood to the head of household and all individuals employed by the head of household (servants and domestic farm workers). In institutional cases, the family members were those employed at the institution. Primary individuals were boarders and lodgers, inmates in institutions such as hospitals and prisons, and military
The family members within a sample unit made up a cluster, called the "family cluster." The primary individuals within a sample unit were also logically divided into clusters called "primary clusters." Most primary individuals were a cluster of size one, but when two or more primaries were living together as a family, they were considered a single cluster. Each sample unit contained at least one cluster, but lt. might contain many more, depending on the number of primaries in lt.
Each cluster in turn contained 8 "key individual." In a family cluster the key individual was always the head of household. In a primary cluster of size one, the key individual was of course the only individual in the cluster. In primary clusters with more than one individual, the key individual was the first person listed in the cluster, regardless of the relationship to head code .
Our sampling procedure was performed in two stages. First, a random sample of lines was drawn from among all of the lines on all of the manuscripts in the Twelfth Census. The lines were selected using a random number generator, so that each line hat the same probability of being selected. Each selected line was examined by a data entry operator. If the line contained the key individual from a cluster, the entire cluster was entered into the sample, regardless of its size. If the line contained any other member of a cluster, or was blank, no data was entered into the sample.
This sampling procedure hat the advantage that every cluster was sampled with the same probability, regardless of its size, and no weighting of the sample was necessary. This can be seen in the following way. Each cluster of size n occupies exactly n lines, since each individual in the manuscript is recorded on a separate line. Each line in the cluster has the probability p of being selected, since the sample of lines is purely random. Therefore the probability that a line will fall somewhere within the cluster is np, since there are n lines, each with probability p of being selected. If a line falls within a cluster, the probability that it will land on the key individual is always l/n, since each cluster contains one and only one key individual. Therefore, the probability that a given cluster will be taken into the sample is given by
np(l/n) = p
Thus it can be seen that each cluster has the probability p of being selected, and this probability is completely independent of cluster size. The same result can be demonstrated in a different way as follows:
We divided the manuscript of the Twelfth Census into clusters such that
- Each individual belongs to one and only one cluster.
- Each cluster has one and only one key individual.
We then draw a random sample of lines from the manuscript so that each line has the same probability p of being selected. Hence each key individual occupies one and only one line;
- Each key individual has the probability p of being selected.
- Each cluster has the probability p of being selected, since each cluster has one and only one key individual.
- Each individual in the Twelfth Census has the probability p of being selected, since each individual belongs to one and only one cluster.
This procedure also has the advantage that it is very simple to operationalize, can be quickly learned, and shows a very low error rate. The basis for the sampling procedure was the Roll Master File, containing one record for each of the 1854 microfilm rolls in the 1900 Census of Population. The data entry operator would request a sample for a particular roll. The computer would calculate an estimated number of frames for that roll, based on the length of the roll and the density of the frames per inch of microfilm. Each frame was known to contain 50 lines, so a total number of lines for the roll was calculated.
The total number of lines for the roll was divided by 750, and the result was the total number of sample lines to be examined on that roll. A random number generator was used to select sample lines (without replacement) from the total available on the roll. The computer sorted the selected lines into sequence and returned them to the operator, giving the frame number and line number of each selected line. The operator looked up each selected sample line on the microfilm. Then:
- If the sample line contained a head of household, all of the family members in the corresponding sample unit were taken as a single case. No primary individuals were taken, but the number of primaries was recorded.
- If the sample line contained a primary individual, then:
- If the primary individual was living alone, not with other related primaries, the individual was entered into the data set as a case.
- If the primary lived with other related primaries, and was the "head" of the family (i.e., listed first), then all of the members of the family were entered as a case. Relatedness in these cases was imputed on the basis of surname, age and marital status data, since relation to head code usually was Boarder or Lodger.
By selecting primaries individually, we hoped to produce a more representative sample of this important subgroup. Primary individuals appeared in clusters which varied widely in size, ranging from households with a single boarder up to institutional situations with several thousand individuals living in a single sample unit. It was felt that an individual level sample was the best approach to this population. For each case taken, the number of family members and the number of primaries in the sample unit were computed by the operator and entered as part of the case.
Sometimes the relation to head field was either illegible or missing, and the data entry staff had to impute head of household from other information, such as family number, name, etc. In other cases, usually large institutional situations, there was no head of household listed at all. In these cases, the sample unit was taken to be the entire institutional population and the number of family members was set to zero.
A more serious difficulty with our sampling scheme occurred in institutional situations. The staff of some institutions would sometimes be listed as a large "family," with a head of household shown first and large numbers of live-in staff members listed immediately afterwards. Under the rules of our sampling scheme, all individuals in such a "family" were taken as a single case. This resulted in a few large clusters of institutional staff members being included in our sample. We have six cases with over twenty individuals, among them a case with 40, a case with 50, and a case with 130. These cases did not turn up until fairly late in the data collection process, so we were not able to revise our sampling procedure to allow for them. They are included in the sample, but the user should exercise extreme caution in attempting to generalize about group quarters staff on the basis of our sample. Clustering effects can be expected to be severe.
When sampling was begun, the goal was to obtain a sample of l/750 of the population of the U.S. When sampling was completed the final sample size was 100,438 for an effective sampling rate of l/759.73. The difference requires some discussion.
Our sampling strategy was based on the use of a Kodak Motormatic Microfilm reader, a motorized 35 mm reader with a built-in counter. The counter measures the distance along the film in inches, but is not capable of counting frames. The number of frames on the roll was calculated by the computer based on an average density of frames/inch on a sample of microfilm rolls taken early in the project. While the average density of frames/inch varied slightly from roll to roll, the difference amounted to not more than one sample line more or less per roll, and so the difference was neglected in favor of the much-increased efficiency of the sampling process when using a motorized reader. Because of the variations from roll to roll, we were not able to maintain an exact sampling fraction of l/750, but an actual fraction of l/759.73 was considered quite a good fit.
There remains the question of whether the sample 1s uniform throughout the entire domain. We did not calculate a frame density for all of the 1854 microfilm rolls, due to time and cost restrictions. But we are able to estimate an expected sample size by state and compare it to the observed.
Our approach is based on the fact that the Census "family" is nearly identical to our "sample unit". Since statistics on the number of "families" was reported by state in the 1900 Census documents, we have a figure very close to the number of heads of household for each state. If our effective sampling fraction is close to l/760, we can expect that the number of heads of household taken will be l/760 of the number of families reported for the state in the
Census publications. We treat the sampling process as if it were a binomial process with replacement. This allows us to estimate the variance of the sample size and evaluate the success of the sampling strategy.
Table 2 shows the results of applying this technique of estimation to the l900 Public Use Sample. For each state, the proportion p was taken to be the number of families reported by Census divided by the total population of the state. This is the probability that a given individual examined by the operator will be a head of household. The number of lines examined by the operator, n, is estimated by dividing the total population of the state by the overall constant 760. This assumes that the effective sampling rate in each state, i.e., the number of lines actually examined by the operator, was equal to 1/760 of the total population of the state. The expected number of heads of household actually taken into the sample is given by NP, and the variance by np (l-p). We would expect the sample sizes to be normally distributed around their expected values.
An examination of Table 2 reveals that our expectations are generally met well by the l900 Public Use Sample. Three states, New Hampshire, North Carolina, and Indiana, are more than 2 standard deviations away from their expected values. Fourteen states are between l and 2 standard deviations, and 36 states are less than l standard deviation from expected sample size. The corresponding percentages are 5.6, 26.4, and 68.
It must be emphasized that the same technique CANNOT be applied at the individual level. Our sample was a random sample of clusters, not a random sample of individuals. For this reason, the calculation of variance at the individual level is problematical. For some variables, variance will be very nearly the same as a random sample of individuals; for others, it may be very much larger.
For example, case number 1809 in the 1900 Public Use Sample 18 is a family in Connecticut made up of six individuals, five of whom were born in Wales. The total sample drawn from Connecticut was 1201 individuals, so the proportion Welsh in our Connecticut sample was 511201 or .004163. We could use this proportion to estimate the number of Welsh in Connecticut. To do this we would multiply the total population of Connecticut, 908,420, by our proportion Welsh, .004163, arriving at an estimated Welsh population of 3,781. The actual number of Welsh in Connecticut reported by Census was 629, for an error of estimation of 3,152 or 601 per cent. This error is due, not to a lack of randomness in the sample, but to the fact that the sample is taken at the cluster level, not at the individual level.
We can see this more clearly by looking again at the sampling process. Our overall sampling probability was 1/760. The total number of Welsh in Connecticut was 629, so the probability that one of our sample lines would fall on a Welshman in Connecticut was 629/760 or .8276. This is a fairly strong probability, and in fact one of our sample lines did fall on a Welsh individual. This individual happened quite by chance to be a head of household, 80 all members of the family were taken into the sample as a cluster. Not surprisingly, most of the family members were also Welsh, so the number of Welsh taken into the sample was five, although only one sample line was Welsh.
We can see from this example that great care must be taken in making estimates at the individual level from our data. The variance of a variable at the individual level is dependent on the uniformly of the variable within the cluster. A variable such as race or ethnicity will tent to be very uniform within clusters, so the clustering effect on the variance will be larger than a variable like sew or month of birth which occurs more randomly within clusters.
The problem of actually computing the variance of individual variables in a clustered sample is quite complex and beyond the scope of this document. The reader is referred to an extensive discussion of this problem in Chaps. 5-7 of Survey Sampling by Leslie Kish (John Wiley & Sons, 1965.)
- Adapted from the following text: Stephen N. Graham, "Chapter 4: Sampling," 1900 Public Use Sample: User's Handbook (Draft Version), Seattle: University of Washington, Center for Studies in Demography and Ecology, 1980, pp. 39-47.