1900
Sampling Procedures1
For sampling purposes, the 1900 Census of Population was divided into
logical groups called sample units. Each sample unit began with a head
of household and continued until another head of household was found, or
until some other logical break was found, such as blank lines on the page
or the end of the enumeration district. The division into sample units
followed very closely the Census division into "families", and each sample
unit usually had a unique family number, but the enumerator's use of family
numbers was inconsistent and occasional discrepancies would occur.
The individuals contained in the sample unit were divided into two classes,
family members and primary individuals. Family members consisted of the
head of household, all individuals related by blood to the head of household
and all individuals employed by the head of household (servants and domestic
farm workers). In institutional cases, the family members were those employed
at the institution. Primary individuals were boarders and lodgers, inmates
in institutions such as hospitals and prisons, and military
The family members within a sample unit made up a cluster, called the
"family cluster." The primary individuals within a sample unit were also
logically divided into clusters called "primary clusters." Most primary
individuals were a cluster of size one, but when two or more primaries
were living together as a family, they were considered a single cluster.
Each sample unit contained at least one cluster, but lt. might contain
many more, depending on the number of primaries in lt.
Each cluster in turn contained 8 "key individual." In a family cluster
the key individual was always the head of household. In a primary cluster
of size one, the key individual was of course the only individual in the
cluster. In primary clusters with more than one individual, the key individual
was the first person listed in the cluster, regardless of the relationship
to head code .
Our sampling procedure was performed in two stages. First, a random
sample of lines was drawn from among all of the lines on all of the manuscripts
in the Twelfth Census. The lines were selected using a random number generator,
so that each line hat the same probability of being selected. Each selected
line was examined by a data entry operator. If the line contained the key
individual from a cluster, the entire cluster was entered into the sample,
regardless of its size. If the line contained any other member of a cluster,
or was blank, no data was entered into the sample.
This sampling procedure hat the advantage that every cluster was sampled
with the same probability, regardless of its size, and no weighting of
the sample was necessary. This can be seen in the following way. Each cluster
of size n occupies exactly n lines, since each individual in the manuscript
is recorded on a separate line. Each line in the cluster has the probability
p of being selected, since the sample of lines is purely random. Therefore
the probability that a line will fall somewhere within the cluster is np,
since there are n lines, each with probability p of being selected. If
a line falls within a cluster, the probability that it will land on the
key individual is always l/n, since each cluster contains one and only
one key individual. Therefore, the probability that a given cluster will
be taken into the sample is given by
np(l/n) = p
Thus it can be seen that each cluster has the probability p of being
selected, and this probability is completely independent of cluster size.
The same result can be demonstrated in a different way as follows:
We divided the manuscript of the Twelfth Census into clusters such that
1) Each individual belongs to one and only one cluster. 2) Each cluster
has one and only one key individual.
We then draw a random sample of lines from the manuscript so that each
line has the same probability p of being selected. Hence each key individual
occupies one and only one line;
1) Each key individual has the probability p of being selected.
2) Each cluster has the probability p of being selected, since each
cluster has one and only one key individual.
3) Each individual in the Twelfth Census has the probability p of being
selected, since each individual belongs to one and only one cluster.
This procedure also has the advantage that it is very simple to operationalize,
can be quickly learned, and shows a very low error rate. The basis for
the sampling procedure was the Roll Master File, containing one record
for each of the 1854 microfilm rolls in the 1900 Census of Population.
The data entry operator would request a sample for a particular roll. The
computer would calculate an estimated number of frames for that roll, based
on the length of the roll and the density of the frames per inch of microfilm.
Each frame was known to contain 50 lines, so a total number of lines for
the roll was calculated.
The total number of lines for the roll was divided by 750, and the result
was the total number of sample lines to be examined on that roll. A random
number generator was used to select sample lines (without replacement)
from the total available on the roll. The computer sorted the selected
lines into sequence and returned them to the operator, giving the frame
number and line number of each selected line. The operator looked up each
selected sample line on the microfilm. Then:
1) If the sample line contained a head of household, all of the family
members in the corresponding sample unit were taken as a single case. No
primary individuals were taken, but the number of primaries was recorded.
2) If the sample line contained a primary individual, then:
a) If the primary individual was living alone, not with other related
primaries, the individual was entered into the data set as a case.
b) If the primary lived with other related primaries, and was the "head"
of the family (i.e., listed first), then all of the members of the family
were entered as a case. Relatedness in these cases was imputed on the basis
of surname, age and marital status data, since relation to head code usually
was Boarder or Lodger.
By selecting primaries individually, we hoped to produce a more representative
sample of this important subgroup. Primary individuals appeared in clusters
which varied widely in size, ranging from households with a single boarder
up to institutional situations with several thousand individuals living
in a single sample unit. It was felt that an individual level sample was
the best approach to this population. For each case taken, the number of
family members and the number of primaries in the sample unit were computed
by the operator and entered as part of the case.
Sometimes the relation to head field was either illegible or missing,
and the data entry staff had to impute head of household from other information,
such as family number, name, etc. In other cases, usually large institutional
situations, there was no head of household listed at all. In these cases,
the sample unit was taken to be the entire institutional population and
the number of family members was set to zero.
A more serious difficulty with our sampling scheme occurred in institutional
situations. The staff of some institutions would sometimes be listed as
a large "family," with a head of household shown first and large numbers
of live-in staff members listed immediately afterwards. Under the rules
of our sampling scheme, all individuals in such a "family" were taken as
a single case. This resulted in a few large clusters of institutional staff
members being included in our sample. We have six cases with over twenty
individuals, among them a case with 40, a case with 50, and a case with
130. These cases did not turn up until fairly late in the data collection
process, so we were not able to revise our sampling procedure to allow
for them. They are included in the sample, but the user should exercise
extreme caution in attempting to generalize about group quarters staff
on the basis of our sample. Clustering effects can be expected to be severe.
When sampling was begun, the goal was to obtain a sample of l/750 of
the population of the U.S. When sampling was completed the final sample
size was 100,438 for an effective sampling rate of l/759.73. The difference
requires some discussion.
Our sampling strategy was based on the use of a Kodak Motormatic Microfilm
reader, a motorized 35 mm reader with a built-in counter. The counter measures
the distance along the film in inches, but is not capable of counting frames.
The number of frames on the roll was calculated by the computer based on
an average density of frames/inch on a sample of microfilm rolls taken
early in the project. While the average density of frames/inch varied slightly
from roll to roll, the difference amounted to not more than one sample
line more or less per roll, and so the difference was neglected in favor
of the much-increased efficiency of the sampling process when using a motorized
reader. Because of the variations from roll to roll, we were not able to
maintain an exact sampling fraction of l/750, but an actual fraction of
l/759.73 was considered quite a good fit.
There remains the question of whether the sample 1s uniform throughout
the entire domain. We did not calculate a frame density for all of the
1854 microfilm rolls, due to time and cost restrictions. But we are able
to estimate an expected sample size by state and compare it to the observed.
Our approach is based on the fact that the Census "family" is nearly
identical to our "sample unit". Since statistics on the number of "families"
was reported by state in the 1900 Census documents, we have a figure very
close to the number of heads of household for each state. If our effective
sampling fraction is close to l/760, we can expect that the number of heads
of household taken will be l/760 of the number of families reported for
the state in the
Census publications. We treat the sampling process as if it were a binomial
process with replacement. This allows us to estimate the variance of the
sample size and evaluate the success of the sampling strategy.
Table 2 shows the results of applying this
technique of estimation to the l900 Public Use Sample. For each state,
the proportion p was taken to be the number of families reported by Census
divided by the total population of the state. This is the probability that
a given individual examined by the operator will be a head of household.
The number of lines examined by the operator, n, is estimated by dividing
the total population of the state by the overall constant 760. This assumes
that the effective sampling rate in each state, i.e., the number of lines
actually examined by the operator, was equal to 1/760 of the total population
of the state. The expected number of heads of household actually taken
into the sample is given by NP, and the variance by np (l-p). We would
expect the sample sizes to be normally distributed around their expected
values.
An examination of Table 2 reveals that
our expectations are generally met well by the l900 Public Use Sample.
Three states, New Hampshire, North Carolina, and Indiana, are more than
2 standard deviations away from their expected values. Fourteen states
are between l and 2 standard deviations, and 36 states are less than l
standard deviation from expected sample size. The corresponding percentages
are 5.6, 26.4, and 68.
It must be emphasized that the same technique CANNOT be applied at the
individual level. Our sample was a random sample of clusters, not a random
sample of individuals. For this reason, the calculation of variance at
the individual level is problematical. For some variables, variance will
be very nearly the same as a random sample of individuals; for others,
it may be very much larger.
For example, case number 1809 in the 1900 Public Use Sample 18 is a
family in Connecticut made up of six individuals, five of whom were born
in Wales. The total sample drawn from Connecticut was 1201 individuals,
so the proportion Welsh in our Connecticut sample was 511201 or .004163.
We could use this proportion to estimate the number of Welsh in Connecticut.
To do this we would multiply the total population of Connecticut, 908,420,
by our proportion Welsh, .004163, arriving at an estimated Welsh population
of 3,781. The actual number of Welsh in Connecticut reported by Census
was 629, for an error of estimation of 3,152 or 601 per cent. This error
is due, not to a lack of randomness in the sample, but to the fact that
the sample is taken at the cluster level, not at the individual level.
We can see this more clearly by looking again at the sampling process.
Our overall sampling probability was 1/760. The total number of Welsh in
Connecticut was 629, so the probability that one of our sample lines would
fall on a Welshman in Connecticut was 629/760 or .8276. This is a fairly
strong probability, and in fact one of our sample lines did fall on a Welsh
individual. This individual happened quite by chance to be a head of household,
80 all members of the family were taken into the sample as a cluster. Not
surprisingly, most of the family members were also Welsh, so the number
of Welsh taken into the sample was five, although only one sample line
was Welsh.
We can see from this example that great care must be taken in making
estimates at the individual level from our data. The variance of a variable
at the individual level is dependent on the uniformly of the variable within
the cluster. A variable such as race or ethnicity will tent to be very
uniform within clusters, so the clustering effect on the variance will
be larger than a variable like sew or month of birth which occurs more
randomly within clusters.
The problem of actually computing the variance of individual variables
in a clustered sample is quite complex and beyond the scope of this document.
The reader is referred to an extensive discussion of this problem in Chaps.
5-7 of Survey Sampling by Leslie Kish (John Wiley & Sons, 1965.)
ENDNOTES:
-
Adapted from the following text: Stephen N. Graham, "Chapter 4: Sampling," 1900 Public
Use Sample: User's Handbook (Draft Version), Seattle: University of
Washington, Center for Studies in Demography and Ecology, 1980, pp. 39-47.
Go Back to Sampling Procedures Index
|