Frequently Asked Questions (FAQ)
What is IPUMS-USA?
What's in the future for IPUMS?
Does IPUMS-USA add value to the data?
Where should a new user start?
How do I get access to IPUMS-USA data?
What are microdata?
What are "pointer variables"?
What are "general" and "detailed" versions of variables?
What are "weights"?
What does "universe" mean in the variable descriptions?
What are "data quality flags"?
How do I obtain data?
What format are the data in?
How long does a data extract take?
What if the samples are too big for me to handle?
How does "sample selection" work on the IPUMS-USA web site?
What do "add to cart" and "in cart" mean?
Why can't I open the data file?
Is there a preferred statistical package for using the IPUMS?
Can I analyze IPUMS-USA data without a statistical package?
Can I get the original data?
How is a record uniquely identified?
Using IPUMS data
Are there tricky aspects of IPUMS data to be particularly aware of?
How do I decide what sample to use from any given year?
You say the IPUMS is nationally representative; is it also representative of smaller areas?
What are the major limitations of the data?
Can I find particular individuals in the IPUMS data?
How do I cite IPUMS-USA?
Can I use IPUMS for genealogy?
Using the variables page
Variables page menu
Variables page details
Using the data extract system
Your data cart
Why are some variables in my data cart preselected?
What is "Type"?
Extract request page
Extract definition: Data structure
Extract option: Customize sample sizes
Extract option: Select cases
Extract option: Attach characteristics
Extract option: Select data quality flags
Extract option: Describe your extract
General information about the project
What is IPUMS-USA? [top]
The Integrated Public Use Microdata Series (IPUMS-USA) consists of more than fifty high-precision samples of the American population drawn from fifteen federal censuses and from the American Community Surveys of 2000-2012. Some of these samples have existed for years, and others were created specifically for this database. These samples, which draw on every surviving census from 1850-2000, and the 2000-2012 ACS samples, collectively constitute our richest source of quantitative information on long-term changes in the American population. However, because different investigators created these samples at different times, they employed a wide variety of record layouts, coding schemes, and documentation. This has complicated efforts to use them to study change over time. The IPUMS assigns uniform codes across all the samples and brings relevant documentation into a coherent form to facilitate analysis of social and economic change.
IPUMS is not a collection of compiled statistics; it is composed of microdata. Each record is a person, with all characteristics numerically coded. In most samples persons are organized into households, making it possible to study the characteristics of people in the context of their families or other co-residents. Because the data are individuals and not tables, researchers must use a statistical package to analyze the millions of records in the database. A data extraction system enables users to select only the samples and variables they require.
IPUMS-International is the world's largest collection of publicly available individual-level census data. IPUMS-International integrates samples from population censuses from around the world taken since 1960. Scholars interested only in the United States are better served using IPUMS-USA, which is optimized for U.S. research.
IPUMS-CPS is an integrated set of data from the March Current Population Survey (CPS), beginning in 1962 and continuing until the present. This harmonized dataset is also compatible with the data from the U.S. decennial censuses that are part of the IPUMS-USA. Researchers can take advantage of the relatively large sample size of IPUMS-USA at ten-year intervals and fill in information for the intervening years using IPUMS-CPS.
What's in the future for IPUMS? [top]
IPUMS-USA is funded through 2012 by several grants from the National Institute of Child Health and Human Development. In addition to working on new features for the website and data extraction system, we are currently making new high-density samples for 1900, 1930, and 1960. Over the next four years we plan semi-annual data releases every March and October. The precise sample composition of each data release is hard to predict very far ahead of time, but our latest plans can be found on our data release schedule page.
We have every expectation of continuing the project beyond 2012, but will have to secure further funding as our current grants expire. To be successful, we need to have a large body of users and published works we can point to. Please inform us if you have any presentations or publications using IPUMS data.
Does IPUMS-USA add value to the data? [top]
IPUMS data is integrated over time and across samples by assigning uniform codes to variables. This process itself adds value to the data by fully documenting all codes and compiling all variable documentation in a hyperlinked web format. But we do many other things as well:
IPUMS creates a consistent set of constructed variables on family interrelationships for all samples. The "pointer" variables indicate the location within the household of every person's mother, father, and spouse.
IPUMS data also includes harmonized income and occupation variables. The Census Bureau has reorganized its occupational and industrial classification systems in almost every census administered since 1850. Although IPUMS retains the original occupation and industry codes, a variety of occupation and industry variables have been created for long-term analysis. More information on these variables can be found on the Occupation and Industry Variables page.
Where should a new user start? [top]
The documentation is a natural starting place for new IPUMS-USA users. First, the User's Guide provides an overview of the database and detailed documentation on the variables and samples available.
The Variables page is the primary tool for exploring the contents of IPUMS-USA. On the variables page, clicking on a variable name brings up its documentation. It contains a description of the variable and discussions of comparability issues over time. The variables page also has direct links to the codes page for each variable. The codes page shows the coding structure and labels for a variable and the availability of categories across samples. These categories can suggest the types of research possible with a given sample. Links at the bottom of each variable description lead to original questionnaire text and instructions pertaining to the census question for every sample. Click on the "Tutorial" button on the variables page for an overview of how to make an extract, or see the FAQ item on the variables page.
The Samples page describes the characteristics of the samples in the data series and the censuses from which they were derived.
If you are already registered to use IPUMS-USA, you can click on "create an extract" and use the data access system. To start, users can reference our instructions for the extraction system (pdf).
How do I get access to IPUMS-USA data? [top]
Access to the documentation is freely available without restriction; however, users must register before extracting data from the website.
The IPUMS-USA data is also available for analysis online through the IPUMS Online Data Analysis System.
What are microdata? [top]
Census microdata are composed of individual records containing information collected on persons and households. The unit of observation is the individual. The responses of each person to the different census questions are recorded in separate variables.
Microdata stand in contrast to more familiar "summary" or "aggregate" data. Aggregate data are compiled statistics, such as a table of marital status by sex for some locality. There are no such tabular or summary statistics in the IPUMS data.
Microdata are inherently flexible. One need not depend on published statistics from a census that compiled the data in a certain way, if at all. Users can generate their own statistics from the data in any manner desired, including individual-level multivariate analyses.
See an image of IPUMS data here. All IPUMS data are in this general format.
What are "pointer variables"? [top]
The IPUMS "pointer" variables indicate the location within the household of every person's mother, father, and spouse. Nearly all samples indicate the relationship of each person to the head of household, but it is much harder to relate individuals to persons other than the head (for example, grandchildren to children, sons-in-laws to daughters, or unrelated persons to each other). We have developed a complex core algorithm to make such connections, and we customize it as needed to account for peculiarities of specific samples. The pointer variables are called MOMLOC, POPLOC and SPLOC in the IPUMS system, and accompanying variables indicate the major rules under which a specific link was made.
The pointer variables make it easy to construct individual-level variables representing the characteristics of co-resident persons, such as occupation of spouse, age of mother, or educational attainment of father. You need to include the serial and person ID variables (SERIAL and PERNUM) in your extract, as well as the pointer variables themselves, to perform these data manipulations.
What are "general" and "detailed" versions of variables? [top]
Most variables in the IPUMS have a composite coding structure, where the first digit is largely comparable across samples, and second and subsequent digits provide progressively more detail available in some samples and not others. For some highly requested variables, the composite coding structure is formally recognized in our system by distinguishing separate "general" and "detailed" versions of the variable. For example, researchers can access an internationally comparable 1-digit general version of "employment status", or they can use the fully detailed 3-digit version, if their research requires finer distinctions. The two sets of codes are completely consistent with one another; one simply provides more categories, while the other is simpler to use and usually more comparable across samples.
The variables with general versions have a checkbox in the "general version" column in the data extract variable selection screen. Other variables only have the default full-detail version. It is possible to include both the general and the detailed version of a variable in a data extract. Both versions of a variable come with appropriate syntax labels. In data extracts, the detailed version of the variable gets a "D" appended onto the end of its mnemonic (for example, for "employment status," EMPSTAT is general and EMPSTATD is detailed).
The general and detailed versions of a variable both correspond to the same description in the documentation system. The codes and frequencies of each version are viewable separately on the relevant variable codes page.
What are "weights"? [top]
All datasets in the IPUMS samples; each sample case represents anywhere from 20 to 1000 people in the complete population for the given year. The "weight" variables indicate how many persons in the population are represented by each sample case. Many IPUMS samples are unweighted or "flat", meaning that every person in the sample data represents the same number of persons in the population. In the 2000 1% Unweighted sample, for instance, the weight for all sample cases is fixed at 100; each case represents 100 people in the population. But many samples in the IPUMS are "weighted", meaning that some sample cases represent more people in the population than others. Persons and households with some characteristics are over-represented in the samples, while others are underrepresented. Weight variables allow researchers to create accurate population estimates using weighted samples. See the PERWT variable or What is IPUMS? page for a listing of the weighted IPUMS-USA samples.
To obtain representative statistics from the weighted samples, users must apply sample weights. Follow one of the following procedures:
1. For person-level analyses using a weighted sample, apply the PERWT variable. PERWT gives the population represented by each individual in the sample; more precision is available in the 1850-1930 samples than in the 1940-onward samples.
2. For household-level analyses using a weighted sample, weight the households using the HHWT variable. HHWT gives the number of households in the general population represented by each household in the sample; more precision is available in the 1850-1930 samples than in the 1940-onward samples. If you are using a rectangularized sample, select only one person (e.g., PERNUM = 1) to represent that household.
3. For analyses using sample-line variables from 1940 and 1950, apply the SLWT variable. SLWT gives the number of persons in the general population represented by each sample-line person. (For other samples, SLWT contains PERWT, so applying SLWT to other years of data will not alter results for those years.)
Even the unweighted samples have values for HHWT and PERWT, but every record in those samples receives an identical weight. This allows the application of the weight variables in pooled extracts that contain both weighted and unweighted samples. Otherwise, the use of the weights is optional in the unweighted samples.
What does "universe" mean in the variable descriptions? [top]
The universe is the population at risk of having a response for the variable in question. In most cases these are the households or persons to whom the census question was asked, as reflected on the census questionnaire. For example, children are not usually asked employment questions, and men and children are not asked fertility questions. Cases that are outside of the universe for a variable are labeled "NIU" on the codes page. Differences in a variable's universe across samples are a common data comparability issue.
The universes will not always be entirely clean of apparently erroneous cases. Some persons or households that should not have answered the question did, and some that should have answered may be included in the "NIU" (not in universe) category. But until we perform comprehensive data editing and allocation in the future, we do not know whether the variable in question is in error or whether the variables that define the universe (for example, age or employment status) are incorrect.
What are "data quality flags"? [top]
Many variables in the IPUMS have been edited for missing, illegible and inconsistent values. Data quality flags indicate which values are edited or allocated. More information on the philosophy and procedures used for editing and allocating the IPUMS data can be found on the Introduction to data editing and allocation page.
Each data quality flag corresponds to one or more IPUMS variables, and the codes for each flag vary based on the sample. Data quality flags can be included in your extract by selecting the boxes in the "Set Variable Options" step of the extract system.
How do I obtain data? [top]
All IPUMS data are delivered through our data extraction system. Users select the variables and samples they are interested in, and the system creates a custom-made extract containing only this information. To start, users can reference our instructions for the data extraction system (pdf) and instructions for opening an IPUMS extract on your computer.
Data are generated on our server. The system sends out an email message to the user when the extract is completed. The user must download the extract and analyze it on their local machine. Access to the documentation is freely available without restriction; however, users must register before extracting data from the website.
What format are the data in? [top]
IPUMS produces fixed-column ASCII data. Data are entirely numeric. By default, the extraction system rectangularizes the data: that is, it puts household information on the person records and does not retain the households as separate records. No information is lost, and this is the format preferred by most researchers; however, the extraction system includes the option of hierarchical data or household record only data.
In addition to the ASCII data file, the system creates a statistical package syntax file to accompany each extract. The syntax file is designed to read in the ASCII data while applying appropriate variable and value labels. SPSS, SAS, and Stata are supported. You must download the syntax file with the extract or you will be unable to read the data. The syntax file requires minor editing to identify the location of the data file on your local computer.
A codebook file is also created with each extract. It records the characteristics of your extract and should be downloaded for record-keeping.
All data files are created in gzip compressed format. You must uncompress the file to analyze it. Most data compression utilities will handle the files.
How long does a data extract take? [top]
The time needed to make an extract differs depending on the number and size of samples requested, whether case selection is performed, and the load on our server. Extracts can take from a few minutes to an hour or more. The system sends an email when the extract is completed, so there is no need to stay active on the IPUMS site while the extract is being made.
What if the samples are too big for me to handle? [top]
It is possible to make samples that are extremely large. Extracts over a certain size will not be allowed by
our server. If you have a legitimate reason to make extremely large extracts, email us to request a higher threshold.
Roughly speaking, the number of records times the number of columns of data requested yields
the file size of your extract in bytes. The number of records in each sample is given on the samples page. Among the current samples, the 1880 complete-count database, the 1980-2000 decennial samples, and the multi-year ACS samples are particularly large.
The "extract request" screen predicts the size of your data file. If it is too large for your purposes, there
are several things you can do.
1) Select fewer variables and/or samples. In particular, requesting replicate weights (REPWT and REPWTP) will increase your extract size dramatically, so we recommend that you select these only if you intend to use them for this extract.
2) Use the "Select cases" feature on the extract page to include only the particular kinds of people
or households you want to include in your data. Instructions can be found here. (Note: the estimated file size does not include case selection in its calculation, so this number will not change.)
3) Use the "Customize sample sizes" feature to draw smaller random subsets of some or all of the samples in
your extract. The sample weights in the data will be adjusted appropriately. Instructions can be found here.
How does "sample selection" work on the IPUMS-USA web site? [top]
When a user first enters the variable documentation system, all samples are selected by default. Every variable in the system will display on all relevant screens.
Users can filter the information displayed by selecting only the samples of interest to them. Only the variables available in one of the selected samples will appear in the variable lists. The integrated variable descriptions and codes pages will also be filtered to display only the text and columns corresponding to the selected samples. Sample selections can be altered at any time in your session. Selections do not persist beyond the current session.
When a user enters the extract system after selecting samples, those selections are carried into the data extract system.
What do "add to cart" and "in cart" mean? [top]
Much like a shopping cart on commercial websites, your data cart contains the variables and samples that you want to include in your customized extract. You can place variables into (and remove them from) your data cart while browsing the documentation system by clicking the icons in the "Add to Cart" column as described in the variable selection FAQ item. When you are viewing your data cart, the icons appear in the "In Cart" column and work the same way.
Why can't I open the data file? [top]
There are two likely explanations:
1) The data produced by the extract system are gzipped (the file has a .gz extension). You must use a data compression utility to uncompress the file before you can analyze it.
2) You cannot open the data file directly with a statistical package. The file is a simple ASCII file, not a system file in the format of any statistical package. The extract system does, however, generate a syntax (set-up) file to read the ASCII file into your statistical package. You must download the syntax file along with the data file from our server, open the syntax file with your statistical package, and edit the path in the syntax file to point to the location of the data on your local computer. Now you are ready to read in the data.
Is there a preferred statistical package for using the IPUMS? [top]
IPUMS supports SPSS, SAS and Stata. The system does not make data files in those formats, but does generate syntax files with which to read in the ASCII data.
Can I analyze IPUMS-USA data without a statistical package? [top]
The IPUMS Online Data Analysis System allows users to analyze all IPUMS-USA samples online. The system performs a wide range of operations data based on specifications made by the user, from simple operations such as tabulations to advanced statistical analyses. Examples and screenshots are available on our short instructions page.
Can I get the original data? [top]
For all census years from 1850 to 1950 (except for 1890, which was destroyed by fire), the original manuscript population schedules are preserved on microfilm at the National Archives in Washington D.C. In each year, the microfilm reels and the schedules within reels are organized geographically: alphabetically by state, within states alphabetically by county, and within counties numerically by enumeration district. For census and ACS years since 1960, the census schedules exist in fully machine-readable form.
The basic sources for most of the IPUMS documentation are the documentation provided for each of the individual public use microdata samples. All of them should be available through the Inter-university Consortium for Political and Social Research (ICPSR), P.O. Box 1248, Ann Arbor, MI, 48106.
How is a record uniquely identified? [top]
Three variables constitute a unique identifier for each household record in the IPUMS: YEAR, DATANUM, and SERIAL (year, data set number, and household serial number).
Four variables constitute a unique identifier for each person record in the IPUMS: YEAR, DATANUM, SERIAL, and PERNUM (year, data set number, household serial number, and person serial number).
Using IPUMS data
Are there tricky aspects of IPUMS data to be particularly aware of? [top]
Some samples are weighted: each individual does not represent the same number of persons in the population. It is important to use the weight variables when performing analyses with these samples. See the PERWT variable for a listing of the weighted IPUMS-USA samples. For other samples the use of weights is optional.
For users interested in the 2000-2004 ACS samples, these samples do not include persons in group quarters.
It is important to examine the documentation for the variables you are using. The codes and labels for variable categories do not tell the whole story. In other words, the syntax labels are not enough. There are two things to pay particular attention to. The universe for a variable -- the population at risk for answering the question -- can differ subtly or markedly across samples. Also, read the variable comparability discussions for the samples you are interested in. Important comparability issues should be mentioned there. If a variable is of particular importance in your research (for example, it is your dependent variable), you are also well served to read the enumeration text associated with it. This text is linked directly to the variable, so it is quite easy to call it up.
By default, the extract system rectangularizes the data: it puts the household information on the person records and drops the separate household record. This can distort analyses at the household level. The number of observations will be inflated to the number of person records. You can either select the first person in each household (PERNUM) or select the "hierarchical" box in the extract system to get the proper number of household observations. The rectangularizing feature also drops any vacant households, which are otherwise available in some samples. Despite these complications, the great majority of researchers prefer the rectangularized format, which is why it is the default output of our system.
How do I decide what sample to use from any given year? [top]
In some years, multiple IPUMS-USA samples are available. Which sample to use depends on your research questions; typically, the decision involves trading less precision on some variables for increased detail on other variables. Aside from the obvious distinction between USA and Puerto Rico samples, there are four major dimensions along which the samples vary:
In 1880 and 1980-2000 data, multiple densities are available. Lower-density samples are smaller in size and therefore quicker to work with, but may not contain as much geographic detail as the higher-density samples. Additionally, users who want to study small subgroups of the population should use higher-density samples that will contain more cases of interest.
The 1860-1870 and 1900-1910 samples contain oversamples of racial and ethnic minorities. (While not technically an oversample, a 3% sample of households containing elderly people is also available in 1990.) These oversamples yield a greater number of minority cases; researchers who are interested in such subgroups should consider using them.
In 1990 and 2000, both weighted and unweighted samples are available. (It is also possible to select unweighted subsamples of the 1940 and 1950 data.) Unweighted samples may contain fewer cases from sparsely populated areas, but are easier to work with for statistical analyses that do not accept weights. For more information, see Sample Weights in the IPUMS introduction.
In 1970-2000, many samples contain different sets of variables. This is particularly important when using the 1970 data; many variables appear only in "Form 1" samples, while others appear only in "Form 2" samples; information on neighborhood characteristics is available in some samples but not others. In the other years, it is mostly geographic identifiers that vary across samples: some samples do not include identifiers for metropolitan areas, for example, while others do but suppress state identification for some cases.
You say the IPUMS is nationally representative; is it also representative of smaller areas? [top]
IPUMS-USA data can be considered representative of geographic areas of all sizes, as long as the proper weights are applied. The standard errors of statistics based on smaller geographic areas will be larger than those based on national data, but this is due only to the reduced sample size rather than to the fact that one is analyzing a small area.
What are the major limitations of the data? [top]
The data are composed entirely of individual person and household records from population censuses. There are no macroeconomic, business, or aggregate statistics. We do not deliver the published statistics from the population censuses.
IPUMS is composed entirely of sample data, with sample densities ranging from 1 percent to 5 percent of national populations. Some subpopulations may be too small to study with the sample data.
Because the data are public-use, measures have been taken to assure confidentiality. Names and other identifying information are suppressed. Most importantly for many researchers, geographic information is usually limited, sometimes severely.
Can I find particular individuals in the IPUMS data? [top]
No. A variety of steps have been taken to ensure the confidentiality of the data. Most fundamentally, the modern samples do not contain names or addresses. The data are only samples, so there is no guarantee any given individual will be in the dataset.
How do I cite IPUMS-USA? [top]
See the IPUMS USA Citation page.
Any publications, research reports, presentations, or educational material making use of the data or documentation should be added to our Bibliography. Continued funding for the IPUMS depends on our ability to show our sponsor agencies that researchers are using the data for productive purposes.
Can I use IPUMS for genealogy? [top]
You are welcome to work with the IPUMS data, but you should be aware of its severe limitations for genealogical purposes. Only the 1850-1880 and 1920 samples contain names of individuals. Moreover, the 1850-1870 and 1920 samples are 1-in-100 samples of the population, meaning there is a 1 percent chance of finding any particular person in the data.
The data extraction system is not suited for genealogical research. The system is not a search engine. You cannot search for any particular name in the data, and the system will not select out cases on the basis of name.
If you wish to try your luck with the data, you should also be aware, if you are not already, that the data is "raw" (i.e., it is just strings of numbers with names and addresses embedded). Each line represents an individual or household. The IPUMS codebook is required to interpret the numbers. Most software - especially word processors - is unable to handle files as large as these. SPSS or SAS are the most frequently used programs to deal with the data.
Data for the complete 1880 census are available on two web sites which contain programs to search the data by name (Ancestry.com and FamilySearch.org). These sites also help genealogists search other historical census databases and are far better suited to genealogical research than the IPUMS is.
Using the variables page
Variables page menu [top]
Use the "Variables" menu to browse or search variables:
Household: household variables by group
Person: person variables by group
A-Z: integrated variables by letter
Search: display only variables that contain specified text in particular fields
Use the links on the right side of the menu to:
Select Samples: limit the display of variable information to selected samples
Options and Help: alter how the variable list is displayed or get help for this page
Variables page details [top]
The variables page allows you to browse variables while limiting and controlling how the information is displayed.
The "Variables" menu is for browsing the variables. You may also search variables by specifying search terms for specific fields of variable metadata. The system will return a list of variables that include any of the search terms you indicate.
When you "Select Samples" you limit the variable list to display only variables that are available in at least one of those samples. But the effect of selecting samples extends into all the variable descriptions and codes pages you can access through the variable system. Only information relevant to your selected samples will be displayed in any context while you browse the variables. You can change your sample selections at any point.
Selecting samples is a good practice when exploring the IPUMS, because the amount of information can be unwieldy. On the other hand, sometimes you need to see everything to determine what kinds of research are possible using the database.
The final choices are "Options" and "Help." The "Display Options" item brings up a screen that offers a number of choices regarding the display of the variable list. Each selection has a default choice.
View one group / View all groups
Switch between viewing one variable group at a time and viewing all variable groups on one screen. Unless you have a limited number of samples selected, your browser may be slow to display all groups. The default view is one group at a time.
Show availability detail / Show availability summary
Switch between displaying the full sample-specific availability matrix, and a view that only displays the total number of samples that contain each variable. Both views only display or sum the samples that the user has selected in "Select samples." The default view is the detailed availability information.
View available variables / View all variables
Switch between a view that only displays variables present in one of your selected samples, and a view that displays every variable, even if they are not available. The default view is to only display available variables.
Samples are displayed chronologically / Samples . . . reverse chronologically
Display the samples columns indicating variable availability in chronological order (oldest to newest) or reverse chronological order (newest to oldest). The default is reverse chronological (newest to oldest).
The Variable List
As you browse the variables, they are displayed in a list containing a number of columns. The variable name links to the variable description, which includes detailed comparability discussions, universes, and enumeration text. The variable codes -- and their associated labels -- can be accessed directly using the "codes" links. The "type" column indicates if it is a person or household variable. In some contexts, like the alphabetic view, the two types are pooled together.
In the area to the right of the "codes" column is a column for every sample that the user chose in "Select samples." By default, the most commonly requested samples from each year are selected. The four-digit year and brief sample identifier identify each sample at the top of every column. Hover over the year with the mouse to see the full sample name. If a variable is available in a given sample, an "x" is printed in that column.
In the column labeled "Add to cart", each variable has a yellow circle with a "+" on the far left. Click these circles to add them to your data cart (they will appear green when you hover over them to indicate that you may select it). Once you have clicked them, these icons change to a checked box, indicating that the variable is in your data cart. To remove the variable from your data cart, simply click the checkbox.
Using the data extract system
Your data cart [top]
You must be logged in to use the data extract system. If you are not registered, you must apply for access.
At the top right corner of the variables page is a summary of your data cart. This box displays the number of variables and samples you have selected. Clicking the yellow circle by a given variable places it in your data cart. You can view your data cart at any time by clicking "View Cart".
You data cart lists the variables preselected by the extract system as well as any variables you selected while browsing the documentation. As with the variable selection page, you can remove variables from your extract in this step by clicking the checkbox next to the variable in the "Add to cart" column. If you chose a variable but subsequently altered your sample selections in such a way that the variable is no longer available, it is indicated by an "i" icon.
The data cart also includes record type, links to codes pages, and sample availability for the variables and samples in your cart.
Buttons are provided to return to the variable list to make more selections or to alter your sample choices. If you return to the variable list, click on "View Cart" again to return to your data cart.
When you are satisfied with your data selections, click "CHECK OUT: Create Data Extract" to finalize your extract request.
Why are some variables in my data cart preselected? [top]
Certain variables appear in your data cart even if you did not select them, and they are not counted as in the constantly updated number of variables in your data cart.
These variables are important for various reasons. In IPUMS-USA, there are four variables--YEAR, DATANUM, SERIAL, and PERNUM--that uniquely identify every observation in the IPUMS-USA database. Group quarters status (GQ) is included because it is a critical part of obtaining consistent household definitions across time. And two weighting variables at the household (HHWT) and person (PERWT) are essential for obtaining representative statistics from sample data.
Unless you are absolutely certain you will not need one of these variables, we recommend that you not remove them from your data cart.
What is "Type"? [top]
The "Type" column on the variables selection pages and in your data cart indicates the record type of the variable. The variables with a "P" are from the person record, and the variables with an "H" are from the household record. Data at the household level pertain to each person in the household, and are identical on each person record within a household in the rectangular data file.
Extract request page [top]
When you click "Create data extract" in the Data Cart, you come to the Extract Request page. All of the
actions on this page are optional. If you wish, you can simply hit the "Submit" button and create your
data extract. You will be prompted to log in if have have not done so already.
The page summarizes your data extract and provides a number of options for customizing it. A link
at the top expands to show the samples you selected. If any samples have notes associated with them,
a message will appear on the samples bar to encourage you to review that information. Click the
appropriate links to go back to the variable browsing and sample selection pages to alter your choices.
You return to the extract request page via the data cart, where you can review the availability matrix for
selections and easily drop variables by unchecking them. If you selected a variable that has both general and detailed versions, it appears as two variables on this screen. You can unselect any versions that you do not want.
A separate link lets you choose the preferred data structure for your extract:
rectangular, hierarchical, or household-only. Rectangular format is the default.
Another row on the page estimates the size of your extract. If the estimated size is too large, click on the
link to reduce extract size. Two of the methods for reducing the size of extracts
involve options buttons on the lower half of the extract request page.
When you submit an extract, there will be a delay ranging from minutes to hours, depending on the size
of the job. You do not need to wait on our site for the job to be completed. Our system will send you an
e-mail when your extract is ready.
All data are produced in gzip compressed format, and you must decompress the data before reading it. Most compression software has no difficulty with the files. The system produces only ASCII fixed column-format data, but with each extract it generates SAS, SPSS, and STATA command files to read the data into one of those statistical packages. The system also creates a codebook file that describes the content of your extract. Remember that before reading it with a statistical software package.
The definitions of every extract will remain on our server indefinitely, but the data files are subject to
deletion after three days. However, the screen where you download extracts has a feature that lets you
revise old extracts. When you click on "revise," all your selections for that extract will be loaded into the
system, after which you can edit or regenerate it. Note, however, that each successive data release can
create difficulties for recreating old extracts, because codes might change.
Extract definition: Data structure [top]
In Step 1 of the extract procedure you define some general characteristics of your desired extract.
Choose the preferred file structure for your extract: rectangular (only person records, with all requested household information attached to respective household members), hierarchical (household record followed by person records), or household-only (household records without any person records; use this only if you are interested primarily in the characteristics of housing units). The system defaults to rectangular format, which is the overwhelming choice of researchers. Note that vacant housing units are not available in rectangular data, so select one of the other options if you need data for them.
Extract option: Customize sample sizes [top]
This step is optional. At the top of the screen the extract system calculates the expected size of your data extract once you have uncompressed it. (Note that the predicted extract size does not take account of any case selection you may have implemented in Step 2. If you used case selection, your extract probably will be smaller than the size reported on this screen.)
If this is too large, or if you want to manage the sizes of the individual samples in your extract, click on the "customize sample sizes" link at the bottom of the screen to use the sample size tool. In the sample size tool, the samples you included in your extract are listed in the first column. In the next three columns (labeled "Full IPUMS Samples"), you will see the number of household records and person records (in thousands) with the sample density (the percent of the total population included in the sample). These are the cases you will receive if you proceed to the next step without making alterations.
In the next three columns (labeled "Customize sample sizes"), you can type a number in any cell to adjust your sample sizes, then submit your selection by clicking the "Calculate" button at the bottom of screen, pressing the "Tab" key, or clicking anywhere else on the screen. The other two cells will be calculated automatically to accord with your entry, and the total number of records and your total estimated extract size will be adjusted at the bottom.
For example, you can alter the density of the 2000 5 percent sample from 5.0 percent to 2.5 percent--cutting the sample in half--and the number of household records and person records will be automatically divided in half as well.
Or you can enter a number in the first row ("All samples") to apply the same selection to each and every sample in your extract. For example, you can enter "1.0" in the density column to give every sample a density of 1 percent, or enter "50" in the households column to receive 50,000 households for every sample.
The minimum number of cases for any sample is 10,000 households. If your entry would require more cases than are available in a given sample, the tool will indicate the number from the full sample. At any point you can clear your selections and return to the full sizes for every sample by clicking the "Reset selections" button at the bottom of the screen.
The sampling unit for the sample size tool is the household. The system will draw a systematic sample of every Nth household -- after a random start -- at the proper density to produce the number of cases you requested. The weights in your extract are automatically adjusted to reflect the new sample densities, so your smaller extract will still yield approximately the same results as a full extract. Your extract must include the appropriate weight variable (PERWT and possibly HHWT) to calculate the new weights in your statistical package.
Due to random sampling error, the accuracy of estimates from your smaller extract relative to those from the full dataset is affected by the size of the sample you request. While small-sample estimates will likely be within 1 or 2 percent of the full-sample estimate even at the minimum sample size (10,000 households), estimates for small geographic areas or small population subgroups can be farther off. Use this feature with caution.
Note also that vacant households decrease the number of households in the customized sample. Vacancy rates exist in the following years: These vacancy rates also affect the number of person records calcuated in a customized sample at the smallest level; the number of persons will be decreased by the vacancy rate.
You should know that the extract system is designed to produce the requested number of cases in a hierarchical extract, where both household and person records are included separately. If you requested a rectangularized extract (which is the default), you will likely receive fewer cases. This is because many samples contain vacant housing units that are selected by the sampling program but are dropped from rectangularized samples because they contain no person records. The higher the proportion of housing unit vacancies, the greater the extent to which the number of households you receive in a rectangular extract will be lower than specified on the website. Users may wish to adjust their requested household sizes accordingly. The following years have vacant housing units: 1860 (2 percent vacant), 1870 (1.5 percent), 1970 (7 percent for US samples, 10 percent for Puerto Rico samples), 1980 (8 percent for US, 12 percent for Puerto Rico), 1990 (12 percent for US, 10 percent for Puerto Rico), 2000 (8 percent for US, 10 percent for Puerto Rico), the 2000-onward ACS (6-7 percent), and the 2005-onward PRCS (10-14 percent). All other samples have no vacant units.
Extract option: Select cases [top]
The "select cases" feature allows users to limit their dataset to contain only records with specific values for selected variables, such as persons age 65 and older. Multiple variables can be used in combination during case selection. Selections for multiple variables are additive, each being implicitly connected by a logical "AND" for processing purposes. You can only perform case selection on either the general or the detailed version of a variable, not both.
Simply extracting selected cases can be too crude, however, because you may need the persons who co-resided with your selected population. Accordingly, the case selection function also lets you choose to include everyone living in a household with a person with the selected characteristics.
Users should be careful with the case selection feature. It is possible to select a specific variable category (e.g., polygamous marriage) that does not exist across all the samples in your extract, thereby inadvertently excluding those samples from your dataset.
Extract option: Attach characteristics [top]
The data extract system can attach a characteristic of a person's mother, father, or spouse as a new variable on the person's record. It can also attach the characteristics of the household head. For example, using the variable "Occupation," it can make a new variable for "Occupation of mother." All persons in the extract who reside in a household with their mother would receive a value for this new variable. Persons without a mother present in the household would receive a missing value. The extract system automatically generates a unique name for the new variable.
The attached-characteristics feature uses the constructed IPUMS family interrelationship "pointer variables" that identify co-resident mothers, fathers, and spouses for each person. The pointer variables identify social mothers and fathers, not strictly biological parents.
Extract option: Select data quality flags [top]
You can include data quality flags for certain variables by checking the boxes next to the variable name. Checking the box at the top of the "Include flags" column will select all data quality flags available for the variables and samples you have chosen.
Extract option: Describe your extract [top]
You can describe your extract for future reference. Our system will display the description on the page
where you download your data extract.