IPUMS Extracts#

IPUMS-py can be used to read extracts made via the IPUMS web interface into python. This page discusses how to request an IPUMS extract via API using IPUMS-py.

Extract Definition#

An extract is defined by:

  1. A data collection name

  2. A list of IPUMS sample IDs from that collection

  3. A list of IPUMS variable names from that collection

IPUMS metadata is not currently accessible via API. Sample IDs and IPUMS variable names can be browsed via the data collection’s website. See the table below for data collection abreviations and links to sample IDs and variable browsing. Note that not all IPUMS data collections are currently available via API. The table below will be filled in as new IPUMS data collections become accessible via API.

IPUMS data collections metadata resources#

IPUMS data collection

collection IDs

sample IDs

variable names

IPUMS USA

usa

usa samples

usa variables

IPUMS CPS

cps

cps samples

cps variables

IPUMS International

ipumsi

ipumsi samples

ipumsi variables

Extract Objects#

Each IPUMS data collection that is accessible via API has its own extract class. Using this class to create your extract object obviates the need to specify a data collection.

For example:

extract = UsaExtract(
    ["us2012b"],
    ["AGE", "SEX"],
)

instantiates a UsaExtract object for the IPUMS USA data collection that includes the us2012b (2012 PRCS) sample, and the variables AGE and SEX.

IPUMS extracts can be requested as rectangular or hierarchical files. The data_structure argument defaults to {"rectangular": {"on": "P"}} to request a rectangular, person-level extract. The code snippet below requests a hierarchical USA extract.

extract = UsaExtract(
    ["us2012b"],
    ["AGE", "SEX"],
    data_structure={"hierarchical": {}}
)

Users also have the option to specify a data format and an extract description when creating an extract object.

extract = UsaExtract(
    ["us2012b"],
    ["AGE", "SEX"],
    data_format="csv",
    description="My first IPUMS USA extract!"
)

Once an extract object has been created, the extract must be submitted to the API.

from ipumspy import IpumsApiClient, UsaExtract

IPUMS_API_KEY = your_api_key
DOWNLOAD_DIR = Path(your_download_dir)

ipums = IpumsApiClient(IPUMS_API_KEY)

# define your extract
extract = UsaExtract(
    ["us2012b"],
    ["AGE", "SEX"],
)

# submit your extract
ipums.submit_extract(extract)

Once an extract has been submitted, an extract ID number will be assigned to it.

extract.extract_id

returns the extract id number assigned by the IPUMS extract system. In the case of your first extract, this code will return

1

You can use this extract ID number along with the data collection name to check on or download your extract later if you lose track of the original extract object.

Extract status#

After your extract has been submitted, you can check its status using

ipums.extract_status(extract)

returns:

'started'

While IPUMS retains all of a user’s extract definitions, after a certain period, the extract data and syntax files are purged from the IPUMS cache - these extracts are said to be “expired”. Importantly, if an extract’s data and syntax files have been removed, the extract is still considered to have been completed, and extract_status() will return “completed.”

# extract number 1 has expired
ipums.extract_status(collection="usa", extract="1")

returns:

'completed'

If an extract has expired:

ipums.extract_is_expired(collection="usa", extract="1")

returns:

True

For extracts that have expired, the data collection name and extract ID number can be used to re-create and re-submit the old extract. Note that re-creating and re-submitting a expired extract results in a new extract with its own unique ID number!

# create a UsaExtract object from the expired extract definition
renewed_extract = ipums.get_extract_by_id(collection="usa", extract_id=1)

# submit the renewed extract to re-generate the data and syntax files
resubmitted_extract = ipums.submit_extract(renewed_extract)

resubmitted_extract.extract_id

returns:

2

Extract Features#

IPUMS Extract features can be added or updated before an extract request is submitted. This section demonstrates adding features to the following IPUMS CPS extract.

extract = CPSExtract(
    ["cps2022_03s"],
    ["AGE", "SEX", "RACE"],
)

Attach Characteristics#

IPUMS allows users to create variables that reflect the characteristics of other household members. The example below uses the attach_characteristics() method to attach the spouse’s AGE value, creating a new variable called SEX_SP in the extract that will contain the age of a person’s spouse if they have one and be 0 otherwise. The attach_characteristics() method takes the name of the variable to attach and the household member whose values the new variable will include. Valid household members include “spouse”, “mother”, “father”, and “head”.

extract.attach_characteristics("SEX", ["spouse"])

The following would add variables for the RACE value of both parents:

extract.attach_characteristics("RACE", ["mother", "father"])

Select Cases#

IPUMS allows users to limit their extract based on values of included variables. The code below uses the select_cases() to select only the female records in the example IPUMS CPS extract. This method takes a variable name and a list of values for that variable for which to include records in the extract. Note that the variable must be included in the IPUMS extract object in order to use this feature; also note that this feature is only available for categorical varaibles.

extract.select_cases("SEX", ["2"])

The select_cases() method defaults to using “general” codes to select cases. Some variables also have detailed codes that can be used to select cases. Consider the following example extract of the 2021 ACS data from IPUMS USA:

extract = UsaExtract(
    ["us2021a"],
    ["AGE", "SEX", "RACE"]
)

In IPUMS USA, the RACE variable has both general and detailed codes. A user interested in respondents who identify themselves with two major race groups can use general codes:

extract.select_cases("RACE", ["8"])

A user interested in respondents who identify as both White and Asian can use detailed case selection to only include those chose White and another available Asian cateogry. To do this, in addition to specifying the correct detailed codes, set the general flag to False:

extract.select_cases("RACE",
                     ["810", "811", "812", "813", "814", "815", "816", "818"],
                     general=False)

By default, case selection includes only individuals with the specified values for the specified variables. In the previous example, only persons who identified as both White and Asian are included in the extract. To make an extract that contains individuals in households that include an individual who identifies as both White and Asian, set the case_select_who flag to "households" when instantiating the extract object. The code snippet below creates such an extract. Note that whether to select individuals or households must be specified at the extract level, while what values to select on and whether these values are general or detailed codes is specified at the variable level.

extract = UsaExtract(
    ["us2021a"],
    ["AGE", "SEX", "RACE"],
    case_select_who = "households"
)
extract.select_cases("RACE",
                     ["810", "811", "812", "813", "814", "815", "816", "818"],
                     general=False)

Add Data Quality Flags#

Data quality flags can be added to an extract on a per-variable basis or for the entire extract. The CPS extract example above could be re-defined as follows in order to add all available data quality flags:

extract = CpsExtract(
    ["cps2022_03s"],
    ["AGE", "SEX", "RACE"],
    data_quality_flags=True
)

This extract specification will add data quality flags for all variables in the variable list to the extract for which data quality flags exist in the sample(s) in the samples list.

Data quality flags can also be selected for specific variables using the add_data_quality_flags() method.

# add the data quality flag for AGE to the extract
extract.add_data_quality_flags("AGE")

# note that this method will also accept a list!
extract.add_data_quality_flags(["AGE", "SEX"])

Using Variable Objects to Include Extract Features#

It is also possible to define all variable-level extract features when the IPUMS extract object is first defined using ipumspy.api.extract.Variable objects. The example below defines an IPUMS CPS extract that includes a variable for the age of the spouse (attached_characteristics), limits the sample to women (case_selections), and includes the data quality flag for RACE (data_quality_flags).

fancy_extract = CpsExtract(
    ["cps2022_03s"],
    [
        Variable(name="AGE",
                 attached_characteristics=["spouse"]),
        Variable(name="SEX",
                 case_selections={"general": ["2"]}),
        Variable(name="RACE",
                 data_quality_flags=True)
     ]
)

Unsupported Features#

Not all features available through the IPUMS extract web UI are currently supported for extracts made via API. For a list of currently unsupported features, see the developer documentation. This list will be updated as more features become available.

Extract Histories#

ipumspy offers several ways to peruse your extract history for a given IPUMS data collection.

get_previous_extracts() can be used to retrieve your 10 most recent extracts for a given collection. The limit can be set to a custom n of most recent previous extracts.

from ipumspy import IpumsApiClient

ipums = IpumsApiClient("YOUR_API_KEY")

# get my 10 most-recent USA extracts
recent_extracts = ipums.get_previous_extracts("usa")

# get my 20 most-recent CPS extracts
more_recent_extracts = ipums.get_previous_extracts("cps", limit=20)

The get_extract_history() generator makes it easy to filter your extract history to pull out extracts with certain variables, samples, features, file formats, etc. By default, this generator returns pages extract definitions of the maximum possible size, 2500. Page size can be set to a lower number using the page_size argument.

# make a list of all of my extracts from IPUMS CPS that include the variable STATEFIP
extracts_with_state = []
# get pages with 50 CPS extracts per page
for page in ipums.get_extract_history("cps", page_size=100):
    for ext in page["data"]:
        extract_obj = CpsExtract(**ext["extractDefinition"])
        if "STATEFIP" in [var.name for var in extract_obj.variables]:
            extracts_with_state.append(extract_obj)