Aggregate Data Extracts#

IPUMS aggregate data collections distribute aggregated statistics for a set of geographic units. IPUMS contains two aggregate data collections, both of which are supported by the IPUMS API:

IPUMS NHGIS provides 3 different types of data sources:

Datasets/data tables
Time series tables
Shapefiles

IPUMS IHGIS provides 1 type of data source:

Datasets/data tables

Note

IHGIS does provide boundary shapefiles, but these are not provided via the IPUMS API. Shapefiles from IHGIS can be downloaded directly from the IHGIS website.

Extract Objects#

Construct an extract for an IPUMS aggregate data collection using the AggregateDataExtract class. An AggregateDataExtract must contain an IPUMS collection ID and at least one data source. We also recommend providing an extract description to make it easier to identify and retrieve your extract in the future.

For example:

from ipumspy import AggregateDataExtract, NhgisDataset

extract = AggregateDataExtract(
   collection="nhgis",
   description="An NHGIS extract example",
   datasets=[
      NhgisDataset(name="1990_STF1", data_tables=["NP1", "NP2"], geog_levels=["county"])
   ]
)

This instantiates an AggregateDataExtract object for the IPUMS NHGIS data collection that includes a request for county-level data from tables NP1 (total population) and NP2 (total families) of the 1990 STF 1 decennial census file.

After instantiation, an AggregateDataExtract object can be submitted to the API for processing.

Note

The IPUMS API provides a set of metadata endpoints for aggregate data collections that allow you to browse available data sources and identify their associated API codes (see below for examples).

Datasets + Data Tables#

An IPUMS dataset contains a collection of data tables that each correspond to a particular tabulated summary statistic. A dataset is distinguished by the years, geographic levels, and topics that it covers. For instance, 2021 1-year data from the American Community Survey (ACS) is encapsulated in a single dataset. In other cases, a single census product will be split into multiple datasets, typically based on the lowest-level geography for which a set of tables is available. See the NHGIS and IHGIS documentation for more details.

To request data contained in an IPUMS dataset, you need to specify the name of the dataset, name of the data table(s) to request from that dataset, and the geographic level at which those tables should be aggregated.

NHGIS Datasets#

For NHGIS extracts, use the NhgisDataset class to specify these parameters:

extract = AggregateDataExtract(
   collection="nhgis",
   description="An NHGIS example extract",
   datasets=[
      NhgisDataset(name="2000_SF1a", data_tables=["NP001A", "NP031A"], geog_levels=["state"])
   ],
)

Some datasets span multiple years and require a selection of years:

extract = AggregateDataExtract(
   collection="nhgis",
   description="An NHGIS example extract",
   datasets=[
      NhgisDataset(
         name="1988_1997_CBPa",
         data_tables=["NT004"],
         geog_levels=["county"],
         years=[1988, 1989, 1990],
     )
   ],
)

Tip

To select all years in a dataset, use years=["*"].

You can also optionally request specific breakdown values for a dataset with the breakdown_values keyword argument:

extract = AggregateDataExtract(
   collection="nhgis",
   description="An NHGIS example extract",
   datasets=[
      NhgisDataset(
            name="2000_SF1a",
            data_tables=["NP001A", "NP031A"],
            geog_levels=["state"],
            breakdown_values=["bs21.ge01", "bs21.ge43"],  # Urban + Rural breakdowns
      )
   ],
)

By default, the first available breakdown (typically, the total count) will be selected. When retrieving a previously submitted extract from the IPUMS API, you may notice a breakdown value code present in the extract definition despite not explicitly requesting one when submitting the extract.

For datasets with multiple breakdowns or data types (e.g., the American Community Survey contains both estimates and margins of error), you can request that the data for each be provided in separate files or together in a single file using the breakdown_and_data_type_layout argument.

IHGIS Datasets#

For IHGIS, each dataset must be associated with a selection of data tables and tabulation geographies (the level of geographic aggregation for the requested data). These are the only available parameters for IHGIS dataset requests.

AggregateDataExtract(
   collection="ihgis",
   description="An IHGIS example extract",
   datasets=[
      IhgisDataset(
         "KZ2009pop",
         data_tables=["KZ2009pop.AAA"],
         tabulation_geographies=["KZ2009pop.g0"]
      )
   ]
)

Caution

IHGIS extract requests only accept input for description and datasets. Other AggregateDataExtract arguments do not apply to IHGIS extracts and will be omitted from the extract request if included.

Dataset + Data Table Metadata#

You can obtain a listing of datasets and data tables as well as detailed information about individual datasets and data tables via the IPUMS Metadata API.

Use the NhgisDatasetMetadata and IhgisDatasetMetadata data classes to browse the available specification options for a particular dataset and identify the codes to use when requesting data from the API:

from ipumspy import IpumsApiClient, NhgisDatasetMetadata

ipums = IpumsApiClient(os.environ.get("IPUMS_API_KEY"))

ds = ipums.get_metadata(NhgisDatasetMetadata("2000_SF1a"))

The returned object will contain the metadata for the requested dataset. For example:

# Description of the dataset
ds.description

# Dictionary of data table codes for this dataset
ds.data_tables

# etc...

You can also request metadata for individual data tables using the same workflow with the NhgisDataTableMetadata and IhgisDataTableMetadata data classes.

Time Series Tables#

IPUMS NHGIS also provides time series tables—longitudinal data sources that link comparable statistics from multiple U.S. censuses in a single package. A table is comprised of one or more related time series, each of which describes a single summary statistic measured at multiple times for a given geographic level.

Use the TimeSeriesTable class to add time series tables to your NHGIS extract request.

Time series tables are already associated with a specific summary statistic, so they don’t require an additional selection of data tables as is required for NHGIS datasets. However, you will need to specify the geographic level for the data:

from ipumspy import AggregateDataExtract, TimeSeriesTable

extract = AggregateDataExtract(
   collection="nhgis",
   description="An NHGIS example extract: time series tables",
   time_series_tables=[TimeSeriesTable("CW3", geog_levels=["county", "state"])],
)

By default, a time series table request will provide data for all years available for that time series table. You can select a subset of available years with the years argument:

extract = AggregateDataExtract(
   collection="nhgis",
   description="An NHGIS example extract: time series tables",
   time_series_tables=[
      TimeSeriesTable("CW3", geog_levels=["county", "state"], years=[1990, 2000])
   ],
)

For extract requests that contain time series tables, you can indicate the desired layout of the time series data with the tst_layout argument. Timepoints can either be arranged in columns, rows, or split into separate files (by default, time is arranged across columns).

extract = AggregateDataExtract(
   collection="nhgis",
   description="An NHGIS example extract: time series tables",
   time_series_tables=[
      TimeSeriesTable("CW3", geog_levels=["county", "state"], years=[1990, 2000])
   ],
   tst_layout="time_by_row_layout",
)

Time Series Table Metadata#

As with datasets and data tables, you can request metadata about the available specification options for a specific time series table using the TimeSeriesTableMetadata class with get_metadata().

Geographic Extent Selection#

When working with small geographies it can be computationally intensive to work with nationwide data. To avoid this problem, you can request data from a specific geographic area using the geographic_extents argument. This argument is only available for NHGIS extracts.

The following extract requests ACS 5-year sex-by-age counts at the census block group level, but only includes block groups that fall within Alabama and Arkansas (identified by their FIPS codes with a trailing 0):

extract = AggregateDataExtract(
   collection="nhgis",
   description="Extent selection example",
   datasets=[
      NhgisDataset(name="2018_2022_ACS5a", data_tables=["B01001"], geog_levels=["blck_grp"]),
      NhgisDataset(name="2017_2021_ACS5a", data_tables=["B01001"], geog_levels=["blck_grp"])
   ],
   geographic_extents=["010", "050"]
)

Tip

You can see available extent selection API codes, if any, in the geographic_instances attribute of a submitted NhgisDatasetMetadata or TimeSeriesTableMetadata object. The geog_levels attribute indicates whether a given geographic level supports extent selection.

Note that the selected extents are applied to all datasets and time series tables in an extract. It is not possible to request different extents for different data sources in a single extract.

Shapefiles#

IPUMS shapefiles contain geographic data for a given geographic level and year. Typically, these files are composed of polygon geometries containing the boundaries of census reporting areas.

Because there are no additional selection parameters for shapefiles, you can include them in your request simply by specifying their names:

AggregateDataExtract(
   collection="nhgis",
   shapefiles=["us_county_2021_tl2021", "us_county_2020_tl2020"]
)

As mentioned above, IHGIS shapefiles must be downloaded directly from the IHGIS website.

Shapefile Metadata#

You can access a listing of shapefile API codes and descriptions via the IPUMS Metadata API using get_metadata_catalog() with metadata_type="shapefiles". The IPUMS API does not provide detailed metadata for individual shapefiles.

Multiple Data Sources#

You can request any combination of datasets, time series tables, and shapefiles in a single extract. For instance, to request spatial boundary data to go along with the tabular data requested in a set of datasets:

# Total state-level population from 2000 and 2010 decennial census
extract = AggregateDataExtract(
   collection="nhgis",
   description="An NHGIS example extract",
   datasets=[
      NhgisDataset(name="2000_SF1a", data_tables=["NP001A"], geog_levels=["state"]),
      NhgisDataset(name="2010_SF1a", data_tables=["P1"], geog_levels=["state"])
   ],
   shapefiles=["us_state_2000_tl2010", "us_state_2010_tl2010"]
)

In some cases, data table codes are consistent across datasets. This is often the case for the American Community Survey (ACS) datasets. This makes it easy to build an extract request for a specific data table for several ACS years at once using list comprehensions. For instance:

acs1_names = ["2017_ACS1", "2018_ACS1", "2019_ACS1"]
acs1_specs = [
   NhgisDataset(name, data_tables=["B01001"], geog_levels=["state"]) for name in acs1_names
]

# Total state-level population from 2017-2019 ACS 1-year estimates
extract = AggregateDataExtract(
   collection="nhgis",
   description="An NHGIS example extract",
   datasets=acs1_specs,
)

Data Format#

By default, NHGIS extracts are provided in CSV format with only a single header row. If you like, you can request that your CSV data include a second header row containing a description of each column’s contents by setting data_format="csv_header".

While you can also request your data in fixed-width format, NHGIS is likely to phase out support for this format in the future. We therefore suggest that you request data in CSV format. Also note that unlike for microdata projects, NHGIS does not provide DDI codebook files (in XML format), which allow ipumspy to parse microdata fixed-width files. Thus, loading an NHGIS fixed width file will require manual work to parse the file correctly.

Supplemental Data#

IPUMS NHGIS also provides some data products via direct download, without the need to create an extract request. These sources are available via the IPUMS API. However, since you access these files directly, you must know a file’s URL before you can download it.

Many NHGIS supplemental data files can be found under the “Supplemental Data” heading on the left side of the NHGIS homepage. See the IPUMS developer documentation page for all supported supplemental data endpoints and advice on how to convert file URLs found on the website into acceptable API request URLs.

Once you’ve identified a file’s location, you can use the ipumspy get() method to download it. For instance, to download a state-level NHGIS crosswalk file, we could use the following:

ipums = IpumsApiClient(os.environ.get("IPUMS_API_KEY"))

file_name = "nhgis_blk2010_blk2020_10.zip"
url = f"{ipums.base_url}/supplemental-data/nhgis/crosswalks/nhgis_blk2010_blk2020_state/{file_name}"

download_path = "<your-download-path-here>"

with ipums.get(url, stream=True) as response:
   with open(download_path, "wb") as outfile:
      for chunk in response.iter_content(chunk_size=8192):
            outfile.write(chunk)