Reading IPUMS Data#

Reading IPUMS Extracts#

Reading IPUMS data into a Pandas data frame using ipumspy requires a fixed-width or csv IPUMS extract data file and an IPUMS xml DDI file.

To read a fixed-width rectangular IPUMS extract:

# Get the DDI
ddi = readers.read_ipums_ddi(path/to/ddi_file.xml)
# Get the data
ipums_df = readers.read_microdata(ddi, path/to/data_file.dat.gz)

As these files are often large, users may wish to filter or read in chunks. The readers.read_microdata_chunked() method can help. For example, the following reads only rows from Minnesota:

iter_microdata = read_microdata_chunked(ddi, chunksize=1000)
df = pd.concat([df[df["STATEFIP"] == 27] for df in iter_microdata])

The readers.read_hierarchical_microdata() method is for reading hierarchical extracts. By default, this method returns a dictionary with a data frame for each record type. Record types are keys, data frames are values.

extract_dict = readers.read_hierarchical_microdata(ddi,
                                                   path/to/hierarhcical_file.dat.gz)

To get a single data frame for a hierarchical extract, set the as_dict flag in readers.read_hierarchical_microdata() to False.

Reading Non-Extractable IPUMS Collections#

The IPUMS YRBSS and IPUMS NYTS data collections are not accessed through the IPUMS extract system, but are available for download in their entirety. ipumspy has functionality to download these datasets (download_noextract_data()) and parse the yml format codebooks that come packaged with the ipumspy library (read_noextract_codebook()). This codebook object can then be used to read the downloaded dataset into a Pandas data frame using read_microdata() as with other IPUMS datasets retrieved via the IPUMS extract system.