ipumspy.readers.read_microdata_chunked#

ipumspy.readers.read_microdata_chunked(ddi, filename=None, encoding=None, subset=None, chunksize=None, dtype=None, **kwargs)[source]#

Read in microdata in chunks as specified by the Codebook. As these files are often large, you may wish to filter or read in chunks. As an example of how you might do that, consider the following example that filters only for rows in Rhode Island:

iter_microdata = read_microdata_chunked(ddi, chunksize=1000) df = pd.concat([df[df[‘STATEFIP’] == 44]] for df in iter_microdata])

This method also works for large hierarchical files. When reading these files in chunks, users will want to be sure to filter on the RECTYPE variable. For example, the code below reads in only household records in Rhode Island:

iter_microdata = read_microdata_chunked(ddi, chunksize=1000) df = pd.concat([df[(df[‘RECTYPE’] == ‘H’) & (df[‘STATEFIP’] == 44)] for df in iter_microdata])

Parameters:
  • ddi (Codebook) – The codebook representing the data

  • filename (Union[str, Path, IOBase, None]) – The path to the data file. If not present, gets from ddi and assumes the file is relative to the current working directory

  • encoding (Optional[str]) – The encoding of the data file. If not present, reads from ddi

  • subset (Optional[List[str]]) – A list of variable names to keep. If None, will keep all

  • dtype (Optional[dict]) – A dictionary with variable names as keys and variable types as values. Has an effect only when used with pd.read_fwf or pd.read_csv engine. If None, pd.read_fwf or pd.read_csv use type ddi.data_description.pandas_type for all variables. See ipumspy.ddi.VariableDescription for more precision on ddi.data_description.pandas_type. If files are csv, and dtype is not None, pandas converts the column types once: on pd.read_csv call. When file format is .dat or .csv and dtype is None, two conversion occur: one on load, and one when returning the dataframe.

  • chunksize (Optional[int]) – The size of the chunk to return with iterator. See pandas.read_csv

  • kwargs – keyword args to be passed to pd.read_fwf

Yields:

An iterator of data frames

Return type:

Iterator[DataFrame]