ipumspy.ddi.Codebook.get_all_types#

Codebook.get_all_types(type_format, string_pyarrow=False)[source]#

Retrieve all column types

Parameters:
  • type_format (str) – type format. Should be one of [“numpy_type”, “pandas_type”, “pandas_type_efficient”, “python_type”, “vartype”]

  • string_pyarrow (bool) – has an effect when True and used with type_format in [“pandas_type”, “pandas_type_efficient”]. In this case, string types==pd.StringDtype() is replaced with pd.StringDtype(storage=’pyarrow’).

Return type:

dict

Returns:

A dict with column names column dtype mapping.

Examples

Let’s see an example of usage with pandas.read_csv engine:

>>> from ipumspy import readers
>>> ddi_codebook = readers.read_ipums_ddi('extract_ddi.xml')
>>> dataframe_dtypes = ddi_codebook.get_all_types(type_format='pandas_type', string_pyarrow=False)
>>> df = readers.read_microdata(ddi=ddi_codebook, filename="extract.csv", dtype=dataframe_dtypes)

And an example of usecase of string_pyarrow set to True:

>>> from ipumspy import readers
>>> ddi_codebook = readers.read_ipums_ddi('extract_ddi.xml')
>>> dataframe_dtypes = ddi_codebook.get_all_types(type_format='pandas_type', string_pyarrow=True)
>>> # No particular impact for reading from csv.
>>> df = readers.read_microdata(ddi=ddi_codebook, filename="extract.csv", dtype=dataframe_dtypes)
>>> # The benefit of using string_pyarrow: converting to parquet. The writing time is reduced.
>>> df.to_parquet("extract.parquet")
>>> # Also, the data loaded from the derived extract.parquet will be faster than if the csv file was converted
>>> # using string_pyarrow=False