Sagemaker-python-sdk: Pandas and the IngestionManagerPandas class for the FeatureStore class

Created on 12 Dec 2020  路  4Comments  路  Source: aws/sagemaker-python-sdk

Describe the bug
To write from a Pandas dataframe to the Feature Store, IngestionManagerPandas iterates through the dataframe using the .iterrows() method. Pandas sends row values as floats; no matter what the datatype of the column. When the Feature Store tries to save the float to a column which was configured as an Integer, it throws an error. you can fix this by iterating through the dataframe with .loc or .iloc to get the proper datatype. This will likely confuse customers when they attempt to use the .ingest method in the SDK.

To reproduce
A clear, step-by-step set of instructions to reproduce the bug.

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots or logs
If applicable, add screenshots or logs to help explain your problem.

System information
A description of your system. Please provide:

  • SageMaker Python SDK version:
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans):
  • Framework version:
  • Python version:
  • CPU or GPU:
  • Custom Docker image (Y/N):

Additional context
Add any other context about the problem here.

bug

All 4 comments

@aarsanjani

Which Python version do you see this bug on? We do have integ tests that writes int dtype columns to FeatureStore. I understand with iterrows we lose the dtype but the type of the data itself seems to have been mapped correctly in my experiment. Could you show me how to reproduce this error?

df = pd.DataFrame(
        {
            "feature1": pd.Series(np.arange(3.0), dtype="float64"),
            "feature2": pd.Series(np.arange(3), dtype="int64"),
            "feature3": pd.Series(["2020-10-30T03:43:21Z"] * 3, dtype="string"),
        }
)

for _, row in df.iterrows():
    print("-----------------------------")
    for idx, value in row.items():
        print(idx, value, type(value), str(value))

-----------------------------
feature1 0.0 <class 'float'> 0.0
feature2 0 <class 'int'> 0
feature3 2020-10-30T03:43:21Z <class 'str'> 2020-10-30T03:43:21Z
-----------------------------
feature1 1.0 <class 'float'> 1.0
feature2 1 <class 'int'> 1
feature3 2020-10-30T03:43:21Z <class 'str'> 2020-10-30T03:43:21Z
-----------------------------
feature1 2.0 <class 'float'> 2.0
feature2 2 <class 'int'> 2
feature3 2020-10-30T03:43:21Z <class 'str'> 2020-10-30T03:43:21Z

@icywang86rui

Your example is only valid because you have a string column in your dataframe, which makes the row to be defined as an object for the iterrow(). If there is no string column in the dataframe the iterrow() will understand it as a float series, which will cause the issue described by @aarsanjani .

import numpy as np
df = pd.DataFrame(
        { 
            "feature1": pd.Series(np.arange(3.0), dtype="float64"),
            "feature2": pd.Series(np.arange(3), dtype="int64")
        }
)

for _, row in df.iterrows():
    print("-----------------------------")
    for idx, value in row.items():
        print(idx, value, type(value), str(value))

-----------------------------
feature1 0.0 <class 'float'> 0.0
feature2 0.0 <class 'float'> 0.0
-----------------------------
feature1 1.0 <class 'float'> 1.0
feature2 1.0 <class 'float'> 1.0
-----------------------------
feature1 2.0 <class 'float'> 2.0
feature2 2.0 <class 'float'> 2.0

From https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iterrows.html

Notes
Because iterrows returns a Series for each row, it does not preserve dtypes across the rows (dtypes are preserved across columns for DataFrames). For example,

df = pd.DataFrame([[1, 1.5]], columns=['int', 'float'])

row = next(df.iterrows())[1]

row
int      1.0
float    1.5
Name: 0, dtype: float64

print(row['int'].dtype)
float64

print(df['int'].dtype)
int64

To preserve dtypes while iterating over the rows, it is better to use itertuples() which returns namedtuples of the values and which is generally faster than iterrows.

@icywang86rui I edited your example so it works with itertuples, one point is that itertuples() doesnt return a Pandas Series so we need an extra treatment (_asdict() ) to iterate the items.


df = pd.DataFrame(
        {
            "feature1": pd.Series(np.arange(3.0), dtype="float64"),
            "feature2": pd.Series(np.arange(3), dtype="int64"),
#             "feature3": pd.Series(["2020-10-30T03:43:21Z"] * 3, dtype="string"),
        }
)


for row in df.itertuples(index=False):
    print("-----------------------------")
    for idx, value in row._asdict().items():
        print(idx, value, type(value), str(value))

-----------------------------
feature1 0.0 <class 'float'> 0.0
feature2 0 <class 'int'> 0
-----------------------------
feature1 1.0 <class 'float'> 1.0
feature2 1 <class 'int'> 1
-----------------------------
feature1 2.0 <class 'float'> 2.0
feature2 2 <class 'int'> 2

I believe that this line https://github.com/aws/sagemaker-python-sdk/blob/e8deeb37c59524a0bb8c016bf13954d90f0ba9fa/src/sagemaker/feature_store/feature_group.py#L182 until line 186:

for _, row in data_frame[start_index:end_index].iterrows():
    record = [
        FeatureValue(feature_name=name, value_as_string=str(value))
        for name, value in row.items()
    ]

could be changed to this:

for row in df.itertuples(index=False):
     record = [
        FeatureValue(feature_name=name, value_as_string=str(value))
        for name, value in row._asdict().items()
    ] 

I dont know exactly how to propose this with a pull/merge

@Previatto Thanks for the suggestion. This indeed was implemented with itertuples before. The problems is with itertuples when Python version is 3.6 or older and the number of column is larger than 256 regular tuple will be returned not named tuples. I see that you saw my PR that change from itertuples to iterrows.

I think i can switch back to itertuples and just iterate the tuple as regular tuple and get the column name from the dataframe itself instead.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

velociraptor111 picture velociraptor111  路  4Comments

stevehawley picture stevehawley  路  3Comments

sylvainrobbiano picture sylvainrobbiano  路  3Comments

meownoid picture meownoid  路  5Comments

ryanpeach picture ryanpeach  路  4Comments