Machinelearningnotebooks: ValidationException when running linear regression in pipeline

Created on 4 Oct 2020  路  12Comments  路  Source: Azure/MachineLearningNotebooks

I reproduced the issue with the Boston dataset so that I could validate the issue. I also confirm that this is working with classification examples but not with regression

Here is the exact code that will reproduce the error:
https://github.com/BillmanH/learn-azureml/blob/main/main_run_linear.py

2020-10-04 18:37:04.069 - CRITICAL - Type: InvalidInputDatatype
Class: ValidationException
Message: ValidationException:
    Message: Input of type 'string' is not supported. Supported types: [decimal, mixed-integer-float, floating, integer]
    InnerException: None
    ErrorResponse 
{
    "error": {
        "code": "UserError",
        "message": "Input of type 'string' is not supported. Supported types: [decimal, mixed-integer-float, floating, integer]",
        "details_uri": "https://aka.ms/AutoMLConfig",
        "target": "y",
        "inner_error": {
            "code": "BadArgument",
            "inner_error": {
                "code": "ArgumentInvalid",
                "inner_error": {
                    "code": "InvalidInputDatatype"
                }
            }
        },
        "reference_code": "08629a21-bbcb-4088-959b-2d35c00cae32"
    }
}
Traceback:
  File "_remote_script.py", line 551, in setup_wrapper
    script_directory, dataprep_json, entry_point, parent_run, verifier)
  File "_remote_script.py", line 737, in _prep_input_data
    verifier=verifier
  File "_remote_script.py", line 95, in _prepare_data
    automl_settings=automl_settings_obj)
  File "training_utilities.py", line 140, in set_task_parameters
    target='y'

ExceptionTarget: y

It seems to be saying that the target y was given a string input. However I'm just using the Boston Housing prices dataset in this example so I don't think that this is the issue. Just to check I put the df.dtypes printout in the previous step mung.py just to confirm that it was all float values.

CRIM       float64
ZN         float64
INDUS      float64
CHAS       float64
NOX        float64
RM         float64
AGE        float64
DIS        float64
RAD        float64
TAX        float64
PTRATIO    float64
B          float64
LSTAT      float64
price      float64

Obviously no strings.

Either AutoML is converting the values to a string, OR this error message is wrong. Either way it is wrapped heavily in the AutoML library and I can't troubleshoot it.

Auto鈥疢L product-issue

All 12 comments

I updated the repo: https://github.com/BillmanH/learn-azureml
Just to show the issue.

main_run_classifier.py and main_run_linear.py access the same munge_step. The only difference is the precursor step. The automl run protests about the input. I was wondering if it somehow can't find the dataset but the extensive use of wrappers makes debugging difficult.

Adding @kimix92 and @anupsms to follow up.

@BillmanH : Can you post your training file gold_data.csv used in main_run_linear.py? This is the Boston housing dataset after any modification are made; as read by the AutoML code.

The error you hit is raised when a string type label column is used for a regression task.

I don't see anything odd in the code for that error; though perhaps we can repro with your data.

I'd have to dig it up, but it's easy to generate. I do no transformations on that dataset.
You can see in the pipe steps that I just generate the data, save it, and move on.
You can repro the issue with the code that I've provided. The boston data set is made from sklearn.datasets.boston, as shown in the get_linear.py step.
https://github.com/BillmanH/learn-azureml/tree/main/pipes

You can also see the data types above confirming that they are numbers. If there is a process in the OutputDataset that saves it as strings then it would still be a bug, right?

Greetings @BillmanH, would you have time to post your training file gold_data.csv?

gold_data.zip
Here is the file, I just went to the storage location in the blob from the get_linear step and downloaded it. However you can reproduce the file just by running the code.

I did a quick run using the AutoML GUI -- https://ml.azure.com/automl/startrun

There's no issues when using your dataset on that interface. That said, the data loader is a bit different as it's loading from Azure Datasets.

Next up is a run within the SDK.

I've also done similar things with uploading AML runs directly from notebooks and also had no problems. However I have a client need for using an AutoML run as part of a pipeline.

@BillmanH -- thanks for your patience. We are still looking in to the issue.

@BillmanH, could you please try using the latest SDK (1.16), if you are not already, and see if that resolves the issue. The OutputFileDatasetConfig class is marked as experimental and may have had some issues in earlier versions of the SDK. Internal testing of the code in question with the latest SDK did not produce the error and the target price column was correctly set to float.

Note, the online documentation for OutputFileDatasetConfig.read_delimited_files() states that the column types read in will all be seen as strings unless the set_column_types parameter is used to override the type (which was done correctly for the price target column).
See here for more information: https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.data.output_dataset_config.outputfiledatasetconfig?view=azure-ml-py#read-delimited-files-include-path-false--separator------header--promoteheadersbehavior-all-files-have-same-headers--3---partition-format-none--path-glob-none--set-column-types-none-

If the error is still encountered while using the latest SDK, consider saving the data as a parquet file instead of CSV in the munge step since this will preserve the column types. The parquet file can be read in using OutputFileDatasetConfig.read_parquet_files().

@pieths -- thanks for the deeper investigation.

@BillmanH -- I'm going to close the issue. Feel free to reopen if you continue encountering the issue with the latest SDK version (1.16). As @pieths mentioned, using the parquet format should also work as the format internally stores the datatypes for each column, whereas CSV/TSV files require auto-detection or user specified datatypes.

I reproduced the issue with the Boston dataset so that I could validate the issue. I also confirm that this is working with classification examples but not with regression

Since the label column is being read in as a string, the label datatype is acceptable for a classification task, but regression requires a numeric label column. When you implement the fix (version change or parquet), I expect you'll see an increase in the classification accuracy since, assumingly, your numeric feature columns previously would also be treated as a string instead of as numeric.

I finally got around to re-running in the 1.16.0 version. Still crashing. However the SDK may have already moved on by now so this may no longer be relevant. No worries, no need to re-open. I've moved on to other solutions.

Here is the new 70 log:

2020/11/02 02:38:57 logger.go:297: Attempt 1 of http call to http://10.0.0.5:16384/sendlogstoartifacts/info
2020/11/02 02:38:57 logger.go:297: Attempt 1 of http call to http://10.0.0.5:16384/sendlogstoartifacts/status
[2020-11-02T02:38:59.029511] Entering context manager injector.
[context_manager_injector.py] Command line Options: Namespace(inject=['ProjectPythonPath:context_managers.ProjectPythonPath', 'RunHistory:context_managers.RunHistory', 'TrackUserError:context_managers.TrackUserError'], invocation=['setup_85aab669-9a70-4c3d-a851-0dbb2cae440b.py', 'automl_driver.py'])
Starting the daemon thread to refresh tokens in background for process with pid = 119
Entering Run History Context Manager.
Current directory:  /mnt/batch/tasks/shared/LS_root/jobs/pjx-d-cu1-mlw-models/azureml/85aab669-9a70-4c3d-a851-0dbb2cae440b_setup/mounts/workspaceblobstore/azureml/85aab669-9a70-4c3d-a851-0dbb2cae440b_setup
Preparing to call script [ setup_85aab669-9a70-4c3d-a851-0dbb2cae440b.py ] with arguments: ['automl_driver.py']
After variable expansion, calling script [ setup_85aab669-9a70-4c3d-a851-0dbb2cae440b.py ] with arguments: ['automl_driver.py']

Script type = None
Starting the setup....
2020-11-02 02:39:06.346 - INFO - Changing AutoML temporary path to current working directory.
2020-11-02 02:39:06.704 - INFO - Successfully got the cache data store, caching enabled.
2020-11-02 02:39:06.704 - INFO - Took 0.34537768363952637 seconds to retrieve cache data store
2020-11-02 02:39:07.879 - INFO - ActivityStarted: load
2020-11-02 02:39:07.918 - INFO - ActivityCompleted: Activity=load, HowEnded=Success, Duration=39.51[ms]
2020-11-02 02:39:07.918 - INFO - Preparing input data for setup iteration for run 85aab669-9a70-4c3d-a851-0dbb2cae440b_setup.
2020-11-02 02:39:07.919 - INFO - Resolving dataflows using dprep json.
2020-11-02 02:39:07.919 - INFO - DataPrep version: 2.3.4
2020-11-02 02:39:07.919 - INFO - DataPrep log client session id: f70f61be-665e-41e5-8ce3-3e93b665cc09
2020-11-02 02:39:07.919 - INFO - ActivityStarted: ParsingDataprepJSON
2020-11-02 02:39:07.920 - INFO - Deserializing dataflow.
2020-11-02 02:39:14.610 - INFO - ActivityCompleted: Activity=ParsingDataprepJSON, HowEnded=Success, Duration=6690.66[ms]
2020-11-02 02:39:14.610 - INFO - ActivityStarted: BuildingDataCharacteristics
2020-11-02 02:39:14.611 - INFO - Starting data characteristics calculation. This might take a while...
2020-11-02 02:39:24.787 - INFO - ActivityCompleted: Activity=BuildingDataCharacteristics, HowEnded=Success, Duration=10176.2[ms]
2020-11-02 02:39:24.787 - INFO - Successfully retrieved data using dataprep.
2020-11-02 02:39:25.333 - INFO - Service responded with streaming disabled
2020-11-02 02:39:26.621 - CRITICAL - Type: MissingColumnsInData
Class: DataException
Message: DataException:
    Message: Expected column(s) price (label_column_name) not found in X.
    InnerException: None
    ErrorResponse 
{
    "error": {
        "code": "UserError",
        "message": "Expected column(s) price (label_column_name) not found in X.",
        "target": "label_column_name",
        "inner_error": {
            "code": "BadArgument",
            "inner_error": {
                "code": "MissingColumnsInData"
            }
        },
        "reference_code": "06328553-815b-486d-98dd-98e4c8952a7c"
    }
}
Traceback:
  File "_data_preparer.py", line 72, in prepare
    fit_params = self._get_fit_params(automl_settings_obj)
  File "_data_preparer.py", line 169, in _get_fit_params
    training_data_row_count)
  File "_data_preparer.py", line 458, in _helper_get_data_from_dict
    automl_settings_obj.cv_split_column_names
  File "training_utilities.py", line 1896, in _extract_data_from_combined_dataflow
    reference_code=ReferenceCodes._EXTRACT_DATA_FROM_COMBINED_DATAFLOW_LABEL_COLUMN_MISSING)

ExceptionTarget: label_column_name
2020-11-02 02:39:26.622 - INFO - Error in setup_wrapper.
2020-11-02 02:39:26.622 - ERROR - Marking Run 85aab669-9a70-4c3d-a851-0dbb2cae440b_setup as Failed.
2020-11-02 02:39:26.623 - CRITICAL - Type: MissingColumnsInData
Class: DataException
Message: DataException:
    Message: Expected column(s) price (label_column_name) not found in X.
    InnerException: None
    ErrorResponse 
{
    "error": {
        "code": "UserError",
        "message": "Expected column(s) price (label_column_name) not found in X.",
        "target": "label_column_name",
        "inner_error": {
            "code": "BadArgument",
            "inner_error": {
                "code": "MissingColumnsInData"
            }
        },
        "reference_code": "06328553-815b-486d-98dd-98e4c8952a7c"
    }
}
Traceback:
  File "_remote_script.py", line 560, in setup_wrapper
    script_directory, dataprep_json, entry_point, parent_run, verifier)
  File "_remote_script.py", line 746, in _prep_input_data
    verifier=verifier
  File "_remote_script.py", line 89, in _prepare_data
    data_dict = data_preparer.prepare(automl_settings_obj)
  File "_data_preparer.py", line 72, in prepare
    fit_params = self._get_fit_params(automl_settings_obj)
  File "_data_preparer.py", line 169, in _get_fit_params
    training_data_row_count)
  File "_data_preparer.py", line 458, in _helper_get_data_from_dict
    automl_settings_obj.cv_split_column_names
  File "training_utilities.py", line 1896, in _extract_data_from_combined_dataflow
    reference_code=ReferenceCodes._EXTRACT_DATA_FROM_COMBINED_DATAFLOW_LABEL_COLUMN_MISSING)

ExceptionTarget: label_column_name
2020-11-02 02:39:26.691 - WARNING - Failed to retrieve AutoMLSettings for run: 85aab669-9a70-4c3d-a851-0dbb2cae440b_setup
2020-11-02 02:39:26.693 - ERROR - Type: Unclassified
Class: TypeError
Message: the JSON object must be str, bytes or bytearray, not 'NoneType'
Traceback:
  File "run.py", line 165, in _is_local
    settings = self._get_automl_settings()
  File "run.py", line 727, in _get_automl_settings
    **json.loads(self._get_property('AMLSettingsJsonString')))
  File "__init__.py", line 348, in loads
    'not {!r}'.format(s.__class__.__name__))

ExceptionTarget: Unspecified
2020-11-02 02:39:26.693 - WARNING - Encountered an exception of type: <class 'azureml.automl.core.shared.exceptions.DataException'>, interpreted as error code MissingColumnsInData.
Setup run completed successfully!
Starting the daemon thread to refresh tokens in background for process with pid = 119


[2020-11-02T02:39:27.076005] The experiment completed successfully. Finalizing run...
[2020-11-02T02:39:27.076124] FinalizingInRunHistory is not called
Cleaning up all outstanding Run operations, waiting 900.0 seconds
4 items cleaning up...
Cleanup took 0.3062152862548828 seconds
[2020-11-02T02:39:27.722739] Finished context manager injector.
2020/11/02 02:39:32 logger.go:297: Attempt 1 of http call to http://10.0.0.5:16384/sendlogstoartifacts/status
2020/11/02 02:39:32 logger.go:297: Process Exiting with Code:  0
Was this page helpful?
0 / 5 - 0 ratings