Machinelearningnotebooks: Error in Datasets: Engine process terminated. Try running again.

Created on 19 Mar 2020  路  10Comments  路  Source: Azure/MachineLearningNotebooks

Runs are being terminated with this error Engine process terminated. Try running again:

  • I'm seeing these with AutoML runs submitted through the SDK (1.1.5.1)
  • The error itself isn't informative, but looking through the logs (complete 70-driver log below)
  • Switching back to the old experience, Failed to get data from DataPrep pops in
data_dict = data_preparer.prepare(automl_settings_obj)
y = input.keep_columns([label_column_name], validate_column_exists=True)
df.verify_has_data()

seems like the dataset is reporting that no data was found. I tried downloading the dataset locally and subetting to the columns needed and the data is there.

@swanderz have you seen this error before?

@anliakho2 @MayMSFT

Traceback:

The experiment failed. Finalizing run...
Cleaning up all outstanding Run operations, waiting 300.0 seconds
3 items cleaning up...
Cleanup took 0.2822861671447754 seconds
Starting the daemon thread to refresh tokens in background for process with pid = 154
Traceback (most recent call last):
File "setup_AutoML_aaa431be-9eb6-482f-961f-0260f307c912.py", line 729, in
result = setup_run()
File "setup_AutoML_aaa431be-9eb6-482f-961f-0260f307c912.py", line 723, in setup_run
num_iterations=num_iterations
File "/azureml-envs/azureml_e021c6cf8c1865aaf501e22e7207a17e/lib/python3.6/site-packages/azureml/train/automl/runtime/_remote_script.py", line 778, in setup_wrapper
verifier=verifier
File "/azureml-envs/azureml_e021c6cf8c1865aaf501e22e7207a17e/lib/python3.6/site-packages/azureml/train/automl/runtime/_remote_script.py", line 128, in _prepare_data
data_dict = data_preparer.prepare(automl_settings_obj)
File "/azureml-envs/azureml_e021c6cf8c1865aaf501e22e7207a17e/lib/python3.6/site-packages/azureml/train/automl/runtime/_data_preparer.py", line 76, in prepare
generic_msg)
File "/azureml-envs/azureml_e021c6cf8c1865aaf501e22e7207a17e/lib/python3.6/site-packages/azureml/train/automl/runtime/_data_preparer.py", line 65, in prepare
return self._get_fit_params(automl_settings_obj)
File "/azureml-envs/azureml_e021c6cf8c1865aaf501e22e7207a17e/lib/python3.6/site-packages/azureml/train/automl/runtime/_data_preparer.py", line 152, in _get_fit_params
self._logger)
File "/azureml-envs/azureml_e021c6cf8c1865aaf501e22e7207a17e/lib/python3.6/site-packages/azureml/train/automl/runtime/_data_preparer.py", line 451, in _helper_get_data_from_dict
df, automl_settings_obj.label_column_name, automl_settings_obj.weight_column_name)
File "/azureml-envs/azureml_e021c6cf8c1865aaf501e22e7207a17e/lib/python3.6/site-packages/azureml/automl/runtime/training_utilities.py", line 2166, in _extract_data_from_combined_dataflow
y = input.keep_columns([label_column_name], validate_column_exists=True)
File "/azureml-envs/azureml_e021c6cf8c1865aaf501e22e7207a17e/lib/python3.6/site-packages/azureml/dataprep/api/dataflow.py", line 2603, in keep_columns
df.verify_has_data()
File "/azureml-envs/azureml_e021c6cf8c1865aaf501e22e7207a17e/lib/python3.6/site-packages/azureml/dataprep/api/_loggerfactory.py", line 151, in wrapper
return func(args, *kwargs)
File "/azureml-envs/azureml_e021c6cf8c1865aaf501e22e7207a17e/lib/python3.6/site-packages/azureml/dataprep/api/dataflow.py", line 852, in verify_has_data
profile = self.take(1)._get_profile()
File "/azureml-envs/azureml_e021c6cf8c1865aaf501e22e7207a17e/lib/python3.6/site-packages/azureml/dataprep/api/dataflow.py", line 549, in _get_profile
include_string_lengths)
File "/azureml-envs/azureml_e021c6cf8c1865aaf501e22e7207a17e/lib/python3.6/site-packages/azureml/dataprep/api/dataprofile.py", line 53, in _from_execution
inspector_response = _Inspector._from_execution(engine_api, context, table_inspector)
File "/azureml-envs/azureml_e021c6cf8c1865aaf501e22e7207a17e/lib/python3.6/site-packages/azureml/dataprep/api/inspector.py", line 219, in _from_execution
row_count = _MAX_ROW_COUNT))
File "/azureml-envs/azureml_e021c6cf8c1865aaf501e22e7207a17e/lib/python3.6/site-packages/azureml/dataprep/api/_aml_helper.py", line 38, in wrapper
return send_message_func(op_code, message, cancellation_token)
File "/azureml-envs/azureml_e021c6cf8c1865aaf501e22e7207a17e/lib/python3.6/site-packages/azureml/dataprep/api/engineapi/api.py", line 107, in execute_inspector
response = self._message_channel.send_message('Engine.ExecuteInspector', message_args, cancellation_token)
File "/azureml-envs/azureml_e021c6cf8c1865aaf501e22e7207a17e/lib/python3.6/site-packages/azureml/dataprep/api/engineapi/engine.py", line 154, in send_message
response = self._read_response()
File "/azureml-envs/azureml_e021c6cf8c1865aaf501e22e7207a17e/lib/python3.6/site-packages/azureml/dataprep/api/engineapi/engine.py", line 116, in _read_response
raise RuntimeError(crash_msg + ' Try running again.')
azureml.train.automl.exceptions.ClientException: ClientException:
Message: Engine process terminated. Try running again.
InnerException Engine process terminated. Try running again.
ErrorResponse
{
"error": {
"code": "System",
"message": "Engine process terminated. Try running again."
}
}

Auto鈥疢L awaiting-product-team-response cxp product-question triaged

Most helpful comment

Closing the loop:
We found out that the "Engine process terminated" error is due to OS running out of memory and killing our process. This cannot be caught by our current error handling to report "out of memory" properly.
To this extent next release will add better handling and usefull error message for this case.

Regarding to memory usage, our parquet file reading reads full RowGroup based on metadata of the parquet file. This means the entire RowGroup of data needs be materialized into RAM. That can cause OOM when the system is already under high memory pressure.

Here are workaround options to consider:

  1. Using bigger machine, but this highly depend on the size of data and is not scale-able as you saw.
  2. Reduce the RowGroup size when generating parquet data file.
  3. Partition big data into multiple smaller parquet files, then create dataset from those. Here is a sample gist modified on top of yours to demonstrate the idea.

All 10 comments

Thanks for reporting the error. We are investigating the case. Will reach back if we need more info.

@jadhosn This means there was a fatal exception in the execution engine.
Could you please share your driver logs or search for "session id" there, so we could try to look for clue in the telemetry.
Having a repro would be also very helpful, so if possible to share the data - would be cool.

@jadhosn this is somewhat helpful 馃憤
The bad news that it looks like our engine died on the last get_profile call and didn't even have a chance to send telemetry... I do see that we are reading pretty big parquet files beforehand (like 500 MB ) so I wonder if we just ran out of memory trying to materialize the data...
Sadly there is nothing else I can glean from logs.
The only next step here is to try to get a repro, but that would require data access. (or you trying to get_profile() on the datasets in question)

In order to isolate the root cause, would it be possible for you to try similar code but using SDK 1.0.83 instead of 1.1.5.1?

Note that if using SDK 1.0.83 the old import is:
from azureml.train.automl.runtime import AutoMLStep
instead of the new:
from azureml.pipeline.steps import AutoMLStep

@CESARDELATORRE We are not using any AutoML steps, instead we submit AutoML config as their own runs.

We tried switching to 1.0.83 and 1.0.85 (for the local SDK as I don't have control over the AutoML config dependencies).

Both SDK version came back with:

(Stream was too long.)|session_id=455152c3-b2e8-4d09-895f-2cc42313ea7b

What's worth noting that these runs were running on a DS14_v2 (16 cores, 112 GB RAM, 224 GB disk).

70-driver log traceback

Traceback (most recent call last):
File "setup_AutoML_9ba731a8-1342-46db-bcb8-d1d4cc111ada.py", line 729, in
result = setup_run()
File "setup_AutoML_9ba731a8-1342-46db-bcb8-d1d4cc111ada.py", line 723, in setup_run
num_iterations=num_iterations
File "/azureml-envs/azureml_bb5f448562b2c4428838c28d9b22ab0f/lib/python3.6/site-packages/azureml/train/automl/runtime/_remote_script.py", line 1058, in setup_wrapper
verifier=verifier
File "/azureml-envs/azureml_bb5f448562b2c4428838c28d9b22ab0f/lib/python3.6/site-packages/azureml/train/automl/runtime/_remote_script.py", line 471, in _prepare_data
data_dict = _get_data_from_dataprep(dataprep_json, automl_settings_obj, logger)
File "/azureml-envs/azureml_bb5f448562b2c4428838c28d9b22ab0f/lib/python3.6/site-packages/azureml/train/automl/runtime/_remote_script.py", line 406, in _get_data_from_dataprep
automl_settings_obj)
File "/azureml-envs/azureml_bb5f448562b2c4428838c28d9b22ab0f/lib/python3.6/site-packages/azureml/train/automl/runtime/_remote_script.py", line 107, in _get_data_from_serialized_dataflow
return _helper_get_data_from_dict(dataflow_dict, automl_settings_obj, logger)
File "/azureml-envs/azureml_bb5f448562b2c4428838c28d9b22ab0f/lib/python3.6/site-packages/azureml/train/automl/runtime/_remote_script.py", line 185, in _helper_get_data_from_dict
for k in data_columns
File "/azureml-envs/azureml_bb5f448562b2c4428838c28d9b22ab0f/lib/python3.6/site-packages/azureml/train/automl/runtime/_remote_script.py", line 185, in
for k in data_columns
File "/azureml-envs/azureml_bb5f448562b2c4428838c28d9b22ab0f/lib/python3.6/site-packages/azureml/automl/runtime/dataprep_utilities.py", line 44, in retrieve_numpy_array
raise DataException.from_exception(e)
File "/azureml-envs/azureml_bb5f448562b2c4428838c28d9b22ab0f/lib/python3.6/site-packages/azureml/automl/runtime/dataprep_utilities.py", line 33, in retrieve_numpy_array
df = dataflow.to_pandas_dataframe(on_error='null')
File "/azureml-envs/azureml_bb5f448562b2c4428838c28d9b22ab0f/lib/python3.6/site-packages/azureml/dataprep/api/_loggerfactory.py", line 149, in wrapper
return func(args, *kwargs)
File "/azureml-envs/azureml_bb5f448562b2c4428838c28d9b22ab0f/lib/python3.6/site-packages/azureml/dataprep/api/dataflow.py", line 708, in to_pandas_dataframe
ExecuteAnonymousActivityMessageArguments(anonymous_activity=Dataflow._dataflow_to_anonymous_activity_data(dataflow_to_execute)))
File "/azureml-envs/azureml_bb5f448562b2c4428838c28d9b22ab0f/lib/python3.6/site-packages/azureml/dataprep/api/_aml_helper.py", line 38, in wrapper
return send_message_func(op_code, message, cancellation_token)
File "/azureml-envs/azureml_bb5f448562b2c4428838c28d9b22ab0f/lib/python3.6/site-packages/azureml/dataprep/api/engineapi/api.py", line 94, in execute_anonymous_activity
response = self._message_channel.send_message('Engine.ExecuteActivity', message_args, cancellation_token)
File "/azureml-envs/azureml_bb5f448562b2c4428838c28d9b22ab0f/lib/python3.6/site-packages/azureml/dataprep/api/engineapi/engine.py", line 120, in send_message
raise_engine_error(response['error'])
File "/azureml-envs/azureml_bb5f448562b2c4428838c28d9b22ab0f/lib/python3.6/site-packages/azureml/dataprep/api/errorhandlers.py", line 24, in raise_engine_error
raise ExecutionError(error_response)
automl.client.core.common.exceptions.DataException: DataException:
Message: (Stream was too long.)|session_id=455152c3-b2e8-4d09-895f-2cc42313ea7b
InnerException (Stream was too long.)|session_id=455152c3-b2e8-4d09-895f-2cc42313ea7b
ErrorResponse
{
"error": {
"code": "UserError",
"inner_error": {
"code": "InvalidData"
},
"message": "(Stream was too long.)|session_id=455152c3-b2e8-4d09-895f-2cc42313ea7b"
}
}

@jadhosn the stream was too long is an issue that we have fixed since 1.0.85 and has to do max buffer size for the stream being 2GB and http client tryin to allocate such a buffer and failing, In the latest version we should progress further, but then we clearly hit some sort of bug... if you have a consistent repro for a specific file I would really like to check it out.

Closing the loop:
We found out that the "Engine process terminated" error is due to OS running out of memory and killing our process. This cannot be caught by our current error handling to report "out of memory" properly.
To this extent next release will add better handling and usefull error message for this case.

Regarding to memory usage, our parquet file reading reads full RowGroup based on metadata of the parquet file. This means the entire RowGroup of data needs be materialized into RAM. That can cause OOM when the system is already under high memory pressure.

Here are workaround options to consider:

  1. Using bigger machine, but this highly depend on the size of data and is not scale-able as you saw.
  2. Reduce the RowGroup size when generating parquet data file.
  3. Partition big data into multiple smaller parquet files, then create dataset from those. Here is a sample gist modified on top of yours to demonstrate the idea.

@anliakho2 Can you provide a pointer to the "sample gist to demonstrate the approach of Partition big data into multiple smaller parquet files, then create dataset from those"?
I don't see the gist you mentioned.

Also, since this is an OOM issue, can we close the issue already?

@jadhosn
We will now proceed to close this thread. If there are further questions regarding this matter, please respond here and @YutongTie-MSFT and we will gladly continue the discussion.

Was this page helpful?
0 / 5 - 0 ratings