Machinelearningnotebooks: train-hyperparameter-tune-deploy-with-keras : deployment failing

Created on 5 Apr 2020  路  4Comments  路  Source: Azure/MachineLearningNotebooks

Notebook fails at "Deploy to ACI" step

service = Model.deploy(workspace=ws, 
                           name='keras-mnist-svc', 
                           models=[model], 
                           inference_config=inference_config, 
                           deployment_config=aciconfig)

service.wait_for_deployment(True)
print(service.state)

Inspecting logs shows that during the container setup, it appears there is a failure to find python 3.6:

Step 8/15 : RUN ldconfig /usr/local/cuda/lib64/stubs && conda env create -p /azureml-envs/azureml_c40c8ee583f9d7d8ef0c7fde8bb2f880 -f azureml-environment-setup/mutated_conda_dependencies.yml && rm -rf "$HOME/.cache/pip" && conda clean -aqy && CONDA_ROOT_DIR=$(conda info --root) && rm -rf "$CONDA_ROOT_DIR/pkgs" && find "$CONDA_ROOT_DIR" -type d -name __pycache__ -exec rm -rf {} + && ldconfig
---> Running in 75964cf66a3e
Solving environment: ...working...
done


Downloading and Extracting Packages
Preparing transaction: ...working... done
Verifying transaction: ...working... done
Executing transaction: ...working... failed

ERROR conda.core.link:_execute(502): An error occurred while installing package 'conda-forge::astor-0.7.1-py_0'.
FileNotFoundError(2, "No such file or directory: '/azureml-envs/azureml_c40c8ee583f9d7d8ef0c7fde8bb2f880/bin/python3.6'")
Attempting to roll back.

Rolling back transaction: ...working... done

FileNotFoundError(2, "No such file or directory: '/azureml-envs/azureml_c40c8ee583f9d7d8ef0c7fde8bb2f880/bin/python3.6'")

The command '/bin/sh -c ldconfig /usr/local/cuda/lib64/stubs && conda env create -p /azureml-envs/azureml_c40c8ee583f9d7d8ef0c7fde8bb2f880 -f azureml-environment-setup/mutated_conda_dependencies.yml && rm -rf "$HOME/.cache/pip" && conda clean -aqy && CONDA_ROOT_DIR=$(conda info --root) && rm -rf "$CONDA_ROOT_DIR/pkgs" && find "$CONDA_ROOT_DIR" -type d -name __pycache__ -exec rm -rf {} + && ldconfig' returned a non-zero code: 1
2020/04/05 04:14:20 Container failed during run: acb_step_0. No retries remaining.
failed to run step ID: acb_step_0: exit status 1

Very grateful for your prompt help resolving this.

Auto鈥疢L assigned-to-author machine-learninsvc needs-more-info product-question triaged

All 4 comments

@nswitanek Thank you for reaching out, can you please share the document link you are referring so that we can help you accordingly. Thanks

@nswitanek you'd need to update to the latest SDK or add anaconda as a first conda channel in your environment.

Your environment has only conda-forge channel (my guess): conda-forge::astor-0.7.1-py_0 that package does not exist there as conda shuffled their channels recently. https://repo.anaconda.com/pkgs/main/linux-64/ you can find it here though

Always worth trying updating to the newest SDK!
I've updated to the latest SDK and that appears to have resolved the deployment issue. Thank you!

The conda-forge channel idea was a good guess, but the anaconda channel is actually in the env.yml file:

  • pip:

    • azureml-defaults

  • tensorflow=1.13.1
  • keras==2.2.5
    channels:
  • anaconda
  • conda-forge
Was this page helpful?
0 / 5 - 0 ratings

Related issues

wagenrace picture wagenrace  路  3Comments

vineetgarhewal picture vineetgarhewal  路  3Comments

jarandaf picture jarandaf  路  4Comments

AakanchJoshi picture AakanchJoshi  路  4Comments

ahyerman picture ahyerman  路  3Comments