Machinelearningnotebooks: BUG: HyperDriveStep's allow_reuse param does not work.

Created on 17 Jan 2020  路  28Comments  路  Source: Azure/MachineLearningNotebooks

I made a reproducible example in this branch of a fork of this repo. This is a big time suck for our team as the majority of our pipelines use HyperDriveStep and we always have downstream steps from HyperDrive.
https://github.com/swanderz/MachineLearningNotebooks/blob/hyperdrive_allow_reuse/how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml-pipelines-parameter-tuning-with-hyperdrive.ipynb

Auto鈥疢L Hyperdrive assigned-to-author product-issue triaged

Most helpful comment

Yes "hd_step01 will run and generate new outputs" is what we get when the step is not reused. Will try to repro.

All 28 comments

@swanderz I see that you are copying training_files to script_folder, at cell
"Copy the training files into the script folder" which would be changing the hash of the source_directory and preventing the reuse. Are you running that everytime?

@purnesh42H thanks for jumping on to help out! The copy script magic was just part of the minimum viable reproducible example I used. Here's the steps to reproduce:

  1. Run the whole notebook through
pipeline = Pipeline(workspace=ws, steps=[best_run_step])
pipeline_run = exp.submit(pipeline)
pipeline_run.wait_for_completion()
  1. Submit the pipeline again, with the above cells, changing nothing else.
  2. HyperDriveStep will run again, ignoring allow_reuse.

I'm interested to see what you get on your side!

@swanderz for some of the services such as HD or AutoML, they possibly produce similar output with same config but not exactly same afaik.
I will confirm with HD team and let you know.

actually regardless of HD produce same output or not allow_reuse needs to skip HD runs so that user can control it. Will check why it didn't work.

@sonnypark I appreciate you also helping me get answers. Can you please run the reprex and see what you get? When I do it, the stdout says something along the lines of hd_step01 will run and generate new outputs and I wait ~12 minutes for the step to complete, which its a lot longer than I would expect if the step is being reused.
image

Yes "hd_step01 will run and generate new outputs" is what we get when the step is not reused. Will try to repro.

synced with @clmccart 95% chance this is related to #734.

734 is unrelated. In the case of hyperdrive, there is a change in hyperdrive config setting which is preventing the reuse

Any update on this?

Fix has been checked in and will be available on the next release (which will be next week).

@sonnypark good to hear. in the next release, will the changes in the experimental branch also be included? I'd like to have this issue and #771 fixed in the same release.

the experimental branch?

I realized that Experimental release is the different name of Release cut.
Yes, the fix has been added to the latest release cut yesterday.
AzureML SDK Experimental Release Candidate is 1.1.0a1.dev0
And #771 fix should be in the release too.

@sonnypark @rastala @purnesh42H @sanpil @nemasobhani
remember this bug? it's back....
here's the reprex
here are the package versions I'm using.

azureml-automl-core                  1.1.5.1
azureml-contrib-notebook             1.1.5.1
azureml-core                         1.1.5.4
azureml-dataprep                     1.3.2
azureml-dataprep-native              14.1.0
azureml-interpret                    1.1.5
azureml-pipeline                     1.1.5
azureml-pipeline-core                1.1.5
azureml-pipeline-steps               1.1.5
azureml-sdk                          1.1.5
azureml-telemetry                    1.1.5.3
azureml-train                        1.1.5
azureml-train-automl-client          1.1.5.1
azureml-train-core                   1.1.5
azureml-train-restclients-hyperdrive 1.1.5
azureml-widgets                      1.1.5

Alt Text

@swanderz I can see 'finished' run on the link you posted with new step created without reusing. But I guess you already ran the same notebook before with no changes.
Any changes on your packages from the ones which was able to reuse?

Oh I found that you ran the same job twice in the same notebook!

Yeah to prove the bug. I submit the run twice. In the 2nd PipelineRun, we should see that the HyperDriveStep is not re-run... but we don't.

I want to say that the bug does not happen in 1.1.0a1.dev0, but does happen in 1.1.5.1

looks strange to me. It looks the second step is reused:
Created step get best run [31c67e94][a9842d81-7995-43c1-bb55-df20aa92e1b0], (This step is eligible to reuse a previous run's output)
Created step hd_step01 [54aa5211][a10bf668-d90f-44a4-9524-8759d419aa95], (This step will run and generate new outputs)

@swanderz did you actually could see the second run took same amount of time to run the job?

I was able to reuse running the published HyperDriveStep notebook. You can check the run was really reused in portal:
image

@sonnypark I appreciate you helping out again. Are you 100% that you have the same environment as me? Can you run pip list | grep 'azureml' and share the results? It is very frustrating that I can't help you to reproduce the error on your end. In our workflows, HyperDriveStep is sporadically reused. @sanpil @yanrez what can I do to determine if a step will be re-used beyond the console stdout and the portal info?

All the PipelineRuns

image

All HyperdriveStepRuns

Here's the StepRuns that corresponds to past 3 HyperDriveStep's that have run.
image

1st PipelineRun 62

this is the first time i ran the pipeline in two months.
note that for hd_step01:

  • duration is 9m 15s
  • "Reuse" is "No"
    image

2nd PipelineRun 71

This is a run i submitted immediately after the first. I'm 97% sure I didn't change anything, just submitted again. Note:

  • duration is 10m 28s
  • "Reuse" is again "No"
    image

3rd PipelineRun 80 (6 hours later)

I submitted this after seeing @sonnypark 's reply. Still didn't re-use... Strange. Note:

  • duration is 8m 57s
  • "Reuse" is again "No"
    image

4th PipelineRun 89

Now it's working?? Note:

  • duration is 0s
  • "Reuse" is finally "Yes"
    image

Considering you were able to reuse on the 4th run, I suspect either

  1. backend service ended up rerunning even with the same fingerprint for the module hash calculation or
  2. There were some unnoticed changed on 2/3rd run such as path or contents of files.
    I can take a look at the backend side to see any possible reasons for such. Is there any possibility on your side for such changes you didn't noticed?

@sonnypark to my knowledge, nothing changed. Though this could an example of me fundamentally misunderstanding how allow_reuse works and what's entailed. One idea is that I made a Git commit on a file that has nothing to do with the step... would that trigger a new run?

What would help me to debug is:

  • a list of every factor that goes into whether the step will be re-used, and
  • a way for me to do the module hashing locally so I can tell what verify if the step will be reused or not before I submit my two hour pipelines.

I think git commit can be a potential change can lead to different fingerprint in that any changes on the source_directory might result in different fingerprint. Since you were able to reuse on 4th run I don't think it is python package related.
Afaik, any changes on source_dir including file content changes, path, file name can result in different fingerprint.

I need help manually verifying:

  • are the shapshots/hashes unique from run to run?
  • if so, what is causing the snapshots/hashes to be different? I can't think of anything that's different.

Working with Sonny, we're narrowed things down to the following testable hypotheses.
Does allow_reuse work for HyperdriveStep if:

  1. I re-submit the exact same pipeline object, after the first PipelineRun completes?
  2. Re-create the pipeline object with line below, then submit again?
    pipeline = Pipeline(workspace=ws, steps=[best_run_step])
  3. restart my kernel then re-run the notebook?

To test these hypotheses:

  1. clone my fork of MachineLearningNotebooks (I've directly committed the necessary scripts and data)
  2. Ensure the data is in default datastore with the 'mnist' prefix (if not follow the original version of the notebook to do so.)
  3. Ensure you environment matches
  4. add your own config.json
  5. Run the entire notebook (Run->"Run All Cells") to test hypotheses 1&2.
  6. Kernel->Restart Kernel and Run All Cells to test hypothesis 3.

i've updated the reproducible example to ensure that within the notebook no changes are being made to the scripts nor the underlying data.

Again, here's my environment for reference:

azureml-automl-core                  1.1.5.1
azureml-contrib-notebook             1.1.5.1
azureml-core                         1.1.5.4
azureml-dataprep                     1.3.2
azureml-dataprep-native              14.1.0
azureml-interpret                    1.1.5
azureml-pipeline                     1.1.5
azureml-pipeline-core                1.1.5
azureml-pipeline-steps               1.1.5
azureml-sdk                          1.1.5
azureml-telemetry                    1.1.5.3
azureml-train                        1.1.5
azureml-train-automl-client          1.1.5.1
azureml-train-core                   1.1.5
azureml-train-restclients-hyperdrive 1.1.5
azureml-widgets                      1.1.5

We had a sync up for the issue. I mentioned 3 things I found:

  1. source_directory should not be current directory if notebook is used in the same directory to avoid the change in the source_directory
  2. HDstep uses snapshotId as part of HDConfig which is part of reuse hash calculation.
  3. It turns out snapshot creation has limit on its cache so that even same path provided without any changes results in different snapshotId.

In summary, combination of 2 and 3 above, sometimes HDStep is reused sometimes not depends on creation of snapshot from other steps.
So obvious fix on HDStep side is removing snapshotId from the hash calcuation so that reuse doesn't depends on snapshot creation. Since HDStep already checking the contents of the source_dir any changes on source_dir would still be part of reuse hash calculation.

Bug has been created to change it.
https://msdata.visualstudio.com/vienna/_queries/edit/674431/?triage=true

bug referenced has been resloved

@sanpil @sonnypark after some testing, I'm 97% sure the bug is fixed! Thanks for your hard work.

Thanks. I am worried about that 3% :)

Was this page helpful?
0 / 5 - 0 ratings

Related issues

vineetgarhewal picture vineetgarhewal  路  3Comments

tkawchak picture tkawchak  路  5Comments

nswitanek picture nswitanek  路  4Comments

wagenrace picture wagenrace  路  3Comments

BillmanH picture BillmanH  路  5Comments