Machinelearningnotebooks: Using a FileDataset within Estimator

Created on 19 Dec 2019 · 8Comments · Source: Azure/MachineLearningNotebooks

I can't for the life of me access a FileDataset within an estimator. I've tried mounting and downloading to no avail. Might be because my dataset is ~1500 images within a folder that has ~300k images? @MayMSFT @rastala

Theories:

I'm not being patient enough.
the FileDataset definition of a DataPath list of length ~1500
Chaos monkeys

mount_dataset.py

import argparse
import datetime
import time
import os
from pprint import pprint

from azureml.core import Run

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument('--data_folder',
                        dest='data_folder',
                        default="C:\\\\temp")
    parser.add_argument('--random_seed',
                        dest='random_seed',
                        type=int,
                        default=None)
    parser.add_argument('--n_rows',
                        dest="n_rows",
                        type=int)

    args = parser.parse_args()
    print("all args:")
    pprint(vars(args))

    run = Run.get_context()

    print('data_folder', args.data_folder)

    print('input dataset', run.input_datasets)

    dataset = run.input_datasets['ds_file']
    print('input dataset["ds_file"]', run.input_datasets['ds_file'])

    print('dataset listdir')
    print(os.listdir(dataset))

control plane

compute_target = get_or_create_compute(
    ws,
    compute_target_name=make_resource_name(prefix, name, max_len = 16),
    vm_size='Standard_NC24',
    max_nodes=4,
    idle_seconds_before_scaledown=600
)

ds_file = Dataset.get_by_name(ws, name='RoadkillSubset', version=3)

estimator = Estimator(
    source_directory='./compute',
    entry_script='mount_dataset.py',
    script_params={
        '--data_folder': ds_file.as_named_input('ds_file').as_download(path_on_compute='./files')},
    inputs=[  ds_file.as_named_input('ds_file').as_download(path_on_compute='./files')],
    compute_target=compute_target,
    pip_packages=['azureml-dataprep[fuse]==1.1.33']
        )
run = experiment.submit(estimator2)

1) Attempt at Downloading `70_driver_log.txt` excerpt

Stuck here for almost an hour

bash: /azureml-envs/azureml_ff5f64206ec4de970c43d700cd178866/lib/libtinfo.so.5: no version information available (required by bash)
bash: /azureml-envs/azureml_ff5f64206ec4de970c43d700cd178866/lib/libtinfo.so.5: no version information available (required by bash)
Initialize DatasetContextManager.
Starting the daemon thread to refresh tokens in background for process with pid = 152
Enter __enter__ of DatasetContextManager
Processing 'ds_file'
Processing dataset FileDataset
{
  "source": [
    "('corpcarinput', 'Roadkill/v1/MVRK-02FEB2019/MVRK-02FEB2019-CPU/MVRK-02FEB2019-PHOTOS/100RECNX/RCNX0001.JPG')",
    ~~~~~~~~~~~~ OMITTED ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  ],
  "definition": [
    "GetDatastoreFiles"
  ],
  "registration": {
    "id": "e1f8aedd-6e5a-4ed6-99c3-4d71ca9b8d8a",
    "name": "RoadkillSubset",
    "version": 3,
    "description": "v3) clean up tags & no empty shots v2) larger set 1600 have clear carcass v1) the original 2,000 subset of photos before Becca cleaned things",
    "workspace": "Workspace.create(name='corpcarml', subscription_id='c5dad506-4ded-408d-9433-d39c6497b759', resource_group='CorpusCarcass')"
  }
}
Downloading ds_file to ./files
/azureml-envs/azureml_ff5f64206ec4de970c43d700cd178866/lib/python3.6/site-packages/azureml/dataprep/api/dataflow.py:689: UserWarning: Please install pyarrow>=0.11.0 for improved performance of to_pandas_dataframe. You can ensure the correct version is installed by running: pip install azureml-dataprep[pandas].
  warnings.warn('Please install pyarrow>=0.11.0 for improved performance of to_pandas_dataframe. '

2) Attempt at Mounting `70_driver_log.txt` excerpt

manually cancelled after 10 min.

bash: /azureml-envs/azureml_ff5f64206ec4de970c43d700cd178866/lib/libtinfo.so.5: no version information available (required by bash)
bash: /azureml-envs/azureml_ff5f64206ec4de970c43d700cd178866/lib/libtinfo.so.5: no version information available (required by bash)
Initialize DatasetContextManager.
Starting the daemon thread to refresh tokens in background for process with pid = 153
Enter __enter__ of DatasetContextManager
Processing 'ds_file'
Processing dataset FileDataset
{
  "source": [
    "('corpcarinput', 'Roadkill/v1/MVRK-02FEB2019/MVRK-02FEB2019-CPU/MVRK-02FEB2019-PHOTOS/100RECNX/RCNX0001.JPG')",
    ~~~~~~~ omitted ~~~~~~~~~~~~~~~~~~~~~~~~~
  ],
  "definition": [
    "GetDatastoreFiles"
  ],
  "registration": {
    "id": "e1f8aedd-6e5a-4ed6-99c3-4d71ca9b8d8a",
    "name": "RoadkillSubset",
    "version": 3,
    "description": "v3) clean up tags & no empty shots v2) larger set 1600 have clear carcass v1) the original 2,000 subset of photos before Becca cleaned things",
    "workspace": "Workspace.create(name='corpcarml', subscription_id='c5dad506-4ded-408d-9433-d39c6497b759', resource_group='CorpusCarcass')"
  }
}
Mounting ds_file to /Dataset/ds_file
Mounted ds_file to /Dataset/ds_file
Exit __enter__ of DatasetContextManager
Entering Run History Context Manager.
all args:
{'data_folder': '/Dataset/ds_file', 'n_rows': None, 'random_seed': None}
data_folder /Dataset/ds_file
input dataset {}
input dataset["ds_file"] /Dataset/ds_file
dataset listdir

awaiting-product-team-response cxp doc-enhancement triaged

Source

swanderz

All 8 comments

per @lostmygithubaccount 's suggestion, I tried the following to no avail.
inputs=[ ds_file.as_named_input('ds_file').as_download(path_on_compute='/files/')],
instead of
inputs=[ ds_file.as_named_input('ds_file').as_download(path_on_compute='./files')],

swanderz on 19 Dec 2019

What's now driving me crazy is that doing the following works where 1<n<=600, but fails at 700.
inputs=[ ds_file.take(n).as_named_input('ds_file').as_download(path_on_compute='/files/')],

Files are all in blob container, copcarinput which has a "folder" in it, Roadkill/v1/, that has ~300k photos in it. Rather than load in all of the images, I made a subset of 1600 images and created a Dataset, RoadkillSubset which is given an array of DataPaths, where each photo has its own data path, e.g.

ds_file = Dataset.File.from_files(
    path=[tuple([datastore, i]) for i in df_labels['FilePath']]
)
ds_file = ds_file.register(
    workspace=ws,
    name='RoadkillSubset',
    description='v3) clean up tags & no empty shots v2) larger set 1600 have clear carcass v1) the original 2,000 subset of photos before Becca cleaned things',
    create_new_version=True
)

As I grasp at straws trying to fix this, I thought to make a Dataset, Roadkill that refers to the whole folder of files and compare performance between them. My suspicion about my dataset's construction is reinforced when I look at the performance of these two lines.

%%timeit
# Roadkill dataset is all ~300k photos given by RelativePath: 'Roadkill/v1/**'
ds_file = (Dataset.get_by_name(workspace, name='Roadkill', version=1)
           .take(100)
           .download(target_path='./tmp/', overwrite=True))

13.8 s ± 3.91 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
# RoadkillSubset is ~1500 photos given by an array of DataPaths, where each photo has its own data path
ds_file = (Dataset.get_by_name(workspace, name='RoadkillSubset', version=3)
           .take(100)
           .download(target_path='./tmp/', overwrite=True))

39 s ± 1.35 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

swanderz on 20 Dec 2019

Hi Anders,

As a short mitigation could you try creating multiple FileDatasets across the list of paths you'd like to download to the compute.
This function will split the list of paths into distinct sets:

def slice_per(source, step):
    return [source[i::step] for i in range(step)]

I recommend 50-100 files per FileDataset, as you've found it works for <=600 files approximately.
I'll be online to help you get this working tomorrow!

slices = slice_per(df_labels['FilePath'], 100)
file_datasets = []
for slice in slices:
    files_datasets.append(Dataset.File.from_files(
        path=[tuple(datastore, i) for i in slice]
    ))

Cheers,
Lucas

tot0 on 20 Dec 2019

🚀1

thanks for the workaround, @tot0 and @shuyums2. I had a local download of the 300k images so I just:

copied the 1600 subset into its own folder,
uploaded them to their own blob folder,
registered a new version of the dataset pointing to that.

My (incorrect) assumptions that should probably be addressed in the documentation are:

FileDatasets could help me selectively sample a larger section of blob files,
arbitrarily long arrays of DataPath as the FileDataset source will not result in a good time, and
FileDatasets with more than 100 files are not recommended.

swanderz on 20 Dec 2019

I’m sorry you’ve had to slog through this trail and error process. Improving our recommendations around ‘resonable’ or performant usage of FileDatasets is something we’re actively working on. We have some fixes coming to mounting/downloading which improve the way we talk to blob, and should handle cases like this where blob is rate limiting us above a few 100 requests in a short time.

We expect mounting many files (1000s) to work for your use case of only accessing a subset, the problem arises when doing a listdir, since we need to ask blob for details of every single file at once. We are confirming that your access pattern would work if not listdir was done.

tot0 on 20 Dec 2019

👍1

makes sense. i was just calling os.listdir() to try and debug what was going on.

swanderz on 20 Dec 2019

👍1

@swanderz
As per Lucas replied above, we have a few perf improvement on flight and they will be released in Jan. I will update you once the fix is available in public sdk.

In the mean time, we are doing benchmarking, testing for dataset perf. We will update our public doc with dataset best practice once the benchmarking is done.

Thank you!

MayMSFT on 27 Dec 2019

@swanderz
We will now proceed to close this thread. If there are further questions regarding this matter, please respond here and @YutongTie-MSFT and we will gladly continue the discussion.