I can't for the life of me access a FileDataset within an estimator. I've tried mounting and downloading to no avail. Might be because my dataset is ~1500 images within a folder that has ~300k images? @MayMSFT @rastala
DataPath list of length ~1500import argparse
import datetime
import time
import os
from pprint import pprint
from azureml.core import Run
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument('--data_folder',
dest='data_folder',
default="C:\\\\temp")
parser.add_argument('--random_seed',
dest='random_seed',
type=int,
default=None)
parser.add_argument('--n_rows',
dest="n_rows",
type=int)
args = parser.parse_args()
print("all args:")
pprint(vars(args))
run = Run.get_context()
print('data_folder', args.data_folder)
print('input dataset', run.input_datasets)
dataset = run.input_datasets['ds_file']
print('input dataset["ds_file"]', run.input_datasets['ds_file'])
print('dataset listdir')
print(os.listdir(dataset))
compute_target = get_or_create_compute(
ws,
compute_target_name=make_resource_name(prefix, name, max_len = 16),
vm_size='Standard_NC24',
max_nodes=4,
idle_seconds_before_scaledown=600
)
ds_file = Dataset.get_by_name(ws, name='RoadkillSubset', version=3)
estimator = Estimator(
source_directory='./compute',
entry_script='mount_dataset.py',
script_params={
'--data_folder': ds_file.as_named_input('ds_file').as_download(path_on_compute='./files')},
inputs=[ ds_file.as_named_input('ds_file').as_download(path_on_compute='./files')],
compute_target=compute_target,
pip_packages=['azureml-dataprep[fuse]==1.1.33']
)
run = experiment.submit(estimator2)
70_driver_log.txt excerptStuck here for almost an hour
bash: /azureml-envs/azureml_ff5f64206ec4de970c43d700cd178866/lib/libtinfo.so.5: no version information available (required by bash)
bash: /azureml-envs/azureml_ff5f64206ec4de970c43d700cd178866/lib/libtinfo.so.5: no version information available (required by bash)
Initialize DatasetContextManager.
Starting the daemon thread to refresh tokens in background for process with pid = 152
Enter __enter__ of DatasetContextManager
Processing 'ds_file'
Processing dataset FileDataset
{
"source": [
"('corpcarinput', 'Roadkill/v1/MVRK-02FEB2019/MVRK-02FEB2019-CPU/MVRK-02FEB2019-PHOTOS/100RECNX/RCNX0001.JPG')",
~~~~~~~~~~~~ OMITTED ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
],
"definition": [
"GetDatastoreFiles"
],
"registration": {
"id": "e1f8aedd-6e5a-4ed6-99c3-4d71ca9b8d8a",
"name": "RoadkillSubset",
"version": 3,
"description": "v3) clean up tags & no empty shots v2) larger set 1600 have clear carcass v1) the original 2,000 subset of photos before Becca cleaned things",
"workspace": "Workspace.create(name='corpcarml', subscription_id='c5dad506-4ded-408d-9433-d39c6497b759', resource_group='CorpusCarcass')"
}
}
Downloading ds_file to ./files
/azureml-envs/azureml_ff5f64206ec4de970c43d700cd178866/lib/python3.6/site-packages/azureml/dataprep/api/dataflow.py:689: UserWarning: Please install pyarrow>=0.11.0 for improved performance of to_pandas_dataframe. You can ensure the correct version is installed by running: pip install azureml-dataprep[pandas].
warnings.warn('Please install pyarrow>=0.11.0 for improved performance of to_pandas_dataframe. '
70_driver_log.txt excerptmanually cancelled after 10 min.
bash: /azureml-envs/azureml_ff5f64206ec4de970c43d700cd178866/lib/libtinfo.so.5: no version information available (required by bash)
bash: /azureml-envs/azureml_ff5f64206ec4de970c43d700cd178866/lib/libtinfo.so.5: no version information available (required by bash)
Initialize DatasetContextManager.
Starting the daemon thread to refresh tokens in background for process with pid = 153
Enter __enter__ of DatasetContextManager
Processing 'ds_file'
Processing dataset FileDataset
{
"source": [
"('corpcarinput', 'Roadkill/v1/MVRK-02FEB2019/MVRK-02FEB2019-CPU/MVRK-02FEB2019-PHOTOS/100RECNX/RCNX0001.JPG')",
~~~~~~~ omitted ~~~~~~~~~~~~~~~~~~~~~~~~~
],
"definition": [
"GetDatastoreFiles"
],
"registration": {
"id": "e1f8aedd-6e5a-4ed6-99c3-4d71ca9b8d8a",
"name": "RoadkillSubset",
"version": 3,
"description": "v3) clean up tags & no empty shots v2) larger set 1600 have clear carcass v1) the original 2,000 subset of photos before Becca cleaned things",
"workspace": "Workspace.create(name='corpcarml', subscription_id='c5dad506-4ded-408d-9433-d39c6497b759', resource_group='CorpusCarcass')"
}
}
Mounting ds_file to /Dataset/ds_file
Mounted ds_file to /Dataset/ds_file
Exit __enter__ of DatasetContextManager
Entering Run History Context Manager.
all args:
{'data_folder': '/Dataset/ds_file', 'n_rows': None, 'random_seed': None}
data_folder /Dataset/ds_file
input dataset {}
input dataset["ds_file"] /Dataset/ds_file
dataset listdir
per @lostmygithubaccount 's suggestion, I tried the following to no avail.
inputs=[ ds_file.as_named_input('ds_file').as_download(path_on_compute='/files/')],
instead of
inputs=[ ds_file.as_named_input('ds_file').as_download(path_on_compute='./files')],
What's now driving me crazy is that doing the following works where 1<n<=600, but fails at 700.
inputs=[ ds_file.take(n).as_named_input('ds_file').as_download(path_on_compute='/files/')],
Files are all in blob container, copcarinput which has a "folder" in it, Roadkill/v1/, that has ~300k photos in it. Rather than load in all of the images, I made a subset of 1600 images and created a Dataset, RoadkillSubset which is given an array of DataPaths, where each photo has its own data path, e.g.
ds_file = Dataset.File.from_files(
path=[tuple([datastore, i]) for i in df_labels['FilePath']]
)
ds_file = ds_file.register(
workspace=ws,
name='RoadkillSubset',
description='v3) clean up tags & no empty shots v2) larger set 1600 have clear carcass v1) the original 2,000 subset of photos before Becca cleaned things',
create_new_version=True
)
As I grasp at straws trying to fix this, I thought to make a Dataset, Roadkill that refers to the whole folder of files and compare performance between them. My suspicion about my dataset's construction is reinforced when I look at the performance of these two lines.
%%timeit
# Roadkill dataset is all ~300k photos given by RelativePath: 'Roadkill/v1/**'
ds_file = (Dataset.get_by_name(workspace, name='Roadkill', version=1)
.take(100)
.download(target_path='./tmp/', overwrite=True))
13.8 s ± 3.91 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
# RoadkillSubset is ~1500 photos given by an array of DataPaths, where each photo has its own data path
ds_file = (Dataset.get_by_name(workspace, name='RoadkillSubset', version=3)
.take(100)
.download(target_path='./tmp/', overwrite=True))
39 s ± 1.35 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
Hi Anders,
As a short mitigation could you try creating multiple FileDatasets across the list of paths you'd like to download to the compute.
This function will split the list of paths into distinct sets:
def slice_per(source, step):
return [source[i::step] for i in range(step)]
I recommend 50-100 files per FileDataset, as you've found it works for <=600 files approximately.
I'll be online to help you get this working tomorrow!
slices = slice_per(df_labels['FilePath'], 100)
file_datasets = []
for slice in slices:
files_datasets.append(Dataset.File.from_files(
path=[tuple(datastore, i) for i in slice]
))
Cheers,
Lucas
thanks for the workaround, @tot0 and @shuyums2. I had a local download of the 300k images so I just:
My (incorrect) assumptions that should probably be addressed in the documentation are:
FileDatasets could help me selectively sample a larger section of blob files,DataPath as the FileDataset source will not result in a good time, andFileDatasets with more than 100 files are not recommended.I’m sorry you’ve had to slog through this trail and error process. Improving our recommendations around ‘resonable’ or performant usage of FileDatasets is something we’re actively working on. We have some fixes coming to mounting/downloading which improve the way we talk to blob, and should handle cases like this where blob is rate limiting us above a few 100 requests in a short time.
We expect mounting many files (1000s) to work for your use case of only accessing a subset, the problem arises when doing a listdir, since we need to ask blob for details of every single file at once. We are confirming that your access pattern would work if not listdir was done.
makes sense. i was just calling os.listdir() to try and debug what was going on.
@swanderz
As per Lucas replied above, we have a few perf improvement on flight and they will be released in Jan. I will update you once the fix is available in public sdk.
In the mean time, we are doing benchmarking, testing for dataset perf. We will update our public doc with dataset best practice once the benchmarking is done.
Thank you!
@swanderz
We will now proceed to close this thread. If there are further questions regarding this matter, please respond here and @YutongTie-MSFT and we will gladly continue the discussion.