Pyro: Incomplete download of MNIST data

Created on 21 May 2019  路  9Comments  路  Source: pyro-ppl/pyro

Description

make test-examples fails to fully download mnist data dependency for the air example

Details

On osx==10.14.1 and docker==18.09.2

git checkout 3ac5a02e0e6b0a11ae796707413c11df2c14ff6b
make build pyro_branch=dev pytorch_branch=release python_verion=3.6
make run pyro_branch=dev pytorch_branch=release python_verion=3.6
cd pyro
pip install numpy==1.15
pip install scipy==1.2
make test-examples > output 2>&1

Output

Resulting files in .data

ls -lh -rw-r--r--  1 matt  staff    19M May 21 15:03 train-images-idx3-ubyte.gz.part

Most helpful comment

I wonder if the original issue @mattwescott is running into is a race condition that occurs when trying to fetch a data set from multiple processes? Trying this locally, if I clear out the directory in which the VAE example caches data (rm -rf examples/vae/utils/.data/) and run pytest tests/test_examples.py -k 'vae/vae.py' then I see a success. However, clearing the cached data and running pytest -n auto tests/test_examples.py -k 'vae/vae.py' results in a failure that looks quite similar to the one reported. (Something similar happens with the AIR example.) @neerajprad Is it possible that this could be the problem without it biting us elsewhere?

All 9 comments

I think the observations library is trying to download the files from http://yann.lecun.com/exdb/mnist/. Is this an intermittent failure, or does it fail on multiple retries? I do not have a better solution for this other than us moving away from observations.

@null-a - I think only the AIR example is using observations. Do we have any other sources for the multi-MNIST dataset? The dataset generation code seems straightforward, so if not, we can write our own data loader for this. Also see #1871.

It seemed to fail on multiple retries. And IIRC the same error came up outside of docker yesterday.

Looks like the download might be coming from here...

https://github.com/pyro-ppl/pyro/blob/835aa31f141a6f1d5a7b3a50f0702e5acbb8a789/pyro/contrib/examples/util.py#L15

The original output seemed to indicate that the download was successful, with the load attempt then failing because .part was all that made it into the folder.

Looks like the download might be coming from here...

The AIR example does not use torchvision.datasets and the error is coming from the observations library download, which seems to be a google storage link mirroring the original site.

I'm not sure what's going on here, this is working consistently for me at the moment. (dev branch, no docker, Mac OS.) Out of interest, are you able to grab the file that failed manually, with something like curl -O https://storage.googleapis.com/cvdf-datasets/mnist/train-images-idx3-ubyte.gz? If so perhaps you could download all four .gz files to the examples/air/.data directory manually as a work-around?

Do we have any other sources for the multi-MNIST dataset?

@neerajprad I'm not aware of a canonical version of this, everyone seems to roll their own from MNIST, and of those I don't know of any that are hosted in a suitable way.

The dataset generation code seems straightforward

That's right. In my original implementation I just generated the data on demand from the MNIST data available via pytorch, only later was this switched over to observations. Are you suggesting we pretty much revert to something like that? If so, I could make a PR for that if you like? (Also happy for someone else to do it, of course.)

This time I saved the initial error output, which might be more helpful

$ make build pyro_branch=dev pytorch_branch=release python_version=3.6
$ make run pyro_branch=dev pytorch_branch=release python_version=3.6
$ pip install numpy==1.15.0 #  prevent binary warnings that are escalated into errors #1873
$ curl -O https://storage.googleapis.com/cvdf-datasets/mnist/train-images-idx3-ubyte.gz 
$ echo $?
0
$ make test-examples > make_text_examples.txt 2>&1

make_text_examples.txt

If so perhaps you could download all four .gz files to the examples/air/.data directory manually as a work-around?

Not blocked by this but I'm reluctant to add more environment workarounds. For now I'll remove -x from make test, so it doesn't bail on the first failure. Seems easier to track specific test failures than environment state.

As an aside, I've seen it work well to run the CI tests inside the docker container(s). IIRC, we factored out a base container that is rebuilt infrequently, and on each push, rebuilt the application container with --no-cache to closely match a fresh dev environment, running the tests using the same flow and commands that were used locally.

Are you suggesting we pretty much revert to something like that? If so, I could make a PR for that if you like? (Also happy for someone else to do it, of course.)

Thanks, @null-a, that will be great! The observations library is deprecated in favor of tensorflow datasets which doesn't have multi MNIST at the moment, and it might be best to avoid adding tensorflow as a dependency anyways. I think we can just reinstate your original dataset creation utility in pyro.contrib.examples module. We already have the MNIST dataset in S3 backed by cloudfront at this url. Could you add a PR reinstating your earlier code? I'll be happy to contribute any fixes required, and get it merged.

I wonder if the original issue @mattwescott is running into is a race condition that occurs when trying to fetch a data set from multiple processes? Trying this locally, if I clear out the directory in which the VAE example caches data (rm -rf examples/vae/utils/.data/) and run pytest tests/test_examples.py -k 'vae/vae.py' then I see a success. However, clearing the cached data and running pytest -n auto tests/test_examples.py -k 'vae/vae.py' results in a failure that looks quite similar to the one reported. (Something similar happens with the AIR example.) @neerajprad Is it possible that this could be the problem without it biting us elsewhere?

@null-a : I think you are right, and that is exactly what is happening in this case since we have multiple tests for each example (there are two for AIR). This probably leads to a race condition when two subprocesses are trying to download to the same directory. I think we should just change our make test-examples to not run pytest with xdist, and given that our examples are compute heavy, I doubt if there is much benefit to distributing them across different cores anyways. I will test and make this change shortly.

EDIT: Running with xdist halves the amount of time it takes to run test-examples on my laptop with 12 cores, from 21 minutes to 10 minutes. If there was a way to group tests to run on the same worker using xdist, that would be ideal, but the support for this feature is lacking. See https://github.com/pytest-dev/pytest-xdist/issues/365. I think it is best if we change the default to not use xdist so that new users aren't tripped. There are other ways around this, for instance by populating all the datasets beforehand, but not sure if it is worth spending too much time on this.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

neerajprad picture neerajprad  路  4Comments

fritzo picture fritzo  路  5Comments

robsalomone picture robsalomone  路  4Comments

fehiepsi picture fehiepsi  路  3Comments

jpchen picture jpchen  路  5Comments