Vision: Downloading MNIST dataset with torchvision gives HTTP Error 403

Created on 4 Mar 2020  ·  15Comments  ·  Source: pytorch/vision

🐛 Bug

I'm getting a 403 error when I try to download MNIST dataset with torchvision 0.4.2.

To Reproduce

../.local/lib/python3.6/site-packages/torchvision/datasets/mnist.py:68: in __init__
    self.download()
../.local/lib/python3.6/site-packages/torchvision/datasets/mnist.py:135: in download
    download_and_extract_archive(url, download_root=self.raw_folder, filename=filename)
../.local/lib/python3.6/site-packages/torchvision/datasets/utils.py:248: in download_and_extract_archive
    download_url(url, download_root, filename, md5)
../.local/lib/python3.6/site-packages/torchvision/datasets/utils.py:96: in download_url
    raise e
../.local/lib/python3.6/site-packages/torchvision/datasets/utils.py:84: in download_url
    reporthook=gen_bar_updater()
/usr/local/lib/python3.6/urllib/request.py:248: in urlretrieve
    with contextlib.closing(urlopen(url, data)) as fp:
/usr/local/lib/python3.6/urllib/request.py:223: in urlopen
    return opener.open(url, data, timeout)
/usr/local/lib/python3.6/urllib/request.py:532: in open
    response = meth(req, response)
/usr/local/lib/python3.6/urllib/request.py:642: in http_response
    'http', request, response, code, msg, hdrs)
/usr/local/lib/python3.6/urllib/request.py:570: in error
    return self._call_chain(*args)
/usr/local/lib/python3.6/urllib/request.py:504: in _call_chain
    result = func(*args)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <urllib.request.HTTPDefaultErrorHandler object at 0x7efbf9edaac8>
req = <urllib.request.Request object at 0x7efbf9eda8d0>
fp = <http.client.HTTPResponse object at 0x7efbf9edaf98>, code = 403
msg = 'Forbidden', hdrs = <http.client.HTTPMessage object at 0x7efbf9ea22b0>

    def http_error_default(self, req, fp, code, msg, hdrs):
>       raise HTTPError(req.full_url, code, msg, hdrs, fp)
E       urllib.error.HTTPError: HTTP Error 403: Forbidden

Environment

  • torch==1.3.1
  • torchvision==0.4.2

Additional context

https://app.circleci.com/jobs/github/PyTorchLightning/pytorch-lightning/6877

bug help wanted datasets

Most helpful comment

@eduardo4jesus You could patch your model script at the top using:

from six.moves import urllib
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
urllib.request.install_opener(opener)

It will use that user agent for the entire script assuming the opener does not get overwritten somewhere else.

All 15 comments

Thanks for reporting! I can reproduce the issue locally, and downloading from the browser works.

I don't yet know what the root cause is though.

I think we might need to pass header in the download_url function https://github.com/pytorch/vision/blob/c3e2b018517dedcbda18462f5d3e62e1fd913003/torchvision/datasets/utils.py#L59-L100 according to https://stackoverflow.com/questions/13303449/urllib2-httperror-http-error-403-forbidden

cc @cpuhrsch @vincentqb @zhangguanheng66 for awareness

this is because the download links for mnist at https://github.com/pytorch/vision/blob/master/torchvision/datasets/mnist.py#L33-L36 are hosted on yann.lecun.com and that server has moved under CloudFlare protection.

@fmassa we need to maybe mirror and change the URLs to maybe the PyTorch S3 bucket or something

so could we make a hot-fix somehow?

@Borda I haven't tried the current hotfix I mentioned, but I think it might be possible, would you be able to try it and send a PR? Otherwise I'll look into it early next week (I'm working towards ECCV deadline tomorrow)

And I would rather avoid hosting the datasets ourselves, as this would give precedence on us storing the datasets.

Is there any way to have a quick fix without using the master?
I am concerned about the potential changes I have to do in my code for going from the version I am using (1.4.0) and the master.

@eduardo4jesus You can explicitly add headers as stated above, something alike:

opener = urllib.request.URLopener()
opener.addheader('User-Agent', some_user_agent)
opener.retrieve(
    url, fpath,
    reporthook=gen_bar_updater()
)

(line 81 and onwards in vision/torchvision/datasets/utils.py). Seems to be a quick workaround that works.

@eduardo4jesus You could patch your model script at the top using:

from six.moves import urllib
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
urllib.request.install_opener(opener)

It will use that user agent for the entire script assuming the opener does not get overwritten somewhere else.

To make it work for python 2 as well:

import urllib
try:
    # For python 2
    class AppURLopener(urllib.FancyURLopener):
        version = "Mozilla/5.0"

    urllib._urlopener = AppURLopener()
except AttributeError:
    # For python 3
    opener = urllib.request.build_opener()
    opener.addheaders = [('User-agent', 'Mozilla/5.0')]
    urllib.request.install_opener(opener)

so for python 3 I now use the following snipplet:

from torchvision import datasets
import torchvision.transforms as transforms
import urllib

num_workers = 0
batch_size = 20
basepath = 'some/base/path'
transform = transforms.ToTensor()

def set_header_for(url, filename):
    opener = urllib.request.URLopener()
    opener.addheader('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36')
    opener.retrieve(
    url, f'{basepath}/{filename}')

set_header_for('http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz', 'train-images-idx3-ubyte.gz')
set_header_for('http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz', 'train-labels-idx1-ubyte.gz')
set_header_for('http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz', 't10k-images-idx3-ubyte.gz')
set_header_for('http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz', 't10k-labels-idx1-ubyte.gz')
train_data = datasets.MNIST(root='data', train=True,
                                   download=True, transform=transform)
test_data = datasets.MNIST(root='data', train=False,
                                  download=False, transform=transform)

You would need to modify the basepath variable of course

I've just got the same problem. Waiting for the answer without changing codes... (ROOKIE ALERT)

I've just got the same problem. Waiting for the answer without changing codes... (ROOKIE ALERT)

Clone this to your working dir:
https://github.com/knamdar/data

The problem ist that Yann LeCun’s side changed hoster if I got it right, and this one checks if the HTTP headers are set.

I currently work around with the following code:

from torchvision import datasets
import torchvision.transforms as transforms
import urllib

num_workers = 0
batch_size = 20
basepath = 'some/base/path'
transform = transforms.ToTensor()

def set_header_for(url, filename):
opener = urllib.request.URLopener()
opener.addheader('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36')
opener.retrieve(
url, f'{basepath}/{filename}')

set_header_for('http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz', 'train-images-idx3-ubyte.gz')
set_header_for('http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz', 'train-labels-idx1-ubyte.gz')
set_header_for('http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz', 't10k-images-idx3-ubyte.gz')
set_header_for('http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz', 't10k-labels-idx1-ubyte.gz')
train_data = datasets.MNIST(root='data', train=True,
download=True, transform=transform)
test_data = datasets.MNIST(root='data', train=False,
download=False, transform=transform)
You need to change base path of course

On 05.03.2020, at 05:26, Nikita Makarin notifications@github.com wrote:

I've the same issue when I'm trying to get datasets:

import torch
import torchvision
from torchvision import transforms, datasets

train = datasets.MNIST("", train=True, download=True,
transform=transforms.Compose([transforms.ToTensor()]))

test = datasets.MNIST("", train=False, download=True,
transform=transforms.Compose([transforms.ToTensor()]))

You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/pytorch/vision/issues/1938?email_source=notifications&email_token=AAN2AFNSOADTTTO6F3JRBLDRF4SZFA5CNFSM4LBCIY62YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEN3VCJQ#issuecomment-595022118, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAN2AFI4ZQEJJ2HEPJCBHP3RF4SZFANCNFSM4LBCIY6Q.

from six.moves import urllib
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
urllib.request.install_opener(opener)

@nvcastet, Thank you so much for the clarification. At that point I misunderstood that I would have to go into Torchvision library and change one of its internal files, which would not ben a smooth move on Colab/Kaggle.

vision/torchvision/datasets/utils.py

This should have been fixed now, there is no need to update torchvision.

All should be working as before, without any change on the user side.

This was fixed on the server hosting the original dataset (thanks @soumith !).

As such, I'm closing this issue but let us know if you still face this issue.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

carlocab picture carlocab  ·  3Comments

IssamLaradji picture IssamLaradji  ·  3Comments

bodokaiser picture bodokaiser  ·  3Comments

ArashJavan picture ArashJavan  ·  3Comments

varagrawal picture varagrawal  ·  3Comments