Vision: Downloading MNIST dataset with torchvision gives HTTP Error 403

Created on 4 Mar 2020 · 15Comments · Source: pytorch/vision

🐛 Bug

I'm getting a 403 error when I try to download MNIST dataset with torchvision 0.4.2.

To Reproduce

../.local/lib/python3.6/site-packages/torchvision/datasets/mnist.py:68: in __init__
    self.download()
../.local/lib/python3.6/site-packages/torchvision/datasets/mnist.py:135: in download
    download_and_extract_archive(url, download_root=self.raw_folder, filename=filename)
../.local/lib/python3.6/site-packages/torchvision/datasets/utils.py:248: in download_and_extract_archive
    download_url(url, download_root, filename, md5)
../.local/lib/python3.6/site-packages/torchvision/datasets/utils.py:96: in download_url
    raise e
../.local/lib/python3.6/site-packages/torchvision/datasets/utils.py:84: in download_url
    reporthook=gen_bar_updater()
/usr/local/lib/python3.6/urllib/request.py:248: in urlretrieve
    with contextlib.closing(urlopen(url, data)) as fp:
/usr/local/lib/python3.6/urllib/request.py:223: in urlopen
    return opener.open(url, data, timeout)
/usr/local/lib/python3.6/urllib/request.py:532: in open
    response = meth(req, response)
/usr/local/lib/python3.6/urllib/request.py:642: in http_response
    'http', request, response, code, msg, hdrs)
/usr/local/lib/python3.6/urllib/request.py:570: in error
    return self._call_chain(*args)
/usr/local/lib/python3.6/urllib/request.py:504: in _call_chain
    result = func(*args)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <urllib.request.HTTPDefaultErrorHandler object at 0x7efbf9edaac8>
req = <urllib.request.Request object at 0x7efbf9eda8d0>
fp = <http.client.HTTPResponse object at 0x7efbf9edaf98>, code = 403
msg = 'Forbidden', hdrs = <http.client.HTTPMessage object at 0x7efbf9ea22b0>

    def http_error_default(self, req, fp, code, msg, hdrs):
>       raise HTTPError(req.full_url, code, msg, hdrs, fp)
E       urllib.error.HTTPError: HTTP Error 403: Forbidden

Environment

torch==1.3.1
torchvision==0.4.2

Additional context

https://app.circleci.com/jobs/github/PyTorchLightning/pytorch-lightning/6877

bug help wanted datasets

Source

Borda

Most helpful comment

@eduardo4jesus You could patch your model script at the top using:

from six.moves import urllib
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
urllib.request.install_opener(opener)

It will use that user agent for the entire script assuming the opener does not get overwritten somewhere else.

nvcastet on 4 Mar 2020

👍11 🚀5 🎉1

All 15 comments

Thanks for reporting! I can reproduce the issue locally, and downloading from the browser works.

I don't yet know what the root cause is though.

fmassa on 4 Mar 2020

I think we might need to pass header in the download_url function https://github.com/pytorch/vision/blob/c3e2b018517dedcbda18462f5d3e62e1fd913003/torchvision/datasets/utils.py#L59-L100 according to https://stackoverflow.com/questions/13303449/urllib2-httperror-http-error-403-forbidden

cc @cpuhrsch @vincentqb @zhangguanheng66 for awareness

fmassa on 4 Mar 2020

this is because the download links for mnist at https://github.com/pytorch/vision/blob/master/torchvision/datasets/mnist.py#L33-L36 are hosted on yann.lecun.com and that server has moved under CloudFlare protection.

@fmassa we need to maybe mirror and change the URLs to maybe the PyTorch S3 bucket or something

soumith on 4 Mar 2020

so could we make a hot-fix somehow?

Borda on 4 Mar 2020

@Borda I haven't tried the current hotfix I mentioned, but I think it might be possible, would you be able to try it and send a PR? Otherwise I'll look into it early next week (I'm working towards ECCV deadline tomorrow)

And I would rather avoid hosting the datasets ourselves, as this would give precedence on us storing the datasets.

fmassa on 4 Mar 2020

Is there any way to have a quick fix without using the master?
I am concerned about the potential changes I have to do in my code for going from the version I am using (1.4.0) and the master.

eduardo4jesus on 4 Mar 2020

👍2

@eduardo4jesus You can explicitly add headers as stated above, something alike:

opener = urllib.request.URLopener()
opener.addheader('User-Agent', some_user_agent)
opener.retrieve(
    url, fpath,
    reporthook=gen_bar_updater()
)

(line 81 and onwards in vision/torchvision/datasets/utils.py). Seems to be a quick workaround that works.

mvelebit on 4 Mar 2020

👍1

@eduardo4jesus You could patch your model script at the top using:

from six.moves import urllib
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
urllib.request.install_opener(opener)

It will use that user agent for the entire script assuming the opener does not get overwritten somewhere else.

nvcastet on 4 Mar 2020

👍11 🚀5 🎉1

To make it work for python 2 as well:

import urllib
try:
    # For python 2
    class AppURLopener(urllib.FancyURLopener):
        version = "Mozilla/5.0"

    urllib._urlopener = AppURLopener()
except AttributeError:
    # For python 3
    opener = urllib.request.build_opener()
    opener.addheaders = [('User-agent', 'Mozilla/5.0')]
    urllib.request.install_opener(opener)

nvcastet on 4 Mar 2020

so for python 3 I now use the following snipplet:

from torchvision import datasets
import torchvision.transforms as transforms
import urllib

num_workers = 0
batch_size = 20
basepath = 'some/base/path'
transform = transforms.ToTensor()

def set_header_for(url, filename):
    opener = urllib.request.URLopener()
    opener.addheader('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36')
    opener.retrieve(
    url, f'{basepath}/{filename}')

set_header_for('http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz', 'train-images-idx3-ubyte.gz')
set_header_for('http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz', 'train-labels-idx1-ubyte.gz')
set_header_for('http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz', 't10k-images-idx3-ubyte.gz')
set_header_for('http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz', 't10k-labels-idx1-ubyte.gz')
train_data = datasets.MNIST(root='data', train=True,
                                   download=True, transform=transform)
test_data = datasets.MNIST(root='data', train=False,
                                  download=False, transform=transform)

You would need to modify the basepath variable of course

joergsimon on 4 Mar 2020

I've just got the same problem. Waiting for the answer without changing codes... (ROOKIE ALERT)

fatihbeyhan on 4 Mar 2020

I've just got the same problem. Waiting for the answer without changing codes... (ROOKIE ALERT)

Clone this to your working dir:
https://github.com/knamdar/data

knamdar on 5 Mar 2020

👍6

The problem ist that Yann LeCun’s side changed hoster if I got it right, and this one checks if the HTTP headers are set.

I currently work around with the following code:

from torchvision import datasets
import torchvision.transforms as transforms
import urllib

num_workers = 0
batch_size = 20
basepath = 'some/base/path'
transform = transforms.ToTensor()

def set_header_for(url, filename):
opener = urllib.request.URLopener()
opener.addheader('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36')
opener.retrieve(
url, f'{basepath}/{filename}')

set_header_for('http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz', 'train-images-idx3-ubyte.gz')
set_header_for('http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz', 'train-labels-idx1-ubyte.gz')
set_header_for('http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz', 't10k-images-idx3-ubyte.gz')
set_header_for('http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz', 't10k-labels-idx1-ubyte.gz')
train_data = datasets.MNIST(root='data', train=True,
download=True, transform=transform)
test_data = datasets.MNIST(root='data', train=False,
download=False, transform=transform)
You need to change base path of course

On 05.03.2020, at 05:26, Nikita Makarin notifications@github.com wrote:

I've the same issue when I'm trying to get datasets:

import torch
import torchvision
from torchvision import transforms, datasets

train = datasets.MNIST("", train=True, download=True,
transform=transforms.Compose([transforms.ToTensor()]))

test = datasets.MNIST("", train=False, download=True,
transform=transforms.Compose([transforms.ToTensor()]))
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/pytorch/vision/issues/1938?email_source=notifications&email_token=AAN2AFNSOADTTTO6F3JRBLDRF4SZFA5CNFSM4LBCIY62YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEN3VCJQ#issuecomment-595022118, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAN2AFI4ZQEJJ2HEPJCBHP3RF4SZFANCNFSM4LBCIY6Q.

joergsimon on 5 Mar 2020

from six.moves import urllib
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
urllib.request.install_opener(opener)

@nvcastet, Thank you so much for the clarification. At that point I misunderstood that I would have to go into Torchvision library and change one of its internal files, which would not ben a smooth move on Colab/Kaggle.