Unable to download and load celeba dataset into a loader.
batch_size=25
train_loader = torch.utils.data.DataLoader(
datasets.CelebA('../data', split="train", download=True,
transform=transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.5,), (0.5,))
])),
batch_size=batch_size, shuffle=True)
Returns
/usr/local/lib/python3.6/dist-packages/torchvision/datasets/celeba.py in __init__(self, root, split, target_type, transform, target_transform, download)
64
65 if download:
---> 66 self.download()
67
68 if not self._check_integrity():
/usr/local/lib/python3.6/dist-packages/torchvision/datasets/celeba.py in download(self)
118 download_file_from_google_drive(file_id, os.path.join(self.root, self.base_folder), filename, md5)
119
--> 120 with zipfile.ZipFile(os.path.join(self.root, self.base_folder, "img_align_celeba.zip"), "r") as f:
121 f.extractall(os.path.join(self.root, self.base_folder))
122
/usr/lib/python3.6/zipfile.py in __init__(self, file, mode, compression, allowZip64)
1129 try:
1130 if mode == 'r':
-> 1131 self._RealGetContents()
1132 elif mode in ('w', 'x'):
1133 # set the modified flag so central directory gets written
/usr/lib/python3.6/zipfile.py in _RealGetContents(self)
1196 raise BadZipFile("File is not a zip file")
1197 if not endrec:
-> 1198 raise BadZipFile("File is not a zip file")
1199 if self.debug > 1:
1200 print(endrec)
BadZipFile: File is not a zip file
CUDA used to build PyTorch: 10.1
OS: Ubuntu 18.04.3 LTS
Python version: 3.6
Versions of relevant libraries:
This has nothing to do with the loader. We can get the same result with
from torchvision import datasets
dataset = datasets.CelebA(".", split="train", download=True,)
The underlying problem was reported in #1920: Google Drive has a daily maximum quota for any file, which seems to be exceeded for the CelebA files. You can see this in the response which is mindlessly written to every .txt and also .zip file.
<!DOCTYPE html><html><head><title>Google Drive - Quota exceeded</title><meta http-equiv="content-type" content="text/html; charset=utf-8"/><link href=/static/doclist/client/css/1659352109-untrustedcontent.css rel="stylesheet"><link rel="icon" href="https://ssl.gstatic.com/docs/doclist/images/infinite_arrow_favicon_4.ico"/><style nonce="0AwDvc7jesmreq9s3Zkdcw">#gbar,#guser{font-size:13px;padding-top:0px !important;}#gbar{height:22px}#guser{padding-bottom:7px !important;text-align:right}.gbh,.gbd{border-top:1px solid #c9d7f1;font-size:1px}.gbh{height:0;position:absolute;top:24px;width:100%}@media all{.gb1{height:22px;margin-right:.5em;vertical-align:top}#gbar{float:left}}a.gb1,a.gb4{text-decoration:underline !important}a.gb1,a.gb4{color:#00c !important}.gbi .gb4{color:#dd8e27 !important}.gbf .gb4{color:#900 !important}
</style><script nonce="0AwDvc7jesmreq9s3Zkdcw"></script></head><body><div id=gbar><nobr><a target=_blank class=gb1 href="https://www.google.de/webhp?tab=ow">Search</a> <a target=_blank class=gb1 href="http://www.google.de/imghp?hl=en&tab=oi">Images</a> <a target=_blank class=gb1 href="https://maps.google.de/maps?hl=en&tab=ol">Maps</a> <a target=_blank class=gb1 href="https://play.google.com/?hl=en&tab=o8">Play</a> <a target=_blank class=gb1 href="https://www.youtube.com/?gl=DE&tab=o1">YouTube</a> <a target=_blank class=gb1 href="https://mail.google.com/mail/?tab=om">Gmail</a> <b class=gb1>Drive</b> <a target=_blank class=gb1 href="https://www.google.com/calendar?tab=oc">Calendar</a> <a target=_blank class=gb1 style="text-decoration:none" href="https://www.google.de/intl/en/about/products?tab=oh"><u>More</u> »</a></nobr></div><div id=guser width=100%><nobr><span id=gbn class=gbi></span><span id=gbf class=gbf></span><span id=gbe></span><a target="_self" href="/settings?hl=en_US" class=gb4>Settings</a> | <a target=_blank href="//support.google.com/drive/?p=web_home&hl=en_US" class=gb4>Help</a> | <a target=_top id=gb_70 href="https://accounts.google.com/ServiceLogin?hl=en&passive=true&continue=https://docs.google.com/uc%3Fexport%3Ddownload%26id%3D0B7EVK8r0v71pY0NSMzRuSXJEVkk&service=writely" class=gb4>Sign in</a></nobr></div><div class=gbh style=left:0></div><div class=gbh style=right:0></div><div class="uc-main"><div id="uc-text"><p class="uc-error-caption">Sorry, you can't view or download this file at this time.</p><p class="uc-error-subcaption">Too many users have viewed or downloaded this file recently. Please try accessing the file again later. If the file you are trying to access is particularly large or is shared with many people, it may take up to 24 hours to be able to view or download the file. If you still can't access a file after 24 hours, contact your domain administrator.</p></div></div><div class="uc-footer"><hr class="uc-footer-divider">© 2020 Google - <a class="goog-link" href="//support.google.com/drive/?p=web_home">Help</a> - <a class="goog-link" href="//support.google.com/drive/bin/answer.py?hl=en_US&answer=2450387">Privacy & Terms</a></div></body></html>
@ajayrfhp The only "solution" we can offer is to tell you to wait and try again, since we have no control about your issue. You can ask the author of the dataset to host it on a platform that does not have daily quotas. If you do and he goes through with your proposal please inform us so that we can adapt our code.
@fmassa We should check the contents of the response first before we write them to the files and raise a descriptive error message.
@pmeier thanks for looking into this!
@fmassa We should check the contents of the response first before we write them to the files and raise a descriptive error message.
Is this something that could be done in the download_from_url function, or would it need to be done on a case-by-case basis?
CelebA uses download_file_from_google_drive and I would put the fix before L167:
Maybe it is as easy as checking the response.status_code.
Problem I see is that we need wait until we have a day where the quota is exceeded and fix it instantly. Furthermore, I have no idea how to test this.
I see. Thanks, I will download at a later point then.
@pmeier your fix sounds good to me, but indeed, this might be difficult to test.
@fmassa I suggest we wait for another issue raising this problem. At least I won't check daily if this quota is exceeded. If there is another issue for this and I miss it or you somehow find a day when we can fix this feel free to tag me in. I'll see what I can do.
Sounds good, thanks a lot @pmeier !
Seems this is a known issue, but wanted to raise this again as per @pmeier 's comment. I didn't want to open another ticket on this though.
@jotterbach This was fixed in e757d52 but didn't make it in the latest release.