Datasets: option to be be able to use datasets behind a proxy

Created on 18 Mar 2019  Â·  18Comments  Â·  Source: tensorflow/datasets

Is your feature request related to a problem? Please describe.
Right now behing a proxy, it is not working:

ds_train = tfds.load(name="cats_vs_dogs", split=tfds.Split.TRAIN)

C:\Program Files\Anaconda3\envs\env_gcp_dl_2_0_ds\lib\site-packages\requests\adapters.py in send(self, request, stream, timeout, verify, cert, proxies)
    514                 raise SSLError(e, request=request)
    515
--> 516             raise ConnectionError(e, request=request)
    517
    518         except ClosedPoolError as e:
ConnectionError: HTTPConnectionPool(host='storage.googleapis.com', port=80): Max retries exceeded with url: /tfds-data/ (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000192F3D06668>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond',))

I don't think this is supported for now (I didn't see it in the documentation):
https://www.tensorflow.org/datasets/api_docs/python/tfds/load

This will impact quite a lot of people working in company and university

Describe the solution you'd like
I am not an expert but using request seems to be the standard way. Below on example from a Google GCP tool:

from google.cloud import storage
import os

os.environ['GOOGLE_APPLICATION_CREDENTIALS']=xxx
os.environ['HTTPS_PROXY']=xxx
os.environ['REQUESTS_CA_BUNDLE']=/xxx/xxx
client = storage.Client()

ignore the GOOGLE_APPLICATION_CREDENTIALS' whihc is specific to GCP. The user need to setup one or 2 env variables and everything is done in the backgroud (I guess this is using requests)

http://docs.python-requests.org/en/master/user/advanced/#ssl-cert-verification

enhancement

Most helpful comment

You need to wait till it merges to master branch and you can get it in the
nightly build the next day.

Or if it is too urgent. You can clone it from my fork. Then cd into the
local repository and git checkout issue_275.
Finally:

  1. "pip install ."
    OR
  2. "python setup.py install"

Any one of these will do the job :)

On Wed, 12 Jun 2019, 2:31 pm Dr. Fabien Tarrade, notifications@github.com
wrote:

@captain-pool https://github.com/captain-pool good idea to disabling
for python version <= 2.7.8. I am quite new in this business how can I see
in which build this fix was collected ? It is alread in
tfds-nightly==1.0.2.dev201906120105 or should I wait in the one from
tomorrow ?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tensorflow/datasets/issues/275?email_source=notifications&email_token=ADKYRWIGFM6PNFGMBUSDDZDP2C3N3A5CNFSM4G7G2XT2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXPXRWQ#issuecomment-501184730,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ADKYRWMZJQD3SNSY77BWGADP2C3N3ANCNFSM4G7G2XTQ
.

All 18 comments

@rsepassi @Conchylicultor @cyfra Is this issue fixed? if not, can you assign this to me?

It should be fixed with @captain-pool contribution #488

Thanks @captain-pool

"To Configure the Proxy Settings, The User needs to set the Proxies for HTTP, HTTPS and FTP in the Environment Variables
TFDS_HTTP_PROXY, TFDS_HTTPS_PROXY, TFDS_FTP_PROXY respectively."

Do you also have an option to pass a CA certificate for SSL ?

Right now it is crahsing with :

requests.exceptions.SSLError: HTTPSConnectionPool(host='zenodo.org', port=443): Max retries exceeded with url: /record/53169/files/Kather_texture_2016_image_tiles_5000.zip (Caused by SSLError(SSLError("bad handshake: Error([('SSL routines', 'ssl3_get_server_certificate', 'certificate verify failed')],)",),))

This is typical of SSL interception and you need to put SSL verify false (if possible) or simply pass the CA certificate. Did you implement a REQUESTS_CA_BUNDLE environment variable as well ? Wich lib is used in your implementation ? Request ?

Hey @tarrade the downloader uses both requests and urllib. And Sorry, I totally missed the feature request for CA file. I just made it flexible for Proxies. Will add the support for CA Certificates ASAP.

@Conchylicultor should I skip the certificate verification by passing CERT_NONE from ssl, or should I put an option for adding certificate file?

Hi @captain-pool , no problem. I know it is only compny that are using proxy and CA certificate and we are suffering from that everyday. I will be happy to test it when you have it ready. Just tell me in which nithly build it was collected. Thanks

Can you re open the issue?

On Mon, 10 Jun 2019, 9:19 pm Dr. Fabien Tarrade, notifications@github.com
wrote:

Hi @captain-pool https://github.com/captain-pool , no problem. I know
it is only compny that are using proxy and CA certificate and we are
suffering from that everyday. I will be happy to test it when you have it
ready. Just tell me in which nithly build it was collected. Thanks

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tensorflow/datasets/issues/275?email_source=notifications&email_token=ADKYRWJRBRNKJYOM7NUBFLLPZZZZBA5CNFSM4G7G2XT2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXKIWOQ#issuecomment-500468538,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ADKYRWMNOKF4RRXCHTPAFC3PZZZZBANCNFSM4G7G2XTQ
.

On my side I cannot reopen this ticket. By the way I forgot thank you for the implementation of this request. You will help a lot of people using company laptop

@tarrade I think #663 should fix this. Give a Check

cc: @Conchylicultor

Hi @captain-pool, it seems the build failed, right ? https://source.cloud.google.com/results/invocations/e044b82b-65e9-4b34-9d5f-abd96aaba0a8/targets/tensorflow_datasets%2Fgh_testing%2Fpresubmit/log

I tested with 1.0.2.dev201906110105 but it is still failling with "bad handshake"

If the fix is already in 1.0.2.dev201906110105, then I will investiagte that I have all ca certificates in my file

@tarrade it is failing because I'm using SSL Context which is supported from python 2.7.9, however, Kokoro is using a version <= python 2.7.8, which doesn't allow that. Let me find out an alternative, will fix it soon.
@Conchylicultor @rsepassi @vbardiovskyg @cyfra is it possible to upgrade Kokoro's configuration for python 2 to python 2.7.9 ?

@rsepassi is the expert here, but from what I see it might not be that easy :-(
As we'd have to move from the "common" kokoro cluster/image to custom one (and pay the cost of managing it).

I see other places in our code, where we had to do workarounds in the past, to accommodate the fact that linux machines on kokoro use 2.7.8.

Would it make sense to have this feature "disabled" if running on old python version ?

@rsepassi is the expert here, but from what I see it might not be that easy :-(
As we'd have to move from the "common" kokoro cluster/image to custom one (and pay the cost of managing it).

I see other places in our code, where we had to do workarounds in the past, to accommodate the fact that linux machines on kokoro use 2.7.8.

Would it make sense to have this feature "disabled" if running on old python version ?

Done :)
Disabling for python version <= 2.7.8 seemed like the only valid way out.
The Builds are passing.
@tarrade after @cyfra verifies and merges, it should be ready :)

@captain-pool good idea to disabling for python version <= 2.7.8. I am quite new in this business how can I see in which build this fix was collected ? It is alread in tfds-nightly==1.0.2.dev201906120105 or should I wait in the one from tomorrow ?

You need to wait till it merges to master branch and you can get it in the
nightly build the next day.

Or if it is too urgent. You can clone it from my fork. Then cd into the
local repository and git checkout issue_275.
Finally:

  1. "pip install ."
    OR
  2. "python setup.py install"

Any one of these will do the job :)

On Wed, 12 Jun 2019, 2:31 pm Dr. Fabien Tarrade, notifications@github.com
wrote:

@captain-pool https://github.com/captain-pool good idea to disabling
for python version <= 2.7.8. I am quite new in this business how can I see
in which build this fix was collected ? It is alread in
tfds-nightly==1.0.2.dev201906120105 or should I wait in the one from
tomorrow ?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tensorflow/datasets/issues/275?email_source=notifications&email_token=ADKYRWIGFM6PNFGMBUSDDZDP2C3N3A5CNFSM4G7G2XT2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXPXRWQ#issuecomment-501184730,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ADKYRWMZJQD3SNSY77BWGADP2C3N3ANCNFSM4G7G2XTQ
.

I tested the latested build 1.0.2.dev201906180105 and I confirm that it is working with proxy and CA certificate.

Here my test and setup:

export TFDS_HTTPS_PROXY="http://user:password@ip:port/"
export TFDS_CA_BUNDLE=path/ca_certs

It is working for the following dataset:

dataset = tfds.load(name="colorectal_histology_large", split=tfds.Split.TREST)
dataset = tfds.load(name="colorectal_histology", split=tfds.Split.TRAIN)

I have some crashes when the dataset in is on AWS:

tfds.load(name="fashion_mnist", split=tfds.Split.TRAIN)

requests.exceptions.ConnectionError: HTTPConnectionPool(host='fashion-mnist.s3-website.eu-central-1.amazonaws.com', port=80): Max retries exceeded with url: /train-images-idx3-ubyte.gz (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f22bc27df98>: Failed to establish a new connection: [Errno 110] Connection timed out',))

I don't know what is the issue with AWS. Manually I can dowmload the file. I need to retry later. I am in a conf with a not so great network.

Overall it is working. The questions is on which side is the issue with AWS.

of course, I need to add both:

export TFDS_HTTPS_PROXY="http://user:password@ip:port/"
export TFDS_HTTP_PROXY="http://user:password@ip:port/"

and then everything is working fine.

All is working perfectly. Thanks @captain-pool . Closing

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Eshan-Agarwal picture Eshan-Agarwal  Â·  3Comments

keshan picture keshan  Â·  5Comments

MareoRaft picture MareoRaft  Â·  5Comments

ageron picture ageron  Â·  4Comments

jinbo-huang picture jinbo-huang  Â·  3Comments