Apologies if this one is going to be challenging to reproduce. I'll try to get as many details as I can from my cloud provider.
I'm trying to serve a DVC backend behind a proxied cloud instance with an access portal. You first get a redirect, and if you auth wrong, you get a 200 and a password prompt page (this may not be compliant behavior, I didn't write it), which might be the problem. Fetch shows no errors. Basically the client is telling me everything is A-OK when I know it can't possibly be.
Tried to add as much salient information, if you want me to try anything specific, let me know. Also tried to wireshark the conversation, but it was all TLS'd, I think there is a way to decode but I haven't tried that yet :/
Config looks like
[core]
remote = test2
['remote "test"']
url = http://dvc.company.com/test
['remote "test2"']
url = http://dvc.company.com/test
custom_auth_header = Authorization: redacted
Fresh git init
, dvc init
, add some junk, dvc add
it.
dvc push -v
shows the following with exit 0
, no error.
2020-08-28 11:50:29,014 DEBUG: Check for update is enabled.
2020-08-28 11:50:29,017 DEBUG: fetched: [(3,)]
2020-08-28 11:50:29,026 DEBUG: Preparing to upload data to 'http://dvc.company.com/test'
2020-08-28 11:50:29,027 DEBUG: Preparing to collect status from http://dvc.company.com/test
2020-08-28 11:50:29,027 DEBUG: Collecting information from local cache...
2020-08-28 11:50:29,027 DEBUG: Assuming '/Users/mike/rando/3dvctest/.dvc/cache/bc/9ae05a848582740df3c01234e889be' is unchanged since it is read-only
2020-08-28 11:50:29,029 DEBUG: Collecting information from remote cache...
2020-08-28 11:50:29,030 DEBUG: Matched '0' indexed hashes
2020-08-28 11:50:29,030 DEBUG: Querying 4 hashes via object_exists
2020-08-28 11:50:30,200 DEBUG: fetched: [(9,)]
Everything is up to date.
2020-08-28 11:50:30,209 DEBUG: Analytics is enabled.
2020-08-28 11:50:30,287 DEBUG: Trying to spawn '['daemon', '-q', 'analytics', '/var/folders/wp/kcxqg3tx3sg5cyl2sq9hw6qh0000gn/T/tmpjch6hi1q']'
2020-08-28 11:50:30,289 DEBUG: Spawned '['daemon', '-q', 'analytics', '/var/folders/wp/kcxqg3tx3sg5cyl2sq9hw6qh0000gn/T/tmpjch6hi1q']'
No indication anything has gone wrong. My Server sees no request, so I think the proxy is fielding it and returning a 200, but it's not the 200 from DVC. This occurs with out without or even an incorrect custom_auth_header
dvc fetch -v
also shows clean exit, even though it definitely can't fetch that endpoint the way I have it currently configured.
2020-08-28 12:04:38,945 DEBUG: Check for update is enabled.
2020-08-28 12:04:38,948 DEBUG: fetched: [(3,)]
2020-08-28 12:04:38,963 DEBUG: Preparing to download data from 'http://dvc.comany.com/test'
2020-08-28 12:04:38,963 DEBUG: Preparing to collect status from http://dvc.company.com/test
2020-08-28 12:04:38,963 DEBUG: Collecting information from local cache...
2020-08-28 12:04:38,964 DEBUG: Assuming '/Users/mike/rando/3dvctest/.dvc/cache/39/00f2a36279dd10ad3df80e9bf4a3fe' is unchanged since it is read-only
...
2020-08-28 12:04:39,014 DEBUG: Path '.dvc/cache/5a/b0042a3dd99d0fefb2f92d5193e83c' inode '21557239'
2020-08-28 12:04:39,016 DEBUG: fetched: [('1598628669975163904', '2175', '5ab0042a3dd99d0fefb2f92d5193e83c', '1598629499065528064')]
...
2020-08-28 12:04:39,018 DEBUG: fetched: [(9,)]
Everything is up to date.
2020-08-28 12:04:39,020 DEBUG: Analytics is enabled.
2020-08-28 12:04:39,091 DEBUG: Trying to spawn '['daemon', '-q', 'analytics', '/var/folders/wp/kcxqg3tx3sg5cyl2sq9hw6qh0000gn/T/tmp_ah2pv_h']'
2020-08-28 12:04:39,093 DEBUG: Spawned '['daemon', '-q', 'analytics', '/var/folders/wp/kcxqg3tx3sg5cyl2sq9hw6qh0000gn/T/tmp_ah2pv_h']'
Seen on dvc v 1.1.11
and 1.6.3
. Macbook source /Ubuntu 18 server. My remote is running this utility I am working on. Bog standard http daemon.
The server is on a cluster run by a private 'cloud' company. They have their own nginx proxy and firewall which lets you define your own subdomain, e.g. xkortex.company.com -> cluster-machine-ip:8888
. If you browse to one of these sites, you have to log in to a portal with username/pass. They also define an api-token.company.com
which gives you a token to insert into a header like curl -X GET --header 'Authorization: ASDF....DEADBEEF'
so as to avoid the login.
Currently, when I run
curl -v --header 'Authorization: redacted ' http://dvc.company.com
, I get
* Rebuilt URL to: http://dvc.company.com/
* Trying 192.#.#.#...
* TCP_NODELAY set
* Connected to dvc.company.com (192.#.#.#) port 80 (#0)
> GET / HTTP/1.1
> Host: dvc.company.com
> User-Agent: curl/7.54.0
> Accept: */*
> Authorization: redacted
>
< HTTP/1.1 301 Moved Permanently
< Server: nginx/1.13.9
< Date: Fri, 28 Aug 2020 15:52:14 GMT
< Content-Type: text/html
< Content-Length: 185
< Connection: keep-alive
< Location: https://dvc.company.com/
<html>
<head><title>301 Moved Permanently</title></head>
<body bgcolor="white">
<center><h1>301 Moved Permanently</h1></center>
<hr><center>nginx/1.13.9</center>
</body>
</html>
so I try
curl -vL --header 'Authorization: redacted ' http://dvc.company.com
, this time:
* Ignoring the response-body
* Connection #0 to host dvc.company.com left intact
* Issue another request to this URL: 'https://dvc.company.com/'
* Trying 192.#.#.#...
* TCP_NODELAY set
* Connected to dvc.company.com (192.#.#.#) port 443 (#1)
* ALPN, offering h2
* ALPN, offering http/1.1
* Cipher selection: ALL:!EXPORT:!EXPORT40:!EXPORT56:!aNULL:!LOW:!RC4:@STRENGTH
* successfully set certificate verify locations:
* CAfile: /etc/ssl/cert.pem
CApath: none
* TLSv1.2 (OUT), TLS handshake, Client hello (1):
* TLSv1.2 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Client hello (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS change cipher, Client hello (1):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / ECDHE-RSA-AES256-GCM-SHA384
* ALPN, server accepted to use http/1.1
* Server certificate:
* subject: CN=company.com
* start date: Jul 18 07:55:45 2020 GMT
* expire date: Oct 16 07:55:45 2020 GMT
* subjectAltName: host "dvc.company.com" matched cert's "*.company.com"
* issuer: C=US; O=Let's Encrypt; CN=Let's Encrypt Authority X3
* SSL certificate verify ok.
> GET / HTTP/1.1
> Host: dvc.company.com
> User-Agent: curl/7.54.0
> Accept: */*
> Authorization: redacted
>
< HTTP/1.1 200 OK
< Server: nginx/1.13.9
< Date: Fri, 28 Aug 2020 15:53:41 GMT
< Content-Type: text/html; charset=utf-8
< Content-Length: 375
< Connection: keep-alive
< Accept-Ranges: bytes
<
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
expected stuff from my server instance
</html>
Hi @xkortex !
Not seeing anything obvious right away :slightly_frowning_face: The best bet might be for you to look into our extremely simple http implementation https://github.com/iterative/dvc/blob/master/dvc/tree/http.py and try to tinker with it to make it work with your server. Let us know if you'll have any questions. We also have a #dev-talk channel on discord http://dvc.org/chat , feel free to join.
https://github.com/iterative/dvc/blob/master/dvc/tree/http.py
is exactly what simple_http_server
is based on actually :p
Probably going to chat it out on the chat server. I need to study up a bit more on what exactly the dvc http protocol is looking for/trying to do.
What does the server (or the proxy) return when you send HEAD
and GET
requests directly to a file URL?
When you try to push .dvc/cache/bc/9ae05a848582740df3c01234e889be
, we will eventually make a HEAD
or GET
request for http://dvc.company.com/test/bc/9ae05a848582740df3c01234e889be
, (where the URL will start with whatever your remote URL is configured as) to check whether or not the file already exists in the remote. If the server is returning 200's here and not any other response, DVC will take that to mean that the file already exists (and does not need to be pushed).
edit: it sounds that the server will return a redirect to some login page (and an eventual 200 OK response)? The issue here would be that we would normally expect a 401 (if there is no auth header) or 404 (if we are auth'd but the file doesn't exist) in this scenario.
Hey, sorry about the delay, work stuffs. Didn't mean to close the issue, my keyboard glitched and hit "Close with comment" when typing this reply!
curl dvc.company.com
<html>
<head><title>301 Moved Permanently</title></head>
<body bgcolor="white">
<center><h1>301 Moved Permanently</h1></center>
<hr><center>nginx/1.13.9</center>
</body>
</html>
HEAD
HTTP/1.1 301 Moved Permanently
Server: nginx/1.13.9
Date: Wed, 16 Sep 2020 23:08:00 GMT
Content-Type: text/html
Content-Length: 185
Connection: keep-alive
Location: https://dvc.company.com/
curl --head -L dvc.company.com
HTTP/1.1 301 Moved Permanently
Server: nginx/1.13.9
Date: Wed, 16 Sep 2020 23:08:54 GMT
Content-Type: text/html
Content-Length: 185
Connection: keep-alive
Location: https://dvc.company.com/
HTTP/1.1 200 OK
Server: nginx/1.13.9
Date: Wed, 16 Sep 2020 23:08:54 GMT
Content-Type: text/html;charset=utf-8
Content-Length: 497
Connection: keep-alive
WWW-Authenticate: Basic realm="Restricted"
I'm not sure if 200 is the correct response after a successful 3XX redirect but that seems reasonable since the landing page resolves. I agree that a 503 or a 401 is more desirable, but I don't think 200 here is totally out of line. If I visit gitlab.com/my/private/repo
, I get a 302 then a 200 for the sign-in page.
Could we, as a primitive sanity check, ensure that Content-Length
matches the expected size of the object we are looking for?
If I wanted to hack around a bit, would this be a good place to start?
https://github.com/iterative/dvc/blob/master/dvc/tree/http.py#L98
@xkortex Were you able to make it work? :slightly_smiling_face:
Most helpful comment
What does the server (or the proxy) return when you send
HEAD
andGET
requests directly to a file URL?When you try to push
.dvc/cache/bc/9ae05a848582740df3c01234e889be
, we will eventually make aHEAD
orGET
request forhttp://dvc.company.com/test/bc/9ae05a848582740df3c01234e889be
, (where the URL will start with whatever your remote URL is configured as) to check whether or not the file already exists in the remote. If the server is returning 200's here and not any other response, DVC will take that to mean that the file already exists (and does not need to be pushed).edit: it sounds that the server will return a redirect to some login page (and an eventual 200 OK response)? The issue here would be that we would normally expect a 401 (if there is no auth header) or 404 (if we are auth'd but the file doesn't exist) in this scenario.