Dvc.org: man.dvc.org/import-url redirect is broken

Created on 3 Nov 2019  路  23Comments  路  Source: iterative/dvc.org

It supposed to behave in the same way as man.dvc.org/get-url

The fix should be applies to the server.js file.

bug doc-content doc-engine priority-p0

All 23 comments

That is funny. Probably a bug in the redirect code...

This is still broken after attempting a hotfix. In fact now man.dvc.org/get-url is also broken! Which is weird because I tested the logic in server.js and it works well locally. Here's the latest prod deploy diff: https://github.com/iterative/dvc.org/compare/37aab6a0...a07bd0c5#diff-78c12f5adc1848d13b1c6f07055d996eL37-R37 for this code:

https://github.com/iterative/dvc.org/blob/a07bd0c5b1efd749e830dd950a19cc4a8a45202c/server.js#L34-L42

['/get-url', '/import-url'].indexOf(pathname) < 0 basically checks that pathname is not one of those 2 exceptions, before the code replaces - for / in any other path. Tested locally with node:

> ['/get-url', '/import-url'].indexOf('/get-url')
0
> ['/get-url', '/import-url'].indexOf('/import-url')
1
> ['/get-url', '/import-url'].indexOf('/cache-dir')
-1

However on prod everything gets replaced, even when indexOf results in >= 0

Changed code to this:

https://github.com/iterative/dvc.org/blob/1376f2a3fce6886faa2ff344e66826039cd43395/server.js#L34-L46

to make it easier to read. This code also works well in manual tests, but it's also not working yet on prod ):

There's 2 things going on here ^ BTW:

  1. Replace - for / in man.dvc.org/ paths (except /get-url and /import-url).
  2. Redirect man.dvc.org -> dvc.org/doc/command-reference .

We could separate each one for even further clarity, but that would result in an extra redirect.

Related: #757

I think the reason it does not work for me is caching of redirects:

Request URL: https://man.dvc.org/import-url
Request Method: GET
Status Code: 301 Moved Permanently (from disk cache)
Remote Address: 35.170.171.200:443
Referrer Policy: no-referrer-when-downgrade

If I run it a private windows everything works fine. So, we should check if there is a way to manage cache lifetime for redirects.

I guess this is the issue we bumped into! 301 Redirects: The Horror That Cannot Be Uncached

Basically you can only do a 301 redirect once...

So right now we have to deal with the fact that browsers and search engines have already cached all of our 301 redirects. We would need to turn the ones we're not so sure are permanent into 302 redirects and then wait a month or more to assume the caching is gone.

In the mean time, for these 2 specific cases we could setup 302 redirects from /import/url and /get/url to /import-url and /get-url respectively. These would be second redirects after the cached one from man.dvc.org...

I would try to avoid 302s . As I mentioned our redirects are 301s in nature. We do not plan to go back to the previous page and serve content directly from man.dvc.org/something for example.

Let's do some experiments with no-cache. Let's not worry about stuff that is already cached somewhere outside. First let's worry about the solution moving forwards (new redirects, new clients).

Right, that's an example of a permanent one, from man.dvc.org to dvc.org/doc/command-reference

But it used to turn - into / and now we don't really want that, so that part was not so permanent. 馃構

  • [ ] We should keep the 301 redirect from the man. subdomain to the cmd ref path, but make a separate 302 that turns - to / in some paths only (instead of excluding just import-url and get-url) for backward compatibility.
  • And review all the other redirects to see if any one should be 302.

What do you think?

Let's not worry about stuff that is already cached somewhere outside.

I guess in this case we can indeed ignore it. I doubt many people have used man.dvc.org/import-url or /get-url and definitely no search engines since man.dvc.org URLs are only printed in our command line. 馃檪 (I think)

I think

there are probably some links outside, not many indeed

We should keep the 301 redirect from the man. subdomain to the cmd ref path, but make a separate 302 that turns - to / in some paths only (instead of excluding just import-url and get-url) for backward compatibility.

sounds too complicated to be honest. Let's first see what we can with no-cache policy.

Almost all our redirects are permanent in nature, I actually don't see a good example of 302.

Almost all our redirects are permanent in nature, I actually don't see a good example of 302.

Well, let's see our current redirects in server.js:

  • Enforce https protocol and remove www from host = Permanent (SEO)
  • man.dvc.org/{cmd} -> dvc.org/doc/command-reference/{cmd} = Permanent (route design)

    • replace - for / in {cmd} except for /get-url, /import-url = Temporary (backward compat)

  • {code/data/remote}.dvc.org -> corresponding S3 bucket -> Permanent (route design)
  • path /(deb|rpm) -> corresponding S3 bucket = Permanent (route design)
  • path /(help|chat) -> Discord chat invite = Permanent (route design)
  • path /doc/commands-reference... -> /doc/command-reference... = Permanent (backward compat)
  • path /doc/tutorial/... -> /doc/tutorials/deep/... = Permanent (backward compat)
  • path /doc/tutorial -> /doc/tutorials = Permanent (backward compat)
  • path /doc/use-cases/data-and-model-files-versioning -> /doc/use-cases/versioning-data-and-model-files = Permanent (backward compat)
  • path /doc*/... -> /doc/... = Permanent (backward compat)

Seems like you're right! Only the - for / replacement under man.dvc.org redirects seems to me like something we should address with 302s or 303s or 307s or 308s. (See https://github.com/iterative/dvc.org/issues/757#issuecomment-549494538)

Let's first see what we can (do) with no-cache policy.

But OK! Let's try this first...

replace - for / in {cmd} except for /get-url, /import-url = Temporary (backward compat)

it's also permanent, I think. We don't plan to change it back right? It's temporary in a sense that we will remove it completely soon and start returning 404. Tbh, would not worry that much about redirects like man.dvc.org/remote-add at all because of this.

302s or 303s or 307s or 308s.

need to be careful with this - I don't know anything about 30x, x>2. :)

Let's first see what we can (do) with no-cache policy.

OK I added Cache-Control: no-cache to the 301 header. Will have to merge #772 to check the effect (on new client only).

I don't know anything about 30x

302 (Found (Previously "Moved temporarily")) has been superseded by 303 and 307.
303 See Other: "The response to the request can be found under another URI using the GET method."
307 Temporary Redirect (since HTTP/1.1): "The request should be repeated with another URI; however, future requests should still use the original URI... the request method is not allowed to be changed..."

Seems to me 307 is exactly what we want during a period of backward compatibility with certain man.dvc.org/cmd-subcmd URL paths that used to need converting - to /.

From https://en.wikipedia.org/wiki/List_of_HTTP_status_codes#3xx_Redirection

yep, I just don't know how well all these codes are supported that's why a bit cautious

OK so now that #772 is merged, I opened an incognito window in Chrome and went to https://man.dvc.org/import-url, then tried again. Both times I see this same Network response:

General

Request URL: https://man.dvc.org/cache-dir
Request Method: GET
Status Code: 301 Moved Permanently
Remote Address: 52.72.245.79:443
Referrer Policy: no-referrer-when-downgrade

Response Headers

HTTP/1.1 301 Moved Permanently
Server: Cowboy
Connection: keep-alive
Cache-Control: no-cache
Location: https://dvc.org/doc/command-reference/import-url
Date: Tue, 05 Nov 2019 00:08:45 GMT
Transfer-Encoding: chunked
Via: 1.1 vegur

No more (from disk cache). I guess this means the issue is fixed? Not sure how else to confirm.

Can also text with curl -Lv man.dvc.org/some-thing > /dev/null.

Alright, seems like this is fixed now even for previous clients.

Having a no-cache 301 "Moved Permanently" seems counterintuitive, but I guess that's the easiest solution here, and allowed by the HTTP 1.1 standard: https://tools.ietf.org/html/rfc2616#section-14.9

I still think we should use 307 for some or all the backward compatibility redirects (see https://github.com/iterative/dvc.org/issues/768#issuecomment-549494662) but OK, will address this later in #757.

@jorgeorpinel I don't think it's fixed unfortunately. I think Firefox at least does not respect the flag.

Boo FF. Happily it has like 5% of the market. https://en.wikipedia.org/wiki/Usage_share_of_web_browsers#Summary_tables

Safari is more relevant, but I can't tell whether it respects the cache-directive.

It's 11% for dvc.org, the same as Safari :)

So should we separate the 301 redirect from man.dvc.org to /doc/cmd-ref from a new 302 or 307 redirect that replaces - for / in certain paths only? And if so, do you know which exact commands with dash we are trying to provide backward compatibility for in this redirect?

And if so, do you know which exact commands

almost all of them (except import-url and get-url)

from a new 302 or 307 redirect that replaces - for / in certain paths only

let's not do this. Not that we understand that it's only caching issue, I think we can keep 301s. Most of our users will see the right result. When we get to the #757 we can think one more time about what codes we apply where - most likely most of them will be 301s anyway to keep SEO.

let's close this for now? good findings and good research for 757.

let's not do this... When we get to the #757 we can think one more time about what codes

Sounds like a plan.

most likely most of them will be 301s anyway to keep SEO

We'll see haha 馃檲

let's close this for now?

Sure! You reopened it so I was just thinking out-loud what alternatives we had left. Closing! 馃檪

Was this page helpful?
0 / 5 - 0 ratings