Apparently, HTTP redirections are not followed by the URL fetcher: they probably should.
I'm already looking at the URL fetching as part of #1217 so I can look at this as well
OK - the issue seems to be that the URLConnection library does not follow redirects when they use different protocols to the original request. In particular this means that if you request an http URI, and there is a redirect to an https URI, this will not be followed. Redirects which are to the same protocol (http->http or https->https) work as expected in OpenRefine.
This behaviour is deliberate, as it stops redirects taking the user from a secure protocol to an unsecured one (https->http). The second answer in this StackOverflow post gives a good explanation of why this is a bad idea https://stackoverflow.com/questions/1884230/urlconnection-doesnt-follow-redirect
What is less clear to me is whether there are any issues in supporting http->https redirects (which will be the far more common scenario) - this feels like it is increasingly common and we could support it without any security concerns.
Any other views?
I agree there should not be any security concern with HTTP -> HTTPS. By the way, if URLConnection does this sort of nonsense, it might be worth migrating to a more modern library (https://stackoverflow.com/questions/1322335/what-is-the-best-java-library-to-use-for-http-post-get-etc). Ideally one that we already have in our dependencies, and in a dream world something that can be easily mocked for tests.
Looks like a work around for cross-protocol redirects was already implemented for data import in response to #748. Relevant commit is 4f7da9d18e05361a6b1135528394b59f1e13b244
Most helpful comment
OK - the issue seems to be that the URLConnection library does not follow redirects when they use different protocols to the original request. In particular this means that if you request an http URI, and there is a redirect to an https URI, this will not be followed. Redirects which are to the same protocol (http->http or https->https) work as expected in OpenRefine.
This behaviour is deliberate, as it stops redirects taking the user from a secure protocol to an unsecured one (https->http). The second answer in this StackOverflow post gives a good explanation of why this is a bad idea https://stackoverflow.com/questions/1884230/urlconnection-doesnt-follow-redirect
What is less clear to me is whether there are any issues in supporting http->https redirects (which will be the far more common scenario) - this feels like it is increasingly common and we could support it without any security concerns.
Any other views?