Describe the bug
When fetching data from an external source with an SSL cert (i.e. using "add column by fetching URL" with an https URL) the fetch can fail if the certificate is not trusted by the local system.
To Reproduce
As a result of #1265 documentation was added to Wiki for diagnosing and fixing these problems manually https://github.com/OpenRefine/OpenRefine/wiki/Troubleshooting-Fetching-data-from-URLs
Expected behavior
Where SSL errors are encountered on attempting to fetch data it would be good if:
It would be even better if the SSL issues could be overcome easily by the user by one or more of the following:
It is not clear to what extent we can achieve these changes with the current way we bundle/distrubute OpenRefine (again see discussion on #1265)
@ostephens I have reproduced the issue and gone through the mentioned other issues too. Not sure, if I can resolve this or not. But want to give it a try. Can you please help me, where the related code is located in which I should look into for resolving this issue? Thank you.
@darecoder1999 Just search the code for "fetch" in your IDE and this should be apparent.
I had mentioned to others in #2031 that perhaps moving to a smarter library like OKHttp library might make things easier for us all around in the long term, such as dealing with SSL 3.0 etc. But that's a major design decision that needs to be evaluated by our lead developers @wetneb @ostephens I personally feel it would be the right move forward as it also allows us and future extensions to support custom ConnectionSpec's to support various cipher suites still in the wild through COMPATIBLE_TLS for example. The other good thing about OKHttp library is that they track all that: https://square.github.io/okhttp/tls_configuration_history/
Basically, we are doing things in OpenRefine code a bit too low level in places, and should use better open source libraries to abstract away a lot of our historical complexity, IMHO.
(Anyways, I'm not a web security expert by any means, but worked with many and learned a lot from them.)
@thadguidry thanks for the quick response. I'll search that and try to explore the code. If stuck anywhere will ask in the main group or over here.
And I agree with you, that we should make apt. use of existing open-source libraries. As in my opinion, it's the best way to do faster and reliable development.
Thanks once again for the help.
Just noting that this is probably a pretty involved issue. But definitely worth tackling indeed. To locate the places where we need to upgrade to a more high-level HTTP library, look for UrlConnection in the code. A first step would be to list all the places where this is used, see what the requirements are in terms of API granularity (streaming, encoding, headers, …) and see if we could find a single library which would cover all these use cases. It would be good to have better tests before we migrate, although mocking these things in a way that is independent from the underlying HTTP API is probably quite hard. Perhaps wide-ranging integration tests are better for this (but they would need to be run separately from the test suite, ideally).
I found out that this exception is raised from here while adding column by fetching URL.
https://github.com/OpenRefine/OpenRefine/blob/70b4c6a6d09d5fa7f733b22636e3f8a897cc6af7/main/src/com/google/refine/operations/column/ColumnAdditionByFetchingURLsOperation.java#L397-L400
I am thinking of a way to provide an option to trust that SSL certificate from UI only. For that it is required to get the SSL certificate which was initially provided by remote host. So far, I have tried to use getServerCertificates() method of HttpsURLConnection but it is giving me null pointer exception while trying to call this method. Any leads that I can follow to get root CA of remote's SSL certificate?
New root certificates are needed pretty infrequently, so the last suggestion:
Improve the starting point in terms of certs/encryption is supported by OpenRefine out of the box
would greatly reduce the need for the others.
Java 8u151 and later includes the Unlimited Strength Crypto. The IndenTrust CA root cert needed for Let's Encrypt was included in Java 8u101 and 7u111.
For some reason we seem to specify JRE 8.0_241 on Windows, but 8.0_181 on Mac, but both should be fine for these purposes.
As for
Making it easy to install additional certs
I think that, generally, we don't want this to be TOO easy because it opens people to social engineering attacks (and it's only very rarely necessary).
Finally, better error reporting is always a plus. @ostephens did you have some specific ideas on ways that that could be improved? Obviously, having the error message buried in the cell isn't the most obvious, but it's difficult to know how to improve it. Perhaps a project-wide error console/log?
Switching to a different HTTP library like Retrofit, OkHttp, Netty, or UniRest, seems like a useful project, but it seems largely orthogonal to fixing this.
The issue was highlighted to the user in a user-friendly and obvious way (rather than the current situation where the user needs to do quite a bit of troubleshooting and understand obscure error messages)
👍
Simply display a friendly error message with the link to the wiki 😃
A significant contributor to the problem here was poor error reporting. Now instead of the obscure "Received fatal alert: handshake_failure" or "sun.security.validator.ValidatorException: PKIX path building failed", we report the main exception "javax.net.ssl.SSLHandshakeException" first, which is much more recognizable as an SSL problem.
We also use the Apache HTTP client library consistently everywhere now, including for Fetch URL.
By far the easiest way to get support for new root certificates is to install a more recent version of Java. When the spate of problems happened a few years ago, there were patch releases of Java 8 available that included the necessary new root certificates.
By far the easiest way to get support for new root certificates is to install a more recent version of Java. When the spate of problems happened a few years ago, there were patch releases of Java 8 available that included the necessary new root certificates.
On Mac the problem was exacerbated by the bundled Java in OR 2.7/2.8 not having the certs (documented in #1265) - I wasn't sure at the time (or now) what determined which version of Java was bundled in the distributed version but I'd hope that the current versions use up to date Java releases which include the missing certs that gave rise to a fair number of these issues.
Now instead of the obscure "Received fatal alert: handshake_failure" or "sun.security.validator.ValidatorException: PKIX path building failed", we report the main exception "javax.net.ssl.SSLHandshakeException" first, which is much more recognizable as an SSL problem.
Do the more detailed errors still get reported as well? These are documented on the wiki so helpful for users trying to solve the specific issue they are seeing https://github.com/OpenRefine/OpenRefine/wiki/Troubleshooting-Fetching-data-from-URLs
I wasn't sure at the time (or now) what determined which version of Java was bundled in the distributed version
The About Open Refine could display the Java version.
The main page too:

I have added a note to the wiki page about the release process to emphasize that it is important to download fresh JREs to ensure the latest certificates are there:
https://github.com/OpenRefine/OpenRefine/wiki/Releasing-Version
Do the more detailed errors still get reported as well?
Yes, that's what the "first" was meant to imply. The full message would be something like javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed
These are documented on the wiki so helpful for users trying to solve the specific issue they are seeing https://github.com/OpenRefine/OpenRefine/wiki/Troubleshooting-Fetching-data-from-URLs
I gave @allanaaa the updates as comments on the new docs. Are we going to continue to have this information in two places?
The About Open Refine could display the Java version.
@pyrog The best way to suggest/request a feature is to create an issue. I've created #3240 for you.
@tfmorris Hi Tom, the plan is to have all product reference (user manual) and technical reference material moved over to https://docs.openrefine.org and then remove that reference material from the various wiki pages and leave what remains to be the community maintained wiki. (Recipes, my Jython experiments, Tony's Jupyter stuff, etc.)
It's still undecided if we might even keep the wiki on GitHub...or actually move it...(it's search and interface isn't the most welcoming). @magdmartin had previous good points that moving it would perhaps make the wiki more accessible to the average nontechnical user, and which I tend to agree, but then there's maintenance on any host chosen, unless Wikimedia has a way for us, or some other open source hoster.
Most helpful comment
I have added a note to the wiki page about the release process to emphasize that it is important to download fresh JREs to ensure the latest certificates are there:
https://github.com/OpenRefine/OpenRefine/wiki/Releasing-Version