Jabref: Google Scholar: Preview with wrong encoding

Created on 17 Aug 2016  路  9Comments  路  Source: JabRef/jabref

JabRef 3.6dev--snapshot--2016-08-16--master--257ba21 windows 10 10.0 amd64 Java 1.8.0_101

Steps to reproduce:

  1. Web search interface
  2. Google Scholar
  3. Fetch
  4. Preview with wrong encoding

scholar

fetcher bug 馃悰

Most helpful comment

Try adding "&oe=utf-8" to the calling URL.
Found at http://www.codesimple.net/2006/08/google-maps-utf-8-problem_9034.html

Will check later if this also helps for Google scholar queries...

All 9 comments

Confirmed in latest dev build. Mabye somehow related to #1694
Search term used: non linear component analysis as a kernal eigenvalue

Have not checked the source... @oscargus Have you made a change in one of your recent PRs that always an encoding must be set in the fetchers? I don't exactly remember this...

Yes, put what I did was to remove

public String downloadToString() throws IOException {
   return downloadToString(Globals.prefs.getDefaultEncoding());
 }

and replace downloadToString() with a direct call to downloadToString(Globals.prefs.getDefaultEncoding()), see https://github.com/JabRef/jabref/commit/90044ac423016f8b5931ea6a4067ff314ed3047a

While this is the only change from our side, it is not obvious that this should affect it.

Could it be that they are not using HTML-encoded characters anymore, but UTF-8?

Okay... I think I got it.

The problem is that we are using the user agent "JabRef" google is serving the results not in "UTF-8" but in "ISO-8859-1"... We are using new URLDownload(urlQuery).downloadToString(Globals.prefs.getDefaultEncoding()); - which is generally a bad idea. Thus it works if a user is using ISO-8859-1 as default encoding but not in all other cases...

The question is how solve this... we can either switch the user-agent to a "Mozilla-like-String" to get UTF-8 data from Google _AND_ hard-code downloadToString(StandardCharsets.UTF_8) or we use "JabRef" and use the ISO charset (both variant also solve #1694)

Opinions?

Hm. Theoretically UTF-8 would be the prefered version. Not sure if google treats our "JabRef" Useragent any differently than any other string (except for the encoding)

I think hard-coding is fine. As long as the imported entries are converted
to plain ASCII-LaTeX it should work fine for either case (probably slightly
better for UTF-8, depending on how clever Google is). If we should try to
convert to some other encoding we open for lots of work (or at least the
need for some library).

I'd go with some Mozilla-like-string and UTF-8, as is done in some of the
other fetchers (I assume there is no way to tell Google what encoding we
want, or, as part of the download read the header, get the header
information and use that in a clever way). (ISO wouldn't work for #1694,
right?)

Best would be JabRef as user agent and UTF-8 as response, if this is somehow possible.

Try adding "&oe=utf-8" to the calling URL.
Found at http://www.codesimple.net/2006/08/google-maps-utf-8-problem_9034.html

Will check later if this also helps for Google scholar queries...

Was this page helpful?
0 / 5 - 0 ratings

Related issues

thorstenwagner picture thorstenwagner  路  4Comments

simonharrer picture simonharrer  路  3Comments

oscargus picture oscargus  路  3Comments

tobiasdiez picture tobiasdiez  路  4Comments

caugner picture caugner  路  3Comments