Openrefine: How to determine what the "User Agent" is -and how to change it?

Created on 31 Jul 2017 · 16Comments · Source: OpenRefine/OpenRefine

When using; Edit Column -> Add columns by fetching URLs..., I sometimes get blank/no results with certain URLs/queries, and when I select Store error it typically returns something like the following;

<h1>Access denied</h1> <p>The owner of this website has banned your access based on your browser's signature ...

The same URL resolves fine when using my web browser, so I _assume_ it doesn't like/accept OpenRefine's User Agent, and/or expects my browser's User Agent, e.g.-

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_2) AppleWebKit/602.3.12 (KHTML, like Gecko) Version/10.0.2 Safari/602.3.12

How might I change the User Agent in OpenRefine so that it reads the same as my web browser?

Or, how can I determine what the User Agent is for OpenRefine so that I can use it in my web browser (Develop -> User Agent -> Custom)?

Similarly, regarding websites that require credentials (_username/password_), is it somehow possible to add/enter credentials whilst using; Edit Column -> Add columns by fetching URLs...?

_Assuming there is not already a way to address one or both of the above mentioned_... Is there anyone -_amongst us_- whom is capable (_and willing_) of writing an extension that would provide the following;
_(1.)_ _editing/changing_ the User Agent
_(2.)_ adding _username/password_ for fetching URLs that require credentials ?

Although I am unable to contribute code-wise (_am not a programmer_), I would certainly contribute a _fair_ bounty -based on feedback to this request from other members of the community.

Thanks,

Eric

enhancement fetch urls Medium

Source

ericjarvies

Most helpful comment

The ability to set HTTP headers when using Edit Column -> Add columns by fetching URLs is now part of the 3.0 (beta) release:

User-Agent
Accept
Authorization

https://github.com/OpenRefine/OpenRefine/releases

ostephens on 29 May 2018

🎉1 👍1

All 16 comments

@ericjarvies thanks for this very detailed issue!

The User Agent OpenRefine uses is currently determined by the version of the underlying Java library, which is not a very good thing. For instance, OpenRefine 2.7 uses "Java/1.7.0_121" as User Agent on my platform. This is definitely something we should change (the default User Agent should mention OpenRefine and give its version number).

It would totally make sense to let users define their own headers. There is a long-standing need for that and the earliest issue seems to be #218. It would make sense to put a bounty on that.

wetneb on 31 Jul 2017

👍1

Hi Eric,

For now, the function only accepts GET requests, and I have no idea how to change the internal agent.

+1 for the bounty. In the meantime, you can try this. Add column based on the column that contains your URLs, then use Python / Jyton instead of GREL.

import urllib2
opener = urllib2.build_opener()
opener.addheaders = [('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_2) AppleWebKit/602.3.12 (KHTML, like Gecko) Version/10.0.2 Safari/602.3.12')]
response = opener.open(value)
return response.read()

Can you give a try ? If it works, you can manage a site that requires a username / password with a few more lines.

ettorerizza on 31 Jul 2017

👍2

Thanks @wetneb and @ettorerizza for the feedback.

What would be a fair bounty for this feature/option (_the ability to set/change_ User Agent)?

ericjarvies on 31 Jul 2017

@ettorerizza - Your above mentioned advice/instructions worked, thank you.

ericjarvies on 31 Jul 2017

👍2

@ettorerizza - If I may; what are the _"few more lines"_ that allow credentials (username/password) to be used?

It seems both the User Agent and Credentials features would not be too terribly difficult to add, right?

Although you've offered -_interim_- solutions to both of these, wouldn't it would be better if they were included as default menu items/options.

Do you have an opinion regarding what a fair bounty is (_vs. programming time_) for each of these tasks?

Eric

ericjarvies on 31 Jul 2017

@ericjarvies I agree this should be supported natively by the "Add columns by fetching URLs" operation.

It is hard to judge what a fair bounty would be for this feature - in terms of orders of magnitude, the last two bounties on this project were around $300. In theory I guess people should just contribute what they can afford, and it would add up for issues where there is more demand.

wetneb on 31 Jul 2017

@ericjarvies It all depends on the authorization system that the site uses. But if it's the kind to ban user-agents and impose a username/password, it must also have terms of use that prohibit scraping. It will not take long for it to detect that your queries are being sent by software. Open Refine can do basic scraping, but this is not its primary function.

ettorerizza on 31 Jul 2017

i've been fetching records from this site for several years, and have
never been blocked, but i only ever fetch their 'free' data, otherwise
i am a regular/ongoing paying customer for their 'fee' based data.

i have a handful of other sources/sites that also yield the same
un-styled XML... but i can't remember which ones they are, but as i do
my routine updating i'll make note of them as they occur, and post them
on GitHub so there are some examples to working with.

thanks,

eric

On 7/31/17 2:54 AM, Ettore Rizza wrote:
>

@ericjarvies https://github.com/ericjarvies It all depends on the
authorization system that the site uses. But if it's the kind to ban
user-agents and impose a username/password, it must also have terms of
use that prohibit scraping. It will not take long for it to detect
that your queries are being sent by software. Open Refine can do basic
scraping, but this is not its primary function.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/OpenRefine/OpenRefine/issues/1217#issuecomment-319022364,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AA7qKicL0q-yyfiKmOzGEERnUkMVr3n6ks5sTaRtgaJpZM4On6kU.

ericjarvies on 31 Jul 2017

FWIW, some sites prohibit anonymous and library-default user-agent strings because of past bad behavior by bots that don't identify themselves and that violate robots.txt directives or scrape so aggressively that site performance suffers. This has been our experience, prior to a recent upgrade, with the website of the Pleiades gazetteer (see intersecting discussion at #1265).

If there is interest, I can see if I can find Pleiades-related funds for a bounty for resolving this OpenRefine issue. Please advise.

paregorios on 13 Oct 2017

@paregorios it would be great if you can help with this issue! I agree that it is an important topic (that's also why I have worked on caching for URL fetching, so that OpenRefine is a bit more polite with data providers).

wetneb on 13 Oct 2017

So it would be easy to set a hard coded User-Agent for OpenRefine. However, this issue is slightly broader than that - offering the ability for the user to set the user agent - which makes it a more challenging enhancement.

ostephens on 13 Oct 2017

It is feasible that just set the Agent as the chrome? cannot see the benefit to set OpenRefine's own Agent. It is the opposite.

jackyq2015 on 29 Nov 2017

I'm against us (as a default) deliberately sending an agent string that identifies a different piece of software. I understand that if we make it possible to set the Agent, then no doubt that will be used in this way, but I don't think we should make this the default behaviour of the s/w.

My other aim will be to support some other http headers - which will probably be the more useful part of the enhancement to be honest - for example to enable content negotiation

ostephens on 30 Nov 2017

@ostephens You probably right. but the hard part is not to add those http headers from OpenRefine IMO. The hard part is to have 3rd party to accept them and support them. But sure how others similar software deal with this situation. They should come across the same issue. It is worthy to take a look

jackyq2015 on 1 Dec 2017

I've submitted a PR #1434 which allows the user to set three http request headers:

User-Agent
Accept
Authorization

Adding additional headers can be done easily either in an extension or in the core code. @thadguidry has suggested adding 'Proxy-Authorization' request header as well - v happy to do this if there is a need for it.

If anyone (@paregorios?) is in a position to test and feedback on the functionality from the user perspective that would be v welcome

ostephens on 17 Jan 2018

🎉1

The ability to set HTTP headers when using Edit Column -> Add columns by fetching URLs is now part of the 3.0 (beta) release: