Openrefine: Do not display the entire element returned by "fetch urls"

Created on 23 Jan 2018  ·  15Comments  ·  Source: OpenRefine/OpenRefine

If you click on this Viaf API URL, even your browser will have trouble displaying the returned JSON. Imagine then when you have 100 of them and you use "add column by fetching URLS" in Open Refine... Even with only two URLs, the whole interface slows considerably. This is particularly the case in the transformation window. The preview takes forever to appear and everything is freezed.

screencast

Same problem when extracting source code from a large web page for scraping purpose. I wonder if there would be any way to display in Refine only the first lines of the Json/xml or source code. After all, I doubt that users really read this tag soup. Instead, they certainly use the web developper in their browser to identify the path of the elements they later extract with the parseJson() or parseHtml() functions.

enhancement fetch urls High

Most helpful comment

Until the problem is solved, I share the better solution I found for the browser to support these tons of heavy HTML or JSON:

1 ° fetch URLs, store the result in a column called for example "HTML_RESULT".

2 ° Click immediately on "View -> Collapse this column".

3 ° Extract the HTML elements that interest you in an empty column by using cells['HTML_RESULT'].value.parseHtml().select(<YOUR JSOUP SELECTOR>)

(or cells['JSON_RESULT'].value.parseJson().pathtotheelement, of course)

All 15 comments

I think a cap on text length would be reasonable.

Otherwise, one possible workaround is to do the fetching in Python and extract the interesting data in the same go (so, without storing the full response).

@wetneb I had thought of this possibility. After all, we often know in advance what we want to extract from the JSON / XML / HTML. The only case where storing the answer is useful it's when one wants to create several columns from the same response. The ideal would be to have the possibility to store the response, or parse it on the fly with something like:

value.fetchUrl().parseJson()...

This sounds like a preference that we should set...and then if users want to override it...they update the preference.

@wetneb So... any thoughts on what a reasonable cap that we can setup in preferences.vt ???

I'm not sure, maybe something like 1024 characters?

Cap the text might be hard to suit user's need.

Maybe the extraction can be done from the VIAF side. I will ease the burner from the OR side. Though I am not familiar with the syntax.

Yes if there is a text cap it should be large by default and configurable by the user.

On the VIAF side if you use the SRU interface you can limit the number of records returned, but I don't think this is the problem here. The issue is that requesting a single record that has many alternative name representations etc. you get a very large chunk of JSON which you can't limit through the API.

I'm in favour of a user configurable text cap on the amount of text show in a cell with a couple of caveats:

  • There needs to be a 'show everything' setting (rather than just putting large numbers to the limit and hoping)
  • When you click 'edit' it should always offer the full text for editing

Hello all. Looks like OR 3 has taken the exact opposite of what has been discussed here: the HTML extracted from URLs is now pretty-printed, which greatly increases the weight of the page. Now, scraping 100 URLS, and even 10, has become a pain. OR crashes or slows down.

@ettorerizza I was not even aware that we had added pretty-printing there…

Until the problem is solved, I share the better solution I found for the browser to support these tons of heavy HTML or JSON:

1 ° fetch URLs, store the result in a column called for example "HTML_RESULT".

2 ° Click immediately on "View -> Collapse this column".

3 ° Extract the HTML elements that interest you in an empty column by using cells['HTML_RESULT'].value.parseHtml().select(<YOUR JSOUP SELECTOR>)

(or cells['JSON_RESULT'].value.parseJson().pathtotheelement, of course)

@thadguidry, @ostephens, @weblate & @ettorerizza: how about we bring this to the next step:

  • default to showing a max of 1024 chars in the display, per column, but,
  • user can edit a column pref and and change that limit per column.
  • when a cell value is trunked, then a little icon is present in the cell (like a + on the lower right).

This would only affect display.

@antoine2711 I'm generally happy with the idea but I have some concerns about the details:

  • Should the max setting be configurable at the top OpenRefine level (applies to all projects)? And/or at Project level? And/or Column level?
  • I think we need to support setting the display to unlimited (rather than specify a large number of characters)
  • I think it should be possible to set display limits across multiple columns, or all columns in a project easily

Yes good questions, @ostephens.

  • Should the max setting be configurable at the top OpenRefine level (applies to all projects)? And/or at Project level? And/or Column level?

For sure at the Column Level. Probably also at either project or host level (or both?! not very complex/time consuming to code). But I would set the default app max at 5K, not 1K.

  • I think we need to support setting the display to unlimited (rather than specify a large number of characters)

Even with showing 10 lines, working with large cells is very unpractical. Here is a test with one cell with a 1000 digits number and another cell with 50x rows of 100 digits numbers. This brought me to think that a feature request for a max column's width is going to be the next step…
image

It not really workable, I would say. I had column with far less data, and I ended up deleting them, for convenience.

  • I think it should be possible to set display limits across multiple columns, or all columns in a project easily

Well, if we implement a project default that gets set at the creation/import step, that would be the same, with an added override option. I could imagine very easily someone wanting different limit on several columns in a same project.

Regards, Antoine

👍 to allow setting on a per column basis, easily.

Looks like OR 3 has taken the exact opposite of what has been discussed here: the HTML extracted from URLs is now pretty-printed, which greatly increases the weight of the page.

Does anyone know where/when this change was made?

From a practical point of view, thousands of characters displayed for a cell or transformation preview provide no benefit to the user. I'd be fine with truncating them by default, perhaps providing an ellipsis (...) or some other affordance that they could use to display all.

Was this page helpful?
0 / 5 - 0 ratings