Openrefine: File type detection does not recognize JSON in certain cases

Created on 24 Jun 2020  路  9Comments  路  Source: OpenRefine/OpenRefine

Describe the bug

Creating a project by fetching JSON from certain URLs does not pick the JSON importer by default.

To Reproduce
Steps to reproduce the behavior:

  1. Go to "Create project"
  2. "Web addresses"
  3. Use "https://query.wikidata.org/sparql?query=SELECT%20%3Fitem%20%3Furl%20WHERE%20%7B%0A%20%20%3Fitem%20p%3AP6269%20%5Bps%3AP6269%20%3Furl%20%3B%0A%20%20%20%20%20%20%20%20pq%3AP2700%20wd%3AQ458022%5D.%0A%7D%20LIMIT%20100&format=json"

Current Results

The default importer is "line-based text files"

Expected behavior

The JSON importer should be proposed by default.

Desktop:

  • OS: Windows, Linux

OpenRefine :

  • Version 3.3, still present in master

Additional context
Encountered in this video: https://www.twitch.tv/videos/659297903

bug import

All 9 comments

Note that the importer being selected is "Line based text files" as opposed to "text files". This is the default fallback when OpenRefine is unable to guess anything else. This distinction is important because JSON is a text file and it is possible (likely IMO) the importer is never hitting the TextFormatGuesser

Thanks, I corrected the issue.

The server is sending content-type: application/sparql-results+json;charset=utf-8. This is one of the cases that would potentially be helped by my suggestion (#2805) that we handle mime type suffixes (e.g. +json).

Having said that, it seems like the character frequency analysis in TextFormatGuesser should still spot this as JSON, so perhaps that could be made smarter.

Perhaps the recently added wilcards for mime types could also be used (#2598)? We could add application/*+json as a mapping? But #2612's implementation would not support that, I think.

Edit: the PR is not merged yet.

@tfmorris I would like to see the TextFormatGuesser made smarter. Because I have seen on occasion where even using the clipboard (and not URL downloading) was not guessing JSON correctly. I'll see if I can dig up one of the messy JSON's where I saw that happen...

I think it is worth doing both, to cover a large range of scenarios.

Perhaps the recently added wilcards for mime types could also be used (#2598)? We could add application/*+json as a mapping? But #2612's implementation would not support that, I think.

I think that's the wrong way to do this. I've promoted my suggestion to an issue so that we can reference it (and I've lost the original context). I think we should implement #2805 instead. It should be part of the framework and not require anything from the importers themselves.

I would like to see the TextFormatGuesser made smarter. Because I have seen on occasion where even using the clipboard (and not URL downloading) was not guessing JSON correctly. I'll see if I can dig up one of the messy JSON's where I saw that happen...

The TextFormatGuesser never gets invoked here, so that wouldn't help. Besides, the server told us explicitly that it's JSON, so there's no guessing needed.

Having said that, please do open issues for any cases where you think the guesser(s) could do a better job. If you do it when you first run into the problem, you won't have to worry about tracking down the files later.

It should be part of the framework and not require anything from the importers themselves.

I think there is a misunderstanding - I don't think I suggested that this should be handled by importers, rather by the ImportingManager, which handles the logic of mapping MIME types to formats. So there is no disagreement here!

The TextFormatGuesser never gets invoked here, so that wouldn't help.

This is another unnecessarily deep hole that I noticed imports can fall into. Our default fallback is text/line-based but the LineBasedFormatGuesser will only attempt TSV, CSV, and Fixed formats and will never invoke the TextFormatGuesser where the character frequency analysis is done. I've changed the default to text instead of text/line-based which should make this less of a trap, hopefully.

Was this page helpful?
0 / 5 - 0 ratings