Is your feature request related to a problem or area of OpenRefine? Please describe.
Parsing JSON with unknown and variable field names is difficult in GREL because the 'get' function requires you to know the field name you want to access. This causes problems where the JSON contains an entity ID/name as a key. Examples:
Wikimedia Commons -> https://commons.wikimedia.org/w/api.php?action=query&titles=File:Google%202015%20logo.svg&prop=imageinfo&iiprop=url&format=json
MySociety MapIt -> https://mapit.mysociety.org/point/4326/0.1293497,51.5464828
This issue has been discussed at https://groups.google.com/d/msg/openrefine/O72bg8GZkj8/vBoEzmGxBwAJ and occasional issues for parsing JSON with this pattern are raised on StackOverflow and on the Google Group - e.g. https://stackoverflow.com/questions/10782737/google-refine-iterate-over-a-json-dictionary
e.g. consider:
{
"results": [
{
"id": "result1_ID",
"value": "result1_value"
},
{
"id": "result2_ID",
"value": "result2_value"
}
]
}
You can extract the 'values' with GREL easily:
forEach(value.parseJson().results,v,v.value)
However, the same information expressed in JSON but in a different way cannot be parsed like this:
{
"result1_ID": {
"value": "result1_value"
},
"result2_ID": {
"value": "result2_value"
}
}
This is difficult to parse with GREL because you don't know the property key to use
Describe the solution you'd like
If it was possible to retrieve a list of property/field keys (as an Array) from the JSON object with a GREL "getFields" function, it would be straightforward to then iterate over the array to get the property keys - e.g. taking the second example above again:
value.parseJson().getFields() -> [ "result1_ID" ,"result2_ID"]
You could then use these keys to retrieve each value:
forEach(value.parseJson().getFields() ,k,value.parseJson().get(k).value)
Describe alternatives you've considered
In his response to https://stackoverflow.com/questions/10782737/google-refine-iterate-over-a-json-dictionary @tfmorris suggested he wanted to implement some functionality that would allow users to
"iterate through the keys of a dict where they keys are unknown"
I'm not sure what the intention was here, but I feel that getting the list of keys for the dictionary offers flexibility and can be used to achieved the desired outcome.
Additional context
I wonder if it would make sense to add "getFields" to the "hasFields" interface and so require it for any class implementing hasFields. Currently this is:
But the advantages might be slight - it might be helpful to be able to do something like:
row.getFields() -> ["flagged","starred","cells","record"]
from a self-documenting point of view this might avoid having to check documentation sometimes - but it is hard to see a situation where you want to iterate through the results of this in GREL
Great ticket, thoroughly documented and motivated! And makes a lot of sense.
Adding a getFields method to the HasFields interface would make sense. We should decide how to introduce this:
getFields function.No strong feelings about this…
By the way version 3.3 will already include some breaking changes for #2164 (which we have to fix since it is a vulnerability).
I think having a default implementation is a good idea - better to avoid breaking things unless we have to :)
@wetneb (or others) can you provide any guidance on the use of HasFieldsListImpl? I can't work out what it is doing and therefor what getFields() should do in that context?
@ostephens here is my take on it:
Consider the following CSV project with records:
artist,album
Gilberto Gil,Toda menina Bahiana
,Expresso 2222
,Refazenda
Hiromi Uehara,Desire
,Deep Into The Night
Assume that all cells are reconciled.
In GREL you can evaluate row.record.cells which will give you a RecordCells object. You can then get row.record.cells.album which is going to look like a list but is in fact a HasFieldsListImpl. The cool thing with that is that you can then access sub-fields of the items transparently:
row.record.cells.album.value or even row.record.cells.album.recon.match.id. This is going to return the list of values of these attributes on each item. This is sort of a hack (as almost always in GREL!) since a list does not, in itself, have a value or recon attribute, only its children have. This is probably quite helpful in GREL which does not have list comprehensions. (In Python, you would rewrite this as: [album.recon.match.id for album in row.record.cells].)
So to implement getFields on this, I see a couple of options:
getFields on the first item, since we assume the items are of a homogeneous type and therefore have the same fields, and return that?getFields on all the items and return the union of these sets?GREL has historically grown slowly, organically as helper functions were needed.
The term Get Fields kind of implies "union" to a user. Its not Get Fields On First Item after all.
We might therefore have 2 GREL functions needed or even 3 depending.
I think its better to get the terminology right for 1 function first, code it, get a feel for it.
Then if we decide we need a 2nd function for unions/intersections, we can introduce that later.
Regardless, the end result is that it should feel very natural for novice users to type out in GREL either 1 or 2 functions to control result sets of JSON substructures.
I would also like to see an analysis/summary of how other tools/languages deal with substructures of JSON. Perhaps then we can see how they deal with that and come up with good terminology and handling to make it "easy" for our OpenRefine users. (for instance, Jackson by default will include null fields, and we might want to explore taking advantage of @JsonView and @JsonIgnore , etc.)
To explain my thoughts on this:
Our users expect to be able to easily navigate JSON as a Tree substructure and sometimes the branches have no twigs as @ostephens pointed out and only direct leaves on their branch. How to handle twig-less branches is the use case here (those without a key that represents a twig with some leaves) Jackson has support for Tree nodes that are missing or considered virtual for instance.
Summary:
I'd rather punt on this .getFields issue, which actually is limiting somewhat, and instead work on a GREL wrapper syntax for supporting JSONPath Plus.
Details:
So after thinking about this for a few months, and in the meantime often just using jq alot for these needs, and sometimes just Clojure. I think what we are really missing is a good library for querying JSON, but without the extras like jq rolls in. JSONPath is an extremely popular specification for querying JSON and has wide support. But the original spec missed some good operators that help with JSON querying like filtering through parent/sibling.
There has been really good activity from JSONPath Plus in the Node.js community over the last 4 years that provides those missing operators and adds additional selectors. It's used by 1000's of companies and developers now. Its compliant to the original JSONPath spec, and just adds to it.
I think incorporating a GREL wrapper for JSONPath Plus would put our users into very comfortable territory. Similar to what @iainsproat did for us back in the day for our jsoup wrapper for parsing HTML.
Bonus:
SmartBear (the guys who wrote the Swagger API Spec and contribute to OpenAPI support and use JSONPath and it's been a part of their products since the birth of JSONPath by Mr. Goessner )
@wetneb:
So to implement
getFieldson this, I see a couple of options:
- only call
getFieldson the first item, since we assume the items are of a homogeneous type and therefore have the same fields, and return that?- call
getFieldson all the items and return the union of these sets?- return the intersection instead?
Well, the way my user eyes see it, it a bit of the 3.
But, at the end, I guess I would want, at each iteration to have all the « fields » contained by the object.
@thadguidry: definitely, implementing JSONPath plus is a good thing, so is getFields as it can also be a debugging helper.
I'd would like for sure to go forward with both.
Regards, Antoine
In his response to https://stackoverflow.com/questions/10782737/google-refine-iterate-over-a-json-dictionary @tfmorris suggested he wanted to implement some functionality that would allow users to
"iterate through the keys of a dict where they keys are unknown"
I'm not sure what the intention was here, but I feel that getting the list of keys for the dictionary offers flexibility and can be used to achieved the desired outcome.
Most likely it was just extending the forEach to be able to iterate over dict items in addition to array items, but this wouldn't give you access to the key (although most data sources that use this braindead format duplicate the key in the payload of the item).
This ticket seems to have grown into something quite a bit bigger.
After reviewing the existing controls, forEach and forEachIndex, I think the solution which would fit best with our current structure is to introduce an forEachItem iterator which provided the key and value for each item in dict in the same way that forEachIndex provides the index and item for an array.
Optionally, also:
forEach to iterate over dict items (but without providing the key)forEachItem polymorphic and have it return index + item for arrays and key + item for dicts (and deprecate forEachIndex). This is probably what we should have done originally.This probably still relies on a similar underlying mechanism to what was started here - extending the HasFields interface to have a keys or items iterator. Having said that, none of our standard objects have dynamic fields and JSON access is not done via that interface.
As an aside, I don't see any evidence of a Java package for JSONPath Plus.
@thadguidry did you have a package in mind?
@tfmorris Yeah the old jayway package... https://github.com/json-path/JsonPath ohh, I didn't add it to that linked issue. Let me do that now.
But it's missing the Plus features. We'll probably have to ask them or look around more? or enhance it?
I found the big issue in Jayway for the primary need and feature of Plus that folks like "nested filters" ... https://github.com/json-path/JsonPath/issues/287
Most helpful comment
I think having a default implementation is a good idea - better to avoid breaking things unless we have to :)