Openrefine: Extend cross() function to support multiple-value-cell-input

Created on 26 Oct 2017  ·  21Comments  ·  Source: OpenRefine/OpenRefine

From a user perspective, I would like to call the cross function with a separator as 4th argument, so that multiple values (separated by a custom separator, e.g. ␟) in an input cell get "crossed".

Example

project "Christmas Gifts"

| gift | recipient |
| ------------- | ------------- |
| lamp | mary |
| clock | john |
| sweets | mary,john |
| wine | anne,mary |

project "My Address Book"

| friend | address |
| ------------- | ------------- |
| john | 120 Main St. |
| mary | 50 Broadway Ave. |
| anne | 17 Morning Crescent |

Transformation

add values from column "address" from project "My Address Book" based on keys (with multiple values separated by ",") in column "recipient" from project "Christmas Gifts"

forEach(value.cross("My Address Book", "friend", ","),r,r.cells["address"].value).join(",")

Output

| gift | recipient | address |
| ------------- | ------------- | ------------- |
| lamp | mary | 50 Broadway Ave. |
| clock | john | 120 Main St. |
| sweets | mary,john | 50 Broadway Ave.,120 Main St.
| wine | anne,mary | 17 Morning Crescent, 50 Broadway Ave.

History

see last few comments in #1204

enhancement logic performance

Most helpful comment

@felixlohmeier OK - hadn't noted the performance difference, and agree that's a big factor.

I have a common use for cross in which the target project for the cross contains multiple matches for a row in the source project - which I'm not sure this deals with so effectively - although maybe I need to experiment a bit more with the possible GREL!

I'm still not completely convinced the best thing to do here is change the cross function this much but I'd be v happy to see a new function that works as you need - I don't particularly mind it being hacky, just slightly uncomfortable that it introduces ambiguity into the function.

I'm just one voice though

All 21 comments

An issue here is that you cannot tell from the results which value resulted in the listed outcome. To take the simplest example:

project "Christmas Gifts"

| gift | recipient |
| ------------- | ------------- |
| lamp | mary |
| clock | john |
| sweets | mary,john |
| wine | anne,mary |

project "My Address Book"

| friend | address |
| ------------- | ------------- |
| john | 120 Main St. |
| mary | 50 Broadway Ave. |

Transformation

add values from column "address" from project "My Address Book" based on keys (with multiple values separated by ",") in column "recipient" from project "Christmas Gifts"

forEach(value.cross("My Address Book", "friend", ","),r,r.cells["address"].value).join(",")

Output

| gift | recipient | address |
| ------------- | ------------- | ------------- |
| lamp | mary | 50 Broadway Ave. |
| clock | john | 120 Main St. |
| sweets | mary,john | 50 Broadway Ave.,120 Main St.
| wine | anne,mary | 50 Broadway Ave.

In the last row of this Output it is impossible to know whether the address is related to anne or mary.

In more complex situations you could have multiple matching rows in the project named in 'cross', resulting in a list of matching values, which you cannot tell which key matched.

The only way I can see you could make this explicit would be to return a more structured response (e.g. maybe JSON which labelled each returned array of values with the key used to look them up).

I'd suggest this would be better in a separate operation, rather than making cross more complex or capable of returning different format of results depending on the parameters

I agree… and splitting the cell contents on the fly with a regex is quite hacky too. The cross function already feels hacky in the first place actually. This is just a hack around the fact that OpenRefine's model is inherently designed for single-table databases. Ideally, we would need a proper join operation between two tables.

Also worth noting that with the change to use a string rather than just a 'cell' object for the lookup, the desired outcome can be achieved in GREL:

forEach(value.split(","),v,forEach(v.cross("My Address Book","friend"),r,r.cells["address"].value).join(",")).join(",")

(although I agree it isn't pretty)

I agree that it is quite hacky, but it is very useful and it works for all cases i can imagine. Before I have realized that the pull request got reverted I extended the documentation of the cross() function in the wiki, see https://github.com/OpenRefine/OpenRefine/wiki/GREL-Other-Functions (cross, note for dev version)

@ostephens, I think there are many possibilities with the new operator and combined forEach, forNonBlank or value.split() operations. What do you think as a GREL pro :-) ?

Example for splitting multi-value field values in from column and extract more than one value from the results:

forEach(value.cross("My Address Book", "friend", ","),r,forNonBlank(r.cells["address"].value,v,v,"")).join("|")

Example for splitting multi-value field values in from column, extract only the first value from the results and return a custom string if there is a match in foreign key ("friend") but no value in target column ("address"):

forEach(value.split(","),v,forNonBlank(v.cross("My Address Book", "friend", ",")[0].cells["address"].value,x,x,"!")).join("|")

See example data attached: example-data-1289.zip

I have tested the cross-func-split branch with quite "big" data: project A with 1.5 million rows and 100 columns, project B with 500.000 rows and 50 columns. Keys in project A are separated by a string. Without the new cross() function I need to split multi-valued cells first (and join after doing the cross). This process (split, cross, join) takes about 30 minutes on my machine. If I use the new cross() function with integrated split possibility it takes only seconds!

@ostephens write:

Also worth noting that with the change to use a string rather than just a 'cell' object for the lookup, the desired outcome can be achieved in GREL:
forEach(value.split(","),v,forEach(v.cross("My Address Book","friend"),r,r.cells["address"].value).join(",")).join(",")

The old cross() function builds a "cache" containing all cell values in from and to column. The GREL above does not produce the desired outcome if there are keys in the from column that are only present in multi-value form (e.g. mary, anne in one cell but no cell containing mary alone -> mary will not be "cached"). See this example: https://github.com/OpenRefine/OpenRefine/issues/1204#issuecomment-316012444

@felixlohmeier OK - hadn't noted the performance difference, and agree that's a big factor.

I have a common use for cross in which the target project for the cross contains multiple matches for a row in the source project - which I'm not sure this deals with so effectively - although maybe I need to experiment a bit more with the possible GREL!

I'm still not completely convinced the best thing to do here is change the cross function this much but I'd be v happy to see a new function that works as you need - I don't particularly mind it being hacky, just slightly uncomfortable that it introduces ambiguity into the function.

I'm just one voice though

Team, S L O W D O W N here. :)

Overall there are several ways to approach the use case and making @felixlohmeier happy with essentially a faster "split or join operations that are based on keys".

But what I need to understand and what confuses me is a mixing of discussions all over the place and I want to know which one that @felixlohmeier is after. Is it...

  1. We are trying to attempt to use cross() in order to handle a key lookup. And cross() does that very well now. When the keys match, @felixlohmeier wants to perform further operation(s).
  2. We are trying to perform a "split function who's output goes into a key lookup argument in cross()"

The problem with 2 is that it complicates matter because it means more memory is needed since data copying and cache storage is needed to hold the split key while lookups are performed.

So @felixlohmeier perhaps the problem your facing is that the split() is taking a really long time on your 1.5 million rows in Project A when you do only that operation ? How long does it take if you do a Edit column -> Split into several columns ?

Team, S L O W D O W N here. :)

Hmm, can you remind us who merged this PR on sight? ^^

@wetneb the PR made a bit of sense to me, but now we have @felixlohmeier last 2 hours of comments. So that changes things Antonin. But yeah, I can admit that perhaps I was hasty and the use case was not quite so clear. So let's clear it up now that I'm here on the issue.

I agree the idea of simplicity. cross function behaves like the database join operation but lacks the validation aspect due to the unawareness of schema of each other. so we have to make assumption that user understand their data before applying this operation. The ideal situation is the data elements are normalised and can be matched without extra tweak. that's from the functionality point of view as well as the performance point of view.

Also there is a myth that cross can handle the foreach properly with given separator. It can NOT. because the cache will always cache the whole cell value as the key(ie, "ABC, EFG" etc), not the element itself.

Many thanks to ALL that you dive into my use case! I am happy with a more general approach and a S L O W D O W N :-). @claussni developed a solution that works for me, so it is not urgent for me anymore. But maybe the extended cross() functionality in #1294 is useful for someone else right now and could be merged until a more general approach is discussed and developed?

@thadguidry: here are some stats...

system

  • VM with 4 XEON CPU E5-2660 v3 @ 2.60 GhZ and 98GB RAM
  • OpenRefine 2.7
  • Xms: 70G ; Xmx: 70G

initial load base project:

  • 1521345 rows, 138 columns
  • loaded project from disk in 50 sec(s)
  • RSS after initial load: 32483896

initial load target project:

  • 457966 rows, 34 columns
  • loaded project from disk in 3 sec(s)

the "old" way

split multi-valued cells

  • command: edit cells / split multi-valued cells
  • run time: 4m39s
  • RSS after split: 38782472
  • 1864858 rows

cross

  • expression: forEach(cross(cell,"A","B"),v,v.cells["C"].value)[0]
  • duration: 0m12s

join multi-valued cells

  • command: edit cells / join multi-valued cells (for key column AND new column)
  • run time: 4m43s + 4m33s = 9m16s
  • RSS after join: 45576532
  • 1521345 rows

total run time: 4m39s + 0m12s + 9m16s = 14m7s
highest memory load: 45576532

There are several columns with foreign keys in the base project which multiplies run time in my use case by ~10.

with cross-func-split branch

cross

  • expression: forEach(value.split("␟"),v,forNonBlank(v.cross("A","B","␟")[0].cells["C"].value,x,x,"␞")).join("␟")
  • run time: 0m09s
  • RSS after cross: 33681272

split into several columns

  • split by separator: ␟
  • guess cell type and remove column unselected
  • run time: 4m02s
  • 40 new columns

@felixlohmeier Why are you using Split Multi-Valued cells instead of 'Edit column -> Split into several columns' ? Is it because you need record rows made from Split Multi-Valued cells ?

@thadguidry There are up to 40 keys in the from column in the example above. If I use "split into several columns" I will get up to 40 new columns. I would have to apply the cross function 40 times (one transformation for each column) and concat resulting values into one column afterwards, right? Performance is only slightly better (see above).

@felixlohmeier well I was thinking also of you playing with our Transpose while your doing that. OpenRefine's data model prefers 'more rows, less columns'. Honestly however, if you've got that much data to manipulate, I'd be more than happy to have a Google hangout with you this weekend and just train you and show you tips in Pentaho ETL or any other tool designed better for this. We can certainly put more into OpenRefine (and we will eventually I promise) to make this sort of thing easier, but honestly, and I am being dead serious, OpenRefine's core itself is not well designed to handle this use case. However, tons of free off-the-shelf tools handle it with ease and in seconds. I can do what your asking to do in Pentaho on 1 billion rows in about 22 secs.

Thank you @thadguidry, @jackyq2015, @ostephens and @wetneb. I understand your hesitation to put functionality into OpenRefine that is out of it's main scope. I am grateful for your comments.

I work with library data (books, articles, etc.) and in this particular use case we built a discovery tool based on existing data that is only available in proprietary structured fixed-width text files. The project team (including non-tech-librarians) loves the opportunity to explore and experiment with their data in OpenRefine. We built an simple ETL workflow from these fixed-width text files to Apache Solr with OpenRefine, a python client for OpenRefine and some shell scripts (openrefine-batch.sh). This works great overall. Librarians can explore and manipulate data and transformation rules in a daily updated OpenRefine instance. Administration for the IT department in the library is easy. The templating feature allows us to build custom library metadata formats (e.g. dublin core xml). I am not keen to introduce more tools to the workflow. I want to keep it simple.

I am aware of real ETL tools like Pentaho, although I have not much experience using it. I have used Pandas recently, which provides merge and join functions that would also handle this use case. There are library specific tools (without GUIs) like catmandu or metafacture. But as I said, I prefer to stick with OpenRefine. The code from @claussni works very well for this use case and solves the performance bottleneck in the workflow.

Well, works for me != works for us... I thought that other people might be interested in this extended cross() function. I did not get the full picture why it is exactly troublesome to add #1294 to the code base but I understand the general approach to keep OpenRefine straightforward and maintainable.

@felixlohmeier AH... and now I have the proper context. :) Ok, then your fine. I'm sure other folks WILL find a splitCross() function useful, but I am actually against having additional argument based approaches. Early on we developed GREL and its syntax to be more functional oriented and support the wonderful dot notation chaining. If we overload our existing cross() with tons of arguments, then I fear the complexity becomes a tedious headache rather than GREL being a breath of fresh air. Which is the full picture why its troublesome actually, its not about functionality as @jackyq2015 or @wetneb might worry about, but for me its about our user base and keeping GREL's ease of use.

Incidentally, one of the ideas that is yet to be implemented is making it easier to write GREL to be "a walk in the park" (yeah, I actually hate our Expression Window and it could be done much more slicker if I had a UI expert latched to my designer brain) and also having CodeMirror support to make our GREL syntax even more vivid and appealing like autocompletion, autoclosing brackets, tons more. Just need to write a tokenizer for GREL language, but before we do, I want to actually improve GREL syntax a bit, perhaps next year. https://codemirror.net/doc/manual.html#modeapi

Sounds great @thadguidry ! In the meantime i will maintain a fork that includes #1294 here:

I have reverted my changes to the documentation of the cross() function in the official OpenRefine wiki (and added a recipe from previous discussions from the mailing list).

I will close this issue now. Thank you all for your help!

@felixlohmeier Sure thing! Nice working with you Felix on your interesting use case!

@thadguidry CodeMirror is great! but AutoCompletion seems broken.

@felixlohmeier I submitted one issue for your repo:
https://github.com/opencultureconsulting/openrefine-batch/issues/4

@jackyq2015 @felixlohmeier @wetneb For the big picture, I've begun describing the future state I'd like to see with cross() here https://github.com/OpenRefine/OpenRefine/projects/1#card-5380183

Was this page helpful?
0 / 5 - 0 ratings