From a user perspective, I would like to call the cross function with a separator as 4th argument, so that multiple values (separated by a custom separator, e.g. ␟) in an input cell get "crossed".
| gift | recipient |
| ------------- | ------------- |
| lamp | mary |
| clock | john |
| sweets | mary,john |
| wine | anne,mary |
| friend | address |
| ------------- | ------------- |
| john | 120 Main St. |
| mary | 50 Broadway Ave. |
| anne | 17 Morning Crescent |
add values from column "address" from project "My Address Book" based on keys (with multiple values separated by ",") in column "recipient" from project "Christmas Gifts"
forEach(value.cross("My Address Book", "friend", ","),r,r.cells["address"].value).join(",")
| gift | recipient | address |
| ------------- | ------------- | ------------- |
| lamp | mary | 50 Broadway Ave. |
| clock | john | 120 Main St. |
| sweets | mary,john | 50 Broadway Ave.,120 Main St.
| wine | anne,mary | 17 Morning Crescent, 50 Broadway Ave.
see last few comments in #1204
An issue here is that you cannot tell from the results which value resulted in the listed outcome. To take the simplest example:
| gift | recipient |
| ------------- | ------------- |
| lamp | mary |
| clock | john |
| sweets | mary,john |
| wine | anne,mary |
| friend | address |
| ------------- | ------------- |
| john | 120 Main St. |
| mary | 50 Broadway Ave. |
add values from column "address" from project "My Address Book" based on keys (with multiple values separated by ",") in column "recipient" from project "Christmas Gifts"
forEach(value.cross("My Address Book", "friend", ","),r,r.cells["address"].value).join(",")
| gift | recipient | address |
| ------------- | ------------- | ------------- |
| lamp | mary | 50 Broadway Ave. |
| clock | john | 120 Main St. |
| sweets | mary,john | 50 Broadway Ave.,120 Main St.
| wine | anne,mary | 50 Broadway Ave.
In the last row of this Output it is impossible to know whether the address is related to anne or mary.
In more complex situations you could have multiple matching rows in the project named in 'cross', resulting in a list of matching values, which you cannot tell which key matched.
The only way I can see you could make this explicit would be to return a more structured response (e.g. maybe JSON which labelled each returned array of values with the key used to look them up).
I'd suggest this would be better in a separate operation, rather than making cross more complex or capable of returning different format of results depending on the parameters
I agree… and splitting the cell contents on the fly with a regex is quite hacky too. The cross function already feels hacky in the first place actually. This is just a hack around the fact that OpenRefine's model is inherently designed for single-table databases. Ideally, we would need a proper join operation between two tables.
Also worth noting that with the change to use a string rather than just a 'cell' object for the lookup, the desired outcome can be achieved in GREL:
forEach(value.split(","),v,forEach(v.cross("My Address Book","friend"),r,r.cells["address"].value).join(",")).join(",")
(although I agree it isn't pretty)
I agree that it is quite hacky, but it is very useful and it works for all cases i can imagine. Before I have realized that the pull request got reverted I extended the documentation of the cross() function in the wiki, see https://github.com/OpenRefine/OpenRefine/wiki/GREL-Other-Functions (cross, note for dev version)
@ostephens, I think there are many possibilities with the new operator and combined forEach, forNonBlank or value.split() operations. What do you think as a GREL pro :-) ?
Example for splitting multi-value field values in from column and extract more than one value from the results:
forEach(value.cross("My Address Book", "friend", ","),r,forNonBlank(r.cells["address"].value,v,v,"")).join("|")
Example for splitting multi-value field values in from column, extract only the first value from the results and return a custom string if there is a match in foreign key ("friend") but no value in target column ("address"):
forEach(value.split(","),v,forNonBlank(v.cross("My Address Book", "friend", ",")[0].cells["address"].value,x,x,"!")).join("|")
See example data attached: example-data-1289.zip
I have tested the cross-func-split branch with quite "big" data: project A with 1.5 million rows and 100 columns, project B with 500.000 rows and 50 columns. Keys in project A are separated by a string. Without the new cross() function I need to split multi-valued cells first (and join after doing the cross). This process (split, cross, join) takes about 30 minutes on my machine. If I use the new cross() function with integrated split possibility it takes only seconds!
@ostephens write:
Also worth noting that with the change to use a string rather than just a 'cell' object for the lookup, the desired outcome can be achieved in GREL:
forEach(value.split(","),v,forEach(v.cross("My Address Book","friend"),r,r.cells["address"].value).join(",")).join(",")
The old cross() function builds a "cache" containing all cell values in from and to column. The GREL above does not produce the desired outcome if there are keys in the from column that are only present in multi-value form (e.g. mary, anne in one cell but no cell containing mary alone -> mary will not be "cached"). See this example: https://github.com/OpenRefine/OpenRefine/issues/1204#issuecomment-316012444
@felixlohmeier OK - hadn't noted the performance difference, and agree that's a big factor.
I have a common use for cross in which the target project for the cross contains multiple matches for a row in the source project - which I'm not sure this deals with so effectively - although maybe I need to experiment a bit more with the possible GREL!
I'm still not completely convinced the best thing to do here is change the cross function this much but I'd be v happy to see a new function that works as you need - I don't particularly mind it being hacky, just slightly uncomfortable that it introduces ambiguity into the function.
I'm just one voice though
Team, S L O W D O W N here. :)
Overall there are several ways to approach the use case and making @felixlohmeier happy with essentially a faster "split or join operations that are based on keys".
But what I need to understand and what confuses me is a mixing of discussions all over the place and I want to know which one that @felixlohmeier is after. Is it...
The problem with 2 is that it complicates matter because it means more memory is needed since data copying and cache storage is needed to hold the split key while lookups are performed.
So @felixlohmeier perhaps the problem your facing is that the split() is taking a really long time on your 1.5 million rows in Project A when you do only that operation ? How long does it take if you do a Edit column -> Split into several columns ?
Team, S L O W D O W N here. :)
Hmm, can you remind us who merged this PR on sight? ^^
@wetneb the PR made a bit of sense to me, but now we have @felixlohmeier last 2 hours of comments. So that changes things Antonin. But yeah, I can admit that perhaps I was hasty and the use case was not quite so clear. So let's clear it up now that I'm here on the issue.
I agree the idea of simplicity. cross function behaves like the database join operation but lacks the validation aspect due to the unawareness of schema of each other. so we have to make assumption that user understand their data before applying this operation. The ideal situation is the data elements are normalised and can be matched without extra tweak. that's from the functionality point of view as well as the performance point of view.
Also there is a myth that cross can handle the foreach properly with given separator. It can NOT. because the cache will always cache the whole cell value as the key(ie, "ABC, EFG" etc), not the element itself.
Many thanks to ALL that you dive into my use case! I am happy with a more general approach and a S L O W D O W N :-). @claussni developed a solution that works for me, so it is not urgent for me anymore. But maybe the extended cross() functionality in #1294 is useful for someone else right now and could be merged until a more general approach is discussed and developed?
@thadguidry: here are some stats...
initial load base project:
initial load target project:
split multi-valued cells
cross
join multi-valued cells
total run time: 4m39s + 0m12s + 9m16s = 14m7s
highest memory load: 45576532
There are several columns with foreign keys in the base project which multiplies run time in my use case by ~10.
cross
@felixlohmeier Why are you using Split Multi-Valued cells instead of 'Edit column -> Split into several columns' ? Is it because you need record rows made from Split Multi-Valued cells ?
@thadguidry There are up to 40 keys in the from column in the example above. If I use "split into several columns" I will get up to 40 new columns. I would have to apply the cross function 40 times (one transformation for each column) and concat resulting values into one column afterwards, right? Performance is only slightly better (see above).
@felixlohmeier well I was thinking also of you playing with our Transpose while your doing that. OpenRefine's data model prefers 'more rows, less columns'. Honestly however, if you've got that much data to manipulate, I'd be more than happy to have a Google hangout with you this weekend and just train you and show you tips in Pentaho ETL or any other tool designed better for this. We can certainly put more into OpenRefine (and we will eventually I promise) to make this sort of thing easier, but honestly, and I am being dead serious, OpenRefine's core itself is not well designed to handle this use case. However, tons of free off-the-shelf tools handle it with ease and in seconds. I can do what your asking to do in Pentaho on 1 billion rows in about 22 secs.
Thank you @thadguidry, @jackyq2015, @ostephens and @wetneb. I understand your hesitation to put functionality into OpenRefine that is out of it's main scope. I am grateful for your comments.
I work with library data (books, articles, etc.) and in this particular use case we built a discovery tool based on existing data that is only available in proprietary structured fixed-width text files. The project team (including non-tech-librarians) loves the opportunity to explore and experiment with their data in OpenRefine. We built an simple ETL workflow from these fixed-width text files to Apache Solr with OpenRefine, a python client for OpenRefine and some shell scripts (openrefine-batch.sh). This works great overall. Librarians can explore and manipulate data and transformation rules in a daily updated OpenRefine instance. Administration for the IT department in the library is easy. The templating feature allows us to build custom library metadata formats (e.g. dublin core xml). I am not keen to introduce more tools to the workflow. I want to keep it simple.
I am aware of real ETL tools like Pentaho, although I have not much experience using it. I have used Pandas recently, which provides merge and join functions that would also handle this use case. There are library specific tools (without GUIs) like catmandu or metafacture. But as I said, I prefer to stick with OpenRefine. The code from @claussni works very well for this use case and solves the performance bottleneck in the workflow.
Well, works for me != works for us... I thought that other people might be interested in this extended cross() function. I did not get the full picture why it is exactly troublesome to add #1294 to the code base but I understand the general approach to keep OpenRefine straightforward and maintainable.
@felixlohmeier AH... and now I have the proper context. :) Ok, then your fine. I'm sure other folks WILL find a splitCross() function useful, but I am actually against having additional argument based approaches. Early on we developed GREL and its syntax to be more functional oriented and support the wonderful dot notation chaining. If we overload our existing cross() with tons of arguments, then I fear the complexity becomes a tedious headache rather than GREL being a breath of fresh air. Which is the full picture why its troublesome actually, its not about functionality as @jackyq2015 or @wetneb might worry about, but for me its about our user base and keeping GREL's ease of use.
Incidentally, one of the ideas that is yet to be implemented is making it easier to write GREL to be "a walk in the park" (yeah, I actually hate our Expression Window and it could be done much more slicker if I had a UI expert latched to my designer brain) and also having CodeMirror support to make our GREL syntax even more vivid and appealing like autocompletion, autoclosing brackets, tons more. Just need to write a tokenizer for GREL language, but before we do, I want to actually improve GREL syntax a bit, perhaps next year. https://codemirror.net/doc/manual.html#modeapi
Sounds great @thadguidry ! In the meantime i will maintain a fork that includes #1294 here:
I have reverted my changes to the documentation of the cross() function in the official OpenRefine wiki (and added a recipe from previous discussions from the mailing list).
I will close this issue now. Thank you all for your help!
@felixlohmeier Sure thing! Nice working with you Felix on your interesting use case!
@thadguidry CodeMirror is great! but AutoCompletion seems broken.
@felixlohmeier I submitted one issue for your repo:
https://github.com/opencultureconsulting/openrefine-batch/issues/4
@jackyq2015 @felixlohmeier @wetneb For the big picture, I've begun describing the future state I'd like to see with cross() here https://github.com/OpenRefine/OpenRefine/projects/1#card-5380183
Most helpful comment
@felixlohmeier OK - hadn't noted the performance difference, and agree that's a big factor.
I have a common use for cross in which the target project for the cross contains multiple matches for a row in the source project - which I'm not sure this deals with so effectively - although maybe I need to experiment a bit more with the possible GREL!
I'm still not completely convinced the best thing to do here is change the cross function this much but I'd be v happy to see a new function that works as you need - I don't particularly mind it being hacky, just slightly uncomfortable that it introduces ambiguity into the function.
I'm just one voice though