On Saturday, 13 October 2012 01:12:23 UTC+2, Tom Morris wrote:
Currently the cross() function is one of the few (only) that takes a cell as its first argument instead of value.
Can anyone think of a good reason that we shouldn't extend it to take either a cell or a value?
Tom
On Sunday, 14 October 2012 23:13:57 UTC+2, David Huynh wrote:
If I remember correctly--and I should have documented this, the reasons for taking a cell rather than an arbitrary expression were to make the caching logic easy, and also to avoid any expression that yields something other than a single literal value (e.g., an array of value, a "cell" or "row" or "record" object).
We should definitely make it consistent with the other functions if we can work out the caching logic (which has been buggy).
David
Current state:
Pushed the change f03be76, it covers:
@felixlohmeier Thanks for the summary. Do you have links the the original sources of the quotes? Was it one of the mailing lists or another issue?
@jackyq2015 The referenced commit appears to be to master, not a branch. Does it resolve this issue? Who reviewed it? (Apparently Thad has reviewed it post-facto based on his typo comment)
"Thad" being @thadguidry, of course (so he gets notified)
@tfmorris quotes from https://groups.google.com/forum/#!topic/openrefine/7ESn_Xh0tRU
Nice, that was quick! If you have time to write one or two unit tests that would be fantastic too!
Also, what happens if cross
gets something else than just a single literal value?
@jackyq2015 Yeah, thanks for the quick solution! I am afk until July, 16. Shall I close the issue now, so you can get your reward at bountysource?
@felixlohmeier That would be nice Felix. Jacky and I chatted about a few other things on Sunday, like perhaps adding 1 or 2 more unit tests for the cross() work...other than that, its currently shippable and I'm using the trunk build now. Any bugs or issues we can probably resolve down the line, so I'd say go ahead and close the issue so that Jacky can withdraw his funds :) He did a great job on it, and now as a bonus, because of this work, we now have a way to easily generate summed up column values now as part of a cross() function, a much requested feature that Excel does in Pivot Tables but we never did https://groups.google.com/forum/#!topic/openrefine/CfNurLR5CII :) I'll post about how to do that later on the wiki... we could probably even add that function as a column operation later on perhaps if folks want it easier.
@wetneb Will add few test case for non-literal value.
Again, thank you very much @jackyq2015 for the great job!
I wonder if this is a breaking change that needs to be addressed in upcoming release notes. The original proposal by @tfmorris in 2012 was to...
extend it to take either a cell or a value
The new cross() function takes a string only, which is fine by me but breaks existing user transformation history and the VIB-BITS OpenRefine extension. Is there a chance to make the new cross() function backward compatible (to take either a cell or a value)?
I added a note in the wiki to reflect this different behavior between the stable release (2.7) and the development version.
@felixlohmeier can you please elaborate the symptom of " breaks existing user transformation history"? Definitely we can address it. Either make it back compatible or fix the history issue you refer to.
@jackyq2015 So, if an existing transformation script assumes that the cross function expects a cell, and the new version of that function does not accept cells anymore, this transformation script that worked for OpenRefine 2.7 will not work for a later version. (That is how I understand the problem)
I think of OpenRefine projects that use the old cross function that expects cell.cross()
and will be opened someday in a future release of OpenRefine that includes the new cross function that expects value.cross()
. In this case the undo/redo function would not work properly.
Additionally, there are many tutorials out there (stackoverflow, google forum, ...) that suggest to use the cross function with cell.cross()
. If they try them with the new version they will get an error.
Is it possible to extend the cross function one step further to allow both forms of user input (cell and string)?
cell.cross("My Address Book", "friend")[0].cells["address"].value
value.cross("My Address Book", "friend")[0].cells["address"].value
I have another issue... maybe there is still a problem in the caching logic. I would like to use the new cross() function together with split and join:
forEach(value.split(","),v,v.cross("My Address Book","friend")[0].cells["address"].value).join(",")
This works for "mary,john" but not for "anne,mary"
In this example, the value "anne" is not part of the recipient column. If I add a row for "anne" then "anne,mary" works too.
I think this behavior is unexpected. It seems that the new cross() function still expects that the string input value is present as a single cell.
I will review the document related to the cross function to make it consistent and can have the code also support cell if necessary.
For the transformation, I can still redo and undo between cross function. could you please give an example?
For the logic of cache, the key of the cache is based on:
fromProject + ";" + fromColumn + ";" + toProject + ";" + toColumn;
which means:
... can have the code also support cell if necessary
I think that would be great to make it backwards compatible.
The cross function does not and should not aware of the custom operation(in your case, split).
Thank you @jackyq2015, now I understand the caching logic. The reason why I raised this issue was the idea that combining the split, cross and join in one expression could reduce total run time. Row operations (split/join multi-valued-cells, delete rows) are very expensive and often amount to more than 75% of total run time in my projects. Do you have an idea how to support custom operations (e.g. split) together with the cross function? I am willing to give another bounty if this helps.
Will find some time this weekend for the compatibility issue.
For the cache logic I think the good practice is that always do the cross on normalized data(source and destination) rather than row data to prevent the unpredictable result. Also in this way you can leverage the cache efficiently.
As I feared, this change "breaks" the VibBits extension, which allows to make a cell.cross() visually. Too bad, it is very convenient.
@ettorerizza
Added the backward compatibility for cross function as discussed. https://github.com/OpenRefine/OpenRefine/commit/4950d2907444990bd5584dea91179c5a7ff3c8f3
Could you please test from the head?
Thank you @jackyq2015, it works.
We should update the integrated help too: line 85 in Cross.java
writer.key("params"); writer.value("string s or cell c, string projectName, string columnName");
@felixlohmeier corrected the description
Hi @jackyq2015 and ALL, may I ask again if there is a way to support multiple-value-cell-input in the cross function? I am willing to give another bounty. I had a look at the InterProjectModel again and I wonder if it could be possible to add an optional split function before computing the cache? From a user perspective, I would like to call the cross function with a separator as 4th argument, so that multiple values (separated by a custom separator, e.g. ␟) in an input cell get "crossed". Multiple output values should be joined with the same separator.
Example (see screenshots above):
The reason why I raised this issue was the idea that combining split, cross and join in one (cached) function could reduce total run time. Row operations (split/join multi-valued-cells, delete rows) are very expensive and often amount to more than 75% of total run time in my projects.
Implemented as optional 4th parameter to cross() function: https://github.com/claussni/OpenRefine/tree/cross-func-split
@felixlohmeier can you test this, please? Here is the expression I used:
forEach(value.cross("My Address Book", "friend", ","),r,r.cells["address"].value).join(";")
The result should be:
result for anne should be "17 Morning Crescent"; i am going to open a new issue with example data