Please consider the following feature request.
Currently, nomatch accepts only NA or 0:
DT[i, nomatch = NULL]
# Error in if (!is.na(nomatch) && nomatch != 0L) stop("nomatch must either be NA or 0, or (ideally) NA_integer_ or 0L") :
By expanding it to accept wider range of values, one can perform for example na.fill operation _during_ the join operation, not after, lending to faster code. Related SO
Proposed behavior:
DT[i, nomatch = any_character_or_real] # fill with supplied value; could be 0, "foo", even NA ...
DT[i, nomatch = NA] # fill with NA. Special case of above. Same behavior as today.
DT[i, nomatch = 0] # fill with zeroes. Special case of above.
DT[i, nomatch = NULL] # omit non-matched rows. Equivalent to today's nomatch = 0
DT[i, nomatch = locf] # reserved symbol. Equivalent to roll=+Inf
DT[i, nomatch = nocb] # reserved symbol. Equivalent to roll=-Inf
Alternatively, this could be achieved by introducing new argument fill, accepting any value:
[.data.table(..., fill=NA, ...)
jangorecki: adding some features to scope
nomatch=.(0L) # fill numeric columns with 0, character columns with NA
nomatch=.(0L, "missing") # fill numeric columns with 0, character columns with "missing"
nomatch=.(NA, price=LOCF()) # fill NA, other than price column which should be rolled
Merging with #940 as it is the same feature.
Extending Daniel's examples for an _prepared_ error message in case of _nomatch_.
data.table join nomatch=quote(stop())
We have 2 dts:
library(data.table)
dict <- data.table(symbol = c("a","b","c"), ratio = c(5,10,15), key = "symbol")
dt <- data.table(symbol = c("a","b","c","d"), value = rnorm(4))
Which we want join:
dict[dt][,list(symbol, new_value = ratio * value)]
Consider the case when we expect to have match on all data in the join.
In case of nomatch we want to raise error.
Currently it is possible using:
new_dt <- dict[dt]
if(any(is.na(new_dt$ratio))) stop("missing data in the reference dictionary")
Which makes us unable to make chaining in such case.
If we could use something like this:
dict[dt,nomatch=quote(stop("missing data in the reference dictionary"))
][,list(symbol, new_value = ratio * value)]
just to stop process and raise the error in case of any nomatch.
Added to 1.9.6. Will add a global option and issue a warning message so that transition can be made in the next major release. Setting the global option will allow users to use the new implementation already.
nomatch = NA - current behaviour (and default).
nomatch = 0 - fills with 0 instead of removing non-matched rows.
nomatch = NULL - current behaviour of nomatch = 0.
nomatch = LOCF(val, rollends=TRUE/FALSE) and nomatch = NOCB(val, rollends=TRUE/FALSE) - could potentially replace roll and rollends arguments??
nomatch = val - should type convert appropriately and use val for all columns.
I think that second option is a bit revolutionary and potentially going to "piss" a lot of people who either don't read documentation/warnings or will have to rewrite huge code chunks...
It's bad enough that this will break a lot of code, but now we'd have to type even more characters for this very common operation? Uhh...
I don't like just saying I dislike smth without presenting a viable alternative, but the only thing I can think of is a new argument, which sucks in its own little way :-
On extra characters, I guess you're talking about "roll" arguments? I'm not quite certain what are the extra characters otherwise.. And, I am not sure about the "roll" part, hence the "??". The idea is to first implement nomatch = <any_value> first.
# previously
X[Y, roll = Inf, rollends=TRUE]
# after
X[Y, nomatch = LOCF(Inf, TRUE)] # default case can be just nomatch = "locf"
Is this what you find hard?
For one, I find it straightforward, as "roll" actually is used to "fill in" missing values. Second, the number of arguments that are mutually exclusive in [.data.table are one too many, and this helps address that.. (by removing roll and rollends as separate arguments). Right now, using roll=TRUE and nomatch = 0L returns the rolled result, which I find confusing. Providing nomatch = roll is explicit in that it asks the results to be rolled in the event of a no match.
On breaking code.. like I said before, in this version (1.9.6), we'll have a global argument which will be set to current behaviour and will issue a warning that nomatch = 0L should be replaced to nomatch = NULL and the current behaviour will be deprecated in the next version (which is plenty time), and we'll add a new vignette to explain the benefits. There will also be an option to switch off this warning. (or at least this is what I've in mind right now). @mattdowle thoughts?
I think this is a hugely welcoming change, similar to by = .EACHI (which you advocated btw, @eddi, which also would have resulted in lot of code being broken). The point is to bring in as much consistency as possible, by transitioning smoothly (and by giving ample time for users to shift their code), and this kind of "major" change shouldn't happen a lot (it's a lot of effort for us as well), but it's essential here.
I'm talking about typing NULL instead of 0, I didn't even think about the roll. For me, nomatch=0 has been a major annoyance from the very beginning, and I'm not too happy about it becoming even more annoying.
That said I totally understand the need for filling with other numbers/throwing an error, I'm just not particularly happy with this implementation.
Actually the root of my annoyance comes from nomatch=NA being the default, instead of nomatch=NULL/0. Since we're talking about a breaking change, can we please discuss making NULL/0 the default? I'll make a short "pro" argument: it's the default for base merge and ime by far the more common type of a join/merge.
Taking into account what Gsee mentioned about having a release as fallback in case changes are too major (and possibly affecting current code too much), I've changed it to 1.9.8. 1.9.6 will keep things as expected for people who don't want their code to be affected (and they need not upgrade if they wish so).
I agree, purely on consistency, that nomatch = NULL should be the default. IMO, we'll have to do that as well. The default should be an inner join.
This means, this'll be default in 2.0.0 (depending on what Matt feels as well).
Ok, if nomatch=NULL is made the default, then I have no issues, and the change would be great.
As for breaking the code. The sooner we have it as an option to better tested it will be, so if you are not in rush with 1.9.6 (sad) it can be worth to put it there. But for the breaking changes - moving from option to default - the best will be 2.0.0 release.
If we consider adding new argument to [ to it is worth to review other open issues which would like to get own new arguments. Maybe they can be nicely combined. The one very similar would be #614, or maybe also #830 would need one - this can be combined with roll maybe?
PS. do you drop the nomatch=quote(stop("my error")) FR at all?
If such breaking change is going to happened I think we should put some info packageStartupMessage yet before 1.9.8 release - example one below.
"There is a breaking change coming in next stable release to CRAN (2.0.0) related to `nomatch` argument. That affects joins and binary search filtering. To keep the current behavior of `nomatch` in 2.0.0+ recode missing `nomatch` to `nomatch=NA` and `nomatch=0` to `nomatch=NULL` / missing. Alternatively use `options('datatable.oldnomatch'=TRUE)`."
Maybe we can also provide an option datatable.newnomatch which could makes new behavior available in 1.9.8? This can make switch easier.
adding that this feature would make a lot of problems simpler and that breaking existing code with nomatch = 0 is worth it in the long run as I've always thought that choice of notation was strange anyway.
Meanwhile, there are people writing production code today that need the current functionality of nomatch=0L and every time they write a new line of code that includes it, they're increasing their workload in the future if the behavior is changed. Regardless of whether the behavior of nomatch=0 is ever changed, it would be nice to be able to start using nomatch=NULL, because at least we know that once that is implemented, it's not likely to break in the foreseeable future.
@gsee that's a good idea, but I do think you're exaggerating claims of an increased workload... ctrl+F will still only need to be pressed once (or grep could be used to do all R scripts at once)
The point is that I don't like writing a line of code when I know it's going to break in the future. If I had a way to write code such that I wouldn't have to fix it, I'd prefer to do that.
@gsee totally agree. Many times when I'm using nomatch I'm concerned if I can write it in a way it could be supported in future. One option is to use nomatch = if(packageVersion("data.table") >= "2.0.0") NULL else 0 but I'm not sure if it is recommended way.
Another option is to not upgrade your production environment. If everything works as expected then I would say it is a good practice.
Having good test coverage for own package would easily detect most of the issues when migration to new version of a dependency. Giving API for future changes, as the one written above, can help package maintainers. Additionally pointing what test they could include in their test workflow.
Just adding my two-cents to highlight that, as @jangorecki knows, this is an issue that commonly comes up on StackExchange, etc. The nomatch=0 syntax always confused me (nomatch=null always seemed more logical). Here's a current thread with a very basic case. Thanks for your great work and this upcoming feature, however it is implemented!
Could you provide an example on how to use this nomatch roll method, please?
For example, how can I do this
zoo::na.locf(c(NA, 1,2,NA, NA, 1, NA))
using just data.table?
This significant breaking change can be relatively safely implemented using one of the following strategies.
1.12.0: allow nomatch=NULL, identical to nomatch=0L
1.13.0: nomatch=0L prints message about future change
1.14.0: nomatch=0L raises warning
1.15.0: nomatch=0L raises error
1.16.0: nomatch=0L fill missing rows with 0L
So anyone upgrading DT at least once per each major release will not be able to get incorrect results because of nomatch=0L raising an error at some point, before changing to requested behavior in this issue. Of course options can be provided to relax those rules.
Optionally to limit breaking change to extreme we can keep 0L mapping to NULL "forever" unless special option options("datatable.nomatch.fill.0"=TRUE). Then the old code is still portable to new versions, and the not-so-common case of filling non matching rows with 0 can be achieved, with a little bit extra effort (setting option) than any other nomatch value. This would need to be smart enough to determine if nomatch was hardcoded or provided as parameter, and if parameter than warn (just because user might want to fill non-matching rows with dynamically calculated value).
So, in 1.15.0, only nomatch = NULL is accepted? Anything else is an error? Or nomatch = x will fill with x unless x == 0?
@MichaelChirico no decisions yet, I just thrown few ideas to make transition smooth. We can also ask users to wrap nomatch=0 into nomatch=I(0) if they want to fill with zeros, just for few releases to make nomatch=0 backward compatible for longer period. R's lazy evaluation gives many ways to handle transition. Other nomatch values can be supported anytime as they won't break any code.
Note that in current releases frequency 1.15.0 will be released around 2024 :-)
Yep: nomatch=NULL good and agree with Jan's PR #3095.
The desire is to specify fill values in nomatch=, right? So nomatch=0L will eventually mean use 0L instead of NA to fill the missing rows. But what about differing column types? There might be a mixture of character columns and numeric columns to fill and what should match=0L mean for the character columns?
Rather than changing the meaning of nomatch=0L, how about allowing a new list value.
nomatch=.(0L) # fill numeric columns with 0, character columns with NA
nomatch=.(0L, "missing") # fill numeric columns with 0, character columns with "missing"
nomatch=.(NA, price=LOCF()) # fill NA, other than price column which should be rolled
This would allow the ability of filling with 0 rather than NA, without needing to change the meaning of nomatch=0L and the associated breakage.
Btw, the reason for the default being nomatch=NA (outer join) is for statistical safety. I found that in the majority of cases in my day to day work, I expected there to be data available and if there wasn't, then I wanted to measure how much was missing or at least be tripped up with the NAs afterwards. It's just too easy to miss incorrect joins due to missing data when the default is inner join. Inner join to me means ignore missing data silently which doesn't feel robust. If you expect missing data and that's ok, then I like to see nomatch=NULL in the code to remind myself that missing data is being dropped silently. I also like the nomatch="error" option idea too. In other words, I'm not keen on changing the default behaviour away from nomatch=NA since there's a good reason for it.
Very good question about filling multiple columns, and brilliant idea to address it retaining nomatch=0L backward compatible at the same time.
To include also rollequal=TRUE|FALSE #857
it seems like there are 2 discrete ideas that are being combined in a single parameter:
granted, if you do an inner join, you don't need fill values, but adding new parameters would get you around the transition and backward compatibility quite easily:
add join.type as a new parameter (currently only allowing inner, right)
add join.fill as a new parameter taking a list of name/value pars as described in Matt's Oct4th post.
the join.type could be implemented now, with warnings and backward compatibility would be a simple thunk to the new parameter.
this would also leave open the possibility for supporting full join in the future, or possibly even the sacrilegious left join ;)
To summarize current requirements
# `nomatch` argument value on the left, translation on the right
NA -> list(a=NA, b=NA)
0L -> NULL
"locf" -> list(a=LOCF(), b=LOCF())
"stop" -> list(a=stop(), b=stop())
NULL -> "[remove non matching rows]"
list(0L) -> list(a=0L, b=0L) # fill with zeros
list(a=NA, b=LOCF()) # fill with a column with NA, and b with LOCF
list(a=NA, b=stop()) # fill a column is irrelevant becase we will raise exception anyway
# this should be supported as well, so `nomatch` can be more easily passed as a parameter
list(a=NA, b=quote(LOCF()))
list(a=NA, b=quote(stop()))
# of we could also provide LOCF() function, that crease a specification object, rather than a symbol that we handle in unevaluated expression.
list(a=NA, b=LOCF())
rolling join options rollends and rollequal (#579) are whole tables join specification, rather than per-column specification. Not sure if we want to have them as argument to be specified in LOCF(rollends, rollequal) as then it would be defined per-column.
# some extra guessing of user intent
list(0L, "missing") # fill numeric columns with 0, character columns with "missing"
list(NA, price=LOCF()) # fill NA, other than price column which should be rolled
Assuming we will implement an argument that controls join type ("inner","left","right","full" #3946) then nomatch will be provided automatically
left, right, full -> NA
inner -> NULL
note that there is an idea of provided two kinds of nomatch, one for LHS table, and another for RHS, which would control if join is inner, left, right, full: https://github.com/Rdatatable/data.table/issues/614#issuecomment-415370474
Most helpful comment
Yep:
nomatch=NULLgood and agree with Jan's PR #3095.The desire is to specify fill values in
nomatch=, right? Sonomatch=0Lwill eventually mean use0Linstead ofNAto fill the missing rows. But what about differing column types? There might be a mixture of character columns and numeric columns to fill and what shouldmatch=0Lmean for the character columns?Rather than changing the meaning of
nomatch=0L, how about allowing a newlistvalue.This would allow the ability of filling with 0 rather than NA, without needing to change the meaning of
nomatch=0Land the associated breakage.Btw, the reason for the default being nomatch=NA (outer join) is for statistical safety. I found that in the majority of cases in my day to day work, I expected there to be data available and if there wasn't, then I wanted to measure how much was missing or at least be tripped up with the NAs afterwards. It's just too easy to miss incorrect joins due to missing data when the default is inner join. Inner join to me means ignore missing data silently which doesn't feel robust. If you expect missing data and that's ok, then I like to see nomatch=NULL in the code to remind myself that missing data is being dropped silently. I also like the nomatch="error" option idea too. In other words, I'm not keen on changing the default behaviour away from
nomatch=NAsince there's a good reason for it.