Hello.
While brewing an answer for this stackoverflow question, i came across something that suprised me in the join implementation of data.table.
The question is about a non-equi (possibly cartesian) join, returning the joined column values (Keeping both of the joined columns seems logical on a non-equi join). But while answering the question i hit a bump. To avoid confusion in the following the RHS is both the Right-Hand-Side in the on statement, equivalently the RHS in DT1[DT2] is DT2, the reverse being true for LHS (Left-Hand-Side).
j to force output of both RHS and LHS of the on statement, two identical columns (the RHS values) were returned under the name of the LHS and the RHS respectively.by = .EACHI returned something similar to the expected result, but in the case of j = .(LHS, RHS), (no other columns) only the first match of the non-equi is returned. Knowing that by = .EACHI is meant for use in evaluating j for each match in the LHS, such as finding number of matches examplified here. It seems odd however that in no unique columns, outside the join statement, the result is reduced to only the first match.
Using the example from the original question, (1 - 3) can be reproduced quite simply. The data is provided below
df1 <- fread("Lastname_Census firstname_Census census_year
C2last C2first 1880
C3last C3first 1850
C4last C4first 1850",
key = "census_year")
df2 <- fread("Lastname_Reg firstname_reg birth_year
P2Last P2first 1818
P3last P3first 1879
P4last P4first 1903",
key = "birth_year")
(1)
So the initial 'logical' idea would be to simply join each table on a non-equi join.
>df2[df1, on = .(birth_year >= census_year)]
Lastname_Reg firstname_reg birth_year Lastname_Census firstname_Census
1: P3last P3first 1850 C3last C3first
2: P4last P4first 1850 C3last C3first
3: P3last P3first 1850 C4last C4first
4: P4last P4first 1850 C4last C4first
5: P4last P4first 1880 C2last C2first
Here's the first unexpected behaviour. The return value for birth_year, contains the values for census_year, while it is not stated or shown that this is in fact the RHS values under the alias of the LHS (the values from census_ year). This seems like inconsistent behaviour, and in the case of large data.table merges the chance to overlook this seems great.
(2)
Now if one noted this behaviour the next step seems to force the return of the original column, using j. And so i did:
>df2[df1,
on = .(birth_year >= census_year),
j = .(i.Lastname_Census,
i.firstname_Census,
Lastname_Reg,
firstname_reg,
birth_year,
i.census_year)]
i.Lastname_Census i.firstname_Census Lastname_Reg firstname_reg birth_year i.census_year
1: C3last C3first P3last P3first 1850 1850
2: C3last C3first P4last P4first 1850 1850
3: C4last C4first P3last P3first 1850 1850
4: C4last C4first P4last P4first 1850 1850
5: C2last C2first P4last P4first 1880 1880
As one can see, now it is glaringly obvious that birth_year and census_year are identical in the output, while they are not in the original data.
One could create an index in the original data, and use this to join the new table. This would give me the original birth_year, but requires an additional join.
(3)
Now an alternative way one would think is to evaluate the join on each value in i, my thought being that the original value must be returned if it has to be evaluated. And initially it seems to work (kind off).
>df2[df1,
on = .(birth_year >= census_year),
j = .(i.Lastname_Census,
i.firstname_Census,
Lastname_Reg,
firstname_reg,
birth_year,
i.census_year),
by = .EACHI]
birth_year i.Lastname_Census i.firstname_Census Lastname_Reg firstname_reg birth_year i.census_year
1: 1850 C3last C3first P3last P3first 1879 1850
2: 1850 C3last C3first P4last P4first 1879 1850
3: 1850 C4last C4first P3last P3first 1879 1850
4: 1850 C4last C4first P4last P4first 1879 1850
5: 1880 C2last C2first P4last P4first 1903 1880
We obtain the original values, but also an additional birth_year column. This is not too bad however as it could just be removed.
Now this seems all dandy, but trying to join only the non-equi join statement a problem emerges.
>df2[df1,
on = .(birth_year >= census_year),
j = .(birth_year, census_year),
by = .EACHI]
birth_year birth_year census_year
1: 1850 1879 1850
2: 1850 1879 1850
3: 1880 1903 1880
One again we get the additional column, but more importantly, one can see that only the initial match is returned, which seems inconsistent, as j only include the original values without any form of aggregate measures. This seems to happen when no additional column from the LHS is included.
>df2[df1,
on = .(birth_year >= census_year),
j = .(birth_year, census_year, Lastname_Census),
by = .EACHI]
birth_year birth_year census_year Lastname_Census
1: 1850 1879 1850 C3last
2: 1850 1879 1850 C4last
3: 1880 1903 1880 C2last
>df2[df1,
on = .(birth_year >= census_year),
j = .(birth_year, census_year, Lastname_Reg),
by = .EACHI]
birth_year birth_year census_year Lastname_Reg
1: 1850 1879 1850 P3last
2: 1850 1879 1850 P4last
3: 1850 1879 1850 P3last
4: 1850 1879 1850 P4last
5: 1880 1903 1880 P4last
Now again it seems weird to me, that on a non-equi join only one of the join statements are returned, however even more weird is that the RHS overwrites the LHS values in the output. One could naively think this is due to the name difference. If this was the case, changing the name and using i.newname in j should return the original values. This is however not the case:
>setnames(df1, "census_year", "birth_year")
>df2[df1, on = .(birth_year >= birth_year), j = .(i.birth_year, birth_year)]
i.birth_year birth_year
1: 1850 1850
2: 1850 1850
3: 1850 1850
4: 1850 1850
5: 1880 1880
I would suggest a future bug fix and structure change of non-equi joins, such that both columns are returned. This would assure that each row is uniquely identifiable in the old data. In addition renaming the joined column within the internal environment seems dangerous, as the original data (from the join statement) seems to be eradicated from the data.table. This seems more like a bug to me however.
For completion i have provided my session info below.
R version 3.5.2 (2018-12-20)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ggplot2_3.1.0 dplyr_0.7.8 data.table_1.12.0
loaded via a namespace (and not attached):
[1] Rcpp_1.0.0 rstudioapi_0.9.0 whisker_0.3-2 bindr_0.1.1 magrittr_1.5 tidyselect_0.2.5
[7] munsell_0.5.0 lattice_0.20-38 colorspace_1.4-0 R6_2.3.0 rlang_0.3.1 plyr_1.8.4
[13] tools_3.5.2 parallel_3.5.2 grid_3.5.2 gtable_0.2.0 withr_2.1.2 tfruns_1.4
[19] yaml_2.2.0 lazyeval_0.2.1 assertthat_0.2.0 tibble_2.0.1 crayon_1.3.4 Matrix_1.2-15
[25] bindrcpp_0.2.2 purrr_0.3.0 base64enc_0.1-3 glue_1.3.0 labeling_0.3 compiler_3.5.2
[31] pillar_1.3.1 scales_1.0.0 jsonlite_1.6 reticulate_1.10 pkgconfig_2.0.2
```
Related: https://github.com/Rdatatable/data.table/pull/3093 and issues linked from there
Most helpful comment
Related: https://github.com/Rdatatable/data.table/pull/3093 and issues linked from there