Data.table: possible inconsistency in non-equi join, returning join columns

Created on 28 Feb 2019  路  1Comment  路  Source: Rdatatable/data.table

Hello.

While brewing an answer for this stackoverflow question, i came across something that suprised me in the join implementation of data.table.

The question is about a non-equi (possibly cartesian) join, returning the joined column values (Keeping both of the joined columns seems logical on a non-equi join). But while answering the question i hit a bump. To avoid confusion in the following the RHS is both the Right-Hand-Side in the on statement, equivalently the RHS in DT1[DT2] is DT2, the reverse being true for LHS (Left-Hand-Side).

  1. On non-equi join the RHS of the inequality is returned in the output, but under the name of the LHS.
  2. Using j to force output of both RHS and LHS of the on statement, two identical columns (the RHS values) were returned under the name of the LHS and the RHS respectively.
  3. Experimenting a little, adding by = .EACHI returned something similar to the expected result, but in the case of j = .(LHS, RHS), (no other columns) only the first match of the non-equi is returned.

Knowing that by = .EACHI is meant for use in evaluating j for each match in the LHS, such as finding number of matches examplified here. It seems odd however that in no unique columns, outside the join statement, the result is reduced to only the first match.

Reproducable example

Using the example from the original question, (1 - 3) can be reproduced quite simply. The data is provided below

df1 <- fread("Lastname_Census     firstname_Census     census_year                
C2last                C2first            1880
C3last                C3first            1850
C4last                C4first            1850", 
key = "census_year")
df2 <- fread("Lastname_Reg      firstname_reg       birth_year               
P2Last              P2first         1818
P3last              P3first         1879
P4last              P4first         1903", 
key = "birth_year")

(1)
So the initial 'logical' idea would be to simply join each table on a non-equi join.

>df2[df1, on = .(birth_year >= census_year)]
   Lastname_Reg firstname_reg birth_year Lastname_Census firstname_Census
1:       P3last       P3first       1850          C3last          C3first
2:       P4last       P4first       1850          C3last          C3first
3:       P3last       P3first       1850          C4last          C4first
4:       P4last       P4first       1850          C4last          C4first
5:       P4last       P4first       1880          C2last          C2first

Here's the first unexpected behaviour. The return value for birth_year, contains the values for census_year, while it is not stated or shown that this is in fact the RHS values under the alias of the LHS (the values from census_ year). This seems like inconsistent behaviour, and in the case of large data.table merges the chance to overlook this seems great.

(2)
Now if one noted this behaviour the next step seems to force the return of the original column, using j. And so i did:

>df2[df1, 
       on = .(birth_year >= census_year), 
       j = .(i.Lastname_Census, 
             i.firstname_Census, 
             Lastname_Reg, 
             firstname_reg, 
             birth_year, 
             i.census_year)]

   i.Lastname_Census i.firstname_Census Lastname_Reg firstname_reg birth_year i.census_year
1:            C3last            C3first       P3last       P3first       1850          1850
2:            C3last            C3first       P4last       P4first       1850          1850
3:            C4last            C4first       P3last       P3first       1850          1850
4:            C4last            C4first       P4last       P4first       1850          1850
5:            C2last            C2first       P4last       P4first       1880          1880

As one can see, now it is glaringly obvious that birth_year and census_year are identical in the output, while they are not in the original data.
One could create an index in the original data, and use this to join the new table. This would give me the original birth_year, but requires an additional join.

(3)
Now an alternative way one would think is to evaluate the join on each value in i, my thought being that the original value must be returned if it has to be evaluated. And initially it seems to work (kind off).

>df2[df1, 
      on = .(birth_year >= census_year), 
      j = .(i.Lastname_Census, 
            i.firstname_Census, 
            Lastname_Reg, 
            firstname_reg, 
            birth_year, 
            i.census_year),
      by = .EACHI]
   birth_year i.Lastname_Census i.firstname_Census Lastname_Reg firstname_reg birth_year i.census_year
1:       1850            C3last            C3first       P3last       P3first       1879          1850
2:       1850            C3last            C3first       P4last       P4first       1879          1850
3:       1850            C4last            C4first       P3last       P3first       1879          1850
4:       1850            C4last            C4first       P4last       P4first       1879          1850
5:       1880            C2last            C2first       P4last       P4first       1903          1880

We obtain the original values, but also an additional birth_year column. This is not too bad however as it could just be removed.
Now this seems all dandy, but trying to join only the non-equi join statement a problem emerges.

>df2[df1, 
        on = .(birth_year >= census_year), 
        j = .(birth_year, census_year), 
        by = .EACHI]

   birth_year birth_year census_year
1:       1850       1879        1850
2:       1850       1879        1850
3:       1880       1903        1880

One again we get the additional column, but more importantly, one can see that only the initial match is returned, which seems inconsistent, as j only include the original values without any form of aggregate measures. This seems to happen when no additional column from the LHS is included.

>df2[df1, 
      on = .(birth_year >= census_year), 
      j = .(birth_year, census_year, Lastname_Census), 
      by = .EACHI]

   birth_year birth_year census_year Lastname_Census
1:       1850       1879        1850          C3last
2:       1850       1879        1850          C4last
3:       1880       1903        1880          C2last

>df2[df1, 
      on = .(birth_year >= census_year), 
      j = .(birth_year, census_year, Lastname_Reg), 
      by = .EACHI]

   birth_year birth_year census_year Lastname_Reg
1:       1850       1879        1850       P3last
2:       1850       1879        1850       P4last
3:       1850       1879        1850       P3last
4:       1850       1879        1850       P4last
5:       1880       1903        1880       P4last

Now again it seems weird to me, that on a non-equi join only one of the join statements are returned, however even more weird is that the RHS overwrites the LHS values in the output. One could naively think this is due to the name difference. If this was the case, changing the name and using i.newname in j should return the original values. This is however not the case:

>setnames(df1, "census_year", "birth_year")
>df2[df1, on = .(birth_year >= birth_year), j = .(i.birth_year, birth_year)]

   i.birth_year birth_year
1:         1850       1850
2:         1850       1850
3:         1850       1850
4:         1850       1850
5:         1880       1880

I would suggest a future bug fix and structure change of non-equi joins, such that both columns are returned. This would assure that each row is uniquely identifiable in the old data. In addition renaming the joined column within the internal environment seems dangerous, as the original data (from the join statement) seems to be eradicated from the data.table. This seems more like a bug to me however.

Session info

For completion i have provided my session info below.

R version 3.5.2 (2018-12-20)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] ggplot2_3.1.0 dplyr_0.7.8 data.table_1.12.0

loaded via a namespace (and not attached):
[1] Rcpp_1.0.0 rstudioapi_0.9.0 whisker_0.3-2 bindr_0.1.1 magrittr_1.5 tidyselect_0.2.5
[7] munsell_0.5.0 lattice_0.20-38 colorspace_1.4-0 R6_2.3.0 rlang_0.3.1 plyr_1.8.4
[13] tools_3.5.2 parallel_3.5.2 grid_3.5.2 gtable_0.2.0 withr_2.1.2 tfruns_1.4
[19] yaml_2.2.0 lazyeval_0.2.1 assertthat_0.2.0 tibble_2.0.1 crayon_1.3.4 Matrix_1.2-15
[25] bindrcpp_0.2.2 purrr_0.3.0 base64enc_0.1-3 glue_1.3.0 labeling_0.3 compiler_3.5.2
[31] pillar_1.3.1 scales_1.0.0 jsonlite_1.6 reticulate_1.10 pkgconfig_2.0.2
```

non-equi joins

Most helpful comment

Related: https://github.com/Rdatatable/data.table/pull/3093 and issues linked from there

>All comments

Related: https://github.com/Rdatatable/data.table/pull/3093 and issues linked from there

Was this page helpful?
0 / 5 - 0 ratings