Data.table: How is order determined in the result of a non-equi join?

Created on 14 Jan 2017 · 2Comments · Source: Rdatatable/data.table

This question was originally posted on SO. Arun asked me to file an issue, so here we go. Not sure about the ethics on data.table github, if a 'SO link only' is OK? Anyway, here's the question, verbatim from SO.

I'm trying to understand the underlying logic of how the result of a non-equi join in data.table is ordered _within_ each level of the on-variable.

Just to make it clear from the start: I have no problem with the order itself, or to order the output in a desired way after the join. However, because I find the output from all other data.table operations highly consistent, I suspect there is a ordering pattern to be revealed in non-equi joins as well.

I will give two examples, where two different 'large' data sets are joined with a smaller. I have tried to describe the most obvious patterns in the output, as well as instances where the pattern differs between the joins of the two data sets.

# the first 'large' data set
d1 <- data.table(x = c(rep(c("b", "a", "c"), each = 3), c("a", "b")),
                 y = c(rep(c(1, 3, 6), 3), 6, 6),
                 id = 1:11) # to make it easier to track the original order in the output    
#     x y  id
# 1:  b 1   1
# 2:  b 3   2
# 3:  b 6   3
# 4:  a 1   4
# 5:  a 3   5
# 6:  a 6   6
# 7:  c 1   7
# 8:  c 3   8
# 9:  c 6   9
# 10: a 6  10
# 11: b 6  11

# the small data set
d2 <- data.table(id = 1:2, val = c(4, 2))   
#     id val
# 1:   1   4
# 2:   2   2

Non-equi join between the first large data set and the small, on = .(y >= val).

d1[d2, on = .(y >= val)]
#     x y  id  i.id
# 1:  b 4   3     1 # Row 1-5, first match: y >= val[1]; y >= 4
# 2:  a 4   6     1 # The rows within this match have the same order as the original data
# 3:  c 4   9     1 # and runs consecutively from first to last match
# 4:  a 4  10     1
# 5:  b 4  11     1

# 6:  b 2   2     2 # Row 6-13, second match: y >= val[2]; y >= 2 
# 7:  a 2   5     2 # The rows within this match do not have the same order as the original data
# 8:  c 2   8     2 # Rather, they seem to be come in chunks (6-8, 9-11, 12-13) 
                    # First chunk starts with the match with lowest index, y[2] 
# 9:  b 2   3     2  
# 10: a 2   6     2 
# 11: c 2   9     2 

# 12: a 2  10     2
# 13: b 2  11     2

The second 'large' data set:

d3 <- data.table(x = rep(c("a", "b", "c"), each = 3),
                 y = c(6, 1, 3),
                 id = 1:9)
#    x y id
# 1: a 6  1
# 2: a 1  2
# 3: a 3  3
# 4: b 6  4
# 5: b 1  5
# 6: b 3  6
# 7: c 6  7
# 8: c 1  8
# 9: c 3  9

Same non-equi join between the second large data set with the small:

d3[d2, on = .(y >= val)]

#    x y   id i.id
# 1: a 4   1     1 # Row 1-3, first match (y >= 4), similar to output above
# 2: b 4   4     1
# 3: c 4   7     1

# 4: a 2   3     2 # Row 4-9, second match (y >= 2).  
# 5: b 2   6     2 # Again, rows not consecutive.
# 6: c 2   9     2 # However, now the first chunk does not start with the match with lowest index,
                   # y[3] instead of y[1]

# 7: a 2   1     2 # y[1] appears after y[3]
# 8: b 2   4     2 # ditto
# 9: c 2   7     2

Can anyone explain the logic of (1) the order _within_ each level of the on-variable, here especially within the _second_ match, where original order of the data isn't kept in the result. And (2) why does the order _between_ chunks _within_ matches differ when the two different data sets are used?

An even smaller example which makes it easier to track the re-ordering spotted by @franknarf1; in the result of the join, x is sorted by its join variable:

d1 <- data.table(end = c(4, 1, 3), ix = 1:3)
d1
#    end ix
# 1:   4  1 
# 2:   1  2
# 3:   3  3

d2 <- data.table(start = 2)
d2
#    start
# 1:     2


# result of join not in original order, but in order of "x.end"
d1[d2, .(x.end, ix), on = .(end > start)]
#    x.end ix
# 1:     3  3
# 2:     4  1

Low bug non-equi joins

Source

Henrik-P

Most helpful comment

Update SO post https://stackoverflow.com/a/47148117/559784

arunsrinivasan on 7 Nov 2017

👍2

All 2 comments

I cannot explain the logic, but the pattern is d1[order(y)][y > 2, id] and d3[order(y)][y > 2, id], right? (Or in place of 2, put whatever other value from i.)