Data.table: join vignette

Created on 27 May 2017 · 20Comments · Source: Rdatatable/data.table

Several places in the available vignettes refer to this mysterious vignette about join and rolling join. When will it be up? Thanks

documentation joins non-equi joins top request

Source

lilchow

Most helpful comment

Noting for reference: joins vignette would be a good place to have an example of replacing nested ifelse with a join.

MichaelChirico on 29 Aug 2019

👍2

All 20 comments

See #944 . There are many ideas for vignettes there.

If you want some examples, you could see stackoverflow.com or maybe my notes.

franknarf1 on 27 May 2017

Yes, there's no time frame for this currently. See the #944 issue for progress.

arunsrinivasan on 29 Jun 2017

Hi,

So..does this "joins & rolling joins" vignette exist..? since it's referred to in many of the existing (very helpful!) vignettes...but doesn't exist when searched for. need it! ...help!

batcheneden on 13 Jul 2017

@batcheneden please read above and follow #944 for updates.

MichaelChirico on 13 Jul 2017

@MichaelChirico thank you- Just wanted to make sure indeed there isn't an existing one , despite the reference in existing vignettes.

thanks!

batcheneden on 16 Jul 2017

As #944 is pretty epic issue, I would like to re-open this one, asking for joins vignette only. I've seen people asking for it also on twitter, lets try to have it for 1.13.0 then.

jangorecki on 26 Aug 2019

👍2

Noting for reference: joins vignette would be a good place to have an example of replacing nested ifelse with a join.

MichaelChirico on 29 Aug 2019

👍2

summarizing the scope

equi joins
join types: inner, left, right, full
semi join
cross join
not join
natural join
set operators (fintersect, etc.)
non-equi join
overlapping join
rolling join (#857 may change rolling join API)
aggregate on join (mention that it is currently not possible to aggregate by a fields that are not on fields)
update on join (example vs [f]ifelse and fcase, and lazyness of the latter one)
mergelist (if ready #599)
nomatch (if ready #857, mention fill, locf, stop)
mult
allow.cartesian
mention join with on or with key, or pre-computing index - explain it is good to setkey/setindex if you are repeatedly joining

jangorecki on 10 Apr 2020

👍1

Copied from #944.

I have a question on joining by reference, while preparing the vignettes. The X[Y, new_col := old_col] performs something similar to a traditional left join on X. However, if there are multiple matches to Y's keys in X, only the last (or first?) matching value of the key is retained. ~Is this explicitly documented somewhere? I had tried searching for this back when I encountered it, but had to resort to my understanding of updating by reference for the reason~ Documented in set. For a reproducible example,

> X = data.table(a = c(1, 2, 3), m = c("a", "b", "c"))
> Y = data.table(b = c(1, 1, 4), n = c("x", "y", "z"))
> X[Y, new_col := i.n, on = "a == b"]
   a m new_col
1: 1 a       y
2: 2 b    <NA>
3: 3 c    <NA>

# an ideal left join - expected behaviour per a new user, given below
# not possible because updating row by reference isn't implemented
   a m new_col
1: 1 a       x
1: 1 a       y
2: 2 b    <NA>
3: 3 c    <NA>

This is expected behaviour, but isn't exactly straightforward for a new user. mult does not impact the output either. Any suggestions on how I mention this in the vignette? Add merge as a workaround for a proper left join?

zeomal on 24 Apr 2020

@jangorecki, given that #3453 is being prepared where a detailed overview of rolling joins is being covered by @Henrik-P, would it make sense to add separate vignettes for equi- and non equi- joins, as I believe the latter is far more relevant for time series analysis? The content of both vignettes at the moment will be significant given your scope above.

zeomal on 24 Apr 2020

For Joins vignette:

https://github.com/Rdatatable/data.table/issues/2396

_Originally posted by @MichaelChirico in https://github.com/Rdatatable/data.table/issues/944#issuecomment-521710383_

jangorecki on 24 Apr 2020

@zeomal better to have 2 bigger vignettes, than 3 smaller IMO. We already have many vignettes.

jangorecki on 24 Apr 2020

@jangorecki, I've created a draft pull request for this vignette. It's a first version, bound to have many changes, but covers the basics of equi-joins. This is my first pull request ever, so if I've done something wrong, please correct me.

zeomal on 25 Apr 2020

mult usage can nicely present _temporal joins_, like:

take the _most recent_ address of customers

setkey(customer_addresses, customer_id, address_date)
customer_addresses[customer, mult="last", on="customer_id"]

address of customers _as of_ end of 2019

customer_addresses[address_date<as.IDate("2020-01-01")
  ][customer, mult="last", on="customer_id"]

Probably better to use some know dataset, like airlines, whenever possible.

jangorecki on 6 May 2020

@jangorecki it's been a while since the mergelist and changes to nomatch have been there on the milestones, so I thought I might go ahead an complete the vignettes, at least from an equi-join standpoint. Currently, I'm using the nycflights13 package , but it looks like that isn't present on travis-ci and thus the build is failing. You also mentioned to use a known dataset - nycflights13 fits that criteria, yeah? Is there some alternate way I can link to the package or data that doesn't involve loading it (I'm not sure linking directly to the CSV files on Github is appropriate)?

zeomal on 17 Dec 2020

@zeomal we have flights dataset in vignettes directory, see https://github.com/Rdatatable/data.table/blob/7aa22ee6b245b9308352acd66384373a99376c13/vignettes/datatable-intro.Rmd#L50-L55 for examples of usage.

jangorecki on 17 Dec 2020

@jangorecki correct me if I'm wrong - I think that this includes only the flights dataset of the series. That is, airlines, planes, airports, weather are missing - which would be needed to demonstrate joins. I don't see any of the other files here either, so let me know if I'm missing something. The only other dataset I might be able to use is the Lahman one, but present as RData files in the same location (used in the .SD vignette).

zeomal on 18 Dec 2020

@zeomal you are absolutely correct.

jangorecki on 18 Dec 2020

So... my question is, what should I use as a data source for the vignette that will be feasible? @jangorecki

zeomal on 18 Dec 2020

No worries, I am not replying because I don't have any valuable answer to questions asked, at the moment.
Choosing datasets for presenting join cases is very complex task. I don't want to make recommendation without first properly investigating the problem. There are many factors to consider and it is not easy to construct dummy datasets that will cover them well, so choosing real datasets is even more tricky.
Joining should present (of course) matches, some matches, zero matches, duplicated matches (many-to-many join), rolling matches, mult arg, overlapping, non-equi, multi col, matches on NA....
Complete list of features to cover could be really long.

Personally I think that using multiple different datasets (each 10-20 rows to be visually easier to grasp) in this single vignette could be more appropriate but do not take that as a suggestion to write this vignette now like this. I would like to hear from @mattdowle and @arunsrinivasan as they are authors of (AFAIR) all join scenarios in data.table.

jangorecki on 23 Dec 2020

Was this page helpful?

0 / 5 - 0 ratings