Data.table: Delete rows by reference

Created on 8 Jun 2014 · 22Comments · Source: Rdatatable/data.table

Submitted by: Matt Dowle; Assigned to: Nobody; R-Forge link

Since deleting 1 column is DT[,colname:=NULL], and deleting rows is the same as deleting all columns for those rows, and we wish to use hierarchical indexes to find the rows to delete by reference, we just need a LHS to indicate "all" columns, leading to :

 DT[i,.:=NULL]   # delete rows by reference

 DT[,.:=NULL]    # error("must specify i to delete rows. To delete all rows from a table use DT[TRUE,.:=NULL], or, DT=DT[0].  This is deliberately a little harder, to avoid accidents such as "delete from table" a coomon accident in SQL.")

We can also add an attribute "read only" or "protect" to a data.table, and if the user had protected the data.table in that way, .:= would not work on it.

feature request top request

Source

arunsrinivasan

👍63

Most helpful comment

To complement @andrewrech's answer. Here is code as function and example of its usage.

delete <- function(DT, del.idxs) {           # pls note 'del.idxs' vs. 'keep.idxs'
  keep.idxs <- setdiff(DT[, .I], del.idxs);  # select row indexes to keep
  cols = names(DT);
  DT.subset <- data.table(DT[[1]][keep.idxs]); # this is the subsetted table
  setnames(DT.subset, cols[1]);
  for (col in cols[2:length(cols)]) {
    DT.subset[, (col) := DT[[col]][keep.idxs]];
    DT[, (col) := NULL];  # delete
  }
   return(DT.subset);
}

And example of its usage:

dat <- delete(dat, del.idxs)

Where "dat" is a data.table. Removing 14k rows from 1.4M rows takes 0.25 sec on my laptop.

> dim(dat)
[1] 1419393      25
> system.time(dat <- delete(dat,del.idxs))
   user  system elapsed 
   0.23    0.02    0.25 
> dim(dat)
[1] 1404715      25
>

This is my very first GitHub post, btw.

Jarno-P on 18 Nov 2016

👍9 😄2

All 22 comments

Second that.

Thank you.
Kind regards,
TY

Co0olCat on 7 May 2015

👍1

Subset data table without using <-

zx8754 on 1 Oct 2015

👍1

http://stackoverflow.com/questions/10790204/how-to-delete-a-row-by-reference-in-r-data-table

mattdowle on 1 Oct 2015

👍1

Just delete by reference is not that hard. The benefit would be mainly memory efficiency rather than speed so much.

mattdowle on 1 Oct 2015

How about adding both
delete(DT, b>=8 | a<=3)
and
DT[b>=8 | a<=8, .ROW:=NULL]
The advantage of the latter would be combining with other features of [] such as row numbers in i, join in i and roll. All benefiting from [i,j,by] optimization.
As per : http://stackoverflow.com/questions/10790204/how-to-delete-a-row-by-reference-in-r-data-table/10791729?noredirect=1#comment54633906_10791729

mattdowle on 29 Oct 2015

👍3

More advanced example :

DT[ b>=8, .SD[1, .ROW:=NULL], by=group]
# remove by reference the 1st observation in each group within a subset

Is .ROW the right name for this new symbol?

mattdowle on 29 Oct 2015

👍1

Re right name: doesn't .SD already carry the right meaning for that (instead of introducing a new name a la .ROW)?

eantonya on 29 Oct 2015

👍2

I think syntax for selecting rows to keep (which just deletes their complement) would be convenient.

delete(DT, b >= 8 | a <= 3) # or
keep(  DT, b <  8 & a >  3)

I don't know that there's a sensible way to extend this logic to work inside j. I'd just as well have Matt's second example only work via

badrows = DT[b >= 8, .I[1], by=g]$V1
delete(DT, badrows)

Just as new columns cannot be created by set (last I checked), it could be that row modifications cannot be done inside [.data.table.

franknarf1 on 29 Oct 2015

👍7

if anyone needs a quick-and-dirty solution, as I did, here is a memory-efficient function to select rows for each col then delete by reference based on a SO answer by vc273.

## ---- Deleting rows by reference using data.table*
## ---- *not exactly!

# Example dt
DT = data.table(col1 = 1:1e6)
cols = paste0('col', 2:100)
for (col in cols){ DT[, col := 1:1e6, with = F] }
keep.idxs = sample(1e6, 9e4, FALSE) # keep 90% of

delete <- function(DT, keep.idxs){
cols <- copy(names(DT))
DT_subset <- DT[[1]][keep.idxs] %>% as.data.table
setnames(DT_subset, ".", cols[1])
for (col in cols){
  DT_subset[, (col) := DT[[col]][keep.idxs]]
  set(DT, NULL, col, NULL)
}
return(DT_subset)
}

str(delete(DT, keep.idxs))
str(DT)

andrewrech on 23 Aug 2016

@andrewrech I can't get your code to work. I'm on the dev version of data.table, and when I run your code, I end up with an empty data.table:

> dim(d1)
[1] 0 0

vinhdizzo on 25 Aug 2016

👍1

To complement @andrewrech's answer. Here is code as function and example of its usage.

delete <- function(DT, del.idxs) {           # pls note 'del.idxs' vs. 'keep.idxs'
  keep.idxs <- setdiff(DT[, .I], del.idxs);  # select row indexes to keep
  cols = names(DT);
  DT.subset <- data.table(DT[[1]][keep.idxs]); # this is the subsetted table
  setnames(DT.subset, cols[1]);
  for (col in cols[2:length(cols)]) {
    DT.subset[, (col) := DT[[col]][keep.idxs]];
    DT[, (col) := NULL];  # delete
  }
   return(DT.subset);
}

And example of its usage:

dat <- delete(dat, del.idxs)

Where "dat" is a data.table. Removing 14k rows from 1.4M rows takes 0.25 sec on my laptop.

> dim(dat)
[1] 1419393      25
> system.time(dat <- delete(dat,del.idxs))
   user  system elapsed 
   0.23    0.02    0.25 
> dim(dat)
[1] 1404715      25
>

This is my very first GitHub post, btw.

Jarno-P on 18 Nov 2016

👍9 😄2

Is it already implemented ?

skanskan on 10 Jan 2017

👍4

is it implemented now. Its kinda necessary function.

vikram-rawat on 16 Aug 2017

👍5

What kind of work needs to be done in order to add this functionality to data.table ? Would be glad to help, but not totally sure where to start !

The delete function could be added using @Jarno-P awnser and later on modified to be more efficient and works with [] references, don't you think ?

MiloParigi on 3 Jan 2018

👍1

I think the open question is the best API. data.table-like syntax would suggest the following should "work":

DT[rows_to_delete := NULL]

The functional approach of @Jarno-P would be a change from this, where row deletion would become functional & require DT <- f(DT) constructions. This may be best since := usages are _truly_ by reference, whereas row deletions as exemplified thus far are only _fast_ (compared to full copies), and not _truly_ by reference.

MichaelChirico on 8 Jan 2018

👍7

Although I am all but qualified to comment, should the syntax user perspective be more like:

DT[ i , .SR := NULL ]

Where the "i" is a DT-expression to select rows. .SR is similar to .SD, except it is always defined within DT and it includes references to all the rows selected by i. But such an approach may add overhead in expressions not intending to delete rows.

Alternative way is to change the behavior of .SD and have it defined also when by-expression is not used and when used without "by", .SD would refer to the whole rows instead (.SD excludes grouping columns).

Jarno-P on 9 Jan 2018

👍2

An approach to bypass X <- f(X) might be to find out the name of X via deparse + substitute and than use the assign function. E.g. like this (adjusting the function of @Jarno-P):

del_rows <- function(X,delete) {

  keep <- -delete
  name_of_X <- deparse(substitute(X))
  X_names <- copy(names(X))
  X_new <- X[keep,X_names[1L],with=F]
  set(X,i=NULL,j=1L,value=NULL)

  for(j in seq_len(ncol(X))) {

    set(X_new,i=NULL,j=X_names[1L+j],value=X[[1L]][keep] )
    set(X,i=NULL,j=1L,value=NULL)

  }
  assign(name_of_X,value=X_new, envir = .GlobalEnv)
}

You would need to find out the environment of X for general cases.

matthiaskaeding on 18 Jan 2018

👍1

Have already reacted. Bumping this if possible!

Gayyam on 14 Jun 2019

There is an interesting question on SO:

Subsetting a large vector uses unnecessarily large amounts of memory

Not directly related to data.table but a potential use case for deleting rows by reference.

UweBlock on 6 Aug 2019

👍5

Yes, please. This would help a lot a work I'm doing on leveraging data.table in individual-based models.

rkanitz on 23 Aug 2019

👍3

I provided one design idea to address this issue in https://github.com/Rdatatable/data.table/issues/4345#issuecomment-608348313
Are there any other ideas? If not it should be safe to start working on implementation of the idea.

jangorecki on 7 Apr 2020

Proof of concept based on https://github.com/Rdatatable/data.table/issues/4345#issuecomment-608348313

setsubset = function(x, i) {
  stopifnot(is.data.table(x), is.integer(i))
  if (!length(i)) return(x)
  if (anyNA(i) || anyDuplicated(i) || any(i < 1L || i > nrow(x) || is.unsorted(i))) stop("i must be non-NA, no dups, in range of 1:nrow(x) and sorted")
  drop = setdiff(1:nrow(x), i)
  last_ii = drop[1L]-1L
  do_i = i[i > last_ii]
  for (ii in do_i) {
    last_ii = last_ii+1L
    set(x, last_ii, names(x), as.list(x[ii]))
  }
  ## we need to set true length here but this needs C
  invisible(x)
}

x = data.table(a = 1:8, b = 8:1)
X = copy(x)
i = c(1:2, 6:7)
address(x)
sapply(x, address)
setsubset(x, i)
address(x)
sapply(x, address)
all.equal(x[seq_along(i)], X[i])

x = data.table(a = 1:8, b = 8:1)
X = copy(x)
i = c(3L, 5L, 7L)
address(x)
sapply(x, address)
setsubset(x, i)
address(x)
sapply(x, address)
all.equal(x[seq_along(i)], X[i])

jangorecki on 7 Apr 2020

❤2 🚀1

Was this page helpful?

0 / 5 - 0 ratings