Data.table: Delete rows by reference

Created on 8 Jun 2014  路  22Comments  路  Source: Rdatatable/data.table

Submitted by: Matt Dowle; Assigned to: Nobody; R-Forge link

Since deleting 1 column is DT[,colname:=NULL], and deleting rows is the same as deleting all columns for those rows, and we wish to use hierarchical indexes to find the rows to delete by reference, we just need a LHS to indicate "all" columns, leading to :

 DT[i,.:=NULL]   # delete rows by reference

 DT[,.:=NULL]    # error("must specify i to delete rows. To delete all rows from a table use DT[TRUE,.:=NULL], or, DT=DT[0].  This is deliberately a little harder, to avoid accidents such as "delete from table" a coomon accident in SQL.")

We can also add an attribute "read only" or "protect" to a data.table, and if the user had protected the data.table in that way, .:= would not work on it.

feature request top request

Most helpful comment

To complement @andrewrech's answer. Here is code as function and example of its usage.

delete <- function(DT, del.idxs) {           # pls note 'del.idxs' vs. 'keep.idxs'
  keep.idxs <- setdiff(DT[, .I], del.idxs);  # select row indexes to keep
  cols = names(DT);
  DT.subset <- data.table(DT[[1]][keep.idxs]); # this is the subsetted table
  setnames(DT.subset, cols[1]);
  for (col in cols[2:length(cols)]) {
    DT.subset[, (col) := DT[[col]][keep.idxs]];
    DT[, (col) := NULL];  # delete
  }
   return(DT.subset);
}

And example of its usage:

dat <- delete(dat, del.idxs)

Where "dat" is a data.table. Removing 14k rows from 1.4M rows takes 0.25 sec on my laptop.

> dim(dat)
[1] 1419393      25
> system.time(dat <- delete(dat,del.idxs))
   user  system elapsed 
   0.23    0.02    0.25 
> dim(dat)
[1] 1404715      25
> 

This is my very first GitHub post, btw.

All 22 comments

Second that.

Thank you.
Kind regards,
TY

Just delete by reference is not that hard. The benefit would be mainly memory efficiency rather than speed so much.

How about adding both
delete(DT, b>=8 | a<=3)
and
DT[b>=8 | a<=8, .ROW:=NULL]
The advantage of the latter would be combining with other features of [] such as row numbers in i, join in i and roll. All benefiting from [i,j,by] optimization.
As per : http://stackoverflow.com/questions/10790204/how-to-delete-a-row-by-reference-in-r-data-table/10791729?noredirect=1#comment54633906_10791729

More advanced example :

DT[ b>=8, .SD[1, .ROW:=NULL], by=group]
# remove by reference the 1st observation in each group within a subset

Is .ROW the right name for this new symbol?

Re right name: doesn't .SD already carry the right meaning for that (instead of introducing a new name a la .ROW)?

I think syntax for selecting rows to keep (which just deletes their complement) would be convenient.

delete(DT, b >= 8 | a <= 3) # or
keep(  DT, b <  8 & a >  3)

I don't know that there's a sensible way to extend this logic to work inside j. I'd just as well have Matt's second example only work via

badrows = DT[b >= 8, .I[1], by=g]$V1
delete(DT, badrows)

Just as new columns cannot be created by set (last I checked), it could be that row modifications cannot be done inside [.data.table.

if anyone needs a quick-and-dirty solution, as I did, here is a memory-efficient function to select rows for each col then delete by reference based on a SO answer by vc273.

## ---- Deleting rows by reference using data.table*
## ---- *not exactly!

# Example dt
DT = data.table(col1 = 1:1e6)
cols = paste0('col', 2:100)
for (col in cols){ DT[, col := 1:1e6, with = F] }
keep.idxs = sample(1e6, 9e4, FALSE) # keep 90% of

delete <- function(DT, keep.idxs){
cols <- copy(names(DT))
DT_subset <- DT[[1]][keep.idxs] %>% as.data.table
setnames(DT_subset, ".", cols[1])
for (col in cols){
  DT_subset[, (col) := DT[[col]][keep.idxs]]
  set(DT, NULL, col, NULL)
}
return(DT_subset)
}

str(delete(DT, keep.idxs))
str(DT)

@andrewrech I can't get your code to work. I'm on the dev version of data.table, and when I run your code, I end up with an empty data.table:

> dim(d1)
[1] 0 0

To complement @andrewrech's answer. Here is code as function and example of its usage.

delete <- function(DT, del.idxs) {           # pls note 'del.idxs' vs. 'keep.idxs'
  keep.idxs <- setdiff(DT[, .I], del.idxs);  # select row indexes to keep
  cols = names(DT);
  DT.subset <- data.table(DT[[1]][keep.idxs]); # this is the subsetted table
  setnames(DT.subset, cols[1]);
  for (col in cols[2:length(cols)]) {
    DT.subset[, (col) := DT[[col]][keep.idxs]];
    DT[, (col) := NULL];  # delete
  }
   return(DT.subset);
}

And example of its usage:

dat <- delete(dat, del.idxs)

Where "dat" is a data.table. Removing 14k rows from 1.4M rows takes 0.25 sec on my laptop.

> dim(dat)
[1] 1419393      25
> system.time(dat <- delete(dat,del.idxs))
   user  system elapsed 
   0.23    0.02    0.25 
> dim(dat)
[1] 1404715      25
> 

This is my very first GitHub post, btw.

Is it already implemented ?

is it implemented now. Its kinda necessary function.

What kind of work needs to be done in order to add this functionality to data.table ? Would be glad to help, but not totally sure where to start !

The delete function could be added using @Jarno-P awnser and later on modified to be more efficient and works with [] references, don't you think ?

I think the open question is the best API. data.table-like syntax would suggest the following should "work":

DT[rows_to_delete := NULL]

The functional approach of @Jarno-P would be a change from this, where row deletion would become functional & require DT <- f(DT) constructions. This may be best since := usages are _truly_ by reference, whereas row deletions as exemplified thus far are only _fast_ (compared to full copies), and not _truly_ by reference.

Although I am all but qualified to comment, should the syntax user perspective be more like:

DT[ i , .SR := NULL ]

Where the "i" is a DT-expression to select rows. .SR is similar to .SD, except it is always defined within DT and it includes references to all the rows selected by i. But such an approach may add overhead in expressions not intending to delete rows.

Alternative way is to change the behavior of .SD and have it defined also when by-expression is not used and when used without "by", .SD would refer to the whole rows instead (.SD excludes grouping columns).

An approach to bypass X <- f(X) might be to find out the name of X via deparse + substitute and than use the assign function. E.g. like this (adjusting the function of @Jarno-P):

del_rows <- function(X,delete) {

  keep <- -delete
  name_of_X <- deparse(substitute(X))
  X_names <- copy(names(X))
  X_new <- X[keep,X_names[1L],with=F]
  set(X,i=NULL,j=1L,value=NULL)

  for(j in seq_len(ncol(X))) {

    set(X_new,i=NULL,j=X_names[1L+j],value=X[[1L]][keep] )
    set(X,i=NULL,j=1L,value=NULL)

  }
  assign(name_of_X,value=X_new, envir = .GlobalEnv)
}

You would need to find out the environment of X for general cases.

Have already reacted. Bumping this if possible!

There is an interesting question on SO:

Subsetting a large vector uses unnecessarily large amounts of memory

Not directly related to data.table but a potential use case for deleting rows by reference.

Yes, please. This would help a lot a work I'm doing on leveraging data.table in individual-based models.

I provided one design idea to address this issue in https://github.com/Rdatatable/data.table/issues/4345#issuecomment-608348313
Are there any other ideas? If not it should be safe to start working on implementation of the idea.

Proof of concept based on https://github.com/Rdatatable/data.table/issues/4345#issuecomment-608348313

setsubset = function(x, i) {
  stopifnot(is.data.table(x), is.integer(i))
  if (!length(i)) return(x)
  if (anyNA(i) || anyDuplicated(i) || any(i < 1L || i > nrow(x) || is.unsorted(i))) stop("i must be non-NA, no dups, in range of 1:nrow(x) and sorted")
  drop = setdiff(1:nrow(x), i)
  last_ii = drop[1L]-1L
  do_i = i[i > last_ii]
  for (ii in do_i) {
    last_ii = last_ii+1L
    set(x, last_ii, names(x), as.list(x[ii]))
  }
  ## we need to set true length here but this needs C
  invisible(x)
}

x = data.table(a = 1:8, b = 8:1)
X = copy(x)
i = c(1:2, 6:7)
address(x)
sapply(x, address)
setsubset(x, i)
address(x)
sapply(x, address)
all.equal(x[seq_along(i)], X[i])

x = data.table(a = 1:8, b = 8:1)
X = copy(x)
i = c(3L, 5L, 7L)
address(x)
sapply(x, address)
setsubset(x, i)
address(x)
sapply(x, address)
all.equal(x[seq_along(i)], X[i])
Was this page helpful?
0 / 5 - 0 ratings

Related issues

arunsrinivasan picture arunsrinivasan  路  3Comments

tcederquist picture tcederquist  路  3Comments

mattdowle picture mattdowle  路  3Comments

mattdowle picture mattdowle  路  3Comments

franknarf1 picture franknarf1  路  3Comments