I wonder if it would be possible to make data.table compatible with the new sf library. The library sf is promising to be a game changer for spatial analysis in R so it sounds like good idea to bring together the power of both libraries.
Currently, the class of an sf object is "sf" "data.frame" and it brings a column named geometry of class "sfc_MULTIPOLYGON" "sfc".
Right now, I think the main incompatibility between dt and sf is that when we convert an sf data.frame into a data.table using setDT(), the geometry column is ruined and the object is not recognised anymore as an sf class.
I'm sure there are other points to take into account when making these two great libraries together so I just wanted to put the ball rolling if someone has not done this before.
My biggest problem in my R pipeline continues to be the trouble spatial libraries have communicating with data.table. I'm not sure how much can be done on the data.table side though...
Reproducible example please.
```
library(data.table)
library(sf)
nc <- st_read( system.file("shape/nc.shp", package="sf") )
class(nc)
[1] "sf" "data.frame"
head(nc)
Simple feature collection with 6 features and 14 fields
geometry type: MULTIPOLYGON
dimension: XY
bbox: xmin: -81.74107 ymin: 36.07282 xmax: -75.77316 ymax: 36.58965
epsg (SRID): 4267
proj4string: +proj=longlat +datum=NAD27 +no_defs
AREA PERIMETER CNTY_ CNTY_ID NAME FIPS FIPSNO CRESS_ID BIR74 SID74 NWBIR74 BIR79 SID79 NWBIR79 geometry
1 0.114 1.442 1825 1825 Ashe 37009 37009 5 1091 1 10 1364 0 19 1 MULTIPOLYGON(((-81.47275543...
2 0.061 1.231 1827 1827 Alleghany 37005 37005 3 487 0 10 542 3 12 2 MULTIPOLYGON(((-81.23989105...
3 0.143 1.630 1828 1828 Surry 37171 37171 86 3188 5 208 3616 6 260 3 MULTIPOLYGON(((-80.45634460...
4 0.070 2.968 1831 1831 Currituck 37053 37053 27 508 1 123 830 2 145 4 MULTIPOLYGON(((-76.00897216...
5 0.153 2.206 1832 1832 Northampton 37131 37131 66 1421 9 1066 1606 3 1197 5 MULTIPOLYGON(((-77.21766662...
6 0.097 1.670 1833 1833 Hertford 37091 37091 46 1452 7 954 1838 5 1237 6 MULTIPOLYGON(((-76.74506378...
````
Now when we try setting the sf object into data.table, note that the object looses both its "sf" class and the information in the geometry column
`````
setDT(nc)
class(nc)
[1] "data.table" "data.frame"
head(nc)
AREA PERIMETER CNTY_ CNTY_ID NAME FIPS FIPSNO CRESS_ID BIR74 SID74 NWBIR74 BIR79 SID79 NWBIR79 geometry1: 0.114 1.442 1825 1825 Ashe 37009 37009 5 1091 1 10 1364 0 19
2: 0.061 1.231 1827 1827 Alleghany 37005 37005 3 487 0 10 542 3 12
3: 0.143 1.630 1828 1828 Surry 37171 37171 86 3188 5 208 3616 6 260
4: 0.070 2.968 1831 1831 Currituck 37053 37053 27 508 1 123 830 2 145
5: 0.153 2.206 1832 1832 Northampton 37131 37131 66 1421 9 1066 1606 3 1197
6: 0.097 1.670 1833 1833 Hertford 37091 37091 46 1452 7 954 1838 5 1237
```
For the record, I've also created an issue in the sf github page #428 because the compatibility between the two libraries might request a bit of collaboration from both sides.
If you lose the sf class you can still rely on the sfc class, but you have to take care of more stuff. Also, returning a geometry in a data.table seems tricky, but my DT skills are rusty, so there might be a better way. There is no reason it shouldn't work in theory, but in practice some things might have to be ironed. Looking at the tidy verbs implementation should give a good idea of what needs to be reconciled.
library(sf)
#> Linking to GEOS 3.5.1, GDAL 2.1.3, proj.4 4.9.2, lwgeom 2.2.1 r14555
library(data.table)
nc <- st_read(system.file("shape/nc.shp", package = "sf"))
#> Reading layer `nc' from data source `/usr/local/lib/R/site-library/sf/shape/nc.shp' (...)
#> Simple feature collection with 100 features and 14 fields
#> geometry type: MULTIPOLYGON
#> dimension: XY
#> bbox: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
#> epsg (SRID): 4267
#> proj4string: +proj=longlat +datum=NAD27 +no_defs
nc <- setDT(nc)
nc[AREA > 0.1, st_area(geometry)]
#> Units: m^2
#> [1] 1137388604 1423489919 1520740530 1179347051 1232769242 1136287383
#> [...]
#> [61] 1264050755 2288682896 2181174167 2450241815 2165843695
nc[AREA > 0.1, sum(st_area(st_union(geometry))), by = SID74]
#> SID74 V1
#> 1: 1 3598847644 m^2
#> 2: 5 11299175600 m^2
#> [...]
#> 20: 15 3850824500 m^2
#> 21: 29 1978619669 m^2
#> 22: 31 2439553215 m^2
#> SID74 V1
But returning geometries is a problem
nc[, st_union(geometry), by = SID74]
#> SID74
#> 1: 1
#> [...]
#> 83: 31
#> SID74
#> V1
#> 1: <list>
#> 2: <list>
[...]
#> 83: -78.86451,-78.91947,-78.95074,-78.97536,-79.00224,-79.00642, 34.47720, 34.45364, 34.44938, 34.39917, 34.38804, 34.36627,
#> V1
Another path is to nc <- st_as_sf(setDT(nc)), but there are still problems.
A quick note that @SymbolixAU has been working on a project creating a spatial.data.table More info HERE. I'm not sure creating a new class is the way forward, but I'm sure there is a good overlap between all these projects.
FYI, the reason I'm extending the data.table class is because i want to make use of Google's encoded polylines to store shape information. But these polylines can be hundreds of characters, so I wanted a custom print method to handle them.
more info/background on what I'm doing
SO: https://stackoverflow.com/questions/44819369/extend-data-table-class-with-custom-print-method
blog: https://www.symbolix.com.au/blog-main/x7gdctwdhre678bsplakplkclg8hg4
The problem @etiennebr is describing is because aggregation of geometry can lead to two different kinds of geometries (POLYGON and MULTIPOLYGON in this case). Try the following:
nc <- st_read(system.file("shape/nc.shp",package="sf"))
nc_DT <- as.data.table(nc)
nc %>% group_by(SID74) %>% summarise(geom = st_union(geometry))
nc_DT[,st_union(geometry),by=SID74][,table(SID74)] # indices where frequency > 1 are multipolygon geometries!
I still don't understand why data.table shows 83 rows for nc_DT[, st_union(geometry),by=SID74]. Regardless of the geometry aggregation issue shouldn't data.table still return only 23 rows here?
I have realized my error and I know that I have to do nc_DT[, .(st_union(geometry)), by=.(SID74)] but that still doesn't give me a valid object that I can plot!
nc %>% group_by(SID74) %>% summarise(geom=st_union(geometry)) %>% plot # works!
plot(nc_DT[, .(st_union(geometry)), by=.(SID74)]) # does not work...some weird error!
@vlulla try
setDT(nc)[, .(st_union(geometry)), by=.(SID74)] %>% sf::st_as_sf() %>% plot
Thank you @SymbolixAU !! I don't know why I didn't think of this... :-(
Am I right in thinking it would still be nice if setDT() retained the sf in the class?
Or can this issue be closed now on the data.table side. The solutions above are workarounds?
I reran the reproducible example above with 1.12.0 but it's still the same. Note that geometry column is not really lost though. That's just a printing/display difference. It's still all there :
> nc$geometry
Geometry set for 100 features
geometry type: MULTIPOLYGON
dimension: XY
bbox: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
epsg (SRID): 4267
proj4string: +proj=longlat +datum=NAD27 +no_defs
First 5 geometries:
MULTIPOLYGON (((-81.47276 36.23436, -81.54084 3...
MULTIPOLYGON (((-81.23989 36.36536, -81.24069 3...
MULTIPOLYGON (((-80.45634 36.24256, -80.47639 3...
MULTIPOLYGON (((-76.00897 36.3196, -76.01735 36...
MULTIPOLYGON (((-77.21767 36.24098, -77.23461 3...
>
I just tried
setDT(nc)
attr(nc, "class") <- c(attr(nc, "class"), "sf")
and it seems to cause more issues, so not sure if there's any immediate benefit in retaining the sf class.
At the moment _I'm_ happy with workarounds. There's lots which can be done inside the j argument using anonymous functions and the like. And as has been mentioned, you don't lose the sfc class on the geometry column, which is the main component.
Instead of putting sf last in the class I tried putting it first :
setattr(nc, "class", c("sf", "data.table", "data.frame"))
but this still resulted in problems. Perhaps you could try that and see if those problems can be overcome.
The root issue is perhaps that sf has its own [.sf method which either masks or is masked by [.data.table depending on whether sf is before or after data.table in class().
The class attribute being c("sf", "data.table", "data.frame") makes sense to me on first glance. data.table could change setDT() to retain sf like that. But then sf would likely need to become data.table-aware (import data.table) and call NextMethod() inside its [.sf if inherts(x,"data.table"). Something like that.
Where in https://github.com/r-spatial/sf/blob/master/R/sf.R#L299-L335 should this NextMethod() be called? In the else around Line 318?
At one time (when print.sf was not tbl_df aware) I had the following in my .Rprofile but it caused various problems that I did not know how to resolve so it's commented now:
#### print.sf <- function(x, ..., n = ifelse(options("max.print")[[1]] == 99999, 20, options("max.print")[[1]])) {
#### geoms = which(vapply(x,function(col) inherits(col,"sfc"), TRUE))
#### nf = length(x) - length(geoms)
#### app = paste("and", nf, ifelse(nf == 1, "field", "fields"))
#### if (any(!is.na(st_agr(x))))
#### app = paste0(app,"\n","Attribute-geometry relationship: ", sf:::summarize_agr(x))
#### if (length(geoms) > 1)
#### app = paste0(app,"\n","Active geometry column: ", attr(x, "sf_column"))
#### print(st_geometry(x),n=0,what="Simple feature collection with",append=app)
#### if(is.data.table(x)) {
#### ## data.table:::print.data.table(x, ...)
#### NextMethod()
#### } else {
#### print.data.frame(x, ...)
#### }
#### ## print.data.frame(x, ...)
#### invisible(x)
#### }
The lines around 318 in [.sf are like this :
class(x) = setdiff(class(x), "sf") # one step down
x = if (missing(j)) {
if (nargs == 2) # `[`(x,i)
x[i] # do sth else for tbl?
else
x[i, , drop = drop]
} else
x[i, j, drop = drop]
You don't need to remove sf from class(x) like that. That's what NextMethod() does for you.
I'm not sure exactly what you'd like to achieve in sf in terms of print or [ methods so I'm not sure what exactly to be done. If you can provide some inputs and expected outputs then I could maybe help out.
Since last Matt's reply almost 2 years ago there was no feedback. Also previously mentioned issues were addressed in comments by workarounds or suggestions on how to improve integration on sf side. This is quite popular issue so I am not closing it yet but encourage anyone interested into it, to read it, and provide comment what compatibility improvement could be added on data.table side, providing an example of course.
I personally am a massive user of data.table and sf and the best way to integrate them together is to use sfc column inside a data.table instead of using an sf object. Everything integrates very well, the only known issue for me is #4217 and I suggested a temporary workaround in the issue.
Maybe this thread helps https://github.com/paleolimbot/wk/issues/12. It seems that (1) data.table fully supports vctrs, or that (2) data.table supports geovctrs (wk) that are the bridge to sf can be potential solutions.
Most helpful comment
I personally am a massive user of
data.tableandsfand the best way to integrate them together is to usesfccolumn inside adata.tableinstead of using ansfobject. Everything integrates very well, the only known issue for me is #4217 and I suggested a temporary workaround in the issue.