I think the dplyr-friendly sf object is really nice, and it would be great to see it one day become a base class imported by other spatial packages. However, sf package introduces system dependencies (e.g. geos/gdal in particular), which we've seen be a common barrier / source of frustration for users. While it makes sense for sf itself to have these, I think it would be really nice for other developers to be able to depend on the sf data structure without having to encumber their package with heavy system dependencies inherited from sf, if that makes sense.
Would it be possible to split out the base data structure and core methods from the gdal-based read/write infrastructure (and perhaps from geos deps as well?) or are the external dependencies pretty much inseparable from any of the base class definition & methods?
Carl, really interesting that you bring this up. When sp started in 2003-2004, that was what we did, letting rgdal develop by itself, and adding rgeos (GSoC 2010, @rundel) as GEOS became feasible. So the original "design" was rgdal and rgeos providing services to sp classes (S4 vintage).
Over time, we've found that to read and write data, rgdal is loaded almost always in use cases (see the code for our book, except for Edzer's chapters where built-in data sets were used). No rgeos in the 1st edition - didn't exist - but rgeos is used a lot in data preparation, all the aggregate, group_by, etc. things depend on it.
So when we considered re-booting and adopting SF standards, it was logical to build on what we'd learned since 2003, that the classes should be together with read/write and manipulation functions. Yes, external dependencies are clunky, but they are needed anyway in the vast majority of use cases with real data (except for points, which could be coordinate columns).
Without the external dependencies, you also lose the ability to transform data sets to the same coordinate reference system. mapproj has built-in code, but it is very limited, and now very out of date (I inserted Plan 9 licensed code in 2009, the code really dates back to Doug McIlroy flying his plane - certainly over the Atlantic - using it as written for S about 30 years ago). The proj4 package has the same external dependency.
Yes, it is possible to separate, but we've been there and done that with sp.
I agree with what Roger said. Users will practically always need the whole thing, where developers prefer something slim, light and modular.
Developer objections I've heard so far were CI build times (resolved), difficulties to install on Mac OSX from source (resolved in README), needing administrator rights to install gdal/geos/proj.4 (exceptional, but resolved). There's an earlier discussion where ropensci developers brought up the latter aspect here.
Thanks for the reply, it's certainly invaluable to have the perspective of your experience with sp.
Just to be clear though, I was not suggesting that sf repeat the sp model itself and exist as a user-facing package that can be installed without all of these pieces included. I completely agree with your perspective that it makes sense that sf include the read/write, precision/transform, and and projection abilities provided by these standards, and I see how it streamlines the introduction and presentation of the package to end users when you don't have to caveat everything with "oh, you need the optional rgdal dependency for that". So I'm completely on board with sf not repeating the opt-in dependency structure of sp.
However, I don't think the observation that most users will load all these dependencies need make a modular approach impossible. For instance, no user uses dplyr without tibble, and yet it makes sense to provide tibble as an independent package. As you surely already know, this simplifies things both for other developers and maintainers, independent of the external libraries thing (which obviously doesn't effect tibble/dplyr. Once something is modular it's easy to aggregate, but something non-aggregated can't be easily reversed. As maintainers, surely you would find it easier deal with reverse-dependency check issues of CRAN if developers could depend only on the parts they need?
So you think a similar approach could work here, in which the dependencies of sf would remain unchanged and thus have no user-facing impact, but developers would be able to use the base class alone without introducing the dependencies? Or is there some other compelling reason I'm overlooking about how a modular approach could create a problem?
Certainly the combination of complete integration of gdal, geos, and proj are a big part of what makes sf so compelling, but if there truly were no use cases separating them then they probably wouldn't be three separate libraries to begin with. (Though of course gdal itself has adopted a very kitchen-sink approach which precludes building a toolchain focused on a single serialization format, so oh well ;-)).
You have indeed done some excellent work at addressing the developer issues regarding peculiarities of Mac OSX, travis caching, and linux environments, though at the same time the existence of these issues in the first place underscores the point that portability & modularity are not (at least not yet) completely passé.
Rightly or wrongly, many software developers instincts are not to introduce unnecessary dependencies, and I would far rather see developers depend on the careful, knowledgeable and robust implementation of sf (and where relevant it's underlying component libraries). sf defines a very clever, simple data structure (the data.frame with geometry column) that I think is a powerful idea that could be exploited to great advantage in spatial and spatial-adjacent packages. I think that practically speaking, that adoption would be faster if that structure was available to developers in a modular way.
I completely follow you, and the dplyr/tibble model looks attractive.
Creating an sf object using st_sf, or a geometry column using st_sfc allows/requires setting a coordinate reference system. This is either a number or a proj4 string; in the former case the number is resolved to its proj4 string, in the second case the string is validated, in both cases Proj.4 is used for this. This means we'd have to exclude the base constructor functions? Or allow construction without validation? What happens then, when they're invalid?
If you, or anyone else, can give me a list of sf functions and methods you'd like to have factored out, and these don't involve gdal, geos or proj.4, I will give it a serious thought. @mdsumner
Very cool.
Yeah, I suspected proj might be the most difficult to isolate, the projection information is after all fundamental to the data-structure. My intuition would be that it probably does make sense for the sf object module to maintain a dependency on Proj4 (I'm not opposed to system dependencies, just shooting for the appropriate level of modularity here). In a ideal world, having Proj4 baked into a widely used base class may encourage other packages to also bother using and validating Proj4 strings instead of the heterogeneous practices we currently see ;-).
@cboettig the reasons that sf interfaces separately to geos, proj and gdal, and not to all three through gdal (which gdal offers) are that
I do not share your enthusiasm for an sf_base package depending on proj.4. Proj.4 is not a substantially simpler dependency than gdal or geos. Modular yes, but convoluted too.
I don't have any objections to sf linking all three, and I see your point that any package using gdal may be better off if it uses proj4 projections rather than gdal's own. I also realize that CRAN's approach to compilation of platform-specific binaries makes this all much harder -- like you say essentially gdal is modular at the source level but not really at the binary level without static linking. But all these are more technical nuciances then fundamental barriers.
We have a data type given by a well-defined standard. It seems reasonable to be able to work with that data type without any dependency on a particular serialization or transformation of it, and an inability to separate these things usually means we still lack the proper abstractions, which I don't think is the case here. It may mean convoluted makevars, but separating concept from serialization isn't convoluted, right? (Eg no one would advocate that dplyr should have been built inseperably from MySQL and postgres).
Perhaps a base module that leaves projections as invalidated strings by default to gain the benefits​ of more portability would be cleaner after all.
You're the experts on this stuff of course, and I do appreciate the expediency of kitchen sink approaches. But long term I think everyone wins when we get the abstractions right.
We are very supportive of efforts at modularization.
@cboettig just for the record:
In case you want to make comparisons with the tidyverse, I'd like to include
I think it's fair to summarize that @edzer supports modularization but not its priority, i.e. no interest in doing it but also is not opposed to it.
I really want a raw GDAL package in R from which we could build sf-alike packages, with different purposes. I also want access to the projection library in GDAL without having to construct sf-types. A relatively straightforward first step for read-access would be a tibble from a vector data source with or without the geometry column, and with the option of the geometry column, if present, being unpacked nested lists of coordinates (an unstructured list col) or the packed raw binary in a blob-list-column. Other options would be WKT geometry or GeoJSON, and the other varieties of binary that are needed by different database engines.
R will be a central general engine for many of the standard database engines, a way better ETL than FME ultimately, much like Radian is going for and it's absolutely going to need these standard core tools.
I believe many existing projects could make good use of this capacity, and with a little practice and testing we could design a way to have sf use this as well, though this is just a side-benefit, not a primary need here.
I believe the success of sf leads many to wonder why we need anything more, but that there are many projects that would indeed use it and until we have it or clearly identify the need for it we're stuck as isolated islands of need with a hard programming job ahead. Very keen to hear from anyone interested to pursue this. We (me and colleagues) will get to it eventually, but I know there's quite a bit of interest and potential to do this as a community and leverage different levels of programming and domain-specific expertise. sf is a good model of how to go about it, and in the first instance will simply involve a lot of code removal to get at the core Rcpp hooks into GDAL. rgdal2 is a more general implementation also useful for seeing how to do this.
As has been commented elsewhere, ideas may be great, but wanting is really not part of the real world. rgdal maintenance was taken over by me (and Edzer and others, but mostly me) because the author was busy with his own stuff and non-responsive. The problem is not, as Mike and Carl seem to think, being cool or aspiring to fashionable ideas, it is effectively only about ensuring performance stability so that the users can get their work done.
Robert Hijmans does use some lower-level access to GDAL and PROJ4 in rgdal from raster, but it is very hard to update these things, and IIRC raster still does not support obtran at all, while rgdal does (logically, not like PROJ.4).
Things should only (ever) be as tidy as possible, never more so. If making them tidier than desirable also uses up time and commitment to long-term maintenance resources that are the major constraints here, really its priority is never going to trump real modal user needs. By the way, rgdal2 was started three years ago, last commit 11 months ago PR from Edzer, so if Mike wants it, you're welcome. Remember to stop by and use what we've learned over 15 years about maintaining this stuff - Mike knows about Windows GDAL builds, and the current system works for effectively everybody.
Further, remember that GDAL is a moving target, so actually getting at the Rcpp hooks only works if your assumptions about GDAL continue to hold (they typically don't), otherwise each release involves a lot of back-tracking. The reasons for their upstream changes are almost always good (both vector and raster), so we have to adapt (and rgdal and sf lean on each other in this).
I'll stay away. I'm mystified as to why offence is caused here, and insults are unacceptable. Wanting exactly is a very real part of everything, and good behaviour and friendly conduct help everywhere and at every level. Wanting orthogonal modular components that can be put together flexibly is a core development strategy with a very good pedigree, it's not just fashionable and cool but also now is a very productive part of modern R package development.
Let's not get emotional here.
With stars coming up, sf still a _very_ young and in my opinion a rather small package, breaking it up has absolutely no priority for me right now. Since nobody else seems motivated to do actual work on this or to come with concrete proposals on how to do it, I'm closing this issue.
Most helpful comment
I completely follow you, and the dplyr/tibble model looks attractive.
Creating an
sfobject usingst_sf, or a geometry column usingst_sfcallows/requires setting a coordinate reference system. This is either a number or a proj4 string; in the former case the number is resolved to its proj4 string, in the second case the string is validated, in both cases Proj.4 is used for this. This means we'd have to exclude the base constructor functions? Or allow construction without validation? What happens then, when they're invalid?If you, or anyone else, can give me a list of sf functions and methods you'd like to have factored out, and these don't involve gdal, geos or proj.4, I will give it a serious thought. @mdsumner