I've had a very quick go at the new feather format by @wesm & @hadley to (de-)serialise DataFrame objects, here's a working example. Given the substantial gains in read/write times, would it make sense to include an experimental driver for feather in geopandas? It would go something like:
db = gpd.read_file('mygeo.feather', driver='feather')
db.to_file(mynewgeo.feather', driver='feather')
Under the hood it'd serialise the geometry column into wkb_hex (or any other format if faster/more efficient, this was my first go at it) and back into shapely geoms when reading.
Tagging @ljwolf as this is the fruit of discussing with him too.
I think we could also consider parquet, but the idea would be very similar, and certainly find it a good idea!
And with the cython branch the serializing to and deserializing from WKB will even be faster, as this part is still the biggest overhead currently compared to the actual reading/writing to feather or parquet.
I think it might be better the write the actual bytes (so wkb instead of wkb_hex, but not sure)
Is this still being considered for inclusion? My colleagues and I regularly use parquet for storing geodataframes so I'd be happy to submit a PR if it's of interest
@knaaptime how do you store geodataframes as parquet and then load a geodataframe from parquet? i tried it a few times but it did not work for me. (with normal pandas dataframes i had not problem to store and load parquet files)
Normally, you'd:
geometry column from wkb to shapely objects. I could still do a PR to encapsulate the logic @ljwolf just described but was holding off because I thought I remember a conversation somewhere that we get this for free if the geometry accessor in the ExtensionArray is implemented?
Found this the other day: https://github.com/brendan-ward/geofeather
ping @brendan-ward
Is there a way to store the CRS as metadata in the feather or parquet file so you don't have to keep track of more than one file?
@snowman2 I didn't do an extensive investigation into what other information could be stored in feather beyond that in the data frame itself, but a preliminary search did not reveal anything obvious - hence the need to dump CRS info into a separate file. It would be great if we could encapsulate everything in the same file.
I'd be open to turning the idea in geofeather into a PR here, since it would be much nicer as a direct function in Geopandas instead of stand-alone. Apologies for not checking back in here first, my first task was to generalize it outside of some of my other projects since I was starting to use it more widely.
@snowman2 i decided to just use pickle. it is the easiest solution for storing python objects and also very fast (at least fast enough for me). when loading a pickle file you get the exact same object back.
you could create a dict with the geodataframe and even more additional information and store this object in one file.
dataframe file format comparison: https://towardsdatascience.com/the-best-format-to-save-pandas-data-414dca023e0d
pickle tutorial: https://youtu.be/2Tw39kZIbhs
I'd be open to turning the idea in geofeather into a PR here,
@brendan-ward I was actually writing an issue in your repo yesterday to ask about that, but didn't finish it yet. But so, I think this would be very welcome!
And thanks for making that package and moving it forward!
There is actually also already a "geoparquet" (https://github.com/darcy-r/geoparquet-python) that would be nice to include as well (cc @darcy-r)
About the actual implementation and the use of the separate CRS file: I think @brendan-ward is correct that there is currently no way to include metadata in a feather file using the python API (although there is nothing inherently preventing that, I think, I will open an issue about that on that Arrow project).
But, for writing parquet, this is actually already possible (pyarrow already stores pandas-specific metadata as well for a faithful roundtrip), and this is also what @darcy-r did.
I'd be open to turning the idea in geofeather into a PR here,
@brendan-ward That would be great, thanks!
About the actual implementation and the use of the separate CRS file ... I will open an issue about that on that Arrow project
@jorisvandenbossche I think that would be very beneficial to be able to keep all of the information in a single file in the future. It makes using/sharing the file so much easier. Mind linking the issue you create here?
I think geofeather and geoparquet could eventually transform into file specifications for how to properly store the data in the feather & parquet formats similar to GeoJSON.
I think geofeather and geoparquet could eventually transform into file specifications for how to properly store the data in the feather & parquet formats similar to GeoJSON.
@snowman2 I agree and think that any feather and parquet extensions to geopandas should include sufficient metadata to accommodate a potential future feather and parquet driver for GDAL.
@brendan-ward and @knaaptime I am a coding newbie so let me know if I can help out at all :)
fwiw, Wes recommends parquet
Now that there is a well-supported Parquet implementation available for both Python and R, we recommend it as a “gold standard” columnar storage format.
@darcy-r's implementation is more elegant and general purpose than mine so how would @jorisvandenbossche and @darcy-r feel about moving geoparquet into geopandas? (and maybe incorporating multiprocessing like we do in geosnap?)
I'd like to see write benchmarks - esp. for geo data - before dropping feather. In my case, writes consume more time than reads and in my own tiny benchmarks in geofeather have seem more variation in write times than read times (compared to shapefile).
For those engaged here, I'd recommend a similar approach to supporting these as optional dependencies, like parquet and pyarrow as is used in pandas. Or - if the implementations move directly into geopandas to keep the dependencies on pyarrow, etc as optional.
@darcy-r not sure that a GDAL driver for feather makes sense, esp. given the above post from Wes, but a parquet driver may make sense since it is a storage format.
@snowman2 I think there are really only two parts of a spec we are talking about here, since the substantive specs are at the feather and parquet levels:
wkbIt seems like parquet can give us all of the above, and perhaps by disabling compression, keep I/O competitive or better than feather. We just need to build out support first in a consistent way, then benchmark.
@darcy-r I'd be happy to run with trying to integrate your work with parquet with mine with feather and pull together a PR to introduce both, if that seems reasonable to you?
I am happy to work on code that includes support for all formats, regardless of their merits and demerits. @brendan-ward I will be happy to work with you on a PR including feather.
@brendan-ward I would be keen to err on the side of caution with a more expansive specification, at least for Parquet files, with my primary motivation here being future GDAL integration.
@knaaptime I wasn't aware of the multiprocessing module; I'm keen to integrate it now I've seen it in action.
Chipping here just to say a) this is _super_ exciting :) and b) I've not checked in a while and maybe parquet not does a similar thing, but one of the advantages for feather was interoperability between Python and R. It'd be cool to have files that both geopandas and sf could read/write without loss of detail :-)
@darribas Python/R interoperability is a big reason Wes endorses parquet in the post I linked above :P
@darribas with the latest release of the R arrow package, that should now also fully support reading/writing parquet files in R (with a functionality on par with the pandas interface).
Feather has still a few advantages over parquet: it can be faster (single threaded) if you don't care about file size, it can be memory-mapped. On the other hand, parquet is much more an industry standard so better interoperability with other ecosystems, and has much better compression support for small file sizes.
Very draft skeleton of doing this for geofeather here
The basic ideas are:
1) inherit read_feather and to_feather from pandas, which gives us optional dependency handling and other pandas level checks for free.
2) override the to_feather() function on the GeoDataFrame so that the interface is the same, but wraps geometry in WKB before export (and adds a CRS file...)
3) provide a read_feather() function in geopandas that shadows the pandas read_feather() but returns a GeoDataFrame instead.
I think a similar approach for parquet could work here too, but I need to do more research before I start stubbing that out based on @darcy-r 's work.
I did not find any way to avoid writing a separate CRS file; there is no python API into the feather writer that would let us add arbitrary metadata. For now, I suggest we don't stress over that detail until parquet support is in place, and then we can evaluate if feather provides enough speed to justify the tradeoff of having an additional file...
Likewise, there doesn't appear to be any way to serialize the name of the geometry column, but it seems reasonable that read_feather always return it as geometry in a new GeoDataFrame, and we can name the WKB field whatever we want for I/O. (I opted for _wkb)
@brendan-ward that approach sounds good. I think you can open the PR, that will make it easier to comment on the actual code.
I did not find any way to avoid writing a separate CRS file; there is no python API into the feather writer that would let us add arbitrary metadata.
Yes, you don't have to look further: there is currently no way. Hopefully there will be in the near future though (I opened https://issues.apache.org/jira/browse/ARROW-6823 for this, there are plans to improve the feather format, but will probably take a few more months)
Given those current limitations (also regarding the column name as you mention), would it make sense to start with parquet in geopandas, and wait for feather until it supports metadata?
I haven't read through the associated PR yet, so take these comments with a grain of salt, but this is really exciting. I think there's a lot of potential here, not just within Python but especially cross languages since Parquet is designed to be language-independent.
My main comment is that it might be helpful to try to engage non-Python geospatial communities into this discussion, and make sure that instead of being a one-off Python or Python+R format, there's no part of this "GeoParquet" driver that's incompatible with other languages. I assume the files would conform to the Parquet spec, but a specification for how the metadata and geometries are encoded would be a good thing I think.
In particular, I'm interested in the potential for a "GeoArrow" as a cross-language in-memory specification for geospatial data. I'm not sure how there could be benefits to Pyarrow + Geopandas + Pygeos, or even if such an integration is possible, but if there's interest I'd be happy to engage in that discussion. (Probably best as a separate issue)
Yes, we really need to get back to that PR (as I am also excited about it)
The upcoming pyarrow 0.17 will include a "Feather 2.0" (which makes it basically exact the Arrow IPC format on disk), which will also allow to include metadata, and thus allow to use the same metadata spec for the feather format as well than what was now being discussed for parquet (the lack thereof was a reason to focus on parquet up to now).
You're fully right on trying to engage other geospatial communities (I was planning to mail to GDAL dev mailing to ask for feedback. Do you have ideas of other channels or projects to ask explicitly?)
With the improvements of arrow in R (the latest version of the R arrow package can now also read feather and parquet), it should be possible that the sf and other packages in that community could also use those formats.
Will look tomorrow at your feedback on the other PR. And thanks for jumping in on this!
The upcoming pyarrow 0.17 will include a "Feather 2.0" (which makes it basically exact the Arrow IPC format on disk), which will also allow to include metadata, and thus allow to use the same metadata spec for the feather format as well than what was now being discussed for parquet (the lack thereof was a reason to focus on parquet up to now).
That's really cool. I personally am more excited about Parquet as a serialization format because of its great compression savings, but I'm not opposed to support for both, and memory mapping Feather files might have some potential.
You're fully right on trying to engage other geospatial communities (I was planning to mail to GDAL dev mailing to ask for feedback. Do you have ideas of other channels or projects to ask explicitly?)
Well my suggestion is only so helpful when I don't know where to look for all the communities! I've been working with browser-based geospatial visualization, using libraries such as Deck.gl. We've been discussing improved support for Arrow, and I opened an issue about Parquet/Arrow here on a sister project that provides centralized loaders for geospatial data formats for the browser.
I also think emailing GDAL wouldn't be a bad idea. @bjornharrtell might have interesting thoughts from developing a similar cross-language high performance format in https://github.com/bjornharrtell/flatgeobuf.
With the improvements of arrow in R (the latest version of the R arrow package can now also read feather and parquet), it should be possible that the sf and other packages in that community could also use those formats.
Also very cool. I don't know anything about sf, but it might be worth opening an issue there.
@kylebarron for playing with pygeos & feather, I have this minimally stubbed out in geofeather here.
There are some hackish things done re: CRS and DataFrame handling that will be handled much better in #1180, and it will be obviated once pygeos support lands in geopandas. However, I've been using it for all my recent projects that are able to use pygeos directly; once you get the higher I/O it's hard to go back. 😄
That's really cool. I personally am more excited about Parquet as a serialization format because of its great compression savings, but I'm not opposed to support for both, and memory mapping Feather files might have some potential.
@kylebarron the upcoming pyarrow 0.17 release with Feater v2 will actuall also include basic compression for feather ;) (not as advanced / fine grained as parquet, only basic per-column compression with lz4 or zstd, but it should already give a lot of value if compression is desired)
May very well be that FlatGeobuf can be of use here, as the goal of it is to be a high performance serialization of simple features. But I'm a complete noob when it comes to Python and R, I don't even understand what the relationship between Python and R is. The experience of @tim-salabim and @jeroen might be of interest, they did some work to integrate FlatGeobuf with sf and R-related visualisation.
Given that FlatGeobuf has a GDAL driver I don't see any issues with python/R interoperability. I haven't tried the GDAL driver from R yet, but I don't expect any issues really. If it's available, it should work. Writing from R via V8 was slow...
Most helpful comment
Is this still being considered for inclusion? My colleagues and I regularly use parquet for storing geodataframes so I'd be happy to submit a PR if it's of interest