Sf: NA_integer_ handling in shapefiles

Created on 19 Jan 2017  路  7Comments  路  Source: r-spatial/sf

df <- data.frame(
    a = c(0, 1, NA, -Inf, Inf),
    b = c("a", "b", NA, "c", ""),
    x = 1:5,
    y = 1:5)

st_point2 <- function(x, y) st_sfc(purrr::map2(x, y, ~c(.x, .y) %>% st_point))

sf <- df %>% 
    mutate(geom = st_point2(x, y),
           `90%` = x) %>% 
    st_as_sf()

st_write(sf, "tests/test.shp")

z <- st_read("tests/test.shp")
# Reading layer `test' from data source `C:\Users\etienne.racine\Documents\workspace\sfr\tests\test.shp' using driver `ESRI Shapefile'
# Simple feature collection with 5 features and 5 fields
# geometry type:  POINT
# dimension:      XY
# bbox:           xmin: 1 ymin: 1 xmax: 5 ymax: 5
# epsg (SRID):    NA
# proj4string:    NA
# Warning message:
#   In CPL_read_ogr(dsn, layer, as.character(options), quiet, iGeomField -  :
#    GDAL Message 1: Value '1.#SNAN0000000000' of field test.a parsed incompletely to real 1.
z
# Simple feature collection with 5 features and 5 fields
# geometry type:  POINT
# dimension:      XY
# bbox:           xmin: 1 ymin: 1 xmax: 5 ymax: 5
# epsg (SRID):    NA
# proj4string:    NA
# a    b x y X90.   geometry
# 1    0    a 1 1    1 POINT(1 1)
# 2    1    b 2 2    2 POINT(2 2)
# 3    1 <NA> 3 3    3 POINT(3 3)
# 4 -Inf    c 4 4    4 POINT(4 4)
# 5  Inf <NA> 5 5    5 POINT(5 5)

Also, the warning message GDAL Message 1: Value '1.#SNAN0000000000' of field test.a parsed incompletely to real 1. can fill a screen because it's thrown for every NA value, probably because of the evil Rcout. ogrinfo -al raises a warning as well (for each record).

What should NA's be changed to? I haven't looked at rgdal's solution.

Most helpful comment

Maybe, it is possible to add some kind of message for users who try to write shapefile:
"Using ESRI Shapefile seriously harms you and others around you" (or something else) ;)

All 7 comments

I just looked at rgdal and seems fine:

rgdal::writeOGR(as(sf, "Spatial"), "tests", "test_rgdal", "ESRI Shapefile")
y <- st_read("tests/test_rgdal.shp")
# Reading layer `test_rgdal' from data source `tests/test_rgdal.shp' using driver `ESRI Shapefile'
# Simple feature collection with 5 features and 5 fields
# geometry type:  POINT
# dimension:      XY
# bbox:           xmin: 1 ymin: 1 xmax: 5 ymax: 5
# epsg (SRID):    NA
# proj4string:    NA
y
# Simple feature collection with 5 features and 5 fields
# geometry type:  POINT
# dimension:      XY
# bbox:           xmin: 1 ymin: 1 xmax: 5 ymax: 5
# epsg (SRID):    NA
# proj4string:    NA
#      a    b x y X90_   geometry
# 1    0    a 1 1    1 POINT(1 1)
# 2    1    b 2 2    2 POINT(2 2)
# 3   NA <NA> 3 3    3 POINT(3 3)
# 4 -Inf    c 4 4    4 POINT(4 4)
# 5  Inf <NA> 5 5    5 POINT(5 5)

This is what I get with 0e37e39 (I added dates, because I faced a problem today as well):

library(tidyverse)
#> Loading tidyverse: ggplot2
#> Loading tidyverse: tibble
#> Loading tidyverse: tidyr
#> Loading tidyverse: readr
#> Loading tidyverse: purrr
#> Loading tidyverse: dplyr
#> Conflicts with tidy packages ----------------------------------------------
#> filter(): dplyr, stats
#> lag():    dplyr, stats
library(sf)
#> Linking to GEOS 3.5.0, GDAL 2.1.1, proj.4 4.9.3

df <- data.frame(
    a = c(0, 1, NA, -Inf, Inf),
    b = c("a", "b", NA, "c", ""),
    c = c(Sys.Date(), NA, -99, 0, 1),
    d = c(Sys.time(), NA, -99, 0, 1),
    x = 1:5,
    y = 1:5)

st_point2 <- function(x, y) st_sfc(purrr::map2(x, y, ~st_point(c(.x, .y))))

sf <- df %>% 
    mutate(geom = st_point2(x, y),
           `90%` = x) %>% 
    st_as_sf()

tf <- replicate(2, paste0(tempfile(), ".shp"))
st_write(sf, tf[1])
#> Writing layer `file14d077ae1843.shp' to data source `C:\Users\ETIENN~1.RAC\AppData\Local\Temp\RtmpQfYR0m\file14d077ae1843.shp' using driver `ESRI Shapefile'
#> Warning in CPL_write_ogr(obj, dsn, layer, driver,
#> as.character(dataset_options), : GDAL Message 6: Field d create as date
#> field, though DateTime requested.
#> features:       5
#> fields:         7
#> geometry type:  Point
#> Warning in CPL_write_ogr(obj, dsn, layer, driver,
#> as.character(dataset_options), : GDAL Error 6: Years < -32768 or > 32767
#> are not supported
#> Warning in CPL_write_ogr(obj, dsn, layer, driver,
#> as.character(dataset_options), : GDAL Error 6: Years < -32768 or > 32767
#> are not supported

z <- st_read(tf[1])
#> Reading layer `file14d077ae1843' from data source `C:\Users\etienne.racine\AppData\Local\Temp\RtmpQfYR0m\file14d077ae1843.shp' using driver `ESRI Shapefile'
#> Warning in CPL_read_ogr(dsn, layer, as.character(options), quiet,
#> iGeomField - : GDAL Message 1: Value '1.#SNAN0000000000' of field
#> file14d077ae1843.a parsed incompletely to real 1.
#> Simple feature collection with 5 features and 7 fields
#> geometry type:  POINT
#> dimension:      XY
#> bbox:           xmin: 1 ymin: 1 xmax: 5 ymax: 5
#> epsg (SRID):    NA
#> proj4string:    NA
z
#> Simple feature collection with 5 features and 7 fields
#> geometry type:  POINT
#> dimension:      XY
#> bbox:           xmin: 1 ymin: 1 xmax: 5 ymax: 5
#> epsg (SRID):    NA
#> proj4string:    NA
#>      a    b          c          d x y X90.   geometry
#> 1    0    a 2017-01-20 2017-01-20 1 1    1 POINT(1 1)
#> 2    1    b       <NA>       <NA> 2 2    2 POINT(2 2)
#> 3    1 <NA> 1969-09-24 1969-12-31 3 3    3 POINT(3 3)
#> 4 -Inf    c 1970-01-01 1969-12-31 4 4    4 POINT(4 4)
#> 5  Inf <NA> 1970-01-02 1969-12-31 5 5    5 POINT(5 5)

rgdal::writeOGR(as(sf, "Spatial"), dirname(tf[2]), gsub("\\.shp$", "", basename(tf[2])), "ESRI Shapefile", overwrite = TRUE)
y <- st_read(tf[2])
#> Reading layer `file14d04222d13' from data source `C:\Users\etienne.racine\AppData\Local\Temp\RtmpQfYR0m\file14d04222d13.shp' using driver `ESRI Shapefile'
#> Simple feature collection with 5 features and 7 fields
#> geometry type:  POINT
#> dimension:      XY
#> bbox:           xmin: 1 ymin: 1 xmax: 5 ymax: 5
#> epsg (SRID):    NA
#> proj4string:    NA

sf
#> Simple feature collection with 5 features and 7 fields
#> geometry type:  POINT
#> dimension:      XY
#> bbox:           xmin: 1 ymin: 1 xmax: 5 ymax: 5
#> epsg (SRID):    NA
#> proj4string:    NA
#>      a    b          c                   d x y 90%       geom
#> 1    0    a 2017-01-20 2017-01-20 10:27:33 1 1   1 POINT(1 1)
#> 2    1    b       <NA>                <NA> 2 2   2 POINT(2 2)
#> 3   NA <NA> 1969-09-24 1969-12-31 18:58:21 3 3   3 POINT(3 3)
#> 4 -Inf    c 1970-01-01 1969-12-31 19:00:00 4 4   4 POINT(4 4)
#> 5  Inf      1970-01-02 1969-12-31 19:00:01 5 5   5 POINT(5 5)
z
#> Simple feature collection with 5 features and 7 fields
#> geometry type:  POINT
#> dimension:      XY
#> bbox:           xmin: 1 ymin: 1 xmax: 5 ymax: 5
#> epsg (SRID):    NA
#> proj4string:    NA
#>      a    b          c          d x y X90.   geometry
#> 1    0    a 2017-01-20 2017-01-20 1 1    1 POINT(1 1)
#> 2    1    b       <NA>       <NA> 2 2    2 POINT(2 2)
#> 3    1 <NA> 1969-09-24 1969-12-31 3 3    3 POINT(3 3)
#> 4 -Inf    c 1970-01-01 1969-12-31 4 4    4 POINT(4 4)
#> 5  Inf <NA> 1970-01-02 1969-12-31 5 5    5 POINT(5 5)
y
#> Simple feature collection with 5 features and 7 fields
#> geometry type:  POINT
#> dimension:      XY
#> bbox:           xmin: 1 ymin: 1 xmax: 5 ymax: 5
#> epsg (SRID):    NA
#> proj4string:    NA
#>      a    b          c                   d x y X90_   geometry
#> 1    0    a 2017-01-20 2017-01-20 10:27:33 1 1    1 POINT(1 1)
#> 2    1    b       <NA>                <NA> 2 2    2 POINT(2 2)
#> 3   NA <NA> 1969-09-24 1969-12-31 18:58:21 3 3    3 POINT(3 3)
#> 4 -Inf    c 1970-01-01 1969-12-31 19:00:00 4 4    4 POINT(4 4)
#> 5  Inf <NA> 1970-01-02 1969-12-31 19:00:01 5 5    5 POINT(5 5)

I believe it still writes a 1 rather than a NA or it just suppresses the warning. We lose the empty string (also happens with sp), and datetime was cast to date.

I can put these into a proper test, so it's clear if it passes (and doesn't get broken in the future).

Are you 100% sure this is https://github.com/edzer/sfr/commit/0e37e39b5f64228cced4ffb6130d72729952df85? I get

> z[[1]][3]
[1] NA

IIRC, shapefiles don't support datetime, please try it out with GPKG.

Indeed, sorry. I apparently didn't rebuild from the commit 馃憥 ! I get a a proper NA now.

GPKG are beatifully supported:

gpkg
#> Simple feature collection with 5 features and 7 fields
#> geometry type:  POINT
#> dimension:      XY
#> bbox:           xmin: 1 ymin: 1 xmax: 5 ymax: 5
#> epsg (SRID):    NA
#> proj4string:    NA
#>      a    b          c                   d x y       geom
#> 1    0    a 2017-01-20 2017-01-20 12:02:23 1 1 POINT(1 1)
#> 2    1    b       <NA>                <NA> 2 2 POINT(2 2)
#> 3   NA <NA> 1969-09-24 1969-12-31 18:58:21 3 3 POINT(3 3)
#> 4 -Inf    c 1970-01-01 1969-12-31 19:00:00 4 4 POINT(4 4)
#> 5  Inf      1970-01-02 1969-12-31 19:00:01 5 5 POINT(5 5)

So these are driver specific. I find that sp's strategy to coerce datetime to character for shapefile makes sense, would it be possible to do reproduce that behavior?

In principle, but why not leave that to the user?

We should avoid at all times to write shapefiles with R. Use GPKG instead.

Maybe, it is possible to add some kind of message for users who try to write shapefile:
"Using ESRI Shapefile seriously harms you and others around you" (or something else) ;)

Good point about GPKG. I'm not familiar with it, and I should use it more. I'm not very fond of shapefiles. It's just that many applications still use shapefiles, which is my situation right now, so I happen to currently test this driver.
I also tried to read this very GPKG with ArcMap 10.3 and it failed, haven't investigated further though.

Maybe it would be worth a warning when casting a variable (for shapefiles, but there are probably others)? Is factor -> string considered casting ?

Was this page helpful?
0 / 5 - 0 ratings