Sf: st_write tries to convert encoding to ISO-8859-1

Created on 21 Dec 2018  Â·  5Comments  Â·  Source: r-spatial/sf

When using sf_write to write a shapefile, it always convert the encoding to ISO-8859-1. The information of my session is:

> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 9 (stretch)

Matrix products: default
BLAS: /usr/lib/openblas-base/libblas.so.3
LAPACK: /usr/lib/libopenblasp-r0.2.19.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C
  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C
 [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

 attached base packages:
 [1] stats     graphics  grDevices utils     datasets  methods   base

 other attached packages:
 [1] sf_0.7-1       nvimcom_0.9-75 colorout_1.2-0

 loaded via a namespace (and not attached):
  [1] compiler_3.5.1 magrittr_1.5   class_7.3-14   DBI_1.0.0      tools_3.5.1    units_0.6-2
  [7] Rcpp_1.0.0     grid_3.5.1     e1071_1.7-0    classInt_0.2-3 spData_0.2.9.6

Here is a reproducible example:

> library(sf)
Linking to GEOS 3.5.1, GDAL 2.1.2, PROJ 4.9.3
>
> nc <- st_read(system.file("shape/nc.shp", package="sf"))
Reading layer `nc' from data source `/usr/local/lib/R/site-library/sf/shape/nc.shp' using driver `ESRI Shapefile'
Simple feature collection with 100 features and 14 fields
geometry type:  MULTIPOLYGON
dimension:      XY
bbox:           xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
epsg (SRID):    4267
proj4string:    +proj=longlat +datum=NAD27 +no_defs
> nc$test <- "èáéíñ字" # this is added to create the warning
> st_write(nc, "nc.shp", delete_layer = TRUE)
Deleting layer `nc' using driver `ESRI Shapefile'
Writing layer `nc' to data source `nc.shp' using driver `ESRI Shapefile'
features:       100
fields:         15
geometry type:  Multi Polygon
Warning message:
In CPL_write_ogr(obj, dsn, layer, driver, as.character(dataset_options),  :
                   GDAL Message 1: One or several characters couldn't be converted correctly from UTF-8 to ISO-8859-1.
                   This warning will not be emitted anymore.

Is there a way to force it to save with encoding UTF-8. Thanks in advance.

Most helpful comment

I didn't want to bring this up but fully agree. Having to read shapefiles is something you can't always avoid, but writing them is something you can avoid, and should avoid, by all means.

All 5 comments

st_write(nc, "nc.shp", layer_options = "ENCODING=UTF-8", delete_layer = TRUE)

I just found out that I can use the above code to solve the problem. However, I still do not understand why, by default, it tries to change the encoding to ISO-8859-1.

Maybe this is a shapefile property of a GDAL-shapefile-driver property; see https://www.gdal.org/drv_shapefile.html

You may also consult the rgdal vignette: https://cran.r-project.org/web/packages/rgdal/vignettes/OGR_shape_encoding.pdf. On Linux, your GDAL should have been built with iconv support, but you can check as shown in the vignette. Because DBF is unpredictable when handling non-single byte characters, many now prefer to use SQLite or better GPKG, which do handle UTF-8 properly.

I didn't want to bring this up but fully agree. Having to read shapefiles is something you can't always avoid, but writing them is something you can avoid, and should avoid, by all means.

I can see that it is due to GDAL properties. Thank you @edzer and @rsbivand for the links and the suggestion. I will try out the other alternatives.

Was this page helpful?
0 / 5 - 0 ratings