When using sf_write to write a shapefile, it always convert the encoding to ISO-8859-1. The information of my session is:
> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 9 (stretch)
Matrix products: default
BLAS: /usr/lib/openblas-base/libblas.so.3
LAPACK: /usr/lib/libopenblasp-r0.2.19.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
[4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] sf_0.7-1 nvimcom_0.9-75 colorout_1.2-0
loaded via a namespace (and not attached):
[1] compiler_3.5.1 magrittr_1.5 class_7.3-14 DBI_1.0.0 tools_3.5.1 units_0.6-2
[7] Rcpp_1.0.0 grid_3.5.1 e1071_1.7-0 classInt_0.2-3 spData_0.2.9.6
Here is a reproducible example:
> library(sf)
Linking to GEOS 3.5.1, GDAL 2.1.2, PROJ 4.9.3
>
> nc <- st_read(system.file("shape/nc.shp", package="sf"))
Reading layer `nc' from data source `/usr/local/lib/R/site-library/sf/shape/nc.shp' using driver `ESRI Shapefile'
Simple feature collection with 100 features and 14 fields
geometry type: MULTIPOLYGON
dimension: XY
bbox: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
epsg (SRID): 4267
proj4string: +proj=longlat +datum=NAD27 +no_defs
> nc$test <- "èáéÃñå—" # this is added to create the warning
> st_write(nc, "nc.shp", delete_layer = TRUE)
Deleting layer `nc' using driver `ESRI Shapefile'
Writing layer `nc' to data source `nc.shp' using driver `ESRI Shapefile'
features: 100
fields: 15
geometry type: Multi Polygon
Warning message:
In CPL_write_ogr(obj, dsn, layer, driver, as.character(dataset_options), :
GDAL Message 1: One or several characters couldn't be converted correctly from UTF-8 to ISO-8859-1.
This warning will not be emitted anymore.
Is there a way to force it to save with encoding UTF-8. Thanks in advance.
st_write(nc, "nc.shp", layer_options = "ENCODING=UTF-8", delete_layer = TRUE)
I just found out that I can use the above code to solve the problem. However, I still do not understand why, by default, it tries to change the encoding to ISO-8859-1.
Maybe this is a shapefile property of a GDAL-shapefile-driver property; see https://www.gdal.org/drv_shapefile.html
You may also consult the rgdal vignette: https://cran.r-project.org/web/packages/rgdal/vignettes/OGR_shape_encoding.pdf. On Linux, your GDAL should have been built with iconv support, but you can check as shown in the vignette. Because DBF is unpredictable when handling non-single byte characters, many now prefer to use SQLite or better GPKG, which do handle UTF-8 properly.
I didn't want to bring this up but fully agree. Having to read shapefiles is something you can't always avoid, but writing them is something you can avoid, and should avoid, by all means.
I can see that it is due to GDAL properties. Thank you @edzer and @rsbivand for the links and the suggestion. I will try out the other alternatives.
Most helpful comment
I didn't want to bring this up but fully agree. Having to read shapefiles is something you can't always avoid, but writing them is something you can avoid, and should avoid, by all means.