Sf: st_read() breaks with column names that have weird encodings

Created on 18 Jun 2020  ·  6Comments  ·  Source: r-spatial/sf

I tried to import the following shapefile using st_read: http://dados.prefeitura.sp.gov.br/dataset/b7242ef1-3add-4ce9-8e74-af7f9288762a/resource/84055ad9-49d6-46dc-af08-573ae1012d48/download/layerfavelas2015.zip.

It broke with the following error:

Error in make.names(vnames, unique = TRUE) : string multibyte inválida 2
Erros durante o embrulho: string multibyte inválida em '<b2>'

(Sorry for the non-English error message, but it's clearly just saying "invalid multibyte string").

I assumed this had something to do with encoding, so I opened the shapefile in QGIS, changed the column name with the weird character sequence, and st_read imported fine.

I wonder if st_read() should have a locale option, though? Or maybe one already exists and I didn't realize?

Most helpful comment

No, because the example shows how to use an environment variable. In rgdal, the problem was resolved by using GDAL's internal CPL variables 12 years ago, but then shapefiles needed to be read and written often. ESRI shapefiles are end-of-life, also for ESRI. So existing files should be read once, and then written preferably as GeoPackage.

All 6 comments

I'm a bit at a loss whether this actually should work: rgdal::readOGR breaks on the same error, as does foreign::read.dbf("LAYER_FAVELAS_2015/DEINFO_FAVELAS_2015.dbf"). How do you set your locale? Can you read the dbf with read.dbf?

The DBF read into LibreOffice Calc is:

image
read with ISO-8859-1. Editing out the square metres resolves the problem. I'll look to see where something might be done in rgdal::ogrInfo(). Further, the cpg file says System, which is wrong. With the original DBF:

> ogrInfo("DEINFO_FAVELAS_2015.shp")
Source: "/home/rsb/tmp/bigshape/LAYER_FAVELAS_2015/DEINFO_FAVELAS_2015.shp", layer: "DEINFO_FAVELAS_2015"
Driver: ESRI Shapefile; number of rows: 1677 
Feature type: wkbPolygon with 2 dimensions
Extent: (315618.7 7358833) - (360628 7411543)
CRS: +proj=utm +zone=23 +south +ellps=aust_SA +towgs84=-66.87,4.37,-38.52,0,0,0,0 +units=m +no_defs 
LDID: 0 
Number of fields: 7 
        name type length  typeName
1         ID    4      5    String
2 AREA_m\xb2   12     10 Integer64
3       NOME    4    100    String
4  DATULTATZ    9     10      Date
5   ENDERECO    4    200    String
6    NOMESEC    4    150    String
7    PROPRTR    4    100    String
Warning message:
In OGRSpatialRef(dsn, layer, morphFromESRI = morphFromESRI, dumpSRS = dumpSRS,  :
  Discarded datum South_American_Datum_1969 in CRS definition: +proj=utm +zone=23 +south +ellps=aust_SA +towgs84=-66.87,4.37,-38.52,0,0,0,0 +units=m +no_defs

I refer to https://cran.r-project.org/web/packages/rgdal/vignettes/OGR_shape_encoding.pdf, and find on my system with iconv that:

library(rgdal)
getCPLConfigOption("SHAPE_ENCODING")
setCPLConfigOption("SHAPE_ENCODING", "CP1250")
o <- ogrInfo("DEINFO_FAVELAS_2015.shp")
o
oo <- readOGR("DEINFO_FAVELAS_2015.shp")
setCPLConfigOption("SHAPE_ENCODING", NULL)

works. I'm unsure whether the CP is correct, but at least it doesn't break.

Thanks! After

Sys.setenv("SHAPE_ENCODING"= "CP1250")

I also get

> library(sf)
Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 7.0.0
> read_sf("/tmp/LAYER_FAVELAS_2015/")
Simple feature collection with 1677 features and 7 fields
geometry type:  MULTIPOLYGON
dimension:      XY
bbox:           xmin: 315618.7 ymin: 7358833 xmax: 360628 ymax: 7411543
projected CRS:  SAD69 / UTM zone 23S
# A tibble: 1,677 x 8
   ID    AREA_m. NOME  DATULTATZ  ENDERECO NOMESEC PROPRTR
   <chr>   <dbl> <chr> <date>     <chr>    <chr>   <chr>  
 1 1       23687 Parq… 2011-07-04 Rua Dom… Pirapo… Munici…
 2 2         404 Vila… 2011-06-16 Rua Con… NA      Munici…
 3 3         486 Pedr… 2012-04-18 Avenida… NA      Munici…
 4 4         453 Tols… 2011-01-19 Avenida… NA      NA     
 5 5       20843 Vila… 2011-08-11 R Júlio… Justin… Munici…
 6 6        3416 Sant… 2011-08-11 Rua Dua… NA      NA     
 7 7        1455 Esme… 2012-02-10 Avenida… NA      Munici…
 8 8        1658 Vila… 2010-11-23 Av. Pro… Vila D… Munici…
 9 9        4155 Viel… 2011-06-16 Rua Ope… Viela … NA     
10 10       6129 Mauro 2011-01-19 Avenida… Whitak… Munici…
# … with 1,667 more rows, and 1 more variable: geometry <MULTIPOLYGON [m]>

Would it make sense for st_read() to take an encoding option, then? Taking values like those of SHAPE_ENCODING, that is.

No, because the example shows how to use an environment variable. In rgdal, the problem was resolved by using GDAL's internal CPL variables 12 years ago, but then shapefiles needed to be read and written often. ESRI shapefiles are end-of-life, also for ESRI. So existing files should be read once, and then written preferably as GeoPackage.

Was this page helpful?
0 / 5 - 0 ratings