When I write a shp using
st_write(shp1, "myshp.shp", driver="ESRI Shapefile")
if the column names in the attribute table are too long for ESRI the output .dbf shortens these column names BUT ALSO deletes any data in that column. Have tested using integer & numeric data, columns write fine if col names are < 10 characters but end up blank if col names > 10 characters.
The field name length restriction is a known feature of shapefiles, and dates way back (10 is generous, MS-DOS had 8 as maximum in file names). Migrate to GPKG and other more modern formats, or manually shorten and disambiguate field names before writing, for example using base::abbreviate().
Can we check and abbreviate column names before attempting to st_write using "ESRI Shapefile"? Driver will do it anyways, causing a data loss. We can warn the users that the field names have been abbreviated to comply with ESRI driver limitations
Sp uses base::abbreviate to automatically handle this issue, I see no reason why sf can't do the same thing. If not, there needs to be a clear warning on st_write that long columns = data loss when output format is shp.
Not sp, rgdal::writeOGR(). Shapefiles were the only option then; maptools::writeSpatial() did this through foreign::write.dbf(), whose helpfile says:
Dots in column names are replaced by underlines in the DBF file, and names are truncated to 11 characters.
Why help people to (ab)use shapefiles when we want them to migrate?
Ah, so it is. Well, I'd love to migrate, but I'm stuck in a very ESRI-centric workplace with change-averse colleagues, so moving on is a bit of a fraught process. Without broader institutional support, I'm just That Coworker. The other issue for now is gpkg's slow disk write speed, which can be very inconvenient.
I unfortunately second @obrl-soil 's comment in a non-academic setting. Though, internally I could push gpkg through (as I already do for small data sets) the disk write speed is a big inconvenience for large data sets. Which doesn't mean I am in favour of trimming names automatically btw.
This is related to this thread? The foreign approach is OGR/shapelib - to truncate, risking non-unique field names. In rgdal::writeOGR(), base::abbreviate() is used and . replaced by _ when the "ESRI Shapefile" driver is being used.
Does anyone know how encoding affects the length constraint - it is bytes, not characters, isn't it?
On UTF-8:
> nchar("Fjærland")
[1] 8
> length(charToRaw("Fjærland"))
[1] 9
> abbreviate("Fjærland")
Fjærland
"Fjær"
Warning message:
In abbreviate("Fjærland") : abbreviate used with non-ASCII chars
If a input element contains non-ASCII characters, the corresponding value will be in UTF-8 and marked as such (see ‘Encoding’).
@rsbivand yes, that is what I see too. I really need to find some time to investigate...
Most helpful comment
The field name length restriction is a known feature of shapefiles, and dates way back (10 is generous, MS-DOS had 8 as maximum in file names). Migrate to GPKG and other more modern formats, or manually shorten and disambiguate field names before writing, for example using base::abbreviate().