Cartodb: Invalid geometries after export/import

Created on 1 Dec 2016  路  53Comments  路  Source: CartoDB/cartodb

Context

The dataset with all geometries as valid generates, a different number of, invalid geometries when importing it depending on the export format.

Steps to Reproduce

  1. Go to https://team.carto.com/u/saleiva/tables/chicago_census_tracts/public?redirected=true
  2. Export:
  3. Import all of them.
  4. Run the queries:
  • SELECT count(*) FROM chicago_census_tracts_csv where st_isvalidreason(the_geom) != 'Valid Geometry'
  • SELECT count(*) FROM chicago_census_tracts_shp where st_isvalidreason(the_geom) != 'Valid Geometry'
  • SELECT count(*) FROM chicago_census_tracts_gpkg where st_isvalidreason(the_geom) != 'Valid Geometry'

Current Result

Tables count for invalid geometries:

  • chicago_census_tracts_csv: 0
  • chicago_census_tracts_shp: 3
  • chicago_census_tracts_gpkg: 36

Expected result

I would expect to have all geometries as valid, independently of the export format.

Additional info

Using QGIS Geometry Checker Plugin I wasn't able to detect invalid geometries from step 2 files. However, I was able to detect invalid geometries after exporting again the imported gpkg dataset at step 3.

cc @javisantana

bug

All 53 comments

Just curious and to let me manage priorities: who added this to Data services project?

Me.

This should be a priority since it's making some of the .CARTO we create for the use cases to not import correctly.

Hey, If the issue is in GDAL ping me ASAP so I can talk with Even

sure

thanks guys

I reproduced the issue but got different numbers:

Current Result

Tables count for invalid geometries:

  • chicago_census_tracts_csv: 0
  • chicago_census_tracts_shp: 1
  • chicago_census_tracts_gpkg: 5

Wondering what could have changed since then or if there's some source of variance in the process.

No modifications to the original table since the issue was reported:

 # SELECT * from cartodb.cdb_tablemetadata where tabname = 'saleiva.chicago_census_tracts'::regclass;
            tabname            |          updated_at           
-------------------------------+-------------------------------
 saleiva.chicago_census_tracts | 2016-12-01 18:25:49.107569+00
(1 row)

and I checked again and got the same numbers

Invalid reasons

  • chicago_census_tracts_gpkg:
Hole lies outside shell[-87.5999603271484 41.6572640635485]
Self-intersection[-87.6329271739131 41.932908826087]
Self-intersection[-87.6247062332533 41.9068444417047]
Self-intersection[-87.614171 41.888397]
Self-intersection[-87.575521 41.795392]
  • chicago_census_tracts_shp:
Self-intersection[-87.632939 41.932914]

It looks like a precision problem to me.

Recursion test

I got the same numbers in local. Then I thought of running it recursively (feeding back the outputs to the inputs) and got the numbers again:

  • CSV: http://cdb.localhost.lan:8080/api/v2/sql?filename=chicago_census_tracts_csv_csv&q=SELECT+*+FROM+public.chicago_census_tracts_csv&skipfields=the_geom_webmercator&format=csv
  • SHP: http://cdb.localhost.lan:8080/api/v2/sql?filename=chicago_census_tracts_shp_shp&q=SELECT+*+FROM+public.chicago_census_tracts_shp&skipfields=the_geom_webmercator&format=shp
  • GPKG: http://cdb.localhost.lan:8080/api/v2/sql?filename=chicago_census_tracts_gpkg_gpkg&q=SELECT+*+FROM+public.chicago_census_tracts_gpkg&skipfields=the_geom_webmercator&format=gpkg

Tables count for invalid geometries:

  • chicago_census_tracts_csv_csv: 0
  • chicago_census_tracts_shp_shp: 2
  • chicago_census_tracts_gpkg_gpkg: 7

Same invalid reasons (holes and self-intersections). IMO that reinforces the hypothesis of a precision issue.

@rochoa in production and following the STR I only got 500 rows imported for SHP and GPKG formats, though the original file has 1031 rows. Do you know of any reason for that behavior?

Regarding the 500 rows apparent limit, it seems to be a different issue related to permissions when exporting public tables without api_key.

From the DB logs:

ERROR:  permission denied for relation geometry_columns
STATEMENT:  SELECT type, coord_dimension, srid FROM geometry_columns WHERE f_table_name = 'chicago_census_tracts' AND f_geometry_column='the_geom' AND f_table_schema = 'saleiva'
ERROR:  current transaction is aborted, commands ignored until end of transaction block
STATEMENT:  SELECT pg_class.relname FROM pg_class WHERE oid = (SELECT pg_inherits.inhparent FROM pg_inherits WHERE inhrelid = (SELECT c.oid FROM pg_class c, pg_namespace n WHERE c.relname = 'chicago_census_tracts' AND c.relnamespace=n.oid AND n.nspname = 'saleiva'))

...

ERROR:  current transaction is aborted, commands ignored until end of transaction block
STATEMENT:  FETCH 500 in OGRPGLayerReader0x2549e90
ERROR:  current transaction is aborted, commands ignored until end of transaction block
STATEMENT:  CLOSE OGRPGLayerReader0x2549e90

and it has been introduced sometime between the 1st of December and today.

I'd consider it pretty important if that happens for every single export. Could be related to the onpremise release?

Exporting issues

Let's keep the api_key permissions discussion apart in its own ticket, as these are two separate issues.

Exporting the SHP

Exporting the SHP file in local, with no api_key:

http://cdb.localhost.lan:8080/api/v2/sql?filename=chicago_census_tracts_shp&q=SELECT+*+FROM+public.chicago_census_tracts&skipfields=the_geom_webmercator&format=shp

Using the Check validity feature of qgis 2.18, with the QGIS method:

| ogr2ogr binary | rows exported | invalid geoms (qgis) |
| -------------: | ------------: | -------------------: |
| ogr2ogr | 1031 | 1 |
| ogr2ogr2 | 1031 | 1 |
| ogr2ogr2.1 | 1031 | 1 |

Exporting the GPKG

Exporting the GPKG file in local, with no api_key:

http://cdb.localhost.lan:8080/api/v2/sql?filename=chicago_census_tracts_gpkg&q=SELECT+*+FROM+public.chicago_census_tracts&skipfields=the_geom_webmercator&format=gpkg

I couldn't import the resulting gpkg back in qgis, so measuring invalid geoms as per st_isvalid(the_geom) Update: Added the invalid geoms as per qgis. Note: qgis uses ogr2ogr to import geopackages among other formats (used gdal-bin 2.1.0+dfsg-1~trusty2 from ubuntugis):

SELECT count(*) FROM chicago_census_tracts_gpkg where not st_isvalid(the_geom);

| ogr2ogr binary (export) | rows exported | invalid geoms (qgis) | invalid geoms (postgis) |
| -------------: | ------------: | -------------------: | ----: |
| ogr2ogr | 1031 | 1 | 7 |
| ogr2ogr2 | 1031 | 1 | 7 |
| ogr2ogr2.1 | 1031 | 1 | 7 |

The problem reported by qgis: segments 167 and 173 of line 0 intersect at -87.658697, 41.909483

Actually qgis complains about the same error on the original table imported directly from postgres: segments 167 and 173 of line 0 intersect at -87.658697, 41.909483, though all rows are OK as per postgis.

So, instead of using qgis (which underneath uses ogr2ogr as well) let's focus on a possible precision loss and cumulative errors in export/import process and how they impact feature validity from postgis point of view.

but wait, when exporting and importing geopackage there should NOT be any precision loss. It uses WKB to store geometries so there should be something.

I don't actually mind about SHP (but anyway it should be fine as well since geometries use 8 bytes/coord as well) but with geopackage we have a important problem since .carto use it.

In conclusion, there should not be any transformation process when exporting geometry

(As suggested by Santana) Let's pick one invalid feature:

SELECT cartodb_id, st_isvalidreason(the_geom) FROM chicago_census_tracts_gpkg where st_isvalidreason(the_geom) != 'Valid Geometry' LIMIT 1;
 cartodb_id |                   st_isvalidreason                    
------------+-------------------------------------------------------
        967 | Self-intersection[-87.6783312117263 41.8422366970676]
(1 row)

and compare the original WKB with the BLOB stored in the sqlite file:

$ sqlite3 chicago_census_tracts_gpkg.gpkg
SQLite version 3.8.2 2013-12-06 14:53:30
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite> .mode csv
sqlite> .output gpkg_blob_out.csv
sqlite> select hex(geom) from chicago_census_tracts_gpkg  where cartodb_id = 967;
sqlite> .output stdout

image

The binary representations of the original and the exported blob look very similar, except for the first few bytes (it's a 9 KB multi-polygon feature).

I tried this as well in sqlite:

SELECT Hex(ST_AsBinary(geom)) from chicago_census_tracts_gpkg where cartodb_id = 967;

but couldn't get any output. See https://www.gaia-gis.it/gaia-sins/spatialite-cookbook/html/wkt-wkb.html

After importing the GPKG and inspecting the WKB directly in the database, the diff shows subtle differences in "random" bytes, hopefully some LSB's in coordinates:

image

So at least the issue with GPKG (which matter most because of .carto format) is not exporting but importing GPKG files.

The next question is whether this bug happens before or after ogr2ogr command execution. This is the command line from my dev machine:

OSM_USE_CUSTOM_INDEXING=NO PG_USE_COPY=YES PGCLIENTENCODING=UTF-8  /usr/bin/ogr2ogr2.1 \
  -f PostgreSQL  PG:"host=localhost port=5432 user=development_cartodb_user_95dc74f3-84b8-4738-a16e-92bab6a9a9af dbname=cartodb_dev_user_95dc74f3-84b8-4738-a16e-92bab6a9a9af_db password=f351585ea8623d35033ea5684c462e724051626adevelopment_cartodb_user_95dc74f3-84b8-4738-a16e-92bab6a9a9af" \
  -t_srs EPSG:4326  -lco DIM=2 -lco PRECISION=NO /tmp/imports/20170105-17614-5uhokg/chicago_census_tracts_gpkg.gpkg  \
  -nln cdb_importer.importer_79ce31b0d36511e69bd9080027880ca6 -nlt PROMOTE_TO_MULTI  \
  -doo PRELUDE_STATEMENTS="SET statement_timeout TO '1h'" -doo CLOSING_STATEMENTS='SET statement_timeout TO DEFAULT' -update

(new lines added for readability)

Admittedly the option -lco PRECISION=NO stinks.

Right after executing it and checking the resulting table:

select count(*) from cdb_importer.importer_79ce31b0d36511e69bd9080027880ca6 where not st_isvalid(wkb_geometry);
NOTICE:  Self-intersection at or near point -87.67833121172626 41.842236697067641
NOTICE:  Self-intersection at or near point -87.658417636367616 41.909339727274777
NOTICE:  Self-intersection at or near point -87.632927173913075 41.932908826086972
NOTICE:  Self-intersection at or near point -87.624706233253292 41.906844441704735
NOTICE:  Self-intersection at or near point -87.614170999999985 41.888396999999991
NOTICE:  Self-intersection at or near point -87.575520999999981 41.795391999999993
NOTICE:  Self-intersection at or near point -87.704673999999983 41.83199399999998
 count 
-------
     7
(1 row)

So at least we can discard cartodby'ation and further processing as the source FTM.

With -lco PRECISION=YES I got the same result :disappointed:

The PRECISION option is targeted at postgres output (lco stands for "layer creation option") and that option has been there for ages (w/o affecting CSV format for instance). See http://www.gdal.org/drv_pg.html

I made a small test with one of the "offending" features:

original.zip

# convert csv -> gpkg -> csv
ogr2ogr -f GPKG original.gpkg original.csv
ogr2ogr -f CSV converted.csv original.gpkg

# compare them, ignoring CR new lines
$ diff --strip-trailing-cr original.csv converted.csv  && echo "OK"
OK

It doesn't look like the issue is in the gdal gpkg driver, but a combination with postgis driver.

This seems to be a datum conversion issue; the geometry in the geopackage (or shapefile) is not correctly interpreted as EPSG:4326.

Omiting the -t_srs EPSG:4326 from the ogr2ogr import avoids the problem, but when it is present, the coordinates are slightly altered.

On second thought, It could be not an SRS misinterpretation,but that the presence of -t_srs introduces a transformation step (in this case applying an identity transformation if SRIDs are right) which introduces some roundoff errors.

seems like that's also an optimization in gdal, right? we might send a patch for this...

The underlying problem we have here is that in some cases (involving conversion to PostGIS with ogr2ogr) in which the WKB form of geometry should be preserved exactly (because no coordinate transformation is involved and the data source preserves the their full precision) the coordinates are altered in their least significant digits.

This in turn can cause the invalid geometry result observed for complex-enough polygons, since tiny displacements derived from the changed coordinates can change the topology of lines.

Here's how to reproduce the coordinate-changing problem using the 2.1 version of ogr2ogr used at CARTO.

We'll use this point to reveal the problem:

  • WKB: (hex) 0101000020E610000007A359D93EEB55C05812A0A696EB4440
  • Text: POINT(-87.675711 41.840535)

We'll be using a pnt.csv file with that point

id,geom
1,0101000020E610000007A359D93EEB55C05812A0A696EB4440

And a PostGIS table:

CREATE TABLE pnt(cartodb_id integer, the_geom geometry);
INSERT INTO pnt(cartodb_id, the_geom)
  VALUES (1, '0101000020E610000007A359D93EEB55C05812A0A696EB4440'::geometry);

We can export this data to a GPKG using CARTO (the SQL API):

curl "http://cdb.localhost.lan:8080/api/v2/sql?filename=pntexp&q=SELECT+*+FROM+pnt&skipfields=the_geom_webmercator&format=gpkg" > pnt.gpkg

or directly with ogr2ogr obtaining an equivalent result:

ogr2ogr -f GPKG /pnt.gpkg pnt.csv -oo GEOM_POSSIBLE_NAMES=wkb_geom -a_srs EPSG:4326

Now we have a GPKG that preserves the WKB geometry exactly:

$ sqlite3 pntexp.gpkg
sqlite> select * from gpkg_contents;
pntexp|features|pntexp||2017-01-10T11:26:53.000Z|-87.6757|41.8405|-87.6757|41.8405|4326
sqlite> select * from pntexp;
1|GP|1
sqlite> select hex(geom) from pntexp;
47500001E6100000010100000007A359D93EEB55C05812A0A696EB4440
sqlite> .quit

Now we convert the GPKG back to PostGIS with ogr2ogr using the -t_srs
option to transform the geometry to WGS84 (4326). Since has already that
reference as shown in the GPKG metadata above, the transformation should be
the identity.

ogr2ogr \
 -f PostgreSQL  PG:"host=localhost port=5432 user=development_cartodb_user_bd5106b3-3463-4182-a819-d8a53d6f95ca dbname=cartodb_dev_user_bd5106b3-3463-4182-a819-d8a53d6f95ca_db password=c21cde21e4abf48bb6ec3b0505bc8bca922cf476development_cartodb_user_bd5106b3-3463-4182-a819-d8a53d6f95ca" \
 -t_srs EPSG:4326  \
 pntexp.gpkg \
 -nln cdb_importer.pnt_imp

But the geometry has been (slightly) altered:

select * from cdb_importer.pnt_imp;
 fid | cartodb_id |                    wkb_geometry
-----+------------+----------------------------------------------------
   1 |          1 | 0101000020E610000006A359D93EEB55C05812A0A696EB4440
(1 row)

cartodb_dev_user_bd5106b3-3463-4182-a819-d8a53d6f95ca_db=> select st_astext(wkb_geometry) from cdb_importer.pnt_imp;
             st_astext
------------------------------------
 POINT(-87.6757109999999 41.840535)
(1 row)

What has happened here is that the IEEE 64 floating point number (X coordinate) 07A359D93EEB55C0 (-0x1.5EB3ED959A307p6) has changed by 1 unit in the least significant digit of the significand
into 06A359D93EEB55C0 (-0x1.5EB3ED959A306p6).

If we omit the -t_srs (or if we replace it by -a_srs which assigns the SRS without transformation)
this problem doesn't happen:

ogr2ogr \
 -f PostgreSQL  PG:"host=localhost port=5432 user=development_cartodb_user_bd5106b3-3463-4182-a819-d8a53d6f95ca dbname=cartodb_dev_user_bd5106b3-3463-4182-a819-d8a53d6f95ca_db password=c21cde21e4abf48bb6ec3b0505bc8bca922cf476development_cartodb_user_bd5106b3-3463-4182-a819-d8a53d6f95ca" \
 pntexp.gpkg \
 -nln cdb_importer.pnt_imp2
cartodb_dev_user_bd5106b3-3463-4182-a819-d8a53d6f95ca_db=> select * from cdb_importer.pnt_imp2;
 fid | cartodb_id |                    wkb_geometry
-----+------------+----------------------------------------------------
   1 |          1 | 0101000020E610000007A359D93EEB55C05812A0A696EB4440
(1 row)

cartodb_dev_user_bd5106b3-3463-4182-a819-d8a53d6f95ca_db=> select st_astext(wkb_geometry) from cdb_importer.pnt_imp2;
          st_astext
-----------------------------
 POINT(-87.675711 41.840535)
(1 row)

Converting the same GPKG to CSV instead of PostGIS with ogr2ogr
is free of the problem as well:

ogr2ogr -f CSV pnt2.csv pntexp.gpkg -t_srs EPSG:4326 -lco GEOMETRY=AS_WKT
cat pnt2.csv
WKT,cartodb_id
"POINT (-87.675711 41.840535)",1

Note that changes were introduced in OGR 1.11.0 to treat identity transformations: https://trac.osgeo.org/gdal/changeset/26274
(in principle it seems that this should have solve this problem)

re. https://trac.osgeo.org/gdal/changeset/26274 these lines look a bit fragile to me:

    755     /* Determine if we really have a transformation to do */
    756     bIdentityTransform = (strcmp(pszSrcProj4Defn, pszDstProj4Defn) == 0);

I'd compile it with debug traces enabled and/or run it through a debugger.

That text comparison seems really fragile, but it is not the problem in our case (the definitions are identical, maybe both have been extracted from the PROJ4 epsg file, because I don't think PostGIS spatial_ref_sys or the GPKG gpkg_spatial_ref_sys is being used at this point).

In our case the transformation is correctly identified as an Identity, but the coordinates have been converted to radians, and the error is introduced when converting back to degrees:

Debugging in ogrct.cpp from GDAL 2.1.1 we have in our problematic conversion:

(gdb) s
1162                    x[i] *= dfTargetFromRadians;
(gdb) p x[i]
$5 = -1.5302298309770097
(gdb) s
1163                    y[i] *= dfTargetFromRadians;
(gdb) p x[i]
$6 = -87.67571099999995

Tomorrow I'll take a look at why are coordinates being converted to radians here.

I've been checking the radias/degrees conversions:

First, the deg->rad->deg conversions could been done preserving the original value (at least in our test case) if using the DEG_TO_RAD constant, but a different value is used here which corresponds to the less precise literal in the projection definition 0.0174532925199433

These values (with less precision than IEEE 64 bit floating point numbers) are used in a lot of projection definition files, so we don't have control over it.

What can be done is to try to detect deg->rad->deg round-trip conversions (allowing some tolerance in the conversion factors) and avoid them. And this precisely the intention of the deactivated and commented-out code here (which has been completely removed from trunk)

I've checked that code and it fixes our problem. I'll try to find out why was it dismissed.

EDITED: (clarification) the fact that the factor used to convert degrees to radians has less precision than desired is not the culprit here as has been implied above, since it is compensated for here. It just happens that for that particular floating point factor value, multiplication by it, then by its inverse losses 1 ulp of precision in our test case, whereas for full precision value (DEG_TO_RAD) doesn't. So, the problem is really that any round trip conversion of units can potentially alter the coordinates and, as stated above, we should avoid it when possible to preserve exact values.

For the record, here's a snippet showing how I got to debug our test case:

First, compile GDAL with debug symbols and open the resulting ogr2ogr with gdb:

wget http://download.osgeo.org/gdal/2.1.1/gdal-2.1.1.tar.gz
tar -xzf gdal-2.1.1.tar.gz
cd gdal-2.1.1
./configure --without-ld-shared --disable-shared --enable-static --with-ogdi=no --with-spatialite=yes --enable-debug
make
export GDAL_DATA=$(pwd)/data
gdb apps/ogr2ogr

Now we can set breakpoints as needed and execute the conversion of our test case:

b ogrct.cpp:859
...
r -f PostgreSQL PG:"host=localhost port=5432 user={DBUSER} dbname={DBNAME} password={PASSWORD}"  -t_srs EPSG:4326  pntexp.gpkg  -nln pnt_imp_dbg
...

Hey @rouault, could you shed some light on why this code hasn't been used? It would avoid the loss of precision due to the use of the degree definition common in WKT files when source and target are in the same long/lat units.

It seems that code originated in this ticket: https://trac.osgeo.org/gdal/ticket/5188

EDIT: the loss of precision we're worried about (1 ulp) can be caused with any round-trip units conversions, and the precision of the units definition factor is irrelevant.

@jgoizueta: I'm not sure why the code has been commented. Probably because it looked like a hack to set bSourceLatLong = FALSE; and bTargetLatLong = FALSE. A potential fix would be to add a bNoOperation = TRUE flag ( bIdentifyTransform only applies to proj.4 related conversions, not angular unit conversions )

Thanks for the info and your suggestions @rouault ! I'll take a look at adding a new flag to skip conversions as you propose.

A note on why we haven't had this problem with CSV files:

With the options passed to ogr2ogr to import files into PostGIS , ogr2ogr does not regard the geometry columns as geometry, and they're imported verbatim as text (hex WKB representation of the geometry). Then the import process (cartodbfication etc.) detects the columns as geometric and converts it, but no coordinate transformation is involved.

This works nicely for files exported from CARTO since the geometry is always EPSG:4326 and they are transparently converted back when imported.

If we use ogr2ogr options to make it handle the geometry the same problem occurs as for gpkg:

ogr2ogr -f PostgreSQL PG:"host=localhost port=5432 user=USER dbname=DBNAME password=PASSwoRD"  -t_srs EPSG:4326  pnt.csv  -nln cdb_importer.pnt_imp_csv -oo GEOM_POSSIBLE_NAMES=the_geom -t_srs EPSG:4326 -s_srs EPSG:4326

Here we use GEOM_POSSIBLE_NAMES to specify the name of the geometry column, -s_srs to define its reference system and -t_srs to transform it when imported.

After looking carefully into the code I agree the existing (disabled) patch is rather hackish and a equivalent but cleaner changer is in order.

It would consist simply of adding a flag bNoTransformation to class OGRProj4CT to skip transformation completely, assigned in OGRProj4CT::InitializeNoLock as:

bNoTransformation = bIdentityTransform && bSourceLatLong && !bSourceWrap &&
                    bTargetLatLong && !bTargetWrap &&
                    fabs(dfSourceToRadians * dfTargetFromRadians - 1.0) < EPSILON;

And then used it in OGRProj4CT::TransformEx to avoid transforming the coordinates.

Now, since we're considering changing ogr2ogr version I think we could create a patch for this too, include it in our next version build and submit it to upstream (so we can remove it from our build if/when accepted).

Here's a patch to fix this problem: https://github.com/OSGeo/gdal/pull/184

The fix has been committed by rouault (thanks!) in trunk r37138 and branches/2.1 r37139, with a fix in the test when running on 32bit Linux ( hex() adds a trailing 'L' in that case).

Proposed acceptance test:

(In a local or staging environment)

  1. Load the CSV data into an account, check 1031 rows are imported and none are invalid

  2. build ogr2ogr from branch 2.1, install it in the test environment

  3. Export the data to CSV, SHP & Geopackage

  4. Reimport each format, check it to have 1031 rows and no invalid geometries

@ethervoid in order to coordinate, acceptance test the 2 bugfixes and prepare the release: what commit/revision are you testing against?

At the time of writing this, it is https://github.com/OSGeo/gdal/commit/3c42908b43090c367450f641e916c23ed643b105 / r37152

@rafatower I've packaged this one: r37152 | rouault | 2017-01-15 14:07:29 +0000 (Sun, 15 Jan 2017) | 1 line

It looks like we still have a problem with the current patch:

Using the production version of GDAL 2.1, I've generated three new files (CSV, GPKG and SHP) using this file as base:

PGCLIENTENCODING=UTF-8 ogr2ogr2.1 -f CSV chicago_census.csv chicago_census_tracts_csv.csv -oo GEOM_POSSIBLE_NAMES=the_geom -s_srs EPSG:4326 -t_srs EPSG:4326
PGCLIENTENCODING=UTF-8 ogr2ogr2.1 -f GPKG chicago_census.GPKG chicago_census_tracts_csv.csv -oo GEOM_POSSIBLE_NAMES=the_geom -s_srs EPSG:4326 -t_srs EPSG:4326
PGCLIENTENCODING=UTF-8 ogr2ogr2.1 -f "ESRI Shapefile" chicago_census.shp chicago_census_tracts_csv.csv -oo GEOM_POSSIBLE_NAMES=the_geom -s_srs EPSG:4326 -t_srs EPSG:4326

The results after import them into CARTO and check for invalid geometries are the following:

chicago_census_tracts_csv: 0
chicago_census_tracts_shp: 8
chicago_census_tracts_gpkg: 10

I've packaged the GDAL in the last revision of the branch 2.1 and installed it. After generating the same three files with it, the results are:

chicago_census_tracts_csv: 0
chicago_census_tracts_shp: 3
chicago_census_tracts_gpkg: 0

So we still have a problem with the reprojection part even with the patch. After talk with @jgoizueta he has found the problem: the condition

irb(main):004:0> a = 0.0174532925199433
=> 0.0174532925199433
irb(main):005:0> b = 1/0.017453292519943295
=> 57.29577951308232
irb(main):006:0> (a*b - 1.0) < Float::EPSILON
=> false
irb(main):007:0> (a*b - 1.0) == Float::EPSILON
=> true

We have input units defined by some factor A and output units defined by other factor B.

Using DBL_EPSILON as the tolerance to consider if A/B = 1 (i.e. if A and B define the same units) I was giving enough tolerance for the floating point arithmetic involved if A and B where identical, but not for the actual case we have found where A and B are defined with varying precision: degrees are defined as 0.0174532925199433 (radians per degree) in PROJ4 configuration files (epsg) and as 0.017453292519943295 in a typical shapefile's .prj.

I should have accounted for that and use a larger tolerance 馃槩

Now, for this particular case it just happens that A/B -1 == DBL_EPSILON so if a <= had been used instead of a < it would have worked! 馃槅 (but that's just a coincidence)

I think we can work out a better value for this tolerance by considering the minimum number of correct significant number that we should require the units definitions to have.

I think we can look at that and make another change to GDAL to increase the tolerance. Otherwise, with the current fix we have GPPK (.carto) files working transparently but shapefiles with slight discrepancies (all depending on how are units defined for the particular format/file)

In the found case both definitions have at least 14 correct significant digits, so 10^-13 would be an appropriate tolerance for such cases. If we require N correct digits, we should use 10^(-N+1).

By looking at the GDAL code and configuration files, the pi/180 constant (radians per degree) is found defined as:

  • .0174532925199432958 default value used in code; correct to 18 rounded digits; enough to be casted to a double which is the best possible approximation to pi/180.
  • .017453292519943295 value used in code and in esri_extra.wkt; correct to 16 rounded digits; enough to be casted to a double which is the best possible approximation to pi/180. This is the value that is used int .prj files in EPSG:4326 shapefiles generated with ogr2ogr.
  • 0.0174532925199433 most common value found in configuration files, including GPKG's gpkg_spatial_ref_sys and PostGIS spatial_ref_sys, and assigned internally to WGS84, correct to 15 rounded digits. This is used when assigning EPSG:4326 to a CSV with ogr2ogr, when generating GPKG, etc.
  • 0.0174532925199432955 used in some configuration files (esri_StatePlane_extra), correct to 16 rounded digits.
  • 0.01745329251994328 used in bsbdataset.cpp, correct to 15 rounded digits.

Note: for reference, 180/pi = 0.0174532925199432957692369076848861271344287188854172545609719144017100911460344944368224156963450948221230449250...

So, when input and output is in degrees, we can in general expect to have it defined by slightly different values with at least 15 correct digits.

As a result, I'd use a tolerance not smaller than 10^-14 in the check for same input/output units. In fact I see no problem having 10^-13 or even something larger to cover even less accurate definitions that could surface. This is only used only when the input and output is long/lat with equivalent datums, so I see no risk of regarding different units as the same.

I've found this in the code where the angular units are checked to be degrees:

The discrepancy in the degrees definition we have found is because when saving shapefiles, the degrees definition used for EPSG:4326 is converted to a more precise value to be saved in the .prj file:

/* -------------------------------------------------------------------- */
/*      reset constants for decimal degrees to the exact string ESRI    */
/*      expects when encountered to ensure a matchup.                   */
/* -------------------------------------------------------------------- */
    OGR_SRSNode *poUnit = GetAttrNode( "GEOGCS|UNIT" );

    if( poUnit != NULL && poUnit->GetChildCount() >= 2
        && std::abs(GetAngularUnits()-0.0174532925199433) < 0.00000000001 )
    {
        poUnit->GetChild(0)->SetValue("Degree");
        poUnit->GetChild(1)->SetValue("0.017453292519943295");
    }

Note how the absolute tolerance for units equivalence used here is 1E-11. Taking into account this is for a magnitude of the order of 1E-2 (0.017...) that would be equivalent to using 1E-10 1E-9 for our relative tolerance (fabs(dfSourceToRadians * dfTargetFromRadians - 1.0) < 1E-9)

Now we have two possible criteria to choose the new tolerance:

To admit all observed definitions of degrees could expect to have 14 correct digits in the definitions and use a value of 10E-13. This is the minimum value we should use to allow all those cases.

To be consistent with the ESRI definitions and allow for any variation introduced in that case we should use a value of 10E-9 (equivalent require no more than 10 correct digits).

IMHO I don't think we really want to do "conversions" from "ESRI degrees" to "whatever degrees". All degrees should be dealt as equivalent. The numerical differences in WKT definitions are a matter of history, but not really significant (ie we don't want 45 deg to be converted to 45.0000000001 deg when converting from a shapefile to a geopackage). So a rather relaxed test to ensure that the source and target units are the "same" should be OK.

I'm wondering if the ogrct code shouldn't even force the use of .0174532925199432958 when the unit is close-enough to degree.

My decision is to use 10-E9. The greater tolerance shouldn't be a problem (what can an angular unit that fits 359.999999 of them in a circle be other than degrees?)

Hi rouault glad to have your opinion here. I agree with you with having a relaxed acceptance of degrees.

I also think it's a good idea adjusting the conversion factors when close enough to degrees. It would make for more accurate and consistent coordinate conversions. I'll leave it out of this ticket, though. Here our purpose is to have exact round-trip conversions when possible and we'll achieve it with the tolerance change.

Testing again against gdal2.1-static-bin (2.1.2+svn.37167-precise1) in staging, counting invalid geometries:

SELECT count(*) FROM table where st_isvalid(the_geom) is false;

Results:

chicago_census_tracts_csv: 0
chicago_census_tracts_shp: 0
chicago_census_tracts_gpkg: 0

vamos!

I ran the following regression tests as well:

  • Import a non-trivial .carto file: OK
  • Duplicate a dataset (and add new data to it): OK
  • Create a table from a SQL query: OK
  • Create a sync table: OK
  • Force sync: OK
  • raster import: OK
  • Create a table with ogr2ogr 2.1.2+svn.37167-precise1: OK

Ready for production.

Deployed to production, should be fixed by now

Was this page helpful?
0 / 5 - 0 ratings

Related issues

saleiva picture saleiva  路  4Comments

makella picture makella  路  3Comments

noguerol picture noguerol  路  5Comments

ivanmalagon picture ivanmalagon  路  3Comments

fernando-carto picture fernando-carto  路  5Comments