Cartodb: Automatic geocoding should always be done with an account's specified provider

Created on 21 Jun 2017 · 10Comments · Source: CartoDB/cartodb

Context

Please explain here below what you were doing when the issue happened

We tested this with two separate CARTO accounts configured in Superadmin to use Here.com geocoding today and symptoms were identical.

We had a spreadsheet (can share if needed) that among other columns has an address and city column of store locations in the USA. We uploaded this by dragging and dropping onto the browser window ("Maps" page). The file was uploaded, automatically geocoded, and the Builder map created. We grew skeptical when we saw points in New York and Alabama that should have been in Tennessee and I investigated further.

I then took the same file, un-checked the "Let CARTO automatically guess data types and content on import." check-box, and uploaded it. This time, presumably because the auto-guess box was unchecked, the file was imported and not geocoded, the_geom column as null for all records. I then created a map with this 2nd file in Builder, applied the Georeference analysis with the appropriate columns, and spot-checked the results. All locations that were supposed to be in Tennessee were, in fact, this time, in Tennessee as expected.

Working conclusions:
Even if your account is configured to use HERE geocoding as the provider, if you upload a spreadsheet file and the “Let CARTO automatically guess data types and content on import.” check box is clicked when you do, and your file has some of our auto-recognized column names like address, your file will be automatically geocoded using _Mapzen_ and not _Here_, and will therefore most likely be poor quality. The work-around is to upload your file, de-select that “automatically guess” checkbox, and apply a Georeference analysis in Builder. That is the only way you can be sure your data will be geocoded with Here.

Steps to Reproduce

Please break down here below all the needed steps to reproduce the issue

Using an account with Here configured as geocoding provider, upload spreadsheet with address column and "auto-guess" checkbox clicked
Examine results and look for a high rate of incorrect geocodes (easy to do this with a widget if you have another "state" column) that point to a likelihood this file was auto-geocoded with Mapzen and not Here.

Current Result

Please describe here below the current result you got

I can't be certain (as we don't provide metadata yet per @kevin-reilly 's #12371 ), but I'm confident beyond a reasonable doubt that this file is being auto-geocoded with Mapzen, even though the superadmin setting is "heremaps":
screen shot 2017-06-21 at 4 10 23 pm

Expected result

Please describe here below what should be the expected behaviour

I would think/hope any account configured to use Here for geocoding would use Here, in all contexts. One possible exception to this might be our geocoding "search box" that appears on maps, which I know is 100% Mapzen across the board for all accounts, but that there is perhaps another business decision we should reconsider too.

Data-services stale

Source

andrewbt

Most helpful comment

I have some other proposals:

advanced import options combined with guessing.
Do not "guess content" at all. Or at least measure the value to the user and consider removing that feature. That is a pre-Builder feature that to me makes little sense.

Some remarks:

Content guessing limits: say you have the perfect geocoding provider that is 100% accurate and has 100% coverage. Taking a sample from the whole dataset and with the best statistical method (insert ML or neural networks or whatever you want here) there will always be uncertainty in the results, that is, the sample can always mislead the decision about the column contents of the whole dataset.

Metadata: for a given query instead of just returning a geometry, we'd need to return several other things. That means changes in the API but also changes in the UX. E.g: is the metadata to be added to the columns of the analysis? do users want to have results below X accuracy or Y accuracy?
Internal geocoder: we all know it can be improved. Are we willing to prioritize work on that? It is the provider of the content guessing. It has its limitations but it let us do some stuff without incurring in some costs.

@saleiva and @kevin-reilly I beg you, please: if you really want the situation to improve we'd need a proper feature doc with high level requirements broken down into smaller requirements with a guarantee of consistency and completeness. And then prioritize the feature.

rafatower on 23 Jun 2017

👍2 ❤1 🎉1 😄1

All 10 comments

cc @rochoa @rafatower

jorgesancha on 22 Jun 2017

The automatic geocoding always happen with the internal geocoder, no external provider.

https://github.com/CartoDB/cartodb-management/wiki/Guessing-of-named-places
https://carto.com/docs/carto-engine/import-api/importing-geospatial-data#import-guessing

I don't think anything related to providers belongs in this repository.

rafatower on 22 Jun 2017

Woah. I forgot about the internal geocoder. Did not realize we have automatic geocoding of place names. (Pretty dangerous without transparency of that to the user - another reason for metadata in #12371 !) This does explain why my point with Jackson, Tennessee (a small city) location was geocoded to Jackson, Mississippi (a large city, probably outranked in the internal geocoder database).

The CSV had columns (this is an semi-anonymized excerpt) :

id | address | city | st | zip_code
-- | ------- | ---- | -- | --------
1 | 136 Stonebrook Pl | Jackson | TN | 38305
2 | 7689 Poplar Ave. | Germantown | TN | 38138
3 | 349 Brentwood Pike Rd | Brentwood | TN | 37027
4 | 830 James Campbell Blvd South | Columbia | TN | 38402
5 | 314 Paul Huff Pkwy NW | Cleveland | TN | 37312
6 | 728 Thompson Lane | Nashville | TN | 37204
7 | 4435 Summer Ave | Memphis | TN | 38122

They were automatically geocoded to the centroids of:

Jackson, Mississippi
https://en.wikipedia.org/wiki/Germantown,_Maryland
https://en.wikipedia.org/wiki/Brentwood,_New_York
https://en.wikipedia.org/wiki/Columbia,_South_Carolina
https://en.wikipedia.org/wiki/Cleveland,_Ohio
"Correct" centroid for Nashville, Tennessee, but the actual address is miles away from here
"Correct" centroid for Memphis, Tennessee, but the actual address is miles away from here

This means that automatic geocoding of place names is substantially biased toward larger/more populated places. _Even_ if files have fully detailed addresses with states and postal codes and so on to clarify where to geocode, the automatic guessing will place them in the largest city with that city name. This is a huge methodological error because place names are not unique, not even within countries:
https://en.wikipedia.org/wiki/List_of_the_most_common_U.S._place_names
https://en.wikipedia.org/wiki/List_of_popular_place_names

I noticed that even though the source dataset the_geom column has been permanently saved via the automatic geocoding on import, I was able to apply the Georeference analysis in Builder and the points were properly geocoded (all to Tennessee in my sample).

Given that information, this issue is incorrect in its description, but still fairly serious. Without some kind of metadata, or transparency of "automatic geocoding" and which fields it is based on to the user, or other user notification of what's happening, this is a really confusing "feature". @rafatower , would you prefer I closed this and posted a revised issue in https://github.com/CartoDB/data-services instead? Though to be honest this one may be more than a simple issue and need input from UI/UX Design, etc...

andrewbt on 22 Jun 2017

internal geocoder is geocoding place names, not addresses, that's why the result is that. We guess data in columns and if we detect country names or place names (for example) with not a lot of duplicates and so on, we basically geocode it for free (that's why we use the internal geocoder).

I can come up with two solution:

adding some metadata to any geocoding process (https://github.com/CartoDB/cartodb/issues/12371) so one of the columns stores the provider that is used, and the column that has been used (if it has been done by guessing).
Improve our guessing mechanisms, so we can identify addresses (BOOM!), and then we don't geocode by anything else until the user select the operation within the map view.

saleiva on 23 Jun 2017

I have some other proposals:

advanced import options combined with guessing.
Do not "guess content" at all. Or at least measure the value to the user and consider removing that feature. That is a pre-Builder feature that to me makes little sense.

Some remarks:

Content guessing limits: say you have the perfect geocoding provider that is 100% accurate and has 100% coverage. Taking a sample from the whole dataset and with the best statistical method (insert ML or neural networks or whatever you want here) there will always be uncertainty in the results, that is, the sample can always mislead the decision about the column contents of the whole dataset.

Metadata: for a given query instead of just returning a geometry, we'd need to return several other things. That means changes in the API but also changes in the UX. E.g: is the metadata to be added to the columns of the analysis? do users want to have results below X accuracy or Y accuracy?
Internal geocoder: we all know it can be improved. Are we willing to prioritize work on that? It is the provider of the content guessing. It has its limitations but it let us do some stuff without incurring in some costs.

rafatower on 23 Jun 2017

👍2 ❤1 🎉1 😄1

Another important disclaimer: the more features we add, the slower the process will be.

rochoa on 23 Jun 2017

It'd be interesting to understand how useful the "guessing" is for customers. IMO the metadata about the geocode is interesting (even if we just store those results for our own use) because it opens the door to finding out how accurate results are (and how useful if they are not very accurate)

jorgesancha on 23 Jun 2017

FYI, just last week I experienced exact same problem uploading a dataset. The auto geocoding done with the auto-guess box checked produced a large error rate (20%+) of points in the wrong US state. Could this be a serious problem that customers may have and don't realize?

I believe @terrett101 had same problem just today trying to do some work for a new customer.

jeffkaplan88 on 13 Sep 2017

Yes, I've been running into this problem all day. My greater concern though is that customers have no idea that the automatic internal geocoder is looking at cities or places only, and not accurately geocoding street addresses when present in the dataset.

If a customer uploads a dataset which contains street addresses and see's the points plot to a map, they assume they're plotted accurately based on the locational information in the file. It's the same assumption I've been making until today, so I'm sure our customers are too.

I'm not opposed to either of Rafa's suggestions. Either removing the guess work by CARTO (unless perhaps there is a lat/lon in the dataset) or providing the user with georeference options as they import (if no geometry detected) would both be an improvement over the current workflow.

terrett101 on 13 Sep 2017

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.