Arctos: Geography Proposal

Created on 3 Dec 2020  Â·  34Comments  Â·  Source: ArctosDB/arctos

Background

I can see no evidence that the recent efforts in geography cleanup have resulted in more discoverable catalog record data, which I presume to be a core use case for maintaining geography. It's still possible for data entry personnel to assign arbitrary geography to records, and without consistency predictable geography text search results are not possible. See https://github.com/ArctosDB/arctos/issues/3249 for example.

https://github.com/ArctosDB/arctos/issues/3186 is a proposal to find more consistency in these data, but it will result in significantly reduced functionality in several areas. I don't see this as an acceptable tradeoff, and I don't think Curators will or should either.

Our current geography model does offer several valuable tools for georeferencing and confirming that georeferences fall within specified geography areas, but this still does not provide a consistent mechanism for locating cataloged records by geography.

Arctos has for some time been using various webservices to find coordinates for records without them, and to associate coordinates (both asserted and derived) with place names from various webservices. This is useful for search, but there is not formality or consistency in these data; they're just search strings.

Proposal

  1. Retain the existing geography model, which allows "traditional" curatorial assertions (which support various internal functions - organizing material by Quad, for example).

  2. Split the derived geography out into a separate, structured, formal table. This would allow consistent searching - all records from http://www.geonames.org/5880054/barrow.html would be discoverable as "United States","Alaska" and "North Slope" for example. For contrast, current data would require somewhere between three and 16 queries (depending on level) to find the desired "Barrow-ish" records.


CONTINENT_OCEAN COUNTRY STATE_PROV  COUNTY  QUAD    SEA
Arctic Ocean                    Beaufort Sea
Arctic Ocean                    Chukchi Sea
Arctic Ocean                    
no higher geography recorded                    
North America   United States   Alaska  North Slope Borough     
North America   United States   Alaska      Barrow  Beaufort Sea
North America   United States   Alaska      Barrow  Chukchi Sea
North America   United States   Alaska      Barrow  
North America   United States   Alaska      Barter Island   Beaufort Sea
North America   United States   Alaska      Iliamna 
North America   United States   Alaska      Meade River 
North America   United States   Alaska      St. Lawrence    Bering Sea
North America   United States   Alaska      Teshekpuk   
North America   United States   Alaska          Beaufort Sea
North America   United States   Alaska          Chukchi Sea
North America   United States   Alaska          

Implications

This would immediately result in more discoverable (by virtue of consistency) data in Arctos. One query - rather than the currently-required 16 - would find records from Barrow.

Longer term, we could discuss making these data more visible, perhaps sharing them via DWC, etc. This is essentially an implementation of https://github.com/ArctosDB/arctos/issues/3186 but as an enhancement rather than a replacement.

This approach also has significant future-proof qualities. A county's new name will become available for searching as soon as it's entered into a service we use, with no curatorial work involved. Using a new/better/specialized service would be a matter of making Arctos aware of it.

No changes would be required to catalog new material.

Future changes to "curatorial geography" would not be so wide-ranging; we might be able to more readily accommodate curatorial needs without reducing functionality to users.

In short, I think this would result in drastically more discoverable data with no additional curatorial work, and without asking Curators to give up anything. It would also retain all of the work we've put into cleaning and organizing geography.

Followup

This approach would rely on coordinates to retrieve the consistent geography data, and so I also propose that we make the derived coordinates more visible, and more available to collections who wish to use them, as an immediate followup. It would be trivial to create a georeferenced Specimen Event for cataloged records without one, for example. This would not be a particularly "good" georeference, but it would make any problems much more discoverable by providing a path to spatial tools, and could be flagged as automation in various ways (a new value in https://arctos.database.museum/info/ctDocumentation.cfm?table=ctverificationstatus is perhaps most "filter-able").

For scale, Arctos currently holds 688778 localities, 496467 (72%) of which have curatorial coordinate assertions. 668709 (97%) have service-derived coordinate assertions.

Related Issues

In no particular order. I got overwhelmed and gave up trying to better organize these, you can too! There are a few "themes" in these, but they're often broad and intermingled.

  1. Some Issues are incorporated in this proposal. There's nothing new here, it's just a no-compromises merger of existing ideas. Restructuring geography, incorporating various Standards and Services, and being a more involved member of the larger community are inevitable, for example.

  2. Some Issues become less important if not irrelevant under this proposal. Choosing curatorial functionality over discovery has little impact with this 2-part approach. Inconsistent data has a much shorter reach. Using "modern" geography is not as pressing, perhaps not even desirable. Lacking a universal definition of geography or idea of the goals is not necessary.

  3. Some Issues change very little, or not at all, under this. Adding spatial data will enable the same awesomeness under this proposal, for example.

Structure

Table formal_geography could take two general shapes.

A normalized structure would provide more flexibility, but is more difficult and expensive to query

formal_geography_id serial
term varchar not null
rank varchar null
order int not null
souce varchar not null
metadata various

would support any number of terms of any rank (including none), and generally be more capable of representing whatever comes in from Services (including that cool new thing which hasn't been built yet). It would also be expensive to query, difficult to access, impractical to flatten, and perhaps difficult to "translate" (eg, we end up with 12 ways of saying "country" from various sources).

A more flattened approach would serve the core use case of discoverability, could be treated like a spreadsheet for various purposes, but would not be completely faithful to service data.

formal_geography_id serial
term_1 varchar<--- map continent-level-ish data here
term_2 varchar<--- map country-level-ish data here
term_3 varchar<--- map state-level-ish data here
term_4 varchar<--- map county-level-ish data here
term_5 varchar<--- map municipality-level-ish data here
souce varchar not null
metadata various

Both would require some way to tie to "core" or "curatorial" data (probably Locality). A linking table would provide a mechanism to tie many assertions to a locality, which seems necessary, and a mechanism to tie many localities to an assertion (which could reduce the data we must store, but I don't anticipate using this direction).

geo_link_id serial
formal_geography_id fkey-->formal_geography
locality_id fkey-->locality

An alternate would be adding locality_id fkey-->locality directly into the formal_geography table, which might make sense with the flatter version.

Enhancement Function-LocalitEvenGeoreferencing Help wanted Service-related

All 34 comments

Issues meeting:

  • allow eg "just use GADM" as geog for data entry-->don't assert anything, just pull from coordinates-at-source

has potential, implement, gather some data, expose internally and in limited scope (eg, from higher geog edit page), then analyze and decide how to proceed

AWG: Go

This is a goldmine. I am going to blithely steal from it as I work on the Locality Services.

as I work on the Locality Services.

Built it, and we shall steal....

Seems fair.

On Thu, Dec 3, 2020 at 4:35 PM dustymc notifications@github.com wrote:

as I work on the Locality Services.

Built it, and we shall steal....

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/ArctosDB/arctos/issues/3272#issuecomment-738246657,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AADQ726CO2BN5JA4HH4AVT3SS7R6ZANCNFSM4UMLYODA
.

I went with a fairly-normalized model, should be pretty easy to shuffle things around if it causes some sort of problem.

create table place_terms (
  place_term_id serial not null,
  locality_id bigint references locality(locality_id) on delete cascade,
  term_type varchar not null,
  term_value varchar not null,
  source varchar not null,
  last_date date default current_date
);

It's talking to Google, and keeping only

administrative_area_level_1,administrative_area_level_2,administrative_area_level_3,country

which are the only "geography-like" terms I could find in that particular API. That's easy to adjust if someone wants something else; Google seems to know a lot about rooftops...

Plugging in to other APIs should be trivial, so if anyone knows of anything that'll take coordinates and return something that someone might consider geography, please let me know about it.

http://test.arctos.database.museum/place.cfm?action=detail&locality_id=1178173 looks like....

Screen Shot 2020-12-08 at 4 45 25 PM

It would be pretty easy to use those terms and/or ranks in search, assert them instead of or alongside "curatorial geography," or whatever turns out to be handy.

It won't be very interesting until some data are gathered. @mkoo if we have the bandwidth I could temporarily be more aggressive with the cacher after this goes to production, which might happen in a couple hours.

GBIF has made a reverse geocoding API available that uses GADM and
marineregions.org EEZs

The code is here:
https://github.com/gbif/geocode

And here is an example API call:
http://api.gbif.org/v1/geocode/reverse?lat=-41.0570673&lng=-71.5268821

In the response, if a distance is non-zero, then it is the minimum distance
in degrees to that administrative division.

On Tue, Dec 8, 2020 at 9:50 PM dustymc notifications@github.com wrote:

I went with a fairly-normalized model, should be pretty easy to shuffle
things around if it causes some sort of problem.

create table place_terms (
place_term_id serial not null,
locality_id bigint references locality(locality_id) on delete cascade,
term_type varchar not null,
term_value varchar not null,
source varchar not null,
last_date date default current_date
);

It's talking to Google, and keeping only

administrative_area_level_1,administrative_area_level_2,administrative_area_level_3,country

which are the only "geography-like" terms I could find in that particular
API. That's easy to adjust if someone wants something else; Google seems to
know a lot about rooftops...

Plugging in to other APIs should be trivial, so if anyone knows of
anything that'll take coordinates and return something that someone might
consider geography, please let me know about it.

http://test.arctos.database.museum/place.cfm?action=detail&locality_id=1178173
looks like....

[image: Screen Shot 2020-12-08 at 4 45 25 PM]
https://user-images.githubusercontent.com/5720791/101558916-d35cdc80-3974-11eb-841b-541c13cbdc57.png

It would be pretty easy to use those terms and/or ranks in search, assert
them instead of or alongside "curatorial geography," or whatever turns out
to be handy.

It won't be very interesting until some data are gathered. @mkoo
https://github.com/mkoo if we have the bandwidth I could temporarily be
more aggressive with the cacher after this goes to production, which might
happen in a couple hours.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/ArctosDB/arctos/issues/3272#issuecomment-741343744,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AADQ726J6QKG435VZIHE26LST3CVFANCNFSM4UMLYODA
.

Thx - I did eventually remember that...

Screen Shot 2020-12-09 at 1 41 01 PM

I've got it set to grab everything for now - I suspect we'll end up filtering and deleting some stuff at some point. Given the (vague and potential) intent of this, perhaps it's best to preemptively reject everything with distance>0?

For the followup of making generated coordinates more visible, there's a new operator button on specimen detail for no-coordinate events. Two clicks...

Screen Shot 2020-12-09 at 3 57 18 PM

Screen Shot 2020-12-09 at 3 57 28 PM

...and...

Screen Shot 2020-12-09 at 4 02 36 PM

... happens. It's not a great georeference - there is no error calculation - but I've clicked the button perhaps 50 times and nothing meaningfully "wrong" has happened. (Maybe I'm bad at picking test cases!) There is a map available before the second click, should anyone want to review it before clicking - this is simply a new path to an old tool. The georeference will need further work to be suitable for all use cases, but it also makes the record available to spatial tools where it can be more efficiently improved; even horribly incorrect georeferences seem like an improvement from that perspective.

I'd be happy to talk about further lowering the bar, should anyone or everyone want magical coordinates without the clicking.

I would reject everything with distance >0. Those are near neighbors in
case the geometry is vague or if you want to apply admin values to
near-offshore coordinates.. But I wouldn't want to propagate false
positives. But maybe I just like things too simple.

On Wed, Dec 9, 2020 at 6:46 PM dustymc notifications@github.com wrote:

Thx - I did eventually remember that...

[image: Screen Shot 2020-12-09 at 1 41 01 PM]
https://user-images.githubusercontent.com/5720791/101691896-3a859a00-3a24-11eb-8216-74be77d20ecc.png

I've got it set to grab everything for now - I suspect we'll end up
filtering and deleting some stuff at some point. Given the (vague and
potential) intent of this, perhaps it's best to preemptively reject
everything with distance>0?

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/ArctosDB/arctos/issues/3272#issuecomment-742082836,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AADQ72YK7IFYZASRB2KHQOLST7V2ZANCNFSM4UMLYODA
.

too simple.

Sounds scary, but I guess we could give it a try....

Done, in production, cache-checker-thingee is running a little harder than normal @mkoo

This has processed ~20K localities so far, there's perhaps enough data for patterns to begin emerging.

https://arctos.database.museum/place.cfm?action=detail&locality_id=1116141 had just finished when I checked in, seems fairly normal.

Locality terms:

        term_value        |          term_type          |   source   
--------------------------+-----------------------------+------------
 United States of America | Political                   | GBIF API
 United States            | GADM0                       | GBIF API
 New Mexico               | GADM1                       | GBIF API
 Sandoval                 | GADM2                       | GBIF API
 NORTH AMERICA MAINLAND   | SeaVoX                      | GBIF API
 New Mexico               | WGSRPD                      | GBIF API
 United States            | country                     | Google API
 New Mexico               | administrative_area_level_1 | Google API
 Sandoval County          | administrative_area_level_2 | Google API
 Sandoval County          | political                   | Google API
 New Mexico               | political                   | Google API
 United States            | political                   | Google API
 Jemez Springs            | political                   | Google API

GBIF:GADM0,GBIF:GADM1,GBIF:GADM2 pretty consistently form country:state:province, they seem like a suitable solution to https://github.com/ArctosDB/arctos/issues/3186. Google:country,Google:administrative_area_level_1,Google:administrative_area_level_2 could serve the same purpose. [Dis]agreement between those things could be a useful metric.

This does seem capable of providing a consistent, limited set of search parameters which will return ALL (or the 97% I can get coordinates for) items from a placename.

The "all localities" report @mkoo asked for are a decent reflection of the all-localities map.

Screen Shot 2020-12-10 at 8 47 52 AM

        term_value        |          term_type          |   source   
--------------------------+-----------------------------+------------
 United States of America | Political                   | GBIF API
 United States            | GADM0                       | GBIF API
 Michigan                 | GADM1                       | GBIF API
 New Mexico               | GADM1                       | GBIF API
 Los Alamos               | GADM2                       | GBIF API
 Sandoval                 | GADM2                       | GBIF API
 Roosevelt                | GADM2                       | GBIF API
 Otero                    | GADM2                       | GBIF API
 Santa Fe                 | GADM2                       | GBIF API
 Taos                     | GADM2                       | GBIF API
 Rio Arriba               | GADM2                       | GBIF API
 Wexford                  | GADM2                       | GBIF API
 NORTH AMERICA MAINLAND   | SeaVoX                      | GBIF API
 New Mexico               | WGSRPD                      | GBIF API
 Michigan                 | WGSRPD                      | GBIF API
 United States            | country                     | Google API
 New Mexico               | administrative_area_level_1 | Google API
 Michigan                 | administrative_area_level_1 | Google API
 Colfax County            | administrative_area_level_2 | Google API
 Roosevelt County         | administrative_area_level_2 | Google API
 Sandoval County          | administrative_area_level_2 | Google API
 Los Alamos County        | administrative_area_level_2 | Google API
 Otero County             | administrative_area_level_2 | Google API
 Santa Fe County          | administrative_area_level_2 | Google API
 Taos County              | administrative_area_level_2 | Google API
 Wexford County           | administrative_area_level_2 | Google API
 Rio Arriba County        | administrative_area_level_2 | Google API
 Boon Township            | administrative_area_level_3 | Google API
 Sandoval County          | political                   | Google API
 San Ildefonso Pueblo     | political                   | Google API
 San Luis                 | political                   | Google API
 San Pedro                | political                   | Google API
 Santa Fe                 | political                   | Google API
 Santa Fe County          | political                   | Google API
 San Ysidro               | political                   | Google API
 Taos                     | political                   | Google API
 Taos County              | political                   | Google API
 United States            | political                   | Google API
 Wexford County           | political                   | Google API
 Algodones                | political                   | Google API
 White Rock               | political                   | Google API
 Angel Fire               | political                   | Google API
 Boon                     | political                   | Google API
 Boon Township            | political                   | Google API
 Budaghers                | political                   | Google API
 Cloudcroft               | political                   | Google API
 Cochiti Lake             | political                   | Google API
 Colfax County            | political                   | Google API
 Corrales                 | political                   | Google API
 Coyote                   | political                   | Google API
 Cuba                     | political                   | Google API
 Golden                   | political                   | Google API
 Jemez Pueblo             | political                   | Google API
 Jemez Springs            | political                   | Google API
 La Jara                  | political                   | Google API
 Los Alamos               | political                   | Google API
 Los Alamos County        | political                   | Google API
 Mescalero                | political                   | Google API
 Michigan                 | political                   | Google API
 New Mexico               | political                   | Google API
 Otero County             | political                   | Google API
 Pep                      | political                   | Google API
 Placitas                 | political                   | Google API
 Questa                   | political                   | Google API
 Rio Arriba County        | political                   | Google API
 Rio Rancho               | political                   | Google API
 Roosevelt County         | political                   | Google API
 Sandia Park              | political                   | Google API

they're both all over the place, might be useful for demonstrating that we need funding to resolve https://github.com/ArctosDB/arctos/issues/1679, but they're not useful for addressing spatial questions.

There's some limited oceanic data in GBIF - https://arctos.database.museum/place.cfm?action=detail&locality_id=80080 is the first "mostly wet" locality I stumbled across, the service seems to be at least as useful as the asserted data. I think the important point for this is that figuring out marine things isn't an Arctos problem under this model, it's a community problem. If GBIF (who certainly has far more resources than Arctos) does something clever it'll magically find its way in to Arctos, if someone else does something we should be able to plug in to their API. @sharpphyl

This seems to be working far better than I'd expected. I suggest we begin thinking about how to make it available in the UIs, how to distinguish it from "curatorial geography," and perhaps even how to share it back to GBIF via DWC (which should stop the flagging that seems to annoy some users).

https://arctos.database.museum/place.cfm?action=detail&locality_id=10824871 is interesting.

There's no WKT for the drainage-in-county.

Without something like https://github.com/ArctosDB/arctos/issues/3108 (which would get at "in county" but not "in drainage") it's difficult to say if the coordinates are reasonable or not.

GBIF is returning "Bernalillo" for GADM2, strongly suggesting that the coordinate/curatorial geography alignment is in fact not reasonable.

While not a replacement for better WKT, this looks like it will expose useful ways of detecting low-quality data.

Scattered links to place detail around a bit
Indexed the table
Added "Standardized Place Name" to specimendetail

Screen Shot 2020-12-10 at 3 27 48 PM

with some light styling to separate it from "data"

Nice.

Re: standardized place name - It isn't really the place name, it is the geography, right? Why smaller, maybe some other way to separate it, just call it "Service Asserted Geography? Also, how about a "more" link to that? Possible?

Maybe "Higher Geography" should be titled "Curatorial Asserted Higher Geography"? Or maybe we just need a section here that is "Curatorial Asserted" and another that is "Service Asserted" or something like that.

it is the geography

For now - yea, more or less, I think, whatever that means.....

Potentially, it's whatever we find at some place - certainly marine (no geo) stuff, maybe there's something cool in Google's rooftop data, whatever. I'm struggling to find a name that might accommodate that, suggestions greatly appreciated.

"more" link

There are 2 in the area that will get you there. The one with locality is the more relevant, that may or may not say something useful about the label.

Screen Shot 2020-12-11 at 7 21 39 AM

Curatorial Asserted Higher Geography"

That's what it IS in my view, but we use higher_geog[raphy] in many places, and I don't want this to turn in to something that someone finds offensive - I think that might be a little overly aggressive.

Service Asserted

It's "Service-Derived" in /place - "Asserted" might be better - accurate, but does everyone know what that means?

There are 2 in the area that will get you there. The one with locality is the more relevant, that may or may not say something useful about the label.

Those "more" take you to things that are more of those. This is probably a bad example because the HG and the "Standardized Place Name" are essentially the same, but if the SPN was different from the HG, then I would assume that "more" would be a different set of stuff - No?

Curatorial Asserted Higher Geography"

That's what it IS in my view, but we use higher_geog[raphy] in many places, and I don't want this to turn in to something that someone finds offensive - I think that might be a little overly aggressive.

Verbatim?

Service Asserted

It's "Service-Derived" in /place - "Asserted" might be better - accurate, but does everyone know what that means?

Service Derived seems good.

different set of stuff

It will anyway - /place will have a table

Screen Shot 2020-12-11 at 7 59 19 AM

SpecimenDetail is just pulling a few terms from the data that make that table and concatenating them into a hopefully-familiar form.

See https://arctos.database.museum/info/reviewAnnotation.cfm?ANNOTATION_GROUP_ID=37714

The webservice data is pulling in a nearby county, in this case unnecessarily/incorrectly. Can/should we do anything about that?

Maybe I'm missing something here, but I think we need a way to deal with Standardized Place Names (SPN) that do not match the higher_geog that were curatorially assigned. @sharpphyl brought the following examples to my attention (but I know I've seen others that bother me):
https://arctos.database.museum/guid/DMNS:Inv:1287
https://arctos.database.museum/guid/DMNS:Inv:1400
https://arctos.database.museum/guid/DMNS:Inv:1160
In the first example, all we know is that this lot of shells was collected in the waters around Bali. So we picked a point in the middle and gave it a 91,000m error to include a reasonable amount of surrounding seas in all directions. Based on that center point, the SPN asserts that it is from the Penebel Provence of the Tabanan Regency. This lot certainly wasn't collected in Penebel and we have no reason believe it was collected offshore of Tabanan Regency. This adds unnecessary confusion to the record and there is no way an outside/novice user will understand why this difference is displayed.
In the second example, the SPN indicates Richmond Shire which is not anywhere near the coordinates provided. It is just one of many shires that are included within the error radius and it is certainly not home to the Great Barrier Reef. I really don't get how that was picked as the SPN based on the coordinates.
I understand that the SPN will result in more inclusive search results (which is good), but does it have to be displayed on the specimen event? When the SPN does match the higher_geo it is just needlessly taking up screen space. When they don't match it seems like we are presenting inaccurate data.
Perhaps, if it must be displayed, it belongs down in the Coordinates sections and should be called "Coordiante-based Place Name"; but even then it seems wrong to display information that doesn't correspond to reality (e.g. inland locations for marine species). Or maybe a way to toggle the display of SPN on or off? Any other suggestions?

Just a side comment on your case 1. That is not georeferenced to best
practices. I realize that a best practice georeference wouldn't solve your
problems either, but I wanted to point this out. Best practices would not
put a marine location on land, regardless of the geometry. For references,
see:

https://docs.gbif.org/georeferencing-best-practices/1.0/en/#polygons
https://docs.gbif.org/georeferencing-quick-reference-guide/1.0/en/#s-corrected-center

On Mon, Apr 5, 2021 at 4:53 PM Andrew Doll @.*> wrote:

Maybe I'm missing something here, but I think we need a way to deal with
Standardized Place Names (SPN) that do not match the higher_geog that were
curatorially assigned. @sharpphyl https://github.com/sharpphyl brought
the following examples to my attention (but I know I've seen others that
bother me):
https://arctos.database.museum/guid/DMNS:Inv:1287
https://arctos.database.museum/guid/DMNS:Inv:1400
https://arctos.database.museum/guid/DMNS:Inv:1160
In the first example, all we know is that this lot of shells was collected
in the waters around Bali. So we picked a point in the middle and gave it a
91,000m error to include a reasonable amount of surrounding seas in all
directions. Based on that center point, the SPN asserts that it is from the
Penebel Provence of the Tabanan Regency. This lot certainly wasn't
collected in Penebel and we have no reason believe it was collected
offshore of Tabanan Regency. This adds unnecessary confusion to the record
and there is no way an outside/novice user will understand why this
difference is displayed.
In the second example, the SPN indicates Richmond Shire which is not
anywhere near the coordinates provided. It is just one of many shires that
are included within the error radius and it is certainly not home to the
Great Barrier Reef. I really don't get how that was picked as the SPN based
on the coordinates.
I understand that the SPN will result in more inclusive search results
(which is good), but does it have to be displayed on the specimen event?
When the SPN does match the higher_geo it is just needlessly taking up
screen space. When they don't match it seems like we are presenting
inaccurate data.
Perhaps, if it must be displayed, it belongs down in the Coordinates
sections and should be called "Coordiante-based Place Name"; but even then
it seems wrong to display information that doesn't correspond to reality
(e.g. inland locations for marine species). Or maybe a way to toggle the
display of SPN on or off? Any other suggestions?

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/ArctosDB/arctos/issues/3272#issuecomment-813613066,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AADQ725T344YSAIHOWIU7DTTHIIMJANCNFSM4UMLYODA
.

I think that's mostly a limitation of the services; right now, I pass in a point and get back some data. What I really need to do is pass in a polygon (eg, giant circle) and get back some data.

That said, I don't think getting low-quality data from low-quality data is very surprising.

does it have to be displayed on the specimen event

I think it's important - it reveals low-quality data on our end, it reveals things missing from the services, and it reveals limitations in how I'm processing what I get back from the services. (You can see it all on the locality page - click the small 'details' link by the locality - it's a bit overwhelming at times, I'm up for ideas on how to choose a winner.)

I'm not sure it'll ever be a straight match - as far as I can tell there is no "correct" way to handle this stuff, the variations help with search.

Moving stuff around is easy enough, and I don't much care what we call the derived data.

Best practices would not put a marine location on land, regardless of the geometry

If I understand these references and your polygon is a ring around an island, you would place the coordinates on, say, the western shoreline. Then uncertainty radius is going to be twice as big as if you put it in the center of the island. It will display the uncertainty circle as extending way out into the western sea and also includes all of the land of the island. I feel that this is much less desirable than having a circle that only includes the extent of reasonable collection locations, even if that central point is on land. In any situation where you have a large uncertainty, you are going to include habitats where you will never find a given species; land versus sea is just an obvious distinction to see on a map.
I'd agree that just using WKTs is preferable, but until the tools for efficiently generating and displaying those are readily available, it's not worth our time.

That's fine, as long as you recognize that your methoda are counter to best
pracice and that you therefore should not cite them as a
georeferenceProtocol.

On Mon, Apr 5, 2021 at 6:20 PM Andrew Doll @.*> wrote:

Best practices would not put a marine location on land, regardless of the
geometry

If I understand these references and your polygon is a ring around an
island, you would place the coordinates on, say, the western shoreline.
Then uncertainty radius is going to be twice as big as if you put it in the
center of the island. It will display the uncertainty circle as extending
way out into the western sea and also includes all of the land of the
island. I feel that this is much less desirable than having a circle that
only includes the extent of reasonable collection locations, even if that
central point is on land. In any situation where you have a large
uncertainty, you are going to include habitats where you will never find a
given species; land versus sea is just an obvious distinction to see on a
map.
I'd agree that just using WKTs is preferable, but until the tools for
efficiently generating and displaying those are readily available, it's not
worth our time.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/ArctosDB/arctos/issues/3272#issuecomment-813657414,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AADQ72ZUNKHWK747N6DUXIDTHISTDANCNFSM4UMLYODA
.

Just a side comment on your case 1. That is not georeferenced to best practices.

@tucotuco I read the two GBIF references but need your help to understand them. We've tried to create polygons in GeoLocate but that service didn't work then. If it works now, we'll try again. Are you saying that we need to create a WKT for every island and peninsula with a center point on the locality, then move the center point into the water somewhere and adjust the polygon? My head spins just thinking about it.

Where would you put the center point and the polygon (or circles) for this locality: the specimen was collected at a depth of 15-20 fathoms around Bataan Province, P.I. We don't know which side of the peninsula it was on. To the east is Manilla Bay. To the west is the South China Sea. To the Northwest is Subic Bay.

Screen Shot 2021-04-05 at 5 17 59 PM

No matter which water I put the pin in, I'm creating specious accuracy - I'm asserting that we know which side of the island or peninsula the specimen is from? I only know the reasonable circle within which it could have been collected. Is there a better way to represent it?

https://arctos.database.museum/editLocality.cfm?locality_id=10847398. We put the center point on the peninsula with an error radius (judgment call) that would include this depth.

Screen Shot 2021-04-05 at 8 17 56 PM

The "service" interprets not the written data but the point we selected (without any consideration of the error radius) and takes the geography to a level of detail that is not in the catalog record. Our higher geography was Asia, Philippines, Bataan, Philippine Islands, Luzon. The "service" interprets the point (not the data) as Bataan,Central Luzon,Philippines,Barangay Cupang Proper Rural, City of Balanga,Balanga City,JFRJ+MP (on the locality page) and inserts Standardized Place Name: Philippines, Central Luzon, Bataan, Balanga City into our catalog record https://arctos.database.museum/guid/DMNS:Inv:21889. I don't think increasing the specificity of the locality to include Balanga City will help anyone looking for a _Conus laterculatus_. If the service interpreted the verbatim locality instead of the GeoLocated center of the collection area, that would improve its accuracy.

To not geolocate these specimens because of these issues seems a huge loss of scientific data despite the imperfections of the current system

@dustymc

That said, I don't think getting low-quality data from low-quality data is very surprising.

We know that putting the lat/long coordinates in the middle of the area is low-quality data but don't know how to make it high quality data with today's tools. I would be pleased to get a "low-quality data" report with these details. But to put this data in the public record is misleading and confusing. Again, all we're asking is that the decision to include the "service created" locality in the catalog record be the decision of the collection curator (Paula) or Andy and not an automatic mandate.

There are also some bugs with the system - totally beyond the "put the pin in the middle of the island" issue. Check https://arctos.database.museum/guid/DMNS:Inv:15325.

Screen Shot 2021-04-05 at 7 56 31 PM

Here's the locality info. I've refreshed the webservice but I still get Denver along with Dampier and it's in the public catalog record and there's nothing I can do about it.

Screen Shot 2021-04-05 at 7 55 29 PM

Sometimes I refresh and just get Denver.

Screen Shot 2021-04-05 at 7 55 04 PM

Lastly, do I need to create a marine set of higher geography that begins with the Ocean, then the country, state, etc.? So Dampier Archipelago would be in Indian Ocean, Australia (not Australia, Australia), Western Australia etc.? A specimen from the west coast of Florida would be North Atlantic Ocean, Gulf of Mexico, United States, Florida, Lee County. What would GBIF do with that since they put the USA in North America and the Gulf of Mexico in Water Body?

I believe @tucotuco is talking about the assigned geography. What you're doing is about the same as saying "2000 miles west of Denver" and asserting Colorado geography for a California specimen. That can't answer the questions people want to ask of these data; the assigned geography should encompass the spatial data.

Coordinates of an island for a marine critter is not wrong, but it's not unrealistic to expect some service to include the island either.

As above, "return the most precise thing that describes the entirety of this shape" would be a super cool service, and if someone builds it I'll drop everything to use it. That's the best solution to this (and a bunch more similar things that we've been successful in ignoring), I just don't think it exists yet.

"low-quality data" report with these details.

That's one thing these service-derived data exist for.

decision to include the "service created" locality in the catalog record be the decision of the collection curator

That probably needs a dedicated Issue. I think these data are extremely valuable (even if they're imperfect) and I would not want to undo this work (and I don't really see a different way to what you're suggesting at the moment), but that's not my call either.

North Atlantic Ocean, Gulf of Mexico, United States, Florida, Lee County.

I don't think there is any such place; as far as I can tell, Florida counties end at the water line.

How do I/we/you get this locality to NOT show that the Australian Dampier Archipelago is in Denver, Colorado in our six catalog records? Yes, I refreshed several times.

https://arctos.database.museum/place.cfm?action=detail&collecting_event_id=10673555

Screen Shot 2021-04-07 at 9 40 42 AM

decision to include the "service created" locality in the catalog record be the decision of the collection curator

That probably needs a dedicated Issue. I think these data are extremely valuable (even if they're imperfect) and I would not want to undo this work..

Will do. I agree it's valuable information and it would be great to get a report or query for localities that are inconsistent with website data just as I look for catalog records using invalid taxon names. Then I could focus my geography efforts where the biggest problems are. Having it automatically be in the public catalog record is the problem.

I'll look at that when I can; I'm not sure what's going on there. Whatever it is, it's probably not being helped by your non-standard specific locality - https://handbook.arctosdb.org/documentation/locality.html#specific-locality.

Having it automatically be in the public catalog record is the problem.

I think this needs addressed by The Community. There's all kinds of service-derived data in various places, Arctos would be a much more boring place without it, sometimes it does strange things because computers. I need to know about that, sometimes the services providing the data need to know about it, but it's not "data" (call it metadata) and it shouldn't really be seen as your problem. Is that just a matter of documentation or labeling??

focus my geography efforts

Screen Shot 2021-04-07 at 8 51 11 AM

or

Screen Shot 2021-04-07 at 8 52 32 AM

https://arctos.database.museum/info/reviewAnnotation.cfm?action=show&atype=&guid_prefix=DMNS%3AInv&institution=&reviewer_comment=NULL&submitter=&reviewer=

is mostly asserted spatial data not agreeing with itself.

I can't say that's a direct cause of any of this, but I would not be surprised if a service does something strange when given two conflicting data points.

So I changed the specific locality from "no specific locality recorded" to "No specific locality recorded." Is that what you were referring to?

That took away the duplication of the US/Denver locality interwoven with a Western Australia one. Now it just shows Denver, Colorado as the locality for the Dampier Archipelago. Too absurd for comment.

Screen Shot 2021-04-07 at 12 46 38 PM

Yes, I'm painfully aware of hundreds of annotations pointing out that our marine specimens were found outside a terrestrial WKT. I'm referring to a report that would show the inconsistency between the curatorially selected geography vs. the service derived geography.

Attempting to create a new higher geography for the Great Barrier Reef in the Pacific Ocean, Coral Sea instead of the existing one which is Australia, Australia, Queensland, Great Barrier Reef which is creating a problem Standardized Place Name in Richmond Shire in the middle of Queensland. I get this error message. @dustymc Do you have to create this? Will that correct the SPN conflict?

Screen Shot 2021-04-07 at 2 06 51 PM

Ok, ZC.15325 (located = Dampier Archipelago) now shows an SPN of Australia, Western Australia, Shire of Leonora which is at least in Australia but, unfortunately, in the middle of the Western Australia desert.

Shire of Leonora

Additionally, I'm getting annotations for ZC.20009. On this one, I followed your directions and put the "off Cape San Blas" location in the Gulf of Mexico.

Screen Shot 2021-04-09 at 1 11 16 PM

But the SPN thinks it belongs back on dry land in Florida.

Screen Shot 2021-04-09 at 1 12 22 PM

Are the SPNs creating the annotations or something else?

I'll start another issue to make the SPNs an internal tool and part of the public catalog record only of the collection chooses them to be included.

at least in Australia

I changed the specific locality - anything other than the recommended 'don't know' value messes with the services. (I think that was missing the period.)

I doubt if "off Cape San Blas" is something services will ever be able to understand. You can play with that in geolocate - your data as provided result in....

Screen Shot 2021-04-09 at 12 34 38 PM

I added a "better" locality description (not sure if more accurate, I'm just guessing from dots on a map...) and ....

Screen Shot 2021-04-09 at 12 34 24 PM

creating the annotations or something else

Automated annotations come from the asserted coordinates being outside the asserted geography and nothing else. SPN is not involved in any way.

Consider:

  1. fire up a new locality attribute "category" (or something of the sort) with which a locality can be designated as marine (and maybe eventually some other stuff).
  2. Use that in assembling the SPN.

Example:

https://arctos.database.museum/place.cfm?action=detail&locality_id=10904124 returns from webservices


arctosprod@arctos>> select source, term_type,term_value from place_terms where locality_id=10904124;
   source   |          term_type          |             term_value             
------------+-----------------------------+------------------------------------
 Google API | political                   | Dampier Archipelago
 Google API | administrative_area_level_2 | City of Karratha
 Google API | political                   | City of Karratha
 Google API | administrative_area_level_1 | Western Australia
 Google API | political                   | Western Australia
 Google API | country                     | Australia
 Google API | political                   | Australia
 Google API | political                   | Dampier
 Google API | political                   | Maitland
 GBIF API   | EEZ                         | Australian Exclusive Economic Zone
 GBIF API   | IHO                         | Indian Ocean
 GBIF API   | SeaVoX                      | INDIAN OCEAN
 Google API | political                   | Lake Darlot
 Google API | administrative_area_level_2 | Shire of Leonora
 Google API | political                   | Shire of Leonora
 Google API | political                   | Leonora
 GBIF API   | Political                   | Australia
 GBIF API   | GADM1                       | Western Australia
 GBIF API   | GADM0                       | Australia
 GBIF API   | WGSRPD                      | Western Australia
 GBIF API   | GADM2                       | Leonora

And we pick a few terms from that list to assemble

Australia, Western Australia, City of Karratha; Shire of Leonora

which works well for whatever I was looking at when I wrote whatever code assembles the SPN.

Given some indicator, I could be selective about which data are used to assemble the SPN - maybe SeaVoX, or SeaVoX+EEZ, or WHATEVER is more appropriate for this locality (and potentially something else for some other category of locality).

That's a poor replacement for "smarter" services (eg one that could take WKT and return things based on that, rather than a point), but as far as I know those don't exist and we could do this now.

How the webservices works is changing a bit (unless @mkoo has a dramatic change of heart!). This is running in test, will probably be in production tonight. The data will take some time to catch up.

GeoLocate is now the primary source of coordinate-from-text data, and it generally returns NULL (translation: "I have no idea what you're talking about") for variations of No specific locality recorded. When I get a NULL return from GeoLocate I replace locality with the most precise available term from geography, which I think generally all comes together as an accidentally more sophisticated way of ignoring No specific locality recorded. (I increment to the next geography "field" if that doesn't work, see below.)

"most precise available term from geography" is currently feature,quad,island,island_group,drainage,sea" - geography isn't very consistent at scale, I don't think there's a "correct" ordering of those terms (or other stuff in the table), but I can easily rearrange them if anyone has better ideas.

I am now being more explicit in source. The locality detail page now looks like...

Screen Shot 2021-05-28 at 8 41 06 AM

note "asserted" (from curatorially-supplied coordinates) and "derived" (from coordinates I've produced from the text data).

The catalog record now looks like...

Screen Shot 2021-05-28 at 8 42 08 AM

  • The label is more distinct from verbatim locality
  • There's a distinct style (easy to change, should be developed and applied to all non-asserted data)
  • There's a mouseover with an explanation (also easy to change)

I don't think any of this is incompatible with idea of "categorizing" localities (from a couple comments up); that would add another dimension on what we can use to detect conflicting data, and would still be useful (eg in ignoring terrestrial, overly precise, whatever terms) if we do want to assert a "standardized" place name at some point.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

alexkrohn picture alexkrohn  Â·  3Comments

dustymc picture dustymc  Â·  6Comments

acdoll picture acdoll  Â·  8Comments

acdoll picture acdoll  Â·  4Comments

AJLinn picture AJLinn  Â·  3Comments