Arctos: Locality character conversion issues

Created on 12 May 2020  Â·  54Comments  Â·  Source: ArctosDB/arctos

Documentation is http://handbook.arctosdb.org/documentation/encoding.html

Suggest we move the NOPRINT check to a function, add "not contains �" to all possible free-text fields. Below will need cleaned up first.

select guid_prefix,count(*) c from collection inner join cataloged_item on collection.collection_id=cataloged_item.collection_id inner join specimen_event on cataloged_item.collection_object_id=specimen_event.collection_object_id inner join collecting_event on specimen_event.collecting_event_ID=collecting_event.collecting_event_ID inner join locality on collecting_event.locality_id=locality.locality_id
where 
  3  spec_locality like '%�%' group by guid_prefix order by count(*);

GUID_PREFIX                               C
------------------------------------------------------------ ----------
UMNH:Teach                                1
ALMNH:ES                                  1
UTEP:Mamm                                 1
HWML:Para                                 1
MVZ:Egg                                   1
UCM:Bird                                  2
DMNS:Bird                                 3
NMMNH:Ento                                3
UTEP:Teach                                3
UAM:Herb                                  4
UTEPObs:Herp                                  5
CHAS:Mamm                                 9
UTEP:Herp                                 9
UAM:Alg                                   9
UTEP:Inv                                 10
UAM:Inv                                  11
CHAS:Bird                                12
UWBM:Herp                                16
MSB:Para                                 17
MSB:Bird                                 21
MSB:Host                                 22
UNR:Herp                                 41
UTEP:Ento                                43
MSB:Fish                                 75
MSB:Mamm                                110
NMMNH:Mamm                              110
UAMb:Herb                               128
ASNHC:Herp                              143
MSB:Herp                                159
ASNHC:Mamm                             1312

Function-LocalitEvenGeoreferencing NeedsDocumentation Priority-Critical

Most helpful comment

OK, I have edited all of the localities with the � that I can. The remaining 61 I cannot determine what the replacement character(s) should be. I vote that we replace with [?] and STOP THIS MADNESS.

All 54 comments

I have attempted to assign people who are responsible for these collections.

Put %�% in specific locality and search your collection to find what needs fixing.

UTEP:Teach corrected

ALMNH:ES corrected

Put %�% in specific locality and search your collection to find what needs fixing.

Yup, or let me know if you need some other query.

Is there a faster way of editing the localities? Or is it one by one?
Also in Verbatim locality and locality remarks

Editing is one by one, unfortunately....

We can make a list and work on them together. Or I can give Paula access and she can help - as long as she knows exactly what to do.

If there's some pattern I can mass-update (post-postgres). "Everything" is my ideal filter for that, but I can do it for smaller sets as well, I just need to know how to find and what to replace with.

The main problem I'm seeing for us is the TRS data, which we won't know what it's supposed to be unless we go through each one by one and edit it. Does this have to get fixed now, for transfer into post gres, or is it something that could be fixed along with georeferencing?

MSB will need to wait until after PG because we have a couple hundred of
these.

On Tue, May 19, 2020 at 12:07 PM Lindsey NMMNHS notifications@github.com
wrote:

  • [EXTERNAL]*

The main problem I'm seeing for us is the TRS data, which we won't know
what it's supposed to be unless we go through each one by one and edit it.
Does this have to get fixed now, for transfer into post gres, or is it
something that could be fixed along with georeferencing?

—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
https://github.com/ArctosDB/arctos/issues/2675#issuecomment-630987893,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ADQ7JBFSTHQRVV6GGPQQJ6LRSLC6LANCNFSM4M6MFZ4A
.

I have 611 just for NMMNH:Ento when I put %�% into Any Geography Term

Yes it's find to wait for PG, and I'm going to be extremely hesitant to do anything in Oracle anyway.

These will still be OK as data, it just won't be possible to save them - which will eventually break some script and I'll be forced to replace them with "[funky unicode fail diamond thingee was here]" or something equally annoying....

611 just for NMMNH:Ento

You'll only need to fix localities, not each specimen. Looks like you fixed it while I was typing?

select guid_prefix,count(*) c from collection inner join cataloged_item on collection.collection_id=cataloged_item.collection_id inner join specimen_event on cataloged_item.collection_object_id=specimen_event.collection_object_id inner join collecting_event on specimen_event.collecting_event_ID=collecting_event.collecting_event_ID inner join locality on collecting_event.locality_id=locality.locality_id
where 
  3    spec_locality like '%�%' group by guid_prefix order by guid_prefix;

GUID_PREFIX                               C
------------------------------------------------------------ ----------
ASNHC:Herp                              143
ASNHC:Mamm                             1312
CHAS:Bird                                12
CHAS:Mamm                                 9
DMNS:Bird                                 3
HWML:Para                                 1
MSB:Bird                                 21
MSB:Fish                                 75
MSB:Herp                                159
MSB:Host                                 22
MSB:Mamm                                110
MSB:Para                                 17
MVZ:Egg                                   1
NMMNH:Inv                                25
NMMNH:Mamm                              110
UAM:Alg                                   9
UAM:Herb                                  4
UAM:Inv                                  11
UAMb:Herb                               128
UCM:Bird                                  2
UMNH:Teach                                1
UNR:Herp                                 41
UTEP:Ento                                43
UTEP:Herp                                 9
UTEP:Inv                                 10
UTEP:Mamm                                 1
UTEPObs:Herp                                  5
UWBM:Herp                                16


Oh.

Any Geography Term

Here's verbatim - we should get them too.

GUID_PREFIX                               C
------------------------------------------------------------ ----------
ASNHC:Herp                               27
ASNHC:Mamm                             2296
CHAS:Bird                                 1
CHAS:Mamm                                 8
DMNS:Bird                               161
DMNS:Inv                                  3
HWML:Para                                 1
MSB:Bird                                 87
MSB:Fish                                 53
MSB:Herp                                 80
MSB:Host                                  9
MSB:Mamm                                156
MSB:Para                                  2
MVZ:Bird                                  2
NMMNH:Ento                              606
NMMNH:Inv                               375
NMMNH:Mamm                              156
UAM:Alg                                   9
UAM:Herb                                  4
UAM:Inv                                  11
UAM:Mamm                                  2
UAMObs:Ento                               1
UAMb:Herb                               163
UCM:Bird                                  2
UCM:Mamm                                 14
UCM:Obs                                   2
UMNH:Teach                                1
UNR:Fish                                  1
UNR:Herp                                 41
UTEP:Bird                                 1
UTEP:ES                                2520
UTEP:Ento                               133
UTEP:Herb                                29
UTEP:Herp                               104
UTEP:HerpOS                               7
UTEP:Inv                                104
UTEP:Mamm                                 4
UTEP:Teach                                1
UTEP:Zoo                                  1
UTEPObs:Herp                                 29
UWBM:Herp                                17

I'll try and at least get the specific locality ones done now as it doesn't contain counties and TRS data, which is where a lot of errors are coming from. But there are a lot of these issues in Verbatim Locality and Locality Remarks

Locality Remarks

select guid_prefix,count(*) c from collection inner join cataloged_item on collection.collection_id=cataloged_item.collection_id inner join specimen_event on cataloged_item.collection_object_id=specimen_event.collection_object_id inner join collecting_event on specimen_event.collecting_event_ID=collecting_event.collecting_event_ID inner join locality on collecting_event.locality_id=locality.locality_id
where 
  locality_remarks like '%�%' group by guid_prefix order by guid_prefix;



GUID_PREFIX                               C
------------------------------------------------------------ ----------
ALMNH:ES                                  1
MSB:Mamm                                558
NMMNH:Ento                               20
NMMNH:Herb                                8
NMMNH:Mamm                              557
UAM:Alg                                 186
UCM:Bird                                  3
UCM:Herp                                 18
UCM:Mamm                                  3
UMZM:Mamm                                 1
UTEP:Ento                               289
UTEP:Herb                                19
UTEP:Herp                               520
UTEP:HerpOS                               5
UTEP:Inv                                 14

I have 611 just for NMMNH:Ento when I put %�% into Any Geography Term

A lot of these are duplicates (13 specimens share a locality and collecting event) so it is less than you think, but still a lot of work. For these kinds of things, I like to tackle 5 a day until they are done. Although I also end up getting on a roll and end up spending an hour so that I can finish up the 56 from some specific collection.

The only issue is that until we fix what is there and Dusty can change data validation, people can continue to create more.....

The only issue is that until we fix what is there and Dusty can change data validation, people can continue to create more.....

So until everyone fixes everything in specific locality, verbatim locality, and locality remarks, we can't change data validation?

I believe that the NMMNH and MSB Mamm localities are the same. These are
all likely Dave Hafner's Mexico material, probably all enyes. So we only
have to fix 557 between us.

On Tue, May 19, 2020 at 12:37 PM Lindsey NMMNHS notifications@github.com
wrote:

  • [EXTERNAL]*

The only issue is that until we fix what is there and Dusty can change
data validation, people can continue to create more.....

So until everyone fixes everything in specific locality, verbatim
locality, and locality remarks, we can't change data validation?

—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
https://github.com/ArctosDB/arctos/issues/2675#issuecomment-631005344,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ADQ7JBCVPILVGQWCUGYVMKDRSLGW5ANCNFSM4M6MFZ4A
.

Probably best to have the exact text characters in the verbatim but not in the specific locality for search reasons. I think there is a mix of letters with accents.


Jonathan L. Dunnum Ph.D.
Senior Collection Manager
Division of Mammals, Museum of Southwestern Biology
University of New Mexico
Albuquerque, NM 87131
(505) 277-9262
Fax (505) 277-1351

MSB Mammals website: http://www.msb.unm.edu/mammals/index.html
Facebook: http://www.facebook.com/MSBDivisionofMammals

Shipping Address:
Museum of Southwestern Biology
Division of Mammals
University of New Mexico
CERIA Bldg 83, Room 204
Albuquerque, NM 87131


From: Mariel Campbell notifications@github.com
Sent: Tuesday, May 19, 2020 12:44 PM
To: ArctosDB/arctos arctos@noreply.github.com
Cc: Jonathan Dunnum jldunnum@unm.edu; Assign assign@noreply.github.com
Subject: Re: [ArctosDB/arctos] Locality character conversion issues (#2675)

[EXTERNAL]

I believe that the NMMNH and MSB Mamm localities are the same. These are
all likely Dave Hafner's Mexico material, probably all enyes. So we only
have to fix 557 between us.

On Tue, May 19, 2020 at 12:37 PM Lindsey NMMNHS notifications@github.com
wrote:

  • [EXTERNAL]*

The only issue is that until we fix what is there and Dusty can change
data validation, people can continue to create more.....

So until everyone fixes everything in specific locality, verbatim
locality, and locality remarks, we can't change data validation?

—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
https://github.com/ArctosDB/arctos/issues/2675#issuecomment-631005344,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ADQ7JBCVPILVGQWCUGYVMKDRSLGW5ANCNFSM4M6MFZ4A
.

—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHubhttps://github.com/ArctosDB/arctos/issues/2675#issuecomment-631008980, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AED2PAZOGBBVBJ37BFPFSRTRSLHPPANCNFSM4M6MFZ4A.

For the Hafner material, we at least have an original spreadsheet with the
actual values. What about pulling these from the global bulkload file
archive? If we have everything that was ever bulkloaded, we should be able
to find the original locality values and replace the invalid characters
with the original.

On Tue, May 19, 2020 at 12:48 PM jldunnum notifications@github.com wrote:

  • [EXTERNAL]*

Probably best to have the exact text characters in the verbatim but not in
the specific locality for search reasons. I think there is a mix of letters
with accents.


Jonathan L. Dunnum Ph.D.
Senior Collection Manager
Division of Mammals, Museum of Southwestern Biology
University of New Mexico
Albuquerque, NM 87131
(505) 277-9262
Fax (505) 277-1351

MSB Mammals website: http://www.msb.unm.edu/mammals/index.html
Facebook: http://www.facebook.com/MSBDivisionofMammals

Shipping Address:
Museum of Southwestern Biology
Division of Mammals
University of New Mexico
CERIA Bldg 83, Room 204
Albuquerque, NM 87131


From: Mariel Campbell notifications@github.com
Sent: Tuesday, May 19, 2020 12:44 PM
To: ArctosDB/arctos arctos@noreply.github.com
Cc: Jonathan Dunnum jldunnum@unm.edu; Assign assign@noreply.github.com
Subject: Re: [ArctosDB/arctos] Locality character conversion issues (#2675)

[EXTERNAL]

I believe that the NMMNH and MSB Mamm localities are the same. These are
all likely Dave Hafner's Mexico material, probably all enyes. So we only
have to fix 557 between us.

On Tue, May 19, 2020 at 12:37 PM Lindsey NMMNHS notifications@github.com
wrote:

  • [EXTERNAL]*

The only issue is that until we fix what is there and Dusty can change
data validation, people can continue to create more.....

So until everyone fixes everything in specific locality, verbatim
locality, and locality remarks, we can't change data validation?

—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
https://github.com/ArctosDB/arctos/issues/2675#issuecomment-631005344,
or unsubscribe
<
https://github.com/notifications/unsubscribe-auth/ADQ7JBCVPILVGQWCUGYVMKDRSLGW5ANCNFSM4M6MFZ4A

.

—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub<
https://github.com/ArctosDB/arctos/issues/2675#issuecomment-631008980>,
or unsubscribe<
https://github.com/notifications/unsubscribe-auth/AED2PAZOGBBVBJ37BFPFSRTRSLHPPANCNFSM4M6MFZ4A

.

—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
https://github.com/ArctosDB/arctos/issues/2675#issuecomment-631011256,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ADQ7JBAQ6IYQCJ5SGBLCWBDRSLH7BANCNFSM4M6MFZ4A
.

letters with accents.

See http://handbook.arctosdb.org/documentation/encoding.html

Accents (hieroglyphics, cyrillic, kanji, whatever) are fine.

All of that stuff HTML-encoded is acceptable, but not searchable.

This problem - some character or characters replaced by a 'I have no idea what you mean' unicode character - comes about when you have those in some non-UTF encoding and your editor doesn't properly convert them to UTF before they're loaded to Arctos.

@campmlc I can look but I think its unlikely this happened after bulkloading.

It happened during bulkloading. I have also gone in and made various edits based on fixes Dave sent me directly after he found the georeferencing problems from Steven's (volunteer) re-georeferencing. He had various locality changes based on the data from his catalogs, said he wasn't sure when the errors got introduced but most were there when we got the original data from Patty.


Jonathan L. Dunnum Ph.D.
Senior Collection Manager
Division of Mammals, Museum of Southwestern Biology
University of New Mexico
Albuquerque, NM 87131
(505) 277-9262
Fax (505) 277-1351

MSB Mammals website: http://www.msb.unm.edu/mammals/index.html
Facebook: http://www.facebook.com/MSBDivisionofMammals

Shipping Address:
Museum of Southwestern Biology
Division of Mammals
University of New Mexico
CERIA Bldg 83, Room 204
Albuquerque, NM 87131


From: dustymc notifications@github.com
Sent: Tuesday, May 19, 2020 12:53 PM
To: ArctosDB/arctos arctos@noreply.github.com
Cc: Jonathan Dunnum jldunnum@unm.edu; Assign assign@noreply.github.com
Subject: Re: [ArctosDB/arctos] Locality character conversion issues (#2675)

[EXTERNAL]

letters with accents.

See http://handbook.arctosdb.org/documentation/encoding.html

Accents (hieroglyphics, cyrillic, kanji, whatever) are fine.

All of that stuff HTML-encoded is acceptable, but not searchable.

This problem - some character or characters replaced by a 'I have no idea what you mean' unicode character - comes about when you have those in some non-UTF encoding and your editor doesn't properly convert them to UTF before they're loaded to Arctos.

@campmlchttps://github.com/campmlc I can look but I think its unlikely this happened after bulkloading.

—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHubhttps://github.com/ArctosDB/arctos/issues/2675#issuecomment-631014163, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AED2PA2EIW52SAYWFDNNAIDRSLITRANCNFSM4M6MFZ4A.

Just looked at the original spreadsheet used for bulkloading.
This is a locality from that file: 8 mi. S, 3 mi. W La Purísima
This is the same locality after being bulkloaded into Arctos and
redownloaded. There is a ? in the blank space. 8 mi. S, 3 mi. W La Pur sima

On Tue, May 19, 2020 at 1:00 PM jldunnum notifications@github.com wrote:

  • [EXTERNAL]*

It happened during bulkloading. I have also gone in and made various edits
based on fixes Dave sent me directly after he found the georeferencing
problems from Steven's (volunteer) re-georeferencing. He had various
locality changes based on the data from his catalogs, said he wasn't sure
when the errors got introduced but most were there when we got the original
data from Patty.


Jonathan L. Dunnum Ph.D.
Senior Collection Manager
Division of Mammals, Museum of Southwestern Biology
University of New Mexico
Albuquerque, NM 87131
(505) 277-9262
Fax (505) 277-1351

MSB Mammals website: http://www.msb.unm.edu/mammals/index.html
Facebook: http://www.facebook.com/MSBDivisionofMammals

Shipping Address:
Museum of Southwestern Biology
Division of Mammals
University of New Mexico
CERIA Bldg 83, Room 204
Albuquerque, NM 87131


From: dustymc notifications@github.com
Sent: Tuesday, May 19, 2020 12:53 PM
To: ArctosDB/arctos arctos@noreply.github.com
Cc: Jonathan Dunnum jldunnum@unm.edu; Assign assign@noreply.github.com
Subject: Re: [ArctosDB/arctos] Locality character conversion issues (#2675)

[EXTERNAL]

letters with accents.

See http://handbook.arctosdb.org/documentation/encoding.html

Accents (hieroglyphics, cyrillic, kanji, whatever) are fine.

All of that stuff HTML-encoded is acceptable, but not searchable.

This problem - some character or characters replaced by a 'I have no idea
what you mean' unicode character - comes about when you have those in some
non-UTF encoding and your editor doesn't properly convert them to UTF
before they're loaded to Arctos.

@campmlchttps://github.com/campmlc I can look but I think its unlikely
this happened after bulkloading.

—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub<
https://github.com/ArctosDB/arctos/issues/2675#issuecomment-631014163>,
or unsubscribe<
https://github.com/notifications/unsubscribe-auth/AED2PA2EIW52SAYWFDNNAIDRSLITRANCNFSM4M6MFZ4A

.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/ArctosDB/arctos/issues/2675#issuecomment-631019169,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ADQ7JBDFOV5BFQ7FGXUQWW3RSLJMVANCNFSM4M6MFZ4A
.

That is from MSB:Mamm:274283

On Tue, May 19, 2020 at 1:02 PM Mariel Campbell campbell@carachupa.org
wrote:

Just looked at the original spreadsheet used for bulkloading.
This is a locality from that file: 8 mi. S, 3 mi. W La Purísima
This is the same locality after being bulkloaded into Arctos and
redownloaded. There is a ? in the blank space. 8 mi. S, 3 mi. W La Pur sima

On Tue, May 19, 2020 at 1:00 PM jldunnum notifications@github.com wrote:

  • [EXTERNAL]*

It happened during bulkloading. I have also gone in and made various
edits based on fixes Dave sent me directly after he found the
georeferencing problems from Steven's (volunteer) re-georeferencing. He had
various locality changes based on the data from his catalogs, said he
wasn't sure when the errors got introduced but most were there when we got
the original data from Patty.


Jonathan L. Dunnum Ph.D.
Senior Collection Manager
Division of Mammals, Museum of Southwestern Biology
University of New Mexico
Albuquerque, NM 87131
(505) 277-9262
Fax (505) 277-1351

MSB Mammals website: http://www.msb.unm.edu/mammals/index.html
Facebook: http://www.facebook.com/MSBDivisionofMammals

Shipping Address:
Museum of Southwestern Biology
Division of Mammals
University of New Mexico
CERIA Bldg 83, Room 204
Albuquerque, NM 87131


From: dustymc notifications@github.com
Sent: Tuesday, May 19, 2020 12:53 PM
To: ArctosDB/arctos arctos@noreply.github.com
Cc: Jonathan Dunnum jldunnum@unm.edu; Assign >
Subject: Re: [ArctosDB/arctos] Locality character conversion issues
(#2675)

[EXTERNAL]

letters with accents.

See http://handbook.arctosdb.org/documentation/encoding.html

Accents (hieroglyphics, cyrillic, kanji, whatever) are fine.

All of that stuff HTML-encoded is acceptable, but not searchable.

This problem - some character or characters replaced by a 'I have no idea
what you mean' unicode character - comes about when you have those in some
non-UTF encoding and your editor doesn't properly convert them to UTF
before they're loaded to Arctos.

@campmlchttps://github.com/campmlc I can look but I think its unlikely
this happened after bulkloading.

—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub<
https://github.com/ArctosDB/arctos/issues/2675#issuecomment-631014163>,
or unsubscribe<
https://github.com/notifications/unsubscribe-auth/AED2PA2EIW52SAYWFDNNAIDRSLITRANCNFSM4M6MFZ4A

.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/ArctosDB/arctos/issues/2675#issuecomment-631019169,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ADQ7JBDFOV5BFQ7FGXUQWW3RSLJMVANCNFSM4M6MFZ4A
.

@campmlc is that CSV or something proprietary?

In either case, what character encoding is used?

UAM@ARCTOS> select spec_locality from bulkloader_deletes where spec_locality like '8 mi. S, 3 mi. W La Pur%';

SPEC_LOCALITY
------------------------------------------------------------------------------------------------------------------------
8 mi. S, 3 mi. W La Pur�sima
8 mi. S, 3 mi. W La Pur�sima
8 mi. S, 3 mi. W La Pur�sima
8 mi. S, 3 mi. W La Pur�sima
8 mi. S, 3 mi. W La Pur�sima
8 mi. S, 3 mi. W La Pur�sima
8 mi. S, 3 mi. W La Pur�sima

7 rows selected.

@dustymc Here are some you can bulk edit?

All verbatim locality for events in locality nickname UTEP:ES:Site 21

replace "Do�a Ana" with "Doña Ana"

Is that do-able?

Is that do-able?

easily- but let's wait until we're in a less-meltable environment?

Csv
There are likely georeferencing problems.

On Tue, May 19, 2020, 1:07 PM dustymc notifications@github.com wrote:

  • [EXTERNAL]*

@campmlc https://github.com/campmlc is that CSV or something
proprietary?

In either case, what character encoding is used?

UAM@ARCTOS> select spec_locality from bulkloader_deletes where spec_locality like '8 mi. S, 3 mi. W La Pur%';

SPEC_LOCALITY


8 mi. S, 3 mi. W La Pur�sima

8 mi. S, 3 mi. W La Pur�sima

8 mi. S, 3 mi. W La Pur�sima

8 mi. S, 3 mi. W La Pur�sima

8 mi. S, 3 mi. W La Pur�sima

8 mi. S, 3 mi. W La Pur�sima

8 mi. S, 3 mi. W La Pur�sima

7 rows selected.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/ArctosDB/arctos/issues/2675#issuecomment-631023291,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ADQ7JBFIHQCYJNVDCL4KDBDRSLKGDANCNFSM4M6MFZ4A
.

MVZ:Egg fixed for spec locality, MVZ:Bird fixed for verbatim locality. I think that's it for MVZ, but let me know if I'm missing something. Thanks.

NMMNH:Inv; NMMNH:Mamm; NMMNH:Ento specific localities done

Do�a Ana

I just got all of the unverified events while I was in there, here's the original if someone wants me to undo something.

temp_donaana.csv.zip

 update collecting_event set verbatim_locality=replace(verbatim_locality,'Do�a Ana','Doña Ana') where verbatim_locality like '%Do�a Ana%' and collecting_event_id not in (select collecting_event_id from specimen_event where VERIFICATIONSTATUS='verified and locked');

198 rows updated.

YASSSSSS! Thanks!

OK, here is another possible batch.

All named localities starting with UTEP:ES:Site

that have 34� in verbatim locality, replace with 34°

also

that have 106� in verbatim locality, replace with 106°

I try to read and pick the right thing - it doesn't always work, but dang it, I do try!

So say we all!

FYI my usual go-to for that is http://www.fileformat.info/info/unicode/char/search.htm?q=%C2%B0&preview=entity

And now for the rest of that thought: I wonder if we can and/or should block some of those? Do we really need to accept áµ’ and &deg; and &#176; and <sup>o</sup> and the bajillion other ways to make something that sorta looks like a degree symbol? If we should filter, is blocking them worth the investment - does it MATTER that "34o" is slightly less searchable in a field that's fundamentally not searchable, or is that vastly outweighed by the work to clean data? If you've found the specimen-or-whatever it still adequately conveys the idea to humans - is doing more worth the effort?

I tried using the Unicode and it failed to do anything. I just ended up with "34U+00B0" so I used the HTML instead. If we are going to select one, let's make sure it is one that ends up being readable even if it isn't searchable.

Unicode and it failed to do anything.

In what context?

http://test.arctos.database.museum/editLocality.cfm?locality_id=84325

Screen Shot 2020-05-21 at 12 10 19 PM

ends up being readable

Also depends on context - eg, the HTML looks like &deg; or &#176; or <sup>o</sup> or whatever in many views (CSV probably most relevant here).

image

Changed to "ñ" and now it's
image

@dustymc here are some bulk replacements you can make.

In Specific locality:
M�zquiz = Múzquiz
38�#0' = 38°#0'
26� ENE = 26° ENE
Do�a Ana = Doña Ana
130� = 130°
Ca�oncito = Cañoncito
Volc�n Po�s = Volcán Poás
Ca�on = Cañon
Mayag�ez = Mayagüez

In verbatim locality
130� = 130°

Can we deal with this systematically in https://github.com/ArctosDB/arctos/issues/2678 instead of sniping away at problems which immediately return if we let them?

I got Do�a Ana on May 20, it's apparently back.

it's apparently back.

Yeah - I thought these were getting caught at data entry - not true?

I've been working on these - should I stop or keep going?

Keep going!

OK, I have edited all of the localities with the � that I can. The remaining 61 I cannot determine what the replacement character(s) should be. I vote that we replace with [?] and STOP THIS MADNESS.

agree.

On Tue, Nov 17, 2020 at 12:18 PM Teresa Mayfield-Meyer <
[email protected]> wrote:

  • [EXTERNAL]*

OK, I have edited all of the localities with the � that I can. The
remaining 61 I cannot determine what the replacement character(s) should
be. I vote that we replace with [?] and STOP THIS MADNESS.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/ArctosDB/arctos/issues/2675#issuecomment-729145716,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ADQ7JBA4TCWXZNOXVAVISULSQLEB5ANCNFSM4M6MFZ4A
.

I agree and am ok with the replacement

STOP THIS MADNESS.

See https://github.com/ArctosDB/arctos/issues/2678

2 possibilities for madness-stopping

  1. Replace � with something else everywhere, update the function. This would be an entirely consistent solution - � would be banned from all of Arctos, yay everybody. We'd also end up with a lot of replacement "something" ([?] is nice) in various places.

  2. Clean up one field, swap that field, and only that field, to a new check that disallows � (plus whatever the current check does). This would NOT be consistent, would be confusing, drags at least part of this problem out indefinitely, means I have two functions to get confused by. Not so yay, but maybe tolerable.

I'm a big fan of (1) but reality might not be.

Sheesh - where else do we have �?

I'm fine with the replacement. If you send a list maybe I can fit a few more.


Jonathan L. Dunnum Ph.D.
Senior Collection Manager
Division of Mammals, Museum of Southwestern Biology
University of New Mexico
Albuquerque, NM 87131
(505) 277-9262
Fax (505) 277-1351

MSB Mammals website: http://www.msb.unm.edu/mammals/index.html
Facebook: http://www.facebook.com/MSBDivisionofMammals

Shipping Address:
Museum of Southwestern Biology
Division of Mammals
University of New Mexico
CERIA Bldg 83, Room 204
Albuquerque, NM 87131


From: Teresa Mayfield-Meyer notifications@github.com
Sent: Tuesday, November 17, 2020 1:08 PM
To: ArctosDB/arctos arctos@noreply.github.com
Cc: Jonathan Dunnum jldunnum@unm.edu; Assign assign@noreply.github.com
Subject: Re: [ArctosDB/arctos] Locality character conversion issues (#2675)

[EXTERNAL]

Sheesh - where else do we have �?

—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHubhttps://github.com/ArctosDB/arctos/issues/2675#issuecomment-729172060, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AED2PA73MDQRRVG3QD4N27DSQLJ2LANCNFSM4M6MFZ4A.

where else do we have �?

https://github.com/ArctosDB/arctos/issues/2678#issuecomment-729226025

From Excel, save your bulkload csv files as UTF-8! Spread the word and document!

@ebraker add end screen to tutorials and @Jegelewicz will peruse documentation to add the above.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

sharpphyl picture sharpphyl  Â·  7Comments

mkoo picture mkoo  Â·  3Comments

mgoliver picture mgoliver  Â·  7Comments

dustymc picture dustymc  Â·  3Comments

dustymc picture dustymc  Â·  4Comments