Arctos: Clean up sex attribute table

Created on 11 Mar 2021  Â·  36Comments  Â·  Source: ArctosDB/arctos

In addition to #1237 we need to deal with the following terms in the sex code table.

SEX_CDE | Documentation
-- | --
female ? | The examiner believes the specimen to be a female, but is uncertain.
male ? | The examiner believes the specimen to be a male, but is uncertain.
sexes mixed | Lot contains individuals of both sexes.

Suggestions?

CodeTableCleanup Function-CodeTables Function-ObjectRecord Priority-Critical

Most helpful comment

sexes mixed-->unknown + remark or something, I suppose. ("Both" in reference to 9 values can't make too much sense, can it?!)

How about "sexes mixed" in the remark - this would make it possible to find and then add two, five, or however many "sexes" are represented if one so desired...

All 36 comments

Suggestions?

female ?-->unknown + remark="possibly female"
male ?-->unknown + remark="possibly male"
sexes mixed-->unknown + remark or something, I suppose. ("Both" in reference to 9 values can't make too much sense, can it?!)

sexes mixed-->unknown + remark or something, I suppose. ("Both" in reference to 9 values can't make too much sense, can it?!)

How about "sexes mixed" in the remark - this would make it possible to find and then add two, five, or however many "sexes" are represented if one so desired...

female ?-->unknown + remark="possibly female"
male ?-->unknown + remark="possibly male"

That doesn't really represent an unknown though. It represents a sex determination that isn't really positive on. We use that when we have things like adult male plumage, but we cannot find the gonads.
That's really different then where I would mark it as unknown, where I have no idea and no hints.

Would we loose data by marking them as unknown and being more conservative?

Maybe it would be better to mark them as:
female ? ---> female + remark = "sex determination has question"
male ? ---> male + remark = "sex determination has question"

adult male plumage

We are severely under-utilizing method.

we cannot find the gonads

Add a second determination - sex=unknown, method=cannot find the gonads.

If we really want to be "research grade," we have to find a way to shift our entire mindset from "it's a boy" to "AGENT on DATE using METHOD thinks it's a boy." (And then somehow get researchers to follow....)

Maybe we need to have a rethink on the definition of how we determine sex?

For museum specimens this has always been two things for me:

  1. a historic tag or notes records the sex
  2. we find the gonads while prepping an animal

But what about observation data or data without a vouchered specimen? When banding a bird I record male when the measurements, plumage, or behavior falls under a key for male. But I don't open the bird up to peak at the gonads.

How do we make the two data sets align and make sure you are searching for sex that is recorded to the best of our ability under the current research parameters?

historic tag or notes

Method! If you happen to know and are willing to record it, who and when wrote the tag would be cool and useful; if you can't/won't then "tag" still seems infinitely better than nothing to me.

observation data

Method! Some way of sorting out "according to this blurry camera trap picture, ...." and "Some known ornithologist dug around in there on DATE (at which time they had 20 years of relevant experience) and ...." seems pretty useful to me.

make the two data sets align

Not our problem (just because I don't think we can do anything useful beyond recording what we know). For most questions, "someone thinks it's a male" is probably a sufficient answer anyway. Someone REALLY looking probably expects to have to dig a bit; we can at least give them everything we know in one place, not buried in "specimen remarks" with 38 other kinds of data. (And we can add their interpretation back as another determination.)

If they knew it was a possibility, maybe they'd even help make media for those tags in some way.

Not our problem (just because I don't think we can do anything useful beyond recording what we know). For most questions, "someone thinks it's a male" is probably a sufficient answer anyway. Someone REALLY looking probably expects to have to dig a bit; we can at least give them everything we know in one place, not buried in "specimen remarks" with 38 other kinds of data. (And we can add their interpretation back as another determination.)

This is why I think we can't say:
female ? ---> unknown
male ? ----> unknown
Switching them to unknown is losing recorded data. They are not unknown. They are probably one sex or the other.

Gotcha. I was just suggesting a "safe" (=doesn't make any unfounded assertions) migration path. I'm totally fine with some other approach, either for everything or by collection or WHATEVER. If you think "female ?" should be "female" (plus remarks or something) then I do too; I'll get behind about anything that moves us towards cleaner data!

@ccicero may have an opinion as well. She led the discussion on cleaning up the GitHub bird data at the workshop.

Would we loose data by marking them as unknown and being more conservative?

Maybe it would be better to mark them as:
female ? ---> female + remark = "sex determination has question"
male ? ---> male + remark = "sex determination has question"

That is probably a better path, but instead of "sex determination has question" I suggest "determination has low confidence".

Do we need "confidence" for attributes like we have for identifications?

Do we need "confidence" for attributes like we have for identifications?

I don't think so. You can assess yourself via remarks, everyone else can assess you via method+agent/date. MAYBE there's some small bit of usefulness in there, but it would be a huge change in code and work required - I don't think that balances out.

I think it might be best to be conservative here and go with "unknown" and then put the other legacy data in another field.
If the user is someone doing searches from GBIF or VertNet in order to just pull all of one sex for some reason they won't see the low confidence and could get specimens which aren't the correct sex.
If the user is doing work at the individual specimen level they will be going into the record itself where all the information on confidence or ambiguity of the determination is there. They can then make a judgment on whether or not to use the data.


From: Teresa Mayfield-Meyer @.>
Sent: Wednesday, March 17, 2021 9:57 AM
To: ArctosDB/arctos *
@.>
Cc: Subscribed
@.**>
Subject: Re: [ArctosDB/arctos] Clean up sex attribute table (#3516)

[EXTERNAL]

Would we loose data by marking them as unknown and being more conservative?

Maybe it would be better to mark them as:
female ? ---> female + remark = "sex determination has question"
male ? ---> male + remark = "sex determination has question"

That is probably a better path, but instead of "sex determination has question" I suggest "determination has low confidence".

Do we need "confidence" for attributes like we have for identifications?

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHubhttps://github.com/ArctosDB/arctos/issues/3516#issuecomment-801199179, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AED2PA33QOQD7CLVK655RIDTEDGM5ANCNFSM4ZBEZC6A.

HMMM _ maybe this should be a collection by collection decision? Although I agree with @jldunnum that @ewommack option could mislead people at the aggregators....

collection by collection

I have no problem with that - it's not ideal for users, but neither is what we're starting with.

I suspect these data vary from "IDK, maybe...." to "we're not 100% positive...." across time, collections, people, taxa, etc., etc., etc. - I doubt there is one true answer, whatever we do at least stops more of that.

I think getting to a place from where we are producing better data should outweigh about anything else.

Remarks and method are 4000 character fields (and could be bigger if needed) - we can be VERY verbose if that somehow facilitates this.

Although I agree with @jldunnum that @ewommack option could mislead people at the aggregators....

I feel like I need to hear from someone who might work with the dataset. Is it better to have the data in there that has a medium level of confidence, or better to just throw it out? It is going to be a small part of the data set.

I think I also keep getting tied up with the different levels of confidence I apply between live trapping animals and museum specimens. The choice eventually is going to be different no matter what by the collection.
A banding station's value for male would equal the same as our bird collections male ?, just because of the difference in how they determine the sex.

A banding station's value for male would equal the same as our bird collections male ?, just because of the difference in how they determine the sex.

And that would be covered if an appropriate method was applied to the attribute.

I agree with Jon that method and remarks would be difficult to parse for
aggregators and also for Arctos, because we can't currently download
attribute remarks in a useable format. We are losing information if we go
this route. Why not keep " female ?" , but start advocating the use of
method, date, and determiner more rigorously.

On Thu, Mar 18, 2021, 12:59 PM Teresa Mayfield-Meyer <
@.*> wrote:

  • [EXTERNAL]*

A banding station's value for male would equal the same as our bird
collections male ?, just because of the difference in how they determine
the sex.

And that would be covered if an appropriate method was applied to the
attribute.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/ArctosDB/arctos/issues/3516#issuecomment-802111357,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ADQ7JBGVIWVYWRYX2SX6PUDTEIWQXANCNFSM4ZBEZC6A
.

Why not keep " female ?" , but start advocating the use of method, date, and determiner more rigorously.

Because as @ccicero pointed out female ? is not a sex attribute, it is a question.

Also, as long as that term is in the code table, people will use it - documentation be damned!

"possible female" ? That contains more info than unknown. It flags as
requiring further scrutiny. "Unknown" does not.

On Fri, Mar 19, 2021, 10:35 AM Teresa Mayfield-Meyer <
@.*> wrote:

  • [EXTERNAL]*

Why not keep " female ?" , but start advocating the use of method, date,
and determiner more rigorously.

Because as @ccicero https://github.com/ccicero pointed out female ? is
not a sex attribute, it is a question.

Also, as long as that term is in the code table, people will use it -
documentation be damned!

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/ArctosDB/arctos/issues/3516#issuecomment-802878429,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ADQ7JBAMV2LF2ZUW3ES3XLLTENOKZANCNFSM4ZBEZC6A
.

Sorry, just jumping in here. I agree that we don't want to lose information, and 'female ?' has more information to me than just 'unknown' - yet from our workshop, sex concepts should be restricted to what is a sex: female, male, atypical, unknown.

There are three options for 'unknown' in Arctos - unknown, recorded as unknown, and not recorded. These have different meanings, at least how we use them:
recorded as unknown --> someone tried but couldn't sex the animal
not recorded ---> have data but there's nothing about sex
unknown ---> no data (notes etc.) to indicate anything about sex

These should be combined into a single unknown, but with at least "recorded as unknown" or "not recorded" going into attribute remarks.

gynandromorph, hermaphrodite ---> ATYPICAL with that value in the remarks field,

female ? and male ? ---> in the workshop mappings, we put these as 'FEMALE' and 'MALE' which I think is better than just unknown as the latter is less informative, but we need to make the uncertainty known. I like the idea of a confidence score for sex, or we could put the uncertainty in remarks, although I know that wouldn't get mapped to aggregators nor downloaded. Still, if someone is looking for females, they still may want to look at those specimens and may have better methods (e.g., DNA, size) for confirming the sex. We just need a way of downloading the attribute remarks in a usuable way, also the determination method which I agree should be used more. What about having a controlled vocab for determination method, with anything else that's free text going in remarks - e.g., gonads, phenotype, genetic, behavior... Re: aggregators, the non-controlled sex values (other than the few concepts which get mapped to SEX) should go in DYNAMIC PROPERTIES.

mixed - in the workshop, we mapped those as ATYPICAL and the details again can go in remarks or for determination method, we have another controlled value of 'mixed'

Here is the file with our concept list and mappings for sex from the workshop.

female ?' has more information to me than just 'unknown'

It's not usable information though - it's not "research grade." That's clearly demonstrated above, where a bird in one situation (banding, from examining characteristics) would get "female" and a bird in another (lab, gonads can't be found) would receive "female ?" The evidence is the same, the results are different.

This is a proposal to put those data into a usable place (method - not remarks!).

confidence score

Method is a USEFUL (if occasionally complicated) confidence score. "I'm sure this is a female" (because I just got this job banding and it doesn't look like a male!) and "I'm sure this is a female" (because I'm an experienced ornithologist and I'm looking right a the female-bits) is just a more complicated way of staying where we are, not producing research-grade data.

downloading the attribute remarks in a usuable way

They are. If they're not for you, tell me what you want (and perhaps provide the resources I'll need, depending on what that is) in another Issue. Even if true, I do not think we should allow things like this to distract us from creating research-grade data.

controlled vocab for determination method,

The goal should not be to do a researcher's work for them, but to provide them data from which they can confidently make their own categorizations as needed, using whatever tools they wish. A controlled vocabulary cannot do that in sufficient detail; those data can be useful only to the levels of sophistication we've baked in, which would be very low.

DYNAMIC PROPERTIES

Let's keep that a separate discussion. DWC should not drive what we do, and I very easily change how we map things to DWC.

ATYPICAL

I think that should also be a separate conversation - I'm not sure what's atypical for birds is also atypical for other collections in Arctos, and I don't think that conversation should distract us from where most of the subpar data production is happening.

I guess I am still very leary about just assigning confidence values (for identifications too) in that they are subjective and there are a lot of very confident idiots in this world. 😉
Maybe if we utilize methods those could auto generate a confidence value? For example, gonad examination would generate a "high confidence" value. Might need to have varied methods for different collection types though.

Bottomline is that research grade data should mean that the data are unambiguous. We need clear parameters so that if a researcher only wants data on females, they can filter and be 100% sure they are getting only females. But using a slightly less rigorous filter can also get those specimens which have a better than 50% chance of being females.


Jonathan L. Dunnum Ph.D.
Senior Collection Manager
Division of Mammals, Museum of Southwestern Biology
University of New Mexico
Albuquerque, NM 87131
(505) 277-9262
Fax (505) 277-1351

MSB Mammals website: http://www.msb.unm.edu/mammals/index.html
Facebook: http://www.facebook.com/MSBDivisionofMammals

Shipping Address:
Museum of Southwestern Biology
Division of Mammals
University of New Mexico
CERIA Bldg 83, Room 204
Albuquerque, NM 87131


From: dustymc @.>
Sent: Friday, March 19, 2021 9:44 AM
To: ArctosDB/arctos *
@.>
Cc: Jonathan Dunnum
@.>; Mention @.*>
Subject: Re: [ArctosDB/arctos] Clean up sex attribute table (#3516)

[EXTERNAL]

female ?' has more information to me than just 'unknown'

It's not usable information though - it's not "research grade." That's clearly demonstrated above, where a bird in one situation (banding, from examining characteristics) would get "female" and a bird in another (lab, gonads can't be found) would receive "female ?" The evidence is the same, the results are different.

This is a proposal to put those data into a usable place (method - not remarks!).

confidence score

Method is a USEFUL (if occasionally complicated) confidence score. "I'm sure this is a female" (because I just got this job banding and it doesn't look like a male!) and "I'm sure this is a female" (because I'm an experienced ornithologist and I'm looking right a the female-bits) is just a more complicated way of staying where we are, not producing research-grade data.

downloading the attribute remarks in a usuable way

They are. If they're not for you, tell me what you want (and perhaps provide the resources I'll need, depending on what that is) in another Issue. Even if true, I do not think we should allow things like this to distract us from creating research-grade data.

controlled vocab for determination method,

The goal should not be to do a researcher's work for them, but to provide them data from which they can confidently make their own categorizations as needed, using whatever tools they wish. A controlled vocabulary cannot do that in sufficient detail; those data can be useful only to the levels of sophistication we've baked in, which would be very low.

DYNAMIC PROPERTIES

Let's keep that a separate discussion. DWC should not drive what we do, and I very easily change how we map things to DWC.

ATYPICAL

I think that should also be a separate conversation - I'm not sure what's atypical for birds is also atypical for other collections in Arctos, and I don't think that conversation should distract us from where most of the subpar data production is happening.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com/ArctosDB/arctos/issues/3516#issuecomment-802926571, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AED2PA3BWG3JSOZBK4ZKPZTTENWOPANCNFSM4ZBEZC6A.

if a researcher only wants data on females, they can filter and be 100% sure they are getting only females. But using a slightly less rigorous filter can also get those specimens which have a better than 50% chance of being females.

But that is just another kind of confidence meter? I think I am in agreement with @dustymc on this one. If we add appropriate information in method, the person who is using the data can determine for themselves what confidence to apply to each determination. I don't think there is any way, without examining specimens yourself, to be 100% sure that what someone said is a female is actually a female.

gonad examination would generate a "high confidence" value

I've been through way too many shrews with Dokuchaev to believe that....

research grade data should mean that the data are unambiguous

Fully agreed, but that doesn't have to (and can't) lead to absolute confidence. What we can do is remove the mystery in how we made the determination.

Sex=female, remarks=testes finds 40 records at the moment, which is pretty good but evidence that mistakes and misinterpretations are inevitable. (It's also evidence that we record data in inappropriate fields - "testes" should be in attribute remarks or methods, not in our official junkyard.) "Here's how we got there" in a predictable place is as close to research grade as anyone can realistically expect of us. (That doesn't even require us to change anything about our values, although I agree that we're just adding confusion by having lots of ways of hiding methodology.)

add appropriate information in method

... for everything in https://arctos.database.museum/info/ctDocumentation.cfm?table=ctattribute_type, not just sex!

OK good points Theresa and Dusty. Do you think this will work for the aggregators as well or will this be another place where our pretty robust data get distilled down and we lose that extra info on that end.


From: dustymc @.>
Sent: Friday, March 19, 2021 11:11 AM
To: ArctosDB/arctos *
@.>
Cc: Jonathan Dunnum
@.>; Mention @.*>
Subject: Re: [ArctosDB/arctos] Clean up sex attribute table (#3516)

[EXTERNAL]

gonad examination would generate a "high confidence" value

I've been through way too many shrews with Dokuchaev to believe that....

research grade data should mean that the data are unambiguous

Fully agreed, but that doesn't have to (and can't) lead to absolute confidence. What we can do is remove the mystery in how we made the determination.

Sex=female, remarks=testes finds 40 records at the moment, which is pretty good but evidence that mistakes and misinterpretations are inevitable. (It's also evidence that we record data in inappropriate fields - "testes" should be in attribute remarks or methods, not in our official junkyard.) "Here's how we got there" in a predictable place is as close to research grade as anyone can realistically expect of us. (That doesn't even require us to change anything about our values, although I agree that we're just adding confusion by having lots of ways of hiding methodology.)

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com/ArctosDB/arctos/issues/3516#issuecomment-802984403, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AED2PA6ZD7QK4FRTLJ2U7DLTEOAVTANCNFSM4ZBEZC6A.

aggregators

It could - I have dynamicProperties mapped to key-value string data, DWC now says they like JSON, we have JSON that contains all of the information, it's trivial to change the mapping.

Sorting https://github.com/ArctosDB/arctos/issues/2131#issuecomment-800712123 (our keys are pretty cryptic, they don't have to be) out before changing stuff is probably worthwhile.

Suggest taking a babystep: from https://github.com/ArctosDB/arctos/issues/3516#issuecomment-802913954

  • not recorded--->unknown, method+="There is data in the form of a label or field notes, and there is no mention of sex."
  • recorded as unknown-->unknown, method+="There are data in the form of a label or field notes, and these indicate that the examiner was unable to determine the sex."

which probably means we also need a more comprehensive definition for https://arctos.database.museum/info/ctDocumentation.cfm?table=ctsex_cde#unknown

How about two babysteps?

  • sexes mixed: Lot contains individuals of both sexes.

becomes two attributes:

  • male
  • female

based on "both" in the definition.

sexes mixed: Lot contains individuals of both sexes.

becomes two attributes:

male
female

based on "both" in the definition.

YES to this. But probably should add remark "sexes mixed" for clarity

DLM edit: add "and/or method"

not recorded--->unknown, method+="There is data in the form of a label or field notes, and there is no mention of sex."
recorded as unknown-->unknown, method+="There are data in the form of a label or field notes, and these indicate that the examiner was unable to determine the sex."

which probably means we also need a more comprehensive definition for https://arctos.database.museum/info/ctDocumentation.cfm?table=ctsex_cde#unknown

How about

unknown = "Sex is either not determinable or not recorded/there was no attempt to determine. Remarks and/or method should be used to elaborate."

add remark "sexes mixed" for clarity

create table temp_sexmixed as select * from attributes where attribute_type='sex' and attribute_value='sexes mixed';

insert into attributes (
  collection_object_id,
  determined_by_agent_id,
  attribute_type,
  attribute_value,
  attribute_remark,
  determination_method,
  determined_date
) (
  select
    collection_object_id,
    determined_by_agent_id,
    'sex',
    'male',
    concat_ws('; ',attribute_remark,'Formerly "sexes mixed"'),
    determination_method,
    determined_date
  from
    temp_sexmixed
);

insert into attributes (
  collection_object_id,
  determined_by_agent_id,
  attribute_type,
  attribute_value,
  attribute_remark,
  determination_method,
  determined_date
) (
  select
    collection_object_id,
    determined_by_agent_id,
    'sex',
    'female',
    concat_ws('; ',attribute_remark,'Formerly "sexes mixed"'),
    determination_method,
    determined_date
  from
    temp_sexmixed
);

delete from attributes where attribute_type='sex' and attribute_value='sexes mixed';

delete from ctsex_cde where sex_cde='sexes mixed';

Code table definition for sex = "unknown" has been updated.

create table temp_attr_sx_unk_run as select * from attributes where attribute_type='sex' and attribute_value in ('not recorded','recorded as unknown');

update 
  attributes 
set 
  attribute_value='unknown',
  determination_method=concat_ws('; ',determination_method,'Formerly "not recorded": There is data in the form of a label or field notes, and there is no mention of sex.') 
where 
  attribute_type='sex' and 
  attribute_value='not recorded'
;


update 
  attributes 
set 
  attribute_value='unknown',
  determination_method=concat_ws('; ',determination_method,'Formerly "recorded as unknown": There are data in the form of a label or field notes, and these indicate that the examiner was unable to determine the sex.') 
where 
  attribute_type='sex' and 
  attribute_value='recorded as unknown'
;

delete from ctsex_cde where sex_cde='not recorded';
delete from ctsex_cde where sex_cde='recorded as unknown';

Easy stuff is gone, female ? and male ? remain.

Is there any consensus whether those should be "unknown" or "male"/"female"? Either case will involve verbose remarks. Should we flip a coin?

Easy stuff is gone, female ? and male ? remain.

Is there any consensus whether those should be "unknown" or "male"/"female"? Either case will involve verbose remarks. Should we flip a coin?

Now I will inject complexity. Do we actually need "attribute confidence" just as we have with identification?

If no one wants that, I suggest that male ? be changed to male with the remark "sex determination is uncertain, but assumed to be male" and ditto for female ? with appropriate wording. @ccicero

confidence

I still don't think that can be useful, nor do anything that existing data doesn't. "I think I'm pretty good at this!" (confidence) isn't Research Grade. "I can't find {x} but it has {y} so its probably female" (method) is.

https://github.com/ArctosDB/arctos/issues/3516#issuecomment-802926571

Was this page helpful?
0 / 5 - 0 ratings

Related issues

dustymc picture dustymc  Â·  4Comments

dustymc picture dustymc  Â·  3Comments

dustymc picture dustymc  Â·  6Comments

acdoll picture acdoll  Â·  8Comments

DerekSikes picture DerekSikes  Â·  3Comments