Description
What problem are we solving? What does the experience look like today? What are the symptoms?
Relates to #806 #1440
Dates are stored in a variety of different formats, causing sorting to break. There have been a few discussions about date storage (e.g. archive.org, wikidata) and I think we should choose one format and enforce + encode it canonically moving forward
Stakeholders
@seabelis, @hornc, @cdrini, @tfmorris, @leadsongdog, @JeffKaplan
Alexis Rossi (archive.org) has suggestions here:
https://docs.google.com/spreadsheets/d/18sYxYd5ZY-gKv55HkUg37fYGHVdPjMQ4RnBOogRk910/edit#gid=0
Sorting should obviously not care what format the date is stored in. All input string formats should be rendered to one consistent stored form _for sorting purposes_. That said, the _stored and displayed_ form should not create falsely precise dates, and should handle real uncertainty in or absence of source dates. Source catalogue records for old books in some cases may not clearly establish even a decade or year: "17??" or "171?" are legitimately distinct from "1700" and should not be forced to an incorrect numeric value.
I would strongly recommend against attempting to invent a new date standard. If ISO 8601-2004 isn't powerful enough, there's EDTF which is going into the next revision (ISO 8601-2019) and which has a Python package supporting it.
The idea of separate display and sort dates seems attractive on the surface, but one still needs to derive the sort date (which would likely be used for more than just sorting, but also searching & selecting).
Wow, EDTF looks awesome! It supports just about everything I can think of (except different calendar models (e.g. Gregorian), but eh). Here's how it compares:
Class | Draft proposal | EDTF
-- | -- | --
Date and Time | 2017-07-11T18:24:23+00:00 | 2017-07-11T18:24:23Z
Year, Month and Day | 2017-07-11 | 2017-07-11
Year and Month | 2017-07 | 2017-07
Year | 2017 | 2017
BCE dates | 32 BCE | -0032
Approximate dates and ranges | 聽|
No known date | n.d. | ??? Not sure if this can be represented
Approximate date | 1973? | 1973?
Missing digit | 197-? | 197X
Year range, both years known | 2014-2017 | 2014/2017
Year range, one date known and one approximate | 1923-1926? | 1923/1926?
Year range, both dates approximate | 192-?-1933? | 192X/1933?
Year range, one date unknown, one known | ----?-1933 | /1933
BCE date range | 400-372 BCE | -0400/-0372
BCE to CE date range | 32 BCE-41 CE | -0032/0041
EDTF certainly looks to be well considered, with the latest adaptations suited for librarianship purposes. The repos at https://github.com/search?l=Python&q=edtf&type=Repositories and https://pypi.org/search/?q=edtf seem not to have caught up yet, but they're already a vast improvement over the current chaos. How hard would it be to adopt?
+1 for EDTF
Fodder for thought:
Question: Does a given field accept any EDTF date representation, or only a subset? Should our UI, especially on user input, steer the user away from certain types of degradation (to keep data quality high)? If you allow the user to enter any type of degraded date, you'll get 'em.
Question: What is the localized presentation of degraded dates? If we're doing a big cutover, can we set ourselves up for ease of localization going forward? Just a thought.
It sounds like we've decided on EDTF when we're ready ot make a move. It doesn't seem like there's a clear next step or effort, let's consider the result of this ticket to be a decision: EDTF
please note community decision above (to inform our future developments) @seabelis, @cdrini, @hornc
@mekarpeles Even if the form input field will not be modified in the near future, I would like to add this to FAQs with a link from the form to at least provide guidance. Any issues with that?
I'm going to be coming from a data dump perspective:
To condense the space in the data dump (as the editions are approaching almost 6GB), I propose this is how it's going to look there:
1) No lettering (like months)
2) no punctuation, like forward slashes (/) or hyphens (-)
3) Year goes first
How the format looks on an edition page in the OL will be different than the data dump. Maybe on the edition page, it'll look like May 10th, 1980, but 1980 05 10 in the data dump. the UI and database formats could be different.
For the UI, it'll be based on locality as @guyjeangilles mentioned in #3301. I would say hyphens are good, as it keeps the date together and is easier to read than slashes. I'm unsure if the month should be written out - as it takes a while to type in, takes up space to read, etc. However, it'll prevent people from confusing the date with the month - that's important!
Stakeholders;
@seabelis @dcapillae
@BrittanyBunk Let鈥檚 not start down the rabbit hole of roll-your-own date formats. Wiki folk battled years over details of date formats. Just adopt an existing EDTF code base being used by librarians elsewhere.
@BrittanyBunk If you're going to attempt to overrule the previous consensus (EDTF), you're going to need a LOT stronger justification than saving a few bytes in the data dump.
I wouldn't say I'm overriding it. I was talking about a UI for it. I think I made a mistake though, as @seabelis was saying what I wrote in another issue was a duplicate, but I think I'm starting to realize this is a different issue, as I'm talking more about the UI that'd go with filling out the editing form for editions, so that it'll appear like the EDTF in the data dump (as right now it doesn't really follow it).
Since the year goes first here, it looks fine. I would say the next step on this issue is to relay this info to editors when they're editing the editions - so what goes there can follow this format.
There is one thing on the table I noticed:
"No known date | n.d. | ??? Not sure if this can be represented"
I would say that it should be left blank if the date's unknown - as there's no need to fill in what's unknown.
See the level 1 portion of https://github.com/ixc/python-edtf for discussion of how approximate, undefined, or partly undefined dates are represented.
Most helpful comment
+1 for EDTF