Openlibrary: Standardizing Publication Date format

Created on 10 Apr 2019 · 16Comments · Source: internetarchive/openlibrary

Description

What problem are we solving? What does the experience look like today? What are the symptoms?

Relates to #806 #1440

Dates are stored in a variety of different formats, causing sorting to break. There have been a few discussions about date storage (e.g. archive.org, wikidata) and I think we should choose one format and enforce + encode it canonically moving forward
Stakeholders
@seabelis, @hornc, @cdrini, @tfmorris, @leadsongdog, @JeffKaplan

Data OpenLibrary Metadata Standard (OLMS) 3 RFC Bug discussion

Source

mekarpeles

Most helpful comment

+1 for EDTF

mekarpeles on 12 Apr 2019

👍2

All 16 comments

Alexis Rossi (archive.org) has suggestions here:
https://docs.google.com/spreadsheets/d/18sYxYd5ZY-gKv55HkUg37fYGHVdPjMQ4RnBOogRk910/edit#gid=0

mekarpeles on 10 Apr 2019

Sorting should obviously not care what format the date is stored in. All input string formats should be rendered to one consistent stored form _for sorting purposes_. That said, the _stored and displayed_ form should not create falsely precise dates, and should handle real uncertainty in or absence of source dates. Source catalogue records for old books in some cases may not clearly establish even a decade or year: "17??" or "171?" are legitimately distinct from "1700" and should not be forced to an incorrect numeric value.

LeadSongDog on 11 Apr 2019

I would strongly recommend against attempting to invent a new date standard. If ISO 8601-2004 isn't powerful enough, there's EDTF which is going into the next revision (ISO 8601-2019) and which has a Python package supporting it.

The idea of separate display and sort dates seems attractive on the surface, but one still needs to derive the sort date (which would likely be used for more than just sorting, but also searching & selecting).

tfmorris on 11 Apr 2019

Wow, EDTF looks awesome! It supports just about everything I can think of (except different calendar models (e.g. Gregorian), but eh). Here's how it compares:

Class | Draft proposal | EDTF
-- | -- | --
Date and Time | 2017-07-11T18:24:23+00:00 | 2017-07-11T18:24:23Z
Year, Month and Day | 2017-07-11 | 2017-07-11
Year and Month | 2017-07 | 2017-07
Year | 2017 | 2017
BCE dates | 32 BCE | -0032
Approximate dates and ranges | |
No known date | n.d. | ??? Not sure if this can be represented
Approximate date | 1973? | 1973?
Missing digit | 197-? | 197X
Year range, both years known | 2014-2017 | 2014/2017
Year range, one date known and one approximate | 1923-1926? | 1923/1926?
Year range, both dates approximate | 192-?-1933? | 192X/1933?
Year range, one date unknown, one known | ----?-1933 | /1933
BCE date range | 400-372 BCE | -0400/-0372
BCE to CE date range | 32 BCE-41 CE | -0032/0041

cdrini on 11 Apr 2019

EDTF certainly looks to be well considered, with the latest adaptations suited for librarianship purposes. The repos at https://github.com/search?l=Python&q=edtf&type=Repositories and https://pypi.org/search/?q=edtf seem not to have caught up yet, but they're already a vast improvement over the current chaos. How hard would it be to adopt?

LeadSongDog on 12 Apr 2019

+1 for EDTF

mekarpeles on 12 Apr 2019

👍2

Fodder for thought:
Question: Does a given field accept any EDTF date representation, or only a subset? Should our UI, especially on user input, steer the user away from certain types of degradation (to keep data quality high)? If you allow the user to enter any type of degraded date, you'll get 'em.

Question: What is the localized presentation of degraded dates? If we're doing a big cutover, can we set ourselves up for ease of localization going forward? Just a thought.

brad2014 on 15 May 2019

It sounds like we've decided on EDTF when we're ready ot make a move. It doesn't seem like there's a clear next step or effort, let's consider the result of this ticket to be a decision: EDTF

mekarpeles on 12 Dec 2019

please note community decision above (to inform our future developments) @seabelis, @cdrini, @hornc

mekarpeles on 12 Dec 2019

👀1

@mekarpeles Even if the form input field will not be modified in the near future, I would like to add this to FAQs with a link from the form to at least provide guidance. Any issues with that?

seabelis on 17 Dec 2019

I'm going to be coming from a data dump perspective:

To condense the space in the data dump (as the editions are approaching almost 6GB), I propose this is how it's going to look there:
1) No lettering (like months)
2) no punctuation, like forward slashes (/) or hyphens (-)
3) Year goes first

How the format looks on an edition page in the OL will be different than the data dump. Maybe on the edition page, it'll look like May 10th, 1980, but 1980 05 10 in the data dump. the UI and database formats could be different.

For the UI, it'll be based on locality as @guyjeangilles mentioned in #3301. I would say hyphens are good, as it keeps the date together and is easier to read than slashes. I'm unsure if the month should be written out - as it takes a while to type in, takes up space to read, etc. However, it'll prevent people from confusing the date with the month - that's important!

Stakeholders;
@seabelis @dcapillae

BrittanyBunk on 3 Apr 2020

@BrittanyBunk Let’s not start down the rabbit hole of roll-your-own date formats. Wiki folk battled years over details of date formats. Just adopt an existing EDTF code base being used by librarians elsewhere.

LeadSongDog on 3 Apr 2020

@BrittanyBunk If you're going to attempt to overrule the previous consensus (EDTF), you're going to need a LOT stronger justification than saving a few bytes in the data dump.

tfmorris on 3 Apr 2020

I wouldn't say I'm overriding it. I was talking about a UI for it. I think I made a mistake though, as @seabelis was saying what I wrote in another issue was a duplicate, but I think I'm starting to realize this is a different issue, as I'm talking more about the UI that'd go with filling out the editing form for editions, so that it'll appear like the EDTF in the data dump (as right now it doesn't really follow it).

Since the year goes first here, it looks fine. I would say the next step on this issue is to relay this info to editors when they're editing the editions - so what goes there can follow this format.

BrittanyBunk on 3 Apr 2020

👍1

There is one thing on the table I noticed:
"No known date | n.d. | ??? Not sure if this can be represented"

I would say that it should be left blank if the date's unknown - as there's no need to fill in what's unknown.

BrittanyBunk on 3 Apr 2020

👍1

See the level 1 portion of https://github.com/ixc/python-edtf for discussion of how approximate, undefined, or partly undefined dates are represented.

LeadSongDog on 3 Apr 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Search result preview overlapping with other page elements

bitnapper · 4Comments

ISBN star queries no longer work

cdrini · 4Comments

Enable searching for a classification on prod

LeadSongDog · 5Comments

Let's lint our JavaScript with eslint!

jdlrobson · 5Comments

Drop fontawesome usage

jdlrobson · 5Comments