Openlibrary: Roadmapping 2018 Q2

Created on 12 Mar 2018  Â·  19Comments  Â·  Source: internetarchive/openlibrary

@leadsongdog, @hornc, @skylerbunny, @cdrini, @salman-bhai, @bfalling, @whatisgalen

If you can add your top 5-10 issues or features you believe should be prioritized in Q2 (preferably w/ github issue #'s)...
https://docs.google.com/spreadsheets/d/10Cp2xwcLY4NqN5gvDz6Jf8dcsz_eI5cF8lF6ZzfiskI/edit#gid=0

We'll be discussing these points this upcoming Tuesday on zoom @ 11:30pm PT during our weekly community call: https://zoom.us/j/369477551

P.S. @anandology, @EdwardBetts, @rajbot, @mouse-reeve, et al -- if you have features (or votes) you'd like to push for, please feel free to add them to the spreadsheet!

Next Tuesday, we'll take the results of the spreadsheet and fill out https://github.com/internetarchive/openlibrary/projects/6

All 19 comments

Several issues are more or less blockers for author cleanup and deduping. These should be getting more attention. Only after authority is deduped will it be feasible to dedupe works and editions.

@LeadSongDog do you know which issues those are?

  • The BioDiversity Heritage Library has its content in the Internet Archive. I expect that IA is aware of the BHL identifiers. We are adding these identifiers to Wikidata and are disambiguating them in the process. Many of these authors have books as well in the Open Library so there are OL identifiers. The request is to share identifiers so that both our work in disambiguation is optimized.

  • The OL identifiers of Freebase become possible to import into Wikidata in one batch. When this happens, please run the process so that we can have the latest OL identifier and have them disambiguated.

  • The basic data for authors like date of birth and date of death .. please update the information to what Wikidata holds.. It is a service to your readers and we will celebrate your use of our data

  • Can we please have a list of all the identifiers for books by authors who have a Wikidata identifier? Included should be a name, an authorID, a LoC identifier or ISBN. Objective is to load all of them into Wikidata and seek a wider audience.
    Thanks,
    GerardM

@GerardMeijssen I'm not sure exactly what is required for point #1 -- it would be most helpful and get the most exposure if each of these were opened as separate issues which we can add to our triage spreadsheet.

Who is coordinating this Freebase batch import? When/how will we learn when this happens?

re: author info, when you say, "please update the information to what Wikidata holds" are you talking about synchronizing the keys? Or pulling in wikidata values into OL?

re: "Can we please have a list of all the identifiers for books by authors who have a Wikidata identifier?", is the request for a 1-time data dump (e.g. our existing monthly authors dump)? Or an API to retrieve all authors with wikidata IDs?

The first thing is to expose BHL identifiers in combination with IA / OL identifiers.
You are disambiguating and so is Wikidata.

  • When we add a BHL identifier it would be good to have a method to add your (IA or OL) identifiers.

  • When there are multiple links for the same author, further processing is our standard disambiguation process (in place for OL).

  • It is for the Biodiversity Heritage Library to consider what their policy is for disambiguation.. In the mean time we do the donkey work. We can and will invite people to help when these processes are in place.

The Freebase import is on my radar. When this is done, I will ask (Charles ?) to run the update functionality.

As to updating the information at OL, I ask for you to import information like date of birth and date of death. Particularly when Wikidata has info and you don't it will be an improvement for the OL readers.

When you have information where we do not or where there is a difference, we appreciate a list so that we can curate Wikidata.

By importing the books for the authors we have in common, at Wikidata we will have the information to enable people to read books from the OL .. We do not necessarily need a dump, what I can do is get authorisation for running a bot.. having you run a bot makes the collaboration even more prominent

I've moved mine from a comment into the spreadsheet now that I have edit rights. I'll update them with issue numbers, etc later, although they align with what @LeadSongDog.

  • BHL - I'm opposed to promoting them until they give at least minimal credit to IA/OL which is is the source of 99% of their data. The only reason that BHL identifiers even exist is that they mint them to make the connection to IA opaque.

  • Freebase - Any OL related data from that source needs to be used with great care, because it dates from a period before any author dedupe had been done, so the OLIDs can point to redirects, deleted records, etc (I say this as someone who's worked with Freebase since early 2010).

  • Duplicating data in general - I think we should have a more general discussion about this before we starting copying data to and fro. Having multiple, duplicate, editable data stores makes the reconciliation problem very difficult. I'd be much more tempted to push as much of this as possible to Wikidata and just pull things from there when available. This includes identifiers, biographical info like birth/death dates, and a whole host of other data.

Hoi,
When we link Wikidata OL / IA for BHL, we will gain a lot of friends in
many libraries worldwide. The BHL is already very happy that we are
including data into Wikidata and they will be supremely happy when we
together provide them with an even better service.

Yes, the data of Freebase is stale. It is exactly why the inclusion will be
synchronised with Charles because we will want to update Wikidata
afterwards with the latest and greatest information of OL / IA. This is
easier than the manual process that is happening now. Once this is done,
there will be no new stale information in Wikidata. This makes this a win
for everyone involved.

The only really important thing in what we do is linking identifiers. This
is key, not the associated data. I am very happy when the IA and OL find
use for the Wikidata data. However, it is for them to decide what to do.
Our data is freely available. For me it is key that we collaborate and
share a mission of bringing more and better information, get people to read
is (for me personally) a dream come true.
Thanks,
GerardM

On 12 March 2018 at 15:43, Tom Morris notifications@github.com wrote:

I've moved mine from a comment into the spreadsheet now that I have edit
rights. I'll update them with issue numbers, etc later, although they align
with what @LeadSongDog https://github.com/leadsongdog.

-

BHL - I'm opposed to promoting them until they give at least minimal
credit to IA/OL which is is the source of 99% of their data. The only
reason that BHL identifiers even exist that they mint them to make the
connection to IA opaque.
-

Freebase - Any OL related data from that source needs to be used with
great care, because it dates from a period before any author dedupe
had been done, so the OLIDs can point to redirects, deleted records, etc (I
say this as someone who's worked with Freebase since early 2010).
-

Duplicating data in general - I think we should have a more general
discussion about this before we starting copying data to and fro. Having
multiple, duplicate, editable data stores makes the reconciliation problem
very difficult. I'd be much more tempted to push as much of this as
possible to Wikidata and just pull things from there when available. This
includes identifiers, biographical info like birth/death dates, and a whole
host of other data.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/internetarchive/openlibrary/issues/845#issuecomment-372333687,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AdQumH6ZtofFI4Z1Urz44JBB5yXABVUnks5tdol3gaJpZM4Sl7Ho
.

Issues with authority:

790, #757, #756, #714, #699, #669, #667. #604, #513, #498, #486, #366, #352, #351, #349, #178, #149, #145, #89, #77

Issues with authority:

Those look like they're mostly issues with author records and/or author search. "authority" is an archaic librarian's term rooted in their belief that they're in charge of everything. (Not that I have an issue with authority. :-) ) I added # signs to all the issue numbers so they'll act as hot links.

Well, I think of authority simply as answers to "Who authored what?", but https://www.loc.gov/standards/mads/mads-doc.html says:


The element is a container that includes a standardized "authoritative" form of an agent (person or organization), an event, a title, or a term (topic, genre, geographic). The authority container may only be repeated to give multiple authoritative forms in different languages or scripts

The geographicSubdivision attribute can be used with the element to indicate whether or not a concept can append a geographic facet, such as the name of a country or other jurisdiction, region, or geographic feature. This information is important to some controlled vocabularies, such as LCSH. For vocabularies to which this does not apply, the attribute would not be used. The geographicSubdivision attribute is comparable to MARC Authority 008/06, and can carry the following values:

none - no geographic facet applies
direct - a geographic facet may be applied without its larger geographic entity
indirect - a geographic facet may be applied with the name of its larger geographic entity
not applicable - a geographic facet is not appropriate

Then, https://www.loc.gov/standards/sourcelist/name-title.html says a bunch more...
Personally I have no issues with authority either, so long as I'm the authority :-)

@here -- reminder, this week's Tuesday community call @ 11:30am PT we'll be having 2018 Q2 Planning.

Join the call
https://zoom.us/j/369477551

Please nominate issues for Q2
https://github.com/internetarchive/openlibrary/projects/7

Browse open issues
https://github.com/internetarchive/openlibrary/issues

Last quarter's goals; Q1
https://github.com/internetarchive/openlibrary/projects/3

Evolving project board for Q2
https://github.com/internetarchive/openlibrary/projects/6

@tfmorris sorry to make life difficult. Instead of using the gdoc spreadsheet, I've moved all our Q2 nominations to this board: https://github.com/internetarchive/openlibrary/projects/7

Your list is currently
Search alternate names
Search (dedupe req.)
Search I18N (UX & dedupe req)
Data quality (author dedupe to start)
Improved UX

Where applicable, can you please add existing issue cards to your name on that board? And or create issues where necessary for them? Thank you!! I've already added author search dupes (848) to your list. Note, I've also added internationalization (i18n) elsewhere on the board but issues apparently can only belong to a single column (as heads up, just so we don't re-create those existing issues)

@GerardMeijssen if you can do the same, that would be a huge help. You mentioned 3 or so points above -- if each of these can be turned into an issue with the correct context, we can add it to the Q2 planning board

@LeadSongDog I've already added / migrated all the issues you nominated -- thank you!

@hornc + @cdrini + @bfalling if you two can also update the Q2 planning board w/ your issue nominations, it would be a big help!

https://github.com/internetarchive/openlibrary/projects/7

salman-bhai [9:53 PM]
I'd like to add these Issues to my Board

  1. https://github.com/internetarchive/openlibrary/issues/830 #830
  2. https://github.com/internetarchive/openlibrary/issues/439 #439
  3. https://github.com/internetarchive/openlibrary/issues/846 #846
  4. https://github.com/internetarchive/openlibrary/issues/835 (Slightly challenging) #835

In addition to these kindly add these Issues as well, if they are not to be closed(these pertain to Recaptcha v2)

  1. https://github.com/internetarchive/openlibrary/issues/433 #433
  2. https://github.com/internetarchive/openlibrary/issues/431 #431

In addition to this I'd like to work on Docker with Charles. Not sure an Issue has been created for that as of now!

Also @mekarpeles @hornc

  1. https://github.com/internetarchive/openlibrary/issues/811
  2. https://github.com/internetarchive/openlibrary/issues/817

I think you can remove these from the Board. They're almost done! I'm just awaiting Merge for the two for now!

@GerardMeijssen I'll try to add the issues you listed in this thread

Here's a 1st draft consolidated/prioritized list of Q2 goals
https://github.com/internetarchive/openlibrary/projects/7

We'll continue to integrate author + search related issues as we have a better idea where to spray the firehose

Next Tuesday we'll followup to discuss the final prioritized list. Closing this issue for now! Thanks everyone for adding your issues to the board.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

LeadSongDog picture LeadSongDog  Â·  5Comments

BrittanyBunk picture BrittanyBunk  Â·  5Comments

jdlrobson picture jdlrobson  Â·  5Comments

cdrini picture cdrini  Â·  4Comments

jdlrobson picture jdlrobson  Â·  5Comments