Openlibrary: Author Merging: Bob Simmon

Created on 20 Feb 2019  路  7Comments  路  Source: internetarchive/openlibrary

Posting on behalf of paul on Open Library

Description

What problem are we solving? What does the experience look like today? What are the symptoms?

Merging/Deleting Authors/Works

There are two authors named "Bob Simmons" - one a stuntman, and one a cook. The stuntman has written one work entitled "Nobody Does It Better", and the cook has written lots of cookery books.

However, in the database their records appear to be all mixed up - there are two entries for "Bob Simmons", both of which have entries for both "Nobody..." and some cookbooks.

The book "Nobody Does It Better" appears to have three different entries:
https://openlibrary.org/books/OL21671339M/Nobody_does_it_better
https://openlibrary.org/books/OL2471542M/Nobody_does_it_better
https://openlibrary.org/books/OL18708485M/Nobody_does_it_better
...two listed under one "Bob Simmons", and one under the other - even though there exists only one edition of the book.

There is also in the database a "Bob Simmon" who has co-written a cookbook with Coleen Simmons - This is also the cook "Bob Simmons", but with a misspelled name.

It's all a bit of a mess!

There needs to be two "Bob Simmonses" - one with all the cookbooks, and the other with a single entry for "Nobody Does It Better"

authors merging

All 7 comments

Author records which are conflated should be deleted. Normal users can't do this, but they can move the works to appropriate other records and petition the OL gods to remove the offending record(s).

If an author record is clearly (birth date, predominance of works, etc) about a specific author, but just has the occasional work mixed in by error, that work can be moved to the correct place.

@mekarpeles I would have guessed that you knew all this already, but if any of it is unclear, sing out.

Perhaps we need better user documentation for situations like this.

Consider https://openlibrary.org/search/authors?q=undifferentiated&has_fulltext=true
Aside from the trivial issues of whether/how to punctuate the 'name' of the author records, such are helpful steps toward refining the assignment of works to the correct author. Still, there is a real need for power tools to address the proliferation of incomplete names in the catalog. Once a record is identified as undifferentiated it should be possible to prioritize similar names for the addition of external identifiers (VIAF, ISNI, Wikidata). I generally find it helpful to create notes in the biography field of the 'undifferentiated' record along the lines of "Distinguish between: OL12345A the stuntman; OL67890A the cook; any others"

UX note: IMDB.com deals with multiple actors having the same name by suffixing a roman numeral after the name (when there are multiples) to cue the user that there is more than one, and that you have to pick carefully. E.g. https://www.imdb.com/find?ref_=nv_sr_fn&q=Fran+Smith&s=nm

Have a close look at https://openlibrary.org/books/OL8040406M.json and you'll see two author records shown: "authors": [{"key": "/authors/OL2764562A"}, {"key": "/authors/OL2845554A"}]
Comparing the display,
staleauthors
notice that the outdated edition-level author record for David G. Smith (undifferentiated) is shown on the default cover image at the left, while the updated work-level author record for David G. Smith (collector) shows to the right.
Until the stale (for years!!!) data in the edition record gets corrected, there's not much point in sorting out the work records, but of course mere mortal users are not allowed to access those edition authors.

Not sure that this is a merging issue, but there are two Bob Simmons entries in OL

https://openlibrary.org/authors/OL1102082A/Bob_Simmons 1933-
https://openlibrary.org/authors/OL259808A/Bob_Simmons

The 1933- appears to me to be the Stuntman, who was actually born in 1923 : https://www.wikidata.org/wiki/Q888254 and the 1933 came from a catalogger typo, recorded by VIAF here: https://viaf.org/processed/LC%7Cn%20%2087941596

There a probably multiple ways to fix, but correcting the dates for OL1102082A and moving all Stuntman books there, and cooking books to OL259808A would be my pick.

It would be great for someone to pick up #853 too to enable others to add authority control easily.

All but one of the cooking books listed with the stunt man record were orphaned editions with two authors. I associated them with existing works and have separated out the two authors, so if someone can review the changes, we can close this specific data issue?

@LeadSongDog I see you have fixed that cover by associating it with a work, thank you. I'm going to put that particular issue down to everything that is slightly wrong with OL is compounded by orphaned editions, and it's a pain for anyone regardless of access levels to fix. This month I have been re-importing ~1M + early items that for various reasons skipped the part when OL added works so hopefully the era of orphans getting in the way of everything is over. At the beginning of the year we had 5M orphans, I expect to see that at least halve when the next db dumps are generated at the end of the month, and I'm going to keep pushing forward to reduce that number to 0 so we can get past this constant pain of un-synced and difficult to correct data.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

jdlrobson picture jdlrobson  路  5Comments

cdrini picture cdrini  路  4Comments

Pratyush1197 picture Pratyush1197  路  3Comments

jdlrobson picture jdlrobson  路  5Comments

BrittanyBunk picture BrittanyBunk  路  4Comments