Openlibrary: ImportBot sometimes attaches IA ID to wrong edition

Created on 24 Jul 2019  路  13Comments  路  Source: internetarchive/openlibrary

Description

ImportBot sometimes attaches the IA ID to the wrong edition record. This is a problem because the IA ID cannot be changed, so all the details on the record must be changed to match the linked edition.

Evidence / Screenshot (if possible)

Relevant url?

https://openlibrary.org/books/OL25936281M/Most_Dangerous
This one was already updated, but you can see the history: https://openlibrary.org/books/OL26412866M/Born_a_Crime
And here, the importBot replaced a good ID with a bad one (wrong details on IA as well, but why change the ID?). https://openlibrary.org/books/OL23277896M/Desperation

Expectation

If a matching record does not exist, the bot should create one for the imported book.

Details

  • Logged in (Y/N)?
  • Browser type/version?
  • Operating system?

Proposal & Constraints

Stakeholders

@hornc

Data Librarians Import Lead 2 Work In Progress Identifiers Bug

All 13 comments

@seabelis What exactly is the difference between the https://openlibrary.org/books/OL25936281M/Most_Dangerous example and the archive.org id? All the metadata I see, OL record, archive.org metadata, and attached MARC record seem to be the 2015 Roaring Brook Press edition of the "Most dangerous" book.

Also for https://openlibrary.org/books/OL26412866M/Born_a_Crime I'm not sure where I should be looking either.

The Desperation example shows a problem with the archive.org item that looks like it was intended to be a Steven King book by its id name, and presumably some of the metadata relates to the Steven King book which is why the bot picked it to match, but the scan and most of the other metadata of the archive.org item is something else. I'm not sure how the archive.org data got to be that mixed, but the reason the bot overwrote the id was that the previous one was printdisabled access only, and some of the new one was borrowable by all, so it should have been an improvement. I'll try to figure out what caused the problem with the archive.org source data, but it looks like the bot was trying to do the correct thing with bad source data.

Most_Dangerous Same title, different ISBN. This is my error. OCLC ISBN search pulls up a record for a different ISBN; I did not catch that and updated the rest of the details based on the mistake. Disregard.

Born A Crime was entered with the ebook ISBN and identifiers. The Internet Archive record shows several IDs -- hard to say which is actually correct for the book, since it's not viewable-- but none of the ISBNs match what was on the initial record. https://openlibrary.org/books/OL26412866M/Born_a_Crime?v=1

The ID swap for a lendable or vs non-lendable edition is okay if the edition is exactly the same; not just the same title, but same everything. I'm not certain the bot can tell the difference, and frequently the details on IA are not correct for the specific edition.

In general trying to match the OL record with the linked copy on IA should not be a moving target. I think the problem has mostly to do with the IA records 1) being incorrect and/or 2) conflated with details for many editions. A book's record should reflect the specific edition. Most books will only have one set of ISBNs (sometimes two in the case of misprints or when the book has been bound for libraries resulting in a copyright page/cover mismatch).

I've already notified Jeff about fixing the record for the Desperation/USA mix-up, but since I don't want to flood him with messages, I only notify him about major mistakes, as in this case where the record is just entirely incorrect. I don't message him in the cases where the title/author are correct, but the other particulars doesn't match; I DO correct it on the corresponding OL record however, so if the links to IA items are being swapped out, this is potentially a problem.

I have just looked further into Born a Crime -- I see that the archive.org metadata lists a lot of ISBNs, but the original isbn of 9780399588181, (ebook) matches the ISBN printed in front of the actual archive.org scan, and the removed Goodreads and Amazon ids {'amazon': ['B01DHWACVY'], 'goodreads': ['37039065']} also match this ISBN, so I think the problem is that the archive.org record perhaps lists more equivalent ISBNs , and isn't clear about which one the scan is.

I think the import did the correct thing here, but I suspect that it probably was more by chance since the extra ISBNs have the potential to confuse things.

That said, I haven't been able to pinpoint the difference between the ISBNs printed in the scan, and
image
and the 9780399590979 ISBN -- there isn't much concrete information about 9780399590979, it tends to redirect to one of the others in OCLC https://www.worldcat.org/title/born-a-crime-stories-from-a-south-african-childhood/oclc/945946648

How are you able to access the scan for non-lendable editions?

@seabelis I have admin access on ia to check bibliographic data like this. Unfortunately there is no preview on non-borrowable items like this, so there isn't a good way for you to confirm yourself :(

@hornc FWIW, the entry at
https://catalog.loc.gov/vwebv/search?searchArg=2016031399&searchCode=GKEY%5E*&searchType=0&recCount=25
shows "Invalid ISBN 9780399588181 (ebook)"
Perhaps there's a lesson here: rather than ignore bad identifiers, should we flag them as such? After all the internet echo chamber will keep bringing them up...

Weirdness. This https://archive.org/details/works0000melv_j5b0 links to https://openlibrary.org/books/OL5886401M/White_Jacket which links to https://archive.org/details/works0000melv_q2c8/page/n5. They are not the same work, just the same series.

Already edited on the OpenLibrary side before I realized the discrepancy, so that's why the history looks a bit muddled.

Not completely incorrect, but not correct either. The original record on open library is meant to represent the series of 9 volumes. The IA ID for one of the 9 volumes. A record for the one volume should be associated with the IA copy. https://openlibrary.org/books/OL18274985M/Gesammelte_Werke

This is a common issue. Volumes with individual ISBNs should have individual records as they are not equal to each other. In many cases, there is an ISBN for the set which should represent the set on one record; the volumes should each have their own records with their respective ISBNs.

@seabelis
I'm sure you are aware that this is a long-standing defect in the OL way of doing things. Here's a workaround. There should be no OCAID associated with the series ISBN. Rather we should show that series ISBN on an OL edition record, wherein the TOC had links to each of the nine single-volume OLIDs. Then the links from each of the IA scans could be direct to their specific volume's OLID edition record.

@seabelis @tfmorris That scan was for _War for power and power for peace_, but Import bot conflated it with the identifiers, description, etc for Julius Caesar

https://openlibrary.org/books/OL6447918M/War_for_power_and_power_for_peace?v=6
was based on
https://openlibrary.org/books/OL6447918M/War_for_power_and_power_for_peace?b=5&a=4&_compare=Compare&m=diff

The ia metadata showed Openlibrary_work as OL362702W which may have caused the bad import, but I can't tell when that first happened.
Anyhow, I think it is now sorted out, with no residual linkage to Julius.

I've reverted the edit that initially linked to Julius Caesar and added the ID to the appropriate edition in OL. The link from IA to Julius Caesar still exists, but I've asked Jeff to fix when he can. I did want to point it out as an example so @hornc could take a look at where things went awry.

Was this page helpful?
0 / 5 - 0 ratings