Openlibrary: ImportBot importing titlepage instead of cover

Created on 24 May 2019  路  8Comments  路  Source: internetarchive/openlibrary

Description

ImportBot does not seem to be choosing covers correctly from archive.org. It seems to be using the title page even when a good cover. I wonder if this is happening in:

https://github.com/internetarchive/openlibrary/blob/02990e2138ed756d16c4f1eb66da70b53fe08bae/openlibrary/plugins/importapi/code.py#L247

This is using:

https://github.com/internetarchive/openlibrary/blob/02990e2138ed756d16c4f1eb66da70b53fe08bae/openlibrary/plugins/importapi/code.py#L339

which I think is wrong. We should be using e.g. https://archive.org/download/greatdebatesback0000unse/page/cover_t.jpg which gives the cover _or_ the title page (if the cover is not useful).

Evidence / Screenshot (if possible)

image

Relevant url?

e.g. https://openlibrary.org/books/OL26968796M/Guan_li_cheng_jiu_sheng_huo

Expectation

Should display e.g. https://archive.org/download/guanlichengjiush0002fred/page/cover_t.jpg

Details

  • Logged in (Y/N)? Y
  • Browser type/version? Firefox 67.0
  • Operating system? Windows 10 Home

Stakeholders

@mekarpeles @hornc

Data Cover Service Import 2 Bug

All 8 comments

@mekarpeles @hornc Is there anything special that needs to be done to deploy this, or do we just have to fix the line above?

Note also that for your example, https://archive.org/download/greatdebatesback0000unse/page/title.jpg (the current form) doesn't resolve at all, but in that case the title page would arguably be better than a plain blank green cover.

In addition to fixing the current code, we'll also need to figure out which editions need to have their covers fixed.

It seems that there are four conflated issues here.

First, both the cover and the title page should be captured into the coverstore, not just one or the other. Capturing TP verso would also be helpful.

Second, the better of the two (by some metric) should be presented in search results and carousels. I would argue that when there is not a minimum amount of legible text on the cover (to identify the author and title), seeing the title page is often essential to confirming the edition is correctly described.

Third, absent a good identifying image for the edition, should a useful default cover for the work be presented instead?

Fourth, are all useful sources for cover images being exploited?

@tfmorris could you create a new issue (probably on https://github.com/internetarchive/openlibrary-client ) for cleaning up the incorrect covers?

@LeadSongDog Trying to keep the scope of this issue small. Baby steps :)

  1. That would require a redesign of the way we store covers. We currently don't store any extra semantic data with the images; it's just a list of pictures labelled covers. This would require labeling the images as "front cover", "title page", etc. Could you create a new issue to investigate redesigning our cover storage schema?
  2. This is what this issue is trying to fix. Currently ImportBot _indiscriminately_ imports/displays the title page. The proposed solution would display the cover page if it's "better" (as deemed by the folks who scanned the book), and the title page if it's not.
  3. Strong disagree; displaying the work's cover is misleading. What if the edition is in a different language?
  4. This bot's focus is on importing from internet archive; to keep it concise/manageable, it should only be importing from there. New bots should be created to deal with other sources.

@cdrini

  1. Done.
  2. Understood, but how often do the scanners make that distinction?
  3. I take your point, but is seeing no cover at all really better than another-edition-so-annotated? Of course we should not be deceptive: other sites indicate this with "Other editions" or "Similar items"
  4. Good to know. Perhaps a more informative name, to more accurately match the scope? Say, "IAImportBot" perhaps? Alternatively, widen the scope to exploit a list of usable sources?

I'm not sure whether relying on cover to be set to the title page on books where it is appropriate is fully reliable.

It seems to be the case on many items, e.g. https://archive.org/download/hesiodtheognis00daviuoft/page/cover.jpg

But it looks like title and cover are independent things, and there is no logic that redirects to the other if one is not set.

archive.org logic appears to prefer title if a book is pre-1923, but cover otherwise, but that is explicitly for choosing a preview image to display. We'll need to make a choice ourselves.

also, the correct URL is https://archive.org/download/guanlichengjiush0002fred/page/cover.jpg
cover_s4.jpg would scale the image , but the _t is not meaningful and is stripped.

Also, in the current code the archive.org id is duplicated, which does not seem necessary :man_shrugging:

I'm not sure cover.jpg is absolutely better than title.jpg, it _seems_ like it will be better in this case, and I can't find a concrete example where it is _worse_. It seems very dependent on the source data, and not on any system smarts.

We could
A. make the change and see if we notice any issues
B. Investigate further to make a stronger case

Assigning @hornc per slack discussions because this issue is import related.

Also removing the Good First Issue label as this seems a little involved for a newcomer.

I'm not sure there is a clear action to take here. Closing.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

BrittanyBunk picture BrittanyBunk  路  5Comments

Pratyush1197 picture Pratyush1197  路  3Comments

cdrini picture cdrini  路  5Comments

jdlrobson picture jdlrobson  路  5Comments

nonom picture nonom  路  3Comments