Openlibrary: Identify Book Page Types from fulltext (e.g. "Chapter page")

Created on 15 Apr 2020 · 8Comments · Source: internetarchive/openlibrary

Write a program to identify page types [Cover, Table of Contents, Chapter Pages, etc] from Internet Archive djvu.xml files.

Goal / Opportunity:

The Internet Archive Bookreader Preview is able to show front-matter (such as table of contents, copyright page, cover, etc) if these pages are “asserted” (i.e. marked) in our metadata. This means we can possibly make hundreds of thousands of book page previews available which aren’t today.

History

Once upon a time, our digitizing team (as they scanned books) would manually select page-types from a dropdown menu (e.g. "cover page", "chapter page").
We have hundreds of thousands of books which include these page-type tags.
If a book has page-types, you can find them within the archive.org item in a file called scandata.xml.
In ~2019, as we scaled up to our Super Scanning Center (we now process >1000 books a day) it became too costly and slow for our imaging specialists to manually tag book page-types while digitizing.

Proposal & Constraints

Naive approach/questions:

Where does the word “copyright” first appear in the book? That’s probably the copyright page.
Where does the word “Chapter” appear in the book?
Where does the phrase “Table of Contents” first appear in the book?

Example

Find a public book and download the djvu.xml file (which has OCR + page numbers) using these instructions.
Here is an example of an open item with Table of Contents, etc:

Stakeholders

@tabshaikh

@tabshaikh 3 Feature Request Scoped Project

Source

tabshaikh

Most helpful comment

I cannot really say how much time I would be able to spend on this as I am busy with another project ATM.
But I can help write small scripts that will allow fetching and iterating over the XML files and look for the keywords that you mentioned.

ishank-dev on 16 Apr 2020

❤1 👍1

All 8 comments

@devarshigoswami @ishank-dev This would be an interesting scoped project with a huge impact if you are interested :)

tabshaikh on 15 Apr 2020

There's considerable variability in the information that's already available. Some volumes have the page types in the scandata.xml file (this volume only has the title page marked) and others have it in the ABBYY OCR file. Those should all be checked before resorting to guessing.

Don't forget that "Copyright" and "Table of Contents" are spelled differently in different languages.

This sounds like a good application for training a machine learning classifier to identify the page types. Volumes that have the page types in the scandata.xml can be used to create the training set.

BTW, a better link for the download is the pre-redirect URL which can be easily computed: https://archive.org/download/in.ernet.dli.2015.504418/2015.504418.Cambridge-Geographical_djvu.xml

tfmorris on 16 Apr 2020

👍1

ishank-dev on 16 Apr 2020

❤1 👍1

2156 Relates

LeadSongDog on 21 Apr 2020

There's existing code here already for doing similar processing on Archive.org books:
https://github.com/Open-Book-Genome-Project/sequencer

This is a guide on downloading and using Archive.org files (e.g. djvu.xml):
https://docs.google.com/document/u/1/d/1eybbw_qZ3EE9CJg868BhPuq5z_36Wq2G0Ki3Lkde9v8/edit?ts=5e4516a6#heading=h.u35u6r32vmeh

There's plenty of background information here:
https://bookgenomeproject.org/ and the whitepaper https://docs.google.com/document/d/1eybbw_qZ3EE9CJg868BhPuq5z_36Wq2G0Ki3Lkde9v8/edit?ts=5e4516a6#heading=h.u35u6r32vmeh

mekarpeles on 23 Apr 2020

Next Steps
Let's start with identifying Copyright pages (this seems like an easier one). The code we currently have: https://github.com/Open-Book-Genome-Project/sequencer/blob/master/bgp/runner.py currently works off of the djvu.txt because the word-frequency modules we're currently running do not require us knowing page numbers. However, identifying page types will require page numbers, so we'll want some way to use djvu.xml for this new copyright page-detection module

mekarpeles on 23 Apr 2020

👍1

A heuristic to detect a chapter would be the word chapter would mostly be the first word on the chapter page and pages with less than 10 words which contains the word chapter would be great candidates as chapter pages.

tabshaikh on 24 Apr 2020

Copyright pages in any language generally can be expected to have:
-a placement near the start of the volume
-one or more 4-digit numbers, less than +2;
-far fewer words than typical pages; and
-one or more place names.
They are normally found verso to (and following) the Title page, which can be expected to:
-state the long title (Duh!)
-state at least one author name (though often not in the form or spelling used by OL)
Not infrequently the the publisher, publication date and place are shown on the t.p. instead of verso but taken together the pair of pages should be readily identifiable.

LeadSongDog on 24 Apr 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Search result preview overlapping with other page elements

bitnapper · 4Comments

Books without covers not showing title/author in carousels on 3+ page

cdrini · 4Comments

ISBN star queries no longer work

cdrini · 4Comments

Method to convert LCCs to LCC class names

cdrini · 5Comments

Merge works with same title and spelling differences in author name

nemobis · 5Comments