Openlibrary: Identify Book Page Types from fulltext (e.g. "Chapter page")

Created on 15 Apr 2020  Â·  8Comments  Â·  Source: internetarchive/openlibrary


Write a program to identify page types [Cover, Table of Contents, Chapter Pages, etc] from Internet Archive djvu.xml files.

Goal / Opportunity:

The Internet Archive Bookreader Preview is able to show front-matter (such as table of contents, copyright page, cover, etc) if these pages are “asserted” (i.e. marked) in our metadata. This means we can possibly make hundreds of thousands of book page previews available which aren’t today.

History

  • Once upon a time, our digitizing team (as they scanned books) would manually select page-types from a dropdown menu (e.g. "cover page", "chapter page").
  • We have hundreds of thousands of books which include these page-type tags.
  • If a book has page-types, you can find them within the archive.org item in a file called scandata.xml.
  • In ~2019, as we scaled up to our Super Scanning Center (we now process >1000 books a day) it became too costly and slow for our imaging specialists to manually tag book page-types while digitizing.

Proposal & Constraints

Naive approach/questions:

Where does the word “copyright” first appear in the book? That’s probably the copyright page.
Where does the word “Chapter” appear in the book?
Where does the phrase “Table of Contents” first appear in the book?

Example

Find a public book and download the djvu.xml file (which has OCR + page numbers) using these instructions.
Here is an example of an open item with Table of Contents, etc:

Stakeholders


@tabshaikh

@tabshaikh 3 Feature Request Scoped Project

Most helpful comment

I cannot really say how much time I would be able to spend on this as I am busy with another project ATM.
But I can help write small scripts that will allow fetching and iterating over the XML files and look for the keywords that you mentioned.

All 8 comments

@devarshigoswami @ishank-dev This would be an interesting scoped project with a huge impact if you are interested :)

There's considerable variability in the information that's already available. Some volumes have the page types in the scandata.xml file (this volume only has the title page marked) and others have it in the ABBYY OCR file. Those should all be checked before resorting to guessing.

Don't forget that "Copyright" and "Table of Contents" are spelled differently in different languages.

This sounds like a good application for training a machine learning classifier to identify the page types. Volumes that have the page types in the scandata.xml can be used to create the training set.

BTW, a better link for the download is the pre-redirect URL which can be easily computed: https://archive.org/download/in.ernet.dli.2015.504418/2015.504418.Cambridge-Geographical_djvu.xml

I cannot really say how much time I would be able to spend on this as I am busy with another project ATM.
But I can help write small scripts that will allow fetching and iterating over the XML files and look for the keywords that you mentioned.

2156 Relates

There's existing code here already for doing similar processing on Archive.org books:
https://github.com/Open-Book-Genome-Project/sequencer

This is a guide on downloading and using Archive.org files (e.g. djvu.xml):
https://docs.google.com/document/u/1/d/1eybbw_qZ3EE9CJg868BhPuq5z_36Wq2G0Ki3Lkde9v8/edit?ts=5e4516a6#heading=h.u35u6r32vmeh

There's plenty of background information here:
https://bookgenomeproject.org/ and the whitepaper https://docs.google.com/document/d/1eybbw_qZ3EE9CJg868BhPuq5z_36Wq2G0Ki3Lkde9v8/edit?ts=5e4516a6#heading=h.u35u6r32vmeh

Next Steps
Let's start with identifying Copyright pages (this seems like an easier one). The code we currently have: https://github.com/Open-Book-Genome-Project/sequencer/blob/master/bgp/runner.py currently works off of the djvu.txt because the word-frequency modules we're currently running do not require us knowing page numbers. However, identifying page types will require page numbers, so we'll want some way to use djvu.xml for this new copyright page-detection module

A heuristic to detect a chapter would be the word chapter would mostly be the first word on the chapter page and pages with less than 10 words which contains the word chapter would be great candidates as chapter pages.

Copyright pages in any language generally can be expected to have:
-a placement near the start of the volume
-one or more 4-digit numbers, less than +2;
-far fewer words than typical pages; and
-one or more place names.
They are normally found verso to (and following) the Title page, which can be expected to:
-state the long title (Duh!)
-state at least one author name (though often not in the form or spelling used by OL)
Not infrequently the the publisher, publication date and place are shown on the t.p. instead of verso but taken together the pair of pages should be readily identifiable.

Was this page helpful?
0 / 5 - 0 ratings