Write a program to identify page types [Cover, Table of Contents, Chapter Pages, etc] from Internet Archive djvu.xml files.
The Internet Archive Bookreader Preview is able to show front-matter (such as table of contents, copyright page, cover, etc) if these pages are “asserted” (i.e. marked) in our metadata. This means we can possibly make hundreds of thousands of book page previews available which aren’t today.
scandata.xml.Where does the word “copyright” first appear in the book? That’s probably the copyright page.
Where does the word “Chapter” appear in the book?
Where does the phrase “Table of Contents” first appear in the book?
Find a public book and download the djvu.xml file (which has OCR + page numbers) using these instructions.
Here is an example of an open item with Table of Contents, etc:
@tabshaikh
@devarshigoswami @ishank-dev This would be an interesting scoped project with a huge impact if you are interested :)
There's considerable variability in the information that's already available. Some volumes have the page types in the scandata.xml file (this volume only has the title page marked) and others have it in the ABBYY OCR file. Those should all be checked before resorting to guessing.
Don't forget that "Copyright" and "Table of Contents" are spelled differently in different languages.
This sounds like a good application for training a machine learning classifier to identify the page types. Volumes that have the page types in the scandata.xml can be used to create the training set.
BTW, a better link for the download is the pre-redirect URL which can be easily computed: https://archive.org/download/in.ernet.dli.2015.504418/2015.504418.Cambridge-Geographical_djvu.xml
I cannot really say how much time I would be able to spend on this as I am busy with another project ATM.
But I can help write small scripts that will allow fetching and iterating over the XML files and look for the keywords that you mentioned.
There's existing code here already for doing similar processing on Archive.org books:
https://github.com/Open-Book-Genome-Project/sequencer
This is a guide on downloading and using Archive.org files (e.g. djvu.xml):
https://docs.google.com/document/u/1/d/1eybbw_qZ3EE9CJg868BhPuq5z_36Wq2G0Ki3Lkde9v8/edit?ts=5e4516a6#heading=h.u35u6r32vmeh
There's plenty of background information here:
https://bookgenomeproject.org/ and the whitepaper https://docs.google.com/document/d/1eybbw_qZ3EE9CJg868BhPuq5z_36Wq2G0Ki3Lkde9v8/edit?ts=5e4516a6#heading=h.u35u6r32vmeh
Next Steps
Let's start with identifying Copyright pages (this seems like an easier one). The code we currently have: https://github.com/Open-Book-Genome-Project/sequencer/blob/master/bgp/runner.py currently works off of the djvu.txt because the word-frequency modules we're currently running do not require us knowing page numbers. However, identifying page types will require page numbers, so we'll want some way to use djvu.xml for this new copyright page-detection module
A heuristic to detect a chapter would be the word chapter would mostly be the first word on the chapter page and pages with less than 10 words which contains the word chapter would be great candidates as chapter pages.
Copyright pages in any language generally can be expected to have:
-a placement near the start of the volume
-one or more 4-digit numbers, less than
-far fewer words than typical pages; and
-one or more place names.
They are normally found verso to (and following) the Title page, which can be expected to:
-state the long title (Duh!)
-state at least one author name (though often not in the form or spelling used by OL)
Not infrequently the the publisher, publication date and place are shown on the t.p. instead of verso but taken together the pair of pages should be readily identifiable.
Most helpful comment
I cannot really say how much time I would be able to spend on this as I am busy with another project ATM.
But I can help write small scripts that will allow fetching and iterating over the XML files and look for the keywords that you mentioned.