Guide to digitising public domain books from scans

(1) Choose your digitisation platform

There are three or four digitisation projects that undertake digitising and proofing ‘recent’ Latin texts. It is worth checking if one of these is intending to or in the process of creating a digitised version. These are:

Project Gutenberg, and Distributed Proofreaders, and their proposal list.
Project Gutenberg Canada and their Distributed Proofreaders
Possibly, Project Gutenberg Australia
Latin Wikisource

For older Latin texts, there are academic sites where digitised texts may reside. You should check these before proceeding.

There are advantages to each approach. The Gutenberg projects are less visible until work is complete, but encourage teamwork. It is probably easier to get a project completed, if others are keen to help you. Assuming you get agreement to take a digitisation project forward, they have very clear procedures for checking and eliminating errors.

Wikisource on the other hand is easy to start, and the results are immediately visible. If you are familiar with Wiki markup or already volunteer on Wikipedia it is very convenient to use. However, means to help collaborate and co-ordinate work need some development, and there are many incomplete projects as a result. Introduce yourself at the Scriptorium if you would like to work with others on Latin texts.

(2) Setting up a text for digitisation on Wiki Commons and Wikisource

Setting a book up to digitise is easy enough.

Find and download a text from Google Books or Archive.org. Remove any Google Books cover image (WikiCommons don’t allow these title pages as they are arguable under copyright of Google).
You can use a PDF or DJVU file. WikiCommons say that DJVU is preferred, but it is easier to use PDFs, not least because you may need to add or remove pages in the document later.
Make sure you have a Wikimedia account, that you can use on WikiCommons and Wikisource.
Upload the document to WikiCommons. You have to provide information about the source, the reason it is out of copyright. Roughly speaking, any book published 96 years ago is fair game under US law, which WikiCommons says it publishes under. As of 2021, books published in the USA before 1926 are regarded as public domain. (Some other books are public domain in the USA, if their copyright was not “renewed” but that needs another guide.)
Note the name of the document. For instance, File:Ad_Alpes.djvu is the name of the Ad Alpes book on WikiCommons.
Decide if the book is mainly in English or Latin. Place books mostly in English on en.wikisource.org and books mainly in Latin on la.wikisource.org.
Make a page for the book digitisation, using the same file name, but replacing “File:” with “Index:” on en.ws, or “Liber:” on la.wikisource.org. Ad Alpes is at https://la.wikisource.org/wiki/Liber:Ad_Alpes.djvu and the key to Easy latin Stories is at Index:Key to Easy Latin Stories for beginners.djvu on en.wikisource.org.
When you edit the page, a number of automatic fields are created for author, year, source and so on; fill these in as appropriate.
Add <pagelist/> to the “Pages” field. If this is not done, you won’t see any scans for editing. You can now view the individual pages for editing. The pagelist tag allows you to define the page numbers of the book; any page sequence using Roman numerals and so on; plus pages to ignore. This is easiest to learn by copying from other books.
Click any page number to create a page and start the editing process. Save a few of these, and establish that the book is complete, where extra pages have been accidentally double scanned etc, and where the page numbers start.
You can use the “OCR” button to pull a digitised version of the text onto the wiki page for editing.
You will need to edit and adjust the pagelist tag, as you do this. Obviously, the scans are going to be for actual book pages with different numbering. This tool allows you to match “scan numbers” to “page numbers”. Ad Alpes has:
pagelist 1to3=”–”
4=1
4to20=roman
6to17=”–”
18=iii
21=1
This tells you and the Wiki software that the first three scans are ignored; the first book page is the fourth scan; scans 4-20 are for pages in Roman numerals; scans 6-20 are ignored; numbering restarts at scan 18 with page iii; and scan 21 is page 1 of the book.
Go through the whole book to find errors. Duplicate scanned pages are common, but sometimes you will also find pages are missing.
If you find missing scans you will need to add them. Find another scan, add the pages to your desktop file and replace the original file on Wiki Commons with an updated copy. Don’t add a whole new copy, use the “Upload a new version of this file” option.

(3) Set up book pages for the online book

At some point you will want to set up Wiki pages to show the compiled book. You don’t need to do all of the proofreading first, so it is up to you when you want to do this.

There is a lot to learn with options for these pages, so the best thing is to check the source of other pages and copy / try that, until you get the style you want. Feel free to poke around the pages set up for Ad Alpēs or Colloquia Familiara to see how it is done.

The main things to note are that:

the page needs navigation; this is at the top and bottom on most books;
the code < pages index=”Ad Alpes.djvu” from=19 to=19 /> draws the relevant scans by scan number from the book index, in this case just scan 19.
the landing page for the book chooses the pages for export to ebook; all links to pages underneath the front page will be exported, in order, including duplicate links. So make sure each section or chapter is listed, once only, in the natural order of reading.

(1) Choose your digitisation platform

(2) Setting up a text for digitisation on Wiki Commons and Wikisource

(3) Set up book pages for the online book

Share this:

Like this: