Working with PDF, DJVU and EPUB

Despite my reluctance to get along with this Internet world, I appreciate that we have some many e-books available online. This richness of knowledge and literature material excites me and serves as an invaluable offer from our age. The reading experience might be terrible comparing to paper books, but I am used to it now. In this post, I share my experience of performing different useful daily tasks involving e-books of formats: PDF, DjVu and EPUB.

PDF compatible issues

It happens that PDF documents produced by different tools might be not compatible with all possible PDF document reader or printer softwares. The command ps2pdf from Ghostscript solves this problem perfectly. The usage is as easy as follows:

ps2pdf old-problematic.pdf new-pefect.pdf

Reduce PDF file size

For non-scanned PDF ^[1], one can again use the ps2pdf command mentioned above. If the result of the plain command given above didn't reduce the file size, one adjust the distiller parameters by supplying the -dPDFSETTINGS=/ebook argument , as an example here, to the command ps2pdf.

For scanned PDF documents from images, it is highly recommended to convert them into DjVu documents. See the following section for details.

Convert scanned PDF

DjVu format, pronounced as déjà-vu, is designed to compress scanned images into a document. To start with, we create a working directory and then extract all images with names D-XXX.pbm or D-XXX.ppm.

mkdir -p /tmp/var/extract
cd /tmp/var/extract
pdfimages a-large-scanned-document.pdf D

Now in our working directory we usually have two different types of images:

PBM, Portable Bit Map for black and white images;
PPM, Portable Pixel Map usually meant for colorful images.

If there are only few PPM format images, they are usually the book cover or front images. I suggest to keep them colorful since they won't take too much space. But if all images are of PPM format, that is not necessary at all. We should better convert them into PBM format images to reduce the document size efficiently. To convert images, we use the convert command from ImageMagick. At this stage, we are able to improve the contrast or details of the scanned document pages. We tune an exemplar image with varying parameters and preview the results with the display command to find the best parameters for current document. Please refer to the -blur parameter for better adjustment.

convert -blur 2x0.3 -threshold 73% D-011.ppm /tmp/tmp.pbm && display /tmp/tmp.pbm

In case that the examplar image (D-011.ppm) is of inverted color, that is say, with white text and black background, we should add the -negate argument immediately after the convert command. Also, sometimes one may need to replace the margins of pages by pure white blanks, which can be done by combining the -shave and -border arguments:

convert -negate -shave 80x100 -bordercolor white -border 80x100 D-011.pbm /tmp/tmp.pbm && display /tmp/tmp.pbm

After finding out the best adjustment, we first create subdirectories to store processed images and futur converted DjVu pages. Then we use a for loop to convert all images.

mkdir opt-img djvu
for file in D*.pbm;  do
    convert -blur 2x0.3 -threshold 73% $file opt-img/${file:r}.pbm;
done

You should decide which images to be processed. In the above example, I only process the PBM images since usually I keep those very few PPM images. After this image processing step, we should now convert each images into single page DjVu files.

cd opt-img
for file in *.pbm; do
    cjb2 $file ../djvu/${file:r}.djvu;
done

Here the command cjb2 from DjVuLibre can convert PBM images to single page DjVu files. For PPM images, we should use the c44 command from DjVuLibre instead. Here is an example to convert some few PPM images, i.e., four files from D-000.ppm to D-003.ppm.

cd /var/tmp/extract
for file in 00{0..3}; do
    c44 D-$file.ppm djvu/D-$file.djvu
done

Finally, we combine all single pages DjVu files into a DjVu document using the djvm command provided by DjVuLibre.

cd /var/tmp/extract
djvm -c my-awesome-and-light-document.djvu djvu/*

Add bookmarks

For a textbook in the formats PDF or DjVu, it would be convenient that one can use bookmarks or outlines provided by the document to help us jump between different chapters. To do so, we need the page numbers corresponding to chapters or sections. This is usually presented in the document, precisely in the pages for Table of Content. Hence, we should first extract those page numbers data. ^[2] Then we edit them to a form that programs can easily process and we should add reasonable extra data, such as page offsets since the page 1 in the table of content of a book is usually not the first page of a document. Finally we embed these data into our documents.

# Extract the table of content from a PDF document that locates at pages 8 and 9.
pdftotext -f 8 -l 9 -layout this-document-need-bookmarks.pdf toc-raw


# Perform the same task for a DjVu document
pdftotext -pages=8-9 this-document-also-need-bookmarks.djvu toc-raw

I have processed many raw data of table of contents extracted from books, please use this file as a standard example, where I used \t indent for bookmark levels and d=13 to indicate the page offsets, i.e., shifting 13 pages. I also made scripts add_bkmk and export_bkmk to embed or export bookmarks, one can find them here.

./add_bkmk this-document-will-get-bookmarks.pdf toc-processed

As for EPUB documents, use Sigil in the section for EPUB.

EPUB documents

I use Sigil ^[3] to edit EPUB documents. Since many tasks are more efficiently performed without graphical interfaces with possible stupid designs, I wrote this script to do some basic things such as:

upgrade to EPUB 2 documents;
tidy html files;
fix errors detected by epubcheck.

I store and read all my edited EPUB books on Google Play Books.

Print webpages

The dedicated tool for this task is wkhtmltopdf. I once made a web page for my curriculum vitae so that I can always print it to a PDF file with this convenient tool.

Document reader softwares

I recommend zathura for reading PDF and DjVu on computers; here is my zathurarc file if you are interested. For mobile platforms, ReadEra seems to be the best.

Texts from image based documents

Optical character recognition or optical character reader (OCR) is designed to achieve this goal. One has many tools based on tesseract, such ocrmypdf and ocrodjvu. For mathematical documents, one can consider the commercial product Mathpix or the open source project LaTex-OCR.

Download e-books

Usually, I use Library Genesis and ZLibary for English books. For Chinese books, I rely on direct searches with Google or search engines for shared files on 阿里云盘, such as UP云搜. To download shared files on 阿里云盘, I wrote the script aliyun-share to store shared files data or download a file via its id.

# Download folder files metadata from a share URL
./aliyun-share https://www.aliyundrive.com/s/9y1gY9mbhfk/folder/625538b114ca25f59c1643ea9f2ecde089f32271

# Download a specific file with id
aliyun-share 625538b9b3e1a10642784adb8c7443d4e6dd9aaf

For example, PDF documents produced by LaTex, Adobe InDesign or some Print-to-PDF tools. ↩︎
The extraction could be complicated in some cases, see my comments. ↩︎
On the wayland platform, QT is buggy; to fix it, one use instead xwayland through the environment variable QT_QPA_PLATFORM=xcb. ↩︎

Working with PDF, DJVU and EPUB

PDF compatible issues ​

Reduce PDF file size ​

Convert scanned PDF ​

Add bookmarks ​

EPUB documents ​

Print webpages ​

Document reader softwares ​

Texts from image based documents ​

Download e-books ​