Utente:Aubrey/djvu

Vedi anche:

Specific trials
OCR with Tesseract
See also Wikimedia Commons:Commons:DjVu.

Conversione

Convertire a DjVu

Windows

Immagini → stampante virtuale → DjVu

If the page scans are made available as a PDF file, e.g. Google Books scans, then this can be directly converted into a DjVu file using either:

The free Any2DjVu online service; this can also OCR the text and embed it in the .djvu file.
The freeware Pdf To Djvu GUI. Note that this requires the installation of the cygwin environment as a prerequisite to its own installation.
The free software command-line pdf2djvu (available in repositories, also for Linux), which is usually as simple as pdf2djvu -o output.djvu input.pdf. There's also a GUI available.
If you need to crop the pdf, you can use pdfcrop.pl (see below) for black margins or freeware Govert's PDF Cropper for white margins (it requires Ghostscript and .Net 2.0).

If the scanned images are made available as individual images, then the easiest option is to print them to a PDF via one of the many "virtual printer" tools, such as the free PDFCreator; then convert the PDF to DjVu as described above.

Note that there are many other options for converting pages to .djvu. One could convert using PostScript or multipage TIFF as the intermediate format, rather than PDF, but this would of course require different conversion tools. It is also possible to convert from .pdf or .ps to .djvu with the DjVuLibre software and its GSDjVu plug-in but due to licensing restrictions installing the plug-in is a fairly intricate process that involves compiling a patched version of Ghostscript.

Another free Windows tool that can come in handy for the images-to-pdf-to-djvu process is ConcatPDF, a GUI tool that permits easy splitting and merging of PDF files. An example of how ConcatPDF might be used is: if a 100-page document has previously been scanned and converted to .djvu and the single page #42 needs to be re-scanned, ConcatPDF would allow that one page to be inserted into the intermediate .pdf file without tracking down the other page images and re-composing the entire document. Installing ConcatPDF version 1.1 requires as prerequisites that the free Microsoft program libraries Microsoft .NET Framework Version 1 and the corresponding Visual J# .NET Redistributable Package be installed beforehand.

Images directly to DjVu

However, a far higher quality document can be achieved using the DjVuLibre software library. Jpeg images can be directly encoded into individual DjVu pages using the c44 encoder. Images in lossless formats such as PNG should be converted to PPM (for colour scans) or PGM (for greyscale scans), then encoded using c44. For bitonal (i.e. black-and-white) scans, such as most page text images, a smaller DjVu file can be obtained by converting the page images to the monochrome PBM format, then encoding to DjVu using the cjb2 encoder. All of these image format conversions can be performed by the freeware ImageMagick library (in batch, with mogrify). Individual DjVu pages can be aggregated into a multi-page DjVu using the djvm program; this program can also be used to insert or delete pages from a djvu file.

An important caveat of this process is that high quality scans come at the cost of larger files, and there is currently a 100Mb limit on uploads to commons.

Scripting djVuLibre

There is a script at Commons that will allow you to take a whole directory of jpg images and convert and collate them automatically.

Linux

See also: User:GrafZahl/How to digitalise works for Wikisource

Method 1

You need the djvu software, which includes a viewer, and some tools for creating and handling DJVU files. You will probably also need the Imagemagick software for converting scans from one format to another. The tool cjb2 is used to creating a DJVU file from a PBM or TIFF file. Therefore you need to convert your scans if there are not already in one of these formats.

Conversion from PNG format to PBM format with the tool convert from Imagemagick

convert rig_veda-000.png rig_veda-000.pbm

Depending on the quality of the original scans, you may find it useful to process them with the unpaper utility, which deletes black borders around the pages and aligns the scanned text squarely on the page. Unpaper is also capable of extracting two separate page images where facing pages of a book have been scanned into a single image. Another utility is mkbitmap, another pdfcrop.pl (Perl-based and free software, it requires Ghostscript and texlive-extra-utils on Ubuntu; it uses BoundingBox; it can crop a whole multipage pdf in just one passage). PDFCrop (another one!) deletes white margins.

Creation of a DJVU file from a PBM file

cjb2 -clean rig_veda-000.pbm rig_veda-000.djvu

Adding the DJVU file to the final document

djvm -i rig_veda.djvu rig_veda-000.djvu

You need to repeat these steps with a script for each page of the book. Example:

#!/bin/bash
for n in `seq 1 9`
do
        i="rig_veda-$n.png"
        j=`basename $i .png`
        convert $i $j.pbm
        cjb2 -clean $j.pbm $j.djvu
        djvm -i rig_veda.djvu $j.djvu
done

There is also another way to add all the *.djvu parts into one:

djvm -c rig_veda.djvu rig_veda-000.djvu rig_veda-001.djvu rig_veda-002.djvu

See the following section for an automated process for multiple pages.

Method 2

Use this script, which converts pdf (multiple or single page) into images, automagically crop them with ImageMagick, convert them in DjVu and bundle them. This is very slow (huge pdf can require days) but a little more efficient than the following method. The resulting pdf is quite big and low-quality, probably because of poor font recognition, which may be fixed by newer versions of poppler (the used library): the version avilable in repositories is usually several months old.¹

You can also remove the pdftoppm part and use the script to convert multiple images directly in a multiple page pdf. If images are not in pbm format, you can convert them with single command using mogrify from ImageMagick.

Method 3

Simply download the pdf2djvu tool from your repository to directly convert pdf (single or multiple pages) into DjVu. This is slow (several hours for a pdf of about 100 MB, depending on your hardware), but requires little memory and CPU. The obtained DjVu file is quite low-quality and big in size, and with no OCR.

Moreover, you need to crop directly the pdf before the conversion. On Linux this is quite difficult. You could use ImageMagick convert -crop, but attention: with multiple page big pdf, this can take several GB of memory (the limit is 16 TB!) and kill your computer if you don't use the -limit area 1 option directly after -crop. This make the convertion very long. The resulting pdf is increased in size and reduced in quality because of rastering.² See other crop tools above.

Method 4

Use djvudigital,³ which like pdf2djvu converts pdf directly in DjVu.⁴ There are licensing problems, because the GSDjVu library has a different license, then you'll need to compile it by yourself; the included utils make this step quite easy, but still long (about 1 hour) and a bit annoying.⁵
But, then you can convert pdf into DjVu with a single command (see the previous section for crop). The conversion is very slow, and can take several days for big pdf. The resulting DjVu is of higher quality and lower dimension compared to both the previous two methods.¹ DjVuDigital has many advanced options to improve results, but they're very difficult to master.⁶

Online ([almost] all systems)

Any2Djvu

Another method to convert the images to djvu is to zip them and use the Any2Djvu site to create the djvu file. The Any2Djvu will extract the images in the zip and create a OCRed djvu. OCR functions well only with English text. Any2Djvu cannot handle huge files. Big files are best dealed with if you upload them by URL (e.g. by entering a link like ftp://ftp.bnf.fr/005/N0051165_PDF_1_-1DM.pdf). Conversion can take several hours.

Internet Archive

Another method is to upload pdf to the Internet Archive. You need to login (don't use OpenId, it won't function⁷). Click "Upload" at the top-right corner. The JavaScript upload ("Share" button) won't function with Firefox (use Opera or Internet Explorer instead⁸) or Linux. You can use the FTP upload instead, but this is slower and seems crashy.

When the upload has been completed, archive.org will start the "derive" work: OCR to create pdf with text, then conversion to DjVu with text, text only etc. This is very, very, slow, and can take several days, but you don't need to do anything.⁹ The Internet Archive uses a professional, proprietary, commercial ABBYY software with a quite good images and OCR output in many languages and fonts and an aggressive compression¹⁰ which mantains an high quality of the final DjVu file.¹

DjVu to text

OCR via Any2DjVu

The OCR option available at the free conversion service Any2DjVu does do an OCR of the scanned image but the resulting text is embedded within the .djvu file itself and must be extracted so it can be used on Wikisource.

One way to do this is to use the DjVuLibre software to extract the text, via a command like

djvused myfile.djvu -e 'print-pure-txt' > myfile.txt

or

 djvutxt myfile.djvu > myfile-ocr.txt

JVbot can automatically upload the text layer of a DJVU to the pages on Wikisource. For example, Robert the Bruce and the struggle for Scottish independence - 1909

OCR via Internet Archive

See above: if you upload a DjVu file, the derive process will OCR it.

OCR with Tesseract

OCR can be done with Tesseract, a free OCR software, and a script: OCR with Tesseract.

DjVu to Images

Linux

To extract images from a DjVu file, you can use ddjvu

ddjvu -page=8 -format=tiff myfile.djvu myfile.tif

If you done all the pages (without -page=**) you can split the multi-page tiff into single pages png (or any other format)

convert -limit area 1 myfile.tif myfile.png

Manipulating

Splitting DjVu files

Large works can not be uploaded onto Wikimedia servers which have a 100 MB upload limit. To split the DjVu, use DjVuLibre "Save as", and specify a page range which will produce a file small enough to be uploaded. Some trial and error may be necessary.

djvused can do this to :

 djvused myfile.djvu -e 'select 10; save-page-with p10.djvu'

This can be done for every page.

To know the number of page of the file :

 djvused myfile.djvu -e 'n'

Displaying a particular page

The [[Image:...]] link tag accepts a named parameter "page" so that, for example, this wiki code displays the image of page 164 of the file Emily Dickinson Poems (1890).djvu on the right, 150 pixels wide (the rear cover of the book, containing no text):

[[Image:Emily Dickinson Poems (1890).djvu|right|150px|page=164]]

The page image can be displayed in the DjVu in place of text as in Page:Personal Recollections of Joan of Arc.djvu/9 using:

{{use page image|caption=JOAN'S VISION}}

The page image can be displayed in the books Wikisource main space as with Personal Recollections of Joan of Arc/Book I/Chapter 2 using:

[[Image:Personal_Recollections_of_Joan_of_Arc.djvu|page=27|right|thumbnail|200px|THE FAIRY TREE]]

Notes

↑ ^1,0 ^1,1 ^1,2 Example: this 205 MB pdf of a 1691 book from Gallica is converted by pdf2djvu.sh script in a hardly readable 382.4 MB djvu, in a little better readable 316.7 MB djvu by djvudigital and in a better quality 51.3 MB djvu by Internet Archive.
↑ For instance, this 55 MB pdf when cropped with ImageMagick gives a 100 MB pdf which converted with pdf2djvu gives a 86.2 MB djvu, while the Internet Archive gives directly a 10.1 MB djvu of better quality.
↑ Man page.
↑ A comparison here.
↑ Complete instructions here.
↑ Moreover, they can require the proprietary msepdjvu libray instead of csepdjvu: see superhero pres: is it independently reproducible?.
↑ See forums: Authentication error; not a valid OpenID, Login problems when I click "Share" .
↑ See forum.
↑ Example: Vocabolario degli accademici della Crusca, 1691 took 5.1 days to derive.
↑ In the example, dimension is 1/6 compared to djvudigital output.