I scan a lot of documentation for BitSavers. Sometimes the documentation has a limited use of color, such as red text highlighting user input, and so-on. Eventually, scanned documents on bitsavers will be run through an optical character recognition (OCR) program. High resolution bitonal (on or off, single bit per pixel) images work best for the OCR process. However, scanning a document with limited use of color in bitonal form means that the limited color highlighting present in the original document is lost. This can be particularly confusing when the original document mixed highlighted text and black text in the same example. In this blog post, I’ll describe how I scan these documents in a manner that preserves the original limited use of color and also keeps the document ready for OCR.
Let’s motivate this discussion with an example. Here is page 34 from the “IRIS Programming Tutorial”, an SGI manual from 1986 that instructs the reader on how to get started programming with the IRIS graphics library in C:
For the purposes of blogging, I’ve reduced the file size to 58K by reducing resolution to 75 dpi (dots per inch) and converting the file to JPEG. The original raw scan at 600 dpi is just about 7 MB. For comparison, a typical bitonal page from this manual scanned at 600 dpi is about 48K. Clearly most of this page would be perfectly fine in a bitonal representation and it’s only the little bit of color that needs to be accounted for in order to have a faithful rendition in the PDF file.
My approach is to take the full range color image file and quantize it to a reduced number of colors. In the case of this manual, the colors black, white, red, green, blue, yellow, magenta and cyan are sufficient:
|Index||Color||RGB Values (0-255)||Example|
|0||Black||0 0 0|
|1||White||255 255 255|
|2||Red||255 0 0|
|3||Green||0 255 0|
|4||Blue||0 0 255|
|5||Yellow||255 255 0|
|6||Magenta||255 0 255|
|7||Cyan||0 255 255|
Now you may already be noticing that green on your monitor doesn’t exactly match the green in the scanned image shown above. Exactly matching the reflective color of a printed ink to the color from a monitor is a difficult process. A monitor may be emissive (CRT, plasma or LED monitor) or reflective (LCD monitor) in the way that it generates color and will not have the same range of representable colors as combinations of printed inks on a page. However, the text in the manual clearly indicates that the printed image is to represent full intensities of colors on the monitor as you work through the examples, so we’re OK in using these colors. The general process of matching colors on the screen with colors produced by printed inks is a difficult one that the computer graphics industry has struggled with over time. A number of solutions have appeared; discussing these may be the topic of another blog post some day.
So what we want to do is process the full color 24-bit (8 bits red, green and blue) image into a reduced color image using the above colormap. This process is called quantization. We can do this using the
ppmquant utility from the NetPBM tools, assuming we’ve put the above colormap into a file called
netpbm -mapfile reduced.map pg34.ppm > pg34-quantized.ppm
There are other tools that can perform the same quantization operation, this is just one way of doing this. Once the file has been quantized to a reduced colormap, we can combine it into a PDF with all the rest of the scanned images.
You can see the final results for the SGI IRIS Programming Tutorial. This manual is mostly black and white with a few pages containing color illustrations on pages 41, 48, 54, 56 and 69 of the PDF. In this case, the color illustrations are not continuous tone full color images, but simple renderings with fully saturated colors as shown in the above table.