Scanning Documents with Limited Colors for Optical Character Recognition

I scan a lot of documentation for BitSavers. Sometimes the documentation has a limited use of color, such as red text highlighting user input, and so-on. Eventually, scanned documents on bitsavers will be run through an optical character recognition (OCR) program. High resolution bitonal (on or off, single bit per pixel) images work best for the OCR process. However, scanning a document with limited use of color in bitonal form means that the limited color highlighting present in the original document is lost. This can be particularly confusing when the original document mixed highlighted text and black text in the same example. In this blog post, I’ll describe how I scan these documents in a manner that preserves the original limited use of color and also keeps the document ready for OCR.

Let’s motivate this discussion with an example. Here is page 34 from the “IRIS Programming Tutorial”, an SGI manual from 1986 that instructs the reader on how to get started programming with the IRIS graphics library in C:

For the purposes of blogging, I’ve reduced the file size to 58K by reducing resolution to 75 dpi (dots per inch) and converting the file to JPEG. The original raw scan at 600 dpi is just about 7 MB. For comparison, a typical bitonal page from this manual scanned at 600 dpi is about 48K. Clearly most of this page would be perfectly fine in a bitonal representation and it’s only the little bit of color that needs to be accounted for in order to have a faithful rendition in the PDF file.

My approach is to take the full range color image file and quantize it to a reduced number of colors. In the case of this manual, the colors black, white, red, green, blue, yellow, magenta and cyan are sufficient:

Index Color RGB Values (0-255) Example
0 Black 0 0 0
1 White 255 255 255
2 Red 255 0 0
3 Green 0 255 0
4 Blue 0 0 255
5 Yellow 255 255 0
6 Magenta 255 0 255
7 Cyan 0 255 255

Now you may already be noticing that green on your monitor doesn’t exactly match the green in the scanned image shown above. Exactly matching the reflective color of a printed ink to the color from a monitor is a difficult process. A monitor may be emissive (CRT, plasma or LED monitor) or reflective (LCD monitor) in the way that it generates color and will not have the same range of representable colors as combinations of printed inks on a page. However, the text in the manual clearly indicates that the printed image is to represent full intensities of colors on the monitor as you work through the examples, so we’re OK in using these colors. The general process of matching colors on the screen with colors produced by printed inks is a difficult one that the computer graphics industry has struggled with over time. A number of solutions have appeared; discussing these may be the topic of another blog post some day.

So what we want to do is process the full color 24-bit (8 bits red, green and blue) image into a reduced color image using the above colormap. This process is called quantization. We can do this using the ppmquant utility from the NetPBM tools, assuming we’ve put the above colormap into a file called reduced.map:

netpbm -mapfile reduced.map pg34.ppm > pg34-quantized.ppm

There are other tools that can perform the same quantization operation, this is just one way of doing this. Once the file has been quantized to a reduced colormap, we can combine it into a PDF with all the rest of the scanned images.

You can see the final results for the SGI IRIS Programming Tutorial. This manual is mostly black and white with a few pages containing color illustrations on pages 41, 48, 54, 56 and 69 of the PDF. In this case, the color illustrations are not continuous tone full color images, but simple renderings with fully saturated colors as shown in the above table.

2 thoughts on “Scanning Documents with Limited Colors for Optical Character Recognition

  1. Hello.
    I also had some manuals to pre-process for OCRing in in the past, maybe
    I could give some help in to improve your process.
    The result you are obtain quantizing the images with the fixed palette is
    very good, but as you write, you are loosing the real tone of the original colors.
    Furthermore, if you are unlucky and the colors are spreaded around the quantization slice limits, they will be sampled at random pixels to a color or to the neightbor, but they are two completely different colors, and no one of the two could
    represent well the original color…
    I think that a better approach could be to analyze statistically the colors in the image, then calculate the n colors most present in the image, discarding colors
    counted less than a specified percentage (noise colors).
    You could also specify that if the process will find two colors that are not too distant
    each other, the should be merged to a single one.
    Then you will quantize the images with the new bounds, that in this way should avoid any kind of error in color representation.
    You could however still face to a problem: most color images are printed using CMYK colors, with tones obtained using screening process.
    In this case the real color is a result of all color dots inside a larger macro-pixel merged together to obtain the average value (this is the process the eye does on true color perception). As all above processes doesn’t work with such color representation, you will need to scan the original document at very high resolution (to avoid moire and aliasing effects of screening pattern sub-sampling), then separate black/white component (text) from color component, then downsample the text to the desired output resolution, and low-pass and then resample to lower resolution the color content, to reproduce the original pre-print color.
    Of course other methods, as adaptive area averaging, could be better for the descreening operation, but they are also more complex.
    What do you think?

    • The method I propose here, as it says in the title, is for documents with limited colors such as this SGI manual that uses full intensity primaries. It is also acceptable for other manuals that use color to highlight specific text. In both these cases, it is not important to match in the scanned PDF the exact appearance of the printed inks on the page. What is important is to carry over the semantic content of the document. Printed colors fade and change over time, so even if you did capture the colors as they exist now with a scanned image, they may not be a faithful representation of the colors in the original document and they may not match the visual representation of the document at a later point in time.

      Manuals with limited numbers of colors are printed with one ink per color. What we are attempting to do with this quantization method is to separate the combined printing from several printing plates and inks into distinct colors per pixel. Usually there is no overprinting of inks in documents printed with limited colors in this manner, so this separation is faithful to the original document.

      If the document has no color and only half-tone printing for grayscale photographs, simply scanning at a density high enough to resolve the half-tone dot pattern with bitonal (one bit per pixel) encoding is sufficient to capture the document faithfully. In my experience, 600dpi is dense enough to resolve the underlying half-tone dot patterns without aliasing, but some documents may need higher resolution scans to resolve the half-tone pattern without aliasing.

      Documents with overprinting of color inks or the use of half-tone processes to approximate continuous tone color images are a different problem. A future blog post will be dedicated to a proposed method for dealing with these documents that attempts to preserve the original document faithfully while still allowing quality OCR from bitonal text.

Leave a Reply to Richard Cancel reply

Your email address will not be published. Required fields are marked *