Images converted to text

In days past, a newspaper archive was a room stacked from floor to ceiling with preserved copies of each paper exactly as it was produced. In the twenty-first century, we have access to a number of more efficient ways of storing and accessing the papers.

The earliest advance was to take a photographic image of each page of the paper and to covert it to one of the micro-formats. In sheets (microfiche) or rolls (microfilm), these can be read on specialised readers.

To make the micro-images available on a computer screen, they were put through a process of digitisation to create a digital image file (that might be stored in jpeg or tiff format depending on the system). Although a digitised image is easier to store, access and handle, it is still simply a picture of the page and needs to be read in the same way.

A major breakthrough came with the development of software that could recognise a letter by its shape on the page and write a corresponding character into a computer text file. This is called optical character recognition (OCR). In one sense it reads a page in the same way you do. It sees some black marks in a pattern that it remembers and associates with a sound.

Some pitfalls when using optical character recognition (OCR) converted nespapers.

The upper-case letter W might be read as VV (double V).
Upper-case H is frequently converted to II (double I).
Rounded lower-case letters are easily confused by software, so check u and n carefuuly.
Sometimes lower-case m is read as a pair of letters, either in or iu.
Remember that in per-1800 papers there may be instances of a "long s" that will almost always be misinterpreted as a lower-case f.