If you go to the North Carolina Digital Collections and click on an item, you’ll see that for items with text there’s usually a tab that says “text” (see the image at right) where you can see the full text that appears in the image. One of the comments we sometimes hear regarding the full text goes along the lines of “my cat could provide a better transcription.” So I thought I’d take a minute or two to describe why the full text is sometimes – frankly – horrible.
We add a lot of scanned, or digitized, print items to the Digital Collections. For items with text, we want to make the text searchable. If you’ve ever used a scanner before or even just taken a photograph of something that contains words, you know that you can’t automatically “select” the text or search the text in the image. To give our items that capability, we’ve incorporated Optical Character Recognition – OCR – into our workflow. But what does that really mean?
Let’s start with an image of a page of text. While it might be meaningful to us, especially if it’s in a language we understand, a computer won’t know or care that there are words in the image. That’s where OCR comes in. OCR is the set of directions that tells a computer to interpret parts of an image as text, as well as how to do that.
To get a little more specific, OCR software contains instructions on how to take an image, analyze where letters stand out from the background, guess what those letters might be based on established patterns recorded in the software, and then give us the letters not in the form of a single image but in the kind of text you might use in word processing software. In other words, it gives us machine-readable text.
Once we have that machine-readable text, we can feed it into search engines and do things like full-text search. It’s how many of our scanned items available in the Digital Collections become full-text searchable.
Text generated by OCR can be wildly variable, depending on the original document. If you printed out this blog post, scanned it, and then ran the image through OCR software, you’d probably get text that’s almost 100% accurate.
But images of old documents? Text with lots of creative serifs or embellished type? Handwriting? OCR doesn’t do as well. The Digital Collections has a lot of examples of really poor and really great OCR (see the examples to the right).
It can be frustrating, if you’re looking for something very specific, to have the OCR incorrect. So why don’t we fix it? Well, we’ve decided that we’d be more responsible with our resources and we’d serve more people by putting up more content with OCR that isn’t as clean, rather than putting up less content and spending pretty much all of our staff time correcting the OCR.
There are some exceptions, though. For some types of items where our users really value accuracy, and where the text is handwritten, poorer quality, or hard-to-recognize using OCR, we’ve asked for generous volunteers to provide us with transcriptions. You can join these volunteers if you head over to our Flickr page.
Want to learn even more about OCR or how to use it with your own scans? This article is really readable (no pun intended), and also goes into more depth about OCR technology, its history, and modern-day uses.
If you have other questions about OCR or why we do what we do with the Digital Collections, leave a comment below and we’d be happy to answer them.