GHL Blog Rotating Header Image

Digital Collections

Inside digitization: A brief introduction to OCR

Digital Collections Text Tab

If you go to the North Carolina Digital Collections and click on an item, you’ll see that for items with text there’s usually a tab that says “text” (see the image at right) where you can see the full text that appears in the image. One of the comments we sometimes hear regarding the full text goes along the lines of “my cat could provide a better transcription.” So I thought I’d take a minute or two to describe why the full text is sometimes – frankly – horrible.

We add a lot of scanned, or digitized, print items to the Digital Collections. For items with text, we want to make the text searchable. If you’ve ever used a scanner before or even just taken a photograph of something that contains words, you know that you can’t automatically “select” the text or search the text in the image. To give our items that capability, we’ve incorporated Optical Character Recognition – OCR – into our workflow. But what does that really mean?

Computer Thinking

How a computer feels about what your image contains.

Let’s start with an image of a page of text. While it might be meaningful to us, especially if it’s in a language we understand, a computer won’t know or care that there are words in the image. That’s where OCR comes in. OCR is the set of directions that tells a computer to interpret parts of an image as text, as well as how to do that.

To get a little more specific, OCR software contains instructions on how to take an image, analyze where letters stand out from the background, guess what those letters might be based on established patterns recorded in the software, and then give us the letters not in the form of a single image but in the kind of text you might use in word processing software. In other words, it gives us machine-readable text.

Once we have that machine-readable text, we can feed it into search engines and do things like full-text search. It’s how many of our scanned items available in the Digital Collections become full-text searchable.

Text generated by OCR can be wildly variable, depending on the original document. If you printed out this blog post, scanned it, and then ran the image through OCR software, you’d probably get text that’s almost 100% accurate.

Comparison of OCR samples

Comparison of the OCR results for two documents from very different time periods in the North Carolina Digital Collections.

But images of old documents? Text with lots of creative serifs or embellished type? Handwriting? OCR doesn’t do as well. The Digital Collections has a lot of examples of really poor and really great OCR (see the examples to the right).

It can be frustrating, if you’re looking for something very specific, to have the OCR incorrect. So why don’t we fix it? Well, we’ve decided that we’d be more responsible with our resources and we’d serve more people by putting up more content with OCR that isn’t as clean, rather than putting up less content and spending pretty much all of our staff time correcting the OCR.

There are some exceptions, though. For some types of items where our users really value accuracy, and where the text is handwritten, poorer quality, or hard-to-recognize using OCR, we’ve asked for generous volunteers to provide us with transcriptions. You can join these volunteers if you head over to our Flickr page.

Want to learn even more about OCR or how to use it with your own scans? This article is really readable (no pun intended), and also goes into more depth about OCR technology, its history, and modern-day uses.

If you have other questions about OCR or why we do what we do with the Digital Collections, leave a comment below and we’d be happy to answer them.

NC Government Publication Named Notable of 2012 by Library Journal

AccessNC

 

ACCESS North Carolina: A Vacation and Travel Guide for People with Disabilities was recently named by Library Journal as one of the notable government documents for 2012.

ACCESS North Carolina mixes text and icons to offer basic tourist site data on nearly 400 different locations throughout North Carolina. Users can tell at a glance if a site is accessible or partially accessible for persons with ­disabilities. In addition to benefiting people with disabilities, the information in ACCESS North Carolina can also benefit people who are aging, and parents with child and infant strollers.

ACCESS North Carolina was produced by the Division of Vocational Rehabilitation Services, a division of the North Carolina Department of Health and Human Services .

To read the complete Library Journal article go to here.  To see the listing for ACCESS North Carolina, scroll down the article to State and Local, and then to North Carolina.

 

National Applesauce Cake Day

Photograph of applesauce cake

Applesauce cake, care of flickr user Patent and the Pantry. (Made from a recipe similar to, but not the same as the one in this post.)

The internet tells me that today is National Applesauce Cake Day. I couldn’t find out who officially declared that June 6 be reserved for the celebration of applesauce cake, but I did find some information on the origin of the recipe itself, claiming that it’s a 20th century creation related to traditional fruitcake. During World War I, applesauce cakes were “promoted as patriotic (less butter, sugar, eggs)” – coming from the idea that using fewer food resources was a way of contributing to the War on the domestic front*. The earliest mention I could find of applesauce cake in the United States comes from a 1904 letter to the Boston Globe‘s Household Department. It reads as follows:

Nellie Bly—Your apple sauce cake was delicious. I have read the household department ever since it started and find many good recipes. Sunnyside. (February 6, 1904)

Google’s ngram, which I’ve mentioned before, shows a sharp up-tick in mentions of applesauce cake leading into World War II.

So here’s a recipe from Favorite Recipes of North Carolina (page 33, digital page 35) available in our North Carolina Digital Collections. I’ve tried it myself, and have to say it’s definitely a frugal tasting cake; best for when you need a filling, but only slightly sweet snack.

applesaucecake

*See http://www.foodtimeline.org/foodcakes.html#applesaucecakes

State Doc Pick of the Week : About going to college

Going to collegeSchool is out and summer is here! And for a lot of high schoolers who have made their way across the stage, a university beckons in the fall. This document, published in 1957, provides information that every ‘50s high school graduate needs to know about college.  It includes sections on choosing a college, admissions, financing college, academic programs and provides a list of all of the white, Indian and negro colleges of North Carolina. Take a step back in time and read about what going to college was like when the fee for taking the SAT was $6.00, and students could serve in the dining hall or work at a local soda fountain to pay for their education.

This publication can be downloaded, printed, saved, and viewed by clicking here.

This blog is a service of the State Library of North Carolina, part of the NC Department of Cultural Resources. Blog comments and posts may be subject to Public Records Law and may be disclosed to third parties.