Behind the Scenes in a Digitization Project

by Ruth Ann Jones

Digitized collections of primary sources on the World Wide Web are providing new and exciting research alternatives for students and teachers of history. Many rare and fragile works, once accessible only through microfilm reproductions or travel to the library that owned them, are becoming freely available in their digitized formats to anyone with Internet access.

Happily, women's history is fairly well represented in the online repositories being created. Significant digitization projects of interest to feminist scholars include the Victorian Women Writers Project, the Suffragists Oral History Project, and African American Women Writers of the 19th Century. In addition, women writers are well represented in the holdings of larger archives, such as the University of Virginia Electronic Text Center. (See URLs below.) Because of copyright restrictions, many archives concentrate on material published before 1923.

In the process of mounting these archives, academic and public libraries and historical museums are extending their activities beyond the traditional functions of acquiring, preserving, and organizing information. Many of the digitization projects now in progress expand not only access but the ways the material can be used, through the search capabilities offered by electronic text and the textual analysis made possible with Standard Generalized Markup Language (SGML) encoding.

What's involved in planning and executing a successful digitization project? What kinds of skills are necessary to pull it off? And why should libraries venture outside their traditional functions of acquiring, preserving, and organizing information to begin creating and disseminating it as well? These questions have become the subtext of my work life in the last eighteen months.

In the summer of 1999, the Michigan State University Libraries were awarded a $123,000 grant in the Library of Congress/Ameritech National Digital Library Competition. The project "Shaping the Values of Youth: Sunday School Books in 19th-Century America" will include approximately ninety texts held at the MSU Libraries and thirty-five held at the Clarke Historical Library at Central Michigan University. The librarian originally assigned as project manager left MSU in August, shortly before work on the grant was scheduled to begin. I agreed to transfer part-time to the library's Digital Sources Center, and immediately began an exciting, stressful, and fascinating baptism of fire in the arena of digital archives work.

Why digitize Sunday School books? The nineteenth century was a time of intense religious fervor in America. Many rural areas lacked a public library, but Sunday School books were widely available and read by children and adults. The subjects treated in these religious books go far beyond the doctrinal issues one might expect. Missionary travels, temperance, natural history, the evils of slavery, and advice on daily conduct are all represented in the genre. Stories for children warn against laziness and dishonesty and extol the rewards of obedience, kindness, and piety. Works like Maternal Love, The Young Lady's Guide, and Helps Over Hard Places: Stories for Girls offer revealing portraits of daily life and social expectations. Biographies of women who were considered exemplary Christians, such as missionary Ann Judson and African-American teacher Mary Peake, provided role models to young readers.

When digitization is complete, each book exists in three formats. The first is color page images scanned from the copies held in MSU's Special Collections Division. Every page is included, from front cover to back--a total of about 15,000 images. The second format is HTML, which offers users a faster download and the ability to easily adjust the font size of the display as desired.

The third format is SGML. Although similar to HTML in its mechanics (tags enclosed in angle brackets, elements, attributes), SGML provides more precise encoding and is infinitely customizable. Tagsets exist or are currently being developed for material as diverse as computer manuals, chemical formulas, and musical notation. Closer to home, the TEI (Text Encoding Initiative) tagset is widely used in library digitization projects for literature and primary historical sources, and the EAD (Encoded Archival Description) tagset is used to create archival finding aids. Like HTML source code, SGML data uses only ASCII characters, so it can be created in one operating system, edited in another, and displayed in a third. The SGML-encoded versions of the books will become the basis for a full-text search function.

Of the three formats, production of the image files is the least complicated. Using a Hewlett-Packard flatbed scanner, books are initially scanned at a high resolution for archival purposes, then converted in batches to .jpg files for the web and burned onto CD-ROMs for long-term storage. The initial scanning is the slowest process: 23 images per hour is a good speed. All three processes together (scanning, converting, and burning CDs) work out to a production speed of about 20 images per hour, or 750 hours total for 15,000 images.

The HTML and SGML formats start with a word-processed copy of a book's text. There are several ways of creating the initial text: optical character recognition, contracting with a data entry service bureau, or typing in-house. The last option, which we use at MSU, has the benefit of providing employment to the students of your own institution, but requires a commitment of staff time for hiring, training, and supervising.

Once typed, the texts must be proofread, either by "traditional" proofreading methods or computer-aided file comparison. In the Sunday School books project, the goal is 99.995% accuracy: that is, no more than one error every 20,000 characters (roughly fifty pages). That requires a highly skilled proofreader--a rare bird in these days of automated spell-checking. The file comparison method involves having each book typed twice (since two people will rarely make the same keyboarding error) and using a file comparison program to compare the two texts, character by character, and show where they differ.

Students' typing speeds vary, so estimates of the time needed for typing and proofreading are harder to make than those for image production. The total collection has an estimated 4.2 million words, or just over 10,000 "standard pages" of 400 words each. (The actual number of words per page varies from one book to another.) Typing twice for file comparison raises the total to 20,000 pages. I use a rough average of 10 pages per hour for planning, or 2000 hours of typing to be accomplished over 18 months. File comparison can take 3-4 hours per book: 500 hours for 125 books is a safe estimate.

Finally, the proofread texts are coded in SGML. The time required for SGML coding depends on the complexity of the work: simple prose is easier to code than a work with many footnotes, illustrations, poems, or other textual features. Once a coder is trained, the average "easy" book may take three to five hours; the average "difficult" book may take eight to ten hours. Students doing SGML coding require much more training than those doing image production or file comparison. The HTML version of the text can then be derived from the SGML using a PERL script or a style sheet conversion. Or, XML-compliant SGML files can be displayed in XML-capable web browsers with an appropriate style sheet.

How do these totals translate into a weekly or monthly workload? About 60 hours of student labor per week are needed to do the typing, file comparison, and image production (3250 hours total, divided by four 14-week semesters or 56 weeks.) Students doing SGML coding add another 15-18 hours per week. These totals may sound a little daunting to libraries considering a digitization project. The time spent supervising student workers is a significant commitment in itself. However, starting with a smaller collection is always an option, as is concentrating on only one format. Some excellent digital collections, especially those featuring visual materials, provide only page images. Others concentrate on text transcriptions and include only images of a book's illustrations.

Is digitization a worthwhile endeavor for a library? Some librarians say no, seeing it as a distraction from the library's primary mission: acquiring information, making it accessible, and teaching our patrons how to find it and use it. But our historical collections put us in possession of valuable content resources, which are in great demand in the web environment, and librarians already have the intellectual skills needed to organize the content and make it usable in the electronic environment. Library digitization projects are one response to the increasing commercialization of knowledge. The alternative to creating our own electronic archives now may be paying publishers later for access to material owned by our sister institutions.

 URLs for the digital archives mentioned in this article:

Indiana University Library, Victorian Women Writers Project:

http://www.indiana.edu/~letrs/vwwp/



UC Berkeley Library, Suffragists Oral History Project:

http://www.lib.berkeley.edu/BANC/ROHO/ohonline/suffragists.html



New York Public Library, African American Women Writers of the 19th Century:

http://digital.nypl.org/schomburg/writers_aa19/toc.html



University of Virginia, Electronic Text Center

http://etext.lib.virginia.edu/



Michigan State University Libraries, Shaping the Values of Youth:

Sunday School Books in 19th-Century America:

http://digital.lib.msu.edu/ssb/



Other useful websites:

U.S. Copyright Office, "Copyright Basics."

http://www.loc.gov/copyright/circs/circ1.html



American Women's History: A Research Guide, "Digital Collections of Primary Sources."

http://frank.mtsu.edu/~kmiddlet/history/women/wh-digcoll.html



Council on Library and Information Resources, "Selecting Research Collections for Digitization."

http://www.clir.org/pubs/reports/hazen/pub74.html



[Ruth Ann Jones is a librarian in the Digital Sources Center at the Michigan State University Libraries. She currently manages the "Shaping the Values of Youth" project and several other digitization projects.]

[Editor's note: For more examples of primary sources on women that are available on the World Wide Web, see Phyllis Holman Weisbard, "The World Wide Web: A Primary Resource for Women's History," Feminist Collections, v.21, no.4 (Summer 2000), pp.19-25. ]

 

 

 


FEMINIST COLLECTIONS is published by the
University of Wisconsin System Women's Studies Librarian
430 Memorial Library, 728 State Street, Madison, WI 53706
(608) 263-5754

FEMINIST COLLECTIONS' copyright is held by the Regents of the University of Wisconsin System.
Single issues of FEMINIST COLLECTIONS may be purchased for $3.50 (plus postal charges for non-U.S. requests--inquire about rates). Please send a check made payable to University of Wisconsin-Madison to Women's Studies Librarian's Office, 430 Memorial Library, 728 State Street, Madison, WI 53706

Mounted July 19, 2001.