Stanford University Libraries and Academic Information Resources Home Log-in Feedback Help SULAIR WTO  
 
Technology

Physical Description of the Collection

The GATT Digital Library was produced by digitizing, converting to text, and XML encoding the GATT microfiche collection housed at the Stanford University Libraries & Academic Information Resources. The microfiche collection consists of approximately 17,000 microfiche pages with over 480,000 frames. Two physical microfiche formats are represented in the collection:

  • Silver-based, positive polarity, 145 x 105 mm, 58 frames per microfiche, 20:1 reduction ratio.
  • Silver-based, positive polarity, 145 X 105 mm, 98 frames per microfiche, 24:1 reduction ratio.

The microfiche used for this project were produced incrementally since the late 1940's and are third generation copies.

In addition to the microfiche, 166 volumes of print publications were scanned, consisting of approximately 25,000 additional pages.

The Conversion Process

Apex CoVantage, a commercial vendor selected in a competitive bidding process, converted the microfiche and printed volumes to digital formats. Apex performed the following conversion services:

  • Conversion of microfiche and printed volumes to Group IV bitonal TIFF 6.0 images.
  • Full-text conversion of resulting TIFF images to ASCII text at a minimum character accuracy level of 99%.
  • XML encoding of converted text using the TEI Lite DTD for description of basic document structure (TEI Level 1).
  • Creation of presentation derivatives for all documents in PDF Searchable Image (Exact) format.
  • Creation of descriptive metadata for each document.
  • Creation of technical metadata for all files.

Scanning Specification

All images were scanned as bitonal images, using the TIFF 6.0 specification and CCITT Group IV compression. Image files were scanned at 400 dots-per-inch, not interpolated, relative to the original document size. Image treatments, such as page trim, rotation and deskew, were applied as necessary.

Descriptive Metadata

Apex captured descriptive metadata for each separate document found in the microfiche and print collections. The primary source of the document-level descriptive metadata was the text of the first page of the document itself. When not found in the document text, descriptive metadata was also taken from the header of the microfiche page.

SULAIR librarians and content experts provided the vendor with rules for capturing document-level descriptive metadata. The vendor used a combination of zoned Optical Character Recognition (OCR) and manual data entry to capture descriptive elements of each document. SULAIR required 99.95% character accuracy for descriptive metadata capture. The vendor captured the metadata in a Microsoft SQL relational database, using a schema designed by SULAIR.

Text Conversion

Apex also converted the scanned images into text to allow full-text searching of the collection. Converted text was stored in two formats: plain text, and XML encoded using the TEI Lite schema. Because the collection consists exclusively of English, French, Spanish, and Portuguese documents, SULAIR chose ISO-8859-1 as the character set for both the plain text and TEI documents.

It is important to remember that the goal of the project was to create an interface that allowed end users to discover the page images related to their search terms. Neither the plain text nor TEI file for documents are displayed directly to end-users. Rather, the full-text of the documents is used to build an index for searching, and images of original pages are delivered to the user for human consumption. This, coupled with the scale and budgetary constraints of the project, led to SULAIR's choice to specify a 99% character accuracy requirement for text conversion.

The primary means of text conversion was automated Optical Character Recognition (OCR). Apex conducted only minimal and selective human correction of text conversion errors in order to achieve the 99% character accuracy requirement.

PDF Files

Apex created delivery surrogates in PDF Searchable Image (Exact) format for each document. PDF files are used for both online viewing as well as printing. Consequently, the formatting of the PDF files was optimized to produce both high-quality print-outs as well as an efficient online-viewing experience. All PDF's are compatible for viewing in Adobe Acrobat 5.0 and above, and are saved in version 1.4 of the PDF file format.

Technical Metadata

The vendor captured technical metadata for all files produced, including TIFF images, raw text files, XML-encoded text files, and PDF document files. The technical metadata specification used can be found at:

http://www-sul.stanford.edu/depts/ts/tsdepts/cat/units/metadata/docs/taskforce/QuickVuDsc_Tech_Img.pdf [PDF]

Quality Control

Upon delivery of metadata and image files from the vendor, SULAIR staff conducted quality control of all products. Automated quality control procedures were used when possible, for example for automated validation of MD5 checksums and XML-encoded files. For document-level descriptive metadata, a random sample of between 1% and 5% of all documents was chosen to validate the 99.95% character accuracy requirement. Similarly, to validate the 99% accuracy requirement for full-text conversion, SULAIR staff evaluated a random sample of between 1% and 5% of all pages converted. Random samples of TIFF images and PDF files were also checked to verify that specifications were followed.

Indexing, Search, and Document Delivery

The Apache Group's Lucene search engine powers the search and browse sections of the library. SULAIR staff utilized Lux, a Lucene front-end that facilitates building Lucene indexes out of collections of XML documents, to build an index for the site from the TEI-encoded text produced by Apex. The index, which measured 1.2 gigabytes at the time of writing, is updated periodically to allow for metadata corrections and enhancements.

The rest of the website is a fairly standard Java 2 web application. It uses Apache Struts as its MVC framework, Apache Tomcat 4 as its servlet engine, and a MySQL database for persistent storage of session and user data.

The library is hosted on a Sun Netra T12 at Forsythe Hall, Stanford University, and is ably maintained by SULAIR Systems and Stanford University ITSS.

 
WTO logo Copyright 2004 The Board and Trustees of the Leland Stanford Junior University