Technical Documentation

Physical Description of the Collection

The GATT Digital Library was produced by digitizing, converting to text, and XML encoding the GATT microfiche collection housed at the Stanford University Libraries & Academic Information Resources. The microfiche collection consists of approximately 17,000 microfiche pages with over 480,000 frames. Two physical microfiche formats are represented in the collection:

  • Silver-based, positive polarity, 145 x 105 mm, 58 frames per microfiche, 20:1 reduction ratio.
  • Silver-based, positive polarity, 145 X 105 mm, 98 frames per microfiche, 24:1 reduction ratio.

The microfiche used for this project were produced incrementally since the late 1940's and are third generation copies.

In addition to the microfiche, 166 volumes of print publications were scanned, consisting of approximately 25,000 additional pages.

Conversion Process

Apex CoVantage, a commercial vendor selected in a competitive bidding process, converted the microfiche and printed volumes to digital formats. Apex performed the following conversion services:

  • Conversion of microfiche and printed volumes to Group IV bitonal TIFF 6.0 images.
  • Full-text conversion of resulting TIFF images to ASCII text at a minimum character accuracy level of 99%.
  • XML encoding of converted text using the TEI Lite DTD for description of basic document structure (TEI Level 1).
  • Creation of presentation derivatives for all documents in PDF Searchable Image (Exact) format.
  • Creation of descriptive metadata for each document.
  • Creation of technical metadata for all files.

Scanning Specification

All images were scanned as bitonal images, using the TIFF 6.0 specification and CCITT Group IV compression. Image files were scanned at 400 dots-per-inch, not interpolated, relative to the original document size. Image treatments, such as page trim, rotation and deskew, were applied as necessary.

Descriptive Metadata

Apex captured descriptive metadata for each separate document found in the microfiche and print collections. The primary source of the document-level descriptive metadata was the text of the first page of the document itself. When not found in the document text, descriptive metadata was also taken from the header of the microfiche page.

SULAIR librarians and content experts provided the vendor with rules for capturing document-level descriptive metadata. The vendor used a combination of zoned Optical Character Recognition (OCR) and manual data entry to capture descriptive elements of each document. SULAIR required 99.95% character accuracy for descriptive metadata capture. The vendor captured the metadata in a Microsoft SQL relational database, using a schema designed by SULAIR.

Text Conversion

Apex also converted the scanned images into text to allow full-text searching of the collection. Converted text was stored in two formats: plain text, and XML encoded using the TEI Lite schema. Because the collection consists exclusively of English, French, Spanish, and Portuguese documents, SULAIR chose ISO-8859-1 as the character set for both the plain text and TEI documents.

It is important to remember that the goal of the project was to create an interface that allowed end users to discover the page images related to their search terms. Neither the plain text nor TEI file for documents are displayed directly to end-users. Rather, the full-text of the documents is used to build an index for searching, and images of original pages are delivered to the user for human consumption. This, coupled with the scale and budgetary constraints of the project, led to SULAIR's choice to specify a 99% character accuracy requirement for text conversion.

The primary means of text conversion was automated Optical Character Recognition (OCR). Apex conducted only minimal and selective human correction of text conversion errors in order to achieve the 99% character accuracy requirement.