FALVEY MEMORIAL LIBRARY



You are exploring: VU > Library > Blogs > Blue Electrode: Sparking between Silicon and Paper

Digital Library upgrade provides enhanced discovery

Villanova University’s Digital Library has recently upgraded its discovery interface, introducing a more detailed search experience. This represents the first major upgrade of the application’s existing structure which was introduced a year ago when it was migrated to a Fedora-Commons Repository and debuted a public interface utilizing the Open Source faceted search engine VuFind.

Part 1 – Modeling the Repository

First we will discuss the systems architecture and components. Fedora (Flexible Extensible Digital Object Repository Architecture) provides the core architecture and services necessary for digital preservation, all accessible through a well-defined Application Programming Interface (API). It also provides numerous support services to facilitate harvesting, fixity, and messaging. It also supports the Resource Description Framework (RDF) by including the Mulgara triple store.

fig1

Figure 1

It is through these RDF semantic descriptions that Fedora models the relationships between the objects within the repository. An object’s RDF description contains declarative information regarding what kind of object it is. In our case we created one top-level model (CoreModel) that describes attributes commons among all objects (thumbnails, metadata, licensing information) and two second-level models that represent all basic shapes in the repository (Collections and Data). Collections represent groups of objects and Data objects represent the actual content being stored. (See Figure 1)

Figure 2

Figure 2

From here we further extrapolated these two models into specific types. Collections can be either Folders or Resources and Data objects can be Images, Audio files, Documents, etc. (See Figure 2)

Figure 3

Figure 3

Another important component found within the RDF description is the object’s relationship to other objects. It is this relationship that organizes Resources with their Parent Folder, and book pages within their parent Resource. (See Figure 3)

Look at the following RDF description for our Cuala Press Collection. You can see that it contains two “hasModel” relationships stating that it is both a Collection and Folder (Fedora does not support inheritance in favor of a mixin approach). Note also the one “isMemberOf” relationship referencing vudl:3, the top-level collection of the Digital Library.

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:fedora="info:fedora/fedora-system:def/model#" xmlns:rel="info:fedora/fedora-system:def/relations-external#">
  <rdf:Description rdf:about="info:fedora/vudl:2001">
    <fedora:hasModel rdf:resource="info:fedora/vudl-system:CollectionModel"/>
    <fedora:hasModel rdf:resource="info:fedora/vudl-system:FolderCollection"/>
    <rel:isMemberOf rdf:resource="info:fedora/vudl:3"/>
  </rdf:Description>
</rdf:RDF>

A more detailed explanation of this data model was presented at Open Repositories 2013. Abstract

Part 2 – The Discovery Layer

Villanova’s Falvey Library is the focal point and lead development partner for VuFind, an Open Source search engine designed specifically around the discovery of bibliographic content. Its recently redesigned core provides a flexible model for searching and displaying our Digital Library, making it the perfect match for the public interface.

The backbone of VuFind is Apache Solr, a Java-based search engine. A simple explanation of how it works is that you put “records” into the Solr search index, each containing predefined fields (title, author, description, etc), and then the application can search through the contents of the index with high speed and efficiency.

Our initial index contained all Resource and Folders from the repository, which allows us to browse through collections by hierarchy, and search receiving both Resources and Folders in the results.

Figure 4

Figure 4

An early enhancement to the browse module made available Collections that reside in multiple locations. For example our Dime Novel collection contains sub-collections whose resources can exist in 2 places. (See Figure 4)
Look at the Buffalo Bill collection and notice how its breadcrumb trail denotes residency in multiple places. This is achieved by adding an additional “is MemberOf” relationship in its RDF description:

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:fedora="info:fedora/fedora-system:def/model#" xmlns:rel="info:fedora/fedora-system:def/relations-external#">
  <rdf:Description rdf:about="info:fedora/vudl:279438">
    <fedora:hasModel rdf:resource="info:fedora/vudl-system:CollectionModel"/>
    <fedora:hasModel rdf:resource="info:fedora/vudl-system:FolderCollection"/>
    <rel:isMemberOf rdf:resource="info:fedora/vudl:280419"/>
    <rel:isMemberOf rdf:resource="info:fedora/vudl:280425"/>
  </rdf:Description>
</rdf:RDF>

Part 3 – The Upgrade

The existing search interface supports “full text” searching. We routinely perform Optical Character Recognition (OCR) using Google’s Tesseract application, on all scanned Resources, storing this derivative in the accompanying Data object. When the parent Resource is ingested into Solr, a loop is performed over all of the associated child Data objects, grabbing their OCR file and stuffing it into the full text field for the Resource. This works, as it will match searches from that particular page of the book and direct the patron to the parent Resource, but from there it is often difficult to determine what page matched the query.

Figure 5

Figure 5

A solution to this dilemma was achieved by including all Data objects in the Solr index. This would allow specific pages to be searched in the catalog, leading users to the individual pages that match the query. The first obvious problem with this idea is that the search results would then be cluttered with individual pages, and not the more useful Folders and Resources. This was ultimately overcome by taking advantage of a newer feature in Solr called Field Collapsing. This allows the result set to be grouped by a particular field in Solr. (See Figure 5) In our case we group on the parent Resource, which allows us to display the Resource in the result set and the page which was matched. (See Figure 6) A live example of this can be seen here.

Figure 6

Figure 6

We are pleased to make this available to the world, with the hopes that it will be helpful.

Happy searching…

Useful Links

The components of our infrastructure are all Open Source, freely available applications.

Fedora-Commons Repository
The backbone of the system

VuFind
The public Discovery interface

VuDL
The admin used to ingest objects into Fedora

File Information Tool Set (FITS)
A file metadata extraction tool

Tesseract
A OCR engine

Like

New Digital Library Administration Software

Falvey’s Digital Library has just been upgraded with new backend software that will improve its ability to continue growing and improving the online collection. The Digital Library’s first incarnation was launched in August 2006. Over the course of 4 years, the DL’s collection grew to over 9,000 items, and a substantial software functionality wish-list.

  • Add support for more file formats, so our collection can include a broader range of materials
  • Incorporate an OCR process to facilitate full-text searching of collection content.
  • Add support for inclusion of transcriptions with hand-written materials

 

Our initial software used a variety of technologies to achieve its goal of storing information about digital documents. Unfortunately, not all of these tools worked well together. While the new version of the software retains the METS metadata format and eXist-db XML database, it replaces nearly all of the other components with a suite of more closely-related technologies. The new, all-XML, all-Open-Source framework consists of the following components:

 

New Key Features:

Root level Document Attachment

document-transcriptions

Catalogers now have the ability to add document-level items to each object. The most relevant use of this feature is to attach a hand-transcribed, fully annotated companion document to a digitally scanned book. More information on this feature can be found here and a live example can be found by viewing the Lane Manuscript


AJAX-based metadata editor

metadata

The Orbeon forms Java-based XForms engine integrates with the YUI JavaScript Library providing a rich user interface for metadata editing.


Document layout and file attachment configurations

document-layout

The system incorporates a batch-attach routine for adding multiple files (in our case the pages of a scanned book) to a digital object as a single operation. An interface is available to customize the arrangement and location of these files, as well as adding and deleting files when appropriate.


OAI harvestable

oai

OAI/PMH is a standard for serving and harvesting metadata. The Digital Library is now fully harvestable using this standard.


In the coming months we will extend the software to include custom drivers for a VuFind front-end and modularize the metadata editor to support a wide-range of options including Dublin Core, MODS, EAD, and PREMIS support for preservation Metadata.

Our plan is to launch the software as a simple, open-source platform for preservation and presentation of digital collections. So stay tuned! We are targeting April 2011 for the Beta Release.

We are always looking for development partners! If you are interested, please contact us at digitallibrary@villanova.edu

Like

“What next?”

Written by Darren G. Poley, Outreach Librarian, Falvey Memorial Library.

There are several consortia who have for many years been trying to promote the idea and utility of digital collections. The concept of course is simple. Either digitize print material in the public domain or archive digital works that are not under copyright to the end of making works more widely available to the scholarly community via the Web. The Digital Library Federation has worked primarily on standards. The D-Lib Alliance has an online journal and runs workshops. The Association of Research Libraries developed the Scholarly Publishing and Academic Resources Coalition (SPARC®) which for over a decade has been a major advocacy group for policy change. For Catholic Universities there is the Catholic Research Resources Alliance which is working on preserving access to rare Catholic materials. While membership in these various groups is commendable and their work continues to be necessary recently there have been several turn of events that document the change in the milieu of digital libraries.

Our Cultural Commonwealth [PDF] (2006) from the American Council of Learned Societies Commission on Cyberinfrastructure for the Humanities and Social Sciences was meant to map out the horizon for greater collaboration. The Ithaka Report, University Publishing in a Digital Age [PDF], released last year forecasts the changing nature of university publishing due to the digital environment in which we now work. The Open Access mandate passed February 2008 by the Arts and Sciences Faculty at Harvard University is a hotly debated effort using institutional weight to promote an opt-out policy that will cause much that would have been less-accessible to be OA available thereby “disseminating the fruits of its research and scholarship as widely as possible.” Finally published March 2008, the Research Library Publishing Services [PDF] study that assesses the lay of the land on this front, at least among major research libraries in the United States.
These hallmark statements show that eventually all universities will need to look at their efforts and policies concerning the necessity and viability of digital libraries and how they are an increasingly essential means for more than just reformatting old books. The digital library may very well become the vehicle for preserving a larger and richer deposit of current scholarship.

Like

 


Last Modified: June 5, 2008