Highlighting and Snippets in VuFind 1.1

Posted by: Demian Katz
Posted Date: March 23, 2011
Filed Under: Library News, Technology Developments
Tags: vufind

One of the perils of keyword-based searching is that sometimes it is not totally clear why certain results show up after performing a search. Fortunately, two common conventions help ease this problem: highlighting matching keywords and displaying snippets of text to show matches in context. The Solr index engine has supported both of these features for a long time, but VuFind has only provided robust support for them starting in version 1.1.

Activating Highlighting and Snippets in VuFind

As a VuFind administrator, if you want to take advantage of these new features, all you have to do is upgrade to VuFind 1.1 and they will be turned on by default. If you want to turn them off or adjust some of the behavior, you can make a few adjustments to your searches.ini file as described in the VuFind wiki. Unless you are interested in the technical workings behind the scenes, that is all you need to know. Have fun! Solr power users, VuFind developers and other interested techies, please read on….

Highlighting and Snippets at the Solr Level

Solr’s support for highlighting and snippets is straightforward. By means of some search parameters (set in the solrconfig.xml configuration file and/or as part of the search request), you tell Solr whether or not to apply highlighting, which fields to highlight, how to mark highlighted words, and so forth. When highlighting is requested, Solr adds a new section to its search response listing all of the highlighted phrases found in all of the documents in the search response. The highlighting information is completely separate from the main list of search results, so highlighting does not actually alter the main part of the Solr response — the details need to be merged in by the calling code.

Problem #1: Marking Highlighted Text

One of the first problems that needs to be addressed is how to mark highlighted words in the Solr response. Solr provides hl.simple.pre and hl.simple.post parameters which can be used to specify text to mark the beginning and ending of highlighted words. The obvious first temptation is to simply stick some HTML in here — "<em>" and "</em>", for example. This can lead to pitfalls, however — if you are escaping your output, the HTML won’t make it through, and the end user will actually see the HTML code. If you are not escaping your output, then text between or around the emphasis tags may get misinterpreted as HTML, leading to garbled displays (never assume you won’t have angle brackets somewhere in your records!).

VuFind’s solution to this problem is fairly obvious — it uses markers that are extremely unlikely to show up in record text (“{{{{START_HILITE}}}}” and “{{{{END_HILITE}}}}”) and defines a special escaping routine used only for highlighted text. When displaying something that it knows has been highlighted, it first escapes any possible HTML entities, and THEN it replaces the highlighting markers with HTML code that achieves the actual highlighting logic. You can see the Smarty modifier that achieves this work here. Note that the Smarty code contains some extra logic for finding and highlighting words, since it is also designed for use by other modules of VuFind that are unable to rely on Solr’s highlighting capabilities — this logic is ignored when Solr results are being displayed.

Problem #2: Merging Highlighting Data with Records

As mentioned above, Solr provides highlighting information completely separately from its search result list. This can be rather inconvenient since it requires code to look in two different places during record processing. The first temptation when encountering this problem is to write code that merges everything together, overwriting fields in the main response with highlighted versions found elsewhere in the response. However, as with many first temptations, that’s a bad idea. First of all, you will very likely lose data if you do this. In a multi-valued field, it is possible that only certain values will be highlighted and others omitted entirely. Also, unless the hl.fragsize parameter is set to 0, snippets will be truncated to only show a few words around the highlighted term. Additionally, data loss aside, it is often convenient to have both highlighted and non-highlighted versions of fields available; for example, if you want to create a link to a page about an author, you want to use the non-highlighted text for inclusion in the target URL, but you want to use the highlighted version to display the link text.

Again, VuFind works through these issues in a fairly straightforward way. For convenience, it does merge the highlighting data with the search results so that code doesn’t need to look in two completely separate arrays for information about each record. However, it doesn’t overwrite any fields; instead, it creates a fake “_highlighting” field within the body of the record and stores all of the highlighting details in there. Whenever VuFind displays a field that might be subject to highlighting, it looks in two places — first it checks the _highlighting array and displays properly processed, highlighted text if it finds any. If no highlighted version exists, it fails over to the standard, non-highlighted text. Admittedly, this adds a bit more complexity to the display templates, but it seems a reasonable price to pay to ensure data integrity. It also helps to remind template designers where they need to use the Smarty highlight modifier described above, greatly reducing the risk of any “{{{START_HILITE}}}” tags accidentally slipping through to the end user’s display.

Problem #3: Highlighted Text May Be Truncated

As discussed above, highlighted text may be truncated in some circumstances (by default, snippets are limited to about 100 characters). This is reasonable, since search results should be brief and easy to read. Indeed, even before it supported highlighting, VuFind already had code to trim down super-long titles in search results. The critical difference between the old title-trimming code and the new reliance on Solr snippets is that the old code always showed the beginning of a title, while Solr snippets occasionally come from the middle of a title, yielding strange-looking results. Setting the hl.fragsize parameter to 0 is an option, though that will lead to very long titles in search results. VuFind’s solution relies on another new Smarty modifier (modifier.addEllipsis.php) which compares highlighted text against non-highlighted text and adds periods of ellipsis on each end if truncation is detected. This may not be a perfect solution, but at least it adds a little more visual context to the truncated text.

There is one additional caveat that should be noted: multi-valued fields are still a problem. If a field contains five values and only two of them match search terms, then the highlighting data will only contain (at most) two values. VuFind does not currently contain any mechanisms for matching up partial highlighted results with longer lists of non-highlighted results. The problem is avoided in the simplest way possible: the highlighted fields currently used in VuFind’s search result templates (title and primary author) are single-valued. Multi-valued fields are only displayed as snippets (see below).

Problem #4: Displaying Snippets

As discussed above, there are certain Solr fields which VuFind will always display in search results: most importantly, title and author. However, keyword matches may fall outside of these displayed fields. For that reason, it is helpful to display snippets showing matches in other fields. Since there may be many snippets, and the search result listing should be kept reasonably brief, it makes sense to try to display just one snippet, preferably the most relevant one.

Snippet selection is handled by the IndexRecord record driver, the base class that handles display of all records retrieved from the Solr index. This class contains two arrays: $preferredSnippetFields, an array of fields that are very likely to have good snippet data and should be checked first, and $forbiddenSnippetFields, an array of fields with bad or redundant data that should never be considered for use as a snippet. By default, $preferredSnippetFields contains subject headings and table of contents entries, since these tend to offer valuable information, while $forbiddenSnippetFields contains author and title fields (unnecessary for snippets since they are always displayed elsewhere in the template), ID values (obviously uninformative) and the spelling field (a jumble of data duplicated from other fields, necessary for spell checking but misleading as a snippet). The getHighlightedSnippet method uses these arrays to pick a single best snippet, first checking the preferred fields and then taking the first available non-forbidden field if necessary. Since the method and its related arrays are all protected, it is possible to extend the IndexRecord class and create custom behavior as needed on a driver-by-driver basis.

One further detail helps make things more clear: some snippets make little sense out of context, so searches.ini contains a [Snippet_Captions] section where Solr fields can be assigned labels that will be used as captions in front of snippets. Snippets for fields not listed in this section will display as stand-alone, uncaptioned lines in the search results.

Conclusions

Highlighting and snippets really aren’t too difficult to work with, but as with almost anything, they turn out to be a little more complicated than expected once you look at all of the details. I hope this post has helped point out the most obvious pitfalls and explain the reasoning behind VuFind’s implementation. There is still plenty more that could be done — some of the behavior could be made even smarter, and more of Solr’s power could be exposed through VuFind configuration settings. If you have ideas or questions, please feel free to share them as comments on this post or via the vufind-tech mailing list.

1 People Like This Post

2 comments so far

2 Comments »

Pingback by Infobib » Neuer Biblioblog: Villanova Library Technology Development — March 23, 2011 @ 4:46 PM

[…] erste Posting beschreibt Highlighting and Snippets in VuFind 1.1. Aggregiert gibt’s die VuFind-Neuigkeiten auch im Planet […]
Comment by Bill Lawson — March 24, 2011 @ 10:03 PM

As an alum, I’m so glad to see that the Library is keeping current with today’s technology. Maybe we can convince the students that libraries aren’t outdated and that they should use them more often!

RSS feed for comments on this post. TrackBack URI

Falvey Library Blog

Highlighting and Snippets in VuFind 1.1

2 Comments »

Leave a comment