FALVEY MEMORIAL LIBRARY

You are exploring: VU > Library > Blogs > Library Technology Development

Using Dismax for VuFind’s Advanced Search

  • Posted by: Demian Katz
  • Posted Date: April 15, 2011
  • Filed Under: VuFind

The Problem

One of the complexities of dealing with Solr searching is the fact that it has multiple query parsers with different strengths and weaknesses. The “Standard” query parser (sometimes referred to as the “Lucene” parser) offers traditional features like wildcards and boolean operators, but it doesn’t always do a good job when you need to search multiple index fields at the same time. The “Dismax” query parser uses fancy logic to do cross-fielded keyword searching that often seems to work like magic, but it lacks support for all the operators found in the Standard parser. VuFind currently uses a blend of these two mechanisms — most of the time, it relies on the Dismax handler, since that tends to yield the best results… but when a search contains features that Dismax can’t cope with (like a boolean AND or a * wildcard), it fails over to the Standard handler.

One of the big limitations of this situation was that VuFind’s advanced search screen always generated a Standard query, since the advanced search form forces the use of boolean operators, and Dismax doesn’t support booleans. This meant that advanced searches were often slightly inconsistent with basic searches, not to mention being slightly less effective in some cases. Fortunately, due to some little-known and little-documented Solr features, the next VuFind release will address this problem.

The Solution

As it turns out, the Standard query parser supports a pseudo-field called “_query_” which allows you to combine multiple non-Standard queries using Standard operators. You can specify the parser to use in each subquery through the {!parser} syntax. As a result, as long as each individual field of the advanced search form can be handled by Dismax, it is possible to use the Dismax parser for the separate chunks of the advanced search while still combining the chunks together using the Standard parser’s boolean capabilities!

For example, suppose you wanted to combine a Dismax author search with a Dismax title search. You could do it through this Standard search:

_query_:”{!dismax qf=\”author^100 author2^50\”}charles dickens” AND _query_:”{!dismax qf=\”title^100 alt_title^50\”}tale of two cities”

This will perform two Dismax searches (note that you can specify qf boosts inline) and then return only the results that match both of them. It’s not pretty thanks to the need to escape quotes inside the subquery, but it works… and attractiveness doesn’t really matter when it’s all generated automatically by code. Admittedly, VuFind’s search generation logic is fairly convoluted right now, but adding support for this capability only required the addition of a few more lines, as you can see from the patch posted in JIRA, and the benefits are significant.

The Future

Hopefully things can be improved even further in the near future. The latest release of Solr (version 3.1) adds an “extended Dismax” parser which combines many of the best features of the Standard and Dismax parsers. This should greatly reduce the number of situations in which we need to use Standard instead of Dismax, and it may even eliminate the need for the current nest of recursive code that builds cross-field-capable Standard queries. Once I find time to upgrade VuFind’s Solr instance to the new version, I will begin investigating how much of the search logic can be simplified through the use of this new feature.

Like

Java Tuning Made Easier

  • Posted by: Demian Katz
  • Posted Date: March 31, 2011
  • Filed Under: VuFind

If you run a constantly-growing Solr index (as many VuFind users do), chances are that sooner or later, you will need to do some Java tuning in order to solve performance problems. There are some good resources already on the web about this topic (for example, Sun’s Java Tuning White Paper), but they tend to be somewhat dense and technical. This article is intended to give a shorter introduction to the problem and the most basic strategies for solving it. If you need more details, by all means refer to more technical sources; I just wanted to offer an easier starting point.

Why Java Needs Tuning

The main reason Java needs tuning has to do with how it handles memory management. I was studying computer science when my university switched its curriculum from C++ to Java, so I’m very familiar with Java’s distinctive approach to this subject. In C++, the programmer is responsible for all the fine details of memory management — you have to request all of the memory that you plan to use, then return it to the operating system when you are done with it. Failing to do this properly leads to the dreaded “memory leak.” Java relieves this burden by taking an entirely different approach: the programmer uses memory without worrying about where it came from, and Java uses something called a “garbage collector” to figure out which pieces of memory are no longer needed and free them up for others to use. The C++ to Java transition caused many lessons to abruptly change from “memory management is of vital importance to all of your work” to “don’t worry about memory management; the magic box will do it for you.”

Usually, sparing the programmer from worrying about memory is a great improvement — it removes a lot of tedium from the work of writing code, and most of the time, the garbage collector just does its job, and nobody has to think about it. The problem is that for complex, memory-hungry applications like Solr, the garbage collector sometimes can’t keep up. The longer the program runs, the more time Java spends on garbage collection and the less time it spends on actually running the program. In extreme situations, a Java program can become completely unresponsive, devoting all of its effort to cleaning up after itself. If you run into problems with VuFind searches becoming extremely slow and find that the problem goes away after you restart VuFind, the cause is almost certainly the garbage collector. A restart frees up all memory and gives Java a clean start, so it’s usually an easy fix to performance problems… but it’s only a matter of time before garbage once again accumulates to a critical level and the problem returns!

Possible Solutions

There are basically three answers to the Java garbage collection problem, and you don’t have to pick just one. Using multiple strategies at the same time often makes sense.

• Regularly restart your Java application — call it cheating or postponing the inevitable if you like, but it’s a very simple approach: if it takes several days for your application to start performing poorly, just schedule it to automatically restart in the middle of the night every night to get a clean slate and consistent stability.

• Give Java more memory — intense garbage collection is triggered by high memory use, so the more memory you have available, the longer it will take for a program to fill it all. Adding memory reduces (or at least postpones) the need for garbage collection, and it’s as simple as changing a couple of parameters (see details in the VuFind wiki). Generally, more memory is always better… but there is one important caveat: don’t give Java all of your system’s available memory, since that can crowd out your operating system and cause other performance problems — always leave a bit of a buffer.

• Change garbage collector behavior — Java has several different garbage collection strategies available, and some of them have additional tuneable parameters. This is where things start to get complicated, but Lucid Imagination’s Java Garbage Collection Boot Camp offers a good run-down of the available choices (not to mention providing some more detailed technical background). Even if you don’t understand all the gory details, knowing the available options means you can do some trial and error.

Testing Your Strategy

Trial and error is an inevitable part of solving Java tuning problems. The biggest shortcoming I found in other articles on the subject is that they don’t offer a simple strategy for doing this. Fortunately, it’s not too hard to test your progress using some simple tools.

Java is capable of recording a log of all of its garbage collection behavior, telling you how often it performs garbage collection and how long each collection takes to complete. While the exact parameter for generating a log may vary depending on the Java Virtual Machine that you are using, for VuFind’s preferred OpenJDK version, you can add something like this to your Java options:

-Xloggc:/tmp/garbage.log

As you can probably guess, this outputs the garbage collection data to a log file called /tmp/garbage.log. If you want something fancier, you could do this instead:

-Xloggc:$VUFIND_HOME/solr/jetty/logs/gc-`/bin/date +%F-%H-%M`.log

Through the magic of the Unix shell, this version stores logs inside VuFind’s solr/jetty/logs folder, naming each log file with the date and time that VuFind started up so that you can track behavior across multiple restarts.

So far, so good… except that these log files are really hard to read. Fortunately, an excellent tool exists to help visualize the data: gcviewer. With gcviewer, you can see a graph of your memory usage and the time spent on garbage collection, plus there are a number of handy statistics available (average collection time, total collection time, longest collection time, etc.). If gcviewer doesn’t meet your needs, there is also an IBM tool called PMAT which is slightly less convenient to download but which supports a broader range of log formats.

By logging data for several days between each tweak to your Java settings and using gcviewer or PMAT to analyze your logs, you can usually get a pretty good sense of whether you’ve made things better or worse… and how long it takes for your application to fall into the pit of inefficient garbage collection.

Conclusion

Java tuning is never going to be an easy subject to understand deeply, but that doesn’t mean you need to be afraid of it. There are several simple strategies available to help solve your problems even if you don’t know all the details of what is going on under the hood, and there are readily available tools to help you support your inevitable trial and error with empirical data. In fact, even if you are experiencing perfect performance today, it might not be a bad idea to examine garbage collection logs occasionally to see if you can prevent future problems before they become noticeable! Magic problem-solving boxes are great most of the time, but a bit of knowledge is always helpful for those times when they let you down.

Like

Highlighting and Snippets in VuFind 1.1

  • Posted by: Demian Katz
  • Posted Date: March 23, 2011
  • Filed Under: VuFind

One of the perils of keyword-based searching is that sometimes it is not totally clear why certain results show up after performing a search. Fortunately, two common conventions help ease this problem: highlighting matching keywords and displaying snippets of text to show matches in context. The Solr index engine has supported both of these features for a long time, but VuFind has only provided robust support for them starting in version 1.1.

Activating Highlighting and Snippets in VuFind

As a VuFind administrator, if you want to take advantage of these new features, all you have to do is upgrade to VuFind 1.1 and they will be turned on by default. If you want to turn them off or adjust some of the behavior, you can make a few adjustments to your searches.ini file as described in the VuFind wiki. Unless you are interested in the technical workings behind the scenes, that is all you need to know. Have fun! Solr power users, VuFind developers and other interested techies, please read on….

Highlighting and Snippets at the Solr Level

Solr’s support for highlighting and snippets is straightforward. By means of some search parameters (set in the solrconfig.xml configuration file and/or as part of the search request), you tell Solr whether or not to apply highlighting, which fields to highlight, how to mark highlighted words, and so forth. When highlighting is requested, Solr adds a new section to its search response listing all of the highlighted phrases found in all of the documents in the search response. The highlighting information is completely separate from the main list of search results, so highlighting does not actually alter the main part of the Solr response — the details need to be merged in by the calling code.

Problem #1: Marking Highlighted Text

One of the first problems that needs to be addressed is how to mark highlighted words in the Solr response. Solr provides hl.simple.pre and hl.simple.post parameters which can be used to specify text to mark the beginning and ending of highlighted words. The obvious first temptation is to simply stick some HTML in here — "<em>" and "</em>", for example. This can lead to pitfalls, however — if you are escaping your output, the HTML won’t make it through, and the end user will actually see the HTML code. If you are not escaping your output, then text between or around the emphasis tags may get misinterpreted as HTML, leading to garbled displays (never assume you won’t have angle brackets somewhere in your records!).

VuFind’s solution to this problem is fairly obvious — it uses markers that are extremely unlikely to show up in record text (“{{{{START_HILITE}}}}” and “{{{{END_HILITE}}}}”) and defines a special escaping routine used only for highlighted text. When displaying something that it knows has been highlighted, it first escapes any possible HTML entities, and THEN it replaces the highlighting markers with HTML code that achieves the actual highlighting logic. You can see the Smarty modifier that achieves this work here. Note that the Smarty code contains some extra logic for finding and highlighting words, since it is also designed for use by other modules of VuFind that are unable to rely on Solr’s highlighting capabilities — this logic is ignored when Solr results are being displayed.

Problem #2: Merging Highlighting Data with Records

As mentioned above, Solr provides highlighting information completely separately from its search result list. This can be rather inconvenient since it requires code to look in two different places during record processing. The first temptation when encountering this problem is to write code that merges everything together, overwriting fields in the main response with highlighted versions found elsewhere in the response. However, as with many first temptations, that’s a bad idea. First of all, you will very likely lose data if you do this. In a multi-valued field, it is possible that only certain values will be highlighted and others omitted entirely. Also, unless the hl.fragsize parameter is set to 0, snippets will be truncated to only show a few words around the highlighted term. Additionally, data loss aside, it is often convenient to have both highlighted and non-highlighted versions of fields available; for example, if you want to create a link to a page about an author, you want to use the non-highlighted text for inclusion in the target URL, but you want to use the highlighted version to display the link text.

Again, VuFind works through these issues in a fairly straightforward way. For convenience, it does merge the highlighting data with the search results so that code doesn’t need to look in two completely separate arrays for information about each record. However, it doesn’t overwrite any fields; instead, it creates a fake “_highlighting” field within the body of the record and stores all of the highlighting details in there. Whenever VuFind displays a field that might be subject to highlighting, it looks in two places — first it checks the _highlighting array and displays properly processed, highlighted text if it finds any. If no highlighted version exists, it fails over to the standard, non-highlighted text. Admittedly, this adds a bit more complexity to the display templates, but it seems a reasonable price to pay to ensure data integrity. It also helps to remind template designers where they need to use the Smarty highlight modifier described above, greatly reducing the risk of any “{{{START_HILITE}}}” tags accidentally slipping through to the end user’s display.

Problem #3: Highlighted Text May Be Truncated

As discussed above, highlighted text may be truncated in some circumstances (by default, snippets are limited to about 100 characters). This is reasonable, since search results should be brief and easy to read. Indeed, even before it supported highlighting, VuFind already had code to trim down super-long titles in search results. The critical difference between the old title-trimming code and the new reliance on Solr snippets is that the old code always showed the beginning of a title, while Solr snippets occasionally come from the middle of a title, yielding strange-looking results. Setting the hl.fragsize parameter to 0 is an option, though that will lead to very long titles in search results. VuFind’s solution relies on another new Smarty modifier (modifier.addEllipsis.php) which compares highlighted text against non-highlighted text and adds periods of ellipsis on each end if truncation is detected. This may not be a perfect solution, but at least it adds a little more visual context to the truncated text.

There is one additional caveat that should be noted: multi-valued fields are still a problem. If a field contains five values and only two of them match search terms, then the highlighting data will only contain (at most) two values. VuFind does not currently contain any mechanisms for matching up partial highlighted results with longer lists of non-highlighted results. The problem is avoided in the simplest way possible: the highlighted fields currently used in VuFind’s search result templates (title and primary author) are single-valued. Multi-valued fields are only displayed as snippets (see below).

Problem #4: Displaying Snippets

As discussed above, there are certain Solr fields which VuFind will always display in search results: most importantly, title and author. However, keyword matches may fall outside of these displayed fields. For that reason, it is helpful to display snippets showing matches in other fields. Since there may be many snippets, and the search result listing should be kept reasonably brief, it makes sense to try to display just one snippet, preferably the most relevant one.

Snippet selection is handled by the IndexRecord record driver, the base class that handles display of all records retrieved from the Solr index. This class contains two arrays: $preferredSnippetFields, an array of fields that are very likely to have good snippet data and should be checked first, and $forbiddenSnippetFields, an array of fields with bad or redundant data that should never be considered for use as a snippet. By default, $preferredSnippetFields contains subject headings and table of contents entries, since these tend to offer valuable information, while $forbiddenSnippetFields contains author and title fields (unnecessary for snippets since they are always displayed elsewhere in the template), ID values (obviously uninformative) and the spelling field (a jumble of data duplicated from other fields, necessary for spell checking but misleading as a snippet). The getHighlightedSnippet method uses these arrays to pick a single best snippet, first checking the preferred fields and then taking the first available non-forbidden field if necessary. Since the method and its related arrays are all protected, it is possible to extend the IndexRecord class and create custom behavior as needed on a driver-by-driver basis.

One further detail helps make things more clear: some snippets make little sense out of context, so searches.ini contains a [Snippet_Captions] section where Solr fields can be assigned labels that will be used as captions in front of snippets. Snippets for fields not listed in this section will display as stand-alone, uncaptioned lines in the search results.

Conclusions

Highlighting and snippets really aren’t too difficult to work with, but as with almost anything, they turn out to be a little more complicated than expected once you look at all of the details. I hope this post has helped point out the most obvious pitfalls and explain the reasoning behind VuFind’s implementation. There is still plenty more that could be done — some of the behavior could be made even smarter, and more of Solr’s power could be exposed through VuFind configuration settings. If you have ideas or questions, please feel free to share them as comments on this post or via the vufind-tech mailing list.

Like

« Previous Page

 


Last Modified: March 23, 2011