Tag Archives: Custom Analyzer

Lucene and Hyphens

16 Dec

I have used Apache Lucene for providing an intelligent intuitive search for my products many times over.  Well that was till yesterday! No, I did not ditch Lucene.  But what threw a monkey in the wrench was the ‘intuitive’ nature of the product.  Let me explain.

Say I have a string ‘abc-def’ to be indexed.  The say Lucene works (well actually the StandardAnalyzer that delegates the tokenizing through StandardTokenizer works) is by splitting ‘abc-def’ into ‘abc’, ‘def’ to index the two words separately.  The problem now is that when someone searched with “abc-def” Lucene says there are no matching records.  Well this, even to us the defenders of engineering dignity, was plain unacceptable.  Not much digging later I saw that it was exactly the way in which StandardAnalyzer is designed to work.  No problem I said.  One of the perks of being a heavily open source shop with awareness of GPL Vs. LGPL pitfalls is that we know we could tweak the platforms we use to make it work the way we want.  So to me the plan was to modify the StandardAnalyzer to not to tokenize strings containing hyphens (‘-‘).

Now the challenge was that bug was reported at sharp 6PM, right after I tucked in my laptop and told my wife that I will pickup the pasta on my way home.   The techie hubris got the better of me.  Out came the laptop but not the power chord since I figured it is below my dignity to work beyond whatever power is left on the machine for such a trivial fix. Big mistake – never under estimate a problem, never assume while debugging.

Shortly after I started to dig in I realized that it need more than Java to fix this tiny bug.  StandardAnalyzer under its hoods delegates most of the heavy lifting to StandardTokenizer – which to my shock was a generated java file.  The source was in flex; written specifically for a generator unknown to me yesterday – JFlex.  As we speak we are best of buddies.  Alright then out came the power chord and I got down to RTFM of JFlex.  An hour and a few test cases a later I was reasonably confident to perform an invasive surgery on the StandardTokenizerImpl.jflex.  Or so I thought.

Though my micro test cases around JFlex worked well my main testcase on the product code base refused to budge.  It was adamant that the tokenizing happened at hyphens irrespective of what I wrote in JFlex.  By 11PM I accepted defeat and drove home.

There is wisdom in whoever said that sometimes the best solution to a hard problem is sleep.  Today morning I sat down to tackle the gorilla and I saw that the plain problem was hiding in plain sight.  Nice guys at Lucene development team had left a note in the same package where they bundled StandardTokenizerImpl.jflex.  It said, for the love of goo use JDK1.4 for to run Jflex to generate code.  Wait! 1.4?? 1.4?? A decade old JDK?  ‘Yes’; it said.  So I went and got hold of JDK 1.4 (That took the JDK count on my notebook 5).

WARNING: if you change StandardTokenizerImpl.jflex and need to regenerate
the tokenizer, only use Java 1.4 !!!
This grammar currently uses constructs (eg :digit:, :letter:) whose
meaning can vary according to the JRE used to run jflex. See
https://issues.apache.org/jira/browse/LUCENE-1126 for details.
For current backwards compatibility it is needed to support
only Java 1.4 – this will change in Lucene 3.1.

Original Modified

// floating point, serial, model numbers, ip addresses, etc.
// every other segment must have at least one digit
NUM        = (
		   {HAS_DIGIT} ({P} {ALPHANUM})+
		   | {ALPHANUM} {P} {HAS_DIGIT}
           | {HAS_DIGIT} {P} {ALPHANUM}
           | {ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM})+
           | {HAS_DIGIT} ({P} {ALPHANUM} {P} {HAS_DIGIT})+
           | {ALPHANUM} {P} {HAS_DIGIT} ({P} {ALPHANUM} {P} {HAS_DIGIT})+
           | {HAS_DIGIT} {P} {ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM})+)

// floating point, serial, model numbers, ip addresses, etc.
// every other segment must have at least one digit
NUM        = (
			 {ALPHANUM} {P} {ALPHANUM}
		   | {ALPHANUM} ({P} {ALPHANUM})+
		   | {HAS_DIGIT} ({P} {ALPHANUM})+
		   | {ALPHANUM} {P} {HAS_DIGIT}
           | {HAS_DIGIT} {P} {ALPHANUM}
           | {ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM})+
           | {HAS_DIGIT} ({P} {ALPHANUM} {P} {HAS_DIGIT})+
           | {ALPHANUM} {P} {HAS_DIGIT} ({P} {ALPHANUM} {P} {HAS_DIGIT})+
           | {HAS_DIGIT} {P} {ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM})+)

The custom analyzer code is here. The MyTokenizer.java is a generated java file when we compile the modified .jflex grammar file with JFlex.

public final class MyCustomAnalyzer extends org.apache.lucene.analysis.Analyzer {
	static Version luceneVersion = null;
	static StandardAnalyzer STANDARD = null;
	public MyCustomAnalyzer(Version version){
		super();
		this.luceneVersion = version;
		STANDARD = new StandardAnalyzer(luceneVersion){
			@Override
			public TokenStream tokenStream(String fieldName, Reader reader) {
			    MyTokenizer tokenStream = new MyTokenizer(luceneVersion, reader);
			    tokenStream.setMaxTokenLength(STANDARD.getMaxTokenLength());
			    TokenStream result = new StandardFilter(tokenStream);
			    result = new LowerCaseFilter(result);
			    result = new StopFilter(StopFilter.getEnablePositionIncrementsVersionDefault(luceneVersion), result, StandardAnalyzer.STOP_WORDS_SET);
			    return result;
			}
		};
	}
	@Override
	public TokenStream tokenStream(String fieldName, Reader reader) {
		return STANDARD.tokenStream(fieldName, reader);
	}
	@Override
	public TokenStream reusableTokenStream(String fieldName, Reader reader)
			throws IOException {
		return STANDARD.reusableTokenStream(fieldName, reader);
	}
}

Miracle!  It worked as I wanted.  Exactly as I wanted.  My test cases passed with flying colors and I was laughing ear to ear.  Great!

So, for the unlucky or the intrigued who is left to write a custom analyzer to introduce grammar changes in tokenizing – be prepared to befriend JFlex.  Believe me though it is a nice tool and I can already see a few places in parsing where my team would love to put those muscles into work.

Ciao!