2014-01-23 09:39:37

Apache SOLR 4.x and IKanalyzer to Search Drupal in Chinese

For a certain project, the client wanted to be able to search on a single Chinese character and for this reason the standard Drupal search failed to come up with acceptable results, even with simple CJK handling. Understandable, as Drupal will split up every character as it's own search word and search record.

IKanalyzer loaded into Apache SOLR

To give you an example, searching for "河內道" would give results for "河", "內", and "道" separately.

The first step I considered is just going with a standard SOLR search. After setting one up and testing, results were also not as expected. Searching for "河內道" would return only records with "河內道, but searching for "河" would not return any records. Close, but no banana.

Looking back at attempting the quick SOLR setup, it makes sense that this would not work. Stemming of English and Chinese is completely different. There are quite a few plugins available, and the most attractive one seemed IKanalyzer. Unfortunately for me, all information and documentation is only available in Chinese. Could have seen that one coming.

Installing IKanalyzer for SOLR 4.x

Let's get to the point of setting up IKanalyzer. Start by downloading the latest version at https://code.google.com/p/ik-analyzer/downloads/list  , for me (and probably for you too) this was IK Analyzer 2012FF_hf1.zip

Extract the files IKAnalyzer.cfg.xml, IKAnalyzer2012FF_u1.jar, and stopword.dic to your SOLR instance directory. In my case my SOLR was running at /data/client/ meaning the files would go in /data/client/collection1/lib/.

Up next, edit your SOLR solrconfig.xml to tell your SOLR instance where to look for plugins. This is usually done by finding the line <lib dir="./lib" /> and uncommenting it, or just adding it in your solrconfig.xml

Last item to do on the list is loading the analyzer for the fields you wish to use it on. For this project, it was only the TextField field. Open up schema.xml and add the definition seen below. Make sure this is the only occurrence of <fieldType name="text" class="solr.TextField"> or you are going to have a bad time.
<fieldType name="text" class="solr.TextField">
<analyzer class="org.wltea.analyzer.lucene.IKAnalyzer"/>

And that is it, restart your SOLR, re-index your content, and you are good to go! For more information on the analyzer plugin IKanalyzer see https://code.google.com/p/ik-analyzer/ and to read more about analyzers in general see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

Chinese Stopwords

SOLR already comes with a default list of stopwords for the English language, but you might want to have one for your Chinese search too. Check out the list provided by Kevin Bouge at https://sites.google.com/site/kevinbouge/stopwords-lists for the stopwords file. I case the site goes down, you may also download it from my hosting: stopwords_zh

The Results

Results in general seemed to make much more sense. Searching for "河內道" now returns articles actually matching "河內道", and searching for single characters is also working. The cases where you would enter "河 內道" (with space) it also does proper matching on both words, returning results that match both words "河" and "內道" before results that contain only one of those words.