Sitecore + Lucene's Search Algorithm
Bhavik Patel • 6/20/2017
As many of you know, Sitecore’s Content Search API does a great job of hiding the complexities of Lucene, thus providing a quick way for implementers to provide search functionality to their end users when building Sitecore websites. Wrapping search in this way also allows Sitecore to change the underlying provider to SOLR, Coveo, or Azure Search, without requiring the implementers to change their search code.
In any case, that’s not what the focus of this particular article is. Many of our customers have asked how Lucene works “out-of-the-box”. It’s a very vague question and difficult to provide a simple answer, so I thought I would dig a little deeper and provide some insight on how Lucene indexes content and how it evaluates those documents when a search is performed. Note that the Lucene provider you get with Sitecore is not customized in any way; it’s plain old Lucene. Sitecore has simply wrapped it in their Content Search API.
Lucene uses a variant of the TF (term frequency) and IDF (inverse document frequency) algorithm for search. The detailed documentation about that can be found here, if you are interested: https://lucene.apache.org/core/6_6_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
For all the examples below, please assume that we are performing an exact search with matches on a single field in Lucene’s index.
At the highest level, TF means that Lucene ranks documents higher if the occurrences of the search term are greater in that document. E.g., if two documents are returned for “Energy”, the one that has more occurrences of “Energy” in the content being searched will be ranked higher.
IDF means that Lucene will score documents lower the greater the number of occurrences of the search term are across the entire index (the idea is that common terms are less important than uncommon ones). This applies mainly to multiple search terms. Example: when searching for “Energy Savings”, if “Energy” is more common amongst all documents in the index, but “Savings” is not, “Energy” documents will be deprioritized, and “Savings” ones will be prioritized.
In addition to the above TF + IDF, Lucene also implements additional logic of its own for prioritization. Specifically, when multiple terms are used, documents that contain more of each term are ranked higher than those that contain less of each term. Also, terms that exists in a document with fewer terms overall are prioritized over those that have more terms overall.
After all of the above evaluation is done, any additional index and query time boosting is finally considered to prioritize and deprioritize documents further.