Make delicious recipes!

Lucene matching and scoring



In the Lucene/Solr parlance, several terms are commonly used to describe search-engines.
Below are some of the most common ones.


Term Frequency: Measures the frequency of occurrence of a term in the document.

Inverse Document Frequency (IDF): Inverse of the frequency of occurrence in all documents.
IDF allows the search engine to reduce the importance of words like "of", "in", "to" etc. which occur very frequently in all the documents but do not add any great value to a search.

Length Norm: This term is defined as the inverse of the square root of number of terms.
This makes the search more relevant by calculating search term's percentage. For example, if the searched word is present an equal number of times in a big book and a small book, the Length Norm factor would make the smaller book more relevant because it has a higher proportion of the searched word.

Field Norm: This is defined in terms of document-boost, field-boost and length-norm.
They are calculated during index-time and cost an additional byte per field in the index. Document-boost is applied to all the fields in the document and is just a shorthand way for not having to specify boost on all fields individually.

The COORD Factor: Consider the expression A and (B or C)
In a normal scenario, this expression will yield true or false. But in Solr/Lucene world, this expression will evaluate to true/false too but it will also generate a COORD factor such that if all terms A,B and C match, then the COORD factor would be 3/3. If two match, then COORD factor is 2/3 and so on. Such a factor gives more preference to documents which contain more of the search terms rather than those documents which contain only a few search terms.


Precision and Recall


Precision is a measure of the accuracy by which documents were matched against a query.
If the returned result contains only the relevant documents, precision is 100% or 1.0
If the returned result contains some irrelevant documents, precision is less than 100%
Mathematically, precision is defined as:
number of relevant documents / total number of documents returned

Precision alone is not sufficient to determine the search relevance because it only penalizes the score if incorrect documents were also returned in the result set. It does not take into account the documents which were relevant to the search but did not make into the result set. A measure for those documents is Recall (also called thoroughness).
Recall is defined as:
RR / (RR + RNR)
where RR = Number of relevant documents returned in the result.
RNR = Number of relevant documents not returned in the result.


In summary, precision does not care if all relevant results are returned while recall does not care if all returned results are relevant. Together they balance the measurement of a search engine's querying capabilities. A search engine is good if it scores high on both these parameters and this is difficult because both these parameters are usually counter scoring against each other. If one goes high, it usually does so at the cost of the other and vice versa. Most applications aim to have higher precision for the first few results (like 20-30 results) and then aim for higher recall to make their search exhaustive










Like us on Facebook to remain in touch
with the latest in technology and tutorials!


Got a thought to share or found a
bug in the code?
We'd love to hear from you:

Name:
Email: (Your email is not shared with anybody)
Comment:

Facebook comments:

Site Owner: Sachin Goyal