Make delicious recipes!

Why Solr (Search engine vs DB)



Suppose you are Google News and want to provide a search box for searching news stories.
Assume that every news has a title and the search is considered successful if the title of found news matches as closely as possible to the searched phrase.

If a traditional database was used to store news, it would contain a column for title.
Imagine what would happen if this database was searched with a phrase such as Trend of home prices in California.
Some DB queries written to find this phrase would be:
Select * from news where title = 'Trend of home prices in California';
Select * from news where title like '%Trend of home prices in California%';
Select * from news where title like '%Trend*home*prices*California%';
Select * from news where title like '%Trend%' and title like '%home%' and title like '%prices%' and title like '%California%';

All the queries above are way too restrictive and will ignore several news with slight word modifications or change of order.
For example, if a news contains the words 'house' instead of home or 'change' instead of 'trend', it will be left out.
This brings to the following conclusion:
  1. Phrase/word variations like synonyms, tense of verbs and singular/plural forms are all missing from this search.

  2. The matched documents also have no concept of relevance, for example: News stories having just the word 'prices' or the word 'home' are equally likely to appear at the top as those stories with full phrase 'Trend of home prices in California'.

  3. Like queries have a very bad reputation for performance and would kill performance if used for search.
These drawbacks make SQL or relational DB less than ideal for using as a search engine.


How does a search engine like Solr solve this problem?

A search engine solves this problem by building an inverted index.
It breaks down the input document of words into individual words, finds synonyms, stemmed versions and singular/plurals for each word and maps each one of these words to the containing document.
With this structure in place, it becomes very easy to search.
When a phrase is input for search, Solr breaks it down in a similar manner: word splitting, stemming and synonym searching and then searches all those words in the inverted index.
For each word, it gets a list of document IDs.
To build a relevancy score, intersections of these lists are found with each other and the documents containing the most amount of words are ranked higher. Breaking up the search phrase is very light weight since the phrase is small and then looking up the words is also fast because its a hash-map like lookup. Intersections of lists is also very fast since they are mostly bitwise operations on large bit-vectors.
All in all, this operation yields much better results and in much lesser time.


Why the term inverted index?

The index is called inverted in search engines because it is the inverse of relational databases where a document ID is mapped to the content. In an inverted index, content is mapped to the document ID, hence the term "inverted".


Example of an "Inverted Index"

Imagine following 3 documents (news title in the above example):
Document 1: Foo has cat.
Document 2: Bar is good.
Document 3: Foo has good cat.

Inverted index for them would be:
word document-ID
bar 2
cat 1, 3
foo 1, 3
good 2, 3
has 1, 3
is 2
Note: The index is kept sorted always, so it helps to make range queries much faster.

Now if a search for cat is made, a single lookup in the above index would give documents 1 and 3
For a search like:
Get all documents having words between foo and has
the search engine would search for the two words in O(logN) time and then return the matching set as 1, 2 and 3.
Such a range search was not possible without sorting the index. So the search engine always keeps the index sorted to speed up such queries.


Phrase matching with Term Positions

What if the search is made for a phrase like: has cat.
Lucene/Solr have a feature called Term Positions. If that is ON, then the index also contains the position number of the term.
So, with that information, our index above would become:
Document 1: Foo has cat.
Document 2: Bar is good.
Document 3: Foo has good cat.

word document-ID
bar 2 (index=0)
cat 1(index=2), 3(index=3)
foo 1(index=0), 3(index=0)
good 2(index=2), 3(index=2)
has 1(index=1), 3(index=1)
is 2(index=1)

With term positions stored in the index, phrase matching is a breeze.
Lucene searches the individual words as usual and then does an intersection to get those documents having all the words in the phrase. Then it matches the indexes to make sure the matched documents actually have the phrase. For the above documents and search phrase of has cat, this would look like:
Doc search:
   has: 1(index=1), 3(index=1)
   cat: 1(index=2), 3(index=3)
Intersection:
   Docs 1 and 3 have both the words.
Match index orders:
   Only doc 1 has both the words in succession (index 1 and 2).
   Doc 3 does not have the words in succession (index 1 and 3).
   Hence doc 1 is the result of the phrase search.

Note: The term positions stored with each word/doc combination need not be a single integer. It can be a list of integers if the same word is expected to occur multiple times in a document. In such a case, a list of all occurrences of the word in a document is stored in the term-position index and is called a term-vector

The inverted index, sorted by terms and storing term-vectors can be used to perform:
  1. Wild-card search like "school*" will first get all words from index like "schools", "schooling" etc. and then search.

  2. Fuzzy search like "school~" accounts for one-character spelling mistakes by humans and will match "shcool", "schol", "schoo" etc. If more fuzzy searching is required, a numerical index can be used to control that. For example: "school~2" will match two character spelling mistakes and "school~N" will match N character mistakes.

  3. Proximity search: Example: "nice food"~1 will match "nice healthy food" as well as "nice tasty food". This is done by relaxing the term-position matching as requested by the fuzz-index.







Like us on Facebook to remain in touch
with the latest in technology and tutorials!


Got a thought to share or found a
bug in the code?
We'd love to hear from you:

Name:
Email: (Your email is not shared with anybody)
Comment:

Facebook comments:

Site Owner: Sachin Goyal