Suppose you are Google News and want to provide a search box for searching news stories.
Assume that every news has a title and the search is considered successful if the title of found news matches
as closely as possible to the searched phrase.
If a traditional database was used to store news, it would contain a column for title.
Imagine what would happen if this database was searched with a phrase such as
Trend of home prices in California.
Some DB queries written to find this phrase would be:
Select * from news where title = 'Trend of home prices in California';
Select * from news where title like '%Trend of home prices in California%';
Select * from news where title like '%Trend*home*prices*California%';
Select * from news where title like '%Trend%' and title like '%home%' and title like '%prices%' and title like '%California%';
All the queries above are way too restrictive and will ignore several news with slight word modifications or change of order.
For example, if a news contains the words 'house' instead of home or 'change' instead of 'trend', it will be left out.
This brings to the following conclusion:
Phrase/word variations like synonyms, tense of verbs and singular/plural forms
are all missing from this search.
The matched documents also have no concept of relevance, for example:
News stories having just the word 'prices' or the word 'home' are equally likely to appear at the top as those
stories with full phrase 'Trend of home prices in California'.
Like queries have a very bad reputation for performance and would kill
performance if used for search.
These drawbacks make SQL or relational DB less than ideal for using as a search engine.
How does a search engine like Solr solve this problem?
A search engine solves this problem by building an inverted index.
It breaks down the input document of words into individual words, finds synonyms, stemmed versions and singular/plurals
for each word and maps each one of these words to the containing document.
With this structure in place, it becomes very easy to search.
When a phrase is input for search, Solr breaks it down in a similar manner: word splitting, stemming and synonym
searching and then searches all those words in the inverted index.
For each word, it gets a list of document IDs.
To build a relevancy score, intersections of these lists are found with each other and the documents containing the
most amount of words are ranked higher. Breaking up the search phrase is very light weight since the phrase is small
and then looking up the words is also fast because its a hash-map like lookup. Intersections of lists is also very
fast since they are mostly bitwise operations on large bit-vectors.
All in all, this operation yields much better results and in much lesser time.
Why the term inverted index?
The index is called inverted in search engines because it is the inverse of relational databases where a document
ID is mapped to the content. In an inverted index, content is mapped to the document ID, hence the term "inverted".
Example of an "Inverted Index"
Imagine following 3 documents (news title in the above example):
Document 1: Foo has cat. Document 2: Bar is good. Document 3: Foo has good cat.
Inverted index for them would be:
Note: The index is kept sorted always, so it helps to make range queries much faster.
Now if a search for cat is made, a single lookup in the above index would give documents 1 and 3
For a search like: Get all documents having words between foo and has
the search engine would search for the two words in O(logN) time and then return the matching set as 1, 2 and 3.
Such a range search was not possible without sorting the index. So the search engine always keeps the index sorted to
speed up such queries.
Phrase matching with Term Positions
What if the search is made for a phrase like: has cat.
Lucene/Solr have a feature called Term Positions. If that is ON, then the index also
contains the position number of the term.
So, with that information, our index above would become:
Document 1: Foo has cat.
Document 2: Bar is good.
Document 3: Foo has good cat.
With term positions stored in the index, phrase matching is a breeze.
Lucene searches the individual words as usual and then does an intersection to get those documents having all the
words in the phrase. Then it matches the indexes to make sure the matched documents actually have the phrase. For the
above documents and search phrase of has cat, this would look like: Doc search:
has: 1(index=1), 3(index=1)
cat: 1(index=2), 3(index=3) Intersection:
Docs 1 and 3 have both the words. Match index orders:
Only doc 1 has both the words in succession (index 1 and 2).
Doc 3 does not have the words in succession (index 1 and 3).
Hence doc 1 is the result of the phrase search.
Note: The term positions stored with each word/doc combination need not be a single
integer. It can be a list of integers if the same word is expected to occur multiple times in a document. In such a case,
a list of all occurrences of the word in a document is stored in the term-position index and is called a
The inverted index, sorted by terms and storing term-vectors can be used to perform:
Wild-card search like "school*" will first get all words from index like "schools",
"schooling" etc. and then search.
Fuzzy search like "school~" accounts for one-character spelling mistakes by humans and
will match "shcool", "schol", "schoo" etc. If more fuzzy searching is required,
a numerical index can be used to control that. For example: "school~2" will match two character
spelling mistakes and "school~N" will match N character mistakes.
Proximity search: Example: "nice food"~1 will match "nice healthy food" as well
as "nice tasty food". This is done by relaxing the term-position matching as requested by the fuzz-index.
Got a thought to share or found a bug in the code? We'd love to hear from you: