Solr is a high performance search engine developed over the Lucene library.
It has several out-of-the-box features which can be made up and running quickly:
Auto-completion
Spell-checker
Hit highlighting
Geo spatial indexing (searching and ranking documents by latitude/longitude)
Indexing of geographical shapes such as polygons
Integrates with Apache Tika to support various document formats like pdf and word
Optimistic locking is supported out of the box (locking using
compare-and-swap)
Durable writes - Meaning document is available in real-time for get requests even before its indexed.
This is achieved by having a layer of transaction log between client and Lucene index.
Get requests are serviced from this layer making get real-time even though the doc is not indexed.
Automatic sharding and replication with Apache Zookeeper
Since 4.0, Solr provides a cloud-mode out of the box which takes care of sharding, replication, load-balancing
and scales linearly. This is referred to as Solr Cloud
When not to use Solr
Solr (or in general a search engine) is not good when:
A query returns thousands of documents (like bootstrapping another Solr by querying
current Solr) Because search engines store fields on disk in a format from which it is easy to get only a few documents,
not millions.
Lot of hierarchical relations are expected in the design with same kind of queries.
Document-level security is desired in Solr.
Building a very very large scale index.
Solr is not recommended for very large scale inverted indexes like web-scale inverted index used in Google.
For such cases, better use Hadoop map-reduce to create indexes.
Apache Nutch is one such project that uses Hadoop to map-reduce web-links and feeding the resulting index to Solr.
Got a thought to share or found a bug in the code? We'd love to hear from you: