Make delicious recipes!

Solr Cloud

Post version 4.0, Solr supports a cloud-mode which automates the process of sharding the search index.
This is also called the Solr Cloud.

When used in this mode, Solr only requires the knowledge of number of shards numShards the application wishes to have.
With that knowledge, Solr-Cloud automatically handles:
  1. Sharding
  2. Replication and
  3. Node-failures

Whenever a new Solr node is added to the Solr-Cloud, it checks if sufficient number of nodes are available to fulfill the numShards requirement.
If no, then the new node is made a master shard-node.
If yes, then the new node is made a replica of an existing master shard-node.

Importance of "numShards"

As of 4.1 version of Solr, numShards has to be specified in the beginning and cannot be changed later. This is so because "numShards" determine how the data is distributed among the shards and modifications to "numShards" are difficult to accommodate later. However, it is possible to split an index on a shard into two (even when Solr is running). This feature helps to reduce the load on a shard if its index becomes overloaded by documents.

To understand why "numShards" is important, we need to understand the behavior of Solr Document Router (SDR).
Whenever a new document arrives in Solr cloud for indexing, SDR inspects its document-ID and computes a hash of the ID which is an integer. The whole range of 32-bit integers is divided into "numShards" buckets such that every shard becomes responsible for handling a range of document hash-codes. This range is used by SDR to forward a document to a particular shard based on its hash-code.
This helps in avoiding the need to broadcast every put/get-by-ID query to all the shards and greatly improves the performance.
(In fact, this functioning is very much like a hash-map).

If "numShards" is not kept fixed and it changes with every restart of Solr-Cloud, then document-routing strategy will also keep on changing and the documents stored earlier will become inaccessible.

Apache Zookeepr

Solr uses Apache Zookeeper to manage shard-nodes.
Zookeeper is a software which keeps track of live server nodes. It knows which nodes in a cloud are shard-leaders and which are replicas. If a leader in a shard dies and a new leader is elected, Zookeeper keeps track of this change and helps the outside world to interact with the cloud based on this knowledge. Similarly, when a new node is added to the cloud, Zookeeper can add that node to the system and helps other nodes know about this change.
So, in summary, Zookeeper keeps track of nodes health and is responsible for communication the birth and death of nodes to the outside world and to the other nodes in the family.

Solr comes bundled with a Zookeeper instance but can also be run with other shard-management utilities if required.

Like us on Facebook to remain in touch
with the latest in technology and tutorials!

Got a thought to share or found a
bug in the code?
We'd love to hear from you:

Email: (Your email is not shared with anybody)

Facebook comments:

Site Owner: Sachin Goyal