Post version 4.0, Solr supports a cloud-mode which automates the process of sharding the search index.
This is also called the Solr Cloud.
When used in this mode, Solr only requires the knowledge of number of shards numShards the application wishes to have.
With that knowledge, Solr-Cloud automatically handles:
Whenever a new Solr node is added to the Solr-Cloud, it checks if sufficient number of nodes are available to fulfill the
If no, then the new node is made a master shard-node.
If yes, then the new node is made a replica of an existing master shard-node.
Importance of "numShards"
As of 4.1 version of Solr, numShards has to be specified in the beginning and cannot
be changed later. This is so because "numShards" determine how the data is distributed
among the shards and modifications to "numShards" are difficult to accommodate later.
However, it is possible to split an index on a shard into two (even when Solr is running). This feature helps to reduce
the load on a shard if its index becomes overloaded by documents.
To understand why "numShards" is important, we need to understand the behavior of
Solr Document Router (SDR).
Whenever a new document arrives in Solr cloud for indexing, SDR inspects its document-ID and computes a
hash of the ID which is an integer. The whole range of 32-bit
integers is divided into "numShards" buckets such that every shard becomes responsible for handling a range
of document hash-codes. This range is used by SDR to forward a document to a particular shard based on its hash-code.
This helps in avoiding the need to broadcast every put/get-by-ID query to all the shards and greatly improves the performance.
(In fact, this functioning is very much like a hash-map).
If "numShards" is not kept fixed and it changes with every restart of Solr-Cloud, then document-routing strategy
will also keep on changing and the documents stored earlier will become inaccessible.
Solr uses Apache Zookeeper to manage shard-nodes.
Zookeeper is a software which keeps track of live server nodes.
It knows which nodes in a cloud are shard-leaders and which are replicas. If a leader in a shard dies and a new leader is
elected, Zookeeper keeps track of this change and helps the outside world to interact with the cloud based on this knowledge.
Similarly, when a new node is added to the cloud, Zookeeper can add that node to the system and helps other nodes know about
So, in summary, Zookeeper keeps track of nodes health and is responsible for communication the birth and death of nodes to
the outside world and to the other nodes in the family.
Solr comes bundled with a Zookeeper instance but can also be run with other shard-management utilities if required.
Got a thought to share or found a bug in the code? We'd love to hear from you: