CloudSolrServer is a class in the Solrj client for connecting to the Solr cloud.
It connects to Zookeeper and keeps track of the state of each node in the cluster. With this knowledge, CloudSolrServer
client knows which nodes are the leaders and sends requests to leaders only to save time.
Without CloudSolrServer, requests are sent in a round-robin fashion to all the Solr nodes (Leaders and Replicas).
So there is S/N chance of getting the current Shard leader where N is the total number of nodes in the cloud (i.e. sum
of leaders and replicas = N) and S is the number of Shard-Leaders. There are (1-S/N) chances of hitting the non-leader node
which is wasteful as the non-leader node would then have to pass the request to its leader. With CloudSolrServer, requests
are sent only to Shard-Leaders which performs much better.
If a node crashes, ZooKeeper notifies CloudSolrServer about the same so that CloudSolrServer removes it from the eligible
solr instances' list. If a new leader is elected, then also CloudSolrServer clients are notified.
In fact, Solr actually uses the CloudSolrServer internally to communicate with other nodes in the cluster
Node failures
If a node fails for sometime and then rejoins the cluster, it syncs itself with the leader. If very few changes
have been committed (hard-coded to 100 in initial releases of Solr), then the node just pulls
those changes from the leader's update log. This is called Peer Sync. If too many
changes have been committed, then it replicates the entire snapshot from the leader. This is called
Snapshot Replication.
Also note that leader does not care about the replicas going down because such replicas will automatically sync themselves
from the leader when they come up.
The ZK ensemble runs as long as half of its servers are running.
Got a thought to share or found a bug in the code? We'd love to hear from you: