Make delicious recipes!

Flume vs Kafka


"What tool is the best for transporting data/logs across servers in a system?"

This question is very commonly faced when designing a big-data system.
Here we try to compare both these tools and see which is the best suited for the job.


Problems targeted by these systems

Flume is designed to ease the ingestion of data from one component to other.
It's focus is mostly on Hadoop although now it has sources and sinks for several other tools also, like Solr.

Kafka on the other hand is a messaging system that can store data for several days (depending on the data size of-course).
Kafka focuses more on the pipe while Flume focuses more on the end-points of the pipe.
That's why Kafka does not provide any sources or sinks specific to any component like Hadoop or Solr.
It just provides a reliable way of getting the data across from one system to another.
Kafka uses partitioning for achieving higher throughput of writes and uses replication for reliability and higher read throughput.

Push / Pull

Flume pushes data while Kafka needs the consumers to pull the data. Due to push nature, Flume needs some work at the consumers' end for replicating data-streams to multiple sinks. With Kafka, each consumer manages its own read pointer, so its relatively easy to replicate channels in Kafka and also much easier to parallelize data-flow into multiple sinks like Solr and Hadoop.


Latest trend is to use both Kafka and Flume together.
KafkaSource and KafkaSink for Flume are available which help in doing so.
The combination of these two gives a very desirable system because Flume's primary effort is to help ingest data into Hadoop and doing this in Kafka without Flume is a significant effort. Also note that Flume by itself is not sufficient for processing streams of data as Flume's primary purpose is not to store and replicate data streams. Hence it would be poor system if it uses only a single one of these two tools.

Persistence storage on disk

Kafka keeps the messages on disk till the time it is configured to keep them.
Thus if a Kafka broker goes offline for a while and comes back, the messages are not lost.
Flume also maintains a write-ahead-log which helps it to restore messages during a crash.

Flume error handling and transactional writes

Flume is meant to pass messages from source to sink (All of which implement Flume interfaces for get and put, thus treating Flume as an adapter). For example, a Flume log reader could send messages to a Flume sink which duplicates the incoming stream to Hadoop Flume Sink and Solr Flume Sink. For a chained system of Flume sources and sinks, Flume achieves reliability by using transactions - a sending Flume client does not close its write transaction unless the receiving client writes the data to its own Write-Ahead-Log and informs the sender about the same. If the receiver does not acknowledge the writing of WAL to the sender, then the sender marks this as a failure. The sender then begins to buffer all such events unless it can no longer hold any more. At this point, it begins to reject writes from its own upstream clients as well.



References:
  1. Cloudera Blog
  2. Apache Flume
  3. Apache Blogs
  4. Kafka Design







Like us on Facebook to remain in touch
with the latest in technology and tutorials!


Got a thought to share or found a
bug in the code?
We'd love to hear from you:

Name:
Email: (Your email is not shared with anybody)
Comment:

Facebook comments:

Site Owner: Sachin Goyal