Join us for a FREE hands-on Meetup webinar on Governance in GenAI Landscape | FRI, JUL 12 · 7:00 PM IST Join us for a FREE hands-on Meetup webinar on Governance in GenAI Landscape | FRI, JUL 12 · 7:00 PM IST
Search
Close this search box.
Search
Close this search box.

Differences between Kafka and Flume

share

Kafka is usually compared with Flume as both these technologies can be used in Data Ingestion phase of a Data Pipeline. In this article, we will discuss how these technologies have evolved over a period of time.

Below are some of the key differences between these 2 technologies –


Development

Kafka was developed by Linkedin in 2010 and Flume was developed by Cloudera in 2011.

Kafka was developed with the intention to basically simplify Linkedin architecture and Flume was developed with the intention to perform log file aggregation.

In the Hadoop environment we are having multiple worker machines. When we are running a job then it generates a lot of logs and these logs are distributed across multiple machines. We want to aggregate this data, from so many machines to a center place, lets say “HDFS”.This is basically the use case of Flume. You can easily do analysis of data, you can find errors in a log file, you can do n numbers of things when you have data at one place.

Flume is coming from the Hadoop world, so it focused a lot around hadoop related integrations. But Kafka is not just specific to hadoop so it takes care of entire enterprise architecture. In an enterprise architecture you just not have hadoop but you will also have no sql database, RDBMS, OLAP etc.. Kafka actually simplifies the Linkedin architecture and it ensures that all parts in Linkedin work properly.

Thus, Kaka is beyond hadoop and Flume is focussed around hadoop so there are a lot of integrations , a lot of connectors are available. That’s why Flume became very popular during that time.


Initially Started

Kafka initially started as a Pub/Sub.It means you have Publishers and Subscribers. Publishers will publish the data and subscribers will pull the data. But Flume is for streaming ingestion specifically.


Architecture

Kafka architecture is having Brokers. Broker is a term which is being used in Kafka for JVM Daemons but many companies use this term for referring to a Kafka machine. In simple terms, we can say Brokers are basically JVMs Daemons. Flume is having Agents based architecture, Agents are JVM Daemons in Flume. Agent has 3 components –
1. Source which connects to origin of data or another Agent
2. Channel – Propagation path of data
3. Sink which connects to destination of data or another Agent


Time Period

Kafka can buffer data for long term.Thus, you can define retention period, how long you want to keep the data and after that the data will be deleted from the Kafka cluster. In case of Flume, we can store data in Channel (Channel is basically the propagation path from where the events are propagating) for a short interval


Fault Tolerance

In Kafka, there is implicit fault tolerance. So, you can define what replication you need for storing your data. So, In Kafka it is just a configuration that we can change very quickly.

In Flume there is explicit fault tolerance, it means you have to basically configure your Flume data pipeline to ensure that the data is replicated.So, In Flume you need to do more hard work to do that kind of replication.


ETL

In Kafka you can do true ETL or very powerful ETL. You can achieve millisecond latency in Kafka also.In Flume very light weight ETL is performed.

Thus, Kafka is not just a Pub/Sub, It is a complete Event Streaming Platform.Flume is just for Data Ingestion but Kafka can be used for ingestion, storing of data for long term storage, real time processing of data, etc.


Earlier some companies used Flafka which combined good parts of Flume and good parts of Kafka together. These days, people prefer Kafka a lot because a lot of connectors are related to Hadoop and many others have been built into the Kafka Connect API. In 2015, Kafka Connect API was released and thus, the focus has shifted a lot towards Kafka than Flume. So, Flume is no longer a priority for Cloudera and most of the enterprises are focused on Kafka.


Hope this discussion helps you in understanding the differences between Kafka and Flume in the big data world. If you liked this article then please do share with others. Thank you!

Leave a comment

Your email address will not be published. Required fields are marked *

Categories

Trending posts

Subscribe

Sign up to receive our top tips and tricks.