New to Big Data? Start with Kafka

Kafka has become a popular distributed messaging system for a big data environment, so it made good sense for me to write an article about it.

In this article, we will look at the high level architecture, the components, and terminologies of Kafka, and understand the way it works.

Introduction

In the big data world, the data is expected to be pumped to the destination at very high frequencies and high volumes, so distributed systems such as Kafka become easy to persist the data temporarily and later consume it in batches. Following are a few of the advantages of using Kafka as a distributed messaging system.

  • High throughput
  • Easy linear scalability
  • Inbuilt partitioning
  • Easy and quick replication
  • High fault tolerance
  • A publish-subscribe model

System Architecture

The Kafka environment is a distributed environment; this means it consists of a cluster of servers. Figure 1 provides a high level architecture of Kafka.

Kafka1
Figure 1: The architecture of Kafka

Broker

Each server in the Kafka cluster is called the broker. At any point in time, the cluster can be linearly scaled by simply adding a new broker to the cluster.

Topic

The messages which get persisted in the Kafka brokers are categorized as topics. A topic can be partitioned and thus the messages get distributed across the cluster.

Producer

A producer is the one which pumps in messages to the Kafka cluster. There can be multiple producers that pump in the data simultaneously to the Kafka brokers. Each producer publishes the message to a particular topic.

Consumer

A consumer is one that subscribes to the Kafka brokers to receive the messages. The consumers will listen for messages from the particular topic.

Partition

A number of partitions can be configured at the Kafka level. Each topic can be divided into multiple partitions. The partitions get distributed across multiple brokers. Based on the replication factor, the partitions also get replicated between the clusters. Among a partition and its replications, one of them acts as a “leader” and others act as “followers.” When the leader fails, one of the followers automatically steps up to be a leader. This ensures high fault tolerance and less down time.

Figure 2 shows a diagram of a sample partition.

Kafka2
Figure 2: A sample partitioned diagram

In Figure 2, you see that there are three brokers, three partitions, and the replication factor is 3. The leader partition is marked in green and the followers are marked in brown. I have also expanded a partition to show how the messages are stored in a partition and indexed with an offset value.

Offset

Each message persisted inside a partition is assigned to a number offset value. In each partition, the messages are ordered by the offset value and then stored. When the message is consumed by a consumer, it also gets the partition ID and offset value of the received message.

Key Points to Note

Now, let us see a few key points to remember about the Kafka framework:

  • The messages don’t get deleted upon consumption. They live until a configured expiry time is reached.
  • At any point in time, the messages can be re-consumed by using their offset value.
  • A message can be published to a topic and a message can be consumed from a topic.
  • A message can be uniquely identified by using the combination of the topic name, partition ID, and the offset value of the message.

Conclusion

I hope this article gave you an overview and an architectural insight to Apache Kafka. I’ll see you in my future article about creating a producer and consumer application for Kafka using .NET in the C# language.

Happy reading!

More by Author

Must Read