Apache Kafka Partitions, Replicas and Topic

Ramesh Babu Chayapathi
4 min readDec 16, 2017

Apache Kafka is very interesting, but it also has many things that we will need to learn, understand and control it.
In this article, we will discuss about something important inside Apache Kafka. Hopefully, It can help you have better information about Kafka, and you can use it better.

Why use Apache Kafka?
There are many use cases, and some of those are discussed in Kafka’s documentation. The benefits of Kafka are many: scalability, speed, durability. That’s all great, but here’s my biggest reason for using it: it serves as a central data bus for all streaming data. This is especially important when you may not know in advance who will be producing data, and who will be consuming that data. Apache Kafka will store data and allows we pub/sub it, easy streaming data in distributed system.
Topic
We can understand Kafka as a streaming log system or a queue system. Think of it as tail -f in UNIX speak. In Linux, if there is a process that is producing some log output, it is very common to run tail -f [filename] on the file to track the log file updates as they happen. A Kafka topic is exactly that, its just a log file that lives in the Kafka broker ecosystem. The big difference is that instead of tailing a single file on a single server, you can consume from a topic from anywhere that has access to Kafka. That topic could also have multiple producers writing to them from many different places.
This is very useful in Distributed Applications or Microservice Systems, That will need to streaming data. You have a distributed application that lives across more than one server. This application has some output. So, Where should that output be written to? The usual choices are to a flat file, or a database. But what if you don’t have a database set up or don’t know which one to use at first? What if you need to do some additional processing on this distributed data before sending to a database? You could write all the data out to flat files, do some processing on it, and then ingest the data into the database. But then you have to worry about managing all the data between the servers.
With Kafka, if of each server node writing out the data to a different place, they all wrote their data to a common Kafka topic. This is the power of Kafka. In Kafka, when data is written to a topic it is a producer. Now, if someone wants to read that data stream, they have one single place to go, that is Kafka Topic. The client reading the data is called a consumer.
Partitions and Replicas
Two other important features of Kafka are parallelism and redundancy. Kafka handles this by giving each topic a certain number of partitions and replicas.
PARTITIONS
Partitions: A single piece of a Kafka topic. The number of partitions is configurable on a per topic basis. More partitions allow for great parallelism when reading from the topics. The number of partitions determines how many consumers you have in a consumer-group. For example, if a topic has 3 partitions, you can have 3 consumers in a consumer-group balancing consuming between the partitions. In this way you have a parallelism of 3. This partition number is somewhat hard to determine until you know how fast you are producing data and how fast you are consuming the data. If you have a topic that you know will be high volume, you will need to have more partitions.
REPLICAS
Replicas: These are copies of the partitions. They are never written to or read. Their only purpose is for data redundancy. If your topic has n replicas, n-1 brokers can fail before there is any data loss. Additionally, you cannot have a topic a replication factor greater than the number of brokers that you have. For example, you have 5 Kafka brokers, you could have a topic with a maximum replication factor of 5, and 5–1=4 brokers could go down before there is any data loss.
This is the command to create kafka topic with replication and partition options:
bin/kafka-topics.sh — create — zookeeper localhost:2181 — replication-factor 1 — partition 1 — topic kafkatopic
Offsets: An “offset” is just a pointer to a location in the “topic”. Each client or “consumer” has their own “consumer-group” that is used to track the offset where they are in the topic. The actual offset values are stored in a special Kafka topic called “-consumer-offsets”. Why is it called a “consumer-group” and not just a “consumer”? This is because Kafka supports balanced consuming, meaning that you can have more than one consumer reading from a topic to increase parallelism.
LEADERS AND IN SYNC REPLICAS
Each partition has a broker leader, and the replicas simply “follow” the leader and duplicate the data. If a broker that is a leader does down, Kafka will automatically select a new broker leader by default. Note that if you have consumers consuming on a topic that temporarily loses their leader, they may need to be re-connect to fetch the new meta data from the cluster.
Once your topic has been created, you can use Kafka’s built in tool to run to describe the topics on your Kafka cluster. You might see something like this:
$ cd /path/to/kafka
$ ./bin/kafka-topics.sh — describe -z localhost:2181

Topic: test PartitionCount:3 ReplicationFactor:3 Configs:
Topic: test Partition: 0 Leader: 4 Replicas: 4,5,1 Isr: 1,4,5
Topic: test Partition: 1 Leader: 5 Replicas: 5,1,2 Isr: 1,2,5
Topic: test Partition: 2 Leader: 1 Replicas: 1,2,3 Isr: 1,2,3
Kafka is a great tool, hopefully, with information in this article, you can easy to understand better about Apache Kafka.

--

--