Depending on which side of the project you are - business, organizational, or technical - you may know how to work with certain tools, you may have an idea what some of them do, or you may have heard of some but still have no idea which benefits using them can bring to your project.
This article aims to explain what Apache Kafka is, what it does, and how it can help you as an IT architect or developer who hasn't had the chance to work with it yet, or even other professionals who may benefit from expanding their knowledge of technologies and tools.
The official definition for Kafka changed throughout the years, so let's take a journey through the years. According to the official website, Apache Kafka was:
In 2013 "a distributed publish-subscribe messaging system."
In 2014 "a publish-subscribe messaging rethought as a distributed commit log."
In 2017 "a distributed streaming platform."
In 2021 "an open-source distributed event streaming platform."
As you already know, whatever goes online, stays online, and can be found with the help of the wayback machine, which is also how I found the definitions above.
However, the problem with all these definitions is not necessarily that they're wrong, just that people still don’t understand what Kafka really is, since the definitions are made with difficult-to-grasp concepts. Unless you already know a few technical terms such as messaging system or event streaming, these definitions don’t say much to you, right?
Well, if I were to ELI5 it, Kafka is simply a storage system - it can store and retrieve data.
**ELI5 = explain me like I'm 5
Kafka stores a particular kind of data, namely it stores messages (also called events or records) which are small pieces of data (the default limit is 1MB). They have key, value and (implicit) timestamp.
Producers are applications that developers write which connect to Kafka and write messages into topics.
Consumers are applications developers write which read (and process) messages from topics.
A topic is a collection of related messages, similar to a table in a database.
Kafka is a distributed and replicated system, which ensures scalability (by adding disks or nodes) and high availability. The nodes in Kafka have a special name - brokers - and they can be physical servers, virtual machines, or containers/pods in k8s as long as they run and have storage.
So what do you need to know to start using Kafka? Let’s review the main info until now:
But let’s understand Kafka even better:
Let’s say you have a fleet of trucks, each of them transmitting location data. In this case, the trucks are sending messages - the message key is represented by the truck ID and the value is represented by the GPS coordinates (messages also have an implicit timestamp). You can store that data in Kafka, in a topic. Kafka is very fast and scalable so you will easily handle thousands of trucks at once. So you basically have a Kafka producer in each truck. You can have various independent microservices reading from that topic and processing truck locations - these are the consumers. A message is not deleted once it is read by a consumer, because it can be re-read by other consumers as well. By design, Kafka stores messages long-term.
Now let’s take another example and imagine you run a large online retail company that processes millions of orders every day. Each time a customer places an order, that order information needs to be processed in various ways. In this scenario, each order is a message. The message key could be the order ID, and the value could be the order details, such as customer information, items purchased, price, and shipping address.
You can use Apache Kafka to handle this stream of order data. When a customer places an order, your e-commerce platform acts as a Kafka producer, sending the order details to a Kafka topic dedicated to orders. Kafka's robust architecture can easily handle the high volume of orders your platform generates, ensuring no data is lost and orders are processed in real-time.
Now, you have various microservices that need to process these orders. For example:
Each of these services acts as a Kafka consumer, subscribing to the orders topic. They read the order messages independently and perform their respective tasks. For instance, the billing service reads an order message, charges the customer's credit card, and then generates an invoice. Meanwhile, the inventory service might read the same message to decrease the stock levels for the purchased items.
Because Kafka retains messages for a configurable period, these services can process orders at different rates without losing data. If a new service is added, such as a data analytics service that analyzes buying patterns, it can start consuming messages from the orders topic without affecting the other services.
In summary, Kafka serves as the central nervous system for your order processing, reliably handling a massive flow of data and enabling decoupled services to work together seamlessly.
Let us understand a few more difficult concepts found in Kafka, such as:
1. Batches. Producers don’t send individual messages but instead they group them in batches, for better efficiency. You could think of it as being similar to sending a bunch of letters in one package rather than mailing each letter separately.
2. Partitions represent one of the most important ideas in Kafka. Topics are broken down into partitions, and each partition is based on a key. Partitioning the topics allows for better overall scalability. The message order is guaranteed per partition, not per the whole topic - although this might seem like a downside of Kafka, it's actually necessary to achieve such scalability. Messages with the same key will go in the same partition, allowing for the correct timeline to be registered.
Continuing on the example from before, if we wish to see the route of one of the trucks, the messages will be in order, since the truck has a particular ID and all messages from that ID will be stored in the same partition.
3. Multiple consumers can read in parallel from the same topic creating a consumer group. Kafka has a maximum of one consumer per partition, which allows for better speed. Think of it like a team of workers where each worker is assigned a specific task to ensure that all tasks are completed without duplicating the effort.
We mentioned a few benefits Kafka brings to the table throughout the article, but let’s get into it a bit more.
The concepts outlined before - batches, partitions, and consumer groups - allow for efficient, scalable, and fault-tolerant message processing. Batches optimize network and disk usage, partitions provide parallelism and scalability, and consumer groups enable load balancing and fault tolerance among consumers.
That is why Kafka can handle large volumes of data fast, making it ideal for real-time data processing, but not only. Kafka offers high throughput, low latency, scalability, and reliability. You can also integrate it with other tools, from databases, to messaging systems, or data processing frameworks such as Hadoop, Spark, and Flink, which makes Kafka as versatile a tool as it can be.
Apache Kafka can be used in several industries and their respective use cases, such as retail (product recommendations, supply-chain optimization, inventory management), banking (real-time fraud detection, cybersecurity), healthcare (real-time monitoring systems), IoT (real-time data processing), and more.
At eSolutions, our experience with Kafka spans numerous projects, for facilitating asynchronous communication between microservices within event-driven architectures, for event streaming and stream processing, such as constructing real-time stock aggregates with Kafka Streams, or for data ingestion and change data capture tasks. The case studies outlined below showcase a selection of the projects we've delivered, in which we've seamlessly incorporated Kafka into the solutions we provided our partners with.
ReefBeat / ReefWave - Smart Reef Pumps System
ReefBeat / ReefLED - Smart Reef Lighting System
On-Premise Big Data Platform for Carrefour
Regina Maria On-Premise Big Data Platform
Viorel Anghel has 20+ years of experience as an IT Professional, taking on various roles, such as Systems Architect, Sysadmin, Network Engineer, SRE, DevOps, and Tech Lead. He has a background in Unix/Linux systems administration, high availability, scalability, change, and config management. Also, Viorel is a RedHat Certified Engineer and AWS Certified Solutions Architect, working with Docker, Kubernetes, Xen, AWS, GCP, Cassandra, Kafka, and many other technologies. He is the Head of Cloud and Infrastructure at eSolutions.