How does Kafka decouple services?

Throughout my career, I’ve heard it said over and over again, “Kafka decouples services.” Almost like a mantra, without hearing anything about or knowing how it actually helps in decoupling. I’ve been guilty of this too, telling my mentee exactly this without fully understanding the reasons behind it.

While reading the incredible book Foundations of Scalable Systems, this became clear. In this post, I’ll go over the reasons why Kafka decouples services. (because yes, it indeed does)

Why the traditional REST API leads to service coupling

Having a service call another by a REST API introduces coupling due to having a response time; this is the nature of HTTP requests. Since your service needs to wait for the other to respond before proceeding, this leads to threads being blocked while waiting for the response. This issue is easily prevented by using routines/tasks like modern languages have (e.g., Go goroutines, Kotlin coroutines, Rust tokio::task).

Now, if the API you’re calling is down, you have to properly deal with it. Do you want to keep sending requests, perhaps even with retries, while the service is timing out? That can lead to both your app hanging waiting for the (timeout) response, since your users will be waiting for those 2-5 second timeouts, and to you worsening the recoverability of the API.

How would that worsen the recoverability? Imagine the API has 2 overloaded instances, a third instance coming up would immediately be loaded with tons of requests, potentially also getting overloaded right away, with no time for caching.

So, you would want to have:

circuit breakers (i.e., prevent calling for some time when it is failing)
fail-fast requests (i.e., take the request as failed when it’s taking over 1 sec)
bulkheads (i.e., allocate X threads for calling this API, and more threads for other operations the service does)

All of these also apply to WebSockets and SSE, which are in no way similar to Kafka, as they are designed for shorter-lived message subscriptions, suited for websites and the like, but not for service-to-service communication.

How does Kafka avoid coupling?

You can see that all of the issues above with REST simply don’t apply to Kafka; your app doesn’t hang for response time, as you consume whenever there is a new message (and it’s only a few threads/tasks consuming), you get the data when it is available, rather than asking for it when you need it.

With that, you end up not having to deal with the case of the service being down; you aren’t even aware of it since you just know whether records are coming to the topic or not. (although this can cause a problem with observability in case you’re not receiving any when you’re supposed to)

In a scenario where you’d want to get something like the purchases of the user account in a given shopping service, instead of requesting for it when you need it, you could be using a Kafka and the shopping service sending purchases as they are made, and your app getting those purchases and feeding into a database (or cache, data lake, data warehouse, etc). Then, when your app needs that information, it simply queries said database.

What’s even better about this is that you can enrich the data there. Imagine you don’t just want the user’s email and purchase, you want it to include whether your app has shown the user a given ad for that day (for marketing reasons). You can include it when adding to the table.

Other options for decoupling

It’s worth mentioning that the above doesn’t apply only to Kafka, it also to RabbitMQ, and solutions similar to them.

In a nutshell, RabbitMQ is also a message broker, but it is more tailored to complex message routing, rather than simply subscribing to a topic and receiving everything. Read this great answer for more details.

It’s not perfect

Although these options decouple the services, it’s not a one-size-fits-all. Firstly, this is only ever an option when you have some influence on the technology used on the other service – if your team or company owns it, this is possible; if not, then you must have a good collaboration model with the other company.

Secondly, as I mentioned in Kafka, when the service is down, you simply aren’t aware of that (and there are no clean ways of being). Here, if the service belongs to your team, this isn’t an issue, but as the service gets further from you, the more troublesome it gets.

And finally, the data eventually reaches your app. This offers numerous performance benefits, but it’s a UX (user experience) nightmare. In essence, the user does a purchase, then doesn’t see the purchase made until the service produces to Kafka and the app consumes it. This wouldn’t happen in the REST scenario because it’s blocking; the user would be waiting until the request (server wait) was made to then read it (and the read would be to the service, thus the read from primary in the linked article).

Final note

In essence, Kafka and RabbitMQ provide a way of decoupling services, which is even more crucial when you own all those services (in a microservices architecture).

Beware, though, that you don’t always need a microservices architecture. If your team handles both business areas (like the account registrations and shopping purchases), why would you introduce two services and have a Kafka cluster?

This can make sense if one deals with a lot of traffic (to avoid disruption of one business area due to the other). Although this could be dealt with by a bulkhead.

Or they are completely independent from one another (in which case, a Kafka cluster possibly wouldn’t be needed either).

Just leaving some food for thought on that.