developer.confluent.io Open in urlscan Pro
2a05:d014:275:cb02:b2b8:b4ca:8518:7335  Public Scan

Submitted URL: https://go2.confluent.io/NTgyLVFIWC0yNjIAAAGIbasI3whgV5O68JPpF4yZwo_0pPYcglTPy_Lv6hH1wNb498FVO3zKFdB73YSHvrVFdDnEmCs=
Effective URL: https://developer.confluent.io/podcast/common-apache-kafka-mistakes-to-avoid/?utm_medium=marketingemail&utm_campaign=tm.devx_ch...
Submission: On December 01 via manual from PH — Scanned from DE

Form analysis 0 forms found in the DOM

Text Content

Get Started Free
Get Started Free
Courses

 * What are the courses?
   Video courses covering Apache Kafka basics, advanced concepts, setup and use
   cases, and everything in between.View All courses

 * * KAFKA® 101
   
   * KAFKA® INTERNAL ARCHITECTURE
   
   * KAFKA® CONNECT 101
   
   * KAFKA® SECURITY
   
   * KAFKA STREAMS 101
   
   * NEWDESIGNING EVENTS AND EVENT STREAMS
   
   * EVENT SOURCING AND STORAGE
   
   * NEWSCHEMA REGISTRY 101
   
   * DATA MESH 101
   
   * KSQLDB 101
   
   * INSIDE KSQLDB
   
   * SPRING FRAMEWORKS AND KAFKA®
   
   * BUILDING DATA PIPELINES
   
   * CONFLUENT CLOUD NETWORKING
   
   * NEWCONFLUENT CLOUD SECURITY
 * 

 * * KAFKA® 101
   
   * KAFKA® INTERNAL ARCHITECTURE
   
   * KAFKA® CONNECT 101
   
   * KAFKA® SECURITY
   
   * KAFKA STREAMS 101
   
   * NEWDESIGNING EVENTS AND EVENT STREAMS
   
   * EVENT SOURCING AND STORAGE
   
   * NEWSCHEMA REGISTRY 101

 * * DATA MESH 101
   
   * KSQLDB 101
   
   * INSIDE KSQLDB
   
   * SPRING FRAMEWORKS AND KAFKA®
   
   * BUILDING DATA PIPELINES
   
   * CONFLUENT CLOUD NETWORKING
   
   * NEWCONFLUENT CLOUD SECURITY

Learn

 * Pick your learning path
   A wide range of resources to get you startedStart Learning

 * * ARTICLES
     
     Deep-Dives into key concepts
   
   * PATTERNS
     
     Architectures for event streaming
   
   * FAQS
     
     Q & A about Kafka® and its ecosystem
   
   * 100 DAYS OF CODE
     
     A self-directed learning path
   
   * BLOG
     
     The Confluent blog
   
   * PODCAST
     
     Our podcast, Streaming Audio
   
   * CONFLUENT DEVELOPER LIVE
     
     Free live professional training
   
   * CODING IN MOTION
     
     Build a real-time streaming app
 * 

 * * ARTICLES
     
     Deep-Dives into key concepts
   
   * PATTERNS
     
     Architectures for event streaming
   
   * FAQS
     
     Q & A about Kafka® and its ecosystem
   
   * 100 DAYS OF CODE
     
     A self-directed learning path

 * * BLOG
     
     The Confluent blog
   
   * PODCAST
     
     Our podcast, Streaming Audio
   
   * CONFLUENT DEVELOPER LIVE
     
     Free live professional training
   
   * CODING IN MOTION
     
     Build a real-time streaming app

Build

 * Design. Build. Run.
   Build a client app, explore use cases, and build on our demos and
   resourcesStart Building

 * * LANGUAGE GUIDES
     
     Build apps in your favorite language
   
   * TUTORIALS
     
     Hands-on stream processing examples
   
   * DEMOS
     
     More resources to get you started
 * 

 * * LANGUAGE GUIDES
     
     Build apps in your favorite language
   
   * TUTORIALS
     
     Hands-on stream processing examples
   
   * DEMOS
     
     More resources to get you started

Community

 * Join the Community
   Confluent proudly supports the global community of streaming platforms,
   real-time data streams, Apache Kafka®️, and its ecosystemsLearn More

 * * KAFKA SUMMIT AND CURRENT CONFERENCES
     
     Premier data streaming events
   
   * MEETUPS & EVENTS
     
     Kafka and data streaming community
   
   * ASK THE COMMUNITY
     
     Community forums and Slack channels
   
   * COMMUNITY CATALYSTS
     
     Sharing expertise with the community
 * 

 * * KAFKA SUMMIT AND CURRENT CONFERENCES
     
     Premier data streaming events
   
   * MEETUPS & EVENTS
     
     Kafka and data streaming community
   
   * ASK THE COMMUNITY
     
     Community forums and Slack channels
   
   * COMMUNITY CATALYSTS
     
     Sharing expertise with the community

Docs

 * Get started for free
   Use the Cloud quick start to get up and running with Confluent Cloud using a
   basic clusterLearn more

 * * DOCUMENTATION
     
     Guides, tutorials, and reference
   
   * CONFLUENT CLOUD
     
     Fully managed, cloud-native service
   
   * CONFLUENT PLATFORM
     
     Enterprise-grade distribution of Kafka
   
   * CONFLUENT CONNECTORS
     
     Stream data between Kafka and other systems
   
   * TOOLS
     
     Operational and developer tools
   
   * CLIENTS
     
     Use clients to produce and consume messages
 * 

 * * DOCUMENTATION
     
     Guides, tutorials, and reference
   
   * CONFLUENT CLOUD
     
     Fully managed, cloud-native service
   
   * CONFLUENT PLATFORM
     
     Enterprise-grade distribution of Kafka

 * * CONFLUENT CONNECTORS
     
     Stream data between Kafka and other systems
   
   * TOOLS
     
     Operational and developer tools
   
   * CLIENTS
     
     Use clients to produce and consume messages

Search
 * Courses
   * What are the courses?
     Video courses covering Apache Kafka basics, advanced concepts, setup and
     use cases, and everything in between.View All courses
   
   * * KAFKA® 101
     
     * KAFKA® INTERNAL ARCHITECTURE
     
     * KAFKA® CONNECT 101
     
     * KAFKA® SECURITY
     
     * KAFKA STREAMS 101
     
     * NEWDESIGNING EVENTS AND EVENT STREAMS
     
     * EVENT SOURCING AND STORAGE
     
     * NEWSCHEMA REGISTRY 101
   
   * * DATA MESH 101
     
     * KSQLDB 101
     
     * INSIDE KSQLDB
     
     * SPRING FRAMEWORKS AND KAFKA®
     
     * BUILDING DATA PIPELINES
     
     * CONFLUENT CLOUD NETWORKING
     
     * NEWCONFLUENT CLOUD SECURITY
 * Learn
   * Pick your learning path
     A wide range of resources to get you startedStart Learning
   
   * * ARTICLES
       
       Deep-Dives into key concepts
     
     * PATTERNS
       
       Architectures for event streaming
     
     * FAQS
       
       Q & A about Kafka® and its ecosystem
     
     * 100 DAYS OF CODE
       
       A self-directed learning path
   
   * * BLOG
       
       The Confluent blog
     
     * PODCAST
       
       Our podcast, Streaming Audio
     
     * CONFLUENT DEVELOPER LIVE
       
       Free live professional training
     
     * CODING IN MOTION
       
       Build a real-time streaming app
 * Build
   * Design. Build. Run.
     Build a client app, explore use cases, and build on our demos and
     resourcesStart Building
   
   * * LANGUAGE GUIDES
       
       Build apps in your favorite language
     
     * TUTORIALS
       
       Hands-on stream processing examples
     
     * DEMOS
       
       More resources to get you started
 * Community
   * Join the Community
     Confluent proudly supports the global community of streaming platforms,
     real-time data streams, Apache Kafka®️, and its ecosystemsLearn More
   
   * * KAFKA SUMMIT AND CURRENT CONFERENCES
       
       Premier data streaming events
     
     * MEETUPS & EVENTS
       
       Kafka and data streaming community
     
     * ASK THE COMMUNITY
       
       Community forums and Slack channels
     
     * COMMUNITY CATALYSTS
       
       Sharing expertise with the community
 * Docs
   * Get started for free
     Use the Cloud quick start to get up and running with Confluent Cloud using
     a basic clusterLearn more
   
   * * DOCUMENTATION
       
       Guides, tutorials, and reference
     
     * CONFLUENT CLOUD
       
       Fully managed, cloud-native service
     
     * CONFLUENT PLATFORM
       
       Enterprise-grade distribution of Kafka
   
   * * CONFLUENT CONNECTORS
       
       Stream data between Kafka and other systems
     
     * TOOLS
       
       Operational and developer tools
     
     * CLIENTS
       
       Use clients to produce and consume messages
   Search
 * Get Started Free

June 23, 2022 | Episode 221


COMMON APACHE KAFKA MISTAKES TO AVOID

Subscribe

Back to Podcasts


 * Transcript
 * Notes


KRIS JENKINS: (00:00)

Hello, you're listening to Streaming Audio. In this week's episode, I sat down
with Nikoleta Verbeck to talk about performance and tuning, monitoring, how
Kafka works under the hood to work efficiently? All those things that either you
know before you deploy, or you are forced to learn really quickly after you've
deployed. And frankly, we couldn't have got a better guest for this. Nikoleta is
absolutely amazing. She's a gold mine of information. And I just got to spend
this episode, hitting her with an ax and collecting valuable nuggets.


KRIS JENKINS: (00:36)

That's probably not the best metaphor I could have gone with, but you get my
point, right? This is one episode where I can guarantee you you're going to
learn something. And you're going to learn something you didn't know, and you're
going to learn something you didn't know you needed to know. Let's just skip my
usual introduction and dive straight in. Streaming Audio is brought to you by
Confluent Developer. And I will tell you all about that at the end. Grab your
notebooks folks because here we go.


KRIS JENKINS: (01:07)

My guest today is Nikoleta Verbeck, who is a principal solutions architect, at
confluent. Nikoleta, welcome to the show.


NIKOLETA VERBECK: (01:15)

Hey, nice to be here.


KRIS JENKINS: (01:17)

Good to have you. My first question is what's a principal solutions architect?
What do you actually do?


NIKOLETA VERBECK: (01:23)

Yeah. Our professional service team kind of handles any troubleshooting, any
problem solving, any architecture design, helping you devise a solution to use
cases you have. And so that's kind of what I do all day, is help customers
realize their use of Kafka in event streaming.


KRIS JENKINS: (01:45)

Is that mostly troubleshooting or figuring out for how they're going to tackle
new projects? Is it a mix?


NIKOLETA VERBECK: (01:53)

A little of both actually. We spend a lot of time problem-solving issues they're
currently having, maybe they've grown bigger and now they're trying to figure
out how to scale where they've already built or this is their first time with
Kafka and they want to do it right, out of the box. And so we'll help them all
along the way.


KRIS JENKINS: (02:14)

Okay. You some more stories.


NIKOLETA VERBECK: (02:19)

I have quite a few.


KRIS JENKINS: (02:20)

Okay. Well, we're going to get into those, but so we got in touch and said, do
you have any suggestions for how to run Kafka properly? Right. And [inaudible
00:02:32]. And wow, you have suggestions. You've sent us a laundry list. We're
going to go through this and I know I'm going to learn a lot, so I'm sure
everyone else will. Let me pick one at random. Let's start with producers. One
of the things you said is, "It's a problem to use multiple producers in a single
service," which took me by surprise.


NIKOLETA VERBECK: (02:57)

Yeah. Yeah. This tends to be, a lot of people migrating to Kafka from things
like RabbitMQ or a lot of your traditional message queue systems where they push
a model of just create new clients for every single topic that you want to write
to or each queue you want to write to or such. And while in Kafka, our Kafka
producer is a thread safe producer. And especially in the job realm and it's
actually able to talk to many topics with one instance. We treat the topic as
metadata. We treat the partitions, the actual shard of the data set. And we do a
lot of interesting things behind the scenes to improve that throughput. And the
biggest thing of that is this notion of batching. And we'll actually batch
multiple records together as they target to a particular topic and partition.


NIKOLETA VERBECK: (03:57)

But we're also able to batch those batches up together as they target to a
particular broker in the cluster. If we're able to share topics, even if it's a
single topic or your multiple topics, we can put them together, get a little bit
better throughput, because now we're bundling every partition that was on that
node together, targeting that particular broker, sending it across and we're
reducing our broker workload. Because a typical thing most people think about
is, request per second. Well the broker doesn't measure its performance in that,
it measures its performance in requests per second, not records. And so if we
can get more batching in each of those two different segments, the batch of
records per partition and the batch of records per node, we get higher
throughputs, we got less requests per second. We got less load on the broker to
perform all that.


KRIS JENKINS: (04:58)

Right. And you don't normally have that much control from a programmer's
perspective over which broker you're connected to. Right?


NIKOLETA VERBECK: (05:05)

Not typically because it's you can kind of manipulate where partitions live and
try to isolate common partitions and topics together on common nodes to help
that. You can also kind of override your own partition if you choose, but not
too much control, but at least having them together saves you a ton of
performance potential.


KRIS JENKINS: (05:33)

Is this fair? You're better off rather than trying to control which broker
you're connecting to have a single producer that's going to manage that for you.


NIKOLETA VERBECK: (05:46)

Yeah. Yeah. Yeah. Because now we're aggregating topics, we're aggregating
partitions, we're aggregating connections. I mean you look at Confluent Cloud
and you look at what we're measuring you on and your performance and CKU values
and that's connections, that's bandwidth, that's request per second load on the
backside and stuff. This helps minimize that cost margin and increase your
performance capabilities.


KRIS JENKINS: (06:19)

See, I'm often interested in things like throughput for real time systems.
Batching is great, but sorry, a latency. This what I'm trying to get to, what's
the trade off between throughput and latency with batching? How do you decide
what to do there?


NIKOLETA VERBECK: (06:38)

Yeah. That comes into a few adjustments. A lot of the time, it's adjusting your
linger.ms. And your size boundaries, because then in all things, Kafka, we
typically have the two boundaries time and size. Manipulating those, but there's
always a base cost. It's going to cost me X amount of time. Because typically we
think of latency and time to send a message from your producer to your broker
and your broker to your consumer and such. And that's just going to have a flat
cost.


NIKOLETA VERBECK: (07:15)

And so it's finding what that cost is, adjusting those thresholds to meet either
on par or halfway through because we are asynchronous, so we can't have more in
flight, but adjusting that to where we're having enough requests in flight,
we're not too many requests in flight at the same time, we're taking advantage
of the fact that it's going to cost us 30 seconds, well or 30 milliseconds.
We're now able to say, "Well, if my baseline is 30 milliseconds, maybe I set my
linger at 15. Maybe I adjust for that aspect." And now-


KRIS JENKINS: (07:56)

I think we should pause there. You should explain to me and all of us what
linger.ms is, because not everyone's going to know that.


NIKOLETA VERBECK: (08:05)

Yeah. Linger.ms is one of the primary Kafka producer settings typically and it
controls the time boundary of batching. So the Kafka producer always attempts to
batch records by topic and partition before sending them off. And we have the
size boundary, but linger.ms tends to be the time boundary. And most people run
into it first because again, they're thinking of latency and time. Let's adjust
our time. But that controls the notion of, when we open a new batch, how long am
I willing to leave that batch open before I consider it closed and ready to
send?


KRIS JENKINS: (08:48)

Right. I decide I'm going to set linger.ms. To the very minimum and that way it
will send every record immediately, which sounds great. Is it a bad idea?


NIKOLETA VERBECK: (09:01)

Yeah. It can be. I mean there are some use cases out there where doing that is
it's perfectly fine. Typically, those are seen in our IOT solutions, things like
that, where you're trying to get the minimalist latency possible. You're willing
to have massive Kafka clusters to do so, even though you might not be sending a
whole lot of data in terms of size, but you want that minimal latency, you don't
care about ordering. You don't care about a number of things to get that
latency. But in 90% of these cases out there, you're going to want to set that
linger. You're going to want to balance that with your cost.


NIKOLETA VERBECK: (09:49)

What's it going cost me to send across my network, things like that? And get
that batching, because that batching is going to increase that performance and
workload on your brokers. Not only is it going to help the producer, but it's
just going to help what's the load the brokers are doing? Because now I'm
getting more records into a single batch and that single batch into a request or
multiple batches into a request to that broker. And now, I'm not sending one
request for every single record, which is the work variant on the broker.


KRIS JENKINS: (10:24)

Yeah. Right. Almost, that's kind of analogs to the transactional overhead. That
you're paying for each batch. Okay. Okay.


NIKOLETA VERBECK: (10:36)

Exactly.


KRIS JENKINS: (10:36)

That makes sense to me. Before we get off that, are there any signs I should be
looking for that I've set linger.ms badly? What's the symptom?


NIKOLETA VERBECK: (10:56)

Usually the symptom a lot of the time is A, I mean kind of to the point earlier,
a lot of people don't even know linger.ms exists to begin with. We're used to
plugging in the Kafka client and just rolling with it and typically out of the
box, it's perfect. Like a lot of use cases, it just run out of the box, just
fine. If you're still running in the defaults, you probably need to go and
evaluate it from that perspective. That's usually kind of our first look at is,
let's go look at these properties you're setting, are you adjusting it? The
other is just kind of looking at the producer is going to report metrics and
this is kind of one of the big things too is make sure you're monitoring
everything. Kafka, in our Java realm, we kick out a ton of metrics into MBeans
and JMX, same with the brokers.


NIKOLETA VERBECK: (11:51)

And some of those metrics are actually, what is your percentage of records per
request in terms of counts. And you can back that out to a one minute rate and
stuff like that. And actually see, am I getting a good record to request rate
percentage? And knowing, am I doing well there? Because that's going to boil
down to really looking at your brokers and looking at primarily two metrics that
come out of there. One is, the idle thread percentage, request idle handler
thread percentage and that request handler queue, right? Because these are the
threads and this is the queue that is handling those requests that the producers
and consumers and admin clients and everything talking to Kafka have to pass
through. And if you're spending a lot of that queue, not very idle and stuff
like that means, you're overworking your brokers potentially. And let's see if
we can tune out and get more out of those brokers for their current value.


KRIS JENKINS: (12:59)

My first point of call would be to do larger batches so that the queue deals
with more in each step. Is that right? Okay.


NIKOLETA VERBECK: (13:09)

Exactly. Yeah. That's usually the biggest starting area is, let's get batches
better and adjusting. Over time we kind of might figure out what a good batch
size is, at the time, but a year from now maybe you've doubled or tripled or
quadrupled your number of records you're trying to produce, let's reevaluate it.
It's something that's never going to stop changing and you should probably look
at it over time.


KRIS JENKINS: (13:40)

I mean, it seems like one of those things where in a lot of cases, people make
writing software to produce to Kafka. There's not a natural batching in their
domain. Do you think-


NIKOLETA VERBECK: (13:51)

Not typically. A lot of them tend to, they don't think about the batching,
they're just trying to get the records across. They might be thinking of it in
the terms of an order by the key things like that. So it's kind of an abstracted
away thing that we kind of put in these smart Kafka producers and stuff. It's
just kind of adjusting and tuning for that.


KRIS JENKINS: (14:20)

Yeah. So you got to start thinking of it as you grow, is that fair to say?


NIKOLETA VERBECK: (14:24)

Yeah.


KRIS JENKINS: (14:27)

Well, that's fair. I expect to learn more about the internals of a system as I
grow with it.


NIKOLETA VERBECK: (14:33)

Yeah. It was what makes learning new tech fun is, you got to start out with it,
you get start playing with it. And then as you start getting more workloads on
it, you get to discover all those little intricacies that you didn't expect to
discover when you first started out. How does this work under the covers? I want
to go adjust the engine and play. Yeah. It's kind of the heart of being an
engineer. It's getting to play and learn and experiment.


KRIS JENKINS: (15:00)

Yeah. You've always got to know one step below the covers to do a really good
job I find. Yeah. Okay. I'm going back to your laundry list of things that can
go wrong. And so you said not enabling compression, which sounds fair to me, but
where does compression matter? And there are choices of compression. Which kind
of compression do you enable?


NIKOLETA VERBECK: (15:29)

Yeah. Kind of to that aspect of earlier, it's one of those settings that out of
the box is not enabled. We don't have compression on by default, but we do offer
four different kinds of compression. And it's something you should turn on. That
compression happens at that batching layer. All those records that you get into
that batch targeting that partition were able to compress those down at that
time before they're transmitted across the wire. That alone, we're going to
reduce our network overhead, our network consumption, which is important,
especially in clouds where you pay for that network. How much bite you're
sending in across. But it actually counters on the counter side that a lot of
people go unnoticed, which is the consumer. Well that producer's compressing it,
that consumer is the one who actually gets to deflate it. Now not only am I
saving bandwidth, going to Kafka, I'm also saving bandwidth coming out of Kafka.


NIKOLETA VERBECK: (16:33)

I'm also saving my dis consumption. Now, I'm able to fit more into a smaller
space on the discs of every broker. I'm able to handle that a little smoother if
you've got SSL turned on, we all know, Kafka and it's zero copy capabilities.
Well, once TLS is there, zero copy kind of goes out of there. Well,
compression's also going to help there. That's less that I have to copy, less
that I have to move, less that I have to stream to dis. So you get some
performance gains there, but when setting it, there's some trade offs. We're
going to have CPU cost. That producer's going to take the brunt of the CPU cost,
that consumer's going to take the brunt of the decompression costs. And so
that's why we offer the options we do for compression is, what sacrifice are you
willing to make?


NIKOLETA VERBECK: (17:33)

And we have things like LZO, which has really good compression at the benefit of
reduced CPU cost. But then you have all the way up to something like Gzip, which
is going to get you the best compression you can, but have a lot more CPU costs.
And then there's some of the newer ones, like with a lot of the newer Kafka
clients, you actually have now access to ZSTD. Which is kind of balancing act
between those two different compression capabilities and figuring out, well, I
do want utmost compression that I can, but I do want to have a little more CPU
to my producer so you can account for that. But in the grand scheme of things, I
can scale producers a lot cheaper typically than I can scale my brokers. And so
that's something to keep in mind while you're thinking through those cost
measurements.


KRIS JENKINS: (18:30)

You are offloading that compute cost to the producer side and the consumer side?


NIKOLETA VERBECK: (18:34)

Mm-hmm (affirmative).


KRIS JENKINS: (18:35)

Yeah. And I can see how that would free up the broker. So that brings up two
questions, but let's start to clarify the first one. When you enable
compression, you are compressing a batch of records at the producer side, the
compressed data goes over the wire and is actually stored as a compressed batch
of records on the topic.


NIKOLETA VERBECK: (18:57)

Correct.


KRIS JENKINS: (18:58)

And it will be shipped back out as the same batch and not actually decompress
till it reaches the consumer side?


NIKOLETA VERBECK: (19:06)

Correct. Yeah. The producer's able to... It's going to put the whole batch or
attempt to put the whole batch in the request, depending on again, size boundary
constraints. But that's going to get it over there. We're going to write it to
disk. Part of that payload is a bunch of metadata. That's, I'm going to have my
raw record compression, but then I have metadata that's going to say, "Okay,
here's the number of records that are in this batch. Here's any extra
information I need to know about it, that's size, things like that." That's
going to get on the broker. And that's how that broker's going to know how to
handle on the consumption side.


NIKOLETA VERBECK: (19:46)

And on the consumption side that batch might get in one request, it might need
to take two fetch requests. Again, that all depends on size tunings. Those are
some things when you start getting into deep performance tuning, you can start
messing with is, "Well, I know my average batch size is this, well, let me go
adjust my fetch size on the other side to that, to account for that and try to
minimize splitting and things like that." But it's getting the deep performance
really trying to squeak out as much as possible.


KRIS JENKINS: (20:20)

That sounds very deep in the weeds. Just to be clear, this isn't a day one
concern, right?


NIKOLETA VERBECK: (20:27)

No, no.


KRIS JENKINS: (20:29)

Good.


NIKOLETA VERBECK: (20:29)

This is, I've bought million dollar hardware and I really want to see where I
can take it.


KRIS JENKINS: (20:36)

Right. Yeah. Because you get to work with clients of all those sizes?


NIKOLETA VERBECK: (20:40)

Oh yeah. I mean-


KRIS JENKINS: (20:41)

Which must be fun.


NIKOLETA VERBECK: (20:42)

... between customers that are running in cloud and customers that are running
on hardware with very expensive rate arrays or going straight J bought on NVMEs
and there's all kinds of out there.


KRIS JENKINS: (20:57)

I can't believe. I can't believe. The other question that came to mind from that
is, actually two questions. Oh, God you're making me think of so many questions.
That's my job. That's okay. I'm allowed to ask you as many questions as I like.
First one is, if you are compressing on the producer side and maybe you don't
know this, maybe you don't know the answer, but there are plans to have schema
validation from schema registry on the broker side, right?


NIKOLETA VERBECK: (21:24)

Yep.


KRIS JENKINS: (21:24)

If it's compressed, how's that going to play in?


NIKOLETA VERBECK: (21:30)

It gets a little more complicated in those. The broker for the most part is
scanning through those records. And for those that don't know how the schema
registry and that kind of works. One of the things we put in there is this magic
bite. It's attached to every single record at the beginning of its serialized
data set. And so we're able to just scan through that real quickly, because part
of that metadata is their offsets and where things are at. So we're actually
able to pick that up, get that metadata go, "Oh, this is its ID.


NIKOLETA VERBECK: (22:07)

I know that based on this topic, it's subject naming strategy is this." Now I
can go look that up and evaluate, okay, is this schema ID the correct and valid
schema ID for this topic and in its order and stuff like that, because we are in
schema registry and all that. It's a subject that contains many schemas as the
version history goes and schema ID is what we put in the metadata there of that
magic bite. And I'm sure there's a lot of more details there that I'm skimming
over and simplifying on the broker side, but in layman's that's kind of what's
happening.


KRIS JENKINS: (22:51)

Okay. We'll also pull people in from the schema registry team for the podcast at
some point. But the other thing I was going to ask you, and this is really in
the weeds, but I'm going to ask you anyway. Because you've got an overhead,
you've got computation overhead for compression, you've also got one for
encryption. Right?


NIKOLETA VERBECK: (23:07)

Mm-hmm (affirmative).


KRIS JENKINS: (23:08)

Do you know what the balance of that tends to be?


NIKOLETA VERBECK: (23:13)

It's kind of one of those that is your mileage is going to vary and you're
really like to my earlier point, monitor everything. If you don't even think you
don't need it right now, you'll need it eventually. And it's better to have it
now and be able to see that history when it comes up, then it's going to be,
"Well now I'm scrambling. I'm trying to get some stuff out. I did monitor this
now I got to go get it in there. And now I don't have any history I have just
now." And that's kind of one of those things that's going to help you with that
balancing act of, "When I turn these things on, how did my CPU profiles change?
How did my load profile change? Do I need to add more nodes to my producer pool
or my consumer pool?" You know, you won't know if you don't measure it.


KRIS JENKINS: (24:10)

Yeah. Yeah. I've definitely been in that situation where you think you've got
performance problem and you think, ah, these look like things I should monitor.
They're important. I'll start monitoring those and there's no help at all for a
couple of weeks until you've actually got some data in.


NIKOLETA VERBECK: (24:24)

Oh yeah. Yeah. I mean to that point, if you ever talk to any of your people at
compliment, we do have sample dashboards for Grafana and things like that. And
Prometheus configurations and YAML configs that come with our Ansible and things
like that can drastically help just get yourself off the bat in the best
practices on that monitoring aspect with minimal effort.


KRIS JENKINS: (24:55)

Yep. Yeah. There's probably something out there for whatever tool stack you use
to monitor at the moment, right?


NIKOLETA VERBECK: (25:00)

We're trying to build out quite a bit. I know we've kind of built a lot around
Prometheus and Grafana, because those are kind of the defacto anymore and into
your elastic search and is in splunks and app dynamics and New Relic and
Datadog. And so there's dashboards for a lot.


KRIS JENKINS: (25:19)

Yeah. Cool. Another one from your producer list, which again like compression
it's like yeah of course, but give me the details, Nikoleta, not defining a
producer callback. Because when I produce the record... Explain what it is first
to the people that don't know. And then tell me how you should do it properly.


NIKOLETA VERBECK: (25:42)

Yeah. When you use the Kafka producer, you can send just your Kafka record in
it. That's typically where most people start, that's your key value, headers,
timestamp, all that kind of stuff. But there's a second attribute you can
optionally give. And that is the producer callback. And while the Kafka producer
is going to try to get that record over there, as best as it can, it's going to
use all of its little retry policies and attempts and stuff that it has baked
into itself. There's inevitably sometimes where that's just not going to happen.
Whether that is, brokers are down for a long period of time and the retries are
exhausted or you get into this concept of back pressure. And it's not something
most people get into right off the bat. And that's, you need to have a way to
handle that.


NIKOLETA VERBECK: (26:43)

And that's this callback and that callback allows you to subscribe to your
record and know that was it successful or was it not successful? And you get
that through typical callback structure, in Java it's an interface definition
with a single function, that's going to get back either the record that you
produced with some extra metadata, like which partition did it actually write
to? Which node? Things like that. But it also comes back with a potential,
optional exception and you want to evaluate that exception looking for, well,
what happened? And this can come into play, if you've set your linger too low,
you've lowered your retries too much. And you're actually erroring out. Because
if I don't capture that retry, well then I don't know my record didn't actually
make it.


NIKOLETA VERBECK: (27:40)

I assumed it did, but I didn't know. I exhausted retries. The producer tried to
tell me, but I gave him nothing to tell, to be able to tell me.


KRIS JENKINS: (27:49)

Again, we need to get away from ship and hope.


NIKOLETA VERBECK: (27:53)

Yeah, exactly. But it's also a way to get that signal back from the broker side
that says, "Hey, you're sending me too much. Everybody's sending me too much. I
need to push back a little bit." That request queues full, that request handlers
are doing as much work as they can take on. Maybe the network threads are maxed
out, something on the brokers is, it's slow. And that callback kind of gives you
that extra signal on top of just failure of, "Hey, you're timed out. We timed
out for whatever reason, I need you to pause and handle gracefully back to
yourself."


NIKOLETA VERBECK: (28:39)

Whether that's maybe auto skilling out a little bit further, your cluster to
handle this pause and the cycle a little bit more. Maybe it's pushing upstream
from your application further and saying, "Hey, can you slow it down a little
bit? You're spaming the system." And it's kind of a good measure and make sure
you're doing both, capture that exception, make sure you're retrying, maybe the
error is to your early point, maybe the schema you are using is wrong and
actually not valid. And we are trying to tell you that this record will never
get produced because it is invalid.


NIKOLETA VERBECK: (29:19)

And you want to capture that and be able to handle that. Whether that's used in
a policy of a dead letter queue, sending it off to another topic that, for those
that don't know, dial letter queuing, sending it off to a secondary topic to be
evaluated at a later time. Whether that's through an automated system or feeding
that into an elastic search log somewhere to search against something. Or just
saying, "Hey, I got this, I need a back pressure." Things like that. So there's
a lot of things that can be captured out of that callback.


KRIS JENKINS: (29:56)

Yes. Yeah. I've definitely got it in my mind for error handling, but I'm
definitely guilty of not considering it for things like back pressure.


NIKOLETA VERBECK: (30:07)

Yeah. The back pressure's usually a concept that kind of goes overlooked or just
not heard of not known until it becomes an issue. It is something I would
encourage all to learn and understand because as you grow, it is a concept you
will run into and you'll run into it a lot in distribute systems.


KRIS JENKINS: (30:29)

Yeah. Yeah. Do you have any tips because it feels like, you send a request, you
produce a record, you send it off and eventually some point asynchronously,
you'll get this thing saying, "Whoa, slow down." And it often doesn't feel like
a natural thing to put in the programming model to say, "I'm going to slow down
on the new stuff because the old thing told me to slow down." Do you have any
tips on how to actually program that?


NIKOLETA VERBECK: (30:56)

It depends on some of the use cases. I've seen in certain cases where they
absolutely cannot pause the upstream for whatever reason, whether that, maybe
that's SLAs to third party customers to the coffee cluster or whatnot. And so
they'll just start buffering that to disk. So you use something to buffer it to
disk, kind of hold it there, until things catch back up or maybe if this is
happening frequently, this is probably a good case of let's adjust our linger.
Let's increase those batching. Let's try to get more out of it to handle that a
little more gracefully.


NIKOLETA VERBECK: (31:40)

I've also seen in instances where you absolutely can't even do the buffer to
disk you need guarantees. They'll just have a secondary Kafka cluster that's
available off to the side and that's the spill cluster effectively.


KRIS JENKINS: (31:54)

Oh really?


NIKOLETA VERBECK: (31:55)

My first one's overtaken, let me spill over to this secondary cluster. I'll
start writing it there and we'll aggregate it on the backside when it comes to
it.


KRIS JENKINS: (32:06)

Okay.


NIKOLETA VERBECK: (32:07)

And this kind of happens to like launch events, think like video games
launching, and I need to have this guarantee that, I've got my Kafka cluster
while we can scale that up reasonably fast, that still takes time. I'm going to
pay the cost. Let me have just a secondary cluster off to the side, just in
case. And so you have kind of this spillover for this, because you don't know. I
mean, sometimes your launch event goes really well. Sometimes it doesn't go as
well as you thought it would though. And sometimes it goes-


KRIS JENKINS: (32:36)

I've definitely seen that in the world.


NIKOLETA VERBECK: (32:36)

Yeah. Sometimes it goes beyond what you could even measure.


KRIS JENKINS: (32:44)

Yeah. And that's a nice problem to have, but you still want to solve it, right?


NIKOLETA VERBECK: (32:47)

Yep, exactly.


KRIS JENKINS: (32:49)

Yeah. Okay. So to summarize that then you are saying, you might have a
programmatic solution to back pressure, but at the very least you want to be
aware that it's a requirement to tune for?


NIKOLETA VERBECK: (33:00)

Exactly.


KRIS JENKINS: (33:01)

Yeah. Okay. Let's move on in your... We should publish this list as a handy
reference guide to people. Maybe we'll do that in the show notes, but moving on,
on your list to consumer side, right?


NIKOLETA VERBECK: (33:15)

Yeah.


KRIS JENKINS: (33:16)

And it feels like this is going to be a parallel to the producer side. You've
said to me, a problem with using one consumer per thread in a service.


NIKOLETA VERBECK: (33:24)

Yeah. Kind of the same on the producer side, I mean we try to take advantage of
those fetches. We try to take advantage of in memory buffers, things like that.
And it's not necessarily a requirement, while the consumer out of the box
itself, isn't a thread safe thing, it is capable of doing it. And that's usually
where we encourage it. A lot of the time maybe the evaluation of a single record
does take a while. Your first inclination and it's kind of been always the push
in Kafka, which is, "Well, let me just add more partitions. Let me add just more
consumers and I'll just scale it that way." Well, you might still be one
threaded microservices, things like that. That's not always the best nature.
Because more partitions, each partition does have a cost.


NIKOLETA VERBECK: (34:20)

Each consumer does have a cost. You have fault tolerances there to deal with and
such. So moving to a multi threaded model can help at a certain point. You don't
want to get too many consumers out there, too many partitions because it's going
to affect some things and I'll go down here a second, but you can move to a
model where I have one consumer, I thread that off, they have records coming out
of the consumer to a thread pool. And I manage that. This usually means you're
starting to get a little more into the weeds of the Kafka consumer and how you
tune it and manipulate it and do stuff like that.


NIKOLETA VERBECK: (35:03)

You're also going to have to start managing your own offsets at this point,
because now you've got to do thread coordination. It's not just safe enough to
say, "Well, if I got to my next poll, then that most likely means I can commit
all the records that were just handled." Because out of the box, that's what the
consumer assumes. You're going to need to start managing your own offsets,
keeping track of that. When you do do that, make sure you do start using the
rebalance callback, which is on my list of other things.


KRIS JENKINS: (35:36)

Yeah. But you can choose if it's time to go into that right now or push on?


NIKOLETA VERBECK: (35:41)

We'll come back to that one.


KRIS JENKINS: (35:43)

Okay.


NIKOLETA VERBECK: (35:45)

But kind of to the earlier point there, when I get too many consumers, I start
getting to a point where each consumer's going to require at least one partition
to get the perils of work. And now I get to a point where my fault vector kind
of happens. I've got a lot more services out there. I've got, if one of them
happens, I've got to rebalance. Well now my consumer group is so big that
rebalance can take a little longer. I've got to shuffle between all of these
nodes, hundreds of consumers potentially. And that's going to take some time. So
that means my time recover is a little more delayed. That's going to affect my
guarantees, latency guarantees, things like that.


NIKOLETA VERBECK: (36:28)

It's also kind of one of the things that goes overlooked is, well now I've got
more partitions that can actually now affect the producer. To some of those
earlier things we talked about, batching is a very big key to performance in
Kafka. And well, now if I have more partitions, I'm going to have more of this
concept of what I call spray. I'm going to have to put more records, across more
brokers, across more partitions and now I'm reducing my potential for a bigger
batch.


KRIS JENKINS: (37:01)

Right. Yeah. Yeah. I've always got why you want more partitions, but now I start
to see why you want fewer as well.


NIKOLETA VERBECK: (37:10)

Yeah. Yeah. It's a balancing act. I mean a lot of what people think about on
this is, "Oh, just more partitions I'll scale up." And that's kind of what
everybody gets out of the box, but that's not necessarily the case. It's
something to think about. Don't just always add more partitions. Think about it
a little bit. Is there a better way to go about this that isn't just adding more
partitions? While it's the simplest answer, it can be a costly answer.


KRIS JENKINS: (37:40)

Right? Yeah. And then it's not easy to change that number down the line.


NIKOLETA VERBECK: (37:45)

No. Yeah. Shrinking partition counts is a little harder than growing partition
counts.


KRIS JENKINS: (37:52)

Do you have any tips for that, if you have to?


NIKOLETA VERBECK: (37:56)

Not really. It's a-


KRIS JENKINS: (37:58)

It's that bad?


NIKOLETA VERBECK: (37:59)

It's a painful experience to begin with.


KRIS JENKINS: (38:00)

Okay.


NIKOLETA VERBECK: (38:03)

Because usually you're talking about new topics, different partition counts,
possibly the use of a replicator or something to migrate data if you need to,
it's not an easy endeavor to go down at the moment. Hopefully we will get to a
point where we're able to do that for you, but that's not available at the
moment.


KRIS JENKINS: (38:28)

Feels like there's a technical reason why it's hard, but there's also, you're
changing the semantics of ordering a bit?


NIKOLETA VERBECK: (38:34)

Yeah. I mean, you've got-


KRIS JENKINS: (38:37)

That's always going to be with us?


NIKOLETA VERBECK: (38:37)

Yeah. You got the ordering faults like, "Well, do I re partition all this data
that was in these partitions, I'm trying to effectively delete if I want to
preserve that data? Or it's adverse effect?" Like for those folks that care
about ordering and stuff like that, adding part at adding partitions is, it's a
big event potentially. And a lot of times you need to time it for a time of day
to do it. Because you need to guarantee that order somehow. And you need, when I
add that partition, my ordering's going to be out of whack for a little bit
while all those quotation and tables realign.


KRIS JENKINS: (39:16)

Okay. The best advice is to pick the right number first time as always?


NIKOLETA VERBECK: (39:24)

Yeah. And that's kind of one of the things I kind of start out with when people
ask about that is, in your development phase, in your staging phase, well, let's
figure out what your unit of scale is first. How much is one consumer going to
be able to do? See what that looks like. Is that X number of records per second?
What's that look like? When I need to add that next consumer, what's that look
like? What's that threshold? And go, "Okay, well, so I know to date, I'm going
to be dealing with this much data. I know it's going to be like this. That means
I'm going to start out with this many consumers. I know in a quarter, two
quarters, three quarters, ideally a year out that my growth trajectory's going
to look like this. And now that I know what my unit of scale is, I know how many
consumers I'm possibly going to have in a year's time or half year's time,
depending on how far you can forecast out there."


KRIS JENKINS: (40:25)

Yeah.


NIKOLETA VERBECK: (40:26)

That's going to give you a good idea of how many partitions should I start out
with today? If I know in a year's time I'm going to go from 20 consumers to 100,
I should probably start out with 100, let's minimize those events of
repartitioning and ordering adjustments and stuff to once a year.


KRIS JENKINS: (40:48)

Yeah, good. It's the first time anyone's ever given me a concrete plan of how to
guess this.


NIKOLETA VERBECK: (40:54)

It's kind of the best I've kind of figured out that tends to work is, let's
start to the earlier point, let's monitor, let's know where we're at, let's
evaluate this and let's really try to send as much data as we can through. Let's
baseline our application. Because a lot of people don't even know what the
baseline of their application is to begin with. And so, if you don't know that
you don't know what the scaling capabilities are.


KRIS JENKINS: (41:21)

Yeah. It's often very hard to predict, but if you can simulate it, it gives you
a lot more intelligence.


NIKOLETA VERBECK: (41:27)

Oh yeah. Yeah. I mean, we have a number of tools that can help you simulate that
data too. It doesn't have to be end to end, you can use the CLI tools or if you
really want to get it into it, the Trogdor, if you haven't played with that
comes with [inaudible 00:41:44]. That kind of-


KRIS JENKINS: (41:45)

I only know that from Strong Bad, what Trogdor?


NIKOLETA VERBECK: (41:49)

It is kind of a utility that comes bundled with Kafka that can let you really
evaluate performance metrics. It usually goes unnoticed, but it is one of the
major tools we use to evaluate the performance of every single release we do
here. Make sure we're not backtracking at performance when we do changes, things
like that. It can be something that can help you test your applications as well
beyond just the typical, the producer-perf-test CLI tool or even JMeter. JMeter
has some capabilities for those that have used JMeter and like it, you can do
Kafka stuff.


KRIS JENKINS: (42:32)

What's Trogdor actually doing to test?


NIKOLETA VERBECK: (42:37)

It's able to basically act as a producer, consumer at scale and really replay a
number of different scenarios. Well maybe my batching needs to look like this,
my request rate needs to look like that and you can really get into it.


KRIS JENKINS: (42:55)

That's-


NIKOLETA VERBECK: (42:57)

And I think we actually have a bunch of stuff on Developer about it.


KRIS JENKINS: (43:00)

Okay. We'll link to that in the show notes.


NIKOLETA VERBECK: (43:02)

You're on developer or a blog post?


KRIS JENKINS: (43:04)

That sounds like a huge topic in itself, but at least we know where to go to
start simulating these things.


NIKOLETA VERBECK: (43:13)

Oh yes. Performance simulation can be a couple hour long conversation.


KRIS JENKINS: (43:19)

Well, we'll move on because we don't... We'll spend that out to a separate
podcast perhaps, but okay, so going back to your list, something that looks like
my task board, actually, because you've said it's danger to over commit. Tell me
about that.


NIKOLETA VERBECK: (43:37)

Yeah. Yeah, this kind of starts out a lot of the time, you start getting into
that multi threading of the consumer, you start managing your own offsets and
you start doing a boundary of commit. A lot of people try to just do it at the
end of the poll or every record and stuff. And when you think about that and
think about to the producer's aspect, well, every commit is a produced record.
Every time I commit something, I've got to produce a record back to Kafka to
commit that. And that's, it's usually a record going into the consumer offsets
topic for those that haven't seen it, which is a compacted topic, if you didn't
know. We've got all these records going back and if I'm over committing, I'm
overproducing and the way the consumer does that is, one commit is one request
to the broker and getting to their other point, well, the broker handles
requests.


NIKOLETA VERBECK: (44:37)

And so I'm putting work on it, on those aspects. So I can overload the work
queue and take away capabilities from just getting records in and out of the
broker to just handle offset commands. The secondary that a lot of people don't
even think about is, it's load that it's going to put on the compactor.


KRIS JENKINS: (44:58)

Oh, okay.


NIKOLETA VERBECK: (44:59)

Which is this backend thread that goes around compacting these compacted topics.
Cleaning them up, stuff like that. While if I'm over committing, I've got now
just a ton of data that that compactor got to crawl through and compact out and
get to its final stage and that's going to put load on it. Maybe it's not going
to get to your other compacted topics as fast as you would like it to do. Now
you're dealing with excess tuning there.


NIKOLETA VERBECK: (45:27)

It's also going to put extra load on your disc volumes to deal with all that. It
kind of can cause a number of issues on the broker side if you're over
committing and some performance loss on the consumer. That single sender thread
that those consumers have on the back end that talks to the broker, now has to
handle all those requests and the return trips and all that while also trying to
facilitate fetch requests. So it can be a performance penalty [inaudible
00:45:55]-


KRIS JENKINS: (45:55)

So while you're committing, you're kind of blocked on fetching the next one?
Which makes sense. But-


NIKOLETA VERBECK: (46:02)

It's able to do both, but it's now juggling both the sends and the response of
both of those request types, right? The fetch consumer requests and the commit
request types.


KRIS JENKINS: (46:17)

I'm pleased with that because it's very similar mental model that I have to
worry about on committing those offsets as producing. I feel like the same
driving principle is keeping the two efficient. So I don't have to think of two
different principles at once, which makes me happy.


NIKOLETA VERBECK: (46:36)

Yeah. The only downside agent, you can't batch up commits into a single request.
You're kind of stuck with the one at a time because if we try to batch it, then
that kind of removes the point of trying to-


KRIS JENKINS: (46:51)

You're trying to commit like a high watermark?


NIKOLETA VERBECK: (46:55)

Exactly.


KRIS JENKINS: (46:55)

So your solution is to try and consume several records before you try and commit
the offsets?


NIKOLETA VERBECK: (47:04)

Yeah. I try to find a balancing act. Typically, we like to do either account
threshold or a time threshold of records. Doing it, time usually ends up being
where everybody usually goes, because again, most people are thinking of records
per second. They know their volume of records they're producing and managing and
stuff like that. It's usually a concept everybody's familiar with. Doing it on a
time boundary tends to come a little more second nature thought to it.


KRIS JENKINS: (47:38)

Okay. But that surprises me because it seems like something that's so domain
specific.


NIKOLETA VERBECK: (47:44)

It can be. You might go to a count threshold or a transactional threshold, maybe
you need to do a number of things before you can say, "Okay, my watermark
moves," like maybe you're putting data into a database table. Maybe you're doing
a transaction there and you want to manage that commit that way. But a lot of
the time, that's not usually the use case, but it is out there. Usually most of
the time, your time boundary or a number of records boundary tends to be the de
factor everybody starts out with.


KRIS JENKINS: (48:24)

And does that cover everything we need to think about on performance, tuning the
actual fetching of data?


NIKOLETA VERBECK: (48:34)

Not completely. One of the setting sets that a lot of people just don't know
about is, how do you tune the fetch? And a lot of the time they start out with,
maybe I turn my consumers offer a little bit. Maybe my deployment is a red
green, or kind of those different types where I'm stopping everything, I'm
spinning up everything. Or maybe I had a bug, the whole thing didn't work and
I'm relaunching it. Or you're migrating an older batch driven ETL system to
Kafka. So you still have kind of, some of that consumption lingering and what
you usually tend to see a lot of the time is, you'll be watching your consumer
lags. And you start seeing one or multiple partitions that are signed to
consumer start falling behind, but you'll see one partition, that's just fine
for that consumer, but all the others are failing and that's usually the symptom
of needing to tune your fetches.


NIKOLETA VERBECK: (49:38)

And there's two main ones that come into play. And like all things again, we
have time and size boundaries. In this case, it's usually the size boundary
that's coming into play here. And when we fetch your requests from the brokers,
we usually say one of two things, "I'm willing to wait this long to get my data.
Or I want this much at the minimum or the maximum." And so we start needing to
tune the maximum on that threshold, which is, I need to get more data back. Out
of the default, we're saying, 50 Megs of data currently that I want in total.
But we do have another setting that goes unnoticed a lot of the time, which is
our fetch per partition or max partition, fetch bytes.


NIKOLETA VERBECK: (50:35)

Which is one Meg and out of the box for default. And so if I'm doing ETL stuff,
I probably am going to want to bring that up. And a lot of people's first is,
"Well, let's take that to the 50 Meg." Well, if I'm saying I'm willing to take
50 Megs on a single partition, but my max fetch request response size is 50
Megs. Well now I'm locked on that partition. I'm going to keep reading from that
partition before I go to the next ones, because the brokers and the way all of
this consumption wants to do, is it wants to drain a partition first before
moving its kind of cursor to the next partition that it might be handling for
you. If I can adjust that to where I say, even though that one partition might
have 50 Megs worth of data available to me, but I have three partitions I'm
consuming from as a single consumer-


KRIS JENKINS: (51:27)

Because I'm low balance.


NIKOLETA VERBECK: (51:29)

Yeah. Yeah. I'm going to want to set my fetch max as a whole to something big
enough to account for all three of those partitions, getting a chance to send
data back to me in that fetch request. So I'm not locking on that one, because
that broker's going to kind of send, "Okay, well here's your first Meg from this
partition, I've backed out that partition' setting, you told me I'm going to
give you the next same amount from the next partition," and so forth. So this
kind of helps balance that issue that pops up.


KRIS JENKINS: (52:02)

Right. In that case, am I going to say to myself, "Well, once it's low balance,
my consumers will probably consume from three partitions on average. And divide
the big value by three and that will be my smaller value."


NIKOLETA VERBECK: (52:18)

You can do it that way. That usually a good baseline to start with, when you
start getting into it. I mean, out of the box, do you remember, the default is
50 Megs total, one Meg minimum. Usually where people are starting out at with is
they start manipulating that partition fetch and they start increasing it and
forget about the other one. Before they know it, they're fetch in 50 Megs.
They're locked down on partition now. It is a balancing act as time goes on as
your volume increases and maybe your performance gains increase. Especially if
you start doing that multi threaded consumption model, maybe you start getting
more threads on a single instance. Maybe they're able to handle a lot more.
Maybe you're not keeping up with that thread queue and you're starving those
threads. So you start increasing those fetch capabilities and things like that,
increasing that buffer. So just some things to keep in mind as you start tuning,
tune both and account for both


KRIS JENKINS: (53:24)

What we're going to have to put the show notes is like a cheat sheet for all the
parameters Nikoleta says you really should know about.


NIKOLETA VERBECK: (53:32)

There's a few of them. There's a lot of metrics too, that keep an eye on that go
overlook. Like to the earlier stuff on the producer that the metrics around
records per request, the average compression ratios, things like that. They're
usually ones people don't think about, they're wanting to monitor and make sure
the system's healthy, but usually performance of the system goes overlooked
initially.


KRIS JENKINS: (53:58)

Yeah. Yeah. It's true. I've got a lot of sympathy for the people on the other
side of your desk when you're talking to them, because it's like you, out of the
box it seems to work quite well. And you go and look at the docs for which
parameters look interesting. And there are loads of them and how do you
navigate?


NIKOLETA VERBECK: (54:20)

Well, and the docs alone are ominous. You go through them and it's hundreds of
these things and you've got to read every single one of them. And so it's a lot
to do. It is one of those things I would highly encourage those that are wanting
to get into really understanding those, to check out many of the videos we're
doing on these and our education team's work they've done, because they kind of
help you understand exactly what that parameter covers and what it's actually
doing. Because it's one thing to read that blurb that is in the docs, it's
another thing to actually truly understand what you are doing. If you were to
read the stuff around the producer batching, you might not get the full picture
of what's actually going on.


KRIS JENKINS: (55:10)

Yeah. Yeah. And we have a course on the internals of how Kafka works by Jun Rao,
which feels like it's definitely your first port of call for understanding this
stuff under the hood.


NIKOLETA VERBECK: (55:23)

Oh, absolutely.


KRIS JENKINS: (55:24)

We'll link to that in the show notes. We have a podcast with him too. We'll link
to that in the show notes. Richly cross referenced podcast this one.


NIKOLETA VERBECK: (55:36)

Oh yes. Anything from Jun is worth watching.


KRIS JENKINS: (55:40)

Yeah. Yeah. He's a smart guy and he knows the insides of it like the back of his
hand.


NIKOLETA VERBECK: (55:48)

Oh yes.


KRIS JENKINS: (55:48)

Okay. I have another one from your list and you've touched on this briefly, but
let's go into it in depth now. It's a mistake to forget to provide a
ConsumerRebalanceListener. What's that? And why should I be providing it? And
what should it do?


NIKOLETA VERBECK: (56:06)

Yeah. Kind of similar to the producer callback, this is more on the consumption
side and it's not really dealing with the per record. It's dealing with the
failure scenario. So when you start managing your own offsets, most people
forget that this even exists or didn't even notice because typically it's
another one of those properties or it's another if you're in the librd realm,
it's a function call that you've got to make or some definition of the kind. And
I'll speak to the Java side, it's an interface that you're implementing that
gets called when there's a rebalance event. And rebalance events happen for a
number of reasons. You're redeploying your consumers applications. You've got
consumers going online, offline as they roll through their restarts or a broker
went down, I need to rebalance where the listeners go, whatever. So the consumer
group needs to rebalance the partition assignments, right?


KRIS JENKINS: (57:17)

Yeah.


NIKOLETA VERBECK: (57:19)

This function actually gets called on every single consumer before it starts.
And there's two functions that are defined in this interface. One is, I'm about
to rebalance, close things up. And the other is, the rebalance is done here's
your new assignments. And this kind of gives you a chance especially if you're
doing the thread model, "Hey, the rebalance event's about to happen. Let me stop
the work I'm doing because I might lose partitions."


NIKOLETA VERBECK: (57:49)

And it will tell you know, if you're using cooperative, the new cooperative
sticky stuff, you get some hint to which partitions are going away and which
aren't, but it gives you the chance to go stop the work a little bit, wrap up
where you're at. Because you can hold it for a period of time. There is a
timeout so be cautious of that. But you can hold the work, hold the rebalance a
little bit, wrap up what you're doing, do your final offset commits, clean
yourself up.


KRIS JENKINS: (58:17)

Oh, so you are allowed to...


NIKOLETA VERBECK: (58:19)

[inaudible 00:58:19].


KRIS JENKINS: (58:19)

Sorry. You are allowed to like do that last bit of work of saying commit my
offsets before it goes down?


NIKOLETA VERBECK: (58:24)

Yeah.


KRIS JENKINS: (58:24)

Okay.


NIKOLETA VERBECK: (58:25)

Yeah. That's in effect of what it's kind of designed for is to give you that
opportunity to say, "Okay, something's going to happen. The partitions I'm
currently latched onto are going to change. Somebody else is going to get this
work. If I don't commit my offsets now, there's a potential for duplicate
consumption. I might have finished the work, but I haven't committed offsets
yet, there's that differential. This is your chance to commit that not replay
the work. Cancel the work in progress, clear any work queues that you might
have." It's kind of your chance to clean up, get ready for new records to come
from different partitions


KRIS JENKINS: (59:05)

Hit save before the laptop reboots. That kind of thing.


NIKOLETA VERBECK: (59:08)

Exactly. Exactly.


KRIS JENKINS: (59:11)

Okay.


NIKOLETA VERBECK: (59:12)

And that gets kind of broadcasted to every consumer in the consumer group who's
still in the group. So, if you left, you issued your leave group request, if you
left, you're probably not going to get it. But as long as you're in it, you'll
get it. You can handle it. Every consumer gets it so everybody gets to handle
it. It's not just the group leader that's going to get it. It's not like trying
to do your own consumer partition or any of that.


KRIS JENKINS: (59:41)

This feels like a real post research and development. This is when you're really
into production. You're living with the reality of computers coming up and down
all the time.


NIKOLETA VERBECK: (59:52)

Oh yeah.


KRIS JENKINS: (59:52)

This is-


NIKOLETA VERBECK: (59:53)

Yeah. I mean, it's never a factor of, is it going to die? It's always a factor
of it's going to die, when is it going to die? And am I going to handle that
death gracefully? Especially when we start thinking Kubernetes and it's whole
cold to herd model of deployment management.


KRIS JENKINS: (01:00:16)

Yeah. Prepared is, if you're prepared for it, then you're going to survive it.
Yeah. Okay. Looking at my list again, there's so much you've given us here.
Undersized per Kafka consumer instances. What does that mean?


NIKOLETA VERBECK: (01:00:42)

This kind of plays into some of it and it happens in the producer side too,
which is, a lot of the time when you first start out, maybe you pick that small
AWS instance node type, for whatever reason, you wanted three or four
paralyzations. But as time's gone on, you're now at 10, 15, 20 consumers in your
pool, they're spreading that work all across each other, but they're really
small, right?


KRIS JENKINS: (01:01:10)

Yeah.


NIKOLETA VERBECK: (01:01:10)

And on both sides, well, now I've increased my number of producers or I've
increased my number of consumers and the work they're doing well, yes, it's
taking up maybe the one or two CPUs on that node. It's better if I can actually
shrink that. Let's go to four or five, six CPUs because now I'm reducing that
spray. That earlier concept I had is, it's not just on the producer going out or
the consumer going out to the brokers, it happens on how many producers do I
have in that data hitting those?


NIKOLETA VERBECK: (01:01:47)

And vice versa on the consumption side of that, which is, well, if I have all my
data spraying against that front to begin with, well that data's going to be
sprayed counter on the backside because I've reduced my potential of batch,
right?


KRIS JENKINS: (01:02:03)

Yeah. Yeah.


NIKOLETA VERBECK: (01:02:04)

A record that could have gone to the same partition in the same batch, if it
landed on the same producer. Well now got landed on this producer all the way
from this one. Sometimes it's best to not just go out, but also go up. Let's
vertical scale that up a little bit. Let's dense those up for those reasons of,
let's increase those batch efficiencies.


NIKOLETA VERBECK: (01:02:28)

Let's increase those request efficiencies. Now, I'm getting that single consumer
fetch request able to get a lot more data from a number of partitions, get that
into there. Let's spread that work. Maybe we start really moving into that.
Multi-threaded model a little more and taking advantage of it. Because if I keep
scaling out horizontally, I'm going to run into those earlier, same problems of
I'm adding more partitions, I'm adding more consumers to the consumer group. I'm
adding more producers to their side of the pool. And I'm just spreading myself
thinner when I could come back up a little bit, go a little vertical and get a
little more bang for my buck.


KRIS JENKINS: (01:03:10)

Yeah. Again, you've got trade offs in different ways you scale and you've got to
be aware of both and decide which is going to hit the underlined model best.


NIKOLETA VERBECK: (01:03:20)

Yeah. Yeah. And I mean, it's like a little bit of a rubber band. It's good to
start out like, let's go horizontal a little bit as time goes on, but at a
certain point let's reevaluate and let's pull that elastic back in a little bit,
go a little more vertical, tune different properties and then go a little more
elastic again. And do that, because we can't... Some of these adjustments like
changing your instance type size or adjusting settings, little more here effort
involved there than just adding another node to the pool.


NIKOLETA VERBECK: (01:03:58)

So, it's kind of one of those things of, maybe every quarter, every other
quarter think about, "Okay, how can we bring this back in? We've stretched
ourselves out a little bit, let's bring in, let's adjust, let's make that part
of our quarterly retrospective of, okay, what have we done? How have we grown?
Let's adjust, minimize that, get reset for the next quarter."


KRIS JENKINS: (01:04:24)

Yeah. Yeah. We've talked a lot about adjusting lots of different parameters and
changing layouts and stuff. I just want to be clear here. Is this something I'm
expecting to do a lot or is it in response to changing growth? Is Kafka this
constant you're always going to be maintaining it and tuning it thing? Or is it
life's going to change around your business and you have to respond to that?


NIKOLETA VERBECK: (01:04:53)

I think it's a little of all the above really. It depends on the use case. There
are a lot of use cases out there that the work doesn't grow. Maybe it's a
simplified workload, it's different in whichever way. Your growth is minimal. Or
so a lot of these settings, maybe something you look into once a year, or once
every other year, or once every five years.


NIKOLETA VERBECK: (01:05:26)

And some other areas that you maybe, you've got Kafka in that main line of your
business, you're handling events, you're growing successfully as a startup or
even as an established enterprise on this new workload or you're gradually
migrating data. Well, all that comes with some effort that needs to be put forth
while we try to make Kafka as scalable as possible, as minimal as possible
sometimes you got to put a little more effort in, you got to figure it out. You
got to tune some settings and dive deep.


KRIS JENKINS: (01:06:02)

Yeah. Yeah. Okay. Let me see if I can wrap this up with what's top of my mind
going away from this and you can tell me if I've missed anything. I don't really
want to reduce the number of petitions I've got, I need to understand batching
and compression and I should really worry about those call back hooks. And if I
do that, I won't be too unhappy in the future. I'm monitoring. And monitoring.


NIKOLETA VERBECK: (01:06:27)

You're monitoring.


KRIS JENKINS: (01:06:29)

Yeah.


NIKOLETA VERBECK: (01:06:29)

Yes. Don't forget monitoring.


KRIS JENKINS: (01:06:30)

If I go away from this with those four things, will I be happy?


NIKOLETA VERBECK: (01:06:35)

You'll be happy for a good while. That'll get you to some pretty reasonable
scales, really tuning the batching and stuff. This is that next here, this is
going from, "I just start out with Kafka to I'm really using Kafka. I need to
scale this up." Right?


KRIS JENKINS: (01:06:54)

Yeah.


NIKOLETA VERBECK: (01:06:55)

And so that's where a lot of these settings and monitoring and metrics really
start showing their own as that next stage. I'm getting a little more advanced
with my Kafka usage.


KRIS JENKINS: (01:07:07)

Yeah. Okay. Well, happy for a long time is probably the best guarantee I've been
offered for a while. Well, I think we should probably leave it there, while I
could pick your brains for another few hours. We'll probably have to bring you
back for more another day, but Nikoleta, thanks very much. This has been really
educational and interesting. Thanks for being on the show.


NIKOLETA VERBECK: (01:07:28)

Yeah thanks. Yeah. Thank you.


KRIS JENKINS: (01:07:30)

And that was Nikoleta Verbeck, sharing some common mistakes and a lot of hard
one knowledge. I think there are going to be a few people who get to the end of
this episode and start tweaking a few cluster parameters and batch sizes. Good
luck if you're one of them. We're going to put as much of that detail as we can
in the show notes. But if you want some more, then take a look at
developer.confluent.io. That's our education site for Kafka and it will give you
in depth guides for a lot of what we've covered today. There's also a really
useful course by Jun Rao, that explains how Kafka works under the hood through
so many different parts of the system. And if you look at that, you'll really
understand how Kafka works and exactly how Nikoleta's advice plays out as your
data travels through network threats and IOQs and onto disk.


KRIS JENKINS: (01:08:24)

That course also has exercises to cement your knowledge. And if you want to go
through those, you'll probably want a Kafka cluster to play with. The easiest
way to get one is with Confluent Cloud. Sign up with the code, PODCAST100 and
we'll give you $100 of extra free credit so you can spend longer on it.
Meanwhile, if you have thoughts or questions about today's episode, please get
in touch. My contact details are in the show notes, or you can leave us a
comment or a like, or a thumbs up, or a review just let us know that you've
enjoyed it. And if you are out there saying, "I already knew everything in this
episode," well, you should be a guest on a future episode, my friend. So get in
touch. And with that, it remains for me to thank Nikoleta Verbeck for joining us
and you for listening. I've been your host, Kris Jenkins, and I'll catch you
next time.

What are some of the common mistakes that you have seen with Apache Kafka®
record production and consumption? Nikoleta Verbeck (Principal Solutions
Architect at Professional Services, Confluent) has a role that specifically
tasks her with performance tuning as well as troubleshooting Kafka installations
of all kinds. Based on her field experience, she put together a comprehensive
list of common issues with recommendations for building, maintaining, and
improving Kafka systems that are applicable across use cases.

Kris and Nikoleta begin by discussing the fact that it is common for those
migrating to Kafka from other message brokers to implement too many producers,
rather than the one per service. Kafka is thread safe and one producer instance
can talk to multiple topics, unlike with traditional message brokers, where you
may tend to use a client per topic. 

Monitoring is an unabashed good in any Kafka system. Nikoleta notes that it is
better to monitor from the start of your installation as thoroughly as possible,
even if you don't think you ultimately will require so much detail, because it
will pay off in the long run. A major advantage of monitoring is that it lets
you predict your potential resource growth in a more orderly fashion, as well as
helps you to use your current resources more efficiently. Nikoleta mentions the
many dashboards that have been built out by her team to accommodate leading
monitoring platforms such as Prometheus, Grafana, New Relic, Datadog, and
Splunk. 

They also discuss a number of useful elements that are optional in Kafka so
people tend to be unaware of them. Compression is the first of these, and
Nikoleta absolutely recommends that you enable it. Another is producer
callbacks, which you can use to catch exceptions. A third is setting a
`ConsumerRebalanceListener`, which notifies you about rebalancing events,
letting you prepare for any issues that may result from them.  

Other topics covered in the episode are batching and the `linger.ms` Kafka
producer setting, how to figure out your units of scale, and the metrics tool
Trogdor.




EPISODE LINKS

 * 5 Common Pitfalls when Using Apache Kafka
 * Kafka Internals course
 * linger.ms producer configs.
 * Fault Injection—Trogdor
 * From Apache Kafka to Performance in Confluent Cloud
 * Kafka Compression
 * Interface ConsumerRebalanceListener
 * Watch the video version of this podcast
 * Nikoleta Verbeck’s Twitter
 * Kris Jenkins’ Twitter
 * Streaming Audio Playlist 
 * Join the Confluent Community
 * Learn more on Confluent Developer
 * Use PODCAST100 to get $100 of free Confluent Cloud usage (details)  


CONTINUE LISTENING

Episode 222June 30, 2022 | 48 min
AUTOMATING MULTI-CLOUD APACHE KAFKA CLUSTER ROLLOUTS

To ensure safe and efficient deployment of large-scale Confluent solutions
including Apache Kafka clusters across multiple cloud providers, Rashmi Prabhu
(Staff Software Engineer & Eng Manager, Fleet Management Platform, Confluent)
and her team have been building a cluster management solution —the Fleet
Management Platform for Confluent Cloud. In this episode, she delves into what
Fleet Management is, and how the service streamlines Kafka operations in the
cloud while providing a seamless developer experience.

Listen Now
Episode 223July 7, 2022 | 50 min
BLOCKCHAIN DATA INTEGRATION WITH APACHE KAFKA

How is Apache Kafka relevant to blockchain technology and cryptocurrency? Fotios
Filacouris (Staff Solutions Engineer, Confluent) has been working with Kafka for
close to five years, primarily designing architectural solutions for financial
services, he also has expertise in the blockchain. In this episode, he joins
Kris to discuss how blockchain and Kafka are complementary, and he also
highlights some of the use cases he has seen emerging that use Kafka in
conjunction with traditional, distributed ledger technology (DLT) as well as
blockchain technologies.

Listen Now
Episode 224July 14, 2022 | 66 min
STREAMING ANALYTICS AND REAL-TIME SIGNAL PROCESSING WITH APACHE KAFKA

Imagine you can process and analyze real-time event streams for intelligence to
mitigate cyber threats or keep soldiers constantly alerted to risks and
precautions they should take based on events. In this episode, Jeffrey Needham
(Senior Solutions Engineer, Advanced Technology Group, Confluent) shares use
cases on how Apache Kafka can be used for real-time signal processing to
mitigate risk before it arises. He also explains the classic Kafka transactional
processing defaults and the distinction between transactional and analytic
processing.

Listen Now

GOT QUESTIONS?

If there's something you want to know about Apache Kafka, Confluent or event
streaming, please send us an email with your question and we'll hope to answer
it on the next episode of Ask Confluent.

Email Us

NEVER MISS AN EPISODE!

 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 


CONFLUENT CLOUD IS A FULLY MANAGED APACHE KAFKA SERVICE AVAILABLE ON ALL THREE
MAJOR CLOUDS. TRY IT FOR FREE TODAY.

Try it for free
Feedback
 * Confluent
 * About
 * Careers
 * Contact
 * Professional Services
 *  * 
    * 
    * 
    * 
    * 

 * Product
 * Confluent Cloud
 * ksqlDB

 * Developer
 * Free Courses
 * Tutorials
 * Event Streaming Patterns
 * Documentation
 * Blog
 * Podcast

 * Community
 * Forum
 * Meetups
 * Kafka Summit
 * Catalysts

 * 
 * 
 * 
 * 
 * 

Terms & Conditions | Privacy Policy | Do Not Sell My Information | Modern
Slavery Policy | Cookie Settings

Copyright © Confluent, Inc. 2014-2022. Apache, Apache Kafka, Kafka, and
associated open source project names are trademarks of the Apache Software
Foundation




By clicking “Accept All Cookies”, you agree to the storing of cookies on your
device to enhance site navigation, analyze site usage, and assist in our
marketing efforts. Cookie Notice

Cookies Settings Reject All Accept All Cookies