docs.aws.amazon.com Open in urlscan Pro
13.35.58.67  Public Scan

Submitted URL: https://docs.aws.amazon.com/msk/latest/developerguide/bestpractices.html#bestpractices-monitor-disk-space
Effective URL: https://docs.aws.amazon.com/msk/latest/developerguide/bestpractices.html
Submission: On June 13 via manual from GB — Scanned from GB

Form analysis 0 forms found in the DOM

Text Content

SELECT YOUR COOKIE PREFERENCES

We use essential cookies and similar tools that are necessary to provide our
site and services. We use performance cookies to collect anonymous statistics so
we can understand how customers use our site and make improvements. Essential
cookies cannot be deactivated, but you can click “Customize cookies” to decline
performance cookies.

If you agree, AWS and approved third parties will also use cookies to provide
useful site features, remember your preferences, and display relevant content,
including relevant advertising. To continue without accepting these cookies,
click “Continue without accepting.” To make more detailed choices or learn more,
click “Customize cookies.”

Accept all cookiesContinue without acceptingCustomize cookies


CUSTOMIZE COOKIE PREFERENCES

We use cookies and similar tools (collectively, "cookies") for the following
purposes.


ESSENTIAL

Essential cookies are necessary to provide our site and services and cannot be
deactivated. They are usually set in response to your actions on the site, such
as setting your privacy preferences, signing in, or filling in forms.




PERFORMANCE

Performance cookies provide anonymous statistics about how customers navigate
our site so we can improve site experience and performance. Approved third
parties may perform analytics on our behalf, but they cannot use the data for
their own purposes.

Allow performance category
Allowed


FUNCTIONAL

Functional cookies help us provide useful site features, remember your
preferences, and display relevant content. Approved third parties may set these
cookies to provide certain site features. If you do not allow these cookies,
then some or all of these services may not function properly.

Allow functional category
Allowed


ADVERTISING

Advertising cookies may be set through our site by us or our advertising
partners and help us deliver relevant marketing content. If you do not allow
these cookies, you will experience less relevant advertising.

Allow advertising category
Allowed

Blocking some types of cookies may impact your experience of our sites. You may
review and change your choices at any time by clicking Cookie preferences in the
footer of this site. We and selected third-parties use cookies or similar
technologies as specified in the AWS Cookie Notice.

CancelSave preferences




UNABLE TO SAVE COOKIE PREFERENCES

We will only store essential cookies at this time, because we were unable to
save your cookie preferences.

If you want to change your cookie preferences, try again later using the link in
the AWS console footer, or contact support if the problem persists.

Dismiss


Contact Us
English


Create an AWS Account
 1. AWS
 2. ...
    
    
 3. Documentation
 4. Amazon Managed Streaming for Apache Kafka
 5. Developer Guide

Feedback
Preferences


AMAZON MANAGED STREAMING FOR APACHE KAFKA


DEVELOPER GUIDE

 * Welcome
 * Setting up
 * Getting started
    * Step 1: Create a cluster
    * Step 2: Create an IAM role
    * Step 3: Create a client machine
    * Step 4: Create a topic
    * Step 5: Produce and consume data
    * Step 6: View metrics
    * Step 7: Delete the resources

 * How it works
    * Creating a cluster
    * Deleting a cluster
    * Getting the bootstrap brokers
    * Listing clusters
    * Metadata management
       * ZooKeeper mode
       * KRaft mode
   
    * Storage management
       * Tiered storage
          * Tiered storage scenario
          * Creating a tiered storage cluster with the console
          * Creating a tiered storage cluster with the AWS CLI
          * Enabling and disabling tiered storage on an existing topic
          * Enabling tiered storage on an existing cluster using AWS CLI
          * Updating tiered storage on an existing cluster using AWS console
      
       * Scaling up broker storage
          * Automatic scaling
          * Manual scaling
      
       * Provisioning storage throughput
   
    * Updating the broker type
    * Updating the configuration of a cluster
    * Expanding a cluster
    * Remove a broker
    * Updating security
    * Rebooting a broker for a cluster
    * Patching
    * Tagging a cluster

 * Configuration
    * Custom configurations
    * Default configuration
    * Configuration operations

 * MSK Serverless
    * Getting started tutorial
       * Step 1: Create a cluster
       * Step 2: Create an IAM role
       * Step 3: Create a client machine
       * Step 4: Create a topic
       * Step 5: Produce and consume data
       * Step 6: Delete resources
   
    * Configuration
    * Monitoring

 * MSK Connect
    * Getting started
       * Step 1: Set up required resources
       * Step 2: Create custom plugin
       * Step 3: Create client machine and Apache Kafka topic
       * Step 4: Create connector
       * Step 5: Send data
   
    * Connectors
    * Plugins
    * Workers
    * Configuration providers
    * IAM roles and policies
       * Service execution role
       * Example policies
       * Cross-service confused deputy prevention
       * AWS managed policies
       * Using service-linked roles
   
    * Enabling internet access
    * Private DNS hostnames
       * Configuring
       * DNS attributes
       * Failure handling
   
    * Logging
    * Monitoring
    * Examples
       * Amazon S3 sink connector
       * Debezium source connector
   
    * Best practices
       * Connecting from connectors
   
    * Migration guide
       * Migrating
   
    * Troubleshooting

 * MSK Replicator
    * How Amazon MSK Replicator works
    * Requirements and considerations for creating an Amazon MSK Replicator
    * Getting started tutorial
       * Step 1: Prepare the Amazon MSK source cluster
       * Step 2: Prepare the Amazon MSK target cluster
       * Step 3: Create an Amazon MSK Replicator
   
    * Edit MSK Replicator settings
    * Delete an MSK Replicator
    * Monitor replication
    * Using replication to increase the resiliency of a Kafka streaming
      application across regions
       * Creating an active-passive Kafka cluster setup and replicated topic
         naming
       * When to failover to the secondary AWS Region
       * Performing a planned failover to the secondary AWS Region
       * Performing an unplanned failover to the secondary AWS Region
       * Performing failback to the primary AWS Region
       * Creating an active-active setup using MSK Replicator
   
    * Troubleshooting MSK Replicator
    * Best practices for using MSK Replicator

 * Cluster states
 * Security
    * Data protection
       * Encryption
       * How do I get started with encryption?
   
    * Authentication and authorization for Amazon MSK APIs
       * How Amazon MSK works with IAM
       * Identity-based policy examples
       * Service-linked roles
       * AWS managed policies
       * Troubleshooting
   
    * Authentication and authorization for Apache Kafka APIs
       * IAM access control
       * Mutual TLS authentication
       * SASL/SCRAM authentication
       * Apache Kafka ACLs
   
    * Changing security groups
    * Controlling access to Apache ZooKeeper
    * Logging
    * Compliance validation
    * Resilience
    * Infrastructure security

 * Connecting to an MSK cluster
    * Public access
    * Access from within AWS
       * Multi-VPC private connectivity in a single Region
          * Update auth schemes on a cluster
          * Reject managed VPC connection
          * Delete managed VPC connection
          * Multi-VPC private connectivity permissions
      
       * Port information

 * Migration
 * Monitoring a cluster
    * Amazon MSK metrics for monitoring with CloudWatch
    * Viewing Amazon MSK metrics using CloudWatch
    * Consumer-lag monitoring
    * Open monitoring with Prometheus
    * Amazon MSK storage capacity alerts

 * Cruise Control
 * Quota
 * Resources
 * MSK integrations
    * Athena
    * Redshift
    * Firehose
    * Accessing EventBridge pipes

 * Apache Kafka versions
    * Supported Apache Kafka versions
    * Amazon MSK version support

 * Troubleshooting
 * Best practices
 * Document history
 * AWS Glossary

Best practices - Amazon Managed Streaming for Apache Kafka
AWSDocumentationAmazon Managed Streaming for Apache KafkaDeveloper Guide
Right-size your cluster: Number of partitions per brokerRight-size your cluster:
Number of brokers per clusterOptimize cluster throughput for m5.4xl, m7g.4xl or
larger instancesUse latest Kafka AdminClient to avoid topic ID mismatch
issueBuild highly available clustersMonitor CPU usageMonitor disk spaceAdjust
data retention parametersSpeeding up log recovery after unclean shutdownMonitor
Apache Kafka memoryDon't add non-MSK brokersEnable in-transit encryptionReassign
partitions


BEST PRACTICES

PDFRSS

This topic outlines some best practices to follow when using Amazon MSK.


RIGHT-SIZE YOUR CLUSTER: NUMBER OF PARTITIONS PER BROKER


The following table shows the recommended number of partitions (including leader
and follower replicas) per broker.

Broker type Recommended number of partitions (including leader and follower
replicas) per broker kafka.t3.small 300 kafka.m5.large or kafka.m5.xlarge 1000
kafka.m5.2xlarge 2000 kafka.m5.4xlarge, kafka.m5.8xlarge, kafka.m5.12xlarge,
kafka.m5.16xlarge, or kafka.m5.24xlarge 4000 kafka.m7g.large or kafka.m7g.xlarge
1000 kafka.m7g.2xlarge 2000 kafka.m7g.4xlarge, kafka.m7g.8xlarge,
kafka.m7g.12xlarge, or kafka.m7g.16xlarge 4000

If the number of partitions per broker exceeds the recommended value and your
cluster becomes overloaded, you may be prevented from performing the following
operations:

 * Update the cluster configuration

 * Update the Apache Kafka version for the cluster

 * Update the cluster to a smaller broker type

 * Associate an AWS Secrets Manager secret with a cluster that has SASL/SCRAM
   authentication

A high number of partitions can also result in missing Kafka metrics on
CloudWatch and on Prometheus scraping.

For guidance on choosing the number of partitions, see Apache Kafka Supports
200K Partitions Per Cluster. We also recommend that you perform your own testing
to determine the right type for your brokers. For more information about the
different broker types, see Broker types.


RIGHT-SIZE YOUR CLUSTER: NUMBER OF BROKERS PER CLUSTER


To determine the right number of brokers for your MSK cluster and understand
costs, see the MSK Sizing and Pricing spreadsheet. This spreadsheet provides an
estimate for sizing an MSK cluster and the associated costs of Amazon MSK
compared to a similar, self-managed, EC2-based Apache Kafka cluster. For more
information about the input parameters in the spreadsheet, hover over the
parameter descriptions. Estimates provided by this sheet are conservative and
provide a starting point for a new cluster. Cluster performance, size, and costs
are dependent on your use case and we recommend that you verify them with actual
testing.

To understand how the underlying infrastructure affects Apache Kafka
performance, see Best practices for right-sizing your Apache Kafka clusters to
optimize performance and cost in the AWS Big Data Blog. The blog post provides
information about how to size your clusters to meet your throughput,
availability, and latency requirements. It also provides answers to questions
such as when you should scale up versus scale out, and guidance on how to
continuously verify the size of your production clusters.


OPTIMIZE CLUSTER THROUGHPUT FOR M5.4XL, M7G.4XL OR LARGER INSTANCES


When using m5.4xl, m7g.4xl, or larger instances, you can optimize the cluster
throughput by tuning the num.io.threads and num.network.threads configurations.

Num.io.threads is the number of threads that a broker uses for processing
requests. Adding more threads, up to the number of CPU cores supported for the
instance type, can help improve cluster throughput.

Num.network.threads is the number of threads the broker uses for receiving all
incoming requests and returning responses. Network threads place incoming
requests on a request queue for processing by io.threads. Setting
num.network.threads to half the number of CPU cores supported for the instance
type allows for full usage of the new instance type.

IMPORTANT

Do not increase num.network.threads without first increasing num.io.threads as
this can lead to congestion related to queue saturation.

Recommended settings
Instance type Recommended value for num.io.threads Recommended value for
num.network.threads

m5.4xl

16

8

m5.8xl

32

16

m5.12xl

48

24

m5.16xl

64

32

m5.24xl

96

48

m7g.4xlarge

16

8

m7g.8xlarge

32

16

m7g.12xlarge

48

24

m7g.16xlarge

64

32


USE LATEST KAFKA ADMINCLIENT TO AVOID TOPIC ID MISMATCH ISSUE


The ID of a topic is lost (Error: does not match the topic Id for partition)
when you use a Kafka AdminClient version lower than 2.8.0 with the flag
--zookeeper to increase or reassign topic partitions for a cluster using Kafka
version 2.8.0 or higher. Note that the --zookeeper flag is deprecated in Kafka
2.5 and is removed starting with Kafka 3.0. See Upgrading to 2.5.0 from any
version 0.8.x through 2.4.x.

To prevent topic ID mismatch, use a Kafka client version 2.8.0 or higher for
Kafka admin operations. Alternatively, clients 2.5 and higher can use the
--bootstrap-servers flag instead of the --zookeeper flag.


BUILD HIGHLY AVAILABLE CLUSTERS


Use the following recommendations so that your MSK cluster can be highly
available during an update (such as when you're updating the broker type or
Apache Kafka version, for example) or when Amazon MSK is replacing a broker.

 * Set up a three-AZ cluster.

 * Ensure that the replication factor (RF) is at least 3. Note that a RF of 1
   can lead to offline partitions during a rolling update; and a RF of 2 may
   lead to data loss.

 * Set minimum in-sync replicas (minISR) to at most RF - 1. A minISR that is
   equal to the RF can prevent producing to the cluster during a rolling update.
   A minISR of 2 allows three-way replicated topics to be available when one
   replica is offline.

 * Ensure client connection strings include at least one broker from each
   availability zone. Having multiple brokers in a client's connection string
   allows for failover when a specific broker is offline for an update. For
   information about how to get a connection string with multiple brokers, see
   Getting the bootstrap brokers for an Amazon MSK cluster.


MONITOR CPU USAGE


Amazon MSK strongly recommends that you maintain the total CPU utilization for
your brokers (defined as CPU User + CPU System) under 60%. When you have at
least 40% of your cluster's total CPU available, Apache Kafka can redistribute
CPU load across brokers in the cluster when necessary. One example of when this
is necessary is when Amazon MSK detects and recovers from a broker fault; in
this case, Amazon MSK performs automatic maintenance, like patching. Another
example is when a user requests a broker-type change or version upgrade; in
these two cases, Amazon MSK deploys rolling workflows that take one broker
offline at a time. When brokers with lead partitions go offline, Apache Kafka
reassigns partition leadership to redistribute work to other brokers in the
cluster. By following this best practice you can ensure you have enough CPU
headroom in your cluster to tolerate operational events like these.

You can use Amazon CloudWatch metric math to create a composite metric that is
CPU User + CPU System. Set an alarm that gets triggered when the composite
metric reaches an average CPU utilization of 60%. When this alarm is triggered,
scale the cluster using one of the following options:

 * Option 1 (recommended): Update your broker type to the next larger type. For
   example, if the current type is kafka.m5.large, update the cluster to use
   kafka.m5.xlarge. Keep in mind that when you update the broker type in the
   cluster, Amazon MSK takes brokers offline in a rolling fashion and
   temporarily reassigns partition leadership to other brokers. A size update
   typically takes 10-15 minutes per broker.

 * Option 2: If there are topics with all messages ingested from producers that
   use round-robin writes (in other words, messages aren't keyed and ordering
   isn't important to consumers), expand your cluster by adding brokers. Also
   add partitions to existing topics with the highest throughput. Next, use
   kafka-topics.sh --describe to ensure that newly added partitions are assigned
   to the new brokers. The main benefit of this option compared to the previous
   one is that you can manage resources and costs more granularly. Additionally,
   you can use this option if CPU load significantly exceeds 60% because this
   form of scaling doesn't typically result in increased load on existing
   brokers.

 * Option 3: Expand your cluster by adding brokers, then reassign existing
   partitions by using the partition reassignment tool named
   kafka-reassign-partitions.sh. However, if you use this option, the cluster
   will need to spend resources to replicate data from broker to broker after
   partitions are reassigned. Compared to the two previous options, this can
   significantly increase the load on the cluster at first. As a result, Amazon
   MSK doesn't recommend using this option when CPU utilization is above 70%
   because replication causes additional CPU load and network traffic. Amazon
   MSK only recommends using this option if the two previous options aren't
   feasible.

Other recommendations:

 * Monitor total CPU utilization per broker as a proxy for load distribution. If
   brokers have consistently uneven CPU utilization it might be a sign that load
   isn't evenly distributed within the cluster. Amazon MSK recommends using
   Cruise Control to continuously manage load distribution via partition
   assignment.

 * Monitor produce and consume latency. Produce and consume latency can increase
   linearly with CPU utilization.

 * JMX scrape interval: If you enable open monitoring with the Prometheus
   feature, it is recommended that you use a 60 second or higher scrape interval
   (scrape_interval: 60s) for your Prometheus host configuration
   (prometheus.yml). Lowering the scrape interval can lead to high CPU usage on
   your cluster.


MONITOR DISK SPACE


To avoid running out of disk space for messages, create a CloudWatch alarm that
watches the KafkaDataLogsDiskUsed metric. When the value of this metric reaches
or exceeds 85%, perform one or more of the following actions:

 * Use Automatic scaling. You can also manually increase broker storage as
   described in Manual scaling.

 * Reduce the message retention period or log size. For information on how to do
   that, see Adjust data retention parameters.

 * Delete unused topics.

For information on how to set up and use alarms, see Using Amazon CloudWatch
Alarms. For a full list of Amazon MSK metrics, see Monitoring an Amazon MSK
cluster.


ADJUST DATA RETENTION PARAMETERS


Consuming messages doesn't remove them from the log. To free up disk space
regularly, you can explicitly specify a retention time period, which is how long
messages stay in the log. You can also specify a retention log size. When either
the retention time period or the retention log size are reached, Apache Kafka
starts removing inactive segments from the log.

To specify a retention policy at the cluster level, set one or more of the
following parameters: log.retention.hours, log.retention.minutes,
log.retention.ms, or log.retention.bytes. For more information, see Custom MSK
configurations.

You can also specify retention parameters at the topic level:

 * To specify a retention time period per topic, use the following command.
   
   kafka-configs.sh --bootstrap-server $bs --alter --entity-type topics --entity-name TopicName --add-config retention.ms=DesiredRetentionTimePeriod

 * To specify a retention log size per topic, use the following command.
   
   kafka-configs.sh --bootstrap-server $bs --alter --entity-type topics --entity-name TopicName --add-config retention.bytes=DesiredRetentionLogSize

The retention parameters that you specify at the topic level take precedence
over cluster-level parameters.


SPEEDING UP LOG RECOVERY AFTER UNCLEAN SHUTDOWN


After an unclean shutdown, a broker can take a while to restart as it does log
recovery. By default, Kafka only uses a single thread per log directory to
perform this recovery. For example, if you have thousands of partitions, log
recovery can take hours to complete. To speed up log recovery, it's recommended
to increase the number of threads using configuration property
num.recovery.threads.per.data.dir. You can set it to the number of CPU cores.


MONITOR APACHE KAFKA MEMORY


We recommend that you monitor the memory that Apache Kafka uses. Otherwise, the
cluster may become unavailable.

To determine how much memory Apache Kafka uses, you can monitor the
HeapMemoryAfterGC metric. HeapMemoryAfterGC is the percentage of total heap
memory that is in use after garbage collection. We recommend that you create a
CloudWatch alarm that takes action when HeapMemoryAfterGC increases above 60%.

The steps that you can take to decrease memory usage vary. They depend on the
way that you configure Apache Kafka. For example, if you use transactional
message delivery, you can decrease the transactional.id.expiration.ms value in
your Apache Kafka configuration from 604800000 ms to 86400000 ms (from 7 days to
1 day). This decreases the memory footprint of each transaction.


DON'T ADD NON-MSK BROKERS


For ZooKeeper-based clusters, if you use Apache ZooKeeper commands to add
brokers, these brokers don't get added to your MSK cluster, and your Apache
ZooKeeper will contain incorrect information about the cluster. This might
result in data loss. For supported cluster operations, see Amazon MSK: How it
works.


ENABLE IN-TRANSIT ENCRYPTION


For information about encryption in transit and how to enable it, see Encryption
in transit.


REASSIGN PARTITIONS


To move partitions to different brokers on the same cluster, you can use the
partition reassignment tool named kafka-reassign-partitions.sh. For example,
after you add new brokers to expand a cluster or to move partitions in order to
removing brokers, you can rebalance that cluster by reassigning partitions to
the new brokers. For information about how to add brokers to a cluster, see
Expanding an Amazon MSK cluster. For information about how to remove brokers
from a cluster, see Remove a broker from an Amazon MSK cluster. For information
about the partition reassignment tool, see Expanding your cluster in the Apache
Kafka documentation.

Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please
refer to your browser's Help pages for instructions.

Document Conventions
Troubleshooting
Document history
Did this page help you? - Yes

Thanks for letting us know we're doing a good job!

If you've got a moment, please tell us what we did right so we can do more of
it.



Did this page help you? - No

Thanks for letting us know this page needs work. We're sorry we let you down.

If you've got a moment, please tell us how we can make the documentation better.





DID THIS PAGE HELP YOU?

Yes
No
Provide feedback

NEXT TOPIC:

Document history

PREVIOUS TOPIC:

Troubleshooting

NEED HELP?

 * Try AWS re:Post 
 * Connect with an AWS IQ expert 

PrivacySite termsCookie preferences
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.


ON THIS PAGE

 * Right-size your cluster: Number of partitions per broker
 * Right-size your cluster: Number of brokers per cluster
 * Optimize cluster throughput for m5.4xl, m7g.4xl or larger instances
 * Use latest Kafka AdminClient to avoid topic ID mismatch issue
 * Build highly available clusters
 * Monitor CPU usage
 * Monitor disk space
 * Adjust data retention parameters
 * Speeding up log recovery after unclean shutdown
 * Monitor Apache Kafka memory
 * Don't add non-MSK brokers
 * Enable in-transit encryption
 * Reassign partitions