www.confluent.io Open in urlscan Pro
2a05:d014:275:cb00:ce75:162:d945:5f34  Public Scan

Submitted URL: https://go2.confluent.io/NTgyLVFIWC0yNjIAAAGEayFptylbOfERnB-BjfyE6GZnu5QTOgHXQl8r4f3YeWQZ8l2ACtKng_lcfBZwbtimGM89qp8=
Effective URL: https://www.confluent.io/blog/apache-kafka-monitoring-and-metrics-with-confluent-health/?utm_campaign=tm.customer_cd.heal...
Submission: On May 16 via manual from GB — Scanned from GB

Form analysis 1 forms found in the DOM

<form role="search"><svg class="cfHeaderNav-style-module--searchBarIcon--2d65r" aria-hidden="true" width="18px" height="18px" focusable="false" data-prefix="fas" data-icon="search" role="img" viewBox="0 0 512 512">
    <path fill="currentColor"
      d="M505 442.7L405.3 343c-4.5-4.5-10.6-7-17-7H372c27.6-35.3 44-79.7 44-128C416 93.1 322.9 0 208 0S0 93.1 0 208s93.1 208 208 208c48.3 0 92.7-16.4 128-44v16.3c0 6.4 2.5 12.5 7 17l99.7 99.7c9.4 9.4 24.6 9.4 33.9 0l28.3-28.3c9.4-9.4 9.4-24.6.1-34zM208 336c-70.7 0-128-57.2-128-128 0-70.7 57.2-128 128-128 70.7 0 128 57.2 128 128 0 70.7-57.2 128-128 128z">
    </path>
  </svg><input type="search" placeholder="Search" autocomplete="off" autocorrect="off" autocapitalize="off" name="s" value="" class="style-module--input--3P0z6 st-default-search-input" id="nav-search-input"><svg
    class="cfHeaderNav-style-module--inputBarResetIcon--3jT0N" aria-hidden="true" width="18px" height="18px" focusable="false" data-prefix="fas" data-icon="times" role="img" viewBox="0 0 352 512">
    <path fill="currentColor"
      d="M242.72 256l100.07-100.07c12.28-12.28 12.28-32.19 0-44.48l-22.24-22.24c-12.28-12.28-32.19-12.28-44.48 0L176 189.28 75.93 89.21c-12.28-12.28-32.19-12.28-44.48 0L9.21 111.45c-12.28 12.28-12.28 32.19 0 44.48L109.28 256 9.21 356.07c-12.28 12.28-12.28 32.19 0 44.48l22.24 22.24c12.28 12.28 32.2 12.28 44.48 0L176 322.72l100.07 100.07c12.28 12.28 32.2 12.28 44.48 0l22.24-22.24c12.28-12.28 12.28-32.19 0-44.48L242.72 256z">
    </path>
  </svg></form>

Text Content

Register for demo | RBAC at scale, Oracle CDC Source Connector, and more within
our Q2 Launch for Confluent Cloud

Contact Us

 * Products
   
   * Choose Your deployment
     
     
     Confluent Cloud
      * Pricing
      * Login
     
     
     Software: Confluent Platform
      * Subscription
   
   * 
     Connectors
     
     ksqlDB
     
     Stream Governance
     Confluent vs. Kafka: Why you need Confluent
 * Solutions
   * 
     By Industry
     
     By Use Case
     
     By Architecture
     
     By Customer
     
     All Solutions
   * 
     Hybrid and Multicloud Modernization
     
     Event-driven Microservices
     
     Streaming ETL
     
     Use Case Showcase
     Streaming Use Cases to transform your business
 * Learn
   * 
     Blog
     
     Resources
     
     Training
     
     Professional Services
   * 
     Careers
     
     Events
      * Meetups
      * Kafka Summit
      * Webinars
     
     Streaming ETL cloud demo
     Mastering Kafka Streams and ksqlDB
     Microservices with Confluent
 * Developers
   * 
     Confluent Developer
     
     Docs
   * Apache Kafka Quick Start
     Streaming Audio Podcast
     Ask the Community
 * Get Started Free
 * 
 * US English

Get Started Free

 * 
 * 
 * Products
   
    * Choose Your deployment
      
      
      Confluent Cloud
       * Pricing
       * Login
      
      
      Software: Confluent Platform
       * Subscription
   
    * 
      Connectors
      
      ksqlDB
      
      Stream Governance
      Confluent vs. Kafka: Why you need Confluent

 * Solutions
    * 
      By Industry
      
      By Use Case
      
      By Architecture
      
      By Customer
      
      All Solutions
    * 
      Hybrid and Multicloud Modernization
      
      Event-driven Microservices
      
      Streaming ETL
      
      Use Case Showcase
      Streaming Use Cases to transform your business

 * Learn
    * 
      Blog
      
      Resources
      
      Training
      
      Professional Services
    * 
      Careers
      
      Events
       * Meetups
       * Kafka Summit
       * Webinars
      
      Streaming ETL cloud demo
      Mastering Kafka Streams and ksqlDB
      Microservices with Confluent

 * Developers
    * 
      Confluent Developer
      
      Docs
    * Apache Kafka Quick Start
      Streaming Audio Podcast
      Ask the Community

 * Get Started Free

Confluent Platform


INTELLIGENTLY MONITOR AND AVOID CRITICAL APACHE KAFKA ISSUES WITH HEALTH+


Jesse Miller

April 20, 2022

When it comes to alerts, monitoring, and support for Apache Kafka®, how do you
know when you’ve got a critical problem that needs your immediate attention?

You likely won’t be sitting in front of a live dashboard somewhere simply
waiting for something to go wrong. Your time is best used elsewhere. Instead,
you want to have the right alerts already configured for mission-critical Kafka
services that identify and notify you of problems as they occur, complete with
recommended actions for remediation. Having the right alerts allows you to focus
on more important matters—knowing that if an issue does arise, you’ll know
immediately.

So, what exactly does success look like when it comes to alerts, monitoring, and
support for Kafka with Confluent Platform? There are likely a few steps:

 * When you receive an alert, you want to quickly understand where the issue is
   by jumping into the metrics to diagnose.
 * Time-to-resolution is important, so having the right metrics identified in
   your monitoring dashboards is important to quickly spot patterns,
   correlations, and ultimately identify where issues are for root cause
   analysis.
 * If you need help troubleshooting or resolving the issues at hand the
   Confluent Support Team helps get you back up and running as quickly as
   possible to minimize business-disrupting downtime.


SOUNDS SIMPLE, DOESN’T IT? WE RECOGNIZE THERE’S MORE INVOLVED IN TAKING THESE
STEPS.

For example:

 * How do you know which alerts to set up?
 * What metrics should you monitor?
 * Are you sure you’re alerting on the right metrics to detect issues?
 * What metrics thresholds should you be setting in your alerts?

Alerts are notoriously difficult and time-consuming to set up. Additionally, if
you add in a new cluster, broker, or other service, you have to repeat the setup
process. And don’t forget the metrics that come with upgrades and new features;
you need to set up alerts for those, too!

With monitoring dashboards, similar questions arise. Are you displaying the
right metrics on your dashboards? How do you tell what is good or bad for a
metric? If you’re well-versed in the internals of Kafka, you may already know
what all the metrics mean—but as your team grows and brings on new members,
providing context and explanation across all of these metrics can be difficult.
Collecting and storing metrics can be expensive, too. If you’re hosting your own
monitoring solution, storing multiple days, weeks, or months of monitoring data
can lead to escalating infrastructure costs.

And last, if you need assistance from the world-class Confluent Support Team,
you have to provide a good bit of context so a support engineer can orient
themselves and help you resolve the issue. To best support you, the Support Team
typically asks for JMX metric dumps, historical values, or configuration files,
which takes time to collect, upload, and consume, all before the troubleshooting
begins. This takes time and slows down your time to resolution, putting you at
increased risk of a business-disrupting downtime.


WE MAKE CLUSTER HEALTH AND MONITORING EASY FOR OUR CUSTOMERS

At Confluent, we’re hyper-aware of the challenges involved in managing alerts,
monitoring, and resolution. Over the years, we’ve developed several tools,
written blogs and whitepapers, and built a world-class support team, all with
the goal of helping our customers keep their mission-critical Kafka systems
healthy, and reliably setting their data in motion. Recently, we released a new
product for Confluent Platform that took this one step further.

Confluent Health+ provides the tools and visibility needed to best monitor your
Kafka environments and minimize business disruptions. Health+ has three main
benefits:

 1. Intelligent Alerts to reduce the risk of downtime and data loss by
    identifying potential issues before they occur. These alerts are based on
    expert-tested rules and algorithms developed through years of experience
    running thousands of clusters in Confluent Cloud.
 2. Cloud-based monitoring dashboards to ensure the health of your
    environment(s) and quickly troubleshoot issues through real-time and
    historical visualizations of monitoring data. This scalable, cloud-based
    solution also offloads expensive and infrastructure-intensive monitoring of
    your self-managed services, helping you reduce your monitoring costs by up
    to 70%.
 3. Accelerated Confluent support to speed up issue resolution and minimize
    business disruption with a streamlined support experience. We enable
    customers to securely share contextual metadata and metrics about their
    services without manual entry. This helps our Support Team diagnose your
    issues much quicker to lower your overall time to resolution by up to 30%.

Let’s dig into each of these a bit more.


INTELLIGENT ALERTS

When you connect your Confluent Platform service to Health+, all alerts are
automatically set up for you, removing the need for you to manually review each
metric and set up individual alerts and thresholds. Instead, you simply
configure the channel for which you want to be notified, and the severity level
of alerts you want to receive. Health+ Intelligent Alerts provide three severity
levels to help you prioritize:

 * Critical – Issues are present that may limit or prevent data from moving
   across your cluster. We recommend these be addressed with urgency.
 * Warn – Metrics that are close to exceeding their normal operational range and
   may cause future issues. We recommend these metrics be reviewed expediently
   along with their recommended actions.
 * Info – Informational events on the normal operation of the cluster. We
   recommend these be reviewed regularly.

To set up a new notification in Health+, you select the severity levels to
include and the channel to receive them on. Today we support three channels:
Slack, email, and a webhook. The webhook can be used to build additional
integrations into other tools as needed.



 

Today we provide more than 30 Intelligent Alerts in Health+ across various
metrics and severities. We’re constantly adding new Intelligent Alerts, while
also tuning our existing ones to proactively identify issues. Through our
Health+ product backend, we’re able to seamlessly release new alerts as new
metrics become available or as clusters are upgraded to new versions—no
intervention is needed by users to start tracking the new metrics or to set up
new alerts.


CLOUD-BASED MONITORING DASHBOARDS

Similar to the Intelligent Alerts, when you connect a Confluent Platform cluster
into Health+, the monitoring dashboards instantly come to life, showing the
active health of the cluster along with a summary of its overall status. When
there are issues, the dashboards highlight the trouble areas to help you zero in
and diagnose further.



When building Health+ monitoring dashboards, we wanted to ensure that a user
wasn’t just thrown into an unwieldy dashboard and left wading through endless
pages of metrics. Instead, Health+ surfaces the metrics that matter most and
visually indicates where there are potential issues. Each of these metrics can
be expanded upon and observed deeper.



Digging deeper into a metric allows you to observe it over different time
periods in order to identify when issues began, and then compare that view
against historical trends and other metrics. When Health+ detects that a metric
is not in a good state, additional information is shown with the metric to offer
an explanation as to what the metric means and recommended steps for addressing
the underlying issue.

As new features and metrics are added to Confluent Platform, the Health+
dashboards automatically update depending on the version you’re on without any
additional configuration needed on your end.


ACCELERATED CONFLUENT SUPPORT

And finally, when you do reach out to Confluent Support for additional help and
troubleshooting, our team is able to view the same monitoring details you see in
Health+ and address your issues quicker. Instead of you needing to capture and
upload JMX metrics and broker configuration details, our team is able to view
metrics in real time along with the historical details, all with the goal of
speeding your time to resolution.


HOW DOES HEALTH+ WORK?

Health+ works by sending telemetry data from your Confluent Platform components
to the Telemetry Collector in Confluent Cloud. Each Confluent Platform component
has the Telemetry Reporter plugin pre-installed. Once configured, the Telemetry
Reporter sends monitoring data over an encrypted HTTPS connection to the
Telemetry Collector located at https://collector.telemetry.confluent.cloud/ for
collection and storage against your organization.



Similar to other cloud-hosted monitoring tools, setting up Health+ requires
allowing for outbound traffic from your Confluent Platform components to enable
the telemetry data to be sent. For ease of setup, the Telemetry Reporter also
supports routing traffic through a proxy with only outbound access allowed.




WHAT DATA IS SENT TO CONFLUENT WHEN USING HEALTH+?

“Data” can be broken into two main categories:

 * Message content refers to data sent to and stored on Kafka topics. This is
   the message-level data your organization processes using applications that
   are built on top of Kafka.
 * Telemetry data refers to data about the health and operational status of your
   Kafka services. This data doesn’t contain any message content. This
   information is typically requested by Confluent’s Support Team when
   troubleshooting an issue with you.

All data captured by the Telemetry Reporter is thoroughly detailed in our
documentation. Each metric that we capture is accompanied by a description of
the metric, along with the version of Confluent Platform from where we started
capturing.


WHAT ABOUT OTHER SECURITY QUESTIONS/CONCERNS?

We built Health+ with the utmost security-conscious customer in mind and
understand you or your Infosec team may have questions. Confluent Health+ FAQs
has helped many of our customers address most of the common questions that come
up. If you have additional questions not addressed in this document, please
reach out to our support team at support@confluent.io.


POTENTIAL INFRASTRUCTURE SAVINGS FOR SELF-HOSTED MONITORING

If you run Confluent Control Center or another self-hosted monitoring platform
today to track your Confluent Platform metrics, you’re probably aware of the
infrastructure costs associated with storing all of the historical monitoring
data. This monitoring data accumulates over time and can exponentially increase
as you add in new clusters and services. With Health+, you no longer need to
store monitoring data on your own infrastructure. If you’re currently using
Confluent Control Center, Reduced infrastructure mode can be enabled for
continued use of Control Center for all management capabilities in Confluent
Platform, while disabling the heavy-weight monitoring features in favor of
Health+. In this mode, Control Center’s system requirements can be greatly
reduced. We estimate that customers who leverage Reduced infrastructure mode
along with Health+ will see ~70% savings in infrastructure costs.


GETTING STARTED WITH HEALTH+ IS EASY

Health+ is simple (and free) to get started. When you sign up for Health+, you
are quickly guided through the necessary steps to generate your secure
credentials and set up the Telemetry Reporter on each of your Confluent Platform
components. Get started today and say goodbye to endless troubleshooting and
costly downtime!

GET STARTED

Jesse Miller is a senior product manager for Health+ and other observability
products at Confluent. Prior to Confluent, Jesse led Learndot by ServiceRocket,
an LMS specifically tailored to help fast-growing software companies train their
end-users and drive product adoption.


DID YOU LIKE THIS BLOG POST? SHARE IT NOW



Subscribe to the Confluent blog

Subscribe
 * Product
 * Confluent Platform
 * Connectors
 * ksqlDB
 * Stream Governance
 * Confluent Hub
 * Subscription
 * Professional Services
 * Training
 * Customers

 * Cloud
 * Confluent Cloud
 * Support
 * Sign Up
 * Log In
 * Cloud FAQ

 * Solutions
 * Financial Services
 * Insurance
 * Retail and eCommerce
 * Automotive
 * Government
 * Gaming
 * Communication Service Providers
 * Technology
 * Manufacturing
 * Fraud Detection
 * Customer 360
 * Messaging Modernization
 * Streaming ETL
 * Event-driven Microservices
 * Mainframe Offload
 * SIEM Optimization
 * Hybrid and Multicloud
 * Internet of Things
 * Data Warehouse

 * Developers
 * Confluent Developer
 * What is Kafka?
 * Resources
 * Events
 * Online Talks
 * Meetups
 * Kafka Summit
 * Tutorials
 * Docs
 * Blog

 * About
 * Investor Relations
 * Company
 * Careers
 * Partners
 * News
 * Contact
 * Trust and Security

 * 
 * 
 * 
 * 
 * 
 * 
 * 

 * 
 * 
 * 
 * 
 * 
 * 
 * 

Terms & Conditions | Privacy Policy | Do Not Sell My Information | Modern
Slavery Policy | Cookie Settings

Copyright © Confluent, Inc. 2014-2022. Apache, Apache Kafka, Kafka, and
associated open source project names are trademarks of the Apache Software
Foundation



By clicking “Accept All Cookies”, you agree to the storing of cookies on your
device to enhance site navigation, analyze site usage, and assist in our
marketing efforts. Cookie Notice

Cookies Settings Reject All Accept All Cookies