www.zebrium.com
Open in
urlscan Pro
2606:2c40::c73c:671d
Public Scan
Submitted URL: https://click.sciencelogic.com/api/mailings/click/PMRGSZBCHI2DINJUHE3DCLBCOVZGYIR2EJUHI5DQOM5C6L3XO53S46TFMJZGS5LNFZRW63JPMJWG6...
Effective URL: https://www.zebrium.com/blog/how-cisco-uses-zebrium-ml-to-analyze-logs-for-root-cause
Submission: On April 17 via api from US — Scanned from DE
Effective URL: https://www.zebrium.com/blog/how-cisco-uses-zebrium-ml-to-analyze-logs-for-root-cause
Submission: On April 17 via api from US — Scanned from DE
Form analysis
1 forms found in the DOMPOST https://forms.hsforms.com/submissions/v3/public/submit/formsnext/multipart/4228532/58100fc5-95c6-42d4-8a0d-66e03308e25f
<form id="hsForm_58100fc5-95c6-42d4-8a0d-66e03308e25f" method="POST" accept-charset="UTF-8" enctype="multipart/form-data" novalidate=""
action="https://forms.hsforms.com/submissions/v3/public/submit/formsnext/multipart/4228532/58100fc5-95c6-42d4-8a0d-66e03308e25f"
class="hs-form-private hsForm_58100fc5-95c6-42d4-8a0d-66e03308e25f hs-form-58100fc5-95c6-42d4-8a0d-66e03308e25f hs-form-58100fc5-95c6-42d4-8a0d-66e03308e25f_0ac80d6a-e773-40c9-ae87-4c9a41154467 hs-form stacked"
target="target_iframe_58100fc5-95c6-42d4-8a0d-66e03308e25f" data-instance-id="0ac80d6a-e773-40c9-ae87-4c9a41154467" data-form-id="58100fc5-95c6-42d4-8a0d-66e03308e25f" data-portal-id="4228532" data-hs-cf-bound="true">
<div class="hs_email hs-email hs-fieldtype-text field hs-form-field"><label id="label-email-58100fc5-95c6-42d4-8a0d-66e03308e25f" class="" placeholder="Enter your " for="email-58100fc5-95c6-42d4-8a0d-66e03308e25f"><span></span></label>
<legend class="hs-field-desc" style="display: none;"></legend>
<div class="input"><input id="email-58100fc5-95c6-42d4-8a0d-66e03308e25f" name="email" required="" placeholder="name@company.com*" type="email" class="hs-input" inputmode="email" autocomplete="email" value=""></div>
</div>
<div class="hs_submit hs-submit">
<div class="hs-field-desc" style="display: none;"></div>
<div class="actions"><input type="submit" class="hs-button primary large" value="SUBMIT"></div>
</div><input name="hs_context" type="hidden"
value="{"embedAtTimestamp":"1681752897157","formDefinitionUpdatedAt":"1592510946971","renderRawHtml":"true","userAgent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.5615.49 Safari/537.36","pageTitle":"How Cisco uses Zebrium ML to Analyze Logs for Root Cause","pageUrl":"https://www.zebrium.com/blog/how-cisco-uses-zebrium-ml-to-analyze-logs-for-root-cause","pageId":"67930464917","isHubSpotCmsGeneratedPage":true,"canonicalUrl":"https://www.zebrium.com/blog/how-cisco-uses-zebrium-ml-to-analyze-logs-for-root-cause","contentType":"blog-post","hutk":"cdd4320ee402d0a1e7718cad6d6b8e6b","__hsfp":287657573,"__hssc":"161145634.1.1681752898573","__hstc":"161145634.cdd4320ee402d0a1e7718cad6d6b8e6b.1681752898573.1681752898573.1681752898573.1","formTarget":"#hbspt-form-0ac80d6a-e773-40c9-ae87-4c9a41154467","locale":"en","timestamp":1681752898591,"originalEmbedContext":{"portalId":"4228532","formId":"58100fc5-95c6-42d4-8a0d-66e03308e25f","region":"na1","target":"#hbspt-form-0ac80d6a-e773-40c9-ae87-4c9a41154467","isBuilder":false,"isTestPage":false,"cssRequired":".submitted-message { color: #ffffff; }","isMobileResponsive":true},"correlationId":"0ac80d6a-e773-40c9-ae87-4c9a41154467","renderedFieldsIds":["email"],"captchaStatus":"NOT_APPLICABLE","emailResubscribeStatus":"NOT_APPLICABLE","isInsideCrossOriginFrame":false,"source":"forms-embed-1.3033","sourceName":"forms-embed","sourceVersion":"1.3033","sourceVersionMajor":"1","sourceVersionMinor":"3033","_debug_allPageIds":{"analyticsPageId":"67930464917","pageContextPageId":"67930464917"},"_debug_embedLogLines":[{"clientTimestamp":1681752897504,"level":"INFO","message":"Retrieved pageContext values which may be overriden by the embed context: {\"pageTitle\":\"How Cisco uses Zebrium ML to Analyze Logs for Root Cause\",\"pageUrl\":\"https://www.zebrium.com/blog/how-cisco-uses-zebrium-ml-to-analyze-logs-for-root-cause\",\"userAgent\":\"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.5615.49 Safari/537.36\",\"pageId\":\"67930464917\",\"isHubSpotCmsGeneratedPage\":true}"},{"clientTimestamp":1681752897507,"level":"INFO","message":"Retrieved countryCode property from normalized embed definition response: \"DE\""},{"clientTimestamp":1681752898586,"level":"INFO","message":"Retrieved analytics values from API response which may be overriden by the embed context: {\"hutk\":\"cdd4320ee402d0a1e7718cad6d6b8e6b\",\"canonicalUrl\":\"https://www.zebrium.com/blog/how-cisco-uses-zebrium-ml-to-analyze-logs-for-root-cause\",\"contentType\":\"blog-post\",\"pageId\":\"67930464917\"}"}]}"><iframe
name="target_iframe_58100fc5-95c6-42d4-8a0d-66e03308e25f" style="display: none;"></iframe>
</form>
Text Content
This website stores cookies on your computer. These cookies are used to collect information about how you interact with our website and allow us to remember you. We use this information in order to improve and customize your browsing experience and for analytics and metrics about our visitors both on this website and other media. To find out more about the cookies we use, see our Privacy Policy. Accept We're thrilled to announce that Zebrium has been acquired by ScienceLogic! Learn More * Product * What is Zebrium? * How it works * Integrations * Docs * Solutions * Datadog * New Relic * AppDynamics * Elastic Stack * Kubernetes * IT Service Providers * Pricing * Videos * General Videos * Webinars * Blog * Company * Customer Case Studies * In the News * Sign-in Get Started Free BOOK A DEMO HOW CISCO USES ZEBRIUM ML TO ANALYZE LOGS FOR ROOT CAUSE Atri Basu (Innovation Product Manager @ Cisco), Necati Cehreli (Technical Leader @ Cisco) & Gavin Cohen (VP Product @ Zebrium) The Cisco Technical Assistance Center (TAC) has over 11,000 engineers handling 2.2 million Service Requests (analogous to incidents or support cases) a year. Although 44% of them are resolved in one day or less, many take longer because they involve log analysis to determine the root cause. This not only impacts the time a case remains open, but at Cisco’s scale, translates to thousands of hours spent each month analyzing logs. AUTOMATION AS THE FIRST LINE OF DEFENSE In addition to employing some of the industry’s most talented support engineers, Cisco TAC makes extensive use of automation and cutting-edge technologies such as Natural Language Processing to help engineers find relevant content to resolve customer issues based on the problem description and symptoms articulated in the case notes. As a first step towards automated log analysis, when log bundles are received from a customer, they are scanned by a proprietary rule engine called BORG. BORG scans each log bundle against a set of known problem signatures to determine if there are any matches. Signatures can be as complex as pieces of code that cross reference multiple sources, or as simple as regex rules that search for occurrence of specific well-known patterns. The goal is to quickly see if a particular customer is experiencing a known problem for which a signature exists. If there’s a match, the case can be resolved quickly. However, building and maintaining signatures manually is challenging. When a new type of problem is uncovered, it must be characterized based on how it appears in the logs. While this can be simple when there is a single log line that identifies the problem, it is often more complex (e.g. when a problem presents itself as multiple log events from multiple different sources that must occur within a period of time). This means, there is always a back log of signatures waiting to be created. In addition, maintaining and testing signatures is an ongoing burden because log formats and payloads can change in new software releases. Because of this, existing signatures can become outdated and stop providing alerts that engineers might have expected. This can result in false negatives that may lead an engineer astray. In addition, using signatures to speed up problem resolution is not always possible. For example, sometimes a signature will only catch a symptom rather than its root cause. And since a single symptom could have many different causes, a support engineer must still analyze the logs to determine what happened. There is also a large class of problems that have never been seen before. In these cases, the only option is manual log analysis. MANUAL LOG ANALYSIS IS STILL NECESSARY Each month, Cisco engineers analyze over 20,000 log bundles to help resolve SRs. The analysis typically starts with searching around the time the problem occurred for error messages and keywords known to be related to the problem description. Tenured engineers, with experience troubleshooting a particular product, will also keep an eye out for rare or unexpected log lines which aren’t usually present. Depending on what is found, the search will often continue across multiple log files or even bundles and the engineers might need to correlate events from these logs to uncover important details. A common method of identifying problems is to compare logs from a working sequence and a non-working sequence, using the difference in the log lines between the two sequences as problem indicators. And depending upon the expertise required, the case might need to be escalated to a developer for further analysis. In some situations, custom scripts need to be written to extract and order a subset of log lines to make them “human legible”. Overall, the “log hunting” process requires skill, intuition, and experience and can take significant time and resources. Finding the proverbial needle in the haystack whilst poring over esoteric log bundles is hardly anyone’s favorite way to spend their time! IN SEARCH OF IMPROVEMENT: COULD ML PLAY A PART IN THIS PROCESS? For several years, our team, the Innovation, Automation & Disruption Team within the TAC had been investigating the feasibility of using AI and ML to assist with log analysis. We were exploring both building a tool internally as well as any available commercial products. In early 2021, we came across Zebrium. After seeing a demonstration, we decided to try Zebrium with a set of logs from one of our internal systems. Within minutes of ingesting the first log bundle, Zebrium generated a root cause report that contained the exact log lines that explained the root cause. This promising result allowed us to quickly gain approval to carry out a more extensive evaluation of the Zebrium technology. The initial trial leveraged Zebrium’s SaaS offering, but for a more thorough proof of concept, we required a solution that could be easily used with static log bundles. Though originally designed to analyze log streams from live applications, the Zebrium ML engine works equally well on static log bundles. When Zebrium ingests log data, its machine learning automatically learns the structure of each unique type of log line and, therefore, works well with unstructured logs of any format. All that Zebrium requires is that the logs be in plain text and that most lines contain a timestamp. In the case of static log files, it’s important to send each file together with meta data about where the log came from (e.g. server name, container name, etc.) so that Zebrium can correctly identify correlations. When using streaming log files, this is taken care of automatically by the open-source log collectors. Additionally, to avoid Cisco Customer data leaving Cisco premises, we also required an on-premises solution. Fortunately, Zebrium was in the final stages of making available it’s on-premises deployment option and was able to deliver this to Cisco in June 2021. HOW TO VALIDATE THE EFFICACY OF ZEBRIUM’S ML TECHNOLOGY? The only way to realistically test Zebrium’s technology was to ingest customer logs obtained from actual customer Service Requests (SR is the term Cisco uses for a customer case) and see if Zebrium could assist in performing a through and accurate Root Cause Analysis. So, an experiment was devised to take historically solved SRs, containing log bundles and where the root cause was known, and ingest them into Zebrium. TAC subject matter experts would then be asked to use the incident reports generated by Zebrium to identify the root cause of the customer’s issue and compare the results with the already known root causes. To keep the experiment concise, but also validate the technology agnostic nature of Zebrium’s solution, it was decided to conduct this experiment with SRs from four different and disparate technologies and products. The experiment would be deemed successful if we were able to meet or exceed the following success criteria. Success Criteria (actual results are further below): * TAC SMEs were able to identify the correct root cause for the SRs from the incident reports generated by Zebrium at least 50% of the time. * A minimum of 100 log bundles were analyzed * TAC SMEs found the process of identifying the root cause using Zebrium simpler and more efficient than their current methods SYSTEMATIC TESTING OF ZEBRIUM ML FOR LOGS USING ACTUAL CUSTOMER CASES As mentioned, for the Proof of Concept (POC), we focused on four very different product lines: * Cisco Webex client * Cisco DNA Center (DNAC) * Cisco Identity Services Engine (ISE) * Cisco Unified Compute Servers (UCS) These products were specifically chosen because the log bundle structure, volume of logs in a bundle and the individual log formats varied significantly across each of them. For each of these products, we selected historical Cisco Service Requests (SR) that had already been solved by Cisco TAC engineers. Each SR that was chosen contained notes that included the specific log lines that were identified by the engineer handling the SR as explaining the root cause. In total, 192 log bundles were analyzed as follows: Once Zebrium was set up on an on-prem cluster, adapters were written for each product to ingest an entire log bundle for that product and forward it to the Zebrium ML engine. The adapters are designed to do the following: * Use the SR number to let the ML know that all files from the bundle are related (Zebrium uses the term “service group” for this) * Extract and decompress each file * Identify and label the namespace to which events in the log files belonged (Zebrium uses the term “log basename” for this) * Use the Zebrium API to send each file together with relevant meta data (its service group name, host name, log type, etc.) * Normalize log lines with disparate structures (e.g. JSON dumps or stack traces) into unstructured text that looks and feels similar to regular log lines After each bundle was ingested, the root cause report generated by Zebrium was perused by a TAC SME to identify the root cause and compare it against that identified by the original TAC SR owner. * If the TAC SME was able to find an incident generated by the Zebrium ML that contained log lines matching those identified by the original SR owner as the root cause, the result was noted as “positive” * If the TAC SME was able to identify an incident generated by the Zebrium ML that highlighted log lines different from what was identified by the original SR owner, but still with enough necessary detail and context to pinpoint the actual root cause, the result was noted as “positive”. It turned out that in some cases, Zebrium identified log lines with a clearer explanation of root cause than those identified by the original engineer! * If the TAC SME identified a problem that was different from the reason the customer opened the original SR but was still pertinent to the customer and required fixing, it was considered a “Beyond the Fix” result. While this didn’t count to the success metric, it was still a positive outcome. * If the TAC SME was unable to identify any incident that helped identify the root cause of the customer’s issue, the result was noted as “negative” THE RESULTS Across the 192 bundles that TAC SMEs were able to analyze during the period designated for the POC, Zebrium’s ML was able to correctly identify the results 95.8% of the time (a significant improvement over the success criterion of 50% accuracy). Just as importantly, the user feedback from TAC SMEs, highlighted just how much of an improvement the addition of Zebrium made to the log analysis workflow: “It was very easy to find the errors, they pop up immediately…I’ll find relevant logs much faster.” “It's a really smart tool that can quickly narrow down problem error messages.” “This is amazing to see. This wasn’t an easy issue and to see that [the ML] analysis was able to point us in right direction is very positive.” “Zebrium added color to the drab black and white process of log analysis” Following are the results for each of the four product lines: CONCLUSION In this detailed study of 192 actual customer Service Requests, Zebrium’s machine learning was able to correctly identify the root cause in 95.8% of the cases. User feedback praised the quality of the root cause reports and experienced significant time savings compared to manual log analysis. As a result of this, at the time of writing, we are in the process of rolling this solution into production for use across an initial tranche of 8 product lines in TAC centers around the world. It is believed that the addition of Zebrium’s machine learning to the existing TAC tool kit, will allow TAC to achieve a significant increase in case resolution times for SRs that require log file analysis. This will not only improve customer satisfaction but will also drive significant cost savings by slashing thousands of hours of time each month that would otherwise be spent manually analyzing logs. * * Share * RECENT POSTS * Zebrium RCaaS: A Natural Evolution From Datadog Watchdog Insights Log Anomaly Detection June 16, 2022 * Observability: It's Time to Automate the Observer | Zebrium June 15, 2022 * Speeding-up Root Cause Analysis with New Relic | Zebrium May 3, 2022 * Using ELK For Observability? Speed up Troubleshooting with Zebrium April 8, 2022 * Root Cause as a Service | Zebrium March 22, 2022 * How Cisco uses Zebrium ML to Analyze Logs for Root Cause March 7, 2022 * Root Cause as a Service for Datadog | Zebrium February 28, 2022 * Uncover Blind Spots in Your Monitoring | Zebrium and AppDynamics December 9, 2021 * Visualizing Root Cause Summaries from Logs | Zebrium October 26, 2021 * How to Try Zebrium ML-based RCA Using a Realistic Cloud Native Demo App October 12, 2021 TAGS * AI (1) * anomaly detection (3) * autonomous log monitoring (4) * autonomous monitoring (6) * ci/cd forensics (1) * continuous delivery (1) * customer experience (1) * dev/test forensics (1) * devops (11) * engineering analytics (1) * fluentd (1) * incident augmentation (2) * incident detection (2) * incident response (2) * k8s (7) * kubernetes (6) * log anomaly detection (6) * log files (6) * logs (5) * machine data (3) * machine learning (5) * metrics anomaly detection (5) * monitoring (3) * observability (8) * open source (1) * opsgenie (1) * predictive support (1) * predictive troubleshooting (1) * product analytics (1) * prometheus (1) * RCA (1) * root cause (2) * software incident (2) * structure (1) * structured data (2) * support automation (2) * troubleshooting (2) * unstructured data (1) * User Experience (1) See all ARCHIVE * June 2022 (2) * May 2022 (1) * April 2022 (1) * March 2022 (2) * February 2022 (1) * December 2021 (1) * October 2021 (3) * June 2021 (2) * May 2021 (2) * April 2021 (1) * March 2021 (1) * February 2021 (2) * January 2021 (1) * December 2020 (1) * October 2020 (2) * July 2020 (2) * June 2020 (1) * May 2020 (3) * April 2020 (2) * March 2020 (5) * February 2020 (1) * January 2020 (2) * December 2019 (1) * November 2019 (1) * October 2019 (3) * August 2019 (1) * July 2019 (2) * June 2019 (1) * May 2019 (2) * February 2019 (1) * December 2018 (1) * October 2018 (1) Select MonthJune 2022 (2)May 2022 (1)April 2022 (1)March 2022 (2)February 2022 (1)December 2021 (1)October 2021 (3)June 2021 (2)May 2021 (2)April 2021 (1)March 2021 (1)February 2021 (2)January 2021 (1)December 2020 (1)October 2020 (2)July 2020 (2)June 2020 (1)May 2020 (3)April 2020 (2)March 2020 (5)February 2020 (1)January 2020 (2)December 2019 (1)November 2019 (1)October 2019 (3)August 2019 (1)July 2019 (2)June 2019 (1)May 2019 (2)February 2019 (1)December 2018 (1)October 2018 (1) See all SEARCH BY TAGS * devops (11) * observability (8) * k8s (7) * autonomous monitoring (6) * kubernetes (6) * log anomaly detection (6) * log files (6) * logs (5) * machine learning (5) * metrics anomaly detection (5) * autonomous log monitoring (4) * anomaly detection (3) * machine data (3) * monitoring (3) * incident augmentation (2) * incident detection (2) * incident response (2) * root cause (2) * software incident (2) * structured data (2) * support automation (2) * troubleshooting (2) * AI (1) * RCA (1) * User Experience (1) * ci/cd forensics (1) * continuous delivery (1) * customer experience (1) * dev/test forensics (1) * engineering analytics (1) * fluentd (1) * open source (1) * opsgenie (1) * predictive support (1) * predictive troubleshooting (1) * product analytics (1) * prometheus (1) * structure (1) * unstructured data (1) See all ARCHIVE * June 2022 (2) * May 2022 (1) * April 2022 (1) * March 2022 (2) * February 2022 (1) * December 2021 (1) * October 2021 (3) * June 2021 (2) * May 2021 (2) * April 2021 (1) * March 2021 (1) * February 2021 (2) * January 2021 (1) * December 2020 (1) * October 2020 (2) * July 2020 (2) * June 2020 (1) * May 2020 (3) * April 2020 (2) * March 2020 (5) * February 2020 (1) * January 2020 (2) * December 2019 (1) * November 2019 (1) * October 2019 (3) * August 2019 (1) * July 2019 (2) * June 2019 (1) * May 2019 (2) * February 2019 (1) * December 2018 (1) * October 2018 (1) Select MonthJune 2022 (2)May 2022 (1)April 2022 (1)March 2022 (2)February 2022 (1)December 2021 (1)October 2021 (3)June 2021 (2)May 2021 (2)April 2021 (1)March 2021 (1)February 2021 (2)January 2021 (1)December 2020 (1)October 2020 (2)July 2020 (2)June 2020 (1)May 2020 (3)April 2020 (2)March 2020 (5)February 2020 (1)January 2020 (2)December 2019 (1)November 2019 (1)October 2019 (3)August 2019 (1)July 2019 (2)June 2019 (1)May 2019 (2)February 2019 (1)December 2018 (1)October 2018 (1) See all LINKS * Product * Videos * Blog * Docs * Company CONTACT hello@zebrium.com careers@zebrium.com SUBSCRIBE TO NEWSLETTER Privacy Policy Terms of Service © 2022 by Zebrium, Inc.