www.gremlin.com Open in urlscan Pro
2a03:b0c0:3:d0::1440:1  Public Scan

Submitted URL: http://mkto-sj320146.com/MjUxLUpHSC0xNTUAAAGFohzHlk3AE4ZYKy9dpkzyYMZMGXWrofa-h9Z9Op4dJYCvXyPCFThISj7OGtA-kouRmWbfe08=
Effective URL: https://www.gremlin.com/blog/podcast-break-things-on-purpose-jj-tang-people-process-culture-tools/?utm_medium=email&utm_...
Submission: On July 16 via api from CH — Scanned from DE

Form analysis 2 forms found in the DOM

<form id="mktoForm_1422" class="css-1d7j9c1" novalidate="novalidate">
  <div class="mktoFormRow">
    <div class="mktoFieldDescriptor mktoFormCol">
      <div class="mktoOffset"></div>
      <div class="mktoFieldWrap Marketo__FieldWrapper"><label for="Email" id="LblEmail" class="mktoLabel mktoHasWidth">
          <div class="mktoAsterix">*</div>Email Address:
        </label>
        <div class="mktoGutter mktoHasWidth"></div><input id="Email" name="Email" maxlength="255" aria-labelledby="LblEmail InstructEmail" type="email" class="mktoField mktoEmailField mktoHasWidth"><span id="InstructEmail" tabindex="-1"
          class="mktoInstruction"></span>
        <div class="mktoClear"></div>
      </div>
      <div class="mktoClear"></div>
    </div>
    <div class="mktoClear"></div>
  </div>
  <div class="mktoButtonRow"><span class="mktoButtonWrap mktoSimple"><button type="submit" class="mktoButton">Submit</button></span></div><input type="hidden" name="formid" class="mktoField mktoFieldDescriptor" value="1422"><input type="hidden"
    name="munchkinId" class="mktoField mktoFieldDescriptor" value="251-JGH-155">
</form>

<form class="css-6xijhz mktoForm mktoHasWidth mktoLayoutLeft" novalidate="novalidate"
  style="font-family: Helvetica, Arial, sans-serif; font-size: 13px; color: rgb(51, 51, 51); visibility: hidden; position: absolute; top: -500px; left: -1000px; width: 1600px;"></form>

Text Content

Product
Community
Resources
Company
Login

Get a Demo

April 19, 2022 - 9 min read


PODCAST: BREAK THINGS ON PURPOSE | JJ TANG: PEOPLE, PROCESS, CULTURE, TOOLS


 * 
 * 
 * 
 * 
 * 

Jason Yee
Director of Advocacy
Related
 * How to define and measure the reliability of a service
   July 14, 2022 - 6 min read
 * Chaos Engineering Tools: Build vs Buy
   July 8, 2022 - 5 min read
 * Podcast: Break Things on Purpose | Developer Advocacy and Innersource with
   Aaron Clark
   June 14, 2022 - 27 min read

Get a Demo

Gremlin empowers you to proactively root out failures before they cause
downtime. See how you can build resilient systems and earn customer trust by
requesting a demo of Gremlin.

Request a demo

For this episode we’re continuing to “Build Things on Purpose” with JJ Tang,
co-founder of Rootly, who joins us to talk about incident response, the tool
he’s built, and his many lessons learned from incidents. Rootly is aiming to
automate some of the more tedious work around incidents, and keeping that
consistency. JJ chats about why he and his co-founder built Rootly, and the
problems they’re trying to fix and eliminate when it comes to reliability. JJ
reflects on what sets Rootly apart, how they handle chaos engineering, and more!

EPISODE HIGHLIGHTS

In this episode, we cover:

 * 00:00:00 - Introduction
 * 00:00:57 - Rootly, an incident management platform
 * 00:02:20 - Why build Rootly
 * 00:06:00 - Unique aspects of Rootly
 * 00:09:50 - How people should use Rootly

Links:

 * Rootly: https://rootly.com/demo

TRANSCRIPT

JJ: How do you now get this massive organization to change the way that they
work? Even if they were following, like, a checklist and Google Docs, that still
marks as a fairly significant cultural change, and so we need to be very mindful
of it.

Jason: Welcome to another episode of Build Things on Purpose, part of the Break
Things on Purpose podcast. In our build episodes, we chat with the engineers and
developers who create tools that help us build modern applications, or help us
fix them when they break. In this episode, JJ Tang, co-founder of Rootly, joins
us to chat about incident response, the tool he’s built, and the lessons he’s
learned from incidents.

So, in this episode, we’ve got with us JJ Tang, who’s the co-founder of a
company and a tool called Rootly, welcome to the show.

JJ: Thank you, Jason, super excited to be here. Big fan of what you guys are
doing over at Gremlin and all things Chaos Engineering. Quick intro on my side.
I’m JJ, as you mentioned. We are building Rootly, which is an incident
management platform built on top of Slack.

So, we help a bunch of different companies automate what we believe to be some
of the most manual and tedious work when it comes to incidents, like creating
virtual war rooms, Zoom Bridges, tracking your action items on Jira, generating
your postmortem timeline, adding the right responders, and generally just
helping build that consistency. So, we work with a bunch of different
fast-growing tech companies like Canva, Grammarly, Bolt, Faire, Productboard,
and also some of the more traditional ones like Ford and Shell. So, super
excited to be here. Hopefully, I have some somewhat engaging insight, I hope.
[laugh].

Jason: Yeah, I think you will because in our discussions previously, we’ve
always had fantastic conversations. So, you’ve kind of covered a lot of the
first question that I normally ask, and that’s what did you build? And so as you
explained, Rootly is an incident management tool; works with Slack. But that
naturally leads into the other question that I asked our Build Things guests,
and that’s why did you build this? Was it something from your experience as an
engineer that you’re just like, “I need a tool to solve this?” What’s the story
behind Rootly?

JJ: Yeah, definitely. Sorry to jump the gun on the first question. I was a
little bit too excited, I think. But yeah, so my co-founder, and I—his name is
Quinton—we both used to work at Instacart, the grocery delivery startup. He was
there super, super early days; he was actually one of the first SREs there and
kind of built out that team.

And I was more on the product side of things, so I helped us build out our
enterprise and last-mile delivery products. If you’re curious what does [laugh]
grocery have to do with reliability, actually, not that much, but the challenges
we were dealing with were at very great scale. So, it all started back when the
pandemic first started getting kicked off. Instacart was growing rapidly at the
time, we were scaling really well, we were heading the numbers where we want it
to be, but with suddenly the lockdowns occurring, everyone overnight who didn’t
care about grocery delivery and thought, “Well, why don’t I just drive to
Walmart,” [laugh] suddenly wanted to order things on Instacart. So, the company
grew 5, 600%, nearly overnight.

And with that, our systems just could not handle the load. And it’d be the most
obscure incidents you wouldn’t think would break, but under such immense stress
and demand, we just couldn’t keep the site up all the time. And what that really
exposed on our end was, we don’t have a really good incident management process.
What we were doing was, we kind of just had every engineer in a single incident
channel on Slack. And if you got paged, you just kind of ping in there. “I just
got woken up. Did anyone else? Does this look legit?”

And there was no formal way, so there was no consistency in terms of how the
incidents were created. And then, of course, from that top-of-funnel into the
postmortem, there wasn’t too much discipline there. So, we really thought about,
you know, after the dust kind of settled, there must be a better way to do this.
And like most organizations that we work with, you start thinking about how can
I build this myself?

I think there’s probably a little bit of a gap right now in this space. People
generally understand monitoring tools really well, like New Relic, Datadog,
alerting tools super well, PagerDuty, Opsgenie, they do a really good job at it.
But everything afterwards, the actual orchestration and learning from the
incidents tends to be a little bit sparse. So, we started embarking on our own.
And for my co-founder’s side of things, he was more at the heart of the incident
than I was. I think I was the one complaining about and breathing down his neck
a little bit about why things [laugh] sometimes weren’t working.

And—yeah, and, you know, as we started thinking about internal solutions, we
took a step back and thought, “Well, you know, if Instacart is facing this
problem then I think a lot of companies must be as well.” And luckily, our
hypothesis has proven to be true, and yeah, the rest is just history now.

Jason: That’s really fascinating, particularly because, I mean, it is such a
widespread issue, right? And I think I’ve experienced that as well, where you’ve
got a general on-call or incidents channel, and literally everybody in the
organization’s in there, not just engineers, but—like yourself—product people
and customer success or support folks are all in there. And the idea is this,
sort of—it’s a giant, giant crowd of folks who are just, like, waiting and
wondering. And so having a tool to help manage that is extremely useful. As you
started building out this tool, I’m starting to think there are starting to
become a lot more incident management tools or incident response management
tools, so talk to me about what are the unique points about Rootly?

Because I suspect that a lot of it is influenced from, “These are the pain
points that I had during my incidents,” and so you pulled them over? And so I’m
curious, what are those that you brought to the tool that really help it shine
during an incident?

JJ: Yeah, definitely. I think the space that we’re in right now is certainly
heating up as you go to the different conferences and the content that’s put out
there. Which is great because that means everyone is educating the broader
audience of what’s going on and just makes my job just a little bit easier.
There’s a couple, you know, original hypothesis that we had for the product that
just ended up not being as important. And that has really defined how we think
about Rootly and how we differentiate a lot of what we do.

How we did incidents at Instacart wasn’t all that unique, you know? We used the
same tools everyone else did. We had Opsgenie, we used Slack, Datadog, Jira, we
wrote our postmortems on Confluence, stuff like that, and our initial reaction
was, “Well, people are using the same tools, they must be following a very
similar process.” And we also looked and worked a lot with people that are deep
into the space, you know, Google, Stripe, the Airbnbs of the world, people that
have a very formal process. And so we actually embarked on this journey building
a relatively opinionated tool; “This is how we think the best incidents can be
run.” And that actually isn’t the best fit for everyone.

I think if you had no incident management process whatsoever, that’s great. You
know, we give you super powerful defaults out of the box, like we do today, and
you kind of can just hit the ground running super fast. But what we found is
despite everyone using basically the same kind of tools, the way they use it is
super different. You might only want to create a Zoom Bridge for, you know, high
severity incidents, whereas someone else wants to create it for every single
incident, for example. So, what we did was really focus on how do we balance
between building something that’s opinionated versus flexible, where should
customers be able to turn the knobs and the dials.

And a big part of it is we built what we call our workflows, and that allows
customers to create a process that it’s very similar to theirs. And a part of
that we didn’t anticipate at the very beginning was, although the tool is super
simple to use, I think or average install time is probably 13 minutes, all the
integrations and everything on a quick call with our customers, the really heavy
lifting comes with, how do you now get this massive organization to change the
way that they work? Even if they were following, like, a checklist in Google
Docs, that still marks as a fairly significant cultural change, and so we need
to be very mindful of it. So, we can’t be just ripping tools out of their
existing stack, we can’t be wildly changing every process; everything has to
happen progressively, almost, in a way. And that is a lot more digestible than
saying you’re going to replace everything.

So, I think that’s probably one of the key differences is we tend to lean more
on the side of playing with your existing stack versus changing everything up.

Jason: That’s a really good insight, particularly because coming from Chaos
Engineering, and that is almost entirely changing the way that people work,
right, is Chaos Engineering is a new practice, so I definitely empathize with
you, or sympathize with you on that struggle of, like, how do you change what
people are doing and really get them to embrace it? That said, being opinionated
is also a really good thing because you have a chance to lead people, and so
that leads me to our final question that we always ask folks—and this is where
being opinionated is good—but if folks were to use Rootly, or just even wanted
to improve their incident response processes in general, what are some of those
opinions that you had about how people should be doing that, that they should
consider embracing?

JJ: Yeah, that’s an awesome question. So, a couple things, a little bit related
to your second question that we initially thought but just proved to not be as
important for us, everything that we build at the beginning—and still build—is
relatively laser-focused on helping you get to that resolution as fast as
possible. But from an organizational perspective, what we found is, people don’t
think about incident management success as how quickly they can resolve an
incident. A lot of it’s actually just having that security and framework and
consistency around the incident. So ironically, as a tool in incident
management, the most important things are actually around your people and the
process and the culture that you can develop around the tool.

No matter how good of something that we build, you know—let’s say you’re an
organization, you just bring in Rootly, you have a very blameful way of handling
postmortems, no one generally understands how severities in organization work,
you’re super laser-focused on, you know, tracking MTTR, which can not always be
the best metric, but you still want to interpret it as such, it’s very difficult
to make the tool successful. So, that’s the biggest advice that we give to our
customers is when we see those type of red flags from, like, a process and
culture standpoint, we’ll try to guide them the best that we can. And we’ll also
do it from a product perspective. What you get out of the box today, we have
companies as small as, you know, 20 for example, just kind of being able to hit
the ground running; they’ll use workflow templates that are pre-built based on
some best practices that we’ve seen to just kind of layer in that framework. So,
I think that would be a really big one that we’ve noticed is it’s not all about
us; it’s not all about the product and the benefits that we can provide; it’s
about how we can actually enable our customers to get to that stage.

Jason: I love that answer. Well, JJ, thanks for being a guest on the show and
sharing a bit more about your journey and the journey of Rootly. If folks are
interested in trying out the product and getting better at incident response,
where can they find more info about you and about Rootly?

JJ: Yeah. You can just visit rootly.com/demo. We do offer a 14-day trial if you
want to sign up for free.

If you want to talk to one of us or partnerships team, you’re welcome to book a
personalized session. I recommend that because then you get to see my super cute
dog that isn’t with me right now and wouldn’t matter because this is audio only,
but I love showing her off. That’s my favorite part of my job.

Jason: So, if you want to go see JJ's dog, or learn more about Rootly and
incident management, go check it out. Thanks again.

JJ: Yeah, thanks for having me.

Jason: For links to all the information mentioned, visit our website at
gremlin.com/podcast. If you liked this episode, subscribe to the Break Things on
Purpose podcast on Spotify, Apple Podcasts, or your favorite podcast platform.
Our theme song is called Battle of Pogs by Komiku and is available on
loyaltyfreakmusic.com.

 * 
 * 
 * 
 * 
 * 

Categories
Podcasts, Industry
July 14, 2022 - 4 min read


HOW GREMLIN'S RELIABILITY SCORE WORKS


 * 
 * 
 * 
 * 
 * 

Andre Newman
Technical Marketing Manager
In order to make reliability improvements tangible, there needs to be a way to
quantify and track the reliability of systems and services in a meaningful way.
This "reliability score" should indicate at a glance how likely a service is to…
Read more
July 14, 2022 - 6 min read


HOW TO DEFINE AND MEASURE THE RELIABILITY OF A SERVICE


 * 
 * 
 * 
 * 
 * 

Andre Newman
Technical Marketing Manager
More and more teams are moving away from monolithic applications and towards
microservice-based architectures. As part of this transition, development teams
are taking more direct ownership over their applications, including their…
Read more
Sign up to get the latest info about Gremlin

*
Email Address:




Submit

--------------------------------------------------------------------------------

Company
 * Team
   Join us
 * Product
 * Contact
 * Press
 * Privacy

Resources
 * Blog
 * Docs
 * Security

Industries
 * SaaS
 * Finance
 * Retail

Featured
 * What is Chaos Engineering?
 * What is Chaos Monkey?
 * What is Site Reliability Engineering?
 * The 2021 State of Chaos Engineering Report
 * How to achieve reliability in distributed systems

 * 
 * 
 * 
 * 


Loading...

--------------------------------------------------------------------------------

© 2022 Gremlin Inc. Walnut, CA 91789