Could AI End Human QA?
Download MP3The original title for this episode was AI
Killed the QA Star, which would make more
sense if you knew the eighties lore of
the very first music video played on MTV.
Um, that's a music television TV channel.
Back, back in the days, how you
watched videos before the internet.
That was in 1981 and it was called
Video Killed the Radio Star.
But I decided that a Deep cut title
was too obscure for this conversation.
Yet the question still remains, could
the increased velocity of shipping
AI generated code cause businesses
to leave human based QA behind,
presumably, because we're not gonna
hire any more of them, and we don't
want to grow those teams and operations
teams just because we created AI code.
And would we start relying more on
production observability to detect code
issues that affect user experience?
And that's the theory of today's
guest, Andrew Tunall, the President
and Chief product Officer at Embrace.
They're a mobile observability
platform company that I first
met at KubeCon London this year.
Their pitch was that mobile apps were
ready for the full observability stack
and that we now have SDKs to let mobile
dev teams integrate with the same
tools that we platform engineers and
DevOps people and operators have been
building and enjoying for years now.
I can tell you that we don't yet have
exactly a full picture on how AI will
affect those roles, but I can tell you
that business management is being told
that similar to software development,
they can expect gains from using AI to
assist or replace operators, testers,
build engineers, QA and DevOps.
That's not true, or at least not yet.
But it seems to be an expectation.
And that's not gonna stop us from trying
to integrate LLMs into our jobs more.
So I wanted to hear from observability
experts on how they think this
is all going to shake out.
So I hope you enjoyed this
conversation with Andrew of Embrace.
Hello.
Hey, Bret.
How are you doing?
Andrew Tunal is the president and
chief product officer of Embrace.
And if you've not heard of Embrace,
they are, I think your claim to me the
first time talked was, you're the first.
Mobile focused or mobile only
observability company in the
cloud native computing foundation.
Is that a correct statement?
I don't know if in,
probably in CNCF, right?
Because like we're, we were certainly
the first company that started going
all in on OpenTelemetry as the means
with which we published instrumentation.
obviously CNCF has a bunch of
observability vendors, some of
which do, mobile and web run,
but, completely focused on that.
Yeah.
I would say we're the first.
Yeah.
I mean, we can just, we can
say that until it's not true.
the internet will judge us harshly,
for whatever claims we make today.
Yeah.
well, and that just means
people are listening.
Andrew.
Okay, I told people on the internet that
we were going to talk about So, the idea
that you gave me that QA is at risk of
falling behind because if we're pretty
If we're producing more code, if we're
shipping more code because of AI, even
the pessimistic stuff about, you know,
we've seen some studies in the last
quarter around effective use of AI, it is
anywhere between negative 20% And positive
30 percent in productivity, depending
on the team and the organization.
So let's assume for a minute it's on
the positive side, and the AI has helped
the team produce more code, ship more
releases, have, even in the perfect world
that I see teams automating everything,
there's still usually, and almost always,
in fact, I will say in my experience,
100 percent of cases, there is a human at
some point during the deployment process.
Whether that's QA.
Whether that's a PR reviewer,
whether that's someone who spun
up the test instances to run
it in a staging environment.
There's always something.
So tell me a little bit about where
this idea came from and what you
think might be a solution to that.
Yeah.
I'll start with the, the belief that
I mean, AI is going to fundamentally
alter the productivity of software
engineering organizations.
I mean, I think the, CTOs I
talk to out there are making
a pretty big bet that it will.
you know, there's today and then
there's, okay, think about the, even
just the pace that, that AI has evolved
in, in the past, the past year and
you know, what it'll look like given
the investment that's flowing into it.
But if you start with that, the claim
that AI is going to kill QA is more
about the fact that, we built, Our
software development life cycle under
the assumption that software was slow to
build and relatively expensive to do so.
And so if those start to change,
right, like a lot of the, systems
and processes we put around, our
software development life cycle,
they probably need to change too.
because ultimately, If you say, okay,
we had 10 engineers and formerly we were
going to have to double that number to
go build the number of features we want.
And suddenly those engineers become, more
productive to go build the features and
capabilities you want inside your apps.
I find it really hard to believe that
those organizations are going to make
the same investment in QA organizations
and automated testing, etc. to keep
pace with that level of development.
The underlying hypothesis was like
productivity allows them to build
more software cheaper, right?
And like cheaper rarely correlates
with adding more humans to the loop.
yeah, the promise.
Like I make this joke and it's probably.
Not true in most cases, cause it's a
little old, but I used to say things
like, yeah, CIO magazine told them to
deploy Kubernetes or, you know, like
whatever, whatever executive VP level C,
CIO, CTO, and you're at that chief level.
So I'm making a joke about you, but,
funny.
but that's when you're not in
the engineering ranks and you
maybe have multiple levels of
management, you get that sort of.
overview where you're reading and you're
discussing things with other suits.
You're reading the suits magazines,
the CIO magazines, the uh, IT
pro magazines and all that stuff.
yeah, the point I'm making here is
that, people are being told that AI
is going to save the business money.
I, I think I've said this to several
podcasts already, but, you weren't here.
So I'm telling you that I was in a
media event in London, KubeCon, and,
they give me a media pass for some
reason, because I make a podcast.
So they act like I'm a journalist
standing around all real journalists
and pretending, but I was there and
I watched an analyst who's, from my
understanding, I don't actually know
the company, but they sound like the
Gartner of Europe or something like that.
and the words came out of the mouth
at an infrastructure conference,
the person said, my clients
are looking for humanless ops.
And I visibly, I think,
chuckled in the room because
I thought, well, that's great.
That's rich that you're
at a 12, 000 ops person.
Conference telling us that your
companies want none of this.
these are all humans here doing this work.
the premise of your argument about QA
is my exact same thoughts was nobody
is budgeting for more QA or more
operations personnel or more DevOps
personnel, just because the engineers
are able to produce more code with AI,
in the app teams and the feature teams.
And so we've all got to do better
and we've got to figure out where
AI can help us because if we don't.
They're going to, they're just
going to hire the person that
says they can, even though maybe
it's a little bit of a shit show.
my belief is that software
organizations need to change.
The way they work for an AI future, so
that might be cultural changes, it might
be role changes, it might be, the, you
know, the word, the words like human in
the loop get tossed around a lot when
it's about engineers interacting with
AI, and the question is, okay, what
does that actually look like, right?
Are we just reviewing AI's PRs and,
you know, kind of blindly saying, yep,
like they wrote the unit tests for
something that works, or is it like
we're actually doing critical thinking,
critical systems thinking, unique
thinking that allows us as owners of
the business and our user success?
to design and build better
software with AI as a tool.
and it's not just QA, it's kind of
like all along the software development
life cycle, how do we put the right
practices in place, and how do we build
an organization that actually allows us,
with this new AI driven future, you know,
whether it's agents, doing work on our
behalf, or whether it's us just with AI
assisted coding, to build better software.
And, yeah, I'm interested in what that
looks like over the next couple of years.
That's kind of the premise of my,
the new Adjetic DevOps podcast.
And also as I'm building out this new
GitHub Actions course, I'm realizing that
I'm having to make the best practices.
I'm having to figure out what is risky
and what's not because no one has really
figured this out yet in any great detail
and, in fact, at, KubeCon London in April,
which feels like a lifetime ago, there
was only one talk around using AI anywhere
in the DevOps and operations path.
To benefit us.
It was all about how to
run infrastructure for AI.
And granted, KubeCon is an infrastructure
conference for platform engineering
builders and all that, so it makes sense.
But the fact that really only one talk,
and it was a team from Cisco, which
I don't think of Cisco as like the
bleeding edge AI company, but it was a
team from Cisco, simply trying to get
And workflow or maybe you would call it
like an agentic workflow for PR review,
which in my case, in that case, I'm
presuming that humans are still writing
their code and AI is reviewing the code.
I'm actually, I was just yesterday, I
was on GitHub trying to figure out if
I, if there was a way to make branch
rules or, some sort of automation
rule that if the AI, wrote the PR,
the AI doesn't get to review the PR
Yeah, yeah, right.
we've got this coming
at us from both angles.
We've got AI in our IDEs.
We've got, multiple companies,
Devon and GitHub themselves.
They now have the GitHub
copilot code agent.
I have to make sure I
get that term, right.
GitHub copilot code agent,
which will write the PR and
the code to solve your issue.
And then they have a PR
code copilot reviewer.
Agent that will review the code.
It's the same models, different
context, but, it feels like that
doesn't, that's not a human in the loop.
So we're going to need these guardrails
and these checks in order to make sure
that code didn't end up in production
with literally no human eyeballs
ever in the path of that code being
created, reviewed, tested, and shipped.
cause we can do that now.
Like we did, we
Yeah, totally, you're totally good.
And I mean, you can easily perceive the
mistakes that could happen too, right?
I mean, before I, I took this role, at
Embrace, I was at New Relic for four and a
half years, and before that I was at AWS.
And so, obviously I've spent a
lot of time around CloudFormation
templates, Terraform, etc.
You can see a world where, you
know, AI builds your CloudFormation
template for you and selects an EC2
instance type because the information
it has about your workload is
optimal for this EC2 instance type.
But in the region you're running,
that instance type's not freely
available for you to autoscale to.
And pretty soon, you go try
to provision more instances,
and poof, you hit your cap.
Because, that instance type just
doesn't have availability in Singapore.
And, us as humans and operators, we learn
a lot about our operating environments, we
learn about our workloads, we learn about
the, the character, the peculiarities
of them that don't make sense to a
computer, but are, like, based on reality.
over time, maybe AI gets really
good at those things, right?
But the question is, how do we build
the culture into kind of be guiding our,
our, you know, army of assistance to
build software that really works for our
users, instead of just trusting it to go
do the right thing, because we view, you
know, everything as having a true and
pure result, which I don't think is true.
a lot of the tech we build is
for people who build consumer
mobile apps and websites.
I mean, that is what the
tech we build is for.
And you can easily see, you know, some
of our, our engineers have been playing
around with using AI assisted coding to
implement our SDK in a consumer mobile
app that, and it works quite well, right?
You can see situations where, an
engineer gets asked by somebody, a
product manager, a leader to say, Hey,
you know, we got a note from marketing.
They want to implement this new
attribution SDK that's going to go.
build up a profile of our users
and help us build more, you know,
customer friendly experiences.
You have the bots go do it, it tests
the code, everything works just fine.
And then, for some reason that SDK makes
a uncached request out to US West 2
for users globally that, you know, for
your users in Southeast Asia instead
of adding an additional six and a half
seconds to app startup because physics.
And, what do those users do?
if you start an app and it's.
sits there hanging for four, five,
six seconds, and you don't have to use
the app because it's you know, you're
waiting for your boarding pass to come
up and you're about to get on the plane.
You probably perceive it
as broken and abandoned.
And to me, that's like a reliability
problem that requires systems thinking
and cultural design around how you're
engineering organization to avoid.
one that I don't think is
immediately solved with AI.
right.
observability is only getting more complex
even, you know, it's a cat and mouse game.
I think in all, in all regards, and I
think everything we do has yin yang I'm
watching organizations that are full
steam ahead, aggressively using ai, and.
I'd love to see some stats.
I don't know if you have empirical
evidence or if you sort of anecdotal stuff
from your clients, but where you see them
accelerate with AI and then almost have a
pulling back effect because they realize
how easy it is for them to ship more bugs.
and then sure, we could have the AI
writing tests, but it can also write
really horrible tests, or it can
just delete tests because it doesn't
like them because they keep failing.
it's that to me multiple times.
In fact, I think, Oh, what's his name?
Gene, not, it wasn't Gene Kim.
There's a, a famous, it might've been
the guy who created Extreme Programming.
I think I heard him on a podcast talking
about he wished he could make certain, all
of his test files have to be read only.
Because he writes the tests, and then
he expects the AI to make them pass,
and the AI will eventually give up
and then want to rewrite his tests.
And he doesn't seem to be able to stop
the AI from doing that, other than just
denying, deny, or cancel, cancel, cancel.
And there's a scenario where
that can easily happen in
the automation of CI and CD.
Where, you know, suddenly it decided
that the tests failing were okay.
And then it's going to push
it to production anyway, or
whatever craziness ensues.
Do you see stuff happening
on the ground there?
That's
did you read that article about, the,
AI driven store, a fake store purveyor
that Anthropic created named Claudius?
That, when it got pushed back from
clo, the Anthropic employees around
what it was stocking, its fake fridge.
Its fake store with it called security
on the employees to try to displace them.
I mean, it's, I think what you're
pointing to is like the moral compass of
AI is not necessarily, it, like that's
a very complex thing to solve, right?
Is is this the right thing to do or not?
Because it's trying to solve the
problem however it can solve the
problem with whatever tools it has.
Not whether the problem
is the right one to solve.
And right is, I mean, obviously, you
know, even humans struggle with this one.
Right.
I've just been labeling it all as taste.
like the AI lacks taste
Yeah, that's
like in a certain team culture,
if you have a bad team culture,
deleting tests might be acceptable.
I used to work with a team
that would ignore linting.
So I would implement linting.
I'd give them the tools to customize
linting and then they would ignore
it and accept the PR anyway.
And did not care.
They basically followed the
linting rules of the file.
Now they were in a monolithic
repo with thousands and thousands
of a 10 year old Ruby app.
But they would just ignore the linting.
And I always saw that as a
culture problem for them.
That I as the consultant can't,
I would always tell the boss that
I was working for there that I
can't solve your culture problems.
I'm just a consultant here trying
to help you with your DevOps.
You need a full time person leading
teams to tell the rest of the team
that this is important and that these
things matter over the course of
years as you change out engineers.
You can't just go with,
well, this file is tab.
This file is, spaces.
And this one follows, especially
in places like Ruby and Python and
whatnot, where there's multiple, sort
of style guys and whatnot like that.
But I just call that a taste
issue or culture issue.
AI doesn't have culture or taste.
It.
It just follows the rules we give it.
we're not really good yet as
an industry at defining rules.
I think that's actually part
of the course I'm creating is
figuring out all the different
places where we can put AI rules.
We've seen them put in repos.
We've seen them put in
your UI and your IDEs.
Now I'm trying to figure out how do I
put that into CI and how do I put that
into even like the operations if you're
going to try to put AIs somewhere in your,
somewhere where they're going to help with
troubleshooting or visibility or discovery
of issues, like they also need to have.
Taste, which I want to change all
my rules files to taste files.
I guess that's my platform.
I'm standing.
Yeah.
Cause it,
Yeah, I mean, at the same time, you
probably don't want to be reviewing,
hundreds of PRs that an AI robot is
going through, just changing casing of
legacy code to, meet your, to the, Change
obviously introduces the opportunity
for failure, bugs, etc. And if it's,
you know, it's somewhat arbitrary
simply because you've given it a new
rule that this is what it has to do.
I mean, I had a former co worker
many years ago, like 15, 20 years
ago, who, was famous for just, going
through periodically and making
nothing but casing code changes
because that's what he preferred.
And it was just this, endless stream
of, trivialized changes on code we
hadn't touched in months or years.
That's it.
that would inevitably lead to
some sort of problem, right?
And because,
The toil of that.
I'm just like, yeah, that you're
giving me heartburn just by
telling me about that story.
yeah, it's, well, I've
gotten over the PTSD of that.
It's a long time ago.
So, okay.
So, what are you seeing on the ground?
do you have some examples of
how this is manifesting in apps?
I mean, you've already given me
a couple of things, but I'm just
curious if you've got some more,
I kind of alluded to it
at the very beginning.
obviously, in my role, I talk to a lot
of senior executives about directionally
where they're going with their engineering
organizations, because I think, the
software we build, it does a bunch of
tactical things, but at a broader sense,
it allows people to measure reliability as
expressed through, you know, our customers
staying engaged in your experience by
virtue of the fact that they otherwise
liking your software are having a better
technical experience with the apps you
build, that the front end and ultimately
measuring that and then thinking through
all of the underlying root causes is a
cultural change in these organizations
and how they think about reliability.
It's no longer just like are
the API endpoints at the edge of
our data center delivering their
payload, you know, responding in
a timely manner and error free.
And then the user experience is well
tested and, you know, we ship it to prod.
And.
That's enough.
it really isn't a shift in how people
think about it, but when I talk to them,
I mean, a lot of CTOs really are taking
a bet that AI is going to be the way
that productivity gains, I won't say make
their business more efficient, but they
do allow them to do more with less, right?
You know, like it or not, especially
over the past few years, I mean, in B2B,
we're in the B2B SaaS business, there's
been, you know, times have definitely
been tougher than they were in the early,
2020s, for kind of everyone, the, I think
there's a lot of pressure in consumer
tech with tariffs and everything to do
things more cost effectively, and first
and foremost on the grounds, this is a
change we are seeing, whether we like
it or not, and, you know, we can argue
about whether it's going to work and
how long it'll take, but the fact is
that, like you said, you mentioned that
the CIO magazines, leadership is taking
a bet this is going to happen, and I
think as we start to talk about that
with these executives, the question is
like, is the existing set of tools I have
to cope with that reality good enough?
And, yeah, I guess my underlying
hypothesis is it probably isn't
for most companies, right?
if you think about the, the world of,
you know, a web as an example, like
a lot of, a lot of companies that are
consumer facing tech companies will
measure their core web vitals, in
large part because it has SEO impact.
And then they'll put a, exception handler
in there, like Sentry that grabs, you
know, kind of JavaScript errors and
tries to give you some level of impact.
around, you know, how many are impacting
your users and whether it's high
severity and then you kind of have to
sort through them to figure out which
ones really matter for you to solve.
so take, existing user frustration
with, human efficiency of delivering
code and the pace, the existing pace.
users are already frustrated that
Core Web Vitals are hard for them to
quantify what that really means in
terms of user impact and whether a user
decides to stay on the site or not.
so, yeah.
and the fact that they're overwhelmed with
the number of JavaScript errors that could
be out there because it's really, I mean,
you go to any site, go to developer tools.
look at the number of JavaScript errors
you see, and then, you know, take your
human experienced, idea of how you're
interacting with the site, and chances
are most of those don't impact you, right?
They're just like, it's a, it's an
analytics pixel that failed to load, or
it's a library that's just barfing some
error, but it's otherwise working fine.
So take that and put it on steroids now
where you have a bunch of AI assisted
developers doubling or tripling the
number of things they're doing or
just driving more experimentation,
right, which I think a lot of
businesses have always wanted to do.
But again, software has been kind
of slow and expensive to build.
And so If it's slow and expensive
to build, my thirst for delivering
an experiment across three customer
cohorts when I can only, deliver one,
given my budget, just means that, I
only have to test for one variation.
We'll now triple that, or quadruple
it, or, and, you know, multiply that
by a number of, arbitrary user factors.
It just gets more challenging,
and I think we need to think about
how we measure things differently.
once the team has the tools in place to
manage multiple experiments at the same
time, that just ramps up exponentially
until the team can't handle it anymore.
But if AI is giving them a chance
to go further, then yeah, they're
just, they're going to do it.
I mean,
yeah, I mean, you're going to get
overwhelmed with support tickets
and bad app reviews and whatever it
is, which I think most people are.
Most business leaders would be
pretty upset if that's how they, are
responding to reliability issues.
I was just gonna say, we've already had
decades now of pressure at all levels
of engineering to reduce personnel.
you need to justify every new hire pretty
significantly unless you're funded and
you're just, you know, in an early stage
startup and they're just growing to the
point that they can burn all their cash.
having been around 30 years in
tech, I've watched operations
get, you know, Merged into DevOps.
I've watched DevOps teams get, which
we didn't traditionally call them that.
We might have called them sysadmins
or automation or build engineers
or CI engineers, and they get
merged into the teams themselves.
And the teams have to take
on that responsibility.
I mean, We've got this weird culture
that we go to this Kubernetes conference
all the time and one of the biggest
complaints is devs who don't want to
be operators, but it's saddled on them
because somehow the industry got the
word DevOps confused with Something.
And we all thought, Oh, that
means the developers can do ops.
That's not what the word meant, but we've,
we've worked, we've gotten to this world
where I'm getting hired as a consultant
to help teams deal with just the sheer
amount of ridiculous expectations a
single engineer is supposed to have,
that not the knowledge you're supposed
to have, the systems are supposed to
be able to run while making features.
And it's already, I feel like a
decade ago, it felt unsustainable.
So now here we are having to
give some of that work to AI.
When it's still doing random
hallucinations on a daily basis,
at least even in the best models.
I think I was just ranting yesterday
on a podcast that SWE bench, which is
like a engineering benchmark website
for AI models and how well they
solve GitHub issues, essentially,
and the best models in the world.
Can barely get two thirds of them.
Correct.
And that's, if you're paying the
premium bucks and you've got the premium
foundational models and you are on the
bleeding edge stuff, which most teams
are not because they have rules or
limitations on which model they can use,
or they can only use the ones in house,
or they can only use a particular one.
and it's just, it's one of
those things where I feel like
we're being pushed at all sides.
and at some point, it's amazing
that any of this even works.
It's amazing that apps
actually load on phones.
it's just, it just, it feels like
garbage on top of garbage, on top
of garbage, turtles all the way
down, whatever you want to call it.
So where are you coming in here to
help solve some of these problems?
yeah, that's fair, yeah.
I'll even add to that, All of that's
even discounting the fact that new tools
are coming that make it even simpler
to push out software that like barely
works without even the guided hands
of a software engineer who has any
professional experience writing that code.
some of the vibe coding tools, like
our product management team uses,
largely for rapid prototyping.
And I can write OK Python.
I used to write OK C sharp.
I have never been particularly
good at writing JavaScript.
I can read it OK, but like when
it comes to fixing a particular
problem, quickly get out of my depth.
That's not to say, I couldn't
be capable of doing it.
It's just not what I do every day,
nor do, I particularly have the energy
when I'm, you know, done with a full
workday to, go teach myself JavaScript.
And, I'll build an app with
one of the Vibe coding tools
as a means of communicating
how I expect something to work.
So if you have any questions, feel
free to reach out to me, and I'll
be happy to answer them, and I'll
be excited to hear back from you.
And for coming.
and I'm like, ah, it's better
to just delete the project
and start all over again.
And, you know, if you can make
it work, that doesn't necessarily
mean it'll work at scale.
It doesn't necessarily mean that there
aren't like a myriad of use cases
you haven't tested for as you click
through the app in the simulator.
and so, you know, I think the
question is, okay, you know, given
the fact that we have to accept more
software is going to make its way
into human beings hands, because we
build software for human beings, it's
going to get to more human beings.
How do we build a, a reliability
paradigm where we can measure
whether or not it's working?
And I think that stops focusing
on, kind of, I guess to go back
to the intentionally inflammatory
title of today's discussion, it
stops focusing on like a zero
bug paradigm where I test things.
I test every possible
pathway for my users.
I, you know, have a set of
requirements, again, around
trivialized performance and stuff.
And, trying to put up these kind of
barriers to getting code into human hands.
And I just accept the fact more code
is going to get into human hands faster
at a pace I can't possibly control.
And so, therefore, I have to put
measurements in the real world.
So, I use a lot of different tools around
my app so that I can be as responsive
as possible to resolving those issues
when I find them, which is I guess my,
you know, charity majors who co founded
Honeycomb, she, her and I were talking a
few months ago and big fan of stickers,
she shipped me an entire envelope of
stickers and they're all like, you
know, ship fast and break things, right?
I tested production, stuff like that.
And somehow I feel like, you know,
in our world, because we build
observability for front end and mobile
experiences, like web and mobile
experiences, I feel like that message
just hadn't gotten through historically.
part of it's because like release
cycles on mobile were really slow.
you had to wait days for
an app to get out there.
Part of it was software is expensive
to build and slow to build and so
getting feature flags out there where
you can operate in production was hard.
Part of it was just the observability
paradigm hadn't shifted, right?
Like the, the paradigm of measure
everything and then find root cause
had not made its way to frontend.
It was more like measure the known
things you look for, like web exceptions,
like core web vitals are on mobile,
look at crashes, and that's about it.
And the notion of, okay, measure
whether, users are starting the app
successfully, and when you see some
unknown outcomes start to occur or users
start to abandon, how can you then sift
through the data to find the root cause?
Hadn't really migrated its way.
And that's what we're trying to do.
we're trying to bring that paradigm
of how do you define the A, for
lack of a better term, the APIs.
The things that your humans interact
with, with your apps, the things they
do, you don't build an API in your app
for human beings, you build a login
screen, you build a checkout screen,
a cart experience, a product catalog.
How do we take those things and measure
the success of them and then try to,
attribute them to underlying technical
causes where your teams can have, better
have those socio technical conversations?
so that they can understand and then
resolve them, probably using AI, right?
as we grow.
but like it, it allows better,
system knowledge and the interplay
between real human activities and
the telemetry we're gathering.
Yeah.
Have you been, I'm just curious,
have you been playing with, having
AI look at observability data,
whether it's logs or metrics?
have you had any experience with that?
I'm asking that simply as a generic
question, because the conversations
I've had in the last few months, it
sounds like AI is much better at reading
logs than it is at reading metrics or
dashboards or anything that sort of
lacks context or, you know, it's not
like we're putting in alt image messages
for every single dashboard graph.
And that's probably all coming from
the providers just because if if
they're expecting AI to look at stuff.
they're gonna have to give more context,
but it sounds like it's not as easy
as just giving AI access to all those
systems and saying, yeah, go read the
website for my dashboard, Grafana,
and figure out what's the problem?
I've seen it deployed in two ways, one
of which I find really interesting and
something we're actively working on,
because I think it's just a high degree
of utility and it's, You know, given
the kind of state of LLMs, I think it's
probably something that's relatively easy
to get right, which is the, the notion of,
when you come into a product like ours,
or a Grafana, or, you know, a New Relic,
you probably have an objective in mind,
maybe you have a question you're trying
to ask, what's the health of this service?
Or, I got page on a particular issue.
I'm going to, I need to build a chart
that shows me like the interplay between
latency for this particular service
and the success rate of, you know,
some other type of thing, more like
database calls or something like that.
Today, like people broadly have to
manually create those queries and it
requires a lot of human knowledge around
the query language or schema of your data.
And I think there's a ton of opportunity
for us to simply ask a human question
of You know, show me a query of all
active sessions on this mobile app for
the latest iOS version and the number
of, traces, like startup traces that
took greater than a second and a half.
And have it just simply pull up your
dashboard and query language and build
the chart for you quite rapidly, which
is a massive time savings, right?
And.
It also just makes our tech, which
you're right, can get quite complex, more
approachable by your average engineer.
Which, you know, I'm a big believer that
if every engineer in your organization
understands how your systems work.
And the data around it, you're going
to build a lot better software,
especially as they use AI, right?
Because now they, they understand
how things work and they can better
provide instructions to the robots.
so I think that's really
a useful, interesting way.
And we've seen people start to roll that
type of assistant act, functionality out.
The second way I've seen deployed, I
see mixed results, which is I mean, an
incident, go look at every potential
signal that I can see related to this
incident and try to tell me what's
going on and get to the root cause, and
more often than not, I find it's just a
summarization of stuff that you, as an
experienced user, probably would come to
the exact conclusions on, I think there's
utility there, certainly, it gets you a
written summary quickly of what you see.
But I do also worry that, it doesn't
apply a high degree of critical thinking.
and you know, an example of where lacking
context, I mean, it wouldn't be very
smart, right, is you've probably seen
it, every traffic chart, around service
traffic, depending upon how it runs,
tends to be pretty lumpy with time of day.
Because Most companies don't
have equivalent distribution
of traffic across the globe.
Not every country across the globe
has an equivalent population.
And so you'd tend to see these spikes
of where, you know, you have a number
of service requests spiking during
daylight hours or the Monday of every
week because people come into the office
and suddenly start e commerce shopping.
And you see it taper throughout the
week, or you taper into the evening.
I think that's normal.
You understand that as an
operator of your services, because
it's unique to your business.
I think the AI would struggle, lacking
context around, your business to
understand, somewhat normal fluctuations,
or the fact that, You know, a marketing
dropped a campaign where there's no
data inside your observability system
to tell you that that campaign dropped.
It's not a release.
see your AI is lacking context.
It's lacking the historical context
that the humans already have implicitly.
Yeah,
and I mean that context might be in a
Slack channel where marketing said we
just dropped an email and you know an
email drop so expect an increased number
of requests to this endpoint as you know
people retrieve their, their special
offer token or whatever that will allow
them to use it in our checkout flow.
today like just giving, if we provided
that scope to a, an AI model within our
system we would lock that type of context.
Yeah, if I'm not creating any of
these apps, I'm sure they've already
thought of all of this, but, the
first thing that comes to mind is
well, the hack for me would be give
it access to our ops slack room,
Right.
we're probably all having those
conversations of oh, What's going on here?
And someone in someone's who had
happened to get reached out to for
marketing was like, well, you know,
yesterday we did send out a, you
know, a new coupon sale or whatever.
So, yeah, having it read all that
stuff might be, necessary for it to
understand the, because you're right.
it's not like we have a dashboard in
Grafana that's number of marketing
email, you know, the email sent per day
or the level of sale we're expecting,
based on in America, it's the 4th of
July sale, or, you know, some holiday
in a certain region of the world.
Or a social media influencer dropping
like some sort of link to your
product that suddenly, you know, that
it's a new green doll that people
attach to their designer handbags.
I don't know like anything about what
the kids are into these days, but
it seems kind of arbitrary and like
I, I would struggle to predict that.
Let me put it that way.
they're into everything old.
So it's into everything that I'm into.
Yes.
all right.
So if we're talking about this at a
high level, we talked a little bit
about before the show around how
embrace is thinking about, observability
and particularly on mobile, but
you know, anything front end there.
the tooling ecosystem for, engineers
on web and mobile is pretty rich.
but they all tend to be just like
hammers for a particular nail, right?
It's you know, How do we give you a
better craft reporter, better exception
handler, how do we go measure X or Y?
some of the stuff that we're thinking
about, which is really how we define
the objective of OB observability
for, the coming digital age, right?
Which is you know, as
creators of user experiences.
I think my opinion is that we
shouldn't just be measuring
like crash rate on an app.
We should be measuring, are users
staying engaged with our experience?
And when we see they are not, sometimes,
crashes, I mean, the obvious, the
answer is obviously they can't, right?
Because the app explodes.
But, I think, you know, I was
talking to a senior executive at
a, a massive food delivery app.
And it's listen, we know, anecdotally,
there's more than just crashes that make
our users throw their phone at the wall.
you're trying to You're trying to
order lunch at noon and something's
really slow or you keep running into
a, just a validation error because we
shipped you an experiment thinking it
worked and you can't order the item
you want on the two for one promotion.
you're enraged because you really want
the, you know, the spicy dry fried chicken
hangry.
and you want two of them because I
want to eat the other one tonight.
and you've already suckered me
into that offer, you've convinced
me I want it, and now I'm having
trouble, completing my objective.
And broadly speaking, the observability
ecosystem on the front end really
hasn't measured that, right?
We've used all sorts of proxy
measurements in, out of the data
center because the reliability story
has been really well told and evolved.
Over the past 10 to 15 years in the data
center world, but it just really hasn't
materially evolved in the front end.
And so, a lot of that's like shifting
the objective from how do I just measure
counts of things I already know are bad to
measuring what users engagement looks like
and whether I can attribute that to change
in my software or defects I've introduced.
So, that's kind of the take.
just about anybody who has ever built
a consumer mobile app has Firebase
Crashlytics in the app, which is
a free service provided by Google.
It was a company a long time ago that.
Crashlytics that got bought by Twitter
and then got reacquired by Google.
it basically gives you rough cut
performance metrics and crash reporting.
Right.
I would consider this like the
foundational requirement of any level
of, app quality, but to call this NSAID
app quality, I think, you know, our
opinion is that would be a misnomer.
So we're going to go kind of go
through what this looks like, right.
Which is it's giving you things.
You would expect to see, a number of
events that are crashes, et cetera, and,
you can do what you would expect here,
crashes are bad, so I need to solve a
crash, so I'm going to go into a crash,
view stack traces, get the information I
need to actually be able to resolve it.
and, you know, I think we see a lot
of customers before we talk to them
who are just like, well, I have crash
reporting and I have QA, that's enough.
there's a lot of products
that have other features.
So like Core Web Vital measurements
on a page level, this is Sentry.
It's a lot of data, but I don't
really know what to do with this.
beyond, okay, it probably
has some SEO impact.
There's a, you know, Core, bad Core
Web Vital or slow, you know, slow
something for a render on this page.
How do I actually go
figure out root cause?
But again, right, this is a single signal.
So this is kind of, you don't know that,
whether or not the P75 core web vital here
that is, considered scored badly by Google
is actually causing your users to bounce.
And I think that's important because I
was reading this article the other day
on this like notion of a performance
plateau, like there's empirical science
proving that like faster core web
vitals, especially with like content
paint and interaction to next paint.
et cetera, improve bounce rate
materially, like people are less likely
to bounce if the page loads really fast.
But at some point, like if it's long
enough, there's this massive long
tail of people who just have a rotten
experience and you kind of have to
figure out, I can't make everyone
globally have a great experience.
Where's this plateau where, I know
that I'm improving the experience for
people who I'm likely to retain and
improve their bounce rate versus I'm
just, you know, going to live with this.
And so we kind of have a different
take, which is like we wanted to
center our experience less on just
individual signals, and more on
like these flows, these tasks that
users are performing in your app.
So if you think about the key flows,
like I'm breaking these down into
the types of activities that I
actually built for my end users.
And I want to say, okay, how many
of them were successful versus
how many ended in an error, like
something went truly bad, right?
You just could not proceed.
Versus how many abandoned, and
when they abandoned, why, right?
Did they abandon because they, they
clicked on a product catalog screen, they
saw some stuff that they didn't like?
Or did they abandon because the product
catalog was so slow to load, and images
slow, like slow, so slow to hydrate?
That they perceived it as
broken, lost interest in the
experience and ended up leaving.
And so, the way you do that is you
basically take the telemetry we're
emitting from the app, the exhaust
we collect by default, and you create
these start and end events that allow
you to then, we post process the data.
We go through all of these sessions
we're collecting, which is basically
a play by play of like linear
events that users went through.
And we hydrate the flow to tell
you where people are dropping off.
and so you can see like their actual
completion rates over time, you know,
obviously it's a test app, so there's
not a ton of data there, but what gets
really cool is we start to, we start to
build out this notion of once you see.
the, issues happen, well how can I now
go look at all of the various attributes
of those populations under the hood
to try to specify which of the things
are most likely to be attributed to
the population suffering the issue?
So that could be an experiment.
it could be a particular mobile,
a particular version they're on.
It could be, an OS version, right?
You just shipped an experiment that
isn't supported in older OSs, and those
users start having a bad experience.
And then each of those gets you down to
what we call, this, user play by play
session timeline, where you basically
get a full Recreation of every part of
the exhaust stream that we're gathering
from you interacting with the app or
website just for reproduction purposes.
once you've distilled here,
you can say, okay, now let me
look at that cohort of users.
And so I can do pattern
recognition, which I think is pretty
Hmm.
So for the audio audience that didn't,
that didn't get to watch the video,
what are some of the key sort of, if
someone's in a mobile and front end
team and this is actually going back
to a conversation I had with one of
your Embrace team members at KubeCon in
London, what are some of the key changes
or things that they need to be doing?
If I guess if I back up and say the
premise here is that if I'm it's
there's almost like two archetypes.
What am I trying to say here?
There's two archetypes
that I'm thinking about.
I'm thinking about me, the DevOps
slash observability system maintainer.
I've probably set up.
Elk or, you know, I've got
the key names, the Loki's, the
Prometheus, the, the Grafana's.
I've got all these things
that I've implemented.
I've brought my
engineering teams on board.
They like these tools.
They tend to have, especially for
mobile, they tend to have other
tools that I don't deal with.
They might have platform, like the, the
Traditionally, right?
We're obviously trying to change that.
But yeah, traditionally, they
have five or six other tools that
don't play into the ecosystem,
observability ecosystem you have set up.
Yeah.
So, so we're on this journey to
try to centralize, bring them
into the observability world.
you know, traditional mobile
app developers might not even.
Be aware of what's going on in the
cloud native observability space and
we're bringing them on board here.
Now suddenly they get even more code
coming at them that's slightly less
reliable or maybe presents some unusual
problems that we didn't anticipate.
So now, you know, we're in a
world where suddenly what we have
in observability isn't enough.
yeah,
you're a potential solution.
What are you looking at for behaviors
that they need to change to things
that people can take home with them?
And
I mean, I guess the way I think about
it is, right, the way, the reason
observability became so widely adopted
in server side products was because in
an effort to more easily maintain our
software and to avoid widespread defects
of high blast radius, we shifted from a
paradigm of like monoliths deployed on
bare metal to virtualization, which was,
you know, various container schemes kind
of that has right now most widely been
around, Kubernetes and microservices
because you could scale them independently
and you could deploy them independently.
Right.
And that complexity of the deployment
scheme and the different apps and services
interplaying with each other necessitated
an x ray vision into your entire system
where you could understand system wide
impacts to your, the end of your world.
And the end of your world, from the
most part became your API surface
layer, the things that served
your web and mobile experiences.
And, you know, there are
businesses that just serve APIs.
Right, but broadly speaking, the brands
we interact with as human beings serve us
visual experiences that we interact with.
right.
It's the server team managing
the server analytics, not so much
the client device analytics that
right.
the world has gotten a lot more
complicated in what the front
end experience looks like.
And you could have a service that
consistently responds and has a nominal
increase in latency and is well within
your alert thresholds, but where the
SDK or library designed for your front
end experience suddenly starts retrying
a lot more frequently, delivering
Perceived latency to your end user.
And, and so I think the question
is, could you uncover that incident?
Because if users suffer perceived latency
and therefore abandon, what metrics do
you have to go measure whether or not
users are performing the actions you
care about them performing, whether
that's attributable to system change?
In most instances, I don't think
most observability systems have that.
and then the second question is, right,
so, and by the way, Bret, that's the
underlying supposition that in a real
observability scheme, mean time to detect
is important, is as important, if not
more so, than mean time to resolve.
the existing tooling ecosystem for
frontend and mobile has been set up to
optimize mean time to resolve for known
problems where I can basically just
count the instance and then alert you.
So, And, you know, the lack of
desire to be on call, like I've
heard this stupid saying that
there's no such thing as a front end
emergency, which is like ridiculous.
If I'm, you know, if I'm a major travel
website and I run a thousand different
experiments and a team in, Eastern
Europe drops an experiment that affects
users globally, 1 percent of users
globally in the middle of my night.
That, makes the calendar control
broken and that some segment of that
population can't book their flights.
That sounds a lot like a
production emergency to me.
I, that has material business
impact in terms of revenue.
Or the font color changes
and the font's not readable
yeah, I guess I am imploring the
world to shift to a paradigm where
they view like users willingness
and ability to interact with your
experiences as a reliability signal.
And I think the underlying supposition
is that this only becomes more acute of a
problem as the number of features we ship.
I guess I'm starting with the
belief that from what I hear, people
are doubling down on this, right?
They're saying we need to make software
cheaper to build and faster to build.
Because it is a competitive environment
and if we don't do it, somebody else will.
and as that world starts to
expand, the acuity of the
problem space only increases.
Yeah.
I think your theory here, matches
are well with some other like we're
all kind of in this little bit of
we've got a little bit of evidence.
We hear some things
and we've got theories.
We don't have years of facts of
exactly how AI is affecting a lot of
these things, but other compatible
theories I've heard recently on this,
actually with three guests on the show.
that I'm just coming to mind.
One of them is, because of the velocity
change and because of the, it is
this increase in velocity, it only
is going to increase the desire for
standardization in CI and deployment,
which those of us in, if you've been
living in this Kubernetes world, we've
all been trying to approach that.
Like we're all leaning into Argo CD
as one of the, as the number one way
on Kubernetes to deploy software.
you know, we've got.
This GitOps idea of how to standardize
change as we deliver in CI.
It's still completely wild, wild west.
You've got a thousand vendors
all in a thousand ways you
can create your pipelines.
And hence the reason I need to make
courses on it for people, because
there's a lot of art still to it.
We don't have a checkbox of
this is exactly how we do it.
And in that world, the theory is.
Right now that maybe AI is going to allow
us to act where it's going to force us
to standardize because we can't have a
thousand different workflows or pipelines
that are all slightly different for
different parts of our software stack,
because then when we get to production and
we have started having problems, or if the
AI is starting to take more control, it's
just going to get worse because the AI
Yeah, and at the end of the day,
right, standardization is more
for the mean than the outliers.
And I think a lot of people assume they're
like, oh, we can do it right because we
have hundreds of engineers working on like
our, you know, our automation and stuff.
It's do you know how many
companies there are out there
that are not technology companies?
They're a warehouse
company with technology.
Right.
And like standardization allows them
to more often than not build good
technology, to measure correctly, to
deploy things the right way, like we're
all going to interact with it in some
way, right, and as the pressure for more
velocity and them building technology
speeds up, the need for you know, a
base layer of doing things, where it's
consistently right, but only increases.
And I think that's, you know,
for the world it's a challenge,
for us it's an opportunity.
Yeah.
Awesome.
I think that's a perfect
place to wrap it up.
we've been going for a while now.
but, Andrew, io, right?
That's,
Yeah.
www.
embrace.
io.
let's just bring those up.
We didn't show a lot of
that stuff, but, the,
in case people didn't know, I made a short
a few months ago about Embrace or with one
of the Embrace team members at KubeCon.
which we had touched a little bit on
this show, but we talked about that.
Observability is here for your mobile
apps now and your front end apps and
that those people, those developers
can now join the rest of us in this,
world of modern metrics, collection
and consolidation of logging tools
and bringing it all together into
ideally one, one single pane of glass.
if you're advanced enough to figure
all that out, the tools are making
it a little bit easier nowadays,
but I still think there's a lot of.
A lot of effort in terms of
implementation engineering to get all
this stuff to work the way we hope.
But, it sounds like that you all
are making it easier for those
people, with your platform.
We're definitely trying.
Yeah.
I think it'd be a future state that would
be pretty cool would be the ability for
operators to look at a Grafana dashboard
or a, you know, dashboard or Chronosphere
or New Relic or whatever it is.
see that, you know, they see a disen a,
an engagement decrease on login on a web
property, and immediately open an incident
where they page in the front end team and
the core team servicing the off APIs in
a company and have them operating on data
where the front end team can be like.
We're seeing a number of retries
happening after we updated to the
new version of the API that you
serve for logging credentials.
even though they're all 200s, they're
all successful requests, what's going on?
And that backend team says, well, it
looks like our, you know, we were slow
rolling it out and the P90 latency
is actually 250 milliseconds longer.
Why would that impact you?
And they say, well, the SDK retries
after 500 milliseconds and our P50
latency before this was 300 milliseconds.
So, 10 percent of our users or
something are starting to retry
and that's why we're seeing this.
You know, the answer here is to increase.
The resource provisioning for the
auth service to get latency back
down and or change our SDK to have
a more permissive retry policy.
and, you know, have teams be able to
collaborate around the right design
of software for their end users and
understand the problem from both
perspectives, but be able to kick
off that incident because they saw
real people failing to disengage,
not just some server side metrics,
which I think would be pretty neat.
Yeah.
And I should mention you all,
if I remember correctly in cloud
native, you're a lead maintainers
on the mobile observability SDK.
Is that, am I getting that right?
I'm trying
we have engineers who are
provers, on Android and iOS.
we have the, from what I'm
aware of, the only production
React native, OpenTelemetry SDK.
we are also participants in a new
browser SIG, which is a, a subset
of the former JavaScript SDK.
So our OpenTelemetry SDK
for web properties is.
Basically a very slimmed down chunk of
instrumentation that's only relevant
for browser React implementations.
so yeah, working to advance the kind of
standards community in the cloud native
environment for what, Instrumenting
in real world runtimes where Swift,
Kotlin, JavaScript are executed.
Nice.
Andrew, thanks so much for being here.
So we'll see you soon.
Bye everybody.
Creators and Guests


