Could AI End Human QA?

Download MP3

The original title for this episode was AI
Killed the QA Star, which would make more

sense if you knew the eighties lore of
the very first music video played on MTV.

Um, that's a music television TV channel.

Back, back in the days, how you
watched videos before the internet.

That was in 1981 and it was called
Video Killed the Radio Star.

But I decided that a Deep cut title
was too obscure for this conversation.

Yet the question still remains, could
the increased velocity of shipping

AI generated code cause businesses
to leave human based QA behind,

presumably, because we're not gonna
hire any more of them, and we don't

want to grow those teams and operations
teams just because we created AI code.

And would we start relying more on
production observability to detect code

issues that affect user experience?

And that's the theory of today's
guest, Andrew Tunall, the President

and Chief product Officer at Embrace.

They're a mobile observability
platform company that I first

met at KubeCon London this year.

Their pitch was that mobile apps were
ready for the full observability stack

and that we now have SDKs to let mobile
dev teams integrate with the same

tools that we platform engineers and
DevOps people and operators have been

building and enjoying for years now.

I can tell you that we don't yet have
exactly a full picture on how AI will

affect those roles, but I can tell you
that business management is being told

that similar to software development,
they can expect gains from using AI to

assist or replace operators, testers,
build engineers, QA and DevOps.

That's not true, or at least not yet.

But it seems to be an expectation.

And that's not gonna stop us from trying
to integrate LLMs into our jobs more.

So I wanted to hear from observability
experts on how they think this

is all going to shake out.

So I hope you enjoyed this
conversation with Andrew of Embrace.

Hello.

Hey, Bret.

How are you doing?

Andrew Tunal is the president and
chief product officer of Embrace.

And if you've not heard of Embrace,
they are, I think your claim to me the

first time talked was, you're the first.

Mobile focused or mobile only
observability company in the

cloud native computing foundation.

Is that a correct statement?

I don't know if in,
probably in CNCF, right?

Because like we're, we were certainly
the first company that started going

all in on OpenTelemetry as the means
with which we published instrumentation.

obviously CNCF has a bunch of
observability vendors, some of

which do, mobile and web run,
but, completely focused on that.

Yeah.

I would say we're the first.

Yeah.

I mean, we can just, we can
say that until it's not true.

the internet will judge us harshly,
for whatever claims we make today.

Yeah.

well, and that just means
people are listening.

Andrew.

Okay, I told people on the internet that
we were going to talk about So, the idea

that you gave me that QA is at risk of
falling behind because if we're pretty

If we're producing more code, if we're
shipping more code because of AI, even

the pessimistic stuff about, you know,
we've seen some studies in the last

quarter around effective use of AI, it is
anywhere between negative 20% And positive

30 percent in productivity, depending
on the team and the organization.

So let's assume for a minute it's on
the positive side, and the AI has helped

the team produce more code, ship more
releases, have, even in the perfect world

that I see teams automating everything,
there's still usually, and almost always,

in fact, I will say in my experience,
100 percent of cases, there is a human at

some point during the deployment process.

Whether that's QA.

Whether that's a PR reviewer,
whether that's someone who spun

up the test instances to run
it in a staging environment.

There's always something.

So tell me a little bit about where
this idea came from and what you

think might be a solution to that.

Yeah.

I'll start with the, the belief that
I mean, AI is going to fundamentally

alter the productivity of software
engineering organizations.

I mean, I think the, CTOs I
talk to out there are making

a pretty big bet that it will.

you know, there's today and then
there's, okay, think about the, even

just the pace that, that AI has evolved
in, in the past, the past year and

you know, what it'll look like given
the investment that's flowing into it.

But if you start with that, the claim
that AI is going to kill QA is more

about the fact that, we built, Our
software development life cycle under

the assumption that software was slow to
build and relatively expensive to do so.

And so if those start to change,
right, like a lot of the, systems

and processes we put around, our
software development life cycle,

they probably need to change too.

because ultimately, If you say, okay,
we had 10 engineers and formerly we were

going to have to double that number to
go build the number of features we want.

And suddenly those engineers become, more
productive to go build the features and

capabilities you want inside your apps.

I find it really hard to believe that
those organizations are going to make

the same investment in QA organizations
and automated testing, etc. to keep

pace with that level of development.

The underlying hypothesis was like
productivity allows them to build

more software cheaper, right?

And like cheaper rarely correlates
with adding more humans to the loop.

yeah, the promise.

Like I make this joke and it's probably.

Not true in most cases, cause it's a
little old, but I used to say things

like, yeah, CIO magazine told them to
deploy Kubernetes or, you know, like

whatever, whatever executive VP level C,
CIO, CTO, and you're at that chief level.

So I'm making a joke about you, but,

funny.

but that's when you're not in
the engineering ranks and you

maybe have multiple levels of
management, you get that sort of.

overview where you're reading and you're
discussing things with other suits.

You're reading the suits magazines,
the CIO magazines, the uh, IT

pro magazines and all that stuff.

yeah, the point I'm making here is
that, people are being told that AI

is going to save the business money.

I, I think I've said this to several
podcasts already, but, you weren't here.

So I'm telling you that I was in a
media event in London, KubeCon, and,

they give me a media pass for some
reason, because I make a podcast.

So they act like I'm a journalist
standing around all real journalists

and pretending, but I was there and
I watched an analyst who's, from my

understanding, I don't actually know
the company, but they sound like the

Gartner of Europe or something like that.

and the words came out of the mouth
at an infrastructure conference,

the person said, my clients
are looking for humanless ops.

And I visibly, I think,
chuckled in the room because

I thought, well, that's great.

That's rich that you're
at a 12, 000 ops person.

Conference telling us that your
companies want none of this.

these are all humans here doing this work.

the premise of your argument about QA
is my exact same thoughts was nobody

is budgeting for more QA or more
operations personnel or more DevOps

personnel, just because the engineers
are able to produce more code with AI,

in the app teams and the feature teams.

And so we've all got to do better
and we've got to figure out where

AI can help us because if we don't.

They're going to, they're just
going to hire the person that

says they can, even though maybe
it's a little bit of a shit show.

my belief is that software
organizations need to change.

The way they work for an AI future, so
that might be cultural changes, it might

be role changes, it might be, the, you
know, the word, the words like human in

the loop get tossed around a lot when
it's about engineers interacting with

AI, and the question is, okay, what
does that actually look like, right?

Are we just reviewing AI's PRs and,
you know, kind of blindly saying, yep,

like they wrote the unit tests for
something that works, or is it like

we're actually doing critical thinking,
critical systems thinking, unique

thinking that allows us as owners of
the business and our user success?

to design and build better
software with AI as a tool.

and it's not just QA, it's kind of
like all along the software development

life cycle, how do we put the right
practices in place, and how do we build

an organization that actually allows us,
with this new AI driven future, you know,

whether it's agents, doing work on our
behalf, or whether it's us just with AI

assisted coding, to build better software.

And, yeah, I'm interested in what that
looks like over the next couple of years.

That's kind of the premise of my,
the new Adjetic DevOps podcast.

And also as I'm building out this new
GitHub Actions course, I'm realizing that

I'm having to make the best practices.

I'm having to figure out what is risky
and what's not because no one has really

figured this out yet in any great detail
and, in fact, at, KubeCon London in April,

which feels like a lifetime ago, there
was only one talk around using AI anywhere

in the DevOps and operations path.

To benefit us.

It was all about how to
run infrastructure for AI.

And granted, KubeCon is an infrastructure
conference for platform engineering

builders and all that, so it makes sense.

But the fact that really only one talk,
and it was a team from Cisco, which

I don't think of Cisco as like the
bleeding edge AI company, but it was a

team from Cisco, simply trying to get
And workflow or maybe you would call it

like an agentic workflow for PR review,
which in my case, in that case, I'm

presuming that humans are still writing
their code and AI is reviewing the code.

I'm actually, I was just yesterday, I
was on GitHub trying to figure out if

I, if there was a way to make branch
rules or, some sort of automation

rule that if the AI, wrote the PR,
the AI doesn't get to review the PR

Yeah, yeah, right.

we've got this coming
at us from both angles.

We've got AI in our IDEs.

We've got, multiple companies,
Devon and GitHub themselves.

They now have the GitHub
copilot code agent.

I have to make sure I
get that term, right.

GitHub copilot code agent,
which will write the PR and

the code to solve your issue.

And then they have a PR
code copilot reviewer.

Agent that will review the code.

It's the same models, different
context, but, it feels like that

doesn't, that's not a human in the loop.

So we're going to need these guardrails
and these checks in order to make sure

that code didn't end up in production
with literally no human eyeballs

ever in the path of that code being
created, reviewed, tested, and shipped.

cause we can do that now.

Like we did, we

Yeah, totally, you're totally good.

And I mean, you can easily perceive the
mistakes that could happen too, right?

I mean, before I, I took this role, at
Embrace, I was at New Relic for four and a

half years, and before that I was at AWS.

And so, obviously I've spent a
lot of time around CloudFormation

templates, Terraform, etc.

You can see a world where, you
know, AI builds your CloudFormation

template for you and selects an EC2
instance type because the information

it has about your workload is
optimal for this EC2 instance type.

But in the region you're running,
that instance type's not freely

available for you to autoscale to.

And pretty soon, you go try
to provision more instances,

and poof, you hit your cap.

Because, that instance type just
doesn't have availability in Singapore.

And, us as humans and operators, we learn
a lot about our operating environments, we

learn about our workloads, we learn about
the, the character, the peculiarities

of them that don't make sense to a
computer, but are, like, based on reality.

over time, maybe AI gets really
good at those things, right?

But the question is, how do we build
the culture into kind of be guiding our,

our, you know, army of assistance to
build software that really works for our

users, instead of just trusting it to go
do the right thing, because we view, you

know, everything as having a true and
pure result, which I don't think is true.

a lot of the tech we build is
for people who build consumer

mobile apps and websites.

I mean, that is what the
tech we build is for.

And you can easily see, you know, some
of our, our engineers have been playing

around with using AI assisted coding to
implement our SDK in a consumer mobile

app that, and it works quite well, right?

You can see situations where, an
engineer gets asked by somebody, a

product manager, a leader to say, Hey,
you know, we got a note from marketing.

They want to implement this new
attribution SDK that's going to go.

build up a profile of our users
and help us build more, you know,

customer friendly experiences.

You have the bots go do it, it tests
the code, everything works just fine.

And then, for some reason that SDK makes
a uncached request out to US West 2

for users globally that, you know, for
your users in Southeast Asia instead

of adding an additional six and a half
seconds to app startup because physics.

And, what do those users do?

if you start an app and it's.

sits there hanging for four, five,
six seconds, and you don't have to use

the app because it's you know, you're
waiting for your boarding pass to come

up and you're about to get on the plane.

You probably perceive it
as broken and abandoned.

And to me, that's like a reliability
problem that requires systems thinking

and cultural design around how you're
engineering organization to avoid.

one that I don't think is
immediately solved with AI.

right.

observability is only getting more complex
even, you know, it's a cat and mouse game.

I think in all, in all regards, and I
think everything we do has yin yang I'm

watching organizations that are full
steam ahead, aggressively using ai, and.

I'd love to see some stats.

I don't know if you have empirical
evidence or if you sort of anecdotal stuff

from your clients, but where you see them
accelerate with AI and then almost have a

pulling back effect because they realize
how easy it is for them to ship more bugs.

and then sure, we could have the AI
writing tests, but it can also write

really horrible tests, or it can
just delete tests because it doesn't

like them because they keep failing.

it's that to me multiple times.

In fact, I think, Oh, what's his name?

Gene, not, it wasn't Gene Kim.

There's a, a famous, it might've been
the guy who created Extreme Programming.

I think I heard him on a podcast talking
about he wished he could make certain, all

of his test files have to be read only.

Because he writes the tests, and then
he expects the AI to make them pass,

and the AI will eventually give up
and then want to rewrite his tests.

And he doesn't seem to be able to stop
the AI from doing that, other than just

denying, deny, or cancel, cancel, cancel.

And there's a scenario where
that can easily happen in

the automation of CI and CD.

Where, you know, suddenly it decided
that the tests failing were okay.

And then it's going to push
it to production anyway, or

whatever craziness ensues.

Do you see stuff happening
on the ground there?

That's

did you read that article about, the,
AI driven store, a fake store purveyor

that Anthropic created named Claudius?

That, when it got pushed back from
clo, the Anthropic employees around

what it was stocking, its fake fridge.

Its fake store with it called security
on the employees to try to displace them.

I mean, it's, I think what you're
pointing to is like the moral compass of

AI is not necessarily, it, like that's
a very complex thing to solve, right?

Is is this the right thing to do or not?

Because it's trying to solve the
problem however it can solve the

problem with whatever tools it has.

Not whether the problem
is the right one to solve.

And right is, I mean, obviously, you
know, even humans struggle with this one.

Right.

I've just been labeling it all as taste.

like the AI lacks taste

Yeah, that's

like in a certain team culture,
if you have a bad team culture,

deleting tests might be acceptable.

I used to work with a team
that would ignore linting.

So I would implement linting.

I'd give them the tools to customize
linting and then they would ignore

it and accept the PR anyway.

And did not care.

They basically followed the
linting rules of the file.

Now they were in a monolithic
repo with thousands and thousands

of a 10 year old Ruby app.

But they would just ignore the linting.

And I always saw that as a
culture problem for them.

That I as the consultant can't,
I would always tell the boss that

I was working for there that I
can't solve your culture problems.

I'm just a consultant here trying
to help you with your DevOps.

You need a full time person leading
teams to tell the rest of the team

that this is important and that these
things matter over the course of

years as you change out engineers.

You can't just go with,
well, this file is tab.

This file is, spaces.

And this one follows, especially
in places like Ruby and Python and

whatnot, where there's multiple, sort
of style guys and whatnot like that.

But I just call that a taste
issue or culture issue.

AI doesn't have culture or taste.

It.

It just follows the rules we give it.

we're not really good yet as
an industry at defining rules.

I think that's actually part
of the course I'm creating is

figuring out all the different
places where we can put AI rules.

We've seen them put in repos.

We've seen them put in
your UI and your IDEs.

Now I'm trying to figure out how do I
put that into CI and how do I put that

into even like the operations if you're
going to try to put AIs somewhere in your,

somewhere where they're going to help with
troubleshooting or visibility or discovery

of issues, like they also need to have.

Taste, which I want to change all
my rules files to taste files.

I guess that's my platform.

I'm standing.

Yeah.

Cause it,

Yeah, I mean, at the same time, you
probably don't want to be reviewing,

hundreds of PRs that an AI robot is
going through, just changing casing of

legacy code to, meet your, to the, Change
obviously introduces the opportunity

for failure, bugs, etc. And if it's,
you know, it's somewhat arbitrary

simply because you've given it a new
rule that this is what it has to do.

I mean, I had a former co worker
many years ago, like 15, 20 years

ago, who, was famous for just, going
through periodically and making

nothing but casing code changes
because that's what he preferred.

And it was just this, endless stream
of, trivialized changes on code we

hadn't touched in months or years.

That's it.

that would inevitably lead to
some sort of problem, right?

And because,

The toil of that.

I'm just like, yeah, that you're
giving me heartburn just by

telling me about that story.

yeah, it's, well, I've
gotten over the PTSD of that.

It's a long time ago.

So, okay.

So, what are you seeing on the ground?

do you have some examples of
how this is manifesting in apps?

I mean, you've already given me
a couple of things, but I'm just

curious if you've got some more,

I kind of alluded to it
at the very beginning.

obviously, in my role, I talk to a lot
of senior executives about directionally

where they're going with their engineering
organizations, because I think, the

software we build, it does a bunch of
tactical things, but at a broader sense,

it allows people to measure reliability as
expressed through, you know, our customers

staying engaged in your experience by
virtue of the fact that they otherwise

liking your software are having a better
technical experience with the apps you

build, that the front end and ultimately
measuring that and then thinking through

all of the underlying root causes is a
cultural change in these organizations

and how they think about reliability.

It's no longer just like are
the API endpoints at the edge of

our data center delivering their
payload, you know, responding in

a timely manner and error free.

And then the user experience is well
tested and, you know, we ship it to prod.

And.

That's enough.

it really isn't a shift in how people
think about it, but when I talk to them,

I mean, a lot of CTOs really are taking
a bet that AI is going to be the way

that productivity gains, I won't say make
their business more efficient, but they

do allow them to do more with less, right?

You know, like it or not, especially
over the past few years, I mean, in B2B,

we're in the B2B SaaS business, there's
been, you know, times have definitely

been tougher than they were in the early,
2020s, for kind of everyone, the, I think

there's a lot of pressure in consumer
tech with tariffs and everything to do

things more cost effectively, and first
and foremost on the grounds, this is a

change we are seeing, whether we like
it or not, and, you know, we can argue

about whether it's going to work and
how long it'll take, but the fact is

that, like you said, you mentioned that
the CIO magazines, leadership is taking

a bet this is going to happen, and I
think as we start to talk about that

with these executives, the question is
like, is the existing set of tools I have

to cope with that reality good enough?

And, yeah, I guess my underlying
hypothesis is it probably isn't

for most companies, right?

if you think about the, the world of,
you know, a web as an example, like

a lot of, a lot of companies that are
consumer facing tech companies will

measure their core web vitals, in
large part because it has SEO impact.

And then they'll put a, exception handler
in there, like Sentry that grabs, you

know, kind of JavaScript errors and
tries to give you some level of impact.

around, you know, how many are impacting
your users and whether it's high

severity and then you kind of have to
sort through them to figure out which

ones really matter for you to solve.

so take, existing user frustration
with, human efficiency of delivering

code and the pace, the existing pace.

users are already frustrated that
Core Web Vitals are hard for them to

quantify what that really means in
terms of user impact and whether a user

decides to stay on the site or not.

so, yeah.

and the fact that they're overwhelmed with
the number of JavaScript errors that could

be out there because it's really, I mean,
you go to any site, go to developer tools.

look at the number of JavaScript errors
you see, and then, you know, take your

human experienced, idea of how you're
interacting with the site, and chances

are most of those don't impact you, right?

They're just like, it's a, it's an
analytics pixel that failed to load, or

it's a library that's just barfing some
error, but it's otherwise working fine.

So take that and put it on steroids now
where you have a bunch of AI assisted

developers doubling or tripling the
number of things they're doing or

just driving more experimentation,
right, which I think a lot of

businesses have always wanted to do.

But again, software has been kind
of slow and expensive to build.

And so If it's slow and expensive
to build, my thirst for delivering

an experiment across three customer
cohorts when I can only, deliver one,

given my budget, just means that, I
only have to test for one variation.

We'll now triple that, or quadruple
it, or, and, you know, multiply that

by a number of, arbitrary user factors.

It just gets more challenging,
and I think we need to think about

how we measure things differently.

once the team has the tools in place to
manage multiple experiments at the same

time, that just ramps up exponentially
until the team can't handle it anymore.

But if AI is giving them a chance
to go further, then yeah, they're

just, they're going to do it.

I mean,

yeah, I mean, you're going to get
overwhelmed with support tickets

and bad app reviews and whatever it
is, which I think most people are.

Most business leaders would be
pretty upset if that's how they, are

responding to reliability issues.

I was just gonna say, we've already had
decades now of pressure at all levels

of engineering to reduce personnel.

you need to justify every new hire pretty
significantly unless you're funded and

you're just, you know, in an early stage
startup and they're just growing to the

point that they can burn all their cash.

having been around 30 years in
tech, I've watched operations

get, you know, Merged into DevOps.

I've watched DevOps teams get, which
we didn't traditionally call them that.

We might have called them sysadmins
or automation or build engineers

or CI engineers, and they get
merged into the teams themselves.

And the teams have to take
on that responsibility.

I mean, We've got this weird culture
that we go to this Kubernetes conference

all the time and one of the biggest
complaints is devs who don't want to

be operators, but it's saddled on them
because somehow the industry got the

word DevOps confused with Something.

And we all thought, Oh, that
means the developers can do ops.

That's not what the word meant, but we've,
we've worked, we've gotten to this world

where I'm getting hired as a consultant
to help teams deal with just the sheer

amount of ridiculous expectations a
single engineer is supposed to have,

that not the knowledge you're supposed
to have, the systems are supposed to

be able to run while making features.

And it's already, I feel like a
decade ago, it felt unsustainable.

So now here we are having to
give some of that work to AI.

When it's still doing random
hallucinations on a daily basis,

at least even in the best models.

I think I was just ranting yesterday
on a podcast that SWE bench, which is

like a engineering benchmark website
for AI models and how well they

solve GitHub issues, essentially,
and the best models in the world.

Can barely get two thirds of them.

Correct.

And that's, if you're paying the
premium bucks and you've got the premium

foundational models and you are on the
bleeding edge stuff, which most teams

are not because they have rules or
limitations on which model they can use,

or they can only use the ones in house,
or they can only use a particular one.

and it's just, it's one of
those things where I feel like

we're being pushed at all sides.

and at some point, it's amazing
that any of this even works.

It's amazing that apps
actually load on phones.

it's just, it just, it feels like
garbage on top of garbage, on top

of garbage, turtles all the way
down, whatever you want to call it.

So where are you coming in here to
help solve some of these problems?

yeah, that's fair, yeah.

I'll even add to that, All of that's
even discounting the fact that new tools

are coming that make it even simpler
to push out software that like barely

works without even the guided hands
of a software engineer who has any

professional experience writing that code.

some of the vibe coding tools, like
our product management team uses,

largely for rapid prototyping.

And I can write OK Python.

I used to write OK C sharp.

I have never been particularly
good at writing JavaScript.

I can read it OK, but like when
it comes to fixing a particular

problem, quickly get out of my depth.

That's not to say, I couldn't
be capable of doing it.

It's just not what I do every day,
nor do, I particularly have the energy

when I'm, you know, done with a full
workday to, go teach myself JavaScript.

And, I'll build an app with
one of the Vibe coding tools

as a means of communicating
how I expect something to work.

So if you have any questions, feel
free to reach out to me, and I'll

be happy to answer them, and I'll
be excited to hear back from you.

And for coming.

and I'm like, ah, it's better
to just delete the project

and start all over again.

And, you know, if you can make
it work, that doesn't necessarily

mean it'll work at scale.

It doesn't necessarily mean that there
aren't like a myriad of use cases

you haven't tested for as you click
through the app in the simulator.

and so, you know, I think the
question is, okay, you know, given

the fact that we have to accept more
software is going to make its way

into human beings hands, because we
build software for human beings, it's

going to get to more human beings.

How do we build a, a reliability
paradigm where we can measure

whether or not it's working?

And I think that stops focusing
on, kind of, I guess to go back

to the intentionally inflammatory
title of today's discussion, it

stops focusing on like a zero
bug paradigm where I test things.

I test every possible
pathway for my users.

I, you know, have a set of
requirements, again, around

trivialized performance and stuff.

And, trying to put up these kind of
barriers to getting code into human hands.

And I just accept the fact more code
is going to get into human hands faster

at a pace I can't possibly control.

And so, therefore, I have to put
measurements in the real world.

So, I use a lot of different tools around
my app so that I can be as responsive

as possible to resolving those issues
when I find them, which is I guess my,

you know, charity majors who co founded
Honeycomb, she, her and I were talking a

few months ago and big fan of stickers,
she shipped me an entire envelope of

stickers and they're all like, you
know, ship fast and break things, right?

I tested production, stuff like that.

And somehow I feel like, you know,
in our world, because we build

observability for front end and mobile
experiences, like web and mobile

experiences, I feel like that message
just hadn't gotten through historically.

part of it's because like release
cycles on mobile were really slow.

you had to wait days for
an app to get out there.

Part of it was software is expensive
to build and slow to build and so

getting feature flags out there where
you can operate in production was hard.

Part of it was just the observability
paradigm hadn't shifted, right?

Like the, the paradigm of measure
everything and then find root cause

had not made its way to frontend.

It was more like measure the known
things you look for, like web exceptions,

like core web vitals are on mobile,
look at crashes, and that's about it.

And the notion of, okay, measure
whether, users are starting the app

successfully, and when you see some
unknown outcomes start to occur or users

start to abandon, how can you then sift
through the data to find the root cause?

Hadn't really migrated its way.

And that's what we're trying to do.

we're trying to bring that paradigm
of how do you define the A, for

lack of a better term, the APIs.

The things that your humans interact
with, with your apps, the things they

do, you don't build an API in your app
for human beings, you build a login

screen, you build a checkout screen,
a cart experience, a product catalog.

How do we take those things and measure
the success of them and then try to,

attribute them to underlying technical
causes where your teams can have, better

have those socio technical conversations?

so that they can understand and then
resolve them, probably using AI, right?

as we grow.

but like it, it allows better,
system knowledge and the interplay

between real human activities and
the telemetry we're gathering.

Yeah.

Have you been, I'm just curious,
have you been playing with, having

AI look at observability data,
whether it's logs or metrics?

have you had any experience with that?

I'm asking that simply as a generic
question, because the conversations

I've had in the last few months, it
sounds like AI is much better at reading

logs than it is at reading metrics or
dashboards or anything that sort of

lacks context or, you know, it's not
like we're putting in alt image messages

for every single dashboard graph.

And that's probably all coming from
the providers just because if if

they're expecting AI to look at stuff.

they're gonna have to give more context,
but it sounds like it's not as easy

as just giving AI access to all those
systems and saying, yeah, go read the

website for my dashboard, Grafana,
and figure out what's the problem?

I've seen it deployed in two ways, one
of which I find really interesting and

something we're actively working on,
because I think it's just a high degree

of utility and it's, You know, given
the kind of state of LLMs, I think it's

probably something that's relatively easy
to get right, which is the, the notion of,

when you come into a product like ours,
or a Grafana, or, you know, a New Relic,

you probably have an objective in mind,
maybe you have a question you're trying

to ask, what's the health of this service?

Or, I got page on a particular issue.

I'm going to, I need to build a chart
that shows me like the interplay between

latency for this particular service
and the success rate of, you know,

some other type of thing, more like
database calls or something like that.

Today, like people broadly have to
manually create those queries and it

requires a lot of human knowledge around
the query language or schema of your data.

And I think there's a ton of opportunity
for us to simply ask a human question

of You know, show me a query of all
active sessions on this mobile app for

the latest iOS version and the number
of, traces, like startup traces that

took greater than a second and a half.

And have it just simply pull up your
dashboard and query language and build

the chart for you quite rapidly, which
is a massive time savings, right?

And.

It also just makes our tech, which
you're right, can get quite complex, more

approachable by your average engineer.

Which, you know, I'm a big believer that
if every engineer in your organization

understands how your systems work.

And the data around it, you're going
to build a lot better software,

especially as they use AI, right?

Because now they, they understand
how things work and they can better

provide instructions to the robots.

so I think that's really
a useful, interesting way.

And we've seen people start to roll that
type of assistant act, functionality out.

The second way I've seen deployed, I
see mixed results, which is I mean, an

incident, go look at every potential
signal that I can see related to this

incident and try to tell me what's
going on and get to the root cause, and

more often than not, I find it's just a
summarization of stuff that you, as an

experienced user, probably would come to
the exact conclusions on, I think there's

utility there, certainly, it gets you a
written summary quickly of what you see.

But I do also worry that, it doesn't
apply a high degree of critical thinking.

and you know, an example of where lacking
context, I mean, it wouldn't be very

smart, right, is you've probably seen
it, every traffic chart, around service

traffic, depending upon how it runs,
tends to be pretty lumpy with time of day.

Because Most companies don't
have equivalent distribution

of traffic across the globe.

Not every country across the globe
has an equivalent population.

And so you'd tend to see these spikes
of where, you know, you have a number

of service requests spiking during
daylight hours or the Monday of every

week because people come into the office
and suddenly start e commerce shopping.

And you see it taper throughout the
week, or you taper into the evening.

I think that's normal.

You understand that as an
operator of your services, because

it's unique to your business.

I think the AI would struggle, lacking
context around, your business to

understand, somewhat normal fluctuations,
or the fact that, You know, a marketing

dropped a campaign where there's no
data inside your observability system

to tell you that that campaign dropped.

It's not a release.

see your AI is lacking context.

It's lacking the historical context
that the humans already have implicitly.

Yeah,

and I mean that context might be in a
Slack channel where marketing said we

just dropped an email and you know an
email drop so expect an increased number

of requests to this endpoint as you know
people retrieve their, their special

offer token or whatever that will allow
them to use it in our checkout flow.

today like just giving, if we provided
that scope to a, an AI model within our

system we would lock that type of context.

Yeah, if I'm not creating any of
these apps, I'm sure they've already

thought of all of this, but, the
first thing that comes to mind is

well, the hack for me would be give
it access to our ops slack room,

Right.

we're probably all having those
conversations of oh, What's going on here?

And someone in someone's who had
happened to get reached out to for

marketing was like, well, you know,
yesterday we did send out a, you

know, a new coupon sale or whatever.

So, yeah, having it read all that
stuff might be, necessary for it to

understand the, because you're right.

it's not like we have a dashboard in
Grafana that's number of marketing

email, you know, the email sent per day
or the level of sale we're expecting,

based on in America, it's the 4th of
July sale, or, you know, some holiday

in a certain region of the world.

Or a social media influencer dropping
like some sort of link to your

product that suddenly, you know, that
it's a new green doll that people

attach to their designer handbags.

I don't know like anything about what
the kids are into these days, but

it seems kind of arbitrary and like
I, I would struggle to predict that.

Let me put it that way.

they're into everything old.

So it's into everything that I'm into.

Yes.

all right.

So if we're talking about this at a
high level, we talked a little bit

about before the show around how
embrace is thinking about, observability

and particularly on mobile, but
you know, anything front end there.

the tooling ecosystem for, engineers
on web and mobile is pretty rich.

but they all tend to be just like
hammers for a particular nail, right?

It's you know, How do we give you a
better craft reporter, better exception

handler, how do we go measure X or Y?

some of the stuff that we're thinking
about, which is really how we define

the objective of OB observability
for, the coming digital age, right?

Which is you know, as
creators of user experiences.

I think my opinion is that we
shouldn't just be measuring

like crash rate on an app.

We should be measuring, are users
staying engaged with our experience?

And when we see they are not, sometimes,
crashes, I mean, the obvious, the

answer is obviously they can't, right?

Because the app explodes.

But, I think, you know, I was
talking to a senior executive at

a, a massive food delivery app.

And it's listen, we know, anecdotally,
there's more than just crashes that make

our users throw their phone at the wall.

you're trying to You're trying to
order lunch at noon and something's

really slow or you keep running into
a, just a validation error because we

shipped you an experiment thinking it
worked and you can't order the item

you want on the two for one promotion.

you're enraged because you really want
the, you know, the spicy dry fried chicken

hangry.

and you want two of them because I
want to eat the other one tonight.

and you've already suckered me
into that offer, you've convinced

me I want it, and now I'm having
trouble, completing my objective.

And broadly speaking, the observability
ecosystem on the front end really

hasn't measured that, right?

We've used all sorts of proxy
measurements in, out of the data

center because the reliability story
has been really well told and evolved.

Over the past 10 to 15 years in the data
center world, but it just really hasn't

materially evolved in the front end.

And so, a lot of that's like shifting
the objective from how do I just measure

counts of things I already know are bad to
measuring what users engagement looks like

and whether I can attribute that to change
in my software or defects I've introduced.

So, that's kind of the take.

just about anybody who has ever built
a consumer mobile app has Firebase

Crashlytics in the app, which is
a free service provided by Google.

It was a company a long time ago that.

Crashlytics that got bought by Twitter
and then got reacquired by Google.

it basically gives you rough cut
performance metrics and crash reporting.

Right.

I would consider this like the
foundational requirement of any level

of, app quality, but to call this NSAID
app quality, I think, you know, our

opinion is that would be a misnomer.

So we're going to go kind of go
through what this looks like, right.

Which is it's giving you things.

You would expect to see, a number of
events that are crashes, et cetera, and,

you can do what you would expect here,
crashes are bad, so I need to solve a

crash, so I'm going to go into a crash,
view stack traces, get the information I

need to actually be able to resolve it.

and, you know, I think we see a lot
of customers before we talk to them

who are just like, well, I have crash
reporting and I have QA, that's enough.

there's a lot of products
that have other features.

So like Core Web Vital measurements
on a page level, this is Sentry.

It's a lot of data, but I don't
really know what to do with this.

beyond, okay, it probably
has some SEO impact.

There's a, you know, Core, bad Core
Web Vital or slow, you know, slow

something for a render on this page.

How do I actually go
figure out root cause?

But again, right, this is a single signal.

So this is kind of, you don't know that,
whether or not the P75 core web vital here

that is, considered scored badly by Google
is actually causing your users to bounce.

And I think that's important because I
was reading this article the other day

on this like notion of a performance
plateau, like there's empirical science

proving that like faster core web
vitals, especially with like content

paint and interaction to next paint.

et cetera, improve bounce rate
materially, like people are less likely

to bounce if the page loads really fast.

But at some point, like if it's long
enough, there's this massive long

tail of people who just have a rotten
experience and you kind of have to

figure out, I can't make everyone
globally have a great experience.

Where's this plateau where, I know
that I'm improving the experience for

people who I'm likely to retain and
improve their bounce rate versus I'm

just, you know, going to live with this.

And so we kind of have a different
take, which is like we wanted to

center our experience less on just
individual signals, and more on

like these flows, these tasks that
users are performing in your app.

So if you think about the key flows,
like I'm breaking these down into

the types of activities that I
actually built for my end users.

And I want to say, okay, how many
of them were successful versus

how many ended in an error, like
something went truly bad, right?

You just could not proceed.

Versus how many abandoned, and
when they abandoned, why, right?

Did they abandon because they, they
clicked on a product catalog screen, they

saw some stuff that they didn't like?

Or did they abandon because the product
catalog was so slow to load, and images

slow, like slow, so slow to hydrate?

That they perceived it as
broken, lost interest in the

experience and ended up leaving.

And so, the way you do that is you
basically take the telemetry we're

emitting from the app, the exhaust
we collect by default, and you create

these start and end events that allow
you to then, we post process the data.

We go through all of these sessions
we're collecting, which is basically

a play by play of like linear
events that users went through.

And we hydrate the flow to tell
you where people are dropping off.

and so you can see like their actual
completion rates over time, you know,

obviously it's a test app, so there's
not a ton of data there, but what gets

really cool is we start to, we start to
build out this notion of once you see.

the, issues happen, well how can I now
go look at all of the various attributes

of those populations under the hood
to try to specify which of the things

are most likely to be attributed to
the population suffering the issue?

So that could be an experiment.

it could be a particular mobile,
a particular version they're on.

It could be, an OS version, right?

You just shipped an experiment that
isn't supported in older OSs, and those

users start having a bad experience.

And then each of those gets you down to
what we call, this, user play by play

session timeline, where you basically
get a full Recreation of every part of

the exhaust stream that we're gathering
from you interacting with the app or

website just for reproduction purposes.

once you've distilled here,
you can say, okay, now let me

look at that cohort of users.

And so I can do pattern
recognition, which I think is pretty

Hmm.

So for the audio audience that didn't,
that didn't get to watch the video,

what are some of the key sort of, if
someone's in a mobile and front end

team and this is actually going back
to a conversation I had with one of

your Embrace team members at KubeCon in
London, what are some of the key changes

or things that they need to be doing?

If I guess if I back up and say the
premise here is that if I'm it's

there's almost like two archetypes.

What am I trying to say here?

There's two archetypes
that I'm thinking about.

I'm thinking about me, the DevOps
slash observability system maintainer.

I've probably set up.

Elk or, you know, I've got
the key names, the Loki's, the

Prometheus, the, the Grafana's.

I've got all these things
that I've implemented.

I've brought my
engineering teams on board.

They like these tools.

They tend to have, especially for
mobile, they tend to have other

tools that I don't deal with.

They might have platform, like the, the

Traditionally, right?

We're obviously trying to change that.

But yeah, traditionally, they
have five or six other tools that

don't play into the ecosystem,
observability ecosystem you have set up.

Yeah.

So, so we're on this journey to
try to centralize, bring them

into the observability world.

you know, traditional mobile
app developers might not even.

Be aware of what's going on in the
cloud native observability space and

we're bringing them on board here.

Now suddenly they get even more code
coming at them that's slightly less

reliable or maybe presents some unusual
problems that we didn't anticipate.

So now, you know, we're in a
world where suddenly what we have

in observability isn't enough.

yeah,

you're a potential solution.

What are you looking at for behaviors
that they need to change to things

that people can take home with them?

And

I mean, I guess the way I think about
it is, right, the way, the reason

observability became so widely adopted
in server side products was because in

an effort to more easily maintain our
software and to avoid widespread defects

of high blast radius, we shifted from a
paradigm of like monoliths deployed on

bare metal to virtualization, which was,
you know, various container schemes kind

of that has right now most widely been
around, Kubernetes and microservices

because you could scale them independently
and you could deploy them independently.

Right.

And that complexity of the deployment
scheme and the different apps and services

interplaying with each other necessitated
an x ray vision into your entire system

where you could understand system wide
impacts to your, the end of your world.

And the end of your world, from the
most part became your API surface

layer, the things that served
your web and mobile experiences.

And, you know, there are
businesses that just serve APIs.

Right, but broadly speaking, the brands
we interact with as human beings serve us

visual experiences that we interact with.

right.

It's the server team managing
the server analytics, not so much

the client device analytics that

right.

the world has gotten a lot more
complicated in what the front

end experience looks like.

And you could have a service that
consistently responds and has a nominal

increase in latency and is well within
your alert thresholds, but where the

SDK or library designed for your front
end experience suddenly starts retrying

a lot more frequently, delivering
Perceived latency to your end user.

And, and so I think the question
is, could you uncover that incident?

Because if users suffer perceived latency
and therefore abandon, what metrics do

you have to go measure whether or not
users are performing the actions you

care about them performing, whether
that's attributable to system change?

In most instances, I don't think
most observability systems have that.

and then the second question is, right,
so, and by the way, Bret, that's the

underlying supposition that in a real
observability scheme, mean time to detect

is important, is as important, if not
more so, than mean time to resolve.

the existing tooling ecosystem for
frontend and mobile has been set up to

optimize mean time to resolve for known
problems where I can basically just

count the instance and then alert you.

So, And, you know, the lack of
desire to be on call, like I've

heard this stupid saying that
there's no such thing as a front end

emergency, which is like ridiculous.

If I'm, you know, if I'm a major travel
website and I run a thousand different

experiments and a team in, Eastern
Europe drops an experiment that affects

users globally, 1 percent of users
globally in the middle of my night.

That, makes the calendar control
broken and that some segment of that

population can't book their flights.

That sounds a lot like a
production emergency to me.

I, that has material business
impact in terms of revenue.

Or the font color changes
and the font's not readable

yeah, I guess I am imploring the
world to shift to a paradigm where

they view like users willingness
and ability to interact with your

experiences as a reliability signal.

And I think the underlying supposition
is that this only becomes more acute of a

problem as the number of features we ship.

I guess I'm starting with the
belief that from what I hear, people

are doubling down on this, right?

They're saying we need to make software
cheaper to build and faster to build.

Because it is a competitive environment
and if we don't do it, somebody else will.

and as that world starts to
expand, the acuity of the

problem space only increases.

Yeah.

I think your theory here, matches
are well with some other like we're

all kind of in this little bit of
we've got a little bit of evidence.

We hear some things
and we've got theories.

We don't have years of facts of
exactly how AI is affecting a lot of

these things, but other compatible
theories I've heard recently on this,

actually with three guests on the show.

that I'm just coming to mind.

One of them is, because of the velocity
change and because of the, it is

this increase in velocity, it only
is going to increase the desire for

standardization in CI and deployment,
which those of us in, if you've been

living in this Kubernetes world, we've
all been trying to approach that.

Like we're all leaning into Argo CD
as one of the, as the number one way

on Kubernetes to deploy software.

you know, we've got.

This GitOps idea of how to standardize
change as we deliver in CI.

It's still completely wild, wild west.

You've got a thousand vendors
all in a thousand ways you

can create your pipelines.

And hence the reason I need to make
courses on it for people, because

there's a lot of art still to it.

We don't have a checkbox of
this is exactly how we do it.

And in that world, the theory is.

Right now that maybe AI is going to allow
us to act where it's going to force us

to standardize because we can't have a
thousand different workflows or pipelines

that are all slightly different for
different parts of our software stack,

because then when we get to production and
we have started having problems, or if the

AI is starting to take more control, it's
just going to get worse because the AI

Yeah, and at the end of the day,
right, standardization is more

for the mean than the outliers.

And I think a lot of people assume they're
like, oh, we can do it right because we

have hundreds of engineers working on like
our, you know, our automation and stuff.

It's do you know how many
companies there are out there

that are not technology companies?

They're a warehouse
company with technology.

Right.

And like standardization allows them
to more often than not build good

technology, to measure correctly, to
deploy things the right way, like we're

all going to interact with it in some
way, right, and as the pressure for more

velocity and them building technology
speeds up, the need for you know, a

base layer of doing things, where it's
consistently right, but only increases.

And I think that's, you know,
for the world it's a challenge,

for us it's an opportunity.

Yeah.

Awesome.

I think that's a perfect
place to wrap it up.

we've been going for a while now.

but, Andrew, io, right?

That's,

Yeah.

www.

embrace.

io.

let's just bring those up.

We didn't show a lot of
that stuff, but, the,

in case people didn't know, I made a short
a few months ago about Embrace or with one

of the Embrace team members at KubeCon.

which we had touched a little bit on
this show, but we talked about that.

Observability is here for your mobile
apps now and your front end apps and

that those people, those developers
can now join the rest of us in this,

world of modern metrics, collection
and consolidation of logging tools

and bringing it all together into
ideally one, one single pane of glass.

if you're advanced enough to figure
all that out, the tools are making

it a little bit easier nowadays,
but I still think there's a lot of.

A lot of effort in terms of
implementation engineering to get all

this stuff to work the way we hope.

But, it sounds like that you all
are making it easier for those

people, with your platform.

We're definitely trying.

Yeah.

I think it'd be a future state that would
be pretty cool would be the ability for

operators to look at a Grafana dashboard
or a, you know, dashboard or Chronosphere

or New Relic or whatever it is.

see that, you know, they see a disen a,
an engagement decrease on login on a web

property, and immediately open an incident
where they page in the front end team and

the core team servicing the off APIs in
a company and have them operating on data

where the front end team can be like.

We're seeing a number of retries
happening after we updated to the

new version of the API that you
serve for logging credentials.

even though they're all 200s, they're
all successful requests, what's going on?

And that backend team says, well, it
looks like our, you know, we were slow

rolling it out and the P90 latency
is actually 250 milliseconds longer.

Why would that impact you?

And they say, well, the SDK retries
after 500 milliseconds and our P50

latency before this was 300 milliseconds.

So, 10 percent of our users or
something are starting to retry

and that's why we're seeing this.

You know, the answer here is to increase.

The resource provisioning for the
auth service to get latency back

down and or change our SDK to have
a more permissive retry policy.

and, you know, have teams be able to
collaborate around the right design

of software for their end users and
understand the problem from both

perspectives, but be able to kick
off that incident because they saw

real people failing to disengage,
not just some server side metrics,

which I think would be pretty neat.

Yeah.

And I should mention you all,
if I remember correctly in cloud

native, you're a lead maintainers
on the mobile observability SDK.

Is that, am I getting that right?

I'm trying

we have engineers who are
provers, on Android and iOS.

we have the, from what I'm
aware of, the only production

React native, OpenTelemetry SDK.

we are also participants in a new
browser SIG, which is a, a subset

of the former JavaScript SDK.

So our OpenTelemetry SDK
for web properties is.

Basically a very slimmed down chunk of
instrumentation that's only relevant

for browser React implementations.

so yeah, working to advance the kind of
standards community in the cloud native

environment for what, Instrumenting
in real world runtimes where Swift,

Kotlin, JavaScript are executed.

Nice.

Andrew, thanks so much for being here.

So we'll see you soon.

Bye everybody.

Creators and Guests

Bret Fisher
Host
Bret Fisher
Cloud native DevOps Dude. Course creator, YouTuber, Podcaster. Docker Captain and CNCF Ambassador. People person who spends too much time in front of a computer.
Andrew Tunall
Guest
Andrew Tunall
President & CPO at Embrace
Beth Fisher
Producer
Beth Fisher
Producer of the DevOps and Docker Talk and Agentic DevOps podcasts. Assistant producer on Bret Fisher Live show on YouTube. Business and proposal writer by trade.
Cristi Cotovan
Editor
Cristi Cotovan
Video editor and educational content producer. Descript and Camtasia coach.
Could AI End Human QA?
Broadcast by