Move K8s Stateful Pods Between Nodes

Download MP3

Kubernetes Pod Live Migration.

That's what Cast AI calls it when they
dynamically move your pod to a different

node without downtime or data loss.

They've built a Kubernetes controller
that works with CSI storage and CNI

network plugins to copy your running
pod data, the memory, the IP address,

and the TCP connections from one
node to another in real time Welcome

to DevOps and Docker talk, and I'm
your solo host today, Bret Fisher.

This YouTube live stream was a fun one
because I got to nerd out with engineers

Philip Andrews and Dan Muret from Cast
AI on Kubernetes pod live migrations

and how it works under the hood.

We talk about use cases for
this feature, including hardware

or OS maintenance on a node.

Maybe for right sizing or bin packing your
pods for cost savings or moving off of a

spot instance that's about to shut down.

Or really anytime you need to move a
daemon set that would cause an outage if

the pod had to restart or be redeployed.

I don't know of any other turnkey way
to do this on Kubernetes today, but

I've got a feeling that Cast AI has got
a winning feature on their hands, and

I'm glad that we got to dig into it.

Over 20 years ago, the virtual machine
vendors built this live migration feature

into their products, and finally, in 2025,
we're now able to do that in Kubernetes.

Let's get into the episode.

Welcome to the show.

All right, Both of these
gentlemen are from Cast AI.

Philip is the global
field, CTO at Cast AI.

What exactly does Global CTO do?

handling, a lot of our large customers.

a lot of the strategic partnerships,
usually new technologies, so working

with a lot of our, you know, customers on
showing them new technologies, helping in,

proof of concepts with new technologies.

it's been kind of a cool role.

basically on the technical
customer facing side.

I get to do a lot with our
largest customers in solving

some of the the hardest problems.

Nice.

when you don't know what the
title is, it makes it sound like

you just live in an airplane.

it sounds impressive.

and we've got Dan Muret or Muret.

I really don't know my French, so that's
probably, a horrible pronunciation.

Dan is here, he's the senior sales
engineer with CAST AI or one of them.

I'm gonna make you the sales senior sales
engineer so that you sound very elite.

Uh, welcome Dan.

who could tell me about what's the
elevator pitch for Cast AI because

I've known about you all for years.

I've visited your booth at KubeCon at
least a half a dozen times over the years.

what's the main benefit of Cast AI?

Sure I can grab that one.

when Cast AI was founded, it was to
solve a problem that our founders

had in their last startup years ago,
which was every month their AWS bill

was going up 10% regardless of what
they did the month before to try to.

Mitigate and manage costs
across the in infrastructure.

they sold that startup to Oracle, when
they finished that Oracle time, they went

and figured out how to solve this problem.

They said, well, Kubernetes is
gonna be the future platform.

We're gonna make our bet on Kubernetes.

And the only way to solve this
problem is through automation.

Because doing things manually
every month, one, it's tedious

and takes up a lot of time.

And two, it's just, not helpful, right?

You save a little bit every,
you know, it's two steps back

for every step forward you make.

With Cast AI, it's fully
automation first, right?

We made the effort that everything was
going to be automated from the start.

when it comes to node auto scaling,
node selection, node, right sizing,

workload, right sizing, everything
to do with Kubernetes, everything

we implement is automated.

And that's where the live migration
piece came in being able to automatically

move applications around within the
cluster without having downtime.

and that's where application
performance automation comes

in, Moving from this application
performance monitoring mindset.

Datadog has made a lot of money on that.

I dunno if I'm allowed to say that.

but they've done very well
and have a fantastic platform.

we love Datadog, but
you get data overload.

You end up with metric overload and
actioning on those is very hard.

Where we need to go from here,
especially with the AI mindset that

we're moving into, is automation
of that application performance.

And that's what Cast AI
is leading the way in.

Nice, when you all reached out and I
learned about the fact that you now have

live migration, it took me back to almost
25 years ago when that was first invented

for VMs at the time it felt like magic.

It did not seem real.

We all had to try it to believe
it because it seemed impossible.

To, move from one host to
another, maintain the ip,

maintain the TCP connections.

surely I'm gonna freeze up and
it's gonna be like a frozen screen.

we all just assumed that.

eventually, and maybe at first it
was a little hiccupy, I think, if I

remember correctly, like 2003, 2005.

It was one of those where it wasn't
quite live and they didn't, it was

like very short amounts of gaps.

then eventually it got good
enough that it was live.

I was running data centers, for
local governments at the time.

So I was very interested in this.

'cause we were running
both ESX and HyperV.

So I was heavily invested in
that feature and functionality.

So when I saw that you were doing
it to Kubernetes, my first thought

was, why did this take so long?

why don't we have this yet on everything?

Because it's clearly possible,
it's technically possible.

Obviously, it's not super easy and
requires a lot of low level tooling

that has to understand networking
and memory and you know, dis disc

rights and all that kind of stuff.

So I'm excited for us to get
into exactly how this sort of

operates for a Kubernetes admin.

And I really feel like this show's gonna
be great for anyone learning Kubernetes or

Kubernetes admins, because to talk about
the stateful set, the daemon set problem

of, we've got stateful work, everyone's
got, I mean, everyone I know almost

has stateful workflows in Kubernetes.

I would say, I don't know about you
all's experience, but to me it's

an exception when everything is
stateless in Kubernetes nowadays.

Do you find that to be the case

we talk to customers all the time, right.

And yeah.

Used to, you know, a few, just
a couple years ago, right.

It was a lot more stateless.

Web servers, whatever.

now we're definitely seeing a shift
to, more stateful workloads whether

it's, legacy applications being
forced into Kubernetes as part of a

modernization project or whatever.

We're seeing a lot more stateful
workloads in Kubernetes for sure.

Particularly amongst the Fortune one
hundreds, fortune five hundreds, right?

'cause you've got this modernization,
and I put it in quotes, where

modernization means taking some
crusty old 15-year-old application,

containerizing it and shoving it in
Kubernetes and calling it cloud native,

the flawed approach of lift and shift.

and you end up with a lot of
applications that are in Kubernetes.

They're listed as a deployment, but
you can't restart them without your

customer having a significant outage.

it goes against everything
Kubernetes was built on.

But that's the world we live in today.

when we first launched Live migration,
and I posted about it on LinkedIn,

some of the first questions I
got was, why is this even needed?

If you're doing Kubernetes correctly, live
migration shouldn't even be a real thing.

yes, but 95% of the customers I deal with.

Don't do Kubernetes, the right way.

Well, yeah, I honestly think we
could argue that's a great point.

when Docker and Kubernetes were
both created, it was like stateless.

'cause it's easy, you know,
move everything around.

It's wonderful.

But I think that to, I mean, this channel,
if there's anything consistent about this

channel over the almost decade that has
existed, it's that I, it's all containers.

Like I don't care what the tool
is, we're doing it in containers.

the large success of containers is
because we could put everything in it.

so many.

Evolutions or attempted evolutions
in tech have been, well, you're

gonna have to rewrite to, you know,
functionless, you know, or functions.

you're serverless, you're gonna have to
write functions now or you're gonna have

to rewrite this language or whatever.

and that's, I think that's like the
secret sauce of containers was that we

could literally shove everything in it.

It's also the negative.

And so there's, there's a thing that,
I don't know if I learned it from a

therapist or whatever, but often our
weaknesses are just overdone strengths.

And I feel like the strength of containers
is that you can do everything with it.

you can put every known
app on the planet in it.

They will eventually work
if you figure it out.

the overdone weakness is that we're
putting everything in there which makes

managing these infrastructures very hard.

you have to assume everything's
fragile Until you are sure

that it's truly stateless.

even stateless, people say stateless
and what they really mean is it's, it

doesn't care about disc, but it definitely
cares about connections, which is when

we're trying to talk about stateless,
that's not technically accurate.

Like when we say stateless,
we should probably mean also

doesn't care about connections.

at least once the connections are drained.

it's an interesting dilemma we all have
in the infrastructure of we have the

power to be able to move everything and
do everything, but also everything's

super fragile that we're running at the
same time, so how do we even manage that?

We've encountered a lot of teams
that, swore up and down they were

stateless, right up until you
started bin packing their cluster.

They said, wait, wait, wait.

why are we having all
these restarted pods?

We're like, because we're bin
packing and we're moving things, and

we're getting better, optimization.

They're like, but my container restarted.

Well, yes, that's what
containers do in Kubernetes.

Right.

And for those that are maybe just getting
into Kubernetes or haven't dealt with

large enterprise forever workloads
where they just can't be touched.

I've had 30 years in tech
of don't touch that server.

don't touch that workload.

It's fragile, it's precious, but
it's also probably on the oldest

hardware in the least maintained.

so the idea of, one of the performance
measures that any significant size

Kubernetes team is dealing with
is the cost of infrastructure.

And then we keep getting told, I think
this was just in the last year at

KubeCon, that even on top of Kubernetes,
we're still only averaging like 10% CPU

on average utilization across nodes.

Like we, we still are struggling with
the same infrastructure problems that

we were dealing with the last 30 years,
even before VMs, before virtualization.

That was the same problem we had then
because everybody would want their

own server and they always had to plan
for the worst busiest day of the year.

So they would buy huge servers,
put 'em in, and they'd sit idle

almost all the time because
they barely got 5%, utilization.

So I can see where like one of the core
premises of something like a application

performance tool is that we're gonna
save tons of money by bin packing.

can you explain the bin packing process?

what does that look like?

so one of the big things with Kubernetes
is the scheduler will typically

round robin, assign pods to nodes.

if you have 10 nodes in a cluster,
your pods will more or less get evenly

distributed to those nodes in the cluster.

You can manage that with certain
different scheduler hints and

certain scheduler, suggestions to
steer that towards, you know, most

utilized, least utilized, et cetera.

But at the end of the day,
you're gonna have spread out

workloads across your nodes.

Bin packing is basically the
defragmentation of Kubernetes right back.

You know, Bret, when you were first
starting out, when I was first starting

out, and you could actually defragment
a hard drive and you get to move the

little Tetris blocks around the screen,

days.

being able to do that in a Kubernetes
cluster can mean massive, massive

savings on the actual utilization of that
cluster because now you free up a bunch

of workloads, a bunch of nodes in the
cluster that are no longer necessary,

you can delete those off and when you
need them, you just add them back.

that's the joy of being in a cloud
environment, you can use the least amount

of resources when you don't need 'em.

So in, for instance, at, you know,
your off busy hours, your nighttime

hours, and then when you start needing
'em again, you spin 'em up, you add

more to it, you scale up during the
day, and being able to do that process

over and over again every day is how
you can optimize your cloud resources.

What we see is that.

People have so many stateful workloads
whether it's stateful in real state or

stateful in, this is a really poorly
architected application, or stateful in

this application takes 15 minutes to start
up and I, it's a monolithic and I can only

run one copy of it so I can't move it.

All of those things causes, so you
can't bin pack a cluster, right?

You can't move those things around.

So what ends up happening is people just
end up with these stateful workloads

scattered throughout all 10 nodes.

And even if the 10 nodes are only
60% utilized, you can't get rid

of any of them because it'll cause
some kind of service interruption.

And that's where live migration
allows you to move those

stateful sensitive workloads.

So now those 10 nodes can go down
to six or seven nodes without having

a service interruption, even if
there's less than ideal workloads

scattered throughout the cluster.

stateful versus stateless versus like,
where's the scenario for where we need

a live pod migration and like to those
that are perfect in all their software

and they run and they control all
the software that runs on Kubernetes.

I don't know who those people are,
but let's just say they exist.

Then this isn't needed.

Every database has a replica
or database mirror, so you

can always take a node down.

Every pod properly has proper
shutdown for ensuring that connections

are properly moved to a new pod.

By the way, I used to do a whole
conference talk on TCP packets

and, resetting the connection to
make sure it moves properly through

the load balancer to the next one,
and having a long shutdown time

so that you can drain connections.

that world of shutting down a pod is so
more complicated than anyone gives it.

everyone treats it like it's casual
and easy, and it's just not if

you're dealing with, hundreds of
thousands or millions of connections.

there is a lot of nuance
and detail to this.

and I often end up with teams
where they implement Kubernetes.

It's a sort of predictable pattern, right?

They implement Kubernetes,
move workloads to it.

They think Kubernetes gives their
workloads magic and then they just

start trying moving things around
and they realize when their customers

complain that, The rules of TCP IPS
load balancers and connection state,

like all these rules still apply.

you have to understand those lower levels
and obviously disc and writing to disc

and logs for databases and all that stuff.

that's still there too, I think.

I think the networking is where
I see a lot of junior engineers

hand waving over it because quite
honestly, the cloud has made a lot

of the networking problems go away.

So we don't have to have
Cisco certifications just

to run servers anymore.

We used to, but now we can get away
with it till a certain point, in

the career or complexity level.

And then suddenly you're having to really
understand the difference between TCP and

UDP and how session state long polling or
web sockets, how all these things affect.

Whether you're going to break
customers when you decide to restart

that pod or redeploy a new pod.

I love that stuff because it's super
technical and you can get really into

the weeds of it, and it's not, I wouldn't
call a solved problem for everyone.

my understanding of something
like a live migration is,

takes most of those concerns.

it doesn't make them irrelevant, but
it does deal with those concerns.

am I right?

in terms of networking we're talking
about live migrations, having

to be concerned with ips and,
connection state, stuff like that.

yeah, so with being able to do the
networking move of things, to your

point, reestablishing those sessions.

one of the big things we
see is long running jobs.

if you've got a job that's running for
eight hours and it gets interrupted

at six, you've lost that job.

Even if you try to, move it from one to,
there's some checkpointing involved, a

lot of times, like on a spark workload,
the driver will just kill the pod

and restart it if it senses any kind
of interruption in the networking.

So the networking's super important there.

Being able to maintain that.

Long running sessions, web sockets.

To your point, we've actually tested
this extensively with web sockets.

Web sockets, stay connected.

And we're still in those
early vMotion days.

there is a slight pause
when we move things.

it took vMotion multiple years before
they got it kind of really ironed out.

We're moving probably faster
to them 'cause we have a lot

of experience, you know, of the
experiences they went through.

and the research that's
happened since then.

So I think we're moving pretty fast
on shortening that time window.

But what we found is you queue up all
the traffic and once the pod is live on

the new node that traffic is replayed
and all the messages come through.

So even on something like a web socket,
you don't actually lose messages.

They're just held up for a few seconds.

And that's extremely important
from maintaining that connection

state, like you were mentioning.

one of our customers that we're
working with this heavily on,

they run spark streaming jobs.

So they're 24 7, 365, pulling off a
queue, running data transformations

and detections, and then pushing
somewhere else for alerting mechanisms.

If they have a pod go down, it takes about
two minutes to get that process restarted,

pull in all the data that they need.

Again, that's two minutes of backlog.

They have super tight SLAs.

They have a five minute SLA from
message creation to end run through

the entire detection pipeline.

So if you've got a two minute
delay on that shard in your Kafka

topic, that's a huge chunk of that
five minutes that you just ate up.

You're talking all the other pipeline.

It's very easy to start
missing SLAs there.

It's, you can't take maintenance
windows if you're 24 7, 365 and

you're doing security processing.

you don't.

You can't be like, well, security's
gonna be offline for 10 minutes

while we move our pods around.

Like, that's just, that's not
acceptable in that world mindset.

so keeping that connectivity, keeping
the connection state, being able to

keep everything intact, keeping the
Kafka connection, keeping the, spark

driver connection is all super important
with being able to move that entire

TCP/IP stack over from one node to
another, during that migration process.

Yeah, and I mean, we're really talking
about a lot of the different kinds of

problems that come with shifting workloads
like being able to say, you know, walking

into an environment and sort of being
your own wrecking ball and, your own chaos

monkey and saying, I'm gonna go over here
and push the power button on this node,

or I'm gonna properly shut down this node,
do you have everything set up correctly so

that connections are properly drained that
is a, such a. I would say a moving target,

especially because every time we've had
these processes where I've had clients

where we go through this like exercise of
we're going to do maintenance on a node

and we're even gonna plan for it, and
then we do it, and then we fix all the

issues of the pods and the shutdown timing
and, the Argo CD deployment settings

that we need to massage and perfect.

And then, you know, six months later, if
we do it again, the same thing happens

because now there's new workloads that
weren't perfected and weren't well tested.

if I can make a career out
of actually being like.

a pod migration guru, like that sounds
like my kind of dream job where we crash

and break everything and then we track
all of the potential issues of that and

we are like a tiger team that goes pod
by pod and certifies this is like, yep,

this pod can now move safely without risk.

because we've got everything dialed in.

We've got all the right settings.

I feel like that's a workshop opportunity.

maybe sell something on that because
there are so many levels of complexity

we haven't even talked about, like
database logs and database mirroring

you can't really spin up a new node
of a database and let it sit there.

Idle is a pod while you're waiting for
the old one to shut down, they can't

access the same files, blah, blah, blah.

it just depends on the workload,
on how complex this all gets.

But I'm assuming also that when we talk
about something like live migrations,

we're not just concerned with networking.

We're also somehow shifting to storage.

I'm guessing there's certain
limitations to that where you're not

replicating on the backend volumes.

You're, I guess you're just
using like ice zy reconnects or

how's that, how does that work?

we haven't really gotten into the
solution, but I know you're only on

certain clouds right now, and I'm assuming
that's partly due to the technicals of

the limitations of their infrastructure.

Right, exactly.

each cloud has different kind
of quirks, around how they

function, what the different,
technologies look like around them.

somebody had asked about being able
to move, you know, larger systems

are the limits around it and their,
it depends on the use case, right?

If you're talking spot instances,
being able to move from one spot

instance to another spot instance
in a two minute interruption window

on AWS, depends on how much data.

If you're trying to move 120 gigs of
data physics is working against you.

you don't have enough time in that two
minute window to get enough through

the pipe over to the new, system.

Now if you're talking small pods,
if you're talking less than 32 gig

nodes, you can move that fast enough.

64 gigs, maybe you're on the edge.

Depends on how much other network
traffic is tying up the bandwidth,

64 gigs is getting on the edge that
you can move in a two minute window.

that other example I was talking about,
those long running spark streaming

jobs, if they're running on demand,
live migration, is still a massive

benefit because now you can do server
upgrades without taking an outage.

You can create a new node running
your new patch version of, Kubernetes,

running your new Patched Os, and
migrate the pod from one to the other.

Your time to replicate is less important.

Even if it takes you three minutes
four minutes, or five minutes

to replicate the memory from
one box to the other, who cares?

It's not gonna be paused for that
long, because what we're doing

is we're doing delta replication.

So you replicate a big chunk, and then
a smaller chunk, and then a smaller

chunk until the chunk is small enough
where you can do it in a pause window.

And so now when you're moving a
huge service from one to the other.

Same thing.

If you're talking like NVME local storage,
we've got another customer we're working

with and it's a different set of problems.

They have a terabyte of NVME that they use
on local ephemeral disc on every node that

needs to be replicated from node to node.

When they do node upgrades, that
takes about 20 minutes, to replicate

all of that from one node to
another, even on high throughput

discs, on high throughput nodes.

But if it's happening in the
background, while everything else

is humming along nicely, who cares?

Keep replicating it over.

You keep going down to Deltas, and
then once your deltas get small enough,

you pause for six to 10 seconds,
depending on how big the service is.

And then you slide it over.

a lot of these things are being solved.

We're actively reducing these pause
times by being able to do more prep

behind the scenes, being able to do
more processing behind the scenes.

everything is, operating
as a containerd plugin.

I saw somebody asked, about on-prem
we will be supporting on-prem, we

will be supporting other solutions.

The big catch there is everybody
has some different flavor of

networking and different flavors
of things behind the scenes.

The actual live migration piece right now
could apply to any Kubernetes anywhere.

It's the IP stack that gets a little
trickier because you've got cilium

running places, you've got Calico running
places, you've got VPN, V-P-C-C-N-I

running places, you know, everybody
has different networking flavors.

So being able to maintain network
connections when you do the

migration is largely the more
difficult part of the whole process.

Hmm.

Being able to move the pod isn't that
bad, so if you've got workloads that

you can reestablish connections and the
connection resetting is not a big deal,

but you don't wanna have to restart
the pod, that's fairly straightforward.

We could pretty much do that today across
any containerd compatible Kubernetes.

It's specifically the networking
that causes a lot more hardships,

because everybody has a
different flavor of networking.

for AWS, we were able to fork
the open source AWS node CNI, and

create our own flavor of it that
now handles the networking piece.

So we're using the open source
AWS CI code, and we've modified

it, and now it works just fine.

for our purposes, we're doing something
similar on GCP, working with the gc.

GPFI recall is using cilium under
the hood for their networking.

So we're gonna be, building a
similar plugin for their cilium side.

Yeah, and the nice thing is, I
guess if you build it for cilium.

would it work universally
across any cilium deployment?

In theory, I mean, I'm just thinking of
the most popular CNIs and if you check

those off the list, it suddenly gives you,
you know, a lot more reach than having to

go plowed by cloud or os by os, you know?

and,

Exactly.

our first iteration of this back in
January, February, the first version

that we demoed was actually Calico.

A lot of people were like, I don't
wanna have to rip out my cluster and

rebuild it with Calico as the CNI.

we were able to figure out
a way to work with the, VPC

CNI, as a backing basis there.

So Calico's pretty much already built.

we've got AWS CNI now built,
cilium our next target, that saw

somebody asked about Azure, Azure
is probably gonna be early 2026.

we'll be E-K-S-G-K-E and then
we'll work on a KS, and then we'll

work on-prem solutions after that.

So on-prem will probably
be sometime in 2026.

Yeah, I can remember, going
back to the two thousands.

I can remember when we went from
delayed migrations or paused

migrations to live migrations.

I can remember reading the technical
papers coming out of, VMware and

Microsoft and they were talking about
the idea of, these deltas continually

repeating the Delta process until
you get down to zero or like you can

fit it in a packet and then that's
the final packet kind of thing.

I don't know why I remember that all
these years later, but I do remember

that I thought that was some pretty cool
science, like some pretty cool, physics

across the wire because back then we were
lucky if our servers had one gigabit,

200 gig workloads or anything like that.

this actually led me during my
research and, we could talk about

the idea that there are, there are
attempts in Linux over the years

to try to solve this universally.

I did some research before the show and
saw some projects around ML workloads in

particular a lot of, engineers, whether
it's platform engineering or just the ML

engineers themselves interested in this
because of the, sort of the problems of

large ML or AI workloads today where you
can't interrupt them if you interrupt

them, you have to basically start over.

it's sort of a precious workload
while it's running and it

might be running long time.

do you have, AI and ML workload
customers where they're.

Are they maybe part of the first movers
to move onto something like this?

I'm basing it on the KubeCon talks
and things that I've seen out there,

Large scale data analytics is
definitely, one of the big players here.

A lot of it's spark driven data
analytics that we're seeing,

because of exactly that problem.

A lot of these jobs will be running
for 8, 10, 12, 14, 16 hours and.

Running those on demand at the
scale that they're running them

at is extraordinarily expensive.

So the big ask is how do we get those
workloads onto spot instances where

when we get the interruption notice we
can fall back to some type of reserved

capacity and then fail back to spot.

So basically the goal is to move
to this new concept where in your

Kubernetes cluster you have some swap
space, whether that's, excess, spot

capacity, two or three extra nodes of
spot capacity or a couple of nodes of

on-demand capacity where if you get a
node interruption, you can quickly swap

into those nodes, and then once you stand
your new spot instance back up, then

you can swap back to that spot instance.

where we're headed.

that's what Q4 is gonna be working on
this year, is to be able to automate that

entire process to where you can float back
and forth between reserve capacity and

spot capacity to really save on those,
data analytics jobs, those large ML jobs.

We're not to the GPU side of things yet.

I'd love to get us to where we could
migrate GPU workloads 'cause that's where

the next big bottleneck is gonna be.

the hooks aren't there in the
Nvidia tool sets yet for the Cuda

tool sets in a lot of places, to be
able to get what we need for data.

we're figuring our way around that.

they tend to be much larger, so the
time taken to move them very expensive.

it might be 20 minutes to be able
to move a job from one to the other.

'cause it took 20 minutes to get
a startup up in the first place.

just due to the size of the models and
how much data you have to replicate.

we're starting to put some POC work
into the GPU side of things while we're

continuing on full steam with building
out the expansion of the feature set

of the CPG, and memory based workloads.

Alright.

Dan, I was curious if you've seen
on, the implementation side of this.

when we talk about.

the need to live, migrate a pod, whether
it's for maintenance then the almost

feel like the next level is the idea
of spot instances, I love that idea of

my infrastructure dynamically failing
and my applications can handle it.

is there a maturity level where
you see people start out it's hard

for me to imagine like on day one
someone's like, yeah, let's just put

it all on spot instances in yolo.

let's just fi we don't care.

It's all good.

Mi Live migration will solve it all.

Because obviously, there are
physics limits to the amount of

data we can transmit over the wire.

I'm imagining this scenario where you're.

accrediting certain workloads,
like this replica set is good

for spot because it's low data.

we don't need to transfer a hundred
giga data during a two minute outage

or a two minute notice of outage.

do you see that as a maturity
scale where you have to

Yeah.

I mean, it is absolutely
a maturity scale, right?

kind of going back to the references
we talked about, the early days of

VMware, nobody started doing vMotion
in production, everyone started it.

Oh, we've got this, five second
interruption, development and test

boxes can handle that all day long.

so it's the same concept really
that we're living in now.

We're going through that same evolution.

I agree with Phil.

I think we're doing it much faster
than VMware did in 2002, 2003.

I was around when that happened as well.

So I remember racking and
stacking all those boxes.

but yeah, it's very much the same thing.

container live migration is brand new,
we've just been GA for a month with it.

So we've had conversations at trade
shows and with customers and there's

a lot of excitement around it.

I think we're still trying to
figure out where it fits, what

are the exact workloads that it
makes the most sense to do this in.

And yeah, I think it's going
to be a process of adoption.

there's definitely a lot of use cases.

I think the spot is a very interesting
use case, especially the large data models

and things that we're processing today.

I'm working with a customer now that's
doing a lot of video processing in

Kubernetes and that's a very, you
know, CPN memory intensive job.

I mean, you know, we're talking a cluster
that scales from a D CPUs to 6,500

CPUs while they're processing this.

we're really trying to figure out
where it makes the most sense to

apply this type of technology.

no one wants to have that kind of
dynamic scale and then have to pay

for reserved instances for all of
that, like, worst case scenario.

that sounds like a billing nightmare.

and you don't want a job that
runs for, you know, hours that

cost you tons of money to fail 80%
through and have to restart it I

mean, that's just not efficient.

So, yeah, I think the ability to
really move this and allow those

workloads to finish is gonna be.

Huge for the market.

Alright, so we have been
talking a lot about the problem

and some of the solution.

we do have some slides that give
visualizations for those on YouTube.

this will turn into a podcast.

So audio listeners, we will give
you the alt text, version of it

while we're talking about it.

But, Philip, I'm Exactly what is
happening and the process of how a live

migration, like how does it kick off?

what's really going on in the
background when it starts,

Absolutely.

and we do have some
better demos other places.

I think on the website.

basically what we have is a live
migration controller looking

across all the workloads and nodes
that are lab migration enabled.

You don't necessarily have to
turn this on for everything.

You've got all your stateless
workloads, you don't need to live

migration, stateless workloads,
just treat them as normal.

You've got your stateful workloads
that you do want this to use for,

so you could set up a specific,
you know, node group for that.

that's gonna allow you to select what you
actually want to do live migration for.

You could use it for everything, but
it just eats up more network bandwidth

if you're using it for the stuff
that already tolerates being moved.

that controller's gonna be looking for
different signals within the cluster of

when something needs to be live, migrated.

instance, interruption is a good one.

being able to do bin packing, evicting
a node from the cluster because it's

underutilized, and then migrating those
workloads to another node in the cluster.

what we call rebalancing.

basically, rebuilding the
cluster with a new set of nodes.

And that could be because you're
doing a node upgrade, you're doing a

Kubernetes upgrade, you're doing an
OS upgrade, you're just trying to.

Get a more efficient set of nodes.

all of those are good reasons that you
would want to do your live migration.

So what's gonna happen in that process
is the two daemon sets on the source

node and the destination node are
gonna start talking to each other.

They're going to look at the pods on
the source node, start synchronizing

them over to the destination node.

So behind the scenes, all of that
kind of memory is being copied over

any disc states being copied over
any TCP/IP connection statuses are

being copied over and you're doing
all that prep work behind the scenes.

If you have, ephemeral storage, on
the node that'll start getting copied.

Obviously, depending on how much, it's
gonna depend on how long it takes.

Once the two nodes have identical
copies of the data, that's when

the live controller will say, it's
time to cut over it will cut the.

Connections from one, pause it and put
it into a pause state in containerd,

then it will unpause on the new node.

It'll come up with a new name.

Right now we call 'em
clone one, clone two.

We just add clone to 'em.

So you can tell which was the
before and which was the after.

when that clone one unpause,
then traffic will be going to it.

It'll have the same exact IP address that
it had while it was on the previous node.

All the traffic continues onto that node.

It picks up exactly where it left off,
and the old pod disappears, right?

The old pod gets shut down and torn down
if you have something like a PVC attached.

So if you've got an E-B-S-P-V-C
attached, there is a longer pause

because you have to do a detach reattach.

with the API calls, it works.

It just takes a little bit
longer for that pause state.

That's The downfall of
having to work with, APIs.

it takes time to do an unbind
rebind, to the new node.

but it works today.

If you're using NFS where you can
do a multitask, then it's instant,

it doesn't actually add any delay.

just NFS is a slower storage technology.

So does that sort of make
sense from a high level?

Yeah.

when we talk about Cast AI as a
solution, it do live migrations

based on certain criteria?

Is it making decisions around, if you
say I want a bin pack all the time,

it in the background, is it just like
doing live migrations on your behalf?

Or is this something where you're
largely doing it with humans clicking

buttons and controlling the chaos.

No, this goes back to what we
had talked about at the beginning

where, automation is key.

when a node is underutilized,
our bin packer is, probably the

most sophisticated on the market.

It analyzes and runs live tests on
every node in the cluster of whether

that node can be deleted, whether that
node doesn't need to be there anymore.

And it'll simulate all the pods being
redistributed throughout the cluster.

if the answer is we don't need this
node, it would automatically kick

off a live migration of all the
pods on that node Once it's empty,

it'll just get garbage collected.

once it's gone, all your pods
are running on the new nodes.

Everything's moved seamlessly.

You haven't seen any interruption.

the cluster keeps continuing as normal.

Most of our customers do
scheduled rebalances, so those

are just in the background.

It's evaluating how efficiently
designed the nodes in the cluster are.

If the nodes in the cluster are.

Not as efficient as they could be, and
different shapes and sizes would be

better for, that setup at that point
in the day, based on the mixture of

workloads there, it'll do a blue-green
deployment, set up new nodes live,

migrate the workloads to those new
nodes and tear down the old ones.

So everything that we're
talking about here can either

be scheduled or it's automatic.

It's running every few minutes on a cycle.

but yeah, no, it's entirely
seamless to the users.

Nice.

So in the technical details, we're
moving the IP address, I think you had a

diagram, showing the pod, on the nodes,
when we get down to the nitty gritty of

Kubernetes level stuff, pod is recreated,
but pod names have to be unique you

can't have the IP on both nodes at once.

And then there's the difference
between TCP and UDP and other, you

know, IP protocols and the, there's
a lot of little devils in the

details that I'm super interested in.

We won't have time to go into all
of it, but I do remember you showed

the replica, the pod that you're
creating, step one is we create a

pod and download an image, right?

this is all still going
through containerd.

So it's not, there's not like voodoo
magic in the background happening

outside of the purview of containerd.

maybe you can talk about that for a minute

Right, Exactly.

So being by changing the pod name,
you now have a. Placeholder for

your new pod information to go into.

And it does maintain the same IP address
when it moves over from one to the other.

So to your point, that's when that
switch has to kick in where the

old pod definition disappears.

And the new pod definition appears
in your, control plane with the

API calls up to the coop API.

that cutover is extremely important
because you can't have the same

pod living in the same place twice.

that's why we do have to change
a name when we switch it over.

there's certain services that, cause
some tricks because they have an

operator structure where they expect
there to be a certain pod name.

So when you move it and add the clone
suffix to it, we're working on finding

workarounds to that, in certain areas.

that is a little bit tricky on
certain workloads because you can't

have the same pod existing with the
same name in two different spots.

They have to be unique.

but yeah, definitely.

the IP is the same, but the pod
name is gonna have a clone dash

one or something like that on it

Yep.

yeah, so it starts with your pod
and then there's an event that

happens outside of the pod that is
talking into, containerd and you're

It's a second pod.

Correct.

It's adding that placeholder.

And because the placeholder pod is
actually still in a paused state.

It can have the same IP address.

it's not actually routing traffic to
it, 'cause it's not an active pod yet.

it'll be in a staged state.

So you stage it up with all
the information, but it has

to be named differently.

And then when you do the cutover,
that's when you switch it from

being inactive to active and
switch the old pod to be inactive.

and that's the final stage when,
the clone pod becomes the primary.

And because it's maintained the
exact, IP address within the

system, it's not losing any traffic.

So the networking system within
Kubernetes routes it to the new node

and the routing tables are updated and
the pod goes to the new destination.

Yeah, that does sound like the hard
part of like the old pod is shut

down so the IP can be released.

I assume the IP.

Can't be taken out while
that old pod is still active.

it's one of these things where it's like,
I understand it at the theory level, but

I have no idea what containerd and coup
proxy and, all these different things that

are binding to a virtual interface and the
order of the things that have to happen

in the exact right sequence in order for
you to first assign that IP to the new

node and then also replay all the packets.

it does seem like a very discreet
order of things that have to happen.

it has to go in a certain order or
they're just all fail it feels like.

so that's the part that took
us about a year to figure out.

there had been a lot of studies
and some research papers around,

the moving of memory and the
snapshotting of different workloads.

Like that part was a little bit
more straightforward because it

was really out of the vMotion
playbook days, from early on.

there were also some college
studies around using cryo to

replicate and migrate containers.

None of them had been able to
solve the IP side of things,

the connectivity side of things.

that's what Cast AI was able to solve for.

and it took a lot of research,
took a lot of, in-depth work.

We started on this, early 2024,
with a team of about five engineers,

deep kernel level Linux engineers,
Kubernetes engineers, people

very familiar with the, code.

they contributed to the Kubernetes
open source project, it was.

10 months before we had a demo
and that was using Calico.

before we could demo, we had to
have a custom AMI at that point in

time in AWS because everything was
kernel level at the a MI level.

we knew that was not feasible going
forward to production, but that

was the first demoable version Like
anything else, There's a lot of warts

and vaporware in the first version.

since then we were able to move
the logic up to a containerd plugin

makes it a lot more portable.

Now it can be applied to different clouds.

It's much less invasive.

You don't need a specific AMI.

and under the hood anymore, we were able
to move it to the A-W-S-V-P-C-C-N-I.

So you don't need the custom Calico CNI.

all of those were iterative steps to
build this and make it more, production

viable and adoptable by the industry.

Now it's a matter of, we've
got kind of two forks going on.

One is continuing to build out
additional platforms, figuring out

cilium, figuring out Azure, CNI,

The other is performance tuning the
existing migrations, reducing time to

migrate, being able to reduce the size
of the deltas down further and further

so we can migrate it faster and faster.

so we've got kind of those
two tracks right now.

the team's up to I think
10 or 12 engineers.

kind of working on those two paths,
and this is probably one of our most

heavily invested, areas in the company
is being able to further this technology.

'cause we see how much value it brings.

Yeah, I imagine it won't be very
long where, you know, this technology

is pretty advanced, but like others
eventually will probably if it's truly

the thing that we all are looking for.

And it sounds like, it feels like it is,
it feels like the kind of, tooling where

it's a hard problem to solve and we'll
maybe see other people attempt to do it.

I mean, the.

the research I had to do for the show.

'cause I was very curious.

I was like, what's the
history of all this?

And we, someone mentioned Sierra
IU, which I believe you're

using at least some of that.

that's a project that's been around
for quite some time over a decade.

And it's not a new idea, but like
a lot of these other technologies,

the devil's in the details.

we never really had an ability, to
capture and understand what a binary's

true dependencies were, whether
it's disc or, networking things.

until we had containers you mentioned
on here like it's in LXC, it's

in Docker, it's in pod man, like
this actually tool is used widely.

it's just maybe not, well known
to us end users because it's

packaged as a part of other tooling.

And I can sort of see a world where.

if this becomes more widespread, you're
gonna end up with haves and have nots

where my solution doesn't have live
migration or my solution does have live

migration at some point, maybe it's
ubiquitous, you're building functionality

on top of it, like your automation that
truly adds value around, spot instances

where my company maybe has never done spot
instances because it was too risky for

us and we didn't have the tooling to take
advantage of it without risking downtime.

I definitely have a couple of clients
that I've worked with over the last

couple of years that are like that,
where they're a hundred percent reserved

instances because they want the cheapest,
but they also need to guarantee uptime.

And they can't do that at a level
that live migrations would provide.

So they have to pay that extra
surcharge, for avoiding, ephemeral

instances and stuff like that.

to me, it gives me comfort
that the technology stack.

is part open source,
part community driven.

there's also the product, and private IP
side of this as well, but it's not like

you're reinventing the Linux kernel.

It wouldn't have been that long ago where
you had to actually throw in a Linux

module or kernel module rather, that
would only work on certain operating

system distributions of Linux and
that you would have to deploy a custom

ISO that wasn't that far in the past.

But now that we've got all these modern
things, I don't know if EBPF is involved

in this at all, but we've got more modern
abstractions that it feels like you can

just plug and play as long as you've
got the right networking components.

From an engineering perspective,
pretty awesome because it allows you

to build stuff on the stack like this.

the team's not on the call, but the
team that's developing this, good job,

Bravo, that's, some great engineering.

obviously anytime something is a year
long effort to crack a nut like this, I

feel sad for the people that had a six
month, like, no one's gonna see this

feature for a year and I'm gonna work all
year on it and I hope someone likes it.

So that's from a software development
perspective, that's the hard part.

That's the true engineering.

Um.

Yeah, absolutely.

we could talk about this forever,
but people have their jobs to do.

okay.

How do people get started do they just
go sign up for Cast AI and this is

like a feature outta the box that they
can implement in their clusters or.

Yep, absolutely.

it's in the UI now, so if people
want to sign up and onboard, we do

recommend having somebody on our sales
engineering team work with folks.

So reach out to us, We'll also reach
out when people sign up as well.

it's all straightforward.

There's no caveats.

It's helm charts to do the install
and then you set up the autoscaler.

we will be adding support
for Karpenter today.

It's using our autoscaler,
but we will support Karpenter.

around end of Q4 or early Q1,

Yeah, that's great.

my usual co-host is with AWS, so
they would greatly appreciate that.

I know that over the last
year, Karpenter's been

out, a little over a year.

We've had, a surprising number
of people on our Discord server.

For those of you watching, there's
a Discord server you can join.

There's a lot of people on Discord
talking about using Karpenter.

I'm really impressed with
the uptake on that project.

And in case you're wondering what
Karpenter is, it's with a K and it's

for Kubernetes, you can look on this
YouTube channel later because we did a

show on it and has had people talking
about it on the show and demoing it.

whenever it was released.

I think that was 2024.

I can't remember exactly.

alright, so everyone
knows how to get started.

Everyone now knows that they wish they
had live migrations and they currently

don't unless they're a Cast AI customer.

Where can we find you on the internet?

Where can people learn more
about what you're doing?

Are you gonna be at conferences soon?

I'm assuming, cast is probably
gonna have a booth at KubeCon again.

They always seemed to have a booth there.

we've got a big booth
at KubeCon this year.

I think we've got a 20 by 20.

We're gonna be doing demos and
presentations in the booth.

this is gonna be a big part of that.

we'll also be at reinvent in Vegas.

In later, early December, I
guess is first week of December.

so I'll be at both of those events.

I'm also really active on LinkedIn, so
if anybody wants to reach out to me on

LinkedIn, if you wanna set up a session
just to go into more detail, feel free

to ping me I post a lot of Kubernetes
content in general, best practices,

things that we see in the industry from
a Kubernetes evolution side of things,

and also obviously a bunch of cast stuff.

so, you know, feel free to follow or
connect Happy to share more information

Awesome.

well, I'm looking forward to hearing about
that continual proliferation of all things

live migration on every possible setup.

someday it'll be on FreeBSD with
some esoteric, Kubernetes variant.

It's, pretty cool to see
the evolution of this.

Well, thank you both for
being here, Philip and Dan.

see you all later.

Ciao.

Thanks for watching, and I'll
see you in the next episode.

Creators and Guests

Bret Fisher
Host
Bret Fisher
Cloud native DevOps Dude. Course creator, YouTuber, Podcaster. Docker Captain and CNCF Ambassador. People person who spends too much time in front of a computer.
Beth Fisher
Producer
Beth Fisher
Producer of the DevOps and Docker Talk and Agentic DevOps podcasts. Assistant producer on Bret Fisher Live show on YouTube. Business and proposal writer by trade.
Cristi Cotovan
Editor
Cristi Cotovan
Video editor and educational content producer. Descript and Camtasia coach.
Dan Muret
Guest
Dan Muret
Solution Architect/Sales Engineer/Tech Enthusiast
Philip Andrews
Guest
Philip Andrews
Global Field CTO - Working with customers to improve efficiency at scale through AI Automation.
Move K8s Stateful Pods Between Nodes
Broadcast by