Move K8s Stateful Pods Between Nodes
Download MP3Kubernetes Pod Live Migration.
That's what Cast AI calls it when they
dynamically move your pod to a different
node without downtime or data loss.
They've built a Kubernetes controller
that works with CSI storage and CNI
network plugins to copy your running
pod data, the memory, the IP address,
and the TCP connections from one
node to another in real time Welcome
to DevOps and Docker talk, and I'm
your solo host today, Bret Fisher.
This YouTube live stream was a fun one
because I got to nerd out with engineers
Philip Andrews and Dan Muret from Cast
AI on Kubernetes pod live migrations
and how it works under the hood.
We talk about use cases for
this feature, including hardware
or OS maintenance on a node.
Maybe for right sizing or bin packing your
pods for cost savings or moving off of a
spot instance that's about to shut down.
Or really anytime you need to move a
daemon set that would cause an outage if
the pod had to restart or be redeployed.
I don't know of any other turnkey way
to do this on Kubernetes today, but
I've got a feeling that Cast AI has got
a winning feature on their hands, and
I'm glad that we got to dig into it.
Over 20 years ago, the virtual machine
vendors built this live migration feature
into their products, and finally, in 2025,
we're now able to do that in Kubernetes.
Let's get into the episode.
Welcome to the show.
All right, Both of these
gentlemen are from Cast AI.
Philip is the global
field, CTO at Cast AI.
What exactly does Global CTO do?
handling, a lot of our large customers.
a lot of the strategic partnerships,
usually new technologies, so working
with a lot of our, you know, customers on
showing them new technologies, helping in,
proof of concepts with new technologies.
it's been kind of a cool role.
basically on the technical
customer facing side.
I get to do a lot with our
largest customers in solving
some of the the hardest problems.
Nice.
when you don't know what the
title is, it makes it sound like
you just live in an airplane.
it sounds impressive.
and we've got Dan Muret or Muret.
I really don't know my French, so that's
probably, a horrible pronunciation.
Dan is here, he's the senior sales
engineer with CAST AI or one of them.
I'm gonna make you the sales senior sales
engineer so that you sound very elite.
Uh, welcome Dan.
who could tell me about what's the
elevator pitch for Cast AI because
I've known about you all for years.
I've visited your booth at KubeCon at
least a half a dozen times over the years.
what's the main benefit of Cast AI?
Sure I can grab that one.
when Cast AI was founded, it was to
solve a problem that our founders
had in their last startup years ago,
which was every month their AWS bill
was going up 10% regardless of what
they did the month before to try to.
Mitigate and manage costs
across the in infrastructure.
they sold that startup to Oracle, when
they finished that Oracle time, they went
and figured out how to solve this problem.
They said, well, Kubernetes is
gonna be the future platform.
We're gonna make our bet on Kubernetes.
And the only way to solve this
problem is through automation.
Because doing things manually
every month, one, it's tedious
and takes up a lot of time.
And two, it's just, not helpful, right?
You save a little bit every,
you know, it's two steps back
for every step forward you make.
With Cast AI, it's fully
automation first, right?
We made the effort that everything was
going to be automated from the start.
when it comes to node auto scaling,
node selection, node, right sizing,
workload, right sizing, everything
to do with Kubernetes, everything
we implement is automated.
And that's where the live migration
piece came in being able to automatically
move applications around within the
cluster without having downtime.
and that's where application
performance automation comes
in, Moving from this application
performance monitoring mindset.
Datadog has made a lot of money on that.
I dunno if I'm allowed to say that.
but they've done very well
and have a fantastic platform.
we love Datadog, but
you get data overload.
You end up with metric overload and
actioning on those is very hard.
Where we need to go from here,
especially with the AI mindset that
we're moving into, is automation
of that application performance.
And that's what Cast AI
is leading the way in.
Nice, when you all reached out and I
learned about the fact that you now have
live migration, it took me back to almost
25 years ago when that was first invented
for VMs at the time it felt like magic.
It did not seem real.
We all had to try it to believe
it because it seemed impossible.
To, move from one host to
another, maintain the ip,
maintain the TCP connections.
surely I'm gonna freeze up and
it's gonna be like a frozen screen.
we all just assumed that.
eventually, and maybe at first it
was a little hiccupy, I think, if I
remember correctly, like 2003, 2005.
It was one of those where it wasn't
quite live and they didn't, it was
like very short amounts of gaps.
then eventually it got good
enough that it was live.
I was running data centers, for
local governments at the time.
So I was very interested in this.
'cause we were running
both ESX and HyperV.
So I was heavily invested in
that feature and functionality.
So when I saw that you were doing
it to Kubernetes, my first thought
was, why did this take so long?
why don't we have this yet on everything?
Because it's clearly possible,
it's technically possible.
Obviously, it's not super easy and
requires a lot of low level tooling
that has to understand networking
and memory and you know, dis disc
rights and all that kind of stuff.
So I'm excited for us to get
into exactly how this sort of
operates for a Kubernetes admin.
And I really feel like this show's gonna
be great for anyone learning Kubernetes or
Kubernetes admins, because to talk about
the stateful set, the daemon set problem
of, we've got stateful work, everyone's
got, I mean, everyone I know almost
has stateful workflows in Kubernetes.
I would say, I don't know about you
all's experience, but to me it's
an exception when everything is
stateless in Kubernetes nowadays.
Do you find that to be the case
we talk to customers all the time, right.
And yeah.
Used to, you know, a few, just
a couple years ago, right.
It was a lot more stateless.
Web servers, whatever.
now we're definitely seeing a shift
to, more stateful workloads whether
it's, legacy applications being
forced into Kubernetes as part of a
modernization project or whatever.
We're seeing a lot more stateful
workloads in Kubernetes for sure.
Particularly amongst the Fortune one
hundreds, fortune five hundreds, right?
'cause you've got this modernization,
and I put it in quotes, where
modernization means taking some
crusty old 15-year-old application,
containerizing it and shoving it in
Kubernetes and calling it cloud native,
the flawed approach of lift and shift.
and you end up with a lot of
applications that are in Kubernetes.
They're listed as a deployment, but
you can't restart them without your
customer having a significant outage.
it goes against everything
Kubernetes was built on.
But that's the world we live in today.
when we first launched Live migration,
and I posted about it on LinkedIn,
some of the first questions I
got was, why is this even needed?
If you're doing Kubernetes correctly, live
migration shouldn't even be a real thing.
yes, but 95% of the customers I deal with.
Don't do Kubernetes, the right way.
Well, yeah, I honestly think we
could argue that's a great point.
when Docker and Kubernetes were
both created, it was like stateless.
'cause it's easy, you know,
move everything around.
It's wonderful.
But I think that to, I mean, this channel,
if there's anything consistent about this
channel over the almost decade that has
existed, it's that I, it's all containers.
Like I don't care what the tool
is, we're doing it in containers.
the large success of containers is
because we could put everything in it.
so many.
Evolutions or attempted evolutions
in tech have been, well, you're
gonna have to rewrite to, you know,
functionless, you know, or functions.
you're serverless, you're gonna have to
write functions now or you're gonna have
to rewrite this language or whatever.
and that's, I think that's like the
secret sauce of containers was that we
could literally shove everything in it.
It's also the negative.
And so there's, there's a thing that,
I don't know if I learned it from a
therapist or whatever, but often our
weaknesses are just overdone strengths.
And I feel like the strength of containers
is that you can do everything with it.
you can put every known
app on the planet in it.
They will eventually work
if you figure it out.
the overdone weakness is that we're
putting everything in there which makes
managing these infrastructures very hard.
you have to assume everything's
fragile Until you are sure
that it's truly stateless.
even stateless, people say stateless
and what they really mean is it's, it
doesn't care about disc, but it definitely
cares about connections, which is when
we're trying to talk about stateless,
that's not technically accurate.
Like when we say stateless,
we should probably mean also
doesn't care about connections.
at least once the connections are drained.
it's an interesting dilemma we all have
in the infrastructure of we have the
power to be able to move everything and
do everything, but also everything's
super fragile that we're running at the
same time, so how do we even manage that?
We've encountered a lot of teams
that, swore up and down they were
stateless, right up until you
started bin packing their cluster.
They said, wait, wait, wait.
why are we having all
these restarted pods?
We're like, because we're bin
packing and we're moving things, and
we're getting better, optimization.
They're like, but my container restarted.
Well, yes, that's what
containers do in Kubernetes.
Right.
And for those that are maybe just getting
into Kubernetes or haven't dealt with
large enterprise forever workloads
where they just can't be touched.
I've had 30 years in tech
of don't touch that server.
don't touch that workload.
It's fragile, it's precious, but
it's also probably on the oldest
hardware in the least maintained.
so the idea of, one of the performance
measures that any significant size
Kubernetes team is dealing with
is the cost of infrastructure.
And then we keep getting told, I think
this was just in the last year at
KubeCon, that even on top of Kubernetes,
we're still only averaging like 10% CPU
on average utilization across nodes.
Like we, we still are struggling with
the same infrastructure problems that
we were dealing with the last 30 years,
even before VMs, before virtualization.
That was the same problem we had then
because everybody would want their
own server and they always had to plan
for the worst busiest day of the year.
So they would buy huge servers,
put 'em in, and they'd sit idle
almost all the time because
they barely got 5%, utilization.
So I can see where like one of the core
premises of something like a application
performance tool is that we're gonna
save tons of money by bin packing.
can you explain the bin packing process?
what does that look like?
so one of the big things with Kubernetes
is the scheduler will typically
round robin, assign pods to nodes.
if you have 10 nodes in a cluster,
your pods will more or less get evenly
distributed to those nodes in the cluster.
You can manage that with certain
different scheduler hints and
certain scheduler, suggestions to
steer that towards, you know, most
utilized, least utilized, et cetera.
But at the end of the day,
you're gonna have spread out
workloads across your nodes.
Bin packing is basically the
defragmentation of Kubernetes right back.
You know, Bret, when you were first
starting out, when I was first starting
out, and you could actually defragment
a hard drive and you get to move the
little Tetris blocks around the screen,
days.
being able to do that in a Kubernetes
cluster can mean massive, massive
savings on the actual utilization of that
cluster because now you free up a bunch
of workloads, a bunch of nodes in the
cluster that are no longer necessary,
you can delete those off and when you
need them, you just add them back.
that's the joy of being in a cloud
environment, you can use the least amount
of resources when you don't need 'em.
So in, for instance, at, you know,
your off busy hours, your nighttime
hours, and then when you start needing
'em again, you spin 'em up, you add
more to it, you scale up during the
day, and being able to do that process
over and over again every day is how
you can optimize your cloud resources.
What we see is that.
People have so many stateful workloads
whether it's stateful in real state or
stateful in, this is a really poorly
architected application, or stateful in
this application takes 15 minutes to start
up and I, it's a monolithic and I can only
run one copy of it so I can't move it.
All of those things causes, so you
can't bin pack a cluster, right?
You can't move those things around.
So what ends up happening is people just
end up with these stateful workloads
scattered throughout all 10 nodes.
And even if the 10 nodes are only
60% utilized, you can't get rid
of any of them because it'll cause
some kind of service interruption.
And that's where live migration
allows you to move those
stateful sensitive workloads.
So now those 10 nodes can go down
to six or seven nodes without having
a service interruption, even if
there's less than ideal workloads
scattered throughout the cluster.
stateful versus stateless versus like,
where's the scenario for where we need
a live pod migration and like to those
that are perfect in all their software
and they run and they control all
the software that runs on Kubernetes.
I don't know who those people are,
but let's just say they exist.
Then this isn't needed.
Every database has a replica
or database mirror, so you
can always take a node down.
Every pod properly has proper
shutdown for ensuring that connections
are properly moved to a new pod.
By the way, I used to do a whole
conference talk on TCP packets
and, resetting the connection to
make sure it moves properly through
the load balancer to the next one,
and having a long shutdown time
so that you can drain connections.
that world of shutting down a pod is so
more complicated than anyone gives it.
everyone treats it like it's casual
and easy, and it's just not if
you're dealing with, hundreds of
thousands or millions of connections.
there is a lot of nuance
and detail to this.
and I often end up with teams
where they implement Kubernetes.
It's a sort of predictable pattern, right?
They implement Kubernetes,
move workloads to it.
They think Kubernetes gives their
workloads magic and then they just
start trying moving things around
and they realize when their customers
complain that, The rules of TCP IPS
load balancers and connection state,
like all these rules still apply.
you have to understand those lower levels
and obviously disc and writing to disc
and logs for databases and all that stuff.
that's still there too, I think.
I think the networking is where
I see a lot of junior engineers
hand waving over it because quite
honestly, the cloud has made a lot
of the networking problems go away.
So we don't have to have
Cisco certifications just
to run servers anymore.
We used to, but now we can get away
with it till a certain point, in
the career or complexity level.
And then suddenly you're having to really
understand the difference between TCP and
UDP and how session state long polling or
web sockets, how all these things affect.
Whether you're going to break
customers when you decide to restart
that pod or redeploy a new pod.
I love that stuff because it's super
technical and you can get really into
the weeds of it, and it's not, I wouldn't
call a solved problem for everyone.
my understanding of something
like a live migration is,
takes most of those concerns.
it doesn't make them irrelevant, but
it does deal with those concerns.
am I right?
in terms of networking we're talking
about live migrations, having
to be concerned with ips and,
connection state, stuff like that.
yeah, so with being able to do the
networking move of things, to your
point, reestablishing those sessions.
one of the big things we
see is long running jobs.
if you've got a job that's running for
eight hours and it gets interrupted
at six, you've lost that job.
Even if you try to, move it from one to,
there's some checkpointing involved, a
lot of times, like on a spark workload,
the driver will just kill the pod
and restart it if it senses any kind
of interruption in the networking.
So the networking's super important there.
Being able to maintain that.
Long running sessions, web sockets.
To your point, we've actually tested
this extensively with web sockets.
Web sockets, stay connected.
And we're still in those
early vMotion days.
there is a slight pause
when we move things.
it took vMotion multiple years before
they got it kind of really ironed out.
We're moving probably faster
to them 'cause we have a lot
of experience, you know, of the
experiences they went through.
and the research that's
happened since then.
So I think we're moving pretty fast
on shortening that time window.
But what we found is you queue up all
the traffic and once the pod is live on
the new node that traffic is replayed
and all the messages come through.
So even on something like a web socket,
you don't actually lose messages.
They're just held up for a few seconds.
And that's extremely important
from maintaining that connection
state, like you were mentioning.
one of our customers that we're
working with this heavily on,
they run spark streaming jobs.
So they're 24 7, 365, pulling off a
queue, running data transformations
and detections, and then pushing
somewhere else for alerting mechanisms.
If they have a pod go down, it takes about
two minutes to get that process restarted,
pull in all the data that they need.
Again, that's two minutes of backlog.
They have super tight SLAs.
They have a five minute SLA from
message creation to end run through
the entire detection pipeline.
So if you've got a two minute
delay on that shard in your Kafka
topic, that's a huge chunk of that
five minutes that you just ate up.
You're talking all the other pipeline.
It's very easy to start
missing SLAs there.
It's, you can't take maintenance
windows if you're 24 7, 365 and
you're doing security processing.
you don't.
You can't be like, well, security's
gonna be offline for 10 minutes
while we move our pods around.
Like, that's just, that's not
acceptable in that world mindset.
so keeping that connectivity, keeping
the connection state, being able to
keep everything intact, keeping the
Kafka connection, keeping the, spark
driver connection is all super important
with being able to move that entire
TCP/IP stack over from one node to
another, during that migration process.
Yeah, and I mean, we're really talking
about a lot of the different kinds of
problems that come with shifting workloads
like being able to say, you know, walking
into an environment and sort of being
your own wrecking ball and, your own chaos
monkey and saying, I'm gonna go over here
and push the power button on this node,
or I'm gonna properly shut down this node,
do you have everything set up correctly so
that connections are properly drained that
is a, such a. I would say a moving target,
especially because every time we've had
these processes where I've had clients
where we go through this like exercise of
we're going to do maintenance on a node
and we're even gonna plan for it, and
then we do it, and then we fix all the
issues of the pods and the shutdown timing
and, the Argo CD deployment settings
that we need to massage and perfect.
And then, you know, six months later, if
we do it again, the same thing happens
because now there's new workloads that
weren't perfected and weren't well tested.
if I can make a career out
of actually being like.
a pod migration guru, like that sounds
like my kind of dream job where we crash
and break everything and then we track
all of the potential issues of that and
we are like a tiger team that goes pod
by pod and certifies this is like, yep,
this pod can now move safely without risk.
because we've got everything dialed in.
We've got all the right settings.
I feel like that's a workshop opportunity.
maybe sell something on that because
there are so many levels of complexity
we haven't even talked about, like
database logs and database mirroring
you can't really spin up a new node
of a database and let it sit there.
Idle is a pod while you're waiting for
the old one to shut down, they can't
access the same files, blah, blah, blah.
it just depends on the workload,
on how complex this all gets.
But I'm assuming also that when we talk
about something like live migrations,
we're not just concerned with networking.
We're also somehow shifting to storage.
I'm guessing there's certain
limitations to that where you're not
replicating on the backend volumes.
You're, I guess you're just
using like ice zy reconnects or
how's that, how does that work?
we haven't really gotten into the
solution, but I know you're only on
certain clouds right now, and I'm assuming
that's partly due to the technicals of
the limitations of their infrastructure.
Right, exactly.
each cloud has different kind
of quirks, around how they
function, what the different,
technologies look like around them.
somebody had asked about being able
to move, you know, larger systems
are the limits around it and their,
it depends on the use case, right?
If you're talking spot instances,
being able to move from one spot
instance to another spot instance
in a two minute interruption window
on AWS, depends on how much data.
If you're trying to move 120 gigs of
data physics is working against you.
you don't have enough time in that two
minute window to get enough through
the pipe over to the new, system.
Now if you're talking small pods,
if you're talking less than 32 gig
nodes, you can move that fast enough.
64 gigs, maybe you're on the edge.
Depends on how much other network
traffic is tying up the bandwidth,
64 gigs is getting on the edge that
you can move in a two minute window.
that other example I was talking about,
those long running spark streaming
jobs, if they're running on demand,
live migration, is still a massive
benefit because now you can do server
upgrades without taking an outage.
You can create a new node running
your new patch version of, Kubernetes,
running your new Patched Os, and
migrate the pod from one to the other.
Your time to replicate is less important.
Even if it takes you three minutes
four minutes, or five minutes
to replicate the memory from
one box to the other, who cares?
It's not gonna be paused for that
long, because what we're doing
is we're doing delta replication.
So you replicate a big chunk, and then
a smaller chunk, and then a smaller
chunk until the chunk is small enough
where you can do it in a pause window.
And so now when you're moving a
huge service from one to the other.
Same thing.
If you're talking like NVME local storage,
we've got another customer we're working
with and it's a different set of problems.
They have a terabyte of NVME that they use
on local ephemeral disc on every node that
needs to be replicated from node to node.
When they do node upgrades, that
takes about 20 minutes, to replicate
all of that from one node to
another, even on high throughput
discs, on high throughput nodes.
But if it's happening in the
background, while everything else
is humming along nicely, who cares?
Keep replicating it over.
You keep going down to Deltas, and
then once your deltas get small enough,
you pause for six to 10 seconds,
depending on how big the service is.
And then you slide it over.
a lot of these things are being solved.
We're actively reducing these pause
times by being able to do more prep
behind the scenes, being able to do
more processing behind the scenes.
everything is, operating
as a containerd plugin.
I saw somebody asked, about on-prem
we will be supporting on-prem, we
will be supporting other solutions.
The big catch there is everybody
has some different flavor of
networking and different flavors
of things behind the scenes.
The actual live migration piece right now
could apply to any Kubernetes anywhere.
It's the IP stack that gets a little
trickier because you've got cilium
running places, you've got Calico running
places, you've got VPN, V-P-C-C-N-I
running places, you know, everybody
has different networking flavors.
So being able to maintain network
connections when you do the
migration is largely the more
difficult part of the whole process.
Hmm.
Being able to move the pod isn't that
bad, so if you've got workloads that
you can reestablish connections and the
connection resetting is not a big deal,
but you don't wanna have to restart
the pod, that's fairly straightforward.
We could pretty much do that today across
any containerd compatible Kubernetes.
It's specifically the networking
that causes a lot more hardships,
because everybody has a
different flavor of networking.
for AWS, we were able to fork
the open source AWS node CNI, and
create our own flavor of it that
now handles the networking piece.
So we're using the open source
AWS CI code, and we've modified
it, and now it works just fine.
for our purposes, we're doing something
similar on GCP, working with the gc.
GPFI recall is using cilium under
the hood for their networking.
So we're gonna be, building a
similar plugin for their cilium side.
Yeah, and the nice thing is, I
guess if you build it for cilium.
would it work universally
across any cilium deployment?
In theory, I mean, I'm just thinking of
the most popular CNIs and if you check
those off the list, it suddenly gives you,
you know, a lot more reach than having to
go plowed by cloud or os by os, you know?
and,
Exactly.
our first iteration of this back in
January, February, the first version
that we demoed was actually Calico.
A lot of people were like, I don't
wanna have to rip out my cluster and
rebuild it with Calico as the CNI.
we were able to figure out
a way to work with the, VPC
CNI, as a backing basis there.
So Calico's pretty much already built.
we've got AWS CNI now built,
cilium our next target, that saw
somebody asked about Azure, Azure
is probably gonna be early 2026.
we'll be E-K-S-G-K-E and then
we'll work on a KS, and then we'll
work on-prem solutions after that.
So on-prem will probably
be sometime in 2026.
Yeah, I can remember, going
back to the two thousands.
I can remember when we went from
delayed migrations or paused
migrations to live migrations.
I can remember reading the technical
papers coming out of, VMware and
Microsoft and they were talking about
the idea of, these deltas continually
repeating the Delta process until
you get down to zero or like you can
fit it in a packet and then that's
the final packet kind of thing.
I don't know why I remember that all
these years later, but I do remember
that I thought that was some pretty cool
science, like some pretty cool, physics
across the wire because back then we were
lucky if our servers had one gigabit,
200 gig workloads or anything like that.
this actually led me during my
research and, we could talk about
the idea that there are, there are
attempts in Linux over the years
to try to solve this universally.
I did some research before the show and
saw some projects around ML workloads in
particular a lot of, engineers, whether
it's platform engineering or just the ML
engineers themselves interested in this
because of the, sort of the problems of
large ML or AI workloads today where you
can't interrupt them if you interrupt
them, you have to basically start over.
it's sort of a precious workload
while it's running and it
might be running long time.
do you have, AI and ML workload
customers where they're.
Are they maybe part of the first movers
to move onto something like this?
I'm basing it on the KubeCon talks
and things that I've seen out there,
Large scale data analytics is
definitely, one of the big players here.
A lot of it's spark driven data
analytics that we're seeing,
because of exactly that problem.
A lot of these jobs will be running
for 8, 10, 12, 14, 16 hours and.
Running those on demand at the
scale that they're running them
at is extraordinarily expensive.
So the big ask is how do we get those
workloads onto spot instances where
when we get the interruption notice we
can fall back to some type of reserved
capacity and then fail back to spot.
So basically the goal is to move
to this new concept where in your
Kubernetes cluster you have some swap
space, whether that's, excess, spot
capacity, two or three extra nodes of
spot capacity or a couple of nodes of
on-demand capacity where if you get a
node interruption, you can quickly swap
into those nodes, and then once you stand
your new spot instance back up, then
you can swap back to that spot instance.
where we're headed.
that's what Q4 is gonna be working on
this year, is to be able to automate that
entire process to where you can float back
and forth between reserve capacity and
spot capacity to really save on those,
data analytics jobs, those large ML jobs.
We're not to the GPU side of things yet.
I'd love to get us to where we could
migrate GPU workloads 'cause that's where
the next big bottleneck is gonna be.
the hooks aren't there in the
Nvidia tool sets yet for the Cuda
tool sets in a lot of places, to be
able to get what we need for data.
we're figuring our way around that.
they tend to be much larger, so the
time taken to move them very expensive.
it might be 20 minutes to be able
to move a job from one to the other.
'cause it took 20 minutes to get
a startup up in the first place.
just due to the size of the models and
how much data you have to replicate.
we're starting to put some POC work
into the GPU side of things while we're
continuing on full steam with building
out the expansion of the feature set
of the CPG, and memory based workloads.
Alright.
Dan, I was curious if you've seen
on, the implementation side of this.
when we talk about.
the need to live, migrate a pod, whether
it's for maintenance then the almost
feel like the next level is the idea
of spot instances, I love that idea of
my infrastructure dynamically failing
and my applications can handle it.
is there a maturity level where
you see people start out it's hard
for me to imagine like on day one
someone's like, yeah, let's just put
it all on spot instances in yolo.
let's just fi we don't care.
It's all good.
Mi Live migration will solve it all.
Because obviously, there are
physics limits to the amount of
data we can transmit over the wire.
I'm imagining this scenario where you're.
accrediting certain workloads,
like this replica set is good
for spot because it's low data.
we don't need to transfer a hundred
giga data during a two minute outage
or a two minute notice of outage.
do you see that as a maturity
scale where you have to
Yeah.
I mean, it is absolutely
a maturity scale, right?
kind of going back to the references
we talked about, the early days of
VMware, nobody started doing vMotion
in production, everyone started it.
Oh, we've got this, five second
interruption, development and test
boxes can handle that all day long.
so it's the same concept really
that we're living in now.
We're going through that same evolution.
I agree with Phil.
I think we're doing it much faster
than VMware did in 2002, 2003.
I was around when that happened as well.
So I remember racking and
stacking all those boxes.
but yeah, it's very much the same thing.
container live migration is brand new,
we've just been GA for a month with it.
So we've had conversations at trade
shows and with customers and there's
a lot of excitement around it.
I think we're still trying to
figure out where it fits, what
are the exact workloads that it
makes the most sense to do this in.
And yeah, I think it's going
to be a process of adoption.
there's definitely a lot of use cases.
I think the spot is a very interesting
use case, especially the large data models
and things that we're processing today.
I'm working with a customer now that's
doing a lot of video processing in
Kubernetes and that's a very, you
know, CPN memory intensive job.
I mean, you know, we're talking a cluster
that scales from a D CPUs to 6,500
CPUs while they're processing this.
we're really trying to figure out
where it makes the most sense to
apply this type of technology.
no one wants to have that kind of
dynamic scale and then have to pay
for reserved instances for all of
that, like, worst case scenario.
that sounds like a billing nightmare.
and you don't want a job that
runs for, you know, hours that
cost you tons of money to fail 80%
through and have to restart it I
mean, that's just not efficient.
So, yeah, I think the ability to
really move this and allow those
workloads to finish is gonna be.
Huge for the market.
Alright, so we have been
talking a lot about the problem
and some of the solution.
we do have some slides that give
visualizations for those on YouTube.
this will turn into a podcast.
So audio listeners, we will give
you the alt text, version of it
while we're talking about it.
But, Philip, I'm Exactly what is
happening and the process of how a live
migration, like how does it kick off?
what's really going on in the
background when it starts,
Absolutely.
and we do have some
better demos other places.
I think on the website.
basically what we have is a live
migration controller looking
across all the workloads and nodes
that are lab migration enabled.
You don't necessarily have to
turn this on for everything.
You've got all your stateless
workloads, you don't need to live
migration, stateless workloads,
just treat them as normal.
You've got your stateful workloads
that you do want this to use for,
so you could set up a specific,
you know, node group for that.
that's gonna allow you to select what you
actually want to do live migration for.
You could use it for everything, but
it just eats up more network bandwidth
if you're using it for the stuff
that already tolerates being moved.
that controller's gonna be looking for
different signals within the cluster of
when something needs to be live, migrated.
instance, interruption is a good one.
being able to do bin packing, evicting
a node from the cluster because it's
underutilized, and then migrating those
workloads to another node in the cluster.
what we call rebalancing.
basically, rebuilding the
cluster with a new set of nodes.
And that could be because you're
doing a node upgrade, you're doing a
Kubernetes upgrade, you're doing an
OS upgrade, you're just trying to.
Get a more efficient set of nodes.
all of those are good reasons that you
would want to do your live migration.
So what's gonna happen in that process
is the two daemon sets on the source
node and the destination node are
gonna start talking to each other.
They're going to look at the pods on
the source node, start synchronizing
them over to the destination node.
So behind the scenes, all of that
kind of memory is being copied over
any disc states being copied over
any TCP/IP connection statuses are
being copied over and you're doing
all that prep work behind the scenes.
If you have, ephemeral storage, on
the node that'll start getting copied.
Obviously, depending on how much, it's
gonna depend on how long it takes.
Once the two nodes have identical
copies of the data, that's when
the live controller will say, it's
time to cut over it will cut the.
Connections from one, pause it and put
it into a pause state in containerd,
then it will unpause on the new node.
It'll come up with a new name.
Right now we call 'em
clone one, clone two.
We just add clone to 'em.
So you can tell which was the
before and which was the after.
when that clone one unpause,
then traffic will be going to it.
It'll have the same exact IP address that
it had while it was on the previous node.
All the traffic continues onto that node.
It picks up exactly where it left off,
and the old pod disappears, right?
The old pod gets shut down and torn down
if you have something like a PVC attached.
So if you've got an E-B-S-P-V-C
attached, there is a longer pause
because you have to do a detach reattach.
with the API calls, it works.
It just takes a little bit
longer for that pause state.
That's The downfall of
having to work with, APIs.
it takes time to do an unbind
rebind, to the new node.
but it works today.
If you're using NFS where you can
do a multitask, then it's instant,
it doesn't actually add any delay.
just NFS is a slower storage technology.
So does that sort of make
sense from a high level?
Yeah.
when we talk about Cast AI as a
solution, it do live migrations
based on certain criteria?
Is it making decisions around, if you
say I want a bin pack all the time,
it in the background, is it just like
doing live migrations on your behalf?
Or is this something where you're
largely doing it with humans clicking
buttons and controlling the chaos.
No, this goes back to what we
had talked about at the beginning
where, automation is key.
when a node is underutilized,
our bin packer is, probably the
most sophisticated on the market.
It analyzes and runs live tests on
every node in the cluster of whether
that node can be deleted, whether that
node doesn't need to be there anymore.
And it'll simulate all the pods being
redistributed throughout the cluster.
if the answer is we don't need this
node, it would automatically kick
off a live migration of all the
pods on that node Once it's empty,
it'll just get garbage collected.
once it's gone, all your pods
are running on the new nodes.
Everything's moved seamlessly.
You haven't seen any interruption.
the cluster keeps continuing as normal.
Most of our customers do
scheduled rebalances, so those
are just in the background.
It's evaluating how efficiently
designed the nodes in the cluster are.
If the nodes in the cluster are.
Not as efficient as they could be, and
different shapes and sizes would be
better for, that setup at that point
in the day, based on the mixture of
workloads there, it'll do a blue-green
deployment, set up new nodes live,
migrate the workloads to those new
nodes and tear down the old ones.
So everything that we're
talking about here can either
be scheduled or it's automatic.
It's running every few minutes on a cycle.
but yeah, no, it's entirely
seamless to the users.
Nice.
So in the technical details, we're
moving the IP address, I think you had a
diagram, showing the pod, on the nodes,
when we get down to the nitty gritty of
Kubernetes level stuff, pod is recreated,
but pod names have to be unique you
can't have the IP on both nodes at once.
And then there's the difference
between TCP and UDP and other, you
know, IP protocols and the, there's
a lot of little devils in the
details that I'm super interested in.
We won't have time to go into all
of it, but I do remember you showed
the replica, the pod that you're
creating, step one is we create a
pod and download an image, right?
this is all still going
through containerd.
So it's not, there's not like voodoo
magic in the background happening
outside of the purview of containerd.
maybe you can talk about that for a minute
Right, Exactly.
So being by changing the pod name,
you now have a. Placeholder for
your new pod information to go into.
And it does maintain the same IP address
when it moves over from one to the other.
So to your point, that's when that
switch has to kick in where the
old pod definition disappears.
And the new pod definition appears
in your, control plane with the
API calls up to the coop API.
that cutover is extremely important
because you can't have the same
pod living in the same place twice.
that's why we do have to change
a name when we switch it over.
there's certain services that, cause
some tricks because they have an
operator structure where they expect
there to be a certain pod name.
So when you move it and add the clone
suffix to it, we're working on finding
workarounds to that, in certain areas.
that is a little bit tricky on
certain workloads because you can't
have the same pod existing with the
same name in two different spots.
They have to be unique.
but yeah, definitely.
the IP is the same, but the pod
name is gonna have a clone dash
one or something like that on it
Yep.
yeah, so it starts with your pod
and then there's an event that
happens outside of the pod that is
talking into, containerd and you're
It's a second pod.
Correct.
It's adding that placeholder.
And because the placeholder pod is
actually still in a paused state.
It can have the same IP address.
it's not actually routing traffic to
it, 'cause it's not an active pod yet.
it'll be in a staged state.
So you stage it up with all
the information, but it has
to be named differently.
And then when you do the cutover,
that's when you switch it from
being inactive to active and
switch the old pod to be inactive.
and that's the final stage when,
the clone pod becomes the primary.
And because it's maintained the
exact, IP address within the
system, it's not losing any traffic.
So the networking system within
Kubernetes routes it to the new node
and the routing tables are updated and
the pod goes to the new destination.
Yeah, that does sound like the hard
part of like the old pod is shut
down so the IP can be released.
I assume the IP.
Can't be taken out while
that old pod is still active.
it's one of these things where it's like,
I understand it at the theory level, but
I have no idea what containerd and coup
proxy and, all these different things that
are binding to a virtual interface and the
order of the things that have to happen
in the exact right sequence in order for
you to first assign that IP to the new
node and then also replay all the packets.
it does seem like a very discreet
order of things that have to happen.
it has to go in a certain order or
they're just all fail it feels like.
so that's the part that took
us about a year to figure out.
there had been a lot of studies
and some research papers around,
the moving of memory and the
snapshotting of different workloads.
Like that part was a little bit
more straightforward because it
was really out of the vMotion
playbook days, from early on.
there were also some college
studies around using cryo to
replicate and migrate containers.
None of them had been able to
solve the IP side of things,
the connectivity side of things.
that's what Cast AI was able to solve for.
and it took a lot of research,
took a lot of, in-depth work.
We started on this, early 2024,
with a team of about five engineers,
deep kernel level Linux engineers,
Kubernetes engineers, people
very familiar with the, code.
they contributed to the Kubernetes
open source project, it was.
10 months before we had a demo
and that was using Calico.
before we could demo, we had to
have a custom AMI at that point in
time in AWS because everything was
kernel level at the a MI level.
we knew that was not feasible going
forward to production, but that
was the first demoable version Like
anything else, There's a lot of warts
and vaporware in the first version.
since then we were able to move
the logic up to a containerd plugin
makes it a lot more portable.
Now it can be applied to different clouds.
It's much less invasive.
You don't need a specific AMI.
and under the hood anymore, we were able
to move it to the A-W-S-V-P-C-C-N-I.
So you don't need the custom Calico CNI.
all of those were iterative steps to
build this and make it more, production
viable and adoptable by the industry.
Now it's a matter of, we've
got kind of two forks going on.
One is continuing to build out
additional platforms, figuring out
cilium, figuring out Azure, CNI,
The other is performance tuning the
existing migrations, reducing time to
migrate, being able to reduce the size
of the deltas down further and further
so we can migrate it faster and faster.
so we've got kind of those
two tracks right now.
the team's up to I think
10 or 12 engineers.
kind of working on those two paths,
and this is probably one of our most
heavily invested, areas in the company
is being able to further this technology.
'cause we see how much value it brings.
Yeah, I imagine it won't be very
long where, you know, this technology
is pretty advanced, but like others
eventually will probably if it's truly
the thing that we all are looking for.
And it sounds like, it feels like it is,
it feels like the kind of, tooling where
it's a hard problem to solve and we'll
maybe see other people attempt to do it.
I mean, the.
the research I had to do for the show.
'cause I was very curious.
I was like, what's the
history of all this?
And we, someone mentioned Sierra
IU, which I believe you're
using at least some of that.
that's a project that's been around
for quite some time over a decade.
And it's not a new idea, but like
a lot of these other technologies,
the devil's in the details.
we never really had an ability, to
capture and understand what a binary's
true dependencies were, whether
it's disc or, networking things.
until we had containers you mentioned
on here like it's in LXC, it's
in Docker, it's in pod man, like
this actually tool is used widely.
it's just maybe not, well known
to us end users because it's
packaged as a part of other tooling.
And I can sort of see a world where.
if this becomes more widespread, you're
gonna end up with haves and have nots
where my solution doesn't have live
migration or my solution does have live
migration at some point, maybe it's
ubiquitous, you're building functionality
on top of it, like your automation that
truly adds value around, spot instances
where my company maybe has never done spot
instances because it was too risky for
us and we didn't have the tooling to take
advantage of it without risking downtime.
I definitely have a couple of clients
that I've worked with over the last
couple of years that are like that,
where they're a hundred percent reserved
instances because they want the cheapest,
but they also need to guarantee uptime.
And they can't do that at a level
that live migrations would provide.
So they have to pay that extra
surcharge, for avoiding, ephemeral
instances and stuff like that.
to me, it gives me comfort
that the technology stack.
is part open source,
part community driven.
there's also the product, and private IP
side of this as well, but it's not like
you're reinventing the Linux kernel.
It wouldn't have been that long ago where
you had to actually throw in a Linux
module or kernel module rather, that
would only work on certain operating
system distributions of Linux and
that you would have to deploy a custom
ISO that wasn't that far in the past.
But now that we've got all these modern
things, I don't know if EBPF is involved
in this at all, but we've got more modern
abstractions that it feels like you can
just plug and play as long as you've
got the right networking components.
From an engineering perspective,
pretty awesome because it allows you
to build stuff on the stack like this.
the team's not on the call, but the
team that's developing this, good job,
Bravo, that's, some great engineering.
obviously anytime something is a year
long effort to crack a nut like this, I
feel sad for the people that had a six
month, like, no one's gonna see this
feature for a year and I'm gonna work all
year on it and I hope someone likes it.
So that's from a software development
perspective, that's the hard part.
That's the true engineering.
Um.
Yeah, absolutely.
we could talk about this forever,
but people have their jobs to do.
okay.
How do people get started do they just
go sign up for Cast AI and this is
like a feature outta the box that they
can implement in their clusters or.
Yep, absolutely.
it's in the UI now, so if people
want to sign up and onboard, we do
recommend having somebody on our sales
engineering team work with folks.
So reach out to us, We'll also reach
out when people sign up as well.
it's all straightforward.
There's no caveats.
It's helm charts to do the install
and then you set up the autoscaler.
we will be adding support
for Karpenter today.
It's using our autoscaler,
but we will support Karpenter.
around end of Q4 or early Q1,
Yeah, that's great.
my usual co-host is with AWS, so
they would greatly appreciate that.
I know that over the last
year, Karpenter's been
out, a little over a year.
We've had, a surprising number
of people on our Discord server.
For those of you watching, there's
a Discord server you can join.
There's a lot of people on Discord
talking about using Karpenter.
I'm really impressed with
the uptake on that project.
And in case you're wondering what
Karpenter is, it's with a K and it's
for Kubernetes, you can look on this
YouTube channel later because we did a
show on it and has had people talking
about it on the show and demoing it.
whenever it was released.
I think that was 2024.
I can't remember exactly.
alright, so everyone
knows how to get started.
Everyone now knows that they wish they
had live migrations and they currently
don't unless they're a Cast AI customer.
Where can we find you on the internet?
Where can people learn more
about what you're doing?
Are you gonna be at conferences soon?
I'm assuming, cast is probably
gonna have a booth at KubeCon again.
They always seemed to have a booth there.
we've got a big booth
at KubeCon this year.
I think we've got a 20 by 20.
We're gonna be doing demos and
presentations in the booth.
this is gonna be a big part of that.
we'll also be at reinvent in Vegas.
In later, early December, I
guess is first week of December.
so I'll be at both of those events.
I'm also really active on LinkedIn, so
if anybody wants to reach out to me on
LinkedIn, if you wanna set up a session
just to go into more detail, feel free
to ping me I post a lot of Kubernetes
content in general, best practices,
things that we see in the industry from
a Kubernetes evolution side of things,
and also obviously a bunch of cast stuff.
so, you know, feel free to follow or
connect Happy to share more information
Awesome.
well, I'm looking forward to hearing about
that continual proliferation of all things
live migration on every possible setup.
someday it'll be on FreeBSD with
some esoteric, Kubernetes variant.
It's, pretty cool to see
the evolution of this.
Well, thank you both for
being here, Philip and Dan.
see you all later.
Ciao.
Thanks for watching, and I'll
see you in the next episode.
Creators and Guests



