CPU Scheduler imbalance with cgroups
The scenario
Job isolation is so hot right now. Everybody has loads of giant servers and loads of jobs, but many of those jobs don’t need an entire server all to themselves. It would be way more efficient with resource usage to be able to stack a bunch of applications on the same server and let them do their thing. Ideally you don’t want any application to affect the other applications on the system, so you want to employ resource limits on the applications to ensure everybody behaves.
Enter cgroups, which allows you to put resource limits on various parts of an
applications behavior. You can limit memory usage, cpu usage, io bandwidth,
etc. In order to test that this works properly one of our teams was testing the
cpu part of this controller. They made two jobs, one that was a stress -n <#
cpus>
command, and another that was a real live application we use. The amount
of cpu time is tracked per cgroup in /sys/fs/cgroup/path/to/group/cpu.stat
in
cgroupsv2, and they noticed that the stress
group got around 50% more cpu time
than the real application, despite being given equal weight.
First things first
The first thing I needed to do was to classify our real application, because starting all of the infrastructure for this test took a lot of time and was kind of fragile. Also I needed to be able to show upstream the problem, and I couldn’t very well take our giant application and hand it to them to and tell them how to build and set it all up.
I created
this
which I used to generate an rt-app
configuration that was representative of
our workload. Then a simple reproducer script
here
in order to quickly iterate on my problem.
The problem generally
If you haven’t read my previous post on scheduler basics you are going to want to do that now, we’re going to go into the weeds a bit here.
The problem is generally because of how the processes behave. Whenever a
process is runnable it sits on it’s respective runqueue. Because stress
is
just a cpu consumer it never leaves the runqueue, which means that a lot of the
work the scheduler does just doesn’t happen for any of the stress
processes.
The real application however does real application things, it goes to sleep,
wakes up, wakes up other threads, waits on IO, etc. All of these actions
interact with the scheduler in various ways, which means there were lots of
little ways our application would lose CPU time to the stress
application.
The problem turned out to be a combination of 3 things. First was the lazy
weight and load propagation up the cfs_rq
hierarchy. Second was how we
calculated the cfs_rq
shares on the cpu at enqueue time. And
finally our good friend wake affinity reared it’s ugly head again and made a
mess of everything.
Load propagation up the hierarchy
There’s two load measurements we make on a cfs_rq
and sched_entity
, they are
the load_avg
and runnable_load_avg
. load_avg
is the load of the task that
gets updated when we are running and sleeping. When we sleep load_avg
goes
down, when we run load_avg
goes up. runnable_load_avg
is slightly different
in that it only gets updated when we’re running. It can go down if we are
asleep for a long period of time, but it only reflects the load of the process
when it’s running. Both of these get propagated up the hierarchy, but only the
runnable_load_avg
matters when it comes to load balancing the cpus (ish, it’s
more complicated than this but for all intents and purposes it’s the important
one).
Prior to Peter Zijlstra’s
patches to
fix this problem we never propagated the runnable_load_avg
of newly attached
cfs_rq
’s up the hierarchy, only the load_avg
. This meant that the load
balancer was usually dealing with stale information, and would make improper
load balancing decisions. This affected our workload because our tasks would go
on and off the runqueue while the stress
group would not, which exacerbated
the issue.
Peter’s patches fixed this problem by making the runnable_load_avg
be
completely re-calculated every time we have a weight change in the hierarchy.
Before we would only change the load_avg
, and we also only updated the
load_avg
by a delta of the load change based on the new weights. Now with his
patches we keep a runnable_load_sum
, which is the time spent runnable, and
then calculate the runnable_load_avg
by using our new weight in the
calculation with runnable_load_sum
. Changes are now immediate and more
accurate.
A slight aside.
Peter’s patches introduced a sharp regression to my test case. I spent a good
amount of time trying to track this down, and eventually realized that his
calculation of runnable_weight
suffered from the same problem as how we
calculated our cfs_rq
shares as described below. The fix for this specific
problem is in this
patch.
Mis-calculation of cfs_rq
shares at enqueue time
The code in my previous post is the updated version of calc_cfs_shares
, but
previously our current load was calculated like this
load = cfs_rq->avg.load_avg
which means that if we suddenly woke up 4 processes on this cfs_rq
, the load
would be the historical load, rather than a reflection of the current load.
This code got changed to the following
load = max(scale_load_down(cfs_rq->load.weight), cfs_rq->avg.load_avg);
load_avg
operates in the range of 0 - load.weight. This change meant that at
enqueue time we would get credit for our theoretical maximum load, as
cfs_rq->load.weight
is updated with the sched_entity->load.weight
at enqueue
time.
The result was that we were now getting an immediate reflection of probable load by newly woken tasks, which made load balancing more accurate.
Wake affinity effective load miscalculation
Wake affinity is the schedulers way of trying to prefer cache locality over wakeup latency. The idea is that all things being equal, we’d rather pay the cost of migrating a task to the cpu where we think it may have cache locality than deal with migrating cache between cpus. This logic isn’t super complicated, we basically have a counter to see if we have a multi-waker to wakee relationship, and if we do make a sort of half-assed load balance decision and migrate the task to our current cpu.
The half-assed load balance decision is part of the problem. Since we make load
balancing decisions at the cpu level, and our sched_entity
is buried under a
cfs_rq
hierarchy, we have to calculate the load difference each cpu would see
by waking this task up on either cpu. How this happens is we do a recursive
walk up the cfs_rq
hierarchy doing a modified version of calc_cfs_shares
all
the way up until we have the delta of the load that would be visible to the cpu.
You’re going to need to read that paragraph a few more times.
How this is different from the normal state of things is that we add our
sched_entity->avg.load_avg
to the cfs_rq->avg.load_avg
, calculate what our
theoretical shares would be for the cfs_rq
’s sched_entity
, subtract the
actual value of the cfs_rq
’s sched_entity
and do this recursively. The
problem is in reality, at enqueue time, we only add our
sched_entity->load.weight
to our cfs_rq->load.weight
, then we calculate our
shares from there, and set our sched_entity->load.weight
to the new weight.
Now this isn’t a huge deal, it’s supposed to be an approximation and it’s
definitely approximate, but it isn’t a reflection of what happens in reality.
And as we see above, the load_avg
can bias us against the group that goes to
sleep occasionally. With load balancing decisions we want to be using the
theoretical maximum load a new task would place on the hierarchy.
Next we have a significantly more subtle problem. We are calculating our new
sched_entity->load.weight
using current load_avg
numbers. Our
sched_entity->load.weight
that we subtract from our newly calculated weight
was computed using historic load_avg
numbers, which could have been higher at
the time. This means that when we’re trying to calculate how much load we would
add to the cpu by moving our task to that cpu, the calculation would actually
make it look like we were removing load from the cpu. This meant we were
constantly migrating tasks when in reality we should not have been. Since we
can easily calculate the “old” weight by using the current numbers without the
tasks added weight we do this so our delta is consistent with our current
load_avg
numbers.
This was a doozy. Even once I understood everything that was going on it still took me a few days (probably more like a week) to get the math and logic right here and realize what was going on.
Wake affinity ping-ponging
The last aspect of the problem was wake affinity ping-ponging our tasks. As I stated above, if everything is equal between two cpu’s, we will prefer to wake up the task on the waker’s cpu. This blanket assumption that these things maybe share cache meant that we were ping-ponging tasks around constantly, which was still causing problems. We would wake affine a process, then the load balancer would come behind us and put the process back on another cpu.
This problem was much easier to understand and more straightforward to fix. I simply kept track of the last time the load balancer moves us off of a cpu, and if it has been in the last second don’t do a affinity wakeup. This got us the rest of the way there, and now we have an even 50-50 split between the two groups.
Conclusion
This problem sucked, I spent a solid 2 months on it. Many thanks to Peter Zijlstra and Tejun Heo for answering my questions and being my sounding board. I had very little scheduler experience before tackling this problem, and it ended up being much more involved than I expected. The whole series has been sent up to fix the problem, I imagine there will be a lot of discussion and the patches that eventually go upstream won’t look anything like these ones, but for now you can look at them here.