From Roman Gushchin <> Subject [PATCH RFC 00/14] The new slab memory controller Date Thu, 5 Sep 2019 14:45:44 -0700 The existing slab memory controller is based on the idea of replicating

slab allocator internals for each memory cgroup. This approach promises

a low memory overhead (one pointer per page), and isn't adding too much

code on hot allocation and release paths. But is has a very serious flaw:

it leads to a low slab utilization.



Using a drgn* script I've got an estimation of slab utilization on

a number of machines running different production workloads. In most

cases it was between 45% and 65%, and the best number I've seen was

around 85%. Turning kmem accounting off brings it to high 90s. Also

it brings back 30-50% of slab memory. It means that the real price

of the existing slab memory controller is way bigger than a pointer

per page.



The real reason why the existing design leads to a low slab utilization

is simple: slab pages are used exclusively by one memory cgroup.

If there are only few allocations of certain size made by a cgroup,

or if some active objects (e.g. dentries) are left after the cgroup is

deleted, or the cgroup contains a single-threaded application which is

barely allocating any kernel objects, but does it every time on a new CPU:

in all these cases the resulting slab utilization is very low.

If kmem accounting is off, the kernel is able to use free space

on slab pages for other allocations.



Arguably it wasn't an issue back to days when the kmem controller was

introduced and was an opt-in feature, which had to be turned on

individually for each memory cgroup. But now it's turned on by default

on both cgroup v1 and v2. And modern systemd-based systems tend to

create a large number of cgroups.



This patchset provides a new implementation of the slab memory controller,

which aims to reach a much better slab utilization by sharing slab pages

between multiple memory cgroups. Below is the short description of the new

design (more details in commit messages).



Accounting is performed per-object instead of per-page. Slab-related

vmstat counters are converted to bytes. Charging is performed on page-basis,

with rounding up and remembering leftovers.



Memcg ownership data is stored in a per-slab-page vector: for each slab page

a vector of corresponding size is allocated. To keep slab memory reparenting

working, instead of saving a pointer to the memory cgroup directly an

intermediate object is used. It's simply a pointer to a memcg (which can be

easily changed to the parent) with a built-in reference counter. This scheme

allows to reparent all allocated objects without walking them over and changing

memcg pointer to the parent.



Instead of creating an individual set of kmem_caches for each memory cgroup,

two global sets are used: the root set for non-accounted and root-cgroup

allocations and the second set for all other allocations. This allows to

simplify the lifetime management of individual kmem_caches: they are destroyed

with root counterparts. It allows to remove a good amount of code and make

things generally simpler.



The patchset contains a couple of semi-independent parts, which can find their

usage outside of the slab memory controller too:

1) subpage charging API, which can be used in the future for accounting of

other non-page-sized objects, e.g. percpu allocations.

2) mem_cgroup_ptr API (refcounted pointers to a memcg, can be reused

for the efficient reparenting of other objects, e.g. pagecache.



The patchset has been tested on a number of different workloads in our

production. In all cases, it saved hefty amounts of memory:

1) web frontend, 650-700 Mb, ~42% of slab memory

2) database cache, 750-800 Mb, ~35% of slab memory

3) dns server, 700 Mb, ~36% of slab memory



So far I haven't found any regression on all tested workloads, but

potential CPU regression caused by more precise accounting is a concern.



Obviously the amount of saved memory depend on the number of memory cgroups,

uptime and specific workloads, but overall it feels like the new controller

saves 30-40% of slab memory, sometimes more. Additionally, it should lead

to a lower memory fragmentation, just because of a smaller number of

non-movable pages and also because there is no more need to move all

slab objects to a new set of pages when a workload is restarted in a new

memory cgroup.



* https://github.com/osandov/drgn





Roman Gushchin (14):

mm: memcg: subpage charging API

mm: memcg: introduce mem_cgroup_ptr

mm: vmstat: use s32 for vm_node_stat_diff in struct per_cpu_nodestat

mm: vmstat: convert slab vmstat counter to bytes

mm: memcg/slab: allocate space for memcg ownership data for non-root

slabs

mm: slub: implement SLUB version of obj_to_index()

mm: memcg/slab: save memcg ownership data for non-root slab objects

mm: memcg: move memcg_kmem_bypass() to memcontrol.h

mm: memcg: introduce __mod_lruvec_memcg_state()

mm: memcg/slab: charge individual slab objects instead of pages

mm: memcg: move get_mem_cgroup_from_current() to memcontrol.h

mm: memcg/slab: replace memcg_from_slab_page() with

memcg_from_slab_obj()

mm: memcg/slab: use one set of kmem_caches for all memory cgroups

mm: slab: remove redundant check in memcg_accumulate_slabinfo()



drivers/base/node.c | 11 +-

fs/proc/meminfo.c | 4 +-

include/linux/memcontrol.h | 102 ++++++++-

include/linux/mm_types.h | 5 +-

include/linux/mmzone.h | 12 +-

include/linux/slab.h | 3 +-

include/linux/slub_def.h | 9 +

include/linux/vmstat.h | 8 +

kernel/power/snapshot.c | 2 +-

mm/list_lru.c | 12 +-

mm/memcontrol.c | 431 +++++++++++++++++++++--------------

mm/oom_kill.c | 2 +-

mm/page_alloc.c | 8 +-

mm/slab.c | 37 ++-

mm/slab.h | 300 +++++++++++++------------

mm/slab_common.c | 449 ++++---------------------------------

mm/slob.c | 12 +-

mm/slub.c | 63 ++----

mm/vmscan.c | 3 +-

mm/vmstat.c | 38 +++-

mm/workingset.c | 6 +-

21 files changed, 683 insertions(+), 834 deletions(-)



--

2.21.0



