From 58bd244cb29c7446c5914be6c7949f932242416d Mon Sep 17 00:00:00 2001 From: Jiangtian Feng Date: Sat, 30 May 2026 16:58:19 +0800 Subject: [PATCH] anolis: mm: memcg: add per-memcg pgdemote_kswapd/direct/khugepaged stats ANBZ: #36699 Surface pgdemote_kswapd, pgdemote_direct and pgdemote_khugepaged in the cgroup v2 memory.stat file so that operators can attribute page demotion (NUMA tiered memory demotion, kswapd-driven demotion, khugepaged THP collapse demotion) per memcg instead of only at the system level via /proc/vmstat. This is an Anolis-variant hand-port of upstream commit f77f0c751478 ("mm,memcg: provide per-cgroup counters for NUMA balancing operations") rather than a cherry-pick. Upstream exposes these counters as per-node lruvec stats, which requires PGDEMOTE_* to first be converted from enum vm_event_item to enum node_stat_item. That conversion is upstream commits 23e9f0138963 ("mm/vmstat: move pgdemote_* to per-node stats") and b805ab3c6935 ("mm/vmstat: move pgdemote_* out of CONFIG_NUMA_BALANCING"), neither of which is on devel-6.6. Pulling them in would grow NR_VM_NODE_STAT_ITEMS, which sizes by-value arrays embedded in header-exported structs (struct lruvec_stats inlined in struct mem_cgroup_per_node in include/linux/memcontrol.h, and struct per_cpu_nodestat / struct pglist_data in include/linux/mmzone.h) -- a KABI break. On devel-6.6 PGDEMOTE_* are still vm_event_item entries (include/linux/vm_event_item.h:44-46, added by commit 668e4147d885 "mm/vmscan: add page demotion counter" and follow-ups, already on this tree). The smallest correct change is therefore to plumb them through the existing memcg_vm_event_stat[] table, following the shape of commit a45974350d00 ("mm: memcg: add THP swap out info for anonymous reclaim") and b1de8455e3da ("mm: memcg: add per-memcg zswap writeback stat"), both of which added vm_event_item entries to that array with no enum moves. devel-7.0 and master already carry the node_stat_item conversion, so devel-6.6 deliberately diverges in the internal representation while keeping the user-visible memory.stat field names identical to upstream. Changes: 1) mm/memcontrol.c: append PGDEMOTE_KSWAPD/DIRECT/KHUGEPAGED to memcg_vm_event_stat[]. The render loop in memcg_stat_format() iterates this array via vm_event_name(), so the three counters automatically appear in memory.stat as "pgdemote_kswapd", "pgdemote_direct" and "pgdemote_khugepaged". 2) mm/vmscan.c: account the demotion to its owning memcg inside demote_folio_list() alongside the existing __count_vm_events() site. Every folio reaching demote_folio_list() is isolated from a single lruvec and so belongs to a single memcg (shrink_inactive_list and the MGLRU evict_folios path each isolate one lruvec; the only mixed-memcg caller, reclaim_folio_list(), sets sc->no_demotion and never reaches here). The owning memcg is captured before migrate_pages() consumes the list and is charged once afterwards with nr_succeeded, so the per-memcg counter is monotonic and equals the global counter's contribution exactly. Attributing to that descendant memcg rather than sc->target_mem_cgroup is required: shrink_lruvec walks descendants, so the reclaim root would otherwise absorb all demotions and descendant memory.stat would stay 0. MGLRU is covered through the shared shrink_folio_list -> demote_folio_list path. KABI: memcg_vm_event_stat[] is a file-local static const in mm/memcontrol.c and NR_MEMCG_EVENTS is a file-private macro derived from ARRAY_SIZE(). The structs that embed events[NR_MEMCG_EVENTS] (struct memcg_vmstats_percpu, struct memcg_vmstats) are defined inside mm/memcontrol.c with only forward declarations in include/linux/memcontrol.h, so growing NR_MEMCG_EVENTS by 3 changes no exported struct layout. Scope: - cgroup v2 only. memcg1_events[] is a separate file-local table for v1 and is left untouched, mirroring a45974350d00 / b1de8455e3da. - PGPROMOTE_SUCCESS and the NUMA_* counters from upstream f77f0c751478 are out of scope: PGPROMOTE_SUCCESS is a node_stat_item bumped via mod_node_page_state(), and the NUMA hint counters bump from sites lacking a clean folio->memcg handle. Both are follow-ups. Tested on a 6.6.102 host: with the patch the three pgdemote_* lines appear in /sys/fs/cgroup//memory.stat and /proc/vmstat is unchanged. Inspired-by: f77f0c751478 ("mm,memcg: provide per-cgroup counters for NUMA balancing operations") Signed-off-by: Jiangtian Feng --- mm/memcontrol.c | 3 +++ mm/vmscan.c | 18 +++++++++++++++++- 2 files changed, 20 insertions(+), 1 deletion(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 0b2ddf4a4640..caf4556cb1b0 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -629,6 +629,9 @@ static const unsigned int memcg_vm_event_stat[] = { THP_SWPOUT, THP_SWPOUT_FALLBACK, #endif + PGDEMOTE_KSWAPD, + PGDEMOTE_DIRECT, + PGDEMOTE_KHUGEPAGED, }; #define NR_MEMCG_EVENTS ARRAY_SIZE(memcg_vm_event_stat) diff --git a/mm/vmscan.c b/mm/vmscan.c index 3fe6b3d1a89d..6740ce3f6adf 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1051,6 +1051,8 @@ static unsigned int demote_folio_list(struct list_head *demote_folios, { int target_nid = next_demotion_node(pgdat->node_id); unsigned int nr_succeeded; + enum vm_event_item item; + struct mem_cgroup *memcg; nodemask_t allowed_mask; struct migration_target_control mtc = { @@ -1074,12 +1076,26 @@ static unsigned int demote_folio_list(struct list_head *demote_folios, node_get_allowed_targets(pgdat, &allowed_mask); + /* + * All folios reaching demote_folio_list() are isolated from a single + * lruvec, hence belong to a single memcg: both shrink_inactive_list() + * and the MGLRU evict_folios() path isolate one lruvec at a time, and + * the only mixed-memcg caller, reclaim_folio_list(), sets + * sc->no_demotion so its folios never reach here. Capture that memcg + * before migrate_pages() consumes the list, then attribute the + * demotion to it once below. + */ + memcg = folio_memcg(list_first_entry(demote_folios, struct folio, lru)); + /* Demotion ignores all cpuset and mempolicy settings */ migrate_pages(demote_folios, alloc_demote_folio, NULL, (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION, &nr_succeeded); - __count_vm_events(PGDEMOTE_KSWAPD + reclaimer_offset(), nr_succeeded); + item = PGDEMOTE_KSWAPD + reclaimer_offset(); + __count_vm_events(item, nr_succeeded); + if (memcg) + count_memcg_events(memcg, item, nr_succeeded); return nr_succeeded; } -- Gitee