page_owner Part 1: a quick introduction

This blog post is part of a series about the page_owner debug feature in the Linux memory management subsystem, related to the talk Improving page_owner for profiling and monitoring memory usage per allocation stack trace presented at Linux Plumbers Conference 2025.

What is page_owner?

In the Linux kernel, page_owner is a debug feature that tracks the memory allocation (and release) of pages in the system – so as to tell the ‘owner of a page’ ;-).

For each memory allocation, page_owner stores its order, GFP flags, stack trace, timestamp, command, process ID (PID) and thread-group ID (TGID), and more. It also stores some information when pages are freed (stack trace, timestamp, PID and TGID).

With page_owner, one can find out “What allocated this page?” and “How many pages are allocated by this particular stack trace, PID, or comm”, for example.

This is struct page_owner in Linux v6.19. It stores additional information per-page, as an extension of struct page with CONFIG_PAGE_EXTENSION.

struct page_owner {
        unsigned short order;
        short last_migrate_reason;
        gfp_t gfp_mask;
        depot_stack_handle_t handle;
        depot_stack_handle_t free_handle;
        u64 ts_nsec;
        u64 free_ts_nsec;
        char comm[TASK_COMM_LEN];
        pid_t pid;
        pid_t tgid;
        pid_t free_pid;
        pid_t free_tgid;
};

Usage

In order to use page_owner, build the kernel with CONFIG_PAGE_OWNER=y (see mm/Kconfig.debug) and boot the kernel with page_owner=on.

The debugfs file /sys/kernel/debug/page_owner provides the information in struct page_owner for every page, listed per PFN (page frame number).

This example shows the entry for a page (line continuation added for clarity) – it tells “What allocated this page?”:

# cat /sys/kernel/debug/page_owner
...
Page allocated via order 0, \
  mask 0xd2cc0(GFP_KERNEL|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), \
  pid 5640, tgid 5640 (stress-ng-brk), ts 414987114269 ns
PFN 0x114 type Unmovable Block 0 type Unmovable Flags 0x200(workingset|node=0|zone=0)
 get_page_from_freelist+0x1416/0x1600
 __alloc_frozen_pages_noprof+0x18c/0x1000
 alloc_pages_mpol+0x43/0x100
 new_slab+0x349/0x460
 ___slab_alloc+0x811/0xd90
 __kmem_cache_alloc_bulk+0xb8/0x1f0
 __prefill_sheaf_pfmemalloc+0x42/0x90
 kmem_cache_prefill_sheaf+0xa9/0x240
 mas_preallocate+0x32f/0x420
 __split_vma+0xdc/0x300
 vms_gather_munmap_vmas+0xa4/0x240
 do_vmi_align_munmap+0xe9/0x180
 do_vmi_munmap+0xcb/0x160
 __vm_munmap+0xa7/0x150
 __x64_sys_munmap+0x16/0x20
 do_syscall_64+0xa4/0x310

...

One can use tools/mm/page_owner_sort to process the information in the file, or come up with custom commands, scripts, or programs.

For example: calculate the total size of pages allocated by stress-ng-brk with any order, in MiB:

# COMM=stress-ng-brk
# cat /sys/kernel/debug/page_owner \
  | awk -F '[ ,]' \
    '/^Page allocated via order .* \('${COMM}'\)/ { PAGES+=2^$5 } 
     END { print PAGES*4096/2**20 " MiB" }'
0.0429688 MiB

More information about page_owner is available in Documentation/mm/page_owner.rst.

Problem: output size

In the page_owner file, note the significant amount of text that is produced per-page: 745 bytes, in the example above.

Considering a system with 1 GiB of RAM and 4 kB pages, fully allocated, with similarly sized entries per page, the output size might reach approximately 186 MiB! (745 [bytes/page] * (2**30 [bytes of RAM] / 4096 [bytes/page]) / 2**20 [bytes/MiB])

For validation, a test VM with 1 GiB of RAM after just a warm-up level of stress (stress-ng --sequential --timeout 1) produced 125 MiB, which was not quick to read even in idle state:

# time cat /sys/kernel/debug/page_owner \
  | wc --bytes | numfmt --to=iec
125M

real    0m3.009s
user    0m0.512s
sys     0m3.542s

While this might not be a serious issue for reading and processing the file only once, it can likely impact a sequence of operations.

Alternative: optimized output

Fortunately, another debugfs file, /sys/kernel/debug/page_owner_stacks/show_stacks, provides an optimized output for obtaining the memory usage per stack trace. Even though it doesn’t address all needs as the generic output, it resembles the default operation of page_owner_sort (without PFN lines) and provides an often interesting information for kernel development or analysis.

This example shows the entry for a stack trace – it tells “How many pages are allocated by this particular stack trace?

# cat /sys/kernel/debug/page_owner_stacks/show_stacks
...
 get_page_from_freelist+0x1416/0x1600
 __alloc_frozen_pages_noprof+0x18c/0x1000
 alloc_pages_mpol+0x43/0x100
 folio_alloc_noprof+0x56/0xa0
 page_cache_ra_unbounded+0xd9/0x230
 filemap_fault+0x305/0x1000
 __do_fault+0x2c/0xb0
 __handle_mm_fault+0x6f4/0xeb0
 handle_mm_fault+0xd9/0x210
 do_user_addr_fault+0x205/0x600
 exc_page_fault+0x61/0x130
 asm_exc_page_fault+0x26/0x30
nr_base_pages: 9643

...

The nr_base_pages field tells the number of base pages (i.e., not huge pages) allocated by a stack trace. So, this particular stack trace for readahead (page_cache_ra_unbounded()) has allocated approximately 37 MiB (9643 [pages] * 4096 [bytes/page] / 2**20 [ bytes/MiB]).

Note this file is more efficient for this particular purpose: just 402 KiB in less than 0.05 seconds. (That is 0.3% of the size and 1.7% of the time):

# time cat /sys/kernel/debug/page_owner_stacks/show_stacks \
  | wc --bytes | numfmt --to=iec
402K

real    0m0.042s
user    0m0.004s
sys     0m0.046s

Conclusion

The page_owner debug feature (enabled with CONFIG_PAGE_OWNER=y and page_owner=on) provides information about the memory allocation of pages in the system in debugfs files /sys/kernel/debug/page_owner with a generic format (dense description per-page) and /sys/kernel/debug/page_owner_stacks/show_stacks with an optimized format (number of base pages per stack trace).