• Dave Hansen's avatar
    mm/numa: automatically generate node migration order · 79c28a41
    Dave Hansen authored
    Patch series "Migrate Pages in lieu of discard", v11.
    
    We're starting to see systems with more and more kinds of memory such as
    Intel's implementation of persistent memory.
    
    Let's say you have a system with some DRAM and some persistent memory.
    Today, once DRAM fills up, reclaim will start and some of the DRAM
    contents will be thrown out.  Allocations will, at some point, start
    falling over to the slower persistent memory.
    
    That has two nasty properties.  First, the newer allocations can end up in
    the slower persistent memory.  Second, reclaimed data in DRAM are just
    discarded even if there are gobs of space in persistent memory that could
    be used.
    
    This patchset implements a solution to these problems.  At the end of the
    reclaim process in shrink_page_list() just before the last page refcount
    is dropped, the page is migrated to persistent memory instead of being
    dropped.
    
    While I've talked about a DRAM/PMEM pairing, this approach would function
    in any environment where memory tiers exist.
    
    This is not perfect.  It "strands" pages in slower memory and never brings
    them back to fast DRAM.  Huang Ying has follow-on work which repurposes
    NUMA balancing to promote hot pages back to DRAM.
    
    This is also all based on an upstream mechanism that allows persistent
    memory to be onlined and used as if it were volatile:
    
    	http://lkml.kernel.org/r/20190124231441.37A4A305@viggo.jf.intel.com
    
    With that, the DRAM and PMEM in each socket will be represented as 2
    separate NUMA nodes, with the CPUs sit in the DRAM node.  So the
    general inter-NUMA demotion mechanism introduced in the patchset can
    migrate the cold DRAM pages to the PMEM node.
    
    We have tested the patchset with the postgresql and pgbench.  On a
    2-socket server machine with DRAM and PMEM, the kernel with the patchset
    can improve the score of pgbench up to 22.1% compared with that of the
    DRAM only + disk case.  This comes from the reduced disk read throughput
    (which reduces up to 70.8%).
    
    == Open Issues ==
    
     * Memory policies and cpusets that, for instance, restrict allocations
       to DRAM can be demoted to PMEM whenever they opt in to this
       new mechanism.  A cgroup-level API to opt-in or opt-out of
       these migrations will likely be required as a follow-on.
     * Could be more aggressive about where anon LRU scanning occurs
       since it no longer necessarily involves I/O.  get_scan_count()
       for instance says: "If we have no swap space, do not bother
       scanning anon pages"
    
    This patch (of 9):
    
    Prepare for the kernel to auto-migrate pages to other memory nodes with a
    node migration table.  This allows creating single migration target for
    each NUMA node to enable the kernel to do NUMA page migrations instead of
    simply discarding colder pages.  A node with no target is a "terminal
    node", so reclaim acts normally there.  The migration target does not
    fundamentally _need_ to be a single node, but this implementation starts
    there to limit complexity.
    
    When memory fills up on a node, memory contents can be automatically
    migrated to another node.  The biggest problems are knowing when to
    migrate and to where the migration should be targeted.
    
    The most straightforward way to generate the "to where" list would be to
    follow the page allocator fallback lists.  Those lists already tell us if
    memory is full where to look next.  It would also be logical to move
    memory in that order.
    
    But, the allocator fallback lists have a fatal flaw: most nodes appear in
    all the lists.  This would potentially lead to migration cycles (A->B,
    B->A, A->B, ...).
    
    Instead of using the allocator fallback lists directly, keep a separate
    node migration ordering.  But, reuse the same data used to generate page
    allocator fallback in the first place: find_next_best_node().
    
    This means that the firmware data used to populate node distances
    essentially dictates the ordering for now.  It should also be
    architecture-neutral since all NUMA architectures have a working
    find_next_best_node().
    
    RCU is used to allow lock-less read of node_demotion[] and prevent
    demotion cycles been observed.  If multiple reads of node_demotion[] are
    performed, a single rcu_read_lock() must be held over all reads to ensure
    no cycles are observed.  Details are as follows.
    
    === What does RCU provide? ===
    
    Imagine a simple loop which walks down the demotion path looking
    for the last node:
    
            terminal_node = start_node;
            while (node_demotion[terminal_node] != NUMA_NO_NODE) {
                    terminal_node = node_demotion[terminal_node];
            }
    
    The initial values are:
    
            node_demotion[0] = 1;
            node_demotion[1] = NUMA_NO_NODE;
    
    and are updated to:
    
            node_demotion[0] = NUMA_NO_NODE;
            node_demotion[1] = 0;
    
    What guarantees that the cycle is not observed:
    
            node_demotion[0] = 1;
            node_demotion[1] = 0;
    
    and would loop forever?
    
    With RCU, a rcu_read_lock/unlock() can be placed around the loop.  Since
    the write side does a synchronize_rcu(), the loop that observed the old
    contents is known to be complete before the synchronize_rcu() has
    completed.
    
    RCU, combined with disable_all_migrate_targets(), ensures that the old
    migration state is not visible by the time __set_migration_target_nodes()
    is called.
    
    === What does READ_ONCE() provide? ===
    
    READ_ONCE() forbids the compiler from merging or reordering successive
    reads of node_demotion[].  This ensures that any updates are *eventually*
    observed.
    
    Consider the above loop again.  The compiler could theoretically read the
    entirety of node_demotion[] into local storage (registers) and never go
    back to memory, and *permanently* observe bad values for node_demotion[].
    
    Note: RCU does not provide any universal compiler-ordering
    guarantees:
    
    	https://lore.kernel.org/lkml/20150921204327.GH4029@linux.vnet.ibm.com/
    
    This code is unused for now.  It will be called later in the
    series.
    
    Link: https://lkml.kernel.org/r/20210721063926.3024591-1-ying.huang@intel.com
    Link: https://lkml.kernel.org/r/20210715055145.195411-1-ying.huang@intel.com
    Link: https://lkml.kernel.org/r/20210715055145.195411-2-ying.huang@intel.com
    
    Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
    Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
    Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
    Reviewed-by: default avatarZi Yan <ziy@nvidia.com>
    Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Wei Xu <weixugc@google.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Greg Thelen <gthelen@google.com>
    Cc: Keith Busch <kbusch@kernel.org>
    Cc: Yang Shi <yang.shi@linux.alibaba.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    79c28a41
page_alloc.c 263 KB