1. 04 Feb, 2020 3 commits
    • David Hildenbrand's avatar
      mm/page_alloc: fix and rework pfn handling in memmap_init_zone() · 948c436e
      David Hildenbrand authored
      Let's update the pfn manually whenever we continue the loop.  This makes
      the code easier to read but also less error prone (and we can directly fix
      one issue).
      
      When overlap_memmap_init() returns true, pfn is updated to
      "memblock_region_memory_end_pfn(r)".  So it already points at the *next*
      pfn to process.  Incrementing the pfn another time is wrong, we might
      leave one uninitialized.  I spotted this by inspecting the code, so I have
      no idea if this is relevant in practise (with kernelcore=mirror).
      
      Link: http://lkml.kernel.org/r/20200113144035.10848-2-david@redhat.com
      Fixes: a9a9e77f
      
       ("mm: move mirrored memory specific code outside of memmap_init_zone")
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: "Jin, Zhi" <zhi.jin@intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      948c436e
    • David Hildenbrand's avatar
      mm/page_alloc.c: initialize memmap of unavailable memory directly · 4b094b78
      David Hildenbrand authored
      Let's make sure that all memory holes are actually marked PageReserved(),
      that page_to_pfn() produces reliable results, and that these pages are not
      detected as "mmap" pages due to the mapcount.
      
      E.g., booting a x86-64 QEMU guest with 4160 MB:
      
      [    0.010585] Early memory node ranges
      [    0.010586]   node   0: [mem 0x0000000000001000-0x000000000009efff]
      [    0.010588]   node   0: [mem 0x0000000000100000-0x00000000bffdefff]
      [    0.010589]   node   0: [mem 0x0000000100000000-0x0000000143ffffff]
      
      max_pfn is 0x144000.
      
      Before this change:
      
      [root@localhost ~]# ./page-types -r -a 0x144000,
                   flags      page-count       MB  symbolic-flags                     long-symbolic-flags
      0x0000000000000800           16384       64  ___________M_______________________________        mmap
                   total           16384       64
      
      After this change:
      
      [root@localhost ~]# ./page-types -r -a 0x144000,
                   flags      page-count       MB  symbolic-flags                     long-symbolic-flags
      0x0000000100000000           16384       64  ___________________________r_______________        reserved
                   total           16384       64
      
      IOW, especially the unavailable physical memory ("memory hole") in the
      last section would not get properly marked PageReserved() and is indicated
      to be "mmap" memory.
      
      Drop the trace of that function from include/linux/mm.h - nobody else
      needs it, and rename it accordingly.
      
      Note: The fake zone/node might not be covered by the zone/node span.  This
      is not an urgent issue (for now, we had the same node/zone due to the
      zeroing).  We'll need a clean way to mark memory holes (e.g., using a page
      type PageHole() if possible or a fake ZONE_INVALID) and eventually stop
      marking these memory holes PageReserved().
      
      Link: http://lkml.kernel.org/r/20191211163201.17179-4-david@redhat.com
      
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Bob Picco <bob.picco@oracle.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4b094b78
    • David Hildenbrand's avatar
      mm/page_alloc.c: fix uninitialized memmaps on a partially populated last section · e822969c
      David Hildenbrand authored
      Patch series "mm: fix max_pfn not falling on section boundary", v2.
      
      Playing with different memory sizes for a x86-64 guest, I discovered that
      some memmaps (highest section if max_mem does not fall on the section
      boundary) are marked as being valid and online, but contain garbage.  We
      have to properly initialize these memmaps.
      
      Looking at /proc/kpageflags and friends, I found some more issues,
      partially related to this.
      
      This patch (of 3):
      
      If max_pfn is not aligned to a section boundary, we can easily run into
      BUGs.  This can e.g., be triggered on x86-64 under QEMU by specifying a
      memory size that is not a multiple of 128MB (e.g., 4097MB, but also
      4160MB).  I was told that on real HW, we can easily have this scenario
      (esp., one of the main reasons sub-section hotadd of devmem was added).
      
      The issue is, that we have a valid memmap (pfn_valid()) for the whole
      section, and the whole section will be marked "online".
      pfn_to_online_page() will succeed, but the memmap contains garbage.
      
      E.g., doing a "./page-types -r -a 0x144001" when QEMU was started with "-m
      4160M" - (see tools/vm/page-types.c):
      
      [  200.476376] BUG: unable to handle page fault for address: fffffffffffffffe
      [  200.477500] #PF: supervisor read access in kernel mode
      [  200.478334] #PF: error_code(0x0000) - not-present page
      [  200.479076] PGD 59614067 P4D 59614067 PUD 59616067 PMD 0
      [  200.479557] Oops: 0000 [#4] SMP NOPTI
      [  200.479875] CPU: 0 PID: 603 Comm: page-types Tainted: G      D W         5.5.0-rc1-next-20191209 #93
      [  200.480646] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu4
      [  200.481648] RIP: 0010:stable_page_flags+0x4d/0x410
      [  200.482061] Code: f3 ff 41 89 c0 48 b8 00 00 00 00 01 00 00 00 45 84 c0 0f 85 cd 02 00 00 48 8b 53 08 48 8b 2b 48f
      [  200.483644] RSP: 0018:ffffb139401cbe60 EFLAGS: 00010202
      [  200.484091] RAX: fffffffffffffffe RBX: fffffbeec5100040 RCX: 0000000000000000
      [  200.484697] RDX: 0000000000000001 RSI: ffffffff9535c7cd RDI: 0000000000000246
      [  200.485313] RBP: ffffffffffffffff R08: 0000000000000000 R09: 0000000000000000
      [  200.485917] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000144001
      [  200.486523] R13: 00007ffd6ba55f48 R14: 00007ffd6ba55f40 R15: ffffb139401cbf08
      [  200.487130] FS:  00007f68df717580(0000) GS:ffff9ec77fa00000(0000) knlGS:0000000000000000
      [  200.487804] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  200.488295] CR2: fffffffffffffffe CR3: 0000000135d48000 CR4: 00000000000006f0
      [  200.488897] Call Trace:
      [  200.489115]  kpageflags_read+0xe9/0x140
      [  200.489447]  proc_reg_read+0x3c/0x60
      [  200.489755]  vfs_read+0xc2/0x170
      [  200.490037]  ksys_pread64+0x65/0xa0
      [  200.490352]  do_syscall_64+0x5c/0xa0
      [  200.490665]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      But it can be triggered much easier via "cat /proc/kpageflags > /dev/null"
      after cold/hot plugging a DIMM to such a system:
      
      [root@localhost ~]# cat /proc/kpageflags > /dev/null
      [  111.517275] BUG: unable to handle page fault for address: fffffffffffffffe
      [  111.517907] #PF: supervisor read access in kernel mode
      [  111.518333] #PF: error_code(0x0000) - not-present page
      [  111.518771] PGD a240e067 P4D a240e067 PUD a2410067 PMD 0
      
      This patch fixes that by at least zero-ing out that memmap (so e.g.,
      page_to_pfn() will not crash).  Commit 907ec5fc ("mm: zero remaining
      unavailable struct pages") tried to fix a similar issue, but forgot to
      consider this special case.
      
      After this patch, there are still problems to solve.  E.g., not all of
      these pages falling into a memory hole will actually get initialized later
      and set PageReserved - they are only zeroed out - but at least the
      immediate crashes are gone.  A follow-up patch will take care of this.
      
      Link: http://lkml.kernel.org/r/20191211163201.17179-2-david@redhat.com
      Fixes: f7f99100
      
       ("mm: stop zeroing memory during allocation in vmemmap")
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Tested-by: default avatarDaniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Bob Picco <bob.picco@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: <stable@vger.kernel.org>	[4.15+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e822969c
  2. 31 Jan, 2020 4 commits
    • Qian Cai's avatar
      mm/page_isolation: fix potential warning from user · 3d680bdf
      Qian Cai authored
      It makes sense to call the WARN_ON_ONCE(zone_idx(zone) == ZONE_MOVABLE)
      from start_isolate_page_range(), but should avoid triggering it from
      userspace, i.e, from is_mem_section_removable() because it could crash
      the system by a non-root user if warn_on_panic is set.
      
      While at it, simplify the code a bit by removing an unnecessary jump
      label.
      
      Link: http://lkml.kernel.org/r/20200120163915.1469-1-cai@lca.pw
      
      Signed-off-by: default avatarQian Cai <cai@lca.pw>
      Suggested-by: default avatarMichal Hocko <mhocko@kernel.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3d680bdf
    • Qian Cai's avatar
      mm/hotplug: silence a lockdep splat with printk() · 4a55c047
      Qian Cai authored
      It is not that hard to trigger lockdep splats by calling printk from
      under zone->lock.  Most of them are false positives caused by lock
      chains introduced early in the boot process and they do not cause any
      real problems (although most of the early boot lock dependencies could
      happen after boot as well).  There are some console drivers which do
      allocate from the printk context as well and those should be fixed.  In
      any case, false positives are not that trivial to workaround and it is
      far from optimal to lose lockdep functionality for something that is a
      non-issue.
      
      So change has_unmovable_pages() so that it no longer calls dump_page()
      itself - instead it returns a "struct page *" of the unmovable page back
      to the caller so that in the case of a has_unmovable_pages() failure,
      the caller can call dump_page() after releasing zone->lock.  Also, make
      dump_page() is able to report a CMA page as well, so the reason string
      from has_unmovable_pages() can be removed.
      
      Even though has_unmovable_pages doesn't hold any reference to the
      returned page this should be reasonably safe for the purpose of
      reporting the page (dump_page) because it cannot be hotremoved in the
      context of memory unplug.  The state of the page might change but that
      is the case even with the existing code as zone->lock only plays role
      for free pages.
      
      While at it, remove a similar but unnecessary debug-only printk() as
      well.  A sample of one of those lockdep splats is,
      
        WARNING: possible circular locking dependency detected
        ------------------------------------------------------
        test.sh/8653 is trying to acquire lock:
        ffffffff865a4460 (console_owner){-.-.}, at:
        console_unlock+0x207/0x750
      
        but task is already holding lock:
        ffff88883fff3c58 (&(&zone->lock)->rlock){-.-.}, at:
        __offline_isolated_pages+0x179/0x3e0
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #3 (&(&zone->lock)->rlock){-.-.}:
               __lock_acquire+0x5b3/0xb40
               lock_acquire+0x126/0x280
               _raw_spin_lock+0x2f/0x40
               rmqueue_bulk.constprop.21+0xb6/0x1160
               get_page_from_freelist+0x898/0x22c0
               __alloc_pages_nodemask+0x2f3/0x1cd0
               alloc_pages_current+0x9c/0x110
               allocate_slab+0x4c6/0x19c0
               new_slab+0x46/0x70
               ___slab_alloc+0x58b/0x960
               __slab_alloc+0x43/0x70
               __kmalloc+0x3ad/0x4b0
               __tty_buffer_request_room+0x100/0x250
               tty_insert_flip_string_fixed_flag+0x67/0x110
               pty_write+0xa2/0xf0
               n_tty_write+0x36b/0x7b0
               tty_write+0x284/0x4c0
               __vfs_write+0x50/0xa0
               vfs_write+0x105/0x290
               redirected_tty_write+0x6a/0xc0
               do_iter_write+0x248/0x2a0
               vfs_writev+0x106/0x1e0
               do_writev+0xd4/0x180
               __x64_sys_writev+0x45/0x50
               do_syscall_64+0xcc/0x76c
               entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
        -> #2 (&(&port->lock)->rlock){-.-.}:
               __lock_acquire+0x5b3/0xb40
               lock_acquire+0x126/0x280
               _raw_spin_lock_irqsave+0x3a/0x50
               tty_port_tty_get+0x20/0x60
               tty_port_default_wakeup+0xf/0x30
               tty_port_tty_wakeup+0x39/0x40
               uart_write_wakeup+0x2a/0x40
               serial8250_tx_chars+0x22e/0x440
               serial8250_handle_irq.part.8+0x14a/0x170
               serial8250_default_handle_irq+0x5c/0x90
               serial8250_interrupt+0xa6/0x130
               __handle_irq_event_percpu+0x78/0x4f0
               handle_irq_event_percpu+0x70/0x100
               handle_irq_event+0x5a/0x8b
               handle_edge_irq+0x117/0x370
               do_IRQ+0x9e/0x1e0
               ret_from_intr+0x0/0x2a
               cpuidle_enter_state+0x156/0x8e0
               cpuidle_enter+0x41/0x70
               call_cpuidle+0x5e/0x90
               do_idle+0x333/0x370
               cpu_startup_entry+0x1d/0x1f
               start_secondary+0x290/0x330
               secondary_startup_64+0xb6/0xc0
      
        -> #1 (&port_lock_key){-.-.}:
               __lock_acquire+0x5b3/0xb40
               lock_acquire+0x126/0x280
               _raw_spin_lock_irqsave+0x3a/0x50
               serial8250_console_write+0x3e4/0x450
               univ8250_console_write+0x4b/0x60
               console_unlock+0x501/0x750
               vprintk_emit+0x10d/0x340
               vprintk_default+0x1f/0x30
               vprintk_func+0x44/0xd4
               printk+0x9f/0xc5
      
        -> #0 (console_owner){-.-.}:
               check_prev_add+0x107/0xea0
               validate_chain+0x8fc/0x1200
               __lock_acquire+0x5b3/0xb40
               lock_acquire+0x126/0x280
               console_unlock+0x269/0x750
               vprintk_emit+0x10d/0x340
               vprintk_default+0x1f/0x30
               vprintk_func+0x44/0xd4
               printk+0x9f/0xc5
               __offline_isolated_pages.cold.52+0x2f/0x30a
               offline_isolated_pages_cb+0x17/0x30
               walk_system_ram_range+0xda/0x160
               __offline_pages+0x79c/0xa10
               offline_pages+0x11/0x20
               memory_subsys_offline+0x7e/0xc0
               device_offline+0xd5/0x110
               state_store+0xc6/0xe0
               dev_attr_store+0x3f/0x60
               sysfs_kf_write+0x89/0xb0
               kernfs_fop_write+0x188/0x240
               __vfs_write+0x50/0xa0
               vfs_write+0x105/0x290
               ksys_write+0xc6/0x160
               __x64_sys_write+0x43/0x50
               do_syscall_64+0xcc/0x76c
               entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
        other info that might help us debug this:
      
        Chain exists of:
          console_owner --> &(&port->lock)->rlock --> &(&zone->lock)->rlock
      
         Possible unsafe locking scenario:
      
               CPU0                    CPU1
               ----                    ----
          lock(&(&zone->lock)->rlock);
                                       lock(&(&port->lock)->rlock);
                                       lock(&(&zone->lock)->rlock);
          lock(console_owner);
      
         *** DEADLOCK ***
      
        9 locks held by test.sh/8653:
         #0: ffff88839ba7d408 (sb_writers#4){.+.+}, at:
        vfs_write+0x25f/0x290
         #1: ffff888277618880 (&of->mutex){+.+.}, at:
        kernfs_fop_write+0x128/0x240
         #2: ffff8898131fc218 (kn->count#115){.+.+}, at:
        kernfs_fop_write+0x138/0x240
         #3: ffffffff86962a80 (device_hotplug_lock){+.+.}, at:
        lock_device_hotplug_sysfs+0x16/0x50
         #4: ffff8884374f4990 (&dev->mutex){....}, at:
        device_offline+0x70/0x110
         #5: ffffffff86515250 (cpu_hotplug_lock.rw_sem){++++}, at:
        __offline_pages+0xbf/0xa10
         #6: ffffffff867405f0 (mem_hotplug_lock.rw_sem){++++}, at:
        percpu_down_write+0x87/0x2f0
         #7: ffff88883fff3c58 (&(&zone->lock)->rlock){-.-.}, at:
        __offline_isolated_pages+0x179/0x3e0
         #8: ffffffff865a4920 (console_lock){+.+.}, at:
        vprintk_emit+0x100/0x340
      
        stack backtrace:
        Hardware name: HPE ProLiant DL560 Gen10/ProLiant DL560 Gen10,
        BIOS U34 05/21/2019
        Call Trace:
         dump_stack+0x86/0xca
         print_circular_bug.cold.31+0x243/0x26e
         check_noncircular+0x29e/0x2e0
         check_prev_add+0x107/0xea0
         validate_chain+0x8fc/0x1200
         __lock_acquire+0x5b3/0xb40
         lock_acquire+0x126/0x280
         console_unlock+0x269/0x750
         vprintk_emit+0x10d/0x340
         vprintk_default+0x1f/0x30
         vprintk_func+0x44/0xd4
         printk+0x9f/0xc5
         __offline_isolated_pages.cold.52+0x2f/0x30a
         offline_isolated_pages_cb+0x17/0x30
         walk_system_ram_range+0xda/0x160
         __offline_pages+0x79c/0xa10
         offline_pages+0x11/0x20
         memory_subsys_offline+0x7e/0xc0
         device_offline+0xd5/0x110
         state_store+0xc6/0xe0
         dev_attr_store+0x3f/0x60
         sysfs_kf_write+0x89/0xb0
         kernfs_fop_write+0x188/0x240
         __vfs_write+0x50/0xa0
         vfs_write+0x105/0x290
         ksys_write+0xc6/0x160
         __x64_sys_write+0x43/0x50
         do_syscall_64+0xcc/0x76c
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Link: http://lkml.kernel.org/r/20200117181200.20299-1-cai@lca.pw
      
      Signed-off-by: default avatarQian Cai <cai@lca.pw>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4a55c047
    • David Hildenbrand's avatar
      mm: remove "count" parameter from has_unmovable_pages() · fe4c86c9
      David Hildenbrand authored
      Now that the memory isolate notifier is gone, the parameter is always 0.
      Drop it and cleanup has_unmovable_pages().
      
      Link: http://lkml.kernel.org/r/20191114131911.11783-3-david@redhat.com
      
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Pingfan Liu <kernelfans@gmail.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Wei Yang <richardw.yang@linux.intel.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Arun KS <arunks@codeaurora.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fe4c86c9
    • Kirill A. Shutemov's avatar
      mm/page_alloc: skip non present sections on zone initialization · 3f135355
      Kirill A. Shutemov authored
      memmap_init_zone() can be called on the ranges with holes during the
      boot.  It will skip any non-valid PFNs one-by-one.  It works fine as
      long as holes are not too big.
      
      But huge holes in the memory map causes a problem.  It takes over 20
      seconds to walk 32TiB hole.  x86-64 with 5-level paging allows for much
      larger holes in the memory map which would practically hang the system.
      
      Deferred struct page init doesn't help here.  It only works on the
      present ranges.
      
      Skipping non-present sections would fix the issue.
      
      Link: http://lkml.kernel.org/r/20191230093828.24613-1-kirill.shutemov@linux.intel.com
      
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarBaoquan He <bhe@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: "Jin, Zhi" <zhi.jin@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3f135355
  3. 14 Jan, 2020 2 commits
    • Vlastimil Babka's avatar
      mm, debug_pagealloc: don't rely on static keys too early · 8e57f8ac
      Vlastimil Babka authored
      Commit 96a2b03f ("mm, debug_pagelloc: use static keys to enable
      debugging") has introduced a static key to reduce overhead when
      debug_pagealloc is compiled in but not enabled.  It relied on the
      assumption that jump_label_init() is called before parse_early_param()
      as in start_kernel(), so when the "debug_pagealloc=on" option is parsed,
      it is safe to enable the static key.
      
      However, it turns out multiple architectures call parse_early_param()
      earlier from their setup_arch().  x86 also calls jump_label_init() even
      earlier, so no issue was found while testing the commit, but same is not
      true for e.g.  ppc64 and s390 where the kernel would not boot with
      debug_pagealloc=on as found by our QA.
      
      To fix this without tricky changes to init code of multiple
      architectures, this patch partially reverts the static key conversion
      from 96a2b03f.  Init-time and non-fastpath calls (such as in arch
      code) of debug_pagealloc_enabled() will again test a simple bool
      variable.  Fastpath mm code is converted to a new
      debug_pagealloc_enabled_static() variant that relies on the static key,
      which is enabled in a well-defined point in mm_init() where it's
      guaranteed that jump_label_init() has been called, regardless of
      architecture.
      
      [sfr@canb.auug.org.au: export _debug_pagealloc_enabled_early]
        Link: http://lkml.kernel.org/r/20200106164944.063ac07b@canb.auug.org.au
      Link: http://lkml.kernel.org/r/20191219130612.23171-1-vbabka@suse.cz
      Fixes: 96a2b03f
      
       ("mm, debug_pagelloc: use static keys to enable debugging")
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Qian Cai <cai@lca.pw>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8e57f8ac
    • Vlastimil Babka's avatar
      mm, thp: tweak reclaim/compaction effort of local-only and all-node allocations · cc638f32
      Vlastimil Babka authored
      THP page faults now attempt a __GFP_THISNODE allocation first, which
      should only compact existing free memory, followed by another attempt
      that can allocate from any node using reclaim/compaction effort
      specified by global defrag setting and madvise.
      
      This patch makes the following changes to the scheme:
      
       - Before the patch, the first allocation relies on a check for
         pageblock order and __GFP_IO to prevent excessive reclaim. This
         however affects also the second attempt, which is not limited to
         single node.
      
         Instead of that, reuse the existing check for costly order
         __GFP_NORETRY allocations, and make sure the first THP attempt uses
         __GFP_NORETRY. As a side-effect, all costly order __GFP_NORETRY
         allocations will bail out if compaction needs reclaim, while
         previously they only bailed out when compaction was deferred due to
         previous failures.
      
         This should be still acceptable within the __GFP_NORETRY semantics.
      
       - Before the patch, the second allocation attempt (on all nodes) was
         passing __GFP_NORETRY. This is redundant as the check for pageblock
         order (discussed above) was stronger. It's also contrary to
         madvise(MADV_HUGEPAGE) which means some effort to allocate THP is
         requested.
      
         After this patch, the second attempt doesn't pass __GFP_THISNODE nor
         __GFP_NORETRY.
      
      To sum up, THP page faults now try the following attempts:
      
      1. local node only THP allocation with no reclaim, just compaction.
      2. for madvised VMA's or when synchronous compaction is enabled always - THP
         allocation from any node with effort determined by global defrag setting
         and VMA madvise
      3. fallback to base pages on any node
      
      Link: http://lkml.kernel.org/r/08a3f4dd-c3ce-0009-86c5-9ee51aba8557@suse.cz
      Fixes: b39d0ee2
      
       ("mm, page_alloc: avoid expensive reclaim when compaction may not succeed")
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cc638f32
  4. 01 Dec, 2019 6 commits
  5. 06 Nov, 2019 2 commits
    • Johannes Weiner's avatar
      mm/page_alloc.c: ratelimit allocation failure warnings more aggressively · 1be334e5
      Johannes Weiner authored
      While investigating a bug related to higher atomic allocation failures,
      we noticed the failure warnings positively drowning the console, and in
      our case trigger lockup warnings because of a serial console too slow to
      handle all that output.
      
      But even if we had a faster console, it's unclear what additional
      information the current level of repetition provides.
      
      Allocation failures happen for three reasons: The machine is OOM, the VM
      is failing to handle reasonable requests, or somebody is making
      unreasonable requests (and didn't acknowledge their opportunism with
      __GFP_NOWARN).  Having the memory dump, a callstack, and the ratelimit
      stats on skipped failure warnings should provide enough information to
      let users/admins/developers know whether something is wrong and point
      them in the right direction for debugging, bpftracing etc.
      
      Limit allocation failure warnings to one spew every ten seconds.
      
      Link: http://lkml.kernel.org/r/20191028194906.26899-1-hannes@cmpxchg.org
      
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1be334e5
    • Mel Gorman's avatar
      mm, meminit: recalculate pcpu batch and high limits after init completes · 3e8fc007
      Mel Gorman authored
      Deferred memory initialisation updates zone->managed_pages during the
      initialisation phase but before that finishes, the per-cpu page
      allocator (pcpu) calculates the number of pages allocated/freed in
      batches as well as the maximum number of pages allowed on a per-cpu
      list.  As zone->managed_pages is not up to date yet, the pcpu
      initialisation calculates inappropriately low batch and high values.
      
      This increases zone lock contention quite severely in some cases with
      the degree of severity depending on how many CPUs share a local zone and
      the size of the zone.  A private report indicated that kernel build
      times were excessive with extremely high system CPU usage.  A perf
      profile indicated that a large chunk of time was lost on zone->lock
      contention.
      
      This patch recalculates the pcpu batch and high values after deferred
      initialisation completes for every populated zone in the system.  It was
      tested on a 2-socket AMD EPYC 2 machine using a kernel compilation
      workload -- allmodconfig and all available CPUs.
      
      mmtests configuration: config-workload-kernbench-max Configuration was
      modified to build on a fresh XFS partition.
      
      kernbench
                                      5.4.0-rc3              5.4.0-rc3
                                        vanilla           resetpcpu-v2
      Amean     user-256    13249.50 (   0.00%)    16401.31 * -23.79%*
      Amean     syst-256    14760.30 (   0.00%)     4448.39 *  69.86%*
      Amean     elsp-256      162.42 (   0.00%)      119.13 *  26.65%*
      Stddev    user-256       42.97 (   0.00%)       19.15 (  55.43%)
      Stddev    syst-256      336.87 (   0.00%)        6.71 (  98.01%)
      Stddev    elsp-256        2.46 (   0.00%)        0.39 (  84.03%)
      
                         5.4.0-rc3    5.4.0-rc3
                           vanilla resetpcpu-v2
      Duration User       39766.24     49221.79
      Duration System     44298.10     13361.67
      Duration Elapsed      519.11       388.87
      
      The patch reduces system CPU usage by 69.86% and total build time by
      26.65%.  The variance of system CPU usage is also much reduced.
      
      Before, this was the breakdown of batch and high values over all zones
      was:
      
          256               batch: 1
          256               batch: 63
          512               batch: 7
          256               high:  0
          256               high:  378
          512               high:  42
      
      512 pcpu pagesets had a batch limit of 7 and a high limit of 42.  After
      the patch:
      
          256               batch: 1
          768               batch: 63
          256               high:  0
          768               high:  378
      
      [mgorman@techsingularity.net: fix merge/linkage snafu]
        Link: http://lkml.kernel.org/r/20191023084705.GD3016@techsingularity.netLink: http://lkml.kernel.org/r/20191021094808.28824-2-mgorman@techsingularity.net
      
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Qian Cai <cai@lca.pw>
      Cc: <stable@vger.kernel.org>	[4.1+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3e8fc007
  6. 14 Oct, 2019 1 commit
    • David Rientjes's avatar
      mm, hugetlb: allow hugepage allocations to reclaim as needed · 3f36d866
      David Rientjes authored
      Commit b39d0ee2 ("mm, page_alloc: avoid expensive reclaim when
      compaction may not succeed") has chnaged the allocator to bail out from
      the allocator early to prevent from a potentially excessive memory
      reclaim.  __GFP_RETRY_MAYFAIL is designed to retry the allocation,
      reclaim and compaction loop as long as there is a reasonable chance to
      make forward progress.  Neither COMPACT_SKIPPED nor COMPACT_DEFERRED at
      the INIT_COMPACT_PRIORITY compaction attempt gives this feedback.
      
      The most obvious affected subsystem is hugetlbfs which allocates huge
      pages based on an admin request (or via admin configured overcommit).  I
      have done a simple test which tries to allocate half of the memory for
      hugetlb pages while the memory is full of a clean page cache.  This is
      not an unusual situation because we try to cache as much of the memory
      as possible and sysctl/sysfs interface to allocate huge pages is there
      for flexibility to allocate hugetlb pages at any time.
      
      System has 1GB of RAM and we are requesting 515MB worth of hugetlb pages
      after the memory is prefilled by a clean page cache:
      
        root@test1:~# cat hugetlb_test.sh
      
        set -x
        echo 0 > /proc/sys/vm/nr_hugepages
        echo 3 > /proc/sys/vm/drop_caches
        echo 1 > /proc/sys/vm/compact_memory
        dd if=/mnt/data/file-1G of=/dev/null bs=$((4<<10))
        TS=$(date +%s)
        echo 256 > /proc/sys/vm/nr_hugepages
        cat /proc/sys/vm/nr_hugepages
      
      The results for 2 consecutive runs on clean 5.3
      
        root@test1:~# sh hugetlb_test.sh
        + echo 0
        + echo 3
        + echo 1
        + dd if=/mnt/data/file-1G of=/dev/null bs=4096
        262144+0 records in
        262144+0 records out
        1073741824 bytes (1.1 GB) copied, 21.0694 s, 51.0 MB/s
        + date +%s
        + TS=1569905284
        + echo 256
        + cat /proc/sys/vm/nr_hugepages
        256
        root@test1:~# sh hugetlb_test.sh
        + echo 0
        + echo 3
        + echo 1
        + dd if=/mnt/data/file-1G of=/dev/null bs=4096
        262144+0 records in
        262144+0 records out
        1073741824 bytes (1.1 GB) copied, 21.7548 s, 49.4 MB/s
        + date +%s
        + TS=1569905311
        + echo 256
        + cat /proc/sys/vm/nr_hugepages
        256
      
      Now with b39d0ee2 applied
      
        root@test1:~# sh hugetlb_test.sh
        + echo 0
        + echo 3
        + echo 1
        + dd if=/mnt/data/file-1G of=/dev/null bs=4096
        262144+0 records in
        262144+0 records out
        1073741824 bytes (1.1 GB) copied, 20.1815 s, 53.2 MB/s
        + date +%s
        + TS=1569905516
        + echo 256
        + cat /proc/sys/vm/nr_hugepages
        11
        root@test1:~# sh hugetlb_test.sh
        + echo 0
        + echo 3
        + echo 1
        + dd if=/mnt/data/file-1G of=/dev/null bs=4096
        262144+0 records in
        262144+0 records out
        1073741824 bytes (1.1 GB) copied, 21.9485 s, 48.9 MB/s
        + date +%s
        + TS=1569905541
        + echo 256
        + cat /proc/sys/vm/nr_hugepages
        12
      
      The success rate went down by factor of 20!
      
      Although hugetlb allocation requests might fail and it is reasonable to
      expect them to under extremely fragmented memory or when the memory is
      under a heavy pressure but the above situation is not that case.
      
      Fix the regression by reverting back to the previous behavior for
      __GFP_RETRY_MAYFAIL requests and disable the beail out heuristic for
      those requests.
      
      Mike said:
      
      : hugetlbfs allocations are commonly done via sysctl/sysfs shortly after
      : boot where this may not be as much of an issue.  However, I am aware of at
      : least three use cases where allocations are made after the system has been
      : up and running for quite some time:
      :
      : - DB reconfiguration.  If sysctl/sysfs fails to get required number of
      :   huge pages, system is rebooted to perform allocation after boot.
      :
      : - VM provisioning.  If unable get required number of huge pages, fall
      :   back to base pages.
      :
      : - An application that does not preallocate pool, but rather allocates
      :   pages at fault time for optimal NUMA locality.
      :
      : In all cases, I would expect b39d0ee2 to cause regressions and
      : noticable behavior changes.
      :
      : My quick/limited testing in
      : https://lkml.kernel.org/r/3468b605-a3a9-6978-9699-57c52a90bd7e@oracle.com
      : was insufficient.  It was also mentioned that if something like
      : b39d0ee2 went forward, I would like exemptions for __GFP_RETRY_MAYFAIL
      : requests as in this patch.
      
      [mhocko@suse.com: reworded changelog]
      Link: http://lkml.kernel.org/r/20191007075548.12456-1-mhocko@kernel.org
      Fixes: b39d0ee2
      
       ("mm, page_alloc: avoid expensive reclaim when compaction may not succeed")
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3f36d866
  7. 07 Oct, 2019 1 commit
    • Qian Cai's avatar
      mm/page_alloc.c: fix a crash in free_pages_prepare() · 234fdce8
      Qian Cai authored
      On architectures like s390, arch_free_page() could mark the page unused
      (set_page_unused()) and any access later would trigger a kernel panic.
      Fix it by moving arch_free_page() after all possible accessing calls.
      
       Hardware name: IBM 2964 N96 400 (z/VM 6.4.0)
       Krnl PSW : 0404e00180000000 0000000026c2b96e (__free_pages_ok+0x34e/0x5d8)
                  R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
       Krnl GPRS: 0000000088d43af7 0000000000484000 000000000000007c 000000000000000f
                  000003d080012100 000003d080013fc0 0000000000000000 0000000000100000
                  00000000275cca48 0000000000000100 0000000000000008 000003d080010000
                  00000000000001d0 000003d000000000 0000000026c2b78a 000000002717fdb0
       Krnl Code: 0000000026c2b95c: ec1100b30659 risbgn %r1,%r1,0,179,6
                  0000000026c2b962: e32014000036 pfd 2,1024(%r1)
                 #0000000026c2b968: d7ff10001000 xc 0(256,%r1),0(%r1)
                 >0000000026c2b96e: 41101100  la %r1,256(%r1)
                  0000000026c2b972: a737fff8  brctg %r3,26c2b962
                  0000000026c2b976: d7ff10001000 xc 0(256,%r1),0(%r1)
                  0000000026c2b97c: e31003400004 lg %r1,832
                  0000000026c2b982: ebff1430016a asi 5168(%r1),-1
       Call Trace:
       __free_pages_ok+0x16a/0x5d8)
       memblock_free_all+0x206/0x290
       mem_init+0x58/0x120
       start_kernel+0x2b0/0x570
       startup_continue+0x6a/0xc0
       INFO: lockdep is turned off.
       Last Breaking-Event-Address:
       __free_pages_ok+0x372/0x5d8
       Kernel panic - not syncing: Fatal exception: panic_on_oops
       00: HCPGIR450W CP entered; disabled wait PSW 00020001 80000000 00000000 26A2379C
      
      In the past, only kernel_poison_pages() would trigger this but it needs
      "page_poison=on" kernel cmdline, and I suspect nobody tested that on
      s390.  Recently, kernel_init_free_pages() (commit 6471384a ("mm:
      security: introduce init_on_alloc=1 and init_on_free=1 boot options"))
      was added and could trigger this as well.
      
      [akpm@linux-foundation.org: add comment]
      Link: http://lkml.kernel.org/r/1569613623-16820-1-git-send-email-cai@lca.pw
      Fixes: 8823b1db ("mm/page_poison.c: enable PAGE_POISONING as a separate option")
      Fixes: 6471384a
      
       ("mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options")
      Signed-off-by: default avatarQian Cai <cai@lca.pw>
      Reviewed-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      Acked-by: default avatarChristian Borntraeger <borntraeger@de.ibm.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Alexander Duyck <alexander.duyck@gmail.com>
      Cc: <stable@vger.kernel.org>	[5.3+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      234fdce8
  8. 28 Sep, 2019 1 commit
    • David Rientjes's avatar
      mm, page_alloc: avoid expensive reclaim when compaction may not succeed · b39d0ee2
      David Rientjes authored
      
      Memory compaction has a couple significant drawbacks as the allocation
      order increases, specifically:
      
       - isolate_freepages() is responsible for finding free pages to use as
         migration targets and is implemented as a linear scan of memory
         starting at the end of a zone,
      
       - failing order-0 watermark checks in memory compaction does not account
         for how far below the watermarks the zone actually is: to enable
         migration, there must be *some* free memory available.  Per the above,
         watermarks are not always suffficient if isolate_freepages() cannot
         find the free memory but it could require hundreds of MBs of reclaim to
         even reach this threshold (read: potentially very expensive reclaim with
         no indication compaction can be successful), and
      
       - if compaction at this order has failed recently so that it does not even
         run as a result of deferred compaction, looping through reclaim can often
         be pointless.
      
      For hugepage allocations, these are quite substantial drawbacks because
      these are very high order allocations (order-9 on x86) and falling back to
      doing reclaim can potentially be *very* expensive without any indication
      that compaction would even be successful.
      
      Reclaim itself is unlikely to free entire pageblocks and certainly no
      reliance should be put on it to do so in isolation (recall lumpy reclaim).
      This means we should avoid reclaim and simply fail hugepage allocation if
      compaction is deferred.
      
      It is also not helpful to thrash a zone by doing excessive reclaim if
      compaction may not be able to access that memory.  If order-0 watermarks
      fail and the allocation order is sufficiently large, it is likely better
      to fail the allocation rather than thrashing the zone.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b39d0ee2
  9. 24 Sep, 2019 4 commits
  10. 03 Sep, 2019 1 commit
    • Matt Fleming's avatar
      sched/topology: Improve load balancing on AMD EPYC systems · a55c7454
      Matt Fleming authored
      SD_BALANCE_{FORK,EXEC} and SD_WAKE_AFFINE are stripped in sd_init()
      for any sched domains with a NUMA distance greater than 2 hops
      (RECLAIM_DISTANCE). The idea being that it's expensive to balance
      across domains that far apart.
      
      However, as is rather unfortunately explained in:
      
        commit 32e45ff4
      
       ("mm: increase RECLAIM_DISTANCE to 30")
      
      the value for RECLAIM_DISTANCE is based on node distance tables from
      2011-era hardware.
      
      Current AMD EPYC machines have the following NUMA node distances:
      
       node distances:
       node   0   1   2   3   4   5   6   7
         0:  10  16  16  16  32  32  32  32
         1:  16  10  16  16  32  32  32  32
         2:  16  16  10  16  32  32  32  32
         3:  16  16  16  10  32  32  32  32
         4:  32  32  32  32  10  16  16  16
         5:  32  32  32  32  16  10  16  16
         6:  32  32  32  32  16  16  10  16
         7:  32  32  32  32  16  16  16  10
      
      where 2 hops is 32.
      
      The result is that the scheduler fails to load balance properly across
      NUMA nodes on different sockets -- 2 hops apart.
      
      For example, pinning 16 busy threads to NUMA nodes 0 (CPUs 0-7) and 4
      (CPUs 32-39) like so,
      
        $ numactl -C 0-7,32-39 ./spinner 16
      
      causes all threads to fork and remain on node 0 until the active
      balancer kicks in after a few seconds and forcibly moves some threads
      to node 4.
      
      Override node_reclaim_distance for AMD Zen.
      Signed-off-by: default avatarMatt Fleming <matt@codeblueprint.co.uk>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Suravee.Suthikulpanit@amd.com
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Thomas.Lendacky@amd.com
      Cc: Tony Luck <tony.luck@intel.com>
      Link: https://lkml.kernel.org/r/20190808195301.13222-3-matt@codeblueprint.co.uk
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      a55c7454
  11. 25 Aug, 2019 1 commit
    • David Rientjes's avatar
      mm, page_alloc: move_freepages should not examine struct page of reserved memory · cd961038
      David Rientjes authored
      After commit 907ec5fc ("mm: zero remaining unavailable struct
      pages"), struct page of reserved memory is zeroed.  This causes
      page->flags to be 0 and fixes issues related to reading
      /proc/kpageflags, for example, of reserved memory.
      
      The VM_BUG_ON() in move_freepages_block(), however, assumes that
      page_zone() is meaningful even for reserved memory.  That assumption is
      no longer true after the aforementioned commit.
      
      There's no reason why move_freepages_block() should be testing the
      legitimacy of page_zone() for reserved memory; its scope is limited only
      to pages on the zone's freelist.
      
      Note that pfn_valid() can be true for reserved memory: there is a
      backing struct page.  The check for page_to_nid(page) is also buggy but
      reserved memory normally only appears on node 0 so the zeroing doesn't
      affect this.
      
      Move the debug checks to after verifying PageBuddy is true.  This
      isolates the scope of the checks to only be for buddy pages which are on
      the zone's freelist which move_freepages_block() is operating on.  In
      this case, an incorrect node or zone is a bug worthy of being warned
      about (and the examination of struct page is acceptable bcause this
      memory is not reserved).
      
      Why does move_freepages_block() gets called on reserved memory? It's
      simply math after finding a valid free page from the per-zone free area
      to use as fallback.  We find the beginning and end of the pageblock of
      the valid page and that can bring us into memory that was reserved per
      the e820.  pfn_valid() is still true (it's backed by a struct page), but
      since it's zero'd we shouldn't make any inferences here about comparing
      its node or zone.  The current node check just happens to succeed most
      of the time by luck because reserved memory typically appears on node 0.
      
      The fix here is to validate that we actually have buddy pages before
      testing if there's any type of zone or node strangeness going on.
      
      We noticed it almost immediately after bringing 907ec5fc in on
      CONFIG_DEBUG_VM builds.  It depends on finding specific free pages in
      the per-zone free area where the math in move_freepages() will bring the
      start or end pfn into reserved memory and wanting to claim that entire
      pageblock as a new migratetype.  So the path will be rare, require
      CONFIG_DEBUG_VM, and require fallback to a different migratetype.
      
      Some struct pages were already zeroed from reserve pages before
      907ec5fca3c so it theoretically could trigger before this commit.  I
      think it's rare enough under a config option that most people don't run
      that others may not have noticed.  I wouldn't argue against a stable tag
      and the backport should be easy enough, but probably wouldn't single out
      a commit that this is fixing.
      
      Mel said:
      
      : The overhead of the debugging check is higher with this patch although
      : it'll only affect debug builds and the path is not particularly hot.
      : If this was a concern, I think it would be reasonable to simply remove
      : the debugging check as the zone boundaries are checked in
      : move_freepages_block and we never expect a zone/node to be smaller than
      : a pageblock and stuck in the middle of another zone.
      
      Link: http://lkml.kernel.org/r/alpine.DEB.2.21.1908122036560.10779@chino.kir.corp.google.com
      
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cd961038
  12. 20 Aug, 2019 1 commit
  13. 19 Jul, 2019 4 commits
    • Dan Williams's avatar
      mm/sparsemem: support sub-section hotplug · ba72b4c8
      Dan Williams authored
      The libnvdimm sub-system has suffered a series of hacks and broken
      workarounds for the memory-hotplug implementation's awkward
      section-aligned (128MB) granularity.
      
      For example the following backtrace is emitted when attempting
      arch_add_memory() with physical address ranges that intersect 'System
      RAM' (RAM) with 'Persistent Memory' (PMEM) within a given section:
      
          # cat /proc/iomem | grep -A1 -B1 Persistent\ Memory
          100000000-1ffffffff : System RAM
          200000000-303ffffff : Persistent Memory (legacy)
          304000000-43fffffff : System RAM
          440000000-23ffffffff : Persistent Memory
          2400000000-43bfffffff : Persistent Memory
            2400000000-43bfffffff : namespace2.0
      
          WARNING: CPU: 38 PID: 928 at arch/x86/mm/init_64.c:850 add_pages+0x5c/0x60
          [..]
          RIP: 0010:add_pages+0x5c/0x60
          [..]
          Call Trace:
           devm_memremap_pages+0x460/0x6e0
           pmem_attach_disk+0x29e/0x680 [nd_pmem]
           ? nd_dax_probe+0xfc/0x120 [libnvdimm]
           nvdimm_bus_probe+0x66/0x160 [libnvdimm]
      
      It was discovered that the problem goes beyond RAM vs PMEM collisions as
      some platform produce PMEM vs PMEM collisions within a given section.
      The libnvdimm workaround for that case revealed that the libnvdimm
      section-alignment-padding implementation has been broken for a long
      while.
      
      A fix for that long-standing breakage introduces as many problems as it
      solves as it would require a backward-incompatible change to the
      namespace metadata interpretation.  Instead of that dubious route [1],
      address the root problem in the memory-hotplug implementation.
      
      Note that EEXIST is no longer treated as success as that is how
      sparse_add_section() reports subsection collisions, it was also obviated
      by recent changes to perform the request_region() for 'System RAM'
      before arch_add_memory() in the add_memory() sequence.
      
      [1] https://lore.kernel.org/r/155000671719.348031.2347363160141119237.stgit@dwillia2-desk3.amr.corp.intel.com
      
      [osalvador@suse.de: fix deactivate_section for early sections]
        Link: http://lkml.kernel.org/r/20190715081549.32577-2-osalvador@suse.de
      Link: http://lkml.kernel.org/r/156092354368.979959.6232443923440952359.stgit@dwillia2-desk3.amr.corp.intel.com
      
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>	[ppc64]
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Jane Chu <jane.chu@oracle.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Wei Yang <richardw.yang@linux.intel.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ba72b4c8
    • Dan Williams's avatar
      mm: kill is_dev_zone() helper · 46d945ae
      Dan Williams authored
      Given there are no more usages of is_dev_zone() outside of 'ifdef
      CONFIG_ZONE_DEVICE' protection, kill off the compilation helper.
      
      Link: http://lkml.kernel.org/r/156092353211.979959.1489004866360828964.stgit@dwillia2-desk3.amr.corp.intel.com
      
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarPavel Tatashin <pasha.tatashin@soleen.com>
      Reviewed-by: default avatarWei Yang <richardw.yang@linux.intel.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>	[ppc64]
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Jane Chu <jane.chu@oracle.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      46d945ae
    • Dan Williams's avatar
      mm/sparsemem: add helpers track active portions of a section at boot · f46edbd1
      Dan Williams authored
      Prepare for hot{plug,remove} of sub-ranges of a section by tracking a
      sub-section active bitmask, each bit representing a PMD_SIZE span of the
      architecture's memory hotplug section size.
      
      The implications of a partially populated section is that pfn_valid()
      needs to go beyond a valid_section() check and either determine that the
      section is an "early section", or read the sub-section active ranges
      from the bitmask.  The expectation is that the bitmask (subsection_map)
      fits in the same cacheline as the valid_section() / early_section()
      data, so the incremental performance overhead to pfn_valid() should be
      negligible.
      
      The rationale for using early_section() to short-ciruit the
      subsection_map check is that there are legacy code paths that use
      pfn_valid() at section granularity before validating the pfn against
      pgdat data.  So, the early_section() check allows those traditional
      assumptions to persist while also permitting subsection_map to tell the
      truth for purposes of populating the unused portions of early sections
      with PMEM and other ZONE_DEVICE mappings.
      
      Link: http://lkml.kernel.org/r/156092350874.979959.18185938451405518285.stgit@dwillia2-desk3.amr.corp.intel.com
      
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Reported-by: default avatarQian Cai <cai@lca.pw>
      Tested-by: default avatarJane Chu <jane.chu@oracle.com>
      Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>	[ppc64]
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Wei Yang <richardw.yang@linux.intel.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f46edbd1
    • Dan Williams's avatar
      mm/sparsemem: introduce struct mem_section_usage · f1eca35a
      Dan Williams authored
      Patch series "mm: Sub-section memory hotplug support", v10.
      
      The memory hotplug section is an arbitrary / convenient unit for memory
      hotplug.  'Section-size' units have bled into the user interface
      ('memblock' sysfs) and can not be changed without breaking existing
      userspace.  The section-size constraint, while mostly benign for typical
      memory hotplug, has and continues to wreak havoc with 'device-memory'
      use cases, persistent memory (pmem) in particular.  Recall that pmem
      uses devm_memremap_pages(), and subsequently arch_add_memory(), to
      allocate a 'struct page' memmap for pmem.  However, it does not use the
      'bottom half' of memory hotplug, i.e.  never marks pmem pages online and
      never exposes the userspace memblock interface for pmem.  This leaves an
      opening to redress the section-size constraint.
      
      To date, the libnvdimm subsystem has attempted to inject padding to
      satisfy the internal constraints of arch_add_memory().  Beyond
      complicating the code, leading to bugs [2], wasting memory, and limiting
      configuration flexibility, the padding hack is broken when the platform
      changes this physical memory alignment of pmem from one boot to the
      next.  Device failure (intermittent or permanent) and physical
      reconfiguration are events that can cause the platform firmware to
      change the physical placement of pmem on a subsequent boot, and device
      failure is an everyday event in a data-center.
      
      It turns out that sections are only a hard requirement of the
      user-facing interface for memory hotplug and with a bit more
      infrastructure sub-section arch_add_memory() support can be added for
      kernel internal usages like devm_memremap_pages().  Here is an analysis
      of the current design assumptions in the current code and how they are
      addressed in the new implementation:
      
      Current design assumptions:
      
       - Sections that describe boot memory (early sections) are never
         unplugged / removed.
      
       - pfn_valid(), in the CONFIG_SPARSEMEM_VMEMMAP=y, case devolves to a
         valid_section() check
      
       - __add_pages() and helper routines assume all operations occur in
         PAGES_PER_SECTION units.
      
       - The memblock sysfs interface only comprehends full sections
      
      New design assumptions:
      
       - Sections are instrumented with a sub-section bitmask to track (on
         x86) individual 2MB sub-divisions of a 128MB section.
      
       - Partially populated early sections can be extended with additional
         sub-sections, and those sub-sections can be removed with
         arch_remove_memory(). With this in place we no longer lose usable
         memory capacity to padding.
      
       - pfn_valid() is updated to look deeper than valid_section() to also
         check the active-sub-section mask. This indication is in the same
         cacheline as the valid_section() so the performance impact is
         expected to be negligible. So far the lkp robot has not reported any
         regressions.
      
       - Outside of the core vmemmap population routines which are replaced,
         other helper routines like shrink_{zone,pgdat}_span() are updated to
         handle the smaller granularity. Core memory hotplug routines that
         deal with online memory are not touched.
      
       - The existing memblock sysfs user api guarantees / assumptions are not
         touched since this capability is limited to !online
         !memblock-sysfs-accessible sections.
      
      Meanwhile the issue reports continue to roll in from users that do not
      understand when and how the 128MB constraint will bite them.  The current
      implementation relied on being able to support at least one misaligned
      namespace, but that immediately falls over on any moderately complex
      namespace creation attempt.  Beyond the initial problem of 'System RAM'
      colliding with pmem, and the unsolvable problem of physical alignment
      changes, Linux is now being exposed to platforms that collide pmem ranges
      with other pmem ranges by default [3].  In short, devm_memremap_pages()
      has pushed the venerable section-size constraint past the breaking point,
      and the simplicity of section-aligned arch_add_memory() is no longer
      tenable.
      
      These patches are exposed to the kbuild robot on a subsection-v10 branch
      [4], and a preview of the unit test for this functionality is available
      on the 'subsection-pending' branch of ndctl [5].
      
      [2]: https://lore.kernel.org/r/155000671719.348031.2347363160141119237.stgit@dwillia2-desk3.amr.corp.intel.com
      [3]: https://github.com/pmem/ndctl/issues/76
      [4]: https://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git/log/?h=subsection-v10
      [5]: https://github.com/pmem/ndctl/commit/7c59b4867e1c
      
      This patch (of 13):
      
      Towards enabling memory hotplug to track partial population of a section,
      introduce 'struct mem_section_usage'.
      
      A pointer to a 'struct mem_section_usage' instance replaces the existing
      pointer to a 'pageblock_flags' bitmap.  Effectively it adds one more
      'unsigned long' beyond the 'pageblock_flags' (usemap) allocation to house
      a new 'subsection_map' bitmap.  The new bitmap enables the memory
      hot{plug,remove} implementation to act on incremental sub-divisions of a
      section.
      
      SUBSECTION_SHIFT is defined as global constant instead of per-architecture
      value like SECTION_SIZE_BITS in order to allow cross-arch compatibility of
      subsection users.  Specifically a common subsection size allows for the
      possibility that persistent memory namespace configurations be made
      compatible across architectures.
      
      The primary motivation for this functionality is to support platforms that
      mix "System RAM" and "Persistent Memory" within a single section, or
      multiple PMEM ranges with different mapping lifetimes within a single
      section.  The section restriction for hotplug has caused an ongoing saga
      of hacks and bugs for devm_memremap_pages() users.
      
      Beyond the fixups to teach existing paths how to retrieve the 'usemap'
      from a section, and updates to usemap allocation path, there are no
      expected behavior changes.
      
      Link: http://lkml.kernel.org/r/156092349845.979959.73333291612799019.stgit@dwillia2-desk3.amr.corp.intel.com
      
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarWei Yang <richardw.yang@linux.intel.com>
      Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>	[ppc64]
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Jane Chu <jane.chu@oracle.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f1eca35a
  14. 17 Jul, 2019 1 commit
  15. 12 Jul, 2019 7 commits
    • Alexander Potapenko's avatar
      mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options · 6471384a
      Alexander Potapenko authored
      Patch series "add init_on_alloc/init_on_free boot options", v10.
      
      Provide init_on_alloc and init_on_free boot options.
      
      These are aimed at preventing possible information leaks and making the
      control-flow bugs that depend on uninitialized values more deterministic.
      
      Enabling either of the options guarantees that the memory returned by the
      page allocator and SL[AU]B is initialized with zeroes.  SLOB allocator
      isn't supported at the moment, as its emulation of kmem caches complicates
      handling of SLAB_TYPESAFE_BY_RCU caches correctly.
      
      Enabling init_on_free also guarantees that pages and heap objects are
      initialized right after they're freed, so it won't be possible to access
      stale data by using a dangling pointer.
      
      As suggested by Michal Hocko, right now we don't let the heap users to
      disable initialization for certain allocations.  There's not enough
      evidence that doing so can speed up real-life cases, and introducing ways
      to opt-out m...
      6471384a
    • Nicholas Piggin's avatar
      mm/large system hash: clear hashdist when only one node with memory is booted · e03a5125
      Nicholas Piggin authored
      CONFIG_NUMA on 64-bit CPUs currently enables hashdist unconditionally even
      when booting on single node machines.  This causes the large system hashes
      to be allocated with vmalloc, and mapped with small pages.
      
      This change clears hashdist if only one node has come up with memory.
      
      This results in the important large inode and dentry hashes using memblock
      allocations.  All others are within 4MB size up to about 128GB of RAM,
      which allows them to be allocated from the linear map on most non-NUMA
      images.
      
      Other big hashes like futex and TCP should eventually be moved over to the
      same style of allocation as those vfs caches that use HASH_EARLY if
      !hashdist, so they don't exceed MAX_ORDER on very large non-NUMA images.
      
      This brings dTLB misses for linux kernel tree `git diff` from ~45,000 to
      ~8,000 on a Kaby Lake KVM guest with 8MB dentry hash and mitigations=off
      (performance is in the noise, under 1% difference, page tables are likely
      to be well cached for this workload).
      
      Link: http://lkml.kernel.org/r/20190605144814.29319-2-npiggin@gmail.com
      
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e03a5125
    • Nicholas Piggin's avatar
      mm/large system hash: use vmalloc for size > MAX_ORDER when !hashdist · ec11408a
      Nicholas Piggin authored
      The kernel currently clamps large system hashes to MAX_ORDER when hashdist
      is not set, which is rather arbitrary.
      
      vmalloc space is limited on 32-bit machines, but this shouldn't result in
      much more used because of small physical memory limiting system hash
      sizes.
      
      Include "vmalloc" or "linear" in the kernel log message.
      
      Link: http://lkml.kernel.org/r/20190605144814.29319-1-npiggin@gmail.com
      
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ec11408a
    • Vlastimil Babka's avatar
      mm, debug_pagealloc: use a page type instead of page_ext flag · 3972f6bb
      Vlastimil Babka authored
      When debug_pagealloc is enabled, we currently allocate the page_ext
      array to mark guard pages with the PAGE_EXT_DEBUG_GUARD flag.  Now that
      we have the page_type field in struct page, we can use that instead, as
      guard pages are neither PageSlab nor mapped to userspace.  This reduces
      memory overhead when debug_pagealloc is enabled and there are no other
      features requiring the page_ext array.
      
      Link: http://lkml.kernel.org/r/20190603143451.27353-4-vbabka@suse.cz
      
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3972f6bb
    • Vlastimil Babka's avatar
      mm, page_alloc: more extensive free page checking with debug_pagealloc · 4462b32c
      Vlastimil Babka authored
      The page allocator checks struct pages for expected state (mapcount,
      flags etc) as pages are being allocated (check_new_page()) and freed
      (free_pages_check()) to provide some defense against errors in page
      allocator users.
      
      Prior commits 479f854a ("mm, page_alloc: defer debugging checks of
      pages allocated from the PCP") and 4db7548c ("mm, page_alloc: defer
      debugging checks of freed pages until a PCP drain") this has happened
      for order-0 pages as they were allocated from or freed to the per-cpu
      caches (pcplists).  Since those are fast paths, the checks are now
      performed only when pages are moved between pcplists and global free
      lists.  This however lowers the chances of catching errors soon enough.
      
      In order to increase the chances of the checks to catch errors, the
      kernel has to be rebuilt with CONFIG_DEBUG_VM, which also enables
      multiple other internal debug checks (VM_BUG_ON() etc), which is
      suboptimal when the goal is to catch errors in mm users, not in mm code
      itself.
      
      To catch some wrong users of the page allocator we have
      CONFIG_DEBUG_PAGEALLOC, which is designed to have virtually no overhead
      unless enabled at boot time.  Memory corruptions when writing to freed
      pages have often the same underlying errors (use-after-free, double free)
      as corrupting the corresponding struct pages, so this existing debugging
      functionality is a good fit to extend by also perform struct page checks
      at least as often as if CONFIG_DEBUG_VM was enabled.
      
      Specifically, after this patch, when debug_pagealloc is enabled on boot,
      and CONFIG_DEBUG_VM disabled, pages are checked when allocated from or
      freed to the pcplists *in addition* to being moved between pcplists and
      free lists.  When both debug_pagealloc and CONFIG_DEBUG_VM are enabled,
      pages are checked when being moved between pcplists and free lists *in
      addition* to when allocated from or freed to the pcplists.
      
      When debug_pagealloc is not enabled on boot, the overhead in fast paths
      should be virtually none thanks to the use of static key.
      
      Link: http://lkml.kernel.org/r/20190603143451.27353-3-vbabka@suse.cz
      
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4462b32c
    • Vlastimil Babka's avatar
      mm, debug_pagelloc: use static keys to enable debugging · 96a2b03f
      Vlastimil Babka authored
      Patch series "debug_pagealloc improvements".
      
      I have been recently debugging some pcplist corruptions, where it would be
      useful to perform struct page checks immediately as pages are allocated
      from and freed to pcplists, which is now only possible by rebuilding the
      kernel with CONFIG_DEBUG_VM (details in Patch 2 changelog).
      
      To make this kind of debugging simpler in future on a distro kernel, I
      have improved CONFIG_DEBUG_PAGEALLOC so that it has even smaller overhead
      when not enabled at boot time (Patch 1) and also when enabled (Patch 3),
      and extended it to perform the struct page checks more often when enabled
      (Patch 2).  Now it can be configured in when building a distro kernel
      without extra overhead, and debugging page use after free or double free
      can be enabled simply by rebooting with debug_pagealloc=on.
      
      This patch (of 3):
      
      CONFIG_DEBUG_PAGEALLOC has been redesigned by 031bc574
      ("mm/debug-pagealloc: make debug-pagealloc boottime configurable") to
      allow being always enabled in a distro kernel, but only perform its
      expensive functionality when booted with debug_pagelloc=on.  We can
      further reduce the overhead when not boot-enabled (including page
      allocator fast paths) using static keys.  This patch introduces one for
      debug_pagealloc core functionality, and another for the optional guard
      page functionality (enabled by booting with debug_guardpage_minorder=X).
      
      Link: http://lkml.kernel.org/r/20190603143451.27353-2-vbabka@suse.cz
      
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      96a2b03f
    • Denis Efremov's avatar
      mm: remove the exporting of totalram_pages · 98ef2046
      Denis Efremov authored
      Previously totalram_pages was the global variable.  Currently,
      totalram_pages is the static inline function from the include/linux/mm.h
      However, the function is also marked as EXPORT_SYMBOL, which is at best an
      odd combination.  Because there is no point for the static inline function
      from a public header to be exported, this commit removes the
      EXPORT_SYMBOL() marking.  It will be still possible to use the function in
      modules because all the symbols it depends on are exported.
      
      Link: http://lkml.kernel.org/r/20190710141031.15642-1-efremov@linux.com
      Fixes: ca79b0c2
      
       ("mm: convert totalram_pages and totalhigh_pages variables to atomic")
      Signed-off-by: default avatarDenis Efremov <efremov@linux.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      98ef2046
  16. 05 Jul, 2019 1 commit