1. 29 Jul, 2020 6 commits
  2. 22 Jul, 2020 1 commit
  3. 09 Jul, 2020 6 commits
    • Barry Song's avatar
      mm/cma.c: use exact_nid true to fix possible per-numa cma leak · 84524e70
      Barry Song authored
      commit 40366bd7 upstream.
      
      Calling cma_declare_contiguous_nid() with false exact_nid for per-numa
      reservation can easily cause cma leak and various confusion.  For example,
      mm/hugetlb.c is trying to reserve per-numa cma for gigantic pages.  But it
      can easily leak cma and make users confused when system has memoryless
      nodes.
      
      In case the system has 4 numa nodes, and only numa node0 has memory.  if
      we set hugetlb_cma=4G in bootargs, mm/hugetlb.c will get 4 cma areas for 4
      different numa nodes.  since exact_nid=false in current code, all 4 numa
      nodes will get cma successfully from node0, but hugetlb_cma[1 to 3] will
      never be available to hugepage will only allocate memory from
      hugetlb_cma[0].
      
      In case the system has 4 numa nodes, both numa node0&2 has memory, other
      nodes have no memory.  if we set hugetlb_cma=4G in bootargs, mm/hugetlb.c
      will get 4 cma areas for 4 different numa nodes.  since exact_nid=false in
      current code, all 4 numa nodes will get cma successfully from node0 or 2,
      but hugetlb_cma[1] and [3] will never be available to hugepage as
      mm/hugetlb.c will only allocate memory from hugetlb_cma[0] and
      hugetlb_cma[2].  This causes permanent leak of the cma areas which are
      supposed to be used by memoryless node.
      
      Of cource we can workaround the issue by letting mm/hugetlb.c scan all cma
      areas in alloc_gigantic_page() even node_mask includes node0 only.  that
      means when node_mask includes node0 only, we can get page from
      hugetlb_cma[1] to hugetlb_cma[3].  But this will cause kernel crash in
      free_gigantic_page() while it wants to free page by:
      cma_release(hugetlb_cma[page_to_nid(page)], page, 1 << order)
      
      On the other hand, exact_nid=false won't consider numa distance, it might
      be not that useful to leverage cma areas on remote nodes.  I feel it is
      much simpler to make exact_nid true to make everything clear.  After that,
      memoryless nodes won't be able to reserve per-numa CMA from other nodes
      which have memory.
      
      Fixes: cf11e85f
      
       ("mm: hugetlb: optionally allocate gigantic hugepages using cma")
      Signed-off-by: default avatarBarry Song <song.bao.hua@hisilicon.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Aslan Bakirov <aslan@fb.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Andreas Schaufler <andreas.schaufler@gmx.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20200628074345.27228-1-song.bao.hua@hisilicon.com
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      84524e70
    • Mike Kravetz's avatar
      mm/hugetlb.c: fix pages per hugetlb calculation · faf8c29a
      Mike Kravetz authored
      commit 1139d336 upstream.
      
      The routine hpage_nr_pages() was incorrectly used to calculate the number
      of base pages in a hugetlb page.  hpage_nr_pages is designed to be called
      for THP pages and will return HPAGE_PMD_NR for hugetlb pages of any size.
      
      Due to the context in which hpage_nr_pages was called, it is unlikely to
      produce a user visible error.  The routine with the incorrect call is only
      exercised in the case of hugetlb memory error or migration.  In addition,
      this would need to be on an architecture which supports huge page sizes
      less than PMD_SIZE.  And, the vma containing the huge page would also need
      to smaller than PMD_SIZE.
      
      Fixes: c0d0381a
      
       ("hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization")
      Reported-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20200629185003.97202-1-mike.kravetz@oracle.com
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      faf8c29a
    • Vlastimil Babka's avatar
      mm, dump_page(): do not crash with invalid mapping pointer · a76d4f62
      Vlastimil Babka authored
      [ Upstream commit 002ae705
      
       ]
      
      We have seen a following problem on a RPi4 with 1G RAM:
      
          BUG: Bad page state in process systemd-hwdb  pfn:35601
          page:ffff7e0000d58040 refcount:15 mapcount:131221 mapping:efd8fe765bc80080 index:0x1 compound_mapcount: -32767
          Unable to handle kernel paging request at virtual address efd8fe765bc80080
          Mem abort info:
            ESR = 0x96000004
            Exception class = DABT (current EL), IL = 32 bits
            SET = 0, FnV = 0
            EA = 0, S1PTW = 0
          Data abort info:
            ISV = 0, ISS = 0x00000004
            CM = 0, WnR = 0
          [efd8fe765bc80080] address between user and kernel address ranges
          Internal error: Oops: 96000004 [#1] SMP
          Modules linked in: btrfs libcrc32c xor xor_neon zlib_deflate raid6_pq mmc_block xhci_pci xhci_hcd usbcore sdhci_iproc sdhci_pltfm sdhci mmc_core clk_raspberrypi gpio_raspberrypi_exp pcie_brcmstb bcm2835_dma gpio_regulator phy_generic fixed sg scsi_mod efivarfs
          Supported: No, Unreleased kernel
          CPU: 3 PID: 408 Comm: systemd-hwdb Not tainted 5.3.18-8-default #1 SLE15-SP2 (unreleased)
          Hardware name: raspberrypi rpi/rpi, BIOS 2020.01 02/21/2020
          pstate: 40000085 (nZcv daIf -PAN -UAO)
          pc : __dump_page+0x268/0x368
          lr : __dump_page+0xc4/0x368
          sp : ffff000012563860
          x29: ffff000012563860 x28: ffff80003ddc4300
          x27: 0000000000000010 x26: 000000000000003f
          x25: ffff7e0000d58040 x24: 000000000000000f
          x23: efd8fe765bc80080 x22: 0000000000020095
          x21: efd8fe765bc80080 x20: ffff000010ede8b0
          x19: ffff7e0000d58040 x18: ffffffffffffffff
          x17: 0000000000000001 x16: 0000000000000007
          x15: ffff000011689708 x14: 3030386362353637
          x13: 6566386466653a67 x12: 6e697070616d2031
          x11: 32323133313a746e x10: 756f6370616d2035
          x9 : ffff00001168a840 x8 : ffff00001077a670
          x7 : 000000000000013d x6 : ffff0000118a43b5
          x5 : 0000000000000001 x4 : ffff80003dd9e2c8
          x3 : ffff80003dd9e2c8 x2 : 911c8d7c2f483500
          x1 : dead000000000100 x0 : efd8fe765bc80080
          Call trace:
           __dump_page+0x268/0x368
           bad_page+0xd4/0x168
           check_new_page_bad+0x80/0xb8
           rmqueue_bulk.constprop.26+0x4d8/0x788
           get_page_from_freelist+0x4d4/0x1228
           __alloc_pages_nodemask+0x134/0xe48
           alloc_pages_vma+0x198/0x1c0
           do_anonymous_page+0x1a4/0x4d8
           __handle_mm_fault+0x4e8/0x560
           handle_mm_fault+0x104/0x1e0
           do_page_fault+0x1e8/0x4c0
           do_translation_fault+0xb0/0xc0
           do_mem_abort+0x50/0xb0
           el0_da+0x24/0x28
          Code: f9401025 8b8018a0 9a851005 17ffffca (f94002a0)
      
      Besides the underlying issue with page->mapping containing a bogus value
      for some reason, we can see that __dump_page() crashed by trying to read
      the pointer at mapping->host, turning a recoverable warning into full
      Oops.
      
      It can be expected that when page is reported as bad state for some
      reason, the pointers there should not be trusted blindly.
      
      So this patch treats all data in __dump_page() that depends on
      page->mapping as lava, using probe_kernel_read_strict().  Ideally this
      would include the dentry->d_parent recursively, but that would mean
      changing printk handler for %pd.  Chances of reaching the dentry
      printing part with an initially bogus mapping pointer should be rather
      low, though.
      
      Also prefix printing mapping->a_ops with a description of what is being
      printed.  In case the value is bogus, %ps will print raw value instead
      of the symbol name and then it's not obvious at all that it's printing
      a_ops.
      Reported-by: default avatarPetr Tesarik <ptesarik@suse.cz>
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Link: http://lkml.kernel.org/r/20200331165454.12263-1-vbabka@suse.cz
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      a76d4f62
    • Qian Cai's avatar
      mm/slub: fix stack overruns with SLUB_STATS · 01a4f375
      Qian Cai authored
      [ Upstream commit a68ee057 ]
      
      There is no need to copy SLUB_STATS items from root memcg cache to new
      memcg cache copies.  Doing so could result in stack overruns because the
      store function only accepts 0 to clear the stat and returns an error for
      everything else while the show method would print out the whole stat.
      
      Then, the mismatch of the lengths returns from show and store methods
      happens in memcg_propagate_slab_attrs():
      
      	else if (root_cache->max_attr_size < ARRAY_SIZE(mbuf))
      		buf = mbuf;
      
      max_attr_size is only 2 from slab_attr_store(), then, it uses mbuf[64]
      in show_stat() later where a bounch of sprintf() would overrun the stack
      variable.  Fix it by always allocating a page of buffer to be used in
      show_stat() if SLUB_STATS=y which should only be used for debug purpose.
      
        # echo 1 > /sys/kernel/slab/fs_cache/shrink
        BUG: KASAN: stack-out-of-bounds in number+0x421/0x6e0
        Write of size 1 at addr ffffc900256cfde0 by task kworker/76:0/53251
      
        Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
        Workqueue: memcg_kmem_cache memcg_kmem_cache_create_func
        Call Trace:
          number+0x421/0x6e0
          vsnprintf+0x451/0x8e0
          sprintf+0x9e/0xd0
          show_stat+0x124/0x1d0
          alloc_slowpath_show+0x13/0x20
          __kmem_cache_create+0x47a/0x6b0
      
        addr ffffc900256cfde0 is located in stack of task kworker/76:0/53251 at offset 0 in frame:
         process_one_work+0x0/0xb90
      
        this frame has 1 object:
         [32, 72) 'lockdep_map'
      
        Memory state around the buggy address:
         ffffc900256cfc80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
         ffffc900256cfd00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
        >ffffc900256cfd80: 00 00 00 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1
                                                               ^
         ffffc900256cfe00: 00 00 00 00 00 f2 f2 f2 00 00 00 00 00 00 00 00
         ffffc900256cfe80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
        ==================================================================
        Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: __kmem_cache_create+0x6ac/0x6b0
        Workqueue: memcg_kmem_cache memcg_kmem_cache_create_func
        Call Trace:
          __kmem_cache_create+0x6ac/0x6b0
      
      Fixes: 107dab5c
      
       ("slub: slub-specific propagation changes")
      Signed-off-by: default avatarQian Cai <cai@lca.pw>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Glauber Costa <glauber@scylladb.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Link: http://lkml.kernel.org/r/20200429222356.4322-1-cai@lca.pw
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      01a4f375
    • Dongli Zhang's avatar
      mm/slub.c: fix corrupted freechain in deactivate_slab() · 91b602b4
      Dongli Zhang authored
      [ Upstream commit 52f23478
      
       ]
      
      The slub_debug is able to fix the corrupted slab freelist/page.
      However, alloc_debug_processing() only checks the validity of current
      and next freepointer during allocation path.  As a result, once some
      objects have their freepointers corrupted, deactivate_slab() may lead to
      page fault.
      
      Below is from a test kernel module when 'slub_debug=PUF,kmalloc-128
      slub_nomerge'.  The test kernel corrupts the freepointer of one free
      object on purpose.  Unfortunately, deactivate_slab() does not detect it
      when iterating the freechain.
      
        BUG: unable to handle page fault for address: 00000000123456f8
        #PF: supervisor read access in kernel mode
        #PF: error_code(0x0000) - not-present page
        PGD 0 P4D 0
        Oops: 0000 [#1] SMP PTI
        ... ...
        RIP: 0010:deactivate_slab.isra.92+0xed/0x490
        ... ...
        Call Trace:
         ___slab_alloc+0x536/0x570
         __slab_alloc+0x17/0x30
         __kmalloc+0x1d9/0x200
         ext4_htree_store_dirent+0x30/0xf0
         htree_dirblock_to_tree+0xcb/0x1c0
         ext4_htree_fill_tree+0x1bc/0x2d0
         ext4_readdir+0x54f/0x920
         iterate_dir+0x88/0x190
         __x64_sys_getdents+0xa6/0x140
         do_syscall_64+0x49/0x170
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Therefore, this patch adds extra consistency check in deactivate_slab().
      Once an object's freepointer is corrupted, all following objects
      starting at this object are isolated.
      
      [akpm@linux-foundation.org: fix build with CONFIG_SLAB_DEBUG=n]
      Signed-off-by: default avatarDongli Zhang <dongli.zhang@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Joe Jin <joe.jin@oracle.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Link: http://lkml.kernel.org/r/20200331031450.12182-1-dongli.zhang@oracle.com
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      91b602b4
    • Hugh Dickins's avatar
      mm: fix swap cache node allocation mask · 972b36cc
      Hugh Dickins authored
      [ Upstream commit 243bce09 ]
      
      Chris Murphy reports that a slightly overcommitted load, testing swap
      and zram along with i915, splats and keeps on splatting, when it had
      better fail less noisily:
      
        gnome-shell: page allocation failure: order:0,
        mode:0x400d0(__GFP_IO|__GFP_FS|__GFP_COMP|__GFP_RECLAIMABLE),
        nodemask=(null),cpuset=/,mems_allowed=0
        CPU: 2 PID: 1155 Comm: gnome-shell Not tainted 5.7.0-1.fc33.x86_64 #1
        Call Trace:
          dump_stack+0x64/0x88
          warn_alloc.cold+0x75/0xd9
          __alloc_pages_slowpath.constprop.0+0xcfa/0xd30
          __alloc_pages_nodemask+0x2df/0x320
          alloc_slab_page+0x195/0x310
          allocate_slab+0x3c5/0x440
          ___slab_alloc+0x40c/0x5f0
          __slab_alloc+0x1c/0x30
          kmem_cache_alloc+0x20e/0x220
          xas_nomem+0x28/0x70
          add_to_swap_cache+0x321/0x400
          __read_swap_cache_async+0x105/0x240
          swap_cluster_readahead+0x22c/0x2e0
          shmem_swapin+0x8e/0xc0
          shmem_swapin_page+0x196/0x740
          shmem_getpage_gfp+0x3a2/0xa60
          shmem_read_mapping_page_gfp+0x32/0x60
          shmem_get_pages+0x155/0x5e0 [i915]
          __i915_gem_object_get_pages+0x68/0xa0 [i915]
          i915_vma_pin+0x3fe/0x6c0 [i915]
          eb_add_vma+0x10b/0x2c0 [i915]
          i915_gem_do_execbuffer+0x704/0x3430 [i915]
          i915_gem_execbuffer2_ioctl+0x1ea/0x3e0 [i915]
          drm_ioctl_kernel+0x86/0xd0 [drm]
          drm_ioctl+0x206/0x390 [drm]
          ksys_ioctl+0x82/0xc0
          __x64_sys_ioctl+0x16/0x20
          do_syscall_64+0x5b/0xf0
          entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Reported on 5.7, but it goes back really to 3.1: when
      shmem_read_mapping_page_gfp() was implemented for use by i915, and
      allowed for __GFP_NORETRY and __GFP_NOWARN flags in most places, but
      missed swapin's "& GFP_KERNEL" mask for page tree node allocation in
      __read_swap_cache_async() - that was to mask off HIGHUSER_MOVABLE bits
      from what page cache uses, but GFP_RECLAIM_MASK is now what's needed.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=208085
      Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2006151330070.11064@eggly.anvils
      Fixes: 68da9f05
      
       ("tmpfs: pass gfp to shmem_getpage_gfp")
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reported-by: default avatarChris Murphy <lists@colorremedies.com>
      Analyzed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Analyzed-by: default avatarMatthew Wilcox <willy@infradead.org>
      Tested-by: default avatarChris Murphy <lists@colorremedies.com>
      Cc: <stable@vger.kernel.org>	[3.1+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      972b36cc
  4. 30 Jun, 2020 6 commits
    • Ben Widawsky's avatar
      mm/memory_hotplug.c: fix false softlockup during pfn range removal · 566ea141
      Ben Widawsky authored
      commit b7e3debd upstream.
      
      When working with very large nodes, poisoning the struct pages (for which
      there will be very many) can take a very long time.  If the system is
      using voluntary preemptions, the software watchdog will not be able to
      detect forward progress.  This patch addresses this issue by offering to
      give up time like __remove_pages() does.  This behavior was introduced in
      v5.6 with: commit d33695b1 ("mm/memory_hotplug: poison memmap in
      remove_pfn_range_from_zone()")
      
      Alternately, init_page_poison could do this cond_resched(), but it seems
      to me that the caller of init_page_poison() is what actually knows whether
      or not it should relax its own priority.
      
      Based on Dan's notes, I think this is perfectly safe: commit f931ab47
      ("mm: fix devm_memremap_pages crash, use mem_hotplug_{begin, done}")
      
      Aside from fixing the lockup, it is also a friendlier thing to do on lower
      core systems that might wipe out large chunks of hotplug memory (probably
      not a very common case).
      
      Fixes this kind of splat:
      
        watchdog: BUG: soft lockup - CPU#46 stuck for 22s! [daxctl:9922]
        irq event stamp: 138450
        hardirqs last  enabled at (138449): [<ffffffffa1001f26>] trace_hardirqs_on_thunk+0x1a/0x1c
        hardirqs last disabled at (138450): [<ffffffffa1001f42>] trace_hardirqs_off_thunk+0x1a/0x1c
        softirqs last  enabled at (138448): [<ffffffffa1e00347>] __do_softirq+0x347/0x456
        softirqs last disabled at (138443): [<ffffffffa10c416d>] irq_exit+0x7d/0xb0
        CPU: 46 PID: 9922 Comm: daxctl Not tainted 5.7.0-BEN-14238-g373c6049b336 #30
        Hardware name: Intel Corporation PURLEY/PURLEY, BIOS PLYXCRB1.86B.0578.D07.1902280810 02/28/2019
        RIP: 0010:memset_erms+0x9/0x10
        Code: c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48 0f af c6 f3 48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89 d1 <f3> aa 4c 89 c8 c3 90 49 89 fa 40 0f b6 ce 48 b8 01 01 01 01 01 01
        Call Trace:
         remove_pfn_range_from_zone+0x3a/0x380
         memunmap_pages+0x17f/0x280
         release_nodes+0x22a/0x260
         __device_release_driver+0x172/0x220
         device_driver_detach+0x3e/0xa0
         unbind_store+0x113/0x130
         kernfs_fop_write+0xdc/0x1c0
         vfs_write+0xde/0x1d0
         ksys_write+0x58/0xd0
         do_syscall_64+0x5a/0x120
         entry_SYSCALL_64_after_hwframe+0x49/0xb3
        Built 2 zonelists, mobility grouping on.  Total pages: 49050381
        Policy zone: Normal
        Built 3 zonelists, mobility grouping on.  Total pages: 49312525
        Policy zone: Normal
      
      David said: "It really only is an issue for devmem.  Ordinary
      hotplugged system memory is not affected (onlined/offlined in memory
      block granularity)."
      
      Link: http://lkml.kernel.org/r/20200619231213.1160351-1-ben.widawsky@intel.com
      Fixes: commit d33695b1
      
       ("mm/memory_hotplug: poison memmap in remove_pfn_range_from_zone()")
      Signed-off-by: default avatarBen Widawsky <ben.widawsky@intel.com>
      Reported-by: default avatar"Scargall, Steve" <steve.scargall@intel.com>
      Reported-by: default avatarBen Widawsky <ben.widawsky@intel.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      566ea141
    • Muchun Song's avatar
      mm/memcontrol.c: add missed css_put() · bdfba338
      Muchun Song authored
      commit 3a98990a upstream.
      
      We should put the css reference when memory allocation failed.
      
      Link: http://lkml.kernel.org/r/20200614122653.98829-1-songmuchun@bytedance.com
      Fixes: f0a3a24b
      
       ("mm: memcg/slab: rework non-root kmem_cache lifecycle management")
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bdfba338
    • Johannes Weiner's avatar
      mm: memcontrol: handle div0 crash race condition in memory.low · 2652d59d
      Johannes Weiner authored
      commit cd324edc upstream.
      
      Tejun reports seeing rare div0 crashes in memory.low stress testing:
      
        RIP: 0010:mem_cgroup_calculate_protection+0xed/0x150
        Code: 0f 46 d1 4c 39 d8 72 57 f6 05 16 d6 42 01 40 74 1f 4c 39 d8 76 1a 4c 39 d1 76 15 4c 29 d1 4c 29 d8 4d 29 d9 31 d2 48 0f af c1 <49> f7 f1 49 01 c2 4c 89 96 38 01 00 00 5d c3 48 0f af c7 31 d2 49
        RSP: 0018:ffffa14e01d6fcd0 EFLAGS: 00010246
        RAX: 000000000243e384 RBX: 0000000000000000 RCX: 0000000000008f4b
        RDX: 0000000000000000 RSI: ffff8b89bee84000 RDI: 0000000000000000
        RBP: ffffa14e01d6fcd0 R08: ffff8b89ca7d40f8 R09: 0000000000000000
        R10: 0000000000000000 R11: 00000000006422f7 R12: 0000000000000000
        R13: ffff8b89d9617000 R14: ffff8b89bee84000 R15: ffffa14e01d6fdb8
        FS:  0000000000000000(0000) GS:ffff8b8a1f1c0000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f93b1fc175b CR3: 000000016100a000 CR4: 0000000000340ea0
        Call Trace:
          shrink_node+0x1e5/0x6c0
          balance_pgdat+0x32d/0x5f0
          kswapd+0x1d7/0x3d0
          kthread+0x11c/0x160
          ret_from_fork+0x1f/0x30
      
      This happens when parent_usage == siblings_protected.
      
      We check that usage is bigger than protected, which should imply
      parent_usage being bigger than siblings_protected.  However, we don't
      read (or even update) these values atomically, and they can be out of
      sync as the memory state changes under us.  A bit of fluctuation around
      the target protection isn't a big deal, but we need to handle the div0
      case.
      
      Check the parent state explicitly to make sure we have a reasonable
      positive value for the divisor.
      
      Link: http://lkml.kernel.org/r/20200615140658.601684-1-hannes@cmpxchg.org
      Fixes: 8a931f80
      
       ("mm: memcontrol: recursive memory.low protection")
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarChris Down <chris@chrisdown.name>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2652d59d
    • Waiman Long's avatar
      mm/slab: use memzero_explicit() in kzfree() · b60dbf12
      Waiman Long authored
      commit 8982ae52 upstream.
      
      The kzfree() function is normally used to clear some sensitive
      information, like encryption keys, in the buffer before freeing it back to
      the pool.  Memset() is currently used for buffer clearing.  However
      unlikely, there is still a non-zero probability that the compiler may
      choose to optimize away the memory clearing especially if LTO is being
      used in the future.
      
      To make sure that this optimization will never happen,
      memzero_explicit(), which is introduced in v3.18, is now used in
      kzfree() to future-proof it.
      
      Link: http://lkml.kernel.org/r/20200616154311.12314-2-longman@redhat.com
      Fixes: 3ef0e5ba
      
       ("slab: introduce kzfree()")
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: "Serge E. Hallyn" <serge@hallyn.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: "Jason A . Donenfeld" <Jason@zx2c4.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b60dbf12
    • Waiman Long's avatar
      mm, slab: fix sign conversion problem in memcg_uncharge_slab() · 768e7c02
      Waiman Long authored
      commit d7670879 upstream.
      
      It was found that running the LTP test on a PowerPC system could produce
      erroneous values in /proc/meminfo, like:
      
        MemTotal:       531915072 kB
        MemFree:        507962176 kB
        MemAvailable:   1100020596352 kB
      
      Using bisection, the problem is tracked down to commit 9c315e4d ("mm:
      memcg/slab: cache page number in memcg_(un)charge_slab()").
      
      In memcg_uncharge_slab() with a "int order" argument:
      
        unsigned int nr_pages = 1 << order;
          :
        mod_lruvec_state(lruvec, cache_vmstat_idx(s), -nr_pages);
      
      The mod_lruvec_state() function will eventually call the
      __mod_zone_page_state() which accepts a long argument.  Depending on the
      compiler and how inlining is done, "-nr_pages" may be treated as a
      negative number or a very large positive number.  Apparently, it was
      treated as a large positive number in that PowerPC system leading to
      incorrect stat counts.  This problem hasn't been seen in x86-64 yet,
      perhaps the gcc compiler there has some slight difference in behavior.
      
      It is fixed by making nr_pages a signed value.  For consistency, a similar
      change is applied to memcg_charge_slab() as well.
      
      Link: http://lkml.kernel.org/r/20200620184719.10994-1-longman@redhat.com
      Fixes: 9c315e4d
      
       ("mm: memcg/slab: cache page number in memcg_(un)charge_slab()").
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      768e7c02
    • Vlastimil Babka's avatar
      mm, compaction: make capture control handling safe wrt interrupts · 196a44d3
      Vlastimil Babka authored
      commit b9e20f0d upstream.
      
      Hugh reports:
      
       "While stressing compaction, one run oopsed on NULL capc->cc in
        __free_one_page()'s task_capc(zone): compact_zone_order() had been
        interrupted, and a page was being freed in the return from interrupt.
      
        Though you would not expect it from the source, both gccs I was using
        (4.8.1 and 7.5.0) had chosen to compile compact_zone_order() with the
        ".cc = &cc" implemented by mov %rbx,-0xb0(%rbp) immediately before
        callq compact_zone - long after the "current->capture_control =
        &capc". An interrupt in between those finds capc->cc NULL (zeroed by
        an earlier rep stos).
      
        This could presumably be fixed by a barrier() before setting
        current->capture_control in compact_zone_order(); but would also need
        more care on return from compact_zone(), in order not to risk leaking
        a page captured by interrupt just before capture_control is reset.
      
        Maybe that is the preferable fix, but I felt safer for task_capc() to
        exclude the rather surprising possibility of capture at interrupt
        time"
      
      I have checked that gcc10 also behaves the same.
      
      The advantage of fix in compact_zone_order() is that we don't add
      another test in the page freeing hot path, and that it might prevent
      future problems if we stop exposing pointers to uninitialized structures
      in current task.
      
      So this patch implements the suggestion for compact_zone_order() with
      barrier() (and WRITE_ONCE() to prevent store tearing) for setting
      current->capture_control, and prevents page leaking with
      WRITE_ONCE/READ_ONCE in the proper order.
      
      Link: http://lkml.kernel.org/r/20200616082649.27173-1-vbabka@suse.cz
      Fixes: 5e1f0f09
      
       ("mm, compaction: capture a page under direct compaction")
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reported-by: default avatarHugh Dickins <hughd@google.com>
      Suggested-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Li Wang <liwang@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: <stable@vger.kernel.org>	[5.1+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      196a44d3
  5. 22 Jun, 2020 4 commits
  6. 17 Jun, 2020 5 commits
    • Wang Hai's avatar
      mm/slub: fix a memory leak in sysfs_slab_add() · 5e028cc3
      Wang Hai authored
      commit dde3c6b7 upstream.
      
      syzkaller reports for memory leak when kobject_init_and_add() returns an
      error in the function sysfs_slab_add() [1]
      
      When this happened, the function kobject_put() is not called for the
      corresponding kobject, which potentially leads to memory leak.
      
      This patch fixes the issue by calling kobject_put() even if
      kobject_init_and_add() fails.
      
      [1]
        BUG: memory leak
        unreferenced object 0xffff8880a6d4be88 (size 8):
        comm "syz-executor.3", pid 946, jiffies 4295772514 (age 18.396s)
        hex dump (first 8 bytes):
          70 69 64 5f 33 00 ff ff                          pid_3...
        backtrace:
           kstrdup+0x35/0x70 mm/util.c:60
           kstrdup_const+0x3d/0x50 mm/util.c:82
           kvasprintf_const+0x112/0x170 lib/kasprintf.c:48
           kobject_set_name_vargs+0x55/0x130 lib/kobject.c:289
           kobject_add_varg lib/kobject.c:384 [inline]
           kobject_init_and_add+0xd8/0x170 lib/kobject.c:473
           sysfs_slab_add+0x1d8/0x290 mm/slub.c:5811
           __kmem_cache_create+0x50a/0x570 mm/slub.c:4384
           create_cache+0x113/0x1e0 mm/slab_common.c:407
           kmem_cache_create_usercopy+0x1a1/0x260 mm/slab_common.c:505
           kmem_cache_create+0xd/0x10 mm/slab_common.c:564
           create_pid_cachep kernel/pid_namespace.c:54 [inline]
           create_pid_namespace kernel/pid_namespace.c:96 [inline]
           copy_pid_ns+0x77c/0x8f0 kernel/pid_namespace.c:148
           create_new_namespaces+0x26b/0xa30 kernel/nsproxy.c:95
           unshare_nsproxy_namespaces+0xa7/0x1e0 kernel/nsproxy.c:229
           ksys_unshare+0x3d2/0x770 kernel/fork.c:2969
           __do_sys_unshare kernel/fork.c:3037 [inline]
           __se_sys_unshare kernel/fork.c:3035 [inline]
           __x64_sys_unshare+0x2d/0x40 kernel/fork.c:3035
           do_syscall_64+0xa1/0x530 arch/x86/entry/common.c:295
      
      Fixes: 80da026a
      
       ("mm/slub: fix slab double-free in case of duplicate sysfs filename")
      Reported-by: default avatarHulk Robot <hulkci@huawei.com>
      Signed-off-by: default avatarWang Hai <wanghai38@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Link: http://lkml.kernel.org/r/20200602115033.1054-1-wanghai38@huawei.com
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5e028cc3
    • Linus Torvalds's avatar
      gup: document and work around "COW can break either way" issue · 8e45fdaf
      Linus Torvalds authored
      commit 17839856
      
       upstream.
      
      Doing a "get_user_pages()" on a copy-on-write page for reading can be
      ambiguous: the page can be COW'ed at any time afterwards, and the
      direction of a COW event isn't defined.
      
      Yes, whoever writes to it will generally do the COW, but if the thread
      that did the get_user_pages() unmapped the page before the write (and
      that could happen due to memory pressure in addition to any outright
      action), the writer could also just take over the old page instead.
      
      End result: the get_user_pages() call might result in a page pointer
      that is no longer associated with the original VM, and is associated
      with - and controlled by - another VM having taken it over instead.
      
      So when doing a get_user_pages() on a COW mapping, the only really safe
      thing to do would be to break the COW when getting the page, even when
      only getting it for reading.
      
      At the same time, some users simply don't even care.
      
      For example, the perf code wants to look up the page not because it
      cares about the page, but because the code simply wants to look up the
      physical address of the access for informational purposes, and doesn't
      really care about races when a page might be unmapped and remapped
      elsewhere.
      
      This adds logic to force a COW event by setting FOLL_WRITE on any
      copy-on-write mapping when FOLL_GET (or FOLL_PIN) is used to get a page
      pointer as a result.
      
      The current semantics end up being:
      
       - __get_user_pages_fast(): no change. If you don't ask for a write,
         you won't break COW. You'd better know what you're doing.
      
       - get_user_pages_fast(): the fast-case "look it up in the page tables
         without anything getting mmap_sem" now refuses to follow a read-only
         page, since it might need COW breaking.  Which happens in the slow
         path - the fast path doesn't know if the memory might be COW or not.
      
       - get_user_pages() (including the slow-path fallback for gup_fast()):
         for a COW mapping, turn on FOLL_WRITE for FOLL_GET/FOLL_PIN, with
         very similar semantics to FOLL_FORCE.
      
      If it turns out that we want finer granularity (ie "only break COW when
      it might actually matter" - things like the zero page are special and
      don't need to be broken) we might need to push these semantics deeper
      into the lookup fault path.  So if people care enough, it's possible
      that we might end up adding a new internal FOLL_BREAK_COW flag to go
      with the internal FOLL_COW flag we already have for tracking "I had a
      COW".
      
      Alternatively, if it turns out that different callers might want to
      explicitly control the forced COW break behavior, we might even want to
      make such a flag visible to the users of get_user_pages() instead of
      using the above default semantics.
      
      But for now, this is mostly commentary on the issue (this commit message
      being a lot bigger than the patch, and that patch in turn is almost all
      comments), with that minimal "enable COW breaking early" logic using the
      existing FOLL_WRITE behavior.
      
      [ It might be worth noting that we've always had this ambiguity, and it
        could arguably be seen as a user-space issue.
      
        You only get private COW mappings that could break either way in
        situations where user space is doing cooperative things (ie fork()
        before an execve() etc), but it _is_ surprising and very subtle, and
        fork() is supposed to give you independent address spaces.
      
        So let's treat this as a kernel issue and make the semantics of
        get_user_pages() easier to understand. Note that obviously a true
        shared mapping will still get a page that can change under us, so this
        does _not_ mean that get_user_pages() somehow returns any "stable"
        page ]
      Reported-by: default avatarJann Horn <jannh@google.com>
      Tested-by: default avatarChristoph Hellwig <hch@lst.de>
      Acked-by: default avatarOleg Nesterov <oleg@redhat.com>
      Acked-by: default avatarKirill Shutemov <kirill@shutemov.name>
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8e45fdaf
    • Steven Price's avatar
      x86: mm: ptdump: calculate effective permissions correctly · 81845dd7
      Steven Price authored
      commit 1494e0c3 upstream.
      
      Patch series "Fix W+X debug feature on x86"
      
      Jan alerted me[1] that the W+X detection debug feature was broken in x86
      by my change[2] to switch x86 to use the generic ptdump infrastructure.
      
      Fundamentally the approach of trying to move the calculation of
      effective permissions into note_page() was broken because note_page() is
      only called for 'leaf' entries and the effective permissions are passed
      down via the internal nodes of the page tree.  The solution I've taken
      here is to create a new (optional) callback which is called for all
      nodes of the page tree and therefore can calculate the effective
      permissions.
      
      Secondly on some configurations (32 bit with PAE) "unsigned long" is not
      large enough to store the table entries.  The fix here is simple - let's
      just use a u64.
      
      [1] https://lore.kernel.org/lkml/d573dc7e-e742-84de-473d-f971142fa319@suse.com/
      [2] 2ae27137 ("x86: mm: convert dump_pagetables to use walk_page_range")
      
      This patch (of 2):
      
      By switching the x86 page table dump code to use the generic code the
      effective permissions are no longer calculated correctly because the
      note_page() function is only called for *leaf* entries.  To calculate
      the actual effective permissions it is necessary to observe the full
      hierarchy of the page tree.
      
      Introduce a new callback for ptdump which is called for every entry and
      can therefore update the prot_levels array correctly.  note_page() can
      then simply access the appropriate element in the array.
      
      [steven.price@arm.com: make the assignment conditional on val != 0]
        Link: http://lkml.kernel.org/r/430c8ab4-e7cd-6933-dde6-087fac6db872@arm.com
      Fixes: 2ae27137
      
       ("x86: mm: convert dump_pagetables to use walk_page_range")
      Reported-by: default avatarJan Beulich <jbeulich@suse.com>
      Signed-off-by: default avatarSteven Price <steven.price@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20200521152308.33096-1-steven.price@arm.com
      Link: http://lkml.kernel.org/r/20200521152308.33096-2-steven.price@arm.com
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      81845dd7
    • Vlastimil Babka's avatar
      usercopy: mark dma-kmalloc caches as usercopy caches · b33e28d6
      Vlastimil Babka authored
      commit 49f2d241 upstream.
      
      We have seen a "usercopy: Kernel memory overwrite attempt detected to
      SLUB object 'dma-kmalloc-1 k' (offset 0, size 11)!" error on s390x, as
      IUCV uses kmalloc() with __GFP_DMA because of memory address
      restrictions.  The issue has been discussed [2] and it has been noted
      that if all the kmalloc caches are marked as usercopy, there's little
      reason not to mark dma-kmalloc caches too.  The 'dma' part merely means
      that __GFP_DMA is used to restrict memory address range.
      
      As Jann Horn put it [3]:
       "I think dma-kmalloc slabs should be handled the same way as normal
        kmalloc slabs. When a dma-kmalloc allocation is freshly created, it is
        just normal kernel memory - even if it might later be used for DMA -,
        and it should be perfectly fine to copy_from_user() into such
        allocations at that point, and to copy_to_user() out of them at the
        end. If you look at the places where such allocations are created, you
        can see things like kmemdup(), memcpy() and so on - all normal
        operations that shouldn't conceptually be different from usercopy in
        any relevant way."
      
      Thus this patch marks the dma-kmalloc-* caches as usercopy.
      
      [1] https://bugzilla.suse.com/show_bug.cgi?id=1156053
      [2] https://lore.kernel.org/kernel-hardening/bfca96db-bbd0-d958-7732-76e36c667c68@suse.cz/
      [3] https://lore.kernel.org/kernel-hardening/CAG48ez1a4waGk9kB0WLaSbs4muSoK0AYAVk8=XYaKj4_+6e6Hg@mail.gmail.com/
      
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarChristian Borntraeger <borntraeger@de.ibm.com>
      Acked-by: default avatarJiri Slaby <jslaby@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Julian Wiedmann <jwi@linux.ibm.com>
      Cc: Ursula Braun <ubraun@linux.ibm.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: David Windsor <dave@nullcore.net>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Christoffer Dall <christoffer.dall@linaro.org>
      Cc: Dave Kleikamp <dave.kleikamp@oracle.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Luis de Bethencourt <luisbg@kernel.org>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Matthew Garrett <mjg59@google.com>
      Cc: Michal Kubecek <mkubecek@suse.cz>
      Link: http://lkml.kernel.org/r/7d810f6d-8085-ea2f-7805-47ba3842dc50@suse.cz
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b33e28d6
    • Waiman Long's avatar
      mm: add kvfree_sensitive() for freeing sensitive data objects · f492c6b6
      Waiman Long authored
      [ Upstream commit d4eaa283 ]
      
      For kvmalloc'ed data object that contains sensitive information like
      cryptographic keys, we need to make sure that the buffer is always cleared
      before freeing it.  Using memset() alone for buffer clearing may not
      provide certainty as the compiler may compile it away.  To be sure, the
      special memzero_explicit() has to be used.
      
      This patch introduces a new kvfree_sensitive() for freeing those sensitive
      data objects allocated by kvmalloc().  The relevant places where
      kvfree_sensitive() can be used are modified to use it.
      
      Fixes: 4f088249
      
       ("KEYS: Avoid false positive ENOMEM error on key read")
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarEric Biggers <ebiggers@google.com>
      Acked-by: default avatarDavid Howells <dhowells@redhat.com>
      Cc: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: "Serge E. Hallyn" <serge@hallyn.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Uladzislau Rezki <urezki@gmail.com>
      Link: http://lkml.kernel.org/r/20200407200318.11711-1-longman@redhat.com
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      f492c6b6
  7. 07 Jun, 2020 1 commit
  8. 28 May, 2020 2 commits
  9. 23 May, 2020 2 commits
  10. 14 May, 2020 4 commits
  11. 09 May, 2020 1 commit
  12. 08 May, 2020 2 commits
    • Henry Willard's avatar
      mm: limit boost_watermark on small zones · 14f69140
      Henry Willard authored
      Commit 1c30844d ("mm: reclaim small amounts of memory when an
      external fragmentation event occurs") adds a boost_watermark() function
      which increases the min watermark in a zone by at least
      pageblock_nr_pages or the number of pages in a page block.
      
      On Arm64, with 64K pages and 512M huge pages, this is 8192 pages or
      512M.  It does this regardless of the number of managed pages managed in
      the zone or the likelihood of success.
      
      This can put the zone immediately under water in terms of allocating
      pages from the zone, and can cause a small machine to fail immediately
      due to OoM.  Unlike set_recommended_min_free_kbytes(), which
      substantially increases min_free_kbytes and is tied to THP,
      boost_watermark() can be called even if THP is not active.
      
      The problem is most likely to appear on architectures such as Arm64
      where pageblock_nr_pages is very large.
      
      It is desirable to run the kdump capture kernel in as small a space as
      possible to avoid wasting memory.  In some architectures, such as Arm64,
      there are restrictions on where the capture kernel can run, and
      therefore, the space available.  A capture kernel running in 768M can
      fail due to OoM immediately after boost_watermark() sets the min in zone
      DMA32, where most of the memory is, to 512M.  It fails even though there
      is over 500M of free memory.  With boost_watermark() suppressed, the
      capture kernel can run successfully in 448M.
      
      This patch limits boost_watermark() to boosting a zone's min watermark
      only when there are enough pages that the boost will produce positive
      results.  In this case that is estimated to be four times as many pages
      as pageblock_nr_pages.
      
      Mel said:
      
      : There is no harm in marking it stable.  Clearly it does not happen very
      : often but it's not impossible.  32-bit x86 is a lot less common now
      : which would previously have been vulnerable to triggering this easily.
      : ppc64 has a larger base page size but typically only has one zone.
      : arm64 is likely the most vulnerable, particularly when CMA is
      : configured with a small movable zone.
      
      Fixes: 1c30844d
      
       ("mm: reclaim small amounts of memory when an external fragmentation event occurs")
      Signed-off-by: default avatarHenry Willard <henry.willard@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/1588294148-6586-1-git-send-email-henry.willard@oracle.com
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      14f69140
    • Qiwu Chen's avatar
      mm/vmscan: remove unnecessary argument description of isolate_lru_pages() · 17e34526
      Qiwu Chen authored
      Since commit a9e7c39f
      
       ("mm/vmscan.c: remove 7th argument of
      isolate_lru_pages()"), the explanation of 'mode' argument has been
      unnecessary.  Let's remove it.
      Signed-off-by: default avatarQiwu Chen <chenqiwu@xiaomi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20200501090346.2894-1-chenqiwu@xiaomi.com
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      17e34526