1. 22 Sep, 2021 1 commit
  2. 14 Oct, 2020 1 commit
  3. 26 Aug, 2020 2 commits
    • Charan Teja Reddy's avatar
      mm, page_alloc: fix core hung in free_pcppages_bulk() · 1a4029e9
      Charan Teja Reddy authored
      commit 88e8ac11 upstream.
      
      The following race is observed with the repeated online, offline and a
      delay between two successive online of memory blocks of movable zone.
      
      P1						P2
      
      Online the first memory block in
      the movable zone. The pcp struct
      values are initialized to default
      values,i.e., pcp->high = 0 &
      pcp->batch = 1.
      
      					Allocate the pages from the
      					movable zone.
      
      Try to Online the second memory
      block in the movable zone thus it
      entered the online_pages() but yet
      to call zone_pcp_update().
      					This process is entered into
      					the exit path thus it tries
      					to release the order-0 pages
      					to pcp lists through
      					free_unref_page_commit().
      					As pcp->high = 0, pcp->count = 1
      					proceed to call the function
      					free_pcppages_bulk().
      Update the pcp values thus the
      new pcp values are like, say,
      pcp->high = 378, pcp->batch = 63.
      					Read the pcp's batch value using
      					READ_ONCE() and pass the same to
      					free_pcppages_bulk(), pcp values
      					passed here are, batch = 63,
      					count = 1.
      
      					Since num of pages in the pcp
      					lists are less than ->batch,
      					then it will stuck in
      					while(list_empty(list)) loop
      					with interrupts disabled thus
      					a core hung.
      
      Avoid this by ensuring free_pcppages_bulk() is called with proper count of
      pcp list pages.
      
      The mentioned race is some what easily reproducible without [1] because
      pcp's are not updated for the first memory block online and thus there is
      a enough race window for P2 between alloc+free and pcp struct values
      update through onlining of second memory block.
      
      With [1], the race still exists but it is very narrow as we update the pcp
      struct values for the first memory block online itself.
      
      This is not limited to the movable zone, it could also happen in cases
      with the normal zone (e.g., hotplug to a node that only has DMA memory, or
      no other memory yet).
      
      [1]: https://patchwork.kernel.org/patch/11696389/
      
      Fixes: 5f8dcc21
      
       ("page-allocator: split per-cpu list into one-list-per-migrate-type")
      Signed-off-by: default avatarCharan Teja Reddy <charante@codeaurora.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Cc: <stable@vger.kernel.org> [2.6+]
      Link: http://lkml.kernel.org/r/1597150703-19003-1-git-send-email-charante@codeaurora.org
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1a4029e9
    • Doug Berger's avatar
      mm: include CMA pages in lowmem_reserve at boot · 8c6a0bcb
      Doug Berger authored
      commit e08d3fdf upstream.
      
      The lowmem_reserve arrays provide a means of applying pressure against
      allocations from lower zones that were targeted at higher zones.  Its
      values are a function of the number of pages managed by higher zones and
      are assigned by a call to the setup_per_zone_lowmem_reserve() function.
      
      The function is initially called at boot time by the function
      init_per_zone_wmark_min() and may be called later by accesses of the
      /proc/sys/vm/lowmem_reserve_ratio sysctl file.
      
      The function init_per_zone_wmark_min() was moved up from a module_init to
      a core_initcall to resolve a sequencing issue with khugepaged.
      Unfortunately this created a sequencing issue with CMA page accounting.
      
      The CMA pages are added to the managed page count of a zone when
      cma_init_reserved_areas() is called at boot also as a core_initcall.  This
      makes it uncertain whether the CMA pages will be added to the managed page
      counts of their zones before or after the call to
      init_per_zone_wmark_min() as it becomes dependent on link order.  With the
      current link order the pages are added to the managed count after the
      lowmem_reserve arrays are initialized at boot.
      
      This means the lowmem_reserve values at boot may be lower than the values
      used later if /proc/sys/vm/lowmem_reserve_ratio is accessed even if the
      ratio values are unchanged.
      
      In many cases the difference is not significant, but for example
      an ARM platform with 1GB of memory and the following memory layout
      
        cma: Reserved 256 MiB at 0x0000000030000000
        Zone ranges:
          DMA      [mem 0x0000000000000000-0x000000002fffffff]
          Normal   empty
          HighMem  [mem 0x0000000030000000-0x000000003fffffff]
      
      would result in 0 lowmem_reserve for the DMA zone.  This would allow
      userspace to deplete the DMA zone easily.
      
      Funnily enough
      
        $ cat /proc/sys/vm/lowmem_reserve_ratio
      
      would fix up the situation because as a side effect it forces
      setup_per_zone_lowmem_reserve.
      
      This commit breaks the link order dependency by invoking
      init_per_zone_wmark_min() as a postcore_initcall so that the CMA pages
      have the chance to be properly accounted in their zone(s) and allowing
      the lowmem_reserve arrays to receive consistent values.
      
      Fixes: bc22af74
      
       ("mm: update min_free_kbytes from khugepaged after core initialization")
      Signed-off-by: default avatarDoug Berger <opendmb@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Jason Baron <jbaron@akamai.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/1597423766-27849-1-git-send-email-opendmb@gmail.com
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8c6a0bcb
  4. 21 Aug, 2020 1 commit
    • Oscar Salvador's avatar
      mm: Avoid calling build_all_zonelists_init under hotplug context · 23feab18
      Oscar Salvador authored
      Recently a customer of ours experienced a crash when booting the
      system while enabling memory-hotplug.
      
      The problem is that Normal zones on different nodes don't get their private
      zone->pageset allocated, and keep sharing the initial boot_pageset.
      The sharing between zones is normally safe as explained by the comment for
      boot_pageset - it's a percpu structure, and manipulations are done with
      disabled interrupts, and boot_pageset is set up in a way that any page placed
      on its pcplist is immediately flushed to shared zone's freelist, because
      pcp->high == 1.
      However, the hotplug operation updates pcp->high to a higher value as it
      expects to be operating on a private pageset.
      
      The problem is in build_all_zonelists(), which is called when the first range
      of pages is onlined for the Normal zone of node X or Y:
      
      	if (system_state == SYSTEM_BOOTING) {
      		build_all_zonelists_init();
      	} else {
      	#ifdef CONFIG_MEMORY_HOTPLUG
      		if (zone)
      			setup_zone_pag...
      23feab18
  5. 20 May, 2020 1 commit
  6. 24 Apr, 2020 1 commit
  7. 22 Jun, 2019 1 commit
    • Linxu Fang's avatar
      mem-hotplug: fix node spanned pages when we have a node with only ZONE_MOVABLE · 121100e4
      Linxu Fang authored
      [ Upstream commit 299c83dc ]
      
      342332e6 ("mm/page_alloc.c: introduce kernelcore=mirror option") and
      later patches rewrote the calculation of node spanned pages.
      
      e506b996 ("mem-hotplug: fix node spanned pages when we have a movable
      node"), but the current code still has problems,
      
      When we have a node with only zone_movable and the node id is not zero,
      the size of node spanned pages is double added.
      
      That's because we have an empty normal zone, and zone_start_pfn or
      zone_end_pfn is not between arch_zone_lowest_possible_pfn and
      arch_zone_highest_possible_pfn, so we need to use clamp to constrain the
      range just like the commit <96e907d1> (bootmem: Reimplement
      __absent_pages_in_range() using for_each_mem_pfn_range()).
      
      e.g.
      Zone ranges:
        DMA      [mem 0x0000000000001000-0x0000000000ffffff]
        DMA32    [mem 0x0000000001000000-0x00000000ffffffff]
        Normal   [mem 0x0000000100000000-0x000000023fffffff]
      Movable zone start for each node
        Node 0: 0x0000000100000000
        Node 1: 0x0000000140000000
      Early memory node ranges
        node   0: [mem 0x0000000000001000-0x000000000009efff]
        node   0: [mem 0x0000000000100000-0x00000000bffdffff]
        node   0: [mem 0x0000000100000000-0x000000013fffffff]
        node   1: [mem 0x0000000140000000-0x000000023fffffff]
      
      node 0 DMA	spanned:0xfff   present:0xf9e   absent:0x61
      node 0 DMA32	spanned:0xff000 present:0xbefe0	absent:0x40020
      node 0 Normal	spanned:0	present:0	absent:0
      node 0 Movable	spanned:0x40000 present:0x40000 absent:0
      On node 0 totalpages(node_present_pages): 1048446
      node_spanned_pages:1310719
      node 1 DMA	spanned:0	    present:0		absent:0
      node 1 DMA32	spanned:0	    present:0		absent:0
      node 1 Normal	spanned:0x100000    present:0x100000	absent:0
      node 1 Movable	spanned:0x100000    present:0x100000	absent:0
      On node 1 totalpages(node_present_pages): 2097152
      node_spanned_pages:2097152
      Memory: 6967796K/12582392K available (16388K kernel code, 3686K rwdata,
      4468K rodata, 2160K init, 10444K bss, 5614596K reserved, 0K
      cma-reserved)
      
      It shows that the current memory of node 1 is double added.
      After this patch, the problem is fixed.
      
      node 0 DMA	spanned:0xfff   present:0xf9e   absent:0x61
      node 0 DMA32	spanned:0xff000 present:0xbefe0	absent:0x40020
      node 0 Normal	spanned:0	present:0	absent:0
      node 0 Movable	spanned:0x40000 present:0x40000 absent:0
      On node 0 totalpages(node_present_pages): 1048446
      node_spanned_pages:1310719
      node 1 DMA	spanned:0	    present:0		absent:0
      node 1 DMA32	spanned:0	    present:0		absent:0
      node 1 Normal	spanned:0	    present:0		absent:0
      node 1 Movable	spanned:0x100000    present:0x100000	absent:0
      On node 1 totalpages(node_present_pages): 1048576
      node_spanned_pages:1048576
      memory: 6967796K/8388088K available (16388K kernel code, 3686K rwdata,
      4468K rodata, 2160K init, 10444K bss, 1420292K reserved, 0K
      cma-reserved)
      
      Link: http://lkml.kernel.org/r/1554178276-10372-1-git-send-email-fanglinxu@huawei.com
      
      Signed-off-by: default avatarLinxu Fang <fanglinxu@huawei.com>
      Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      121100e4
  8. 23 Mar, 2019 1 commit
    • Jann Horn's avatar
      mm: page_alloc: fix ref bias in page_frag_alloc() for 1-byte allocs · 484e89a9
      Jann Horn authored
      [ Upstream commit 2c2ade81
      
       ]
      
      The basic idea behind ->pagecnt_bias is: If we pre-allocate the maximum
      number of references that we might need to create in the fastpath later,
      the bump-allocation fastpath only has to modify the non-atomic bias value
      that tracks the number of extra references we hold instead of the atomic
      refcount. The maximum number of allocations we can serve (under the
      assumption that no allocation is made with size 0) is nc->size, so that's
      the bias used.
      
      However, even when all memory in the allocation has been given away, a
      reference to the page is still held; and in the `offset < 0` slowpath, the
      page may be reused if everyone else has dropped their references.
      This means that the necessary number of references is actually
      `nc->size+1`.
      
      Luckily, from a quick grep, it looks like the only path that can call
      page_frag_alloc(fragsz=1) is TAP with the IFF_NAPI_FRAGS flag, which
      requires CAP_NET_ADMIN in the init namespace and is only intended to be
      used for kernel testing and fuzzing.
      
      To test for this issue, put a `WARN_ON(page_ref_count(page) == 0)` in the
      `offset < 0` path, below the virt_to_page() call, and then repeatedly call
      writev() on a TAP device with IFF_TAP|IFF_NO_PI|IFF_NAPI_FRAGS|IFF_NAPI,
      with a vector consisting of 15 elements containing 1 byte each.
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      484e89a9
  9. 13 Dec, 2018 1 commit
    • Tetsuo Handa's avatar
      mm: don't warn about allocations which stall for too long · fbb78e97
      Tetsuo Handa authored
      commit 400e2249 upstream.
      
      Commit 63f53dea ("mm: warn about allocations which stall for too
      long") was a great step for reducing possibility of silent hang up
      problem caused by memory allocation stalls.  But this commit reverts it,
      for it is possible to trigger OOM lockup and/or soft lockups when many
      threads concurrently called warn_alloc() (in order to warn about memory
      allocation stalls) due to current implementation of printk(), and it is
      difficult to obtain useful information due to limitation of synchronous
      warning approach.
      
      Current printk() implementation flushes all pending logs using the
      context of a thread which called console_unlock().  printk() should be
      able to flush all pending logs eventually unless somebody continues
      appending to printk() buffer.
      
      Since warn_alloc() started appending to printk() buffer while waiting
      for oom_kill_process() to make forward progress when oom_kill_process()
      is processing pending logs, it became possible for warn_alloc() to force
      oom_kill_process() loop inside printk().  As a result, warn_alloc()
      significantly increased possibility of preventing oom_kill_process()
      from making forward progress.
      
      ---------- Pseudo code start ----------
      Before warn_alloc() was introduced:
      
        retry:
          if (mutex_trylock(&oom_lock)) {
            while (atomic_read(&printk_pending_logs) > 0) {
              atomic_dec(&printk_pending_logs);
              print_one_log();
            }
            // Send SIGKILL here.
            mutex_unlock(&oom_lock)
          }
          goto retry;
      
      After warn_alloc() was introduced:
      
        retry:
          if (mutex_trylock(&oom_lock)) {
            while (atomic_read(&printk_pending_logs) > 0) {
              atomic_dec(&printk_pending_logs);
              print_one_log();
            }
            // Send SIGKILL here.
            mutex_unlock(&oom_lock)
          } else if (waited_for_10seconds()) {
            atomic_inc(&printk_pending_logs);
          }
          goto retry;
      ---------- Pseudo code end ----------
      
      Although waited_for_10seconds() becomes true once per 10 seconds,
      unbounded number of threads can call waited_for_10seconds() at the same
      time.  Also, since threads doing waited_for_10seconds() keep doing
      almost busy loop, the thread doing print_one_log() can use little CPU
      resource.  Therefore, this situation can be simplified like
      
      ---------- Pseudo code start ----------
        retry:
          if (mutex_trylock(&oom_lock)) {
            while (atomic_read(&printk_pending_logs) > 0) {
              atomic_dec(&printk_pending_logs);
              print_one_log();
            }
            // Send SIGKILL here.
            mutex_unlock(&oom_lock)
          } else {
            atomic_inc(&printk_pending_logs);
          }
          goto retry;
      ---------- Pseudo code end ----------
      
      when printk() is called faster than print_one_log() can process a log.
      
      One of possible mitigation would be to introduce a new lock in order to
      make sure that no other series of printk() (either oom_kill_process() or
      warn_alloc()) can append to printk() buffer when one series of printk()
      (either oom_kill_process() or warn_alloc()) is already in progress.
      
      Such serialization will also help obtaining kernel messages in readable
      form.
      
      ---------- Pseudo code start ----------
        retry:
          if (mutex_trylock(&oom_lock)) {
            mutex_lock(&oom_printk_lock);
            while (atomic_read(&printk_pending_logs) > 0) {
              atomic_dec(&printk_pending_logs);
              print_one_log();
            }
            // Send SIGKILL here.
            mutex_unlock(&oom_printk_lock);
            mutex_unlock(&oom_lock)
          } else {
            if (mutex_trylock(&oom_printk_lock)) {
              atomic_inc(&printk_pending_logs);
              mutex_unlock(&oom_printk_lock);
            }
          }
          goto retry;
      ---------- Pseudo code end ----------
      
      But this commit does not go that direction, for we don't want to
      introduce a new lock dependency, and we unlikely be able to obtain
      useful information even if we serialized oom_kill_process() and
      warn_alloc().
      
      Synchronous approach is prone to unexpected results (e.g.  too late [1],
      too frequent [2], overlooked [3]).  As far as I know, warn_alloc() never
      helped with providing information other than "something is going wrong".
      I want to consider asynchronous approach which can obtain information
      during stalls with possibly relevant threads (e.g.  the owner of
      oom_lock and kswapd-like threads) and serve as a trigger for actions
      (e.g.  turn on/off tracepoints, ask libvirt daemon to take a memory dump
      of stalling KVM guest for diagnostic purpose).
      
      This commit temporarily loses ability to report e.g.  OOM lockup due to
      unable to invoke the OOM killer due to !__GFP_FS allocation request.
      But asynchronous approach will be able to detect such situation and emit
      warning.  Thus, let's remove warn_alloc().
      
      [1] https://bugzilla.kernel.org/show_bug.cgi?id=192981
      [2] http://lkml.kernel.org/r/CAM_iQpWuPVGc2ky8M-9yukECtS+zKjiDasNymX7rMcBjBFyM_A@mail.gmail.com
      [3] commit db73ee0d ("mm, vmscan: do not loop on too_many_isolated for ever"))
      
      Link: http://lkml.kernel.org/r/1509017339-4802-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
      
      Signed-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Reported-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Reported-by: default avataryuwang.yuwang <yuwang.yuwang@alibaba-inc.com>
      Reported-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      
      [Resolved backport conflict due to missing 82251963, a8e99259, 9e80c719 and
       9a67f648
      
       in 4.9 -- all of which modified this hunk being removed.]
      Signed-off-by: default avatarAmit Shah <amit@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      fbb78e97
  10. 11 Jul, 2018 1 commit
    • Vlastimil Babka's avatar
      mm, page_alloc: do not break __GFP_THISNODE by zonelist reset · 6cfbbdd2
      Vlastimil Babka authored
      commit 7810e678 upstream.
      
      In __alloc_pages_slowpath() we reset zonelist and preferred_zoneref for
      allocations that can ignore memory policies.  The zonelist is obtained
      from current CPU's node.  This is a problem for __GFP_THISNODE
      allocations that want to allocate on a different node, e.g.  because the
      allocating thread has been migrated to a different CPU.
      
      This has been observed to break SLAB in our 4.4-based kernel, because
      there it relies on __GFP_THISNODE working as intended.  If a slab page
      is put on wrong node's list, then further list manipulations may corrupt
      the list because page_to_nid() is used to determine which node's
      list_lock should be locked and thus we may take a wrong lock and race.
      
      Current SLAB implementation seems to be immune by luck thanks to commit
      511e3a05 ("mm/slab: make cache_grow() handle the page allocated on
      arbitrary node") but there may be others assuming that __GFP_THISNODE
      works as promised.
      
      We can fix it by simply removing the zonelist reset completely.  There
      is actually no reason to reset it, because memory policies and cpusets
      don't affect the zonelist choice in the first place.  This was different
      when commit 183f6371 ("mm: ignore mempolicies when using
      ALLOC_NO_WATERMARK") introduced the code, as mempolicies provided their
      own restricted zonelists.
      
      We might consider this for 4.17 although I don't know if there's
      anything currently broken.
      
      SLAB is currently not affected, but in kernels older than 4.7 that don't
      yet have 511e3a05 ("mm/slab: make cache_grow() handle the page
      allocated on arbitrary node") it is.  That's at least 4.4 LTS.  Older
      ones I'll have to check.
      
      So stable backports should be more important, but will have to be
      reviewed carefully, as the code went through many changes.  BTW I think
      that also the ac->preferred_zoneref reset is currently useless if we
      don't also reset ac->nodemask from a mempolicy to NULL first (which we
      probably should for the OOM victims etc?), but I would leave that for a
      separate patch.
      
      Link: http://lkml.kernel.org/r/20180525130853.13915-1-vbabka@suse.cz
      
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Fixes: 183f6371
      
       ("mm: ignore mempolicies when using ALLOC_NO_WATERMARK")
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6cfbbdd2
  11. 31 Jan, 2018 2 commits
    • Johannes Weiner's avatar
      mm: fix 100% CPU kswapd busyloop on unreclaimable nodes · 19a7db1e
      Johannes Weiner authored
      commit c73322d0 upstream.
      
      Patch series "mm: kswapd spinning on unreclaimable nodes - fixes and
      cleanups".
      
      Jia reported a scenario in which the kswapd of a node indefinitely spins
      at 100% CPU usage.  We have seen similar cases at Facebook.
      
      The kernel's current method of judging its ability to reclaim a node (or
      whether to back off and sleep) is based on the amount of scanned pages
      in proportion to the amount of reclaimable pages.  In Jia's and our
      scenarios, there are no reclaimable pages in the node, however, and the
      condition for backing off is never met.  Kswapd busyloops in an attempt
      to restore the watermarks while having nothing to work with.
      
      This series reworks the definition of an unreclaimable node based not on
      scanning but on whether kswapd is able to actually reclaim pages in
      MAX_RECLAIM_RETRIES (16) consecutive runs.  This is the same criteria
      the page allocator uses for giving up on direct reclaim and invoking the
      OOM killer.  If it cannot free any pages, kswapd will go to sleep and
      leave further attempts to direct reclaim invocations, which will either
      make progress and re-enable kswapd, or invoke the OOM killer.
      
      Patch #1 fixes the immediate problem Jia reported, the remainder are
      smaller fixlets, cleanups, and overall phasing out of the old method.
      
      Patch #6 is the odd one out.  It's a nice cleanup to get_scan_count(),
      and directly related to #5, but in itself not relevant to the series.
      
      If the whole series is too ambitious for 4.11, I would consider the
      first three patches fixes, the rest cleanups.
      
      This patch (of 9):
      
      Jia He reports a problem with kswapd spinning at 100% CPU when
      requesting more hugepages than memory available in the system:
      
      $ echo 4000 >/proc/sys/vm/nr_hugepages
      
      top - 13:42:59 up  3:37,  1 user,  load average: 1.09, 1.03, 1.01
      Tasks:   1 total,   1 running,   0 sleeping,   0 stopped,   0 zombie
      %Cpu(s):  0.0 us, 12.5 sy,  0.0 ni, 85.5 id,  2.0 wa,  0.0 hi,  0.0 si,  0.0 st
      KiB Mem:  31371520 total, 30915136 used,   456384 free,      320 buffers
      KiB Swap:  6284224 total,   115712 used,  6168512 free.    48192 cached Mem
      
        PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
         76 root      20   0       0      0      0 R 100.0 0.000 217:17.29 kswapd3
      
      At that time, there are no reclaimable pages left in the node, but as
      kswapd fails to restore the high watermarks it refuses to go to sleep.
      
      Kswapd needs to back away from nodes that fail to balance.  Up until
      commit 1d82de61 ("mm, vmscan: make kswapd reclaim in terms of
      nodes") kswapd had such a mechanism.  It considered zones whose
      theoretically reclaimable pages it had reclaimed six times over as
      unreclaimable and backed away from them.  This guard was erroneously
      removed as the patch changed the definition of a balanced node.
      
      However, simply restoring this code wouldn't help in the case reported
      here: there *are* no reclaimable pages that could be scanned until the
      threshold is met.  Kswapd would stay awake anyway.
      
      Introduce a new and much simpler way of backing off.  If kswapd runs
      through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single
      page, make it back off from the node.  This is the same number of shots
      direct reclaim takes before declaring OOM.  Kswapd will go to sleep on
      that node until a direct reclaimer manages to reclaim some pages, thus
      proving the node reclaimable again.
      
      [hannes@cmpxchg.org: check kswapd failure against the cumulative nr_reclaimed count]
        Link: http://lkml.kernel.org/r/20170306162410.GB2090@cmpxchg.org
      [shakeelb@google.com: fix condition for throttle_direct_reclaim]
        Link: http://lkml.kernel.org/r/20170314183228.20152-1-shakeelb@google.com
      Link: http://lkml.kernel.org/r/20170228214007.5621-2-hannes@cmpxchg.org
      
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Reported-by: default avatarJia He <hejianet@gmail.com>
      Tested-by: default avatarJia He <hejianet@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Dmitry Shmidt <dimitrysh@google.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      19a7db1e
    • Vlastimil Babka's avatar
      mm, page_alloc: fix potential false positive in __zone_watermark_ok · 685cce58
      Vlastimil Babka authored
      commit b050e376 upstream.
      
      Since commit 97a16fc8 ("mm, page_alloc: only enforce watermarks for
      order-0 allocations"), __zone_watermark_ok() check for high-order
      allocations will shortcut per-migratetype free list checks for
      ALLOC_HARDER allocations, and return true as long as there's free page
      of any migratetype.  The intention is that ALLOC_HARDER can allocate
      from MIGRATE_HIGHATOMIC free lists, while normal allocations can't.
      
      However, as a side effect, the watermark check will then also return
      true when there are pages only on the MIGRATE_ISOLATE list, or (prior to
      CMA conversion to ZONE_MOVABLE) on the MIGRATE_CMA list.  Since the
      allocation cannot actually obtain isolated pages, and might not be able
      to obtain CMA pages, this can result in a false positive.
      
      The condition should be rare and perhaps the outcome is not a fatal one.
      Still, it's better if the watermark check is correct.  There also
      shouldn't be a performance tradeoff here.
      
      Link: http://lkml.kernel.org/r/20171102125001.23708-1-vbabka@suse.cz
      Fixes: 97a16fc8
      
       ("mm, page_alloc: only enforce watermarks for order-0 allocations")
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      685cce58
  12. 09 Dec, 2017 1 commit
  13. 05 Dec, 2017 1 commit
  14. 24 Nov, 2017 1 commit
  15. 27 Sep, 2017 1 commit
  16. 25 Aug, 2017 1 commit
    • Pavel Tatashin's avatar
      mm: discard memblock data later · 87395eeb
      Pavel Tatashin authored
      commit 3010f876 upstream.
      
      There is existing use after free bug when deferred struct pages are
      enabled:
      
      The memblock_add() allocates memory for the memory array if more than
      128 entries are needed.  See comment in e820__memblock_setup():
      
        * The bootstrap memblock region count maximum is 128 entries
        * (INIT_MEMBLOCK_REGIONS), but EFI might pass us more E820 entries
        * than that - so allow memblock resizing.
      
      This memblock memory is freed here:
              free_low_memory_core_early()
      
      We access the freed memblock.memory later in boot when deferred pages
      are initialized in this path:
      
              deferred_init_memmap()
                      for_each_mem_pfn_range()
                        __next_mem_pfn_range()
                          type = &memblock.memory;
      
      One possible explanation for why this use-after-free hasn't been hit
      before is that the limit of INIT_MEMBLOCK_REGIONS has never been
      exceeded at least on systems where deferred struct pages were enabled.
      ...
      87395eeb
  17. 16 Aug, 2017 1 commit
    • Jonathan Toppins's avatar
      mm: ratelimit PFNs busy info message · b56cd77c
      Jonathan Toppins authored
      commit 75dddef3 upstream.
      
      The RDMA subsystem can generate several thousand of these messages per
      second eventually leading to a kernel crash.  Ratelimit these messages
      to prevent this crash.
      
      Doug said:
       "I've been carrying a version of this for several kernel versions. I
        don't remember when they started, but we have one (and only one) class
        of machines: Dell PE R730xd, that generate these errors. When it
        happens, without a rate limit, we get rcu timeouts and kernel oopses.
        With the rate limit, we just get a lot of annoying kernel messages but
        the machine continues on, recovers, and eventually the memory
        operations all succeed"
      
      And:
       "> Well... why are all these EBUSY's occurring? It sounds inefficient
        > (at least) but if it is expected, normal and unavoidable then
        > perhaps we should just remove that message altogether?
      
        I don't have an answer to that question. To be honest, I haven't
        looked real hard. We never had this at all, then it started out of the
        blue, but only on our Dell 730xd machines (and it hits all of them),
        but no other classes or brands of machines. And we have our 730xd
        machines loaded up with different brands and models of cards (for
        instance one dedicated to mlx4 hardware, one for qib, one for mlx5, an
        ocrdma/cxgb4 combo, etc), so the fact that it hit all of the machines
        meant it wasn't tied to any particular brand/model of RDMA hardware.
        To me, it always smelled of a hardware oddity specific to maybe the
        CPUs or mainboard chipsets in these machines, so given that I'm not an
        mm expert anyway, I never chased it down.
      
        A few other relevant details: it showed up somewhere around 4.8/4.9 or
        thereabouts. It never happened before, but the prinkt has been there
        since the 3.18 days, so possibly the test to trigger this message was
        changed, or something else in the allocator changed such that the
        situation started happening on these machines?
      
        And, like I said, it is specific to our 730xd machines (but they are
        all identical, so that could mean it's something like their specific
        ram configuration is causing the allocator to hit this on these
        machine but not on other machines in the cluster, I don't want to say
        it's necessarily the model of chipset or CPU, there are other bits of
        identicalness between these machines)"
      
      Link: http://lkml.kernel.org/r/499c0f6cc10d6eb829a67f2a4d75b4228a9b356e.1501695897.git.jtoppins@redhat.com
      
      Signed-off-by: default avatarJonathan Toppins <jtoppins@redhat.com>
      Reviewed-by: default avatarDoug Ledford <dledford@redhat.com>
      Tested-by: default avatarDoug Ledford <dledford@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b56cd77c
  18. 11 Aug, 2017 2 commits
  19. 07 Jun, 2017 1 commit
    • Michal Hocko's avatar
      mm: consider memblock reservations for deferred memory initialization sizing · 292f70cd
      Michal Hocko authored
      commit 864b9a39 upstream.
      
      We have seen an early OOM killer invocation on ppc64 systems with
      crashkernel=4096M:
      
      	kthreadd invoked oom-killer: gfp_mask=0x16040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK), nodemask=7, order=0, oom_score_adj=0
      	kthreadd cpuset=/ mems_allowed=7
      	CPU: 0 PID: 2 Comm: kthreadd Not tainted 4.4.68-1.gd7fe927-default #1
      	Call Trace:
      	  dump_stack+0xb0/0xf0 (unreliable)
      	  dump_header+0xb0/0x258
      	  out_of_memory+0x5f0/0x640
      	  __alloc_pages_nodemask+0xa8c/0xc80
      	  kmem_getpages+0x84/0x1a0
      	  fallback_alloc+0x2a4/0x320
      	  kmem_cache_alloc_node+0xc0/0x2e0
      	  copy_process.isra.25+0x260/0x1b30
      	  _do_fork+0x94/0x470
      	  kernel_thread+0x48/0x60
      	  kthreadd+0x264/0x330
      	  ret_from_kernel_thread+0x5c/0xa4
      
      	Mem-Info:
      	active_anon:0 inactive_anon:0 isolated_anon:0
      	 active_file:0 inactive_file:0 isolated_file:0
      	 unevictable:0 dirty:0 writeback:0 unstable:0
      	 slab_reclaimable:5 slab_unreclaimable:73
      	 mapped:0 shmem:0 pagetables:0 bounce:0
      	 free:0 free_pcp:0 free_cma:0
      	Node 7 DMA free:0kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:52428800kB managed:110016kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:320kB slab_unreclaimable:4672kB kernel_stack:1152kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
      	lowmem_reserve[]: 0 0 0 0
      	Node 7 DMA: 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB 0*8192kB 0*16384kB = 0kB
      	0 total pagecache pages
      	0 pages in swap cache
      	Swap cache stats: add 0, delete 0, find 0/0
      	Free swap  = 0kB
      	Total swap = 0kB
      	819200 pages RAM
      	0 pages HighMem/MovableOnly
      	817481 pages reserved
      	0 pages cma reserved
      	0 pages hwpoisoned
      
      the reason is that the managed memory is too low (only 110MB) while the
      rest of the the 50GB is still waiting for the deferred intialization to
      be done.  update_defer_init estimates the initial memoty to initialize
      to 2GB at least but it doesn't consider any memory allocated in that
      range.  In this particular case we've had
      
      	Reserving 4096MB of memory at 128MB for crashkernel (System RAM: 51200MB)
      
      so the low 2GB is mostly depleted.
      
      Fix this by considering memblock allocations in the initial static
      initialization estimation.  Move the max_initialise to
      reset_deferred_meminit and implement a simple memblock_reserved_memory
      helper which iterates all reserved blocks and sums the size of all that
      start below the given address.  The cumulative size is than added on top
      of the initial estimation.  This is still not ideal because
      reset_deferred_meminit doesn't consider holes and so reservation might
      be above the initial estimation whihch we ignore but let's make the
      logic simpler until we really need to handle more complicated cases.
      
      Fixes: 3a80a7fa ("mm: meminit: initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set")
      Link: http://lkml.kernel.org/r/20170531104010.GI27783@dhcp22.suse.cz
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Tested-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      292f70cd
  20. 20 May, 2017 1 commit
    • Vlastimil Babka's avatar
      mm: prevent potential recursive reclaim due to clearing PF_MEMALLOC · 4e434d4f
      Vlastimil Babka authored
      commit 62be1511 upstream.
      
      Patch series "more robust PF_MEMALLOC handling"
      
      This series aims to unify the setting and clearing of PF_MEMALLOC, which
      prevents recursive reclaim.  There are some places that clear the flag
      unconditionally from current->flags, which may result in clearing a
      pre-existing flag.  This already resulted in a bug report that Patch 1
      fixes (without the new helpers, to make backporting easier).  Patch 2
      introduces the new helpers, modelled after existing memalloc_noio_* and
      memalloc_nofs_* helpers, and converts mm core to use them.  Patches 3
      and 4 convert non-mm code.
      
      This patch (of 4):
      
      __alloc_pages_direct_compact() sets PF_MEMALLOC to prevent deadlock
      during page migration by lock_page() (see the comment in
      __unmap_and_move()).  Then it unconditionally clears the flag, which can
      clear a pre-existing PF_MEMALLOC flag and result in recursive reclaim.
      This was not a problem until commit a8161d1e ("mm, page_alloc:
      restructure direct compaction handling in slowpath"), because direct
      compation was called only after direct reclaim, which was skipped when
      PF_MEMALLOC flag was set.
      
      Even now it's only a theoretical issue, as the new callsite of
      __alloc_pages_direct_compact() is reached only for costly orders and
      when gfp_pfmemalloc_allowed() is true, which means either
      __GFP_NOMEMALLOC is in gfp_flags or in_interrupt() is true.  There is no
      such known context, but let's play it safe and make
      __alloc_pages_direct_compact() robust for cases where PF_MEMALLOC is
      already set.
      
      Fixes: a8161d1e ("mm, page_alloc: restructure direct compaction handling in slowpath")
      Link: http://lkml.kernel.org/r/20170405074700.29871-2-vbabka@suse.cz
      
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reported-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Boris Brezillon <boris.brezillon@free-electrons.com>
      Cc: Chris Leech <cleech@redhat.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Lee Duncan <lduncan@suse.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Richard Weinberger <richard@nod.at>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4e434d4f
  21. 12 Apr, 2017 1 commit
  22. 12 Mar, 2017 1 commit
    • Gavin Shan's avatar
      mm/page_alloc: fix nodes for reclaim in fast path · d1e80426
      Gavin Shan authored
      commit e02dc017 upstream.
      
      When @node_reclaim_node isn't 0, the page allocator tries to reclaim
      pages if the amount of free memory in the zones are below the low
      watermark.  On Power platform, none of NUMA nodes are scanned for page
      reclaim because no nodes match the condition in zone_allows_reclaim().
      On Power platform, RECLAIM_DISTANCE is set to 10 which is the distance
      of Node-A to Node-A.  So the preferred node even won't be scanned for
      page reclaim.
      
         __alloc_pages_nodemask()
         get_page_from_freelist()
            zone_allows_reclaim()
      
      Anton proposed the test code as below:
      
         # cat alloc.c
            :
         int main(int argc, char *argv[])
         {
      	void *p;
      	unsigned long size;
      	unsigned long start, end;
      
      	start = time(NULL);
      	size = strtoul(argv[1], NULL, 0);
      	printf("To allocate %ldGB memory\n", size);
      
      	size <<= 30;
      	p = malloc(size);
      	assert(p);
      	memset(p, 0, size);
      
      	end = time(NULL);
      	printf("Used time: %ld seconds\n", end - start);
      	sleep(3600);
      	return 0;
         }
      
      The system I use for testing has two NUMA nodes.  Both have 128GB
      memory.  In below scnario, the page caches on node#0 should be reclaimed
      when it encounters pressure to accommodate request of allocation.
      
         # echo 2 > /proc/sys/vm/zone_reclaim_mode; \
           sync; \
           echo 3 > /proc/sys/vm/drop_caches; \
         # taskset -c 0 cat file.32G > /dev/null; \
           grep FilePages /sys/devices/system/node/node0/meminfo
           Node 0 FilePages:       33619712 kB
         # taskset -c 0 ./alloc 128
         # grep FilePages /sys/devices/system/node/node0/meminfo
           Node 0 FilePages:       33619840 kB
         # grep MemFree /sys/devices/system/node/node0/meminfo
           Node 0 MemFree:          186816 kB
      
      With the patch applied, the pagecache on node-0 is reclaimed when its
      free memory is running out.  It's the expected behaviour.
      
         # echo 2 > /proc/sys/vm/zone_reclaim_mode; \
           sync; \
           echo 3 > /proc/sys/vm/drop_caches
         # taskset -c 0 cat file.32G > /dev/null; \
           grep FilePages /sys/devices/system/node/node0/meminfo
           Node 0 FilePages:       33605568 kB
         # taskset -c 0 ./alloc 128
         # grep FilePages /sys/devices/system/node/node0/meminfo
           Node 0 FilePages:        1379520 kB
         # grep MemFree /sys/devices/system/node/node0/meminfo
           Node 0 MemFree:           317120 kB
      
      Fixes: 5f7a75ac ("mm: page_alloc: do not cache reclaim distances")
      Link: http://lkml.kernel.org/r/1486532455-29613-1-git-send-email-gwshan@linux.vnet.ibm.com
      
      Signed-off-by: default avatarGavin Shan <gwshan@linux.vnet.ibm.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Anton Blanchard <anton@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d1e80426
  23. 01 Feb, 2017 4 commits
  24. 06 Jan, 2017 1 commit
  25. 11 Nov, 2016 1 commit
  26. 31 Oct, 2016 1 commit
    • Kees Cook's avatar
      latent_entropy: Fix wrong gcc code generation with 64 bit variables · 58bea414
      Kees Cook authored
      
      The stack frame size could grow too large when the plugin used long long
      on 32-bit architectures when the given function had too many basic blocks.
      
      The gcc warning was:
      
      drivers/pci/hotplug/ibmphp_ebda.c: In function 'ibmphp_access_ebda':
      drivers/pci/hotplug/ibmphp_ebda.c:409:1: warning: the frame size of 1108 bytes is larger than 1024 bytes [-Wframe-larger-than=]
      
      This switches latent_entropy from u64 to unsigned long.
      
      Thanks to PaX Team and Emese Revfy for the patch.
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      58bea414
  27. 28 Oct, 2016 1 commit
  28. 27 Oct, 2016 1 commit
    • Linus Torvalds's avatar
      mm: remove per-zone hashtable of bitlock waitqueues · 9dcb8b68
      Linus Torvalds authored
      
      The per-zone waitqueues exist because of a scalability issue with the
      page waitqueues on some NUMA machines, but it turns out that they hurt
      normal loads, and now with the vmalloced stacks they also end up
      breaking gfs2 that uses a bit_wait on a stack object:
      
           wait_on_bit(&gh->gh_iflags, HIF_WAIT, TASK_UNINTERRUPTIBLE)
      
      where 'gh' can be a reference to the local variable 'mount_gh' on the
      stack of fill_super().
      
      The reason the per-zone hash table breaks for this case is that there is
      no "zone" for virtual allocations, and trying to look up the physical
      page to get at it will fail (with a BUG_ON()).
      
      It turns out that I actually complained to the mm people about the
      per-zone hash table for another reason just a month ago: the zone lookup
      also hurts the regular use of "unlock_page()" a lot, because the zone
      lookup ends up forcing several unnecessary cache misses and generates
      horrible code.
      
      As part of that earlier discussion, we had a much better solution for
      the NUMA scalability issue - by just making the page lock have a
      separate contention bit, the waitqueue doesn't even have to be looked at
      for the normal case.
      
      Peter Zijlstra already has a patch for that, but let's see if anybody
      even notices.  In the meantime, let's fix the actual gfs2 breakage by
      simplifying the bitlock waitqueues and removing the per-zone issue.
      Reported-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      Tested-by: default avatarBob Peterson <rpeterso@redhat.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9dcb8b68
  29. 10 Oct, 2016 2 commits
    • Emese Revfy's avatar
      latent_entropy: Mark functions with __latent_entropy · 0766f788
      Emese Revfy authored
      
      The __latent_entropy gcc attribute can be used only on functions and
      variables.  If it is on a function then the plugin will instrument it for
      gathering control-flow entropy. If the attribute is on a variable then
      the plugin will initialize it with random contents.  The variable must
      be an integer, an integer array type or a structure with integer fields.
      
      These specific functions have been selected because they are init
      functions (to help gather boot-time entropy), are called at unpredictable
      times, or they have variable loops, each of which provide some level of
      latent entropy.
      Signed-off-by: default avatarEmese Revfy <re.emese@gmail.com>
      [kees: expanded commit message]
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      0766f788
    • Emese Revfy's avatar
      gcc-plugins: Add latent_entropy plugin · 38addce8
      Emese Revfy authored
      This adds a new gcc plugin named "latent_entropy". It is designed to
      extract as much possible uncertainty from a running system at boot time as
      possible, hoping to capitalize on any possible variation in CPU operation
      (due to runtime data differences, hardware differences, SMP ordering,
      thermal timing variation, cache behavior, etc).
      
      At the very least, this plugin is a much more comprehensive example for
      how to manipulate kernel code using the gcc plugin internals.
      
      The need for very-early boot entropy tends to be very architecture or
      system design specific, so this plugin is more suited for those sorts
      of special cases. The existing kernel RNG already attempts to extract
      entropy from reliable runtime variation, but this plugin takes the idea to
      a logical extreme by permuting a global variable based on any variation
      in code execution (e.g. a different value (and permutation function)
      is used to permute the global based on loop count, case statement,
      if/then/else bra...
      38addce8
  30. 08 Oct, 2016 4 commits