1. 19 Sep, 2018 2 commits
  2. 06 Jun, 2018 1 commit
  3. 30 May, 2018 2 commits
    • Huang Ying's avatar
      mm: fix races between address_space dereference and free in page_evicatable · d7f4e948
      Huang Ying authored
      [ Upstream commit e92bb4dd ]
      
      When page_mapping() is called and the mapping is dereferenced in
      page_evicatable() through shrink_active_list(), it is possible for the
      inode to be truncated and the embedded address space to be freed at the
      same time.  This may lead to the following race.
      
      CPU1                                                CPU2
      
      truncate(inode)                                     shrink_active_list()
        ...                                                 page_evictable(page)
        truncate_inode_page(mapping, page);
          delete_from_page_cache(page)
            spin_lock_irqsave(&mapping->tree_lock, flags);
              __delete_from_page_cache(page, NULL)
                page_cache_tree_delete(..)
                  ...                                         mapping = page_mapping(page);
                  page->mapping = NULL;
                  ...
            spin_unlock_irqrestore(&mapping->tree_lock, flags);
            page_cache_free_page(mapping, page)
              put_page(page)
                if (put_page_testzero(page)) -> false
      - inode now has no pages and can be freed including embedded address_space
      
                                                              mapping_unevictable(mapping)
      							  test_bit(AS_UNEVICTABLE, &mapping->flags);
      - we've dereferenced mapping which is potentially already free.
      
      Similar race exists between swap cache freeing and page_evicatable()
      too.
      
      The address_space in inode and swap cache will be freed after a RCU
      grace period.  So the races are fixed via enclosing the page_mapping()
      and address_space usage in rcu_read_lock/unlock().  Some comments are
      added in code to make it clear what is protected by the RCU read lock.
      
      Link: http://lkml.kernel.org/r/20180212081227.1940-1-ying.huang@intel.com
      
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d7f4e948
    • Mel Gorman's avatar
      mm: pin address_space before dereferencing it while isolating an LRU page · ab88b8a2
      Mel Gorman authored
      [ Upstream commit 69d763fc ]
      
      Minchan Kim asked the following question -- what locks protects
      address_space destroying when race happens between inode trauncation and
      __isolate_lru_page? Jan Kara clarified by describing the race as follows
      
      CPU1                                            CPU2
      
      truncate(inode)                                 __isolate_lru_page()
        ...
        truncate_inode_page(mapping, page);
          delete_from_page_cache(page)
            spin_lock_irqsave(&mapping->tree_lock, flags);
              __delete_from_page_cache(page, NULL)
                page_cache_tree_delete(..)
                  ...                                   mapping = page_mapping(page);
                  page->mapping = NULL;
                  ...
            spin_unlock_irqrestore(&mapping->tree_lock, flags);
            page_cache_free_page(mapping, page)
              put_page(page)
                if (put_page_testzero(page)) -> false
      - inode now has no pages and can be freed including embedded address_space
      
                                                        if (mapping && !mapping->a_ops->migratepage)
      - we've dereferenced mapping which is potentially already free.
      
      The race is theoretically possible but unlikely.  Before the
      delete_from_page_cache, truncate_cleanup_page is called so the page is
      likely to be !PageDirty or PageWriteback which gets skipped by the only
      caller that checks the mappping in __isolate_lru_page.  Even if the race
      occurs, a substantial amount of work has to happen during a tiny window
      with no preemption but it could potentially be done using a virtual
      machine to artifically slow one CPU or halt it during the critical
      window.
      
      This patch should eliminate the race with truncation by try-locking the
      page before derefencing mapping and aborting if the lock was not
      acquired.  There was a suggestion from Huang Ying to use RCU as a
      side-effect to prevent mapping being freed.  However, I do not like the
      solution as it's an unconventional means of preserving a mapping and
      it's not a context where rcu_read_lock is obviously protecting rcu data.
      
      Link: http://lkml.kernel.org/r/20180104102512.2qos3h5vqzeisrek@techsingularity.net
      Fixes: c8244935
      
       ("mm: compaction: make isolate_lru_page() filter-aware again")
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ab88b8a2
  4. 08 Apr, 2018 1 commit
  5. 24 Mar, 2018 1 commit
  6. 25 Feb, 2018 1 commit
    • Tetsuo Handa's avatar
      mm,vmscan: Make unregister_shrinker() no-op if register_shrinker() failed. · 274ee93f
      Tetsuo Handa authored
      commit bb422a73 upstream.
      
      Syzbot caught an oops at unregister_shrinker() because combination of
      commit 1d3d4437
      
       ("vmscan: per-node deferred work") and fault
      injection made register_shrinker() fail and the caller of
      register_shrinker() did not check for failure.
      
      ----------
      [  554.881422] FAULT_INJECTION: forcing a failure.
      [  554.881422] name failslab, interval 1, probability 0, space 0, times 0
      [  554.881438] CPU: 1 PID: 13231 Comm: syz-executor1 Not tainted 4.14.0-rc8+ #82
      [  554.881443] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      [  554.881445] Call Trace:
      [  554.881459]  dump_stack+0x194/0x257
      [  554.881474]  ? arch_local_irq_restore+0x53/0x53
      [  554.881486]  ? find_held_lock+0x35/0x1d0
      [  554.881507]  should_fail+0x8c0/0xa40
      [  554.881522]  ? fault_create_debugfs_attr+0x1f0/0x1f0
      [  554.881537]  ? check_noncircular+0x20/0x20
      [  554.881546]  ? find_next_zero_bit+0x2c/0x40
      [  554.881560]  ? ida_get_new_above+0x421/0x9d0
      [  554.881577]  ? find_held_lock+0x35/0x1d0
      [  554.881594]  ? __lock_is_held+0xb6/0x140
      [  554.881628]  ? check_same_owner+0x320/0x320
      [  554.881634]  ? lock_downgrade+0x990/0x990
      [  554.881649]  ? find_held_lock+0x35/0x1d0
      [  554.881672]  should_failslab+0xec/0x120
      [  554.881684]  __kmalloc+0x63/0x760
      [  554.881692]  ? lock_downgrade+0x990/0x990
      [  554.881712]  ? register_shrinker+0x10e/0x2d0
      [  554.881721]  ? trace_event_raw_event_module_request+0x320/0x320
      [  554.881737]  register_shrinker+0x10e/0x2d0
      [  554.881747]  ? prepare_kswapd_sleep+0x1f0/0x1f0
      [  554.881755]  ? _down_write_nest_lock+0x120/0x120
      [  554.881765]  ? memcpy+0x45/0x50
      [  554.881785]  sget_userns+0xbcd/0xe20
      (...snipped...)
      [  554.898693] kasan: CONFIG_KASAN_INLINE enabled
      [  554.898724] kasan: GPF could be caused by NULL-ptr deref or user memory access
      [  554.898732] general protection fault: 0000 [#1] SMP KASAN
      [  554.898737] Dumping ftrace buffer:
      [  554.898741]    (ftrace buffer empty)
      [  554.898743] Modules linked in:
      [  554.898752] CPU: 1 PID: 13231 Comm: syz-executor1 Not tainted 4.14.0-rc8+ #82
      [  554.898755] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      [  554.898760] task: ffff8801d1dbe5c0 task.stack: ffff8801c9e38000
      [  554.898772] RIP: 0010:__list_del_entry_valid+0x7e/0x150
      [  554.898775] RSP: 0018:ffff8801c9e3f108 EFLAGS: 00010246
      [  554.898780] RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000000
      [  554.898784] RDX: 0000000000000000 RSI: ffff8801c53c6f98 RDI: ffff8801c53c6fa0
      [  554.898788] RBP: ffff8801c9e3f120 R08: 1ffff100393c7d55 R09: 0000000000000004
      [  554.898791] R10: ffff8801c9e3ef70 R11: 0000000000000000 R12: 0000000000000000
      [  554.898795] R13: dffffc0000000000 R14: 1ffff100393c7e45 R15: ffff8801c53c6f98
      [  554.898800] FS:  0000000000000000(0000) GS:ffff8801db300000(0000) knlGS:0000000000000000
      [  554.898804] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
      [  554.898807] CR2: 00000000dbc23000 CR3: 00000001c7269000 CR4: 00000000001406e0
      [  554.898813] DR0: 0000000020000000 DR1: 0000000020000000 DR2: 0000000000000000
      [  554.898816] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
      [  554.898818] Call Trace:
      [  554.898828]  unregister_shrinker+0x79/0x300
      [  554.898837]  ? perf_trace_mm_vmscan_writepage+0x750/0x750
      [  554.898844]  ? down_write+0x87/0x120
      [  554.898851]  ? deactivate_super+0x139/0x1b0
      [  554.898857]  ? down_read+0x150/0x150
      [  554.898864]  ? check_same_owner+0x320/0x320
      [  554.898875]  deactivate_locked_super+0x64/0xd0
      [  554.898883]  deactivate_super+0x141/0x1b0
      ----------
      
      Since allowing register_shrinker() callers to call unregister_shrinker()
      when register_shrinker() failed can simplify error recovery path, this
      patch makes unregister_shrinker() no-op when register_shrinker() failed.
      Also, reset shrinker->nr_deferred in case unregister_shrinker() was
      by error called twice.
      Signed-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: default avatarAliaksei Karaliou <akaraliou.dev@gmail.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Cc: Glauber Costa <glauber@scylladb.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      274ee93f
  7. 31 Jan, 2018 1 commit
    • Johannes Weiner's avatar
      mm: fix 100% CPU kswapd busyloop on unreclaimable nodes · 19a7db1e
      Johannes Weiner authored
      commit c73322d0 upstream.
      
      Patch series "mm: kswapd spinning on unreclaimable nodes - fixes and
      cleanups".
      
      Jia reported a scenario in which the kswapd of a node indefinitely spins
      at 100% CPU usage.  We have seen similar cases at Facebook.
      
      The kernel's current method of judging its ability to reclaim a node (or
      whether to back off and sleep) is based on the amount of scanned pages
      in proportion to the amount of reclaimable pages.  In Jia's and our
      scenarios, there are no reclaimable pages in the node, however, and the
      condition for backing off is never met.  Kswapd busyloops in an attempt
      to restore the watermarks while having nothing to work with.
      
      This series reworks the definition of an unreclaimable node based not on
      scanning but on whether kswapd is able to actually reclaim pages in
      MAX_RECLAIM_RETRIES (16) consecutive runs.  This is the same criteria
      the page allocator uses for giving up on direct reclaim and invoking the
      OOM killer.  If it cannot free any pages, kswapd will go to sleep and
      leave further attempts to direct reclaim invocations, which will either
      make progress and re-enable kswapd, or invoke the OOM killer.
      
      Patch #1 fixes the immediate problem Jia reported, the remainder are
      smaller fixlets, cleanups, and overall phasing out of the old method.
      
      Patch #6 is the odd one out.  It's a nice cleanup to get_scan_count(),
      and directly related to #5, but in itself not relevant to the series.
      
      If the whole series is too ambitious for 4.11, I would consider the
      first three patches fixes, the rest cleanups.
      
      This patch (of 9):
      
      Jia He reports a problem with kswapd spinning at 100% CPU when
      requesting more hugepages than memory available in the system:
      
      $ echo 4000 >/proc/sys/vm/nr_hugepages
      
      top - 13:42:59 up  3:37,  1 user,  load average: 1.09, 1.03, 1.01
      Tasks:   1 total,   1 running,   0 sleeping,   0 stopped,   0 zombie
      %Cpu(s):  0.0 us, 12.5 sy,  0.0 ni, 85.5 id,  2.0 wa,  0.0 hi,  0.0 si,  0.0 st
      KiB Mem:  31371520 total, 30915136 used,   456384 free,      320 buffers
      KiB Swap:  6284224 total,   115712 used,  6168512 free.    48192 cached Mem
      
        PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
         76 root      20   0       0      0      0 R 100.0 0.000 217:17.29 kswapd3
      
      At that time, there are no reclaimable pages left in the node, but as
      kswapd fails to restore the high watermarks it refuses to go to sleep.
      
      Kswapd needs to back away from nodes that fail to balance.  Up until
      commit 1d82de61 ("mm, vmscan: make kswapd reclaim in terms of
      nodes") kswapd had such a mechanism.  It considered zones whose
      theoretically reclaimable pages it had reclaimed six times over as
      unreclaimable and backed away from them.  This guard was erroneously
      removed as the patch changed the definition of a balanced node.
      
      However, simply restoring this code wouldn't help in the case reported
      here: there *are* no reclaimable pages that could be scanned until the
      threshold is met.  Kswapd would stay awake anyway.
      
      Introduce a new and much simpler way of backing off.  If kswapd runs
      through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single
      page, make it back off from the node.  This is the same number of shots
      direct reclaim takes before declaring OOM.  Kswapd will go to sleep on
      that node until a direct reclaimer manages to reclaim some pages, thus
      proving the node reclaimable again.
      
      [hannes@cmpxchg.org: check kswapd failure against the cumulative nr_reclaimed count]
        Link: http://lkml.kernel.org/r/20170306162410.GB2090@cmpxchg.org
      [shakeelb@google.com: fix condition for throttle_direct_reclaim]
        Link: http://lkml.kernel.org/r/20170314183228.20152-1-shakeelb@google.com
      Link: http://lkml.kernel.org/r/20170228214007.5621-2-hannes@cmpxchg.org
      
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Reported-by: default avatarJia He <hejianet@gmail.com>
      Tested-by: default avatarJia He <hejianet@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Dmitry Shmidt <dimitrysh@google.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      19a7db1e
  8. 12 Mar, 2017 2 commits
  9. 19 Jan, 2017 1 commit
    • Michal Hocko's avatar
      mm, memcg: fix the active list aging for lowmem requests when memcg is enabled · 07fc9575
      Michal Hocko authored
      commit b4536f0c upstream.
      
      Nils Holland and Klaus Ethgen have reported unexpected OOM killer
      invocations with 32b kernel starting with 4.8 kernels
      
      	kworker/u4:5 invoked oom-killer: gfp_mask=0x2400840(GFP_NOFS|__GFP_NOFAIL), nodemask=0, order=0, oom_score_adj=0
      	kworker/u4:5 cpuset=/ mems_allowed=0
      	CPU: 1 PID: 2603 Comm: kworker/u4:5 Not tainted 4.9.0-gentoo #2
      	[...]
      	Mem-Info:
      	active_anon:58685 inactive_anon:90 isolated_anon:0
      	 active_file:274324 inactive_file:281962 isolated_file:0
      	 unevictable:0 dirty:649 writeback:0 unstable:0
      	 slab_reclaimable:40662 slab_unreclaimable:17754
      	 mapped:7382 shmem:202 pagetables:351 bounce:0
      	 free:206736 free_pcp:332 free_cma:0
      	Node 0 active_anon:234740kB inactive_anon:360kB active_file:1097296kB inactive_file:1127848kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:29528kB dirty:2596kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 184320kB anon_thp: 808kB writeback_tmp:0kB unstable:0kB pages_scanned:0 all_unreclaimable? no
      	DMA free:3952kB min:788kB low:984kB high:1180kB active_anon:0kB inactive_anon:0kB active_file:7316kB inactive_file:0kB unevictable:0kB writepending:96kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:3200kB slab_unreclaimable:1408kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
      	lowmem_reserve[]: 0 813 3474 3474
      	Normal free:41332kB min:41368kB low:51708kB high:62048kB active_anon:0kB inactive_anon:0kB active_file:532748kB inactive_file:44kB unevictable:0kB writepending:24kB present:897016kB managed:836248kB mlocked:0kB slab_reclaimable:159448kB slab_unreclaimable:69608kB kernel_stack:1112kB pagetables:1404kB bounce:0kB free_pcp:528kB local_pcp:340kB free_cma:0kB
      	lowmem_reserve[]: 0 0 21292 21292
      	HighMem free:781660kB min:512kB low:34356kB high:68200kB active_anon:234740kB inactive_anon:360kB active_file:557232kB inactive_file:1127804kB unevictable:0kB writepending:2592kB present:2725384kB managed:2725384kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:800kB local_pcp:608kB free_cma:0kB
      
      the oom killer is clearly pre-mature because there there is still a lot
      of page cache in the zone Normal which should satisfy this lowmem
      request.  Further debugging has shown that the reclaim cannot make any
      forward progress because the page cache is hidden in the active list
      which doesn't get rotated because inactive_list_is_low is not memcg
      aware.
      
      The code simply subtracts per-zone highmem counters from the respective
      memcg's lru sizes which doesn't make any sense.  We can simply end up
      always seeing the resulting active and inactive counts 0 and return
      false.  This issue is not limited to 32b kernels but in practice the
      effect on systems without CONFIG_HIGHMEM would be much harder to notice
      because we do not invoke the OOM killer for allocations requests
      targeting < ZONE_NORMAL.
      
      Fix the issue by tracking per zone lru page counts in mem_cgroup_per_node
      and subtract per-memcg highmem counts when memcg is enabled.  Introduce
      helper lruvec_zone_lru_size which redirects to either zone counters or
      mem_cgroup_get_zone_lru_size when appropriate.
      
      We are losing empty LRU but non-zero lru size detection introduced by
      ca707239 ("mm: update_lru_size warn and reset bad lru_size") because
      of the inherent zone vs. node discrepancy.
      
      Fixes: f8d1a311 ("mm: consider whether to decivate based on eligible zones inactive ratio")
      Link: http://lkml.kernel.org/r/20170104100825.3729-1-mhocko@kernel.org
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarNils Holland <nholland@tisys.org>
      Tested-by: default avatarNils Holland <nholland@tisys.org>
      Reported-by: default avatarKlaus Ethgen <Klaus@Ethgen.de>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      07fc9575
  10. 06 Jan, 2017 1 commit
  11. 03 Dec, 2016 1 commit
    • Michal Hocko's avatar
      mm, vmscan: add cond_resched() into shrink_node_memcg() · bd041733
      Michal Hocko authored
      Boris Zhmurov has reported RCU stalls during the kswapd reclaim:
      
        INFO: rcu_sched detected stalls on CPUs/tasks:
         23-...: (22 ticks this GP) idle=92f/140000000000000/0 softirq=2638404/2638404 fqs=23
         (detected by 4, t=6389 jiffies, g=786259, c=786258, q=42115)
        Task dump for CPU 23:
        kswapd1         R  running task        0   148      2 0x00000008
        Call Trace:
          shrink_node+0xd2/0x2f0
          kswapd+0x2cb/0x6a0
          mem_cgroup_shrink_node+0x160/0x160
          kthread+0xbd/0xe0
          __switch_to+0x1fa/0x5c0
          ret_from_fork+0x1f/0x40
          kthread_create_on_node+0x180/0x180
      
      a closer code inspection has shown that we might indeed miss all the
      scheduling points in the reclaim path if no pages can be isolated from
      the LRU list.  This is a pathological case but other reports from Donald
      Buczek have shown that we might indeed hit such a path:
      
              clusterd-989   [009] .... 118023.654491: mm_vmscan_direct_reclaim_end: nr_reclaimed=193
               kswapd1-86    [001] dN.. 118023.987475: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239830 nr_taken=0 file=1
               kswapd1-86    [001] dN.. 118024.320968: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239844 nr_taken=0 file=1
               kswapd1-86    [001] dN.. 118024.654375: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239858 nr_taken=0 file=1
               kswapd1-86    [001] dN.. 118024.987036: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239872 nr_taken=0 file=1
               kswapd1-86    [001] dN.. 118025.319651: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239886 nr_taken=0 file=1
               kswapd1-86    [001] dN.. 118025.652248: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239900 nr_taken=0 file=1
               kswapd1-86    [001] dN.. 118025.984870: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239914 nr_taken=0 file=1
        [...]
               kswapd1-86    [001] dN.. 118084.274403: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4241133 nr_taken=0 file=1
      
      this is minute long snapshot which didn't take a single page from the
      LRU.  It is not entirely clear why only 1303 pages have been scanned
      during that time (maybe there was a heavy IRQ activity interfering).
      
      In any case it looks like we can really hit long periods without
      scheduling on non preemptive kernels so an explicit cond_resched() in
      shrink_node_memcg which is independent on the reclaim operation is due.
      
      Link: http://lkml.kernel.org/r/20161202095841.16648-1-mhocko@kernel.org
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarBoris Zhmurov <bb@kernelpanic.ru>
      Tested-by: default avatarBoris Zhmurov <bb@kernelpanic.ru>
      Reported-by: default avatarDonald Buczek <buczek@molgen.mpg.de>
      Reported-by: default avatar"Christopher S. Aker" <caker@theshore.net>
      Reported-by: default avatarPaul Menzel <pmenzel@molgen.mpg.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bd041733
  12. 28 Oct, 2016 1 commit
    • Johannes Weiner's avatar
      mm: memcontrol: do not recurse in direct reclaim · 89a28483
      Johannes Weiner authored
      On 4.0, we saw a stack corruption from a page fault entering direct
      memory cgroup reclaim, calling into btrfs_releasepage(), which then
      tried to allocate an extent and recursed back into a kmem charge ad
      nauseam:
      
        [...]
        btrfs_releasepage+0x2c/0x30
        try_to_release_page+0x32/0x50
        shrink_page_list+0x6da/0x7a0
        shrink_inactive_list+0x1e5/0x510
        shrink_lruvec+0x605/0x7f0
        shrink_zone+0xee/0x320
        do_try_to_free_pages+0x174/0x440
        try_to_free_mem_cgroup_pages+0xa7/0x130
        try_charge+0x17b/0x830
        memcg_charge_kmem+0x40/0x80
        new_slab+0x2d9/0x5a0
        __slab_alloc+0x2fd/0x44f
        kmem_cache_alloc+0x193/0x1e0
        alloc_extent_state+0x21/0xc0
        __clear_extent_bit+0x2b5/0x400
        try_release_extent_mapping+0x1a3/0x220
        __btrfs_releasepage+0x31/0x70
        btrfs_releasepage+0x2c/0x30
        try_to_release_page+0x32/0x50
        shrink_page_list+0x6da/0x7a0
        shrink_inactive_list+0x1e5/0x510
        shrink_lruvec+0x605/0x7f0
        shrink_zone+0xee/0x320
        do_try_to_free_pages+0x174/0x440
        try_to_free_mem_cgroup_pages+0xa7/0x130
        try_charge+0x17b/0x830
        mem_cgroup_try_charge+0x65/0x1c0
        handle_mm_fault+0x117f/0x1510
        __do_page_fault+0x177/0x420
        do_page_fault+0xc/0x10
        page_fault+0x22/0x30
      
      On later kernels, kmem charging is opt-in rather than opt-out, and that
      particular kmem allocation in btrfs_releasepage() is no longer being
      charged and won't recurse and overrun the stack anymore.
      
      But it's not impossible for an accounted allocation to happen from the
      memcg direct reclaim context, and we needed to reproduce this crash many
      times before we even got a useful stack trace out of it.
      
      Like other direct reclaimers, mark tasks in memcg reclaim PF_MEMALLOC to
      avoid recursing into any other form of direct reclaim.  Then let
      recursive charges from PF_MEMALLOC contexts bypass the cgroup limit.
      
      Link: http://lkml.kernel.org/r/20161025141050.GA13019@cmpxchg.org
      
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      89a28483
  13. 08 Oct, 2016 5 commits
    • Aneesh Kumar K.V's avatar
      mm: use zonelist name instead of using hardcoded index · c9634cf0
      Aneesh Kumar K.V authored
      Use the existing enums instead of hardcoded index when looking at the
      zonelist.  This makes it more readable.  No functionality change by this
      patch.
      
      Link: http://lkml.kernel.org/r/1472227078-24852-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com
      
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Reviewed-by: default avatarAnshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c9634cf0
    • Michal Hocko's avatar
      mm, vmscan: get rid of throttle_vm_writeout · bf484383
      Michal Hocko authored
      throttle_vm_writeout() was introduced back in 2005 to fix OOMs caused by
      excessive pageout activity during the reclaim.  Too many pages could be
      put under writeback therefore LRUs would be full of unreclaimable pages
      until the IO completes and in turn the OOM killer could be invoked.
      
      There have been some important changes introduced since then in the
      reclaim path though.  Writers are throttled by balance_dirty_pages when
      initiating the buffered IO and later during the memory pressure, the
      direct reclaim is throttled by wait_iff_congested if the node is
      considered congested by dirty pages on LRUs and the underlying bdi is
      congested by the queued IO.  The kswapd is throttled as well if it
      encounters pages marked for immediate reclaim or under writeback which
      signals that that there are too many pages under writeback already.
      Finally should_reclaim_retry does congestion_wait if the reclaim cannot
      make any progress and there are too many dirty/writeback pages.
      
      Another important aspect is that we do not issue any IO from the direct
      reclaim context anymore.  In a heavy parallel load this could queue a
      lot of IO which would be very scattered and thus unefficient which would
      just make the problem worse.
      
      This three mechanisms should throttle and keep the amount of IO in a
      steady state even under heavy IO and memory pressure so yet another
      throttling point doesn't really seem helpful.  Quite contrary, Mikulas
      Patocka has reported that swap backed by dm-crypt doesn't work properly
      because the swapout IO cannot make sufficient progress as the writeout
      path depends on dm_crypt worker which has to allocate memory to perform
      the encryption.  In order to guarantee a forward progress it relies on
      the mempool allocator.  mempool_alloc(), however, prefers to use the
      underlying (usually page) allocator before it grabs objects from the
      pool.  Such an allocation can dive into the memory reclaim and
      consequently to throttle_vm_writeout.  If there are too many dirty or
      pages under writeback it will get throttled even though it is in fact a
      flusher to clear pending pages.
      
        kworker/u4:0    D ffff88003df7f438 10488     6      2	0x00000000
        Workqueue: kcryptd kcryptd_crypt [dm_crypt]
        Call Trace:
          schedule+0x3c/0x90
          schedule_timeout+0x1d8/0x360
          io_schedule_timeout+0xa4/0x110
          congestion_wait+0x86/0x1f0
          throttle_vm_writeout+0x44/0xd0
          shrink_zone_memcg+0x613/0x720
          shrink_zone+0xe0/0x300
          do_try_to_free_pages+0x1ad/0x450
          try_to_free_pages+0xef/0x300
          __alloc_pages_nodemask+0x879/0x1210
          alloc_pages_current+0xa1/0x1f0
          new_slab+0x2d7/0x6a0
          ___slab_alloc+0x3fb/0x5c0
          __slab_alloc+0x51/0x90
          kmem_cache_alloc+0x27b/0x310
          mempool_alloc_slab+0x1d/0x30
          mempool_alloc+0x91/0x230
          bio_alloc_bioset+0xbd/0x260
          kcryptd_crypt+0x114/0x3b0 [dm_crypt]
      
      Let's just drop throttle_vm_writeout altogether.  It is not very much
      helpful anymore.
      
      I have tried to test a potential writeback IO runaway similar to the one
      described in the original patch which has introduced that [1].  Small
      virtual machine (512MB RAM, 4 CPUs, 2G of swap space and disk image on a
      rather slow NFS in a sync mode on the host) with 8 parallel writers each
      writing 1G worth of data.  As soon as the pagecache fills up and the
      direct reclaim hits then I start anon memory consumer in a loop
      (allocating 300M and exiting after populating it) in the background to
      make the memory pressure even stronger as well as to disrupt the steady
      state for the IO.  The direct reclaim is throttled because of the
      congestion as well as kswapd hitting congestion_wait due to nr_immediate
      but throttle_vm_writeout doesn't ever trigger the sleep throughout the
      test.  Dirty+writeback are close to nr_dirty_threshold with some
      fluctuations caused by the anon consumer.
      
      [1] https://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.9-rc1/2.6.9-rc1-mm3/broken-out/vm-pageout-throttling.patch
      Link: http://lkml.kernel.org/r/1471171473-21418-1-git-send-email-mhocko@kernel.org
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: NeilBrown <neilb@suse.com>
      Cc: Ondrej Kozina <okozina@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bf484383
    • Vlastimil Babka's avatar
      mm, vmscan: make compaction_ready() more accurate and readable · fdd4c614
      Vlastimil Babka authored
      The compaction_ready() is used during direct reclaim for costly order
      allocations to skip reclaim for zones where compaction should be
      attempted instead.  It's combining the standard compaction_suitable()
      check with its own watermark check based on high watermark with extra
      gap, and the result is confusing at best.
      
      This patch attempts to better structure and document the checks
      involved.  First, compaction_suitable() can determine that the
      allocation should either succeed already, or that compaction doesn't
      have enough free pages to proceed.  The third possibility is that
      compaction has enough free pages, but we still decide to reclaim first -
      unless we are already above the high watermark with gap.  This does not
      mean that the reclaim will actually reach this watermark during single
      attempt, this is rather an over-reclaim protection.  So document the
      code as such.  The check for compaction_deferred() is removed
      completely, as it in fact had no proper role here.
      
      The result after this patch is mainly a less confusing code.  We also
      skip some over-reclaim in cases where the allocation should already
      succed.
      
      Link: http://lkml.kernel.org/r/20160810091226.6709-12-vbabka@suse.cz
      
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Tested-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fdd4c614
    • Vlastimil Babka's avatar
      mm, compaction: create compact_gap wrapper · 9861a62c
      Vlastimil Babka authored
      Compaction uses a watermark gap of (2UL << order) pages at various
      places and it's not immediately obvious why.  Abstract it through a
      compact_gap() wrapper to create a single place with a thorough
      explanation.
      
      [vbabka@suse.cz: clarify the comment of compact_gap()]
       Link: http://lkml.kernel.org/r/7b6aed1f-fdf8-2063-9ff4-bbe4de712d37@suse.cz
      Link: http://lkml.kernel.org/r/20160810091226.6709-9-vbabka@suse.cz
      
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Tested-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9861a62c
    • Vlastimil Babka's avatar
      mm, compaction: rename COMPACT_PARTIAL to COMPACT_SUCCESS · cf378319
      Vlastimil Babka authored
      COMPACT_PARTIAL has historically meant that compaction returned after
      doing some work without fully compacting a zone.  It however didn't
      distinguish if compaction terminated because it succeeded in creating
      the requested high-order page.  This has changed recently and now we
      only return COMPACT_PARTIAL when compaction thinks it succeeded, or the
      high-order watermark check in compaction_suitable() passes and no
      compaction needs to be done.
      
      So at this point we can make the return value clearer by renaming it to
      COMPACT_SUCCESS.  The next patch will remove some redundant tests for
      success where compaction just returned COMPACT_SUCCESS.
      
      Link: http://lkml.kernel.org/r/20160810091226.6709-4-vbabka@suse.cz
      
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Tested-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cf378319
  14. 24 Sep, 2016 1 commit
    • Hugh Dickins's avatar
      mm: delete unnecessary and unsafe init_tlb_ubc() · b385d21f
      Hugh Dickins authored
      init_tlb_ubc() looked unnecessary to me: tlb_ubc is statically
      initialized with zeroes in the init_task, and copied from parent to
      child while it is quiescent in arch_dup_task_struct(); so I went to
      delete it.
      
      But inserted temporary debug WARN_ONs in place of init_tlb_ubc() to
      check that it was always empty at that point, and found them firing:
      because memcg reclaim can recurse into global reclaim (when allocating
      biosets for swapout in my case), and arrive back at the init_tlb_ubc()
      in shrink_node_memcg().
      
      Resetting tlb_ubc.flush_required at that point is wrong: if the upper
      level needs a deferred TLB flush, but the lower level turns out not to,
      we miss a TLB flush.  But fortunately, that's the only part of the
      protocol that does not nest: with the initialization removed, cpumask
      collects bits from upper and lower levels, and flushes TLB when needed.
      
      Fixes: 72b252ae
      
       ("mm: send one IPI per CPU to TLB flush all entries after unmapping pages")
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: stable@vger.kernel.org # 4.3+
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b385d21f
  15. 02 Sep, 2016 1 commit
  16. 02 Aug, 2016 1 commit
  17. 28 Jul, 2016 17 commits