1. 04 Jan, 2020 1 commit
    • Hangbin Liu's avatar
      net: add bool confirm_neigh parameter for dst_ops.update_pmtu · d49ce85c
      Hangbin Liu authored
      [ Upstream commit bd085ef6
      
       ]
      
      The MTU update code is supposed to be invoked in response to real
      networking events that update the PMTU. In IPv6 PMTU update function
      __ip6_rt_update_pmtu() we called dst_confirm_neigh() to update neighbor
      confirmed time.
      
      But for tunnel code, it will call pmtu before xmit, like:
        - tnl_update_pmtu()
          - skb_dst_update_pmtu()
            - ip6_rt_update_pmtu()
              - __ip6_rt_update_pmtu()
                - dst_confirm_neigh()
      
      If the tunnel remote dst mac address changed and we still do the neigh
      confirm, we will not be able to update neigh cache and ping6 remote
      will failed.
      
      So for this ip_tunnel_xmit() case, _EVEN_ if the MTU is changed, we
      should not be invoking dst_confirm_neigh() as we have no evidence
      of successful two-way communication at this point.
      
      On the other hand it is also important to keep the neigh reachability fresh
      for TCP flows, so we cannot remove this dst_confirm_neigh() call.
      
      To fix the issue, we have to add a new bool parameter for dst_ops.update_pmtu
      to choose whether we should do neigh update or not. I will add the parameter
      in this patch and set all the callers to true to comply with the previous
      way, and fix the tunnel code one by one on later patches.
      
      v5: No change.
      v4: No change.
      v3: Do not remove dst_confirm_neigh, but add a new bool parameter in
          dst_ops.update_pmtu to control whether we should do neighbor confirm.
          Also split the big patch to small ones for each area.
      v2: Remove dst_confirm_neigh in __ip6_rt_update_pmtu.
      Suggested-by: default avatarDavid Miller <davem@davemloft.net>
      Reviewed-by: default avatarGuillaume Nault <gnault@redhat.com>
      Acked-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarHangbin Liu <liuhangbin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d49ce85c
  2. 31 Dec, 2019 1 commit
    • Geert Uytterhoeven's avatar
      net: dst: Force 4-byte alignment of dst_metrics · f0465803
      Geert Uytterhoeven authored
      [ Upstream commit 258a980d ]
      
      When storing a pointer to a dst_metrics structure in dst_entry._metrics,
      two flags are added in the least significant bits of the pointer value.
      Hence this assumes all pointers to dst_metrics structures have at least
      4-byte alignment.
      
      However, on m68k, the minimum alignment of 32-bit values is 2 bytes, not
      4 bytes.  Hence in some kernel builds, dst_default_metrics may be only
      2-byte aligned, leading to obscure boot warnings like:
      
          WARNING: CPU: 0 PID: 7 at lib/refcount.c:28 refcount_warn_saturate+0x44/0x9a
          refcount_t: underflow; use-after-free.
          Modules linked in:
          CPU: 0 PID: 7 Comm: ksoftirqd/0 Tainted: G        W         5.5.0-rc2-atari-01448-g114a1a1038af891d-dirty #261
          Stack from 10835e6c:
      	    10835e6c 0038134f 00023fa6 00394b0f 0000001c 00000009 00321560 00023fea
      	    00394b0f 0000001c 001a70f8 00000009 00000000 10835eb4 00000001 00000000
      	    04208040 0000000a 00394b4a 10835ed4 00043aa8 001a70f8 00394b0f 0000001c
      	    00000009 00394b4a 0026aba8 003215a4 00000003 00000000 0026d5a8 00000001
      	    003215a4 003a4361 003238d6 000001f0 00000000 003215a4 10aa3b00 00025e84
      	    003ddb00 10834000 002416a8 10aa3b00 00000000 00000080 000aa038 0004854a
          Call Trace: [<00023fa6>] __warn+0xb2/0xb4
           [<00023fea>] warn_slowpath_fmt+0x42/0x64
           [<001a70f8>] refcount_warn_saturate+0x44/0x9a
           [<00043aa8>] printk+0x0/0x18
           [<001a70f8>] refcount_warn_saturate+0x44/0x9a
           [<0026aba8>] refcount_sub_and_test.constprop.73+0x38/0x3e
           [<0026d5a8>] ipv4_dst_destroy+0x5e/0x7e
           [<00025e84>] __local_bh_enable_ip+0x0/0x8e
           [<002416a8>] dst_destroy+0x40/0xae
      
      Fix this by forcing 4-byte alignment of all dst_metrics structures.
      
      Fixes: e5fd387a
      
       ("ipv6: do not overwrite inetpeer metrics prematurely")
      Signed-off-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f0465803
  3. 01 Jul, 2019 1 commit
  4. 29 Jun, 2019 1 commit
    • Florian Westphal's avatar
      net: make skb_dst_force return true when dst is refcounted · b60a7738
      Florian Westphal authored
      
      netfilter did not expect that skb_dst_force() can cause skb to lose its
      dst entry.
      
      I got a bug report with a skb->dst NULL dereference in netfilter
      output path.  The backtrace contains nf_reinject(), so the dst might have
      been cleared when skb got queued to userspace.
      
      Other users were fixed via
      if (skb_dst(skb)) {
      	skb_dst_force(skb);
      	if (!skb_dst(skb))
      		goto handle_err;
      }
      
      But I think its preferable to make the 'dst might be cleared' part
      of the function explicit.
      
      In netfilter case, skb with a null dst is expected when queueing in
      prerouting hook, so drop skb for the other hooks.
      
      v2:
       v1 of this patch returned true in case skb had no dst entry.
       Eric said:
         Say if we have two skb_dst_force() calls for some reason
         on the same skb, only the first one will return false.
      
       This now returns false even when skb had no dst, as per Erics
       suggestion, so callers might need to check skb_dst() first before
       skb_dst_force().
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b60a7738
  5. 21 Mar, 2019 1 commit
  6. 18 Oct, 2018 1 commit
  7. 20 Jul, 2018 1 commit
    • Benedict Wong's avatar
      xfrm: Remove xfrmi interface ID from flowi · bc56b334
      Benedict Wong authored
      In order to remove performance impact of having the extra u32 in every
      single flowi, this change removes the flowi_xfrm struct, prefering to
      take the if_id as a method parameter where needed.
      
      In the inbound direction, if_id is only needed during the
      __xfrm_check_policy() function, and the if_id can be determined at that
      point based on the skb. As such, xfrmi_decode_session() is only called
      with the skb in __xfrm_check_policy().
      
      In the outbound direction, the only place where if_id is needed is the
      xfrm_lookup() call in xfrmi_xmit2(). With this change, the if_id is
      directly passed into the xfrm_lookup_with_ifid() call. All existing
      callers can still call xfrm_lookup(), which uses a default if_id of 0.
      
      This change does not change any behavior of XFRMIs except for improving
      overall system performance via flowi size reduction.
      
      This change has been tested against the Android Kernel Networking Tests:
      
      https://android.googlesource.com/kernel/tests/+/master/net/test
      
      Signed-off-by: default avatarBenedict Wong <benedictwong@google.com>
      Signed-off-by: default avatarSteffen Klassert <steffen.klassert@secunet.com>
      bc56b334
  8. 05 Mar, 2018 1 commit
  9. 25 Jan, 2018 1 commit
  10. 30 Nov, 2017 8 commits
  11. 02 Nov, 2017 1 commit
    • Greg Kroah-Hartman's avatar
      License cleanup: add SPDX GPL-2.0 license identifier to files with no license · b2441318
      Greg Kroah-Hartman authored
      Many source files in the tree are missing licensing information, which
      makes it harder for compliance tools to determine the correct license.
      
      By default all files without license information are under the default
      license of the kernel, which is GPL version 2.
      
      Update the files which contain no license information with the 'GPL-2.0'
      SPDX license identifier.  The SPDX identifier is a legally binding
      shorthand, which can be used instead of the full boiler plate text.
      
      This patch is based on work done by Thomas Gleixner and Kate Stewart and
      Philippe Ombredanne.
      
      How this work was done:
      
      Patches were generated and checked against linux-4.14-rc6 for a subset of
      the use cases:
       - file had no licensing information it it.
       - file was a */uapi/* one with no licensing information in it,
       - file was a */uapi/* one with existing licensing information,
      
      Further patches will be generated in subsequent months to fix up cases
      where non-standard...
      b2441318
  12. 27 Oct, 2017 1 commit
    • Paolo Abeni's avatar
      net: updating dst lastusage is an unlikely event. · 32d18ab1
      Paolo Abeni authored
      Since commit 0da4af00
      
       ("ipv6: only update __use and lastusetime
      once per jiffy at most"), updating the dst lastuse field is an
      unlikely action: it happens at most once per jiffy, out of
      potentially millions of calls per second.
      
      Mark explicitly the code as such, and let the compiler generate
      better code.
      
      Note: gcc 7.2 and several older versions do actually generate
      different - better - code when the unlikely() hint is in place,
      avoid jump in the fast path and keeping better code locality.
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      32d18ab1
  13. 16 Oct, 2017 1 commit
    • Wei Wang's avatar
      ipv6: only update __use and lastusetime once per jiffy at most · 0da4af00
      Wei Wang authored
      
      In order to not dirty the cacheline too often, we try to only update
      dst->__use and dst->lastusetime at most once per jiffy.
      As dst->lastusetime is only used by ipv6 garbage collector, it should
      be good enough time resolution.
      And __use is only used in ipv6_route_seq_show() to show how many times a
      dst has been used. And as __use is not atomic_t right now, it does not
      show the precise number of usage times anyway. So we think it should be
      OK to only update it at most once per jiffy.
      
      According to my latest syn flood test on a machine with intel Xeon 6th
      gen processor and 2 10G mlx nics bonded together, each with 8 rx queues
      on 2 NUMA nodes:
      With this patch, the packet process rate increases from ~3.49Mpps to
      ~3.75Mpps with a 7% increase rate.
      
      Note: dst_use() is being renamed to dst_hold_and_use() to better specify
      the purpose of the function.
      Signed-off-by: default avatarWei Wang <weiwan@google.com>
      Acked-by: default avatarEric Dumazet <edumazet@googl.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0da4af00
  14. 07 Oct, 2017 1 commit
    • Wei Wang's avatar
      ipv6: replace rwlock with rcu and spinlock in fib6_table · 66f5d6ce
      Wei Wang authored
      
      With all the preparation work before, we are now ready to replace rwlock
      with rcu and spinlock in fib6_table.
      That means now all fib6_node in fib6_table are protected by rcu. And
      when freeing fib6_node, call_rcu() is used to wait for the rcu grace
      period before releasing the memory.
      When accessing fib6_node, corresponding rcu APIs need to be used.
      And all previous sessions protected by the write lock will now be
      protected by the spin lock per table.
      All previous sessions protected by read lock will now be protected by
      rcu_read_lock().
      
      A couple of things to note here:
      1. As part of the work of replacing rwlock with rcu, the linked list of
      fn->leaf now has to be rcu protected as well. So both fn->leaf and
      rt->dst.rt6_next are now __rcu tagged and corresponding rcu APIs are
      used when manipulating them.
      
      2. For fn->rr_ptr, first of all, it also needs to be rcu protected now
      and is tagged with __rcu and rcu APIs are used in corresponding places.
      Secondly, fn->rr_ptr is changed in rt6_select() which is a reader
      thread. This makes the issue a bit complicated. We think a valid
      solution for it is to let rt6_select() grab the tb6_lock if it decides
      to change it. As it is not in the normal operation and only happens when
      there is no valid neighbor cache for the route, we think the performance
      impact should be low.
      
      3. fib6_walk_continue() has to be called with tb6_lock held even in the
      route dumping related functions, e.g. inet6_dump_fib(),
      fib6_tables_dump() and ipv6_route_seq_ops. It is because
      fib6_walk_continue() makes modifications to the walker structure, and so
      are fib6_repair_tree() and fib6_del_route(). In order to do proper
      syncing between them, we need to let fib6_walk_continue() hold the lock.
      We may be able to do further improvement on the way we do the tree walk
      to get rid of the need for holding the spin lock. But not for now.
      
      4. When fib6_del_route() removes a route from the tree, we no longer
      mark rt->dst.rt6_next to NULL to make simultaneous reader be able to
      further traverse the list with rcu. However, rt->dst.rt6_next is only
      valid within this same rcu period. No one should access it later.
      
      5. All the operation of atomic_inc(rt->rt6i_ref) is changed to be
      performed before we publish this route (either by linking it to fn->leaf
      or insert it in the list pointed by fn->leaf) just to be safe because as
      soon as we publish the route, some read thread will be able to access it.
      Signed-off-by: default avatarWei Wang <weiwan@google.com>
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      66f5d6ce
  15. 22 Sep, 2017 1 commit
    • Eric Dumazet's avatar
      net: prevent dst uses after free · 222d7dbd
      Eric Dumazet authored
      In linux-4.13, Wei worked hard to convert dst to a traditional
      refcounted model, removing GC.
      
      We now want to make sure a dst refcount can not transition from 0 back
      to 1.
      
      The problem here is that input path attached a not refcounted dst to an
      skb. Then later, because packet is forwarded and hits skb_dst_force()
      before exiting RCU section, we might try to take a refcount on one dst
      that is about to be freed, if another cpu saw 1 -> 0 transition in
      dst_release() and queued the dst for freeing after one RCU grace period.
      
      Lets unify skb_dst_force() and skb_dst_force_safe(), since we should
      always perform the complete check against dst refcount, and not assume
      it is not zero.
      
      Bugzilla : https://bugzilla.kernel.org/show_bug.cgi?id=197005
      
      [  989.919496]  skb_dst_force+0x32/0x34
      [  989.919498]  __dev_queue_xmit+0x1ad/0x482
      [  989.919501]  ? eth_header+0x28/0xc6
      [  989.919502]  dev_queue_xmit+0xb/0xd
      [  989.919504]  neigh_connected_output+0x9b/0xb4
      [  989.919507]  ip_finish_output2+0x234/0x294
      [  989.919509]  ? ipt_do_table+0x369/0x388
      [  989.919510]  ip_finish_output+0x12c/0x13f
      [  989.919512]  ip_output+0x53/0x87
      [  989.919513]  ip_forward_finish+0x53/0x5a
      [  989.919515]  ip_forward+0x2cb/0x3e6
      [  989.919516]  ? pskb_trim_rcsum.part.9+0x4b/0x4b
      [  989.919518]  ip_rcv_finish+0x2e2/0x321
      [  989.919519]  ip_rcv+0x26f/0x2eb
      [  989.919522]  ? vlan_do_receive+0x4f/0x289
      [  989.919523]  __netif_receive_skb_core+0x467/0x50b
      [  989.919526]  ? tcp_gro_receive+0x239/0x239
      [  989.919529]  ? inet_gro_receive+0x226/0x238
      [  989.919530]  __netif_receive_skb+0x4d/0x5f
      [  989.919532]  netif_receive_skb_internal+0x5c/0xaf
      [  989.919533]  napi_gro_receive+0x45/0x81
      [  989.919536]  ixgbe_poll+0xc8a/0xf09
      [  989.919539]  ? kmem_cache_free_bulk+0x1b6/0x1f7
      [  989.919540]  net_rx_action+0xf4/0x266
      [  989.919543]  __do_softirq+0xa8/0x19d
      [  989.919545]  irq_exit+0x5d/0x6b
      [  989.919546]  do_IRQ+0x9c/0xb5
      [  989.919548]  common_interrupt+0x93/0x93
      [  989.919548]  </IRQ>
      
      Similarly dst_clone() can use dst_hold() helper to have additional
      debugging, as a follow up to commit 44ebe791 ("net: add debug
      atomic_inc_not_zero() in dst_hold()")
      
      In net-next we will convert dst atomic_t to refcount_t for peace of
      mind.
      
      Fixes: a4c2fd7f
      
       ("net: remove DST_NOCACHE flag")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Wei Wang <weiwan@google.com>
      Reported-by: default avatarPaweł Staszewski <pstaszewski@itcare.pl>
      Bisected-by: default avatarPaweł Staszewski <pstaszewski@itcare.pl>
      Acked-by: default avatarWei Wang <weiwan@google.com>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      222d7dbd
  16. 18 Aug, 2017 1 commit
  17. 18 Jun, 2017 8 commits
    • Wei Wang's avatar
      net: add debug atomic_inc_not_zero() in dst_hold() · 44ebe791
      Wei Wang authored
      
      This patch is meant to add a debug warning on the situation where dst is
      being held during its destroy phase. This could potentially cause double
      free issue on the dst.
      Signed-off-by: default avatarWei Wang <weiwan@google.com>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      44ebe791
    • Wei Wang's avatar
      net: reorder all the dst flags · 1eb04e7c
      Wei Wang authored
      
      As some dst flags are removed, reorder the dst flags to fill in the
      blanks.
      Note: these flags are not exposed into user space. So it is safe to
      reorder.
      Signed-off-by: default avatarWei Wang <weiwan@google.com>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1eb04e7c
    • Wei Wang's avatar
      net: remove DST_NOCACHE flag · a4c2fd7f
      Wei Wang authored
      
      DST_NOCACHE flag check has been removed from dst_release() and
      dst_hold_safe() in a previous patch because all the dst are now ref
      counted properly and can be released based on refcnt only.
      Looking at the rest of the DST_NOCACHE use, all of them can now be
      removed or replaced with other checks.
      So this patch gets rid of all the DST_NOCACHE usage and remove this flag
      completely.
      Signed-off-by: default avatarWei Wang <weiwan@google.com>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a4c2fd7f
    • Wei Wang's avatar
      net: remove DST_NOGC flag · b2a9c0ed
      Wei Wang authored
      
      Now that all the components have been changed to release dst based on
      refcnt only and not depend on dst gc anymore, we can remove the
      temporary flag DST_NOGC.
      
      Note that we also need to remove the DST_NOCACHE check in dst_release()
      and dst_hold_safe() because now all the dst are released based on refcnt
      and behaves as DST_NOCACHE.
      Signed-off-by: default avatarWei Wang <weiwan@google.com>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b2a9c0ed
    • Wei Wang's avatar
      net: remove dst gc related code · 5b7c9a8f
      Wei Wang authored
      
      This patch removes all dst gc related code and all the dst free
      functions
      Signed-off-by: default avatarWei Wang <weiwan@google.com>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5b7c9a8f
    • Wei Wang's avatar
      xfrm: take refcnt of dst when creating struct xfrm_dst bundle · 52df157f
      Wei Wang authored
      
      During the creation of xfrm_dst bundle, always take ref count when
      allocating the dst. This way, xfrm_bundle_create() will form a linked
      list of dst with dst->child pointing to a ref counted dst child. And
      the returned dst pointer is also ref counted. This makes the link from
      the flow cache to this dst now ref counted properly.
      As the dst is always ref counted properly, we can safely mark
      DST_NOGC flag so dst_release() will release dst based on refcnt only.
      And dst gc is no longer needed and all dst_free() and its related
      function calls should be replaced with dst_release() or
      dst_release_immediate().
      
      The special handling logic for dst->child in dst_destroy() can be
      replaced with a simple dst_release_immediate() call on the child to
      release the whole list linked by dst->child pointer.
      Previously used DST_NOHASH flag is not needed anymore as well. The
      reason that DST_NOHASH is used in the existing code is mainly to prevent
      the dst inserted in the fib tree to be wrongly destroyed during the
      deletion of the xfrm_dst bundle. So in the existing code, DST_NOHASH
      flag is marked in all the dst children except the one which is in the
      fib tree.
      However, with this patch series to remove dst gc logic and release dst
      only based on ref count, it is safe to release all the children from a
      xfrm_dst bundle as long as the dst children are all ref counted
      properly which is already the case in the existing code.
      So, this patch removes the use of DST_NOHASH flag.
      Signed-off-by: default avatarWei Wang <weiwan@google.com>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      52df157f
    • Wei Wang's avatar
      net: introduce a new function dst_dev_put() · 4a6ce2b6
      Wei Wang authored
      
      This function should be called when removing routes from fib tree after
      the dst gc is no longer in use.
      We first mark DST_OBSOLETE_DEAD on this dst to make sure next
      dst_ops->check() fails and returns NULL.
      Secondly, as we no longer keep the gc_list, we need to properly
      release dst->dev right at the moment when the dst is removed from
      the fib/fib6 tree.
      It does the following:
      1. change dst->input and output pointers to dst_discard/dst_dscard_out to
         discard all packets
      2. replace dst->dev with loopback interface
      Signed-off-by: default avatarWei Wang <weiwan@google.com>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4a6ce2b6
    • Wei Wang's avatar
      net: introduce DST_NOGC in dst_release() to destroy dst based on refcnt · 5f56f409
      Wei Wang authored
      
      The current mechanism of freeing dst is a bit complicated. dst has its
      ref count and when user grabs the reference to the dst, the ref count is
      properly taken in most cases except in IPv4/IPv6/decnet/xfrm routing
      code due to some historic reasons.
      
      If the reference to dst is always taken properly, we should be able to
      simplify the logic in dst_release() to destroy dst when dst->__refcnt
      drops from 1 to 0. And this should be the only condition to determine
      if we can call dst_destroy().
      And as dst is always ref counted, there is no need for a dst garbage
      list to hold the dst entries that already get removed by the routing
      code but are still held by other users. And the task to periodically
      check the list to free dst if ref count become 0 is also not needed
      anymore.
      
      This patch introduces a temporary flag DST_NOGC(no garbage collector).
      If it is set in the dst, dst_release() will call dst_destroy() when
      dst->__refcnt drops to 0. dst_hold_safe() will also check for this flag
      and do atomic_inc_not_zero() similar as DST_NOCACHE to avoid double free
      issue.
      This temporary flag is mainly used so that we can make the transition
      component by component without breaking other parts.
      This flag will be removed after all components are properly transitioned.
      
      This patch also introduces a new function dst_release_immediate() which
      destroys dst without waiting on the rcu when refcnt drops to 0. It will
      be used in later patches.
      
      Follow-up patches will correct all the places to properly take ref count
      on dst and mark DST_NOGC. dst_release() or dst_release_immediate() will
      be used to release the dst instead of dst_free() and its related
      functions.
      And final clean-up patch will remove the DST_NOGC flag.
      Signed-off-by: default avatarWei Wang <weiwan@google.com>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5f56f409
  18. 26 May, 2017 1 commit
    • Eric Dumazet's avatar
      ipv4: add reference counting to metrics · 3fb07daf
      Eric Dumazet authored
      Andrey Konovalov reported crashes in ipv4_mtu()
      
      I could reproduce the issue with KASAN kernels, between
      10.246.7.151 and 10.246.7.152 :
      
      1) 20 concurrent netperf -t TCP_RR -H 10.246.7.152 -l 1000 &
      
      2) At the same time run following loop :
      while :
      do
       ip ro add 10.246.7.152 dev eth0 src 10.246.7.151 mtu 1500
       ip ro del 10.246.7.152 dev eth0 src 10.246.7.151 mtu 1500
      done
      
      Cong Wang attempted to add back rt->fi in commit
      82486aa6 ("ipv4: restore rt->fi for reference counting")
      but this proved to add some issues that were complex to solve.
      
      Instead, I suggested to add a refcount to the metrics themselves,
      being a standalone object (in particular, no reference to other objects)
      
      I tried to make this patch as small as possible to ease its backport,
      instead of being super clean. Note that we believe that only ipv4 dst
      need to take care of the metric refcount. But if this is wrong,
      this patch adds the basic infrastructure to extend this to other
      families.
      
      Many thanks to Julian Anastasov for reviewing this patch, and Cong Wang
      for his efforts on this problem.
      
      Fixes: 2860583f
      
       ("ipv4: Kill rt->fi")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Reviewed-by: default avatarJulian Anastasov <ja@ssi.bg>
      Acked-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3fb07daf
  19. 18 May, 2017 1 commit
    • Alexey Dobriyan's avatar
      net: make struct dst_entry::dev first member · 66727145
      Alexey Dobriyan authored
      
      struct dst_entry::dev is used most often. Move it so it can be
      accessed without imm8 offset on x86_64.
      
      	add/remove: 0/0 grow/shrink: 9/239 up/down: 52/-413 (-361)
      	function                                     old     new   delta
      	dst_rcu_free                                 126     138     +12
      	fnhe_flush_routes                            211     219      +8
      	rt_set_nexthop                               747     754      +7
      	rt_cache_route                                85      91      +6
      	rt6_release                                  209     215      +6
      	dst_release                                  107     111      +4
      	dst_destroy_rcu                               29      33      +4
      	dn_dst_check_expire                          329     333      +4
      	dn_insert_route                              484     485      +1
      	xfrm_resolve_and_create_bundle              2991    2990      -1
      					...
      	ip_route_me_harder                          1163    1157      -6
      	__ip_append_data.isra                       2730    2724      -6
      	ip6_forward                                 3052    3045      -7
      	callforward_do_filter                        659     651      -8
      	dst_gc_task                                  571     549     -22
      Signed-off-by: default avatarAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      66727145
  20. 12 Feb, 2017 1 commit
  21. 07 Feb, 2017 2 commits
  22. 25 Apr, 2016 1 commit
    • Jiri Benc's avatar
      route: move lwtunnel state to a single place · 0868e253
      Jiri Benc authored
      Commit 751a587a
      
       ("route: fix breakage after moving lwtunnel state")
      moved lwtstate to the end of dst_entry for 32bit archs. This makes it share
      the cacheline with __refcnt which had an unkown effect on performance. For
      this reason, the pointer was kept in place for 64bit archs.
      
      However, later performance measurements showed this is of no concern. It
      turns out that every performance sensitive path that accesses lwtstate
      accesses also struct rtable or struct rt6_info which share the same cache
      line.
      
      Thus, to get rid of a few #ifdefs, move the field to the end of the struct
      also for 64bit.
      Signed-off-by: default avatarJiri Benc <jbenc@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0868e253
  23. 18 Mar, 2016 1 commit
  24. 15 Dec, 2015 1 commit
    • Eric Dumazet's avatar
      net: fix IP early demux races · 5037e9ef
      Eric Dumazet authored
      
      David Wilder reported crashes caused by dst reuse.
      
      <quote David>
        I am seeing a crash on a distro V4.2.3 kernel caused by a double
        release of a dst_entry.  In ipv4_dst_destroy() the call to
        list_empty() finds a poisoned next pointer, indicating the dst_entry
        has already been removed from the list and freed. The crash occurs
        18 to 24 hours into a run of a network stress exerciser.
      </quote>
      
      Thanks to his detailed report and analysis, we were able to understand
      the core issue.
      
      IP early demux can associate a dst to skb, after a lookup in TCP/UDP
      sockets.
      
      When socket cache is not properly set, we want to store into
      sk->sk_dst_cache the dst for future IP early demux lookups,
      by acquiring a stable refcount on the dst.
      
      Problem is this acquisition is simply using an atomic_inc(),
      which works well, unless the dst was queued for destruction from
      dst_release() noticing dst refcount went to zero, if DST_NOCACHE
      was set on dst.
      
      We need to make sure current refcount is not zero before incrementing
      it, or risk double free as David reported.
      
      This patch, being a stable candidate, adds two new helpers, and use
      them only from IP early demux problematic paths.
      
      It might be possible to merge in net-next skb_dst_force() and
      skb_dst_force_safe(), but I prefer having the smallest patch for stable
      kernels : Maybe some skb_dst_force() callers do not expect skb->dst
      can suddenly be cleared.
      
      Can probably be backported back to linux-3.6 kernels
      Reported-by: default avatarDavid J. Wilder <dwilder@us.ibm.com>
      Tested-by: default avatarDavid J. Wilder <dwilder@us.ibm.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5037e9ef
  25. 08 Oct, 2015 1 commit