1. 21 Dec, 2019 18 commits
  2. 18 Dec, 2019 22 commits
    • Greg Kroah-Hartman's avatar
      Linux 5.4.5 · 9a088971
      Greg Kroah-Hartman authored
      9a088971
    • Heiner Kallweit's avatar
      r8169: add missing RX enabling for WoL on RTL8125 · 68159412
      Heiner Kallweit authored
      [ Upstream commit 00222d13 ]
      
      RTL8125 also requires to enable RX for WoL.
      
      v2: add missing Fixes tag
      
      Fixes: f1bce4ad
      
       ("r8169: add support for RTL8125")
      Signed-off-by: default avatarHeiner Kallweit <hkallweit1@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      68159412
    • Vladimir Oltean's avatar
      net: mscc: ocelot: unregister the PTP clock on deinit · 157560f9
      Vladimir Oltean authored
      [ Upstream commit 9385973f ]
      
      Currently a switch driver deinit frees the regmaps, but the PTP clock is
      still out there, available to user space via /dev/ptpN. Any PTP
      operation is a ticking time bomb, since it will attempt to use the freed
      regmaps and thus trigger kernel panics:
      
      [    4.291746] fsl_enetc 0000:00:00.2 eth1: error -22 setting up slave phy
      [    4.291871] mscc_felix 0000:00:00.5: Failed to register DSA switch: -22
      [    4.308666] mscc_felix: probe of 0000:00:00.5 failed with error -22
      [    6.358270] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000088
      [    6.367090] Mem abort info:
      [    6.369888]   ESR = 0x96000046
      [    6.369891]   EC = 0x25: DABT (current EL), IL = 32 bits
      [    6.369892]   SET = 0, FnV = 0
      [    6.369894]   EA = 0, S1PTW = 0
      [    6.369895] Data abort info:
      [    6.369897]   ISV = 0, ISS = 0x00000046
      [    6.369899]   CM = 0, WnR = 1
      [    6.369902] user pgtable: 4k pages, 48-bit VAs, pgdp=00000020d58c7000
      [    6.369904] [0000000000000088] pgd=00000020d5912003, pud=00000020d5915003, pmd=0000000000000000
      [    6.369914] Internal error: Oops: 96000046 [#1] PREEMPT SMP
      [    6.420443] Modules linked in:
      [    6.423506] CPU: 1 PID: 262 Comm: phc_ctl Not tainted 5.4.0-03625-gb7b2a5dadd7f #204
      [    6.431273] Hardware name: LS1028A RDB Board (DT)
      [    6.435989] pstate: 40000085 (nZcv daIf -PAN -UAO)
      [    6.440802] pc : css_release+0x24/0x58
      [    6.444561] lr : regmap_read+0x40/0x78
      [    6.448316] sp : ffff800010513cc0
      [    6.451636] x29: ffff800010513cc0 x28: ffff002055873040
      [    6.456963] x27: 0000000000000000 x26: 0000000000000000
      [    6.462289] x25: 0000000000000000 x24: 0000000000000000
      [    6.467617] x23: 0000000000000000 x22: 0000000000000080
      [    6.472944] x21: ffff800010513d44 x20: 0000000000000080
      [    6.478270] x19: 0000000000000000 x18: 0000000000000000
      [    6.483596] x17: 0000000000000000 x16: 0000000000000000
      [    6.488921] x15: 0000000000000000 x14: 0000000000000000
      [    6.494247] x13: 0000000000000000 x12: 0000000000000000
      [    6.499573] x11: 0000000000000000 x10: 0000000000000000
      [    6.504899] x9 : 0000000000000000 x8 : 0000000000000000
      [    6.510225] x7 : 0000000000000000 x6 : ffff800010513cf0
      [    6.515550] x5 : 0000000000000000 x4 : 0000000fffffffe0
      [    6.520876] x3 : 0000000000000088 x2 : ffff800010513d44
      [    6.526202] x1 : ffffcada668ea000 x0 : ffffcada64d8b0c0
      [    6.531528] Call trace:
      [    6.533977]  css_release+0x24/0x58
      [    6.537385]  regmap_read+0x40/0x78
      [    6.540795]  __ocelot_read_ix+0x6c/0xa0
      [    6.544641]  ocelot_ptp_gettime64+0x4c/0x110
      [    6.548921]  ptp_clock_gettime+0x4c/0x58
      [    6.552853]  pc_clock_gettime+0x5c/0xa8
      [    6.556699]  __arm64_sys_clock_gettime+0x68/0xc8
      [    6.561331]  el0_svc_common.constprop.2+0x7c/0x178
      [    6.566133]  el0_svc_handler+0x34/0xa0
      [    6.569891]  el0_sync_handler+0x114/0x1d0
      [    6.573908]  el0_sync+0x140/0x180
      [    6.577232] Code: d503201f b00119a1 91022263 b27b7be4 (f9004663)
      [    6.583349] ---[ end trace d196b9b14cdae2da ]---
      [    6.587977] Kernel panic - not syncing: Fatal exception
      [    6.593216] SMP: stopping secondary CPUs
      [    6.597151] Kernel Offset: 0x4ada54400000 from 0xffff800010000000
      [    6.603261] PHYS_OFFSET: 0xffffd0a7c0000000
      [    6.607454] CPU features: 0x10002,21806008
      [    6.611558] Memory Limit: none
      
      And now that ocelot->ptp_clock is checked at exit, prevent a potential
      error where ptp_clock_register returned a pointer-encoded error, which
      we are keeping in the ocelot private data structure. So now,
      ocelot->ptp_clock is now either NULL or a valid pointer.
      
      Fixes: 4e3b0468
      
       ("net: mscc: PTP Hardware Clock (PHC) support")
      Cc: Antoine Tenart <antoine.tenart@bootlin.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      157560f9
    • Shannon Nelson's avatar
      ionic: keep users rss hash across lif reset · dd561233
      Shannon Nelson authored
      [ Upstream commit ffac2027 ]
      
      If the user has specified their own RSS hash key, don't
      lose it across queue resets such as DOWN/UP, MTU change,
      and number of channels change.  This is fixed by moving
      the key initialization to a little earlier in the lif
      creation.
      
      Also, let's clean up the RSS config a little better on
      the way down by setting it all to 0.
      
      Fixes: aa319881
      
       ("ionic: Add RSS support")
      Signed-off-by: default avatarShannon Nelson <snelson@pensando.io>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      dd561233
    • Jonathan Lemon's avatar
      xdp: obtain the mem_id mutex before trying to remove an entry. · 9bd01a33
      Jonathan Lemon authored
      [ Upstream commit 86c76c09 ]
      
      A lockdep splat was observed when trying to remove an xdp memory
      model from the table since the mutex was obtained when trying to
      remove the entry, but not before the table walk started:
      
      Fix the splat by obtaining the lock before starting the table walk.
      
      Fixes: c3f812ce
      
       ("page_pool: do not release pool until inflight == 0.")
      Reported-by: default avatarGrygorii Strashko <grygorii.strashko@ti.com>
      Signed-off-by: default avatarJonathan Lemon <jonathan.lemon@gmail.com>
      Tested-by: default avatarGrygorii Strashko <grygorii.strashko@ti.com>
      Acked-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: default avatarIlias Apalodimas <ilias.apalodimas@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9bd01a33
    • Jonathan Lemon's avatar
      page_pool: do not release pool until inflight == 0. · 05f646cb
      Jonathan Lemon authored
      [ Upstream commit c3f812ce ]
      
      The page pool keeps track of the number of pages in flight, and
      it isn't safe to remove the pool until all pages are returned.
      
      Disallow removing the pool until all pages are back, so the pool
      is always available for page producers.
      
      Make the page pool responsible for its own delayed destruction
      instead of relying on XDP, so the page pool can be used without
      the xdp memory model.
      
      When all pages are returned, free the pool and notify xdp if the
      pool is registered with the xdp memory system.  Have the callback
      perform a table walk since some drivers (cpsw) may share the pool
      among multiple xdp_rxq_info.
      
      Note that the increment of pages_state_release_cnt may result in
      inflight == 0, resulting in the pool being released.
      
      Fixes: d956a048
      
       ("xdp: force mem allocator removal and periodic warning")
      Signed-off-by: default avatarJonathan Lemon <jonathan.lemon@gmail.com>
      Acked-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: default avatarIlias Apalodimas <ilias.apalodimas@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      05f646cb
    • Aya Levin's avatar
      net/mlx5e: ethtool, Fix analysis of speed setting · 6b2377de
      Aya Levin authored
      [ Upstream commit 3d7cadae ]
      
      When setting speed to 100G via ethtool (AN is set to off), only 25G*4 is
      configured while the user, who has an advanced HW which supports
      extended PTYS, expects also 50G*2 to be configured.
      With this patch, when extended PTYS mode is available, configure
      PTYS via extended fields.
      
      Fixes: 4b95840a
      
       ("net/mlx5e: Fix matching of speed to PRM link modes")
      Signed-off-by: default avatarAya Levin <ayal@mellanox.com>
      Reviewed-by: default avatarEran Ben Elisha <eranbe@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6b2377de
    • Aya Levin's avatar
      net/mlx5e: Fix translation of link mode into speed · dd544845
      Aya Levin authored
      [ Upstream commit 6d485e5e ]
      
      Add a missing value in translation of PTYS ext_eth_proto_oper to its
      corresponding speed. When ext_eth_proto_oper bit 10 is set, ethtool
      shows unknown speed. With this fix, ethtool shows speed is 100G as
      expected.
      
      Fixes: a08b4ed1
      
       ("net/mlx5: Add support to ext_* fields introduced in Port Type and Speed register")
      Signed-off-by: default avatarAya Levin <ayal@mellanox.com>
      Reviewed-by: default avatarEran Ben Elisha <eranbe@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      dd544845
    • Roi Dayan's avatar
      net/mlx5e: Fix freeing flow with kfree() and not kvfree() · 65523f0f
      Roi Dayan authored
      [ Upstream commit a23dae79 ]
      
      Flows are allocated with kzalloc() so free with kfree().
      
      Fixes: 04de7dda
      
       ("net/mlx5e: Infrastructure for duplicated offloading of TC flows")
      Signed-off-by: default avatarRoi Dayan <roid@mellanox.com>
      Reviewed-by: default avatarEli Britstein <elibr@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      65523f0f
    • Eran Ben Elisha's avatar
      net/mlx5e: Fix SFF 8472 eeprom length · 2e4e7670
      Eran Ben Elisha authored
      [ Upstream commit c431f859 ]
      
      SFF 8472 eeprom length is 512 bytes. Fix module info return value to
      support 512 bytes read.
      
      Fixes: ace329f4
      
       ("net/mlx5e: ethtool, Remove unsupported SFP EEPROM high pages query")
      Signed-off-by: default avatarEran Ben Elisha <eranbe@mellanox.com>
      Reviewed-by: default avatarAya Levin <ayal@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2e4e7670
    • Aaron Conole's avatar
      act_ct: support asymmetric conntrack · 4e57c233
      Aaron Conole authored
      [ Upstream commit 95219afb ]
      
      The act_ct TC module shares a common conntrack and NAT infrastructure
      exposed via netfilter.  It's possible that a packet needs both SNAT and
      DNAT manipulation, due to e.g. tuple collision.  Netfilter can support
      this because it runs through the NAT table twice - once on ingress and
      again after egress.  The act_ct action doesn't have such capability.
      
      Like netfilter hook infrastructure, we should run through NAT twice to
      keep the symmetry.
      
      Fixes: b57dc7c1
      
       ("net/sched: Introduce action ct")
      Signed-off-by: default avatarAaron Conole <aconole@redhat.com>
      Acked-by: default avatarMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4e57c233
    • Eran Ben Elisha's avatar
      net/mlx5e: Fix TXQ indices to be sequential · 411fdb97
      Eran Ben Elisha authored
      [ Upstream commit c55d8b10 ]
      
      Cited patch changed (channel index, tc) => (TXQ index) mapping to be a
      static one, in order to keep indices consistent when changing number of
      channels or TCs.
      
      For 32 channels (OOB) and 8 TCs, real num of TXQs is 256.
      When reducing the amount of channels to 8, the real num of TXQs will be
      changed to 64.
      This indices method is buggy:
      - Channel #0, TC 3, the TXQ index is 96.
      - Index 8 is not valid, as there is no such TXQ from driver perspective
        (As it represents channel #8, TC 0, which is not valid with the above
        configuration).
      
      As part of driver's select queue, it calls netdev_pick_tx which returns an
      index in the range of real number of TXQs. Depends on the return value,
      with the examples above, driver could have returned index larger than the
      real number of tx queues, or crash the kernel as it tries to read invalid
      address of SQ which was not allocated.
      
      Fix that by allocating sequential TXQ indices, and hold a new mapping
      between (channel index, tc) => (real TXQ index). This mapping will be
      updated as part of priv channels activation, and is used in
      mlx5e_select_queue to find the selected queue index.
      
      The existing indices mapping (channel_tc2txq) is no longer needed, as it
      is used only for statistics structures and can be calculated on run time.
      Delete its definintion and updates.
      
      Fixes: 8bfaf07f
      
       ("net/mlx5e: Present SW stats when state is not opened")
      Signed-off-by: default avatarEran Ben Elisha <eranbe@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      411fdb97
    • Martin Varghese's avatar
      net: Fixed updating of ethertype in skb_mpls_push() · cd477d06
      Martin Varghese authored
      [ Upstream commit d04ac224 ]
      
      The skb_mpls_push was not updating ethertype of an ethernet packet if
      the packet was originally received from a non ARPHRD_ETHER device.
      
      In the below OVS data path flow, since the device corresponding to
      port 7 is an l3 device (ARPHRD_NONE) the skb_mpls_push function does
      not update the ethertype of the packet even though the previous
      push_eth action had added an ethernet header to the packet.
      
      recirc_id(0),in_port(7),eth_type(0x0800),ipv4(tos=0/0xfc,ttl=64,frag=no),
      actions:push_eth(src=00:00:00:00:00:00,dst=00:00:00:00:00:00),
      push_mpls(label=13,tc=0,ttl=64,bos=1,eth_type=0x8847),4
      
      Fixes: 8822e270
      
       ("net: core: move push MPLS functionality from OvS to core helper")
      Signed-off-by: default avatarMartin Varghese <martin.varghese@nokia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cd477d06
    • Taehee Yoo's avatar
      hsr: fix a NULL pointer dereference in hsr_dev_xmit() · 10fec3e5
      Taehee Yoo authored
      [ Upstream commit df95467b ]
      
      hsr_dev_xmit() calls hsr_port_get_hsr() to find master node and that would
      return NULL if master node is not existing in the list.
      But hsr_dev_xmit() doesn't check return pointer so a NULL dereference
      could occur.
      
      Test commands:
          ip netns add nst
          ip link add veth0 type veth peer name veth1
          ip link add veth2 type veth peer name veth3
          ip link set veth1 netns nst
          ip link set veth3 netns nst
          ip link set veth0 up
          ip link set veth2 up
          ip link add hsr0 type hsr slave1 veth0 slave2 veth2
          ip a a 192.168.100.1/24 dev hsr0
          ip link set hsr0 up
          ip netns exec nst ip link set veth1 up
          ip netns exec nst ip link set veth3 up
          ip netns exec nst ip link add hsr1 type hsr slave1 veth1 slave2 veth3
          ip netns exec nst ip a a 192.168.100.2/24 dev hsr1
          ip netns exec nst ip link set hsr1 up
          hping3 192.168.100.2 -2 --flood &
          modprobe -rv hsr
      
      Splat looks like:
      [  217.351122][ T1635] kasan: CONFIG_KASAN_INLINE enabled
      [  217.352969][ T1635] kasan: GPF could be caused by NULL-ptr deref or user memory access
      [  217.354297][ T1635] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
      [  217.355507][ T1635] CPU: 1 PID: 1635 Comm: hping3 Not tainted 5.4.0+ #192
      [  217.356472][ T1635] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
      [  217.357804][ T1635] RIP: 0010:hsr_dev_xmit+0x34/0x90 [hsr]
      [  217.373010][ T1635] Code: 48 8d be 00 0c 00 00 be 04 00 00 00 48 83 ec 08 e8 21 be ff ff 48 8d 78 10 48 ba 00 b
      [  217.376919][ T1635] RSP: 0018:ffff8880cd8af058 EFLAGS: 00010202
      [  217.377571][ T1635] RAX: 0000000000000000 RBX: ffff8880acde6840 RCX: 0000000000000002
      [  217.379465][ T1635] RDX: dffffc0000000000 RSI: 0000000000000004 RDI: 0000000000000010
      [  217.380274][ T1635] RBP: ffff8880acde6840 R08: ffffed101b440d5d R09: 0000000000000001
      [  217.381078][ T1635] R10: 0000000000000001 R11: ffffed101b440d5c R12: ffff8880bffcc000
      [  217.382023][ T1635] R13: ffff8880bffcc088 R14: 0000000000000000 R15: ffff8880ca675c00
      [  217.383094][ T1635] FS:  00007f060d9d1740(0000) GS:ffff8880da000000(0000) knlGS:0000000000000000
      [  217.384289][ T1635] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  217.385009][ T1635] CR2: 00007faf15381dd0 CR3: 00000000d523c001 CR4: 00000000000606e0
      [  217.385940][ T1635] Call Trace:
      [  217.386544][ T1635]  dev_hard_start_xmit+0x160/0x740
      [  217.387114][ T1635]  __dev_queue_xmit+0x1961/0x2e10
      [  217.388118][ T1635]  ? check_object+0xaf/0x260
      [  217.391466][ T1635]  ? __alloc_skb+0xb9/0x500
      [  217.392017][ T1635]  ? init_object+0x6b/0x80
      [  217.392629][ T1635]  ? netdev_core_pick_tx+0x2e0/0x2e0
      [  217.393175][ T1635]  ? __alloc_skb+0xb9/0x500
      [  217.393727][ T1635]  ? rcu_read_lock_sched_held+0x90/0xc0
      [  217.394331][ T1635]  ? rcu_read_lock_bh_held+0xa0/0xa0
      [  217.395013][ T1635]  ? kasan_unpoison_shadow+0x30/0x40
      [  217.395668][ T1635]  ? __kasan_kmalloc.constprop.4+0xa0/0xd0
      [  217.396280][ T1635]  ? __kmalloc_node_track_caller+0x3a8/0x3f0
      [  217.399007][ T1635]  ? __kasan_kmalloc.constprop.4+0xa0/0xd0
      [  217.400093][ T1635]  ? __kmalloc_reserve.isra.46+0x2e/0xb0
      [  217.401118][ T1635]  ? memset+0x1f/0x40
      [  217.402529][ T1635]  ? __alloc_skb+0x317/0x500
      [  217.404915][ T1635]  ? arp_xmit+0xca/0x2c0
      [ ... ]
      
      Fixes: 311633b6
      
       ("hsr: switch ->dellink() to ->ndo_uninit()")
      Acked-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      10fec3e5
    • Martin Varghese's avatar
      Fixed updating of ethertype in function skb_mpls_pop · 2cbaf5fb
      Martin Varghese authored
      [ Upstream commit 040b5cfb ]
      
      The skb_mpls_pop was not updating ethertype of an ethernet packet if the
      packet was originally received from a non ARPHRD_ETHER device.
      
      In the below OVS data path flow, since the device corresponding to port 7
      is an l3 device (ARPHRD_NONE) the skb_mpls_pop function does not update
      the ethertype of the packet even though the previous push_eth action had
      added an ethernet header to the packet.
      
      recirc_id(0),in_port(7),eth_type(0x8847),
      mpls(label=12/0xfffff,tc=0/0,ttl=0/0x0,bos=1/1),
      actions:push_eth(src=00:00:00:00:00:00,dst=00:00:00:00:00:00),
      pop_mpls(eth_type=0x800),4
      
      Fixes: ed246cee
      
       ("net: core: move pop MPLS functionality from OvS to core helper")
      Signed-off-by: default avatarMartin Varghese <martin.varghese@nokia.com>
      Acked-by: default avatarPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2cbaf5fb
    • Cong Wang's avatar
      gre: refetch erspan header from skb->data after pskb_may_pull() · 23fbdd5d
      Cong Wang authored
      [ Upstream commit 0e494092 ]
      
      After pskb_may_pull() we should always refetch the header
      pointers from the skb->data in case it got reallocated.
      
      In gre_parse_header(), the erspan header is still fetched
      from the 'options' pointer which is fetched before
      pskb_may_pull().
      
      Found this during code review of a KMSAN bug report.
      
      Fixes: cb73ee40
      
       ("net: ip_gre: use erspan key field for tunnel lookup")
      Cc: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: default avatarLorenzo Bianconi <lorenzo.bianconi@redhat.com>
      Acked-by: default avatarWilliam Tu <u9012063@gmail.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@netronome.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      23fbdd5d
    • Yoshiki Komachi's avatar
      cls_flower: Fix the behavior using port ranges with hw-offload · 71bc12b1
      Yoshiki Komachi authored
      [ Upstream commit 8ffb055b ]
      
      The recent commit 5c72299f ("net: sched: cls_flower: Classify
      packets using port ranges") had added filtering based on port ranges
      to tc flower. However the commit missed necessary changes in hw-offload
      code, so the feature gave rise to generating incorrect offloaded flow
      keys in NIC.
      
      One more detailed example is below:
      
      $ tc qdisc add dev eth0 ingress
      $ tc filter add dev eth0 ingress protocol ip flower ip_proto tcp \
        dst_port 100-200 action drop
      
      With the setup above, an exact match filter with dst_port == 0 will be
      installed in NIC by hw-offload. IOW, the NIC will have a rule which is
      equivalent to the following one.
      
      $ tc qdisc add dev eth0 ingress
      $ tc filter add dev eth0 ingress protocol ip flower ip_proto tcp \
        dst_port 0 action drop
      
      The behavior was caused by the flow dissector which extracts packet
      data into the flow key in the tc flower. More specifically, regardless
      of exact match or specified port ranges, fl_init_dissector() set the
      FLOW_DISSECTOR_KEY_PORTS flag in struct flow_dissector to extract port
      numbers from skb in skb_flow_dissect() called by fl_classify(). Note
      that device drivers received the same struct flow_dissector object as
      used in skb_flow_dissect(). Thus, offloaded drivers could not identify
      which of these is used because the FLOW_DISSECTOR_KEY_PORTS flag was
      set to struct flow_dissector in either case.
      
      This patch adds the new FLOW_DISSECTOR_KEY_PORTS_RANGE flag and the new
      tp_range field in struct fl_flow_key to recognize which filters are applied
      to offloaded drivers. At this point, when filters based on port ranges
      passed to drivers, drivers return the EOPNOTSUPP error because they do
      not support the feature (the newly created FLOW_DISSECTOR_KEY_PORTS_RANGE
      flag).
      
      Fixes: 5c72299f
      
       ("net: sched: cls_flower: Classify packets using port ranges")
      Signed-off-by: default avatarYoshiki Komachi <komachi.yoshiki@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      71bc12b1
    • John Hurley's avatar
      net: sched: allow indirect blocks to bind to clsact in TC · 554d2e14
      John Hurley authored
      [ Upstream commit 25a443f7 ]
      
      When a device is bound to a clsact qdisc, bind events are triggered to
      registered drivers for both ingress and egress. However, if a driver
      registers to such a device using the indirect block routines then it is
      assumed that it is only interested in ingress offload and so only replays
      ingress bind/unbind messages.
      
      The NFP driver supports the offload of some egress filters when
      registering to a block with qdisc of type clsact. However, on unregister,
      if the block is still active, it will not receive an unbind egress
      notification which can prevent proper cleanup of other registered
      callbacks.
      
      Modify the indirect block callback command in TC to send messages of
      ingress and/or egress bind depending on the qdisc in use. NFP currently
      supports egress offload for TC flower offload so the changes are only
      added to TC.
      
      Fixes: 4d12ba42
      
       ("nfp: flower: allow offloading of matches on 'internal' ports")
      Signed-off-by: default avatarJohn Hurley <john.hurley@netronome.com>
      Acked-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      554d2e14
    • John Hurley's avatar
      net: core: rename indirect block ingress cb function · 1b511a9d
      John Hurley authored
      [ Upstream commit dbad3408 ]
      
      With indirect blocks, a driver can register for callbacks from a device
      that is does not 'own', for example, a tunnel device. When registering to
      or unregistering from a new device, a callback is triggered to generate
      a bind/unbind event. This, in turn, allows the driver to receive any
      existing rules or to properly clean up installed rules.
      
      When first added, it was assumed that all indirect block registrations
      would be for ingress offloads. However, the NFP driver can, in some
      instances, support clsact qdisc binds for egress offload.
      
      Change the name of the indirect block callback command in flow_offload to
      remove the 'ingress' identifier from it. While this does not change
      functionality, a follow up patch will implement a more more generic
      callback than just those currently just supporting ingress offload.
      
      Fixes: 4d12ba42
      
       ("nfp: flower: allow offloading of matches on 'internal' ports")
      Signed-off-by: default avatarJohn Hurley <john.hurley@netronome.com>
      Acked-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1b511a9d
    • Guillaume Nault's avatar
      tcp: Protect accesses to .ts_recent_stamp with {READ,WRITE}_ONCE() · ee0dc0c3
      Guillaume Nault authored
      [ Upstream commit 721c8daf ]
      
      Syncookies borrow the ->rx_opt.ts_recent_stamp field to store the
      timestamp of the last synflood. Protect them with READ_ONCE() and
      WRITE_ONCE() since reads and writes aren't serialised.
      
      Use of .rx_opt.ts_recent_stamp for storing the synflood timestamp was
      introduced by a0f82f64 ("syncookies: remove last_synq_overflow from
      struct tcp_sock"). But unprotected accesses were already there when
      timestamp was stored in .last_synq_overflow.
      
      Fixes: 1da177e4
      
       ("Linux-2.6.12-rc2")
      Signed-off-by: default avatarGuillaume Nault <gnault@redhat.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ee0dc0c3
    • Guillaume Nault's avatar
      tcp: tighten acceptance of ACKs not matching a child socket · e70ee164
      Guillaume Nault authored
      [ Upstream commit cb44a08f
      
       ]
      
      When no synflood occurs, the synflood timestamp isn't updated.
      Therefore it can be so old that time_after32() can consider it to be
      in the future.
      
      That's a problem for tcp_synq_no_recent_overflow() as it may report
      that a recent overflow occurred while, in fact, it's just that jiffies
      has grown past 'last_overflow' + TCP_SYNCOOKIE_VALID + 2^31.
      
      Spurious detection of recent overflows lead to extra syncookie
      verification in cookie_v[46]_check(). At that point, the verification
      should fail and the packet dropped. But we should have dropped the
      packet earlier as we didn't even send a syncookie.
      
      Let's refine tcp_synq_no_recent_overflow() to report a recent overflow
      only if jiffies is within the
      [last_overflow, last_overflow + TCP_SYNCOOKIE_VALID] interval. This
      way, no spurious recent overflow is reported when jiffies wraps and
      'last_overflow' becomes in the future from the point of view of
      time_after32().
      
      However, if jiffies wraps and enters the
      [last_overflow, last_overflow + TCP_SYNCOOKIE_VALID] interval (with
      'last_overflow' being a stale synflood timestamp), then
      tcp_synq_no_recent_overflow() still erroneously reports an
      overflow. In such cases, we have to rely on syncookie verification
      to drop the packet. We unfortunately have no way to differentiate
      between a fresh and a stale syncookie timestamp.
      
      In practice, using last_overflow as lower bound is problematic.
      If the synflood timestamp is concurrently updated between the time
      we read jiffies and the moment we store the timestamp in
      'last_overflow', then 'now' becomes smaller than 'last_overflow' and
      tcp_synq_no_recent_overflow() returns true, potentially dropping a
      valid syncookie.
      
      Reading jiffies after loading the timestamp could fix the problem,
      but that'd require a memory barrier. Let's just accommodate for
      potential timestamp growth instead and extend the interval using
      'last_overflow - HZ' as lower bound.
      Signed-off-by: default avatarGuillaume Nault <gnault@redhat.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e70ee164
    • Guillaume Nault's avatar
      tcp: fix rejected syncookies due to stale timestamps · 9afe6901
      Guillaume Nault authored
      [ Upstream commit 04d26e7b ]
      
      If no synflood happens for a long enough period of time, then the
      synflood timestamp isn't refreshed and jiffies can advance so much
      that time_after32() can't accurately compare them any more.
      
      Therefore, we can end up in a situation where time_after32(now,
      last_overflow + HZ) returns false, just because these two values are
      too far apart. In that case, the synflood timestamp isn't updated as
      it should be, which can trick tcp_synq_no_recent_overflow() into
      rejecting valid syncookies.
      
      For example, let's consider the following scenario on a system
      with HZ=1000:
      
        * The synflood timestamp is 0, either because that's the timestamp
          of the last synflood or, more commonly, because we're working with
          a freshly created socket.
      
        * We receive a new SYN, which triggers synflood protection. Let's say
          that this happens when jiffies == 2147484649 (that is,
          'synflood timestamp' + HZ + 2^31 + 1).
      
        * Then tcp_synq_overflow() doesn't update the synflood timestamp,
          because time_after32(2147484649, 1000) returns false.
          With:
            - 2147484649: the value of jiffies, aka. 'now'.
            - 1000: the value of 'last_overflow' + HZ.
      
        * A bit later, we receive the ACK completing the 3WHS. But
          cookie_v[46]_check() rejects it because tcp_synq_no_recent_overflow()
          says that we're not under synflood. That's because
          time_after32(2147484649, 120000) returns false.
          With:
            - 2147484649: the value of jiffies, aka. 'now'.
            - 120000: the value of 'last_overflow' + TCP_SYNCOOKIE_VALID.
      
          Of course, in reality jiffies would have increased a bit, but this
          condition will last for the next 119 seconds, which is far enough
          to accommodate for jiffie's growth.
      
      Fix this by updating the overflow timestamp whenever jiffies isn't
      within the [last_overflow, last_overflow + HZ] range. That shouldn't
      have any performance impact since the update still happens at most once
      per second.
      
      Now we're guaranteed to have fresh timestamps while under synflood, so
      tcp_synq_no_recent_overflow() can safely use it with time_after32() in
      such situations.
      
      Stale timestamps can still make tcp_synq_no_recent_overflow() return
      the wrong verdict when not under synflood. This will be handled in the
      next patch.
      
      For 64 bits architectures, the problem was introduced with the
      conversion of ->tw_ts_recent_stamp to 32 bits integer by commit
      cca9bab1 ("tcp: use monotonic timestamps for PAWS").
      The problem has always been there on 32 bits architectures.
      
      Fixes: cca9bab1 ("tcp: use monotonic timestamps for PAWS")
      Fixes: 1da177e4
      
       ("Linux-2.6.12-rc2")
      Signed-off-by: default avatarGuillaume Nault <gnault@redhat.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9afe6901