- 06 Jun, 2016 1 commit
-
-
Michal Kubeček authored
Before commit 6d7b857d ("net: use lib/percpu_counter API for fragmentation mem accounting"), setting the reassembly high threshold to 0 prevented fragment reassembly as first fragment would be always evicted before second could be added to the queue. While inefficient, some users apparently relied on this method. Since the commit mentioned above, a percpu counter is used for reassembly memory accounting and high batch size avoids taking slow path in most common scenarios. As a result, a whole full sized packet can be reassembled without the percpu counter's main counter changing its value so that even with high_thresh set to 0, fragmented packets can be still reassembled and processed. Add explicit check preventing reassembly if high threshold is zero. Signed-off-by:
Michal Kubecek <mkubecek@suse.cz> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- 06 Jan, 2016 1 commit
-
-
Florian Westphal authored
The only user was removed in commit 029f7f3b ("netfilter: ipv6: nf_defrag: avoid/free clone operations"). Signed-off-by:
Florian Westphal <fw@strlen.de> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- 03 Nov, 2015 1 commit
-
-
Eric Dumazet authored
This patch fixes following problems : 1) percpu_counter_init() can return an error, therefore init_frag_mem_limit() must propagate this error so that inet_frags_init_net() can do the same up to its callers. 2) If ip[46]_frags_ns_ctl_register() fail, we must unwind properly and free the percpu_counter. Without this fix, we leave freed object in percpu_counters global list (if CONFIG_HOTPLUG_CPU) leading to crashes. This bug was detected by KASAN and syzkaller tool (http://github.com/google/syzkaller) Fixes: 6d7b857d ("net: use lib/percpu_counter API for fragmentation mem accounting") Signed-off-by:
Eric Dumazet <edumazet@google.com> Reported-by:
Dmitry Vyukov <dvyukov@google.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by:
Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- 27 Jul, 2015 4 commits
-
-
Nikolay Aleksandrov authored
We can simply remove the INET_FRAG_EVICTED flag to avoid all the flags race conditions with the evictor and use a participation test for the evictor list, when we're at that point (after inet_frag_kill) in the timer there're 2 possible cases: 1. The evictor added the entry to its evictor list while the timer was waiting for the chainlock or 2. The timer unchained the entry and the evictor won't see it In both cases we should be able to see list_evictor correctly due to the sync on the chainlock. Joint work with Florian Westphal. Tested-by:
Frank Schreuder <fschreuder@transip.nl> Signed-off-by:
Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by:
Florian Westphal <fw@strlen.de> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Florian Westphal authored
Frank reports 'NMI watchdog: BUG: soft lockup' errors when load is high. Instead of (potentially) unbounded restarts of the eviction process, just skip to the next entry. One caveat is that, when a netns is exiting, a timer may still be running by the time inet_evict_bucket returns. We use the frag memory accounting to wait for outstanding timers, so that when we free the percpu counter we can be sure no running timer will trip over it. Reported-and-tested-by:
Frank Schreuder <fschreuder@transip.nl> Signed-off-by:
Florian Westphal <fw@strlen.de> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Florian Westphal authored
Followup patch will call it after inet_frag_queue was freed, so q->net doesn't work anymore (but netf = q->net; free(q); mem_limit(netf) would). Tested-by:
Frank Schreuder <fschreuder@transip.nl> Signed-off-by:
Florian Westphal <fw@strlen.de> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Florian Westphal authored
commit 65ba1f1e ("inet: frags: fix a race between inet_evict_bucket and inet_frag_kill") describes the bug, but the fix doesn't work reliably. Problem is that ->flags member can be set on other cpu without chainlock being held by that task, i.e. the RMW-Cycle can clear INET_FRAG_EVICTED bit after we put the element on the evictor private list. We can crash when walking the 'private' evictor list since an element can be deleted from list underneath the evictor. Join work with Nikolay Alexandrov. Fixes: b13d3cbf ("inet: frag: move eviction of queues to work queue") Reported-by:
Johan Schuijt <johan@transip.nl> Tested-by:
Frank Schreuder <fschreuder@transip.nl> Signed-off-by:
Nikolay Alexandrov <nikolay@cumulusnetworks.com> Signed-off-by:
Florian Westphal <fw@strlen.de> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- 03 Apr, 2015 1 commit
-
-
Ian Morris authored
The ipv4 code uses a mixture of coding styles. In some instances check for NULL pointer is done as x == NULL and sometimes as !x. !x is preferred according to checkpatch and this patch makes the code consistent by adopting the latter form. No changes detected by objdiff. Signed-off-by:
Ian Morris <ipm@chirality.org.uk> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- 11 Nov, 2014 1 commit
-
-
Joe Perches authored
Use the more common dynamic_debug capable net_dbg_ratelimited and remove the LIMIT_NETDEBUG macro. All messages are still ratelimited. Some KERN_<LEVEL> uses are changed to KERN_DEBUG. This may have some negative impact on messages that were emitted at KERN_INFO that are not not enabled at all unless DEBUG is defined or dynamic_debug is enabled. Even so, these messages are now _not_ emitted by default. This also eliminates the use of the net_msg_warn sysctl "/proc/sys/net/core/warnings". For backward compatibility, the sysctl is not removed, but it has no function. The extern declaration of net_msg_warn is removed from sock.h and made static in net/core/sysctl_net_core.c Miscellanea: o Update the sysctl documentation o Remove the embedded uses of pr_fmt o Coalesce format fragments o Realign arguments Signed-off-by:
Joe Perches <joe@perches.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- 29 Oct, 2014 2 commits
-
-
Nikolay Aleksandrov authored
The WARN_ON in inet_evict_bucket can be triggered by a valid case: inet_frag_kill and inet_evict_bucket can be running in parallel on the same queue which means that there has been at least one more ref added by a previous inet_frag_find call, but inet_frag_kill can delete the timer before inet_evict_bucket which will cause the WARN_ON() there to trigger since we'll have refcnt!=1. Now, this case is valid because the queue is being "killed" for some reason (removed from the chain list and its timer deleted) so it will get destroyed in the end by one of the inet_frag_put() calls which reaches 0 i.e. refcnt is still valid. CC: Florian Westphal <fw@strlen.de> CC: Eric Dumazet <eric.dumazet@gmail.com> CC: Patrick McLean <chutzpah@gentoo.org> Fixes: b13d3cbf ("inet: frag: move eviction of queues to work queue") Reported-by:
Patrick McLean <chutzpah@gentoo.org> Signed-off-by:
Nikolay Aleksandrov <nikolay@redhat.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Nikolay Aleksandrov authored
When the evictor is running it adds some chosen frags to a local list to be evicted once the chain lock has been released but at the same time the *frag_queue can be running for some of the same queues and it may call inet_frag_kill which will wait on the chain lock and will then delete the queue from the wrong list since it was added in the eviction one. The fix is simple - check if the queue has the evict flag set under the chain lock before deleting it, this is safe because the evict flag is set only under that lock and having the flag set also means that the queue has been detached from the chain list, so no need to delete it again. An important note to make is that we're safe w.r.t refcnt because inet_frag_kill and inet_evict_bucket will sync on the del_timer operation where only one of the two can succeed (or if the timer is executing - none of them), the cases are: 1. inet_frag_kill succeeds in del_timer - then the timer ref is removed, but inet_evict_bucket will not add this queue to its expire list but will restart eviction in that chain 2. inet_evict_bucket succeeds in del_timer - then the timer ref is kept until the evictor "expires" the queue, but inet_frag_kill will remove the initial ref and will set INET_FRAG_COMPLETE which will make the frag_expire fn just to remove its ref. In the end all of the queue users will do an inet_frag_put and the one that reaches 0 will free it. The refcount balance should be okay. CC: Florian Westphal <fw@strlen.de> CC: Eric Dumazet <eric.dumazet@gmail.com> CC: Patrick McLean <chutzpah@gentoo.org> Fixes: b13d3cbf ("inet: frag: move eviction of queues to work queue") Suggested-by:
Eric Dumazet <eric.dumazet@gmail.com> Reported-by:
Patrick McLean <chutzpah@gentoo.org> Tested-by:
Patrick McLean <chutzpah@gentoo.org> Signed-off-by:
Nikolay Aleksandrov <nikolay@redhat.com> Reviewed-by:
Florian Westphal <fw@strlen.de> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- 02 Aug, 2014 4 commits
-
-
Nikolay Aleksandrov authored
Use kmem_cache to allocate/free inet_frag_queue objects since they're all the same size per inet_frags user and are alloced/freed in high volumes thus making it a perfect case for kmem_cache. Signed-off-by:
Nikolay Aleksandrov <nikolay@redhat.com> Acked-by:
Florian Westphal <fw@strlen.de> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Nikolay Aleksandrov authored
Now that we have INET_FRAG_EVICTED we might as well use it to stop sending icmp messages in the "frag_expire" functions instead of stripping INET_FRAG_FIRST_IN from their flags when evicting. Also fix the comment style in ip6_expire_frag_queue(). Signed-off-by:
Nikolay Aleksandrov <nikolay@redhat.com> Reviewed-by:
Florian Westphal <fw@strlen.de> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Nikolay Aleksandrov authored
Fix a couple of functions' declaration alignments. Signed-off-by:
Nikolay Aleksandrov <nikolay@redhat.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Nikolay Aleksandrov authored
The last_in field has been used to store various flags different from first/last frag in so give it a more descriptive name: flags. Signed-off-by:
Nikolay Aleksandrov <nikolay@redhat.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- 28 Jul, 2014 7 commits
-
-
Florian Westphal authored
rehash is rare operation, don't force readers to take the read-side rwlock. Instead, we only have to detect the (rare) case where the secret was altered while we are trying to insert a new inetfrag queue into the table. If it was changed, drop the bucket lock and recompute the hash to get the 'new' chain bucket that we have to insert into. Joint work with Nikolay Aleksandrov. Signed-off-by:
Florian Westphal <fw@strlen.de> Signed-off-by:
Nikolay Aleksandrov <nikolay@redhat.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Florian Westphal authored
merge functionality into the eviction workqueue. Instead of rebuilding every n seconds, take advantage of the upper hash chain length limit. If we hit it, mark table for rebuild and schedule workqueue. To prevent frequent rebuilds when we're completely overloaded, don't rebuild more than once every 5 seconds. ipfrag_secret_interval sysctl is now obsolete and has been marked as deprecated, it still can be changed so scripts won't be broken but it won't have any effect. A comment is left above each unused secret_timer variable to avoid confusion. Joint work with Nikolay Aleksandrov. Signed-off-by:
Florian Westphal <fw@strlen.de> Signed-off-by:
Nikolay Aleksandrov <nikolay@redhat.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Florian Westphal authored
no longer used. Signed-off-by:
Florian Westphal <fw@strlen.de> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Florian Westphal authored
The 'nqueues' counter is protected by the lru list lock, once thats removed this needs to be converted to atomic counter. Given this isn't used for anything except for reporting it to userspace via /proc, just remove it. We still report the memory currently used by fragment reassembly queues. Signed-off-by:
Florian Westphal <fw@strlen.de> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Florian Westphal authored
When the high_thresh limit is reached we try to toss the 'oldest' incomplete fragment queues until memory limits are below the low_thresh value. This happens in softirq/packet processing context. This has two drawbacks: 1) processors might evict a queue that was about to be completed by another cpu, because they will compete wrt. resource usage and resource reclaim. 2) LRU list maintenance is expensive. But when constantly overloaded, even the 'least recently used' element is recent, so removing 'lru' queue first is not 'fairer' than removing any other fragment queue. This moves eviction out of the fast path: When the low threshold is reached, a work queue is scheduled which then iterates over the table and removes the queues that exceed the memory limits of the namespace. It sets a new flag called INET_FRAG_EVICTED on the evicted queues so the proper counters will get incremented when the queue is forcefully expired. When the high threshold is reached, no more fragment queues are created until we're below the limit again. The LRU list is now unused and will be removed in a followup patch. Joint work with Nikolay Aleksandrov. Suggested-by:
Eric Dumazet <edumazet@google.com> Signed-off-by:
Florian Westphal <fw@strlen.de> Signed-off-by:
Nikolay Aleksandrov <nikolay@redhat.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Florian Westphal authored
First step to move eviction handling into a work queue. We lose two spots that accounted evicted fragments in MIB counters. Accounting will be restored since the upcoming work-queue evictor invokes the frag queue timer callbacks instead. Signed-off-by:
Florian Westphal <fw@strlen.de> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Florian Westphal authored
hide actual hash size from individual users: The _find function will now fold the given hash value into the required range. Signed-off-by:
Florian Westphal <fw@strlen.de> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- 06 Mar, 2014 2 commits
-
-
Florian Westphal authored
Quoting Alexander Aring: While fragmentation and unloading of 6lowpan module I got this kernel Oops after few seconds: BUG: unable to handle kernel paging request at f88bbc30 [..] Modules linked in: ipv6 [last unloaded: 6lowpan] Call Trace: [<c012af4c>] ? call_timer_fn+0x54/0xb3 [<c012aef8>] ? process_timeout+0xa/0xa [<c012b66b>] run_timer_softirq+0x140/0x15f Problem is that incomplete frags are still around after unload; when their frag expire timer fires, we get crash. When a netns is removed (also done when unloading module), inet_frag calls the evictor with 'force' argument to purge remaining frags. The evictor loop terminates when accounted memory ('work') drops to 0 or the lru-list becomes empty. However, the mem accounting is done via percpu counters and may not be accurate, i.e. loop may terminate prematurely. Alter evictor to only stop once the lru list is empty when force is requested. Reported-by:
Phoebe Buckheister <phoebe.buckheister@itwm.fraunhofer.de> Reported-by:
Alexander Aring <alex.aring@gmail.com> Tested-by:
Alexander Aring <alex.aring@gmail.com> Signed-off-by:
Florian Westphal <fw@strlen.de> Acked-by:
Eric Dumazet <edumazet@google.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Nikolay Aleksandrov authored
I stumbled upon this very serious bug while hunting for another one, it's a very subtle race condition between inet_frag_evictor, inet_frag_intern and the IPv4/6 frag_queue and expire functions (basically the users of inet_frag_kill/inet_frag_put). What happens is that after a fragment has been added to the hash chain but before it's been added to the lru_list (inet_frag_lru_add) in inet_frag_intern, it may get deleted (either by an expired timer if the system load is high or the timer sufficiently low, or by the fraq_queue function for different reasons) before it's added to the lru_list, then after it gets added it's a matter of time for the evictor to get to a piece of memory which has been freed leading to a number of different bugs depending on what's left there. I've been able to trigger this on both IPv4 and IPv6 (which is normal as the frag code is the same), but it's been much more difficult to trigger on IPv4 due to the protocol differences about how fragments are treated. The setup I used to reproduce this is: 2 machines with 4 x 10G bonded in a RR bond, so the same flow can be seen on multiple cards at the same time. Then I used multiple instances of ping/ping6 to generate fragmented packets and flood the machines with them while running other processes to load the attacked machine. *It is very important to have the _same flow_ coming in on multiple CPUs concurrently. Usually the attacked machine would die in less than 30 minutes, if configured properly to have many evictor calls and timeouts it could happen in 10 minutes or so. An important point to make is that any caller (frag_queue or timer) of inet_frag_kill will remove both the timer refcount and the original/guarding refcount thus removing everything that's keeping the frag from being freed at the next inet_frag_put. All of this could happen before the frag was ever added to the LRU list, then it gets added and the evictor uses a freed fragment. An example for IPv6 would be if a fragment is being added and is at the stage of being inserted in the hash after the hash lock is released, but before inet_frag_lru_add executes (or is able to obtain the lru lock) another overlapping fragment for the same flow arrives at a different CPU which finds it in the hash, but since it's overlapping it drops it invoking inet_frag_kill and thus removing all guarding refcounts, and afterwards freeing it by invoking inet_frag_put which removes the last refcount added previously by inet_frag_find, then inet_frag_lru_add gets executed by inet_frag_intern and we have a freed fragment in the lru_list. The fix is simple, just move the lru_add under the hash chain locked region so when a removing function is called it'll have to wait for the fragment to be added to the lru_list, and then it'll remove it (it works because the hash chain removal is done before the lru_list one and there's no window between the two list adds when the frag can get dropped). With this fix applied I couldn't kill the same machine in 24 hours with the same setup. Fixes: 3ef0eb0d ("net: frag, move LRU list maintenance outside of rwlock") CC: Florian Westphal <fw@strlen.de> CC: Jesper Dangaard Brouer <brouer@redhat.com> CC: David S. Miller <davem@davemloft.net> Signed-off-by:
Nikolay Aleksandrov <nikolay@redhat.com> Acked-by:
Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- 23 Oct, 2013 1 commit
-
-
Hannes Frederic Sowa authored
All fragmentation hash secrets now get initialized by their corresponding hash function with net_get_random_once. Thus we can eliminate the initial seeding. Also provide a comment that hash secret seeding happens at the first call to the corresponding hashing function. Cc: David S. Miller <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by:
Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- 03 Jul, 2013 1 commit
-
-
Jiang Liu authored
The global variable num_physpages is scheduled to be removed, so use totalram_pages instead of num_physpages at runtime. Signed-off-by:
Jiang Liu <jiang.liu@huawei.com> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: "David S. Miller" <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- 20 Jun, 2013 1 commit
-
-
Rami Rosen authored
This patch removes an empty ifdef from inet_frag_intern() in net/ipv4/inet_fragment.c. commit b67bfe0d (hlist: drop the node parameter from iterators) removed hlist from net/ipv4/inet_fragment.c, but did not remove the enclosing ifdef command, which is now empty. Signed-off-by:
Rami Rosen <ramirose@gmail.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- 06 May, 2013 1 commit
-
-
Konstantin Khlebnikov authored
This patch fixes race between inet_frag_lru_move() and inet_frag_lru_add() which was introduced in commit 3ef0eb0d ("net: frag, move LRU list maintenance outside of rwlock") One cpu already added new fragment queue into hash but not into LRU. Other cpu found it in hash and tries to move it to the end of LRU. This leads to NULL pointer dereference inside of list_move_tail(). Another possible race condition is between inet_frag_lru_move() and inet_frag_lru_del(): move can happens after deletion. This patch initializes LRU list head before adding fragment into hash and inet_frag_lru_move() doesn't touches it if it's empty. I saw this kernel oops two times in a couple of days. [119482.128853] BUG: unable to handle kernel NULL pointer dereference at (null) [119482.132693] IP: [<ffffffff812ede89>] __list_del_entry+0x29/0xd0 [119482.136456] PGD 2148f6067 PUD 215ab9067 PMD 0 [119482.140221] Oops: 0000 [#1] SMP [119482.144008] Modules linked in: vfat msdos fat 8021q fuse nfsd auth_rpcgss nfs_acl nfs lockd sunrpc ppp_async ppp_generic bridge slhc stp llc w83627ehf hwmon_vid snd_hda_codec_hdmi snd_hda_codec_realtek kvm_amd k10temp kvm snd_hda_intel snd_hda_codec edac_core radeon snd_hwdep ath9k snd_pcm ath9k_common snd_page_alloc ath9k_hw snd_timer snd soundcore drm_kms_helper ath ttm r8169 mii [119482.152692] CPU 3 [119482.152721] Pid: 20, comm: ksoftirqd/3 Not tainted 3.9.0-zurg-00001-g9f95269 #132 To Be Filled By O.E.M. To Be Filled By O.E.M./RS880D [119482.161478] RIP: 0010:[<ffffffff812ede89>] [<ffffffff812ede89>] __list_del_entry+0x29/0xd0 [119482.166004] RSP: 0018:ffff880216d5db58 EFLAGS: 00010207 [119482.170568] RAX: 0000000000000000 RBX: ffff88020882b9c0 RCX: dead000000200200 [119482.175189] RDX: 0000000000000000 RSI: 0000000000000880 RDI: ffff88020882ba00 [119482.179860] RBP: ffff880216d5db58 R08: ffffffff8155c7f0 R09: 0000000000000014 [119482.184570] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88020882ba00 [119482.189337] R13: ffffffff81c8d780 R14: ffff880204357f00 R15: 00000000000005a0 [119482.194140] FS: 00007f58124dc700(0000) GS:ffff88021fcc0000(0000) knlGS:0000000000000000 [119482.198928] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [119482.203711] CR2: 0000000000000000 CR3: 00000002155f0000 CR4: 00000000000007e0 [119482.208533] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [119482.213371] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [119482.218221] Process ksoftirqd/3 (pid: 20, threadinfo ffff880216d5c000, task ffff880216d3a9a0) [119482.223113] Stack: [119482.228004] ffff880216d5dbd8 ffffffff8155dcda 0000000000000000 ffff000200000001 [119482.233038] ffff8802153c1f00 ffff880000289440 ffff880200000014 ffff88007bc72000 [119482.238083] 00000000000079d5 ffff88007bc72f44 ffffffff00000002 ffff880204357f00 [119482.243090] Call Trace: [119482.248009] [<ffffffff8155dcda>] ip_defrag+0x8fa/0xd10 [119482.252921] [<ffffffff815a8013>] ipv4_conntrack_defrag+0x83/0xe0 [119482.257803] [<ffffffff8154485b>] nf_iterate+0x8b/0xa0 [119482.262658] [<ffffffff8155c7f0>] ? inet_del_offload+0x40/0x40 [119482.267527] [<ffffffff815448e4>] nf_hook_slow+0x74/0x130 [119482.272412] [<ffffffff8155c7f0>] ? inet_del_offload+0x40/0x40 [119482.277302] [<ffffffff8155d068>] ip_rcv+0x268/0x320 [119482.282147] [<ffffffff81519992>] __netif_receive_skb_core+0x612/0x7e0 [119482.286998] [<ffffffff81519b78>] __netif_receive_skb+0x18/0x60 [119482.291826] [<ffffffff8151a650>] process_backlog+0xa0/0x160 [119482.296648] [<ffffffff81519f29>] net_rx_action+0x139/0x220 [119482.301403] [<ffffffff81053707>] __do_softirq+0xe7/0x220 [119482.306103] [<ffffffff81053868>] run_ksoftirqd+0x28/0x40 [119482.310809] [<ffffffff81074f5f>] smpboot_thread_fn+0xff/0x1a0 [119482.315515] [<ffffffff81074e60>] ? lg_local_lock_cpu+0x40/0x40 [119482.320219] [<ffffffff8106d870>] kthread+0xc0/0xd0 [119482.324858] [<ffffffff8106d7b0>] ? insert_kthread_work+0x40/0x40 [119482.329460] [<ffffffff816c32dc>] ret_from_fork+0x7c/0xb0 [119482.334057] [<ffffffff8106d7b0>] ? insert_kthread_work+0x40/0x40 [119482.338661] Code: 00 00 55 48 8b 17 48 b9 00 01 10 00 00 00 ad de 48 8b 47 08 48 89 e5 48 39 ca 74 29 48 b9 00 02 20 00 00 00 ad de 48 39 c8 74 7a <4c> 8b 00 4c 39 c7 75 53 4c 8b 42 08 4c 39 c7 75 2b 48 89 42 08 [119482.343787] RIP [<ffffffff812ede89>] __list_del_entry+0x29/0xd0 [119482.348675] RSP <ffff880216d5db58> [119482.353493] CR2: 0000000000000000 Oops happened on this path: ip_defrag() -> ip_frag_queue() -> inet_frag_lru_move() -> list_move_tail() -> __list_del_entry() Signed-off-by:
Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Florian Westphal <fw@strlen.de> Cc: Eric Dumazet <edumazet@google.com> Cc: David S. Miller <davem@davemloft.net> Acked-by:
Florian Westphal <fw@strlen.de> Signed-off-by:
Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- 04 Apr, 2013 1 commit
-
-
Jesper Dangaard Brouer authored
This patch implements per hash bucket locking for the frag queue hash. This removes two write locks, and the only remaining write lock is for protecting hash rebuild. This essentially reduce the readers-writer lock to a rebuild lock. This patch is part of "net: frag performance followup" http://thread.gmane.org/gmane.linux.network/263644 of which two patches have already been accepted: Same test setup as previous: (http://thread.gmane.org/gmane.linux.network/257155) Two 10G interfaces, on seperate NUMA nodes, are under-test, and uses Ethernet flow-control. A third interface is used for generating the DoS attack (with trafgen). Notice, I have changed the frag DoS generator script to be more efficient/deadly. Before it would only hit one RX queue, now its sending packets causing multi-queue RX, due to "better" RX hashing. Test types summary (netperf UDP_STREAM): Test-20G64K == 2x10G with 65K fragments Test-20G3F == 2x10G with 3x fragments (3*1472 bytes) Test-20G64K+DoS == Same as 20G64K with frag DoS Test-20G3F+DoS == Same as 20G3F with frag DoS Test-20G64K+MQ == Same as 20G64K with Multi-Queue frag DoS Test-20G3F+MQ == Same as 20G3F with Multi-Queue frag DoS When I rebased this-patch(03) (on top of net-next commit a210576c) and removed the _bh spinlock, I saw a performance regression. BUT this was caused by some unrelated change in-between. See tests below. Test (A) is what I reported before for patch-02, accepted in commit 1b5ab0de. Test (B) verifying-retest of commit 1b5ab0de corrospond to patch-02. Test (C) is what I reported before for this-patch Test (D) is net-next master HEAD (commit a210576c), which reveals some (unknown) performance regression (compared against test (B)). Test (D) function as a new base-test. Performance table summary (in Mbit/s): (#) Test-type: 20G64K 20G3F 20G64K+DoS 20G3F+DoS 20G64K+MQ 20G3F+MQ ---------- ------- ------- ---------- --------- -------- ------- (A) Patch-02 : 18848.7 13230.1 4103.04 5310.36 130.0 440.2 (B) 1b5ab0de : 18841.5 13156.8 4101.08 5314.57 129.0 424.2 (C) Patch-03v1: 18838.0 13490.5 4405.11 6814.72 196.6 461.6 (D) a210576c : 18321.5 11250.4 3635.34 5160.13 119.1 405.2 (E) with _bh : 17247.3 11492.6 3994.74 6405.29 166.7 413.6 (F) without bh: 17471.3 11298.7 3818.05 6102.11 165.7 406.3 Test (E) and (F) is this-patch(03), with(V1) and without(V2) the _bh spinlocks. I cannot explain the slow down for 20G64K (but its an artificial "lab-test" so I'm not worried). But the other results does show improvements. And test (E) "with _bh" version is slightly better. Signed-off-by:
Jesper Dangaard Brouer <brouer@redhat.com> Acked-by:
Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by:
Eric Dumazet <edumazet@google.com> ---- V2: - By analysis from Hannes Frederic Sowa and Eric Dumazet, we don't need the spinlock _bh versions, as Netfilter currently does a local_bh_disable() before entering inet_fragment. - Fold-in desc from cover-mail V3: - Drop the chain_len counter per hash bucket. Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- 27 Mar, 2013 2 commits
-
-
Jesper Dangaard Brouer authored
Move the protection of netns_frags.nqueues updates under the LRU_lock, instead of the write lock. As they are located on the same cacheline, and this is also needed when transitioning to use per hash bucket locking. Signed-off-by:
Jesper Dangaard Brouer <brouer@redhat.com> Acked-by:
Eric Dumazet <edumazet@google.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Jesper Dangaard Brouer authored
The LRU list is protected by its own lock, since commit 3ef0eb0d (net: frag, move LRU list maintenance outside of rwlock), and no-longer by a read_lock. This makes it possible, to remove the inet_frag_queue, which is about to be "evicted", from the LRU list head. This avoids the problem, of several CPUs grabbing the same frag queue. Note, cannot remove the inet_frag_lru_del() call in fq_unlink() called by inet_frag_kill(), because inet_frag_kill() is also used in other situations. Thus, we use list_del_init() to allow this double list_del to work. Signed-off-by:
Jesper Dangaard Brouer <brouer@redhat.com> Acked-by:
Eric Dumazet <edumazet@google.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- 24 Mar, 2013 1 commit
-
-
Hannes Frederic Sowa authored
This patch just moves some code arround to make the ip4_frag_ecn_table and IPFRAG_ECN_* constants accessible from the other reassembly engines. I also renamed ip4_frag_ecn_table to ip_frag_ecn_table. Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Jesper Dangaard Brouer <jbrouer@redhat.com> Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by:
Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by:
YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- 19 Mar, 2013 1 commit
-
-
Hannes Frederic Sowa authored
This patch introduces a constant limit of the fragment queue hash table bucket list lengths. Currently the limit 128 is choosen somewhat arbitrary and just ensures that we can fill up the fragment cache with empty packets up to the default ip_frag_high_thresh limits. It should just protect from list iteration eating considerable amounts of cpu. If we reach the maximum length in one hash bucket a warning is printed. This is implemented on the caller side of inet_frag_find to distinguish between the different users of inet_fragment.c. I dropped the out of memory warning in the ipv4 fragment lookup path, because we already get a warning by the slab allocator. Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Jesper Dangaard Brouer <jbrouer@redhat.com> Signed-off-by:
Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by:
Eric Dumazet <edumazet@google.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- 28 Feb, 2013 1 commit
-
-
Sasha Levin authored
I'm not sure why, but the hlist for each entry iterators were conceived list_for_each_entry(pos, head, member) The hlist ones were greedy and wanted an extra parameter: hlist_for_each_entry(tpos, pos, head, member) Why did they need an extra pos parameter? I'm not quite sure. Not only they don't really need it, it also prevents the iterator from looking exactly like the list iterator, which is unfortunate. Besides the semantic patch, there was some manual work required: - Fix up the actual hlist iterators in linux/list.h - Fix up the declaration of other iterators based on the hlist ones. - A very small amount of places were using the 'node' parameter, this was modified to use 'obj->member' instead. - Coccinelle didn't handle the hlist_for_each_entry_safe iterator properly, so those had to be fixed up manually. The semantic patch which is mostly the work of Peter Senna Tschudin is here: @@ iterator name hlist_for_each_entry, hlist...
-
- 29 Jan, 2013 3 commits
-
-
Jesper Dangaard Brouer authored
Updating the fragmentation queues LRU (Least-Recently-Used) list, required taking the hash writer lock. However, the LRU list isn't tied to the hash at all, so we can use a separate lock for it. Original-idea-by:
Florian Westphal <fw@strlen.de> Signed-off-by:
Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Jesper Dangaard Brouer authored
Replace the per network namespace shared atomic "mem" accounting variable, in the fragmentation code, with a lib/percpu_counter. Getting percpu_counter to scale to the fragmentation code usage requires some tweaks. At first view, percpu_counter looks superfast, but it does not scale on multi-CPU/NUMA machines, because the default batch size is too small, for frag code usage. Thus, I have adjusted the batch size by using __percpu_counter_add() directly, instead of percpu_counter_sub() and percpu_counter_add(). The batch size is increased to 130.000, based on the largest 64K fragment memory usage. This does introduce some imprecise memory accounting, but its does not need to be strict for this use-case. It is also essential, that the percpu_counter, does not share cacheline with other writers, to make this scale. Signed-off-by:
Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Jesper Dangaard Brouer authored
This change is primarily a preparation to ease the extension of memory limit tracking. The change does reduce the number atomic operation, during freeing of a frag queue. This does introduce a some performance improvement, as these atomic operations are at the core of the performance problems seen on NUMA systems. Signed-off-by:
Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- 19 Sep, 2012 1 commit
-
-
Amerigo Wang authored
Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Michal Kubeček <mkubecek@suse.cz> Cc: David Miller <davem@davemloft.net> Signed-off-by:
Cong Wang <amwang@redhat.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- 08 Jun, 2012 1 commit
-
-
Gao feng authored
add struct net as a parameter of inet_getpeer_v[4,6], use net to replace &init_net. and modify some places to provide net for inet_getpeer_v[4,6] Signed-off-by:
Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- 12 Jul, 2010 1 commit
-
-
Eric Dumazet authored
CodingStyle cleanups EXPORT_SYMBOL should immediately follow the symbol declaration. Signed-off-by:
Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-