- 14 Feb, 2020 1 commit
-
-
Jens Axboe authored
Carter reported an issue where he could produce a stall on ring exit, when we're cleaning up requests that match the given file table. For this particular test case, a combination of a few things caused the issue: - The cq ring was overflown - The request being canceled was in the overflow list The combination of the above means that the cq overflow list holds a reference to the request. The request is canceled correctly, but since the overflow list holds a reference to it, the final put won't happen. Since the final put doesn't happen, the request remains in the inflight. Hence we never finish the cancelation flush. Fix this by removing requests from the overflow list if we're canceling them. Cc: stable@vger.kernel.org # 5.5 Reported-by:
Carter Li 李通洲 <carter.li@eoitek.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- 13 Feb, 2020 1 commit
-
-
Jens Axboe authored
Glauber reports a crash on init on a box he has: RIP: 0010:__alloc_pages_nodemask+0x132/0x340 Code: 18 01 75 04 41 80 ce 80 89 e8 48 8b 54 24 08 8b 74 24 1c c1 e8 0c 48 8b 3c 24 83 e0 01 88 44 24 20 48 85 d2 0f 85 74 01 00 00 <3b> 77 08 0f 82 6b 01 00 00 48 89 7c 24 10 89 ea 48 8b 07 b9 00 02 RSP: 0018:ffffb8be4d0b7c28 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000000000e8e8 RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000002080 RBP: 0000000000012cc0 R08: 0000000000000000 R09: 0000000000000002 R10: 0000000000000dc0 R11: ffff995c60400100 R12: 0000000000000000 R13: 0000000000012cc0 R14: 0000000000000001 R15: ffff995c60db00f0 FS: 00007f4d115ca900(0000) GS:ffff995c60d80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000002088 CR3: 00000017cca66002 CR4: 00000000007606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: alloc_slab_page+0x46/0x320 new_slab+0x9d/0x4e0 ___slab_alloc+0x507/0x6a0 ? io_wq_create+0xb4/0x2a0 __slab_alloc+0x1c/0x30 kmem_cache_alloc_node_trace+0xa6/0x260 io_wq_create+0xb4/0x2a0 io_uring_setup+0x97f/0xaa0 ? io_remove_personalities+0x30/0x30 ? io_poll_trigger_evfd+0x30/0x30 do_syscall_64+0x5b/0x1c0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f4d116cb1ed which is due to the 'wqe' and 'worker' allocation being node affine. But it isn't valid to call the node affine allocation if the node isn't online. Setup structures for even offline nodes, as usual, but skip them in terms of thread setup to not waste resources. If the node isn't online, just alloc memory with NUMA_NO_NODE. Reported-by:
Glauber Costa <glauber@scylladb.com> Tested-by:
Glauber Costa <glauber@scylladb.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- 09 Feb, 2020 4 commits
-
-
Jens Axboe authored
Jonas reports that he sometimes sees -97/-22 error returns from sendmsg, if it gets punted async. This is due to not retaining the sockaddr_storage between calls. Include that in the state we copy when going async. Cc: stable@vger.kernel.org # 5.3+ Reported-by:
Jonas Bonn <jonas@norrbonn.se> Tested-by:
Jonas Bonn <jonas@norrbonn.se> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
Normally we cancel all work we track, but for untracked work we could leave the async worker behind until that work completes. This is totally fine, but does leave resources pending after the task is gone until that work completes. Cancel work that this task queued up when it goes away. Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
Add a helper that allows the caller to cancel work based on what mm it belongs to. This allows io_uring to cancel work from a given task or thread when it exits. Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
We want to use the cancel functionality for canceling based on not just the work itself. Instead of matching on the work address manually, allow a match handler to tell us if we found the right work item or not. No functional changes in this patch. Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- 08 Feb, 2020 15 commits
-
-
Hans de Goede authored
VirtualBox hosts can share folders with guests, this commit adds a VFS driver implementing the Linux-guest side of this, allowing folders exported by the host to be mounted under Linux. This driver depends on the guest <-> host IPC functions exported by the vboxguest driver. Acked-by:
Christoph Hellwig <hch@infradead.org> Signed-off-by:
Hans de Goede <hdegoede@redhat.com> Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk>
-
Pavel Begunkov authored
As in the previous patch, make openat*_prep() and statx_prep() handle double preparation to avoid resource leakage. Signed-off-by:
Pavel Begunkov <asml.silence@gmail.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
Requests may be prepared multiple times with ->io allocated (i.e. async prepared). Preparation functions don't handle it and forget about previously allocated resources. This may happen in case of: - spurious defer_check - non-head (i.e. async prepared) request executed in sync (via nxt). Make the handlers check, whether they already allocated resources, which is true IFF REQ_F_NEED_CLEANUP is set. Cc: stable@vger.kernel.org # 5.5 Signed-off-by:
Pavel Begunkov <asml.silence@gmail.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
First, io_close() misses filp_close() and io_cqring_add_event(), when f_op->flush is defined. That's because in this case it will io_queue_async_work() itself not grabbing files, so the corresponding chunk in io_close_finish() won't be executed. Second, when submitted through io_wq_submit_work(), it will do filp_close() and *_add_event() twice: first inline in io_close(), and the second one in call to io_close_finish() from io_close(). The second one will also fire, because it was submitted async through generic path, and so have grabbed files. And the last nice thing is to remove this weird pilgrimage with checking work/old_work and casting it to nxt. Just use a helper instead. Signed-off-by:
Pavel Begunkov <asml.silence@gmail.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
Don't just check for dirfd == -1, we should allow AT_FDCWD as well for relative lookups. Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
This passes it in to io-wq, so it assumes the right fs_struct when executing async work that may need to do lookups. Cc: stable@vger.kernel.org # 5.3+ Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
Some work items need this for relative path lookup, make it available like the other inherited credentials/mm/etc. Cc: stable@vger.kernel.org # 5.3+ Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
For non-blocking issue, we set IOCB_NOWAIT in the kiocb. However, on a raw block device, this yields an -EOPNOTSUPP return, as non-blocking writes aren't supported. Turn this -EOPNOTSUPP into -EAGAIN, so we retry from blocking context with IOCB_NOWAIT cleared. Cc: stable@vger.kernel.org # 5.5 Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
openat() and statx() may have allocated ->open.filename, which should be be put. Add cleanup handlers for them. Signed-off-by:
Pavel Begunkov <asml.silence@gmail.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
Allocated iovec is freed only in io_{read,write,send,recv)(), and just leaves it if an error occured. There are plenty of such cases: - cancellation of non-head requests - fail grabbing files in __io_queue_sqe() - set REQ_F_NOWAIT and returning in __io_queue_sqe() Add REQ_F_NEED_CLEANUP, which will force such requests with custom allocated resourses go through cleanup handlers on put. Cc: stable@vger.kernel.org # 5.5 Signed-off-by:
Pavel Begunkov <asml.silence@gmail.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
struct io_async_open is unused, remove it. Signed-off-by:
Pavel Begunkov <asml.silence@gmail.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Stefano Garzarella authored
In io_uring_poll() we must flush overflowed CQ events before to check if there are CQ events available, to avoid missing events. We call the io_cqring_events() that checks and flushes any overflow and returns the number of CQ events available. Signed-off-by:
Stefano Garzarella <sgarzare@redhat.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
All of these opcodes take a directory file descriptor. We can't easily support fixed files for these operations, and the use case for that probably isn't all that clear (or sensible) anyway. Disable IOSQE_FIXED_FILE for these operations. Reported-by:
Stefan Metzmacher <metze@samba.org> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Linus Torvalds authored
This makes the pipe code use separate wait-queues and exclusive waiting for readers and writers, avoiding a nasty thundering herd problem when there are lots of readers waiting for data on a pipe (or, less commonly, lots of writers waiting for a pipe to have space). While this isn't a common occurrence in the traditional "use a pipe as a data transport" case, where you typically only have a single reader and a single writer process, there is one common special case: using a pipe as a source of "locking tokens" rather than for data communication. In particular, the GNU make jobserver code ends up using a pipe as a way to limit parallelism, where each job consumes a token by reading a byte from the jobserver pipe, and releases the token by writing a byte back to the pipe. This pattern is fairly traditional on Unix, and works very well, but will waste a lot of time waking up a lot of processes when only a single reader needs to be woken up when a writer releases a new token. A simplified test-case of just this pipe interaction is to create 64 processes, and then pass a single token around between them (this test-case also intentionally passes another token that gets ignored to test the "wake up next" logic too, in case anybody wonders about it): #include <unistd.h> int main(int argc, char **argv) { int fd[2], counters[2]; pipe(fd); counters[0] = 0; counters[1] = -1; write(fd[1], counters, sizeof(counters)); /* 64 processes */ fork(); fork(); fork(); fork(); fork(); fork(); do { int i; read(fd[0], &i, sizeof(i)); if (i < 0) continue; counters[0] = i+1; write(fd[1], counters, (1+(i & 1)) *sizeof(int)); } while (counters[0] < 1000000); return 0; } and in a perfect world, passing that token around should only cause one context switch per transfer, when the writer of a token causes a directed wakeup of just a single reader. But with the "writer wakes all readers" model we traditionally had, on my test box the above case causes more than an order of magnitude more scheduling: instead of the expected ~1M context switches, "perf stat" shows 231,852.37 msec task-clock # 15.857 CPUs utilized 11,250,961 context-switches # 0.049 M/sec 616,304 cpu-migrations # 0.003 M/sec 1,648 page-faults # 0.007 K/sec 1,097,903,998,514 cycles # 4.735 GHz 120,781,778,352 instructions # 0.11 insn per cycle 27,997,056,043 branches # 120.754 M/sec 283,581,233 branch-misses # 1.01% of all branches 14.621273891 seconds time elapsed 0.018243000 seconds user 3.611468000 seconds sys before this commit. After this commit, I get 5,229.55 msec task-clock # 3.072 CPUs utilized 1,212,233 context-switches # 0.232 M/sec 103,951 cpu-migrations # 0.020 M/sec 1,328 page-faults # 0.254 K/sec 21,307,456,166 cycles # 4.074 GHz 12,947,819,999 instructions # 0.61 insn per cycle 2,881,985,678 branches # 551.096 M/sec 64,267,015 branch-misses # 2.23% of all branches 1.702148350 seconds time elapsed 0.004868000 seconds user 0.110786000 seconds sys instead. Much better. [ Note! This kernel improvement seems to be very good at triggering a race condition in the make jobserver (in GNU make 4.2.1) for me. It's a long known bug that was fixed back in June 2017 by GNU make commit b552b0525198 ("[SV 51159] Use a non-blocking read with pselect to avoid hangs."). But there wasn't a new release of GNU make until 4.3 on Jan 19 2020, so a number of distributions may still have the buggy version. Some have backported the fix to their 4.2.1 release, though, and even without the fix it's quite timing-dependent whether the bug actually is hit. ] Josh Triplett says: "I've been hammering on your pipe fix patch (switching to exclusive wait queues) for a month or so, on several different systems, and I've run into no issues with it. The patch *substantially* improves parallel build times on large (~100 CPU) systems, both with parallel make and with other things that use make's pipe-based jobserver. All current distributions (including stable and long-term stable distributions) have versions of GNU make that no longer have the jobserver bug" Tested-by:
Josh Triplett <josh@joshtriplett.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Arnd Bergmann authored
My final cleanup patch for sys_compat_ioctl() introduced a regression on the FIONREAD ioctl command, which is used for both regular and special files, but only works on regular files after my patch, as I had missed the warning that Al Viro put into a comment right above it. Change it back so it can work on any file again by moving the implementation to do_vfs_ioctl() instead. Fixes: 77b90401 ("compat_ioctl: simplify the implementation") Reported-and-tested-by:
Christian Zigotzky <chzigotzky@xenosoft.de> Reported-and-tested-by:
youling257 <youling257@gmail.com> Signed-off-by:
Arnd Bergmann <arnd@arndb.de>
-
- 07 Feb, 2020 19 commits
-
-
Al Viro authored
Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk>
-
Al Viro authored
Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk>
-
Al Viro authored
Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk>
-
Al Viro authored
Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk>
-
Al Viro authored
Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk>
-
Al Viro authored
Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk>
-
Al Viro authored
Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk>
-
Al Viro authored
Don't bother with "mixed" options that would allow both the form with and without argument (i.e. both -o foo and -o foo=bar). Rather than trying to shove both into a single fs_parameter_spec, allow having with-argument and no-argument specs with the same name and teach fs_parse to handle that. There are very few options of that sort, and they are actually easier to handle that way - callers end up with less postprocessing. Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk>
-
Al Viro authored
The former contains nothing but a pointer to an array of the latter... Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk>
-
Eric Sandeen authored
Unused now. Signed-off-by:
Eric Sandeen <sandeen@redhat.com> Acked-by:
David Howells <dhowells@redhat.com> Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk>
-
Al Viro authored
... turning it into struct p_log embedded into fs_context. Initialize the prefix with fs_type->name, turning fs_parse() into a trivial inline wrapper for __fs_parse(). This makes fs_parameter_description->name completely unused. Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk>
-
Al Viro authored
... and now errorf() et.al. are never called with NULL fs_context, so we can get rid of conditional in those. Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk>
-
Al Viro authored
fs_parse() analogue taking p_log instead of fs_context. fs_parse() turned into a wrapper, callers in ceph_common and rbd switched to __fs_parse(). As the result, fs_parse() never gets NULL fs_context and neither do fs_context-based logging primitives Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk>
-
Al Viro authored
Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk>
-
Al Viro authored
Its behaviour is identical to that of fs_value_is_filename. It makes no sense, anyway - LOOKUP_EMPTY affects nothing whatsoever once the pathname has been imported from userland. And both fs_value_is_filename and fs_value_is_filename_empty carry an already imported pathname. Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk>
-
Al Viro authored
Have the arrays of constant_table self-terminated (by NULL ->name in the final entry). Simplifies lookup_constant() and allows to reuse the search for enum params as well. Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk>
-
Chen Zhou authored
Fix sparse warning: fs/nfsd/filecache.c:55:25: warning: symbol 'nfsd_filecache_wq' was not declared. Should it be static? Reported-by:
Hulk Robot <hulkci@huawei.com> Signed-off-by:
Chen Zhou <chenzhou10@huawei.com> Signed-off-by:
J. Bruce Fields <bfields@redhat.com>
-
Damien Le Moal authored
zonefs is a very simple file system exposing each zone of a zoned block device as a file. Unlike a regular file system with zoned block device support (e.g. f2fs), zonefs does not hide the sequential write constraint of zoned block devices to the user. Files representing sequential write zones of the device must be written sequentially starting from the end of the file (append only writes). As such, zonefs is in essence closer to a raw block device access interface than to a full featured POSIX file system. The goal of zonefs is to simplify the implementation of zoned block device support in applications by replacing raw block device file accesses with a richer file API, avoiding relying on direct block device file ioctls which may be more obscure to developers. One example of this approach is the implementation of LSM (log-structured merge) tree structures (such as used in RocksDB and LevelDB) on zoned block devices by allowing SSTables to be stored in a zone file similarly to a regular file system rather than as a range of sectors of a zoned device. The introduction of the higher level construct "one file is one zone" can help reducing the amount of changes needed in the application as well as introducing support for different application programming languages. Zonefs on-disk metadata is reduced to an immutable super block to persistently store a magic number and optional feature flags and values. On mount, zonefs uses blkdev_report_zones() to obtain the device zone configuration and populates the mount point with a static file tree solely based on this information. E.g. file sizes come from the device zone type and write pointer offset managed by the device itself. The zone files created on mount have the following characteristics. 1) Files representing zones of the same type are grouped together under a common sub-directory: * For conventional zones, the sub-directory "cnv" is used. * For sequential write zones, the sub-directory "seq" is used. These two directories are the only directories that exist in zonefs. Users cannot create other directories and cannot rename nor delete the "cnv" and "seq" sub-directories. 2) The name of zone files is the number of the file within the zone type sub-directory, in order of increasing zone start sector. 3) The size of conventional zone files is fixed to the device zone size. Conventional zone files cannot be truncated. 4) The size of sequential zone files represent the file's zone write pointer position relative to the zone start sector. Truncating these files is allowed only down to 0, in which case, the zone is reset to rewind the zone write pointer position to the start of the zone, or up to the zone size, in which case the file's zone is transitioned to the FULL state (finish zone operation). 5) All read and write operations to files are not allowed beyond the file zone size. Any access exceeding the zone size is failed with the -EFBIG error. 6) Creating, deleting, renaming or modifying any attribute of files and sub-directories is not allowed. 7) There are no restrictions on the type of read and write operations that can be issued to conventional zone files. Buffered, direct and mmap read & write operations are accepted. For sequential zone files, there are no restrictions on read operations, but all write operations must be direct IO append writes. mmap write of sequential files is not allowed. Several optional features of zonefs can be enabled at format time. * Conventional zone aggregation: ranges of contiguous conventional zones can be aggregated into a single larger file instead of the default one file per zone. * File ownership: The owner UID and GID of zone files is by default 0 (root) but can be changed to any valid UID/GID. * File access permissions: the default 640 access permissions can be changed. The mkzonefs tool is used to format zoned block devices for use with zonefs. This tool is available on Github at: git@github.com:damien-lemoal/zonefs-tools.git. zonefs-tools also includes a test suite which can be run against any zoned block device, including null_blk block device created with zoned mode. Example: the following formats a 15TB host-managed SMR HDD with 256 MB zones with the conventional zones aggregation feature enabled. $ sudo mkzonefs -o aggr_cnv /dev/sdX $ sudo mount -t zonefs /dev/sdX /mnt $ ls -l /mnt/ total 0 dr-xr-xr-x 2 root root 1 Nov 25 13:23 cnv dr-xr-xr-x 2 root root 55356 Nov 25 13:23 seq The size of the zone files sub-directories indicate the number of files existing for each type of zones. In this example, there is only one conventional zone file (all conventional zones are aggregated under a single file). $ ls -l /mnt/cnv total 137101312 -rw-r----- 1 root root 140391743488 Nov 25 13:23 0 This aggregated conventional zone file can be used as a regular file. $ sudo mkfs.ext4 /mnt/cnv/0 $ sudo mount -o loop /mnt/cnv/0 /data The "seq" sub-directory grouping files for sequential write zones has in this example 55356 zones. $ ls -lv /mnt/seq total 14511243264 -rw-r----- 1 root root 0 Nov 25 13:23 0 -rw-r----- 1 root root 0 Nov 25 13:23 1 -rw-r----- 1 root root 0 Nov 25 13:23 2 ... -rw-r----- 1 root root 0 Nov 25 13:23 55354 -rw-r----- 1 root root 0 Nov 25 13:23 55355 For sequential write zone files, the file size changes as data is appended at the end of the file, similarly to any regular file system. $ dd if=/dev/zero of=/mnt/seq/0 bs=4K count=1 conv=notrunc oflag=direct 1+0 records in 1+0 records out 4096 bytes (4.1 kB, 4.0 KiB) copied, 0.000452219 s, 9.1 MB/s $ ls -l /mnt/seq/0 -rw-r----- 1 root root 4096 Nov 25 13:23 /mnt/seq/0 The written file can be truncated to the zone size, preventing any further write operation. $ truncate -s 268435456 /mnt/seq/0 $ ls -l /mnt/seq/0 -rw-r----- 1 root root 268435456 Nov 25 13:49 /mnt/seq/0 Truncation to 0 size allows freeing the file zone storage space and restart append-writes to the file. $ truncate -s 0 /mnt/seq/0 $ ls -l /mnt/seq/0 -rw-r----- 1 root root 0 Nov 25 13:49 /mnt/seq/0 Since files are statically mapped to zones on the disk, the number of blocks of a file as reported by stat() and fstat() indicates the size of the file zone. $ stat /mnt/seq/0 File: /mnt/seq/0 Size: 0 Blocks: 524288 IO Block: 4096 regular empty file Device: 870h/2160d Inode: 50431 Links: 1 Access: (0640/-rw-r-----) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2019-11-25 13:23:57.048971997 +0900 Modify: 2019-11-25 13:52:25.553805765 +0900 Change: 2019-11-25 13:52:25.553805765 +0900 Birth: - The number of blocks of the file ("Blocks") in units of 512B blocks gives the maximum file size of 524288 * 512 B = 256 MB, corresponding to the device zone size in this example. Of note is that the "IO block" field always indicates the minimum IO size for writes and corresponds to the device physical sector size. This code contains contributions from: * Johannes Thumshirn <jthumshirn@suse.de>, * Darrick J. Wong <darrick.wong@oracle.com>, * Christoph Hellwig <hch@lst.de>, * Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> and * Ting Yao <tingyao@hust.edu.cn>. Signed-off-by:
Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by:
Dave Chinner <dchinner@redhat.com>
-
Al Viro authored
no real difference now Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk>
-