Commits · ca60ad6a6bc4aa88c02c6f103dd80df54689ea4d · Upstream / linux-stable

14 Feb, 2020 1 commit

io_uring: prune request from overflow list on flush · 2ca10259

Jens Axboe authored 5 years ago


Carter reported an issue where he could produce a stall on ring exit,
when we're cleaning up requests that match the given file table. For
this particular test case, a combination of a few things caused the
issue:

- The cq ring was overflown
- The request being canceled was in the overflow list

The combination of the above means that the cq overflow list holds a
reference to the request. The request is canceled correctly, but since
the overflow list holds a reference to it, the final put won't happen.
Since the final put doesn't happen, the request remains in the inflight.
Hence we never finish the cancelation flush.

Fix this by removing requests from the overflow list if we're canceling
them.

Cc: stable@vger.kernel.org # 5.5
Reported-by: Carter Li 李通洲 <carter.li@eoitek.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

2ca10259

13 Feb, 2020 1 commit

io-wq: don't call kXalloc_node() with non-online node · 7563439a

Jens Axboe authored 5 years ago


Glauber reports a crash on init on a box he has:

 RIP: 0010:__alloc_pages_nodemask+0x132/0x340
 Code: 18 01 75 04 41 80 ce 80 89 e8 48 8b 54 24 08 8b 74 24 1c c1 e8 0c 48 8b 3c 24 83 e0 01 88 44 24 20 48 85 d2 0f 85 74 01 00 00 <3b> 77 08 0f 82 6b 01 00 00 48 89 7c 24 10 89 ea 48 8b 07 b9 00 02
 RSP: 0018:ffffb8be4d0b7c28 EFLAGS: 00010246
 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000000000e8e8
 RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000002080
 RBP: 0000000000012cc0 R08: 0000000000000000 R09: 0000000000000002
 R10: 0000000000000dc0 R11: ffff995c60400100 R12: 0000000000000000
 R13: 0000000000012cc0 R14: 0000000000000001 R15: ffff995c60db00f0
 FS:  00007f4d115ca900(0000) GS:ffff995c60d80000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000002088 CR3: 00000017cca66002 CR4: 00000000007606e0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 PKRU: 55555554
 Call Trace:
  alloc_slab_page+0x46/0x320
  new_slab+0x9d/0x4e0
  ___slab_alloc+0x507/0x6a0
  ? io_wq_create+0xb4/0x2a0
  __slab_alloc+0x1c/0x30
  kmem_cache_alloc_node_trace+0xa6/0x260
  io_wq_create+0xb4/0x2a0
  io_uring_setup+0x97f/0xaa0
  ? io_remove_personalities+0x30/0x30
  ? io_poll_trigger_evfd+0x30/0x30
  do_syscall_64+0x5b/0x1c0
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
 RIP: 0033:0x7f4d116cb1ed

which is due to the 'wqe' and 'worker' allocation being node affine.
But it isn't valid to call the node affine allocation if the node isn't
online.

Setup structures for even offline nodes, as usual, but skip them in
terms of thread setup to not waste resources. If the node isn't online,
just alloc memory with NUMA_NO_NODE.
Reported-by: Glauber Costa <glauber@scylladb.com>
Tested-by: Glauber Costa <glauber@scylladb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

7563439a

09 Feb, 2020 4 commits

io_uring: retain sockaddr_storage across send/recvmsg async punt · b537916c

Jens Axboe authored 5 years ago


Jonas reports that he sometimes sees -97/-22 error returns from
sendmsg, if it gets punted async. This is due to not retaining the
sockaddr_storage between calls. Include that in the state we copy when
going async.

Cc: stable@vger.kernel.org # 5.3+
Reported-by: Jonas Bonn <jonas@norrbonn.se>
Tested-by: Jonas Bonn <jonas@norrbonn.se>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

b537916c

io_uring: cancel pending async work if task exits · 6ab23144

Jens Axboe authored 5 years ago


Normally we cancel all work we track, but for untracked work we could
leave the async worker behind until that work completes. This is totally
fine, but does leave resources pending after the task is gone until that
work completes.

Cancel work that this task queued up when it goes away.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

6ab23144

io-wq: add io_wq_cancel_pid() to cancel based on a specific pid · 36282881

Jens Axboe authored 5 years ago


Add a helper that allows the caller to cancel work based on what mm
it belongs to. This allows io_uring to cancel work from a given
task or thread when it exits.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

36282881

io-wq: make io_wqe_cancel_work() take a match handler · 00bcda13

Jens Axboe authored 5 years ago


We want to use the cancel functionality for canceling based on not
just the work itself. Instead of matching on the work address
manually, allow a match handler to tell us if we found the right work
item or not.

No functional changes in this patch.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

00bcda13

08 Feb, 2020 15 commits

fs: Add VirtualBox guest shared folder (vboxsf) support · 0fd16957

Hans de Goede authored 5 years ago


VirtualBox hosts can share folders with guests, this commit adds a
VFS driver implementing the Linux-guest side of this, allowing folders
exported by the host to be mounted under Linux.

This driver depends on the guest <-> host IPC functions exported by
the vboxguest driver.
Acked-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

0fd16957

io_uring: fix openat/statx's filename leak · 0bdbdd08

Pavel Begunkov authored 5 years ago


As in the previous patch, make openat*_prep() and statx_prep() handle
double preparation to avoid resource leakage.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

0bdbdd08

io_uring: fix double prep iovec leak · 5f798bea

Pavel Begunkov authored 5 years ago


Requests may be prepared multiple times with ->io allocated (i.e. async
prepared). Preparation functions don't handle it and forget about
previously allocated resources. This may happen in case of:
- spurious defer_check
- non-head (i.e. async prepared) request executed in sync (via nxt).

Make the handlers check, whether they already allocated resources, which
is true IFF REQ_F_NEED_CLEANUP is set.

Cc: stable@vger.kernel.org # 5.5
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

5f798bea

io_uring: fix async close() with f_op->flush() · a93b3331

Pavel Begunkov authored 5 years ago


First, io_close() misses filp_close() and io_cqring_add_event(), when
f_op->flush is defined. That's because in this case it will
io_queue_async_work() itself not grabbing files, so the corresponding
chunk in io_close_finish() won't be executed.

Second, when submitted through io_wq_submit_work(), it will do
filp_close() and *_add_event() twice: first inline in io_close(),
and the second one in call to io_close_finish() from io_close().
The second one will also fire, because it was submitted async through
generic path, and so have grabbed files.

And the last nice thing is to remove this weird pilgrimage with checking
work/old_work and casting it to nxt. Just use a helper instead.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

a93b3331

io_uring: allow AT_FDCWD for non-file openat/openat2/statx · 0b5faf6b

Jens Axboe authored 5 years ago


Don't just check for dirfd == -1, we should allow AT_FDCWD as well for
relative lookups.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

0b5faf6b

io_uring: grab ->fs as part of async preparation · ff002b30

Jens Axboe authored 5 years ago


This passes it in to io-wq, so it assumes the right fs_struct when
executing async work that may need to do lookups.

Cc: stable@vger.kernel.org # 5.3+
Signed-off-by: Jens Axboe <axboe@kernel.dk>

ff002b30

io-wq: add support for inheriting ->fs · 9392a27d

Jens Axboe authored 5 years ago


Some work items need this for relative path lookup, make it available
like the other inherited credentials/mm/etc.

Cc: stable@vger.kernel.org # 5.3+
Signed-off-by: Jens Axboe <axboe@kernel.dk>

9392a27d

io_uring: retry raw bdev writes if we hit -EOPNOTSUPP · faac996c

Jens Axboe authored 5 years ago


For non-blocking issue, we set IOCB_NOWAIT in the kiocb. However, on a
raw block device, this yields an -EOPNOTSUPP return, as non-blocking
writes aren't supported. Turn this -EOPNOTSUPP into -EAGAIN, so we retry
from blocking context with IOCB_NOWAIT cleared.

Cc: stable@vger.kernel.org # 5.5
Signed-off-by: Jens Axboe <axboe@kernel.dk>

faac996c

io_uring: add cleanup for openat()/statx() · 8fef80bf

Pavel Begunkov authored 5 years ago


openat() and statx() may have allocated ->open.filename, which should be
be put. Add cleanup handlers for them.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

8fef80bf

io_uring: fix iovec leaks · 99bc4c38

Pavel Begunkov authored 5 years ago


Allocated iovec is freed only in io_{read,write,send,recv)(), and just
leaves it if an error occured. There are plenty of such cases:
- cancellation of non-head requests
- fail grabbing files in __io_queue_sqe()
- set REQ_F_NOWAIT and returning in __io_queue_sqe()

Add REQ_F_NEED_CLEANUP, which will force such requests with custom
allocated resourses go through cleanup handlers on put.

Cc: stable@vger.kernel.org # 5.5
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

99bc4c38

io_uring: remove unused struct io_async_open · e96e9779

Pavel Begunkov authored 5 years ago


struct io_async_open is unused, remove it.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

e96e9779

io_uring: flush overflowed CQ events in the io_uring_poll() · 63e5d81f

Stefano Garzarella authored 5 years ago


In io_uring_poll() we must flush overflowed CQ events before to
check if there are CQ events available, to avoid missing events.

We call the io_cqring_events() that checks and flushes any overflow
and returns the number of CQ events available.
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

63e5d81f

io_uring: statx/openat/openat2 don't support fixed files · cf3040ca

Jens Axboe authored 5 years ago


All of these opcodes take a directory file descriptor. We can't easily
support fixed files for these operations, and the use case for that
probably isn't all that clear (or sensible) anyway.

Disable IOSQE_FIXED_FILE for these operations.
Reported-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

cf3040ca

pipe: use exclusive waits when reading or writing · 0ddad21d

Linus Torvalds authored 5 years ago


This makes the pipe code use separate wait-queues and exclusive waiting
for readers and writers, avoiding a nasty thundering herd problem when
there are lots of readers waiting for data on a pipe (or, less commonly,
lots of writers waiting for a pipe to have space).

While this isn't a common occurrence in the traditional "use a pipe as a
data transport" case, where you typically only have a single reader and
a single writer process, there is one common special case: using a pipe
as a source of "locking tokens" rather than for data communication.

In particular, the GNU make jobserver code ends up using a pipe as a way
to limit parallelism, where each job consumes a token by reading a byte
from the jobserver pipe, and releases the token by writing a byte back
to the pipe.

This pattern is fairly traditional on Unix, and works very well, but
will waste a lot of time waking up a lot of processes when only a single
reader needs to be woken up when a writer releases a new token.

A simplified test-case of just this pipe interaction is to create 64
processes, and then pass a single token around between them (this
test-case also intentionally passes another token that gets ignored to
test the "wake up next" logic too, in case anybody wonders about it):

    #include <unistd.h>

    int main(int argc, char **argv)
    {
        int fd[2], counters[2];

        pipe(fd);
        counters[0] = 0;
        counters[1] = -1;
        write(fd[1], counters, sizeof(counters));

        /* 64 processes */
        fork(); fork(); fork(); fork(); fork(); fork();

        do {
                int i;
                read(fd[0], &i, sizeof(i));
                if (i < 0)
                        continue;
                counters[0] = i+1;
                write(fd[1], counters, (1+(i & 1)) *sizeof(int));
        } while (counters[0] < 1000000);
        return 0;
    }

and in a perfect world, passing that token around should only cause one
context switch per transfer, when the writer of a token causes a
directed wakeup of just a single reader.

But with the "writer wakes all readers" model we traditionally had, on
my test box the above case causes more than an order of magnitude more
scheduling: instead of the expected ~1M context switches, "perf stat"
shows

        231,852.37 msec task-clock                #   15.857 CPUs utilized
        11,250,961      context-switches          #    0.049 M/sec
           616,304      cpu-migrations            #    0.003 M/sec
             1,648      page-faults               #    0.007 K/sec
 1,097,903,998,514      cycles                    #    4.735 GHz
   120,781,778,352      instructions              #    0.11  insn per cycle
    27,997,056,043      branches                  #  120.754 M/sec
       283,581,233      branch-misses             #    1.01% of all branches

      14.621273891 seconds time elapsed

       0.018243000 seconds user
       3.611468000 seconds sys

before this commit.

After this commit, I get

          5,229.55 msec task-clock                #    3.072 CPUs utilized
         1,212,233      context-switches          #    0.232 M/sec
           103,951      cpu-migrations            #    0.020 M/sec
             1,328      page-faults               #    0.254 K/sec
    21,307,456,166      cycles                    #    4.074 GHz
    12,947,819,999      instructions              #    0.61  insn per cycle
     2,881,985,678      branches                  #  551.096 M/sec
        64,267,015      branch-misses             #    2.23% of all branches

       1.702148350 seconds time elapsed

       0.004868000 seconds user
       0.110786000 seconds sys

instead. Much better.

[ Note! This kernel improvement seems to be very good at triggering a
  race condition in the make jobserver (in GNU make 4.2.1) for me. It's
  a long known bug that was fixed back in June 2017 by GNU make commit
  b552b0525198 ("[SV 51159] Use a non-blocking read with pselect to
  avoid hangs.").

  But there wasn't a new release of GNU make until 4.3 on Jan 19 2020,
  so a number of distributions may still have the buggy version. Some
  have backported the fix to their 4.2.1 release, though, and even
  without the fix it's quite timing-dependent whether the bug actually
  is hit. ]

Josh Triplett says:
 "I've been hammering on your pipe fix patch (switching to exclusive
  wait queues) for a month or so, on several different systems, and I've
  run into no issues with it. The patch *substantially* improves
  parallel build times on large (~100 CPU) systems, both with parallel
  make and with other things that use make's pipe-based jobserver.

  All current distributions (including stable and long-term stable
  distributions) have versions of GNU make that no longer have the
  jobserver bug"
Tested-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

0ddad21d

compat_ioctl: fix FIONREAD on devices · 0a061743

Arnd Bergmann authored 5 years ago

My final cleanup patch for sys_compat_ioctl() introduced a regression on
the FIONREAD ioctl command, which is used for both regular and special
files, but only works on regular files after my patch, as I had missed
the warning that Al Viro put into a comment right above it.

Change it back so it can work on any file again by moving the implementation
to do_vfs_ioctl() instead.

Fixes: 77b90401

 ("compat_ioctl: simplify the implementation")
Reported-and-tested-by: Christian Zigotzky <chzigotzky@xenosoft.de>
Reported-and-tested-by: youling257 <youling257@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>

0a061743

07 Feb, 2020 19 commits

procfs: switch to use of invalfc() · bf45f7fc
Al Viro authored 5 years ago
```
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
```
bf45f7fc
hugetlbfs: switch to use of invalfc() · b5db30cf
Al Viro authored 5 years ago
```
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
```
b5db30cf
cramfs: switch to use of errofc() et.al. · e1ee7d85
Al Viro authored 5 years ago
```
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
```
e1ee7d85
gfs2: switch to use of errorfc() et.al. · 77cb271e
Al Viro authored 5 years ago
```
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
```
77cb271e
fuse: switch to use errorfc() et.al. · 2e28c49e
Al Viro authored 5 years ago
```
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
```
2e28c49e
ceph: use errorfc() and friends instead of spelling the prefix out · d53d0f74
Al Viro authored 5 years ago
```
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
```
d53d0f74
turn fs_param_is_... into functions · 328de528
Al Viro authored 5 years ago
```
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
```
328de528

fs_parse: handle optional arguments sanely · 48ce73b1

Al Viro authored 5 years ago


Don't bother with "mixed" options that would allow both the
form with and without argument (i.e. both -o foo and -o foo=bar).
Rather than trying to shove both into a single fs_parameter_spec,
allow having with-argument and no-argument specs with the same
name and teach fs_parse to handle that.

There are very few options of that sort, and they are actually
easier to handle that way - callers end up with less postprocessing.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

48ce73b1

fs_parse: fold fs_parameter_desc/fs_parameter_spec · d7167b14

Al Viro authored 5 years ago


The former contains nothing but a pointer to an array of the latter...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

d7167b14

fs_parser: remove fs_parameter_description name field · 96cafb9c

Eric Sandeen authored 5 years ago


Unused now.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

96cafb9c

add prefix to fs_context->log · cc3c0b53

Al Viro authored 5 years ago


... turning it into struct p_log embedded into fs_context.  Initialize
the prefix with fs_type->name, turning fs_parse() into a trivial
inline wrapper for __fs_parse().

This makes fs_parameter_description->name completely unused.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

cc3c0b53

ceph_parse_param(), ceph_parse_mon_ips(): switch to passing fc_log · c80c98f0

Al Viro authored 5 years ago


... and now errorf() et.al. are never called with NULL fs_context,
so we can get rid of conditional in those.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

c80c98f0

new primitive: __fs_parse() · 7f5d3814

Al Viro authored 5 years ago


fs_parse() analogue taking p_log instead of fs_context.
fs_parse() turned into a wrapper, callers in ceph_common and rbd
switched to __fs_parse().

As the result, fs_parse() never gets NULL fs_context and neither
do fs_context-based logging primitives
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

7f5d3814

teach logfc() to handle prefices, give it saner calling conventions · 9f09f649
Al Viro authored 5 years ago
```
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
```
9f09f649

get rid of fs_value_is_filename_empty · aa1918f9

Al Viro authored 5 years ago


Its behaviour is identical to that of fs_value_is_filename.
It makes no sense, anyway - LOOKUP_EMPTY affects nothing
whatsoever once the pathname has been imported from userland.
And both fs_value_is_filename and fs_value_is_filename_empty
carry an already imported pathname.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

aa1918f9

don't bother with explicit length argument for __lookup_constant() · 34264ae3

Al Viro authored 5 years ago


Have the arrays of constant_table self-terminated (by NULL ->name
in the final entry).  Simplifies lookup_constant() and allows to
reuse the search for enum params as well.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

34264ae3

nfsd: make nfsd_filecache_wq variable static · 50d0def9

Chen Zhou authored 5 years ago


Fix sparse warning:

fs/nfsd/filecache.c:55:25: warning:
	symbol 'nfsd_filecache_wq' was not declared. Should it be static?
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>

50d0def9

fs: New zonefs file system · 8dcc1a9d

Damien Le Moal authored 5 years ago


zonefs is a very simple file system exposing each zone of a zoned block
device as a file. Unlike a regular file system with zoned block device
support (e.g. f2fs), zonefs does not hide the sequential write
constraint of zoned block devices to the user. Files representing
sequential write zones of the device must be written sequentially
starting from the end of the file (append only writes).

As such, zonefs is in essence closer to a raw block device access
interface than to a full featured POSIX file system. The goal of zonefs
is to simplify the implementation of zoned block device support in
applications by replacing raw block device file accesses with a richer
file API, avoiding relying on direct block device file ioctls which may
be more obscure to developers. One example of this approach is the
implementation of LSM (log-structured merge) tree structures (such as
used in RocksDB and LevelDB) on zoned block devices by allowing SSTables
to be stored in a zone file similarly to a regular file system rather
than as a range of sectors of a zoned device. The introduction of the
higher level construct "one file is one zone" can help reducing the
amount of changes needed in the application as well as introducing
support for different application programming languages.

Zonefs on-disk metadata is reduced to an immutable super block to
persistently store a magic number and optional feature flags and
values. On mount, zonefs uses blkdev_report_zones() to obtain the device
zone configuration and populates the mount point with a static file tree
solely based on this information. E.g. file sizes come from the device
zone type and write pointer offset managed by the device itself.

The zone files created on mount have the following characteristics.
1) Files representing zones of the same type are grouped together
   under a common sub-directory:
     * For conventional zones, the sub-directory "cnv" is used.
     * For sequential write zones, the sub-directory "seq" is used.
  These two directories are the only directories that exist in zonefs.
  Users cannot create other directories and cannot rename nor delete
  the "cnv" and "seq" sub-directories.
2) The name of zone files is the number of the file within the zone
   type sub-directory, in order of increasing zone start sector.
3) The size of conventional zone files is fixed to the device zone size.
   Conventional zone files cannot be truncated.
4) The size of sequential zone files represent the file's zone write
   pointer position relative to the zone start sector. Truncating these
   files is allowed only down to 0, in which case, the zone is reset to
   rewind the zone write pointer position to the start of the zone, or
   up to the zone size, in which case the file's zone is transitioned
   to the FULL state (finish zone operation).
5) All read and write operations to files are not allowed beyond the
   file zone size. Any access exceeding the zone size is failed with
   the -EFBIG error.
6) Creating, deleting, renaming or modifying any attribute of files and
   sub-directories is not allowed.
7) There are no restrictions on the type of read and write operations
   that can be issued to conventional zone files. Buffered, direct and
   mmap read & write operations are accepted. For sequential zone files,
   there are no restrictions on read operations, but all write
   operations must be direct IO append writes. mmap write of sequential
   files is not allowed.

Several optional features of zonefs can be enabled at format time.
* Conventional zone aggregation: ranges of contiguous conventional
  zones can be aggregated into a single larger file instead of the
  default one file per zone.
* File ownership: The owner UID and GID of zone files is by default 0
  (root) but can be changed to any valid UID/GID.
* File access permissions: the default 640 access permissions can be
  changed.

The mkzonefs tool is used to format zoned block devices for use with
zonefs. This tool is available on Github at:

git@github.com:damien-lemoal/zonefs-tools.git.

zonefs-tools also includes a test suite which can be run against any
zoned block device, including null_blk block device created with zoned
mode.

Example: the following formats a 15TB host-managed SMR HDD with 256 MB
zones with the conventional zones aggregation feature enabled.

$ sudo mkzonefs -o aggr_cnv /dev/sdX
$ sudo mount -t zonefs /dev/sdX /mnt
$ ls -l /mnt/
total 0
dr-xr-xr-x 2 root root     1 Nov 25 13:23 cnv
dr-xr-xr-x 2 root root 55356 Nov 25 13:23 seq

The size of the zone files sub-directories indicate the number of files
existing for each type of zones. In this example, there is only one
conventional zone file (all conventional zones are aggregated under a
single file).

$ ls -l /mnt/cnv
total 137101312
-rw-r----- 1 root root 140391743488 Nov 25 13:23 0

This aggregated conventional zone file can be used as a regular file.

$ sudo mkfs.ext4 /mnt/cnv/0
$ sudo mount -o loop /mnt/cnv/0 /data

The "seq" sub-directory grouping files for sequential write zones has
in this example 55356 zones.

$ ls -lv /mnt/seq
total 14511243264
-rw-r----- 1 root root 0 Nov 25 13:23 0
-rw-r----- 1 root root 0 Nov 25 13:23 1
-rw-r----- 1 root root 0 Nov 25 13:23 2
...
-rw-r----- 1 root root 0 Nov 25 13:23 55354
-rw-r----- 1 root root 0 Nov 25 13:23 55355

For sequential write zone files, the file size changes as data is
appended at the end of the file, similarly to any regular file system.

$ dd if=/dev/zero of=/mnt/seq/0 bs=4K count=1 conv=notrunc oflag=direct
1+0 records in
1+0 records out
4096 bytes (4.1 kB, 4.0 KiB) copied, 0.000452219 s, 9.1 MB/s

$ ls -l /mnt/seq/0
-rw-r----- 1 root root 4096 Nov 25 13:23 /mnt/seq/0

The written file can be truncated to the zone size, preventing any
further write operation.

$ truncate -s 268435456 /mnt/seq/0
$ ls -l /mnt/seq/0
-rw-r----- 1 root root 268435456 Nov 25 13:49 /mnt/seq/0

Truncation to 0 size allows freeing the file zone storage space and
restart append-writes to the file.

$ truncate -s 0 /mnt/seq/0
$ ls -l /mnt/seq/0
-rw-r----- 1 root root 0 Nov 25 13:49 /mnt/seq/0

Since files are statically mapped to zones on the disk, the number of
blocks of a file as reported by stat() and fstat() indicates the size
of the file zone.

$ stat /mnt/seq/0
  File: /mnt/seq/0
  Size: 0       Blocks: 524288     IO Block: 4096   regular empty file
Device: 870h/2160d      Inode: 50431       Links: 1
Access: (0640/-rw-r-----)  Uid: (    0/    root)   Gid: (    0/  root)
Access: 2019-11-25 13:23:57.048971997 +0900
Modify: 2019-11-25 13:52:25.553805765 +0900
Change: 2019-11-25 13:52:25.553805765 +0900
 Birth: -

The number of blocks of the file ("Blocks") in units of 512B blocks
gives the maximum file size of 524288 * 512 B = 256 MB, corresponding
to the device zone size in this example. Of note is that the "IO block"
field always indicates the minimum IO size for writes and corresponds
to the device physical sector size.

This code contains contributions from:
* Johannes Thumshirn <jthumshirn@suse.de>,
* Darrick J. Wong <darrick.wong@oracle.com>,
* Christoph Hellwig <hch@lst.de>,
* Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> and
* Ting Yao <tingyao@hust.edu.cn>.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>

8dcc1a9d

fold struct fs_parameter_enum into struct constant_table · 5eede625
Al Viro authored 5 years ago
```
no real difference now
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
```
5eede625