Commits · bbf423c28efcde2beec2b187806eda0041cb0582 · Upstream / linux-stable

03 Sep, 2020 40 commits

x86/irq: Unbreak interrupt affinity setting · bbf423c2

Thomas Gleixner authored 4 years ago

commit e027ffff upstream.

Several people reported that 5.8 broke the interrupt affinity setting
mechanism.

The consolidation of the entry code reused the regular exception entry code
for device interrupts and changed the way how the vector number is conveyed
from ptregs->orig_ax to a function argument.

The low level entry uses the hardware error code slot to push the vector
number onto the stack which is retrieved from there into a function
argument and the slot on stack is set to -1.

The reason for setting it to -1 is that the error code slot is at the
position where pt_regs::orig_ax is. A positive value in pt_regs::orig_ax
indicates that the entry came via a syscall. If it's not set to a negative
value then a signal delivery on return to userspace would try to restart a
syscall. But there are other places which rely on pt_regs::orig_ax being a
valid indicator for syscall entry.

But setting pt_regs::orig_ax to -1 has a nasty side effect vs. the
interrupt affinity setting mechanism, which was overlooked when this change
was made.

Moving interrupts on x86 happens in several steps. A new vector on a
different CPU is allocated and the relevant interrupt source is
reprogrammed to that. But that's racy and there might be an interrupt
already in flight to the old vector. So the old vector is preserved until
the first interrupt arrives on the new vector and the new target CPU. Once
that happens the old vector is cleaned up, but this cleanup still depends
on the vector number being stored in pt_regs::orig_ax, which is now -1.

That -1 makes the check for cleanup: pt_regs::orig_ax == new_vector
always false. As a consequence the interrupt is moved once, but then it
cannot be moved anymore because the cleanup of the old vector never
happens.

There would be several ways to convey the vector information to that place
in the guts of the interrupt handling, but on deeper inspection it turned
out that this check is pointless and a leftover from the old affinity model
of X86 which supported multi-CPU affinities. Under this model it was
possible that an interrupt had an old and a new vector on the same CPU, so
the vector match was required.

Under the new model the effective affinity of an interrupt is always a
single CPU from the requested affinity mask. If the affinity mask changes
then either the interrupt stays on the CPU and on the same vector when that
CPU is still in the new affinity mask or it is moved to a different CPU, but
it is never moved to a different vector on the same CPU.

Ergo the cleanup check for the matching vector number is not required and
can be removed which makes the dependency on pt_regs:orig_ax go away.

The remaining check for new_cpu == smp_processsor_id() is completely
sufficient. If it matches then the interrupt was successfully migrated and
the cleanup can proceed.

For paranoia sake add a warning into the vector assignment code to
validate that the assumption of never moving to a different vector on
the same CPU holds.

Fixes: 633260fa

 ("x86/irq: Convey vector as argument and not in ptregs")
Reported-by: Alex bykov <alex.bykov@scylladb.com>
Reported-by: Avi Kivity <avi@scylladb.com>
Reported-by: Alexander Graf <graf@amazon.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Alexander Graf <graf@amazon.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/87wo1ltaxz.fsf@nanos.tec.linutronix.de

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

bbf423c2

irqchip/stm32-exti: Avoid losing interrupts due to clearing pending bits by mistake · 66e1e9bd

qiuguorui1 authored 4 years ago

commit e579076a upstream.

In the current code, when the eoi callback of the exti clears the pending
bit of the current interrupt, it will first read the values of fpr and
rpr, then logically OR the corresponding bit of the interrupt number,
and finally write back to fpr and rpr.

We found through experiments that if two exti interrupts,
we call them int1/int2, arrive almost at the same time. in our scenario,
the time difference is 30 microseconds, assuming int1 is triggered first.

there will be an extreme scenario: both int's pending bit are set to 1,
the irq handle of int1 is executed first, and eoi handle is then executed,
at this moment, all pending bits are cleared, but the int 2 has not
finally been reported to the cpu yet, which eventually lost int2.

According to stm32's TRM description about rpr and fpr: Writing a 1 to this
bit will trigger a rising edge event on event x, Writing 0 has no
effect.

Therefore, when clearing the pending bit, we only need to clear the
pending bit of the irq.

Fixes: 927abfc4

 ("irqchip/stm32: Add stm32mp1 support with hierarchy domain")
Signed-off-by: qiuguorui1 <qiuguorui1@huawei.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Cc: stable@vger.kernel.org # v4.18+
Link: https://lore.kernel.org/r/20200820031629.15582-1-qiuguorui1@huawei.com

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

66e1e9bd

genirq/matrix: Deal with the sillyness of for_each_cpu() on UP · a1b11651

Thomas Gleixner authored 4 years ago

commit 784a0830 upstream.

Most of the CPU mask operations behave the same way, but for_each_cpu() and
it's variants ignore the cpumask argument and claim that CPU0 is always in
the mask. This is historical, inconsistent and annoying behaviour.

The matrix allocator uses for_each_cpu() and can be called on UP with an
empty cpumask. The calling code does not expect that this succeeds but
until commit e027ffff ("x86/irq: Unbreak interrupt affinity setting")
this went unnoticed. That commit added a WARN_ON() to catch cases which
move an interrupt from one vector to another on the same CPU. The warning
triggers on UP.

Add a check for the cpumask being empty to prevent this.

Fixes: 2f75d9e1

 ("genirq: Implement bitmap matrix allocator")
Reported-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

a1b11651

usbip: Implement a match function to fix usbip · 8cb3561d

M. Vefa Bicakci authored 4 years ago

commit 7a2f2974 upstream.

Commit 88b7381a ("USB: Select better matching USB drivers when
available") introduced the use of a "match" function to select a
non-generic/better driver for a particular USB device. This
unfortunately breaks the operation of usbip in general, as reported in
the kernel bugzilla with bug 208267 (linked below).

Upon inspecting the aforementioned commit, one can observe that the
original code in the usb_device_match function used to return 1
unconditionally, but the aforementioned commit makes the usb_device_match
function use identifier tables and "match" virtual functions, if either of
them are available.

Hence, this commit implements a match function for usbip that
unconditionally returns true to ensure that usbip is functional again.

This change has been verified to restore usbip functionality, with a
v5.7.y kernel on an up-to-date version of Qubes OS 4.0, which uses
usbip to redirect USB devices between VMs.

Thanks to Jonathan Dieter for the effort in bisecting this issue down
to the aforementioned commit.

Fixes: 88b7381a ("USB: Select better matching USB drivers when available")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=208267
Link: https://bugzilla.redhat.com/show_bug.cgi?id=1856443
Link: https://github.com/QubesOS/qubes-issues/issues/5905

Signed-off-by: M. Vefa Bicakci <m.v.b@runbox.com>
Cc: <stable@vger.kernel.org> # 5.7
Cc: Valentina Manea <valentina.manea.m@gmail.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Reviewed-by: Bastien Nocera <hadess@hadess.net>
Reviewed-by: Shuah Khan <skhan@linuxfoundation.org>
Link: https://lore.kernel.org/r/20200810160017.46002-1-m.v.b@runbox.com

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

8cb3561d

crypto: af_alg - Work around empty control messages without MSG_MORE · 3c491c44

Herbert Xu authored 4 years ago

commit c195d66a

 upstream.

The iwd daemon uses libell which sets up the skcipher operation with
two separate control messages.  As the first control message is sent
without MSG_MORE, it is interpreted as an empty request.

While libell should be fixed to use MSG_MORE where appropriate, this
patch works around the bug in the kernel so that existing binaries
continue to work.

We will print a warning however.

A separate issue is that the new kernel code no longer allows the
control message to be sent twice within the same request.  This
restriction is obviously incompatible with what iwd was doing (first
setting an IV and then sending the real control message).  This
patch changes the kernel so that this is explicitly allowed.
Reported-by: Caleb Jorden <caljorden@hotmail.com>
Fixes: f3c802a1

 ("crypto: algif_aead - Only wake up when...")
Cc: <stable@vger.kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

3c491c44

device property: Fix the secondary firmware node handling in set_primary_fwnode() · 1d35dfde

Heikki Krogerus authored 4 years ago

commit c15e1bdd upstream.

When the primary firmware node pointer is removed from a
device (set to NULL) the secondary firmware node pointer,
when it exists, is made the primary node for the device.
However, the secondary firmware node pointer of the original
primary firmware node is never cleared (set to NULL).

To avoid situation where the secondary firmware node pointer
is pointing to a non-existing object, clearing it properly
when the primary node is removed from a device in
set_primary_fwnode().

Fixes: 97badf87

 ("device property: Make it possible to use secondary firmware nodes")
Cc: All applicable <stable@vger.kernel.org>
Signed-off-by: Heikki Krogerus <heikki.krogerus@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

1d35dfde

powerpc/perf: Fix crashes with generic_compat_pmu & BHRB · 9a9cc8c9

Alexey Kardashevskiy authored 5 years ago

commit b460b512 upstream.

The bhrb_filter_map ("The Branch History Rolling Buffer") callback is
only defined in raw CPUs' power_pmu structs. The "architected" CPUs
use generic_compat_pmu, which does not have this callback, and crashes
occur if a user tries to enable branch stack for an event.

This add a NULL pointer check for bhrb_filter_map() which behaves as
if the callback returned an error.

This does not add the same check for config_bhrb() as the only caller
checks for cpuhw->bhrb_users which remains zero if bhrb_filter_map==0.

Fixes: be80e758

 ("powerpc/perf: Add generic compat mode pmu driver")
Cc: stable@vger.kernel.org # v5.2+
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200602025612.62707-1-aik@ozlabs.ru

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

9a9cc8c9

powerpc/32s: Disable VMAP stack which CONFIG_ADB_PMU · bdae0167

Christophe Leroy authored 4 years ago

commit 4a133eb3 upstream.

low_sleep_handler() can't restore the context from virtual
stack because the stack can hardly be accessed with MMU OFF.

For now, disable VMAP stack when CONFIG_ADB_PMU is selected.

Fixes: cd08f109

 ("powerpc/32s: Enable CONFIG_VMAP_STACK")
Cc: stable@vger.kernel.org # v5.6+
Reported-by: Giuseppe Sacco <giuseppe@sguazz.it>
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/ec96c15bfa1a7415ab604ee1c98cd45779c08be0.1598553015.git.christophe.leroy@csgroup.eu

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

bdae0167

PM: sleep: core: Fix the handling of pending runtime resume requests · 3b555853

Rafael J. Wysocki authored 4 years ago

commit e3eb6e8f upstream.

It has been reported that system-wide suspend may be aborted in the
absence of any wakeup events due to unforseen interactions of it with
the runtume PM framework.

One failing scenario is when there are multiple devices sharing an
ACPI power resource and runtime-resume needs to be carried out for
one of them during system-wide suspend (for example, because it needs
to be reconfigured before the whole system goes to sleep).  In that
case, the runtime-resume of that device involves turning the ACPI
power resource "on" which in turn causes runtime-resume requests
to be queued up for all of the other devices sharing it.  Those
requests go to the runtime PM workqueue which is frozen during
system-wide suspend, so they are not actually taken care of until
the resume of the whole system, but the pm_runtime_barrier()
call in __device_suspend() sees them and triggers system wakeup
events for them which then cause the system-wide suspend to be
aborted if wakeup source objects are in active use.

Of course, the logic that leads to triggering those wakeup events is
questionable in the first place, because clearly there are cases in
which a pending runtime resume request for a device is not connected
to any real wakeup events in any way (like the one above).  Moreover,
it is racy, because the device may be resuming already by the time
the pm_runtime_barrier() runs and so if the driver doesn't take care
of signaling the wakeup event as appropriate, it will be lost.
However, if the driver does take care of that, the extra
pm_wakeup_event() call in the core is redundant.

Accordingly, drop the conditional pm_wakeup_event() call fron
__device_suspend() and make the latter call pm_runtime_barrier()
alone.  Also modify the comment next to that call to reflect the new
code and extend it to mention the need to avoid unwanted interactions
between runtime PM and system-wide device suspend callbacks.

Fixes: 1e2ef05b

 ("PM: Limit race conditions between runtime PM and system sleep (v2)")
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Alan Stern <stern@rowland.harvard.edu>
Reported-by: Utkarsh H Patel <utkarsh.h.patel@intel.com>
Tested-by: Utkarsh H Patel <utkarsh.h.patel@intel.com>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Cc: All applicable <stable@vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

3b555853

arm64: vdso32: make vdso32 install conditional · 17d66e62

Frank van der Linden authored 4 years ago

commit 5d28ba5f upstream.

vdso32 should only be installed if CONFIG_COMPAT_VDSO is enabled,
since it's not even supposed to be compiled otherwise, and arm64
builds without a 32bit crosscompiler will fail.

Fixes: 8d75785a

 ("ARM64: vdso32: Install vdso32 from vdso_install")
Signed-off-by: Frank van der Linden <fllinden@amazon.com>
Cc: stable@vger.kernel.org [5.4+]
Link: https://lore.kernel.org/r/20200827234012.19757-1-fllinden@amazon.com

Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

17d66e62

KVM: arm64: Set HCR_EL2.PTW to prevent AT taking synchronous exception · d36a6712

James Morse authored 4 years ago

commit 71a7f8cb upstream.

AT instructions do a translation table walk and return the result, or
the fault in PAR_EL1. KVM uses these to find the IPA when the value is
not provided by the CPU in HPFAR_EL1.

If a translation table walk causes an external abort it is taken as an
exception, even if it was due to an AT instruction. (DDI0487F.a's D5.2.11
"Synchronous faults generated by address translation instructions")

While we previously made KVM resilient to exceptions taken due to AT
instructions, the device access causes mismatched attributes, and may
occur speculatively. Prevent this, by forbidding a walk through memory
described as device at stage2. Now such AT instructions will report a
stage2 fault.

Such a fault will cause KVM to restart the guest. If the AT instructions
always walk the page tables, but guest execution uses the translation cached
in the TLB, the guest can't make forward progress until the TLB entry is
evicted. This isn't a problem, as since commit 5dcd0fdb ("KVM: arm64:
Defer guest entry when an asynchronous exception is pending"), KVM will
return to the host to process IRQs allowing the rest of the system to keep
running.

Cc: stable@vger.kernel.org # <v5.3: 5dcd0fdb

 ("KVM: arm64: Defer guest entry when an asynchronous exception is pending")
Signed-off-by: James Morse <james.morse@arm.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

d36a6712

io-wq: fix hang after cancelling pending hashed work · 13f35a2c

Pavel Begunkov authored 4 years ago

commit 204361a7

 upstream.

Don't forget to update wqe->hash_tail after cancelling a pending work
item, if it was hashed.

Cc: stable@vger.kernel.org # 5.7+
Reported-by: Dmitry Shulyak <yashulyak@gmail.com>
Fixes: 86f3cd1b

 ("io-wq: handle hashed writes in chains")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

13f35a2c

xhci: Always restore EP_SOFT_CLEAR_TOGGLE even if ep reset failed · 96d020dd

Ding Hui authored 4 years ago

commit f1ec7ae6 upstream.

Some device drivers call libusb_clear_halt when target ep queue
is not empty. (eg. spice client connected to qemu for usb redir)

Before commit f5249461 ("xhci: Clear the host side toggle
manually when endpoint is soft reset"), that works well.
But now, we got the error log:

    EP not empty, refuse reset

xhci_endpoint_reset failed and left ep_state's EP_SOFT_CLEAR_TOGGLE
bit still set

So all the subsequent urb sumbits to the ep will fail with the
warn log:

    Can't enqueue URB while manually clearing toggle

We need to clear ep_state EP_SOFT_CLEAR_TOGGLE bit after
xhci_endpoint_reset, even if it failed.

Fixes: f5249461

 ("xhci: Clear the host side toggle manually when endpoint is soft reset")
Cc: stable <stable@vger.kernel.org> # v4.17+
Signed-off-by: Ding Hui <dinghui@sangfor.com.cn>
Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com>
Link: https://lore.kernel.org/r/20200821091549.20556-4-mathias.nyman@linux.intel.com

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

96d020dd

xhci: Do warm-reset when both CAS and XDEV_RESUME are set · 7d31eaa7

Kai-Heng Feng authored 4 years ago

commit 904df64a

 upstream.

Sometimes re-plugging a USB device during system sleep renders the device
useless:
[  173.418345] xhci_hcd 0000:00:14.0: Get port status 2-4 read: 0x14203e2, return 0x10262
...
[  176.496485] usb 2-4: Waited 2000ms for CONNECT
[  176.496781] usb usb2-port4: status 0000.0262 after resume, -19
[  176.497103] usb 2-4: can't resume, status -19
[  176.497438] usb usb2-port4: logical disconnect

Because PLS equals to XDEV_RESUME, xHCI driver reports U3 to usbcore,
despite of CAS bit is flagged.

So proritize CAS over XDEV_RESUME to let usbcore handle warm-reset for
the port.

Cc: stable <stable@vger.kernel.org>
Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com>
Link: https://lore.kernel.org/r/20200821091549.20556-3-mathias.nyman@linux.intel.com

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

7d31eaa7

usb: host: xhci: fix ep context print mismatch in debugfs · 50a7c735

Li Jun authored 4 years ago

commit 0077b1b2 upstream.

dci is 0 based and xhci_get_ep_ctx() will do ep index increment to get
the ep context.

[rename dci to ep_index -Mathias]
Cc: stable <stable@vger.kernel.org> # v4.15+
Fixes: 02b6fdc2

 ("usb: xhci: Add debugfs interface for xHCI driver")
Signed-off-by: Li Jun <jun.li@nxp.com>
Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com>
Link: https://lore.kernel.org/r/20200821091549.20556-2-mathias.nyman@linux.intel.com

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

50a7c735

usb: host: xhci-tegra: fix tegra_xusb_get_phy() · 55c0eeab

JC Kuo authored 4 years ago

commit d54343a8

 upstream.

tegra_xusb_get_phy() should take input argument "name".
Signed-off-by: JC Kuo <jckuo@nvidia.com>
Cc: stable <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/20200811092553.657762-1-jckuo@nvidia.com

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

55c0eeab

usb: host: xhci-tegra: otg usb2/usb3 port init · c08e590b

JC Kuo authored 4 years ago

commit 316a2868

 upstream.

tegra_xusb_init_usb_phy() should initialize "otg_usb2_port" and
"otg_usb3_port" with -EINVAL because "0" is a valid value
represents usb2 port 0 or usb3 port 0.
Signed-off-by: JC Kuo <jckuo@nvidia.com>
Cc: stable <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/20200811093143.699541-1-jckuo@nvidia.com

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

c08e590b

usb: renesas-xhci: remove version check · 68adec46

Vinod Koul authored 4 years ago

commit d66a57be upstream.

Some devices in wild are reporting bunch of firmware versions, so remove
the check for versions in driver

Reported by: Anastasios Vacharakis <vacharakis@gmail.com>
Reported by: Glen Journeay <journeay@gmail.com>
Fixes: 2478be82 ("usb: renesas-xhci: Add ROM loader for uPD720201")
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=208911

Signed-off-by: Vinod Koul <vkoul@kernel.org>
Cc: stable <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/20200818071739.789720-1-vkoul@kernel.org

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

68adec46

XEN uses irqdesc::irq_data_common::handler_data to store a per interrupt XEN... · 2b32323f

Thomas Gleixner authored 4 years ago

XEN uses irqdesc::irq_data_common::handler_data to store a per interrupt XEN data pointer which contains XEN specific information.

commit c330fb1d

 upstream.

handler data is meant for interrupt handlers and not for storing irq chip
specific information as some devices require handler data to store internal
per interrupt information, e.g. pinctrl/GPIO chained interrupt handlers.

This obviously creates a conflict of interests and crashes the machine
because the XEN pointer is overwritten by the driver pointer.

As the XEN data is not handler specific it should be stored in
irqdesc::irq_data::chip_data instead.

A simple sed s/irq_[sg]et_handler_data/irq_[sg]et_chip_data/ cures that.

Cc: stable@vger.kernel.org
Reported-by: Roman Shaposhnik <roman@zededa.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Roman Shaposhnik <roman@zededa.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/87lfi2yckt.fsf@nanos.tec.linutronix.de

Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

2b32323f

writeback: Fix sync livelock due to b_dirty_time processing · 05ae7cf3

Jan Kara authored 5 years ago

commit f9cae926 upstream.

When we are processing writeback for sync(2), move_expired_inodes()
didn't set any inode expiry value (older_than_this). This can result in
writeback never completing if there's steady stream of inodes added to
b_dirty_time list as writeback rechecks dirty lists after each writeback
round whether there's more work to be done. Fix the problem by using
sync(2) start time is inode expiry value when processing b_dirty_time
list similarly as for ordinarily dirtied inodes. This requires some
refactoring of older_than_this handling which simplifies the code
noticeably as a bonus.

Fixes: 0ae45f63

 ("vfs: add support for a lazytime mount option")
CC: stable@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

05ae7cf3

writeback: Avoid skipping inode writeback · d74c235b

Jan Kara authored 5 years ago

commit 5afced3b

 upstream.

Inode's i_io_list list head is used to attach inode to several different
lists - wb->{b_dirty, b_dirty_time, b_io, b_more_io}. When flush worker
prepares a list of inodes to writeback e.g. for sync(2), it moves inodes
to b_io list. Thus it is critical for sync(2) data integrity guarantees
that inode is not requeued to any other writeback list when inode is
queued for processing by flush worker. That's the reason why
writeback_single_inode() does not touch i_io_list (unless the inode is
completely clean) and why __mark_inode_dirty() does not touch i_io_list
if I_SYNC flag is set.

However there are two flaws in the current logic:

1) When inode has only I_DIRTY_TIME set but it is already queued in b_io
list due to sync(2), concurrent __mark_inode_dirty(inode, I_DIRTY_SYNC)
can still move inode back to b_dirty list resulting in skipping
writeback of inode time stamps during sync(2).

2) When inode is on b_dirty_time list and writeback_single_inode() races
with __mark_inode_dirty() like:

writeback_single_inode()		__mark_inode_dirty(inode, I_DIRTY_PAGES)
  inode->i_state |= I_SYNC
  __writeback_single_inode()
					  inode->i_state |= I_DIRTY_PAGES;
					  if (inode->i_state & I_SYNC)
					    bail
  if (!(inode->i_state & I_DIRTY_ALL))
  - not true so nothing done

We end up with I_DIRTY_PAGES inode on b_dirty_time list and thus
standard background writeback will not writeback this inode leading to
possible dirty throttling stalls etc. (thanks to Martijn Coenen for this
analysis).

Fix these problems by tracking whether inode is queued in b_io or
b_more_io lists in a new I_SYNC_QUEUED flag. When this flag is set, we
know flush worker has queued inode and we should not touch i_io_list.
On the other hand we also know that once flush worker is done with the
inode it will requeue the inode to appropriate dirty list. When
I_SYNC_QUEUED is not set, __mark_inode_dirty() can (and must) move inode
to appropriate dirty list.
Reported-by: Martijn Coenen <maco@android.com>
Reviewed-by: Martijn Coenen <maco@android.com>
Tested-by: Martijn Coenen <maco@android.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Fixes: 0ae45f63

 ("vfs: add support for a lazytime mount option")
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

d74c235b

writeback: Protect inode->i_io_list with inode->i_lock · dd71dd1d

Jan Kara authored 5 years ago

commit b35250c0

 upstream.

Currently, operations on inode->i_io_list are protected by
wb->list_lock. In the following patches we'll need to maintain
consistency between inode->i_state and inode->i_io_list so change the
code so that inode->i_lock protects also all inode's i_io_list handling.
Reviewed-by: Martijn Coenen <maco@android.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
CC: stable@vger.kernel.org # Prerequisite for "writeback: Avoid skipping inode writeback"
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

dd71dd1d

io_uring: clear req->result on IOPOLL re-issue · 1506fdcd

Jens Axboe authored 4 years ago

commit 56450c20

 upstream.

Make sure we clear req->result, which was set to -EAGAIN for retry
purposes, when moving it to the reissue list. Otherwise we can end up
retrying a request more than once, which leads to weird results in
the io-wq handling (and other spots).

Cc: stable@vger.kernel.org
Reported-by: Andres Freund <andres@anarazel.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

1506fdcd

serial: 8250: change lock order in serial8250_do_startup() · 116790cf

Sergey Senozhatsky authored 4 years ago

commit 205d300a

 upstream.

We have a number of "uart.port->desc.lock vs desc.lock->uart.port"
lockdep reports coming from 8250 driver; this causes a bit of trouble
to people, so let's fix it.

The problem is reverse lock order in two different call paths:

chain #1:

 serial8250_do_startup()
  spin_lock_irqsave(&port->lock);
   disable_irq_nosync(port->irq);
    raw_spin_lock_irqsave(&desc->lock)

chain #2:

  __report_bad_irq()
   raw_spin_lock_irqsave(&desc->lock)
    for_each_action_of_desc()
     printk()
      spin_lock_irqsave(&port->lock);

Fix this by changing the order of locks in serial8250_do_startup():
 do disable_irq_nosync() first, which grabs desc->lock, and grab
 uart->port after that, so that chain #1 and chain #2 have same lock
 order.

Full lockdep splat:

 ======================================================
 WARNING: possible circular locking dependency detected
 5.4.39 #55 Not tainted
 ======================================================

 swapper/0/0 is trying to acquire lock:
 ffffffffab65b6c0 (console_owner){-...}, at: console_lock_spinning_enable+0x31/0x57

 but task is already holding lock:
 ffff88810a8e34c0 (&irq_desc_lock_class){-.-.}, at: __report_bad_irq+0x5b/0xba

 which lock already depends on the new lock.

 the existing dependency chain (in reverse order) is:

 -> #2 (&irq_desc_lock_class){-.-.}:
        _raw_spin_lock_irqsave+0x61/0x8d
        __irq_get_desc_lock+0x65/0x89
        __disable_irq_nosync+0x3b/0x93
        serial8250_do_startup+0x451/0x75c
        uart_startup+0x1b4/0x2ff
        uart_port_activate+0x73/0xa0
        tty_port_open+0xae/0x10a
        uart_open+0x1b/0x26
        tty_open+0x24d/0x3a0
        chrdev_open+0xd5/0x1cc
        do_dentry_open+0x299/0x3c8
        path_openat+0x434/0x1100
        do_filp_open+0x9b/0x10a
        do_sys_open+0x15f/0x3d7
        kernel_init_freeable+0x157/0x1dd
        kernel_init+0xe/0x105
        ret_from_fork+0x27/0x50

 -> #1 (&port_lock_key){-.-.}:
        _raw_spin_lock_irqsave+0x61/0x8d
        serial8250_console_write+0xa7/0x2a0
        console_unlock+0x3b7/0x528
        vprintk_emit+0x111/0x17f
        printk+0x59/0x73
        register_console+0x336/0x3a4
        uart_add_one_port+0x51b/0x5be
        serial8250_register_8250_port+0x454/0x55e
        dw8250_probe+0x4dc/0x5b9
        platform_drv_probe+0x67/0x8b
        really_probe+0x14a/0x422
        driver_probe_device+0x66/0x130
        device_driver_attach+0x42/0x5b
        __driver_attach+0xca/0x139
        bus_for_each_dev+0x97/0xc9
        bus_add_driver+0x12b/0x228
        driver_register+0x64/0xed
        do_one_initcall+0x20c/0x4a6
        do_initcall_level+0xb5/0xc5
        do_basic_setup+0x4c/0x58
        kernel_init_freeable+0x13f/0x1dd
        kernel_init+0xe/0x105
        ret_from_fork+0x27/0x50

 -> #0 (console_owner){-...}:
        __lock_acquire+0x118d/0x2714
        lock_acquire+0x203/0x258
        console_lock_spinning_enable+0x51/0x57
        console_unlock+0x25d/0x528
        vprintk_emit+0x111/0x17f
        printk+0x59/0x73
        __report_bad_irq+0xa3/0xba
        note_interrupt+0x19a/0x1d6
        handle_irq_event_percpu+0x57/0x79
        handle_irq_event+0x36/0x55
        handle_fasteoi_irq+0xc2/0x18a
        do_IRQ+0xb3/0x157
        ret_from_intr+0x0/0x1d
        cpuidle_enter_state+0x12f/0x1fd
        cpuidle_enter+0x2e/0x3d
        do_idle+0x1ce/0x2ce
        cpu_startup_entry+0x1d/0x1f
        start_kernel+0x406/0x46a
        secondary_startup_64+0xa4/0xb0

 other info that might help us debug this:

 Chain exists of:
   console_owner --> &port_lock_key --> &irq_desc_lock_class

  Possible unsafe locking scenario:

        CPU0                    CPU1
        ----                    ----
   lock(&irq_desc_lock_class);
                                lock(&port_lock_key);
                                lock(&irq_desc_lock_class);
   lock(console_owner);

  *** DEADLOCK ***

 2 locks held by swapper/0/0:
  #0: ffff88810a8e34c0 (&irq_desc_lock_class){-.-.}, at: __report_bad_irq+0x5b/0xba
  #1: ffffffffab65b5c0 (console_lock){+.+.}, at: console_trylock_spinning+0x20/0x181

 stack backtrace:
 CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.4.39 #55
 Hardware name: XXXXXX
 Call Trace:
  <IRQ>
  dump_stack+0xbf/0x133
  ? print_circular_bug+0xd6/0xe9
  check_noncircular+0x1b9/0x1c3
  __lock_acquire+0x118d/0x2714
  lock_acquire+0x203/0x258
  ? console_lock_spinning_enable+0x31/0x57
  console_lock_spinning_enable+0x51/0x57
  ? console_lock_spinning_enable+0x31/0x57
  console_unlock+0x25d/0x528
  ? console_trylock+0x18/0x4e
  vprintk_emit+0x111/0x17f
  ? lock_acquire+0x203/0x258
  printk+0x59/0x73
  __report_bad_irq+0xa3/0xba
  note_interrupt+0x19a/0x1d6
  handle_irq_event_percpu+0x57/0x79
  handle_irq_event+0x36/0x55
  handle_fasteoi_irq+0xc2/0x18a
  do_IRQ+0xb3/0x157
  common_interrupt+0xf/0xf
  </IRQ>
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Fixes: 768aec0b

 ("serial: 8250: fix shared interrupts issues with SMP and RT kernels")
Reported-by: Guenter Roeck <linux@roeck-us.net>
Reported-by: Raul Rangel <rrangel@google.com>
BugLink: https://bugs.chromium.org/p/chromium/issues/detail?id=1114800
Link: https://lore.kernel.org/lkml/CAHQZ30BnfX+gxjPm1DUd5psOTqbyDh4EJE=2=VAMW_VDafctkA@mail.gmail.com/T/#u

Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Guenter Roeck <linux@roeck-us.net>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Cc: stable <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/20200817022646.1484638-1-sergey.senozhatsky@gmail.com

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

116790cf

serial: 8250_exar: Fix number of ports for Commtech PCIe cards · 89171ef8

Valmer Huhn authored 4 years ago

commit c6b9e95d upstream.

The following in 8250_exar.c line 589 is used to determine the number
of ports for each Exar board:

nr_ports = board->num_ports ? board->num_ports : pcidev->device & 0x0f;

If the number of ports a card has is not explicitly specified, it defaults
to the rightmost 4 bits of the PCI device ID. This is prone to error since
not all PCI device IDs contain a number which corresponds to the number of
ports that card provides.

This particular case involves COMMTECH_4222PCIE, COMMTECH_4224PCIE and
COMMTECH_4228PCIE cards with device IDs 0x0022, 0x0020 and 0x0021.
Currently the multiport cards receive 2, 0 and 1 port instead of 2, 4 and
8 ports respectively.

To fix this, each Commtech Fastcom PCIe card is given a struct where the
number of ports is explicitly specified. This ensures 'board->num_ports'
is used instead of the default 'pcidev->device & 0x0f'.

Fixes: d0aeaa83

 ("serial: exar: split out the exar code from 8250_pci")
Signed-off-by: Valmer Huhn <valmer.huhn@concurrent-rt.com>
Tested-by: Valmer Huhn <valmer.huhn@concurrent-rt.com>
Cc: stable <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/20200813165255.GC345440@icarus.concurrent-rt.com

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

89171ef8

serial: stm32: avoid kernel warning on absence of optional IRQ · 0a60539b

Holger Assmann authored 4 years ago

commit fdf16d78 upstream.

stm32_init_port() of the stm32-usart may trigger a warning in
platform_get_irq() when the device tree specifies no wakeup interrupt.

The wakeup interrupt is usually a board-specific GPIO and the driver
functions correctly in its absence. The mainline stm32mp151.dtsi does
not specify it, so all mainline device trees trigger an unnecessary
kernel warning. Use of platform_get_irq_optional() avoids this.

Fixes: 2c58e560

 ("serial: stm32: fix the get_irq error case")
Signed-off-by: Holger Assmann <h.assmann@pengutronix.de>
Cc: stable <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/20200813152757.32751-1-h.assmann@pengutronix.de

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

0a60539b

serial: pl011: Don't leak amba_ports entry on driver register error · df264303

Lukas Wunner authored 4 years ago

commit 89efbe70 upstream.

pl011_probe() calls pl011_setup_port() to reserve an amba_ports[] entry,
then calls pl011_register_port() to register the uart driver with the
tty layer.

If registration of the uart driver fails, the amba_ports[] entry is not
released.  If this happens 14 times (value of UART_NR macro), then all
amba_ports[] entries will have been leaked and driver probing is no
longer possible.  (To be fair, that can only happen if the DeviceTree
doesn't contain alias IDs since they cause the same entry to be used for
a given port.)   Fix it.

Fixes: ef2889f7

 ("serial: pl011: Move uart_register_driver call to device")
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Cc: stable@vger.kernel.org # v3.15+
Cc: Tushar Behera <tushar.behera@linaro.org>
Link: https://lore.kernel.org/r/138f8c15afb2f184d8102583f8301575566064a6.1597316167.git.lukas@wunner.de

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

df264303

serial: pl011: Fix oops on -EPROBE_DEFER · 6648c599

Lukas Wunner authored 4 years ago

commit 27afac93 upstream.

If probing of a pl011 gets deferred until after free_initmem(), an oops
ensues because pl011_console_match() is called which has been freed.

Fix by removing the __init attribute from the function and those it
calls.

Commit 10879ae5 ("serial: pl011: add console matching function")
introduced pl011_console_match() not just for early consoles but
regular preferred consoles, such as those added by acpi_parse_spcr().
Regular consoles may be registered after free_initmem() for various
reasons, one being deferred probing, another being dynamic enablement
of serial ports using a DeviceTree overlay.

Thus, pl011_console_match() must not be declared __init and the
functions it calls mustn't either.

Stack trace for posterity:

Unable to handle kernel paging request at virtual address 80c38b58
Internal error: Oops: 8000000d [#1] PREEMPT SMP ARM
PC is at pl011_console_match+0x0/0xfc
LR is at register_console+0x150/0x468
[<80187004>] (register_console)
[<805a8184>] (uart_add_one_port)
[<805b2b68>] (pl011_register_port)
[<805b3ce4>] (pl011_probe)
[<80569214>] (amba_probe)
[<805ca088>] (really_probe)
[<805ca2ec>] (driver_probe_device)
[<805ca5b0>] (__device_attach_driver)
[<805c8060>] (bus_for_each_drv)
[<805c9dfc>] (__device_attach)
[<805ca630>] (device_initial_probe)
[<805c90a8>] (bus_probe_device)
[<805c95a8>] (deferred_probe_work_func)

Fixes: 10879ae5

 ("serial: pl011: add console matching function")
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Cc: stable@vger.kernel.org # v4.10+
Cc: Aleksey Makarov <amakarov@marvell.com>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Christopher Covington <cov@codeaurora.org>
Link: https://lore.kernel.org/r/f827ff09da55b8c57d316a1b008a137677b58921.1597315557.git.lukas@wunner.de

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

6648c599

serial: samsung: Removes the IRQ not found warning · e3f8041d

Tamseel Shams authored 4 years ago

commit 8c6c378b

 upstream.

In few older Samsung SoCs like s3c2410, s3c2412
and s3c2440, UART IP is having 2 interrupt lines.
However, in other SoCs like s3c6400, s5pv210,
exynos5433, and exynos4210 UART is having only 1
interrupt line. Due to this, "platform_get_irq(platdev, 1)"
call in the driver gives the following false-positive error:
"IRQ index 1 not found" on newer SoC's.

This patch adds the condition to check for Tx interrupt
only for the those SoC's which have 2 interrupt lines.
Tested-by: Alim Akhtar <alim.akhtar@samsung.com>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Reviewed-by: Krzysztof Kozlowski <krzk@kernel.org>
Reviewed-by: Alim Akhtar <alim.akhtar@samsung.com>
Signed-off-by: Tamseel Shams <m.shams@samsung.com>
Cc: stable <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/20200810030021.45348-1-m.shams@samsung.com

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

e3f8041d

vt_ioctl: change VT_RESIZEX ioctl to check for error return from vc_resize() · edc8a4eb

George Kennedy authored 4 years ago

commit bc5269ca

 upstream.

vc_resize() can return with an error after failure. Change VT_RESIZEX ioctl
to save struct vc_data values that are modified and restore the original
values in case of error.
Signed-off-by: George Kennedy <george.kennedy@oracle.com>
Cc: stable <stable@vger.kernel.org>
Reported-by: syzbot+38a3699c7eaf165b97a6@syzkaller.appspotmail.com
Link: https://lore.kernel.org/r/1596213192-6635-2-git-send-email-george.kennedy@oracle.com

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

edc8a4eb

vt: defer kfree() of vc_screenbuf in vc_do_resize() · 2392eea6

Tetsuo Handa authored 4 years ago

commit f8d1653d upstream.

syzbot is reporting UAF bug in set_origin() from vc_do_resize() [1], for
vc_do_resize() calls kfree(vc->vc_screenbuf) before calling set_origin().

Unfortunately, in set_origin(), vc->vc_sw->con_set_origin() might access
vc->vc_pos when scroll is involved in order to manipulate cursor, but
vc->vc_pos refers already released vc->vc_screenbuf until vc->vc_pos gets
updated based on the result of vc->vc_sw->con_set_origin().

Preserving old buffer and tolerating outdated vc members until set_origin()
completes would be easier than preventing vc->vc_sw->con_set_origin() from
accessing outdated vc members.

[1] https://syzkaller.appspot.com/bug?id=6649da2081e2ebdc65c0642c214b27fe91099db3

Reported-by: syzbot <syzbot+9116ecc1978ca3a12f43@syzkaller.appspotmail.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: stable <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/1596034621-4714-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

2392eea6

USB: lvtest: return proper error code in probe · e863ac5f

Evgeny Novikov authored 4 years ago

commit 53141249

 upstream.

lvs_rh_probe() can return some nonnegative value from usb_control_msg()
when it is less than "USB_DT_HUB_NONVAR_SIZE + 2" that is considered as
a failure. Make lvs_rh_probe() return -EINVAL in this case.

Found by Linux Driver Verification project (linuxtesting.org).
Signed-off-by: Evgeny Novikov <novikov@ispras.ru>
Cc: stable <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/20200805090643.3432-1-novikov@ispras.ru

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

e863ac5f

fbcon: prevent user font height or width change from causing potential out-of-bounds access · 34cf1aff

George Kennedy authored 4 years ago

commit 39b3cffb

 upstream.

Add a check to fbcon_resize() to ensure that a possible change to user font
height or user font width will not allow a font data out-of-bounds access.
NOTE: must use original charcount in calculation as font charcount can
change and cannot be used to determine the font data allocated size.
Signed-off-by: George Kennedy <george.kennedy@oracle.com>
Cc: stable <stable@vger.kernel.org>
Reported-by: syzbot+38a3699c7eaf165b97a6@syzkaller.appspotmail.com
Link: https://lore.kernel.org/r/1596213192-6635-1-git-send-email-george.kennedy@oracle.com

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

34cf1aff

btrfs: detect nocow for swap after snapshot delete · bb77dd02

Boris Burkov authored 4 years ago

commit a84d5d42

 upstream.

can_nocow_extent and btrfs_cross_ref_exist both rely on a heuristic for
detecting a must cow condition which is not exactly accurate, but saves
unnecessary tree traversal. The incorrect assumption is that if the
extent was created in a generation smaller than the last snapshot
generation, it must be referenced by that snapshot. That is true, except
the snapshot could have since been deleted, without affecting the last
snapshot generation.

The original patch claimed a performance win from this check, but it
also leads to a bug where you are unable to use a swapfile if you ever
snapshotted the subvolume it's in. Make the check slower and more strict
for the swapon case, without modifying the general cow checks as a
compromise. Turning swap on does not seem to be a particularly
performance sensitive operation, so incurring a possibly unnecessary
btrfs_search_slot seems worthwhile for the added usability.

Note: Until the snapshot is competely cleaned after deletion,
check_committed_refs will still cause the logic to think that cow is
necessary, so the user must until 'btrfs subvolu sync' finished before
activating the swapfile swapon.

CC: stable@vger.kernel.org # 5.4+
Suggested-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

bb77dd02

btrfs: fix space cache memory leak after transaction abort · b40d12b7

Filipe Manana authored 4 years ago

commit bbc37d6e

 upstream.

If a transaction aborts it can cause a memory leak of the pages array of
a block group's io_ctl structure. The following steps explain how that can
happen:

1) Transaction N is committing, currently in state TRANS_STATE_UNBLOCKED
   and it's about to start writing out dirty extent buffers;

2) Transaction N + 1 already started and another task, task A, just called
   btrfs_commit_transaction() on it;

3) Block group B was dirtied (extents allocated from it) by transaction
   N + 1, so when task A calls btrfs_start_dirty_block_groups(), at the
   very beginning of the transaction commit, it starts writeback for the
   block group's space cache by calling btrfs_write_out_cache(), which
   allocates the pages array for the block group's io_ctl with a call to
   io_ctl_init(). Block group A is added to the io_list of transaction
   N + 1 by btrfs_start_dirty_block_groups();

4) While transaction N's commit is writing out the extent buffers, it gets
   an IO error and aborts transaction N, also setting the file system to
   RO mode;

5) Task A has already returned from btrfs_start_dirty_block_groups(), is at
   btrfs_commit_transaction() and has set transaction N + 1 state to
   TRANS_STATE_COMMIT_START. Immediately after that it checks that the
   filesystem was turned to RO mode, due to transaction N's abort, and
   jumps to the "cleanup_transaction" label. After that we end up at
   btrfs_cleanup_one_transaction() which calls btrfs_cleanup_dirty_bgs().
   That helper finds block group B in the transaction's io_list but it
   never releases the pages array of the block group's io_ctl, resulting in
   a memory leak.

In fact at the point when we are at btrfs_cleanup_dirty_bgs(), the pages
array points to pages that were already released by us at
__btrfs_write_out_cache() through the call to io_ctl_drop_pages(). We end
up freeing the pages array only after waiting for the ordered extent to
complete through btrfs_wait_cache_io(), which calls io_ctl_free() to do
that. But in the transaction abort case we don't wait for the space cache's
ordered extent to complete through a call to btrfs_wait_cache_io(), so
that's why we end up with a memory leak - we wait for the ordered extent
to complete indirectly by shutting down the work queues and waiting for
any jobs in them to complete before returning from close_ctree().

We can solve the leak simply by freeing the pages array right after
releasing the pages (with the call to io_ctl_drop_pages()) at
__btrfs_write_out_cache(), since we will never use it anymore after that
and the pages array points to already released pages at that point, which
is currently not a problem since no one will use it after that, but not a
good practice anyway since it can easily lead to use-after-free issues.

So fix this by freeing the pages array right after releasing the pages at
__btrfs_write_out_cache().

This issue can often be reproduced with test case generic/475 from fstests
and kmemleak can detect it and reports it with the following trace:

unreferenced object 0xffff9bbf009fa600 (size 512):
  comm "fsstress", pid 38807, jiffies 4298504428 (age 22.028s)
  hex dump (first 32 bytes):
    00 a0 7c 4d 3d ed ff ff 40 a0 7c 4d 3d ed ff ff  ..|M=...@.|M=...
    80 a0 7c 4d 3d ed ff ff c0 a0 7c 4d 3d ed ff ff  ..|M=.....|M=...
  backtrace:
    [<00000000f4b5cfe2>] __kmalloc+0x1a8/0x3e0
    [<0000000028665e7f>] io_ctl_init+0xa7/0x120 [btrfs]
    [<00000000a1f95b2d>] __btrfs_write_out_cache+0x86/0x4a0 [btrfs]
    [<00000000207ea1b0>] btrfs_write_out_cache+0x7f/0xf0 [btrfs]
    [<00000000af21f534>] btrfs_start_dirty_block_groups+0x27b/0x580 [btrfs]
    [<00000000c3c23d44>] btrfs_commit_transaction+0xa6f/0xe70 [btrfs]
    [<000000009588930c>] create_subvol+0x581/0x9a0 [btrfs]
    [<000000009ef2fd7f>] btrfs_mksubvol+0x3fb/0x4a0 [btrfs]
    [<00000000474e5187>] __btrfs_ioctl_snap_create+0x119/0x1a0 [btrfs]
    [<00000000708ee349>] btrfs_ioctl_snap_create_v2+0xb0/0xf0 [btrfs]
    [<00000000ea60106f>] btrfs_ioctl+0x12c/0x3130 [btrfs]
    [<000000005c923d6d>] __x64_sys_ioctl+0x83/0xb0
    [<0000000043ace2c9>] do_syscall_64+0x33/0x80
    [<00000000904efbce>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

CC: stable@vger.kernel.org # 4.9+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

b40d12b7

btrfs: check the right error variable in btrfs_del_dir_entries_in_log · c7e8c6f4

Josef Bacik authored 4 years ago

commit fb2fecba upstream.

With my new locking code dbench is so much faster that I tripped over a
transaction abort from ENOSPC.  This turned out to be because
btrfs_del_dir_entries_in_log was checking for ret == -ENOSPC, but this
function sets err on error, and returns err.  So instead of properly
marking the inode as needing a full commit, we were returning -ENOSPC
and aborting in __btrfs_unlink_inode.  Fix this by checking the proper
variable so that we return the correct thing in the case of ENOSPC.

The ENOENT needs to be checked, because btrfs_lookup_dir_item_index()
can return -ENOENT if the dir item isn't in the tree log (which would
happen if we hadn't fsync'ed this guy).  We actually handle that case in
__btrfs_unlink_inode, so it's an expected error to get back.

Fixes: 4a500fd1

 ("Btrfs: Metadata ENOSPC handling for tree log")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add note and comment about ENOENT ]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

c7e8c6f4

btrfs: reset compression level for lzo on remount · 204ed5f3

Marcos Paulo de Souza authored 4 years ago

commit 282dd7d7

 upstream.

Currently a user can set mount "-o compress" which will set the
compression algorithm to zlib, and use the default compress level for
zlib (3):

  relatime,compress=zlib:3,space_cache

If the user remounts the fs using "-o compress=lzo", then the old
compress_level is used:

  relatime,compress=lzo:3,space_cache

But lzo does not expose any tunable compression level. The same happens
if we set any compress argument with different level, also with zstd.

Fix this by resetting the compress_level when compress=lzo is
specified.  With the fix applied, lzo is shown without compress level:

  relatime,compress=lzo,space_cache

CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

204ed5f3

blk-mq: order adding requests to hctx->dispatch and checking SCHED_RESTART · b4cbbc13

Ming Lei authored 4 years ago

commit d7d8535f upstream.

SCHED_RESTART code path is relied to re-run queue for dispatch requests
in hctx->dispatch. Meantime the SCHED_RSTART flag is checked when adding
requests to hctx->dispatch.

memory barriers have to be used for ordering the following two pair of OPs:

1) adding requests to hctx->dispatch and checking SCHED_RESTART in
blk_mq_dispatch_rq_list()

2) clearing SCHED_RESTART and checking if there is request in hctx->dispatch
in blk_mq_sched_restart().

Without the added memory barrier, either:

1) blk_mq_sched_restart() may miss requests added to hctx->dispatch meantime
blk_mq_dispatch_rq_list() observes SCHED_RESTART, and not run queue in
dispatch side

or

2) blk_mq_dispatch_rq_list still sees SCHED_RESTART, and not run queue
in dispatch side, meantime checking if there is request in
hctx->dispatch from blk_mq_sched_restart() is missed.

IO hang in ltp/fs_fill test is reported by kernel test robot:

	https://lkml.org/lkml/2020/7/26/77

Turns out it is caused by the above out-of-order OPs. And the IO hang
can't be observed any more after applying this patch.

Fixes: bd166ef1

 ("blk-mq-sched: add framework for MQ capable IO schedulers")
Reported-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Jeffery <djeffery@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

b4cbbc13

HID: i2c-hid: Always sleep 60ms after I2C_HID_PWR_ON commands · 649b6c86

Hans de Goede authored 4 years ago

commit eef40162 upstream.

Before this commit i2c_hid_parse() consists of the following steps:

1. Send power on cmd
2. usleep_range(1000, 5000)
3. Send reset cmd
4. Wait for reset to complete (device interrupt, or msleep(100))
5. Send power on cmd
6. Try to read HID descriptor

Notice how there is an usleep_range(1000, 5000) after the first power-on
command, but not after the second power-on command.

Testing has shown that at least on the BMAX Y13 laptop's i2c-hid touchpad,
not having a delay after the second power-on command causes the HID
descriptor to read as all zeros.

In case we hit this on other devices too, the descriptor being all zeros
can be recognized by the following message being logged many, many times:

hid-generic 0018:0911:5288.0002: unknown main item tag 0x0

At the same time as the BMAX Y13's touchpad issue was debugged,
Kai-Heng was working on debugging some issues with Goodix i2c-hid
touchpads. It turns out that these need a delay after a PWR_ON command
too, otherwise they stop working after a suspend/resume cycle.
According to Goodix a delay of minimal 60ms is needed.

Having multiple cases where we need a delay after sending the power-on
command, seems to indicate that we should always sleep after the power-on
command.

This commit fixes the mentioned issues by moving the existing 1ms sleep to
the i2c_hid_set_power() function and changing it to a 60ms sleep.

Cc: stable@vger.kernel.org
BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=208247

Reported-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
Reported-and-tested-by: Andrea Borgia <andrea@borgia.bo.it>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

649b6c86

block: loop: set discard granularity and alignment for block device backed loop · 7aaaf975

Ming Lei authored 4 years ago

commit bcb21c8c upstream.

In case of block device backend, if the backend supports write zeros, the
loop device will set queue flag of QUEUE_FLAG_DISCARD. However,
limits.discard_granularity isn't setup, and this way is wrong,
see the following description in Documentation/ABI/testing/sysfs-block:

	A discard_granularity of 0 means that the device does not support
	discard functionality.

Especially 9b15d109 ("block: improve discard bio alignment in
__blkdev_issue_discard()") starts to take q->limits.discard_granularity
for computing max discard sectors. And zero discard granularity may cause
kernel oops, or fail discard request even though the loop queue claims
discard support via QUEUE_FLAG_DISCARD.

Fix the issue by setup discard granularity and alignment.

Fixes: c52abf56

 ("loop: Better discard support for block devices")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Coly Li <colyli@suse.de>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Xiao Ni <xni@redhat.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Evan Green <evgreen@chromium.org>
Cc: Gwendal Grignou <gwendal@chromium.org>
Cc: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Cc: Andrzej Pietrasiewicz <andrzej.p@collabora.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

7aaaf975

GitLab

Menu