-
Notifications
You must be signed in to change notification settings - Fork 7
Support for CoVE's deployment model 3 #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for CoVE's deployment model 3 #3
Conversation
|
As discussed on last Monday, we should expose TSM's capabilities to the host. Based on them, the host can provide different behavior when creating and managing TVMs. In this PR, the host recognizes TSM's capabilities based on, for example, presence of COVI extension. Once |
747de52 to
c9a3bf1
Compare
|
@wojciechozga : Anup agreed to host the staging repo in kvm-riscv. I think it would better to merge these patches directly there once you update these with the items discussed last time. Wdyt ? |
|
yes, we can do it that way! |
c9a3bf1 to
1e93d62
Compare
1e93d62 to
998239d
Compare
arch/riscv/kernel/setup.c
Outdated
| void __init setup_arch(char **cmdline_p) | ||
| { | ||
| parse_dtb(); | ||
| promote_to_cove_guest(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you invoke it after sbi_init and riscv_cove_sbi_init so that this can be called only if is_cove_guest is true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The promote call might actually move to different place, i.e., head.S in _start_kernel. Like this, the kernel requests the promotion before it touches the stack or heap, so there is no runtime state associated with the execution. Consequently the TVM's memory at the time of promotion equals the linux kernel image, so integrity measurements are easier to calculate.
The approach presented in this patch was intended to match what OpenPower PEF did. But in such a case we would need to do a few additional tricks (like PEF in powerpc part of Linux kernel) that would unnecessarily complicate the implementation. My team believes that we might try a simpler solution, but I am interested in learning your opinion and the opinion of the community. Guerney and I will present details on local attestation during RV CoVE TG on November 5th.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The commit #fb1b6080a118b70305be6bb1e0243a5e1b119e05 implements the simpler approach to VM promotion, which I mentioned above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to convert so early, why not just invoke it from the VMM if the kernel supports it ? The VMM can create a single step or multi-step TVM. It should be even simpler and you don't have to bounce the PROMOTE_TVM ecalls from the TVM->TSM->HOST.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your comments and suggestions.
There are a few advantages of implementing promotion inside the kernel that early. (1) TSM can move just kernel+initrd+fdt to confidential memory, keeping entire boot firmware / rom etc out of TCB, (2) proposed solution is then VMM-independent, (3) implementation is relatively simple even though not rich in features, (4) reference measurements are easy to calculate since there is no runtime state we have to care about.
There are alternatives and every solution comes with its own tradeoffs. Doing promotion before the proposed point of time (e.g., in VMM or early firmware) will include firmware in TCB and will be vmm- or firmware-specific. Doing promotion after the proposed time will result in inclusion of runtime state in TCB and will require calculating the measurements of a VM's snapshot at given point of time when promotion occurs.
In OpenPOWER PEF (see powerpc arch code in Linux kernel) we did it also that early in the kernel boot. Specifically inside PROM. The _start procedure (in arch/powerpc/kernel/head_book3s_32.S) calls prom_init, which calls setup_secure_guest (arch/powerpc/kernel/prom_init.c) if the correct boot arg param is set. I am not aware of any PROM-like code in Riscv-dependent code, that is why I placed it directly in arch/riscv/kernel/head.S.
In Secure Execution for IBM Z (see s390 arch code in Linux kernel) we did it differently since the business requirements were different. We wanted to provide confidentiality and integrity of the TVM's Linux kernel and initrd.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deployment model 3 is primarily aimed at embedded systems although it can also be used by cloud or other systems. Putting the execution in the kernel as proposed is a minimal code approach. This approach will fit better into the development cycles for embedded systems. The tooling to generate the Attestation Payload will be simpler. There is no need to actually run the VM prior to generating the payload. If the VMM does not support confidential computing the call will fail which is not a problem because no secrets will be exposed because VM will not have access to the key that unlocks its disk. This approach gives the VM creator the ability to lock their TVM to a specific set of hardware. To do this requires that each platform have unique information, such as a serial number, that is incorporated into the attestation. The TVM can be locked to a single platform or a set of platforms, depending on the hardware environment. This is also important for embedded systems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wojciechozga Thanks for sharing the slides for the Nov 5th meeting. I understand the local attestation requirement. I have couple of questions:
- However, I wanted to understand what type of guest firmware you typically run on an embedded systems ? It can just boot directly similar to what we do it in kvm ?
- Why it is a problematic to make VMM aware of TAP and pass the TAP to the VMM ? It can be part of the guest image that vmm can load it.
We can discuss the details in Nov 5th call if you like as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guest content is system-specific. It may be a guest-level "bare-metal" application, a directly booted kernel, a full-fledged OS, or a single-tasking system with or without boot firmware. It is the TVM creator's decision on how to structure the TVM, depending on system design requirements. It is important for RISC-V CoVE to enable different use cases, also because of legacy processes to build and certify systems. The TVM owner should decide when and what type of attestation to perform—whether promotion should happen in firmware, kernel, or operating system, and whether it should be local or remote attestation.
Leaving the decision on when to execute the promotion call allows a TVM creator to decide if firmware should be part of the TCB and whether the guest kernel's code should be confidential, among other considerations. For example, if a VM requests promotion during kernel execution, the TSM can exclude rom reset vec and firmware from the TCB. Conversely, if the initial firmware is intended to be part of the TCB and to retrieve a key to decrypt the kernel, promotion occurs in firmware.
Implementing this level of granularity within a VMM presents challenges, as it would require trapping the VM’s execution at specific instructions based on some policy. Moreover, making a VMM aware of the TAP would necessitate modifications to the VMM itself, potentially reducing the solution's portability across various VMM implementations. The proposed separation of concerns simplifies the VMM's role, enabling more flexible development and evolution of the attestation mechanisms within the TVM.
@atishp04, let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to clarify: I was not suggesting to trap any VM's execution at any certain stage and promote it to TVM.
I was suggesting to start the TVM in the single step boot mode at the beginning itself. No change in promote_to_tvm API.
Instead of this path
VM->(promote_to_tvm)->TSM traps and forward ->Host->promote_to_tvm (COVH)->TSM
We can just simply start the TVM. Are there any issue with this execution model ?
VMM->host (start TVM in single_step_mode)
The flexibility you mentioned still exists for anybody to implement in a different way. I was mainly concerned about the odd boot flow for a Linux based TVM. My earlier assumption was a Linux VM can be promoted to TVM at any point during the boot. If it is supposed to promoted right after the boot, VMM may as well tell the host to start a single step TVM. I understand that the firmware needs to be part of the TCB in this case but I am not sure if we need that for this use case.
We will anyways have have VMM modification for CoVE for other models. So adding another ABI for single step booting shouldn't be a issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@atishp04 thanks for the explanation.
The current path is different that what you described. The VM traps into Host, which reflects the call to TSM:
VM->(promote_to_tvm VM>KVM ABI)->Host->promote_to_tvm (COVH)->TSM
If I understood correctly, you propose that VMM+Host do the promotion of a Linux VM that hasn't executed yet, right?
VMM->(single_step_mode VMM>KVM ABI)->Host->promote_to_tvm (COVH)->TSM
My understanding is that your motivation to shift promotion call from VM to VMM is that it sounds odd to call promote_to_tvm ABI early in Linux kernel while the same effect could be achieved by doing promotion before the VM execution, right? In Linux kernel's powerpc (Openpower PEF) we promote TVMs the same way as proposed now for risc-v, so within first instructions after entering the Linux kernel code.
I agree that we can call promote_to_tvm before kernel execution (what you propose) but we can also do it during early kernel boot (current proposal) or later during runtime (this will be more like a snapshots of a running system). Starting a TVM in a single-step-mode would be ok for directly boot Linux kernel but would not support use cases with firmware that is not intended to be in TCB. Also, this approach would require modification of every VMM. These are some tradeoffs we can discuss today during the call. Do you know what are the plans for VMM modifications? Do you know which VMMs are expected to get support for CoVE (qemu, kvmtool, firecracker)?
Some other question I have is how could we make a VMM aware of TAP? Should we pass it as an extra argument or should VMM somehow extract TAP address from the VM image? Current approach embeds TAP inside the Linux kernel image and Linux kernel is aware about the TAP's address so it can expose it during the promote call.
998239d to
894ae0c
Compare
|
@atishp04 thank you for valuable comments and hints. I addressed the comments and pushed the improved version of these patches. |
arch/riscv/kernel/head.S
Outdated
| mv s2, a1 | ||
| li a7, COVE_PROMOTE_SBI_EXT_ID | ||
| li a6, COVE_PROMOTE_SBI_FID | ||
| mv a0, a1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't invoke SBI calls like this without checks. It will just fail if the higher privilege mode system doesn't support it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your comment. This piece of code is compiled only for non-M mode system (HS/VS mode) compiled with dedicated kernel build flag CONFIG_RISCV_COVE_GUEST_PROMOTE, so won't execute for a M-mode kernel or a regular COVE guest created in a multi-step fashion. If the more privileged mode doesn't support promote_to_tvm call, then it will return an SBI error reflected in a0/a1 gprs. We can ignore the SBI result here the same as we would do it at any later guest's execution stage. This is because the result of promotion is reflected via attestation. The VM that failed the promotion will continue as a regular VM and won't retrieve secrets from attestation.
only if the kernel was built with the CONFIG_RISCV_COVE_GUEST_PROMOTE build parameter. CoVE guest images created in the multi-step creation process should not use this parameter. Signed-off-by: Wojciech Ozga <woz@zurich.ibm.com>
…ic page conversion is not supported. When a single share_memory_region() call for a memory region that contains multiple 4KiB pages fails, execute multiple 4KiB share_memory_region() calls until the request is completed. (See deployment model 3 in the CoVE spec.) Signed-off-by: Wojciech Ozga <woz@zurich.ibm.com>
…he correct NACL features by testing whether a feature required for NACL exploitation in nested virtualized environments is present. If nested virtualization is not present, use only the NACL setup_shared_memory() ABI otherwise use the entire NACL ABI. Signed-off-by: Wojciech Ozga <woz@zurich.ibm.com>
Signed-off-by: Wojciech Ozga <woz@zurich.ibm.com>
…etect AIA presence by discovering that the TEE security monitor (TSM) supports AIA capability. If AIA is not present, inject external interrupts using the HVIP register when resuming execution of a virtual processor via the COVH tvm_vcpu_run() call. Signed-off-by: Wojciech Ozga <woz@zurich.ibm.com>
Signed-off-by: Wojciech Ozga <woz@zurich.ibm.com>
…d in a single-step. Preload VM pages into memory, fill the NACL shared memory with boot vcpu state, and reflect the promote_to_tvm() call to the TSM. Support CoVE implementations that do not support dynamic page conversion. A TSM that does not support dynamic page conversion does not require the donation of pages to store VCPU state in confidential memory. Signed-off-by: Wojciech Ozga <woz@zurich.ibm.com>
894ae0c to
daba2b6
Compare
daba2b6 to
793938e
Compare
…e for TVM Attestation Payload (TAP). Pass the physical address of the TAP when requesting to be promoted to a TVM. Signed-off-by: Wojciech Ozga <woz@zurich.ibm.com>
793938e to
a08f811
Compare
The function blk_revalidate_disk_zones() calls the function
disk_update_zone_resources() after freezing the device queue. In turn,
disk_update_zone_resources() calls queue_limits_start_update() which
takes a queue limits mutex lock, resulting in the ordering:
q->q_usage_counter check -> q->limits_lock. However, the usual ordering
is to always take a queue limit lock before freezing the queue to commit
the limits updates, e.g., the code pattern:
lim = queue_limits_start_update(q);
...
blk_mq_freeze_queue(q);
ret = queue_limits_commit_update(q, &lim);
blk_mq_unfreeze_queue(q);
Thus, blk_revalidate_disk_zones() introduces a potential circular
locking dependency deadlock that lockdep sometimes catches with the
splat:
[ 51.934109] ======================================================
[ 51.935916] WARNING: possible circular locking dependency detected
[ 51.937561] 6.12.0+ #2107 Not tainted
[ 51.938648] ------------------------------------------------------
[ 51.940351] kworker/u16:4/157 is trying to acquire lock:
[ 51.941805] ffff9fff0aa0bea8 (&q->limits_lock){+.+.}-{4:4}, at: disk_update_zone_resources+0x86/0x170
[ 51.944314]
but task is already holding lock:
[ 51.945688] ffff9fff0aa0b890 (&q->q_usage_counter(queue)#3){++++}-{0:0}, at: blk_revalidate_disk_zones+0x15f/0x340
[ 51.948527]
which lock already depends on the new lock.
[ 51.951296]
the existing dependency chain (in reverse order) is:
[ 51.953708]
-> #1 (&q->q_usage_counter(queue)#3){++++}-{0:0}:
[ 51.956131] blk_queue_enter+0x1c9/0x1e0
[ 51.957290] blk_mq_alloc_request+0x187/0x2a0
[ 51.958365] scsi_execute_cmd+0x78/0x490 [scsi_mod]
[ 51.959514] read_capacity_16+0x111/0x410 [sd_mod]
[ 51.960693] sd_revalidate_disk.isra.0+0x872/0x3240 [sd_mod]
[ 51.962004] sd_probe+0x2d7/0x520 [sd_mod]
[ 51.962993] really_probe+0xd5/0x330
[ 51.963898] __driver_probe_device+0x78/0x110
[ 51.964925] driver_probe_device+0x1f/0xa0
[ 51.965916] __driver_attach_async_helper+0x60/0xe0
[ 51.967017] async_run_entry_fn+0x2e/0x140
[ 51.968004] process_one_work+0x21f/0x5a0
[ 51.968987] worker_thread+0x1dc/0x3c0
[ 51.969868] kthread+0xe0/0x110
[ 51.970377] ret_from_fork+0x31/0x50
[ 51.970983] ret_from_fork_asm+0x11/0x20
[ 51.971587]
-> #0 (&q->limits_lock){+.+.}-{4:4}:
[ 51.972479] __lock_acquire+0x1337/0x2130
[ 51.973133] lock_acquire+0xc5/0x2d0
[ 51.973691] __mutex_lock+0xda/0xcf0
[ 51.974300] disk_update_zone_resources+0x86/0x170
[ 51.975032] blk_revalidate_disk_zones+0x16c/0x340
[ 51.975740] sd_zbc_revalidate_zones+0x73/0x160 [sd_mod]
[ 51.976524] sd_revalidate_disk.isra.0+0x465/0x3240 [sd_mod]
[ 51.977824] sd_probe+0x2d7/0x520 [sd_mod]
[ 51.978917] really_probe+0xd5/0x330
[ 51.979915] __driver_probe_device+0x78/0x110
[ 51.981047] driver_probe_device+0x1f/0xa0
[ 51.982143] __driver_attach_async_helper+0x60/0xe0
[ 51.983282] async_run_entry_fn+0x2e/0x140
[ 51.984319] process_one_work+0x21f/0x5a0
[ 51.985873] worker_thread+0x1dc/0x3c0
[ 51.987289] kthread+0xe0/0x110
[ 51.988546] ret_from_fork+0x31/0x50
[ 51.989926] ret_from_fork_asm+0x11/0x20
[ 51.991376]
other info that might help us debug this:
[ 51.994127] Possible unsafe locking scenario:
[ 51.995651] CPU0 CPU1
[ 51.996694] ---- ----
[ 51.997716] lock(&q->q_usage_counter(queue)#3);
[ 51.998817] lock(&q->limits_lock);
[ 52.000043] lock(&q->q_usage_counter(queue)#3);
[ 52.001638] lock(&q->limits_lock);
[ 52.002485]
*** DEADLOCK ***
Prevent this issue by moving the calls to blk_mq_freeze_queue() and
blk_mq_unfreeze_queue() around the call to queue_limits_commit_update()
in disk_update_zone_resources(). In case of revalidation failure, the
call to disk_free_zone_resources() in blk_revalidate_disk_zones()
is still done with the queue frozen as before.
Fixes: 843283e ("block: Fake max open zones limit when there is no limit")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241126104705.183996-1-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Kernel will hang on destroy admin_q while we create ctrl failed, such as following calltrace: PID: 23644 TASK: ff2d52b40f439fc0 CPU: 2 COMMAND: "nvme" #0 [ff61d23de260fb78] __schedule at ffffffff8323bc15 #1 [ff61d23de260fc08] schedule at ffffffff8323c014 #2 [ff61d23de260fc28] blk_mq_freeze_queue_wait at ffffffff82a3dba1 #3 [ff61d23de260fc78] blk_freeze_queue at ffffffff82a4113a #4 [ff61d23de260fc90] blk_cleanup_queue at ffffffff82a33006 #5 [ff61d23de260fcb0] nvme_rdma_destroy_admin_queue at ffffffffc12686ce #6 [ff61d23de260fcc8] nvme_rdma_setup_ctrl at ffffffffc1268ced #7 [ff61d23de260fd28] nvme_rdma_create_ctrl at ffffffffc126919b #8 [ff61d23de260fd68] nvmf_dev_write at ffffffffc024f362 #9 [ff61d23de260fe38] vfs_write at ffffffff827d5f25 RIP: 00007fda7891d574 RSP: 00007ffe2ef06958 RFLAGS: 00000202 RAX: ffffffffffffffda RBX: 000055e8122a4d90 RCX: 00007fda7891d574 RDX: 000000000000012b RSI: 000055e8122a4d90 RDI: 0000000000000004 RBP: 00007ffe2ef079c0 R8: 000000000000012b R9: 000055e8122a4d90 R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000004 R13: 000055e8122923c0 R14: 000000000000012b R15: 00007fda78a54500 ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b This due to we have quiesced admi_q before cancel requests, but forgot to unquiesce before destroy it, as a result we fail to drain the pending requests, and hang on blk_mq_freeze_queue_wait() forever. Here try to reuse nvme_rdma_teardown_admin_queue() to fix this issue and simplify the code. Fixes: 958dc1d ("nvme-rdma: add clean action for failed reconnection") Reported-by: Yingfu.zhou <yingfu.zhou@shopee.com> Signed-off-by: Chunguang.xu <chunguang.xu@shopee.com> Signed-off-by: Yue.zhao <yue.zhao@shopee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
Konstantin Shkolnyy says: ==================== vsock/test: fix wrong setsockopt() parameters Parameters were created using wrong C types, which caused them to be of wrong size on some architectures, causing problems. The problem with SO_RCVLOWAT was found on s390 (big endian), while x86-64 didn't show it. After the fix, all tests pass on s390. Then Stefano Garzarella pointed out that SO_VM_SOCKETS_* calls might have a similar problem, which turned out to be true, hence, the second patch. Changes for v8: - Fix whitespace warnings from "checkpatch.pl --strict" - Add maintainers to Cc: Changes for v7: - Rebase on top of https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git - Add the "net" tags to the subjects Changes for v6: - rework the patch #3 to avoid creating a new file for new functions, and exclude vsock_perf from calling the new functions. - add "Reviewed-by:" to the patch #2. Changes for v5: - in the patch #2 replace the introduced uint64_t with unsigned long long to match documentation - add a patch #3 that verifies every setsockopt() call. Changes for v4: - add "Reviewed-by:" to the first patch, and add a second patch fixing SO_VM_SOCKETS_* calls, which depends on the first one (hence, it's now a patch series.) Changes for v3: - fix the same problem in vsock_perf and update commit message Changes for v2: - add "Fixes:" lines to the commit message ==================== Link: https://patch.msgid.link/20241203150656.287028-1-kshk@linux.ibm.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
We have several places across the kernel where we want to access another
task's syscall arguments, such as ptrace(2), seccomp(2), etc., by making
a call to syscall_get_arguments().
This works for register arguments right away by accessing the task's
`regs' member of `struct pt_regs', however for stack arguments seen with
32-bit/o32 kernels things are more complicated. Technically they ought
to be obtained from the user stack with calls to an access_remote_vm(),
but we have an easier way available already.
So as to be able to access syscall stack arguments as regular function
arguments following the MIPS calling convention we copy them over from
the user stack to the kernel stack in arch/mips/kernel/scall32-o32.S, in
handle_sys(), to the current stack frame's outgoing argument space at
the top of the stack, which is where the handler called expects to see
its incoming arguments. This area is also pointed at by the `pt_regs'
pointer obtained by task_pt_regs().
Make the o32 stack argument space a proper member of `struct pt_regs'
then, by renaming the existing member from `pad0' to `args' and using
generated offsets to access the space. No functional change though.
With the change in place the o32 kernel stack frame layout at the entry
to a syscall handler invoked by handle_sys() is therefore as follows:
$sp + 68 -> | ... | <- pt_regs.regs[9]
+---------------------+
$sp + 64 -> | $t0 | <- pt_regs.regs[8]
+---------------------+
$sp + 60 -> | $a3/argument #4 | <- pt_regs.regs[7]
+---------------------+
$sp + 56 -> | $a2/argument #3 | <- pt_regs.regs[6]
+---------------------+
$sp + 52 -> | $a1/argument #2 | <- pt_regs.regs[5]
+---------------------+
$sp + 48 -> | $a0/argument #1 | <- pt_regs.regs[4]
+---------------------+
$sp + 44 -> | $v1 | <- pt_regs.regs[3]
+---------------------+
$sp + 40 -> | $v0 | <- pt_regs.regs[2]
+---------------------+
$sp + 36 -> | $at | <- pt_regs.regs[1]
+---------------------+
$sp + 32 -> | $zero | <- pt_regs.regs[0]
+---------------------+
$sp + 28 -> | stack argument #8 | <- pt_regs.args[7]
+---------------------+
$sp + 24 -> | stack argument #7 | <- pt_regs.args[6]
+---------------------+
$sp + 20 -> | stack argument #6 | <- pt_regs.args[5]
+---------------------+
$sp + 16 -> | stack argument #5 | <- pt_regs.args[4]
+---------------------+
$sp + 12 -> | psABI space for $a3 | <- pt_regs.args[3]
+---------------------+
$sp + 8 -> | psABI space for $a2 | <- pt_regs.args[2]
+---------------------+
$sp + 4 -> | psABI space for $a1 | <- pt_regs.args[1]
+---------------------+
$sp + 0 -> | psABI space for $a0 | <- pt_regs.args[0]
+---------------------+
holding user data received and with the first 4 frame slots reserved by
the psABI for the compiler to spill the incoming arguments from $a0-$a3
registers (which it sometimes does according to its needs) and the next
4 frame slots designated by the psABI for any stack function arguments
that follow. This data is also available for other tasks to peek/poke
at as reqired and where permitted.
Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk>
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
…/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 fixes for 6.14, take #3 - Fix TCR_EL2 configuration to not use the ASID in TTBR1_EL2 and not mess-up T1SZ/PS by using the HCR_EL2.E2H==0 layout. - Bring back the VMID allocation to the vcpu_load phase, ensuring that we only setup VTTBR_EL2 once on VHE. This cures an ugly race that would lead to running with an unallocated VMID.
Use raw_spinlock in order to fix spurious messages about invalid context
when spinlock debugging is enabled. The lock is only used to serialize
register access.
[ 4.239592] =============================
[ 4.239595] [ BUG: Invalid wait context ]
[ 4.239599] 6.13.0-rc7-arm64-renesas-05496-gd088502a519f #35 Not tainted
[ 4.239603] -----------------------------
[ 4.239606] kworker/u8:5/76 is trying to lock:
[ 4.239609] ffff0000091898a0 (&p->lock){....}-{3:3}, at: gpio_rcar_config_interrupt_input_mode+0x34/0x164
[ 4.239641] other info that might help us debug this:
[ 4.239643] context-{5:5}
[ 4.239646] 5 locks held by kworker/u8:5/76:
[ 4.239651] #0: ffff0000080fb148 ((wq_completion)async){+.+.}-{0:0}, at: process_one_work+0x190/0x62c
[ 4.250180] OF: /soc/sound@ec500000/ports/port@0/endpoint: Read of boolean property 'frame-master' with a value.
[ 4.254094] #1: ffff80008299bd80 ((work_completion)(&entry->work)){+.+.}-{0:0}, at: process_one_work+0x1b8/0x62c
[ 4.254109] #2: ffff00000920c8f8
[ 4.258345] OF: /soc/sound@ec500000/ports/port@1/endpoint: Read of boolean property 'bitclock-master' with a value.
[ 4.264803] (&dev->mutex){....}-{4:4}, at: __device_attach_async_helper+0x3c/0xdc
[ 4.264820] #3: ffff00000a50ca40 (request_class#2){+.+.}-{4:4}, at: __setup_irq+0xa0/0x690
[ 4.264840] #4:
[ 4.268872] OF: /soc/sound@ec500000/ports/port@1/endpoint: Read of boolean property 'frame-master' with a value.
[ 4.273275] ffff00000a50c8c8 (lock_class){....}-{2:2}, at: __setup_irq+0xc4/0x690
[ 4.296130] renesas_sdhi_internal_dmac ee100000.mmc: mmc1 base at 0x00000000ee100000, max clock rate 200 MHz
[ 4.304082] stack backtrace:
[ 4.304086] CPU: 1 UID: 0 PID: 76 Comm: kworker/u8:5 Not tainted 6.13.0-rc7-arm64-renesas-05496-gd088502a519f #35
[ 4.304092] Hardware name: Renesas Salvator-X 2nd version board based on r8a77965 (DT)
[ 4.304097] Workqueue: async async_run_entry_fn
[ 4.304106] Call trace:
[ 4.304110] show_stack+0x14/0x20 (C)
[ 4.304122] dump_stack_lvl+0x6c/0x90
[ 4.304131] dump_stack+0x14/0x1c
[ 4.304138] __lock_acquire+0xdfc/0x1584
[ 4.426274] lock_acquire+0x1c4/0x33c
[ 4.429942] _raw_spin_lock_irqsave+0x5c/0x80
[ 4.434307] gpio_rcar_config_interrupt_input_mode+0x34/0x164
[ 4.440061] gpio_rcar_irq_set_type+0xd4/0xd8
[ 4.444422] __irq_set_trigger+0x5c/0x178
[ 4.448435] __setup_irq+0x2e4/0x690
[ 4.452012] request_threaded_irq+0xc4/0x190
[ 4.456285] devm_request_threaded_irq+0x7c/0xf4
[ 4.459398] ata1: link resume succeeded after 1 retries
[ 4.460902] mmc_gpiod_request_cd_irq+0x68/0xe0
[ 4.470660] mmc_start_host+0x50/0xac
[ 4.474327] mmc_add_host+0x80/0xe4
[ 4.477817] tmio_mmc_host_probe+0x2b0/0x440
[ 4.482094] renesas_sdhi_probe+0x488/0x6f4
[ 4.486281] renesas_sdhi_internal_dmac_probe+0x60/0x78
[ 4.491509] platform_probe+0x64/0xd8
[ 4.495178] really_probe+0xb8/0x2a8
[ 4.498756] __driver_probe_device+0x74/0x118
[ 4.503116] driver_probe_device+0x3c/0x154
[ 4.507303] __device_attach_driver+0xd4/0x160
[ 4.511750] bus_for_each_drv+0x84/0xe0
[ 4.515588] __device_attach_async_helper+0xb0/0xdc
[ 4.520470] async_run_entry_fn+0x30/0xd8
[ 4.524481] process_one_work+0x210/0x62c
[ 4.528494] worker_thread+0x1ac/0x340
[ 4.532245] kthread+0x10c/0x110
[ 4.535476] ret_from_fork+0x10/0x20
Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20250121135833.3769310-1-niklas.soderlund+renesas@ragnatech.se
Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
…cal section
A circular lock dependency splat has been seen involving down_trylock():
======================================================
WARNING: possible circular locking dependency detected
6.12.0-41.el10.s390x+debug
------------------------------------------------------
dd/32479 is trying to acquire lock:
0015a20accd0d4f8 ((console_sem).lock){-.-.}-{2:2}, at: down_trylock+0x26/0x90
but task is already holding lock:
000000017e461698 (&zone->lock){-.-.}-{2:2}, at: rmqueue_bulk+0xac/0x8f0
the existing dependency chain (in reverse order) is:
-> #4 (&zone->lock){-.-.}-{2:2}:
-> #3 (hrtimer_bases.lock){-.-.}-{2:2}:
-> #2 (&rq->__lock){-.-.}-{2:2}:
-> #1 (&p->pi_lock){-.-.}-{2:2}:
-> #0 ((console_sem).lock){-.-.}-{2:2}:
The console_sem -> pi_lock dependency is due to calling try_to_wake_up()
while holding the console_sem raw_spinlock. This dependency can be broken
by using wake_q to do the wakeup instead of calling try_to_wake_up()
under the console_sem lock. This will also make the semaphore's
raw_spinlock become a terminal lock without taking any further locks
underneath it.
The hrtimer_bases.lock is a raw_spinlock while zone->lock is a
spinlock. The hrtimer_bases.lock -> zone->lock dependency happens via
the debug_objects_fill_pool() helper function in the debugobjects code.
-> #4 (&zone->lock){-.-.}-{2:2}:
__lock_acquire+0xe86/0x1cc0
lock_acquire.part.0+0x258/0x630
lock_acquire+0xb8/0xe0
_raw_spin_lock_irqsave+0xb4/0x120
rmqueue_bulk+0xac/0x8f0
__rmqueue_pcplist+0x580/0x830
rmqueue_pcplist+0xfc/0x470
rmqueue.isra.0+0xdec/0x11b0
get_page_from_freelist+0x2ee/0xeb0
__alloc_pages_noprof+0x2c2/0x520
alloc_pages_mpol_noprof+0x1fc/0x4d0
alloc_pages_noprof+0x8c/0xe0
allocate_slab+0x320/0x460
___slab_alloc+0xa58/0x12b0
__slab_alloc.isra.0+0x42/0x60
kmem_cache_alloc_noprof+0x304/0x350
fill_pool+0xf6/0x450
debug_object_activate+0xfe/0x360
enqueue_hrtimer+0x34/0x190
__run_hrtimer+0x3c8/0x4c0
__hrtimer_run_queues+0x1b2/0x260
hrtimer_interrupt+0x316/0x760
do_IRQ+0x9a/0xe0
do_irq_async+0xf6/0x160
Normally a raw_spinlock to spinlock dependency is not legitimate
and will be warned if CONFIG_PROVE_RAW_LOCK_NESTING is enabled,
but debug_objects_fill_pool() is an exception as it explicitly
allows this dependency for non-PREEMPT_RT kernel without causing
PROVE_RAW_LOCK_NESTING lockdep splat. As a result, this dependency is
legitimate and not a bug.
Anyway, semaphore is the only locking primitive left that is still
using try_to_wake_up() to do wakeup inside critical section, all the
other locking primitives had been migrated to use wake_q to do wakeup
outside of the critical section. It is also possible that there are
other circular locking dependencies involving printk/console_sem or
other existing/new semaphores lurking somewhere which may show up in
the future. Let just do the migration now to wake_q to avoid headache
like this.
Reported-by: yzbot+ed801a886dfdbfe7136d@syzkaller.appspotmail.com
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250307232717.1759087-3-boqun.feng@gmail.com
When mb-xdp is set and return is XDP_PASS, packet is converted from xdp_buff to sk_buff with xdp_update_skb_shared_info() in bnxt_xdp_build_skb(). bnxt_xdp_build_skb() passes incorrect truesize argument to xdp_update_skb_shared_info(). The truesize is calculated as BNXT_RX_PAGE_SIZE * sinfo->nr_frags but the skb_shared_info was wiped by napi_build_skb() before. So it stores sinfo->nr_frags before bnxt_xdp_build_skb() and use it instead of getting skb_shared_info from xdp_get_shared_info_from_buff(). Splat looks like: ------------[ cut here ]------------ WARNING: CPU: 2 PID: 0 at net/core/skbuff.c:6072 skb_try_coalesce+0x504/0x590 Modules linked in: xt_nat xt_tcpudp veth af_packet xt_conntrack nft_chain_nat xt_MASQUERADE nf_conntrack_netlink xfrm_user xt_addrtype nft_coms CPU: 2 UID: 0 PID: 0 Comm: swapper/2 Not tainted 6.14.0-rc2+ #3 RIP: 0010:skb_try_coalesce+0x504/0x590 Code: 4b fd ff ff 49 8b 34 24 40 80 e6 40 0f 84 3d fd ff ff 49 8b 74 24 48 40 f6 c6 01 0f 84 2e fd ff ff 48 8d 4e ff e9 25 fd ff ff <0f> 0b e99 RSP: 0018:ffffb62c4120caa8 EFLAGS: 00010287 RAX: 0000000000000003 RBX: ffffb62c4120cb14 RCX: 0000000000000ec0 RDX: 0000000000001000 RSI: ffffa06e5d7dc000 RDI: 0000000000000003 RBP: ffffa06e5d7ddec0 R08: ffffa06e6120a800 R09: ffffa06e7a119900 R10: 0000000000002310 R11: ffffa06e5d7dcec0 R12: ffffe4360575f740 R13: ffffe43600000000 R14: 0000000000000002 R15: 0000000000000002 FS: 0000000000000000(0000) GS:ffffa0755f700000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f147b76b0f8 CR3: 00000001615d4000 CR4: 00000000007506f0 PKRU: 55555554 Call Trace: <IRQ> ? __warn+0x84/0x130 ? skb_try_coalesce+0x504/0x590 ? report_bug+0x18a/0x1a0 ? handle_bug+0x53/0x90 ? exc_invalid_op+0x14/0x70 ? asm_exc_invalid_op+0x16/0x20 ? skb_try_coalesce+0x504/0x590 inet_frag_reasm_finish+0x11f/0x2e0 ip_defrag+0x37a/0x900 ip_local_deliver+0x51/0x120 ip_sublist_rcv_finish+0x64/0x70 ip_sublist_rcv+0x179/0x210 ip_list_rcv+0xf9/0x130 How to reproduce: <Node A> ip link set $interface1 xdp obj xdp_pass.o ip link set $interface1 mtu 9000 up ip a a 10.0.0.1/24 dev $interface1 <Node B> ip link set $interfac2 mtu 9000 up ip a a 10.0.0.2/24 dev $interface2 ping 10.0.0.1 -s 65000 Following ping.py patch adds xdp-mb-pass case. so ping.py is going to be able to reproduce this issue. Fixes: 1dc4c55 ("bnxt: adding bnxt_xdp_build_skb to build skb from multibuffer xdp_buff") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Link: https://patch.msgid.link/20250309134219.91670-2-ap420073@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
…ux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 fixes for 6.15, round #3 - Avoid use of uninitialized memcache pointer in user_mem_abort() - Always set HCR_EL2.xMO bits when running in VHE, allowing interrupts to be taken while TGE=0 and fixing an ugly bug on AmpereOne that occurs when taking an interrupt while clearing the xMO bits (AC03_CPU_36) - Prevent VMMs from hiding support for AArch64 at any EL virtualized by KVM - Save/restore the host value for HCRX_EL2 instead of restoring an incorrect fixed value - Make host_stage2_set_owner_locked() check that the entire requested range is memory rather than just the first page
This patch enables support for DYNAMIC_FTRACE_WITH_CALL_OPS on RISC-V. This allows each ftrace callsite to provide an ftrace_ops to the common ftrace trampoline, allowing each callsite to invoke distinct tracer functions without the need to fall back to list processing or to allocate custom trampolines for each callsite. This significantly speeds up cases where multiple distinct trace functions are used and callsites are mostly traced by a single tracer. The idea and most of the implementation is taken from the ARM64's implementation of the same feature. The idea is to place a pointer to the ftrace_ops as a literal at a fixed offset from the function entry point, which can be recovered by the common ftrace trampoline. We use -fpatchable-function-entry to reserve 8 bytes above the function entry by emitting 2 4 byte or 4 2 byte nops depending on the presence of CONFIG_RISCV_ISA_C. These 8 bytes are patched at runtime with a pointer to the associated ftrace_ops for that callsite. Functions are aligned to 8 bytes to make sure that the accesses to this literal are atomic. This approach allows for directly invoking ftrace_ops::func even for ftrace_ops which are dynamically-allocated (or part of a module), without going via ftrace_ops_list_func. We've benchamrked this with the ftrace_ops sample module on Spacemit K1 Jupiter: Without this patch: baseline (Linux rivos 6.14.0-09584-g7d06015d936c #3 SMP Sat Mar 29 +-----------------------+-----------------+----------------------------+ | Number of tracers | Total time (ns) | Per-call average time | |-----------------------+-----------------+----------------------------| | Relevant | Irrelevant | 100000 calls | Total (ns) | Overhead (ns) | |----------+------------+-----------------+------------+---------------| | 0 | 0 | 1357958 | 13 | - | | 0 | 1 | 1302375 | 13 | - | | 0 | 2 | 1302375 | 13 | - | | 0 | 10 | 1379084 | 13 | - | | 0 | 100 | 1302458 | 13 | - | | 0 | 200 | 1302333 | 13 | - | |----------+------------+-----------------+------------+---------------| | 1 | 0 | 13677833 | 136 | 123 | | 1 | 1 | 18500916 | 185 | 172 | | 1 | 2 | 22856459 | 228 | 215 | | 1 | 10 | 58824709 | 588 | 575 | | 1 | 100 | 505141584 | 5051 | 5038 | | 1 | 200 | 1580473126 | 15804 | 15791 | |----------+------------+-----------------+------------+---------------| | 1 | 0 | 13561000 | 135 | 122 | | 2 | 0 | 19707292 | 197 | 184 | | 10 | 0 | 67774750 | 677 | 664 | | 100 | 0 | 714123125 | 7141 | 7128 | | 200 | 0 | 1918065668 | 19180 | 19167 | +----------+------------+-----------------+------------+---------------+ Note: per-call overhead is estimated relative to the baseline case with 0 relevant tracers and 0 irrelevant tracers. With this patch: v4-rc4 (Linux rivos 6.14.0-09598-gd75747611c93 #4 SMP Sat Mar 29 +-----------------------+-----------------+----------------------------+ | Number of tracers | Total time (ns) | Per-call average time | |-----------------------+-----------------+----------------------------| | Relevant | Irrelevant | 100000 calls | Total (ns) | Overhead (ns) | |----------+------------+-----------------+------------+---------------| | 0 | 0 | 1459917 | 14 | - | | 0 | 1 | 1408000 | 14 | - | | 0 | 2 | 1383792 | 13 | - | | 0 | 10 | 1430709 | 14 | - | | 0 | 100 | 1383791 | 13 | - | | 0 | 200 | 1383750 | 13 | - | |----------+------------+-----------------+------------+---------------| | 1 | 0 | 5238041 | 52 | 38 | | 1 | 1 | 5228542 | 52 | 38 | | 1 | 2 | 5325917 | 53 | 40 | | 1 | 10 | 5299667 | 52 | 38 | | 1 | 100 | 5245250 | 52 | 39 | | 1 | 200 | 5238459 | 52 | 39 | |----------+------------+-----------------+------------+---------------| | 1 | 0 | 5239083 | 52 | 38 | | 2 | 0 | 19449417 | 194 | 181 | | 10 | 0 | 67718584 | 677 | 663 | | 100 | 0 | 709840708 | 7098 | 7085 | | 200 | 0 | 2203580626 | 22035 | 22022 | +----------+------------+-----------------+------------+---------------+ Note: per-call overhead is estimated relative to the baseline case with 0 relevant tracers and 0 irrelevant tracers. As can be seen from the above: a) Whenever there is a single relevant tracer function associated with a tracee, the overhead of invoking the tracer is constant, and does not scale with the number of tracers which are *not* associated with that tracee. b) The overhead for a single relevant tracer has dropped to ~1/3 of the overhead prior to this series (from 122ns to 38ns). This is largely due to permitting calls to dynamically-allocated ftrace_ops without going through ftrace_ops_list_func. Signed-off-by: Puranjay Mohan <puranjay12@gmail.com> [update kconfig, asm, refactor] Signed-off-by: Andy Chiu <andybnac@gmail.com> Tested-by: Björn Töpel <bjorn@rivosinc.com> Link: https://lore.kernel.org/r/20250407180838.42877-10-andybnac@gmail.com Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Currently there is no ISB between __deactivate_cptr_traps() disabling traps that affect EL2 and fpsimd_lazy_switch_to_host() manipulating registers potentially affected by CPTR traps. When NV is not in use, this is safe because the relevant registers are only accessed when guest_owns_fp_regs() && vcpu_has_sve(vcpu), and this also implies that SVE traps affecting EL2 have been deactivated prior to __guest_entry(). When NV is in use, a guest hypervisor may have configured SVE traps for a nested context, and so it is necessary to have an ISB between __deactivate_cptr_traps() and fpsimd_lazy_switch_to_host(). Due to the current lack of an ISB, when a guest hypervisor enables SVE traps in CPTR, the host can take an unexpected SVE trap from within fpsimd_lazy_switch_to_host(), e.g. | Unhandled 64-bit el1h sync exception on CPU1, ESR 0x0000000066000000 -- SVE | CPU: 1 UID: 0 PID: 164 Comm: kvm-vcpu-0 Not tainted 6.15.0-rc4-00138-ga05e0f012c05 #3 PREEMPT | Hardware name: FVP Base RevC (DT) | pstate: 604023c9 (nZCv DAIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--) | pc : __kvm_vcpu_run+0x6f4/0x844 | lr : __kvm_vcpu_run+0x150/0x844 | sp : ffff800083903a60 | x29: ffff800083903a90 x28: ffff000801f4a300 x27: 0000000000000000 | x26: 0000000000000000 x25: ffff000801f90000 x24: ffff000801f900f0 | x23: ffff800081ff7720 x22: 0002433c807d623f x21: ffff000801f90000 | x20: ffff00087f730730 x19: 0000000000000000 x18: 0000000000000000 | x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000 | x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000 | x11: 0000000000000000 x10: 0000000000000000 x9 : 0000000000000000 | x8 : 0000000000000000 x7 : 0000000000000000 x6 : ffff000801f90d70 | x5 : 0000000000001000 x4 : ffff8007fd739000 x3 : ffff000801f90000 | x2 : 0000000000000000 x1 : 00000000000003cc x0 : ffff800082f9d000 | Kernel panic - not syncing: Unhandled exception | CPU: 1 UID: 0 PID: 164 Comm: kvm-vcpu-0 Not tainted 6.15.0-rc4-00138-ga05e0f012c05 #3 PREEMPT | Hardware name: FVP Base RevC (DT) | Call trace: | show_stack+0x18/0x24 (C) | dump_stack_lvl+0x60/0x80 | dump_stack+0x18/0x24 | panic+0x168/0x360 | __panic_unhandled+0x68/0x74 | el1h_64_irq_handler+0x0/0x24 | el1h_64_sync+0x6c/0x70 | __kvm_vcpu_run+0x6f4/0x844 (P) | kvm_arm_vcpu_enter_exit+0x64/0xa0 | kvm_arch_vcpu_ioctl_run+0x21c/0x870 | kvm_vcpu_ioctl+0x1a8/0x9d0 | __arm64_sys_ioctl+0xb4/0xf4 | invoke_syscall+0x48/0x104 | el0_svc_common.constprop.0+0x40/0xe0 | do_el0_svc+0x1c/0x28 | el0_svc+0x30/0xcc | el0t_64_sync_handler+0x10c/0x138 | el0t_64_sync+0x198/0x19c | SMP: stopping secondary CPUs | Kernel Offset: disabled | CPU features: 0x0000,000002c0,02df4fb9,97ee773f | Memory Limit: none | ---[ end Kernel panic - not syncing: Unhandled exception ]--- Fix this by adding an ISB between __deactivate_traps() and fpsimd_lazy_switch_to_host(). Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Fuad Tabba <tabba@google.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Oliver Upton <oliver.upton@linux.dev> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20250617133718.4014181-3-mark.rutland@arm.com Signed-off-by: Marc Zyngier <maz@kernel.org>
…/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 fixes for 6.16, take #3 - Fix another set of FP/SIMD/SVE bugs affecting NV, and plugging some missing synchronisation - A small fix for the irqbypass hook fixes, tightening the check and ensuring that we only deal with MSI for both the old and the new route entry - Rework the way the shadow LRs are addressed in a nesting configuration, plugging an embarrassing bug as well as simplifying the whole process - Add yet another fix for the dreaded arch_timer_edge_cases selftest
The issue arises when kzalloc() is invoked while holding umem_mutex or
any other lock acquired under umem_mutex. This is problematic because
kzalloc() can trigger fs_reclaim_aqcuire(), which may, in turn, invoke
mmu_notifier_invalidate_range_start(). This function can lead to
mlx5_ib_invalidate_range(), which attempts to acquire umem_mutex again,
resulting in a deadlock.
The problematic flow:
CPU0 | CPU1
---------------------------------------|------------------------------------------------
mlx5_ib_dereg_mr() |
→ revoke_mr() |
→ mutex_lock(&umem_odp->umem_mutex) |
| mlx5_mkey_cache_init()
| → mutex_lock(&dev->cache.rb_lock)
| → mlx5r_cache_create_ent_locked()
| → kzalloc(GFP_KERNEL)
| → fs_reclaim()
| → mmu_notifier_invalidate_range_start()
| → mlx5_ib_invalidate_range()
| → mutex_lock(&umem_odp->umem_mutex)
→ cache_ent_find_and_store() |
→ mutex_lock(&dev->cache.rb_lock) |
Additionally, when kzalloc() is called from within
cache_ent_find_and_store(), we encounter the same deadlock due to
re-acquisition of umem_mutex.
Solve by releasing umem_mutex in dereg_mr() after umr_revoke_mr()
and before acquiring rb_lock. This ensures that we don't hold
umem_mutex while performing memory allocations that could trigger
the reclaim path.
This change prevents the deadlock by ensuring proper lock ordering and
avoiding holding locks during memory allocation operations that could
trigger the reclaim path.
The following lockdep warning demonstrates the deadlock:
python3/20557 is trying to acquire lock:
ffff888387542128 (&umem_odp->umem_mutex){+.+.}-{4:4}, at:
mlx5_ib_invalidate_range+0x5b/0x550 [mlx5_ib]
but task is already holding lock:
ffffffff82f6b840 (mmu_notifier_invalidate_range_start){+.+.}-{0:0}, at:
unmap_vmas+0x7b/0x1a0
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #3 (mmu_notifier_invalidate_range_start){+.+.}-{0:0}:
fs_reclaim_acquire+0x60/0xd0
mem_cgroup_css_alloc+0x6f/0x9b0
cgroup_init_subsys+0xa4/0x240
cgroup_init+0x1c8/0x510
start_kernel+0x747/0x760
x86_64_start_reservations+0x25/0x30
x86_64_start_kernel+0x73/0x80
common_startup_64+0x129/0x138
-> #2 (fs_reclaim){+.+.}-{0:0}:
fs_reclaim_acquire+0x91/0xd0
__kmalloc_cache_noprof+0x4d/0x4c0
mlx5r_cache_create_ent_locked+0x75/0x620 [mlx5_ib]
mlx5_mkey_cache_init+0x186/0x360 [mlx5_ib]
mlx5_ib_stage_post_ib_reg_umr_init+0x3c/0x60 [mlx5_ib]
__mlx5_ib_add+0x4b/0x190 [mlx5_ib]
mlx5r_probe+0xd9/0x320 [mlx5_ib]
auxiliary_bus_probe+0x42/0x70
really_probe+0xdb/0x360
__driver_probe_device+0x8f/0x130
driver_probe_device+0x1f/0xb0
__driver_attach+0xd4/0x1f0
bus_for_each_dev+0x79/0xd0
bus_add_driver+0xf0/0x200
driver_register+0x6e/0xc0
__auxiliary_driver_register+0x6a/0xc0
do_one_initcall+0x5e/0x390
do_init_module+0x88/0x240
init_module_from_file+0x85/0xc0
idempotent_init_module+0x104/0x300
__x64_sys_finit_module+0x68/0xc0
do_syscall_64+0x6d/0x140
entry_SYSCALL_64_after_hwframe+0x4b/0x53
-> #1 (&dev->cache.rb_lock){+.+.}-{4:4}:
__mutex_lock+0x98/0xf10
__mlx5_ib_dereg_mr+0x6f2/0x890 [mlx5_ib]
mlx5_ib_dereg_mr+0x21/0x110 [mlx5_ib]
ib_dereg_mr_user+0x85/0x1f0 [ib_core]
uverbs_free_mr+0x19/0x30 [ib_uverbs]
destroy_hw_idr_uobject+0x21/0x80 [ib_uverbs]
uverbs_destroy_uobject+0x60/0x3d0 [ib_uverbs]
uobj_destroy+0x57/0xa0 [ib_uverbs]
ib_uverbs_cmd_verbs+0x4d5/0x1210 [ib_uverbs]
ib_uverbs_ioctl+0x129/0x230 [ib_uverbs]
__x64_sys_ioctl+0x596/0xaa0
do_syscall_64+0x6d/0x140
entry_SYSCALL_64_after_hwframe+0x4b/0x53
-> #0 (&umem_odp->umem_mutex){+.+.}-{4:4}:
__lock_acquire+0x1826/0x2f00
lock_acquire+0xd3/0x2e0
__mutex_lock+0x98/0xf10
mlx5_ib_invalidate_range+0x5b/0x550 [mlx5_ib]
__mmu_notifier_invalidate_range_start+0x18e/0x1f0
unmap_vmas+0x182/0x1a0
exit_mmap+0xf3/0x4a0
mmput+0x3a/0x100
do_exit+0x2b9/0xa90
do_group_exit+0x32/0xa0
get_signal+0xc32/0xcb0
arch_do_signal_or_restart+0x29/0x1d0
syscall_exit_to_user_mode+0x105/0x1d0
do_syscall_64+0x79/0x140
entry_SYSCALL_64_after_hwframe+0x4b/0x53
Chain exists of:
&dev->cache.rb_lock --> mmu_notifier_invalidate_range_start -->
&umem_odp->umem_mutex
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&umem_odp->umem_mutex);
lock(mmu_notifier_invalidate_range_start);
lock(&umem_odp->umem_mutex);
lock(&dev->cache.rb_lock);
*** DEADLOCK ***
Fixes: abb604a ("RDMA/mlx5: Fix a race for an ODP MR which leads to CQE with error")
Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://patch.msgid.link/3c8f225a8a9fade647d19b014df1172544643e4a.1750061612.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
When operating in concurrent STA/AP mode with host MLME enabled, the firmware incorrectly sends disassociation frames to the STA interface when clients disconnect from the AP interface. This causes kernel warnings as the STA interface processes disconnect events that don't apply to it: [ 1303.240540] WARNING: CPU: 0 PID: 513 at net/wireless/mlme.c:141 cfg80211_process_disassoc+0x78/0xec [cfg80211] [ 1303.250861] Modules linked in: 8021q garp stp mrp llc rfcomm bnep btnxpuart nls_iso8859_1 nls_cp437 onboard_us [ 1303.327651] CPU: 0 UID: 0 PID: 513 Comm: kworker/u9:2 Not tainted 6.16.0-rc1+ #3 PREEMPT [ 1303.335937] Hardware name: Toradex Verdin AM62 WB on Verdin Development Board (DT) [ 1303.343588] Workqueue: MWIFIEX_RX_WORK_QUEUE mwifiex_rx_work_queue [mwifiex] [ 1303.350856] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 1303.357904] pc : cfg80211_process_disassoc+0x78/0xec [cfg80211] [ 1303.364065] lr : cfg80211_process_disassoc+0x70/0xec [cfg80211] [ 1303.370221] sp : ffff800083053be0 [ 1303.373590] x29: ffff800083053be0 x28: 0000000000000000 x27: 0000000000000000 [ 1303.380855] x26: 0000000000000000 x25: 00000000ffffffff x24: ffff000002c5b8ae [ 1303.388120] x23: ffff000002c5b884 x22: 0000000000000001 x21: 0000000000000008 [ 1303.395382] x20: ffff000002c5b8ae x19: ffff0000064dd408 x18: 0000000000000006 [ 1303.402646] x17: 3a36333a61623a30 x16: 32206d6f72662063 x15: ffff800080bfe048 [ 1303.409910] x14: ffff000003625300 x13: 0000000000000001 x12: 0000000000000000 [ 1303.417173] x11: 0000000000000002 x10: ffff000003958600 x9 : ffff000003625300 [ 1303.424434] x8 : ffff00003fd9ef40 x7 : ffff0000039fc280 x6 : 0000000000000002 [ 1303.431695] x5 : ffff0000038976d4 x4 : 0000000000000000 x3 : 0000000000003186 [ 1303.438956] x2 : 000000004836ba20 x1 : 0000000000006986 x0 : 00000000d00479de [ 1303.446221] Call trace: [ 1303.448722] cfg80211_process_disassoc+0x78/0xec [cfg80211] (P) [ 1303.454894] cfg80211_rx_mlme_mgmt+0x64/0xf8 [cfg80211] [ 1303.460362] mwifiex_process_mgmt_packet+0x1ec/0x460 [mwifiex] [ 1303.466380] mwifiex_process_sta_rx_packet+0x1bc/0x2a0 [mwifiex] [ 1303.472573] mwifiex_handle_rx_packet+0xb4/0x13c [mwifiex] [ 1303.478243] mwifiex_rx_work_queue+0x158/0x198 [mwifiex] [ 1303.483734] process_one_work+0x14c/0x28c [ 1303.487845] worker_thread+0x2cc/0x3d4 [ 1303.491680] kthread+0x12c/0x208 [ 1303.495014] ret_from_fork+0x10/0x20 Add validation in the STA receive path to verify that disassoc/deauth frames originate from the connected AP. Frames that fail this check are discarded early, preventing them from reaching the MLME layer and triggering WARN_ON(). This filtering logic is similar with that used in the ieee80211_rx_mgmt_disassoc() function in mac80211, which drops disassoc frames that don't match the current BSSID (!ether_addr_equal(mgmt->bssid, sdata->vif.cfg.ap_addr)), ensuring only relevant frames are processed. Tested on: - 8997 with FW 16.68.1.p197 Fixes: 3699589 ("wifi: mwifiex: add host mlme for client mode") Cc: stable@vger.kernel.org Signed-off-by: Vitor Soares <vitor.soares@toradex.com> Reviewed-by: Jeff Chen <jeff.chen_1@nxp.con> Reviewed-by: Francesco Dolcini <francesco.dolcini@toradex.com> Link: https://patch.msgid.link/20250701142643.658990-1-ivitro@gmail.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
alloc_tag_top_users() attempts to lock alloc_tag_cttype->mod_lock even
when the alloc_tag_cttype is not allocated because:
1) alloc tagging is disabled because mem profiling is disabled
(!alloc_tag_cttype)
2) alloc tagging is enabled, but not yet initialized (!alloc_tag_cttype)
3) alloc tagging is enabled, but failed initialization
(!alloc_tag_cttype or IS_ERR(alloc_tag_cttype))
In all cases, alloc_tag_cttype is not allocated, and therefore
alloc_tag_top_users() should not attempt to acquire the semaphore.
This leads to a crash on memory allocation failure by attempting to
acquire a non-existent semaphore:
Oops: general protection fault, probably for non-canonical address 0xdffffc000000001b: 0000 [#3] SMP KASAN NOPTI
KASAN: null-ptr-deref in range [0x00000000000000d8-0x00000000000000df]
CPU: 2 UID: 0 PID: 1 Comm: systemd Tainted: G D 6.16.0-rc2 #1 VOLUNTARY
Tainted: [D]=DIE
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
RIP: 0010:down_read_trylock+0xaa/0x3b0
Code: d0 7c 08 84 d2 0f 85 a0 02 00 00 8b 0d df 31 dd 04 85 c9 75 29 48 b8 00 00 00 00 00 fc ff df 48 8d 6b 68 48 89 ea 48 c1 ea 03 <80> 3c 02 00 0f 85 88 02 00 00 48 3b 5b 68 0f 85 53 01 00 00 65 ff
RSP: 0000:ffff8881002ce9b8 EFLAGS: 00010016
RAX: dffffc0000000000 RBX: 0000000000000070 RCX: 0000000000000000
RDX: 000000000000001b RSI: 000000000000000a RDI: 0000000000000070
RBP: 00000000000000d8 R08: 0000000000000001 R09: ffffed107dde49d1
R10: ffff8883eef24e8b R11: ffff8881002cec20 R12: 1ffff11020059d37
R13: 00000000003fff7b R14: ffff8881002cec20 R15: dffffc0000000000
FS: 00007f963f21d940(0000) GS:ffff888458ca6000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f963f5edf71 CR3: 000000010672c000 CR4: 0000000000350ef0
Call Trace:
<TASK>
codetag_trylock_module_list+0xd/0x20
alloc_tag_top_users+0x369/0x4b0
__show_mem+0x1cd/0x6e0
warn_alloc+0x2b1/0x390
__alloc_frozen_pages_noprof+0x12b9/0x21a0
alloc_pages_mpol+0x135/0x3e0
alloc_slab_page+0x82/0xe0
new_slab+0x212/0x240
___slab_alloc+0x82a/0xe00
</TASK>
As David Wang points out, this issue became easier to trigger after commit
780138b ("alloc_tag: check mem_profiling_support in alloc_tag_init").
Before the commit, the issue occurred only when it failed to allocate and
initialize alloc_tag_cttype or if a memory allocation fails before
alloc_tag_init() is called. After the commit, it can be easily triggered
when memory profiling is compiled but disabled at boot.
To properly determine whether alloc_tag_init() has been called and its
data structures initialized, verify that alloc_tag_cttype is a valid
pointer before acquiring the semaphore. If the variable is NULL or an
error value, it has not been properly initialized. In such a case, just
skip and do not attempt to acquire the semaphore.
[harry.yoo@oracle.com: v3]
Link: https://lkml.kernel.org/r/20250624072513.84219-1-harry.yoo@oracle.com
Link: https://lkml.kernel.org/r/20250620195305.1115151-1-harry.yoo@oracle.com
Fixes: 780138b ("alloc_tag: check mem_profiling_support in alloc_tag_init")
Fixes: 1438d34 ("lib: add memory allocations report in show_mem()")
Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202506181351.bba867dd-lkp@intel.com
Acked-by: Suren Baghdasaryan <surenb@google.com>
Tested-by: Raghavendra K T <raghavendra.kt@amd.com>
Cc: Casey Chen <cachen@purestorage.com>
Cc: David Wang <00107082@163.com>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Yuanyuan Zhong <yzhong@purestorage.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
If "try_verify_in_tasklet" is set for dm-verity, DM_BUFIO_CLIENT_NO_SLEEP
is enabled for dm-bufio. However, when bufio tries to evict buffers, there
is a chance to trigger scheduling in spin_lock_bh, the following warning
is hit:
BUG: sleeping function called from invalid context at drivers/md/dm-bufio.c:2745
in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 123, name: kworker/2:2
preempt_count: 201, expected: 0
RCU nest depth: 0, expected: 0
4 locks held by kworker/2:2/123:
#0: ffff88800a2d1548 ((wq_completion)dm_bufio_cache){....}-{0:0}, at: process_one_work+0xe46/0x1970
#1: ffffc90000d97d20 ((work_completion)(&dm_bufio_replacement_work)){....}-{0:0}, at: process_one_work+0x763/0x1970
#2: ffffffff8555b528 (dm_bufio_clients_lock){....}-{3:3}, at: do_global_cleanup+0x1ce/0x710
#3: ffff88801d5820b8 (&c->spinlock){....}-{2:2}, at: do_global_cleanup+0x2a5/0x710
Preemption disabled at:
[<0000000000000000>] 0x0
CPU: 2 UID: 0 PID: 123 Comm: kworker/2:2 Not tainted 6.16.0-rc3-g90548c634bd0 #305 PREEMPT(voluntary)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
Workqueue: dm_bufio_cache do_global_cleanup
Call Trace:
<TASK>
dump_stack_lvl+0x53/0x70
__might_resched+0x360/0x4e0
do_global_cleanup+0x2f5/0x710
process_one_work+0x7db/0x1970
worker_thread+0x518/0xea0
kthread+0x359/0x690
ret_from_fork+0xf3/0x1b0
ret_from_fork_asm+0x1a/0x30
</TASK>
That can be reproduced by:
veritysetup format --data-block-size=4096 --hash-block-size=4096 /dev/vda /dev/vdb
SIZE=$(blockdev --getsz /dev/vda)
dmsetup create myverity -r --table "0 $SIZE verity 1 /dev/vda /dev/vdb 4096 4096 <data_blocks> 1 sha256 <root_hash> <salt> 1 try_verify_in_tasklet"
mount /dev/dm-0 /mnt -o ro
echo 102400 > /sys/module/dm_bufio/parameters/max_cache_size_bytes
[read files in /mnt]
Cc: stable@vger.kernel.org # v6.4+
Fixes: 450e8de ("dm bufio: improve concurrent IO performance")
Signed-off-by: Wang Shuai <wangshuai12@xiaomi.com>
Signed-off-by: Sheng Yong <shengyong1@xiaomi.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
The existing code move the VF NIC to new namespace when NETDEV_REGISTER is received on netvsc NIC. During deletion of the namespace, default_device_exit_batch() >> default_device_exit_net() is called. When netvsc NIC is moved back and registered to the default namespace, it automatically brings VF NIC back to the default namespace. This will cause the default_device_exit_net() >> for_each_netdev_safe loop unable to detect the list end, and hit NULL ptr: [ 231.449420] mana 7870:00:00.0 enP30832s1: Moved VF to namespace with: eth0 [ 231.449656] BUG: kernel NULL pointer dereference, address: 0000000000000010 [ 231.450246] #PF: supervisor read access in kernel mode [ 231.450579] #PF: error_code(0x0000) - not-present page [ 231.450916] PGD 17b8a8067 P4D 0 [ 231.451163] Oops: Oops: 0000 [#1] SMP NOPTI [ 231.451450] CPU: 82 UID: 0 PID: 1394 Comm: kworker/u768:1 Not tainted 6.16.0-rc4+ #3 VOLUNTARY [ 231.452042] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 11/21/2024 [ 231.452692] Workqueue: netns cleanup_net [ 231.452947] RIP: 0010:default_device_exit_batch+0x16c/0x3f0 [ 231.453326] Code: c0 0c f5 b3 e8 d5 db fe ff 48 85 c0 74 15 48 c7 c2 f8 fd ca b2 be 10 00 00 00 48 8d 7d c0 e8 7b 77 25 00 49 8b 86 28 01 00 00 <48> 8b 50 10 4c 8b 2a 4c 8d 62 f0 49 83 ed 10 4c 39 e0 0f 84 d6 00 [ 231.454294] RSP: 0018:ff75fc7c9bf9fd00 EFLAGS: 00010246 [ 231.454610] RAX: 0000000000000000 RBX: 0000000000000002 RCX: 61c8864680b583eb [ 231.455094] RDX: ff1fa9f71462d800 RSI: ff75fc7c9bf9fd38 RDI: 0000000030766564 [ 231.455686] RBP: ff75fc7c9bf9fd78 R08: 0000000000000000 R09: 0000000000000000 [ 231.456126] R10: 0000000000000001 R11: 0000000000000004 R12: ff1fa9f70088e340 [ 231.456621] R13: ff1fa9f70088e340 R14: ffffffffb3f50c20 R15: ff1fa9f7103e6340 [ 231.457161] FS: 0000000000000000(0000) GS:ff1faa6783a08000(0000) knlGS:0000000000000000 [ 231.457707] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 231.458031] CR2: 0000000000000010 CR3: 0000000179ab2006 CR4: 0000000000b73ef0 [ 231.458434] Call Trace: [ 231.458600] <TASK> [ 231.458777] ops_undo_list+0x100/0x220 [ 231.459015] cleanup_net+0x1b8/0x300 [ 231.459285] process_one_work+0x184/0x340 To fix it, move the ns change to a workqueue, and take rtnl_lock to avoid changing the netdev list when default_device_exit_net() is using it. Cc: stable@vger.kernel.org Fixes: 4c26280 ("hv_netvsc: Fix VF namespace also in synthetic NIC NETDEV_REGISTER event") Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Link: https://patch.msgid.link/1754511711-11188-1-git-send-email-haiyangz@linux.microsoft.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
…dlock
When a user creates a dualpi2 qdisc it automatically sets a timer. This
timer will run constantly and update the qdisc's probability field.
The issue is that the timer acquires the qdisc root lock and runs in
hardirq. The qdisc root lock is also acquired in dev.c whenever a packet
arrives for this qdisc. Since the dualpi2 timer callback runs in hardirq,
it may interrupt the packet processing running in softirq. If that happens
and it runs on the same CPU, it will acquire the same lock and cause a
deadlock. The following splat shows up when running a kernel compiled with
lock debugging:
[ +0.000224] WARNING: inconsistent lock state
[ +0.000224] 6.16.0+ #10 Not tainted
[ +0.000169] --------------------------------
[ +0.000029] inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
[ +0.000000] ping/156 [HC0[0]:SC0[2]:HE1:SE0] takes:
[ +0.000000] ffff897841242110 (&sch->root_lock_key){?.-.}-{3:3}, at: __dev_queue_xmit+0x86d/0x1140
[ +0.000000] {IN-HARDIRQ-W} state was registered at:
[ +0.000000] lock_acquire.part.0+0xb6/0x220
[ +0.000000] _raw_spin_lock+0x31/0x80
[ +0.000000] dualpi2_timer+0x6f/0x270
[ +0.000000] __hrtimer_run_queues+0x1c5/0x360
[ +0.000000] hrtimer_interrupt+0x115/0x260
[ +0.000000] __sysvec_apic_timer_interrupt+0x6d/0x1a0
[ +0.000000] sysvec_apic_timer_interrupt+0x6e/0x80
[ +0.000000] asm_sysvec_apic_timer_interrupt+0x1a/0x20
[ +0.000000] pv_native_safe_halt+0xf/0x20
[ +0.000000] default_idle+0x9/0x10
[ +0.000000] default_idle_call+0x7e/0x1e0
[ +0.000000] do_idle+0x1e8/0x250
[ +0.000000] cpu_startup_entry+0x29/0x30
[ +0.000000] rest_init+0x151/0x160
[ +0.000000] start_kernel+0x6f3/0x700
[ +0.000000] x86_64_start_reservations+0x24/0x30
[ +0.000000] x86_64_start_kernel+0xc8/0xd0
[ +0.000000] common_startup_64+0x13e/0x148
[ +0.000000] irq event stamp: 6884
[ +0.000000] hardirqs last enabled at (6883): [<ffffffffa75700b3>] neigh_resolve_output+0x223/0x270
[ +0.000000] hardirqs last disabled at (6882): [<ffffffffa7570078>] neigh_resolve_output+0x1e8/0x270
[ +0.000000] softirqs last enabled at (6880): [<ffffffffa757006b>] neigh_resolve_output+0x1db/0x270
[ +0.000000] softirqs last disabled at (6884): [<ffffffffa755b533>] __dev_queue_xmit+0x73/0x1140
[ +0.000000]
other info that might help us debug this:
[ +0.000000] Possible unsafe locking scenario:
[ +0.000000] CPU0
[ +0.000000] ----
[ +0.000000] lock(&sch->root_lock_key);
[ +0.000000] <Interrupt>
[ +0.000000] lock(&sch->root_lock_key);
[ +0.000000]
*** DEADLOCK ***
[ +0.000000] 4 locks held by ping/156:
[ +0.000000] #0: ffff897842332e08 (sk_lock-AF_INET){+.+.}-{0:0}, at: raw_sendmsg+0x41e/0xf40
[ +0.000000] #1: ffffffffa816f880 (rcu_read_lock){....}-{1:3}, at: ip_output+0x2c/0x190
[ +0.000000] #2: ffffffffa816f880 (rcu_read_lock){....}-{1:3}, at: ip_finish_output2+0xad/0x950
[ +0.000000] #3: ffffffffa816f840 (rcu_read_lock_bh){....}-{1:3}, at: __dev_queue_xmit+0x73/0x1140
I am able to reproduce it consistently when running the following:
tc qdisc add dev lo handle 1: root dualpi2
ping -f 127.0.0.1
To fix it, make the timer run in softirq.
Fixes: 320d031 ("sched: Struct definition and parsing of dualpi2 qdisc")
Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Link: https://patch.msgid.link/20250815135317.664993-1-victor@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In the snd_utimer_create() function, if the kasprintf() function return NULL, snd_utimer_put_id() will be called, finally use ida_free() to free the unallocated id 0. the syzkaller reported the following information: ------------[ cut here ]------------ ida_free called for id=0 which is not allocated. WARNING: CPU: 1 PID: 1286 at lib/idr.c:592 ida_free+0x1fd/0x2f0 lib/idr.c:592 Modules linked in: CPU: 1 UID: 0 PID: 1286 Comm: syz-executor164 Not tainted 6.15.8 #3 PREEMPT(lazy) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-4.fc42 04/01/2014 RIP: 0010:ida_free+0x1fd/0x2f0 lib/idr.c:592 Code: f8 fc 41 83 fc 3e 76 69 e8 70 b2 f8 (...) RSP: 0018:ffffc900007f79c8 EFLAGS: 00010282 RAX: 0000000000000000 RBX: 1ffff920000fef3b RCX: ffffffff872176a5 RDX: ffff88800369d200 RSI: 0000000000000000 RDI: ffff88800369d200 RBP: 0000000000000000 R08: ffffffff87ba60a5 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000002 R14: 0000000000000000 R15: 0000000000000000 FS: 00007f6f1abc1740(0000) GS:ffff8880d76a0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f6f1ad7a784 CR3: 000000007a6e2000 CR4: 00000000000006f0 Call Trace: <TASK> snd_utimer_put_id sound/core/timer.c:2043 [inline] [snd_timer] snd_utimer_create+0x59b/0x6a0 sound/core/timer.c:2184 [snd_timer] snd_utimer_ioctl_create sound/core/timer.c:2202 [inline] [snd_timer] __snd_timer_user_ioctl.isra.0+0x724/0x1340 sound/core/timer.c:2287 [snd_timer] snd_timer_user_ioctl+0x75/0xc0 sound/core/timer.c:2298 [snd_timer] vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:907 [inline] __se_sys_ioctl fs/ioctl.c:893 [inline] __x64_sys_ioctl+0x198/0x200 fs/ioctl.c:893 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0x7b/0x160 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x76/0x7e [...] The utimer->id should be set properly before the kasprintf() function, ensures the snd_utimer_put_id() function will free the allocated id. Fixes: 3774591 ("ALSA: timer: Introduce virtual userspace-driven timers") Signed-off-by: Dewei Meng <mengdewei@cqsoftware.com.cn> Link: https://patch.msgid.link/20250821014317.40786-1-mengdewei@cqsoftware.com.cn Signed-off-by: Takashi Iwai <tiwai@suse.de>
Commit 3c7ac40 ("scsi: ufs: core: Delegate the interrupt service routine to a threaded IRQ handler") introduced an IRQ lock inversion issue. Fix this lock inversion by changing the spin_lock_irq() calls into spin_lock_irqsave() calls in code that can be called either from interrupt context or from thread context. This patch fixes the following lockdep complaint: WARNING: possible irq lock inversion dependency detected 6.12.30-android16-5-maybe-dirty-4k #1 Tainted: G W OE -------------------------------------------------------- kworker/u28:0/12 just changed the state of lock: ffffff881e29dd60 (&hba->clk_gating.lock){-...}-{2:2}, at: ufshcd_release_scsi_cmd+0x60/0x110 but this lock took another, HARDIRQ-unsafe lock in the past: (shost->host_lock){+.+.}-{2:2} and interrupts could create inverse lock ordering between them. other info that might help us debug this: Possible interrupt unsafe locking scenario: CPU0 CPU1 ---- ---- lock(shost->host_lock); local_irq_disable(); lock(&hba->clk_gating.lock); lock(shost->host_lock); <Interrupt> lock(&hba->clk_gating.lock); *** DEADLOCK *** 4 locks held by kworker/u28:0/12: #0: ffffff8800ac6158 ((wq_completion)async){+.+.}-{0:0}, at: process_one_work+0x1bc/0x65c #1: ffffffc085c93d70 ((work_completion)(&entry->work)){+.+.}-{0:0}, at: process_one_work+0x1e4/0x65c #2: ffffff881e29c0e0 (&shost->scan_mutex){+.+.}-{3:3}, at: __scsi_add_device+0x74/0x120 #3: ffffff881960ea00 (&hwq->cq_lock){-...}-{2:2}, at: ufshcd_mcq_poll_cqe_lock+0x28/0x104 the shortest dependencies between 2nd lock and 1st lock: -> (shost->host_lock){+.+.}-{2:2} { HARDIRQ-ON-W at: lock_acquire+0x134/0x2b4 _raw_spin_lock+0x48/0x64 ufshcd_sl_intr+0x4c/0xa08 ufshcd_threaded_intr+0x70/0x12c irq_thread_fn+0x48/0xa8 irq_thread+0x130/0x1ec kthread+0x110/0x134 ret_from_fork+0x10/0x20 SOFTIRQ-ON-W at: lock_acquire+0x134/0x2b4 _raw_spin_lock+0x48/0x64 ufshcd_sl_intr+0x4c/0xa08 ufshcd_threaded_intr+0x70/0x12c irq_thread_fn+0x48/0xa8 irq_thread+0x130/0x1ec kthread+0x110/0x134 ret_from_fork+0x10/0x20 INITIAL USE at: lock_acquire+0x134/0x2b4 _raw_spin_lock+0x48/0x64 ufshcd_sl_intr+0x4c/0xa08 ufshcd_threaded_intr+0x70/0x12c irq_thread_fn+0x48/0xa8 irq_thread+0x130/0x1ec kthread+0x110/0x134 ret_from_fork+0x10/0x20 } ... key at: [<ffffffc085ba1a98>] scsi_host_alloc.__key+0x0/0x10 ... acquired at: _raw_spin_lock_irqsave+0x5c/0x80 __ufshcd_release+0x78/0x118 ufshcd_send_uic_cmd+0xe4/0x118 ufshcd_dme_set_attr+0x88/0x1c8 ufs_google_phy_initialization+0x68/0x418 [ufs] ufs_google_link_startup_notify+0x78/0x27c [ufs] ufshcd_link_startup+0x84/0x720 ufshcd_init+0xf3c/0x1330 ufshcd_pltfrm_init+0x728/0x7d8 ufs_google_probe+0x30/0x84 [ufs] platform_probe+0xa0/0xe0 really_probe+0x114/0x454 __driver_probe_device+0xa4/0x160 driver_probe_device+0x44/0x23c __driver_attach_async_helper+0x60/0xd4 async_run_entry_fn+0x4c/0x17c process_one_work+0x26c/0x65c worker_thread+0x33c/0x498 kthread+0x110/0x134 ret_from_fork+0x10/0x20 -> (&hba->clk_gating.lock){-...}-{2:2} { IN-HARDIRQ-W at: lock_acquire+0x134/0x2b4 _raw_spin_lock_irqsave+0x5c/0x80 ufshcd_release_scsi_cmd+0x60/0x110 ufshcd_compl_one_cqe+0x2c0/0x3f4 ufshcd_mcq_poll_cqe_lock+0xb0/0x104 ufs_google_mcq_intr+0x80/0xa0 [ufs] __handle_irq_event_percpu+0x104/0x32c handle_irq_event+0x40/0x9c handle_fasteoi_irq+0x170/0x2e8 generic_handle_domain_irq+0x58/0x80 gic_handle_irq+0x48/0x104 call_on_irq_stack+0x3c/0x50 do_interrupt_handler+0x7c/0xd8 el1_interrupt+0x34/0x58 el1h_64_irq_handler+0x18/0x24 el1h_64_irq+0x68/0x6c _raw_spin_unlock_irqrestore+0x3c/0x6c debug_object_assert_init+0x16c/0x21c __mod_timer+0x4c/0x48c schedule_timeout+0xd4/0x16c io_schedule_timeout+0x48/0x70 do_wait_for_common+0x100/0x194 wait_for_completion_io_timeout+0x48/0x6c blk_execute_rq+0x124/0x17c scsi_execute_cmd+0x18c/0x3f8 scsi_probe_and_add_lun+0x204/0xd74 __scsi_add_device+0xbc/0x120 ufshcd_async_scan+0x80/0x3c0 async_run_entry_fn+0x4c/0x17c process_one_work+0x26c/0x65c worker_thread+0x33c/0x498 kthread+0x110/0x134 ret_from_fork+0x10/0x20 INITIAL USE at: lock_acquire+0x134/0x2b4 _raw_spin_lock_irqsave+0x5c/0x80 ufshcd_hold+0x34/0x14c ufshcd_send_uic_cmd+0x28/0x118 ufshcd_dme_set_attr+0x88/0x1c8 ufs_google_phy_initialization+0x68/0x418 [ufs] ufs_google_link_startup_notify+0x78/0x27c [ufs] ufshcd_link_startup+0x84/0x720 ufshcd_init+0xf3c/0x1330 ufshcd_pltfrm_init+0x728/0x7d8 ufs_google_probe+0x30/0x84 [ufs] platform_probe+0xa0/0xe0 really_probe+0x114/0x454 __driver_probe_device+0xa4/0x160 driver_probe_device+0x44/0x23c __driver_attach_async_helper+0x60/0xd4 async_run_entry_fn+0x4c/0x17c process_one_work+0x26c/0x65c worker_thread+0x33c/0x498 kthread+0x110/0x134 ret_from_fork+0x10/0x20 } ... key at: [<ffffffc085ba6fe8>] ufshcd_init.__key+0x0/0x10 ... acquired at: mark_lock+0x1c4/0x224 __lock_acquire+0x438/0x2e1c lock_acquire+0x134/0x2b4 _raw_spin_lock_irqsave+0x5c/0x80 ufshcd_release_scsi_cmd+0x60/0x110 ufshcd_compl_one_cqe+0x2c0/0x3f4 ufshcd_mcq_poll_cqe_lock+0xb0/0x104 ufs_google_mcq_intr+0x80/0xa0 [ufs] __handle_irq_event_percpu+0x104/0x32c handle_irq_event+0x40/0x9c handle_fasteoi_irq+0x170/0x2e8 generic_handle_domain_irq+0x58/0x80 gic_handle_irq+0x48/0x104 call_on_irq_stack+0x3c/0x50 do_interrupt_handler+0x7c/0xd8 el1_interrupt+0x34/0x58 el1h_64_irq_handler+0x18/0x24 el1h_64_irq+0x68/0x6c _raw_spin_unlock_irqrestore+0x3c/0x6c debug_object_assert_init+0x16c/0x21c __mod_timer+0x4c/0x48c schedule_timeout+0xd4/0x16c io_schedule_timeout+0x48/0x70 do_wait_for_common+0x100/0x194 wait_for_completion_io_timeout+0x48/0x6c blk_execute_rq+0x124/0x17c scsi_execute_cmd+0x18c/0x3f8 scsi_probe_and_add_lun+0x204/0xd74 __scsi_add_device+0xbc/0x120 ufshcd_async_scan+0x80/0x3c0 async_run_entry_fn+0x4c/0x17c process_one_work+0x26c/0x65c worker_thread+0x33c/0x498 kthread+0x110/0x134 ret_from_fork+0x10/0x20 stack backtrace: CPU: 6 UID: 0 PID: 12 Comm: kworker/u28:0 Tainted: G W OE 6.12.30-android16-5-maybe-dirty-4k #1 ccd4020fe444bdf629efc3b86df6be920b8df7d0 Tainted: [W]=WARN, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE Hardware name: Spacecraft board based on MALIBU (DT) Workqueue: async async_run_entry_fn Call trace: dump_backtrace+0xfc/0x17c show_stack+0x18/0x28 dump_stack_lvl+0x40/0xa0 dump_stack+0x18/0x24 print_irq_inversion_bug+0x2fc/0x304 mark_lock_irq+0x388/0x4fc mark_lock+0x1c4/0x224 __lock_acquire+0x438/0x2e1c lock_acquire+0x134/0x2b4 _raw_spin_lock_irqsave+0x5c/0x80 ufshcd_release_scsi_cmd+0x60/0x110 ufshcd_compl_one_cqe+0x2c0/0x3f4 ufshcd_mcq_poll_cqe_lock+0xb0/0x104 ufs_google_mcq_intr+0x80/0xa0 [ufs dd6f385554e109da094ab91d5f7be18625a2222a] __handle_irq_event_percpu+0x104/0x32c handle_irq_event+0x40/0x9c handle_fasteoi_irq+0x170/0x2e8 generic_handle_domain_irq+0x58/0x80 gic_handle_irq+0x48/0x104 call_on_irq_stack+0x3c/0x50 do_interrupt_handler+0x7c/0xd8 el1_interrupt+0x34/0x58 el1h_64_irq_handler+0x18/0x24 el1h_64_irq+0x68/0x6c _raw_spin_unlock_irqrestore+0x3c/0x6c debug_object_assert_init+0x16c/0x21c __mod_timer+0x4c/0x48c schedule_timeout+0xd4/0x16c io_schedule_timeout+0x48/0x70 do_wait_for_common+0x100/0x194 wait_for_completion_io_timeout+0x48/0x6c blk_execute_rq+0x124/0x17c scsi_execute_cmd+0x18c/0x3f8 scsi_probe_and_add_lun+0x204/0xd74 __scsi_add_device+0xbc/0x120 ufshcd_async_scan+0x80/0x3c0 async_run_entry_fn+0x4c/0x17c process_one_work+0x26c/0x65c worker_thread+0x33c/0x498 kthread+0x110/0x134 ret_from_fork+0x10/0x20 Cc: Neil Armstrong <neil.armstrong@linaro.org> Cc: André Draszik <andre.draszik@linaro.org> Reviewed-by: Peter Wang <peter.wang@mediatek.com> Fixes: 3c7ac40 ("scsi: ufs: core: Delegate the interrupt service routine to a threaded IRQ handler") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20250815155842.472867-2-bvanassche@acm.org Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
These iterations require the read lock, otherwise RCU lockdep will splat: ============================= WARNING: suspicious RCU usage 6.17.0-rc3-00014-g31419c045d64 #6 Tainted: G O ----------------------------- drivers/base/power/main.c:1333 RCU-list traversed in non-reader section!! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 5 locks held by rtcwake/547: #0: 00000000643ab418 (sb_writers#6){.+.+}-{0:0}, at: file_start_write+0x2b/0x3a #1: 0000000067a0ca88 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x181/0x24b #2: 00000000631eac40 (kn->active#3){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x191/0x24b #3: 00000000609a1308 (system_transition_mutex){+.+.}-{4:4}, at: pm_suspend+0xaf/0x30b #4: 0000000060c0fdb0 (device_links_srcu){.+.+}-{0:0}, at: device_links_read_lock+0x75/0x98 stack backtrace: CPU: 0 UID: 0 PID: 547 Comm: rtcwake Tainted: G O 6.17.0-rc3-00014-g31419c045d64 #6 VOLUNTARY Tainted: [O]=OOT_MODULE Stack: 223721b3a80 6089eac6 00000001 00000001 ffffff00 6089eac6 00000535 6086e528 721b3ac0 6003c294 00000000 60031fc0 Call Trace: [<600407ed>] show_stack+0x10e/0x127 [<6003c294>] dump_stack_lvl+0x77/0xc6 [<6003c2fd>] dump_stack+0x1a/0x20 [<600bc2f8>] lockdep_rcu_suspicious+0x116/0x13e [<603d8ea1>] dpm_async_suspend_superior+0x117/0x17e [<603d980f>] device_suspend+0x528/0x541 [<603da24b>] dpm_suspend+0x1a2/0x267 [<603da837>] dpm_suspend_start+0x5d/0x72 [<600ca0c9>] suspend_devices_and_enter+0xab/0x736 [...] Add the fourth argument to the iteration to annotate this and avoid the splat. Fixes: 0679963 ("PM: sleep: Make async suspend handle suppliers like parents") Fixes: ed18738 ("PM: sleep: Make async resume handle consumers like children") Signed-off-by: Johannes Berg <johannes.berg@intel.com> Link: https://patch.msgid.link/20250826134348.aba79f6e6299.I9ecf55da46ccf33778f2c018a82e1819d815b348@changeid Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The QuickI2C ACPI _DSD methods return ICRS and ISUB data with a trailing byte, making the actual length is one more byte than the structs defined. It caused stack-out-of-bounds and kernel crash: kernel: BUG: KASAN: stack-out-of-bounds in quicki2c_acpi_get_dsd_property.constprop.0+0x111/0x1b0 [intel_quicki2c] kernel: Write of size 12 at addr ffff888106d1f900 by task kworker/u33:2/75 kernel: kernel: CPU: 3 UID: 0 PID: 75 Comm: kworker/u33:2 Not tainted 6.16.0+ #3 PREEMPT(voluntary) kernel: Workqueue: async async_run_entry_fn kernel: Call Trace: kernel: <TASK> kernel: dump_stack_lvl+0x76/0xa0 kernel: print_report+0xd1/0x660 kernel: ? __pfx__raw_spin_lock_irqsave+0x10/0x10 kernel: ? __kasan_slab_free+0x5d/0x80 kernel: ? kasan_addr_to_slab+0xd/0xb0 kernel: kasan_report+0xe1/0x120 kernel: ? quicki2c_acpi_get_dsd_property.constprop.0+0x111/0x1b0 [intel_quicki2c] kernel: ? quicki2c_acpi_get_dsd_property.constprop.0+0x111/0x1b0 [intel_quicki2c] kernel: kasan_check_range+0x11c/0x200 kernel: __asan_memcpy+0x3b/0x80 kernel: quicki2c_acpi_get_dsd_property.constprop.0+0x111/0x1b0 [intel_quicki2c] kernel: ? __pfx_quicki2c_acpi_get_dsd_property.constprop.0+0x10/0x10 [intel_quicki2c] kernel: quicki2c_get_acpi_resources+0x237/0x730 [intel_quicki2c] [...] kernel: </TASK> kernel: kernel: The buggy address belongs to stack of task kworker/u33:2/75 kernel: and is located at offset 48 in frame: kernel: quicki2c_get_acpi_resources+0x0/0x730 [intel_quicki2c] kernel: kernel: This frame has 3 objects: kernel: [32, 36) 'hid_desc_addr' kernel: [48, 59) 'i2c_param' kernel: [80, 224) 'i2c_config' ACPI DSD methods return: \_SB.PC00.THC0.ICRS Buffer 000000003fdc947b 001 Len 0C = 0A 00 80 1A 06 00 00 00 00 00 00 00 \_SB.PC00.THC0.ISUB Buffer 00000000f2fcbdc4 001 Len 91 = 00 00 00 00 00 00 00 00 00 00 00 00 Adding reserved padding to quicki2c_subip_acpi_parameter/config. Fixes: 5282e45 ("HID: intel-thc-hid: intel-quicki2c: Add THC QuickI2C ACPI interfaces") Signed-off-by: Aaron Ma <aaron.ma@canonical.com> Reviewed-by: Even Xu <even.xu@intel.com> Tested-by: Even Xu <even.xu@intel.com> Signed-off-by: Jiri Kosina <jkosina@suse.com>
…C regs save Improper use of secondary pointer (&dev->i2c_subip_regs) caused kernel crash and out-of-bounds error: BUG: KASAN: slab-out-of-bounds in _regmap_bulk_read+0x449/0x510 Write of size 4 at addr ffff888136005dc0 by task kworker/u33:5/5107 CPU: 3 UID: 0 PID: 5107 Comm: kworker/u33:5 Not tainted 6.16.0+ #3 PREEMPT(voluntary) Workqueue: async async_run_entry_fn Call Trace: <TASK> dump_stack_lvl+0x76/0xa0 print_report+0xd1/0x660 ? __pfx__raw_spin_lock_irqsave+0x10/0x10 ? kasan_complete_mode_report_info+0x26/0x200 kasan_report+0xe1/0x120 ? _regmap_bulk_read+0x449/0x510 ? _regmap_bulk_read+0x449/0x510 __asan_report_store4_noabort+0x17/0x30 _regmap_bulk_read+0x449/0x510 ? __pfx__regmap_bulk_read+0x10/0x10 regmap_bulk_read+0x270/0x3d0 pio_complete+0x1ee/0x2c0 [intel_thc] ? __pfx_pio_complete+0x10/0x10 [intel_thc] ? __pfx_pio_wait+0x10/0x10 [intel_thc] ? regmap_update_bits_base+0x13b/0x1f0 thc_i2c_subip_pio_read+0x117/0x270 [intel_thc] thc_i2c_subip_regs_save+0xc2/0x140 [intel_thc] ? __pfx_thc_i2c_subip_regs_save+0x10/0x10 [intel_thc] [...] The buggy address belongs to the object at ffff888136005d00 which belongs to the cache kmalloc-rnd-12-192 of size 192 The buggy address is located 0 bytes to the right of allocated 192-byte region [ffff888136005d00, ffff888136005dc0) Replaced with direct array indexing (&dev->i2c_subip_regs[i]) to ensure safe memory access. Fixes: 4228966 ("HID: intel-thc-hid: intel-thc: Add THC I2C config interfaces") Signed-off-by: Aaron Ma <aaron.ma@canonical.com> Reviewed-by: Even Xu <even.xu@intel.com> Tested-by: Even Xu <even.xu@intel.com> Signed-off-by: Jiri Kosina <jkosina@suse.com>
A malicious HID device can trigger a slab out-of-bounds during mt_report_fixup() by passing in report descriptor smaller than 607 bytes. mt_report_fixup() attempts to patch byte offset 607 of the descriptor with 0x25 by first checking if byte offset 607 is 0x15 however it lacks bounds checks to verify if the descriptor is big enough before conducting this check. Fix this bug by ensuring the descriptor size is at least 608 bytes before accessing it. Below is the KASAN splat after the out of bounds access happens: [ 13.671954] ================================================================== [ 13.672667] BUG: KASAN: slab-out-of-bounds in mt_report_fixup+0x103/0x110 [ 13.673297] Read of size 1 at addr ffff888103df39df by task kworker/0:1/10 [ 13.673297] [ 13.673297] CPU: 0 UID: 0 PID: 10 Comm: kworker/0:1 Not tainted 6.15.0-00005-gec5d573d83f4-dirty #3 [ 13.673297] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/04 [ 13.673297] Call Trace: [ 13.673297] <TASK> [ 13.673297] dump_stack_lvl+0x5f/0x80 [ 13.673297] print_report+0xd1/0x660 [ 13.673297] kasan_report+0xe5/0x120 [ 13.673297] __asan_report_load1_noabort+0x18/0x20 [ 13.673297] mt_report_fixup+0x103/0x110 [ 13.673297] hid_open_report+0x1ef/0x810 [ 13.673297] mt_probe+0x422/0x960 [ 13.673297] hid_device_probe+0x2e2/0x6f0 [ 13.673297] really_probe+0x1c6/0x6b0 [ 13.673297] __driver_probe_device+0x24f/0x310 [ 13.673297] driver_probe_device+0x4e/0x220 [ 13.673297] __device_attach_driver+0x169/0x320 [ 13.673297] bus_for_each_drv+0x11d/0x1b0 [ 13.673297] __device_attach+0x1b8/0x3e0 [ 13.673297] device_initial_probe+0x12/0x20 [ 13.673297] bus_probe_device+0x13d/0x180 [ 13.673297] device_add+0xe3a/0x1670 [ 13.673297] hid_add_device+0x31d/0xa40 [...] Fixes: c8000de ("HID: multitouch: Add support for GT7868Q") Cc: stable@vger.kernel.org Signed-off-by: Qasim Ijaz <qasdev00@gmail.com> Reviewed-by: Jiri Slaby <jirislaby@kernel.org> Signed-off-by: Jiri Kosina <jkosina@suse.com>
When a large VM, specifically one that holds a significant number of PTEs, gets abruptly destroyed, the following warning is seen during the page-table walk: sched: CPU 0 need_resched set for > 100018840 ns (100 ticks) without schedule CPU: 0 UID: 0 PID: 9617 Comm: kvm_page_table_ Tainted: G O 6.16.0-smp-DEV #3 NONE Tainted: [O]=OOT_MODULE Call trace: show_stack+0x20/0x38 (C) dump_stack_lvl+0x3c/0xb8 dump_stack+0x18/0x30 resched_latency_warn+0x7c/0x88 sched_tick+0x1c4/0x268 update_process_times+0xa8/0xd8 tick_nohz_handler+0xc8/0x168 __hrtimer_run_queues+0x11c/0x338 hrtimer_interrupt+0x104/0x308 arch_timer_handler_phys+0x40/0x58 handle_percpu_devid_irq+0x8c/0x1b0 generic_handle_domain_irq+0x48/0x78 gic_handle_irq+0x1b8/0x408 call_on_irq_stack+0x24/0x30 do_interrupt_handler+0x54/0x78 el1_interrupt+0x44/0x88 el1h_64_irq_handler+0x18/0x28 el1h_64_irq+0x84/0x88 stage2_free_walker+0x30/0xa0 (P) __kvm_pgtable_walk+0x11c/0x258 __kvm_pgtable_walk+0x180/0x258 __kvm_pgtable_walk+0x180/0x258 __kvm_pgtable_walk+0x180/0x258 kvm_pgtable_walk+0xc4/0x140 kvm_pgtable_stage2_destroy+0x5c/0xf0 kvm_free_stage2_pgd+0x6c/0xe8 kvm_uninit_stage2_mmu+0x24/0x48 kvm_arch_flush_shadow_all+0x80/0xa0 kvm_mmu_notifier_release+0x38/0x78 __mmu_notifier_release+0x15c/0x250 exit_mmap+0x68/0x400 __mmput+0x38/0x1c8 mmput+0x30/0x68 exit_mm+0xd4/0x198 do_exit+0x1a4/0xb00 do_group_exit+0x8c/0x120 get_signal+0x6d4/0x778 do_signal+0x90/0x718 do_notify_resume+0x70/0x170 el0_svc+0x74/0xd8 el0t_64_sync_handler+0x60/0xc8 el0t_64_sync+0x1b0/0x1b8 The warning is seen majorly on the host kernels that are configured not to force-preempt, such as CONFIG_PREEMPT_NONE=y. To avoid this, instead of walking the entire page-table in one go, split it into smaller ranges, by checking for cond_resched() between each range. Since the path is executed during VM destruction, after the page-table structure is unlinked from the KVM MMU, relying on cond_resched_rwlock_write() isn't necessary. Signed-off-by: Raghavendra Rao Ananta <rananta@google.com> Suggested-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20250820162242.2624752-3-rananta@google.com Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
When the "proxy" option is enabled on a VXLAN device, the device will suppress ARP requests and IPv6 Neighbor Solicitation messages if it is able to reply on behalf of the remote host. That is, if a matching and valid neighbor entry is configured on the VXLAN device whose MAC address is not behind the "any" remote (0.0.0.0 / ::). The code currently assumes that the FDB entry for the neighbor's MAC address points to a valid remote destination, but this is incorrect if the entry is associated with an FDB nexthop group. This can result in a NPD [1][3] which can be reproduced using [2][4]. Fix by checking that the remote destination exists before dereferencing it. [1] BUG: kernel NULL pointer dereference, address: 0000000000000000 [...] CPU: 4 UID: 0 PID: 365 Comm: arping Not tainted 6.17.0-rc2-virtme-g2a89cb21162c #2 PREEMPT(voluntary) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-4.fc41 04/01/2014 RIP: 0010:vxlan_xmit+0xb58/0x15f0 [...] Call Trace: <TASK> dev_hard_start_xmit+0x5d/0x1c0 __dev_queue_xmit+0x246/0xfd0 packet_sendmsg+0x113a/0x1850 __sock_sendmsg+0x38/0x70 __sys_sendto+0x126/0x180 __x64_sys_sendto+0x24/0x30 do_syscall_64+0xa4/0x260 entry_SYSCALL_64_after_hwframe+0x4b/0x53 [2] #!/bin/bash ip address add 192.0.2.1/32 dev lo ip nexthop add id 1 via 192.0.2.2 fdb ip nexthop add id 10 group 1 fdb ip link add name vx0 up type vxlan id 10010 local 192.0.2.1 dstport 4789 proxy ip neigh add 192.0.2.3 lladdr 00:11:22:33:44:55 nud perm dev vx0 bridge fdb add 00:11:22:33:44:55 dev vx0 self static nhid 10 arping -b -c 1 -s 192.0.2.1 -I vx0 192.0.2.3 [3] BUG: kernel NULL pointer dereference, address: 0000000000000000 [...] CPU: 13 UID: 0 PID: 372 Comm: ndisc6 Not tainted 6.17.0-rc2-virtmne-g6ee90cb26014 #3 PREEMPT(voluntary) Hardware name: QEMU Standard PC (i440FX + PIIX, 1v996), BIOS 1.17.0-4.fc41 04/01/2x014 RIP: 0010:vxlan_xmit+0x803/0x1600 [...] Call Trace: <TASK> dev_hard_start_xmit+0x5d/0x1c0 __dev_queue_xmit+0x246/0xfd0 ip6_finish_output2+0x210/0x6c0 ip6_finish_output+0x1af/0x2b0 ip6_mr_output+0x92/0x3e0 ip6_send_skb+0x30/0x90 rawv6_sendmsg+0xe6e/0x12e0 __sock_sendmsg+0x38/0x70 __sys_sendto+0x126/0x180 __x64_sys_sendto+0x24/0x30 do_syscall_64+0xa4/0x260 entry_SYSCALL_64_after_hwframe+0x4b/0x53 RIP: 0033:0x7f383422ec77 [4] #!/bin/bash ip address add 2001:db8:1::1/128 dev lo ip nexthop add id 1 via 2001:db8:1::1 fdb ip nexthop add id 10 group 1 fdb ip link add name vx0 up type vxlan id 10010 local 2001:db8:1::1 dstport 4789 proxy ip neigh add 2001:db8:1::3 lladdr 00:11:22:33:44:55 nud perm dev vx0 bridge fdb add 00:11:22:33:44:55 dev vx0 self static nhid 10 ndisc6 -r 1 -s 2001:db8:1::1 -w 1 2001:db8:1::3 vx0 Fixes: 1274e1c ("vxlan: ecmp support for mac fdb entries") Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20250901065035.159644-3-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ido Schimmel says:
====================
vxlan: Fix NPDs when using nexthop objects
With FDB nexthop groups, VXLAN FDB entries do not necessarily point to
a remote destination but rather to an FDB nexthop group. This means that
first_remote_{rcu,rtnl}() can return NULL and a few places in the driver
were not ready for that, resulting in NULL pointer dereferences.
Patches #1-#2 fix these NPDs.
Note that vxlan_fdb_find_uc() still dereferences the remote returned by
first_remote_rcu() without checking that it is not NULL, but this
function is only invoked by a single driver which vetoes the creation of
FDB nexthop groups. I will patch this in net-next to make the code less
fragile.
Patch #3 adds a selftests which exercises these code paths and tests
basic Tx functionality with FDB nexthop groups. I verified that the test
crashes the kernel without the first two patches.
====================
Link: https://patch.msgid.link/20250901065035.159644-1-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When transmitting a PTP frame which is timestamp using 2 step, the
following warning appears if CONFIG_PROVE_LOCKING is enabled:
=============================
[ BUG: Invalid wait context ]
6.17.0-rc1-00326-ge6160462704e #427 Not tainted
-----------------------------
ptp4l/119 is trying to lock:
c2a44ed4 (&vsc8531->ts_lock){+.+.}-{3:3}, at: vsc85xx_txtstamp+0x50/0xac
other info that might help us debug this:
context-{4:4}
4 locks held by ptp4l/119:
#0: c145f068 (rcu_read_lock_bh){....}-{1:2}, at: __dev_queue_xmit+0x58/0x1440
#1: c29df974 (dev->qdisc_tx_busylock ?: &qdisc_tx_busylock){+...}-{2:2}, at: __dev_queue_xmit+0x5c4/0x1440
#2: c2aaaad0 (_xmit_ETHER#2){+.-.}-{2:2}, at: sch_direct_xmit+0x108/0x350
#3: c2aac170 (&lan966x->tx_lock){+.-.}-{2:2}, at: lan966x_port_xmit+0xd0/0x350
stack backtrace:
CPU: 0 UID: 0 PID: 119 Comm: ptp4l Not tainted 6.17.0-rc1-00326-ge6160462704e #427 NONE
Hardware name: Generic DT based system
Call trace:
unwind_backtrace from show_stack+0x10/0x14
show_stack from dump_stack_lvl+0x7c/0xac
dump_stack_lvl from __lock_acquire+0x8e8/0x29dc
__lock_acquire from lock_acquire+0x108/0x38c
lock_acquire from __mutex_lock+0xb0/0xe78
__mutex_lock from mutex_lock_nested+0x1c/0x24
mutex_lock_nested from vsc85xx_txtstamp+0x50/0xac
vsc85xx_txtstamp from lan966x_fdma_xmit+0xd8/0x3a8
lan966x_fdma_xmit from lan966x_port_xmit+0x1bc/0x350
lan966x_port_xmit from dev_hard_start_xmit+0xc8/0x2c0
dev_hard_start_xmit from sch_direct_xmit+0x8c/0x350
sch_direct_xmit from __dev_queue_xmit+0x680/0x1440
__dev_queue_xmit from packet_sendmsg+0xfa4/0x1568
packet_sendmsg from __sys_sendto+0x110/0x19c
__sys_sendto from sys_send+0x18/0x20
sys_send from ret_fast_syscall+0x0/0x1c
Exception stack(0xf0b05fa8 to 0xf0b05ff0)
5fa0: 00000001 0000000e 0000000e 0004b47a 0000003a 00000000
5fc0: 00000001 0000000e 00000000 00000121 0004af58 00044874 00000000 00000000
5fe0: 00000001 bee9d420 00025a10 b6e75c7c
So, instead of using the ts_lock for tx_queue, use the spinlock that
skb_buff_head has.
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Fixes: 7d272e6 ("net: phy: mscc: timestamping and PHC support")
Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Link: https://patch.msgid.link/20250902121259.3257536-1-horatiu.vultur@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Problem description
===================
Lockdep reports a possible circular locking dependency (AB/BA) between
&pl->state_mutex and &phy->lock, as follows.
phylink_resolve() // acquires &pl->state_mutex
-> phylink_major_config()
-> phy_config_inband() // acquires &pl->phydev->lock
whereas all the other call sites where &pl->state_mutex and
&pl->phydev->lock have the locking scheme reversed. Everywhere else,
&pl->phydev->lock is acquired at the top level, and &pl->state_mutex at
the lower level. A clear example is phylink_bringup_phy().
The outlier is the newly introduced phy_config_inband() and the existing
lock order is the correct one. To understand why it cannot be the other
way around, it is sufficient to consider phylink_phy_change(), phylink's
callback from the PHY device's phy->phy_link_change() virtual method,
invoked by the PHY state machine.
phy_link_up() and phy_link_down(), the (indirect) callers of
phylink_phy_change(), are called with &phydev->lock acquired.
Then phylink_phy_change() acquires its own &pl->state_mutex, to
serialize changes made to its pl->phy_state and pl->link_config.
So all other instances of &pl->state_mutex and &phydev->lock must be
consistent with this order.
Problem impact
==============
I think the kernel runs a serious deadlock risk if an existing
phylink_resolve() thread, which results in a phy_config_inband() call,
is concurrent with a phy_link_up() or phy_link_down() call, which will
deadlock on &pl->state_mutex in phylink_phy_change(). Practically
speaking, the impact may be limited by the slow speed of the medium
auto-negotiation protocol, which makes it unlikely for the current state
to still be unresolved when a new one is detected, but I think the
problem is there. Nonetheless, the problem was discovered using lockdep.
Proposed solution
=================
Practically speaking, the phy_config_inband() requirement of having
phydev->lock acquired must transfer to the caller (phylink is the only
caller). There, it must bubble up until immediately before
&pl->state_mutex is acquired, for the cases where that takes place.
Solution details, considerations, notes
=======================================
This is the phy_config_inband() call graph:
sfp_upstream_ops :: connect_phy()
|
v
phylink_sfp_connect_phy()
|
v
phylink_sfp_config_phy()
|
| sfp_upstream_ops :: module_insert()
| |
| v
| phylink_sfp_module_insert()
| |
| | sfp_upstream_ops :: module_start()
| | |
| | v
| | phylink_sfp_module_start()
| | |
| v v
| phylink_sfp_config_optical()
phylink_start() | |
| phylink_resume() v v
| | phylink_sfp_set_config()
| | |
v v v
phylink_mac_initial_config()
| phylink_resolve()
| | phylink_ethtool_ksettings_set()
v v v
phylink_major_config()
|
v
phy_config_inband()
phylink_major_config() caller #1, phylink_mac_initial_config(), does not
acquire &pl->state_mutex nor do its callers. It must acquire
&pl->phydev->lock prior to calling phylink_major_config().
phylink_major_config() caller #2, phylink_resolve() acquires
&pl->state_mutex, thus also needs to acquire &pl->phydev->lock.
phylink_major_config() caller #3, phylink_ethtool_ksettings_set(), is
completely uninteresting, because it only calls phylink_major_config()
if pl->phydev is NULL (otherwise it calls phy_ethtool_ksettings_set()).
We need to change nothing there.
Other solutions
===============
The lock inversion between &pl->state_mutex and &pl->phydev->lock has
occurred at least once before, as seen in commit c718af2 ("net:
phylink: fix ethtool -A with attached PHYs"). The solution there was to
simply not call phy_set_asym_pause() under the &pl->state_mutex. That
cannot be extended to our case though, where the phy_config_inband()
call is much deeper inside the &pl->state_mutex section.
Fixes: 5fd0f1a ("net: phylink: add negotiation of in-band capabilities")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/20250904125238.193990-2-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
…ux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 changes for 6.17, round #3 - Invalidate nested MMUs upon freeing the PGD to avoid WARNs when visiting from an MMU notifier - Fixes to the TLB match process and TLB invalidation range for managing the VCNR pseudo-TLB - Prevent SPE from erroneously profiling guests due to UNKNOWN reset values in PMSCR_EL1 - Fix save/restore of host MDCR_EL2 to account for eagerly programming at vcpu_load() on VHE systems - Correct lock ordering when dealing with VGIC LPIs, avoiding scenarios where an xarray's spinlock was nested with a *raw* spinlock - Permit stage-2 read permission aborts which are possible in the case of NV depending on the guest hypervisor's stage-2 translation - Call raw_spin_unlock() instead of the internal spinlock API - Fix parameter ordering when assigning VBAR_EL1
This fixes the following UAF caused by not properly locking hdev when processing HCI_EV_NUM_COMP_PKTS: BUG: KASAN: slab-use-after-free in hci_conn_tx_dequeue+0x1be/0x220 net/bluetooth/hci_conn.c:3036 Read of size 4 at addr ffff8880740f0940 by task kworker/u11:0/54 CPU: 1 UID: 0 PID: 54 Comm: kworker/u11:0 Not tainted 6.16.0-rc7 #3 PREEMPT(full) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 Workqueue: hci1 hci_rx_work Call Trace: <TASK> dump_stack_lvl+0x189/0x250 lib/dump_stack.c:120 print_address_description mm/kasan/report.c:378 [inline] print_report+0xca/0x230 mm/kasan/report.c:480 kasan_report+0x118/0x150 mm/kasan/report.c:593 hci_conn_tx_dequeue+0x1be/0x220 net/bluetooth/hci_conn.c:3036 hci_num_comp_pkts_evt+0x1c8/0xa50 net/bluetooth/hci_event.c:4404 hci_event_func net/bluetooth/hci_event.c:7477 [inline] hci_event_packet+0x7e0/0x1200 net/bluetooth/hci_event.c:7531 hci_rx_work+0x46a/0xe80 net/bluetooth/hci_core.c:4070 process_one_work kernel/workqueue.c:3238 [inline] process_scheduled_works+0xae1/0x17b0 kernel/workqueue.c:3321 worker_thread+0x8a0/0xda0 kernel/workqueue.c:3402 kthread+0x70e/0x8a0 kernel/kthread.c:464 ret_from_fork+0x3fc/0x770 arch/x86/kernel/process.c:148 ret_from_fork_asm+0x1a/0x30 home/kwqcheii/source/fuzzing/kernel/kasan/linux-6.16-rc7/arch/x86/entry/entry_64.S:245 </TASK> Allocated by task 54: kasan_save_stack mm/kasan/common.c:47 [inline] kasan_save_track+0x3e/0x80 mm/kasan/common.c:68 poison_kmalloc_redzone mm/kasan/common.c:377 [inline] __kasan_kmalloc+0x93/0xb0 mm/kasan/common.c:394 kasan_kmalloc include/linux/kasan.h:260 [inline] __kmalloc_cache_noprof+0x230/0x3d0 mm/slub.c:4359 kmalloc_noprof include/linux/slab.h:905 [inline] kzalloc_noprof include/linux/slab.h:1039 [inline] __hci_conn_add+0x233/0x1b30 net/bluetooth/hci_conn.c:939 le_conn_complete_evt+0x3d6/0x1220 net/bluetooth/hci_event.c:5628 hci_le_enh_conn_complete_evt+0x189/0x470 net/bluetooth/hci_event.c:5794 hci_event_func net/bluetooth/hci_event.c:7474 [inline] hci_event_packet+0x78c/0x1200 net/bluetooth/hci_event.c:7531 hci_rx_work+0x46a/0xe80 net/bluetooth/hci_core.c:4070 process_one_work kernel/workqueue.c:3238 [inline] process_scheduled_works+0xae1/0x17b0 kernel/workqueue.c:3321 worker_thread+0x8a0/0xda0 kernel/workqueue.c:3402 kthread+0x70e/0x8a0 kernel/kthread.c:464 ret_from_fork+0x3fc/0x770 arch/x86/kernel/process.c:148 ret_from_fork_asm+0x1a/0x30 home/kwqcheii/source/fuzzing/kernel/kasan/linux-6.16-rc7/arch/x86/entry/entry_64.S:245 Freed by task 9572: kasan_save_stack mm/kasan/common.c:47 [inline] kasan_save_track+0x3e/0x80 mm/kasan/common.c:68 kasan_save_free_info+0x46/0x50 mm/kasan/generic.c:576 poison_slab_object mm/kasan/common.c:247 [inline] __kasan_slab_free+0x62/0x70 mm/kasan/common.c:264 kasan_slab_free include/linux/kasan.h:233 [inline] slab_free_hook mm/slub.c:2381 [inline] slab_free mm/slub.c:4643 [inline] kfree+0x18e/0x440 mm/slub.c:4842 device_release+0x9c/0x1c0 kobject_cleanup lib/kobject.c:689 [inline] kobject_release lib/kobject.c:720 [inline] kref_put include/linux/kref.h:65 [inline] kobject_put+0x22b/0x480 lib/kobject.c:737 hci_conn_cleanup net/bluetooth/hci_conn.c:175 [inline] hci_conn_del+0x8ff/0xcb0 net/bluetooth/hci_conn.c:1173 hci_abort_conn_sync+0x5d1/0xdf0 net/bluetooth/hci_sync.c:5689 hci_cmd_sync_work+0x210/0x3a0 net/bluetooth/hci_sync.c:332 process_one_work kernel/workqueue.c:3238 [inline] process_scheduled_works+0xae1/0x17b0 kernel/workqueue.c:3321 worker_thread+0x8a0/0xda0 kernel/workqueue.c:3402 kthread+0x70e/0x8a0 kernel/kthread.c:464 ret_from_fork+0x3fc/0x770 arch/x86/kernel/process.c:148 ret_from_fork_asm+0x1a/0x30 home/kwqcheii/source/fuzzing/kernel/kasan/linux-6.16-rc7/arch/x86/entry/entry_64.S:245 Fixes: 134f4b3 ("Bluetooth: add support for skb TX SND/COMPLETION timestamping") Reported-by: Junvyyang, Tencent Zhuque Lab <zhuque@tencent.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
This fixes the following UFA in hci_acl_create_conn_sync where a connection still pending is command submission (conn->state == BT_OPEN) maybe freed, also since this also can happen with the likes of hci_le_create_conn_sync fix it as well: BUG: KASAN: slab-use-after-free in hci_acl_create_conn_sync+0x5ef/0x790 net/bluetooth/hci_sync.c:6861 Write of size 2 at addr ffff88805ffcc038 by task kworker/u11:2/9541 CPU: 1 UID: 0 PID: 9541 Comm: kworker/u11:2 Not tainted 6.16.0-rc7 #3 PREEMPT(full) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 Workqueue: hci3 hci_cmd_sync_work Call Trace: <TASK> dump_stack_lvl+0x189/0x250 lib/dump_stack.c:120 print_address_description mm/kasan/report.c:378 [inline] print_report+0xca/0x230 mm/kasan/report.c:480 kasan_report+0x118/0x150 mm/kasan/report.c:593 hci_acl_create_conn_sync+0x5ef/0x790 net/bluetooth/hci_sync.c:6861 hci_cmd_sync_work+0x210/0x3a0 net/bluetooth/hci_sync.c:332 process_one_work kernel/workqueue.c:3238 [inline] process_scheduled_works+0xae1/0x17b0 kernel/workqueue.c:3321 worker_thread+0x8a0/0xda0 kernel/workqueue.c:3402 kthread+0x70e/0x8a0 kernel/kthread.c:464 ret_from_fork+0x3fc/0x770 arch/x86/kernel/process.c:148 ret_from_fork_asm+0x1a/0x30 home/kwqcheii/source/fuzzing/kernel/kasan/linux-6.16-rc7/arch/x86/entry/entry_64.S:245 </TASK> Allocated by task 123736: kasan_save_stack mm/kasan/common.c:47 [inline] kasan_save_track+0x3e/0x80 mm/kasan/common.c:68 poison_kmalloc_redzone mm/kasan/common.c:377 [inline] __kasan_kmalloc+0x93/0xb0 mm/kasan/common.c:394 kasan_kmalloc include/linux/kasan.h:260 [inline] __kmalloc_cache_noprof+0x230/0x3d0 mm/slub.c:4359 kmalloc_noprof include/linux/slab.h:905 [inline] kzalloc_noprof include/linux/slab.h:1039 [inline] __hci_conn_add+0x233/0x1b30 net/bluetooth/hci_conn.c:939 hci_conn_add_unset net/bluetooth/hci_conn.c:1051 [inline] hci_connect_acl+0x16c/0x4e0 net/bluetooth/hci_conn.c:1634 pair_device+0x418/0xa70 net/bluetooth/mgmt.c:3556 hci_mgmt_cmd+0x9c9/0xef0 net/bluetooth/hci_sock.c:1719 hci_sock_sendmsg+0x6ca/0xef0 net/bluetooth/hci_sock.c:1839 sock_sendmsg_nosec net/socket.c:712 [inline] __sock_sendmsg+0x219/0x270 net/socket.c:727 sock_write_iter+0x258/0x330 net/socket.c:1131 new_sync_write fs/read_write.c:593 [inline] vfs_write+0x54b/0xa90 fs/read_write.c:686 ksys_write+0x145/0x250 fs/read_write.c:738 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xfa/0x3b0 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f Freed by task 103680: kasan_save_stack mm/kasan/common.c:47 [inline] kasan_save_track+0x3e/0x80 mm/kasan/common.c:68 kasan_save_free_info+0x46/0x50 mm/kasan/generic.c:576 poison_slab_object mm/kasan/common.c:247 [inline] __kasan_slab_free+0x62/0x70 mm/kasan/common.c:264 kasan_slab_free include/linux/kasan.h:233 [inline] slab_free_hook mm/slub.c:2381 [inline] slab_free mm/slub.c:4643 [inline] kfree+0x18e/0x440 mm/slub.c:4842 device_release+0x9c/0x1c0 kobject_cleanup lib/kobject.c:689 [inline] kobject_release lib/kobject.c:720 [inline] kref_put include/linux/kref.h:65 [inline] kobject_put+0x22b/0x480 lib/kobject.c:737 hci_conn_cleanup net/bluetooth/hci_conn.c:175 [inline] hci_conn_del+0x8ff/0xcb0 net/bluetooth/hci_conn.c:1173 hci_conn_complete_evt+0x3c7/0x1040 net/bluetooth/hci_event.c:3199 hci_event_func net/bluetooth/hci_event.c:7477 [inline] hci_event_packet+0x7e0/0x1200 net/bluetooth/hci_event.c:7531 hci_rx_work+0x46a/0xe80 net/bluetooth/hci_core.c:4070 process_one_work kernel/workqueue.c:3238 [inline] process_scheduled_works+0xae1/0x17b0 kernel/workqueue.c:3321 worker_thread+0x8a0/0xda0 kernel/workqueue.c:3402 kthread+0x70e/0x8a0 kernel/kthread.c:464 ret_from_fork+0x3fc/0x770 arch/x86/kernel/process.c:148 ret_from_fork_asm+0x1a/0x30 home/kwqcheii/source/fuzzing/kernel/kasan/linux-6.16-rc7/arch/x86/entry/entry_64.S:245 Last potentially related work creation: kasan_save_stack+0x3e/0x60 mm/kasan/common.c:47 kasan_record_aux_stack+0xbd/0xd0 mm/kasan/generic.c:548 insert_work+0x3d/0x330 kernel/workqueue.c:2183 __queue_work+0xbd9/0xfe0 kernel/workqueue.c:2345 queue_delayed_work_on+0x18b/0x280 kernel/workqueue.c:2561 pairing_complete+0x1e7/0x2b0 net/bluetooth/mgmt.c:3451 pairing_complete_cb+0x1ac/0x230 net/bluetooth/mgmt.c:3487 hci_connect_cfm include/net/bluetooth/hci_core.h:2064 [inline] hci_conn_failed+0x24d/0x310 net/bluetooth/hci_conn.c:1275 hci_conn_complete_evt+0x3c7/0x1040 net/bluetooth/hci_event.c:3199 hci_event_func net/bluetooth/hci_event.c:7477 [inline] hci_event_packet+0x7e0/0x1200 net/bluetooth/hci_event.c:7531 hci_rx_work+0x46a/0xe80 net/bluetooth/hci_core.c:4070 process_one_work kernel/workqueue.c:3238 [inline] process_scheduled_works+0xae1/0x17b0 kernel/workqueue.c:3321 worker_thread+0x8a0/0xda0 kernel/workqueue.c:3402 kthread+0x70e/0x8a0 kernel/kthread.c:464 ret_from_fork+0x3fc/0x770 arch/x86/kernel/process.c:148 ret_from_fork_asm+0x1a/0x30 home/kwqcheii/source/fuzzing/kernel/kasan/linux-6.16-rc7/arch/x86/entry/entry_64.S:245 Fixes: aef2aa4 ("Bluetooth: hci_event: Fix creating hci_conn object on error status") Reported-by: Junvyyang, Tencent Zhuque Lab <zhuque@tencent.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Ido Schimmel says: ==================== nexthop: Various fixes Patch #1 fixes a NPD that was recently reported by syzbot. Patch #2 fixes an issue in the existing FIB nexthop selftest. Patch #3 extends the selftest with test cases for the bug that was fixed in the first patch. ==================== Link: https://patch.msgid.link/20250921150824.149157-1-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This PR adds support for the deployment model 3 of the RISC-V CoVE spec. It adds the missing API call in COVH ABI and implementation of single-step TEE VM (TVM) creation. It modifies the architecture-dependent KVM code to support deployment models in which: (1) host runs in HS mode, (2) TEE security monitor (TSM) or hardware do not support RISC-V AIA, (3) TSM or hardware do not support dynamic page conversion.
Promotion to TVM (guest side)
At the guest side, request isolation against the hypervisor early in the kernel boot process only if the kernel was built with the CONFIG_RISCV_COVE_GUEST_PROMOTE command-line kernel parameter. CoVE guest images created in the multi-step creation process should not use this parameter.
Promotion to TVM (host side)
At the host side, preload VM pages into memory, fill the NACL shared memory with boot vcpu state, and reflect the promote_to_tvm() call to the TSM. Support CoVE implementations that do not support dynamic page conversion. A TSM that does not support dynamic page conversion does not require the donation of pages to store VCPU state in confidential memory.
Support sharing pages with a hypervisor in an environment where dynamic page conversion is not supported.
When a single share_memory_region() call for a memory region that contains multiple 4KiB pages fails, execute multiple 4KiB share_memory_region() calls until the request is completed. (See deployment model 3 in the CoVE spec.)
Support the use of NACL in CoVE for KVM running in HS mode.
Utilize the correct NACL features by testing whether a feature required for NACL exploitation in nested virtualized environments is present. If nested virtualization is not present, use only the NACL setup_shared_memory() ABI otherwise use the entire NACL ABI.Mandatory NACL support for CoVE This commit mandates presence of NACL extension for KVM's CoVE support. To support deployments running host in HS or VS mode, KVM checks presence of specific NACL features. Specifically, CoVE deployments with host running in HS mode must implement only NACL setup_shared_memory() and no NACL feature. CoVE deployments with host running in VS mode must, however, implement all required NACL features.
Enable CoVE guests to run in an environment that does not have AIA.
Detect AIA presence by discovering that the TEE security monitor (TSM) implements the SBI COVI extension. If AIA is not present, inject external interrupts using the HVIP register when resuming execution of a virtual processor via the COVH tvm_vcpu_run() call.
TSM Capabilities
Discover TSM capabilities introduced as part of the GetTsmInfo call with the version 0.7 of the CoVE spec.
Local attestation
Support placement of TEE attestation payload inside the Linux kernel image and exposing its address during promotion to TVM.