35 Commits

Author SHA1 Message Date
Marc-André Lureau
70dfabeaa7 seccomp: set the seccomp filter to all threads
When using "-seccomp on", the seccomp policy is only applied to the
main thread, the vcpu worker thread and other worker threads created
after seccomp policy is applied; the seccomp policy is not applied to
e.g. the RCU thread because it is created before the seccomp policy is
applied and SECCOMP_FILTER_FLAG_TSYNC isn't used.

This can be verified with
for task in /proc/`pidof qemu`/task/*; do cat $task/status | grep Secc ; done
Seccomp:	2
Seccomp:	0
Seccomp:	0
Seccomp:	2
Seccomp:	2
Seccomp:	2

Starting with libseccomp 2.2.0 and kernel >= 3.17, we can use
seccomp_attr_set(ctx, > SCMP_FLTATR_CTL_TSYNC, 1) to update the policy
on all threads.

libseccomp requirement was bumped to 2.2.0 in previous patch.
libseccomp should fail to set the filter if it can't honour
SCMP_FLTATR_CTL_TSYNC (untested), and thus -sandbox will now fail on
kernel < 3.17.

Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Acked-by: Eduardo Otubo <otubo@redhat.com>
2018-08-23 16:45:44 +02:00
Marc-André Lureau
bda08a5764 seccomp: prefer SCMP_ACT_KILL_PROCESS if available
The upcoming libseccomp release should have SCMP_ACT_KILL_PROCESS
action (https://github.com/seccomp/libseccomp/issues/96).

SCMP_ACT_KILL_PROCESS is preferable to immediately terminate the
offending process, rather than having the SIGSYS handler running.

Use SECCOMP_GET_ACTION_AVAIL to check availability of kernel support,
as libseccomp will fallback on SCMP_ACT_KILL otherwise, and we still
prefer SCMP_ACT_TRAP.

Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Acked-by: Eduardo Otubo <otubo@redhat.com>
2018-08-23 16:45:23 +02:00
Marc-André Lureau
6f2231e9b0 seccomp: use SIGSYS signal instead of killing the thread
The seccomp action SCMP_ACT_KILL results in immediate termination of
the thread that made the bad system call. However, qemu being
multi-threaded, it keeps running. There is no easy way for parent
process / management layer (libvirt) to know about that situation.

Instead, the default SIGSYS handler when invoked with SCMP_ACT_TRAP
will terminate the program and core dump.

This may not be the most secure solution, but probably better than
just killing the offending thread. SCMP_ACT_KILL_PROCESS has been
added in Linux 4.14 to improve the situation, which I propose to use
by default if available in the next patch.

Related to:
https://bugzilla.redhat.com/show_bug.cgi?id=1594456

Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Acked-by: Eduardo Otubo <otubo@redhat.com>
2018-08-23 16:45:20 +02:00
Marc-André Lureau
056de1e894 seccomp: allow sched_setscheduler() with SCHED_IDLE policy
Current and upcoming mesa releases rely on a shader disk cash. It uses
a thread job queue with low priority, set with
sched_setscheduler(SCHED_IDLE). However, that syscall is rejected by
the "resourcecontrol" seccomp qemu filter.

Since it should be safe to allow lowering thread priority, let's allow
scheduling thread to idle policy.

Related to:
https://bugzilla.redhat.com/show_bug.cgi?id=1594456

Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Acked-by: Eduardo Otubo <otubo@redhat.com>
2018-07-12 14:52:39 +02:00
Yi Min Zhao
9d0fdecbad sandbox: disable -sandbox if CONFIG_SECCOMP undefined
If CONFIG_SECCOMP is undefined, the option 'elevatedprivileges' remains
compiled. This would make libvirt set the corresponding capability and
then trigger failure during guest startup. This patch moves the code
regarding seccomp command line options to qemu-seccomp.c file and
wraps qemu_opts_foreach finding sandbox option with CONFIG_SECCOMP.
Because parse_sandbox() is moved into qemu-seccomp.c file, change
seccomp_start() to static function.

Signed-off-by: Yi Min Zhao <zyimin@linux.ibm.com>
Reviewed-by: Ján Tomko <jtomko@redhat.com>
Tested-by: Ján Tomko <jtomko@redhat.com>
Acked-by: Eduardo Otubo <otubo@redhat.com>
2018-06-01 13:44:15 +02:00
Eduardo Otubo
24f8cdc572 seccomp: add resourcecontrol argument to command line
This patch adds [,resourcecontrol=deny] to `-sandbox on' option. It
blacklists all process affinity and scheduler priority system calls to
avoid any bigger of the process.

Signed-off-by: Eduardo Otubo <otubo@redhat.com>
2017-09-15 10:15:06 +02:00
Eduardo Otubo
995a226f88 seccomp: add spawn argument to command line
This patch adds [,spawn=deny] argument to `-sandbox on' option. It
blacklists fork and execve system calls, avoiding Qemu to spawn new
threads or processes.

Signed-off-by: Eduardo Otubo <otubo@redhat.com>
2017-09-15 10:15:06 +02:00
Eduardo Otubo
73a1e64725 seccomp: add elevateprivileges argument to command line
This patch introduces the new argument
[,elevateprivileges=allow|deny|children] to the `-sandbox on'. It allows
or denies Qemu process to elevate its privileges by blacklisting all
set*uid|gid system calls. The 'children' option will let forks and
execves run unprivileged.

Signed-off-by: Eduardo Otubo <otubo@redhat.com>
2017-09-15 10:15:06 +02:00
Eduardo Otubo
2b716fa6d6 seccomp: add obsolete argument to command line
This patch introduces the argument [,obsolete=allow] to the `-sandbox on'
option. It allows Qemu to run safely on old system that still relies on
old system calls.

Signed-off-by: Eduardo Otubo <otubo@redhat.com>
2017-09-15 10:15:05 +02:00
Eduardo Otubo
1bd6152ae2 seccomp: changing from whitelist to blacklist
This patch changes the default behavior of the seccomp filter from
whitelist to blacklist. By default now all system calls are allowed and
a small black list of definitely forbidden ones was created.

Signed-off-by: Eduardo Otubo <otubo@redhat.com>
2017-09-15 10:13:35 +02:00
Eduardo Otubo
cf9dc9e480 seccomp: adding getrusage to the whitelist
getrusage is used in a number of places throughout the qemu codebase
(notably, in crypto/pbkdf.c).  Without this syscall being whitelisted,
qemu ends up getting killed by the kernel whenever you try to connect to
a VNC console.

Signed-off-by: Brian Rak <brak@gameservers.com>
Acked-by: Eduardo Otubo <eduardo.otubo@profitbricks.com>
2016-09-21 11:26:02 +02:00
Miroslav Rezanina
8e08f8a4a7 seccomp: adding sysinfo system call to whitelist
Newer version of nss-softokn libraries (> 3.16.2.3) use sysinfo call
so qemu using rbd image hang after start when run in sandbox mode.

To allow using rbd images in sandbox mode we have to whitelist it.

Signed-off-by: Miroslav Rezanina <mrezanin@redhat.com>
Acked-by: Eduardo Otubo <eduardo.otubo@profitbricks.com>
2016-04-16 20:27:44 +02:00
James Hogan
81bed73b53 seccomp: Whitelist cacheflush since 2.2.0 not 2.2.3
The cacheflush system call (found on MIPS and ARM) has been included in
the libseccomp header since 2.2.0, so include it back to that version.
Previously it was only enabled since 2.2.3 since that is when it was
enabled properly for ARM.

This will allow seccomp support to be enabled for MIPS back to
libseccomp 2.2.0.

Signed-off-by: James Hogan <james.hogan@imgtec.com>
Reviewed-By: Andrew Jones <drjones@redhat.com>
Acked-by: Eduardo Otubo <eduardo.otubo@profitbricks.com>
2016-04-16 20:27:41 +02:00
Peter Maydell
d38ea87ac5 all: Clean up includes
Clean up includes so that osdep.h is included first and headers
which it implies are not included manually.

This commit was created with scripts/clean-includes.

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Message-id: 1454089805-5470-16-git-send-email-peter.maydell@linaro.org
2016-02-04 17:41:30 +00:00
Andrew Jones
47d2067af3 seccomp: add cacheflush to whitelist
cacheflush is an arm-specific syscall that qemu built for arm
uses. Add it to the whitelist, but only if we're linking with
a recent enough libseccomp.

Signed-off-by: Andrew Jones <drjones@redhat.com>
2015-11-16 09:48:53 +01:00
Eduardo Otubo
f8d82b8eb8 seccomp: add memfd_create to whitelist
This is used by memfd code.

Signed-off-by: Eduardo Otubo <eduardo.otubo@profitbricks.com>
Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Tested-by: Thibaut Collet <thibaut.collet@6wind.com>
2015-10-22 14:34:50 +03:00
Paolo Bonzini
4b45b05549 seccomp: add mlockall to whitelist
This is used by "-realtime mlock=on".

Signed-off-by: Eduardo Otubo <eduardo.otubo@profitbricks.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Amit Shah <amit.shah@redhat.com>
Reviewed-by: Eduardo Habkost <ehabkost@redhat.com>
Tested-by: Eduardo Habkost <ehabkost@redhat.com>
Acked-by: Eduardo Otubo <eduardo.otubo@profitbricks.com>
2015-01-23 14:07:08 +01:00
Paul Moore
ea259acae5 seccomp: add mbind() to the syscall whitelist
The "memory-backend-ram" QOM object utilizes the mbind(2) syscall to
set the policy for a memory range.  Add the syscall to the seccomp
sandbox whitelist.

Signed-off-by: Paul Moore <pmoore@redhat.com>
Signed-off-by: Eduardo Otubo <eduardo.otubo@profitbricks.com>
Acked-by: Eduardo Otubo <eduardo.otubo@profitbricks.com>
Tested-by: Eduardo Habkost <ehabkost@redhat.com>
Reviewed-by: Eduardo Habkost <ehabkost@redhat.com>
2015-01-05 18:13:38 +01:00
Philipp Gesang
f73adec709 seccomp: whitelist syscalls fallocate(), fadvise64(), inotify_init1() and inotify_add_watch()
fallocate() is needed for snapshotting. If it isn’t whitelisted

    $ qemu-img create -f qcow2 x.qcow 1G
    Formatting 'x.qcow', fmt=qcow2 size=1073741824 encryption=off cluster_size=65536 lazy_refcounts=off
    $ qemu-kvm -display none -monitor stdio -sandbox on x.qcow
    QEMU 2.1.50 monitor - type 'help' for more information
    (qemu) savevm foo
    (qemu) loadvm foo

will fail, as will subsequent savevm commands on the same image.

fadvise64(), inotify_init1(), inotify_add_watch() are needed by
the SDL display. Without the whitelist entries,

    qemu-kvm -sandbox on

fails immediately.

In my tests fadvise64() is called 50--51 times per VM run. That
number seems independent of the duration of the run. fallocate(),
inotify_init1(), inotify_add_watch() are called once each.
Accordingly, they are added to the whitelist at a very low
priority.

Signed-off-by: Philipp Gesang <philipp.gesang@intra2net.com>
Signed-off-by: Eduardo Otubo <eduardo.otubo@profitbricks.com>
2014-11-11 17:01:35 +01:00
Paul Moore
b22876cc2f seccomp: add semctl() to the syscall whitelist
QEMU needs to call semctl() for correct operation.  This particular
problem was identified on shutdown with the following commandline:

 # qemu -sandbox on -monitor stdio \
   -device intel-hda -device hda-duplex -vnc :0

Signed-off-by: Paul Moore <pmoore@redhat.com>
Signed-off-by: Eduardo Otubo <eduardo.otubo@profitbricks.com>
2014-08-21 10:29:16 +02:00
Paul Moore
e3f9bb011a seccomp: add shmctl(), mlock(), and munlock() to the syscall whitelist
Additional testing reveals that PulseAudio requires shmctl() and the
mlock()/munlock() syscalls on some systems/configurations.  As before,
on systems that do require these syscalls, the problem can be seen with
the following command line:

  # qemu -monitor stdio  -sandbox on \
         -device intel-hda -device hda-duplex

Signed-off-by: Paul Moore <pmoore@redhat.com>
Signed-off-by: Eduardo Otubo <otubo@linux.vnet.ibm.com>
2014-04-25 14:52:03 -03:00
Felix Geyer
8439761852 seccomp: add timerfd_create and timerfd_settime to the whitelist
libusb calls timerfd_create() and timerfd_settime() when it's built with
timerfd support.

Command to reproduce:

       -device usb-host,hostbus=1,hostaddr=3,id=hostdev0

Log messages:

audit(1390730418.924:135): auid=4294967295 uid=121 gid=103 ses=4294967295
                           pid=5232 comm="qemu-system-x86" sig=31 syscall=283
                           compat=0 ip=0x7f2b0f4e96a7 code=0x0
audit(1390733100.580:142): auid=4294967295 uid=121 gid=103 ses=4294967295
                           pid=16909 comm="qemu-system-x86" sig=31 syscall=286
                           compat=0 ip=0x7f03513a06da code=0x0

Reading a few hundred MB from a USB drive on x86_64 shows this syscall distribution.
Therefore the timerfd_settime priority is set to 242.

    calls  syscall
 --------- ----------------
   5303600 write
   2240554 read
   2167030 ppoll
   2134828 ioctl
    704023 timerfd_settime
    689105 poll
     83122 futex
       803 writev
       476 rt_sigprocmask
       287 recvmsg
       178 brk

Signed-off-by: Felix Geyer <debfx@fobos.de>
Signed-off-by: Eduardo Otubo <otubo@linux.vnet.ibm.com>
2014-04-25 14:51:59 -03:00
Paul Moore
918b94e287 seccomp: add some basic shared memory syscalls to the whitelist
PulseAudio requires the use of shared memory so add shmget(), shmat(),
and shmdt() to the syscall whitelist.

Reported-by: xuhan@redhat.com
Signed-off-by: Paul Moore <pmoore@redhat.com>
2014-01-20 11:19:34 -02:00
Paul Moore
0c2acb163f seccomp: add mkdir() and fchmod() to the whitelist
The PulseAudio library attempts to do a mkdir(2) and fchmod(2) on
"/run/user/<UID>/pulse" which is currently blocked by the syscall
filter; this patch adds the two missing syscalls to the whitelist.
You can reproduce this problem with the following command:

 # qemu -monitor stdio -device intel-hda -device hda-duplex

If watched under strace the following syscalls are shown:

 mkdir("/run/user/0/pulse", 0700)
 fchmod(11, 0700) [NOTE: 11 is the fd for /run/user/0/pulse]

Reported-by: xuhan@redhat.com
Signed-off-by: Paul Moore <pmoore@redhat.com>
2014-01-20 11:19:29 -02:00
Corey Bryant
2a13f99112 seccomp: exit if seccomp_init() fails
This fixes a bug where we weren't exiting if seccomp_init() failed.

Signed-off-by: Corey Bryant <coreyb@linux.vnet.ibm.com>
Acked-by: Eduardo Otubo <otubo@linux.vnet.ibm.com>
Acked-by: Paul Moore <pmoore@redhat.com>
2013-12-20 16:38:29 -02:00
Paul Moore
e9eecb5bf8 seccomp: add kill() to the syscall whitelist
The kill() syscall is triggered with the following command:

 # qemu -sandbox on -monitor stdio \
        -device intel-hda -device hda-duplex -vnc :0

The resulting syslog/audit message:

 # ausearch -m SECCOMP
 ----
 time->Wed Nov 20 09:52:08 2013
 type=SECCOMP msg=audit(1384912328.482:6656): auid=0 uid=0 gid=0 ses=854
  subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 pid=12087
  comm="qemu-kvm" sig=31 syscall=62 compat=0 ip=0x7f7a1d2abc67 code=0x0
 # scmp_sys_resolver 62
 kill

Reported-by: CongLi <coli@redhat.com>
Tested-by: CongLi <coli@redhat.com>
Signed-off-by: Paul Moore <pmoore@redhat.com>
Acked-by: Eduardo Otubo <otubo@linux.vnet.ibm.com>
2013-12-03 10:21:32 -02:00
Eduardo Otubo
c236f4519c seccomp: fine tuning whitelist by adding times()
This was causing Qemu process to hang when using -sandbox on as
discribed on RHBZ: https://bugzilla.redhat.com/show_bug.cgi?id=1004175

Signed-off-by: Eduardo Otubo <otubo@linux.vnet.ibm.com>
Tested-by: Paul Moore <pmoore@redhat.com>
Acked-by: Paul Moore <pmoore@redhat.com>
2013-09-24 15:15:16 -03:00
Paul Moore
d2509b667c seccomp: add arch_prctl() to the syscall whitelist
It appears that even a very simple /etc/qemu-ifup configuration can
require the arch_prctl() syscall, see the example below:

	#!/bin/sh
	/sbin/ifconfig $1 0.0.0.0 up
	/usr/sbin/brctl addif <switch> $1

Signed-off-by: Paul Moore <pmoore@redhat.com>
Reviewed-by: Eduardo Otubo <otubo@linux.vnet.ibm.com>
Message-id: 20130718135703.8247.19213.stgit@localhost
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
2013-07-29 19:56:52 -05:00
Paul Moore
94113bd8a1 seccomp: add additional asynchronous I/O syscalls
A previous commit, "seccomp: add the asynchronous I/O syscalls to the
whitelist", added several asynchronous I/O syscalls but left out the
io_submit() and io_cancel() syscalls.  This patch corrects this by
adding the two missing asynchronous I/O syscalls.

Signed-off-by: Paul Moore <pmoore@redhat.com>
Reviewed-by: Eduardo Otubo <otubo@linux.vnet.ibm.com>
Message-id: 20130715193201.943.4913.stgit@localhost
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
2013-07-29 19:56:52 -05:00
Eduardo Otubo
2fb861eb02 seccomp: removing unused syscalls gtom whitelist
v3 update:
 - reincluding getrlimit(), it is used by Xen.

v2 update:
 - reincluding setrlimit(), it is used by Xen.

Signed-off-by: Eduardo Otubo <otubo@linux.vnet.ibm.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-id: 1374518017-10424-3-git-send-email-otubo@linux.vnet.ibm.com
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
2013-07-26 16:54:08 -05:00
Eduardo Otubo
7d7b2ad436 seccomp: no need to check arch in syscall whitelist
v2 update:
- set libseccomp 2.1.0 as requirement on configure script.

Since libseccomp 2.0 there's no need to check the architecture type
anymore.

Signed-off-by: Eduardo Otubo <otubo@linux.vnet.ibm.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-id: 1374518017-10424-2-git-send-email-otubo@linux.vnet.ibm.com
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
2013-07-26 16:54:08 -05:00
Paul Moore
fd21faadb1 seccomp: add the asynchronous I/O syscalls to the whitelist
In order to enable the asynchronous I/O functionality when using the
seccomp sandbox we need to add the associated syscalls to the
whitelist.

Signed-off-by: Paul Moore <pmoore@redhat.com>
Reviewed-by: Corey Bryant <coreyb@linux.vnet.ibm.com>
Message-id: 20130529203001.20939.83322.stgit@localhost
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
2013-05-30 11:46:07 -05:00
Paolo Bonzini
9c17d615a6 softmmu: move include files to include/sysemu/
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2012-12-19 08:32:45 +01:00
Eduardo Otubo
fe512d65e0 seccomp: adding new syscalls (bugzilla 855162)
According to the bug 855162[0] - there's the need of adding new syscalls
to the whitelist when using Qemu with Libvirt.

[0] - https://bugzilla.redhat.com/show_bug.cgi?id=855162

Reported-by: Paul Moore <pmoore@redhat.com>
Tested-by: Paul Moore <pmoore@redhat.com>
Signed-off-by: Eduardo Otubo <otubo@linux.vnet.ibm.com>
Signed-off-by: Corey Bryant <coreyb@linux.vnet.ibm.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
2012-11-30 08:27:27 -06:00
Eduardo Otubo
2f668be775 Adding qemu-seccomp.[ch] (v8)
Signed-off-by: Eduardo Otubo <otubo@linux.vnet.ibm.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
---
v1:
 - I added a syscall struct using priority levels as described in the
   libseccomp man page. The priority numbers are based to the frequency
   they appear in a sample strace from a regular qemu guest run under
   libvirt.

   Libseccomp generates linear BPF code to filter system calls, those rules
   are read one after another. The priority system places the most common
   rules first in order to reduce the overhead when processing them.

v1 -> v2:
 - Fixed some style issues
 - Removed code from vl.c and created qemu-seccomp.[ch]
 - Now using ARRAY_SIZE macro
 - Added more syscalls without priority/frequency set yet

v2 -> v3:
 - Adding copyright and license information
 - Replacing seccomp_whitelist_count just by ARRAY_SIZE
 - Adding header protection to qemu-seccomp.h
 - Moving QemuSeccompSyscall definition to qemu-seccomp.c
 - Negative return from seccomp_start is fatal now.
 - Adding open() and execve() to the whitelis

v3 -> v4:
 - Tests revealed a bigger set of syscalls.
 - seccomp_start() now has an argument to set the mode according to the
   configure option trap or kill.

v4 -> v5:
 - Tests on x86_64 required a new specific set of system calls.
 - libseccomp release 1.0.0: part of the API have changed in this last
   release, had to adapt to the new function signatures.
2012-08-16 13:41:16 -05:00