qemu-e2k/docs/tools/virtiofsd.rst
Dr. David Alan Gilbert e586edcb41 virtiofs: drop remapped security.capability xattr as needed
On Linux, the 'security.capability' xattr holds a set of
capabilities that can change when an executable is run, giving
a limited form of privilege escalation to those programs that
the writer of the file deemed worthy.

Any write causes the 'security.capability' xattr to be dropped,
stopping anyone from gaining privilege by modifying a blessed
file.

Fuse relies on the daemon to do this dropping, and in turn the
daemon relies on the host kernel to drop the xattr for it.  However,
with the addition of -o xattrmap, the xattr that the guest
stores its capabilities in is now not the same as the one that
the host kernel automatically clears.

Where the mapping changes 'security.capability', explicitly clear
the remapped name to preserve the same behaviour.

This bug is assigned CVE-2021-20263.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Vivek Goyal <vgoyal@redhat.com>
2021-03-04 10:26:16 +00:00

311 lines
9.4 KiB
ReStructuredText

QEMU virtio-fs shared file system daemon
========================================
Synopsis
--------
**virtiofsd** [*OPTIONS*]
Description
-----------
Share a host directory tree with a guest through a virtio-fs device. This
program is a vhost-user backend that implements the virtio-fs device. Each
virtio-fs device instance requires its own virtiofsd process.
This program is designed to work with QEMU's ``--device vhost-user-fs-pci``
but should work with any virtual machine monitor (VMM) that supports
vhost-user. See the Examples section below.
This program must be run as the root user. The program drops privileges where
possible during startup although it must be able to create and access files
with any uid/gid:
* The ability to invoke syscalls is limited using seccomp(2).
* Linux capabilities(7) are dropped.
In "namespace" sandbox mode the program switches into a new file system
namespace and invokes pivot_root(2) to make the shared directory tree its root.
A new pid and net namespace is also created to isolate the process.
In "chroot" sandbox mode the program invokes chroot(2) to make the shared
directory tree its root. This mode is intended for container environments where
the container runtime has already set up the namespaces and the program does
not have permission to create namespaces itself.
Both sandbox modes prevent "file system escapes" due to symlinks and other file
system objects that might lead to files outside the shared directory.
Options
-------
.. program:: virtiofsd
.. option:: -h, --help
Print help.
.. option:: -V, --version
Print version.
.. option:: -d
Enable debug output.
.. option:: --syslog
Print log messages to syslog instead of stderr.
.. option:: -o OPTION
* debug -
Enable debug output.
* flock|no_flock -
Enable/disable flock. The default is ``no_flock``.
* modcaps=CAPLIST
Modify the list of capabilities allowed; CAPLIST is a colon separated
list of capabilities, each preceded by either + or -, e.g.
''+sys_admin:-chown''.
* log_level=LEVEL -
Print only log messages matching LEVEL or more severe. LEVEL is one of
``err``, ``warn``, ``info``, or ``debug``. The default is ``info``.
* posix_lock|no_posix_lock -
Enable/disable remote POSIX locks. The default is ``no_posix_lock``.
* readdirplus|no_readdirplus -
Enable/disable readdirplus. The default is ``readdirplus``.
* sandbox=namespace|chroot -
Sandbox mode:
- namespace: Create mount, pid, and net namespaces and pivot_root(2) into
the shared directory.
- chroot: chroot(2) into shared directory (use in containers).
The default is "namespace".
* source=PATH -
Share host directory tree located at PATH. This option is required.
* timeout=TIMEOUT -
I/O timeout in seconds. The default depends on cache= option.
* writeback|no_writeback -
Enable/disable writeback cache. The cache allows the FUSE client to buffer
and merge write requests. The default is ``no_writeback``.
* xattr|no_xattr -
Enable/disable extended attributes (xattr) on files and directories. The
default is ``no_xattr``.
.. option:: --socket-path=PATH
Listen on vhost-user UNIX domain socket at PATH.
.. option:: --socket-group=GROUP
Set the vhost-user UNIX domain socket gid to GROUP.
.. option:: --fd=FDNUM
Accept connections from vhost-user UNIX domain socket file descriptor FDNUM.
The file descriptor must already be listening for connections.
.. option:: --thread-pool-size=NUM
Restrict the number of worker threads per request queue to NUM. The default
is 64.
.. option:: --cache=none|auto|always
Select the desired trade-off between coherency and performance. ``none``
forbids the FUSE client from caching to achieve best coherency at the cost of
performance. ``auto`` acts similar to NFS with a 1 second metadata cache
timeout. ``always`` sets a long cache lifetime at the expense of coherency.
The default is ``auto``.
xattr-mapping
-------------
By default the name of xattr's used by the client are passed through to the server
file system. This can be a problem where either those xattr names are used
by something on the server (e.g. selinux client/server confusion) or if the
virtiofsd is running in a container with restricted privileges where it cannot
access some attributes.
A mapping of xattr names can be made using -o xattrmap=mapping where the ``mapping``
string consists of a series of rules.
The first matching rule terminates the mapping.
The set of rules must include a terminating rule to match any remaining attributes
at the end.
Each rule consists of a number of fields separated with a separator that is the
first non-white space character in the rule. This separator must then be used
for the whole rule.
White space may be added before and after each rule.
Using ':' as the separator a rule is of the form:
``:type:scope:key:prepend:``
**scope** is:
- 'client' - match 'key' against a xattr name from the client for
setxattr/getxattr/removexattr
- 'server' - match 'prepend' against a xattr name from the server
for listxattr
- 'all' - can be used to make a single rule where both the server
and client matches are triggered.
**type** is one of:
- 'prefix' - is designed to prepend and strip a prefix; the modified
attributes then being passed on to the client/server.
- 'ok' - Causes the rule set to be terminated when a match is found
while allowing matching xattr's through unchanged.
It is intended both as a way of explicitly terminating
the list of rules, and to allow some xattr's to skip following rules.
- 'bad' - If a client tries to use a name matching 'key' it's
denied using EPERM; when the server passes an attribute
name matching 'prepend' it's hidden. In many ways it's use is very like
'ok' as either an explicit terminator or for special handling of certain
patterns.
**key** is a string tested as a prefix on an attribute name originating
on the client. It maybe empty in which case a 'client' rule
will always match on client names.
**prepend** is a string tested as a prefix on an attribute name originating
on the server, and used as a new prefix. It may be empty
in which case a 'server' rule will always match on all names from
the server.
e.g.:
``:prefix:client:trusted.:user.virtiofs.:``
will match 'trusted.' attributes in client calls and prefix them before
passing them to the server.
``:prefix:server::user.virtiofs.:``
will strip 'user.virtiofs.' from all server replies.
``:prefix:all:trusted.:user.virtiofs.:``
combines the previous two cases into a single rule.
``:ok:client:user.::``
will allow get/set xattr for 'user.' xattr's and ignore
following rules.
``:ok:server::security.:``
will pass 'securty.' xattr's in listxattr from the server
and ignore following rules.
``:ok:all:::``
will terminate the rule search passing any remaining attributes
in both directions.
``:bad:server::security.:``
would hide 'security.' xattr's in listxattr from the server.
A simpler 'map' type provides a shorter syntax for the common case:
``:map:key:prepend:``
The 'map' type adds a number of separate rules to add **prepend** as a prefix
to the matched **key** (or all attributes if **key** is empty).
There may be at most one 'map' rule and it must be the last rule in the set.
Note: When the 'security.capability' xattr is remapped, the daemon has to do
extra work to remove it during many operations, which the host kernel normally
does itself.
xattr-mapping Examples
----------------------
1) Prefix all attributes with 'user.virtiofs.'
::
-o xattrmap=":prefix:all::user.virtiofs.::bad:all:::"
This uses two rules, using : as the field separator;
the first rule prefixes and strips 'user.virtiofs.',
the second rule hides any non-prefixed attributes that
the host set.
This is equivalent to the 'map' rule:
::
-o xattrmap=":map::user.virtiofs.:"
2) Prefix 'trusted.' attributes, allow others through
::
"/prefix/all/trusted./user.virtiofs./
/bad/server//trusted./
/bad/client/user.virtiofs.//
/ok/all///"
Here there are four rules, using / as the field
separator, and also demonstrating that new lines can
be included between rules.
The first rule is the prefixing of 'trusted.' and
stripping of 'user.virtiofs.'.
The second rule hides unprefixed 'trusted.' attributes
on the host.
The third rule stops a guest from explicitly setting
the 'user.virtiofs.' path directly.
Finally, the fourth rule lets all remaining attributes
through.
This is equivalent to the 'map' rule:
::
-o xattrmap="/map/trusted./user.virtiofs./"
3) Hide 'security.' attributes, and allow everything else
::
"/bad/all/security./security./
/ok/all///'
The first rule combines what could be separate client and server
rules into a single 'all' rule, matching 'security.' in either
client arguments or lists returned from the host. This stops
the client seeing any 'security.' attributes on the server and
stops it setting any.
Examples
--------
Export ``/var/lib/fs/vm001/`` on vhost-user UNIX domain socket
``/var/run/vm001-vhost-fs.sock``:
::
host# virtiofsd --socket-path=/var/run/vm001-vhost-fs.sock -o source=/var/lib/fs/vm001
host# qemu-system-x86_64 \
-chardev socket,id=char0,path=/var/run/vm001-vhost-fs.sock \
-device vhost-user-fs-pci,chardev=char0,tag=myfs \
-object memory-backend-memfd,id=mem,size=4G,share=on \
-numa node,memdev=mem \
...
guest# mount -t virtiofs myfs /mnt