qemu-e2k/block
Sam Eiderman 98eb9733f4 vmdk: Add read-only support for seSparse snapshots
Until ESXi 6.5 VMware used the vmfsSparse format for snapshots (VMDK3 in
QEMU).

This format was lacking in the following:

    * Grain directory (L1) and grain table (L2) entries were 32-bit,
      allowing access to only 2TB (slightly less) of data.
    * The grain size (default) was 512 bytes - leading to data
      fragmentation and many grain tables.
    * For space reclamation purposes, it was necessary to find all the
      grains which are not pointed to by any grain table - so a reverse
      mapping of "offset of grain in vmdk" to "grain table" must be
      constructed - which takes large amounts of CPU/RAM.

The format specification can be found in VMware's documentation:
https://www.vmware.com/support/developer/vddk/vmdk_50_technote.pdf

In ESXi 6.5, to support snapshot files larger than 2TB, a new format was
introduced: SESparse (Space Efficient).

This format fixes the above issues:

    * All entries are now 64-bit.
    * The grain size (default) is 4KB.
    * Grain directory and grain tables are now located at the beginning
      of the file.
      + seSparse format reserves space for all grain tables.
      + Grain tables can be addressed using an index.
      + Grains are located in the end of the file and can also be
        addressed with an index.
      - seSparse vmdks of large disks (64TB) have huge preallocated
        headers - mainly due to L2 tables, even for empty snapshots.
    * The header contains a reverse mapping ("backmap") of "offset of
      grain in vmdk" to "grain table" and a bitmap ("free bitmap") which
      specifies for each grain - whether it is allocated or not.
      Using these data structures we can implement space reclamation
      efficiently.
    * Due to the fact that the header now maintains two mappings:
        * The regular one (grain directory & grain tables)
        * A reverse one (backmap and free bitmap)
      These data structures can lose consistency upon crash and result
      in a corrupted VMDK.
      Therefore, a journal is also added to the VMDK and is replayed
      when the VMware reopens the file after a crash.

Since ESXi 6.7 - SESparse is the only snapshot format available.

Unfortunately, VMware does not provide documentation regarding the new
seSparse format.

This commit is based on black-box research of the seSparse format.
Various in-guest block operations and their effect on the snapshot file
were tested.

The only VMware provided source of information (regarding the underlying
implementation) was a log file on the ESXi:

    /var/log/hostd.log

Whenever an seSparse snapshot is created - the log is being populated
with seSparse records.

Relevant log records are of the form:

[...] Const Header:
[...]  constMagic     = 0xcafebabe
[...]  version        = 2.1
[...]  capacity       = 204800
[...]  grainSize      = 8
[...]  grainTableSize = 64
[...]  flags          = 0
[...] Extents:
[...]  Header         : <1 : 1>
[...]  JournalHdr     : <2 : 2>
[...]  Journal        : <2048 : 2048>
[...]  GrainDirectory : <4096 : 2048>
[...]  GrainTables    : <6144 : 2048>
[...]  FreeBitmap     : <8192 : 2048>
[...]  BackMap        : <10240 : 2048>
[...]  Grain          : <12288 : 204800>
[...] Volatile Header:
[...] volatileMagic     = 0xcafecafe
[...] FreeGTNumber      = 0
[...] nextTxnSeqNumber  = 0
[...] replayJournal     = 0

The sizes that are seen in the log file are in sectors.
Extents are of the following format: <offset : size>

This commit is a strict implementation which enforces:
    * magics
    * version number 2.1
    * grain size of 8 sectors  (4KB)
    * grain table size of 64 sectors
    * zero flags
    * extent locations

Additionally, this commit proivdes only a subset of the functionality
offered by seSparse's format:
    * Read-only
    * No journal replay
    * No space reclamation
    * No unmap support

Hence, journal header, journal, free bitmap and backmap extents are
unused, only the "classic" (L1 -> L2 -> data) grain access is
implemented.

However there are several differences in the grain access itself.
Grain directory (L1):
    * Grain directory entries are indexes (not offsets) to grain
      tables.
    * Valid grain directory entries have their highest nibble set to
      0x1.
    * Since grain tables are always located in the beginning of the
      file - the index can fit into 32 bits - so we can use its low
      part if it's valid.
Grain table (L2):
    * Grain table entries are indexes (not offsets) to grains.
    * If the highest nibble of the entry is:
        0x0:
            The grain in not allocated.
            The rest of the bytes are 0.
        0x1:
            The grain is unmapped - guest sees a zero grain.
            The rest of the bits point to the previously mapped grain,
            see 0x3 case.
        0x2:
            The grain is zero.
        0x3:
            The grain is allocated - to get the index calculate:
            ((entry & 0x0fff000000000000) >> 48) |
            ((entry & 0x0000ffffffffffff) << 12)
    * The difference between 0x1 and 0x2 is that 0x1 is an unallocated
      grain which results from the guest using sg_unmap to unmap the
      grain - but the grain itself still exists in the grain extent - a
      space reclamation procedure should delete it.
      Unmapping a zero grain has no effect (0x2 will not change to 0x1)
      but unmapping an unallocated grain will (0x0 to 0x1) - naturally.

In order to implement seSparse some fields had to be changed to support
both 32-bit and 64-bit entry sizes.

Reviewed-by: Karl Heubaum <karl.heubaum@oracle.com>
Reviewed-by: Eyal Moscovici <eyal.moscovici@oracle.com>
Reviewed-by: Arbel Moshe <arbel.moshe@oracle.com>
Signed-off-by: Sam Eiderman <shmuel.eiderman@oracle.com>
Message-id: 20190620091057.47441-4-shmuel.eiderman@oracle.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-06-24 15:53:02 +02:00
..
Makefile.objs block/nbd: merge nbd-client.* to nbd.c 2019-06-13 09:55:09 -05:00
accounting.c block/accounting: introduce latency histogram 2018-03-19 14:58:37 -05:00
backup.c block: Add BlockBackend.ctx 2019-06-04 15:22:22 +02:00
blkdebug.c blkdebug: Inject errors on .bdrv_co_block_status() 2019-06-14 14:16:57 +02:00
blklogwrites.c Include qemu/module.h where needed, drop it from qemu-common.h 2019-06-12 13:18:33 +02:00
blkreplay.c Include qemu/module.h where needed, drop it from qemu-common.h 2019-06-12 13:18:33 +02:00
blkverify.c Include qemu/module.h where needed, drop it from qemu-common.h 2019-06-12 13:18:33 +02:00
block-backend.c block/block-backend: blk_iostatus_reset: drop usage of bs->job 2019-06-18 16:41:10 +02:00
bochs.c Include qemu-common.h exactly where needed 2019-06-12 13:20:20 +02:00
cloop.c Include qemu-common.h exactly where needed 2019-06-12 13:20:20 +02:00
commit.c block/commit: Drop bdrv_child_try_set_perm() 2019-06-18 16:41:10 +02:00
copy-on-read.c Include qemu/module.h where needed, drop it from qemu-common.h 2019-06-12 13:18:33 +02:00
create.c jobs: utilize job_exit shim 2018-08-31 16:28:33 +02:00
crypto.c Include qemu/module.h where needed, drop it from qemu-common.h 2019-06-12 13:18:33 +02:00
crypto.h Clean up ill-advised or unusual header guards 2019-05-13 08:58:55 +02:00
curl.c Include qemu/module.h where needed, drop it from qemu-common.h 2019-06-12 13:18:33 +02:00
dirty-bitmap.c Include qemu-common.h exactly where needed 2019-06-12 13:20:20 +02:00
dmg-bz2.c Include qemu-common.h exactly where needed 2019-06-12 13:20:20 +02:00
dmg-lzfse.c block: adding lzfse decompressing support as a module. 2018-12-14 11:52:40 +01:00
dmg.c Include qemu-common.h exactly where needed 2019-06-12 13:20:20 +02:00
dmg.h Include qemu-common.h exactly where needed 2019-06-12 13:20:20 +02:00
file-posix.c file-posix: Update open_flags in raw_set_perm() 2019-06-18 16:41:10 +02:00
file-win32.c avoid TABs in files that only contain a few 2019-01-11 15:46:56 +01:00
gluster.c block/gluster: update .help of BLOCK_OPT_PREALLOC option 2019-06-12 18:32:32 +02:00
io.c block/io: bdrv_pdiscard: support int64_t bytes parameter 2019-06-04 16:55:58 +02:00
iscsi-opts.c Include qemu/module.h where needed, drop it from qemu-common.h 2019-06-12 13:18:33 +02:00
iscsi.c Include qemu-common.h exactly where needed 2019-06-12 13:20:20 +02:00
linux-aio.c Include qemu-common.h exactly where needed 2019-06-12 13:20:20 +02:00
mirror.c block/mirror: Fix child permissions 2019-06-18 16:41:10 +02:00
nbd.c block/nbd: merge NBDClientSession struct back to BDRVNBDState 2019-06-13 10:00:42 -05:00
nfs.c Include qemu/module.h where needed, drop it from qemu-common.h 2019-06-12 13:18:33 +02:00
null.c Include qemu/module.h where needed, drop it from qemu-common.h 2019-06-12 13:18:33 +02:00
nvme.c Include qemu/module.h where needed, drop it from qemu-common.h 2019-06-12 13:18:33 +02:00
parallels.c block: Add BlockBackend.ctx 2019-06-04 15:22:22 +02:00
parallels.h Clean up includes 2018-02-09 05:05:11 +01:00
qapi.c Include qemu-common.h exactly where needed 2019-06-12 13:20:20 +02:00
qcow.c block: Add BlockBackend.ctx 2019-06-04 15:22:22 +02:00
qcow2-bitmap.c qcow2-bitmap: initialize bitmap directory alignment 2019-05-28 20:30:55 +02:00
qcow2-cache.c Include qemu-common.h exactly where needed 2019-06-12 13:20:20 +02:00
qcow2-cluster.c Include qemu-common.h exactly where needed 2019-06-12 13:20:20 +02:00
qcow2-refcount.c Include qemu-common.h exactly where needed 2019-06-12 13:20:20 +02:00
qcow2-snapshot.c qcow2.h: add missing include 2019-05-28 20:30:55 +02:00
qcow2-threads.c qcow2: do encryption in threads 2019-05-28 20:30:55 +02:00
qcow2.c block: Add BlockBackend.ctx 2019-06-04 15:22:22 +02:00
qcow2.h block: avoid recursive block_status call if possible 2019-06-04 15:20:41 +02:00
qed-check.c block/qed: add missed coroutine_fn markers 2019-04-30 15:29:00 +02:00
qed-cluster.c qed: protect table cache with CoMutex 2017-07-17 11:34:11 +08:00
qed-l2-cache.c qed: protect table cache with CoMutex 2017-07-17 11:34:11 +08:00
qed-table.c block/qed: add missed coroutine_fn markers 2019-04-30 15:29:00 +02:00
qed.c Include qemu/module.h where needed, drop it from qemu-common.h 2019-06-12 13:18:33 +02:00
qed.h block/qed: add missed coroutine_fn markers 2019-04-30 15:29:00 +02:00
quorum.c Include qemu/module.h where needed, drop it from qemu-common.h 2019-06-12 13:18:33 +02:00
raw-format.c Include qemu/module.h where needed, drop it from qemu-common.h 2019-06-12 13:18:33 +02:00
rbd.c Include qemu/module.h where needed, drop it from qemu-common.h 2019-06-12 13:18:33 +02:00
replication.c block/replication: drop usage of bs->job 2019-06-18 16:41:09 +02:00
sheepdog.c Include qemu-common.h exactly where needed 2019-06-12 13:20:20 +02:00
snapshot.c block/snapshot: remove bdrv_snapshot_delete_by_id_or_name 2019-02-25 15:03:18 +01:00
ssh.c Include qemu/module.h where needed, drop it from qemu-common.h 2019-06-12 13:18:33 +02:00
stream.c block/stream: use buffer-based io 2019-04-30 15:29:00 +02:00
throttle-groups.c throttle-groups: fix restart coroutine iothread race 2019-01-24 10:02:28 +00:00
throttle.c Include qemu/module.h where needed, drop it from qemu-common.h 2019-06-12 13:18:33 +02:00
trace-events block: drop bs->job 2019-06-18 16:41:10 +02:00
vdi.c block: Add BlockBackend.ctx 2019-06-04 15:22:22 +02:00
vhdx-endian.c Include qemu-common.h exactly where needed 2019-06-12 13:20:20 +02:00
vhdx-log.c Include qemu/module.h where needed, drop it from qemu-common.h 2019-06-12 13:18:33 +02:00
vhdx.c block: Add BlockBackend.ctx 2019-06-04 15:22:22 +02:00
vhdx.h block/vhdx: Use IEC binary prefixes for size constants 2019-04-30 15:29:00 +02:00
vmdk.c vmdk: Add read-only support for seSparse snapshots 2019-06-24 15:53:02 +02:00
vpc.c block: Add BlockBackend.ctx 2019-06-04 15:22:22 +02:00
vvfat.c qemu-common: Move qemu_isalnum() etc. to qemu/ctype.h 2019-06-11 20:22:09 +02:00
vxhs.c Include qemu/module.h where needed, drop it from qemu-common.h 2019-06-12 13:18:33 +02:00
win32-aio.c Include qemu/module.h where needed, drop it from qemu-common.h 2019-06-12 13:18:33 +02:00
write-threshold.c qapi: Drop qapi_event_send_FOO()'s Error ** argument 2018-08-28 18:21:38 +02:00