Documentation: describe MIDX-based bitmaps

Update the technical documentation to describe the multi-pack bitmap
format. This patch merely introduces the new format, and describes its
high-level ideas. Git does not yet know how to read nor write these
multi-pack variants, and so the subsequent patches will:

  - Introduce code to interpret multi-pack bitmaps, according to this
    document.

  - Then, introduce code to write multi-pack bitmaps from the 'git
    multi-pack-index write' sub-command.

Finally, the implementation will gain tests in subsequent patches (as
opposed to inline with the patch teaching Git how to write multi-pack
bitmaps) to avoid a cyclic dependency.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This commit is contained in:
Taylor Blau 2021-08-24 12:15:59 -04:00 committed by Junio C Hamano
parent 1d7f7f242c
commit 917a54c017
2 changed files with 60 additions and 21 deletions

View File

@ -1,6 +1,44 @@
GIT bitmap v1 format
====================
== Pack and multi-pack bitmaps
Bitmaps store reachability information about the set of objects in a packfile,
or a multi-pack index (MIDX). The former is defined obviously, and the latter is
defined as the union of objects in packs contained in the MIDX.
A bitmap may belong to either one pack, or the repository's multi-pack index (if
it exists). A repository may have at most one bitmap.
An object is uniquely described by its bit position within a bitmap:
- If the bitmap belongs to a packfile, the __n__th bit corresponds to
the __n__th object in pack order. For a function `offset` which maps
objects to their byte offset within a pack, pack order is defined as
follows:
o1 <= o2 <==> offset(o1) <= offset(o2)
- If the bitmap belongs to a MIDX, the __n__th bit corresponds to the
__n__th object in MIDX order. With an additional function `pack` which
maps objects to the pack they were selected from by the MIDX, MIDX order
is defined as follows:
o1 <= o2 <==> pack(o1) <= pack(o2) /\ offset(o1) <= offset(o2)
The ordering between packs is done according to the MIDX's .rev file.
Notably, the preferred pack sorts ahead of all other packs.
The on-disk representation (described below) of a bitmap is the same regardless
of whether or not that bitmap belongs to a packfile or a MIDX. The only
difference is the interpretation of the bits, which is described above.
Certain bitmap extensions are supported (see: Appendix B). No extensions are
required for bitmaps corresponding to packfiles. For bitmaps that correspond to
MIDXs, both the bit-cache and rev-cache extensions are required.
== On-disk format
- A header appears at the beginning:
4-byte signature: {'B', 'I', 'T', 'M'}
@ -14,17 +52,19 @@ GIT bitmap v1 format
The following flags are supported:
- BITMAP_OPT_FULL_DAG (0x1) REQUIRED
This flag must always be present. It implies that the bitmap
index has been generated for a packfile with full closure
(i.e. where every single object in the packfile can find
its parent links inside the same packfile). This is a
requirement for the bitmap index format, also present in JGit,
that greatly reduces the complexity of the implementation.
This flag must always be present. It implies that the
bitmap index has been generated for a packfile or
multi-pack index (MIDX) with full closure (i.e. where
every single object in the packfile/MIDX can find its
parent links inside the same packfile/MIDX). This is a
requirement for the bitmap index format, also present in
JGit, that greatly reduces the complexity of the
implementation.
- BITMAP_OPT_HASH_CACHE (0x4)
If present, the end of the bitmap file contains
`N` 32-bit name-hash values, one per object in the
pack. The format and meaning of the name-hash is
pack/MIDX. The format and meaning of the name-hash is
described below.
4-byte entry count (network byte order)
@ -33,7 +73,8 @@ GIT bitmap v1 format
20-byte checksum
The SHA1 checksum of the pack this bitmap index belongs to.
The SHA1 checksum of the pack/MIDX this bitmap index
belongs to.
- 4 EWAH bitmaps that act as type indexes
@ -50,7 +91,7 @@ GIT bitmap v1 format
- Tags
In each bitmap, the `n`th bit is set to true if the `n`th object
in the packfile is of that type.
in the packfile or multi-pack index is of that type.
The obvious consequence is that the OR of all 4 bitmaps will result
in a full set (all bits set), and the AND of all 4 bitmaps will
@ -62,8 +103,9 @@ GIT bitmap v1 format
Each entry contains the following:
- 4-byte object position (network byte order)
The position **in the index for the packfile** where the
bitmap for this commit is found.
The position **in the index for the packfile or
multi-pack index** where the bitmap for this commit is
found.
- 1-byte XOR-offset
The xor offset used to compress this bitmap. For an entry
@ -146,10 +188,11 @@ Name-hash cache
---------------
If the BITMAP_OPT_HASH_CACHE flag is set, the end of the bitmap contains
a cache of 32-bit values, one per object in the pack. The value at
a cache of 32-bit values, one per object in the pack/MIDX. The value at
position `i` is the hash of the pathname at which the `i`th object
(counting in index order) in the pack can be found. This can be fed
into the delta heuristics to compare objects with similar pathnames.
(counting in index or multi-pack index order) in the pack/MIDX can be found.
This can be fed into the delta heuristics to compare objects with similar
pathnames.
The hash algorithm used is:

View File

@ -71,14 +71,10 @@ Future Work
still reducing the number of binary searches required for object
lookups.
- The reachability bitmap is currently paired directly with a single
packfile, using the pack-order as the object order to hopefully
compress the bitmaps well using run-length encoding. This could be
extended to pair a reachability bitmap with a multi-pack-index. If
the multi-pack-index is extended to store a "stable object order"
- If the multi-pack-index is extended to store a "stable object order"
(a function Order(hash) = integer that is constant for a given hash,
even as the multi-pack-index is updated) then a reachability bitmap
could point to a multi-pack-index and be updated independently.
even as the multi-pack-index is updated) then MIDX bitmaps could be
updated independently of the MIDX.
- Packfiles can be marked as "special" using empty files that share
the initial name but replace ".pack" with ".keep" or ".promisor".