4bdde337f4
The end of the cache tree index extension format trails off with
ellipses ever since 23fcc98
(doc: technical details about the index
file format, 2011-03-01). While an intuitive reader could gather what
this means, it could be better to use "and so on" instead.
Really, this is only justified because I also wanted to point out that
the number of subtrees in the index format is used to determine when the
recursive depth-first-search stack should be "popped." This should help
to add clarity to the format.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
388 lines
14 KiB
Plaintext
388 lines
14 KiB
Plaintext
Git index format
|
|
================
|
|
|
|
== The Git index file has the following format
|
|
|
|
All binary numbers are in network byte order.
|
|
In a repository using the traditional SHA-1, checksums and object IDs
|
|
(object names) mentioned below are all computed using SHA-1. Similarly,
|
|
in SHA-256 repositories, these values are computed using SHA-256.
|
|
Version 2 is described here unless stated otherwise.
|
|
|
|
- A 12-byte header consisting of
|
|
|
|
4-byte signature:
|
|
The signature is { 'D', 'I', 'R', 'C' } (stands for "dircache")
|
|
|
|
4-byte version number:
|
|
The current supported versions are 2, 3 and 4.
|
|
|
|
32-bit number of index entries.
|
|
|
|
- A number of sorted index entries (see below).
|
|
|
|
- Extensions
|
|
|
|
Extensions are identified by signature. Optional extensions can
|
|
be ignored if Git does not understand them.
|
|
|
|
Git currently supports cache tree and resolve undo extensions.
|
|
|
|
4-byte extension signature. If the first byte is 'A'..'Z' the
|
|
extension is optional and can be ignored.
|
|
|
|
32-bit size of the extension
|
|
|
|
Extension data
|
|
|
|
- Hash checksum over the content of the index file before this checksum.
|
|
|
|
== Index entry
|
|
|
|
Index entries are sorted in ascending order on the name field,
|
|
interpreted as a string of unsigned bytes (i.e. memcmp() order, no
|
|
localization, no special casing of directory separator '/'). Entries
|
|
with the same name are sorted by their stage field.
|
|
|
|
32-bit ctime seconds, the last time a file's metadata changed
|
|
this is stat(2) data
|
|
|
|
32-bit ctime nanosecond fractions
|
|
this is stat(2) data
|
|
|
|
32-bit mtime seconds, the last time a file's data changed
|
|
this is stat(2) data
|
|
|
|
32-bit mtime nanosecond fractions
|
|
this is stat(2) data
|
|
|
|
32-bit dev
|
|
this is stat(2) data
|
|
|
|
32-bit ino
|
|
this is stat(2) data
|
|
|
|
32-bit mode, split into (high to low bits)
|
|
|
|
4-bit object type
|
|
valid values in binary are 1000 (regular file), 1010 (symbolic link)
|
|
and 1110 (gitlink)
|
|
|
|
3-bit unused
|
|
|
|
9-bit unix permission. Only 0755 and 0644 are valid for regular files.
|
|
Symbolic links and gitlinks have value 0 in this field.
|
|
|
|
32-bit uid
|
|
this is stat(2) data
|
|
|
|
32-bit gid
|
|
this is stat(2) data
|
|
|
|
32-bit file size
|
|
This is the on-disk size from stat(2), truncated to 32-bit.
|
|
|
|
Object name for the represented object
|
|
|
|
A 16-bit 'flags' field split into (high to low bits)
|
|
|
|
1-bit assume-valid flag
|
|
|
|
1-bit extended flag (must be zero in version 2)
|
|
|
|
2-bit stage (during merge)
|
|
|
|
12-bit name length if the length is less than 0xFFF; otherwise 0xFFF
|
|
is stored in this field.
|
|
|
|
(Version 3 or later) A 16-bit field, only applicable if the
|
|
"extended flag" above is 1, split into (high to low bits).
|
|
|
|
1-bit reserved for future
|
|
|
|
1-bit skip-worktree flag (used by sparse checkout)
|
|
|
|
1-bit intent-to-add flag (used by "git add -N")
|
|
|
|
13-bit unused, must be zero
|
|
|
|
Entry path name (variable length) relative to top level directory
|
|
(without leading slash). '/' is used as path separator. The special
|
|
path components ".", ".." and ".git" (without quotes) are disallowed.
|
|
Trailing slash is also disallowed.
|
|
|
|
The exact encoding is undefined, but the '.' and '/' characters
|
|
are encoded in 7-bit ASCII and the encoding cannot contain a NUL
|
|
byte (iow, this is a UNIX pathname).
|
|
|
|
(Version 4) In version 4, the entry path name is prefix-compressed
|
|
relative to the path name for the previous entry (the very first
|
|
entry is encoded as if the path name for the previous entry is an
|
|
empty string). At the beginning of an entry, an integer N in the
|
|
variable width encoding (the same encoding as the offset is encoded
|
|
for OFS_DELTA pack entries; see pack-format.txt) is stored, followed
|
|
by a NUL-terminated string S. Removing N bytes from the end of the
|
|
path name for the previous entry, and replacing it with the string S
|
|
yields the path name for this entry.
|
|
|
|
1-8 nul bytes as necessary to pad the entry to a multiple of eight bytes
|
|
while keeping the name NUL-terminated.
|
|
|
|
(Version 4) In version 4, the padding after the pathname does not
|
|
exist.
|
|
|
|
Interpretation of index entries in split index mode is completely
|
|
different. See below for details.
|
|
|
|
== Extensions
|
|
|
|
=== Cache tree
|
|
|
|
Since the index does not record entries for directories, the cache
|
|
entries cannot describe tree objects that already exist in the object
|
|
database for regions of the index that are unchanged from an existing
|
|
commit. The cache tree extension stores a recursive tree structure that
|
|
describes the trees that already exist and completely match sections of
|
|
the cache entries. This speeds up tree object generation from the index
|
|
for a new commit by only computing the trees that are "new" to that
|
|
commit. It also assists when comparing the index to another tree, such
|
|
as `HEAD^{tree}`, since sections of the index can be skipped when a tree
|
|
comparison demonstrates equality.
|
|
|
|
The recursive tree structure uses nodes that store a number of cache
|
|
entries, a list of subnodes, and an object ID (OID). The OID references
|
|
the existing tree for that node, if it is known to exist. The subnodes
|
|
correspond to subdirectories that themselves have cache tree nodes. The
|
|
number of cache entries corresponds to the number of cache entries in
|
|
the index that describe paths within that tree's directory.
|
|
|
|
The extension tracks the full directory structure in the cache tree
|
|
extension, but this is generally smaller than the full cache entry list.
|
|
|
|
When a path is updated in index, Git invalidates all nodes of the
|
|
recursive cache tree corresponding to the parent directories of that
|
|
path. We store these tree nodes as being "invalid" by using "-1" as the
|
|
number of cache entries. Invalid nodes still store a span of index
|
|
entries, allowing Git to focus its efforts when reconstructing a full
|
|
cache tree.
|
|
|
|
The signature for this extension is { 'T', 'R', 'E', 'E' }.
|
|
|
|
A series of entries fill the entire extension; each of which
|
|
consists of:
|
|
|
|
- NUL-terminated path component (relative to its parent directory);
|
|
|
|
- ASCII decimal number of entries in the index that is covered by the
|
|
tree this entry represents (entry_count);
|
|
|
|
- A space (ASCII 32);
|
|
|
|
- ASCII decimal number that represents the number of subtrees this
|
|
tree has;
|
|
|
|
- A newline (ASCII 10); and
|
|
|
|
- Object name for the object that would result from writing this span
|
|
of index as a tree.
|
|
|
|
An entry can be in an invalidated state and is represented by having
|
|
a negative number in the entry_count field. In this case, there is no
|
|
object name and the next entry starts immediately after the newline.
|
|
When writing an invalid entry, -1 should always be used as entry_count.
|
|
|
|
The entries are written out in the top-down, depth-first order. The
|
|
first entry represents the root level of the repository, followed by the
|
|
first subtree--let's call this A--of the root level (with its name
|
|
relative to the root level), followed by the first subtree of A (with
|
|
its name relative to A), and so on. The specified number of subtrees
|
|
indicates when the current level of the recursive stack is complete.
|
|
|
|
=== Resolve undo
|
|
|
|
A conflict is represented in the index as a set of higher stage entries.
|
|
When a conflict is resolved (e.g. with "git add path"), these higher
|
|
stage entries will be removed and a stage-0 entry with proper resolution
|
|
is added.
|
|
|
|
When these higher stage entries are removed, they are saved in the
|
|
resolve undo extension, so that conflicts can be recreated (e.g. with
|
|
"git checkout -m"), in case users want to redo a conflict resolution
|
|
from scratch.
|
|
|
|
The signature for this extension is { 'R', 'E', 'U', 'C' }.
|
|
|
|
A series of entries fill the entire extension; each of which
|
|
consists of:
|
|
|
|
- NUL-terminated pathname the entry describes (relative to the root of
|
|
the repository, i.e. full pathname);
|
|
|
|
- Three NUL-terminated ASCII octal numbers, entry mode of entries in
|
|
stage 1 to 3 (a missing stage is represented by "0" in this field);
|
|
and
|
|
|
|
- At most three object names of the entry in stages from 1 to 3
|
|
(nothing is written for a missing stage).
|
|
|
|
=== Split index
|
|
|
|
In split index mode, the majority of index entries could be stored
|
|
in a separate file. This extension records the changes to be made on
|
|
top of that to produce the final index.
|
|
|
|
The signature for this extension is { 'l', 'i', 'n', 'k' }.
|
|
|
|
The extension consists of:
|
|
|
|
- Hash of the shared index file. The shared index file path
|
|
is $GIT_DIR/sharedindex.<hash>. If all bits are zero, the
|
|
index does not require a shared index file.
|
|
|
|
- An ewah-encoded delete bitmap, each bit represents an entry in the
|
|
shared index. If a bit is set, its corresponding entry in the
|
|
shared index will be removed from the final index. Note, because
|
|
a delete operation changes index entry positions, but we do need
|
|
original positions in replace phase, it's best to just mark
|
|
entries for removal, then do a mass deletion after replacement.
|
|
|
|
- An ewah-encoded replace bitmap, each bit represents an entry in
|
|
the shared index. If a bit is set, its corresponding entry in the
|
|
shared index will be replaced with an entry in this index
|
|
file. All replaced entries are stored in sorted order in this
|
|
index. The first "1" bit in the replace bitmap corresponds to the
|
|
first index entry, the second "1" bit to the second entry and so
|
|
on. Replaced entries may have empty path names to save space.
|
|
|
|
The remaining index entries after replaced ones will be added to the
|
|
final index. These added entries are also sorted by entry name then
|
|
stage.
|
|
|
|
== Untracked cache
|
|
|
|
Untracked cache saves the untracked file list and necessary data to
|
|
verify the cache. The signature for this extension is { 'U', 'N',
|
|
'T', 'R' }.
|
|
|
|
The extension starts with
|
|
|
|
- A sequence of NUL-terminated strings, preceded by the size of the
|
|
sequence in variable width encoding. Each string describes the
|
|
environment where the cache can be used.
|
|
|
|
- Stat data of $GIT_DIR/info/exclude. See "Index entry" section from
|
|
ctime field until "file size".
|
|
|
|
- Stat data of core.excludesfile
|
|
|
|
- 32-bit dir_flags (see struct dir_struct)
|
|
|
|
- Hash of $GIT_DIR/info/exclude. A null hash means the file
|
|
does not exist.
|
|
|
|
- Hash of core.excludesfile. A null hash means the file does
|
|
not exist.
|
|
|
|
- NUL-terminated string of per-dir exclude file name. This usually
|
|
is ".gitignore".
|
|
|
|
- The number of following directory blocks, variable width
|
|
encoding. If this number is zero, the extension ends here with a
|
|
following NUL.
|
|
|
|
- A number of directory blocks in depth-first-search order, each
|
|
consists of
|
|
|
|
- The number of untracked entries, variable width encoding.
|
|
|
|
- The number of sub-directory blocks, variable width encoding.
|
|
|
|
- The directory name terminated by NUL.
|
|
|
|
- A number of untracked file/dir names terminated by NUL.
|
|
|
|
The remaining data of each directory block is grouped by type:
|
|
|
|
- An ewah bitmap, the n-th bit marks whether the n-th directory has
|
|
valid untracked cache entries.
|
|
|
|
- An ewah bitmap, the n-th bit records "check-only" bit of
|
|
read_directory_recursive() for the n-th directory.
|
|
|
|
- An ewah bitmap, the n-th bit indicates whether hash and stat data
|
|
is valid for the n-th directory and exists in the next data.
|
|
|
|
- An array of stat data. The n-th data corresponds with the n-th
|
|
"one" bit in the previous ewah bitmap.
|
|
|
|
- An array of hashes. The n-th hash corresponds with the n-th "one" bit
|
|
in the previous ewah bitmap.
|
|
|
|
- One NUL.
|
|
|
|
== File System Monitor cache
|
|
|
|
The file system monitor cache tracks files for which the core.fsmonitor
|
|
hook has told us about changes. The signature for this extension is
|
|
{ 'F', 'S', 'M', 'N' }.
|
|
|
|
The extension starts with
|
|
|
|
- 32-bit version number: the current supported versions are 1 and 2.
|
|
|
|
- (Version 1)
|
|
64-bit time: the extension data reflects all changes through the given
|
|
time which is stored as the nanoseconds elapsed since midnight,
|
|
January 1, 1970.
|
|
|
|
- (Version 2)
|
|
A null terminated string: an opaque token defined by the file system
|
|
monitor application. The extension data reflects all changes relative
|
|
to that token.
|
|
|
|
- 32-bit bitmap size: the size of the CE_FSMONITOR_VALID bitmap.
|
|
|
|
- An ewah bitmap, the n-th bit indicates whether the n-th index entry
|
|
is not CE_FSMONITOR_VALID.
|
|
|
|
== End of Index Entry
|
|
|
|
The End of Index Entry (EOIE) is used to locate the end of the variable
|
|
length index entries and the beginning of the extensions. Code can take
|
|
advantage of this to quickly locate the index extensions without having
|
|
to parse through all of the index entries.
|
|
|
|
Because it must be able to be loaded before the variable length cache
|
|
entries and other index extensions, this extension must be written last.
|
|
The signature for this extension is { 'E', 'O', 'I', 'E' }.
|
|
|
|
The extension consists of:
|
|
|
|
- 32-bit offset to the end of the index entries
|
|
|
|
- Hash over the extension types and their sizes (but not
|
|
their contents). E.g. if we have "TREE" extension that is N-bytes
|
|
long, "REUC" extension that is M-bytes long, followed by "EOIE",
|
|
then the hash would be:
|
|
|
|
Hash("TREE" + <binary representation of N> +
|
|
"REUC" + <binary representation of M>)
|
|
|
|
== Index Entry Offset Table
|
|
|
|
The Index Entry Offset Table (IEOT) is used to help address the CPU
|
|
cost of loading the index by enabling multi-threading the process of
|
|
converting cache entries from the on-disk format to the in-memory format.
|
|
The signature for this extension is { 'I', 'E', 'O', 'T' }.
|
|
|
|
The extension consists of:
|
|
|
|
- 32-bit version (currently 1)
|
|
|
|
- A number of index offset entries each consisting of:
|
|
|
|
- 32-bit offset from the beginning of the file to the first cache entry
|
|
in this block of entries.
|
|
|
|
- 32-bit count of cache entries in this block
|