docs: move cruft pack docs to gitformat-pack
Integrate the cruft packs documentation initially added in3d89a8c118
(Documentation/technical: add cruft-packs.txt, 2022-05-20) to the newly created "gitformat-pack" documentation. Like the "bitmap-format" added before it in0d4455a3ab
(documentation: add documentation for the bitmap format, 2013-11-14) the "cruft-packs" were documented in their own file. As the diff move detection will show there is no change to "Documentation/technical/cruft-packs.txt" here except to move it, and to "indent" the existing sections by adding an extra "=" to them. We could similarly convert the "bitmap-format.txt", but let's leave it for now due to a conflict with the in-flight ac/bitmap-lookup-table series. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
This commit is contained in:
parent
977c47b46d
commit
6b6029dd1d
@ -105,7 +105,6 @@ TECH_DOCS += MyFirstObjectWalk
|
||||
TECH_DOCS += SubmittingPatches
|
||||
TECH_DOCS += ToolsForGit
|
||||
TECH_DOCS += technical/bitmap-format
|
||||
TECH_DOCS += technical/cruft-packs
|
||||
TECH_DOCS += technical/hash-function-transition
|
||||
TECH_DOCS += technical/http-protocol
|
||||
TECH_DOCS += technical/long-running-process-protocol
|
||||
|
@ -11,6 +11,7 @@ SYNOPSIS
|
||||
[verse]
|
||||
$GIT_DIR/objects/pack/pack-*.{pack,idx}
|
||||
$GIT_DIR/objects/pack/pack-*.rev
|
||||
$GIT_DIR/objects/pack/pack-*.mtimes
|
||||
$GIT_DIR/objects/pack/multi-pack-index
|
||||
|
||||
DESCRIPTION
|
||||
@ -507,6 +508,131 @@ packs arranged in MIDX order (with the preferred pack coming first).
|
||||
The MIDX's reverse index is stored in the optional 'RIDX' chunk within
|
||||
the MIDX itself.
|
||||
|
||||
== cruft packs
|
||||
|
||||
The cruft packs feature offer an alternative to Git's traditional mechanism of
|
||||
removing unreachable objects. This document provides an overview of Git's
|
||||
pruning mechanism, and how a cruft pack can be used instead to accomplish the
|
||||
same.
|
||||
|
||||
=== Background
|
||||
|
||||
To remove unreachable objects from your repository, Git offers `git repack -Ad`
|
||||
(see linkgit:git-repack[1]). Quoting from the documentation:
|
||||
|
||||
----
|
||||
[...] unreachable objects in a previous pack become loose, unpacked objects,
|
||||
instead of being left in the old pack. [...] loose unreachable objects will be
|
||||
pruned according to normal expiry rules with the next 'git gc' invocation.
|
||||
----
|
||||
|
||||
Unreachable objects aren't removed immediately, since doing so could race with
|
||||
an incoming push which may reference an object which is about to be deleted.
|
||||
Instead, those unreachable objects are stored as loose objects and stay that way
|
||||
until they are older than the expiration window, at which point they are removed
|
||||
by linkgit:git-prune[1].
|
||||
|
||||
Git must store these unreachable objects loose in order to keep track of their
|
||||
per-object mtimes. If these unreachable objects were written into one big pack,
|
||||
then either freshening that pack (because an object contained within it was
|
||||
re-written) or creating a new pack of unreachable objects would cause the pack's
|
||||
mtime to get updated, and the objects within it would never leave the expiration
|
||||
window. Instead, objects are stored loose in order to keep track of the
|
||||
individual object mtimes and avoid a situation where all cruft objects are
|
||||
freshened at once.
|
||||
|
||||
This can lead to undesirable situations when a repository contains many
|
||||
unreachable objects which have not yet left the grace period. Having large
|
||||
directories in the shards of `.git/objects` can lead to decreased performance in
|
||||
the repository. But given enough unreachable objects, this can lead to inode
|
||||
starvation and degrade the performance of the whole system. Since we
|
||||
can never pack those objects, these repositories often take up a large amount of
|
||||
disk space, since we can only zlib compress them, but not store them in delta
|
||||
chains.
|
||||
|
||||
=== Cruft packs
|
||||
|
||||
A cruft pack eliminates the need for storing unreachable objects in a loose
|
||||
state by including the per-object mtimes in a separate file alongside a single
|
||||
pack containing all loose objects.
|
||||
|
||||
A cruft pack is written by `git repack --cruft` when generating a new pack.
|
||||
linkgit:git-pack-objects[1]'s `--cruft` option. Note that `git repack --cruft`
|
||||
is a classic all-into-one repack, meaning that everything in the resulting pack is
|
||||
reachable, and everything else is unreachable. Once written, the `--cruft`
|
||||
option instructs `git repack` to generate another pack containing only objects
|
||||
not packed in the previous step (which equates to packing all unreachable
|
||||
objects together). This progresses as follows:
|
||||
|
||||
1. Enumerate every object, marking any object which is (a) not contained in a
|
||||
kept-pack, and (b) whose mtime is within the grace period as a traversal
|
||||
tip.
|
||||
|
||||
2. Perform a reachability traversal based on the tips gathered in the previous
|
||||
step, adding every object along the way to the pack.
|
||||
|
||||
3. Write the pack out, along with a `.mtimes` file that records the per-object
|
||||
timestamps.
|
||||
|
||||
This mode is invoked internally by linkgit:git-repack[1] when instructed to
|
||||
write a cruft pack. Crucially, the set of in-core kept packs is exactly the set
|
||||
of packs which will not be deleted by the repack; in other words, they contain
|
||||
all of the repository's reachable objects.
|
||||
|
||||
When a repository already has a cruft pack, `git repack --cruft` typically only
|
||||
adds objects to it. An exception to this is when `git repack` is given the
|
||||
`--cruft-expiration` option, which allows the generated cruft pack to omit
|
||||
expired objects instead of waiting for linkgit:git-gc[1] to expire those objects
|
||||
later on.
|
||||
|
||||
It is linkgit:git-gc[1] that is typically responsible for removing expired
|
||||
unreachable objects.
|
||||
|
||||
=== Caution for mixed-version environments
|
||||
|
||||
Repositories that have cruft packs in them will continue to work with any older
|
||||
version of Git. Note, however, that previous versions of Git which do not
|
||||
understand the `.mtimes` file will use the cruft pack's mtime as the mtime for
|
||||
all of the objects in it. In other words, do not expect older (pre-cruft pack)
|
||||
versions of Git to interpret or even read the contents of the `.mtimes` file.
|
||||
|
||||
Note that having mixed versions of Git GC-ing the same repository can lead to
|
||||
unreachable objects never being completely pruned. This can happen under the
|
||||
following circumstances:
|
||||
|
||||
- An older version of Git running GC explodes the contents of an existing
|
||||
cruft pack loose, using the cruft pack's mtime.
|
||||
- A newer version running GC collects those loose objects into a cruft pack,
|
||||
where the .mtime file reflects the loose object's actual mtimes, but the
|
||||
cruft pack mtime is "now".
|
||||
|
||||
Repeating this process will lead to unreachable objects not getting pruned as a
|
||||
result of repeatedly resetting the objects' mtimes to the present time.
|
||||
|
||||
If you are GC-ing repositories in a mixed version environment, consider omitting
|
||||
the `--cruft` option when using linkgit:git-repack[1] and linkgit:git-gc[1], and
|
||||
leaving the `gc.cruftPacks` configuration unset until all writers understand
|
||||
cruft packs.
|
||||
|
||||
=== Alternatives
|
||||
|
||||
Notable alternatives to this design include:
|
||||
|
||||
- The location of the per-object mtime data, and
|
||||
- Storing unreachable objects in multiple cruft packs.
|
||||
|
||||
On the location of mtime data, a new auxiliary file tied to the pack was chosen
|
||||
to avoid complicating the `.idx` format. If the `.idx` format were ever to gain
|
||||
support for optional chunks of data, it may make sense to consolidate the
|
||||
`.mtimes` format into the `.idx` itself.
|
||||
|
||||
Storing unreachable objects among multiple cruft packs (e.g., creating a new
|
||||
cruft pack during each repacking operation including only unreachable objects
|
||||
which aren't already stored in an earlier cruft pack) is significantly more
|
||||
complicated to construct, and so aren't pursued here. The obvious drawback to
|
||||
the current implementation is that the entire cruft pack must be re-written from
|
||||
scratch.
|
||||
|
||||
GIT
|
||||
---
|
||||
Part of the linkgit:git[1] suite
|
||||
|
@ -1,123 +0,0 @@
|
||||
= Cruft packs
|
||||
|
||||
The cruft packs feature offer an alternative to Git's traditional mechanism of
|
||||
removing unreachable objects. This document provides an overview of Git's
|
||||
pruning mechanism, and how a cruft pack can be used instead to accomplish the
|
||||
same.
|
||||
|
||||
== Background
|
||||
|
||||
To remove unreachable objects from your repository, Git offers `git repack -Ad`
|
||||
(see linkgit:git-repack[1]). Quoting from the documentation:
|
||||
|
||||
[quote]
|
||||
[...] unreachable objects in a previous pack become loose, unpacked objects,
|
||||
instead of being left in the old pack. [...] loose unreachable objects will be
|
||||
pruned according to normal expiry rules with the next 'git gc' invocation.
|
||||
|
||||
Unreachable objects aren't removed immediately, since doing so could race with
|
||||
an incoming push which may reference an object which is about to be deleted.
|
||||
Instead, those unreachable objects are stored as loose objects and stay that way
|
||||
until they are older than the expiration window, at which point they are removed
|
||||
by linkgit:git-prune[1].
|
||||
|
||||
Git must store these unreachable objects loose in order to keep track of their
|
||||
per-object mtimes. If these unreachable objects were written into one big pack,
|
||||
then either freshening that pack (because an object contained within it was
|
||||
re-written) or creating a new pack of unreachable objects would cause the pack's
|
||||
mtime to get updated, and the objects within it would never leave the expiration
|
||||
window. Instead, objects are stored loose in order to keep track of the
|
||||
individual object mtimes and avoid a situation where all cruft objects are
|
||||
freshened at once.
|
||||
|
||||
This can lead to undesirable situations when a repository contains many
|
||||
unreachable objects which have not yet left the grace period. Having large
|
||||
directories in the shards of `.git/objects` can lead to decreased performance in
|
||||
the repository. But given enough unreachable objects, this can lead to inode
|
||||
starvation and degrade the performance of the whole system. Since we
|
||||
can never pack those objects, these repositories often take up a large amount of
|
||||
disk space, since we can only zlib compress them, but not store them in delta
|
||||
chains.
|
||||
|
||||
== Cruft packs
|
||||
|
||||
A cruft pack eliminates the need for storing unreachable objects in a loose
|
||||
state by including the per-object mtimes in a separate file alongside a single
|
||||
pack containing all loose objects.
|
||||
|
||||
A cruft pack is written by `git repack --cruft` when generating a new pack.
|
||||
linkgit:git-pack-objects[1]'s `--cruft` option. Note that `git repack --cruft`
|
||||
is a classic all-into-one repack, meaning that everything in the resulting pack is
|
||||
reachable, and everything else is unreachable. Once written, the `--cruft`
|
||||
option instructs `git repack` to generate another pack containing only objects
|
||||
not packed in the previous step (which equates to packing all unreachable
|
||||
objects together). This progresses as follows:
|
||||
|
||||
1. Enumerate every object, marking any object which is (a) not contained in a
|
||||
kept-pack, and (b) whose mtime is within the grace period as a traversal
|
||||
tip.
|
||||
|
||||
2. Perform a reachability traversal based on the tips gathered in the previous
|
||||
step, adding every object along the way to the pack.
|
||||
|
||||
3. Write the pack out, along with a `.mtimes` file that records the per-object
|
||||
timestamps.
|
||||
|
||||
This mode is invoked internally by linkgit:git-repack[1] when instructed to
|
||||
write a cruft pack. Crucially, the set of in-core kept packs is exactly the set
|
||||
of packs which will not be deleted by the repack; in other words, they contain
|
||||
all of the repository's reachable objects.
|
||||
|
||||
When a repository already has a cruft pack, `git repack --cruft` typically only
|
||||
adds objects to it. An exception to this is when `git repack` is given the
|
||||
`--cruft-expiration` option, which allows the generated cruft pack to omit
|
||||
expired objects instead of waiting for linkgit:git-gc[1] to expire those objects
|
||||
later on.
|
||||
|
||||
It is linkgit:git-gc[1] that is typically responsible for removing expired
|
||||
unreachable objects.
|
||||
|
||||
== Caution for mixed-version environments
|
||||
|
||||
Repositories that have cruft packs in them will continue to work with any older
|
||||
version of Git. Note, however, that previous versions of Git which do not
|
||||
understand the `.mtimes` file will use the cruft pack's mtime as the mtime for
|
||||
all of the objects in it. In other words, do not expect older (pre-cruft pack)
|
||||
versions of Git to interpret or even read the contents of the `.mtimes` file.
|
||||
|
||||
Note that having mixed versions of Git GC-ing the same repository can lead to
|
||||
unreachable objects never being completely pruned. This can happen under the
|
||||
following circumstances:
|
||||
|
||||
- An older version of Git running GC explodes the contents of an existing
|
||||
cruft pack loose, using the cruft pack's mtime.
|
||||
- A newer version running GC collects those loose objects into a cruft pack,
|
||||
where the .mtime file reflects the loose object's actual mtimes, but the
|
||||
cruft pack mtime is "now".
|
||||
|
||||
Repeating this process will lead to unreachable objects not getting pruned as a
|
||||
result of repeatedly resetting the objects' mtimes to the present time.
|
||||
|
||||
If you are GC-ing repositories in a mixed version environment, consider omitting
|
||||
the `--cruft` option when using linkgit:git-repack[1] and linkgit:git-gc[1], and
|
||||
leaving the `gc.cruftPacks` configuration unset until all writers understand
|
||||
cruft packs.
|
||||
|
||||
== Alternatives
|
||||
|
||||
Notable alternatives to this design include:
|
||||
|
||||
- The location of the per-object mtime data, and
|
||||
- Storing unreachable objects in multiple cruft packs.
|
||||
|
||||
On the location of mtime data, a new auxiliary file tied to the pack was chosen
|
||||
to avoid complicating the `.idx` format. If the `.idx` format were ever to gain
|
||||
support for optional chunks of data, it may make sense to consolidate the
|
||||
`.mtimes` format into the `.idx` itself.
|
||||
|
||||
Storing unreachable objects among multiple cruft packs (e.g., creating a new
|
||||
cruft pack during each repacking operation including only unreachable objects
|
||||
which aren't already stored in an earlier cruft pack) is significantly more
|
||||
complicated to construct, and so aren't pursued here. The obvious drawback to
|
||||
the current implementation is that the entire cruft pack must be re-written from
|
||||
scratch.
|
Loading…
Reference in New Issue
Block a user