2018-04-02 22:34:17 +02:00
|
|
|
Git Commit Graph Design Notes
|
|
|
|
=============================
|
|
|
|
|
|
|
|
Git walks the commit graph for many reasons, including:
|
|
|
|
|
|
|
|
1. Listing and filtering commit history.
|
|
|
|
2. Computing merge bases.
|
|
|
|
|
|
|
|
These operations can become slow as the commit count grows. The merge
|
|
|
|
base calculation shows up in many user-facing commands, such as 'merge-base'
|
|
|
|
or 'status' and can take minutes to compute depending on history shape.
|
|
|
|
|
|
|
|
There are two main costs here:
|
|
|
|
|
|
|
|
1. Decompressing and parsing commits.
|
|
|
|
2. Walking the entire graph to satisfy topological order constraints.
|
|
|
|
|
2018-09-27 21:12:22 +02:00
|
|
|
The commit-graph file is a supplemental data structure that accelerates
|
2018-04-02 22:34:17 +02:00
|
|
|
commit graph walks. If a user downgrades or disables the 'core.commitGraph'
|
|
|
|
config setting, then the existing ODB is sufficient. The file is stored
|
|
|
|
as "commit-graph" either in the .git/objects/info directory or in the info
|
|
|
|
directory of an alternate.
|
|
|
|
|
2018-09-27 21:12:22 +02:00
|
|
|
The commit-graph file stores the commit graph structure along with some
|
2018-04-02 22:34:17 +02:00
|
|
|
extra metadata to speed up graph walks. By listing commit OIDs in lexi-
|
|
|
|
cographic order, we can identify an integer position for each commit and
|
|
|
|
refer to the parents of a commit using those integer positions. We use
|
|
|
|
binary search to find initial commits and then use the integer positions
|
|
|
|
for fast lookups during the walk.
|
|
|
|
|
|
|
|
A consumer may load the following info for a commit from the graph:
|
|
|
|
|
|
|
|
1. The commit OID.
|
|
|
|
2. The list of parents, along with their integer position.
|
|
|
|
3. The commit date.
|
|
|
|
4. The root tree OID.
|
|
|
|
5. The generation number (see definition below).
|
|
|
|
|
|
|
|
Values 1-4 satisfy the requirements of parse_commit_gently().
|
|
|
|
|
|
|
|
Define the "generation number" of a commit recursively as follows:
|
|
|
|
|
|
|
|
* A commit with no parents (a root commit) has generation number one.
|
|
|
|
|
|
|
|
* A commit with at least one parent has generation number one more than
|
|
|
|
the largest generation number among its parents.
|
|
|
|
|
|
|
|
Equivalently, the generation number of a commit A is one more than the
|
|
|
|
length of a longest path from A to a root commit. The recursive definition
|
|
|
|
is easier to use for computation and observing the following property:
|
|
|
|
|
|
|
|
If A and B are commits with generation numbers N and M, respectively,
|
|
|
|
and N <= M, then A cannot reach B. That is, we know without searching
|
|
|
|
that B is not an ancestor of A because it is further from a root commit
|
|
|
|
than A.
|
|
|
|
|
|
|
|
Conversely, when checking if A is an ancestor of B, then we only need
|
|
|
|
to walk commits until all commits on the walk boundary have generation
|
|
|
|
number at most N. If we walk commits using a priority queue seeded by
|
|
|
|
generation numbers, then we always expand the boundary commit with highest
|
|
|
|
generation number and can easily detect the stopping condition.
|
|
|
|
|
|
|
|
This property can be used to significantly reduce the time it takes to
|
|
|
|
walk commits and determine topological relationships. Without generation
|
|
|
|
numbers, the general heuristic is the following:
|
|
|
|
|
|
|
|
If A and B are commits with commit time X and Y, respectively, and
|
|
|
|
X < Y, then A _probably_ cannot reach B.
|
|
|
|
|
|
|
|
This heuristic is currently used whenever the computation is allowed to
|
|
|
|
violate topological relationships due to clock skew (such as "git log"
|
|
|
|
with default order), but is not used when the topological order is
|
|
|
|
required (such as merge base calculations, "git log --graph").
|
|
|
|
|
|
|
|
In practice, we expect some commits to be created recently and not stored
|
|
|
|
in the commit graph. We can treat these commits as having "infinite"
|
|
|
|
generation number and walk until reaching commits with known generation
|
|
|
|
number.
|
|
|
|
|
2018-05-01 14:47:25 +02:00
|
|
|
We use the macro GENERATION_NUMBER_INFINITY = 0xFFFFFFFF to mark commits not
|
|
|
|
in the commit-graph file. If a commit-graph file was written by a version
|
|
|
|
of Git that did not compute generation numbers, then those commits will
|
|
|
|
have generation number represented by the macro GENERATION_NUMBER_ZERO = 0.
|
|
|
|
|
|
|
|
Since the commit-graph file is closed under reachability, we can guarantee
|
|
|
|
the following weaker condition on all commits:
|
|
|
|
|
|
|
|
If A and B are commits with generation numbers N amd M, respectively,
|
|
|
|
and N < M, then A cannot reach B.
|
|
|
|
|
|
|
|
Note how the strict inequality differs from the inequality when we have
|
|
|
|
fully-computed generation numbers. Using strict inequality may result in
|
|
|
|
walking a few extra commits, but the simplicity in dealing with commits
|
|
|
|
with generation number *_INFINITY or *_ZERO is valuable.
|
|
|
|
|
|
|
|
We use the macro GENERATION_NUMBER_MAX = 0x3FFFFFFF to for commits whose
|
|
|
|
generation numbers are computed to be at least this value. We limit at
|
|
|
|
this value since it is the largest value that can be stored in the
|
|
|
|
commit-graph file using the 30 bits available to generation numbers. This
|
|
|
|
presents another case where a commit can have generation number equal to
|
|
|
|
that of a parent.
|
|
|
|
|
2018-04-02 22:34:17 +02:00
|
|
|
Design Details
|
|
|
|
--------------
|
|
|
|
|
2018-09-27 21:12:22 +02:00
|
|
|
- The commit-graph file is stored in a file named 'commit-graph' in the
|
2018-04-02 22:34:17 +02:00
|
|
|
.git/objects/info directory. This could be stored in the info directory
|
|
|
|
of an alternate.
|
|
|
|
|
|
|
|
- The core.commitGraph config setting must be on to consume graph files.
|
|
|
|
|
|
|
|
- The file format includes parameters for the object ID hash function,
|
|
|
|
so a future change of hash algorithm does not require a change in format.
|
|
|
|
|
2018-08-20 20:24:21 +02:00
|
|
|
- Commit grafts and replace objects can change the shape of the commit
|
|
|
|
history. The latter can also be enabled/disabled on the fly using
|
|
|
|
`--no-replace-objects`. This leads to difficultly storing both possible
|
|
|
|
interpretations of a commit id, especially when computing generation
|
|
|
|
numbers. The commit-graph will not be read or written when
|
|
|
|
replace-objects or grafts are present.
|
|
|
|
|
|
|
|
- Shallow clones create grafts of commits by dropping their parents. This
|
|
|
|
leads the commit-graph to think those commits have generation number 1.
|
|
|
|
If and when those commits are made unshallow, those generation numbers
|
|
|
|
become invalid. Since shallow clones are intended to restrict the commit
|
|
|
|
history to a very small set of commits, the commit-graph feature is less
|
|
|
|
helpful for these clones, anyway. The commit-graph will not be read or
|
|
|
|
written when shallow commits are present.
|
|
|
|
|
2019-06-18 20:14:23 +02:00
|
|
|
Commit Graphs Chains
|
|
|
|
--------------------
|
|
|
|
|
|
|
|
Typically, repos grow with near-constant velocity (commits per day). Over time,
|
|
|
|
the number of commits added by a fetch operation is much smaller than the
|
|
|
|
number of commits in the full history. By creating a "chain" of commit-graphs,
|
|
|
|
we enable fast writes of new commit data without rewriting the entire commit
|
|
|
|
history -- at least, most of the time.
|
|
|
|
|
|
|
|
## File Layout
|
|
|
|
|
|
|
|
A commit-graph chain uses multiple files, and we use a fixed naming convention
|
|
|
|
to organize these files. Each commit-graph file has a name
|
|
|
|
`$OBJDIR/info/commit-graphs/graph-{hash}.graph` where `{hash}` is the hex-
|
|
|
|
valued hash stored in the footer of that file (which is a hash of the file's
|
|
|
|
contents before that hash). For a chain of commit-graph files, a plain-text
|
|
|
|
file at `$OBJDIR/info/commit-graphs/commit-graph-chain` contains the
|
|
|
|
hashes for the files in order from "lowest" to "highest".
|
|
|
|
|
|
|
|
For example, if the `commit-graph-chain` file contains the lines
|
|
|
|
|
|
|
|
```
|
|
|
|
{hash0}
|
|
|
|
{hash1}
|
|
|
|
{hash2}
|
|
|
|
```
|
|
|
|
|
|
|
|
then the commit-graph chain looks like the following diagram:
|
|
|
|
|
|
|
|
+-----------------------+
|
|
|
|
| graph-{hash2}.graph |
|
|
|
|
+-----------------------+
|
|
|
|
|
|
|
|
|
+-----------------------+
|
|
|
|
| |
|
|
|
|
| graph-{hash1}.graph |
|
|
|
|
| |
|
|
|
|
+-----------------------+
|
|
|
|
|
|
|
|
|
+-----------------------+
|
|
|
|
| |
|
|
|
|
| |
|
|
|
|
| |
|
|
|
|
| graph-{hash0}.graph |
|
|
|
|
| |
|
|
|
|
| |
|
|
|
|
| |
|
|
|
|
+-----------------------+
|
|
|
|
|
|
|
|
Let X0 be the number of commits in `graph-{hash0}.graph`, X1 be the number of
|
|
|
|
commits in `graph-{hash1}.graph`, and X2 be the number of commits in
|
|
|
|
`graph-{hash2}.graph`. If a commit appears in position i in `graph-{hash2}.graph`,
|
|
|
|
then we interpret this as being the commit in position (X0 + X1 + i), and that
|
|
|
|
will be used as its "graph position". The commits in `graph-{hash2}.graph` use these
|
|
|
|
positions to refer to their parents, which may be in `graph-{hash1}.graph` or
|
|
|
|
`graph-{hash0}.graph`. We can navigate to an arbitrary commit in position j by checking
|
|
|
|
its containment in the intervals [0, X0), [X0, X0 + X1), [X0 + X1, X0 + X1 +
|
|
|
|
X2).
|
|
|
|
|
2018-04-02 22:34:17 +02:00
|
|
|
Related Links
|
|
|
|
-------------
|
|
|
|
[0] https://bugs.chromium.org/p/git/issues/detail?id=8
|
|
|
|
Chromium work item for: Serialized Commit Graph
|
|
|
|
|
|
|
|
[1] https://public-inbox.org/git/20110713070517.GC18566@sigill.intra.peff.net/
|
|
|
|
An abandoned patch that introduced generation numbers.
|
|
|
|
|
|
|
|
[2] https://public-inbox.org/git/20170908033403.q7e6dj7benasrjes@sigill.intra.peff.net/
|
|
|
|
Discussion about generation numbers on commits and how they interact
|
|
|
|
with fsck.
|
|
|
|
|
|
|
|
[3] https://public-inbox.org/git/20170908034739.4op3w4f2ma5s65ku@sigill.intra.peff.net/
|
|
|
|
More discussion about generation numbers and not storing them inside
|
|
|
|
commit objects. A valuable quote:
|
|
|
|
|
|
|
|
"I think we should be moving more in the direction of keeping
|
|
|
|
repo-local caches for optimizations. Reachability bitmaps have been
|
|
|
|
a big performance win. I think we should be doing the same with our
|
|
|
|
properties of commits. Not just generation numbers, but making it
|
|
|
|
cheap to access the graph structure without zlib-inflating whole
|
|
|
|
commit objects (i.e., packv4 or something like the "metapacks" I
|
|
|
|
proposed a few years ago)."
|
|
|
|
|
|
|
|
[4] https://public-inbox.org/git/20180108154822.54829-1-git@jeffhostetler.com/T/#u
|
|
|
|
A patch to remove the ahead-behind calculation from 'status'.
|