clear_##slabname() frees only the memory allocated for a commit slab
itself, but entries in the commit slab might own additional memory
outside the slab that should be freed as well. We already have (at
least) one such commit slab, and this patch series is about to add one
more.
To free all additional memory owned by entries on the commit slab the
user of such a slab could iterate over all commits it knows about,
peek whether there is a valid entry associated with each commit, and
free the additional memory, if any. Or it could rely on intimate
knowledge about the internals of the commit slab implementation, and
could itself iterate directly through all entries in the slab, and
free the additional memory. Or it could just leak the additional
memory...
Introduce deep_clear_##slabname() to allow releasing memory owned by
commit slab entries by invoking the 'void free_fn(elemtype *ptr)'
function specified as parameter for each entry in the slab.
Use it in get_shallow_commits() in 'shallow.c' to replace an
open-coded iteration over a commit slab's entries.
Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The commit-graph format specifies that "All 4-byte numbers are in
network order", but the commit-graph contains 8-byte integers as well
(file offsets in the Chunk Lookup table), and their byte order is
unspecified.
Clarify that all multi-byte integers are in network byte order.
Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The commit-graph file format specifies that the chunks may be in any
order. However, if the OID Lookup chunk happens to be the last one in
the file, then any command attempting to access the commit-graph data
will fail with:
fatal: invalid commit position. commit-graph is likely corrupt
In this case the error is wrong, the commit-graph file does conform to
the specification, but the parsing of the Chunk Lookup table is a bit
buggy, and leaves the field holding the number of commits in the
commit-graph zero-initialized.
The number of commits in the commit-graph is determined while parsing
the Chunk Lookup table, by dividing the size of the OID Lookup chunk
with the hash size. However, the Chunk Lookup table doesn't actually
store the size of the chunks, but it stores their starting offset.
Consequently, the size of a chunk can only be calculated by
subtracting the starting offsets of that chunk from the offset of the
subsequent chunk, or in case of the last chunk from the offset
recorded in the terminating label. This is currenly implemented in a
bit complicated way: as we iterate over the entries of the Chunk
Lookup table, we check the ID of each chunk and store its starting
offset, then we check the ID of the last seen chunk and calculate its
size using its previously saved offset if necessary (at the moment
it's only necessary for the OID Lookup chunk). Alas, while parsing
the Chunk Lookup table we only interate through the "real" chunks, but
never look at the terminating label, thus don't even check whether
it's necessary to calulate the size of the last chunk. Consequently,
if the OID Lookup chunk is the last one, then we don't calculate its
size and turn don't run the piece of code determining the number of
commits in the commit graph, leaving the field holding that number
unchanged (i.e. zero-initialized), eventually triggering the sanity
check in load_oid_from_graph().
Fix this by iterating through all entries in the Chunk Lookup table,
including the terminating label.
Note that this is the minimal fix, suitable for the maintenance track.
A better fix would be to simplify how the chunk sizes are calculated,
but that is a more invasive change, less suitable for 'maint', so that
will be done in later patches.
This additional flexibility of scanning more chunks breaks a test for
"git commit-graph verify" so alter that test to mutate the commit-graph
to have an even lower chunk count.
Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Submodules should be handled the same as regular directories with
respect to the presence of a trailing slash, i.e. commands like:
git diff rev1 rev2 -- $path
git rev-list HEAD -- $path
should produce the same output whether $path is 'submod' or 'submod/'.
This has been fixed in commit 74b4f7f277 (tree-walk.c: ignore trailing
slash on submodule in tree_entry_interesting(), 2014-01-23).
Unfortunately, that commit had the unintended side effect to handle
'submod/anything' the same as 'submod' and 'submod/' as well, e.g.:
$ git log --oneline --name-only -- sha1collisiondetection/whatever
4125f78222 sha1dc: update from upstream
sha1collisiondetection
07a20f569b Makefile: fix unaligned loads in sha1dc with UBSan
sha1collisiondetection
23e37f8e9d sha1dc: update from upstream
sha1collisiondetection
86cfd61e6b sha1dc: optionally use sha1collisiondetection as a submodule
sha1collisiondetection
Fix this by rejecting submodules as partial pathnames when their
trailing slash is followed by anything.
Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The previous changes to the line-log machinery focused on making the
first result appear faster. This was achieved by no longer walking the
entire commit history before returning the early results. There is still
another way to improve the performance: walk most commits much faster.
Let's use the changed-path Bloom filters to reduce time spent computing
diffs.
Since the line-log computation requires opening blobs and checking the
content-diff, there is still a lot of necessary computation that cannot
be replaced with changed-path Bloom filters. The part that we can reduce
is most effective when checking the history of a file that is deep in
several directories and those directories are modified frequently. In
this case, the computation to check if a commit is TREESAME to its first
parent takes a large fraction of the time. That is ripe for improvement
with changed-path Bloom filters.
We must ensure that prepare_to_use_bloom_filters() is called in
revision.c so that the bloom_filter_settings are loaded into the struct
rev_info from the commit-graph. Of course, some cases are still
forbidden, but in the line-log case the pathspec is provided in a
different way than normal.
Since multiple paths and segments could be requested, we compute the
struct bloom_key data dynamically during the commit walk. This could
likely be improved, but adds code complexity that is not valuable at
this time.
There are two cases to care about: merge commits and "ordinary" commits.
Merge commits have multiple parents, but if we are TREESAME to our first
parent in every range, then pass the blame for all ranges to the first
parent. Ordinary commits have the same condition, but each is done
slightly differently in the process_ranges_[merge|ordinary]_commit()
methods. By checking if the changed-path Bloom filter can guarantee
TREESAME, we can avoid that tree-diff cost. If the filter says "probably
changed", then we need to run the tree-diff and then the blob-diff if
there was a real edit.
The Linux kernel repository is a good testing ground for the performance
improvements claimed here. There are two different cases to test. The
first is the "entire history" case, where we output the entire history
to /dev/null to see how long it would take to compute the full line-log
history. The second is the "first result" case, where we find how long
it takes to show the first value, which is an indicator of how quickly a
user would see responses when waiting at a terminal.
To test, I selected the paths that were changed most frequently in the
top 10,000 commits using this command (stolen from StackOverflow [1]):
git log --pretty=format: --name-only -n 10000 | sort | \
uniq -c | sort -rg | head -10
which results in
121 MAINTAINERS
63 fs/namei.c
60 arch/x86/kvm/cpuid.c
59 fs/io_uring.c
58 arch/x86/kvm/vmx/vmx.c
51 arch/x86/kvm/x86.c
45 arch/x86/kvm/svm.c
42 fs/btrfs/disk-io.c
42 Documentation/scsi/index.rst
(along with a bogus first result). It appears that the path
arch/x86/kvm/svm.c was renamed, so we ignore that entry. This leaves the
following results for the real command time:
| | Entire History | First Result |
| Path | Before | After | Before | After |
|------------------------------|--------|--------|--------|--------|
| MAINTAINERS | 4.26 s | 3.87 s | 0.41 s | 0.39 s |
| fs/namei.c | 1.99 s | 0.99 s | 0.42 s | 0.21 s |
| arch/x86/kvm/cpuid.c | 5.28 s | 1.12 s | 0.16 s | 0.09 s |
| fs/io_uring.c | 4.34 s | 0.99 s | 0.94 s | 0.27 s |
| arch/x86/kvm/vmx/vmx.c | 5.01 s | 1.34 s | 0.21 s | 0.12 s |
| arch/x86/kvm/x86.c | 2.24 s | 1.18 s | 0.21 s | 0.14 s |
| fs/btrfs/disk-io.c | 1.82 s | 1.01 s | 0.06 s | 0.05 s |
| Documentation/scsi/index.rst | 3.30 s | 0.89 s | 1.46 s | 0.03 s |
It is worth noting that the least speedup comes for the MAINTAINERS file
which is
* edited frequently,
* low in the directory heirarchy, and
* quite a large file.
All of those points lead to spending more time doing the blob diff and
less time doing the tree diff. Still, we see some improvement in that
case and significant improvement in other cases. A 2-4x speedup is
likely the more typical case as opposed to the small 5% change for that
file.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The previous patch made it possible to perform line-level filtering
during history traversal instead of in an expensive preprocessing
step, but it still requires some simpler preprocessing steps, notably
topo-ordering. However, nowadays we have commit-graphs storing
generation numbers, which make it possible to incrementally traverse
the history in topological order, without the preparatory limit_list()
and sort_in_topological_order() steps; see b45424181e (revision.c:
generation-based topo-order algorithm, 2018-11-01).
This patch combines the two, so we can do both the topo-ordering and
the line-level filtering during history traversal, eliminating even
those simpler preprocessing steps, and thus further reducing the delay
before showing the first commit modifying the given line range.
The 'revs->limited' flag plays the central role in this, because, due
to limitations of the current implementation, the generation
number-based topo-ordering is only enabled when this flag remains
unset. Line-level log, however, always sets this flag in
setup_revisions() ever since the feature was introduced in 12da1d1f6f
(Implement line-history search (git log -L), 2013-03-28). The reason
for setting 'limited' is unclear, though, because the line-level log
itself doesn't directly depend on it, and it doesn't affect how the
limit_list() function limits the revision range. However, there is an
indirect dependency: the line-level log requires topo-ordering, and
the "traditional" sort_in_topological_order() requires an already
limited commit list since e6c3505b44 (Make sure we generate the whole
commit list before trying to sort it topologically, 2005-07-06). The
new, generation numbers-based topo-ordering doesn't require a limited
commit list anymore.
So don't set 'revs->limited' for line-level log, unless it is really
necessary, namely:
- The user explicitly requested parent rewriting, because that is
still done in the line_log_filter() preprocessing step (see
previous patch), which requires sort_in_topological_order() and in
turn limit_list() as well.
- A commit-graph file is not available or it doesn't yet contain
generation numbers. In these cases we had to fall back on
sort_in_topological_order() and in turn limit_list(). The
existing condition with generation_numbers_enabled() has already
ensured that the 'limited' flag is set in these cases; this patch
just makes sure that the line-level log sets 'revs->topo_order'
before that condition.
While the reduced delay before showing the first commit is measurable
in git.git, it takes a bigger repository to make it clearly noticable.
In both cases below the line ranges were chosen so that they were
modified rather close to the starting revisions, so the effect of this
change is most noticable.
# git.git
$ time git --no-pager log -L:read_alternate_refs:sha1-file.c -1 v2.23.0
Before:
real 0m0.107s
user 0m0.091s
sys 0m0.013s
After:
real 0m0.058s
user 0m0.050s
sys 0m0.005s
# linux.git
$ time git --no-pager log \
-L:build_restore_work_registers:arch/mips/mm/tlbex.c -1 v5.2
Before:
real 0m1.129s
user 0m1.061s
sys 0m0.069s
After:
real 0m0.096s
user 0m0.087s
sys 0m0.009s
Additional testing by Derrick Stolee: Since this patch improves
the performance for the first result, I repeated the experiment
from the previous patch on the Linux kernel repository, reporting
real time here:
Command: git log -L 100,200:MAINTAINERS -n 1 >/dev/null
Before: 0.71 s
After: 0.05 s
Now, we have dropped the full topo-order of all ~910,000 commits
before reporting the first result. The remaining performance
improvements then are:
1. Update the parent-rewriting logic to be incremental similar to
how "git log --graph" behaves.
2. Use changed-path Bloom filters to reduce the time spend in the
tree-diff to see if the path(s) changed.
Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The current line-level log implementation performs a preprocessing
step in prepare_revision_walk(), during which the line_log_filter()
function filters and rewrites history to keep only commits modifying
the given line range. This preprocessing affects both responsiveness
and correctness:
- Git doesn't produce any output during this preprocessing step.
Checking whether a commit modified the given line range is
somewhat expensive, so depending on the size of the given revision
range this preprocessing can result in a significant delay before
the first commit is shown.
- Limiting the number of displayed commits (e.g. 'git log -3 -L...')
doesn't limit the amount of work during preprocessing, because
that limit is applied during history traversal. Alas, by that
point this expensive preprocessing step has already churned
through the whole revision range to find all commits modifying the
revision range, even though only a few of them need to be shown.
- It rewrites parents, with no way to turn it off. Without the user
explicitly requesting parent rewriting any parent object ID shown
should be that of the immediate parent, just like in case of a
pathspec-limited history traversal without parent rewriting.
However, after that preprocessing step rewrote history, the
subsequent "regular" history traversal (i.e. get_revision() in a
loop) only sees commits modifying the given line range.
Consequently, it can only show the object ID of the last ancestor
that modified the given line range (which might happen to be the
immediate parent, but many-many times it isn't).
This patch addresses both the correctness and, at least for the common
case, the responsiveness issues by integrating line-level log
filtering into the regular revision walking machinery:
- Make process_ranges_arbitrary_commit(), the static function in
'line-log.c' deciding whether a commit modifies the given line
range, public by removing the static keyword and adding the
'line_log_' prefix, so it can be called from other parts of the
revision walking machinery.
- If the user didn't explicitly ask for parent rewriting (which, I
believe, is the most common case):
- Call this now-public function during regular history traversal,
namely from get_commit_action() to ignore any commits not
modifying the given line range.
Note that while this check is relatively expensive, it must be
performed before other, much cheaper conditions, because the
tracked line range must be adjusted even when the commit will
end up being ignored by other conditions.
- Skip the line_log_filter() call, i.e. the expensive
preprocessing step, in prepare_revision_walk(), because, thanks
to the above points, the revision walking machinery is now able
to filter out commits not modifying the given line range while
traversing history.
This way the regular history traversal sees the unmodified
history, and is therefore able to print the object ids of the
immediate parents of the listed commits. The eliminated
preprocessing step can greatly reduce the delay before the first
commit is shown, see the numbers below.
- However, if the user did explicitly ask for parent rewriting via
'--parents' or a similar option, then stick with the current
implementation for now, i.e. perform that expensive filtering and
history rewriting in the preprocessing step just like we did
before, leaving the initial delay as long as it was.
I tried to integrate line-level log filtering with parent rewriting
into the regular history traversal, but, unfortunately, several
subtleties resisted... :) Maybe someday we'll figure out how to do
that, but until then at least the simple and common (i.e. without
parent rewriting) 'git log -L:func:file' commands can benefit from the
reduced delay.
This change makes the failing 'parent oids without parent rewriting'
test in 't4211-line-log.sh' succeed.
The reduced delay is most noticable when there's a commit modifying
the line range near the tip of a large-ish revision range:
# no parent rewriting requested, no commit-graph present
$ time git --no-pager log -L:read_alternate_refs:sha1-file.c -1 v2.23.0
Before:
real 0m9.570s
user 0m9.494s
sys 0m0.076s
After:
real 0m0.718s
user 0m0.674s
sys 0m0.044s
A significant part of the remaining delay is spent reading and parsing
commit objects in limit_list(). With the help of the commit-graph we
can eliminate most of that reading and parsing overhead, so here are
the timing results of the same command as above, but this time using
the commit-graph:
Before:
real 0m8.874s
user 0m8.816s
sys 0m0.057s
After:
real 0m0.107s
user 0m0.091s
sys 0m0.013s
The next patch will further reduce the remaining delay.
To be clear: this patch doesn't actually optimize the line-level log,
but merely moves most of the work from the preprocessing step to the
history traversal, so the commits modifying the line range can be
shown as soon as they are processed, and the traversal can be
terminated as soon as the given number of commits are shown.
Consequently, listing the full history of a line range, potentially
all the way to the root commit, will take the same time as before (but
at least the user might start reading the output earlier).
Furthermore, if the most recent commit modifying the line range is far
away from the starting revision, then that initial delay will still be
significant.
Additional testing by Derrick Stolee: In the Linux kernel repository,
the MAINTAINERS file was changed ~3,500 times across the ~915,000
commits. In addition to that edit frequency, the file itself is quite
large (~18,700 lines). This means that a significant portion of the
computation is taken up by computing the patch-diff of the file. This
patch improves the real time it takes to output the first result quite
a bit:
Command: git log -L 100,200:MAINTAINERS -n 1 >/dev/null
Before: 3.88 s
After: 0.71 s
If we drop the "-n 1" in the command, then there is no change in
end-to-end process time. This is because the command still needs to
walk the entire commit history, which negates the point of this
patch. This is expected.
As a note for future reference, the ~4.3 seconds in the old code
spends ~2.6 seconds computing the patch-diffs, and the rest of the
time is spent walking commits and computing diffs for which paths
changed at each commit. The changed-path Bloom filters could improve
the end-to-end computation time (i.e. no "-n 1" in the command).
Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
None of the tests in 't4211-line-log.sh' really check which parent
object IDs are shown in the output, either implicitly as part of
"Merge: ..." lines [1] or explicitly via the '%p' or '%P' format
specifiers in a custom pretty format.
Add two tests to 't4211-line-log.sh' to check which parent object IDs
are shown, one without and one with explicitly requested parent
rewriting, IOW without and with the '--parents' option.
The test without '--parents' is marked as failing, because without
that option parent rewriting should not be performed, and thus the
parent object ID should be that of the immediate parent, just like in
case of a pathspec-limited history traversal without parent rewriting.
The current line-level log implementation, however, performs parent
rewriting unconditionally and without a possibility to turn it off,
and, consequently, it shows the object ID of the most recent ancestor
that modified the given line range.
In both of these new tests we only really care about the object IDs of
the listed commits and their parents, but not the diffs of the line
ranges; the diffs have already been thoroughly checked in the previous
tests.
[1] While one of the tests ('-M -L ':f:b.c' parallel-change') does
list a merge commit, both of its parents happen to modify the
given line range and are listed as well, so the implications of
parent rewriting remained hidden and untested.
Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Remove the unused fields 'status', 'arg_alloc', 'arg_nr' and 'args'
from 'struct line_log_data'. They were already part of the struct
when it was introduced in commit 12da1d1f6 (Implement line-history
search (git log -L), 2013-03-28), but as far as I can tell none of
them have ever been actually used.
Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
As diff_tree_oid() computes a diff, it will terminate early if the
total number of changed paths is strictly larger than max_changes.
This includes the directories that changed, not just the file paths.
However, only the file paths are reflected in the resulting diff
queue's "nr" value.
Use the "num_changes" from diffopt to check if the diff terminated
early. This is incredibly important, as it can result in incorrect
filters! For example, the first commit in the Linux kernel repo
reports only 471 changes, but since these are nested inside several
directories they expand to 513 "real" changes, and in fact the
total list of changes is not reported. Thus, the computed filter
for this commit is incorrect.
Demonstrate the subtle difference by using one fewer file change
in the 'get bloom filter for commit with 513 changes' test. Before,
this edited 513 files inside "bigDir" which hit this inequality.
However, dropping the file count by one demonstrates how the
previous inequality was incorrect but the new one is correct.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When computing a changed-path Bloom filter, we need to take the
files that changed from the diff computation and extract the parent
directories. That way, a directory pathspec such as "Documentation"
could match commits that change "Documentation/git.txt".
However, the current code does a poor job of this process. The paths
are added to a hashmap, but we do not check if an entry already
exists with that path. This can create many duplicate entries and
cause the filter to have a much larger length than it should. This
means that the filter is more sparse than intended, which helps the
false positive rate, but wastes a lot of space.
Properly use hashmap_get() before hashmap_add(). Also be sure to
include a comparison function so these can be matched correctly.
This has an effect on a test in t0095-bloom.sh. This makes sense,
there are ten changes inside "smallDir" so the total number of
paths in the filter should be 11. This would result in 11 * 10 bits
required, and with 8 bits per byte, this results in 14 bytes.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In Documentation/technical/commit-graph-format.txt, the definition
of the BIDX chunk specifies the length is a number of 8-byte words.
During development we discovered that using 8-byte words in the
Murmur3 hash algorithm causes issues with big-endian versus little-
endian machines. Thus, the hash algorithm was adapted to work on a
byte-by-byte basis. However, this caused a change in the definition
of a "word" in bloom.h. Now, a "word" is a single byte, which allows
filters to be as small as two bytes. These length-two filters are
demonstrated in t0095-bloom.sh, and a larger filter of length 25 is
demonstrated as well.
The original point of using 8-byte words was for alignment reasons.
It also presented opportunities for extremely sparse Bloom filters
when there were a small number of changes at a commit, creating a
very low false-positive rate. However, modifying the format at this
point is unlikely to be a valuable exercise. Also, this use of
single-byte granularity does present opportunities to save space.
It is unclear if 8-byte alignment of the filters would present any
meaningful performance benefits.
Modify the format document to reflect reality.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When computing changed-path Bloom filters for a commit, we need to
know if the commit has a parent or not. If the commit is not parsed,
then its parent pointer will be NULL.
As far as I can tell, the only opportunity to reach this code
without parsing the commit is inside "test-tool bloom
get_filter_for_commit" but it is best to be safe.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Fix alignment issues that were likely introduced due to an editor
using tab lengths of 4 instead of 8.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
If "test-tool bloom" is not fed a command, or if arguments are missing
for some commands, it will just segfault. Let's check argc and write a
friendlier usage message.
Signed-off-by: Jeff King <peff@peff.net>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
We don't use the "parent" parameter at all (probably because the bloom
filter for a commit is always defined against a single parent anyway).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The changed-path Bloom filters help reduce the amount of tree
parsing required during history queries. Before calculating a
diff, we can ask the filter if a path changed between a commit
and its first parent. If the filter says "no" then we can move
on without parsing trees. If the filter says "maybe" then we
parse trees to discover if the answer is actually "yes" or "no".
When computing a blame, there is a section in find_origin() that
computes a diff between a commit and one of its parents. When this
is the first parent, we can check the Bloom filters before calling
diff_tree_oid().
In order to make this work with the blame machinery, we need to
initialize a struct bloom_key with the initial path. But also, we
need to add more keys to a list if a rename is detected. We then
check to see if _any_ of these keys answer "maybe" in the diff.
During development, I purposefully left out this "add a new key
when a rename is detected" to see if the test suite would catch
my error. That is how I discovered the issues with
GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS from the previous change.
With that change, we can feel some confidence in the coverage of
this change.
If a user requests copy detection using "git blame -C", then there
are more places where the set of "important" files can expand. I
do not know enough about how this happens in the blame machinery.
Thus, the Bloom filter integration is explicitly disabled in this
mode. A later change could expand the bloom_key data with an
appropriate call (or calls) to add_bloom_key().
If we did not disable this mode, then the following tests would
fail:
t8003-blame-corner-cases.sh
t8011-blame-split-file.sh
Generally, this is a performance enhancement and should not
change the behavior of 'git blame' in any way. If a repo has a
commit-graph file with computed changed-path Bloom filters, then
they should notice improved performance for their 'git blame'
commands.
Here are some example timings that I found by blaming some paths
in the Linux kernel repository:
git blame arch/x86/kernel/topology.c >/dev/null
Before: 0.83s
After: 0.24s
git blame kernel/time/time.c >/dev/null
Before: 0.72s
After: 0.24s
git blame tools/perf/ui/stdio/hist.c >/dev/null
Before: 0.27s
After: 0.11s
I specifically looked for "deep" paths that were also edited many
times. As a counterpoint, the MAINTAINERS file was edited many
times but is located in the root tree. This means that the cost of
computing a diff relative to the pathspec is very small. Here are
the timings for that command:
git blame MAINTAINERS >/dev/null
Before: 20.1s
After: 18.0s
These timings are the best of five. The worst-case runs were on the
order of 2.5 minutes for both cases. Note that the MAINTAINERS file
has 18,740 lines across 17,000+ commits. This happens to be one of
the cases where this change provides the least improvement.
The lack of improvement for the MAINTAINERS file and the relatively
modest improvement for the other examples can be easily explained.
The blame machinery needs to compute line-level diffs to determine
which lines were changed by each commit. That makes up a large
proportion of the computation time, and this change does not
attempt to improve on that section of the algorithm. The
MAINTAINERS file is large and changed often, so it takes time to
determine which lines were updated by which commit. In contrast,
the code files are much smaller, and it takes longer to comute
the line-by-line diff for a single patch on the Linux mailing
lists.
Outside of the "-C" integration, I believe there is little more to
gain from the changed-path Bloom filters for 'git blame' after this
patch.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The GIT_TEST_COMMIT_GRAPH environment variable updates the commit-
graph file whenever "git commit" is run, ensuring that we always
have an updated commit-graph throughout the test suite. The
GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS environment variable was
introduced to write the changed-path Bloom filters whenever "git
commit-graph write" is run. However, the GIT_TEST_COMMIT_GRAPH
trick doesn't launch a separate process and instead writes it
directly.
To expand the number of tests that have commits in the commit-graph
file, add a helper method that computes the commit-graph and place
that helper inside "git commit" and "git merge".
In the helper method, check GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS
to ensure we are writing changed-path Bloom filters whenever
possible.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The changed-path Bloom filters work only when we can compute an
explicit Bloom filter key in advance. When a pathspec is given
that allows case-insensitive checks or wildcard matching, we
must disable the Bloom filter performance checks.
By checking the pathspec in prepare_to_use_bloom_filters(), we
avoid setting up the Bloom filter data and thus revert to the
usual logic.
Before this change, the following tests would fail*:
t6004-rev-list-path-optim.sh (Tests 6-7)
t6130-pathspec-noglob.sh (Tests 3-6)
t6131-pathspec-icase.sh (Tests 3-5)
*These tests would fail when using GIT_TEST_COMMIT_GRAPH and
GIT_TEST_COMMIT_GRAPH_BLOOM_FILTERS except that the latter
environment variable was not set up correctly to write the changed-
path Bloom filters in the test suite. That will be fixed in the
next change.
Helped-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The changed-path Bloom filters record an entry in the filter for
every path that was changed. This includes every add and delete,
regardless of whether a rename was detected. Detecting renames
causes significant performance issues, but also will trigger
downloading missing blobs in partial clone.
The simple fix is to disable rename detection when computing a
changed-path Bloom filter. This should already be disabled by
default, but it is good to explicitly enforce the intended
behavior.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Add GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS test flag to the test setup suite
in order to toggle writing Bloom filters when running any of the git tests.
If set to true, we will compute and write Bloom filters every time a test
calls `git commit-graph write`, as if the `--changed-paths` option was
passed in.
The test suite passes when GIT_TEST_COMMIT_GRAPH and
GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS are enabled.
Helped-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
These tests exercises writing commit graph with Bloom filters
and exercises 'git log -- path' with all the applicable
options. They check that the output is the same with and
without Bloom filters, confirm Bloom filters were used by
checking if trace2 statistics were logged correctly.
Also confirms cases where Bloom filters are not used:
1. Multiple path specs,
2. --walk-reflogs (see patch titled 'revision.c: use Bloom filters...'
for details,
3. If the latest commit graph does not have Bloom filters
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Add trace2 statistics around Bloom filter usage and behavior
for 'git log -- path' commands that are hoping to benefit from
the presence of computed changed paths Bloom filters.
These statistics are great for performance analysis work and
for formal testing, which we will see in the commit following
this one.
Helped-by: Derrick Stolee <dstolee@microsoft.com
Helped-by: SZEDER Gábor <szeder.dev@gmail.com>
Helped-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Revision walk will now use Bloom filters for commits to speed up
revision walks for a particular path (for computing history for
that path), if they are present in the commit-graph file.
We load the Bloom filters during the prepare_revision_walk step,
currently only when dealing with a single pathspec. Extending
it to work with multiple pathspecs can be explored and built on
top of this series in the future.
While comparing trees in rev_compare_trees(), if the Bloom filter
says that the file is not different between the two trees, we don't
need to compute the expensive diff. This is where we get our
performance gains. The other response of the Bloom filter is '`:maybe',
in which case we fall back to the full diff calculation to determine
if the path was changed in the commit.
We do not try to use Bloom filters when the '--walk-reflogs' option
is specified. The '--walk-reflogs' option does not walk the commit
ancestry chain like the rest of the options. Incorporating the
performance gains when walking reflog entries would add more
complexity, and can be explored in a later series.
Performance Gains:
We tested the performance of `git log -- <path>` on the git repo, the linux
and some internal large repos, with a variety of paths of varying depths.
On the git and linux repos:
- we observed a 2x to 5x speed up.
On a large internal repo with files seated 6-10 levels deep in the tree:
- we observed 10x to 20x speed ups, with some paths going up to 28 times
faster.
Helped-by: Derrick Stolee <dstolee@microsoft.com
Helped-by: SZEDER Gábor <szeder.dev@gmail.com>
Helped-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Add --changed-paths option to git commit-graph write. This option will
allow users to compute information about the paths that have changed
between a commit and its first parent, and write it into the commit graph
file. If the option is passed to the write subcommand we set the
COMMIT_GRAPH_WRITE_BLOOM_FILTERS flag and pass it down to the
commit-graph logic.
Helped-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Add logic to
a) parse Bloom filter information from the commit graph file and,
b) re-use existing Bloom filters.
See Documentation/technical/commit-graph-format for the format in which
the Bloom filter information is written to the commit graph file.
To read Bloom filter for a given commit with lexicographic position
'i' we need to:
1. Read BIDX[i] which essentially gives us the starting index in BDAT for
filter of commit i+1. It is essentially the index past the end
of the filter of commit i. It is called end_index in the code.
2. For i>0, read BIDX[i-1] which will give us the starting index in BDAT
for filter of commit i. It is called the start_index in the code.
For the first commit, where i = 0, Bloom filter data starts at the
beginning, just past the header in the BDAT chunk. Hence, start_index
will be 0.
3. The length of the filter will be end_index - start_index, because
BIDX[i] gives the cumulative 8-byte words including the ith
commit's filter.
We toggle whether Bloom filters should be recomputed based on the
compute_if_not_present flag.
Helped-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Update the technical documentation for commit-graph-format with
the formats for the Bloom filter index (BIDX) and Bloom filter
data (BDAT) chunks. Write the computed Bloom filters information
to the commit graph file using this format.
Helped-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When running 'git commit-graph write --changed-paths', we sort the
commits by pack-order to save time when computing the changed-paths
bloom filters. This does not help when finding the commits via the
'--reachable' flag.
If not using pack-order, then sort by generation number before
examining the diff. Commits with similar generation are more likely
to have many trees in common, making the diff faster.
On the Linux kernel repository, this change reduced the computation
time for 'git commit-graph write --reachable --changed-paths' from
3m00s to 1m37s.
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Looking at the diff of commit objects in pack order is much faster than
in sha1 order, as it gives locality to the access of tree deltas
(whereas sha1 order is effectively random). Unfortunately the
commit-graph code sorts the commits (several times, sometimes as an oid
and sometimes a pointer-to-commit), and we ultimately traverse in sha1
order.
Instead, let's remember the position at which we see each commit, and
traverse in that order when looking at bloom filters. This drops my time
for "git commit-graph write --changed-paths" in linux.git from ~4
minutes to ~1.5 minutes.
Probably the "--reachable" code path would want something similar.
Or alternatively, we could use a different data structure (either a
hash, or maybe even just a bit in "struct commit") to keep track of
which oids we've seen, etc instead of sorting. And then we could keep
the original order.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Add new COMMIT_GRAPH_WRITE_CHANGED_PATHS flag that makes Git compute
Bloom filters for the paths that changed between a commit and it's
first parent, for each commit in the commit-graph. This computation
is done on a commit-by-commit basis.
We will write these Bloom filters to the commit-graph file, to store
this data on disk, in the next change in this series.
Helped-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When computing the changed-paths bloom filters for the commit-graph,
we limit the size of the filter by restricting the number of paths
in the diff. Instead of computing a large diff and then ignoring the
result, it is better to halt the diff computation early.
Create a new "max_changes" option in struct diff_options. If non-zero,
then halt the diff computation after discovering strictly more changed
paths. This includes paths corresponding to trees that change.
Use this max_changes option in the bloom filter calculations. This
reduces the time taken to compute the filters for the Linux kernel
repo from 2m50s to 2m35s. On a large internal repository with ~500
commits that perform tree-wide changes, the time reduced from
6m15s to 3m48s.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Add the core implementation for computing Bloom filters for
the paths changed between a commit and it's first parent.
We fill the Bloom filters as (const char *data, int len) pairs
as `struct bloom_filters" within a commit slab.
Filters for commits with no changes and more than 512 changes,
is represented with a filter of length zero. There is no gain
in distinguishing between a computed filter of length zero for
a commit with no changes, and an uncomputed filter for new commits
or for commits with more than 512 changes. The effect on
`git log -- path` is the same in both cases. We will fall back to
the normal diffing algorithm when we can't benefit from the
existence of Bloom filters.
Helped-by: Jeff King <peff@peff.net>
Helped-by: Derrick Stolee <dstolee@microsoft.com>
Reviewed-by: Jakub Narębski <jnareb@gmail.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Introduce the constructs for Bloom filters, Bloom filter keys
and Bloom filter settings.
For details on what Bloom filters are and how they work, refer
to Dr. Derrick Stolee's blog post [1]. It provides a concise
explanation of the adoption of Bloom filters as described in
[2] and [3].
Implementation specifics:
1. We currently use 7 and 10 for the number of hashes and the
size of each entry respectively. They served as great starting
values, the mathematical details behind this choice are
described in [1] and [4]. The implementation, while not
completely open to it at the moment, is flexible enough to allow
for tweaking these settings in the future.
Note: The performance gains we have observed with these values
are significant enough that we did not need to tweak these
settings. The performance numbers are included in the cover letter
of this series and in the commit message of the subsequent commit
where we use Bloom filters to speed up `git log -- path`.
2. As described in [1] and [3], we do not need 7 independent hashing
functions. We use the Murmur3 hashing scheme, seed it twice and
then combine those to procure an arbitrary number of hash values.
3. The filters will be sized according to the number of changes in
each commit, in multiples of 8 bit words.
[1] Derrick Stolee
"Supercharging the Git Commit Graph IV: Bloom Filters"
https://devblogs.microsoft.com/devops/super-charging-the-git-commit-graph-iv-Bloom-filters/
[2] Flavio Bonomi, Michael Mitzenmacher, Rina Panigrahy, Sushil Singh, George Varghese
"An Improved Construction for Counting Bloom Filters"
http://theory.stanford.edu/~rinap/papers/esa2006b.pdfhttps://doi.org/10.1007/11841036_61
[3] Peter C. Dillinger and Panagiotis Manolios
"Bloom Filters in Probabilistic Verification"
http://www.ccs.neu.edu/home/pete/pub/Bloom-filters-verification.pdfhttps://doi.org/10.1007/978-3-540-30494-4_26
[4] Thomas Mueller Graf, Daniel Lemire
"Xor Filters: Faster and Smaller Than Bloom and Cuckoo Filters"
https://arxiv.org/abs/1912.08258
Helped-by: Derrick Stolee <dstolee@microsoft.com>
Reviewed-by: Jakub Narębski <jnareb@gmail.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In preparation for computing changed paths Bloom filters,
implement the Murmur3 hash algorithm as described in [1].
It hashes the given data using the given seed and produces
a uniformly distributed hash value.
[1] https://en.wikipedia.org/wiki/MurmurHash#Algorithm
Helped-by: Derrick Stolee <dstolee@microsoft.com>
Helped-by: Szeder Gábor <szeder.dev@gmail.com>
Reviewed-by: Jakub Narębski <jnareb@gmail.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This is a minor cleanup to make it easier to change
the number of chunks being written to the commit
graph.
Reviewed-by: Jakub Narębski <jnareb@gmail.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The 'pack.useSparse' configuration variable now defaults to 'true',
enabling an optimization that has been experimental since Git 2.21.
* ds/default-pack-use-sparse-to-true:
pack-objects: flip the use of GIT_TEST_PACK_SPARSE
config: set pack.useSparse=true by default
"git pull" learned to warn when no pull.rebase configuration
exists, and neither --[no-]rebase nor --ff-only is given (which
would result a merge).
* ah/force-pull-rebase-configuration:
pull: warn if the user didn't say whether to rebase or to merge
"git stash" has kept an escape hatch to use the scripted version
for a few releases, which got stale. It has been removed.
* tg/retire-scripted-stash:
stash: remove the stash.useBuiltin setting
stash: get git_stash_config at the top level
When "git describe C" finds an annotated tag with tagname A to be
the best name to explain commit C, and the tag is stored in a
"wrong" place in the refs/tags hierarchy, e.g. refs/tags/B, the
command gave a warning message but used A (not B) to describe C.
If C is exactly at the tag, the describe output would be "A", but
"git rev-parse A^0" would not be equal as "git rev-parse C^0". The
behavior of the command has been changed to use the "long" form
i.e. A-0-gOBJECTNAME, which is correctly interpreted by rev-parse.
* jc/describe-misnamed-annotated-tag:
describe: force long format for a name based on a mislocated tag
The "--fork-point" mode of "git rebase" regressed when the command
was rewritten in C back in 2.20 era, which has been corrected.
* at/rebase-fork-point-regression-fix:
rebase: --fork-point regression fix
Provide more information (e.g. the object of the tree-ish in which
the blob being converted appears, in addition to its path, which
has already been given) to smudge/clean conversion filters.
* bc/filter-process:
t0021: test filter metadata for additional cases
builtin/reset: compute checkout metadata for reset
builtin/rebase: compute checkout metadata for rebases
builtin/clone: compute checkout metadata for clones
builtin/checkout: compute checkout metadata for checkouts
convert: provide additional metadata to filters
convert: permit passing additional metadata to filter processes
builtin/checkout: pass branch info down to checkout_worktree
The code to interface with GnuPG has been refactored.
* hi/gpg-prefer-check-signature:
gpg-interface: prefer check_signature() for GPG verification
t: increase test coverage of signature verification output
SHA-256 transition continues.
* bc/sha-256-part-1-of-4: (22 commits)
fast-import: add options for rewriting submodules
fast-import: add a generic function to iterate over marks
fast-import: make find_marks work on any mark set
fast-import: add helper function for inserting mark object entries
fast-import: permit reading multiple marks files
commit: use expected signature header for SHA-256
worktree: allow repository version 1
init-db: move writing repo version into a function
builtin/init-db: add environment variable for new repo hash
builtin/init-db: allow specifying hash algorithm on command line
setup: allow check_repository_format to read repository format
t/helper: make repository tests hash independent
t/helper: initialize repository if necessary
t/helper/test-dump-split-index: initialize git repository
t6300: make hash algorithm independent
t6300: abstract away SHA-1-specific constants
t: use hash-specific lookup tables to define test constants
repository: require a build flag to use SHA-256
hex: add functions to parse hex object IDs in any algorithm
hex: introduce parsing variants taking hash algorithms
...
Fix "git checkout --recurse-submodules" of a nested submodule
hierarchy.
* pb/recurse-submodules-fix:
t/lib-submodule-update: add test removing nested submodules
unpack-trees: check for missing submodule directory in merged_entry
unpack-trees: remove outdated description for verify_clean_submodule
t/lib-submodule-update: move a test to the right section
t/lib-submodule-update: remove outdated test description
t7112: remove mention of KNOWN_FAILURE_SUBMODULE_RECURSIVE_NESTED