This patch fixes an extreme slowdown in pack-objects when you have more
than 1023 packs. See below for numbers.
Since 43fa44fa3b (pack-objects: move in_pack out of struct object_entry,
2018-04-14), we use a complicated system to save some per-object memory.
Each object_entry structs gets a 10-bit field to store the index of the
pack it's in. We map those indices into pointers using
packing_data->in_pack_by_idx, which we initialize at the start of the
program. If we have 2^10 or more packs, then we instead create an array
of pack pointers, one per object. This is packing_data->in_pack.
So far so good. But there's one other tricky case: if a new pack arrives
after we've initialized in_pack_by_idx, it won't have an index yet. We
solve that by calling oe_map_new_pack(), which just switches on the fly
to the less-optimal in_pack mechanism, allocating the array and
back-filling it for already-seen objects.
But that logic kicks in even when we've switched to it already (whether
because we really did see a new pack, or because we had too many packs
in the first place). The result doesn't produce a wrong outcome, but
it's very slow. What happens is this:
- imagine you have a repo with 500k objects and 2000 packs that you
want to repack.
- before looking at any objects, we call prepare_in_pack_by_idx(). It
starts allocating an index for each pack. On the 1024th pack, it
sees there are too many, so it bails, leaving in_pack_by_idx as
NULL.
- while actually adding objects to the packing list, we call
oe_set_in_pack(), which checks whether the pack already has an
index. If it's one of the packs after the first 1023, then it
doesn't have one, and we'll call oe_map_new_pack().
But there's no useful work for that function to do. We're already
using in_pack, so it just uselessly walks over the complete list of
objects, trying to backfill in_pack.
And we end up doing this for almost 1000 packs (each of which may be
triggered by more than one object). And each time it triggers, we
may iterate over up to 500k objects. So in the absolute worst case,
this is quadratic in the number of objects.
The solution is simple: we don't need to bother checking whether the
pack has an index if we've already converted to using in_pack, since by
definition we're not going to use it. So we can just push the "does the
pack have a valid index" check down into that half of the conditional,
where we know we're going to use it.
The current test in p5303 sadly doesn't notice this problem, since it
maxes out at 1000 packs. If we add a new test to it at 2000 packs, it
does show the improvement:
Test HEAD^ HEAD
----------------------------------------------------------------------
5303.12: repack (2000) 26.72(39.68+0.67) 15.70(28.70+0.66) -41.2%
However, these many-pack test cases are rather expensive to run, so
adding larger and larger numbers isn't appealing. Instead, we can show
it off more easily by using GIT_TEST_FULL_IN_PACK_ARRAY, which forces us
into the absolute worst case: no pack has an index, so we'll trigger
oe_map_new_pack() pointlessly for every single object, making it truly
quadratic.
Here are the numbers (on git.git) with the included change to p5303:
Test HEAD^ HEAD
----------------------------------------------------------------------
5303.3: rev-list (1) 2.05(1.98+0.06) 2.06(1.99+0.06) +0.5%
5303.4: repack (1) 33.45(33.46+0.19) 2.75(2.73+0.22) -91.8%
5303.6: rev-list (50) 2.07(2.01+0.06) 2.06(2.01+0.05) -0.5%
5303.7: repack (50) 34.21(35.18+0.16) 3.49(4.50+0.12) -89.8%
5303.9: rev-list (1000) 2.87(2.78+0.08) 2.88(2.80+0.07) +0.3%
5303.10: repack (1000) 41.26(51.30+0.47) 10.75(20.75+0.44) -73.9%
Again, those improvements aren't realistic for the 1-pack case (because
in the real world, the full-array solution doesn't kick in), but it's
more useful to be testing the more-complicated code path.
While we're looking at this issue, we'll tweak one more thing: in
oe_map_new_pack(), we call REALLOC_ARRAY(pack->in_pack). But we'd never
expect to get here unless we're back-filling it for the first time, in
which case it would be NULL. So let's switch that to ALLOC_ARRAY() for
clarity, and add a BUG() to document the expectation. Unfortunately this
code isn't well-covered in the test suite because it's inherently racy
(it only kicks in if somebody else adds a new pack while we're in the
middle of repacking).
Signed-off-by: Jeff King <peff@peff.net>
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Inspired by 21416f0a07 ("restore: fix typo in docs", 2019-08-03), I ran
"git grep -E '(\b[a-zA-Z]+) \1\b' -- Documentation/" to find other cases
where words were duplicated, e.g. "the the", and in most cases removed
one of the repeated words.
There were many false positives by this grep command, including
deliberate repeated words like "really really" or valid uses of "that
that" which I left alone, of course.
I also did not correct any of the legitimate, accidentally repeated
words in old RelNotes.
Signed-off-by: Mark Rushakoff <mark.rushakoff@gmail.com>
Acked-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
My IEE 'home for life' email service is being withdrawn on 30 Sept 2019.
Replace with my new email domain.
I also have a secondary (backup) 'home for life' through
<philipoakley@dunelm.org.uk>.
Signed-off-by: Philip Oakley <philipoakley@iee.email>
Signed-off-by: Philip Oakley <philipoakley@iee.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE4fA2sf7nIh/HeOzvsLXohpav5ssFAl1NqkwACgkQsLXohpav
5ssjXQ//VsqRnuVu947TP0x/3vJzAuLSsTW1qE4kJUNQbRCRz64ejSiKiVlfDtpb
yk4rWbdnVVVCZwCUCNp421SAKWVWuFvEDhqd6JMe69DF1MqOwdn7gVleRiVa59Sv
2aVMzCO0FcRwUxuSkHogExJp94z2kyzL6EdAVYyalU8InR54cBML+in+gqtWToXE
7gGzAyu6g2Dv7/Wx2laohm05xppvbgsnrGqZvMhoYR1rl5pf9LlERvS/CjNl4FBc
mFqhJYgYjjvWfVPmv7WSce1JxlGd/AdDK0eMl6rnorHwSfDbeNsmvDT5a62YioQ7
9eC2/2woRom5T56NuEwobMYhpEG7ttlZDHEDg0YULSW7gbaNJdEoYJ78T0p7yQL3
HIljlg2H+l/I2wxeiMDg50oLCIWptT8d0E9TkEX89UkLq8Lc0XQeA7oxIM8HpjQy
n/Zx7sfDE4DVd7mFZb9UmzvHpzwXKl1NEy9a2/Mb7gRIUwO1DEHL8ATjar+j3AbO
uh3vOShC4u1Ya1vUOY7wRbmxfxIGIRiqRHtEmx60j1GCSDQMl71fTyO/QnAi71KH
CNzBRWauiyuJqwXQfzhZzXKLBDjfufoPudVHlWm0UC5oY3MXuLv9jUH6JAoaRP1U
46gauPfLinOeB0XQ4Uo3xbHJ6j2e91BLt2TzyQMMz0n6upGM9QE=
=hE45
-----END PGP SIGNATURE-----
Merge tag 'v2.23.0-rc2' of git://git.kernel.org/pub/scm/git/git
Git 2.23-rc2
* tag 'v2.23.0-rc2' of git://git.kernel.org/pub/scm/git/git: (63 commits)
Git 2.23-rc2
t0000: reword comments for "local" test
t: decrease nesting in test_oid_to_path
sha1-file: release strbuf after use
test-dir-iterator: use path argument directly
dir-iterator: release strbuf after use
commit-graph: release strbufs after use
l10n: reformat some localized strings for v2.23.0
merge-recursive: avoid directory rename detection in recursive case
commit-graph: fix bug around octopus merges
restore: fix typo in docs
doc: typo: s/can not/cannot/ and s/is does/does/
Git 2.23-rc1
log: really flip the --mailmap default
RelNotes/2.23.0: fix a few typos and other minor issues
RelNotes/2.21.1: typofix
log: flip the --mailmap default unconditionally
config: work around bug with includeif:onbranch and early config
A few more last-minute fixes
repack: simplify handling of auto-bitmaps and .keep files
...
Compilation fix.
* cb/xdiff-no-system-includes-in-dot-c:
xdiff: remove duplicate headers from xpatience.c
xdiff: remove duplicate headers from xhistogram.c
xdiff: drop system includes in xutils.c
The internal diff machinery can be made to read out of bounds while
looking for --funcion-context line in a corner case, which has been
corrected.
* jk/xdiff-clamp-funcname-context-index:
xdiff: clamp function context indices in post-image
"merge-recursive" hit a BUG() when building a virtual merge base
detected a directory rename.
* en/disable-dir-rename-in-recursive-merge:
merge-recursive: avoid directory rename detection in recursive case
commit-graph did not handle commits with more than two parents
correctly, which has been corrected.
* ds/commit-graph-octopus-fix:
commit-graph: fix bug around octopus merges
Commit 01d3a526ad (t0000: check whether the shell supports the "local"
keyword, 2017-10-26) added a test to gather data on whether people run
the test suite with shells that don't support "local".
After almost two years, nobody has complained, and several other uses
have cropped up in test-lib-functions.sh. Let's declare it acceptable to
use.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
t1410.3 ("corrupt and checks") fails when run using dash versions
before 0.5.8, with a cryptic message:
mv: cannot stat '.git/objects//e84adb2704cbd49549e52169b4043871e13432': No such file or directory
The function generating that path:
test_oid_to_path () {
echo "${1%${1#??}}/${1#??}"
}
which is supposed to produce a result like
12/3456789....
But a dash bug[*] causes it to instead expand to
/3456789...
The stream of symbols that makes up this function is hard for humans
to follow, too. The complexity mostly comes from the repeated use of
the expression ${1#??} for the basename of the loose object. Use a
variable instead --- nowadays, the dialect of shell used by Git
permits local variables, so this is cheap.
An alternative way to work around [*] is to remove the double-quotes
around test_oid_to_path's return value. That makes the expression
easier for dash to read, but harder for humans. Let's prefer the
rephrasing that's helpful for humans, too.
Noticed by building on Ubuntu trusty, which uses dash 0.5.7.
[*] Fixed by v0.5.8~13 ("[EXPAND] Propagate EXP_QPAT in subevalvar, 2013-08-23).
Signed-off-by: Jonathan Nieder <jrnieder@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Avoid allocating and leaking a strbuf for holding a verbatim copy of the
path argument and pass the latter directly to dir_iterator_begin()
instead.
Signed-off-by: René Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Ever since commit 8c8e5bd6eb ("merge-recursive: switch directory
rename detection default", 2019-04-05), the default handling with
directory rename detection was to report a conflict and leave unstaged
entries in the index. However, when creating a virtual merge base in
the recursive case, we absolutely need a tree, and the only way a tree
can be written is if we have no unstaged entries -- otherwise we hit a
BUG().
There are a few fixes possible here which at least fix the BUG(), but
none of them seem optimal for other reasons; see the comments with the
new testcase 13e in t6043 for details (which testcase triggered a BUG()
prior to this patch). As such, just opt for a very conservative and
simple choice that is still relatively reasonable: have the recursive
case treat 'conflict' as 'false' for opt->detect_directory_renames.
Reported-by: Emily Shaffer <emilyshaffer@google.com>
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In 1771be90 "commit-graph: merge commit-graph chains" (2019-06-18),
the method sort_and_scan_merged_commits() was added to merge the
commit lists of two commit-graph files in the incremental format.
Unfortunately, there was an off-by-one error in that method around
incrementing num_extra_edges, which leads to an incorrect offset
for the base graph chunk.
When we store an octopus merge in the commit-graph file, we store
the first parent in the normal place, but use the second parent
position to point into the "extra edges" chunk where the remaining
parents exist. This means we should be adding "num_parents - 1"
edges to this list, not "num_parents - 2". That is the basic error.
The reason this was not caught in the test suite is more subtle.
In 5324-split-commit-graph.sh, we test creating an octopus merge
and adding it to the tip of a commit-graph chain, then verify the
result. This _should_ have caught the problem, except that when
we load the commit-graph files we were overly careful to not fail
when the commit-graph chain does not match. This care was on
purpose to avoid race conditions as one process reads the chain
and another process modifies it. In such a case, the reading
process outputs the following message to stderr:
warning: commit-graph chain does not match
These warnings are output in the test suite, but ignored. By
checking the stderr of `git commit-graph verify` to include
the expected progress output, it will now catch this error.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
"Can not" suggests one has the option to not do something, whereas
"cannot" more strongly suggests something is disallowed or impossible.
Noticed "can not", mistakenly used instead of "cannot" in git help
glossary, then ran git grep 'can not' and found many other instances.
Only files in the Documentation folder were modified.
'Can not' also occurs in some source code comments and some test
assertion messages, and there is an error message and translation "can
not move directory into itself" which I may fix and submit separately
from the documentation change.
Also noticed and fixed "is does" in git help fetch, but there are no
other occurrences of that typo according to git grep.
Signed-off-by: Mark Rushakoff <mark.rushakoff@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>