07e7dbf0db
This commit message is long and has lots of background and numbers. The summary is: the current default of 250 doesn't save much space, and costs CPU. It's not a good tradeoff. Read on for details. The "--aggressive" flag to git-gc does three things: 1. use "-f" to throw out existing deltas and recompute from scratch 2. use "--window=250" to look harder for deltas 3. use "--depth=250" to make longer delta chains Items (1) and (2) are good matches for an "aggressive" repack. They ask the repack to do more computation work in the hopes of getting a better pack. You pay the costs during the repack, and other operations see only the benefit. Item (3) is not so clear. Allowing longer chains means fewer restrictions on the deltas, which means potentially finding better ones and saving some space. But it also means that operations which access the deltas have to follow longer chains, which affects their performance. So it's a tradeoff, and it's not clear that the tradeoff is even a good one. The existing "250" numbers for "--aggressive" come originally from this thread: http://public-inbox.org/git/alpine.LFD.0.9999.0712060803430.13796@woody.linux-foundation.org/ where Linus says: So when I said "--depth=250 --window=250", I chose those numbers more as an example of extremely aggressive packing, and I'm not at all sure that the end result is necessarily wonderfully usable. It's going to save disk space (and network bandwidth - the delta's will be re-used for the network protocol too!), but there are definitely downsides too, and using long delta chains may simply not be worth it in practice. There are some numbers in that thread, but they're mostly focused on the improved window size, and measure the improvement from --depth=250 and --window=250 together. E.g.: http://public-inbox.org/git/9e4733910712062006l651571f3w7f76ce64c6650dff@mail.gmail.com/ talks about the improved run-time of "git-blame", which comes from the reduced pack size. But most of that reduction is coming from --window=250, whereas most of the extra costs come from --depth=250. There's a link in that thread showing that increasing the depth beyond 50 doesn't seem to help much with the size: https://vcscompare.blogspot.com/2008/06/git-repack-parameters.html but again, no discussion of the timing impact. In an earlier thread from Ted Ts'o which discussed setting the non-aggressive default (from 10 to 50): http://public-inbox.org/git/20070509134958.GA21489%40thunk.org/ we have more numbers, with the conclusion that going past 50 does not help size much, and hurts the speed of normal operations. So from that, we might guess that 50 is actually a sweet spot, even for aggressive, if we interpret aggressive to "spend time now to make a better pack". It is not clear that "--depth=250" is actually a better pack. It may be slightly _smaller_, but it carries a run-time penalty. Here are some more recent timings I did to verify that. They show three things: - the size of the resulting pack (so disk saved to store, bandwidth saved on clones/fetches) - the cost of "rev-list --objects --all", which shows the effect of the delta chains on trees (commits typically don't delta, and the command doesn't touch the blobs at all) - the cost of "log -Sfoo", which will additionally access each blob All cases were repacked with "git repack -adf --depth=$d --window=250" (so basically, what would happen if we tweaked the "gc --aggressive" default depth). The timings are all wall-clock best-of-3. The machine itself has plenty of RAM compared to the repositories (which is probably typical of most workstations these days), so we're really measuring CPU usage, as the whole thing will be in disk cache after the first run. The core.deltaBaseCacheLimit is at its default of 96MiB. It's possible that tweaking it would have some impact on the tests, as some of them (especially "log -S" on a large repo) are likely to overflow that. But bumping that carries a run-time memory cost, so for these tests, I focused on what we could do just with the on-disk pack tradeoffs. Each test is done for four depths: 250 (the current value), 50 (the current default that tested well previously), 100 (to show something on the larger side, which previous tests showed was not a good tradeoff), and 10 (the very old default, which previous tests showed was worse than 50). Here are the numbers for linux.git: depth | size | % | rev-list | % | log -Sfoo | % -------+-------+-------+----------+--------+-----------+------- 250 | 967MB | n/a | 48.159s | n/a | 378.088 | n/a 100 | 971MB | +0.4% | 41.471s | -13.9% | 342.060 | -9.5% 50 | 979MB | +1.2% | 37.778s | -21.6% | 311.040s | -17.7% 10 | 1.1GB | +6.6% | 32.518s | -32.5% | 279.890s | -25.9% and for git.git: depth | size | % | rev-list | % | log -Sfoo | % -------+-------+-------+----------+--------+-----------+------- 250 | 48MB | n/a | 2.215s | n/a | 20.922s | n/a 100 | 49MB | +0.5% | 2.140s | -3.4% | 17.736s | -15.2% 50 | 49MB | +1.7% | 2.099s | -5.2% | 15.418s | -26.3% 10 | 53MB | +9.3% | 2.001s | -9.7% | 12.677s | -39.4% You can see that that the CPU savings for regular operations improves as we decrease the depth. The savings are less for "rev-list" on a smaller repository than they are for blob-accessing operations, or even rev-list on a larger repository. This may mean that a larger delta cache would help (though setting core.deltaBaseCacheLimit by itself doesn't). But we can also see that the space savings are not that great as the depth goes higher. Saving 5-10% between 10 and 50 is probably worth the CPU tradeoff. Saving 1% to go from 50 to 100, or another 0.5% to go from 100 to 250 is probably not. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com> |
||
---|---|---|
.. | ||
add.c | ||
annotate.c | ||
apply.c | ||
archive.c | ||
bisect--helper.c | ||
blame.c | ||
branch.c | ||
bundle.c | ||
cat-file.c | ||
check-attr.c | ||
check-ignore.c | ||
check-mailmap.c | ||
check-ref-format.c | ||
checkout-index.c | ||
checkout.c | ||
clean.c | ||
clone.c | ||
column.c | ||
commit-tree.c | ||
commit.c | ||
config.c | ||
count-objects.c | ||
credential.c | ||
describe.c | ||
diff-files.c | ||
diff-index.c | ||
diff-tree.c | ||
diff.c | ||
fast-export.c | ||
fetch-pack.c | ||
fetch.c | ||
fmt-merge-msg.c | ||
for-each-ref.c | ||
fsck.c | ||
gc.c | ||
get-tar-commit-id.c | ||
grep.c | ||
hash-object.c | ||
help.c | ||
index-pack.c | ||
init-db.c | ||
interpret-trailers.c | ||
log.c | ||
ls-files.c | ||
ls-remote.c | ||
ls-tree.c | ||
mailinfo.c | ||
mailsplit.c | ||
merge-base.c | ||
merge-file.c | ||
merge-index.c | ||
merge-ours.c | ||
merge-recursive.c | ||
merge-tree.c | ||
merge.c | ||
mktag.c | ||
mktree.c | ||
mv.c | ||
name-rev.c | ||
notes.c | ||
pack-objects.c | ||
pack-redundant.c | ||
pack-refs.c | ||
patch-id.c | ||
prune-packed.c | ||
prune.c | ||
push.c | ||
read-tree.c | ||
receive-pack.c | ||
reflog.c | ||
remote-ext.c | ||
remote-fd.c | ||
remote.c | ||
repack.c | ||
replace.c | ||
rerere.c | ||
reset.c | ||
rev-list.c | ||
rev-parse.c | ||
revert.c | ||
rm.c | ||
send-pack.c | ||
shortlog.c | ||
show-branch.c | ||
show-ref.c | ||
stripspace.c | ||
symbolic-ref.c | ||
tag.c | ||
unpack-file.c | ||
unpack-objects.c | ||
update-index.c | ||
update-ref.c | ||
update-server-info.c | ||
upload-archive.c | ||
var.c | ||
verify-commit.c | ||
verify-pack.c | ||
verify-tag.c | ||
write-tree.c |