git-commit-vandalism/builtin
Jeff King 07e7dbf0db gc: default aggressive depth to 50
This commit message is long and has lots of background and
numbers. The summary is: the current default of 250 doesn't
save much space, and costs CPU. It's not a good tradeoff.
Read on for details.

The "--aggressive" flag to git-gc does three things:

  1. use "-f" to throw out existing deltas and recompute from
     scratch

  2. use "--window=250" to look harder for deltas

  3. use "--depth=250" to make longer delta chains

Items (1) and (2) are good matches for an "aggressive"
repack. They ask the repack to do more computation work in
the hopes of getting a better pack. You pay the costs during
the repack, and other operations see only the benefit.

Item (3) is not so clear. Allowing longer chains means fewer
restrictions on the deltas, which means potentially finding
better ones and saving some space. But it also means that
operations which access the deltas have to follow longer
chains, which affects their performance. So it's a tradeoff,
and it's not clear that the tradeoff is even a good one.

The existing "250" numbers for "--aggressive" come
originally from this thread:

  http://public-inbox.org/git/alpine.LFD.0.9999.0712060803430.13796@woody.linux-foundation.org/

where Linus says:

  So when I said "--depth=250 --window=250", I chose those
  numbers more as an example of extremely aggressive
  packing, and I'm not at all sure that the end result is
  necessarily wonderfully usable. It's going to save disk
  space (and network bandwidth - the delta's will be re-used
  for the network protocol too!), but there are definitely
  downsides too, and using long delta chains may
  simply not be worth it in practice.

There are some numbers in that thread, but they're mostly
focused on the improved window size, and measure the
improvement from --depth=250 and --window=250 together.
E.g.:

  http://public-inbox.org/git/9e4733910712062006l651571f3w7f76ce64c6650dff@mail.gmail.com/

talks about the improved run-time of "git-blame", which
comes from the reduced pack size. But most of that reduction
is coming from --window=250, whereas most of the extra costs
come from --depth=250. There's a link in that thread showing
that increasing the depth beyond 50 doesn't seem to help
much with the size:

  https://vcscompare.blogspot.com/2008/06/git-repack-parameters.html

but again, no discussion of the timing impact.

In an earlier thread from Ted Ts'o which discussed setting
the non-aggressive default (from 10 to 50):

  http://public-inbox.org/git/20070509134958.GA21489%40thunk.org/

we have more numbers, with the conclusion that going past 50
does not help size much, and hurts the speed of normal
operations.

So from that, we might guess that 50 is actually a sweet
spot, even for aggressive, if we interpret aggressive to
"spend time now to make a better pack". It is not clear that
"--depth=250" is actually a better pack. It may be slightly
_smaller_, but it carries a run-time penalty.

Here are some more recent timings I did to verify that. They
show three things:

  - the size of the resulting pack (so disk saved to store,
    bandwidth saved on clones/fetches)

  - the cost of "rev-list --objects --all", which shows the
    effect of the delta chains on trees (commits typically
    don't delta, and the command doesn't touch the blobs at
    all)

  - the cost of "log -Sfoo", which will additionally access
    each blob

All cases were repacked with "git repack -adf --depth=$d
--window=250" (so basically, what would happen if we tweaked
the "gc --aggressive" default depth).

The timings are all wall-clock best-of-3. The machine itself
has plenty of RAM compared to the repositories (which is
probably typical of most workstations these days), so we're
really measuring CPU usage, as the whole thing will be in
disk cache after the first run.

The core.deltaBaseCacheLimit is at its default of 96MiB.
It's possible that tweaking it would have some impact on the
tests, as some of them (especially "log -S" on a large repo)
are likely to overflow that. But bumping that carries a
run-time memory cost, so for these tests, I focused on what
we could do just with the on-disk pack tradeoffs.

Each test is done for four depths: 250 (the current value),
50 (the current default that tested well previously), 100
(to show something on the larger side, which previous tests
showed was not a good tradeoff), and 10 (the very old
default, which previous tests showed was worse than 50).

Here are the numbers for linux.git:

   depth |  size |  %    | rev-list |  %     | log -Sfoo |   %
  -------+-------+-------+----------+--------+-----------+-------
    250  | 967MB |  n/a  | 48.159s  |   n/a  | 378.088   |   n/a
    100  | 971MB | +0.4% | 41.471s  | -13.9% | 342.060   |  -9.5%
     50  | 979MB | +1.2% | 37.778s  | -21.6% | 311.040s  | -17.7%
     10  | 1.1GB | +6.6% | 32.518s  | -32.5% | 279.890s  | -25.9%

and for git.git:

   depth |  size |  %    | rev-list |  %     | log -Sfoo |   %
  -------+-------+-------+----------+--------+-----------+-------
    250  |  48MB |  n/a  |  2.215s  |   n/a  |  20.922s  |   n/a
    100  |  49MB | +0.5% |  2.140s  |  -3.4% |  17.736s  | -15.2%
     50  |  49MB | +1.7% |  2.099s  |  -5.2% |  15.418s  | -26.3%
     10  |  53MB | +9.3% |  2.001s  |  -9.7% |  12.677s  | -39.4%

You can see that that the CPU savings for regular operations improves as we
decrease the depth. The savings are less for "rev-list" on a smaller repository
than they are for blob-accessing operations, or even rev-list on a larger
repository. This may mean that a larger delta cache would help (though setting
core.deltaBaseCacheLimit by itself doesn't).

But we can also see that the space savings are not that great as the depth goes
higher. Saving 5-10% between 10 and 50 is probably worth the CPU tradeoff.
Saving 1% to go from 50 to 100, or another 0.5% to go from 100 to 250 is
probably not.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-08-11 11:53:19 -07:00
..
add.c Merge branch 'jk/add-e-kill-editor' into maint 2015-06-05 12:00:09 -07:00
annotate.c annotate: use argv_array 2014-07-16 11:10:11 -07:00
apply.c builtin/apply.c: fix a memleak 2015-03-23 11:12:58 -07:00
archive.c replace {pre,suf}fixcmp() with {starts,ends}_with() 2013-12-05 14:13:21 -08:00
bisect--helper.c Replace deprecated OPT_BOOLEAN by OPT_BOOL 2013-08-05 11:32:19 -07:00
blame.c Sync with 2.3.10 2015-09-28 15:28:31 -07:00
branch.c Merge branch 'dl/branch-error-message' into maint 2015-06-05 12:00:29 -07:00
bundle.c bundle: verify arguments more strictly 2015-05-08 10:52:11 -07:00
cat-file.c Merge branch 'ah/usage-strings' 2015-02-11 13:44:20 -08:00
check-attr.c standardize usage info string format 2015-01-14 09:32:04 -08:00
check-ignore.c standardize usage info string format 2015-01-14 09:32:04 -08:00
check-mailmap.c standardize usage info string format 2015-01-14 09:32:04 -08:00
check-ref-format.c standardize usage info string format 2015-01-14 09:32:04 -08:00
checkout-index.c standardize usage info string format 2015-01-14 09:32:04 -08:00
checkout.c standardize usage info string format 2015-01-14 09:32:04 -08:00
clean.c Merge branch 'rs/janitorial' into maint 2015-06-16 14:33:47 -07:00
clone.c clone: simplify string handling in guess_dir_name() 2015-07-09 14:21:29 -07:00
column.c standardize usage info string format 2015-01-14 09:32:04 -08:00
commit-tree.c commit-tree: simplify parsing of option -S using skip_prefix() 2014-12-29 09:32:45 -08:00
commit.c Merge branch 'pt/xdg-config-path' into maint 2015-06-05 12:00:04 -07:00
config.c Merge branch 'pt/xdg-config-path' into maint 2015-06-05 12:00:04 -07:00
count-objects.c count-objects: use for_each_loose_file_in_objdir 2014-10-16 10:10:41 -07:00
credential.c
describe.c standardize usage info string format 2015-01-14 09:32:04 -08:00
diff-files.c standardize usage info string format 2015-01-14 09:32:04 -08:00
diff-index.c standardize usage info string format 2015-01-14 09:32:04 -08:00
diff-tree.c standardize usage info string format 2015-01-14 09:32:04 -08:00
diff.c lockfile.h: extract new header file for the functions in lockfile.c 2014-10-01 13:56:14 -07:00
fast-export.c teach fast-export an --anonymize option 2014-08-27 10:42:16 -07:00
fetch-pack.c standardize usage info string format 2015-01-14 09:32:04 -08:00
fetch.c Merge branch 'mh/refs-have-new' 2015-03-05 12:45:39 -08:00
fmt-merge-msg.c Merge branch 'jc/plug-fmt-merge-msg-leak' into maint 2015-06-05 12:00:05 -07:00
for-each-ref.c Merge branch 'mh/reporting-broken-refs-from-for-each-ref' into maint 2015-08-03 10:41:31 -07:00
fsck.c fsck: report errors if reflog entries point at invalid objects 2015-06-08 12:40:36 -07:00
gc.c gc: default aggressive depth to 50 2016-08-11 11:53:19 -07:00
get-tar-commit-id.c use skip_prefix() to avoid more magic numbers 2014-10-07 11:09:16 -07:00
grep.c Merge branch 'ps/grep-help-all-callback-arg' 2015-04-20 15:28:34 -07:00
hash-object.c Merge branch 'jc/hash-object' into maint 2015-05-26 13:49:25 -07:00
help.c Merge branch 'sb/leaks' 2015-03-20 13:11:53 -07:00
index-pack.c Merge branch 'jk/index-pack-reduce-recheck' into maint 2015-07-27 12:21:38 -07:00
init-db.c Merge branch 'jk/init-core-worktree-at-root' into maint 2015-05-13 14:05:49 -07:00
interpret-trailers.c trailer: add interpret-trailers command 2014-10-13 13:55:27 -07:00
log.c Merge branch 'jc/do-not-feed-tags-to-clear-commit-marks' into maint 2015-07-15 11:41:16 -07:00
ls-files.c Merge branch 'jc/report-path-error-to-dir' into maint 2015-03-31 14:53:08 -07:00
ls-remote.c standardize usage info string format 2015-01-14 09:32:04 -08:00
ls-tree.c ls-tree: disable negative pathspec because it's not supported 2014-12-01 11:33:45 -08:00
mailinfo.c standardize usage info string format 2015-01-14 09:32:04 -08:00
mailsplit.c mailsplit: remove unnecessary unlink(2) call 2014-10-07 10:49:57 -07:00
merge-base.c standardize usage info string format 2015-01-14 09:32:04 -08:00
merge-file.c Sync with 2.3.10 2015-09-28 15:28:31 -07:00
merge-index.c standardize usage info string format 2015-01-14 09:32:04 -08:00
merge-ours.c
merge-recursive.c replace {pre,suf}fixcmp() with {starts,ends}_with() 2013-12-05 14:13:21 -08:00
merge-tree.c react to errors in xdi_diff 2015-09-28 14:57:10 -07:00
merge.c Revert "merge: pass verbosity flag down to merge-recursive" 2015-04-16 08:03:14 -07:00
mktag.c
mktree.c builtin/mktree.c: use ALLOC_GROW() in append_to_tree() 2014-03-03 14:54:45 -08:00
mv.c standardize usage info string format 2015-01-14 09:32:04 -08:00
name-rev.c standardize usage info string format 2015-01-14 09:32:04 -08:00
notes.c standardize usage info string format 2015-01-14 09:32:04 -08:00
pack-objects.c list-objects: pass full pathname to callbacks 2016-03-16 10:41:04 -07:00
pack-redundant.c standardize usage info string format 2015-01-14 09:32:04 -08:00
pack-refs.c standardize usage info string format 2015-01-14 09:32:04 -08:00
patch-id.c patch-id: make it stable against hunk reordering 2014-06-10 13:09:24 -07:00
prune-packed.c standardize usage info string format 2015-01-14 09:32:04 -08:00
prune.c prune: turn on ref_paranoia flag 2015-03-20 12:40:56 -07:00
push.c push: allow --follow-tags to be set by config push.followTags 2015-03-14 15:08:35 -07:00
read-tree.c lockfile.h: extract new header file for the functions in lockfile.c 2014-10-01 13:56:14 -07:00
receive-pack.c Merge branch 'jc/update-instead-into-void' 2015-04-14 11:49:10 -07:00
reflog.c reflog: improve and update documentation 2015-03-05 12:35:36 -08:00
remote-ext.c use skip_prefix() to avoid more magic numbers 2014-10-07 11:09:16 -07:00
remote-fd.c
remote.c Merge branch 'ah/usage-strings' 2015-02-11 13:44:20 -08:00
repack.c Merge branch 'jk/prune-with-corrupt-refs' 2015-03-25 12:54:26 -07:00
replace.c ref_transaction_update(): remove "have_old" parameter 2015-02-17 11:22:50 -08:00
rerere.c Sync with 2.3.10 2015-09-28 15:28:31 -07:00
reset.c lockfile.h: extract new header file for the functions in lockfile.c 2014-10-01 13:56:14 -07:00
rev-list.c list-objects: pass full pathname to callbacks 2016-03-16 10:41:04 -07:00
rev-parse.c standardize usage info string format 2015-01-14 09:32:04 -08:00
revert.c standardize usage info string format 2015-01-14 09:32:04 -08:00
rm.c use file_exists() to check if a file exists in the worktree 2015-05-20 13:49:10 -07:00
send-pack.c send-pack.c: add --atomic command line argument 2015-01-07 19:56:44 -08:00
shortlog.c standardize usage info string format 2015-01-14 09:32:04 -08:00
show-branch.c Sync with 2.3.9 2015-09-04 10:34:19 -07:00
show-ref.c standardize usage info string format 2015-01-14 09:32:04 -08:00
stripspace.c builtin/stripspace.c: fix broken indentation 2013-09-06 13:33:17 -07:00
symbolic-ref.c standardize usage info string format 2015-01-14 09:32:04 -08:00
tag.c Merge branch 'jk/tag-h-column-is-a-listing-option' into maint 2015-03-27 13:00:23 -07:00
unpack-file.c
unpack-objects.c index-pack: terminate object buffers with NUL 2014-12-09 11:56:37 -08:00
update-index.c update-index: fix a memleak 2015-03-22 12:26:31 -07:00
update-ref.c ref_transaction_verify(): new function to check a reference's value 2015-02-17 11:24:59 -08:00
update-server-info.c
upload-archive.c replace {pre,suf}fixcmp() with {starts,ends}_with() 2013-12-05 14:13:21 -08:00
var.c
verify-commit.c standardize usage info string format 2015-01-14 09:32:04 -08:00
verify-pack.c standardize usage info string format 2015-01-14 09:32:04 -08:00
verify-tag.c standardize usage info string format 2015-01-14 09:32:04 -08:00
write-tree.c i18n: write-tree: mark parseopt strings for translation 2012-08-22 10:58:29 -07:00