git-commit-vandalism/builtin
Junio C Hamano 1b4bb16b9e pack-objects: optimize "recency order"
This optimizes the "recency order" (see pack-heuristics.txt in
Documentation/technical/ directory) used to order objects within a
packfile in three ways:

 - Commits at the tip of tags are written together, in the hope that
   revision traversal done in incremental fetch (which starts by
   putting them in a revision queue marked as UNINTERESTING) will see a
   better locality of these objects;

 - In the original recency order, trees and blobs are intermixed. Write
   trees together before blobs, in the hope that this will improve
   locality when running pathspec-limited revision traversal, i.e.
   "git log paths...";

 - When writing blob objects out, write the whole family of blobs that use
   the same delta base object together, by starting from the root of the
   delta chain, and writing its immediate children in a width-first
   manner, in the hope that this will again improve locality when reading
   blobs that belong to the same path, which are likely to be deltified
   against each other.

I tried various workloads in the Linux kernel repositories (HEAD at
v3.0-rc6-71-g4dd1b49) packed with v1.7.6 and with this patch, counting how
large seeks are needed between adjacent accesses to objects in the pack,
and the result looks promising.  The history has 2072052 objects, weighing
some 490MiB.

 * Simple commit-only log.

   $ git log >/dev/null

   There are 254656 commits in total.

                                  v1.7.6  with patch
   Total number of access :      258,031     258,032
          0.0% percentile :           12          12
         10.0% percentile :          259         259
         20.0% percentile :          294         294
         30.0% percentile :          326         326
         40.0% percentile :          363         363
         50.0% percentile :          415         415
         60.0% percentile :          513         513
         70.0% percentile :          857         858
         80.0% percentile :       10,434      10,441
         90.0% percentile :       91,985      91,996
         95.0% percentile :      260,852     260,885
         99.0% percentile :    1,150,680   1,152,811
         99.9% percentile :    3,148,435   3,148,435
       Less than 2MiB seek:       99.70%      99.69%

   95% of the pack accesses look at data that is no further than 260kB
   from the previous location we accessed. The patch does not change the
   order of commit objects very much, and the result is very similar.

 * Pathspec-limited log.

   $ git log drivers/net >/dev/null

   The path is touched by 26551 commits and merges (among 254656 total).

                                  v1.7.6  with patch
   Total number of access :      559,511     558,663
          0.0% percentile :            0           0
         10.0% percentile :          182         167
         20.0% percentile :          259         233
         30.0% percentile :          357         304
         40.0% percentile :          714         485
         50.0% percentile :        5,046       3,976
         60.0% percentile :      688,671     443,578
         70.0% percentile :  319,574,732 110,370,100
         80.0% percentile :  361,647,599 123,707,229
         90.0% percentile :  393,195,669 128,947,636
         95.0% percentile :  405,496,875 131,609,321
         99.0% percentile :  412,942,470 133,078,115
         99.5% percentile :  413,172,266 133,163,349
         99.9% percentile :  413,354,356 133,240,445
       Less than 2MiB seek:       61.71%      62.87%

   With the current pack heuristics, more than 30% of accesses have to
   seek further than 300MB; the updated pack heuristics ensures that less
   than 0.1% of accesses have to seek further than 135MB. This is largely
   due to the fact that the updated heuristics does not mix blobs and
   trees together.

 * Blame.

   $ git blame drivers/net/ne.c >/dev/null

   The path is touched by 34 commits and merges.

                                  v1.7.6  with patch
   Total number of access :      178,147     178,166
          0.0% percentile :            0           0
         10.0% percentile :          142         139
         20.0% percentile :          222         194
         30.0% percentile :          373         300
         40.0% percentile :        1,168         837
         50.0% percentile :       11,248       7,334
         60.0% percentile :  305,121,284 106,850,130
         70.0% percentile :  361,427,854 123,709,715
         80.0% percentile :  388,127,343 128,171,047
         90.0% percentile :  399,987,762 130,200,707
         95.0% percentile :  408,230,673 132,174,308
         99.0% percentile :  412,947,017 133,181,160
         99.5% percentile :  413,312,798 133,220,425
         99.9% percentile :  413,352,366 133,269,051
       Less than 2MiB seek:       56.47%      56.83%

   The result is very similar to the pathspec-limited log above, which
   only looks at the tree objects.

 * Packing recent history.

   $ (git for-each-ref --format='^%(refname)' refs/tags; echo HEAD) |
     git pack-objects --revs --stdout >/dev/null

   This should pack data worth 71 commits.

                                  v1.7.6  with patch
   Total number of access :       11,511      11,514
          0.0% percentile :            0           0
         10.0% percentile :           48          47
         20.0% percentile :          134          98
         30.0% percentile :          332         178
         40.0% percentile :        1,386         293
         50.0% percentile :        8,030         478
         60.0% percentile :       33,676       1,195
         70.0% percentile :      147,268      26,216
         80.0% percentile :    9,178,662     464,598
         90.0% percentile :   67,922,665     965,782
         95.0% percentile :   87,773,251   1,226,102
         99.0% percentile :   98,011,763   1,932,377
         99.5% percentile :  100,074,427  33,642,128
         99.9% percentile :  105,336,398 275,772,650
       Less than 2MiB seek:       77.09%      99.04%

    The long-tail part of the result looks worse with the patch, but
    the change helps majority of the access. 99.04% of the accesses
    need less than 2MiB of seeking, compared to 77.09% with the current
    packing heuristics.

 * Index pack.

   $ git index-pack -v .git/objects/pack/pack*.pack

                                  v1.7.6  with patch
   Total number of access :    2,791,228   2,788,802
          0.0% percentile :            9           9
         10.0% percentile :          140          89
         20.0% percentile :          233         167
         30.0% percentile :          322         235
         40.0% percentile :          464         310
         50.0% percentile :          862         423
         60.0% percentile :        2,566         686
         70.0% percentile :       25,827       1,498
         80.0% percentile :    1,317,862       4,971
         90.0% percentile :   11,926,385     119,398
         95.0% percentile :   41,304,149     952,519
         99.0% percentile :  227,613,070   6,709,650
         99.5% percentile :  321,265,121  11,734,871
         99.9% percentile :  382,919,785  33,155,191
       Less than 2MiB seek:       81.73%      96.92%

   As the index-pack command already walks objects in the delta chain
   order, writing the blobs out in the delta chain order seems to
   drastically improve the locality of access.

Note that a half-a-gigabyte packfile comfortably fits in the buffer cache,
and you would unlikely to see much performance difference on a modern and
reasonably beefy machine with enough memory and local disks. Benchmarking
with cold cache (or over NFS) would be interesting.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
2011-07-08 10:03:24 -07:00
..
add.c Merge branch 'ci/commit--interactive-atomic' 2011-05-16 16:47:10 -07:00
annotate.c
apply.c Merge branch 'jc/maint-add-p-overlapping-hunks' into maint 2011-05-16 16:36:46 -07:00
archive.c
bisect--helper.c
blame.c blame: add --line-porcelain output format 2011-05-09 15:27:50 -07:00
branch.c Merge branch 'maint-1.7.5' into maint 2011-06-29 16:41:55 -07:00
bundle.c
cat-file.c plug a few coverity-spotted leaks 2011-06-20 14:27:36 -07:00
check-attr.c
check-ref-format.c
checkout-index.c
checkout.c Merge branch 'jc/maint-1.7.3-checkout-describe' 2011-06-29 17:03:12 -07:00
clean.c
clone.c Merge branch 'ab/i18n-fixup' into maint 2011-05-31 12:00:27 -07:00
commit-tree.c
commit.c Merge branch 'bc/maint-status-z-to-use-porcelain' 2011-06-06 11:40:08 -07:00
config.c config: Give error message when not changing a multivar 2011-05-17 21:01:29 -07:00
count-objects.c
describe.c Merge branch 'lt/default-abbrev' into maint 2011-04-03 12:32:51 -07:00
diff-files.c
diff-index.c
diff-tree.c diffcore-rename: fall back to -C when -C -C busts the rename limit 2011-03-22 14:29:07 -07:00
diff.c plug a few coverity-spotted leaks 2011-06-20 14:27:36 -07:00
fast-export.c
fetch-pack.c Merge branch 'jk/haves-from-alternate-odb' 2011-05-29 23:51:22 -07:00
fetch.c fetch: do not leak a refspec 2011-06-08 17:21:08 -07:00
fmt-merge-msg.c Fix sparse warnings 2011-03-22 10:16:54 -07:00
for-each-ref.c
fsck.c Remove unused variables 2011-03-22 11:43:27 -07:00
gc.c builtin/gc.c: add missing newline in message 2011-06-19 14:46:39 -07:00
grep.c grep: add --heading 2011-06-05 18:15:27 -07:00
hash-object.c index_fd(): turn write_object and format_check arguments into one flag 2011-05-09 11:58:19 -07:00
help.c
index-pack.c sparse: Fix an "symbol 'cmd_index_pack' not declared" warning 2011-04-11 10:35:25 -07:00
init-db.c Merge branch 'ab/i18n-fixup' into maint 2011-05-31 12:00:27 -07:00
log.c Merge branch 'jk/format-patch-am' 2011-05-31 12:19:11 -07:00
ls-files.c pathspec: rename per-item field has_wildcard to use_wildcard 2011-04-05 09:30:36 -07:00
ls-remote.c ls-remote: the --exit-code option reports "no matching refs" 2011-05-18 14:37:46 -07:00
ls-tree.c pathspec: rename per-item field has_wildcard to use_wildcard 2011-04-05 09:30:36 -07:00
mailinfo.c mailinfo: always clean up rfc822 header folding 2011-05-26 14:13:38 -07:00
mailsplit.c
merge-base.c Documentation: update to git-merge-base --octopus 2011-04-15 10:13:52 -07:00
merge-file.c
merge-index.c Fix sparse warnings 2011-03-22 10:16:54 -07:00
merge-ours.c
merge-recursive.c Fix sparse warnings 2011-03-22 10:16:54 -07:00
merge-tree.c sparse: Fix an "symbol 'merge_file' not decared" warning 2011-04-11 10:35:25 -07:00
merge.c Merge branch 'jk/format-patch-am' 2011-05-31 12:19:11 -07:00
mktag.c read_sha1_file(): get rid of read_sha1_file_repl() madness 2011-05-15 15:23:33 -07:00
mktree.c
mv.c
name-rev.c
notes.c notes remove: --stdin reads from the standard input 2011-05-19 10:54:16 -07:00
pack-objects.c pack-objects: optimize "recency order" 2011-07-08 10:03:24 -07:00
pack-redundant.c Fix sparse warnings 2011-03-22 10:16:54 -07:00
pack-refs.c Fix sparse warnings 2011-03-22 10:16:54 -07:00
patch-id.c Fix sparse warnings 2011-03-22 10:16:54 -07:00
prune-packed.c
prune.c
push.c Merge branch 'ab/i18n-st' 2011-04-01 17:55:55 -07:00
read-tree.c Teach read-tree the -n|--dry-run option 2011-05-25 15:04:25 -07:00
receive-pack.c receive-pack: eliminate duplicate .have refs 2011-05-19 20:02:31 -07:00
reflog.c
remote-ext.c Remove unused variables 2011-03-22 11:43:27 -07:00
remote-fd.c Fix sparse warnings 2011-03-22 10:16:54 -07:00
remote.c Merge branch 'jk/maint-remote-mirror-safer' 2011-05-31 12:08:52 -07:00
replace.c
rerere.c rerere: libify rerere_clear() and rerere_gc() 2011-05-08 12:55:34 -07:00
reset.c Merge branch 'ab/i18n-st' 2011-04-01 17:55:55 -07:00
rev-list.c Merge branch 'jk/format-patch-am' 2011-05-31 12:19:11 -07:00
rev-parse.c show: --ignore-missing 2011-05-19 10:55:54 -07:00
revert.c revert: allow reverting a root commit 2011-05-16 13:01:45 -07:00
rm.c
send-pack.c Merge branch 'jk/git-connection-deadlock-fix' into maint 2011-05-26 09:33:25 -07:00
shortlog.c Merge branch 'jk/format-patch-am' 2011-05-31 12:19:11 -07:00
show-branch.c Merge branch 'jk/format-patch-am' 2011-05-31 12:19:11 -07:00
show-ref.c
stripspace.c
symbolic-ref.c
tag.c tag: disallow '-' as tag name 2011-05-10 08:45:37 -07:00
tar-tree.c
unpack-file.c Fix sparse warnings 2011-03-22 10:16:54 -07:00
unpack-objects.c
update-index.c plug a few coverity-spotted leaks 2011-06-20 14:27:36 -07:00
update-ref.c
update-server-info.c
upload-archive.c
var.c Fix sparse warnings 2011-03-22 10:16:54 -07:00
verify-pack.c
verify-tag.c
write-tree.c