rev-list: add --disk-usage option for calculating disk usage
It can sometimes be useful to see which refs are contributing to the
overall repository size (e.g., does some branch have a bunch of objects
not found elsewhere in history, which indicates that deleting it would
shrink the size of a clone).
You can find that out by generating a list of objects, getting their
sizes from cat-file, and then summing them, like:
git rev-list --objects --no-object-names main..branch
git cat-file --batch-check='%(objectsize:disk)' |
perl -lne '$total += $_; END { print $total }'
Though note that the caveats from git-cat-file(1) apply here. We "blame"
base objects more than their deltas, even though the relationship could
easily be flipped. Still, it can be a useful rough measure.
But one problem is that it's slow to run. Teaching rev-list to sum up
the sizes can be much faster for two reasons:
1. It skips all of the piping of object names and sizes.
2. If bitmaps are in use, for objects that are in the
bitmapped packfile we can skip the oid_object_info()
lookup entirely, and just ask the revindex for the
on-disk size.
This patch implements a --disk-usage option which produces the same
answer in a fraction of the time. Here are some timings using a clone of
torvalds/linux:
[rev-list piped to cat-file, no bitmaps]
$ time git rev-list --objects --no-object-names --all |
git cat-file --buffer --batch-check='%(objectsize:disk)' |
perl -lne '$total += $_; END { print $total }'
1459938510
real 0m29.635s
user 0m38.003s
sys 0m1.093s
[internal, no bitmaps]
$ time git rev-list --disk-usage --objects --all
1459938510
real 0m31.262s
user 0m30.885s
sys 0m0.376s
Even though the wall-clock time is slightly worse due to parallelism,
notice the CPU savings between the two. We saved 21% of the CPU just by
avoiding the pipes.
But the real win is with bitmaps. If we use them without the new option:
[rev-list piped to cat-file, bitmaps]
$ time git rev-list --objects --no-object-names --all --use-bitmap-index |
git cat-file --batch-check='%(objectsize:disk)' |
perl -lne '$total += $_; END { print $total }'
1459938510
real 0m6.244s
user 0m8.452s
sys 0m0.311s
then we're faster to generate the list of objects, but we still spend a
lot of time piping and looking things up. But if we do both together:
[internal, bitmaps]
$ time git rev-list --disk-usage --objects --all --use-bitmap-index
1459938510
real 0m0.219s
user 0m0.169s
sys 0m0.049s
then we get the same answer much faster.
For "--all", that answer will correspond closely to "du objects/pack",
of course. But we're actually checking reachability here, so we're still
fast when we ask for more interesting things:
$ time git rev-list --disk-usage --use-bitmap-index v5.0..v5.10
374798628
real 0m0.429s
user 0m0.356s
sys 0m0.072s
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-09 11:53:50 +01:00
|
|
|
#!/bin/sh
|
|
|
|
|
|
|
|
test_description='basic tests of rev-list --disk-usage'
|
|
|
|
. ./test-lib.sh
|
|
|
|
|
|
|
|
# we want a mix of reachable and unreachable, as well as
|
|
|
|
# objects in the bitmapped pack and some outside of it
|
|
|
|
test_expect_success 'set up repository' '
|
|
|
|
test_commit --no-tag one &&
|
|
|
|
test_commit --no-tag two &&
|
|
|
|
git repack -adb &&
|
|
|
|
git reset --hard HEAD^ &&
|
|
|
|
test_commit --no-tag three &&
|
|
|
|
test_commit --no-tag four &&
|
|
|
|
git reset --hard HEAD^
|
|
|
|
'
|
|
|
|
|
|
|
|
# We don't want to hardcode sizes, because they depend on the exact details of
|
|
|
|
# packing, zlib, etc. We'll assume that the regular rev-list and cat-file
|
|
|
|
# machinery works and compare the --disk-usage output to that.
|
|
|
|
disk_usage_slow () {
|
|
|
|
git rev-list --no-object-names "$@" |
|
|
|
|
git cat-file --batch-check="%(objectsize:disk)" |
|
|
|
|
perl -lne '$total += $_; END { print $total}'
|
|
|
|
}
|
|
|
|
|
|
|
|
# check behavior with given rev-list options; note that
|
|
|
|
# whitespace is not preserved in args
|
|
|
|
check_du () {
|
|
|
|
args=$*
|
|
|
|
|
|
|
|
test_expect_success "generate expected size ($args)" "
|
|
|
|
disk_usage_slow $args >expect
|
|
|
|
"
|
|
|
|
|
|
|
|
test_expect_success "rev-list --disk-usage without bitmaps ($args)" "
|
|
|
|
git rev-list --disk-usage $args >actual &&
|
|
|
|
test_cmp expect actual
|
|
|
|
"
|
|
|
|
|
|
|
|
test_expect_success "rev-list --disk-usage with bitmaps ($args)" "
|
|
|
|
git rev-list --disk-usage --use-bitmap-index $args >actual &&
|
|
|
|
test_cmp expect actual
|
|
|
|
"
|
|
|
|
}
|
|
|
|
|
|
|
|
check_du HEAD
|
|
|
|
check_du --objects HEAD
|
|
|
|
check_du --objects HEAD^..HEAD
|
|
|
|
|
2022-08-11 06:47:54 +02:00
|
|
|
# As mentioned above, don't use hardcode sizes as actual size, but use the
|
|
|
|
# output from git cat-file.
|
|
|
|
test_expect_success 'rev-list --disk-usage=human' '
|
|
|
|
git rev-list --objects HEAD --disk-usage=human >actual &&
|
|
|
|
disk_usage_slow --objects HEAD >actual_size &&
|
|
|
|
grep "$(cat actual_size) bytes" actual
|
|
|
|
'
|
|
|
|
|
|
|
|
test_expect_success 'rev-list --disk-usage=human with bitmaps' '
|
|
|
|
git rev-list --objects HEAD --use-bitmap-index --disk-usage=human >actual &&
|
|
|
|
disk_usage_slow --objects HEAD >actual_size &&
|
|
|
|
grep "$(cat actual_size) bytes" actual
|
|
|
|
'
|
|
|
|
|
|
|
|
test_expect_success 'rev-list use --disk-usage unproperly' '
|
|
|
|
test_must_fail git rev-list --objects HEAD --disk-usage=typo 2>err &&
|
|
|
|
cat >expect <<-\EOF &&
|
|
|
|
fatal: invalid value for '\''--disk-usage=<format>'\'': '\''typo'\'', the only allowed format is '\''human'\''
|
|
|
|
EOF
|
|
|
|
test_cmp err expect
|
|
|
|
'
|
|
|
|
|
rev-list: add --disk-usage option for calculating disk usage
It can sometimes be useful to see which refs are contributing to the
overall repository size (e.g., does some branch have a bunch of objects
not found elsewhere in history, which indicates that deleting it would
shrink the size of a clone).
You can find that out by generating a list of objects, getting their
sizes from cat-file, and then summing them, like:
git rev-list --objects --no-object-names main..branch
git cat-file --batch-check='%(objectsize:disk)' |
perl -lne '$total += $_; END { print $total }'
Though note that the caveats from git-cat-file(1) apply here. We "blame"
base objects more than their deltas, even though the relationship could
easily be flipped. Still, it can be a useful rough measure.
But one problem is that it's slow to run. Teaching rev-list to sum up
the sizes can be much faster for two reasons:
1. It skips all of the piping of object names and sizes.
2. If bitmaps are in use, for objects that are in the
bitmapped packfile we can skip the oid_object_info()
lookup entirely, and just ask the revindex for the
on-disk size.
This patch implements a --disk-usage option which produces the same
answer in a fraction of the time. Here are some timings using a clone of
torvalds/linux:
[rev-list piped to cat-file, no bitmaps]
$ time git rev-list --objects --no-object-names --all |
git cat-file --buffer --batch-check='%(objectsize:disk)' |
perl -lne '$total += $_; END { print $total }'
1459938510
real 0m29.635s
user 0m38.003s
sys 0m1.093s
[internal, no bitmaps]
$ time git rev-list --disk-usage --objects --all
1459938510
real 0m31.262s
user 0m30.885s
sys 0m0.376s
Even though the wall-clock time is slightly worse due to parallelism,
notice the CPU savings between the two. We saved 21% of the CPU just by
avoiding the pipes.
But the real win is with bitmaps. If we use them without the new option:
[rev-list piped to cat-file, bitmaps]
$ time git rev-list --objects --no-object-names --all --use-bitmap-index |
git cat-file --batch-check='%(objectsize:disk)' |
perl -lne '$total += $_; END { print $total }'
1459938510
real 0m6.244s
user 0m8.452s
sys 0m0.311s
then we're faster to generate the list of objects, but we still spend a
lot of time piping and looking things up. But if we do both together:
[internal, bitmaps]
$ time git rev-list --disk-usage --objects --all --use-bitmap-index
1459938510
real 0m0.219s
user 0m0.169s
sys 0m0.049s
then we get the same answer much faster.
For "--all", that answer will correspond closely to "du objects/pack",
of course. But we're actually checking reachability here, so we're still
fast when we ask for more interesting things:
$ time git rev-list --disk-usage --use-bitmap-index v5.0..v5.10
374798628
real 0m0.429s
user 0m0.356s
sys 0m0.072s
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-09 11:53:50 +01:00
|
|
|
test_done
|