builtin/repack.c: add '--geometric' option
Often it is useful to both:
- have relatively few packfiles in a repository, and
- avoid having so few packfiles in a repository that we repack its
entire contents regularly
This patch implements a '--geometric=<n>' option in 'git repack'. This
allows the caller to specify that they would like each pack to be at
least a factor times as large as the previous largest pack (by object
count).
Concretely, say that a repository has 'n' packfiles, labeled P1, P2,
..., up to Pn. Each packfile has an object count equal to 'objects(Pn)'.
With a geometric factor of 'r', it should be that:
objects(Pi) > r*objects(P(i-1))
for all i in [1, n], where the packs are sorted by
objects(P1) <= objects(P2) <= ... <= objects(Pn).
Since finding a true optimal repacking is NP-hard, we approximate it
along two directions:
1. We assume that there is a cutoff of packs _before starting the
repack_ where everything to the right of that cut-off already forms
a geometric progression (or no cutoff exists and everything must be
repacked).
2. We assume that everything smaller than the cutoff count must be
repacked. This forms our base assumption, but it can also cause
even the "heavy" packs to get repacked, for e.g., if we have 6
packs containing the following number of objects:
1, 1, 1, 2, 4, 32
then we would place the cutoff between '1, 1' and '1, 2, 4, 32',
rolling up the first two packs into a pack with 2 objects. That
breaks our progression and leaves us:
2, 1, 2, 4, 32
^
(where the '^' indicates the position of our split). To restore a
progression, we move the split forward (towards larger packs)
joining each pack into our new pack until a geometric progression
is restored. Here, that looks like:
2, 1, 2, 4, 32 ~> 3, 2, 4, 32 ~> 5, 4, 32 ~> ... ~> 9, 32
^ ^ ^ ^
This has the advantage of not repacking the heavy-side of packs too
often while also only creating one new pack at a time. Another wrinkle
is that we assume that loose, indexed, and reflog'd objects are
insignificant, and lump them into any new pack that we create. This can
lead to non-idempotent results.
Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-23 03:25:27 +01:00
|
|
|
#!/bin/sh
|
|
|
|
|
|
|
|
test_description='git repack --geometric works correctly'
|
|
|
|
|
|
|
|
. ./test-lib.sh
|
|
|
|
|
|
|
|
GIT_TEST_MULTI_PACK_INDEX=0
|
|
|
|
|
|
|
|
objdir=.git/objects
|
t7703: demonstrate object corruption with pack.packSizeLimit
When doing a `--geometric=<d>` repack, `git repack` determines a
splitting point among packs ordered by their object count such that:
- each pack above the split has at least `<d>` times as many objects
as the next-largest pack by object count, and
- the first pack above the split has at least `<d>` times as many
object as the sum of all packs below the split line combined
`git repack` then creates a pack containing all of the objects contained
in packs below the split line by running `git pack-objects
--stdin-packs` underneath. Once packs are moved into place, then any
packs below the split line are removed, since their objects were just
combined into a new pack.
But `git repack` tries to be careful to avoid removing a pack that it
just wrote, by checking:
struct packed_git *p = geometry->pack[i];
if (string_list_has_string(&names, hash_to_hex(p->hash)))
continue;
in the `delete_redundant` and `geometric` conditional towards the end of
`cmd_repack`.
But it's possible to trick `git repack` into not recognizing a pack that
it just wrote when `names` is out-of-order (which violates
`string_list_has_string()`'s assumption that the list is sorted and thus
binary search-able).
When this happens in just the right circumstances, it is possible to
remove a pack that we just wrote, leading to object corruption.
Luckily, this is quite difficult to provoke in practice (for a couple of
reasons):
- we ordinarily write just one pack, so `names` usually contains just
one entry, and is thus sorted
- when we do write more than one pack (e.g., due to `--max-pack-size`)
we have to: (a) write a pack identical to one that already
exists, (b) have that pack be below the split line, and (c) have
the set of packs written by `pack-objects` occur in an order which
tricks `string_list_has_string()`.
Demonstrate the above scenario in a failing test, which causes `git
repack --geometric` to write a pack which occurs below the split line,
_and_ fail to recognize that it wrote that pack.
The following patch will fix this bug.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-05-20 21:01:48 +02:00
|
|
|
packdir=$objdir/pack
|
builtin/repack.c: add '--geometric' option
Often it is useful to both:
- have relatively few packfiles in a repository, and
- avoid having so few packfiles in a repository that we repack its
entire contents regularly
This patch implements a '--geometric=<n>' option in 'git repack'. This
allows the caller to specify that they would like each pack to be at
least a factor times as large as the previous largest pack (by object
count).
Concretely, say that a repository has 'n' packfiles, labeled P1, P2,
..., up to Pn. Each packfile has an object count equal to 'objects(Pn)'.
With a geometric factor of 'r', it should be that:
objects(Pi) > r*objects(P(i-1))
for all i in [1, n], where the packs are sorted by
objects(P1) <= objects(P2) <= ... <= objects(Pn).
Since finding a true optimal repacking is NP-hard, we approximate it
along two directions:
1. We assume that there is a cutoff of packs _before starting the
repack_ where everything to the right of that cut-off already forms
a geometric progression (or no cutoff exists and everything must be
repacked).
2. We assume that everything smaller than the cutoff count must be
repacked. This forms our base assumption, but it can also cause
even the "heavy" packs to get repacked, for e.g., if we have 6
packs containing the following number of objects:
1, 1, 1, 2, 4, 32
then we would place the cutoff between '1, 1' and '1, 2, 4, 32',
rolling up the first two packs into a pack with 2 objects. That
breaks our progression and leaves us:
2, 1, 2, 4, 32
^
(where the '^' indicates the position of our split). To restore a
progression, we move the split forward (towards larger packs)
joining each pack into our new pack until a geometric progression
is restored. Here, that looks like:
2, 1, 2, 4, 32 ~> 3, 2, 4, 32 ~> 5, 4, 32 ~> ... ~> 9, 32
^ ^ ^ ^
This has the advantage of not repacking the heavy-side of packs too
often while also only creating one new pack at a time. Another wrinkle
is that we assume that loose, indexed, and reflog'd objects are
insignificant, and lump them into any new pack that we create. This can
lead to non-idempotent results.
Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-23 03:25:27 +01:00
|
|
|
midx=$objdir/pack/multi-pack-index
|
|
|
|
|
|
|
|
test_expect_success '--geometric with no packs' '
|
|
|
|
git init geometric &&
|
|
|
|
test_when_finished "rm -fr geometric" &&
|
|
|
|
(
|
|
|
|
cd geometric &&
|
|
|
|
|
builtin/repack.c: support writing a MIDX while repacking
Teach `git repack` a new `--write-midx` option for callers that wish to
persist a multi-pack index in their repository while repacking.
There are two existing alternatives to this new flag, but they don't
cover our particular use-case. These alternatives are:
- Call 'git multi-pack-index write' after running 'git repack', or
- Set 'GIT_TEST_MULTI_PACK_INDEX=1' in your environment when running
'git repack'.
The former works, but introduces a gap in bitmap coverage between
repacking and writing a new MIDX (since the repack may have deleted a
pack included in the existing MIDX, invalidating it altogether).
Setting the 'GIT_TEST_' environment variable is obviously unsupported.
In fact, even if it were supported officially, it still wouldn't work,
because it generates the MIDX *after* redundant packs have been dropped,
leading to the same issue as above.
Introduce a new option which eliminates this race by teaching `git
repack` to generate the MIDX at the critical point: after the new packs
have been written and moved into place, but before the redundant packs
have been removed.
This option is compatible with `git repack`'s '--bitmap' option (it
changes the interpretation to be: "write a bitmap corresponding to the
MIDX after one has been generated").
There is a little bit of additional noise in the patch below to avoid
repeating ourselves when selecting which packs to delete. Instead of a
single loop as before (where we iterate over 'existing_packs', decide if
a pack is worth deleting, and if so, delete it), we have two loops (the
first where we decide which ones are worth deleting, and the second
where we actually do the deleting). This makes it so we have a single
check we can make consistently when (1) telling the MIDX which packs we
want to exclude, and (2) actually unlinking the redundant packs.
There is also a tiny change to short-circuit the body of
write_midx_included_packs() when no packs remain in the case of an empty
repository. The MIDX code does not handle this, so avoid trying to
generate a MIDX covering zero packs in the first place.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-29 03:55:18 +02:00
|
|
|
git repack --write-midx --geometric 2 >out &&
|
builtin/repack.c: add '--geometric' option
Often it is useful to both:
- have relatively few packfiles in a repository, and
- avoid having so few packfiles in a repository that we repack its
entire contents regularly
This patch implements a '--geometric=<n>' option in 'git repack'. This
allows the caller to specify that they would like each pack to be at
least a factor times as large as the previous largest pack (by object
count).
Concretely, say that a repository has 'n' packfiles, labeled P1, P2,
..., up to Pn. Each packfile has an object count equal to 'objects(Pn)'.
With a geometric factor of 'r', it should be that:
objects(Pi) > r*objects(P(i-1))
for all i in [1, n], where the packs are sorted by
objects(P1) <= objects(P2) <= ... <= objects(Pn).
Since finding a true optimal repacking is NP-hard, we approximate it
along two directions:
1. We assume that there is a cutoff of packs _before starting the
repack_ where everything to the right of that cut-off already forms
a geometric progression (or no cutoff exists and everything must be
repacked).
2. We assume that everything smaller than the cutoff count must be
repacked. This forms our base assumption, but it can also cause
even the "heavy" packs to get repacked, for e.g., if we have 6
packs containing the following number of objects:
1, 1, 1, 2, 4, 32
then we would place the cutoff between '1, 1' and '1, 2, 4, 32',
rolling up the first two packs into a pack with 2 objects. That
breaks our progression and leaves us:
2, 1, 2, 4, 32
^
(where the '^' indicates the position of our split). To restore a
progression, we move the split forward (towards larger packs)
joining each pack into our new pack until a geometric progression
is restored. Here, that looks like:
2, 1, 2, 4, 32 ~> 3, 2, 4, 32 ~> 5, 4, 32 ~> ... ~> 9, 32
^ ^ ^ ^
This has the advantage of not repacking the heavy-side of packs too
often while also only creating one new pack at a time. Another wrinkle
is that we assume that loose, indexed, and reflog'd objects are
insignificant, and lump them into any new pack that we create. This can
lead to non-idempotent results.
Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-23 03:25:27 +01:00
|
|
|
test_i18ngrep "Nothing new to pack" out
|
|
|
|
)
|
|
|
|
'
|
|
|
|
|
2021-03-05 16:21:37 +01:00
|
|
|
test_expect_success '--geometric with one pack' '
|
|
|
|
git init geometric &&
|
|
|
|
test_when_finished "rm -fr geometric" &&
|
|
|
|
(
|
|
|
|
cd geometric &&
|
|
|
|
|
|
|
|
test_commit "base" &&
|
|
|
|
git repack -d &&
|
|
|
|
|
|
|
|
git repack --geometric 2 >out &&
|
|
|
|
|
|
|
|
test_i18ngrep "Nothing new to pack" out
|
|
|
|
)
|
|
|
|
'
|
|
|
|
|
builtin/repack.c: add '--geometric' option
Often it is useful to both:
- have relatively few packfiles in a repository, and
- avoid having so few packfiles in a repository that we repack its
entire contents regularly
This patch implements a '--geometric=<n>' option in 'git repack'. This
allows the caller to specify that they would like each pack to be at
least a factor times as large as the previous largest pack (by object
count).
Concretely, say that a repository has 'n' packfiles, labeled P1, P2,
..., up to Pn. Each packfile has an object count equal to 'objects(Pn)'.
With a geometric factor of 'r', it should be that:
objects(Pi) > r*objects(P(i-1))
for all i in [1, n], where the packs are sorted by
objects(P1) <= objects(P2) <= ... <= objects(Pn).
Since finding a true optimal repacking is NP-hard, we approximate it
along two directions:
1. We assume that there is a cutoff of packs _before starting the
repack_ where everything to the right of that cut-off already forms
a geometric progression (or no cutoff exists and everything must be
repacked).
2. We assume that everything smaller than the cutoff count must be
repacked. This forms our base assumption, but it can also cause
even the "heavy" packs to get repacked, for e.g., if we have 6
packs containing the following number of objects:
1, 1, 1, 2, 4, 32
then we would place the cutoff between '1, 1' and '1, 2, 4, 32',
rolling up the first two packs into a pack with 2 objects. That
breaks our progression and leaves us:
2, 1, 2, 4, 32
^
(where the '^' indicates the position of our split). To restore a
progression, we move the split forward (towards larger packs)
joining each pack into our new pack until a geometric progression
is restored. Here, that looks like:
2, 1, 2, 4, 32 ~> 3, 2, 4, 32 ~> 5, 4, 32 ~> ... ~> 9, 32
^ ^ ^ ^
This has the advantage of not repacking the heavy-side of packs too
often while also only creating one new pack at a time. Another wrinkle
is that we assume that loose, indexed, and reflog'd objects are
insignificant, and lump them into any new pack that we create. This can
lead to non-idempotent results.
Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-23 03:25:27 +01:00
|
|
|
test_expect_success '--geometric with an intact progression' '
|
|
|
|
git init geometric &&
|
|
|
|
test_when_finished "rm -fr geometric" &&
|
|
|
|
(
|
|
|
|
cd geometric &&
|
|
|
|
|
|
|
|
# These packs already form a geometric progression.
|
|
|
|
test_commit_bulk --start=1 1 && # 3 objects
|
|
|
|
test_commit_bulk --start=2 2 && # 6 objects
|
|
|
|
test_commit_bulk --start=4 4 && # 12 objects
|
|
|
|
|
|
|
|
find $objdir/pack -name "*.pack" | sort >expect &&
|
|
|
|
git repack --geometric 2 -d &&
|
|
|
|
find $objdir/pack -name "*.pack" | sort >actual &&
|
|
|
|
|
|
|
|
test_cmp expect actual
|
|
|
|
)
|
|
|
|
'
|
|
|
|
|
2021-03-05 16:21:43 +01:00
|
|
|
test_expect_success '--geometric with loose objects' '
|
|
|
|
git init geometric &&
|
|
|
|
test_when_finished "rm -fr geometric" &&
|
|
|
|
(
|
|
|
|
cd geometric &&
|
|
|
|
|
|
|
|
# These packs already form a geometric progression.
|
|
|
|
test_commit_bulk --start=1 1 && # 3 objects
|
|
|
|
test_commit_bulk --start=2 2 && # 6 objects
|
|
|
|
# The loose objects are packed together, breaking the
|
|
|
|
# progression.
|
|
|
|
test_commit loose && # 3 objects
|
|
|
|
|
|
|
|
find $objdir/pack -name "*.pack" | sort >before &&
|
|
|
|
git repack --geometric 2 -d &&
|
|
|
|
find $objdir/pack -name "*.pack" | sort >after &&
|
|
|
|
|
|
|
|
comm -13 before after >new &&
|
|
|
|
comm -23 before after >removed &&
|
|
|
|
|
|
|
|
test_line_count = 1 new &&
|
|
|
|
test_must_be_empty removed &&
|
|
|
|
|
|
|
|
git repack --geometric 2 -d &&
|
|
|
|
find $objdir/pack -name "*.pack" | sort >after &&
|
|
|
|
|
|
|
|
# The progression (3, 3, 6) is combined into one new pack.
|
|
|
|
test_line_count = 1 after
|
|
|
|
)
|
|
|
|
'
|
|
|
|
|
builtin/repack.c: add '--geometric' option
Often it is useful to both:
- have relatively few packfiles in a repository, and
- avoid having so few packfiles in a repository that we repack its
entire contents regularly
This patch implements a '--geometric=<n>' option in 'git repack'. This
allows the caller to specify that they would like each pack to be at
least a factor times as large as the previous largest pack (by object
count).
Concretely, say that a repository has 'n' packfiles, labeled P1, P2,
..., up to Pn. Each packfile has an object count equal to 'objects(Pn)'.
With a geometric factor of 'r', it should be that:
objects(Pi) > r*objects(P(i-1))
for all i in [1, n], where the packs are sorted by
objects(P1) <= objects(P2) <= ... <= objects(Pn).
Since finding a true optimal repacking is NP-hard, we approximate it
along two directions:
1. We assume that there is a cutoff of packs _before starting the
repack_ where everything to the right of that cut-off already forms
a geometric progression (or no cutoff exists and everything must be
repacked).
2. We assume that everything smaller than the cutoff count must be
repacked. This forms our base assumption, but it can also cause
even the "heavy" packs to get repacked, for e.g., if we have 6
packs containing the following number of objects:
1, 1, 1, 2, 4, 32
then we would place the cutoff between '1, 1' and '1, 2, 4, 32',
rolling up the first two packs into a pack with 2 objects. That
breaks our progression and leaves us:
2, 1, 2, 4, 32
^
(where the '^' indicates the position of our split). To restore a
progression, we move the split forward (towards larger packs)
joining each pack into our new pack until a geometric progression
is restored. Here, that looks like:
2, 1, 2, 4, 32 ~> 3, 2, 4, 32 ~> 5, 4, 32 ~> ... ~> 9, 32
^ ^ ^ ^
This has the advantage of not repacking the heavy-side of packs too
often while also only creating one new pack at a time. Another wrinkle
is that we assume that loose, indexed, and reflog'd objects are
insignificant, and lump them into any new pack that we create. This can
lead to non-idempotent results.
Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-23 03:25:27 +01:00
|
|
|
test_expect_success '--geometric with small-pack rollup' '
|
|
|
|
git init geometric &&
|
|
|
|
test_when_finished "rm -fr geometric" &&
|
|
|
|
(
|
|
|
|
cd geometric &&
|
|
|
|
|
|
|
|
test_commit_bulk --start=1 1 && # 3 objects
|
|
|
|
test_commit_bulk --start=2 1 && # 3 objects
|
|
|
|
find $objdir/pack -name "*.pack" | sort >small &&
|
|
|
|
test_commit_bulk --start=3 4 && # 12 objects
|
|
|
|
test_commit_bulk --start=7 8 && # 24 objects
|
|
|
|
find $objdir/pack -name "*.pack" | sort >before &&
|
|
|
|
|
|
|
|
git repack --geometric 2 -d &&
|
|
|
|
|
|
|
|
# Three packs in total; two of the existing large ones, and one
|
|
|
|
# new one.
|
|
|
|
find $objdir/pack -name "*.pack" | sort >after &&
|
|
|
|
test_line_count = 3 after &&
|
|
|
|
comm -3 small before | tr -d "\t" >large &&
|
|
|
|
grep -qFf large after
|
|
|
|
)
|
|
|
|
'
|
|
|
|
|
|
|
|
test_expect_success '--geometric with small- and large-pack rollup' '
|
|
|
|
git init geometric &&
|
|
|
|
test_when_finished "rm -fr geometric" &&
|
|
|
|
(
|
|
|
|
cd geometric &&
|
|
|
|
|
|
|
|
# size(small1) + size(small2) > size(medium) / 2
|
|
|
|
test_commit_bulk --start=1 1 && # 3 objects
|
|
|
|
test_commit_bulk --start=2 1 && # 3 objects
|
|
|
|
test_commit_bulk --start=2 3 && # 7 objects
|
|
|
|
test_commit_bulk --start=6 9 && # 27 objects &&
|
|
|
|
|
|
|
|
find $objdir/pack -name "*.pack" | sort >before &&
|
|
|
|
|
|
|
|
git repack --geometric 2 -d &&
|
|
|
|
|
|
|
|
find $objdir/pack -name "*.pack" | sort >after &&
|
|
|
|
comm -12 before after >untouched &&
|
|
|
|
|
|
|
|
# Two packs in total; the largest pack from before running "git
|
|
|
|
# repack", and one new one.
|
|
|
|
test_line_count = 1 untouched &&
|
|
|
|
test_line_count = 2 after
|
|
|
|
)
|
|
|
|
'
|
|
|
|
|
|
|
|
test_expect_success '--geometric ignores kept packs' '
|
|
|
|
git init geometric &&
|
|
|
|
test_when_finished "rm -fr geometric" &&
|
|
|
|
(
|
|
|
|
cd geometric &&
|
|
|
|
|
|
|
|
test_commit kept && # 3 objects
|
|
|
|
test_commit pack && # 3 objects
|
|
|
|
|
|
|
|
KEPT=$(git pack-objects --revs $objdir/pack/pack <<-EOF
|
|
|
|
refs/tags/kept
|
|
|
|
EOF
|
|
|
|
) &&
|
|
|
|
PACK=$(git pack-objects --revs $objdir/pack/pack <<-EOF
|
|
|
|
refs/tags/pack
|
|
|
|
^refs/tags/kept
|
|
|
|
EOF
|
|
|
|
) &&
|
|
|
|
|
|
|
|
# neither pack contains more than twice the number of objects in
|
|
|
|
# the other, so they should be combined. but, marking one as
|
|
|
|
# .kept on disk will "freeze" it, so the pack structure should
|
|
|
|
# remain unchanged.
|
|
|
|
touch $objdir/pack/pack-$KEPT.keep &&
|
|
|
|
|
|
|
|
find $objdir/pack -name "*.pack" | sort >before &&
|
|
|
|
git repack --geometric 2 -d &&
|
|
|
|
find $objdir/pack -name "*.pack" | sort >after &&
|
|
|
|
|
|
|
|
# both packs should still exist
|
|
|
|
test_path_is_file $objdir/pack/pack-$KEPT.pack &&
|
|
|
|
test_path_is_file $objdir/pack/pack-$PACK.pack &&
|
|
|
|
|
|
|
|
# and no new packs should be created
|
|
|
|
test_cmp before after &&
|
|
|
|
|
|
|
|
# Passing --pack-kept-objects causes packs with a .keep file to
|
|
|
|
# be repacked, too.
|
|
|
|
git repack --geometric 2 -d --pack-kept-objects &&
|
|
|
|
|
repack: don't remove .keep packs with `--pack-kept-objects`
`git repack` supports a `--pack-kept-objects` flag which more or less
translates to whether or not we pass `--honor-pack-keep` down to `git
pack-objects` when assembling a new pack.
This behavior has existed since ee34a2bead (repack: add
`repack.packKeptObjects` config var, 2014-03-03). In that commit, the
documentation was extended to say:
[...] Note that we still do not delete `.keep` packs after
`pack-objects` finishes.
Unfortunately, this is not the case when `--pack-kept-objects` is
combined with a `--geometric` repack. When doing a geometric repack, we
include `.keep` packs when enumerating available packs only when
`pack_kept_objects` is set.
So this all works fine when `--no-pack-kept-objects` (or similar) is
given. Kept packs are excluded from the geometric roll-up, so when we go
to delete redundant packs (with `-d`), no `.keep` packs appear "below
the split" in our geometric progression.
But when `--pack-kept-objects` is given, things can go awry. Namely,
when a kept pack is included in the list of packs tracked by the
`pack_geometry` struct *and* part of the pack roll-up, we will delete
the `.keep` pack when we shouldn't.
Note that this *doesn't* result in object corruption, since the `.keep`
pack's objects are still present in the new pack. But the `.keep` pack
itself is removed, which violates our promise from back in ee34a2bead.
But there's more. Because `repack` computes the geometric roll-up
independently from selecting which packs belong in a MIDX (with
`--write-midx`), this can lead to odd behavior. Consider when a `.keep`
pack appears below the geometric split (ie., its objects will be part of
the new pack we generate).
We'll write a MIDX containing the new pack along with the existing
`.keep` pack. But because the `.keep` pack appears below the geometric
split line, we'll (incorrectly) try to remove it. While this doesn't
corrupt the repository, it does cause us to remove the MIDX we just
wrote, since removing that pack would invalidate the new MIDX.
Funny enough, this behavior became far less noticeable after e4d0c11c04
(repack: respect kept objects with '--write-midx -b', 2021-12-20), which
made `pack_kept_objects` be enabled by default only when we were writing
a non-MIDX bitmap.
But e4d0c11c04 didn't resolve this bug, it just made it harder to notice
unless callers explicitly passed `--pack-kept-objects`.
The solution is to avoid trying to remove `.keep` packs during
`--geometric` repacks, even when they appear below the geometric split
line, which is the approach this patch implements.
Co-authored-by: Victoria Dye <vdye@github.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-10-18 04:26:06 +02:00
|
|
|
# After repacking, two packs remain: one new one (containing the
|
|
|
|
# objects in both the .keep and non-kept pack), and the .keep
|
|
|
|
# pack (since `--pack-kept-objects -d` does not actually delete
|
|
|
|
# the kept pack).
|
builtin/repack.c: add '--geometric' option
Often it is useful to both:
- have relatively few packfiles in a repository, and
- avoid having so few packfiles in a repository that we repack its
entire contents regularly
This patch implements a '--geometric=<n>' option in 'git repack'. This
allows the caller to specify that they would like each pack to be at
least a factor times as large as the previous largest pack (by object
count).
Concretely, say that a repository has 'n' packfiles, labeled P1, P2,
..., up to Pn. Each packfile has an object count equal to 'objects(Pn)'.
With a geometric factor of 'r', it should be that:
objects(Pi) > r*objects(P(i-1))
for all i in [1, n], where the packs are sorted by
objects(P1) <= objects(P2) <= ... <= objects(Pn).
Since finding a true optimal repacking is NP-hard, we approximate it
along two directions:
1. We assume that there is a cutoff of packs _before starting the
repack_ where everything to the right of that cut-off already forms
a geometric progression (or no cutoff exists and everything must be
repacked).
2. We assume that everything smaller than the cutoff count must be
repacked. This forms our base assumption, but it can also cause
even the "heavy" packs to get repacked, for e.g., if we have 6
packs containing the following number of objects:
1, 1, 1, 2, 4, 32
then we would place the cutoff between '1, 1' and '1, 2, 4, 32',
rolling up the first two packs into a pack with 2 objects. That
breaks our progression and leaves us:
2, 1, 2, 4, 32
^
(where the '^' indicates the position of our split). To restore a
progression, we move the split forward (towards larger packs)
joining each pack into our new pack until a geometric progression
is restored. Here, that looks like:
2, 1, 2, 4, 32 ~> 3, 2, 4, 32 ~> 5, 4, 32 ~> ... ~> 9, 32
^ ^ ^ ^
This has the advantage of not repacking the heavy-side of packs too
often while also only creating one new pack at a time. Another wrinkle
is that we assume that loose, indexed, and reflog'd objects are
insignificant, and lump them into any new pack that we create. This can
lead to non-idempotent results.
Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-23 03:25:27 +01:00
|
|
|
find $objdir/pack -name "*.pack" >after &&
|
repack: don't remove .keep packs with `--pack-kept-objects`
`git repack` supports a `--pack-kept-objects` flag which more or less
translates to whether or not we pass `--honor-pack-keep` down to `git
pack-objects` when assembling a new pack.
This behavior has existed since ee34a2bead (repack: add
`repack.packKeptObjects` config var, 2014-03-03). In that commit, the
documentation was extended to say:
[...] Note that we still do not delete `.keep` packs after
`pack-objects` finishes.
Unfortunately, this is not the case when `--pack-kept-objects` is
combined with a `--geometric` repack. When doing a geometric repack, we
include `.keep` packs when enumerating available packs only when
`pack_kept_objects` is set.
So this all works fine when `--no-pack-kept-objects` (or similar) is
given. Kept packs are excluded from the geometric roll-up, so when we go
to delete redundant packs (with `-d`), no `.keep` packs appear "below
the split" in our geometric progression.
But when `--pack-kept-objects` is given, things can go awry. Namely,
when a kept pack is included in the list of packs tracked by the
`pack_geometry` struct *and* part of the pack roll-up, we will delete
the `.keep` pack when we shouldn't.
Note that this *doesn't* result in object corruption, since the `.keep`
pack's objects are still present in the new pack. But the `.keep` pack
itself is removed, which violates our promise from back in ee34a2bead.
But there's more. Because `repack` computes the geometric roll-up
independently from selecting which packs belong in a MIDX (with
`--write-midx`), this can lead to odd behavior. Consider when a `.keep`
pack appears below the geometric split (ie., its objects will be part of
the new pack we generate).
We'll write a MIDX containing the new pack along with the existing
`.keep` pack. But because the `.keep` pack appears below the geometric
split line, we'll (incorrectly) try to remove it. While this doesn't
corrupt the repository, it does cause us to remove the MIDX we just
wrote, since removing that pack would invalidate the new MIDX.
Funny enough, this behavior became far less noticeable after e4d0c11c04
(repack: respect kept objects with '--write-midx -b', 2021-12-20), which
made `pack_kept_objects` be enabled by default only when we were writing
a non-MIDX bitmap.
But e4d0c11c04 didn't resolve this bug, it just made it harder to notice
unless callers explicitly passed `--pack-kept-objects`.
The solution is to avoid trying to remove `.keep` packs during
`--geometric` repacks, even when they appear below the geometric split
line, which is the approach this patch implements.
Co-authored-by: Victoria Dye <vdye@github.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-10-18 04:26:06 +02:00
|
|
|
test_line_count = 2 after
|
builtin/repack.c: add '--geometric' option
Often it is useful to both:
- have relatively few packfiles in a repository, and
- avoid having so few packfiles in a repository that we repack its
entire contents regularly
This patch implements a '--geometric=<n>' option in 'git repack'. This
allows the caller to specify that they would like each pack to be at
least a factor times as large as the previous largest pack (by object
count).
Concretely, say that a repository has 'n' packfiles, labeled P1, P2,
..., up to Pn. Each packfile has an object count equal to 'objects(Pn)'.
With a geometric factor of 'r', it should be that:
objects(Pi) > r*objects(P(i-1))
for all i in [1, n], where the packs are sorted by
objects(P1) <= objects(P2) <= ... <= objects(Pn).
Since finding a true optimal repacking is NP-hard, we approximate it
along two directions:
1. We assume that there is a cutoff of packs _before starting the
repack_ where everything to the right of that cut-off already forms
a geometric progression (or no cutoff exists and everything must be
repacked).
2. We assume that everything smaller than the cutoff count must be
repacked. This forms our base assumption, but it can also cause
even the "heavy" packs to get repacked, for e.g., if we have 6
packs containing the following number of objects:
1, 1, 1, 2, 4, 32
then we would place the cutoff between '1, 1' and '1, 2, 4, 32',
rolling up the first two packs into a pack with 2 objects. That
breaks our progression and leaves us:
2, 1, 2, 4, 32
^
(where the '^' indicates the position of our split). To restore a
progression, we move the split forward (towards larger packs)
joining each pack into our new pack until a geometric progression
is restored. Here, that looks like:
2, 1, 2, 4, 32 ~> 3, 2, 4, 32 ~> 5, 4, 32 ~> ... ~> 9, 32
^ ^ ^ ^
This has the advantage of not repacking the heavy-side of packs too
often while also only creating one new pack at a time. Another wrinkle
is that we assume that loose, indexed, and reflog'd objects are
insignificant, and lump them into any new pack that we create. This can
lead to non-idempotent results.
Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-23 03:25:27 +01:00
|
|
|
)
|
|
|
|
'
|
|
|
|
|
2022-05-20 21:01:45 +02:00
|
|
|
test_expect_success '--geometric ignores --keep-pack packs' '
|
|
|
|
git init geometric &&
|
|
|
|
test_when_finished "rm -fr geometric" &&
|
|
|
|
(
|
|
|
|
cd geometric &&
|
|
|
|
|
|
|
|
# Create two equal-sized packs
|
|
|
|
test_commit kept && # 3 objects
|
|
|
|
git repack -d &&
|
|
|
|
test_commit pack && # 3 objects
|
|
|
|
git repack -d &&
|
|
|
|
|
|
|
|
find $objdir/pack -type f -name "*.pack" | sort >packs.before &&
|
|
|
|
git repack --geometric 2 -dm \
|
|
|
|
--keep-pack="$(basename "$(head -n 1 packs.before)")" >out &&
|
|
|
|
find $objdir/pack -type f -name "*.pack" | sort >packs.after &&
|
|
|
|
|
|
|
|
# Packs should not have changed (only one non-kept pack, no
|
|
|
|
# loose objects), but $midx should now exist.
|
|
|
|
grep "Nothing new to pack" out &&
|
|
|
|
test_path_is_file $midx &&
|
|
|
|
|
|
|
|
test_cmp packs.before packs.after &&
|
|
|
|
|
|
|
|
git fsck
|
|
|
|
)
|
|
|
|
'
|
|
|
|
|
2021-09-29 03:55:20 +02:00
|
|
|
test_expect_success '--geometric chooses largest MIDX preferred pack' '
|
|
|
|
git init geometric &&
|
|
|
|
test_when_finished "rm -fr geometric" &&
|
|
|
|
(
|
|
|
|
cd geometric &&
|
|
|
|
|
|
|
|
# These packs already form a geometric progression.
|
|
|
|
test_commit_bulk --start=1 1 && # 3 objects
|
|
|
|
test_commit_bulk --start=2 2 && # 6 objects
|
|
|
|
ls $objdir/pack/pack-*.idx >before &&
|
|
|
|
test_commit_bulk --start=4 4 && # 12 objects
|
|
|
|
ls $objdir/pack/pack-*.idx >after &&
|
|
|
|
|
|
|
|
git repack --geometric 2 -dbm &&
|
|
|
|
|
|
|
|
comm -3 before after | xargs -n 1 basename >expect &&
|
|
|
|
test-tool read-midx --preferred-pack $objdir >actual &&
|
|
|
|
|
|
|
|
test_cmp expect actual
|
|
|
|
)
|
|
|
|
'
|
|
|
|
|
2022-05-20 21:01:51 +02:00
|
|
|
test_expect_success '--geometric with pack.packSizeLimit' '
|
t7703: demonstrate object corruption with pack.packSizeLimit
When doing a `--geometric=<d>` repack, `git repack` determines a
splitting point among packs ordered by their object count such that:
- each pack above the split has at least `<d>` times as many objects
as the next-largest pack by object count, and
- the first pack above the split has at least `<d>` times as many
object as the sum of all packs below the split line combined
`git repack` then creates a pack containing all of the objects contained
in packs below the split line by running `git pack-objects
--stdin-packs` underneath. Once packs are moved into place, then any
packs below the split line are removed, since their objects were just
combined into a new pack.
But `git repack` tries to be careful to avoid removing a pack that it
just wrote, by checking:
struct packed_git *p = geometry->pack[i];
if (string_list_has_string(&names, hash_to_hex(p->hash)))
continue;
in the `delete_redundant` and `geometric` conditional towards the end of
`cmd_repack`.
But it's possible to trick `git repack` into not recognizing a pack that
it just wrote when `names` is out-of-order (which violates
`string_list_has_string()`'s assumption that the list is sorted and thus
binary search-able).
When this happens in just the right circumstances, it is possible to
remove a pack that we just wrote, leading to object corruption.
Luckily, this is quite difficult to provoke in practice (for a couple of
reasons):
- we ordinarily write just one pack, so `names` usually contains just
one entry, and is thus sorted
- when we do write more than one pack (e.g., due to `--max-pack-size`)
we have to: (a) write a pack identical to one that already
exists, (b) have that pack be below the split line, and (c) have
the set of packs written by `pack-objects` occur in an order which
tricks `string_list_has_string()`.
Demonstrate the above scenario in a failing test, which causes `git
repack --geometric` to write a pack which occurs below the split line,
_and_ fail to recognize that it wrote that pack.
The following patch will fix this bug.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-05-20 21:01:48 +02:00
|
|
|
git init pack-rewrite &&
|
|
|
|
test_when_finished "rm -fr pack-rewrite" &&
|
|
|
|
(
|
|
|
|
cd pack-rewrite &&
|
|
|
|
|
|
|
|
test-tool genrandom foo 1048576 >foo &&
|
|
|
|
test-tool genrandom bar 1048576 >bar &&
|
|
|
|
|
|
|
|
git add foo bar &&
|
|
|
|
test_tick &&
|
|
|
|
git commit -m base &&
|
|
|
|
|
|
|
|
git rev-parse HEAD:foo HEAD:bar >p1.objects &&
|
|
|
|
git rev-parse HEAD HEAD^{tree} >p2.objects &&
|
|
|
|
|
|
|
|
# These two packs each contain two objects, so the following
|
|
|
|
# `--geometric` repack will try to combine them.
|
|
|
|
p1="$(git pack-objects $packdir/pack <p1.objects)" &&
|
|
|
|
p2="$(git pack-objects $packdir/pack <p2.objects)" &&
|
|
|
|
|
|
|
|
# Remove any loose objects in packs, since we do not want extra
|
|
|
|
# copies around (which would mask over potential object
|
|
|
|
# corruption issues).
|
|
|
|
git prune-packed &&
|
|
|
|
|
|
|
|
# Both p1 and p2 will be rolled up, but pack-objects will write
|
|
|
|
# three packs:
|
|
|
|
#
|
|
|
|
# - one containing object "foo",
|
|
|
|
# - another containing object "bar",
|
|
|
|
# - a final pack containing the commit and tree objects
|
|
|
|
# (identical to p2 above)
|
|
|
|
git repack --geometric 2 -d --max-pack-size=1048576 &&
|
|
|
|
|
|
|
|
# Ensure `repack` can detect that the third pack it wrote
|
|
|
|
# (containing just the tree and commit objects) was identical to
|
|
|
|
# one that was below the geometric split, so that we can save it
|
|
|
|
# from deletion.
|
|
|
|
#
|
|
|
|
# If `repack` fails to do that, we will incorrectly delete p2,
|
|
|
|
# causing object corruption.
|
|
|
|
git fsck
|
|
|
|
)
|
|
|
|
'
|
|
|
|
|
builtin/repack.c: add '--geometric' option
Often it is useful to both:
- have relatively few packfiles in a repository, and
- avoid having so few packfiles in a repository that we repack its
entire contents regularly
This patch implements a '--geometric=<n>' option in 'git repack'. This
allows the caller to specify that they would like each pack to be at
least a factor times as large as the previous largest pack (by object
count).
Concretely, say that a repository has 'n' packfiles, labeled P1, P2,
..., up to Pn. Each packfile has an object count equal to 'objects(Pn)'.
With a geometric factor of 'r', it should be that:
objects(Pi) > r*objects(P(i-1))
for all i in [1, n], where the packs are sorted by
objects(P1) <= objects(P2) <= ... <= objects(Pn).
Since finding a true optimal repacking is NP-hard, we approximate it
along two directions:
1. We assume that there is a cutoff of packs _before starting the
repack_ where everything to the right of that cut-off already forms
a geometric progression (or no cutoff exists and everything must be
repacked).
2. We assume that everything smaller than the cutoff count must be
repacked. This forms our base assumption, but it can also cause
even the "heavy" packs to get repacked, for e.g., if we have 6
packs containing the following number of objects:
1, 1, 1, 2, 4, 32
then we would place the cutoff between '1, 1' and '1, 2, 4, 32',
rolling up the first two packs into a pack with 2 objects. That
breaks our progression and leaves us:
2, 1, 2, 4, 32
^
(where the '^' indicates the position of our split). To restore a
progression, we move the split forward (towards larger packs)
joining each pack into our new pack until a geometric progression
is restored. Here, that looks like:
2, 1, 2, 4, 32 ~> 3, 2, 4, 32 ~> 5, 4, 32 ~> ... ~> 9, 32
^ ^ ^ ^
This has the advantage of not repacking the heavy-side of packs too
often while also only creating one new pack at a time. Another wrinkle
is that we assume that loose, indexed, and reflog'd objects are
insignificant, and lump them into any new pack that we create. This can
lead to non-idempotent results.
Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-23 03:25:27 +01:00
|
|
|
test_done
|