2013-09-15 17:33:20 +02:00
|
|
|
#include "builtin.h"
|
|
|
|
#include "cache.h"
|
2017-06-14 20:07:36 +02:00
|
|
|
#include "config.h"
|
2013-09-15 17:33:20 +02:00
|
|
|
#include "dir.h"
|
|
|
|
#include "parse-options.h"
|
|
|
|
#include "run-command.h"
|
|
|
|
#include "sigchain.h"
|
|
|
|
#include "strbuf.h"
|
|
|
|
#include "string-list.h"
|
2020-07-28 22:23:39 +02:00
|
|
|
#include "strvec.h"
|
2018-07-12 21:39:40 +02:00
|
|
|
#include "midx.h"
|
2018-08-09 00:34:06 +02:00
|
|
|
#include "packfile.h"
|
2020-03-24 02:07:52 +01:00
|
|
|
#include "prune-packed.h"
|
2018-08-20 20:33:55 +02:00
|
|
|
#include "object-store.h"
|
2019-06-25 15:40:31 +02:00
|
|
|
#include "promisor-remote.h"
|
2020-04-30 21:48:50 +02:00
|
|
|
#include "shallow.h"
|
2021-01-12 09:21:59 +01:00
|
|
|
#include "pack.h"
|
2021-10-02 00:38:10 +02:00
|
|
|
#include "pack-bitmap.h"
|
|
|
|
#include "refs.h"
|
2013-09-15 17:33:20 +02:00
|
|
|
|
2022-05-21 01:18:03 +02:00
|
|
|
#define ALL_INTO_ONE 1
|
|
|
|
#define LOOSEN_UNREACHABLE 2
|
|
|
|
#define PACK_CRUFT 4
|
|
|
|
|
2022-05-21 01:18:08 +02:00
|
|
|
#define DELETE_PACK 1
|
2022-05-21 01:18:11 +02:00
|
|
|
#define CRUFT_PACK 2
|
2022-05-21 01:18:08 +02:00
|
|
|
|
2022-05-21 01:18:03 +02:00
|
|
|
static int pack_everything;
|
2013-09-15 17:33:20 +02:00
|
|
|
static int delta_base_offset = 1;
|
repack: add `repack.packKeptObjects` config var
The git-repack command always passes `--honor-pack-keep`
to pack-objects. This has traditionally been a good thing,
as we do not want to duplicate those objects in a new pack,
and we are not going to delete the old pack.
However, when bitmaps are in use, it is important for a full
repack to include all reachable objects, even if they may be
duplicated in a .keep pack. Otherwise, we cannot generate
the bitmaps, as the on-disk format requires the set of
objects in the pack to be fully closed.
Even if the repository does not generally have .keep files,
a simultaneous push could cause a race condition in which a
.keep file exists at the moment of a repack. The repack may
try to include those objects in one of two situations:
1. The pushed .keep pack contains objects that were
already in the repository (e.g., blobs due to a revert of
an old commit).
2. Receive-pack updates the refs, making the objects
reachable, but before it removes the .keep file, the
repack runs.
In either case, we may prefer to duplicate some objects in
the new, full pack, and let the next repack (after the .keep
file is cleaned up) take care of removing them.
This patch introduces both a command-line and config option
to disable the `--honor-pack-keep` option. By default, it
is triggered when pack.writeBitmaps (or `--write-bitmap-index`
is turned on), but specifying it explicitly can override the
behavior (e.g., in cases where you prefer .keep files to
bitmaps, but only when they are present).
Note that this option just disables the pack-objects
behavior. We still leave packs with a .keep in place, as we
do not necessarily know that we have duplicated all of their
objects.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2014-03-03 21:04:20 +01:00
|
|
|
static int pack_kept_objects = -1;
|
2019-03-14 10:12:54 +01:00
|
|
|
static int write_bitmaps = -1;
|
2018-08-16 08:13:10 +02:00
|
|
|
static int use_delta_islands;
|
2022-03-14 08:42:51 +01:00
|
|
|
static int run_update_server_info = 1;
|
repack: avoid loosening promisor objects in partial clones
When `git repack -A -d` is run in a partial clone, `pack-objects`
is invoked twice: once to repack all promisor objects, and once to
repack all non-promisor objects. The latter `pack-objects` invocation
is with --exclude-promisor-objects and --unpack-unreachable, which
loosens all objects unused during this invocation. Unfortunately,
this includes promisor objects.
Because the -d argument to `git repack` subsequently deletes all loose
objects also in packs, these just-loosened promisor objects will be
immediately deleted. However, this extra disk churn is unnecessary in
the first place. For example, in a newly-cloned partial repo that
filters all blob objects (e.g. `--filter=blob:none`), `repack` ends up
unpacking all trees and commits into the filesystem because every
object, in this particular case, is a promisor object. Depending on
the repo size, this increases the disk usage considerably: In my copy
of the linux.git, the object directory peaked 26GB of more disk usage.
In order to avoid this extra disk churn, pass the names of the promisor
packfiles as --keep-pack arguments to the second invocation of
`pack-objects`. This informs `pack-objects` that the promisor objects
are already in a safe packfile and, therefore, do not need to be
loosened.
For testing, we need to validate whether any object was loosened.
However, the "evidence" (loosened objects) is deleted during the
process which prevents us from inspecting the object directory.
Instead, let's teach `pack-objects` to count loosened objects and
emit via trace2 thus allowing inspecting the debug events after the
process is finished. This new event is used on the added regression
test.
Lastly, add a new perf test to evaluate the performance impact
made by this changes (tested on git.git):
Test HEAD^ HEAD
----------------------------------------------------------
5600.3: gc 134.38(41.93+90.95) 7.80(6.72+1.35) -94.2%
For a bigger repository, such as linux.git, the improvement is
even bigger:
Test HEAD^ HEAD
-------------------------------------------------------------------
5600.3: gc 6833.00(918.07+3162.74) 268.79(227.02+39.18) -96.1%
These improvements are particular big because every object in the
newly-cloned partial repository is a promisor object.
Reported-by: SZEDER Gábor <szeder.dev@gmail.com>
Helped-by: Jeff King <peff@peff.net>
Helped-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Rafael Silva <rafaeloliveira.cs@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-04-21 21:32:12 +02:00
|
|
|
static char *packdir, *packtmp_name, *packtmp;
|
2022-05-21 01:18:03 +02:00
|
|
|
static char *cruft_expiration;
|
2013-09-15 17:33:20 +02:00
|
|
|
|
|
|
|
static const char *const git_repack_usage[] = {
|
2015-01-13 08:44:47 +01:00
|
|
|
N_("git repack [<options>]"),
|
2013-09-15 17:33:20 +02:00
|
|
|
NULL
|
|
|
|
};
|
|
|
|
|
2016-12-28 23:45:42 +01:00
|
|
|
static const char incremental_bitmap_conflict_error[] = N_(
|
|
|
|
"Incremental repacks are incompatible with bitmap indexes. Use\n"
|
2022-06-17 12:03:09 +02:00
|
|
|
"--no-write-bitmap-index or disable the pack.writeBitmaps configuration."
|
2016-12-28 23:45:42 +01:00
|
|
|
);
|
|
|
|
|
2022-05-21 01:18:06 +02:00
|
|
|
struct pack_objects_args {
|
|
|
|
const char *window;
|
|
|
|
const char *window_memory;
|
|
|
|
const char *depth;
|
|
|
|
const char *threads;
|
|
|
|
const char *max_pack_size;
|
|
|
|
int no_reuse_delta;
|
|
|
|
int no_reuse_object;
|
|
|
|
int quiet;
|
|
|
|
int local;
|
|
|
|
};
|
2016-12-28 23:45:42 +01:00
|
|
|
|
2013-09-15 17:33:20 +02:00
|
|
|
static int repack_config(const char *var, const char *value, void *cb)
|
|
|
|
{
|
2022-05-21 01:18:06 +02:00
|
|
|
struct pack_objects_args *cruft_po_args = cb;
|
2013-09-15 17:33:20 +02:00
|
|
|
if (!strcmp(var, "repack.usedeltabaseoffset")) {
|
|
|
|
delta_base_offset = git_config_bool(var, value);
|
|
|
|
return 0;
|
|
|
|
}
|
repack: add `repack.packKeptObjects` config var
The git-repack command always passes `--honor-pack-keep`
to pack-objects. This has traditionally been a good thing,
as we do not want to duplicate those objects in a new pack,
and we are not going to delete the old pack.
However, when bitmaps are in use, it is important for a full
repack to include all reachable objects, even if they may be
duplicated in a .keep pack. Otherwise, we cannot generate
the bitmaps, as the on-disk format requires the set of
objects in the pack to be fully closed.
Even if the repository does not generally have .keep files,
a simultaneous push could cause a race condition in which a
.keep file exists at the moment of a repack. The repack may
try to include those objects in one of two situations:
1. The pushed .keep pack contains objects that were
already in the repository (e.g., blobs due to a revert of
an old commit).
2. Receive-pack updates the refs, making the objects
reachable, but before it removes the .keep file, the
repack runs.
In either case, we may prefer to duplicate some objects in
the new, full pack, and let the next repack (after the .keep
file is cleaned up) take care of removing them.
This patch introduces both a command-line and config option
to disable the `--honor-pack-keep` option. By default, it
is triggered when pack.writeBitmaps (or `--write-bitmap-index`
is turned on), but specifying it explicitly can override the
behavior (e.g., in cases where you prefer .keep files to
bitmaps, but only when they are present).
Note that this option just disables the pack-objects
behavior. We still leave packs with a .keep in place, as we
do not necessarily know that we have duplicated all of their
objects.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2014-03-03 21:04:20 +01:00
|
|
|
if (!strcmp(var, "repack.packkeptobjects")) {
|
|
|
|
pack_kept_objects = git_config_bool(var, value);
|
|
|
|
return 0;
|
|
|
|
}
|
2014-06-10 22:20:30 +02:00
|
|
|
if (!strcmp(var, "repack.writebitmaps") ||
|
|
|
|
!strcmp(var, "pack.writebitmaps")) {
|
2014-06-10 22:10:07 +02:00
|
|
|
write_bitmaps = git_config_bool(var, value);
|
2014-06-10 22:09:23 +02:00
|
|
|
return 0;
|
|
|
|
}
|
2018-08-16 08:13:10 +02:00
|
|
|
if (!strcmp(var, "repack.usedeltaislands")) {
|
|
|
|
use_delta_islands = git_config_bool(var, value);
|
|
|
|
return 0;
|
|
|
|
}
|
2022-03-14 08:42:51 +01:00
|
|
|
if (strcmp(var, "repack.updateserverinfo") == 0) {
|
|
|
|
run_update_server_info = git_config_bool(var, value);
|
|
|
|
return 0;
|
|
|
|
}
|
2022-05-21 01:18:06 +02:00
|
|
|
if (!strcmp(var, "repack.cruftwindow"))
|
|
|
|
return git_config_string(&cruft_po_args->window, var, value);
|
|
|
|
if (!strcmp(var, "repack.cruftwindowmemory"))
|
|
|
|
return git_config_string(&cruft_po_args->window_memory, var, value);
|
|
|
|
if (!strcmp(var, "repack.cruftdepth"))
|
|
|
|
return git_config_string(&cruft_po_args->depth, var, value);
|
|
|
|
if (!strcmp(var, "repack.cruftthreads"))
|
|
|
|
return git_config_string(&cruft_po_args->threads, var, value);
|
2013-09-15 17:33:20 +02:00
|
|
|
return git_default_config(var, value, cb);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2021-09-29 03:55:12 +02:00
|
|
|
* Adds all packs hex strings to either fname_nonkept_list or
|
|
|
|
* fname_kept_list based on whether each pack has a corresponding
|
|
|
|
* .keep file or not. Packs without a .keep file are not to be kept
|
|
|
|
* if we are going to pack everything into one file.
|
2013-09-15 17:33:20 +02:00
|
|
|
*/
|
2021-09-29 03:55:12 +02:00
|
|
|
static void collect_pack_filenames(struct string_list *fname_nonkept_list,
|
builtin/repack.c: keep track of existing packs unconditionally
In order to be able to write a multi-pack index during repacking, `git
repack` must keep track of which packs it wants to write into the MIDX.
This set is the union of existing packs which will not be deleted,
new pack(s) generated as a result of the repack, and .keep packs.
Prior to this patch, `git repack` populated the list of existing packs
only when repacking all-into-one (i.e., with `-A` or `-a`), but we will
soon need to know this list when repacking when writing a MIDX without
a-i-o.
Populate the list of existing packs unconditionally, and guard removing
packs from that list only when repacking a-i-o.
Additionally, keep track of filenames of kept packs separately, since
this, too, will be used in an upcoming patch.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-29 03:55:10 +02:00
|
|
|
struct string_list *fname_kept_list,
|
|
|
|
const struct string_list *extra_keep)
|
2013-09-15 17:33:20 +02:00
|
|
|
{
|
|
|
|
DIR *dir;
|
|
|
|
struct dirent *e;
|
|
|
|
char *fname;
|
|
|
|
|
|
|
|
if (!(dir = opendir(packdir)))
|
|
|
|
return;
|
|
|
|
|
|
|
|
while ((e = readdir(dir)) != NULL) {
|
2014-06-30 18:58:51 +02:00
|
|
|
size_t len;
|
2018-04-15 17:36:13 +02:00
|
|
|
int i;
|
|
|
|
|
builtin/repack.c: keep track of existing packs unconditionally
In order to be able to write a multi-pack index during repacking, `git
repack` must keep track of which packs it wants to write into the MIDX.
This set is the union of existing packs which will not be deleted,
new pack(s) generated as a result of the repack, and .keep packs.
Prior to this patch, `git repack` populated the list of existing packs
only when repacking all-into-one (i.e., with `-A` or `-a`), but we will
soon need to know this list when repacking when writing a MIDX without
a-i-o.
Populate the list of existing packs unconditionally, and guard removing
packs from that list only when repacking a-i-o.
Additionally, keep track of filenames of kept packs separately, since
this, too, will be used in an upcoming patch.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-29 03:55:10 +02:00
|
|
|
if (!strip_suffix(e->d_name, ".pack", &len))
|
|
|
|
continue;
|
|
|
|
|
2018-04-15 17:36:13 +02:00
|
|
|
for (i = 0; i < extra_keep->nr; i++)
|
|
|
|
if (!fspathcmp(e->d_name, extra_keep->items[i].string))
|
|
|
|
break;
|
2013-09-15 17:33:20 +02:00
|
|
|
|
|
|
|
fname = xmemdupz(e->d_name, len);
|
|
|
|
|
builtin/repack.c: keep track of existing packs unconditionally
In order to be able to write a multi-pack index during repacking, `git
repack` must keep track of which packs it wants to write into the MIDX.
This set is the union of existing packs which will not be deleted,
new pack(s) generated as a result of the repack, and .keep packs.
Prior to this patch, `git repack` populated the list of existing packs
only when repacking all-into-one (i.e., with `-A` or `-a`), but we will
soon need to know this list when repacking when writing a MIDX without
a-i-o.
Populate the list of existing packs unconditionally, and guard removing
packs from that list only when repacking a-i-o.
Additionally, keep track of filenames of kept packs separately, since
this, too, will be used in an upcoming patch.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-29 03:55:10 +02:00
|
|
|
if ((extra_keep->nr > 0 && i < extra_keep->nr) ||
|
2022-05-21 01:18:11 +02:00
|
|
|
(file_exists(mkpath("%s/%s.keep", packdir, fname)))) {
|
builtin/repack.c: keep track of existing packs unconditionally
In order to be able to write a multi-pack index during repacking, `git
repack` must keep track of which packs it wants to write into the MIDX.
This set is the union of existing packs which will not be deleted,
new pack(s) generated as a result of the repack, and .keep packs.
Prior to this patch, `git repack` populated the list of existing packs
only when repacking all-into-one (i.e., with `-A` or `-a`), but we will
soon need to know this list when repacking when writing a MIDX without
a-i-o.
Populate the list of existing packs unconditionally, and guard removing
packs from that list only when repacking a-i-o.
Additionally, keep track of filenames of kept packs separately, since
this, too, will be used in an upcoming patch.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-29 03:55:10 +02:00
|
|
|
string_list_append_nodup(fname_kept_list, fname);
|
2022-05-21 01:18:11 +02:00
|
|
|
} else {
|
|
|
|
struct string_list_item *item;
|
|
|
|
item = string_list_append_nodup(fname_nonkept_list,
|
|
|
|
fname);
|
|
|
|
if (file_exists(mkpath("%s/%s.mtimes", packdir, fname)))
|
|
|
|
item->util = (void*)(uintptr_t)CRUFT_PACK;
|
|
|
|
}
|
2013-09-15 17:33:20 +02:00
|
|
|
}
|
|
|
|
closedir(dir);
|
2022-05-20 21:01:45 +02:00
|
|
|
|
|
|
|
string_list_sort(fname_kept_list);
|
2013-09-15 17:33:20 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
static void remove_redundant_pack(const char *dir_name, const char *base_name)
|
|
|
|
{
|
|
|
|
struct strbuf buf = STRBUF_INIT;
|
midx: traverse the local MIDX first
When a repository has an alternate object directory configured, callers
can traverse through each alternate's MIDX by walking the '->next'
pointer.
But, when 'prepare_multi_pack_index_one()' loads multiple MIDXs, it
places the new ones at the front of this pointer chain, not at the end.
This can be confusing for callers such as 'git repack -ad', causing test
failures like in t7700.6 with 'GIT_TEST_MULTI_PACK_INDEX=1'.
The occurs when dropping a pack known to the local MIDX with alternates
configured that have their own MIDX. Since the alternate's MIDX is
returned via 'get_multi_pack_index()', 'midx_contains_pack()' returns
true (which is correct, since it traverses through the '->next' pointer
to find the MIDX in the chain that does contain the requested object).
But, we call 'clear_midx_file()' on 'the_repository', which drops the
MIDX at the path of the first MIDX in the chain, which (in the case of
t7700.6 is the one in the alternate).
This patch addresses that by:
- placing the local MIDX first in the chain when calling
'prepare_multi_pack_index_one()', and
- introducing a new 'get_local_multi_pack_index()', which explicitly
returns the repository-local MIDX, if any.
Don't impose an additional order on the MIDX's '->next' pointer beyond
that the first item in the chain must be local if one exists so that we
avoid a quadratic insertion.
Likewise, use 'get_local_multi_pack_index()' in
'remove_redundant_pack()' to fix the formerly broken t7700.6 when run
with 'GIT_TEST_MULTI_PACK_INDEX=1'.
Finally, note that the MIDX ordering invariant is only preserved by the
insertion order in 'prepare_packed_git()', which traverses through the
ODB's '->next' pointer, meaning we visit the local object store first.
This fragility makes this an undesirable long-term solution if more
callers are added, but it is acceptable for now since this is the only
caller.
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-08-28 22:22:13 +02:00
|
|
|
struct multi_pack_index *m = get_local_multi_pack_index(the_repository);
|
2020-08-25 18:04:36 +02:00
|
|
|
strbuf_addf(&buf, "%s.pack", base_name);
|
|
|
|
if (m && midx_contains_pack(m, buf.buf))
|
|
|
|
clear_midx_file(the_repository);
|
|
|
|
strbuf_insertf(&buf, 0, "%s/", dir_name);
|
2019-06-11 01:35:22 +02:00
|
|
|
unlink_pack_path(buf.buf, 1);
|
2013-09-15 17:33:20 +02:00
|
|
|
strbuf_release(&buf);
|
|
|
|
}
|
|
|
|
|
2018-08-09 00:34:05 +02:00
|
|
|
static void prepare_pack_objects(struct child_process *cmd,
|
|
|
|
const struct pack_objects_args *args)
|
|
|
|
{
|
2020-07-28 22:24:27 +02:00
|
|
|
strvec_push(&cmd->args, "pack-objects");
|
2018-08-09 00:34:05 +02:00
|
|
|
if (args->window)
|
2020-07-28 22:24:27 +02:00
|
|
|
strvec_pushf(&cmd->args, "--window=%s", args->window);
|
2018-08-09 00:34:05 +02:00
|
|
|
if (args->window_memory)
|
2020-07-28 22:24:27 +02:00
|
|
|
strvec_pushf(&cmd->args, "--window-memory=%s", args->window_memory);
|
2018-08-09 00:34:05 +02:00
|
|
|
if (args->depth)
|
2020-07-28 22:24:27 +02:00
|
|
|
strvec_pushf(&cmd->args, "--depth=%s", args->depth);
|
2018-08-09 00:34:05 +02:00
|
|
|
if (args->threads)
|
2020-07-28 22:24:27 +02:00
|
|
|
strvec_pushf(&cmd->args, "--threads=%s", args->threads);
|
2018-08-09 00:34:05 +02:00
|
|
|
if (args->max_pack_size)
|
2020-07-28 22:24:27 +02:00
|
|
|
strvec_pushf(&cmd->args, "--max-pack-size=%s", args->max_pack_size);
|
2018-08-09 00:34:05 +02:00
|
|
|
if (args->no_reuse_delta)
|
2020-07-28 22:24:27 +02:00
|
|
|
strvec_pushf(&cmd->args, "--no-reuse-delta");
|
2018-08-09 00:34:05 +02:00
|
|
|
if (args->no_reuse_object)
|
2020-07-28 22:24:27 +02:00
|
|
|
strvec_pushf(&cmd->args, "--no-reuse-object");
|
2018-08-09 00:34:05 +02:00
|
|
|
if (args->local)
|
2020-07-28 22:24:27 +02:00
|
|
|
strvec_push(&cmd->args, "--local");
|
2018-08-09 00:34:05 +02:00
|
|
|
if (args->quiet)
|
2020-07-28 22:24:27 +02:00
|
|
|
strvec_push(&cmd->args, "--quiet");
|
2018-08-09 00:34:05 +02:00
|
|
|
if (delta_base_offset)
|
2020-07-28 22:24:27 +02:00
|
|
|
strvec_push(&cmd->args, "--delta-base-offset");
|
|
|
|
strvec_push(&cmd->args, packtmp);
|
2018-08-09 00:34:05 +02:00
|
|
|
cmd->git_cmd = 1;
|
|
|
|
cmd->out = -1;
|
|
|
|
}
|
|
|
|
|
2018-08-09 00:34:06 +02:00
|
|
|
/*
|
|
|
|
* Write oid to the given struct child_process's stdin, starting it first if
|
|
|
|
* necessary.
|
|
|
|
*/
|
|
|
|
static int write_oid(const struct object_id *oid, struct packed_git *pack,
|
|
|
|
uint32_t pos, void *data)
|
|
|
|
{
|
|
|
|
struct child_process *cmd = data;
|
|
|
|
|
|
|
|
if (cmd->in == -1) {
|
|
|
|
if (start_command(cmd))
|
2018-11-10 06:16:10 +01:00
|
|
|
die(_("could not start pack-objects to repack promisor objects"));
|
2018-08-09 00:34:06 +02:00
|
|
|
}
|
|
|
|
|
2019-08-18 22:04:18 +02:00
|
|
|
xwrite(cmd->in, oid_to_hex(oid), the_hash_algo->hexsz);
|
2018-08-09 00:34:06 +02:00
|
|
|
xwrite(cmd->in, "\n", 1);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2020-11-16 19:41:12 +01:00
|
|
|
static struct {
|
|
|
|
const char *name;
|
|
|
|
unsigned optional:1;
|
|
|
|
} exts[] = {
|
|
|
|
{".pack"},
|
packfile: prepare for the existence of '*.rev' files
Specify the format of the on-disk reverse index 'pack-*.rev' file, as
well as prepare the code for the existence of such files.
The reverse index maps from pack relative positions (i.e., an index into
the array of object which is sorted by their offsets within the
packfile) to their position within the 'pack-*.idx' file. Today, this is
done by building up a list of (off_t, uint32_t) tuples for each object
(the off_t corresponding to that object's offset, and the uint32_t
corresponding to its position in the index). To convert between pack and
index position quickly, this array of tuples is radix sorted based on
its offset.
This has two major drawbacks:
First, the in-memory cost scales linearly with the number of objects in
a pack. Each 'struct revindex_entry' is sizeof(off_t) +
sizeof(uint32_t) + padding bytes for a total of 16.
To observe this, force Git to load the reverse index by, for e.g.,
running 'git cat-file --batch-check="%(objectsize:disk)"'. When asking
for a single object in a fresh clone of the kernel, Git needs to
allocate 120+ MB of memory in order to hold the reverse index in memory.
Second, the cost to sort also scales with the size of the pack.
Luckily, this is a linear function since 'load_pack_revindex()' uses a
radix sort, but this cost still must be paid once per pack per process.
As an example, it takes ~60x longer to print the _size_ of an object as
it does to print that entire object's _contents_:
Benchmark #1: git.compile cat-file --batch <obj
Time (mean ± σ): 3.4 ms ± 0.1 ms [User: 3.3 ms, System: 2.1 ms]
Range (min … max): 3.2 ms … 3.7 ms 726 runs
Benchmark #2: git.compile cat-file --batch-check="%(objectsize:disk)" <obj
Time (mean ± σ): 210.3 ms ± 8.9 ms [User: 188.2 ms, System: 23.2 ms]
Range (min … max): 193.7 ms … 224.4 ms 13 runs
Instead, avoid computing and sorting the revindex once per process by
writing it to a file when the pack itself is generated.
The format is relatively straightforward. It contains an array of
uint32_t's, the length of which is equal to the number of objects in the
pack. The ith entry in this table contains the index position of the
ith object in the pack, where "ith object in the pack" is determined by
pack offset.
One thing that the on-disk format does _not_ contain is the full (up to)
eight-byte offset corresponding to each object. This is something that
the in-memory revindex contains (it stores an off_t in 'struct
revindex_entry' along with the same uint32_t that the on-disk format
has). Omit it in the on-disk format, since knowing the index position
for some object is sufficient to get a constant-time lookup in the
pack-*.idx file to ask for an object's offset within the pack.
This trades off between the on-disk size of the 'pack-*.rev' file for
runtime to chase down the offset for some object. Even though the lookup
is constant time, the constant is heavier, since it can potentially
involve two pointer walks in v2 indexes (one to access the 4-byte offset
table, and potentially a second to access the double wide offset table).
Consider trying to map an object's pack offset to a relative position
within that pack. In a cold-cache scenario, more page faults occur while
switching between binary searching through the reverse index and
searching through the *.idx file for an object's offset. Sure enough,
with a cold cache (writing '3' into '/proc/sys/vm/drop_caches' after
'sync'ing), printing out the entire object's contents is still
marginally faster than printing its size:
Benchmark #1: git.compile cat-file --batch-check="%(objectsize:disk)" <obj >/dev/null
Time (mean ± σ): 22.6 ms ± 0.5 ms [User: 2.4 ms, System: 7.9 ms]
Range (min … max): 21.4 ms … 23.5 ms 41 runs
Benchmark #2: git.compile cat-file --batch <obj >/dev/null
Time (mean ± σ): 17.2 ms ± 0.7 ms [User: 2.8 ms, System: 5.5 ms]
Range (min … max): 15.6 ms … 18.2 ms 45 runs
(Numbers taken in the kernel after cheating and using the next patch to
generate a reverse index). There are a couple of approaches to improve
cold cache performance not pursued here:
- We could include the object offsets in the reverse index format.
Predictably, this does result in fewer page faults, but it triples
the size of the file, while simultaneously duplicating a ton of data
already available in the .idx file. (This was the original way I
implemented the format, and it did show
`--batch-check='%(objectsize:disk)'` winning out against `--batch`.)
On the other hand, this increase in size also results in a large
block-cache footprint, which could potentially hurt other workloads.
- We could store the mapping from pack to index position in more
cache-friendly way, like constructing a binary search tree from the
table and writing the values in breadth-first order. This would
result in much better locality, but the price you pay is trading
O(1) lookup in 'pack_pos_to_index()' for an O(log n) one (since you
can no longer directly index the table).
So, neither of these approaches are taken here. (Thankfully, the format
is versioned, so we are free to pursue these in the future.) But, cold
cache performance likely isn't interesting outside of one-off cases like
asking for the size of an object directly. In real-world usage, Git is
often performing many operations in the revindex (i.e., asking about
many objects rather than a single one).
The trade-off is worth it, since we will avoid the vast majority of the
cost of generating the revindex that the extra pointer chase will look
like noise in the following patch's benchmarks.
This patch describes the format and prepares callers (like in
pack-revindex.c) to be able to read *.rev files once they exist. An
implementation of the writer will appear in the next patch, and callers
will gradually begin to start using the writer in the patches that
follow after that.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-26 00:37:14 +01:00
|
|
|
{".rev", 1},
|
2022-05-21 01:17:35 +02:00
|
|
|
{".mtimes", 1},
|
2020-11-16 19:41:12 +01:00
|
|
|
{".bitmap", 1},
|
|
|
|
{".promisor", 1},
|
2021-09-10 01:24:44 +02:00
|
|
|
{".idx"},
|
2020-11-16 19:41:12 +01:00
|
|
|
};
|
|
|
|
|
2022-10-22 02:21:45 +02:00
|
|
|
struct generated_pack_data {
|
repack: use tempfiles for signal cleanup
When git-repack exits due to a signal, it tries to clean up by calling
its remove_temporary_files() function, which walks through the packs dir
looking for ".tmp-$$-pack-*" files to delete (where "$$" is the pid of
the current process).
The biggest problem here is that remove_temporary_files() is not safe to
call in a signal handler. It uses opendir(), which isn't on the POSIX
async-signal-safe list. The details will be platform-specific, but a
likely issue is that it needs to allocate memory; if we receive a signal
while inside malloc(), etc, we'll conflict on the allocator lock and
deadlock with ourselves.
We can fix this by just cleaning up the files directly, without walking
the directory. We already know the complete list of .tmp-* files that
were generated, because we recorded them via populate_pack_exts(). When
we find files there, we can use register_tempfile() to record the
filenames. If we receive a signal, then the tempfile API will clean them
up for us, and it's async-safe and pretty battle-tested.
Note that this is slightly racier than the existing scheme. We don't
record the filenames until pack-objects tells us the hash over stdout.
So during the period between it generating the file and reporting the
hash, we'd fail to clean up. However, that period is very small. During
most of the pack generation process pack-objects is using its own
internal tempfiles. It's only at the very end that it moves them into
the names git-repack expects, and then it immediately reports the name
to us. Given that cleanup like this is best effort (after all, we may
get SIGKILL), this level of race is acceptable.
When we register the tempfiles, we'll record them locally and use the
result to call rename_tempfile(), rather than renaming by hand. This
isn't strictly necessary, as once we've renamed the files they're gone,
and the tempfile API's cleanup unlink() would simply become a pointless
noop. But managing the lifetimes of the tempfile objects is the cleanest
thing to do, and the tempfile pointers naturally fill the same role as
the old booleans.
This patch also fixes another small problem. We only hook signals, and
don't set up an atexit handler. So if we see an error that causes us to
die(), we'll leave the .tmp-* files in place. But since the tempfile API
handles this for us, this is now fixed for free. The new test covers
this by stimulating a failure of pack-objects when generating a cruft
pack. Before this patch, the .tmp-* file for the main pack would have
been left, but now we correctly clean it up.
Two small subtleties on the implementation:
- in the renaming loop, we can stop re-constructing fname_old; we only
use it when we have a tempfile to rename, so we can just ask the
tempfile for its path (which, barring bugs, should be identical)
- when renaming fails, our error message mentions fname_old. But since
a failed rename_tempfile() invalidates the tempfile struct, we'll
lose access to that string. Instead, let's mention the destination
filename, which is what most other callers do.
Reported-by: Jan Pokorný <poki@fnusa.cz>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-10-22 02:21:54 +02:00
|
|
|
struct tempfile *tempfiles[ARRAY_SIZE(exts)];
|
2022-10-22 02:21:45 +02:00
|
|
|
};
|
|
|
|
|
|
|
|
static struct generated_pack_data *populate_pack_exts(const char *name)
|
2020-11-16 19:41:17 +01:00
|
|
|
{
|
|
|
|
struct stat statbuf;
|
|
|
|
struct strbuf path = STRBUF_INIT;
|
2022-10-22 02:21:45 +02:00
|
|
|
struct generated_pack_data *data = xcalloc(1, sizeof(*data));
|
2020-11-16 19:41:17 +01:00
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = 0; i < ARRAY_SIZE(exts); i++) {
|
|
|
|
strbuf_reset(&path);
|
|
|
|
strbuf_addf(&path, "%s-%s%s", packtmp, name, exts[i].name);
|
|
|
|
|
|
|
|
if (stat(path.buf, &statbuf))
|
|
|
|
continue;
|
|
|
|
|
repack: use tempfiles for signal cleanup
When git-repack exits due to a signal, it tries to clean up by calling
its remove_temporary_files() function, which walks through the packs dir
looking for ".tmp-$$-pack-*" files to delete (where "$$" is the pid of
the current process).
The biggest problem here is that remove_temporary_files() is not safe to
call in a signal handler. It uses opendir(), which isn't on the POSIX
async-signal-safe list. The details will be platform-specific, but a
likely issue is that it needs to allocate memory; if we receive a signal
while inside malloc(), etc, we'll conflict on the allocator lock and
deadlock with ourselves.
We can fix this by just cleaning up the files directly, without walking
the directory. We already know the complete list of .tmp-* files that
were generated, because we recorded them via populate_pack_exts(). When
we find files there, we can use register_tempfile() to record the
filenames. If we receive a signal, then the tempfile API will clean them
up for us, and it's async-safe and pretty battle-tested.
Note that this is slightly racier than the existing scheme. We don't
record the filenames until pack-objects tells us the hash over stdout.
So during the period between it generating the file and reporting the
hash, we'd fail to clean up. However, that period is very small. During
most of the pack generation process pack-objects is using its own
internal tempfiles. It's only at the very end that it moves them into
the names git-repack expects, and then it immediately reports the name
to us. Given that cleanup like this is best effort (after all, we may
get SIGKILL), this level of race is acceptable.
When we register the tempfiles, we'll record them locally and use the
result to call rename_tempfile(), rather than renaming by hand. This
isn't strictly necessary, as once we've renamed the files they're gone,
and the tempfile API's cleanup unlink() would simply become a pointless
noop. But managing the lifetimes of the tempfile objects is the cleanest
thing to do, and the tempfile pointers naturally fill the same role as
the old booleans.
This patch also fixes another small problem. We only hook signals, and
don't set up an atexit handler. So if we see an error that causes us to
die(), we'll leave the .tmp-* files in place. But since the tempfile API
handles this for us, this is now fixed for free. The new test covers
this by stimulating a failure of pack-objects when generating a cruft
pack. Before this patch, the .tmp-* file for the main pack would have
been left, but now we correctly clean it up.
Two small subtleties on the implementation:
- in the renaming loop, we can stop re-constructing fname_old; we only
use it when we have a tempfile to rename, so we can just ask the
tempfile for its path (which, barring bugs, should be identical)
- when renaming fails, our error message mentions fname_old. But since
a failed rename_tempfile() invalidates the tempfile struct, we'll
lose access to that string. Instead, let's mention the destination
filename, which is what most other callers do.
Reported-by: Jan Pokorný <poki@fnusa.cz>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-10-22 02:21:54 +02:00
|
|
|
data->tempfiles[i] = register_tempfile(path.buf);
|
2020-11-16 19:41:17 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
strbuf_release(&path);
|
2022-10-22 02:21:45 +02:00
|
|
|
return data;
|
2020-11-16 19:41:17 +01:00
|
|
|
}
|
|
|
|
|
2018-08-09 00:34:06 +02:00
|
|
|
static void repack_promisor_objects(const struct pack_objects_args *args,
|
|
|
|
struct string_list *names)
|
|
|
|
{
|
|
|
|
struct child_process cmd = CHILD_PROCESS_INIT;
|
|
|
|
FILE *out;
|
|
|
|
struct strbuf line = STRBUF_INIT;
|
|
|
|
|
|
|
|
prepare_pack_objects(&cmd, args);
|
|
|
|
cmd.in = -1;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* NEEDSWORK: Giving pack-objects only the OIDs without any ordering
|
|
|
|
* hints may result in suboptimal deltas in the resulting pack. See if
|
|
|
|
* the OIDs can be sent with fake paths such that pack-objects can use a
|
|
|
|
* {type -> existing pack order} ordering when computing deltas instead
|
|
|
|
* of a {type -> size} ordering, which may produce better deltas.
|
|
|
|
*/
|
|
|
|
for_each_packed_object(write_oid, &cmd,
|
|
|
|
FOR_EACH_OBJECT_PROMISOR_ONLY);
|
|
|
|
|
2021-10-28 22:25:48 +02:00
|
|
|
if (cmd.in == -1) {
|
2018-08-09 00:34:06 +02:00
|
|
|
/* No packed objects; cmd was never started */
|
2021-10-28 22:25:48 +02:00
|
|
|
child_process_clear(&cmd);
|
2018-08-09 00:34:06 +02:00
|
|
|
return;
|
2021-10-28 22:25:48 +02:00
|
|
|
}
|
2018-08-09 00:34:06 +02:00
|
|
|
|
|
|
|
close(cmd.in);
|
|
|
|
|
|
|
|
out = xfdopen(cmd.out, "r");
|
|
|
|
while (strbuf_getline_lf(&line, out) != EOF) {
|
2020-11-16 19:41:17 +01:00
|
|
|
struct string_list_item *item;
|
2018-08-09 00:34:06 +02:00
|
|
|
char *promisor_name;
|
2021-01-12 09:21:59 +01:00
|
|
|
|
2018-10-15 02:01:50 +02:00
|
|
|
if (line.len != the_hash_algo->hexsz)
|
2019-01-04 22:33:31 +01:00
|
|
|
die(_("repack: Expecting full hex object ID lines only from pack-objects."));
|
2020-11-16 19:41:17 +01:00
|
|
|
item = string_list_append(names, line.buf);
|
2018-08-09 00:34:06 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* pack-objects creates the .pack and .idx files, but not the
|
|
|
|
* .promisor file. Create the .promisor file, which is empty.
|
2019-10-15 02:12:31 +02:00
|
|
|
*
|
|
|
|
* NEEDSWORK: fetch-pack sometimes generates non-empty
|
|
|
|
* .promisor files containing the ref names and associated
|
|
|
|
* hashes at the point of generation of the corresponding
|
|
|
|
* packfile, but this would not preserve their contents. Maybe
|
|
|
|
* concatenate the contents of all .promisor files instead of
|
|
|
|
* just creating a new empty file.
|
2018-08-09 00:34:06 +02:00
|
|
|
*/
|
|
|
|
promisor_name = mkpathdup("%s-%s.promisor", packtmp,
|
|
|
|
line.buf);
|
2021-01-12 09:21:59 +01:00
|
|
|
write_promisor_file(promisor_name, NULL, 0);
|
2020-11-16 19:41:17 +01:00
|
|
|
|
2022-10-22 02:21:45 +02:00
|
|
|
item->util = populate_pack_exts(item->string);
|
2020-11-16 19:41:17 +01:00
|
|
|
|
2018-08-09 00:34:06 +02:00
|
|
|
free(promisor_name);
|
|
|
|
}
|
|
|
|
fclose(out);
|
|
|
|
if (finish_command(&cmd))
|
2018-11-10 06:16:10 +01:00
|
|
|
die(_("could not finish pack-objects to repack promisor objects"));
|
2018-08-09 00:34:06 +02:00
|
|
|
}
|
|
|
|
|
builtin/repack.c: add '--geometric' option
Often it is useful to both:
- have relatively few packfiles in a repository, and
- avoid having so few packfiles in a repository that we repack its
entire contents regularly
This patch implements a '--geometric=<n>' option in 'git repack'. This
allows the caller to specify that they would like each pack to be at
least a factor times as large as the previous largest pack (by object
count).
Concretely, say that a repository has 'n' packfiles, labeled P1, P2,
..., up to Pn. Each packfile has an object count equal to 'objects(Pn)'.
With a geometric factor of 'r', it should be that:
objects(Pi) > r*objects(P(i-1))
for all i in [1, n], where the packs are sorted by
objects(P1) <= objects(P2) <= ... <= objects(Pn).
Since finding a true optimal repacking is NP-hard, we approximate it
along two directions:
1. We assume that there is a cutoff of packs _before starting the
repack_ where everything to the right of that cut-off already forms
a geometric progression (or no cutoff exists and everything must be
repacked).
2. We assume that everything smaller than the cutoff count must be
repacked. This forms our base assumption, but it can also cause
even the "heavy" packs to get repacked, for e.g., if we have 6
packs containing the following number of objects:
1, 1, 1, 2, 4, 32
then we would place the cutoff between '1, 1' and '1, 2, 4, 32',
rolling up the first two packs into a pack with 2 objects. That
breaks our progression and leaves us:
2, 1, 2, 4, 32
^
(where the '^' indicates the position of our split). To restore a
progression, we move the split forward (towards larger packs)
joining each pack into our new pack until a geometric progression
is restored. Here, that looks like:
2, 1, 2, 4, 32 ~> 3, 2, 4, 32 ~> 5, 4, 32 ~> ... ~> 9, 32
^ ^ ^ ^
This has the advantage of not repacking the heavy-side of packs too
often while also only creating one new pack at a time. Another wrinkle
is that we assume that loose, indexed, and reflog'd objects are
insignificant, and lump them into any new pack that we create. This can
lead to non-idempotent results.
Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-23 03:25:27 +01:00
|
|
|
struct pack_geometry {
|
|
|
|
struct packed_git **pack;
|
|
|
|
uint32_t pack_nr, pack_alloc;
|
|
|
|
uint32_t split;
|
|
|
|
};
|
|
|
|
|
|
|
|
static uint32_t geometry_pack_weight(struct packed_git *p)
|
|
|
|
{
|
|
|
|
if (open_pack_index(p))
|
|
|
|
die(_("cannot open index for %s"), p->pack_name);
|
|
|
|
return p->num_objects;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int geometry_cmp(const void *va, const void *vb)
|
|
|
|
{
|
|
|
|
uint32_t aw = geometry_pack_weight(*(struct packed_git **)va),
|
|
|
|
bw = geometry_pack_weight(*(struct packed_git **)vb);
|
|
|
|
|
|
|
|
if (aw < bw)
|
|
|
|
return -1;
|
|
|
|
if (aw > bw)
|
|
|
|
return 1;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2022-05-20 21:01:45 +02:00
|
|
|
static void init_pack_geometry(struct pack_geometry **geometry_p,
|
|
|
|
struct string_list *existing_kept_packs)
|
builtin/repack.c: add '--geometric' option
Often it is useful to both:
- have relatively few packfiles in a repository, and
- avoid having so few packfiles in a repository that we repack its
entire contents regularly
This patch implements a '--geometric=<n>' option in 'git repack'. This
allows the caller to specify that they would like each pack to be at
least a factor times as large as the previous largest pack (by object
count).
Concretely, say that a repository has 'n' packfiles, labeled P1, P2,
..., up to Pn. Each packfile has an object count equal to 'objects(Pn)'.
With a geometric factor of 'r', it should be that:
objects(Pi) > r*objects(P(i-1))
for all i in [1, n], where the packs are sorted by
objects(P1) <= objects(P2) <= ... <= objects(Pn).
Since finding a true optimal repacking is NP-hard, we approximate it
along two directions:
1. We assume that there is a cutoff of packs _before starting the
repack_ where everything to the right of that cut-off already forms
a geometric progression (or no cutoff exists and everything must be
repacked).
2. We assume that everything smaller than the cutoff count must be
repacked. This forms our base assumption, but it can also cause
even the "heavy" packs to get repacked, for e.g., if we have 6
packs containing the following number of objects:
1, 1, 1, 2, 4, 32
then we would place the cutoff between '1, 1' and '1, 2, 4, 32',
rolling up the first two packs into a pack with 2 objects. That
breaks our progression and leaves us:
2, 1, 2, 4, 32
^
(where the '^' indicates the position of our split). To restore a
progression, we move the split forward (towards larger packs)
joining each pack into our new pack until a geometric progression
is restored. Here, that looks like:
2, 1, 2, 4, 32 ~> 3, 2, 4, 32 ~> 5, 4, 32 ~> ... ~> 9, 32
^ ^ ^ ^
This has the advantage of not repacking the heavy-side of packs too
often while also only creating one new pack at a time. Another wrinkle
is that we assume that loose, indexed, and reflog'd objects are
insignificant, and lump them into any new pack that we create. This can
lead to non-idempotent results.
Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-23 03:25:27 +01:00
|
|
|
{
|
|
|
|
struct packed_git *p;
|
|
|
|
struct pack_geometry *geometry;
|
2022-05-20 21:01:45 +02:00
|
|
|
struct strbuf buf = STRBUF_INIT;
|
builtin/repack.c: add '--geometric' option
Often it is useful to both:
- have relatively few packfiles in a repository, and
- avoid having so few packfiles in a repository that we repack its
entire contents regularly
This patch implements a '--geometric=<n>' option in 'git repack'. This
allows the caller to specify that they would like each pack to be at
least a factor times as large as the previous largest pack (by object
count).
Concretely, say that a repository has 'n' packfiles, labeled P1, P2,
..., up to Pn. Each packfile has an object count equal to 'objects(Pn)'.
With a geometric factor of 'r', it should be that:
objects(Pi) > r*objects(P(i-1))
for all i in [1, n], where the packs are sorted by
objects(P1) <= objects(P2) <= ... <= objects(Pn).
Since finding a true optimal repacking is NP-hard, we approximate it
along two directions:
1. We assume that there is a cutoff of packs _before starting the
repack_ where everything to the right of that cut-off already forms
a geometric progression (or no cutoff exists and everything must be
repacked).
2. We assume that everything smaller than the cutoff count must be
repacked. This forms our base assumption, but it can also cause
even the "heavy" packs to get repacked, for e.g., if we have 6
packs containing the following number of objects:
1, 1, 1, 2, 4, 32
then we would place the cutoff between '1, 1' and '1, 2, 4, 32',
rolling up the first two packs into a pack with 2 objects. That
breaks our progression and leaves us:
2, 1, 2, 4, 32
^
(where the '^' indicates the position of our split). To restore a
progression, we move the split forward (towards larger packs)
joining each pack into our new pack until a geometric progression
is restored. Here, that looks like:
2, 1, 2, 4, 32 ~> 3, 2, 4, 32 ~> 5, 4, 32 ~> ... ~> 9, 32
^ ^ ^ ^
This has the advantage of not repacking the heavy-side of packs too
often while also only creating one new pack at a time. Another wrinkle
is that we assume that loose, indexed, and reflog'd objects are
insignificant, and lump them into any new pack that we create. This can
lead to non-idempotent results.
Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-23 03:25:27 +01:00
|
|
|
|
|
|
|
*geometry_p = xcalloc(1, sizeof(struct pack_geometry));
|
|
|
|
geometry = *geometry_p;
|
|
|
|
|
|
|
|
for (p = get_all_packs(the_repository); p; p = p->next) {
|
2022-05-20 21:01:45 +02:00
|
|
|
if (!pack_kept_objects) {
|
|
|
|
/*
|
|
|
|
* Any pack that has its pack_keep bit set will appear
|
|
|
|
* in existing_kept_packs below, but this saves us from
|
|
|
|
* doing a more expensive check.
|
|
|
|
*/
|
|
|
|
if (p->pack_keep)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The pack may be kept via the --keep-pack option;
|
|
|
|
* check 'existing_kept_packs' to determine whether to
|
|
|
|
* ignore it.
|
|
|
|
*/
|
|
|
|
strbuf_reset(&buf);
|
|
|
|
strbuf_addstr(&buf, pack_basename(p));
|
|
|
|
strbuf_strip_suffix(&buf, ".pack");
|
|
|
|
|
|
|
|
if (string_list_has_string(existing_kept_packs, buf.buf))
|
|
|
|
continue;
|
|
|
|
}
|
2022-05-21 01:18:03 +02:00
|
|
|
if (p->is_cruft)
|
|
|
|
continue;
|
builtin/repack.c: add '--geometric' option
Often it is useful to both:
- have relatively few packfiles in a repository, and
- avoid having so few packfiles in a repository that we repack its
entire contents regularly
This patch implements a '--geometric=<n>' option in 'git repack'. This
allows the caller to specify that they would like each pack to be at
least a factor times as large as the previous largest pack (by object
count).
Concretely, say that a repository has 'n' packfiles, labeled P1, P2,
..., up to Pn. Each packfile has an object count equal to 'objects(Pn)'.
With a geometric factor of 'r', it should be that:
objects(Pi) > r*objects(P(i-1))
for all i in [1, n], where the packs are sorted by
objects(P1) <= objects(P2) <= ... <= objects(Pn).
Since finding a true optimal repacking is NP-hard, we approximate it
along two directions:
1. We assume that there is a cutoff of packs _before starting the
repack_ where everything to the right of that cut-off already forms
a geometric progression (or no cutoff exists and everything must be
repacked).
2. We assume that everything smaller than the cutoff count must be
repacked. This forms our base assumption, but it can also cause
even the "heavy" packs to get repacked, for e.g., if we have 6
packs containing the following number of objects:
1, 1, 1, 2, 4, 32
then we would place the cutoff between '1, 1' and '1, 2, 4, 32',
rolling up the first two packs into a pack with 2 objects. That
breaks our progression and leaves us:
2, 1, 2, 4, 32
^
(where the '^' indicates the position of our split). To restore a
progression, we move the split forward (towards larger packs)
joining each pack into our new pack until a geometric progression
is restored. Here, that looks like:
2, 1, 2, 4, 32 ~> 3, 2, 4, 32 ~> 5, 4, 32 ~> ... ~> 9, 32
^ ^ ^ ^
This has the advantage of not repacking the heavy-side of packs too
often while also only creating one new pack at a time. Another wrinkle
is that we assume that loose, indexed, and reflog'd objects are
insignificant, and lump them into any new pack that we create. This can
lead to non-idempotent results.
Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-23 03:25:27 +01:00
|
|
|
|
|
|
|
ALLOC_GROW(geometry->pack,
|
|
|
|
geometry->pack_nr + 1,
|
|
|
|
geometry->pack_alloc);
|
|
|
|
|
|
|
|
geometry->pack[geometry->pack_nr] = p;
|
|
|
|
geometry->pack_nr++;
|
|
|
|
}
|
|
|
|
|
|
|
|
QSORT(geometry->pack, geometry->pack_nr, geometry_cmp);
|
2022-05-20 21:01:45 +02:00
|
|
|
strbuf_release(&buf);
|
builtin/repack.c: add '--geometric' option
Often it is useful to both:
- have relatively few packfiles in a repository, and
- avoid having so few packfiles in a repository that we repack its
entire contents regularly
This patch implements a '--geometric=<n>' option in 'git repack'. This
allows the caller to specify that they would like each pack to be at
least a factor times as large as the previous largest pack (by object
count).
Concretely, say that a repository has 'n' packfiles, labeled P1, P2,
..., up to Pn. Each packfile has an object count equal to 'objects(Pn)'.
With a geometric factor of 'r', it should be that:
objects(Pi) > r*objects(P(i-1))
for all i in [1, n], where the packs are sorted by
objects(P1) <= objects(P2) <= ... <= objects(Pn).
Since finding a true optimal repacking is NP-hard, we approximate it
along two directions:
1. We assume that there is a cutoff of packs _before starting the
repack_ where everything to the right of that cut-off already forms
a geometric progression (or no cutoff exists and everything must be
repacked).
2. We assume that everything smaller than the cutoff count must be
repacked. This forms our base assumption, but it can also cause
even the "heavy" packs to get repacked, for e.g., if we have 6
packs containing the following number of objects:
1, 1, 1, 2, 4, 32
then we would place the cutoff between '1, 1' and '1, 2, 4, 32',
rolling up the first two packs into a pack with 2 objects. That
breaks our progression and leaves us:
2, 1, 2, 4, 32
^
(where the '^' indicates the position of our split). To restore a
progression, we move the split forward (towards larger packs)
joining each pack into our new pack until a geometric progression
is restored. Here, that looks like:
2, 1, 2, 4, 32 ~> 3, 2, 4, 32 ~> 5, 4, 32 ~> ... ~> 9, 32
^ ^ ^ ^
This has the advantage of not repacking the heavy-side of packs too
often while also only creating one new pack at a time. Another wrinkle
is that we assume that loose, indexed, and reflog'd objects are
insignificant, and lump them into any new pack that we create. This can
lead to non-idempotent results.
Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-23 03:25:27 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
static void split_pack_geometry(struct pack_geometry *geometry, int factor)
|
|
|
|
{
|
|
|
|
uint32_t i;
|
|
|
|
uint32_t split;
|
|
|
|
off_t total_size = 0;
|
|
|
|
|
2021-03-05 16:21:37 +01:00
|
|
|
if (!geometry->pack_nr) {
|
builtin/repack.c: add '--geometric' option
Often it is useful to both:
- have relatively few packfiles in a repository, and
- avoid having so few packfiles in a repository that we repack its
entire contents regularly
This patch implements a '--geometric=<n>' option in 'git repack'. This
allows the caller to specify that they would like each pack to be at
least a factor times as large as the previous largest pack (by object
count).
Concretely, say that a repository has 'n' packfiles, labeled P1, P2,
..., up to Pn. Each packfile has an object count equal to 'objects(Pn)'.
With a geometric factor of 'r', it should be that:
objects(Pi) > r*objects(P(i-1))
for all i in [1, n], where the packs are sorted by
objects(P1) <= objects(P2) <= ... <= objects(Pn).
Since finding a true optimal repacking is NP-hard, we approximate it
along two directions:
1. We assume that there is a cutoff of packs _before starting the
repack_ where everything to the right of that cut-off already forms
a geometric progression (or no cutoff exists and everything must be
repacked).
2. We assume that everything smaller than the cutoff count must be
repacked. This forms our base assumption, but it can also cause
even the "heavy" packs to get repacked, for e.g., if we have 6
packs containing the following number of objects:
1, 1, 1, 2, 4, 32
then we would place the cutoff between '1, 1' and '1, 2, 4, 32',
rolling up the first two packs into a pack with 2 objects. That
breaks our progression and leaves us:
2, 1, 2, 4, 32
^
(where the '^' indicates the position of our split). To restore a
progression, we move the split forward (towards larger packs)
joining each pack into our new pack until a geometric progression
is restored. Here, that looks like:
2, 1, 2, 4, 32 ~> 3, 2, 4, 32 ~> 5, 4, 32 ~> ... ~> 9, 32
^ ^ ^ ^
This has the advantage of not repacking the heavy-side of packs too
often while also only creating one new pack at a time. Another wrinkle
is that we assume that loose, indexed, and reflog'd objects are
insignificant, and lump them into any new pack that we create. This can
lead to non-idempotent results.
Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-23 03:25:27 +01:00
|
|
|
geometry->split = geometry->pack_nr;
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* First, count the number of packs (in descending order of size) which
|
|
|
|
* already form a geometric progression.
|
|
|
|
*/
|
|
|
|
for (i = geometry->pack_nr - 1; i > 0; i--) {
|
|
|
|
struct packed_git *ours = geometry->pack[i];
|
|
|
|
struct packed_git *prev = geometry->pack[i - 1];
|
2021-03-05 16:21:56 +01:00
|
|
|
|
|
|
|
if (unsigned_mult_overflows(factor, geometry_pack_weight(prev)))
|
|
|
|
die(_("pack %s too large to consider in geometric "
|
|
|
|
"progression"),
|
|
|
|
prev->pack_name);
|
|
|
|
|
2021-03-05 16:21:50 +01:00
|
|
|
if (geometry_pack_weight(ours) < factor * geometry_pack_weight(prev))
|
builtin/repack.c: add '--geometric' option
Often it is useful to both:
- have relatively few packfiles in a repository, and
- avoid having so few packfiles in a repository that we repack its
entire contents regularly
This patch implements a '--geometric=<n>' option in 'git repack'. This
allows the caller to specify that they would like each pack to be at
least a factor times as large as the previous largest pack (by object
count).
Concretely, say that a repository has 'n' packfiles, labeled P1, P2,
..., up to Pn. Each packfile has an object count equal to 'objects(Pn)'.
With a geometric factor of 'r', it should be that:
objects(Pi) > r*objects(P(i-1))
for all i in [1, n], where the packs are sorted by
objects(P1) <= objects(P2) <= ... <= objects(Pn).
Since finding a true optimal repacking is NP-hard, we approximate it
along two directions:
1. We assume that there is a cutoff of packs _before starting the
repack_ where everything to the right of that cut-off already forms
a geometric progression (or no cutoff exists and everything must be
repacked).
2. We assume that everything smaller than the cutoff count must be
repacked. This forms our base assumption, but it can also cause
even the "heavy" packs to get repacked, for e.g., if we have 6
packs containing the following number of objects:
1, 1, 1, 2, 4, 32
then we would place the cutoff between '1, 1' and '1, 2, 4, 32',
rolling up the first two packs into a pack with 2 objects. That
breaks our progression and leaves us:
2, 1, 2, 4, 32
^
(where the '^' indicates the position of our split). To restore a
progression, we move the split forward (towards larger packs)
joining each pack into our new pack until a geometric progression
is restored. Here, that looks like:
2, 1, 2, 4, 32 ~> 3, 2, 4, 32 ~> 5, 4, 32 ~> ... ~> 9, 32
^ ^ ^ ^
This has the advantage of not repacking the heavy-side of packs too
often while also only creating one new pack at a time. Another wrinkle
is that we assume that loose, indexed, and reflog'd objects are
insignificant, and lump them into any new pack that we create. This can
lead to non-idempotent results.
Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-23 03:25:27 +01:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
2021-03-05 16:21:50 +01:00
|
|
|
split = i;
|
|
|
|
|
builtin/repack.c: add '--geometric' option
Often it is useful to both:
- have relatively few packfiles in a repository, and
- avoid having so few packfiles in a repository that we repack its
entire contents regularly
This patch implements a '--geometric=<n>' option in 'git repack'. This
allows the caller to specify that they would like each pack to be at
least a factor times as large as the previous largest pack (by object
count).
Concretely, say that a repository has 'n' packfiles, labeled P1, P2,
..., up to Pn. Each packfile has an object count equal to 'objects(Pn)'.
With a geometric factor of 'r', it should be that:
objects(Pi) > r*objects(P(i-1))
for all i in [1, n], where the packs are sorted by
objects(P1) <= objects(P2) <= ... <= objects(Pn).
Since finding a true optimal repacking is NP-hard, we approximate it
along two directions:
1. We assume that there is a cutoff of packs _before starting the
repack_ where everything to the right of that cut-off already forms
a geometric progression (or no cutoff exists and everything must be
repacked).
2. We assume that everything smaller than the cutoff count must be
repacked. This forms our base assumption, but it can also cause
even the "heavy" packs to get repacked, for e.g., if we have 6
packs containing the following number of objects:
1, 1, 1, 2, 4, 32
then we would place the cutoff between '1, 1' and '1, 2, 4, 32',
rolling up the first two packs into a pack with 2 objects. That
breaks our progression and leaves us:
2, 1, 2, 4, 32
^
(where the '^' indicates the position of our split). To restore a
progression, we move the split forward (towards larger packs)
joining each pack into our new pack until a geometric progression
is restored. Here, that looks like:
2, 1, 2, 4, 32 ~> 3, 2, 4, 32 ~> 5, 4, 32 ~> ... ~> 9, 32
^ ^ ^ ^
This has the advantage of not repacking the heavy-side of packs too
often while also only creating one new pack at a time. Another wrinkle
is that we assume that loose, indexed, and reflog'd objects are
insignificant, and lump them into any new pack that we create. This can
lead to non-idempotent results.
Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-23 03:25:27 +01:00
|
|
|
if (split) {
|
|
|
|
/*
|
|
|
|
* Move the split one to the right, since the top element in the
|
|
|
|
* last-compared pair can't be in the progression. Only do this
|
|
|
|
* when we split in the middle of the array (otherwise if we got
|
|
|
|
* to the end, then the split is in the right place).
|
|
|
|
*/
|
|
|
|
split++;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Then, anything to the left of 'split' must be in a new pack. But,
|
|
|
|
* creating that new pack may cause packs in the heavy half to no longer
|
|
|
|
* form a geometric progression.
|
|
|
|
*
|
|
|
|
* Compute an expected size of the new pack, and then determine how many
|
|
|
|
* packs in the heavy half need to be joined into it (if any) to restore
|
|
|
|
* the geometric progression.
|
|
|
|
*/
|
2021-03-05 16:21:56 +01:00
|
|
|
for (i = 0; i < split; i++) {
|
|
|
|
struct packed_git *p = geometry->pack[i];
|
|
|
|
|
|
|
|
if (unsigned_add_overflows(total_size, geometry_pack_weight(p)))
|
|
|
|
die(_("pack %s too large to roll up"), p->pack_name);
|
|
|
|
total_size += geometry_pack_weight(p);
|
|
|
|
}
|
builtin/repack.c: add '--geometric' option
Often it is useful to both:
- have relatively few packfiles in a repository, and
- avoid having so few packfiles in a repository that we repack its
entire contents regularly
This patch implements a '--geometric=<n>' option in 'git repack'. This
allows the caller to specify that they would like each pack to be at
least a factor times as large as the previous largest pack (by object
count).
Concretely, say that a repository has 'n' packfiles, labeled P1, P2,
..., up to Pn. Each packfile has an object count equal to 'objects(Pn)'.
With a geometric factor of 'r', it should be that:
objects(Pi) > r*objects(P(i-1))
for all i in [1, n], where the packs are sorted by
objects(P1) <= objects(P2) <= ... <= objects(Pn).
Since finding a true optimal repacking is NP-hard, we approximate it
along two directions:
1. We assume that there is a cutoff of packs _before starting the
repack_ where everything to the right of that cut-off already forms
a geometric progression (or no cutoff exists and everything must be
repacked).
2. We assume that everything smaller than the cutoff count must be
repacked. This forms our base assumption, but it can also cause
even the "heavy" packs to get repacked, for e.g., if we have 6
packs containing the following number of objects:
1, 1, 1, 2, 4, 32
then we would place the cutoff between '1, 1' and '1, 2, 4, 32',
rolling up the first two packs into a pack with 2 objects. That
breaks our progression and leaves us:
2, 1, 2, 4, 32
^
(where the '^' indicates the position of our split). To restore a
progression, we move the split forward (towards larger packs)
joining each pack into our new pack until a geometric progression
is restored. Here, that looks like:
2, 1, 2, 4, 32 ~> 3, 2, 4, 32 ~> 5, 4, 32 ~> ... ~> 9, 32
^ ^ ^ ^
This has the advantage of not repacking the heavy-side of packs too
often while also only creating one new pack at a time. Another wrinkle
is that we assume that loose, indexed, and reflog'd objects are
insignificant, and lump them into any new pack that we create. This can
lead to non-idempotent results.
Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-23 03:25:27 +01:00
|
|
|
for (i = split; i < geometry->pack_nr; i++) {
|
|
|
|
struct packed_git *ours = geometry->pack[i];
|
2021-03-05 16:21:56 +01:00
|
|
|
|
|
|
|
if (unsigned_mult_overflows(factor, total_size))
|
|
|
|
die(_("pack %s too large to roll up"), ours->pack_name);
|
|
|
|
|
builtin/repack.c: add '--geometric' option
Often it is useful to both:
- have relatively few packfiles in a repository, and
- avoid having so few packfiles in a repository that we repack its
entire contents regularly
This patch implements a '--geometric=<n>' option in 'git repack'. This
allows the caller to specify that they would like each pack to be at
least a factor times as large as the previous largest pack (by object
count).
Concretely, say that a repository has 'n' packfiles, labeled P1, P2,
..., up to Pn. Each packfile has an object count equal to 'objects(Pn)'.
With a geometric factor of 'r', it should be that:
objects(Pi) > r*objects(P(i-1))
for all i in [1, n], where the packs are sorted by
objects(P1) <= objects(P2) <= ... <= objects(Pn).
Since finding a true optimal repacking is NP-hard, we approximate it
along two directions:
1. We assume that there is a cutoff of packs _before starting the
repack_ where everything to the right of that cut-off already forms
a geometric progression (or no cutoff exists and everything must be
repacked).
2. We assume that everything smaller than the cutoff count must be
repacked. This forms our base assumption, but it can also cause
even the "heavy" packs to get repacked, for e.g., if we have 6
packs containing the following number of objects:
1, 1, 1, 2, 4, 32
then we would place the cutoff between '1, 1' and '1, 2, 4, 32',
rolling up the first two packs into a pack with 2 objects. That
breaks our progression and leaves us:
2, 1, 2, 4, 32
^
(where the '^' indicates the position of our split). To restore a
progression, we move the split forward (towards larger packs)
joining each pack into our new pack until a geometric progression
is restored. Here, that looks like:
2, 1, 2, 4, 32 ~> 3, 2, 4, 32 ~> 5, 4, 32 ~> ... ~> 9, 32
^ ^ ^ ^
This has the advantage of not repacking the heavy-side of packs too
often while also only creating one new pack at a time. Another wrinkle
is that we assume that loose, indexed, and reflog'd objects are
insignificant, and lump them into any new pack that we create. This can
lead to non-idempotent results.
Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-23 03:25:27 +01:00
|
|
|
if (geometry_pack_weight(ours) < factor * total_size) {
|
2021-03-05 16:21:56 +01:00
|
|
|
if (unsigned_add_overflows(total_size,
|
|
|
|
geometry_pack_weight(ours)))
|
|
|
|
die(_("pack %s too large to roll up"),
|
|
|
|
ours->pack_name);
|
|
|
|
|
builtin/repack.c: add '--geometric' option
Often it is useful to both:
- have relatively few packfiles in a repository, and
- avoid having so few packfiles in a repository that we repack its
entire contents regularly
This patch implements a '--geometric=<n>' option in 'git repack'. This
allows the caller to specify that they would like each pack to be at
least a factor times as large as the previous largest pack (by object
count).
Concretely, say that a repository has 'n' packfiles, labeled P1, P2,
..., up to Pn. Each packfile has an object count equal to 'objects(Pn)'.
With a geometric factor of 'r', it should be that:
objects(Pi) > r*objects(P(i-1))
for all i in [1, n], where the packs are sorted by
objects(P1) <= objects(P2) <= ... <= objects(Pn).
Since finding a true optimal repacking is NP-hard, we approximate it
along two directions:
1. We assume that there is a cutoff of packs _before starting the
repack_ where everything to the right of that cut-off already forms
a geometric progression (or no cutoff exists and everything must be
repacked).
2. We assume that everything smaller than the cutoff count must be
repacked. This forms our base assumption, but it can also cause
even the "heavy" packs to get repacked, for e.g., if we have 6
packs containing the following number of objects:
1, 1, 1, 2, 4, 32
then we would place the cutoff between '1, 1' and '1, 2, 4, 32',
rolling up the first two packs into a pack with 2 objects. That
breaks our progression and leaves us:
2, 1, 2, 4, 32
^
(where the '^' indicates the position of our split). To restore a
progression, we move the split forward (towards larger packs)
joining each pack into our new pack until a geometric progression
is restored. Here, that looks like:
2, 1, 2, 4, 32 ~> 3, 2, 4, 32 ~> 5, 4, 32 ~> ... ~> 9, 32
^ ^ ^ ^
This has the advantage of not repacking the heavy-side of packs too
often while also only creating one new pack at a time. Another wrinkle
is that we assume that loose, indexed, and reflog'd objects are
insignificant, and lump them into any new pack that we create. This can
lead to non-idempotent results.
Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-23 03:25:27 +01:00
|
|
|
split++;
|
|
|
|
total_size += geometry_pack_weight(ours);
|
|
|
|
} else
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
geometry->split = split;
|
|
|
|
}
|
|
|
|
|
2021-09-29 03:55:20 +02:00
|
|
|
static struct packed_git *get_largest_active_pack(struct pack_geometry *geometry)
|
|
|
|
{
|
|
|
|
if (!geometry) {
|
|
|
|
/*
|
|
|
|
* No geometry means either an all-into-one repack (in which
|
|
|
|
* case there is only one pack left and it is the largest) or an
|
|
|
|
* incremental one.
|
|
|
|
*
|
|
|
|
* If repacking incrementally, then we could check the size of
|
|
|
|
* all packs to determine which should be preferred, but leave
|
|
|
|
* this for later.
|
|
|
|
*/
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
if (geometry->split == geometry->pack_nr)
|
|
|
|
return NULL;
|
|
|
|
return geometry->pack[geometry->pack_nr - 1];
|
|
|
|
}
|
|
|
|
|
builtin/repack.c: add '--geometric' option
Often it is useful to both:
- have relatively few packfiles in a repository, and
- avoid having so few packfiles in a repository that we repack its
entire contents regularly
This patch implements a '--geometric=<n>' option in 'git repack'. This
allows the caller to specify that they would like each pack to be at
least a factor times as large as the previous largest pack (by object
count).
Concretely, say that a repository has 'n' packfiles, labeled P1, P2,
..., up to Pn. Each packfile has an object count equal to 'objects(Pn)'.
With a geometric factor of 'r', it should be that:
objects(Pi) > r*objects(P(i-1))
for all i in [1, n], where the packs are sorted by
objects(P1) <= objects(P2) <= ... <= objects(Pn).
Since finding a true optimal repacking is NP-hard, we approximate it
along two directions:
1. We assume that there is a cutoff of packs _before starting the
repack_ where everything to the right of that cut-off already forms
a geometric progression (or no cutoff exists and everything must be
repacked).
2. We assume that everything smaller than the cutoff count must be
repacked. This forms our base assumption, but it can also cause
even the "heavy" packs to get repacked, for e.g., if we have 6
packs containing the following number of objects:
1, 1, 1, 2, 4, 32
then we would place the cutoff between '1, 1' and '1, 2, 4, 32',
rolling up the first two packs into a pack with 2 objects. That
breaks our progression and leaves us:
2, 1, 2, 4, 32
^
(where the '^' indicates the position of our split). To restore a
progression, we move the split forward (towards larger packs)
joining each pack into our new pack until a geometric progression
is restored. Here, that looks like:
2, 1, 2, 4, 32 ~> 3, 2, 4, 32 ~> 5, 4, 32 ~> ... ~> 9, 32
^ ^ ^ ^
This has the advantage of not repacking the heavy-side of packs too
often while also only creating one new pack at a time. Another wrinkle
is that we assume that loose, indexed, and reflog'd objects are
insignificant, and lump them into any new pack that we create. This can
lead to non-idempotent results.
Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-23 03:25:27 +01:00
|
|
|
static void clear_pack_geometry(struct pack_geometry *geometry)
|
|
|
|
{
|
|
|
|
if (!geometry)
|
|
|
|
return;
|
|
|
|
|
|
|
|
free(geometry->pack);
|
|
|
|
geometry->pack_nr = 0;
|
|
|
|
geometry->pack_alloc = 0;
|
|
|
|
geometry->split = 0;
|
|
|
|
}
|
|
|
|
|
2021-10-02 00:38:10 +02:00
|
|
|
struct midx_snapshot_ref_data {
|
|
|
|
struct tempfile *f;
|
|
|
|
struct oidset seen;
|
|
|
|
int preferred;
|
|
|
|
};
|
|
|
|
|
2022-08-25 19:09:48 +02:00
|
|
|
static int midx_snapshot_ref_one(const char *refname UNUSED,
|
2021-10-02 00:38:10 +02:00
|
|
|
const struct object_id *oid,
|
2022-08-25 19:09:48 +02:00
|
|
|
int flag UNUSED, void *_data)
|
2021-10-02 00:38:10 +02:00
|
|
|
{
|
|
|
|
struct midx_snapshot_ref_data *data = _data;
|
|
|
|
struct object_id peeled;
|
|
|
|
|
|
|
|
if (!peel_iterated_oid(oid, &peeled))
|
|
|
|
oid = &peeled;
|
|
|
|
|
|
|
|
if (oidset_insert(&data->seen, oid))
|
|
|
|
return 0; /* already seen */
|
|
|
|
|
|
|
|
if (oid_object_info(the_repository, oid, NULL) != OBJ_COMMIT)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
fprintf(data->f->fp, "%s%s\n", data->preferred ? "+" : "",
|
|
|
|
oid_to_hex(oid));
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void midx_snapshot_refs(struct tempfile *f)
|
|
|
|
{
|
|
|
|
struct midx_snapshot_ref_data data;
|
|
|
|
const struct string_list *preferred = bitmap_preferred_tips(the_repository);
|
|
|
|
|
|
|
|
data.f = f;
|
|
|
|
data.preferred = 0;
|
|
|
|
oidset_init(&data.seen, 0);
|
|
|
|
|
|
|
|
if (!fdopen_tempfile(f, "w"))
|
|
|
|
die(_("could not open tempfile %s for writing"),
|
|
|
|
get_tempfile_path(f));
|
|
|
|
|
|
|
|
if (preferred) {
|
|
|
|
struct string_list_item *item;
|
|
|
|
|
|
|
|
data.preferred = 1;
|
|
|
|
for_each_string_list_item(item, preferred)
|
|
|
|
for_each_ref_in(item->string, midx_snapshot_ref_one, &data);
|
|
|
|
data.preferred = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
for_each_ref(midx_snapshot_ref_one, &data);
|
|
|
|
|
|
|
|
if (close_tempfile_gently(f)) {
|
|
|
|
int save_errno = errno;
|
|
|
|
delete_tempfile(&f);
|
|
|
|
errno = save_errno;
|
|
|
|
die_errno(_("could not close refs snapshot tempfile"));
|
|
|
|
}
|
|
|
|
|
|
|
|
oidset_clear(&data.seen);
|
|
|
|
}
|
|
|
|
|
builtin/repack.c: support writing a MIDX while repacking
Teach `git repack` a new `--write-midx` option for callers that wish to
persist a multi-pack index in their repository while repacking.
There are two existing alternatives to this new flag, but they don't
cover our particular use-case. These alternatives are:
- Call 'git multi-pack-index write' after running 'git repack', or
- Set 'GIT_TEST_MULTI_PACK_INDEX=1' in your environment when running
'git repack'.
The former works, but introduces a gap in bitmap coverage between
repacking and writing a new MIDX (since the repack may have deleted a
pack included in the existing MIDX, invalidating it altogether).
Setting the 'GIT_TEST_' environment variable is obviously unsupported.
In fact, even if it were supported officially, it still wouldn't work,
because it generates the MIDX *after* redundant packs have been dropped,
leading to the same issue as above.
Introduce a new option which eliminates this race by teaching `git
repack` to generate the MIDX at the critical point: after the new packs
have been written and moved into place, but before the redundant packs
have been removed.
This option is compatible with `git repack`'s '--bitmap' option (it
changes the interpretation to be: "write a bitmap corresponding to the
MIDX after one has been generated").
There is a little bit of additional noise in the patch below to avoid
repeating ourselves when selecting which packs to delete. Instead of a
single loop as before (where we iterate over 'existing_packs', decide if
a pack is worth deleting, and if so, delete it), we have two loops (the
first where we decide which ones are worth deleting, and the second
where we actually do the deleting). This makes it so we have a single
check we can make consistently when (1) telling the MIDX which packs we
want to exclude, and (2) actually unlinking the redundant packs.
There is also a tiny change to short-circuit the body of
write_midx_included_packs() when no packs remain in the case of an empty
repository. The MIDX code does not handle this, so avoid trying to
generate a MIDX covering zero packs in the first place.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-29 03:55:18 +02:00
|
|
|
static void midx_included_packs(struct string_list *include,
|
|
|
|
struct string_list *existing_nonkept_packs,
|
|
|
|
struct string_list *existing_kept_packs,
|
|
|
|
struct string_list *names,
|
|
|
|
struct pack_geometry *geometry)
|
|
|
|
{
|
|
|
|
struct string_list_item *item;
|
|
|
|
|
|
|
|
for_each_string_list_item(item, existing_kept_packs)
|
|
|
|
string_list_insert(include, xstrfmt("%s.idx", item->string));
|
|
|
|
for_each_string_list_item(item, names)
|
|
|
|
string_list_insert(include, xstrfmt("pack-%s.idx", item->string));
|
|
|
|
if (geometry) {
|
|
|
|
struct strbuf buf = STRBUF_INIT;
|
|
|
|
uint32_t i;
|
|
|
|
for (i = geometry->split; i < geometry->pack_nr; i++) {
|
|
|
|
struct packed_git *p = geometry->pack[i];
|
|
|
|
|
|
|
|
strbuf_addstr(&buf, pack_basename(p));
|
|
|
|
strbuf_strip_suffix(&buf, ".pack");
|
|
|
|
strbuf_addstr(&buf, ".idx");
|
|
|
|
|
|
|
|
string_list_insert(include, strbuf_detach(&buf, NULL));
|
|
|
|
}
|
2022-05-21 01:18:11 +02:00
|
|
|
|
|
|
|
for_each_string_list_item(item, existing_nonkept_packs) {
|
|
|
|
if (!((uintptr_t)item->util & CRUFT_PACK)) {
|
|
|
|
/*
|
|
|
|
* no need to check DELETE_PACK, since we're not
|
|
|
|
* doing an ALL_INTO_ONE repack
|
|
|
|
*/
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
string_list_insert(include, xstrfmt("%s.idx", item->string));
|
|
|
|
}
|
builtin/repack.c: support writing a MIDX while repacking
Teach `git repack` a new `--write-midx` option for callers that wish to
persist a multi-pack index in their repository while repacking.
There are two existing alternatives to this new flag, but they don't
cover our particular use-case. These alternatives are:
- Call 'git multi-pack-index write' after running 'git repack', or
- Set 'GIT_TEST_MULTI_PACK_INDEX=1' in your environment when running
'git repack'.
The former works, but introduces a gap in bitmap coverage between
repacking and writing a new MIDX (since the repack may have deleted a
pack included in the existing MIDX, invalidating it altogether).
Setting the 'GIT_TEST_' environment variable is obviously unsupported.
In fact, even if it were supported officially, it still wouldn't work,
because it generates the MIDX *after* redundant packs have been dropped,
leading to the same issue as above.
Introduce a new option which eliminates this race by teaching `git
repack` to generate the MIDX at the critical point: after the new packs
have been written and moved into place, but before the redundant packs
have been removed.
This option is compatible with `git repack`'s '--bitmap' option (it
changes the interpretation to be: "write a bitmap corresponding to the
MIDX after one has been generated").
There is a little bit of additional noise in the patch below to avoid
repeating ourselves when selecting which packs to delete. Instead of a
single loop as before (where we iterate over 'existing_packs', decide if
a pack is worth deleting, and if so, delete it), we have two loops (the
first where we decide which ones are worth deleting, and the second
where we actually do the deleting). This makes it so we have a single
check we can make consistently when (1) telling the MIDX which packs we
want to exclude, and (2) actually unlinking the redundant packs.
There is also a tiny change to short-circuit the body of
write_midx_included_packs() when no packs remain in the case of an empty
repository. The MIDX code does not handle this, so avoid trying to
generate a MIDX covering zero packs in the first place.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-29 03:55:18 +02:00
|
|
|
} else {
|
|
|
|
for_each_string_list_item(item, existing_nonkept_packs) {
|
2022-05-21 01:18:08 +02:00
|
|
|
if ((uintptr_t)item->util & DELETE_PACK)
|
builtin/repack.c: support writing a MIDX while repacking
Teach `git repack` a new `--write-midx` option for callers that wish to
persist a multi-pack index in their repository while repacking.
There are two existing alternatives to this new flag, but they don't
cover our particular use-case. These alternatives are:
- Call 'git multi-pack-index write' after running 'git repack', or
- Set 'GIT_TEST_MULTI_PACK_INDEX=1' in your environment when running
'git repack'.
The former works, but introduces a gap in bitmap coverage between
repacking and writing a new MIDX (since the repack may have deleted a
pack included in the existing MIDX, invalidating it altogether).
Setting the 'GIT_TEST_' environment variable is obviously unsupported.
In fact, even if it were supported officially, it still wouldn't work,
because it generates the MIDX *after* redundant packs have been dropped,
leading to the same issue as above.
Introduce a new option which eliminates this race by teaching `git
repack` to generate the MIDX at the critical point: after the new packs
have been written and moved into place, but before the redundant packs
have been removed.
This option is compatible with `git repack`'s '--bitmap' option (it
changes the interpretation to be: "write a bitmap corresponding to the
MIDX after one has been generated").
There is a little bit of additional noise in the patch below to avoid
repeating ourselves when selecting which packs to delete. Instead of a
single loop as before (where we iterate over 'existing_packs', decide if
a pack is worth deleting, and if so, delete it), we have two loops (the
first where we decide which ones are worth deleting, and the second
where we actually do the deleting). This makes it so we have a single
check we can make consistently when (1) telling the MIDX which packs we
want to exclude, and (2) actually unlinking the redundant packs.
There is also a tiny change to short-circuit the body of
write_midx_included_packs() when no packs remain in the case of an empty
repository. The MIDX code does not handle this, so avoid trying to
generate a MIDX covering zero packs in the first place.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-29 03:55:18 +02:00
|
|
|
continue;
|
|
|
|
string_list_insert(include, xstrfmt("%s.idx", item->string));
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static int write_midx_included_packs(struct string_list *include,
|
2021-09-29 03:55:20 +02:00
|
|
|
struct pack_geometry *geometry,
|
2021-10-02 00:38:10 +02:00
|
|
|
const char *refs_snapshot,
|
builtin/repack.c: support writing a MIDX while repacking
Teach `git repack` a new `--write-midx` option for callers that wish to
persist a multi-pack index in their repository while repacking.
There are two existing alternatives to this new flag, but they don't
cover our particular use-case. These alternatives are:
- Call 'git multi-pack-index write' after running 'git repack', or
- Set 'GIT_TEST_MULTI_PACK_INDEX=1' in your environment when running
'git repack'.
The former works, but introduces a gap in bitmap coverage between
repacking and writing a new MIDX (since the repack may have deleted a
pack included in the existing MIDX, invalidating it altogether).
Setting the 'GIT_TEST_' environment variable is obviously unsupported.
In fact, even if it were supported officially, it still wouldn't work,
because it generates the MIDX *after* redundant packs have been dropped,
leading to the same issue as above.
Introduce a new option which eliminates this race by teaching `git
repack` to generate the MIDX at the critical point: after the new packs
have been written and moved into place, but before the redundant packs
have been removed.
This option is compatible with `git repack`'s '--bitmap' option (it
changes the interpretation to be: "write a bitmap corresponding to the
MIDX after one has been generated").
There is a little bit of additional noise in the patch below to avoid
repeating ourselves when selecting which packs to delete. Instead of a
single loop as before (where we iterate over 'existing_packs', decide if
a pack is worth deleting, and if so, delete it), we have two loops (the
first where we decide which ones are worth deleting, and the second
where we actually do the deleting). This makes it so we have a single
check we can make consistently when (1) telling the MIDX which packs we
want to exclude, and (2) actually unlinking the redundant packs.
There is also a tiny change to short-circuit the body of
write_midx_included_packs() when no packs remain in the case of an empty
repository. The MIDX code does not handle this, so avoid trying to
generate a MIDX covering zero packs in the first place.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-29 03:55:18 +02:00
|
|
|
int show_progress, int write_bitmaps)
|
|
|
|
{
|
|
|
|
struct child_process cmd = CHILD_PROCESS_INIT;
|
|
|
|
struct string_list_item *item;
|
2021-09-29 03:55:20 +02:00
|
|
|
struct packed_git *largest = get_largest_active_pack(geometry);
|
builtin/repack.c: support writing a MIDX while repacking
Teach `git repack` a new `--write-midx` option for callers that wish to
persist a multi-pack index in their repository while repacking.
There are two existing alternatives to this new flag, but they don't
cover our particular use-case. These alternatives are:
- Call 'git multi-pack-index write' after running 'git repack', or
- Set 'GIT_TEST_MULTI_PACK_INDEX=1' in your environment when running
'git repack'.
The former works, but introduces a gap in bitmap coverage between
repacking and writing a new MIDX (since the repack may have deleted a
pack included in the existing MIDX, invalidating it altogether).
Setting the 'GIT_TEST_' environment variable is obviously unsupported.
In fact, even if it were supported officially, it still wouldn't work,
because it generates the MIDX *after* redundant packs have been dropped,
leading to the same issue as above.
Introduce a new option which eliminates this race by teaching `git
repack` to generate the MIDX at the critical point: after the new packs
have been written and moved into place, but before the redundant packs
have been removed.
This option is compatible with `git repack`'s '--bitmap' option (it
changes the interpretation to be: "write a bitmap corresponding to the
MIDX after one has been generated").
There is a little bit of additional noise in the patch below to avoid
repeating ourselves when selecting which packs to delete. Instead of a
single loop as before (where we iterate over 'existing_packs', decide if
a pack is worth deleting, and if so, delete it), we have two loops (the
first where we decide which ones are worth deleting, and the second
where we actually do the deleting). This makes it so we have a single
check we can make consistently when (1) telling the MIDX which packs we
want to exclude, and (2) actually unlinking the redundant packs.
There is also a tiny change to short-circuit the body of
write_midx_included_packs() when no packs remain in the case of an empty
repository. The MIDX code does not handle this, so avoid trying to
generate a MIDX covering zero packs in the first place.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-29 03:55:18 +02:00
|
|
|
FILE *in;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
if (!include->nr)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
cmd.in = -1;
|
|
|
|
cmd.git_cmd = 1;
|
|
|
|
|
|
|
|
strvec_push(&cmd.args, "multi-pack-index");
|
|
|
|
strvec_pushl(&cmd.args, "write", "--stdin-packs", NULL);
|
|
|
|
|
|
|
|
if (show_progress)
|
|
|
|
strvec_push(&cmd.args, "--progress");
|
|
|
|
else
|
|
|
|
strvec_push(&cmd.args, "--no-progress");
|
|
|
|
|
|
|
|
if (write_bitmaps)
|
|
|
|
strvec_push(&cmd.args, "--bitmap");
|
|
|
|
|
2021-09-29 03:55:20 +02:00
|
|
|
if (largest)
|
|
|
|
strvec_pushf(&cmd.args, "--preferred-pack=%s",
|
|
|
|
pack_basename(largest));
|
|
|
|
|
2021-10-02 00:38:10 +02:00
|
|
|
if (refs_snapshot)
|
|
|
|
strvec_pushf(&cmd.args, "--refs-snapshot=%s", refs_snapshot);
|
|
|
|
|
builtin/repack.c: support writing a MIDX while repacking
Teach `git repack` a new `--write-midx` option for callers that wish to
persist a multi-pack index in their repository while repacking.
There are two existing alternatives to this new flag, but they don't
cover our particular use-case. These alternatives are:
- Call 'git multi-pack-index write' after running 'git repack', or
- Set 'GIT_TEST_MULTI_PACK_INDEX=1' in your environment when running
'git repack'.
The former works, but introduces a gap in bitmap coverage between
repacking and writing a new MIDX (since the repack may have deleted a
pack included in the existing MIDX, invalidating it altogether).
Setting the 'GIT_TEST_' environment variable is obviously unsupported.
In fact, even if it were supported officially, it still wouldn't work,
because it generates the MIDX *after* redundant packs have been dropped,
leading to the same issue as above.
Introduce a new option which eliminates this race by teaching `git
repack` to generate the MIDX at the critical point: after the new packs
have been written and moved into place, but before the redundant packs
have been removed.
This option is compatible with `git repack`'s '--bitmap' option (it
changes the interpretation to be: "write a bitmap corresponding to the
MIDX after one has been generated").
There is a little bit of additional noise in the patch below to avoid
repeating ourselves when selecting which packs to delete. Instead of a
single loop as before (where we iterate over 'existing_packs', decide if
a pack is worth deleting, and if so, delete it), we have two loops (the
first where we decide which ones are worth deleting, and the second
where we actually do the deleting). This makes it so we have a single
check we can make consistently when (1) telling the MIDX which packs we
want to exclude, and (2) actually unlinking the redundant packs.
There is also a tiny change to short-circuit the body of
write_midx_included_packs() when no packs remain in the case of an empty
repository. The MIDX code does not handle this, so avoid trying to
generate a MIDX covering zero packs in the first place.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-29 03:55:18 +02:00
|
|
|
ret = start_command(&cmd);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
in = xfdopen(cmd.in, "w");
|
|
|
|
for_each_string_list_item(item, include)
|
|
|
|
fprintf(in, "%s\n", item->string);
|
|
|
|
fclose(in);
|
|
|
|
|
|
|
|
return finish_command(&cmd);
|
|
|
|
}
|
|
|
|
|
2022-10-18 04:45:12 +02:00
|
|
|
static void remove_redundant_bitmaps(struct string_list *include,
|
|
|
|
const char *packdir)
|
|
|
|
{
|
|
|
|
struct strbuf path = STRBUF_INIT;
|
|
|
|
struct string_list_item *item;
|
|
|
|
size_t packdir_len;
|
|
|
|
|
|
|
|
strbuf_addstr(&path, packdir);
|
|
|
|
strbuf_addch(&path, '/');
|
|
|
|
packdir_len = path.len;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Remove any pack bitmaps corresponding to packs which are now
|
|
|
|
* included in the MIDX.
|
|
|
|
*/
|
|
|
|
for_each_string_list_item(item, include) {
|
|
|
|
strbuf_addstr(&path, item->string);
|
|
|
|
strbuf_strip_suffix(&path, ".idx");
|
|
|
|
strbuf_addstr(&path, ".bitmap");
|
|
|
|
|
|
|
|
if (unlink(path.buf) && errno != ENOENT)
|
|
|
|
warning_errno(_("could not remove stale bitmap: %s"),
|
|
|
|
path.buf);
|
|
|
|
|
|
|
|
strbuf_setlen(&path, packdir_len);
|
|
|
|
}
|
|
|
|
strbuf_release(&path);
|
|
|
|
}
|
|
|
|
|
2022-05-21 01:18:03 +02:00
|
|
|
static int write_cruft_pack(const struct pack_objects_args *args,
|
|
|
|
const char *pack_prefix,
|
|
|
|
struct string_list *names,
|
|
|
|
struct string_list *existing_packs,
|
|
|
|
struct string_list *existing_kept_packs)
|
|
|
|
{
|
|
|
|
struct child_process cmd = CHILD_PROCESS_INIT;
|
|
|
|
struct strbuf line = STRBUF_INIT;
|
|
|
|
struct string_list_item *item;
|
|
|
|
FILE *in, *out;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
prepare_pack_objects(&cmd, args);
|
|
|
|
|
|
|
|
strvec_push(&cmd.args, "--cruft");
|
|
|
|
if (cruft_expiration)
|
|
|
|
strvec_pushf(&cmd.args, "--cruft-expiration=%s",
|
|
|
|
cruft_expiration);
|
|
|
|
|
|
|
|
strvec_push(&cmd.args, "--honor-pack-keep");
|
|
|
|
strvec_push(&cmd.args, "--non-empty");
|
|
|
|
strvec_push(&cmd.args, "--max-pack-size=0");
|
|
|
|
|
|
|
|
cmd.in = -1;
|
|
|
|
|
|
|
|
ret = start_command(&cmd);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* names has a confusing double use: it both provides the list
|
|
|
|
* of just-written new packs, and accepts the name of the cruft
|
|
|
|
* pack we are writing.
|
|
|
|
*
|
|
|
|
* By the time it is read here, it contains only the pack(s)
|
|
|
|
* that were just written, which is exactly the set of packs we
|
|
|
|
* want to consider kept.
|
|
|
|
*/
|
|
|
|
in = xfdopen(cmd.in, "w");
|
|
|
|
for_each_string_list_item(item, names)
|
|
|
|
fprintf(in, "%s-%s.pack\n", pack_prefix, item->string);
|
|
|
|
for_each_string_list_item(item, existing_packs)
|
|
|
|
fprintf(in, "-%s.pack\n", item->string);
|
|
|
|
for_each_string_list_item(item, existing_kept_packs)
|
|
|
|
fprintf(in, "%s.pack\n", item->string);
|
|
|
|
fclose(in);
|
|
|
|
|
|
|
|
out = xfdopen(cmd.out, "r");
|
|
|
|
while (strbuf_getline_lf(&line, out) != EOF) {
|
repack: populate extension bits incrementally
After generating the main pack and then any additional cruft packs, we
iterate over the "names" list (which contains hashes of packs generated
by pack-objects), and call populate_pack_exts() for each.
There's one small problem with this. In repack_promisor_objects(), we
may add entries to "names" and call populate_pack_exts() for them.
Calling it again is mostly just wasteful, as we'll stat() the filename
with each possible extension, get the same result, and just overwrite
our bits.
So we could drop the call there, and leave the final loop to populate
all of the bits. But instead, this patch does the reverse: drops the
final loop, and teaches the other two sites to populate the bits as they
add entries.
This makes the code easier to reason about, as you never have to worry
about when the util field is valid; it is always valid for each entry.
It also serves my ulterior purpose: recording the generated filenames as
soon as possible will make it easier for a future patch to use them for
cleaning up from a failed operation.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-10-22 02:21:48 +02:00
|
|
|
struct string_list_item *item;
|
|
|
|
|
2022-05-21 01:18:03 +02:00
|
|
|
if (line.len != the_hash_algo->hexsz)
|
|
|
|
die(_("repack: Expecting full hex object ID lines only "
|
|
|
|
"from pack-objects."));
|
repack: populate extension bits incrementally
After generating the main pack and then any additional cruft packs, we
iterate over the "names" list (which contains hashes of packs generated
by pack-objects), and call populate_pack_exts() for each.
There's one small problem with this. In repack_promisor_objects(), we
may add entries to "names" and call populate_pack_exts() for them.
Calling it again is mostly just wasteful, as we'll stat() the filename
with each possible extension, get the same result, and just overwrite
our bits.
So we could drop the call there, and leave the final loop to populate
all of the bits. But instead, this patch does the reverse: drops the
final loop, and teaches the other two sites to populate the bits as they
add entries.
This makes the code easier to reason about, as you never have to worry
about when the util field is valid; it is always valid for each entry.
It also serves my ulterior purpose: recording the generated filenames as
soon as possible will make it easier for a future patch to use them for
cleaning up from a failed operation.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-10-22 02:21:48 +02:00
|
|
|
|
|
|
|
item = string_list_append(names, line.buf);
|
|
|
|
item->util = populate_pack_exts(line.buf);
|
2022-05-21 01:18:03 +02:00
|
|
|
}
|
|
|
|
fclose(out);
|
|
|
|
|
|
|
|
strbuf_release(&line);
|
|
|
|
|
|
|
|
return finish_command(&cmd);
|
|
|
|
}
|
|
|
|
|
2013-09-15 17:33:20 +02:00
|
|
|
int cmd_repack(int argc, const char **argv, const char *prefix)
|
|
|
|
{
|
2014-08-19 21:09:35 +02:00
|
|
|
struct child_process cmd = CHILD_PROCESS_INIT;
|
2013-09-15 17:33:20 +02:00
|
|
|
struct string_list_item *item;
|
|
|
|
struct string_list names = STRING_LIST_INIT_DUP;
|
2021-09-29 03:55:12 +02:00
|
|
|
struct string_list existing_nonkept_packs = STRING_LIST_INIT_DUP;
|
builtin/repack.c: keep track of existing packs unconditionally
In order to be able to write a multi-pack index during repacking, `git
repack` must keep track of which packs it wants to write into the MIDX.
This set is the union of existing packs which will not be deleted,
new pack(s) generated as a result of the repack, and .keep packs.
Prior to this patch, `git repack` populated the list of existing packs
only when repacking all-into-one (i.e., with `-A` or `-a`), but we will
soon need to know this list when repacking when writing a MIDX without
a-i-o.
Populate the list of existing packs unconditionally, and guard removing
packs from that list only when repacking a-i-o.
Additionally, keep track of filenames of kept packs separately, since
this, too, will be used in an upcoming patch.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-29 03:55:10 +02:00
|
|
|
struct string_list existing_kept_packs = STRING_LIST_INIT_DUP;
|
builtin/repack.c: add '--geometric' option
Often it is useful to both:
- have relatively few packfiles in a repository, and
- avoid having so few packfiles in a repository that we repack its
entire contents regularly
This patch implements a '--geometric=<n>' option in 'git repack'. This
allows the caller to specify that they would like each pack to be at
least a factor times as large as the previous largest pack (by object
count).
Concretely, say that a repository has 'n' packfiles, labeled P1, P2,
..., up to Pn. Each packfile has an object count equal to 'objects(Pn)'.
With a geometric factor of 'r', it should be that:
objects(Pi) > r*objects(P(i-1))
for all i in [1, n], where the packs are sorted by
objects(P1) <= objects(P2) <= ... <= objects(Pn).
Since finding a true optimal repacking is NP-hard, we approximate it
along two directions:
1. We assume that there is a cutoff of packs _before starting the
repack_ where everything to the right of that cut-off already forms
a geometric progression (or no cutoff exists and everything must be
repacked).
2. We assume that everything smaller than the cutoff count must be
repacked. This forms our base assumption, but it can also cause
even the "heavy" packs to get repacked, for e.g., if we have 6
packs containing the following number of objects:
1, 1, 1, 2, 4, 32
then we would place the cutoff between '1, 1' and '1, 2, 4, 32',
rolling up the first two packs into a pack with 2 objects. That
breaks our progression and leaves us:
2, 1, 2, 4, 32
^
(where the '^' indicates the position of our split). To restore a
progression, we move the split forward (towards larger packs)
joining each pack into our new pack until a geometric progression
is restored. Here, that looks like:
2, 1, 2, 4, 32 ~> 3, 2, 4, 32 ~> 5, 4, 32 ~> ... ~> 9, 32
^ ^ ^ ^
This has the advantage of not repacking the heavy-side of packs too
often while also only creating one new pack at a time. Another wrinkle
is that we assume that loose, indexed, and reflog'd objects are
insignificant, and lump them into any new pack that we create. This can
lead to non-idempotent results.
Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-23 03:25:27 +01:00
|
|
|
struct pack_geometry *geometry = NULL;
|
2013-09-15 17:33:20 +02:00
|
|
|
struct strbuf line = STRBUF_INIT;
|
2021-10-02 00:38:10 +02:00
|
|
|
struct tempfile *refs_snapshot = NULL;
|
builtin/repack.c: don't move existing packs out of the way
When 'git repack' creates a pack with the same name as any existing
pack, it moves the existing one to 'old-pack-xxx.{pack,idx,...}' and
then renames the new one into place.
Eventually, it would be nice to have 'git repack' allow for writing a
multi-pack index at the critical time (after the new packs have been
written / moved into place, but before the old ones have been deleted).
Guessing that this option might be called '--write-midx', this makes the
following situation (where repacks are issued back-to-back without any
new objects) impossible:
$ git repack -adb
$ git repack -adb --write-midx
In the second repack, the existing packs are overwritten verbatim with
the same rename-to-old sequence. At that point, the current MIDX is
invalidated, since it refers to now-missing packs. So that code wants to
be run after the MIDX is re-written. But (prior to this patch) the new
MIDX can't be written until the new packs are moved into place. So, we
have a circular dependency.
This is all hypothetical, since no code currently exists to write a MIDX
safely during a 'git repack' (the 'GIT_TEST_MULTI_PACK_INDEX' does so
unsafely). Putting hypothetical aside, though: why do we need to rename
existing packs to be prefixed with 'old-' anyway?
This behavior dates all the way back to 2ad47d6 (git-repack: Be
careful when updating the same pack as an existing one., 2006-06-25).
2ad47d6 is mainly concerned about a case where a newly written pack
would have a different structure than its index. This used to be
possible when the pack name was a hash of the set of objects. Under this
naming scheme, two packs that store the same set of objects could differ
in delta selection, object positioning, or both. If this happened, then
any such packs would be unreadable in the instant between copying the
new pack and new index (i.e., either the index or pack will be stale
depending on the order that they were copied).
But since 1190a1a (pack-objects: name pack files after trailer hash,
2013-12-05), this is no longer possible, since pack files are named not
after their logical contents (i.e., the set of objects), but by the
actual checksum of their contents. So, this old- behavior can safely go,
which allows us to avoid our circular dependency above.
In addition to avoiding the circular dependency, this patch also makes
'git repack' a lot simpler, since we don't have to deal with failures
encountered when renaming existing packs to be prefixed with 'old-'.
This patch is mostly limited to removing code paths that deal with the
'old' prefixing, with the exception of files that include the pack's
name in their own filename, like .idx, .bitmap, and related files. The
exception is that we want to continue to trust what pack-objects wrote.
That is, it is not the case that we pretend as if pack-objects didn't
write files identical to ones that already exist, but rather that we
respect what pack-objects wrote as the source of truth. That cuts two
ways:
- If pack-objects produced an identical pack to one that already
exists with a bitmap, but did not produce a bitmap, we remove the
bitmap that already exists. (This behavior is codified in t7700.14).
- If pack-objects produced an identical pack to one that already
exists, we trust the just-written version of the coresponding .idx,
.promisor, and other files over the ones that already exist. This
ensures that we use the most up-to-date versions of this files,
which is safe even in the face of format changes in, say, the .idx
file (which would not be reflected in the .idx file's name).
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-11-17 21:15:16 +01:00
|
|
|
int i, ext, ret;
|
2013-09-15 17:33:20 +02:00
|
|
|
FILE *out;
|
2021-12-20 15:48:11 +01:00
|
|
|
int show_progress;
|
2013-09-15 17:33:20 +02:00
|
|
|
|
|
|
|
/* variables to be filled by option parsing */
|
|
|
|
int delete_redundant = 0;
|
2014-01-23 02:28:30 +01:00
|
|
|
const char *unpack_unreachable = NULL;
|
repack: add --keep-unreachable option
The usual way to do a full repack (and what is done by
git-gc) is to run "repack -Ad --unpack-unreachable=<when>",
which will loosen any unreachable objects newer than
"<when>", and drop any older ones.
This is a safer alternative to "repack -ad", because
"<when>" becomes a grace period during which we will not
drop any new objects that are about to be referenced.
However, it isn't perfectly safe. It's always possible that
a process is about to reference an old object. Even if that
process were to take care to update the timestamp on the
object, there is no atomicity with a simultaneously running
"repack" process.
So while unlikely, there is a small race wherein we may drop
an object that is in the process of being referenced. If you
do automated repacking on a large number of active
repositories, you may hit it eventually, and the result is a
corrupted repository.
It would be nice to fix that race in the long run, but it's
complicated. In the meantime, there is a much simpler
strategy for automated repository maintenance: do not drop
objects at all. We already have a "--keep-unreachable"
option in pack-objects; we just need to plumb it through
from git-repack.
Note that this _isn't_ plumbed through from git-gc, so at
this point it's strictly a tool for people doing their own
advanced repository maintenance strategy.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-06-13 06:36:28 +02:00
|
|
|
int keep_unreachable = 0;
|
2018-04-15 17:36:13 +02:00
|
|
|
struct string_list keep_pack_list = STRING_LIST_INIT_NODUP;
|
2018-08-09 00:34:05 +02:00
|
|
|
struct pack_objects_args po_args = {NULL};
|
2022-05-21 01:18:06 +02:00
|
|
|
struct pack_objects_args cruft_po_args = {NULL};
|
builtin/repack.c: add '--geometric' option
Often it is useful to both:
- have relatively few packfiles in a repository, and
- avoid having so few packfiles in a repository that we repack its
entire contents regularly
This patch implements a '--geometric=<n>' option in 'git repack'. This
allows the caller to specify that they would like each pack to be at
least a factor times as large as the previous largest pack (by object
count).
Concretely, say that a repository has 'n' packfiles, labeled P1, P2,
..., up to Pn. Each packfile has an object count equal to 'objects(Pn)'.
With a geometric factor of 'r', it should be that:
objects(Pi) > r*objects(P(i-1))
for all i in [1, n], where the packs are sorted by
objects(P1) <= objects(P2) <= ... <= objects(Pn).
Since finding a true optimal repacking is NP-hard, we approximate it
along two directions:
1. We assume that there is a cutoff of packs _before starting the
repack_ where everything to the right of that cut-off already forms
a geometric progression (or no cutoff exists and everything must be
repacked).
2. We assume that everything smaller than the cutoff count must be
repacked. This forms our base assumption, but it can also cause
even the "heavy" packs to get repacked, for e.g., if we have 6
packs containing the following number of objects:
1, 1, 1, 2, 4, 32
then we would place the cutoff between '1, 1' and '1, 2, 4, 32',
rolling up the first two packs into a pack with 2 objects. That
breaks our progression and leaves us:
2, 1, 2, 4, 32
^
(where the '^' indicates the position of our split). To restore a
progression, we move the split forward (towards larger packs)
joining each pack into our new pack until a geometric progression
is restored. Here, that looks like:
2, 1, 2, 4, 32 ~> 3, 2, 4, 32 ~> 5, 4, 32 ~> ... ~> 9, 32
^ ^ ^ ^
This has the advantage of not repacking the heavy-side of packs too
often while also only creating one new pack at a time. Another wrinkle
is that we assume that loose, indexed, and reflog'd objects are
insignificant, and lump them into any new pack that we create. This can
lead to non-idempotent results.
Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-23 03:25:27 +01:00
|
|
|
int geometric_factor = 0;
|
builtin/repack.c: support writing a MIDX while repacking
Teach `git repack` a new `--write-midx` option for callers that wish to
persist a multi-pack index in their repository while repacking.
There are two existing alternatives to this new flag, but they don't
cover our particular use-case. These alternatives are:
- Call 'git multi-pack-index write' after running 'git repack', or
- Set 'GIT_TEST_MULTI_PACK_INDEX=1' in your environment when running
'git repack'.
The former works, but introduces a gap in bitmap coverage between
repacking and writing a new MIDX (since the repack may have deleted a
pack included in the existing MIDX, invalidating it altogether).
Setting the 'GIT_TEST_' environment variable is obviously unsupported.
In fact, even if it were supported officially, it still wouldn't work,
because it generates the MIDX *after* redundant packs have been dropped,
leading to the same issue as above.
Introduce a new option which eliminates this race by teaching `git
repack` to generate the MIDX at the critical point: after the new packs
have been written and moved into place, but before the redundant packs
have been removed.
This option is compatible with `git repack`'s '--bitmap' option (it
changes the interpretation to be: "write a bitmap corresponding to the
MIDX after one has been generated").
There is a little bit of additional noise in the patch below to avoid
repeating ourselves when selecting which packs to delete. Instead of a
single loop as before (where we iterate over 'existing_packs', decide if
a pack is worth deleting, and if so, delete it), we have two loops (the
first where we decide which ones are worth deleting, and the second
where we actually do the deleting). This makes it so we have a single
check we can make consistently when (1) telling the MIDX which packs we
want to exclude, and (2) actually unlinking the redundant packs.
There is also a tiny change to short-circuit the body of
write_midx_included_packs() when no packs remain in the case of an empty
repository. The MIDX code does not handle this, so avoid trying to
generate a MIDX covering zero packs in the first place.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-29 03:55:18 +02:00
|
|
|
int write_midx = 0;
|
2013-09-15 17:33:20 +02:00
|
|
|
|
|
|
|
struct option builtin_repack_options[] = {
|
|
|
|
OPT_BIT('a', NULL, &pack_everything,
|
|
|
|
N_("pack everything in a single pack"), ALL_INTO_ONE),
|
|
|
|
OPT_BIT('A', NULL, &pack_everything,
|
|
|
|
N_("same as -a, and turn unreachable objects loose"),
|
|
|
|
LOOSEN_UNREACHABLE | ALL_INTO_ONE),
|
2022-05-21 01:18:03 +02:00
|
|
|
OPT_BIT(0, "cruft", &pack_everything,
|
|
|
|
N_("same as -a, pack unreachable cruft objects separately"),
|
|
|
|
PACK_CRUFT),
|
|
|
|
OPT_STRING(0, "cruft-expiration", &cruft_expiration, N_("approxidate"),
|
|
|
|
N_("with -C, expire objects older than this")),
|
2013-09-15 17:33:20 +02:00
|
|
|
OPT_BOOL('d', NULL, &delete_redundant,
|
|
|
|
N_("remove redundant packs, and run git-prune-packed")),
|
2018-08-09 00:34:05 +02:00
|
|
|
OPT_BOOL('f', NULL, &po_args.no_reuse_delta,
|
2013-09-15 17:33:20 +02:00
|
|
|
N_("pass --no-reuse-delta to git-pack-objects")),
|
2018-08-09 00:34:05 +02:00
|
|
|
OPT_BOOL('F', NULL, &po_args.no_reuse_object,
|
2013-09-15 17:33:20 +02:00
|
|
|
N_("pass --no-reuse-object to git-pack-objects")),
|
2022-03-14 08:42:46 +01:00
|
|
|
OPT_NEGBIT('n', NULL, &run_update_server_info,
|
|
|
|
N_("do not run git-update-server-info"), 1),
|
2018-08-09 00:34:05 +02:00
|
|
|
OPT__QUIET(&po_args.quiet, N_("be quiet")),
|
|
|
|
OPT_BOOL('l', "local", &po_args.local,
|
2013-09-15 17:33:20 +02:00
|
|
|
N_("pass --local to git-pack-objects")),
|
2014-06-10 22:10:07 +02:00
|
|
|
OPT_BOOL('b', "write-bitmap-index", &write_bitmaps,
|
2013-12-21 15:00:31 +01:00
|
|
|
N_("write bitmap index")),
|
2018-08-16 08:13:10 +02:00
|
|
|
OPT_BOOL('i', "delta-islands", &use_delta_islands,
|
|
|
|
N_("pass --delta-islands to git-pack-objects")),
|
2013-09-15 17:33:20 +02:00
|
|
|
OPT_STRING(0, "unpack-unreachable", &unpack_unreachable, N_("approxidate"),
|
|
|
|
N_("with -A, do not loosen objects older than this")),
|
repack: add --keep-unreachable option
The usual way to do a full repack (and what is done by
git-gc) is to run "repack -Ad --unpack-unreachable=<when>",
which will loosen any unreachable objects newer than
"<when>", and drop any older ones.
This is a safer alternative to "repack -ad", because
"<when>" becomes a grace period during which we will not
drop any new objects that are about to be referenced.
However, it isn't perfectly safe. It's always possible that
a process is about to reference an old object. Even if that
process were to take care to update the timestamp on the
object, there is no atomicity with a simultaneously running
"repack" process.
So while unlikely, there is a small race wherein we may drop
an object that is in the process of being referenced. If you
do automated repacking on a large number of active
repositories, you may hit it eventually, and the result is a
corrupted repository.
It would be nice to fix that race in the long run, but it's
complicated. In the meantime, there is a much simpler
strategy for automated repository maintenance: do not drop
objects at all. We already have a "--keep-unreachable"
option in pack-objects; we just need to plumb it through
from git-repack.
Note that this _isn't_ plumbed through from git-gc, so at
this point it's strictly a tool for people doing their own
advanced repository maintenance strategy.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-06-13 06:36:28 +02:00
|
|
|
OPT_BOOL('k', "keep-unreachable", &keep_unreachable,
|
|
|
|
N_("with -a, repack unreachable objects")),
|
2018-08-09 00:34:05 +02:00
|
|
|
OPT_STRING(0, "window", &po_args.window, N_("n"),
|
2013-09-15 17:33:20 +02:00
|
|
|
N_("size of the window used for delta compression")),
|
2018-08-09 00:34:05 +02:00
|
|
|
OPT_STRING(0, "window-memory", &po_args.window_memory, N_("bytes"),
|
2013-09-15 17:33:20 +02:00
|
|
|
N_("same as the above, but limit memory size instead of entries count")),
|
2018-08-09 00:34:05 +02:00
|
|
|
OPT_STRING(0, "depth", &po_args.depth, N_("n"),
|
2013-09-15 17:33:20 +02:00
|
|
|
N_("limits the maximum delta depth")),
|
2018-08-09 00:34:05 +02:00
|
|
|
OPT_STRING(0, "threads", &po_args.threads, N_("n"),
|
2017-04-27 01:09:25 +02:00
|
|
|
N_("limits the maximum number of threads")),
|
2018-08-09 00:34:05 +02:00
|
|
|
OPT_STRING(0, "max-pack-size", &po_args.max_pack_size, N_("bytes"),
|
2013-09-15 17:33:20 +02:00
|
|
|
N_("maximum size of each packfile")),
|
repack: add `repack.packKeptObjects` config var
The git-repack command always passes `--honor-pack-keep`
to pack-objects. This has traditionally been a good thing,
as we do not want to duplicate those objects in a new pack,
and we are not going to delete the old pack.
However, when bitmaps are in use, it is important for a full
repack to include all reachable objects, even if they may be
duplicated in a .keep pack. Otherwise, we cannot generate
the bitmaps, as the on-disk format requires the set of
objects in the pack to be fully closed.
Even if the repository does not generally have .keep files,
a simultaneous push could cause a race condition in which a
.keep file exists at the moment of a repack. The repack may
try to include those objects in one of two situations:
1. The pushed .keep pack contains objects that were
already in the repository (e.g., blobs due to a revert of
an old commit).
2. Receive-pack updates the refs, making the objects
reachable, but before it removes the .keep file, the
repack runs.
In either case, we may prefer to duplicate some objects in
the new, full pack, and let the next repack (after the .keep
file is cleaned up) take care of removing them.
This patch introduces both a command-line and config option
to disable the `--honor-pack-keep` option. By default, it
is triggered when pack.writeBitmaps (or `--write-bitmap-index`
is turned on), but specifying it explicitly can override the
behavior (e.g., in cases where you prefer .keep files to
bitmaps, but only when they are present).
Note that this option just disables the pack-objects
behavior. We still leave packs with a .keep in place, as we
do not necessarily know that we have duplicated all of their
objects.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2014-03-03 21:04:20 +01:00
|
|
|
OPT_BOOL(0, "pack-kept-objects", &pack_kept_objects,
|
|
|
|
N_("repack objects in packs marked with .keep")),
|
2018-04-15 17:36:13 +02:00
|
|
|
OPT_STRING_LIST(0, "keep-pack", &keep_pack_list, N_("name"),
|
|
|
|
N_("do not repack this pack")),
|
builtin/repack.c: add '--geometric' option
Often it is useful to both:
- have relatively few packfiles in a repository, and
- avoid having so few packfiles in a repository that we repack its
entire contents regularly
This patch implements a '--geometric=<n>' option in 'git repack'. This
allows the caller to specify that they would like each pack to be at
least a factor times as large as the previous largest pack (by object
count).
Concretely, say that a repository has 'n' packfiles, labeled P1, P2,
..., up to Pn. Each packfile has an object count equal to 'objects(Pn)'.
With a geometric factor of 'r', it should be that:
objects(Pi) > r*objects(P(i-1))
for all i in [1, n], where the packs are sorted by
objects(P1) <= objects(P2) <= ... <= objects(Pn).
Since finding a true optimal repacking is NP-hard, we approximate it
along two directions:
1. We assume that there is a cutoff of packs _before starting the
repack_ where everything to the right of that cut-off already forms
a geometric progression (or no cutoff exists and everything must be
repacked).
2. We assume that everything smaller than the cutoff count must be
repacked. This forms our base assumption, but it can also cause
even the "heavy" packs to get repacked, for e.g., if we have 6
packs containing the following number of objects:
1, 1, 1, 2, 4, 32
then we would place the cutoff between '1, 1' and '1, 2, 4, 32',
rolling up the first two packs into a pack with 2 objects. That
breaks our progression and leaves us:
2, 1, 2, 4, 32
^
(where the '^' indicates the position of our split). To restore a
progression, we move the split forward (towards larger packs)
joining each pack into our new pack until a geometric progression
is restored. Here, that looks like:
2, 1, 2, 4, 32 ~> 3, 2, 4, 32 ~> 5, 4, 32 ~> ... ~> 9, 32
^ ^ ^ ^
This has the advantage of not repacking the heavy-side of packs too
often while also only creating one new pack at a time. Another wrinkle
is that we assume that loose, indexed, and reflog'd objects are
insignificant, and lump them into any new pack that we create. This can
lead to non-idempotent results.
Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-23 03:25:27 +01:00
|
|
|
OPT_INTEGER('g', "geometric", &geometric_factor,
|
|
|
|
N_("find a geometric progression with factor <N>")),
|
builtin/repack.c: support writing a MIDX while repacking
Teach `git repack` a new `--write-midx` option for callers that wish to
persist a multi-pack index in their repository while repacking.
There are two existing alternatives to this new flag, but they don't
cover our particular use-case. These alternatives are:
- Call 'git multi-pack-index write' after running 'git repack', or
- Set 'GIT_TEST_MULTI_PACK_INDEX=1' in your environment when running
'git repack'.
The former works, but introduces a gap in bitmap coverage between
repacking and writing a new MIDX (since the repack may have deleted a
pack included in the existing MIDX, invalidating it altogether).
Setting the 'GIT_TEST_' environment variable is obviously unsupported.
In fact, even if it were supported officially, it still wouldn't work,
because it generates the MIDX *after* redundant packs have been dropped,
leading to the same issue as above.
Introduce a new option which eliminates this race by teaching `git
repack` to generate the MIDX at the critical point: after the new packs
have been written and moved into place, but before the redundant packs
have been removed.
This option is compatible with `git repack`'s '--bitmap' option (it
changes the interpretation to be: "write a bitmap corresponding to the
MIDX after one has been generated").
There is a little bit of additional noise in the patch below to avoid
repeating ourselves when selecting which packs to delete. Instead of a
single loop as before (where we iterate over 'existing_packs', decide if
a pack is worth deleting, and if so, delete it), we have two loops (the
first where we decide which ones are worth deleting, and the second
where we actually do the deleting). This makes it so we have a single
check we can make consistently when (1) telling the MIDX which packs we
want to exclude, and (2) actually unlinking the redundant packs.
There is also a tiny change to short-circuit the body of
write_midx_included_packs() when no packs remain in the case of an empty
repository. The MIDX code does not handle this, so avoid trying to
generate a MIDX covering zero packs in the first place.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-29 03:55:18 +02:00
|
|
|
OPT_BOOL('m', "write-midx", &write_midx,
|
|
|
|
N_("write a multi-pack index of the resulting packs")),
|
2013-09-15 17:33:20 +02:00
|
|
|
OPT_END()
|
|
|
|
};
|
|
|
|
|
2022-05-21 01:18:06 +02:00
|
|
|
git_config(repack_config, &cruft_po_args);
|
2013-09-15 17:33:20 +02:00
|
|
|
|
|
|
|
argc = parse_options(argc, argv, prefix, builtin_repack_options,
|
|
|
|
git_repack_usage, 0);
|
|
|
|
|
2015-06-23 12:54:11 +02:00
|
|
|
if (delete_redundant && repository_format_precious_objects)
|
|
|
|
die(_("cannot delete packs in a precious-objects repo"));
|
|
|
|
|
repack: add --keep-unreachable option
The usual way to do a full repack (and what is done by
git-gc) is to run "repack -Ad --unpack-unreachable=<when>",
which will loosen any unreachable objects newer than
"<when>", and drop any older ones.
This is a safer alternative to "repack -ad", because
"<when>" becomes a grace period during which we will not
drop any new objects that are about to be referenced.
However, it isn't perfectly safe. It's always possible that
a process is about to reference an old object. Even if that
process were to take care to update the timestamp on the
object, there is no atomicity with a simultaneously running
"repack" process.
So while unlikely, there is a small race wherein we may drop
an object that is in the process of being referenced. If you
do automated repacking on a large number of active
repositories, you may hit it eventually, and the result is a
corrupted repository.
It would be nice to fix that race in the long run, but it's
complicated. In the meantime, there is a much simpler
strategy for automated repository maintenance: do not drop
objects at all. We already have a "--keep-unreachable"
option in pack-objects; we just need to plumb it through
from git-repack.
Note that this _isn't_ plumbed through from git-gc, so at
this point it's strictly a tool for people doing their own
advanced repository maintenance strategy.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-06-13 06:36:28 +02:00
|
|
|
if (keep_unreachable &&
|
|
|
|
(unpack_unreachable || (pack_everything & LOOSEN_UNREACHABLE)))
|
2022-01-05 21:02:16 +01:00
|
|
|
die(_("options '%s' and '%s' cannot be used together"), "--keep-unreachable", "-A");
|
repack: add --keep-unreachable option
The usual way to do a full repack (and what is done by
git-gc) is to run "repack -Ad --unpack-unreachable=<when>",
which will loosen any unreachable objects newer than
"<when>", and drop any older ones.
This is a safer alternative to "repack -ad", because
"<when>" becomes a grace period during which we will not
drop any new objects that are about to be referenced.
However, it isn't perfectly safe. It's always possible that
a process is about to reference an old object. Even if that
process were to take care to update the timestamp on the
object, there is no atomicity with a simultaneously running
"repack" process.
So while unlikely, there is a small race wherein we may drop
an object that is in the process of being referenced. If you
do automated repacking on a large number of active
repositories, you may hit it eventually, and the result is a
corrupted repository.
It would be nice to fix that race in the long run, but it's
complicated. In the meantime, there is a much simpler
strategy for automated repository maintenance: do not drop
objects at all. We already have a "--keep-unreachable"
option in pack-objects; we just need to plumb it through
from git-repack.
Note that this _isn't_ plumbed through from git-gc, so at
this point it's strictly a tool for people doing their own
advanced repository maintenance strategy.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-06-13 06:36:28 +02:00
|
|
|
|
2022-05-21 01:18:03 +02:00
|
|
|
if (pack_everything & PACK_CRUFT) {
|
|
|
|
pack_everything |= ALL_INTO_ONE;
|
|
|
|
|
|
|
|
if (unpack_unreachable || (pack_everything & LOOSEN_UNREACHABLE))
|
|
|
|
die(_("options '%s' and '%s' cannot be used together"), "--cruft", "-A");
|
|
|
|
if (keep_unreachable)
|
|
|
|
die(_("options '%s' and '%s' cannot be used together"), "--cruft", "-k");
|
|
|
|
}
|
|
|
|
|
2019-06-29 21:13:59 +02:00
|
|
|
if (write_bitmaps < 0) {
|
builtin/repack.c: support writing a MIDX while repacking
Teach `git repack` a new `--write-midx` option for callers that wish to
persist a multi-pack index in their repository while repacking.
There are two existing alternatives to this new flag, but they don't
cover our particular use-case. These alternatives are:
- Call 'git multi-pack-index write' after running 'git repack', or
- Set 'GIT_TEST_MULTI_PACK_INDEX=1' in your environment when running
'git repack'.
The former works, but introduces a gap in bitmap coverage between
repacking and writing a new MIDX (since the repack may have deleted a
pack included in the existing MIDX, invalidating it altogether).
Setting the 'GIT_TEST_' environment variable is obviously unsupported.
In fact, even if it were supported officially, it still wouldn't work,
because it generates the MIDX *after* redundant packs have been dropped,
leading to the same issue as above.
Introduce a new option which eliminates this race by teaching `git
repack` to generate the MIDX at the critical point: after the new packs
have been written and moved into place, but before the redundant packs
have been removed.
This option is compatible with `git repack`'s '--bitmap' option (it
changes the interpretation to be: "write a bitmap corresponding to the
MIDX after one has been generated").
There is a little bit of additional noise in the patch below to avoid
repeating ourselves when selecting which packs to delete. Instead of a
single loop as before (where we iterate over 'existing_packs', decide if
a pack is worth deleting, and if so, delete it), we have two loops (the
first where we decide which ones are worth deleting, and the second
where we actually do the deleting). This makes it so we have a single
check we can make consistently when (1) telling the MIDX which packs we
want to exclude, and (2) actually unlinking the redundant packs.
There is also a tiny change to short-circuit the body of
write_midx_included_packs() when no packs remain in the case of an empty
repository. The MIDX code does not handle this, so avoid trying to
generate a MIDX covering zero packs in the first place.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-29 03:55:18 +02:00
|
|
|
if (!write_midx &&
|
|
|
|
(!(pack_everything & ALL_INTO_ONE) || !is_bare_repository()))
|
2019-07-31 07:39:27 +02:00
|
|
|
write_bitmaps = 0;
|
2021-08-31 22:52:43 +02:00
|
|
|
} else if (write_bitmaps &&
|
|
|
|
git_env_bool(GIT_TEST_MULTI_PACK_INDEX, 0) &&
|
|
|
|
git_env_bool(GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP, 0)) {
|
|
|
|
write_bitmaps = 0;
|
2019-06-29 21:13:59 +02:00
|
|
|
}
|
repack: add `repack.packKeptObjects` config var
The git-repack command always passes `--honor-pack-keep`
to pack-objects. This has traditionally been a good thing,
as we do not want to duplicate those objects in a new pack,
and we are not going to delete the old pack.
However, when bitmaps are in use, it is important for a full
repack to include all reachable objects, even if they may be
duplicated in a .keep pack. Otherwise, we cannot generate
the bitmaps, as the on-disk format requires the set of
objects in the pack to be fully closed.
Even if the repository does not generally have .keep files,
a simultaneous push could cause a race condition in which a
.keep file exists at the moment of a repack. The repack may
try to include those objects in one of two situations:
1. The pushed .keep pack contains objects that were
already in the repository (e.g., blobs due to a revert of
an old commit).
2. Receive-pack updates the refs, making the objects
reachable, but before it removes the .keep file, the
repack runs.
In either case, we may prefer to duplicate some objects in
the new, full pack, and let the next repack (after the .keep
file is cleaned up) take care of removing them.
This patch introduces both a command-line and config option
to disable the `--honor-pack-keep` option. By default, it
is triggered when pack.writeBitmaps (or `--write-bitmap-index`
is turned on), but specifying it explicitly can override the
behavior (e.g., in cases where you prefer .keep files to
bitmaps, but only when they are present).
Note that this option just disables the pack-objects
behavior. We still leave packs with a .keep in place, as we
do not necessarily know that we have duplicated all of their
objects.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2014-03-03 21:04:20 +01:00
|
|
|
if (pack_kept_objects < 0)
|
2021-12-20 15:48:10 +01:00
|
|
|
pack_kept_objects = write_bitmaps > 0 && !write_midx;
|
repack: add `repack.packKeptObjects` config var
The git-repack command always passes `--honor-pack-keep`
to pack-objects. This has traditionally been a good thing,
as we do not want to duplicate those objects in a new pack,
and we are not going to delete the old pack.
However, when bitmaps are in use, it is important for a full
repack to include all reachable objects, even if they may be
duplicated in a .keep pack. Otherwise, we cannot generate
the bitmaps, as the on-disk format requires the set of
objects in the pack to be fully closed.
Even if the repository does not generally have .keep files,
a simultaneous push could cause a race condition in which a
.keep file exists at the moment of a repack. The repack may
try to include those objects in one of two situations:
1. The pushed .keep pack contains objects that were
already in the repository (e.g., blobs due to a revert of
an old commit).
2. Receive-pack updates the refs, making the objects
reachable, but before it removes the .keep file, the
repack runs.
In either case, we may prefer to duplicate some objects in
the new, full pack, and let the next repack (after the .keep
file is cleaned up) take care of removing them.
This patch introduces both a command-line and config option
to disable the `--honor-pack-keep` option. By default, it
is triggered when pack.writeBitmaps (or `--write-bitmap-index`
is turned on), but specifying it explicitly can override the
behavior (e.g., in cases where you prefer .keep files to
bitmaps, but only when they are present).
Note that this option just disables the pack-objects
behavior. We still leave packs with a .keep in place, as we
do not necessarily know that we have duplicated all of their
objects.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2014-03-03 21:04:20 +01:00
|
|
|
|
builtin/repack.c: support writing a MIDX while repacking
Teach `git repack` a new `--write-midx` option for callers that wish to
persist a multi-pack index in their repository while repacking.
There are two existing alternatives to this new flag, but they don't
cover our particular use-case. These alternatives are:
- Call 'git multi-pack-index write' after running 'git repack', or
- Set 'GIT_TEST_MULTI_PACK_INDEX=1' in your environment when running
'git repack'.
The former works, but introduces a gap in bitmap coverage between
repacking and writing a new MIDX (since the repack may have deleted a
pack included in the existing MIDX, invalidating it altogether).
Setting the 'GIT_TEST_' environment variable is obviously unsupported.
In fact, even if it were supported officially, it still wouldn't work,
because it generates the MIDX *after* redundant packs have been dropped,
leading to the same issue as above.
Introduce a new option which eliminates this race by teaching `git
repack` to generate the MIDX at the critical point: after the new packs
have been written and moved into place, but before the redundant packs
have been removed.
This option is compatible with `git repack`'s '--bitmap' option (it
changes the interpretation to be: "write a bitmap corresponding to the
MIDX after one has been generated").
There is a little bit of additional noise in the patch below to avoid
repeating ourselves when selecting which packs to delete. Instead of a
single loop as before (where we iterate over 'existing_packs', decide if
a pack is worth deleting, and if so, delete it), we have two loops (the
first where we decide which ones are worth deleting, and the second
where we actually do the deleting). This makes it so we have a single
check we can make consistently when (1) telling the MIDX which packs we
want to exclude, and (2) actually unlinking the redundant packs.
There is also a tiny change to short-circuit the body of
write_midx_included_packs() when no packs remain in the case of an empty
repository. The MIDX code does not handle this, so avoid trying to
generate a MIDX covering zero packs in the first place.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-29 03:55:18 +02:00
|
|
|
if (write_bitmaps && !(pack_everything & ALL_INTO_ONE) && !write_midx)
|
2016-12-28 23:45:42 +01:00
|
|
|
die(_(incremental_bitmap_conflict_error));
|
|
|
|
|
2021-10-02 00:38:10 +02:00
|
|
|
if (write_midx && write_bitmaps) {
|
|
|
|
struct strbuf path = STRBUF_INIT;
|
|
|
|
|
|
|
|
strbuf_addf(&path, "%s/%s_XXXXXX", get_object_directory(),
|
|
|
|
"bitmap-ref-tips");
|
|
|
|
|
|
|
|
refs_snapshot = xmks_tempfile(path.buf);
|
|
|
|
midx_snapshot_refs(refs_snapshot);
|
|
|
|
|
|
|
|
strbuf_release(&path);
|
|
|
|
}
|
|
|
|
|
2022-05-20 21:01:45 +02:00
|
|
|
packdir = mkpathdup("%s/pack", get_object_directory());
|
|
|
|
packtmp_name = xstrfmt(".tmp-%d-pack", (int)getpid());
|
|
|
|
packtmp = mkpathdup("%s/%s", packdir, packtmp_name);
|
|
|
|
|
|
|
|
collect_pack_filenames(&existing_nonkept_packs, &existing_kept_packs,
|
|
|
|
&keep_pack_list);
|
|
|
|
|
builtin/repack.c: add '--geometric' option
Often it is useful to both:
- have relatively few packfiles in a repository, and
- avoid having so few packfiles in a repository that we repack its
entire contents regularly
This patch implements a '--geometric=<n>' option in 'git repack'. This
allows the caller to specify that they would like each pack to be at
least a factor times as large as the previous largest pack (by object
count).
Concretely, say that a repository has 'n' packfiles, labeled P1, P2,
..., up to Pn. Each packfile has an object count equal to 'objects(Pn)'.
With a geometric factor of 'r', it should be that:
objects(Pi) > r*objects(P(i-1))
for all i in [1, n], where the packs are sorted by
objects(P1) <= objects(P2) <= ... <= objects(Pn).
Since finding a true optimal repacking is NP-hard, we approximate it
along two directions:
1. We assume that there is a cutoff of packs _before starting the
repack_ where everything to the right of that cut-off already forms
a geometric progression (or no cutoff exists and everything must be
repacked).
2. We assume that everything smaller than the cutoff count must be
repacked. This forms our base assumption, but it can also cause
even the "heavy" packs to get repacked, for e.g., if we have 6
packs containing the following number of objects:
1, 1, 1, 2, 4, 32
then we would place the cutoff between '1, 1' and '1, 2, 4, 32',
rolling up the first two packs into a pack with 2 objects. That
breaks our progression and leaves us:
2, 1, 2, 4, 32
^
(where the '^' indicates the position of our split). To restore a
progression, we move the split forward (towards larger packs)
joining each pack into our new pack until a geometric progression
is restored. Here, that looks like:
2, 1, 2, 4, 32 ~> 3, 2, 4, 32 ~> 5, 4, 32 ~> ... ~> 9, 32
^ ^ ^ ^
This has the advantage of not repacking the heavy-side of packs too
often while also only creating one new pack at a time. Another wrinkle
is that we assume that loose, indexed, and reflog'd objects are
insignificant, and lump them into any new pack that we create. This can
lead to non-idempotent results.
Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-23 03:25:27 +01:00
|
|
|
if (geometric_factor) {
|
|
|
|
if (pack_everything)
|
2022-01-05 21:02:16 +01:00
|
|
|
die(_("options '%s' and '%s' cannot be used together"), "--geometric", "-A/-a");
|
2022-05-20 21:01:45 +02:00
|
|
|
init_pack_geometry(&geometry, &existing_kept_packs);
|
builtin/repack.c: add '--geometric' option
Often it is useful to both:
- have relatively few packfiles in a repository, and
- avoid having so few packfiles in a repository that we repack its
entire contents regularly
This patch implements a '--geometric=<n>' option in 'git repack'. This
allows the caller to specify that they would like each pack to be at
least a factor times as large as the previous largest pack (by object
count).
Concretely, say that a repository has 'n' packfiles, labeled P1, P2,
..., up to Pn. Each packfile has an object count equal to 'objects(Pn)'.
With a geometric factor of 'r', it should be that:
objects(Pi) > r*objects(P(i-1))
for all i in [1, n], where the packs are sorted by
objects(P1) <= objects(P2) <= ... <= objects(Pn).
Since finding a true optimal repacking is NP-hard, we approximate it
along two directions:
1. We assume that there is a cutoff of packs _before starting the
repack_ where everything to the right of that cut-off already forms
a geometric progression (or no cutoff exists and everything must be
repacked).
2. We assume that everything smaller than the cutoff count must be
repacked. This forms our base assumption, but it can also cause
even the "heavy" packs to get repacked, for e.g., if we have 6
packs containing the following number of objects:
1, 1, 1, 2, 4, 32
then we would place the cutoff between '1, 1' and '1, 2, 4, 32',
rolling up the first two packs into a pack with 2 objects. That
breaks our progression and leaves us:
2, 1, 2, 4, 32
^
(where the '^' indicates the position of our split). To restore a
progression, we move the split forward (towards larger packs)
joining each pack into our new pack until a geometric progression
is restored. Here, that looks like:
2, 1, 2, 4, 32 ~> 3, 2, 4, 32 ~> 5, 4, 32 ~> ... ~> 9, 32
^ ^ ^ ^
This has the advantage of not repacking the heavy-side of packs too
often while also only creating one new pack at a time. Another wrinkle
is that we assume that loose, indexed, and reflog'd objects are
insignificant, and lump them into any new pack that we create. This can
lead to non-idempotent results.
Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-23 03:25:27 +01:00
|
|
|
split_pack_geometry(geometry, geometric_factor);
|
|
|
|
}
|
|
|
|
|
2018-08-09 00:34:05 +02:00
|
|
|
prepare_pack_objects(&cmd, &po_args);
|
|
|
|
|
2021-12-20 15:48:11 +01:00
|
|
|
show_progress = !po_args.quiet && isatty(2);
|
|
|
|
|
2020-07-28 22:24:27 +02:00
|
|
|
strvec_push(&cmd.args, "--keep-true-parents");
|
repack: add `repack.packKeptObjects` config var
The git-repack command always passes `--honor-pack-keep`
to pack-objects. This has traditionally been a good thing,
as we do not want to duplicate those objects in a new pack,
and we are not going to delete the old pack.
However, when bitmaps are in use, it is important for a full
repack to include all reachable objects, even if they may be
duplicated in a .keep pack. Otherwise, we cannot generate
the bitmaps, as the on-disk format requires the set of
objects in the pack to be fully closed.
Even if the repository does not generally have .keep files,
a simultaneous push could cause a race condition in which a
.keep file exists at the moment of a repack. The repack may
try to include those objects in one of two situations:
1. The pushed .keep pack contains objects that were
already in the repository (e.g., blobs due to a revert of
an old commit).
2. Receive-pack updates the refs, making the objects
reachable, but before it removes the .keep file, the
repack runs.
In either case, we may prefer to duplicate some objects in
the new, full pack, and let the next repack (after the .keep
file is cleaned up) take care of removing them.
This patch introduces both a command-line and config option
to disable the `--honor-pack-keep` option. By default, it
is triggered when pack.writeBitmaps (or `--write-bitmap-index`
is turned on), but specifying it explicitly can override the
behavior (e.g., in cases where you prefer .keep files to
bitmaps, but only when they are present).
Note that this option just disables the pack-objects
behavior. We still leave packs with a .keep in place, as we
do not necessarily know that we have duplicated all of their
objects.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2014-03-03 21:04:20 +01:00
|
|
|
if (!pack_kept_objects)
|
2020-07-28 22:24:27 +02:00
|
|
|
strvec_push(&cmd.args, "--honor-pack-keep");
|
2018-04-15 17:36:13 +02:00
|
|
|
for (i = 0; i < keep_pack_list.nr; i++)
|
2020-07-28 22:24:27 +02:00
|
|
|
strvec_pushf(&cmd.args, "--keep-pack=%s",
|
strvec: fix indentation in renamed calls
Code which split an argv_array call across multiple lines, like:
argv_array_pushl(&args, "one argument",
"another argument", "and more",
NULL);
was recently mechanically renamed to use strvec, which results in
mis-matched indentation like:
strvec_pushl(&args, "one argument",
"another argument", "and more",
NULL);
Let's fix these up to align the arguments with the opening paren. I did
this manually by sifting through the results of:
git jump grep 'strvec_.*,$'
and liberally applying my editor's auto-format. Most of the changes are
of the form shown above, though I also normalized a few that had
originally used a single-tab indentation (rather than our usual style of
aligning with the open paren). I also rewrapped a couple of obvious
cases (e.g., where previously too-long lines became short enough to fit
on one), but I wasn't aggressive about it. In cases broken to three or
more lines, the grouping of arguments is sometimes meaningful, and it
wasn't worth my time or reviewer time to ponder each case individually.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-07-28 22:26:31 +02:00
|
|
|
keep_pack_list.items[i].string);
|
2020-07-28 22:24:27 +02:00
|
|
|
strvec_push(&cmd.args, "--non-empty");
|
builtin/repack.c: add '--geometric' option
Often it is useful to both:
- have relatively few packfiles in a repository, and
- avoid having so few packfiles in a repository that we repack its
entire contents regularly
This patch implements a '--geometric=<n>' option in 'git repack'. This
allows the caller to specify that they would like each pack to be at
least a factor times as large as the previous largest pack (by object
count).
Concretely, say that a repository has 'n' packfiles, labeled P1, P2,
..., up to Pn. Each packfile has an object count equal to 'objects(Pn)'.
With a geometric factor of 'r', it should be that:
objects(Pi) > r*objects(P(i-1))
for all i in [1, n], where the packs are sorted by
objects(P1) <= objects(P2) <= ... <= objects(Pn).
Since finding a true optimal repacking is NP-hard, we approximate it
along two directions:
1. We assume that there is a cutoff of packs _before starting the
repack_ where everything to the right of that cut-off already forms
a geometric progression (or no cutoff exists and everything must be
repacked).
2. We assume that everything smaller than the cutoff count must be
repacked. This forms our base assumption, but it can also cause
even the "heavy" packs to get repacked, for e.g., if we have 6
packs containing the following number of objects:
1, 1, 1, 2, 4, 32
then we would place the cutoff between '1, 1' and '1, 2, 4, 32',
rolling up the first two packs into a pack with 2 objects. That
breaks our progression and leaves us:
2, 1, 2, 4, 32
^
(where the '^' indicates the position of our split). To restore a
progression, we move the split forward (towards larger packs)
joining each pack into our new pack until a geometric progression
is restored. Here, that looks like:
2, 1, 2, 4, 32 ~> 3, 2, 4, 32 ~> 5, 4, 32 ~> ... ~> 9, 32
^ ^ ^ ^
This has the advantage of not repacking the heavy-side of packs too
often while also only creating one new pack at a time. Another wrinkle
is that we assume that loose, indexed, and reflog'd objects are
insignificant, and lump them into any new pack that we create. This can
lead to non-idempotent results.
Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-23 03:25:27 +01:00
|
|
|
if (!geometry) {
|
|
|
|
/*
|
2021-03-05 16:22:02 +01:00
|
|
|
* We need to grab all reachable objects, including those that
|
|
|
|
* are reachable from reflogs and the index.
|
builtin/repack.c: add '--geometric' option
Often it is useful to both:
- have relatively few packfiles in a repository, and
- avoid having so few packfiles in a repository that we repack its
entire contents regularly
This patch implements a '--geometric=<n>' option in 'git repack'. This
allows the caller to specify that they would like each pack to be at
least a factor times as large as the previous largest pack (by object
count).
Concretely, say that a repository has 'n' packfiles, labeled P1, P2,
..., up to Pn. Each packfile has an object count equal to 'objects(Pn)'.
With a geometric factor of 'r', it should be that:
objects(Pi) > r*objects(P(i-1))
for all i in [1, n], where the packs are sorted by
objects(P1) <= objects(P2) <= ... <= objects(Pn).
Since finding a true optimal repacking is NP-hard, we approximate it
along two directions:
1. We assume that there is a cutoff of packs _before starting the
repack_ where everything to the right of that cut-off already forms
a geometric progression (or no cutoff exists and everything must be
repacked).
2. We assume that everything smaller than the cutoff count must be
repacked. This forms our base assumption, but it can also cause
even the "heavy" packs to get repacked, for e.g., if we have 6
packs containing the following number of objects:
1, 1, 1, 2, 4, 32
then we would place the cutoff between '1, 1' and '1, 2, 4, 32',
rolling up the first two packs into a pack with 2 objects. That
breaks our progression and leaves us:
2, 1, 2, 4, 32
^
(where the '^' indicates the position of our split). To restore a
progression, we move the split forward (towards larger packs)
joining each pack into our new pack until a geometric progression
is restored. Here, that looks like:
2, 1, 2, 4, 32 ~> 3, 2, 4, 32 ~> 5, 4, 32 ~> ... ~> 9, 32
^ ^ ^ ^
This has the advantage of not repacking the heavy-side of packs too
often while also only creating one new pack at a time. Another wrinkle
is that we assume that loose, indexed, and reflog'd objects are
insignificant, and lump them into any new pack that we create. This can
lead to non-idempotent results.
Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-23 03:25:27 +01:00
|
|
|
*
|
2021-03-05 16:22:02 +01:00
|
|
|
* When repacking into a geometric progression of packs,
|
|
|
|
* however, we ask 'git pack-objects --stdin-packs', and it is
|
|
|
|
* not about packing objects based on reachability but about
|
|
|
|
* repacking all the objects in specified packs and loose ones
|
|
|
|
* (indeed, --stdin-packs is incompatible with these options).
|
builtin/repack.c: add '--geometric' option
Often it is useful to both:
- have relatively few packfiles in a repository, and
- avoid having so few packfiles in a repository that we repack its
entire contents regularly
This patch implements a '--geometric=<n>' option in 'git repack'. This
allows the caller to specify that they would like each pack to be at
least a factor times as large as the previous largest pack (by object
count).
Concretely, say that a repository has 'n' packfiles, labeled P1, P2,
..., up to Pn. Each packfile has an object count equal to 'objects(Pn)'.
With a geometric factor of 'r', it should be that:
objects(Pi) > r*objects(P(i-1))
for all i in [1, n], where the packs are sorted by
objects(P1) <= objects(P2) <= ... <= objects(Pn).
Since finding a true optimal repacking is NP-hard, we approximate it
along two directions:
1. We assume that there is a cutoff of packs _before starting the
repack_ where everything to the right of that cut-off already forms
a geometric progression (or no cutoff exists and everything must be
repacked).
2. We assume that everything smaller than the cutoff count must be
repacked. This forms our base assumption, but it can also cause
even the "heavy" packs to get repacked, for e.g., if we have 6
packs containing the following number of objects:
1, 1, 1, 2, 4, 32
then we would place the cutoff between '1, 1' and '1, 2, 4, 32',
rolling up the first two packs into a pack with 2 objects. That
breaks our progression and leaves us:
2, 1, 2, 4, 32
^
(where the '^' indicates the position of our split). To restore a
progression, we move the split forward (towards larger packs)
joining each pack into our new pack until a geometric progression
is restored. Here, that looks like:
2, 1, 2, 4, 32 ~> 3, 2, 4, 32 ~> 5, 4, 32 ~> ... ~> 9, 32
^ ^ ^ ^
This has the advantage of not repacking the heavy-side of packs too
often while also only creating one new pack at a time. Another wrinkle
is that we assume that loose, indexed, and reflog'd objects are
insignificant, and lump them into any new pack that we create. This can
lead to non-idempotent results.
Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-23 03:25:27 +01:00
|
|
|
*/
|
|
|
|
strvec_push(&cmd.args, "--all");
|
|
|
|
strvec_push(&cmd.args, "--reflog");
|
|
|
|
strvec_push(&cmd.args, "--indexed-objects");
|
|
|
|
}
|
2019-06-25 15:40:31 +02:00
|
|
|
if (has_promisor_remote())
|
2020-07-28 22:24:27 +02:00
|
|
|
strvec_push(&cmd.args, "--exclude-promisor-objects");
|
builtin/repack.c: support writing a MIDX while repacking
Teach `git repack` a new `--write-midx` option for callers that wish to
persist a multi-pack index in their repository while repacking.
There are two existing alternatives to this new flag, but they don't
cover our particular use-case. These alternatives are:
- Call 'git multi-pack-index write' after running 'git repack', or
- Set 'GIT_TEST_MULTI_PACK_INDEX=1' in your environment when running
'git repack'.
The former works, but introduces a gap in bitmap coverage between
repacking and writing a new MIDX (since the repack may have deleted a
pack included in the existing MIDX, invalidating it altogether).
Setting the 'GIT_TEST_' environment variable is obviously unsupported.
In fact, even if it were supported officially, it still wouldn't work,
because it generates the MIDX *after* redundant packs have been dropped,
leading to the same issue as above.
Introduce a new option which eliminates this race by teaching `git
repack` to generate the MIDX at the critical point: after the new packs
have been written and moved into place, but before the redundant packs
have been removed.
This option is compatible with `git repack`'s '--bitmap' option (it
changes the interpretation to be: "write a bitmap corresponding to the
MIDX after one has been generated").
There is a little bit of additional noise in the patch below to avoid
repeating ourselves when selecting which packs to delete. Instead of a
single loop as before (where we iterate over 'existing_packs', decide if
a pack is worth deleting, and if so, delete it), we have two loops (the
first where we decide which ones are worth deleting, and the second
where we actually do the deleting). This makes it so we have a single
check we can make consistently when (1) telling the MIDX which packs we
want to exclude, and (2) actually unlinking the redundant packs.
There is also a tiny change to short-circuit the body of
write_midx_included_packs() when no packs remain in the case of an empty
repository. The MIDX code does not handle this, so avoid trying to
generate a MIDX covering zero packs in the first place.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-29 03:55:18 +02:00
|
|
|
if (!write_midx) {
|
|
|
|
if (write_bitmaps > 0)
|
|
|
|
strvec_push(&cmd.args, "--write-bitmap-index");
|
|
|
|
else if (write_bitmaps < 0)
|
|
|
|
strvec_push(&cmd.args, "--write-bitmap-index-quiet");
|
|
|
|
}
|
2018-08-16 08:13:10 +02:00
|
|
|
if (use_delta_islands)
|
2020-07-28 22:24:27 +02:00
|
|
|
strvec_push(&cmd.args, "--delta-islands");
|
2013-09-15 17:33:20 +02:00
|
|
|
|
builtin/repack.c: keep track of existing packs unconditionally
In order to be able to write a multi-pack index during repacking, `git
repack` must keep track of which packs it wants to write into the MIDX.
This set is the union of existing packs which will not be deleted,
new pack(s) generated as a result of the repack, and .keep packs.
Prior to this patch, `git repack` populated the list of existing packs
only when repacking all-into-one (i.e., with `-A` or `-a`), but we will
soon need to know this list when repacking when writing a MIDX without
a-i-o.
Populate the list of existing packs unconditionally, and guard removing
packs from that list only when repacking a-i-o.
Additionally, keep track of filenames of kept packs separately, since
this, too, will be used in an upcoming patch.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-29 03:55:10 +02:00
|
|
|
if (pack_everything & ALL_INTO_ONE) {
|
2018-08-09 00:34:06 +02:00
|
|
|
repack_promisor_objects(&po_args, &names);
|
|
|
|
|
2022-05-21 01:18:03 +02:00
|
|
|
if (existing_nonkept_packs.nr && delete_redundant &&
|
|
|
|
!(pack_everything & PACK_CRUFT)) {
|
repack: avoid loosening promisor objects in partial clones
When `git repack -A -d` is run in a partial clone, `pack-objects`
is invoked twice: once to repack all promisor objects, and once to
repack all non-promisor objects. The latter `pack-objects` invocation
is with --exclude-promisor-objects and --unpack-unreachable, which
loosens all objects unused during this invocation. Unfortunately,
this includes promisor objects.
Because the -d argument to `git repack` subsequently deletes all loose
objects also in packs, these just-loosened promisor objects will be
immediately deleted. However, this extra disk churn is unnecessary in
the first place. For example, in a newly-cloned partial repo that
filters all blob objects (e.g. `--filter=blob:none`), `repack` ends up
unpacking all trees and commits into the filesystem because every
object, in this particular case, is a promisor object. Depending on
the repo size, this increases the disk usage considerably: In my copy
of the linux.git, the object directory peaked 26GB of more disk usage.
In order to avoid this extra disk churn, pass the names of the promisor
packfiles as --keep-pack arguments to the second invocation of
`pack-objects`. This informs `pack-objects` that the promisor objects
are already in a safe packfile and, therefore, do not need to be
loosened.
For testing, we need to validate whether any object was loosened.
However, the "evidence" (loosened objects) is deleted during the
process which prevents us from inspecting the object directory.
Instead, let's teach `pack-objects` to count loosened objects and
emit via trace2 thus allowing inspecting the debug events after the
process is finished. This new event is used on the added regression
test.
Lastly, add a new perf test to evaluate the performance impact
made by this changes (tested on git.git):
Test HEAD^ HEAD
----------------------------------------------------------
5600.3: gc 134.38(41.93+90.95) 7.80(6.72+1.35) -94.2%
For a bigger repository, such as linux.git, the improvement is
even bigger:
Test HEAD^ HEAD
-------------------------------------------------------------------
5600.3: gc 6833.00(918.07+3162.74) 268.79(227.02+39.18) -96.1%
These improvements are particular big because every object in the
newly-cloned partial repository is a promisor object.
Reported-by: SZEDER Gábor <szeder.dev@gmail.com>
Helped-by: Jeff King <peff@peff.net>
Helped-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Rafael Silva <rafaeloliveira.cs@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-04-21 21:32:12 +02:00
|
|
|
for_each_string_list_item(item, &names) {
|
|
|
|
strvec_pushf(&cmd.args, "--keep-pack=%s-%s.pack",
|
|
|
|
packtmp_name, item->string);
|
|
|
|
}
|
2015-03-20 19:43:13 +01:00
|
|
|
if (unpack_unreachable) {
|
2020-07-28 22:24:27 +02:00
|
|
|
strvec_pushf(&cmd.args,
|
strvec: fix indentation in renamed calls
Code which split an argv_array call across multiple lines, like:
argv_array_pushl(&args, "one argument",
"another argument", "and more",
NULL);
was recently mechanically renamed to use strvec, which results in
mis-matched indentation like:
strvec_pushl(&args, "one argument",
"another argument", "and more",
NULL);
Let's fix these up to align the arguments with the opening paren. I did
this manually by sifting through the results of:
git jump grep 'strvec_.*,$'
and liberally applying my editor's auto-format. Most of the changes are
of the form shown above, though I also normalized a few that had
originally used a single-tab indentation (rather than our usual style of
aligning with the open paren). I also rewrapped a couple of obvious
cases (e.g., where previously too-long lines became short enough to fit
on one), but I wasn't aggressive about it. In cases broken to three or
more lines, the grouping of arguments is sometimes meaningful, and it
wasn't worth my time or reviewer time to ponder each case individually.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-07-28 22:26:31 +02:00
|
|
|
"--unpack-unreachable=%s",
|
|
|
|
unpack_unreachable);
|
2015-03-20 19:43:13 +01:00
|
|
|
} else if (pack_everything & LOOSEN_UNREACHABLE) {
|
2020-07-28 22:24:27 +02:00
|
|
|
strvec_push(&cmd.args,
|
strvec: fix indentation in renamed calls
Code which split an argv_array call across multiple lines, like:
argv_array_pushl(&args, "one argument",
"another argument", "and more",
NULL);
was recently mechanically renamed to use strvec, which results in
mis-matched indentation like:
strvec_pushl(&args, "one argument",
"another argument", "and more",
NULL);
Let's fix these up to align the arguments with the opening paren. I did
this manually by sifting through the results of:
git jump grep 'strvec_.*,$'
and liberally applying my editor's auto-format. Most of the changes are
of the form shown above, though I also normalized a few that had
originally used a single-tab indentation (rather than our usual style of
aligning with the open paren). I also rewrapped a couple of obvious
cases (e.g., where previously too-long lines became short enough to fit
on one), but I wasn't aggressive about it. In cases broken to three or
more lines, the grouping of arguments is sometimes meaningful, and it
wasn't worth my time or reviewer time to ponder each case individually.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-07-28 22:26:31 +02:00
|
|
|
"--unpack-unreachable");
|
repack: add --keep-unreachable option
The usual way to do a full repack (and what is done by
git-gc) is to run "repack -Ad --unpack-unreachable=<when>",
which will loosen any unreachable objects newer than
"<when>", and drop any older ones.
This is a safer alternative to "repack -ad", because
"<when>" becomes a grace period during which we will not
drop any new objects that are about to be referenced.
However, it isn't perfectly safe. It's always possible that
a process is about to reference an old object. Even if that
process were to take care to update the timestamp on the
object, there is no atomicity with a simultaneously running
"repack" process.
So while unlikely, there is a small race wherein we may drop
an object that is in the process of being referenced. If you
do automated repacking on a large number of active
repositories, you may hit it eventually, and the result is a
corrupted repository.
It would be nice to fix that race in the long run, but it's
complicated. In the meantime, there is a much simpler
strategy for automated repository maintenance: do not drop
objects at all. We already have a "--keep-unreachable"
option in pack-objects; we just need to plumb it through
from git-repack.
Note that this _isn't_ plumbed through from git-gc, so at
this point it's strictly a tool for people doing their own
advanced repository maintenance strategy.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-06-13 06:36:28 +02:00
|
|
|
} else if (keep_unreachable) {
|
2020-07-28 22:24:27 +02:00
|
|
|
strvec_push(&cmd.args, "--keep-unreachable");
|
|
|
|
strvec_push(&cmd.args, "--pack-loose-unreachable");
|
2015-03-20 19:43:13 +01:00
|
|
|
}
|
2013-09-15 17:33:20 +02:00
|
|
|
}
|
builtin/repack.c: add '--geometric' option
Often it is useful to both:
- have relatively few packfiles in a repository, and
- avoid having so few packfiles in a repository that we repack its
entire contents regularly
This patch implements a '--geometric=<n>' option in 'git repack'. This
allows the caller to specify that they would like each pack to be at
least a factor times as large as the previous largest pack (by object
count).
Concretely, say that a repository has 'n' packfiles, labeled P1, P2,
..., up to Pn. Each packfile has an object count equal to 'objects(Pn)'.
With a geometric factor of 'r', it should be that:
objects(Pi) > r*objects(P(i-1))
for all i in [1, n], where the packs are sorted by
objects(P1) <= objects(P2) <= ... <= objects(Pn).
Since finding a true optimal repacking is NP-hard, we approximate it
along two directions:
1. We assume that there is a cutoff of packs _before starting the
repack_ where everything to the right of that cut-off already forms
a geometric progression (or no cutoff exists and everything must be
repacked).
2. We assume that everything smaller than the cutoff count must be
repacked. This forms our base assumption, but it can also cause
even the "heavy" packs to get repacked, for e.g., if we have 6
packs containing the following number of objects:
1, 1, 1, 2, 4, 32
then we would place the cutoff between '1, 1' and '1, 2, 4, 32',
rolling up the first two packs into a pack with 2 objects. That
breaks our progression and leaves us:
2, 1, 2, 4, 32
^
(where the '^' indicates the position of our split). To restore a
progression, we move the split forward (towards larger packs)
joining each pack into our new pack until a geometric progression
is restored. Here, that looks like:
2, 1, 2, 4, 32 ~> 3, 2, 4, 32 ~> 5, 4, 32 ~> ... ~> 9, 32
^ ^ ^ ^
This has the advantage of not repacking the heavy-side of packs too
often while also only creating one new pack at a time. Another wrinkle
is that we assume that loose, indexed, and reflog'd objects are
insignificant, and lump them into any new pack that we create. This can
lead to non-idempotent results.
Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-23 03:25:27 +01:00
|
|
|
} else if (geometry) {
|
|
|
|
strvec_push(&cmd.args, "--stdin-packs");
|
|
|
|
strvec_push(&cmd.args, "--unpacked");
|
2013-09-15 17:33:20 +02:00
|
|
|
} else {
|
2020-07-28 22:24:27 +02:00
|
|
|
strvec_push(&cmd.args, "--unpacked");
|
|
|
|
strvec_push(&cmd.args, "--incremental");
|
2013-09-15 17:33:20 +02:00
|
|
|
}
|
|
|
|
|
builtin/repack.c: add '--geometric' option
Often it is useful to both:
- have relatively few packfiles in a repository, and
- avoid having so few packfiles in a repository that we repack its
entire contents regularly
This patch implements a '--geometric=<n>' option in 'git repack'. This
allows the caller to specify that they would like each pack to be at
least a factor times as large as the previous largest pack (by object
count).
Concretely, say that a repository has 'n' packfiles, labeled P1, P2,
..., up to Pn. Each packfile has an object count equal to 'objects(Pn)'.
With a geometric factor of 'r', it should be that:
objects(Pi) > r*objects(P(i-1))
for all i in [1, n], where the packs are sorted by
objects(P1) <= objects(P2) <= ... <= objects(Pn).
Since finding a true optimal repacking is NP-hard, we approximate it
along two directions:
1. We assume that there is a cutoff of packs _before starting the
repack_ where everything to the right of that cut-off already forms
a geometric progression (or no cutoff exists and everything must be
repacked).
2. We assume that everything smaller than the cutoff count must be
repacked. This forms our base assumption, but it can also cause
even the "heavy" packs to get repacked, for e.g., if we have 6
packs containing the following number of objects:
1, 1, 1, 2, 4, 32
then we would place the cutoff between '1, 1' and '1, 2, 4, 32',
rolling up the first two packs into a pack with 2 objects. That
breaks our progression and leaves us:
2, 1, 2, 4, 32
^
(where the '^' indicates the position of our split). To restore a
progression, we move the split forward (towards larger packs)
joining each pack into our new pack until a geometric progression
is restored. Here, that looks like:
2, 1, 2, 4, 32 ~> 3, 2, 4, 32 ~> 5, 4, 32 ~> ... ~> 9, 32
^ ^ ^ ^
This has the advantage of not repacking the heavy-side of packs too
often while also only creating one new pack at a time. Another wrinkle
is that we assume that loose, indexed, and reflog'd objects are
insignificant, and lump them into any new pack that we create. This can
lead to non-idempotent results.
Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-23 03:25:27 +01:00
|
|
|
if (geometry)
|
|
|
|
cmd.in = -1;
|
|
|
|
else
|
|
|
|
cmd.no_stdin = 1;
|
2013-09-15 17:33:20 +02:00
|
|
|
|
|
|
|
ret = start_command(&cmd);
|
|
|
|
if (ret)
|
2013-09-15 17:33:21 +02:00
|
|
|
return ret;
|
2013-09-15 17:33:20 +02:00
|
|
|
|
builtin/repack.c: add '--geometric' option
Often it is useful to both:
- have relatively few packfiles in a repository, and
- avoid having so few packfiles in a repository that we repack its
entire contents regularly
This patch implements a '--geometric=<n>' option in 'git repack'. This
allows the caller to specify that they would like each pack to be at
least a factor times as large as the previous largest pack (by object
count).
Concretely, say that a repository has 'n' packfiles, labeled P1, P2,
..., up to Pn. Each packfile has an object count equal to 'objects(Pn)'.
With a geometric factor of 'r', it should be that:
objects(Pi) > r*objects(P(i-1))
for all i in [1, n], where the packs are sorted by
objects(P1) <= objects(P2) <= ... <= objects(Pn).
Since finding a true optimal repacking is NP-hard, we approximate it
along two directions:
1. We assume that there is a cutoff of packs _before starting the
repack_ where everything to the right of that cut-off already forms
a geometric progression (or no cutoff exists and everything must be
repacked).
2. We assume that everything smaller than the cutoff count must be
repacked. This forms our base assumption, but it can also cause
even the "heavy" packs to get repacked, for e.g., if we have 6
packs containing the following number of objects:
1, 1, 1, 2, 4, 32
then we would place the cutoff between '1, 1' and '1, 2, 4, 32',
rolling up the first two packs into a pack with 2 objects. That
breaks our progression and leaves us:
2, 1, 2, 4, 32
^
(where the '^' indicates the position of our split). To restore a
progression, we move the split forward (towards larger packs)
joining each pack into our new pack until a geometric progression
is restored. Here, that looks like:
2, 1, 2, 4, 32 ~> 3, 2, 4, 32 ~> 5, 4, 32 ~> ... ~> 9, 32
^ ^ ^ ^
This has the advantage of not repacking the heavy-side of packs too
often while also only creating one new pack at a time. Another wrinkle
is that we assume that loose, indexed, and reflog'd objects are
insignificant, and lump them into any new pack that we create. This can
lead to non-idempotent results.
Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-23 03:25:27 +01:00
|
|
|
if (geometry) {
|
|
|
|
FILE *in = xfdopen(cmd.in, "w");
|
|
|
|
/*
|
|
|
|
* The resulting pack should contain all objects in packs that
|
|
|
|
* are going to be rolled up, but exclude objects in packs which
|
|
|
|
* are being left alone.
|
|
|
|
*/
|
|
|
|
for (i = 0; i < geometry->split; i++)
|
|
|
|
fprintf(in, "%s\n", pack_basename(geometry->pack[i]));
|
|
|
|
for (i = geometry->split; i < geometry->pack_nr; i++)
|
|
|
|
fprintf(in, "^%s\n", pack_basename(geometry->pack[i]));
|
|
|
|
fclose(in);
|
|
|
|
}
|
|
|
|
|
2013-09-15 17:33:20 +02:00
|
|
|
out = xfdopen(cmd.out, "r");
|
2016-01-14 00:31:17 +01:00
|
|
|
while (strbuf_getline_lf(&line, out) != EOF) {
|
repack: populate extension bits incrementally
After generating the main pack and then any additional cruft packs, we
iterate over the "names" list (which contains hashes of packs generated
by pack-objects), and call populate_pack_exts() for each.
There's one small problem with this. In repack_promisor_objects(), we
may add entries to "names" and call populate_pack_exts() for them.
Calling it again is mostly just wasteful, as we'll stat() the filename
with each possible extension, get the same result, and just overwrite
our bits.
So we could drop the call there, and leave the final loop to populate
all of the bits. But instead, this patch does the reverse: drops the
final loop, and teaches the other two sites to populate the bits as they
add entries.
This makes the code easier to reason about, as you never have to worry
about when the util field is valid; it is always valid for each entry.
It also serves my ulterior purpose: recording the generated filenames as
soon as possible will make it easier for a future patch to use them for
cleaning up from a failed operation.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-10-22 02:21:48 +02:00
|
|
|
struct string_list_item *item;
|
|
|
|
|
2018-10-15 02:01:50 +02:00
|
|
|
if (line.len != the_hash_algo->hexsz)
|
2019-01-04 22:33:31 +01:00
|
|
|
die(_("repack: Expecting full hex object ID lines only from pack-objects."));
|
repack: populate extension bits incrementally
After generating the main pack and then any additional cruft packs, we
iterate over the "names" list (which contains hashes of packs generated
by pack-objects), and call populate_pack_exts() for each.
There's one small problem with this. In repack_promisor_objects(), we
may add entries to "names" and call populate_pack_exts() for them.
Calling it again is mostly just wasteful, as we'll stat() the filename
with each possible extension, get the same result, and just overwrite
our bits.
So we could drop the call there, and leave the final loop to populate
all of the bits. But instead, this patch does the reverse: drops the
final loop, and teaches the other two sites to populate the bits as they
add entries.
This makes the code easier to reason about, as you never have to worry
about when the util field is valid; it is always valid for each entry.
It also serves my ulterior purpose: recording the generated filenames as
soon as possible will make it easier for a future patch to use them for
cleaning up from a failed operation.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-10-22 02:21:48 +02:00
|
|
|
item = string_list_append(&names, line.buf);
|
|
|
|
item->util = populate_pack_exts(item->string);
|
2013-09-15 17:33:20 +02:00
|
|
|
}
|
|
|
|
fclose(out);
|
|
|
|
ret = finish_command(&cmd);
|
|
|
|
if (ret)
|
2013-09-15 17:33:21 +02:00
|
|
|
return ret;
|
2013-09-15 17:33:20 +02:00
|
|
|
|
2018-08-09 00:34:05 +02:00
|
|
|
if (!names.nr && !po_args.quiet)
|
2018-11-10 06:16:10 +01:00
|
|
|
printf_ln(_("Nothing new to pack."));
|
2013-09-15 17:33:20 +02:00
|
|
|
|
2022-05-21 01:18:03 +02:00
|
|
|
if (pack_everything & PACK_CRUFT) {
|
|
|
|
const char *pack_prefix;
|
|
|
|
if (!skip_prefix(packtmp, packdir, &pack_prefix))
|
|
|
|
die(_("pack prefix %s does not begin with objdir %s"),
|
|
|
|
packtmp, packdir);
|
|
|
|
if (*pack_prefix == '/')
|
|
|
|
pack_prefix++;
|
|
|
|
|
2022-05-21 01:18:06 +02:00
|
|
|
if (!cruft_po_args.window)
|
|
|
|
cruft_po_args.window = po_args.window;
|
|
|
|
if (!cruft_po_args.window_memory)
|
|
|
|
cruft_po_args.window_memory = po_args.window_memory;
|
|
|
|
if (!cruft_po_args.depth)
|
|
|
|
cruft_po_args.depth = po_args.depth;
|
|
|
|
if (!cruft_po_args.threads)
|
|
|
|
cruft_po_args.threads = po_args.threads;
|
|
|
|
|
|
|
|
cruft_po_args.local = po_args.local;
|
|
|
|
cruft_po_args.quiet = po_args.quiet;
|
|
|
|
|
|
|
|
ret = write_cruft_pack(&cruft_po_args, pack_prefix, &names,
|
2022-05-21 01:18:03 +02:00
|
|
|
&existing_nonkept_packs,
|
|
|
|
&existing_kept_packs);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2022-05-20 21:01:51 +02:00
|
|
|
string_list_sort(&names);
|
|
|
|
|
2019-05-17 20:41:49 +02:00
|
|
|
close_object_store(the_repository->objects);
|
2018-12-15 23:04:01 +01:00
|
|
|
|
2013-09-15 17:33:20 +02:00
|
|
|
/*
|
|
|
|
* Ok we have prepared all new packfiles.
|
|
|
|
*/
|
|
|
|
for_each_string_list_item(item, &names) {
|
2022-10-22 02:21:45 +02:00
|
|
|
struct generated_pack_data *data = item->util;
|
|
|
|
|
2013-12-21 15:00:19 +01:00
|
|
|
for (ext = 0; ext < ARRAY_SIZE(exts); ext++) {
|
repack: use tempfiles for signal cleanup
When git-repack exits due to a signal, it tries to clean up by calling
its remove_temporary_files() function, which walks through the packs dir
looking for ".tmp-$$-pack-*" files to delete (where "$$" is the pid of
the current process).
The biggest problem here is that remove_temporary_files() is not safe to
call in a signal handler. It uses opendir(), which isn't on the POSIX
async-signal-safe list. The details will be platform-specific, but a
likely issue is that it needs to allocate memory; if we receive a signal
while inside malloc(), etc, we'll conflict on the allocator lock and
deadlock with ourselves.
We can fix this by just cleaning up the files directly, without walking
the directory. We already know the complete list of .tmp-* files that
were generated, because we recorded them via populate_pack_exts(). When
we find files there, we can use register_tempfile() to record the
filenames. If we receive a signal, then the tempfile API will clean them
up for us, and it's async-safe and pretty battle-tested.
Note that this is slightly racier than the existing scheme. We don't
record the filenames until pack-objects tells us the hash over stdout.
So during the period between it generating the file and reporting the
hash, we'd fail to clean up. However, that period is very small. During
most of the pack generation process pack-objects is using its own
internal tempfiles. It's only at the very end that it moves them into
the names git-repack expects, and then it immediately reports the name
to us. Given that cleanup like this is best effort (after all, we may
get SIGKILL), this level of race is acceptable.
When we register the tempfiles, we'll record them locally and use the
result to call rename_tempfile(), rather than renaming by hand. This
isn't strictly necessary, as once we've renamed the files they're gone,
and the tempfile API's cleanup unlink() would simply become a pointless
noop. But managing the lifetimes of the tempfile objects is the cleanest
thing to do, and the tempfile pointers naturally fill the same role as
the old booleans.
This patch also fixes another small problem. We only hook signals, and
don't set up an atexit handler. So if we see an error that causes us to
die(), we'll leave the .tmp-* files in place. But since the tempfile API
handles this for us, this is now fixed for free. The new test covers
this by stimulating a failure of pack-objects when generating a cruft
pack. Before this patch, the .tmp-* file for the main pack would have
been left, but now we correctly clean it up.
Two small subtleties on the implementation:
- in the renaming loop, we can stop re-constructing fname_old; we only
use it when we have a tempfile to rename, so we can just ask the
tempfile for its path (which, barring bugs, should be identical)
- when renaming fails, our error message mentions fname_old. But since
a failed rename_tempfile() invalidates the tempfile struct, we'll
lose access to that string. Instead, let's mention the destination
filename, which is what most other callers do.
Reported-by: Jan Pokorný <poki@fnusa.cz>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-10-22 02:21:54 +02:00
|
|
|
char *fname;
|
2018-07-12 21:39:40 +02:00
|
|
|
|
2013-09-15 17:33:20 +02:00
|
|
|
fname = mkpathdup("%s/pack-%s%s",
|
2013-12-21 15:00:23 +01:00
|
|
|
packdir, item->string, exts[ext].name);
|
builtin/repack.c: don't move existing packs out of the way
When 'git repack' creates a pack with the same name as any existing
pack, it moves the existing one to 'old-pack-xxx.{pack,idx,...}' and
then renames the new one into place.
Eventually, it would be nice to have 'git repack' allow for writing a
multi-pack index at the critical time (after the new packs have been
written / moved into place, but before the old ones have been deleted).
Guessing that this option might be called '--write-midx', this makes the
following situation (where repacks are issued back-to-back without any
new objects) impossible:
$ git repack -adb
$ git repack -adb --write-midx
In the second repack, the existing packs are overwritten verbatim with
the same rename-to-old sequence. At that point, the current MIDX is
invalidated, since it refers to now-missing packs. So that code wants to
be run after the MIDX is re-written. But (prior to this patch) the new
MIDX can't be written until the new packs are moved into place. So, we
have a circular dependency.
This is all hypothetical, since no code currently exists to write a MIDX
safely during a 'git repack' (the 'GIT_TEST_MULTI_PACK_INDEX' does so
unsafely). Putting hypothetical aside, though: why do we need to rename
existing packs to be prefixed with 'old-' anyway?
This behavior dates all the way back to 2ad47d6 (git-repack: Be
careful when updating the same pack as an existing one., 2006-06-25).
2ad47d6 is mainly concerned about a case where a newly written pack
would have a different structure than its index. This used to be
possible when the pack name was a hash of the set of objects. Under this
naming scheme, two packs that store the same set of objects could differ
in delta selection, object positioning, or both. If this happened, then
any such packs would be unreadable in the instant between copying the
new pack and new index (i.e., either the index or pack will be stale
depending on the order that they were copied).
But since 1190a1a (pack-objects: name pack files after trailer hash,
2013-12-05), this is no longer possible, since pack files are named not
after their logical contents (i.e., the set of objects), but by the
actual checksum of their contents. So, this old- behavior can safely go,
which allows us to avoid our circular dependency above.
In addition to avoiding the circular dependency, this patch also makes
'git repack' a lot simpler, since we don't have to deal with failures
encountered when renaming existing packs to be prefixed with 'old-'.
This patch is mostly limited to removing code paths that deal with the
'old' prefixing, with the exception of files that include the pack's
name in their own filename, like .idx, .bitmap, and related files. The
exception is that we want to continue to trust what pack-objects wrote.
That is, it is not the case that we pretend as if pack-objects didn't
write files identical to ones that already exist, but rather that we
respect what pack-objects wrote as the source of truth. That cuts two
ways:
- If pack-objects produced an identical pack to one that already
exists with a bitmap, but did not produce a bitmap, we remove the
bitmap that already exists. (This behavior is codified in t7700.14).
- If pack-objects produced an identical pack to one that already
exists, we trust the just-written version of the coresponding .idx,
.promisor, and other files over the ones that already exist. This
ensures that we use the most up-to-date versions of this files,
which is safe even in the face of format changes in, say, the .idx
file (which would not be reflected in the .idx file's name).
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-11-17 21:15:16 +01:00
|
|
|
|
repack: use tempfiles for signal cleanup
When git-repack exits due to a signal, it tries to clean up by calling
its remove_temporary_files() function, which walks through the packs dir
looking for ".tmp-$$-pack-*" files to delete (where "$$" is the pid of
the current process).
The biggest problem here is that remove_temporary_files() is not safe to
call in a signal handler. It uses opendir(), which isn't on the POSIX
async-signal-safe list. The details will be platform-specific, but a
likely issue is that it needs to allocate memory; if we receive a signal
while inside malloc(), etc, we'll conflict on the allocator lock and
deadlock with ourselves.
We can fix this by just cleaning up the files directly, without walking
the directory. We already know the complete list of .tmp-* files that
were generated, because we recorded them via populate_pack_exts(). When
we find files there, we can use register_tempfile() to record the
filenames. If we receive a signal, then the tempfile API will clean them
up for us, and it's async-safe and pretty battle-tested.
Note that this is slightly racier than the existing scheme. We don't
record the filenames until pack-objects tells us the hash over stdout.
So during the period between it generating the file and reporting the
hash, we'd fail to clean up. However, that period is very small. During
most of the pack generation process pack-objects is using its own
internal tempfiles. It's only at the very end that it moves them into
the names git-repack expects, and then it immediately reports the name
to us. Given that cleanup like this is best effort (after all, we may
get SIGKILL), this level of race is acceptable.
When we register the tempfiles, we'll record them locally and use the
result to call rename_tempfile(), rather than renaming by hand. This
isn't strictly necessary, as once we've renamed the files they're gone,
and the tempfile API's cleanup unlink() would simply become a pointless
noop. But managing the lifetimes of the tempfile objects is the cleanest
thing to do, and the tempfile pointers naturally fill the same role as
the old booleans.
This patch also fixes another small problem. We only hook signals, and
don't set up an atexit handler. So if we see an error that causes us to
die(), we'll leave the .tmp-* files in place. But since the tempfile API
handles this for us, this is now fixed for free. The new test covers
this by stimulating a failure of pack-objects when generating a cruft
pack. Before this patch, the .tmp-* file for the main pack would have
been left, but now we correctly clean it up.
Two small subtleties on the implementation:
- in the renaming loop, we can stop re-constructing fname_old; we only
use it when we have a tempfile to rename, so we can just ask the
tempfile for its path (which, barring bugs, should be identical)
- when renaming fails, our error message mentions fname_old. But since
a failed rename_tempfile() invalidates the tempfile struct, we'll
lose access to that string. Instead, let's mention the destination
filename, which is what most other callers do.
Reported-by: Jan Pokorný <poki@fnusa.cz>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-10-22 02:21:54 +02:00
|
|
|
if (data->tempfiles[ext]) {
|
|
|
|
const char *fname_old = get_tempfile_path(data->tempfiles[ext]);
|
builtin/repack.c: don't move existing packs out of the way
When 'git repack' creates a pack with the same name as any existing
pack, it moves the existing one to 'old-pack-xxx.{pack,idx,...}' and
then renames the new one into place.
Eventually, it would be nice to have 'git repack' allow for writing a
multi-pack index at the critical time (after the new packs have been
written / moved into place, but before the old ones have been deleted).
Guessing that this option might be called '--write-midx', this makes the
following situation (where repacks are issued back-to-back without any
new objects) impossible:
$ git repack -adb
$ git repack -adb --write-midx
In the second repack, the existing packs are overwritten verbatim with
the same rename-to-old sequence. At that point, the current MIDX is
invalidated, since it refers to now-missing packs. So that code wants to
be run after the MIDX is re-written. But (prior to this patch) the new
MIDX can't be written until the new packs are moved into place. So, we
have a circular dependency.
This is all hypothetical, since no code currently exists to write a MIDX
safely during a 'git repack' (the 'GIT_TEST_MULTI_PACK_INDEX' does so
unsafely). Putting hypothetical aside, though: why do we need to rename
existing packs to be prefixed with 'old-' anyway?
This behavior dates all the way back to 2ad47d6 (git-repack: Be
careful when updating the same pack as an existing one., 2006-06-25).
2ad47d6 is mainly concerned about a case where a newly written pack
would have a different structure than its index. This used to be
possible when the pack name was a hash of the set of objects. Under this
naming scheme, two packs that store the same set of objects could differ
in delta selection, object positioning, or both. If this happened, then
any such packs would be unreadable in the instant between copying the
new pack and new index (i.e., either the index or pack will be stale
depending on the order that they were copied).
But since 1190a1a (pack-objects: name pack files after trailer hash,
2013-12-05), this is no longer possible, since pack files are named not
after their logical contents (i.e., the set of objects), but by the
actual checksum of their contents. So, this old- behavior can safely go,
which allows us to avoid our circular dependency above.
In addition to avoiding the circular dependency, this patch also makes
'git repack' a lot simpler, since we don't have to deal with failures
encountered when renaming existing packs to be prefixed with 'old-'.
This patch is mostly limited to removing code paths that deal with the
'old' prefixing, with the exception of files that include the pack's
name in their own filename, like .idx, .bitmap, and related files. The
exception is that we want to continue to trust what pack-objects wrote.
That is, it is not the case that we pretend as if pack-objects didn't
write files identical to ones that already exist, but rather that we
respect what pack-objects wrote as the source of truth. That cuts two
ways:
- If pack-objects produced an identical pack to one that already
exists with a bitmap, but did not produce a bitmap, we remove the
bitmap that already exists. (This behavior is codified in t7700.14).
- If pack-objects produced an identical pack to one that already
exists, we trust the just-written version of the coresponding .idx,
.promisor, and other files over the ones that already exist. This
ensures that we use the most up-to-date versions of this files,
which is safe even in the face of format changes in, say, the .idx
file (which would not be reflected in the .idx file's name).
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-11-17 21:15:16 +01:00
|
|
|
struct stat statbuffer;
|
repack: use tempfiles for signal cleanup
When git-repack exits due to a signal, it tries to clean up by calling
its remove_temporary_files() function, which walks through the packs dir
looking for ".tmp-$$-pack-*" files to delete (where "$$" is the pid of
the current process).
The biggest problem here is that remove_temporary_files() is not safe to
call in a signal handler. It uses opendir(), which isn't on the POSIX
async-signal-safe list. The details will be platform-specific, but a
likely issue is that it needs to allocate memory; if we receive a signal
while inside malloc(), etc, we'll conflict on the allocator lock and
deadlock with ourselves.
We can fix this by just cleaning up the files directly, without walking
the directory. We already know the complete list of .tmp-* files that
were generated, because we recorded them via populate_pack_exts(). When
we find files there, we can use register_tempfile() to record the
filenames. If we receive a signal, then the tempfile API will clean them
up for us, and it's async-safe and pretty battle-tested.
Note that this is slightly racier than the existing scheme. We don't
record the filenames until pack-objects tells us the hash over stdout.
So during the period between it generating the file and reporting the
hash, we'd fail to clean up. However, that period is very small. During
most of the pack generation process pack-objects is using its own
internal tempfiles. It's only at the very end that it moves them into
the names git-repack expects, and then it immediately reports the name
to us. Given that cleanup like this is best effort (after all, we may
get SIGKILL), this level of race is acceptable.
When we register the tempfiles, we'll record them locally and use the
result to call rename_tempfile(), rather than renaming by hand. This
isn't strictly necessary, as once we've renamed the files they're gone,
and the tempfile API's cleanup unlink() would simply become a pointless
noop. But managing the lifetimes of the tempfile objects is the cleanest
thing to do, and the tempfile pointers naturally fill the same role as
the old booleans.
This patch also fixes another small problem. We only hook signals, and
don't set up an atexit handler. So if we see an error that causes us to
die(), we'll leave the .tmp-* files in place. But since the tempfile API
handles this for us, this is now fixed for free. The new test covers
this by stimulating a failure of pack-objects when generating a cruft
pack. Before this patch, the .tmp-* file for the main pack would have
been left, but now we correctly clean it up.
Two small subtleties on the implementation:
- in the renaming loop, we can stop re-constructing fname_old; we only
use it when we have a tempfile to rename, so we can just ask the
tempfile for its path (which, barring bugs, should be identical)
- when renaming fails, our error message mentions fname_old. But since
a failed rename_tempfile() invalidates the tempfile struct, we'll
lose access to that string. Instead, let's mention the destination
filename, which is what most other callers do.
Reported-by: Jan Pokorný <poki@fnusa.cz>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-10-22 02:21:54 +02:00
|
|
|
|
builtin/repack.c: don't move existing packs out of the way
When 'git repack' creates a pack with the same name as any existing
pack, it moves the existing one to 'old-pack-xxx.{pack,idx,...}' and
then renames the new one into place.
Eventually, it would be nice to have 'git repack' allow for writing a
multi-pack index at the critical time (after the new packs have been
written / moved into place, but before the old ones have been deleted).
Guessing that this option might be called '--write-midx', this makes the
following situation (where repacks are issued back-to-back without any
new objects) impossible:
$ git repack -adb
$ git repack -adb --write-midx
In the second repack, the existing packs are overwritten verbatim with
the same rename-to-old sequence. At that point, the current MIDX is
invalidated, since it refers to now-missing packs. So that code wants to
be run after the MIDX is re-written. But (prior to this patch) the new
MIDX can't be written until the new packs are moved into place. So, we
have a circular dependency.
This is all hypothetical, since no code currently exists to write a MIDX
safely during a 'git repack' (the 'GIT_TEST_MULTI_PACK_INDEX' does so
unsafely). Putting hypothetical aside, though: why do we need to rename
existing packs to be prefixed with 'old-' anyway?
This behavior dates all the way back to 2ad47d6 (git-repack: Be
careful when updating the same pack as an existing one., 2006-06-25).
2ad47d6 is mainly concerned about a case where a newly written pack
would have a different structure than its index. This used to be
possible when the pack name was a hash of the set of objects. Under this
naming scheme, two packs that store the same set of objects could differ
in delta selection, object positioning, or both. If this happened, then
any such packs would be unreadable in the instant between copying the
new pack and new index (i.e., either the index or pack will be stale
depending on the order that they were copied).
But since 1190a1a (pack-objects: name pack files after trailer hash,
2013-12-05), this is no longer possible, since pack files are named not
after their logical contents (i.e., the set of objects), but by the
actual checksum of their contents. So, this old- behavior can safely go,
which allows us to avoid our circular dependency above.
In addition to avoiding the circular dependency, this patch also makes
'git repack' a lot simpler, since we don't have to deal with failures
encountered when renaming existing packs to be prefixed with 'old-'.
This patch is mostly limited to removing code paths that deal with the
'old' prefixing, with the exception of files that include the pack's
name in their own filename, like .idx, .bitmap, and related files. The
exception is that we want to continue to trust what pack-objects wrote.
That is, it is not the case that we pretend as if pack-objects didn't
write files identical to ones that already exist, but rather that we
respect what pack-objects wrote as the source of truth. That cuts two
ways:
- If pack-objects produced an identical pack to one that already
exists with a bitmap, but did not produce a bitmap, we remove the
bitmap that already exists. (This behavior is codified in t7700.14).
- If pack-objects produced an identical pack to one that already
exists, we trust the just-written version of the coresponding .idx,
.promisor, and other files over the ones that already exist. This
ensures that we use the most up-to-date versions of this files,
which is safe even in the face of format changes in, say, the .idx
file (which would not be reflected in the .idx file's name).
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-11-17 21:15:16 +01:00
|
|
|
if (!stat(fname_old, &statbuffer)) {
|
|
|
|
statbuffer.st_mode &= ~(S_IWUSR | S_IWGRP | S_IWOTH);
|
|
|
|
chmod(fname_old, statbuffer.st_mode);
|
|
|
|
}
|
|
|
|
|
repack: use tempfiles for signal cleanup
When git-repack exits due to a signal, it tries to clean up by calling
its remove_temporary_files() function, which walks through the packs dir
looking for ".tmp-$$-pack-*" files to delete (where "$$" is the pid of
the current process).
The biggest problem here is that remove_temporary_files() is not safe to
call in a signal handler. It uses opendir(), which isn't on the POSIX
async-signal-safe list. The details will be platform-specific, but a
likely issue is that it needs to allocate memory; if we receive a signal
while inside malloc(), etc, we'll conflict on the allocator lock and
deadlock with ourselves.
We can fix this by just cleaning up the files directly, without walking
the directory. We already know the complete list of .tmp-* files that
were generated, because we recorded them via populate_pack_exts(). When
we find files there, we can use register_tempfile() to record the
filenames. If we receive a signal, then the tempfile API will clean them
up for us, and it's async-safe and pretty battle-tested.
Note that this is slightly racier than the existing scheme. We don't
record the filenames until pack-objects tells us the hash over stdout.
So during the period between it generating the file and reporting the
hash, we'd fail to clean up. However, that period is very small. During
most of the pack generation process pack-objects is using its own
internal tempfiles. It's only at the very end that it moves them into
the names git-repack expects, and then it immediately reports the name
to us. Given that cleanup like this is best effort (after all, we may
get SIGKILL), this level of race is acceptable.
When we register the tempfiles, we'll record them locally and use the
result to call rename_tempfile(), rather than renaming by hand. This
isn't strictly necessary, as once we've renamed the files they're gone,
and the tempfile API's cleanup unlink() would simply become a pointless
noop. But managing the lifetimes of the tempfile objects is the cleanest
thing to do, and the tempfile pointers naturally fill the same role as
the old booleans.
This patch also fixes another small problem. We only hook signals, and
don't set up an atexit handler. So if we see an error that causes us to
die(), we'll leave the .tmp-* files in place. But since the tempfile API
handles this for us, this is now fixed for free. The new test covers
this by stimulating a failure of pack-objects when generating a cruft
pack. Before this patch, the .tmp-* file for the main pack would have
been left, but now we correctly clean it up.
Two small subtleties on the implementation:
- in the renaming loop, we can stop re-constructing fname_old; we only
use it when we have a tempfile to rename, so we can just ask the
tempfile for its path (which, barring bugs, should be identical)
- when renaming fails, our error message mentions fname_old. But since
a failed rename_tempfile() invalidates the tempfile struct, we'll
lose access to that string. Instead, let's mention the destination
filename, which is what most other callers do.
Reported-by: Jan Pokorný <poki@fnusa.cz>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-10-22 02:21:54 +02:00
|
|
|
if (rename_tempfile(&data->tempfiles[ext], fname))
|
|
|
|
die_errno(_("renaming pack to '%s' failed"), fname);
|
builtin/repack.c: don't move existing packs out of the way
When 'git repack' creates a pack with the same name as any existing
pack, it moves the existing one to 'old-pack-xxx.{pack,idx,...}' and
then renames the new one into place.
Eventually, it would be nice to have 'git repack' allow for writing a
multi-pack index at the critical time (after the new packs have been
written / moved into place, but before the old ones have been deleted).
Guessing that this option might be called '--write-midx', this makes the
following situation (where repacks are issued back-to-back without any
new objects) impossible:
$ git repack -adb
$ git repack -adb --write-midx
In the second repack, the existing packs are overwritten verbatim with
the same rename-to-old sequence. At that point, the current MIDX is
invalidated, since it refers to now-missing packs. So that code wants to
be run after the MIDX is re-written. But (prior to this patch) the new
MIDX can't be written until the new packs are moved into place. So, we
have a circular dependency.
This is all hypothetical, since no code currently exists to write a MIDX
safely during a 'git repack' (the 'GIT_TEST_MULTI_PACK_INDEX' does so
unsafely). Putting hypothetical aside, though: why do we need to rename
existing packs to be prefixed with 'old-' anyway?
This behavior dates all the way back to 2ad47d6 (git-repack: Be
careful when updating the same pack as an existing one., 2006-06-25).
2ad47d6 is mainly concerned about a case where a newly written pack
would have a different structure than its index. This used to be
possible when the pack name was a hash of the set of objects. Under this
naming scheme, two packs that store the same set of objects could differ
in delta selection, object positioning, or both. If this happened, then
any such packs would be unreadable in the instant between copying the
new pack and new index (i.e., either the index or pack will be stale
depending on the order that they were copied).
But since 1190a1a (pack-objects: name pack files after trailer hash,
2013-12-05), this is no longer possible, since pack files are named not
after their logical contents (i.e., the set of objects), but by the
actual checksum of their contents. So, this old- behavior can safely go,
which allows us to avoid our circular dependency above.
In addition to avoiding the circular dependency, this patch also makes
'git repack' a lot simpler, since we don't have to deal with failures
encountered when renaming existing packs to be prefixed with 'old-'.
This patch is mostly limited to removing code paths that deal with the
'old' prefixing, with the exception of files that include the pack's
name in their own filename, like .idx, .bitmap, and related files. The
exception is that we want to continue to trust what pack-objects wrote.
That is, it is not the case that we pretend as if pack-objects didn't
write files identical to ones that already exist, but rather that we
respect what pack-objects wrote as the source of truth. That cuts two
ways:
- If pack-objects produced an identical pack to one that already
exists with a bitmap, but did not produce a bitmap, we remove the
bitmap that already exists. (This behavior is codified in t7700.14).
- If pack-objects produced an identical pack to one that already
exists, we trust the just-written version of the coresponding .idx,
.promisor, and other files over the ones that already exist. This
ensures that we use the most up-to-date versions of this files,
which is safe even in the face of format changes in, say, the .idx
file (which would not be reflected in the .idx file's name).
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-11-17 21:15:16 +01:00
|
|
|
} else if (!exts[ext].optional)
|
repack: expand error message for missing pack files
If pack-objects tells us it generated pack $hash, we expect to find
.tmp-$$-pack-$hash.pack, .idx, .rev, and so on. Some of these files are
optional, but others are not. For the required ones, we'll bail with an
error if any of them is missing.
The error message is just "missing required file", which is a bit vague.
We should be more clear that it is not the user's fault, but rather that
the sub-pgoram we called is not operating as expected. In practice,
nobody should ever see this message, as it would generally only be
caused by a bug in Git.
It probably doesn't make sense to convert this to a BUG(), though, as
there are other (unlikely) possibilities, such as somebody else racily
deleting the files, filesystem errors causing stat() to fail, and so on.
A nice side effect here is that we stop relying on fname_old in this
code path, which will let us deal with it only in the first part of the
conditional.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-10-22 02:21:50 +02:00
|
|
|
die(_("pack-objects did not write a '%s' file for pack %s-%s"),
|
|
|
|
exts[ext].name, packtmp, item->string);
|
builtin/repack.c: don't move existing packs out of the way
When 'git repack' creates a pack with the same name as any existing
pack, it moves the existing one to 'old-pack-xxx.{pack,idx,...}' and
then renames the new one into place.
Eventually, it would be nice to have 'git repack' allow for writing a
multi-pack index at the critical time (after the new packs have been
written / moved into place, but before the old ones have been deleted).
Guessing that this option might be called '--write-midx', this makes the
following situation (where repacks are issued back-to-back without any
new objects) impossible:
$ git repack -adb
$ git repack -adb --write-midx
In the second repack, the existing packs are overwritten verbatim with
the same rename-to-old sequence. At that point, the current MIDX is
invalidated, since it refers to now-missing packs. So that code wants to
be run after the MIDX is re-written. But (prior to this patch) the new
MIDX can't be written until the new packs are moved into place. So, we
have a circular dependency.
This is all hypothetical, since no code currently exists to write a MIDX
safely during a 'git repack' (the 'GIT_TEST_MULTI_PACK_INDEX' does so
unsafely). Putting hypothetical aside, though: why do we need to rename
existing packs to be prefixed with 'old-' anyway?
This behavior dates all the way back to 2ad47d6 (git-repack: Be
careful when updating the same pack as an existing one., 2006-06-25).
2ad47d6 is mainly concerned about a case where a newly written pack
would have a different structure than its index. This used to be
possible when the pack name was a hash of the set of objects. Under this
naming scheme, two packs that store the same set of objects could differ
in delta selection, object positioning, or both. If this happened, then
any such packs would be unreadable in the instant between copying the
new pack and new index (i.e., either the index or pack will be stale
depending on the order that they were copied).
But since 1190a1a (pack-objects: name pack files after trailer hash,
2013-12-05), this is no longer possible, since pack files are named not
after their logical contents (i.e., the set of objects), but by the
actual checksum of their contents. So, this old- behavior can safely go,
which allows us to avoid our circular dependency above.
In addition to avoiding the circular dependency, this patch also makes
'git repack' a lot simpler, since we don't have to deal with failures
encountered when renaming existing packs to be prefixed with 'old-'.
This patch is mostly limited to removing code paths that deal with the
'old' prefixing, with the exception of files that include the pack's
name in their own filename, like .idx, .bitmap, and related files. The
exception is that we want to continue to trust what pack-objects wrote.
That is, it is not the case that we pretend as if pack-objects didn't
write files identical to ones that already exist, but rather that we
respect what pack-objects wrote as the source of truth. That cuts two
ways:
- If pack-objects produced an identical pack to one that already
exists with a bitmap, but did not produce a bitmap, we remove the
bitmap that already exists. (This behavior is codified in t7700.14).
- If pack-objects produced an identical pack to one that already
exists, we trust the just-written version of the coresponding .idx,
.promisor, and other files over the ones that already exist. This
ensures that we use the most up-to-date versions of this files,
which is safe even in the face of format changes in, say, the .idx
file (which would not be reflected in the .idx file's name).
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-11-17 21:15:16 +01:00
|
|
|
else if (unlink(fname) < 0 && errno != ENOENT)
|
|
|
|
die_errno(_("could not unlink: %s"), fname);
|
2013-09-15 17:33:20 +02:00
|
|
|
|
2015-08-10 11:35:38 +02:00
|
|
|
free(fname);
|
2013-09-15 17:33:20 +02:00
|
|
|
}
|
|
|
|
}
|
|
|
|
/* End of pack replacement. */
|
|
|
|
|
builtin/repack.c: support writing a MIDX while repacking
Teach `git repack` a new `--write-midx` option for callers that wish to
persist a multi-pack index in their repository while repacking.
There are two existing alternatives to this new flag, but they don't
cover our particular use-case. These alternatives are:
- Call 'git multi-pack-index write' after running 'git repack', or
- Set 'GIT_TEST_MULTI_PACK_INDEX=1' in your environment when running
'git repack'.
The former works, but introduces a gap in bitmap coverage between
repacking and writing a new MIDX (since the repack may have deleted a
pack included in the existing MIDX, invalidating it altogether).
Setting the 'GIT_TEST_' environment variable is obviously unsupported.
In fact, even if it were supported officially, it still wouldn't work,
because it generates the MIDX *after* redundant packs have been dropped,
leading to the same issue as above.
Introduce a new option which eliminates this race by teaching `git
repack` to generate the MIDX at the critical point: after the new packs
have been written and moved into place, but before the redundant packs
have been removed.
This option is compatible with `git repack`'s '--bitmap' option (it
changes the interpretation to be: "write a bitmap corresponding to the
MIDX after one has been generated").
There is a little bit of additional noise in the patch below to avoid
repeating ourselves when selecting which packs to delete. Instead of a
single loop as before (where we iterate over 'existing_packs', decide if
a pack is worth deleting, and if so, delete it), we have two loops (the
first where we decide which ones are worth deleting, and the second
where we actually do the deleting). This makes it so we have a single
check we can make consistently when (1) telling the MIDX which packs we
want to exclude, and (2) actually unlinking the redundant packs.
There is also a tiny change to short-circuit the body of
write_midx_included_packs() when no packs remain in the case of an empty
repository. The MIDX code does not handle this, so avoid trying to
generate a MIDX covering zero packs in the first place.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-29 03:55:18 +02:00
|
|
|
if (delete_redundant && pack_everything & ALL_INTO_ONE) {
|
2018-10-15 02:01:50 +02:00
|
|
|
const int hexsz = the_hash_algo->hexsz;
|
builtin/repack.c: support writing a MIDX while repacking
Teach `git repack` a new `--write-midx` option for callers that wish to
persist a multi-pack index in their repository while repacking.
There are two existing alternatives to this new flag, but they don't
cover our particular use-case. These alternatives are:
- Call 'git multi-pack-index write' after running 'git repack', or
- Set 'GIT_TEST_MULTI_PACK_INDEX=1' in your environment when running
'git repack'.
The former works, but introduces a gap in bitmap coverage between
repacking and writing a new MIDX (since the repack may have deleted a
pack included in the existing MIDX, invalidating it altogether).
Setting the 'GIT_TEST_' environment variable is obviously unsupported.
In fact, even if it were supported officially, it still wouldn't work,
because it generates the MIDX *after* redundant packs have been dropped,
leading to the same issue as above.
Introduce a new option which eliminates this race by teaching `git
repack` to generate the MIDX at the critical point: after the new packs
have been written and moved into place, but before the redundant packs
have been removed.
This option is compatible with `git repack`'s '--bitmap' option (it
changes the interpretation to be: "write a bitmap corresponding to the
MIDX after one has been generated").
There is a little bit of additional noise in the patch below to avoid
repeating ourselves when selecting which packs to delete. Instead of a
single loop as before (where we iterate over 'existing_packs', decide if
a pack is worth deleting, and if so, delete it), we have two loops (the
first where we decide which ones are worth deleting, and the second
where we actually do the deleting). This makes it so we have a single
check we can make consistently when (1) telling the MIDX which packs we
want to exclude, and (2) actually unlinking the redundant packs.
There is also a tiny change to short-circuit the body of
write_midx_included_packs() when no packs remain in the case of an empty
repository. The MIDX code does not handle this, so avoid trying to
generate a MIDX covering zero packs in the first place.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-29 03:55:18 +02:00
|
|
|
for_each_string_list_item(item, &existing_nonkept_packs) {
|
2013-09-15 17:33:20 +02:00
|
|
|
char *sha1;
|
|
|
|
size_t len = strlen(item->string);
|
2018-10-15 02:01:50 +02:00
|
|
|
if (len < hexsz)
|
2013-09-15 17:33:20 +02:00
|
|
|
continue;
|
2018-10-15 02:01:50 +02:00
|
|
|
sha1 = item->string + len - hexsz;
|
builtin/repack.c: support writing a MIDX while repacking
Teach `git repack` a new `--write-midx` option for callers that wish to
persist a multi-pack index in their repository while repacking.
There are two existing alternatives to this new flag, but they don't
cover our particular use-case. These alternatives are:
- Call 'git multi-pack-index write' after running 'git repack', or
- Set 'GIT_TEST_MULTI_PACK_INDEX=1' in your environment when running
'git repack'.
The former works, but introduces a gap in bitmap coverage between
repacking and writing a new MIDX (since the repack may have deleted a
pack included in the existing MIDX, invalidating it altogether).
Setting the 'GIT_TEST_' environment variable is obviously unsupported.
In fact, even if it were supported officially, it still wouldn't work,
because it generates the MIDX *after* redundant packs have been dropped,
leading to the same issue as above.
Introduce a new option which eliminates this race by teaching `git
repack` to generate the MIDX at the critical point: after the new packs
have been written and moved into place, but before the redundant packs
have been removed.
This option is compatible with `git repack`'s '--bitmap' option (it
changes the interpretation to be: "write a bitmap corresponding to the
MIDX after one has been generated").
There is a little bit of additional noise in the patch below to avoid
repeating ourselves when selecting which packs to delete. Instead of a
single loop as before (where we iterate over 'existing_packs', decide if
a pack is worth deleting, and if so, delete it), we have two loops (the
first where we decide which ones are worth deleting, and the second
where we actually do the deleting). This makes it so we have a single
check we can make consistently when (1) telling the MIDX which packs we
want to exclude, and (2) actually unlinking the redundant packs.
There is also a tiny change to short-circuit the body of
write_midx_included_packs() when no packs remain in the case of an empty
repository. The MIDX code does not handle this, so avoid trying to
generate a MIDX covering zero packs in the first place.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-29 03:55:18 +02:00
|
|
|
/*
|
|
|
|
* Mark this pack for deletion, which ensures that this
|
|
|
|
* pack won't be included in a MIDX (if `--write-midx`
|
|
|
|
* was given) and that we will actually delete this pack
|
|
|
|
* (if `-d` was given).
|
|
|
|
*/
|
2022-05-21 01:18:08 +02:00
|
|
|
if (!string_list_has_string(&names, sha1))
|
|
|
|
item->util = (void*)(uintptr_t)((size_t)item->util | DELETE_PACK);
|
builtin/repack.c: support writing a MIDX while repacking
Teach `git repack` a new `--write-midx` option for callers that wish to
persist a multi-pack index in their repository while repacking.
There are two existing alternatives to this new flag, but they don't
cover our particular use-case. These alternatives are:
- Call 'git multi-pack-index write' after running 'git repack', or
- Set 'GIT_TEST_MULTI_PACK_INDEX=1' in your environment when running
'git repack'.
The former works, but introduces a gap in bitmap coverage between
repacking and writing a new MIDX (since the repack may have deleted a
pack included in the existing MIDX, invalidating it altogether).
Setting the 'GIT_TEST_' environment variable is obviously unsupported.
In fact, even if it were supported officially, it still wouldn't work,
because it generates the MIDX *after* redundant packs have been dropped,
leading to the same issue as above.
Introduce a new option which eliminates this race by teaching `git
repack` to generate the MIDX at the critical point: after the new packs
have been written and moved into place, but before the redundant packs
have been removed.
This option is compatible with `git repack`'s '--bitmap' option (it
changes the interpretation to be: "write a bitmap corresponding to the
MIDX after one has been generated").
There is a little bit of additional noise in the patch below to avoid
repeating ourselves when selecting which packs to delete. Instead of a
single loop as before (where we iterate over 'existing_packs', decide if
a pack is worth deleting, and if so, delete it), we have two loops (the
first where we decide which ones are worth deleting, and the second
where we actually do the deleting). This makes it so we have a single
check we can make consistently when (1) telling the MIDX which packs we
want to exclude, and (2) actually unlinking the redundant packs.
There is also a tiny change to short-circuit the body of
write_midx_included_packs() when no packs remain in the case of an empty
repository. The MIDX code does not handle this, so avoid trying to
generate a MIDX covering zero packs in the first place.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-29 03:55:18 +02:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (write_midx) {
|
|
|
|
struct string_list include = STRING_LIST_INIT_NODUP;
|
|
|
|
midx_included_packs(&include, &existing_nonkept_packs,
|
|
|
|
&existing_kept_packs, &names, geometry);
|
|
|
|
|
2021-09-29 03:55:20 +02:00
|
|
|
ret = write_midx_included_packs(&include, geometry,
|
2021-10-02 00:38:10 +02:00
|
|
|
refs_snapshot ? get_tempfile_path(refs_snapshot) : NULL,
|
builtin/repack.c: support writing a MIDX while repacking
Teach `git repack` a new `--write-midx` option for callers that wish to
persist a multi-pack index in their repository while repacking.
There are two existing alternatives to this new flag, but they don't
cover our particular use-case. These alternatives are:
- Call 'git multi-pack-index write' after running 'git repack', or
- Set 'GIT_TEST_MULTI_PACK_INDEX=1' in your environment when running
'git repack'.
The former works, but introduces a gap in bitmap coverage between
repacking and writing a new MIDX (since the repack may have deleted a
pack included in the existing MIDX, invalidating it altogether).
Setting the 'GIT_TEST_' environment variable is obviously unsupported.
In fact, even if it were supported officially, it still wouldn't work,
because it generates the MIDX *after* redundant packs have been dropped,
leading to the same issue as above.
Introduce a new option which eliminates this race by teaching `git
repack` to generate the MIDX at the critical point: after the new packs
have been written and moved into place, but before the redundant packs
have been removed.
This option is compatible with `git repack`'s '--bitmap' option (it
changes the interpretation to be: "write a bitmap corresponding to the
MIDX after one has been generated").
There is a little bit of additional noise in the patch below to avoid
repeating ourselves when selecting which packs to delete. Instead of a
single loop as before (where we iterate over 'existing_packs', decide if
a pack is worth deleting, and if so, delete it), we have two loops (the
first where we decide which ones are worth deleting, and the second
where we actually do the deleting). This makes it so we have a single
check we can make consistently when (1) telling the MIDX which packs we
want to exclude, and (2) actually unlinking the redundant packs.
There is also a tiny change to short-circuit the body of
write_midx_included_packs() when no packs remain in the case of an empty
repository. The MIDX code does not handle this, so avoid trying to
generate a MIDX covering zero packs in the first place.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-29 03:55:18 +02:00
|
|
|
show_progress, write_bitmaps > 0);
|
|
|
|
|
2022-10-18 04:45:12 +02:00
|
|
|
if (!ret && write_bitmaps)
|
|
|
|
remove_redundant_bitmaps(&include, packdir);
|
|
|
|
|
builtin/repack.c: support writing a MIDX while repacking
Teach `git repack` a new `--write-midx` option for callers that wish to
persist a multi-pack index in their repository while repacking.
There are two existing alternatives to this new flag, but they don't
cover our particular use-case. These alternatives are:
- Call 'git multi-pack-index write' after running 'git repack', or
- Set 'GIT_TEST_MULTI_PACK_INDEX=1' in your environment when running
'git repack'.
The former works, but introduces a gap in bitmap coverage between
repacking and writing a new MIDX (since the repack may have deleted a
pack included in the existing MIDX, invalidating it altogether).
Setting the 'GIT_TEST_' environment variable is obviously unsupported.
In fact, even if it were supported officially, it still wouldn't work,
because it generates the MIDX *after* redundant packs have been dropped,
leading to the same issue as above.
Introduce a new option which eliminates this race by teaching `git
repack` to generate the MIDX at the critical point: after the new packs
have been written and moved into place, but before the redundant packs
have been removed.
This option is compatible with `git repack`'s '--bitmap' option (it
changes the interpretation to be: "write a bitmap corresponding to the
MIDX after one has been generated").
There is a little bit of additional noise in the patch below to avoid
repeating ourselves when selecting which packs to delete. Instead of a
single loop as before (where we iterate over 'existing_packs', decide if
a pack is worth deleting, and if so, delete it), we have two loops (the
first where we decide which ones are worth deleting, and the second
where we actually do the deleting). This makes it so we have a single
check we can make consistently when (1) telling the MIDX which packs we
want to exclude, and (2) actually unlinking the redundant packs.
There is also a tiny change to short-circuit the body of
write_midx_included_packs() when no packs remain in the case of an empty
repository. The MIDX code does not handle this, so avoid trying to
generate a MIDX covering zero packs in the first place.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-29 03:55:18 +02:00
|
|
|
string_list_clear(&include, 0);
|
|
|
|
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2018-08-09 00:34:06 +02:00
|
|
|
reprepare_packed_git(the_repository);
|
|
|
|
|
2013-09-15 17:33:20 +02:00
|
|
|
if (delete_redundant) {
|
2014-09-13 09:28:01 +02:00
|
|
|
int opts = 0;
|
builtin/repack.c: support writing a MIDX while repacking
Teach `git repack` a new `--write-midx` option for callers that wish to
persist a multi-pack index in their repository while repacking.
There are two existing alternatives to this new flag, but they don't
cover our particular use-case. These alternatives are:
- Call 'git multi-pack-index write' after running 'git repack', or
- Set 'GIT_TEST_MULTI_PACK_INDEX=1' in your environment when running
'git repack'.
The former works, but introduces a gap in bitmap coverage between
repacking and writing a new MIDX (since the repack may have deleted a
pack included in the existing MIDX, invalidating it altogether).
Setting the 'GIT_TEST_' environment variable is obviously unsupported.
In fact, even if it were supported officially, it still wouldn't work,
because it generates the MIDX *after* redundant packs have been dropped,
leading to the same issue as above.
Introduce a new option which eliminates this race by teaching `git
repack` to generate the MIDX at the critical point: after the new packs
have been written and moved into place, but before the redundant packs
have been removed.
This option is compatible with `git repack`'s '--bitmap' option (it
changes the interpretation to be: "write a bitmap corresponding to the
MIDX after one has been generated").
There is a little bit of additional noise in the patch below to avoid
repeating ourselves when selecting which packs to delete. Instead of a
single loop as before (where we iterate over 'existing_packs', decide if
a pack is worth deleting, and if so, delete it), we have two loops (the
first where we decide which ones are worth deleting, and the second
where we actually do the deleting). This makes it so we have a single
check we can make consistently when (1) telling the MIDX which packs we
want to exclude, and (2) actually unlinking the redundant packs.
There is also a tiny change to short-circuit the body of
write_midx_included_packs() when no packs remain in the case of an empty
repository. The MIDX code does not handle this, so avoid trying to
generate a MIDX covering zero packs in the first place.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-29 03:55:18 +02:00
|
|
|
for_each_string_list_item(item, &existing_nonkept_packs) {
|
2022-05-21 01:18:08 +02:00
|
|
|
if (!((uintptr_t)item->util & DELETE_PACK))
|
builtin/repack.c: support writing a MIDX while repacking
Teach `git repack` a new `--write-midx` option for callers that wish to
persist a multi-pack index in their repository while repacking.
There are two existing alternatives to this new flag, but they don't
cover our particular use-case. These alternatives are:
- Call 'git multi-pack-index write' after running 'git repack', or
- Set 'GIT_TEST_MULTI_PACK_INDEX=1' in your environment when running
'git repack'.
The former works, but introduces a gap in bitmap coverage between
repacking and writing a new MIDX (since the repack may have deleted a
pack included in the existing MIDX, invalidating it altogether).
Setting the 'GIT_TEST_' environment variable is obviously unsupported.
In fact, even if it were supported officially, it still wouldn't work,
because it generates the MIDX *after* redundant packs have been dropped,
leading to the same issue as above.
Introduce a new option which eliminates this race by teaching `git
repack` to generate the MIDX at the critical point: after the new packs
have been written and moved into place, but before the redundant packs
have been removed.
This option is compatible with `git repack`'s '--bitmap' option (it
changes the interpretation to be: "write a bitmap corresponding to the
MIDX after one has been generated").
There is a little bit of additional noise in the patch below to avoid
repeating ourselves when selecting which packs to delete. Instead of a
single loop as before (where we iterate over 'existing_packs', decide if
a pack is worth deleting, and if so, delete it), we have two loops (the
first where we decide which ones are worth deleting, and the second
where we actually do the deleting). This makes it so we have a single
check we can make consistently when (1) telling the MIDX which packs we
want to exclude, and (2) actually unlinking the redundant packs.
There is also a tiny change to short-circuit the body of
write_midx_included_packs() when no packs remain in the case of an empty
repository. The MIDX code does not handle this, so avoid trying to
generate a MIDX covering zero packs in the first place.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-29 03:55:18 +02:00
|
|
|
continue;
|
|
|
|
remove_redundant_pack(packdir, item->string);
|
2013-09-15 17:33:20 +02:00
|
|
|
}
|
builtin/repack.c: add '--geometric' option
Often it is useful to both:
- have relatively few packfiles in a repository, and
- avoid having so few packfiles in a repository that we repack its
entire contents regularly
This patch implements a '--geometric=<n>' option in 'git repack'. This
allows the caller to specify that they would like each pack to be at
least a factor times as large as the previous largest pack (by object
count).
Concretely, say that a repository has 'n' packfiles, labeled P1, P2,
..., up to Pn. Each packfile has an object count equal to 'objects(Pn)'.
With a geometric factor of 'r', it should be that:
objects(Pi) > r*objects(P(i-1))
for all i in [1, n], where the packs are sorted by
objects(P1) <= objects(P2) <= ... <= objects(Pn).
Since finding a true optimal repacking is NP-hard, we approximate it
along two directions:
1. We assume that there is a cutoff of packs _before starting the
repack_ where everything to the right of that cut-off already forms
a geometric progression (or no cutoff exists and everything must be
repacked).
2. We assume that everything smaller than the cutoff count must be
repacked. This forms our base assumption, but it can also cause
even the "heavy" packs to get repacked, for e.g., if we have 6
packs containing the following number of objects:
1, 1, 1, 2, 4, 32
then we would place the cutoff between '1, 1' and '1, 2, 4, 32',
rolling up the first two packs into a pack with 2 objects. That
breaks our progression and leaves us:
2, 1, 2, 4, 32
^
(where the '^' indicates the position of our split). To restore a
progression, we move the split forward (towards larger packs)
joining each pack into our new pack until a geometric progression
is restored. Here, that looks like:
2, 1, 2, 4, 32 ~> 3, 2, 4, 32 ~> 5, 4, 32 ~> ... ~> 9, 32
^ ^ ^ ^
This has the advantage of not repacking the heavy-side of packs too
often while also only creating one new pack at a time. Another wrinkle
is that we assume that loose, indexed, and reflog'd objects are
insignificant, and lump them into any new pack that we create. This can
lead to non-idempotent results.
Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-23 03:25:27 +01:00
|
|
|
|
|
|
|
if (geometry) {
|
|
|
|
struct strbuf buf = STRBUF_INIT;
|
|
|
|
|
|
|
|
uint32_t i;
|
|
|
|
for (i = 0; i < geometry->split; i++) {
|
|
|
|
struct packed_git *p = geometry->pack[i];
|
|
|
|
if (string_list_has_string(&names,
|
|
|
|
hash_to_hex(p->hash)))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
strbuf_reset(&buf);
|
|
|
|
strbuf_addstr(&buf, pack_basename(p));
|
|
|
|
strbuf_strip_suffix(&buf, ".pack");
|
|
|
|
|
repack: don't remove .keep packs with `--pack-kept-objects`
`git repack` supports a `--pack-kept-objects` flag which more or less
translates to whether or not we pass `--honor-pack-keep` down to `git
pack-objects` when assembling a new pack.
This behavior has existed since ee34a2bead (repack: add
`repack.packKeptObjects` config var, 2014-03-03). In that commit, the
documentation was extended to say:
[...] Note that we still do not delete `.keep` packs after
`pack-objects` finishes.
Unfortunately, this is not the case when `--pack-kept-objects` is
combined with a `--geometric` repack. When doing a geometric repack, we
include `.keep` packs when enumerating available packs only when
`pack_kept_objects` is set.
So this all works fine when `--no-pack-kept-objects` (or similar) is
given. Kept packs are excluded from the geometric roll-up, so when we go
to delete redundant packs (with `-d`), no `.keep` packs appear "below
the split" in our geometric progression.
But when `--pack-kept-objects` is given, things can go awry. Namely,
when a kept pack is included in the list of packs tracked by the
`pack_geometry` struct *and* part of the pack roll-up, we will delete
the `.keep` pack when we shouldn't.
Note that this *doesn't* result in object corruption, since the `.keep`
pack's objects are still present in the new pack. But the `.keep` pack
itself is removed, which violates our promise from back in ee34a2bead.
But there's more. Because `repack` computes the geometric roll-up
independently from selecting which packs belong in a MIDX (with
`--write-midx`), this can lead to odd behavior. Consider when a `.keep`
pack appears below the geometric split (ie., its objects will be part of
the new pack we generate).
We'll write a MIDX containing the new pack along with the existing
`.keep` pack. But because the `.keep` pack appears below the geometric
split line, we'll (incorrectly) try to remove it. While this doesn't
corrupt the repository, it does cause us to remove the MIDX we just
wrote, since removing that pack would invalidate the new MIDX.
Funny enough, this behavior became far less noticeable after e4d0c11c04
(repack: respect kept objects with '--write-midx -b', 2021-12-20), which
made `pack_kept_objects` be enabled by default only when we were writing
a non-MIDX bitmap.
But e4d0c11c04 didn't resolve this bug, it just made it harder to notice
unless callers explicitly passed `--pack-kept-objects`.
The solution is to avoid trying to remove `.keep` packs during
`--geometric` repacks, even when they appear below the geometric split
line, which is the approach this patch implements.
Co-authored-by: Victoria Dye <vdye@github.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-10-18 04:26:06 +02:00
|
|
|
if ((p->pack_keep) ||
|
|
|
|
(string_list_has_string(&existing_kept_packs,
|
|
|
|
buf.buf)))
|
|
|
|
continue;
|
|
|
|
|
builtin/repack.c: add '--geometric' option
Often it is useful to both:
- have relatively few packfiles in a repository, and
- avoid having so few packfiles in a repository that we repack its
entire contents regularly
This patch implements a '--geometric=<n>' option in 'git repack'. This
allows the caller to specify that they would like each pack to be at
least a factor times as large as the previous largest pack (by object
count).
Concretely, say that a repository has 'n' packfiles, labeled P1, P2,
..., up to Pn. Each packfile has an object count equal to 'objects(Pn)'.
With a geometric factor of 'r', it should be that:
objects(Pi) > r*objects(P(i-1))
for all i in [1, n], where the packs are sorted by
objects(P1) <= objects(P2) <= ... <= objects(Pn).
Since finding a true optimal repacking is NP-hard, we approximate it
along two directions:
1. We assume that there is a cutoff of packs _before starting the
repack_ where everything to the right of that cut-off already forms
a geometric progression (or no cutoff exists and everything must be
repacked).
2. We assume that everything smaller than the cutoff count must be
repacked. This forms our base assumption, but it can also cause
even the "heavy" packs to get repacked, for e.g., if we have 6
packs containing the following number of objects:
1, 1, 1, 2, 4, 32
then we would place the cutoff between '1, 1' and '1, 2, 4, 32',
rolling up the first two packs into a pack with 2 objects. That
breaks our progression and leaves us:
2, 1, 2, 4, 32
^
(where the '^' indicates the position of our split). To restore a
progression, we move the split forward (towards larger packs)
joining each pack into our new pack until a geometric progression
is restored. Here, that looks like:
2, 1, 2, 4, 32 ~> 3, 2, 4, 32 ~> 5, 4, 32 ~> ... ~> 9, 32
^ ^ ^ ^
This has the advantage of not repacking the heavy-side of packs too
often while also only creating one new pack at a time. Another wrinkle
is that we assume that loose, indexed, and reflog'd objects are
insignificant, and lump them into any new pack that we create. This can
lead to non-idempotent results.
Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-23 03:25:27 +01:00
|
|
|
remove_redundant_pack(packdir, buf.buf);
|
|
|
|
}
|
|
|
|
strbuf_release(&buf);
|
|
|
|
}
|
2021-12-20 15:48:11 +01:00
|
|
|
if (show_progress)
|
2014-09-13 09:28:01 +02:00
|
|
|
opts |= PRUNE_PACKED_VERBOSE;
|
|
|
|
prune_packed_objects(opts);
|
repack -ad: prune the list of shallow commits
`git repack` can drop unreachable commits without further warning,
making the corresponding entries in `.git/shallow` invalid, which causes
serious problems when deepening the branches.
One scenario where unreachable commits are dropped by `git repack` is
when a `git fetch --prune` (or even a `git fetch` when a ref was
force-pushed in the meantime) can make a commit unreachable that was
reachable before.
Therefore it is not safe to assume that a `git repack -adlf` will keep
unreachable commits alone (under the assumption that they had not been
packed in the first place, which is an assumption at least some of Git's
code seems to make).
This is particularly important to keep in mind when looking at the
`.git/shallow` file: if any commits listed in that file become
unreachable, it is not a problem, but if they go missing, it *is* a
problem. One symptom of this problem is that a deepening fetch may now
fail with
fatal: error in object: unshallow <commit-hash>
To avoid this problem, let's prune the shallow list in `git repack` when
the `-d` option is passed, unless `-A` is passed, too (which would force
the now-unreachable objects to be turned into loose objects instead of
being deleted). Additionally, we also need to take `--keep-reachable`
and `--unpack-unreachable=<date>` into account.
Note: an alternative solution discussed during the review of this patch
was to teach `git fetch` to simply ignore entries in .git/shallow if the
corresponding commits do not exist locally. A quick test, however,
revealed that the .git/shallow file is written during a shallow *clone*,
in which case the commits do not exist, either, but the "shallow" line
*does* need to be sent. Therefore, this approach would be a lot more
finicky than the approach presented by the this patch.
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-10-24 17:56:13 +02:00
|
|
|
|
|
|
|
if (!keep_unreachable &&
|
|
|
|
(!(pack_everything & LOOSEN_UNREACHABLE) ||
|
|
|
|
unpack_unreachable) &&
|
|
|
|
is_repository_shallow(the_repository))
|
|
|
|
prune_shallow(PRUNE_QUICK);
|
2013-09-15 17:33:20 +02:00
|
|
|
}
|
|
|
|
|
2022-03-14 08:42:46 +01:00
|
|
|
if (run_update_server_info)
|
2014-09-13 09:28:01 +02:00
|
|
|
update_server_info(0);
|
2018-10-12 19:34:20 +02:00
|
|
|
|
2021-08-31 22:52:43 +02:00
|
|
|
if (git_env_bool(GIT_TEST_MULTI_PACK_INDEX, 0)) {
|
|
|
|
unsigned flags = 0;
|
|
|
|
if (git_env_bool(GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP, 0))
|
|
|
|
flags |= MIDX_WRITE_BITMAP | MIDX_WRITE_REV_INDEX;
|
midx: preliminary support for `--refs-snapshot`
To figure out which commits we can write a bitmap for, the multi-pack
index/bitmap code does a reachability traversal, marking any commit
which can be found in the MIDX as eligible to receive a bitmap.
This approach will cause a problem when multi-pack bitmaps are able to
be generated from `git repack`, since the reference tips can change
during the repack. Even though we ignore commits that don't exist in
the MIDX (when doing a scan of the ref tips), it's possible that a
commit in the MIDX reaches something that isn't.
This can happen when a multi-pack index contains some pack which refers
to loose objects (e.g., if a pack was pushed after starting the repack
but before generating the MIDX which depends on an object which is
stored as loose in the repository, and by definition isn't included in
the multi-pack index).
By taking a snapshot of the references before we start repacking, we can
close that race window. In the above scenario (where we have a packed
object pointing at a loose one), we'll either (a) take a snapshot of the
references before seeing the packed one, or (b) take it after, at which
point we can guarantee that the loose object will be packed and included
in the MIDX.
This patch does just that. It writes a temporary "reference snapshot",
which is a list of OIDs that are at the ref tips before writing a
multi-pack bitmap. References that are "preferred" (i.e,. are a suffix
of at least one value of the 'pack.preferBitmapTips' configuration) are
marked with a special '+'.
The format is simple: one line per commit at each tip, with an optional
'+' at the beginning (for preferred references, as described above).
When provided, the reference snapshot is used to drive bitmap selection
instead of the MIDX code doing its own traversal. When it isn't
provided, the usual traversal takes place instead.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-29 03:55:07 +02:00
|
|
|
write_midx_file(get_object_directory(), NULL, NULL, flags);
|
2021-08-31 22:52:43 +02:00
|
|
|
}
|
2018-10-12 19:34:20 +02:00
|
|
|
|
2022-10-22 02:21:45 +02:00
|
|
|
string_list_clear(&names, 1);
|
2021-09-29 03:55:12 +02:00
|
|
|
string_list_clear(&existing_nonkept_packs, 0);
|
builtin/repack.c: keep track of existing packs unconditionally
In order to be able to write a multi-pack index during repacking, `git
repack` must keep track of which packs it wants to write into the MIDX.
This set is the union of existing packs which will not be deleted,
new pack(s) generated as a result of the repack, and .keep packs.
Prior to this patch, `git repack` populated the list of existing packs
only when repacking all-into-one (i.e., with `-A` or `-a`), but we will
soon need to know this list when repacking when writing a MIDX without
a-i-o.
Populate the list of existing packs unconditionally, and guard removing
packs from that list only when repacking a-i-o.
Additionally, keep track of filenames of kept packs separately, since
this, too, will be used in an upcoming patch.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-29 03:55:10 +02:00
|
|
|
string_list_clear(&existing_kept_packs, 0);
|
builtin/repack.c: add '--geometric' option
Often it is useful to both:
- have relatively few packfiles in a repository, and
- avoid having so few packfiles in a repository that we repack its
entire contents regularly
This patch implements a '--geometric=<n>' option in 'git repack'. This
allows the caller to specify that they would like each pack to be at
least a factor times as large as the previous largest pack (by object
count).
Concretely, say that a repository has 'n' packfiles, labeled P1, P2,
..., up to Pn. Each packfile has an object count equal to 'objects(Pn)'.
With a geometric factor of 'r', it should be that:
objects(Pi) > r*objects(P(i-1))
for all i in [1, n], where the packs are sorted by
objects(P1) <= objects(P2) <= ... <= objects(Pn).
Since finding a true optimal repacking is NP-hard, we approximate it
along two directions:
1. We assume that there is a cutoff of packs _before starting the
repack_ where everything to the right of that cut-off already forms
a geometric progression (or no cutoff exists and everything must be
repacked).
2. We assume that everything smaller than the cutoff count must be
repacked. This forms our base assumption, but it can also cause
even the "heavy" packs to get repacked, for e.g., if we have 6
packs containing the following number of objects:
1, 1, 1, 2, 4, 32
then we would place the cutoff between '1, 1' and '1, 2, 4, 32',
rolling up the first two packs into a pack with 2 objects. That
breaks our progression and leaves us:
2, 1, 2, 4, 32
^
(where the '^' indicates the position of our split). To restore a
progression, we move the split forward (towards larger packs)
joining each pack into our new pack until a geometric progression
is restored. Here, that looks like:
2, 1, 2, 4, 32 ~> 3, 2, 4, 32 ~> 5, 4, 32 ~> ... ~> 9, 32
^ ^ ^ ^
This has the advantage of not repacking the heavy-side of packs too
often while also only creating one new pack at a time. Another wrinkle
is that we assume that loose, indexed, and reflog'd objects are
insignificant, and lump them into any new pack that we create. This can
lead to non-idempotent results.
Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-23 03:25:27 +01:00
|
|
|
clear_pack_geometry(geometry);
|
2013-09-15 17:33:20 +02:00
|
|
|
strbuf_release(&line);
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|