2018-04-02 22:34:19 +02:00
|
|
|
#ifndef COMMIT_GRAPH_H
|
|
|
|
#define COMMIT_GRAPH_H
|
|
|
|
|
2018-04-10 14:56:02 +02:00
|
|
|
#include "git-compat-util.h"
|
2018-06-27 15:24:32 +02:00
|
|
|
#include "repository.h"
|
2018-06-27 15:24:44 +02:00
|
|
|
#include "string-list.h"
|
2018-08-15 19:54:05 +02:00
|
|
|
#include "cache.h"
|
2020-02-04 06:51:50 +01:00
|
|
|
#include "object-store.h"
|
2018-04-10 14:56:02 +02:00
|
|
|
|
2018-08-29 14:49:04 +02:00
|
|
|
#define GIT_TEST_COMMIT_GRAPH "GIT_TEST_COMMIT_GRAPH"
|
commit-graph write: don't die if the existing graph is corrupt
When the commit-graph is written we end up calling
parse_commit(). This will in turn invoke code that'll consult the
existing commit-graph about the commit, if the graph is corrupted we
die.
We thus get into a state where a failing "commit-graph verify" can't
be followed-up with a "commit-graph write" if core.commitGraph=true is
set, the graph either needs to be manually removed to proceed, or
core.commitGraph needs to be set to "false".
Change the "commit-graph write" codepath to use a new
parse_commit_no_graph() helper instead of parse_commit() to avoid
this. The latter will call repo_parse_commit_internal() with
use_commit_graph=1 as seen in 177722b344 ("commit: integrate commit
graph with commit parsing", 2018-04-10).
Not using the old graph at all slows down the writing of the new graph
by some small amount, but is a sensible way to prevent an error in the
existing commit-graph from spreading.
Just fixing the current issue would be likely to result in code that's
inadvertently broken in the future. New code might use the
commit-graph at a distance. To detect such cases introduce a
"GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD" setting used when we do our
corruption tests, and test that a "write/verify" combo works after
every one of our current test cases where we now detect commit-graph
corruption.
Some of the code changes here might be strictly unnecessary, e.g. I
was unable to find cases where the parse_commit() called from
write_graph_chunk_data() didn't exit early due to
"item->object.parsed" being true in
repo_parse_commit_internal() (before the use_commit_graph=1 has any
effect). But let's also convert those cases for good measure, we do
not have exhaustive tests for all possible types of commit-graph
corruption.
This might need to be re-visited if we learn to write the commit-graph
incrementally, but probably not. Hopefully we'll just start by finding
out what commits we have in total, then read the old graph(s) to see
what they cover, and finally write a new graph file with everything
that's missing. In that case the new graph writing code just needs to
continue to use e.g. a parse_commit() that doesn't consult the
existing commit-graphs.
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-03-25 13:08:33 +01:00
|
|
|
#define GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD "GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD"
|
2018-08-29 14:49:04 +02:00
|
|
|
|
2018-07-12 00:42:39 +02:00
|
|
|
struct commit;
|
|
|
|
|
commit-graph.c: remove path normalization, comparison
As of the previous patch, all calls to 'commit-graph.c' functions which
perform path normalization (for e.g., 'get_commit_graph_filename()') are
of the form 'ctx->odb->path', which is always in normalized form.
Now that there are no callers passing non-normalized paths to these
functions, ensure that future callers are bound by the same restrictions
by making these functions take a 'struct object_directory *' instead of
a 'const char *'. To match, replace all calls with arguments of the form
'ctx->odb->path' with 'ctx->odb' To recover the path, functions that
perform path manipulation simply use 'odb->path'.
Further, avoid string comparisons with arguments of the form
'odb->path', and instead prefer raw pointer comparisons, which
accomplish the same effect, but are far less brittle.
This has a pleasant side-effect of making these functions much more
robust to paths that cannot be normalized by 'normalize_path_copy()',
i.e., because they are outside of the current working directory.
For example, prior to this patch, Valgrind reports that the following
uninitialized memory read [1]:
$ ( cd t && GIT_DIR=../.git valgrind git rev-parse HEAD^ )
because 'normalize_path_copy()' can't normalize '../.git' (since it's
relative to but above of the current working directory) [2].
By using a 'struct object_directory *' directly,
'get_commit_graph_filename()' does not need to normalize, because all
paths are relative to the current working directory since they are
always read from the '->path' of an object directory.
[1]: https://lore.kernel.org/git/20191027042116.GA5801@sigill.intra.peff.net.
[2]: The bug here is that 'get_commit_graph_filename()' returns the
result of 'normalize_path_copy()' without checking the return
value.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-02-03 22:18:02 +01:00
|
|
|
char *get_commit_graph_filename(struct object_directory *odb);
|
2019-03-25 13:08:30 +01:00
|
|
|
int open_commit_graph(const char *graph_file, int *fd, struct stat *st);
|
2018-04-10 14:56:02 +02:00
|
|
|
|
2018-04-10 14:56:05 +02:00
|
|
|
/*
|
|
|
|
* Given a commit struct, try to fill the commit struct info, including:
|
|
|
|
* 1. tree object
|
|
|
|
* 2. date
|
|
|
|
* 3. parents.
|
|
|
|
*
|
|
|
|
* Returns 1 if and only if the commit was found in the packed graph.
|
|
|
|
*
|
|
|
|
* See parse_commit_buffer() for the fallback after this call.
|
|
|
|
*/
|
2018-07-12 00:42:42 +02:00
|
|
|
int parse_commit_in_graph(struct repository *r, struct commit *item);
|
2018-04-10 14:56:05 +02:00
|
|
|
|
2018-05-01 14:47:13 +02:00
|
|
|
/*
|
|
|
|
* It is possible that we loaded commit contents from the commit buffer,
|
|
|
|
* but we also want to ensure the commit-graph content is correctly
|
|
|
|
* checked and filled. Fill the graph_pos and generation members of
|
|
|
|
* the given commit.
|
|
|
|
*/
|
2018-07-12 00:42:42 +02:00
|
|
|
void load_commit_graph_info(struct repository *r, struct commit *item);
|
2018-05-01 14:47:13 +02:00
|
|
|
|
2018-07-12 00:42:42 +02:00
|
|
|
struct tree *get_commit_tree_in_graph(struct repository *r,
|
|
|
|
const struct commit *c);
|
2018-04-06 21:09:46 +02:00
|
|
|
|
2018-04-10 14:56:02 +02:00
|
|
|
struct commit_graph {
|
|
|
|
int graph_fd;
|
|
|
|
|
|
|
|
const unsigned char *data;
|
|
|
|
size_t data_len;
|
|
|
|
|
|
|
|
unsigned char hash_len;
|
|
|
|
unsigned char num_chunks;
|
|
|
|
uint32_t num_commits;
|
|
|
|
struct object_id oid;
|
2019-06-18 20:14:27 +02:00
|
|
|
char *filename;
|
2020-02-03 22:18:00 +01:00
|
|
|
struct object_directory *odb;
|
2018-04-10 14:56:02 +02:00
|
|
|
|
2019-06-18 20:14:24 +02:00
|
|
|
uint32_t num_commits_in_base;
|
|
|
|
struct commit_graph *base_graph;
|
|
|
|
|
2018-04-10 14:56:02 +02:00
|
|
|
const uint32_t *chunk_oid_fanout;
|
|
|
|
const unsigned char *chunk_oid_lookup;
|
|
|
|
const unsigned char *chunk_commit_data;
|
commit-graph: rename "large edges" to "extra edges"
The optional 'Large Edge List' chunk of the commit graph file stores
parent information for commits with more than two parents, and the
names of most of the macros, variables, struct fields, and functions
related to this chunk contain the term "large edges", e.g.
write_graph_chunk_large_edges(). However, it's not a really great
term, as the edges to the second and subsequent parents stored in this
chunk are not any larger than the edges to the first and second
parents stored in the "main" 'Commit Data' chunk. It's the number of
edges, IOW number of parents, that is larger compared to non-merge and
"regular" two-parent merge commits. And indeed, two functions in
'commit-graph.c' have a local variable called 'num_extra_edges' that
refer to the same thing, and this "extra edges" term is much better at
describing these edges.
So let's rename all these references to "large edges" in macro,
variable, function, etc. names to "extra edges". There is a
GRAPH_OCTOPUS_EDGES_NEEDED macro as well; for the sake of consistency
rename it to GRAPH_EXTRA_EDGES_NEEDED.
We can do so safely without causing any incompatibility issues,
because the term "large edges" doesn't come up in the file format
itself in any form (the chunk's magic is {'E', 'D', 'G', 'E'}, there
is no 'L' in there), but only in the specification text. The string
"large edges", however, does come up in the output of 'git
commit-graph read' and in tests looking at its input, but that command
is explicitly documented as debugging aid, so we can change its output
and the affected tests safely.
Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-19 21:21:13 +01:00
|
|
|
const unsigned char *chunk_extra_edges;
|
2019-06-18 20:14:26 +02:00
|
|
|
const unsigned char *chunk_base_graphs;
|
2018-04-10 14:56:02 +02:00
|
|
|
};
|
|
|
|
|
2020-02-03 22:18:04 +01:00
|
|
|
struct commit_graph *load_commit_graph_one_fd_st(int fd, struct stat *st,
|
|
|
|
struct object_directory *odb);
|
2020-02-03 22:18:00 +01:00
|
|
|
struct commit_graph *read_commit_graph_one(struct repository *r,
|
|
|
|
struct object_directory *odb);
|
2019-01-15 23:25:50 +01:00
|
|
|
struct commit_graph *parse_commit_graph(void *graph_map, int fd,
|
|
|
|
size_t graph_size);
|
|
|
|
|
commit-reach: use can_all_from_reach
The is_descendant_of method previously used in_merge_bases() to check if
the commit can reach any of the commits in the provided list. This had
two performance problems:
1. The performance is quadratic in worst-case.
2. A single in_merge_bases() call requires walking beyond the target
commit in order to find the full set of boundary commits that may be
merge-bases.
The can_all_from_reach method avoids this quadratic behavior and can
limit the search beyond the target commits using generation numbers. It
requires a small prototype adjustment to stop using commit-date as a
cutoff, as that optimization is no longer appropriate here.
Since in_merge_bases() uses paint_down_to_common(), is_descendant_of()
naturally found cutoffs to avoid walking the entire commit graph. Since
we want to always return the correct result, we cannot use the
min_commit_date cutoff in can_all_from_reach. We then rely on generation
numbers to provide the cutoff.
Since not all repos will have a commit-graph file, nor will we always
have generation numbers computed for a commit-graph file, create a new
method, generation_numbers_enabled(), that checks for a commit-graph
file and sees if the first commit in the file has a non-zero generation
number. In the case that we do not have generation numbers, use the old
logic for is_descendant_of().
Performance was meausured on a copy of the Linux repository using the
'test-tool reach is_descendant_of' command using this input:
A:v4.9
X:v4.10
X:v4.11
X:v4.12
X:v4.13
X:v4.14
X:v4.15
X:v4.16
X:v4.17
X.v3.0
Note that this input is tailored to demonstrate the quadratic nature of
the previous method, as it will compute merge-bases for v4.9 versus all
of the later versions before checking against v4.1.
Before: 0.26 s
After: 0.21 s
Since we previously used the is_descendant_of method in the ref_newer
method, we also measured performance there using
'test-tool reach ref_newer' with this input:
A:v4.9
B:v3.19
Before: 0.10 s
After: 0.08 s
By adding a new commit with parent v3.19, we test the non-reachable case
of ref_newer:
Before: 0.09 s
After: 0.08 s
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-07-20 18:33:30 +02:00
|
|
|
/*
|
|
|
|
* Return 1 if and only if the repository has a commit-graph
|
|
|
|
* file and generation numbers are computed in that file.
|
|
|
|
*/
|
|
|
|
int generation_numbers_enabled(struct repository *r);
|
|
|
|
|
2019-08-05 10:02:39 +02:00
|
|
|
enum commit_graph_write_flags {
|
|
|
|
COMMIT_GRAPH_WRITE_APPEND = (1 << 0),
|
|
|
|
COMMIT_GRAPH_WRITE_PROGRESS = (1 << 1),
|
2019-08-05 10:02:40 +02:00
|
|
|
COMMIT_GRAPH_WRITE_SPLIT = (1 << 2),
|
|
|
|
/* Make sure that each OID in the input is a valid commit OID. */
|
|
|
|
COMMIT_GRAPH_WRITE_CHECK_OIDS = (1 << 3)
|
2019-08-05 10:02:39 +02:00
|
|
|
};
|
2019-06-12 15:29:38 +02:00
|
|
|
|
2019-06-18 20:14:32 +02:00
|
|
|
struct split_commit_graph_opts {
|
|
|
|
int size_multiple;
|
|
|
|
int max_commits;
|
|
|
|
timestamp_t expire_time;
|
|
|
|
};
|
|
|
|
|
2019-06-12 15:29:37 +02:00
|
|
|
/*
|
|
|
|
* The write_commit_graph* methods return zero on success
|
|
|
|
* and a negative value on failure. Note that if the repository
|
|
|
|
* is not compatible with the commit-graph feature, then the
|
|
|
|
* methods will return 0 without writing a commit-graph.
|
|
|
|
*/
|
2020-02-04 06:51:50 +01:00
|
|
|
int write_commit_graph_reachable(struct object_directory *odb,
|
2019-08-05 10:02:39 +02:00
|
|
|
enum commit_graph_write_flags flags,
|
2019-06-18 20:14:32 +02:00
|
|
|
const struct split_commit_graph_opts *split_opts);
|
2020-02-04 06:51:50 +01:00
|
|
|
int write_commit_graph(struct object_directory *odb,
|
2019-06-12 15:29:37 +02:00
|
|
|
struct string_list *pack_indexes,
|
|
|
|
struct string_list *commit_hex,
|
2019-08-05 10:02:39 +02:00
|
|
|
enum commit_graph_write_flags flags,
|
2019-06-18 20:14:32 +02:00
|
|
|
const struct split_commit_graph_opts *split_opts);
|
2018-04-02 22:34:19 +02:00
|
|
|
|
2019-06-18 20:14:32 +02:00
|
|
|
#define COMMIT_GRAPH_VERIFY_SHALLOW (1 << 0)
|
|
|
|
|
|
|
|
int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags);
|
2018-06-27 15:24:32 +02:00
|
|
|
|
2019-05-17 20:41:47 +02:00
|
|
|
void close_commit_graph(struct raw_object_store *);
|
2018-07-12 00:42:40 +02:00
|
|
|
void free_commit_graph(struct commit_graph *);
|
|
|
|
|
upload-pack: disable commit graph more gently for shallow traversal
When the client has asked for certain shallow options like
"deepen-since", we do a custom rev-list walk that pretends to be
shallow. Before doing so, we have to disable the commit-graph, since it
is not compatible with the shallow view of the repository. That's
handled by 829a321569 (commit-graph: close_commit_graph before shallow
walk, 2018-08-20). That commit literally closes and frees our
repo->objects->commit_graph struct.
That creates an interesting problem for commits that have _already_ been
parsed using the commit graph. Their commit->object.parsed flag is set,
their commit->graph_pos is set, but their commit->maybe_tree may still
be NULL. When somebody later calls repo_get_commit_tree(), we see that
we haven't loaded the tree oid yet and try to get it from the commit
graph. But since it has been freed, we segfault!
So the root of the issue is a data dependency between the commit's
lazy-load of the tree oid and the fact that the commit graph can go
away mid-process. How can we resolve it?
There are a couple of general approaches:
1. The obvious answer is to avoid loading the tree from the graph when
we see that it's NULL. But then what do we return for the tree oid?
If we return NULL, our caller in do_traverse() will rightly
complain that we have no tree. We'd have to fallback to loading the
actual commit object and re-parsing it. That requires teaching
parse_commit_buffer() to understand re-parsing (i.e., not starting
from a clean slate and not leaking any allocated bits like parent
list pointers).
2. When we close the commit graph, walk through the set of in-memory
objects and clear any graph_pos pointers. But this means we also
have to "unparse" any such commits so that we know they still need
to open the commit object to fill in their trees. So it's no less
complicated than (1), and is more expensive (since we clear objects
we might not later need).
3. Stop freeing the commit-graph struct. Continue to let it be used
for lazy-loads of tree oids, but let upload-pack specify that it
shouldn't be used for further commit parsing.
4. Push the whole shallow rev-list out to its own sub-process, with
the commit-graph disabled from the start, giving it a clean memory
space to work from.
I've chosen (3) here. Options (1) and (2) would work, but are
non-trivial to implement. Option (4) is more expensive, and I'm not sure
how complicated it is (shelling out for the actual rev-list part is
easy, but we do then parse the resulting commits internally, and I'm not
clear which parts need to be handling shallow-ness).
The new test in t5500 triggers this segfault, but see the comments there
for how horribly intimate it has to be with how both upload-pack and
commit graphs work.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-09-12 16:44:45 +02:00
|
|
|
/*
|
|
|
|
* Disable further use of the commit graph in this process when parsing a
|
|
|
|
* "struct commit".
|
|
|
|
*/
|
|
|
|
void disable_commit_graph(struct repository *r);
|
|
|
|
|
2018-04-02 22:34:19 +02:00
|
|
|
#endif
|