Change the error emitted when a commit-graph file is corrupt so that
we actually mention the commit-graph, e.g. change errors like:
error: improper chunk offset 0000000000385e0c
To:
error: commit-graph improper chunk offset 0000000000385e0c
As discussed in the commits leading up to this one the commit-graph
machinery is now used by common commands like "status". If the graph
was corrupt we'd often emit some error that gave no indication what
was wrong. Now some of them are still cryptic, but they'll at least
mention "commit-graph" to give the user a hint as to where to look.
While I'm at it mark some of the strings that hadn't been marked for
translation. It's clear from the commit history and the code that this
was merely forgotten at the time, and wasn't intentional.p5
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When the commit-graph is written we end up calling
parse_commit(). This will in turn invoke code that'll consult the
existing commit-graph about the commit, if the graph is corrupted we
die.
We thus get into a state where a failing "commit-graph verify" can't
be followed-up with a "commit-graph write" if core.commitGraph=true is
set, the graph either needs to be manually removed to proceed, or
core.commitGraph needs to be set to "false".
Change the "commit-graph write" codepath to use a new
parse_commit_no_graph() helper instead of parse_commit() to avoid
this. The latter will call repo_parse_commit_internal() with
use_commit_graph=1 as seen in 177722b344 ("commit: integrate commit
graph with commit parsing", 2018-04-10).
Not using the old graph at all slows down the writing of the new graph
by some small amount, but is a sensible way to prevent an error in the
existing commit-graph from spreading.
Just fixing the current issue would be likely to result in code that's
inadvertently broken in the future. New code might use the
commit-graph at a distance. To detect such cases introduce a
"GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD" setting used when we do our
corruption tests, and test that a "write/verify" combo works after
every one of our current test cases where we now detect commit-graph
corruption.
Some of the code changes here might be strictly unnecessary, e.g. I
was unable to find cases where the parse_commit() called from
write_graph_chunk_data() didn't exit early due to
"item->object.parsed" being true in
repo_parse_commit_internal() (before the use_commit_graph=1 has any
effect). But let's also convert those cases for good measure, we do
not have exhaustive tests for all possible types of commit-graph
corruption.
This might need to be re-visited if we learn to write the commit-graph
incrementally, but probably not. Hopefully we'll just start by finding
out what commits we have in total, then read the old graph(s) to see
what they cover, and finally write a new graph file with everything
that's missing. In that case the new graph writing code just needs to
continue to use e.g. a parse_commit() that doesn't consult the
existing commit-graphs.
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
An earlier change implemented load_commit_graph_one_fd_st() in a way
that was bug-compatible with earlier code in terms of the "graph file
%s is too small" error message printing out the path to the
commit-graph (".git/objects/info/commit-graph").
But change that, because:
* A function that takes an already-open file descriptor also needing
the filename isn't very intuitive.
* The vast majority of errors we might emit when loading the graph
come from parse_commit_graph(), which doesn't report the
filename. Let's not do that either in this case for consistency.
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Make the commit-graph loading code work as a library that returns an
error code instead of calling exit(1) when the commit-graph is
corrupt. This means that e.g. "status" will now report commit-graph
corruption as an "error: [...]" at the top of its output, but then
proceed to work normally.
This required splitting up the load_commit_graph_one() function so
that the code that deals with open()-ing and stat()-ing the graph can
now be called independently as open_commit_graph().
This is needed because "commit-graph verify" where the graph doesn't
exist isn't an error. See the third paragraph in
283e68c72f ("commit-graph: add 'verify' subcommand",
2018-06-27). There's a bug in that logic where we conflate the
intended ENOENT with other errno values (e.g. EACCES), but this change
doesn't address that. That'll be addressed in a follow-up change.
I'm then splitting most of the logic out of load_commit_graph_one()
into load_commit_graph_one_fd_st(), which allows for providing an
existing file descriptor and stat information to the loading
code. This isn't strictly needed, but it would be redundant and
confusing to open() and stat() the file twice for some of the
codepaths, this allows for calling open_commit_graph() followed by
load_commit_graph_one_fd_st(). The "graph_file" still needs to be
passed to that function for the the "graph file %s is too small" error
message.
This leaves load_commit_graph_one() unused by everything except the
internal prepare_commit_graph_one() function, so let's mark it as
"static". If someone needs it in the future we can remove the "static"
attribute. I could also rewrite its sole remaining
user ("prepare_commit_graph_one()") to use
load_commit_graph_one_fd_st() instead, but let's leave it at this.
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Ramsay Jones <ramsay@ramsayjones.plus.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When core.commitGraph=true is set, various common commands now consult
the commit graph. Because the commit-graph code is very trusting of
its input data, it's possibly to construct a graph that'll cause an
immediate segfault on e.g. "status" (and e.g. "log", "blame", ...). In
some other cases where git immediately exits with a cryptic error
about the graph being broken.
The root cause of this is that while the "commit-graph verify"
sub-command exhaustively verifies the graph, other users of the graph
simply trust the graph, and will e.g. deference data found at certain
offsets as pointers, causing segfaults.
This change does the bare minimum to ensure that we don't segfault in
the common fill_commit_in_graph() codepath called by
e.g. setup_revisions(), to do this instrument the "commit-graph
verify" tests to always check if "status" would subsequently
segfault. This fixes the following tests which would previously
segfault:
not ok 50 - detect low chunk count
not ok 51 - detect missing OID fanout chunk
not ok 52 - detect missing OID lookup chunk
not ok 53 - detect missing commit data chunk
Those happened because with the commit-graph enabled setup_revisions()
would eventually call fill_commit_in_graph(), where e.g.
g->chunk_commit_data is used early as an offset (and will be
0x0). With this change we get far enough to detect that the graph is
broken, and show an error instead. E.g.:
$ git status; echo $?
error: commit-graph is missing the Commit Data chunk
1
That also sucks, we should *warn* and not hard-fail "status" just
because the commit-graph is corrupt, but fixing is left to a follow-up
change.
A side-effect of changing the reporting from graph_report() to error()
is that we now have an "error: " prefix for these even for
"commit-graph verify". Pseudo-diff before/after:
$ git commit-graph verify
-commit-graph is missing the Commit Data chunk
+error: commit-graph is missing the Commit Data chunk
Changing that is OK. Various errors it emits now early on are prefixed
with "error: ", moving these over and changing the output doesn't
break anything.
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The codepath to show progress meter while writing out commit-graph
file has been improved.
* ab/commit-graph-write-progress:
commit-graph write: emit a percentage for all progress
commit-graph write: add itermediate progress
commit-graph write: remove empty line for readability
commit-graph write: add more descriptive progress output
commit-graph write: show progress for object search
commit-graph write: more descriptive "writing out" output
commit-graph write: add "Writing out" progress output
commit-graph: don't call write_graph_chunk_extra_edges() unnecessarily
commit-graph: rename "large edges" to "extra edges"
The codepath to write out commit-graph has been optimized by
following the usual pattern of visiting objects in in-pack order.
* ab/commit-graph-write-optim:
commit-graph write: use pack order when finding commits
The codepath to read from the commit-graph file attempted to read
past the end of it when the file's table-of-contents was corrupt.
* js/commit-graph-chunk-table-fix:
Makefile: correct example fuzz build
commit-graph: fix buffer read-overflow
commit-graph, fuzz: add fuzzer for commit-graph
The in-core repository instances are passed through more codepaths.
* sb/more-repo-in-api: (23 commits)
t/helper/test-repository: celebrate independence from the_repository
path.h: make REPO_GIT_PATH_FUNC repository agnostic
commit: prepare free_commit_buffer and release_commit_memory for any repo
commit-graph: convert remaining functions to handle any repo
submodule: don't add submodule as odb for push
submodule: use submodule repos for object lookup
pretty: prepare format_commit_message to handle arbitrary repositories
commit: prepare logmsg_reencode to handle arbitrary repositories
commit: prepare repo_unuse_commit_buffer to handle any repo
commit: prepare get_commit_buffer to handle any repo
commit-reach: prepare in_merge_bases[_many] to handle any repo
commit-reach: prepare get_merge_bases to handle any repo
commit-reach.c: allow get_merge_bases_many_0 to handle any repo
commit-reach.c: allow remove_redundant to handle any repo
commit-reach.c: allow merge_bases_many to handle any repo
commit-reach.c: allow paint_down_to_common to handle any repo
commit: allow parse_commit* to handle any repo
object: parse_object to honor its repository argument
object-store: prepare has_{sha1, object}_file to handle any repo
object-store: prepare read_object_file to deal with any repo
...
Add sha-256 hash and plug it through the code to allow building Git
with the "NewHash".
* bc/sha-256:
hash: add an SHA-256 implementation using OpenSSL
sha256: add an SHA-256 implementation using libgcrypt
Add a base implementation of SHA-256 support
commit-graph: convert to using the_hash_algo
t/helper: add a test helper to compute hash speed
sha1-file: add a constant for hash block size
t: make the sha1 test-tool helper generic
t: add basic tests for our SHA-1 implementation
cache: make hashcmp and hasheq work with larger hashes
hex: introduce functions to print arbitrary hashes
sha1-file: provide functions to look up hash algorithms
sha1-file: rename algorithm to "sha1"
Follow-up 01ca387774 ("commit-graph: split up close_reachable()
progress output", 2018-11-19) by making the progress bars in
close_reachable() report a completion percentage. This fixes the last
occurrence where in the commit graph writing where we didn't report
that.
The change in 01ca387774 split up the 1x progress bar in
close_reachable() into 3x, but left them as dumb counters without a
percentage completion. Fixing that is easy, and the only reason it
wasn't done already is because that commit was rushed in during the
v2.20.0 RC period to fix the unrelated issue of over-reporting commit
numbers. See [1] and follow-ups for ML activity at the time and [2]
for an alternative approach where the progress bars weren't split up.
Now for e.g. linux.git we'll emit:
$ ~/g/git/git --exec-path=$HOME/g/git commit-graph write
Finding commits for commit graph among packed objects: 100% (6529159/6529159), done.
Expanding reachable commits in commit graph: 100% (815990/815980), done.
Computing commit graph generation numbers: 100% (815983/815983), done.
Writing out commit graph in 4 passes: 100% (3263932/3263932), done.
1. https://public-inbox.org/git/20181119202300.18670-1-avarab@gmail.com/
2. https://public-inbox.org/git/20181122153922.16912-11-avarab@gmail.com/
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Add progress output to sections of code between "Annotating[...]" and
"Computing[...]generation numbers". This can collectively take 5-10
seconds on a large enough repository.
On a test repository with I have with ~7 million commits and ~50
million objects we'll now emit:
$ ~/g/git/git --exec-path=$HOME/g/git commit-graph write
Finding commits for commit graph among packed objects: 100% (124763727/124763727), done.
Loading known commits in commit graph: 100% (18989461/18989461), done.
Expanding reachable commits in commit graph: 100% (18989507/18989461), done.
Clearing commit marks in commit graph: 100% (18989507/18989507), done.
Counting distinct commits in commit graph: 100% (18989507/18989507), done.
Finding extra edges in commit graph: 100% (18989507/18989507), done.
Computing commit graph generation numbers: 100% (7250302/7250302), done.
Writing out commit graph in 4 passes: 100% (29001208/29001208), done.
Whereas on a medium-sized repository such as linux.git these new
progress bars won't have time to kick in and as before and we'll still
emit output like:
$ ~/g/git/git --exec-path=$HOME/g/git commit-graph write
Finding commits for commit graph among packed objects: 100% (6529159/6529159), done.
Expanding reachable commits in commit graph: 815990, done.
Computing commit graph generation numbers: 100% (815983/815983), done.
Writing out commit graph in 4 passes: 100% (3263932/3263932), done.
The "Counting distinct commits in commit graph" phase will spend most
of its time paused at "0/*" as we QSORT(...) the list. That's not
optimal, but at least we don't seem to be stalling anymore most of the
time.
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Remove the empty line between a QSORT(...) and the subsequent oideq()
for-loop. This makes it clearer that the QSORT(...) is being done so
that we can run the oideq() loop on adjacent OIDs. Amends code added
in 08fd81c9b6 ("commit-graph: implement write_commit_graph()",
2018-04-02).
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Make the progress output shown when we're searching for commits to
include in the graph more descriptive. This amends code I added in
7b0f229222 ("commit-graph write: add progress output", 2018-09-17).
Now, on linux.git, we'll emit this sort of output in the various modes
we support:
$ git commit-graph write
Finding commits for commit graph among packed objects: 100% (6529159/6529159), done.
[...]
# Actually we don't emit this since this takes almost no time at
# all. But if we did (s/_delayed//) we'd show:
$ git for-each-ref --format='%(objectname)' | git commit-graph write --stdin-commits
Finding commits for commit graph from 630 refs: 100% (630/630), done.
[...]
$ (cd .git/objects/pack/ && ls *idx) | git commit-graph write --stdin-pack
Finding commits for commit graph in 3 packs: 6529159, done.
[...]
The middle on of those is going to be the output users might see in
practice, since it'll be emitted when they get the commit graph via
gc.writeCommitGraph=true. But as noted above you need a really large
number of refs for this message to show. It'll show up on a test
repository I have with ~165k refs:
Finding commits for commit graph from 165203 refs: 100% (165203/165203), done.
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Show the percentage progress for the "Finding commits for commit
graph" phase for the common case where we're operating on all packs in
the repository, as "commit-graph write" or "gc" will do.
Before we'd emit on e.g. linux.git with "commit-graph write":
Finding commits for commit graph: 6529159, done.
[...]
And now:
Finding commits for commit graph: 100% (6529159/6529159), done.
[...]
Since the commit graph only includes those commits that are packed
(via for_each_packed_object(...)) the approximate_object_count()
returns the actual number of objects we're going to process.
Still, it is possible due to a race with "gc" or another process
maintaining packs that the number of objects we're going to process is
lower than what approximate_object_count() reported. In that case we
don't want to stop the progress bar short of 100%. So let's make sure
it snaps to 100% at the end.
The inverse case is also possible and more likely. I.e. that a new
pack has been added between approximate_object_count() and
for_each_packed_object(). In that case the percentage will go beyond
100%, and we'll do nothing to snap it back to 100% at the end.
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Make the "Writing out" part of the progress output more
descriptive. Depending on the shape of the graph we either make 3 or 4
passes over it.
Let's present this information to the user in case they're wondering
what this number, which is much larger than their number of commits,
has to do with writing out the commit graph. Now e.g. on linux.git we
emit:
$ ~/g/git/git --exec-path=$HOME/g/git -C ~/g/linux commit-graph write
Finding commits for commit graph: 6529159, done.
Expanding reachable commits in commit graph: 815990, done.
Computing commit graph generation numbers: 100% (815983/815983), done.
Writing out commit graph in 4 passes: 100% (3263932/3263932), done.
A note on i18n: Why are we using the Q_() function and passing a
number & English text for a singular which'll never be used? Because
the plural rules of translated languages may not match those of
English, and to use the plural function we need to use this format.
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Add progress output to be shown when we're writing out the
commit-graph, this adds to the output already added in 7b0f229222
("commit-graph write: add progress output", 2018-09-17).
As noted in that commit most of the progress output isn't displayed on
small repositories, but before this change we'd noticeably hang for
2-3 seconds at the end on medium sized repositories such as linux.git.
Now we'll instead show output like this, and reduce the
human-observable times at which we're not producing progress output:
$ ~/g/git/git --exec-path=$HOME/g/git -C ~/g/2015-04-03-1M-git commit-graph write
Finding commits for commit graph: 13064614, done.
Expanding reachable commits in commit graph: 1000447, done.
Computing commit graph generation numbers: 100% (1000447/1000447), done.
Writing out commit graph: 100% (3001341/3001341), done.
This "Writing out" number is 3x or 4x the number of commits, depending
on the graph we're processing. A later change will make this explicit
to the user.
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The optional 'Extra Edge List' chunk of the commit graph file stores
parent information for commits with more than two parents. Since the
chunk is optional, write_commit_graph() looks through all commits to
find those with more than two parents, and then writes the commit
graph file header accordingly, i.e. if there are no such commits, then
there won't be a 'Extra Edge List' chunk written, only the three
mandatory chunks.
However, when it later comes to writing actual chunk data,
write_commit_graph() unconditionally invokes
write_graph_chunk_extra_edges(), even when it was decided earlier that
that chunk won't be written. Strictly speaking there is no bug here,
because write_graph_chunk_extra_edges() won't write anything if it
doesn't find any commits with more than two parents, but then it
unnecessarily and in vain looks through all commits once again in
search for such commits.
Don't call write_graph_chunk_extra_edges() when that chunk won't be
written to spare an unnecessary iteration over all commits.
Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The optional 'Large Edge List' chunk of the commit graph file stores
parent information for commits with more than two parents, and the
names of most of the macros, variables, struct fields, and functions
related to this chunk contain the term "large edges", e.g.
write_graph_chunk_large_edges(). However, it's not a really great
term, as the edges to the second and subsequent parents stored in this
chunk are not any larger than the edges to the first and second
parents stored in the "main" 'Commit Data' chunk. It's the number of
edges, IOW number of parents, that is larger compared to non-merge and
"regular" two-parent merge commits. And indeed, two functions in
'commit-graph.c' have a local variable called 'num_extra_edges' that
refer to the same thing, and this "extra edges" term is much better at
describing these edges.
So let's rename all these references to "large edges" in macro,
variable, function, etc. names to "extra edges". There is a
GRAPH_OCTOPUS_EDGES_NEEDED macro as well; for the sake of consistency
rename it to GRAPH_EXTRA_EDGES_NEEDED.
We can do so safely without causing any incompatibility issues,
because the term "large edges" doesn't come up in the file format
itself in any form (the chunk's magic is {'E', 'D', 'G', 'E'}, there
is no 'L' in there), but only in the specification text. The string
"large edges", however, does come up in the output of 'git
commit-graph read' and in tests looking at its input, but that command
is explicitly documented as debugging aid, so we can change its output
and the affected tests safely.
Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Slightly optimize the "commit-graph write" step by using
FOR_EACH_OBJECT_PACK_ORDER with for_each_object_in_pack(). See commit
[1] and [2] for the facility and a similar optimization for "cat-file".
On Linux it is around 5% slower to run:
echo 1 >/proc/sys/vm/drop_caches &&
cat .git/objects/pack/* >/dev/null &&
git cat-file --batch-all-objects --batch-check --unordered
Than the same thing with the "cat" omitted. This is as expected, since
we're iterating in pack order and the "cat" is extra work.
Before this change the opposite was true of "commit-graph write". We
were 6% faster if we first ran "cat" to efficiently populate the FS
cache for our sole big pack on linux.git, than if we had populated it
via for_each_object_in_pack(). Now we're 3% faster without the "cat"
instead.
My tests were done on an unloaded Linux 3.10 system with 10 runs for
each. Derrick Stolee did his own tests on Windows[3] showing a 2%
improvement with a high degree of accuracy.
1. 736eb88fdc ("for_each_packed_object: support iterating in
pack-order", 2018-08-10)
2. 0750bb5b51 ("cat-file: support "unordered" output for
--batch-all-objects", 2018-08-10)
3. https://public-inbox.org/git/f71fa868-25e8-a9c9-46a6-611b987f1a8f@gmail.com/
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
fuzz-commit-graph identified a case where Git will read past the end of
a buffer containing a commit graph if the graph's header has an
incorrect chunk count. A simple bounds check in parse_commit_graph()
prevents this.
Signed-off-by: Josh Steadmon <steadmon@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Break load_commit_graph_one() into a new function, parse_commit_graph().
The latter function operates on arbitrary buffers, which makes it
suitable as a fuzzing target. Since parse_commit_graph() is only called
by load_commit_graph_one() (and the fuzzer described below), we omit
error messages that would be duplicated by the caller.
Adds fuzz-commit-graph.c, which provides a fuzzing entry point
compatible with libFuzzer (and possibly other fuzzing engines).
Signed-off-by: Josh Steadmon <steadmon@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When writing a commit-graph, we write GRAPH_MISSING_PARENT if the
parent's object id does not appear in the list of commits to be
written into the commit-graph. This was done as the initial design
allowed commits to have missing parents, but the final version
requires the commit-graph to be closed under reachability. Thus,
this GRAPH_MISSING_PARENT value should never be written.
However, there are reasons why it could be written! These range
from a bug in the reachable-closure code to a memory error causing
the binary search into the list of object ids to fail. In either
case, we should fail fast and avoid writing the commit-graph file
with bad data.
Remove the GRAPH_MISSING_PARENT constant in favor of the constant
GRAPH_EDGE_LAST_MASK, which has the same value.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Convert all functions to handle arbitrary repositories in commit-graph.c
that are used by functions taking a repository argument already.
Notable exclusion is write_commit_graph and its local functions as that
only works on the_repository.
Signed-off-by: Stefan Beller <sbeller@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Amend the progress output added in 7b0f229222 ("commit-graph write:
add progress output", 2018-09-17) so that the total numbers it reports
aren't higher than the total number of commits anymore. See [1] for a
bug report pointing that out.
When I added this I wasn't intending to provide an accurate count, but
just have some progress output to show the user the command wasn't
hanging[2]. But since we are showing numbers, let's make them
accurate. The progress descriptions were suggested by Derrick Stolee
in [3].
As noted in [2] we are unlikely to show anything except the "Expanding
reachable..." message even on fairly large repositories such as
linux.git. On a test repository I have with north of 7 million commits
all of these are displayed. Two of them don't show up for long, but as
noted in [5] future-proofing this for if the loops become more
expensive in the future makes sense.
1. https://public-inbox.org/git/20181010203738.GE23446@szeder.dev/
2. https://public-inbox.org/git/87pnwhea8y.fsf@evledraar.gmail.com/
3. https://public-inbox.org/git/f7a0cbee-863c-61d3-4959-5cec8b43c705@gmail.com/
4. https://public-inbox.org/git/20181015160545.GG19800@szeder.dev/
5. https://public-inbox.org/git/87murle8da.fsf@evledraar.gmail.com/
Reported-by: SZEDER Gábor <szeder.dev@gmail.com>
Helped-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Instead of using hard-coded constants for object sizes, use
the_hash_algo to look them up. In addition, use a function call to look
up the object ID version and produce the correct value. For now, we use
version 1, which means to use the default algorithm used in the rest of
the repository.
Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Our handling of alternate object directories is needlessly different
from the main object directory. As a result, many places in the code
basically look like this:
do_something(r->objects->objdir);
for (odb = r->objects->alt_odb_list; odb; odb = odb->next)
do_something(odb->path);
That gets annoying when do_something() is non-trivial, and we've
resorted to gross hacks like creating fake alternates (see
find_short_object_filename()).
Instead, let's give each raw_object_store a unified list of
object_directory structs. The first will be the main store, and
everything after is an alternate. Very few callers even care about the
distinction, and can just loop over the whole list (and those who care
can just treat the first element differently).
A few observations:
- we don't need r->objects->objectdir anymore, and can just
mechanically convert that to r->objects->odb->path
- object_directory's path field needs to become a real pointer rather
than a FLEX_ARRAY, in order to fill it with expand_base_dir()
- we'll call prepare_alt_odb() earlier in many functions (i.e.,
outside of the loop). This may result in us calling it even when our
function would be satisfied looking only at the main odb.
But this doesn't matter in practice. It's not a very expensive
operation in the first place, and in the majority of cases it will
be a noop. We call it already (and cache its results) in
prepare_packed_git(), and we'll generally check packs before loose
objects. So essentially every program is going to call it
immediately once per program.
Arguably we should just prepare_alt_odb() immediately upon setting
up the repository's object directory, which would save us sprinkling
calls throughout the code base (and forgetting to do so has been a
source of subtle bugs in the past). But I've stopped short of that
here, since there are already a lot of other moving parts in this
patch.
- Most call sites just get shorter. The check_and_freshen() functions
are an exception, because they have entry points to handle local and
nonlocal directories separately.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In preparation for unifying the handling of alt odb's and the normal
repo object directory, let's use a more neutral name. This patch is
purely mechanical, swapping the type name, and converting any variables
named "alt" to "odb". There should be no functional change, but it will
reduce the noise in subsequent diffs.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The recently introduced commit-graph auxiliary data is incompatible
with mechanisms such as replace & grafts that "breaks" immutable
nature of the object reference relationship. Disable optimizations
based on its use (and updating existing commit-graph) when these
incompatible features are in use in the repository.
* ds/commit-graph-with-grafts:
commit-graph: close_commit_graph before shallow walk
commit-graph: not compatible with uninitialized repo
commit-graph: not compatible with grafts
commit-graph: not compatible with replace objects
test-repository: properly init repo
commit-graph: update design document
refs.c: upgrade for_each_replace_ref to be a each_repo_ref_fn callback
refs.c: migrate internal ref iteration to pass thru repository argument
Generation of (experimental) commit-graph files have so far been
fairly silent, even though it takes noticeable amount of time in a
meaningfully large repository. The users will now see progress
output.
* ab/commit-graph-progress:
gc: fix regression in 7b0f229222 impacting --quiet
commit-graph verify: add progress output
commit-graph write: add progress output
While writing a commit-graph file, we store the full list of
commits in a flat list. We use this list for sorting and ensuring
we are closed under reachability.
The initial allocation assumed that (at most) one in four objects
is a commit. This is a dramatic over-count for many repos,
especially large ones. Since we grow the repo dynamically, reduce
this count by a factor of eight. We still set it to a minimum of
1024 before allocating.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The write_commit_graph() method in commit-graph.c leaks some lits
and strings during execution. In addition, a list of strings is
leaked in write_commit_graph_reachable(). Clean these up so our
memory checking is cleaner.
Further, if we use a list of pack-files to find the commits, we
can leak the packed_git structs after scanning them for commits.
Running the following commands demonstrates the leak before and
the fix after:
* valgrind --leak-check=full ./git commit-graph write --reachable
* valgrind --leak-check=full ./git commit-graph write --stdin-packs
Signed-off-by: Martin Ågren <martin.agren@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
spatch transformation to replace boolean uses of !hashcmp() to
newly introduced oideq() is added, and applied, to regain
performance lost due to support of multiple hash algorithms.
* jk/cocci:
show_dirstat: simplify same-content check
read-cache: use oideq() in ce_compare functions
convert hashmap comparison functions to oideq()
convert "hashcmp() != 0" to "!hasheq()"
convert "oidcmp() != 0" to "!oideq()"
convert "hashcmp() == 0" to hasheq()
convert "oidcmp() == 0" to oideq()
introduce hasheq() and oideq()
coccinelle: use <...> for function exclusion
The code for computing history reachability has been shuffled,
obtained a bunch of new tests to cover them, and then being
improved.
* ds/reachable:
commit-reach: correct accidental #include of C file
commit-reach: use can_all_from_reach
commit-reach: make can_all_from_reach... linear
commit-reach: replace ref_newer logic
test-reach: test commit_contains
test-reach: test can_all_from_reach_with_flags
test-reach: test reduce_heads
test-reach: test get_merge_bases_many
test-reach: test is_descendant_of
test-reach: test in_merge_bases
test-reach: create new test tool for ref_newer
commit-reach: move can_all_from_reach_with_flags
upload-pack: generalize commit date cutoff
upload-pack: refactor ok_to_give_up()
upload-pack: make reachable() more generic
commit-reach: move commit_contains from ref-filter
commit-reach: move ref_newer from remote.c
commit.h: remove method declarations
commit-reach: move walk methods from commit.c
For the reasons explained in the "commit-graph write: add progress
output" commit leading up to this one, emit progress on "commit-graph
verify". Since e0fd51e1d7 ("fsck: verify commit-graph", 2018-06-27)
"git fsck" has called this command if core.commitGraph=true, but
there's been no progress output to indicate that anything was
different. Now there is (on my tiny dotfiles.git repository):
$ git -c core.commitGraph=true -C ~/ fsck
Checking object directories: 100% (256/256), done.
Checking objects: 100% (2821/2821), done.
dangling blob 5b8bbdb9b788ed90459f505b0934619c17cc605b
Verifying commits in commit graph: 100% (867/867), done.
And on a larger repository, such as the 2015-04-03-1M-git.git test
repository:
$ time git -c core.commitGraph=true -C ~/g/2015-04-03-1M-git/ commit-graph verify
Verifying commits in commit graph: 100% (1000447/1000447), done.
real 0m7.813s
[...]
Since the "commit-graph verify" subcommand is never called from "git
gc", we don't have to worry about passing some some "report_progress"
progress variable around for this codepath.
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Before this change the "commit-graph write" command didn't report any
progress. On my machine this command takes more than 10 seconds to
write the graph for linux.git, and around 1m30s on the
2015-04-03-1M-git.git[1] test repository (a test case for a large
monorepository).
Furthermore, since the gc.writeCommitGraph setting was added in
d5d5d7b641 ("gc: automatically write commit-graph files", 2018-06-27),
there was no indication at all from a "git gc" run that anything was
different. This why one of the progress bars being added here uses
start_progress() instead of start_delayed_progress(), so that it's
guaranteed to be seen. E.g. on my tiny 867 commit dotfiles.git
repository:
$ git -c gc.writeCommitGraph=true gc
Enumerating objects: 2821, done.
[...]
Computing commit graph generation numbers: 100% (867/867), done.
On larger repositories, such as linux.git the delayed progress bar(s)
will kick in, and we'll show what's going on instead of, as was
previously happening, printing nothing while we write the graph:
$ git -c gc.writeCommitGraph=true gc
[...]
Annotating commits in commit graph: 1565573, done.
Computing commit graph generation numbers: 100% (782484/782484), done.
Note that here we don't show "Finding commits for commit graph", this
is because under "git gc" we seed the search with the commit
references in the repository, and that set is too small to show any
progress, but would e.g. on a smaller repo such as git.git with
--stdin-commits:
$ git rev-list --all | git -c gc.writeCommitGraph=true write --stdin-commits
Finding commits for commit graph: 100% (162576/162576), done.
Computing commit graph generation numbers: 100% (162576/162576), done.
With --stdin-packs we don't show any estimation of how much is left to
do. This is because we might be processing more than one pack. We
could be less lazy here and show progress, either by detecting that
we're only processing one pack, or by first looping over the packs to
discover how many commits they have. I don't see the point in doing
that work. So instead we get (on 2015-04-03-1M-git.git):
$ echo pack-<HASH>.idx | git -c gc.writeCommitGraph=true --exec-path=$PWD commit-graph write --stdin-packs
Finding commits for commit graph: 13064614, done.
Annotating commits in commit graph: 3001341, done.
Computing commit graph generation numbers: 100% (1000447/1000447), done.
No GC mode uses --stdin-packs. It's what they use at Microsoft to
manually compute the generation numbers for their collection of large
packs which are never coalesced.
The reason we need a "report_progress" variable passed down from "git
gc" is so that we don't report this output when we're running in the
process "git gc --auto" detaches from the terminal.
Since we write the commit graph from the "git gc" process itself (as
opposed to what we do with say the "git repack" phase), we'd end up
writing the output to .git/gc.log and reporting it to the user next
time as part of the "The last gc run reported the following[...]"
error, see 329e6e8794 ("gc: save log from daemonized gc --auto and
print it next time", 2015-09-19).
So we must keep track of whether or not we're running in that
demonized mode, and if so print no progress.
See [2] and subsequent replies for a discussion of an approach not
taken in compute_generation_numbers(). I.e. we're saying "Computing
commit graph generation numbers", even though on an established
history we're mostly skipping over all the work we did in the
past. This is similar to the white lie we tell in the "Writing
objects" phase (not all are objects being written).
Always showing progress is considered more important than
accuracy. I.e. on a repository like 2015-04-03-1M-git.git we'd hang
for 6 seconds with no output on the second "git gc" if no changes were
made to any objects in the interim if we'd take the approach in [2].
1. https://github.com/avar/2015-04-03-1M-git
2. <c6960252-c095-fb2b-e0bc-b1e6bb261614@gmail.com>
(https://public-inbox.org/git/c6960252-c095-fb2b-e0bc-b1e6bb261614@gmail.com/)
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This rounds out the previous three patches, covering the
inequality logic for the "hash" variant of the functions.
As with the previous three, the accompanying code changes
are the mechanical result of applying the coccinelle patch;
see those patches for more discussion.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This is the flip side of the previous two patches: checking
for a non-zero oidcmp() can be more strictly expressed as
inequality. Like those patches, we write "!= 0" in the
coccinelle transformation, which covers by isomorphism the
more common:
if (oidcmp(E1, E2))
As with the previous two patches, this patch can be achieved
almost entirely by running "make coccicheck"; the only
differences are manual line-wrap fixes to match the original
code.
There is one thing to note for anybody replicating this,
though: coccinelle 1.0.4 seems to miss the case in
builtin/tag.c, even though it's basically the same as all
the others. Running with 1.0.7 does catch this, so
presumably it's just a coccinelle bug that was fixed in the
interim.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Using the more restrictive oideq() should, in the long run,
give the compiler more opportunities to optimize these
callsites. For now, this conversion should be a complete
noop with respect to the generated code.
The result is also perhaps a little more readable, as it
avoids the "zero is equal" idiom. Since it's so prevalent in
C, I think seasoned programmers tend not to even notice it
anymore, but it can sometimes make for awkward double
negations (e.g., we can drop a few !!oidcmp() instances
here).
This patch was generated almost entirely by the included
coccinelle patch. This mechanical conversion should be
completely safe, because we check explicitly for cases where
oidcmp() is compared to 0, which is what oideq() is doing
under the hood. Note that we don't have to catch "!oidcmp()"
separately; coccinelle's standard isomorphisms make sure the
two are treated equivalently.
I say "almost" because I did hand-edit the coccinelle output
to fix up a few style violations (it mostly keeps the
original formatting, but sometimes unwraps long lines).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The commit-graph feature is tested in isolation by
t5318-commit-graph.sh and t6600-test-reach.sh, but there are many
more interesting scenarios involving commit walks. Many of these
scenarios are covered by the existing test suite, but we need to
maintain coverage when the optional commit-graph structure is not
present.
To allow running the full test suite with the commit-graph present,
add a new test environment variable, GIT_TEST_COMMIT_GRAPH. Similar
to GIT_TEST_SPLIT_INDEX, this variable makes every Git command try
to load the commit-graph when parsing commits, and writes the
commit-graph file after every 'git commit' command.
There are a few tests that rely on commits not existing in
pack-files to trigger important events, so manually set
GIT_TEST_COMMIT_GRAPH to false for the necessary commands.
There is one test in t6024-recursive-merge.sh that relies on the
merge-base algorithm picking one of two ambiguous merge-bases, and
the commit-graph feature changes which merge-base is picked.
Helped-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Call close_commit_graph() when about to start a rev-list walk that
includes shallow commits. This is necessary in code paths that "fake"
shallow commits for the sake of fetch. Specifically, test 351 in
t5500-fetch-pack.sh runs
git fetch --shallow-exclude one origin
with a file-based transfer. When the "remote" has a commit-graph, we do
not prevent the commit-graph from being loaded, but then the commits are
intended to be dynamically transferred into shallow commits during
get_shallow_commits_by_rev_list(). By closing the commit-graph before
this call, we prevent this interaction.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Augment commit_graph_compatible(r) to return false when the given
repository r has commit grafts or is a shallow clone. Test that in these
situations we ignore existing commit-graph files and we do not write new
commit-graph files.
Helped-by: Jakub Narebski <jnareb@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Create new method commit_graph_compatible(r) to check if a given
repository r is compatible with the commit-graph feature. Fill the
method with a check to see if replace-objects exist. Test this
interaction succeeds, including ignoring an existing commit-graph and
failing to write a new commit-graph. However, we do ensure that
we write a new commit-graph by setting read_replace_refs to 0, thereby
ignoring the replace refs.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The API to iterate over all objects learned to optionally list
objects in the order they appear in packfiles, which helps locality
of access if the caller accesses these objects while as objects are
enumerated.
* jk/for-each-object-iteration:
for_each_*_object: move declarations to object-store.h
cat-file: use a single strbuf for all output
cat-file: split batch "buf" into two variables
cat-file: use oidset check-and-insert
cat-file: support "unordered" output for --batch-all-objects
cat-file: rename batch_{loose,packed}_object callbacks
t1006: test cat-file --batch-all-objects with duplicates
for_each_packed_object: support iterating in pack-order
for_each_*_object: give more comprehensive docstrings
for_each_*_object: take flag arguments as enum
for_each_*_object: store flag definitions in a single location
Many more strings are prepared for l10n.
* nd/i18n: (23 commits)
transport-helper.c: mark more strings for translation
transport.c: mark more strings for translation
sha1-file.c: mark more strings for translation
sequencer.c: mark more strings for translation
replace-object.c: mark more strings for translation
refspec.c: mark more strings for translation
refs.c: mark more strings for translation
pkt-line.c: mark more strings for translation
object.c: mark more strings for translation
exec-cmd.c: mark more strings for translation
environment.c: mark more strings for translation
dir.c: mark more strings for translation
convert.c: mark more strings for translation
connect.c: mark more strings for translation
config.c: mark more strings for translation
commit-graph.c: mark more strings for translation
builtin/replace.c: mark more strings for translation
builtin/pack-objects.c: mark more strings for translation
builtin/grep.c: mark strings for translation
builtin/config.c: mark more strings for translation
...