git-commit-vandalism

Author	SHA1	Message	Date
Junio C Hamano	2d436678a7	Merge branch 'jk/log-cherry-pick-duplicate-patches' into maint When more than one commit with the same patch ID appears on one side, "git log --cherry-pick A...B" did not exclude them all when a commit with the same patch ID appears on the other side. Now it does. * jk/log-cherry-pick-duplicate-patches: patch-ids: handle duplicate hashmap entries	2021-02-05 16:31:28 -08:00
Jeff King	c9e3a4e76d	patch-ids: handle duplicate hashmap entries This fixes a bug introduced in `dfb7a1b4d0` (patch-ids: stop using a hand-rolled hashmap implementation, 2016-07-29) in which git rev-list --cherry-pick A...B will fail to suppress commits reachable from A even if a commit with matching patch-id appears in B. Around the time of that commit, the algorithm for "--cherry-pick" looked something like this: 0. Traverse all of the commits, marking them as being on the left or right side of the symmetric difference. 1. Iterate over the left-hand commits, inserting a patch-id struct for each into a hashmap, and pointing commit->util to the patch-id struct. 2. Iterate over the right-hand commits, checking which are present in the hashmap. If so, we exclude the commit from the output _and_ we mark the patch-id as "seen". 3. Iterate again over the left-hand commits, checking whether commit->util->seen is set; if so, exclude them from the output. At the end, we'll have eliminated commits from both sides that have a matching patch-id on the other side. But there's a subtle assumption here: for any given patch-id, we must have exactly one struct representing it. If two commits from A both have the same patch-id and we allow duplicates in the hashmap, then we run into a problem: a. In step 1, we insert two patch-id structs into the hashmap. b. In step 2, our lookups will find only one of these structs, so only one "seen" flag is marked. c. In step 3, one of the commits in A will have its commit->util->seen set, but the other will not. We'll erroneously output the latter. Prior to `dfb7a1b4d0`, our hashmap did not allow duplicates. Afterwards, it used hashmap_add(), which explicitly does allow duplicates. At that point, the solution would have been easy: when we are about to add a duplicate, skip doing so and return the existing entry which matches. But it gets more complicated. In `683f17ec44` (patch-ids: replace the seen indicator with a commit pointer, 2016-07-29), our step 3 goes away entirely. Instead, in step 2, when the right-hand side finds a matching patch_id from the left-hand side, we can directly mark the left-hand patch_id->commit to be omitted. Solving that would be easy, too; there's a one-to-many relationship of patch-ids to commits, so we just need to keep a list. But there's more. Commit `b3dfeebb92` (rebase: avoid computing unnecessary patch IDs, 2016-07-29) built on that by lazily computing the full patch-ids. So we don't even know when adding to the hashmap whether two commits truly have the same id. We'd have to tentatively assign them a list, and then possibly split them apart (possibly into N new structs) at the moment we compute the real patch-ids. This could work, but it's complicated and error-prone. Instead, let's accept that we may store duplicates, and teach the lookup side to be more clever. Rather than asking for a single matching patch-id, it will need to iterate over all matching patch-ids. This does mean examining every entry in a single hash bucket, but the worst-case for a hash lookup was already doing that. We'll keep the hashmap details out of the caller by providing a simple iteration interface. We can retain the simple has_commit_patch_id() interface for the other callers, but we'll simplify its return value into an integer, rather than returning the patch_id struct. That way they won't be tempted to look at the "commit" field of the return value without iterating. Reported-by: Arnaud Morin <arnaud.morin@gmail.com> Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-01-12 11:13:32 -08:00
Junio C Hamano	5dfb976460	Merge branch 'ma/grep-init-default' Code clean-up. * ma/grep-init-default: MyFirstObjectWalk: drop `init_walken_defaults()` grep: copy struct in one fell swoop grep: use designated initializers for `grep_defaults` grep: don't set up a "default" repo for grep	2020-12-08 15:11:20 -08:00
Martin Ågren	96313423a7	grep: use designated initializers for `grep_defaults` In `15fabd1bbd` ("builtin/grep.c: make configuration callback more reusable", 2012-10-09), we learned to fill a `static struct grep_opt grep_defaults` which we can use as a blueprint for other such structs. At the time, we didn't consider designated initializers to be widely useable, but these days, we do. (See, e.g., `cbc0f81d96` ("strbuf: use designated initializers in STRBUF_INIT", 2017-07-10).) Use designated initializers to let the compiler set up the struct and so that we don't need to remember to call `init_grep_defaults()`. Signed-off-by: Martin Ågren <martin.agren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2020-11-21 14:50:33 -08:00
Martin Ågren	1d3878799f	grep: don't set up a "default" repo for grep `init_grep_defaults()` fills a `static struct grep_opt grep_defaults`. This struct is then used by `grep_init()` as a blueprint for other such structs. Notably, `grep_init()` takes a `struct repo ` and assigns it into the target struct. As a result, it is unnecessary for us to take a `struct repo ` in `init_grep_defaults()` as well. We assign it into the default struct and never look at it again. And in light of how we return early if we have already set up the default struct, it's not just unnecessary, but is also a bit confusing: If we are called twice and with different repos, is it a bug or a feature that we ignore the second repo? Drop the repo parameter for `init_grep_defaults()`. Signed-off-by: Martin Ågren <martin.agren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2020-11-21 14:50:29 -08:00
Elijah Newren	b19315d8ab	Use new HASHMAP_INIT macro to simplify hashmap initialization Now that hashamp has lazy initialization and a HASHMAP_INIT macro, hashmaps allocated on the stack can be initialized without a call to hashmap_init() and in some cases makes the code a bit shorter. Convert some callsites over to take advantage of this. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2020-11-11 12:55:27 -08:00
Elijah Newren	6da1a25814	hashmap: provide deallocation function names hashmap_free(), hashmap_free_entries(), and hashmap_free_() have existed for a while, but aren't necessarily the clearest names, especially with hashmap_partial_clear() being added to the mix and lazy-initialization now being supported. Peff suggested we adopt the following names[1]: - hashmap_clear() - remove all entries and de-allocate any hashmap-specific data, but be ready for reuse - hashmap_clear_and_free() - ditto, but free the entries themselves - hashmap_partial_clear() - remove all entries but don't deallocate table - hashmap_partial_clear_and_free() - ditto, but free the entries This patch provides the new names and converts all existing callers over to the new naming scheme. [1] https://lore.kernel.org/git/20201030125059.GA3277724@coredump.intra.peff.net/ Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2020-11-02 12:15:50 -08:00
Jeff King	e885a84f1b	drop unused argc parameters Many functions take an argv/argc pair, but never actually look at argc. This makes it useless at best (we use the NULL sentinel in argv to find the end of the array), and misleading at worst (what happens if the argc count does not match the argv NULL?). In each of these instances, the argv NULL does match the argc count, so there are no bugs here. But let's tighten the interfaces to make it harder to get wrong (and to reduce some -Wunused-parameter complaints). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2020-09-30 12:53:47 -07:00
Junio C Hamano	288ed98bf7	Merge branch 'tb/bloom-improvements' "git commit-graph write" learned to limit the number of bloom filters that are computed from scratch with the --max-new-filters option. * tb/bloom-improvements: commit-graph: introduce 'commitGraph.maxNewFilters' builtin/commit-graph.c: introduce '--max-new-filters=<n>' commit-graph: rename 'split_commit_graph_opts' bloom: encode out-of-bounds filters as non-empty bloom/diff: properly short-circuit on max_changes bloom: use provided 'struct bloom_filter_settings' bloom: split 'get_bloom_filter()' in two commit-graph.c: store maximum changed paths commit-graph: respect 'commitGraph.readChangedPaths' t/helper/test-read-graph.c: prepare repo settings commit-graph: pass a 'struct repository *' in more places t4216: use an '&&'-chain commit-graph: introduce 'get_bloom_filter_settings()'	2020-09-29 14:01:20 -07:00
Junio C Hamano	4831c23f75	Merge branch 'mf/submodule-summary-with-correct-repository' "git diff/show" on a change that involves a submodule used to read the information on commits in the submodule from a wrong repository and gave a wrong information when the commit-graph is involved. * mf/submodule-summary-with-correct-repository: submodule: use submodule repository when preparing summary revision: use repository from rev_info when parsing commits	2020-09-18 17:58:05 -07:00
Taylor Blau	312cff5207	bloom: split 'get_bloom_filter()' in two 'get_bloom_filter' takes a flag to control whether it will compute a Bloom filter if the requested one is missing. In the next patch, we'll add yet another parameter to this method, which would force all but one caller to specify an extra 'NULL' parameter at the end. Instead of doing this, split 'get_bloom_filter' into two functions: 'get_bloom_filter' and 'get_or_compute_bloom_filter'. The former only looks up a Bloom filter (and does not compute one if it's missing, thus dropping the 'compute_if_not_present' flag). The latter does compute missing Bloom filters, with an additional parameter to store whether or not it needed to do so. This simplifies many call-sites, since the majority of existing callers to 'get_bloom_filter' do not want missing Bloom filters to be computed (so they can drop the parameter entirely and use the simpler version of the function). While we're at it, instrument the new 'get_or_compute_bloom_filter()' with counters in the 'write_commit_graph_context' struct which store the number of filters that we did and didn't compute, as well as filters that were truncated. It would be nice to drop the 'compute_if_not_present' flag entirely, since all remaining callers of 'get_or_compute_bloom_filter' pass it as '1', but this will change in a future patch and hence cannot be removed. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2020-09-17 09:31:25 -07:00
Junio C Hamano	0df670bc0b	Merge branch 'jt/interpret-branch-name-fallback' "git status" has trouble showing where it came from by interpreting reflog entries that recordcertain events, e.g. "checkout @{u}", and gives a hard/fatal error. Even though it inherently is impossible to give a correct answer because the reflog entries lose some information (e.g. "@{u}" does not record what branch the user was on hence which branch 'the upstream' needs to be computed, and even if the record were available, the relationship between branches may have changed), at least hide the error to allow "status" show its output. * jt/interpret-branch-name-fallback: wt-status: tolerate dangling marks refs: move dwim_ref() to header file sha1-name: replace unsigned int with option struct	2020-09-09 13:53:09 -07:00
Junio C Hamano	c1ce30d364	Merge branch 'so/separate-field-for-m-and-diff-merges' Internal API clean-up to handle two options "diff-index" and "log" have, which happen to share the same short form, more sensibly. * so/separate-field-for-m-and-diff-merges: revision: add separate field for "-m" of "diff-index -m"	2020-09-09 13:53:07 -07:00
Taylor Blau	4f3644056a	commit-graph: introduce 'get_bloom_filter_settings()' Many places in the code often need a pointer to the commit-graph's 'struct bloom_filter_settings', in which case they often take the value from the top-most commit-graph. In the non-split case, this works as expected. In the split case, however, things get a little tricky. Not all layers in a chain of incremental commit-graphs are required to themselves have Bloom data, and so whether or not some part of the code uses Bloom filters depends entirely on whether or not the top-most level of the commit-graph chain has Bloom filters. This has been the behavior since Bloom filters were introduced, and has been codified into the tests since `a759bfa9ee` (t4216: add end to end tests for git log with Bloom filters, 2020-04-06). In fact, t4216.130 requires that Bloom filters are not used in exactly the case described earlier. There is no reason that this needs to be the case, since it is perfectly valid for commits in an earlier layer to have Bloom filters when commits in a newer layer do not. Since Bloom settings are guaranteed in practice to be the same for any layer in a chain that has Bloom data, it is sufficient to traverse the '->base_graph' pointer until either (1) a non-null 'struct bloom_filter_settings ' is found, or (2) until we are at the root of the commit-graph chain. Introduce a 'get_bloom_filter_settings()' function that does just this, and use it instead of purely dereferencing the top-most graph's '->bloom_filter_settings' pointer. While we're at it, add an additional test in t5324 to guard against code in the commit-graph writing machinery that doesn't correctly handle a NULL 'struct bloom_filter '. Co-authored-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2020-09-09 12:51:48 -07:00
Jonathan Tan	a4f66a7876	sha1-name: replace unsigned int with option struct In preparation for a future patch adding a boolean parameter to repo_interpret_branch_name(), which might be easily confused with an existing unsigned int parameter, refactor repo_interpret_branch_name() to take an option struct instead of the unsigned int parameter. The static function interpret_branch_mark() is also updated to take the option struct in preparation for that future patch, since it will also make use of the to-be-introduced boolean parameter. Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2020-09-02 14:39:17 -07:00
Junio C Hamano	cacab0c856	Merge branch 'jk/rev-input-given-fix' Feeding "$ZERO_OID" to "git log --ignore-missing --stdin", and running "git log --ignore-missing $ZERO_OID" fell back to start digging from HEAD; it has been corrected to become a no-op, like "git log --tags=no-tag-matches-this-pattern" does. * jk/rev-input-given-fix: revision: set rev_input_given in handle_revision_arg()	2020-08-31 15:49:52 -07:00
Sergey Organov	572fc9aa54	revision: add separate field for "-m" of "diff-index -m" Add separate 'match_missing' field for diff-index to use and set it when we encounter "-m" option. This field won't then be cleared when another meaning of "-m" is reverted (e.g., by "--no-diff-merges"), nor it will be affected by future option(s) that might drive 'ignore_merges' field. Use this new field from diff-lib:do_oneway_diff() instead of reusing 'ignore_merges' field. Signed-off-by: Sergey Organov <sorganov@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2020-08-31 13:42:58 -07:00
Jeff King	04a0e98515	revision: set rev_input_given in handle_revision_arg() Commit `7ba826290a` (revision: add rev_input_given flag, 2017-08-02) added a flag to rev_info to tell whether we got any revision arguments. As explained there, this is necessary because some revision arguments may not produce any pending traversal objects, but should still inhibit default behaviors (e.g., a glob that matches nothing). However, it only set the flag in the globbing code, but not for revisions we get on the command-line or via stdin. This leads to two problems: - the command-line code keeps its own separate got_rev_arg flag; this isn't wrong, but it's confusing and an extra maintenance burden - even specifically-named rev arguments might end up not adding any pending objects: if --ignore-missing is set, then specifying a missing object is a noop rather than an error. And that leads to some user-visible bugs: - when deciding whether a default rev like "HEAD" should kick in, we check both got_rev_arg and rev_input_given. That means that "--ignore-missing $ZERO_OID" works on the command-line (where we set got_rev_arg) but not on --stdin (where we don't) - when rev-list decides whether it should complain that it wasn't given a starting point, it relies on rev_input_given. So it can't even get the command-line "--ignore-missing $ZERO_OID" right Let's consistently set the flag if we got any revision argument. That lets us clean up the redundant got_rev_arg, and fixes both of those bugs (but note there are three new tests: we'll confirm the already working git-log command-line case). A few implementation notes: - conceptually we want to set the flag whenever handle_revision_arg() finds an actual revision arg ("handles" it, you might say). But it covers a ton of cases with early returns. Rather than annotating each one, we just wrap it and use its success exit-code to set the flag in one spot. - the new rev-list test is in t6018, which is titled to cover globs. This isn't exactly a glob, but it made sense to stick it with the other tests that handle the "even though we got a rev, we have no pending objects" case, which are globs. - the tests check for the oid of a missing object, which it's pretty clear --ignore-missing should ignore. You can see the same behavior with "--ignore-missing a-ref-that-does-not-exist", because --ignore-missing treats them both the same. That's perhaps less clearly correct, and we may want to change that in the future. But the way the code and tests here are written, we'd continue to do the right thing even if it does. Reported-by: Bryan Turner <bturner@atlassian.com> Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2020-08-26 13:30:08 -07:00
Junio C Hamano	a555b514cd	Merge branch 'so/log-diff-merges-opt' Earlier, to countermand the implicit "-m" option when the "--first-parent" option is used with "git log", we added the "--[no-]diff-merges" option in the jk/log-fp-implies-m topic. To leave the door open to allow the "--diff-merges" option to take values that instructs how patches for merge commits should be computed (e.g. "cc"? "-p against first parent?"), redefine "--diff-merges" to take non-optional value, and implement "off" that means the same thing as "--no-diff-merges". * so/log-diff-merges-opt: t/t4013: add test for --diff-merges=off doc/git-log: describe --diff-merges=off revision: change "--diff-merges" option to require parameter	2020-08-17 17:02:50 -07:00
Junio C Hamano	eca8c62a50	Merge branch 'jk/log-fp-implies-m' "git log --first-parent -p" showed patches only for single-parent commits on the first-parent chain; the "--first-parent" option has been made to imply "-m". Use "--no-diff-merges" to restore the previous behaviour to omit patches for merge commits. * jk/log-fp-implies-m: doc/git-log: clarify handling of merge commit diffs doc/git-log: move "-t" into diff-options list doc/git-log: drop "-r" diff option doc/git-log: move "Diff Formatting" from rev-list-options log: enable "-m" automatically with "--first-parent" revision: add "--no-diff-merges" option to counteract "-m" log: drop "--cc implies -m" logic	2020-08-17 17:02:49 -07:00
Junio C Hamano	47f0f94bc7	Merge branch 'al/bisect-first-parent' "git bisect" learns the "--first-parent" option to find the first breakage along the first-parent chain. * al/bisect-first-parent: bisect: combine args passed to find_bisection() bisect: introduce first-parent flag cmd_bisect__helper: defer parsing no-checkout flag rev-list: allow bisect and first-parent flags t6030: modernize "git bisect run" tests	2020-08-17 17:02:45 -07:00
Sergey Organov	6501580ff8	revision: change "--diff-merges" option to require parameter --diff-merges=off is the only accepted form for now, a synonym for --no-diff-merges. This patch is a preparation for adding more values, as well as supporting --diff-merges=<parent>, where <parent> is single parent number to output diff against. Signed-off-by: Sergey Organov <sorganov@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2020-08-11 14:20:24 -07:00
Junio C Hamano	abde3d39ec	Merge branch 'so/rev-parser-errormessage-fix' Error message fix. * so/rev-parser-errormessage-fix: revision: fix die() message for "--unpacked="	2020-08-10 10:24:03 -07:00
Junio C Hamano	1aa3dff4ba	Merge branch 'jk/compiler-fixes-and-workarounds' Small fixes and workarounds. * jk/compiler-fixes-and-workarounds: revision: avoid leak when preparing bloom filter for "/" revision: avoid out-of-bounds read/write on empty pathspec config: work around gcc-10 -Wstringop-overflow warning	2020-08-10 10:24:02 -07:00
Junio C Hamano	46b225f153	Merge branch 'jk/strvec' The argv_array API is useful for not just managing argv but any "vector" (NULL-terminated array) of strings, and has seen adoption to a certain degree. It has been renamed to "strvec" to reduce the barrier to adoption. * jk/strvec: strvec: rename struct fields strvec: drop argv_array compatibility layer strvec: update documention to avoid argv_array strvec: fix indentation in renamed calls strvec: convert remaining callers away from argv_array name strvec: convert more callers away from argv_array name strvec: convert builtin/ callers away from argv_array name quote: rename sq_dequote_to_argv_array to mention strvec strvec: rename files from argv-array to strvec argv-array: rename to strvec argv-array: use size_t for count and alloc	2020-08-10 10:23:57 -07:00
Aaron Lipman	0fe305a5d3	rev-list: allow bisect and first-parent flags Add first_parent_only parameter to find_bisection(), removing the barrier that prevented combining the --bisect and --first-parent flags when using git rev-list Based-on-patch-by: Tiago Botelho <tiagonbotelho@hotmail.com> Signed-off-by: Aaron Lipman <alipman88@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2020-08-07 15:11:59 -07:00
Sergey Organov	f649aaaf82	revision: fix die() message for "--unpacked=" Get rid of the trailing dot and mark for translation. Signed-off-by: Sergey Organov <sorganov@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2020-08-04 17:01:37 -07:00
Jeff King	398e659e1e	revision: avoid leak when preparing bloom filter for "/" If we're given an empty pathspec, we refuse to set up bloom filters, as described in `f3c2a36810` (revision: empty pathspecs should not use Bloom filters, 2020-07-01). But before the empty string check, we drop any trailing slash by allocating a new string without it. So a pathspec consisting only of "/" will allocate that string, but then still cause us to bail, leaking the new string. Let's make sure to free it. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2020-08-04 09:31:57 -07:00
Jeff King	fd9a631c56	revision: avoid out-of-bounds read/write on empty pathspec Running t4216 with ASan results in it complaining of an out-of-bounds read in prepare_to_use_bloom_filter(). The issue is this code to strip a trailing slash: last_index = pi->len - 1; if (pi->match[last_index] == '/') { because we have no guarantee that pi->len isn't zero. This can happen if the pathspec is ".", as we translate that to an empty string. And if that read of random memory does trigger the conditional, we'd then do an out-of-bounds write: path_alloc = xstrdup(pi->match); path_alloc[last_index] = '\0'; Let's make sure to check the length before subtracting. Note that for an empty pathspec, we'd end up bailing from the function a few lines later, which makes it tempting to just: if (!pi->len) return; early here. But our code here is stripping a trailing slash, and we need to check for emptiness after stripping that slash, too. So we'd have two blocks, which would require repeating some cleanup code. Instead, just skip the trailing-slash for an empty string. Setting last_index at all in the case is awkward since it will have a nonsense value (and it uses an "int", which is a too-small type for a string anyway). So while we're here, let's: - drop last_index entirely; it's only used in two spots right next to each other and writing out "pi->len - 1" in both is actually easier to follow - use xmemdupz() to duplicate the string. This is slightly more efficient, but more importantly makes the intent more clear by allocating the correct-sized substring in the first place. It also eliminates any question of whether path_alloc is as long as pi->match (which it would not be if pi->match has any embedded NULs, though in practice this is probably impossible). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2020-08-04 09:31:02 -07:00
Jeff King	d70a9eb611	strvec: rename struct fields The "argc" and "argv" names made sense when the struct was argv_array, but now they're just confusing. Let's rename them to "nr" (which we use for counts elsewhere) and "v" (which is rather terse, but reads well when combined with typical variable names like "args.v"). Note that we have to update all of the callers immediately. Playing tricks with the preprocessor is hard here, because we wouldn't want to rewrite unrelated tokens. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2020-07-30 19:18:06 -07:00
Junio C Hamano	70cdbbe3a7	Merge branch 'ds/commit-graph-bloom-updates' into master Updates to the changed-paths bloom filter. * ds/commit-graph-bloom-updates: commit-graph: check all leading directories in changed path Bloom filters revision: empty pathspecs should not use Bloom filters revision.c: fix whitespace commit-graph: check chunk sizes after writing commit-graph: simplify chunk writes into loop commit-graph: unify the signatures of all write_graph_chunk_*() functions commit-graph: persist existence of changed-paths bloom: fix logic in get_bloom_filter() commit-graph: change test to die on parse, not load commit-graph: place bloom_settings in context	2020-07-30 13:20:31 -07:00
Junio C Hamano	de6dda0dc3	Merge branch 'sg/commit-graph-cleanups' into master The changed-path Bloom filter is improved using ideas from an independent implementation. * sg/commit-graph-cleanups: commit-graph: simplify write_commit_graph_file() #2 commit-graph: simplify write_commit_graph_file() #1 commit-graph: simplify parse_commit_graph() #2 commit-graph: simplify parse_commit_graph() #1 commit-graph: clean up #includes diff.h: drop diff_tree_oid() & friends' return value commit-slab: add a function to deep free entries on the slab commit-graph-format.txt: all multi-byte numbers are in network byte order commit-graph: fix parsing the Chunk Lookup table tree-walk.c: don't match submodule entries for 'submod/anything'	2020-07-30 13:20:30 -07:00
Jeff King	6fae74b418	revision: add "--no-diff-merges" option to counteract "-m" The "-m" option sets revs->ignore_merges to "0", but there's no way to undo it. This probably isn't something anybody overly cares about, since "1" is already the default, but it will serve as an escape hatch when we flip the default for ignore_merges to "0" in more situations. We'll also add a few extra niceties: - initialize the value to "-1" to indicate "not set", and then resolve it to the normal 0/1 bool in setup_revisions(). This lets any tweak functions, as well as setup_revisions() itself, avoid clobbering the user's preference (which until now they couldn't actually express). - since we now have --no-diff-merges, let's add the matching --diff-merges, which is just a synonym for "-m". Then we don't even need to document --no-diff-merges separately; it countermands the long form of "-m" in the usual way. The new test shows that this behaves just the same as the current behavior without "-m". Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2020-07-29 13:43:57 -07:00
Jeff King	c972bf4cf5	strvec: convert remaining callers away from argv_array name We eventually want to drop the argv_array name and just use strvec consistently. There's no particular reason we have to do it all at once, or care about interactions between converted and unconverted bits. Because of our preprocessor compat layer, the names are interchangeable to the compiler (so even a definition and declaration using different names is OK). This patch converts all of the remaining files, as the resulting diff is reasonably sized. The conversion was done purely mechanically with: git ls-files '.c' '.h' \| xargs perl -i -pe ' s/ARGV_ARRAY/STRVEC/g; s/argv_array/strvec/g; ' We'll deal with any indentation/style fallouts separately. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2020-07-28 15:02:18 -07:00
Jeff King	dbbcd44fb4	strvec: rename files from argv-array to strvec This requires updating #include lines across the code-base, but that's all fairly mechanical, and was done with: git ls-files '.c' '.h' \| xargs perl -i -pe 's/argv-array.h/strvec.h/' Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2020-07-28 15:02:17 -07:00
Junio C Hamano	20d451c4da	Merge branch 'rs/line-log-until' into master "git log -Lx,y:path --before=date" lost track of where the range should be because it didn't take the changes made by the youngest commits that are omitted from the output into account. * rs/line-log-until: revision: disable min_age optimization with line-log	2020-07-09 14:00:42 -07:00
Junio C Hamano	645f63111b	Merge branch 'es/get-worktrees-unsort' API cleanup for get_worktrees() * es/get-worktrees-unsort: worktree: drop get_worktrees() unused 'flags' argument worktree: drop get_worktrees() special-purpose sorting option	2020-07-06 22:09:15 -07:00
René Scharfe	01faa91cb7	revision: disable min_age optimization with line-log If one of the options --before, --min-age or --until is given, limit_list() filters out younger commits early on. Line-log needs all those commits to trace the movement of line ranges, though. Skip this optimization if both are used together. Reported-by: Мария Долгополова <dolgopolovamariia@gmail.com> Signed-off-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2020-07-06 18:38:03 -07:00
SZEDER Gábor	c525ce95b4	commit-graph: check all leading directories in changed path Bloom filters The file 'dir/subdir/file' can only be modified if its leading directories 'dir' and 'dir/subdir' are modified as well. So when checking modified path Bloom filters looking for commits modifying a path with multiple path components, then check not only the full path in the Bloom filters, but all its leading directories as well. Take care to check these paths in "deepest first" order, because it's the full path that is least likely to be modified, and the Bloom filter queries can short circuit sooner. This can significantly reduce the average false positive rate, by about an order of magnitude or three(!), and can further speed up pathspec-limited revision walks. The table below compares the average false positive rate and runtime of git rev-list HEAD -- "$path" before and after this change for 5000+ randomly* selected paths from each repository: Average false Average Average positive rate runtime runtime before after before after difference ------------------------------------------------------------------ git 3.220% 0.7853% 0.0558s 0.0387s -30.6% linux 2.453% 0.0296% 0.1046s 0.0766s -26.8% tensorflow 2.536% 0.6977% 0.0594s 0.0420s -29.2% *Path selection was done with the following pipeline: git ls-tree -r --name-only HEAD \| sort -R \| head -n 5000 The improvements in runtime are much smaller than the improvements in average false positive rate, as we are clearly reaching diminishing returns here. However, all these timings depend on that accessing tree objects is reasonably fast (warm caches). If we had a partial clone and the tree objects had to be fetched from a promisor remote, e.g.: $ git clone --filter=tree:0 --bare file://.../webkit.git webkit.notrees.git $ git -C webkit.git -c core.modifiedPathBloomFilters=1 \ commit-graph write --reachable $ cp webkit.git/objects/info/commit-graph webkit.notrees.git/objects/info/ $ git -C webkit.notrees.git -c core.modifiedPathBloomFilters=1 \ rev-list HEAD -- "$path" then checking all leading path component can reduce the runtime from over an hour to a few seconds (and this is with the clone and the promisor on the same machine). This adjusts the tracing values in t4216-log-bloom.sh, which provides a concrete way to notice the improvement. Helped-by: Taylor Blau <me@ttaylorr.com> Helped-by: René Scharfe <l.s.r@web.de> Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2020-07-01 14:17:43 -07:00
Taylor Blau	f3c2a36810	revision: empty pathspecs should not use Bloom filters The prepare_to_use_bloom_filter() method was not intended to be called on an empty pathspec. However, 'git log -- .' and 'git log' are subtly different: the latter reports all commits while the former will simplify commits that do not change the root tree. This means that the path used to construct the bloom_key might be empty, and that value is not added to the Bloom filter during construction. That means that the results are likely incorrect! To resolve the issue, be careful about the length of the path and stop filling Bloom filters. To be completely sure we do not use them, drop the pointer to the bloom_filter_settings from the commit-graph. That allows our test to look at the trace2 logs to verify no Bloom filter statistics are reported. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2020-07-01 14:17:43 -07:00
Derrick Stolee	dc8e95ba7c	revision.c: fix whitespace Here, four spaces were used instead of tab characters. Reported-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2020-07-01 14:17:43 -07:00
Derrick Stolee	949197420e	bloom: fix logic in get_bloom_filter() The get_bloom_filter() method is a bit complicated in some parts where it does not need to be. In particular, it needs to return a NULL filter only when compute_if_not_present is zero AND the filter data cannot be loaded from a commit-graph file. This currently happens by accident because the commit-graph does not load changed-path Bloom filters from an existing commit-graph when writing a new one. This will change in a later patch. Also clean up some style issues while we are here. One side-effect of returning a NULL filter is that the filters that are reported as "too large" will now be reported as NULL insead of length zero. This case was not properly covered before, so add a test. Further, remote the counting of the zero-length filters from revision.c and the trace2 logs. Helped-by: René Scharfe <l.s.r@web.de> Helped-by: SZEDER Gábor <szeder.dev@gmail.com> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2020-07-01 14:17:43 -07:00
Michael Forney	ea3f7e598c	revision: use repository from rev_info when parsing commits This is needed when repo_init_revisions() is called with a repository that is not the_repository to ensure appropriate repository is used in repo_parse_commit_internal(). If the wrong repository is used, a fatal error is the commit-graph machinery occurs: fatal: invalid commit position. commit-graph is likely corrupt Since revision.c was the only user of the parse_commit_gently compatibility define, remove it from commit.h. Signed-off-by: Michael Forney <mforney@mforney.org> Acked-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2020-06-24 09:13:04 -07:00
Eric Sunshine	03f2465bb1	worktree: drop get_worktrees() unused 'flags' argument get_worktrees() accepts a 'flags' argument, however, there are no existing flags (the lone flag GWT_SORT_LINKED was recently retired) and no behavior which can be tweaked. Therefore, drop the 'flags' argument. Signed-off-by: Eric Sunshine <sunshine@sunshineco.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2020-06-22 10:31:15 -07:00
Abhishek Kumar	c752ad09c4	commit-graph: minimize commit_graph_data_slab access In an earlier patch, multiple struct acccesses to `graph_pos` and `generation` were auto-converted to multiple method calls. Since the values are fixed and commit-slab access costly, we would be better off with storing the values as a local variable and reusing it. Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2020-06-17 14:37:52 -07:00
Abhishek Kumar	c49c82aa4c	commit: move members graph_pos, generation to a slab We remove members `graph_pos` and `generation` from the struct commit. The default assignments in init_commit_node() are no longer valid, which is fine as the slab helpers return appropriate default values and the assignments are removed. We will replace existing use of commit->generation and commit->graph_pos by commit_graph_data_slab helpers using `contrib/coccinelle/commit.cocci'. Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2020-06-17 14:37:30 -07:00
Junio C Hamano	c3a02824cf	Merge branch 'ds/line-log-on-bloom' "git log -L..." now takes advantage of the "which paths are touched by this commit?" info stored in the commit-graph system. * ds/line-log-on-bloom: line-log: integrate with changed-path Bloom filters line-log: try to use generation number-based topo-ordering line-log: more responsive, incremental 'git log -L' t4211-line-log: add tests for parent oids line-log: remove unused fields from 'struct line_log_data'	2020-06-08 18:06:26 -07:00
SZEDER Gábor	0ee3cb888d	diff.h: drop diff_tree_oid() & friends' return value ll_diff_tree_oid() has only ever returned 0 [1], so it's return value is basically useless. It's only caller diff_tree_oid() has only ever returned the return value of ll_diff_tree_oid() as-is [2], so its return value is just as useless. Most of diff_tree_oid()'s callers simply ignore its return value, except: - diff_root_tree_oid() is a thin wrapper around diff_tree_oid() and returns with its return value, but all of diff_root_tree_oid()'s callers ignore its return value. - rev_compare_tree() and rev_same_tree_as_empty() do look at the return value in a condition, but, since the return value is always 0, the former's < 0 condition is never fulfilled, while the latter's >= 0 condition is always fulfilled. So let's drop the return value of ll_diff_tree_oid(), diff_tree_oid() and diff_root_tree_oid(), and drop those conditions from rev_compare_tree() and rev_same_tree_as_empty() as well. [1] ll_diff_tree_oid() and its ancestors have been returning only 0 ever since it was introduced as diff_tree() in `9174026cfe` (Add "diff-tree" program to show which files have changed between two trees., 2005-04-09). [2] diff_tree_oid() traces back to diff-tree.c:main() in `9174026cfe` as well. Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2020-06-08 12:28:49 -07:00
Derrick Stolee	f32dde8c12	line-log: integrate with changed-path Bloom filters The previous changes to the line-log machinery focused on making the first result appear faster. This was achieved by no longer walking the entire commit history before returning the early results. There is still another way to improve the performance: walk most commits much faster. Let's use the changed-path Bloom filters to reduce time spent computing diffs. Since the line-log computation requires opening blobs and checking the content-diff, there is still a lot of necessary computation that cannot be replaced with changed-path Bloom filters. The part that we can reduce is most effective when checking the history of a file that is deep in several directories and those directories are modified frequently. In this case, the computation to check if a commit is TREESAME to its first parent takes a large fraction of the time. That is ripe for improvement with changed-path Bloom filters. We must ensure that prepare_to_use_bloom_filters() is called in revision.c so that the bloom_filter_settings are loaded into the struct rev_info from the commit-graph. Of course, some cases are still forbidden, but in the line-log case the pathspec is provided in a different way than normal. Since multiple paths and segments could be requested, we compute the struct bloom_key data dynamically during the commit walk. This could likely be improved, but adds code complexity that is not valuable at this time. There are two cases to care about: merge commits and "ordinary" commits. Merge commits have multiple parents, but if we are TREESAME to our first parent in every range, then pass the blame for all ranges to the first parent. Ordinary commits have the same condition, but each is done slightly differently in the process_ranges_[merge\|ordinary]_commit() methods. By checking if the changed-path Bloom filter can guarantee TREESAME, we can avoid that tree-diff cost. If the filter says "probably changed", then we need to run the tree-diff and then the blob-diff if there was a real edit. The Linux kernel repository is a good testing ground for the performance improvements claimed here. There are two different cases to test. The first is the "entire history" case, where we output the entire history to /dev/null to see how long it would take to compute the full line-log history. The second is the "first result" case, where we find how long it takes to show the first value, which is an indicator of how quickly a user would see responses when waiting at a terminal. To test, I selected the paths that were changed most frequently in the top 10,000 commits using this command (stolen from StackOverflow [1]): git log --pretty=format: --name-only -n 10000 \| sort \| \ uniq -c \| sort -rg \| head -10 which results in 121 MAINTAINERS 63 fs/namei.c 60 arch/x86/kvm/cpuid.c 59 fs/io_uring.c 58 arch/x86/kvm/vmx/vmx.c 51 arch/x86/kvm/x86.c 45 arch/x86/kvm/svm.c 42 fs/btrfs/disk-io.c 42 Documentation/scsi/index.rst (along with a bogus first result). It appears that the path arch/x86/kvm/svm.c was renamed, so we ignore that entry. This leaves the following results for the real command time: \| \| Entire History \| First Result \| \| Path \| Before \| After \| Before \| After \| \|------------------------------\|--------\|--------\|--------\|--------\| \| MAINTAINERS \| 4.26 s \| 3.87 s \| 0.41 s \| 0.39 s \| \| fs/namei.c \| 1.99 s \| 0.99 s \| 0.42 s \| 0.21 s \| \| arch/x86/kvm/cpuid.c \| 5.28 s \| 1.12 s \| 0.16 s \| 0.09 s \| \| fs/io_uring.c \| 4.34 s \| 0.99 s \| 0.94 s \| 0.27 s \| \| arch/x86/kvm/vmx/vmx.c \| 5.01 s \| 1.34 s \| 0.21 s \| 0.12 s \| \| arch/x86/kvm/x86.c \| 2.24 s \| 1.18 s \| 0.21 s \| 0.14 s \| \| fs/btrfs/disk-io.c \| 1.82 s \| 1.01 s \| 0.06 s \| 0.05 s \| \| Documentation/scsi/index.rst \| 3.30 s \| 0.89 s \| 1.46 s \| 0.03 s \| It is worth noting that the least speedup comes for the MAINTAINERS file which is * edited frequently, * low in the directory heirarchy, and * quite a large file. All of those points lead to spending more time doing the blob diff and less time doing the tree diff. Still, we see some improvement in that case and significant improvement in other cases. A 2-4x speedup is likely the more typical case as opposed to the small 5% change for that file. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2020-05-11 09:33:56 -07:00
SZEDER Gábor	002933f3fe	line-log: try to use generation number-based topo-ordering The previous patch made it possible to perform line-level filtering during history traversal instead of in an expensive preprocessing step, but it still requires some simpler preprocessing steps, notably topo-ordering. However, nowadays we have commit-graphs storing generation numbers, which make it possible to incrementally traverse the history in topological order, without the preparatory limit_list() and sort_in_topological_order() steps; see `b45424181e` (revision.c: generation-based topo-order algorithm, 2018-11-01). This patch combines the two, so we can do both the topo-ordering and the line-level filtering during history traversal, eliminating even those simpler preprocessing steps, and thus further reducing the delay before showing the first commit modifying the given line range. The 'revs->limited' flag plays the central role in this, because, due to limitations of the current implementation, the generation number-based topo-ordering is only enabled when this flag remains unset. Line-level log, however, always sets this flag in setup_revisions() ever since the feature was introduced in `12da1d1f6f` (Implement line-history search (git log -L), 2013-03-28). The reason for setting 'limited' is unclear, though, because the line-level log itself doesn't directly depend on it, and it doesn't affect how the limit_list() function limits the revision range. However, there is an indirect dependency: the line-level log requires topo-ordering, and the "traditional" sort_in_topological_order() requires an already limited commit list since `e6c3505b44` (Make sure we generate the whole commit list before trying to sort it topologically, 2005-07-06). The new, generation numbers-based topo-ordering doesn't require a limited commit list anymore. So don't set 'revs->limited' for line-level log, unless it is really necessary, namely: - The user explicitly requested parent rewriting, because that is still done in the line_log_filter() preprocessing step (see previous patch), which requires sort_in_topological_order() and in turn limit_list() as well. - A commit-graph file is not available or it doesn't yet contain generation numbers. In these cases we had to fall back on sort_in_topological_order() and in turn limit_list(). The existing condition with generation_numbers_enabled() has already ensured that the 'limited' flag is set in these cases; this patch just makes sure that the line-level log sets 'revs->topo_order' before that condition. While the reduced delay before showing the first commit is measurable in git.git, it takes a bigger repository to make it clearly noticable. In both cases below the line ranges were chosen so that they were modified rather close to the starting revisions, so the effect of this change is most noticable. # git.git $ time git --no-pager log -L:read_alternate_refs:sha1-file.c -1 v2.23.0 Before: real 0m0.107s user 0m0.091s sys 0m0.013s After: real 0m0.058s user 0m0.050s sys 0m0.005s # linux.git $ time git --no-pager log \ -L:build_restore_work_registers:arch/mips/mm/tlbex.c -1 v5.2 Before: real 0m1.129s user 0m1.061s sys 0m0.069s After: real 0m0.096s user 0m0.087s sys 0m0.009s Additional testing by Derrick Stolee: Since this patch improves the performance for the first result, I repeated the experiment from the previous patch on the Linux kernel repository, reporting real time here: Command: git log -L 100,200:MAINTAINERS -n 1 >/dev/null Before: 0.71 s After: 0.05 s Now, we have dropped the full topo-order of all ~910,000 commits before reporting the first result. The remaining performance improvements then are: 1. Update the parent-rewriting logic to be incremental similar to how "git log --graph" behaves. 2. Use changed-path Bloom filters to reduce the time spend in the tree-diff to see if the path(s) changed. Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2020-05-11 09:33:56 -07:00

1 2 3 4 5 ...

836 Commits