git-commit-vandalism/builtin
Shaoxuan Yuan 7cae7627c4 builtin/grep.c: integrate with sparse index
Turn on sparse index and remove ensure_full_index().

Before this patch, `git-grep` utilizes the ensure_full_index() method to
expand the index and search all the entries. Because this method
requires walking all the trees and constructing the index, it is the
slow part within the whole command.

To achieve better performance, this patch uses grep_tree() to search the
sparse directory entries and get rid of the ensure_full_index() method.

Why grep_tree() is a better choice over ensure_full_index()?

1) grep_tree() is as correct as ensure_full_index(). grep_tree() looks
   into every sparse-directory entry (represented by a tree) recursively
   when looping over the index, and the result of doing so matches the
   result of expanding the index.

2) grep_tree() utilizes pathspecs to limit the scope of searching.
   ensure_full_index() always expands the index, which means it will
   always walk all the trees and blobs in the repo without caring if
   the user only wants a subset of the content, i.e. using a pathspec.
   On the other hand, grep_tree() will only search the contents that
   match the pathspec, and thus possibly walking fewer trees.

3) grep_tree() does not construct and copy back a new index, while
   ensure_full_index() does. This also saves some time.

----------------
Performance test

- Summary:

p2000 tests demonstrate a ~71% execution time reduction for
`git grep --cached bogus -- "f2/f1/f1/*"` using tree-walking logic.
However, notice that this result varies depending on the pathspec
given. See below "Command used for testing" for more details.

Test                              HEAD~   HEAD
-------------------------------------------------------
2000.78: git grep ... (full-v3)   0.35    0.39 (≈)
2000.79: git grep ... (full-v4)   0.36    0.30 (≈)
2000.80: git grep ... (sparse-v3) 0.88    0.23 (-73.8%)
2000.81: git grep ... (sparse-v4) 0.83    0.26 (-68.6%)

- Command used for testing:

	git grep --cached bogus -- "f2/f1/f1/*"

The reason for specifying a pathspec is that, if we don't specify a
pathspec, then grep_tree() will walk all the trees and blobs to find the
pattern, and the time consumed doing so is not too different from using
the original ensure_full_index() method, which also spends most of the
time walking trees. However, when a pathspec is specified, this latest
logic will only walk the area of trees enclosed by the pathspec, and the
time consumed is reasonably a lot less.

Generally speaking, because the performance gain is acheived by walking
less trees, which are specified by the pathspec, the HEAD time v.s.
HEAD~ time in sparse-v[3|4], should be proportional to
"pathspec enclosed area" v.s. "all area", respectively. Namely, the
wider the <pathspec> is encompassing, the less the performance
difference between HEAD~ and HEAD, and vice versa.

That is, if we don't specify a pathspec, the performance difference [1]
is indistinguishable: both methods walk all the trees and take generally
same amount of time (even with the index construction time included for
ensure_full_index()).

[1] Performance test result without pathspec (hence walking all trees):

	Command used:

		git grep --cached bogus

	Test                                HEAD~  HEAD
	---------------------------------------------------
	2000.78: git grep ... (full-v3)     6.17   5.19 (≈)
	2000.79: git grep ... (full-v4)     6.19   5.46 (≈)
	2000.80: git grep ... (sparse-v3)   6.57   6.44 (≈)
	2000.81: git grep ... (sparse-v4)   6.65   6.28 (≈)

--------------------------
NEEDSWORK about submodules

There are a few NEEDSWORKs that belong to improvements beyond this
topic. See the NEEDSWORK in builtin/grep.c::grep_submodule() for
more context. The other two NEEDSWORKs in t1092 are also relative.

Suggested-by: Derrick Stolee <derrickstolee@github.com>
Helped-by: Derrick Stolee <derrickstolee@github.com>
Helped-by: Victoria Dye <vdye@github.com>
Helped-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Shaoxuan Yuan <shaoxuan.yuan02@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-09-23 09:41:27 -07:00
..
add.c Merge branch 'ab/plug-leak-in-revisions' 2022-06-07 14:10:56 -07:00
am.c revisions API users: add straightforward release_revisions() 2022-04-13 23:56:08 -07:00
annotate.c
apply.c apply.c: remove unnecessary include 2022-04-06 09:42:14 -07:00
archive.c
bisect--helper.c Merge branch 'ab/plug-leak-in-revisions' 2022-06-07 14:10:56 -07:00
blame.c Merge branch 'ab/plug-leak-in-revisions' 2022-06-07 14:10:56 -07:00
branch.c branch: drop unused worktrees variable 2022-06-21 08:52:37 -07:00
bugreport.c builtin/bugreport.c: create '--diagnose' option 2022-08-12 13:20:02 -07:00
bundle.c bundle: call strvec_clear() on allocated strvec 2022-03-04 13:24:18 -08:00
cat-file.c Merge branch 'tb/cat-file-z' 2022-08-05 15:52:14 -07:00
check-attr.c
check-ignore.c
check-mailmap.c
check-ref-format.c check-ref-format: fix trivial memory leak 2022-07-01 11:43:42 -07:00
checkout--worker.c
checkout-index.c checkout-index: integrate with sparse index 2022-01-13 13:49:45 -08:00
checkout.c Merge branch 'vd/sparse-reset-checkout-fixes' 2022-08-18 13:07:04 -07:00
clean.c Merge branch 'vd/sparse-clean-etc' 2022-02-17 16:25:05 -08:00
clone.c Merge branch 'jk/clone-unborn-confusion' 2022-07-19 16:40:17 -07:00
column.c
commit-graph.c commit-graph: fix memory leak in misused string_list API 2022-03-04 13:24:18 -08:00
commit-tree.c
commit.c Merge branch 'ab/plug-leak-in-revisions' 2022-06-07 14:10:56 -07:00
config.c Merge branch 'mf/fix-type-in-config-h' 2022-03-16 17:53:07 -07:00
count-objects.c i18n: remove from i18n strings that do not hold translatable parts 2022-02-04 13:58:28 -08:00
credential-cache--daemon.c
credential-cache.c
credential-store.c
credential.c
describe.c revisions API users: add straightforward release_revisions() 2022-04-13 23:56:08 -07:00
diagnose.c builtin/diagnose.c: add '--mode' option 2022-08-12 13:20:02 -07:00
diff-files.c diff-files: move misplaced cleanup label 2022-07-12 07:17:28 -07:00
diff-index.c revisions API: call diff_free(&revs->pruning) in revisions_release() 2022-04-13 23:56:10 -07:00
diff-tree.c 2.36 gitk/diff-tree --stdin regression fix 2022-04-26 09:26:35 -07:00
diff.c Merge branch 'ab/plug-leak-in-revisions' 2022-06-07 14:10:56 -07:00
difftool.c run-command API: rename "env_array" to "env" 2022-06-02 14:31:16 -07:00
env--helper.c
fast-export.c Merge branch 'ab/plug-leak-in-revisions' 2022-06-07 14:10:56 -07:00
fast-import.c i18n: fix mismatched camelCase config variables 2022-06-17 10:38:26 -07:00
fetch-pack.c Merge branch 'rc/fetch-refetch' 2022-04-04 10:56:23 -07:00
fetch.c Merge branch 'ab/cocci-unused' 2022-07-18 13:31:57 -07:00
fmt-merge-msg.c
for-each-ref.c
for-each-repo.c
fsck.c fsck: do not dereference NULL while checking resolve-undo data 2022-07-11 16:26:33 -07:00
fsmonitor--daemon.c fsmonitor--daemon: stub in health thread 2022-05-26 15:59:27 -07:00
gc.c gc: fix a memory leak 2022-07-01 11:43:43 -07:00
get-tar-commit-id.c builtin/get-tar-commit-id: make hash size independent 2019-04-01 11:57:39 +09:00
grep.c builtin/grep.c: integrate with sparse index 2022-09-23 09:41:27 -07:00
hash-object.c Merge branch 'ab/object-file-api-updates' 2022-03-16 17:53:08 -07:00
help.c git docs: add a category for file formats, protocols and interfaces 2022-08-04 14:12:23 -07:00
hook.c git hook run: add an --ignore-missing flag 2022-01-07 15:19:34 -08:00
index-pack.c i18n: fix mismatched camelCase config variables 2022-06-17 10:38:26 -07:00
init-db.c i18n: refactor "foo and bar are mutually exclusive" 2022-01-05 13:29:23 -08:00
interpret-trailers.c
log.c log: refactor "rev.pending" code in cmd_show() 2022-08-03 10:54:20 -07:00
ls-files.c ls-files: introduce "--format" option 2022-07-23 10:53:55 -07:00
ls-remote.c Merge branch 'ep/maint-equals-null-cocci' 2022-05-20 15:26:59 -07:00
ls-tree.c Merge branch 'tl/ls-tree-oid-only' 2022-04-06 15:21:59 -07:00
mailinfo.c
mailsplit.c Merge branch 'ep/maint-equals-null-cocci' 2022-05-20 15:26:59 -07:00
merge-base.c merge-base: free() allocated "struct commit **" list 2022-03-04 13:24:17 -08:00
merge-file.c merge-file: fix memory leaks on error path 2022-07-01 11:43:43 -07:00
merge-index.c
merge-ours.c
merge-recursive.c gettext API users: don't explicitly cast ngettext()'s "n" 2022-03-07 11:57:52 -08:00
merge-tree.c merge-tree: add a --allow-unrelated-histories flag 2022-06-22 16:10:06 -07:00
merge.c merge: do not exit restore_state() prematurely 2022-07-22 21:45:23 -07:00
mktag.c Merge branch 'ab/object-file-api-updates' 2022-03-16 17:53:08 -07:00
mktree.c mktree: do not check type of remote objects 2022-06-21 10:12:15 -07:00
multi-pack-index.c multi-pack-index: simplify handling of unknown --options 2022-07-10 14:53:48 -07:00
mv.c Merge branch 'jc/builtin-mv-move-array' 2022-07-18 13:31:53 -07:00
name-rev.c name-rev: prefix annotate-stdin with '--' in message 2022-06-20 16:20:45 -07:00
notes.c Merge branch 'ab/object-file-api-updates' 2022-03-16 17:53:08 -07:00
pack-objects.c i18n: fix mismatched camelCase config variables 2022-06-17 10:38:26 -07:00
pack-redundant.c tree-wide: apply equals-null.cocci 2022-05-02 09:50:37 -07:00
pack-refs.c
patch-id.c patch-id: fix scan_hunk_header on diffs with 1 line of before/after 2022-02-02 11:24:23 -08:00
prune-packed.c i18n: remove from i18n strings that do not hold translatable parts 2022-02-04 13:58:28 -08:00
prune.c revisions API users: add straightforward release_revisions() 2022-04-13 23:56:08 -07:00
pull.c pull: fix a "struct oid_array" memory leak 2022-07-01 11:43:43 -07:00
push.c push: fix capitalisation of the option name autoSetupMerge 2022-06-15 11:45:46 -07:00
range-diff.c
read-tree.c read-tree: make three-way merge sparse-aware 2022-03-01 12:36:01 -08:00
rebase.c rebase: add rebase.updateRefs config option 2022-07-19 12:49:04 -07:00
receive-pack.c Merge branch 'ab/bug-if-bug' 2022-06-10 15:04:15 -07:00
reflog.c Merge branch 'ab/plug-leak-in-revisions' 2022-06-07 14:10:56 -07:00
remote-ext.c
remote-fd.c
remote.c Merge branch 'jc/string-list-cleanup' 2022-08-03 13:36:09 -07:00
repack.c cocci: generalize "unused" rule to cover more than "strbuf" 2022-07-06 12:24:43 -07:00
replace.c Merge branch 'ep/maint-equals-null-cocci' 2022-05-20 15:26:59 -07:00
rerere.c
reset.c pathspec.h: move pathspec_needs_expanded_index() from reset.c to here 2022-08-08 13:23:26 -07:00
rev-list.c rev-list: support human-readable output for --disk-usage 2022-08-11 13:45:23 -07:00
rev-parse.c Merge branch 'ep/maint-equals-null-cocci' 2022-05-20 15:26:59 -07:00
revert.c revert: free "struct replay_opts" members 2022-07-01 11:43:42 -07:00
rm.c rm: integrate with sparse-index 2022-08-08 13:23:26 -07:00
send-pack.c i18n: factorize "invalid value" messages 2022-02-04 13:58:28 -08:00
shortlog.c shortlog: use a stable sort 2022-07-14 11:24:11 -07:00
show-branch.c Merge branch 'jc/show-branch-g-current' into maint 2022-06-08 14:27:51 -07:00
show-index.c
show-ref.c builtin/show-ref.c: avoid over-iterating with --heads, --tags 2022-06-06 09:56:42 -07:00
sparse-checkout.c Merge branch 'ds/sparse-sparse-checkout' 2022-06-03 14:30:35 -07:00
stash.c Merge branch 'ab/env-array' 2022-06-10 15:04:13 -07:00
stripspace.c i18n: remove from i18n strings that do not hold translatable parts 2022-02-04 13:58:28 -08:00
submodule--helper.c revisions API: don't leak memory on argv elements that need free()-ing 2022-08-03 11:12:36 -07:00
symbolic-ref.c symbolic-ref: refuse to set syntactically invalid target 2022-08-01 12:17:13 -07:00
tag.c Merge branch 'ep/maint-equals-null-cocci' 2022-05-20 15:26:59 -07:00
unpack-file.c
unpack-objects.c unpack-objects: use stream_loose_object() to unpack large objects 2022-06-13 10:22:36 -07:00
update-index.c Merge branch 'jh/builtin-fsmonitor-part3' 2022-06-10 15:04:15 -07:00
update-ref.c
update-server-info.c i18n: remove from i18n strings that do not hold translatable parts 2022-02-04 13:58:28 -08:00
upload-archive.c
upload-pack.c
var.c
verify-commit.c
verify-pack.c
verify-tag.c
worktree.c run-command API: rename "env_array" to "env" 2022-06-02 14:31:16 -07:00
write-tree.c