git-commit-vandalism/Documentation/config
Han Xin aaf81223f4 unpack-objects: use stream_loose_object() to unpack large objects
Make use of the stream_loose_object() function introduced in the
preceding commit to unpack large objects. Before this we'd need to
malloc() the size of the blob before unpacking it, which could cause
OOM with very large blobs.

We could use the new streaming interface to unpack all blobs, but
doing so would be much slower, as demonstrated e.g. with this
benchmark using git-hyperfine[0]:

	rm -rf /tmp/scalar.git &&
	git clone --bare https://github.com/Microsoft/scalar.git /tmp/scalar.git &&
	mv /tmp/scalar.git/objects/pack/*.pack /tmp/scalar.git/my.pack &&
	git hyperfine \
		-r 2 --warmup 1 \
		-L rev origin/master,HEAD -L v "10,512,1k,1m" \
		-s 'make' \
		-p 'git init --bare dest.git' \
		-c 'rm -rf dest.git' \
		'./git -C dest.git -c core.bigFileThreshold={v} unpack-objects </tmp/scalar.git/my.pack'

Here we'll perform worse with lower core.bigFileThreshold settings
with this change in terms of speed, but we're getting lower memory use
in return:

	Summary
	  './git -C dest.git -c core.bigFileThreshold=10 unpack-objects </tmp/scalar.git/my.pack' in 'origin/master' ran
	    1.01 ± 0.01 times faster than './git -C dest.git -c core.bigFileThreshold=1k unpack-objects </tmp/scalar.git/my.pack' in 'origin/master'
	    1.01 ± 0.01 times faster than './git -C dest.git -c core.bigFileThreshold=1m unpack-objects </tmp/scalar.git/my.pack' in 'origin/master'
	    1.01 ± 0.02 times faster than './git -C dest.git -c core.bigFileThreshold=1m unpack-objects </tmp/scalar.git/my.pack' in 'HEAD'
	    1.02 ± 0.00 times faster than './git -C dest.git -c core.bigFileThreshold=512 unpack-objects </tmp/scalar.git/my.pack' in 'origin/master'
	    1.09 ± 0.01 times faster than './git -C dest.git -c core.bigFileThreshold=1k unpack-objects </tmp/scalar.git/my.pack' in 'HEAD'
	    1.10 ± 0.00 times faster than './git -C dest.git -c core.bigFileThreshold=512 unpack-objects </tmp/scalar.git/my.pack' in 'HEAD'
	    1.11 ± 0.00 times faster than './git -C dest.git -c core.bigFileThreshold=10 unpack-objects </tmp/scalar.git/my.pack' in 'HEAD'

A better benchmark to demonstrate the benefits of that this one, which
creates an artificial repo with a 1, 25, 50, 75 and 100MB blob:

	rm -rf /tmp/repo &&
	git init /tmp/repo &&
	(
		cd /tmp/repo &&
		for i in 1 25 50 75 100
		do
			dd if=/dev/urandom of=blob.$i count=$(($i*1024)) bs=1024
		done &&
		git add blob.* &&
		git commit -mblobs &&
		git gc &&
		PACK=$(echo .git/objects/pack/pack-*.pack) &&
		cp "$PACK" my.pack
	) &&
	git hyperfine \
		--show-output \
		-L rev origin/master,HEAD -L v "512,50m,100m" \
		-s 'make' \
		-p 'git init --bare dest.git' \
		-c 'rm -rf dest.git' \
		'/usr/bin/time -v ./git -C dest.git -c core.bigFileThreshold={v} unpack-objects </tmp/repo/my.pack 2>&1 | grep Maximum'

Using this test we'll always use >100MB of memory on
origin/master (around ~105MB), but max out at e.g. ~55MB if we set
core.bigFileThreshold=50m.

The relevant "Maximum resident set size" lines were manually added
below the relevant benchmark:

  '/usr/bin/time -v ./git -C dest.git -c core.bigFileThreshold=50m unpack-objects </tmp/repo/my.pack 2>&1 | grep Maximum' in 'origin/master' ran
        Maximum resident set size (kbytes): 107080
    1.02 ± 0.78 times faster than '/usr/bin/time -v ./git -C dest.git -c core.bigFileThreshold=512 unpack-objects </tmp/repo/my.pack 2>&1 | grep Maximum' in 'origin/master'
        Maximum resident set size (kbytes): 106968
    1.09 ± 0.79 times faster than '/usr/bin/time -v ./git -C dest.git -c core.bigFileThreshold=100m unpack-objects </tmp/repo/my.pack 2>&1 | grep Maximum' in 'origin/master'
        Maximum resident set size (kbytes): 107032
    1.42 ± 1.07 times faster than '/usr/bin/time -v ./git -C dest.git -c core.bigFileThreshold=100m unpack-objects </tmp/repo/my.pack 2>&1 | grep Maximum' in 'HEAD'
        Maximum resident set size (kbytes): 107072
    1.83 ± 1.02 times faster than '/usr/bin/time -v ./git -C dest.git -c core.bigFileThreshold=50m unpack-objects </tmp/repo/my.pack 2>&1 | grep Maximum' in 'HEAD'
        Maximum resident set size (kbytes): 55704
    2.16 ± 1.19 times faster than '/usr/bin/time -v ./git -C dest.git -c core.bigFileThreshold=512 unpack-objects </tmp/repo/my.pack 2>&1 | grep Maximum' in 'HEAD'
        Maximum resident set size (kbytes): 4564

This shows that if you have enough memory this new streaming method is
slower the lower you set the streaming threshold, but the benefit is
more bounded memory use.

An earlier version of this patch introduced a new
"core.bigFileStreamingThreshold" instead of re-using the existing
"core.bigFileThreshold" variable[1]. As noted in a detailed overview
of its users in [2] using it has several different meanings.

Still, we consider it good enough to simply re-use it. While it's
possible that someone might want to e.g. consider objects "small" for
the purposes of diffing but "big" for the purposes of writing them
such use-cases are probably too obscure to worry about. We can always
split up "core.bigFileThreshold" in the future if there's a need for
that.

0. https://github.com/avar/git-hyperfine/
1. https://lore.kernel.org/git/20211210103435.83656-1-chiyutianyi@gmail.com/
2. https://lore.kernel.org/git/20220120112114.47618-5-chiyutianyi@gmail.com/

Helped-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Helped-by: Derrick Stolee <stolee@gmail.com>
Helped-by: Jiang Xin <zhiyou.jx@alibaba-inc.com>
Signed-off-by: Han Xin <chiyutianyi@gmail.com>
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-06-13 10:22:36 -07:00
..
add.txt add -i: default to the built-in implementation 2021-12-01 14:34:43 -08:00
advice.txt Merge branch 'tk/ambiguous-fetch-refspec' 2022-04-04 10:56:24 -07:00
alias.txt
am.txt
apply.txt
blame.txt blame: correct name of config option in docs 2021-06-28 10:05:13 -07:00
branch.txt push: default to single remote even when not named origin 2022-04-29 11:20:55 -07:00
browser.txt
checkout.txt parallel-checkout: add configuration options 2021-04-19 11:57:05 -07:00
clean.txt
clone.txt clone, submodule: pass partial clone filters to submodules 2022-02-09 15:38:36 -08:00
color.txt Merge branch 'hm/paint-hits-in-log-grep' 2021-11-01 13:48:08 -07:00
column.txt
commit.txt
commitgraph.txt commit-graph: use config to specify generation type 2021-02-25 15:10:41 -08:00
completion.txt
core.txt unpack-objects: use stream_loose_object() to unpack large objects 2022-06-13 10:22:36 -07:00
credential.txt crendential-store: use timeout when locking file 2020-11-25 12:30:18 -08:00
diff.txt rename: bump limit defaults yet again 2021-07-15 16:54:34 -07:00
difftool.txt
extensions.txt Documentation: add extensions.worktreeConfig details 2022-02-08 09:49:20 -08:00
fastimport.txt
feature.txt protocol: re-enable v2 protocol by default 2020-09-25 11:40:42 -07:00
fetch.txt repo-settings: rename the traditional default fetch.negotiationAlgorithm 2022-02-02 09:36:17 -08:00
filter.txt
fmt-merge-msg.txt config/fmt-merge-msg.txt: drop space in quote 2020-09-27 14:22:41 -07:00
format.txt Merge branch 'jc/format-patch-name-max' 2020-11-21 15:14:38 -08:00
fsck.txt
gc.txt builtin/gc.c: conditionally avoid pruning objects via loose 2022-05-26 15:48:26 -07:00
gitcvs.txt
gitweb.txt
gpg.txt Documentation/config/pgp.txt: add missing apostrophe 2022-01-26 18:31:59 -08:00
grep.txt grep: clarify what grep.patternType=default means 2021-12-05 12:26:43 -08:00
gui.txt docs: use "character encoding" to refer to commit-object encoding 2021-08-27 12:45:45 -07:00
guitool.txt
help.txt help.c: help.autocorrect=prompt waits for user action 2021-08-14 11:20:49 -07:00
http.txt http: add custom hostname to IP address resolutions 2022-05-16 09:46:52 -07:00
i18n.txt
imap.txt
index.txt sparse-index: add index.sparse config option 2021-03-30 12:57:47 -07:00
init.txt clone: respect remote unborn HEAD 2021-02-05 13:49:55 -08:00
instaweb.txt
interactive.txt
log.txt diff-merges: introduce log.diffMerges config variable 2021-04-16 23:38:35 -07:00
lsrefs.txt ls-refs: report unborn targets of symrefs 2021-02-05 13:49:53 -08:00
mailinfo.txt
mailmap.txt
maintenance.txt maintenance: incremental strategy runs pack-refs weekly 2021-02-09 23:09:29 -08:00
man.txt
merge.txt update documentation for new zdiff3 conflictStyle 2021-12-01 14:45:59 -08:00
mergetool.txt vimdiff: add tool documentation 2022-04-03 15:09:52 -07:00
notes.txt
pack.txt midx.c: respect 'pack.writeBitmapHashcache' when writing bitmaps 2021-09-14 16:34:18 -07:00
pager.txt
pretty.txt
protocol.txt protocol: re-enable v2 protocol by default 2020-09-25 11:40:42 -07:00
pull.txt pull: remove support for --rebase=preserve 2021-09-07 21:45:32 -07:00
push.txt push: new config option "push.autoSetupRemote" supports "simple" push 2022-04-29 11:20:55 -07:00
rebase.txt rebase: remove transitory rebase.useBuiltin setting & env 2021-03-23 14:05:58 -07:00
receive.txt receive-pack: new config receive.procReceiveRefs 2020-08-27 12:47:47 -07:00
remote.txt docs: mention --refetch fetch option 2022-03-28 10:25:53 -07:00
remotes.txt
repack.txt builtin/repack.c: allow configuring cruft pack generation 2022-05-26 15:48:26 -07:00
rerere.txt
safe.txt Merge branch 'cb/path-owner-check-with-sudo' 2022-05-26 14:51:32 -07:00
sendemail.txt send-email: remove non-working support for "sendemail.smtpssl" 2021-05-28 18:38:07 +09:00
sequencer.txt
showbranch.txt
sparse.txt repo_read_index: add config to expect files outside sparse patterns 2022-03-01 23:37:48 -08:00
splitindex.txt
ssh.txt
stash.txt stash: remove documentation for stash.useBuiltin 2022-01-27 18:00:37 -08:00
status.txt
submodule.txt branch: add --recurse-submodules option for branch creation 2022-02-04 08:16:39 -08:00
tag.txt separate tar.* config to its own source file 2020-03-18 12:42:09 -07:00
tar.txt separate tar.* config to its own source file 2020-03-18 12:42:09 -07:00
trace2.txt doc: fix some typos 2021-01-04 11:27:48 -08:00
transfer.txt docs: clarify the interaction of transfer.hideRefs and namespaces 2021-09-01 07:54:30 -07:00
uploadarchive.txt
uploadpack.txt list-objects: implement object type filter 2021-04-19 14:09:11 -07:00
url.txt
user.txt ssh signing: support non ssh-* keytypes 2021-11-19 09:05:25 -08:00
versionsort.txt
web.txt
worktree.txt