git-commit-vandalism/Documentation/config
Elijah Newren 94b82d5686 rename: bump limit defaults yet again
These were last bumped in commit 92c57e5c1d (bump rename limit
defaults (again), 2011-02-19), and were bumped both because processors
had gotten faster, and because people were getting ugly merges that
caused problems and reporting it to the mailing list (suggesting that
folks were willing to spend more time waiting).

Since that time:
  * Linus has continued recommending kernel folks to set
    diff.renameLimit=0 (maps to 32767, currently)
  * Folks with repositories with lots of renames were happy to set
    merge.renameLimit above 32767, once the code supported that, to
    get correct cherry-picks
  * Processors have gotten faster
  * It has been discovered that the timing methodology used last time
    probably used too large example files.

The last point is probably worth explaining a bit more:

  * The "average" file size used appears to have been average blob size
    in the linux kernel history at the time (probably v2.6.25 or
    something close to it).
  * Since bigger files are modified more frequently, such a computation
    weights towards larger files.
  * Larger files may be more likely to be modified over time, but are
    not more likely to be renamed -- the mean and median blob size
    within a tree are a bit higher than the mean and median of blob
    sizes in the history leading up to that version for the linux
    kernel.
  * The mean blob size in v2.6.25 was half the average blob size in
    history leading to that point
  * The median blob size in v2.6.25 was about 40% of the mean blob size
    in v2.6.25.
  * Since the mean blob size is more than double the median blob size,
    any file as big as the mean will not be compared to any files of
    median size or less (because they'd be more than 50% dissimilar).
  * Since it is the number of files compared that provides the O(n^2)
    behavior, median-sized files should matter more than mean-sized
    ones.

The combined effect of the above is that the file size used in past
calculations was likely about 5x too large.  Combine that with a CPU
performance improvement of ~30%, and we can increase the limits by
a factor of sqrt(5/(1-.3)) = 2.67, while keeping the original stated
time limits.

Keeping the same approximate time limit probably makes sense for
diff.renameLimit (there is no progress feedback in e.g. git log -p),
but the experience above suggests merge.renameLimit could be extended
significantly.  In fact, it probably would make sense to have an
unlimited default setting for merge.renameLimit, but that would
likely need to be coupled with changes to how progress is displayed.
(See https://lore.kernel.org/git/YOx+Ok%2FEYvLqRMzJ@coredump.intra.peff.net/
for details in that area.)  For now, let's just bump the approximate
time limit from 10s to 1m.

(Note: We do not want to use actual time limits, because getting results
that depend on how loaded your system is that day feels bad, and because
we don't discover that we won't get all the renames until after we've
put in a lot of work rather than just upfront telling the user there are
too many files involved.)

Using the original time limit of 2s for diff.renameLimit, and bumping
merge.renameLimit from 10s to 60s, I found the following timings using
the simple script at the end of this commit message (on an AWS c5.xlarge
which reports as "Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz"):

      N   Timing
   1300    1.995s
   7100   59.973s

So let's round down to nice even numbers and bump the limits from
400->1000, and from 1000->7000.

Here is the measure_rename_perf script (adapted from
https://lore.kernel.org/git/20080211113516.GB6344@coredump.intra.peff.net/
in particular to avoid triggering the linear handling from
basename-guided rename detection):

    #!/bin/bash

    n=$1; shift

    rm -rf repo
    mkdir repo && cd repo
    git init -q -b main

    mkdata() {
      mkdir $1
      for i in `seq 1 $2`; do
        (sed "s/^/$i /" <../sample
         echo tag: $1
        ) >$1/$i
      done
    }

    mkdata initial $n
    git add .
    git commit -q -m initial

    mkdata new $n
    git add .
    cd new
    for i in *; do git mv $i $i.renamed; done
    cd ..
    git rm -q -rf initial
    git commit -q -m new

    time git diff-tree -M -l0 --summary HEAD^ HEAD

Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-07-15 16:54:34 -07:00
..
add.txt
advice.txt rm: honor sparse checkout patterns 2021-04-08 14:18:03 -07:00
alias.txt
am.txt
apply.txt
blame.txt blame: correct name of config option in docs 2021-06-28 10:05:13 -07:00
branch.txt
browser.txt
checkout.txt parallel-checkout: add configuration options 2021-04-19 11:57:05 -07:00
clean.txt
clone.txt builtin/clone.c: add --reject-shallow option 2021-04-01 12:58:58 -07:00
color.txt doc: explain the use of color.pager 2021-05-20 15:37:10 +09:00
column.txt
commit.txt
commitgraph.txt commit-graph: use config to specify generation type 2021-02-25 15:10:41 -08:00
completion.txt
core.txt Merge branch 'ew/decline-core-abbrev' 2021-01-15 15:20:28 -08:00
credential.txt crendential-store: use timeout when locking file 2020-11-25 12:30:18 -08:00
diff.txt rename: bump limit defaults yet again 2021-07-15 16:54:34 -07:00
difftool.txt
extensions.txt docs: add documentation for extensions.objectFormat 2020-07-30 09:16:49 -07:00
fastimport.txt
feature.txt protocol: re-enable v2 protocol by default 2020-09-25 11:40:42 -07:00
fetch.txt negotiator/noop: add noop fetch negotiator 2020-08-18 13:25:05 -07:00
filter.txt
fmt-merge-msg.txt config/fmt-merge-msg.txt: drop space in quote 2020-09-27 14:22:41 -07:00
format.txt Merge branch 'jc/format-patch-name-max' 2020-11-21 15:14:38 -08:00
fsck.txt
gc.txt gc docs: change --keep-base-pack to --keep-largest-pack 2020-11-21 11:39:55 -08:00
gitcvs.txt
gitweb.txt
gpg.txt
grep.txt
gui.txt
guitool.txt
help.txt help.c: help.autocorrect=never means "do not compute suggestions" 2020-11-25 13:02:15 -08:00
http.txt doc: fix some typos 2021-01-04 11:27:48 -08:00
i18n.txt
imap.txt
index.txt sparse-index: add index.sparse config option 2021-03-30 12:57:47 -07:00
init.txt clone: respect remote unborn HEAD 2021-02-05 13:49:55 -08:00
instaweb.txt
interactive.txt
log.txt diff-merges: introduce log.diffMerges config variable 2021-04-16 23:38:35 -07:00
lsrefs.txt ls-refs: report unborn targets of symrefs 2021-02-05 13:49:53 -08:00
mailinfo.txt
mailmap.txt
maintenance.txt maintenance: incremental strategy runs pack-refs weekly 2021-02-09 23:09:29 -08:00
man.txt
merge.txt rename: bump limit defaults yet again 2021-07-15 16:54:34 -07:00
mergetool.txt mergetool: do not enable hideResolved by default 2021-03-13 15:30:29 -08:00
notes.txt
pack.txt Merge branch 'jk/doc-max-pack-size' 2021-07-08 13:15:03 -07:00
pager.txt
pretty.txt
protocol.txt protocol: re-enable v2 protocol by default 2020-09-25 11:40:42 -07:00
pull.txt
push.txt send-pack: support push negotiation 2021-05-05 10:41:29 +09:00
rebase.txt rebase: remove transitory rebase.useBuiltin setting & env 2021-03-23 14:05:58 -07:00
receive.txt receive-pack: new config receive.procReceiveRefs 2020-08-27 12:47:47 -07:00
remote.txt
remotes.txt
repack.txt
rerere.txt
reset.txt
sendemail.txt git-send-email: die if sendmail.* config is set 2020-07-23 18:00:34 -07:00
sequencer.txt
showbranch.txt
splitindex.txt
ssh.txt
stash.txt stash show: use stash.showIncludeUntracked even when diff options given 2021-05-22 17:56:46 +09:00
status.txt
submodule.txt doc: explain how to deactivate submodule.recurse completely 2020-04-06 13:42:43 -07:00
tag.txt
tar.txt
trace2.txt doc: fix some typos 2021-01-04 11:27:48 -08:00
transfer.txt docs: new transfer.advertiseSID option 2020-11-11 18:26:52 -08:00
uploadarchive.txt
uploadpack.txt list-objects: implement object type filter 2021-04-19 14:09:11 -07:00
url.txt
user.txt
versionsort.txt
web.txt
worktree.txt