git-commit-vandalism/Documentation/config/diff.txt
Elijah Newren 94b82d5686 rename: bump limit defaults yet again
These were last bumped in commit 92c57e5c1d (bump rename limit
defaults (again), 2011-02-19), and were bumped both because processors
had gotten faster, and because people were getting ugly merges that
caused problems and reporting it to the mailing list (suggesting that
folks were willing to spend more time waiting).

Since that time:
  * Linus has continued recommending kernel folks to set
    diff.renameLimit=0 (maps to 32767, currently)
  * Folks with repositories with lots of renames were happy to set
    merge.renameLimit above 32767, once the code supported that, to
    get correct cherry-picks
  * Processors have gotten faster
  * It has been discovered that the timing methodology used last time
    probably used too large example files.

The last point is probably worth explaining a bit more:

  * The "average" file size used appears to have been average blob size
    in the linux kernel history at the time (probably v2.6.25 or
    something close to it).
  * Since bigger files are modified more frequently, such a computation
    weights towards larger files.
  * Larger files may be more likely to be modified over time, but are
    not more likely to be renamed -- the mean and median blob size
    within a tree are a bit higher than the mean and median of blob
    sizes in the history leading up to that version for the linux
    kernel.
  * The mean blob size in v2.6.25 was half the average blob size in
    history leading to that point
  * The median blob size in v2.6.25 was about 40% of the mean blob size
    in v2.6.25.
  * Since the mean blob size is more than double the median blob size,
    any file as big as the mean will not be compared to any files of
    median size or less (because they'd be more than 50% dissimilar).
  * Since it is the number of files compared that provides the O(n^2)
    behavior, median-sized files should matter more than mean-sized
    ones.

The combined effect of the above is that the file size used in past
calculations was likely about 5x too large.  Combine that with a CPU
performance improvement of ~30%, and we can increase the limits by
a factor of sqrt(5/(1-.3)) = 2.67, while keeping the original stated
time limits.

Keeping the same approximate time limit probably makes sense for
diff.renameLimit (there is no progress feedback in e.g. git log -p),
but the experience above suggests merge.renameLimit could be extended
significantly.  In fact, it probably would make sense to have an
unlimited default setting for merge.renameLimit, but that would
likely need to be coupled with changes to how progress is displayed.
(See https://lore.kernel.org/git/YOx+Ok%2FEYvLqRMzJ@coredump.intra.peff.net/
for details in that area.)  For now, let's just bump the approximate
time limit from 10s to 1m.

(Note: We do not want to use actual time limits, because getting results
that depend on how loaded your system is that day feels bad, and because
we don't discover that we won't get all the renames until after we've
put in a lot of work rather than just upfront telling the user there are
too many files involved.)

Using the original time limit of 2s for diff.renameLimit, and bumping
merge.renameLimit from 10s to 60s, I found the following timings using
the simple script at the end of this commit message (on an AWS c5.xlarge
which reports as "Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz"):

      N   Timing
   1300    1.995s
   7100   59.973s

So let's round down to nice even numbers and bump the limits from
400->1000, and from 1000->7000.

Here is the measure_rename_perf script (adapted from
https://lore.kernel.org/git/20080211113516.GB6344@coredump.intra.peff.net/
in particular to avoid triggering the linear handling from
basename-guided rename detection):

    #!/bin/bash

    n=$1; shift

    rm -rf repo
    mkdir repo && cd repo
    git init -q -b main

    mkdata() {
      mkdir $1
      for i in `seq 1 $2`; do
        (sed "s/^/$i /" <../sample
         echo tag: $1
        ) >$1/$i
      done
    }

    mkdata initial $n
    git add .
    git commit -q -m initial

    mkdata new $n
    git add .
    cd new
    for i in *; do git mv $i $i.renamed; done
    cd ..
    git rm -q -rf initial
    git commit -q -m new

    time git diff-tree -M -l0 --summary HEAD^ HEAD

Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-07-15 16:54:34 -07:00

239 lines
9.7 KiB
Plaintext

diff.autoRefreshIndex::
When using 'git diff' to compare with work tree
files, do not consider stat-only change as changed.
Instead, silently run `git update-index --refresh` to
update the cached stat information for paths whose
contents in the work tree match the contents in the
index. This option defaults to true. Note that this
affects only 'git diff' Porcelain, and not lower level
'diff' commands such as 'git diff-files'.
diff.dirstat::
A comma separated list of `--dirstat` parameters specifying the
default behavior of the `--dirstat` option to linkgit:git-diff[1]
and friends. The defaults can be overridden on the command line
(using `--dirstat=<param1,param2,...>`). The fallback defaults
(when not changed by `diff.dirstat`) are `changes,noncumulative,3`.
The following parameters are available:
+
--
`changes`;;
Compute the dirstat numbers by counting the lines that have been
removed from the source, or added to the destination. This ignores
the amount of pure code movements within a file. In other words,
rearranging lines in a file is not counted as much as other changes.
This is the default behavior when no parameter is given.
`lines`;;
Compute the dirstat numbers by doing the regular line-based diff
analysis, and summing the removed/added line counts. (For binary
files, count 64-byte chunks instead, since binary files have no
natural concept of lines). This is a more expensive `--dirstat`
behavior than the `changes` behavior, but it does count rearranged
lines within a file as much as other changes. The resulting output
is consistent with what you get from the other `--*stat` options.
`files`;;
Compute the dirstat numbers by counting the number of files changed.
Each changed file counts equally in the dirstat analysis. This is
the computationally cheapest `--dirstat` behavior, since it does
not have to look at the file contents at all.
`cumulative`;;
Count changes in a child directory for the parent directory as well.
Note that when using `cumulative`, the sum of the percentages
reported may exceed 100%. The default (non-cumulative) behavior can
be specified with the `noncumulative` parameter.
<limit>;;
An integer parameter specifies a cut-off percent (3% by default).
Directories contributing less than this percentage of the changes
are not shown in the output.
--
+
Example: The following will count changed files, while ignoring
directories with less than 10% of the total amount of changed files,
and accumulating child directory counts in the parent directories:
`files,10,cumulative`.
diff.statGraphWidth::
Limit the width of the graph part in --stat output. If set, applies
to all commands generating --stat output except format-patch.
diff.context::
Generate diffs with <n> lines of context instead of the default
of 3. This value is overridden by the -U option.
diff.interHunkContext::
Show the context between diff hunks, up to the specified number
of lines, thereby fusing the hunks that are close to each other.
This value serves as the default for the `--inter-hunk-context`
command line option.
diff.external::
If this config variable is set, diff generation is not
performed using the internal diff machinery, but using the
given command. Can be overridden with the `GIT_EXTERNAL_DIFF'
environment variable. The command is called with parameters
as described under "git Diffs" in linkgit:git[1]. Note: if
you want to use an external diff program only on a subset of
your files, you might want to use linkgit:gitattributes[5] instead.
diff.ignoreSubmodules::
Sets the default value of --ignore-submodules. Note that this
affects only 'git diff' Porcelain, and not lower level 'diff'
commands such as 'git diff-files'. 'git checkout'
and 'git switch' also honor
this setting when reporting uncommitted changes. Setting it to
'all' disables the submodule summary normally shown by 'git commit'
and 'git status' when `status.submoduleSummary` is set unless it is
overridden by using the --ignore-submodules command-line option.
The 'git submodule' commands are not affected by this setting.
By default this is set to untracked so that any untracked
submodules are ignored.
diff.mnemonicPrefix::
If set, 'git diff' uses a prefix pair that is different from the
standard "a/" and "b/" depending on what is being compared. When
this configuration is in effect, reverse diff output also swaps
the order of the prefixes:
`git diff`;;
compares the (i)ndex and the (w)ork tree;
`git diff HEAD`;;
compares a (c)ommit and the (w)ork tree;
`git diff --cached`;;
compares a (c)ommit and the (i)ndex;
`git diff HEAD:file1 file2`;;
compares an (o)bject and a (w)ork tree entity;
`git diff --no-index a b`;;
compares two non-git things (1) and (2).
diff.noprefix::
If set, 'git diff' does not show any source or destination prefix.
diff.relative::
If set to 'true', 'git diff' does not show changes outside of the directory
and show pathnames relative to the current directory.
diff.orderFile::
File indicating how to order files within a diff.
See the '-O' option to linkgit:git-diff[1] for details.
If `diff.orderFile` is a relative pathname, it is treated as
relative to the top of the working tree.
diff.renameLimit::
The number of files to consider in the exhaustive portion of
copy/rename detection; equivalent to the 'git diff' option
`-l`. If not set, the default value is currently 1000. This
setting has no effect if rename detection is turned off.
diff.renames::
Whether and how Git detects renames. If set to "false",
rename detection is disabled. If set to "true", basic rename
detection is enabled. If set to "copies" or "copy", Git will
detect copies, as well. Defaults to true. Note that this
affects only 'git diff' Porcelain like linkgit:git-diff[1] and
linkgit:git-log[1], and not lower level commands such as
linkgit:git-diff-files[1].
diff.suppressBlankEmpty::
A boolean to inhibit the standard behavior of printing a space
before each empty output line. Defaults to false.
diff.submodule::
Specify the format in which differences in submodules are
shown. The "short" format just shows the names of the commits
at the beginning and end of the range. The "log" format lists
the commits in the range like linkgit:git-submodule[1] `summary`
does. The "diff" format shows an inline diff of the changed
contents of the submodule. Defaults to "short".
diff.wordRegex::
A POSIX Extended Regular Expression used to determine what is a "word"
when performing word-by-word difference calculations. Character
sequences that match the regular expression are "words", all other
characters are *ignorable* whitespace.
diff.<driver>.command::
The custom diff driver command. See linkgit:gitattributes[5]
for details.
diff.<driver>.xfuncname::
The regular expression that the diff driver should use to
recognize the hunk header. A built-in pattern may also be used.
See linkgit:gitattributes[5] for details.
diff.<driver>.binary::
Set this option to true to make the diff driver treat files as
binary. See linkgit:gitattributes[5] for details.
diff.<driver>.textconv::
The command that the diff driver should call to generate the
text-converted version of a file. The result of the
conversion is used to generate a human-readable diff. See
linkgit:gitattributes[5] for details.
diff.<driver>.wordRegex::
The regular expression that the diff driver should use to
split words in a line. See linkgit:gitattributes[5] for
details.
diff.<driver>.cachetextconv::
Set this option to true to make the diff driver cache the text
conversion outputs. See linkgit:gitattributes[5] for details.
diff.tool::
Controls which diff tool is used by linkgit:git-difftool[1].
This variable overrides the value configured in `merge.tool`.
The list below shows the valid built-in values.
Any other value is treated as a custom diff tool and requires
that a corresponding difftool.<tool>.cmd variable is defined.
diff.guitool::
Controls which diff tool is used by linkgit:git-difftool[1] when
the -g/--gui flag is specified. This variable overrides the value
configured in `merge.guitool`. The list below shows the valid
built-in values. Any other value is treated as a custom diff tool
and requires that a corresponding difftool.<guitool>.cmd variable
is defined.
include::../mergetools-diff.txt[]
diff.indentHeuristic::
Set this option to `false` to disable the default heuristics
that shift diff hunk boundaries to make patches easier to read.
diff.algorithm::
Choose a diff algorithm. The variants are as follows:
+
--
`default`, `myers`;;
The basic greedy diff algorithm. Currently, this is the default.
`minimal`;;
Spend extra time to make sure the smallest possible diff is
produced.
`patience`;;
Use "patience diff" algorithm when generating patches.
`histogram`;;
This algorithm extends the patience algorithm to "support
low-occurrence common elements".
--
+
diff.wsErrorHighlight::
Highlight whitespace errors in the `context`, `old` or `new`
lines of the diff. Multiple values are separated by comma,
`none` resets previous values, `default` reset the list to
`new` and `all` is a shorthand for `old,new,context`. The
whitespace errors are colored with `color.diff.whitespace`.
The command line option `--ws-error-highlight=<kind>`
overrides this setting.
diff.colorMoved::
If set to either a valid `<mode>` or a true value, moved lines
in a diff are colored differently, for details of valid modes
see '--color-moved' in linkgit:git-diff[1]. If simply set to
true the default color mode will be used. When set to false,
moved lines are not colored.
diff.colorMovedWS::
When moved lines are colored using e.g. the `diff.colorMoved` setting,
this option controls the `<mode>` how spaces are treated
for details of valid modes see '--color-moved-ws' in linkgit:git-diff[1].