12bd17521c
Performance optimization work on the rename detection continues. * en/diffcore-rename: merge-ort: call diffcore_rename() directly gitdiffcore doc: mention new preliminary step for rename detection diffcore-rename: guide inexact rename detection based on basenames diffcore-rename: complete find_basename_matches() diffcore-rename: compute basenames of source and dest candidates t4001: add a test comparing basename similarity and content similarity diffcore-rename: filter rename_src list when possible diffcore-rename: no point trying to find a match better than exact
334 lines
14 KiB
Plaintext
334 lines
14 KiB
Plaintext
gitdiffcore(7)
|
|
==============
|
|
|
|
NAME
|
|
----
|
|
gitdiffcore - Tweaking diff output
|
|
|
|
SYNOPSIS
|
|
--------
|
|
[verse]
|
|
'git diff' *
|
|
|
|
DESCRIPTION
|
|
-----------
|
|
|
|
The diff commands 'git diff-index', 'git diff-files', and 'git diff-tree'
|
|
can be told to manipulate differences they find in
|
|
unconventional ways before showing 'diff' output. The manipulation
|
|
is collectively called "diffcore transformation". This short note
|
|
describes what they are and how to use them to produce 'diff' output
|
|
that is easier to understand than the conventional kind.
|
|
|
|
|
|
The chain of operation
|
|
----------------------
|
|
|
|
The 'git diff-{asterisk}' family works by first comparing two sets of
|
|
files:
|
|
|
|
- 'git diff-index' compares contents of a "tree" object and the
|
|
working directory (when `--cached` flag is not used) or a
|
|
"tree" object and the index file (when `--cached` flag is
|
|
used);
|
|
|
|
- 'git diff-files' compares contents of the index file and the
|
|
working directory;
|
|
|
|
- 'git diff-tree' compares contents of two "tree" objects;
|
|
|
|
In all of these cases, the commands themselves first optionally limit
|
|
the two sets of files by any pathspecs given on their command-lines,
|
|
and compare corresponding paths in the two resulting sets of files.
|
|
|
|
The pathspecs are used to limit the world diff operates in. They remove
|
|
the filepairs outside the specified sets of pathnames. E.g. If the
|
|
input set of filepairs included:
|
|
|
|
------------------------------------------------
|
|
:100644 100644 bcd1234... 0123456... M junkfile
|
|
------------------------------------------------
|
|
|
|
but the command invocation was `git diff-files myfile`, then the
|
|
junkfile entry would be removed from the list because only "myfile"
|
|
is under consideration.
|
|
|
|
The result of comparison is passed from these commands to what is
|
|
internally called "diffcore", in a format similar to what is output
|
|
when the -p option is not used. E.g.
|
|
|
|
------------------------------------------------
|
|
in-place edit :100644 100644 bcd1234... 0123456... M file0
|
|
create :000000 100644 0000000... 1234567... A file4
|
|
delete :100644 000000 1234567... 0000000... D file5
|
|
unmerged :000000 000000 0000000... 0000000... U file6
|
|
------------------------------------------------
|
|
|
|
The diffcore mechanism is fed a list of such comparison results
|
|
(each of which is called "filepair", although at this point each
|
|
of them talks about a single file), and transforms such a list
|
|
into another list. There are currently 5 such transformations:
|
|
|
|
- diffcore-break
|
|
- diffcore-rename
|
|
- diffcore-merge-broken
|
|
- diffcore-pickaxe
|
|
- diffcore-order
|
|
- diffcore-rotate
|
|
|
|
These are applied in sequence. The set of filepairs 'git diff-{asterisk}'
|
|
commands find are used as the input to diffcore-break, and
|
|
the output from diffcore-break is used as the input to the
|
|
next transformation. The final result is then passed to the
|
|
output routine and generates either diff-raw format (see Output
|
|
format sections of the manual for 'git diff-{asterisk}' commands) or
|
|
diff-patch format.
|
|
|
|
|
|
diffcore-break: For Splitting Up Complete Rewrites
|
|
--------------------------------------------------
|
|
|
|
The second transformation in the chain is diffcore-break, and is
|
|
controlled by the -B option to the 'git diff-{asterisk}' commands. This is
|
|
used to detect a filepair that represents "complete rewrite" and
|
|
break such filepair into two filepairs that represent delete and
|
|
create. E.g. If the input contained this filepair:
|
|
|
|
------------------------------------------------
|
|
:100644 100644 bcd1234... 0123456... M file0
|
|
------------------------------------------------
|
|
|
|
and if it detects that the file "file0" is completely rewritten,
|
|
it changes it to:
|
|
|
|
------------------------------------------------
|
|
:100644 000000 bcd1234... 0000000... D file0
|
|
:000000 100644 0000000... 0123456... A file0
|
|
------------------------------------------------
|
|
|
|
For the purpose of breaking a filepair, diffcore-break examines
|
|
the extent of changes between the contents of the files before
|
|
and after modification (i.e. the contents that have "bcd1234..."
|
|
and "0123456..." as their SHA-1 content ID, in the above
|
|
example). The amount of deletion of original contents and
|
|
insertion of new material are added together, and if it exceeds
|
|
the "break score", the filepair is broken into two. The break
|
|
score defaults to 50% of the size of the smaller of the original
|
|
and the result (i.e. if the edit shrinks the file, the size of
|
|
the result is used; if the edit lengthens the file, the size of
|
|
the original is used), and can be customized by giving a number
|
|
after "-B" option (e.g. "-B75" to tell it to use 75%).
|
|
|
|
|
|
diffcore-rename: For Detecting Renames and Copies
|
|
-------------------------------------------------
|
|
|
|
This transformation is used to detect renames and copies, and is
|
|
controlled by the -M option (to detect renames) and the -C option
|
|
(to detect copies as well) to the 'git diff-{asterisk}' commands. If the
|
|
input contained these filepairs:
|
|
|
|
------------------------------------------------
|
|
:100644 000000 0123456... 0000000... D fileX
|
|
:000000 100644 0000000... 0123456... A file0
|
|
------------------------------------------------
|
|
|
|
and the contents of the deleted file fileX is similar enough to
|
|
the contents of the created file file0, then rename detection
|
|
merges these filepairs and creates:
|
|
|
|
------------------------------------------------
|
|
:100644 100644 0123456... 0123456... R100 fileX file0
|
|
------------------------------------------------
|
|
|
|
When the "-C" option is used, the original contents of modified files,
|
|
and deleted files (and also unmodified files, if the
|
|
"--find-copies-harder" option is used) are considered as candidates
|
|
of the source files in rename/copy operation. If the input were like
|
|
these filepairs, that talk about a modified file fileY and a newly
|
|
created file file0:
|
|
|
|
------------------------------------------------
|
|
:100644 100644 0123456... 1234567... M fileY
|
|
:000000 100644 0000000... bcd3456... A file0
|
|
------------------------------------------------
|
|
|
|
the original contents of fileY and the resulting contents of
|
|
file0 are compared, and if they are similar enough, they are
|
|
changed to:
|
|
|
|
------------------------------------------------
|
|
:100644 100644 0123456... 1234567... M fileY
|
|
:100644 100644 0123456... bcd3456... C100 fileY file0
|
|
------------------------------------------------
|
|
|
|
In both rename and copy detection, the same "extent of changes"
|
|
algorithm used in diffcore-break is used to determine if two
|
|
files are "similar enough", and can be customized to use
|
|
a similarity score different from the default of 50% by giving a
|
|
number after the "-M" or "-C" option (e.g. "-M8" to tell it to use
|
|
8/10 = 80%).
|
|
|
|
Note that when rename detection is on but both copy and break
|
|
detection are off, rename detection adds a preliminary step that first
|
|
checks if files are moved across directories while keeping their
|
|
filename the same. If there is a file added to a directory whose
|
|
contents is sufficiently similar to a file with the same name that got
|
|
deleted from a different directory, it will mark them as renames and
|
|
exclude them from the later quadratic step (the one that pairwise
|
|
compares all unmatched files to find the "best" matches, determined by
|
|
the highest content similarity). So, for example, if a deleted
|
|
docs/ext.txt and an added docs/config/ext.txt are similar enough, they
|
|
will be marked as a rename and prevent an added docs/ext.md that may
|
|
be even more similar to the deleted docs/ext.txt from being considered
|
|
as the rename destination in the later step. For this reason, the
|
|
preliminary "match same filename" step uses a bit higher threshold to
|
|
mark a file pair as a rename and stop considering other candidates for
|
|
better matches. At most, one comparison is done per file in this
|
|
preliminary pass; so if there are several remaining ext.txt files
|
|
throughout the directory hierarchy after exact rename detection, this
|
|
preliminary step will be skipped for those files.
|
|
|
|
Note. When the "-C" option is used with `--find-copies-harder`
|
|
option, 'git diff-{asterisk}' commands feed unmodified filepairs to
|
|
diffcore mechanism as well as modified ones. This lets the copy
|
|
detector consider unmodified files as copy source candidates at
|
|
the expense of making it slower. Without `--find-copies-harder`,
|
|
'git diff-{asterisk}' commands can detect copies only if the file that was
|
|
copied happened to have been modified in the same changeset.
|
|
|
|
|
|
diffcore-merge-broken: For Putting Complete Rewrites Back Together
|
|
------------------------------------------------------------------
|
|
|
|
This transformation is used to merge filepairs broken by
|
|
diffcore-break, and not transformed into rename/copy by
|
|
diffcore-rename, back into a single modification. This always
|
|
runs when diffcore-break is used.
|
|
|
|
For the purpose of merging broken filepairs back, it uses a
|
|
different "extent of changes" computation from the ones used by
|
|
diffcore-break and diffcore-rename. It counts only the deletion
|
|
from the original, and does not count insertion. If you removed
|
|
only 10 lines from a 100-line document, even if you added 910
|
|
new lines to make a new 1000-line document, you did not do a
|
|
complete rewrite. diffcore-break breaks such a case in order to
|
|
help diffcore-rename to consider such filepairs as candidate of
|
|
rename/copy detection, but if filepairs broken that way were not
|
|
matched with other filepairs to create rename/copy, then this
|
|
transformation merges them back into the original
|
|
"modification".
|
|
|
|
The "extent of changes" parameter can be tweaked from the
|
|
default 80% (that is, unless more than 80% of the original
|
|
material is deleted, the broken pairs are merged back into a
|
|
single modification) by giving a second number to -B option,
|
|
like these:
|
|
|
|
* -B50/60 (give 50% "break score" to diffcore-break, use 60%
|
|
for diffcore-merge-broken).
|
|
|
|
* -B/60 (the same as above, since diffcore-break defaults to 50%).
|
|
|
|
Note that earlier implementation left a broken pair as a separate
|
|
creation and deletion patches. This was an unnecessary hack and
|
|
the latest implementation always merges all the broken pairs
|
|
back into modifications, but the resulting patch output is
|
|
formatted differently for easier review in case of such
|
|
a complete rewrite by showing the entire contents of old version
|
|
prefixed with '-', followed by the entire contents of new
|
|
version prefixed with '+'.
|
|
|
|
|
|
diffcore-pickaxe: For Detecting Addition/Deletion of Specified String
|
|
---------------------------------------------------------------------
|
|
|
|
This transformation limits the set of filepairs to those that change
|
|
specified strings between the preimage and the postimage in a certain
|
|
way. -S<block of text> and -G<regular expression> options are used to
|
|
specify different ways these strings are sought.
|
|
|
|
"-S<block of text>" detects filepairs whose preimage and postimage
|
|
have different number of occurrences of the specified block of text.
|
|
By definition, it will not detect in-file moves. Also, when a
|
|
changeset moves a file wholesale without affecting the interesting
|
|
string, diffcore-rename kicks in as usual, and `-S` omits the filepair
|
|
(since the number of occurrences of that string didn't change in that
|
|
rename-detected filepair). When used with `--pickaxe-regex`, treat
|
|
the <block of text> as an extended POSIX regular expression to match,
|
|
instead of a literal string.
|
|
|
|
"-G<regular expression>" (mnemonic: grep) detects filepairs whose
|
|
textual diff has an added or a deleted line that matches the given
|
|
regular expression. This means that it will detect in-file (or what
|
|
rename-detection considers the same file) moves, which is noise. The
|
|
implementation runs diff twice and greps, and this can be quite
|
|
expensive. To speed things up binary files without textconv filters
|
|
will be ignored.
|
|
|
|
When `-S` or `-G` are used without `--pickaxe-all`, only filepairs
|
|
that match their respective criterion are kept in the output. When
|
|
`--pickaxe-all` is used, if even one filepair matches their respective
|
|
criterion in a changeset, the entire changeset is kept. This behavior
|
|
is designed to make reviewing changes in the context of the whole
|
|
changeset easier.
|
|
|
|
diffcore-order: For Sorting the Output Based on Filenames
|
|
---------------------------------------------------------
|
|
|
|
This is used to reorder the filepairs according to the user's
|
|
(or project's) taste, and is controlled by the -O option to the
|
|
'git diff-{asterisk}' commands.
|
|
|
|
This takes a text file each of whose lines is a shell glob
|
|
pattern. Filepairs that match a glob pattern on an earlier line
|
|
in the file are output before ones that match a later line, and
|
|
filepairs that do not match any glob pattern are output last.
|
|
|
|
As an example, a typical orderfile for the core Git probably
|
|
would look like this:
|
|
|
|
------------------------------------------------
|
|
README
|
|
Makefile
|
|
Documentation
|
|
*.h
|
|
*.c
|
|
t
|
|
------------------------------------------------
|
|
|
|
diffcore-rotate: For Changing At Which Path Output Starts
|
|
---------------------------------------------------------
|
|
|
|
This transformation takes one pathname, and rotates the set of
|
|
filepairs so that the filepair for the given pathname comes first,
|
|
optionally discarding the paths that come before it. This is used
|
|
to implement the `--skip-to` and the `--rotate-to` options. It is
|
|
an error when the specified pathname is not in the set of filepairs,
|
|
but it is not useful to error out when used with "git log" family of
|
|
commands, because it is unreasonable to expect that a given path
|
|
would be modified by each and every commit shown by the "git log"
|
|
command. For this reason, when used with "git log", the filepair
|
|
that sorts the same as, or the first one that sorts after, the given
|
|
pathname is where the output starts.
|
|
|
|
Use of this transformation combined with diffcore-order will produce
|
|
unexpected results, as the input to this transformation is likely
|
|
not sorted when diffcore-order is in effect.
|
|
|
|
|
|
SEE ALSO
|
|
--------
|
|
linkgit:git-diff[1],
|
|
linkgit:git-diff-files[1],
|
|
linkgit:git-diff-index[1],
|
|
linkgit:git-diff-tree[1],
|
|
linkgit:git-format-patch[1],
|
|
linkgit:git-log[1],
|
|
linkgit:gitglossary[7],
|
|
link:user-manual.html[The Git User's Manual]
|
|
|
|
GIT
|
|
---
|
|
Part of the linkgit:git[1] suite
|