2005-04-26 18:25:05 +02:00
|
|
|
/*
|
|
|
|
* Copyright (C) 2005 Junio C Hamano
|
|
|
|
*/
|
2005-04-26 03:22:47 +02:00
|
|
|
#ifndef DIFF_H
|
|
|
|
#define DIFF_H
|
|
|
|
|
2006-03-30 08:55:43 +02:00
|
|
|
#include "tree-walk.h"
|
2013-07-14 10:35:25 +02:00
|
|
|
#include "pathspec.h"
|
2015-03-14 00:39:33 +01:00
|
|
|
#include "object.h"
|
2018-01-04 23:50:42 +01:00
|
|
|
#include "oidset.h"
|
2006-01-31 23:10:56 +01:00
|
|
|
|
Log message printout cleanups
On Sun, 16 Apr 2006, Junio C Hamano wrote:
>
> In the mid-term, I am hoping we can drop the generate_header()
> callchain _and_ the custom code that formats commit log in-core,
> found in cmd_log_wc().
Ok, this was nastier than expected, just because the dependencies between
the different log-printing stuff were absolutely _everywhere_, but here's
a patch that does exactly that.
The patch is not very easy to read, and the "--patch-with-stat" thing is
still broken (it does not call the "show_log()" thing properly for
merges). That's not a new bug. In the new world order it _should_ do
something like
if (rev->logopt)
show_log(rev, rev->logopt, "---\n");
but it doesn't. I haven't looked at the --with-stat logic, so I left it
alone.
That said, this patch removes more lines than it adds, and in particular,
the "cmd_log_wc()" loop is now a very clean:
while ((commit = get_revision(rev)) != NULL) {
log_tree_commit(rev, commit);
free(commit->buffer);
commit->buffer = NULL;
}
so it doesn't get much prettier than this. All the complexity is entirely
hidden in log-tree.c, and any code that needs to flush the log literally
just needs to do the "if (rev->logopt) show_log(...)" incantation.
I had to make the combined_diff() logic take a "struct rev_info" instead
of just a "struct diff_options", but that part is pretty clean.
This does change "git whatchanged" from using "diff-tree" as the commit
descriptor to "commit", and I changed one of the tests to reflect that new
reality. Otherwise everything still passes, and my other tests look fine
too.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-04-17 20:59:32 +02:00
|
|
|
struct rev_info;
|
2005-10-21 06:05:05 +02:00
|
|
|
struct diff_options;
|
2006-09-07 08:35:42 +02:00
|
|
|
struct diff_queue_struct;
|
2010-05-26 09:08:02 +02:00
|
|
|
struct strbuf;
|
2010-06-07 17:23:36 +02:00
|
|
|
struct diff_filespec;
|
|
|
|
struct userdiff_driver;
|
2017-03-31 03:40:00 +02:00
|
|
|
struct oid_array;
|
2011-12-17 11:20:07 +01:00
|
|
|
struct commit;
|
tree-diff: rework diff_tree() to generate diffs for multiparent cases as well
Previously diff_tree(), which is now named ll_diff_tree_sha1(), was
generating diff_filepair(s) for two trees t1 and t2, and that was
usually used for a commit as t1=HEAD~, and t2=HEAD - i.e. to see changes
a commit introduces.
In Git, however, we have fundamentally built flexibility in that a
commit can have many parents - 1 for a plain commit, 2 for a simple merge,
but also more than 2 for merging several heads at once.
For merges there is a so called combine-diff, which shows diff, a merge
introduces by itself, omitting changes done by any parent. That works
through first finding paths, that are different to all parents, and then
showing generalized diff, with separate columns for +/- for each parent.
The code lives in combine-diff.c .
There is an impedance mismatch, however, in that a commit could
generally have any number of parents, and that while diffing trees, we
divide cases for 2-tree diffs and more-than-2-tree diffs. I mean there
is no special casing for multiple parents commits in e.g.
revision-walker .
That impedance mismatch *hurts* *performance* *badly* for generating
combined diffs - in "combine-diff: optimize combine_diff_path
sets intersection" I've already removed some slowness from it, but from
the timings provided there, it could be seen, that combined diffs still
cost more than an order of magnitude more cpu time, compared to diff for
usual commits, and that would only be an optimistic estimate, if we take
into account that for e.g. linux.git there is only one merge for several
dozens of plain commits.
That slowness comes from the fact that currently, while generating
combined diff, a lot of time is spent computing diff(commit,commit^2)
just to only then intersect that huge diff to almost small set of files
from diff(commit,commit^1).
That's because at present, to compute combine-diff, for first finding
paths, that "every parent touches", we use the following combine-diff
property/definition:
D(A,P1...Pn) = D(A,P1) ^ ... ^ D(A,Pn) (w.r.t. paths)
where
D(A,P1...Pn) is combined diff between commit A, and parents Pi
and
D(A,Pi) is usual two-tree diff Pi..A
So if any of that D(A,Pi) is huge, tracting 1 n-parent combine-diff as n
1-parent diffs and intersecting results will be slow.
And usually, for linux.git and other topic-based workflows, that
D(A,P2) is huge, because, if merge-base of A and P2, is several dozens
of merges (from A, via first parent) below, that D(A,P2) will be diffing
sum of merges from several subsystems to 1 subsystem.
The solution is to avoid computing n 1-parent diffs, and to find
changed-to-all-parents paths via scanning A's and all Pi's trees
simultaneously, at each step comparing their entries, and based on that
comparison, populate paths result, and deduce we could *skip*
*recursing* into subdirectories, if at least for 1 parent, sha1 of that
dir tree is the same as in A. That would save us from doing significant
amount of needless work.
Such approach is very similar to what diff_tree() does, only there we
deal with scanning only 2 trees simultaneously, and for n+1 tree, the
logic is a bit more complex:
D(T,P1...Pn) calculation scheme
-------------------------------
D(T,P1...Pn) = D(T,P1) ^ ... ^ D(T,Pn) (regarding resulting paths set)
D(T,Pj) - diff between T..Pj
D(T,P1...Pn) - combined diff from T to parents P1,...,Pn
We start from all trees, which are sorted, and compare their entries in
lock-step:
T P1 Pn
- - -
|t| |p1| |pn|
|-| |--| ... |--| imin = argmin(p1...pn)
| | | | | |
|-| |--| |--|
|.| |. | |. |
. . .
. . .
at any time there could be 3 cases:
1) t < p[imin];
2) t > p[imin];
3) t = p[imin].
Schematic deduction of what every case means, and what to do, follows:
1) t < p[imin] -> ∀j t ∉ Pj -> "+t" ∈ D(T,Pj) -> D += "+t"; t↓
2) t > p[imin]
2.1) ∃j: pj > p[imin] -> "-p[imin]" ∉ D(T,Pj) -> D += ø; ∀ pi=p[imin] pi↓
2.2) ∀i pi = p[imin] -> pi ∉ T -> "-pi" ∈ D(T,Pi) -> D += "-p[imin]"; ∀i pi↓
3) t = p[imin]
3.1) ∃j: pj > p[imin] -> "+t" ∈ D(T,Pj) -> only pi=p[imin] remains to investigate
3.2) pi = p[imin] -> investigate δ(t,pi)
|
|
v
3.1+3.2) looking at δ(t,pi) ∀i: pi=p[imin] - if all != ø ->
⎧δ(t,pi) - if pi=p[imin]
-> D += ⎨
⎩"+t" - if pi>p[imin]
in any case t↓ ∀ pi=p[imin] pi↓
~
For comparison, here is how diff_tree() works:
D(A,B) calculation scheme
-------------------------
A B
- -
|a| |b| a < b -> a ∉ B -> D(A,B) += +a a↓
|-| |-| a > b -> b ∉ A -> D(A,B) += -b b↓
| | | | a = b -> investigate δ(a,b) a↓ b↓
|-| |-|
|.| |.|
. .
. .
~~~~~~~~
This patch generalizes diff tree-walker to work with arbitrary number of
parents as described above - i.e. now there is a resulting tree t, and
some parents trees tp[i] i=[0..nparent). The generalization builds on
the fact that usual diff
D(A,B)
is by definition the same as combined diff
D(A,[B]),
so if we could rework the code for common case and make it be not slower
for nparent=1 case, usual diff(t1,t2) generation will not be slower, and
multiparent diff tree-walker would greatly benefit generating
combine-diff.
What we do is as follows:
1) diff tree-walker ll_diff_tree_sha1() is internally reworked to be
a paths generator (new name diff_tree_paths()), with each generated path
being `struct combine_diff_path` with info for path, new sha1,mode and for
every parent which sha1,mode it was in it.
2) From that info, we can still generate usual diff queue with
struct diff_filepairs, via "exporting" generated
combine_diff_path, if we know we run for nparent=1 case.
(see emit_diff() which is now named emit_diff_first_parent_only())
3) In order for diff_can_quit_early(), which checks
DIFF_OPT_TST(opt, HAS_CHANGES))
to work, that exporting have to be happening not in bulk, but
incrementally, one diff path at a time.
For such consumers, there is a new callback in diff_options
introduced:
->pathchange(opt, struct combine_diff_path *)
which, if set to !NULL, is called for every generated path.
(see new compat ll_diff_tree_sha1() wrapper around new paths
generator for setup)
4) The paths generation itself, is reworked from previous
ll_diff_tree_sha1() code according to "D(A,P1...Pn) calculation
scheme" provided above:
On the start we allocate [nparent] arrays in place what was
earlier just for one parent tree.
then we just generalize loops, and comparison according to the
algorithm.
Some notes(*):
1) alloca(), for small arrays, is used for "runs not slower for
nparent=1 case than before" goal - if we change it to xmalloc()/free()
the timings get ~1% worse. For alloca() we use just-introduced
xalloca/xalloca_free compatibility wrappers, so it should not be a
portability problem.
2) For every parent tree, we need to keep a tag, whether entry from that
parent equals to entry from minimal parent. For performance reasons I'm
keeping that tag in entry's mode field in unused bit - see S_IFXMIN_NEQ.
Not doing so, we'd need to alloca another [nparent] array, which hurts
performance.
3) For emitted paths, memory could be reused, if we know the path was
processed via callback and will not be needed later. We use efficient
hand-made realloc-style path_appendnew(), that saves us from ~1-1.5%
of potential additional slowdown.
4) goto(s) are used in several places, as the code executes a little bit
faster with lowered register pressure.
Also
- we should now check for FIND_COPIES_HARDER not only when two entries
names are the same, and their hashes are equal, but also for a case,
when a path was removed from some of all parents having it.
The reason is, if we don't, that path won't be emitted at all (see
"a > xi" case), and we'll just skip it, and FIND_COPIES_HARDER wants
all paths - with diff or without - to be emitted, to be later analyzed
for being copies sources.
The new check is only necessary for nparent >1, as for nparent=1 case
xmin_eqtotal always =1 =nparent, and a path is always added to diff as
removal.
~~~~~~~~
Timings for
# without -c, i.e. testing only nparent=1 case
`git log --raw --no-abbrev --no-renames`
before and after the patch are as follows:
navy.git linux.git v3.10..v3.11
before 0.611s 1.889s
after 0.619s 1.907s
slowdown 1.3% 0.9%
This timings show we did no harm to usual diff(tree1,tree2) generation.
From the table we can see that we actually did ~1% slowdown, but I think
I've "earned" that 1% in the previous patch ("tree-diff: reuse base
str(buf) memory on sub-tree recursion", HEAD~~) so for nparent=1 case,
net timings stays approximately the same.
The output also stayed the same.
(*) If we revert 1)-4) to more usual techniques, for nparent=1 case,
we'll get ~2-2.5% of additional slowdown, which I've tried to avoid, as
"do no harm for nparent=1 case" rule.
For linux.git, combined diff will run an order of magnitude faster and
appropriate timings will be provided in the next commit, as we'll be
taking advantage of the new diff tree-walker for combined-diff
generation there.
P.S. and combined diff is not some exotic/for-play-only stuff - for
example for a program I write to represent Git archives as readonly
filesystem, there is initial scan with
`git log --reverse --raw --no-abbrev --no-renames -c`
to extract log of what was created/changed when, as a result building a
map
{} sha1 -> in which commit (and date) a content was added
that `-c` means also show combined diff for merges, and without them, if
a merge is non-trivial (merges changes from two parents with both having
separate changes to a file), or an evil one, the map will not be full,
i.e. some valid sha1 would be absent from it.
That case was my initial motivation for combined diffs speedup.
Signed-off-by: Kirill Smelkov <kirr@mns.spb.ru>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2014-04-06 23:46:26 +02:00
|
|
|
struct combine_diff_path;
|
|
|
|
|
|
|
|
typedef int (*pathchange_fn_t)(struct diff_options *options,
|
|
|
|
struct combine_diff_path *path);
|
2005-10-21 06:05:05 +02:00
|
|
|
|
|
|
|
typedef void (*change_fn_t)(struct diff_options *options,
|
|
|
|
unsigned old_mode, unsigned new_mode,
|
2017-05-30 19:30:49 +02:00
|
|
|
const struct object_id *old_oid,
|
|
|
|
const struct object_id *new_oid,
|
|
|
|
int old_oid_valid, int new_oid_valid,
|
2010-01-18 21:26:18 +01:00
|
|
|
const char *fullpath,
|
|
|
|
unsigned old_dirty_submodule, unsigned new_dirty_submodule);
|
2005-10-21 06:05:05 +02:00
|
|
|
|
|
|
|
typedef void (*add_remove_fn_t)(struct diff_options *options,
|
|
|
|
int addremove, unsigned mode,
|
2017-05-30 19:30:47 +02:00
|
|
|
const struct object_id *oid,
|
|
|
|
int oid_valid,
|
2010-01-18 21:26:18 +01:00
|
|
|
const char *fullpath, unsigned dirty_submodule);
|
2005-10-21 06:05:05 +02:00
|
|
|
|
2006-09-07 08:35:42 +02:00
|
|
|
typedef void (*diff_format_fn_t)(struct diff_queue_struct *q,
|
|
|
|
struct diff_options *options, void *data);
|
|
|
|
|
2010-05-26 09:08:02 +02:00
|
|
|
typedef struct strbuf *(*diff_prefix_fn_t)(struct diff_options *opt, void *data);
|
|
|
|
|
2006-06-24 19:21:53 +02:00
|
|
|
#define DIFF_FORMAT_RAW 0x0001
|
|
|
|
#define DIFF_FORMAT_DIFFSTAT 0x0002
|
2006-10-12 12:01:00 +02:00
|
|
|
#define DIFF_FORMAT_NUMSTAT 0x0004
|
|
|
|
#define DIFF_FORMAT_SUMMARY 0x0008
|
|
|
|
#define DIFF_FORMAT_PATCH 0x0010
|
2006-12-15 05:15:44 +01:00
|
|
|
#define DIFF_FORMAT_SHORTSTAT 0x0020
|
Add "--dirstat" for some directory statistics
This adds a new form of overview diffstat output, doing something that I
have occasionally ended up doing manually (and badly, because it's
actually pretty nasty to do), and that I think is very useful for an
project like the kernel that has a fairly deep and well-separated
directory structure with semantic meaning.
What I mean by that is that it's often interesting to see exactly which
sub-directories are impacted by a patch, and to what degree - even if you
don't perhaps care so much about the individual files themselves.
What makes the concept more interesting is that the "impact" is often
hierarchical: in the kernel, for example, something could either have a
very localized impact to "fs/ext3/" and then it's interesting to see that
such a patch changes mostly that subdirectory, but you could have another
patch that changes some generic VFS-layer issue which affects _many_
subdirectories that are all under "fs/", but none - or perhaps just a
couple of them - of the individual filesystems are interesting in
themselves.
So what commonly happens is that you may have big changes in a specific
sub-subdirectory, but still also significant separate changes to the
subdirectory leading up to that - maybe you have significant VFS-level
changes, but *also* changes under that VFS layer in the NFS-specific
directories, for example. In that case, you do want the low-level parts
that are significant to show up, but then the insignificant ones should
show up as under the more generic top-level directory.
This patch shows all of that with "--dirstat". The output can be either
something simple like
commit 81772fe...
Author: Thomas Gleixner <tglx@linutronix.de>
Date: Sun Feb 10 23:57:36 2008 +0100
x86: remove over noisy debug printk
pageattr-test.c contains a noisy debug printk that people reported.
The condition under which it prints (randomly tapping into a mem_map[]
hole and not being able to c_p_a() there) is valid behavior and not
interesting to report.
Remove it.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
100.0% arch/x86/mm/
or something much more complex like
commit e231c2e...
Author: David Howells <dhowells@redhat.com>
Date: Thu Feb 7 00:15:26 2008 -0800
Convert ERR_PTR(PTR_ERR(p)) instances to ERR_CAST(p)
20.5% crypto/
7.6% fs/afs/
7.6% fs/fuse/
7.6% fs/gfs2/
5.1% fs/jffs2/
5.1% fs/nfs/
5.1% fs/nfsd/
7.6% fs/reiserfs/
15.3% fs/
7.6% net/rxrpc/
10.2% security/keys/
where that latter example is an example of significant work in some
individual fs/*/ subdirectories (like the patches to reiserfs accounting
for 7.6% of the whole), but then discounting those individual filesystems,
there's also 15.3% other "random" things that weren't worth reporting on
their oen left over under fs/ in general (either in that directory itself,
or in subdirectories of fs/ that didn't have enough changes to be reported
individually).
I'd like to stress that the "15.3% fs/" mentioned above is the stuff that
is under fs/ but that was _not_ significant enough to report on its own.
So the above does _not_ mean that 15.3% of the work was under fs/ per se,
because that 15.3% does *not* include the already-reported 7.6% of afs,
7.6% of fuse etc.
If you want to enable "cumulative" directory statistics, you can use the
"--cumulative" flag, which adds up percentages recursively even when
they have been already reported for a sub-directory. That cumulative
output is disabled if *all* of the changes in one subdirectory come from
a deeper subdirectory, to avoid repeating subdirectories all the way to
the root.
For an example of the cumulative reporting, the above commit becomes
commit e231c2e...
Author: David Howells <dhowells@redhat.com>
Date: Thu Feb 7 00:15:26 2008 -0800
Convert ERR_PTR(PTR_ERR(p)) instances to ERR_CAST(p)
20.5% crypto/
7.6% fs/afs/
7.6% fs/fuse/
7.6% fs/gfs2/
5.1% fs/jffs2/
5.1% fs/nfs/
5.1% fs/nfsd/
7.6% fs/reiserfs/
61.5% fs/
7.6% net/rxrpc/
10.2% security/keys/
in which the commit percentages now obviously add up to much more than
100%: now the changes that were already reported for the sub-directories
under fs/ are then cumulatively included in the whole percentage of fs/
(ie now shows 61.5% as opposed to the 15.3% without the cumulative
reporting).
The default reporting limit has been arbitrarily set at 3%, which seems
to be a pretty good cut-off, but you can specify the cut-off manually by
giving it as an option parameter (eg "--dirstat=5" makes the cut-off be
at 5% instead)
NOTE! The percentages are purely about the total lines added and removed,
not anything smarter (or dumber) than that. Also note that you should not
generally expect things to add up to 100%: not only does it round down, we
don't report leftover scraps (they add up to the top-level change count,
but we don't even bother reporting that, it only reports subdirectories).
Quite frankly, as a top-level manager this is really convenient for me,
but it's going to be very boring for git itself since there are few
subdirectories. Also, don't expect things to make tons of sense if you
combine this with "-M" and there are cross-directory renames etc.
But even for git itself, you can get some fun statistics. Try out
git log --dirstat
and see the occasional mentions of things like Documentation/, git-gui/,
gitweb/ and gitk-git/. Or try out something like
git diff --dirstat v1.5.0..v1.5.4
which does kind of git an overview that shows *something*. But in general,
the output is more exciting for big projects with deeper structure, and
doing a
git diff --dirstat v2.6.24..v2.6.25-rc1
on the kernel is what I actually wrote this for!
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-02-12 22:26:31 +01:00
|
|
|
#define DIFF_FORMAT_DIRSTAT 0x0040
|
2006-06-24 19:21:53 +02:00
|
|
|
|
|
|
|
/* These override all above */
|
2006-10-12 12:01:00 +02:00
|
|
|
#define DIFF_FORMAT_NAME 0x0100
|
|
|
|
#define DIFF_FORMAT_NAME_STATUS 0x0200
|
|
|
|
#define DIFF_FORMAT_CHECKDIFF 0x0400
|
2006-06-24 19:21:53 +02:00
|
|
|
|
|
|
|
/* Same as output_format = 0 but we know that -s flag was given
|
|
|
|
* and we should not give default value to output_format.
|
|
|
|
*/
|
2006-10-12 12:01:00 +02:00
|
|
|
#define DIFF_FORMAT_NO_OUTPUT 0x0800
|
2006-06-24 19:21:53 +02:00
|
|
|
|
2006-10-12 12:01:00 +02:00
|
|
|
#define DIFF_FORMAT_CALLBACK 0x1000
|
2006-09-07 08:35:42 +02:00
|
|
|
|
2017-10-31 19:19:05 +01:00
|
|
|
#define DIFF_FLAGS_INIT { 0 }
|
|
|
|
struct diff_flags {
|
2017-10-31 19:19:11 +01:00
|
|
|
unsigned recursive:1;
|
|
|
|
unsigned tree_in_recursive:1;
|
|
|
|
unsigned binary:1;
|
|
|
|
unsigned text:1;
|
|
|
|
unsigned full_index:1;
|
|
|
|
unsigned silent_on_remove:1;
|
|
|
|
unsigned find_copies_harder:1;
|
|
|
|
unsigned follow_renames:1;
|
|
|
|
unsigned rename_empty:1;
|
|
|
|
unsigned has_changes:1;
|
|
|
|
unsigned quick:1;
|
|
|
|
unsigned no_index:1;
|
|
|
|
unsigned allow_external:1;
|
|
|
|
unsigned exit_with_status:1;
|
|
|
|
unsigned reverse_diff:1;
|
|
|
|
unsigned check_failed:1;
|
|
|
|
unsigned relative_name:1;
|
|
|
|
unsigned ignore_submodules:1;
|
|
|
|
unsigned dirstat_cumulative:1;
|
|
|
|
unsigned dirstat_by_file:1;
|
|
|
|
unsigned allow_textconv:1;
|
|
|
|
unsigned textconv_set_via_cmdline:1;
|
|
|
|
unsigned diff_from_contents:1;
|
|
|
|
unsigned dirty_submodules:1;
|
|
|
|
unsigned ignore_untracked_in_submodules:1;
|
|
|
|
unsigned ignore_dirty_submodules:1;
|
|
|
|
unsigned override_submodule_config:1;
|
|
|
|
unsigned dirstat_by_line:1;
|
|
|
|
unsigned funccontext:1;
|
|
|
|
unsigned default_follow_renames:1;
|
2018-02-24 15:09:59 +01:00
|
|
|
unsigned stat_with_summary:1;
|
2018-08-13 13:33:11 +02:00
|
|
|
unsigned suppress_diff_headers:1;
|
2018-08-13 13:33:20 +02:00
|
|
|
unsigned dual_color_diffed_diffs:1;
|
2017-10-31 19:19:05 +01:00
|
|
|
};
|
|
|
|
|
|
|
|
static inline void diff_flags_or(struct diff_flags *a,
|
|
|
|
const struct diff_flags *b)
|
|
|
|
{
|
|
|
|
char *tmp_a = (char *)a;
|
|
|
|
const char *tmp_b = (const char *)b;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = 0; i < sizeof(struct diff_flags); i++)
|
|
|
|
tmp_a[i] |= tmp_b[i];
|
|
|
|
}
|
|
|
|
|
2009-02-17 04:26:49 +01:00
|
|
|
#define DIFF_XDL_TST(opts, flag) ((opts)->xdl_opts & XDF_##flag)
|
|
|
|
#define DIFF_XDL_SET(opts, flag) ((opts)->xdl_opts |= XDF_##flag)
|
|
|
|
#define DIFF_XDL_CLR(opts, flag) ((opts)->xdl_opts &= ~XDF_##flag)
|
2007-11-10 20:05:14 +01:00
|
|
|
|
2012-02-20 00:36:55 +01:00
|
|
|
#define DIFF_WITH_ALG(opts, flag) (((opts)->xdl_opts & ~XDF_DIFF_ALGORITHM_MASK) | XDF_##flag)
|
|
|
|
|
2010-04-14 17:59:06 +02:00
|
|
|
enum diff_words_type {
|
|
|
|
DIFF_WORDS_NONE = 0,
|
|
|
|
DIFF_WORDS_PORCELAIN,
|
|
|
|
DIFF_WORDS_PLAIN,
|
|
|
|
DIFF_WORDS_COLOR
|
|
|
|
};
|
|
|
|
|
2016-09-01 01:27:21 +02:00
|
|
|
enum diff_submodule_format {
|
|
|
|
DIFF_SUBMODULE_SHORT = 0,
|
2016-09-01 01:27:25 +02:00
|
|
|
DIFF_SUBMODULE_LOG,
|
|
|
|
DIFF_SUBMODULE_INLINE_DIFF
|
2016-09-01 01:27:21 +02:00
|
|
|
};
|
|
|
|
|
2005-09-21 09:00:47 +02:00
|
|
|
struct diff_options {
|
|
|
|
const char *orderfile;
|
|
|
|
const char *pickaxe;
|
2006-11-02 09:02:11 +01:00
|
|
|
const char *single_follow;
|
2007-12-18 20:32:14 +01:00
|
|
|
const char *a_prefix, *b_prefix;
|
2016-09-01 01:27:20 +02:00
|
|
|
const char *line_prefix;
|
|
|
|
size_t line_prefix_length;
|
2017-10-31 19:19:05 +01:00
|
|
|
struct diff_flags flags;
|
2013-07-18 00:05:46 +02:00
|
|
|
|
|
|
|
/* diff-filter bits */
|
|
|
|
unsigned int filter;
|
|
|
|
|
2011-08-18 07:03:12 +02:00
|
|
|
int use_color;
|
2006-05-13 22:23:48 +02:00
|
|
|
int context;
|
2008-12-28 19:45:32 +01:00
|
|
|
int interhunkcontext;
|
2005-09-21 09:00:47 +02:00
|
|
|
int break_opt;
|
|
|
|
int detect_rename;
|
2011-03-01 01:11:55 +01:00
|
|
|
int irreversible_delete;
|
git-diff: squelch "empty" diffs
After starting to edit a working tree file but later when your edit ends
up identical to the original (this can also happen when you ran a
wholesale regexp replace with something like "perl -i" that does not
actually modify many of the paths), "git diff" between the index and the
working tree outputs many "empty" diffs that show "diff --git" headers
and nothing else, because these paths are stat-dirty. While it was a
way to warn the user that the earlier action of the user made the index
ineffective as an optimization mechanism, it was felt too loud for the
purpose of warning even to experienced users, and also resulted in
confusing people new to git.
This replaces the "empty" diffs with a single warning message at the
end. Having many such paths hurts performance, and you can run
"git-update-index --refresh" to update the lstat(2) information recorded
in the index in such a case. "git-status" does so as a side effect, and
that is more familiar to the end-user, so we recommend it to them.
The change affects only "git diff" that outputs patch text, because that
is where the annoyance of too many "empty" diff is most strongly felt,
and because the warning message can be safely ignored by downstream
tools without getting mistaken as part of the patch. For the low-level
"git diff-files" and "git diff-index", the traditional behaviour is
retained.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2007-08-03 22:33:31 +02:00
|
|
|
int skip_stat_unmatch;
|
2005-09-21 09:00:47 +02:00
|
|
|
int line_termination;
|
|
|
|
int output_format;
|
2018-01-04 23:50:39 +01:00
|
|
|
unsigned pickaxe_opts;
|
2005-09-21 09:00:47 +02:00
|
|
|
int rename_score;
|
2005-09-21 09:18:27 +02:00
|
|
|
int rename_limit;
|
2011-02-19 11:20:51 +01:00
|
|
|
int needed_rename_limit;
|
2011-01-06 22:50:06 +01:00
|
|
|
int degraded_cc_to_c;
|
2011-02-20 10:51:16 +01:00
|
|
|
int show_rename_progress;
|
2011-04-29 11:36:20 +02:00
|
|
|
int dirstat_permille;
|
2005-09-21 09:00:47 +02:00
|
|
|
int setup;
|
2005-12-14 02:21:41 +01:00
|
|
|
int abbrev;
|
2016-10-24 12:42:19 +02:00
|
|
|
int ita_invisible_in_index;
|
2015-05-26 19:11:28 +02:00
|
|
|
/* white-space error highlighting */
|
2017-06-30 02:06:53 +02:00
|
|
|
#define WSEH_NEW (1<<12)
|
|
|
|
#define WSEH_CONTEXT (1<<13)
|
|
|
|
#define WSEH_OLD (1<<14)
|
2015-05-26 19:11:28 +02:00
|
|
|
unsigned ws_error_highlight;
|
diff --relative: output paths as relative to the current subdirectory
This adds --relative option to the diff family. When you start
from a subdirectory:
$ git diff --relative
shows only the diff that is inside your current subdirectory,
and without $prefix part. People who usually live in
subdirectories may like it.
There are a few things I should also mention about the change:
- This works not just with diff but also works with the log
family of commands, but the history pruning is not affected.
In other words, if you go to a subdirectory, you can say:
$ git log --relative -p
but it will show the log message even for commits that do not
touch the current directory. You can limit it by giving
pathspec yourself:
$ git log --relative -p .
This originally was not a conscious design choice, but we
have a way to affect diff pathspec and pruning pathspec
independently. IOW "git log --full-diff -p ." tells it to
prune history to commits that affect the current subdirectory
but show the changes with full context. I think it makes
more sense to leave pruning independent from --relative than
the obvious alternative of always pruning with the current
subdirectory, which would break the symmetry.
- Because this works also with the log family, you could
format-patch a single change, limiting the effect to your
subdirectory, like so:
$ cd gitk-git
$ git format-patch -1 --relative 911f1eb
But because that is a special purpose usage, this option will
never become the default, with or without repository or user
preference configuration. The risk of producing a partial
patch and sending it out by mistake is too great if we did
so.
- This is inherently incompatible with --no-index, which is a
bolted-on hack that does not have much to do with git
itself. I didn't bother checking and erroring out on the
combined use of the options, but probably I should.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-02-12 23:26:02 +01:00
|
|
|
const char *prefix;
|
|
|
|
int prefix_length;
|
2006-05-20 15:40:29 +02:00
|
|
|
const char *stat_sep;
|
2006-06-14 17:40:23 +02:00
|
|
|
long xdl_opts;
|
2005-10-21 06:05:05 +02:00
|
|
|
|
2017-11-27 20:47:47 +01:00
|
|
|
/* see Documentation/diff-options.txt */
|
|
|
|
char **anchors;
|
|
|
|
size_t anchors_nr, anchors_alloc;
|
|
|
|
|
2006-09-27 03:53:02 +02:00
|
|
|
int stat_width;
|
|
|
|
int stat_name_width;
|
2012-03-01 13:26:45 +01:00
|
|
|
int stat_graph_width;
|
2011-05-27 14:36:41 +02:00
|
|
|
int stat_count;
|
2009-01-17 17:29:45 +01:00
|
|
|
const char *word_regex;
|
2010-04-14 17:59:06 +02:00
|
|
|
enum diff_words_type word_diff;
|
2016-09-01 01:27:21 +02:00
|
|
|
enum diff_submodule_format submodule_format;
|
2006-09-27 03:53:02 +02:00
|
|
|
|
2018-01-04 23:50:42 +01:00
|
|
|
struct oidset *objfind;
|
|
|
|
|
2007-02-25 23:34:54 +01:00
|
|
|
/* this is set by diffcore for DIFF_FORMAT_PATCH */
|
|
|
|
int found_changes;
|
|
|
|
|
2010-08-13 21:17:45 +02:00
|
|
|
/* to support internal diff recursion by --follow hack*/
|
|
|
|
int found_follow;
|
|
|
|
|
2013-05-10 17:10:11 +02:00
|
|
|
void (*set_default)(struct diff_options *);
|
|
|
|
|
2008-03-10 03:43:39 +01:00
|
|
|
FILE *file;
|
|
|
|
int close_file;
|
|
|
|
|
2010-12-15 16:02:38 +01:00
|
|
|
struct pathspec pathspec;
|
tree-diff: rework diff_tree() to generate diffs for multiparent cases as well
Previously diff_tree(), which is now named ll_diff_tree_sha1(), was
generating diff_filepair(s) for two trees t1 and t2, and that was
usually used for a commit as t1=HEAD~, and t2=HEAD - i.e. to see changes
a commit introduces.
In Git, however, we have fundamentally built flexibility in that a
commit can have many parents - 1 for a plain commit, 2 for a simple merge,
but also more than 2 for merging several heads at once.
For merges there is a so called combine-diff, which shows diff, a merge
introduces by itself, omitting changes done by any parent. That works
through first finding paths, that are different to all parents, and then
showing generalized diff, with separate columns for +/- for each parent.
The code lives in combine-diff.c .
There is an impedance mismatch, however, in that a commit could
generally have any number of parents, and that while diffing trees, we
divide cases for 2-tree diffs and more-than-2-tree diffs. I mean there
is no special casing for multiple parents commits in e.g.
revision-walker .
That impedance mismatch *hurts* *performance* *badly* for generating
combined diffs - in "combine-diff: optimize combine_diff_path
sets intersection" I've already removed some slowness from it, but from
the timings provided there, it could be seen, that combined diffs still
cost more than an order of magnitude more cpu time, compared to diff for
usual commits, and that would only be an optimistic estimate, if we take
into account that for e.g. linux.git there is only one merge for several
dozens of plain commits.
That slowness comes from the fact that currently, while generating
combined diff, a lot of time is spent computing diff(commit,commit^2)
just to only then intersect that huge diff to almost small set of files
from diff(commit,commit^1).
That's because at present, to compute combine-diff, for first finding
paths, that "every parent touches", we use the following combine-diff
property/definition:
D(A,P1...Pn) = D(A,P1) ^ ... ^ D(A,Pn) (w.r.t. paths)
where
D(A,P1...Pn) is combined diff between commit A, and parents Pi
and
D(A,Pi) is usual two-tree diff Pi..A
So if any of that D(A,Pi) is huge, tracting 1 n-parent combine-diff as n
1-parent diffs and intersecting results will be slow.
And usually, for linux.git and other topic-based workflows, that
D(A,P2) is huge, because, if merge-base of A and P2, is several dozens
of merges (from A, via first parent) below, that D(A,P2) will be diffing
sum of merges from several subsystems to 1 subsystem.
The solution is to avoid computing n 1-parent diffs, and to find
changed-to-all-parents paths via scanning A's and all Pi's trees
simultaneously, at each step comparing their entries, and based on that
comparison, populate paths result, and deduce we could *skip*
*recursing* into subdirectories, if at least for 1 parent, sha1 of that
dir tree is the same as in A. That would save us from doing significant
amount of needless work.
Such approach is very similar to what diff_tree() does, only there we
deal with scanning only 2 trees simultaneously, and for n+1 tree, the
logic is a bit more complex:
D(T,P1...Pn) calculation scheme
-------------------------------
D(T,P1...Pn) = D(T,P1) ^ ... ^ D(T,Pn) (regarding resulting paths set)
D(T,Pj) - diff between T..Pj
D(T,P1...Pn) - combined diff from T to parents P1,...,Pn
We start from all trees, which are sorted, and compare their entries in
lock-step:
T P1 Pn
- - -
|t| |p1| |pn|
|-| |--| ... |--| imin = argmin(p1...pn)
| | | | | |
|-| |--| |--|
|.| |. | |. |
. . .
. . .
at any time there could be 3 cases:
1) t < p[imin];
2) t > p[imin];
3) t = p[imin].
Schematic deduction of what every case means, and what to do, follows:
1) t < p[imin] -> ∀j t ∉ Pj -> "+t" ∈ D(T,Pj) -> D += "+t"; t↓
2) t > p[imin]
2.1) ∃j: pj > p[imin] -> "-p[imin]" ∉ D(T,Pj) -> D += ø; ∀ pi=p[imin] pi↓
2.2) ∀i pi = p[imin] -> pi ∉ T -> "-pi" ∈ D(T,Pi) -> D += "-p[imin]"; ∀i pi↓
3) t = p[imin]
3.1) ∃j: pj > p[imin] -> "+t" ∈ D(T,Pj) -> only pi=p[imin] remains to investigate
3.2) pi = p[imin] -> investigate δ(t,pi)
|
|
v
3.1+3.2) looking at δ(t,pi) ∀i: pi=p[imin] - if all != ø ->
⎧δ(t,pi) - if pi=p[imin]
-> D += ⎨
⎩"+t" - if pi>p[imin]
in any case t↓ ∀ pi=p[imin] pi↓
~
For comparison, here is how diff_tree() works:
D(A,B) calculation scheme
-------------------------
A B
- -
|a| |b| a < b -> a ∉ B -> D(A,B) += +a a↓
|-| |-| a > b -> b ∉ A -> D(A,B) += -b b↓
| | | | a = b -> investigate δ(a,b) a↓ b↓
|-| |-|
|.| |.|
. .
. .
~~~~~~~~
This patch generalizes diff tree-walker to work with arbitrary number of
parents as described above - i.e. now there is a resulting tree t, and
some parents trees tp[i] i=[0..nparent). The generalization builds on
the fact that usual diff
D(A,B)
is by definition the same as combined diff
D(A,[B]),
so if we could rework the code for common case and make it be not slower
for nparent=1 case, usual diff(t1,t2) generation will not be slower, and
multiparent diff tree-walker would greatly benefit generating
combine-diff.
What we do is as follows:
1) diff tree-walker ll_diff_tree_sha1() is internally reworked to be
a paths generator (new name diff_tree_paths()), with each generated path
being `struct combine_diff_path` with info for path, new sha1,mode and for
every parent which sha1,mode it was in it.
2) From that info, we can still generate usual diff queue with
struct diff_filepairs, via "exporting" generated
combine_diff_path, if we know we run for nparent=1 case.
(see emit_diff() which is now named emit_diff_first_parent_only())
3) In order for diff_can_quit_early(), which checks
DIFF_OPT_TST(opt, HAS_CHANGES))
to work, that exporting have to be happening not in bulk, but
incrementally, one diff path at a time.
For such consumers, there is a new callback in diff_options
introduced:
->pathchange(opt, struct combine_diff_path *)
which, if set to !NULL, is called for every generated path.
(see new compat ll_diff_tree_sha1() wrapper around new paths
generator for setup)
4) The paths generation itself, is reworked from previous
ll_diff_tree_sha1() code according to "D(A,P1...Pn) calculation
scheme" provided above:
On the start we allocate [nparent] arrays in place what was
earlier just for one parent tree.
then we just generalize loops, and comparison according to the
algorithm.
Some notes(*):
1) alloca(), for small arrays, is used for "runs not slower for
nparent=1 case than before" goal - if we change it to xmalloc()/free()
the timings get ~1% worse. For alloca() we use just-introduced
xalloca/xalloca_free compatibility wrappers, so it should not be a
portability problem.
2) For every parent tree, we need to keep a tag, whether entry from that
parent equals to entry from minimal parent. For performance reasons I'm
keeping that tag in entry's mode field in unused bit - see S_IFXMIN_NEQ.
Not doing so, we'd need to alloca another [nparent] array, which hurts
performance.
3) For emitted paths, memory could be reused, if we know the path was
processed via callback and will not be needed later. We use efficient
hand-made realloc-style path_appendnew(), that saves us from ~1-1.5%
of potential additional slowdown.
4) goto(s) are used in several places, as the code executes a little bit
faster with lowered register pressure.
Also
- we should now check for FIND_COPIES_HARDER not only when two entries
names are the same, and their hashes are equal, but also for a case,
when a path was removed from some of all parents having it.
The reason is, if we don't, that path won't be emitted at all (see
"a > xi" case), and we'll just skip it, and FIND_COPIES_HARDER wants
all paths - with diff or without - to be emitted, to be later analyzed
for being copies sources.
The new check is only necessary for nparent >1, as for nparent=1 case
xmin_eqtotal always =1 =nparent, and a path is always added to diff as
removal.
~~~~~~~~
Timings for
# without -c, i.e. testing only nparent=1 case
`git log --raw --no-abbrev --no-renames`
before and after the patch are as follows:
navy.git linux.git v3.10..v3.11
before 0.611s 1.889s
after 0.619s 1.907s
slowdown 1.3% 0.9%
This timings show we did no harm to usual diff(tree1,tree2) generation.
From the table we can see that we actually did ~1% slowdown, but I think
I've "earned" that 1% in the previous patch ("tree-diff: reuse base
str(buf) memory on sub-tree recursion", HEAD~~) so for nparent=1 case,
net timings stays approximately the same.
The output also stayed the same.
(*) If we revert 1)-4) to more usual techniques, for nparent=1 case,
we'll get ~2-2.5% of additional slowdown, which I've tried to avoid, as
"do no harm for nparent=1 case" rule.
For linux.git, combined diff will run an order of magnitude faster and
appropriate timings will be provided in the next commit, as we'll be
taking advantage of the new diff tree-walker for combined-diff
generation there.
P.S. and combined diff is not some exotic/for-play-only stuff - for
example for a program I write to represent Git archives as readonly
filesystem, there is initial scan with
`git log --reverse --raw --no-abbrev --no-renames -c`
to extract log of what was created/changed when, as a result building a
map
{} sha1 -> in which commit (and date) a content was added
that `-c` means also show combined diff for merges, and without them, if
a merge is non-trivial (merges changes from two parents with both having
separate changes to a file), or an evil one, the map will not be full,
i.e. some valid sha1 would be absent from it.
That case was my initial motivation for combined diffs speedup.
Signed-off-by: Kirill Smelkov <kirr@mns.spb.ru>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2014-04-06 23:46:26 +02:00
|
|
|
pathchange_fn_t pathchange;
|
2005-10-21 06:05:05 +02:00
|
|
|
change_fn_t change;
|
|
|
|
add_remove_fn_t add_remove;
|
revision: quit pruning diff more quickly when possible
When the revision traversal machinery is given a pathspec,
we must compute the parent-diff for each commit to determine
which ones are TREESAME. We set the QUICK diff flag to avoid
looking at more entries than we need; we really just care
whether there are any changes at all.
But there is one case where we want to know a bit more: if
--remove-empty is set, we care about finding cases where the
change consists only of added entries (in which case we may
prune the parent in try_to_simplify_commit()). To cover that
case, our file_add_remove() callback does not quit the diff
upon seeing an added entry; it keeps looking for other types
of entries.
But this means when --remove-empty is not set (and it is not
by default), we compute more of the diff than is necessary.
You can see this in a pathological case where a commit adds
a very large number of entries, and we limit based on a
broad pathspec. E.g.:
perl -e '
chomp(my $blob = `git hash-object -w --stdin </dev/null`);
for my $a (1..1000) {
for my $b (1..1000) {
print "100644 $blob\t$a/$b\n";
}
}
' | git update-index --index-info
git commit -qm add
git rev-list HEAD -- .
This case takes about 100ms now, but after this patch only
needs 6ms. That's not a huge improvement, but it's easy to
get and it protects us against even more pathological cases
(e.g., going from 1 million to 10 million files would take
ten times as long with the current code, but not increase at
all after this patch).
This is reported to minorly speed-up pathspec limiting in
real world repositories (like the 100-million-file Windows
repository), but probably won't make a noticeable difference
outside of pathological setups.
This patch actually covers the case without --remove-empty,
and the case where we see only deletions. See the in-code
comment for details.
Note that we have to add a new member to the diff_options
struct so that our callback can see the value of
revs->remove_empty_trees. This callback parameter could be
passed to the "add_remove" and "change" callbacks, but
there's not much point. They already receive the
diff_options struct, and doing it this way avoids having to
update the function signature of the other callbacks
(arguably the format_callback and output_prefix functions
could benefit from the same simplification).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-10-13 17:27:45 +02:00
|
|
|
void *change_fn_data;
|
2006-09-07 08:35:42 +02:00
|
|
|
diff_format_fn_t format_callback;
|
|
|
|
void *format_callback_data;
|
2010-05-26 09:08:02 +02:00
|
|
|
diff_prefix_fn_t output_prefix;
|
|
|
|
void *output_prefix_data;
|
2013-12-06 00:38:46 +01:00
|
|
|
|
|
|
|
int diff_path_counter;
|
2017-06-30 02:07:06 +02:00
|
|
|
|
|
|
|
struct emitted_diff_symbols *emitted_symbols;
|
diff.c: color moved lines differently
When a patch consists mostly of moving blocks of code around, it can
be quite tedious to ensure that the blocks are moved verbatim, and not
undesirably modified in the move. To that end, color blocks that are
moved within the same patch differently. For example (OM, del, add,
and NM are different colors):
[OM] -void sensitive_stuff(void)
[OM] -{
[OM] - if (!is_authorized_user())
[OM] - die("unauthorized");
[OM] - sensitive_stuff(spanning,
[OM] - multiple,
[OM] - lines);
[OM] -}
void another_function()
{
[del] - printf("foo");
[add] + printf("bar");
}
[NM] +void sensitive_stuff(void)
[NM] +{
[NM] + if (!is_authorized_user())
[NM] + die("unauthorized");
[NM] + sensitive_stuff(spanning,
[NM] + multiple,
[NM] + lines);
[NM] +}
However adjacent blocks may be problematic. For example, in this
potentially malicious patch, the swapping of blocks can be spotted:
[OM] -void sensitive_stuff(void)
[OM] -{
[OMA] - if (!is_authorized_user())
[OMA] - die("unauthorized");
[OM] - sensitive_stuff(spanning,
[OM] - multiple,
[OM] - lines);
[OMA] -}
void another_function()
{
[del] - printf("foo");
[add] + printf("bar");
}
[NM] +void sensitive_stuff(void)
[NM] +{
[NMA] + sensitive_stuff(spanning,
[NMA] + multiple,
[NMA] + lines);
[NM] + if (!is_authorized_user())
[NM] + die("unauthorized");
[NMA] +}
If the moved code is larger, it is easier to hide some permutation in the
code, which is why some alternative coloring is needed.
This patch implements the first mode:
* basic alternating 'Zebra' mode
This conveys all information needed to the user. Defer customization to
later patches.
First I implemented an alternative design, which would try to fingerprint
a line by its neighbors to detect if we are in a block or at the boundary.
This idea iss error prone as it inspected each line and its neighboring
lines to determine if the line was (a) moved and (b) if was deep inside
a hunk by having matching neighboring lines. This is unreliable as the
we can construct hunks which have equal neighbors that just exceed the
number of lines inspected. (Think of 'AXYZBXYZCXYZD..' with each letter
as a line, that is permutated to AXYZCXYZBXYZD..').
Instead this provides a dynamic programming greedy algorithm that finds
the largest moved hunk and then has several modes on highlighting bounds.
A note on the options '--submodule=diff' and '--color-words/--word-diff':
In the conversion to use emit_line in the prior patches both submodules
as well as word diff output carefully chose to call emit_line with sign=0.
All output with sign=0 is ignored for move detection purposes in this
patch, such that no weird looking output will be generated for these
cases. This leads to another thought: We could pass on '--color-moved' to
submodules such that they color up moved lines for themselves. If we'd do
so only line moves within a repository boundary are marked up.
It is useful to have moved lines colored, but there are annoying corner
cases, such as a single line moved, that is very common. For example
in a typical patch of C code, we have closing braces that end statement
blocks or functions.
While it is technically true that these lines are moved as they show up
elsewhere, it is harmful for the review as the reviewers attention is
drawn to such a minor side annoyance.
For now let's have a simple solution of hardcoding the number of
moved lines to be at least 3 before coloring them. Note, that the
length is applied across all blocks to find the 'lonely' blocks
that pollute new code, but do not interfere with a permutated
block where each permutation has less lines than 3.
Helped-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Stefan Beller <sbeller@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-06-30 22:53:07 +02:00
|
|
|
enum {
|
|
|
|
COLOR_MOVED_NO = 0,
|
2017-06-30 22:53:08 +02:00
|
|
|
COLOR_MOVED_PLAIN = 1,
|
2018-07-17 01:05:39 +02:00
|
|
|
COLOR_MOVED_BLOCKS = 2,
|
|
|
|
COLOR_MOVED_ZEBRA = 3,
|
|
|
|
COLOR_MOVED_ZEBRA_DIM = 4,
|
diff.c: color moved lines differently
When a patch consists mostly of moving blocks of code around, it can
be quite tedious to ensure that the blocks are moved verbatim, and not
undesirably modified in the move. To that end, color blocks that are
moved within the same patch differently. For example (OM, del, add,
and NM are different colors):
[OM] -void sensitive_stuff(void)
[OM] -{
[OM] - if (!is_authorized_user())
[OM] - die("unauthorized");
[OM] - sensitive_stuff(spanning,
[OM] - multiple,
[OM] - lines);
[OM] -}
void another_function()
{
[del] - printf("foo");
[add] + printf("bar");
}
[NM] +void sensitive_stuff(void)
[NM] +{
[NM] + if (!is_authorized_user())
[NM] + die("unauthorized");
[NM] + sensitive_stuff(spanning,
[NM] + multiple,
[NM] + lines);
[NM] +}
However adjacent blocks may be problematic. For example, in this
potentially malicious patch, the swapping of blocks can be spotted:
[OM] -void sensitive_stuff(void)
[OM] -{
[OMA] - if (!is_authorized_user())
[OMA] - die("unauthorized");
[OM] - sensitive_stuff(spanning,
[OM] - multiple,
[OM] - lines);
[OMA] -}
void another_function()
{
[del] - printf("foo");
[add] + printf("bar");
}
[NM] +void sensitive_stuff(void)
[NM] +{
[NMA] + sensitive_stuff(spanning,
[NMA] + multiple,
[NMA] + lines);
[NM] + if (!is_authorized_user())
[NM] + die("unauthorized");
[NMA] +}
If the moved code is larger, it is easier to hide some permutation in the
code, which is why some alternative coloring is needed.
This patch implements the first mode:
* basic alternating 'Zebra' mode
This conveys all information needed to the user. Defer customization to
later patches.
First I implemented an alternative design, which would try to fingerprint
a line by its neighbors to detect if we are in a block or at the boundary.
This idea iss error prone as it inspected each line and its neighboring
lines to determine if the line was (a) moved and (b) if was deep inside
a hunk by having matching neighboring lines. This is unreliable as the
we can construct hunks which have equal neighbors that just exceed the
number of lines inspected. (Think of 'AXYZBXYZCXYZD..' with each letter
as a line, that is permutated to AXYZCXYZBXYZD..').
Instead this provides a dynamic programming greedy algorithm that finds
the largest moved hunk and then has several modes on highlighting bounds.
A note on the options '--submodule=diff' and '--color-words/--word-diff':
In the conversion to use emit_line in the prior patches both submodules
as well as word diff output carefully chose to call emit_line with sign=0.
All output with sign=0 is ignored for move detection purposes in this
patch, such that no weird looking output will be generated for these
cases. This leads to another thought: We could pass on '--color-moved' to
submodules such that they color up moved lines for themselves. If we'd do
so only line moves within a repository boundary are marked up.
It is useful to have moved lines colored, but there are annoying corner
cases, such as a single line moved, that is very common. For example
in a typical patch of C code, we have closing braces that end statement
blocks or functions.
While it is technically true that these lines are moved as they show up
elsewhere, it is harmful for the review as the reviewers attention is
drawn to such a minor side annoyance.
For now let's have a simple solution of hardcoding the number of
moved lines to be at least 3 before coloring them. Note, that the
length is applied across all blocks to find the 'lonely' blocks
that pollute new code, but do not interfere with a permutated
block where each permutation has less lines than 3.
Helped-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Stefan Beller <sbeller@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-06-30 22:53:07 +02:00
|
|
|
} color_moved;
|
|
|
|
#define COLOR_MOVED_DEFAULT COLOR_MOVED_ZEBRA
|
2017-08-16 03:27:39 +02:00
|
|
|
#define COLOR_MOVED_MIN_ALNUM_COUNT 20
|
diff.c: add white space mode to move detection that allows indent changes
The option of --color-moved has proven to be useful as observed on the
mailing list. However when refactoring sometimes the indentation changes,
for example when partitioning a functions into smaller helper functions
the code usually mostly moved around except for a decrease in indentation.
To just review the moved code ignoring the change in indentation, a mode
to ignore spaces in the move detection as implemented in a previous patch
would be enough. However the whole move coloring as motivated in commit
2e2d5ac (diff.c: color moved lines differently, 2017-06-30), brought
up the notion of the reviewer being able to trust the move of a "block".
As there are languages such as python, which depend on proper relative
indentation for the control flow of the program, ignoring any white space
change in a block would not uphold the promises of 2e2d5ac that allows
reviewers to pay less attention to the inside of a block, as inside
the reviewer wants to assume the same program flow.
This new mode of white space ignorance will take this into account and will
only allow the same white space changes per line in each block. This patch
even allows only for the same change at the beginning of the lines.
As this is a white space mode, it is made exclusive to other white space
modes in the move detection.
This patch brings some challenges, related to the detection of blocks.
We need a wide net to catch the possible moved lines, but then need to
narrow down to check if the blocks are still intact. Consider this
example (ignoring block sizes):
- A
- B
- C
+ A
+ B
+ C
At the beginning of a block when checking if there is a counterpart
for A, we have to ignore all space changes. However at the following
lines we have to check if the indent change stayed the same.
Checking if the indentation change did stay the same, is done by computing
the indentation change by the difference in line length, and then assume
the change is only in the beginning of the longer line, the common tail
is the same. That is why the test contains lines like:
- <TAB> A
...
+ A <TAB>
...
As the first line starting a block is caught using a compare function that
ignores white spaces unlike the rest of the block, where the white space
delta is taken into account for the comparison, we also have to think about
the following situation:
- A
- B
- A
- B
+ A
+ B
+ A
+ B
When checking if the first A (both in the + and - lines) is a start of
a block, we have to check all 'A' and record all the white space deltas
such that we can find the example above to be just one block that is
indented.
Signed-off-by: Stefan Beller <sbeller@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-07-18 21:31:55 +02:00
|
|
|
|
|
|
|
/* XDF_WHITESPACE_FLAGS regarding block detection are set at 2, 3, 4 */
|
|
|
|
#define COLOR_MOVED_WS_ALLOW_INDENTATION_CHANGE (1<<5)
|
2018-07-17 01:05:40 +02:00
|
|
|
int color_moved_ws_handling;
|
2005-09-21 09:00:47 +02:00
|
|
|
};
|
|
|
|
|
2017-06-30 02:07:00 +02:00
|
|
|
void diff_emit_submodule_del(struct diff_options *o, const char *line);
|
|
|
|
void diff_emit_submodule_add(struct diff_options *o, const char *line);
|
|
|
|
void diff_emit_submodule_untracked(struct diff_options *o, const char *path);
|
|
|
|
void diff_emit_submodule_modified(struct diff_options *o, const char *path);
|
|
|
|
void diff_emit_submodule_header(struct diff_options *o, const char *header);
|
|
|
|
void diff_emit_submodule_error(struct diff_options *o, const char *err);
|
|
|
|
void diff_emit_submodule_pipethrough(struct diff_options *o,
|
|
|
|
const char *line, int len);
|
|
|
|
|
2006-07-23 11:24:18 +02:00
|
|
|
enum color_diff {
|
|
|
|
DIFF_RESET = 0,
|
2015-05-27 22:48:46 +02:00
|
|
|
DIFF_CONTEXT = 1,
|
2006-07-23 11:24:18 +02:00
|
|
|
DIFF_METAINFO = 2,
|
|
|
|
DIFF_FRAGINFO = 3,
|
|
|
|
DIFF_FILE_OLD = 4,
|
|
|
|
DIFF_FILE_NEW = 5,
|
|
|
|
DIFF_COMMIT = 6,
|
2006-09-23 07:48:39 +02:00
|
|
|
DIFF_WHITESPACE = 7,
|
diff.c: color moved lines differently
When a patch consists mostly of moving blocks of code around, it can
be quite tedious to ensure that the blocks are moved verbatim, and not
undesirably modified in the move. To that end, color blocks that are
moved within the same patch differently. For example (OM, del, add,
and NM are different colors):
[OM] -void sensitive_stuff(void)
[OM] -{
[OM] - if (!is_authorized_user())
[OM] - die("unauthorized");
[OM] - sensitive_stuff(spanning,
[OM] - multiple,
[OM] - lines);
[OM] -}
void another_function()
{
[del] - printf("foo");
[add] + printf("bar");
}
[NM] +void sensitive_stuff(void)
[NM] +{
[NM] + if (!is_authorized_user())
[NM] + die("unauthorized");
[NM] + sensitive_stuff(spanning,
[NM] + multiple,
[NM] + lines);
[NM] +}
However adjacent blocks may be problematic. For example, in this
potentially malicious patch, the swapping of blocks can be spotted:
[OM] -void sensitive_stuff(void)
[OM] -{
[OMA] - if (!is_authorized_user())
[OMA] - die("unauthorized");
[OM] - sensitive_stuff(spanning,
[OM] - multiple,
[OM] - lines);
[OMA] -}
void another_function()
{
[del] - printf("foo");
[add] + printf("bar");
}
[NM] +void sensitive_stuff(void)
[NM] +{
[NMA] + sensitive_stuff(spanning,
[NMA] + multiple,
[NMA] + lines);
[NM] + if (!is_authorized_user())
[NM] + die("unauthorized");
[NMA] +}
If the moved code is larger, it is easier to hide some permutation in the
code, which is why some alternative coloring is needed.
This patch implements the first mode:
* basic alternating 'Zebra' mode
This conveys all information needed to the user. Defer customization to
later patches.
First I implemented an alternative design, which would try to fingerprint
a line by its neighbors to detect if we are in a block or at the boundary.
This idea iss error prone as it inspected each line and its neighboring
lines to determine if the line was (a) moved and (b) if was deep inside
a hunk by having matching neighboring lines. This is unreliable as the
we can construct hunks which have equal neighbors that just exceed the
number of lines inspected. (Think of 'AXYZBXYZCXYZD..' with each letter
as a line, that is permutated to AXYZCXYZBXYZD..').
Instead this provides a dynamic programming greedy algorithm that finds
the largest moved hunk and then has several modes on highlighting bounds.
A note on the options '--submodule=diff' and '--color-words/--word-diff':
In the conversion to use emit_line in the prior patches both submodules
as well as word diff output carefully chose to call emit_line with sign=0.
All output with sign=0 is ignored for move detection purposes in this
patch, such that no weird looking output will be generated for these
cases. This leads to another thought: We could pass on '--color-moved' to
submodules such that they color up moved lines for themselves. If we'd do
so only line moves within a repository boundary are marked up.
It is useful to have moved lines colored, but there are annoying corner
cases, such as a single line moved, that is very common. For example
in a typical patch of C code, we have closing braces that end statement
blocks or functions.
While it is technically true that these lines are moved as they show up
elsewhere, it is harmful for the review as the reviewers attention is
drawn to such a minor side annoyance.
For now let's have a simple solution of hardcoding the number of
moved lines to be at least 3 before coloring them. Note, that the
length is applied across all blocks to find the 'lonely' blocks
that pollute new code, but do not interfere with a permutated
block where each permutation has less lines than 3.
Helped-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Stefan Beller <sbeller@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-06-30 22:53:07 +02:00
|
|
|
DIFF_FUNCINFO = 8,
|
|
|
|
DIFF_FILE_OLD_MOVED = 9,
|
|
|
|
DIFF_FILE_OLD_MOVED_ALT = 10,
|
2017-06-30 22:53:09 +02:00
|
|
|
DIFF_FILE_OLD_MOVED_DIM = 11,
|
|
|
|
DIFF_FILE_OLD_MOVED_ALT_DIM = 12,
|
|
|
|
DIFF_FILE_NEW_MOVED = 13,
|
|
|
|
DIFF_FILE_NEW_MOVED_ALT = 14,
|
|
|
|
DIFF_FILE_NEW_MOVED_DIM = 15,
|
range-diff: use dim/bold cues to improve dual color mode
It *is* a confusing thing to look at a diff of diffs. All too easy is it
to mix up whether the -/+ markers refer to the "inner" or the "outer"
diff, i.e. whether a `+` indicates that a line was added by either the
old or the new diff (or both), or whether the new diff does something
different than the old diff.
To make things easier to process for normal developers, we introduced
the dual color mode which colors the lines according to the commit diff,
i.e. lines that are added by a commit (whether old, new, or both) are
colored in green. In non-dual color mode, the lines would be colored
according to the outer diff: if the old commit added a line, it would be
colored red (because that line addition is only present in the first
commit range that was specified on the command-line, i.e. the "old"
commit, but not in the second commit range, i.e. the "new" commit).
However, this dual color mode is still not making things clear enough,
as we are looking at two levels of diffs, and we still only pick a color
according to *one* of them (the outer diff marker is colored
differently, of course, but in particular with deep indentation, it is
easy to lose track of that outer diff marker's background color).
Therefore, let's add another dimension to the mix. Still use
green/red/normal according to the commit diffs, but now also dim the
lines that were only in the old commit, and use bold face for the lines
that are only in the new commit.
That way, it is much easier not to lose track of, say, when we are
looking at a line that was added in the previous iteration of a patch
series but the new iteration adds a slightly different version: the
obsolete change will be dimmed, the current version of the patch will be
bold.
At least this developer has a much easier time reading the range-diffs
that way.
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-08-13 13:33:32 +02:00
|
|
|
DIFF_FILE_NEW_MOVED_ALT_DIM = 16,
|
|
|
|
DIFF_CONTEXT_DIM = 17,
|
|
|
|
DIFF_FILE_OLD_DIM = 18,
|
|
|
|
DIFF_FILE_NEW_DIM = 19,
|
|
|
|
DIFF_CONTEXT_BOLD = 20,
|
|
|
|
DIFF_FILE_OLD_BOLD = 21,
|
|
|
|
DIFF_FILE_NEW_BOLD = 22,
|
2006-07-23 11:24:18 +02:00
|
|
|
};
|
|
|
|
const char *diff_get_color(int diff_use_color, enum color_diff ix);
|
2007-11-10 20:05:14 +01:00
|
|
|
#define diff_get_color_opt(o, ix) \
|
2011-08-18 07:03:12 +02:00
|
|
|
diff_get_color((o)->use_color, ix)
|
2007-11-10 20:05:14 +01:00
|
|
|
|
2006-07-23 11:24:18 +02:00
|
|
|
|
2013-02-07 21:15:26 +01:00
|
|
|
const char *diff_line_prefix(struct diff_options *);
|
|
|
|
|
|
|
|
|
2006-05-20 15:40:29 +02:00
|
|
|
extern const char mime_boundary_leader[];
|
|
|
|
|
tree-diff: rework diff_tree() to generate diffs for multiparent cases as well
Previously diff_tree(), which is now named ll_diff_tree_sha1(), was
generating diff_filepair(s) for two trees t1 and t2, and that was
usually used for a commit as t1=HEAD~, and t2=HEAD - i.e. to see changes
a commit introduces.
In Git, however, we have fundamentally built flexibility in that a
commit can have many parents - 1 for a plain commit, 2 for a simple merge,
but also more than 2 for merging several heads at once.
For merges there is a so called combine-diff, which shows diff, a merge
introduces by itself, omitting changes done by any parent. That works
through first finding paths, that are different to all parents, and then
showing generalized diff, with separate columns for +/- for each parent.
The code lives in combine-diff.c .
There is an impedance mismatch, however, in that a commit could
generally have any number of parents, and that while diffing trees, we
divide cases for 2-tree diffs and more-than-2-tree diffs. I mean there
is no special casing for multiple parents commits in e.g.
revision-walker .
That impedance mismatch *hurts* *performance* *badly* for generating
combined diffs - in "combine-diff: optimize combine_diff_path
sets intersection" I've already removed some slowness from it, but from
the timings provided there, it could be seen, that combined diffs still
cost more than an order of magnitude more cpu time, compared to diff for
usual commits, and that would only be an optimistic estimate, if we take
into account that for e.g. linux.git there is only one merge for several
dozens of plain commits.
That slowness comes from the fact that currently, while generating
combined diff, a lot of time is spent computing diff(commit,commit^2)
just to only then intersect that huge diff to almost small set of files
from diff(commit,commit^1).
That's because at present, to compute combine-diff, for first finding
paths, that "every parent touches", we use the following combine-diff
property/definition:
D(A,P1...Pn) = D(A,P1) ^ ... ^ D(A,Pn) (w.r.t. paths)
where
D(A,P1...Pn) is combined diff between commit A, and parents Pi
and
D(A,Pi) is usual two-tree diff Pi..A
So if any of that D(A,Pi) is huge, tracting 1 n-parent combine-diff as n
1-parent diffs and intersecting results will be slow.
And usually, for linux.git and other topic-based workflows, that
D(A,P2) is huge, because, if merge-base of A and P2, is several dozens
of merges (from A, via first parent) below, that D(A,P2) will be diffing
sum of merges from several subsystems to 1 subsystem.
The solution is to avoid computing n 1-parent diffs, and to find
changed-to-all-parents paths via scanning A's and all Pi's trees
simultaneously, at each step comparing their entries, and based on that
comparison, populate paths result, and deduce we could *skip*
*recursing* into subdirectories, if at least for 1 parent, sha1 of that
dir tree is the same as in A. That would save us from doing significant
amount of needless work.
Such approach is very similar to what diff_tree() does, only there we
deal with scanning only 2 trees simultaneously, and for n+1 tree, the
logic is a bit more complex:
D(T,P1...Pn) calculation scheme
-------------------------------
D(T,P1...Pn) = D(T,P1) ^ ... ^ D(T,Pn) (regarding resulting paths set)
D(T,Pj) - diff between T..Pj
D(T,P1...Pn) - combined diff from T to parents P1,...,Pn
We start from all trees, which are sorted, and compare their entries in
lock-step:
T P1 Pn
- - -
|t| |p1| |pn|
|-| |--| ... |--| imin = argmin(p1...pn)
| | | | | |
|-| |--| |--|
|.| |. | |. |
. . .
. . .
at any time there could be 3 cases:
1) t < p[imin];
2) t > p[imin];
3) t = p[imin].
Schematic deduction of what every case means, and what to do, follows:
1) t < p[imin] -> ∀j t ∉ Pj -> "+t" ∈ D(T,Pj) -> D += "+t"; t↓
2) t > p[imin]
2.1) ∃j: pj > p[imin] -> "-p[imin]" ∉ D(T,Pj) -> D += ø; ∀ pi=p[imin] pi↓
2.2) ∀i pi = p[imin] -> pi ∉ T -> "-pi" ∈ D(T,Pi) -> D += "-p[imin]"; ∀i pi↓
3) t = p[imin]
3.1) ∃j: pj > p[imin] -> "+t" ∈ D(T,Pj) -> only pi=p[imin] remains to investigate
3.2) pi = p[imin] -> investigate δ(t,pi)
|
|
v
3.1+3.2) looking at δ(t,pi) ∀i: pi=p[imin] - if all != ø ->
⎧δ(t,pi) - if pi=p[imin]
-> D += ⎨
⎩"+t" - if pi>p[imin]
in any case t↓ ∀ pi=p[imin] pi↓
~
For comparison, here is how diff_tree() works:
D(A,B) calculation scheme
-------------------------
A B
- -
|a| |b| a < b -> a ∉ B -> D(A,B) += +a a↓
|-| |-| a > b -> b ∉ A -> D(A,B) += -b b↓
| | | | a = b -> investigate δ(a,b) a↓ b↓
|-| |-|
|.| |.|
. .
. .
~~~~~~~~
This patch generalizes diff tree-walker to work with arbitrary number of
parents as described above - i.e. now there is a resulting tree t, and
some parents trees tp[i] i=[0..nparent). The generalization builds on
the fact that usual diff
D(A,B)
is by definition the same as combined diff
D(A,[B]),
so if we could rework the code for common case and make it be not slower
for nparent=1 case, usual diff(t1,t2) generation will not be slower, and
multiparent diff tree-walker would greatly benefit generating
combine-diff.
What we do is as follows:
1) diff tree-walker ll_diff_tree_sha1() is internally reworked to be
a paths generator (new name diff_tree_paths()), with each generated path
being `struct combine_diff_path` with info for path, new sha1,mode and for
every parent which sha1,mode it was in it.
2) From that info, we can still generate usual diff queue with
struct diff_filepairs, via "exporting" generated
combine_diff_path, if we know we run for nparent=1 case.
(see emit_diff() which is now named emit_diff_first_parent_only())
3) In order for diff_can_quit_early(), which checks
DIFF_OPT_TST(opt, HAS_CHANGES))
to work, that exporting have to be happening not in bulk, but
incrementally, one diff path at a time.
For such consumers, there is a new callback in diff_options
introduced:
->pathchange(opt, struct combine_diff_path *)
which, if set to !NULL, is called for every generated path.
(see new compat ll_diff_tree_sha1() wrapper around new paths
generator for setup)
4) The paths generation itself, is reworked from previous
ll_diff_tree_sha1() code according to "D(A,P1...Pn) calculation
scheme" provided above:
On the start we allocate [nparent] arrays in place what was
earlier just for one parent tree.
then we just generalize loops, and comparison according to the
algorithm.
Some notes(*):
1) alloca(), for small arrays, is used for "runs not slower for
nparent=1 case than before" goal - if we change it to xmalloc()/free()
the timings get ~1% worse. For alloca() we use just-introduced
xalloca/xalloca_free compatibility wrappers, so it should not be a
portability problem.
2) For every parent tree, we need to keep a tag, whether entry from that
parent equals to entry from minimal parent. For performance reasons I'm
keeping that tag in entry's mode field in unused bit - see S_IFXMIN_NEQ.
Not doing so, we'd need to alloca another [nparent] array, which hurts
performance.
3) For emitted paths, memory could be reused, if we know the path was
processed via callback and will not be needed later. We use efficient
hand-made realloc-style path_appendnew(), that saves us from ~1-1.5%
of potential additional slowdown.
4) goto(s) are used in several places, as the code executes a little bit
faster with lowered register pressure.
Also
- we should now check for FIND_COPIES_HARDER not only when two entries
names are the same, and their hashes are equal, but also for a case,
when a path was removed from some of all parents having it.
The reason is, if we don't, that path won't be emitted at all (see
"a > xi" case), and we'll just skip it, and FIND_COPIES_HARDER wants
all paths - with diff or without - to be emitted, to be later analyzed
for being copies sources.
The new check is only necessary for nparent >1, as for nparent=1 case
xmin_eqtotal always =1 =nparent, and a path is always added to diff as
removal.
~~~~~~~~
Timings for
# without -c, i.e. testing only nparent=1 case
`git log --raw --no-abbrev --no-renames`
before and after the patch are as follows:
navy.git linux.git v3.10..v3.11
before 0.611s 1.889s
after 0.619s 1.907s
slowdown 1.3% 0.9%
This timings show we did no harm to usual diff(tree1,tree2) generation.
From the table we can see that we actually did ~1% slowdown, but I think
I've "earned" that 1% in the previous patch ("tree-diff: reuse base
str(buf) memory on sub-tree recursion", HEAD~~) so for nparent=1 case,
net timings stays approximately the same.
The output also stayed the same.
(*) If we revert 1)-4) to more usual techniques, for nparent=1 case,
we'll get ~2-2.5% of additional slowdown, which I've tried to avoid, as
"do no harm for nparent=1 case" rule.
For linux.git, combined diff will run an order of magnitude faster and
appropriate timings will be provided in the next commit, as we'll be
taking advantage of the new diff tree-walker for combined-diff
generation there.
P.S. and combined diff is not some exotic/for-play-only stuff - for
example for a program I write to represent Git archives as readonly
filesystem, there is initial scan with
`git log --reverse --raw --no-abbrev --no-renames -c`
to extract log of what was created/changed when, as a result building a
map
{} sha1 -> in which commit (and date) a content was added
that `-c` means also show combined diff for merges, and without them, if
a merge is non-trivial (merges changes from two parents with both having
separate changes to a file), or an evil one, the map will not be full,
i.e. some valid sha1 would be absent from it.
That case was my initial motivation for combined diffs speedup.
Signed-off-by: Kirill Smelkov <kirr@mns.spb.ru>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2014-04-06 23:46:26 +02:00
|
|
|
extern struct combine_diff_path *diff_tree_paths(
|
2017-05-30 19:31:06 +02:00
|
|
|
struct combine_diff_path *p, const struct object_id *oid,
|
|
|
|
const struct object_id **parents_oid, int nparent,
|
tree-diff: rework diff_tree() to generate diffs for multiparent cases as well
Previously diff_tree(), which is now named ll_diff_tree_sha1(), was
generating diff_filepair(s) for two trees t1 and t2, and that was
usually used for a commit as t1=HEAD~, and t2=HEAD - i.e. to see changes
a commit introduces.
In Git, however, we have fundamentally built flexibility in that a
commit can have many parents - 1 for a plain commit, 2 for a simple merge,
but also more than 2 for merging several heads at once.
For merges there is a so called combine-diff, which shows diff, a merge
introduces by itself, omitting changes done by any parent. That works
through first finding paths, that are different to all parents, and then
showing generalized diff, with separate columns for +/- for each parent.
The code lives in combine-diff.c .
There is an impedance mismatch, however, in that a commit could
generally have any number of parents, and that while diffing trees, we
divide cases for 2-tree diffs and more-than-2-tree diffs. I mean there
is no special casing for multiple parents commits in e.g.
revision-walker .
That impedance mismatch *hurts* *performance* *badly* for generating
combined diffs - in "combine-diff: optimize combine_diff_path
sets intersection" I've already removed some slowness from it, but from
the timings provided there, it could be seen, that combined diffs still
cost more than an order of magnitude more cpu time, compared to diff for
usual commits, and that would only be an optimistic estimate, if we take
into account that for e.g. linux.git there is only one merge for several
dozens of plain commits.
That slowness comes from the fact that currently, while generating
combined diff, a lot of time is spent computing diff(commit,commit^2)
just to only then intersect that huge diff to almost small set of files
from diff(commit,commit^1).
That's because at present, to compute combine-diff, for first finding
paths, that "every parent touches", we use the following combine-diff
property/definition:
D(A,P1...Pn) = D(A,P1) ^ ... ^ D(A,Pn) (w.r.t. paths)
where
D(A,P1...Pn) is combined diff between commit A, and parents Pi
and
D(A,Pi) is usual two-tree diff Pi..A
So if any of that D(A,Pi) is huge, tracting 1 n-parent combine-diff as n
1-parent diffs and intersecting results will be slow.
And usually, for linux.git and other topic-based workflows, that
D(A,P2) is huge, because, if merge-base of A and P2, is several dozens
of merges (from A, via first parent) below, that D(A,P2) will be diffing
sum of merges from several subsystems to 1 subsystem.
The solution is to avoid computing n 1-parent diffs, and to find
changed-to-all-parents paths via scanning A's and all Pi's trees
simultaneously, at each step comparing their entries, and based on that
comparison, populate paths result, and deduce we could *skip*
*recursing* into subdirectories, if at least for 1 parent, sha1 of that
dir tree is the same as in A. That would save us from doing significant
amount of needless work.
Such approach is very similar to what diff_tree() does, only there we
deal with scanning only 2 trees simultaneously, and for n+1 tree, the
logic is a bit more complex:
D(T,P1...Pn) calculation scheme
-------------------------------
D(T,P1...Pn) = D(T,P1) ^ ... ^ D(T,Pn) (regarding resulting paths set)
D(T,Pj) - diff between T..Pj
D(T,P1...Pn) - combined diff from T to parents P1,...,Pn
We start from all trees, which are sorted, and compare their entries in
lock-step:
T P1 Pn
- - -
|t| |p1| |pn|
|-| |--| ... |--| imin = argmin(p1...pn)
| | | | | |
|-| |--| |--|
|.| |. | |. |
. . .
. . .
at any time there could be 3 cases:
1) t < p[imin];
2) t > p[imin];
3) t = p[imin].
Schematic deduction of what every case means, and what to do, follows:
1) t < p[imin] -> ∀j t ∉ Pj -> "+t" ∈ D(T,Pj) -> D += "+t"; t↓
2) t > p[imin]
2.1) ∃j: pj > p[imin] -> "-p[imin]" ∉ D(T,Pj) -> D += ø; ∀ pi=p[imin] pi↓
2.2) ∀i pi = p[imin] -> pi ∉ T -> "-pi" ∈ D(T,Pi) -> D += "-p[imin]"; ∀i pi↓
3) t = p[imin]
3.1) ∃j: pj > p[imin] -> "+t" ∈ D(T,Pj) -> only pi=p[imin] remains to investigate
3.2) pi = p[imin] -> investigate δ(t,pi)
|
|
v
3.1+3.2) looking at δ(t,pi) ∀i: pi=p[imin] - if all != ø ->
⎧δ(t,pi) - if pi=p[imin]
-> D += ⎨
⎩"+t" - if pi>p[imin]
in any case t↓ ∀ pi=p[imin] pi↓
~
For comparison, here is how diff_tree() works:
D(A,B) calculation scheme
-------------------------
A B
- -
|a| |b| a < b -> a ∉ B -> D(A,B) += +a a↓
|-| |-| a > b -> b ∉ A -> D(A,B) += -b b↓
| | | | a = b -> investigate δ(a,b) a↓ b↓
|-| |-|
|.| |.|
. .
. .
~~~~~~~~
This patch generalizes diff tree-walker to work with arbitrary number of
parents as described above - i.e. now there is a resulting tree t, and
some parents trees tp[i] i=[0..nparent). The generalization builds on
the fact that usual diff
D(A,B)
is by definition the same as combined diff
D(A,[B]),
so if we could rework the code for common case and make it be not slower
for nparent=1 case, usual diff(t1,t2) generation will not be slower, and
multiparent diff tree-walker would greatly benefit generating
combine-diff.
What we do is as follows:
1) diff tree-walker ll_diff_tree_sha1() is internally reworked to be
a paths generator (new name diff_tree_paths()), with each generated path
being `struct combine_diff_path` with info for path, new sha1,mode and for
every parent which sha1,mode it was in it.
2) From that info, we can still generate usual diff queue with
struct diff_filepairs, via "exporting" generated
combine_diff_path, if we know we run for nparent=1 case.
(see emit_diff() which is now named emit_diff_first_parent_only())
3) In order for diff_can_quit_early(), which checks
DIFF_OPT_TST(opt, HAS_CHANGES))
to work, that exporting have to be happening not in bulk, but
incrementally, one diff path at a time.
For such consumers, there is a new callback in diff_options
introduced:
->pathchange(opt, struct combine_diff_path *)
which, if set to !NULL, is called for every generated path.
(see new compat ll_diff_tree_sha1() wrapper around new paths
generator for setup)
4) The paths generation itself, is reworked from previous
ll_diff_tree_sha1() code according to "D(A,P1...Pn) calculation
scheme" provided above:
On the start we allocate [nparent] arrays in place what was
earlier just for one parent tree.
then we just generalize loops, and comparison according to the
algorithm.
Some notes(*):
1) alloca(), for small arrays, is used for "runs not slower for
nparent=1 case than before" goal - if we change it to xmalloc()/free()
the timings get ~1% worse. For alloca() we use just-introduced
xalloca/xalloca_free compatibility wrappers, so it should not be a
portability problem.
2) For every parent tree, we need to keep a tag, whether entry from that
parent equals to entry from minimal parent. For performance reasons I'm
keeping that tag in entry's mode field in unused bit - see S_IFXMIN_NEQ.
Not doing so, we'd need to alloca another [nparent] array, which hurts
performance.
3) For emitted paths, memory could be reused, if we know the path was
processed via callback and will not be needed later. We use efficient
hand-made realloc-style path_appendnew(), that saves us from ~1-1.5%
of potential additional slowdown.
4) goto(s) are used in several places, as the code executes a little bit
faster with lowered register pressure.
Also
- we should now check for FIND_COPIES_HARDER not only when two entries
names are the same, and their hashes are equal, but also for a case,
when a path was removed from some of all parents having it.
The reason is, if we don't, that path won't be emitted at all (see
"a > xi" case), and we'll just skip it, and FIND_COPIES_HARDER wants
all paths - with diff or without - to be emitted, to be later analyzed
for being copies sources.
The new check is only necessary for nparent >1, as for nparent=1 case
xmin_eqtotal always =1 =nparent, and a path is always added to diff as
removal.
~~~~~~~~
Timings for
# without -c, i.e. testing only nparent=1 case
`git log --raw --no-abbrev --no-renames`
before and after the patch are as follows:
navy.git linux.git v3.10..v3.11
before 0.611s 1.889s
after 0.619s 1.907s
slowdown 1.3% 0.9%
This timings show we did no harm to usual diff(tree1,tree2) generation.
From the table we can see that we actually did ~1% slowdown, but I think
I've "earned" that 1% in the previous patch ("tree-diff: reuse base
str(buf) memory on sub-tree recursion", HEAD~~) so for nparent=1 case,
net timings stays approximately the same.
The output also stayed the same.
(*) If we revert 1)-4) to more usual techniques, for nparent=1 case,
we'll get ~2-2.5% of additional slowdown, which I've tried to avoid, as
"do no harm for nparent=1 case" rule.
For linux.git, combined diff will run an order of magnitude faster and
appropriate timings will be provided in the next commit, as we'll be
taking advantage of the new diff tree-walker for combined-diff
generation there.
P.S. and combined diff is not some exotic/for-play-only stuff - for
example for a program I write to represent Git archives as readonly
filesystem, there is initial scan with
`git log --reverse --raw --no-abbrev --no-renames -c`
to extract log of what was created/changed when, as a result building a
map
{} sha1 -> in which commit (and date) a content was added
that `-c` means also show combined diff for merges, and without them, if
a merge is non-trivial (merges changes from two parents with both having
separate changes to a file), or an evil one, the map will not be full,
i.e. some valid sha1 would be absent from it.
That case was my initial motivation for combined diffs speedup.
Signed-off-by: Kirill Smelkov <kirr@mns.spb.ru>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2014-04-06 23:46:26 +02:00
|
|
|
struct strbuf *base, struct diff_options *opt);
|
2017-05-30 19:31:03 +02:00
|
|
|
extern int diff_tree_oid(const struct object_id *old_oid,
|
|
|
|
const struct object_id *new_oid,
|
|
|
|
const char *base, struct diff_options *opt);
|
2017-05-30 19:30:57 +02:00
|
|
|
extern int diff_root_tree_oid(const struct object_id *new_oid, const char *base,
|
|
|
|
struct diff_options *opt);
|
2005-10-21 06:05:05 +02:00
|
|
|
|
2006-01-28 09:03:38 +01:00
|
|
|
struct combine_diff_path {
|
|
|
|
struct combine_diff_path *next;
|
|
|
|
char *path;
|
2006-02-06 21:53:07 +01:00
|
|
|
unsigned int mode;
|
2015-03-14 00:39:33 +01:00
|
|
|
struct object_id oid;
|
2006-02-06 21:53:07 +01:00
|
|
|
struct combine_diff_parent {
|
2006-02-10 11:30:52 +01:00
|
|
|
char status;
|
2006-02-06 21:53:07 +01:00
|
|
|
unsigned int mode;
|
2015-03-14 00:39:33 +01:00
|
|
|
struct object_id oid;
|
2006-02-06 21:53:07 +01:00
|
|
|
} parent[FLEX_ARRAY];
|
2006-01-28 09:03:38 +01:00
|
|
|
};
|
2006-02-06 21:53:07 +01:00
|
|
|
#define combine_diff_path_size(n, l) \
|
2016-02-19 12:21:30 +01:00
|
|
|
st_add4(sizeof(struct combine_diff_path), (l), 1, \
|
|
|
|
st_mult(sizeof(struct combine_diff_parent), (n)))
|
2006-01-28 09:03:38 +01:00
|
|
|
|
Log message printout cleanups
On Sun, 16 Apr 2006, Junio C Hamano wrote:
>
> In the mid-term, I am hoping we can drop the generate_header()
> callchain _and_ the custom code that formats commit log in-core,
> found in cmd_log_wc().
Ok, this was nastier than expected, just because the dependencies between
the different log-printing stuff were absolutely _everywhere_, but here's
a patch that does exactly that.
The patch is not very easy to read, and the "--patch-with-stat" thing is
still broken (it does not call the "show_log()" thing properly for
merges). That's not a new bug. In the new world order it _should_ do
something like
if (rev->logopt)
show_log(rev, rev->logopt, "---\n");
but it doesn't. I haven't looked at the --with-stat logic, so I left it
alone.
That said, this patch removes more lines than it adds, and in particular,
the "cmd_log_wc()" loop is now a very clean:
while ((commit = get_revision(rev)) != NULL) {
log_tree_commit(rev, commit);
free(commit->buffer);
commit->buffer = NULL;
}
so it doesn't get much prettier than this. All the complexity is entirely
hidden in log-tree.c, and any code that needs to flush the log literally
just needs to do the "if (rev->logopt) show_log(...)" incantation.
I had to make the combined_diff() logic take a "struct rev_info" instead
of just a "struct diff_options", but that part is pretty clean.
This does change "git whatchanged" from using "diff-tree" as the commit
descriptor to "commit", and I changed one of the tests to reflect that new
reality. Otherwise everything still passes, and my other tests look fine
too.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-04-17 20:59:32 +02:00
|
|
|
extern void show_combined_diff(struct combine_diff_path *elem, int num_parent,
|
|
|
|
int dense, struct rev_info *);
|
2006-01-28 09:03:38 +01:00
|
|
|
|
2017-05-30 19:30:55 +02:00
|
|
|
extern void diff_tree_combined(const struct object_id *oid, const struct oid_array *parents, int dense, struct rev_info *rev);
|
2006-04-29 10:24:49 +02:00
|
|
|
|
2011-12-17 11:20:07 +01:00
|
|
|
extern void diff_tree_combined_merge(const struct commit *commit, int dense, struct rev_info *rev);
|
diff-tree -c: show a merge commit a bit more sensibly.
A new option '-c' to diff-tree changes the way a merge commit is
displayed when generating a patch output. It shows a "combined
diff" (hence the option letter 'c'), which looks like this:
$ git-diff-tree --pretty -c -p fec9ebf1 | head -n 18
diff-tree fec9ebf... (from parents)
Merge: 0620db3... 8a263ae...
Author: Junio C Hamano <junkio@cox.net>
Date: Sun Jan 15 22:25:35 2006 -0800
Merge fixes up to GIT 1.1.3
diff --combined describe.c
@@@ +98,7 @@@
return (a_date > b_date) ? -1 : (a_date == b_date) ? 0 : 1;
}
- static void describe(char *arg)
- static void describe(struct commit *cmit, int last_one)
++ static void describe(char *arg, int last_one)
{
+ unsigned char sha1[20];
+ struct commit *cmit;
There are a few things to note about this feature:
- The '-c' option implies '-p'. It also implies '-m' halfway
in the sense that "interesting" merges are shown, but not all
merges.
- When a blob matches one of the parents, we do not show a diff
for that path at all. For a merge commit, this option shows
paths with real file-level merge (aka "interesting things").
- As a concequence of the above, an "uninteresting" merge is
not shown at all. You can use '-m' in addition to '-c' to
show the commit log for such a merge, but there will be no
combined diff output.
- Unlike "gitk", the output is monochrome.
A '-' character in the nth column means the line is from the nth
parent and does not appear in the merge result (i.e. removed
from that parent's version).
A '+' character in the nth column means the line appears in the
merge result, and the nth parent does not have that line
(i.e. added by the merge itself or inherited from another
parent).
The above example output shows that the function signature was
changed from either parents (hence two "-" lines and a "++"
line), and "unsigned char sha1[20]", prefixed by a " +", was
inherited from the first parent.
The code as sent to the list was buggy in few corner cases,
which I have fixed since then.
It does not bother to keep track of and show the line numbers
from parent commits, which it probably should.
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-01-24 10:22:04 +01:00
|
|
|
|
2008-08-19 05:08:09 +02:00
|
|
|
void diff_set_mnemonic_prefix(struct diff_options *options, const char *a, const char *b);
|
|
|
|
|
2011-05-31 18:14:17 +02:00
|
|
|
extern int diff_can_quit_early(struct diff_options *);
|
|
|
|
|
2005-09-21 09:00:47 +02:00
|
|
|
extern void diff_addremove(struct diff_options *,
|
|
|
|
int addremove,
|
2005-04-27 18:21:00 +02:00
|
|
|
unsigned mode,
|
2017-05-30 19:30:47 +02:00
|
|
|
const struct object_id *oid,
|
|
|
|
int oid_valid,
|
2010-01-18 21:26:18 +01:00
|
|
|
const char *fullpath, unsigned dirty_submodule);
|
2005-04-27 18:21:00 +02:00
|
|
|
|
2005-09-21 09:00:47 +02:00
|
|
|
extern void diff_change(struct diff_options *,
|
|
|
|
unsigned mode1, unsigned mode2,
|
2017-05-30 19:30:49 +02:00
|
|
|
const struct object_id *old_oid,
|
|
|
|
const struct object_id *new_oid,
|
|
|
|
int old_oid_valid, int new_oid_valid,
|
2010-01-18 21:26:18 +01:00
|
|
|
const char *fullpath,
|
|
|
|
unsigned dirty_submodule1, unsigned dirty_submodule2);
|
2005-04-27 18:21:00 +02:00
|
|
|
|
2011-04-23 01:05:58 +02:00
|
|
|
extern struct diff_filepair *diff_unmerge(struct diff_options *, const char *path);
|
2005-04-27 18:21:00 +02:00
|
|
|
|
2005-05-28 00:54:37 +02:00
|
|
|
#define DIFF_SETUP_REVERSE 1
|
2005-05-28 00:56:38 +02:00
|
|
|
#define DIFF_SETUP_USE_CACHE 2
|
|
|
|
#define DIFF_SETUP_USE_SIZE_CACHE 4
|
2005-06-03 10:36:43 +02:00
|
|
|
|
2010-08-05 10:22:52 +02:00
|
|
|
/*
|
2013-10-31 12:08:28 +01:00
|
|
|
* Poor man's alternative to parse-option, to allow both stuck form
|
2010-08-05 10:22:52 +02:00
|
|
|
* (--option=value) and separate form (--option value).
|
|
|
|
*/
|
|
|
|
extern int parse_long_opt(const char *opt, const char **argv,
|
|
|
|
const char **optarg);
|
|
|
|
|
2008-05-14 19:46:53 +02:00
|
|
|
extern int git_diff_basic_config(const char *var, const char *value, void *cb);
|
2016-09-05 11:44:53 +02:00
|
|
|
extern int git_diff_heuristic_config(const char *var, const char *value, void *cb);
|
2016-02-25 09:59:21 +01:00
|
|
|
extern void init_diff_ui_defaults(void);
|
2008-05-14 19:46:53 +02:00
|
|
|
extern int git_diff_ui_config(const char *var, const char *value, void *cb);
|
2005-09-21 09:00:47 +02:00
|
|
|
extern void diff_setup(struct diff_options *);
|
2016-01-21 12:48:44 +01:00
|
|
|
extern int diff_opt_parse(struct diff_options *, const char **, int, const char *);
|
2012-08-03 14:16:24 +02:00
|
|
|
extern void diff_setup_done(struct diff_options *);
|
2018-05-02 18:01:14 +02:00
|
|
|
extern int git_config_rename(const char *var, const char *value);
|
2005-04-26 03:22:47 +02:00
|
|
|
|
2005-05-22 19:04:37 +02:00
|
|
|
#define DIFF_DETECT_RENAME 1
|
|
|
|
#define DIFF_DETECT_COPY 2
|
|
|
|
|
2005-05-28 00:55:28 +02:00
|
|
|
#define DIFF_PICKAXE_ALL 1
|
2006-03-29 02:16:33 +02:00
|
|
|
#define DIFF_PICKAXE_REGEX 2
|
[PATCH] Add -B flag to diff-* brothers.
A new diffcore transformation, diffcore-break.c, is introduced.
When the -B flag is given, a patch that represents a complete
rewrite is broken into a deletion followed by a creation. This
makes it easier to review such a complete rewrite patch.
The -B flag takes the same syntax as the -M and -C flags to
specify the minimum amount of non-source material the resulting
file needs to have to be considered a complete rewrite, and
defaults to 99% if not specified.
As the new test t4008-diff-break-rewrite.sh demonstrates, if a
file is a complete rewrite, it is broken into a delete/create
pair, which can further be subjected to the usual rename
detection if -M or -C is used. For example, if file0 gets
completely rewritten to make it as if it were rather based on
file1 which itself disappeared, the following happens:
The original change looks like this:
file0 --> file0' (quite different from file0)
file1 --> /dev/null
After diffcore-break runs, it would become this:
file0 --> /dev/null
/dev/null --> file0'
file1 --> /dev/null
Then diffcore-rename matches them up:
file1 --> file0'
The internal score values are finer grained now. Earlier
maximum of 10000 has been raised to 60000; there is no user
visible changes but there is no reason to waste available bits.
Signed-off-by: Junio C Hamano <junkio@cox.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-30 09:08:37 +02:00
|
|
|
|
2010-08-23 19:17:03 +02:00
|
|
|
#define DIFF_PICKAXE_KIND_S 4 /* traditional plumbing counter */
|
|
|
|
#define DIFF_PICKAXE_KIND_G 8 /* grep in the patch */
|
2018-01-04 23:50:42 +01:00
|
|
|
#define DIFF_PICKAXE_KIND_OBJFIND 16 /* specific object IDs */
|
2010-08-23 19:17:03 +02:00
|
|
|
|
2018-01-04 23:50:42 +01:00
|
|
|
#define DIFF_PICKAXE_KINDS_MASK (DIFF_PICKAXE_KIND_S | \
|
|
|
|
DIFF_PICKAXE_KIND_G | \
|
|
|
|
DIFF_PICKAXE_KIND_OBJFIND)
|
2018-01-04 23:50:41 +01:00
|
|
|
|
2018-01-04 23:50:40 +01:00
|
|
|
#define DIFF_PICKAXE_IGNORE_CASE 32
|
2010-08-23 19:17:03 +02:00
|
|
|
|
2005-09-21 09:00:47 +02:00
|
|
|
extern void diffcore_std(struct diff_options *);
|
unpack-trees.c: look ahead in the index
This makes the traversal of index be in sync with the tree traversal.
When unpack_callback() is fed a set of tree entries from trees, it
inspects the name of the entry and checks if the an index entry with
the same name could be hiding behind the current index entry, and
(1) if the name appears in the index as a leaf node, it is also
fed to the n_way_merge() callback function;
(2) if the name is a directory in the index, i.e. there are entries in
that are underneath it, then nothing is fed to the n_way_merge()
callback function;
(3) otherwise, if the name comes before the first eligible entry in the
index, the index entry is first unpacked alone.
When traverse_trees_recursive() descends into a subdirectory, the
cache_bottom pointer is moved to walk index entries within that directory.
All of these are omitted for diff-index, which does not even want to be
fed an index entry and a tree entry with D/F conflicts.
This fixes 3-way read-tree and exposes a bug in other parts of the system
in t6035, test #5. The test prepares these three trees:
O = HEAD^
100644 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 a/b-2/c/d
100644 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 a/b/c/d
100644 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 a/x
A = HEAD
100644 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 a/b-2/c/d
100644 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 a/b/c/d
100644 blob 587be6b4c3f93f93c489c0111bba5596147a26cb a/x
B = master
120000 blob a36b77384451ea1de7bd340ffca868249626bc52 a/b
100644 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 a/b-2/c/d
100644 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 a/x
With a clean index that matches HEAD, running
git read-tree -m -u --aggressive $O $A $B
now yields
120000 a36b77384451ea1de7bd340ffca868249626bc52 3 a/b
100644 e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 0 a/b-2/c/d
100644 e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 1 a/b/c/d
100644 e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 2 a/b/c/d
100644 587be6b4c3f93f93c489c0111bba5596147a26cb 0 a/x
which is correct. "master" created "a/b" symlink that did not exist,
and removed "a/b/c/d" while HEAD did not do touch either path.
Before this series, read-tree did not notice the situation and resolved
addition of "a/b" and removal of "a/b/c/d" independently. If A = HEAD had
another path "a/b/c/e" added, this merge should conflict but instead it
silently resolved "a/b" and then immediately overwrote it to add
"a/b/c/e", which was quite bogus.
Tests in t1012 start to work with this.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-09-20 09:03:39 +02:00
|
|
|
extern void diffcore_fix_diff_index(struct diff_options *);
|
2005-06-12 05:57:13 +02:00
|
|
|
|
2005-07-13 21:52:35 +02:00
|
|
|
#define COMMON_DIFF_OPTIONS_HELP \
|
|
|
|
"\ncommon diff options:\n" \
|
2005-09-21 09:18:27 +02:00
|
|
|
" -z output diff-raw with lines terminated with NUL.\n" \
|
|
|
|
" -p output patch format.\n" \
|
|
|
|
" -u synonym for -p.\n" \
|
2006-04-11 13:22:17 +02:00
|
|
|
" --patch-with-raw\n" \
|
|
|
|
" output both a patch and the diff-raw format.\n" \
|
2006-04-14 00:15:30 +02:00
|
|
|
" --stat show diffstat instead of patch.\n" \
|
2006-10-12 12:01:00 +02:00
|
|
|
" --numstat show numeric diffstat instead of patch.\n" \
|
2006-04-15 13:41:18 +02:00
|
|
|
" --patch-with-stat\n" \
|
|
|
|
" output a patch and prepend its diffstat.\n" \
|
2005-09-21 09:18:27 +02:00
|
|
|
" --name-only show only names of changed files.\n" \
|
2005-09-21 09:20:06 +02:00
|
|
|
" --name-status show names and status of changed files.\n" \
|
2005-12-14 02:21:41 +01:00
|
|
|
" --full-index show full object name on index lines.\n" \
|
2005-12-18 11:03:15 +01:00
|
|
|
" --abbrev=<n> abbreviate object names in diff-tree header and diff-raw.\n" \
|
2005-09-21 09:18:27 +02:00
|
|
|
" -R swap input file pairs.\n" \
|
|
|
|
" -B detect complete rewrites.\n" \
|
|
|
|
" -M detect renames.\n" \
|
|
|
|
" -C detect copies.\n" \
|
2005-07-13 21:52:35 +02:00
|
|
|
" --find-copies-harder\n" \
|
2005-09-21 09:18:27 +02:00
|
|
|
" try unchanged files as candidate for copy detection.\n" \
|
|
|
|
" -l<n> limit rename attempts up to <n> paths.\n" \
|
|
|
|
" -O<file> reorder diffs according to the <file>.\n" \
|
|
|
|
" -S<string> find filepair whose only one side contains the string.\n" \
|
2005-07-13 21:52:35 +02:00
|
|
|
" --pickaxe-all\n" \
|
2006-07-07 15:57:08 +02:00
|
|
|
" show all files diff when -S is used and hit is found.\n" \
|
|
|
|
" -a --text treat all files as text.\n"
|
2005-07-13 21:52:35 +02:00
|
|
|
|
2005-05-22 04:40:36 +02:00
|
|
|
extern int diff_queue_is_empty(void);
|
2005-09-21 09:00:47 +02:00
|
|
|
extern void diff_flush(struct diff_options*);
|
2011-01-06 22:50:06 +01:00
|
|
|
extern void diff_warn_rename_limit(const char *varname, int needed, int degraded_cc);
|
2005-04-26 03:22:47 +02:00
|
|
|
|
2005-07-25 22:05:44 +02:00
|
|
|
/* diff-raw status letters */
|
2005-07-25 23:31:19 +02:00
|
|
|
#define DIFF_STATUS_ADDED 'A'
|
2005-07-25 22:05:44 +02:00
|
|
|
#define DIFF_STATUS_COPIED 'C'
|
|
|
|
#define DIFF_STATUS_DELETED 'D'
|
|
|
|
#define DIFF_STATUS_MODIFIED 'M'
|
|
|
|
#define DIFF_STATUS_RENAMED 'R'
|
|
|
|
#define DIFF_STATUS_TYPE_CHANGED 'T'
|
|
|
|
#define DIFF_STATUS_UNKNOWN 'X'
|
|
|
|
#define DIFF_STATUS_UNMERGED 'U'
|
|
|
|
|
|
|
|
/* these are not diff-raw status letters proper, but used by
|
|
|
|
* diffcore-filter insn to specify additional restrictions.
|
|
|
|
*/
|
2005-10-05 02:44:17 +02:00
|
|
|
#define DIFF_STATUS_FILTER_AON '*'
|
2005-07-25 22:05:44 +02:00
|
|
|
#define DIFF_STATUS_FILTER_BROKEN 'B'
|
|
|
|
|
2016-10-20 08:19:43 +02:00
|
|
|
/*
|
|
|
|
* This is different from find_unique_abbrev() in that
|
|
|
|
* it stuffs the result with dots for alignment.
|
|
|
|
*/
|
2016-10-20 08:20:07 +02:00
|
|
|
extern const char *diff_aligned_abbrev(const struct object_id *sha1, int);
|
2005-12-14 02:21:41 +01:00
|
|
|
|
2007-11-10 09:15:03 +01:00
|
|
|
/* do not report anything on removed paths */
|
|
|
|
#define DIFF_SILENT_ON_REMOVED 01
|
git-add: make the entry stat-clean after re-adding the same contents
Earlier in commit 0781b8a9b2fe760fc4ed519a3a26e4b9bd6ccffe
(add_file_to_index: skip rehashing if the cached stat already
matches), add_file_to_index() were taught not to re-add the path
if it already matches the index.
The change meant well, but was not executed quite right. It
used ie_modified() to see if the file on the work tree is really
different from the index, and skipped adding the contents if the
function says "not modified".
This was wrong. There are three possible comparison results
between the index and the file in the work tree:
- with lstat(2) we _know_ they are different. E.g. if the
length or the owner in the cached stat information is
different from the length we just obtained from lstat(2), we
can tell the file is modified without looking at the actual
contents.
- with lstat(2) we _know_ they are the same. The same length,
the same owner, the same everything (but this has a twist, as
described below).
- we cannot tell from lstat(2) information alone and need to go
to the filesystem to actually compare.
The last case arises from what we call 'racy git' situation,
that can be caused with this sequence:
$ echo hello >file
$ git add file
$ echo aeiou >file ;# the same length
If the second "echo" is done within the same filesystem
timestamp granularity as the first "echo", then the timestamp
recorded by "git add" and the timestamp we get from lstat(2)
will be the same, and we can mistakenly say the file is not
modified. The path is called 'racily clean'. We need to
reliably detect racily clean paths are in fact modified.
To solve this problem, when we write out the index, we mark the
index entry that has the same timestamp as the index file itself
(that is the time from the point of view of the filesystem) to
tell any later code that does the lstat(2) comparison not to
trust the cached stat info, and ie_modified() then actually goes
to the filesystem to compare the contents for such a path.
That's all good, but it should not be used for this "git add"
optimization, as the goal of "git add" is to actually update the
path in the index and make it stat-clean. With the false
optimization, we did _not_ cause any data loss (after all, what
we failed to do was only to update the cached stat information),
but it made the following sequence leave the file stat dirty:
$ echo hello >file
$ git add file
$ echo hello >file ;# the same contents
$ git add file
The solution is not to use ie_modified() which goes to the
filesystem to see if it is really clean, but instead use
ie_match_stat() with "assume racily clean paths are dirty"
option, to force re-adding of such a path.
There was another problem with "git add -u". The codepath
shares the same issue when adding the paths that are found to be
modified, but in addition, it asked "git diff-files" machinery
run_diff_files() function (which is "git diff-files") to list
the paths that are modified. But "git diff-files" machinery
uses the same ie_modified() call so that it does not report
racily clean _and_ actually clean paths as modified, which is
not what we want.
The patch allows the callers of run_diff_files() to pass the
same "assume racily clean paths are dirty" option, and makes
"git-add -u" codepath to use that option, to discover and re-add
racily clean _and_ actually clean paths.
We could further optimize on top of this patch to differentiate
the case where the path really needs re-adding (i.e. the content
of the racily clean entry was indeed different) and the case
where only the cached stat information needs to be refreshed
(i.e. the racily clean entry was actually clean), but I do not
think it is worth it.
This patch applies to maint and all the way up.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2007-11-10 03:22:52 +01:00
|
|
|
/* report racily-clean paths as modified */
|
|
|
|
#define DIFF_RACY_IS_MODIFIED 02
|
2007-11-10 09:15:03 +01:00
|
|
|
extern int run_diff_files(struct rev_info *revs, unsigned int option);
|
2006-04-22 12:58:04 +02:00
|
|
|
extern int run_diff_index(struct rev_info *revs, int cached);
|
2006-04-22 11:43:00 +02:00
|
|
|
|
2017-05-07 00:10:35 +02:00
|
|
|
extern int do_diff_cache(const struct object_id *, struct diff_options *);
|
2017-05-30 19:30:54 +02:00
|
|
|
extern int diff_flush_patch_id(struct diff_options *, struct object_id *, int);
|
2006-06-25 03:51:08 +02:00
|
|
|
|
2007-12-14 08:40:27 +01:00
|
|
|
extern int diff_result_code(struct diff_options *, int);
|
|
|
|
|
2016-01-20 12:06:02 +01:00
|
|
|
extern void diff_no_index(struct rev_info *, int, const char **);
|
2008-05-24 07:28:56 +02:00
|
|
|
|
2017-10-31 19:19:05 +01:00
|
|
|
extern int index_differs_from(const char *def, const struct diff_flags *flags,
|
|
|
|
int ita_invisible_in_index);
|
2009-02-10 15:30:35 +01:00
|
|
|
|
2016-02-22 19:28:54 +01:00
|
|
|
/*
|
|
|
|
* Fill the contents of the filespec "df", respecting any textconv defined by
|
|
|
|
* its userdiff driver. The "driver" parameter must come from a
|
|
|
|
* previous call to get_textconv(), and therefore should either be NULL or have
|
|
|
|
* textconv enabled.
|
|
|
|
*
|
|
|
|
* Note that the memory ownership of the resulting buffer depends on whether
|
|
|
|
* the driver field is NULL. If it is, then the memory belongs to the filespec
|
|
|
|
* struct. If it is non-NULL, then "outbuf" points to a newly allocated buffer
|
|
|
|
* that should be freed by the caller.
|
|
|
|
*/
|
2010-06-07 17:23:36 +02:00
|
|
|
extern size_t fill_textconv(struct userdiff_driver *driver,
|
|
|
|
struct diff_filespec *df,
|
|
|
|
char **outbuf);
|
|
|
|
|
2016-02-22 19:28:54 +01:00
|
|
|
/*
|
|
|
|
* Look up the userdiff driver for the given filespec, and return it if
|
|
|
|
* and only if it has textconv enabled (otherwise return NULL). The result
|
|
|
|
* can be passed to fill_textconv().
|
|
|
|
*/
|
2010-06-07 17:23:36 +02:00
|
|
|
extern struct userdiff_driver *get_textconv(struct diff_filespec *one);
|
|
|
|
|
2017-05-24 07:15:10 +02:00
|
|
|
/*
|
|
|
|
* Prepare diff_filespec and convert it using diff textconv API
|
|
|
|
* if the textconv driver exists.
|
|
|
|
* Return 1 if the conversion succeeds, 0 otherwise.
|
|
|
|
*/
|
|
|
|
extern int textconv_object(const char *path, unsigned mode, const struct object_id *oid, int oid_valid, char **buf, unsigned long *buf_size);
|
|
|
|
|
2010-09-28 01:58:25 +02:00
|
|
|
extern int parse_rename_score(const char **cp_p);
|
|
|
|
|
2013-01-16 08:51:58 +01:00
|
|
|
extern long parse_algorithm_value(const char *value);
|
|
|
|
|
2017-06-30 02:07:02 +02:00
|
|
|
extern void print_stat_summary(FILE *fp, int files,
|
|
|
|
int insertions, int deletions);
|
2012-10-26 17:53:52 +02:00
|
|
|
extern void setup_diff_pager(struct diff_options *);
|
2012-02-01 13:55:07 +01:00
|
|
|
|
2005-04-26 03:22:47 +02:00
|
|
|
#endif /* DIFF_H */
|