Commit Graph

34 Commits

Author SHA1 Message Date
Michael Haggerty
433860f3d0 diff: improve positioning of add/delete blocks in diffs
Some groups of added/deleted lines in diffs can be slid up or down,
because lines at the edges of the group are not unique. Picking good
shifts for such groups is not a matter of correctness but definitely has
a big effect on aesthetics. For example, consider the following two
diffs. The first is what standard Git emits:

    --- a/9c572b21dd090a1e5c5bb397053bf8043ffe7fb4:git-send-email.perl
    +++ b/6dcfa306f2b67b733a7eb2d7ded1bc9987809edb:git-send-email.perl
    @@ -231,6 +231,9 @@ if (!defined $initial_reply_to && $prompting) {
     }

     if (!$smtp_server) {
    +       $smtp_server = $repo->config('sendemail.smtpserver');
    +}
    +if (!$smtp_server) {
            foreach (qw( /usr/sbin/sendmail /usr/lib/sendmail )) {
                    if (-x $_) {
                            $smtp_server = $_;

The following diff is equivalent, but is obviously preferable from an
aesthetic point of view:

    --- a/9c572b21dd090a1e5c5bb397053bf8043ffe7fb4:git-send-email.perl
    +++ b/6dcfa306f2b67b733a7eb2d7ded1bc9987809edb:git-send-email.perl
    @@ -230,6 +230,9 @@ if (!defined $initial_reply_to && $prompting) {
            $initial_reply_to =~ s/(^\s+|\s+$)//g;
     }

    +if (!$smtp_server) {
    +       $smtp_server = $repo->config('sendemail.smtpserver');
    +}
     if (!$smtp_server) {
            foreach (qw( /usr/sbin/sendmail /usr/lib/sendmail )) {
                    if (-x $_) {

This patch teaches Git to pick better positions for such "diff sliders"
using heuristics that take the positions of nearby blank lines and the
indentation of nearby lines into account.

The existing Git code basically always shifts such "sliders" as far down
in the file as possible. The only exception is when the slider can be
aligned with a group of changed lines in the other file, in which case
Git favors depicting the change as one add+delete block rather than one
add and a slightly offset delete block. This naive algorithm often
yields ugly diffs.

Commit d634d61ed6 improved the situation somewhat by preferring to
position add/delete groups to make their last line a blank line, when
that is possible. This heuristic does more good than harm, but (1) it
can only help if there are blank lines in the right places, and (2)
always picks the last blank line, even if there are others that might be
better. The end result is that it makes perhaps 1/3 as many errors as
the default Git algorithm, but that still leaves a lot of ugly diffs.

This commit implements a new and much better heuristic for picking
optimal "slider" positions using the following approach: First observe
that each hypothetical positioning of a diff slider introduces two
splits: one between the context lines preceding the group and the first
added/deleted line, and the other between the last added/deleted line
and the first line of context following it. It tries to find the
positioning that creates the least bad splits.

Splits are evaluated based only on the presence and locations of nearby
blank lines, and the indentation of lines near the split. Basically, it
prefers to introduce splits adjacent to blank lines, between lines that
are indented less, and between lines with the same level of indentation.
In more detail:

1. It measures the following characteristics of a proposed splitting
   position in a `struct split_measurement`:

   * the number of blank lines above the proposed split
   * whether the line directly after the split is blank
   * the number of blank lines following that line
   * the indentation of the nearest non-blank line above the split
   * the indentation of the line directly below the split
   * the indentation of the nearest non-blank line after that line

2. It combines the measured attributes using a bunch of
   empirically-optimized weighting factors to derive a `struct
   split_score` that measures the "badness" of splitting the text at
   that position.

3. It combines the `split_score` for the top and the bottom of the
   slider at each of its possible positions, and selects the position
   that has the best `split_score`.

I determined the initial set of weighting factors by collecting a corpus
of Git histories from 29 open-source software projects in various
programming languages. I generated many diffs from this corpus, and
determined the best positioning "by eye" for about 6600 diff sliders. I
used about half of the repositories in the corpus (corresponding to
about 2/3 of the sliders) as a training set, and optimized the weights
against this corpus using a crude automated search of the parameter
space to get the best agreement with the manually-determined values.
Then I tested the resulting heuristic against the full corpus. The
results are summarized in the following table, in column `indent-1`:

| repository            | count |      Git 2.9.0 |     compaction | compaction-fixed |       indent-1 |       indent-2 |
| --------------------- | ----- | -------------- | -------------- | ---------------- | -------------- | -------------- |
| afnetworking          |   109 |    89  (81.7%) |    37  (33.9%) |      37  (33.9%) |     2   (1.8%) |     2   (1.8%) |
| alamofire             |    30 |    18  (60.0%) |    14  (46.7%) |      15  (50.0%) |     0   (0.0%) |     0   (0.0%) |
| angular               |   184 |   127  (69.0%) |    39  (21.2%) |      23  (12.5%) |     5   (2.7%) |     5   (2.7%) |
| animate               |   313 |     2   (0.6%) |     2   (0.6%) |       2   (0.6%) |     2   (0.6%) |     2   (0.6%) |
| ant                   |   380 |   356  (93.7%) |   152  (40.0%) |     148  (38.9%) |    15   (3.9%) |    15   (3.9%) | *
| bugzilla              |   306 |   263  (85.9%) |   109  (35.6%) |      99  (32.4%) |    14   (4.6%) |    15   (4.9%) | *
| corefx                |   126 |    91  (72.2%) |    22  (17.5%) |      21  (16.7%) |     6   (4.8%) |     6   (4.8%) |
| couchdb               |    78 |    44  (56.4%) |    26  (33.3%) |      28  (35.9%) |     6   (7.7%) |     6   (7.7%) | *
| cpython               |   937 |   158  (16.9%) |    50   (5.3%) |      49   (5.2%) |     5   (0.5%) |     5   (0.5%) | *
| discourse             |   160 |    95  (59.4%) |    42  (26.2%) |      36  (22.5%) |    18  (11.2%) |    13   (8.1%) |
| docker                |   307 |   194  (63.2%) |   198  (64.5%) |     253  (82.4%) |     8   (2.6%) |     8   (2.6%) | *
| electron              |   163 |   132  (81.0%) |    38  (23.3%) |      39  (23.9%) |     6   (3.7%) |     6   (3.7%) |
| git                   |   536 |   470  (87.7%) |    73  (13.6%) |      78  (14.6%) |    16   (3.0%) |    16   (3.0%) | *
| gitflow               |   127 |     0   (0.0%) |     0   (0.0%) |       0   (0.0%) |     0   (0.0%) |     0   (0.0%) |
| ionic                 |   133 |    89  (66.9%) |    29  (21.8%) |      38  (28.6%) |     1   (0.8%) |     1   (0.8%) |
| ipython               |   482 |   362  (75.1%) |   167  (34.6%) |     169  (35.1%) |    11   (2.3%) |    11   (2.3%) | *
| junit                 |   161 |   147  (91.3%) |    67  (41.6%) |      66  (41.0%) |     1   (0.6%) |     1   (0.6%) | *
| lighttable            |    15 |     5  (33.3%) |     0   (0.0%) |       2  (13.3%) |     0   (0.0%) |     0   (0.0%) |
| magit                 |    88 |    75  (85.2%) |    11  (12.5%) |       9  (10.2%) |     1   (1.1%) |     0   (0.0%) |
| neural-style          |    28 |     0   (0.0%) |     0   (0.0%) |       0   (0.0%) |     0   (0.0%) |     0   (0.0%) |
| nodejs                |   781 |   649  (83.1%) |   118  (15.1%) |     111  (14.2%) |     4   (0.5%) |     5   (0.6%) | *
| phpmyadmin            |   491 |   481  (98.0%) |    75  (15.3%) |      48   (9.8%) |     2   (0.4%) |     2   (0.4%) | *
| react-native          |   168 |   130  (77.4%) |    79  (47.0%) |      81  (48.2%) |     0   (0.0%) |     0   (0.0%) |
| rust                  |   171 |   128  (74.9%) |    30  (17.5%) |      27  (15.8%) |    16   (9.4%) |    14   (8.2%) |
| spark                 |   186 |   149  (80.1%) |    52  (28.0%) |      52  (28.0%) |     2   (1.1%) |     2   (1.1%) |
| tensorflow            |   115 |    66  (57.4%) |    48  (41.7%) |      48  (41.7%) |     5   (4.3%) |     5   (4.3%) |
| test-more             |    19 |    15  (78.9%) |     2  (10.5%) |       2  (10.5%) |     1   (5.3%) |     1   (5.3%) | *
| test-unit             |    51 |    34  (66.7%) |    14  (27.5%) |       8  (15.7%) |     2   (3.9%) |     2   (3.9%) | *
| xmonad                |    23 |    22  (95.7%) |     2   (8.7%) |       2   (8.7%) |     1   (4.3%) |     1   (4.3%) | *
| --------------------- | ----- | -------------- | -------------- | ---------------- | -------------- | -------------- |
| totals                |  6668 |  4391  (65.9%) |  1496  (22.4%) |    1491  (22.4%) |   150   (2.2%) |   144   (2.2%) |
| totals (training set) |  4552 |  3195  (70.2%) |  1053  (23.1%) |    1061  (23.3%) |    86   (1.9%) |    88   (1.9%) |
| totals (test set)     |  2116 |  1196  (56.5%) |   443  (20.9%) |     430  (20.3%) |    64   (3.0%) |    56   (2.6%) |

In this table, the numbers are the count and percentage of human-rated
sliders that the corresponding algorithm got *wrong*. The columns are

* "repository" - the name of the repository used. I used the diffs
  between successive non-merge commits on the HEAD branch of the
  corresponding repository.

* "count" - the number of sliders that were human-rated. I chose most,
  but not all, sliders to rate from those among which the various
  algorithms gave different answers.

* "Git 2.9.0" - the default algorithm used by `git diff` in Git 2.9.0.

* "compaction" - the heuristic used by `git diff --compaction-heuristic`
  in Git 2.9.0.

* "compaction-fixed" - the heuristic used by `git diff
  --compaction-heuristic` after the fixes from earlier in this patch
  series. Note that the results are not dramatically different than
  those for "compaction". Both produce non-ideal diffs only about 1/3 as
  often as the default `git diff`.

* "indent-1" - the new `--indent-heuristic` algorithm, using the first
  set of weighting factors, determined as described above.

* "indent-2" - the new `--indent-heuristic` algorithm, using the final
  set of weighting factors, determined as described below.

* `*` - indicates that repo was part of training set used to determine
  the first set of weighting factors.

The fact that the heuristic performed nearly as well on the test set as
on the training set in column "indent-1" is a good indication that the
heuristic was not over-trained. Given that fact, I ran a second round of
optimization, using the entire corpus as the training set. The resulting
set of weights gave the results in column "indent-2". These are the
weights included in this patch.

The final result gives consistently and significantly better results
across the whole corpus than either `git diff` or `git diff
--compaction-heuristic`. It makes only about 1/30 as many errors as the
former and about 1/10 as many errors as the latter. (And a good fraction
of the remaining errors are for diffs that involve weirdly-formatted
code, sometimes apparently machine-generated.)

The tools that were used to do this optimization and analysis, along
with the human-generated data values, are recorded in a separate project
[1].

This patch adds a new command-line option `--indent-heuristic`, and a
new configuration setting `diff.indentHeuristic`, that activate this
heuristic. This interface is only meant for testing purposes, and should
be finalized before including this change in any release.

[1] https://github.com/mhagger/diff-slider-tools

Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-09-19 10:25:11 -07:00
Junio C Hamano
0018da1088 Merge branch 'jk/diff-compact-heuristic'
Patch output from "git diff" and friends has been tweaked to be
more readable by using a blank line as a strong hint that the
contents before and after it belong to a logically separate unit.

* jk/diff-compact-heuristic:
  diff: undocument the compaction heuristic knobs for experimentation
  xdiff: implement empty line chunk heuristic
  xdiff: add recs_match helper function
2016-05-06 14:45:46 -07:00
Stefan Beller
d634d61ed6 xdiff: implement empty line chunk heuristic
In order to produce the smallest possible diff and combine several diff
hunks together, we implement a heuristic from GNU Diff which moves diff
hunks forward as far as possible when we find common context above and
below a diff hunk. This sometimes produces less readable diffs when
writing C, Shell, or other programming languages, ie:

...
 /*
+ *
+ *
+ */
+
+/*
...

instead of the more readable equivalent of

...
+/*
+ *
+ *
+ */
+
 /*
...

Implement the following heuristic to (optionally) produce the desired
output.

  If there are diff chunks which can be shifted around, shift each hunk
  such that the last common empty line is below the chunk with the rest
  of the context above.

This heuristic appears to resolve the above example and several other
common issues without producing significantly weird results. However, as
with any heuristic it is not really known whether this will always be
more optimal. Thus, it can be disabled via diff.compactionHeuristic.

Signed-off-by: Stefan Beller <sbeller@google.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Stefan Beller <sbeller@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-04-19 10:53:34 -07:00
Jeff King
907681e940 xdiff: drop XDL_EMIT_COMMON
There are no more callers that use this mode, and none
likely to be added (as our xdl_merge() eliminates the common
use of it for generating 3-way merge bases).

This is effectively a revert of a9ed376 (xdiff: generate
"anti-diffs" aka what is common to two files, 2006-06-28),
though of course trying to revert that ancient commit
directly produces many textual conflicts.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-02-22 22:36:09 -08:00
Antoine Pelisse
36617af7ed diff: add --ignore-blank-lines option
The goal of the patch is to introduce the GNU diff
-B/--ignore-blank-lines as closely as possible. The short option is not
available because it's already used for "break-rewrites".

When this option is used, git-diff will not create hunks that simply
add or remove empty lines, but will still show empty lines
addition/suppression if they are close enough to "valuable" changes.

There are two differences between this option and GNU diff -B option:
- GNU diff doesn't have "--inter-hunk-context", so this must be handled
- The following sequence looks like a bug (context is displayed twice):

    $ seq 5 >file1
    $ cat <<EOF >file2
    change
    1
    2

    3
    4
    5
    change
    EOF
    $ diff -u -B file1 file2
    --- file1	2013-06-08 22:13:04.471517834 +0200
    +++ file2	2013-06-08 22:13:23.275517855 +0200
    @@ -1,5 +1,7 @@
    +change
     1
     2
    +
     3
     4
     5
    @@ -3,3 +5,4 @@
     3
     4
     5
    +change

So here is a more thorough description of the option:
- real changes are interesting
- blank lines that are close enough (less than context size) to
interesting changes are considered interesting (recursive definition)
- "context" lines are used around each hunk of interesting changes
- If two hunks are separated by less than "inter-hunk-context", they
will be merged into one.

The implementation does the "interesting changes selection" in a single
pass.

Signed-off-by: Antoine Pelisse <apelisse@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-06-19 15:17:45 -07:00
René Scharfe
3319e60633 xdiff: remove emit_func() and xdi_diff_hunks()
The functions are unused now, remove them.

Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2012-05-09 14:08:42 -07:00
René Scharfe
467d348c19 xdiff: add hunk_func()
Add a way to register a callback function that is gets passed the
start line and line count of each hunk of a diff.  Only standard
types are used.

Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2012-05-09 14:00:15 -07:00
Junio C Hamano
307ab20b33 xdiff: PATIENCE/HISTOGRAM are not independent option bits
Because the default Myers, patience and histogram algorithms cannot be in
effect at the same time, XDL_PATIENCE_DIFF and XDL_HISTOGRAM_DIFF are not
independent bits.  Instead of wasting one bit per algorithm, define a few
macros to access the few bits they occupy and update the code that access
them.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
2012-02-19 15:36:55 -08:00
Junio C Hamano
e5b06629de xdiff: remove XDL_PATCH_* macros
These are not used anywhere in our codebase, and the bit assignment
definition is merely confusing.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
2012-02-19 14:32:25 -08:00
René Scharfe
14937c2c06 diff: add option to show whole functions as context
Add the option -W/--function-context to git diff.  It is similar to
the same option of git grep and expands the context of change hunks
so that the whole surrounding function is shown.  This "natural"
context can allow changes to be understood better.

Note: GNU patch doesn't like diffs generated with the new option;
it seems to expect context lines to be the same before and after
changes.  git apply doesn't complain.

This implementation has the same shortcoming as the one in grep,
namely that there is no way to explicitly find the end of a
function.  That means that a few lines of extra context are shown,
right up to the next recognized function begins.  It's already
useful in its current form, though.

The function get_func_line() in xdiff/xemit.c is extended to work
forward as well as backward to find post-context as well as
pre-context.  It returns the position of the first found matching
line.  The func_line parameter is made optional, as we don't need
it for -W.

The enhanced function is then used in xdl_emit_diff() to extend
the context as needed.  If the added context overlaps with the
next change, it is merged into the current hunk.

Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2011-10-10 12:05:07 -07:00
Tay Ray Chuan
739864b1ff xdiff: do away with xdl_mmfile_next()
Given our simple mmfile structure, xdl_mmfile_next() calls are
redundant. Do away with calls to them.

Signed-off-by: Tay Ray Chuan <rctay89@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2011-08-03 10:15:16 -07:00
Tay Ray Chuan
8c912eea94 teach --histogram to diff
Port JGit's HistogramDiff algorithm over to C. Rough numbers (TODO) show
that it is faster than its --patience cousin, as well as the default
Meyers algorithm.

The implementation has been reworked to use structs and pointers,
instead of bitmasks, thus doing away with JGit's 2^28 line limit.

We also use xdiff's default hash table implementation (xdl_hash_bits()
with XDL_HASHLONG()) for convenience.

Signed-off-by: Tay Ray Chuan <rctay89@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2011-07-12 09:29:20 -07:00
Jonathan Nieder
a4b5e91c49 xdl_merge(): move file1 and file2 labels to xmparam structure
The labels for the three participants in a potential conflict are all
optional arguments for the xdiff merge routine; if they are NULL, then
xdl_merge() can cope by omitting the labels from its output.  Move
them to the xmparam structure to allow new callers to save some
keystrokes where they are not needed.

This also has the virtue of making the xdiff merge interface more
similar to merge_trees, which might make it easier to learn.

Signed-off-by: Jonathan Nieder <jrnieder@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2010-03-20 20:36:10 -07:00
Jonathan Nieder
8a161433a0 xdl_merge(): add optional ancestor label to diff3-style output
The ‘git checkout --conflict=diff3’ command can be used to
present conflicts hunks including text from the common ancestor:

	<<<<<<< ours
	ourside
	|||||||
	original
	=======
	theirside
	>>>>>>> theirs

The added information is helpful for resolving merges by hand, and
merge tools can usually grok it because it is very similar to the
output from diff3 -m.

A subtle change can help more tools to understand the output.  ‘diff3’
includes the name of the merge base on the ||||||| line of the output,
and some tools misparse the conflict hunks without it.  Add a new
xmp->ancestor parameter to xdl_merge() for use with conflict style
XDL_MERGE_DIFF3 as a label on the ||||||| line for any conflict hunks.

If xmp->ancestor is NULL, the output format is unchanged.  Thus, this
change only provides unexposed plumbing for the new feature; it does
not affect the outward behavior of git.

Requested-by: Stefan Monnier <monnier@iro.umontreal.ca>
Signed-off-by: Jonathan Nieder <jrnieder@gmail.com>
Acked-by: Bert Wesarg <Bert.Wesarg@googlemail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2010-03-20 20:36:10 -07:00
Bert Wesarg
560119b9ab refactor merge flags into xmparam_t
Include the merge level, favor, and style flags into the xmparam_t struct.
This removes the bit twiddling with these three values into the one flags
parameter.

Signed-off-by: Bert Wesarg <bert.wesarg@googlemail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2010-03-02 11:51:48 -08:00
Bert Wesarg
cd1d61c44f make union merge an xdl merge favor
The current union merge driver is implemented as an post process.  But the
xdl_merge code is quite capable to produce the result by itself.  Therefore
move it there.

Signed-off-by: Bert Wesarg <bert.wesarg@googlemail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2010-03-02 11:43:40 -08:00
Junio C Hamano
06dbc1ea57 Merge branch 'jc/conflict-marker-size'
* jc/conflict-marker-size:
  rerere: honor conflict-marker-size attribute
  rerere: prepare for customizable conflict marker length
  conflict-marker-size: new attribute
  rerere: use ll_merge() instead of using xdl_merge()
  merge-tree: use ll_merge() not xdl_merge()
  xdl_merge(): allow passing down marker_size in xmparam_t
  xdl_merge(): introduce xmparam_t for merge specific parameters
  git_attr(): fix function signature

Conflicts:
	builtin-merge-file.c
	ll-merge.c
	xdiff/xdiff.h
	xdiff/xmerge.c
2010-01-20 20:28:51 -08:00
Junio C Hamano
9914cf4689 xdl_merge(): allow passing down marker_size in xmparam_t
This allows the callers of xdl_merge() to pass marker_size (defaults to 7)
in xmparam_t argument, to use conflict markers of non-default length.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
2010-01-16 23:45:29 -08:00
Junio C Hamano
00f8f97d30 xdl_merge(): introduce xmparam_t for merge specific parameters
So far we have only needed to be able to pass an option that is generic to
xdiff family of functions to this function.  Extend the interface so that
we can give it merge specific parameters.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
2010-01-16 21:33:13 -08:00
Junio C Hamano
73eb40eeaa git-merge-file --ours, --theirs
Sometimes people want their conflicting merges autoresolved by
favouring upstream changes.  The standard answer they are given is
to run "git diff --name-only | xargs git checkout MERGE_HEAD --" in
such a case.  This is to accept automerge results for the paths that
are fully resolved automatically, while taking their version of the
file in full for paths that have conflicts.

This is problematic on two counts.

One is that this is not exactly what these people want.  It discards
all changes they did on their branch for any paths that conflicted.
They usually want to salvage as much automerge result as possible in
a conflicted file, and want to take the upstream change only in the
conflicted part.

This patch teaches two new modes of operation to the lowest-lever
merge machinery, xdl_merge().  Instead of leaving the conflicted
lines from both sides enclosed in <<<, ===, and >>> markers, the
conflicts are resolved favouring our side or their side of changes.

A larger problem is that this tends to encourage a bad workflow by
allowing people to record such a mixed up half-merged result as a
full commit without auditing.  This commit does not tackle this
issue at all.  In git, we usually give long enough rope to users
with strange wishes as long as the risky features are not enabled by
default, and this is such a risky feature.

Signed-off-by: Avery Pennarun <apenwarr@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-11-29 23:11:46 -08:00
Junio C Hamano
5dc1308562 Merge branch 'js/patience-diff'
* js/patience-diff:
  bash completions: Add the --patience option
  Introduce the diff option '--patience'
  Implement the patience diff algorithm

Conflicts:
	contrib/completion/git-completion.bash
2009-01-23 21:51:38 -08:00
Johannes Schindelin
92b7de93fb Implement the patience diff algorithm
The patience diff algorithm produces slightly more intuitive output
than the classic Myers algorithm, as it does not try to minimize the
number of +/- lines first, but tries to preserve the lines that are
unique.

To this end, it first determines lines that are unique in both files,
then the maximal sequence which preserves the order (relative to both
files) is extracted.

Starting from this initial set of common lines, the rest of the lines
is handled recursively, with Myers' algorithm as a fallback when
the patience algorithm fails (due to no common unique lines).

This patch includes memory leak fixes by Pierre Habouzit.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-01-07 13:35:44 -08:00
René Scharfe
6d0e674a57 diff: add option to show context between close hunks
Merge two hunks if there is only the specified number of otherwise unshown
context between them.  For --inter-hunk-context=1, the resulting patch has
the same number of lines but shows uninterrupted context instead of a
context header line in between.

Patches generated with this option are easier to read but are also more
likely to conflict if the file to be patched contains other changes.

This patch keeps the default for this option at 0.  It is intended to just
make the feature available in order to see its advantages and downsides.

Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-12-29 01:05:21 -08:00
Brian Downing
ef2e62fe23 Allow alternate "low-level" emit function from xdl_diff
For some users (e.g. git blame), getting textual patch output is just
extra work, as they can get all the information they need from the low-
level diff structures.  Allow for an alternate low-level emit function
to be defined to allow bypassing the textual patch generation; set
xemitconf_t's emit_func member to enable this.

The (void (*)()) type is pretty ugly, but the alternative would be to
include most of the private xdiff headers in xdiff.h to get the types
required for the "proper" function prototype.  Also, a (void *) won't
work, as ANSI C doesn't allow a function pointer to be cast to an
object pointer.

Signed-off-by: Brian Downing <bdowning@lavos.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-10-25 12:09:31 -07:00
Junio C Hamano
e0af48e496 xdiff-merge: optionally show conflicts in "diff3 -m" style
When showing conflicting merges, we traditionally followed RCS's merge
output format.  The output shows:

 <<<<<<<
 postimage from one side;
 =======
 postimage of the other side; and
 >>>>>>>

Some poeple find it easier to be able to understand what is going on when
they can view the common ancestor's version, which is used by "diff3 -m",
which shows:

 <<<<<<<
 postimage from one side;
 |||||||
 shared preimage;
 =======
 postimage of the other side; and
 >>>>>>>

This is an initial step to bring that as an optional feature to git.
Only "git merge-file" has been converted, with "--diff3" option.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-08-30 19:41:43 -07:00
Johannes Schindelin
ee95ec5d58 xdl_merge(): introduce XDL_MERGE_ZEALOUS_ALNUM
When a merge conflicts, there are often common lines that are not really
common, such as empty lines or lines containing a single curly bracket.

With XDL_MERGE_ZEALOUS_ALNUM, we use the following heuristics: when a
hunk does not contain any letters or digits, it is treated as conflicting.

In other words, a conflict which used to look like this:

	<<<<<<<
					a = 1;
	=======
					output();
	>>>>>>>
				}
			}
		}

	<<<<<<<
		output();
	=======
		b = 1;
	>>>>>>>

will look like this with ZEALOUS_ALNUM:

	<<<<<<<
					a = 1;
				}
			}
		}

		output();
	=======
					output();
				}
			}
		}

		b = 1;
	>>>>>>>

To demonstrate this, git-merge-file has been switched from
XDL_MERGE_ZEALOUS to XDL_MERGE_ZEALOUS_ALNUM.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-02-18 00:10:37 -08:00
Junio C Hamano
f258475a6e Per-path attribute based hunk header selection.
This makes"diff -p" hunk headers customizable via gitattributes mechanism.
It is based on Johannes's earlier patch that allowed to define a single
regexp to be used for everything.

The mechanism to arrive at the regexp that is used to define hunk header
is the same as other use of gitattributes.  You assign an attribute, funcname
(because "diff -p" typically uses the name of the function the patch is about
as the hunk header), a simple string value.  This can be one of the names of
built-in pattern (currently, "java" is defined) or a custom pattern name, to
be looked up from the configuration file.

  (in .gitattributes)
  *.java   funcname=java
  *.perl   funcname=perl

  (in .git/config)
  [funcname]
    java = ... # ugly and complicated regexp to override the built-in one.
    perl = ... # another ugly and complicated regexp to define a new one.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
2007-07-06 01:20:47 -07:00
Junio C Hamano
a6080a0a44 War on whitespace
This uses "git-apply --whitespace=strip" to fix whitespace errors that have
crept in to our source files over time.  There are a few files that need
to have trailing whitespaces (most notably, test vectors).  The results
still passes the test, and build result in Documentation/ area is unchanged.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
2007-06-07 00:04:01 -07:00
Johannes Schindelin
859f9c4581 teach diff machinery about --ignore-space-at-eol
`git diff --ignore-space-at-eol` will ignore whitespace at the
line ends.

Signed-off-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-02-13 21:40:42 -08:00
Johannes Schindelin
857b933e04 xdiff: add xdl_merge()
This new function implements the functionality of RCS merge, but
in-memory. It returns < 0 on error, otherwise the number of conflicts.

Finding the conflicting lines can be a very expensive task. You can
control the eagerness of this algorithm:

- a level value of 0 means that all overlapping changes are treated
  as conflicts,
- a value of 1 means that if the overlapping changes are identical,
  it is not treated as a conflict.
- If you set level to 2, overlapping changes will be analyzed, so that
  almost identical changes will not result in huge conflicts. Rather,
  only the conflicting lines will be shown inside conflict markers.

With each increasing level, the algorithm gets slower, but more accurate.
Note that the code for level 2 depends on the simple definition of
mmfile_t specific to git, and therefore it will be harder to port that
to LibXDiff.

Signed-off-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-12-02 17:28:19 -08:00
Linus Torvalds
a9ed376b15 xdiff: generate "anti-diffs" aka what is common to two files
This fairly trivial patch adds a new XDL_EMIT_xxx flag to tell libxdiff
that we don't want to generate the _diff_ between two files, we want to
see the lines that are _common_ to two files.

So when you set XDL_EMIT_COMMON, xdl_diff() will do everything exactly
like it used to do, but the output records it generates just contain the
lines that aren't part of the diff.

This is for doing things like generating the common base case for a file
that was added in both branches.

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-06-28 22:24:32 -07:00
Johannes Schindelin
0d21efa51c Teach diff about -b and -w flags
This adds -b (--ignore-space-change) and -w (--ignore-all-space) flags to
diff. The main part of the patch is teaching libxdiff about it.

[jc: renamed xdl_line_match() to xdl_recmatch() since the former is used
 for different purposes in xpatchi.c which is in the parts of the upstream
 source we do not use.]

Signed-off-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-06-23 17:35:27 -07:00
Mark Wooding
acb7257729 xdiff: Show function names in hunk headers.
The speed of the built-in diff generator is nice; but the function names
shown by `diff -p' are /really/ nice.  And I hate having to choose.  So,
we hack xdiff to find the function names and print them.

xdiff has grown a flag to say whether to dig up the function names.  The
builtin_diff function passes this flag unconditionally.  I suppose it
could parse GIT_DIFF_OPTS, but it doesn't at the moment.  I've also
reintroduced the `function name' into the test suite, from which it was
removed in commit 3ce8f089.

The function names are parsed by a particularly stupid algorithm at the
moment: it just tries to find a line in the `old' file, from before the
start of the hunk, whose first character looks plausible.  Still, it's
most definitely a start.

Signed-off-by: Mark Wooding <mdw@distorted.org.uk>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-03-27 18:43:51 -08:00
Linus Torvalds
3443546f6e Use a *real* built-in diff generator
This uses a simplified libxdiff setup to generate unified diffs _without_
doing  fork/execve of GNU "diff".

This has several huge advantages, for example:

Before:

	[torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null

	real    0m24.818s
	user    0m13.332s
	sys     0m8.664s

After:

	[torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null

	real    0m4.563s
	user    0m2.944s
	sys     0m1.580s

and the fact that this should be a lot more portable (ie we can ignore all
the issues with doing fork/execve under Windows).

Perhaps even more importantly, this allows us to do diffs without actually
ever writing out the git file contents to a temporary file (and without
any of the shell quoting issues on filenames etc etc).

NOTE! THIS PATCH DOES NOT DO THAT OPTIMIZATION YET! I was lazy, and the
current "diff-core" code actually will always write the temp-files,
because it used to be something that you simply had to do. So this current
one actually writes a temp-file like before, and then reads it into memory
again just to do the diff. Stupid.

But if this basic infrastructure is accepted, we can start switching over
diff-core to not write temp-files, which should speed things up even
further, especially when doing big tree-to-tree diffs.

Now, in the interest of full disclosure, I should also point out a few
downsides:

 - the libxdiff algorithm is different, and I bet GNU diff has gotten a
   lot more testing. And the thing is, generating a diff is not an exact
   science - you can get two different diffs (and you will), and they can
   both be perfectly valid. So it's not possible to "validate" the
   libxdiff output by just comparing it against GNU diff.

 - GNU diff does some nice eye-candy, like trying to figure out what the
   last function was, and adding that information to the "@@ .." line.
   libxdiff doesn't do that.

 - The libxdiff thing has some known deficiencies. In particular, it gets
   the "\No newline at end of file" case wrong. So this is currently for
   the experimental branch only. I hope Davide will help fix it.

That said, I think the huge performance advantage, and the fact that it
integrates better is definitely worth it. But it should go into a
development branch at least due to the missing newline issue.

Technical note: this is based on libxdiff-0.17, but I did some surgery to
get rid of the extraneous fat - stuff that git doesn't need, and seriously
cutting down on mmfile_t, which had much more capabilities than the diff
algorithm either needed or used. In this version, "mmfile_t" is just a
trivial <pointer,length> tuple.

That said, I tried to keep the differences to simple removals, so that you
can do a diff between this and the libxdiff origin, and you'll basically
see just things getting deleted. Even the mmfile_t simplifications are
left in a state where the diffs should be readable.

Apologies to Davide, whom I'd love to get feedback on this all from (I
wrote my own "fill_mmfile()" for the new simpler mmfile_t format: the old
complex format had a helper function for that, but I did my surgery with
the goal in mind that eventually we _should_ just do

	mmfile_t mf;

	buf = read_sha1_file(sha1, type, &size);
	mf->ptr = buf;
	mf->size = size;
	.. use "mf" directly ..

which was really a nightmare with the old "helpful" mmfile_t, and really
is that easy with the new cut-down interfaces).

[ Btw, as any hawk-eye can see from the diff, this was actually generated
  with itself, so it is "self-hosting". That's about all the testing it
  has gotten, along with the above kernel diff, which eye-balls correctly,
  but shows the newline issue when you double-check it with "git-apply" ]

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-03-25 16:49:58 -08:00