Commit Graph

115 Commits

Author SHA1 Message Date
Junio C Hamano
2a5fe25458 Merge branch 'jc/rename'
* 'jc/rename' (early part):
  Optimize rename detection for a huge diff
2008-04-09 01:09:12 -07:00
Jeff King
ee542ee3fc rename: warn user when we have turned off rename detection
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-03-01 01:30:15 -08:00
Junio C Hamano
6d24ad971c Optimize rename detection for a huge diff
When there are N deleted paths and M created paths, we used to
allocate (N x M) "struct diff_score" that record how similar
each of the pair is, and picked the <src,dst> pair that gives
the best match first, and then went on to process worse matches.

This sorting is done so that when two new files in the postimage
that are similar to the same file deleted from the preimage, we
can process the more similar one first, and when processing the
second one, it can notice "Ah, the source I was planning to say
I am a copy of is already taken by somebody else" and continue
on to match itself with another file in the preimage with a
lessor match.  This matters to a change introduced between
1.5.3.X series and 1.5.4-rc, that lets the code to favor unused
matches first and then falls back to using already used
matches.

This instead allocates and keeps only a handful rename source
candidates per new files in the postimage.  I.e. it makes the
memory requirement from O(N x M) to O(M).

For each dst, we compute similarlity with all sources (i.e. the
number of similarity estimate computations is still O(N x M)),
but we keep handful best src candidates for each dst.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-02-13 15:44:20 -08:00
Linus Torvalds
9ae8fcb36a Fix a pathological case in git detecting proper renames
On Thu, 29 Nov 2007, Jeff King wrote:
>
> I think it will get worse, because you are simultaneously calculating
> all of the similarity scores bit by bit rather than doing a loop. Though
> perhaps you mean at the end you will end up with a list of src/dst pairs
> sorted by score, and you can loop over that.

Well, after thinking about this a bit, I think there's a solution that may
work well with the current thing too: instead of looping just *once* over
the list of rename pairs, loop twice - and simply refuse to do copies on
the first loop.

This trivial patch does that, and turns Kumar's test-case into a perfect
rename list.

It's not pretty, it's not smart, but it seems to work. There's something
to be said for keeping it simple and stupid.

And it should not be nearly as expensive as it may _look_. Yes, the loop
is "(i = 0; i < num_create * num_src; i++)", but the important part is
that the whole array is sorted by rename score, and we have a

	if (mx[i].score < minimum_score)
		break;

in it, so uthe loop actually would tend to terminate rather quickly.

Anyway, Kumar, the thing to take away from this is:

 - git really doesn't even *care* about the whole "rename detection"
   internally, and any commits you have done with renames are totally
   independent of the heuristics we then use to *show* the renames.

 - the rename detection really is for just two reasons: (a) keep humans
   happy, and keep the diffs small and (b) help automatic merging across
   renames. So getting renames right is certainly good, but it's more of a
   "politeness" issue than a "correctness" issue, although the merge
   portion of it does matter a lot sometimes.

 - the important thing here is that you can commit your changes and not
   worry about them being somehow "corrupted" by lack of rename detection,
   even if you commit them with a version of git that doesn't do rename
   detection the way you expected it. The rename detection is an
   "after-the-fact" thing, not something that actually gets saved in the
   repository, which is why we can change the heuristics _after_ seeing
   examples, and the examples magically correct themselves!

 - try out the two patches I've posted, and see if they work for you. They
   pass the test-suite, and the output for your example commit looks sane,
   but hey, if you have other test-cases, try them out.

Here's Kumar's pretty diffstat with both my patches:

	 Makefile                                         |    6 +++---
	 board/{cds => freescale}/common/cadmus.c         |    0
	 board/{cds => freescale}/common/cadmus.h         |    0
	 board/{cds => freescale}/common/eeprom.c         |    0
	 board/{cds => freescale}/common/eeprom.h         |    0
	 board/{cds => freescale}/common/ft_board.c       |    0
	 board/{cds => freescale}/common/via.c            |    0
	 board/{cds => freescale}/common/via.h            |    0
	 board/{cds => freescale}/mpc8541cds/Makefile     |    0
	 board/{cds => freescale}/mpc8541cds/config.mk    |    0
	 board/{cds => freescale}/mpc8541cds/init.S       |    0
	 board/{cds => freescale}/mpc8541cds/mpc8541cds.c |    0
	 board/{cds => freescale}/mpc8541cds/u-boot.lds   |    4 ++--
	 board/{cds => freescale}/mpc8548cds/Makefile     |    0
	 board/{cds => freescale}/mpc8548cds/config.mk    |    0
	 board/{cds => freescale}/mpc8548cds/init.S       |    0
	 board/{cds => freescale}/mpc8548cds/mpc8548cds.c |    0
	 board/{cds => freescale}/mpc8548cds/u-boot.lds   |    4 ++--
	 board/{cds => freescale}/mpc8555cds/Makefile     |    0
	 board/{cds => freescale}/mpc8555cds/config.mk    |    0
	 board/{cds => freescale}/mpc8555cds/init.S       |    0
	 board/{cds => freescale}/mpc8555cds/mpc8555cds.c |    0
	 board/{cds => freescale}/mpc8555cds/u-boot.lds   |    4 ++--
	 23 files changed, 9 insertions(+), 9 deletions(-)

and here it is before:

	 Makefile                                           |    6 +-
	 board/cds/mpc8548cds/Makefile                      |   60 -----
	 board/cds/mpc8555cds/Makefile                      |   60 -----
	 board/cds/mpc8555cds/init.S                        |  255 --------------------
	 board/cds/mpc8555cds/u-boot.lds                    |  150 ------------
	 board/{cds => freescale}/common/cadmus.c           |    0
	 board/{cds => freescale}/common/cadmus.h           |    0
	 board/{cds => freescale}/common/eeprom.c           |    0
	 board/{cds => freescale}/common/eeprom.h           |    0
	 board/{cds => freescale}/common/ft_board.c         |    0
	 board/{cds => freescale}/common/via.c              |    0
	 board/{cds => freescale}/common/via.h              |    0
	 board/{cds => freescale}/mpc8541cds/Makefile       |    0
	 board/{cds => freescale}/mpc8541cds/config.mk      |    0
	 board/{cds => freescale}/mpc8541cds/init.S         |    0
	 board/{cds => freescale}/mpc8541cds/mpc8541cds.c   |    0
	 board/{cds => freescale}/mpc8541cds/u-boot.lds     |    4 +-
	 .../mpc8541cds => freescale/mpc8548cds}/Makefile   |    0
	 board/{cds => freescale}/mpc8548cds/config.mk      |    0
	 board/{cds => freescale}/mpc8548cds/init.S         |    0
	 board/{cds => freescale}/mpc8548cds/mpc8548cds.c   |    0
	 board/{cds => freescale}/mpc8548cds/u-boot.lds     |    4 +-
	 .../mpc8541cds => freescale/mpc8555cds}/Makefile   |    0
	 board/{cds => freescale}/mpc8555cds/config.mk      |    0
	 .../mpc8541cds => freescale/mpc8555cds}/init.S     |    0
	 board/{cds => freescale}/mpc8555cds/mpc8555cds.c   |    0
	 .../mpc8541cds => freescale/mpc8555cds}/u-boot.lds |    4 +-
	 27 files changed, 9 insertions(+), 534 deletions(-)

so it certainly makes the diffs prettier.

		Linus

Signed-off-by: Junio C Hamano <gitster@pobox.com>
2007-11-30 15:49:17 -08:00
Linus Torvalds
32d75d29f9 Fix a pathological case in git detecting proper renames
Kumar Gala had a case in the u-boot archive with multiple renames of files
with identical contents, and git would turn those into multiple "copy"
operations of one of the sources, and just deleting the other sources.

This patch makes the git exact rename detection prefer to spread out the
renames over the multiple sources, rather than do multiple copies of one
source.

NOTE! The changes are a bit larger than required, because I also renamed
the variables named "one" and "two" to "target" and "source" respectively.
That makes the logic easier to follow, especially as the "one" was
illogically the target and not the soruce, for purely historical reasons
(this piece of code used to traverse over sources and targets in the wrong
order, and when we fixed that, we didn't fix the names back then. So I
fixed them now).

The important part of this change is just the trivial score calculations
for when files have identical contents:

	/* Give higher scores to sources that haven't been used already */
	score = !source->rename_used;
	score += basename_same(source, target);

and when we have multiple choices we'll now pick the choice that gets the
best rename score, rather than only looking at whether the basename
matched.

It's worth noting a few gotchas:

 - this scoring is currently only done for the "exact match" case.

   In particular, in Kumar's example, even after this patch, the inexact
   match case is still done as a copy+delete rather than as two renames:

	 delete mode 100644 board/cds/mpc8555cds/u-boot.lds
	 copy board/{cds => freescale}/mpc8541cds/u-boot.lds (97%)
	 rename board/{cds/mpc8541cds => freescale/mpc8555cds}/u-boot.lds (97%)

   because apparently the "cds/mpc8541cds/u-boot.lds" copy looked
   a bit more similar to both end results. That said, I *suspect* we just
   have the exact same issue there - the similarity analysis just gave
   identical (or at least very _close_ to identical) similarity points,
   and we do not have any logic to prefer multiple renames over a
   copy/delete there.

   That is a separate patch.

 - When you have identical contents and identical basenames, the actual
   entry that is chosen is still picked fairly "at random" for the first
   one (but the subsequent ones will prefer entries that haven't already
   been used).

   It's not actually really random, in that it actually depends on the
   relative alphabetical order of the files (which in turn will have
   impacted the order that the entries got hashed!), so it gives
   consistent results that can be explained. But I wanted to point it out
   as an issue for when anybody actually does cross-renames.

   In Kumar's case the choice is the right one (and for a single normal
   directory rename it should always be, since the relative alphabetical
   sorting of the files will be identical), and we now get:

	 rename board/{cds => freescale}/mpc8541cds/init.S (100%)
	 rename board/{cds => freescale}/mpc8548cds/init.S (100%)

   which is the "expected" answer. However, it might still be better to
   change the pedantic "exact same basename" on/off choice into a more
   graduated "how similar are the pathnames" scoring situation, in order
   to be more likely to get the exact rename choice that people *expect*
   to see, rather than other alternatives that may *technically* be
   equally good, but are surprising to a human.

It's also unclear whether we should consider "basenames are equal" or
"have already used this as a source" to be more important. This gives them
equal weight, but I suspect we might want to just multiple the "basenames
are equal" weight by two, or something, to prefer equal basenames even if
that causes a copy/delete pair. I dunno.

Anyway, what I'm just saying in a really long-winded manner is that I
think this is right as-is, but it's not the complete solution, and it may
want some further tweaking in the future.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2007-11-30 15:49:17 -08:00
Linus Torvalds
42899ac898 Do the fuzzy rename detection limits with the exact renames removed
When we do the fuzzy rename detection, we don't care about the
destinations that we already handled with the exact rename detector.
And, in fact, the code already knew that - but the rename limiter, which
used to run *before* exact renames were detected, did not.

This fixes it so that the rename detection limiter now bases its
decisions on the *remaining* rename counts, rather than the original
ones.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2007-10-26 23:18:06 -07:00
Linus Torvalds
81ac051d6a Fix ugly magic special case in exact rename detection
For historical reasons, the exact rename detection had populated the
filespecs for the entries it compared, and the rest of the similarity
analysis depended on that.  I hadn't even bothered to debug why that was
the case when I re-did the rename detection, I just made the new one
have the same broken behaviour, with a note about this special case.

This fixes that fixme.  The reason the exact rename detector needed to
fill in the file sizes of the files it checked was that the _inexact_
rename detector was broken, and started comparing file sizes before it
filled them in.

Fixing that allows the exact phase to do the sane thing of never even
caring (since all *it* cares about is really just the SHA1 itself, not
the size nor the contents).

It turns out that this also indirectly fixes a bug: trying to populate
all the filespecs will run out of virtual memory if there is tons and
tons of possible rename options.  The fuzzy similarity analysis does the
right thing in this regard, and free's the blob info after it has
generated the hash tables, so the special case code caused more trouble
than just some extra illogical code.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2007-10-26 23:18:06 -07:00
Linus Torvalds
17559a643e Do exact rename detection regardless of rename limits
Now that the exact rename detection is linear-time (with a very small
constant factor to boot), there is no longer any reason to limit it by
the number of files involved.

In some trivial testing, I created a repository with a directory that
had a hundred thousand files in it (all with different contents), and
then moved that directory to show the effects of renaming 100,000 files.

With the new code, that resulted in

	[torvalds@woody big-rename]$ time ~/git/git show -C | wc -l
	400006

	real    0m2.071s
	user    0m1.520s
	sys     0m0.576s

ie the code can correctly detect the hundred thousand renames in about 2
seconds (the number "400006" comes from four lines for each rename:

	diff --git a/really-big-dir/file-1-1-1-1-1 b/moved-big-dir/file-1-1-1-1-1
	similarity index 100%
	rename from really-big-dir/file-1-1-1-1-1
	rename to moved-big-dir/file-1-1-1-1-1

and the extra six lines is from a one-liner commit message and all the
commit information and spacing).

Most of those two seconds weren't even really the rename detection, it's
really all the other stuff needed to get there.

With the old code, this wouldn't have been practically possible.  Doing
a pairwise check of the ten billion possible pairs would have been
prohibitively expensive.  In fact, even with the rename limiter in
place, the old code would waste a lot of time just on the diff_filespec
checks, and despite not even trying to find renames, it used to look
like:

	[torvalds@woody big-rename]$ time git show -C | wc -l
	1400006

	real    0m12.337s
	user    0m12.285s
	sys     0m0.192s

ie we used to take 12 seconds for this load and not even do any rename
detection! (The number 1400006 comes from fourteen lines per file moved:
seven lines each for the delete and the create of a one-liner file, and
the same extra six lines of commit information).

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2007-10-26 23:18:06 -07:00
Linus Torvalds
9027f53cb5 Do linear-time/space rename logic for exact renames
This implements a smarter rename detector for exact renames, which
rather than doing a pairwise comparison (time O(m*n)) will just hash the
files into a hash-table (size O(n+m)), and only do pairwise comparisons
to renames that have the same hash (time O(n+m) except for unrealistic
hash collissions, which we just cull aggressively).

Admittedly the exact rename case is not nearly as interesting as the
generic case, but it's an important case none-the-less. A similar general
approach should work for the generic case too, but even then you do need
to handle the exact renames/copies separately (to avoid the inevitable
added cost factor that comes from the _size_ of the file), so this is
worth doing.

In the expectation that we will indeed do the same hashing trick for the
general rename case, this code uses a generic hash-table implementation
that can be used for other things too.  In fact, we might be able to
consolidate some of our existing hash tables with the new generic code
in hash.[ch].

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2007-10-26 23:18:06 -07:00
Linus Torvalds
644797119d copy vs rename detection: avoid unnecessary O(n*m) loops
The core rename detection had some rather stupid code to check if a
pathname was used by a later modification or rename, which basically
walked the whole pathname space for all renames for each rename, in
order to tell whether it was a pure rename (no remaining users) or
should be considered a copy (other users of the source file remaining).

That's really silly, since we can just keep a count of users around, and
replace all those complex and expensive loops with just testing that
simple counter (but this all depends on the previous commit that shared
the diff_filespec data structure by using a separate reference count).

Note that the reference count is not the same as the rename count: they
behave otherwise rather similarly, but the reference count is tied to
the allocation (and decremented at de-allocation, so that when it turns
zero we can get rid of the memory), while the rename count is tied to
the renames and is decremented when we find a rename (so that when it
turns zero we know that it was a rename, not a copy).

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2007-10-26 23:18:06 -07:00
Linus Torvalds
9fb88419ba Ref-count the filespecs used by diffcore
Rather than copy the filespecs when introducing new versions of them
(for rename or copy detection), use a refcount and increment the count
when reusing the diff_filespec.

This avoids unnecessary allocations, but the real reason behind this is
a future enhancement: we will want to track shared data across the
copy/rename detection.  In order to efficiently notice when a filespec
is used by a rename, the rename machinery wants to keep track of a
rename usage count which is shared across all different users of the
filespec.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2007-10-26 23:18:05 -07:00
Linus Torvalds
cb1491b6bf Split out "exact content match" phase of rename detection
This makes the exact content match a separate function of its own.
Partly to cut down a bit on the size of the diffcore_rename() function
(which is too complex as it is), and partly because there are smarter
ways to do this than an O(m*n) loop over it all, and that function
should be rewritten to take that into account.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2007-10-26 23:18:05 -07:00
Junio C Hamano
8ae92e6389 rename diff_free_filespec_data_large() to diff_free_filespec_blob()
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2007-10-02 21:02:09 -07:00
Jeff King
eede7b7d11 diffcore-rename: cache file deltas
We find rename candidates by computing a fingerprint hash of
each file, and then comparing those fingerprints. There are
inherently O(n^2) comparisons, so it pays in CPU time to
hoist the (rather expensive) computation of the fingerprint
out of that loop (or to cache it once we have computed it once).

Previously, we didn't keep the filespec information around
because then we had the potential to consume a great deal of
memory. However, instead of keeping all of the filespec
data, we can instead just keep the fingerprint.

This patch implements and uses diff_free_filespec_data_large
to accomplish that goal. We also have to change
estimate_similarity not to needlessly repopulate the
filespec data when we already have the hash.

Practical tests showed 4.5x speedup for a 10% memory usage
increase.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2007-10-02 21:02:03 -07:00
Linus Torvalds
0024a54923 Fix the rename detection limit checking
This adds more proper rename detection limits. Instead of just checking
the limit against the number of potential rename destinations, we verify
that the rename matrix (which is what really matters) doesn't grow
ridiculously large, and we also make sure that we don't overflow when
doing the matrix size calculation.

This also changes the default limits from unlimited, to a rename matrix
that is limited to 100 entries on a side. You can raise it with the config
entry, or by using the "-l<n>" command line flag, but at least the default
is now a sane number that avoids spending lots of time (and memory) in
situations that likely don't merit it.

The choice of default value is of course very debatable. Limiting the
rename matrix to a 100x100 size will mean that even if you have just one
obvious rename, but you also create (or delete) 10,000 files, the rename
matrix will be so big that we disable the heuristics. Sounds reasonable to
me, but let's see if people hit this (and, perhaps more importantly,
actually *care*) in real life.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2007-09-14 12:12:57 -07:00
Junio C Hamano
e1bc8dc66d Merge branch 'jc/diffcore'
* jc/diffcore:
  diffcore-delta.c: Ignore CR in CRLF for text files
  diffcore-delta.c: update the comment on the algorithm.
  diffcore_filespec: add is_binary
  diffcore_count_changes: pass diffcore_filespec
2007-07-02 01:45:12 -07:00
Junio C Hamano
d8c3d03a0b diffcore_count_changes: pass diffcore_filespec
We may want to use richer information on the data we are dealing
with in this function, so instead of passing a buffer address
and length, just pass the diffcore_filespec structure.  Existing
callers always call this function with parameters taken from a
filespec anyway, so there is no functionality changes.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
2007-06-30 20:51:31 -07:00
René Scharfe
cfc0aef1ff diffcore-rename: don't change similarity index based on basename equality
This implements a suggestion from Johannes.  It uses a separate field in
struct diff_score to keep the result of the file name comparison in the
rename detection logic.  This reverts the value of the similarity index
to be a function of file contents, only, and basename comparison is only
used to decide between files with equal amounts of content changes.

Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2007-06-24 23:12:31 -07:00
Johannes Schindelin
0ce396431e diffcore-rename: favour identical basenames
When there are several candidates for a rename source, and one of them
has an identical basename to the rename target, take that one.

Noticed by Govind Salinas, posted by Shawn O. Pearce, partial patch
by Linus Torvalds.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2007-06-22 22:43:51 -07:00
Junio C Hamano
50b2b53897 diff -M: release the preimage candidate blobs after rename detection.
We released the postimage candidate blobs after we are done to reduce
memory pressure.  Do the same for preimage candidate blobs.

Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-05-07 15:54:32 -07:00
Shawn O. Pearce
dc49cd769b Cast 64 bit off_t to 32 bit size_t
Some systems have sizeof(off_t) == 8 while sizeof(size_t) == 4.
This implies that we are able to access and work on files whose
maximum length is around 2^63-1 bytes, but we can only malloc or
mmap somewhat less than 2^32-1 bytes of memory.

On such a system an implicit conversion of off_t to size_t can cause
the size_t to wrap, resulting in unexpected and exciting behavior.
Right now we are working around all gcc warnings generated by the
-Wshorten-64-to-32 option by passing the off_t through xsize_t().

In the future we should make xsize_t on such problematic platforms
detect the wrapping and die if such a file is accessed.

Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-03-07 11:15:26 -08:00
Shawn O. Pearce
7da41f48c8 Bypass expensive content comparsion during rename detection.
When comparing file contents during the second loop through a rename
detection attempt we can skip the expensive byte-by-byte comparsion
if both source and destination files have valid SHA1 values.  This
improves performance by avoiding either an expensive open/mmap to
read the working tree copy, or an expensive inflate of a blob object.

Unfortunately we still have to at least initialize the sizes of the
source and destination files even if the SHA1 values don't match.
Failing to initialize the sizes causes a number of test cases to fail
and start reporting different copy/rename behavior than was expected.

Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-12-14 02:40:33 -08:00
Junio C Hamano
2f3f8b218a git-pickaxe: rename detection optimization
The idea is that we are interested in renaming into only one path, so
we do not care about renames that happen elsewhere.

Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-11-04 12:18:12 -08:00
David Rientjes
a89fccd281 Do not use memcmp(sha1_1, sha1_2, 20) with hardcoded length.
Introduces global inline:

	hashcmp(const unsigned char *sha1, const unsigned char *sha2)

Uses memcmp for comparison and returns the result based on the length of
the hash name (a future runtime decision).

Acked-by: Alex Riesen <raa.lkml@gmail.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-08-17 14:23:53 -07:00
Junio C Hamano
ef677686ef diff.c: do not use pathname comparison to tell renames
The final output from diff used to compare pathnames between
preimage and postimage to tell if the filepair is a rename/copy.
By explicitly marking the filepair created by diffcore_rename(),
the output routine, resolve_rename_copy(), does not have to do
so anymore.  This helps feeding a filepair that has different
pathnames in one and two elements to the diff machinery (most
notably, comparing two blobs).

Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-08-03 14:41:53 -07:00
Junio C Hamano
17e6019a2a diffcore-rename: try matching up renames without populating filespec first.
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-07-06 17:03:52 -07:00
Junio C Hamano
fc5807190e diffcore-rename: fix merging back a broken pair.
When a broken pair is matched up by rename detector to be merged
back, we do not want to say it is "dissimilar" with the
similarity index.  The output should just say they were changed,
taking the break score left by the earlier diffcore-break run if
any.

Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-04-08 20:32:41 -07:00
Linus Torvalds
90bd932c81 Fix up diffcore-rename scoring
The "score" calculation for diffcore-rename was totally broken.

It scaled "score" as

	score = src_copied * MAX_SCORE / dst->size;

which means that you got a 100% similarity score even if src and dest were
different, if just every byte of dst was copied from src, even if source
was much larger than dst (eg we had copied 85% of the bytes, but _deleted_
the remaining 15%).

That's clearly bogus. We should do the score calculation relative not to
the destination size, but to the max size of the two.

This seems to fix it.

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-03-12 23:02:00 -08:00
Junio C Hamano
2821104db7 diffcore-delta: make the hash a bit denser.
To reduce wasted memory, wait until the hash fills up more
densely before we rehash.  This reduces the working set size a
bit further.

Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-03-12 17:26:32 -08:00
Junio C Hamano
c06c79667c diffcore-rename: somewhat optimized.
This changes diffcore-rename to reuse statistics information
gathered during similarity estimation, and updates the hashtable
implementation used to keep track of the statistics to be
denser.  This seems to give better performance.

Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-03-12 03:22:10 -08:00
Junio C Hamano
1706306a54 diffcore-rename: similarity estimator fix.
The "similarity" logic was giving added material way too much
negative weight.  What we wanted to see was how similar the
post-change image was compared to the pre-change image, so the
natural definition of similarity is how much common things are
there, relative to the post-change image's size.

This simplifies things a lot.

Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-03-02 22:12:33 -08:00
Junio C Hamano
65416758cd diffcore-rename: split out the delta counting code.
This is to rework diffcore break/rename/copy detection code
so that it does not affected when deltifier code gets improved.

Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-02-28 20:20:04 -08:00
Junio C Hamano
09a5d72d8e diffcore-rename: plug memory leak.
Spotted by Nicolas Pitre.

Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-02-22 19:45:48 -08:00
Eric Wong
7d6fb370bc short circuit out of a few places where we would allocate zero bytes
dietlibc versions of malloc, calloc and realloc all return NULL if
they're told to allocate 0 bytes, causes the x* wrappers to die().

There are several more places where these calls could end up asking
for 0 bytes, too...

Maybe simply not die()-ing in the x* wrappers if 0/NULL is returned
when the requested size is zero is a safer and easier way to go.

Signed-off-by: Eric Wong <normalperson@yhbt.net>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2005-12-26 08:59:21 -08:00
Junio C Hamano
9f70b80692 rename detection with -M100 means "exact renames only".
When the user is interested in pure renames, there is no point
doing the similarity scores.  This changes the score argument
parsing to special case -M100 (otherwise, it is a precision
scaled value 0 <= v < 1 and would mean 0.1, not 1.0 --- if you
do mean 0.1, you can say -M1), and optimizes the diffcore_rename
transformation to only look at pure renames in that case.

Signed-off-by: Junio C Hamano <junkio@cox.net>
2005-11-21 12:21:24 -08:00
Junio C Hamano
3299c6f6a8 diff: make default rename detection limit configurable.
A while ago, a rename-detection limit logic was implemented as a
response to this thread:

	http://marc.theaimsgroup.com/?l=git&m=112413080630175

where gitweb was found to be using a lot of time and memory to
detect renames on huge commits.  git-diff family takes -l<num>
flag, and if the number of paths that are rename destination
candidates (i.e. new paths with -M, or modified paths with -C)
are larger than that number, skips rename/copy detection even
when -M or -C is specified on the command line.

This commit makes the rename detection limit easier to use.  You
can have:

	[diff]
		renamelimit = 30

in your .git/config file to specify the default rename detection
limit.  You can override this from the command line; giving 0
means 'unlimited':

	git diff -M -l0

We might want to change the default behaviour, when you do not
have the configuration, to limit it to say 20 paths or so.  This
would also help the diffstat generation after a big 'git pull'.

Signed-off-by: Junio C Hamano <junkio@cox.net>
2005-11-15 15:08:27 -08:00
Junio C Hamano
8082d8d305 Diff: -l<num> to limit rename/copy detection.
When many paths are modified, rename detection takes a lot of time.
The new option -l<num> can be used to disable rename detection when
more than <num> paths are possibly created as renames.

Signed-off-by: Junio C Hamano <junkio@cox.net>
2005-09-24 23:50:44 -07:00
Junio C Hamano
5098bafb75 Plug diff leaks.
It is a bit embarrassing that it took this long for a fix since the
problem was first reported on Aug 13th.

    Message-ID: <87y876gl1r.wl@mail2.atmark-techno.com>
    From: Yasushi SHOJI <yashi@atmark-techno.com>
    Newsgroups: gmane.comp.version-control.git
    Subject: [patch] possible memory leak in diff.c::diff_free_filepair()
    Date: Sat, 13 Aug 2005 19:58:56 +0900

This time I used valgrind to make sure that it does not overeagerly
discard memory that is still being used.

Signed-off-by: Junio C Hamano <junkio@cox.net>
2005-09-15 16:13:43 -07:00
Junio C Hamano
6bac10d89d Fix copy marking from diffcore-rename.
When (A,B) ==> (B,C) rename-copy was detected, we incorrectly said
that C was created by copying B.  This is because we only check if the
path of rename/copy source still exists in the resulting tree to see
if the file is renamed out of existence.  In this case, the new B is
created by copying or renaming A, so the original B is lost and we
should say C is a rename of B not a copy of B.

Signed-off-by: Junio C Hamano <junkio@cox.net>
2005-09-10 12:42:32 -07:00
Junio C Hamano
75c660ac93 [PATCH] Use enhanced diff_delta() in the similarity estimator.
The diff_delta() interface was extended to reject generating too big a
delta while we were working on the packed GIT archive format.

Take advantage of that when generating delta in the similarity estimator
used in diffcore-rename.c

Signed-off-by: Junio C Hamano <junkio@cox.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-28 17:13:32 -07:00
Linus Torvalds
75c42d8cc3 Add a "max_size" parameter to diff_delta()
Anything that generates a delta to see if two objects are close usually
isn't interested in the delta ends up being bigger than some specified
size, and this allows us to stop delta generation early when that
happens.
2005-06-25 19:30:20 -07:00
Junio C Hamano
2210100ac0 [PATCH] Fix rename/copy when dealing with temporarily broken pairs.
When rename/copy uses a file that was broken by diffcore-break
as the source, and the broken filepair gets merged back later,
the output was mislabeled as a rename.  In this case, the source
file ends up staying in the output, so we should label it as a
copy instead.

Signed-off-by: Junio C Hamano <junkio@cox.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-12 20:40:19 -07:00
Junio C Hamano
0e3994fa97 [PATCH] diff: Clean up diff_scoreopt_parse().
This cleans up diff_scoreopt_parse() function that is used to
parse the fractional notation -B, -C and -M option takes.  The
callers are modified to check for errors and complain.  Earlier
they silently ignored malformed input and falled back on the
default.

Signed-off-by: Junio C Hamano <junkio@cox.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-03 11:23:03 -07:00
Junio C Hamano
355e76a4a3 [PATCH] Tweak count-delta interface
Make it return copied source and insertion separately, so that
later implementation of heuristics can use them more flexibly.

This does not change the heuristics implemented in
diffcore-rename nor diffcore-break in any way.

Signed-off-by: Junio C Hamano <junkio@cox.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-03 11:23:03 -07:00
Junio C Hamano
f345b0a066 [PATCH] Add -B flag to diff-* brothers.
A new diffcore transformation, diffcore-break.c, is introduced.

When the -B flag is given, a patch that represents a complete
rewrite is broken into a deletion followed by a creation.  This
makes it easier to review such a complete rewrite patch.

The -B flag takes the same syntax as the -M and -C flags to
specify the minimum amount of non-source material the resulting
file needs to have to be considered a complete rewrite, and
defaults to 99% if not specified.

As the new test t4008-diff-break-rewrite.sh demonstrates, if a
file is a complete rewrite, it is broken into a delete/create
pair, which can further be subjected to the usual rename
detection if -M or -C is used.  For example, if file0 gets
completely rewritten to make it as if it were rather based on
file1 which itself disappeared, the following happens:

    The original change looks like this:

	file0     --> file0' (quite different from file0)
	file1     --> /dev/null

    After diffcore-break runs, it would become this:

	file0     --> /dev/null
	/dev/null --> file0'
	file1     --> /dev/null

    Then diffcore-rename matches them up:

	file1     --> file0'

The internal score values are finer grained now.  Earlier
maximum of 10000 has been raised to 60000; there is no user
visible changes but there is no reason to waste available bits.

Signed-off-by: Junio C Hamano <junkio@cox.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-30 10:35:49 -07:00
Junio C Hamano
2cd68882ee [PATCH] diff: fix the culling of unneeded delete record.
The commit 15d061b435

    [PATCH] Fix the way diffcore-rename records unremoved source.

still leaves unneeded delete records in its output stream by
mistake, which was covered up by having an extra check to turn
such a delete into a no-op downstream.  Fix the check in the
diffcore-rename to simplify the output routine.

Signed-off-by: Junio C Hamano <junkio@cox.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-30 10:35:49 -07:00
Junio C Hamano
01c4e70f63 [PATCH] diff: code clean-up and removal of rename hack.
A new macro, DIFF_PAIR_RENAME(), is introduced to distinguish a
filepair that is a rename/copy (the definition of which is src
and dst are different paths, of course).  This removes the hack
used in the record_rename_pair() to always put a non-zero value
in the score field.

Signed-off-by: Junio C Hamano <junkio@cox.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-30 10:35:49 -07:00
Junio C Hamano
f0c6b2a2fd [PATCH] Optimize diff-tree -[CM] --stdin
This attempts to optimize "diff-tree -[CM] --stdin", which
compares successible tree pairs.  This optimization does not
make much sense for other commands in the diff-* brothers.

When reading from --stdin and using rename/copy detection, the
patch makes diff-tree to read the current index file first.
This is done to reuse the optimization used by diff-cache in the
non-cached case.  Similarity estimator can avoid expanding a
blob if the index says what is in the work tree has an exact
copy of that blob already expanded.

Another optimization the patch makes is to check only file sizes
first to terminate similarity estimation early.  In order for
this to work, it needs a way to tell the size of the blob
without expanding it.  Since an obvious way of doing it, which
is to keep all the blobs previously used in the memory, is too
costly, it does so by keeping the filesize for each object it
has already seen in memory.

Signed-off-by: Junio C Hamano <junkio@cox.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-29 11:17:44 -07:00
Junio C Hamano
15d061b435 [PATCH] Fix the way diffcore-rename records unremoved source.
Earier version of diffcore-rename used to keep unmodified
filepair in its output so that the last stage of the processing
that tells renames from copies can make all of rename/copy to
copies.  However this had a bad interaction with other diffcore
filters that wanted to run after diffcore-rename, in that such
unmodified filepair must be retained for proper distinction
between renames and copies to happen.

This patch fixes the problem by changing the way diffcore-rename
records the information needed to distinguish "all are copies"
case and "the last one is a rename" case.

Signed-off-by: Junio C Hamano <junkio@cox.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-29 11:17:43 -07:00
Junio C Hamano
1a0756ffe4 [PATCH] Remove unused rank field from diff_core structure.
This removes a field that is no longer used from diff_score
structure.

Signed-off-by: Junio C Hamano <junkio@cox.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-29 11:17:43 -07:00
Junio C Hamano
226406f693 [PATCH] Introduce diff_free_filepair() funcion.
This introduces a new function to free a common data structure,
and plugs some leaks.

Signed-off-by: Junio C Hamano <junkio@cox.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-29 11:17:43 -07:00
Junio C Hamano
a00d7d106a [PATCH] Fix math thinko in similarity estimator.
The math to reject delta that is too big was confused.

Signed-off-by: Junio C Hamano <junkio@cox.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-29 11:17:43 -07:00
Junio C Hamano
8597697458 [PATCH] Update rename/copy similarity estimator.
The second round similarity estimator simply used the size of
the xdelta itself to estimate the extent of damage.  This patch
keeps that logic to detect big insertions to terminate the check
early, but otherwise looks at the generated delta in order to
estimate the extent of edit more accurately.

Signed-off-by: Junio C Hamano <junkio@cox.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-24 17:47:05 -07:00
Junio C Hamano
25d5ea410f [PATCH] Redo rename/copy detection logic.
Earlier implementation had a major screw-up in the memory
management area.  Rename/copy logic sometimes borrowed a pointer
to a structure without any provision for downstream to determine
which pointer is shared and which is not.  This resulted in the
later clean-up code to sometimes double free such structure,
resulting in a segfault.  This made -M and -C useless.

Another problem the earlier implementation had was that it
reordered the patches, and forced the logic to differentiate
renames and copies to depend on that particular order.  This
problem was fixed by teaching rename/copy detection logic not to
do any reordering, and rename-copy differentiator not to depend
on the order of the patches.  The diffs will leave rename/copy
detector in the same destination path order as the patch that
was fed into it.  Some test vectors have been reordered to
accommodate this change.

It also adds a sanity check logic to the human-readable diff-raw
output to detect paths with embedded TAB and LF characters,
which cannot be expressed with that format.  This idea came up
during a discussion with Chris Wedgwood.

Signed-off-by: Junio C Hamano <junkio@cox.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-24 01:26:26 -07:00
Junio C Hamano
bceafe752c [PATCH] Fix diff-pruning logic which was running prune too early.
For later stages to reorder patches, pruning logic and rename detection
logic should not decide which delete to discard (because another entry
said it will take over the file as a rename) until the very end.

Also fix some tests that were assuming the earlier "last one is rename
or keep everything else is copy" semantics of diff-raw format, which no
longer is true.

Signed-off-by: Junio C Hamano <junkio@cox.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-23 19:17:06 -07:00
Junio C Hamano
f7c1512af8 [PATCH] Rename/copy detection fix.
The rename/copy detection logic in earlier round was only good
enough to show patch output and discussion on the mailing list
about the diff-raw format updates revealed many problems with
it.  This patch fixes all the ones known to me, without making
things I want to do later impossible, mostly related to patch
reordering.

 (1) Earlier rename/copy detector determined which one is rename
     and which one is copy too early, which made it impossible
     to later introduce diffcore transformers to reorder
     patches.  This patch fixes it by moving that logic to the
     very end of the processing.

 (2) Earlier output routine diff_flush() was pruning all the
     "no-change" entries indiscriminatingly.  This was done due
     to my false assumption that one of the requirements in the
     diff-raw output was not to show such an entry (which
     resulted in my incorrect comment about "diff-helper never
     being able to be equivalent to built-in diff driver").  My
     special thanks go to Linus for correcting me about this.
     When we produce diff-raw output, for the downstream to be
     able to tell renames from copies, sometimes it _is_
     necessary to output "no-change" entries, and this patch
     adds diffcore_prune() function for doing it.

 (3) Earlier diff_filepair structure was trying to be not too
     specific about rename/copy operations, but the purpose of
     the structure was to record one or two paths, which _was_
     indeed about rename/copy.  This patch discards xfrm_msg
     field which was trying to be generic for this wrong reason,
     and introduces a couple of fields (rename_score and
     rename_rank) that are explicitly specific to rename/copy
     logic.  One thing to note is that the information in a
     single diff_filepair structure _still_ does not distinguish
     renames from copies, and it is deliberately so.  This is to
     allow patches to be reordered in later stages.

 (4) This patch also adds some tests about diff-raw format
     output and makes sure that necessary "no-change" entries
     appear on the output.

Signed-off-by: Junio C Hamano <junkio@cox.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-23 11:49:30 -07:00
Junio C Hamano
60896c7bfe [PATCH] Be careful with symlinks when detecting renames and copies.
Earlier round was not treating symbolic links carefully enough,
and would have produced diff output that renamed/copied then
edited the contents of a symbolic link, which made no practical
sense.  Change it to detect only pure renames.

Signed-off-by: Junio C Hamano <junkio@cox.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-23 11:49:30 -07:00
Junio C Hamano
6b14d7faf0 [PATCH] Diffcore updates.
This moves the path selection logic from individual programs to a new
diffcore transformer (diff-tree still needs to have its own for
performance reasons).  Also the header printing code in diff-tree was
tweaked not to produce anything when pickaxe is in effect and there is
nothing interesting to report.  An interesting example is the following
in the GIT archive itself:

    $ git-whatchanged -p -C -S'or something in a real script'

Signed-off-by: Junio C Hamano <junkio@cox.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-22 10:17:50 -07:00
Junio C Hamano
26dee0adfc [PATCH] Add the code to set default minimum score back in.
When the minimum score is specified as 0 (meaning "use default
value"), set it to the default as we are told.

Signed-off-by: Junio C Hamano <junkio@cox.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-22 09:46:06 -07:00
Junio C Hamano
cd1870edb6 [PATCH] Fix tweak in similarity estimator.
There was a screwy math bug in the estimator that confused what
-C1 meant and what -C9 meant, only in one of the early "cheap"
check, which resulted in quite confusing behaviour.

Signed-off-by: Junio C Hamano <junkio@cox.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-22 09:38:26 -07:00
Junio C Hamano
81e50eabf0 [PATCH] The diff-raw format updates.
Update the diff-raw format as Linus and I discussed, except that
it does not use sequence of underscore '_' letters to express
nonexistence.  All '0' mode is used for that purpose instead.

The new diff-raw format can express rename/copy, and the earlier
restriction that -M and -C _must_ be used with the patch format
output is no longer necessary.  The patch makes -M and -C flags
independent of -p flag, so you need to say git-whatchanged -M -p
to get the diff/patch format.

Updated are both documentations and tests.

Signed-off-by: Junio C Hamano <junkio@cox.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-21 22:49:19 -07:00
Junio C Hamano
38c6f78059 [PATCH] Prepare diffcore interface for diff-tree header supression.
This does not actually supress the extra headers when pickaxe is
used, but prepares enough support for diff-tree to implement it.

Signed-off-by: Junio C Hamano <junkio@cox.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-21 22:49:19 -07:00
Junio C Hamano
58b103f55d [PATCH] Tweak diffcore-rename heuristics.
The heuristics so far was to compare file size change and xdelta
size against the average of file size before and after the
change.  This patch uses the smaller of pre- and post- change
file size instead.

It also makes a very small performance fix.  I didn't measure
it; I do not expect it to make any practical difference, but
while scanning an already sorted list, breaking out in the
middle is the right thing.

Signed-off-by: Junio C Hamano <junkio@cox.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-21 16:22:57 -07:00
Junio C Hamano
52e9578985 [PATCH] Introducing software archaeologist's tool "pickaxe".
This steals the "pickaxe" feature from JIT and make it available
to the bare Plumbing layer.  From the command line, the user
gives a string he is intersted in.

Using the diff-core infrastructure previously introduced, it
filters the differences to limit the output only to the diffs
between <src> and <dst> where the string appears only in one but
not in the other.  For example:

 $ ./git-rev-list HEAD | ./git-diff-tree -Sdiff-tree-helper --stdin -M

would show the diffs that touch the string "diff-tree-helper".

In real software-archaeologist application, you would typically
look for a few to several lines of code and see where that code
came from.

The "pickaxe" module runs after "rename/copy detection" module,
so it even crosses the file rename boundary, as the above
example demonstrates.

Signed-off-by: Junio C Hamano <junkio@cox.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-21 09:58:03 -07:00
Junio C Hamano
427dcb4bca [PATCH] Diff overhaul, adding half of copy detection.
This introduces the diff-core, the layer between the diff-tree
family and the external diff interface engine.  The calls to the
interface diff-tree family uses (diff_change and diff_addremove)
have not changed and will not change.  The purpose of the
diff-core layer is to provide an infrastructure to transform the
set of differences sent from the applications, before sending
them to the external diff interface.

The recently introduced rename detection code has been rewritten
to use the diff-core facility.  When applications send in
separate creates and deletes, matching ones are transformed into
a single rename-and-edit diff, and sent out to the external diff
interface as such.

This patch also enhances the rename detection code further to be
able to detect copies.  Currently this happens only as long as
copy sources appear as part of the modified files, but there
already is enough provision for callers to report unmodified
files to diff-core, so that they can be also used as copy source
candidates.  Extending the callers this way will be done in a
separate patch.

Please see and marvel at how well this works by trying out the
newly added t/t4003-diff-rename-1.sh test script.

Signed-off-by: Junio C Hamano <junkio@cox.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-21 09:58:03 -07:00