Ok, so on the kernel list, some people noticed that "git log --follow"
doesn't work too well with some files in the x86 merge, because a lot of
files got renamed in very special ways.
In particular, there was a pattern of doing single commits with renames
that looked basically like
- rename "filename.h" -> "filename_64.h"
- create new "filename.c" that includes "filename_32.h" or
"filename_64.h" depending on whether we're 32-bit or 64-bit.
which was preparatory for smushing the two trees together.
Now, there's two issues here:
- "filename.c" *remained*. Yes, it was a rename, but there was a new file
created with the old name in the same commit. This was important,
because we wanted each commit to compile properly, so that it was
bisectable, so splitting the rename into one commit and the "create
helper file" into another was *not* an option.
So we need to break associations where the contents change too much.
Fine. We have the -B flag for that. When we break things up, then the
rename detection will be able to figure out whether there are better
alternatives.
- "git log --follow" didn't with with -B.
Now, the second case was really simple: we use a different "diffopt"
structure for the rename detection than the basic one (which we use for
showing the diffs). So that second case is trivially fixed by a trivial
one-liner that just copies the break_opt values from the "real" diffopts
to the one used for rename following. So now "git log -B --follow" works
fine:
diff --git a/tree-diff.c b/tree-diff.c
index 26bdbdd..7c261fd 100644
--- a/tree-diff.c
+++ b/tree-diff.c
@@ -319,6 +319,7 @@ static void try_to_follow_renames(struct tree_desc *t1, struct tree_desc *t2, co
diff_opts.detect_rename = DIFF_DETECT_RENAME;
diff_opts.output_format = DIFF_FORMAT_NO_OUTPUT;
diff_opts.single_follow = opt->paths[0];
+ diff_opts.break_opt = opt->break_opt;
paths[0] = NULL;
diff_tree_setup_paths(paths, &diff_opts);
if (diff_setup_done(&diff_opts) < 0)
however, the end result does *not* work. Because our diffcore-break.c
logic is totally bogus!
In particular:
- it used to do
if (base_size < MINIMUM_BREAK_SIZE)
return 0; /* we do not break too small filepair */
which basically says "don't bother to break small files". But that
"base_size" is the *smaller* of the two sizes, which means that if some
large file was rewritten into one that just includes another file, we
would look at the (small) result, and decide that it's smaller than the
break size, so it cannot be worth it to break it up! Even if the other
side was ten times bigger and looked *nothing* like the samell file!
That's clearly bogus. I replaced "base_size" with "max_size", so that
we compare the *bigger* of the filepair with the break size.
- It calculated a "merge_score", which was the score needed to merge it
back together if nothing else wanted it. But even if it was *so*
different that we would never want to merge it back, we wouldn't
consider it a break! That makes no sense. So I added
if (*merge_score_p > break_score)
return 1;
to make it clear that if we wouldn't want to merge it at the end, it
was *definitely* a break.
- It compared the whole "extent of damage", counting all inserts and
deletes, but it based this score on the "base_size", and generated the
damage score with
delta_size = src_removed + literal_added;
damage_score = delta_size * MAX_SCORE / base_size;
but that makes no sense either, since quite often, this will result in
a number that is *bigger* than MAX_SCORE! Why? Because base_size is
(again) the smaller of the two files we compare, and when you start out
from a small file and add a lot (or start out from a large file and
remove a lot), the base_size is going to be much smaller than the
damage!
Again, the fix was to replace "base_size" with "max_size", at which
point the damage actually becomes a sane percentage of the whole.
With these changes in place, not only does "git log -B --follow" work for
the case that triggered this in the first place, ie now
git log -B --follow arch/x86/kernel/vmlinux_64.lds.S
actually gives reasonable results. But I also wanted to verify it in
general, by doing a full-history
git log --stat -B -C
on my kernel tree with the old code and the new code.
There's some tweaking to be done, but generally, the new code generates
much better results wrt breaking up files (and then finding better rename
candidates). Here's a few examples of the "--stat" output:
- This:
include/asm-x86/Kbuild | 2 -
include/asm-x86/debugreg.h | 79 +++++++++++++++++++++++++++++++++++------
include/asm-x86/debugreg_32.h | 64 ---------------------------------
include/asm-x86/debugreg_64.h | 65 ---------------------------------
4 files changed, 68 insertions(+), 142 deletions(-)
Becomes:
include/asm-x86/Kbuild | 2 -
include/asm-x86/{debugreg_64.h => debugreg.h} | 9 +++-
include/asm-x86/debugreg_32.h | 64 -------------------------
3 files changed, 7 insertions(+), 68 deletions(-)
- This:
include/asm-x86/bug.h | 41 +++++++++++++++++++++++++++++++++++++++--
include/asm-x86/bug_32.h | 37 -------------------------------------
include/asm-x86/bug_64.h | 34 ----------------------------------
3 files changed, 39 insertions(+), 73 deletions(-)
Becomes
include/asm-x86/{bug_64.h => bug.h} | 20 +++++++++++++-----
include/asm-x86/bug_32.h | 37 -----------------------------------
2 files changed, 14 insertions(+), 43 deletions(-)
Now, in some other cases, it does actually turn a rename into a real
"delete+create" pair, and then the diff is usually bigger, so truth in
advertizing: it doesn't always generate a nicer diff. But for what -B was
meant for, I think this is a big improvement, and I suspect those cases
where it generates a bigger diff are tweakable.
So I think this diff fixes a real bug, but we might still want to tweak
the default values and perhaps the exact rules for when a break happens.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
This fixes "git log --follow" to hopefully not leak memory any more, and
also cleans it up a bit to look more like some of the other functions that
use "diff_queued_diff" (by *not* using it directly as a global in the
code, but by instead just taking a pointer to the diff queue and using
that).
As to "diff_queued_diff", I think it would be better off not as a global
at all, but as being just an entry in the "struct diff_options" structure,
but that's a separate issue, and there may be some subtle reason for why
it's currently a global.
Anyway, no real changes. Instead of having a magical first entry in the
diff-queue, we now end up just keeping the diff-queue clean, and keeping
our "preferred" file pairing in an internal "choice" variable. That makes
it easy to switch the choice around when we find a better one.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Ok, I've really held off doing this too damn long, because I'm lazy, and I
was always hoping that somebody else would do it.
But no, people keep asking for it, but nobody actually did anything, so I
decided I might as well bite the bullet, and instead of telling people
they could add a "--follow" flag to "git log" to do what they want to do,
I decided that it looks like I just have to do it for them..
The code wasn't actually that complicated, in that the diffstat for this
patch literally says "70 insertions(+), 1 deletions(-)", but I will have
to admit that in order to get to this fairly simple patch, you did have to
know and understand the internal git diff generation machinery pretty
well, and had to really be able to follow how commit generation interacts
with generating patches and generating the log.
So I suspect that while I was right that it wasn't that hard, I might have
been expecting too much of random people - this patch does seem to be
firmly in the core "Linus or Junio" territory.
To make a long story short: I'm sorry for it taking so long until I just
did it.
I'm not going to guarantee that this works for everybody, but you really
can just look at the patch, and after the appropriate appreciative noises
("Ooh, aah") over how clever I am, you can then just notice that the code
itself isn't really that complicated.
All the real new code is in the new "try_to_follow_renames()" function. It
really isn't rocket science: we notice that the pathname we were looking
at went away, so we start a full tree diff and try to see if we can
instead make that pathname be a rename or a copy from some other previous
pathname. And if we can, we just continue, except we show *that*
particular diff, and ever after we use the _previous_ pathname.
One thing to look out for: the "rename detection" is considered to be a
singular event in the _linear_ "git log" output! That's what people want
to do, but I just wanted to point out that this patch is *not* carrying
around a "commit,pathname" kind of pair and it's *not* going to be able to
notice the file coming from multiple *different* files in earlier history.
IOW, if you use "git log --follow", then you get the stupid CVS/SVN kind
of "files have single identities" kind of semantics, and git log will just
pick the identity based on the normal move/copy heuristics _as_if_ the
history could be linearized.
Put another way: I think the model is broken, but given the broken model,
I think this patch does just about as well as you can do. If you have
merges with the same "file" having different filenames over the two
branches, git will just end up picking _one_ of the pathnames at the point
where the newer one goes away. It never looks at multiple pathnames in
parallel.
And if you understood all that, you probably didn't need it explained, and
if you didn't understand the above blathering, it doesn't really mtter to
you. What matters to you is that you can now do
git log -p --follow builtin-rev-list.c
and it will find the point where the old "rev-list.c" got renamed to
"builtin-rev-list.c" and show it as such.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In addition to optimizing pathspecs that would never match,
which was done earlier, this optimizes pathspecs that would
always match (e.g. "arch/" while the traversal is already in
"arch/i386/" hierarchy).
This patch makes the worst case slightly more palatable, while
improving average case.
Signed-off-by: Junio C Hamano <junkio@cox.net>
If we already know that some of the pathspecs can match later
entries in the tree we are looking at, we do not have to do more
expensive strncmp() upfront before comparing the length of the
match pattern and the path, as a path longer than the match
pattern will not match it, and a path shorter than the match
pattern will match only if the path is a directory-component
wise prefix of the match pattern.
Signed-off-by: Junio C Hamano <junkio@cox.net>
When we are looking at a tree entry with pathspecs, if all the
pathspecs sort strictly earlier than the entry we are currently
looking at, there is no way later entries in the same tree would
match our pathspecs, because the entries are sorted.
Signed-off-by: Junio C Hamano <junkio@cox.net>
This removes slightly more lines than it adds, but the real reason for
doing this is that future optimizations will require more setup of the
tree descriptor, and so we want to do it in one place.
Also renamed the "desc.buf" field to "desc.buffer" just to trigger
compiler errors for old-style manual initializations, making sure I
didn't miss anything.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
This is mainly just a cleanup patch, and sets up for later changes where
the tree-diff.c "interesting()" function can return more than just a
yes/no value.
In particular, it should be quite possible to say "no subsequent entries
in this tree can possibly be interesting any more", and thus allow the
callers to short-circuit the tree entirely.
In fact, changing the callers to do so is trivial, and is really all this
patch really does, because changing "interesting()" itself to say that
nothing further is going to be interesting is definitely more complicated,
considering that we may have arbitrary pathspecs.
But in cleaning up the callers, this actually fixes a potential small
performance issue in diff_tree(): if the second tree has a lot of
uninterestign crud in it, we would keep on doing the "is it interesting?"
check on the first tree for each uninteresting entry in the second one.
The answer is obviously not going to change, so that was just not helping.
The new code is clearer and simpler and avoids this issue entirely.
I also renamed "interesting()" to "tree_entry_interesting()", because I
got frustrated by the fact that
- we actually had *another* function called "interesting()" in another
file, and I couldn't tell from the profiles which one was the one that
mattered more.
- when rewriting it to return a ternary value, you can't just do
if (interesting(...))
...
any more, but want to assign the return value to a local variable. The
name of choice for that variable would normally be "interesting", so
I just wanted to make the function name be more specific, and avoid
that whole issue (even though I then didn't choose that name for either
of the users, just to avoid confusion in the patch itself ;)
In other words, this doesn't really change anything, but I think it's a
good thing to do, and if somebody comes along and writes the logic for
"yeah, none of the pathspecs you have are interesting", we now support
that trivially.
It could easily be a meaningful optimization for things like "blame",
where there's just one pathspec, and stopping when you've seen it would
allow you to avoid about 50% of the tree traversals on average.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
* ar/diff:
Add tests for --quiet option of diff programs
try-to-simplify-commit: use diff-tree --quiet machinery.
revision.c: explain what tree_difference does
Teach --quiet to diff backends.
diff --quiet
Remove unused diffcore_std_no_resolve
Allow git-diff exit with codes similar to diff(1)
This is a micro-optimization that grew out of the mailing list discussion
about "strlen()" showing up in profiles.
We used to pass regular C strings around to the low-level tree walking
routines, and while this worked fine, it meant that we needed to call
strlen() on strings that the caller always actually knew the size of
anyway.
So pass the length of the string down wih the string, and avoid
unnecessary calls to strlen(). Also, when extracting a pathname from a
tree entry, use "tree_entry_len()" instead of strlen(), since the length
of the pathname is directly calculable from the decoded tree entry itself
without having to actually do another strlen().
This shaves off another ~5-10% from some loads that are very tree
intensive (notably doing commit filtering by a pathspec).
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>"
Signed-off-by: Junio C Hamano <junkio@cox.net>
This teaches git-diff-files, git-diff-index and git-diff-tree
backends to exit early under --quiet option.
Signed-off-by: Junio C Hamano <junkio@cox.net>
We currently have two parallel notation for dealing with object types
in the code: a string and a numerical value. One of them is obviously
redundent, and the most used one requires more stack space and a bunch
of strcmp() all over the place.
This is an initial step for the removal of the version using a char array
found in object reading code paths. The patch is unfortunately large but
there is no sane way to split it in smaller parts without breaking the
system.
Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
This patch on top of 'next' makes built-in git-cherry handle root
commits.
It moves the static function log-tree.c::diff_root_tree() to
tree-diff.c and makes it more similar to diff_tree_sha1() by
shuffling around arguments and factoring out the call to
log_tree_diff_flush(). Consequently the name is changed to
diff_root_tree_sha1(). It is a version of diff_tree_sha1() that
compares the empty tree (= root tree) against a single 'real' tree.
This function is then used in get_patch_id() to compute patch IDs
for initial commits instead of SEGFAULTing, as the current code
does if confronted with parentless commits.
Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx>
Signed-off-by: Junio C Hamano <junkio@cox.net>
Introduces global inline:
hashcmp(const unsigned char *sha1, const unsigned char *sha2)
Uses memcmp for comparison and returns the result based on the length of
the hash name (a future runtime decision).
Acked-by: Alex Riesen <raa.lkml@gmail.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Junio C Hamano <junkio@cox.net>
The way tree-diff was set up assumed we would use only one set
of pathspec during the entire life of the program. Move the
pathspec related static variables out to diff_options structure
so that we can filter commits with one set of paths while show
the actual diffs using different set of paths.
I suspect this breaks blame.c, and makes "git log paths..." to
default to the --full-diff, the latter of which is dealt with
the next commit.
Signed-off-by: Junio C Hamano <junkio@cox.net>
This replaces occurences of "blob", "commit", "tag", and "tree",
where they're really used as type specifiers, which we already
have defined global constants for.
Signed-off-by: Peter Eriksen <s022018@student.dtu.dk>
Signed-off-by: Junio C Hamano <junkio@cox.net>
Introduce tree-walk.[ch] and move "struct tree_desc" and
associated functions from various places.
Rename DIFF_FILE_CANON_MODE(mode) macro to canon_mode(mode) and
move it to cache.h. This macro returns the canonicalized
st_mode value in the host byte order for files, symlinks and
directories -- to be compared with a tree_desc entry.
create_ce_mode(mode) in cache.h is similar but is intended to be
used for index entries (so it does not work for directories) and
returns the value in the network byte order.
Signed-off-by: Junio C Hamano <junkio@cox.net>
We have operations to "extract" and "update" a "struct tree_desc", but we
only used them in tree-diff.c and they were static to that file.
But other tree traversal functions can use them to their advantage
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
Avoid asking for zero bytes when that change simplifies overall
logic. Later we would change the wrapper to ask for 1 byte on
platforms that return NULL for zero byte request.
Signed-off-by: Junio C Hamano <junkio@cox.net>
This makes the tree diff functionality independent of the "git-diff-tree"
program, by splitting the core functionality up into a library file.
This will be needed for when we teach git-rev-list to only follow a
specified set of pathnames, rather than the global revision history.
Most of it is a fairly straightforward code move, but it also involves
some calling convention cleanup, and moving some of the static variables
from diff-tree.c into the options structure.
The actual tree change callback routines also become paramterized by the
diff_options structure, allowing the library functionality to do something
else than just show the diff on stdout.
Right now the only user of this functionality remains git-diff-tree
itself.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>