Now that the exact rename detection is linear-time (with a very small
constant factor to boot), there is no longer any reason to limit it by
the number of files involved.
In some trivial testing, I created a repository with a directory that
had a hundred thousand files in it (all with different contents), and
then moved that directory to show the effects of renaming 100,000 files.
With the new code, that resulted in
[torvalds@woody big-rename]$ time ~/git/git show -C | wc -l
400006
real 0m2.071s
user 0m1.520s
sys 0m0.576s
ie the code can correctly detect the hundred thousand renames in about 2
seconds (the number "400006" comes from four lines for each rename:
diff --git a/really-big-dir/file-1-1-1-1-1 b/moved-big-dir/file-1-1-1-1-1
similarity index 100%
rename from really-big-dir/file-1-1-1-1-1
rename to moved-big-dir/file-1-1-1-1-1
and the extra six lines is from a one-liner commit message and all the
commit information and spacing).
Most of those two seconds weren't even really the rename detection, it's
really all the other stuff needed to get there.
With the old code, this wouldn't have been practically possible. Doing
a pairwise check of the ten billion possible pairs would have been
prohibitively expensive. In fact, even with the rename limiter in
place, the old code would waste a lot of time just on the diff_filespec
checks, and despite not even trying to find renames, it used to look
like:
[torvalds@woody big-rename]$ time git show -C | wc -l
1400006
real 0m12.337s
user 0m12.285s
sys 0m0.192s
ie we used to take 12 seconds for this load and not even do any rename
detection! (The number 1400006 comes from fourteen lines per file moved:
seven lines each for the delete and the create of a one-liner file, and
the same extra six lines of commit information).
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This implements a smarter rename detector for exact renames, which
rather than doing a pairwise comparison (time O(m*n)) will just hash the
files into a hash-table (size O(n+m)), and only do pairwise comparisons
to renames that have the same hash (time O(n+m) except for unrealistic
hash collissions, which we just cull aggressively).
Admittedly the exact rename case is not nearly as interesting as the
generic case, but it's an important case none-the-less. A similar general
approach should work for the generic case too, but even then you do need
to handle the exact renames/copies separately (to avoid the inevitable
added cost factor that comes from the _size_ of the file), so this is
worth doing.
In the expectation that we will indeed do the same hashing trick for the
general rename case, this code uses a generic hash-table implementation
that can be used for other things too. In fact, we might be able to
consolidate some of our existing hash tables with the new generic code
in hash.[ch].
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The core rename detection had some rather stupid code to check if a
pathname was used by a later modification or rename, which basically
walked the whole pathname space for all renames for each rename, in
order to tell whether it was a pure rename (no remaining users) or
should be considered a copy (other users of the source file remaining).
That's really silly, since we can just keep a count of users around, and
replace all those complex and expensive loops with just testing that
simple counter (but this all depends on the previous commit that shared
the diff_filespec data structure by using a separate reference count).
Note that the reference count is not the same as the rename count: they
behave otherwise rather similarly, but the reference count is tied to
the allocation (and decremented at de-allocation, so that when it turns
zero we can get rid of the memory), while the rename count is tied to
the renames and is decremented when we find a rename (so that when it
turns zero we know that it was a rename, not a copy).
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Rather than copy the filespecs when introducing new versions of them
(for rename or copy detection), use a refcount and increment the count
when reusing the diff_filespec.
This avoids unnecessary allocations, but the real reason behind this is
a future enhancement: we will want to track shared data across the
copy/rename detection. In order to efficiently notice when a filespec
is used by a rename, the rename machinery wants to keep track of a
rename usage count which is shared across all different users of the
filespec.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This makes the exact content match a separate function of its own.
Partly to cut down a bit on the size of the diffcore_rename() function
(which is too complex as it is), and partly because there are smarter
ways to do this than an O(m*n) loop over it all, and that function
should be rewritten to take that into account.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The diffcore.h header file is included by more than just the internal
diff generation files, and needs to be part of the proper dependencies.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Using konsole, I get no colored output at the end of "t7005-editor.sh"
without this patch.
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The code generating perl/Makefile from Makefile.PL was causing trouble
because it didn't considered NO_PERL_MAKEMAKER and ran makemaker
unconditionally, rewriting perl.mak. Makemaker is FUBAR in ActiveState Perl,
and perl/Makefile has a replacement for it.
Besides, a changed Git.pm is *NOT* a reason to rebuild all the perl scripts,
so remove the dependency too.
Signed-off-by: Alex Riesen <raa.lkml@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Without this strbuf_detach(), it yields a double free later, the
command is in fact stashed, and this is not a memory leak.
Signed-off-by: Pierre Habouzit <madcoder@debian.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This shuts down the "* ok ##: `test description`" messages.
Signed-off-by: Pierre Habouzit <madcoder@debian.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
* db/fetch-pack: (60 commits)
Define compat version of mkdtemp for systems lacking it
Avoid scary errors about tagged trees/blobs during git-fetch
fetch: if not fetching from default remote, ignore default merge
Support 'push --dry-run' for http transport
Support 'push --dry-run' for rsync transport
Fix 'push --all branch...' error handling
Fix compilation when NO_CURL is defined
Added a test for fetching remote tags when there is not tags.
Fix a crash in ls-remote when refspec expands into nothing
Remove duplicate ref matches in fetch
Restore default verbosity for http fetches.
fetch/push: readd rsync support
Introduce remove_dir_recursively()
bundle transport: fix an alloc_ref() call
Allow abbreviations in the first refspec to be merged
Prevent send-pack from segfaulting when a branch doesn't match
Cleanup unnecessary break in remote.c
Cleanup style nit of 'x == NULL' in remote.c
Fix memory leaks when disconnecting transport instances
Ensure builtin-fetch honors {fetch,transfer}.unpackLimit
...
Some projects prefer to receive patches via a given email address.
In these cases, it's handy to configure that address once.
Signed-off-by: Miklos Vajna <vmiklos@frugalware.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
martin f krafft <madduck@madduck.net> writes:
> piper:~> git remote show origin
> * remote origin
> URL: ssh://git.madduck.net/~/git/etc/mailplate.git
> Use of uninitialized value in string ne at /usr/local/stow/git/bin/git-remote line 248.
This is because there might not be branch.<name>.remote defined but
the code unconditionally dereferences $branch->{$name}{'REMOTE'} and
compares with another string.
Tested-by: Martin F Krafft <madduck@madduck.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
First, paths ending in a slash were not matching anything. This fixes
path_filter to handle paths ending in a slash (such entries have to
match a directory, and can't match a file, e.g., foo/bar/ can't match
a plain file called foo/bar).
Secondly, clicking in the file list pane (bottom right) was broken
because $treediffs($ids) contained all the files modified by the
commit, not just those within the file list. This fixes that too.
Signed-off-by: Paul Mackerras <paulus@samba.org>
First, we weren't putting "--" between the ids and the paths in the
git diff-tree/diff-index/diff-files command, so if there was a tag
and a file with the same name, we could get an ambiguity in the
command. This puts the "--" in to make it clear that the paths are
paths.
Secondly, this implements the path limiting for merge diffs as well
as the normal 2-way diffs.
Signed-off-by: Paul Mackerras <paulus@samba.org>
This sets the status window when reading commits, searching through
commits, cherry-picking or checking out a head.
Signed-off-by: Paul Mackerras <paulus@samba.org>
This makes the reset function use a progress bar in the same location
as the progress bars for reading in commits and for finding commits,
instead of a progress bar in a separate detached window. The progress
bar for resetting is red.
This also puts "Resetting" in the status window while the reset is in
progress. The setting of the status window is done through an
extension of the interface used for setting the watch cursor.
Signed-off-by: Paul Mackerras <paulus@samba.org>
We weren't restoring the tabstop setting if the user pressed the
Cancel button in the Edit/Preferences window. Also improved the
label for the checkbox (made it "Tab spacing" rather than the laconic
"tabstop") and moved it above the "Display nearby tags" checkbox.
Signed-off-by: Paul Mackerras <paulus@samba.org>
When the user has specified a list of paths, either on the command line
or when creating a view, gitk currently displays the diffs for all files
that a commit has modified, not just the ones that match the path list.
This is different from other git commands such as git log. This change
makes gitk behave the same as these other git commands by default, that
is, gitk only displays the diffs for files that match the path list.
There is now a checkbox labelled "Limit diffs to listed paths" in the
Edit/Preferences pane. If that is unchecked, gitk will display the
diffs for all files as before.
When gitk is run with the --merge flag, it will get the list of unmerged
files at startup, intersect that with the paths listed on the command line
(if any), and use that as the list of paths.
Signed-off-by: Paul Mackerras <paulus@samba.org>
- Remove out call to list_common_cmds_help()
- Send error message to stderr, not stdout.
Signed-off-by: Jari Aalto <jari.aalto@cante.net>
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Reword the first sentence of the description of -x, in order to
make it easier to read and understand.
Signed-off-by: Ralf Wildenhues <Ralf.Wildenhues@gmx.de>
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Fix size_t vs. unsigned long pointer mismatch warnings introduced
with the addition of strbuf_detach().
Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx>
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Elsewhere in Git we already use PRIuMAX and cast to uintmax_t when
we need to display a value that is 'very big' and we're not exactly
sure what the largest display size is for this platform.
This particular fix is needed so we can do the incredibly crazy
temporary hack of:
diff --git a/cache.h b/cache.h
index e0abcd6..6637fd8 100644
--- a/cache.h
+++ b/cache.h
@@ -6,6 +6,7 @@
#include SHA1_HEADER
#include <zlib.h>
+#define long long long
#if ZLIB_VERNUM < 0x1200
#define deflateBound(c,s) ((s) + (((s) + 7) >> 3) + (((s) + 63) >> 6) + 11)
allowing us to more easily look for locations where we are passing
a pointer to an 8 byte value to a function that expects a 4 byte
value. This can occur on some platforms where sizeof(long) == 8
and sizeof(size_t) == 4.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
* maint:
Describe more 1.5.3.5 fixes in release notes
Fix diffcore-break total breakage
Fix directory scanner to correctly ignore files without d_type
Improve receive-pack error message about funny ref creation
fast-import: Fix argument order to die in file_change_m
git-gui: Don't display CR within console windows
git-gui: Handle progress bars from newer gits
git-gui: Correctly report failures from git-write-tree
gitk.txt: Fix markup.
send-pack: respect '+' on wildcard refspecs
git-gui: accept versions containing text annotations, like 1.5.3.mingw.1
git-gui: Don't crash when starting gitk from a browser session
git-gui: Allow gitk to be started on Cygwin with native Tcl/Tk
git-gui: Ensure .git/info/exclude is honored in Cygwin workdirs
git-gui: Handle starting on mapped shares under Cygwin
git-gui: Display message box when we cannot find git in $PATH
git-gui: Avoid using bold text in entire gui for some fonts
Ok, so on the kernel list, some people noticed that "git log --follow"
doesn't work too well with some files in the x86 merge, because a lot of
files got renamed in very special ways.
In particular, there was a pattern of doing single commits with renames
that looked basically like
- rename "filename.h" -> "filename_64.h"
- create new "filename.c" that includes "filename_32.h" or
"filename_64.h" depending on whether we're 32-bit or 64-bit.
which was preparatory for smushing the two trees together.
Now, there's two issues here:
- "filename.c" *remained*. Yes, it was a rename, but there was a new file
created with the old name in the same commit. This was important,
because we wanted each commit to compile properly, so that it was
bisectable, so splitting the rename into one commit and the "create
helper file" into another was *not* an option.
So we need to break associations where the contents change too much.
Fine. We have the -B flag for that. When we break things up, then the
rename detection will be able to figure out whether there are better
alternatives.
- "git log --follow" didn't with with -B.
Now, the second case was really simple: we use a different "diffopt"
structure for the rename detection than the basic one (which we use for
showing the diffs). So that second case is trivially fixed by a trivial
one-liner that just copies the break_opt values from the "real" diffopts
to the one used for rename following. So now "git log -B --follow" works
fine:
diff --git a/tree-diff.c b/tree-diff.c
index 26bdbdd..7c261fd 100644
--- a/tree-diff.c
+++ b/tree-diff.c
@@ -319,6 +319,7 @@ static void try_to_follow_renames(struct tree_desc *t1, struct tree_desc *t2, co
diff_opts.detect_rename = DIFF_DETECT_RENAME;
diff_opts.output_format = DIFF_FORMAT_NO_OUTPUT;
diff_opts.single_follow = opt->paths[0];
+ diff_opts.break_opt = opt->break_opt;
paths[0] = NULL;
diff_tree_setup_paths(paths, &diff_opts);
if (diff_setup_done(&diff_opts) < 0)
however, the end result does *not* work. Because our diffcore-break.c
logic is totally bogus!
In particular:
- it used to do
if (base_size < MINIMUM_BREAK_SIZE)
return 0; /* we do not break too small filepair */
which basically says "don't bother to break small files". But that
"base_size" is the *smaller* of the two sizes, which means that if some
large file was rewritten into one that just includes another file, we
would look at the (small) result, and decide that it's smaller than the
break size, so it cannot be worth it to break it up! Even if the other
side was ten times bigger and looked *nothing* like the samell file!
That's clearly bogus. I replaced "base_size" with "max_size", so that
we compare the *bigger* of the filepair with the break size.
- It calculated a "merge_score", which was the score needed to merge it
back together if nothing else wanted it. But even if it was *so*
different that we would never want to merge it back, we wouldn't
consider it a break! That makes no sense. So I added
if (*merge_score_p > break_score)
return 1;
to make it clear that if we wouldn't want to merge it at the end, it
was *definitely* a break.
- It compared the whole "extent of damage", counting all inserts and
deletes, but it based this score on the "base_size", and generated the
damage score with
delta_size = src_removed + literal_added;
damage_score = delta_size * MAX_SCORE / base_size;
but that makes no sense either, since quite often, this will result in
a number that is *bigger* than MAX_SCORE! Why? Because base_size is
(again) the smaller of the two files we compare, and when you start out
from a small file and add a lot (or start out from a large file and
remove a lot), the base_size is going to be much smaller than the
damage!
Again, the fix was to replace "base_size" with "max_size", at which
point the damage actually becomes a sane percentage of the whole.
With these changes in place, not only does "git log -B --follow" work for
the case that triggered this in the first place, ie now
git log -B --follow arch/x86/kernel/vmlinux_64.lds.S
actually gives reasonable results. But I also wanted to verify it in
general, by doing a full-history
git log --stat -B -C
on my kernel tree with the old code and the new code.
There's some tweaking to be done, but generally, the new code generates
much better results wrt breaking up files (and then finding better rename
candidates). Here's a few examples of the "--stat" output:
- This:
include/asm-x86/Kbuild | 2 -
include/asm-x86/debugreg.h | 79 +++++++++++++++++++++++++++++++++++------
include/asm-x86/debugreg_32.h | 64 ---------------------------------
include/asm-x86/debugreg_64.h | 65 ---------------------------------
4 files changed, 68 insertions(+), 142 deletions(-)
Becomes:
include/asm-x86/Kbuild | 2 -
include/asm-x86/{debugreg_64.h => debugreg.h} | 9 +++-
include/asm-x86/debugreg_32.h | 64 -------------------------
3 files changed, 7 insertions(+), 68 deletions(-)
- This:
include/asm-x86/bug.h | 41 +++++++++++++++++++++++++++++++++++++++--
include/asm-x86/bug_32.h | 37 -------------------------------------
include/asm-x86/bug_64.h | 34 ----------------------------------
3 files changed, 39 insertions(+), 73 deletions(-)
Becomes
include/asm-x86/{bug_64.h => bug.h} | 20 +++++++++++++-----
include/asm-x86/bug_32.h | 37 -----------------------------------
2 files changed, 14 insertions(+), 43 deletions(-)
Now, in some other cases, it does actually turn a rename into a real
"delete+create" pair, and then the diff is usually bigger, so truth in
advertizing: it doesn't always generate a nicer diff. But for what -B was
meant for, I think this is a big improvement, and I suspect those cases
where it generates a bigger diff are tweakable.
So I think this diff fixes a real bug, but we might still want to tweak
the default values and perhaps the exact rules for when a break happens.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
On Fri, 19 Oct 2007, Todd T. Fries wrote:
> If DT_UNKNOWN exists, then we have to do a stat() of some form to
> find out the right type.
That happened in the case of a pathname that was ignored, and we did
not ask for "dir->show_ignored". That test used to be *together*
with the "DTYPE(de) != DT_DIR", but splitting the two tests up
means that we can do that (common) test before we even bother to
calculate the real dtype.
Of course, that optimization only matters for systems that don't
have, or don't fill in DTYPE properly.
I also clarified the real relationship between "exclude" and
"dir->show_ignored". It used to do
if (exclude != dir->show_ignored) {
..
which wasn't exactly obvious, because it triggers for two different
cases:
- the path is marked excluded, but we are not interested in ignored
files: ignore it
- the path is *not* excluded, but we *are* interested in ignored
files: ignore it unless it's a directory, in which case we might
have ignored files inside the directory and need to recurse
into it).
so this splits them into those two cases, since the first case
doesn't even care about the type.
I also made a the DT_UNKNOWN case a separate helper function,
and added some commentary to the cases.
Linus
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
When apply_filter() runs the external (clean or smudge) filter program, it
needs to pass the writable end of a pipe as its stdout. For this purpose,
it used to dup2(2) the file descriptor explicitly to stdout. Now we use
the facilities of start_command() to do it for us.
Furthermore, the path argument of a subordinate function, filter_buffer(),
was not used, so here we replace it to pass the fd instead.
Signed-off-by: Johannes Sixt <johannes.sixt@telecom.at>
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
This test uses a rot13 filter, which is its own inverse. It tested only
that the content was the same as the original after both the 'clean' and
the 'smudge' filter were applied. This way it would not detect whether
any filter was run at all. Hence, here we add another test that checks
that the repository contained content that was processed by the 'clean'
filter.
Signed-off-by: Johannes Sixt <johannes.sixt@telecom.at>
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
This gets rid of an explicit fork().
Since upload-pack has to coordinate two processes (rev-list and
pack-objects), we cannot use the normal finish_async(), but have to monitor
the process explicitly. Hence, there are no changes at this front.
Signed-off-by: Johannes Sixt <johannes.sixt@telecom.at>
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
This allows us later to use start_async() with this function, and at
the same time is a nice cleanup that makes a long function
(create_pack_file()) shorter.
Signed-off-by: Johannes Sixt <johannes.sixt@telecom.at>
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
We run the sideband demultiplexer in an asynchronous function.
Note that earlier there was a check in the child process that closed
xd[1] only if it was different from xd[0]; this test is no longer needed
because git_connect() always returns two different file descriptors
(see ec587fde0a).
Signed-off-by: Johannes Sixt <johannes.sixt@telecom.at>
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
This adds start_async() and finish_async(), which runs a function
asynchronously. Communication with the caller happens only via pipes.
For this reason, this implementation forks off a child process that runs
the function.
[sp: Style nit fixed by removing unnecessary block on if condition
inside of start_async()]
Signed-off-by: Johannes Sixt <johannes.sixt@telecom.at>
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
This gets rid of an explicit fork/exec.
Since upload-pack has to coordinate two processes (rev-list and
pack-objects), we cannot use the normal finish_command(), but have to
monitor the processes explicitly. Hence, the waitpid() call remains.
Signed-off-by: Johannes Sixt <johannes.sixt@telecom.at>
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
This adds another stanza that allocates a pipe that is connected to the
child's stderr and that the caller can read from. In order to request this
pipe, the caller sets cmd->err to -1.
The implementation is not exactly modeled after the stdout case: For stdout
the caller can supply an existing file descriptor, but this facility is
nowhere needed in the stderr case. Additionally, the caller is required to
close cmd->err.
Signed-off-by: Johannes Sixt <johannes.sixt@telecom.at>
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
The previous code already used finish_command() to wait for the process
to terminate, but did not use start_command() to run it.
Signed-off-by: Johannes Sixt <johannes.sixt@telecom.at>
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
The child process handling is delegated to start_command() and
finish_command().
Signed-off-by: Johannes Sixt <johannes.sixt@telecom.at>
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>