Darwin does not define __WORDSIZE, and compiles the 32-bit code path
on 64-bit systems, resulting in a totally broken git.
I could not find an alternative -- other than the platform symbols
(__x86_64__ etc.) -- that does the test in the preprocessor. However,
we can also just test for the size of a 'long', which is what really
matters here. Any compiler worth its salt will leave only the branch
relevant for its platform, and indeed on Linux/GCC the numbers don't
change:
Test tr/darwin-xdl-fast-hash origin/next origin/master
------------------------------------------------------------------------------------------------------------------
4000.1: log -3000 (baseline) 0.09(0.07+0.01) 0.09(0.07+0.01) -5.5%* 0.09(0.07+0.01) -4.1%
4000.2: log --raw -3000 (tree-only) 0.47(0.41+0.05) 0.47(0.40+0.05) -0.5% 0.45(0.38+0.06) -3.5%.
4000.3: log -p -3000 (Myers) 1.81(1.67+0.12) 1.81(1.67+0.13) +0.3% 1.99(1.84+0.12) +10.2%***
4000.4: log -p -3000 --histogram 1.79(1.66+0.11) 1.80(1.67+0.11) +0.4% 1.96(1.82+0.10) +9.2%***
4000.5: log -p -3000 --patience 2.17(2.02+0.13) 2.20(2.04+0.13) +1.3%. 2.33(2.18+0.13) +7.4%***
------------------------------------------------------------------------------------------------------------------
Significance hints: '.' 0.1 '*' 0.05 '**' 0.01 '***' 0.001
Noticed-by: Brian Gernhardt <brian@gernhardtsoftware.com>
Signed-off-by: Thomas Rast <trast@student.ethz.ch>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Redo the hashing loop in xdl_hash_record in a way that loads an entire
'long' at a time, using masking tricks to see when and where we found
the terminating '\n'.
I stole inspiration and code from the posts by Linus Torvalds around
https://lkml.org/lkml/2012/3/2/452https://lkml.org/lkml/2012/3/5/6
His method reads the buffers in sizeof(long) increments, and may thus
overrun it by at most sizeof(long)-1 bytes before it sees the final
newline (or hits the buffer length check). I considered padding out
all buffers by a suitable amount to "catch" the overrun, but
* this does not work for mmap()'d buffers: if you map 4096+8 bytes
from a 4096 byte file, accessing the last 8 bytes results in a
SIGBUS on my machine; and
* it would also be extremely ugly because it intrudes deep into the
unpacking machinery.
So I adapted it to not read beyond the buffer at all. Instead, it
reads the final partial word byte-by-byte and strings it together.
Then it can use the same logic as before to finish the hashing.
So far we enable this only on x86_64, where it provides nice speedup
for diff-related work:
Test origin/next tr/xdiff-fast-hash
-----------------------------------------------------------------------------
4000.1: log -3000 (baseline) 0.07(0.05+0.02) 0.08(0.06+0.02) +14.3%
4000.2: log --raw -3000 (tree-only) 0.37(0.33+0.04) 0.37(0.32+0.04) +0.0%
4000.3: log -p -3000 (Myers) 1.75(1.65+0.09) 1.60(1.49+0.10) -8.6%
4000.4: log -p -3000 --histogram 1.73(1.62+0.09) 1.58(1.49+0.08) -8.7%
4000.5: log -p -3000 --patience 2.11(2.00+0.10) 1.94(1.80+0.11) -8.1%
Perhaps other platforms could also benefit. However it does NOT work
on big-endian systems!
[jc: minimum style and compilation fixes]
Signed-off-by: Thomas Rast <trast@student.ethz.ch>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Given our simple mmfile structure, xdl_mmfile_next() calls are
redundant. Do away with calls to them.
Signed-off-by: Tay Ray Chuan <rctay89@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
For histogram diff, we can afford a smaller sample size and thus a
poorer estimate of the number of lines, as the hash table (rhash) won't
be filled up/grown. This is safe as the final count of lines (xdf.nrecs)
will be updated correctly anyway by xdl_prepare_ctx().
This gives us a small boost in performance.
Signed-off-by: Tay Ray Chuan <rctay89@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This is in preparation for the histogram diff algorithm, which will also
re-use much of the code to call the default Meyers diff algorithm.
Signed-off-by: Tay Ray Chuan <rctay89@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The ctype functions isspace(), isalnum(), et al take an integer
argument representing an unsigned character, or -1 for EOF. On
platforms with a signed char, it is unsafe to pass a char to them
without casting it to unsigned char first.
Most of git is already shielded against this by the ctype
implementation in git-compat-util.h, but xdiff, which uses libc
ctype.h, ought to be fixed.
Noticed-by: der Mouse <mouse@Rodents-Montreal.ORG>
Reported-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Jonathan Nieder <jrnieder@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In xdl_recmatch, do the memcmp to check if the two lines are equal before
checking if whitespace flags are set. If the lines are identical, then
there is no need to check if they differ only in whitespace.
This makes the common case (there is no whitespace difference) faster.
It costs the case where lines are the same length and contain
whitespace differences, but the common case is more than 20% faster.
Signed-off-by: Dylan Reid <dgreid@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Thell Fowler noticed that various "ignore whitespace" options to git diff
do not work well on an incomplete line.
The loop control of the function responsible for these bugs was extremely
difficult to follow. This patch restructures the loops for three variants
of "ignore whitespace" logic.
The basic idea of the re-written logic is:
- A loop runs while the characters from both strings we are looking at
match. We declare unmatch immediately when we find something that does
not match and return false from the function. We break out of the loop
if we ran out of either side of the string.
The way we skip spaces inside this loop varies depending on the style
of ignoring whitespaces.
- After the above loop breaks, we know that the parts of the strings we
inspected so far match, ignoring the whitespaces. The lines can match
only if the remainder consists of nothing but whitespaces. This part
of the logic is shared across all three styles.
The new code is more obvious and should be much easier to follow.
Tested-by: Thell Fowler <git@tbfowler.name>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Upon seeing a whitespace, xdl_hash_record_with_whitespace() first skipped
the run of whitespaces (excluding LF) that begins there, ensuring that the
pointer points at the last whitespace character in the run, and assumed
that the next character must be LF at the end of the line. This does not
work when hashing an incomplete line, which lacks the LF at the end.
Introduce "at_eol" variable that is true when either we are at the end of
line (looking at LF) or at the end of an incomplete line, and use that
instead throughout the code.
Noticed by Thell Fowler.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The code used to misbehave when options to ignore certain whitespaces
(-w -b and --ignore-at-eol) were combined.
Signed-off-by: Keith Cascio <keith@cs.ucla.edu>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Solaris Workshop Compiler found a few unreachable statements.
Signed-off-by: Guido Ostkamp <git@ostkamp.fastmail.fm>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This uses "git-apply --whitespace=strip" to fix whitespace errors that have
crept in to our source files over time. There are a few files that need
to have trailing whitespaces (most notably, test vectors). The results
still passes the test, and build result in Documentation/ area is unchanged.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Since in at least one use case, xdl_hash_record() takes over 15% of the
CPU time, it makes sense to even micro-optimize it. For many cases, no
whitespace special handling is needed, and in these cases we should not
even bother to check for whitespace in _every_ iteration of the loop.
Signed-off-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
Signed-off-by: Junio C Hamano <junkio@cox.net>
`git diff --ignore-space-at-eol` will ignore whitespace at the
line ends.
Signed-off-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
Signed-off-by: Junio C Hamano <junkio@cox.net>
This is _not_ the same as "treat eol as whitespace", since that would mean
that multiple empty lines would be treated as equal to e.g. a space.
Signed-off-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
Signed-off-by: Junio C Hamano <junkio@cox.net>
When whitespace or whitespace change was ignored, the function
xdl_recmatch() returned memcmp() style differences, which is wrong,
since it should return 0 on non-match.
Also, there were three horrible off-by-one bugs, even leading to wrong
hashes in the whitespace special handling.
The issue was noticed by Ray Lehtiniemi.
For good measure, this commit adds a test.
Signed-off-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
Signed-off-by: Junio C Hamano <junkio@cox.net>
This adds -b (--ignore-space-change) and -w (--ignore-all-space) flags to
diff. The main part of the patch is teaching libxdiff about it.
[jc: renamed xdl_line_match() to xdl_recmatch() since the former is used
for different purposes in xpatchi.c which is in the parts of the upstream
source we do not use.]
Signed-off-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
Signed-off-by: Junio C Hamano <junkio@cox.net>
This reformats the change 621c53cc08
introduced to match what upstream author implemented in libxdiff-0.21
without changing any logic (hopefully ;-). This is to help keep
us in sync with the upstream.
Signed-off-by: Junio C Hamano <junkio@cox.net>
The speed of the built-in diff generator is nice; but the function names
shown by `diff -p' are /really/ nice. And I hate having to choose. So,
we hack xdiff to find the function names and print them.
xdiff has grown a flag to say whether to dig up the function names. The
builtin_diff function passes this flag unconditionally. I suppose it
could parse GIT_DIFF_OPTS, but it doesn't at the moment. I've also
reintroduced the `function name' into the test suite, from which it was
removed in commit 3ce8f089.
The function names are parsed by a particularly stupid algorithm at the
moment: it just tries to find a line in the `old' file, from before the
start of the hunk, whose first character looks plausible. Still, it's
most definitely a start.
Signed-off-by: Mark Wooding <mdw@distorted.org.uk>
Signed-off-by: Junio C Hamano <junkio@cox.net>
This fixes up a couple of minor issues with the real built-in
diff to be more usable:
- Omit ---/+++ header unless we emit diff output;
- Detect and punt binary diff like GNU does;
- Honor GIT_DIFF_OPTS minimally (only -u<number> and
--unified=<number> are currently supported);
- Omit line count of 1 from "@@ -l,k +m,n @@" hunk header
(i.e. when k == 1 or n == 1)
- Adjust testsuite for the lack of -p support.
Signed-off-by: Junio C Hamano <junkio@cox.net>
This uses a simplified libxdiff setup to generate unified diffs _without_
doing fork/execve of GNU "diff".
This has several huge advantages, for example:
Before:
[torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null
real 0m24.818s
user 0m13.332s
sys 0m8.664s
After:
[torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null
real 0m4.563s
user 0m2.944s
sys 0m1.580s
and the fact that this should be a lot more portable (ie we can ignore all
the issues with doing fork/execve under Windows).
Perhaps even more importantly, this allows us to do diffs without actually
ever writing out the git file contents to a temporary file (and without
any of the shell quoting issues on filenames etc etc).
NOTE! THIS PATCH DOES NOT DO THAT OPTIMIZATION YET! I was lazy, and the
current "diff-core" code actually will always write the temp-files,
because it used to be something that you simply had to do. So this current
one actually writes a temp-file like before, and then reads it into memory
again just to do the diff. Stupid.
But if this basic infrastructure is accepted, we can start switching over
diff-core to not write temp-files, which should speed things up even
further, especially when doing big tree-to-tree diffs.
Now, in the interest of full disclosure, I should also point out a few
downsides:
- the libxdiff algorithm is different, and I bet GNU diff has gotten a
lot more testing. And the thing is, generating a diff is not an exact
science - you can get two different diffs (and you will), and they can
both be perfectly valid. So it's not possible to "validate" the
libxdiff output by just comparing it against GNU diff.
- GNU diff does some nice eye-candy, like trying to figure out what the
last function was, and adding that information to the "@@ .." line.
libxdiff doesn't do that.
- The libxdiff thing has some known deficiencies. In particular, it gets
the "\No newline at end of file" case wrong. So this is currently for
the experimental branch only. I hope Davide will help fix it.
That said, I think the huge performance advantage, and the fact that it
integrates better is definitely worth it. But it should go into a
development branch at least due to the missing newline issue.
Technical note: this is based on libxdiff-0.17, but I did some surgery to
get rid of the extraneous fat - stuff that git doesn't need, and seriously
cutting down on mmfile_t, which had much more capabilities than the diff
algorithm either needed or used. In this version, "mmfile_t" is just a
trivial <pointer,length> tuple.
That said, I tried to keep the differences to simple removals, so that you
can do a diff between this and the libxdiff origin, and you'll basically
see just things getting deleted. Even the mmfile_t simplifications are
left in a state where the diffs should be readable.
Apologies to Davide, whom I'd love to get feedback on this all from (I
wrote my own "fill_mmfile()" for the new simpler mmfile_t format: the old
complex format had a helper function for that, but I did my surgery with
the goal in mind that eventually we _should_ just do
mmfile_t mf;
buf = read_sha1_file(sha1, type, &size);
mf->ptr = buf;
mf->size = size;
.. use "mf" directly ..
which was really a nightmare with the old "helpful" mmfile_t, and really
is that easy with the new cut-down interfaces).
[ Btw, as any hawk-eye can see from the diff, this was actually generated
with itself, so it is "self-hosting". That's about all the testing it
has gotten, along with the above kernel diff, which eye-balls correctly,
but shows the newline issue when you double-check it with "git-apply" ]
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>