Commit Graph

40 Commits

Author SHA1 Message Date
Junio C Hamano
fdbe2ac198 Merge branch 'mk/diff-delta-avoid-large-offset'
The delta format used in the packfile cannot reference data at
offset larger than what can be expressed in 4-byte, but the
generator for the data failed to make sure the offset does not
overflow.  This has been corrected.

* mk/diff-delta-avoid-large-offset:
  diff-delta: do not allow delta offset truncation
2017-09-28 14:47:56 +09:00
Martin Koegler
fed1ef9550 diff-delta: do not allow delta offset truncation
Prevent generating delta offsets beyond 4G, as the xdelta used in
the pack format cannot represent such large offset.

Signed-off-by: Martin Koegler <martin.koegler@chello.at>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-08-11 15:25:40 -07:00
Martin Koegler
3f0a67a1f6 diff-delta: fix encoding size that would not fit in "unsigned int"
The current delta code produces incorrect pack objects for files > 4GB,
because the size is copied from size_t field to "unsigned int" variables
during the encoding process.

Signed-off-by: Martin Koegler <martin.koegler@chello.at>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-08-10 13:55:22 -07:00
Stefan Beller
f7466e9437 create_delta_index: simplify condition always evaluating to true
The code sequence  ' (1u << i) < hsize && i < 31 ' is a multi step
process, whose first step requires that 'i' is already less that 31,
otherwise the result (1u << i)  is undefined (and  'undef_val < hsize'
can therefore be assumed to be 'false'), and so the later test  i < 31
can always be optimized away as dead code ('i' is already less than 31,
or the short circuit 'and' applies).

So we need to get rid of that code. One way would be to exchange the
order of the conditions, so the expression 'i < 31 && (1u << i) < hsize'
would remove that optimized unstable code already.

However when checking the previous lines in that function, we can deduce
that 'hsize' must always be smaller than (1u<<31), since 506049c7df
(fix >4GiB source delta assertion failure), because 'entries' is
capped at an upper bound of 0xfffffffeU, so 'hsize' contains a maximum
value of 0x3fffffff, which is smaller than (1u<<31), so the value of
'i' will never be larger than 31 and we can remove that condition
entirely.

Signed-off-by: Stefan Beller <stefanbeller@googlemail.com>
Acked-by: Nicolas Pitre <nico@fluxnic.net>
Acked-by: Philip Oakley <philipoakley@iee.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-08-18 12:56:23 -07:00
Nicolas Pitre
506049c7df fix >4GiB source delta assertion failure
When people try insane things such as delta-compressing 4GiB files, they
get this assertion:

diff-delta.c:285: create_delta_index: Assertion `packed_entry - (struct index_entry *)mem == entries' failed.

This happens because:

1) the 'entries' variable is an unsigned int

2) it is assigned with entries = (bufsize - 1) / RABIN_WINDOW
   (that itself is not a problem unless bufsize > 4G * RABIN_WINDOW)

3) the buffer is indexed from top to bottom starting at
   "data = buffer + entries * RABIN_WINDOW" and the multiplication
   here does indeed overflows, making the resulting top of the buffer
   much lower than expected.

This makes the number of actually produced index entries smaller than
what was computed initially, hence the assertion.

Furthermore, the current delta encoding format cannot represent offsets
into a reference buffer with more than 32 bits anyway.  So let's just
limit the number of entries to what the delta format can encode.

Reported-by: Ilari Liusvaara <ilari.liusvaara@elisanet.fi>
Signed-off-by: Nicolas Pitre <nico@fluxnic.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2010-08-21 23:53:26 -07:00
Nicolas Pitre
03aa8ff3be Nicolas Pitre has a new email address
Due to problems at cam.org, my nico@cam.org email address is no longer
valid.  From now on, nico@fluxnic.net should be used instead.

Signed-off-by: Nicolas Pitre <nico@fluxnic.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-09-14 02:23:36 -07:00
Nicolas Pitre
ce85b053d8 fix style of a few comments in diff-delta.c
Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2007-12-18 15:22:28 -08:00
Pierre Habouzit
f9c5a80cdf Fix segfault in diff-delta.c when FLEX_ARRAY is 1
aka don't do pointer arithmetics on structs that have a FLEX_ARRAY member,
or you'll end up believing your array is 1 cell off its real address.

Signed-off-by: Pierre Habouzit <madcoder@debian.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2007-12-17 21:59:26 -08:00
David Kastrup
02e665ce49 diff-delta.c: Rationalize culling of hash buckets
The previous hash bucket culling resulted in a somewhat unpredictable
number of hash bucket entries in the order of magnitude of HASH_LIMIT.

Replace this with a Bresenham-like algorithm leaving us with exactly
HASH_LIMIT entries by uniform culling.

Signed-off-by: David Kastrup <dak@gnu.org>
Acked-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2007-09-09 17:16:49 -07:00
David Kastrup
d2100860fd diff-delta.c: pack the index structure
In normal use cases, the performance wins are not overly impressive:
we get something like 5-10% due to the slightly better locality of
memory accesses using the packed structure.

However, since the data structure for index entries saves 33% of
memory on 32-bit platforms and 40% on 64-bit platforms, the behavior
when memory gets limited should be nicer.

This is a rather well-contained change.  One obvious improvement would
be sorting the elements in one bucket according to their hash, then
using binary probing to find the elements with the right hash value.

As it stands, the output should be strictly the same as previously
unless one uses the option for limiting the amount of used memory, in
which case the created packs might be better.

Signed-off-by: David Kastrup <dak@gnu.org>
Acked-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2007-09-09 17:16:49 -07:00
David Kastrup
b1d884a9e3 diff-delta.c: Fix broken skip calculation.
A particularly bad case was HASH_LIMIT <= hash_count[i] < 2*HASH_LIMIT:
in that case, only a single hash survived.  For larger cases,
2*HASH_LIMIT was the actual limiting value after pruning.

Signed-off-by: David Kastrup <dak@gnu.org>
Acked-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2007-08-23 00:04:10 -07:00
Brian Downing
11779e7907 Support fetching the memory usage of a delta index
Delta indices, at least on 64-bit platforms, tend to be larger than
the actual uncompressed data.  As such, keeping track of this storage
is important if you want to successfully limit the memory size of your
pack window.

Squirrel away the total allocation size inside the delta_index struct,
and add an accessor "sizeof_delta_index" to access it.

Signed-off-by: Brian Downing <bdowning@lavos.net>
Acked-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2007-07-12 14:32:35 -07:00
Martin Koegler
b75c6c6de1 diff-delta: use realloc instead of xrealloc
Commit 83572c1a91 changed many
realloc to xrealloc. This change was made in diff-delta.c too,
although the code can handle an out of memory failure.

This patch reverts this change in diff-delta.c.

Signed-off-by: Martin Koegler <mkoegler@auto.tuwien.ac.at>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-05-31 00:15:18 -07:00
Nicolas Pitre
366b53c170 update diff-delta.c copyright
There is actually nothing left from the original LibXDiff code I used
over 2 years ago, and even the GIT implementation has diverged quite a
bit from LibXDiff's at this point.  Let's update the copyright notice
to better reflect that fact.

Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-05-26 20:28:13 -07:00
Nicolas Pitre
843366961c improve delta long block matching with big files
Martin Koegler noted that create_delta() performs a new hash lookup
after every block copy encoding which are currently limited to 64KB.

In case of larger identical blocks, the next hash lookup would normally
point to the next 64KB block in the reference buffer and multiple block
copy operations will be consecutively encoded.

It is however possible that the reference buffer be sparsely indexed if
hash buckets have been trimmed down in create_delta_index() when hashing
of the reference buffer isn't well balanced.  In that case the hash
lookup following a block copy might fail to match anything and the fact
that the reference buffer still matches beyond the previous 64KB block
will be missed.

Let's rework the code so that buffer comparison isn't bounded to 64KB
anymore.  The match size should be as large as possible up front and
only then should multiple block copy be encoded to cover it all.
Also, fewer hash lookups will be performed in the end.

According to Martin, this patch should reduce his 92MB pack down to 75MB
with the dataset he has.

Tests performed on the Linux kernel repo show a slightly smaller pack and
a slightly faster repack.

Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-05-26 20:28:13 -07:00
Junio C Hamano
85023577a8 simplify inclusion of system header files.
This is a mechanical clean-up of the way *.c files include
system header files.

 (1) sources under compat/, platform sha-1 implementations, and
     xdelta code are exempt from the following rules;

 (2) the first #include must be "git-compat-util.h" or one of
     our own header file that includes it first (e.g. config.h,
     builtin.h, pkt-line.h);

 (3) system headers that are included in "git-compat-util.h"
     need not be included in individual C source files.

 (4) "git-compat-util.h" does not have to include subsystem
     specific header files (e.g. expat.h).

Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-12-20 09:51:35 -08:00
Junio C Hamano
29f049a0c2 Revert "move pack creation to version 3"
This reverts commit 16854571aa.
Git as recent as v1.1.6 do not understand version 3 delta.

v1.2.0 is Ok and I personally would say it is old enough, but
the improvement between version 2 and version 3 delta is not
bit enough to justify breaking older clients.

We should resurrect this later, but when we do so we shold
make it conditional.

Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-10-14 23:38:01 -07:00
Nicolas Pitre
16854571aa move pack creation to version 3
It's been quite a while now that GIT is able to read version 3 packs.
Let's create them at last.

Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-09-22 19:24:52 -07:00
Jonas Fonseca
83572c1a91 Use xrealloc instead of realloc
Change places that use realloc, without a proper error path, to instead use
xrealloc. Drop an erroneous error path in the daemon code that used errno
in the die message in favour of the simpler xrealloc.

Signed-off-by: Jonas Fonseca <fonseca@diku.dk>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-08-26 17:54:06 -07:00
Pierre Habouzit
b05faa2da9 Fix a comparison bug in diff-delta.c
(1 << i) < hspace is compared in the `int` space rather that in the
unsigned one.  the result will be wrong if hspace is between 0x40000000
and 0x80000000.

Signed-off-by: Pierre Habouzit <madcoder@debian.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-08-23 03:05:27 -07:00
Pavel Roskin
82e5a82fd7 Fix more typos, primarily in the code
The only visible change is that git-blame doesn't understand
"--compability" anymore, but it does accept "--compatibility" instead,
which is already documented.

Signed-off-by: Pavel Roskin <proski@gnu.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-07-10 00:36:44 -07:00
Florian Forster
1d7f171c3a Remove all void-pointer arithmetic.
ANSI C99 doesn't allow void-pointer arithmetic. This patch fixes this in
various ways. Usually the strategy that required the least changes was used.

Signed-off-by: Florian Forster <octo@verplant.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-06-20 01:59:46 -07:00
Florian Forster
63f175693e Initialize FAMs using `FLEX_ARRAY'.
When initializing a `flexible array member' the macro `FLEX_ARRAY' should be
used. This was forgotten in `diff-delta.c'.

Signed-off-by: Florian Forster <octo@verplant.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-06-18 21:19:09 -07:00
Nicolas Pitre
639ca54972 fix diff-delta bad memory access
It cannot be assumed that the given buffer will never be moved when
shrinking the allocated memory size with realloc().  So let's ignore
that optimization for now.

This patch makes Electric Fence happy on Linux.

Signed-off-by: Nicolas Pitre <nico@cam.org>
Acked-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-05-10 10:53:59 -07:00
Nicolas Pitre
06a9f92035 improve diff-delta with sparse and/or repetitive data
It is useless to preserve multiple hash entries for consecutive blocks
with the same hash.  Keeping only the first one will allow for matching
the longest string of identical bytes while subsequent blocks will only
allow for shorter matches.  The backward matching code will match the
end of it as necessary.

This improves both performances (no repeated string compare with long
successions of identical bytes, or even small group of bytes), as well
as compression (less likely to need random hash bucket entry culling),
especially with sparse files.

With well behaved data sets this patch doesn't change much.

Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-05-02 21:32:39 -07:00
Nicolas Pitre
2d08e5dd73 tiny optimization to diff-delta
This is my assembly freak side looking at generated code again.  And
since create_delta() is certainly pretty high on the radar every bits
count.  In this case shorter code is generated if hash_mask is not
copied to a local variable.

Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-05-02 21:32:32 -07:00
Nicolas Pitre
3dc5a9e4cd replace adler32 with Rabin's polynomial in diff-delta
This brings another small repacking speedup for sensibly the same pack
size.  On the Linux kernel repo, git-repack -a -f is 3.7% faster for a
0.4% larger pack.

Credits to Geert Bosch who brought the Rabin's polynomial idea to my
attention.

This also eliminate the issue of adler32() reading past the data buffer,
as noticed by Johannes Schindelin.

Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-04-28 22:41:31 -07:00
Nicolas Pitre
08abe669c0 split the diff-delta interface
This patch splits the diff-delta interface into index creation and delta
generation.  A wrapper is provided to preserve the diff-delta() call.

This will allow for an optimization in pack-objects.c where the source
object could be fixed and a full window of objects tentatively tried
against
that same source object without recomputing the source index each time.

This patch only restructure things, plus a couple cleanups for good
measure. There is no performance change yet.

Signed-off-by: Nicolas Pitre <nico@cam.org>
2006-04-24 22:27:33 -07:00
Nicolas Pitre
5a1fb2ca92 3% tighter packs for free
This patch makes for 3.4% smaller pack with the git repository, and
a bit more than 3% smaller pack with the kernel repository.

And so with _no_ measurable CPU difference.

Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-03-17 20:42:39 -08:00
Nicolas Pitre
c13c6bf758 diff-delta: bound hash list length to avoid O(m*n) behavior
The diff-delta code can exhibit O(m*n) behavior with some patological
data set where most hash entries end up in the same hash bucket.

To prevent this, a limit is imposed to the number of entries that can
exist in the same hash bucket.

Because of the above the code is a tiny bit more expensive on average,
even if some small optimizations were added as well to atenuate the
overhead. But the problematic samples used to diagnoze the issue are now
orders of magnitude less expensive to process with only a slight loss in
compression.

Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-03-09 01:35:14 -08:00
Nicolas Pitre
8e1454b5ad diff-delta: big code simplification
This is much smaller and hopefully clearer code now.

Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-02-22 00:36:09 -08:00
Nicolas Pitre
fe474b588b diff-delta: fold two special tests into one plus cleanups
Testing for realloc and size limit can be done with only one test per
loop. Make it so and fix a theoretical off-by-one comparison error in
the process.

The output buffer memory allocation is also bounded by max_size when
specified.

Finally make some variable unsigned to allow the handling of files up to
4GB in size instead of 2GB.

Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-02-22 00:36:09 -08:00
Peter Eriksen
04fe2a1706 Use adler32() from zlib instead of defining our own.
Since we already depend on zlib, we don't need to define our
own adler32().  Spotted by oprofile.

Signed-off-by: Peter Eriksen <s022018@student.dtu.dk>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-02-05 13:45:01 -08:00
Nicolas Pitre
e5e3a9d8f9 small cleanup for diff-delta.c
This patch removes unused remnants of the original xdiff source.
No functional change.  Possible tiny speed improvement.

Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2005-12-15 16:19:11 -08:00
Junio C Hamano
c7a45bd20e Revert "diff-delta.c: allow delta with empty blob."
This reverts 962537a3eb commit to
play safe.
2005-12-12 16:42:38 -08:00
Junio C Hamano
962537a3eb diff-delta.c: allow delta with empty blob.
Delta computation with an empty blob used to punt and returned NULL.
This commit allows creation with empty blob; all combination of
empty->empty, empty->something, and something->empty are allowed.

Signed-off-by: Junio C Hamano <junkio@cox.net>
2005-12-12 12:57:25 -08:00
Nicolas Pitre
dcde55bc58 [PATCH] assorted delta code cleanup
This is a wrap-up patch including all the cleanups I've done to the
delta code and its usage.  The most important change is the
factorization of the delta header handling code.

Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-29 09:11:38 -07:00
Nicolas Pitre
69a2d426f0 [PATCH] denser delta header encoding
Since the delta data format is not tied to any actual git object
anymore, now is the time to add a small improvement to the delta data
header as it is been done for packed object header.  This patch allows
for reducing the delta header of about 2 bytes and makes for simpler
code.

Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-28 21:38:47 -07:00
Linus Torvalds
75c42d8cc3 Add a "max_size" parameter to diff_delta()
Anything that generates a delta to see if two objects are close usually
isn't interested in the delta ends up being bigger than some specified
size, and this allows us to stop delta generation early when that
happens.
2005-06-25 19:30:20 -07:00
Nicolas Pitre
a310d43494 [PATCH] Deltification library work by Nicolas Pitre.
This patch adds the basic library functions to create and replay delta
information.  Also included is a test-delta utility to validate the
code.

diff-delta was based on LibXDiff written by Davide Libenzi

Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-19 08:56:22 -07:00