The bitmap_is_subset() function checks if the 'self' bitmap contains any
bitmaps that are not on in the 'other' bitmap. Up until this patch, it
had a declaration, but no implementation or callers. A subsequent patch
will want this function, so implement it here.
Helped-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
There's no easy way to make a copy of a bitmap. Obviously a caller can
iterate over the bits and set them one by one in a new bitmap, but we
can go much faster by copying whole words with memcpy().
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
We have a function to bitwise-OR an ewah into an uncompressed bitmap,
but not to OR two uncompressed bitmaps. Let's add it.
Interestingly, we have a public header declaration going back to
e1273106f6 (ewah: compressed bitmap implementation, 2013-11-14), but the
function was never implemented. That was all OK since there were no
users of 'bitmap_or()', but a first caller will be added in a couple of
patches.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
If you ask to set a bit in the Nth word and we haven't yet allocated
that many slots in our array, we'll increase the bitmap size to 2*N.
This means we might frequently end up with bitmaps that are twice the
necessary size (as soon as you ask for the biggest bit, we'll size up to
twice that).
But if we just allocate as many words as were asked for, we may not grow
fast enough. The worst case there is setting bit 0, then 1, etc. Each
time we grow we'd just extend by one more word, giving us linear
reallocations (and quadratic memory copies).
A middle ground is relying on alloc_nr(), which causes us to grow by a
factor of roughly 3/2 instead of 2. That's less aggressive than
doubling, and it may help avoid fragmenting memory. (If we start with N,
then grow twice, our total is N*(3/2)^2 = 9N/4. After growing twice,
that array of size 9N/4 can fit into the space vacated by the original
array and first growth, N+3N/2 = 10N/4 > 9N/4, leading to less
fragmentation in memory).
Our worst case is still 3/2N wasted bits (you set bit N-1, then setting
bit N causes us to grow by 3/2), but our average should be much better.
This isn't usually that big a deal, but it will matter as we shift the
reachability bitmap generation code to store more bitmaps in memory.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
We auto-grow bitmaps when somebody asks to set a bit whose position is
outside of our currently allocated range. Other operations besides
single bit-setting might need to do this, too, so let's pull it into its
own function.
Note that we change the semantics a little: you now ask for the number
of words you'd like to have, not the id of the block you'd like to write
to.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
'ewah/ewah_bitmap.c:buffer_grow()' is responsible for growing the buffer
used to store the bits of an EWAH bitmap. It is essentially doing the
same task as the 'ALLOC_GROW()' macro, so use that instead.
This simplifies the callers of 'buffer_grow()', who no longer have to
ask for a specific size, but rather specify how much of the buffer they
need. They also no longer need to guard 'buffer_grow()' behind an if
statement, since 'ALLOC_GROW()' (and, by extension, 'buffer_grow()') is
a noop if the buffer is already large enough.
But, the most significant change is that this fixes a bug when calling
buffer_grow() with both 'alloc_size' and 'new_size' set to 1. In this
case, truncating integer math will leave the new size set to 1, causing
the buffer to never grow.
Instead, let alloc_nr() handle this, which asks for '(new_size + 16) * 3
/ 2' instead of 'new_size * 3 / 2'.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The object reachability bitmap machinery and the partial cloning
machinery were not prepared to work well together, because some
object-filtering criteria that partial clones use inherently rely
on object traversal, but the bitmap machinery is an optimization
to bypass that object traversal. There however are some cases
where they can work together, and they were taught about them.
* jk/object-filter-with-bitmap:
rev-list --count: comment on the use of count_right++
pack-objects: support filters with bitmaps
pack-bitmap: implement BLOB_LIMIT filtering
pack-bitmap: implement BLOB_NONE filtering
bitmap: add bitmap_unset() function
rev-list: use bitmap filters for traversal
pack-bitmap: basic noop bitmap filter infrastructure
rev-list: allow commit-only bitmap traversals
t5310: factor out bitmap traversal comparison
rev-list: allow bitmaps when counting objects
rev-list: make --count work with --objects
rev-list: factor out bitmap-optimized routines
pack-bitmap: refuse to do a bitmap traversal with pathspecs
rev-list: fallback to non-bitmap traversal when filtering
pack-bitmap: fix leak of haves/wants object lists
pack-bitmap: factor out type iterator initialization
We've never needed to unset an individual bit in a bitmap until now.
Typically they start with all bits unset and we bitmap_set() them, or we
are applying another bitmap as a mask. But the easiest way to apply an
object filter to a bitmap result will be to unset the individual bits.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In a following commit we will need to allocate a variable
number of bitmap words, instead of always 32, so let's add
bitmap_word_alloc() for this purpose.
Note that we have to adjust the block growth in bitmap_set(),
since a caller could now use an initial size of "0" (we don't
plan to do that, but it doesn't hurt to be defensive).
Helped-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The 'ewok_rlw.h' header file contains the rlw_get_run_bit() function
definition, which is marked as 'static' but not 'inline'. At least when
compiled by gcc, with the default -O2 optimization level, the function
is actually inlined and leaves no static version in the ewah_bitmap.o
and ewah_rlw.o object files. Despite this, add the missing 'inline'
keyword to better describe the intended behaviour.
Signed-off-by: Ramsay Jones <ramsay@ramsayjones.plus.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Complete the removal of unused 'ewah bitmap' code by removing the now
unused 'rlwit_discharge_empty()' function. Also, the 'ewah_clear()'
function can now be made a file-scope static symbol.
Signed-off-by: Ramsay Jones <ramsay@ramsayjones.plus.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
We don't call this function, and never have. The on-disk
bitmap format uses network-byte-order integers, meaning that
we cannot use the native-byte-order format written here.
Let's drop it in the name of simplicity.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
We don't call this function, and in fact never have since it
was added (at least not in iterations of the ewah patches
that got merged). Instead we use ewah_read_mmap().
Let's drop the unused code.
Note to anybody who later wants to resurrect this: it does
not check for integer overflow in the ewah data size,
meaning it may be possible to convince the code to allocate
a too-small buffer and read() into it.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Reported-by: Ramsay Jones <ramsay@ramsayjones.plus.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Reported-by: Ramsay Jones <ramsay@ramsayjones.plus.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Reported-by: Ramsay Jones <ramsay@ramsayjones.plus.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Reported-by: Ramsay Jones <ramsay@ramsayjones.plus.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Reported-by: Ramsay Jones <ramsay@ramsayjones.plus.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Reported-by: Ramsay Jones <ramsay@ramsayjones.plus.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Reported-by: Ramsay Jones <ramsay@ramsayjones.plus.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The on-disk ewah format tells us how big the ewah data is,
and we blindly read that much from the buffer without
considering whether the mmap'd data is long enough, which
can lead to out-of-bound reads.
Let's make sure we have data available before reading it,
both for the ewah header/footer as well as for the bit data
itself. In particular:
- keep our ptr/len pair in sync as we move through the
buffer, and check it before each read
- check the size for integer overflow (this should be
impossible on 64-bit, as the size is given as a 32-bit
count of 8-byte words, but is possible on a 32-bit
system)
- return the number of bytes read as an ssize_t instead of
an int, again to prevent integer overflow
- compute the return value using a pointer difference;
this should yield the same result as the existing code,
but makes it more obvious that we got our computations
right
The included test is far from comprehensive, as it just
picks a static point at which to truncate the generated
bitmap. But in practice this will hit in the middle of an
ewah and make sure we're at least exercising this code.
Reported-by: Luat Nguyen <root@l4w.io>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The mailing address for the FSF has changed over the years. Rather than
updating the address across all files, refer readers to gnu.org, as the
GNU GPL documentation now suggests for license notices. The mailing
address is retained in the full license files (COPYING and LGPL-2.1).
The old address is still present in t/diff-lib/COPYING. This is
intentional, as the file is used in tests and the contents are not
expected to change.
Signed-off-by: Todd Zullinger <tmz@pobox.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Convert code that divides and rounds up to use DIV_ROUND_UP to make the
intent clearer and reduce the number of magic constants.
Signed-off-by: Rene Scharfe <l.s.r@web.de>
Reviewed-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The ewah subsystem typedefs eword_t to be uint64_t, but some
code uses a bare uint64_t. This isn't a bug now, but it's a
potential maintenance problem if the definition of eword_t
ever changes. Let's use the correct type.
Note that we can't use COPY_ARRAY() here because the source
and destination point to objects of different sizes. For
that reason we'll also skip the usual "sizeof(*dst)" and use
the real type, which should make it more clear that there's
something tricky going on.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Now that we're built around xmalloc and friends, we can use
helpers like REALLOC_ARRAY, ALLOC_GROW, and so on to make
the code shorter and protect against integer overflow.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This code was originally written with the idea that it could
be spun off into its own ewah library, and uses the
overrideable ewah_malloc to do allocations.
We plug in xmalloc as our ewah_malloc, of course. But over
the years the ewah code itself has become more entangled
with git, and the return value of many ewah_malloc sites is
not checked.
Let's just drop the level of indirection and use xmalloc and
friends directly. This saves a few lines, and will let us
adapt these sites to our more advanced malloc helpers.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The ewah/ewok.h header pollutes the global namespace with
"BITS_IN_WORD", without any specific notion that we are
talking about the bits in an eword_t. We can give this the
more specific name "BITS_IN_EWORD".
Signed-off-by: Jeff King <peff@peff.net>
Reviewed-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
On PowerPC Mac OS X (10.5.8 "Leopard" with Xcode 3.1),
system header /usr/include/ppc/param.h[1] pollutes the
preprocessor namespace with a macro generically named MASK.
This conflicts with the same-named macro in ewah/bitmap.c.
We can avoid this conflict by using a more specific name.
[1]: Included indirectly via:
git-compat-util.h ->
sys/sysctl.h ->
sys/ucred.h ->
sys/param.h ->
machine/param.h ->
ppc/param.h
Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The __builtin_ctzll function was added in gcc 3.4.0.
This extends the check for gcc so that use of __builtin_ctzll is only
enabled if gcc >= 3.4.0.
Signed-off-by: Tom G. Christensen <tgc@statsbiblioteket.dk>
Reviewed-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When buffer_grow changes the size of the buffer using realloc,
it first computes and saves the rlw pointer's offset into the
buffer using (uint8_t *) math before the realloc but then
restores it using (eword_t *) math.
In order to do this it's necessary to convert the (uint8_t *)
offset into an (eword_t *) offset. It was doing this by
dividing by the sizeof(size_t). Unfortunately sizeof(size_t)
is not same as sizeof(eword_t) on all platforms.
This causes illegal memory accesses and other bad things to
happen when attempting to use bitmaps on those platforms.
Fix this by dividing by the sizeof(eword_t) instead which
will always be correct for all platforms.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Acked-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Commit a201c20 tried to optimize out a loop like:
for (i = 0; i < len; i++)
data[i] = ntohll(data[i]);
in the big-endian case, because we know that ntohll is a
noop, and we do not need to pay the cost of the loop at all.
However, it mistakenly assumed that __BYTE_ORDER was always
defined, whereas it may not be on systems which do not
define it by default, and where we did not need to define it
to set up the ntohll macro. This includes OS X and Windows.
We could muck with the ordering in compat/bswap.h to make
sure it is defined unconditionally, but it is simpler to
still to just execute the loop unconditionally. That avoids
the application code knowing anything about these magic
macros, and lets it depend only on having ntohll defined.
And since the resulting loop looks like (on a big-endian
system):
for (i = 0; i < len; i++)
data[i] = data[i];
any decent compiler can probably optimize it out.
Original report and analysis by Brian Gernhardt.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The caller may hand us an unaligned buffer (e.g., because it
is an mmap of a file with many ewah bitmaps). On some
platforms (like SPARC) this can cause a bus error. We can
fix it with a combination of get_be32 and moving the data
into an aligned buffer (which we would do anyway, but we can
move it before fixing the endianness).
Signed-off-by: Vicent Marti <tanoku@gmail.com>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
EWAH is a word-aligned compressed variant of a bitset (i.e. a data
structure that acts as a 0-indexed boolean array for many entries).
It uses a 64-bit run-length encoding (RLE) compression scheme,
trading some compression for better processing speed.
The goal of this word-aligned implementation is not to achieve
the best compression, but rather to improve query processing time.
As it stands right now, this EWAH implementation will always be more
efficient storage-wise than its uncompressed alternative.
EWAH arrays will be used as the on-disk format to store reachability
bitmaps for all objects in a repository while keeping reasonable sizes,
in the same way that JGit does.
This EWAH implementation is a mostly straightforward port of the
original `javaewah` library that JGit currently uses. The library is
self-contained and has been embedded whole (4 files) inside the `ewah`
folder to ease redistribution.
The library is re-licensed under the GPLv2 with the permission of Daniel
Lemire, the original author. The source code for the C version can
be found on GitHub:
https://github.com/vmg/libewok
The original Java implementation can also be found on GitHub:
https://github.com/lemire/javaewah
[jc: stripped debug-only code per Peff's $gmane/239768]
Signed-off-by: Vicent Marti <tanoku@gmail.com>
Signed-off-by: Jeff King <peff@peff.net>
Helped-by: Ramsay Jones <ramsay@ramsay1.demon.co.uk>
Signed-off-by: Junio C Hamano <gitster@pobox.com>