The name-hash used for detecting paths that are different only in
cases (which matter on case insensitive filesystems) has been
optimized to take advantage of multi-threading when it makes sense.
* jh/memihash-opt:
name-hash: add test-lazy-init-name-hash to .gitignore
name-hash: add perf test for lazy_init_name_hash
name-hash: add test-lazy-init-name-hash
name-hash: perf improvement for lazy_init_name_hash
hashmap: document memihash_cont, hashmap_disallow_rehash api
hashmap: add disallow_rehash setting
hashmap: allow memihash computation to be continued
name-hash: specify initial size for istate.dir_hash table
Document memihash_cont() and hashmap_disallow_rehash()
in Documentation/technical/api-hashmap.txt.
Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
On Thu, Jan 19, 2017 at 03:03:46PM +0100, Ulrich Spörlein wrote:
> > I suspect the patch below may fix things for you. It works around it by
> > walking over the lru list (either is fine, as they both contain all
> > entries, and since we're clearing everything, we don't care about the
> > order).
>
> Confirmed. With the patch applied, I can import the whole 55G in one go
> without any crashes or aborts. Thanks much!
Thanks. Here it is rolled up with a commit message.
-- >8 --
Subject: clear_delta_base_cache(): don't modify hashmap while iterating
Removing entries while iterating causes fast-import to
access an already-freed `struct packed_git`, leading to
various confusing errors.
What happens is that clear_delta_base_cache() drops the
whole contents of the cache by iterating over the hashmap,
calling release_delta_base_cache() on each entry. That
function removes the item from the hashmap. The hashmap code
may then shrink the table, but the hashmap_iter struct
retains an offset from the old table.
As a result, the next call to hashmap_iter_next() may claim
that the iteration is done, even though some items haven't
been visited.
The only caller of clear_delta_base_cache() is fast-import,
which wants to clear the cache because it is discarding the
packed_git struct for its temporary pack. So by failing to
remove all of the entries, we still have references to the
freed packed_git.
To make things even more confusing, this doesn't seem to
trigger with the test suite, because it depends on
complexities like the size of the hash table, which entries
got cleared, whether we try to access them before they're
evicted from the cache, etc.
So I've been able to identify the problem with large
imports like freebsd's svn import, or a fast-export of
linux.git. But nothing that would be reasonable to run as
part of the normal test suite.
We can fix this easily by iterating over the lru linked list
instead of the hashmap. They both contain the same entries,
and we can use the "safe" variant of the list iterator,
which exists for exactly this case.
Let's also add a warning to the hashmap API documentation to
reduce the chances of getting bit by this again.
Reported-by: Ulrich Spörlein <uqs@freebsd.org>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The API documentation said that the hashmap_entry structure to be
embedded in the caller's structure is to be treated as opaque, which
left the reader wondering if it can safely be discarded when it no
longer is necessary. If the hashmap_entry structure had references
to external resources such as allocated memory or an open file
descriptor, merely free(3)ing the containing structure (when the
caller's structure is on the heap) or letting it go out of scope
(when it is on the stack) would end up leaking the external
resource.
Document that there is no need for hashmap_entry_clear() that
corresponds to hashmap_entry_init() to give the API users a little
bit of peace of mind.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Interning short strings with high probability of duplicates can reduce the
memory footprint and speed up comparisons.
Add strintern() and memintern() APIs that use a hashmap to manage the pool
of unique, interned strings.
Note: strintern(getenv()) could be used to sanitize git's use of getenv(),
in case we ever encounter a platform where a call to getenv() invalidates
previous getenv() results (which is allowed by POSIX).
Signed-off-by: Karsten Blees <blees@dcon.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Hashmap entries are typically looked up by just a key. The hashmap_get()
API expects an initialized entry structure instead, to support compound
keys. This flexibility is currently only needed by find_dir_entry() in
name-hash.c (and compat/win32/fscache.c in the msysgit fork). All other
(currently five) call sites of hashmap_get() have to set up a near emtpy
entry structure, resulting in duplicate code like this:
struct hashmap_entry keyentry;
hashmap_entry_init(&keyentry, hash(key));
return hashmap_get(map, &keyentry, key);
Add a hashmap_get_from_hash() API that allows hashmap lookups by just
specifying the key and its hash code, i.e.:
return hashmap_get_from_hash(map, hash(key), key);
Signed-off-by: Karsten Blees <blees@dcon.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Copying the first bytes of a SHA1 is duplicated in six places,
however, the implications (the actual value would depend on the
endianness of the platform) is documented only once.
Add a properly documented API for this.
Signed-off-by: Karsten Blees <blees@dcon.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The highlighting was pretty, but unfortunately, the failure mode
when source-highlight is not installed was that the entire code
block disappears.
See https://bugs.debian.org/745591,
https://bugs.launchpad.net/bugs/1316810.
Signed-off-by: Anders Kaseorg <andersk@mit.edu>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The existing hashtable implementation (in hash.[ch]) uses open addressing
(i.e. resolve hash collisions by distributing entries across the table).
Thus, removal is difficult to implement with less than O(n) complexity.
Resolving collisions of entries with identical hashes (e.g. via chaining)
is left to the client code.
Add a hashtable implementation that supports O(1) removal and is slightly
easier to use due to builtin entry chaining.
Supports all basic operations init, free, get, add, remove and iteration.
Also includes ready-to-use hash functions based on the public domain FNV-1
algorithm (http://www.isthe.com/chongo/tech/comp/fnv).
The per-entry data structure (hashmap_entry) is piggybacked in front of
the client's data structure to save memory. See test-hashmap.c for usage
examples.
The hashtable is resized by a factor of four when 80% full. With these
settings, average memory consumption is about 2/3 of hash.[ch], and
insertion is about twice as fast due to less frequent resizing.
Lookups are also slightly faster, because entries are strictly confined to
their bucket (i.e. no data of other buckets needs to be traversed).
Signed-off-by: Karsten Blees <blees@dcon.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>