find_pack_entry_one() uses the fan-out table of pack indexes to find out
which entries match the first byte of the searched hash and does a
binary search on this subset of the main index table.
If there are no matching entries then lo and hi will have the same
value. The binary search still starts and compares the hash of the
following entry (which has a non-matching first byte, so won't cause any
trouble), or whatever comes after the sorted list of entries.
The probability of that stray comparison matching by mistake is low, but
let's not take any chances and check when entering the binary search
loop if we're actually done already.
Signed-off-by: Rene Scharfe <l.s.r@web.de>
Reviewed-by: Jonathan Nieder <jrnieder@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
FD_CLOEXEC only applies to the file descriptor, so it needs to be
manipuluated via F_GETFD/F_SETFD. F_GETFL/F_SETFL are for file
description flags.
Verified via strace with o_cloexec set to zero.
Signed-off-by: Eric Wong <e@80x24.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Update the hashmap API so that data to customize the behaviour of
the comparison function can be specified at the time a hashmap is
initialized.
* sb/hashmap-customize-comparison:
hashmap: migrate documentation from Documentation/technical into header
patch-ids.c: use hashmap correctly
hashmap.h: compare function has access to a data field
Optimize "what are the object names already taken in an alternate
object database?" query that is used to derive the length of prefix
an object name is uniquely abbreviated to.
* rs/sha1-name-readdir-optim:
sha1_file: guard against invalid loose subdirectory numbers
sha1_file: let for_each_file_in_obj_subdir() handle subdir names
p4205: add perf test script for pretty log formats
sha1_name: cache readdir(3) results in find_short_object_filename()
When using the hashmap a common need is to have access to caller provided
data in the compare function. A couple of times we abuse the keydata field
to pass in the data needed. This happens for example in patch-ids.c.
This patch changes the function signature of the compare function
to have one more void pointer available. The pointer given for each
invocation of the compare function must be defined in the init function
of the hashmap and is just passed through.
Documentation of this new feature is deferred to a later patch.
This is a rather mechanical conversion, just adding the new pass-through
parameter. However while at it improve the naming of the fields of all
compare functions used by hashmaps by ensuring unused parameters are
prefixed with 'unused_' and naming the parameters what they are (instead
of 'unused' make it 'unused_keydata').
Signed-off-by: Stefan Beller <sbeller@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
has_sha1_file_with_flags() implements many mechanisms in common with
sha1_object_info_extended(). Make has_sha1_file_with_flags() a
convenience function for sha1_object_info_extended() instead.
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Currently, regardless of the contents of the "struct object_info" passed
to sha1_object_info_extended(), that function always accesses the
packfile whenever it returns information about a packed object, since it
needs to populate "u.packed".
Add the ability to pass NULL, and use NULL-ness of the argument to
activate an optimization in which sha1_object_info_extended() does not
needlessly access the packfile. A subsequent patch will make use of this
optimization.
A similar optimization is not made for the cached and loose cases as it
would not cause a significant performance improvement.
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Improve sha1_object_info_extended() by supporting additional
flags. This allows has_sha1_file_with_flags() to be modified to use
sha1_object_info_extended() in a subsequent patch.
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
A common pattern to free a piece of memory and assign NULL to the
pointer that used to point at it has been replaced with a new
FREE_AND_NULL() macro.
* ab/free-and-null:
*.[ch] refactoring: make use of the FREE_AND_NULL() macro
coccinelle: make use of the "expression" FREE_AND_NULL() rule
coccinelle: add a rule to make "expression" code use FREE_AND_NULL()
coccinelle: make use of the "type" FREE_AND_NULL() rule
coccinelle: add a rule to make "type" code use FREE_AND_NULL()
git-compat-util: add a FREE_AND_NULL() wrapper around free(ptr); ptr = NULL
Fix configuration codepath to pay proper attention to commondir
that is used in multi-worktree situation, and isolate config API
into its own header file.
* bw/config-h:
config: don't implicitly use gitdir or commondir
config: respect commondir
setup: teach discover_git_directory to respect the commondir
config: don't include config.h by default
config: remove git_config_iter
config: create config.h
Loose object subdirectories have hexadecimal names based on the first
byte of the hash of contained objects, thus their numerical
representation can range from 0 (0x00) to 255 (0xff). Change the type
of the corresponding variable in for_each_file_in_obj_subdir() and
associated callback functions to unsigned int and add a range check.
Suggested-by: Jeff King <peff@peff.net>
Signed-off-by: Rene Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The function for_each_file_in_obj_subdir() takes a object subdirectory
number and expects the name of the same subdirectory to be included in
the path strbuf. Avoid this redundancy by letting the function append
the hexadecimal subdirectory name itself. This makes it a bit easier
and safer to use the function -- it becomes impossible to specify
different subdirectories in subdir_nr and path.
Suggested-by: Jeff King <peff@peff.net>
Signed-off-by: Rene Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Read each loose object subdirectory at most once when looking for unique
abbreviated hashes. This speeds up commands like "git log --pretty=%h"
considerably, which previously caused one readdir(3) call for each
candidate, even for subdirectories that were visited before.
The new cache is kept until the program ends and never invalidated. The
same is already true for pack indexes. The inherent racy nature of
finding unique short hashes makes it still fit for this purpose -- a
conflicting new object may be added at any time. Tasks with higher
consistency requirements should not use it, though.
The cached object names are stored in an oid_array, which is quite
compact. The bitmap for remembering which subdir was already read is
stored as a char array, with one char per directory -- that's not quite
as compact, but really simple and incurs only an overhead equivalent to
11 hashes after all.
Suggested-by: Jeff King <peff@peff.net>
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Rene Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
read_object() and sha1_object_info_extended() both implement mechanisms
such as object replacement, retrying the packed store after failing to
find the object in the packed store then the loose store, and being able
to mark a packed object as bad and then retrying the whole process.
Consolidating these mechanisms would be a great help to maintainability.
Therefore, consolidate them by extending sha1_object_info_extended() to
support the functionality needed, and then modifying read_object() to
use sha1_object_info_extended().
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In a subsequent patch, packed_object_info() will be modified to use the
delta base cache, so move the relevant code to before
packed_object_info().
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The LOOKUP_REPLACE_OBJECT flag controls whether the
lookup_replace_object() function is invoked by
sha1_object_info_extended(), read_sha1_file_extended(), and
lookup_replace_object_extended(), but it is not immediately clear which
functions accept that flag.
Therefore restrict this flag to only sha1_object_info_extended(),
renaming it appropriately to OBJECT_INFO_LOOKUP_REPLACE and adding some
documentation. Update read_sha1_file_extended() to have a boolean
parameter instead, and delete lookup_replace_object_extended().
parse_sha1_header() also passes this flag to
parse_sha1_header_extended() since commit 46f0344 ("sha1_file: support
reading from a loose object of unknown type", 2015-05-03), but that has
had no effect since that commit. Therefore this patch also removes this
flag from that invocation.
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The LOOKUP_UNKNOWN_OBJECT flag was introduced in commit 46f0344
("sha1_file: support reading from a loose object of unknown type",
2015-05-03) in order to support a feature in cat-file subsequently
introduced in commit 39e4ae3 ("cat-file: teach cat-file a
'--allow-unknown-type' option", 2015-05-03). Despite its name and
location in cache.h, this flag is used neither in
read_sha1_file_extended() nor in any of the lookup functions, but used
only in sha1_object_info_extended().
Therefore rename this flag to OBJECT_INFO_ALLOW_UNKNOWN_TYPE, taking the
name of the cat-file flag that invokes this feature, and move it closer
to the declaration of sha1_object_info_extended(). Also add
documentation for this flag.
OBJECT_INFO_ALLOW_UNKNOWN_TYPE is defined to 2, not 1, to avoid
conflicting with LOOKUP_REPLACE_OBJECT. Avoidance of this conflict is
necessary because sha1_object_info_extended() supports both flags.
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Apply the result of the just-added coccinelle rule. This manually
excludes a few occurrences, mostly things that resulted in many
FREE_AND_NULL() on one line, that'll be manually fixed in a subsequent
change.
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Stop including config.h by default in cache.h. Instead only include
config.h in those files which require use of the config system.
Signed-off-by: Brandon Williams <bmwill@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In commit 46f0344 ("sha1_file: support reading from a loose object of
unknown type", 2015-05-06), "struct object_info" gained a "typename"
field that could represent a type name from a loose object file, whether
valid or invalid, as opposed to the existing "typep" which could only
represent valid types. Some relatively complex manipulations were added
to avoid breaking packed_object_info() without modifying it, but it is
much easier to just teach packed_object_info() about the new field.
Therefore, teach packed_object_info() as described above.
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Clean up fallouts from recent tightening of the set-up sequence,
where Git barfs when repository information is accessed without
first ensuring that it was started in a repository.
* jk/no-looking-at-dotgit-outside-repo:
test-read-cache: setup git dir
has_sha1_file: don't bother if we are not in a repository
Conversion from unsigned char [40] to struct object_id continues.
* bc/object-id:
Documentation: update and rename api-sha1-array.txt
Rename sha1_array to oid_array
Convert sha1_array_for_each_unique and for_each_abbrev to object_id
Convert sha1_array_lookup to take struct object_id
Convert remaining callers of sha1_array_lookup to object_id
Make sha1_array_append take a struct object_id *
sha1-array: convert internal storage for struct sha1_array to object_id
builtin/pull: convert to struct object_id
submodule: convert check_for_new_submodule_commits to object_id
sha1_name: convert disambiguate_hint_fn to take object_id
sha1_name: convert struct disambiguate_state to object_id
test-sha1-array: convert most code to struct object_id
parse-options-cb: convert sha1_array_append caller to struct object_id
fsck: convert init_skiplist to struct object_id
builtin/receive-pack: convert portions to struct object_id
builtin/pull: convert portions to struct object_id
builtin/diff: convert to struct object_id
Convert GIT_SHA1_RAWSZ used for allocation to GIT_MAX_RAWSZ
Convert GIT_SHA1_HEXSZ used for allocation to GIT_MAX_HEXSZ
Define new hash-size constants for allocating memory
Update error handling for codepath that deals with corrupt loose
objects.
* jk/loose-object-info-report-error:
index-pack: detect local corruption in collision check
sha1_loose_object_info: return error for corrupted objects
Most callers to this function already require that they are in a
git repository, but there is an exception: "git apply" uses
has_sha1_file to avoid work if the result of applying a binary
patch is already present in the repository. When run outside any
repository, this produces an error:
fatal: BUG: setup_git_env called without repository
Signed-off-by: Jonathan Nieder <jrnieder@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When sha1_loose_object_info() finds that a loose object file
cannot be stat(2)ed or mmap(2)ed, it returns -1 to signal an
error to the caller. However, if it found that the loose
object file is corrupt and the object data cannot be used
from it, it stuffs OBJ_BAD into "type" field of the
object_info, but returns zero (i.e., success), which can
confuse callers.
This is due to 052fe5eac (sha1_loose_object_info: make type
lookup optional, 2013-07-12), which switched the return to a
strict success/error, rather than returning the type (but
botched the return).
Callers of regular sha1_object_info() don't notice the
difference, as that function returns the type (which is
OBJ_BAD in this case). However, direct callers of
sha1_object_info_extended() see the function return success,
but without setting any meaningful values in the object_info
struct, leading them to access potentially uninitialized
memory.
The easiest way to see the bug is via "cat-file -s", which
will happily ignore the corruption and report whatever
value happened to be in the "size" variable.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
These calls to snprintf should always succeed, because their
input is small and fixed. Let's use xsnprintf to make sure
this is the case (and to make auditing for actual truncation
easier).
These could be candidates for turning into heap buffers, but
they fall into a few broad categories that make it not worth
doing:
- formatting single numbers is simple enough that we can
see the result should fit
- the size of a sha1 is likewise well-known, and I didn't
want to cause unnecessary conflicts with the ongoing
process to convert these constants to GIT_MAX_HEXSZ
- the interface for curl_errorstr is dictated by curl
Signed-off-by: Jeff King <peff@peff.net>
Code clean-up.
* jk/pack-name-cleanups:
index-pack: make pointer-alias fallbacks safer
replace snprintf with odb_pack_name()
odb_pack_keep(): stop generating keepfile name
sha1_file.c: make pack-name helper globally accessible
move odb_* declarations out of git-compat-util.h
Since we will likely be introducing a new hash function at some point,
and that hash function might be longer than 20 bytes, use the constant
GIT_MAX_RAWSZ, which is designed to be suitable for allocations, instead
of GIT_SHA1_RAWSZ. This will ease the transition down the line by
distinguishing between places where we need to allocate memory suitable
for the largest hash from those where we need to handle the current
hash.
Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Since we will likely be introducing a new hash function at some point,
and that hash function might be longer than 40 hex characters, use the
constant GIT_MAX_HEXSZ, which is designed to be suitable for
allocations, instead of GIT_SHA1_HEXSZ. This will ease the transition
down the line by distinguishing between places where we need to allocate
memory suitable for the largest hash from those where we need to handle
the current hash.
Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Code clean-up.
* jk/pack-name-cleanups:
index-pack: make pointer-alias fallbacks safer
replace snprintf with odb_pack_name()
odb_pack_keep(): stop generating keepfile name
sha1_file.c: make pack-name helper globally accessible
move odb_* declarations out of git-compat-util.h
A leak in a codepath to read from a packed object in (rare) cases
has been plugged.
* rs/sha1-file-plug-fallback-base-leak:
sha1_file: release fallback base's memory in unpack_entry()
The experimental "split index" feature has gained a few
configuration variables to make it easier to use.
* cc/split-index-config: (22 commits)
Documentation/git-update-index: explain splitIndex.*
Documentation/config: add splitIndex.sharedIndexExpire
read-cache: use freshen_shared_index() in read_index_from()
read-cache: refactor read_index_from()
t1700: test shared index file expiration
read-cache: unlink old sharedindex files
config: add git_config_get_expiry() from gc.c
read-cache: touch shared index files when used
sha1_file: make check_and_freshen_file() non static
Documentation/config: add splitIndex.maxPercentChange
t1700: add tests for splitIndex.maxPercentChange
read-cache: regenerate shared index if necessary
config: add git_config_get_max_percent_split_change()
Documentation/git-update-index: talk about core.splitIndex config var
Documentation/config: add information for core.splitIndex
t1700: add tests for core.splitIndex
update-index: warn in case of split-index incoherency
read-cache: add and then use tweak_split_index()
split-index: add {add,remove}_split_index() functions
config: add git_config_get_split_index()
...
We provide sha1_pack_name() and sha1_pack_index_name(), but
the more generic form (which takes its own strbuf and an
arbitrary extension) is only used to implement the other
two. Let's make it available, but clean up a few things:
1. Name it odb_pack_name(), as the original
sha1_get_pack_name() is long but not all that
descriptive.
2. Switch the strbuf argument to the beginning, so that it
matches similar path-building functions like
git_path_buf().
3. Clean up the out-dated docstring and move it to the
public declaration.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
A leak in a codepath to read from a packed object in (rare) cases
has been plugged.
* rs/sha1-file-plug-fallback-base-leak:
sha1_file: release fallback base's memory in unpack_entry()
This function will be used in a commit soon, so let's make
it available globally.
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Deletion of a branch "foo/bar" could remove .git/refs/heads/foo
once there no longer is any other branch whose name begins with
"foo/", but we didn't do so so far. Now we do.
* mh/ref-remove-empty-directory: (23 commits)
files_transaction_commit(): clean up empty directories
try_remove_empty_parents(): teach to remove parents of reflogs, too
try_remove_empty_parents(): don't trash argument contents
try_remove_empty_parents(): rename parameter "name" -> "refname"
delete_ref_loose(): inline function
delete_ref_loose(): derive loose reference path from lock
log_ref_write_1(): inline function
log_ref_setup(): manage the name of the reflog file internally
log_ref_write_1(): don't depend on logfile argument
log_ref_setup(): pass the open file descriptor back to the caller
log_ref_setup(): improve robustness against races
log_ref_setup(): separate code for create vs non-create
log_ref_write(): inline function
rename_tmp_log(): improve error reporting
rename_tmp_log(): use raceproof_create_file()
lock_ref_sha1_basic(): use raceproof_create_file()
lock_ref_sha1_basic(): inline constant
raceproof_create_file(): new function
safe_create_leading_directories(): set errno on SCLD_EXISTS
safe_create_leading_directories_const(): preserve errno
...
If a pack entry that's used as a delta base is corrupt, unpack_entry()
marks it as unusable and then searches the object again in the hope that
it can be found in another pack or in a loose file. The memory for this
external base object is never released. Free it after use.
Signed-off-by: Rene Scharfe <l.s.r@web.de>
Reviewed-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Convert each_loose_object_fn and each_packed_object_fn to take a pointer
to struct object_id. Update the various callbacks. Convert several
40-based constants to use GIT_SHA1_HEXSZ.
Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
There are places in the code where we would like to provide a struct
object_id *, yet read the hash directly from the pack. Provide an
nth_packed_object_oid function that is similar to the
nth_packed_object_sha1 function.
In order to avoid a potentially invalid cast, nth_packed_object_oid
provides a variable into which to store the value, which it returns on
success; on error, it returns NULL, as nth_packed_object_sha1 does.
Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
A crashing bug introduced in v2.11 timeframe has been found (it is
triggerable only in fast-import) and fixed.
* jk/clear-delta-base-cache-fix:
clear_delta_base_cache(): don't modify hashmap while iterating
On Thu, Jan 19, 2017 at 03:03:46PM +0100, Ulrich Spörlein wrote:
> > I suspect the patch below may fix things for you. It works around it by
> > walking over the lru list (either is fine, as they both contain all
> > entries, and since we're clearing everything, we don't care about the
> > order).
>
> Confirmed. With the patch applied, I can import the whole 55G in one go
> without any crashes or aborts. Thanks much!
Thanks. Here it is rolled up with a commit message.
-- >8 --
Subject: clear_delta_base_cache(): don't modify hashmap while iterating
Removing entries while iterating causes fast-import to
access an already-freed `struct packed_git`, leading to
various confusing errors.
What happens is that clear_delta_base_cache() drops the
whole contents of the cache by iterating over the hashmap,
calling release_delta_base_cache() on each entry. That
function removes the item from the hashmap. The hashmap code
may then shrink the table, but the hashmap_iter struct
retains an offset from the old table.
As a result, the next call to hashmap_iter_next() may claim
that the iteration is done, even though some items haven't
been visited.
The only caller of clear_delta_base_cache() is fast-import,
which wants to clear the cache because it is discarding the
packed_git struct for its temporary pack. So by failing to
remove all of the entries, we still have references to the
freed packed_git.
To make things even more confusing, this doesn't seem to
trigger with the test suite, because it depends on
complexities like the size of the hash table, which entries
got cleared, whether we try to access them before they're
evicted from the cache, etc.
So I've been able to identify the problem with large
imports like freebsd's svn import, or a fast-export of
linux.git. But nothing that would be reasonable to run as
part of the normal test suite.
We can fix this easily by iterating over the lru linked list
instead of the hashmap. They both contain the same entries,
and we can use the "safe" variant of the list iterator,
which exists for exactly this case.
Let's also add a warning to the hashmap API documentation to
reduce the chances of getting bit by this again.
Reported-by: Ulrich Spörlein <uqs@freebsd.org>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
"git grep" has been taught to optionally recurse into submodules.
* bw/grep-recurse-submodules:
grep: search history of moved submodules
grep: enable recurse-submodules to work on <tree> objects
grep: optionally recurse into submodules
grep: add submodules as a grep source type
submodules: load gitmodules file from commit sha1
submodules: add helper to determine if a submodule is initialized
submodules: add helper to determine if a submodule is populated
real_path: canonicalize directory separators in root parts
real_path: have callers use real_pathdup and strbuf_realpath
real_path: create real_pathdup
real_path: convert real_path_internal to strbuf_realpath
real_path: resolve symlinks by hand
A recent update to receive-pack to make it easier to drop garbage
objects made it clear that GIT_ALTERNATE_OBJECT_DIRECTORIES cannot
have a pathname with a colon in it (no surprise!), and this in turn
made it impossible to push into a repository at such a path. This
has been fixed by introducing a quoting mechanism used when
appending such a path to the colon-separated list.
* jk/quote-env-path-list-component:
t5615-alternate-env: double-quotes in file names do not work on Windows
t5547-push-quarantine: run the path separator test on Windows, too
tmp-objdir: quote paths we add to alternates
alternates: accept double-quoted paths
When a loose tree or commit is read by fsck (or any git
program), unpack_sha1_rest() checks whether there is extra
cruft at the end of the object file, after the zlib data.
Blobs that are streamed, however, do not have this check.
For normal git operations, it's not a big deal. We know the
sha1 and size checked out, so we have the object bytes we
wanted. The trailing garbage doesn't affect what we're
trying to do.
But since the point of fsck is to find corruption or other
problems, it should be more thorough. This patch teaches its
loose-sha1 reader to detect extra bytes after the zlib
stream and complain.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
It's surprisingly hard to ask the sha1_file code to open a
_specific_ incarnation of a loose object. Most of the
functions take a sha1, and loop over the various object
types (packed versus loose) and locations (local versus
alternates) at a low level.
However, some tools like fsck need to look at a specific
file. This patch gives them a function they can use to open
the loose object at a given path.
The implementation unfortunately ends up repeating bits of
related functions, but there's not a good way around it
without some major refactoring of the whole sha1_file stack.
We need to mmap the specific file, then partially read the
zlib stream to know whether we're streaming or not, and then
finally either stream it or copy the data to a buffer.
We can do that by assembling some of the more arcane
internal sha1_file functions, but we end up having to
essentially reimplement unpack_sha1_file(), along with the
streaming bits of check_sha1_signature().
Still, most of the ugliness is contained in the new
function, and the interface is clean enough that it may be
reusable (though it seems unlikely anything but git-fsck
would care about opening a specific file).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When we fail to open a corrupt loose object, we report an
error and mention the filename via sha1_file_name().
However, that function will always give us a path in the
local repository, whereas the corrupt object may have come
from an alternate. The result is a very misleading error
message.
Teach the open_sha1_file() and stat_sha1_file() helpers to
pass back the path they found, so that we can report it
correctly.
Note that the pointers we return go to static storage (e.g.,
from sha1_file_name()), which is slightly dangerous.
However, these helpers are static local helpers, and the
names are used for immediately generating error messages.
The simplicity is an acceptable tradeoff for the danger.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The codeflow of setting NOATIME and CLOEXEC on file descriptors Git
opens has been simplified.
We may want to drop the tip one, but we'll see.
* jc/git-open-cloexec:
sha1_file: stop opening files with O_NOATIME
git_open_cloexec(): use fcntl(2) w/ FD_CLOEXEC fallback
git_open(): untangle possible NOATIME and CLOEXEC interactions
Add a function that tries to create a file and any containing
directories in a way that is robust against races with other processes
that might be cleaning up empty directories at the same time.
The actual file creation is done by a callback function, which, if it
fails, should set errno to EISDIR or ENOENT according to the convention
of open(). raceproof_create_file() detects such failures, and
respectively either tries to delete empty directories that might be in
the way of the file or tries to create the containing directories. Then
it retries the callback function.
This function is not yet used.
Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu>
Reviewed-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The exit path for SCLD_EXISTS wasn't setting errno, which some callers
use to generate error messages for the user. Fix the problem and
document that the function sets errno correctly to help avoid similar
regressions in the future.
Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu>
Reviewed-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Some implementations of free() change errno (even thought they
shouldn't):
https://sourceware.org/bugzilla/show_bug.cgi?id=17924
So preserve the errno from safe_create_leading_directories() across the
call to free().
Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu>
Reviewed-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
A recent update to receive-pack to make it easier to drop garbage
objects made it clear that GIT_ALTERNATE_OBJECT_DIRECTORIES cannot
have a pathname with a colon in it (no surprise!), and this in turn
made it impossible to push into a repository at such a path. This
has been fixed by introducing a quoting mechanism used when
appending such a path to the colon-separated list.
* jk/quote-env-path-list-component:
t5615-alternate-env: double-quotes in file names do not work on Windows
t5547-push-quarantine: run the path separator test on Windows, too
tmp-objdir: quote paths we add to alternates
alternates: accept double-quoted paths
Migrate callers of real_path() who duplicate the retern value to use
real_pathdup or strbuf_realpath.
Signed-off-by: Brandon Williams <bmwill@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
We read lists of alternates from objects/info/alternates
files (delimited by newline), as well as from the
GIT_ALTERNATE_OBJECT_DIRECTORIES environment variable
(delimited by colon or semi-colon, depending on the
platform).
There's no mechanism for quoting the delimiters, so it's
impossible to specify an alternate path that contains a
colon in the environment, or one that contains a newline in
a file. We've lived with that restriction for ages because
both alternates and filenames with colons are relatively
rare, and it's only a problem when the two meet. But since
722ff7f87 (receive-pack: quarantine objects until
pre-receive accepts, 2016-10-03), which builds on the
alternates system, every push causes the receiver to set
GIT_ALTERNATE_OBJECT_DIRECTORIES internally.
It would be convenient to have some way to quote the
delimiter so that we can represent arbitrary paths.
The simplest thing would be an escape character before a
quoted delimiter (e.g., "\:" as a literal colon). But that
creates a backwards compatibility problem: any path which
uses that escape character is now broken, and we've just
shifted the problem. We could choose an unlikely escape
character (e.g., something from the non-printable ASCII
range), but that's awkward to use.
Instead, let's treat names as unquoted unless they begin
with a double-quote, in which case they are interpreted via
our usual C-stylke quoting rules. This also breaks
backwards-compatibility, but in a smaller way: it only
matters if your file has a double-quote as the very _first_
character in the path (whereas an escape character is a
problem anywhere in the path). It's also consistent with
many other parts of git, which accept either a bare pathname
or a double-quoted one, and the sender can choose to quote
or not as required.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Fix a corner-case regression in a topic that graduated during the
v2.11 cycle.
* jk/alt-odb-cleanup:
alternates: re-allow relative paths from environment
Commit 670c359da (link_alt_odb_entry: handle normalize_path
errors, 2016-10-03) regressed the handling of relative paths
in the GIT_ALTERNATE_OBJECT_DIRECTORIES variable. It's not
entirely clear this was ever meant to work, but it _has_
worked for several years, so this commit restores the
original behavior.
When we get a path in GIT_ALTERNATE_OBJECT_DIRECTORIES, we
add it the path to the list of alternate object directories
as if it were found in objects/info/alternates, but with one
difference: we do not provide the link_alt_odb_entry()
function with a base for relative paths. That function
doesn't turn it into an absolute path, and we end up feeding
the relative path to the strbuf_normalize_path() function.
Most relative paths break out of the top-level directory
(e.g., "../foo.git/objects"), and thus normalizing fails.
Prior to 670c359da, we simply ignored the error, and due to
the way normalize_path_copy() was implemented it happened to
return the original path in this case. We then accessed the
alternate objects using this relative path.
By storing the relative path in the alt_odb list, the path
is relative to wherever we happen to be at the time we do an
object lookup. That means we look from $GIT_DIR in a bare
repository, and from the top of the worktree in a non-bare
repository.
If this were being designed from scratch, it would make
sense to pick a stable location (probably $GIT_DIR, or even
the object directory) and use that as the relative base,
turning the result into an absolute path. However, given
the history, at this point the minimal fix is to match the
pre-670c359da behavior.
We can do this simply by ignoring the error when we have no
relative base and using the original value (which we now
reliably have, thanks to strbuf_normalize_path()).
That still leaves us with a relative path that foils our
duplicate detection, and may act strangely if we ever
chdir() later in the process. We could solve that by storing
an absolute path based on getcwd(). That may be a good
future direction; for now we'll do just the minimum to fix
the regression.
The new t5615 script demonstrates the fix in its final three
tests. Since we didn't have any tests of the alternates
environment variable at all, it also adds some tests of
absolute paths.
Reported-by: Bryan Turner <bturner@atlassian.com>
Signed-off-by: Jeff King <peff@peff.net>
When we open object files, we try to do so with O_NOATIME.
This dates back to 144bde78e9 (Use O_NOATIME when opening
the sha1 files., 2005-04-23), which is an optimization to
avoid creating a bunch of dirty inodes when we're accessing
many objects. But a few things have changed since then:
1. In June 2005, git learned about packfiles, which means
we would do a lot fewer atime updates (rather than one
per object access, we'd generally get one per packfile).
2. In late 2006, Linux learned about "relatime", which is
generally the default on modern installs. So
performance around atimes updates is a non-issue there
these days.
All the world isn't Linux, but as it turns out, Linux
is the only platform to implement O_NOATIME in the
first place.
So it's very unlikely that this code is helping anybody
these days.
Helped-by: Jeff King <peff@peff.net>
[jc: took idea and log message from peff]
Signed-off-by: Junio C Hamano <gitster@pobox.com>
A platform might not support open(2) with O_CLOEXEC but may support
telling the same with fcntl(2) to flip FD_CLOEXEC bit on on an open
file descriptor. It is a fallback that is inherently racy and this
may not be worth doing, though.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Git generally does not explicitly close file descriptors that were
open in the parent process when spawning a child process, but most
of the time the child does not want to access them. As Windows does
not allow removing or renaming a file that has a file descriptor
open, a slow-to-exit child can even break the parent process by
holding onto them. Use O_CLOEXEC flag to open files in various
codepaths.
* ls/git-open-cloexec:
read-cache: make sure file handles are not inherited by child processes
sha1_file: open window into packfiles with O_CLOEXEC
sha1_file: rename git_open_noatime() to git_open()
When fetching from a remote that has many tags that are irrelevant
to branches we are following, we used to waste way too many cycles
when checking if the object pointed at by a tag (that we are not
going to fetch!) exists in our repository too carefully.
* jk/fetch-quick-tag-following:
fetch: use "quick" has_sha1_file for tag following
The way we structured the fallback/retry mechanism for opening with
O_NOATIME and O_CLOEXEC meant that if we failed due to lack of
support to open the file with O_NOATIME option (i.e. EINVAL), we
would still try to drop O_CLOEXEC first and retry, and then drop
O_NOATIME. A platform on which O_NOATIME is defined in the header
without support from the kernel wouldn't have a chance to open with
O_CLOEXEC option due to this code structure.
Arguably, O_CLOEXEC is more important than O_NOATIME, as the latter
is mostly about performance, while the former can affect correctness.
Instead use O_CLOEXEC to open the file, and then use fcntl(2) to set
O_NOATIME on the resulting file descriptor. open(2) itself does not
cause atime to be updated according to Linus [*1*].
The helper to do the former can be usable in the codepath in
ce_compare_data() that was recently added to open a file descriptor
with O_CLOEXEC; use it while we are at it.
*1* <CA+55aFw83E+zOd+z5h-CA-3NhrLjVr-anL6pubrSWttYx3zu8g@mail.gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Updates the way approximate count of total objects is computed
while attempting to come up with a unique abbreviated object name,
which in turn needs to estimate how many hexdigits are necessary to
ensure uniqueness.
* jk/abbrev-auto:
find_unique_abbrev: move logic out of get_short_sha1()
When fetching from a remote that has many tags that are irrelevant
to branches we are following, we used to waste way too many cycles
when checking if the object pointed at by a tag (that we are not
going to fetch!) exists in our repository too carefully.
* jk/fetch-quick-tag-following:
fetch: use "quick" has_sha1_file for tag following
All processes that the Git main process spawns inherit the open file
descriptors of the main process. These leaked file descriptors can
cause problems.
Use the O_CLOEXEC flag similar to 05d1ed61 to fix the leaked file
descriptors.
Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This function is meant to be used when reading from files in the
object store, and the original objective was to avoid smudging atime
of loose object files too often, hence its name. Because we'll be
extending its role in the next commit to also arrange the file
descriptors they return auto-closed in the child processes, rename
it to lose "noatime" part that is too specific.
Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Codepaths involved in interacting alternate object store have
been cleaned up.
* jk/alt-odb-cleanup:
alternates: use fspathcmp to detect duplicates
sha1_file: always allow relative paths to alternates
count-objects: report alternates via verbose mode
fill_sha1_file: write into a strbuf
alternates: store scratch buffer as strbuf
fill_sha1_file: write "boring" characters
alternates: use a separate scratch space
alternates: encapsulate alt->base munging
alternates: provide helper for allocating alternate
alternates: provide helper for adding to alternates list
link_alt_odb_entry: refactor string handling
link_alt_odb_entry: handle normalize_path errors
t5613: clarify "too deep" recursion tests
t5613: do not chdir in main process
t5613: whitespace/style cleanups
t5613: use test_must_fail
t5613: drop test_valid_repo function
t5613: drop reachable_via function
When we auto-follow tags in a fetch, we look at all of the
tags advertised by the remote and fetch ones where we don't
already have the tag, but we do have the object it peels to.
This involves a lot of calls to has_sha1_file(), some of
which we can reasonably expect to fail. Since 45e8a74
(has_sha1_file: re-check pack directory before giving up,
2013-08-30), this may cause many calls to
reprepare_packed_git(), which is potentially expensive.
This has gone unnoticed for several years because it
requires a fairly unique setup to matter:
1. You need to have a lot of packs on the client side to
make reprepare_packed_git() expensive (the most
expensive part is finding duplicates in an unsorted
list, which is currently quadratic).
2. You need a large number of tag refs on the server side
that are candidates for auto-following (i.e., that the
client doesn't have). Each one triggers a re-read of
the pack directory.
3. Under normal circumstances, the client would
auto-follow those tags and after one large fetch, (2)
would no longer be true. But if those tags point to
history which is disconnected from what the client
otherwise fetches, then it will never auto-follow, and
those candidates will impact it on every fetch.
So when all three are true, each fetch pays an extra
O(nr_tags * nr_packs^2) cost, mostly in string comparisons
on the pack names. This was exacerbated by 47bf4b0
(prepare_packed_git_one: refactor duplicate-pack check,
2014-06-30) which uses a slightly more expensive string
check, under the assumption that the duplicate check doesn't
happen very often (and it shouldn't; the real problem here
is how often we are calling reprepare_packed_git()).
This patch teaches fetch to use HAS_SHA1_QUICK to sacrifice
accuracy for speed, in cases where we might be racy with a
simultaneous repack. This is similar to the fix in 0eeb077
(index-pack: avoid excessive re-reading of pack directory,
2015-06-09). As with that case, it's OK for has_sha1_file()
occasionally say "no I don't have it" when we do, because
the worst case is not a corruption, but simply that we may
fail to auto-follow a tag that points to it.
Here are results from the included perf script, which sets
up a situation similar to the one described above:
Test HEAD^ HEAD
----------------------------------------------------------
5550.4: fetch 11.21(10.42+0.78) 0.08(0.04+0.02) -99.3%
Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Codepaths that read from an on-disk loose object were too loose in
validating what they are reading is a proper object file and
sometimes read past the data they read from the disk, which has
been corrected. H/t to Gustavo Grieco for reporting.
* jc/verify-loose-object-header:
unpack_sha1_header(): detect malformed object header
streaming: make sure to notice corrupt object
"git pack-objects" in a repository with many packfiles used to
spend a lot of time looking for/at objects in them; the accesses to
the packfiles are now optimized by checking the most-recently-used
packfile first.
* jk/pack-objects-optim-mru:
pack-objects: use mru list when iterating over packs
pack-objects: break delta cycles before delta-search phase
sha1_file: make packed_object_info public
provide an initializer for "struct object_info"
On a case-insensitive filesystem, we should realize that
"a/objects" and "A/objects" are the same path. We already
use fspathcmp() to check against the main object directory,
but until recently we couldn't use it for comparing against
other alternates (because their paths were not
NUL-terminated strings). But now we can, so let's do so.
Note that we also need to adjust count-objects to load the
config, so that it can see the setting of core.ignorecase
(this is required by the test, but is also a general bugfix
for users of count-objects).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
We recursively expand alternates repositories, so that if A
borrows from B which borrows from C, A can see all objects.
For the root object database, we allow relative paths, so A
can point to B as "../B/objects". However, we currently do
not allow relative paths when recursing, so B must use an
absolute path to reach C.
That is an ancient protection from c2f493a (Transitively
read alternatives, 2006-05-07) that tries to avoid adding
the same alternate through two different paths. Since
5bdf0a8 (sha1_file: normalize alt_odb path before comparing
and storing, 2011-09-07), we use a normalized absolute path
for each alt_odb entry.
This means that in most cases the protection is no longer
necessary; we will detect the duplicate no matter how we got
there (but see below). And it's a good idea to get rid of
it, as it creates an unnecessary complication when setting
up recursive alternates (B has to know that A is going to
borrow from it and make sure to use an absolute path).
Note that our normalization doesn't actually look at the
filesystem, so it can still be fooled by crossing symbolic
links. But that's also true of absolute paths, so it's not a
good reason to disallow only relative paths (it's
potentially a reason to switch to real_path(), but that's a
separate and non-trivial change).
We adjust the test script here to demonstrate that this now
works, and add new tests to show that the normalization does
indeed suppress duplicates.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
It's currently the responsibility of the caller to give
fill_sha1_file() enough bytes to write into, leading them to
manually compute the required lengths. Instead, let's just
write into a strbuf so that it's impossible to get this
wrong.
The alt_odb caller already has a strbuf, so this makes
things strictly simpler. The other caller, sha1_file_name(),
uses a static PATH_MAX buffer and dies when it would
overflow. We can convert this to a static strbuf, which
means our allocation cost is amortized (and as a bonus, we
no longer have to worry about PATH_MAX being too short for
normal use).
This does introduce some small overhead in fill_sha1_file(),
as each strbuf_addchar() will check whether it needs to
grow. However, between the optimization in fec501d
(strbuf_addch: avoid calling strbuf_grow, 2015-04-16) and
the fact that this is not generally called in a tight loop
(after all, the next step is typically to access the file!)
this probably doesn't matter. And even if it did, the right
place to micro-optimize is inside fill_sha1_file(), by
calling a single strbuf_grow() there.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The submodule code wants to temporarily add an alternate
object store to our in-memory alt_odb list, but does it
manually. Let's provide a helper so it can reuse the code in
link_alt_odb_entry().
While we're adding our new add_to_alternates_memory(), let's
document add_to_alternates_file(), as the two are related.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
We pre-size the scratch buffer to hold a loose object
filename of the form "xx/yyyy...", which leads to allocation
code that is hard to verify. We have to use some magic
numbers during the initial allocation, and then writers must
blindly assume that the buffer is big enough. Using a strbuf
makes it more clear that we cannot overflow.
Unfortunately, we do still need some magic numbers to grow
our strbuf before calling fill_sha1_path(), but the strbuf
growth is much closer to the point of use. This makes it
easier to see that it's correct, and opens the possibility
of pushing it even further down if fill_sha1_path() learns
to work on strbufs.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The string handling in link_alt_odb_entry() is mostly an
artifact of the original version, which took the path as a
ptr/len combo, and did not have a NUL-terminated string
until we created one in the alternate_object_database
struct. But since 5bdf0a8 (sha1_file: normalize alt_odb
path before comparing and storing, 2011-09-07), the first
thing we do is put the path into a strbuf, which gives us
some easy opportunities for cleanup.
In particular:
- we call strlen(pathbuf.buf), which is silly; we can look
at pathbuf.len.
- even though we have a strbuf, we don't maintain its
"len" field when chomping extra slashes from the
end, and instead keep a separate "pfxlen" variable. We
can fix this and then drop "pfxlen" entirely.
- we don't check whether the path is usable until after we
allocate the new struct, making extra cleanup work for
ourselves. Since we have a NUL-terminated string, we can
bump the "is it usable" checks higher in the function.
While we're at it, we can move that logic to its own
helper, which makes the flow of link_alt_odb_entry()
easier to follow.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This function forms a sha1 as "xx/yyyy...", but skips over
the slot for the slash rather than writing it, leaving it to
the caller to do so. It also does not bother to put in a
trailing NUL, even though every caller would want it (we're
forming a path which by definition is not a directory, so
the only thing to do with it is feed it to a system call).
Let's make the lives of our callers easier by just writing
out the internal "/" and the NUL.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When we add a new alternate to the list, we try to normalize
out any redundant "..", etc. However, we do not look at the
return value of normalize_path_copy(), and will happily
continue with a path that could not be normalized. Worse,
the normalizing process is done in-place, so we are left
with whatever half-finished working state the normalizing
function was in.
Fortunately, this cannot cause us to read past the end of
our buffer, as that working state will always leave the
NUL from the original path in place. And we do tend to
notice problems when we check is_directory() on the path.
But you can see the nonsense that we feed to is_directory
with an entry like:
this/../../is/../../way/../../too/../../deep/../../to/../../resolve
in your objects/info/alternates, which yields:
error: object directory
/to/e/deep/too/way//ects/this/../../is/../../way/../../too/../../deep/../../to/../../resolve
does not exist; check .git/objects/info/alternates.
We can easily fix this just by checking the return value.
But that makes it hard to generate a good error message,
since we're normalizing in-place and our input value has
been overwritten by cruft.
Instead, let's provide a strbuf helper that does an in-place
normalize, but restores the original contents on error. This
uses a second buffer under the hood, which is slightly less
efficient, but this is not a performance-critical code path.
The strbuf helper can also properly set the "len" parameter
of the strbuf before returning. Just doing:
normalize_path_copy(buf.buf, buf.buf);
will shorten the string, but leave buf.len at the original
length. That may be confusing to later code which uses the
strbuf.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The alternate_object_database struct uses a single buffer
both for storing the path to the alternate, and as a scratch
buffer for forming object names. This is efficient (since
otherwise we'd end up storing the path twice), but it makes
life hard for callers who just want to know the path to the
alternate. They have to remember to stop reading after
"alt->name - alt->base" bytes, and to subtract one for the
trailing '/'.
It would be much simpler if they could simply access a
NUL-terminated path string. We could encapsulate this in a
function which puts a NUL in the scratch buffer and returns
the string, but that opens up questions about the lifetime
of the result. The first time another caller uses the
alternate, the scratch buffer may get other data tacked onto
it.
Let's instead just store the root path separately from the
scratch buffer. There aren't enough alternates being stored
for the duplicated data to matter for performance, and this
keeps things simple and safe for the callers.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The alternate_object_database struct holds a path to the
alternate objects, but we also use that buffer as scratch
space for forming loose object filenames. Let's pull that
logic into a helper function so that we can more easily
modify it.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Allocating a struct alternate_object_database is tricky, as
we must over-allocate the buffer to provide scratch space,
and then put in particular '/' and NUL markers.
Let's encapsulate this in a function so that the complexity
doesn't leak into callers (and so that we can modify it
later).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The get_short_sha1() is only about reading short sha1s; we
do call it in a loop to check "is this long enough" for each
object, but otherwise it should not need to know about
things like our default_abbrev setting.
So instead of asking it to set default_automatic_abbrev as a
side-effect, let's just have find_unique_abbrev() pick the
right place to start its loop. This requires a separate
approximate_object_count() function, but that naturally
belongs with the rest of sha1_file.c.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Codepaths that read from an on-disk loose object were too loose in
validating what they are reading is a proper object file and
sometimes read past the data they read from the disk, which has
been corrected. H/t to Gustavo Grieco for reporting.
* jc/verify-loose-object-header:
unpack_sha1_header(): detect malformed object header
streaming: make sure to notice corrupt object
When opening a loose object file, we often do this sequence:
- prepare a short buffer for the object header (on stack)
- call unpack_sha1_header() and have early part of the object data
inflated, enough to fill the buffer
- parse that data in the short buffer, assuming that the first part
of the object is <typename> SP <length> NUL
Because the parsing function parse_sha1_header_extended() is not
given the number of bytes inflated into the header buffer, it you
craft a file whose early part inflates a garbage sequence without SP
or NUL, and replace a loose object with it, it will end up reading
past the end of the inflated data.
To correct this, do the following four things:
- rename unpack_sha1_header() to unpack_sha1_short_header() and
have unpack_sha1_header_to_strbuf() keep calling that as its
helper function. This will detect and report zlib errors, but is
not aware of the format of a loose object (as before).
- introduce unpack_sha1_header() that calls the same helper
function, and when zlib reports it inflated OK into the buffer,
check if the inflated data has NUL. This would ensure that
parsing function will terminate within the buffer that holds the
inflated header.
- update unpack_sha1_header_to_strbuf() to check if the resulting
buffer has NUL for the same effect.
- update parse_sha1_header_extended() to make sure that its loop to
find the SP that terminates the <typename> stops at NUL.
Essentially, this makes unpack_*() functions that are asked to
unpack a loose object header to be a bit more strict and detect an
input that cannot possibly be a valid object header, even before the
parsing function kicks in.
Reported-by: Gustavo Grieco <gustavo.grieco@imag.fr>
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Recently we updated the code to manage the in-core cache that holds
objects that have recently been used to reconstitute other objects
that are stored as deltas against them, but the update used an
incorrect API function to manage the list of these objects. This
has been fixed.
* jk/delta-base-cache:
add_delta_base_cache: use list_for_each_safe
Sort the linked list of packs directly using llist_mergesort() instead
of building an array, calling qsort(3) and fixing up the list pointers.
This is shorter and less complicated.
Signed-off-by: Rene Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The "git diff --submodule={short,log}" mechanism has been enhanced
to allow "--submodule=diff" to show the patch between the submodule
commits bound to the superproject.
* jk/diff-submodule-diff-inline:
diff: teach diff to display submodule difference with an inline diff
submodule: refactor show_submodule_summary with helper function
submodule: convert show_submodule_summary to use struct object_id *
allow do_submodule_path to work even if submodule isn't checked out
diff: prepare for additional submodule formats
graph: add support for --line-prefix on all graph-aware output
diff.c: remove output_prefix_length field
cache: add empty_tree_oid object and helper function
We may remove elements from the list while we are iterating,
which requires using a second temporary pointer. Otherwise
stepping to the next element of the list might involve
looking at freed memory (which generally works in practice,
as we _just_ freed it, but of course is wrong to rely on;
valgrind notices it).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
"git clone --resurse-submodules --reference $path $URL" is a way to
reduce network transfer cost by borrowing objects in an existing
$path repository when cloning the superproject from $URL; it
learned to also peek into $path for presense of corresponding
repositories of submodules and borrow objects from there when able.
* sb/submodule-clone-rr:
clone: recursive and reference option triggers submodule alternates
clone: implement optional references
clone: clarify option_reference as required
clone: factor out checking for an alternate path
submodule--helper update-clone: allow multiple references
submodule--helper module-clone: allow multiple references
t7408: merge short tests, factor out testing method
t7408: modernize style
Similar to is_null_oid(), and is_empty_blob_sha1() add an
empty_tree_oid along with helper function is_empty_tree_oid(). For
completeness, also add an "is_empty_tree_sha1()",
"is_empty_blob_sha1()", "is_empty_tree_oid()" and "is_empty_blob_oid()"
helpers.
To ensure we only get one singleton, implement EMPTY_BLOB_SHA1_BIN as
simply getting the hash of empty_blob_oid structure.
Signed-off-by: Jacob Keller <jacob.keller@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The fundamental data structure of the delta base cache is a
hash table mapping pairs of "(packfile, offset)" into
structs containing the actual object data. The hash table
implementation dates back to e5e0161 (Implement a simple
delta_base cache, 2007-03-17), and uses a fixed-size table.
The current size is a hard-coded 256 entries.
Because we need to be able to remove objects from the hash
table, entry lookup does not do any kind of probing to
handle collisions. Colliding items simply replace whatever
is in their slot. As a result, we have fewer usable slots
than even the 256 we allocate. At half full, each new item
has a 50% chance of displacing another one. Or another way
to think about it: every item has a 1/256 chance of being
ejected due to hash collision, without regard to our LRU
strategy.
So it would be interesting to see the effect of increasing
the cache size on the runtime for some common operations. As
with the previous patch, we'll measure "git log --raw" for
tree-only operations, and "git log -Sfoo --raw" for
operations that touch trees and blobs. All times are
wall-clock best-of-3, done against fully packed repos with
--depth=50, and the default core.deltaBaseCacheLimit of
96MB.
Here are timings for various values of MAX_DELTA_CACHE
against git.git (the asterisk marks the minimum time for
each operation):
MAX_DELTA_CACHE log-raw log-S
--------------- --------- ---------
256 0m02.227s 0m12.821s
512 0m02.143s 0m10.602s
1024 0m02.127s 0m08.642s
2048 0m02.148s 0m07.123s
4096 0m02.194s 0m06.448s*
8192 0m02.239s 0m06.504s
16384 0m02.144s* 0m06.502s
32768 0m02.202s 0m06.622s
65536 0m02.230s 0m06.677s
The log-raw case isn't changed much at all here (probably
because our trees just aren't that big in the first place,
or possibly because we have so _few_ trees in git.git that
the 256-entry cache is enough). But once we start putting
blobs in the cache, too, we see a big improvement (almost
50%). The curve levels off around 4096, which means that we
can hold about that many entries before hitting the 96MB
memory limit (or possibly that the workload is small enough
that there is simply no more work to be optimized out by
caching more).
(As a side note, I initially timed my existing git.git pack,
which was a base of --aggressive combined with some pulls on
top. So it had quite a few deeper delta chains. The
256-cache case was more like 15s, and it still dropped to
~6.5s in the same way).
Here are the timings for linux.git:
MAX_DELTA_CACHE log-raw log-S
--------------- --------- ---------
256 0m41.661s 5m12.410s
512 0m39.547s 5m07.920s
1024 0m37.054s 4m54.666s
2048 0m35.871s 4m41.194s*
4096 0m34.646s 4m51.648s
8192 0m33.881s 4m55.342s
16384 0m35.190s 5m00.122s
32768 0m35.060s 4m58.851s
65536 0m33.311s* 4m51.420s
As we grow we see a nice 20% speedup in the tree traversal,
and more modest 10% in the log-S. This is probably an
indication that we are bound less by the number of entries,
and more by the memory limit (more on that below). What is
interesting is that the numbers bounce around a bit;
increasing the number of entries isn't always a strict
improvement.
Partially this is due to noise in the measurement. But it
may also be an indication that our LRU ejection scheme is
not optimal. The smaller cache sizes introduce some
randomness into the ejection (due to collisions), which may
sometimes work in our favor (and sometimes not!).
So what is the optimal setting of MAX_DELTA_CACHE? The
"bouncing" in the linux.git log-S numbers notwithstanding,
it mostly seems like bigger is better. And even if we were
to try to find a "sweet spot", these are just two
repositories, that are not necessarily representative. The
shape of history, the size of trees and blobs, the memory
limit configuration, etc, all will affect the outcome.
Rather than trying to find the "right" number, another
strategy is to just switch to a hash table that can actually
store collisions: namely our hashmap.h implementation.
Here are numbers for that compared to the "best" we saw from
adjusting MAX_DELTA_CACHE:
| log-raw | log-S
| best hashmap | best hashmap
| --------- --------- | --------- ---------
git | 0m02.144s 0m02.144s | 0m06.448s 0m06.688s
linux | 0m33.311s 0m33.092s | 4m41.194s 4m57.172s
We can see the results are similar in most cases, which is
what we'd expect. We're not ejecting due to collisions at
all, so this is purely representing the LRU. So really, we'd
expect this to model most closely the larger values of the
static MAX_DELTA_CACHE limit. And that does seem to be
what's happening, including the "bounce" in the linux log-S
case.
So while the value for that case _isn't_ as good as the
optimal one measured above (which was 2048 entries), given
the bouncing I'm hesitant to suggest that 2048 is any kind
of optimum (not even for linux.git, let alone as a general
rule). The generic hashmap has the appeal that it drops the
number of tweakable numbers by one, which means we can focus
on tuning other elements, like the LRU strategy or the
core.deltaBaseCacheLimit setting.
And indeed, if we bump the cache limit to 1G (which is
probably silly for general use, but maybe something people
with big workstations would want to do), the linux.git log-S
time drops to 3m32s. That's something you really _can't_ do
easily with the static hash table, because the number of
entries needs to grow in proportion to the memory limit (so
2048 is almost certainly not going to be the right value
there).
This patch takes that direction, and drops the static hash
table entirely in favor of using the hashmap.h API.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When the delta base cache runs out of allowed memory, it has
to drop entries. It does so by walking an LRU list, dropping
objects until we are under the memory limit. But we actually
walk the list twice: once to drop blobs, and then again to
drop other objects (which are generally trees). This comes
from 18bdec1 (Limit the size of the new delta_base_cache,
2007-03-19).
This performs poorly as the number of entries grows, because
any time dropping blobs does not satisfy the limit, we have
to walk the _entire_ list, trees included, looking for blobs
to drop, before starting to drop any trees.
It's not generally a problem now, as the cache is limited to
only 256 entries. But as we could benefit from increasing
that in a future patch, it's worth looking at how it
performs as the cache size grows. And the answer is "not
well".
The table below shows times for various operations with
different values of MAX_DELTA_CACHE (which is not a run-time
knob; I recompiled with -DMAX_DELTA_CACHE=$n for each).
I chose "git log --raw" ("log-raw" in the table) because it
will access all of the trees, but no blobs at all (so in a
sense it is a worst case for this problem, because we will
always walk over the entire list of trees once before
realizing there are no blobs to drop). This is also
representative of other tree-only operations like "rev-list
--objects" and "git log -- <path>".
I also timed "git log -Sfoo --raw" ("log-S" in the table).
It similarly accesses all of the trees, but also the blobs
for each commit. It's representative of "git log -p", though
it emphasizes the cost of blob access more, as "-S" is
cheaper than computing an actual blob diff.
All timings are best-of-3 wall-clock times (though they all
were CPU bound, so the user CPU times are similar). The
repositories were fully packed with --depth=50, and the
default core.deltaBaseCacheLimit of 96M was in effect. The
current value of MAX_DELTA_CACHE is 256, so I started there
and worked up by factors of 2.
First, here are values for git.git (the asterisk signals the
fastest run for each operation):
MAX_DELTA_CACHE log-raw log-S
--------------- --------- ---------
256 0m02.212s 0m12.634s
512 0m02.136s* 0m10.614s
1024 0m02.156s 0m08.614s
2048 0m02.208s 0m07.062s
4096 0m02.190s 0m06.484s*
8192 0m02.176s 0m07.635s
16384 0m02.913s 0m19.845s
32768 0m03.617s 1m05.507s
65536 0m04.031s 1m18.488s
You can see that for the tree-only log-raw case, we don't
actually benefit that much as the cache grows (all the
differences up through 8192 are basically just noise; this
is probably because we don't actually have that many
distinct trees in git.git). But for log-S, we get a definite
speed improvement as the cache grows, but the improvements
are lost as cache size grows and the linear LRU management
starts to dominate.
Here's the same thing run against linux.git:
MAX_DELTA_CACHE log-raw log-S
--------------- --------- ----------
256 0m40.987s 5m13.216s
512 0m37.949s 5m03.243s
1024 0m35.977s 4m50.580s
2048 0m33.855s 4m39.818s
4096 0m32.913s 4m47.299s*
8192 0m32.176s* 5m14.650s
16384 0m32.185s 6m31.625s
32768 0m38.056s 9m31.136s
65536 1m30.518s 17m38.549s
The pattern is similar, though the effect in log-raw is more
pronounced here. The times dip down in the middle, and then
go back up as we keep growing.
So we know there's a problem. What's the solution?
The obvious one is to improve the data structure to avoid
walking over tree entries during the looking-for-blobs
traversal. We can do this by keeping _two_ LRU lists: one
for blobs, and one for other objects. We drop items from the
blob LRU first, and then from the tree LRU (if necessary).
Here's git.git using that strategy:
MAX_DELTA_CACHE log-raw log-S
--------------- --------- ----------
256 0m02.264s 0m12.830s
512 0m02.201s 0m10.771s
1024 0m02.181s 0m08.593s
2048 0m02.205s 0m07.116s
4096 0m02.158s 0m06.537s*
8192 0m02.213s 0m07.246s
16384 0m02.155s* 0m10.975s
32768 0m02.159s 0m16.047s
65536 0m02.181s 0m16.992s
The upswing on log-raw is gone completely. But log-S still
has it (albeit much better than without this strategy).
Let's see what linux.git shows:
MAX_DELTA_CACHE log-raw log-S
--------------- --------- ---------
256 0m42.519s 5m14.654s
512 0m39.106s 5m04.708s
1024 0m36.802s 4m51.454s
2048 0m34.685s 4m39.378s*
4096 0m33.663s 4m44.047s
8192 0m33.157s 4m50.644s
16384 0m33.090s* 4m49.648s
32768 0m33.458s 4m53.371s
65536 0m33.563s 5m04.580s
The results are similar. The tree-only case again performs
well (not surprising; we're literally just dropping the one
useless walk, and not otherwise changing the cache eviction
strategy at all). But the log-S case again does a bit worse
as the cache grows (though possibly that's within the noise,
which is much larger for this case).
Perhaps this is an indication that the "remove blobs first"
strategy is not actually optimal. The intent of it is to
avoid blowing out the tree cache when we see large blobs,
but it also means we'll throw away useful, recent blobs in
favor of older trees.
Let's run the same numbers without caring about object type
at all (i.e., one LRU list, and always evicting whatever is
at the head, regardless of type).
Here's git.git:
MAX_DELTA_CACHE log-raw log-S
--------------- --------- ---------
256 0m02.227s 0m12.821s
512 0m02.143s 0m10.602s
1024 0m02.127s 0m08.642s
2048 0m02.148s 0m07.123s
4096 0m02.194s 0m06.448s*
8192 0m02.239s 0m06.504s
16384 0m02.144s* 0m06.502s
32768 0m02.202s 0m06.622s
65536 0m02.230s 0m06.677s
Much smoother; there's no dramatic upswing as we increase
the cache size (some remains, though it's small enough that
it's mostly run-to-run noise. E.g., in the log-raw case,
note how 8192 is 50-100ms higher than its neighbors). Note
also that we stop getting any real benefit for log-S after
about 4096 entries; that number will depend on the size of
the repository, the size of the blob entries, and the memory
limit of the cache.
Let's see what linux.git shows for the same strategy:
MAX_DELTA_CACHE log-raw log-S
--------------- --------- ---------
256 0m41.661s 5m12.410s
512 0m39.547s 5m07.920s
1024 0m37.054s 4m54.666s
2048 0m35.871s 4m41.194s*
4096 0m34.646s 4m51.648s
8192 0m33.881s 4m55.342s
16384 0m35.190s 5m00.122s
32768 0m35.060s 4m58.851s
65536 0m33.311s* 4m51.420s
It's similarly good. As with the "separate blob LRU"
strategy, there's a lot of noise on the log-S run here. But
it's certainly not any worse, is possibly a bit better, and
the improvement over "separate blob LRU" on the git.git case
is dramatic.
So it seems like a clear winner, and that's what this patch
implements.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
We keep an LRU list of entries for when we need to drop
something from an over-full cache. The list is implemented
as a circular doubly-linked list, which is exactly what
list.h provides. We can save a few lines by using the list.h
macros and functions. More importantly, this makes the code
easier to follow, as the reader sees explicit concepts like
"list_add_tail()" instead of pointer manipulation.
As a bonus, the list_entry() macro lets us place the lru
pointers anywhere inside the delta_base_cache_entry struct
(as opposed to just casting the pointer, which requires it
at the front of the struct). This will be useful in later
patches when we need to place other items at the front of
the struct (e.g., our hashmap implementation requires this).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This function drops an entry entirely from the cache,
meaning that aside from the freeing of the buffer, it is
exactly equivalent to detach_delta_base_cache_entry(). Let's
build on top of the detach function, which shortens the code
and will make it simpler when we change out the underlying
storage in future patches.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The delta base cache entries are stored in a fixed-length
hash table. So the way to remove an entry is to "clear" the
slot in the table, and that is what this function does.
However, the name is a leaky abstraction. If we were to
change the hash table implementation, it would no longer be
about "clearing". We should name it after _what_ it does,
not _how_ it does it. I.e., something like "remove" instead
of "clear".
But that does not tell the whole story, either. The subtle
thing about this function is that it removes the entry, but
does not free the entry data. So a more descriptive name is
"detach"; we give ownership of the data buffer to the
caller, and remove any other resources.
This patch uses the name detach_delta_base_cache_entry().
We could further model this after functions like
strbuf_detach(), which pass back all of the detached
information. However, since there are so many bits of
information in the struct (the data, the size, the type),
and so few callers (only one), it's not worth that
awkwardness. The name change and a comment can make the
intent clear.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
There is only one caller of cache_or_unpack_entry() and it
always passes 1 for the keep_cache parameter. We can
simplify it by dropping the "!keep_cache" case.
Another call, which did pass 0, was dropped in abe601b
(sha1_file: remove recursion in unpack_entry, 2013-03-27),
as unpack_entry() now does more complicated things than a
simple unpack when there is a cache miss.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In a later patch we want to determine if a path is suitable as an
alternate from other commands than builtin/clone. Move the checking
functionality of `add_one_reference` to `compute_alternate_path` that is
defined in cache.h.
Signed-off-by: Stefan Beller <sbeller@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Some code may have a pack/offset pair for an object, but
would like to look up more information. Using
sha1_object_info() is too heavy-weight; it starts from the
sha1 and has to find the pack again (so not only does it waste
time, it might not even find the same instance).
In some cases, this problem is solved by helpers like
get_size_from_delta(), which is used by pack-objects to take
a shortcut for objects whose packed representation has
already been found. But there's no similar function for
getting the object type, for instance. Rather than introduce
one, let's just make the whole packed_object_info() available.
It is smart enough to spend effort only on the items the
caller wants.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
An all-zero initializer is fine for this struct, but because
the first element is a pointer, call sites need to know to
use "NULL" instead of "0". Otherwise some static checkers
like "sparse" will complain; see d099b71 (Fix some sparse
warnings, 2013-07-18) for example. So let's provide an
initializer to make this easier to get right.
But let's also comment that memset() to zero is explicitly
OK[1]. One of the callers embeds object_info in another
struct which is initialized via memset (expand_data in
builtin/cat-file.c). Since our subset of C doesn't allow
assignment from a compound literal, handling this in any
other way is awkward, so we'd like to keep the ability to
initialize by memset(). By documenting this property, it
should make anybody who wants to change the initializer
think twice before doing so.
There's one other caller of interest. In parse_sha1_header(),
we did not initialize the struct fully in the first place.
This turned out not to be a bug because the sub-function it
calls does not look at any other fields except the ones we
did initialize. But that assumption might not hold in the
future, so it's a dangerous construct. This patch switches
it to initializing the whole struct, which protects us
against unexpected reads of the other fields.
[1] Obviously using memset() to initialize a pointer
violates the C standard, but we long ago decided that it
was an acceptable tradeoff in the real world.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
"git am -3" calls "git merge-recursive" when it needs to fall back
to a three-way merge; this call has been turned into an internal
subroutine call instead of spawning a separate subprocess.
* js/am-3-merge-recursive-direct:
merge-recursive: flush output buffer even when erroring out
merge_trees(): ensure that the callers release output buffer
merge-recursive: offer an option to retain the output in 'obuf'
merge-recursive: write the commit title in one go
merge-recursive: flush output buffer before printing error messages
am -3: use merge_recursive() directly again
merge-recursive: switch to returning errors instead of dying
merge-recursive: handle return values indicating errors
merge-recursive: allow write_tree_from_memory() to error out
merge-recursive: avoid returning a wholesale struct
merge_recursive: abort properly upon errors
prepare the builtins for a libified merge_recursive()
merge-recursive: clarify code in was_tracked()
die(_("BUG")): avoid translating bug messages
die("bug"): report bugs consistently
t5520: verify that `pull --rebase` shows the helpful advice when failing
"git pack-objects" has a few options that tell it not to pack
objects found in certain packfiles, which require it to scan .idx
files of all available packs. The codepaths involved in these
operations have been optimized for a common case of not having any
non-local pack and/or any .kept pack.
* jk/pack-objects-optim:
pack-objects: compute local/ignore_pack_keep early
pack-objects: break out of want_object loop early
find_pack_entry: replace last_found_pack with MRU cache
add generic most-recently-used list
sha1_file: drop free_pack_by_name
t/perf: add tests for many-pack scenarios
"git pack-objects" and "git index-pack" mostly operate with off_t
when talking about the offset of objects in a packfile, but there
were a handful of places that used "unsigned long" to hold that
value, leading to an unintended truncation.
* nd/pack-ofs-4gb-limit:
fsck: use streaming interface for large blobs in pack
pack-objects: do not truncate result in-pack object size on 32-bit systems
index-pack: correct "offset" type in unpack_entry_data()
index-pack: report correct bad object offsets even if they are large
index-pack: correct "len" type in unpack_data()
sha1_file.c: use type off_t* for object_info->disk_sizep
pack-objects: pass length to check_pack_crc() without truncation
Each pack has an index for looking up entries in O(log n)
time, but if we have multiple packs, we have to scan through
them linearly. This can produce a measurable overhead for
some operations.
We dealt with this long ago in f7c22cc (always start looking
up objects in the last used pack first, 2007-05-30), which
keeps what is essentially a 1-element most-recently-used
cache. In theory, we should be able to do better by keeping
a similar but longer cache, that is the same length as the
pack-list itself.
Since we now have a convenient generic MRU structure, we can
plug it in and measure. Here are the numbers for running
p5303 against linux.git:
Test HEAD^ HEAD
------------------------------------------------------------------------
5303.3: rev-list (1) 31.56(31.28+0.27) 31.30(31.08+0.20) -0.8%
5303.4: repack (1) 40.62(39.35+2.36) 40.60(39.27+2.44) -0.0%
5303.6: rev-list (50) 31.31(31.06+0.23) 31.23(31.00+0.22) -0.3%
5303.7: repack (50) 58.65(69.12+1.94) 58.27(68.64+2.05) -0.6%
5303.9: rev-list (1000) 38.74(38.40+0.33) 31.87(31.62+0.24) -17.7%
5303.10: repack (1000) 367.20(441.80+4.62) 342.00(414.04+3.72) -6.9%
The main numbers of interest here are the rev-list ones
(since that is exercising the normal object lookup code
path). The single-pack case shouldn't improve at all; the
260ms speedup there is just part of the run-to-run noise
(but it's important to note that we didn't make anything
worse with the overhead of maintaining our cache). In the
50-pack case, we see similar results. There may be a slight
improvement, but it's mostly within the noise.
The 1000-pack case does show a big improvement, though. That
carries over to the repack case, as well. Even though we
haven't touched its pack-search loop yet, it does still do a
lot of normal object lookups (e.g., for the internal
revision walk), and so improves.
As a point of reference, I also ran the 1000-pack test
against a version of HEAD^ with the last_found_pack
optimization disabled. It takes ~60s, so that gives an
indication of how much even the single-element cache is
helping.
For comparison, here's a smaller repository, git.git:
Test HEAD^ HEAD
---------------------------------------------------------------------
5303.3: rev-list (1) 1.56(1.54+0.01) 1.54(1.51+0.02) -1.3%
5303.4: repack (1) 1.84(1.80+0.10) 1.82(1.80+0.09) -1.1%
5303.6: rev-list (50) 1.58(1.55+0.02) 1.59(1.57+0.01) +0.6%
5303.7: repack (50) 2.50(3.18+0.04) 2.50(3.14+0.04) +0.0%
5303.9: rev-list (1000) 2.76(2.71+0.04) 2.24(2.21+0.02) -18.8%
5303.10: repack (1000) 13.21(19.56+0.25) 11.66(18.01+0.21) -11.7%
You can see that the percentage improvement is similar.
That's because the lookup we are optimizing is roughly
O(nr_objects * nr_packs). Since the number of packs is
constant in both tests, we'd expect the improvement to be
linear in the number of objects. But the whole process is
also linear in the number of objects, so the improvement
is a constant factor.
The exact improvement does also depend on the contents of
the packs. In p5303, the extra packs all have 5 first-parent
commits in them, which is a reasonable simulation of a
pushed-to repository. But it also means that only 250
first-parent commits are in those packs (compared to almost
50,000 total in linux.git), and the rest are in the huge
"base" pack. So once we start looking at history in taht big
pack, that's where we'll find most everything, and even the
1-element cache gets close to 100% cache hits. You could
almost certainly show better numbers with a more
pathological case (e.g., distributing the objects more
evenly across the packs). But that's simply not that
realistic a scenario, so it makes more sense to focus on
these numbers.
The implementation itself is a straightforward application
of the MRU code. We provide an MRU-ordered list of packs
that shadows the packed_git list. This is easy to do because
we only create and revise the pack list in one place. The
"reprepare" code path actually drops the whole MRU and
replaces it for simplicity. It would be more efficient to
just add new entries, but there's not much point in
optimizing here; repreparing happens rarely, and only after
doing a lot of other expensive work. The key things to keep
optimized are traversal (which is just a normal linked list,
albeit with one extra level of indirection over the regular
packed_git list), and marking (which is a constant number of
pointer assignments, though slightly more than the old
last_found_pack was; it doesn't seem to create a measurable
slowdown, though).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The point of this function is to drop an entry from the
"packed_git" cache that points to a file we might be
overwriting, because our contents may not be the same (and
hence the only caller was pack-objects as it moved a
temporary packfile into place).
In older versions of git, this could happen because the
names of packfiles were derived from the set of objects they
contained, not the actual bits on disk. But since 1190a1a
(pack-objects: name pack files after trailer hash,
2013-12-05), the name reflects the actual bits on disk, and
any two packfiles with the same name can be used
interchangeably.
Dropping this function not only saves a few lines of code,
it makes the lifetime of "struct packed_git" much easier to
reason about: namely, we now do not ever free these structs.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
"git pack-objects" and "git index-pack" mostly operate with off_t
when talking about the offset of objects in a packfile, but there
were a handful of places that used "unsigned long" to hold that
value, leading to an unintended truncation.
* nd/pack-ofs-4gb-limit:
fsck: use streaming interface for large blobs in pack
pack-objects: do not truncate result in-pack object size on 32-bit systems
index-pack: correct "offset" type in unpack_entry_data()
index-pack: report correct bad object offsets even if they are large
index-pack: correct "len" type in unpack_data()
sha1_file.c: use type off_t* for object_info->disk_sizep
pack-objects: pass length to check_pack_crc() without truncation
The vast majority of error messages in Git's source code which report a
bug use the convention to prefix the message with "BUG:".
As part of cleaning up merge-recursive to stop die()ing except in case of
detected bugs, let's just make the remainder of the bug reports consistent
with the de facto rule.
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
On 32 bit systems with large file support, unsigned long is 32-bit
while the two offsets in the subtraction expression (pack-objects has
the exact same expression as in sha1_file.c but not shown in diff) are
in 64-bit. If an in-pack object is larger than 2^32 len/datalen is
truncated and we get a misleading "error: bad packed object CRC for
..." as a result.
Use off_t for len and datalen. check_pack_crc() already accepts this
argument as off_t and can deal with 4+ GB.
Noticed-by: Christoph Michelbach <michelbach94@gmail.com>
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The experimental "multiple worktree" feature gains more safety to
forbid operations on a branch that is checked out or being actively
worked on elsewhere, by noticing that e.g. it is being rebased.
* nd/worktree-various-heads:
branch: do not rename a branch under bisect or rebase
worktree.c: check whether branch is bisected in another worktree
wt-status.c: split bisect detection out of wt_status_get_state()
worktree.c: check whether branch is rebased in another worktree
worktree.c: avoid referencing to worktrees[i] multiple times
wt-status.c: make wt_status_check_rebase() work on any worktree
wt-status.c: split rebase detection out of wt_status_get_state()
path.c: refactor and add worktree_git_path()
worktree.c: mark current worktree
worktree.c: make find_shared_symref() return struct worktree *
worktree.c: store "id" instead of "git_dir"
path.c: add git_common_path() and strbuf_git_common_path()
dir.c: rename str(n)cmp_icase to fspath(n)cmp
These functions compare two paths that are taken from file system.
Depending on the running file system, paths may need to be compared
case-sensitively or not, and maybe even something else in future. The
current names do not convey that well.
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The code to read the pack data using the offsets stored in the pack
idx file has been made more carefully check the validity of the
data in the idx.
* jk/pack-idx-corruption-safety:
sha1_file.c: mark strings for translation
use_pack: handle signed off_t overflow
nth_packed_object_offset: bounds-check extended offset
t5313: test bounds-checks of corrupted/malicious pack/idx files
Update various codepaths to avoid manually-counted malloc().
* jk/tighten-alloc: (22 commits)
ewah: convert to REALLOC_ARRAY, etc
convert ewah/bitmap code to use xmalloc
diff_populate_gitlink: use a strbuf
transport_anonymize_url: use xstrfmt
git-compat-util: drop mempcpy compat code
sequencer: simplify memory allocation of get_message
test-path-utils: fix normalize_path_copy output buffer size
fetch-pack: simplify add_sought_entry
fast-import: simplify allocation in start_packfile
write_untracked_extension: use FLEX_ALLOC helper
prepare_{git,shell}_cmd: use argv_array
use st_add and st_mult for allocation size computation
convert trivial cases to FLEX_ARRAY macros
use xmallocz to avoid size arithmetic
convert trivial cases to ALLOC_ARRAY
convert manual allocations to argv_array
argv-array: add detach function
add helpers for allocating flex-array structs
harden REALLOC_ARRAY and xcalloc against size_t overflow
tree-diff: catch integer overflow in combine_diff_path allocation
...
A v2 pack index file can specify an offset within a packfile
of up to 2^64-1 bytes. On a system with a signed 64-bit
off_t, we can represent only up to 2^63-1. This means that a
corrupted .idx file can end up with a negative offset in the
pack code. Our bounds-checking use_pack function looks for
too-large offsets, but not for ones that have wrapped around
to negative. Let's do so, which fixes an out-of-bounds
access demonstrated in t5313.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
If a pack .idx file has a corrupted offset for an object, we
may try to access an offset in the .idx or .pack file that
is larger than the file's size. For the .pack case, we have
use_pack() to protect us, which realizes the access is out
of bounds. But if the corrupted value asks us to look in the
.idx file's secondary 64-bit offset table, we blindly add it
to the mmap'd index data and access arbitrary memory.
We can fix this with a simple bounds-check compared to the
size we found when we opened the .idx file.
Note that there's similar code in index-pack that is
triggered only during "index-pack --verify". To support
both, we pull the bounds-check into a separate function,
which dies when it sees a corrupted file.
It would be nice if we could return an error, so that the
pack code could try to find a good copy of the object
elsewhere. Currently nth_packed_object_offset doesn't have
any way to return an error, but it could probably use "0" as
a sentinel value (since no object can start there). This is
the minimal fix, and we can improve the resilience later on
top.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
If our size computation overflows size_t, we may allocate a
much smaller buffer than we expected and overflow it. It's
probably impossible to trigger an overflow in most of these
sites in practice, but it is easy enough convert their
additions and multiplications into overflow-checking
variants. This may be fixing real bugs, and it makes
auditing the code easier.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Each of these cases can be converted to use ALLOC_ARRAY or
REALLOC_ARRAY, which has two advantages:
1. It automatically checks the array-size multiplication
for overflow.
2. It always uses sizeof(*array) for the element-size,
so that it can never go out of sync with the declared
type of the array.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
$GIT_OBJECT_DIRECTORY/info/alternates is a text file that can be
edited with a DOS editor. We do not want to use the real path with
CR appended at the end.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The strbuf_getline() interface allows a byte other than LF or NUL as
the line terminator, but this is only because I wrote these
codepaths anticipating that there might be a value other than NUL
and LF that could be useful when I introduced line_termination long
time ago. No useful caller that uses other value has emerged.
By now, it is clear that the interface is overly broad without a
good reason. Many codepaths have hardcoded preference to read
either LF terminated or NUL terminated records from their input, and
then call strbuf_getline() with LF or NUL as the third parameter.
This step introduces two thin wrappers around strbuf_getline(),
namely, strbuf_getline_lf() and strbuf_getline_nul(), and
mechanically rewrites these call sites to call either one of
them. The changes contained in this patch are:
* introduction of these two functions in strbuf.[ch]
* mechanical conversion of all callers to strbuf_getline() with
either '\n' or '\0' as the third parameter to instead call the
respective thin wrapper.
After this step, output from "git grep 'strbuf_getline('" would
become a lot smaller. An interim goal of this series is to make
this an empty set, so that we can have strbuf_getline_crlf() take
over the shorter name strbuf_getline().
Signed-off-by: Junio C Hamano <gitster@pobox.com>
"format-patch" has learned a new option to zero-out the commit
object name on the mbox "From " line.
* bc/format-patch-null-from-line:
format-patch: check that header line has expected format
format-patch: add an option to suppress commit hash
sha1_file.c: introduce a null_oid constant
The helper used to iterate over loose object directories to prune
stale objects did not closedir() immediately when it is done with a
directory--a callback such as the one used for "git prune" may want
to do rmdir(), but it would fail on open directory on platforms
such as WinXP.
* jk/prune-mtime:
prune: close directory earlier during loose-object directory traversal
null_oid is the struct object_id equivalent to null_sha1.
Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Add has_object_file, which is a wrapper around has_sha1_file, but for
struct object_id.
Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
Signed-off-by: Jeff King <peff@peff.net>
Having a leftover .idx file without corresponding .pack file in
the repository hurts performance; "git gc" learned to prune them.
* dk/gc-idx-wo-pack:
gc: remove garbage .idx files from pack dir
t5304: test cleaning pack garbage
prepare_packed_git(): refactor garbage reporting in pack directory
Various compilation fixes and squelching of warnings.
* js/misc-fixes:
Correct fscanf formatting string for I64u values
Silence GCC's "cast of pointer to integer of a different size" warning
Squelch warning about an integer overflow
When calculating hashes from pointers, it actually makes sense to cut
off the most significant bits. In that case, said warning does not make
a whole lot of sense.
So let's just work around it by casting the pointer first to intptr_t
and then casting up/down to the final integral type.
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Many allocations that is manually counted (correctly) that are
followed by strcpy/sprintf have been replaced with a less error
prone constructs such as xstrfmt.
Macintosh-specific breakage was noticed and corrected in this
reroll.
* jk/war-on-sprintf: (70 commits)
name-rev: use strip_suffix to avoid magic numbers
use strbuf_complete to conditionally append slash
fsck: use for_each_loose_file_in_objdir
Makefile: drop D_INO_IN_DIRENT build knob
fsck: drop inode-sorting code
convert strncpy to memcpy
notes: document length of fanout path with a constant
color: add color_set helper for copying raw colors
prefer memcpy to strcpy
help: clean up kfmclient munging
receive-pack: simplify keep_arg computation
avoid sprintf and strcpy with flex arrays
use alloc_ref rather than hand-allocating "struct ref"
color: add overflow checks for parsing colors
drop strcpy in favor of raw sha1_to_hex
use sha1_to_hex_r() instead of strcpy
daemon: use cld->env_array when re-spawning
stat_tracking_info: convert to argv_array
http-push: use an argv_array for setup_revisions
fetch-pack: use argv_array for index-pack / unpack-objects
...
"git clone --dissociate" runs a big "git repack" process at the
end, and it helps to close file descriptors that are open on the
packs and their idx files before doing so on filesystems that
cannot remove a file that is still open.
* js/clone-dissociate:
clone --dissociate: avoid locking pack files
sha1_file.c: add a function to release all packs
sha1_file: consolidate code to close a pack's file descriptor
t5700: demonstrate a Windows file locking issue with `git clone --dissociate`
On Windows, files that are in use cannot be removed or renamed. That
means that we have to release pack files when we are about to, say,
repack them. Let's introduce a convenient function to close all the
pack files and their idx files.
While at it, we consolidate the close windows/close fd/close index
stanza in `free_pack_by_name()` into the `close_pack()` function that
is used by the new `close_all_packs()` function to avoid repeated code.
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
There was a lot of repeated code to close the file descriptor of
a given pack. Let's just refactor this code into a single function.
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When we are allocating a struct with a FLEX_ARRAY member, we
generally compute the size of the array and then sprintf or
strcpy into it. Normally we could improve a dynamic allocation
like this by using xstrfmt, but it doesn't work here; we
have to account for the size of the rest of the struct.
But we can improve things a bit by storing the length that
we use for the allocation, and then feeding it to xsnprintf
or memcpy, which makes it more obvious that we are not
writing more than the allocated number of bytes.
It would be nice if we had some kind of helper for
allocating generic flex arrays, but it doesn't work that
well:
- the call signature is a little bit unwieldy:
d = flex_struct(sizeof(*d), offsetof(d, path), fmt, ...);
You need offsetof here instead of just writing to the
end of the base size, because we don't know how the
struct is packed (partially this is because FLEX_ARRAY
might not be zero, though we can account for that; but
the size of the struct may actually be rounded up for
alignment, and we can't know that).
- some sites do clever things, like over-allocating because
they know they will write larger things into the buffer
later (e.g., struct packed_git here).
So we're better off to just write out each allocation (or
add type-specific helpers, though many of these are one-off
allocations anyway).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When creating a loose object tempfile, we use a fixed
PATH_MAX-sized buffer, and strcpy directly into it. This
isn't buggy, because we do a rough check of the size, but
there's no verification that our guesstimate of the required
space is enough (in fact, it's several bytes too big for the
current naming scheme).
Let's switch to a strbuf, which makes this much easier to
verify. The allocation overhead should be negligible, since
we are replacing a static buffer with a static strbuf, and
we'll only need to allocate on the first call.
While we're here, we can also document a subtle interaction
with mkstemp that would be easy to overlook.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
We do some manual memory computation here, and there's no
check that our 60 is not overflowed by the raw sprintf (it
isn't, because the "which" parameter is never longer than
"pack"). We can simplify this greatly with a strbuf.
Technically the end result is not identical, as the original
took care not to rewrite the object directory on each call
for performance reasons. We could do that here, too (by
saving the baselen and resetting to it), but it's not worth
the complexity; this function is not called a lot (generally
once per packfile that we open).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When we want to convert "foo.pack" to "foo.idx", we do it by
duplicating the original string and then munging the bytes
in place. Let's use strip_suffix and xstrfmt instead, which
has several advantages:
1. It's more clear what the intent is.
2. It does not implicitly rely on the fact that
strlen(".idx") <= strlen(".pack") to avoid an overflow.
3. We communicate the assumption that the input file ends
with ".pack" (and get a run-time check that this is so).
4. We drop calls to strcpy, which makes auditing the code
base easier.
Likewise, we can do this to convert ".pack" to ".bitmap",
avoiding some manual memory computation.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
We have the path "foo.idx", and we create a buffer big
enough to hold "foo.pack" and "foo.keep", and then strcpy
straight into it. This isn't a bug (we have enough space),
but it's very hard to tell from the strcpy that this is so.
Let's instead use strip_suffix to take off the ".idx",
record the size of our allocation, and use xsnprintf to make
sure we don't violate our assumptions.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
We generally use 32-byte buffers to format git's "type size"
header fields. These should not generally overflow unless
you can produce some truly gigantic objects (and our types
come from our internal array of constant strings). But it is
a good idea to use xsnprintf to make sure this is the case.
Note that we slightly modify the interface to
write_sha1_file_prepare, which nows uses "hdrlen" as an "in"
parameter as well as an "out" (on the way in it stores the
allocated size of the header, and on the way out it returns
the ultimate size of the header).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This function assumes that the relative_base path passed
into it is no larger than PATH_MAX, and writes into a
fixed-size buffer. However, this path may not have actually
come from the filesystem; for example, add_submodule_odb
generates a path using a strbuf and passes it in. This is
hard to trigger in practice, though, because the long
submodule directory would have to exist on disk before we
would try to open its info/alternates file.
We can easily avoid the bug, though, by simply creating the
filename on the heap.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>