git-commit-vandalism

Author	SHA1	Message	Date
René Scharfe	6355a76802	sha1_file: avoid comparison if no packed hash matches the first byte find_pack_entry_one() uses the fan-out table of pack indexes to find out which entries match the first byte of the searched hash and does a binary search on this subset of the main index table. If there are no matching entries then lo and hi will have the same value. The binary search still starts and compares the hash of the following entry (which has a non-matching first byte, so won't cause any trouble), or whatever comes after the sorted list of entries. The probability of that stray comparison matching by mistake is low, but let's not take any chances and check when entering the binary search loop if we're actually done already. Signed-off-by: Rene Scharfe <l.s.r@web.de> Reviewed-by: Jonathan Nieder <jrnieder@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2017-08-09 09:52:25 -07:00
Junio C Hamano	2842e06352	Merge branch 'ew/fd-cloexec-fix' Portability/fallback fix. * ew/fd-cloexec-fix: set FD_CLOEXEC properly when O_CLOEXEC is not supported	2017-07-20 16:30:00 -07:00
Eric Wong	9fb9495dae	set FD_CLOEXEC properly when O_CLOEXEC is not supported FD_CLOEXEC only applies to the file descriptor, so it needs to be manipuluated via F_GETFD/F_SETFD. F_GETFL/F_SETFL are for file description flags. Verified via strace with o_cloexec set to zero. Signed-off-by: Eric Wong <e@80x24.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2017-07-17 14:52:16 -07:00
Junio C Hamano	91f6922544	Merge branch 'sb/hashmap-customize-comparison' Update the hashmap API so that data to customize the behaviour of the comparison function can be specified at the time a hashmap is initialized. * sb/hashmap-customize-comparison: hashmap: migrate documentation from Documentation/technical into header patch-ids.c: use hashmap correctly hashmap.h: compare function has access to a data field	2017-07-13 16:14:54 -07:00
Junio C Hamano	00b7cf2379	Merge branch 'jt/unify-object-info' Code clean-ups. * jt/unify-object-info: sha1_file: refactor has_sha1_file_with_flags sha1_file: do not access pack if unneeded sha1_file: teach sha1_object_info_extended more flags sha1_file: refactor read_object sha1_file: move delta base cache code up sha1_file: rename LOOKUP_REPLACE_OBJECT sha1_file: rename LOOKUP_UNKNOWN_OBJECT sha1_file: teach packed_object_info about typename	2017-07-05 13:32:57 -07:00
Junio C Hamano	5ab148dda0	Merge branch 'rs/sha1-name-readdir-optim' Optimize "what are the object names already taken in an alternate object database?" query that is used to derive the length of prefix an object name is uniquely abbreviated to. * rs/sha1-name-readdir-optim: sha1_file: guard against invalid loose subdirectory numbers sha1_file: let for_each_file_in_obj_subdir() handle subdir names p4205: add perf test script for pretty log formats sha1_name: cache readdir(3) results in find_short_object_filename()	2017-07-05 13:32:56 -07:00
Stefan Beller	7663cdc86c	hashmap.h: compare function has access to a data field When using the hashmap a common need is to have access to caller provided data in the compare function. A couple of times we abuse the keydata field to pass in the data needed. This happens for example in patch-ids.c. This patch changes the function signature of the compare function to have one more void pointer available. The pointer given for each invocation of the compare function must be defined in the init function of the hashmap and is just passed through. Documentation of this new feature is deferred to a later patch. This is a rather mechanical conversion, just adding the new pass-through parameter. However while at it improve the naming of the fields of all compare functions used by hashmaps by ensuring unused parameters are prefixed with 'unused_' and naming the parameters what they are (instead of 'unused' make it 'unused_keydata'). Signed-off-by: Stefan Beller <sbeller@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2017-06-30 12:49:28 -07:00
Jonathan Tan	e83e71c5e1	sha1_file: refactor has_sha1_file_with_flags has_sha1_file_with_flags() implements many mechanisms in common with sha1_object_info_extended(). Make has_sha1_file_with_flags() a convenience function for sha1_object_info_extended() instead. Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2017-06-26 10:28:58 -07:00
Jonathan Tan	cd585e2a33	sha1_file: do not access pack if unneeded Currently, regardless of the contents of the "struct object_info" passed to sha1_object_info_extended(), that function always accesses the packfile whenever it returns information about a packed object, since it needs to populate "u.packed". Add the ability to pass NULL, and use NULL-ness of the argument to activate an optimization in which sha1_object_info_extended() does not needlessly access the packfile. A subsequent patch will make use of this optimization. A similar optimization is not made for the cached and loose cases as it would not cause a significant performance improvement. Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2017-06-26 10:28:58 -07:00
Jonathan Tan	dfdd4afcf9	sha1_file: teach sha1_object_info_extended more flags Improve sha1_object_info_extended() by supporting additional flags. This allows has_sha1_file_with_flags() to be modified to use sha1_object_info_extended() in a subsequent patch. Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2017-06-26 10:28:42 -07:00
Junio C Hamano	50f03c6676	Merge branch 'ab/free-and-null' A common pattern to free a piece of memory and assign NULL to the pointer that used to point at it has been replaced with a new FREE_AND_NULL() macro. * ab/free-and-null: *.[ch] refactoring: make use of the FREE_AND_NULL() macro coccinelle: make use of the "expression" FREE_AND_NULL() rule coccinelle: add a rule to make "expression" code use FREE_AND_NULL() coccinelle: make use of the "type" FREE_AND_NULL() rule coccinelle: add a rule to make "type" code use FREE_AND_NULL() git-compat-util: add a FREE_AND_NULL() wrapper around free(ptr); ptr = NULL	2017-06-24 14:28:41 -07:00
Junio C Hamano	f31d23a399	Merge branch 'bw/config-h' Fix configuration codepath to pay proper attention to commondir that is used in multi-worktree situation, and isolate config API into its own header file. * bw/config-h: config: don't implicitly use gitdir or commondir config: respect commondir setup: teach discover_git_directory to respect the commondir config: don't include config.h by default config: remove git_config_iter config: create config.h	2017-06-24 14:28:41 -07:00
René Scharfe	70c49050d4	sha1_file: guard against invalid loose subdirectory numbers Loose object subdirectories have hexadecimal names based on the first byte of the hash of contained objects, thus their numerical representation can range from 0 (0x00) to 255 (0xff). Change the type of the corresponding variable in for_each_file_in_obj_subdir() and associated callback functions to unsigned int and add a range check. Suggested-by: Jeff King <peff@peff.net> Signed-off-by: Rene Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2017-06-24 11:09:52 -07:00
René Scharfe	0375f472d4	sha1_file: let for_each_file_in_obj_subdir() handle subdir names The function for_each_file_in_obj_subdir() takes a object subdirectory number and expects the name of the same subdirectory to be included in the path strbuf. Avoid this redundancy by letting the function append the hexadecimal subdirectory name itself. This makes it a bit easier and safer to use the function -- it becomes impossible to specify different subdirectories in subdir_nr and path. Suggested-by: Jeff King <peff@peff.net> Signed-off-by: Rene Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2017-06-24 11:09:50 -07:00
René Scharfe	cc817ca3ef	sha1_name: cache readdir(3) results in find_short_object_filename() Read each loose object subdirectory at most once when looking for unique abbreviated hashes. This speeds up commands like "git log --pretty=%h" considerably, which previously caused one readdir(3) call for each candidate, even for subdirectories that were visited before. The new cache is kept until the program ends and never invalidated. The same is already true for pack indexes. The inherent racy nature of finding unique short hashes makes it still fit for this purpose -- a conflicting new object may be added at any time. Tasks with higher consistency requirements should not use it, though. The cached object names are stored in an oid_array, which is quite compact. The bitmap for remembering which subdir was already read is stored as a char array, with one char per directory -- that's not quite as compact, but really simple and incurs only an overhead equivalent to 11 hashes after all. Suggested-by: Jeff King <peff@peff.net> Helped-by: Jeff King <peff@peff.net> Signed-off-by: Rene Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2017-06-22 12:07:51 -07:00
Jonathan Tan	c84a1f3ed4	sha1_file: refactor read_object read_object() and sha1_object_info_extended() both implement mechanisms such as object replacement, retrying the packed store after failing to find the object in the packed store then the loose store, and being able to mark a packed object as bad and then retrying the whole process. Consolidating these mechanisms would be a great help to maintainability. Therefore, consolidate them by extending sha1_object_info_extended() to support the functionality needed, and then modifying read_object() to use sha1_object_info_extended(). Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2017-06-21 18:54:43 -07:00
Jonathan Tan	845b102b99	sha1_file: move delta base cache code up In a subsequent patch, packed_object_info() will be modified to use the delta base cache, so move the relevant code to before packed_object_info(). Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2017-06-21 18:54:43 -07:00
Jonathan Tan	1f0c0d36c1	sha1_file: rename LOOKUP_REPLACE_OBJECT The LOOKUP_REPLACE_OBJECT flag controls whether the lookup_replace_object() function is invoked by sha1_object_info_extended(), read_sha1_file_extended(), and lookup_replace_object_extended(), but it is not immediately clear which functions accept that flag. Therefore restrict this flag to only sha1_object_info_extended(), renaming it appropriately to OBJECT_INFO_LOOKUP_REPLACE and adding some documentation. Update read_sha1_file_extended() to have a boolean parameter instead, and delete lookup_replace_object_extended(). parse_sha1_header() also passes this flag to parse_sha1_header_extended() since commit `46f0344` ("sha1_file: support reading from a loose object of unknown type", 2015-05-03), but that has had no effect since that commit. Therefore this patch also removes this flag from that invocation. Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2017-06-21 18:54:43 -07:00
Jonathan Tan	19fc5e84a7	sha1_file: rename LOOKUP_UNKNOWN_OBJECT The LOOKUP_UNKNOWN_OBJECT flag was introduced in commit `46f0344` ("sha1_file: support reading from a loose object of unknown type", 2015-05-03) in order to support a feature in cat-file subsequently introduced in commit `39e4ae3` ("cat-file: teach cat-file a '--allow-unknown-type' option", 2015-05-03). Despite its name and location in cache.h, this flag is used neither in read_sha1_file_extended() nor in any of the lookup functions, but used only in sha1_object_info_extended(). Therefore rename this flag to OBJECT_INFO_ALLOW_UNKNOWN_TYPE, taking the name of the cat-file flag that invokes this feature, and move it closer to the declaration of sha1_object_info_extended(). Also add documentation for this flag. OBJECT_INFO_ALLOW_UNKNOWN_TYPE is defined to 2, not 1, to avoid conflicting with LOOKUP_REPLACE_OBJECT. Avoidance of this conflict is necessary because sha1_object_info_extended() supports both flags. Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2017-06-21 18:54:43 -07:00
Ævar Arnfjörð Bjarmason	6a83d90207	coccinelle: make use of the "type" FREE_AND_NULL() rule Apply the result of the just-added coccinelle rule. This manually excludes a few occurrences, mostly things that resulted in many FREE_AND_NULL() on one line, that'll be manually fixed in a subsequent change. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2017-06-16 12:44:03 -07:00
Brandon Williams	b2141fc1d2	config: don't include config.h by default Stop including config.h by default in cache.h. Instead only include config.h in those files which require use of the config system. Signed-off-by: Brandon Williams <bmwill@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2017-06-15 12:56:22 -07:00
Jonathan Tan	285a2984bd	sha1_file: teach packed_object_info about typename In commit `46f0344` ("sha1_file: support reading from a loose object of unknown type", 2015-05-06), "struct object_info" gained a "typename" field that could represent a type name from a loose object file, whether valid or invalid, as opposed to the existing "typep" which could only represent valid types. Some relatively complex manipulations were added to avoid breaking packed_object_info() without modifying it, but it is much easier to just teach packed_object_info() about the new field. Therefore, teach packed_object_info() as described above. Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2017-06-15 09:51:57 -07:00
Brandon Williams	82b474e025	convert: convert convert_to_git to take an index Signed-off-by: Brandon Williams <bmwill@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2017-06-13 11:40:51 -07:00
Brandon Williams	d6c41c20e6	convert: convert convert_to_git_filter_fd to take an index Signed-off-by: Brandon Williams <bmwill@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2017-06-13 11:40:51 -07:00
Junio C Hamano	d7f8a37852	Merge branch 'jk/loose-object-fsck' Code cleanup. * jk/loose-object-fsck: sha1_file: remove an used fd variable	2017-04-23 22:07:50 -07:00
Junio C Hamano	eb3af74e93	Merge branch 'jk/no-looking-at-dotgit-outside-repo' Clean up fallouts from recent tightening of the set-up sequence, where Git barfs when repository information is accessed without first ensuring that it was started in a repository. * jk/no-looking-at-dotgit-outside-repo: test-read-cache: setup git dir has_sha1_file: don't bother if we are not in a repository	2017-04-19 21:37:20 -07:00
Junio C Hamano	b1081e4004	Merge branch 'bc/object-id' Conversion from unsigned char [40] to struct object_id continues. * bc/object-id: Documentation: update and rename api-sha1-array.txt Rename sha1_array to oid_array Convert sha1_array_for_each_unique and for_each_abbrev to object_id Convert sha1_array_lookup to take struct object_id Convert remaining callers of sha1_array_lookup to object_id Make sha1_array_append take a struct object_id * sha1-array: convert internal storage for struct sha1_array to object_id builtin/pull: convert to struct object_id submodule: convert check_for_new_submodule_commits to object_id sha1_name: convert disambiguate_hint_fn to take object_id sha1_name: convert struct disambiguate_state to object_id test-sha1-array: convert most code to struct object_id parse-options-cb: convert sha1_array_append caller to struct object_id fsck: convert init_skiplist to struct object_id builtin/receive-pack: convert portions to struct object_id builtin/pull: convert portions to struct object_id builtin/diff: convert to struct object_id Convert GIT_SHA1_RAWSZ used for allocation to GIT_MAX_RAWSZ Convert GIT_SHA1_HEXSZ used for allocation to GIT_MAX_HEXSZ Define new hash-size constants for allocating memory	2017-04-19 21:37:13 -07:00
Junio C Hamano	dfe46c5ce6	Merge branch 'jk/loose-object-info-report-error' Update error handling for codepath that deals with corrupt loose objects. * jk/loose-object-info-report-error: index-pack: detect local corruption in collision check sha1_loose_object_info: return error for corrupted objects	2017-04-16 23:29:30 -07:00
Sebastian Schuberth	0747fb49fd	sha1_file: remove an used fd variable Signed-off-by: Sebastian Schuberth <sschuberth@gmail.com> Acked-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2017-04-16 21:29:18 -07:00
Jonathan Nieder	3e8b7d3c77	has_sha1_file: don't bother if we are not in a repository Most callers to this function already require that they are in a git repository, but there is an exception: "git apply" uses has_sha1_file to avoid work if the result of applying a binary patch is already present in the repository. When run outside any repository, this produces an error: fatal: BUG: setup_git_env called without repository Signed-off-by: Jonathan Nieder <jrnieder@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2017-04-13 18:33:11 -07:00
Jeff King	93cff9a978	sha1_loose_object_info: return error for corrupted objects When sha1_loose_object_info() finds that a loose object file cannot be stat(2)ed or mmap(2)ed, it returns -1 to signal an error to the caller. However, if it found that the loose object file is corrupt and the object data cannot be used from it, it stuffs OBJ_BAD into "type" field of the object_info, but returns zero (i.e., success), which can confuse callers. This is due to `052fe5eac` (sha1_loose_object_info: make type lookup optional, 2013-07-12), which switched the return to a strict success/error, rather than returning the type (but botched the return). Callers of regular sha1_object_info() don't notice the difference, as that function returns the type (which is OBJ_BAD in this case). However, direct callers of sha1_object_info_extended() see the function return success, but without setting any meaningful values in the object_info struct, leading them to access potentially uninitialized memory. The easiest way to see the bug is via "cat-file -s", which will happily ignore the corruption and report whatever value happened to be in the "size" variable. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2017-04-01 10:45:16 -07:00
Jeff King	1a168e5c86	convert unchecked snprintf into xsnprintf These calls to snprintf should always succeed, because their input is small and fixed. Let's use xsnprintf to make sure this is the case (and to make auditing for actual truncation easier). These could be candidates for turning into heap buffers, but they fall into a few broad categories that make it not worth doing: - formatting single numbers is simple enough that we can see the result should fit - the size of a sha1 is likewise well-known, and I didn't want to cause unnecessary conflicts with the ongoing process to convert these constants to GIT_MAX_HEXSZ - the interface for curl_errorstr is dictated by curl Signed-off-by: Jeff King <peff@peff.net>	2017-03-30 14:59:50 -07:00
Junio C Hamano	ba5e05ffef	Merge branch 'jk/pack-name-cleanups' into maint Code clean-up. * jk/pack-name-cleanups: index-pack: make pointer-alias fallbacks safer replace snprintf with odb_pack_name() odb_pack_keep(): stop generating keepfile name sha1_file.c: make pack-name helper globally accessible move odb_* declarations out of git-compat-util.h	2017-03-28 13:52:25 -07:00
brian m. carlson	cd02599c48	Convert GIT_SHA1_RAWSZ used for allocation to GIT_MAX_RAWSZ Since we will likely be introducing a new hash function at some point, and that hash function might be longer than 20 bytes, use the constant GIT_MAX_RAWSZ, which is designed to be suitable for allocations, instead of GIT_SHA1_RAWSZ. This will ease the transition down the line by distinguishing between places where we need to allocate memory suitable for the largest hash from those where we need to handle the current hash. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2017-03-26 22:08:21 -07:00
brian m. carlson	dc01505f7f	Convert GIT_SHA1_HEXSZ used for allocation to GIT_MAX_HEXSZ Since we will likely be introducing a new hash function at some point, and that hash function might be longer than 40 hex characters, use the constant GIT_MAX_HEXSZ, which is designed to be suitable for allocations, instead of GIT_SHA1_HEXSZ. This will ease the transition down the line by distinguishing between places where we need to allocate memory suitable for the largest hash from those where we need to handle the current hash. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2017-03-26 22:08:21 -07:00
Junio C Hamano	45cbc37c5f	Merge branch 'jk/pack-name-cleanups' Code clean-up. * jk/pack-name-cleanups: index-pack: make pointer-alias fallbacks safer replace snprintf with odb_pack_name() odb_pack_keep(): stop generating keepfile name sha1_file.c: make pack-name helper globally accessible move odb_* declarations out of git-compat-util.h	2017-03-21 15:07:17 -07:00
Junio C Hamano	e36e28e697	Merge branch 'rs/sha1-file-plug-fallback-base-leak' into maint A leak in a codepath to read from a packed object in (rare) cases has been plugged. * rs/sha1-file-plug-fallback-base-leak: sha1_file: release fallback base's memory in unpack_entry()	2017-03-21 15:03:27 -07:00
Junio C Hamano	e1fae93019	Merge branch 'bc/object-id' "uchar [40]" to "struct object_id" conversion continues. * bc/object-id: wt-status: convert to struct object_id builtin/merge-base: convert to struct object_id Convert object iteration callbacks to struct object_id sha1_file: introduce an nth_packed_object_oid function refs: simplify parsing of reflog entries refs: convert each_reflog_ent_fn to struct object_id reflog-walk: convert struct reflog_info to struct object_id builtin/replace: convert to struct object_id Convert remaining callers of resolve_refdup to object_id builtin/merge: convert to struct object_id builtin/clone: convert to struct object_id builtin/branch: convert to struct object_id builtin/grep: convert to struct object_id builtin/fmt-merge-message: convert to struct object_id builtin/fast-export: convert to struct object_id builtin/describe: convert to struct object_id builtin/diff-tree: convert to struct object_id builtin/commit: convert to struct object_id hex: introduce parse_oid_hex	2017-03-17 13:50:25 -07:00
Junio C Hamano	94c9b5af70	Merge branch 'cc/split-index-config' The experimental "split index" feature has gained a few configuration variables to make it easier to use. * cc/split-index-config: (22 commits) Documentation/git-update-index: explain splitIndex.* Documentation/config: add splitIndex.sharedIndexExpire read-cache: use freshen_shared_index() in read_index_from() read-cache: refactor read_index_from() t1700: test shared index file expiration read-cache: unlink old sharedindex files config: add git_config_get_expiry() from gc.c read-cache: touch shared index files when used sha1_file: make check_and_freshen_file() non static Documentation/config: add splitIndex.maxPercentChange t1700: add tests for splitIndex.maxPercentChange read-cache: regenerate shared index if necessary config: add git_config_get_max_percent_split_change() Documentation/git-update-index: talk about core.splitIndex config var Documentation/config: add information for core.splitIndex t1700: add tests for core.splitIndex update-index: warn in case of split-index incoherency read-cache: add and then use tweak_split_index() split-index: add {add,remove}_split_index() functions config: add git_config_get_split_index() ...	2017-03-17 13:50:23 -07:00
Jeff King	1cec8c634f	sha1_file.c: make pack-name helper globally accessible We provide sha1_pack_name() and sha1_pack_index_name(), but the more generic form (which takes its own strbuf and an arbitrary extension) is only used to implement the other two. Let's make it available, but clean up a few things: 1. Name it odb_pack_name(), as the original sha1_get_pack_name() is long but not all that descriptive. 2. Switch the strbuf argument to the beginning, so that it matches similar path-building functions like git_path_buf(). 3. Clean up the out-dated docstring and move it to the public declaration. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2017-03-16 11:05:17 -07:00
Junio C Hamano	82682e218a	Merge branch 'rs/sha1-file-plug-fallback-base-leak' A leak in a codepath to read from a packed object in (rare) cases has been plugged. * rs/sha1-file-plug-fallback-base-leak: sha1_file: release fallback base's memory in unpack_entry()	2017-03-10 13:24:23 -08:00
Christian Couder	6a5e6f5e44	sha1_file: make check_and_freshen_file() non static This function will be used in a commit soon, so let's make it available globally. Signed-off-by: Christian Couder <chriscool@tuxfamily.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2017-03-01 13:24:22 -08:00
Junio C Hamano	3ad8b5bf26	Merge branch 'mh/ref-remove-empty-directory' Deletion of a branch "foo/bar" could remove .git/refs/heads/foo once there no longer is any other branch whose name begins with "foo/", but we didn't do so so far. Now we do. * mh/ref-remove-empty-directory: (23 commits) files_transaction_commit(): clean up empty directories try_remove_empty_parents(): teach to remove parents of reflogs, too try_remove_empty_parents(): don't trash argument contents try_remove_empty_parents(): rename parameter "name" -> "refname" delete_ref_loose(): inline function delete_ref_loose(): derive loose reference path from lock log_ref_write_1(): inline function log_ref_setup(): manage the name of the reflog file internally log_ref_write_1(): don't depend on logfile argument log_ref_setup(): pass the open file descriptor back to the caller log_ref_setup(): improve robustness against races log_ref_setup(): separate code for create vs non-create log_ref_write(): inline function rename_tmp_log(): improve error reporting rename_tmp_log(): use raceproof_create_file() lock_ref_sha1_basic(): use raceproof_create_file() lock_ref_sha1_basic(): inline constant raceproof_create_file(): new function safe_create_leading_directories(): set errno on SCLD_EXISTS safe_create_leading_directories_const(): preserve errno ...	2017-02-27 13:57:12 -08:00
René Scharfe	886ddf4777	sha1_file: release fallback base's memory in unpack_entry() If a pack entry that's used as a delta base is corrupt, unpack_entry() marks it as unusable and then searches the object again in the hope that it can be found in another pack or in a loose file. The memory for this external base object is never released. Free it after use. Signed-off-by: Rene Scharfe <l.s.r@web.de> Reviewed-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2017-02-27 11:00:30 -08:00
brian m. carlson	76c1d9a096	Convert object iteration callbacks to struct object_id Convert each_loose_object_fn and each_packed_object_fn to take a pointer to struct object_id. Update the various callbacks. Convert several 40-based constants to use GIT_SHA1_HEXSZ. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2017-02-22 10:12:15 -08:00
brian m. carlson	068f85e313	sha1_file: introduce an nth_packed_object_oid function There are places in the code where we would like to provide a struct object_id *, yet read the hash directly from the pack. Provide an nth_packed_object_oid function that is similar to the nth_packed_object_sha1 function. In order to avoid a potentially invalid cast, nth_packed_object_oid provides a variable into which to store the value, which it returns on success; on error, it returns NULL, as nth_packed_object_sha1 does. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2017-02-22 10:12:15 -08:00
Junio C Hamano	c54ba283fa	Merge branch 'jk/clear-delta-base-cache-fix' A crashing bug introduced in v2.11 timeframe has been found (it is triggerable only in fast-import) and fixed. * jk/clear-delta-base-cache-fix: clear_delta_base_cache(): don't modify hashmap while iterating	2017-01-31 13:14:59 -08:00
Junio C Hamano	42ace93e41	Merge branch 'jk/loose-object-fsck' "git fsck" inspects loose objects more carefully now. * jk/loose-object-fsck: fsck: detect trailing garbage in all object types fsck: parse loose object paths directly sha1_file: add read_loose_object() function t1450: test fsck of packed objects sha1_file: fix error message for alternate objects t1450: refactor loose-object removal	2017-01-31 13:14:57 -08:00
Jeff King	abd5a00268	clear_delta_base_cache(): don't modify hashmap while iterating On Thu, Jan 19, 2017 at 03:03:46PM +0100, Ulrich Spörlein wrote: > > I suspect the patch below may fix things for you. It works around it by > > walking over the lru list (either is fine, as they both contain all > > entries, and since we're clearing everything, we don't care about the > > order). > > Confirmed. With the patch applied, I can import the whole 55G in one go > without any crashes or aborts. Thanks much! Thanks. Here it is rolled up with a commit message. -- >8 -- Subject: clear_delta_base_cache(): don't modify hashmap while iterating Removing entries while iterating causes fast-import to access an already-freed `struct packed_git`, leading to various confusing errors. What happens is that clear_delta_base_cache() drops the whole contents of the cache by iterating over the hashmap, calling release_delta_base_cache() on each entry. That function removes the item from the hashmap. The hashmap code may then shrink the table, but the hashmap_iter struct retains an offset from the old table. As a result, the next call to hashmap_iter_next() may claim that the iteration is done, even though some items haven't been visited. The only caller of clear_delta_base_cache() is fast-import, which wants to clear the cache because it is discarding the packed_git struct for its temporary pack. So by failing to remove all of the entries, we still have references to the freed packed_git. To make things even more confusing, this doesn't seem to trigger with the test suite, because it depends on complexities like the size of the hash table, which entries got cleared, whether we try to access them before they're evicted from the cache, etc. So I've been able to identify the problem with large imports like freebsd's svn import, or a fast-export of linux.git. But nothing that would be reasonable to run as part of the normal test suite. We can fix this easily by iterating over the lru linked list instead of the hashmap. They both contain the same entries, and we can use the "safe" variant of the list iterator, which exists for exactly this case. Let's also add a warning to the hashmap API documentation to reduce the chances of getting bit by this again. Reported-by: Ulrich Spörlein <uqs@freebsd.org> Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2017-01-19 11:17:20 -08:00
Junio C Hamano	55d128ae06	Merge branch 'bw/grep-recurse-submodules' "git grep" has been taught to optionally recurse into submodules. * bw/grep-recurse-submodules: grep: search history of moved submodules grep: enable recurse-submodules to work on <tree> objects grep: optionally recurse into submodules grep: add submodules as a grep source type submodules: load gitmodules file from commit sha1 submodules: add helper to determine if a submodule is initialized submodules: add helper to determine if a submodule is populated real_path: canonicalize directory separators in root parts real_path: have callers use real_pathdup and strbuf_realpath real_path: create real_pathdup real_path: convert real_path_internal to strbuf_realpath real_path: resolve symlinks by hand	2017-01-18 15:12:11 -08:00
Junio C Hamano	bcaf277b4a	Merge branch 'jk/quote-env-path-list-component' into maint A recent update to receive-pack to make it easier to drop garbage objects made it clear that GIT_ALTERNATE_OBJECT_DIRECTORIES cannot have a pathname with a colon in it (no surprise!), and this in turn made it impossible to push into a repository at such a path. This has been fixed by introducing a quoting mechanism used when appending such a path to the colon-separated list. * jk/quote-env-path-list-component: t5615-alternate-env: double-quotes in file names do not work on Windows t5547-push-quarantine: run the path separator test on Windows, too tmp-objdir: quote paths we add to alternates alternates: accept double-quoted paths	2017-01-17 15:11:06 -08:00
Jeff King	cce044df7f	fsck: detect trailing garbage in all object types When a loose tree or commit is read by fsck (or any git program), unpack_sha1_rest() checks whether there is extra cruft at the end of the object file, after the zlib data. Blobs that are streamed, however, do not have this check. For normal git operations, it's not a big deal. We know the sha1 and size checked out, so we have the object bytes we wanted. The trailing garbage doesn't affect what we're trying to do. But since the point of fsck is to find corruption or other problems, it should be more thorough. This patch teaches its loose-sha1 reader to detect extra bytes after the zlib stream and complain. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2017-01-15 15:59:03 -08:00
Jeff King	f6371f9210	sha1_file: add read_loose_object() function It's surprisingly hard to ask the sha1_file code to open a _specific_ incarnation of a loose object. Most of the functions take a sha1, and loop over the various object types (packed versus loose) and locations (local versus alternates) at a low level. However, some tools like fsck need to look at a specific file. This patch gives them a function they can use to open the loose object at a given path. The implementation unfortunately ends up repeating bits of related functions, but there's not a good way around it without some major refactoring of the whole sha1_file stack. We need to mmap the specific file, then partially read the zlib stream to know whether we're streaming or not, and then finally either stream it or copy the data to a buffer. We can do that by assembling some of the more arcane internal sha1_file functions, but we end up having to essentially reimplement unpack_sha1_file(), along with the streaming bits of check_sha1_signature(). Still, most of the ugliness is contained in the new function, and the interface is clean enough that it may be reusable (though it seems unlikely anything but git-fsck would care about opening a specific file). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2017-01-15 15:59:03 -08:00
Jeff King	771e7d578e	sha1_file: fix error message for alternate objects When we fail to open a corrupt loose object, we report an error and mention the filename via sha1_file_name(). However, that function will always give us a path in the local repository, whereas the corrupt object may have come from an alternate. The result is a very misleading error message. Teach the open_sha1_file() and stat_sha1_file() helpers to pass back the path they found, so that we can report it correctly. Note that the pointers we return go to static storage (e.g., from sha1_file_name()), which is slightly dangerous. However, these helpers are static local helpers, and the names are used for immediately generating error messages. The simplicity is an acceptable tradeoff for the danger. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2017-01-15 15:59:03 -08:00
Junio C Hamano	02d0457eb4	Merge branch 'jc/git-open-cloexec' The codeflow of setting NOATIME and CLOEXEC on file descriptors Git opens has been simplified. We may want to drop the tip one, but we'll see. * jc/git-open-cloexec: sha1_file: stop opening files with O_NOATIME git_open_cloexec(): use fcntl(2) w/ FD_CLOEXEC fallback git_open(): untangle possible NOATIME and CLOEXEC interactions	2017-01-10 15:24:26 -08:00
Michael Haggerty	177978f56a	raceproof_create_file(): new function Add a function that tries to create a file and any containing directories in a way that is robust against races with other processes that might be cleaning up empty directories at the same time. The actual file creation is done by a callback function, which, if it fails, should set errno to EISDIR or ENOENT according to the convention of open(). raceproof_create_file() detects such failures, and respectively either tries to delete empty directories that might be in the way of the file or tries to create the containing directories. Then it retries the callback function. This function is not yet used. Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu> Reviewed-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2017-01-07 19:30:09 -08:00
Michael Haggerty	204a047f23	safe_create_leading_directories(): set errno on SCLD_EXISTS The exit path for SCLD_EXISTS wasn't setting errno, which some callers use to generate error messages for the user. Fix the problem and document that the function sets errno correctly to help avoid similar regressions in the future. Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu> Reviewed-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2017-01-07 19:30:08 -08:00
Michael Haggerty	029443070a	safe_create_leading_directories_const(): preserve errno Some implementations of free() change errno (even thought they shouldn't): https://sourceware.org/bugzilla/show_bug.cgi?id=17924 So preserve the errno from safe_create_leading_directories() across the call to free(). Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu> Reviewed-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2017-01-07 19:30:08 -08:00
Junio C Hamano	fe05033407	Merge branch 'jk/quote-env-path-list-component' A recent update to receive-pack to make it easier to drop garbage objects made it clear that GIT_ALTERNATE_OBJECT_DIRECTORIES cannot have a pathname with a colon in it (no surprise!), and this in turn made it impossible to push into a repository at such a path. This has been fixed by introducing a quoting mechanism used when appending such a path to the colon-separated list. * jk/quote-env-path-list-component: t5615-alternate-env: double-quotes in file names do not work on Windows t5547-push-quarantine: run the path separator test on Windows, too tmp-objdir: quote paths we add to alternates alternates: accept double-quoted paths	2016-12-21 14:55:02 -08:00
Brandon Williams	4ac9006f83	real_path: have callers use real_pathdup and strbuf_realpath Migrate callers of real_path() who duplicate the retern value to use real_pathdup or strbuf_realpath. Signed-off-by: Brandon Williams <bmwill@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-12-12 15:22:32 -08:00
Jeff King	cf3c635210	alternates: accept double-quoted paths We read lists of alternates from objects/info/alternates files (delimited by newline), as well as from the GIT_ALTERNATE_OBJECT_DIRECTORIES environment variable (delimited by colon or semi-colon, depending on the platform). There's no mechanism for quoting the delimiters, so it's impossible to specify an alternate path that contains a colon in the environment, or one that contains a newline in a file. We've lived with that restriction for ages because both alternates and filenames with colons are relatively rare, and it's only a problem when the two meet. But since `722ff7f87` (receive-pack: quarantine objects until pre-receive accepts, 2016-10-03), which builds on the alternates system, every push causes the receiver to set GIT_ALTERNATE_OBJECT_DIRECTORIES internally. It would be convenient to have some way to quote the delimiter so that we can represent arbitrary paths. The simplest thing would be an escape character before a quoted delimiter (e.g., "\:" as a literal colon). But that creates a backwards compatibility problem: any path which uses that escape character is now broken, and we've just shifted the problem. We could choose an unlikely escape character (e.g., something from the non-printable ASCII range), but that's awkward to use. Instead, let's treat names as unquoted unless they begin with a double-quote, in which case they are interpreted via our usual C-stylke quoting rules. This also breaks backwards-compatibility, but in a smaller way: it only matters if your file has a double-quote as the very _first_ character in the path (whereas an escape character is a problem anywhere in the path). It's also consistent with many other parts of git, which accept either a bare pathname or a double-quoted one, and the sender can choose to quote or not as required. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-12-12 15:10:43 -08:00
Junio C Hamano	0538b84027	Merge branch 'jk/alt-odb-cleanup' Fix a corner-case regression in a topic that graduated during the v2.11 cycle. * jk/alt-odb-cleanup: alternates: re-allow relative paths from environment	2016-11-10 13:17:30 -08:00
Jeff King	37a95862c6	alternates: re-allow relative paths from environment Commit `670c359da` (link_alt_odb_entry: handle normalize_path errors, 2016-10-03) regressed the handling of relative paths in the GIT_ALTERNATE_OBJECT_DIRECTORIES variable. It's not entirely clear this was ever meant to work, but it _has_ worked for several years, so this commit restores the original behavior. When we get a path in GIT_ALTERNATE_OBJECT_DIRECTORIES, we add it the path to the list of alternate object directories as if it were found in objects/info/alternates, but with one difference: we do not provide the link_alt_odb_entry() function with a base for relative paths. That function doesn't turn it into an absolute path, and we end up feeding the relative path to the strbuf_normalize_path() function. Most relative paths break out of the top-level directory (e.g., "../foo.git/objects"), and thus normalizing fails. Prior to `670c359da`, we simply ignored the error, and due to the way normalize_path_copy() was implemented it happened to return the original path in this case. We then accessed the alternate objects using this relative path. By storing the relative path in the alt_odb list, the path is relative to wherever we happen to be at the time we do an object lookup. That means we look from $GIT_DIR in a bare repository, and from the top of the worktree in a non-bare repository. If this were being designed from scratch, it would make sense to pick a stable location (probably $GIT_DIR, or even the object directory) and use that as the relative base, turning the result into an absolute path. However, given the history, at this point the minimal fix is to match the pre-670c359da behavior. We can do this simply by ignoring the error when we have no relative base and using the original value (which we now reliably have, thanks to strbuf_normalize_path()). That still leaves us with a relative path that foils our duplicate detection, and may act strangely if we ever chdir() later in the process. We could solve that by storing an absolute path based on getcwd(). That may be a good future direction; for now we'll do just the minimum to fix the regression. The new t5615 script demonstrates the fix in its final three tests. Since we didn't have any tests of the alternates environment variable at all, it also adds some tests of absolute paths. Reported-by: Bryan Turner <bturner@atlassian.com> Signed-off-by: Jeff King <peff@peff.net>	2016-11-08 15:28:22 -05:00
Junio C Hamano	b4d065df03	sha1_file: stop opening files with O_NOATIME When we open object files, we try to do so with O_NOATIME. This dates back to `144bde78e9` (Use O_NOATIME when opening the sha1 files., 2005-04-23), which is an optimization to avoid creating a bunch of dirty inodes when we're accessing many objects. But a few things have changed since then: 1. In June 2005, git learned about packfiles, which means we would do a lot fewer atime updates (rather than one per object access, we'd generally get one per packfile). 2. In late 2006, Linux learned about "relatime", which is generally the default on modern installs. So performance around atimes updates is a non-issue there these days. All the world isn't Linux, but as it turns out, Linux is the only platform to implement O_NOATIME in the first place. So it's very unlikely that this code is helping anybody these days. Helped-by: Jeff King <peff@peff.net> [jc: took idea and log message from peff] Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-11-02 19:34:41 -07:00
Junio C Hamano	1e3001a8e2	git_open_cloexec(): use fcntl(2) w/ FD_CLOEXEC fallback A platform might not support open(2) with O_CLOEXEC but may support telling the same with fcntl(2) to flip FD_CLOEXEC bit on on an open file descriptor. It is a fallback that is inherently racy and this may not be worth doing, though. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-11-02 19:34:16 -07:00
Junio C Hamano	906d6906fb	Merge branch 'ls/git-open-cloexec' Git generally does not explicitly close file descriptors that were open in the parent process when spawning a child process, but most of the time the child does not want to access them. As Windows does not allow removing or renaming a file that has a file descriptor open, a slow-to-exit child can even break the parent process by holding onto them. Use O_CLOEXEC flag to open files in various codepaths. * ls/git-open-cloexec: read-cache: make sure file handles are not inherited by child processes sha1_file: open window into packfiles with O_CLOEXEC sha1_file: rename git_open_noatime() to git_open()	2016-10-31 13:15:21 -07:00
Junio C Hamano	39000e8499	Merge branch 'jk/fetch-quick-tag-following' into maint When fetching from a remote that has many tags that are irrelevant to branches we are following, we used to waste way too many cycles when checking if the object pointed at by a tag (that we are not going to fetch!) exists in our repository too carefully. * jk/fetch-quick-tag-following: fetch: use "quick" has_sha1_file for tag following	2016-10-28 09:01:17 -07:00
Junio C Hamano	1b8ac5ead5	git_open(): untangle possible NOATIME and CLOEXEC interactions The way we structured the fallback/retry mechanism for opening with O_NOATIME and O_CLOEXEC meant that if we failed due to lack of support to open the file with O_NOATIME option (i.e. EINVAL), we would still try to drop O_CLOEXEC first and retry, and then drop O_NOATIME. A platform on which O_NOATIME is defined in the header without support from the kernel wouldn't have a chance to open with O_CLOEXEC option due to this code structure. Arguably, O_CLOEXEC is more important than O_NOATIME, as the latter is mostly about performance, while the former can affect correctness. Instead use O_CLOEXEC to open the file, and then use fcntl(2) to set O_NOATIME on the resulting file descriptor. open(2) itself does not cause atime to be updated according to Linus [1]. The helper to do the former can be usable in the codepath in ce_compare_data() that was recently added to open a file descriptor with O_CLOEXEC; use it while we are at it. 1 <CA+55aFw83E+zOd+z5h-CA-3NhrLjVr-anL6pubrSWttYx3zu8g@mail.gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-10-28 06:23:07 -07:00
Junio C Hamano	d7ae013a31	Merge branch 'jk/abbrev-auto' Updates the way approximate count of total objects is computed while attempting to come up with a unique abbreviated object name, which in turn needs to estimate how many hexdigits are necessary to ensure uniqueness. * jk/abbrev-auto: find_unique_abbrev: move logic out of get_short_sha1()	2016-10-27 14:58:47 -07:00
Junio C Hamano	9fcd14491d	Merge branch 'jk/fetch-quick-tag-following' When fetching from a remote that has many tags that are irrelevant to branches we are following, we used to waste way too many cycles when checking if the object pointed at by a tag (that we are not going to fetch!) exists in our repository too carefully. * jk/fetch-quick-tag-following: fetch: use "quick" has_sha1_file for tag following	2016-10-26 13:14:47 -07:00
Lars Schneider	cd66ada065	sha1_file: open window into packfiles with O_CLOEXEC All processes that the Git main process spawns inherit the open file descriptors of the main process. These leaked file descriptors can cause problems. Use the O_CLOEXEC flag similar to `05d1ed61` to fix the leaked file descriptors. Signed-off-by: Lars Schneider <larsxschneider@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-10-25 11:09:54 -07:00
Lars Schneider	a5436b5794	sha1_file: rename git_open_noatime() to git_open() This function is meant to be used when reading from files in the object store, and the original objective was to avoid smudging atime of loose object files too often, hence its name. Because we'll be extending its role in the next commit to also arrange the file descriptors they return auto-closed in the child processes, rename it to lose "noatime" part that is too specific. Signed-off-by: Lars Schneider <larsxschneider@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-10-25 10:59:13 -07:00
Junio C Hamano	dec040192f	Merge branch 'jk/alt-odb-cleanup' Codepaths involved in interacting alternate object store have been cleaned up. * jk/alt-odb-cleanup: alternates: use fspathcmp to detect duplicates sha1_file: always allow relative paths to alternates count-objects: report alternates via verbose mode fill_sha1_file: write into a strbuf alternates: store scratch buffer as strbuf fill_sha1_file: write "boring" characters alternates: use a separate scratch space alternates: encapsulate alt->base munging alternates: provide helper for allocating alternate alternates: provide helper for adding to alternates list link_alt_odb_entry: refactor string handling link_alt_odb_entry: handle normalize_path errors t5613: clarify "too deep" recursion tests t5613: do not chdir in main process t5613: whitespace/style cleanups t5613: use test_must_fail t5613: drop test_valid_repo function t5613: drop reachable_via function	2016-10-17 13:25:20 -07:00
Jeff King	5827a03545	fetch: use "quick" has_sha1_file for tag following When we auto-follow tags in a fetch, we look at all of the tags advertised by the remote and fetch ones where we don't already have the tag, but we do have the object it peels to. This involves a lot of calls to has_sha1_file(), some of which we can reasonably expect to fail. Since `45e8a74` (has_sha1_file: re-check pack directory before giving up, 2013-08-30), this may cause many calls to reprepare_packed_git(), which is potentially expensive. This has gone unnoticed for several years because it requires a fairly unique setup to matter: 1. You need to have a lot of packs on the client side to make reprepare_packed_git() expensive (the most expensive part is finding duplicates in an unsorted list, which is currently quadratic). 2. You need a large number of tag refs on the server side that are candidates for auto-following (i.e., that the client doesn't have). Each one triggers a re-read of the pack directory. 3. Under normal circumstances, the client would auto-follow those tags and after one large fetch, (2) would no longer be true. But if those tags point to history which is disconnected from what the client otherwise fetches, then it will never auto-follow, and those candidates will impact it on every fetch. So when all three are true, each fetch pays an extra O(nr_tags * nr_packs^2) cost, mostly in string comparisons on the pack names. This was exacerbated by `47bf4b0` (prepare_packed_git_one: refactor duplicate-pack check, 2014-06-30) which uses a slightly more expensive string check, under the assumption that the duplicate check doesn't happen very often (and it shouldn't; the real problem here is how often we are calling reprepare_packed_git()). This patch teaches fetch to use HAS_SHA1_QUICK to sacrifice accuracy for speed, in cases where we might be racy with a simultaneous repack. This is similar to the fix in `0eeb077` (index-pack: avoid excessive re-reading of pack directory, 2015-06-09). As with that case, it's OK for has_sha1_file() occasionally say "no I don't have it" when we do, because the worst case is not a corruption, but simply that we may fail to auto-follow a tag that points to it. Here are results from the included perf script, which sets up a situation similar to the one described above: Test HEAD^ HEAD ---------------------------------------------------------- 5550.4: fetch 11.21(10.42+0.78) 0.08(0.04+0.02) -99.3% Reported-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-10-14 11:31:32 -07:00
Junio C Hamano	f7f0a87e0a	Merge branch 'jc/verify-loose-object-header' into maint Codepaths that read from an on-disk loose object were too loose in validating what they are reading is a proper object file and sometimes read past the data they read from the disk, which has been corrected. H/t to Gustavo Grieco for reporting. * jc/verify-loose-object-header: unpack_sha1_header(): detect malformed object header streaming: make sure to notice corrupt object	2016-10-11 14:21:03 -07:00
Junio C Hamano	e6e24c94df	Merge branch 'jk/pack-objects-optim-mru' "git pack-objects" in a repository with many packfiles used to spend a lot of time looking for/at objects in them; the accesses to the packfiles are now optimized by checking the most-recently-used packfile first. * jk/pack-objects-optim-mru: pack-objects: use mru list when iterating over packs pack-objects: break delta cycles before delta-search phase sha1_file: make packed_object_info public provide an initializer for "struct object_info"	2016-10-10 14:03:47 -07:00
Jeff King	ea0fc3b417	alternates: use fspathcmp to detect duplicates On a case-insensitive filesystem, we should realize that "a/objects" and "A/objects" are the same path. We already use fspathcmp() to check against the main object directory, but until recently we couldn't use it for comparing against other alternates (because their paths were not NUL-terminated strings). But now we can, so let's do so. Note that we also need to adjust count-objects to load the config, so that it can see the setting of core.ignorecase (this is required by the test, but is also a general bugfix for users of count-objects). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-10-10 13:52:37 -07:00
Jeff King	087b6d5840	sha1_file: always allow relative paths to alternates We recursively expand alternates repositories, so that if A borrows from B which borrows from C, A can see all objects. For the root object database, we allow relative paths, so A can point to B as "../B/objects". However, we currently do not allow relative paths when recursing, so B must use an absolute path to reach C. That is an ancient protection from `c2f493a` (Transitively read alternatives, 2006-05-07) that tries to avoid adding the same alternate through two different paths. Since `5bdf0a8` (sha1_file: normalize alt_odb path before comparing and storing, 2011-09-07), we use a normalized absolute path for each alt_odb entry. This means that in most cases the protection is no longer necessary; we will detect the duplicate no matter how we got there (but see below). And it's a good idea to get rid of it, as it creates an unnecessary complication when setting up recursive alternates (B has to know that A is going to borrow from it and make sure to use an absolute path). Note that our normalization doesn't actually look at the filesystem, so it can still be fooled by crossing symbolic links. But that's also true of absolute paths, so it's not a good reason to disallow only relative paths (it's potentially a reason to switch to real_path(), but that's a separate and non-trivial change). We adjust the test script here to demonstrate that this now works, and add new tests to show that the normalization does indeed suppress duplicates. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-10-10 13:52:37 -07:00
Jeff King	f7b7774f34	fill_sha1_file: write into a strbuf It's currently the responsibility of the caller to give fill_sha1_file() enough bytes to write into, leading them to manually compute the required lengths. Instead, let's just write into a strbuf so that it's impossible to get this wrong. The alt_odb caller already has a strbuf, so this makes things strictly simpler. The other caller, sha1_file_name(), uses a static PATH_MAX buffer and dies when it would overflow. We can convert this to a static strbuf, which means our allocation cost is amortized (and as a bonus, we no longer have to worry about PATH_MAX being too short for normal use). This does introduce some small overhead in fill_sha1_file(), as each strbuf_addchar() will check whether it needs to grow. However, between the optimization in `fec501d` (strbuf_addch: avoid calling strbuf_grow, 2015-04-16) and the fact that this is not generally called in a tight loop (after all, the next step is typically to access the file!) this probably doesn't matter. And even if it did, the right place to micro-optimize is inside fill_sha1_file(), by calling a single strbuf_grow() there. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-10-10 13:52:37 -07:00
Jeff King	a5b34d2152	alternates: provide helper for adding to alternates list The submodule code wants to temporarily add an alternate object store to our in-memory alt_odb list, but does it manually. Let's provide a helper so it can reuse the code in link_alt_odb_entry(). While we're adding our new add_to_alternates_memory(), let's document add_to_alternates_file(), as the two are related. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-10-10 13:52:36 -07:00
Jeff King	38dbe5f078	alternates: store scratch buffer as strbuf We pre-size the scratch buffer to hold a loose object filename of the form "xx/yyyy...", which leads to allocation code that is hard to verify. We have to use some magic numbers during the initial allocation, and then writers must blindly assume that the buffer is big enough. Using a strbuf makes it more clear that we cannot overflow. Unfortunately, we do still need some magic numbers to grow our strbuf before calling fill_sha1_path(), but the strbuf growth is much closer to the point of use. This makes it easier to see that it's correct, and opens the possibility of pushing it even further down if fill_sha1_path() learns to work on strbufs. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-10-10 13:52:36 -07:00
Jeff King	4ea82473aa	link_alt_odb_entry: refactor string handling The string handling in link_alt_odb_entry() is mostly an artifact of the original version, which took the path as a ptr/len combo, and did not have a NUL-terminated string until we created one in the alternate_object_database struct. But since `5bdf0a8` (sha1_file: normalize alt_odb path before comparing and storing, 2011-09-07), the first thing we do is put the path into a strbuf, which gives us some easy opportunities for cleanup. In particular: - we call strlen(pathbuf.buf), which is silly; we can look at pathbuf.len. - even though we have a strbuf, we don't maintain its "len" field when chomping extra slashes from the end, and instead keep a separate "pfxlen" variable. We can fix this and then drop "pfxlen" entirely. - we don't check whether the path is usable until after we allocate the new struct, making extra cleanup work for ourselves. Since we have a NUL-terminated string, we can bump the "is it usable" checks higher in the function. While we're at it, we can move that logic to its own helper, which makes the flow of link_alt_odb_entry() easier to follow. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-10-10 13:52:36 -07:00
Jeff King	afbba2f09a	fill_sha1_file: write "boring" characters This function forms a sha1 as "xx/yyyy...", but skips over the slot for the slash rather than writing it, leaving it to the caller to do so. It also does not bother to put in a trailing NUL, even though every caller would want it (we're forming a path which by definition is not a directory, so the only thing to do with it is feed it to a system call). Let's make the lives of our callers easier by just writing out the internal "/" and the NUL. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-10-10 13:52:36 -07:00
Jeff King	670c359da3	link_alt_odb_entry: handle normalize_path errors When we add a new alternate to the list, we try to normalize out any redundant "..", etc. However, we do not look at the return value of normalize_path_copy(), and will happily continue with a path that could not be normalized. Worse, the normalizing process is done in-place, so we are left with whatever half-finished working state the normalizing function was in. Fortunately, this cannot cause us to read past the end of our buffer, as that working state will always leave the NUL from the original path in place. And we do tend to notice problems when we check is_directory() on the path. But you can see the nonsense that we feed to is_directory with an entry like: this/../../is/../../way/../../too/../../deep/../../to/../../resolve in your objects/info/alternates, which yields: error: object directory /to/e/deep/too/way//ects/this/../../is/../../way/../../too/../../deep/../../to/../../resolve does not exist; check .git/objects/info/alternates. We can easily fix this just by checking the return value. But that makes it hard to generate a good error message, since we're normalizing in-place and our input value has been overwritten by cruft. Instead, let's provide a strbuf helper that does an in-place normalize, but restores the original contents on error. This uses a second buffer under the hood, which is slightly less efficient, but this is not a performance-critical code path. The strbuf helper can also properly set the "len" parameter of the strbuf before returning. Just doing: normalize_path_copy(buf.buf, buf.buf); will shorten the string, but leave buf.len at the original length. That may be confusing to later code which uses the strbuf. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-10-10 13:52:36 -07:00
Jeff King	597f9134de	alternates: use a separate scratch space The alternate_object_database struct uses a single buffer both for storing the path to the alternate, and as a scratch buffer for forming object names. This is efficient (since otherwise we'd end up storing the path twice), but it makes life hard for callers who just want to know the path to the alternate. They have to remember to stop reading after "alt->name - alt->base" bytes, and to subtract one for the trailing '/'. It would be much simpler if they could simply access a NUL-terminated path string. We could encapsulate this in a function which puts a NUL in the scratch buffer and returns the string, but that opens up questions about the lifetime of the result. The first time another caller uses the alternate, the scratch buffer may get other data tacked onto it. Let's instead just store the root path separately from the scratch buffer. There aren't enough alternates being stored for the duplicated data to matter for performance, and this keeps things simple and safe for the callers. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-10-10 13:52:36 -07:00
Jeff King	29ec6af2b8	alternates: encapsulate alt->base munging The alternate_object_database struct holds a path to the alternate objects, but we also use that buffer as scratch space for forming loose object filenames. Let's pull that logic into a helper function so that we can more easily modify it. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-10-10 13:52:36 -07:00
Jeff King	7f0fa2c02a	alternates: provide helper for allocating alternate Allocating a struct alternate_object_database is tricky, as we must over-allocate the buffer to provide scratch space, and then put in particular '/' and NUL markers. Let's encapsulate this in a function so that the complexity doesn't leak into callers (and so that we can modify it later). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-10-10 13:52:36 -07:00
Jeff King	8e3f52d778	find_unique_abbrev: move logic out of get_short_sha1() The get_short_sha1() is only about reading short sha1s; we do call it in a loop to check "is this long enough" for each object, but otherwise it should not need to know about things like our default_abbrev setting. So instead of asking it to set default_automatic_abbrev as a side-effect, let's just have find_unique_abbrev() pick the right place to start its loop. This requires a separate approximate_object_count() function, but that naturally belongs with the rest of sha1_file.c. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-10-03 21:03:14 -07:00
Junio C Hamano	71a57ab32d	Merge branch 'jc/verify-loose-object-header' Codepaths that read from an on-disk loose object were too loose in validating what they are reading is a proper object file and sometimes read past the data they read from the disk, which has been corrected. H/t to Gustavo Grieco for reporting. * jc/verify-loose-object-header: unpack_sha1_header(): detect malformed object header streaming: make sure to notice corrupt object	2016-10-03 13:30:36 -07:00
Junio C Hamano	d21f842690	unpack_sha1_header(): detect malformed object header When opening a loose object file, we often do this sequence: - prepare a short buffer for the object header (on stack) - call unpack_sha1_header() and have early part of the object data inflated, enough to fill the buffer - parse that data in the short buffer, assuming that the first part of the object is <typename> SP <length> NUL Because the parsing function parse_sha1_header_extended() is not given the number of bytes inflated into the header buffer, it you craft a file whose early part inflates a garbage sequence without SP or NUL, and replace a loose object with it, it will end up reading past the end of the inflated data. To correct this, do the following four things: - rename unpack_sha1_header() to unpack_sha1_short_header() and have unpack_sha1_header_to_strbuf() keep calling that as its helper function. This will detect and report zlib errors, but is not aware of the format of a loose object (as before). - introduce unpack_sha1_header() that calls the same helper function, and when zlib reports it inflated OK into the buffer, check if the inflated data has NUL. This would ensure that parsing function will terminate within the buffer that holds the inflated header. - update unpack_sha1_header_to_strbuf() to check if the resulting buffer has NUL for the same effect. - update parse_sha1_header_extended() to make sure that its loop to find the SP that terminates the <typename> stops at NUL. Essentially, this makes unpack_*() functions that are asked to unpack a loose object header to be a bit more strict and detect an input that cannot possibly be a valid object header, even before the parsing function kicks in. Reported-by: Gustavo Grieco <gustavo.grieco@imag.fr> Helped-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-09-26 10:48:22 -07:00
Junio C Hamano	ee19836995	Merge branch 'rs/pack-sort-with-llist-mergesort' Code cleanup. * rs/pack-sort-with-llist-mergesort: sha1_file: use llist_mergesort() for sorting packs	2016-09-21 15:15:25 -07:00
Junio C Hamano	13307145a9	Merge branch 'jk/delta-base-cache' Recently we updated the code to manage the in-core cache that holds objects that have recently been used to reconstitute other objects that are stored as deltas against them, but the update used an incorrect API function to manage the list of these objects. This has been fixed. * jk/delta-base-cache: add_delta_base_cache: use list_for_each_safe	2016-09-21 15:15:20 -07:00
René Scharfe	c4c6effa9b	sha1_file: use llist_mergesort() for sorting packs Sort the linked list of packs directly using llist_mergesort() instead of building an array, calling qsort(3) and fixing up the list pointers. This is shorter and less complicated. Signed-off-by: Rene Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-09-13 16:26:46 -07:00
Junio C Hamano	305d7f1339	Merge branch 'jk/diff-submodule-diff-inline' The "git diff --submodule={short,log}" mechanism has been enhanced to allow "--submodule=diff" to show the patch between the submodule commits bound to the superproject. * jk/diff-submodule-diff-inline: diff: teach diff to display submodule difference with an inline diff submodule: refactor show_submodule_summary with helper function submodule: convert show_submodule_summary to use struct object_id * allow do_submodule_path to work even if submodule isn't checked out diff: prepare for additional submodule formats graph: add support for --line-prefix on all graph-aware output diff.c: remove output_prefix_length field cache: add empty_tree_oid object and helper function	2016-09-12 15:34:31 -07:00
Jeff King	4b92bae7d3	add_delta_base_cache: use list_for_each_safe We may remove elements from the list while we are iterating, which requires using a second temporary pointer. Otherwise stepping to the next element of the list might involve looking at freed memory (which generally works in practice, as we _just_ freed it, but of course is wrong to rely on; valgrind notices it). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-09-12 12:09:41 -07:00
Junio C Hamano	02c6c14d6c	Merge branch 'sb/submodule-clone-rr' "git clone --resurse-submodules --reference $path $URL" is a way to reduce network transfer cost by borrowing objects in an existing $path repository when cloning the superproject from $URL; it learned to also peek into $path for presense of corresponding repositories of submodules and borrow objects from there when able. * sb/submodule-clone-rr: clone: recursive and reference option triggers submodule alternates clone: implement optional references clone: clarify option_reference as required clone: factor out checking for an alternate path submodule--helper update-clone: allow multiple references submodule--helper module-clone: allow multiple references t7408: merge short tests, factor out testing method t7408: modernize style	2016-09-08 21:49:50 -07:00
Jacob Keller	8576fde6cb	cache: add empty_tree_oid object and helper function Similar to is_null_oid(), and is_empty_blob_sha1() add an empty_tree_oid along with helper function is_empty_tree_oid(). For completeness, also add an "is_empty_tree_sha1()", "is_empty_blob_sha1()", "is_empty_tree_oid()" and "is_empty_blob_oid()" helpers. To ensure we only get one singleton, implement EMPTY_BLOB_SHA1_BIN as simply getting the hash of empty_blob_oid structure. Signed-off-by: Jacob Keller <jacob.keller@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-08-31 18:07:08 -07:00
Jeff King	8261e1f139	delta_base_cache: use hashmap.h The fundamental data structure of the delta base cache is a hash table mapping pairs of "(packfile, offset)" into structs containing the actual object data. The hash table implementation dates back to `e5e0161` (Implement a simple delta_base cache, 2007-03-17), and uses a fixed-size table. The current size is a hard-coded 256 entries. Because we need to be able to remove objects from the hash table, entry lookup does not do any kind of probing to handle collisions. Colliding items simply replace whatever is in their slot. As a result, we have fewer usable slots than even the 256 we allocate. At half full, each new item has a 50% chance of displacing another one. Or another way to think about it: every item has a 1/256 chance of being ejected due to hash collision, without regard to our LRU strategy. So it would be interesting to see the effect of increasing the cache size on the runtime for some common operations. As with the previous patch, we'll measure "git log --raw" for tree-only operations, and "git log -Sfoo --raw" for operations that touch trees and blobs. All times are wall-clock best-of-3, done against fully packed repos with --depth=50, and the default core.deltaBaseCacheLimit of 96MB. Here are timings for various values of MAX_DELTA_CACHE against git.git (the asterisk marks the minimum time for each operation): MAX_DELTA_CACHE log-raw log-S --------------- --------- --------- 256 0m02.227s 0m12.821s 512 0m02.143s 0m10.602s 1024 0m02.127s 0m08.642s 2048 0m02.148s 0m07.123s 4096 0m02.194s 0m06.448s* 8192 0m02.239s 0m06.504s 16384 0m02.144s* 0m06.502s 32768 0m02.202s 0m06.622s 65536 0m02.230s 0m06.677s The log-raw case isn't changed much at all here (probably because our trees just aren't that big in the first place, or possibly because we have so _few_ trees in git.git that the 256-entry cache is enough). But once we start putting blobs in the cache, too, we see a big improvement (almost 50%). The curve levels off around 4096, which means that we can hold about that many entries before hitting the 96MB memory limit (or possibly that the workload is small enough that there is simply no more work to be optimized out by caching more). (As a side note, I initially timed my existing git.git pack, which was a base of --aggressive combined with some pulls on top. So it had quite a few deeper delta chains. The 256-cache case was more like 15s, and it still dropped to ~6.5s in the same way). Here are the timings for linux.git: MAX_DELTA_CACHE log-raw log-S --------------- --------- --------- 256 0m41.661s 5m12.410s 512 0m39.547s 5m07.920s 1024 0m37.054s 4m54.666s 2048 0m35.871s 4m41.194s* 4096 0m34.646s 4m51.648s 8192 0m33.881s 4m55.342s 16384 0m35.190s 5m00.122s 32768 0m35.060s 4m58.851s 65536 0m33.311s* 4m51.420s As we grow we see a nice 20% speedup in the tree traversal, and more modest 10% in the log-S. This is probably an indication that we are bound less by the number of entries, and more by the memory limit (more on that below). What is interesting is that the numbers bounce around a bit; increasing the number of entries isn't always a strict improvement. Partially this is due to noise in the measurement. But it may also be an indication that our LRU ejection scheme is not optimal. The smaller cache sizes introduce some randomness into the ejection (due to collisions), which may sometimes work in our favor (and sometimes not!). So what is the optimal setting of MAX_DELTA_CACHE? The "bouncing" in the linux.git log-S numbers notwithstanding, it mostly seems like bigger is better. And even if we were to try to find a "sweet spot", these are just two repositories, that are not necessarily representative. The shape of history, the size of trees and blobs, the memory limit configuration, etc, all will affect the outcome. Rather than trying to find the "right" number, another strategy is to just switch to a hash table that can actually store collisions: namely our hashmap.h implementation. Here are numbers for that compared to the "best" we saw from adjusting MAX_DELTA_CACHE: \| log-raw \| log-S \| best hashmap \| best hashmap \| --------- --------- \| --------- --------- git \| 0m02.144s 0m02.144s \| 0m06.448s 0m06.688s linux \| 0m33.311s 0m33.092s \| 4m41.194s 4m57.172s We can see the results are similar in most cases, which is what we'd expect. We're not ejecting due to collisions at all, so this is purely representing the LRU. So really, we'd expect this to model most closely the larger values of the static MAX_DELTA_CACHE limit. And that does seem to be what's happening, including the "bounce" in the linux log-S case. So while the value for that case _isn't_ as good as the optimal one measured above (which was 2048 entries), given the bouncing I'm hesitant to suggest that 2048 is any kind of optimum (not even for linux.git, let alone as a general rule). The generic hashmap has the appeal that it drops the number of tweakable numbers by one, which means we can focus on tuning other elements, like the LRU strategy or the core.deltaBaseCacheLimit setting. And indeed, if we bump the cache limit to 1G (which is probably silly for general use, but maybe something people with big workstations would want to do), the linux.git log-S time drops to 3m32s. That's something you really _can't_ do easily with the static hash table, because the number of entries needs to grow in proportion to the memory limit (so 2048 is almost certainly not going to be the right value there). This patch takes that direction, and drops the static hash table entirely in favor of using the hashmap.h API. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-08-23 15:18:50 -07:00
Jeff King	6d9617f4f7	delta_base_cache: drop special treatment of blobs When the delta base cache runs out of allowed memory, it has to drop entries. It does so by walking an LRU list, dropping objects until we are under the memory limit. But we actually walk the list twice: once to drop blobs, and then again to drop other objects (which are generally trees). This comes from `18bdec1` (Limit the size of the new delta_base_cache, 2007-03-19). This performs poorly as the number of entries grows, because any time dropping blobs does not satisfy the limit, we have to walk the _entire_ list, trees included, looking for blobs to drop, before starting to drop any trees. It's not generally a problem now, as the cache is limited to only 256 entries. But as we could benefit from increasing that in a future patch, it's worth looking at how it performs as the cache size grows. And the answer is "not well". The table below shows times for various operations with different values of MAX_DELTA_CACHE (which is not a run-time knob; I recompiled with -DMAX_DELTA_CACHE=$n for each). I chose "git log --raw" ("log-raw" in the table) because it will access all of the trees, but no blobs at all (so in a sense it is a worst case for this problem, because we will always walk over the entire list of trees once before realizing there are no blobs to drop). This is also representative of other tree-only operations like "rev-list --objects" and "git log -- <path>". I also timed "git log -Sfoo --raw" ("log-S" in the table). It similarly accesses all of the trees, but also the blobs for each commit. It's representative of "git log -p", though it emphasizes the cost of blob access more, as "-S" is cheaper than computing an actual blob diff. All timings are best-of-3 wall-clock times (though they all were CPU bound, so the user CPU times are similar). The repositories were fully packed with --depth=50, and the default core.deltaBaseCacheLimit of 96M was in effect. The current value of MAX_DELTA_CACHE is 256, so I started there and worked up by factors of 2. First, here are values for git.git (the asterisk signals the fastest run for each operation): MAX_DELTA_CACHE log-raw log-S --------------- --------- --------- 256 0m02.212s 0m12.634s 512 0m02.136s* 0m10.614s 1024 0m02.156s 0m08.614s 2048 0m02.208s 0m07.062s 4096 0m02.190s 0m06.484s* 8192 0m02.176s 0m07.635s 16384 0m02.913s 0m19.845s 32768 0m03.617s 1m05.507s 65536 0m04.031s 1m18.488s You can see that for the tree-only log-raw case, we don't actually benefit that much as the cache grows (all the differences up through 8192 are basically just noise; this is probably because we don't actually have that many distinct trees in git.git). But for log-S, we get a definite speed improvement as the cache grows, but the improvements are lost as cache size grows and the linear LRU management starts to dominate. Here's the same thing run against linux.git: MAX_DELTA_CACHE log-raw log-S --------------- --------- ---------- 256 0m40.987s 5m13.216s 512 0m37.949s 5m03.243s 1024 0m35.977s 4m50.580s 2048 0m33.855s 4m39.818s 4096 0m32.913s 4m47.299s* 8192 0m32.176s* 5m14.650s 16384 0m32.185s 6m31.625s 32768 0m38.056s 9m31.136s 65536 1m30.518s 17m38.549s The pattern is similar, though the effect in log-raw is more pronounced here. The times dip down in the middle, and then go back up as we keep growing. So we know there's a problem. What's the solution? The obvious one is to improve the data structure to avoid walking over tree entries during the looking-for-blobs traversal. We can do this by keeping _two_ LRU lists: one for blobs, and one for other objects. We drop items from the blob LRU first, and then from the tree LRU (if necessary). Here's git.git using that strategy: MAX_DELTA_CACHE log-raw log-S --------------- --------- ---------- 256 0m02.264s 0m12.830s 512 0m02.201s 0m10.771s 1024 0m02.181s 0m08.593s 2048 0m02.205s 0m07.116s 4096 0m02.158s 0m06.537s* 8192 0m02.213s 0m07.246s 16384 0m02.155s* 0m10.975s 32768 0m02.159s 0m16.047s 65536 0m02.181s 0m16.992s The upswing on log-raw is gone completely. But log-S still has it (albeit much better than without this strategy). Let's see what linux.git shows: MAX_DELTA_CACHE log-raw log-S --------------- --------- --------- 256 0m42.519s 5m14.654s 512 0m39.106s 5m04.708s 1024 0m36.802s 4m51.454s 2048 0m34.685s 4m39.378s* 4096 0m33.663s 4m44.047s 8192 0m33.157s 4m50.644s 16384 0m33.090s* 4m49.648s 32768 0m33.458s 4m53.371s 65536 0m33.563s 5m04.580s The results are similar. The tree-only case again performs well (not surprising; we're literally just dropping the one useless walk, and not otherwise changing the cache eviction strategy at all). But the log-S case again does a bit worse as the cache grows (though possibly that's within the noise, which is much larger for this case). Perhaps this is an indication that the "remove blobs first" strategy is not actually optimal. The intent of it is to avoid blowing out the tree cache when we see large blobs, but it also means we'll throw away useful, recent blobs in favor of older trees. Let's run the same numbers without caring about object type at all (i.e., one LRU list, and always evicting whatever is at the head, regardless of type). Here's git.git: MAX_DELTA_CACHE log-raw log-S --------------- --------- --------- 256 0m02.227s 0m12.821s 512 0m02.143s 0m10.602s 1024 0m02.127s 0m08.642s 2048 0m02.148s 0m07.123s 4096 0m02.194s 0m06.448s* 8192 0m02.239s 0m06.504s 16384 0m02.144s* 0m06.502s 32768 0m02.202s 0m06.622s 65536 0m02.230s 0m06.677s Much smoother; there's no dramatic upswing as we increase the cache size (some remains, though it's small enough that it's mostly run-to-run noise. E.g., in the log-raw case, note how 8192 is 50-100ms higher than its neighbors). Note also that we stop getting any real benefit for log-S after about 4096 entries; that number will depend on the size of the repository, the size of the blob entries, and the memory limit of the cache. Let's see what linux.git shows for the same strategy: MAX_DELTA_CACHE log-raw log-S --------------- --------- --------- 256 0m41.661s 5m12.410s 512 0m39.547s 5m07.920s 1024 0m37.054s 4m54.666s 2048 0m35.871s 4m41.194s* 4096 0m34.646s 4m51.648s 8192 0m33.881s 4m55.342s 16384 0m35.190s 5m00.122s 32768 0m35.060s 4m58.851s 65536 0m33.311s* 4m51.420s It's similarly good. As with the "separate blob LRU" strategy, there's a lot of noise on the log-S run here. But it's certainly not any worse, is possibly a bit better, and the improvement over "separate blob LRU" on the git.git case is dramatic. So it seems like a clear winner, and that's what this patch implements. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-08-23 14:57:44 -07:00
Jeff King	12d95ef6fc	delta_base_cache: use list.h for LRU We keep an LRU list of entries for when we need to drop something from an over-full cache. The list is implemented as a circular doubly-linked list, which is exactly what list.h provides. We can save a few lines by using the list.h macros and functions. More importantly, this makes the code easier to follow, as the reader sees explicit concepts like "list_add_tail()" instead of pointer manipulation. As a bonus, the list_entry() macro lets us place the lru pointers anywhere inside the delta_base_cache_entry struct (as opposed to just casting the pointer, which requires it at the front of the struct). This will be useful in later patches when we need to place other items at the front of the struct (e.g., our hashmap implementation requires this). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-08-23 14:52:00 -07:00
Jeff King	f92dd60f95	release_delta_base_cache: reuse existing detach function This function drops an entry entirely from the cache, meaning that aside from the freeing of the buffer, it is exactly equivalent to detach_delta_base_cache_entry(). Let's build on top of the detach function, which shortens the code and will make it simpler when we change out the underlying storage in future patches. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-08-23 14:47:33 -07:00
Jeff King	4a5397ca79	clear_delta_base_cache_entry: use a more descriptive name The delta base cache entries are stored in a fixed-length hash table. So the way to remove an entry is to "clear" the slot in the table, and that is what this function does. However, the name is a leaky abstraction. If we were to change the hash table implementation, it would no longer be about "clearing". We should name it after _what_ it does, not _how_ it does it. I.e., something like "remove" instead of "clear". But that does not tell the whole story, either. The subtle thing about this function is that it removes the entry, but does not free the entry data. So a more descriptive name is "detach"; we give ownership of the data buffer to the caller, and remove any other resources. This patch uses the name detach_delta_base_cache_entry(). We could further model this after functions like strbuf_detach(), which pass back all of the detached information. However, since there are so many bits of information in the struct (the data, the size, the type), and so few callers (only one), it's not worth that awkwardness. The name change and a comment can make the intent clear. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-08-23 14:45:29 -07:00
Jeff King	85fe35ab9e	cache_or_unpack_entry: drop keep_cache parameter There is only one caller of cache_or_unpack_entry() and it always passes 1 for the keep_cache parameter. We can simplify it by dropping the "!keep_cache" case. Another call, which did pass 0, was dropped in `abe601b` (sha1_file: remove recursion in unpack_entry, 2013-03-27), as unpack_entry() now does more complicated things than a simple unpack when there is a cache miss. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-08-23 14:44:26 -07:00
Stefan Beller	9eeea7d2bc	clone: factor out checking for an alternate path In a later patch we want to determine if a path is suitable as an alternate from other commands than builtin/clone. Move the checking functionality of `add_one_reference` to `compute_alternate_path` that is defined in cache.h. Signed-off-by: Stefan Beller <sbeller@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-08-15 15:28:01 -07:00
Junio C Hamano	e6b8f80653	Merge branch 'vs/typofix' * vs/typofix: Spelling fixes	2016-08-12 09:47:37 -07:00
Ville Skyttä	2e3a16b279	Spelling fixes <BAD> <CORRECTED> accidently accidentally commited committed dependancy dependency emtpy empty existance existence explicitely explicitly git-upload-achive git-upload-archive hierachy hierarchy indegee indegree intial initial mulitple multiple non-existant non-existent precendence. precedence. priviledged privileged programatically programmatically psuedo-binary pseudo-binary soemwhere somewhere successfull successful transfering transferring uncommited uncommitted unkown unknown usefull useful writting writing Signed-off-by: Ville Skyttä <ville.skytta@iki.fi> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-08-11 14:35:42 -07:00
Jeff King	ca79c98572	sha1_file: make packed_object_info public Some code may have a pack/offset pair for an object, but would like to look up more information. Using sha1_object_info() is too heavy-weight; it starts from the sha1 and has to find the pack again (so not only does it waste time, it might not even find the same instance). In some cases, this problem is solved by helpers like get_size_from_delta(), which is used by pack-objects to take a shortcut for objects whose packed representation has already been found. But there's no similar function for getting the object type, for instance. Rather than introduce one, let's just make the whole packed_object_info() available. It is smart enough to spend effort only on the items the caller wants. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-08-11 10:43:24 -07:00
Jeff King	27b5c1a065	provide an initializer for "struct object_info" An all-zero initializer is fine for this struct, but because the first element is a pointer, call sites need to know to use "NULL" instead of "0". Otherwise some static checkers like "sparse" will complain; see `d099b71` (Fix some sparse warnings, 2013-07-18) for example. So let's provide an initializer to make this easier to get right. But let's also comment that memset() to zero is explicitly OK[1]. One of the callers embeds object_info in another struct which is initialized via memset (expand_data in builtin/cat-file.c). Since our subset of C doesn't allow assignment from a compound literal, handling this in any other way is awkward, so we'd like to keep the ability to initialize by memset(). By documenting this property, it should make anybody who wants to change the initializer think twice before doing so. There's one other caller of interest. In parse_sha1_header(), we did not initialize the struct fully in the first place. This turned out not to be a bug because the sub-function it calls does not look at any other fields except the ones we did initialize. But that assumption might not hold in the future, so it's a dangerous construct. This patch switches it to initializing the whole struct, which protects us against unexpected reads of the other fields. [1] Obviously using memset() to initialize a pointer violates the C standard, but we long ago decided that it was an acceptable tradeoff in the real world. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-08-11 10:42:23 -07:00
Junio C Hamano	1a5f1a3f25	Merge branch 'js/am-3-merge-recursive-direct' "git am -3" calls "git merge-recursive" when it needs to fall back to a three-way merge; this call has been turned into an internal subroutine call instead of spawning a separate subprocess. * js/am-3-merge-recursive-direct: merge-recursive: flush output buffer even when erroring out merge_trees(): ensure that the callers release output buffer merge-recursive: offer an option to retain the output in 'obuf' merge-recursive: write the commit title in one go merge-recursive: flush output buffer before printing error messages am -3: use merge_recursive() directly again merge-recursive: switch to returning errors instead of dying merge-recursive: handle return values indicating errors merge-recursive: allow write_tree_from_memory() to error out merge-recursive: avoid returning a wholesale struct merge_recursive: abort properly upon errors prepare the builtins for a libified merge_recursive() merge-recursive: clarify code in was_tracked() die(_("BUG")): avoid translating bug messages die("bug"): report bugs consistently t5520: verify that `pull --rebase` shows the helpful advice when failing	2016-08-10 12:33:20 -07:00
Junio C Hamano	78849622ec	Merge branch 'jk/pack-objects-optim' "git pack-objects" has a few options that tell it not to pack objects found in certain packfiles, which require it to scan .idx files of all available packs. The codepaths involved in these operations have been optimized for a common case of not having any non-local pack and/or any .kept pack. * jk/pack-objects-optim: pack-objects: compute local/ignore_pack_keep early pack-objects: break out of want_object loop early find_pack_entry: replace last_found_pack with MRU cache add generic most-recently-used list sha1_file: drop free_pack_by_name t/perf: add tests for many-pack scenarios	2016-08-08 14:48:39 -07:00
Junio C Hamano	aa9136a87e	Merge branch 'nd/pack-ofs-4gb-limit' into maint "git pack-objects" and "git index-pack" mostly operate with off_t when talking about the offset of objects in a packfile, but there were a handful of places that used "unsigned long" to hold that value, leading to an unintended truncation. * nd/pack-ofs-4gb-limit: fsck: use streaming interface for large blobs in pack pack-objects: do not truncate result in-pack object size on 32-bit systems index-pack: correct "offset" type in unpack_entry_data() index-pack: report correct bad object offsets even if they are large index-pack: correct "len" type in unpack_data() sha1_file.c: use type off_t* for object_info->disk_sizep pack-objects: pass length to check_pack_crc() without truncation	2016-08-08 14:21:36 -07:00
Jeff King	a73cdd21c4	find_pack_entry: replace last_found_pack with MRU cache Each pack has an index for looking up entries in O(log n) time, but if we have multiple packs, we have to scan through them linearly. This can produce a measurable overhead for some operations. We dealt with this long ago in `f7c22cc` (always start looking up objects in the last used pack first, 2007-05-30), which keeps what is essentially a 1-element most-recently-used cache. In theory, we should be able to do better by keeping a similar but longer cache, that is the same length as the pack-list itself. Since we now have a convenient generic MRU structure, we can plug it in and measure. Here are the numbers for running p5303 against linux.git: Test HEAD^ HEAD ------------------------------------------------------------------------ 5303.3: rev-list (1) 31.56(31.28+0.27) 31.30(31.08+0.20) -0.8% 5303.4: repack (1) 40.62(39.35+2.36) 40.60(39.27+2.44) -0.0% 5303.6: rev-list (50) 31.31(31.06+0.23) 31.23(31.00+0.22) -0.3% 5303.7: repack (50) 58.65(69.12+1.94) 58.27(68.64+2.05) -0.6% 5303.9: rev-list (1000) 38.74(38.40+0.33) 31.87(31.62+0.24) -17.7% 5303.10: repack (1000) 367.20(441.80+4.62) 342.00(414.04+3.72) -6.9% The main numbers of interest here are the rev-list ones (since that is exercising the normal object lookup code path). The single-pack case shouldn't improve at all; the 260ms speedup there is just part of the run-to-run noise (but it's important to note that we didn't make anything worse with the overhead of maintaining our cache). In the 50-pack case, we see similar results. There may be a slight improvement, but it's mostly within the noise. The 1000-pack case does show a big improvement, though. That carries over to the repack case, as well. Even though we haven't touched its pack-search loop yet, it does still do a lot of normal object lookups (e.g., for the internal revision walk), and so improves. As a point of reference, I also ran the 1000-pack test against a version of HEAD^ with the last_found_pack optimization disabled. It takes ~60s, so that gives an indication of how much even the single-element cache is helping. For comparison, here's a smaller repository, git.git: Test HEAD^ HEAD --------------------------------------------------------------------- 5303.3: rev-list (1) 1.56(1.54+0.01) 1.54(1.51+0.02) -1.3% 5303.4: repack (1) 1.84(1.80+0.10) 1.82(1.80+0.09) -1.1% 5303.6: rev-list (50) 1.58(1.55+0.02) 1.59(1.57+0.01) +0.6% 5303.7: repack (50) 2.50(3.18+0.04) 2.50(3.14+0.04) +0.0% 5303.9: rev-list (1000) 2.76(2.71+0.04) 2.24(2.21+0.02) -18.8% 5303.10: repack (1000) 13.21(19.56+0.25) 11.66(18.01+0.21) -11.7% You can see that the percentage improvement is similar. That's because the lookup we are optimizing is roughly O(nr_objects * nr_packs). Since the number of packs is constant in both tests, we'd expect the improvement to be linear in the number of objects. But the whole process is also linear in the number of objects, so the improvement is a constant factor. The exact improvement does also depend on the contents of the packs. In p5303, the extra packs all have 5 first-parent commits in them, which is a reasonable simulation of a pushed-to repository. But it also means that only 250 first-parent commits are in those packs (compared to almost 50,000 total in linux.git), and the rest are in the huge "base" pack. So once we start looking at history in taht big pack, that's where we'll find most everything, and even the 1-element cache gets close to 100% cache hits. You could almost certainly show better numbers with a more pathological case (e.g., distributing the objects more evenly across the packs). But that's simply not that realistic a scenario, so it makes more sense to focus on these numbers. The implementation itself is a straightforward application of the MRU code. We provide an MRU-ordered list of packs that shadows the packed_git list. This is easy to do because we only create and revise the pack list in one place. The "reprepare" code path actually drops the whole MRU and replaces it for simplicity. It would be more efficient to just add new entries, but there's not much point in optimizing here; repreparing happens rarely, and only after doing a lot of other expensive work. The key things to keep optimized are traversal (which is just a normal linked list, albeit with one extra level of indirection over the regular packed_git list), and marking (which is a constant number of pointer assignments, though slightly more than the old last_found_pack was; it doesn't seem to create a measurable slowdown, though). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-07-29 11:05:07 -07:00
Jeff King	3157c880f6	sha1_file: drop free_pack_by_name The point of this function is to drop an entry from the "packed_git" cache that points to a file we might be overwriting, because our contents may not be the same (and hence the only caller was pack-objects as it moved a temporary packfile into place). In older versions of git, this could happen because the names of packfiles were derived from the set of objects they contained, not the actual bits on disk. But since `1190a1a` (pack-objects: name pack files after trailer hash, 2013-12-05), the name reflects the actual bits on disk, and any two packfiles with the same name can be used interchangeably. Dropping this function not only saves a few lines of code, it makes the lifetime of "struct packed_git" much easier to reason about: namely, we now do not ever free these structs. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-07-29 11:05:06 -07:00
Junio C Hamano	ad2d777604	Merge branch 'nd/pack-ofs-4gb-limit' "git pack-objects" and "git index-pack" mostly operate with off_t when talking about the offset of objects in a packfile, but there were a handful of places that used "unsigned long" to hold that value, leading to an unintended truncation. * nd/pack-ofs-4gb-limit: fsck: use streaming interface for large blobs in pack pack-objects: do not truncate result in-pack object size on 32-bit systems index-pack: correct "offset" type in unpack_entry_data() index-pack: report correct bad object offsets even if they are large index-pack: correct "len" type in unpack_data() sha1_file.c: use type off_t* for object_info->disk_sizep pack-objects: pass length to check_pack_crc() without truncation	2016-07-28 10:34:42 -07:00
Johannes Schindelin	ef1177d18e	die("bug"): report bugs consistently The vast majority of error messages in Git's source code which report a bug use the convention to prefix the message with "BUG:". As part of cleaning up merge-recursive to stop die()ing except in case of detected bugs, let's just make the remainder of the bug reports consistent with the de facto rule. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-07-26 11:13:44 -07:00
Nguyễn Thái Ngọc Duy	211c61c6cf	pack-objects: pass length to check_pack_crc() without truncation On 32 bit systems with large file support, unsigned long is 32-bit while the two offsets in the subtraction expression (pack-objects has the exact same expression as in sha1_file.c but not shown in diff) are in 64-bit. If an in-pack object is larger than 2^32 len/datalen is truncated and we get a misleading "error: bad packed object CRC for ..." as a result. Use off_t for len and datalen. check_pack_crc() already accepts this argument as off_t and can deal with 4+ GB. Noticed-by: Christoph Michelbach <michelbach94@gmail.com> Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-07-12 10:14:29 -07:00
Junio C Hamano	352d72a30e	Merge branch 'nd/worktree-various-heads' The experimental "multiple worktree" feature gains more safety to forbid operations on a branch that is checked out or being actively worked on elsewhere, by noticing that e.g. it is being rebased. * nd/worktree-various-heads: branch: do not rename a branch under bisect or rebase worktree.c: check whether branch is bisected in another worktree wt-status.c: split bisect detection out of wt_status_get_state() worktree.c: check whether branch is rebased in another worktree worktree.c: avoid referencing to worktrees[i] multiple times wt-status.c: make wt_status_check_rebase() work on any worktree wt-status.c: split rebase detection out of wt_status_get_state() path.c: refactor and add worktree_git_path() worktree.c: mark current worktree worktree.c: make find_shared_symref() return struct worktree * worktree.c: store "id" instead of "git_dir" path.c: add git_common_path() and strbuf_git_common_path() dir.c: rename str(n)cmp_icase to fspath(n)cmp	2016-05-23 14:54:29 -07:00
Nguyễn Thái Ngọc Duy	7616c6ca9d	sha1_file.c: use {error,die,warning}_errno() Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-05-09 12:29:08 -07:00
Nguyễn Thái Ngọc Duy	ba0897e6ae	dir.c: rename str(n)cmp_icase to fspath(n)cmp These functions compare two paths that are taken from file system. Depending on the running file system, paths may need to be compared case-sensitively or not, and maybe even something else in future. The current names do not convey that well. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-04-22 14:09:37 -07:00
Junio C Hamano	090de6b289	Merge branch 'jk/pack-idx-corruption-safety' The code to read the pack data using the offsets stored in the pack idx file has been made more carefully check the validity of the data in the idx. * jk/pack-idx-corruption-safety: sha1_file.c: mark strings for translation use_pack: handle signed off_t overflow nth_packed_object_offset: bounds-check extended offset t5313: test bounds-checks of corrupted/malicious pack/idx files	2016-03-04 13:45:47 -08:00
Nguyễn Thái Ngọc Duy	7465feba51	sha1_file.c: mark strings for translation Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-02-27 09:54:57 -08:00
Junio C Hamano	11529ecec9	Merge branch 'jk/tighten-alloc' Update various codepaths to avoid manually-counted malloc(). * jk/tighten-alloc: (22 commits) ewah: convert to REALLOC_ARRAY, etc convert ewah/bitmap code to use xmalloc diff_populate_gitlink: use a strbuf transport_anonymize_url: use xstrfmt git-compat-util: drop mempcpy compat code sequencer: simplify memory allocation of get_message test-path-utils: fix normalize_path_copy output buffer size fetch-pack: simplify add_sought_entry fast-import: simplify allocation in start_packfile write_untracked_extension: use FLEX_ALLOC helper prepare_{git,shell}_cmd: use argv_array use st_add and st_mult for allocation size computation convert trivial cases to FLEX_ARRAY macros use xmallocz to avoid size arithmetic convert trivial cases to ALLOC_ARRAY convert manual allocations to argv_array argv-array: add detach function add helpers for allocating flex-array structs harden REALLOC_ARRAY and xcalloc against size_t overflow tree-diff: catch integer overflow in combine_diff_path allocation ...	2016-02-26 13:37:16 -08:00
Jeff King	13e0b0d3dc	use_pack: handle signed off_t overflow A v2 pack index file can specify an offset within a packfile of up to 2^64-1 bytes. On a system with a signed 64-bit off_t, we can represent only up to 2^63-1. This means that a corrupted .idx file can end up with a negative offset in the pack code. Our bounds-checking use_pack function looks for too-large offsets, but not for ones that have wrapped around to negative. Let's do so, which fixes an out-of-bounds access demonstrated in t5313. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-02-25 11:32:46 -08:00
Jeff King	47fe3f6ef0	nth_packed_object_offset: bounds-check extended offset If a pack .idx file has a corrupted offset for an object, we may try to access an offset in the .idx or .pack file that is larger than the file's size. For the .pack case, we have use_pack() to protect us, which realizes the access is out of bounds. But if the corrupted value asks us to look in the .idx file's secondary 64-bit offset table, we blindly add it to the mmap'd index data and access arbitrary memory. We can fix this with a simple bounds-check compared to the size we found when we opened the .idx file. Note that there's similar code in index-pack that is triggered only during "index-pack --verify". To support both, we pull the bounds-check into a separate function, which dies when it sees a corrupted file. It would be nice if we could return an error, so that the pack code could try to find a good copy of the object elsewhere. Currently nth_packed_object_offset doesn't have any way to return an error, but it could probably use "0" as a sentinel value (since no object can start there). This is the minimal fix, and we can improve the resilience later on top. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-02-25 11:32:43 -08:00
Jeff King	50a6c8efa2	use st_add and st_mult for allocation size computation If our size computation overflows size_t, we may allocate a much smaller buffer than we expected and overflow it. It's probably impossible to trigger an overflow in most of these sites in practice, but it is easy enough convert their additions and multiplications into overflow-checking variants. This may be fixing real bugs, and it makes auditing the code easier. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-02-22 14:51:09 -08:00
Jeff King	b32fa95fd8	convert trivial cases to ALLOC_ARRAY Each of these cases can be converted to use ALLOC_ARRAY or REALLOC_ARRAY, which has two advantages: 1. It automatically checks the array-size multiplication for overflow. 2. It always uses sizeof(*array) for the element-size, so that it can never go out of sync with the declared type of the array. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-02-22 14:51:09 -08:00
Junio C Hamano	3f16396228	clone/sha1_file: read info/alternates with strbuf_getline() $GIT_OBJECT_DIRECTORY/info/alternates is a text file that can be edited with a DOS editor. We do not want to use the real path with CR appended at the end. Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-01-15 10:34:53 -08:00
Junio C Hamano	8f309aeb82	strbuf: introduce strbuf_getline_{lf,nul}() The strbuf_getline() interface allows a byte other than LF or NUL as the line terminator, but this is only because I wrote these codepaths anticipating that there might be a value other than NUL and LF that could be useful when I introduced line_termination long time ago. No useful caller that uses other value has emerged. By now, it is clear that the interface is overly broad without a good reason. Many codepaths have hardcoded preference to read either LF terminated or NUL terminated records from their input, and then call strbuf_getline() with LF or NUL as the third parameter. This step introduces two thin wrappers around strbuf_getline(), namely, strbuf_getline_lf() and strbuf_getline_nul(), and mechanically rewrites these call sites to call either one of them. The changes contained in this patch are: * introduction of these two functions in strbuf.[ch] * mechanical conversion of all callers to strbuf_getline() with either '\n' or '\0' as the third parameter to instead call the respective thin wrapper. After this step, output from "git grep 'strbuf_getline('" would become a lot smaller. An interim goal of this series is to make this an empty set, so that we can have strbuf_getline_crlf() take over the shorter name strbuf_getline(). Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-01-15 10:12:51 -08:00
Junio C Hamano	fbe959dde7	Merge branch 'bc/format-patch-null-from-line' "format-patch" has learned a new option to zero-out the commit object name on the mbox "From " line. * bc/format-patch-null-from-line: format-patch: check that header line has expected format format-patch: add an option to suppress commit hash sha1_file.c: introduce a null_oid constant	2015-12-21 10:59:08 -08:00
Junio C Hamano	897b18508b	Merge branch 'jk/prune-mtime' The helper used to iterate over loose object directories to prune stale objects did not closedir() immediately when it is done with a directory--a callback such as the one used for "git prune" may want to do rmdir(), but it would fail on open directory on platforms such as WinXP. * jk/prune-mtime: prune: close directory earlier during loose-object directory traversal	2015-12-15 08:02:16 -08:00
brian m. carlson	3e56e7245c	sha1_file.c: introduce a null_oid constant null_oid is the struct object_id equivalent to null_sha1. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2015-12-14 13:35:54 -08:00
brian m. carlson	b419aa25d5	sha1_file: introduce has_object_file helper. Add has_object_file, which is a wrapper around has_sha1_file, but for struct object_id. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Jeff King <peff@peff.net>	2015-11-20 08:02:05 -05:00
Jeff King	45014beac0	Merge branch 'dk/gc-idx-wo-pack' Having a leftover .idx file without corresponding .pack file in the repository hurts performance; "git gc" learned to prune them. * dk/gc-idx-wo-pack: gc: remove garbage .idx files from pack dir t5304: test cleaning pack garbage prepare_packed_git(): refactor garbage reporting in pack directory	2015-11-20 06:55:34 -05:00
Junio C Hamano	808d119263	Merge branch 'js/misc-fixes' Various compilation fixes and squelching of warnings. * js/misc-fixes: Correct fscanf formatting string for I64u values Silence GCC's "cast of pointer to integer of a different size" warning Squelch warning about an integer overflow	2015-10-30 13:07:00 -07:00
Johannes Schindelin	56a1a3ab44	Silence GCC's "cast of pointer to integer of a different size" warning When calculating hashes from pointers, it actually makes sense to cut off the most significant bits. In that case, said warning does not make a whole lot of sense. So let's just work around it by casting the pointer first to intptr_t and then casting up/down to the final integral type. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2015-10-26 13:24:03 -07:00
Junio C Hamano	78891795df	Merge branch 'jk/war-on-sprintf' Many allocations that is manually counted (correctly) that are followed by strcpy/sprintf have been replaced with a less error prone constructs such as xstrfmt. Macintosh-specific breakage was noticed and corrected in this reroll. * jk/war-on-sprintf: (70 commits) name-rev: use strip_suffix to avoid magic numbers use strbuf_complete to conditionally append slash fsck: use for_each_loose_file_in_objdir Makefile: drop D_INO_IN_DIRENT build knob fsck: drop inode-sorting code convert strncpy to memcpy notes: document length of fanout path with a constant color: add color_set helper for copying raw colors prefer memcpy to strcpy help: clean up kfmclient munging receive-pack: simplify keep_arg computation avoid sprintf and strcpy with flex arrays use alloc_ref rather than hand-allocating "struct ref" color: add overflow checks for parsing colors drop strcpy in favor of raw sha1_to_hex use sha1_to_hex_r() instead of strcpy daemon: use cld->env_array when re-spawning stat_tracking_info: convert to argv_array http-push: use an argv_array for setup_revisions fetch-pack: use argv_array for index-pack / unpack-objects ...	2015-10-20 15:24:01 -07:00
Junio C Hamano	db5adf24bf	Merge branch 'js/clone-dissociate' "git clone --dissociate" runs a big "git repack" process at the end, and it helps to close file descriptors that are open on the packs and their idx files before doing so on filesystems that cannot remove a file that is still open. * js/clone-dissociate: clone --dissociate: avoid locking pack files sha1_file.c: add a function to release all packs sha1_file: consolidate code to close a pack's file descriptor t5700: demonstrate a Windows file locking issue with `git clone --dissociate`	2015-10-15 15:43:49 -07:00
Johannes Schindelin	38849a8116	sha1_file.c: add a function to release all packs On Windows, files that are in use cannot be removed or renamed. That means that we have to release pack files when we are about to, say, repack them. Let's introduce a convenient function to close all the pack files and their idx files. While at it, we consolidate the close windows/close fd/close index stanza in `free_pack_by_name()` into the `close_pack()` function that is used by the new `close_all_packs()` function to avoid repeated code. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2015-10-07 10:47:10 -07:00
Johannes Schindelin	71fe5d7fb0	sha1_file: consolidate code to close a pack's file descriptor There was a lot of repeated code to close the file descriptor of a given pack. Let's just refactor this code into a single function. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2015-10-05 14:43:58 -07:00
Jeff King	c7ab0ba340	avoid sprintf and strcpy with flex arrays When we are allocating a struct with a FLEX_ARRAY member, we generally compute the size of the array and then sprintf or strcpy into it. Normally we could improve a dynamic allocation like this by using xstrfmt, but it doesn't work here; we have to account for the size of the rest of the struct. But we can improve things a bit by storing the length that we use for the allocation, and then feeding it to xsnprintf or memcpy, which makes it more obvious that we are not writing more than the allocated number of bytes. It would be nice if we had some kind of helper for allocating generic flex arrays, but it doesn't work that well: - the call signature is a little bit unwieldy: d = flex_struct(sizeof(*d), offsetof(d, path), fmt, ...); You need offsetof here instead of just writing to the end of the base size, because we don't know how the struct is packed (partially this is because FLEX_ARRAY might not be zero, though we can account for that; but the size of the struct may actually be rounded up for alignment, and we can't know that). - some sites do clever things, like over-allocating because they know they will write larger things into the buffer later (e.g., struct packed_git here). So we're better off to just write out each allocation (or add type-specific helpers, though many of these are one-off allocations anyway). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2015-10-05 11:08:05 -07:00
Jeff King	d4b3d11a03	write_loose_object: convert to strbuf When creating a loose object tempfile, we use a fixed PATH_MAX-sized buffer, and strcpy directly into it. This isn't buggy, because we do a rough check of the size, but there's no verification that our guesstimate of the required space is enough (in fact, it's several bytes too big for the current naming scheme). Let's switch to a strbuf, which makes this much easier to verify. The allocation overhead should be negligible, since we are replacing a static buffer with a static strbuf, and we'll only need to allocate on the first call. While we're here, we can also document a subtle interaction with mkstemp that would be easy to overlook. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2015-10-05 11:08:05 -07:00
Jeff King	ac5190cc48	sha1_get_pack_name: use a strbuf We do some manual memory computation here, and there's no check that our 60 is not overflowed by the raw sprintf (it isn't, because the "which" parameter is never longer than "pack"). We can simplify this greatly with a strbuf. Technically the end result is not identical, as the original took care not to rewrite the object directory on each call for performance reasons. We could do that here, too (by saving the baselen and resetting to it), but it's not worth the complexity; this function is not called a lot (generally once per packfile that we open). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2015-09-25 10:18:18 -07:00
Jeff King	9ae97018fb	use strip_suffix and xstrfmt to replace suffix When we want to convert "foo.pack" to "foo.idx", we do it by duplicating the original string and then munging the bytes in place. Let's use strip_suffix and xstrfmt instead, which has several advantages: 1. It's more clear what the intent is. 2. It does not implicitly rely on the fact that strlen(".idx") <= strlen(".pack") to avoid an overflow. 3. We communicate the assumption that the input file ends with ".pack" (and get a run-time check that this is so). 4. We drop calls to strcpy, which makes auditing the code base easier. Likewise, we can do this to convert ".pack" to ".bitmap", avoiding some manual memory computation. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2015-09-25 10:18:18 -07:00
Jeff King	48bcc1c3cc	add_packed_git: convert strcpy into xsnprintf We have the path "foo.idx", and we create a buffer big enough to hold "foo.pack" and "foo.keep", and then strcpy straight into it. This isn't a bug (we have enough space), but it's very hard to tell from the strcpy that this is so. Let's instead use strip_suffix to take off the ".idx", record the size of our allocation, and use xsnprintf to make sure we don't violate our assumptions. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2015-09-25 10:18:18 -07:00
Jeff King	ef1286d3c0	use xsnprintf for generating git object headers We generally use 32-byte buffers to format git's "type size" header fields. These should not generally overflow unless you can produce some truly gigantic objects (and our types come from our internal array of constant strings). But it is a good idea to use xsnprintf to make sure this is the case. Note that we slightly modify the interface to write_sha1_file_prepare, which nows uses "hdrlen" as an "in" parameter as well as an "out" (on the way in it stores the allocated size of the header, and on the way out it returns the ultimate size of the header). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2015-09-25 10:18:18 -07:00
Junio C Hamano	f0bc854623	Sync with 2.5.2	2015-09-09 14:30:35 -07:00
Junio C Hamano	3d3caf0b78	Sync with 2.4.9	2015-09-04 10:43:23 -07:00
Junio C Hamano	ef0e938a1a	Sync with 2.3.9	2015-09-04 10:34:19 -07:00
Junio C Hamano	8267cd11d6	Sync with 2.2.3	2015-09-04 10:29:28 -07:00
Jeff King	5015f01c12	read_info_alternates: handle paths larger than PATH_MAX This function assumes that the relative_base path passed into it is no larger than PATH_MAX, and writes into a fixed-size buffer. However, this path may not have actually come from the filesystem; for example, add_submodule_odb generates a path using a strbuf and passes it in. This is hard to trigger in practice, though, because the long submodule directory would have to exist on disk before we would try to open its info/alternates file. We can easily avoid the bug, though, by simply creating the filename on the heap. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2015-09-04 09:36:51 -07:00

1 2 3 4 5 ...

944 Commits