"git cat-file --batch-check=<format>" is added, primarily to allow
on-disk footprint of objects in packfiles (often they are a lot
smaller than their true size, when expressed as deltas) to be
reported.
* jk/in-pack-size-measurement:
pack-revindex: radix-sort the revindex
pack-revindex: use unsigned to store number of objects
cat-file: split --batch input lines on whitespace
cat-file: add %(objectsize:disk) format atom
cat-file: add --batch-check=<format>
cat-file: refactor --batch option parsing
cat-file: teach --batch to stream blob objects
t1006: modernize output comparisons
teach sha1_object_info_extended a "disk_size" query
zero-initialize object_info structs
Using sha1_object_info_extended, a caller can find out the
type of an object, its size, and information about where it
is stored. In addition to the object's "true" size, it can
also be useful to know the size that the object takes on
disk (e.g., to generate statistics about which refs consume
space).
This patch adds a "disk_sizep" field to "struct object_info",
and fills it in during sha1_object_info_extended if it is
non-NULL.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The sha1_object_info_extended function expects the caller to
provide a "struct object_info" which contains pointers to
"query" items that will be filled in. The purpose of
providing pointers rather than storing the response directly
in the struct is so that callers can choose not to incur the
expense in finding particular fields that they do not care
about.
Right now the only query item is "sizep", and all callers
set it explicitly to choose whether or not to query it; they
can then leave the rest of the struct uninitialized.
However, as we add new query items, each caller will have to
be updated to explicitly turn off the new ones (by setting
them to NULL). Instead, let's teach each caller to
zero-initialize the struct, so that they do not have to
learn about each new query item added.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When we try to load an object from disk and fail, our
general strategy is to see if we can get it from somewhere
else (e.g., a loose object). That lets users fix corruption
problems by copying known-good versions of objects into the
object database.
We already handle the case where we were not able to read
the delta from disk. However, when we find that the delta we
read does not apply, we simply die. This case is harder to
trigger, as corruption in the delta data itself would
trigger a crc error from zlib. However, a corruption that
pointed us at the wrong delta base might cause it.
We can do the same "fail and try to find the object
elsewhere" trick instead of dying. This not only gives us a
chance to recover, but also puts us on code paths that will
alert the user to the problem (with the current message,
they do not even know which sha1 caused the problem).
Note that unlike some other pack corruptions, we do not
recover automatically from this case when doing a repack.
There is nothing apparently wrong with the delta, as it
points to a valid, accessible object, and we realize the
error only when the resulting size does not match up. And in
theory, one could even have a case where the corrupted size
is the same, and the problem would only be noticed by
recomputing the sha1.
We can get around this by recomputing the deltas with
--no-reuse-delta, which our test does (and this is probably
good advice for anyone recovering from pack corruption).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
5f44324 (core: log offset pack data accesses happened - 2011-07-06)
provides a way to observe pack access patterns via a config
switch. Setting an environment variable looks more obvious than a
config var, especially when you just need to _observe_, and more
inline with other tracing knobs we have.
Document it as it may be useful for remote troubleshooting.
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
sha1_object_info() returns -1 (OBJ_BAD) if it cannot find the object
for some reason, which suggests that it wants the _caller_ to report
this error. However, part of its work happens in
sha1_loose_object_info, which _does_ report errors itself. This is
doubly strange because:
* packed_object_info(), which is the other half of the duo, does _not_
report this.
* In the event that an object is packed and pruned while
sha1_object_info_extended() goes looking for it, we would
erroneously show the error -- even though the code of the latter
function purports to handle this case gracefully.
* A caller might invoke sha1_object_info() to find the type of an
object even if that object is not known to exist.
Silence this error. The others remain untouched as a corrupt object
is a much more grave error than it merely being absent.
Signed-off-by: Thomas Rast <trast@inf.ethz.ch>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In the !delta_data error path of unpack_entry(), we run free(base).
This became a window for use-after-free() in abe601b (sha1_file:
remove recursion in unpack_entry, 2013-03-27), as follows:
Before abe601b, we got the 'base' from cache_or_unpack_entry(..., 0);
keep_cache=0 tells it to also remove that entry. So the 'base' is at
this point not cached, and freeing it in the error path is the right
thing.
After abe601b, the structure changed: we use a three-phase approach
where phase 1 finds the innermost base or a base that is already in
the cache. In phase 3 we therefore know that all bases we unpack are
not part of the delta cache yet. (Observe that we pop from the cache
in phase 1, so this is also true for the very first base.) So we make
no further attempts to look up the bases in the cache, and just call
add_delta_base_cache() on every base object we have assembled.
But the !delta_data error path remained unchanged, and now calls
free() on a base that has already been entered in the cache. This
means that there is a use-after-free if we later use the same base
again.
So remove that free(); we are still going to use that data.
Reported-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Thomas Rast <trast@inf.ethz.ch>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Attempts to reduce the stack footprint of sha1_object_info()
and unpack_entry() codepaths.
* tr/packed-object-info-wo-recursion:
sha1_file: remove recursion in unpack_entry
Refactor parts of in_delta_base_cache/cache_or_unpack_entry
sha1_file: remove recursion in packed_object_info
Have the streaming interface and other codepaths more carefully
examine for corrupt objects.
* jk/check-corrupt-objects-carefully:
clone: leave repo in place after checkout errors
clone: run check_everything_connected
clone: die on errors from unpack_trees
add tests for cloning corrupted repositories
streaming_write_entry: propagate streaming errors
add test for streaming corrupt blobs
avoid infinite loop in read_istream_loose
read_istream_filtered: propagate read error from upstream
check_sha1_signature: check return value from read_istream
stream_blob_to_fd: detect errors reading from stream
It's possible for read_istream to return an error, in which
case we just end up in an infinite loop (aside from EOF, we
do not even look at the result, but just feed it straight
into our running hash).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Similar to the recursion in packed_object_info(), this leads to
problems on stack-space-constrained systems in the presence of long
delta chains.
We proceed in three phases:
1. Dig through the delta chain, saving each delta object's offsets and
size on an ad-hoc stack.
2. Unpack the base object at the bottom.
3. Unpack and apply the deltas from the stack.
Signed-off-by: Thomas Rast <trast@student.ethz.ch>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The delta base cache lookup and test were shared. Refactor them;
we'll need both parts again. Also, we'll use the clearing routine
later.
Signed-off-by: Thomas Rast <trast@student.ethz.ch>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
If two processes are racing to create the same directory tree, they
will both see that the directory doesn't exist, both try to mkdir(),
and one of them will fail. This is okay, as we only care that the
directory gets created. So, we add a check for EEXIST from mkdir,
and continue when the directory exists, taking the same codepath as
the case where the earlier stat() succeeds and finds a directory.
Signed-off-by: Steven Walter <stevenrwalter@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
packed_object_info() and packed_delta_info() were mutually recursive.
The former would handle ordinary types and defer deltas to the latter;
the latter would use the former to resolve the delta base.
This arrangement, however, leads to trouble with threaded index-pack
and long delta chains on platforms where thread stacks are small, as
happened on OS X (512kB thread stacks by default) with the chromium
repo.
The task of the two functions is not all that hard to describe without
any recursion, however. It proceeds in three steps:
- determine the representation type and size, based on the outermost
object (delta or not)
- follow through the delta chain, if any
- determine the object type from what is found at the end of the delta
chain
The only complication stems from the error recovery. If parsing fails
at any step, we want to mark that object (within the pack) as bad and
try getting the corresponding SHA1 from elsewhere. If that also
fails, we want to repeat this process back up the delta chain until we
find a reasonable solution or conclude that there is no way to
reconstruct the object. (This is conveniently checked by t5303.)
To achieve that within the pack, we keep track of the entire delta
chain in a stack. When things go sour, we process that stack from the
top, marking entries as bad and attempting to re-resolve by sha1. To
avoid excessive malloc(), the stack starts out with a small
stack-allocated array. The choice of 64 is based on the default of
pack.depth, which is 50, in the hope that it covers "most" delta
chains without any need for malloc().
It's much harder to make the actual re-resolving by sha1 nonrecursive,
so we skip that. If you can't afford *that* recursion, your
corruption problems are more serious than your stack size problems.
Reported-by: Stefan Zager <szager@google.com>
Signed-off-by: Thomas Rast <trast@student.ethz.ch>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
prepare_packed_git_one() is modified to allow count-objects to hook a
report function to so we don't need to duplicate the pack searching
logic in count-objects.c. When report_pack_garbage is NULL, the
overhead is insignificant.
The garbage is reported with warning() instead of error() in packed
garbage case because it's not an error to have garbage. Loose garbage
is still reported as errors and will be converted to warnings later.
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The current loop does
while (...) {
if (it is not an .idx file)
continue;
process .idx file;
}
and is reordered to
while (...) {
if (it is an .idx file) {
process .idx file;
}
}
This makes it easier to add new extension file processing.
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Change link_alt_odb_entries() to take the length of the "alt"
parameter rather than a pointer to the end of the "alt" string. This
is the more common calling convention and simplifies the code a tiny
bit.
Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu>
Signed-off-by: Jeff King <peff@peff.net>
Change link_alt_odb_entry() to take a NUL-terminated string instead of
(char *, len). Use string_list_split_in_place() rather than inline
code in link_alt_odb_entries().
This approach saves some code and also avoids the (probably harmless)
error of passing a non-NUL-terminated string to is_absolute_path().
Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu>
Signed-off-by: Jeff King <peff@peff.net>
Not all platforms have getrlimit(), but there are other ways to see
the maximum number of files that a process can have open. If
getrlimit() is unavailable, fall back to sysconf(_SC_OPEN_MAX) if
available, and use OPEN_MAX from <limits.h>.
Signed-off-by: Joachim Schmitz <jojo@schmitz-digital.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The code to avoid mistaken attempt to add the object directory
itself as its own alternate could read beyond end of a string while
comparison.
* hv/link-alt-odb-entry:
link_alt_odb_entry: fix read over array bounds reported by valgrind
pfxlen can be longer than the path in objdir when relative_base
contains the path to gits object directory. Here we are interested
in checking if ent->base[] (the part that corresponds to .git/objects)
is the same string as objdir, and the code NUL-terminated ent->base[]
to
LEADING PATH\0XX/XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX\0
in preparation for these "duplicate check" step (before we return
from the function, the first NUL is turned into '/' so that we can
fill XX when probing for loose objects). All we need to do is to
compare the string with the path to our object directory.
Signed-off-by: Heiko Voigt <hvoigt@hvoigt.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When peeking into object stores of submodules, the code forgot that they
might borrow objects from alternate object stores on their own.
By Heiko Voigt
* hv/submodule-alt-odb:
teach add_submodule_odb() to look for alternates
Since we allow to link other object databases when loading a submodules
database we should also load possible alternates.
Signed-off-by: Heiko Voigt <hvoigt@hvoigt.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When write_loose_object() finds that it is unable to
create a temporary file, it complains, for instance:
unable to create temporary sha1 filename : Too many open files
That extra space was supposed to be the name of the file,
and will be an empty string if the git_mkstemps_mode() fails.
The name of the temporary file is unimportant; delete it.
Signed-off-by: Pete Wyckoff <pw@padd.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The error handling routines add a newline. Remove
the duplicate ones in error messages.
Signed-off-by: Pete Wyckoff <pw@padd.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Traditionally, all the callers of check_sha1_signature() first
called read_sha1_file() to prepare the whole object data in core,
and called this function. The function is used to revalidate what
we read from the object database actually matches the object name we
used to ask for the data from the object database.
Update the API to allow callers to pass NULL as the object data, and
have the function read and hash the object data using streaming API
to recompute the object name, without having to hold everything in
core at the same time. This is most useful in parse_object() that
parses a blob object, because this caller does not have to keep the
actual blob data around in memory after a "struct blob" is returned.
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
* jk/maint-avoid-streaming-filtered-contents:
do not stream large files to pack when filters are in use
teach dry-run convert_to_git not to require a src buffer
teach convert_to_git a "dry run" mode
* jk/maint-avoid-streaming-filtered-contents:
do not stream large files to pack when filters are in use
teach dry-run convert_to_git not to require a src buffer
teach convert_to_git a "dry run" mode
Because git's object format requires us to specify the
number of bytes in the object in its header, we must know
the size before streaming a blob into the object database.
This is not a problem when adding a regular file, as we can
get the size from stat(). However, when filters are in use
(such as autocrlf, or the ident, filter, or eol
gitattributes), we have no idea what the ultimate size will
be.
The current code just punts on the whole issue and ignores
filter configuration entirely for files larger than
core.bigfilethreshold. This can generate confusing results
if you use filters for large binary files, as the filter
will suddenly stop working as the file goes over a certain
size. Rather than try to handle unknown input sizes with
streaming, this patch just turns off the streaming
optimization when filters are in use.
This has a slight performance regression in a very specific
case: if you have autocrlf on, but no gitattributes, a large
binary file will avoid the streaming code path because we
don't know beforehand whether it will need conversion or
not. But if you are handling large binary files, you should
be marking them as such via attributes (or at least not
using autocrlf, and instead marking your text files as
such). And the flip side is that if you have a large
_non_-binary file, there is a correctness improvement;
before we did not apply the conversion at all.
The first half of the new t1051 script covers these failures
on input. The second half tests the matching output code
paths. These already work correctly, and do not need any
adjustment.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
* nd/find-pack-entry-recent-cache-invalidation:
find_pack_entry(): do not keep packed_git pointer locally
sha1_file.c: move the core logic of find_pack_entry() into fill_pack_entry()
* nd/find-pack-entry-recent-cache-invalidation:
find_pack_entry(): do not keep packed_git pointer locally
sha1_file.c: move the core logic of find_pack_entry() into fill_pack_entry()
Since 3ba7a06552 (A loose object is not corrupt if it
cannot be read due to EMFILE), "git fsck" on a repository with an empty
loose object file complains with the error message
fatal: failed to read object <sha1>: Invalid argument
This comes from a failure of mmap on this empty file, which sets errno to
EINVAL. Instead of calling xmmap on empty file, we display a clean error
message ourselves, and return a NULL pointer. The new message is
error: object file .git/objects/09/<rest-of-sha1> is empty
fatal: loose object <sha1> (stored in .git/objects/09/<rest-of-sha1>) is corrupt
The second line was already there before the regression in 3ba7a06552,
and the first is an additional message, that should help diagnosing the
problem for the user.
Signed-off-by: Matthieu Moy <Matthieu.Moy@imag.fr>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Commit f7c22cc (always start looking up objects in the last used pack
first - 2007-05-30) introduce a static packed_git* pointer as an
optimization. The kept pointer however may become invalid if
free_pack_by_name() happens to free that particular pack.
Current code base does not access packs after calling
free_pack_by_name() so it should not be a problem. Anyway, move the
pointer out so that free_pack_by_name() can reset it to avoid running
into troubles in future.
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Acked-by: Nicolas Pitre <nico@fluxnic.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The new helper function implements the logic to find the offset for the
object in one pack and fill a pack_entry structure. The next patch will
restructure the loop and will call the helper from two places.
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Acked-by: Nicolas Pitre <nico@fluxnic.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
On Solaris the system headers define the "tmpfile" name, which'll
cause Git compiled with Sun Studio 12 Update 1 to whine about us
redefining the name:
"pack-write.c", line 76: warning: name redefined by pragma redefine_extname declared static: tmpfile (E_PRAGMA_REDEFINE_STATIC)
"sha1_file.c", line 2455: warning: name redefined by pragma redefine_extname declared static: tmpfile (E_PRAGMA_REDEFINE_STATIC)
"fast-import.c", line 858: warning: name redefined by pragma redefine_extname declared static: tmpfile (E_PRAGMA_REDEFINE_STATIC)
"builtin/index-pack.c", line 175: warning: name redefined by pragma redefine_extname declared static: tmpfile (E_PRAGMA_REDEFINE_STATIC)
Just renaming the "tmpfile" variable to "tmp_file" in the relevant
places is the easiest way to fix this.
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
* jc/stream-to-pack:
bulk-checkin: replace fast-import based implementation
csum-file: introduce sha1file_checkpoint
finish_tmp_packfile(): a helper function
create_tmp_packfile(): a helper function
write_pack_header(): a helper function
Conflicts:
pack.h
* nd/misc-cleanups:
unpack_object_header_buffer(): clear the size field upon error
tree_entry_interesting: make use of local pointer "item"
tree_entry_interesting(): give meaningful names to return values
read_directory_recursive: reduce one indentation level
get_tree_entry(): do not call find_tree_entry() on an empty tree
tree-walk.c: do not leak internal structure in tree_entry_len()
* nd/misc-cleanups:
unpack_object_header_buffer(): clear the size field upon error
tree_entry_interesting: make use of local pointer "item"
tree_entry_interesting(): give meaningful names to return values
read_directory_recursive: reduce one indentation level
get_tree_entry(): do not call find_tree_entry() on an empty tree
tree-walk.c: do not leak internal structure in tree_entry_len()
This extends the earlier approach to stream a large file directly from the
filesystem to its own packfile, and allows "git add" to send large files
directly into a single pack. Older code used to spawn fast-import, but the
new bulk-checkin API replaces it.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The callers do not use the returned size when the function says
it did not use any bytes and sets the type to OBJ_BAD, so this
should not matter in practice, but it is a good code hygiene
anyway.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
* jk/maint-pack-objects-compete-with-delete:
downgrade "packfile cannot be accessed" errors to warnings
pack-objects: protect against disappearing packs