Sparse issues some "Using plain integer as NULL pointer" warnings.
Each warning relates to the use of an '{0}' initialiser expression
in the declaration of an 'struct object_info'. The first field of
this structure has pointer type. Thus, in order to suppress these
warnings, we replace the initialiser expression with '{NULL}'.
Signed-off-by: Ramsay Jones <ramsay@ramsay1.demon.co.uk>
Acked-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
"git cat-file --batch-check=<format>" is added, primarily to allow
on-disk footprint of objects in packfiles (often they are a lot
smaller than their true size, when expressed as deltas) to be
reported.
* jk/in-pack-size-measurement:
pack-revindex: radix-sort the revindex
pack-revindex: use unsigned to store number of objects
cat-file: split --batch input lines on whitespace
cat-file: add %(objectsize:disk) format atom
cat-file: add --batch-check=<format>
cat-file: refactor --batch option parsing
cat-file: teach --batch to stream blob objects
t1006: modernize output comparisons
teach sha1_object_info_extended a "disk_size" query
zero-initialize object_info structs
We take in a "struct object_info" which contains pointers to
storage for items the caller cares about. But then rather
than pass the whole object to the low-level loose/packed
helper functions, we pass the individual pointers.
Let's pass the whole struct instead, which will make adding
more items later easier.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Each caller of sha1_object_info_extended sets up an
object_info struct to tell the function which elements of
the object it wants to get. Until now, getting the type of
the object has always been required (and it is returned via
the return type rather than a pointer in object_info).
This can involve actually opening a loose object file to
determine its type, or following delta chains to determine a
packed file's base type. These effects produce a measurable
slow-down when doing a "cat-file --batch-check" that does
not include %(objecttype).
This patch adds a "typep" query to struct object_info, so
that it can be optionally queried just like size and
disk_size. As a result, the return type of the function is
no longer the object type, but rather 0/-1 for success/error.
As there are only three callers total, we just fix up each
caller rather than keep a compatibility wrapper:
1. The simpler sha1_object_info wrapper continues to
always ask for and return the type field.
2. The istream_source function wants to know the type, and
so always asks for it.
3. The cat-file batch code asks for the type only when
%(objecttype) is part of the format string.
On linux.git, the best-of-five for running:
$ git rev-list --objects --all >objects
$ time git cat-file --batch-check='%(objectsize:disk)'
on a fully packed repository goes from:
real 0m8.680s
user 0m8.160s
sys 0m0.512s
to:
real 0m7.205s
user 0m6.580s
sys 0m0.608s
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Currently, packed_object_info can save some work by not
calculating the size or disk_size of the object if the
caller is not interested. However, it always calculates the
true object type, whether the caller cares or not, and only
optionally returns the easy-to-get "representation type".
Let's swap these types. The function will now return the
representation type (or OBJ_BAD on failure), and will only
optionally fill in the true type.
There should be no behavior change yet, as the only caller,
sha1_object_info_extended, will always feed it a type
pointer.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
To calculate the type of a packed object, we must walk down
its delta chain until we hit a true base object with a real
type. Most of the code in packed_object_info is for handling
this case.
Let's hoist it out into a separate helper function, which
will make it easier to make the type-lookup optional in the
future (and keep our indentation level sane).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Until recently, the only items to request from
sha1_object_info_extended were type and size. This meant
that we always had to open a loose object file to determine
one or the other. But with the addition of the disk_size
query, it's possible that we can fulfill the query without
even opening the object file at all. However, since the
function interface always returns the type, we have no way
of knowing whether the caller cares about it or not.
This patch only modified sha1_loose_object_info to make type
lookup optional using an out-parameter, similar to the way
the size is handled (and the return value is "0" or "-1" for
success or error, respectively).
There should be no functional change yet, though, as
sha1_object_info_extended, the only caller, will always ask
for a type.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The value we get from each low-level object_info function
(e.g., loose, packed) is actually the object type (or -1 for
error). Let's explicitly call it "type", which will make
further refactorings easier to read.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Using sha1_object_info_extended, a caller can find out the
type of an object, its size, and information about where it
is stored. In addition to the object's "true" size, it can
also be useful to know the size that the object takes on
disk (e.g., to generate statistics about which refs consume
space).
This patch adds a "disk_sizep" field to "struct object_info",
and fills it in during sha1_object_info_extended if it is
non-NULL.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The sha1_object_info_extended function expects the caller to
provide a "struct object_info" which contains pointers to
"query" items that will be filled in. The purpose of
providing pointers rather than storing the response directly
in the struct is so that callers can choose not to incur the
expense in finding particular fields that they do not care
about.
Right now the only query item is "sizep", and all callers
set it explicitly to choose whether or not to query it; they
can then leave the rest of the struct uninitialized.
However, as we add new query items, each caller will have to
be updated to explicitly turn off the new ones (by setting
them to NULL). Instead, let's teach each caller to
zero-initialize the struct, so that they do not have to
learn about each new query item added.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When we try to load an object from disk and fail, our
general strategy is to see if we can get it from somewhere
else (e.g., a loose object). That lets users fix corruption
problems by copying known-good versions of objects into the
object database.
We already handle the case where we were not able to read
the delta from disk. However, when we find that the delta we
read does not apply, we simply die. This case is harder to
trigger, as corruption in the delta data itself would
trigger a crc error from zlib. However, a corruption that
pointed us at the wrong delta base might cause it.
We can do the same "fail and try to find the object
elsewhere" trick instead of dying. This not only gives us a
chance to recover, but also puts us on code paths that will
alert the user to the problem (with the current message,
they do not even know which sha1 caused the problem).
Note that unlike some other pack corruptions, we do not
recover automatically from this case when doing a repack.
There is nothing apparently wrong with the delta, as it
points to a valid, accessible object, and we realize the
error only when the resulting size does not match up. And in
theory, one could even have a case where the corrupted size
is the same, and the problem would only be noticed by
recomputing the sha1.
We can get around this by recomputing the deltas with
--no-reuse-delta, which our test does (and this is probably
good advice for anyone recovering from pack corruption).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
5f44324 (core: log offset pack data accesses happened - 2011-07-06)
provides a way to observe pack access patterns via a config
switch. Setting an environment variable looks more obvious than a
config var, especially when you just need to _observe_, and more
inline with other tracing knobs we have.
Document it as it may be useful for remote troubleshooting.
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
sha1_object_info() returns -1 (OBJ_BAD) if it cannot find the object
for some reason, which suggests that it wants the _caller_ to report
this error. However, part of its work happens in
sha1_loose_object_info, which _does_ report errors itself. This is
doubly strange because:
* packed_object_info(), which is the other half of the duo, does _not_
report this.
* In the event that an object is packed and pruned while
sha1_object_info_extended() goes looking for it, we would
erroneously show the error -- even though the code of the latter
function purports to handle this case gracefully.
* A caller might invoke sha1_object_info() to find the type of an
object even if that object is not known to exist.
Silence this error. The others remain untouched as a corrupt object
is a much more grave error than it merely being absent.
Signed-off-by: Thomas Rast <trast@inf.ethz.ch>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In the !delta_data error path of unpack_entry(), we run free(base).
This became a window for use-after-free() in abe601b (sha1_file:
remove recursion in unpack_entry, 2013-03-27), as follows:
Before abe601b, we got the 'base' from cache_or_unpack_entry(..., 0);
keep_cache=0 tells it to also remove that entry. So the 'base' is at
this point not cached, and freeing it in the error path is the right
thing.
After abe601b, the structure changed: we use a three-phase approach
where phase 1 finds the innermost base or a base that is already in
the cache. In phase 3 we therefore know that all bases we unpack are
not part of the delta cache yet. (Observe that we pop from the cache
in phase 1, so this is also true for the very first base.) So we make
no further attempts to look up the bases in the cache, and just call
add_delta_base_cache() on every base object we have assembled.
But the !delta_data error path remained unchanged, and now calls
free() on a base that has already been entered in the cache. This
means that there is a use-after-free if we later use the same base
again.
So remove that free(); we are still going to use that data.
Reported-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Thomas Rast <trast@inf.ethz.ch>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Attempts to reduce the stack footprint of sha1_object_info()
and unpack_entry() codepaths.
* tr/packed-object-info-wo-recursion:
sha1_file: remove recursion in unpack_entry
Refactor parts of in_delta_base_cache/cache_or_unpack_entry
sha1_file: remove recursion in packed_object_info
Have the streaming interface and other codepaths more carefully
examine for corrupt objects.
* jk/check-corrupt-objects-carefully:
clone: leave repo in place after checkout errors
clone: run check_everything_connected
clone: die on errors from unpack_trees
add tests for cloning corrupted repositories
streaming_write_entry: propagate streaming errors
add test for streaming corrupt blobs
avoid infinite loop in read_istream_loose
read_istream_filtered: propagate read error from upstream
check_sha1_signature: check return value from read_istream
stream_blob_to_fd: detect errors reading from stream
It's possible for read_istream to return an error, in which
case we just end up in an infinite loop (aside from EOF, we
do not even look at the result, but just feed it straight
into our running hash).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Similar to the recursion in packed_object_info(), this leads to
problems on stack-space-constrained systems in the presence of long
delta chains.
We proceed in three phases:
1. Dig through the delta chain, saving each delta object's offsets and
size on an ad-hoc stack.
2. Unpack the base object at the bottom.
3. Unpack and apply the deltas from the stack.
Signed-off-by: Thomas Rast <trast@student.ethz.ch>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The delta base cache lookup and test were shared. Refactor them;
we'll need both parts again. Also, we'll use the clearing routine
later.
Signed-off-by: Thomas Rast <trast@student.ethz.ch>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
If two processes are racing to create the same directory tree, they
will both see that the directory doesn't exist, both try to mkdir(),
and one of them will fail. This is okay, as we only care that the
directory gets created. So, we add a check for EEXIST from mkdir,
and continue when the directory exists, taking the same codepath as
the case where the earlier stat() succeeds and finds a directory.
Signed-off-by: Steven Walter <stevenrwalter@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
packed_object_info() and packed_delta_info() were mutually recursive.
The former would handle ordinary types and defer deltas to the latter;
the latter would use the former to resolve the delta base.
This arrangement, however, leads to trouble with threaded index-pack
and long delta chains on platforms where thread stacks are small, as
happened on OS X (512kB thread stacks by default) with the chromium
repo.
The task of the two functions is not all that hard to describe without
any recursion, however. It proceeds in three steps:
- determine the representation type and size, based on the outermost
object (delta or not)
- follow through the delta chain, if any
- determine the object type from what is found at the end of the delta
chain
The only complication stems from the error recovery. If parsing fails
at any step, we want to mark that object (within the pack) as bad and
try getting the corresponding SHA1 from elsewhere. If that also
fails, we want to repeat this process back up the delta chain until we
find a reasonable solution or conclude that there is no way to
reconstruct the object. (This is conveniently checked by t5303.)
To achieve that within the pack, we keep track of the entire delta
chain in a stack. When things go sour, we process that stack from the
top, marking entries as bad and attempting to re-resolve by sha1. To
avoid excessive malloc(), the stack starts out with a small
stack-allocated array. The choice of 64 is based on the default of
pack.depth, which is 50, in the hope that it covers "most" delta
chains without any need for malloc().
It's much harder to make the actual re-resolving by sha1 nonrecursive,
so we skip that. If you can't afford *that* recursion, your
corruption problems are more serious than your stack size problems.
Reported-by: Stefan Zager <szager@google.com>
Signed-off-by: Thomas Rast <trast@student.ethz.ch>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
prepare_packed_git_one() is modified to allow count-objects to hook a
report function to so we don't need to duplicate the pack searching
logic in count-objects.c. When report_pack_garbage is NULL, the
overhead is insignificant.
The garbage is reported with warning() instead of error() in packed
garbage case because it's not an error to have garbage. Loose garbage
is still reported as errors and will be converted to warnings later.
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The current loop does
while (...) {
if (it is not an .idx file)
continue;
process .idx file;
}
and is reordered to
while (...) {
if (it is an .idx file) {
process .idx file;
}
}
This makes it easier to add new extension file processing.
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Change link_alt_odb_entries() to take the length of the "alt"
parameter rather than a pointer to the end of the "alt" string. This
is the more common calling convention and simplifies the code a tiny
bit.
Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu>
Signed-off-by: Jeff King <peff@peff.net>
Change link_alt_odb_entry() to take a NUL-terminated string instead of
(char *, len). Use string_list_split_in_place() rather than inline
code in link_alt_odb_entries().
This approach saves some code and also avoids the (probably harmless)
error of passing a non-NUL-terminated string to is_absolute_path().
Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu>
Signed-off-by: Jeff King <peff@peff.net>
Not all platforms have getrlimit(), but there are other ways to see
the maximum number of files that a process can have open. If
getrlimit() is unavailable, fall back to sysconf(_SC_OPEN_MAX) if
available, and use OPEN_MAX from <limits.h>.
Signed-off-by: Joachim Schmitz <jojo@schmitz-digital.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The code to avoid mistaken attempt to add the object directory
itself as its own alternate could read beyond end of a string while
comparison.
* hv/link-alt-odb-entry:
link_alt_odb_entry: fix read over array bounds reported by valgrind
pfxlen can be longer than the path in objdir when relative_base
contains the path to gits object directory. Here we are interested
in checking if ent->base[] (the part that corresponds to .git/objects)
is the same string as objdir, and the code NUL-terminated ent->base[]
to
LEADING PATH\0XX/XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX\0
in preparation for these "duplicate check" step (before we return
from the function, the first NUL is turned into '/' so that we can
fill XX when probing for loose objects). All we need to do is to
compare the string with the path to our object directory.
Signed-off-by: Heiko Voigt <hvoigt@hvoigt.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When peeking into object stores of submodules, the code forgot that they
might borrow objects from alternate object stores on their own.
By Heiko Voigt
* hv/submodule-alt-odb:
teach add_submodule_odb() to look for alternates
Since we allow to link other object databases when loading a submodules
database we should also load possible alternates.
Signed-off-by: Heiko Voigt <hvoigt@hvoigt.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When write_loose_object() finds that it is unable to
create a temporary file, it complains, for instance:
unable to create temporary sha1 filename : Too many open files
That extra space was supposed to be the name of the file,
and will be an empty string if the git_mkstemps_mode() fails.
The name of the temporary file is unimportant; delete it.
Signed-off-by: Pete Wyckoff <pw@padd.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The error handling routines add a newline. Remove
the duplicate ones in error messages.
Signed-off-by: Pete Wyckoff <pw@padd.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Traditionally, all the callers of check_sha1_signature() first
called read_sha1_file() to prepare the whole object data in core,
and called this function. The function is used to revalidate what
we read from the object database actually matches the object name we
used to ask for the data from the object database.
Update the API to allow callers to pass NULL as the object data, and
have the function read and hash the object data using streaming API
to recompute the object name, without having to hold everything in
core at the same time. This is most useful in parse_object() that
parses a blob object, because this caller does not have to keep the
actual blob data around in memory after a "struct blob" is returned.
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
* jk/maint-avoid-streaming-filtered-contents:
do not stream large files to pack when filters are in use
teach dry-run convert_to_git not to require a src buffer
teach convert_to_git a "dry run" mode
* jk/maint-avoid-streaming-filtered-contents:
do not stream large files to pack when filters are in use
teach dry-run convert_to_git not to require a src buffer
teach convert_to_git a "dry run" mode
Because git's object format requires us to specify the
number of bytes in the object in its header, we must know
the size before streaming a blob into the object database.
This is not a problem when adding a regular file, as we can
get the size from stat(). However, when filters are in use
(such as autocrlf, or the ident, filter, or eol
gitattributes), we have no idea what the ultimate size will
be.
The current code just punts on the whole issue and ignores
filter configuration entirely for files larger than
core.bigfilethreshold. This can generate confusing results
if you use filters for large binary files, as the filter
will suddenly stop working as the file goes over a certain
size. Rather than try to handle unknown input sizes with
streaming, this patch just turns off the streaming
optimization when filters are in use.
This has a slight performance regression in a very specific
case: if you have autocrlf on, but no gitattributes, a large
binary file will avoid the streaming code path because we
don't know beforehand whether it will need conversion or
not. But if you are handling large binary files, you should
be marking them as such via attributes (or at least not
using autocrlf, and instead marking your text files as
such). And the flip side is that if you have a large
_non_-binary file, there is a correctness improvement;
before we did not apply the conversion at all.
The first half of the new t1051 script covers these failures
on input. The second half tests the matching output code
paths. These already work correctly, and do not need any
adjustment.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
* nd/find-pack-entry-recent-cache-invalidation:
find_pack_entry(): do not keep packed_git pointer locally
sha1_file.c: move the core logic of find_pack_entry() into fill_pack_entry()
* nd/find-pack-entry-recent-cache-invalidation:
find_pack_entry(): do not keep packed_git pointer locally
sha1_file.c: move the core logic of find_pack_entry() into fill_pack_entry()
Since 3ba7a06552 (A loose object is not corrupt if it
cannot be read due to EMFILE), "git fsck" on a repository with an empty
loose object file complains with the error message
fatal: failed to read object <sha1>: Invalid argument
This comes from a failure of mmap on this empty file, which sets errno to
EINVAL. Instead of calling xmmap on empty file, we display a clean error
message ourselves, and return a NULL pointer. The new message is
error: object file .git/objects/09/<rest-of-sha1> is empty
fatal: loose object <sha1> (stored in .git/objects/09/<rest-of-sha1>) is corrupt
The second line was already there before the regression in 3ba7a06552,
and the first is an additional message, that should help diagnosing the
problem for the user.
Signed-off-by: Matthieu Moy <Matthieu.Moy@imag.fr>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Commit f7c22cc (always start looking up objects in the last used pack
first - 2007-05-30) introduce a static packed_git* pointer as an
optimization. The kept pointer however may become invalid if
free_pack_by_name() happens to free that particular pack.
Current code base does not access packs after calling
free_pack_by_name() so it should not be a problem. Anyway, move the
pointer out so that free_pack_by_name() can reset it to avoid running
into troubles in future.
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Acked-by: Nicolas Pitre <nico@fluxnic.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The new helper function implements the logic to find the offset for the
object in one pack and fill a pack_entry structure. The next patch will
restructure the loop and will call the helper from two places.
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Acked-by: Nicolas Pitre <nico@fluxnic.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
On Solaris the system headers define the "tmpfile" name, which'll
cause Git compiled with Sun Studio 12 Update 1 to whine about us
redefining the name:
"pack-write.c", line 76: warning: name redefined by pragma redefine_extname declared static: tmpfile (E_PRAGMA_REDEFINE_STATIC)
"sha1_file.c", line 2455: warning: name redefined by pragma redefine_extname declared static: tmpfile (E_PRAGMA_REDEFINE_STATIC)
"fast-import.c", line 858: warning: name redefined by pragma redefine_extname declared static: tmpfile (E_PRAGMA_REDEFINE_STATIC)
"builtin/index-pack.c", line 175: warning: name redefined by pragma redefine_extname declared static: tmpfile (E_PRAGMA_REDEFINE_STATIC)
Just renaming the "tmpfile" variable to "tmp_file" in the relevant
places is the easiest way to fix this.
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
* jc/stream-to-pack:
bulk-checkin: replace fast-import based implementation
csum-file: introduce sha1file_checkpoint
finish_tmp_packfile(): a helper function
create_tmp_packfile(): a helper function
write_pack_header(): a helper function
Conflicts:
pack.h
* nd/misc-cleanups:
unpack_object_header_buffer(): clear the size field upon error
tree_entry_interesting: make use of local pointer "item"
tree_entry_interesting(): give meaningful names to return values
read_directory_recursive: reduce one indentation level
get_tree_entry(): do not call find_tree_entry() on an empty tree
tree-walk.c: do not leak internal structure in tree_entry_len()
* nd/misc-cleanups:
unpack_object_header_buffer(): clear the size field upon error
tree_entry_interesting: make use of local pointer "item"
tree_entry_interesting(): give meaningful names to return values
read_directory_recursive: reduce one indentation level
get_tree_entry(): do not call find_tree_entry() on an empty tree
tree-walk.c: do not leak internal structure in tree_entry_len()
This extends the earlier approach to stream a large file directly from the
filesystem to its own packfile, and allows "git add" to send large files
directly into a single pack. Older code used to spawn fast-import, but the
new bulk-checkin API replaces it.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The callers do not use the returned size when the function says
it did not use any bytes and sets the type to OBJ_BAD, so this
should not matter in practice, but it is a good code hygiene
anyway.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
* jk/maint-pack-objects-compete-with-delete:
downgrade "packfile cannot be accessed" errors to warnings
pack-objects: protect against disappearing packs
These can happen if another process simultaneously prunes a
pack. But that is not usually an error condition, because a
properly-running prune should have repacked the object into
a new pack. So we will notice that the pack has disappeared
unexpectedly, print a message, try other packs (possibly
after re-scanning the list of packs), and find it in the new
pack.
Acked-by: Nicolas Pitre <nico@fluxnic.net>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
It's possible that while pack-objects is running, a
simultaneously running prune process might delete a pack
that we are interested in. Because we load the pack indices
early on, we know that the pack contains our item, but by
the time we try to open and map it, it is gone.
Since c715f78, we already protect against this in the normal
object access code path, but pack-objects accesses the packs
at a lower level. In the normal access path, we call
find_pack_entry, which will call find_pack_entry_one on each
pack index, which does the actual lookup. If it gets a hit,
we will actually open and verify the validity of the
matching packfile (using c715f78's is_pack_valid). If we
can't open it, we'll issue a warning and pretend that we
didn't find it, causing us to go on to the next pack (or on
to loose objects).
Furthermore, we will cache the descriptor to the opened
packfile. Which means that later, when we actually try to
access the object, we are likely to still have that packfile
opened, and won't care if it has been unlinked from the
filesystem.
Notice the "likely" above. If there is another pack access
in the interim, and we run out of descriptors, we could
close the pack. And then a later attempt to access the
closed pack could fail (we'll try to re-open it, of course,
but it may have been deleted). In practice, this doesn't
happen because we tend to look up items and then access them
immediately.
Pack-objects does not follow this code path. Instead, it
accesses the packs at a much lower level, using
find_pack_entry_one directly. This means we skip the
is_pack_valid check, and may end up with the name of a
packfile, but no open descriptor.
We can add the same is_pack_valid check here. Unfortunately,
the access patterns of pack-objects are not quite as nice
for keeping lookup and object access together. We look up
each object as we find out about it, and the only later when
writing the packfile do we necessarily access it. Which
means that the opened packfile may be closed in the interim.
In practice, however, adding this check still has value, for
three reasons.
1. If you have a reasonable number of packs and/or a
reasonable file descriptor limit, you can keep all of
your packs open simultaneously. If this is the case,
then the race is impossible to trigger.
2. Even if you can't keep all packs open at once, you
may end up keeping the deleted one open (i.e., you may
get lucky).
3. The race window is shortened. You may notice early that
the pack is gone, and not try to access it. Triggering
the problem without this check means deleting the pack
any time after we read the list of index files, but
before we access the looked-up objects. Triggering it
with this check means deleting the pack means deleting
the pack after we do a lookup (and successfully access
the packfile), but before we access the object. Which
is a smaller window.
Acked-by: Nicolas Pitre <nico@fluxnic.net>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When it needs to compare and add an alt object path to the
alt_odb_list, we normalize this path first since comparing normalized
path is easy to get correct result.
Use strbuf to replace some string operations, since it is cleaner and
safer.
Helped-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Hui Wang <Hui.Wang@windriver.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Cloning from a local repository blindly copies or hardlinks all the files
under objects/ hierarchy. This results in two issues:
- If the repository cloned has an "objects/info/alternates" file, and the
command line of clone specifies --reference, the ones specified on the
command line get overwritten by the copy from the original repository.
- An entry in a "objects/info/alternates" file can specify the object
stores it borrows objects from as a path relative to the "objects/"
directory. When cloning a repository with such an alternates file, if
the new repository is not sitting next to the original repository, such
relative paths needs to be adjusted so that they can be used in the new
repository.
This updates add_to_alternates_file() to take the path to the alternate
object store, including the "/objects" part at the end (earlier, it was
taking the path to $GIT_DIR and was adding "/objects" itself), as it is
technically possible to specify in objects/info/alternates file the path
of a directory whose name does not end with "/objects".
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Git currently reports loose objects as 'corrupt' if they've been
deflated using a window size less than 32Kb, because the
experimental_loose_object() function doesn't recognise the header
byte as a zlib header. This patch makes the function tolerant of
all valid window sizes (15-bit to 8-bit) - but doesn't sacrifice
it's accuracy in distingushing the standard loose-object format
from the experimental (now abandoned) format.
On memory constrained systems zlib may use a much smaller window
size - working on Agit, I found that Android uses a 4KB window;
giving a header byte of 0x48, not 0x78. Consequently all loose
objects generated appear 'corrupt', which is why Agit is a read-only
Git client at this time - I don't want my client to generate Git
repos that other clients treat as broken :(
This patch makes Git tolerant of different deflate settings - it
might appear that it changes experimental_loose_object() to the point
where it could incorrectly identify the experimental format as the
standard one, but the two criteria (bitmask & checksum) can only
give a false result for an experimental object where both of the
following are true:
1) object size is exactly 8 bytes when uncompressed (bitmask)
2) [single-byte in-pack git type&size header] * 256
+ [1st byte of the following zlib header] % 31 = 0 (checksum)
As it happens, for all possible combinations of valid object type
(1-4) and window bits (0-7), the only time when the checksum will be
divisible by 31 is for 0x1838 - ie object type *1*, a Commit - which,
due the fields all Commit objects must contain, could never be as
small as 8 bytes in size.
Given this, the combination of the two criteria (bitmask & checksum)
always correctly determines the buffer format, and is more tolerant
than the previous version.
The alternative to this patch is simply removing support for the
experimental format, which I am also totally cool with.
References:
Android uses a 4KB window for deflation:
http://android.git.kernel.org/?p=platform/libcore.git;a=blob;f=luni/src/main/native/java_util_zip_Deflater.cpp;h=c0b2feff196e63a7b85d97cf9ae5bb2583409c28;hb=refs/heads/gingerbread#l53
Code snippet searching for false positives with the zlib checksum:
https://gist.github.com/1118177
Signed-off-by: Roberto Tyley <roberto.tyley@guardian.co.uk>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
* jc/index-pack:
verify-pack: use index-pack --verify
index-pack: show histogram when emulating "verify-pack -v"
index-pack: start learning to emulate "verify-pack -v"
index-pack: a miniscule refactor
index-pack --verify: read anomalous offsets from v2 idx file
write_idx_file: need_large_offset() helper function
index-pack: --verify
write_idx_file: introduce a struct to hold idx customization options
index-pack: group the delta-base array entries also by type
Conflicts:
builtin/verify-pack.c
cache.h
sha1_file.c
* jc/zlib-wrap:
zlib: allow feeding more than 4GB in one go
zlib: zlib can only process 4GB at a time
zlib: wrap deflateBound() too
zlib: wrap deflate side of the API
zlib: wrap inflateInit2 used to accept only for gzip format
zlib: wrap remaining calls to direct inflate/inflateEnd
zlib wrapper: refactor error message formatter
Conflicts:
sha1_file.c
In a workload other than "git log" (without pathspec nor any option that
causes us to inspect trees and blobs), the recency pack order is said to
cause the access jump around quite a bit. Add a hook to allow us observe
how bad it is.
"git config core.logpackaccess /var/tmp/pal.txt" will give you the log
in the specified file.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The size of objects we read from the repository and data we try to put
into the repository are represented in "unsigned long", so that on larger
architectures we can handle objects that weigh more than 4GB.
But the interface defined in zlib.h to communicate with inflate/deflate
limits avail_in (how many bytes of input are we calling zlib with) and
avail_out (how many bytes of output from zlib are we ready to accept)
fields effectively to 4GB by defining their type to be uInt.
In many places in our code, we allocate a large buffer (e.g. mmap'ing a
large loose object file) and tell zlib its size by assigning the size to
avail_in field of the stream, but that will truncate the high octets of
the real size. The worst part of this story is that we often pass around
z_stream (the state object used by zlib) to keep track of the number of
used bytes in input/output buffer by inspecting these two fields, which
practically limits our callchain to the same 4GB limit.
Wrap z_stream in another structure git_zstream that can express avail_in
and avail_out in unsigned long. For now, just die() when the caller gives
a size that cannot be given to a single zlib call. In later patches in the
series, we would make git_inflate() and git_deflate() internally loop to
give callers an illusion that our "improved" version of zlib interface can
operate on a buffer larger than 4GB in one go.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Wrap deflateInit, deflate, and deflateEnd for everybody, and the sole use
of deflateInit2 in remote-curl.c to tell the library to use gzip header
and trailer in git_deflate_init_gzip().
There is only one caller that cares about the status from deflateEnd().
Introduce git_deflate_end_gently() to let that sole caller retrieve the
status and act on it (i.e. die) for now, but we would probably want to
make inflate_end/deflate_end die when they ran out of memory and get
rid of the _gently() kind.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Every time I look at the read-loose-object codepath, legacy_loose_object()
function makes my brain go through mental contortion. When we were playing
with the experimental loose object format, it may have made sense to call
the traditional format "legacy", in the hope that the experimental one
will some day replace it to become official, but it never happened.
This renames the function (and negates its return value) to detect if we
are looking at the experimental format, and move the code around in its
caller which used to do "if we are looing at legacy, do this special case,
otherwise the normal case is this". The codepath to read from the loose
objects in experimental format is the "unlikely" case.
Someday after Git 2.0, we should drop the support of this format.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This finally gets rid of the inefficient verify-pack implementation that
walks objects in the packfile in their object name order and replaces it
with a call to index-pack --verify. As a side effect, it also removes
packed_object_info_detail() API which is rather expensive.
As this changes the way errors are reported (verify-pack used to rely on
the usual runtime error detection routine unpack_entry() to diagnose the
CRC errors in an entry in the *.idx file; index-pack --verify checks the
whole *.idx file in one go), update a test that expected the string "CRC"
to appear in the error message.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Using an unsigned type, we would fail to detect a read error and then
proceed to try to write (size_t)-1 bytes.
Signed-off-by: Jim Meyering <meyering@redhat.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
* jc/bigfile:
Bigfile: teach "git add" to send a large file straight to a pack
index_fd(): split into two helper functions
index_fd(): turn write_object and format_check arguments into one flag
Make map_sha1_file(), parse_sha1_header() and unpack_sha1_header()
available to the streaming read API by exporting them via cache.h header
file.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
An object found in the delta-base cache is not guaranteed to
stay there, but we know it came from a pack and it is likely
to give us a quick access if we read_sha1_file() it right now,
which is a piece of useful information.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
* jc/replacing:
read_sha1_file(): allow selective bypassing of replacement mechanism
inline lookup_replace_object() calls
read_sha1_file(): get rid of read_sha1_file_repl() madness
t6050: make sure we test not just commit replacement
Declare lookup_replace_object() in cache.h, not in commit.h
Conflicts:
environment.c
The original interface for sha1_object_info() takes an object name and
gives back a type and its size (the latter is given only when it was
asked). The new interface wraps its implementation and exposes a bit
more pieces of information that the interface used to discard, namely:
- where the object is stored (loose? cached? packed?)
- if packed, where in which packfile?
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
* In the earlier round, this used u.pack.delta to record the length of
the delta chain, but the caller is not necessarily interested in the
length of the delta chain per-se, but may only want to know if it is a
delta against another object or is stored as a deflated data. Calling
packed_object_info_detail() involves walking the reverse index chain to
compute the store size of the object and is unnecessarily expensive.
We could resurrect the code if a new caller wants to know, but I doubt
it.
Instead return an integer that can be given to typename() if
the caller wants a string, just like everybody else does.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
* jc/convert:
convert: make it harder to screw up adding a conversion attribute
convert: make it safer to add conversion attributes
convert: give saner names to crlf/eol variables, types and functions
convert: rename the "eol" global variable to "core_eol"
* jc/bigfile:
Bigfile: teach "git add" to send a large file straight to a pack
index_fd(): split into two helper functions
index_fd(): turn write_object and format_check arguments into one flag
* jc/replacing:
read_sha1_file(): allow selective bypassing of replacement mechanism
inline lookup_replace_object() calls
read_sha1_file(): get rid of read_sha1_file_repl() madness
t6050: make sure we test not just commit replacement
Declare lookup_replace_object() in cache.h, not in commit.h
Since commit c793430 (Limit file descriptors used by packs, 2011-02-28),
the extra parameter added in f2e872aa (Work around EMFILE when there are
too many pack files, 2010-11-01) is not used anymore.
Remove it.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Acked-by: Shawn O. Pearce <spearce@spearce.org>
The way "object replacement" mechanism was tucked to the read_sha1_file()
interface was suboptimal in a couple of ways:
- Callers that want it to die with useful diagnosis upon seeing a corrupt
object does not have a way to say that they do not want any object
replacement.
- Callers who do not want it to die but want to handle the errors
themselves are told to arrange to call read_object(), but the function
does not use the replacement mechanism, and also it is a file scope
static function that not many callers can call to begin with.
This adds a read_sha1_file_extended() that takes a set of flags; the
callers of read_sha1_file() passes a flag READ_SHA1_FILE_REPLACE to ask
for object replacement mechanism to kick in.
Later, we could add another flag bit to tell the function to return an
error instead of dying and then remove the misguided "call read_object()
yourself".
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Most callers want to silently get a replacement object, and they do not
care what the real name of the replacement object is. Worse yet, no sane
interface to return the underlying object without replacement is provided.
Remove the function and make only the few callers that want the name of
the replacement object find it themselves.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When adding a new content to the repository, we have always slurped
the blob in its entirety in-core first, and computed the object name
and compressed it into a loose object file. Handling large binary
files (e.g. video and audio asset for games) has been problematic
because of this design.
At the middle level of "git add" callchain is an internal API
index_fd() that takes an open file descriptor to read from the
working tree file being added with its size. Teach it to call out to
fast-import when adding a large blob.
The write-out codepath in entry.c::write_entry() should be taught to
stream, instead of reading everything in core. This should not be so
hard to implement, especially if we limit ourselves only to loose
object files and non-delta representation in packfiles.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Split out the case where we do not know the size of the input (hence we
read everything into a strbuf before doing anything) to index_pipe(), and
the other case where we mmap or read the whole data to index_bulk().
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The "format_check" parameter tucked after the existing parameters is too
ugly an afterthought to live in any reasonable API.
Combine it with the other boolean parameter "write_object" into a single
"flags" parameter.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
I found that some doubled words had snuck back into projects from which
I'd already removed them, so now there's a "syntax-check" makefile rule in
gnulib to help prevent recurrence.
Running the command below spotted a few in git, too:
git ls-files | xargs perl -0777 -n \
-e 'while (/\b(then?|[iao]n|i[fst]|but|f?or|at|and|[dt])\s+\1\b/gims)' \
-e '{$n=($` =~ tr/\n/\n/ + 1); ($v=$&)=~s/\n/\\n/g;' \
-e 'print "$ARGV:$n:$v\n"}'
Signed-off-by: Jim Meyering <meyering@redhat.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The git-new-workdir script in contrib/ makes a new work tree by sharing
many subdirectories of the .git directory with the original repository.
When rerere.enabled is set in the original repository, but the user has
not encountered any conflicts yet, the original repository may not yet
have .git/rr-cache directory.
When rerere wants to run in a new work tree created from such a young
original repository, it fails to mkdir(2) .git/rr-cache that is a symlink
to a yet-to-be-created directory.
There are three possible approaches to this:
- A naive solution is not to create a symlink in the git-new-workdir
script to a directory the original does not have (yet). This is not a
solution, as we tend to lazily create subdirectories of .git/, and
having rerere.enabled configuration set is a strong indication that the
user _wants_ to have this lazy creation to happen;
- We could always create .git/rr-cache upon repository creation. This is
tempting but will not help people with existing repositories.
- Detect this case by seeing that mkdir(2) failed with EEXIST, checking
that the path is a symlink, and try running mkdir(2) on the link
target.
This patch solves the issue by doing the third one.
Strictly speaking, this is incomplete. It does not attempt to handle
relative symbolic link that points into the original repository, but this
is good enough to help people who use contrib/workdir/git-new-workdir
script.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In the spirit of v1.5.0.2~21 (Check for PRIuMAX rather than
NO_C99_FORMAT in fast-import.c, 2007-02-20), use PRIuMAX from
git-compat-util.h on all platforms instead of C99-specific formats
like %zu with dangerous fallbacks to %u or %lu.
So now C99-challenged platforms can build git without provoking
warnings or errors from printf, even if pointers do not have the same
size as an int or long.
The need for a fallback PRIuMAX is detected in git-compat-util.h with
"#ifndef PRIuMAX". So while at it, simplify the Makefile and configure
script by eliminating the NO_C99_FORMAT knob altogether.
Signed-off-by: Jonathan Nieder <jrnieder@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
* sp/maint-fd-limit:
sha1_file.c: Don't retain open fds on small packs
mingw: add minimum getrlimit() compatibility stub
Limit file descriptors used by packs
If a pack file is small enough that its entire contents fits within
one mmap window, mmap the file and then immediately close its file
descriptor. This reduces the number of file descriptors that are
needed to read from repositories with many tiny pack files, such
as one that has received 1000 pushes (and created 1000 small pack
files) since its last repack.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Rather than using 'errno == EMFILE' after a failed open() call
to indicate the process is out of file descriptors and an LRU
pack window should be closed, place a hard upper limit on the
number of open packs based on the actual rlimit of the process.
By using a hard upper limit that is below the rlimit of the current
process it is not necessary to check for EMFILE on every single
fd-allocating system call. Instead reserving 25 file descriptors
makes it safe to assume the system call won't fail due to being over
the filedescriptor limit. Here 25 is chosen as a WAG, but considers
3 for stdin/stdout/stderr, and at least a few for other Git code
to operate on temporary files. An additional 20 is reserved as it
is not known what the C library needs to perform other services on
Git's behalf, such as nsswitch or name resolution.
This fixes a case where running `git gc --auto` in a repository
with more than 1024 packs (but an rlimit of 1024 open fds) fails
due to the temporary output file not being able to allocate a
file descriptor. The output file is opened by pack-objects after
object enumeration and delta compression are done, both of which
have already opened all of the packs and fully populated the file
descriptor table.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>