If there are few objects to deltify, they might be split amongst threads
so that there is simply no other objects left to delta against within
the same thread. Let's use the same 2*window treshold as used for the
final load balancing to allow extra threads to be created.
This fixes the benign t5300 test failure.
Signed-off-by: Nicolas Pitre <nico@cam.org>
Tested-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
... and display the actual number of threads used when locally
repacking. A remote server still won't tell you how many threads it
uses during a fetch though.
Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Especially on Windows where an opened file cannot be replaced, make
sure pack-objects always close packs it is about to replace. Even on
non Windows systems, this could save potential bad results if ever
objects were to be read from the new pack file using offset from the old
index.
This should fix t5303 on Windows.
Signed-off-by: Nicolas Pitre <nico@cam.org>
Tested-by: Johannes Sixt <j6t@kdbg.org> (MinGW)
Signed-off-by: Junio C Hamano <gitster@pobox.com>
* bc/maint-keep-pack:
repack: only unpack-unreachable if we are deleting redundant packs
t7700: test that 'repack -a' packs alternate packed objects
pack-objects: extend --local to mean ignore non-local loose objects too
sha1_file.c: split has_loose_object() into local and non-local counterparts
t7700: demonstrate mishandling of loose objects in an alternate ODB
builtin-gc.c: use new pack_keep bitfield to detect .keep file existence
repack: do not fall back to incremental repacking with [-a|-A]
repack: don't repack local objects in packs with .keep file
pack-objects: new option --honor-pack-keep
packed_git: convert pack_local flag into a bitfield and add pack_keep
t7700: demonstrate mishandling of objects in packs with a .keep file
* np/pack-safer:
t5303: fix printf format string for portability
t5303: work around printf breakage in dash
pack-objects: don't leak pack window reference when splitting packs
extend test coverage for latest pack corruption resilience improvements
pack-objects: allow "fixing" a corrupted pack without a full repack
make find_pack_revindex() aware of the nasty world
make check_object() resilient to pack corruptions
make packed_object_info() resilient to pack corruptions
make unpack_object_header() non fatal
better validation on delta base object offsets
close another possibility for propagating pack corruption
* bc/maint-keep-pack:
t7700: test that 'repack -a' packs alternate packed objects
pack-objects: extend --local to mean ignore non-local loose objects too
sha1_file.c: split has_loose_object() into local and non-local counterparts
t7700: demonstrate mishandling of loose objects in an alternate ODB
builtin-gc.c: use new pack_keep bitfield to detect .keep file existence
repack: do not fall back to incremental repacking with [-a|-A]
repack: don't repack local objects in packs with .keep file
pack-objects: new option --honor-pack-keep
packed_git: convert pack_local flag into a bitfield and add pack_keep
t7700: demonstrate mishandling of objects in packs with a .keep file
* maint:
Start 1.6.0.5 cycle
Fix pack.packSizeLimit and --max-pack-size handling
checkout: Fix "initial checkout" detection
Remove the period after the git-check-attr summary
Conflicts:
RelNotes
If the limit was sufficiently low, having a single object written
could bust the limit (by design), but caused the remaining allowed
size to go negative for subsequent objects, which for an unsigned
variable is a rather huge limit.
Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
With this patch, --local means pack only local objects that are not already
packed.
Additionally, this fixes t7700 testing whether loose objects in an alternate
object database are repacked.
Signed-off-by: Brandon Casey <drafnel@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This adds a new option to pack-objects which will cause it to ignore an
object which appears in a local pack which has a .keep file, even if it
was specified for packing.
This option will be used by the porcelain repack.
Signed-off-by: Brandon Casey <casey@nrlssc.navy.mil>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When the pack data to be reused is found to be bad, let's fall back to
full object access through the generic path which has its own strategies
to find alternate object sources in that case. This allows for "fixing"
a corrupted pack simply by copying either another pack containing the
object(s) found to be bad, or the loose object itself, into the object
store and launch a repack without the need for -f.
Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
It currently calls die() whenever given offset is not found thinking
that such thing should never happen. But this offset may come from a
corrupted pack whych _could_ happen and not be found. Callers should
deal with this possibility gracefully instead.
Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The check_object() function tries to get away with the least amount of
pack access possible when it already has partial information on given
object rather than calling the more costly packed_object_info().
When things don't look right, it should just give up and fall back to
packed_object_info() directly instead of die()'ing.
Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
It is possible to have pack corruption in the object header. Currently
unpack_object_header() simply die() on them instead of letting the caller
deal with that gracefully.
So let's have unpack_object_header() return an error instead, and find
a better name for unpack_object_header_gently() in that context. All
callers of unpack_object_header() are ready for it.
Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In one case, it was possible to have a bad offset equal to 0 effectively
pointing a delta onto itself and crashing git after too many recursions.
In the other cases, a negative offset could result due to off_t being
signed. Catch those.
Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Abstract
--------
With index v2 we have a per object CRC to allow quick and safe reuse of
pack data when repacking. This, however, doesn't currently prevent a
stealth corruption from being propagated into a new pack when _not_
reusing pack data as demonstrated by the modification to t5302 included
here.
The Context
-----------
The Git database is all checksummed with SHA1 hashes. Any kind of
corruption can be confirmed by verifying this per object hash against
corresponding data. However this can be costly to perform systematically
and therefore this check is often not performed at run time when
accessing the object database.
First, the loose object format is entirely compressed with zlib which
already provide a CRC verification of its own when inflating data. Any
disk corruption would be caught already in this case.
Then, packed objects are also compressed with zlib but only for their
actual payload. The object headers and delta base references are not
deflated for obvious performance reasons, however this leave them
vulnerable to potentially undetected disk corruptions. Object types
are often validated against the expected type when they're requested,
and deflated size must always match the size recorded in the object header,
so those cases are pretty much covered as well.
Where corruptions could go unnoticed is in the delta base reference.
Of course, in the OBJ_REF_DELTA case, the odds for a SHA1 reference to
get corrupted so it actually matches the SHA1 of another object with the
same size (the delta header stores the expected size of the base object
to apply against) are virtually zero. In the OBJ_OFS_DELTA case, the
reference is a pack offset which would have to match the start boundary
of a different base object but still with the same size, and although this
is relatively much more "probable" than in the OBJ_REF_DELTA case, the
probability is also about zero in absolute terms. Still, the possibility
exists as demonstrated in t5302 and is certainly greater than a SHA1
collision, especially in the OBJ_OFS_DELTA case which is now the default
when repacking.
Again, repacking by reusing existing pack data is OK since the per object
CRC provided by index v2 guards against any such corruptions. What t5302
failed to test is a full repack in such case.
The Solution
------------
As unlikely as this kind of stealth corruption can be in practice, it
certainly isn't acceptable to propagate it into a freshly created pack.
But, because this is so unlikely, we don't want to pay the run time cost
associated with extra validation checks all the time either. Furthermore,
consequences of such corruption in anything but repacking should be rather
visible, and even if it could be quite unpleasant, it still has far less
severe consequences than actively creating bad packs.
So the best compromize is to check packed object CRC when unpacking
objects, and only during the compression/writing phase of a repack, and
only when not streaming the result. The cost of this is minimal (less
than 1% CPU time), and visible only with a full repack.
Someone with a stats background could provide an objective evaluation of
this, but I suspect that it's bad RAM that has more potential for data
corruptions at this point, even in those cases where this extra check
is not performed. Still, it is best to prevent a known hole for
corruption when recreating object data into a new pack.
What about the streamed pack case? Well, any client receiving a pack
must always consider that pack as untrusty and perform full validation
anyway, hence no such stealth corruption could be propagated to remote
repositoryes already. It is therefore worthless doing local validation
in that case.
Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
* maint:
Start 1.6.0.4 cycle
add instructions on how to send patches to the mailing list with Gmail
Documentation/gitattributes: Add subsection header for each attribute
git send-email: avoid leaking directory file descriptors.
send-pack: do not send out single-level refs such as refs/stash
fix overlapping memcpy in normalize_absolute_path
pack-objects: avoid reading uninitalized data
correct cache_entry allocation
Conflicts:
RelNotes
In the main loop of find_deltas, we do:
struct object_entry *entry = *list++;
...
if (!*list_size)
...
break
Because we look at and increment *list _before_ the check of
list_size, in the very last iteration of the loop we will
look at uninitialized data, and increment the pointer beyond
one past the end of the allocated space. Since we don't
actually do anything with the data until after the check,
this is not a problem in practice.
But since it technically violates the C standard, and
because it provokes a spurious valgrind warning, let's just
move the initialization of entry to a safe place.
This fixes valgrind errors in t5300, t5301, t5302, t303, and
t9400.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Many call sites immediately initialize allocated memory with zero after
calling xmalloc. A single call to xcalloc can replace this two-call
sequence.
Signed-off-by: Brandon Casey <casey@nrlssc.navy.mil>
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
A comment on top of create_tmpfile() describes caveats ('can have
problems on various systems (FAT, NFS, Coda)') that should apply
in this situation as well. This in the end did not end up solving
any of my personal problems, but it might be a useful cleanup patch
nevertheless.
Signed-off-by: Petr Baudis <pasky@suse.cz>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
User notifications are presented as 'git cmd', and code comments
are presented as '"cmd"' or 'git's cmd', rather than 'git-cmd'.
Signed-off-by: Heikki Orsila <heikki.orsila@iki.fi>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
* np/maint-safer-pack:
fixup_pack_header_footer(): use nicely aligned buffer sizes
index-pack: use fixup_pack_header_footer()'s validation mode
pack-objects: use fixup_pack_header_footer()'s validation mode
improve reliability of fixup_pack_header_footer()
pack-objects: improve returned information from write_one()
This improves commit 6d6f9cddbe a bit by simply not including missing
bases in the list of objects to process at all.
Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
* np/maint-safer-pack:
fixup_pack_header_footer(): use nicely aligned buffer sizes
index-pack: use fixup_pack_header_footer()'s validation mode
pack-objects: use fixup_pack_header_footer()'s validation mode
improve reliability of fixup_pack_header_footer()
pack-objects: improve returned information from write_one()
When limiting the pack size, a new header has to be written to the
pack and a new SHA1 computed. Make sure that the SHA1 of what is being
read back matches the SHA1 of what was written.
Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Currently, this function has the potential to read corrupted pack data
from disk and give it a valid SHA1 checksum. Let's add the ability to
validate SHA1 checksum of existing data along the way, including before
and after any arbitrary point in the pack.
Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This function returns 0 when the current object couldn't be written
due to the pack size limit, otherwise the current offset in the pack.
There is a problem with this approach however, since current object
could be a delta and its delta base might just have been written in
the same write_one() call, but those successfully written objects are
not accounted in the offset variable tracked by the caller. Currently
this is not an issue but a subsequent patch will need this.
Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The index-pack command, when processing a thin pack, fixed up the pack
after-the-fact. It forgets to fsync the result, because it only did that
in one path rather in all cases of fixup.
This moves the fsync_or_die() to the fix-up routine itself, rather than
doing it in one of the callers, so that all cases are covered.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
If we are building a thin pack and one of the base objects we would
consider for deltification is missing its OK, the other side already
has that base object. We may be able to get a delta from another
object, or we can simply send the new object whole (no delta).
This change allows a shallow clone to store only the objects which
are unique to it, as well as the boundary commit and its trees, but
avoids storing the boundary blobs. This special form of a shallow
clone is able to represent just the difference between two trees.
Pack objects change suggested by Nicolas Pitre.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Acked-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When printing valuds of type uint32_t, we should use PRIu32, and should
not assume that it is unsigned int. On 32-bit platforms, it could be
defined as unsigned long. The same caution applies to ntohl().
Signed-off-by: Ramsay Jones <ramsay@ramsay1.demon.co.uk>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
To do so, check_pack_crc() moved from builtin-pack-objects.c to
pack-check.c where it is more logical to share.
Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This makes life much easier for next patch, as well as being more efficient
when the revindex is actually not used.
Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Since the pack-files are now always created stably on disk, there is no
need to sync() before pruning lose objects or old stale pack-files.
[jc: with Nico's clean-up]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This means that we can depend on packs always being stable on disk,
simplifying a lot of the object serialization worries. And unlike loose
objects, serializing pack creation IO isn't going to be a performance
killer.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
* bc/repack:
Documentation/git-repack.txt: document new -A behaviour
let pack-objects do the writing of unreachable objects as loose objects
add a force_object_loose() function
builtin-gc.c: deprecate --prune, it now really has no effect
git-gc: always use -A when manually repacking
repack: modify behavior of -A option to leave unreferenced objects unpacked
Conflicts:
builtin-pack-objects.c
git_config() only had a function parameter, but no callback data
parameter. This assumes that all callback functions only modify
global variables.
With this patch, every callback gets a void * parameter, and it is hoped
that this will help the libification effort.
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Commit ccc1297226 changed the behavior
of 'git repack -A' so unreachable objects are stored as loose objects.
However it did so in a naive and inn efficient way by making packs
about to be deleted inaccessible and feeding their content through
'git unpack-objects'. While this works, there are major flaws with
this approach:
- It is unacceptably sloooooooooooooow.
In the Linux kernel repository with no actual unreachable objects,
doing 'git repack -A -d' before:
real 2m33.220s
user 2m21.675s
sys 0m3.510s
And with this change:
real 0m36.849s
user 0m24.365s
sys 0m1.950s
For reference, here's the timing for 'git repack -a -d':
real 0m35.816s
user 0m22.571s
sys 0m2.011s
This is explained by the fact that 'git unpack-objects' was used to
unpack _every_ objects even if (almost) 100% of them were thrown away.
- There is a black out period.
Between the removal of the .idx file for the redundant pack and the
completion of its unpacking, the unreachable objects become completely
unaccessible. This is not a big issue as we're talking about unreachable
objects, but some consistency is always good.
- There is no way to easily set a sensible mtime for the newly created
unreachable loose objects.
So, while having a command called "pack-objects" to perform object
unpacking looks really odd, this is probably the best compromize to be
able to solve the above issues in an efficient way.
Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The 'depth' variable doesn't reflect the actual maximum depth used
when other objects already depend on the current one.
Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When the delta data is cached in memory until it is written to a pack
file on disk, it is best to compress it right away in find_deltas() for
the following reasons:
- we have to compress that data anyway;
- this allows for caching more deltas with the same cache size limit;
- compression is potentially threaded.
This last point is especially relevant for SMP run time. For example,
repacking the Linux repo on a quad core processor using 4 threads with
all default settings produce the following results before this change:
real 2m27.929s
user 4m36.492s
sys 0m3.091s
And with this change applied:
real 2m13.787s
user 4m37.486s
sys 0m3.159s
So the actual execution time stayed more or less the same but the
wall clock time is shorter.
This is however not a good thing to do when generating a pack for
network transmission. In that case, the network is most likely to
throttle the data throughput, so it is best to make find_deltas()
faster in order to start writing data ASAP since we can afford
spending more time between writes to compress the data
at that point.
Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Parsing !no_reuse_delta everywhere makes my brain spend extra
cycles wondering each time.
Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>