Commit Graph

72 Commits

Author SHA1 Message Date
Vicent Marti
ae4f07fbcc pack-bitmap: implement optional name_hash cache
When we use pack bitmaps rather than walking the object
graph, we end up with the list of objects to include in the
packfile, but we do not know the path at which any tree or
blob objects would be found.

In a recently packed repository, this is fine. A fetch would
use the paths only as a heuristic in the delta compression
phase, and a fully packed repository should not need to do
much delta compression.

As time passes, though, we may acquire more objects on top
of our large bitmapped pack. If clients fetch frequently,
then they never even look at the bitmapped history, and all
works as usual. However, a client who has not fetched since
the last bitmap repack will have "have" tips in the
bitmapped history, but "want" newer objects.

The bitmaps themselves degrade gracefully in this
circumstance. We manually walk the more recent bits of
history, and then use bitmaps when we hit them.

But we would also like to perform delta compression between
the newer objects and the bitmapped objects (both to delta
against what we know the user already has, but also between
"new" and "old" objects that the user is fetching). The lack
of pathnames makes our delta heuristics much less effective.

This patch adds an optional cache of the 32-bit name_hash
values to the end of the bitmap file. If present, a reader
can use it to match bitmapped and non-bitmapped names during
delta compression.

Here are perf results for p5310:

Test                      origin/master       HEAD^                      HEAD
-------------------------------------------------------------------------------------------------
5310.2: repack to disk    36.81(37.82+1.43)   47.70(48.74+1.41) +29.6%   47.75(48.70+1.51) +29.7%
5310.3: simulated clone   30.78(29.70+2.14)   1.08(0.97+0.10) -96.5%     1.07(0.94+0.12) -96.5%
5310.4: simulated fetch   3.16(6.10+0.08)     3.54(10.65+0.06) +12.0%    1.70(3.07+0.06) -46.2%
5310.6: partial bitmap    36.76(43.19+1.81)   6.71(11.25+0.76) -81.7%    4.08(6.26+0.46) -88.9%

You can see that the time spent on an incremental fetch goes
down, as our delta heuristics are able to do their work.
And we save time on the partial bitmap clone for the same
reason.

Signed-off-by: Vicent Marti <tanoku@gmail.com>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-12-30 12:19:23 -08:00
Vicent Marti
7cc8f97108 pack-objects: implement bitmap writing
This commit extends more the functionality of `pack-objects` by allowing
it to write out a `.bitmap` index next to any written packs, together
with the `.idx` index that currently gets written.

If bitmap writing is enabled for a given repository (either by calling
`pack-objects` with the `--write-bitmap-index` flag or by having
`pack.writebitmaps` set to `true` in the config) and pack-objects is
writing a packfile that would normally be indexed (i.e. not piping to
stdout), we will attempt to write the corresponding bitmap index for the
packfile.

Bitmap index writing happens after the packfile and its index has been
successfully written to disk (`finish_tmp_packfile`). The process is
performed in several steps:

    1. `bitmap_writer_set_checksum`: this call stores the partial
       checksum for the packfile being written; the checksum will be
       written in the resulting bitmap index to verify its integrity

    2. `bitmap_writer_build_type_index`: this call uses the array of
       `struct object_entry` that has just been sorted when writing out
       the actual packfile index to disk to generate 4 type-index bitmaps
       (one for each object type).

       These bitmaps have their nth bit set if the given object is of
       the bitmap's type. E.g. the nth bit of the Commits bitmap will be
       1 if the nth object in the packfile index is a commit.

       This is a very cheap operation because the bitmap writing code has
       access to the metadata stored in the `struct object_entry` array,
       and hence the real type for each object in the packfile.

    3. `bitmap_writer_reuse_bitmaps`: if there exists an existing bitmap
       index for one of the packfiles we're trying to repack, this call
       will efficiently rebuild the existing bitmaps so they can be
       reused on the new index. All the existing bitmaps will be stored
       in a `reuse` hash table, and the commit selection phase will
       prioritize these when selecting, as they can be written directly
       to the new index without having to perform a revision walk to
       fill the bitmap. This can greatly speed up the repack of a
       repository that already has bitmaps.

    4. `bitmap_writer_select_commits`: if bitmap writing is enabled for
       a given `pack-objects` run, the sequence of commits generated
       during the Counting Objects phase will be stored in an array.

       We then use that array to build up the list of selected commits.
       Writing a bitmap in the index for each object in the repository
       would be cost-prohibitive, so we use a simple heuristic to pick
       the commits that will be indexed with bitmaps.

       The current heuristics are a simplified version of JGit's
       original implementation. We select a higher density of commits
       depending on their age: the 100 most recent commits are always
       selected, after that we pick 1 commit of each 100, and the gap
       increases as the commits grow older. On top of that, we make sure
       that every single branch that has not been merged (all the tips
       that would be required from a clone) gets their own bitmap, and
       when selecting commits between a gap, we tend to prioritize the
       commit with the most parents.

       Do note that there is no right/wrong way to perform commit
       selection; different selection algorithms will result in
       different commits being selected, but there's no such thing as
       "missing a commit". The bitmap walker algorithm implemented in
       `prepare_bitmap_walk` is able to adapt to missing bitmaps by
       performing manual walks that complete the bitmap: the ideal
       selection algorithm, however, would select the commits that are
       more likely to be used as roots for a walk in the future (e.g.
       the tips of each branch, and so on) to ensure a bitmap for them
       is always available.

    5. `bitmap_writer_build`: this is the computationally expensive part
       of bitmap generation. Based on the list of commits that were
       selected in the previous step, we perform several incremental
       walks to generate the bitmap for each commit.

       The walks begin from the oldest commit, and are built up
       incrementally for each branch. E.g. consider this dag where A, B,
       C, D, E, F are the selected commits, and a, b, c, e are a chunk
       of simplified history that will not receive bitmaps.

            A---a---B--b--C--c--D
                     \
                      E--e--F

       We start by building the bitmap for A, using A as the root for a
       revision walk and marking all the objects that are reachable
       until the walk is over. Once this bitmap is stored, we reuse the
       bitmap walker to perform the walk for B, assuming that once we
       reach A again, the walk will be terminated because A has already
       been SEEN on the previous walk.

       This process is repeated for C, and D, but when we try to
       generate the bitmaps for E, we can reuse neither the current walk
       nor the bitmap we have generated so far.

       What we do now is resetting both the walk and clearing the
       bitmap, and performing the walk from scratch using E as the
       origin. This new walk, however, does not need to be completed.
       Once we hit B, we can lookup the bitmap we have already stored
       for that commit and OR it with the existing bitmap we've composed
       so far, allowing us to limit the walk early.

       After all the bitmaps have been generated, another iteration
       through the list of commits is performed to find the best XOR
       offsets for compression before writing them to disk. Because of
       the incremental nature of these bitmaps, XORing one of them with
       its predecesor results in a minimal "bitmap delta" most of the
       time. We can write this delta to the on-disk bitmap index, and
       then re-compose the original bitmaps by XORing them again when
       loaded.

       This is a phase very similar to pack-object's `find_delta` (using
       bitmaps instead of objects, of course), except the heuristics
       have been greatly simplified: we only check the 10 bitmaps before
       any given one to find best compressing one. This gives good
       results in practice, because there is locality in the ordering of
       the objects (and therefore bitmaps) in the packfile.

     6. `bitmap_writer_finish`: the last step in the process is
	serializing to disk all the bitmap data that has been generated
	in the two previous steps.

	The bitmap is written to a tmp file and then moved atomically to
	its final destination, using the same process as
	`pack-write.c:write_idx_file`.

Signed-off-by: Vicent Marti <tanoku@gmail.com>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-12-30 12:19:22 -08:00
Vicent Marti
6b8fda2db1 pack-objects: use bitmaps when packing objects
In this patch, we use the bitmap API to perform the `Counting Objects`
phase in pack-objects, rather than a traditional walk through the object
graph. For a reasonably-packed large repo, the time to fetch and clone
is often dominated by the full-object revision walk during the Counting
Objects phase. Using bitmaps can reduce the CPU time required on the
server (and therefore start sending the actual pack data with less
delay).

For bitmaps to be used, the following must be true:

  1. We must be packing to stdout (as a normal `pack-objects` from
     `upload-pack` would do).

  2. There must be a .bitmap index containing at least one of the
     "have" objects that the client is asking for.

  3. Bitmaps must be enabled (they are enabled by default, but can be
     disabled by setting `pack.usebitmaps` to false, or by using
     `--no-use-bitmap-index` on the command-line).

If any of these is not true, we fall back to doing a normal walk of the
object graph.

Here are some sample timings from a full pack of `torvalds/linux` (i.e.
something very similar to what would be generated for a clone of the
repository) that show the speedup produced by various
methods:

    [existing graph traversal]
    $ time git pack-objects --all --stdout --no-use-bitmap-index \
			    </dev/null >/dev/null
    Counting objects: 3237103, done.
    Compressing objects: 100% (508752/508752), done.
    Total 3237103 (delta 2699584), reused 3237103 (delta 2699584)

    real    0m44.111s
    user    0m42.396s
    sys     0m3.544s

    [bitmaps only, without partial pack reuse; note that
     pack reuse is automatic, so timing this required a
     patch to disable it]
    $ time git pack-objects --all --stdout </dev/null >/dev/null
    Counting objects: 3237103, done.
    Compressing objects: 100% (508752/508752), done.
    Total 3237103 (delta 2699584), reused 3237103 (delta 2699584)

    real    0m5.413s
    user    0m5.604s
    sys     0m1.804s

    [bitmaps with pack reuse (what you get with this patch)]
    $ time git pack-objects --all --stdout </dev/null >/dev/null
    Reusing existing pack: 3237103, done.
    Total 3237103 (delta 0), reused 0 (delta 0)

    real    0m1.636s
    user    0m1.460s
    sys     0m0.172s

Signed-off-by: Vicent Marti <tanoku@gmail.com>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-12-30 12:19:22 -08:00
Jeff King
ce2bc42456 pack-objects: split add_object_entry
This function actually does three things:

  1. Check whether we've already added the object to our
     packing list.

  2. Check whether the object meets our criteria for adding.

  3. Actually add the object to our packing list.

It's a little hard to see these three phases, because they
happen linearly in the rather long function. Instead, this
patch breaks them up into three separate helper functions.

The result is a little easier to follow, though it
unfortunately suffers from some optimization
interdependencies between the stages (e.g., during step 3 we
use the packing list index from step 1 and the packfile
information from step 2).

More importantly, though, the various parts can be
composed differently, as they will be in the next patch.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-12-30 12:19:22 -08:00
Vicent Marti
68fb36eb92 pack-objects: factor out name_hash
As the pack-objects system grows beyond the single
pack-objects.c file, more parts (like the soon-to-exist
bitmap code) will need to compute hashes for matching
deltas. Factor out name_hash to make it available to other
files.

Signed-off-by: Vicent Marti <tanoku@gmail.com>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-10-24 15:44:52 -07:00
Vicent Marti
2834bc27c1 pack-objects: refactor the packing list
The hash table that stores the packing list for a given `pack-objects`
run was tightly coupled to the pack-objects code.

In this commit, we refactor the hash table and the underlying storage
array into a `packing_data` struct. The functionality for accessing and
adding entries to the packing list is hence accessible from other parts
of Git besides the `pack-objects` builtin.

This refactoring is a requirement for further patches in this series
that will require accessing the commit packing list from outside of
`pack-objects`.

The hash table implementation has been minimally altered: we now
use table sizes which are always a power of two, to ensure a uniform
index distribution in the array.

Signed-off-by: Vicent Marti <tanoku@gmail.com>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-10-24 15:44:48 -07:00
Junio C Hamano
eeb8e8373f Merge branch 'jc/pack-objects'
* jc/pack-objects:
  pack-objects: shrink struct object_entry
2013-10-23 13:21:26 -07:00
Junio C Hamano
238504b014 Merge branch 'nd/fetch-into-shallow'
When there is no sufficient overlap between old and new history
during a fetch into a shallow repository, we unnecessarily sent
objects the sending side knows the receiving end has.

* nd/fetch-into-shallow:
  Add testcase for needless objects during a shallow fetch
  list-objects: mark more commits as edges in mark_edges_uninteresting
  list-objects: reduce one argument in mark_edges_uninteresting
  upload-pack: delegate rev walking in shallow fetch to pack-objects
  shallow: add setup_temporary_shallow()
  shallow: only add shallow graft points to new shallow file
  move setup_alternate_shallow and write_shallow_commits to shallow.c
2013-09-20 12:25:32 -07:00
Nguyễn Thái Ngọc Duy
e76a5fb459 list-objects: reduce one argument in mark_edges_uninteresting
mark_edges_uninteresting() is always called with this form

  mark_edges_uninteresting(revs->commits, revs, ...);

Remove the first argument and let mark_edges_uninteresting figure that
out by itself. It helps answer the question "are this commit list and
revs related in any way?" when looking at mark_edges_uninteresting
implementation.

Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-08-28 11:54:18 -07:00
Brandon Casey
7c3ecb3254 Don't close pack fd when free'ing pack windows
Now that close_one_pack() has been introduced to handle file
descriptor pressure, it is not strictly necessary to close the
pack file descriptor in unuse_one_window() when we're under memory
pressure.

Jeff King provided a justification for leaving the pack file open:

   If you close packfile descriptors, you can run into racy situations
   where somebody else is repacking and deleting packs, and they go away
   while you are trying to access them. If you keep a descriptor open,
   you're fine; they last to the end of the process. If you don't, then
   they disappear from under you.

   For normal object access, this isn't that big a deal; we just rescan
   the packs and retry. But if you are packing yourself (e.g., because
   you are a pack-objects started by upload-pack for a clone or fetch),
   it's much harder to recover (and we print some warnings).

Let's do so (or uh, not do so).

Signed-off-by: Brandon Casey <drafnel@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-08-02 09:27:26 -07:00
Junio C Hamano
63cdcfa40f pack-objects: shrink struct object_entry
Turn some boolean fields into bitfields and use uint32_t for name
hash.  This shrinks the size of the structure from 128 bytes to 120
bytes.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-02-04 15:23:35 -08:00
Jeff King
315ea32f1b Merge branch 'jk/peel-ref'
Speeds up "git upload-pack" (what is invoked by "git fetch" on the
other side of the connection) by reducing the cost to advertise the
branches and tags that are available in the repository.

* jk/peel-ref:
  upload-pack: use peel_ref for ref advertisements
  peel_ref: check object type before loading
  peel_ref: do not return a null sha1
  peel_ref: use faster deref_tag_noverify
2012-10-25 06:42:27 -04:00
Jeff King
e6dbffa67b peel_ref: do not return a null sha1
The idea of the peel_ref function is to dereference tag
objects recursively until we hit a non-tag, and return the
sha1. Conceptually, it should return 0 if it is successful
(and fill in the sha1), or -1 if there was nothing to peel.

However, the current behavior is much more confusing. For a
regular loose ref, the behavior is as described above. But
there is an optimization to reuse the peeled-ref value for a
ref that came from a packed-refs file. If we have such a
ref, we return its peeled value, even if that peeled value
is null (indicating that we know the ref definitely does
_not_ peel).

It might seem like such information is useful to the caller,
who would then know not to bother loading and trying to peel
the object. Except that they should not bother loading and
trying to peel the object _anyway_, because that fallback is
already handled by peel_ref. In other words, the whole point
of calling this function is that it handles those details
internally, and you either get a sha1, or you know that it
is not peel-able.

This patch catches the null sha1 case internally and
converts it into a -1 return value (i.e., there is nothing
to peel). This simplifies callers, which do not need to
bother checking themselves.

Two callers are worth noting:

  - in pack-objects, a comment indicates that there is a
    difference between non-peelable tags and unannotated
    tags. But that is not the case (before or after this
    patch). Whether you get a null sha1 has to do with
    internal details of how peel_ref operated.

  - in show-ref, if peel_ref returns a failure, the caller
    tries to decide whether to try peeling manually based on
    whether the REF_ISPACKED flag is set. But this doesn't
    make any sense. If the flag is set, that does not
    necessarily mean the ref came from a packed-refs file
    with the "peeled" extension. But it doesn't matter,
    because even if it didn't, there's no point in trying to
    peel it ourselves, as peel_ref would already have done
    so. In other words, the fallback peeling is guaranteed
    to fail.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2012-10-04 20:34:28 -07:00
Nguyễn Thái Ngọc Duy
4c6881204b i18n: pack-objects: mark parseopt strings for translation
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2012-08-20 12:23:18 -07:00
Junio C Hamano
0958a24d73 Merge branch 'jc/sha1-name-more'
Teaches the object name parser things like a "git describe" output
is always a commit object, "A" in "git log A" must be a committish,
and "A" and "B" in "git log A...B" both must be committish, etc., to
prolong the lifetime of abbreviated object names.

* jc/sha1-name-more: (27 commits)
  t1512: match the "other" object names
  t1512: ignore whitespaces in wc -l output
  rev-parse --disambiguate=<prefix>
  rev-parse: A and B in "rev-parse A..B" refer to committish
  reset: the command takes committish
  commit-tree: the command wants a tree and commits
  apply: --build-fake-ancestor expects blobs
  sha1_name.c: add support for disambiguating other types
  revision.c: the "log" family, except for "show", takes committish
  revision.c: allow handle_revision_arg() to take other flags
  sha1_name.c: introduce get_sha1_committish()
  sha1_name.c: teach lookup context to get_sha1_with_context()
  sha1_name.c: many short names can only be committish
  sha1_name.c: get_sha1_1() takes lookup flags
  sha1_name.c: get_describe_name() by definition groks only commits
  sha1_name.c: teach get_short_sha1() a commit-only option
  sha1_name.c: allow get_short_sha1() to take other flags
  get_sha1(): fix error status regression
  sha1_name.c: restructure disambiguation of short names
  sha1_name.c: correct misnamed "canonical" and "res"
  ...
2012-07-22 12:55:07 -07:00
Junio C Hamano
8e676e8ba5 revision.c: allow handle_revision_arg() to take other flags
The existing "cant_be_filename" that tells the function that the
caller knows the arg is not a path (hence it does not have to be
checked for absense of the file whose name matches it) is made into
a bit in the flag word.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
2012-07-09 16:42:22 -07:00
Nguyễn Thái Ngọc Duy
cf2ba13ac6 pack-objects: use streaming interface for reading large loose blobs
git usually streams large blobs directly to packs. But there are cases
where git can create large loose blobs (unpack-objects or hash-object
over pipe). Or they can come from other git implementations.
core.bigfilethreshold can also be lowered down and introduce a new
wave of large loose blobs.

Use streaming interface to read/compress/write these blobs in one
go. Fall back to normal way if somehow streaming interface cannot be
used.

Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2012-05-29 10:50:56 -07:00
Nguyễn Thái Ngọc Duy
c9018b0305 pack-objects: refactor write_object() into helper functions
The function first decides if we want to copy data taken from existing
pack verbatim or we want to encode the data ourselves for the packfile
we are creating and then carries out the decision.  Separate the latter
phase into two helper functions, one for the case the data is reused,
the other for the case the data is produced anew.

A little twist is that it can later turn out that we cannot reuse the
data after we initially decide to do so; in such a case, the "reuse"
helper makes a call to "generate" helper.  It is easier to follow than
the current fallback code that uses "goto" inside a single large
function.

Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2012-05-18 14:22:15 -07:00
Nguyễn Thái Ngọc Duy
754980d023 pack-objects, streaming: turn "xx >= big_file_threshold" to ".. > .."
This is because all other places do "xx > big_file_threshold"

Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2012-05-18 14:21:19 -07:00
Jeff King
7e52f5660e gc: do not explode objects which will be immediately pruned
When we pack everything into one big pack with "git repack
-Ad", any unreferenced objects in to-be-deleted packs are
exploded into loose objects, with the intent that they will
be examined and possibly cleaned up by the next run of "git
prune".

Since the exploded objects will receive the mtime of the
pack from which they come, if the source pack is old, those
loose objects will end up pruned immediately. In that case,
it is much more efficient to skip the exploding step
entirely for these objects.

This patch teaches pack-objects to receive the expiration
information and avoid writing these objects out. It also
teaches "git gc" to pass the value of gc.pruneexpire to
repack (which in turn learns to pass it along to
pack-objects) so that this optimization happens
automatically during "git gc" and "git gc --auto".

Signed-off-by: Jeff King <peff@peff.net>
Acked-by: Nicolas Pitre <nico@fluxnic.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2012-04-11 11:09:49 -07:00
Michał Kiedrowicz
2b34e486bc pack-objects: Fix compilation with NO_PTHREDS
It looks like commit 99fb6e04 (pack-objects: convert to use
parse_options(), 2012-02-01) moved the #ifdef NO_PTHREDS around but
hasn't noticed that the 'arg' variable no longer is available.

Signed-off-by: Michał Kiedrowicz <michal.kiedrowicz@gmail.com>
Acked-by: Nguyen Thai Ngoc Duy <pclouds@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2012-02-26 17:46:00 -08:00
Nguyễn Thái Ngọc Duy
99fb6e04cb pack-objects: convert to use parse_options()
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2012-02-01 13:05:00 -08:00
Nguyễn Thái Ngọc Duy
3a2ec52e99 pack-objects: remove bogus comment
The comment was introduced in b5d97e6 (pack-objects: run rev-list
equivalent internally. - 2006-09-04), stating that

git pack-objects [options] base-name <refs...>

is acceptable and refs should be passed into rev-list. But that's not
true. All arguments after base-name are ignored.

Remove the comment and reject this syntax (i.e. no more arguments after
base name)

Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2012-02-01 13:04:11 -08:00
Nguyễn Thái Ngọc Duy
6a301345a5 pack-objects: do not accept "--index-version=version,"
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2012-02-01 13:03:46 -08:00
Junio C Hamano
c4a01a3cbb Merge branch 'maint'
* maint:
  Update draft release notes to 1.7.8.4
  Update draft release notes to 1.7.7.6
  Update draft release notes to 1.7.6.6
  thin-pack: try harder to use preferred base objects as base
2012-01-12 23:33:39 -08:00
Junio C Hamano
5a6a939481 Merge branch 'maint-1.7.7' into maint
* maint-1.7.7:
  Update draft release notes to 1.7.7.6
  Update draft release notes to 1.7.6.6
  thin-pack: try harder to use preferred base objects as base
2012-01-12 23:31:46 -08:00
Junio C Hamano
901c907d83 Merge branch 'maint-1.7.6' into maint-1.7.7
* maint-1.7.6:
  Update draft release notes to 1.7.6.6
  thin-pack: try harder to use preferred base objects as base
2012-01-12 23:31:05 -08:00
Jeff King
15f07e061e thin-pack: try harder to use preferred base objects as base
When creating a pack using objects that reside in existing packs, we try
to avoid recomputing futile delta between an object (trg) and a candidate
for its base object (src) if they are stored in the same packfile, and trg
is not recorded as a delta already. This heuristics makes sense because it
is likely that we tried to express trg as a delta based on src but it did
not produce a good delta when we created the existing pack.

As the pack heuristics prefer producing delta to remove data, and Linus's
law dictates that the size of a file grows over time, we tend to record
the newest version of the file as inflated, and older ones as delta
against it.

When creating a thin-pack to transfer recent history, it is likely that we
will try to send an object that is recorded in full, as it is newer.  But
the heuristics to avoid recomputing futile delta effectively forbids us
from attempting to express such an object as a delta based on another
object. Sending an object in full is often more expensive than sending a
suboptimal delta based on other objects, and it is even more so if we
could use an object we know the receiving end already has (i.e. preferred
base object) as the delta base.

Tweak the recomputation avoidance logic, so that we do not punt on
computing delta against a preferred base object.

The effect of this change can be seen on two simulated upload-pack
workloads. The first is based on 44 reflog entries from my git.git
origin/master reflog, and represents the packs that kernel.org sent me git
updates for the past month or two. The second workload represents much
larger fetches, going from git's v1.0.0 tag to v1.1.0, then v1.1.0 to
v1.2.0, and so on.

The table below shows the average generated pack size and the average CPU
time consumed for each dataset, both before and after the patch:

                  dataset
            | reflog | tags
---------------------------------
     before | 53358  | 2750977
size  after | 32398  | 2668479
     change |   -39% |      -3%
---------------------------------
     before |  0.18  | 1.12
CPU   after |  0.18  | 1.15
     change |    +0% |      +3%

This patch makes a much bigger difference for packs with a shorter slice
of history (since its effect is seen at the boundaries of the pack) though
it has some benefit even for larger packs.

Signed-off-by: Jeff King <peff@peff.net>
Acked-by: Nicolas Pitre <nico@fluxnic.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2012-01-12 23:06:20 -08:00
Junio C Hamano
48b303675a Merge branch 'jc/stream-to-pack'
* jc/stream-to-pack:
  bulk-checkin: replace fast-import based implementation
  csum-file: introduce sha1file_checkpoint
  finish_tmp_packfile(): a helper function
  create_tmp_packfile(): a helper function
  write_pack_header(): a helper function

Conflicts:
	pack.h
2011-12-16 22:33:40 -08:00
Junio C Hamano
2e8722fc9e Merge branch 'jc/maint-pack-object-cycle' into maint
* jc/maint-pack-object-cycle:
  pack-object: tolerate broken packs that have duplicated objects

Conflicts:
	builtin/pack-objects.c
2011-12-13 22:04:50 -08:00
Junio C Hamano
df6246ed78 Merge branch 'nd/misc-cleanups' into maint
* nd/misc-cleanups:
  unpack_object_header_buffer(): clear the size field upon error
  tree_entry_interesting: make use of local pointer "item"
  tree_entry_interesting(): give meaningful names to return values
  read_directory_recursive: reduce one indentation level
  get_tree_entry(): do not call find_tree_entry() on an empty tree
  tree-walk.c: do not leak internal structure in tree_entry_len()
2011-12-13 22:02:51 -08:00
Junio C Hamano
cddec4f8ae Merge branch 'jc/maint-pack-object-cycle'
* jc/maint-pack-object-cycle:
  pack-object: tolerate broken packs that have duplicated objects

Conflicts:
	builtin/pack-objects.c
2011-12-05 15:19:34 -08:00
Junio C Hamano
62cdb6b23a Merge branch 'nd/misc-cleanups'
* nd/misc-cleanups:
  unpack_object_header_buffer(): clear the size field upon error
  tree_entry_interesting: make use of local pointer "item"
  tree_entry_interesting(): give meaningful names to return values
  read_directory_recursive: reduce one indentation level
  get_tree_entry(): do not call find_tree_entry() on an empty tree
  tree-walk.c: do not leak internal structure in tree_entry_len()
2011-12-05 15:10:20 -08:00
Junio C Hamano
568508e765 bulk-checkin: replace fast-import based implementation
This extends the earlier approach to stream a large file directly from the
filesystem to its own packfile, and allows "git add" to send large files
directly into a single pack. Older code used to spawn fast-import, but the
new bulk-checkin API replaces it.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
2011-12-01 11:46:09 -08:00
Junio C Hamano
f63c79dbc8 pack-object: tolerate broken packs that have duplicated objects
When --reuse-delta is in effect (which is the default), and an existing
pack in the repository has the same object registered twice (e.g. one copy
in a non-delta format and the other copy in a delta against some other
object), an attempt to repack the repository can result in a cyclic delta
dependency, causing write_one() function to infinitely recurse into
itself.

Detect such a case and break the loopy dependency by writing out an object
that is involved in such a loop in the non-delta format.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
2011-11-16 22:06:08 -08:00
Junio C Hamano
84a9ea90e1 Merge branch 'dm/pack-objects-update'
* dm/pack-objects-update:
  pack-objects: don't traverse objects unnecessarily
  pack-objects: rewrite add_descendants_to_write_order() iteratively
  pack-objects: use unsigned int for counter and offset values
  pack-objects: mark add_to_write_order() as inline
2011-11-01 15:20:07 -07:00
Junio C Hamano
0e990530ae finish_tmp_packfile(): a helper function
Factor out a small logic out of the private write_pack_file() function
in builtin/pack-objects.c.

This changes the order of finishing multi-pack generation slightly. The
code used to

 - adjust shared perm of temporary packfile
 - rename temporary packfile to the final name
 - update mtime of the packfile under the final name
 - adjust shared perm of temporary idxfile
 - rename temporary idxfile to the final name

but because the helper does not want to do the mtime thing, the updated
code does that step first and then all the rest.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
2011-10-28 12:34:09 -07:00
Junio C Hamano
cdf9db3c83 create_tmp_packfile(): a helper function
Factor out a small logic out of the private write_pack_file() function
in builtin/pack-objects.c

Signed-off-by: Junio C Hamano <gitster@pobox.com>
2011-10-28 11:52:14 -07:00
Junio C Hamano
c0ad465725 write_pack_header(): a helper function
Factor out a small logic out of the private write_pack_file() function
in builtin/pack-objects.c

Signed-off-by: Junio C Hamano <gitster@pobox.com>
2011-10-28 11:40:48 -07:00
Nguyễn Thái Ngọc Duy
0de1633783 tree-walk.c: do not leak internal structure in tree_entry_len()
tree_entry_len() does not simply take two random arguments and return
a tree length. The two pointers must point to a tree item structure,
or struct name_entry. Passing random pointers will return incorrect
value.

Force callers to pass struct name_entry instead of two pointers (with
hope that they don't manually construct struct name_entry themselves)

Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2011-10-27 11:08:26 -07:00
Junio C Hamano
2070950633 Merge branch 'jk/maint-pack-objects-compete-with-delete'
* jk/maint-pack-objects-compete-with-delete:
  downgrade "packfile cannot be accessed" errors to warnings
  pack-objects: protect against disappearing packs
2011-10-21 16:04:33 -07:00
Dan McGee
38d4debb6d pack-objects: don't traverse objects unnecessarily
This brings back some of the performance lost in optimizing recency
order inside pack objects. We were doing extreme amounts of object
re-traversal: for the 2.14 million objects in the Linux kernel
repository, we were calling add_to_write_order() over 1.03 billion times
(a 0.2% hit rate, making 99.8% of of these calls extraneous).

Two optimizations take place here- we can start our objects array
iteration from a known point where we left off before we started trying
to find our tags, and we don't need to do the deep dives required by
add_family_to_write_order() if the object has already been marked as
filled.

These two optimizations bring some pretty spectacular results via `perf
stat`:

task-clock:   83373 ms        --> 43800 ms         (50% faster)
cycles:       221,633,461,676 --> 116,307,209,986  (47% fewer)
instructions: 149,299,179,939 --> 122,998,800,184  (18% fewer)

Helped-by: Ramsay Jones (format string fix in "die" message)
Signed-off-by: Dan McGee <dpmcgee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2011-10-20 17:17:49 -07:00
Dan McGee
f380872f0a pack-objects: rewrite add_descendants_to_write_order() iteratively
This removes the need to call this function recursively, shinking the
code size slightly and netting a small performance increase.

Signed-off-by: Dan McGee <dpmcgee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2011-10-18 00:16:32 -07:00
Dan McGee
92bef1a14a pack-objects: use unsigned int for counter and offset values
This is done in some of the new pack layout code introduced in commit
1b4bb16b9e. This more closely matches the nr_objects global that is
unsigned that these variables are based off of and bounded by.

Signed-off-by: Dan McGee <dpmcgee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2011-10-18 00:16:32 -07:00
Dan McGee
be12681896 pack-objects: mark add_to_write_order() as inline
This function is a whole 26 bytes when compiled on x86_64, but is
currently invoked over 1.037 billion times when running pack-objects on
the Linux kernel git repository. This is hitting the point where
micro-optimizations do make a difference, and inlining it only increases
the object file size by 38 bytes.

As reported by perf, this dropped task-clock from 84183 to 83373 ms, and
total cycles from 223.5 billion to 221.6 billion. Not astronomical, but
worth getting for adding one word.

Signed-off-by: Dan McGee <dpmcgee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2011-10-18 00:16:31 -07:00
Jeff King
58a6a9cc43 downgrade "packfile cannot be accessed" errors to warnings
These can happen if another process simultaneously prunes a
pack. But that is not usually an error condition, because a
properly-running prune should have repacked the object into
a new pack. So we will notice that the pack has disappeared
unexpectedly, print a message, try other packs (possibly
after re-scanning the list of packs), and find it in the new
pack.

Acked-by: Nicolas Pitre <nico@fluxnic.net>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2011-10-14 11:43:09 -07:00
Jeff King
4c08018204 pack-objects: protect against disappearing packs
It's possible that while pack-objects is running, a
simultaneously running prune process might delete a pack
that we are interested in. Because we load the pack indices
early on, we know that the pack contains our item, but by
the time we try to open and map it, it is gone.

Since c715f78, we already protect against this in the normal
object access code path, but pack-objects accesses the packs
at a lower level.  In the normal access path, we call
find_pack_entry, which will call find_pack_entry_one on each
pack index, which does the actual lookup. If it gets a hit,
we will actually open and verify the validity of the
matching packfile (using c715f78's is_pack_valid). If we
can't open it, we'll issue a warning and pretend that we
didn't find it, causing us to go on to the next pack (or on
to loose objects).

Furthermore, we will cache the descriptor to the opened
packfile. Which means that later, when we actually try to
access the object, we are likely to still have that packfile
opened, and won't care if it has been unlinked from the
filesystem.

Notice the "likely" above. If there is another pack access
in the interim, and we run out of descriptors, we could
close the pack. And then a later attempt to access the
closed pack could fail (we'll try to re-open it, of course,
but it may have been deleted). In practice, this doesn't
happen because we tend to look up items and then access them
immediately.

Pack-objects does not follow this code path. Instead, it
accesses the packs at a much lower level, using
find_pack_entry_one directly. This means we skip the
is_pack_valid check, and may end up with the name of a
packfile, but no open descriptor.

We can add the same is_pack_valid check here. Unfortunately,
the access patterns of pack-objects are not quite as nice
for keeping lookup and object access together. We look up
each object as we find out about it, and the only later when
writing the packfile do we necessarily access it. Which
means that the opened packfile may be closed in the interim.

In practice, however, adding this check still has value, for
three reasons.

  1. If you have a reasonable number of packs and/or a
     reasonable file descriptor limit, you can keep all of
     your packs open simultaneously. If this is the case,
     then the race is impossible to trigger.

  2. Even if you can't keep all packs open at once, you
     may end up keeping the deleted one open (i.e., you may
     get lucky).

  3. The race window is shortened. You may notice early that
     the pack is gone, and not try to access it. Triggering
     the problem without this check means deleting the pack
     any time after we read the list of index files, but
     before we access the looked-up objects.  Triggering it
     with this check means deleting the pack means deleting
     the pack after we do a lookup (and successfully access
     the packfile), but before we access the object. Which
     is a smaller window.

Acked-by: Nicolas Pitre <nico@fluxnic.net>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2011-10-14 11:42:37 -07:00
Junio C Hamano
2e2e7e9dd0 Merge branch 'jc/fetch-verify'
* jc/fetch-verify:
  fetch: verify we have everything we need before updating our ref
  rev-list --verify-object
  list-objects: pass callback data to show_objects()
2011-10-05 12:36:20 -07:00
Junio C Hamano
4947367267 list-objects: pass callback data to show_objects()
The traverse_commit_list() API takes two callback functions, one to show
commit objects, and the other to show other kinds of objects. Even though
the former has a callback data parameter, so that the callback does not
have to rely on global state, the latter does not.

Give the show_objects() callback the same callback data parameter.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
2011-09-01 15:46:12 -07:00
Junio C Hamano
324b6b1678 Merge branch 'mh/check-attr-relative'
* mh/check-attr-relative: (29 commits)
  test-path-utils: Add subcommand "prefix_path"
  test-path-utils: Add subcommand "absolute_path"
  git-check-attr: Normalize paths
  git-check-attr: Demonstrate problems with relative paths
  git-check-attr: Demonstrate problems with unnormalized paths
  git-check-attr: test that no output is written to stderr
  Rename git_checkattr() to git_check_attr()
  git-check-attr: Fix command-line handling to match docs
  git-check-attr: Drive two tests using the same raw data
  git-check-attr: Add an --all option to show all attributes
  git-check-attr: Error out if no pathnames are specified
  git-check-attr: Process command-line args more systematically
  git-check-attr: Handle each error separately
  git-check-attr: Extract a function error_with_usage()
  git-check-attr: Introduce a new variable
  git-check-attr: Extract a function output_attr()
  Allow querying all attributes on a file
  Remove redundant check
  Remove redundant call to bootstrap_attr_stack()
  Extract a function collect_all_attrs()
  ...
2011-08-17 17:36:22 -07:00