Commit Graph

93 Commits

Author SHA1 Message Date
Junio C Hamano
ab7cd7bb8c pack-objects: finishing touches.
This introduces --no-reuse-delta option to disable reusing of
existing delta, which is a large part of the optimization
introduced by this series.  This may become necessary if
repeated repacking makes delta chain too long.  With this, the
output of the command becomes identical to that of the older
implementation.  But the performance suffers greatly.

It still allows reusing non-deltified representations; there is
no point uncompressing and recompressing the whole text.

It also adds a couple more statistics output, while squelching
it under -q flag, which the last round forgot to do.

  $ time old-git-pack-objects --stdout >/dev/null <RL
  Generating pack...
  Done counting 184141 objects.
  Packing 184141 objects....................
  real    12m8.530s       user    11m1.450s       sys     0m57.920s
  $ time git-pack-objects --stdout >/dev/null <RL
  Generating pack...
  Done counting 184141 objects.
  Packing 184141 objects.....................
  Total 184141, written 184141 (delta 138297), reused 178833 (delta 134081)
  real    0m59.549s       user    0m56.670s       sys     0m2.400s
  $ time git-pack-objects --stdout --no-reuse-delta >/dev/null <RL
  Generating pack...
  Done counting 184141 objects.
  Packing 184141 objects.....................
  Total 184141, written 184141 (delta 134833), reused 47904 (delta 0)
  real    11m13.830s      user    9m45.240s       sys     0m44.330s

There is one remaining issue when --no-reuse-delta option is not
used.  It can create delta chains that are deeper than specified.

    A<--B<--C<--D   E   F   G

Suppose we have a delta chain A to D (A is stored in full either
in a pack or as a loose object. B is depth1 delta relative to A,
C is depth2 delta relative to B...) with loose objects E, F, G.
And we are going to pack all of them.

B, C and D are left as delta against A, B and C respectively.
So A, E, F, and G are examined for deltification, and let's say
we decided to keep E expanded, and store the rest as deltas like
this:

    E<--F<--G<--A

Oops.  We ended up making D a bit too deep, didn't we?  B, C and
D form a chain on top of A!

This is because we did not know what the final depth of A would
be, when we checked objects and decided to keep the existing
delta.  Unfortunately, deferring the decision until just before
the deltification is not an option.  To be able to make B, C,
and D candidates for deltification with the rest, we need to
know the type and final unexpanded size of them, but the major
part of the optimization comes from the fact that we do not read
the delta data to do so -- getting the final size is quite an
expensive operation.

To prevent this from happening, we should keep A from being
deltified.  But how would we tell that, cheaply?

To do this most precisely, after check_object() runs, each
object that is used as the base object of some existing delta
needs to be marked with the maximum depth of the objects we
decided to keep deltified (in this case, D is depth 3 relative
to A, so if no other delta chain that is longer than 3 based on
A exists, mark A with 3).  Then when attempting to deltify A, we
would take that number into account to see if the final delta
chain that leads to D becomes too deep.

However, this is a bit cumbersome to compute, so we would cheat
and reduce the maximum depth for A arbitrarily to depth/4 in
this implementation.

Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-02-22 13:14:57 -08:00
Junio C Hamano
3f9ac8d259 pack-objects: reuse data from existing packs.
When generating a new pack, notice if we have already needed
objects in existing packs.  If an object is stored deltified,
and its base object is also what we are going to pack, then
reuse the existing deltified representation unconditionally,
bypassing all the expensive find_deltas() and try_deltas()
calls.

Also, notice if what we are going to write out exactly match
what is already in an existing pack (either deltified or just
compressed).  In such a case, we can just copy it instead of
going through the usual uncompressing & recompressing cycle.

Without this patch, in linux-2.6 repository with about 1500
loose objects and a single mega pack:

    $ git-rev-list --objects v2.6.16-rc3 >RL
    $ wc -l RL
    184141 RL
    $ time git-pack-objects p <RL
    Generating pack...
    Done counting 184141 objects.
    Packing 184141 objects....................
    a1fc7b3e537fcb9b3c46b7505df859f0a11e79d2

    real    12m4.323s
    user    11m2.560s
    sys     0m55.950s

With this patch, the same input:

    $ time ../git.junio/git-pack-objects q <RL
    Generating pack...
    Done counting 184141 objects.
    Packing 184141 objects.....................
    a1fc7b3e537fcb9b3c46b7505df859f0a11e79d2
    Total 184141, written 184141, reused 182441

    real    1m2.608s
    user    0m55.090s
    sys     0m1.830s

Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-02-22 13:14:56 -08:00
Junio C Hamano
7a979d99ba Thin pack - create packfile with missing delta base.
This goes together with "rev-list --object-edge" change, to feed
pack-objects list of edge commits in addition to the usual
object list.  Upon seeing such list, pack-objects loosens the
usual "self contained delta" constraints, and can produce delta
against blobs and trees contained in the edge commits without
storing the delta base objects themselves.

The resulting packfile is not usable in .git/object/packs, but
is a good way to implement "delta-only" transfer.

Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-02-19 22:27:39 -08:00
Junio C Hamano
e4c9327a77 pack-objects: avoid delta chains that are too long.
This tries to rework the solution for the excess delta chain
problem. An earlier commit worked it around ``cheaply'', but
repeated repacking risks unbound growth of delta chains.

This version counts the length of delta chain we are reusing
from the existing pack, and makes sure a base object that has
sufficiently long delta chain does not get deltified.

Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-02-17 21:48:48 -08:00
Junio C Hamano
ca5381d43e pack-objects: finishing touches.
This introduces --no-reuse-delta option to disable reusing of
existing delta, which is a large part of the optimization
introduced by this series.  This may become necessary if
repeated repacking makes delta chain too long.  With this, the
output of the command becomes identical to that of the older
implementation.  But the performance suffers greatly.

It still allows reusing non-deltified representations; there is
no point uncompressing and recompressing the whole text.

It also adds a couple more statistics output, while squelching
it under -q flag, which the last round forgot to do.

  $ time old-git-pack-objects --stdout >/dev/null <RL
  Generating pack...
  Done counting 184141 objects.
  Packing 184141 objects....................
  real    12m8.530s       user    11m1.450s       sys     0m57.920s
  $ time git-pack-objects --stdout >/dev/null <RL
  Generating pack...
  Done counting 184141 objects.
  Packing 184141 objects.....................
  Total 184141, written 184141 (delta 138297), reused 178833 (delta 134081)
  real    0m59.549s       user    0m56.670s       sys     0m2.400s
  $ time git-pack-objects --stdout --no-reuse-delta >/dev/null <RL
  Generating pack...
  Done counting 184141 objects.
  Packing 184141 objects.....................
  Total 184141, written 184141 (delta 134833), reused 47904 (delta 0)
  real    11m13.830s      user    9m45.240s       sys     0m44.330s

There is one remaining issue when --no-reuse-delta option is not
used.  It can create delta chains that are deeper than specified.

    A<--B<--C<--D   E   F   G

Suppose we have a delta chain A to D (A is stored in full either
in a pack or as a loose object. B is depth1 delta relative to A,
C is depth2 delta relative to B...) with loose objects E, F, G.
And we are going to pack all of them.

B, C and D are left as delta against A, B and C respectively.
So A, E, F, and G are examined for deltification, and let's say
we decided to keep E expanded, and store the rest as deltas like
this:

    E<--F<--G<--A

Oops.  We ended up making D a bit too deep, didn't we?  B, C and
D form a chain on top of A!

This is because we did not know what the final depth of A would
be, when we checked objects and decided to keep the existing
delta.  Unfortunately, deferring the decision until just before
the deltification is not an option.  To be able to make B, C,
and D candidates for deltification with the rest, we need to
know the type and final unexpanded size of them, but the major
part of the optimization comes from the fact that we do not read
the delta data to do so -- getting the final size is quite an
expensive operation.

To prevent this from happening, we should keep A from being
deltified.  But how would we tell that, cheaply?

To do this most precisely, after check_object() runs, each
object that is used as the base object of some existing delta
needs to be marked with the maximum depth of the objects we
decided to keep deltified (in this case, D is depth 3 relative
to A, so if no other delta chain that is longer than 3 based on
A exists, mark A with 3).  Then when attempting to deltify A, we
would take that number into account to see if the final delta
chain that leads to D becomes too deep.

However, this is a bit cumbersome to compute, so we would cheat
and reduce the maximum depth for A arbitrarily to depth/4 in
this implementation.

Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-02-17 02:11:38 -08:00
Junio C Hamano
a49dd05fd0 pack-objects: reuse data from existing packs.
When generating a new pack, notice if we have already needed
objects in existing packs.  If an object is stored deltified,
and its base object is also what we are going to pack, then
reuse the existing deltified representation unconditionally,
bypassing all the expensive find_deltas() and try_deltas()
calls.

Also, notice if what we are going to write out exactly match
what is already in an existing pack (either deltified or just
compressed).  In such a case, we can just copy it instead of
going through the usual uncompressing & recompressing cycle.

Without this patch, in linux-2.6 repository with about 1500
loose objects and a single mega pack:

    $ git-rev-list --objects v2.6.16-rc3 >RL
    $ wc -l RL
    184141 RL
    $ time git-pack-objects p <RL
    Generating pack...
    Done counting 184141 objects.
    Packing 184141 objects....................
    a1fc7b3e537fcb9b3c46b7505df859f0a11e79d2

    real    12m4.323s
    user    11m2.560s
    sys     0m55.950s

With this patch, the same input:

    $ time ../git.junio/git-pack-objects q <RL
    Generating pack...
    Done counting 184141 objects.
    Packing 184141 objects.....................
    a1fc7b3e537fcb9b3c46b7505df859f0a11e79d2
    Total 184141, written 184141, reused 182441

    real    1m2.608s
    user    0m55.090s
    sys     0m1.830s

Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-02-17 02:11:38 -08:00
Junio C Hamano
024701f1d8 Make pack-objects chattier.
You could give -q to squelch it, but currently no tool does it.
This would make 'git clone host:repo here' over ssh not silent
again.

Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-02-12 13:01:54 -08:00
Junio C Hamano
21fcd1bdea fetch-clone progress: finishing touches.
This makes fetch-pack also report the progress of packing part.

Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-02-11 17:54:18 -08:00
Junio C Hamano
82f9d58a39 code comments: spell
Signed-off-by: Junio C Hamano <junkio@cox.net>
2005-12-29 01:32:56 -08:00
Nikolai Weibull
63ae26f87a Document the --non-empty command-line option to git-pack-objects.
This provides (minimal) documentation for the --non-empty command-line
option to the pack-objects command.

Signed-off-by: Nikolai Weibull <nikolai@bitwi.se>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2005-12-08 15:50:13 -08:00
Junio C Hamano
53228a5fb8 Make the rest of commands work from a subdirectory.
These commands are converted to run from a subdirectory.

    commit-tree convert-objects merge-base merge-index mktag
    pack-objects pack-redundant prune-packed read-tree tar-tree
    unpack-file unpack-objects update-server-info write-tree

Signed-off-by: Junio C Hamano <junkio@cox.net>
2005-11-28 23:13:02 -08:00
Linus Torvalds
ef07618fdd git-repack: Properly abort in corrupt repository
In a corrupt repository, git-repack produces a pack that does not
contain needed objects without complaining, and the result of this
combined with -d flag can be very painful -- e.g. a lossage of one
tree object can lead to lossage of blobs reachable only through that
tree.

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2005-11-21 14:08:49 -08:00
Junio C Hamano
f3123c4ab3 pack-objects: Allow use of pre-generated pack.
git-pack-objects can reuse pack files stored in $GIT_DIR/pack-cache
directory, when a necessary pack is found.  This is hopefully useful
when upload-pack (called from git-daemon) is expected to receive
requests for the same set of objects many times (e.g full cloning
request of any project, or updates from the set of heads previous day
to the latest for a slow moving project).

Currently git-pack-objects does *not* keep pack files it creates for
reusing.  It might be useful to add --update-cache option to it,
which would allow it store pack files it created in the pack-cache
directory, and prune rarely used ones from it.

Signed-off-by: Junio C Hamano <junkio@cox.net>
2005-10-26 12:37:49 -07:00
Linus Torvalds
4546738b58 Unlocalized isspace and friends
Do our own ctype.h, just to get the sane semantics: we want
locale-independence, _and_ we want the right signed behaviour. Plus we
only use a very small subset of ctype.h anyway (isspace, isalpha,
isdigit and isalnum).

Signed-off-by: Junio C Hamano <junkio@cox.net>
2005-10-14 17:17:27 -07:00
Linus Torvalds
64560374cc Add support for "local" packing
This adds the "--local" flag to git-pack-objects, which acts like
"--incremental", except that instead of ignoring all packed objects, it
only ignores objects that are packed and in an alternate object tree.

As a result, it effectively only does a local re-pack: any remote-packed
objects will stay in the alternate object directories.

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2005-10-13 15:38:28 -07:00
Junio C Hamano
84c8d8aec5 Fix packname hash generation.
This changes the generation of hash packfiles have in their names, from
"hash of object names as fed to us" to "hash of object names in the
resulting pack, in the order they appear in the index file".  The new
"git-index-pack" command is taught to output the computed hash value
to its standard output.

With this, we can store downloaded pack in a temporary file without
knowing its final name, run git-index-pack to generate idx for it
while finding out its final name, and then rename the pack and idx to
their final names.

Signed-off-by: Junio C Hamano <junkio@cox.net>
2005-10-12 18:32:02 -07:00
Sergey Vlasov
adee7bdf50 [PATCH] Plug memory leak in git-pack-objects
find_deltas() should free its temporary objects before returning.

[jc: Sergey, if you have [PATCH] title on the Subject line of your
e-mail, please do not repeat it on the first line in your message
body.  Thanks.]

Signed-off-by: Sergey Vlasov <vsu@altlinux.ru>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2005-08-08 22:51:46 -07:00
Linus Torvalds
5f3de58ff8 Make the name of a pack-file depend on the objects packed there-in.
This means that the .git/objects/pack directory is also rsync'able,
since the filenames created there-in are either unique or refer to the
same data.

Otherwise you might not be able to pull from a directory that is partly
packed without having to worry about missing objects due to pack-file
name clashes.
2005-07-03 15:34:04 -07:00
Linus Torvalds
1c4a291202 Add "--non-empty" flag to git-pack-objects
It skips writing the pack-file if it ends up being empty.
2005-07-03 13:36:58 -07:00
Linus Torvalds
eb019375ab Add "--incremental" flag to git-pack-objects
It won't add an object that is already in a pack to the new pack.
2005-07-03 13:08:40 -07:00
Nicolas Pitre
dcde55bc58 [PATCH] assorted delta code cleanup
This is a wrap-up patch including all the cleanups I've done to the
delta code and its usage.  The most important change is the
factorization of the delta header handling code.

Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-29 09:11:38 -07:00
Linus Torvalds
01247d8742 Make git pack files use little-endian size encoding
This makes it match the new delta encoding, and admittedly makes the
code easier to follow.

This also updates the PACK file version to 2, since this (and the delta
encoding change in the previous commit) are incompatible with the old
format.
2005-06-28 22:15:57 -07:00
Junio C Hamano
9d5ab9625d [PATCH] Emit base objects of a delta chain when the delta is output.
Deltas are useless by themselves and when you use them you need to get
to their base objects.  A base object should inherit recency from the
most recent deltified object that is based on it and that is what this
patch teaches git-pack-objects.

Signed-off-by: Junio C Hamano <junkio@cox.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-28 20:37:42 -07:00
Junio C Hamano
e1ddc97684 [PATCH] Fix unpack-objects for header length information.
Standalone unpack-objects command was not adjusted for header length
encoding change when dealing with deltified entry.  This fixes it.

Signed-off-by: Junio C Hamano <junkio@cox.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-28 17:12:18 -07:00
Linus Torvalds
a733cb606f Change pack file format. Hopefully for the last time.
This also adds a header with a signature, version info, and the number
of objects to the pack file.  It also encodes the file length and type
more efficiently.
2005-06-28 14:21:02 -07:00
Linus Torvalds
d22b9290ab git-pack-objects: add "--stdout" flag to write the pack file to stdout
This also suppresses creation of the index file.
2005-06-28 11:10:48 -07:00
Linus Torvalds
a69d094366 Teach packing about "tag" objects
(And teach sha1_file and unpack-object know how to unpack them too, of
course)
2005-06-28 09:58:23 -07:00
Junio C Hamano
36e4d74a21 [PATCH] Enhance sha1_file_size() into sha1_object_info()
This lets us eliminate one use of map_sha1_file() outside
sha1_file.c, to bring us one step closer to the packed GIT.

Signed-off-by: Junio C Hamano <junkio@cox.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-27 15:27:51 -07:00
Junio C Hamano
c4584ae3fd [PATCH] Remove "delta" object representation.
Packed delta files created by git-pack-objects seems to be the
way to go, and existing "delta" object handling code has exposed
the object representation details to too many places.  Remove it
while we refactor code to come up with a proper interface in
sha1_file.c.

Signed-off-by: Junio C Hamano <junkio@cox.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-27 15:27:51 -07:00
Linus Torvalds
e18088451d csum-file interface updates: return resulting SHA1
Also, make the writing of the SHA1 as a end-header be conditional: not
every user will necessarily want to write the SHA1 to the file itself,
even though current users do (but we migh end up using the same helper
functions for the object files themselves, that don't do this).

This also makes the packed index file contain the SHA1 of the packed
data file at the end (just before its own SHA1).  That way you can
validate the pairing of the two if you want to.
2005-06-26 22:01:46 -07:00
Linus Torvalds
c38138cd78 git-pack-objects: write the pack files with a SHA1 csum
We want to be able to check their integrity later, and putting the
sha1-sum of the contents at the end is a good thing.  The writing
routines are generic, so we could try to re-use them for the index file,
instead of having the same logic duplicated.

Update unpack-objects to know about the extra 20 bytes at the end
of the index.
2005-06-26 20:27:56 -07:00
Linus Torvalds
27225f2e87 git-pack-objects: use name information (if any) to sort objects for packing.
This is incredibly cheezy. But it's cheap, and it works pretty well.
2005-06-26 15:27:28 -07:00
Linus Torvalds
521a4f4cf4 git-pack-objects: do the delta search in reverse size order
Starting from big objects and going backwards means that we end up
picking a delta that goes from a bigger object to a smaller one.  That's
advantageous for two reasons: the bigger object is likely the newer one
(since things tend to grow, rather than shrink), and doing a delete
tends to be smaller than doing an add.

So the deltas don't tend to be top-of-tree, and the packed end result is
just slightly smaller.
2005-06-26 13:43:41 -07:00
Linus Torvalds
c4fb06c0d0 Fix object packing/unpacking.
This actually successfully packed and unpacked a git archive down to
1.3MB (17MB unpacked).

Right now unpacking is way too noisy, lots of debug messages left.
2005-06-26 08:40:08 -07:00
Junio C Hamano
8ee378a0f0 [PATCH] Finish initial cut of git-pack-object/git-unpack-object pair.
This finishes the initial round of git-pack-object /
git-unpack-object pair.  They are now good enough to be used as
a transport medium:

 - Fix delta direction in pack-objects; the original was
   computing delta to create the base object from the object to
   be squashed, which was quite unfriendly for unpacker ;-).

 - Add a script to test the very basics.

 - Implement unpacker for both regular and deltified objects.

Signed-off-by: Junio C Hamano <junkio@cox.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-26 07:33:23 -07:00
Linus Torvalds
d116a45a9a Add "--depth=N" parameter to git-pack-objects to limit maximum delta depth
It too defaults to 10. A nice round random number.
2005-06-25 20:17:59 -07:00
Linus Torvalds
f846bbff15 git-pack-objects: make "--window=x" semantics more logical.
A zero disables delta generation (like before), but we make the window
be one bigger than specified, since we use one entry for the one to be
tested (it used to be that "--window=1" was meaningless, since we'd have
used up the single-entry window with the entry to be tested, and had no
chance of actually ever finding a delta).

The default window remains at 10, but now it really means "test the 10
closest objects", not "test the 9 closest objects".
2005-06-25 19:35:47 -07:00
Linus Torvalds
75c42d8cc3 Add a "max_size" parameter to diff_delta()
Anything that generates a delta to see if two objects are close usually
isn't interested in the delta ends up being bigger than some specified
size, and this allows us to stop delta generation early when that
happens.
2005-06-25 19:30:20 -07:00
Linus Torvalds
78817c15de Fix delta "sliding window" code
When Junio fixed the lack of a successful error code from try_delta(),
that uncovered an off-by-one error in the caller.

Also, some testing made it clear that we now find a lot more deltas,
because we used to (incorrectly) break early on bogus "failure"
cases.
2005-06-25 18:29:23 -07:00
Junio C Hamano
eb41ab11e8 [PATCH] (patchlet) pack-objects.c: try_delta()
Return value of try_delta is checked for negativeness, but the
success path does not return anything, letting compiler warn and
presumably return garbage.

Signed-off-by: Junio C Hamano <junkio@cox.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25 18:12:07 -07:00
Linus Torvalds
d38c3721a1 git-pack-objects: mark the delta packing with a 'D'.
When writing a delta, we take the real type from the object we're
doing the delta against, and just write a 'D' as the type of the
current object.
2005-06-25 15:58:42 -07:00
Linus Torvalds
49397104f2 git-pack-objects: fix typo
("<" should be "=")
2005-06-25 15:24:30 -07:00
Linus Torvalds
c323ac7d9c git-pack-objects: create a packed object representation.
This is kind of like a tar-ball for a set of objects, ready to be
shipped off to another end.  Alternatively, you could use is as a packed
representation of the object database directly, if you changed
"read_sha1_file()" to read these kinds of packs.

The latter is partiularly useful to generate a "packed history", ie you
could pack up your old history efficiently, but still have it available
(at a performance hit, of course).

I haven't actually written an unpacker yet, so the end result has not
been verified in any way yet.  I obviously always write bug-free code,
so it just has to work, no?
2005-06-25 14:42:43 -07:00