2005-06-25 23:42:43 +02:00
|
|
|
#include "cache.h"
|
|
|
|
#include "object.h"
|
|
|
|
#include "delta.h"
|
2005-06-28 23:21:02 +02:00
|
|
|
#include "pack.h"
|
2005-06-27 05:27:56 +02:00
|
|
|
#include "csum-file.h"
|
2006-02-12 02:54:18 +01:00
|
|
|
#include <sys/time.h>
|
2005-06-25 23:42:43 +02:00
|
|
|
|
pack-objects: finishing touches.
This introduces --no-reuse-delta option to disable reusing of
existing delta, which is a large part of the optimization
introduced by this series. This may become necessary if
repeated repacking makes delta chain too long. With this, the
output of the command becomes identical to that of the older
implementation. But the performance suffers greatly.
It still allows reusing non-deltified representations; there is
no point uncompressing and recompressing the whole text.
It also adds a couple more statistics output, while squelching
it under -q flag, which the last round forgot to do.
$ time old-git-pack-objects --stdout >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects....................
real 12m8.530s user 11m1.450s sys 0m57.920s
$ time git-pack-objects --stdout >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
Total 184141, written 184141 (delta 138297), reused 178833 (delta 134081)
real 0m59.549s user 0m56.670s sys 0m2.400s
$ time git-pack-objects --stdout --no-reuse-delta >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
Total 184141, written 184141 (delta 134833), reused 47904 (delta 0)
real 11m13.830s user 9m45.240s sys 0m44.330s
There is one remaining issue when --no-reuse-delta option is not
used. It can create delta chains that are deeper than specified.
A<--B<--C<--D E F G
Suppose we have a delta chain A to D (A is stored in full either
in a pack or as a loose object. B is depth1 delta relative to A,
C is depth2 delta relative to B...) with loose objects E, F, G.
And we are going to pack all of them.
B, C and D are left as delta against A, B and C respectively.
So A, E, F, and G are examined for deltification, and let's say
we decided to keep E expanded, and store the rest as deltas like
this:
E<--F<--G<--A
Oops. We ended up making D a bit too deep, didn't we? B, C and
D form a chain on top of A!
This is because we did not know what the final depth of A would
be, when we checked objects and decided to keep the existing
delta. Unfortunately, deferring the decision until just before
the deltification is not an option. To be able to make B, C,
and D candidates for deltification with the rest, we need to
know the type and final unexpanded size of them, but the major
part of the optimization comes from the fact that we do not read
the delta data to do so -- getting the final size is quite an
expensive operation.
To prevent this from happening, we should keep A from being
deltified. But how would we tell that, cheaply?
To do this most precisely, after check_object() runs, each
object that is used as the base object of some existing delta
needs to be marked with the maximum depth of the objects we
decided to keep deltified (in this case, D is depth 3 relative
to A, so if no other delta chain that is longer than 3 based on
A exists, mark A with 3). Then when attempting to deltify A, we
would take that number into account to see if the final delta
chain that leads to D becomes too deep.
However, this is a bit cumbersome to compute, so we would cheat
and reduce the maximum depth for A arbitrarily to depth/4 in
this implementation.
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-02-16 20:55:51 +01:00
|
|
|
static const char pack_usage[] = "git-pack-objects [-q] [--no-reuse-delta] [--non-empty] [--local] [--incremental] [--window=N] [--depth=N] {--stdout | base-name} < object-list";
|
2005-06-25 23:42:43 +02:00
|
|
|
|
|
|
|
struct object_entry {
|
|
|
|
unsigned char sha1[20];
|
pack-objects: reuse data from existing packs.
When generating a new pack, notice if we have already needed
objects in existing packs. If an object is stored deltified,
and its base object is also what we are going to pack, then
reuse the existing deltified representation unconditionally,
bypassing all the expensive find_deltas() and try_deltas()
calls.
Also, notice if what we are going to write out exactly match
what is already in an existing pack (either deltified or just
compressed). In such a case, we can just copy it instead of
going through the usual uncompressing & recompressing cycle.
Without this patch, in linux-2.6 repository with about 1500
loose objects and a single mega pack:
$ git-rev-list --objects v2.6.16-rc3 >RL
$ wc -l RL
184141 RL
$ time git-pack-objects p <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects....................
a1fc7b3e537fcb9b3c46b7505df859f0a11e79d2
real 12m4.323s
user 11m2.560s
sys 0m55.950s
With this patch, the same input:
$ time ../git.junio/git-pack-objects q <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
a1fc7b3e537fcb9b3c46b7505df859f0a11e79d2
Total 184141, written 184141, reused 182441
real 1m2.608s
user 0m55.090s
sys 0m1.830s
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-02-16 02:34:29 +01:00
|
|
|
unsigned long size; /* uncompressed size */
|
2006-02-18 05:58:45 +01:00
|
|
|
unsigned long offset; /* offset into the final pack file;
|
|
|
|
* nonzero if already written.
|
|
|
|
*/
|
pack-objects: reuse data from existing packs.
When generating a new pack, notice if we have already needed
objects in existing packs. If an object is stored deltified,
and its base object is also what we are going to pack, then
reuse the existing deltified representation unconditionally,
bypassing all the expensive find_deltas() and try_deltas()
calls.
Also, notice if what we are going to write out exactly match
what is already in an existing pack (either deltified or just
compressed). In such a case, we can just copy it instead of
going through the usual uncompressing & recompressing cycle.
Without this patch, in linux-2.6 repository with about 1500
loose objects and a single mega pack:
$ git-rev-list --objects v2.6.16-rc3 >RL
$ wc -l RL
184141 RL
$ time git-pack-objects p <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects....................
a1fc7b3e537fcb9b3c46b7505df859f0a11e79d2
real 12m4.323s
user 11m2.560s
sys 0m55.950s
With this patch, the same input:
$ time ../git.junio/git-pack-objects q <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
a1fc7b3e537fcb9b3c46b7505df859f0a11e79d2
Total 184141, written 184141, reused 182441
real 1m2.608s
user 0m55.090s
sys 0m1.830s
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-02-16 02:34:29 +01:00
|
|
|
unsigned int depth; /* delta depth */
|
2006-02-18 05:58:45 +01:00
|
|
|
unsigned int delta_limit; /* base adjustment for in-pack delta */
|
pack-objects: reuse data from existing packs.
When generating a new pack, notice if we have already needed
objects in existing packs. If an object is stored deltified,
and its base object is also what we are going to pack, then
reuse the existing deltified representation unconditionally,
bypassing all the expensive find_deltas() and try_deltas()
calls.
Also, notice if what we are going to write out exactly match
what is already in an existing pack (either deltified or just
compressed). In such a case, we can just copy it instead of
going through the usual uncompressing & recompressing cycle.
Without this patch, in linux-2.6 repository with about 1500
loose objects and a single mega pack:
$ git-rev-list --objects v2.6.16-rc3 >RL
$ wc -l RL
184141 RL
$ time git-pack-objects p <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects....................
a1fc7b3e537fcb9b3c46b7505df859f0a11e79d2
real 12m4.323s
user 11m2.560s
sys 0m55.950s
With this patch, the same input:
$ time ../git.junio/git-pack-objects q <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
a1fc7b3e537fcb9b3c46b7505df859f0a11e79d2
Total 184141, written 184141, reused 182441
real 1m2.608s
user 0m55.090s
sys 0m1.830s
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-02-16 02:34:29 +01:00
|
|
|
unsigned int hash; /* name hint hash */
|
2005-06-28 23:21:02 +02:00
|
|
|
enum object_type type;
|
pack-objects: finishing touches.
This introduces --no-reuse-delta option to disable reusing of
existing delta, which is a large part of the optimization
introduced by this series. This may become necessary if
repeated repacking makes delta chain too long. With this, the
output of the command becomes identical to that of the older
implementation. But the performance suffers greatly.
It still allows reusing non-deltified representations; there is
no point uncompressing and recompressing the whole text.
It also adds a couple more statistics output, while squelching
it under -q flag, which the last round forgot to do.
$ time old-git-pack-objects --stdout >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects....................
real 12m8.530s user 11m1.450s sys 0m57.920s
$ time git-pack-objects --stdout >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
Total 184141, written 184141 (delta 138297), reused 178833 (delta 134081)
real 0m59.549s user 0m56.670s sys 0m2.400s
$ time git-pack-objects --stdout --no-reuse-delta >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
Total 184141, written 184141 (delta 134833), reused 47904 (delta 0)
real 11m13.830s user 9m45.240s sys 0m44.330s
There is one remaining issue when --no-reuse-delta option is not
used. It can create delta chains that are deeper than specified.
A<--B<--C<--D E F G
Suppose we have a delta chain A to D (A is stored in full either
in a pack or as a loose object. B is depth1 delta relative to A,
C is depth2 delta relative to B...) with loose objects E, F, G.
And we are going to pack all of them.
B, C and D are left as delta against A, B and C respectively.
So A, E, F, and G are examined for deltification, and let's say
we decided to keep E expanded, and store the rest as deltas like
this:
E<--F<--G<--A
Oops. We ended up making D a bit too deep, didn't we? B, C and
D form a chain on top of A!
This is because we did not know what the final depth of A would
be, when we checked objects and decided to keep the existing
delta. Unfortunately, deferring the decision until just before
the deltification is not an option. To be able to make B, C,
and D candidates for deltification with the rest, we need to
know the type and final unexpanded size of them, but the major
part of the optimization comes from the fact that we do not read
the delta data to do so -- getting the final size is quite an
expensive operation.
To prevent this from happening, we should keep A from being
deltified. But how would we tell that, cheaply?
To do this most precisely, after check_object() runs, each
object that is used as the base object of some existing delta
needs to be marked with the maximum depth of the objects we
decided to keep deltified (in this case, D is depth 3 relative
to A, so if no other delta chain that is longer than 3 based on
A exists, mark A with 3). Then when attempting to deltify A, we
would take that number into account to see if the final delta
chain that leads to D becomes too deep.
However, this is a bit cumbersome to compute, so we would cheat
and reduce the maximum depth for A arbitrarily to depth/4 in
this implementation.
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-02-16 20:55:51 +01:00
|
|
|
enum object_type in_pack_type; /* could be delta */
|
pack-objects: reuse data from existing packs.
When generating a new pack, notice if we have already needed
objects in existing packs. If an object is stored deltified,
and its base object is also what we are going to pack, then
reuse the existing deltified representation unconditionally,
bypassing all the expensive find_deltas() and try_deltas()
calls.
Also, notice if what we are going to write out exactly match
what is already in an existing pack (either deltified or just
compressed). In such a case, we can just copy it instead of
going through the usual uncompressing & recompressing cycle.
Without this patch, in linux-2.6 repository with about 1500
loose objects and a single mega pack:
$ git-rev-list --objects v2.6.16-rc3 >RL
$ wc -l RL
184141 RL
$ time git-pack-objects p <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects....................
a1fc7b3e537fcb9b3c46b7505df859f0a11e79d2
real 12m4.323s
user 11m2.560s
sys 0m55.950s
With this patch, the same input:
$ time ../git.junio/git-pack-objects q <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
a1fc7b3e537fcb9b3c46b7505df859f0a11e79d2
Total 184141, written 184141, reused 182441
real 1m2.608s
user 0m55.090s
sys 0m1.830s
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-02-16 02:34:29 +01:00
|
|
|
unsigned long delta_size; /* delta data size (uncompressed) */
|
|
|
|
struct object_entry *delta; /* delta base object */
|
|
|
|
struct packed_git *in_pack; /* already in pack */
|
|
|
|
unsigned int in_pack_offset;
|
2006-02-18 05:58:45 +01:00
|
|
|
struct object_entry *delta_child; /* delitified objects who bases me */
|
|
|
|
struct object_entry *delta_sibling; /* other deltified objects who
|
|
|
|
* uses the same base as me
|
|
|
|
*/
|
2005-06-25 23:42:43 +02:00
|
|
|
};
|
|
|
|
|
pack-objects: reuse data from existing packs.
When generating a new pack, notice if we have already needed
objects in existing packs. If an object is stored deltified,
and its base object is also what we are going to pack, then
reuse the existing deltified representation unconditionally,
bypassing all the expensive find_deltas() and try_deltas()
calls.
Also, notice if what we are going to write out exactly match
what is already in an existing pack (either deltified or just
compressed). In such a case, we can just copy it instead of
going through the usual uncompressing & recompressing cycle.
Without this patch, in linux-2.6 repository with about 1500
loose objects and a single mega pack:
$ git-rev-list --objects v2.6.16-rc3 >RL
$ wc -l RL
184141 RL
$ time git-pack-objects p <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects....................
a1fc7b3e537fcb9b3c46b7505df859f0a11e79d2
real 12m4.323s
user 11m2.560s
sys 0m55.950s
With this patch, the same input:
$ time ../git.junio/git-pack-objects q <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
a1fc7b3e537fcb9b3c46b7505df859f0a11e79d2
Total 184141, written 184141, reused 182441
real 1m2.608s
user 0m55.090s
sys 0m1.830s
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-02-16 02:34:29 +01:00
|
|
|
/*
|
|
|
|
* Objects we are going to pack are colected in objects array (dynamically
|
|
|
|
* expanded). nr_objects & nr_alloc controls this array. They are stored
|
|
|
|
* in the order we see -- typically rev-list --objects order that gives us
|
|
|
|
* nice "minimum seek" order.
|
|
|
|
*
|
|
|
|
* sorted-by-sha ans sorted-by-type are arrays of pointers that point at
|
|
|
|
* elements in the objects array. The former is used to build the pack
|
|
|
|
* index (lists object names in the ascending order to help offset lookup),
|
|
|
|
* and the latter is used to group similar things together by try_delta()
|
|
|
|
* heuristics.
|
|
|
|
*/
|
|
|
|
|
2005-07-04 00:34:04 +02:00
|
|
|
static unsigned char object_list_sha1[20];
|
2005-07-03 22:36:58 +02:00
|
|
|
static int non_empty = 0;
|
pack-objects: finishing touches.
This introduces --no-reuse-delta option to disable reusing of
existing delta, which is a large part of the optimization
introduced by this series. This may become necessary if
repeated repacking makes delta chain too long. With this, the
output of the command becomes identical to that of the older
implementation. But the performance suffers greatly.
It still allows reusing non-deltified representations; there is
no point uncompressing and recompressing the whole text.
It also adds a couple more statistics output, while squelching
it under -q flag, which the last round forgot to do.
$ time old-git-pack-objects --stdout >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects....................
real 12m8.530s user 11m1.450s sys 0m57.920s
$ time git-pack-objects --stdout >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
Total 184141, written 184141 (delta 138297), reused 178833 (delta 134081)
real 0m59.549s user 0m56.670s sys 0m2.400s
$ time git-pack-objects --stdout --no-reuse-delta >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
Total 184141, written 184141 (delta 134833), reused 47904 (delta 0)
real 11m13.830s user 9m45.240s sys 0m44.330s
There is one remaining issue when --no-reuse-delta option is not
used. It can create delta chains that are deeper than specified.
A<--B<--C<--D E F G
Suppose we have a delta chain A to D (A is stored in full either
in a pack or as a loose object. B is depth1 delta relative to A,
C is depth2 delta relative to B...) with loose objects E, F, G.
And we are going to pack all of them.
B, C and D are left as delta against A, B and C respectively.
So A, E, F, and G are examined for deltification, and let's say
we decided to keep E expanded, and store the rest as deltas like
this:
E<--F<--G<--A
Oops. We ended up making D a bit too deep, didn't we? B, C and
D form a chain on top of A!
This is because we did not know what the final depth of A would
be, when we checked objects and decided to keep the existing
delta. Unfortunately, deferring the decision until just before
the deltification is not an option. To be able to make B, C,
and D candidates for deltification with the rest, we need to
know the type and final unexpanded size of them, but the major
part of the optimization comes from the fact that we do not read
the delta data to do so -- getting the final size is quite an
expensive operation.
To prevent this from happening, we should keep A from being
deltified. But how would we tell that, cheaply?
To do this most precisely, after check_object() runs, each
object that is used as the base object of some existing delta
needs to be marked with the maximum depth of the objects we
decided to keep deltified (in this case, D is depth 3 relative
to A, so if no other delta chain that is longer than 3 based on
A exists, mark A with 3). Then when attempting to deltify A, we
would take that number into account to see if the final delta
chain that leads to D becomes too deep.
However, this is a bit cumbersome to compute, so we would cheat
and reduce the maximum depth for A arbitrarily to depth/4 in
this implementation.
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-02-16 20:55:51 +01:00
|
|
|
static int no_reuse_delta = 0;
|
2005-10-14 00:38:28 +02:00
|
|
|
static int local = 0;
|
2005-07-03 22:08:40 +02:00
|
|
|
static int incremental = 0;
|
2005-06-25 23:42:43 +02:00
|
|
|
static struct object_entry **sorted_by_sha, **sorted_by_type;
|
|
|
|
static struct object_entry *objects = NULL;
|
|
|
|
static int nr_objects = 0, nr_alloc = 0;
|
|
|
|
static const char *base_name;
|
2005-06-27 07:01:46 +02:00
|
|
|
static unsigned char pack_file_sha1[20];
|
2006-02-12 22:01:54 +01:00
|
|
|
static int progress = 1;
|
2005-06-25 23:42:43 +02:00
|
|
|
|
pack-objects: reuse data from existing packs.
When generating a new pack, notice if we have already needed
objects in existing packs. If an object is stored deltified,
and its base object is also what we are going to pack, then
reuse the existing deltified representation unconditionally,
bypassing all the expensive find_deltas() and try_deltas()
calls.
Also, notice if what we are going to write out exactly match
what is already in an existing pack (either deltified or just
compressed). In such a case, we can just copy it instead of
going through the usual uncompressing & recompressing cycle.
Without this patch, in linux-2.6 repository with about 1500
loose objects and a single mega pack:
$ git-rev-list --objects v2.6.16-rc3 >RL
$ wc -l RL
184141 RL
$ time git-pack-objects p <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects....................
a1fc7b3e537fcb9b3c46b7505df859f0a11e79d2
real 12m4.323s
user 11m2.560s
sys 0m55.950s
With this patch, the same input:
$ time ../git.junio/git-pack-objects q <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
a1fc7b3e537fcb9b3c46b7505df859f0a11e79d2
Total 184141, written 184141, reused 182441
real 1m2.608s
user 0m55.090s
sys 0m1.830s
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-02-16 02:34:29 +01:00
|
|
|
/*
|
|
|
|
* The object names in objects array are hashed with this hashtable,
|
|
|
|
* to help looking up the entry by object name. Binary search from
|
|
|
|
* sorted_by_sha is also possible but this was easier to code and faster.
|
|
|
|
* This hashtable is built after all the objects are seen.
|
|
|
|
*/
|
|
|
|
static int *object_ix = NULL;
|
|
|
|
static int object_ix_hashsz = 0;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Pack index for existing packs give us easy access to the offsets into
|
|
|
|
* corresponding pack file where each object's data starts, but the entries
|
|
|
|
* do not store the size of the compressed representation (uncompressed
|
|
|
|
* size is easily available by examining the pack entry header). We build
|
|
|
|
* a hashtable of existing packs (pack_revindex), and keep reverse index
|
|
|
|
* here -- pack index file is sorted by object name mapping to offset; this
|
|
|
|
* pack_revindex[].revindex array is an ordered list of offsets, so if you
|
|
|
|
* know the offset of an object, next offset is where its packed
|
|
|
|
* representation ends.
|
|
|
|
*/
|
|
|
|
struct pack_revindex {
|
|
|
|
struct packed_git *p;
|
|
|
|
unsigned long *revindex;
|
|
|
|
} *pack_revindex = NULL;
|
|
|
|
static int pack_revindex_hashsz = 0;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* stats
|
|
|
|
*/
|
|
|
|
static int written = 0;
|
pack-objects: finishing touches.
This introduces --no-reuse-delta option to disable reusing of
existing delta, which is a large part of the optimization
introduced by this series. This may become necessary if
repeated repacking makes delta chain too long. With this, the
output of the command becomes identical to that of the older
implementation. But the performance suffers greatly.
It still allows reusing non-deltified representations; there is
no point uncompressing and recompressing the whole text.
It also adds a couple more statistics output, while squelching
it under -q flag, which the last round forgot to do.
$ time old-git-pack-objects --stdout >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects....................
real 12m8.530s user 11m1.450s sys 0m57.920s
$ time git-pack-objects --stdout >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
Total 184141, written 184141 (delta 138297), reused 178833 (delta 134081)
real 0m59.549s user 0m56.670s sys 0m2.400s
$ time git-pack-objects --stdout --no-reuse-delta >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
Total 184141, written 184141 (delta 134833), reused 47904 (delta 0)
real 11m13.830s user 9m45.240s sys 0m44.330s
There is one remaining issue when --no-reuse-delta option is not
used. It can create delta chains that are deeper than specified.
A<--B<--C<--D E F G
Suppose we have a delta chain A to D (A is stored in full either
in a pack or as a loose object. B is depth1 delta relative to A,
C is depth2 delta relative to B...) with loose objects E, F, G.
And we are going to pack all of them.
B, C and D are left as delta against A, B and C respectively.
So A, E, F, and G are examined for deltification, and let's say
we decided to keep E expanded, and store the rest as deltas like
this:
E<--F<--G<--A
Oops. We ended up making D a bit too deep, didn't we? B, C and
D form a chain on top of A!
This is because we did not know what the final depth of A would
be, when we checked objects and decided to keep the existing
delta. Unfortunately, deferring the decision until just before
the deltification is not an option. To be able to make B, C,
and D candidates for deltification with the rest, we need to
know the type and final unexpanded size of them, but the major
part of the optimization comes from the fact that we do not read
the delta data to do so -- getting the final size is quite an
expensive operation.
To prevent this from happening, we should keep A from being
deltified. But how would we tell that, cheaply?
To do this most precisely, after check_object() runs, each
object that is used as the base object of some existing delta
needs to be marked with the maximum depth of the objects we
decided to keep deltified (in this case, D is depth 3 relative
to A, so if no other delta chain that is longer than 3 based on
A exists, mark A with 3). Then when attempting to deltify A, we
would take that number into account to see if the final delta
chain that leads to D becomes too deep.
However, this is a bit cumbersome to compute, so we would cheat
and reduce the maximum depth for A arbitrarily to depth/4 in
this implementation.
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-02-16 20:55:51 +01:00
|
|
|
static int written_delta = 0;
|
pack-objects: reuse data from existing packs.
When generating a new pack, notice if we have already needed
objects in existing packs. If an object is stored deltified,
and its base object is also what we are going to pack, then
reuse the existing deltified representation unconditionally,
bypassing all the expensive find_deltas() and try_deltas()
calls.
Also, notice if what we are going to write out exactly match
what is already in an existing pack (either deltified or just
compressed). In such a case, we can just copy it instead of
going through the usual uncompressing & recompressing cycle.
Without this patch, in linux-2.6 repository with about 1500
loose objects and a single mega pack:
$ git-rev-list --objects v2.6.16-rc3 >RL
$ wc -l RL
184141 RL
$ time git-pack-objects p <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects....................
a1fc7b3e537fcb9b3c46b7505df859f0a11e79d2
real 12m4.323s
user 11m2.560s
sys 0m55.950s
With this patch, the same input:
$ time ../git.junio/git-pack-objects q <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
a1fc7b3e537fcb9b3c46b7505df859f0a11e79d2
Total 184141, written 184141, reused 182441
real 1m2.608s
user 0m55.090s
sys 0m1.830s
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-02-16 02:34:29 +01:00
|
|
|
static int reused = 0;
|
pack-objects: finishing touches.
This introduces --no-reuse-delta option to disable reusing of
existing delta, which is a large part of the optimization
introduced by this series. This may become necessary if
repeated repacking makes delta chain too long. With this, the
output of the command becomes identical to that of the older
implementation. But the performance suffers greatly.
It still allows reusing non-deltified representations; there is
no point uncompressing and recompressing the whole text.
It also adds a couple more statistics output, while squelching
it under -q flag, which the last round forgot to do.
$ time old-git-pack-objects --stdout >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects....................
real 12m8.530s user 11m1.450s sys 0m57.920s
$ time git-pack-objects --stdout >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
Total 184141, written 184141 (delta 138297), reused 178833 (delta 134081)
real 0m59.549s user 0m56.670s sys 0m2.400s
$ time git-pack-objects --stdout --no-reuse-delta >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
Total 184141, written 184141 (delta 134833), reused 47904 (delta 0)
real 11m13.830s user 9m45.240s sys 0m44.330s
There is one remaining issue when --no-reuse-delta option is not
used. It can create delta chains that are deeper than specified.
A<--B<--C<--D E F G
Suppose we have a delta chain A to D (A is stored in full either
in a pack or as a loose object. B is depth1 delta relative to A,
C is depth2 delta relative to B...) with loose objects E, F, G.
And we are going to pack all of them.
B, C and D are left as delta against A, B and C respectively.
So A, E, F, and G are examined for deltification, and let's say
we decided to keep E expanded, and store the rest as deltas like
this:
E<--F<--G<--A
Oops. We ended up making D a bit too deep, didn't we? B, C and
D form a chain on top of A!
This is because we did not know what the final depth of A would
be, when we checked objects and decided to keep the existing
delta. Unfortunately, deferring the decision until just before
the deltification is not an option. To be able to make B, C,
and D candidates for deltification with the rest, we need to
know the type and final unexpanded size of them, but the major
part of the optimization comes from the fact that we do not read
the delta data to do so -- getting the final size is quite an
expensive operation.
To prevent this from happening, we should keep A from being
deltified. But how would we tell that, cheaply?
To do this most precisely, after check_object() runs, each
object that is used as the base object of some existing delta
needs to be marked with the maximum depth of the objects we
decided to keep deltified (in this case, D is depth 3 relative
to A, so if no other delta chain that is longer than 3 based on
A exists, mark A with 3). Then when attempting to deltify A, we
would take that number into account to see if the final delta
chain that leads to D becomes too deep.
However, this is a bit cumbersome to compute, so we would cheat
and reduce the maximum depth for A arbitrarily to depth/4 in
this implementation.
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-02-16 20:55:51 +01:00
|
|
|
static int reused_delta = 0;
|
pack-objects: reuse data from existing packs.
When generating a new pack, notice if we have already needed
objects in existing packs. If an object is stored deltified,
and its base object is also what we are going to pack, then
reuse the existing deltified representation unconditionally,
bypassing all the expensive find_deltas() and try_deltas()
calls.
Also, notice if what we are going to write out exactly match
what is already in an existing pack (either deltified or just
compressed). In such a case, we can just copy it instead of
going through the usual uncompressing & recompressing cycle.
Without this patch, in linux-2.6 repository with about 1500
loose objects and a single mega pack:
$ git-rev-list --objects v2.6.16-rc3 >RL
$ wc -l RL
184141 RL
$ time git-pack-objects p <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects....................
a1fc7b3e537fcb9b3c46b7505df859f0a11e79d2
real 12m4.323s
user 11m2.560s
sys 0m55.950s
With this patch, the same input:
$ time ../git.junio/git-pack-objects q <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
a1fc7b3e537fcb9b3c46b7505df859f0a11e79d2
Total 184141, written 184141, reused 182441
real 1m2.608s
user 0m55.090s
sys 0m1.830s
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-02-16 02:34:29 +01:00
|
|
|
|
|
|
|
static int pack_revindex_ix(struct packed_git *p)
|
|
|
|
{
|
|
|
|
unsigned int ui = (unsigned int) p;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
ui = ui ^ (ui >> 16); /* defeat structure alignment */
|
|
|
|
i = (int)(ui % pack_revindex_hashsz);
|
|
|
|
while (pack_revindex[i].p) {
|
|
|
|
if (pack_revindex[i].p == p)
|
|
|
|
return i;
|
|
|
|
if (++i == pack_revindex_hashsz)
|
|
|
|
i = 0;
|
|
|
|
}
|
|
|
|
return -1 - i;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void prepare_pack_ix(void)
|
|
|
|
{
|
|
|
|
int num;
|
|
|
|
struct packed_git *p;
|
|
|
|
for (num = 0, p = packed_git; p; p = p->next)
|
|
|
|
num++;
|
|
|
|
if (!num)
|
|
|
|
return;
|
|
|
|
pack_revindex_hashsz = num * 11;
|
|
|
|
pack_revindex = xcalloc(sizeof(*pack_revindex), pack_revindex_hashsz);
|
|
|
|
for (p = packed_git; p; p = p->next) {
|
|
|
|
num = pack_revindex_ix(p);
|
|
|
|
num = - 1 - num;
|
|
|
|
pack_revindex[num].p = p;
|
|
|
|
}
|
|
|
|
/* revindex elements are lazily initialized */
|
|
|
|
}
|
|
|
|
|
|
|
|
static int cmp_offset(const void *a_, const void *b_)
|
|
|
|
{
|
|
|
|
unsigned long a = *(unsigned long *) a_;
|
|
|
|
unsigned long b = *(unsigned long *) b_;
|
|
|
|
if (a < b)
|
|
|
|
return -1;
|
|
|
|
else if (a == b)
|
|
|
|
return 0;
|
|
|
|
else
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Ordered list of offsets of objects in the pack.
|
|
|
|
*/
|
|
|
|
static void prepare_pack_revindex(struct pack_revindex *rix)
|
|
|
|
{
|
|
|
|
struct packed_git *p = rix->p;
|
|
|
|
int num_ent = num_packed_objects(p);
|
|
|
|
int i;
|
|
|
|
void *index = p->index_base + 256;
|
|
|
|
|
|
|
|
rix->revindex = xmalloc(sizeof(unsigned long) * (num_ent + 1));
|
|
|
|
for (i = 0; i < num_ent; i++) {
|
|
|
|
long hl = *((long *)(index + 24 * i));
|
|
|
|
rix->revindex[i] = ntohl(hl);
|
|
|
|
}
|
|
|
|
/* This knows the pack format -- the 20-byte trailer
|
|
|
|
* follows immediately after the last object data.
|
|
|
|
*/
|
|
|
|
rix->revindex[num_ent] = p->pack_size - 20;
|
|
|
|
qsort(rix->revindex, num_ent, sizeof(unsigned long), cmp_offset);
|
|
|
|
}
|
|
|
|
|
|
|
|
static unsigned long find_packed_object_size(struct packed_git *p,
|
|
|
|
unsigned long ofs)
|
|
|
|
{
|
|
|
|
int num;
|
|
|
|
int lo, hi;
|
|
|
|
struct pack_revindex *rix;
|
|
|
|
unsigned long *revindex;
|
|
|
|
num = pack_revindex_ix(p);
|
|
|
|
if (num < 0)
|
|
|
|
die("internal error: pack revindex uninitialized");
|
|
|
|
rix = &pack_revindex[num];
|
|
|
|
if (!rix->revindex)
|
|
|
|
prepare_pack_revindex(rix);
|
|
|
|
revindex = rix->revindex;
|
|
|
|
lo = 0;
|
|
|
|
hi = num_packed_objects(p) + 1;
|
|
|
|
do {
|
|
|
|
int mi = (lo + hi) / 2;
|
|
|
|
if (revindex[mi] == ofs) {
|
|
|
|
return revindex[mi+1] - ofs;
|
|
|
|
}
|
|
|
|
else if (ofs < revindex[mi])
|
|
|
|
hi = mi;
|
|
|
|
else
|
|
|
|
lo = mi + 1;
|
|
|
|
} while (lo < hi);
|
|
|
|
die("internal error: pack revindex corrupt");
|
|
|
|
}
|
|
|
|
|
2005-06-25 23:42:43 +02:00
|
|
|
static void *delta_against(void *buf, unsigned long size, struct object_entry *entry)
|
|
|
|
{
|
|
|
|
unsigned long othersize, delta_size;
|
|
|
|
char type[10];
|
|
|
|
void *otherbuf = read_sha1_file(entry->delta->sha1, type, &othersize);
|
|
|
|
void *delta_buf;
|
|
|
|
|
|
|
|
if (!otherbuf)
|
|
|
|
die("unable to read %s", sha1_to_hex(entry->delta->sha1));
|
2005-06-26 13:29:18 +02:00
|
|
|
delta_buf = diff_delta(otherbuf, othersize,
|
2005-06-29 08:49:56 +02:00
|
|
|
buf, size, &delta_size, 0);
|
2005-06-26 04:30:20 +02:00
|
|
|
if (!delta_buf || delta_size != entry->delta_size)
|
2005-06-25 23:42:43 +02:00
|
|
|
die("delta size changed");
|
|
|
|
free(buf);
|
|
|
|
free(otherbuf);
|
|
|
|
return delta_buf;
|
|
|
|
}
|
|
|
|
|
2005-06-28 23:21:02 +02:00
|
|
|
/*
|
|
|
|
* The per-object header is a pretty dense thing, which is
|
|
|
|
* - first byte: low four bits are "size", then three bits of "type",
|
|
|
|
* and the high bit is "size continues".
|
|
|
|
* - each byte afterwards: low seven bits are size continuation,
|
|
|
|
* with the high bit being "size continues"
|
|
|
|
*/
|
|
|
|
static int encode_header(enum object_type type, unsigned long size, unsigned char *hdr)
|
|
|
|
{
|
2005-06-29 07:15:57 +02:00
|
|
|
int n = 1;
|
2005-06-28 23:21:02 +02:00
|
|
|
unsigned char c;
|
|
|
|
|
|
|
|
if (type < OBJ_COMMIT || type > OBJ_DELTA)
|
|
|
|
die("bad type %d", type);
|
|
|
|
|
2005-06-29 07:15:57 +02:00
|
|
|
c = (type << 4) | (size & 15);
|
|
|
|
size >>= 4;
|
|
|
|
while (size) {
|
2005-06-28 23:21:02 +02:00
|
|
|
*hdr++ = c | 0x80;
|
2005-06-29 07:15:57 +02:00
|
|
|
c = size & 0x7f;
|
|
|
|
size >>= 7;
|
|
|
|
n++;
|
2005-06-28 23:21:02 +02:00
|
|
|
}
|
|
|
|
*hdr = c;
|
|
|
|
return n;
|
|
|
|
}
|
|
|
|
|
2005-06-27 05:27:56 +02:00
|
|
|
static unsigned long write_object(struct sha1file *f, struct object_entry *entry)
|
2005-06-25 23:42:43 +02:00
|
|
|
{
|
|
|
|
unsigned long size;
|
|
|
|
char type[10];
|
pack-objects: reuse data from existing packs.
When generating a new pack, notice if we have already needed
objects in existing packs. If an object is stored deltified,
and its base object is also what we are going to pack, then
reuse the existing deltified representation unconditionally,
bypassing all the expensive find_deltas() and try_deltas()
calls.
Also, notice if what we are going to write out exactly match
what is already in an existing pack (either deltified or just
compressed). In such a case, we can just copy it instead of
going through the usual uncompressing & recompressing cycle.
Without this patch, in linux-2.6 repository with about 1500
loose objects and a single mega pack:
$ git-rev-list --objects v2.6.16-rc3 >RL
$ wc -l RL
184141 RL
$ time git-pack-objects p <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects....................
a1fc7b3e537fcb9b3c46b7505df859f0a11e79d2
real 12m4.323s
user 11m2.560s
sys 0m55.950s
With this patch, the same input:
$ time ../git.junio/git-pack-objects q <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
a1fc7b3e537fcb9b3c46b7505df859f0a11e79d2
Total 184141, written 184141, reused 182441
real 1m2.608s
user 0m55.090s
sys 0m1.830s
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-02-16 02:34:29 +01:00
|
|
|
void *buf;
|
2005-06-28 23:21:02 +02:00
|
|
|
unsigned char header[10];
|
2005-06-25 23:42:43 +02:00
|
|
|
unsigned hdrlen, datalen;
|
2005-06-28 23:21:02 +02:00
|
|
|
enum object_type obj_type;
|
pack-objects: finishing touches.
This introduces --no-reuse-delta option to disable reusing of
existing delta, which is a large part of the optimization
introduced by this series. This may become necessary if
repeated repacking makes delta chain too long. With this, the
output of the command becomes identical to that of the older
implementation. But the performance suffers greatly.
It still allows reusing non-deltified representations; there is
no point uncompressing and recompressing the whole text.
It also adds a couple more statistics output, while squelching
it under -q flag, which the last round forgot to do.
$ time old-git-pack-objects --stdout >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects....................
real 12m8.530s user 11m1.450s sys 0m57.920s
$ time git-pack-objects --stdout >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
Total 184141, written 184141 (delta 138297), reused 178833 (delta 134081)
real 0m59.549s user 0m56.670s sys 0m2.400s
$ time git-pack-objects --stdout --no-reuse-delta >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
Total 184141, written 184141 (delta 134833), reused 47904 (delta 0)
real 11m13.830s user 9m45.240s sys 0m44.330s
There is one remaining issue when --no-reuse-delta option is not
used. It can create delta chains that are deeper than specified.
A<--B<--C<--D E F G
Suppose we have a delta chain A to D (A is stored in full either
in a pack or as a loose object. B is depth1 delta relative to A,
C is depth2 delta relative to B...) with loose objects E, F, G.
And we are going to pack all of them.
B, C and D are left as delta against A, B and C respectively.
So A, E, F, and G are examined for deltification, and let's say
we decided to keep E expanded, and store the rest as deltas like
this:
E<--F<--G<--A
Oops. We ended up making D a bit too deep, didn't we? B, C and
D form a chain on top of A!
This is because we did not know what the final depth of A would
be, when we checked objects and decided to keep the existing
delta. Unfortunately, deferring the decision until just before
the deltification is not an option. To be able to make B, C,
and D candidates for deltification with the rest, we need to
know the type and final unexpanded size of them, but the major
part of the optimization comes from the fact that we do not read
the delta data to do so -- getting the final size is quite an
expensive operation.
To prevent this from happening, we should keep A from being
deltified. But how would we tell that, cheaply?
To do this most precisely, after check_object() runs, each
object that is used as the base object of some existing delta
needs to be marked with the maximum depth of the objects we
decided to keep deltified (in this case, D is depth 3 relative
to A, so if no other delta chain that is longer than 3 based on
A exists, mark A with 3). Then when attempting to deltify A, we
would take that number into account to see if the final delta
chain that leads to D becomes too deep.
However, this is a bit cumbersome to compute, so we would cheat
and reduce the maximum depth for A arbitrarily to depth/4 in
this implementation.
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-02-16 20:55:51 +01:00
|
|
|
int to_reuse = 0;
|
2005-06-25 23:42:43 +02:00
|
|
|
|
2005-06-28 23:21:02 +02:00
|
|
|
obj_type = entry->type;
|
pack-objects: finishing touches.
This introduces --no-reuse-delta option to disable reusing of
existing delta, which is a large part of the optimization
introduced by this series. This may become necessary if
repeated repacking makes delta chain too long. With this, the
output of the command becomes identical to that of the older
implementation. But the performance suffers greatly.
It still allows reusing non-deltified representations; there is
no point uncompressing and recompressing the whole text.
It also adds a couple more statistics output, while squelching
it under -q flag, which the last round forgot to do.
$ time old-git-pack-objects --stdout >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects....................
real 12m8.530s user 11m1.450s sys 0m57.920s
$ time git-pack-objects --stdout >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
Total 184141, written 184141 (delta 138297), reused 178833 (delta 134081)
real 0m59.549s user 0m56.670s sys 0m2.400s
$ time git-pack-objects --stdout --no-reuse-delta >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
Total 184141, written 184141 (delta 134833), reused 47904 (delta 0)
real 11m13.830s user 9m45.240s sys 0m44.330s
There is one remaining issue when --no-reuse-delta option is not
used. It can create delta chains that are deeper than specified.
A<--B<--C<--D E F G
Suppose we have a delta chain A to D (A is stored in full either
in a pack or as a loose object. B is depth1 delta relative to A,
C is depth2 delta relative to B...) with loose objects E, F, G.
And we are going to pack all of them.
B, C and D are left as delta against A, B and C respectively.
So A, E, F, and G are examined for deltification, and let's say
we decided to keep E expanded, and store the rest as deltas like
this:
E<--F<--G<--A
Oops. We ended up making D a bit too deep, didn't we? B, C and
D form a chain on top of A!
This is because we did not know what the final depth of A would
be, when we checked objects and decided to keep the existing
delta. Unfortunately, deferring the decision until just before
the deltification is not an option. To be able to make B, C,
and D candidates for deltification with the rest, we need to
know the type and final unexpanded size of them, but the major
part of the optimization comes from the fact that we do not read
the delta data to do so -- getting the final size is quite an
expensive operation.
To prevent this from happening, we should keep A from being
deltified. But how would we tell that, cheaply?
To do this most precisely, after check_object() runs, each
object that is used as the base object of some existing delta
needs to be marked with the maximum depth of the objects we
decided to keep deltified (in this case, D is depth 3 relative
to A, so if no other delta chain that is longer than 3 based on
A exists, mark A with 3). Then when attempting to deltify A, we
would take that number into account to see if the final delta
chain that leads to D becomes too deep.
However, this is a bit cumbersome to compute, so we would cheat
and reduce the maximum depth for A arbitrarily to depth/4 in
this implementation.
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-02-16 20:55:51 +01:00
|
|
|
if (! entry->in_pack)
|
|
|
|
to_reuse = 0; /* can't reuse what we don't have */
|
|
|
|
else if (obj_type == OBJ_DELTA)
|
|
|
|
to_reuse = 1; /* check_object() decided it for us */
|
|
|
|
else if (obj_type != entry->in_pack_type)
|
|
|
|
to_reuse = 0; /* pack has delta which is unusable */
|
|
|
|
else if (entry->delta)
|
|
|
|
to_reuse = 0; /* we want to pack afresh */
|
|
|
|
else
|
|
|
|
to_reuse = 1; /* we have it in-pack undeltified,
|
|
|
|
* and we do not need to deltify it.
|
|
|
|
*/
|
|
|
|
|
|
|
|
if (! to_reuse) {
|
pack-objects: reuse data from existing packs.
When generating a new pack, notice if we have already needed
objects in existing packs. If an object is stored deltified,
and its base object is also what we are going to pack, then
reuse the existing deltified representation unconditionally,
bypassing all the expensive find_deltas() and try_deltas()
calls.
Also, notice if what we are going to write out exactly match
what is already in an existing pack (either deltified or just
compressed). In such a case, we can just copy it instead of
going through the usual uncompressing & recompressing cycle.
Without this patch, in linux-2.6 repository with about 1500
loose objects and a single mega pack:
$ git-rev-list --objects v2.6.16-rc3 >RL
$ wc -l RL
184141 RL
$ time git-pack-objects p <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects....................
a1fc7b3e537fcb9b3c46b7505df859f0a11e79d2
real 12m4.323s
user 11m2.560s
sys 0m55.950s
With this patch, the same input:
$ time ../git.junio/git-pack-objects q <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
a1fc7b3e537fcb9b3c46b7505df859f0a11e79d2
Total 184141, written 184141, reused 182441
real 1m2.608s
user 0m55.090s
sys 0m1.830s
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-02-16 02:34:29 +01:00
|
|
|
buf = read_sha1_file(entry->sha1, type, &size);
|
|
|
|
if (!buf)
|
|
|
|
die("unable to read %s", sha1_to_hex(entry->sha1));
|
|
|
|
if (size != entry->size)
|
|
|
|
die("object %s size inconsistency (%lu vs %lu)",
|
|
|
|
sha1_to_hex(entry->sha1), size, entry->size);
|
|
|
|
if (entry->delta) {
|
|
|
|
buf = delta_against(buf, size, entry);
|
|
|
|
size = entry->delta_size;
|
|
|
|
obj_type = OBJ_DELTA;
|
|
|
|
}
|
|
|
|
/*
|
|
|
|
* The object header is a byte of 'type' followed by zero or
|
|
|
|
* more bytes of length. For deltas, the 20 bytes of delta
|
|
|
|
* sha1 follows that.
|
|
|
|
*/
|
|
|
|
hdrlen = encode_header(obj_type, size, header);
|
|
|
|
sha1write(f, header, hdrlen);
|
|
|
|
|
|
|
|
if (entry->delta) {
|
|
|
|
sha1write(f, entry->delta, 20);
|
|
|
|
hdrlen += 20;
|
|
|
|
}
|
|
|
|
datalen = sha1write_compressed(f, buf, size);
|
|
|
|
free(buf);
|
2005-06-25 23:42:43 +02:00
|
|
|
}
|
pack-objects: reuse data from existing packs.
When generating a new pack, notice if we have already needed
objects in existing packs. If an object is stored deltified,
and its base object is also what we are going to pack, then
reuse the existing deltified representation unconditionally,
bypassing all the expensive find_deltas() and try_deltas()
calls.
Also, notice if what we are going to write out exactly match
what is already in an existing pack (either deltified or just
compressed). In such a case, we can just copy it instead of
going through the usual uncompressing & recompressing cycle.
Without this patch, in linux-2.6 repository with about 1500
loose objects and a single mega pack:
$ git-rev-list --objects v2.6.16-rc3 >RL
$ wc -l RL
184141 RL
$ time git-pack-objects p <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects....................
a1fc7b3e537fcb9b3c46b7505df859f0a11e79d2
real 12m4.323s
user 11m2.560s
sys 0m55.950s
With this patch, the same input:
$ time ../git.junio/git-pack-objects q <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
a1fc7b3e537fcb9b3c46b7505df859f0a11e79d2
Total 184141, written 184141, reused 182441
real 1m2.608s
user 0m55.090s
sys 0m1.830s
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-02-16 02:34:29 +01:00
|
|
|
else {
|
|
|
|
struct packed_git *p = entry->in_pack;
|
|
|
|
use_packed_git(p);
|
|
|
|
|
|
|
|
datalen = find_packed_object_size(p, entry->in_pack_offset);
|
|
|
|
buf = p->pack_base + entry->in_pack_offset;
|
|
|
|
sha1write(f, buf, datalen);
|
|
|
|
unuse_packed_git(p);
|
|
|
|
hdrlen = 0; /* not really */
|
pack-objects: finishing touches.
This introduces --no-reuse-delta option to disable reusing of
existing delta, which is a large part of the optimization
introduced by this series. This may become necessary if
repeated repacking makes delta chain too long. With this, the
output of the command becomes identical to that of the older
implementation. But the performance suffers greatly.
It still allows reusing non-deltified representations; there is
no point uncompressing and recompressing the whole text.
It also adds a couple more statistics output, while squelching
it under -q flag, which the last round forgot to do.
$ time old-git-pack-objects --stdout >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects....................
real 12m8.530s user 11m1.450s sys 0m57.920s
$ time git-pack-objects --stdout >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
Total 184141, written 184141 (delta 138297), reused 178833 (delta 134081)
real 0m59.549s user 0m56.670s sys 0m2.400s
$ time git-pack-objects --stdout --no-reuse-delta >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
Total 184141, written 184141 (delta 134833), reused 47904 (delta 0)
real 11m13.830s user 9m45.240s sys 0m44.330s
There is one remaining issue when --no-reuse-delta option is not
used. It can create delta chains that are deeper than specified.
A<--B<--C<--D E F G
Suppose we have a delta chain A to D (A is stored in full either
in a pack or as a loose object. B is depth1 delta relative to A,
C is depth2 delta relative to B...) with loose objects E, F, G.
And we are going to pack all of them.
B, C and D are left as delta against A, B and C respectively.
So A, E, F, and G are examined for deltification, and let's say
we decided to keep E expanded, and store the rest as deltas like
this:
E<--F<--G<--A
Oops. We ended up making D a bit too deep, didn't we? B, C and
D form a chain on top of A!
This is because we did not know what the final depth of A would
be, when we checked objects and decided to keep the existing
delta. Unfortunately, deferring the decision until just before
the deltification is not an option. To be able to make B, C,
and D candidates for deltification with the rest, we need to
know the type and final unexpanded size of them, but the major
part of the optimization comes from the fact that we do not read
the delta data to do so -- getting the final size is quite an
expensive operation.
To prevent this from happening, we should keep A from being
deltified. But how would we tell that, cheaply?
To do this most precisely, after check_object() runs, each
object that is used as the base object of some existing delta
needs to be marked with the maximum depth of the objects we
decided to keep deltified (in this case, D is depth 3 relative
to A, so if no other delta chain that is longer than 3 based on
A exists, mark A with 3). Then when attempting to deltify A, we
would take that number into account to see if the final delta
chain that leads to D becomes too deep.
However, this is a bit cumbersome to compute, so we would cheat
and reduce the maximum depth for A arbitrarily to depth/4 in
this implementation.
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-02-16 20:55:51 +01:00
|
|
|
if (obj_type == OBJ_DELTA)
|
|
|
|
reused_delta++;
|
pack-objects: reuse data from existing packs.
When generating a new pack, notice if we have already needed
objects in existing packs. If an object is stored deltified,
and its base object is also what we are going to pack, then
reuse the existing deltified representation unconditionally,
bypassing all the expensive find_deltas() and try_deltas()
calls.
Also, notice if what we are going to write out exactly match
what is already in an existing pack (either deltified or just
compressed). In such a case, we can just copy it instead of
going through the usual uncompressing & recompressing cycle.
Without this patch, in linux-2.6 repository with about 1500
loose objects and a single mega pack:
$ git-rev-list --objects v2.6.16-rc3 >RL
$ wc -l RL
184141 RL
$ time git-pack-objects p <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects....................
a1fc7b3e537fcb9b3c46b7505df859f0a11e79d2
real 12m4.323s
user 11m2.560s
sys 0m55.950s
With this patch, the same input:
$ time ../git.junio/git-pack-objects q <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
a1fc7b3e537fcb9b3c46b7505df859f0a11e79d2
Total 184141, written 184141, reused 182441
real 1m2.608s
user 0m55.090s
sys 0m1.830s
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-02-16 02:34:29 +01:00
|
|
|
reused++;
|
2005-06-28 23:21:02 +02:00
|
|
|
}
|
pack-objects: finishing touches.
This introduces --no-reuse-delta option to disable reusing of
existing delta, which is a large part of the optimization
introduced by this series. This may become necessary if
repeated repacking makes delta chain too long. With this, the
output of the command becomes identical to that of the older
implementation. But the performance suffers greatly.
It still allows reusing non-deltified representations; there is
no point uncompressing and recompressing the whole text.
It also adds a couple more statistics output, while squelching
it under -q flag, which the last round forgot to do.
$ time old-git-pack-objects --stdout >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects....................
real 12m8.530s user 11m1.450s sys 0m57.920s
$ time git-pack-objects --stdout >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
Total 184141, written 184141 (delta 138297), reused 178833 (delta 134081)
real 0m59.549s user 0m56.670s sys 0m2.400s
$ time git-pack-objects --stdout --no-reuse-delta >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
Total 184141, written 184141 (delta 134833), reused 47904 (delta 0)
real 11m13.830s user 9m45.240s sys 0m44.330s
There is one remaining issue when --no-reuse-delta option is not
used. It can create delta chains that are deeper than specified.
A<--B<--C<--D E F G
Suppose we have a delta chain A to D (A is stored in full either
in a pack or as a loose object. B is depth1 delta relative to A,
C is depth2 delta relative to B...) with loose objects E, F, G.
And we are going to pack all of them.
B, C and D are left as delta against A, B and C respectively.
So A, E, F, and G are examined for deltification, and let's say
we decided to keep E expanded, and store the rest as deltas like
this:
E<--F<--G<--A
Oops. We ended up making D a bit too deep, didn't we? B, C and
D form a chain on top of A!
This is because we did not know what the final depth of A would
be, when we checked objects and decided to keep the existing
delta. Unfortunately, deferring the decision until just before
the deltification is not an option. To be able to make B, C,
and D candidates for deltification with the rest, we need to
know the type and final unexpanded size of them, but the major
part of the optimization comes from the fact that we do not read
the delta data to do so -- getting the final size is quite an
expensive operation.
To prevent this from happening, we should keep A from being
deltified. But how would we tell that, cheaply?
To do this most precisely, after check_object() runs, each
object that is used as the base object of some existing delta
needs to be marked with the maximum depth of the objects we
decided to keep deltified (in this case, D is depth 3 relative
to A, so if no other delta chain that is longer than 3 based on
A exists, mark A with 3). Then when attempting to deltify A, we
would take that number into account to see if the final delta
chain that leads to D becomes too deep.
However, this is a bit cumbersome to compute, so we would cheat
and reduce the maximum depth for A arbitrarily to depth/4 in
this implementation.
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-02-16 20:55:51 +01:00
|
|
|
if (obj_type == OBJ_DELTA)
|
|
|
|
written_delta++;
|
pack-objects: reuse data from existing packs.
When generating a new pack, notice if we have already needed
objects in existing packs. If an object is stored deltified,
and its base object is also what we are going to pack, then
reuse the existing deltified representation unconditionally,
bypassing all the expensive find_deltas() and try_deltas()
calls.
Also, notice if what we are going to write out exactly match
what is already in an existing pack (either deltified or just
compressed). In such a case, we can just copy it instead of
going through the usual uncompressing & recompressing cycle.
Without this patch, in linux-2.6 repository with about 1500
loose objects and a single mega pack:
$ git-rev-list --objects v2.6.16-rc3 >RL
$ wc -l RL
184141 RL
$ time git-pack-objects p <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects....................
a1fc7b3e537fcb9b3c46b7505df859f0a11e79d2
real 12m4.323s
user 11m2.560s
sys 0m55.950s
With this patch, the same input:
$ time ../git.junio/git-pack-objects q <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
a1fc7b3e537fcb9b3c46b7505df859f0a11e79d2
Total 184141, written 184141, reused 182441
real 1m2.608s
user 0m55.090s
sys 0m1.830s
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-02-16 02:34:29 +01:00
|
|
|
written++;
|
2005-06-25 23:42:43 +02:00
|
|
|
return hdrlen + datalen;
|
|
|
|
}
|
|
|
|
|
2005-06-29 02:49:27 +02:00
|
|
|
static unsigned long write_one(struct sha1file *f,
|
|
|
|
struct object_entry *e,
|
|
|
|
unsigned long offset)
|
|
|
|
{
|
|
|
|
if (e->offset)
|
|
|
|
/* offset starts from header size and cannot be zero
|
|
|
|
* if it is written already.
|
|
|
|
*/
|
|
|
|
return offset;
|
|
|
|
e->offset = offset;
|
|
|
|
offset += write_object(f, e);
|
2005-12-29 10:30:08 +01:00
|
|
|
/* if we are deltified, write out its base object. */
|
2005-06-29 02:49:27 +02:00
|
|
|
if (e->delta)
|
|
|
|
offset = write_one(f, e->delta, offset);
|
|
|
|
return offset;
|
|
|
|
}
|
|
|
|
|
2005-06-25 23:42:43 +02:00
|
|
|
static void write_pack_file(void)
|
|
|
|
{
|
|
|
|
int i;
|
2005-06-28 20:10:48 +02:00
|
|
|
struct sha1file *f;
|
2005-06-28 23:21:02 +02:00
|
|
|
unsigned long offset;
|
|
|
|
struct pack_header hdr;
|
2005-06-25 23:42:43 +02:00
|
|
|
|
2005-06-28 20:10:48 +02:00
|
|
|
if (!base_name)
|
|
|
|
f = sha1fd(1, "<stdout>");
|
|
|
|
else
|
2005-07-04 00:34:04 +02:00
|
|
|
f = sha1create("%s-%s.%s", base_name, sha1_to_hex(object_list_sha1), "pack");
|
2005-06-28 23:21:02 +02:00
|
|
|
hdr.hdr_signature = htonl(PACK_SIGNATURE);
|
2005-06-29 07:15:57 +02:00
|
|
|
hdr.hdr_version = htonl(PACK_VERSION);
|
2005-06-28 23:21:02 +02:00
|
|
|
hdr.hdr_entries = htonl(nr_objects);
|
|
|
|
sha1write(f, &hdr, sizeof(hdr));
|
|
|
|
offset = sizeof(hdr);
|
2005-06-29 02:49:27 +02:00
|
|
|
for (i = 0; i < nr_objects; i++)
|
|
|
|
offset = write_one(f, objects + i, offset);
|
|
|
|
|
2005-06-27 07:01:46 +02:00
|
|
|
sha1close(f, pack_file_sha1, 1);
|
2005-06-25 23:42:43 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
static void write_index_file(void)
|
|
|
|
{
|
|
|
|
int i;
|
2005-07-04 00:34:04 +02:00
|
|
|
struct sha1file *f = sha1create("%s-%s.%s", base_name, sha1_to_hex(object_list_sha1), "idx");
|
2005-06-25 23:42:43 +02:00
|
|
|
struct object_entry **list = sorted_by_sha;
|
|
|
|
struct object_entry **last = list + nr_objects;
|
|
|
|
unsigned int array[256];
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Write the first-level table (the list is sorted,
|
|
|
|
* but we use a 256-entry lookup to be able to avoid
|
2005-06-27 07:01:46 +02:00
|
|
|
* having to do eight extra binary search iterations).
|
2005-06-25 23:42:43 +02:00
|
|
|
*/
|
|
|
|
for (i = 0; i < 256; i++) {
|
|
|
|
struct object_entry **next = list;
|
|
|
|
while (next < last) {
|
|
|
|
struct object_entry *entry = *next;
|
|
|
|
if (entry->sha1[0] != i)
|
|
|
|
break;
|
|
|
|
next++;
|
|
|
|
}
|
|
|
|
array[i] = htonl(next - sorted_by_sha);
|
|
|
|
list = next;
|
|
|
|
}
|
2005-06-27 05:27:56 +02:00
|
|
|
sha1write(f, array, 256 * sizeof(int));
|
2005-06-25 23:42:43 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Write the actual SHA1 entries..
|
|
|
|
*/
|
|
|
|
list = sorted_by_sha;
|
2005-06-26 00:24:30 +02:00
|
|
|
for (i = 0; i < nr_objects; i++) {
|
2005-06-25 23:42:43 +02:00
|
|
|
struct object_entry *entry = *list++;
|
|
|
|
unsigned int offset = htonl(entry->offset);
|
2005-06-27 05:27:56 +02:00
|
|
|
sha1write(f, &offset, 4);
|
|
|
|
sha1write(f, entry->sha1, 20);
|
2005-06-25 23:42:43 +02:00
|
|
|
}
|
2005-06-27 07:01:46 +02:00
|
|
|
sha1write(f, pack_file_sha1, 20);
|
|
|
|
sha1close(f, NULL, 1);
|
2005-06-25 23:42:43 +02:00
|
|
|
}
|
|
|
|
|
2005-07-04 00:34:04 +02:00
|
|
|
static int add_object_entry(unsigned char *sha1, unsigned int hash)
|
2005-06-25 23:42:43 +02:00
|
|
|
{
|
|
|
|
unsigned int idx = nr_objects;
|
|
|
|
struct object_entry *entry;
|
pack-objects: reuse data from existing packs.
When generating a new pack, notice if we have already needed
objects in existing packs. If an object is stored deltified,
and its base object is also what we are going to pack, then
reuse the existing deltified representation unconditionally,
bypassing all the expensive find_deltas() and try_deltas()
calls.
Also, notice if what we are going to write out exactly match
what is already in an existing pack (either deltified or just
compressed). In such a case, we can just copy it instead of
going through the usual uncompressing & recompressing cycle.
Without this patch, in linux-2.6 repository with about 1500
loose objects and a single mega pack:
$ git-rev-list --objects v2.6.16-rc3 >RL
$ wc -l RL
184141 RL
$ time git-pack-objects p <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects....................
a1fc7b3e537fcb9b3c46b7505df859f0a11e79d2
real 12m4.323s
user 11m2.560s
sys 0m55.950s
With this patch, the same input:
$ time ../git.junio/git-pack-objects q <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
a1fc7b3e537fcb9b3c46b7505df859f0a11e79d2
Total 184141, written 184141, reused 182441
real 1m2.608s
user 0m55.090s
sys 0m1.830s
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-02-16 02:34:29 +01:00
|
|
|
struct packed_git *p;
|
pack-objects: finishing touches.
This introduces --no-reuse-delta option to disable reusing of
existing delta, which is a large part of the optimization
introduced by this series. This may become necessary if
repeated repacking makes delta chain too long. With this, the
output of the command becomes identical to that of the older
implementation. But the performance suffers greatly.
It still allows reusing non-deltified representations; there is
no point uncompressing and recompressing the whole text.
It also adds a couple more statistics output, while squelching
it under -q flag, which the last round forgot to do.
$ time old-git-pack-objects --stdout >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects....................
real 12m8.530s user 11m1.450s sys 0m57.920s
$ time git-pack-objects --stdout >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
Total 184141, written 184141 (delta 138297), reused 178833 (delta 134081)
real 0m59.549s user 0m56.670s sys 0m2.400s
$ time git-pack-objects --stdout --no-reuse-delta >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
Total 184141, written 184141 (delta 134833), reused 47904 (delta 0)
real 11m13.830s user 9m45.240s sys 0m44.330s
There is one remaining issue when --no-reuse-delta option is not
used. It can create delta chains that are deeper than specified.
A<--B<--C<--D E F G
Suppose we have a delta chain A to D (A is stored in full either
in a pack or as a loose object. B is depth1 delta relative to A,
C is depth2 delta relative to B...) with loose objects E, F, G.
And we are going to pack all of them.
B, C and D are left as delta against A, B and C respectively.
So A, E, F, and G are examined for deltification, and let's say
we decided to keep E expanded, and store the rest as deltas like
this:
E<--F<--G<--A
Oops. We ended up making D a bit too deep, didn't we? B, C and
D form a chain on top of A!
This is because we did not know what the final depth of A would
be, when we checked objects and decided to keep the existing
delta. Unfortunately, deferring the decision until just before
the deltification is not an option. To be able to make B, C,
and D candidates for deltification with the rest, we need to
know the type and final unexpanded size of them, but the major
part of the optimization comes from the fact that we do not read
the delta data to do so -- getting the final size is quite an
expensive operation.
To prevent this from happening, we should keep A from being
deltified. But how would we tell that, cheaply?
To do this most precisely, after check_object() runs, each
object that is used as the base object of some existing delta
needs to be marked with the maximum depth of the objects we
decided to keep deltified (in this case, D is depth 3 relative
to A, so if no other delta chain that is longer than 3 based on
A exists, mark A with 3). Then when attempting to deltify A, we
would take that number into account to see if the final delta
chain that leads to D becomes too deep.
However, this is a bit cumbersome to compute, so we would cheat
and reduce the maximum depth for A arbitrarily to depth/4 in
this implementation.
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-02-16 20:55:51 +01:00
|
|
|
unsigned int found_offset = 0;
|
|
|
|
struct packed_git *found_pack = NULL;
|
pack-objects: reuse data from existing packs.
When generating a new pack, notice if we have already needed
objects in existing packs. If an object is stored deltified,
and its base object is also what we are going to pack, then
reuse the existing deltified representation unconditionally,
bypassing all the expensive find_deltas() and try_deltas()
calls.
Also, notice if what we are going to write out exactly match
what is already in an existing pack (either deltified or just
compressed). In such a case, we can just copy it instead of
going through the usual uncompressing & recompressing cycle.
Without this patch, in linux-2.6 repository with about 1500
loose objects and a single mega pack:
$ git-rev-list --objects v2.6.16-rc3 >RL
$ wc -l RL
184141 RL
$ time git-pack-objects p <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects....................
a1fc7b3e537fcb9b3c46b7505df859f0a11e79d2
real 12m4.323s
user 11m2.560s
sys 0m55.950s
With this patch, the same input:
$ time ../git.junio/git-pack-objects q <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
a1fc7b3e537fcb9b3c46b7505df859f0a11e79d2
Total 184141, written 184141, reused 182441
real 1m2.608s
user 0m55.090s
sys 0m1.830s
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-02-16 02:34:29 +01:00
|
|
|
|
|
|
|
for (p = packed_git; p; p = p->next) {
|
|
|
|
struct pack_entry e;
|
|
|
|
if (find_pack_entry_one(sha1, &e, p)) {
|
|
|
|
if (incremental)
|
|
|
|
return 0;
|
|
|
|
if (local && !p->pack_local)
|
|
|
|
return 0;
|
|
|
|
if (!found_pack) {
|
|
|
|
found_offset = e.offset;
|
|
|
|
found_pack = e.p;
|
2005-10-14 00:38:28 +02:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
2005-07-03 22:08:40 +02:00
|
|
|
|
2005-06-25 23:42:43 +02:00
|
|
|
if (idx >= nr_alloc) {
|
|
|
|
unsigned int needed = (idx + 1024) * 3 / 2;
|
|
|
|
objects = xrealloc(objects, needed * sizeof(*entry));
|
|
|
|
nr_alloc = needed;
|
|
|
|
}
|
|
|
|
entry = objects + idx;
|
|
|
|
memset(entry, 0, sizeof(*entry));
|
|
|
|
memcpy(entry->sha1, sha1, 20);
|
2005-06-27 00:27:28 +02:00
|
|
|
entry->hash = hash;
|
pack-objects: reuse data from existing packs.
When generating a new pack, notice if we have already needed
objects in existing packs. If an object is stored deltified,
and its base object is also what we are going to pack, then
reuse the existing deltified representation unconditionally,
bypassing all the expensive find_deltas() and try_deltas()
calls.
Also, notice if what we are going to write out exactly match
what is already in an existing pack (either deltified or just
compressed). In such a case, we can just copy it instead of
going through the usual uncompressing & recompressing cycle.
Without this patch, in linux-2.6 repository with about 1500
loose objects and a single mega pack:
$ git-rev-list --objects v2.6.16-rc3 >RL
$ wc -l RL
184141 RL
$ time git-pack-objects p <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects....................
a1fc7b3e537fcb9b3c46b7505df859f0a11e79d2
real 12m4.323s
user 11m2.560s
sys 0m55.950s
With this patch, the same input:
$ time ../git.junio/git-pack-objects q <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
a1fc7b3e537fcb9b3c46b7505df859f0a11e79d2
Total 184141, written 184141, reused 182441
real 1m2.608s
user 0m55.090s
sys 0m1.830s
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-02-16 02:34:29 +01:00
|
|
|
if (found_pack) {
|
|
|
|
entry->in_pack = found_pack;
|
|
|
|
entry->in_pack_offset = found_offset;
|
|
|
|
}
|
2005-06-25 23:42:43 +02:00
|
|
|
nr_objects = idx+1;
|
2005-07-04 00:34:04 +02:00
|
|
|
return 1;
|
2005-06-25 23:42:43 +02:00
|
|
|
}
|
|
|
|
|
pack-objects: reuse data from existing packs.
When generating a new pack, notice if we have already needed
objects in existing packs. If an object is stored deltified,
and its base object is also what we are going to pack, then
reuse the existing deltified representation unconditionally,
bypassing all the expensive find_deltas() and try_deltas()
calls.
Also, notice if what we are going to write out exactly match
what is already in an existing pack (either deltified or just
compressed). In such a case, we can just copy it instead of
going through the usual uncompressing & recompressing cycle.
Without this patch, in linux-2.6 repository with about 1500
loose objects and a single mega pack:
$ git-rev-list --objects v2.6.16-rc3 >RL
$ wc -l RL
184141 RL
$ time git-pack-objects p <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects....................
a1fc7b3e537fcb9b3c46b7505df859f0a11e79d2
real 12m4.323s
user 11m2.560s
sys 0m55.950s
With this patch, the same input:
$ time ../git.junio/git-pack-objects q <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
a1fc7b3e537fcb9b3c46b7505df859f0a11e79d2
Total 184141, written 184141, reused 182441
real 1m2.608s
user 0m55.090s
sys 0m1.830s
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-02-16 02:34:29 +01:00
|
|
|
static int locate_object_entry_hash(unsigned char *sha1)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
unsigned int ui;
|
|
|
|
memcpy(&ui, sha1, sizeof(unsigned int));
|
|
|
|
i = ui % object_ix_hashsz;
|
|
|
|
while (0 < object_ix[i]) {
|
|
|
|
if (!memcmp(sha1, objects[object_ix[i]-1].sha1, 20))
|
|
|
|
return i;
|
|
|
|
if (++i == object_ix_hashsz)
|
|
|
|
i = 0;
|
|
|
|
}
|
|
|
|
return -1 - i;
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct object_entry *locate_object_entry(unsigned char *sha1)
|
|
|
|
{
|
|
|
|
int i = locate_object_entry_hash(sha1);
|
|
|
|
if (0 <= i)
|
|
|
|
return &objects[object_ix[i]-1];
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2005-06-25 23:42:43 +02:00
|
|
|
static void check_object(struct object_entry *entry)
|
|
|
|
{
|
2005-06-27 12:34:06 +02:00
|
|
|
char type[20];
|
|
|
|
|
pack-objects: reuse data from existing packs.
When generating a new pack, notice if we have already needed
objects in existing packs. If an object is stored deltified,
and its base object is also what we are going to pack, then
reuse the existing deltified representation unconditionally,
bypassing all the expensive find_deltas() and try_deltas()
calls.
Also, notice if what we are going to write out exactly match
what is already in an existing pack (either deltified or just
compressed). In such a case, we can just copy it instead of
going through the usual uncompressing & recompressing cycle.
Without this patch, in linux-2.6 repository with about 1500
loose objects and a single mega pack:
$ git-rev-list --objects v2.6.16-rc3 >RL
$ wc -l RL
184141 RL
$ time git-pack-objects p <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects....................
a1fc7b3e537fcb9b3c46b7505df859f0a11e79d2
real 12m4.323s
user 11m2.560s
sys 0m55.950s
With this patch, the same input:
$ time ../git.junio/git-pack-objects q <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
a1fc7b3e537fcb9b3c46b7505df859f0a11e79d2
Total 184141, written 184141, reused 182441
real 1m2.608s
user 0m55.090s
sys 0m1.830s
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-02-16 02:34:29 +01:00
|
|
|
if (entry->in_pack) {
|
pack-objects: finishing touches.
This introduces --no-reuse-delta option to disable reusing of
existing delta, which is a large part of the optimization
introduced by this series. This may become necessary if
repeated repacking makes delta chain too long. With this, the
output of the command becomes identical to that of the older
implementation. But the performance suffers greatly.
It still allows reusing non-deltified representations; there is
no point uncompressing and recompressing the whole text.
It also adds a couple more statistics output, while squelching
it under -q flag, which the last round forgot to do.
$ time old-git-pack-objects --stdout >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects....................
real 12m8.530s user 11m1.450s sys 0m57.920s
$ time git-pack-objects --stdout >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
Total 184141, written 184141 (delta 138297), reused 178833 (delta 134081)
real 0m59.549s user 0m56.670s sys 0m2.400s
$ time git-pack-objects --stdout --no-reuse-delta >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
Total 184141, written 184141 (delta 134833), reused 47904 (delta 0)
real 11m13.830s user 9m45.240s sys 0m44.330s
There is one remaining issue when --no-reuse-delta option is not
used. It can create delta chains that are deeper than specified.
A<--B<--C<--D E F G
Suppose we have a delta chain A to D (A is stored in full either
in a pack or as a loose object. B is depth1 delta relative to A,
C is depth2 delta relative to B...) with loose objects E, F, G.
And we are going to pack all of them.
B, C and D are left as delta against A, B and C respectively.
So A, E, F, and G are examined for deltification, and let's say
we decided to keep E expanded, and store the rest as deltas like
this:
E<--F<--G<--A
Oops. We ended up making D a bit too deep, didn't we? B, C and
D form a chain on top of A!
This is because we did not know what the final depth of A would
be, when we checked objects and decided to keep the existing
delta. Unfortunately, deferring the decision until just before
the deltification is not an option. To be able to make B, C,
and D candidates for deltification with the rest, we need to
know the type and final unexpanded size of them, but the major
part of the optimization comes from the fact that we do not read
the delta data to do so -- getting the final size is quite an
expensive operation.
To prevent this from happening, we should keep A from being
deltified. But how would we tell that, cheaply?
To do this most precisely, after check_object() runs, each
object that is used as the base object of some existing delta
needs to be marked with the maximum depth of the objects we
decided to keep deltified (in this case, D is depth 3 relative
to A, so if no other delta chain that is longer than 3 based on
A exists, mark A with 3). Then when attempting to deltify A, we
would take that number into account to see if the final delta
chain that leads to D becomes too deep.
However, this is a bit cumbersome to compute, so we would cheat
and reduce the maximum depth for A arbitrarily to depth/4 in
this implementation.
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-02-16 20:55:51 +01:00
|
|
|
unsigned char base[20];
|
|
|
|
unsigned long size;
|
|
|
|
struct object_entry *base_entry;
|
|
|
|
|
|
|
|
/* We want in_pack_type even if we do not reuse delta.
|
|
|
|
* There is no point not reusing non-delta representations.
|
|
|
|
*/
|
|
|
|
check_reuse_pack_delta(entry->in_pack,
|
|
|
|
entry->in_pack_offset,
|
|
|
|
base, &size,
|
|
|
|
&entry->in_pack_type);
|
|
|
|
|
pack-objects: reuse data from existing packs.
When generating a new pack, notice if we have already needed
objects in existing packs. If an object is stored deltified,
and its base object is also what we are going to pack, then
reuse the existing deltified representation unconditionally,
bypassing all the expensive find_deltas() and try_deltas()
calls.
Also, notice if what we are going to write out exactly match
what is already in an existing pack (either deltified or just
compressed). In such a case, we can just copy it instead of
going through the usual uncompressing & recompressing cycle.
Without this patch, in linux-2.6 repository with about 1500
loose objects and a single mega pack:
$ git-rev-list --objects v2.6.16-rc3 >RL
$ wc -l RL
184141 RL
$ time git-pack-objects p <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects....................
a1fc7b3e537fcb9b3c46b7505df859f0a11e79d2
real 12m4.323s
user 11m2.560s
sys 0m55.950s
With this patch, the same input:
$ time ../git.junio/git-pack-objects q <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
a1fc7b3e537fcb9b3c46b7505df859f0a11e79d2
Total 184141, written 184141, reused 182441
real 1m2.608s
user 0m55.090s
sys 0m1.830s
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-02-16 02:34:29 +01:00
|
|
|
/* Check if it is delta, and the base is also an object
|
|
|
|
* we are going to pack. If so we will reuse the existing
|
|
|
|
* delta.
|
|
|
|
*/
|
pack-objects: finishing touches.
This introduces --no-reuse-delta option to disable reusing of
existing delta, which is a large part of the optimization
introduced by this series. This may become necessary if
repeated repacking makes delta chain too long. With this, the
output of the command becomes identical to that of the older
implementation. But the performance suffers greatly.
It still allows reusing non-deltified representations; there is
no point uncompressing and recompressing the whole text.
It also adds a couple more statistics output, while squelching
it under -q flag, which the last round forgot to do.
$ time old-git-pack-objects --stdout >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects....................
real 12m8.530s user 11m1.450s sys 0m57.920s
$ time git-pack-objects --stdout >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
Total 184141, written 184141 (delta 138297), reused 178833 (delta 134081)
real 0m59.549s user 0m56.670s sys 0m2.400s
$ time git-pack-objects --stdout --no-reuse-delta >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
Total 184141, written 184141 (delta 134833), reused 47904 (delta 0)
real 11m13.830s user 9m45.240s sys 0m44.330s
There is one remaining issue when --no-reuse-delta option is not
used. It can create delta chains that are deeper than specified.
A<--B<--C<--D E F G
Suppose we have a delta chain A to D (A is stored in full either
in a pack or as a loose object. B is depth1 delta relative to A,
C is depth2 delta relative to B...) with loose objects E, F, G.
And we are going to pack all of them.
B, C and D are left as delta against A, B and C respectively.
So A, E, F, and G are examined for deltification, and let's say
we decided to keep E expanded, and store the rest as deltas like
this:
E<--F<--G<--A
Oops. We ended up making D a bit too deep, didn't we? B, C and
D form a chain on top of A!
This is because we did not know what the final depth of A would
be, when we checked objects and decided to keep the existing
delta. Unfortunately, deferring the decision until just before
the deltification is not an option. To be able to make B, C,
and D candidates for deltification with the rest, we need to
know the type and final unexpanded size of them, but the major
part of the optimization comes from the fact that we do not read
the delta data to do so -- getting the final size is quite an
expensive operation.
To prevent this from happening, we should keep A from being
deltified. But how would we tell that, cheaply?
To do this most precisely, after check_object() runs, each
object that is used as the base object of some existing delta
needs to be marked with the maximum depth of the objects we
decided to keep deltified (in this case, D is depth 3 relative
to A, so if no other delta chain that is longer than 3 based on
A exists, mark A with 3). Then when attempting to deltify A, we
would take that number into account to see if the final delta
chain that leads to D becomes too deep.
However, this is a bit cumbersome to compute, so we would cheat
and reduce the maximum depth for A arbitrarily to depth/4 in
this implementation.
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-02-16 20:55:51 +01:00
|
|
|
if (!no_reuse_delta &&
|
|
|
|
entry->in_pack_type == OBJ_DELTA &&
|
pack-objects: reuse data from existing packs.
When generating a new pack, notice if we have already needed
objects in existing packs. If an object is stored deltified,
and its base object is also what we are going to pack, then
reuse the existing deltified representation unconditionally,
bypassing all the expensive find_deltas() and try_deltas()
calls.
Also, notice if what we are going to write out exactly match
what is already in an existing pack (either deltified or just
compressed). In such a case, we can just copy it instead of
going through the usual uncompressing & recompressing cycle.
Without this patch, in linux-2.6 repository with about 1500
loose objects and a single mega pack:
$ git-rev-list --objects v2.6.16-rc3 >RL
$ wc -l RL
184141 RL
$ time git-pack-objects p <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects....................
a1fc7b3e537fcb9b3c46b7505df859f0a11e79d2
real 12m4.323s
user 11m2.560s
sys 0m55.950s
With this patch, the same input:
$ time ../git.junio/git-pack-objects q <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
a1fc7b3e537fcb9b3c46b7505df859f0a11e79d2
Total 184141, written 184141, reused 182441
real 1m2.608s
user 0m55.090s
sys 0m1.830s
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-02-16 02:34:29 +01:00
|
|
|
(base_entry = locate_object_entry(base))) {
|
pack-objects: finishing touches.
This introduces --no-reuse-delta option to disable reusing of
existing delta, which is a large part of the optimization
introduced by this series. This may become necessary if
repeated repacking makes delta chain too long. With this, the
output of the command becomes identical to that of the older
implementation. But the performance suffers greatly.
It still allows reusing non-deltified representations; there is
no point uncompressing and recompressing the whole text.
It also adds a couple more statistics output, while squelching
it under -q flag, which the last round forgot to do.
$ time old-git-pack-objects --stdout >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects....................
real 12m8.530s user 11m1.450s sys 0m57.920s
$ time git-pack-objects --stdout >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
Total 184141, written 184141 (delta 138297), reused 178833 (delta 134081)
real 0m59.549s user 0m56.670s sys 0m2.400s
$ time git-pack-objects --stdout --no-reuse-delta >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
Total 184141, written 184141 (delta 134833), reused 47904 (delta 0)
real 11m13.830s user 9m45.240s sys 0m44.330s
There is one remaining issue when --no-reuse-delta option is not
used. It can create delta chains that are deeper than specified.
A<--B<--C<--D E F G
Suppose we have a delta chain A to D (A is stored in full either
in a pack or as a loose object. B is depth1 delta relative to A,
C is depth2 delta relative to B...) with loose objects E, F, G.
And we are going to pack all of them.
B, C and D are left as delta against A, B and C respectively.
So A, E, F, and G are examined for deltification, and let's say
we decided to keep E expanded, and store the rest as deltas like
this:
E<--F<--G<--A
Oops. We ended up making D a bit too deep, didn't we? B, C and
D form a chain on top of A!
This is because we did not know what the final depth of A would
be, when we checked objects and decided to keep the existing
delta. Unfortunately, deferring the decision until just before
the deltification is not an option. To be able to make B, C,
and D candidates for deltification with the rest, we need to
know the type and final unexpanded size of them, but the major
part of the optimization comes from the fact that we do not read
the delta data to do so -- getting the final size is quite an
expensive operation.
To prevent this from happening, we should keep A from being
deltified. But how would we tell that, cheaply?
To do this most precisely, after check_object() runs, each
object that is used as the base object of some existing delta
needs to be marked with the maximum depth of the objects we
decided to keep deltified (in this case, D is depth 3 relative
to A, so if no other delta chain that is longer than 3 based on
A exists, mark A with 3). Then when attempting to deltify A, we
would take that number into account to see if the final delta
chain that leads to D becomes too deep.
However, this is a bit cumbersome to compute, so we would cheat
and reduce the maximum depth for A arbitrarily to depth/4 in
this implementation.
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-02-16 20:55:51 +01:00
|
|
|
|
|
|
|
/* Depth value does not matter - find_deltas()
|
|
|
|
* will never consider reused delta as the
|
|
|
|
* base object to deltify other objects
|
|
|
|
* against, in order to avoid circular deltas.
|
pack-objects: reuse data from existing packs.
When generating a new pack, notice if we have already needed
objects in existing packs. If an object is stored deltified,
and its base object is also what we are going to pack, then
reuse the existing deltified representation unconditionally,
bypassing all the expensive find_deltas() and try_deltas()
calls.
Also, notice if what we are going to write out exactly match
what is already in an existing pack (either deltified or just
compressed). In such a case, we can just copy it instead of
going through the usual uncompressing & recompressing cycle.
Without this patch, in linux-2.6 repository with about 1500
loose objects and a single mega pack:
$ git-rev-list --objects v2.6.16-rc3 >RL
$ wc -l RL
184141 RL
$ time git-pack-objects p <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects....................
a1fc7b3e537fcb9b3c46b7505df859f0a11e79d2
real 12m4.323s
user 11m2.560s
sys 0m55.950s
With this patch, the same input:
$ time ../git.junio/git-pack-objects q <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
a1fc7b3e537fcb9b3c46b7505df859f0a11e79d2
Total 184141, written 184141, reused 182441
real 1m2.608s
user 0m55.090s
sys 0m1.830s
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-02-16 02:34:29 +01:00
|
|
|
*/
|
pack-objects: finishing touches.
This introduces --no-reuse-delta option to disable reusing of
existing delta, which is a large part of the optimization
introduced by this series. This may become necessary if
repeated repacking makes delta chain too long. With this, the
output of the command becomes identical to that of the older
implementation. But the performance suffers greatly.
It still allows reusing non-deltified representations; there is
no point uncompressing and recompressing the whole text.
It also adds a couple more statistics output, while squelching
it under -q flag, which the last round forgot to do.
$ time old-git-pack-objects --stdout >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects....................
real 12m8.530s user 11m1.450s sys 0m57.920s
$ time git-pack-objects --stdout >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
Total 184141, written 184141 (delta 138297), reused 178833 (delta 134081)
real 0m59.549s user 0m56.670s sys 0m2.400s
$ time git-pack-objects --stdout --no-reuse-delta >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
Total 184141, written 184141 (delta 134833), reused 47904 (delta 0)
real 11m13.830s user 9m45.240s sys 0m44.330s
There is one remaining issue when --no-reuse-delta option is not
used. It can create delta chains that are deeper than specified.
A<--B<--C<--D E F G
Suppose we have a delta chain A to D (A is stored in full either
in a pack or as a loose object. B is depth1 delta relative to A,
C is depth2 delta relative to B...) with loose objects E, F, G.
And we are going to pack all of them.
B, C and D are left as delta against A, B and C respectively.
So A, E, F, and G are examined for deltification, and let's say
we decided to keep E expanded, and store the rest as deltas like
this:
E<--F<--G<--A
Oops. We ended up making D a bit too deep, didn't we? B, C and
D form a chain on top of A!
This is because we did not know what the final depth of A would
be, when we checked objects and decided to keep the existing
delta. Unfortunately, deferring the decision until just before
the deltification is not an option. To be able to make B, C,
and D candidates for deltification with the rest, we need to
know the type and final unexpanded size of them, but the major
part of the optimization comes from the fact that we do not read
the delta data to do so -- getting the final size is quite an
expensive operation.
To prevent this from happening, we should keep A from being
deltified. But how would we tell that, cheaply?
To do this most precisely, after check_object() runs, each
object that is used as the base object of some existing delta
needs to be marked with the maximum depth of the objects we
decided to keep deltified (in this case, D is depth 3 relative
to A, so if no other delta chain that is longer than 3 based on
A exists, mark A with 3). Then when attempting to deltify A, we
would take that number into account to see if the final delta
chain that leads to D becomes too deep.
However, this is a bit cumbersome to compute, so we would cheat
and reduce the maximum depth for A arbitrarily to depth/4 in
this implementation.
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-02-16 20:55:51 +01:00
|
|
|
|
|
|
|
/* uncompressed size of the delta data */
|
pack-objects: reuse data from existing packs.
When generating a new pack, notice if we have already needed
objects in existing packs. If an object is stored deltified,
and its base object is also what we are going to pack, then
reuse the existing deltified representation unconditionally,
bypassing all the expensive find_deltas() and try_deltas()
calls.
Also, notice if what we are going to write out exactly match
what is already in an existing pack (either deltified or just
compressed). In such a case, we can just copy it instead of
going through the usual uncompressing & recompressing cycle.
Without this patch, in linux-2.6 repository with about 1500
loose objects and a single mega pack:
$ git-rev-list --objects v2.6.16-rc3 >RL
$ wc -l RL
184141 RL
$ time git-pack-objects p <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects....................
a1fc7b3e537fcb9b3c46b7505df859f0a11e79d2
real 12m4.323s
user 11m2.560s
sys 0m55.950s
With this patch, the same input:
$ time ../git.junio/git-pack-objects q <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
a1fc7b3e537fcb9b3c46b7505df859f0a11e79d2
Total 184141, written 184141, reused 182441
real 1m2.608s
user 0m55.090s
sys 0m1.830s
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-02-16 02:34:29 +01:00
|
|
|
entry->size = entry->delta_size = size;
|
|
|
|
entry->delta = base_entry;
|
|
|
|
entry->type = OBJ_DELTA;
|
pack-objects: finishing touches.
This introduces --no-reuse-delta option to disable reusing of
existing delta, which is a large part of the optimization
introduced by this series. This may become necessary if
repeated repacking makes delta chain too long. With this, the
output of the command becomes identical to that of the older
implementation. But the performance suffers greatly.
It still allows reusing non-deltified representations; there is
no point uncompressing and recompressing the whole text.
It also adds a couple more statistics output, while squelching
it under -q flag, which the last round forgot to do.
$ time old-git-pack-objects --stdout >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects....................
real 12m8.530s user 11m1.450s sys 0m57.920s
$ time git-pack-objects --stdout >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
Total 184141, written 184141 (delta 138297), reused 178833 (delta 134081)
real 0m59.549s user 0m56.670s sys 0m2.400s
$ time git-pack-objects --stdout --no-reuse-delta >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
Total 184141, written 184141 (delta 134833), reused 47904 (delta 0)
real 11m13.830s user 9m45.240s sys 0m44.330s
There is one remaining issue when --no-reuse-delta option is not
used. It can create delta chains that are deeper than specified.
A<--B<--C<--D E F G
Suppose we have a delta chain A to D (A is stored in full either
in a pack or as a loose object. B is depth1 delta relative to A,
C is depth2 delta relative to B...) with loose objects E, F, G.
And we are going to pack all of them.
B, C and D are left as delta against A, B and C respectively.
So A, E, F, and G are examined for deltification, and let's say
we decided to keep E expanded, and store the rest as deltas like
this:
E<--F<--G<--A
Oops. We ended up making D a bit too deep, didn't we? B, C and
D form a chain on top of A!
This is because we did not know what the final depth of A would
be, when we checked objects and decided to keep the existing
delta. Unfortunately, deferring the decision until just before
the deltification is not an option. To be able to make B, C,
and D candidates for deltification with the rest, we need to
know the type and final unexpanded size of them, but the major
part of the optimization comes from the fact that we do not read
the delta data to do so -- getting the final size is quite an
expensive operation.
To prevent this from happening, we should keep A from being
deltified. But how would we tell that, cheaply?
To do this most precisely, after check_object() runs, each
object that is used as the base object of some existing delta
needs to be marked with the maximum depth of the objects we
decided to keep deltified (in this case, D is depth 3 relative
to A, so if no other delta chain that is longer than 3 based on
A exists, mark A with 3). Then when attempting to deltify A, we
would take that number into account to see if the final delta
chain that leads to D becomes too deep.
However, this is a bit cumbersome to compute, so we would cheat
and reduce the maximum depth for A arbitrarily to depth/4 in
this implementation.
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-02-16 20:55:51 +01:00
|
|
|
|
2006-02-18 05:58:45 +01:00
|
|
|
entry->delta_sibling = base_entry->delta_child;
|
|
|
|
base_entry->delta_child = entry;
|
pack-objects: finishing touches.
This introduces --no-reuse-delta option to disable reusing of
existing delta, which is a large part of the optimization
introduced by this series. This may become necessary if
repeated repacking makes delta chain too long. With this, the
output of the command becomes identical to that of the older
implementation. But the performance suffers greatly.
It still allows reusing non-deltified representations; there is
no point uncompressing and recompressing the whole text.
It also adds a couple more statistics output, while squelching
it under -q flag, which the last round forgot to do.
$ time old-git-pack-objects --stdout >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects....................
real 12m8.530s user 11m1.450s sys 0m57.920s
$ time git-pack-objects --stdout >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
Total 184141, written 184141 (delta 138297), reused 178833 (delta 134081)
real 0m59.549s user 0m56.670s sys 0m2.400s
$ time git-pack-objects --stdout --no-reuse-delta >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
Total 184141, written 184141 (delta 134833), reused 47904 (delta 0)
real 11m13.830s user 9m45.240s sys 0m44.330s
There is one remaining issue when --no-reuse-delta option is not
used. It can create delta chains that are deeper than specified.
A<--B<--C<--D E F G
Suppose we have a delta chain A to D (A is stored in full either
in a pack or as a loose object. B is depth1 delta relative to A,
C is depth2 delta relative to B...) with loose objects E, F, G.
And we are going to pack all of them.
B, C and D are left as delta against A, B and C respectively.
So A, E, F, and G are examined for deltification, and let's say
we decided to keep E expanded, and store the rest as deltas like
this:
E<--F<--G<--A
Oops. We ended up making D a bit too deep, didn't we? B, C and
D form a chain on top of A!
This is because we did not know what the final depth of A would
be, when we checked objects and decided to keep the existing
delta. Unfortunately, deferring the decision until just before
the deltification is not an option. To be able to make B, C,
and D candidates for deltification with the rest, we need to
know the type and final unexpanded size of them, but the major
part of the optimization comes from the fact that we do not read
the delta data to do so -- getting the final size is quite an
expensive operation.
To prevent this from happening, we should keep A from being
deltified. But how would we tell that, cheaply?
To do this most precisely, after check_object() runs, each
object that is used as the base object of some existing delta
needs to be marked with the maximum depth of the objects we
decided to keep deltified (in this case, D is depth 3 relative
to A, so if no other delta chain that is longer than 3 based on
A exists, mark A with 3). Then when attempting to deltify A, we
would take that number into account to see if the final delta
chain that leads to D becomes too deep.
However, this is a bit cumbersome to compute, so we would cheat
and reduce the maximum depth for A arbitrarily to depth/4 in
this implementation.
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-02-16 20:55:51 +01:00
|
|
|
|
pack-objects: reuse data from existing packs.
When generating a new pack, notice if we have already needed
objects in existing packs. If an object is stored deltified,
and its base object is also what we are going to pack, then
reuse the existing deltified representation unconditionally,
bypassing all the expensive find_deltas() and try_deltas()
calls.
Also, notice if what we are going to write out exactly match
what is already in an existing pack (either deltified or just
compressed). In such a case, we can just copy it instead of
going through the usual uncompressing & recompressing cycle.
Without this patch, in linux-2.6 repository with about 1500
loose objects and a single mega pack:
$ git-rev-list --objects v2.6.16-rc3 >RL
$ wc -l RL
184141 RL
$ time git-pack-objects p <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects....................
a1fc7b3e537fcb9b3c46b7505df859f0a11e79d2
real 12m4.323s
user 11m2.560s
sys 0m55.950s
With this patch, the same input:
$ time ../git.junio/git-pack-objects q <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
a1fc7b3e537fcb9b3c46b7505df859f0a11e79d2
Total 184141, written 184141, reused 182441
real 1m2.608s
user 0m55.090s
sys 0m1.830s
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-02-16 02:34:29 +01:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
/* Otherwise we would do the usual */
|
2005-06-27 12:34:06 +02:00
|
|
|
}
|
pack-objects: reuse data from existing packs.
When generating a new pack, notice if we have already needed
objects in existing packs. If an object is stored deltified,
and its base object is also what we are going to pack, then
reuse the existing deltified representation unconditionally,
bypassing all the expensive find_deltas() and try_deltas()
calls.
Also, notice if what we are going to write out exactly match
what is already in an existing pack (either deltified or just
compressed). In such a case, we can just copy it instead of
going through the usual uncompressing & recompressing cycle.
Without this patch, in linux-2.6 repository with about 1500
loose objects and a single mega pack:
$ git-rev-list --objects v2.6.16-rc3 >RL
$ wc -l RL
184141 RL
$ time git-pack-objects p <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects....................
a1fc7b3e537fcb9b3c46b7505df859f0a11e79d2
real 12m4.323s
user 11m2.560s
sys 0m55.950s
With this patch, the same input:
$ time ../git.junio/git-pack-objects q <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
a1fc7b3e537fcb9b3c46b7505df859f0a11e79d2
Total 184141, written 184141, reused 182441
real 1m2.608s
user 0m55.090s
sys 0m1.830s
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-02-16 02:34:29 +01:00
|
|
|
|
|
|
|
if (sha1_object_info(entry->sha1, type, &entry->size))
|
2005-06-27 12:34:06 +02:00
|
|
|
die("unable to get type of object %s",
|
|
|
|
sha1_to_hex(entry->sha1));
|
pack-objects: reuse data from existing packs.
When generating a new pack, notice if we have already needed
objects in existing packs. If an object is stored deltified,
and its base object is also what we are going to pack, then
reuse the existing deltified representation unconditionally,
bypassing all the expensive find_deltas() and try_deltas()
calls.
Also, notice if what we are going to write out exactly match
what is already in an existing pack (either deltified or just
compressed). In such a case, we can just copy it instead of
going through the usual uncompressing & recompressing cycle.
Without this patch, in linux-2.6 repository with about 1500
loose objects and a single mega pack:
$ git-rev-list --objects v2.6.16-rc3 >RL
$ wc -l RL
184141 RL
$ time git-pack-objects p <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects....................
a1fc7b3e537fcb9b3c46b7505df859f0a11e79d2
real 12m4.323s
user 11m2.560s
sys 0m55.950s
With this patch, the same input:
$ time ../git.junio/git-pack-objects q <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
a1fc7b3e537fcb9b3c46b7505df859f0a11e79d2
Total 184141, written 184141, reused 182441
real 1m2.608s
user 0m55.090s
sys 0m1.830s
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-02-16 02:34:29 +01:00
|
|
|
|
|
|
|
if (!strcmp(type, "commit")) {
|
|
|
|
entry->type = OBJ_COMMIT;
|
|
|
|
} else if (!strcmp(type, "tree")) {
|
|
|
|
entry->type = OBJ_TREE;
|
|
|
|
} else if (!strcmp(type, "blob")) {
|
|
|
|
entry->type = OBJ_BLOB;
|
|
|
|
} else if (!strcmp(type, "tag")) {
|
|
|
|
entry->type = OBJ_TAG;
|
|
|
|
} else
|
|
|
|
die("unable to pack object %s of type %s",
|
|
|
|
sha1_to_hex(entry->sha1), type);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void hash_objects(void)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
struct object_entry *oe;
|
|
|
|
|
|
|
|
object_ix_hashsz = nr_objects * 2;
|
|
|
|
object_ix = xcalloc(sizeof(int), object_ix_hashsz);
|
|
|
|
for (i = 0, oe = objects; i < nr_objects; i++, oe++) {
|
|
|
|
int ix = locate_object_entry_hash(oe->sha1);
|
|
|
|
if (0 <= ix) {
|
|
|
|
error("the same object '%s' added twice",
|
|
|
|
sha1_to_hex(oe->sha1));
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
ix = -1 - ix;
|
|
|
|
object_ix[ix] = i + 1;
|
|
|
|
}
|
2005-06-25 23:42:43 +02:00
|
|
|
}
|
|
|
|
|
2006-02-18 05:58:45 +01:00
|
|
|
static unsigned int check_delta_limit(struct object_entry *me, unsigned int n)
|
|
|
|
{
|
|
|
|
struct object_entry *child = me->delta_child;
|
|
|
|
unsigned int m = n;
|
|
|
|
while (child) {
|
|
|
|
unsigned int c = check_delta_limit(child, n + 1);
|
|
|
|
if (m < c)
|
|
|
|
m = c;
|
|
|
|
child = child->delta_sibling;
|
|
|
|
}
|
|
|
|
return m;
|
|
|
|
}
|
|
|
|
|
2005-06-25 23:42:43 +02:00
|
|
|
static void get_object_details(void)
|
|
|
|
{
|
|
|
|
int i;
|
2006-02-18 05:58:45 +01:00
|
|
|
struct object_entry *entry;
|
2005-06-25 23:42:43 +02:00
|
|
|
|
pack-objects: reuse data from existing packs.
When generating a new pack, notice if we have already needed
objects in existing packs. If an object is stored deltified,
and its base object is also what we are going to pack, then
reuse the existing deltified representation unconditionally,
bypassing all the expensive find_deltas() and try_deltas()
calls.
Also, notice if what we are going to write out exactly match
what is already in an existing pack (either deltified or just
compressed). In such a case, we can just copy it instead of
going through the usual uncompressing & recompressing cycle.
Without this patch, in linux-2.6 repository with about 1500
loose objects and a single mega pack:
$ git-rev-list --objects v2.6.16-rc3 >RL
$ wc -l RL
184141 RL
$ time git-pack-objects p <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects....................
a1fc7b3e537fcb9b3c46b7505df859f0a11e79d2
real 12m4.323s
user 11m2.560s
sys 0m55.950s
With this patch, the same input:
$ time ../git.junio/git-pack-objects q <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
a1fc7b3e537fcb9b3c46b7505df859f0a11e79d2
Total 184141, written 184141, reused 182441
real 1m2.608s
user 0m55.090s
sys 0m1.830s
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-02-16 02:34:29 +01:00
|
|
|
hash_objects();
|
|
|
|
prepare_pack_ix();
|
2006-02-18 05:58:45 +01:00
|
|
|
for (i = 0, entry = objects; i < nr_objects; i++, entry++)
|
|
|
|
check_object(entry);
|
|
|
|
for (i = 0, entry = objects; i < nr_objects; i++, entry++)
|
|
|
|
if (!entry->delta && entry->delta_child)
|
|
|
|
entry->delta_limit =
|
|
|
|
check_delta_limit(entry, 1);
|
2005-06-25 23:42:43 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
typedef int (*entry_sort_t)(const struct object_entry *, const struct object_entry *);
|
|
|
|
|
|
|
|
static entry_sort_t current_sort;
|
|
|
|
|
|
|
|
static int sort_comparator(const void *_a, const void *_b)
|
|
|
|
{
|
|
|
|
struct object_entry *a = *(struct object_entry **)_a;
|
|
|
|
struct object_entry *b = *(struct object_entry **)_b;
|
|
|
|
return current_sort(a,b);
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct object_entry **create_sorted_list(entry_sort_t sort)
|
|
|
|
{
|
|
|
|
struct object_entry **list = xmalloc(nr_objects * sizeof(struct object_entry *));
|
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = 0; i < nr_objects; i++)
|
|
|
|
list[i] = objects + i;
|
|
|
|
current_sort = sort;
|
|
|
|
qsort(list, nr_objects, sizeof(struct object_entry *), sort_comparator);
|
|
|
|
return list;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int sha1_sort(const struct object_entry *a, const struct object_entry *b)
|
|
|
|
{
|
|
|
|
return memcmp(a->sha1, b->sha1, 20);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int type_size_sort(const struct object_entry *a, const struct object_entry *b)
|
|
|
|
{
|
|
|
|
if (a->type < b->type)
|
|
|
|
return -1;
|
|
|
|
if (a->type > b->type)
|
|
|
|
return 1;
|
2005-06-27 00:27:28 +02:00
|
|
|
if (a->hash < b->hash)
|
|
|
|
return -1;
|
|
|
|
if (a->hash > b->hash)
|
|
|
|
return 1;
|
2005-06-25 23:42:43 +02:00
|
|
|
if (a->size < b->size)
|
|
|
|
return -1;
|
|
|
|
if (a->size > b->size)
|
|
|
|
return 1;
|
|
|
|
return a < b ? -1 : (a > b);
|
|
|
|
}
|
|
|
|
|
|
|
|
struct unpacked {
|
|
|
|
struct object_entry *entry;
|
|
|
|
void *data;
|
|
|
|
};
|
|
|
|
|
|
|
|
/*
|
2005-06-26 22:43:41 +02:00
|
|
|
* We search for deltas _backwards_ in a list sorted by type and
|
|
|
|
* by size, so that we see progressively smaller and smaller files.
|
|
|
|
* That's because we prefer deltas to be from the bigger file
|
|
|
|
* to the smaller - deletes are potentially cheaper, but perhaps
|
|
|
|
* more importantly, the bigger file is likely the more recent
|
|
|
|
* one.
|
2005-06-25 23:42:43 +02:00
|
|
|
*/
|
2005-06-26 05:17:59 +02:00
|
|
|
static int try_delta(struct unpacked *cur, struct unpacked *old, unsigned max_depth)
|
2005-06-25 23:42:43 +02:00
|
|
|
{
|
|
|
|
struct object_entry *cur_entry = cur->entry;
|
|
|
|
struct object_entry *old_entry = old->entry;
|
2005-06-26 22:43:41 +02:00
|
|
|
unsigned long size, oldsize, delta_size, sizediff;
|
2005-06-26 04:30:20 +02:00
|
|
|
long max_size;
|
2005-06-25 23:42:43 +02:00
|
|
|
void *delta_buf;
|
|
|
|
|
|
|
|
/* Don't bother doing diffs between different types */
|
|
|
|
if (cur_entry->type != old_entry->type)
|
|
|
|
return -1;
|
|
|
|
|
pack-objects: finishing touches.
This introduces --no-reuse-delta option to disable reusing of
existing delta, which is a large part of the optimization
introduced by this series. This may become necessary if
repeated repacking makes delta chain too long. With this, the
output of the command becomes identical to that of the older
implementation. But the performance suffers greatly.
It still allows reusing non-deltified representations; there is
no point uncompressing and recompressing the whole text.
It also adds a couple more statistics output, while squelching
it under -q flag, which the last round forgot to do.
$ time old-git-pack-objects --stdout >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects....................
real 12m8.530s user 11m1.450s sys 0m57.920s
$ time git-pack-objects --stdout >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
Total 184141, written 184141 (delta 138297), reused 178833 (delta 134081)
real 0m59.549s user 0m56.670s sys 0m2.400s
$ time git-pack-objects --stdout --no-reuse-delta >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
Total 184141, written 184141 (delta 134833), reused 47904 (delta 0)
real 11m13.830s user 9m45.240s sys 0m44.330s
There is one remaining issue when --no-reuse-delta option is not
used. It can create delta chains that are deeper than specified.
A<--B<--C<--D E F G
Suppose we have a delta chain A to D (A is stored in full either
in a pack or as a loose object. B is depth1 delta relative to A,
C is depth2 delta relative to B...) with loose objects E, F, G.
And we are going to pack all of them.
B, C and D are left as delta against A, B and C respectively.
So A, E, F, and G are examined for deltification, and let's say
we decided to keep E expanded, and store the rest as deltas like
this:
E<--F<--G<--A
Oops. We ended up making D a bit too deep, didn't we? B, C and
D form a chain on top of A!
This is because we did not know what the final depth of A would
be, when we checked objects and decided to keep the existing
delta. Unfortunately, deferring the decision until just before
the deltification is not an option. To be able to make B, C,
and D candidates for deltification with the rest, we need to
know the type and final unexpanded size of them, but the major
part of the optimization comes from the fact that we do not read
the delta data to do so -- getting the final size is quite an
expensive operation.
To prevent this from happening, we should keep A from being
deltified. But how would we tell that, cheaply?
To do this most precisely, after check_object() runs, each
object that is used as the base object of some existing delta
needs to be marked with the maximum depth of the objects we
decided to keep deltified (in this case, D is depth 3 relative
to A, so if no other delta chain that is longer than 3 based on
A exists, mark A with 3). Then when attempting to deltify A, we
would take that number into account to see if the final delta
chain that leads to D becomes too deep.
However, this is a bit cumbersome to compute, so we would cheat
and reduce the maximum depth for A arbitrarily to depth/4 in
this implementation.
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-02-16 20:55:51 +01:00
|
|
|
/* If the current object is at edge, take the depth the objects
|
|
|
|
* that depend on the current object into account -- otherwise
|
|
|
|
* they would become too deep.
|
|
|
|
*/
|
2006-02-18 05:58:45 +01:00
|
|
|
if (cur_entry->delta_child) {
|
|
|
|
if (max_depth <= cur_entry->delta_limit)
|
|
|
|
return 0;
|
|
|
|
max_depth -= cur_entry->delta_limit;
|
|
|
|
}
|
pack-objects: finishing touches.
This introduces --no-reuse-delta option to disable reusing of
existing delta, which is a large part of the optimization
introduced by this series. This may become necessary if
repeated repacking makes delta chain too long. With this, the
output of the command becomes identical to that of the older
implementation. But the performance suffers greatly.
It still allows reusing non-deltified representations; there is
no point uncompressing and recompressing the whole text.
It also adds a couple more statistics output, while squelching
it under -q flag, which the last round forgot to do.
$ time old-git-pack-objects --stdout >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects....................
real 12m8.530s user 11m1.450s sys 0m57.920s
$ time git-pack-objects --stdout >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
Total 184141, written 184141 (delta 138297), reused 178833 (delta 134081)
real 0m59.549s user 0m56.670s sys 0m2.400s
$ time git-pack-objects --stdout --no-reuse-delta >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
Total 184141, written 184141 (delta 134833), reused 47904 (delta 0)
real 11m13.830s user 9m45.240s sys 0m44.330s
There is one remaining issue when --no-reuse-delta option is not
used. It can create delta chains that are deeper than specified.
A<--B<--C<--D E F G
Suppose we have a delta chain A to D (A is stored in full either
in a pack or as a loose object. B is depth1 delta relative to A,
C is depth2 delta relative to B...) with loose objects E, F, G.
And we are going to pack all of them.
B, C and D are left as delta against A, B and C respectively.
So A, E, F, and G are examined for deltification, and let's say
we decided to keep E expanded, and store the rest as deltas like
this:
E<--F<--G<--A
Oops. We ended up making D a bit too deep, didn't we? B, C and
D form a chain on top of A!
This is because we did not know what the final depth of A would
be, when we checked objects and decided to keep the existing
delta. Unfortunately, deferring the decision until just before
the deltification is not an option. To be able to make B, C,
and D candidates for deltification with the rest, we need to
know the type and final unexpanded size of them, but the major
part of the optimization comes from the fact that we do not read
the delta data to do so -- getting the final size is quite an
expensive operation.
To prevent this from happening, we should keep A from being
deltified. But how would we tell that, cheaply?
To do this most precisely, after check_object() runs, each
object that is used as the base object of some existing delta
needs to be marked with the maximum depth of the objects we
decided to keep deltified (in this case, D is depth 3 relative
to A, so if no other delta chain that is longer than 3 based on
A exists, mark A with 3). Then when attempting to deltify A, we
would take that number into account to see if the final delta
chain that leads to D becomes too deep.
However, this is a bit cumbersome to compute, so we would cheat
and reduce the maximum depth for A arbitrarily to depth/4 in
this implementation.
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-02-16 20:55:51 +01:00
|
|
|
|
2005-06-25 23:42:43 +02:00
|
|
|
size = cur_entry->size;
|
2005-06-26 04:30:20 +02:00
|
|
|
if (size < 50)
|
|
|
|
return -1;
|
2005-06-25 23:42:43 +02:00
|
|
|
oldsize = old_entry->size;
|
2005-06-26 22:43:41 +02:00
|
|
|
sizediff = oldsize > size ? oldsize - size : size - oldsize;
|
|
|
|
if (sizediff > size / 8)
|
2005-06-25 23:42:43 +02:00
|
|
|
return -1;
|
2005-06-26 05:17:59 +02:00
|
|
|
if (old_entry->depth >= max_depth)
|
|
|
|
return 0;
|
2005-06-25 23:42:43 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* NOTE!
|
|
|
|
*
|
|
|
|
* We always delta from the bigger to the smaller, since that's
|
|
|
|
* more space-efficient (deletes don't have to say _what_ they
|
|
|
|
* delete).
|
|
|
|
*/
|
2005-06-26 04:30:20 +02:00
|
|
|
max_size = size / 2 - 20;
|
|
|
|
if (cur_entry->delta)
|
|
|
|
max_size = cur_entry->delta_size-1;
|
2005-06-27 00:27:28 +02:00
|
|
|
if (sizediff >= max_size)
|
|
|
|
return -1;
|
2005-06-26 13:29:18 +02:00
|
|
|
delta_buf = diff_delta(old->data, oldsize,
|
|
|
|
cur->data, size, &delta_size, max_size);
|
2005-06-25 23:42:43 +02:00
|
|
|
if (!delta_buf)
|
2005-06-26 04:30:20 +02:00
|
|
|
return 0;
|
|
|
|
cur_entry->delta = old_entry;
|
|
|
|
cur_entry->delta_size = delta_size;
|
2005-06-26 05:17:59 +02:00
|
|
|
cur_entry->depth = old_entry->depth + 1;
|
2005-06-25 23:42:43 +02:00
|
|
|
free(delta_buf);
|
2005-06-26 02:36:26 +02:00
|
|
|
return 0;
|
2005-06-25 23:42:43 +02:00
|
|
|
}
|
|
|
|
|
2005-06-26 05:17:59 +02:00
|
|
|
static void find_deltas(struct object_entry **list, int window, int depth)
|
2005-06-25 23:42:43 +02:00
|
|
|
{
|
2005-06-26 22:43:41 +02:00
|
|
|
int i, idx;
|
2005-06-25 23:42:43 +02:00
|
|
|
unsigned int array_size = window * sizeof(struct unpacked);
|
|
|
|
struct unpacked *array = xmalloc(array_size);
|
2006-02-12 02:54:18 +01:00
|
|
|
int eye_candy;
|
2005-06-25 23:42:43 +02:00
|
|
|
|
|
|
|
memset(array, 0, array_size);
|
2005-06-26 22:43:41 +02:00
|
|
|
i = nr_objects;
|
|
|
|
idx = 0;
|
2006-02-12 02:54:18 +01:00
|
|
|
eye_candy = i - (nr_objects / 20);
|
|
|
|
|
2005-06-26 22:43:41 +02:00
|
|
|
while (--i >= 0) {
|
2005-06-25 23:42:43 +02:00
|
|
|
struct object_entry *entry = list[i];
|
|
|
|
struct unpacked *n = array + idx;
|
|
|
|
unsigned long size;
|
|
|
|
char type[10];
|
|
|
|
int j;
|
|
|
|
|
2006-02-12 02:54:18 +01:00
|
|
|
if (progress && i <= eye_candy) {
|
|
|
|
eye_candy -= nr_objects / 20;
|
|
|
|
fputc('.', stderr);
|
|
|
|
}
|
pack-objects: reuse data from existing packs.
When generating a new pack, notice if we have already needed
objects in existing packs. If an object is stored deltified,
and its base object is also what we are going to pack, then
reuse the existing deltified representation unconditionally,
bypassing all the expensive find_deltas() and try_deltas()
calls.
Also, notice if what we are going to write out exactly match
what is already in an existing pack (either deltified or just
compressed). In such a case, we can just copy it instead of
going through the usual uncompressing & recompressing cycle.
Without this patch, in linux-2.6 repository with about 1500
loose objects and a single mega pack:
$ git-rev-list --objects v2.6.16-rc3 >RL
$ wc -l RL
184141 RL
$ time git-pack-objects p <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects....................
a1fc7b3e537fcb9b3c46b7505df859f0a11e79d2
real 12m4.323s
user 11m2.560s
sys 0m55.950s
With this patch, the same input:
$ time ../git.junio/git-pack-objects q <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
a1fc7b3e537fcb9b3c46b7505df859f0a11e79d2
Total 184141, written 184141, reused 182441
real 1m2.608s
user 0m55.090s
sys 0m1.830s
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-02-16 02:34:29 +01:00
|
|
|
|
|
|
|
if (entry->delta)
|
|
|
|
/* This happens if we decided to reuse existing
|
pack-objects: finishing touches.
This introduces --no-reuse-delta option to disable reusing of
existing delta, which is a large part of the optimization
introduced by this series. This may become necessary if
repeated repacking makes delta chain too long. With this, the
output of the command becomes identical to that of the older
implementation. But the performance suffers greatly.
It still allows reusing non-deltified representations; there is
no point uncompressing and recompressing the whole text.
It also adds a couple more statistics output, while squelching
it under -q flag, which the last round forgot to do.
$ time old-git-pack-objects --stdout >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects....................
real 12m8.530s user 11m1.450s sys 0m57.920s
$ time git-pack-objects --stdout >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
Total 184141, written 184141 (delta 138297), reused 178833 (delta 134081)
real 0m59.549s user 0m56.670s sys 0m2.400s
$ time git-pack-objects --stdout --no-reuse-delta >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
Total 184141, written 184141 (delta 134833), reused 47904 (delta 0)
real 11m13.830s user 9m45.240s sys 0m44.330s
There is one remaining issue when --no-reuse-delta option is not
used. It can create delta chains that are deeper than specified.
A<--B<--C<--D E F G
Suppose we have a delta chain A to D (A is stored in full either
in a pack or as a loose object. B is depth1 delta relative to A,
C is depth2 delta relative to B...) with loose objects E, F, G.
And we are going to pack all of them.
B, C and D are left as delta against A, B and C respectively.
So A, E, F, and G are examined for deltification, and let's say
we decided to keep E expanded, and store the rest as deltas like
this:
E<--F<--G<--A
Oops. We ended up making D a bit too deep, didn't we? B, C and
D form a chain on top of A!
This is because we did not know what the final depth of A would
be, when we checked objects and decided to keep the existing
delta. Unfortunately, deferring the decision until just before
the deltification is not an option. To be able to make B, C,
and D candidates for deltification with the rest, we need to
know the type and final unexpanded size of them, but the major
part of the optimization comes from the fact that we do not read
the delta data to do so -- getting the final size is quite an
expensive operation.
To prevent this from happening, we should keep A from being
deltified. But how would we tell that, cheaply?
To do this most precisely, after check_object() runs, each
object that is used as the base object of some existing delta
needs to be marked with the maximum depth of the objects we
decided to keep deltified (in this case, D is depth 3 relative
to A, so if no other delta chain that is longer than 3 based on
A exists, mark A with 3). Then when attempting to deltify A, we
would take that number into account to see if the final delta
chain that leads to D becomes too deep.
However, this is a bit cumbersome to compute, so we would cheat
and reduce the maximum depth for A arbitrarily to depth/4 in
this implementation.
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-02-16 20:55:51 +01:00
|
|
|
* delta from a pack. "!no_reuse_delta &&" is implied.
|
pack-objects: reuse data from existing packs.
When generating a new pack, notice if we have already needed
objects in existing packs. If an object is stored deltified,
and its base object is also what we are going to pack, then
reuse the existing deltified representation unconditionally,
bypassing all the expensive find_deltas() and try_deltas()
calls.
Also, notice if what we are going to write out exactly match
what is already in an existing pack (either deltified or just
compressed). In such a case, we can just copy it instead of
going through the usual uncompressing & recompressing cycle.
Without this patch, in linux-2.6 repository with about 1500
loose objects and a single mega pack:
$ git-rev-list --objects v2.6.16-rc3 >RL
$ wc -l RL
184141 RL
$ time git-pack-objects p <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects....................
a1fc7b3e537fcb9b3c46b7505df859f0a11e79d2
real 12m4.323s
user 11m2.560s
sys 0m55.950s
With this patch, the same input:
$ time ../git.junio/git-pack-objects q <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
a1fc7b3e537fcb9b3c46b7505df859f0a11e79d2
Total 184141, written 184141, reused 182441
real 1m2.608s
user 0m55.090s
sys 0m1.830s
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-02-16 02:34:29 +01:00
|
|
|
*/
|
|
|
|
continue;
|
|
|
|
|
2005-06-25 23:42:43 +02:00
|
|
|
free(n->data);
|
|
|
|
n->entry = entry;
|
|
|
|
n->data = read_sha1_file(entry->sha1, type, &size);
|
|
|
|
if (size != entry->size)
|
|
|
|
die("object %s inconsistent object length (%lu vs %lu)", sha1_to_hex(entry->sha1), size, entry->size);
|
pack-objects: finishing touches.
This introduces --no-reuse-delta option to disable reusing of
existing delta, which is a large part of the optimization
introduced by this series. This may become necessary if
repeated repacking makes delta chain too long. With this, the
output of the command becomes identical to that of the older
implementation. But the performance suffers greatly.
It still allows reusing non-deltified representations; there is
no point uncompressing and recompressing the whole text.
It also adds a couple more statistics output, while squelching
it under -q flag, which the last round forgot to do.
$ time old-git-pack-objects --stdout >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects....................
real 12m8.530s user 11m1.450s sys 0m57.920s
$ time git-pack-objects --stdout >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
Total 184141, written 184141 (delta 138297), reused 178833 (delta 134081)
real 0m59.549s user 0m56.670s sys 0m2.400s
$ time git-pack-objects --stdout --no-reuse-delta >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
Total 184141, written 184141 (delta 134833), reused 47904 (delta 0)
real 11m13.830s user 9m45.240s sys 0m44.330s
There is one remaining issue when --no-reuse-delta option is not
used. It can create delta chains that are deeper than specified.
A<--B<--C<--D E F G
Suppose we have a delta chain A to D (A is stored in full either
in a pack or as a loose object. B is depth1 delta relative to A,
C is depth2 delta relative to B...) with loose objects E, F, G.
And we are going to pack all of them.
B, C and D are left as delta against A, B and C respectively.
So A, E, F, and G are examined for deltification, and let's say
we decided to keep E expanded, and store the rest as deltas like
this:
E<--F<--G<--A
Oops. We ended up making D a bit too deep, didn't we? B, C and
D form a chain on top of A!
This is because we did not know what the final depth of A would
be, when we checked objects and decided to keep the existing
delta. Unfortunately, deferring the decision until just before
the deltification is not an option. To be able to make B, C,
and D candidates for deltification with the rest, we need to
know the type and final unexpanded size of them, but the major
part of the optimization comes from the fact that we do not read
the delta data to do so -- getting the final size is quite an
expensive operation.
To prevent this from happening, we should keep A from being
deltified. But how would we tell that, cheaply?
To do this most precisely, after check_object() runs, each
object that is used as the base object of some existing delta
needs to be marked with the maximum depth of the objects we
decided to keep deltified (in this case, D is depth 3 relative
to A, so if no other delta chain that is longer than 3 based on
A exists, mark A with 3). Then when attempting to deltify A, we
would take that number into account to see if the final delta
chain that leads to D becomes too deep.
However, this is a bit cumbersome to compute, so we would cheat
and reduce the maximum depth for A arbitrarily to depth/4 in
this implementation.
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-02-16 20:55:51 +01:00
|
|
|
|
2005-06-26 03:29:23 +02:00
|
|
|
j = window;
|
|
|
|
while (--j > 0) {
|
|
|
|
unsigned int other_idx = idx + j;
|
2005-06-25 23:42:43 +02:00
|
|
|
struct unpacked *m;
|
2005-06-26 03:29:23 +02:00
|
|
|
if (other_idx >= window)
|
|
|
|
other_idx -= window;
|
2005-06-25 23:42:43 +02:00
|
|
|
m = array + other_idx;
|
|
|
|
if (!m->entry)
|
|
|
|
break;
|
2005-06-26 05:17:59 +02:00
|
|
|
if (try_delta(n, m, depth) < 0)
|
2005-06-25 23:42:43 +02:00
|
|
|
break;
|
|
|
|
}
|
2005-06-26 22:43:41 +02:00
|
|
|
idx++;
|
|
|
|
if (idx >= window)
|
|
|
|
idx = 0;
|
2005-06-25 23:42:43 +02:00
|
|
|
}
|
2005-08-08 20:46:58 +02:00
|
|
|
|
|
|
|
for (i = 0; i < window; ++i)
|
|
|
|
free(array[i].data);
|
|
|
|
free(array);
|
2005-06-25 23:42:43 +02:00
|
|
|
}
|
|
|
|
|
2005-10-22 10:28:13 +02:00
|
|
|
static void prepare_pack(int window, int depth)
|
|
|
|
{
|
2006-02-12 02:54:18 +01:00
|
|
|
if (progress)
|
|
|
|
fprintf(stderr, "Packing %d objects", nr_objects);
|
pack-objects: reuse data from existing packs.
When generating a new pack, notice if we have already needed
objects in existing packs. If an object is stored deltified,
and its base object is also what we are going to pack, then
reuse the existing deltified representation unconditionally,
bypassing all the expensive find_deltas() and try_deltas()
calls.
Also, notice if what we are going to write out exactly match
what is already in an existing pack (either deltified or just
compressed). In such a case, we can just copy it instead of
going through the usual uncompressing & recompressing cycle.
Without this patch, in linux-2.6 repository with about 1500
loose objects and a single mega pack:
$ git-rev-list --objects v2.6.16-rc3 >RL
$ wc -l RL
184141 RL
$ time git-pack-objects p <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects....................
a1fc7b3e537fcb9b3c46b7505df859f0a11e79d2
real 12m4.323s
user 11m2.560s
sys 0m55.950s
With this patch, the same input:
$ time ../git.junio/git-pack-objects q <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
a1fc7b3e537fcb9b3c46b7505df859f0a11e79d2
Total 184141, written 184141, reused 182441
real 1m2.608s
user 0m55.090s
sys 0m1.830s
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-02-16 02:34:29 +01:00
|
|
|
get_object_details();
|
|
|
|
if (progress)
|
pack-objects: finishing touches.
This introduces --no-reuse-delta option to disable reusing of
existing delta, which is a large part of the optimization
introduced by this series. This may become necessary if
repeated repacking makes delta chain too long. With this, the
output of the command becomes identical to that of the older
implementation. But the performance suffers greatly.
It still allows reusing non-deltified representations; there is
no point uncompressing and recompressing the whole text.
It also adds a couple more statistics output, while squelching
it under -q flag, which the last round forgot to do.
$ time old-git-pack-objects --stdout >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects....................
real 12m8.530s user 11m1.450s sys 0m57.920s
$ time git-pack-objects --stdout >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
Total 184141, written 184141 (delta 138297), reused 178833 (delta 134081)
real 0m59.549s user 0m56.670s sys 0m2.400s
$ time git-pack-objects --stdout --no-reuse-delta >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
Total 184141, written 184141 (delta 134833), reused 47904 (delta 0)
real 11m13.830s user 9m45.240s sys 0m44.330s
There is one remaining issue when --no-reuse-delta option is not
used. It can create delta chains that are deeper than specified.
A<--B<--C<--D E F G
Suppose we have a delta chain A to D (A is stored in full either
in a pack or as a loose object. B is depth1 delta relative to A,
C is depth2 delta relative to B...) with loose objects E, F, G.
And we are going to pack all of them.
B, C and D are left as delta against A, B and C respectively.
So A, E, F, and G are examined for deltification, and let's say
we decided to keep E expanded, and store the rest as deltas like
this:
E<--F<--G<--A
Oops. We ended up making D a bit too deep, didn't we? B, C and
D form a chain on top of A!
This is because we did not know what the final depth of A would
be, when we checked objects and decided to keep the existing
delta. Unfortunately, deferring the decision until just before
the deltification is not an option. To be able to make B, C,
and D candidates for deltification with the rest, we need to
know the type and final unexpanded size of them, but the major
part of the optimization comes from the fact that we do not read
the delta data to do so -- getting the final size is quite an
expensive operation.
To prevent this from happening, we should keep A from being
deltified. But how would we tell that, cheaply?
To do this most precisely, after check_object() runs, each
object that is used as the base object of some existing delta
needs to be marked with the maximum depth of the objects we
decided to keep deltified (in this case, D is depth 3 relative
to A, so if no other delta chain that is longer than 3 based on
A exists, mark A with 3). Then when attempting to deltify A, we
would take that number into account to see if the final delta
chain that leads to D becomes too deep.
However, this is a bit cumbersome to compute, so we would cheat
and reduce the maximum depth for A arbitrarily to depth/4 in
this implementation.
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-02-16 20:55:51 +01:00
|
|
|
fputc('.', stderr);
|
pack-objects: reuse data from existing packs.
When generating a new pack, notice if we have already needed
objects in existing packs. If an object is stored deltified,
and its base object is also what we are going to pack, then
reuse the existing deltified representation unconditionally,
bypassing all the expensive find_deltas() and try_deltas()
calls.
Also, notice if what we are going to write out exactly match
what is already in an existing pack (either deltified or just
compressed). In such a case, we can just copy it instead of
going through the usual uncompressing & recompressing cycle.
Without this patch, in linux-2.6 repository with about 1500
loose objects and a single mega pack:
$ git-rev-list --objects v2.6.16-rc3 >RL
$ wc -l RL
184141 RL
$ time git-pack-objects p <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects....................
a1fc7b3e537fcb9b3c46b7505df859f0a11e79d2
real 12m4.323s
user 11m2.560s
sys 0m55.950s
With this patch, the same input:
$ time ../git.junio/git-pack-objects q <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
a1fc7b3e537fcb9b3c46b7505df859f0a11e79d2
Total 184141, written 184141, reused 182441
real 1m2.608s
user 0m55.090s
sys 0m1.830s
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-02-16 02:34:29 +01:00
|
|
|
|
2005-10-22 10:28:13 +02:00
|
|
|
sorted_by_type = create_sorted_list(type_size_sort);
|
|
|
|
if (window && depth)
|
|
|
|
find_deltas(sorted_by_type, window+1, depth);
|
2006-02-12 02:54:18 +01:00
|
|
|
if (progress)
|
|
|
|
fputc('\n', stderr);
|
2005-10-22 10:28:13 +02:00
|
|
|
write_pack_file();
|
|
|
|
}
|
|
|
|
|
|
|
|
static int reuse_cached_pack(unsigned char *sha1, int pack_to_stdout)
|
|
|
|
{
|
|
|
|
static const char cache[] = "pack-cache/pack-%s.%s";
|
|
|
|
char *cached_pack, *cached_idx;
|
|
|
|
int ifd, ofd, ifd_ix = -1;
|
|
|
|
|
|
|
|
cached_pack = git_path(cache, sha1_to_hex(sha1), "pack");
|
|
|
|
ifd = open(cached_pack, O_RDONLY);
|
|
|
|
if (ifd < 0)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
if (!pack_to_stdout) {
|
|
|
|
cached_idx = git_path(cache, sha1_to_hex(sha1), "idx");
|
|
|
|
ifd_ix = open(cached_idx, O_RDONLY);
|
|
|
|
if (ifd_ix < 0) {
|
|
|
|
close(ifd);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
pack-objects: finishing touches.
This introduces --no-reuse-delta option to disable reusing of
existing delta, which is a large part of the optimization
introduced by this series. This may become necessary if
repeated repacking makes delta chain too long. With this, the
output of the command becomes identical to that of the older
implementation. But the performance suffers greatly.
It still allows reusing non-deltified representations; there is
no point uncompressing and recompressing the whole text.
It also adds a couple more statistics output, while squelching
it under -q flag, which the last round forgot to do.
$ time old-git-pack-objects --stdout >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects....................
real 12m8.530s user 11m1.450s sys 0m57.920s
$ time git-pack-objects --stdout >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
Total 184141, written 184141 (delta 138297), reused 178833 (delta 134081)
real 0m59.549s user 0m56.670s sys 0m2.400s
$ time git-pack-objects --stdout --no-reuse-delta >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
Total 184141, written 184141 (delta 134833), reused 47904 (delta 0)
real 11m13.830s user 9m45.240s sys 0m44.330s
There is one remaining issue when --no-reuse-delta option is not
used. It can create delta chains that are deeper than specified.
A<--B<--C<--D E F G
Suppose we have a delta chain A to D (A is stored in full either
in a pack or as a loose object. B is depth1 delta relative to A,
C is depth2 delta relative to B...) with loose objects E, F, G.
And we are going to pack all of them.
B, C and D are left as delta against A, B and C respectively.
So A, E, F, and G are examined for deltification, and let's say
we decided to keep E expanded, and store the rest as deltas like
this:
E<--F<--G<--A
Oops. We ended up making D a bit too deep, didn't we? B, C and
D form a chain on top of A!
This is because we did not know what the final depth of A would
be, when we checked objects and decided to keep the existing
delta. Unfortunately, deferring the decision until just before
the deltification is not an option. To be able to make B, C,
and D candidates for deltification with the rest, we need to
know the type and final unexpanded size of them, but the major
part of the optimization comes from the fact that we do not read
the delta data to do so -- getting the final size is quite an
expensive operation.
To prevent this from happening, we should keep A from being
deltified. But how would we tell that, cheaply?
To do this most precisely, after check_object() runs, each
object that is used as the base object of some existing delta
needs to be marked with the maximum depth of the objects we
decided to keep deltified (in this case, D is depth 3 relative
to A, so if no other delta chain that is longer than 3 based on
A exists, mark A with 3). Then when attempting to deltify A, we
would take that number into account to see if the final delta
chain that leads to D becomes too deep.
However, this is a bit cumbersome to compute, so we would cheat
and reduce the maximum depth for A arbitrarily to depth/4 in
this implementation.
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-02-16 20:55:51 +01:00
|
|
|
if (progress)
|
|
|
|
fprintf(stderr, "Reusing %d objects pack %s\n", nr_objects,
|
|
|
|
sha1_to_hex(sha1));
|
2005-10-22 10:28:13 +02:00
|
|
|
|
|
|
|
if (pack_to_stdout) {
|
|
|
|
if (copy_fd(ifd, 1))
|
|
|
|
exit(1);
|
|
|
|
close(ifd);
|
|
|
|
}
|
|
|
|
else {
|
|
|
|
char name[PATH_MAX];
|
|
|
|
snprintf(name, sizeof(name),
|
|
|
|
"%s-%s.%s", base_name, sha1_to_hex(sha1), "pack");
|
|
|
|
ofd = open(name, O_CREAT | O_EXCL | O_WRONLY, 0666);
|
|
|
|
if (ofd < 0)
|
|
|
|
die("unable to open %s (%s)", name, strerror(errno));
|
|
|
|
if (copy_fd(ifd, ofd))
|
|
|
|
exit(1);
|
|
|
|
close(ifd);
|
|
|
|
|
|
|
|
snprintf(name, sizeof(name),
|
|
|
|
"%s-%s.%s", base_name, sha1_to_hex(sha1), "idx");
|
|
|
|
ofd = open(name, O_CREAT | O_EXCL | O_WRONLY, 0666);
|
|
|
|
if (ofd < 0)
|
|
|
|
die("unable to open %s (%s)", name, strerror(errno));
|
|
|
|
if (copy_fd(ifd_ix, ofd))
|
|
|
|
exit(1);
|
|
|
|
close(ifd_ix);
|
|
|
|
puts(sha1_to_hex(sha1));
|
|
|
|
}
|
|
|
|
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2005-06-25 23:42:43 +02:00
|
|
|
int main(int argc, char **argv)
|
|
|
|
{
|
2005-07-04 00:34:04 +02:00
|
|
|
SHA_CTX ctx;
|
2005-06-27 00:27:28 +02:00
|
|
|
char line[PATH_MAX + 20];
|
2005-06-28 20:10:48 +02:00
|
|
|
int window = 10, depth = 10, pack_to_stdout = 0;
|
2005-10-13 01:54:19 +02:00
|
|
|
struct object_entry **list;
|
2005-06-25 23:42:43 +02:00
|
|
|
int i;
|
2006-02-12 02:54:18 +01:00
|
|
|
struct timeval prev_tv;
|
|
|
|
int eye_candy = 0;
|
|
|
|
int eye_candy_incr = 500;
|
|
|
|
|
2005-06-25 23:42:43 +02:00
|
|
|
|
2005-11-26 09:50:02 +01:00
|
|
|
setup_git_directory();
|
|
|
|
|
2005-06-25 23:42:43 +02:00
|
|
|
for (i = 1; i < argc; i++) {
|
|
|
|
const char *arg = argv[i];
|
|
|
|
|
|
|
|
if (*arg == '-') {
|
2005-07-03 22:36:58 +02:00
|
|
|
if (!strcmp("--non-empty", arg)) {
|
|
|
|
non_empty = 1;
|
|
|
|
continue;
|
|
|
|
}
|
2005-10-14 00:38:28 +02:00
|
|
|
if (!strcmp("--local", arg)) {
|
|
|
|
local = 1;
|
|
|
|
continue;
|
|
|
|
}
|
2005-07-03 22:08:40 +02:00
|
|
|
if (!strcmp("--incremental", arg)) {
|
|
|
|
incremental = 1;
|
|
|
|
continue;
|
|
|
|
}
|
2005-06-25 23:42:43 +02:00
|
|
|
if (!strncmp("--window=", arg, 9)) {
|
|
|
|
char *end;
|
2005-06-26 04:35:47 +02:00
|
|
|
window = strtoul(arg+9, &end, 0);
|
2005-06-25 23:42:43 +02:00
|
|
|
if (!arg[9] || *end)
|
|
|
|
usage(pack_usage);
|
|
|
|
continue;
|
|
|
|
}
|
2005-06-26 05:17:59 +02:00
|
|
|
if (!strncmp("--depth=", arg, 8)) {
|
|
|
|
char *end;
|
|
|
|
depth = strtoul(arg+8, &end, 0);
|
|
|
|
if (!arg[8] || *end)
|
|
|
|
usage(pack_usage);
|
|
|
|
continue;
|
|
|
|
}
|
2006-02-12 22:01:54 +01:00
|
|
|
if (!strcmp("-q", arg)) {
|
|
|
|
progress = 0;
|
|
|
|
continue;
|
|
|
|
}
|
pack-objects: finishing touches.
This introduces --no-reuse-delta option to disable reusing of
existing delta, which is a large part of the optimization
introduced by this series. This may become necessary if
repeated repacking makes delta chain too long. With this, the
output of the command becomes identical to that of the older
implementation. But the performance suffers greatly.
It still allows reusing non-deltified representations; there is
no point uncompressing and recompressing the whole text.
It also adds a couple more statistics output, while squelching
it under -q flag, which the last round forgot to do.
$ time old-git-pack-objects --stdout >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects....................
real 12m8.530s user 11m1.450s sys 0m57.920s
$ time git-pack-objects --stdout >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
Total 184141, written 184141 (delta 138297), reused 178833 (delta 134081)
real 0m59.549s user 0m56.670s sys 0m2.400s
$ time git-pack-objects --stdout --no-reuse-delta >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
Total 184141, written 184141 (delta 134833), reused 47904 (delta 0)
real 11m13.830s user 9m45.240s sys 0m44.330s
There is one remaining issue when --no-reuse-delta option is not
used. It can create delta chains that are deeper than specified.
A<--B<--C<--D E F G
Suppose we have a delta chain A to D (A is stored in full either
in a pack or as a loose object. B is depth1 delta relative to A,
C is depth2 delta relative to B...) with loose objects E, F, G.
And we are going to pack all of them.
B, C and D are left as delta against A, B and C respectively.
So A, E, F, and G are examined for deltification, and let's say
we decided to keep E expanded, and store the rest as deltas like
this:
E<--F<--G<--A
Oops. We ended up making D a bit too deep, didn't we? B, C and
D form a chain on top of A!
This is because we did not know what the final depth of A would
be, when we checked objects and decided to keep the existing
delta. Unfortunately, deferring the decision until just before
the deltification is not an option. To be able to make B, C,
and D candidates for deltification with the rest, we need to
know the type and final unexpanded size of them, but the major
part of the optimization comes from the fact that we do not read
the delta data to do so -- getting the final size is quite an
expensive operation.
To prevent this from happening, we should keep A from being
deltified. But how would we tell that, cheaply?
To do this most precisely, after check_object() runs, each
object that is used as the base object of some existing delta
needs to be marked with the maximum depth of the objects we
decided to keep deltified (in this case, D is depth 3 relative
to A, so if no other delta chain that is longer than 3 based on
A exists, mark A with 3). Then when attempting to deltify A, we
would take that number into account to see if the final delta
chain that leads to D becomes too deep.
However, this is a bit cumbersome to compute, so we would cheat
and reduce the maximum depth for A arbitrarily to depth/4 in
this implementation.
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-02-16 20:55:51 +01:00
|
|
|
if (!strcmp("--no-reuse-delta", arg)) {
|
|
|
|
no_reuse_delta = 1;
|
|
|
|
continue;
|
|
|
|
}
|
2005-06-28 20:10:48 +02:00
|
|
|
if (!strcmp("--stdout", arg)) {
|
|
|
|
pack_to_stdout = 1;
|
|
|
|
continue;
|
|
|
|
}
|
2005-06-25 23:42:43 +02:00
|
|
|
usage(pack_usage);
|
|
|
|
}
|
|
|
|
if (base_name)
|
|
|
|
usage(pack_usage);
|
|
|
|
base_name = arg;
|
|
|
|
}
|
|
|
|
|
2005-06-28 20:10:48 +02:00
|
|
|
if (pack_to_stdout != !base_name)
|
2005-06-25 23:42:43 +02:00
|
|
|
usage(pack_usage);
|
|
|
|
|
2005-10-14 00:38:28 +02:00
|
|
|
prepare_packed_git();
|
2006-02-12 02:54:18 +01:00
|
|
|
if (progress) {
|
|
|
|
fprintf(stderr, "Generating pack...\n");
|
|
|
|
gettimeofday(&prev_tv, NULL);
|
|
|
|
}
|
2005-06-25 23:42:43 +02:00
|
|
|
while (fgets(line, sizeof(line), stdin) != NULL) {
|
2005-06-27 00:27:28 +02:00
|
|
|
unsigned int hash;
|
|
|
|
char *p;
|
2005-06-25 23:42:43 +02:00
|
|
|
unsigned char sha1[20];
|
2005-06-27 00:27:28 +02:00
|
|
|
|
2006-02-12 02:54:18 +01:00
|
|
|
if (progress && (eye_candy <= nr_objects)) {
|
|
|
|
fprintf(stderr, "Counting objects...%d\r", nr_objects);
|
|
|
|
if (eye_candy && (50 <= eye_candy_incr)) {
|
|
|
|
struct timeval tv;
|
|
|
|
int time_diff;
|
|
|
|
gettimeofday(&tv, NULL);
|
|
|
|
time_diff = (tv.tv_sec - prev_tv.tv_sec);
|
|
|
|
time_diff <<= 10;
|
|
|
|
time_diff += (tv.tv_usec - prev_tv.tv_usec);
|
|
|
|
if ((1 << 9) < time_diff)
|
|
|
|
eye_candy_incr += 50;
|
|
|
|
else if (50 < eye_candy_incr)
|
|
|
|
eye_candy_incr -= 50;
|
|
|
|
}
|
|
|
|
eye_candy += eye_candy_incr;
|
|
|
|
}
|
2005-06-25 23:42:43 +02:00
|
|
|
if (get_sha1_hex(line, sha1))
|
2005-11-21 21:38:31 +01:00
|
|
|
die("expected sha1, got garbage:\n %s", line);
|
2005-06-27 00:27:28 +02:00
|
|
|
hash = 0;
|
|
|
|
p = line+40;
|
|
|
|
while (*p) {
|
|
|
|
unsigned char c = *p++;
|
|
|
|
if (isspace(c))
|
|
|
|
continue;
|
|
|
|
hash = hash * 11 + c;
|
|
|
|
}
|
2005-10-13 01:54:19 +02:00
|
|
|
add_object_entry(sha1, hash);
|
2005-06-25 23:42:43 +02:00
|
|
|
}
|
2006-02-12 02:54:18 +01:00
|
|
|
if (progress)
|
|
|
|
fprintf(stderr, "Done counting %d objects.\n", nr_objects);
|
2005-07-03 22:36:58 +02:00
|
|
|
if (non_empty && !nr_objects)
|
|
|
|
return 0;
|
2005-06-25 23:42:43 +02:00
|
|
|
|
|
|
|
sorted_by_sha = create_sorted_list(sha1_sort);
|
2005-10-13 01:54:19 +02:00
|
|
|
SHA1_Init(&ctx);
|
|
|
|
list = sorted_by_sha;
|
|
|
|
for (i = 0; i < nr_objects; i++) {
|
|
|
|
struct object_entry *entry = *list++;
|
|
|
|
SHA1_Update(&ctx, entry->sha1, 20);
|
|
|
|
}
|
|
|
|
SHA1_Final(object_list_sha1, &ctx);
|
|
|
|
|
2005-10-22 10:28:13 +02:00
|
|
|
if (reuse_cached_pack(object_list_sha1, pack_to_stdout))
|
|
|
|
;
|
|
|
|
else {
|
|
|
|
prepare_pack(window, depth);
|
|
|
|
if (!pack_to_stdout) {
|
|
|
|
write_index_file();
|
|
|
|
puts(sha1_to_hex(object_list_sha1));
|
|
|
|
}
|
2005-07-04 00:34:04 +02:00
|
|
|
}
|
pack-objects: finishing touches.
This introduces --no-reuse-delta option to disable reusing of
existing delta, which is a large part of the optimization
introduced by this series. This may become necessary if
repeated repacking makes delta chain too long. With this, the
output of the command becomes identical to that of the older
implementation. But the performance suffers greatly.
It still allows reusing non-deltified representations; there is
no point uncompressing and recompressing the whole text.
It also adds a couple more statistics output, while squelching
it under -q flag, which the last round forgot to do.
$ time old-git-pack-objects --stdout >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects....................
real 12m8.530s user 11m1.450s sys 0m57.920s
$ time git-pack-objects --stdout >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
Total 184141, written 184141 (delta 138297), reused 178833 (delta 134081)
real 0m59.549s user 0m56.670s sys 0m2.400s
$ time git-pack-objects --stdout --no-reuse-delta >/dev/null <RL
Generating pack...
Done counting 184141 objects.
Packing 184141 objects.....................
Total 184141, written 184141 (delta 134833), reused 47904 (delta 0)
real 11m13.830s user 9m45.240s sys 0m44.330s
There is one remaining issue when --no-reuse-delta option is not
used. It can create delta chains that are deeper than specified.
A<--B<--C<--D E F G
Suppose we have a delta chain A to D (A is stored in full either
in a pack or as a loose object. B is depth1 delta relative to A,
C is depth2 delta relative to B...) with loose objects E, F, G.
And we are going to pack all of them.
B, C and D are left as delta against A, B and C respectively.
So A, E, F, and G are examined for deltification, and let's say
we decided to keep E expanded, and store the rest as deltas like
this:
E<--F<--G<--A
Oops. We ended up making D a bit too deep, didn't we? B, C and
D form a chain on top of A!
This is because we did not know what the final depth of A would
be, when we checked objects and decided to keep the existing
delta. Unfortunately, deferring the decision until just before
the deltification is not an option. To be able to make B, C,
and D candidates for deltification with the rest, we need to
know the type and final unexpanded size of them, but the major
part of the optimization comes from the fact that we do not read
the delta data to do so -- getting the final size is quite an
expensive operation.
To prevent this from happening, we should keep A from being
deltified. But how would we tell that, cheaply?
To do this most precisely, after check_object() runs, each
object that is used as the base object of some existing delta
needs to be marked with the maximum depth of the objects we
decided to keep deltified (in this case, D is depth 3 relative
to A, so if no other delta chain that is longer than 3 based on
A exists, mark A with 3). Then when attempting to deltify A, we
would take that number into account to see if the final delta
chain that leads to D becomes too deep.
However, this is a bit cumbersome to compute, so we would cheat
and reduce the maximum depth for A arbitrarily to depth/4 in
this implementation.
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-02-16 20:55:51 +01:00
|
|
|
if (progress)
|
|
|
|
fprintf(stderr, "Total %d, written %d (delta %d), reused %d (delta %d)\n",
|
|
|
|
nr_objects, written, written_delta, reused, reused_delta);
|
2005-06-25 23:42:43 +02:00
|
|
|
return 0;
|
|
|
|
}
|