Serving objects from a shallow repository needs to write a
temporary file to be used, but the serving upload-pack may not have
write access to the repository which is meant to be read-only.
Instead feed these temporary shallow bounds from the standard input
of pack-objects so that we do not have to use a temporary file.
* nd/upload-pack-shallow:
upload-pack: send shallow info over stdin to pack-objects
Replace open-coded reallocation with ALLOC_GROW() macro.
* dd/use-alloc-grow:
sha1_file.c: use ALLOC_GROW() in pretend_sha1_file()
read-cache.c: use ALLOC_GROW() in add_index_entry()
builtin/mktree.c: use ALLOC_GROW() in append_to_tree()
attr.c: use ALLOC_GROW() in handle_attr_line()
dir.c: use ALLOC_GROW() in create_simplify()
reflog-walk.c: use ALLOC_GROW()
replace_object.c: use ALLOC_GROW() in register_replace_object()
patch-ids.c: use ALLOC_GROW() in add_commit()
diffcore-rename.c: use ALLOC_GROW()
diff.c: use ALLOC_GROW()
commit.c: use ALLOC_GROW() in register_commit_graft()
cache-tree.c: use ALLOC_GROW() in find_subtree()
bundle.c: use ALLOC_GROW() in add_to_ref_list()
builtin/pack-objects.c: use ALLOC_GROW() in check_pbase_path()
A warning from "git pack-objects" were generated by referring to an
incorrect variable when forming the filename that we had trouble
with.
* sh/write-pack-file-warning-message-fix:
write_pack_file: use correct variable in diagnostic
* mh/replace-refs-variable-rename:
Document some functions defined in object.c
Add docstrings for lookup_replace_object() and do_lookup_replace_object()
rename read_replace_refs to check_replace_refs
Before cdab485 (upload-pack: delegate rev walking in shallow fetch to
pack-objects - 2013-08-16) upload-pack does not write to the source
repository. cdab485 starts to write $GIT_DIR/shallow_XXXXXX if it's a
shallow fetch, so the source repo must be writable.
git:// servers do not need write access to repos and usually don't
have it, which means cdab485 breaks shallow clone over git://
Instead of using a temporary file as the media for shallow points, we
can send them over stdin to pack-objects as well. Prepend shallow
SHA-1 with --shallow so pack-objects knows what is what.
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The old version fixes a maximum length on the buffer, which could be a problem
if one is not certain of the length of get_object_directory().
Using strbuf can avoid the protential bug.
Helped-by: Michael Haggerty <mhagger@alum.mit.edu>
Helped-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Sun He <sunheehnus@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
'pack_tmp_name' is the subject of the utime() check, so report it in the
warning, not the uninitialized 'tmpname'
Signed-off-by: Sun He <sunheehnus@gmail.com>
Reviewed-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Borrow the bitmap index into packfiles from JGit to speed up
enumeration of objects involved in a commit range without having to
fully traverse the history.
* jk/pack-bitmap: (26 commits)
ewah: unconditionally ntohll ewah data
ewah: support platforms that require aligned reads
read-cache: use get_be32 instead of hand-rolled ntoh_l
block-sha1: factor out get_be and put_be wrappers
do not discard revindex when re-preparing packfiles
pack-bitmap: implement optional name_hash cache
t/perf: add tests for pack bitmaps
t: add basic bitmap functionality tests
count-objects: recognize .bitmap in garbage-checking
repack: consider bitmaps when performing repacks
repack: handle optional files created by pack-objects
repack: turn exts array into array-of-struct
repack: stop using magic number for ARRAY_SIZE(exts)
pack-objects: implement bitmap writing
rev-list: add bitmap mode to speed up object lists
pack-objects: use bitmaps when packing objects
pack-objects: split add_object_entry
pack-bitmap: add support for bitmap indexes
documentation: add documentation for the bitmap format
ewah: compressed bitmap implementation
...
The semantics of this flag was changed in commit
e1111cef23 inline lookup_replace_object() calls
but wasn't renamed at the time to minimize code churn. Rename it now,
and add a comment explaining its use.
Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When we use pack bitmaps rather than walking the object
graph, we end up with the list of objects to include in the
packfile, but we do not know the path at which any tree or
blob objects would be found.
In a recently packed repository, this is fine. A fetch would
use the paths only as a heuristic in the delta compression
phase, and a fully packed repository should not need to do
much delta compression.
As time passes, though, we may acquire more objects on top
of our large bitmapped pack. If clients fetch frequently,
then they never even look at the bitmapped history, and all
works as usual. However, a client who has not fetched since
the last bitmap repack will have "have" tips in the
bitmapped history, but "want" newer objects.
The bitmaps themselves degrade gracefully in this
circumstance. We manually walk the more recent bits of
history, and then use bitmaps when we hit them.
But we would also like to perform delta compression between
the newer objects and the bitmapped objects (both to delta
against what we know the user already has, but also between
"new" and "old" objects that the user is fetching). The lack
of pathnames makes our delta heuristics much less effective.
This patch adds an optional cache of the 32-bit name_hash
values to the end of the bitmap file. If present, a reader
can use it to match bitmapped and non-bitmapped names during
delta compression.
Here are perf results for p5310:
Test origin/master HEAD^ HEAD
-------------------------------------------------------------------------------------------------
5310.2: repack to disk 36.81(37.82+1.43) 47.70(48.74+1.41) +29.6% 47.75(48.70+1.51) +29.7%
5310.3: simulated clone 30.78(29.70+2.14) 1.08(0.97+0.10) -96.5% 1.07(0.94+0.12) -96.5%
5310.4: simulated fetch 3.16(6.10+0.08) 3.54(10.65+0.06) +12.0% 1.70(3.07+0.06) -46.2%
5310.6: partial bitmap 36.76(43.19+1.81) 6.71(11.25+0.76) -81.7% 4.08(6.26+0.46) -88.9%
You can see that the time spent on an incremental fetch goes
down, as our delta heuristics are able to do their work.
And we save time on the partial bitmap clone for the same
reason.
Signed-off-by: Vicent Marti <tanoku@gmail.com>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This commit extends more the functionality of `pack-objects` by allowing
it to write out a `.bitmap` index next to any written packs, together
with the `.idx` index that currently gets written.
If bitmap writing is enabled for a given repository (either by calling
`pack-objects` with the `--write-bitmap-index` flag or by having
`pack.writebitmaps` set to `true` in the config) and pack-objects is
writing a packfile that would normally be indexed (i.e. not piping to
stdout), we will attempt to write the corresponding bitmap index for the
packfile.
Bitmap index writing happens after the packfile and its index has been
successfully written to disk (`finish_tmp_packfile`). The process is
performed in several steps:
1. `bitmap_writer_set_checksum`: this call stores the partial
checksum for the packfile being written; the checksum will be
written in the resulting bitmap index to verify its integrity
2. `bitmap_writer_build_type_index`: this call uses the array of
`struct object_entry` that has just been sorted when writing out
the actual packfile index to disk to generate 4 type-index bitmaps
(one for each object type).
These bitmaps have their nth bit set if the given object is of
the bitmap's type. E.g. the nth bit of the Commits bitmap will be
1 if the nth object in the packfile index is a commit.
This is a very cheap operation because the bitmap writing code has
access to the metadata stored in the `struct object_entry` array,
and hence the real type for each object in the packfile.
3. `bitmap_writer_reuse_bitmaps`: if there exists an existing bitmap
index for one of the packfiles we're trying to repack, this call
will efficiently rebuild the existing bitmaps so they can be
reused on the new index. All the existing bitmaps will be stored
in a `reuse` hash table, and the commit selection phase will
prioritize these when selecting, as they can be written directly
to the new index without having to perform a revision walk to
fill the bitmap. This can greatly speed up the repack of a
repository that already has bitmaps.
4. `bitmap_writer_select_commits`: if bitmap writing is enabled for
a given `pack-objects` run, the sequence of commits generated
during the Counting Objects phase will be stored in an array.
We then use that array to build up the list of selected commits.
Writing a bitmap in the index for each object in the repository
would be cost-prohibitive, so we use a simple heuristic to pick
the commits that will be indexed with bitmaps.
The current heuristics are a simplified version of JGit's
original implementation. We select a higher density of commits
depending on their age: the 100 most recent commits are always
selected, after that we pick 1 commit of each 100, and the gap
increases as the commits grow older. On top of that, we make sure
that every single branch that has not been merged (all the tips
that would be required from a clone) gets their own bitmap, and
when selecting commits between a gap, we tend to prioritize the
commit with the most parents.
Do note that there is no right/wrong way to perform commit
selection; different selection algorithms will result in
different commits being selected, but there's no such thing as
"missing a commit". The bitmap walker algorithm implemented in
`prepare_bitmap_walk` is able to adapt to missing bitmaps by
performing manual walks that complete the bitmap: the ideal
selection algorithm, however, would select the commits that are
more likely to be used as roots for a walk in the future (e.g.
the tips of each branch, and so on) to ensure a bitmap for them
is always available.
5. `bitmap_writer_build`: this is the computationally expensive part
of bitmap generation. Based on the list of commits that were
selected in the previous step, we perform several incremental
walks to generate the bitmap for each commit.
The walks begin from the oldest commit, and are built up
incrementally for each branch. E.g. consider this dag where A, B,
C, D, E, F are the selected commits, and a, b, c, e are a chunk
of simplified history that will not receive bitmaps.
A---a---B--b--C--c--D
\
E--e--F
We start by building the bitmap for A, using A as the root for a
revision walk and marking all the objects that are reachable
until the walk is over. Once this bitmap is stored, we reuse the
bitmap walker to perform the walk for B, assuming that once we
reach A again, the walk will be terminated because A has already
been SEEN on the previous walk.
This process is repeated for C, and D, but when we try to
generate the bitmaps for E, we can reuse neither the current walk
nor the bitmap we have generated so far.
What we do now is resetting both the walk and clearing the
bitmap, and performing the walk from scratch using E as the
origin. This new walk, however, does not need to be completed.
Once we hit B, we can lookup the bitmap we have already stored
for that commit and OR it with the existing bitmap we've composed
so far, allowing us to limit the walk early.
After all the bitmaps have been generated, another iteration
through the list of commits is performed to find the best XOR
offsets for compression before writing them to disk. Because of
the incremental nature of these bitmaps, XORing one of them with
its predecesor results in a minimal "bitmap delta" most of the
time. We can write this delta to the on-disk bitmap index, and
then re-compose the original bitmaps by XORing them again when
loaded.
This is a phase very similar to pack-object's `find_delta` (using
bitmaps instead of objects, of course), except the heuristics
have been greatly simplified: we only check the 10 bitmaps before
any given one to find best compressing one. This gives good
results in practice, because there is locality in the ordering of
the objects (and therefore bitmaps) in the packfile.
6. `bitmap_writer_finish`: the last step in the process is
serializing to disk all the bitmap data that has been generated
in the two previous steps.
The bitmap is written to a tmp file and then moved atomically to
its final destination, using the same process as
`pack-write.c:write_idx_file`.
Signed-off-by: Vicent Marti <tanoku@gmail.com>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In this patch, we use the bitmap API to perform the `Counting Objects`
phase in pack-objects, rather than a traditional walk through the object
graph. For a reasonably-packed large repo, the time to fetch and clone
is often dominated by the full-object revision walk during the Counting
Objects phase. Using bitmaps can reduce the CPU time required on the
server (and therefore start sending the actual pack data with less
delay).
For bitmaps to be used, the following must be true:
1. We must be packing to stdout (as a normal `pack-objects` from
`upload-pack` would do).
2. There must be a .bitmap index containing at least one of the
"have" objects that the client is asking for.
3. Bitmaps must be enabled (they are enabled by default, but can be
disabled by setting `pack.usebitmaps` to false, or by using
`--no-use-bitmap-index` on the command-line).
If any of these is not true, we fall back to doing a normal walk of the
object graph.
Here are some sample timings from a full pack of `torvalds/linux` (i.e.
something very similar to what would be generated for a clone of the
repository) that show the speedup produced by various
methods:
[existing graph traversal]
$ time git pack-objects --all --stdout --no-use-bitmap-index \
</dev/null >/dev/null
Counting objects: 3237103, done.
Compressing objects: 100% (508752/508752), done.
Total 3237103 (delta 2699584), reused 3237103 (delta 2699584)
real 0m44.111s
user 0m42.396s
sys 0m3.544s
[bitmaps only, without partial pack reuse; note that
pack reuse is automatic, so timing this required a
patch to disable it]
$ time git pack-objects --all --stdout </dev/null >/dev/null
Counting objects: 3237103, done.
Compressing objects: 100% (508752/508752), done.
Total 3237103 (delta 2699584), reused 3237103 (delta 2699584)
real 0m5.413s
user 0m5.604s
sys 0m1.804s
[bitmaps with pack reuse (what you get with this patch)]
$ time git pack-objects --all --stdout </dev/null >/dev/null
Reusing existing pack: 3237103, done.
Total 3237103 (delta 0), reused 0 (delta 0)
real 0m1.636s
user 0m1.460s
sys 0m0.172s
Signed-off-by: Vicent Marti <tanoku@gmail.com>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This function actually does three things:
1. Check whether we've already added the object to our
packing list.
2. Check whether the object meets our criteria for adding.
3. Actually add the object to our packing list.
It's a little hard to see these three phases, because they
happen linearly in the rather long function. Instead, this
patch breaks them up into three separate helper functions.
The result is a little easier to follow, though it
unfortunately suffers from some optimization
interdependencies between the stages (e.g., during step 3 we
use the packing list index from step 1 and the packfile
information from step 2).
More importantly, though, the various parts can be
composed differently, as they will be in the next patch.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The sha1write function returns an int, but it will always be
"0". The failure-prone parts of the function happen in the
"flush" callback, which cannot pass an error back to us. So
we just end up calling die() during the flush.
Let's just drop the return value altogether, as it only
confuses callers into thinking that it might be useful.
Only one call site actually checked the return value. We can
drop that check, since it just led to a die() anyway.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Leaving only the function definitions and declarations so that any
new topic in flight can still make use of the old functions, replace
existing uses of the prefixcmp() and suffixcmp() with new API
functions.
The change can be recreated by mechanically applying this:
$ git grep -l -e prefixcmp -e suffixcmp -- \*.c |
grep -v strbuf\\.c |
xargs perl -pi -e '
s|!prefixcmp\(|starts_with\(|g;
s|prefixcmp\(|!starts_with\(|g;
s|!suffixcmp\(|ends_with\(|g;
s|suffixcmp\(|!ends_with\(|g;
'
on the result of preparatory changes in this series.
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
As the pack-objects system grows beyond the single
pack-objects.c file, more parts (like the soon-to-exist
bitmap code) will need to compute hashes for matching
deltas. Factor out name_hash to make it available to other
files.
Signed-off-by: Vicent Marti <tanoku@gmail.com>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The hash table that stores the packing list for a given `pack-objects`
run was tightly coupled to the pack-objects code.
In this commit, we refactor the hash table and the underlying storage
array into a `packing_data` struct. The functionality for accessing and
adding entries to the packing list is hence accessible from other parts
of Git besides the `pack-objects` builtin.
This refactoring is a requirement for further patches in this series
that will require accessing the commit packing list from outside of
`pack-objects`.
The hash table implementation has been minimally altered: we now
use table sizes which are always a power of two, to ensure a uniform
index distribution in the array.
Signed-off-by: Vicent Marti <tanoku@gmail.com>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When there is no sufficient overlap between old and new history
during a fetch into a shallow repository, we unnecessarily sent
objects the sending side knows the receiving end has.
* nd/fetch-into-shallow:
Add testcase for needless objects during a shallow fetch
list-objects: mark more commits as edges in mark_edges_uninteresting
list-objects: reduce one argument in mark_edges_uninteresting
upload-pack: delegate rev walking in shallow fetch to pack-objects
shallow: add setup_temporary_shallow()
shallow: only add shallow graft points to new shallow file
move setup_alternate_shallow and write_shallow_commits to shallow.c
mark_edges_uninteresting() is always called with this form
mark_edges_uninteresting(revs->commits, revs, ...);
Remove the first argument and let mark_edges_uninteresting figure that
out by itself. It helps answer the question "are this commit list and
revs related in any way?" when looking at mark_edges_uninteresting
implementation.
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Now that close_one_pack() has been introduced to handle file
descriptor pressure, it is not strictly necessary to close the
pack file descriptor in unuse_one_window() when we're under memory
pressure.
Jeff King provided a justification for leaving the pack file open:
If you close packfile descriptors, you can run into racy situations
where somebody else is repacking and deleting packs, and they go away
while you are trying to access them. If you keep a descriptor open,
you're fine; they last to the end of the process. If you don't, then
they disappear from under you.
For normal object access, this isn't that big a deal; we just rescan
the packs and retry. But if you are packing yourself (e.g., because
you are a pack-objects started by upload-pack for a clone or fetch),
it's much harder to recover (and we print some warnings).
Let's do so (or uh, not do so).
Signed-off-by: Brandon Casey <drafnel@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Turn some boolean fields into bitfields and use uint32_t for name
hash. This shrinks the size of the structure from 128 bytes to 120
bytes.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Speeds up "git upload-pack" (what is invoked by "git fetch" on the
other side of the connection) by reducing the cost to advertise the
branches and tags that are available in the repository.
* jk/peel-ref:
upload-pack: use peel_ref for ref advertisements
peel_ref: check object type before loading
peel_ref: do not return a null sha1
peel_ref: use faster deref_tag_noverify
The idea of the peel_ref function is to dereference tag
objects recursively until we hit a non-tag, and return the
sha1. Conceptually, it should return 0 if it is successful
(and fill in the sha1), or -1 if there was nothing to peel.
However, the current behavior is much more confusing. For a
regular loose ref, the behavior is as described above. But
there is an optimization to reuse the peeled-ref value for a
ref that came from a packed-refs file. If we have such a
ref, we return its peeled value, even if that peeled value
is null (indicating that we know the ref definitely does
_not_ peel).
It might seem like such information is useful to the caller,
who would then know not to bother loading and trying to peel
the object. Except that they should not bother loading and
trying to peel the object _anyway_, because that fallback is
already handled by peel_ref. In other words, the whole point
of calling this function is that it handles those details
internally, and you either get a sha1, or you know that it
is not peel-able.
This patch catches the null sha1 case internally and
converts it into a -1 return value (i.e., there is nothing
to peel). This simplifies callers, which do not need to
bother checking themselves.
Two callers are worth noting:
- in pack-objects, a comment indicates that there is a
difference between non-peelable tags and unannotated
tags. But that is not the case (before or after this
patch). Whether you get a null sha1 has to do with
internal details of how peel_ref operated.
- in show-ref, if peel_ref returns a failure, the caller
tries to decide whether to try peeling manually based on
whether the REF_ISPACKED flag is set. But this doesn't
make any sense. If the flag is set, that does not
necessarily mean the ref came from a packed-refs file
with the "peeled" extension. But it doesn't matter,
because even if it didn't, there's no point in trying to
peel it ourselves, as peel_ref would already have done
so. In other words, the fallback peeling is guaranteed
to fail.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Teaches the object name parser things like a "git describe" output
is always a commit object, "A" in "git log A" must be a committish,
and "A" and "B" in "git log A...B" both must be committish, etc., to
prolong the lifetime of abbreviated object names.
* jc/sha1-name-more: (27 commits)
t1512: match the "other" object names
t1512: ignore whitespaces in wc -l output
rev-parse --disambiguate=<prefix>
rev-parse: A and B in "rev-parse A..B" refer to committish
reset: the command takes committish
commit-tree: the command wants a tree and commits
apply: --build-fake-ancestor expects blobs
sha1_name.c: add support for disambiguating other types
revision.c: the "log" family, except for "show", takes committish
revision.c: allow handle_revision_arg() to take other flags
sha1_name.c: introduce get_sha1_committish()
sha1_name.c: teach lookup context to get_sha1_with_context()
sha1_name.c: many short names can only be committish
sha1_name.c: get_sha1_1() takes lookup flags
sha1_name.c: get_describe_name() by definition groks only commits
sha1_name.c: teach get_short_sha1() a commit-only option
sha1_name.c: allow get_short_sha1() to take other flags
get_sha1(): fix error status regression
sha1_name.c: restructure disambiguation of short names
sha1_name.c: correct misnamed "canonical" and "res"
...
The existing "cant_be_filename" that tells the function that the
caller knows the arg is not a path (hence it does not have to be
checked for absense of the file whose name matches it) is made into
a bit in the flag word.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
git usually streams large blobs directly to packs. But there are cases
where git can create large loose blobs (unpack-objects or hash-object
over pipe). Or they can come from other git implementations.
core.bigfilethreshold can also be lowered down and introduce a new
wave of large loose blobs.
Use streaming interface to read/compress/write these blobs in one
go. Fall back to normal way if somehow streaming interface cannot be
used.
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The function first decides if we want to copy data taken from existing
pack verbatim or we want to encode the data ourselves for the packfile
we are creating and then carries out the decision. Separate the latter
phase into two helper functions, one for the case the data is reused,
the other for the case the data is produced anew.
A little twist is that it can later turn out that we cannot reuse the
data after we initially decide to do so; in such a case, the "reuse"
helper makes a call to "generate" helper. It is easier to follow than
the current fallback code that uses "goto" inside a single large
function.
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This is because all other places do "xx > big_file_threshold"
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When we pack everything into one big pack with "git repack
-Ad", any unreferenced objects in to-be-deleted packs are
exploded into loose objects, with the intent that they will
be examined and possibly cleaned up by the next run of "git
prune".
Since the exploded objects will receive the mtime of the
pack from which they come, if the source pack is old, those
loose objects will end up pruned immediately. In that case,
it is much more efficient to skip the exploding step
entirely for these objects.
This patch teaches pack-objects to receive the expiration
information and avoid writing these objects out. It also
teaches "git gc" to pass the value of gc.pruneexpire to
repack (which in turn learns to pass it along to
pack-objects) so that this optimization happens
automatically during "git gc" and "git gc --auto".
Signed-off-by: Jeff King <peff@peff.net>
Acked-by: Nicolas Pitre <nico@fluxnic.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
It looks like commit 99fb6e04 (pack-objects: convert to use
parse_options(), 2012-02-01) moved the #ifdef NO_PTHREDS around but
hasn't noticed that the 'arg' variable no longer is available.
Signed-off-by: Michał Kiedrowicz <michal.kiedrowicz@gmail.com>
Acked-by: Nguyen Thai Ngoc Duy <pclouds@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The comment was introduced in b5d97e6 (pack-objects: run rev-list
equivalent internally. - 2006-09-04), stating that
git pack-objects [options] base-name <refs...>
is acceptable and refs should be passed into rev-list. But that's not
true. All arguments after base-name are ignored.
Remove the comment and reject this syntax (i.e. no more arguments after
base name)
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
* maint:
Update draft release notes to 1.7.8.4
Update draft release notes to 1.7.7.6
Update draft release notes to 1.7.6.6
thin-pack: try harder to use preferred base objects as base
When creating a pack using objects that reside in existing packs, we try
to avoid recomputing futile delta between an object (trg) and a candidate
for its base object (src) if they are stored in the same packfile, and trg
is not recorded as a delta already. This heuristics makes sense because it
is likely that we tried to express trg as a delta based on src but it did
not produce a good delta when we created the existing pack.
As the pack heuristics prefer producing delta to remove data, and Linus's
law dictates that the size of a file grows over time, we tend to record
the newest version of the file as inflated, and older ones as delta
against it.
When creating a thin-pack to transfer recent history, it is likely that we
will try to send an object that is recorded in full, as it is newer. But
the heuristics to avoid recomputing futile delta effectively forbids us
from attempting to express such an object as a delta based on another
object. Sending an object in full is often more expensive than sending a
suboptimal delta based on other objects, and it is even more so if we
could use an object we know the receiving end already has (i.e. preferred
base object) as the delta base.
Tweak the recomputation avoidance logic, so that we do not punt on
computing delta against a preferred base object.
The effect of this change can be seen on two simulated upload-pack
workloads. The first is based on 44 reflog entries from my git.git
origin/master reflog, and represents the packs that kernel.org sent me git
updates for the past month or two. The second workload represents much
larger fetches, going from git's v1.0.0 tag to v1.1.0, then v1.1.0 to
v1.2.0, and so on.
The table below shows the average generated pack size and the average CPU
time consumed for each dataset, both before and after the patch:
dataset
| reflog | tags
---------------------------------
before | 53358 | 2750977
size after | 32398 | 2668479
change | -39% | -3%
---------------------------------
before | 0.18 | 1.12
CPU after | 0.18 | 1.15
change | +0% | +3%
This patch makes a much bigger difference for packs with a shorter slice
of history (since its effect is seen at the boundaries of the pack) though
it has some benefit even for larger packs.
Signed-off-by: Jeff King <peff@peff.net>
Acked-by: Nicolas Pitre <nico@fluxnic.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
* jc/stream-to-pack:
bulk-checkin: replace fast-import based implementation
csum-file: introduce sha1file_checkpoint
finish_tmp_packfile(): a helper function
create_tmp_packfile(): a helper function
write_pack_header(): a helper function
Conflicts:
pack.h
* nd/misc-cleanups:
unpack_object_header_buffer(): clear the size field upon error
tree_entry_interesting: make use of local pointer "item"
tree_entry_interesting(): give meaningful names to return values
read_directory_recursive: reduce one indentation level
get_tree_entry(): do not call find_tree_entry() on an empty tree
tree-walk.c: do not leak internal structure in tree_entry_len()
* nd/misc-cleanups:
unpack_object_header_buffer(): clear the size field upon error
tree_entry_interesting: make use of local pointer "item"
tree_entry_interesting(): give meaningful names to return values
read_directory_recursive: reduce one indentation level
get_tree_entry(): do not call find_tree_entry() on an empty tree
tree-walk.c: do not leak internal structure in tree_entry_len()
This extends the earlier approach to stream a large file directly from the
filesystem to its own packfile, and allows "git add" to send large files
directly into a single pack. Older code used to spawn fast-import, but the
new bulk-checkin API replaces it.
Signed-off-by: Junio C Hamano <gitster@pobox.com>