This is not a lot of code, but it's a logical construct that
should not need to be repeated (and we are about to add a
third repetition).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In the next commits, we will enhance the fsck_tag() function to check
tag objects more thoroughly. To this end, we need a function to verify
that a given string is a valid object type, but that does not die() in
the negative case.
While at it, prepare type_from_string() for counted strings, i.e. strings
with an explicitly specified length rather than a NUL termination.
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Make sure all in-core commit objects are assigned a unique number
so that they can be annotated using the commit-slab API.
* jk/alloc-commit-id:
diff-tree: avoid lookup_unknown_object
object_as_type: set commit index
alloc: factor out commit index
add object_as_type helper for casting objects
parse_object_buffer: do not set object type
move setting of object->type to alloc_* functions
alloc: write out allocator definitions
alloc.c: remove the alloc_raw_commit_node() function
The point of the "index" field of struct commit is that
every allocated commit would have one. It is supposed to be
an invariant that whenever object->type is set to
OBJ_COMMIT, we have a unique index.
Commit 969eba6 (commit: push commit_index update into
alloc_commit_node, 2014-06-10) covered this case for
newly-allocated commits. However, we may also allocate an
"unknown" object via lookup_unknown_object, and only later
convert it to a commit. We must make sure that we set the
commit index when we switch the type field.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When we call lookup_commit, lookup_tree, etc, the logic goes
something like:
1. Look for an existing object struct. If we don't have
one, allocate and return a new one.
2. Double check that any object we have is the expected
type (and complain and return NULL otherwise).
3. Convert an object with type OBJ_NONE (from a prior
call to lookup_unknown_object) to the expected type.
We can encapsulate steps 2 and 3 in a helper function which
checks whether we have the expected object type, converts
OBJ_NONE as appropriate, and returns the object.
Not only does this shorten the code, but it also provides
one central location for converting OBJ_NONE objects into
objects of other types. Future patches will use that to
enforce type-specific invariants.
Since this is a refactoring, we would want it to behave
exactly as the current code. It takes a little reasoning to
see that this is the case:
- for lookup_{commit,tree,etc} functions, we are just
pulling steps 2 and 3 into a function that does the same
thing.
- for the call in peel_object, we currently only do step 3
(but we want to consolidate it with the others, as
mentioned above). However, step 2 is a noop here, as the
surrounding conditional makes sure we have OBJ_NONE
(which we want to keep to avoid an extraneous call to
sha1_object_info).
- for the call in lookup_commit_reference_gently, we are
currently doing step 2 but not step 3. However, step 3
is a noop here. The object we got will have just come
from deref_tag, which must have figured out the type for
each object in order to know when to stop peeling.
Therefore the type will never be OBJ_NONE.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The only way that "obj" can be non-NULL is if it came from
one of the lookup_* functions. These functions always ensure
that the object has the expected type (and return NULL
otherwise), so there is no need for us to set the type.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The "struct object" type implements basic object
polymorphism. Individual instances are allocated as
concrete types (or as a union type that can store any
object), and a "struct object *" can be cast into its real
type after examining its "type" enum. This means it is
dangerous to have a type field that does not match the
allocation (e.g., setting the type field of a "struct blob"
to "OBJ_COMMIT" would mean that a reader might read past the
allocated memory).
In most of the current code this is not a problem; the first
thing we do after allocating an object is usually to set its
type field by passing it to create_object. However, the
virtual commits we create in merge-recursive.c do not ever
get their type set. This does not seem to have caused
problems in practice, though (presumably because we always
pass around a "struct commit" pointer and never even look at
the type).
We can fix this oversight and also make it harder for future
code to get it wrong by setting the type directly in the
object allocation functions.
This will also make it easier to fix problems with commit
index allocation, as we know that any object allocated by
alloc_commit_node will meet the invariant that an object
with an OBJ_COMMIT type field will have a unique index
number.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Copying the first bytes of a SHA1 is duplicated in six places,
however, the implications (the actual value would depend on the
endianness of the platform) is documented only once.
Add a properly documented API for this.
Signed-off-by: Karsten Blees <blees@dcon.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Most callsites which use the commit buffer try to use the
cached version attached to the commit, rather than
re-reading from disk. Unfortunately, that interface provides
only a pointer to the NUL-terminated buffer, with no
indication of the original length.
For the most part, this doesn't matter. People do not put
NULs in their commit messages, and the log code is happy to
treat it all as a NUL-terminated string. However, some code
paths do care. For example, when checking signatures, we
want to be very careful that we verify all the bytes to
avoid malicious trickery.
This patch just adds an optional "size" out-pointer to
get_commit_buffer and friends. The existing callers all pass
NULL (there did not seem to be any obvious sites where we
could avoid an immediate strlen() call, though perhaps with
some further refactoring we could).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Some call sites check commit->buffer to see whether we have
a cached buffer, and if so, do some work with it. In the
long run we may want to switch these code paths to make
their decision on a different boolean flag (because checking
the cache may get a little more expensive in the future).
But for now, we can easily support them by converting the
calls to use get_cached_commit_buffer.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Right now this is just a one-liner, but abstracting it will
make it easier to change later.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Make "git grep" and "git show" pay attention to --textconv when
dealing with blob objects.
* mg/more-textconv:
grep: honor --textconv for the case rev:path
grep: allow to use textconv filters
t7008: demonstrate behavior of grep with textconv
cat-file: do not die on --textconv without textconv filters
show: honor --textconv for blobs
diff_opt: track whether flags have been set explicitly
t4030: demonstrate behavior of show with textconv
hashtable_index() appears to be a close duplicate of hash_obj().
Keep only the later and make it usable for all cases.
Also remove the modulus as this is an expensive operation.
The size argument is always a power of 2 anyway, so a simple
mask operation provides the same result.
On a 'git rev-list --all --objects' run this decreased the time spent
in lookup_object from 27.5% to 24.1%.
[jc: with a few comments on "modulus turned into mask" by Peff]
Signed-off-by: Nicolas Pitre <nico@fluxnic.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
If we exit early in the function parse_object_buffer, we did not
write to *eaten_p. Then the calling function parse_object, which looks
like the following with respect to the eaten variable, cannot rely on a
proper value set in eaten, hence the freeing of the buffer depends
on random values in memory.
struct object *parse_object(const unsigned char *sha1)
{
int eaten;
...
obj = parse_object_buffer(sha1, type, size, buffer, &eaten);
if (!eaten)
free(buffer);
}
This change makes sure, the buffer freeing condition is deterministic.
Signed-off-by: Stefan Beller <stefanbeller@googlemail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Define memory ownership and lifetime rules for what for-each-ref
feeds to its callbacks (in short, "you do not own it, so make a
copy if you want to keep it").
* mh/reflife: (25 commits)
refs: document the lifetime of the args passed to each_ref_fn
register_ref(): make a copy of the bad reference SHA-1
exclude_existing(): set existing_refs.strdup_strings
string_list_add_refs_by_glob(): add a comment about memory management
string_list_add_one_ref(): rename first parameter to "refname"
show_head_ref(): rename first parameter to "refname"
show_head_ref(): do not shadow name of argument
add_existing(): do not retain a reference to sha1
do_fetch(): clean up existing_refs before exiting
do_fetch(): reduce scope of peer_item
object_array_entry: fix memory handling of the name field
find_first_merges(): remove unnecessary code
find_first_merges(): initialize merges variable using initializer
fsck: don't put a void*-shaped peg in a char*-shaped hole
object_array_remove_duplicates(): rewrite to reduce copying
revision: use object_array_filter() in implementation of gc_boundary()
object_array: add function object_array_filter()
revision: split some overly-long lines
cmd_diff(): make it obvious which cases are exclusive of each other
cmd_diff(): rename local variable "list" -> "entry"
...
Previously, the memory management of the object_array_entry::name
field was inconsistent and undocumented. object_array_entries are
ultimately created by a single function, add_object_array_with_mode(),
which has an argument "const char *name". This function used to
simply set the name field to reference the string pointed to by the
name parameter, and nobody on the object_array side ever freed the
memory. Thus, it assumed that the memory for the name field would be
managed by the caller, and that the lifetime of that string would be
at least as long as the lifetime of the object_array_entry. But
callers were inconsistent:
* Some passed pointers to constant strings or argv entries, which was
OK.
* Some passed pointers to newly-allocated memory, but didn't arrange
for the memory ever to be freed.
* Some passed the return value of sha1_to_hex(), which is a pointer to
a statically-allocated buffer that can be overwritten at any time.
* Some passed pointers to refnames that they received from a
for_each_ref()-type iteration, but the lifetimes of such refnames is
not guaranteed by the refs API.
Bring consistency to this mess by changing object_array to make its
own copy for the object_array_entry::name field and free this memory
when an object_array_entry is deleted from the array.
Many callers were passing the empty string as the name parameter, so
as a performance optimization, treat the empty string specially.
Instead of making a copy, store a pointer to a statically-allocated
empty string to object_array_entry::name. When deleting such an
entry, skip the free().
Change the callers that were already passing copies to
add_object_array_with_mode() to either skip the copy, or (if the
memory needed to be allocated anyway) freeing the memory itself.
A part of this commit effectively reverts
70d26c6e76 read_revisions_from_stdin: make copies for handle_revision_arg
because the copying introduced by that commit (which is still
necessary) is now done at a deeper level.
Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Optimizes object lookup when the object hashtable starts to become
crowded.
* jk/lookup-object-prefer-latest:
lookup_object: prioritize recently found objects
The old version copied one entry to its destination position, then
deleted any matching entries from the tail of the array. This
required the tail of the array to be copied multiple times. It didn't
affect the complexity of the algorithm because the whole tail has to
be searched through anyway. But all the copying was unnecessary.
Instead, check for the existence of an entry with the same name in the
*head* of the list before copying an entry to its final position.
This way each entry has to be copied at most one time.
Extract a helper function contains_name() to do a bit of the work.
Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Add a function that allows unwanted entries in an object_array to be
removed. This encapsulation is a step towards giving object_array
ownership of its entries' name memory.
Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Make "grep" honor the "--textconv" option also for the object case, i.e.
when used with an argument "rev:path".
Signed-off-by: Michael J Gruber <git@drmicha.warpmail.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The lookup_object function is backed by a hash table of all
objects we have seen in the program. We manage collisions
with a linear walk over the colliding entries, checking each
with hashcmp(). The main cost of lookup is in these
hashcmp() calls; finding our item in the first slot is
cheaper than finding it in the second slot, which is cheaper
than the third, and so on.
If we assume that there is some locality to the object
lookups (e.g., if X and Y collide, and we have just looked
up X, the next lookup is more likely to be for X than for
Y), then we can improve our average lookup speed by checking
X before Y.
This patch does so by swapping a found item to the front of
the collision chain. The p0001 perf test reveals that this
does indeed exploit locality in the case of "rev-list --all
--objects":
Test origin this tree
-------------------------------------------------------------------------
0001.1: rev-list --all 0.40(0.38+0.02) 0.40(0.36+0.03) +0.0%
0001.2: rev-list --all --objects 2.24(2.17+0.05) 1.86(1.79+0.05) -17.0%
This is not surprising, as the full object traversal will
hit the same tree entries over and over (e.g., for every
commit that doesn't change "Documentation/", we will have to
look up the same sha1 just to find out that we already
processed it).
The reason why this technique works (and does not violate
any properties of the hash table) is subtle and bears some
explanation. Let's imagine we get a lookup for sha1 `X`, and
it hashes to bucket `i` in our table. That stretch of the
table may look like:
index | i-1 | i | i+1 | i+2 |
-----------------------------------
entry ... | A | B | C | X | ...
-----------------------------------
We start our probe at i, see that B does not match, nor does
C, and finally find X. There may be multiple C's in the
middle, but we know that there are no empty slots (or else
we would not find X at all).
We do not know the original index of B; it may be `i`, or it
may be less than i (e.g., if it were `i-1`, it would collide
with A and spill over into the `i` bucket). So it is
acceptable for us to move it to the right of a contiguous
stretch of entries (because we will find it from a linear
walk starting anywhere at `i` or before), but never to the
left (if we moved it to `i-1`, we would miss it when
starting our walk at `i`).
We do know the original index of X; it is `i`, so it is safe
to place it anywhere in the contiguous stretch between `i`
and where we found it (`i+2` in the this case).
This patch does a pure swap; after finding X in the
situation above, we would end with:
index | i-1 | i | i+1 | i+2 |
-----------------------------------
entry ... | A | X | C | B | ...
-----------------------------------
We could instead bump X into the `i` slot, and then shift
the whole contiguous chain down by one, resulting in:
index | i-1 | i | i+1 | i+2 |
-----------------------------------
entry ... | A | X | B | C | ...
-----------------------------------
That puts our chain in true most-recently-used order.
However, experiments show that it is not any faster (and in
fact, is slightly slower due to the extra manipulation).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Many call-sites of parse_object assume that they will get a
non-NULL return value; this is not the case if we encounter
an error while parsing the object.
This patch adds a wrapper function around parse_object that
handles dying automatically, and uses it anywhere we
immediately try to access the return value as a non-NULL
pointer (i.e., anywhere that we would currently segfault).
This wrapper may also be useful in other places. The most
obvious one is code like:
o = parse_object(sha1);
if (!o)
die(...);
However, these should not be mechanically converted to
parse_object_or_die, as the die message is sometimes
customized. Later patches can address these sites on a
case-by-case basis.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The error handling routines add a newline. Remove
the duplicate ones in error messages.
Signed-off-by: Pete Wyckoff <pw@padd.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
"git push --recurse-submodules" learns to optionally look into the
histories of submodules bound to the superproject and push them out.
By Heiko Voigt
* hv/submodule-recurse-push:
push: teach --recurse-submodules the on-demand option
Refactor submodule push check to use string list instead of integer
Teach revision walking machinery to walk multiple times sequencially
Previously it was not possible to iterate revisions twice using the
revision walking api. We add a reset_revision_walk() which clears the
used flags. This allows us to do multiple sequencial revision walks.
We add the appropriate calls to the existing submodule machinery doing
revision walks. This is done to avoid surprises if future code wants to
call these functions more than once during the processes lifetime.
Signed-off-by: Heiko Voigt <hvoigt@hvoigt.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Traditionally, all the callers of check_sha1_signature() first
called read_sha1_file() to prepare the whole object data in core,
and called this function. The function is used to revalidate what
we read from the object database actually matches the object name we
used to ask for the data from the object database.
Update the API to allow callers to pass NULL as the object data, and
have the function read and hash the object data using streaming API
to recompute the object name, without having to hold everything in
core at the same time. This is most useful in parse_object() that
parses a blob object, because this caller does not have to keep the
actual blob data around in memory after a "struct blob" is returned.
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When parse_object is called, we do the following:
1. read the object data into a buffer via read_sha1_file
2. call parse_object_buffer, which then:
a. calls the appropriate lookup_{commit,tree,blob,tag}
to either create a new "struct object", or to find
an existing one. We know the appropriate type from
the lookup in step 1.
b. calls the appropriate parse_{commit,tree,blob,tag}
to parse the buffer for the new (or existing) object
In step 2b, all of the called functions are no-ops for
object "X" if "X->object.parsed" is set. I.e., when we have
already parsed an object, we end up going to a lot of work
just to find out at a low level that there is nothing left
for us to do (and we throw away the data from read_sha1_file
unread).
We can optimize this by moving the check for "do we have an
in-memory object" from 2a before the expensive call to
read_sha1_file in step 1.
This might seem circular, since step 2a uses the type
information determined in step 1 to call the appropriate
lookup function. However, we can notice that all of the
lookup_* functions are backed by lookup_object. In other
words, all of the objects are kept in a master hash table,
and we don't actually need the type to do the "do we have
it" part of the lookup, only to do the "and create it if it
doesn't exist" part.
This can save time whenever we call parse_object on the same
sha1 twice in a single program. Some code paths already
perform this optimization manually, with either:
if (!obj->parsed)
obj = parse_object(obj->sha1);
if you already have a "struct object", or:
struct object *obj = lookup_unknown_object(sha1);
if (!obj || !obj->parsed)
obj = parse_object(sha1);
if you don't. This patch moves the optimization into
parse_object itself.
Most git operations won't notice any impact. Either they
don't parse a lot of duplicate sha1s, or the calling code
takes special care not to re-parse objects. I timed two
code paths that do benefit (there may be more, but these two
were immediately obvious and easy to time).
The first is fast-export, which calls parse_object on each
object it outputs, like this:
object = parse_object(sha1);
if (!object)
die(...);
if (object->flags & SHOWN)
return;
which means that just to realize we have already shown an
object, we will read the whole object from disk!
With this patch, my best-of-five time for "fast-export --all" on
git.git dropped from 26.3s to 21.3s.
The second case is upload-pack, which will call parse_object
for each advertised ref (because it needs to peel tags to
show "^{}" entries). This doesn't matter for most
repositories, because they don't have a lot of refs pointing
to the same objects. However, if you have a big alternates
repository with a shared object db for a number of child
repositories, then the alternates repository will have
duplicated refs representing each of its children.
For example, GitHub's alternates repository for git.git has
~120,000 refs, of which only ~3200 are unique. The time for
upload-pack to print its list of advertised refs dropped
from 3.4s to 0.76s.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When receive-pack & fetch-pack are run and store the pack obtained over
the wire to a local repository, they internally run the index-pack command
with the --strict option. Make sure that we reject incoming packfile that
records objects twice to avoid spreading such a damage.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Most callers want to silently get a replacement object, and they do not
care what the real name of the replacement object is. Worse yet, no sane
interface to return the underlying object without replacement is provided.
Remove the function and make only the few callers that want the name of
the replacement object find it themselves.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Change some expanded tabs (spaces) to tabs in object.c.
Signed-off-by: Jared Hance <jaredhance@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Commit 0e87c36 (object: call "check_sha1_signature" with the
replacement sha1) changed the first argument passed to
parse_object_buffer() from "sha1" to "repl". With that change,
the returned obj pointer has the replacement SHA1 in obj->sha1,
not the original one.
But when using lookup_commit() and then parse_commit() on a
commit, we get an object pointer with the original sha1, but
the commit content comes from the replacement commit.
So the result we get from using parse_object() is different
from the we get from using lookup_commit() followed by
parse_commit().
It looks much simpler and safer to fix this inconsistency by
passing "sha1" to parse_object_bufer() instead of "repl".
The commit comment should be used to tell the the replacement
commit is replacing another commit and why. So it should be
easy to see that we have a replacement commit instead of an
original one.
And it is not a problem if the content of the commit is not
consistent with the sha1 as cat-file piped to hash-object can
be used to see the difference.
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When passed an empty list, objects_array_remove_duplicates() corrupts it
by changing the number of entries from 0 to 1.
The problem lies in the condition of its main loop:
for (ref = 0; ref < array->nr - 1; ref++) {
The loop body manipulates the supplied object array. In the case of an
empty array, it should not be doing anything at all. But array->nr is an
unsigned quantity, so the code enters the loop, in particular increasing
array->nr. Fix this by comparing (ref + 1 < array->nr) instead.
This bug can be triggered by git bundle --stdin:
$ echo HEAD | git bundle create some.bundle --stdin’
Segmentation fault (core dumped)
The list of commits to bundle appears to be empty because of another bug:
by the time the revision-walking machinery gets to look at it, standard
input has already been consumed by rev-list, so this function gets an
empty list of revisions.
After this patch, git bundle --stdin still does not work; it just doesn’t
segfault any more.
Reported-by: Joey Hess <joey@kitenet.net>
Signed-off-by: Jonathan Nieder <jrnieder@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Otherwise we get a "sha1 mismatch" error for replaced objects.
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Our hash_obj and hashtable_index calls and functions were doing a lot of
funny things with signedness. Unify all of it to 'unsigned int'.
Signed-off-by: Dan McGee <dpmcgee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In these two places we are casting part of our unsigned char sha1 array into
an unsigned int, which violates GCCs strict-aliasing rules (and probably
other compilers).
Signed-off-by: Dan McGee <dpmcgee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
"git bundle create x master master" used to create a bundle that lists
the same branch (master) twice. Cloning from such a bundle resulted in
a needless warning "warning: Duplicated ref: refs/remotes/origin/master".
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In the case of an malformed object, the object specific parsing functions
would return an error, which is currently ignored. The object can be partial
initialized in this case.
This patch make parse_object_buffer propagate such errors.
Signed-off-by: Martin Koegler <mkoegler@auto.tuwien.ac.at>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Instead, signal the error just like the case we do upon encountering
an object with an unknown type.
Signed-off-by: Jim Meyering <meyering@redhat.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When scanning the trees in track_tree_refs() there is a "lazy" test
that assumes that entries are either directories or files. Don't do
that.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When check_sha1_signature fails, program is not terminated:
it prints an error message and returns NULL, so the
buffer returned by read_sha1_file should be freed before.
Signed-off-by: Carlos Rica <jasampler@gmail.com>
Signed-off-by: Junio C Hamano <junkio@cox.net>
Each object in struct object_array is extended with the mode.
If not specified, S_IFINVALID is used. An object with an mode value
can be added with add_object_array_with_mode.
Signed-off-by: Martin Koegler <mkoegler@auto.tuwien.ac.at>
Signed-off-by: Junio C Hamano <junkio@cox.net>
This replaces the fairly odd "created_object()" function that did _most_
of the object setup with a more complete "create_object()" function that
also has a more natural calling convention.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
We used to use a different allocator scheme for when we didn't know the
object type. That meant that objects that were created without any
up-front knowledge of the type would not go through the same allocation
paths as normal object allocations, and would miss out on the statistics.
But perhaps more importantly than the statistics (that are useful when
looking at memory usage but not much else), if we want to make the
object hash tables use a denser object pointer representation, we need
to make sure that they all go through the same blocking allocator.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
Looking at the SHA1 validation code due to the corruption that Alexander
Litvinov is seeing under Cygwin, I notice that one of the most central
places where we read objects, we actually do end up verifying the SHA1 of
the result, but then we happily parse it anyway.
And using "printf" to write the error message means that it not only can
get lost, but will actually mess up stdout, and cause other strange and
hard-to-debug failures downstream.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
This function is called only once in the whole source tree. Let's move
its code inline instead, which is also in the spirit of removing as much
object type char arrays as possible (not that this patch does anything for
that but at least it is now a local matter).
Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
We currently have two parallel notation for dealing with object types
in the code: a string and a numerical value. One of them is obviously
redundent, and the most used one requires more stack space and a bunch
of strcmp() all over the place.
This is an initial step for the removal of the version using a char array
found in object reading code paths. The patch is unfortunately large but
there is no sane way to split it in smaller parts without breaking the
system.
Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
Sometime typename() is used, sometimes type_names[] is accessed directly.
Let's enforce typename() all the time which allows for validating the
type.
Also let's add a function to go from a name to a type and use it instead
of manual memcpy() when appropriate.
Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
This adds a new command, git-for-each-ref. You can have it iterate
over refs and have it output various aspects of the objects they
refer to.
Signed-off-by: Junio C Hamano <junkio@cox.net>
This abstracts away the size of the hash values when copying them
from memory location to memory location, much as the introduction
of hashcmp abstracted away hash value comparsion.
A few call sites were using char* rather than unsigned char* so
I added the cast rather than open hashcpy to be void*. This is a
reasonable tradeoff as most call sites already use unsigned char*
and the existing hashcmp is also declared to be unsigned char*.
[jc: Splitted the patch to "master" part, to be followed by a
patch for merge-recursive.c which is not in "master" yet.
Fixed the cast in the latter hunk to combine-diff.c which was
wrong in the original.
Also converted ones left-over in combine-diff.c, diff-lib.c and
upload-pack.c ]
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
Introduces global inline:
hashcmp(const unsigned char *sha1, const unsigned char *sha2)
Uses memcmp for comparison and returns the result based on the length of
the hash name (a future runtime decision).
Acked-by: Alex Riesen <raa.lkml@gmail.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Junio C Hamano <junkio@cox.net>
This updates the type-enumeration constants introduced to reduce
the memory footprint of "struct object" to match the type bits
already used in the packfile format, by removing the former
(i.e. TYPE_* constant macros) and using the latter (i.e. enum
object_type) throughout the code for consistency.
Eventually we can stop passing around the "type strings"
entirely, and this will help - no confusion about two different
integer enumeration.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
This IMNSHO cleans up the object hashing.
The hash expansion is separated out into a function of its own, the hash
array (and size) names are made more obvious, and the code is generally
made to look a bit more like the object-ref hashing.
It also gets rid of "find_object()" returning an index (or negative
position if no object is found), since that is made redundant by the
simplified object rehashing. The basic operation is now "lookup_object()"
which just returns the object itself.
There's an almost unmeasurable speed increase, but more importantly, I
think the end result is more readable.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
There are a few special places where some programs accessed the object
hash array directly, which bothered me because I wanted to play with some
simple re-organizations.
So this patch makes the object hash array data structures all entirely
local to object.c, and the few users who wanted to look at it now get to
use a function to query how many object index entries there can be, and to
actually access the array.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
We've had this notion of a "object_list" for a long time, which eventually
grew a "name" member because some users (notably git-rev-list) wanted to
name each object as it is generated.
That object_list is great for some things, but it isn't all that wonderful
for others, and the "name" member is generally not used by everybody.
This patch splits the users of the object_list array up into two: the
traditional list users, who want the list-like format, and who don't
actually use or want the name. And another class of users that really used
the list as an extensible array, and generally wanted to name the objects.
The patch is fairly straightforward, but it's also biggish. Most of it
really just cleans things up: switching the revision parsing and listing
over to the array makes things like the builtin-diff usage much simpler
(we now see exactly how many members the array has, and we don't get the
objects reversed from the order they were on the command line).
One of the main reasons for doing this at all is that the malloc overhead
of the simple object list was actually pretty high, and the array is just
a lot denser. So this patch brings down memory usage by git-rev-list by
just under 3% (on top of all the other memory use optimizations) on the
mozilla archive.
It does add more lines than it removes, and more importantly, it adds a
whole new infrastructure for maintaining lists of objects, but on the
other hand, the new dynamic array code is pretty obvious. The change to
builtin-diff-tree.c shows a fairly good example of why an array interface
is sometimes more natural, and just much simpler for everybody.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
This shrinks "struct object" to the absolutely minimal size possible.
It now contains /only/ the object flags and the SHA1 hash name of the
object.
The "refs" field, which is really needed only for fsck, is maintained in
a separate hashed lookup-table, allowing all normal users to totally
ignore it.
This helps memory usage, although not as much as I hoped: it looks like
the allocation overhead of malloc (and the alignment constraints in
particular) means that while the structure size shrinks, the actual
allocation overhead mostly does not.
[ That said: memory usage is actually down, but not as much as it should
be: I suspect just one of the object types actually ended up shrinking
its effective allocation size.
To get to the next level, we probably need specialized allocators that
don't pad the allocation more than necessary. ]
The separation makes for some code cleanup, though, and makes the ref
tracking that fsck wants a clearly separate thing.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
This shrinks "struct object" by a small amount, by getting rid of the
"struct type *" pointer and replacing it with a 3-bit bitfield instead.
In addition, we merge the bitfields and the "flags" field, which
incidentally should also remove a useless 4-byte padding from the object
when in 64-bit mode.
Now, our "struct object" is still too damn large, but it's now less
obviously bloated, and of the remaining fields, only the "util" (which is
not used by most things) is clearly something that should be eventually
discarded.
This shrinks the "git-rev-list --all" memory use by about 2.5% on the
kernel archive (and, perhaps more importantly, on the larger mozilla
archive). That may not sound like much, but I suspect it's more on a
64-bit platform.
There are other remaining inefficiencies (the parent lists, for example,
probably have horrible malloc overhead), but this was pretty obvious.
Most of the patch is just changing the comparison of the "type" pointer
from one of the constant string pointers to the appropriate new TYPE_xxx
small integer constant.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
This is preparatory work for further cleanups, where we try to make
tree_entry look more like the more efficient tree-walk descriptor.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
This allows us to avoid allocating information for names etc, because
we can just use the information from the tree buffer directly.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
This replaces occurences of "blob", "commit", "tag", and "tree",
where they're really used as type specifiers, which we already
have defined global constants for.
Signed-off-by: Peter Eriksen <s022018@student.dtu.dk>
Signed-off-by: Junio C Hamano <junkio@cox.net>
The hashed object lookup had a subtle bug in re-hashing: it did
for (i = 0; i < count; i++)
if (objs[i]) {
.. rehash ..
where "count" was the old hash couny. Oon the face of it is obvious, since
it clearly re-hashes all the old objects.
However, it's wrong.
If the last old hash entry before re-hashing was in use (or became in use
by the re-hashing), then when re-hashing could have inserted an object
into the hash entries with idx >= count due to overflow. When we then
rehash the last old entry, that old entry might become empty, which means
that the overflow entries should be re-hashed again.
In other words, the loop has to be fixed to either traverse the whole
array, rather than just the old count.
(There's room for a slight optimization: instead of counting all the way
up, we can break when we see the first empty slot that is above the old
"count". At that point we know we don't have any collissions that we might
have to fix up any more. This patch only does the trivial fix)
[jc: with trivial fix on trivial fix]
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
Calling hashtable_index from find_object before objs is created
would result in division by zero failure. Avoid it.
Also the given object name may not be aligned suitably for
unsigned int; avoid dereferencing casted pointer.
Signed-off-by: Junio C Hamano <junkio@cox.net>
In a simple test, this brings down the CPU time from 47 sec to 22 sec.
Signed-off-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
Signed-off-by: Junio C Hamano <junkio@cox.net>
ISO C99 (and GCC 3.x or later) lets you write a flexible array
at the end of a structure, like this:
struct frotz {
int xyzzy;
char nitfol[]; /* more */
};
GCC 2.95 and 2.96 let you to do this with "char nitfol[0]";
unfortunately this is not allowed by ISO C90.
This declares such construct like this:
struct frotz {
int xyzzy;
char nitfol[FLEX_ARRAY]; /* more */
};
and git-compat-util.h defines FLEX_ARRAY to 0 for gcc 2.95 and
empty for others.
If you are using a C90 C compiler, you should be able
to override this with CFLAGS=-DFLEX_ARRAY=1 from the
command line of "make".
Signed-off-by: Junio C Hamano <junkio@cox.net>
Morten Welinder <mwelinder@gmail.com> writes:
> The code looks wrong. It assumes that pointers are no larger than ints.
> If pointers are larger than ints, the code does not necessarily compute
> a consistent ordering and qsort is allowed to do whatever it wants.
>
> Morten
>
> static int compare_object_pointers(const void *a, const void *b)
> {
> const struct object * const *pa = a;
> const struct object * const *pb = b;
> return *pa - *pb;
> }
Signed-off-by: Junio C Hamano <junkio@cox.net>
Store pointers to referenced objects in a variable sized array instead
of linked list. This cuts down memory usage of utilities which use
object references; e.g., git-fsck-objects --full on the git.git
repository consumes about 2 MB of memory tracked by Massif instead of
7 MB before the change. Object refs are still the biggest consumer of
memory (57%), but the malloc overhead for a single block instead of a
linked list is substantially smaller.
Signed-off-by: Sergey Vlasov <vsu@altlinux.ru>
Signed-off-by: Junio C Hamano <junkio@cox.net>
The object parsing code builds a generic "this object references that
object" because doing a full connectivity check for fsck requires it.
However, nothing else really needs it, and it's quite expensive for
git-rev-list that can have tons of objects in flight.
So, exactly like the commit buffer save thing, add a global flag to
disable it, and use it in git-rev-list.
Before:
$ /usr/bin/time git-rev-list --objects v2.6.12..HEAD | wc -l
12.28user 0.29system 0:12.57elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+26718minor)pagefaults 0swaps
59124
After this change:
$ /usr/bin/time git-rev-list --objects v2.6.12..HEAD | wc -l
10.33user 0.18system 0:10.54elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+18509minor)pagefaults 0swaps
59124
and note how the number of pages touched by git-rev-list for this
particular object list has shrunk from 26,718 (104 MB) to 18,509 (72 MB).
Calculating the total object difference between two git revisions is still
clearly the most expensive git operation (both in memory and CPU time),
but it's now less than 40% of what it used to be.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
Add function to look up an object which is entirely unknown, so that
it can be put in a list. Various other functions related to lists of
objects.
Signed-off-by: Daniel Barkalow <barkalow@iabervon.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
Packed delta files created by git-pack-objects seems to be the
way to go, and existing "delta" object handling code has exposed
the object representation details to too many places. Remove it
while we refactor code to come up with a proper interface in
sha1_file.c.
Signed-off-by: Junio C Hamano <junkio@cox.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Handle parsing a tag for a non-present object. This adds a function to lookup
an object with lookup_* for * in a string, so that it can get the right storage
based on the "type" line in the tag.
Signed-off-by: Daniel Barkalow <barkalow@iabervon.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Make 'sha1' parameters const where possible
Signed-off-by: Jason McMullan <jason.mcmullan@timesys.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add <limits.h> to the include files handled by "cache.h", and remove
extraneous #include directives from various .c files. The rule is that
"cache.h" gets all the basic stuff, so that we'll have as few system
dependencies as possible.
This adds knowledge of delta objects to fsck-cache and various object
parsing code. A new switch to git-fsck-cache is provided to display the
maximum delta depth found in a repository.
Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
It turns out that parse_object() is loading and decompressing given
object to free it just before calling the specific object parsing
function which does mmap and decompress the same object again. This
patch introduces the ability to parse specific objects directly from a
memory buffer.
Without this patch, running git-fsck-cache on the kernel repositorytake:
real 0m13.006s
user 0m11.421s
sys 0m1.218s
With this patch applied:
real 0m8.060s
user 0m7.071s
sys 0m0.710s
The performance increase is significant, and this is kind of a
prerequisite for sane delta object support with fsck.
Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch fixes memory leaks in parse_object() and related functions;
these leaks were very noticeable when running git-fsck-cache.
Signed-off-by: Sergey Vlasov <vsu@altlinux.ru>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This adds a function that parses an object from the database when we have
to look up its actual type. It also checks the hash of the file, due to
its heritage as part of fsck-cache.
Signed-Off-By: Daniel Barkalow <barkalow@iabervon.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Introduce xmalloc and xrealloc to die gracefully with a descriptive
message when out of memory, rather than taking a SIGSEGV.
Signed-off-by: Christopher Li<chrislgit@chrisli.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>