2018-03-23 18:20:55 +01:00
|
|
|
#ifndef OBJECT_STORE_H
|
|
|
|
#define OBJECT_STORE_H
|
|
|
|
|
2018-08-15 19:54:05 +02:00
|
|
|
#include "cache.h"
|
2018-04-12 02:21:05 +02:00
|
|
|
#include "oidmap.h"
|
2018-07-12 00:42:38 +02:00
|
|
|
#include "list.h"
|
|
|
|
#include "sha1-array.h"
|
|
|
|
#include "strbuf.h"
|
object-store: allow threaded access to object reading
Allow object reading to be performed by multiple threads protecting it
with an internal lock, the obj_read_mutex. The lock usage can be toggled
with enable_obj_read_lock() and disable_obj_read_lock(). Currently, the
functions which can be safely called in parallel are:
read_object_file_extended(), repo_read_object_file(),
read_object_file(), read_object_with_reference(), read_object(),
oid_object_info() and oid_object_info_extended(). It's also possible
to use obj_read_lock() and obj_read_unlock() to protect other sections
that cannot execute in parallel with object reading.
Probably there are many spots in the functions listed above that could
be executed unlocked (and thus, in parallel). But, for now, we are most
interested in allowing parallel access to zlib inflation. This is one of
the sections where object reading spends most of the time in (e.g. up to
one-third of git-grep's execution time in the chromium repo corresponds
to inflation) and it's already thread-safe. So, to take advantage of
that, the obj_read_mutex is released when calling git_inflate() and
re-acquired right after, for every calling spot in
oid_object_info_extended()'s call chain. We may refine this lock to also
exploit other possible parallel spots in the future, but for now,
threaded zlib inflation should already give great speedups for threaded
object reading callers.
Note that add_delta_base_cache() was also modified to skip adding
already present entries to the cache. This wasn't possible before, but
it would be now, with the parallel inflation. Take for example the
following situation, where two threads - A and B - are executing the
code at unpack_entry():
1. Thread A is performing the decompression of a base O (which is not
yet in the cache) at PHASE II. Thread B is simultaneously trying to
unpack O, but just starting at PHASE I.
2. Since O is not yet in the cache, B will go to PHASE II to also
perform the decompression.
3. When they finish decompressing, one of them will get the object
reading mutex and go to PHASE III while the other waits for the
mutex. Let’s say A got the mutex first.
4. Thread A will add O to the cache, go throughout the rest of PHASE III
and return.
5. Thread B gets the mutex, also add O to the cache (if the check wasn't
there) and returns.
Finally, it is also important to highlight that the object reading lock
can only ensure thread-safety in the mentioned functions thanks to two
complementary mechanisms: the use of 'struct raw_object_store's
replace_mutex, which guards sections in the object reading machinery
that would otherwise be thread-unsafe; and the 'struct pack_window's
inuse_cnt, which protects window reading operations (such as the one
performed during the inflation of a packed object), allowing them to
execute without the acquisition of the obj_read_mutex.
Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-01-16 03:39:53 +01:00
|
|
|
#include "thread-utils.h"
|
2018-04-12 02:21:05 +02:00
|
|
|
|
2018-11-12 15:48:47 +01:00
|
|
|
struct object_directory {
|
|
|
|
struct object_directory *next;
|
2018-03-23 18:20:56 +01:00
|
|
|
|
|
|
|
/*
|
2018-11-12 15:50:56 +01:00
|
|
|
* Used to store the results of readdir(3) calls when we are OK
|
|
|
|
* sacrificing accuracy due to races for speed. That includes
|
sha1-file: use loose object cache for quick existence check
In cases where we expect to ask has_sha1_file() about a lot of objects
that we are not likely to have (e.g., during fetch negotiation), we
already use OBJECT_INFO_QUICK to sacrifice accuracy (due to racing with
a simultaneous write or repack) for speed (we avoid re-scanning the pack
directory).
However, even checking for loose objects can be expensive, as we will
stat() each one. On many systems this cost isn't too noticeable, but
stat() can be particularly slow on some operating systems, or due to
network filesystems.
Since the QUICK flag already tells us that we're OK with a slightly
stale answer, we can use that as a cue to look in our in-memory cache of
each object directory. That basically trades an in-memory binary search
for a stat() call.
Note that it is possible for this to actually be _slower_. We'll do a
full readdir() to fill the cache, so if you have a very large number of
loose objects and a very small number of lookups, that readdir() may end
up more expensive.
This shouldn't be a big deal in practice. If you have a large number of
reachable loose objects, you'll already run into performance problems
(which you should remedy by repacking). You may have unreachable objects
which wouldn't otherwise impact performance. Usually these would go away
with the prune step of "git gc", but they may be held for up to 2 weeks
in the default configuration.
So it comes down to how many such objects you might reasonably expect to
have, how much slower is readdir() on N entries versus M stat() calls
(and here we really care about the syscall backing readdir(), like
getdents() on Linux, but I'll just call this readdir() below).
If N is much smaller than M (a typical packed repo), we know this is a
big win (few readdirs() followed by many uses of the resulting cache).
When N and M are similar in size, it's also a win. We care about the
latency of making a syscall, and readdir() should be giving us many
values in a single call. How many?
On Linux, running "strace -e getdents ls" shows a 32k buffer getting 512
entries per call (which is 64 bytes per entry; the name itself is 38
bytes, plus there are some other fields). So we can imagine that this is
always a win as long as the number of loose objects in the repository is
a factor of 500 less than the number of lookups you make. It's hard to
auto-tune this because we don't generally know up front how many lookups
we're going to do. But it's unlikely for this to perform significantly
worse.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-12 15:54:42 +01:00
|
|
|
* object existence with OBJECT_INFO_QUICK, as well as
|
2018-11-12 15:50:56 +01:00
|
|
|
* our search for unique abbreviated hashes. Don't use it for tasks
|
|
|
|
* requiring greater accuracy!
|
|
|
|
*
|
|
|
|
* Be sure to call odb_load_loose_cache() before using.
|
2018-03-23 18:20:56 +01:00
|
|
|
*/
|
|
|
|
char loose_objects_subdir_seen[256];
|
2019-01-06 17:45:52 +01:00
|
|
|
struct oid_array loose_objects_cache[256];
|
2018-03-23 18:20:56 +01:00
|
|
|
|
2018-03-23 18:21:08 +01:00
|
|
|
/*
|
|
|
|
* Path to the alternative object store. If this is a relative path,
|
|
|
|
* it is relative to the current working directory.
|
|
|
|
*/
|
sha1-file: use an object_directory for the main object dir
Our handling of alternate object directories is needlessly different
from the main object directory. As a result, many places in the code
basically look like this:
do_something(r->objects->objdir);
for (odb = r->objects->alt_odb_list; odb; odb = odb->next)
do_something(odb->path);
That gets annoying when do_something() is non-trivial, and we've
resorted to gross hacks like creating fake alternates (see
find_short_object_filename()).
Instead, let's give each raw_object_store a unified list of
object_directory structs. The first will be the main store, and
everything after is an alternate. Very few callers even care about the
distinction, and can just loop over the whole list (and those who care
can just treat the first element differently).
A few observations:
- we don't need r->objects->objectdir anymore, and can just
mechanically convert that to r->objects->odb->path
- object_directory's path field needs to become a real pointer rather
than a FLEX_ARRAY, in order to fill it with expand_base_dir()
- we'll call prepare_alt_odb() earlier in many functions (i.e.,
outside of the loop). This may result in us calling it even when our
function would be satisfied looking only at the main odb.
But this doesn't matter in practice. It's not a very expensive
operation in the first place, and in the majority of cases it will
be a noop. We call it already (and cache its results) in
prepare_packed_git(), and we'll generally check packs before loose
objects. So essentially every program is going to call it
immediately once per program.
Arguably we should just prepare_alt_odb() immediately upon setting
up the repository's object directory, which would save us sprinkling
calls throughout the code base (and forgetting to do so has been a
source of subtle bugs in the past). But I've stopped short of that
here, since there are already a lot of other moving parts in this
patch.
- Most call sites just get shorter. The check_and_freshen() functions
are an exception, because they have entry points to handle local and
nonlocal directories separately.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-12 15:50:39 +01:00
|
|
|
char *path;
|
2018-03-23 18:20:57 +01:00
|
|
|
};
|
sha1-file: use an object_directory for the main object dir
Our handling of alternate object directories is needlessly different
from the main object directory. As a result, many places in the code
basically look like this:
do_something(r->objects->objdir);
for (odb = r->objects->alt_odb_list; odb; odb = odb->next)
do_something(odb->path);
That gets annoying when do_something() is non-trivial, and we've
resorted to gross hacks like creating fake alternates (see
find_short_object_filename()).
Instead, let's give each raw_object_store a unified list of
object_directory structs. The first will be the main store, and
everything after is an alternate. Very few callers even care about the
distinction, and can just loop over the whole list (and those who care
can just treat the first element differently).
A few observations:
- we don't need r->objects->objectdir anymore, and can just
mechanically convert that to r->objects->odb->path
- object_directory's path field needs to become a real pointer rather
than a FLEX_ARRAY, in order to fill it with expand_base_dir()
- we'll call prepare_alt_odb() earlier in many functions (i.e.,
outside of the loop). This may result in us calling it even when our
function would be satisfied looking only at the main odb.
But this doesn't matter in practice. It's not a very expensive
operation in the first place, and in the majority of cases it will
be a noop. We call it already (and cache its results) in
prepare_packed_git(), and we'll generally check packs before loose
objects. So essentially every program is going to call it
immediately once per program.
Arguably we should just prepare_alt_odb() immediately upon setting
up the repository's object directory, which would save us sprinkling
calls throughout the code base (and forgetting to do so has been a
source of subtle bugs in the past). But I've stopped short of that
here, since there are already a lot of other moving parts in this
patch.
- Most call sites just get shorter. The check_and_freshen() functions
are an exception, because they have entry points to handle local and
nonlocal directories separately.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-12 15:50:39 +01:00
|
|
|
|
2018-03-23 18:21:09 +01:00
|
|
|
void prepare_alt_odb(struct repository *r);
|
2018-03-23 18:20:56 +01:00
|
|
|
char *compute_alternate_path(const char *path, struct strbuf *err);
|
2018-11-12 15:48:47 +01:00
|
|
|
typedef int alt_odb_fn(struct object_directory *, void *);
|
2018-03-23 18:20:56 +01:00
|
|
|
int foreach_alt_odb(alt_odb_fn, void*);
|
2019-07-01 15:17:40 +02:00
|
|
|
typedef void alternate_ref_fn(const struct object_id *oid, void *);
|
|
|
|
void for_each_alternate_ref(alternate_ref_fn, void *);
|
2018-03-23 18:20:56 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Add the directory to the on-disk alternates file; the new entry will also
|
|
|
|
* take effect in the current process.
|
|
|
|
*/
|
|
|
|
void add_to_alternates_file(const char *dir);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Add the directory to the in-memory list of alternates (along with any
|
|
|
|
* recursive alternates it points to), but do not modify the on-disk alternates
|
|
|
|
* file.
|
|
|
|
*/
|
|
|
|
void add_to_alternates_memory(const char *dir);
|
|
|
|
|
2019-01-06 17:45:30 +01:00
|
|
|
/*
|
|
|
|
* Populate and return the loose object cache array corresponding to the
|
|
|
|
* given object ID.
|
|
|
|
*/
|
|
|
|
struct oid_array *odb_loose_cache(struct object_directory *odb,
|
|
|
|
const struct object_id *oid);
|
|
|
|
|
2019-01-06 17:45:39 +01:00
|
|
|
/* Empty the loose object cache for the specified object directory. */
|
|
|
|
void odb_clear_loose_cache(struct object_directory *odb);
|
|
|
|
|
2018-03-23 18:20:59 +01:00
|
|
|
struct packed_git {
|
2019-11-27 23:24:53 +01:00
|
|
|
struct hashmap_entry packmap_ent;
|
2018-03-23 18:20:59 +01:00
|
|
|
struct packed_git *next;
|
|
|
|
struct list_head mru;
|
|
|
|
struct pack_window *windows;
|
|
|
|
off_t pack_size;
|
|
|
|
const void *index_data;
|
|
|
|
size_t index_size;
|
|
|
|
uint32_t num_objects;
|
|
|
|
uint32_t num_bad_objects;
|
|
|
|
unsigned char *bad_object_sha1;
|
|
|
|
int index_version;
|
|
|
|
time_t mtime;
|
|
|
|
int pack_fd;
|
2018-04-14 17:35:05 +02:00
|
|
|
int index; /* for builtin/pack-objects.c */
|
2018-03-23 18:20:59 +01:00
|
|
|
unsigned pack_local:1,
|
|
|
|
pack_keep:1,
|
2018-04-15 17:36:13 +02:00
|
|
|
pack_keep_in_core:1,
|
2018-03-23 18:20:59 +01:00
|
|
|
freshened:1,
|
|
|
|
do_not_close:1,
|
midx: add packs to packed_git linked list
The multi-pack-index allows searching for objects across multiple
packs using one object list. The original design gains many of
these performance benefits by keeping the packs in the
multi-pack-index out of the packed_git list.
Unfortunately, this has one major drawback. If the multi-pack-index
covers thousands of packs, and a command loads many of those packs,
then we can hit the limit for open file descriptors. The
close_one_pack() method is used to limit this resource, but it
only looks at the packed_git list, and uses an LRU cache to prevent
thrashing.
Instead of complicating this close_one_pack() logic to include
direct references to the multi-pack-index, simply add the packs
opened by the multi-pack-index to the packed_git list. This
immediately solves the file-descriptor limit problem, but requires
some extra steps to avoid performance issues or other problems:
1. Create a multi_pack_index bit in the packed_git struct that is
one if and only if the pack was loaded from a multi-pack-index.
2. Skip packs with the multi_pack_index bit when doing object
lookups and abbreviations. These algorithms already check the
multi-pack-index before the packed_git struct. This has a very
small performance hit, as we need to walk more packed_git
structs. This is acceptable, since these operations run binary
search on the other packs, so this walk-and-ignore logic is
very fast by comparison.
3. When closing a multi-pack-index file, do not close its packs,
as those packs will be closed using close_all_packs(). In some
cases, such as 'git repack', we run 'close_midx()' without also
closing the packs, so we need to un-set the multi_pack_index bit
in those packs. This is necessary, and caught by running
t6501-freshen-objects.sh with GIT_TEST_MULTI_PACK_INDEX=1.
To manually test this change, I inserted trace2 logging into
close_pack_fd() and set pack_max_fds to 10, then ran 'git rev-list
--all --objects' on a copy of the Git repo with 300+ pack-files and
a multi-pack-index. The logs verified the packs are closed as
we read them beyond the file descriptor limit.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-04-29 18:18:56 +02:00
|
|
|
pack_promisor:1,
|
|
|
|
multi_pack_index:1;
|
2019-02-19 01:05:03 +01:00
|
|
|
unsigned char hash[GIT_MAX_RAWSZ];
|
2018-03-23 18:20:59 +01:00
|
|
|
struct revindex_entry *revindex;
|
|
|
|
/* something like ".git/objects/pack/xxxxx.pack" */
|
|
|
|
char pack_name[FLEX_ARRAY]; /* more */
|
|
|
|
};
|
|
|
|
|
2018-07-12 21:39:23 +02:00
|
|
|
struct multi_pack_index;
|
|
|
|
|
2019-11-27 23:24:53 +01:00
|
|
|
static inline int pack_map_entry_cmp(const void *unused_cmp_data,
|
|
|
|
const struct hashmap_entry *entry,
|
|
|
|
const struct hashmap_entry *entry2,
|
|
|
|
const void *keydata)
|
|
|
|
{
|
|
|
|
const char *key = keydata;
|
|
|
|
const struct packed_git *pg1, *pg2;
|
|
|
|
|
|
|
|
pg1 = container_of(entry, const struct packed_git, packmap_ent);
|
|
|
|
pg2 = container_of(entry2, const struct packed_git, packmap_ent);
|
|
|
|
|
|
|
|
return strcmp(pg1->pack_name, key ? key : pg2->pack_name);
|
|
|
|
}
|
|
|
|
|
2018-03-23 18:20:55 +01:00
|
|
|
struct raw_object_store {
|
|
|
|
/*
|
sha1-file: use an object_directory for the main object dir
Our handling of alternate object directories is needlessly different
from the main object directory. As a result, many places in the code
basically look like this:
do_something(r->objects->objdir);
for (odb = r->objects->alt_odb_list; odb; odb = odb->next)
do_something(odb->path);
That gets annoying when do_something() is non-trivial, and we've
resorted to gross hacks like creating fake alternates (see
find_short_object_filename()).
Instead, let's give each raw_object_store a unified list of
object_directory structs. The first will be the main store, and
everything after is an alternate. Very few callers even care about the
distinction, and can just loop over the whole list (and those who care
can just treat the first element differently).
A few observations:
- we don't need r->objects->objectdir anymore, and can just
mechanically convert that to r->objects->odb->path
- object_directory's path field needs to become a real pointer rather
than a FLEX_ARRAY, in order to fill it with expand_base_dir()
- we'll call prepare_alt_odb() earlier in many functions (i.e.,
outside of the loop). This may result in us calling it even when our
function would be satisfied looking only at the main odb.
But this doesn't matter in practice. It's not a very expensive
operation in the first place, and in the majority of cases it will
be a noop. We call it already (and cache its results) in
prepare_packed_git(), and we'll generally check packs before loose
objects. So essentially every program is going to call it
immediately once per program.
Arguably we should just prepare_alt_odb() immediately upon setting
up the repository's object directory, which would save us sprinkling
calls throughout the code base (and forgetting to do so has been a
source of subtle bugs in the past). But I've stopped short of that
here, since there are already a lot of other moving parts in this
patch.
- Most call sites just get shorter. The check_and_freshen() functions
are an exception, because they have entry points to handle local and
nonlocal directories separately.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-12 15:50:39 +01:00
|
|
|
* Set of all object directories; the main directory is first (and
|
|
|
|
* cannot be NULL after initialization). Subsequent directories are
|
|
|
|
* alternates.
|
2018-03-23 18:20:55 +01:00
|
|
|
*/
|
sha1-file: use an object_directory for the main object dir
Our handling of alternate object directories is needlessly different
from the main object directory. As a result, many places in the code
basically look like this:
do_something(r->objects->objdir);
for (odb = r->objects->alt_odb_list; odb; odb = odb->next)
do_something(odb->path);
That gets annoying when do_something() is non-trivial, and we've
resorted to gross hacks like creating fake alternates (see
find_short_object_filename()).
Instead, let's give each raw_object_store a unified list of
object_directory structs. The first will be the main store, and
everything after is an alternate. Very few callers even care about the
distinction, and can just loop over the whole list (and those who care
can just treat the first element differently).
A few observations:
- we don't need r->objects->objectdir anymore, and can just
mechanically convert that to r->objects->odb->path
- object_directory's path field needs to become a real pointer rather
than a FLEX_ARRAY, in order to fill it with expand_base_dir()
- we'll call prepare_alt_odb() earlier in many functions (i.e.,
outside of the loop). This may result in us calling it even when our
function would be satisfied looking only at the main odb.
But this doesn't matter in practice. It's not a very expensive
operation in the first place, and in the majority of cases it will
be a noop. We call it already (and cache its results) in
prepare_packed_git(), and we'll generally check packs before loose
objects. So essentially every program is going to call it
immediately once per program.
Arguably we should just prepare_alt_odb() immediately upon setting
up the repository's object directory, which would save us sprinkling
calls throughout the code base (and forgetting to do so has been a
source of subtle bugs in the past). But I've stopped short of that
here, since there are already a lot of other moving parts in this
patch.
- Most call sites just get shorter. The check_and_freshen() functions
are an exception, because they have entry points to handle local and
nonlocal directories separately.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-12 15:50:39 +01:00
|
|
|
struct object_directory *odb;
|
|
|
|
struct object_directory **odb_tail;
|
|
|
|
int loaded_alternates;
|
2018-03-23 18:20:55 +01:00
|
|
|
|
sha1-file: use an object_directory for the main object dir
Our handling of alternate object directories is needlessly different
from the main object directory. As a result, many places in the code
basically look like this:
do_something(r->objects->objdir);
for (odb = r->objects->alt_odb_list; odb; odb = odb->next)
do_something(odb->path);
That gets annoying when do_something() is non-trivial, and we've
resorted to gross hacks like creating fake alternates (see
find_short_object_filename()).
Instead, let's give each raw_object_store a unified list of
object_directory structs. The first will be the main store, and
everything after is an alternate. Very few callers even care about the
distinction, and can just loop over the whole list (and those who care
can just treat the first element differently).
A few observations:
- we don't need r->objects->objectdir anymore, and can just
mechanically convert that to r->objects->odb->path
- object_directory's path field needs to become a real pointer rather
than a FLEX_ARRAY, in order to fill it with expand_base_dir()
- we'll call prepare_alt_odb() earlier in many functions (i.e.,
outside of the loop). This may result in us calling it even when our
function would be satisfied looking only at the main odb.
But this doesn't matter in practice. It's not a very expensive
operation in the first place, and in the majority of cases it will
be a noop. We call it already (and cache its results) in
prepare_packed_git(), and we'll generally check packs before loose
objects. So essentially every program is going to call it
immediately once per program.
Arguably we should just prepare_alt_odb() immediately upon setting
up the repository's object directory, which would save us sprinkling
calls throughout the code base (and forgetting to do so has been a
source of subtle bugs in the past). But I've stopped short of that
here, since there are already a lot of other moving parts in this
patch.
- Most call sites just get shorter. The check_and_freshen() functions
are an exception, because they have entry points to handle local and
nonlocal directories separately.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-12 15:50:39 +01:00
|
|
|
/*
|
|
|
|
* A list of alternate object directories loaded from the environment;
|
|
|
|
* this should not generally need to be accessed directly, but will
|
|
|
|
* populate the "odb" list when prepare_alt_odb() is run.
|
|
|
|
*/
|
2018-03-23 18:20:55 +01:00
|
|
|
char *alternate_db;
|
2018-03-23 18:20:57 +01:00
|
|
|
|
2018-04-12 02:21:05 +02:00
|
|
|
/*
|
|
|
|
* Objects that should be substituted by other objects
|
|
|
|
* (see git-replace(1)).
|
|
|
|
*/
|
2018-04-12 02:21:07 +02:00
|
|
|
struct oidmap *replace_map;
|
replace-object: make replace operations thread-safe
replace-object functions are very close to being thread-safe: the only
current racy section is the lazy initialization at
prepare_replace_object(). The following patches will protect some object
reading operations to be called threaded, but before that, replace
functions must be protected. To do so, add a mutex to struct
raw_object_store and acquire it before lazy initializing the
replace_map. This won't cause any noticeable performance drop as the
mutex will no longer be used after the replace_map is initialized.
Later, when the replace functions are called in parallel, thread
debuggers might point our use of the added replace_map_initialized flag
as a data race. However, as this boolean variable is initialized as
false and it's only updated once, there's no real harm. It's perfectly
fine if the value is updated right after a thread read it in
replace-map.h:lookup_replace_object() (there'll only be a performance
penalty for the affected threads at that moment). We could cease the
debugger warning protecting the variable reading at the said function.
However, this would negatively affect performance for all threads
calling it, at any time, so it's not really worthy since the warning
doesn't represent a real problem. Instead, to make sure we don't get
false positives (at ThreadSanitizer, at least) an entry for the
respective function is added to .tsan-suppressions.
Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-01-16 03:39:52 +01:00
|
|
|
unsigned replace_map_initialized : 1;
|
|
|
|
pthread_mutex_t replace_mutex; /* protect object replace functions */
|
2018-04-12 02:21:05 +02:00
|
|
|
|
2018-07-12 00:42:41 +02:00
|
|
|
struct commit_graph *commit_graph;
|
|
|
|
unsigned commit_graph_attempted : 1; /* if loading has been attempted */
|
|
|
|
|
2018-07-12 21:39:33 +02:00
|
|
|
/*
|
|
|
|
* private data
|
|
|
|
*
|
|
|
|
* should only be accessed directly by packfile.c and midx.c
|
|
|
|
*/
|
|
|
|
struct multi_pack_index *multi_pack_index;
|
|
|
|
|
2018-03-23 18:20:59 +01:00
|
|
|
/*
|
|
|
|
* private data
|
|
|
|
*
|
|
|
|
* should only be accessed directly by packfile.c
|
|
|
|
*/
|
|
|
|
|
|
|
|
struct packed_git *packed_git;
|
|
|
|
/* A most-recently-used ordered version of the packed_git list. */
|
|
|
|
struct list_head packed_git_mru;
|
2018-03-23 18:21:01 +01:00
|
|
|
|
2019-11-27 23:24:53 +01:00
|
|
|
/*
|
|
|
|
* A map of packfiles to packed_git structs for tracking which
|
|
|
|
* packs have been loaded already.
|
|
|
|
*/
|
|
|
|
struct hashmap pack_map;
|
|
|
|
|
2018-03-23 18:21:02 +01:00
|
|
|
/*
|
|
|
|
* A fast, rough count of the number of objects in the repository.
|
|
|
|
* These two fields are not meant for direct access. Use
|
|
|
|
* approximate_object_count() instead.
|
|
|
|
*/
|
|
|
|
unsigned long approximate_object_count;
|
|
|
|
unsigned approximate_object_count_valid : 1;
|
|
|
|
|
2018-03-23 18:21:01 +01:00
|
|
|
/*
|
|
|
|
* Whether packed_git has already been populated with this repository's
|
|
|
|
* packs.
|
|
|
|
*/
|
|
|
|
unsigned packed_git_initialized : 1;
|
2018-03-23 18:20:55 +01:00
|
|
|
};
|
|
|
|
|
|
|
|
struct raw_object_store *raw_object_store_new(void);
|
|
|
|
void raw_object_store_clear(struct raw_object_store *o);
|
|
|
|
|
2018-03-23 18:21:10 +01:00
|
|
|
/*
|
|
|
|
* Put in `buf` the name of the file in the local object database that
|
sha1-file: modernize loose object file functions
The loose object access code in sha1-file.c is some of the oldest in
Git, and could use some modernizing. It mostly uses "unsigned char *"
for object ids, which these days should be "struct object_id".
It also uses the term "sha1_file" in many functions, which is confusing.
The term "loose_objects" is much better. It clearly distinguishes
them from packed objects (which didn't even exist back when the name
"sha1_file" came into being). And it also distinguishes it from the
checksummed-file concept in csum-file.c (which until recently was
actually called "struct sha1file"!).
This patch converts the functions {open,close,map,stat}_sha1_file() into
open_loose_object(), etc, and switches their sha1 arguments for
object_id structs. Similarly, path functions like fill_sha1_path()
become fill_loose_path() and use object_ids.
The function sha1_loose_object_info() already says "loose", so we can
just drop the "sha1" (and teach it to use object_id).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-07 09:35:42 +01:00
|
|
|
* would be used to store a loose object with the specified oid.
|
2018-03-23 18:21:10 +01:00
|
|
|
*/
|
sha1-file: modernize loose object file functions
The loose object access code in sha1-file.c is some of the oldest in
Git, and could use some modernizing. It mostly uses "unsigned char *"
for object ids, which these days should be "struct object_id".
It also uses the term "sha1_file" in many functions, which is confusing.
The term "loose_objects" is much better. It clearly distinguishes
them from packed objects (which didn't even exist back when the name
"sha1_file" came into being). And it also distinguishes it from the
checksummed-file concept in csum-file.c (which until recently was
actually called "struct sha1file"!).
This patch converts the functions {open,close,map,stat}_sha1_file() into
open_loose_object(), etc, and switches their sha1 arguments for
object_id structs. Similarly, path functions like fill_sha1_path()
become fill_loose_path() and use object_ids.
The function sha1_loose_object_info() already says "loose", so we can
just drop the "sha1" (and teach it to use object_id).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-07 09:35:42 +01:00
|
|
|
const char *loose_object_path(struct repository *r, struct strbuf *buf,
|
|
|
|
const struct object_id *oid);
|
2018-03-23 18:21:10 +01:00
|
|
|
|
sha1-file: modernize loose object file functions
The loose object access code in sha1-file.c is some of the oldest in
Git, and could use some modernizing. It mostly uses "unsigned char *"
for object ids, which these days should be "struct object_id".
It also uses the term "sha1_file" in many functions, which is confusing.
The term "loose_objects" is much better. It clearly distinguishes
them from packed objects (which didn't even exist back when the name
"sha1_file" came into being). And it also distinguishes it from the
checksummed-file concept in csum-file.c (which until recently was
actually called "struct sha1file"!).
This patch converts the functions {open,close,map,stat}_sha1_file() into
open_loose_object(), etc, and switches their sha1 arguments for
object_id structs. Similarly, path functions like fill_sha1_path()
become fill_loose_path() and use object_ids.
The function sha1_loose_object_info() already says "loose", so we can
just drop the "sha1" (and teach it to use object_id).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-07 09:35:42 +01:00
|
|
|
void *map_loose_object(struct repository *r, const struct object_id *oid,
|
|
|
|
unsigned long *size);
|
2018-03-23 18:21:14 +01:00
|
|
|
|
2019-04-29 10:28:14 +02:00
|
|
|
void *read_object_file_extended(struct repository *r,
|
2019-04-29 10:28:23 +02:00
|
|
|
const struct object_id *oid,
|
|
|
|
enum object_type *type,
|
|
|
|
unsigned long *size, int lookup_replace);
|
2018-11-14 01:12:47 +01:00
|
|
|
static inline void *repo_read_object_file(struct repository *r,
|
|
|
|
const struct object_id *oid,
|
|
|
|
enum object_type *type,
|
|
|
|
unsigned long *size)
|
2018-05-16 01:42:15 +02:00
|
|
|
{
|
2018-11-14 01:12:47 +01:00
|
|
|
return read_object_file_extended(r, oid, type, size, 1);
|
2018-05-16 01:42:15 +02:00
|
|
|
}
|
2018-11-14 01:12:47 +01:00
|
|
|
#ifndef NO_THE_REPOSITORY_COMPATIBILITY_MACROS
|
|
|
|
#define read_object_file(oid, type, size) repo_read_object_file(the_repository, oid, type, size)
|
|
|
|
#endif
|
2018-05-16 01:42:15 +02:00
|
|
|
|
|
|
|
/* Read and unpack an object file into memory, write memory to an object file */
|
|
|
|
int oid_object_info(struct repository *r, const struct object_id *, unsigned long *);
|
|
|
|
|
2019-04-29 10:28:14 +02:00
|
|
|
int hash_object_file(const void *buf, unsigned long len,
|
2019-04-29 10:28:23 +02:00
|
|
|
const char *type, struct object_id *oid);
|
2018-05-16 01:42:15 +02:00
|
|
|
|
2019-04-29 10:28:14 +02:00
|
|
|
int write_object_file(const void *buf, unsigned long len,
|
2019-04-29 10:28:23 +02:00
|
|
|
const char *type, struct object_id *oid);
|
2018-05-16 01:42:15 +02:00
|
|
|
|
2019-04-29 10:28:14 +02:00
|
|
|
int hash_object_file_literally(const void *buf, unsigned long len,
|
2019-04-29 10:28:23 +02:00
|
|
|
const char *type, struct object_id *oid,
|
|
|
|
unsigned flags);
|
2018-05-16 01:42:15 +02:00
|
|
|
|
2019-04-29 10:28:14 +02:00
|
|
|
int pretend_object_file(void *, unsigned long, enum object_type,
|
2019-04-29 10:28:23 +02:00
|
|
|
struct object_id *oid);
|
2018-05-16 01:42:15 +02:00
|
|
|
|
2019-04-29 10:28:14 +02:00
|
|
|
int force_object_loose(const struct object_id *oid, time_t mtime);
|
2018-05-16 01:42:15 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Open the loose object at path, check its hash, and return the contents,
|
|
|
|
* type, and size. If the object is a blob, then "contents" may return NULL,
|
|
|
|
* to allow streaming of large blobs.
|
|
|
|
*
|
|
|
|
* Returns 0 on success, negative on error (details may be written to stderr).
|
|
|
|
*/
|
|
|
|
int read_loose_object(const char *path,
|
|
|
|
const struct object_id *expected_oid,
|
|
|
|
enum object_type *type,
|
|
|
|
unsigned long *size,
|
|
|
|
void **contents);
|
|
|
|
|
2018-11-14 01:12:48 +01:00
|
|
|
#ifndef NO_THE_REPOSITORY_COMPATIBILITY_MACROS
|
|
|
|
#define has_sha1_file_with_flags(sha1, flags) repo_has_sha1_file_with_flags(the_repository, sha1, flags)
|
|
|
|
#define has_sha1_file(sha1) repo_has_sha1_file(the_repository, sha1)
|
|
|
|
#endif
|
|
|
|
|
2018-05-16 01:42:15 +02:00
|
|
|
/* Same as the above, except for struct object_id. */
|
2018-11-14 01:12:48 +01:00
|
|
|
int repo_has_object_file(struct repository *r, const struct object_id *oid);
|
|
|
|
int repo_has_object_file_with_flags(struct repository *r,
|
|
|
|
const struct object_id *oid, int flags);
|
|
|
|
#ifndef NO_THE_REPOSITORY_COMPATIBILITY_MACROS
|
|
|
|
#define has_object_file(oid) repo_has_object_file(the_repository, oid)
|
|
|
|
#define has_object_file_with_flags(oid, flags) repo_has_object_file_with_flags(the_repository, oid, flags)
|
|
|
|
#endif
|
2018-05-16 01:42:15 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Return true iff an alternate object database has a loose object
|
|
|
|
* with the specified name. This function does not respect replace
|
|
|
|
* references.
|
|
|
|
*/
|
2019-04-29 10:28:14 +02:00
|
|
|
int has_loose_object_nonlocal(const struct object_id *);
|
2018-05-16 01:42:15 +02:00
|
|
|
|
2019-04-29 10:28:14 +02:00
|
|
|
void assert_oid_type(const struct object_id *oid, enum object_type expect);
|
2018-05-16 01:42:15 +02:00
|
|
|
|
object-store: allow threaded access to object reading
Allow object reading to be performed by multiple threads protecting it
with an internal lock, the obj_read_mutex. The lock usage can be toggled
with enable_obj_read_lock() and disable_obj_read_lock(). Currently, the
functions which can be safely called in parallel are:
read_object_file_extended(), repo_read_object_file(),
read_object_file(), read_object_with_reference(), read_object(),
oid_object_info() and oid_object_info_extended(). It's also possible
to use obj_read_lock() and obj_read_unlock() to protect other sections
that cannot execute in parallel with object reading.
Probably there are many spots in the functions listed above that could
be executed unlocked (and thus, in parallel). But, for now, we are most
interested in allowing parallel access to zlib inflation. This is one of
the sections where object reading spends most of the time in (e.g. up to
one-third of git-grep's execution time in the chromium repo corresponds
to inflation) and it's already thread-safe. So, to take advantage of
that, the obj_read_mutex is released when calling git_inflate() and
re-acquired right after, for every calling spot in
oid_object_info_extended()'s call chain. We may refine this lock to also
exploit other possible parallel spots in the future, but for now,
threaded zlib inflation should already give great speedups for threaded
object reading callers.
Note that add_delta_base_cache() was also modified to skip adding
already present entries to the cache. This wasn't possible before, but
it would be now, with the parallel inflation. Take for example the
following situation, where two threads - A and B - are executing the
code at unpack_entry():
1. Thread A is performing the decompression of a base O (which is not
yet in the cache) at PHASE II. Thread B is simultaneously trying to
unpack O, but just starting at PHASE I.
2. Since O is not yet in the cache, B will go to PHASE II to also
perform the decompression.
3. When they finish decompressing, one of them will get the object
reading mutex and go to PHASE III while the other waits for the
mutex. Let’s say A got the mutex first.
4. Thread A will add O to the cache, go throughout the rest of PHASE III
and return.
5. Thread B gets the mutex, also add O to the cache (if the check wasn't
there) and returns.
Finally, it is also important to highlight that the object reading lock
can only ensure thread-safety in the mentioned functions thanks to two
complementary mechanisms: the use of 'struct raw_object_store's
replace_mutex, which guards sections in the object reading machinery
that would otherwise be thread-unsafe; and the 'struct pack_window's
inuse_cnt, which protects window reading operations (such as the one
performed during the inflation of a packed object), allowing them to
execute without the acquisition of the obj_read_mutex.
Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-01-16 03:39:53 +01:00
|
|
|
/*
|
|
|
|
* Enabling the object read lock allows multiple threads to safely call the
|
|
|
|
* following functions in parallel: repo_read_object_file(), read_object_file(),
|
|
|
|
* read_object_file_extended(), read_object_with_reference(), read_object(),
|
|
|
|
* oid_object_info() and oid_object_info_extended().
|
|
|
|
*
|
|
|
|
* obj_read_lock() and obj_read_unlock() may also be used to protect other
|
|
|
|
* section which cannot execute in parallel with object reading. Since the used
|
|
|
|
* lock is a recursive mutex, these sections can even contain calls to object
|
|
|
|
* reading functions. However, beware that in these cases zlib inflation won't
|
|
|
|
* be performed in parallel, losing performance.
|
|
|
|
*
|
|
|
|
* TODO: oid_object_info_extended()'s call stack has a recursive behavior. If
|
|
|
|
* any of its callees end up calling it, this recursive call won't benefit from
|
|
|
|
* parallel inflation.
|
|
|
|
*/
|
|
|
|
void enable_obj_read_lock(void);
|
|
|
|
void disable_obj_read_lock(void);
|
|
|
|
|
|
|
|
extern int obj_read_use_lock;
|
|
|
|
extern pthread_mutex_t obj_read_mutex;
|
|
|
|
|
|
|
|
static inline void obj_read_lock(void)
|
|
|
|
{
|
|
|
|
if(obj_read_use_lock)
|
|
|
|
pthread_mutex_lock(&obj_read_mutex);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void obj_read_unlock(void)
|
|
|
|
{
|
|
|
|
if(obj_read_use_lock)
|
|
|
|
pthread_mutex_unlock(&obj_read_mutex);
|
|
|
|
}
|
|
|
|
|
2018-05-16 01:42:15 +02:00
|
|
|
struct object_info {
|
|
|
|
/* Request */
|
|
|
|
enum object_type *typep;
|
|
|
|
unsigned long *sizep;
|
|
|
|
off_t *disk_sizep;
|
|
|
|
unsigned char *delta_base_sha1;
|
|
|
|
struct strbuf *type_name;
|
|
|
|
void **contentp;
|
|
|
|
|
|
|
|
/* Response */
|
|
|
|
enum {
|
|
|
|
OI_CACHED,
|
|
|
|
OI_LOOSE,
|
|
|
|
OI_PACKED,
|
|
|
|
OI_DBCACHED
|
|
|
|
} whence;
|
|
|
|
union {
|
|
|
|
/*
|
|
|
|
* struct {
|
|
|
|
* ... Nothing to expose in this case
|
|
|
|
* } cached;
|
|
|
|
* struct {
|
|
|
|
* ... Nothing to expose in this case
|
|
|
|
* } loose;
|
|
|
|
*/
|
|
|
|
struct {
|
|
|
|
struct packed_git *pack;
|
|
|
|
off_t offset;
|
|
|
|
unsigned int is_delta;
|
|
|
|
} packed;
|
|
|
|
} u;
|
|
|
|
};
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Initializer for a "struct object_info" that wants no items. You may
|
|
|
|
* also memset() the memory to all-zeroes.
|
|
|
|
*/
|
|
|
|
#define OBJECT_INFO_INIT {NULL}
|
|
|
|
|
|
|
|
/* Invoke lookup_replace_object() on the given hash */
|
|
|
|
#define OBJECT_INFO_LOOKUP_REPLACE 1
|
|
|
|
/* Allow reading from a loose object file of unknown/bogus type */
|
|
|
|
#define OBJECT_INFO_ALLOW_UNKNOWN_TYPE 2
|
|
|
|
/* Do not check cached storage */
|
|
|
|
#define OBJECT_INFO_SKIP_CACHED 4
|
|
|
|
/* Do not retry packed storage after checking packed and loose storage */
|
|
|
|
#define OBJECT_INFO_QUICK 8
|
|
|
|
/* Do not check loose object */
|
|
|
|
#define OBJECT_INFO_IGNORE_LOOSE 16
|
2019-03-29 22:39:27 +01:00
|
|
|
/*
|
|
|
|
* Do not attempt to fetch the object if missing (even if fetch_is_missing is
|
2019-05-28 17:19:07 +02:00
|
|
|
* nonzero).
|
2019-03-29 22:39:27 +01:00
|
|
|
*/
|
2019-05-28 17:19:07 +02:00
|
|
|
#define OBJECT_INFO_SKIP_FETCH_OBJECT 32
|
|
|
|
/*
|
|
|
|
* This is meant for bulk prefetching of missing blobs in a partial
|
|
|
|
* clone. Implies OBJECT_INFO_SKIP_FETCH_OBJECT and OBJECT_INFO_QUICK
|
|
|
|
*/
|
|
|
|
#define OBJECT_INFO_FOR_PREFETCH (OBJECT_INFO_SKIP_FETCH_OBJECT | OBJECT_INFO_QUICK)
|
2018-05-16 01:42:15 +02:00
|
|
|
|
|
|
|
int oid_object_info_extended(struct repository *r,
|
|
|
|
const struct object_id *,
|
|
|
|
struct object_info *, unsigned flags);
|
|
|
|
|
2018-08-14 20:21:18 +02:00
|
|
|
/*
|
|
|
|
* Iterate over the files in the loose-object parts of the object
|
|
|
|
* directory "path", triggering the following callbacks:
|
|
|
|
*
|
|
|
|
* - loose_object is called for each loose object we find.
|
|
|
|
*
|
|
|
|
* - loose_cruft is called for any files that do not appear to be
|
|
|
|
* loose objects. Note that we only look in the loose object
|
|
|
|
* directories "objects/[0-9a-f]{2}/", so we will not report
|
|
|
|
* "objects/foobar" as cruft.
|
|
|
|
*
|
|
|
|
* - loose_subdir is called for each top-level hashed subdirectory
|
|
|
|
* of the object directory (e.g., "$OBJDIR/f0"). It is called
|
|
|
|
* after the objects in the directory are processed.
|
|
|
|
*
|
|
|
|
* Any callback that is NULL will be ignored. Callbacks returning non-zero
|
|
|
|
* will end the iteration.
|
|
|
|
*
|
|
|
|
* In the "buf" variant, "path" is a strbuf which will also be used as a
|
|
|
|
* scratch buffer, but restored to its original contents before
|
|
|
|
* the function returns.
|
|
|
|
*/
|
|
|
|
typedef int each_loose_object_fn(const struct object_id *oid,
|
|
|
|
const char *path,
|
|
|
|
void *data);
|
|
|
|
typedef int each_loose_cruft_fn(const char *basename,
|
|
|
|
const char *path,
|
|
|
|
void *data);
|
|
|
|
typedef int each_loose_subdir_fn(unsigned int nr,
|
|
|
|
const char *path,
|
|
|
|
void *data);
|
|
|
|
int for_each_file_in_obj_subdir(unsigned int subdir_nr,
|
|
|
|
struct strbuf *path,
|
|
|
|
each_loose_object_fn obj_cb,
|
|
|
|
each_loose_cruft_fn cruft_cb,
|
|
|
|
each_loose_subdir_fn subdir_cb,
|
|
|
|
void *data);
|
|
|
|
int for_each_loose_file_in_objdir(const char *path,
|
|
|
|
each_loose_object_fn obj_cb,
|
|
|
|
each_loose_cruft_fn cruft_cb,
|
|
|
|
each_loose_subdir_fn subdir_cb,
|
|
|
|
void *data);
|
|
|
|
int for_each_loose_file_in_objdir_buf(struct strbuf *path,
|
|
|
|
each_loose_object_fn obj_cb,
|
|
|
|
each_loose_cruft_fn cruft_cb,
|
|
|
|
each_loose_subdir_fn subdir_cb,
|
|
|
|
void *data);
|
|
|
|
|
|
|
|
/* Flags for for_each_*_object() below. */
|
|
|
|
enum for_each_object_flags {
|
|
|
|
/* Iterate only over local objects, not alternates. */
|
|
|
|
FOR_EACH_OBJECT_LOCAL_ONLY = (1<<0),
|
|
|
|
|
|
|
|
/* Only iterate over packs obtained from the promisor remote. */
|
|
|
|
FOR_EACH_OBJECT_PROMISOR_ONLY = (1<<1),
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Visit objects within a pack in packfile order rather than .idx order
|
|
|
|
*/
|
|
|
|
FOR_EACH_OBJECT_PACK_ORDER = (1<<2),
|
|
|
|
};
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Iterate over all accessible loose objects without respect to
|
|
|
|
* reachability. By default, this includes both local and alternate objects.
|
|
|
|
* The order in which objects are visited is unspecified.
|
|
|
|
*
|
|
|
|
* Any flags specific to packs are ignored.
|
|
|
|
*/
|
|
|
|
int for_each_loose_object(each_loose_object_fn, void *,
|
|
|
|
enum for_each_object_flags flags);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Iterate over all accessible packed objects without respect to reachability.
|
|
|
|
* By default, this includes both local and alternate packs.
|
|
|
|
*
|
|
|
|
* Note that some objects may appear twice if they are found in multiple packs.
|
|
|
|
* Each pack is visited in an unspecified order. By default, objects within a
|
|
|
|
* pack are visited in pack-idx order (i.e., sorted by oid).
|
|
|
|
*/
|
|
|
|
typedef int each_packed_object_fn(const struct object_id *oid,
|
|
|
|
struct packed_git *pack,
|
|
|
|
uint32_t pos,
|
|
|
|
void *data);
|
|
|
|
int for_each_object_in_pack(struct packed_git *p,
|
|
|
|
each_packed_object_fn, void *data,
|
|
|
|
enum for_each_object_flags flags);
|
|
|
|
int for_each_packed_object(each_packed_object_fn, void *,
|
|
|
|
enum for_each_object_flags flags);
|
|
|
|
|
2018-03-23 18:20:55 +01:00
|
|
|
#endif /* OBJECT_STORE_H */
|