When downloading a list of bundles in "all" mode, Git has no
understanding of the dependencies between the bundles. Git attempts to
unbundle the bundles in some order, but some may not pass the
verify_bundle() step because of missing prerequisites. This is passed as
error messages to the user, even when they eventually succeed in later
attempts after their dependent bundles are unbundled.
Add a new VERIFY_BUNDLE_QUIET flag to verify_bundle() that avoids the
error messages from the missing prerequisite commits. The method still
returns the number of missing prerequisit commits, allowing callers to
unbundle() to notice that the bundle failed to apply.
Use this flag in bundle-uri.c and test that the messages go away for
'git clone --bundle-uri' commands.
Signed-off-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The verify_bundle() method has a 'verbose' option, but we will want to
extend this method to have more granular control over its output. First,
replace this 'verbose' option with a new 'flags' option with a single
possible value: VERIFY_BUNDLE_VERBOSE.
Signed-off-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When the content at a given bundle URI is not understood as a bundle
(based on inspecting the initial content), then Git currently gives up
and ignores that content. Independent bundle providers may want to split
up the bundle content into multiple bundles, but still make them
available from a single URI.
Teach Git to attempt parsing the bundle URI content as a Git config file
providing the key=value pairs for a bundle list. Git then looks at the
mode of the list to see if ANY single bundle is sufficient or if ALL
bundles are required. The content at the selected URIs are downloaded
and the content is inspected again, creating a recursive process.
To guard the recursion against malformed or malicious content, limit the
recursion depth to a reasonable four for now. This can be converted to a
configured value in the future if necessary. The value of four is twice
as high as expected to be useful (a bundle list is unlikely to point to
more bundle lists).
To test this scenario, create an interesting bundle topology where three
incremental bundles are built on top of a single full bundle. By using a
merge commit, the two middle bundles are "independent" in that they do
not require each other in order to unbundle themselves. They each only
need the base bundle. The bundle containing the merge commit requires
both of the middle bundles, though. This leads to some interesting
decisions when unbundling, especially when we later implement heuristics
that promote downloading bundles until the prerequisite commits are
satisfied.
Signed-off-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The verify_bundle() method checks two things for a bundle's
prerequisites:
1. Are these objects in the object store?
2. Are these objects reachable from our references?
In this second question, multiple uses of verify_bundle() in the same
process can report an invalid bundle even though it is correct. The
reason is due to not clearing all of the commit marks on the commits
previously walked.
The revision walk machinery was first introduced in-process by
fb9a54150d (git-bundle: avoid fork() in verify_bundle(), 2007-02-22).
This implementation used "-1" as the set of flags to clear. The next
meaningful change came in 2b064697a5 (revision traversal: retire
BOUNDARY_SHOW, 2007-03-05), which introduced the PREREQ_MARK flag
instead of a flag normally controlled by the revision-walk machinery.
In 86a0a408b9 (commit: factor out
clear_commit_marks_for_object_array, 2011-10-01), the loop over the
array of commits was replaced with a new
clear_commit_marks_for_object_array(), but simultaneously the "-1" value
was replaced with "ALL_REV_FLAGS", which stopped un-setting the
PREREQ_MARK flag. This means that if multiple commits were marked by the
PREREQ_MARK in a previous run of verify_bundle(), then this loop could
terminate early due to 'i' going to zero:
while (i && (commit = get_revision(&revs)))
if (commit->object.flags & PREREQ_MARK)
i--;
The flag clearing work was changed again in 63647391e6 (bundle: avoid
using the rev_info flag leak_pending, 2017-12-25), but that was only
cosmetic and did not change the behavior.
It may seem that it would be sufficient to add the PREREQ_MARK flag to
the clear_commit_marks() call in its current location. However, we
actually need to do it in the "cleanup:" step, since the first loop
checking "Are these objects in the object store?" might add the
PREREQ_MARK flag to some objects and then terminate without performing a
walk due to one missing object. By clearing the flags in all cases, we
avoid this issue when running verify_bundle() multiple times in the same
process.
Moving this loop to the cleanup step alone would cause a segfault when
running 'git bundle verify' outside of a repository, but this is because
of that error condition using "goto cleanup" when returning is perfectly
safe. Nothing has been initialized at that point, so we can return
immediately without causing any leaks.
This behavior is verified carefully by a test that will be added soon
when Git learns to download bundle lists in a 'git clone --bundle-uri'
command.
Signed-off-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The next change will start allowing us to parse bundle lists that are
downloaded from a provided bundle URI. Those lists might point to other
lists, which could proceed to an arbitrary depth (and even create
cycles). Restructure fetch_bundle_uri() to have an internal version that
has a recursion depth. Compare that to a new max_bundle_uri_depth
constant that is twice as high as we expect this depth to be for any
legitimate use of bundle list linking.
We can consider making max_bundle_uri_depth a configurable value if
there is demonstrated value in the future.
Signed-off-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When a bundle provider wants to operate independently from a Git remote,
they want to provide a single, consistent URI that users can use in
their 'git clone --bundle-uri' commands. At this point, the Git client
expects that URI to be a single bundle that can be unbundled and used to
bootstrap the rest of the clone from the Git server. This single bundle
cannot be re-used to assist with future incremental fetches.
To allow for the incremental fetch case, teach Git to understand a
bundle list that could be advertised at an independent bundle URI. Such
a bundle list is likely to be inspected by human readers, even if only
by the bundle provider creating the list. For this reason, we can take
our expected "key=value" pairs and instead format them using Git config
format.
Create bundle_uri_parse_config_format() to parse a file in config format
and convert that into a 'struct bundle_list' filled with its
understanding of the contents.
Be careful to use error_action CONFIG_ERROR_ERROR when calling
git_config_from_file_with_options() because the default action for
git_config_from_file() is to die() on a parsing error. The current
warning isn't particularly helpful if it arises to a user, but it will
be made more verbose at a higher layer later.
Update 'test-tool bundle-uri' to take this config file format as input.
It uses a filename instead of stdin because there is no existing way to
parse a FILE pointer in the config machinery. Using
git_config_from_mem() is overly complicated and more likely to introduce
bugs than this simpler version.
Signed-off-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Create a new 'test-tool bundle-uri' test helper. This helper will assist
in testing logic deep in the bundle URI feature.
This change introduces the 'parse-key-values' subcommand, which parses
an input file as a list of lines. These are fed into
bundle_uri_parse_line() to test how we construct a 'struct bundle_list'
from that data. The list is then output to stdout as if the key-value
pairs were a Git config file.
We use an input file instead of stdin because of a future change to
parse in config-file format that works better as an input file.
Co-authored-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When advertising a bundle list over Git's protocol v2, we will use
packet lines. Each line will be of the form "key=value" representing a
bundle list. Connect the API necessary for Git's transport to the
key-value pair parsing created in the previous change.
We are not currently implementing this protocol v2 functionality, but
instead preparing to expose this parsing to be unit-testable.
Co-authored-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
There will be two primary ways to advertise a bundle list: as a list of
packet lines in Git's protocol v2 and as a config file served from a
bundle URI. Both of these fundamentally use a list of key-value pairs.
We will use the same set of key-value pairs across these formats.
Create a new bundle_list_update() method that is currently unusued, but
will be used in the next change. It inspects each key to see if it is
understood and then applies it to the given bundle_list. Here are the
keys that we teach Git to understand:
* bundle.version: This value should be an integer. Git currently
understands only version 1 and will ignore the list if the version is
any other value. This version can be increased in the future if we
need to add new keys that Git should not ignore. We can add new
"heuristic" keys without incrementing the version.
* bundle.mode: This value should be one of "all" or "any". If this
mode is not understood, then Git will ignore the list. This mode
indicates whether Git needs all of the bundle list items to make a
complete view of the content or if any single item is sufficient.
The rest of the keys use a bundle identifier "<id>" as part of the key
name. Keys using the same "<id>" describe a single bundle list item.
* bundle.<id>.uri: This stores the URI of the bundle item. This
currently is expected to be an absolute URI, but will be relaxed to be
a relative URI in the future.
While parsing, return an error if a URI key is repeated, since we can
make that restriction with bundle lists.
Make the git_parse_int() method global so we can parse the integer
version value carefully.
Signed-off-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
It will likely be rare where a user uses a single bundle URI and expects
that URI to point to a bundle. Instead, that URI will likely be a list
of bundles provided in some format. Alternatively, the Git server could
advertise a list of bundles.
In anticipation of these two ways of advertising multiple bundles,
create a data structure that represents such a list. This will be
populated using a common API, but for now focus on what data can be
represented.
Each list contains a number of remote_bundle_info structs. These contain
an 'id' that is used to uniquely identify them in the list, and also a
'uri' that contains the location of its data. Finally, there is a strbuf
containing the filename used when Git downloads the contents to disk.
The list itself stores these remote_bundle_info structs in a hashtable
using 'id' as the key. The order of the structs in the input is
considered unimportant, but future modifications to the format and these
data structures will place ordering possibilities on the set. The list
also has a few "global" properties, including the version (used when
parsing the list) and the mode. The mode is one of these two options:
1. BUNDLE_MODE_ALL: all listed URIs are intended to be combined
together. The client should download all of the advertised data to
have a complete copy of the data.
2. BUNDLE_MODE_ANY: any one listed item is sufficient to have a complete
copy of the data. The client can choose arbitrarily from these
options. In the future, the client may use pings to find the closest
URI among geodistributed replicas, or use some other heuristic
information added to the format.
This API is currently unused, but will soon be expanded with parsing
logic and then be consumed by the bundle URI download logic.
Signed-off-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The find_temp_filename() method was created in 53a50892be (bundle-uri:
create basic file-copy logic, 2022-08-09) and uses odb_mkstemp() to
create a temporary filename. The odb_mkstemp() method uses a strbuf in
its interface, but we do not need to continue carrying a strbuf
throughout the bundle URI code.
Convert the find_temp_filename() method to use a 'char *' and modify its
only caller. This makes sense that we don't actually need to modify this
filename directly later, so using a strbuf is overkill.
This change will simplify the data structure for tracking a bundle list
to use plain strings instead of strbufs.
Signed-off-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
* ds/bundle-uri-clone:
clone: warn on failure to repo_init()
clone: --bundle-uri cannot be combined with --depth
bundle-uri: add support for http(s):// and file://
clone: add --bundle-uri option
bundle-uri: create basic file-copy logic
remote-curl: add 'get' capability
The --bundle-uri option was added in 5556891961 (clone: add
--bundle-uri option, 2022-08-09), but this also introduced a call to
repo_init() whose return value was ignored. Fix that ignored value by
warning that the bundle URI process could not continue if it failed.
Signed-off-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
"git rev-list --disk-usage" learned to take an optional value
"human" to show the reported value in human-readable format, like
"3.40MiB".
* ll/disk-usage-humanise:
rev-list: support human-readable output for `--disk-usage`
"git rm" has become more aware of the sparse-index feature.
* sy/sparse-rm:
rm: integrate with sparse-index
rm: expand the index only when necessary
pathspec.h: move pathspec_needs_expanded_index() from reset.c to here
t1092: add tests for `git-rm`
Fixes to sparse index compatibility work for "reset" and "checkout"
commands.
* vd/sparse-reset-checkout-fixes:
unpack-trees: unpack new trees as sparse directories
cache.h: create 'index_name_pos_sparse()'
oneway_diff: handle removed sparse directories
checkout: fix nested sparse directory diff in sparse index
"git fsck" reads mode from tree objects but canonicalizes the mode
before passing it to the logic to check object sanity, which has
hid broken tree objects from the checking logic. This has been
corrected, but to help exiting projects with broken tree objects
that they cannot fix retroactively, the severity of anomalies this
code detects has been demoted to "info" for now.
* jk/fsck-tree-mode-bits-fix:
fsck: downgrade tree badFilemode to "info"
fsck: actually detect bad file modes in trees
tree-walk: add a mechanism for getting non-canonicalized modes
Platform-specific code that determines if a directory is OK to use
as a repository has been taught to report more details, especially
on Windows.
* js/safe-directory-plus:
mingw: handle a file owned by the Administrators group correctly
mingw: be more informative when ownership check fails on FAT32
mingw: provide details about unsafe directories' ownership
setup: prepare for more detailed "dubious ownership" messages
setup: fix some formatting
Avoid repeatedly running getconf to ask libc version in the test
suite, and instead just as it once per script.
* pw/use-glibc-tunable-for-malloc-optim:
tests: cache glibc version check
Expose a lot of "tech docs" via "git help" interface.
* ab/tech-docs-to-help:
docs: move http-protocol docs to man section 5
docs: move cruft pack docs to gitformat-pack
docs: move pack format docs to man section 5
docs: move signature docs to man section 5
docs: move index format docs to man section 5
docs: move protocol-related docs to man section 5
docs: move commit-graph format docs to man section 5
git docs: add a category for file formats, protocols and interfaces
git docs: add a category for user-facing file, repo and command UX
git help doc: use "<doc>" instead of "<guide>"
help.c: remove common category behavior from drop_prefix() behavior
help.c: refactor drop_prefix() to use a "switch" statement"
Plug a bit more leaks in the revisions API.
* ab/plug-revisions-leak:
revisions API: don't leak memory on argv elements that need free()-ing
bisect.c: partially fix bisect_rev_setup() memory leak
log: refactor "rev.pending" code in cmd_show()
log: fix a memory leak in "git show <revision>..."
test-fast-rebase helper: use release_revisions() (again)
bisect.c: add missing "goto" for release_revisions()
Extend SANITIZE=leak checking and declare more tests "currently leak-free".
* ab/leak-check:
CI: use "GIT_TEST_SANITIZE_LEAK_LOG=true" in linux-leaks
upload-pack: fix a memory leak in create_pack_file()
leak tests: mark passing SANITIZE=leak tests as leak-free
leak tests: don't skip some tests under SANITIZE=leak
test-lib: have the "check" mode for SANITIZE=leak consider leak logs
test-lib: add a GIT_TEST_PASSING_SANITIZE_LEAK=check mode
test-lib: simplify by removing test_external
tests: move copy/pasted PERL + Test::More checks to a lib-perl.sh
t/Makefile: don't remove test-results in "clean-except-prove-cache"
test-lib: add a SANITIZE=leak logging mode
t/README: reword the "GIT_TEST_PASSING_SANITIZE_LEAK" description
test-lib: add a --invert-exit-code switch
test-lib: fix GIT_EXIT_OK logic errors, use BAIL_OUT
test-lib: don't set GIT_EXIT_OK before calling test_atexit_handler
test-lib: use $1, not $@ in test_known_broken_{ok,failure}_
"git symbolic-ref symref non..sen..se" is now diagnosed as an error.
* lt/symbolic-ref-sanity:
symbolic-ref: refuse to set syntactically invalid target
The '--disk-usage' option for git-rev-list was introduced in 16950f8384
(rev-list: add --disk-usage option for calculating disk usage, 2021-02-09).
This is very useful for people inspect their git repo's objects usage
infomation, but the resulting number is quit hard for a human to read.
Teach git rev-list to output a human readable result when using
'--disk-usage'.
Signed-off-by: Li Linchao <lilinchao@oschina.cn>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
There was a bug in the codepath to upgrade generation information
in commit-graph from v1 to v2 format, which has been corrected.
source: <cover.1657667404.git.me@ttaylorr.com>
* tb/commit-graph-genv2-upgrade-fix:
commit-graph: fix corrupt upgrade from generation v1 to v2
commit-graph: introduce `repo_find_commit_pos_in_graph()`
t5318: demonstrate commit-graph generation v2 corruption
Fix for a bug that makes write-tree to fail to write out a
non-existent index as a tree, introduced in 2.37.
source: <20220722212232.833188-1-martin.agren@gmail.com>
* tk/untracked-cache-with-uall:
read-cache: make `do_read_index()` always set up `istate->repo`
"git checkout" miscounted the paths it updated, which has been
corrected.
source: <cover.1657799213.git.matheus.bernardino@usp.br>
* mt/checkout-count-fix:
checkout: fix two bugs on the final count of updated entries
checkout: show bug about failed entries being included in final report
checkout: document bug where delayed checkout counts entries twice
"rerere-train" script (in contrib/) used to honor commit.gpgSign
while recreating the throw-away merges.
source: <PH7PR14MB5594A27B9295E95ACA4D6A69CE8F9@PH7PR14MB5594.namprd14.prod.outlook.com>
* cl/rerere-train-with-no-sign:
contrib/rerere-train: avoid useless gpg sign in training
"git p4" did not handle non-ASCII client name well, which has been
corrected.
source: <pull.1285.v3.git.git.1658394440.gitgitgadget@gmail.com>
* kk/p4-client-name-encoding-fix:
git-p4: refactoring of p4CmdList()
git-p4: fix bug with encoding of p4 client name
"git p4" working on UTF-16 files on Windows did not implement
CRLF-to-LF conversion correctly, which has been corrected.
source: <pull.1294.v2.git.git.1658341065221.gitgitgadget@gmail.com>
* mb/p4-utf16-crlf:
git-p4: fix CR LF handling for utf16 files
A corner case bug where lazily fetching objects from a promisor
remote resulted in infinite recursion has been corrected.
source: <cover.1656593279.git.hanxin.hx@bytedance.com>
* hx/lookup-commit-in-graph-fix:
t5330: remove run_with_limited_processses()
commit-graph.c: no lazy fetch in lookup_commit_in_graph()
The resolve-undo information in the index was not protected against
GC, which has been corrected.
source: <xmqq35f7kzad.fsf@gitster.g>
* jc/resolve-undo:
fsck: do not dereference NULL while checking resolve-undo data
revision: mark blobs needed for resolve-undo as reachable
The previous commit un-broke the "badFileMode" check; before then it was
literally testing nothing. And as far as I can tell, it has been so
since the very initial version of fsck.
The current severity of "badFileMode" is just "warning". But in the
--strict mode used by transfer.fsckObjects, that is elevated to an
error. This will potentially cause hassle for users, because historical
objects with bad modes will suddenly start causing pushes to many server
operators to be rejected.
At the same time, these bogus modes aren't actually a big risk. Because
we canonicalize them everywhere besides fsck, they can't cause too much
mischief in the real world. The worst thing you can do is end up with
two almost-identical trees that have different hashes but are
interpreted the same. That will generally cause things to be inefficient
rather than wrong, and is a bug somebody working on a Git implementation
would want to fix, but probably not worth inconveniencing users by
refusing to push or fetch.
So let's downgrade this to "info" by default, which is our setting for
"mention this when fscking, but don't ever reject, even under strict
mode". If somebody really wants to be paranoid, they can still adjust
the level using config.
Suggested-by: Xavier Morel <xavier.morel@masklinn.net>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
We use the normal tree_desc code to iterate over trees in fsck, meaning
we only see the canonicalized modes it returns. And hence we'd never see
anything unexpected, since it will coerce literally any garbage into one
of our normal and accepted modes.
We can use the new RAW_MODES flag to see the real modes, and then use
the existing code to actually analyze them. The existing code is written
as allow-known-good, so there's not much point in testing a variety of
breakages. The one tested here should be S_IFREG but with nonsense
permissions.
Do note that the error-reporting here isn't great. We don't mention the
specific bad mode, but just that the tree has one or more broken modes.
But when you go to look at it with "git ls-tree", we'll report the
canonicalized mode! This isn't ideal, but given that this should come up
rarely, and that any number of other tree corruptions might force you
into looking at the binary bytes via "cat-file", it's not the end of the
world. And it's something we can improve on top later if we choose.
Reported-by: Xavier Morel <xavier.morel@masklinn.net>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When using init_tree_desc() and tree_entry() to iterate over a tree, we
always canonicalize the modes coming out of the tree. This is a good
thing to prevent bugs or oddities in normal code paths, but it's
counter-productive for tools like fsck that want to see the exact
contents.
We can address this by adding an option to avoid the extra
canonicalization. A few notes on the implementation:
- I've attached the new option to the tree_desc struct itself. The
actual code change is in decode_tree_entry(), which is in turn
called by the public update_tree_entry(), tree_entry(), and
init_tree_desc() functions, plus their "gently" counterparts.
By letting it ride along in the struct, we can avoid changing the
signature of those functions, which are called many times. Plus it's
conceptually simpler: you really want a particular iteration of a
tree to be "raw" or not, rather than individual calls.
- We still have to set the new option somewhere. The struct is
initialized by init_tree_desc(). I added the new flags field only to
the "gently" version. That avoids disturbing the much more numerous
non-gentle callers, and it makes sense that anybody being careful
about looking at raw modes would also be careful about bogus trees
(i.e., the caller will be something like fsck in the first place).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
A previous change added the '--bundle-uri' option, but did not check
if the --depth parameter was included. Since bundles are not compatible
with shallow clones, provide an error message to the user who is
attempting this combination.
I am leaving this as its own change, separate from the one that
implements '--bundle-uri', because this is more of an advisory for the
user. There is nothing wrong with bootstrapping with bundles and then
fetching a shallow clone. However, that is likely going to involve too
much work for the client _and_ the server. The client will download all
of this bundle information containing the full history of the
repository only to ignore most of it. The server will get a shallow
fetch request, but with a list of haves that might cause a more painful
computation of that shallow pack-file.
Reviewed-by: Josh Steadmon <steadmon@google.com>
Signed-off-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>