"git sparse-checkout" learned a new "add" subcommand.
* ds/sparse-add:
sparse-checkout: allow one-character directories in cone mode
sparse-checkout: work with Windows paths
sparse-checkout: create 'add' subcommand
sparse-checkout: extract pattern update from 'set' subcommand
sparse-checkout: extract add_patterns_from_input()
In 9e6d3e64 (sparse-checkout: detect short patterns, 2020-01-24), a
condition on the minimum length of a cone-mode pattern was introduced.
However, this condition was off-by-one.
If we have a directory with a single character, say "b", then the
command
git sparse-checkout set b
will correctly add the pattern "/b/" to the sparse-checkout file. When
this is interpeted in dir.c, the pattern is "/b" with the
PATTERN_FLAG_MUSTBEDIR flag. This string has length two, which satisfies
our inclusive inequality (<= 2).
The reason for this inequality is that we will start to read the pattern
string character-by-character using three char pointers: prev, cur,
next. In particular, next is set to the current pattern plus two. The
mistake was that next will still be a valid pointer when the pattern
length is two, since the string is null-terminated.
Make this inequality strict so these patterns work.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Some codepaths were given a repository instance as a parameter to
work in the repository, but passed the_repository instance to its
callees, which has been cleaned up (somewhat).
* mt/use-passed-repo-more-in-funcs:
sha1-file: allow check_object_signature() to handle any repo
sha1-file: pass git_hash_algo to hash_object_file()
sha1-file: pass git_hash_algo to write_object_file_prepare()
streaming: allow open_istream() to handle any repo
pack-check: use given repo's hash_algo at verify_packfile()
cache-tree: use given repo's hash_algo at verify_one()
diff: make diff_populate_filespec() honor its repo argument
Some rough edges in the sparse-checkout feature, especially around
the cone mode, have been cleaned up.
* ds/sparse-checkout-harden:
sparse-checkout: fix cone mode behavior mismatch
sparse-checkout: improve docs around 'set' in cone mode
sparse-checkout: escape all glob characters on write
sparse-checkout: use C-style quotes in 'list' subcommand
sparse-checkout: unquote C-style strings over --stdin
sparse-checkout: write escaped patterns in cone mode
sparse-checkout: properly match escaped characters
sparse-checkout: warn on globs in cone patterns
sparse-checkout: detect short patterns
sparse-checkout: cone mode does not recognize "**"
sparse-checkout: fix documentation typo for core.sparseCheckoutCone
clone: fix --sparse option with URLs
sparse-checkout: create leading directories
t1091: improve here-docs
t1091: use check_files to reduce boilerplate
In cone mode, the sparse-checkout feature uses hashset containment
queries to match paths. Make this algorithm respect escaped asterisk
(*) and backslash (\) characters.
Create dup_and_filter_pattern() method to convert a pattern by
removing escape characters and dropping an optional "/*" at the end.
This method is available in dir.h as we will use it in
builtin/sparse-checkout.c in a later change.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In cone mode, the sparse-checkout commmand will write patterns that
allow faster pattern matching. This matching only works if the patterns
in the sparse-checkout file are those written by that command. Users
can edit the sparse-checkout file and create patterns that cause the
cone mode matching to fail.
The cone mode patterns may end in "/*" but otherwise an un-escaped
asterisk or other glob character is invalid. Add checks to disable
cone mode when seeing these values.
A later change will properly handle escaped globs.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Allow hash_object_file() to work on arbitrary repos by introducing a
git_hash_algo parameter. Change callers which have a struct repository
pointer in their scope to pass on the git_hash_algo from the said repo.
For all other callers, pass on the_hash_algo, which was already being
used internally at hash_object_file(). This functionality will be used
in the following patch to make check_object_signature() be able to work
on arbitrary repos (which, in turn, will be used to fix an
inconsistency at object.c:parse_object()).
Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In cone mode, the shortest pattern the sparse-checkout command will
write into the sparse-checkout file is "/*". This is handled carefully
in add_pattern_to_hashsets(), so warn if any other pattern is this
short. This will assist future pattern checks by allowing us to assume
there are at least three characters in the pattern.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When core.sparseCheckoutCone is enabled, the 'git sparse-checkout set'
command creates a restricted set of possible patterns that are used
by a custom algorithm to quickly match those patterns.
If a user manually edits the sparse-checkout file, then they could
create patterns that do not match these expectations. The cone-mode
matching algorithm can return incorrect results. The solution is to
detect these incorrect patterns, warn that we do not recognize them,
and revert to the standard algorithm.
Check each pattern for the "**" substring, and revert to the old
logic if seen. While technically a "/<dir>/**" pattern matches
the meaning of "/<dir>/", it is not one that would be written by
the sparse-checkout builtin in cone mode. Attempting to accept that
pattern change complicates the logic and instead we punt and do
not accept any instance of "**".
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Commit 777b420347 (dir: synchronize treat_leading_path() and
read_directory_recursive(), 2019-12-19) tried to add two warning
comments in those functions, pointing at each other. But the one in
treat_leading_path() just points at itself.
Let's fix that. Since the comment also redirects the reader for more
details to "the commit that added this warning", and since we're now
modifying the warning (creating a new commit without those details),
let's mention the actual commit id.
Signed-off-by: Jeff King <peff@peff.net>
Reviewed-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Restructure the code slightly to avoid passing around a struct dirent
anywhere, which also enables us to avoid trying to manufacture one.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
I was going to title this "dir: more synchronizing of
treat_leading_path() and read_directory_recursive()", a nod to commit
777b420347 ("dir: synchronize treat_leading_path() and
read_directory_recursive()", 2019-12-19), but the title was too long.
Anyway, first the backstory...
fill_directory() has always had a slightly error-prone interface: it
returns a subset of paths which *might* match the specified pathspec; it
was intended to prune away some paths which didn't match the specified
pathspec and keep at least all the ones that did match it. Given this
interface, callers were responsible to post-process the results and
check whether each actually matched the pathspec.
builtin/clean.c did this. It would first prune out duplicates (e.g. if
"dir" was returned as well as all files under "dir/", then it would
simplify this to just "dir"), and after pruning duplicates it would
compare the remaining paths to the specified pathspec(s). This
post-processing itself could run into problems, though, as noted in
commit 404ebceda0 ("dir: also check directories for matching
pathspecs", 2019-09-17):
For the case of git-clean and a set of pathspecs of "dir/file" and
"more", this caused a problem because we'd end up with dir entries
for both of
"dir"
"dir/file"
Then correct_untracked_entries() would try to helpfully prune
duplicates for us by removing "dir/file" since it's under "dir",
leaving us with
"dir"
Since the original pathspec only had "dir/file", the only entry left
doesn't match and leaves nothing to be removed. (Note that if only
one pathspec was specified, e.g. only "dir/file", then the
common_prefix_len optimizations in fill_directory would cause us to
bypass this problem, making it appear in simple tests that we could
correctly remove manually specified pathspecs.)
That commit fixed the issue -- when multiple pathspecs were specified --
by making sure fill_directory() wouldn't return both "dir" and
"dir/file" outside the common_prefix_len optimization path. This is
where it starts to get fun.
In commit b9670c1f5e ("dir: fix checks on common prefix directory",
2019-12-19), we noticed that the common_prefix_len wasn't doing
appropriate checks and letting all kinds of stuff through, resulting in
recursing into .git/ directories and other craziness. So it started
locking down and doing checks on pathnames within that code path. That
continued with commit 777b420347 ("dir: synchronize
treat_leading_path() and read_directory_recursive()", 2019-12-19), which
noted the following:
Our optimization to avoid calling into read_directory_recursive()
when all pathspecs have a common leading directory mean that we need
to match the logic that read_directory_recursive() would use if we
had just called it from the root. Since it does more than call
treat_path() we need to copy that same logic.
...and then it more forcefully addressed the issue with this wonderfully
ironic statement:
Needing to duplicate logic like this means it is guaranteed someone
will eventually need to make further changes and forget to update
both locations. It is tempting to just nuke the leading_directory
special casing to avoid such bugs and simplify the code, but
unpack_trees' verify_clean_subdirectory() also calls
read_directory() and does so with a non-empty leading path, so I'm
hesitant to try to restructure further. Add obnoxious warnings to
treat_leading_path() and read_directory_recursive() to try to warn
people of such problems.
You would think that with such a strongly worded description, that its
author would have actually ensured that the logic in
treat_leading_path() and read_directory_recursive() did actually match
and that *everything* that was needed had at least been copied over at
the time that this paragraph was written. But you'd be wrong, I messed
it up by missing part of the logic.
Copy the missing bits to fix the new final test in t7300.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Assorted fixes to the directory traversal API.
* en/fill-directory-fixes:
dir.c: use st_add3() for allocation size
dir: consolidate similar code in treat_directory()
dir: synchronize treat_leading_path() and read_directory_recursive()
dir: fix checks on common prefix directory
dir: break part of read_directory_recursive() out for reuse
dir: exit before wildcard fall-through if there is no wildcard
dir: remove stray quote character in comment
Revert "dir.c: make 'git-status --ignored' work within leading directories"
t3011: demonstrate directory traversal failures
Management of sparsely checked-out working tree has gained a
dedicated "sparse-checkout" command.
* ds/sparse-cone: (21 commits)
sparse-checkout: improve OS ls compatibility
sparse-checkout: respect core.ignoreCase in cone mode
sparse-checkout: check for dirty status
sparse-checkout: update working directory in-process for 'init'
sparse-checkout: cone mode should not interact with .gitignore
sparse-checkout: write using lockfile
sparse-checkout: use in-process update for disable subcommand
sparse-checkout: update working directory in-process
sparse-checkout: sanitize for nested folders
unpack-trees: add progress to clear_ce_flags()
unpack-trees: hash less in cone mode
sparse-checkout: init and set in cone mode
sparse-checkout: use hashmaps for cone patterns
sparse-checkout: add 'cone' mode
trace2: add region in clear_ce_flags
sparse-checkout: create 'disable' subcommand
sparse-checkout: add '--stdin' option to set subcommand
sparse-checkout: 'set' subcommand
clone: add --sparse mode
sparse-checkout: create 'init' subcommand
...
When preparing a manufactured dirent instance, we add a length of
path to the size of struct to decide how many bytes to allocate.
Make sure this addition does not wrap-around to cause us
underallocate.
Suggested-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Both the DIR_SKIP_NESTED_GIT and DIR_NO_GITLINKS cases were checking for
whether a path was actually a nonbare repository. That code could be
shared, with just the result of how to act differing between the two
cases.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Our optimization to avoid calling into read_directory_recursive() when
all pathspecs have a common leading directory mean that we need to match
the logic that read_directory_recursive() would use if we had just
called it from the root. Since it does more than call treat_path() we
need to copy that same logic.
Alternatively, we could try to change treat_path to return path_recurse
for an untracked directory under the given special circumstances that
this logic checks for, but a simple switch results in many test failures
such as 'git clean -d' not wiping out untracked but empty directories.
To work around that, we'd need the caller of treat_path to check for
path_recurse and sometimes special case it into path_untracked. In
other words, we'd still have extra logic in both places.
Needing to duplicate logic like this means it is guaranteed someone will
eventually need to make further changes and forget to update both
locations. It is tempting to just nuke the leading_directory special
casing to avoid such bugs and simplify the code, but unpack_trees'
verify_clean_subdirectory() also calls read_directory() and does so with
a non-empty leading path, so I'm hesitant to try to restructure further.
Add obnoxious warnings to treat_leading_path() and
read_directory_recursive() to try to warn people of such problems.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Many years ago, the directory traversing logic had an optimization that
would always recurse into any directory that was a common prefix of all
the pathspecs without walking the leading directories to get down to
the desired directory. Thus,
git ls-files -o .git/ # case A
would notice that .git/ was a common prefix of all pathspecs (since
it is the only pathspec listed), and then traverse into it and start
showing unknown files under that directory. Unfortunately, .git/ is not
a directory we should be traversing into, which made this optimization
problematic. This also affected cases like
git ls-files -o --exclude-standard t/ # case B
where t/ was in the .gitignore file and thus isn't interesting and
shouldn't be recursed into. It also affected cases like
git ls-files -o --directory untracked_dir/ # case C
where untracked_dir/ is indeed untracked and thus interesting, but the
--directory flag means we only want to show the directory itself, not
recurse into it and start listing untracked files below it.
The case B class of bugs were noted and fixed in commits 16e2cfa909
("read_directory(): further split treat_path()", 2010-01-08) and
48ffef966c ("ls-files: fix overeager pathspec optimization",
2010-01-08), with the idea being that we first wanted to check whether
the common prefix was interesting. The former patch noted that
treat_path() couldn't be used when checking the common prefix because
treat_path() requires a dir_entry() and we haven't read any directories
at the point we are checking the common prefix. So, that patch split
treat_one_path() out of treat_path(). The latter patch then created a
new treat_leading_path() which duplicated by hand the bits of
treat_path() that couldn't be broken out and then called
treat_one_path() for the remainder. There were three problems with this
approach:
* The duplicated logic in treat_leading_path() accidentally missed the
check for special paths (such as is_dot_or_dotdot and matching
".git"), causing case A types of bugs to continue to be an issue.
* The treat_leading_path() logic assumed we should traverse into
anything where path_treatment was not path_none, i.e. it perpetuated
class C types of bugs.
* It meant we had split logic that needed to kept in sync, running the
risk that people introduced new inconsistencies (such as in commit
be8a84c526, which we reverted earlier in this series, or in commit
df5bcdf83a which we'll fix in a subsequent commit)
Fix most these problems by making treat_leading_path() not only loop
over each leading path component, but calling treat_path() directly on
each. To do so, we have to create a synthetic dir_entry, but that only
takes a few lines. Then, pay attention to the path_treatment result we
get from treat_path() and don't treat path_excluded, path_untracked, and
path_recurse all the same as path_recurse.
This leaves one remaining problem, the new inconsistency from commit
df5bcdf83a. That will be addressed in a subsequent commit.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
* hw/doc-in-header:
trace2: move doc to trace2.h
submodule-config: move doc to submodule-config.h
tree-walk: move doc to tree-walk.h
trace: move doc to trace.h
run-command: move doc to run-command.h
parse-options: add link to doc file in parse-options.h
credential: move doc to credential.h
argv-array: move doc to argv-array.h
cache: move doc to cache.h
sigchain: move doc to sigchain.h
pathspec: move doc to pathspec.h
revision: move doc to revision.h
attr: move doc to attr.h
refs: move doc to refs.h
remote: move doc to remote.h and refspec.h
sha1-array: move doc to sha1-array.h
merge: move doc to ll-merge.h
graph: move doc to graph.h and graph.c
dir: move doc to dir.h
diff: move doc to diff.h and diffcore.h
When a user uses the sparse-checkout feature in cone mode, they
add patterns using "git sparse-checkout set <dir1> <dir2> ..."
or by using "--stdin" to provide the directories line-by-line over
stdin. This behaviour naturally looks a lot like the way a user
would type "git add <dir1> <dir2> ..."
If core.ignoreCase is enabled, then "git add" will match the input
using a case-insensitive match. Do the same for the sparse-checkout
feature.
Perform case-insensitive checks while updating the skip-worktree
bits during unpack_trees(). This is done by changing the hash
algorithm and hashmap comparison methods to optionally use case-
insensitive methods.
When this is enabled, there is a small performance cost in the
hashing algorithm. To tease out the worst possible case, the
following was run on a repo with a deep directory structure:
git ls-tree -d -r --name-only HEAD |
git sparse-checkout set --stdin
The 'set' command was timed with core.ignoreCase disabled or
enabled. For the repo with a deep history, the numbers were
core.ignoreCase=false: 62s
core.ignoreCase=true: 74s (+19.3%)
For reproducibility, the equivalent test on the Linux kernel
repository had these numbers:
core.ignoreCase=false: 3.1s
core.ignoreCase=true: 3.6s (+16%)
Now, this is not an entirely fair comparison, as most users
will define their sparse cone using more shallow directories,
and the performance improvement from eb42feca97 ("unpack-trees:
hash less in cone mode" 2019-11-21) can remove most of the
hash cost. For a more realistic test, drop the "-r" from the
ls-tree command to store only the first-level directories.
In that case, the Linux kernel repository takes 0.2-0.25s in
each case, and the deep repository takes one second, plus or
minus 0.05s, in each case.
Thus, we _can_ demonstrate a cost to this change, but it is
unlikely to matter to any reasonable sparse-checkout cone.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Create an add_path_to_appropriate_result_list() function from the code
at the end of read_directory_recursive() so we can use it elsewhere.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The DO_MATCH_LEADING_PATHSPEC had a fall-through case for if there was a
wildcard, noting that we don't yet have enough information to determine
if a further paths under the current directory might match due to the
presence of wildcards. But if we have no wildcards in our pathspec,
then we shouldn't get to that fall-through case.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Commit be8a84c526 ("dir.c: make 'git-status --ignored' work within
leading directories", 2013-04-15) noted that
git status --ignored <SOMEPATH>
would not list ignored files and directories within <SOMEPATH> if
<SOMEPATH> was untracked, and modified the behavior to make it show
them. However, it did so via a hack that broke consistency; it would
show paths under <SOMEPATH> differently than a simple
git status --ignored | grep <SOMEPATH>
would show them. A correct fix is slightly more involved, and
complicated slightly by this hack, so we revert this commit (but keep
corrected versions of the testcases) and will later fix the original
bug with a subsequent patch.
Some history may be helpful:
A very, very similar case to the commit we are reverting was raised in
commit 48ffef966c ("ls-files: fix overeager pathspec optimization",
2010-01-08); but it actually went in somewhat the opposite direction. In
that commit, it mentioned how
git ls-files -o --exclude-standard t/
used to show untracked files under t/ even when t/ was ignored, and then
changed the behavior to stop showing untracked files under an ignored
directory. More importantly, this commit considered keeping this
behavior but noted that it would be inconsistent with the behavior when
multiple pathspecs were specified and thus rejected it.
The reason for this whole inconsistency when one pathspec is specified
versus zero or two is because common prefixes of pathspecs are sent
through a different set of checks (in treat_leading_path()) than normal
file/directory traversal (those go through read_directory_recursive()
and treat_path()). As such, for consistency, one needs to check that
both codepaths produce the same result.
Revert commit be8a84c526, except instead
of removing the testcase it added, modify it to check for correct and
consistent behavior. A subsequent patch in this series will fix the
testcase.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The sparse-checkout feature in "cone mode" can use the fact that
the recursive patterns are "connected" to the root via parent
patterns to decide if a directory is entirely contained in the
sparse-checkout or entirely removed.
In these cases, we can skip hashing the paths within those
directories and simply set the skipworktree bit to the correct
value.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
To make the cone pattern set easy to use, update the behavior of
'git sparse-checkout (init|set)'.
Add '--cone' flag to 'git sparse-checkout init' to set the config
option 'core.sparseCheckoutCone=true'.
When running 'git sparse-checkout set' in cone mode, a user only
needs to supply a list of recursive folder matches. Git will
automatically add the necessary parent matches for the leading
directories.
When testing 'git sparse-checkout set' in cone mode, check the
error stream to ensure we do not see any errors. Specifically,
we want to avoid the warning that the patterns do not match
the cone-mode patterns.
Helped-by: Eric Wong <e@80x24.org>
Helped-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The parent and recursive patterns allowed by the "cone mode"
option in sparse-checkout are restrictive enough that we
can avoid using the regex parsing. Everything is based on
prefix matches, so we can use hashsets to store the prefixes
from the sparse-checkout file. When checking a path, we can
strip path entries from the path and check the hashset for
an exact match.
As a test, I created a cone-mode sparse-checkout file for the
Linux repository that actually includes every file. This was
constructed by taking every folder in the Linux repo and creating
the pattern pairs here:
/$folder/
!/$folder/*/
This resulted in a sparse-checkout file sith 8,296 patterns.
Running 'git read-tree -mu HEAD' on this file had the following
performance:
core.sparseCheckout=false: 0.21 s (0.00 s)
core.sparseCheckout=true: 3.75 s (3.50 s)
core.sparseCheckoutCone=true: 0.23 s (0.01 s)
The times in parentheses above correspond to the time spent
in the first clear_ce_flags() call, according to the trace2
performance traces.
While this example is contrived, it demonstrates how these
patterns can slow the sparse-checkout feature.
Helped-by: Eric Wong <e@80x24.org>
Helped-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Move the documentation from Documentation/technical/api-directory-listing.txt
to dir.h as it's easier for the developers to find the usage information
beside the code instead of looking for it in another doc file.
Also documentation/technical/api-directory-listing.txt is removed because
the information it has is now redundant and it'll be hard to keep it up to
date and synchronized with the documentation in the header files.
Signed-off-by: Heba Waly <heba.waly@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
"git clean" fixes.
* en/clean-nested-with-ignored:
dir: special case check for the possibility that pathspec is NULL
clean: fix theoretical path corruption
clean: rewrap overly long line
clean: avoid removing untracked files in a nested git repository
clean: disambiguate the definition of -d
git-clean.txt: do not claim we will delete files with -n/--dry-run
dir: add commentary explaining match_pathspec_item's return value
dir: if our pathspec might match files under a dir, recurse into it
dir: make the DO_MATCH_SUBMODULE code reusable for a non-submodule case
dir: also check directories for matching pathspecs
dir: fix off-by-one error in match_pathspec_item
dir: fix typo in comment
t7300: add testcases showing failure to clean specified pathspecs
Commits 404ebceda0 ("dir: also check directories for matching
pathspecs", 2019-09-17) and 89a1f4aaf7 ("dir: if our pathspec might
match files under a dir, recurse into it", 2019-09-17) added calls to
match_pathspec() and do_match_pathspec() passing along their pathspec
parameter. Both match_pathspec() and do_match_pathspec() assume the
pathspec argument they are given is non-NULL. It turns out that
unpack-tree.c's verify_clean_subdirectory() calls read_directory() with
pathspec == NULL, and it is possible on case insensitive filesystems for
that NULL to make it to these new calls to match_pathspec() and
do_match_pathspec(). Add appropriate checks on the NULLness of pathspec
to avoid a segfault.
In case the negation throws anyone off (one of the calls was to
do_match_pathspec() while the other was to !match_pathspec(), yet no
negation of the NULLness of pathspec is used), there are two ways to
understand the differences:
* The code already handled the pathspec == NULL cases before this
series, and this series only tried to change behavior when there was
a pathspec, thus we only want to go into the if-block if pathspec is
non-NULL.
* One of the calls is for whether to recurse into a subdirectory, the
other is for after we've recursed into it for whether we want to
remove the subdirectory itself (i.e. the subdirectory didn't match
but something under it could have). That difference in situation
leads to the slight differences in logic used (well, that and the
slightly unusual fact that we don't want empty pathspecs to remove
untracked directories by default).
Denton found and analyzed one issue and provided the patch for the
match_pathspec() call, SZEDER figured out why the issue only reproduced
for some folks and not others and provided the testcase, and I looked
through the remainder of the series and noted the do_match_pathspec()
call that should have the same check.
Co-authored-by: Denton Liu <liu.denton@gmail.com>
Co-authored-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The internal code originally invented for ".gitignore" processing
got reshuffled and renamed to make it less tied to "excluding" and
stress more that it is about "matching", as it has been reused for
things like sparse checkout specification that want to check if a
path is "included".
* ds/include-exclude:
unpack-trees: rename 'is_excluded_from_list()'
treewide: rename 'exclude' methods to 'pattern'
treewide: rename 'EXCL_FLAG_' to 'PATTERN_FLAG_'
treewide: rename 'struct exclude_list' to 'struct pattern_list'
treewide: rename 'struct exclude' to 'struct path_pattern'
Users expect files in a nested git repository to be left alone unless
sufficiently forced (with two -f's). Unfortunately, in certain
circumstances, git would delete both tracked (and possibly dirty) files
and untracked files within a nested repository. To explain how this
happens, let's contrast a couple cases. First, take the following
example setup (which assumes we are already within a git repo):
git init nested
cd nested
>tracked
git add tracked
git commit -m init
>untracked
cd ..
In this setup, everything works as expected; running 'git clean -fd'
will result in fill_directory() returning the following paths:
nested/
nested/tracked
nested/untracked
and then correct_untracked_entries() would notice this can be compressed
to
nested/
and then since "nested/" is a directory, we would call
remove_dirs("nested/", ...), which would
check is_nonbare_repository_dir() and then decide to skip it.
However, if someone also creates an ignored file:
>nested/ignored
then running 'git clean -fd' would result in fill_directory() returning
the same paths:
nested/
nested/tracked
nested/untracked
but correct_untracked_entries() will notice that we had ignored entries
under nested/ and thus simplify this list to
nested/tracked
nested/untracked
Since these are not directories, we do not call remove_dirs() which was
the only place that had the is_nonbare_repository_dir() safety check --
resulting in us deleting both the untracked file and the tracked (and
possibly dirty) file.
One possible fix for this issue would be walking the parent directories
of each path and checking if they represent nonbare repositories, but
that would be wasteful. Even if we added caching of some sort, it's
still a waste because we should have been able to check that "nested/"
represented a nonbare repository before even descending into it in the
first place. Add a DIR_SKIP_NESTED_GIT flag to dir_struct.flags and use
it to prevent fill_directory() and friends from descending into nested
git repos.
With this change, we also modify two regression tests added in commit
91479b9c72 ("t7300: add tests to document behavior of clean and nested
git", 2015-06-15). That commit, nor its series, nor the six previous
iterations of that series on the mailing list discussed why those tests
coded the expectation they did. In fact, it appears their purpose was
simply to test _existing_ behavior to make sure that the performance
changes didn't change the behavior. However, these two tests directly
contradicted the manpage's claims that two -f's were required to delete
files/directories under a nested git repository. While one could argue
that the user gave an explicit path which matched files/directories that
were within a nested repository, there's a slippery slope that becomes
very difficult for users to understand once you go down that route (e.g.
what if they specified "git clean -f -d '*.c'"?) It would also be hard
to explain what the exact behavior was; avoid such problems by making it
really simple.
Also, clean up some grammar errors describing this functionality in the
git-clean manpage.
Finally, there are still a couple bugs with -ffd not cleaning out enough
(e.g. missing the nested .git) and with -ffdX possibly cleaning out the
wrong files (paying attention to outer .gitignore instead of inner).
This patch does not address these cases at all (and does not change the
behavior relative to those flags), it only fixes the handling when given
a single -f. See
https://public-inbox.org/git/20190905212043.GC32087@szeder.dev/ for more
discussion of the -ffd[X?] bugs.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The way match_pathspec_item() handles names and pathspecs with trailing
slash characters, in conjunction with special options like
DO_MATCH_DIRECTORY and DO_MATCH_LEADING_PATHSPEC were non-obvious, and
broken until this patch series. Add a table in a comment explaining the
intent of how these work.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
For git clean, if a directory is entirely untracked and the user did not
specify -d (corresponding to DIR_SHOW_IGNORED_TOO), then we usually do
not want to remove that directory and thus do not recurse into it.
However, if the user manually specified specific (or even globbed) paths
somewhere under that directory to remove, then we need to recurse into
the directory to make sure we remove the relevant paths under that
directory as the user requested.
Note that this does not mean that the recursed-into directory will be
added to dir->entries for later removal; as of a few commits earlier in
this series, there is another more strict match check that is run after
returning from a recursed-into directory before deciding to add it to the
list of entries. Therefore, this will only result in files underneath
the given directory which match one of the pathspecs being added to the
entries list.
Two notes of potential interest to future readers:
* If we wanted to only recurse into a directory when it is specifically
matched rather than matched-via-glob (e.g. '*.c'), then we could do
so via making the final non-zero return in match_pathspec_item be
MATCHED_RECURSIVELY instead of MATCHED_RECURSIVELY_LEADING_PATHSPEC.
(Note that the relative order of MATCHED_RECURSIVELY_LEADING_PATHSPEC
and MATCHED_RECURSIVELY are important for such a change.) I was
leaving open that possibility while writing an RFC asking for the
behavior we want, but even though we don't want it, that knowledge
might help you understand the code flow better.
* There is a growing amount of logic in read_directory_recursive() for
deciding whether to recurse into a subdirectory. However, there is a
comment immediately preceding this logic that says to recurse if
instructed by treat_path(). It may be better for the logic in
read_directory_recursive() to ultimately be moved to treat_path() (or
another function it calls, such as treat_directory()), but I have
left that for someone else to tackle in the future.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The specific checks done in match_pathspec_item for the DO_MATCH_SUBMODULE
case are useful for other cases which have nothing to do with submodules.
Rename this constant; a subsequent commit will make use of this change.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Even if a directory doesn't match a pathspec, it is possible, depending
on the precise pathspecs, that some file underneath it might. So we
special case and recurse into the directory for such situations. However,
we previously always added any untracked directory that we recursed into
to the list of untracked paths, regardless of whether the directory
itself matched the pathspec.
For the case of git-clean and a set of pathspecs of "dir/file" and "more",
this caused a problem because we'd end up with dir entries for both of
"dir"
"dir/file"
Then correct_untracked_entries() would try to helpfully prune duplicates
for us by removing "dir/file" since it's under "dir", leaving us with
"dir"
Since the original pathspec only had "dir/file", the only entry left
doesn't match and leaves nothing to be removed. (Note that if only one
pathspec was specified, e.g. only "dir/file", then the common_prefix_len
optimizations in fill_directory would cause us to bypass this problem,
making it appear in simple tests that we could correctly remove manually
specified pathspecs.)
Fix this by actually checking whether the directory we are about to add
to the list of dir entries actually matches the pathspec; only do this
matching check after we have already returned from recursing into the
directory.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
For a pathspec like 'foo/bar' comparing against a path named "foo/",
namelen will be 4, and match[namelen] will be 'b'. The correct location
of the directory separator is namelen-1.
However, other callers of match_pathspec_item() such as builtin/grep.c's
submodule_path_match() will compare against a path named "foo" instead of
"foo/". It might be better to change all the callers to be consistent,
as discussed at
https://public-inbox.org/git/xmqq7e6cdnkr.fsf@gitster-ct.c.googlers.com/
and
https://public-inbox.org/git/CABPp-BERWUPCPq-9fVW1LNocqkrfsoF4BPj3gJd9+En43vEkTQ@mail.gmail.com/
but there are many cases to audit, so for now just make sure we handle
both cases with and without a trailing slash.
The reason the code worked despite this sometimes-off-by-one error was
that the subsequent code immediately checked whether the first matchlen
characters matched (which they do) and then bailed and return
MATCHED_RECURSIVELY anyway since wildmatch doesn't have the ability to
check if "name" can be matched as a directory (or prefix) against the
pathspec.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The first consumer of pattern-matching filenames was the
.gitignore feature. In that context, storing a list of patterns
as a 'struct exclude_list' makes sense. However, the
sparse-checkout feature then adopted these structures and methods,
but with the opposite meaning: these patterns match the files
that should be included!
Now that this library is renamed to use 'struct pattern_list'
and 'struct pattern', we can now rename the method used by
the sparse-checkout feature to determine which paths should
appear in the working directory.
The method is_excluded_from_list() is only used by the
sparse-checkout logic in unpack-trees and list-objects-filter.
The confusing part is that it returned 1 for "excluded" (i.e.
it matches the list of exclusions) but that really manes that
the path matched the list of patterns for _inclusion_ in the
working directory.
Rename the method to be path_matches_pattern_list() and have
it return an explicit 'enum pattern_match_result'. Here, the
values MATCHED = 1, UNMATCHED = 0, and UNDECIDED = -1 agree
with the previous integer values. This shift allows future
consumers to better understand what the retur values mean,
and provides more type checking for handling those values.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The first consumer of pattern-matching filenames was the
.gitignore feature. In that context, storing a list of patterns
as a 'struct exclude_list' makes sense. However, the
sparse-checkout feature then adopted these structures and methods,
but with the opposite meaning: these patterns match the files
that should be included!
It would be clearer to rename this entire library as a "pattern
matching" library, and the callers apply exclusion/inclusion
logic accordingly based on their needs.
This commit renames several methods defined in dir.h to make
more sense with the renamed 'struct exclude_list' to 'struct
pattern_list' and 'struct exclude' to 'struct path_pattern':
* last_exclude_matching() -> last_matching_pattern()
* parse_exclude() -> parse_path_pattern()
In addition, the word 'exclude' was replaced with 'pattern'
in the methods below:
* add_exclude_list()
* add_excludes_from_file_to_list()
* add_excludes_from_file()
* add_excludes_from_blob_to_list()
* add_exclude()
* clear_exclude_list()
A few methods with the word "exclude" remain. These will
be handled seperately. In particular, the method
"is_excluded()" is concretely about the .gitignore file
relative to a specific directory. This is the important
boundary between library and consumer: is_excluded() cares
about .gitignore, but is_excluded() calls
last_matching_pattern() to make that decision.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The first consumer of pattern-matching filenames was the
.gitignore feature. In that context, storing a list of patterns
as a 'struct exclude_list' makes sense. However, the
sparse-checkout feature then adopted these structures and methods,
but with the opposite meaning: these patterns match the files
that should be included!
It would be clearer to rename this entire library as a "pattern
matching" library, and the callers apply exclusion/inclusion
logic accordingly based on their needs.
This commit replaces 'EXCL_FLAG_' to 'PATTERN_FLAG_' in the
names of the flags used on 'struct path_pattern'.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The first consumer of pattern-matching filenames was the
.gitignore feature. In that context, storing a list of patterns
as a 'struct exclude_list' makes sense. However, the
sparse-checkout feature then adopted these structures and methods,
but with the opposite meaning: these patterns match the files
that should be included!
It would be clearer to rename this entire library as a "pattern
matching" library, and the callers apply exclusion/inclusion
logic accordingly based on their needs.
This commit renames 'struct exclude_list' to 'struct pattern_list'
and renames several variables called 'el' to 'pl'.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The first consumer of pattern-matching filenames was the
.gitignore feature. In that context, storing a list of patterns
as a list of 'struct exclude' items makes sense. However, the
sparse-checkout feature then adopted these structures and methods,
but with the opposite meaning: these patterns match the files
that should be included!
It would be clearer to rename this entire library as a "pattern
matching" library, and the callers apply exclusion/inclusion
logic accordingly based on their needs.
This commit renames 'struct exclude' to 'struct path_pattern'
and renames several variable names to match. 'struct pattern'
was already taken by attr.c, and this more completely describes
that the patterns are specific to file paths.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Calculating the sum of two array indexes to find the midpoint between
them can overflow, i.e. code like this is unsafe for big arrays:
mid = (first + last) >> 1;
Make sure the intermediate value stays within the boundaries instead,
like this:
mid = first + ((last - first) >> 1);
The loop condition of the binary search makes sure that 'last' is
always greater than 'first', so this is safe as long as 'first' is
not negative. And that can be verified easily using the pre-context
of each change, except for name-hash.c, so add an assertion to that
effect there.
The unsafe calculations were found with:
git grep '(.*+.*) *>> *1'
This is a continuation of 19716b21a4 (cleanup: fix possible overflow
errors in binary search, 2017-10-08).
Signed-off-by: Rene Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Code clean-up.
* jk/untracked-cache-more-fixes:
untracked-cache: simplify parsing by dropping "len"
untracked-cache: simplify parsing by dropping "next"
untracked-cache: be defensive about missing NULs in index
Further code clean-up to allow the lowest level of name-to-object
mapping layer to work with a passed-in repository other than the
default one.
* nd/sha1-name-c-wo-the-repository: (34 commits)
sha1-name.c: remove the_repo from get_oid_mb()
sha1-name.c: remove the_repo from other get_oid_*
sha1-name.c: remove the_repo from maybe_die_on_misspelt_object_name
submodule-config.c: use repo_get_oid for reading .gitmodules
sha1-name.c: add repo_get_oid()
sha1-name.c: remove the_repo from get_oid_with_context_1()
sha1-name.c: remove the_repo from resolve_relative_path()
sha1-name.c: remove the_repo from diagnose_invalid_index_path()
sha1-name.c: remove the_repo from handle_one_ref()
sha1-name.c: remove the_repo from get_oid_1()
sha1-name.c: remove the_repo from get_oid_basic()
sha1-name.c: remove the_repo from get_describe_name()
sha1-name.c: remove the_repo from get_oid_oneline()
sha1-name.c: add repo_interpret_branch_name()
sha1-name.c: remove the_repo from interpret_branch_mark()
sha1-name.c: remove the_repo from interpret_nth_prior_checkout()
sha1-name.c: remove the_repo from get_short_oid()
sha1-name.c: add repo_for_each_abbrev()
sha1-name.c: store and use repo in struct disambiguate_state
sha1-name.c: add repo_find_unique_abbrev_r()
...
Running "git add" on a repository created inside the current
repository is an explicit indication that the user wants to add it
as a submodule, but when the HEAD of the inner repository is on an
unborn branch, it cannot be added as a submodule. Worse, the files
in its working tree can be added as if they are a part of the outer
repository, which is not what the user wants. These problems are
being addressed.
* km/empty-repo-is-still-a-repo:
add: error appropriately on repository with no commits
dir: do not traverse repositories with no commits
submodule: refuse to add repository with no commits
An underallocation in the code to read the untracked cache
extension has been corrected.
* js/untracked-cache-allocfix:
untracked cache: fix off-by-one
Conversion from unsigned char[20] to struct object_id continues.
* bc/hash-transition-16: (35 commits)
gitweb: make hash size independent
Git.pm: make hash size independent
read-cache: read data in a hash-independent way
dir: make untracked cache extension hash size independent
builtin/difftool: use parse_oid_hex
refspec: make hash size independent
archive: convert struct archiver_args to object_id
builtin/get-tar-commit-id: make hash size independent
get-tar-commit-id: parse comment record
hash: add a function to lookup hash algorithm by length
remote-curl: make hash size independent
http: replace sha1_to_hex
http: compute hash of downloaded objects using the_hash_algo
http: replace hard-coded constant with the_hash_algo
http-walker: replace sha1_to_hex
http-push: remove remaining uses of sha1_to_hex
http-backend: allow 64-character hex names
http-push: convert to use the_hash_algo
builtin/pull: make hash-size independent
builtin/am: make hash size independent
...