Add and apply a semantic patch for converting code that open-codes
CALLOC_ARRAY to use it instead. It shortens the code and infers the
element size automatically.
Signed-off-by: René Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Update support for invalid UTF-8 in PCRE2.
* ab/grep-pcre-invalid-utf8:
grep/pcre2: better support invalid UTF-8 haystacks
grep/pcre2 tests: don't rely on invalid UTF-8 data test
The support for deprecated PCRE1 library has been dropped.
* ab/retire-pcre1:
Remove support for v1 of the PCRE library
config.mak.uname: remove redundant NO_LIBPCRE1_JIT flag
Remove the hidden "grep --debug" and "log --grep-debug" options added
in 17bf35a3c7 (grep: teach --debug option to dump the parse tree,
2012-09-13).
At the time these options seem to have been intended to go along with
a documentation discussion and to help the author of relevant tests to
perform ad-hoc debugging on them[1].
Reasons to want this gone:
1. They were never documented, and the only (rather trivial) use of
them in our own codebase for testing is something I removed back
in e01b4dab01 (grep: change non-ASCII -i test to stop using
--debug, 2017-05-20).
2. Googling around doesn't show any in-the-wild uses I could dig up,
and on the Git ML the only mentions after the original discussion
seem to have been when they came up in unrelated diff contexts, or
that test commit of mine.
3. An exception to that is c581e4a749 (grep: under --debug, show
whether PCRE JIT is enabled, 2019-08-18) where we added the
ability to dump out when PCREv2 has the JIT in effect.
The combination of that and my earlier b65abcafc7 (grep: use PCRE
v2 for optimized fixed-string search, 2019-07-01) means Git prints
this out in its most common in-the-wild configuration:
$ git log --grep-debug --grep=foo --grep=bar --grep=baz --all-match
pcre2_jit_on=1
pcre2_jit_on=1
pcre2_jit_on=1
[all-match]
(or
pattern_body<body>foo
(or
pattern_body<body>bar
pattern_body<body>baz
)
)
$ git grep --debug \( -e foo --and -e bar \) --or -e baz
pcre2_jit_on=1
pcre2_jit_on=1
pcre2_jit_on=1
(or
(and
patternfoo
patternbar
)
patternbaz
)
I.e. for each pattern we're considering for the and/or/--all-match
etc. debugging we'll now diligently spew out another identical line
saying whether the PCREv2 JIT is on or not.
I think that nobody's complained about that rather glaringly obviously
bad output says something about how much this is used, i.e. it's
not.
The need for this debugging aid for the composed grep/log patterns
seems to have passed, and the desire to dump the JIT config seems to
have been another one-off around the time we had JIT-related issues on
the PCREv2 codepath. That the original author of this debugging
facility seemingly hasn't noticed the bad output since then[2] is
probably some indicator.
1. https://lore.kernel.org/git/cover.1347615361.git.git@drmicha.warpmail.net/
2. https://lore.kernel.org/git/xmqqk1b8x0ac.fsf@gitster-ct.c.googlers.com/
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Improve the support for invalid UTF-8 haystacks given a non-ASCII
needle when using the PCREv2 backend.
This is a more complete fix for a bug I started to fix in
870eea8166 (grep: do not enter PCRE2_UTF mode on fixed matching,
2019-07-26), now that PCREv2 has the PCRE2_MATCH_INVALID_UTF mode we
can make use of it.
This fixes the sort of case described in 8a5999838e (grep: stess test
PCRE v2 on invalid UTF-8 data, 2019-07-26), i.e.:
- The subject string is non-ASCII (e.g. "ævar")
- We're under a is_utf8_locale(), e.g. "en_US.UTF-8", not "C"
- We are using --ignore-case, or we're a non-fixed pattern
If those conditions were satisfied and we matched found non-valid
UTF-8 data PCREv2 might bark on it, in practice this only happened
under the JIT backend (turned on by default on most platforms).
Ultimately this fixes a "regression" in b65abcafc7 ("grep: use PCRE v2
for optimized fixed-string search", 2019-07-01), I'm putting that in
scare-quotes because before then we wouldn't properly support these
complex case-folding, locale etc. cases either, it just broke in
different ways.
There was a bug related to this the PCRE2_NO_START_OPTIMIZE flag fixed
in PCREv2 10.36. It can be worked around by setting the
PCRE2_NO_START_OPTIMIZE flag. Let's do that in those cases, and add
tests for the bug.
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Remove support for using version 1 of the PCRE library. Its use has
been discouraged by upstream for a long time, and it's in a
bugfix-only state.
Anyone who was relying on v1 in particular got a nudge to move to v2
in e6c531b808 (Makefile: make USE_LIBPCRE=YesPlease mean v2, not v1,
2018-03-11), which was first released as part of v2.18.0.
With this the LIBPCRE2 test prerequisites is redundant to PCRE. But
I'm keeping it for self-documentation purposes, and to avoid conflict
with other in-flight PCRE patches.
I'm also not changing all of our own "pcre2" names to "pcre", i.e. the
inverse of 6d4b5747f0 (grep: change internal *pcre* variable &
function names to be *pcre1*, 2017-05-25). I don't see the point, and
it makes the history/blame harder to read. Maybe if there's ever a
PCRE v3...
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
We have a `struct grep_opt` with our defaults which we then copy into
the caller's struct. Rather than zeroing the target struct and copying
each element one by one, just copy everything at once. This leaves the
code simpler and more maintainable.
We don't have any ownership issues with what we're copying now and can
just greedily copy the whole thing. If and when we do need to handle
such elements (`char *`?), we must and can handle it appropriately. Make
sure to leave a comment to our future selves.
Signed-off-by: Martin Ågren <martin.agren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In 15fabd1bbd ("builtin/grep.c: make configuration callback more
reusable", 2012-10-09), we learned to fill a `static struct grep_opt
grep_defaults` which we can use as a blueprint for other such structs.
At the time, we didn't consider designated initializers to be widely
useable, but these days, we do. (See, e.g., cbc0f81d96 ("strbuf: use
designated initializers in STRBUF_INIT", 2017-07-10).)
Use designated initializers to let the compiler set up the struct and so
that we don't need to remember to call `init_grep_defaults()`.
Signed-off-by: Martin Ågren <martin.agren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
`init_grep_defaults()` fills a `static struct grep_opt grep_defaults`.
This struct is then used by `grep_init()` as a blueprint for other such
structs. Notably, `grep_init()` takes a `struct repo *` and assigns it
into the target struct.
As a result, it is unnecessary for us to take a `struct repo *` in
`init_grep_defaults()` as well. We assign it into the default struct and
never look at it again. And in light of how we return early if we have
already set up the default struct, it's not just unnecessary, but is
also a bit confusing: If we are called twice and with different repos,
is it a bug or a feature that we ignore the second repo?
Drop the repo parameter for `init_grep_defaults()`.
Signed-off-by: Martin Ågren <martin.agren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This commit fixes a couple of minor spelling mistakes inside
comments.
Signed-off-by: Steve Kemp <steve@steve.org.uk>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
git-grep uses 'grep_read_mutex' to protect its calls to object reading
operations. But these have their own internal lock now, which ensures a
better performance (allowing parallel access to more regions). So, let's
remove the former and, instead, activate the latter with
enable_obj_read_lock().
Sections that are currently protected by 'grep_read_mutex' but are not
internally protected by the object reading lock should be surrounded by
obj_read_lock() and obj_read_unlock(). These guarantee mutual exclusion
with object reading operations, keeping the current behavior and
avoiding race conditions. Namely, these places are:
In grep.c:
- fill_textconv() at fill_textconv_grep().
- userdiff_get_textconv() at grep_source_1().
In builtin/grep.c:
- parse_object_or_die() and the submodule functions at
grep_submodule().
- deref_tag() and gitmodules_config_oid() at grep_objects().
If these functions become thread-safe, in the future, we might remove
the locking and probably get some speedup.
Note that some of the submodule functions will already be thread-safe
(or close to being thread-safe) with the internal object reading lock.
However, as some of them will require additional modifications to be
removed from the critical section, this will be done in its own patch.
Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
git-grep uses an internal grep_read_mutex to protect object reading
operations. Similarly, there's a grep_attr_mutex to protect calls to the
gitattributes machinery. However, two of the three functions protected
by the last mutex may also perform object reading, as seen below:
- userdiff_get_textconv() > notes_cache_init() >
notes_cache_match_validity() > lookup_commit_reference_gently() >
parse_object() > repo_has_object_file() >
repo_has_object_file_with_flags() > oid_object_info_extended()
- userdiff_find_by_path() > git_check_attr() > collect_some_attrs() >
prepare_attr_stack() > read_attr() > read_attr_from_index() >
read_blob_data_from_index() > read_object_file()
As these calls are not protected by grep_read_mutex, there might be race
conditions with other threads performing object reading (e.g. threads
calling fill_textconv() at grep.c:fill_textconv_grep()). To prevent
that, let's make sure to acquire the lock before both of these calls.
Note: this patch might slow down the threaded grep in worktree, for the
sake of thread-safeness. However, in the following patches, we should
regain performance by replacing grep_read_mutex for an internal object
reading lock and allowing parallel inflation during object reading.
Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Previously, the void pcre2_free() function in grep.c returned free().
While free() itself is void, afaict it's still an expression as per
section A.2.3, subsection 6.8.6 (jump-statement) in both C99 [1] and C11
[2]:
> return expression
Section 6.8.6.4 in C99 [1] and C11 [2] says that:
> A return statement with an expression shall not appear in a function
> whose return type is void.
The consequence of the old behavior was that developer builds with
pedantic errors enabled broke Git if PCRE2 was enabled and a
smart-enough compiler to detect these errors was used. This commit
fixes pedantic builds of Git that enables --with-libpcre.
[1] http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1256.pdf
[2] http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1548.pdf
Signed-off-by: Hans Jerry Illikainen <hji@dyntopia.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Leakfix.
* cb/pcre2-chartables-leakfix:
grep: avoid leak of chartables in PCRE2
grep: make PCRE2 aware of custom allocator
grep: make PCRE1 aware of custom allocator
94da9193a6 ("grep: add support for PCRE v2", 2017-06-01) introduced
a small memory leak visible with valgrind in t7813.
Complete the creation of a PCRE2 specific variable that was missing from
the original change and free the generated table just like it is done
for PCRE1.
Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
94da9193a6 (grep: add support for PCRE v2, 2017-06-01) didn't include
a way to override the system allocator, and so it is incompatible with
custom allocators (e.g. nedmalloc). This problem became obvious when we
tried to plug a memory leak by `free()`ing a data structure allocated by
PCRE2, triggering a segfault in Windows (where we use nedmalloc by
default).
PCRE2 requires the use of a general context to override the allocator
and therefore, there is a lot more code needed than in PCRE1, including
a couple of wrapper functions.
Extend the grep API with a "destructor" that could be called to cleanup
any objects that were created and used globally.
Update `builtin/grep.c` to use that new API, but any other future users
should make sure to have matching `grep_init()`/`grep_destroy()` calls
if they are using the pattern matching functionality.
Move some of the logic that was before done per thread (in the workers)
into an earlier phase to avoid degrading performance, but as the use of
PCRE2 with custom allocators is better understood it is expected more of
its functions will be instructed to use the custom allocator as well as
was done in the original code[1] this work was based on.
[1] https://public-inbox.org/git/3397e6797f872aedd18c6d795f4976e1c579514b.1565005867.git.gitgitgadget@gmail.com/
Reported-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
63e7e9d8b6 ("git-grep: Learn PCRE", 2011-05-09) didn't include a way
to override the system alocator, and so it is incompatible with
USE_NED_ALLOCATOR as reported by Dscho[1] (in similar code from PCRE2)
Note that nedmalloc, as well as other custom allocators like jemalloc
and mi-malloc, can be configured at runtime (via `LD_PRELOAD`),
therefore we cannot know at compile time whether a custom allocator is
used or not.
Make the minimum change possible to ensure this combination is supported
by extending `grep_init()` to set the PCRE1 specific functions to Git's
idea of `malloc()` and `free()` and therefore making sure all
allocations are done inside PCRE1 with the same allocator than the rest
of Git.
This change has negligible performance impact: PCRE needs to allocate
memory once per program run for the character table and for each pattern
compilation. These are both rare events compared to matching patterns
against lines. Actual measurements[2] show that the impact is lost in
the noise.
[1] https://public-inbox.org/git/pull.306.git.gitgitgadget@gmail.com
[2] https://public-inbox.org/git/7f42007f-911b-c570-17f6-1c6af0429586@web.de
Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
A few simplification and bugfixes to PCRE interface.
* ab/pcre-jit-fixes:
grep: under --debug, show whether PCRE JIT is enabled
grep: do not enter PCRE2_UTF mode on fixed matching
grep: stess test PCRE v2 on invalid UTF-8 data
grep: create a "is_fixed" member in "grep_pat"
grep: consistently use "p->fixed" in compile_regexp()
grep: stop using a custom JIT stack with PCRE v1
grep: stop "using" a custom JIT stack with PCRE v2
grep: remove overly paranoid BUG(...) code
grep: use PCRE v2 for optimized fixed-string search
grep: remove the kwset optimization
grep: drop support for \0 in --fixed-strings <pattern>
grep: make the behavior for NUL-byte in patterns sane
grep tests: move binary pattern tests into their own file
grep tests: move "grep binary" alongside the rest
grep: inline the return value of a function call used only once
t4210: skip more command-line encoding tests on MinGW
grep: don't use PCRE2?_UTF8 with "log --encoding=<non-utf8>"
log tests: test regex backends in "--encode=<enc>" tests
Make sure the grep machinery does not abort when seeing a payload
that is not UTF-8 even when JIT is not in use with PCRE1.
* cb/skip-utf8-check-with-pcre1:
grep: skip UTF8 checks explicitly
18547aacf5 ("grep/pcre: support utf-8", 2016-06-25) that was released
with git 2.10 added the PCRE_UTF8 flag to PCRE1 matching including a
call to has_non_ascii() to try to avoid breakage if there was non-utf8
encoded content in the haystack.
Usually PCRE is compiled with JIT support (even if is not the default),
and therefore the codepath used includes calling pcre_jit_exec, which
skips UTF-8 validation by design (which might result in crashes or hangs)
but when JIT support wasn't compiled we use pcre_exec instead with the
posibility that grep might be aborted if invalid UTF-8 is found in the
haystack.
PCRE1 provides a flag since Mar 5, 2007 that could be used to skip the
checks explicitly so use that to make both codepaths equivalent (the
flag is ignored by pcre1_jit_exec)
this fix is only implemented for PCRE1 because PCRE2 is likely to have
a better solution (without the risks) instead in the future
Helped-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
Helped-by: Eric Sunshine <sunshine@sunshineco.com>
Helped-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Suggested-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The code used both a macro and a variable to keep track if JIT
support was desired and relied on the fact that a non JIT
enabled library will ignore a request for JIT compilation
(as defined by the second parameter of the call to pcre_study)
Cleanup the multiple levels of macros used and call pcre_study
with the right parameter after JIT support has been confirmed
and unless it was requested to be disabled with NO_LIBPCRE1_JIT
Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This information is useful and not visible anywhere else, so show it.
Signed-off-by: Beat Bolli <dev+git@drbeat.li>
Suggested-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
As discussed in the last commit partially fix a bug introduced in
b65abcafc7 ("grep: use PCRE v2 for optimized fixed-string search",
2019-07-01). Because PCRE v2, unlike kwset, validates its UTF-8 input
we'd die on e.g.:
fatal: pcre2_match failed with error code -22: UTF-8 error:
isolated byte with 0x80 bit set
When grepping a non-ASCII fixed string. This is a more general problem
that's hard to fix, but we can at least fix the most common case of
grepping for a fixed string without "-i". I can't think of a reason
for why we'd turn on PCRE2_UTF when matching byte-for-byte like that.
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Since my b65abcafc7 ("grep: use PCRE v2 for optimized fixed-string
search", 2019-07-01) we've been dying on invalid UTF-8 data when
grepping for fixed strings if the following are all true:
* The subject string is non-ASCII (e.g. "ævar")
* We're under a is_utf8_locale(), e.g. "en_US.UTF-8", not "C"
* We compiled with PCRE v2
* That PCRE v2 did not have JIT support
The last of those is why this wasn't caught earlier, per pcre2jit(3):
"unless PCRE2_NO_UTF_CHECK is set, a UTF subject string is tested
for validity. In the interests of speed, these checks do not
happen on the JIT fast path, and if invalid data is passed, the
result is undefined."
I.e. the subject being matched against our pattern was invalid, but we
were lucky and getting away with it on the JIT path, but the non-JIT
one is stricter.
This patch does nothing to fix that, instead we sneak in support for
fixed patterns starting with "(*NO_JIT)", this disables the PCRE v2
jit with implicit fixed-string matching for testing, see
pcre2syntax(3) the syntax.
This is technically a change in behavior, but it's so obscure that I
figured it was OK. We'd previously consider this an invalid regular
expression as regcomp() would die on it, now we feed it to the PCRE v2
fixed-string path. I thought this was better than introducing yet
another GIT_TEST_* environment variable.
We're also relying on a behavior of PCRE v2 that technically could
change, but I think the test coverage is worth dipping our toe into
some somewhat undefined behavior.
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This change paves the way for later using this value the regex compile
functions themselves.
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
At the start of this function we do:
p->fixed = opt->fixed;
It's less confusing to use that variable consistently that switch back
& forth between the two.
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Simplify the PCRE v1 code for the same reasons as for the PCRE v2 code
in the last commit. Unlike with v2 we actually used the custom stack
in v1, but let's use PCRE's built-in 32 KB one instead, since
experience with v2 shows that's enough. Most distros are already using
v2 as a default, and the underlying sljit code is the same.
Unfortunately we can't just pass a NULL to pcre_jit_exec() as with
pcre2_jit_match(). Unlike the v2 function it doesn't support
that. Instead we need to use the fatter pcre_exec() if we'd like the
same behavior.
This will make things slightly slower than on the fast-path function,
but it's OK since we care less about v1 performance these days since
we have and recommend v2. Running a similar performance test as what I
ran in fbaceaac47 ("grep: add support for the PCRE v1 JIT API",
2017-05-25) via:
GIT_PERF_REPEAT_COUNT=30 GIT_PERF_LARGE_REPO=~/g/linux GIT_PERF_MAKE_OPTS='-j8 USE_LIBPCRE1=Y CFLAGS=-O3 LIBPCREDIR=/home/avar/g/pcre/inst' ./run HEAD~ HEAD p7820-grep-engines.sh
Gives us this, just the /perl/ results:
Test HEAD~ HEAD
---------------------------------------------------------------------------------------
7820.3: perl grep 'how.to' 0.19(0.67+0.52) 0.19(0.65+0.52) +0.0%
7820.7: perl grep '^how to' 0.19(0.78+0.44) 0.19(0.72+0.49) +0.0%
7820.11: perl grep '[how] to' 0.39(2.13+0.43) 0.40(2.10+0.46) +2.6%
7820.15: perl grep '(e.t[^ ]*|v.ry) rare' 0.44(2.55+0.37) 0.45(2.47+0.41) +2.3%
7820.19: perl grep 'm(ú|u)lt.b(æ|y)te' 0.23(1.06+0.42) 0.22(1.03+0.43) -4.3%
It will also implicitly re-enable UTF-8 validation for PCRE v1. As
noted in [1] we now have cases as a result where PCRE v1 is more eager
to error out. Subsequent patches will fix that for v2, and I think
it's fair to tell v1 users "just upgrade" and not worry about that
edge case for v1.
1. https://public-inbox.org/git/CAPUEsphZJ_Uv9o1-yDpjNLA_q-f7gWXz9g1gCY2pYAYN8ri40g@mail.gmail.com/
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
As reported in [1] the code I added in 94da9193a6 ("grep: add support
for PCRE v2", 2017-06-01) to use a custom JIT stack has never
worked. It was incorrectly copy/pasted from code I added in
fbaceaac47 ("grep: add support for the PCRE v1 JIT API", 2017-05-25),
which did work.
Thus our intention of starting with 1 byte of stack at a maximum of 1
MB didn't happen, we'd always use the 32 KB stack provided by PCRE
v2's jit_machine_stack_exec()[2]. The reason I allocated a custom
stack at all was this advice in pcrejit(3) (same in pcre2jit(3)):
"By default, it uses 32KiB on the machine stack. However, some
large or complicated patterns need more than this"
Since we've haven't had any reports of users running into
PCRE2_ERROR_JIT_STACKLIMIT in the wild I think we can safely assume
that we can just use the library defaults instead and drop this
code. This won't change with the wider use of PCRE v2 in
ed0479ce3d ("Merge branch 'ab/no-kwset' into next", 2019-07-15), a
fixed string search is not a "large or complicated pattern".
For good measure I ran the performance test noted in 94da9193a6,
although the command is simpler now due to my 0f50c8e32c ("Makefile:
remove the NO_R_TO_GCC_LINKER flag", 2019-05-17):
GIT_PERF_REPEAT_COUNT=30 GIT_PERF_LARGE_REPO=~/g/linux GIT_PERF_MAKE_OPTS='-j8 USE_LIBPCRE2=Y CFLAGS=-O3 LIBPCREDIR=/home/avar/g/pcre2/inst' ./run HEAD~ HEAD p7820-grep-engines.sh
Just the /perl/ results are:
Test HEAD~ HEAD
---------------------------------------------------------------------------------------
7820.3: perl grep 'how.to' 0.17(0.27+0.65) 0.17(0.24+0.68) +0.0%
7820.7: perl grep '^how to' 0.16(0.23+0.66) 0.16(0.23+0.67) +0.0%
7820.11: perl grep '[how] to' 0.18(0.35+0.62) 0.18(0.33+0.65) +0.0%
7820.15: perl grep '(e.t[^ ]*|v.ry) rare' 0.17(0.45+0.54) 0.17(0.49+0.50) +0.0%
7820.19: perl grep 'm(ú|u)lt.b(æ|y)te' 0.16(0.33+0.58) 0.16(0.29+0.62) +0.0%
So, as expected there's no change, and running with valgrind reveals
that we have fewer allocations now.
As noted in [3] there are known regexes that will fail with the lower
stack limit, the way GNU grep fixed it is interesting, although I
believe the implementation is overly verbose, they could make PCRE v2
handle that gradual re-allocation, that's what min/max memory is
for.
So we might end up bringing this back, I'm more inclined to just kick
such cases upstairs to PCRE maintainers as a bug, perhaps they'll add
some overall "just allocate more then" flag to make this easier. In
any case there's no functional change here, we didn't have a custom
stack, so let's apply this first, we can always revert it later.
1. https://public-inbox.org/git/20190721194052.15440-1-carenas@gmail.com/
2. I didn't really intend to start with 1 byte, looking at the PCRE v2
code again what happened is that I cargo-culted some of PCRE v2's
own test code which was meant to test re-allocations. It's more
sane to start with say 32 KB with a max of 1 MB, as pcre2grep.c
does.
3. https://public-inbox.org/git/CAPUEspjj+fG8QDmf=bZXktfpLgkgiu34HTjKLhm-cmEE04FE-A@mail.gmail.com/
Reported-by: Carlo Marcelo Arenas Belón <carenas@gmail.com>
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Remove code that would trigger if pcre_config() or pcre2_config() was
so broken that "do we have JIT?" wouldn't return a boolean.
I added this code back in fbaceaac47 ("grep: add support for the PCRE
v1 JIT API", 2017-05-25) and then as noted in f002532784 ("grep: print
the pcre2_jit_on value", 2019-07-22) incorrectly copy/pasted some of
it in 94da9193a6 ("grep: add support for PCRE v2", 2017-06-01).
Let's just remove this code. Being this paranoid about the
pcre2?_config() function itself being broken is crossing the line into
unreasonable paranoia.
Reported-by: Beat Bolli <dev+git@drbeat.li>
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When pcre2_jit_on is neither 1 nor 0, the BUG() call printed the value
of pcre1_jit_on.
Print the value of pcre2_jit_on instead.
Signed-off-by: Beat Bolli <dev+git@drbeat.li>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Bring back optimized fixed-string search for "grep", this time with
PCRE v2 as an optional backend. As noted in [1] with kwset we were
slower than PCRE v1 and v2 JIT with the kwset backend, so that
optimization was counterproductive.
This brings back the optimization for "--fixed-strings", without
changing the semantics of having a NUL-byte in patterns. As seen in
previous commits in this series we could support it now, but I'd
rather just leave that edge-case aside so we don't have one behavior
or the other depending what "--fixed-strings" backend we're using. It
makes the behavior harder to understand and document, and makes tests
for the different backends more painful.
This does change the behavior under non-C locales when "log"'s
"--encoding" option is used and the heystack/needle in the
content/command-line doesn't have a matching encoding. See the recent
change in "t4210: skip more command-line encoding tests on MinGW" in
this series. I think that's OK. We did nothing sensible before
then (just compared raw bytes that had no hope of matching). At least
now the user will get some idea why their grep/log never matches in
that edge case.
I could also support the PCRE v1 backend here, but that would make the
code more complex. I'd rather aim for simplicity here and in future
changes to the diffcore. We're not going to have someone who
absolutely must have faster search, but for whom building PCRE v2
isn't acceptable.
The difference between this series of commits and the current "master"
is, using the same t/perf commands shown in the last commit:
plain grep:
Test origin/master HEAD
-------------------------------------------------------------------------
7821.1: fixed grep int 0.55(1.67+0.56) 0.41(0.98+0.60) -25.5%
7821.2: basic grep int 0.58(1.65+0.52) 0.41(0.96+0.57) -29.3%
7821.3: extended grep int 0.57(1.66+0.49) 0.42(0.93+0.60) -26.3%
7821.4: perl grep int 0.54(1.67+0.50) 0.43(0.88+0.65) -20.4%
7821.6: fixed grep uncommon 0.21(0.52+0.42) 0.16(0.24+0.51) -23.8%
7821.7: basic grep uncommon 0.20(0.49+0.45) 0.17(0.28+0.47) -15.0%
7821.8: extended grep uncommon 0.20(0.54+0.39) 0.16(0.25+0.50) -20.0%
7821.9: perl grep uncommon 0.20(0.58+0.36) 0.16(0.23+0.50) -20.0%
7821.11: fixed grep æ 0.35(1.24+0.43) 0.16(0.23+0.50) -54.3%
7821.12: basic grep æ 0.36(1.29+0.38) 0.16(0.20+0.54) -55.6%
7821.13: extended grep æ 0.35(1.23+0.44) 0.16(0.24+0.50) -54.3%
7821.14: perl grep æ 0.35(1.33+0.34) 0.16(0.28+0.46) -54.3%
grep with -i:
Test origin/master HEAD
----------------------------------------------------------------------------
7821.1: fixed grep -i int 0.62(1.81+0.70) 0.47(1.11+0.64) -24.2%
7821.2: basic grep -i int 0.67(1.90+0.53) 0.46(1.07+0.62) -31.3%
7821.3: extended grep -i int 0.62(1.92+0.53) 0.53(1.12+0.58) -14.5%
7821.4: perl grep -i int 0.66(1.85+0.58) 0.45(1.10+0.59) -31.8%
7821.6: fixed grep -i uncommon 0.21(0.54+0.43) 0.17(0.20+0.55) -19.0%
7821.7: basic grep -i uncommon 0.20(0.52+0.45) 0.17(0.29+0.48) -15.0%
7821.8: extended grep -i uncommon 0.21(0.52+0.44) 0.17(0.26+0.50) -19.0%
7821.9: perl grep -i uncommon 0.21(0.53+0.44) 0.17(0.20+0.56) -19.0%
7821.11: fixed grep -i æ 0.26(0.79+0.44) 0.16(0.29+0.46) -38.5%
7821.12: basic grep -i æ 0.26(0.79+0.42) 0.16(0.20+0.54) -38.5%
7821.13: extended grep -i æ 0.26(0.84+0.39) 0.16(0.24+0.50) -38.5%
7821.14: perl grep -i æ 0.16(0.24+0.49) 0.17(0.25+0.51) +6.3%
plain log:
Test origin/master HEAD
--------------------------------------------------------------------------------
4221.1: fixed log --grep='int' 7.24(6.95+0.28) 7.20(6.95+0.18) -0.6%
4221.2: basic log --grep='int' 7.31(6.97+0.22) 7.20(6.93+0.21) -1.5%
4221.3: extended log --grep='int' 7.37(7.04+0.24) 7.22(6.91+0.25) -2.0%
4221.4: perl log --grep='int' 7.31(7.04+0.21) 7.19(6.89+0.21) -1.6%
4221.6: fixed log --grep='uncommon' 6.93(6.59+0.32) 7.04(6.66+0.37) +1.6%
4221.7: basic log --grep='uncommon' 6.92(6.58+0.29) 7.08(6.75+0.29) +2.3%
4221.8: extended log --grep='uncommon' 6.92(6.55+0.31) 7.00(6.68+0.31) +1.2%
4221.9: perl log --grep='uncommon' 7.03(6.59+0.33) 7.12(6.73+0.34) +1.3%
4221.11: fixed log --grep='æ' 7.41(7.08+0.28) 7.05(6.76+0.29) -4.9%
4221.12: basic log --grep='æ' 7.39(6.99+0.33) 7.00(6.68+0.25) -5.3%
4221.13: extended log --grep='æ' 7.34(7.00+0.25) 7.15(6.81+0.31) -2.6%
4221.14: perl log --grep='æ' 7.43(7.13+0.26) 7.01(6.60+0.36) -5.7%
log with -i:
Test origin/master HEAD
------------------------------------------------------------------------------------
4221.1: fixed log -i --grep='int' 7.31(7.07+0.24) 7.23(7.00+0.22) -1.1%
4221.2: basic log -i --grep='int' 7.40(7.08+0.28) 7.19(6.92+0.20) -2.8%
4221.3: extended log -i --grep='int' 7.43(7.13+0.25) 7.27(6.99+0.21) -2.2%
4221.4: perl log -i --grep='int' 7.34(7.10+0.24) 7.10(6.90+0.19) -3.3%
4221.6: fixed log -i --grep='uncommon' 7.07(6.71+0.32) 7.11(6.77+0.28) +0.6%
4221.7: basic log -i --grep='uncommon' 6.99(6.64+0.28) 7.12(6.69+0.38) +1.9%
4221.8: extended log -i --grep='uncommon' 7.11(6.74+0.32) 7.10(6.77+0.27) -0.1%
4221.9: perl log -i --grep='uncommon' 6.98(6.60+0.29) 7.05(6.64+0.34) +1.0%
4221.11: fixed log -i --grep='æ' 7.85(7.45+0.34) 7.03(6.68+0.32) -10.4%
4221.12: basic log -i --grep='æ' 7.87(7.49+0.29) 7.06(6.69+0.31) -10.3%
4221.13: extended log -i --grep='æ' 7.87(7.54+0.31) 7.09(6.69+0.31) -9.9%
4221.14: perl log -i --grep='æ' 7.06(6.77+0.28) 6.91(6.57+0.31) -2.1%
So as with e05b027627 ("grep: use PCRE v2 for optimized fixed-string
search", 2019-06-26) there's a huge improvement in performance for
"grep", but in "log" most of our time is spent elsewhere, so we don't
notice it that much.
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Change "-f <file>" to not support patterns with a NUL-byte in them
under --fixed-strings. We'll now only support these under
"--perl-regexp" with PCRE v2.
A previous change to grep's documentation changed the description of
"-f <file>" to be vague enough as to not promise that this would work.
By dropping support for this we make it a whole lot easier to move
away from the kwset backend, which we'll do in a subsequent change.
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The behavior of "grep" when patterns contained a NUL-byte has always
been haphazard, and has served the vagaries of the implementation more
than anything else. A pattern containing a NUL-byte can only be
provided via "-f <file>". Since pickaxe (log search) has no such flag
the NUL-byte in patterns has only ever been supported by "grep" (and
not "log --grep").
Since 9eceddeec6 ("Use kwset in grep", 2011-08-21) patterns containing
"\0" were considered fixed. In 966be95549 ("grep: add tests to fix
blind spots with \0 patterns", 2017-05-20) I added tests for this
behavior.
Change the behavior to do the obvious thing, i.e. don't silently
discard a regex pattern and make it implicitly fixed just because they
contain a NUL-byte. Instead die if the backend in question can't
handle them, e.g. --basic-regexp is combined with such a pattern.
This is desired because from a user's point of view it's the obvious
thing to do. Whether we support BRE/ERE/Perl syntax is different from
whether our implementation is limited by C-strings. These patterns are
obscure enough that I think this behavior change is OK, especially
since we never documented the old behavior.
Doing this also makes it easier to replace the kwset backend with
something else, since we'll no longer strictly need it for anything we
can't easily use another fixed-string backend for.
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Since e944d9d932 ("grep: rewrite an if/else condition to avoid
duplicate expression", 2016-06-25) the "ascii_only" variable has only
been used once in compile_regexp(), let's just inline it there.
This makes the code easier to read, and might make it marginally
faster depending on compiler optimizations.
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Fix a bug introduced in 18547aacf5 ("grep/pcre: support utf-8",
2016-06-25) that was missed due to a blindspot in our tests, as
discussed in the previous commit. I then blindly copied the same bug
in 94da9193a6 ("grep: add support for PCRE v2", 2017-06-01) when
adding the PCRE v2 code.
We should not tell PCRE that we're processing UTF-8 just because we're
dealing with non-ASCII. In the case of e.g. "log --encoding=<...>"
under is_utf8_locale() the haystack might be in ISO-8859-1, and the
needle might be in a non-UTF-8 encoding.
Maybe we should be more strict here and die earlier? Should we also be
converting the needle to the encoding in question, and failing if it's
not a string that's valid in that encoding? Maybe.
But for now matching this as non-UTF8 at least has some hope of
producing sensible results, since we know that our default heuristic
of assuming the text to be matched is in the user locale encoding
isn't true when we've explicitly encoded it to be in a different
encoding.
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
grep_source(), which performs much of the work for Git's grep library,
allows passing an arbitrary struct grep_source which represents the text
which grep_source() should search to match a pattern in the provided
struct grep_opt. In most callers, the grep_source::name field is set to
an appropriate prefix to print before a colon when a result matches:
README:Git is an Open Source project covered by the GNU General
One caller, grep_buffer(), leaves the grep_source::name field set to
NULL because there isn't enough context to determine an appropriate name
for this kind of output line. In practice, this has been fine: the only
caller of grep_buffer() is "git log --grep", and that caller sets
grep_opt::status_only, which disables output and only checks whether a
match exists. But this is brittle: a future caller can call
grep_buffer() without grep_opt::status_only set, and as soon as it hits
a match, grep_source() will try to print the match and segfault:
(null):Git is an Open Source project covered by the GNU General
For example, a future caller might want to print all matching lines from
commits which match a regex.
Futureproof by diagnosing early a use of the API that could trigger that
condition, before we know whether the pattern matches:
BUG: grep.c:1783: grep call which could print a name requires
grep_source.name be non-NULL
Aborted
This way, the caller's author gets an indication of how to fix the issue
- by providing grep_source::name or setting grep_opt::status_only - and
they are warned of the potential for a segfault unconditionally, rather
than only if there is a match.
Noticed while adding such a call to a tutorial on revision walks.
Signed-off-by: Emily Shaffer <emilyshaffer@google.com>
Reviewed-by: Jonathan Nieder <jrnieder@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This is a faithful conversion without attempting to improve
anything. That comes later.
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
[jc: squashed in missing forward decl in userdiff.h found by Ramsay]
Helped-by: Ramsay Jones <ramsay@ramsayjones.plus.com>
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
"git grep" learned the "--column" option that gives not just the
line number but the column number of the hit.
* tb/grep-column:
contrib/git-jump/git-jump: jump to exact location
grep.c: add configuration variables to show matched option
builtin/grep.c: add '--column' option to 'git-grep(1)'
grep.c: display column number of first match
grep.[ch]: extend grep_opt to allow showing matched column
grep.c: expose {,inverted} match column in match_line()
Documentation/config.txt: camel-case lineNumber for consistency
The conversion to pass "the_repository" and then "a_repository"
throughout the object access API continues.
* sb/object-store-grafts:
commit: allow lookup_commit_graft to handle arbitrary repositories
commit: allow prepare_commit_graft to handle arbitrary repositories
shallow: migrate shallow information into the object parser
path.c: migrate global git_path_* to take a repository argument
cache: convert get_graft_file to handle arbitrary repositories
commit: convert read_graft_file to handle arbitrary repositories
commit: convert register_commit_graft to handle arbitrary repositories
commit: convert commit_graft_pos() to handle arbitrary repositories
shallow: add repository argument to is_repository_shallow
shallow: add repository argument to check_shallow_file_for_update
shallow: add repository argument to register_shallow
shallow: add repository argument to set_alternate_shallow_file
commit: add repository argument to lookup_commit_graft
commit: add repository argument to prepare_commit_graft
commit: add repository argument to read_graft_file
commit: add repository argument to register_commit_graft
commit: add repository argument to commit_graft_pos
object: move grafts to object parser
object-store: move object access functions to object-store.h
Teach 'git grep --only-matching', a new option to only print the
matching part(s) of a line.
For instance, a line containing the following (taken from README.md:27):
(`man gitcvs-migration` or `git help cvs-migration` if git is
Is printed as follows:
$ git grep --line-number --column --only-matching -e git -- \
README.md | grep ":27"
README.md:27:7:git
README.md:27:16:git
README.md:27:38:git
The patch works mostly as one would expect, with the exception of a few
considerations that are worth mentioning here.
Like GNU grep, this patch ignores --only-matching when --invert (-v) is
given. There is a sensible answer here, but parity with the behavior of
other tools is preferred.
Because a line might contain more than one match, there are special
considerations pertaining to when to print line headers, newlines, and
how to increment the match column offset. The line header and newlines
are handled as a special case within the main loop to avoid polluting
the surrounding code with conditionals that have large blocks.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The grep code invokes show_line() to display the contents of a matched
or context line in its output. Part of this execution is to print a line
header that includes information such as the kind, the line- and
column-number and etc. of that match.
To prepare for the addition of an option to print only the matching
component(s) of a non-context line, we must prepare for the possibility
that a single line may contain multiple matching parts, and thus will
need multiple headers printed for a single line.
Extracting show_line_header allows us to do just that. In the subsequent
commit, it will be used within the colorization loop to print out only
the matching parts of a line, optionally with LFs delimiting
sub-matches.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>