git-commit-vandalism

Author	SHA1	Message	Date
Ævar Arnfjörð Bjarmason	95ca1f987e	grep/pcre2: better support invalid UTF-8 haystacks Improve the support for invalid UTF-8 haystacks given a non-ASCII needle when using the PCREv2 backend. This is a more complete fix for a bug I started to fix in `870eea8166` (grep: do not enter PCRE2_UTF mode on fixed matching, 2019-07-26), now that PCREv2 has the PCRE2_MATCH_INVALID_UTF mode we can make use of it. This fixes the sort of case described in `8a5999838e` (grep: stess test PCRE v2 on invalid UTF-8 data, 2019-07-26), i.e.: - The subject string is non-ASCII (e.g. "ævar") - We're under a is_utf8_locale(), e.g. "en_US.UTF-8", not "C" - We are using --ignore-case, or we're a non-fixed pattern If those conditions were satisfied and we matched found non-valid UTF-8 data PCREv2 might bark on it, in practice this only happened under the JIT backend (turned on by default on most platforms). Ultimately this fixes a "regression" in `b65abcafc7` ("grep: use PCRE v2 for optimized fixed-string search", 2019-07-01), I'm putting that in scare-quotes because before then we wouldn't properly support these complex case-folding, locale etc. cases either, it just broke in different ways. There was a bug related to this the PCRE2_NO_START_OPTIMIZE flag fixed in PCREv2 10.36. It can be worked around by setting the PCRE2_NO_START_OPTIMIZE flag. Let's do that in those cases, and add tests for the bug. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-01-24 16:09:17 -08:00
Ævar Arnfjörð Bjarmason	a4fea08b6e	grep/pcre2 tests: don't rely on invalid UTF-8 data test As noted in [1] when I originally added this test in [2] the test was completely broken as it lacked a redirect[3]. I now think this whole thing is overly fragile. Let's only test if we have a segfault here. Before this the first test's "test_cmp" was pretty meaningless. We were only testing if PCREv2 was so broken that it would spew out something completely unrelated on stdout, which isn't very plausible. In the second test we're relying on PCREv2 forever holding to the current behavior of the PCRE_UTF8 flag, as opposed to learning some optimistic graceful fallback to PCRE2_MATCH_INVALID_UTF in the future. If that happens having this test broken under bisecting would suck. A follow-up commit will actually test this case in a meaningful way under the PCRE2_MATCH_INVALID_UTF flag. Let's run this one unconditionally, and just make sure we don't segfault. 1. `e714b898c6` (t7812: expect failure for grep -i with invalid UTF-8 data, 2019-11-29) 2. `8a5999838e` (grep: stess test PCRE v2 on invalid UTF-8 data, 2019-07-26) 3. `c74b3cbb83` (t7812: add missing redirects, 2019-11-26) Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-01-24 16:09:15 -08:00
Todd Zullinger	e714b898c6	t7812: expect failure for grep -i with invalid UTF-8 data When the 'grep with invalid UTF-8 data' tests were added/adjusted in `8a5999838e` (grep: stess test PCRE v2 on invalid UTF-8 data, 2019-07-26) and `870eea8166` (grep: do not enter PCRE2_UTF mode on fixed matching, 2019-07-26) they lacked a redirect which caused them to falsely succeed on most systems. The 'grep -i' test failed on systems where JIT was disabled as it never reached the portion which was missing the redirect. A recent patch added the missing redirect and exposed the fact that the 'PCRE v2: grep non-ASCII from invalid UTF-8 data with -i' test fails regardless of whether JIT is enabled. Based on the final paragraph in in `870eea8166`: When grepping a non-ASCII fixed string. This is a more general problem that's hard to fix, but we can at least fix the most common case of grepping for a fixed string without "-i". I can't think of a reason for why we'd turn on PCRE2_UTF when matching byte-for-byte like that. it seems that we don't expect that the case-insensitive grep will succeed. Adjust the test to reflect that expectation. Signed-off-by: Todd Zullinger <tmz@pobox.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2019-12-01 22:09:07 -08:00
Andreas Schwab	c74b3cbb83	t7812: add missing redirects Two tests in t7812, added in `8a599983` ("grep: stess test PCRE v2 on invalid UTF-8 data", 2019-07-26), were missing redirects, failing to actually test the produced output. Signed-off-by: Andreas Schwab <schwab@linux-m68k.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2019-11-27 11:33:55 +09:00
Ævar Arnfjörð Bjarmason	870eea8166	grep: do not enter PCRE2_UTF mode on fixed matching As discussed in the last commit partially fix a bug introduced in `b65abcafc7` ("grep: use PCRE v2 for optimized fixed-string search", 2019-07-01). Because PCRE v2, unlike kwset, validates its UTF-8 input we'd die on e.g.: fatal: pcre2_match failed with error code -22: UTF-8 error: isolated byte with 0x80 bit set When grepping a non-ASCII fixed string. This is a more general problem that's hard to fix, but we can at least fix the most common case of grepping for a fixed string without "-i". I can't think of a reason for why we'd turn on PCRE2_UTF when matching byte-for-byte like that. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2019-07-26 13:56:40 -07:00
Ævar Arnfjörð Bjarmason	8a5999838e	grep: stess test PCRE v2 on invalid UTF-8 data Since my `b65abcafc7` ("grep: use PCRE v2 for optimized fixed-string search", 2019-07-01) we've been dying on invalid UTF-8 data when grepping for fixed strings if the following are all true: * The subject string is non-ASCII (e.g. "ævar") * We're under a is_utf8_locale(), e.g. "en_US.UTF-8", not "C" * We compiled with PCRE v2 * That PCRE v2 did not have JIT support The last of those is why this wasn't caught earlier, per pcre2jit(3): "unless PCRE2_NO_UTF_CHECK is set, a UTF subject string is tested for validity. In the interests of speed, these checks do not happen on the JIT fast path, and if invalid data is passed, the result is undefined." I.e. the subject being matched against our pattern was invalid, but we were lucky and getting away with it on the JIT path, but the non-JIT one is stricter. This patch does nothing to fix that, instead we sneak in support for fixed patterns starting with "(NO_JIT)", this disables the PCRE v2 jit with implicit fixed-string matching for testing, see pcre2syntax(3) the syntax. This is technically a change in behavior, but it's so obscure that I figured it was OK. We'd previously consider this an invalid regular expression as regcomp() would die on it, now we feed it to the PCRE v2 fixed-string path. I thought this was better than introducing yet another GIT_TEST_ environment variable. We're also relying on a behavior of PCRE v2 that technically could change, but I think the test coverage is worth dipping our toe into some somewhat undefined behavior. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2019-07-26 13:56:40 -07:00
Nguyễn Thái Ngọc Duy	9038531f1b	t/helper: merge test-regex into test-tool Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2018-03-27 08:45:47 -07:00
Ævar Arnfjörð Bjarmason	e01b4dab01	grep: change non-ASCII -i test to stop using --debug Change a non-ASCII case-insensitive test case to stop using --debug, and instead simply test for the expected results. The test coverage remains the same with this change, but the test won't break due to internal refactoring. This test was added in commit `793dc676e0` ("grep/icase: avoid kwsset when -F is specified", 2016-06-25). It was asserting that the regex must be compiled with compile_fixed_regexp(), instead test for the expected results, allowing the underlying implementation to change. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2017-05-21 08:25:37 +09:00
Ævar Arnfjörð Bjarmason	3eb585c112	test-lib: rename the LIBPCRE prerequisite to PCRE Rename the LIBPCRE prerequisite to PCRE. This is for preparation for libpcre2 support, where having just "LIBPCRE" would be confusing as it implies v1 of the library. None of these tests are incompatible between versions 1 & 2 of libpcre, it's less confusing to give them a more general name to make it clear that they work on both library versions. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2017-05-21 08:25:37 +09:00
Nguyễn Thái Ngọc Duy	b51a9c1479	diffcore-pickaxe: support case insensitive match on non-ascii Similar to the "grep -F -i" case, we can't use kws on icase search outside ascii range, so we quote the string and pass it to regcomp as a basic regexp and let regex engine deal with case sensitivity. The new test is put in t7812 instead of t4209-log-pickaxe because lib-gettext.sh might cause problems elsewhere, probably. Noticed-by: Plamen Totev <plamen.totev@abv.bg> Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-07-01 12:44:57 -07:00
Nguyễn Thái Ngọc Duy	18547aacf5	grep/pcre: support utf-8 In the previous change in this function, we add locale support for single-byte encodings only. It looks like pcre only supports utf-* as multibyte encodings, the others are left in the cold (which is fine). We need to enable PCRE_UTF8 so pcre can find character boundary correctly. It's needed for case folding (when --ignore-case is used) or '*', '+' or similar syntax is used. The "has_non_ascii()" check is to be on the conservative side. If there's non-ascii in the pattern, the searched content could still be in utf-8, but we can treat it just like a byte stream and everything should work. If we force utf-8 based on locale only and pcre validates utf-8 and the file content is in non-utf8 encoding, things break. Noticed-by: Plamen Totev <plamen.totev@abv.bg> Helped-by: Plamen Totev <plamen.totev@abv.bg> Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-07-01 12:44:57 -07:00
Nguyễn Thái Ngọc Duy	793dc676e0	grep/icase: avoid kwsset when -F is specified Similar to the previous commit, we can't use kws on icase search outside ascii range. But we can't simply pass the pattern to regcomp/pcre like the previous commit because it may contain regex special characters, so we need to quote the regex first. To avoid misquote traps that could lead to undefined behavior, we always stick to basic regex engine in this case. We don't need fancy features for grepping a literal string anyway. basic_regex_quote_buf() assumes that if the pattern is in a multibyte encoding, ascii chars must be unambiguously encoded as single bytes. This is true at least for UTF-8. For others, let's wait until people yell up. Chances are nobody uses multibyte, non utf-8 charsets anymore. Noticed-by: Plamen Totev <plamen.totev@abv.bg> Helped-by: René Scharfe <l.s.r@web.de> Helped-by: Eric Sunshine <sunshine@sunshineco.com> Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-07-01 12:44:30 -07:00
Nguyễn Thái Ngọc Duy	5c1ebcca4d	grep/icase: avoid kwsset on literal non-ascii strings When we detect the pattern is just a literal string, we avoid heavy regex engine and use fast substring search implemented in kwsset.c. But kws uses git-ctype which is locale-independent so it does not know how to fold case properly outside ascii range. Let regcomp or pcre take care of this case instead. Slower, but accurate. Noticed-by: Plamen Totev <plamen.totev@abv.bg> Helped-by: René Scharfe <l.s.r@web.de> Helped-by: Eric Sunshine <sunshine@sunshineco.com> Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-06-27 07:31:35 -07:00

13 Commits