git-commit-vandalism

Author	SHA1	Message	Date
Ævar Arnfjörð Bjarmason	870eea8166	grep: do not enter PCRE2_UTF mode on fixed matching As discussed in the last commit partially fix a bug introduced in `b65abcafc7` ("grep: use PCRE v2 for optimized fixed-string search", 2019-07-01). Because PCRE v2, unlike kwset, validates its UTF-8 input we'd die on e.g.: fatal: pcre2_match failed with error code -22: UTF-8 error: isolated byte with 0x80 bit set When grepping a non-ASCII fixed string. This is a more general problem that's hard to fix, but we can at least fix the most common case of grepping for a fixed string without "-i". I can't think of a reason for why we'd turn on PCRE2_UTF when matching byte-for-byte like that. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2019-07-26 13:56:40 -07:00
Ævar Arnfjörð Bjarmason	8a5999838e	grep: stess test PCRE v2 on invalid UTF-8 data Since my `b65abcafc7` ("grep: use PCRE v2 for optimized fixed-string search", 2019-07-01) we've been dying on invalid UTF-8 data when grepping for fixed strings if the following are all true: * The subject string is non-ASCII (e.g. "ævar") * We're under a is_utf8_locale(), e.g. "en_US.UTF-8", not "C" * We compiled with PCRE v2 * That PCRE v2 did not have JIT support The last of those is why this wasn't caught earlier, per pcre2jit(3): "unless PCRE2_NO_UTF_CHECK is set, a UTF subject string is tested for validity. In the interests of speed, these checks do not happen on the JIT fast path, and if invalid data is passed, the result is undefined." I.e. the subject being matched against our pattern was invalid, but we were lucky and getting away with it on the JIT path, but the non-JIT one is stricter. This patch does nothing to fix that, instead we sneak in support for fixed patterns starting with "(NO_JIT)", this disables the PCRE v2 jit with implicit fixed-string matching for testing, see pcre2syntax(3) the syntax. This is technically a change in behavior, but it's so obscure that I figured it was OK. We'd previously consider this an invalid regular expression as regcomp() would die on it, now we feed it to the PCRE v2 fixed-string path. I thought this was better than introducing yet another GIT_TEST_ environment variable. We're also relying on a behavior of PCRE v2 that technically could change, but I think the test coverage is worth dipping our toe into some somewhat undefined behavior. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2019-07-26 13:56:40 -07:00
Nguyễn Thái Ngọc Duy	9038531f1b	t/helper: merge test-regex into test-tool Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2018-03-27 08:45:47 -07:00
Ævar Arnfjörð Bjarmason	e01b4dab01	grep: change non-ASCII -i test to stop using --debug Change a non-ASCII case-insensitive test case to stop using --debug, and instead simply test for the expected results. The test coverage remains the same with this change, but the test won't break due to internal refactoring. This test was added in commit `793dc676e0` ("grep/icase: avoid kwsset when -F is specified", 2016-06-25). It was asserting that the regex must be compiled with compile_fixed_regexp(), instead test for the expected results, allowing the underlying implementation to change. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2017-05-21 08:25:37 +09:00
Ævar Arnfjörð Bjarmason	3eb585c112	test-lib: rename the LIBPCRE prerequisite to PCRE Rename the LIBPCRE prerequisite to PCRE. This is for preparation for libpcre2 support, where having just "LIBPCRE" would be confusing as it implies v1 of the library. None of these tests are incompatible between versions 1 & 2 of libpcre, it's less confusing to give them a more general name to make it clear that they work on both library versions. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2017-05-21 08:25:37 +09:00
Nguyễn Thái Ngọc Duy	b51a9c1479	diffcore-pickaxe: support case insensitive match on non-ascii Similar to the "grep -F -i" case, we can't use kws on icase search outside ascii range, so we quote the string and pass it to regcomp as a basic regexp and let regex engine deal with case sensitivity. The new test is put in t7812 instead of t4209-log-pickaxe because lib-gettext.sh might cause problems elsewhere, probably. Noticed-by: Plamen Totev <plamen.totev@abv.bg> Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-07-01 12:44:57 -07:00
Nguyễn Thái Ngọc Duy	18547aacf5	grep/pcre: support utf-8 In the previous change in this function, we add locale support for single-byte encodings only. It looks like pcre only supports utf-* as multibyte encodings, the others are left in the cold (which is fine). We need to enable PCRE_UTF8 so pcre can find character boundary correctly. It's needed for case folding (when --ignore-case is used) or '*', '+' or similar syntax is used. The "has_non_ascii()" check is to be on the conservative side. If there's non-ascii in the pattern, the searched content could still be in utf-8, but we can treat it just like a byte stream and everything should work. If we force utf-8 based on locale only and pcre validates utf-8 and the file content is in non-utf8 encoding, things break. Noticed-by: Plamen Totev <plamen.totev@abv.bg> Helped-by: Plamen Totev <plamen.totev@abv.bg> Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-07-01 12:44:57 -07:00
Nguyễn Thái Ngọc Duy	793dc676e0	grep/icase: avoid kwsset when -F is specified Similar to the previous commit, we can't use kws on icase search outside ascii range. But we can't simply pass the pattern to regcomp/pcre like the previous commit because it may contain regex special characters, so we need to quote the regex first. To avoid misquote traps that could lead to undefined behavior, we always stick to basic regex engine in this case. We don't need fancy features for grepping a literal string anyway. basic_regex_quote_buf() assumes that if the pattern is in a multibyte encoding, ascii chars must be unambiguously encoded as single bytes. This is true at least for UTF-8. For others, let's wait until people yell up. Chances are nobody uses multibyte, non utf-8 charsets anymore. Noticed-by: Plamen Totev <plamen.totev@abv.bg> Helped-by: René Scharfe <l.s.r@web.de> Helped-by: Eric Sunshine <sunshine@sunshineco.com> Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-07-01 12:44:30 -07:00
Nguyễn Thái Ngọc Duy	5c1ebcca4d	grep/icase: avoid kwsset on literal non-ascii strings When we detect the pattern is just a literal string, we avoid heavy regex engine and use fast substring search implemented in kwsset.c. But kws uses git-ctype which is locale-independent so it does not know how to fold case properly outside ascii range. Let regcomp or pcre take care of this case instead. Slower, but accurate. Noticed-by: Plamen Totev <plamen.totev@abv.bg> Helped-by: René Scharfe <l.s.r@web.de> Helped-by: Eric Sunshine <sunshine@sunshineco.com> Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-06-27 07:31:35 -07:00

9 Commits