log: re-encode commit messages before grepping
If you run "git log --grep=foo", we will run your regex on
the literal bytes of the commit message. This can provide
confusing results if the commit message is not in the same
encoding as your grep expression (or worse, you have commits
in multiple encodings, in which case your regex would need
to be written to match either encoding). On top of this, we
might also be grepping in the commit's notes, which are
already re-encoded, potentially leading to grepping in a
buffer with mixed encodings concatenated. This is insanity,
but most people never noticed, because their terminal and
their commit encodings all match.
Instead, let's massage the to-be-grepped commit into a
standardized encoding. There is not much point in adding a
flag for "this is the encoding I expect my grep pattern to
match"; the only sane choice is for it to use the log output
encoding. That is presumably what the user's terminal is
using, and it means that the patterns found by the grep will
match the output produced by git.
As a bonus, this fixes a potential segfault in commit_match
when commit->buffer is NULL, as we now build on logmsg_reencode,
which handles reading the commit buffer from disk if
necessary. The segfault can be triggered with:
git commit -m 'text1' --allow-empty
git commit -m 'text2' --allow-empty
git log --graph --no-walk --grep 'text2'
which arguably does not make any sense (--graph inherently
wants a connected history, and by --no-walk the command line
is telling us to show discrete points in history without
connectivity), and we probably should forbid the
combination, but that is a separate issue.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-02-11 21:59:58 +01:00
|
|
|
#!/bin/sh
|
|
|
|
|
|
|
|
test_description='test log with i18n features'
|
2019-06-28 01:39:04 +02:00
|
|
|
. ./lib-gettext.sh
|
log: re-encode commit messages before grepping
If you run "git log --grep=foo", we will run your regex on
the literal bytes of the commit message. This can provide
confusing results if the commit message is not in the same
encoding as your grep expression (or worse, you have commits
in multiple encodings, in which case your regex would need
to be written to match either encoding). On top of this, we
might also be grepping in the commit's notes, which are
already re-encoded, potentially leading to grepping in a
buffer with mixed encodings concatenated. This is insanity,
but most people never noticed, because their terminal and
their commit encodings all match.
Instead, let's massage the to-be-grepped commit into a
standardized encoding. There is not much point in adding a
flag for "this is the encoding I expect my grep pattern to
match"; the only sane choice is for it to use the log output
encoding. That is presumably what the user's terminal is
using, and it means that the patterns found by the grep will
match the output produced by git.
As a bonus, this fixes a potential segfault in commit_match
when commit->buffer is NULL, as we now build on logmsg_reencode,
which handles reading the commit buffer from disk if
necessary. The segfault can be triggered with:
git commit -m 'text1' --allow-empty
git commit -m 'text2' --allow-empty
git log --graph --no-walk --grep 'text2'
which arguably does not make any sense (--graph inherently
wants a connected history, and by --no-walk the command line
is telling us to show discrete points in history without
connectivity), and we probably should forbid the
combination, but that is a separate issue.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-02-11 21:59:58 +01:00
|
|
|
|
|
|
|
# two forms of é
|
|
|
|
utf8_e=$(printf '\303\251')
|
|
|
|
latin1_e=$(printf '\351')
|
|
|
|
|
2019-06-28 01:39:04 +02:00
|
|
|
# invalid UTF-8
|
|
|
|
invalid_e=$(printf '\303\50)') # ")" at end to close opening "("
|
|
|
|
|
2020-05-18 20:44:16 +02:00
|
|
|
have_reg_illseq=
|
|
|
|
if test_have_prereq GETTEXT_LOCALE &&
|
|
|
|
! LC_ALL=$is_IS_locale test-tool regex --silent $latin1_e
|
|
|
|
then
|
|
|
|
have_reg_illseq=1
|
|
|
|
fi
|
|
|
|
|
log: re-encode commit messages before grepping
If you run "git log --grep=foo", we will run your regex on
the literal bytes of the commit message. This can provide
confusing results if the commit message is not in the same
encoding as your grep expression (or worse, you have commits
in multiple encodings, in which case your regex would need
to be written to match either encoding). On top of this, we
might also be grepping in the commit's notes, which are
already re-encoded, potentially leading to grepping in a
buffer with mixed encodings concatenated. This is insanity,
but most people never noticed, because their terminal and
their commit encodings all match.
Instead, let's massage the to-be-grepped commit into a
standardized encoding. There is not much point in adding a
flag for "this is the encoding I expect my grep pattern to
match"; the only sane choice is for it to use the log output
encoding. That is presumably what the user's terminal is
using, and it means that the patterns found by the grep will
match the output produced by git.
As a bonus, this fixes a potential segfault in commit_match
when commit->buffer is NULL, as we now build on logmsg_reencode,
which handles reading the commit buffer from disk if
necessary. The segfault can be triggered with:
git commit -m 'text1' --allow-empty
git commit -m 'text2' --allow-empty
git log --graph --no-walk --grep 'text2'
which arguably does not make any sense (--graph inherently
wants a connected history, and by --no-walk the command line
is telling us to show discrete points in history without
connectivity), and we probably should forbid the
combination, but that is a separate issue.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-02-11 21:59:58 +01:00
|
|
|
test_expect_success 'create commits in different encodings' '
|
|
|
|
test_tick &&
|
|
|
|
cat >msg <<-EOF &&
|
|
|
|
utf8
|
|
|
|
|
|
|
|
t${utf8_e}st
|
|
|
|
EOF
|
|
|
|
git add msg &&
|
|
|
|
git -c i18n.commitencoding=utf8 commit -F msg &&
|
|
|
|
cat >msg <<-EOF &&
|
|
|
|
latin1
|
|
|
|
|
|
|
|
t${latin1_e}st
|
|
|
|
EOF
|
|
|
|
git add msg &&
|
|
|
|
git -c i18n.commitencoding=ISO-8859-1 commit -F msg
|
|
|
|
'
|
|
|
|
|
|
|
|
test_expect_success 'log --grep searches in log output encoding (utf8)' '
|
|
|
|
cat >expect <<-\EOF &&
|
|
|
|
latin1
|
|
|
|
utf8
|
|
|
|
EOF
|
|
|
|
git log --encoding=utf8 --format=%s --grep=$utf8_e >actual &&
|
|
|
|
test_cmp expect actual
|
|
|
|
'
|
|
|
|
|
2014-07-22 00:09:27 +02:00
|
|
|
test_expect_success !MINGW 'log --grep searches in log output encoding (latin1)' '
|
log: re-encode commit messages before grepping
If you run "git log --grep=foo", we will run your regex on
the literal bytes of the commit message. This can provide
confusing results if the commit message is not in the same
encoding as your grep expression (or worse, you have commits
in multiple encodings, in which case your regex would need
to be written to match either encoding). On top of this, we
might also be grepping in the commit's notes, which are
already re-encoded, potentially leading to grepping in a
buffer with mixed encodings concatenated. This is insanity,
but most people never noticed, because their terminal and
their commit encodings all match.
Instead, let's massage the to-be-grepped commit into a
standardized encoding. There is not much point in adding a
flag for "this is the encoding I expect my grep pattern to
match"; the only sane choice is for it to use the log output
encoding. That is presumably what the user's terminal is
using, and it means that the patterns found by the grep will
match the output produced by git.
As a bonus, this fixes a potential segfault in commit_match
when commit->buffer is NULL, as we now build on logmsg_reencode,
which handles reading the commit buffer from disk if
necessary. The segfault can be triggered with:
git commit -m 'text1' --allow-empty
git commit -m 'text2' --allow-empty
git log --graph --no-walk --grep 'text2'
which arguably does not make any sense (--graph inherently
wants a connected history, and by --no-walk the command line
is telling us to show discrete points in history without
connectivity), and we probably should forbid the
combination, but that is a separate issue.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-02-11 21:59:58 +01:00
|
|
|
cat >expect <<-\EOF &&
|
|
|
|
latin1
|
|
|
|
utf8
|
|
|
|
EOF
|
|
|
|
git log --encoding=ISO-8859-1 --format=%s --grep=$latin1_e >actual &&
|
|
|
|
test_cmp expect actual
|
|
|
|
'
|
|
|
|
|
2014-07-22 00:09:27 +02:00
|
|
|
test_expect_success !MINGW 'log --grep does not find non-reencoded values (utf8)' '
|
log: re-encode commit messages before grepping
If you run "git log --grep=foo", we will run your regex on
the literal bytes of the commit message. This can provide
confusing results if the commit message is not in the same
encoding as your grep expression (or worse, you have commits
in multiple encodings, in which case your regex would need
to be written to match either encoding). On top of this, we
might also be grepping in the commit's notes, which are
already re-encoded, potentially leading to grepping in a
buffer with mixed encodings concatenated. This is insanity,
but most people never noticed, because their terminal and
their commit encodings all match.
Instead, let's massage the to-be-grepped commit into a
standardized encoding. There is not much point in adding a
flag for "this is the encoding I expect my grep pattern to
match"; the only sane choice is for it to use the log output
encoding. That is presumably what the user's terminal is
using, and it means that the patterns found by the grep will
match the output produced by git.
As a bonus, this fixes a potential segfault in commit_match
when commit->buffer is NULL, as we now build on logmsg_reencode,
which handles reading the commit buffer from disk if
necessary. The segfault can be triggered with:
git commit -m 'text1' --allow-empty
git commit -m 'text2' --allow-empty
git log --graph --no-walk --grep 'text2'
which arguably does not make any sense (--graph inherently
wants a connected history, and by --no-walk the command line
is telling us to show discrete points in history without
connectivity), and we probably should forbid the
combination, but that is a separate issue.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-02-11 21:59:58 +01:00
|
|
|
git log --encoding=utf8 --format=%s --grep=$latin1_e >actual &&
|
2018-07-27 19:48:11 +02:00
|
|
|
test_must_be_empty actual
|
log: re-encode commit messages before grepping
If you run "git log --grep=foo", we will run your regex on
the literal bytes of the commit message. This can provide
confusing results if the commit message is not in the same
encoding as your grep expression (or worse, you have commits
in multiple encodings, in which case your regex would need
to be written to match either encoding). On top of this, we
might also be grepping in the commit's notes, which are
already re-encoded, potentially leading to grepping in a
buffer with mixed encodings concatenated. This is insanity,
but most people never noticed, because their terminal and
their commit encodings all match.
Instead, let's massage the to-be-grepped commit into a
standardized encoding. There is not much point in adding a
flag for "this is the encoding I expect my grep pattern to
match"; the only sane choice is for it to use the log output
encoding. That is presumably what the user's terminal is
using, and it means that the patterns found by the grep will
match the output produced by git.
As a bonus, this fixes a potential segfault in commit_match
when commit->buffer is NULL, as we now build on logmsg_reencode,
which handles reading the commit buffer from disk if
necessary. The segfault can be triggered with:
git commit -m 'text1' --allow-empty
git commit -m 'text2' --allow-empty
git log --graph --no-walk --grep 'text2'
which arguably does not make any sense (--graph inherently
wants a connected history, and by --no-walk the command line
is telling us to show discrete points in history without
connectivity), and we probably should forbid the
combination, but that is a separate issue.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-02-11 21:59:58 +01:00
|
|
|
'
|
|
|
|
|
2020-05-18 20:44:16 +02:00
|
|
|
test_expect_success 'log --grep does not find non-reencoded values (latin1)' '
|
log: re-encode commit messages before grepping
If you run "git log --grep=foo", we will run your regex on
the literal bytes of the commit message. This can provide
confusing results if the commit message is not in the same
encoding as your grep expression (or worse, you have commits
in multiple encodings, in which case your regex would need
to be written to match either encoding). On top of this, we
might also be grepping in the commit's notes, which are
already re-encoded, potentially leading to grepping in a
buffer with mixed encodings concatenated. This is insanity,
but most people never noticed, because their terminal and
their commit encodings all match.
Instead, let's massage the to-be-grepped commit into a
standardized encoding. There is not much point in adding a
flag for "this is the encoding I expect my grep pattern to
match"; the only sane choice is for it to use the log output
encoding. That is presumably what the user's terminal is
using, and it means that the patterns found by the grep will
match the output produced by git.
As a bonus, this fixes a potential segfault in commit_match
when commit->buffer is NULL, as we now build on logmsg_reencode,
which handles reading the commit buffer from disk if
necessary. The segfault can be triggered with:
git commit -m 'text1' --allow-empty
git commit -m 'text2' --allow-empty
git log --graph --no-walk --grep 'text2'
which arguably does not make any sense (--graph inherently
wants a connected history, and by --no-walk the command line
is telling us to show discrete points in history without
connectivity), and we probably should forbid the
combination, but that is a separate issue.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-02-11 21:59:58 +01:00
|
|
|
git log --encoding=ISO-8859-1 --format=%s --grep=$utf8_e >actual &&
|
2018-07-27 19:48:11 +02:00
|
|
|
test_must_be_empty actual
|
log: re-encode commit messages before grepping
If you run "git log --grep=foo", we will run your regex on
the literal bytes of the commit message. This can provide
confusing results if the commit message is not in the same
encoding as your grep expression (or worse, you have commits
in multiple encodings, in which case your regex would need
to be written to match either encoding). On top of this, we
might also be grepping in the commit's notes, which are
already re-encoded, potentially leading to grepping in a
buffer with mixed encodings concatenated. This is insanity,
but most people never noticed, because their terminal and
their commit encodings all match.
Instead, let's massage the to-be-grepped commit into a
standardized encoding. There is not much point in adding a
flag for "this is the encoding I expect my grep pattern to
match"; the only sane choice is for it to use the log output
encoding. That is presumably what the user's terminal is
using, and it means that the patterns found by the grep will
match the output produced by git.
As a bonus, this fixes a potential segfault in commit_match
when commit->buffer is NULL, as we now build on logmsg_reencode,
which handles reading the commit buffer from disk if
necessary. The segfault can be triggered with:
git commit -m 'text1' --allow-empty
git commit -m 'text2' --allow-empty
git log --graph --no-walk --grep 'text2'
which arguably does not make any sense (--graph inherently
wants a connected history, and by --no-walk the command line
is telling us to show discrete points in history without
connectivity), and we probably should forbid the
combination, but that is a separate issue.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-02-11 21:59:58 +01:00
|
|
|
'
|
|
|
|
|
2020-05-18 20:44:16 +02:00
|
|
|
triggers_undefined_behaviour () {
|
|
|
|
local engine=$1
|
|
|
|
|
|
|
|
case $engine in
|
|
|
|
fixed)
|
|
|
|
if test -n "$have_reg_illseq" &&
|
|
|
|
! test_have_prereq LIBPCRE2
|
|
|
|
then
|
|
|
|
return 0
|
|
|
|
fi
|
|
|
|
;;
|
|
|
|
basic|extended)
|
|
|
|
if test -n "$have_reg_illseq"
|
|
|
|
then
|
|
|
|
return 0
|
|
|
|
fi
|
|
|
|
;;
|
|
|
|
esac
|
|
|
|
return 1
|
|
|
|
}
|
|
|
|
|
|
|
|
mismatched_git_log () {
|
|
|
|
local pattern=$1
|
|
|
|
|
|
|
|
LC_ALL=$is_IS_locale git log --encoding=ISO-8859-1 --format=%s \
|
|
|
|
--grep=$pattern
|
|
|
|
}
|
|
|
|
|
2019-06-28 01:39:04 +02:00
|
|
|
for engine in fixed basic extended perl
|
|
|
|
do
|
|
|
|
prereq=
|
|
|
|
if test $engine = "perl"
|
|
|
|
then
|
2020-05-18 20:44:16 +02:00
|
|
|
prereq=PCRE
|
2019-06-28 01:39:04 +02:00
|
|
|
fi
|
|
|
|
force_regex=
|
|
|
|
if test $engine != "fixed"
|
|
|
|
then
|
2020-05-18 20:44:16 +02:00
|
|
|
force_regex='.*'
|
2019-06-28 01:39:04 +02:00
|
|
|
fi
|
|
|
|
|
2020-05-18 20:44:16 +02:00
|
|
|
test_expect_success $prereq "config grep.patternType=$engine" "
|
|
|
|
git config grep.patternType $engine
|
2019-06-28 01:39:04 +02:00
|
|
|
"
|
|
|
|
|
2020-05-18 20:44:16 +02:00
|
|
|
test_expect_success GETTEXT_LOCALE,$prereq "log --grep does not find non-reencoded values (latin1 + locale)" "
|
|
|
|
mismatched_git_log '$force_regex$utf8_e' >actual &&
|
2019-06-28 01:39:04 +02:00
|
|
|
test_must_be_empty actual
|
|
|
|
"
|
2020-05-18 20:44:16 +02:00
|
|
|
|
|
|
|
if ! triggers_undefined_behaviour $engine
|
|
|
|
then
|
|
|
|
test_expect_success !MINGW,GETTEXT_LOCALE,$prereq "log --grep searches in log output encoding (latin1 + locale)" "
|
|
|
|
cat >expect <<-\EOF &&
|
|
|
|
latin1
|
|
|
|
utf8
|
|
|
|
EOF
|
|
|
|
mismatched_git_log '$force_regex$latin1_e' >actual &&
|
|
|
|
test_cmp expect actual
|
|
|
|
"
|
|
|
|
|
|
|
|
test_expect_success GETTEXT_LOCALE,$prereq "log --grep does not die on invalid UTF-8 value (latin1 + locale + invalid needle)" "
|
|
|
|
mismatched_git_log '$force_regex$invalid_e' >actual &&
|
|
|
|
test_must_be_empty actual
|
|
|
|
"
|
|
|
|
fi
|
2019-06-28 01:39:04 +02:00
|
|
|
done
|
|
|
|
|
logmsg_reencode(): warn when iconv() fails
If the user asks for a pretty-printed commit to be converted (either
explicitly with --encoding=foo, or implicitly because the commit is
non-utf8 and we want to convert it), we pass it through iconv(). If that
fails, we fall back to showing the input verbatim, but don't tell the
user that the output may be bogus.
Let's add a warning to do so, along with a mention in the documentation
for --encoding. Two things to note about the implementation:
- we could produce the warning closer to the call to iconv() in
reencode_string_len(), which would let us relay the value of errno.
But this is not actually very helpful. reencode_string_len() does
not know we are operating on a commit, and indeed does not know that
the caller won't produce an error of its own. And the errno values
from iconv() are seldom helpful (iconv_open() only ever produces
EINVAL; perhaps EILSEQ from iconv() might be illuminating, but it
can also return EINVAL for incomplete sequences).
- if the reason for the failure is that the output charset is not
supported, then the user will see this warning for every commit we
try to display. That might be ugly and overwhelming, but on the
other hand it is making it clear that every one of them has not been
converted (and the likely outcome anyway is to re-try the command
with a supported output encoding).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-08-27 20:30:15 +02:00
|
|
|
test_expect_success 'log shows warning when conversion fails' '
|
|
|
|
enc=this-encoding-does-not-exist &&
|
|
|
|
git log -1 --encoding=$enc 2>err &&
|
|
|
|
echo "warning: unable to reencode commit to ${SQ}${enc}${SQ}" >expect &&
|
|
|
|
test_cmp expect err
|
|
|
|
'
|
|
|
|
|
log: re-encode commit messages before grepping
If you run "git log --grep=foo", we will run your regex on
the literal bytes of the commit message. This can provide
confusing results if the commit message is not in the same
encoding as your grep expression (or worse, you have commits
in multiple encodings, in which case your regex would need
to be written to match either encoding). On top of this, we
might also be grepping in the commit's notes, which are
already re-encoded, potentially leading to grepping in a
buffer with mixed encodings concatenated. This is insanity,
but most people never noticed, because their terminal and
their commit encodings all match.
Instead, let's massage the to-be-grepped commit into a
standardized encoding. There is not much point in adding a
flag for "this is the encoding I expect my grep pattern to
match"; the only sane choice is for it to use the log output
encoding. That is presumably what the user's terminal is
using, and it means that the patterns found by the grep will
match the output produced by git.
As a bonus, this fixes a potential segfault in commit_match
when commit->buffer is NULL, as we now build on logmsg_reencode,
which handles reading the commit buffer from disk if
necessary. The segfault can be triggered with:
git commit -m 'text1' --allow-empty
git commit -m 'text2' --allow-empty
git log --graph --no-walk --grep 'text2'
which arguably does not make any sense (--graph inherently
wants a connected history, and by --no-walk the command line
is telling us to show discrete points in history without
connectivity), and we probably should forbid the
combination, but that is a separate issue.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-02-11 21:59:58 +01:00
|
|
|
test_done
|