fsck: do not assume NUL-termination of buffers
The fsck code operates on an object buffer represented as a pointer/len
combination. However, the parsing of commits and tags is a little bit
loose; we mostly scan left-to-right through the buffer, without checking
whether we've gone past the length we were given.
This has traditionally been OK because the buffers we feed to fsck
always have an extra NUL after the end of the object content, which ends
any left-to-right scan. That has always been true for objects we read
from the odb, and we made it true for incoming index-pack/unpack-objects
checks in a1e920a0a7 (index-pack: terminate object buffers with NUL,
2014-12-08).
However, we recently added an exception: hash-object asks index_fd() to
do fsck checks. That _may_ have an extra NUL (if we read from a pipe
into a strbuf), but it might not (if we read the contents from the
file). Nor can we just teach it to always add a NUL. We may mmap the
on-disk file, which will not have any extra bytes (if it's a multiple of
the page size). Not to mention that this is a rather subtle assumption
for the fsck code to make.
Instead, let's make sure that the fsck parsers don't ever look past the
size of the buffer they've been given. This _almost_ works already,
thanks to earlier work in 4d0d89755e (Make sure fsck_commit_buffer()
does not run out of the buffer, 2014-09-11). The theory there is that we
check up front whether we have the end of header double-newline
separator. And then any left-to-right scanning we do is OK as long as it
stops when it hits that boundary.
However, we later softened that in 84d18c0bcf (fsck: it is OK for a tag
and a commit to lack the body, 2015-06-28), which allows the
double-newline header to be missing, but does require that the header
ends in a newline. That was OK back then, because of the NUL-termination
guarantees (including the one from a1e920a0a7 mentioned above).
Because 84d18c0bcf guarantees that any header line does end in a
newline, we are still OK with most of the left-to-right scanning. We
only need to take care after completing a line, to check that there is
another line (and we didn't run out of buffer).
Most of these checks are just need to check "buffer < buffer_end" (where
buffer is advanced as we parse) before scanning for the next header
line. But here are a few notes:
- we don't technically need to check for remaining buffer before
parsing the very first line ("tree" for a commit, or "object" for a
tag), because verify_headers() rejects a totally empty buffer. But
we'll do so in the name of consistency and defensiveness.
- there are some calls to strchr('\n'). These are actually OK by the
"the final header line must end in a newline" guarantee from
verify_headers(). They will always find that rather than run off the
end of the buffer. Curiously, they do check for a NULL return and
complain, but I believe that condition can never be reached.
However, I converted them to use memchr() with a proper size and
retained the NULL checks. Using memchr() is not much longer and
makes it more obvious what is going on. Likewise, retaining the NULL
checks serves as a defensive measure in case my analysis is wrong.
- commit 9a1a3a4d4c (mktag: allow omitting the header/body \n
separator, 2021-01-05), does check for the end-of-buffer condition,
but does so with "!*buffer", relying explicitly on the NUL
termination. We can accomplish the same thing with a pointer
comparison. I also folded it into the follow-on conditional that
checks the contents of the buffer, for consistency with the other
checks.
- fsck_ident() uses parse_timestamp(), which is based on strtoumax().
That function will happily skip past leading whitespace, including
newlines, which makes it a risk. We can fix this by scanning to the
first digit ourselves, and then using parse_timestamp() to do the
actual numeric conversion.
Note that as a side effect this fixes the fact that we missed
zero-padded timestamps like "<email> 0123" (whereas we would
complain about "<email> 0123"). I doubt anybody cares, but I
mention it here for completeness.
- fsck_tree() does not need any modifications. It relies on
decode_tree_entry() to do the actual parsing, and that function
checks both that there are enough bytes in the buffer to represent
an entry, and that there is a NUL at the appropriate spot (one
hash-length from the end; this may not be the NUL for the entry we
are parsing, but we know that in the worst case, everything from our
current position to that NUL is a filename, so we won't run out of
bytes).
In addition to fixing the code itself, we'd like to make sure our rather
subtle assumptions are not violated in the future. So this patch does
two more things:
- add comments around verify_headers() documenting the link between
what it checks and the memory safety of the callers. I don't expect
this code to be modified frequently, but this may help somebody from
accidentally breaking things.
- add a thorough set of tests covering truncations at various key
spots (e.g., for a "tree $oid" line, in the middle of the word
"tree", right after it, after the space, in the middle of the $oid,
and right at the end of the line. Most of these are fine already (it
is only truncating right at the end of the line that is currently
broken). And some of them are not even possible with the current
code (we parse "tree " as a unit, so truncating before the space is
equivalent). But I aimed here to consider the code a black box and
look for any truncations that would be a problem for a left-to-right
parser.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-01-20 00:13:29 +01:00
|
|
|
#!/bin/sh
|
|
|
|
|
|
|
|
test_description='fsck on buffers without NUL termination
|
|
|
|
|
|
|
|
The goal here is to make sure that the various fsck parsers never look
|
|
|
|
past the end of the buffer they are given, even when encountering broken
|
|
|
|
or truncated objects.
|
|
|
|
|
|
|
|
We have to use "hash-object" for this because most code paths that read objects
|
|
|
|
append an extra NUL for safety after the buffer. But hash-object, since it is
|
|
|
|
reading straight from a file (and possibly even mmap-ing it) cannot always do
|
|
|
|
so.
|
|
|
|
|
|
|
|
These tests _might_ catch such overruns in normal use, but should be run with
|
|
|
|
ASan or valgrind for more confidence.
|
|
|
|
'
|
tests: mark tests as passing with SANITIZE=leak
When the "ab/various-leak-fixes" topic was merged in [1] only t6021
would fail if the tests were run in the
"GIT_TEST_PASSING_SANITIZE_LEAK=check" mode, i.e. to check whether we
marked all leak-free tests with "TEST_PASSES_SANITIZE_LEAK=true".
Since then we've had various tests starting to pass under
SANITIZE=leak. Let's mark those as passing, this is when they started
to pass, narrowed down with "git bisect":
- t5317-pack-objects-filter-objects.sh: In
faebba436e6 (list-objects-filter: plug pattern_list leak, 2022-12-01).
- t3210-pack-refs.sh, t5613-info-alternate.sh,
t7403-submodule-sync.sh: In 189e97bc4ba (diff: remove parseopts member
from struct diff_options, 2022-12-01).
- t1408-packed-refs.sh: In ab91f6b7c42 (Merge branch
'rs/diff-parseopts', 2022-12-19).
- t0023-crlf-am.sh, t4152-am-subjects.sh, t4254-am-corrupt.sh,
t4256-am-format-flowed.sh, t4257-am-interactive.sh,
t5403-post-checkout-hook.sh: In a658e881c13 (am: don't pass strvec to
apply_parse_options(), 2022-12-13)
- t1301-shared-repo.sh, t1302-repo-version.sh: In b07a819c05f (reflog:
clear leftovers in reflog_expiry_cleanup(), 2022-12-13).
- t1304-default-acl.sh, t1410-reflog.sh,
t5330-no-lazy-fetch-with-commit-graph.sh, t5502-quickfetch.sh,
t5604-clone-reference.sh, t6014-rev-list-all.sh,
t7701-repack-unpack-unreachable.sh: In b0c61be3209 (Merge branch
'rs/reflog-expiry-cleanup', 2022-12-26)
- t3800-mktag.sh, t5302-pack-index.sh, t5306-pack-nobase.sh,
t5573-pull-verify-signatures.sh, t7612-merge-verify-signatures.sh: In
69bbbe484ba (hash-object: use fsck for object checks, 2023-01-18).
- t1451-fsck-buffer.sh: In 8e4309038f0 (fsck: do not assume
NUL-termination of buffers, 2023-01-19).
- t6501-freshen-objects.sh: In abf2bb895b4 (Merge branch
'jk/hash-object-fsck', 2023-01-30)
1. 9ea1378d046 (Merge branch 'ab/various-leak-fixes', 2022-12-14)
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-02-07 00:07:36 +01:00
|
|
|
|
|
|
|
TEST_PASSES_SANITIZE_LEAK=true
|
fsck: do not assume NUL-termination of buffers
The fsck code operates on an object buffer represented as a pointer/len
combination. However, the parsing of commits and tags is a little bit
loose; we mostly scan left-to-right through the buffer, without checking
whether we've gone past the length we were given.
This has traditionally been OK because the buffers we feed to fsck
always have an extra NUL after the end of the object content, which ends
any left-to-right scan. That has always been true for objects we read
from the odb, and we made it true for incoming index-pack/unpack-objects
checks in a1e920a0a7 (index-pack: terminate object buffers with NUL,
2014-12-08).
However, we recently added an exception: hash-object asks index_fd() to
do fsck checks. That _may_ have an extra NUL (if we read from a pipe
into a strbuf), but it might not (if we read the contents from the
file). Nor can we just teach it to always add a NUL. We may mmap the
on-disk file, which will not have any extra bytes (if it's a multiple of
the page size). Not to mention that this is a rather subtle assumption
for the fsck code to make.
Instead, let's make sure that the fsck parsers don't ever look past the
size of the buffer they've been given. This _almost_ works already,
thanks to earlier work in 4d0d89755e (Make sure fsck_commit_buffer()
does not run out of the buffer, 2014-09-11). The theory there is that we
check up front whether we have the end of header double-newline
separator. And then any left-to-right scanning we do is OK as long as it
stops when it hits that boundary.
However, we later softened that in 84d18c0bcf (fsck: it is OK for a tag
and a commit to lack the body, 2015-06-28), which allows the
double-newline header to be missing, but does require that the header
ends in a newline. That was OK back then, because of the NUL-termination
guarantees (including the one from a1e920a0a7 mentioned above).
Because 84d18c0bcf guarantees that any header line does end in a
newline, we are still OK with most of the left-to-right scanning. We
only need to take care after completing a line, to check that there is
another line (and we didn't run out of buffer).
Most of these checks are just need to check "buffer < buffer_end" (where
buffer is advanced as we parse) before scanning for the next header
line. But here are a few notes:
- we don't technically need to check for remaining buffer before
parsing the very first line ("tree" for a commit, or "object" for a
tag), because verify_headers() rejects a totally empty buffer. But
we'll do so in the name of consistency and defensiveness.
- there are some calls to strchr('\n'). These are actually OK by the
"the final header line must end in a newline" guarantee from
verify_headers(). They will always find that rather than run off the
end of the buffer. Curiously, they do check for a NULL return and
complain, but I believe that condition can never be reached.
However, I converted them to use memchr() with a proper size and
retained the NULL checks. Using memchr() is not much longer and
makes it more obvious what is going on. Likewise, retaining the NULL
checks serves as a defensive measure in case my analysis is wrong.
- commit 9a1a3a4d4c (mktag: allow omitting the header/body \n
separator, 2021-01-05), does check for the end-of-buffer condition,
but does so with "!*buffer", relying explicitly on the NUL
termination. We can accomplish the same thing with a pointer
comparison. I also folded it into the follow-on conditional that
checks the contents of the buffer, for consistency with the other
checks.
- fsck_ident() uses parse_timestamp(), which is based on strtoumax().
That function will happily skip past leading whitespace, including
newlines, which makes it a risk. We can fix this by scanning to the
first digit ourselves, and then using parse_timestamp() to do the
actual numeric conversion.
Note that as a side effect this fixes the fact that we missed
zero-padded timestamps like "<email> 0123" (whereas we would
complain about "<email> 0123"). I doubt anybody cares, but I
mention it here for completeness.
- fsck_tree() does not need any modifications. It relies on
decode_tree_entry() to do the actual parsing, and that function
checks both that there are enough bytes in the buffer to represent
an entry, and that there is a NUL at the appropriate spot (one
hash-length from the end; this may not be the NUL for the entry we
are parsing, but we know that in the worst case, everything from our
current position to that NUL is a filename, so we won't run out of
bytes).
In addition to fixing the code itself, we'd like to make sure our rather
subtle assumptions are not violated in the future. So this patch does
two more things:
- add comments around verify_headers() documenting the link between
what it checks and the memory safety of the callers. I don't expect
this code to be modified frequently, but this may help somebody from
accidentally breaking things.
- add a thorough set of tests covering truncations at various key
spots (e.g., for a "tree $oid" line, in the middle of the word
"tree", right after it, after the space, in the middle of the $oid,
and right at the end of the line. Most of these are fine already (it
is only truncating right at the end of the line that is currently
broken). And some of them are not even possible with the current
code (we parse "tree " as a unit, so truncating before the space is
equivalent). But I aimed here to consider the code a black box and
look for any truncations that would be a problem for a left-to-right
parser.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-01-20 00:13:29 +01:00
|
|
|
. ./test-lib.sh
|
|
|
|
|
|
|
|
# the general idea for tags and commits is to build up the "base" file
|
|
|
|
# progressively, and then test new truncations on top of it.
|
|
|
|
reset () {
|
|
|
|
test_expect_success 'reset input to empty' '
|
|
|
|
>base
|
|
|
|
'
|
|
|
|
}
|
|
|
|
|
|
|
|
add () {
|
|
|
|
content="$1"
|
|
|
|
type=${content%% *}
|
|
|
|
test_expect_success "add $type line" '
|
|
|
|
echo "$content" >>base
|
|
|
|
'
|
|
|
|
}
|
|
|
|
|
|
|
|
check () {
|
|
|
|
type=$1
|
|
|
|
fsck=$2
|
|
|
|
content=$3
|
|
|
|
test_expect_success "truncated $type ($fsck, \"$content\")" '
|
|
|
|
# do not pipe into hash-object here; we want to increase
|
|
|
|
# the chance that it uses a fixed-size buffer or mmap,
|
|
|
|
# and a pipe would be read into a strbuf.
|
|
|
|
{
|
|
|
|
cat base &&
|
|
|
|
echo "$content"
|
|
|
|
} >input &&
|
|
|
|
test_must_fail git hash-object -t "$type" input 2>err &&
|
|
|
|
grep "$fsck" err
|
|
|
|
'
|
|
|
|
}
|
|
|
|
|
|
|
|
test_expect_success 'create valid objects' '
|
|
|
|
git commit --allow-empty -m foo &&
|
|
|
|
commit=$(git rev-parse --verify HEAD) &&
|
|
|
|
tree=$(git rev-parse --verify HEAD^{tree})
|
|
|
|
'
|
|
|
|
|
|
|
|
reset
|
|
|
|
check commit missingTree ""
|
|
|
|
check commit missingTree "tr"
|
|
|
|
check commit missingTree "tree"
|
|
|
|
check commit badTreeSha1 "tree "
|
|
|
|
check commit badTreeSha1 "tree 1234"
|
|
|
|
add "tree $tree"
|
|
|
|
|
|
|
|
# these expect missingAuthor because "parent" is optional
|
|
|
|
check commit missingAuthor ""
|
|
|
|
check commit missingAuthor "par"
|
|
|
|
check commit missingAuthor "parent"
|
|
|
|
check commit badParentSha1 "parent "
|
|
|
|
check commit badParentSha1 "parent 1234"
|
|
|
|
add "parent $commit"
|
|
|
|
|
|
|
|
check commit missingAuthor ""
|
|
|
|
check commit missingAuthor "au"
|
|
|
|
check commit missingAuthor "author"
|
|
|
|
ident_checks () {
|
|
|
|
check $1 missingEmail "$2 "
|
|
|
|
check $1 missingEmail "$2 name"
|
|
|
|
check $1 badEmail "$2 name <"
|
|
|
|
check $1 badEmail "$2 name <email"
|
|
|
|
check $1 missingSpaceBeforeDate "$2 name <email>"
|
|
|
|
check $1 badDate "$2 name <email> "
|
|
|
|
check $1 badDate "$2 name <email> 1234"
|
|
|
|
check $1 badTimezone "$2 name <email> 1234 "
|
|
|
|
check $1 badTimezone "$2 name <email> 1234 +"
|
|
|
|
}
|
|
|
|
ident_checks commit author
|
|
|
|
add "author name <email> 1234 +0000"
|
|
|
|
|
|
|
|
check commit missingCommitter ""
|
|
|
|
check commit missingCommitter "co"
|
|
|
|
check commit missingCommitter "committer"
|
|
|
|
ident_checks commit committer
|
|
|
|
add "committer name <email> 1234 +0000"
|
|
|
|
|
|
|
|
reset
|
|
|
|
check tag missingObject ""
|
|
|
|
check tag missingObject "obj"
|
|
|
|
check tag missingObject "object"
|
|
|
|
check tag badObjectSha1 "object "
|
|
|
|
check tag badObjectSha1 "object 1234"
|
|
|
|
add "object $commit"
|
|
|
|
|
|
|
|
check tag missingType ""
|
|
|
|
check tag missingType "ty"
|
|
|
|
check tag missingType "type"
|
|
|
|
check tag badType "type "
|
|
|
|
check tag badType "type com"
|
|
|
|
add "type commit"
|
|
|
|
|
|
|
|
check tag missingTagEntry ""
|
|
|
|
check tag missingTagEntry "ta"
|
|
|
|
check tag missingTagEntry "tag"
|
|
|
|
check tag badTagName "tag "
|
|
|
|
add "tag foo"
|
|
|
|
|
|
|
|
check tag missingTagger ""
|
|
|
|
check tag missingTagger "ta"
|
|
|
|
check tag missingTagger "tagger"
|
|
|
|
ident_checks tag tagger
|
|
|
|
|
|
|
|
# trees are a binary format and can't use our earlier helpers
|
|
|
|
test_expect_success 'truncated tree (short hash)' '
|
|
|
|
printf "100644 foo\0\1\1\1\1" >input &&
|
|
|
|
test_must_fail git hash-object -t tree input 2>err &&
|
|
|
|
grep badTree err
|
|
|
|
'
|
|
|
|
|
|
|
|
test_expect_success 'truncated tree (missing nul)' '
|
|
|
|
# these two things are indistinguishable to the parser. The important
|
|
|
|
# thing about this is example is that there are enough bytes to
|
|
|
|
# make up a hash, and that there is no NUL (and we confirm that the
|
|
|
|
# parser does not walk past the end of the buffer).
|
|
|
|
printf "100644 a long filename, or a hash with missing nul?" >input &&
|
|
|
|
test_must_fail git hash-object -t tree input 2>err &&
|
|
|
|
grep badTree err
|
|
|
|
'
|
|
|
|
|
|
|
|
test_done
|