2014-08-27 19:01:28 +02:00
|
|
|
#!/bin/sh
|
|
|
|
|
|
|
|
test_description='basic tests for fast-export --anonymize'
|
2020-11-19 00:44:42 +01:00
|
|
|
GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=main
|
tests: mark tests relying on the current default for `init.defaultBranch`
In addition to the manual adjustment to let the `linux-gcc` CI job run
the test suite with `master` and then with `main`, this patch makes sure
that GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME is set in all test scripts
that currently rely on the initial branch name being `master by default.
To determine which test scripts to mark up, the first step was to
force-set the default branch name to `master` in
- all test scripts that contain the keyword `master`,
- t4211, which expects `t/t4211/history.export` with a hard-coded ref to
initialize the default branch,
- t5560 because it sources `t/t556x_common` which uses `master`,
- t8002 and t8012 because both source `t/annotate-tests.sh` which also
uses `master`)
This trick was performed by this command:
$ sed -i '/^ *\. \.\/\(test-lib\|lib-\(bash\|cvs\|git-svn\)\|gitweb-lib\)\.sh$/i\
GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=master\
export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME\
' $(git grep -l master t/t[0-9]*.sh) \
t/t4211*.sh t/t5560*.sh t/t8002*.sh t/t8012*.sh
After that, careful, manual inspection revealed that some of the test
scripts containing the needle `master` do not actually rely on a
specific default branch name: either they mention `master` only in a
comment, or they initialize that branch specificially, or they do not
actually refer to the current default branch. Therefore, the
aforementioned modification was undone in those test scripts thusly:
$ git checkout HEAD -- \
t/t0027-auto-crlf.sh t/t0060-path-utils.sh \
t/t1011-read-tree-sparse-checkout.sh \
t/t1305-config-include.sh t/t1309-early-config.sh \
t/t1402-check-ref-format.sh t/t1450-fsck.sh \
t/t2024-checkout-dwim.sh \
t/t2106-update-index-assume-unchanged.sh \
t/t3040-subprojects-basic.sh t/t3301-notes.sh \
t/t3308-notes-merge.sh t/t3423-rebase-reword.sh \
t/t3436-rebase-more-options.sh \
t/t4015-diff-whitespace.sh t/t4257-am-interactive.sh \
t/t5323-pack-redundant.sh t/t5401-update-hooks.sh \
t/t5511-refspec.sh t/t5526-fetch-submodules.sh \
t/t5529-push-errors.sh t/t5530-upload-pack-error.sh \
t/t5548-push-porcelain.sh \
t/t5552-skipping-fetch-negotiator.sh \
t/t5572-pull-submodule.sh t/t5608-clone-2gb.sh \
t/t5614-clone-submodules-shallow.sh \
t/t7508-status.sh t/t7606-merge-custom.sh \
t/t9302-fast-import-unpack-limit.sh
We excluded one set of test scripts in these commands, though: the range
of `git p4` tests. The reason? `git p4` stores the (foreign) remote
branch in the branch called `p4/master`, which is obviously not the
default branch. Manual analysis revealed that only five of these tests
actually require a specific default branch name to pass; They were
modified thusly:
$ sed -i '/^ *\. \.\/lib-git-p4\.sh$/i\
GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=master\
export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME\
' t/t980[0167]*.sh t/t9811*.sh
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-11-19 00:44:19 +01:00
|
|
|
export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME
|
|
|
|
|
2014-08-27 19:01:28 +02:00
|
|
|
. ./test-lib.sh
|
|
|
|
|
|
|
|
test_expect_success 'setup simple repo' '
|
|
|
|
test_commit base &&
|
|
|
|
test_commit foo &&
|
2020-06-25 21:48:32 +02:00
|
|
|
test_commit retain-me &&
|
2014-08-27 19:01:28 +02:00
|
|
|
git checkout -b other HEAD^ &&
|
|
|
|
mkdir subdir &&
|
|
|
|
test_commit subdir/bar &&
|
|
|
|
test_commit subdir/xyzzy &&
|
fast-export: use xmemdupz() for anonymizing oids
Our anonymize_mem() function is careful to take a ptr/len pair to allow
storing binary tokens like object ids, as well as partial strings (e.g.,
just "foo" of "foo/bar"). But it duplicates the hash key using
xstrdup()! That means that:
- for a partial string, we'd store all bytes up to the NUL, even
though we'd never look at anything past "len". This didn't produce
wrong behavior, but was wasteful.
- for a binary oid that doesn't contain a zero byte, we'd copy garbage
bytes off the end of the array (though as long as nothing complained
about reading uninitialized bytes, further reads would be limited by
"len", and we'd produce the correct results)
- for a binary oid that does contain a zero byte, we'd copy _fewer_
bytes than intended into the hashmap struct. When we later try to
look up a value, we'd access uninitialized memory and potentially
falsely claim that a particular oid is not present.
The most common reason to store an oid is an anonymized gitlink, but our
test case doesn't have any gitlinks at all. So let's add one whose oid
contains a NUL and is present at two different paths. ASan catches the
memory error, but even without it we can detect the bug because the oid
is not anonymized the same way for both paths.
And of course the fix is to copy the correct number of bytes. We don't
technically need the appended NUL from xmemdupz(), but it doesn't hurt
as an extra protection against anybody treating it like a string (plus a
future patch will push us more in that direction).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-06-23 17:24:49 +02:00
|
|
|
fake_commit=$(echo $ZERO_OID | sed s/0/a/) &&
|
|
|
|
git update-index --add --cacheinfo 160000,$fake_commit,link1 &&
|
|
|
|
git update-index --add --cacheinfo 160000,$fake_commit,link2 &&
|
|
|
|
git commit -m "add gitlink" &&
|
2021-08-31 17:55:54 +02:00
|
|
|
git tag -m "annotated tag" mytag &&
|
|
|
|
git tag -m "annotated tag with long message" longtag
|
2014-08-27 19:01:28 +02:00
|
|
|
'
|
|
|
|
|
|
|
|
test_expect_success 'export anonymized stream' '
|
2020-06-25 21:48:32 +02:00
|
|
|
git fast-export --anonymize --all \
|
|
|
|
--anonymize-map=retain-me \
|
fast-export: de-obfuscate --anonymize-map handling
When we handle an --anonymize-map option, we parse the orig/anon pair,
and then feed the "orig" string to anonymize_str(), along with a
generator function that duplicates the "anon" string to be cached in the
map.
This works, because anonymize_str() says "ah, there is no mapping yet
for orig; I'll add one from the generator". But there are some
downsides:
1. It's a bit too clever, as it's not obvious what the code is trying
to do or why it works.
2. It requires allowing generator functions to take an extra void
pointer, which is not something any of the normal callers of
anonymize_str() want.
3. It does the wrong thing if the same token is provided twice.
When there are conflicting options, like:
git fast-export --anonymize \
--anonymize-map=foo:one \
--anonymize-map=foo:two
we usually let the second one override the first. But by using
anonymize_str(), which has first-one-wins logic, we do the
opposite.
So instead of relying on anonymize_str(), let's directly add the entry
ourselves. We can tweak the tests to show that we handle overridden
options correctly now.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-03-22 18:42:13 +01:00
|
|
|
--anonymize-map=xyzzy:should-not-appear \
|
2020-06-25 21:48:32 +02:00
|
|
|
--anonymize-map=xyzzy:custom-name \
|
fast-export: anonymize "master" refname
Running "fast-export --anonymize" will leave "refs/heads/master"
untouched in the output, for two reasons:
- it helped to have some known reference point between the original
and anonymized repository
- since it's historically the default branch name, it doesn't leak any
information
Now that we can ask fast-export to retain particular tokens, we have a
much better tool for the first one (because it works for any ref, not
just master).
For the second, the notion of "default branch name" is likely to become
configurable soon, at which point the name _does_ leak information.
Let's drop this special case in preparation.
Note that we have to adjust the test a bit, since it relied on using the
name "master" in the anonymized repos. We could just use
--anonymize-map=master to keep the same output, but then we wouldn't
know if it works because of our hard-coded master or because of the
explicit map.
So let's flip the test a bit, and confirm that we anonymize "master",
but keep "other" in the output.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-06-25 21:48:35 +02:00
|
|
|
--anonymize-map=other \
|
2020-06-25 21:48:32 +02:00
|
|
|
>stream
|
2014-08-27 19:01:28 +02:00
|
|
|
'
|
|
|
|
|
|
|
|
# this also covers commit messages
|
|
|
|
test_expect_success 'stream omits path names' '
|
|
|
|
! grep base stream &&
|
|
|
|
! grep foo stream &&
|
|
|
|
! grep subdir stream &&
|
|
|
|
! grep bar stream &&
|
|
|
|
! grep xyzzy stream
|
|
|
|
'
|
|
|
|
|
2020-06-25 21:48:32 +02:00
|
|
|
test_expect_success 'stream contains user-specified names' '
|
|
|
|
grep retain-me stream &&
|
fast-export: de-obfuscate --anonymize-map handling
When we handle an --anonymize-map option, we parse the orig/anon pair,
and then feed the "orig" string to anonymize_str(), along with a
generator function that duplicates the "anon" string to be cached in the
map.
This works, because anonymize_str() says "ah, there is no mapping yet
for orig; I'll add one from the generator". But there are some
downsides:
1. It's a bit too clever, as it's not obvious what the code is trying
to do or why it works.
2. It requires allowing generator functions to take an extra void
pointer, which is not something any of the normal callers of
anonymize_str() want.
3. It does the wrong thing if the same token is provided twice.
When there are conflicting options, like:
git fast-export --anonymize \
--anonymize-map=foo:one \
--anonymize-map=foo:two
we usually let the second one override the first. But by using
anonymize_str(), which has first-one-wins logic, we do the
opposite.
So instead of relying on anonymize_str(), let's directly add the entry
ourselves. We can tweak the tests to show that we handle overridden
options correctly now.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-03-22 18:42:13 +01:00
|
|
|
! grep should-not-appear stream &&
|
2020-06-25 21:48:32 +02:00
|
|
|
grep custom-name stream
|
|
|
|
'
|
|
|
|
|
fast-export: use xmemdupz() for anonymizing oids
Our anonymize_mem() function is careful to take a ptr/len pair to allow
storing binary tokens like object ids, as well as partial strings (e.g.,
just "foo" of "foo/bar"). But it duplicates the hash key using
xstrdup()! That means that:
- for a partial string, we'd store all bytes up to the NUL, even
though we'd never look at anything past "len". This didn't produce
wrong behavior, but was wasteful.
- for a binary oid that doesn't contain a zero byte, we'd copy garbage
bytes off the end of the array (though as long as nothing complained
about reading uninitialized bytes, further reads would be limited by
"len", and we'd produce the correct results)
- for a binary oid that does contain a zero byte, we'd copy _fewer_
bytes than intended into the hashmap struct. When we later try to
look up a value, we'd access uninitialized memory and potentially
falsely claim that a particular oid is not present.
The most common reason to store an oid is an anonymized gitlink, but our
test case doesn't have any gitlinks at all. So let's add one whose oid
contains a NUL and is present at two different paths. ASan catches the
memory error, but even without it we can detect the bug because the oid
is not anonymized the same way for both paths.
And of course the fix is to copy the correct number of bytes. We don't
technically need the appended NUL from xmemdupz(), but it doesn't hurt
as an extra protection against anybody treating it like a string (plus a
future patch will push us more in that direction).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-06-23 17:24:49 +02:00
|
|
|
test_expect_success 'stream omits gitlink oids' '
|
|
|
|
# avoid relying on the whole oid to remain hash-agnostic; this is
|
|
|
|
# plenty to be unique within our test case
|
|
|
|
! grep a000000000000000000 stream
|
|
|
|
'
|
|
|
|
|
fast-export: anonymize "master" refname
Running "fast-export --anonymize" will leave "refs/heads/master"
untouched in the output, for two reasons:
- it helped to have some known reference point between the original
and anonymized repository
- since it's historically the default branch name, it doesn't leak any
information
Now that we can ask fast-export to retain particular tokens, we have a
much better tool for the first one (because it works for any ref, not
just master).
For the second, the notion of "default branch name" is likely to become
configurable soon, at which point the name _does_ leak information.
Let's drop this special case in preparation.
Note that we have to adjust the test a bit, since it relied on using the
name "master" in the anonymized repos. We could just use
--anonymize-map=master to keep the same output, but then we wouldn't
know if it works because of our hard-coded master or because of the
explicit map.
So let's flip the test a bit, and confirm that we anonymize "master",
but keep "other" in the output.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-06-25 21:48:35 +02:00
|
|
|
test_expect_success 'stream retains other as refname' '
|
|
|
|
grep other stream
|
2014-08-27 19:01:28 +02:00
|
|
|
'
|
|
|
|
|
|
|
|
test_expect_success 'stream omits other refnames' '
|
2020-11-19 00:44:42 +01:00
|
|
|
! grep main stream &&
|
2021-08-31 17:55:54 +02:00
|
|
|
! grep mytag stream &&
|
|
|
|
! grep longtag stream
|
2014-08-27 19:01:28 +02:00
|
|
|
'
|
|
|
|
|
|
|
|
test_expect_success 'stream omits identities' '
|
|
|
|
! grep "$GIT_COMMITTER_NAME" stream &&
|
|
|
|
! grep "$GIT_COMMITTER_EMAIL" stream &&
|
|
|
|
! grep "$GIT_AUTHOR_NAME" stream &&
|
|
|
|
! grep "$GIT_AUTHOR_EMAIL" stream
|
|
|
|
'
|
|
|
|
|
|
|
|
test_expect_success 'stream omits tag message' '
|
|
|
|
! grep "annotated tag" stream
|
|
|
|
'
|
|
|
|
|
|
|
|
# NOTE: we chdir to the new, anonymized repository
|
|
|
|
# after this. All further tests should assume this.
|
|
|
|
test_expect_success 'import stream to new repository' '
|
|
|
|
git init new &&
|
|
|
|
cd new &&
|
|
|
|
git fast-import <../stream
|
|
|
|
'
|
|
|
|
|
|
|
|
test_expect_success 'result has two branches' '
|
|
|
|
git for-each-ref --format="%(refname)" refs/heads >branches &&
|
|
|
|
test_line_count = 2 branches &&
|
fast-export: anonymize "master" refname
Running "fast-export --anonymize" will leave "refs/heads/master"
untouched in the output, for two reasons:
- it helped to have some known reference point between the original
and anonymized repository
- since it's historically the default branch name, it doesn't leak any
information
Now that we can ask fast-export to retain particular tokens, we have a
much better tool for the first one (because it works for any ref, not
just master).
For the second, the notion of "default branch name" is likely to become
configurable soon, at which point the name _does_ leak information.
Let's drop this special case in preparation.
Note that we have to adjust the test a bit, since it relied on using the
name "master" in the anonymized repos. We could just use
--anonymize-map=master to keep the same output, but then we wouldn't
know if it works because of our hard-coded master or because of the
explicit map.
So let's flip the test a bit, and confirm that we anonymize "master",
but keep "other" in the output.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-06-25 21:48:35 +02:00
|
|
|
other_branch=refs/heads/other &&
|
|
|
|
main_branch=$(grep -v $other_branch branches)
|
2014-08-27 19:01:28 +02:00
|
|
|
'
|
|
|
|
|
|
|
|
test_expect_success 'repo has original shape and timestamps' '
|
|
|
|
shape () {
|
|
|
|
git log --format="%m %ct" --left-right --boundary "$@"
|
|
|
|
} &&
|
2020-11-19 00:44:42 +01:00
|
|
|
(cd .. && shape main...other) >expect &&
|
fast-export: anonymize "master" refname
Running "fast-export --anonymize" will leave "refs/heads/master"
untouched in the output, for two reasons:
- it helped to have some known reference point between the original
and anonymized repository
- since it's historically the default branch name, it doesn't leak any
information
Now that we can ask fast-export to retain particular tokens, we have a
much better tool for the first one (because it works for any ref, not
just master).
For the second, the notion of "default branch name" is likely to become
configurable soon, at which point the name _does_ leak information.
Let's drop this special case in preparation.
Note that we have to adjust the test a bit, since it relied on using the
name "master" in the anonymized repos. We could just use
--anonymize-map=master to keep the same output, but then we wouldn't
know if it works because of our hard-coded master or because of the
explicit map.
So let's flip the test a bit, and confirm that we anonymize "master",
but keep "other" in the output.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-06-25 21:48:35 +02:00
|
|
|
shape $main_branch...$other_branch >actual &&
|
2014-08-27 19:01:28 +02:00
|
|
|
test_cmp expect actual
|
|
|
|
'
|
|
|
|
|
|
|
|
test_expect_success 'root tree has original shape' '
|
|
|
|
# the output entries are not necessarily in the same
|
2020-06-23 17:24:47 +02:00
|
|
|
# order, but we should at least have the same set of
|
|
|
|
# object types.
|
|
|
|
git -C .. ls-tree HEAD >orig-root &&
|
|
|
|
cut -d" " -f2 <orig-root | sort >expect &&
|
2014-08-27 19:01:28 +02:00
|
|
|
git ls-tree $other_branch >root &&
|
|
|
|
cut -d" " -f2 <root | sort >actual &&
|
|
|
|
test_cmp expect actual
|
|
|
|
'
|
|
|
|
|
|
|
|
test_expect_success 'paths in subdir ended up in one tree' '
|
2020-06-23 17:24:47 +02:00
|
|
|
git -C .. ls-tree other:subdir >orig-subdir &&
|
|
|
|
cut -d" " -f2 <orig-subdir | sort >expect &&
|
2014-08-27 19:01:28 +02:00
|
|
|
tree=$(grep tree root | cut -f2) &&
|
|
|
|
git ls-tree $other_branch:$tree >tree &&
|
|
|
|
cut -d" " -f2 <tree >actual &&
|
|
|
|
test_cmp expect actual
|
|
|
|
'
|
|
|
|
|
fast-export: use xmemdupz() for anonymizing oids
Our anonymize_mem() function is careful to take a ptr/len pair to allow
storing binary tokens like object ids, as well as partial strings (e.g.,
just "foo" of "foo/bar"). But it duplicates the hash key using
xstrdup()! That means that:
- for a partial string, we'd store all bytes up to the NUL, even
though we'd never look at anything past "len". This didn't produce
wrong behavior, but was wasteful.
- for a binary oid that doesn't contain a zero byte, we'd copy garbage
bytes off the end of the array (though as long as nothing complained
about reading uninitialized bytes, further reads would be limited by
"len", and we'd produce the correct results)
- for a binary oid that does contain a zero byte, we'd copy _fewer_
bytes than intended into the hashmap struct. When we later try to
look up a value, we'd access uninitialized memory and potentially
falsely claim that a particular oid is not present.
The most common reason to store an oid is an anonymized gitlink, but our
test case doesn't have any gitlinks at all. So let's add one whose oid
contains a NUL and is present at two different paths. ASan catches the
memory error, but even without it we can detect the bug because the oid
is not anonymized the same way for both paths.
And of course the fix is to copy the correct number of bytes. We don't
technically need the appended NUL from xmemdupz(), but it doesn't hurt
as an extra protection against anybody treating it like a string (plus a
future patch will push us more in that direction).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-06-23 17:24:49 +02:00
|
|
|
test_expect_success 'identical gitlinks got identical oid' '
|
|
|
|
awk "/commit/ { print \$3 }" <root | sort -u >commits &&
|
|
|
|
test_line_count = 1 commits
|
|
|
|
'
|
|
|
|
|
2021-08-31 17:55:54 +02:00
|
|
|
test_expect_success 'all tags point to branch tip' '
|
2014-08-27 19:01:28 +02:00
|
|
|
git rev-parse $other_branch >expect &&
|
2021-08-31 17:55:54 +02:00
|
|
|
git for-each-ref --format="%(*objectname)" | grep . | uniq >actual &&
|
2014-08-27 19:01:28 +02:00
|
|
|
test_cmp expect actual
|
|
|
|
'
|
|
|
|
|
|
|
|
test_expect_success 'idents are shared' '
|
|
|
|
git log --all --format="%an <%ae>" >authors &&
|
|
|
|
sort -u authors >unique &&
|
|
|
|
test_line_count = 1 unique &&
|
|
|
|
git log --all --format="%cn <%ce>" >committers &&
|
|
|
|
sort -u committers >unique &&
|
|
|
|
test_line_count = 1 unique &&
|
|
|
|
! test_cmp authors committers
|
|
|
|
'
|
|
|
|
|
|
|
|
test_done
|