do not stream large files to pack when filters are in use
Because git's object format requires us to specify the
number of bytes in the object in its header, we must know
the size before streaming a blob into the object database.
This is not a problem when adding a regular file, as we can
get the size from stat(). However, when filters are in use
(such as autocrlf, or the ident, filter, or eol
gitattributes), we have no idea what the ultimate size will
be.
The current code just punts on the whole issue and ignores
filter configuration entirely for files larger than
core.bigfilethreshold. This can generate confusing results
if you use filters for large binary files, as the filter
will suddenly stop working as the file goes over a certain
size. Rather than try to handle unknown input sizes with
streaming, this patch just turns off the streaming
optimization when filters are in use.
This has a slight performance regression in a very specific
case: if you have autocrlf on, but no gitattributes, a large
binary file will avoid the streaming code path because we
don't know beforehand whether it will need conversion or
not. But if you are handling large binary files, you should
be marking them as such via attributes (or at least not
using autocrlf, and instead marking your text files as
such). And the flip side is that if you have a large
_non_-binary file, there is a correctness improvement;
before we did not apply the conversion at all.
The first half of the new t1051 script covers these failures
on input. The second half tests the matching output code
paths. These already work correctly, and do not need any
adjustment.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2012-02-24 23:10:17 +01:00
|
|
|
#!/bin/sh
|
|
|
|
|
|
|
|
test_description='test conversion filters on large files'
|
2022-07-01 12:42:59 +02:00
|
|
|
|
|
|
|
TEST_PASSES_SANITIZE_LEAK=true
|
do not stream large files to pack when filters are in use
Because git's object format requires us to specify the
number of bytes in the object in its header, we must know
the size before streaming a blob into the object database.
This is not a problem when adding a regular file, as we can
get the size from stat(). However, when filters are in use
(such as autocrlf, or the ident, filter, or eol
gitattributes), we have no idea what the ultimate size will
be.
The current code just punts on the whole issue and ignores
filter configuration entirely for files larger than
core.bigfilethreshold. This can generate confusing results
if you use filters for large binary files, as the filter
will suddenly stop working as the file goes over a certain
size. Rather than try to handle unknown input sizes with
streaming, this patch just turns off the streaming
optimization when filters are in use.
This has a slight performance regression in a very specific
case: if you have autocrlf on, but no gitattributes, a large
binary file will avoid the streaming code path because we
don't know beforehand whether it will need conversion or
not. But if you are handling large binary files, you should
be marking them as such via attributes (or at least not
using autocrlf, and instead marking your text files as
such). And the flip side is that if you have a large
_non_-binary file, there is a correctness improvement;
before we did not apply the conversion at all.
The first half of the new t1051 script covers these failures
on input. The second half tests the matching output code
paths. These already work correctly, and do not need any
adjustment.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2012-02-24 23:10:17 +01:00
|
|
|
. ./test-lib.sh
|
|
|
|
|
|
|
|
set_attr() {
|
|
|
|
test_when_finished 'rm -f .gitattributes' &&
|
|
|
|
echo "* $*" >.gitattributes
|
|
|
|
}
|
|
|
|
|
|
|
|
check_input() {
|
|
|
|
git read-tree --empty &&
|
|
|
|
git add small large &&
|
|
|
|
git cat-file blob :small >small.index &&
|
|
|
|
git cat-file blob :large | head -n 1 >large.index &&
|
|
|
|
test_cmp small.index large.index
|
|
|
|
}
|
|
|
|
|
|
|
|
check_output() {
|
|
|
|
rm -f small large &&
|
|
|
|
git checkout small large &&
|
|
|
|
head -n 1 large >large.head &&
|
|
|
|
test_cmp small large.head
|
|
|
|
}
|
|
|
|
|
|
|
|
test_expect_success 'setup input tests' '
|
|
|
|
printf "\$Id: foo\$\\r\\n" >small &&
|
|
|
|
cat small small >large &&
|
|
|
|
git config core.bigfilethreshold 20 &&
|
|
|
|
git config filter.test.clean "sed s/.*/CLEAN/"
|
|
|
|
'
|
|
|
|
|
|
|
|
test_expect_success 'autocrlf=true converts on input' '
|
|
|
|
test_config core.autocrlf true &&
|
|
|
|
check_input
|
|
|
|
'
|
|
|
|
|
|
|
|
test_expect_success 'eol=crlf converts on input' '
|
|
|
|
set_attr eol=crlf &&
|
|
|
|
check_input
|
|
|
|
'
|
|
|
|
|
|
|
|
test_expect_success 'ident converts on input' '
|
|
|
|
set_attr ident &&
|
|
|
|
check_input
|
|
|
|
'
|
|
|
|
|
|
|
|
test_expect_success 'user-defined filters convert on input' '
|
|
|
|
set_attr filter=test &&
|
|
|
|
check_input
|
|
|
|
'
|
|
|
|
|
|
|
|
test_expect_success 'setup output tests' '
|
|
|
|
echo "\$Id\$" >small &&
|
|
|
|
cat small small >large &&
|
|
|
|
git add small large &&
|
|
|
|
git config core.bigfilethreshold 7 &&
|
|
|
|
git config filter.test.smudge "sed s/.*/SMUDGE/"
|
|
|
|
'
|
|
|
|
|
|
|
|
test_expect_success 'autocrlf=true converts on output' '
|
|
|
|
test_config core.autocrlf true &&
|
|
|
|
check_output
|
|
|
|
'
|
|
|
|
|
|
|
|
test_expect_success 'eol=crlf converts on output' '
|
|
|
|
set_attr eol=crlf &&
|
|
|
|
check_output
|
|
|
|
'
|
|
|
|
|
|
|
|
test_expect_success 'user-defined filters convert on output' '
|
|
|
|
set_attr filter=test &&
|
|
|
|
check_output
|
|
|
|
'
|
|
|
|
|
|
|
|
test_expect_success 'ident converts on output' '
|
|
|
|
set_attr ident &&
|
|
|
|
rm -f small large &&
|
|
|
|
git checkout small large &&
|
|
|
|
sed -n "s/Id: .*/Id: SHA/p" <small >small.clean &&
|
|
|
|
head -n 1 large >large.head &&
|
|
|
|
sed -n "s/Id: .*/Id: SHA/p" <large.head >large.clean &&
|
|
|
|
test_cmp small.clean large.clean
|
|
|
|
'
|
|
|
|
|
2021-11-02 16:46:07 +01:00
|
|
|
# This smudge filter prepends 5GB of zeros to the file it checks out. This
|
|
|
|
# ensures that smudging doesn't mangle large files on 64-bit Windows.
|
2021-11-02 16:46:08 +01:00
|
|
|
test_expect_success EXPENSIVE,SIZE_T_IS_64BIT,!LONG_IS_64BIT \
|
2021-11-02 16:46:07 +01:00
|
|
|
'files over 4GB convert on output' '
|
|
|
|
test_commit test small "a small file" &&
|
|
|
|
small_size=$(test_file_size small) &&
|
|
|
|
test_config filter.makelarge.smudge \
|
|
|
|
"test-tool genzeros $((5*1024*1024*1024)) && cat" &&
|
|
|
|
echo "small filter=makelarge" >.gitattributes &&
|
|
|
|
rm small &&
|
|
|
|
git checkout -- small &&
|
|
|
|
size=$(test_file_size small) &&
|
|
|
|
test "$size" -eq $((5 * 1024 * 1024 * 1024 + $small_size))
|
|
|
|
'
|
|
|
|
|
2021-11-02 16:46:11 +01:00
|
|
|
# This clean filter writes down the size of input it receives. By checking against
|
|
|
|
# the actual size, we ensure that cleaning doesn't mangle large files on 64-bit Windows.
|
|
|
|
test_expect_success EXPENSIVE,SIZE_T_IS_64BIT,!LONG_IS_64BIT \
|
|
|
|
'files over 4GB convert on input' '
|
|
|
|
test-tool genzeros $((5*1024*1024*1024)) >big &&
|
|
|
|
test_config filter.checklarge.clean "wc -c >big.size" &&
|
|
|
|
echo "big filter=checklarge" >.gitattributes &&
|
|
|
|
git add big &&
|
|
|
|
test $(test_file_size big) -eq $(cat big.size)
|
|
|
|
'
|
|
|
|
|
do not stream large files to pack when filters are in use
Because git's object format requires us to specify the
number of bytes in the object in its header, we must know
the size before streaming a blob into the object database.
This is not a problem when adding a regular file, as we can
get the size from stat(). However, when filters are in use
(such as autocrlf, or the ident, filter, or eol
gitattributes), we have no idea what the ultimate size will
be.
The current code just punts on the whole issue and ignores
filter configuration entirely for files larger than
core.bigfilethreshold. This can generate confusing results
if you use filters for large binary files, as the filter
will suddenly stop working as the file goes over a certain
size. Rather than try to handle unknown input sizes with
streaming, this patch just turns off the streaming
optimization when filters are in use.
This has a slight performance regression in a very specific
case: if you have autocrlf on, but no gitattributes, a large
binary file will avoid the streaming code path because we
don't know beforehand whether it will need conversion or
not. But if you are handling large binary files, you should
be marking them as such via attributes (or at least not
using autocrlf, and instead marking your text files as
such). And the flip side is that if you have a large
_non_-binary file, there is a correctness improvement;
before we did not apply the conversion at all.
The first half of the new t1051 script covers these failures
on input. The second half tests the matching output code
paths. These already work correctly, and do not need any
adjustment.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2012-02-24 23:10:17 +01:00
|
|
|
test_done
|